ARTICLE IN PRESS
JID: NEUCOM
[m5G;June 13, 2017;8:14]
Neurocomputing 0 0 0 (2017) 1–11
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Effective Kalman filtering algorithm for distributed multichannel speech enhancementR Jingxian Tu a, Youshen Xia b,∗ a b
Laboratory of Complex System Simulation & Intelligent Computing, School of Information and Electronic Engineering, Wuzhou University, Wuzhou, China Department of Software Engineering, College of Mathematics and Computer Science, Fuzhou University, Fuzhou, China
a r t i c l e
i n f o
Article history: Received 11 May 2016 Revised 11 April 2017 Accepted 21 May 2017 Available online xxx Communicated by Dr. K Chan Keywords: Colored noise reduction Kalman filtering Multichannel speech enhancement Time domain
a b s t r a c t Kalman filtering is known as an effective speech enhancement technique. Many Kalman filtering algorithms for single channel speech enhancement were developed in past decades. However, the Kalman filtering algorithm for multichannel speech enhancement is very less. This paper proposes a Kalman filtering algorithm for distributed multichannel speech enhancement in the time domain under colored noise environment. Compared with conventional algorithms for distributed multichannel speech enhancement, the proposed algorithm has lower computational complexity and requires less computational resources. Simulation results show that the proposed algorithm is superior to the conventional algorithms for distributed multichannel speech enhancement in achieving higher noise reduction, less signal distortion and more speech intelligibility. Moreover, the proposed algorithm has a faster speed than several multichannel speech enhancement algorithms.
1. Introduction Since received signals are usually corrupted by background noise, noise reduction is required in signal processing such as speech communication, speech recognition, speaker identification, etc. Over the past decades, many of speech enhancement algorithms for noise reduction were presented. According to the number of channel available, speech enhancement algorithms can be classified as single channel algorithms [1–21] and multichannel speech enhancement algorithms [22–34]. In term of the microphone configuration, multichannel microphone can be classified as dual, array, and distributed microphones. This paper focuses on distributed multichannel speech enhancement. Multichannel speech enhancement techniques have been attractively studied as the multi-microphone system was developed. Many multichannel speech enhancement algorithms were presented. Since the multichannel microphone can utilize much more information to improve the performance of speech enhance-
R This work is supported by the National Natural Science Foundation of China under Grant No. 61179037, and in part by Scientific Research Fund of Education Department of Guangxi Zhuang Autonomous Region under Grant No. 2013YB223 and the Construction Fund of Master’s Degree Grant Unit under Gui Degree [2013] No. 4. ∗ Corresponding author. E-mail addresses:
[email protected] (J. Tu),
[email protected],
[email protected] (Y. Xia).
© 2017 Elsevier B.V. All rights reserved.
ment, the multichannel speech enhancement algorithms outperform the single channel speech enhancement algorithms. Multichannel speech enhancement algorithms mainly include the classic beamforming algorithm [22,23], the multichannel wiener algorithm [24,25], the multichannel subspace algorithm [26–28], the multichannel minimum distortion algorithm [29], and multichannel statistical estimation algorithm [30–32]. These multichannel speech enhancement algorithms can reduce background noise with less speech distortion when increasing the number of microphones. The multichannel wiener algorithm reduces the noise by assuming the noise is stationary and minimizing the mean square error [24,25]. It can obtain the good performance in the stationary noise case and but the poor performance in the nonstationary noise case. The multichannel subspace algorithm first decomposes the noisy space into the pure noise space and speech-plus-noise space, and then reduces the noise in the speech-plus-noise space by minimizing the mean square error [26–28]. The multichannel subspace algorithm can obtain the better performance than the wiener algorithm, especially in the nonstationary noise case. Multichannel statistical estimation algorithm reduces the background noise by assuming the Fourier coefficients of the clean speech signal and noise signal obey certain probability distribution and giving certain optimization strategy, such as, minimizing the mean square error, maximizing posterior probability and so on [30–32]. Kalman filtering is known as an effective speech enhancement technique, where speech signal is usually modeled as autoregressive (AR) process and represented in the state-space domain and
http://dx.doi.org/10.1016/j.neucom.2017.05.048 0925-2312/© 2017 Elsevier B.V. All rights reserved.
Please cite this article as: J. Tu, Y. Xia, Effective Kalman filtering algorithm for distributed multichannel speech enhancement, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.05.048
JID: NEUCOM 2
ARTICLE IN PRESS
[m5G;June 13, 2017;8:14]
J. Tu, Y. Xia / Neurocomputing 000 (2017) 1–11
the speech signal is then recovered from Kalman filter. In comparison with other speech enhancement algorithms, the Kalman filtering algorithm has low computational complexity without the assumption of stationarity signals. So, the Kalman filter has been of great interest in speech enhancement. Many single channel speech enhancement algorithms based on Kalman filter have been proposed. Among them, the earliest Kalman filtering algorithm for speech enhancement in white noise was proposed by Palial and Basu [5]. The Kalman filtering algorithm for speech enhancement in colored noise were proposed by Gibson et al. [6] and Ning et al. [11]. Other existing Kalman filtering algorithms for speech enhancement were presented by improving the accuracy of AR parameters of the Kalman filter [7,8,12–16]. On the other hand, the Kalman filtering-based multichannel speech enhancement algorithm is very less. Only one Kalman filtering-based frequency domain algorithm for multichannel speech enhancement, called the LPC-based multichannel speech enhancement algorithm, was found in a conference paper [33]. In general, the frequency domain algorithm requires more computational sources than the time domain algorithm. In this paper, we propose a Kalman filtering algorithm for distributed multichannel speech enhancement in the time domain under colored noise environment. The proposed algorithm is the first multichannel speech enhancement algorithm based on Kalman filtering in the time domain. Compared with traditional multichannel speech enhancement algorithms, the proposed algorithm has lower computational complexity and requires less computational resources. Therefore, It is easily used in practical applications. Simulation results show that the proposed algorithm is superior to several conventional multichannel algorithms for distributed multichannel speech enhancement in achieving higher noise reduction and lower signal distortion. Moreover, the proposed algorithm has a faster speed than several multichannel speech enhancement algorithms. This paper is organized as follows. Section 2 introduces multichannel model and proposes a Kalman filtering-based multichannel speech enhancement algorithm under color noise environment. Section 3 describes how to estimate parameters of the Kalman filter. Section 4 presents the performance evaluation, and Section 5 gives the conclusion.
2.2. Proposed multichannel algorithm In this subsection, we propose a Kalman filtering algorithm for distributed multichannel speech enhancement in the time domain in colored noise cases. Let the speech signal s(n) be modeled as the pth-order AR process:
s (n ) =
ai s ( n − i ) + u ( n )
(2)
i=1
where ai is the ith AR speech model parameter and u(n) is driving white noise with variance being σu2 (n ). For our discussion, (2) is expressed as in vector form:
s(n ) = Fs(n − 1 ) + u(n )
(3) 1 ) , . . . , s ( n )] T
where s(n ) = [s(n − p + is a p × 1 vector, u = [0, . . . , 0, u(n )]T is a p × 1 vector, and F is a p × p matrix defined as:
⎛
⎜ ⎜ F=⎜ ⎜ ⎝
0 0 .. . 0 ap
1 0 .. . 0
0 1 .. . 0
a p−1
a p−2
··· ··· .. . ··· ···
0 0 .. . 1 a1
⎞
⎟ ⎟ ⎟. ⎟ ⎠
Consider that the speech signal of each channel is corrupted by additive colored noise. Let the ith channel noise vi (n ) be modeled as qth-order AR process:
vi ( n ) =
q
bi j vi ( n − j ) + wi ( n ), i = 1, . . . , M
(4)
j=1
where bij is the jth AR noise model parameter of the ith channel and wi (n ) is white Gaussian noise at the ith channel with zero mean and variance being σw2i (n ). (4) can be written as the vector form:
vi (n ) = Gi vi (n − 1 ) + wi (n )
(5)
where vi (n ) = [vi (n − q + 1 ), . . . , vi (n )]T is the q × 1 vector, wi (n ) = [0, . . . , 0, wi (n )]T is the q × 1 vector, and Gi is the q × q matrix given by
⎛ ⎜ ⎜
2. The model and the proposed multichannel algorithm
p
0 0 .. . 0 biq
1 0 .. . 0
0 1 .. . 0
bi(q−1)
bi(q−2)
··· ··· .. . ··· ···
0 0 .. . 1 bi1
⎞ ⎟ ⎟ ⎟ ⎟ ⎠
2.1. Distributed multichannel model
Gi = ⎜ ⎜
We are concerned with a distributed microphone system that can accurately time align the M noisy speeches [31,32]. The distributed multichannel microphone model is described as
The noise driving correlation matrix of the ith channel Wi (n ) = E[wi (n )wi (n )T ] is expressed as:
yi ( n ) = ci s ( n ) + vi ( n ), i = 1 , 2 , . . . , M
⎝
⎛
(1)
where M is the number of channels, yi (n) and vi (n ) are the noisy speech and background noise in the nth sample and channel i, s(n) is the true source signal, and ci ∈ [0, 1] are time invariant attenuation factors. In a special case that M = 1 and c1 = 1, the distributed multichannel model then becomes a well-known single channel model. Our goal is to estimate speech signal s(n) from M noisy signal observations {yi (n )}M . i=1 Traditional distributed multichannel speech enhancement algorithms mainly include the multichannel wiener algorithm, the multichannel subspace algorithm, the multichannel minimum distortion algorithm, and multichannel statistical estimation-based algorithm. Recently, a Kalman filter-based frequency domain algorithm, called the LPC-based multichannel speech enhancement algorithm, was presented in a conference paper [33]. In general, the frequency domain algorithm requires more computational sources than the time domain algorithm.
⎜ ⎝
Wi (n ) = ⎜
0 .. . 0 0
··· .. . ··· ···
0 .. . 0 0
0 .. . 0
⎞
⎟ ⎟. ⎠
σw2i (n )
Let x(n ) = [sT (n ), vT1 (n ), . . . , vTM (n )]T and let w (n ) = [uT (n ), wT1 (n ), . . . , wTM (n )]T . Then the state equations for this system can be written as:
x ( n ) = Gx ( n − 1 ) + w ( n )
(6)
y ( n ) = Lx ( n )
(7)
where
⎛ ⎜ ⎝
G=⎜
F 0 .. . 0
0 G1 .. . 0
··· ··· .. . ...
0 0 .. . GM
⎞ ⎟ ⎟, ⎠
Please cite this article as: J. Tu, Y. Xia, Effective Kalman filtering algorithm for distributed multichannel speech enhancement, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.05.048
ARTICLE IN PRESS
JID: NEUCOM
[m5G;June 13, 2017;8:14]
J. Tu, Y. Xia / Neurocomputing 000 (2017) 1–11
⎛ ⎜ ⎝
L=⎜
c 1 e1 c 2 e1 .. . cM e1
e2 0 .. . 0
··· ··· .. . ···
0 e2 .. . 0
0 0 .. . e2
⎞
Algorithm 1 KFADMSE algorithm.
⎟ ⎟, ⎠
Require: y1 (n ), . . . , yM (n ), n = 1, . . ., N. Ensure: sˆ (n ), n = 1, . . ., N. Initialize n = 0, P(0|0 ) = I( p+qM )×( p+qM ) , xˆ (0|0 ) = 0( p+qM )×1 , Max_iter = 4. Estimate c1 , . . ., cM . Compute the frame number N f = NL . for i = 1 to N f do Applying the voice activity detection(VAD) to the ith frame of noisy speech. if The ith frame contains noise only then Estimate b1 , . . ., bM from the noisy speech frame. end if for j = 1 to Max_iter do if j = 1 then Estimate a from the current noisy speech frame. else if then Estimate a from the estimated clean speech frame. end if Update G and W. for n = (i − 1 )L to iL − 1 do Compute P(n + 1|n ) using (13). Compute K(n + 1 ) using (13). Compute xˆ (n + 1|n ) using (13). Compute e(n + 1 ) using (13). Compute xˆ (n + 1|n + 1 ) using (13). Compute P(n + 1|n + 1 ) using (13). xˆ(n + 1 ) = e3 xˆ (n + 1|n + 1 ). end for end for end for
e1 = [0, . . . , 0, 1] is a 1 × p vector, and e2 = [0, . . . , 0, 1] is a 1 × q vector. Let
⎛ ⎜ ⎝
W(n ) = E[w(n )w(n )T ] = ⎜
U 0 .. . 0
0 W1 .. . 0
··· ··· .. . ···
0 0 .. . WM
⎞ ⎟ ⎟ ⎠
where U is the speech correlation matrix and is defined as:
⎛ ⎜ ⎝
U=⎜
0 .. . 0 0
··· .. . ··· ···
0 .. . 0 0
0 .. . 0
3
⎞ ⎟ ⎟ ⎠
σu2 (n )
From (6) to (7), the standard Kalman filtering estimate can be obtained by using the following recursive equations:
xˆ (n + 1|n ) = Gxˆ (n|n )
(8)
xˆ (n + 1|n + 1 ) = xˆ (n + 1|n ) + K(n + 1 )e(n + 1 )
(9)
e(n + 1 ) = y(n ) − Lxˆ (n + 1|n )
(10)
K(n + 1 ) = P(n + 1|n )LT [LP(n + 1|n )LT ]−1
(11)
P ( n + 1 | n ) = GP ( n | n ) G + W ( n )
(12)
P ( n + 1|n + 1 ) = ( I − K ( n )L )P ( n + 1|n )
(13)
T
where: 1. xˆ (n + 1|n ) is the minimum mean-square estimate of x(n + 1 ) given the past observations y(1 ), . . . , y(n ); 2. xˆ (n|n ) is the filtered estimate of the state vector x(n ); 3. P(n + 1|n ) = E[x˜ (n + 1|n )x˜ (n + 1|n )T ] is the predicted stateerror correlation matrix, where x˜ (n + 1|n ) = x(n + 1 ) − xˆ (n + 1|n ) is the predicted state-error vector; 4. P(n|n ) = E[x˜ (n|n )x˜ (n|n )T ] is the filtered state-error correlation matrix, where x˜ (n|n ) = x(n ) − xˆ (n|n ) is the filtered state-error vector; 5. e(n ) is the innovation sequence; 6. K(n ) is the Kalman gain. Based on the Kalman filtering estimate above, we propose a Kalman filtering algorithm for distributed multichannel speech enhancement (KFADMSE), described by Algorithm 1 where L and Nf denote the frame length and the number of the frame, respectively, I( p+qM )×( p+qM ) is a ( p + qM ) × ( p + qM ) unity matrix and e3 = [0, . . . , 0, 1, 0, . . . , 0] is a 1 × ( p + qM ) vector with the qth element being 1 and other elements being 0. 2.3. Comparison First, the proposed algorithm significantly extends conventional Kalman filtering-based time domain algorithm for single-channel speech enhancement in colored background noise. Unlike the single-channel Kalman filter algorithm, the proposed algorithm can
use space signal information. Therefore, the latter has the potential of outperforming the former. Second, like the multichannel wiener algorithm, the proposed algorithm belongs to the MMSE estimator. On the other hand, the multichannel wiener algorithm supposes that the speech signal are stationary, while the proposed algorithm uses the speech production model to recover clean speech and thus can reduce noise and recover clean speech by using more information of clean speech. Third, the multichannel subspace algorithm requires not only the computation for optimal filter but also the computation for eigenvalue and eigenvector in each frame. As a result, the multichannel subspace algorithm requires more computational cost than the proposed algorithm. Fourth, the multichannel statistical estimation-based frequency domain algorithm, such as the minimum mean-square error short-time log-spectral amplitude and spectral phase (MMSELSA) estimator proposed in [31] and the minimum mean-square error short-time magnitude-squared spectrum (MMSE-MSS) estimator proposed in [32], includes more statistical assumptions. Moreover, they require more computational resources. Finally, Schwartz et al. presented a Kalman filter-based frequency domain (KEMDLP) algorithm for multichannel speech enhancement [33]. Since the frequency domain algorithm requires more computational sources than the time domain algorithm, the proposed time domain algorithm is easily used in practices. Finally, we compare the algorithm complexity of the multichannel speech enhancement algorithms mentioned above. To do that, we employ the definition of the algorithm complexity in term of the total number of multiplication operation. To simplify discussion, without lost of generality we consider the case that M = 3 and we neglect the low order term. The comparison of five algorithms in terms of the algorithm complexity in the case of M = 3 is listed in Table 1 where p, q L and L is large.
Please cite this article as: J. Tu, Y. Xia, Effective Kalman filtering algorithm for distributed multichannel speech enhancement, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.05.048
ARTICLE IN PRESS
JID: NEUCOM 4
J. Tu, Y. Xia / Neurocomputing 000 (2017) 1–11 Table 1 Comparison of six algorithms in terms of the algorithm complexity in the case of M = 3. Algorithm
The total number of multiplication operation
Wiener Subspace KEMD-LP MMSE-LSA MMSE-MSS Proposed algorithm
108L3 Nf 355L3 Nf (562, 088 + 32 p + 16log(L ))LN f (300, 088 + 16log(L ))LN f (350 + 16log(L ))LN f 16[( p + 3q )3 + 3( p + 3q )2 + 9( p + 3q ) + 9]LN f
in the preprocessing. And then, we carry the short-time energybased VAD technique to the enhanced noisy speech signal. Assuming yˆ1 (lL + 1 ), yˆ1 (lL + 2 ), . . . , yˆ1 (lL + L ) is the lth frame signal in the enhanced noisy speech signal, where L is the frame length, the short-time energy in the lth frame signal is computed as:
Eˆl = 10 log10
For the implementation of the proposed algorithm, we need estimating several parameters in advance. By using the first channel noisy speech, we perform the Yule–Walker equation technique [33,35] to estimate the AR parameter vector of the source speech signal: a = [a1 , . . . , a p ]T . After estimating a, background noise variance σu2 (n ) can be estimated by averaging the prediction errors: p L−1 1 ( y1 ( n + j ) − ai y1 (n + j − i )). L j=0
(14)
i=1
To enhance the accuracy of a and σu2 (n ), we may perform the iterative method to update them. In the first iteration, they are estimated from the current noisy speech signal frame and then the source speech signals are estimated by using Kalman filter. In the subsequent iterations, they are re-estimated from the previous estimated source speech signal frame. The iterations are not stopped until the termination condition of iteration is met [14]. bi = [bi1 , . . . , biq ]T , σw2i (n ) are also estimated by the Yule–Walker equation technique. The attenuation factors ci are estimated by the method proposed in [31]:
σy2i − σv2i
σy21 − σv21
( i = 1, . . . , M )
2
yˆ1 (lL + i ) .
(16)
The rule for VAD is described as:
3. Parameter estimation
σu2 (n ) =
L i=1
From Table 1, we see that the proposed algorithm, MMSE-LSA and KEMD-LP have lower computational complexity than both the multichannel wiener algorithm and the multichannel subspace algorithm. the proposed algorithm has lower computational complexity than KEMD-LP, and MMSE-LSA. MMSE-MSS has the lowest computational complexity among the six algorithms. On the other hand, the proposed algorithm processes the observed signals in the time domain while MMSE-LSA and KEMD-LP process the observed signals in the frequency domain. Therefore, the proposed algorithm requires less computational resources than MMSE-LSA, MMSE-MSS and KEMD-LP.
cˆi =
[m5G;June 13, 2017;8:14]
(15)
where σy2i and σv2i are the variances of the noisy observations and
noise of channel i, respectively. σv2i is estimated by using the variance of the noise-dominated segment of the noisy observations of channel i. In the proposed algorithm, the short-time energy-based VAD technique is employed to determine whether the current frame is the noise-dominated frame or the speech-dominated frame. We carry the VAD technique to the first channel noisy speech signal. In order to improve the performance of VAD technique, we first use the single channel subspace speech enhancement algorithm to the first channel noisy speech signal. In this preprocessing, the first 1.5 s noisy speech signal is used to estimate the noise covariance matrix, and the noise covariance matrix is not updated in the whole processing. Moreover, for keeping speech information, the parameter for controlling noise reduction and speech distortion μ is chose to be small, in this paper, we set μ = 1
Vl =
1, 0,
i f Eˆl < Tv elsei f,
(17)
where Vl is the index for VAD in the lth frame signal (Vl = 1 shows that the current frame is the speech-dominated frame, Vl = 0 shows the current frame is the noise-dominated frame) and Tv is the threshold. 4. Performance evaluation In this section, we give testing examples to demonstrate the effectiveness of the proposed algorithm. We compare the proposed algorithm (denoted as KFADMSE) with the multichannel wiener algorithm, the multichannel subspace algorithm, the multichannel speech enhancement algorithm based on Kalman-EM algorithm proposed in [33] (denoted as KEMD-LP), one multichannel method proposed in literature [31] (denoted as MMSE-LSA), one multichannel method proposed in recent literatures [32] (denoted as MMSEMSS). In the multichannel wiener method, the clean speech signal and the noise signal in one frame are supposed to be wide stationary, the clean speech signal is estimated in minimum mean square error (MMSE) sense in the time domain. This method are only suitable to process the speech corrupted by the stationary noise. In the multichannel subspace method, the noisy signal space is first separated into two orthogonal subspace: the noisy subspace and the signal subspace, and then signal enhancement is to remove the noise subspace and to estimate the clean speech signal from the noisy speech subspace. This method outperforms the wiener method, especially in the non-stationary noise case. MMSELSA and MMSE-MSS belong to the speech enhancement method based on the statistical models in the frequency domain. In the MMSE-LSA, Rayleigh and Gaussin statistical models for the speech prior and noise likelihood are used. In the MMSE-MSS, the real and imaginary parts of Discrete Fourier Transform (DFT) coefficients of clean speech and noise are modeled as independent Gaussian random variables with equal variance. In the KEMD-LP, the Kalman expectation maximization (KEM) scheme is adopted in the short time Fourier transform (STFT) domain. In the E-step, the Kalman smoother applied to extract the clean signal, in the M-step, the parameters are updated according to the output of Kalman smoother. 4.1. Methodology In the experimental environments, the room is with 10 m length, 8 m width and 6 m height (x × y × z). The source is located at (2, 4, 1.6). We consider an uniform linear distributed microphone array of 10 omnidirectional microphones and the spacing between adjacent microphones is about 30 cm. The ith microphone is located at (2.2, 4 + 0.3 × (i − 1 ), 1.6 ). The location of the microphones and source is depicted in Fig. 1. Testing utterances and noise signals are selected from the NOIZEUS corpora [35]. All signals are sampled at 8 kHz. We randomly select 20 different speech sentences from the NOIZEUS database. These sentences are joined to a clean signal.The joined clean speech is about 50 s in length. The image method [36] was used to obtain the impulse responses. The impulse responses were convolved with the source speech to
Please cite this article as: J. Tu, Y. Xia, Effective Kalman filtering algorithm for distributed multichannel speech enhancement, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.05.048
ARTICLE IN PRESS
JID: NEUCOM
[m5G;June 13, 2017;8:14]
J. Tu, Y. Xia / Neurocomputing 000 (2017) 1–11
Fig. 1. The location of the microphones and source in our experiment environments.
obtain the clean speeches. Then these clean speeches were added to noises with inputs SNRs of −5 dB and 5 dB. We have tried different types of noise (babble, train, pink, f16 and factory) and same or similar conclusions have been obtained. Because of space limitation, we here presents the results for babble and factory noises. The rectangular window was used with frame size 256 samples (32 ms) and half overlap. The reverberation condition T60 = 200 ms is considered herein. Voice activity detection(VAD) is based on the short-time energy described above. The multichannel wiener algorithm, the multichannel subspace algorithm and the proposed algorithm use the same VAD technique, the other three algorithms do not need VAD algorithm. The threshold of VAD Tv is set to −78 dB when the input SNR is −5 dB and −91 dB when the input SNR is 5 dB. For our simulations, we take the first microphone nearest the speech source as a reference. To remove the impact of the estimation of time delay, we time align the M noisy speech signals with the true time delay. The simulation is conducted in MATLAB. 4.2. Objective measures We use the segmental signal to noise ratio (SSNR) improvement, the log-likelihood ratio (LLR) [37], perceptual evaluation of speech quality (PESQ) [37] and the short-time objective intelligibility (STOI) [38] as the objective measures. SSNR improvement evaluates noise reduction and LLR evaluates signal distortion. The SSNR is defined as
SSNR =
Nl+N−1 L−1 s ( n )2 10 n=Nl log10 Nl+N−1 L [s(n ) − sˆ(n )]2 l=0
(18)
n=Nl
where s(n) is the source speech signal, sˆ(n ) is the enhanced signal and N is the length of the source speech signal. A larger SSNR value implies a better performance. The LLR measure for one particular frame l is defined as: T
l dLLR = log(
aˆ l Rs,l aˆ l ) aTl Rs,l al
(19)
where al is the LPC vector of the original clean speech signal frame, aˆ l is the LPC vector of the enhanced speech signal frame, and Rs,l is the autocorrelation matrix of the original clean speech signal frame. The average LLR value is evaluated from the different l . To remove unrealistically high speech distortion levels, only dLLR
Fig. 2. SSNR improvement results by six algorithms versus number of microphones in babble noise with input SNR value being −5 dB.
Fig. 3. LLR results by six algorithms versus number of microphones in babble noise with input SNR value being −5 dB.
the smallest 95% frame LLR values were chosen to compute the average LLR value. LLR has rang of 0 − 2. A lower LLR value indicates a better performance. PESQ is based on mean opinion score (MOS) that cover a scale from 1 as bad to 5 as excellent.STOI measure is the most relevant measure with the intelligibility of the enhanced speech signal. STOI has rang of 0 − 1. A bigger STOI value indicates a better performance. In the first experiment, we compare the proposed algorithm with the other five algorithms in colored noise. The input SNR value are −5 dB and 5 dB with the number of microphones being M = 1, 2, . . . , 10, respectively. Let p = 6 and q = 6. Figs. 2–17 depict the variations of SSNR improvement, LLR, PESQ and STOI by the proposed algorithm, MMSE-LSA, MMSE-MSS, Wiener, Subspace and KEMD-LP as the number of microphones from 1 to 10 under babble noise and factory noise environments with the input SNR being −5 dB and 5 dB. For comparison, all the figures depict the results of noisy speech signal in the first microphone. From Figs. 2 to 17, we first see that the proposed algorithm outperform other five algorithms in terms of SSNR improvement, LLR, PESQ and STOI particularly when increasing the number of microphones. This indicates that the proposed algorithm achieves the most noise reduction, the least speech distortion and the most speech intelli-
Please cite this article as: J. Tu, Y. Xia, Effective Kalman filtering algorithm for distributed multichannel speech enhancement, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.05.048
5
JID: NEUCOM 6
ARTICLE IN PRESS
[m5G;June 13, 2017;8:14]
J. Tu, Y. Xia / Neurocomputing 000 (2017) 1–11
Fig. 4. PESQ results by six algorithms versus number of microphones in babble noise with input SNR value being −5 dB.
Fig. 7. LLR results by six algorithms versus number of microphones in babble noise with input SNR value being 5 dB.
Fig. 5. STOI results by six algorithms versus number of microphones in babble noise with input SNR value being −5 dB.
Fig. 8. PESQ results by six algorithms versus number of microphones in babble noise with input SNR value being 5 dB.
Fig. 6. SSNR improvement results by six algorithms versus number of microphones in babble noise with input SNR value being 5 dB.
Fig. 9. STOI results by six algorithms versus number of microphones in babble noise with input SNR value being 5 dB.
Please cite this article as: J. Tu, Y. Xia, Effective Kalman filtering algorithm for distributed multichannel speech enhancement, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.05.048
JID: NEUCOM
ARTICLE IN PRESS
[m5G;June 13, 2017;8:14]
J. Tu, Y. Xia / Neurocomputing 000 (2017) 1–11
7
Fig. 10. SSNR improvement results by six algorithms versus number of microphones in factory noise with input SNR value being −5 dB.
Fig. 13. STOI results by six algorithms versus number of microphones in factory noise with input SNR value being −5 dB.
Fig. 11. LLR results by six algorithms versus number of microphones in factory noise with input SNR value being −5 dB.
Fig. 14. SSNR improvement results by six algorithms versus number of microphones in factory noise with input SNR value being 5 dB.
Fig. 12. PESQ results by six algorithms versus number of microphones in factory noise with input SNR value being −5 dB.
Fig. 15. LLR results by six algorithms versus number of microphones in factory noise with input SNR value being 5 dB.
Please cite this article as: J. Tu, Y. Xia, Effective Kalman filtering algorithm for distributed multichannel speech enhancement, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.05.048
JID: NEUCOM 8
ARTICLE IN PRESS
[m5G;June 13, 2017;8:14]
J. Tu, Y. Xia / Neurocomputing 000 (2017) 1–11
Fig. 16. PESQ results by six algorithms versus number of microphones in factory noise with input SNR value being 5 dB.
Fig. 18. Waveforms of clean signal, noisy signal, and enhanced signals by six algorithms in the case of M = 4 and babble noise with input SNR value being 5 dB.
Fig. 17. STOI results by six algorithms versus number of microphones in factory noise with input SNR value being 5 dB.
gibility in the case of large number of microphones. The performance of the proposed algorithm can be improved greatly by increasing the number of microphone. Since autoregressive (AR) process which can models the speech signals and noise signals is exploited in our proposed algorithm, this improves the performance of our proposed algorithm. Second, increasing the number of microphones can improve the value of SSNR improvement, PESQ and STOI and can reduce the value of LLR in the six algorithms. When we increase the number of microphones, the much more time and spatial information can be utilized to improve the performance of multichannel speech enhancement algorithms. It is worth to be noted that, the performance of multichannel wiener speech enhancement algorithm may deteriorate when increasing the number of microphones since the error of the estimation for the noise and speech covariance matrix may be too big. Figs. 18 and 19 display the waveforms of clean signal, noisy signal corrupted by the colored noises, and six enhanced signals, where the input SNR value is 5 dB and M = 4. From Figs. 18 and 19, we see that the waveforms of the enhanced signals produced by the proposed algorithm is closer to the source speech signal than the other five algorithms. Finally, Table 2 lists comparison of the proposed algorithm, MMSELSA, MMSE-MSS, the multichannel winer algorithm, the multichan-
Fig. 19. Waveforms of clean signal, noisy signal, and enhanced signals by six algorithms in the case of M = 4 and factory noise with input SNR value being 5 dB.
nel subspace algorithm and KEMD-LP in terms of computer running time measure for 20 different speech sentences under babble noise environment and babble noise environment, respectively. From Table 2, we see that the proposed algorithm has the faster speed than MMSE-LSA, the multichannel winer algorithm, the multichannel subspace algorithm and KEMD-LP. MMSE-MSS has the fastest speed because of its simple estimator. In the second experiment, we compare the proposed algorithm with wiener algorithm and subspace algorithm in the cases of the
Please cite this article as: J. Tu, Y. Xia, Effective Kalman filtering algorithm for distributed multichannel speech enhancement, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.05.048
ARTICLE IN PRESS
JID: NEUCOM
[m5G;June 13, 2017;8:14]
J. Tu, Y. Xia / Neurocomputing 000 (2017) 1–11
9
Table 2 Comparison of six algorithms in terms of computer running time measures for 20 different speech sentences in the case of M = 4 and babble noise with the input SNR being 5 dB. Performance index
Proposed algorithm
MMSE-LSA
MMSE-MSS
Wiener
Subspace
KEMD-LP
CPU time(s)
127
132
21
356
743
315
Fig. 20. SSNR improvement results by three algorithms versus number of microphones in babble noise with input SNR value being 5 dB in the cases of perfect VAD and worst VAD.
Fig. 21. LLR results by three algorithms versus number of microphones in babble noise with input SNR value being 5 dB in the cases of perfect VAD and worst VAD.
perfect VAD and the worst VAD for exploring the impact of VAD. In the worst VAD, each signal frame is assumed to be speech dominated frame except for the first frame. Since MMSE-LSA, MMSEMSS and KEMD-LP belong to the frequency based approach and do not need VAD, the three approaches aren’t chose for comparison. Figs. 20–23 depict the variations of SSNR improvement, LLR, PESQ and STOI by the proposed algorithm, Wiener and Subspace in the cases of the perfect VAD and the worst VAD as the number of microphones from 1 to 10 under babble noise environments with the input SNR being 5 dB. From Figs. 20 to 23 we first see that the proposed algorithm outperform wiener algorithm and subspace algorithm in terms of SSNR improvement, LLR, PESQ and STOI particularly when increasing the number of microphones both in the case of perfect VAD and in the case of worst VAD. We second see that the results of the three algorithms in the case of perfect VAD are better than that in the case of worst VAD. This indicates that the performance of the three algorithms depend on the VAD. 4.3. Subjective measure We use the subjective test based on the MUSHRA standard [39] as the subjective measure. We select ten listeners to take part in the test. The listeners are required to listen the enhanced signals and judge the quality of the enhanced signals in terms of noise reduction, speech distortion, speech intelligibility and auditory comfort. The listeners are also required to assign grades to these enhanced signals. The clean speech is used as the reference speech. The larger the grade is, the better quality the testing signal has. The grade has a range of 0–100 (0–20 bad, 20–40 poor, 40–60 fair, 60–80 good, 80–100 excellent). To compare algorithms, we obtain the mean and standard deviation of the grades assigned by ten listeners. Tables 3 and 4 list results of the proposed algorithm, MMSE-LSA, MMSE-MSS, the multichannel winer algorithm, the multichannel subspace algorithm and KEMD-LP in terms of the mean and standard deviation of subjective measure grade in the
Fig. 22. PESQ results by three algorithms versus number of microphones in babble noise with input SNR value being 5 dB in the cases of perfect VAD and worst VAD.
case of M = 4 under babble noise environment and factory noise environment, respectively. From Tables 3 to 4, we can see that the proposed algorithm achieves the highest mean grade of the subjective measure in all cases. This indicates that the proposed algorithm can outperform the other five algorithms in terms of the subjective measure. The subjective measure is a comprehensive evaluation criterion for speech enhancement compared to the objective measures, such as SSNR, LLR, PESQ and STOI. It is not only associated to SSNR, LLR, PESQ and STOI, but also is related to the auditory comfort of the enhanced signals. In general, if the algorithm obtains larger SSNR, PESQ and STOI, and smaller LLR, it can obtain the better result of the subjective measure in most cases. Our proposed algorithm achieves the better values in SSNR, LLR,
Please cite this article as: J. Tu, Y. Xia, Effective Kalman filtering algorithm for distributed multichannel speech enhancement, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.05.048
ARTICLE IN PRESS
JID: NEUCOM 10
[m5G;June 13, 2017;8:14]
J. Tu, Y. Xia / Neurocomputing 000 (2017) 1–11 Table 3 Comparison of six algorithms in terms of subjective measure under babble noise environment where M = 4. Input SNR
Performance index
Noisy
Proposed algorithm
MMSE-LSA
MMSE-MSS
Wiener
Subspace
KEMD-LP
−5(dB)
Mean Std Mean Std
45.2 1.33 67.3 1.43
62.3 1.28 81.6 1.28
54.1 1.31 74.3 1.62
60.2 1.42 80.3 1.31
48.1 1.62 70.4 1.86
58.4 1.61 79.9 1.59
52.3 1.34 72.3 1.96
5(dB)
Table 4 Comparison of six algorithms in terms of subjective measure under factory noise environment where M = 4. Input SNR
Performance index
Noisy
Proposed algorithm
MMSE-LSA
MMSE-MSS
Wiener
Subspace
KEMD-LP
−5(dB)
Mean Std Mean Std
46.5 1.26 68.8 1.51
63.7 1.30 84.6 1.25
56.2 1.34 73.3 1.43
61.3 1.42 81.3 1.37
50.1 1.52 72.3 1.55
58.6 1.54 80.9 1.62
53.4 1.37 74.3 1.66
5(dB)
Fig. 23. STOI results by three algorithms versus number of microphones in babble noise with input SNR value being 5 dB in the cases of perfect VAD and worst VAD.
PESQ and STOI in most cases when M = 4, so it obtains the better results of the subjective measure. 5. Conclusion This paper has proposed a Kalman filtering-based distributed multichannel speech enhancement algorithm in colored noise. Compared with traditional multichannel speech enhancement algorithms for distributed multichannel speech enhancement, the proposed algorithm has lower computational complexity and requires less computational resources. Simulation results show that the proposed algorithm is superior to several traditional algorithms for distributed multichannel speech enhancement in achieving higher noise reduction, less signal distortion and more speech intelligibility. Moreover, the proposed algorithm has a faster speed than a couple of multichannel speech enhancement algorithms. References [1] S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust. Speech Signal Process. 27 (2) (1979) 113–120. [2] P.C. Hansen, S.H. Jensen, FIR filter representations of reduced rank noise reduction, IEEE Trans. Signal Process. 46 (6) (1998) 1737–1741. [3] Y. Ephraim, H.L.V. Trees, A signal subspace approach for speech enhancement, IEEE Trans. Speech Audio Process. 3 (4) (1995) 251–266. [4] Y. Hu, P.C. Loizou, A subspace approach for enhancing speech corrupted by colored noise, IEEE Signal Process. Lett. 9 (6) (2002) 204–206. [5] K. Paliwal, A. Basu, A speech enhancement method based on Kalman filtering, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 12, USA, 1987, pp. 177–180.
[6] J.D. Gibson, B. Koo, S.D. Gray, Filtering of colored noise for speech enhancement and coding, IEEE Trans. Signal Process. 39 (1991) 1732–1742. [7] M. Gabrea, E. Grivel, A. Najim, A single microphone Kalman filter-based noise canceller, IEEE Signal Process. Lett. 6 (1999) 55–59. [8] M. Gabrea, An adaptive Kalman filter for the speech enhancement, in: Proceedings of the IEEE Workshop on Application of Signal processing to Audio and Acoustics, New Paltz, NY, 2005. October 16–19. [9] Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator, IEEE Trans. Acoust. Speech Signal Process. 32 (1984) 1109–1121. [10] Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Trans. Acoust. Speech Signal Process. 33 (1985) 443–445. [11] M. Ning, M. Bouchard, R.A. Goubran, Speech enhancement using a masking threshold constrained Kalman filter and its heuristic implementations, IEEE Trans. Audio Speech Lang. Process. 14 (1) (2006) 19–32. [12] W. Bobillet, R. Diversi, E. Grivel, et al., Speech enhancement combining optimal smoothing and errors-in-variables identification of noisy AR processes, IEEE Trans. Signal Process. 55 (2007) 5564–5578. [13] Y.S. Xia, Fast speech enhancement using a novel noise constrained least square estimation, in: Proceedings of the International Conference on Audio, Language and Image Processing, Shanghai, China, 2012, pp. 980–985. 16/7/2012–18/7/2012. [14] S. So, K.K. Paliwal, Suppressing the influence of additive noise on the Kalman gain for low residual noise speech enhancement, Speech Commun. 53 (2011) 355–378. [15] D. Labarre, E. Grivel, M.H. Najim, E. Todini, Two-Kalman filters based instrumental variable techniques for speech enhancement, IEEE Trans. Signal Process. (2004). [16] K.Y. Lee, S. Jung, Time-domain approach using multiple Kalman filters and em algorithm to speech enhancement with nonstationary noise, IEEE Trans. Speech Audio Process. 8 (20 0 0) 282–291. [17] C. Zheng, R. Peng, J. Li, X.D. Li, A constrained MMSE LP residual estimator for speech dereverberation in noisy environments, IEEE Signal Process. Lett. 21 (2014) 1462–1466. [18] Sunnydayal, K. Kumar, S. Cruces, An iterative posterior NMF method for speech enhancement in the presence of additive Gaussian noise, Neurocomputing 230 (2017) 312–315. [19] Y.T. Chan, S. Nordholm, F.C.Y. Ka, Speech enhancement strategy for speech recognition microcontroller under noisy environments, Neurocomputing 118 (22) (2013) 279–288. [20] Y. Zhou, H. Zhao, L. Shang, Immune k-SVD algorithm for dictionary learning in speech denoising, Neurocomputing 137 (5) (2014) 223–233. [21] L. Dehyadegary, S.A. Seyyedsalehi, I. Nejadgholi, Nonlinear enhancement of noisy speech, using continuous attractor dynamics formed in recurrent neural networks, Neurocomputing 74 (17) (2011) 2716–2724. [22] H. Cox, R.M. Zeskind, M.M. Owen, Robust adaptive beamforming, IEEE Trans. Acoust. Speech Signal Process. ASSP-35 (10) (1987) 1365–1376. [23] J. Benesty, J. Chen, Y. Huang, J. Dmochowski, On microphonearray beamforming from a MIMO acoustic signal processing perspective, IEEE Trans. Audio Speech Lang. Process. 15 (3) (2007) 1053–1065. [24] Y. Huang, J. Benesty, J. Chen, Analysis and comparison of multichannel noise reduction methods in a common framework, IEEE Trans. Audio Speech Lang. Process. 16 (5) (2008) 957–968. [25] G. Rombouts, M. Moonen, QRD-based unconstrained optimal filtering for acoustic noise reduction, Signal Process. 83 (2003) 1889–1904. [26] F. Jabloun, B. Champagne, A multi-microphone signal subspace approach for speech enhancement, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1, 2001, pp. 205–208. [27] S. Doclo, M. Moonen, GSVD-based optimal filtering for single and multi-microphone speech enhancement, IEEE Trans. Signal Process. 50 (9) (2002) 2230–2244. [28] S. Doclo, M. Moonen, Multimicrophone noise reduction using recursive GSVD-based optimal filtering with ANC postprocessing stage, IEEE Trans. Signal Process. 13 (2005) 53–69.
Please cite this article as: J. Tu, Y. Xia, Effective Kalman filtering algorithm for distributed multichannel speech enhancement, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.05.048
JID: NEUCOM
ARTICLE IN PRESS J. Tu, Y. Xia / Neurocomputing 000 (2017) 1–11
[29] J. Chen, J. Benesty, Y. Huang, A minimum distortion noise reduction algorithm with multiple microphones, IEEE Trans. Audio Speech Lang. Process. 16 (3) (2008) 481–493. [30] R.C. Hendriks, R. Heusdens, U. Kjems, J. Jensen, On optimal multichannel mean-squared error estimators for speech enhancement, IEEE Signal Process. Lett. 16 (10) (2009) 885–888. [31] M.B. Trawiki, M.T. Johnson, Distributed multichannel speech enhancement with minimum mean-square error short-time spectral amplitude, log-spectral amplitude, and spectral phase estimation, Signal Process. 92 (2012) 345–356. [32] J. Tu, Y.S. Xia, Fast distributed multichannel speech enhancement using novel frequency domain estimators of magnitude-squared spectrum, Speech Commun. 72 (2015) 96–108. [33] B. Schwartz, S. Gannot, E.A.P. Habets, LPC-based speech dereverberation using Kalman-EM algorithm, in: Proceedings of the International Workshop on Acoustic Signal Enhancement (IWAENC), 2014. [34] J. Jeong, T.J. Moir, A real-time Kepstrum approach to speech enhancement and noise cancellation, Neurocomputing 71 (13–15) (2008) 2635–2649. [35] P.C. Loizou, 2007, Speech Enhancement: Theory and Practice, CRC, Boca Raton, FL, USA. [36] J.B. Allen, D.A. Berkley, Image method for efficiently simulating small-room acoustic, J. Acoust. Soc. Amer. 65 (1979) 943–950. [37] Y. Hu, P.C. Loizou, Evaluation of objective quality measures for speech enhancement, IEEE Trans. Audio Speech Lang. Process. 16 (1) (2008) 229–238. [38] C.H. Taal, R.C. Hendriks, R. Heusdens, An algorithm for intelligibility prediction of time-frequency weighted noisy speech, IEEE Trans. Audio Speech Lang. Process. 19 (7) (2011) 2125–2136. [39] G. Stoll, F. Kozamernik, EBU listening tests on internet audio codecs, EBU Tech. Rev. 283 (20 0 0) 20–33.
[m5G;June 13, 2017;8:14] 11 Tu Jingxian: He was born in Heyuan city, Guangdong province, China. He has received the Master and Ph. D. degrees in Applied Mathematics from Guangdong University of Technology and Fuzhou University in the year 2012 and 2016, respectively. His research interests include speech signal processing and image signal processing.He has several published papers in the areas of speech signal processing.
Youshen Xia: He received the Ph.D. degree in Automation and Computer-Aided Engineering from Chinese University of Hong Kong, China in 20 0 0. He has held various faculty/research/visiting positions at Nanjing Post University, China, Chinese University of Hong Kong, City University of Hong Kong, Calgary University, Canada, University of Waterloo, Canada, and University of Western Australia, Australia. He is a professor of the College of Mathematics and Computer Science, Fuzhou University, Fuzhou, China. His present research interests include design and analysis of neural dynamical optimization approaches, and system blind identification with applications to signal and image processing.
Please cite this article as: J. Tu, Y. Xia, Effective Kalman filtering algorithm for distributed multichannel speech enhancement, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.05.048