Applied Acoustics 156 (2019) 101–112
Contents lists available at ScienceDirect
Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust
Mask estimation incorporating phase-sensitive information for speech enhancement Xianyun Wang, Changchun Bao ⇑ Speech and Audio Signal Processing Laboratory, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
a r t i c l e
i n f o
Article history: Received 16 January 2019 Received in revised form 3 June 2019 Accepted 2 July 2019
Keywords: Monaural speech enhancement Phase-sensitive Mask estimation MAP Deep neural network
a b s t r a c t For deep neural network (DNN)-based methods, the time-frequency (T-F) masks are commonly used as the training target. However, most of them do not focus on the phase information, while recent studies have revealed that incorporating phase information into the T-F mask can effectively help improve the speech quality of the enhanced speech. In this paper, we present two techniques to obtain the T-F mask considering phase information. In the first technique, the characteristics about spectral structures of two phase differences, which include the phase difference (PD) between clean and noisy speech and the PD between noise and noisy speech, are firstly discussed. Then, considering the specific characteristics of two PDs, a parametric ideal ratio mask (IRM) whose parameters are controlled by the cosines of the two aforementioned PDs is proposed, which is termed as a bounded IRM with phase constraint (BIRMP). In the second technique, an optimal estimator based on generalized maximum a posteriori (GMAP) probability of complex speech spectrum is proposed and defined as an optimal GMAP estimation of complex spectrum (OGMAPC). The OGMAPC estimator can dynamically adjust the scale of prior information of spectral magnitude and phase. Considering the difficult predictability of speech phase in the DNN-based method, the second technique exploits the spectral magnitude part of the OGMAPC estimator to calculate an optimal magnitude mask with the phase information and its ideal value is used for DNN training. The experiments show that the proposed methods can outperform the reference methods. Ó 2019 Elsevier Ltd. All rights reserved.
1. Introduction In a natural environment, a target sound, such as speech, is usually deteriorated by noise. Thus, in order to reduce the noise from noisy speech, a module of noise reduction needs to be introduced. The research on single-channel noise reduction methods has been investigated over several decades and many classical algorithms have been proposed, such as spectral-subtractive algorithm [1], Wiener filtering [2] and statistical-model-based methods [3,4]. Since the stationary assumption about background noise is usually given in these algorithms, a good performance could be only achieved in stationary noise environments. For the nonstationary noise condition, the performance of these methods becomes unsatisfactory. Thus, in order to improve the ability of dealing with non-stationary noise, some supervised learning approaches, which involve the pre-training for some prior information, have been intensively investigated. Generally, these super-
⇑ Corresponding author. E-mail addresses:
[email protected] (X. Wang), baochch@bjut. edu.cn (C. Bao). https://doi.org/10.1016/j.apacoust.2019.07.009 0003-682X/Ó 2019 Elsevier Ltd. All rights reserved.
vised methods can exploit prior information of speech and noise to develop better tracking ability for non-stationary noise. For the supervised speech separation/enhancement, there are three key factors that need to be mainly considered. The three key factors generally include learning machines, acoustic features and training targets. Several learning machines have been investigated in the literature, for example, hidden Markov model (HMM) model [5,6], auto-regressive (AR) model [7–9], Gaussian mixture model (GMM) [10–13], support vector machine (SVM) [14,15] and DNN model [16–36]. Generally, DNN includes a variety of learning machines, such as deep denoising autoencoder (DDAE) [16,17], feedforward multilayer perceptron (MLP) [18–25], convolutional neural network (CNN) [26–28], recurrent neural network (RNN) [29–32], and generative adversarial network (GAN) [33– 36]. For the DNN-based methods, the discriminative features are very important. In [15,20,21], the extensive features were studied and proved a very helpful improvement of speech quality in noisy and reverberant conditions. In addition, a good training target should not be hard for current learning machine. At the same time, it should have a better potential to improve the ability of speech restoration. In the DNN-based methods, the training targets are
102
X. Wang, C. Bao / Applied Acoustics 156 (2019) 101–112
often classified into two categories: training target based on spectral mapping [17–19] and training target based on time-frequency (T-F) mask [20–25]. In the methods based on spectral mapping, the power spectrum [17] or log-power spectrum [18,19] of clean speech was usually used as the training target. In the methods based on the T-F mask, the ideal binary mask (IBM) and the IRM are commonly used as the training targets. The IBM is the first one to be successfully applied in the supervised technique [13]. The idea about binary mask originates from the masking phenomenon of human auditory system that one source cannot be perceived in the presence of a much louder interference. This perceptual effect has been utilized by an approach called computational auditory scene analysis (CASA) which employs an IBM as computational goal. However, since the IBM is a two-dimension 0–1 matrix along time and frequency index, once its estimated accuracy is not guaranteed, the IBM is more sensitive to estimation error and violates smooth evolution nature of speech. In order to address this problem, the IRM is used instead of the IBM, which can be thought of as a smooth form of the IBM. To further improve the performance of speech enhancement, some studies consider the phase information in many T-F masks, such as complex ideal ratio mask (cIRM) [24,25] and phase sensitive mask (PSM) [31]. Although these masks incorporating phase information can obtain better performance in restoring speech, they needs to be compressed because of no boundaries, which is possibly detrimental to the supervised approaches based on the gradient descent [20]. In this paper, in order to improve speech quality, two T-F masks used for training the DNN are proposed by incorporating phase information. In the first T-F masking, a parametric ratio mask whose parameters are controlled by two phase differences (PDs) is proposed as the training target of the DNN. Here, the first PD, u1, is defined between noisy speech and clean speech, and the second PD, u2, is defined between noisy speech and noise. For high SNR situation, we found that the phase uy of noisy speech is nearly same as phase us of clean speech so that cos(u1) = cos(uy-us) 1. In contrast, for low SNR situation, the u1 tends to be random. For the u2, its absolute value is close to 0 for low SNR case so that cos(u2) = cos(uy-un) 1. Similar to u1, when SNR grow up to a high level, u2 tends to be random as well. Because u1 and u2 show obvious differences at high and low SNR conditions, the specific structures may be provided in spectra of the u1 and u2. Based on the study [25], the information with a specific structure is more conducive to obtaining a better training target for the learning-based approach. Therefore, the cos(u1) and cos(u2) are viewed as valuable parameters to generate ratio mask with phase information in this paper. However, for complex spectra of speech and noise in the discrete Fourier transformation (DFT), the additive operation of their imaginary and real parts may provide a resultant spectral magnitude that is smaller than spectral magnitude of speech (or spectral magnitude of noise) [37,38], which corresponds the case that the cos(u1) < 0 or cos(u2) < 0. In this cases, the filter with magnitude attenuation is not appropriate for speech restoration. Thus, the cos(u1) and cos(u2) should be restricted by a zero threshold so that a bound ratio mask with the constrained PD information is proposed. In the second T-F masking, considering that the super-Gaussian model can better reflect statistical distribution of speech signal, a system of speech restoration based on statistical model is used to generate the T-F mask based on the super-Gaussian speech model. Based on this, a generalized maximum posteriori probability (GMAP) estimator of complex spectrum is proposed to obtain optimal magnitude and phase based on the super-Gaussian speech model. In our work, we attempt to use their ideal values as the training targets of the DNN. Unfortunately, the study in [25] had shown that the speech phase is hard to be successfully predicted by the DNN due to the lack of specific structure of spectrum. Thus, speech phase based on the super-Gaussian speech model is not
considered for training the DNN and a T-F mask only generated by magnitude based on the super-Gaussian speech model is used as the training target. In addition, for the acoustic features in the DNN-based method, this paper uses some features with harmonic preservation in the input of DNN model and views them as an additional information to reinforce the predictive accuracy of the proposed two T-F masks. The rest of this paper is organized as follows. In Section 2 the detail of the proposed two techniques is described. In Section 3 we present the simulation examples, and we conclude the paper in Section 4. 2. The proposed method The proposed speech enhancement algorithm gives two techniques to generate the T-F mask incorporating the phase information, which includes a bounded IRM with phase parameterization (BIRMP) and a T-F mask generated by the magnitude based on the super-Gaussian speech model, respectively. 2.1. The framework of the proposed speech enhancement system The block diagram of the proposed system is given in Fig. 1. As shown in Fig. 1, As shown in Fig. 1, two time-frequency (T-F) masks are predicted by the DNN for speech restoration. The proposed system includes two stages: training stage and enhancement stage. In the training stage, a set of robust parameters [23] (i.e., amplitude modulation spectrogram (AMS), relative spectral transform and perceptual linear prediction (RASTA-PLP), Mel frequency cepstral coefficients (MFCC) and Gammatone filterbank power spectra (GF)) are extracted from noisy speech training sets and used as input features of the DNN. Moreover, the proposed T-F mask is extracted from the data sets of noise and clean speech and used as training target of the DNN. For input features of the
Training stage Clean speech/noise data sets
Noisy speech training sets
DFT-based T-F mask
Input feature extraction
Gammatone-based T-F mask
DNN training
Enhancing stage Noisy speech
Input feature extraction
DNN enhancement
T-F mask estimation Enhancing speech
Fig. 1. The block diagram of the proposed method.
X. Wang, C. Bao / Applied Acoustics 156 (2019) 101–112
103
DNN, this paper introduces the feature with harmonic preservation, which is generated from artificial noisy signal. Thus, in this paper, the input features used in [23] are extracted in terms of two ways. One is directly from noisy speech, another one from artificial noisy signal with harmonic preservation, where the artificial noisy signal is obtained by applying a maximum function [39] into noisy speech. For the training target, considering the study in [40] that the cochleagram is more separable than the spectrogram, two T-F masks are firstly obtained in the discrete Fourier transform (DFT) domain in our work. Then, they are transformed into the Gammatone domain [40] and eventually used as the desired output of the DNN. For the DNN, its structure is composed of three hidden layers (each with 1024 nodes) with sigmoid activation functions. The backpropagation algorithm with dropout regularization (dropout rate 0.2) are applied to train network. The adaptive gradient descent along with a momentum term is used as the optimization technique. The momentum rate is 0.5 for the first 5 epochs and 0.9 for the rest epochs. The learning rate is 0.08 for the first 10 epochs and decrease 10% after every epoch. The total epoch is 50. Furthermore, the batch size is set to 128. The mean squared error is used as the cost function. The number of output units is correspond to the dimensionality of the training target. The sigmoid activation function is used at output layer when learning target is in the range of [0, 1], otherwise, linear activation function is used. In the enhancement stage, given the features of noisy speech, the trained DNN network can be used to directly predict the proposed two T-F masks for enhancing noisy speech.
targets. Since the phase information of clean speech does not show a specific structure, it is not easy to directly estimate or train the phase of clean speech, so that the enhanced signal in the T-F domain is usually synthesized by combining spectral magnitude of the enhanced speech and the phase of noisy speech. Fig. 2 gives the spectrograms of speech magnitude, phase, PD u1 and PD u2 under high and low SNR conditions. From Fig. 2, we can see that the phase of speech cannot show obvious specific structure like the magnitude. Fig. 2(c) and Fig. 2(d) show the spectra of phase differences |u1| at 5dB and 10 dB SNR levels, respectively. Moreover, Fig. 2(e) and Fig. 2(f) give the spectra of |u2| at 5dB and 10 dB SNR levels, respectively. Different from phase spectrum, two PDs can provide a spectrum similar to the magnitude structure at higher SNR levels (Fig. 2(d) and Fig. 2(f)). Moreover, the high-energy regions of speech can be reflected to a certain extent for the relative low SNR levels (Fig. 2(c) and Fig. 2 (e)). These specific characteristics about the PDs like the magnitude may help to improve the perceptual quality of the output speech when the above two PDs are used in the T-F mask. In the spectra of the two PDs, the blue regions represent that the absolute of the phase difference is a very small value, which means that the noisy phase is similar to the speech or noise phase in blue regions. In order to briefly explain the reason that the phase us of clean speech is approximately equal to the phase uy of noisy speech at a relatively high SNR level, the noisy phase in the DFT domain is defined as arctan(Yi/Yr). Where Yi and Yr are the imaginary and real parts of complex spectrum of noisy speech, respectively, and arctan(.) is arctangent function. So, the noisy phase is simply expressed as follows:
2.2. The technique based on the BIRMP
Yi Si þ N i ¼ arctan arctan Yr Sr þ N i
2.2.1. The spectral structure of two PDs For DNN-based speech separation/enhancement, the different source speech signals are usually converted into the T-F domain to extract some acoustic parameters (e.g., log-power spectra [18,19] and Gammatone filterbank power spectra [21–23]) as input features of DNN model. The T-F mask is usually used as training
ð1Þ
where Si and Sr (or Ni and Nr) are imaginary and real parts of speech (or noise), respectively. When the SNR level is high enough, that is, when jSi j jNi jandjSr j jN r j are achieved, the noisy phase can be expressed approximately as
Fig. 2. Spectrograms of speech magnitude (a), phase (b) and the PDs. PD u1 between clean and noisy speech (mixing with babble noise) at 5 dB (c) and 10 dB (d) SNR levels. PD u2 between noise and noisy speech (mixing with babble noise) at 5 dB (e) and 10 dB (f) SNR levels.
104
X. Wang, C. Bao / Applied Acoustics 156 (2019) 101–112
Yi Si arctan arctan Yr Sr
ð2Þ
As a result, there is a very small difference between noisy phase and clean phase at higher SNR levels so that the absolute difference |u1| between noisy phase uy and clean phase us, approaches to 0, i.e., |u1 = uy us| 0. Thus, we have cos(u1) = cos(uy us) 1. In contrast, when the SNR level is low enough, the noisy phase can be expressed as
Yi Ni arctan arctan Yr Nr
ð3Þ
Obviously, the noise phase un is approximately same as noisy phase uy at lower SNR levels so that |u2 = uy un| 0, i.e., cos (u2) = cos(uy un) 1. However, in higher SNR levels, the absolute value of the u2 may tend to be random due to uncertainty of its value. This can be seen from Fig. 2(e) and Fig. 2(f). Similarly, for low SNR levels, |u1| tends to be random. This case can be seen from Fig. 2(c) and Fig. 2(d). 2.2.2. The BIRMP According to the above analysis, u1 and u2 can show obvious differences at high and low SNR conditions so that the specific structures are provided in spectra of the |u1| and |u2|. Since the information with a specific structure is more conducive to obtaining a better training target for the learning-based approach [25], it may be very useful to incorporate the above two PDs into the T-F mask. Based on this, two aforementioned PDs are viewed as valuable parameters to generate the BIRMP for speech restoration. In this paper, the BIRMP is generated from the following aspect. That is, the idea of introducing parameters into the Wiener filter in [41] is followed. Thus, a parametric T-F mask whose parameters are controlled by the cos(u1) and cos(u2) is generated as a training target of the DNN,
2 Sm;k cosðu1 Þ BIRMP m;k ¼ 2 2 Sm;k cosðu1 Þ þ Nm;k cosðu2 Þ
ð4Þ
where Sm,k and Nm,k are spectral magnitudes of clean speech and noise, respectively. The k and m are frequency index and frame index. However, when the cos(u1) < 0, the sum of the amplitudes of noisy speech and clean speech is less than that of noise from the triangle relation. In this case, the noise component is dominant and the BIRMP is expected to be equal to zero. Moreover, if the cos (u2) < 0, the sum of the amplitudes of noisy speech and noise is less than that of clean speech. In this case, the speech component is dominant and the BIRMP is expected to be equal to one. Thus, the cos(u1) and cos(u2) need to be restricted by a zero threshold. In addition, considering the problem of existing the constructive and destructive noise types pointed out in the [37] and [38], once the case of the cos(u1) < 0 or cos(u2) < 0 is existed, noisy speech is smaller than noise or speech in the magnitude. In this case, the BIRMP based on magnitude attenuation is inappropriate so that the above two PDs also need be restricted. In order to prove that zeroconstraint of phase difference does not cause excessive loss of phase difference information, Fig. 3 and Fig. 4 give the comparison about the spectra of the PDs with and without constraint for babble and factory noise at different input SNRs. From Fig. 3 and Fig. 4, we can see a similar spectrum structure about the PD with and without constraint for different types of noise at different SNR levels. This implies that the hypothesis is feasible to a certain extent. Thus, in this paper, the modified BIRMP is given by
2 Sm;k maxðcosðu1 Þ; 0Þ BIRMP m;k ¼ 2 2 Sm;k maxðcosðu1 Þ; 0Þ þ Nm;k maxðcosðu2 Þ; 0Þ ð5Þ
where the notation max(.) denotes a maximizing operation. Obviously, when cos(u1) < 0, the BIRMP equals 0, and when cos(u2) < 0, the BIRMP equals 1. 2.3. The technique based on the OGMAPC In this section, considering that the super-Gaussian model can better reflect the statistical distribution of speech signal, we attempt to use a system of speech restoration based on statistical model to generate a T-F mask based on the super-Gaussian speech model. In order to obtain the T-F mask, a generalized maximum posteriori probability (GMAP) estimator of complex spectrum is firstly proposed to obtain optimal magnitude and phase. The optimal GMAP estimator of complex spectrum (OGMAPC) is defined as follows,
n o ^Sm;k ; u ^ s ¼ argmax J GMAP Sm;k ; us Sm;k ;us
¼ argmax Sm;k ;us
a
pðY ðm;kÞjSm;k ;us ÞðpðSm;k ÞÞ ðpðus ÞÞb
ð6Þ
pðY ðm;kÞÞ
where JGMAP Sm;k ; us is cost function of the generalized maximum posteriori probability of spectral magnitude and phase. us is speech phase at the kth frequency bin and the mth frame. The Y(m,k) is complex spectrum of noisy speech at the kth frequency bin of the mth frame. The magnitude and phase of speech are assumed to be uncorrelated. For the sake of convenience, the index symbols (m, k) is omitted later. For the parameters a and b, in this paper, the purpose of the introduction of parameters a and b is to produce a generalized maximum posteriori probability (GMAP) estimator about complex speech spectrum for estimating appropriate magnitude and phase of clean speech. Namely, the parameters a and b are used to adjust the scale of magnitude distribution and phase distribution, respectively. Obviously, when the parameters a and b are all equal to zero, the GMAP estimator can be turned into a maximum likelihood estimator about spectral magnitude of speech. When the parameters a and b are all equal to one, the GMAP estimator is degenerated into a maximum posteriori probability estimator about complex speech spectrum. For phase probability, it is usually considered as a uniform distribution with the largest entropy. However, the study in [42] has pointed out that the harmonic phase obeys von Mises distribution with relatively small entropy. This means that the entropy and uncertainty of phase probability become small at higher SNR levels. Thus, the uniform distribution is more suitable for phase at lower SNR levels, while the von Mises distribution is available for phase at higher SNR levels. So, in order to use b to dynamically adjust the phase probability, the ðpðus ÞÞb is given by
ðpðus ÞÞb ¼
b 1 expðj cosðus uc ÞÞ 2p I0 ðjÞ
ð7Þ
where uc is circular mean and j is concentration parameter. I0() is the modified first kind of Bessel function with zero order. Obviously, when the b is zero, the clean speech phase is uniformly distributed. When b = 1 is selected, a von Mises distribution based prior assumption is imposed to the clean speech phase. For spectral magnitude of speech, a Gamma prior density function [38] is given as follows:
p Sm;k ¼
lmþ1 Smm;k l Sm;k exp Cðm þ 1Þ rsmþ1 rs
ð8Þ
where r2s is variance of speech spectral magnitude. The CðÞ is the Gamma function. The l and m are shape parameters of spectral magnitude distribution. When the noise is given by a complex Gaussian distribution, the p YjSm;k ; us is expressed as follows [43]:
105
X. Wang, C. Bao / Applied Acoustics 156 (2019) 101–112
(1) SNR= -5dB
(2) SNR=0dB
(3) SNR=5dB
(4) SNR=10dB
Fig. 3. PD comparison with and without constraint for babble noise at 5, 0, 5 and 10 dB SNR levels. (a) and (b) correspond to constraint; and (c) and (d) correspond to unconstraint in (1), (2), (3) and (4).
p YjSm;k ; us ¼
! Y Sm;k expð ju Þ2 1 s exp p r2n r2n
0
ð9Þ
1 sin uy þ b j sinðuc Þ 1 @ A ^ s ¼ tan u 2Y S m;k m;k cos u j cos ð u Þ þ b 2 y c r 2Y m;k Sm;k
r2n
ð10Þ
n
and
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ^Sm;k ¼ A þ A2 þ ma jY j 2n
ð11Þ
where
la cosðu1 Þ A ¼ pffiffiffiffiffiffiffiffiffi þ 2 4 nc
ð12Þ r2
2
tan
2Y m;k Sm;k
r2n
2Y m;k Sm;k r2n
where r2n is variance of noise spectrum magnitude. Finally, the optimal estimations of phase and magnitude of speech based on the GMAP are given by:
0
1 @
The parameters n ¼ rs2 and c ¼ jrY j2 are a priori SNR and n n a posteriori SNR, respectively. The expressions,
sinðuy Þþbjsinðuc Þ
cosðuy Þþbjcosðuc Þ
1 A
and
Aþ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi A2 þ m2na jY j,
are
viewed as phase and magnitude based on the super-Gaussian speech model. In this paper, their ideal values are used as the training targets of DNN. In fact, we planned to use DNN to predict phase and magnitude of clean speech. However, for speech magnitude, the study in [31] has shown that using DNN to predict magnitude directly does not achieve better performance than the PSM [31]. This implies that there is still much room for the improvement to use the DNN to estimate speech magnitude. Thus, in this section, we attempt to use the DNN to predict a variant of magnitude considering phase-sensitive information. In order to obtain the variant of magnitude, the super-Gaussian model of speech is used to produce the GMAP estimator about complex speech spectrum. Ultimately, not only the variant of magnitude (termed as magnitude based on the super-Gaussian speech model, i.e., Eq. (11)) are obtained, but also the variant of phase (termed as phase based on the super-Gaussian speech model, i.e., Eq. (10)) are obtained. Since the study in [25] has shown that the clean speech phase is hard to be successfully predicted by the DNN due to the lack of specific structure of spectrum, the phase based on the super-
106
X. Wang, C. Bao / Applied Acoustics 156 (2019) 101–112
(1) SNR= -5dB
(2) SNR=0dB
(3) SNR=5dB
(4) SNR=10dB
Fig. 4. PD comparison with and without constraint for factory noise at 5, 0, 5 and 10 dB SNR levels. (a) and (b) correspond to constraint; and (c) and (d) correspond to unconstraint in (1), (2), (3) and (4).
Gaussian speech model is not considered in the following DNN training. Since ideal values of the variant of magnitude is used as the training targets of the DNN, the ideal values about the parameters n, cand cos(u1) in Eq. (12) are used in the training stage. In order to train spectral magnitude based on the superqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Gaussian speech model, A þ A2 þ m2na is used as the training target and is termed as an optimal GMAP-based mask (OGAM), i.e.,
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ma OGAM ¼ A þ A2 þ 2n
ð13Þ
When a is set to 1, the OGAM is degenerated into a MAP mask based on the super-Gaussian speech model (named as OGAMMAP). When a is set to 0, the OGAM can be turned into a ML mask based on the super-Gaussian speech model (named as OGAMML). For the purpose that introduces scale factor a, the study in [44] has showed that the magnitude estimation based on the MAP may provide a good ability of noise reduction at lower SNR levels but possibly produce more speech distortions due to over-attenuation at higher SNR levels. Moreover, the magnitude estimation based on the ML may maintain speech energy at higher SNR levels and obtain a limited noise attenuation capability at lower SNR levels. Thus, the scale factor a is
incorporated to find an effective compromise between the OGAMMAP and the OGAMML so that the OGAM can better adapt to the difference of higher and lower SNR levels. For the OGAMML, it can be represented as a cosine function of u1 with constraint.
OGAMML
cosðu1 Þ cosðu1 Þ ¼ maxðcosðu Þ; 0Þ ¼ þ 1 2 2
ð14Þ
From the perspective of masking combination, the OGAM can obtain an effective noise suppression at lower SNR levels and maintain speech energy at higher SNR levels. This may also explain capability of the PSM to achieve better performance of speech restoration to a certain extent [20]. In the OGAM, the sigmoid function is used to determine scale factor a as follows:
a¼
amax
1 þ expðb1 ðn b2 ÞÞ
ð15Þ
where the amax is set 1. The b1 and b2 are set as 0.1 and 0.01. Although these parameters are empirically determined, some reasons are also considered. That is, the amax is the upper bound of a which is usually set to 1 because of its range of [0, 1]. The b2 is the sigmoid center, which is set approximately to 0. However, considering that it is more appropriate when the energy threshold
107
X. Wang, C. Bao / Applied Acoustics 156 (2019) 101–112
between speech and noise is positive [45], the b2 is set to a smaller positive number in this work. For the b1, it controls the slope of sigmoid. If b1 is a large value, the sigmoid becomes very steep so that the transition of a is fast in range of [0, 1]. This may give the OGAM a disadvantage similar to the problem concerned in the IBM, that is, the IBM is more sensitive to estimation error and violates smooth evolution nature of speech. Thus, a smaller value is set to increase the transition of a in this work. 2.4. The acoustic features with harmonic preservation In addition to the proposed two T-F masks, this paper also uses some acoustic features with harmonic preservation in the input of DNN model. In our work, a fixed set of features [23] are extracted by two ways. One is to directly obtain them from noisy speech. Another one is to obtain them from an artificial noisy signal with harmonic preservation. The artificial noisy signal can be given, when a maximum function [39] is applied into the noisy time signal. For the module of input feature extraction in Fig. 1, Fig. 5 shows the details of feature extraction of Fig. 1, in which the artificial noisy signal is obtained as follows:
yar ðnÞ ¼ maxð yðnÞ; 0Þ ¼ yðnÞ signð yðnÞÞ
ð16Þ
The max(.) is the maximum function and the sign(y(n)) is defined by:
signð yðnÞÞ ¼
1 ; yðnÞ 0 0 ; yðnÞ < 0
ð17Þ
According to the Eq. (17), when current frame is voiced speech with a period of T, sign(y(n)) is a square wave with a period of T. In order to analyze harmonic regeneration, the Fourier transform (FT) of square waveform sign(y(n)) is given by:
FT ðsignð yðnÞÞÞ ¼
þ1 1 X l l d f Sa T l¼1 T T
ð18Þ
where f denotes frequency index and Sa(l/T) is the FT of elementary square waveform taken at discrete frequency l/T. The d(.) is the Dirac delta function. Thus, the FT of yar ðnÞ can be written as:
FT ðyar ðnÞÞ ¼ FT ðsignð yðnÞÞÞFT ðyðnÞÞ þ1 P Sa Tl F y f Tl ¼ 1T
ð19Þ
l¼1
where Fy(.) is the FT of y(n) and the notation ‘‘*” denotes convolution operation. For voiced speech, sign(y(n)) is a period signal and its fundamental frequency coincides with the voiced speech. In this case, the spectrum of sign(y(n)) becomes a harmonic comb [39], which explains the phenomenon of harmonic regeneration of artificial signal yar(n). Although speech components are covered by noise in noisy speech, many harmonic structures at low frequency region still exist. This can help to extract effective artificial noisy signal in voiced speech. For the unvoiced speech, the study in [39] has shown that the unvoiced spectrum is not degraded by harmonic regeneration process since periodic characteristics is not used.
Noisy signal y(n)
Input feature extraction max (y(n),0)
Artificial noisy speech
Since artificial noisy signal has the ability to preserve harmonics, the features derived from artificial noisy signal can improve predictive potential of the T-F mask. Therefore, in the proposed DNN-based technique, in addition to the features obtained by the first way, the features from artificial noisy signal are also considered as an additional information to improve predictive accuracy of the training target. According to the study in [40], the cochleagram is more separable than the spectrogram. So the BIRMP and OGAM are transformed from DFT domain to Gammatone auditory domain by the equation from the study [46]. K X 2 1 BIRMPm;c ¼ P BIRMP m;k Gc;k 2 K k¼1 k¼1 Gc;k
and K X 2 1 OGAMm;c ¼ P OGAM m;k Gc;k 2 K k¼1 k¼1 Gc;k
Fig. 5. Details of feature extraction used in Fig. 1.
ð21Þ
where Gc,k is the frequency response of the filter channel c. K is the number of DFT bins used for T-F analysis. In the DNN training of our work, the transformed BIRMP and OGAM are used for the training target instead of the BIRMP and OGAM in the DFT domain. 3. Experimental results In this section, we attempt to give some experiments to evaluate the performance of the proposed scheme. In the experiments, two hours of utterances with approximately 1150 utterances chosen from TIMIT training set [48] are used for the training stage. In the test stage, forty clean speech signals are chosen from the TIMIT test set [48] for evaluation. The sampling rate of all speech signals is down-sampled to 8 kHz. Four different types of background noises used for training are selected from NOISEX-92 noise databases [47], including white noise, babble noise, f16 noise and factory noise. In addition, two noises (i.e., factory2 noise and street noise) outside the training set are used to evaluate generalization performance. The input SNR is set to 5dB, 0 dB, 5 dB and 10 dB, respectively. In this work, we select the IRM from a classical DNN-based method [23] as the first reference method (named as Ref.1). The PSM incorporating phase-sensitive information [31] is considered as the second reference method (named as Ref.2) and the standard Weiner filter IRM is considered as the third reference method (named as Ref.3). In the experiment, the proposed BIRMPbased method and OGAM-based method are named as the BIRMP and OGAM, respectively. In order to help readers to see the tuned parameters used in the proposed method, the parameter names and values are given in Table 1. In this paper, uc is circular mean and j is concentration parameter of phase distribution. Since the speech phase based on the super-Gaussian speech model is not considered in the DNN training, uc and j are not calculated in proposed method. Some measurements including segmental SNR improvement (SSNRI), shorttime objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) are evaluated for speech enhancement
Table 1 Parameters values.
DNN feature extraction
ð20Þ
Parameters
value
m l amax
0.2 [43] 1.6248 [43] 1 0.1 0.01
b1 b2
108
X. Wang, C. Bao / Applied Acoustics 156 (2019) 101–112
performance. In the following parts, we will give these evaluation results. 3.1. The comparison of the proposed masks with and without additional features In order to evaluate effectiveness of additional features with harmonic preservation in the proposed T-F mask, Table 2 lists average PESQ and STOI scores of the BIRMP_wo, BIRMP, OGAM_wo and OGAM. Here, The BIRMP_wo and BIRMP are represented as the BIRMP without and with additional features based on harmonic preservation, respectively. Moreover, the OGAM_wo and OGAM are represented as the OGAM without and with additional features, respectively. From Table 2, the enhanced speech processed by all masks can all obtain better PESQ and STOI results than noisy speech. As a comparison, the proposed T-F masks with additional features give a higher PESQ and STOI values than that without additional features. This confirms that the additional features can effectively improve prediction ability of the training target.
Table 3 Test results of SSNRI. Noise type
Input SNR
White
5dB 0 dB 5 dB 10 dB 5dB 0 dB 5 dB 10 dB 5dB 0 dB 5 dB 10 dB 5dB 0 dB 5 dB 10 dB
Babble
F16
Factory
Methods Ref.1
Ref.2
Ref.3
BIRMP
OGAM
14.51 12.96 10.89 8.68 13.79 12.25 10.23 8.03 14.91 12.99 11.04 9.16 13.36 12.49 10.57 8.75
15.26 13.91 11.98 9.69 14.37 12.98 10.82 9.08 15.25 13.86 11.92 9.97 14.09 13.14 11.48 9.52
15.95 14.13 12.28 9.96 14.96 13.48 11.55 9.40 15.94 14.22 12.21 10.26 14.99 13.93 12.07 9.88
16.57 14.70 12.48 10.08 16.57 14.34 12.01 9.62 17.06 14.94 12.55 10.46 16.51 14.48 12.30 9.97
16.15 14.21 12.21 9.89 15.26 13.59 11.65 9.45 16.15 14.51 12.10 10.23 15.84 14.11 11.96 9.82
3.2. The comparison between the proposed method and other methods 3.2.1. The SSNR evaluation The segmental signal-to-noise ratio (SSNR) [6] is often applied to evaluate the de-noising performance of speech enhancement method. It is defined as:
PN 2 N um 1 X x ðnÞ SSNR ¼ 10log10 PN n¼1 2 Num j¼1 ^ n¼1 ½xðnÞ xðnÞ
!
ð22Þ
where Num is the number of frames, N is the length of frame, x(n) is the clean speech signal, b x ðnÞ is the enhanced speech signal. A higher value of SSNRI is an indication of higher speech quality. Table 3 shows the SSNR improvement (SSNRI) results of the enhanced speech obtained by the proposed methods and the reference methods for four types of trained noises at different input SNRs. As seen from the Table 3, the Ref.1 without phase-sensitive information gets relatively lower SSNRI results than the PSM and proposed two masks in all conditions. For the PSM, it introduces the PD between clean and noisy speech. However, it only operates in real domain and the imaginary information is abandoned, which may weaken the ability to suppress background noise so that the Ref.2 method cannot generate higher SSNRI results than the proposed two methods. For the Ref.3 and BIRMP based on the minimum mean square error (MMSE) estimator of speech, since the MMSE estimator indicates that the SNR of the enhanced speech needs to be to some extent maximized as much as possible, they have a better ability to suppress the noise than the Ref.1, Ref.2 and OGAM in most cases. By contrast, the BIRMP with phasesensitive information gets better performance than the Ref.3. For each input SNR, the average SSNRI values of different trained noises for different methods are presented in Fig. 6. From
Table 2 Comparison on average STOI (%) and PESQ. Methods Noisy BIRMP_wo BIRMP OGAM_wo OGAM
PESQ STOI PESQ STOI PESQ STOI PESQ STOI PESQ STOI
5dB
0 dB
5 dB
10 dB
1.53 61.87 2.31 75.16 2.33 76.02 2.33 77.00 2.35 77.13
1.86 71.70 2.63 83.01 2.65 83.07 2. 61 84.27 2.66 84.35
2.22 80.86 2.92 88.77 2.93 88.92 2.92 89.81 2.95 89.84
2.56 88.08 3.19 92.76 3.21 92.86 3.18 93.15 3.21 93.26
Fig. 6. Average test results of SSNRI.
the Fig. 6, we can find that the average SSNRI of the Ref.1 are relatively lower than that of the Ref.2, BIRMP and OGAM, which may mean that the incorporation of phase-sensitive information could help to improve predictive accuracy of T-F mask for noise reduction. Furthermore, we can see that the proposed two systems get much higher average SSNRI than the PSM-based systems. For the Ref.3 and BIRMP, they all attempt to seek the maximization of SNR. Moreover, the proposed BIRMP considering phasesensitive information performs better than the Ref.3 in reducing background noise. 3.2.2. The PESQ evaluation The perceptual evaluation of speech quality (PESQ) [49] is an objective evaluation of speech quality, which is often used to evaluate the quality of the restored speech. The higher PESQ score is obtained, the better the objective quality of speech signals. Table 4 shows the PESQ results of the proposed methods and reference methods for different trained noises at different input SNRs. From the Table 4, when the phase-sensitive information is considered, the temporal continuity [50,51] of speech spectral structure is probably protected, resulting in the improvement of the PESQ for the Ref.2, BIRMP and OGAM compared with the Ref.1. However, since the speech energy corresponding the imagi-
109
X. Wang, C. Bao / Applied Acoustics 156 (2019) 101–112 Table 4 Test results of PESQ. Noise type
Input SNR
White
5dB 0 dB 5 dB 10 dB 5dB 0 dB 5 dB 10 dB 5dB 0 dB 5 dB 10 dB 5dB 0 dB 5 dB 10 dB
Babble
F16
Factory
Table 5 Test results of STOI (%). Methods Noisy
Ref.1
Ref.2
Ref.3
BIRMP
OGAM
1.251 1.483 1.773 2.094 1.422 1.747 2.105 2.444 1.385 1.683 2.023 2.354 1.325 1.636 1.985 2.333
2.052 2.334 2.602 2.873 2.035 2.306 2.593 2.882 2.103 2.405 2.688 2.984 2.063 2.333 2.624 2.906
2.161 2.443 2.711 3.026 2.161 2.441 2.673 3.042 2.189 2.475 2.807 3.139 2.139 2.437 2.762 3.081
2.196 2.434 2.732 3.051 2.122 2.402 2.672 3.041 2.188 2.451 2.811 3.143 2.151 2.427 2.766 3.083
2.17 2.463 2.793 3.109 2.197 2.485 2.773 3.092 2.241 2.533 2.850 3.196 2.188 2.504 2.803 3.107
2.21 2.467 2.799 3.114 2.228 2.524 2.808 3.111 2.265 2.572 2.878 3.224 2.245 2.540 2.834 3.145
nary information is abandoned in the PSM, the ability of maintaining speech temporal continuity is limited so that the PSM generates a poorer PESQ results than the Ref.3, BIRMP and OGAM in most conditions. As a comparison, the BIRMP-based method with phase-sensitive information performs better than the Ref.3. For the OGAM, the incorporation of an adjustable scale of spectral magnitude information is beneficial to maintain speech energy at higher SNR levels and obtain an effective noise suppression at lower SNR levels, which may help to retain the spectral temporal continuity. Thus, the proposed OGAM-based system can produce the higher PESQ values than the BIRMP-based system. Fig. 7 gives the average PESQ values of different methods relative to noisy speech. From the Fig. 7, we can see that, by comparing with noisy speech, all methods can improve the PESQ results. Furthermore, the proposed two methods produce higher average PESQ scores than the reference methods. By comparison, the OGAMbased method obtains the best performance in terms of the PESQ.
3.2.3. The speech intelligibility test The short-time objective intelligibility (STOI) [52] is used to evaluate the intelligibility of our systems and reference systems. Table 5 gives the STOI results of different trained noises for different methods, and Table 6 gives average STOI results of different
Noise type
Input SNR
White
5dB 0 dB 5 dB 10 dB 5dB 0 dB 5 dB 10 dB 5dB 0 dB 5 dB 10 dB 5dB 0 dB 5 dB 10 dB
Babble
F16
Factory
Noisy
Ref.1
Ref.2
Ref.3
BIRMP
OGAM
52.70 64.25 76.36 86.56 54.36 65.53 75.96 84.77 53.30 64.63 76.03 85.42 52.25 63.75 75.35 85.28
74.55 81.63 87.60 92.53 75.55 80.93 85.99 90.57 75.41 81.90 87.41 91.94 74.29 80.41 86.25 90.76
72.58 80.97 88.09 92.79 73.61 79.76 86.28 91.66 72.91 81.07 87.45 92.57 71.20 78.48 86.27 90.88
74.49 81.60 88.04 92.75 75.24 79.66 86.12 91.62 74.51 81.06 87.42 92.17 71.30 78.34 86.25 90.81
74.61 81.90 88.31 92.87 75.63 81.96 87.43 91.77 75.42 82.19 88.13 92.68 74.51 80.67 86.93 91.77
74.78 82.04 88.35 92.89 76.65 82.17 87.45 91.67 75.75 82.60 88.14 92.58 74.95 81.32 86.96 91.53
Table 6 Test Results of STOI (%) related to noisy speech. Methods
Average STOI
Noisy Ref. 1-Noisy Ref. 2-Noisy Ref. 3-Noisy BIRMP-Noisy OGAM-Noisy
5dB
0 dB
5 dB
10 dB
53.15 21.79 19.42 20.73 21.89 22.38
64.54 16.67 15.53 15.62 17.14 17.49
75.92 10.88 11.09 11.03 11.77 11.80
85.50 5.94 6.46 6.33 6.76 6.66
methods relative to noisy speech. In Tables 5 and 6, from the perspective of a combination viewpoint, the PSM, BIRMP and OGAM are all regarded as the combined masks incorporating the PDbased mask. Since the PD-based mask seems be helpful to maintain speech energy at higher SNR levels, the PSM, BIRMP and OGAM have the potential to suppress more speech distortion and obtain higher STOI results than the Ref.1 and Ref.3 in most cases. It is worth noting that the Ref.2 can only generate higher STOI results than the Ref.1 and Ref.3 in higher SNR conditions, while a poorer result is existed at the situation that the input SNR is lower (e.g., 5, 0 dB). In view of the phenomenon, the PSM is analyzed in this paper as follows. It is well known that the real and imaginary parts of complex speech spectrum are closely related to speech phase information. According to the studies in [53,54], the effect of clean speech phase on the enhanced speech is mostly pronounced in lower SNR cases and the effect is smaller in higher SNR cases. Thus, for the PSM, we make a possible hypothesis that the abandonment of imaginary information has a significantly harmful effect on the enhanced speech in lower SNR conditions, resulting in more speech distortion, and abandoning imaginary information in higher SNR conditions is acceptable. Obviously, when this assumption is valid, the difference of the PSM in higher and lower SNR conditions can be explained. To verify the above hypotheses, Fig. 8 gives a comparison about the spectra of ideal PSM and PSM_i in higher and lower SNR conditions. Herein, the PSM_i is imaginary information abandoned in the PSM. The PSM and PSM_i can be represented mathematically as follows:
PSM ¼ Re
Fig. 7. The average results of PESQ related to noisy speech.
Methods
S jSj cosðu1 Þ ¼ Y jY j
S j Sj sinðu1 Þ ¼ PSM i ¼ Im Y jY j
ð23Þ
ð24Þ
110
X. Wang, C. Bao / Applied Acoustics 156 (2019) 101–112
Fig. 8. The comparison of ideal PSM and PSM_i. PSM (left), PSM_i (right).
Moreover, a comparison about the spectra of the obtained speech by ideal PSM and PSM_i is also given in higher and lower SNR conditions in Fig. 9. From the Fig. 8, we can see that the spectrum of the PSM in the 10 dB SNR condition shows an obvious
speech spectral structure, while the spectrum of the PSM_i in the 10 dB SNR case shows a blurry speech spectral structure, which may mean that abandoning PSM_i is acceptable in this case. However, in the 5dB SNR condition, a relatively obvious speech spec-
Fig. 9. The comparison of output speech obtained by ideal PSM and PSM_i. Speech processed by PSM (left), speech processed by PSM_i (right).
111
X. Wang, C. Bao / Applied Acoustics 156 (2019) 101–112
tral structure is given in the spectrum of the PSM_i. Thus, there is more speech distortion in output speech obtained by the PSM for lower SNR case, when the PSM_i is abandoned. In Fig. 9, the spectra of the obtained speech by ideal PSM and PSM_i show a speech spectral phenomenon similar to the Fig. 8. As a comparison, in the proposed method, the scalability of speech magnitude information helps the OGAM to reduce more speech distortion so that the OGAM generates the higher intelligibility than the BIRMP in most cases, except for 10 dB input SNR. For this case, the reason may be that the ability to maintain speech energy in the BIRMP is superior to that in the OGAM under 10 dB SNR condition. 3.2.4. The noise generalization ability test In order to measure the robustness of noise environment of the proposed method, two types of the unseen noises (i.e., factory2 noise and street noise) and ten types of the noises (named as n1n10 in [55]) from 100 environmental noises [55] are used for mismatch evaluation. Table 7 lists the relative improvements of the average scores comparison of PESQ and STOI of the unseen noises for different methods relative to noisy speech at different input SNR conditions. The enhanced speech processed by all methods can obtain better PESQ and STOI results than the noisy speech. For the Ref.2, BIRMP and OGAM, the ability to handle unseen noise is not weakened, which seems that incorporation of phasesensitive information is helpful to generate the robust T-F mask. However, in the Ref.2, since more speech energy included in imaginary information is discarded in low input SNR condition, more speech distortion is generated so that the Ref.2 cannot generate higher STOI results than the Ref.1 in low input SNR condition. For the Ref.3, although it can get a slightly better PESQ result than the Ref.2, it did not achieve better performance than the Ref.2 in the STOI. In this paper, the methods based on the OGAM and BIRMP generate better results in the STOI and PESQ. Moreover, the OGAM slightly outperforms the BIRMP in terms of the PESQ, while the STOI improvements of the OGAM are not consistently higher than that of the BIRMP in all SNR conditions, probably due to the selection of the scale factor a of speech magnitude information is not optimal in higher SNR conditions. 3.3. The sparseness property of the proposed method In this section, the proposed method considering sparseness is discussed. Table 8 lists the relative improvements of average PESQ and STOI scores of the proposed masks (including the BIRMP_s, BIRMP, OGAM_s and OGAM) related to noisy speech. Here, The BIRMP_s and OGAM_s are represented as the BIRMP with sparseness [56] and the OGAM considering sparseness [56], respectively. From Table 8, the BIRMP_s, BIRMP, OGAM_s and OGAM can all
Table 7 Comparison on the relative improvements of average STOI (%) and PESQ related to noisy speech. Methods Noisy Ref.1-Noisy Ref.2-Noisy Ref.3-Noisy BIRMP-Noisy OGAM-Noisy
PESQ STOI PESQ STOI PESQ STOI PESQ STOI PESQ STOI PESQ STOI
5dB
0 dB
5 dB
10 dB
1.779 71.47 0.602 4.19 0.641 4.90 0.67 4.83 0.725 6.84 0.793 7.02
2.105 79.62 0.532 3.41 0.620 4.31 0.663 4.18 0.708 5.17 0.7550 4.85
2.561 86.33 0.416 1.59 0.474 2.15 0.534 2.01 0.617 4.04 0.669 3.68
2.826 91.26 0.371 1.26 0.38 2.16 0.426 1.70 0.459 3.19 0.475 2.45
Table 8 Comparison on the relative improvements of average STOI (%) and PESQ related to noisy speech. Methods Noisy BIRMP_s-Noisy BIRMP-Noisy OGAM_s-Noisy OGAM-Noisy
PESQ STOI PESQ STOI PESQ STOI PESQ STOI PESQ STOI
5dB
0 dB
5 dB
10 dB
1.64 66.70 0.81 10.55 0.76 10.50 0.85 11.19 0.80 11.14
1.97 75.69 0.76 8.32 0.74 8.28 0.81 8.81 0.77 8.76
2.37 83.63 0.68 6.08 0.66 6.05 0.71 6.36 0.70 6.33
2.67 89.71 0.51 4.01 0.56 3.99 0.55 3.84 0.57 3.82
obtain better PESQ and STOI results than noisy speech. When the sparseness is considered in this paper, the values of the PESQ obtained by the BIRMP_s and OGAM_s are slightly better than that obtained by the BIRMP and OGAM in most cases. Moreover, the BIRMP_s and OGAM_s achieve better performance than the BIRMP and OGAM in the STOI.
4. Conclusions In this paper, considering that incorporating phase-sensitive information into the T-F mask can help to improve the speech quality of the enhanced speech, we present two techniques to generate the T-F mask with phase-sensitive information for singlechannel speech enhancement. In the first technique, the spectral structures of two phase differences are described and they are combined with IRM to generate a bounded IRM in a constrained situation (i.e., BIRMP). The second technique exploits a generalized maximum a posteriori probability of complex speech spectrum to generate an optimal spectral estimator of speech magnitude and phase (i.e., OGMAPC), which can dynamically adjust the scale of prior information of spectral magnitude and phase. Unfortunately, since clean phase information is hard to train successfully due to its random spectral structure, the second technique only exploits the spectral magnitude part of the OGMAPC estimator to generate a magnitude mask with the phase-sensitive information (i.e., OGAM). For the phase part of the OGMAPC, it is possibly processed based on the study in [43] and may be concerned in our future work. Obviously, the enhanced speech processed by the BIRMP and OGAM still use the noisy phase, which is similar to the classical PSM. In addition to the proposed two masks, we also use some acoustic features with harmonic preservation, which are considered as an additional information of the DNN input. These features are generated from an artificial noisy signal obtained by applying the maximum operation into noisy signal. The experiments were performed for various noise environments at different SNR conditions. Experiments showed that the proposed methods could achieve a better performance by comparing with the reference methods.
Acknowledgement This work was supported by the National Natural Science Foundation of China (Grant No. 61831019, No. 61471014 and No. 61231015).
Appendix A. Supplementary data Supplementary data to this article can be found online at https://doi.org/10.1016/j.apacoust.2019.07.009.
112
X. Wang, C. Bao / Applied Acoustics 156 (2019) 101–112
References [1] Boll SF. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Audio Speech Language Process 1979;27(2):113–20. [2] Loizou PC. Speech enhancement: theory and practice. Boca Raton, 670 FL, USA: CRC Press; 2007. [3] Ephraim Y, Malah D. Speech enhancement using a minimum mean square error short-time spectral amplitude estimator. IEEE Trans Acoustics Speech Signal Process 1984;ASSP-32(6):1109–21. [4] Ephraim Y, Malah D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans Acoustics Speech Signal Process 1985;33:443–5. [5] Zhao DY, Kleijn WB. HMM-based gain modeling for enhancement of speech in noise. IEEE Trans Audio Speech Language Process 2007;15(3):882–92. [6] Deng F, Bao CC, Kleijin WB. Sparse hidden Markov models for speech enhancement in non-stationary noise environments. IEEE Trans Audio Speech Language Process 2015;23(11):1973–87. [7] Srinivasan S, Samuelsson J, Kleijn WB. Codebook driven short term predictor parameter estimation for speech enhancement. IEEE Trans Audio Speech Language Process 2006;14(1):163–76. [8] Srinivasan S, Samuelsson J, Kleijn WB. Codebook-based Bayesian speech enhancement for nonstationary environments. IEEE Trans Audio Speech Language Process 2007;15(2):441–52. [9] Wang XY, Bao CC. Speech enhancement using a joint MAP estimation of LP parameters. Int Conf Signal Process Commun Comput 2015. [10] Reddy A, Raj B. Soft mask methods for single-channel speaker separation. IEEE Trans Audio Speech Language Process 2007;15(6):1766–76. [11] Radfar MH, Dansereau RM. Single-channel speech separation using soft masking filtering. IEEE Trans. Audio, Speech, Language Process. 2007;15 (8):2299–310. [12] Hu K, Wang DL. An iterative model-based approach to cochannel speech separation. EURASIP J Audio Speech Music Process 2013;14:1–11. [13] Kim G, Lu Y, Hu Y, Loizou PC. An algorithm that improves speech intelligibility in noise for normal-hearing listeners. J Acoust Soc Am 2009;126:1486–94. [14] Han K, Wang DL. A classification based approach to speech segregation. J Acoust Soc Amer 2012;132:3475–83. [15] Wang Y, Han K, Wang DL. Exploring monaural features for classification-based speech segregation. IEEE Trans Audio Speech Language Process 2013;21:270–9. [16] Lu X, Tsao Yu, Matsuda S, Hori C. Speech enhancement based on deep denoising autoencoder. Proc Interspeech 2013:436–40. [17] Xia BY, Bao CC. Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification. Speech Commun 2014;60:13–29. [18] Xu Y, Du J, Dai L, Lee C. An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process Lett 2014;21(1):66–8. [19] Xu Y, Du J, Dai L, Lee C. A Regression approach to speech enhancement based on deep neural networks. IEEE Trans Audio Speech Lang Process 2015;23 (1):7–19. [20] Wang Z, Wang X, Li X, Fu Q, Yan Y. Oracle performance investigation of the ideal masks. IWAENC 2016:1–5. [21] Chen J, Wang Y, Wang DL. A feature study for classification-based speech separation at low signal-to-noise ratios. IEEE/ACM Trans Audio Speech Lang Process 2014;22:1993–2002. [22] Delfarah M, Wang DL. Features for masking-based monaural speech separation in reverberant conditions. IEEE/ACM Trans Audio Speech Lang Process 2017;25:1085–94. [23] Wang YX, Narayanan A, Wang DL. On training targets for supervised speech separation. IEEE/ACM Trans Audio Speech Lang Process 2014;22(12):1849–58. [24] Williamson DS, Wang YX, Wang DL. Complex ratio masking for joint enhancement of magnitude and phase. Proc ICASSP 2016:5220–4. [25] Williamson DS, Wang Y, Wang DL. ‘‘Complex ratio masking for monaural speech separation”, IEEE/ACM Trans. on Audio, Speech, and Lang. Process. 2016;24:483–92. [26] Kounovsky T, Malek J. Single channel speech enhancement using convolutional neural network. Workshop ECMSM 2017:1–5. [27] SR Park, J Lee. A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132; 2016. [28] SW Fu, Y Tsao, X Lu, H Kawai. Raw waveform-based speech enhancement by fully convolutional networks. arXiv preprint arXiv:1703.02205; 2017. [29] Chen J, Wang DL. Long short-term memory for speaker generalization in supervised speech separation. J Acoust Soc Am 2017;141(6):4705–14.
[30] DL Wang, J Chen. Supervised speech separation based on deep learning: an overview. arXiv preprint arXiv:1708.07524; 2017. [31] Erdogan H, Hershey JR, Watanabe S, et al. Phase-sensitive and recognition boosted speech separation using deep recurrent neural networks. Proc ICASSP 2015:708–12. [32] L Sun, J Du, LR Dai, CH Lee. Multiple-target deep learning for LSTM-RNN based speech enhancement. Hands-free Speech Communications and Microphone Arrays (HSCMA), pp. 136–140; 2017. [33] S Pascual, A Bonafonte, J Serra. SEGAN: Speech enhancement generative adversarial network. arXiv preprint arXiv: 1703.09452; 2017. [34] Michelsanti D, Tan ZH. Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification. Proc. Interspeech 2017;2017:2008–12. [35] A Sriram, H Jun, Y Gaur, S Satheesh. Robust speech recognition using generative adversarial networks. arXiv preprint arXiv:1711.01567; 2017. [36] C Donahue, B Li, R Prabhavalka. Exploring speech enhancement with Generative Adversarial Networks for robust speech recognition. arXiv preprint arXiv:1711.05747; 2017. [37] Soon IY, Koh SN. Low distortion speech enhancement. IEE Proc Vis Image Signal Process 2000;147(3):247–8. [38] Hasan T, Hasan MK. MMSE estimator for speech enhancement considering the constructive and destructive interference of noise. IET Signal Process 2010;4 (1):1–11. [39] Plapous C, Marro C, Scalart P. Speech enhancement using harmonic regeneration. Proc ICASSP 2005:157–60. [40] Gao B, Woo WL, Dlay SS. Unsupervised single-channel separation of nonstationary signals using gammatone filter-bank and itakura-saito nonnegative matrix two-dimensional factorizations. IEEE Trans Circuits Syst I 2013;60(3):662–75. [41] S Chehresa, M. H. Savoji. Codebook constrained iterative and Parametric Wiener filter speech enhancement, IEEE International Conference on Signal and Image Processing Applications, pp. 548-553; 2010. [42] Kulmer J, Mowlaee P. Harmonic phase estimation in single-channel speech enhancement using von mises distribution and prior SNR. Proc ICASSP 2015:5063–7. [43] Mowlaee P, Stahl J, Kulmer J. Iterative joint MAP single-channel speech enhancement given non-uniform phase prior. Speech Commun 2017;86:85–96. [44] Su Y, Tsao Y, Wu J, Jean F. Speech enhancement using generalized maximum a posteriori spectral amplitude estimator. In: IEEE International Conference on Acoustics, Speech and Signal Processing. p. 7467–71. [45] Geravanchizadeh M, Ahmadnia R. Monaural speech enhancement based on multi-threshold masking. Springer Berlin Heidelberg; 2014. p. 369–93. [46] Narayanan A, Wang DL. A CASA-based system for long-term SNR estimation. IEEE/ACM Trans Audio Speech Lang Process 2012;20(9):2518–27. [47] Varga A, Steeneken HJ. Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 1993;12(3):247–51. https:// doi.org/10.1016/0167-6393(93)90095-3. [48] Zue V, Seneff S, Glass J. Speech database development at MIT: TIMIT and beyond. Speech Commun 1990;9(4):351–6. https://doi.org/10.1016/01676393(90)90010-7. [49] Perceptual Evaluation of Speech Quality (PESQ), an Objective Method for Endto-End Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs, ITU-T Recommendation, p. 862; 2001. [50] Wang Y, Brookes M. Speech enhancement using an MMSE spectral amplitude estimator based on a modulation domain Kalman filter with a Gamma prior. Proc ICASSP 2016:5225–9. [51] Wu B, Li K, Yang M, Lee CH. A reverberation-time-aware approach to speech dereverberation based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 2017;25(1):102-–111. [52] Taal CH, Hendriks RC, Heusdens R, et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE/ACM Trans Audio Speech Lang Process 2011;19(7):2125–36. [53] Mowlaee P, Saeidi R. On phase importance in parameter estimation in singlechannel speech enhancement. Proc ICASSP 2013:7462–6. [54] Mowlaee P, Martin R. On phase importance in parameter estimation for singlechannel source separation. IWAENC 2012:1–4. [55] G.Hu, 100 nonspeech environmental sounds; 2014. [56] Xu L, Choy CS, Li YW. Deep sparse rectifier neural networks for speech denoising. IWAENC 2016:1–5.