Digital Signal Processing 36 (2015) 174–183
Contents lists available at ScienceDirect
Digital Signal Processing www.elsevier.com/locate/dsp
Binaural source separation based on spatial cues and maximum likelihood model adaptation Roohollah Abdipour a , Ahmad Akbari a,∗ , Mohsen Rahmani a,b , Babak Nasersharif a,c a b c
Audio & Speech Processing Lab, School of Computer Engineering, Iran University of Science & Technology, Tehran, Iran Computer Engineering Department, Faculty of Engineering, Arak University, Arak, Iran Electrical & Computer Engineering Department, K.N. Toosi University of Technology, Tehran, Iran
a r t i c l e
i n f o
Article history: Available online 17 September 2014 Keywords: Binaural source separation Model adaptation Maximum likelihood linear regression Statistical signal processing Speech enhancement
a b s t r a c t This paper describes a system for separating multiple moving sound sources from two-channel recordings based on spatial cues and a model adaptation technique. We employ a statistical model of observed interaural level and phase differences, where maximum likelihood estimation of model parameters is achieved through an expectation-maximization algorithm. This model is used to partition spectrogram points into several clusters (one cluster per source) and generate spectrogram masks accordingly for isolating individual sound sources. We follow a maximum likelihood linear regression (MLLR) approach for tracking source relocations and adapting model parameters accordingly. The proposed algorithm is able to separate more sources than input channels, i.e. in the underdetermined setting. In simulated anechoic and reverberant environments with two and three speakers, the proposed model-adaptation algorithm yields more than 10 dB gain in signal-to-noise-ratio-improvement for azimuthal source relocations of 15◦ or more. Moreover, this performance gain is achievable with only 0.6 seconds of input mixture received after relocation. © 2014 Elsevier Inc. All rights reserved.
1. Introduction Sound source separation is a well-known challenge with important applications. For example, consider a hearing-aid device that should separate utterances of a target speaker from competing sound sources. An effective source separation algorithm is rewarding in this context. As a result, various algorithms have been proposed in the literature. Independent component analysis (ICA) is one of the wellknown source separation approaches that rely on the availability of multi-channel observations [1–6]. Commonly, source signals are assumed to be statistically independent. This independence assumption makes it possible to use optimization methods based on higher order statistics (HOS) [7]. Alternatively, ICA methods based on the maximum likelihood principle [8] and second-order statistics (SOS) [9–11] can be applied. SOS-based solutions are especially useful in situations with uncorrelated and non-stationary sources. Traditional ICA methods are famous due to their ability
*
Corresponding author. E-mail addresses:
[email protected] (R. Abdipour),
[email protected] (A. Akbari),
[email protected] (M. Rahmani),
[email protected] (B. Nasersharif). http://dx.doi.org/10.1016/j.dsp.2014.09.001 1051-2004/© 2014 Elsevier Inc. All rights reserved.
to separate signals without any apriori knowledge about sound sources and environmental conditions (such as the configuration of microphones). However, traditional ICA methods fail in underdetermined conditions (i.e., when the number of sources exceeds the number of input channels). Another major limitation is that the mixing coefficients should be stationary for a period of time. But this constraint is not satisfied in real situations where sound sources can move. Model-based source separation is another well-studied approach that incorporates apriori knowledge about sources. For example, code-book based methods [12,13] and hidden Markov model (HMM) based methods [14–16] have been widely used for speech enhancement. In these methods, some models are considered for noise and speech signals and the model parameters are estimated using a training set in advance. Nonnegative matrix factorization (NMF) [17] is another model-based source separation approach which was initially used in single-channel situations [18–24]. NMF-based methods decompose a nonnegative matrix of observations into a multiplication of two nonnegative matrices, a basis matrix containing a set of basis vectors for each source and a gain matrix containing the mixing coefficients. Source signals are usually obtained by calculating a wiener-like filter based on the basis and gain matrices (e.g., see [21,23–27]).
R. Abdipour et al. / Digital Signal Processing 36 (2015) 174–183
The NMF-based solutions have shown promising results, especially for non-stationary signals. However, the spectrograms of real signals are strongly diverse and are often poorly modeled using a low-rank structure such as NMF. Furthermore, NMF is usually applied on whole excerpts of data and hence do not appear as very appropriate for real-time processing, but rather for off-line applications. Other methods employ binaural cues, such as interaural time or phase difference (ITD or IPD) and interaural level difference (ILD), for separating sound sources [28–31]. Generally, for each spatiallyfixed sound source, the corresponding subband-level IPD, ILD observations concentrate in a specific region in the IPD–ILD space. So, the IPD–ILD space of a multi-source environment can be modeled as a set of observation clusters (one cluster per source). The position of each cluster depends on the location of the source, and differs for spatially-disjoint sources. Based on this point, many source separation methods aim to find clusters of observations in IPD–ILD space, and assign each cluster to a source. For example, in [28] a two-dimensional histogram of ITD and ILD features is constructed and each peak of the histogram is assigned to a source. Then, a time-frequency mask is constructed accordingly to partition the input mixture into the original signals. For another example, [29] describes a supervised classification algorithm that learns the rules of separating the target source based on ITD and ILD features. This classifier is used to calculate a binary mask for source separation. The method described in [30] learns the probability distribution of a target speaker given its ITD, ILD observation pair. This method uses this probability distribution as a look-up table to determine the probability that each spectrum bin be dominated by the target source and calculate a mask accordingly. Incorporating source models in conjunction with spatial models is also common and usually improves performance. For example, in [32] spatial cues are employed as prior knowledge to separate sound sources based on nonnegative tensor factorization. For another example, in [31], spatial models of sources are combined with apriori trained source models. These models give the likelihood of each source based on the current observation and a timefrequency mask is built accordingly for source separation. In [33] a library of source models is employed to incorporate prior knowledge about each source. It also considers models that represent the spatial and environmental conditions. The proposed framework is shown to be flexible enough to be applicable in different conditions. Promising results have been reported for the localization-based methods. However, these methods are only useful in offline scenarios where sources do not relocate. That is due to the fact that they are based on localization models of spatially-fixed sources and need a relatively long segment of observations to estimate model parameters. In effect, these methods fail in real situations with moving sources where model parameters should be updated over time according to new source locations. Our main idea is to exploit model-adaptation techniques to adjust model parameters according to new source locations. We employ a bivariate Gaussian model to represent observations of each source, where the model parameters are estimated using an expectation-maximization algorithm. We also employ the maximum likelihood linear regression (MLLR) technique to adjust model parameters after possible source relocations. We use this model to partition the observations into source-related clusters and build a separation mask accordingly. It’s worth mentioning that although incorporating source models besides the spatial models can improve the performance, this study is limited to the update of spatial models and source movement tracking in order to build a solution for online applications.
175
Obviously, one can still utilize source models besides our up-todate spatial models to improve performance. The remainder of the paper is organized as follows. In Section 2, we recall a state-of-the-art statistical model for sound source separation, which is useful for spatially-fixed sources. Then, in Section 3, we propose a model adaptation algorithm to update the parameters of this model as sources relocate over time. The performance of this algorithm is evaluated in Section 4 for different conditions. Finally, the paper concludes in Section 5. 2. Background A state-of-the-art approach for building the spatial model of concurrent sound sources is proposed in [31]. Therein, a Gaussian mixture model (GMM) is employed to represent the localization cues of spatially-fixed sources. Moreover, an expectationmaximization (EM) algorithm is derived to estimate model parameters. We use this model as our spatial model. We also use its parameter estimation algorithm to initialize our model. This model and its parameter estimation algorithm are detailed in this section. Then, in the next section, we propose a model adaptation algorithm to track source movements and update the spatial model accordingly. Consider I spatially-fixed sources of signals {si (t ), i = 1.. I }. The binaural recordings xl (t ) and xr (t ), corresponding to mixtures arriving at the left and right ears, respectively, are modeled as:
xl (t ) =
I
si t − τil ∗ hli (t )
(1)
i =1
xr (t ) =
I
si t − τir ∗ hri (t )
(2)
i =1
τil,r are the delays of the direct path of source i to
In this model,
l,r
the left and right ears, and h i (t ) show the effects of room and head-related impulse responses (RIR and HRIR), excluding the delay of arrival. Supposing that the spectra of these mixtures are approximately disjoint (i.e., supposing that each time-frequency bin corresponds to one source), the element-wise ratio of time-frequency units of these mixtures can be expressed as:
R (λ, f ) =
X l (λ, f ) X r (λ,
=
f)
H li (λ, f ) − j2π f (τ l −τ r ) i i e H ir (λ, f )
(3)
where upper-case letters show the short-term Fourier transform (STFT) of their corresponding lower-case signals, and λ and f are the frame and frequency-bin indices, respectively. The interaural level and phase differences between the two ears are written as:
IPD(λ, f ) = 2π f
ILD(λ, f ) = ln
τil − τir = 2π f τi , τi = τil − τir
H li (λ, H ir (λ,
f)
(4)
(5)
f)
The IPD values are constrained to the interval (−π , +π ]. For spatially-disjoint sources, their subband observation o (λ, f ) = [IPD(λ, f ), ILD(λ, f )] form distinct clusters in IPD–ILD space [28,31]. The cluster related to the source i can be modeled using a two-variable Gaussian distribution as:
p o(λ, f ) i , Φi ( f )
=
1 2π |C i ( f )|1/2
1
e − 2 (o(λ, f )−μi ( f ))
T
1 C− ( f )(o(λ, f )−μi ( f )) i
(6)
176
R. Abdipour et al. / Digital Signal Processing 36 (2015) 174–183
where o (λ, f ) is the subband observation, C i ( f ) =
σi2,IPD ( f )
0
σi2,ILD ( f )
0
,
μi ( f ) = [μi,IPD ( f ), μi,ILD ( f )], and Φi ( f ) = {μi ( f ), C i ( f )} are the model parameters of source i. The covariance matrix C i ( f ) is considered diagonal due to the independence assumption of IPD and ILD variables. The parameters of the above model are associated with source location. So, assuming that sources are fixed for a sufficiently long time, the model parameters can be estimated from the observations. To do so, an expectation maximization (EM) algorithm is employed [31]. The E-step consists of calculating the hidden variable zi (λ, f ), which represents the posterior probability that the active source at that bin is i, given the estimated model paramej ters of source i at iteration j, Φi ( f ), as: j
zi (λ, f ) = I
p (o(λ, f ) | i , Φi ( f ))
k =1
(7)
j
p (o(λ, f ) | k, Φk ( f ))
where k is the source index. Consider the auxiliary function Q (Φ | Φ j ) as the total log-likelihood of parameters Φ = iI=1 Φi given I j the current parameter estimation Φ j = i =1 Φi , i.e.:
Q Φ |Φj =
λ
i
zi (λ, f ) ln p o(λ, f ) i , Φi ( f )
(8)
zi (λ, f )o (λ, f ) λ
(9)
λ zi (λ, f )
diag C i ( f ) =
ˆ i ( f ))T (o(λ, f ) − μ ˆ i ( f )) )−μ λ zi (λ, f )(o (λ, f λ zi (λ, f )
(10) where diag(C i ( f )) =
σi2,IPD ( f ) σi2,ILD ( f )
are the entries on the main diago-
nal of C i ( f ). The EM algorithm is repeated until | Q (Φ j +1 | Φ j ) − Q (Φ j | Φ j −1 )| ≤ ε , where ε is a small threshold value. Based on current model parameters Φ( f ), the probabilistic spectral mask M i (λ, f ) is derived to separate source i. To do so, the posterior probability of source i given the current observation o (λ, f ) is computed and considered as the de-mixing mask, i.e.:
M i (λ, f ) = p as(λ, f ) = i o(λ, f ), Φ( f )
(11)
where as(λ, f ) represents the index of the active source at current time-frequency unit. According to Bayes’ theorem, the above equation can be written as:
p (as(λ, f ) = i )
M i (λ, f ) = I
k =1
p (as(λ, f ) = k) p (o (λ, f ) | as(λ, f ) = k, Φk ( f ))
× p o(λ, f ) as(λ, f ) = i , Φi ( f )
(12)
where p (as(λ, f ) = k) denotes the probability that current timefrequency bin belongs to source k. Also, note that p (o(λ, f ) | as(λ, f ) = i , Φi ( f )) is equal to p (o(λ, f )|i , Φi ( f )) as defined in Eq. (6). Assuming that p (as(λ, f ) = k) is equivalent for all sources k, the above equation is simplified as:
M i (λ, f ) = I
l,r
l,r
S i (λ, f ) = M i (λ, f ) X i (λ, f )
(14)
Finally, the time-domain representation of the signal is calculated through inverse FFT and overlap-add operations. We will show in Section 4 that the above EM algorithm needs relatively long segments (more than 1.2 seconds) of mixtures to estimate model parameters. Moreover, this algorithm iterates several times before its convergence. So, the computational cost, and equivalently the model training time, is high. These challenges also hold in case of other localization-based source separation methods. For example, in [28], long segments of observations are needed to build a reliable histogram of spatial cues. In [29], the supervised training of the classifier demands a large train set, and also, high computational time. Similarly, building the lookup table of ITD, ILD observations in [30] needs long segments of input mixtures. These challenges prevent these methods to rapidly learn new localization information as sources displace. So, these methods are limited to situations with spatially-fixed sources and are not applicable in real conditions with moving sources. In the next section, we propose a model-adaptation algorithm for source tracking and separation that has low computational cost and adjusts model parameters using a short segment of input mixtures.
f
The above equation is derived based on the assumption that timefrequency bins are independent. The M-step consists of finding the parameters Φi that maximize Q (Φ | Φ j ). Thus, calculating the derivation of Q (Φ | Φ j ) with respect to Φi and setting it to zero and solving it for Φi , the model parameters are derived as:
μˆ i ( f ) =
The spectrum of the signal i is then estimated by multiplying the spectrum of the input mixture at each ear by the above mask, i.e.:
p (o(λ, f ) | i , Φi ( f ))
k=1 p (o (λ, f ) | k, Φk ( f ))
= zi (λ, f )
(13)
3. Spatial-model adaptation Sound sources can relocate. Source locations (with respect to microphones or ears) may also change in applications like hearingaids as the user turns his/her head. Since the statistical model of Eq. (6) is based on spatial cues of fixed sources, it will not be an efficient model after source displacements and should be modified accordingly. In addition, since the ILD and its statistics are related to room-related transfer functions of left and right ears (see Eq. (5)), the model also needs to be conformed to RIR changes, e.g., due to the movements of people or appliances. For these reasons, the model parameters of Eq. (6) should be changed as new observations arrive. To achieve this goal, we propose a maximum likelihood linear regression (MLLR) model adaptation algorithm to conform the model to new observations in every successive timeslot of short duration. A schematic representation of the proposed model-adaptation procedure is depicted in Fig. 1 for a two-source situation. The subband IPD, ILD observations of sources before movement form two clusters in the IPD–ILD space are shown with lighter colors. After source displacement, new observations construct new clusters with mean and variances different from those of the initial clusters. As a result, the initial model is no longer a suitable model of observations. Thus, it should be adjusted according to new observations. Our main idea is to find linear transformation matrices that map the initial clusters to their corresponding clusters after relocation. To do so, we follow a maximum likelihood linear regression (MLLR) approach to calculate transformation matrices for adapting mean vectors and covariance matrices of the model according to new observations. According to the MLLR adaptation procedure [34], the new ˆ i ( f ) is calculated using the linear transformation mean vector μ matrix W b as:
μˆ i ( f ) = W b μi ( f ) + b
(15)
where W b is the 2 ∗ 2 mean-adaptation matrix, and b is the 2 ∗ 1 bias vector. Eq. (15) could be represented equivalently as:
μˆ i ( f ) =
w 11 w 21
w 12 w 22
b1 b2
⎡
⎤
μi,IPD ( f ) ⎣ μi ,ILD ( f ) ⎦ = W ξ i ( f ) 1
(16)
R. Abdipour et al. / Digital Signal Processing 36 (2015) 174–183
177
Fig. 1. A schematic view of the spatial-model adaption algorithm for the case of two sound sources.
where W is considered as the 2 ∗ 3 adaptation matrix which should be calculated. Showing the new model parameters as Φi ( f ) = { W ξ i ( f ), C i ( f )}, the new model is written as:
p o(λ, f ) i , Φi ( f )
= N o(λ, f ); W ξ i , C i ( f )
=
1
1
2π |C i ( f )|1/2
e − 2 (o(λ, f )− W ξ i ( f ))
T
1 C− ( f )(o(λ, f )− W ξ i ( f )) i
(17)
We define the auxiliary function as:
Q Φi Φi =
zi (λ, f ) ln p o (λ, f ) i , Φi ( f )
(18)
λ
In order to find the transformation matrix W , we calculate the derivative of Q (Φi | Φi ) with respect to W and set it to zero and solve it for W . The general form for calculating W is derived as:
1 zi (λ, f )C − oμiT = i
λ
1 zi (λ, f )C − W μi μiT i
(19)
λ
Note that the left-hand side of Eq. (19) can be calculated from previous model parameters and new observations. Thus, the transformation matrix W is calculated easily as:
W = Ci
1 zi (λ, f )C − o(λ, f )μiT i
T −1
μi μi
λ
1 λ zi (λ, f )
(20)
We also adapt the model’s covariance matrix as:
Cˆ i ( f ) = LBL T
(21)
where L is the Choleski factor of C i ( f ), i.e., C i ( f ) = LL . Substituting Eq. (21) into the auxiliary function Q (Φi | Φi ) (Eq. (18)) and maximizing the resulted function, the maximum likelihood estimate of B is derived as: T
ˆ ( f ))T (o(λ, f ) − μ ˆ i ( f ))] L −1 ( L −1 )T [ λ (o(λ, f ) − μ i B= λ zi (λ, f )
(22)
Comparing the model-adaptation algorithm of this section with the EM algorithm of Section 2, it is clear that the proposed adaptation algorithm has lower computational cost, because it is a one-pass procedure while the EM algorithm repeats several times before convergence. Moreover, we show in Section 4 that the model-adaptation algorithm achieves approximately the same performance as the EM algorithm using shorter segments of input mixture. These two points enable the proposed adaptation
algorithm to pursue source displacements more rapidly than reestimating model parameters using the EM algorithm. The modeladaptation algorithm is especially beneficial because sources usually move slowly, and their spatial-model needs small and rapid modifications to keep the system performance unaffected. In order to use the proposed model adaptation technique, one should first build an initial spatial model. The model initialization can be performed using the EM algorithm described in Section 2. The block diagram of such a system is shown in Fig. 2. This system is a two-phase statistical-model-based system that employs localization cues for separation of moving sound sources. At first phase, the algorithm finds initial parameters of the spatial model from binaural recordings assuming that sources are fixed during the beginning period. Parameter initialization is performed using the algorithm described in Section 2. Then, in the second phase, the model parameters are repeatedly adapted to match new observations in consecutive short time-slots, and hence, follow possible source relocations. The model-adaptation is performed as described in this section. Finally, a probabilistic mask is constructed based on the model and is used to de-mix input signals. The mask calculation process is performed as described at the end of Section 2. 4. Experiments We perform four experiments to evaluate the performance of the proposed algorithm. The first experiment examines the performance of the proposed algorithm in the case of spatially-fixed sources, and compares it to four competing methods. Then, in the second experiment, we evaluate the level of performance degradation of the competing algorithms for different degrees of source displacement. We also show that the proposed model-adaptation algorithm is able to revive the system’s performance after source relocations. Finally, in the third and fourth experiments, we evaluate the length of signal that is needed to initialize the spatialmodel using the EM algorithm, and also, the length of signal that is required to adjust model parameters using the model-adaptation algorithm. Before these experiments, we discuss the shared details of our experimental setup. 4.1. Shared experimental details 1) Data sources We used speech signals from the TIMIT corpus [35]. Of the 63 000 utterances in the dataset, we randomly selected 40 utterances each with approximately 3 seconds duration. All signals were normalized to have the same mean square
178
R. Abdipour et al. / Digital Signal Processing 36 (2015) 174–183
utterances are used for each speaker, resulting in 125 different sets of mixed recordings at each source placement. 2) Evaluation metrics We measure the performance of separation of source i with signal-to-noise ratio improvement (SNRI) metric defined as:
| M i (λ, f ) S i (λ, f )|2 2 λ f | S i (λ, f ) − M i (λ, f ) k S k (λ, f )| 2 λ f | S i (λ, f )| − 10 log10 (23) 2 λ f | k=i S k (λ, f )|
SNRIi = 10 log10
λ
f
Where S i (λ, f ) and M i (λ, f ) are the STFT and the estimated mask of source i, respectively. This measure penalizes both the remaining of competing sources and the signal distortion. We calculated the mean of Eq. (23) for all sources as follows and consider it as the performance evaluation metric:
1 I
SNRI =
I
SNRIi
(24)
i =1
We also evaluate the quality of the separated speech signals using the Perceptual Evaluation of Speech Quality (PESQ) score [38]. The PESQ score is a psychoacoustics-based measure that is correlated with subjective evaluation measures with correlation values around 0.8 [38]. The PESQ values range from −0.5 (for the worst case) to 4.5 (for the best case) [39]. The details of SNRI and PESQ calculation can be found at [40] and [38], respectively. 4.2. Evaluation for fixed sources
Fig. 2. The block diagram of the proposed system.
energy. The experiments consist of either two or three concurrent speakers. In the two-speaker case, the speakers were located symmetrically at left and right of the dummy head at different azimuthal angles with 0◦ elevation. For the case of three speakers, the same placement was used for two of speakers, and the third speaker was fixed straightly in front of the dummy head. In order to simulate the anechoic signals, we borrowed the head-related impulse responses (HRIRs) from CIPIC dataset [36] for a KEMAR dummy head with small pinnae. These HRIR data are measured at different azimuthal angles with 5◦ resolution. We used the measurements taken with sources at 1 meter distance from the listener. The reverberant binaural impulse responses came from AIR [37], an effort to record such impulse responses for a KEMAR dummy head in real environments (including meeting room, lecture room, stairway, corridor, and a former church with very strong reverberation effects). We used the measurements for the stairway environment. The azimuthal angles in the stairway recordings range from 0◦ to 180◦ with 15◦ resolution. For configurations with two-speakers, we randomly selected 10 different utterances for each speaker. Thus, 100 different sets of mixed utterances are used at each two-speaker configuration. For the case of the three-speaker configurations, 5 randomly selected
The first experiment compares the performance of the basic system of Section 2 with that of other studied methods in situations with non-moving speakers. The competing methods are Yilmaz and Rickard [28], Roman et al. [29], Harding et al. [30], and Mandel et al. [31] which are described in Section 1. We implemented the Roman et al. and Harding methods ourselves. For Yilmaz and Rickard method we borrowed their Matlab implementation provided at [41]. For Mandel et al. method we downloaded and used the original codes of their corresponding authors. For the supervised methods (i.e., Yilmaz and Rickard [28], Roman et al. [29], and Harding et al. [30] methods), we considered two different sets of 100 (125) mixtures as the train and test sets in anechoic (reverberant) condition. For the unsupervised methods (i.e., Mandel et al. [31] and the proposed method), a set of 100 (125) mixtures was used as input in anechoic (reverberant) condition. We also compare the performance of different methods with that of the ideal binary mask (IBM). Since the mixtures are created synthetically, the IBM can be constructed as a binary mask that is 1 at time-frequency units at which the local power of the target signal overcomes that of the competing signal and 0 everywhere else. Such a binary mask gives an indication of the upper bounds of the achievable performance. To evaluate each method, the process of training and source separation is performed for each between-source angle separately and the average SNRI and PESQ score of that angle is calculated. Finally, the mean of these results are obtained which are shown in Tables 1 and 2. According to these tables, the performance of the proposed algorithm is acceptable in both two-speaker and threespeaker conditions. However, the performance of all the algorithms decreases in reverberant conditions. In these situations it is more difficult to perform the clustering operation in ITD–ILD space, and so, the separation task will have lower accuracy. The proposed method, as well as the Mandel et al. method, isolate the sources more efficiently than the other algorithms in all
R. Abdipour et al. / Digital Signal Processing 36 (2015) 174–183
179
Fig. 3. The performance drops due to source relocations, and the corresponding SNR gain obtainable using the proposed model adaptation algorithm. Table 1 SNRI results for spatially-fixed sources. Anechoic
Ideal binary mask Roman et al. Yilmaz and Rickard Harding et al. Mandel et al. The proposed method
Reverberant
2 sources
3 sources
2 sources
3 sources
14.62 5.72 9.53 7.49 11.94 11.85
11.83 4.18 6.39 5.13 9.27 9.04
7.92 2.11 2.66 1.88 6.12 5.90
6.28 −2.06 −1.18 −1.43 3.24 3.17
2 sources
3 sources
2 sources
3 sources
3.31 1.88 2.01 2.48 2.16 3.16 3.09
3.11 1.47 1.44 1.92 1.59 2.51 2.46
2.89 1.51 1.03 1.55 1.28 2.19 2.08
2.72 1.19 0.86 1.17 0.94 1.82 1.77
Table 2 PESQ results for spatially-fixed sources. Anechoic
Ideal binary mask Mixture Roman et al. Yilmaz and Rickard Harding et al. Mandel et al. The proposed method
Reverberant
conditions. These methods achieve approximately 11.9 dB SNRI and 3.1 PESQ score for anechoic environment with two speakers, which is noticeably higher than that of competing methods. Similar superiorities are observed in other conditions. The Mandel et al. method performs slightly better than the proposed method. We believe that this is due to the usage of source models in Mandel et al. method which are useful when the efficiency of spatial cues is decreased. 4.3. Evaluation for displaced sources As was mentioned earlier, the values of binaural cues depend on the location of sources. For a spatially-fixed source, its binaural observations cover a particular region (a cluster) in the observation
space. When a source relocates, the corresponding cluster relocates as well. Since existing binaural source separation methods are based on the knowledge of the clusters of each source, their performance drops as sources move. Our second experiment measures the level of performance decade of studied methods when sources relocate. In this experiment, at first, we trained all the studied methods for the configurations depicted on the side of Fig. 3. In this configuration, two sources are fixed at ±45◦ azimuthal angle, and for the case of three sources, the third one is situated in front of the dummy head. Similar to previous experiment, we considered a set of 100 (125) mixtures as the train set of the supervised methods in anechoic (reverberant) condition, and a set of 100 (125) mixtures as inputs of the unsupervised methods in anechoic (reverberant) condition. Then, the trained classifiers/models were evaluated for anechoic situations where the dummy head was turned by 0◦ , 5◦ , 10◦ , 15◦ , . . . , and 40◦ to the left, and for reverberant situations with 0◦ , 15◦ , 30◦ , 45◦ , . . . , 135◦ head rotations.1 Note that these rotations change the positions of speakers with respect to the listener. For each rotation, the trained classifiers/models were evaluated using the same number (but different sets) of mixtures as was considered for the train sets. The mixtures of the test set were synthesized using HRIRs corresponding to new source directions. The results are shown in Figs. 3a to 3d for anechoic and reverberant conditions with two and three speakers. It is seen that for all competing methods and in all conditions, the SNRI drops noticeably. Even for 5◦ rotation, the SNRI of these systems are reduced by more than 4.5 dB. In order to evaluate the SNRI gain achievable by the MLLR adaptation algorithm after each head rotation, we performed the following procedure: we employed a single input mixture (with
1 The difference between considered head rotation angles in anechoic and reverberant conditions is due to the difference between the azimuthal angle resolutions of their corresponding HRIR datasets, as was described in Section 4.1.
180
R. Abdipour et al. / Digital Signal Processing 36 (2015) 174–183
Fig. 4. Spectrograms of the target signal and signals de-mixed using different methods.
3 seconds duration) to calculate the adapted model parameters. The adapted model was then used to de-mix 10 other mixtures. After each source separation, the SNRI was calculated. This process of model-adaptation and source separation was repeated 10 times using different mixtures and the mean of SNRIs was calculated. This procedure was performed for both two-speaker and three-speaker scenarios in anechoic as well as reverberant conditions. The mean of the resulted SNRIs are included in Figs. 3a to 3d. Comparing the adaptation results with SNRIs achieved using test signals of fixed sources (which are shown in Figs. 3a to 3d as 0◦ rotation), it is seen that the proposed MLLR adaptation algorithm is capable of restoring the performance of the system to the level before movement, even for large degrees of rotations. For example, in case of 30◦ azimuthal rotation in anechoic twospeaker situation, the proposed model adaptation algorithm results in 10.16 dB SNRI, which is more than 12 dB higher than that of the competing methods. In order to further investigate the effect of the model-adaptation algorithm, we compare sample spectrograms of relocated sources separated with different methods. These spectrograms are shown in Figs. 4a to 4h for 10◦ rotation of the dummy head. The sources were initially located at ±45◦ azimuthal degrees. The spectrograms of the clean target signal and the input mixture are also included for comparison. It is seen that the competing methods keep a large portion of the interfering signal or distort the target signal. But the proposed model adaptation algorithm separates the target signal with acceptable accuracy. 4.4. Evaluating the amount of data needed to build the initial model We conducted an experiment for building the initial model of Section 2 for different conditions (i.e., for anechoic and reverberant mixtures with 2 and 3 sources) with several distraction angles. Here, the goal is to measure the amount of signal (in seconds)
that is required to build the initial model using the EM algorithm of Section 2. We consider ±15◦ , ±30◦ , and ±45◦ distraction angles for speakers in two-source scenario. For the three-source case, the third speaker was fixed at 0◦ angle as is shown on the side of Fig. 5. For each distraction angle, we repeatedly selected 0.2, 0.3, 0.4, 0.5, . . . , and 3 seconds of signals and used them as the source signals. For each input mixture, the EM algorithm of Section 2 was performed to de-mix the input mixture, and the resulted SNRI was measured. This process was repeated 100 times for each signal length and each condition (i.e., for anechoic and reverberant mixtures with 2 and 3 sources), and the SNRIs were averaged. The results are shown in Figs. 5a to 5d. The results of this experiment show that in anechoic conditions with two and three speakers, we need more than 1.2 and 1.4 seconds of signal, respectively, to bring the model to its highest achievable performance. Similarly, for reverberant conditions, at least 1.6 seconds of signals are needed for both two-speaker and three-speaker scenarios. The findings of Fig. 5 reveal that the creation of the initial spatial model (using the EM algorithm of Section 2) demands relatively long segments (1.2 seconds and more) of input mixtures. Moreover, the EM algorithm is an iterative process that usually repeats for several iterations until it is converged. So, this process is usually time-consuming. In the next section, we show that the proposed model-adaptation algorithm of Section 3 demands shorter segments of input mixtures, and so, can react to source relocations in shorter times (compare it to the scenario where one re-estimates model parameters using the EM algorithm in each consecutive segment of input mixtures). The proposed modeladaptation algorithm is also a single-pass process, and so, has noticeably lower computational complexity than re-estimating model parameters using the EM algorithm.
R. Abdipour et al. / Digital Signal Processing 36 (2015) 174–183
181
Fig. 5. The length of signal needed to train the initial model.
4.5. Evaluating the amount of signal required for model-adaptation Our last experiment measures the amount of signal needed to adapt the initial model after source relocation. We place two sources at ±45◦ , and the third one at 0◦ , as is shown on the side of Fig. 6. After building the initial model with 3 seconds of input mixture, we rotated the dummy head to the left with 15◦ , 30◦ and 40◦ . For each rotation, we repeatedly employed 0.2, 0.3, 0.4, 0.5, . . . , and 2 seconds of input mixtures (different from the mixtures used in training of the initial model) to adapt the model. Each adapted model was employed to de-mix 100 input mixtures (each with 3 seconds duration) and the average SNRI was calculated. The results are shown in Figs. 6a to 6d for different conditions. From this figure it is clear that for all the conditions, only 0.6 second (or less) of the input mixture is sufficient to do model adaptation with acceptable performance (near to that of the initial model according to results of Section 4.2). This amount of signal is less than half of the signal needed to run the EM algorithm to re-estimate the model parameters. 5. Summary and conclusion This paper has presented a novel source separation algorithm based on statistical models of source spatial cues and a model adaptation technique. We first built an initial spatial model of fixed sources and estimated its parameters using an expectationmaximization algorithm. This spatial model is employed to calculate a probabilistic mask for source separation. Then, in order to follow source displacements, we proposed an algorithm to itera-
tively adjust the model parameters at every short time-slot using a maximum likelihood linear regression approach. We performed several experiments to evaluate various aspects of the system in anechoic and reverberant conditions with multiple sound sources. We found that for the case of fixed sources, the initial model achieves noticeably higher SNRI and PESQ scores compared to competing methods, except for Mandel’s method [31] (which employs source models besides the spatial model) where the scores were slightly lower (see Tables 1 and 2). For example, for anechoic two-source condition, 11.85 dB SNRI and 3.09 PESQ score was achieved. We also showed that the performance of existing localization-based source separation methods drops noticeably as sources move (e.g., SNRI of the basic system drops to −1.82 dB for 39◦ rotation in anechoic two-source condition). But the proposed model-adaptation algorithm updates the model parameters according to new source locations and keeps the performance of the system near that of the initial model. Moreover, we showed that although the initial model needs long segments (1.2 seconds and more) of the input mixture for its parameter estimation, the model can be adapted to new observations using a short window (less than 0.6 second) of mixtures. The proposed model-adaptation technique is also a single-pass procedure and demands less computational resources than the iterative EM algorithm of the initial model. There are a number of directions to continue this work in the future. The first is to automatically find the number of active sound sources and adapt the model accordingly. We would also like to add a model for reverberation as well as diffuse noise (like people talking in a restaurant) to improve the performance of the system in these conditions. Finally, to determine which of the separated
182
R. Abdipour et al. / Digital Signal Processing 36 (2015) 174–183
Fig. 6. The length of signal needed to adapt the model in case of different degrees of rotation.
sources is the target signal and should be kept, we would like to employ monaural models of the target source. Such a system is useful in applications like speech enhancement where only the speech source should be retained.
[9] L. Parra, C. Spence, Convolutive blind separation of non-stationary sources, IEEE Trans. Speech Audio Process. 8 (3) (2000) 320–327. [10] A.J. Bell, T.J. Sejnowski, An information-maximization approach to blind separation and blind deconvolution, Neural Comput. 7 (1995) 1129–1159. [11] R.R. Gharieb, A. Cichocki, Second-order statistics based blind source separation using a bank of subband filters, Digit. Signal Process. 13 (2) (2003) 252–274.
Acknowledgments
[12] T. Sreenivas, P. Kirnapure, Codebook constrained Wiener filtering for speech enhancement, IEEE Trans. Speech Audio Process. 4 (5) (1996) 383–389.
The authors would like to thank Iran Telecommunication Research Centre (Grant No. ITDR-88922018) for its support during this work. The authors also wish to thank Prof. Bijan Raahemi for proofreading the paper and for his pertinent comments.
[13] S. Srinivasan, J. Samuelsson, W. Kleijn, Codebook driven shortterm predictor parameter estimation for speech enhancement, IEEE Trans. Audio Speech Lang. Process. 14 (1) (2006) 163–176. [14] Y. Ephraim, A Bayesian estimation approach for speech enhancement using hidden Markov models, IEEE Trans. Signal Process. 40 (4) (1992) 725–735.
References
[15] H. Sameti, H. Sheikhzadeh, L. Deng, R. Brennan, HMM-based strategies for enhancement of speech signals embedded in nonstationary noise, IEEE Trans. Speech Audio Process. 6 (5) (1998) 445–455.
[1] P. Comon, Independent component analysis, a new concept? Special issue on higher-order statistics, Signal Process. 36 (3) (1994) 287–314. [2] A. Hyvärinen, J. Karhunen, E. Oja, Independent Component Analysis, vol. 46, John Wiley & Sons, 2004. [3] F. Asano, S. Ikeda, M. Ogawa, H. Asoh, N. Kitawaki, Combined approach of array processing and independent component analysis for blind separation of acoustic signals, IEEE Trans. Audio Speech Lang. Process. 11 (3) (2003) 204–215. [4] P. Comon, C. Jutten, Handbook of Blind Source Separation: Independent Component Analysis and Applications, first ed., Elsevier Ltd, 2010. [5] M. Yamada, G. Wichern, K. Kondo, M. Sugiyama, H. Sawada, Noise adaptive optimization of matrix initialization for frequency-domain independent component analysis, Digit. Signal Process. 23 (1) (2013) 1–8. [6] P. Rajkishore, H. Saruwatari, K. Shikano, Enhancement of speech signals separated from their convolutive mixture by FDICA algorithm, Digit. Signal Process. 19 (1) (2009) 127–133. [7] J.F. Cardoso, A. Souloumiac, Blind beamforming for non-Gaussian signals, IEE Proc., F, Radar Signal Process. 140 (6) (1993). [8] J.F. Cardoso, Blind signal separation: statistical principles, Proc. IEEE 86 (10) (1998) 2009–2025.
[16] H. Veisi, H. Sameti, Speech enhancement using hidden Markov models in Melfrequency domain, Speech Commun. 55 (2) (2013) 205–220. [17] D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, in: Advances in Neural Info Process Systems (NIPS), MIT Press, 2001, pp. 556–562. [18] N. Mohammadiha, J. Taghia, A. Leijon, Single channel speech enhancement using Bayesian NMF with recursive temporal updates of prior distributions, in: Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process, ICASSP, 2012, pp. 4561–4564. [19] S. Babji, A.K. Tangirala, Source separation in systems with correlated sources using NMF, Digit. Signal Process. 20 (2) (2010) 417–432. [20] G.J. Mysore, P. Smaragdis, A non-negative approach to semisupervised separation of speech from noise with the use of temporal dynamics, in: Proc. IEEE Int. Conf. Acoustics Speech Signal Process, ICASSP, May, 2011, pp. 17–20. [21] T. Virtanen, Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria, IEEE Trans. Audio Speech Lang. Process. 15 (3) (2007) 1066–1074. [22] E.M. Grais, H. Erdogan, Source separation using regularized NMF with MMSE estimates under GMM priors with online learning for the uncertainties, Digit. Signal Process. 29 (2014) 20–34.
R. Abdipour et al. / Digital Signal Processing 36 (2015) 174–183
[23] C. Févotte, N. Bertin, J.L. Durrieu, Nonnegative matrix factorization with the Itakura–Saito divergence: with application to music analysis, Neural Comput. 21 (2009) 793–830. [24] T. Virtanen, Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria, IEEE Trans. Audio Speech Lang. Process. 15 (3) (2007) 1066–1074. [25] P. Smaragdis, J.C. Brown, Non-negative matrix factorization for polyphonic music transcription, in: IEEE Workshop App. Signal Process. Audio Acoustics, WASPAA, 2003, pp. 177–180. [26] K.W. Wilson, B. Raj, P. Smaragdis, Regularized non-negative matrix factorization with temporal dependencies for speech denoising, in: Proc. Int. Conf. Spoken Lang. Process. (Interspeech), 2008, pp. 411–414. [27] M.N. Schmidt, R.K. Olsson, Single-channel speech separation using sparse nonnegative matrix factorization, in: Proc. Int. Conf. Spoken Lang. Process (Interspeech), 2006. [28] O. Yilmaz, S. Rickard, Blind separation of speech mixtures via time–frequency masking, IEEE Trans. Signal Process. 52 (7) (2004) 1830–1847. [29] N. Roman, D. Wang, G.J. Brown, A classification-based cocktail party processor, in: Proc. Neural Information Processing Systems, 2003, pp. 1425–1432. [30] S. Harding, J. Barker, G.J. Brown, Mask estimation for missing data speech recognition based on statistics of binaural interaction, IEEE Trans. Audio Speech Lang. Process. 14 (1) (2006) 58–67. [31] R.J. Weiss, M. Mandel, D.P. Ellis, Combining localization cues and source model constraints for binaural source separation, Speech Commun. 53 (5) (2011) 606–621. [32] Y. Mitsufuji, A. Roebel, Sound source separation based on non-negative tensor factorization incorporating spatial cue as prior knowledge, in: Proc. Acoustics Speech Signal Process, ICASSP, 2013. [33] A. Ozerov, E. Vincent, F. Bimbot, A general flexible framework for the handling of prior information in audio source separation, IEEE Trans. Audio Speech Lang. Process. 20 (4) (2012) 1118–1133. [34] C.J. Leggetter, P.C. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Comput. Speech Lang. 9 (2) (1995) 171–185. [35] J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N. Dahlgren, V. Zue, TIMIT Acoustic–Phonetic Continuous Speech Corpus Linguistic Data Consortium, 1993. [36] V.R. Algazi, R.O. Duda, D.M. Thompson, C. Avendano, The CIPIC HRTF database, in: IEEE Workshop on Applications of Signal Processing to Audio and Electroacoustics, Oct. 2001, pp. 99–102. [37] M. Jeub, M. Schäfer, P. Vary, A binaural room impulse response database for the evaluation of de-reverberation algorithms, in: Proc. Int. Conf. Digital Sig. Proc. (DSP), 2009. [38] ITU-T Recommendation P.862, Perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs, 2001. [39] T. Rohdenburg, V. Hohmann, B. Kollmeier, Objective perceptual quality measures for the evaluation of noise reduction schemes, in: Proc. 9th Int. Workshop Acoust. Echo Noise Control, 2005, pp. 169–172. [40] E. Paajanen, V.V. Mattila, Improved objective measures for characterization of noise suppression algorithms, in: Proc. IEEE Workshop Speech Coding, 2002, pp. 77–79. [41] S. Rickard, The DUET blind source separation algorithm, in: S. Makino et al. (Eds.), Blind Speech Separation, Springer, 2007, pp. 217–237.
183
Roohollah Abdipour received his B.Sc. and M.Sc. degrees in computer engineering from the School of Computer Engineering, Iran University of Science and Technology (IUST), Tehran, Iran, in 2002 and 2004, where he now pursues his Ph.D. studies under the supervision of Prof. Ahmad Akbari. His research interests include audio and speech processing, especially speech enhancement methods, data mining, and network security. Ahmad Akbari received his B.Sc. degree in electronics engineering and his M.Sc. degree in communications engineering from Isfahan University of Technology, Isfahan, Iran, in 1986 and 1989, respectively. He received his Diplôme des Hautes Études Technologiques (DHET) degree in computer networks from ENSEEIHT, Toulouse, France in 1991. He also received his Diplôme d’études appliquées (DEA) and Ph.D. degrees in signal processing and telecommunications from the University of Rennes 1, Rennes, France, in 1992 and 1995, respectively. In 1996, he joined the Computer Engineering Department at the Iran University of Science and Technology as an assistant professor, where he is now working as an associate professor. His research interests include acoustic modeling of speech, robust speech recognition, speech enhancement, and network security. Mohsen Rahmani received his B.Sc. degree in computer engineering from Shiraz University, Shiraz, Iran, in 2001. He received his M.Sc. and Ph.D. degrees in computer engineering from the Iran University of Science and Technology in 2003 and 2008, respectively. Since 2008, he joined the Engineering Department at Arak University, Arak, Iran, where he works as an assistant professor. His research interests include signal processing, especially speech enhancement. Babak Nasersharif received the B.S. degree in hardware engineering from the AmirKabir University of Technology, Tehran, Iran, in 1997. He received his M.Sc. and Ph.D. degrees in computer engineering (Artificial Intelligence) from Iran University of Science and Technology, Tehran, Iran, in 2001 and 2007 respectively. He is currently an assistant Professor in the Electrical and Computer Engineering Department K.N. Toosi University of Technology. His research interests include speech processing, and image processing.