Underdetermined blind separation of overlapped speech mixtures in time-frequency domain with estimated number of sources

Accepted Manuscript Underdetermined Blind Separation of Overlapped Speech Mixtures in Time-Frequency Domain with Estimated Number of Sources Haijian ...

Download PDF

1MB Sizes 0 Downloads 58 Views

Report

PDF Reader
Full Text

Accepted Manuscript

Underdetermined Blind Separation of Overlapped Speech Mixtures in Time-Frequency Domain with Estimated Number of Sources Haijian Zhang, Guang Hua, Lei Yu, Yunlong Cai, Guoan Bi PII: DOI: Reference:

S0167-6393(16)30039-5 10.1016/j.specom.2017.02.003 SPECOM 2441

To appear in:

Speech Communication

Received date: Revised date: Accepted date:

29 February 2016 16 February 2017 22 February 2017

Please cite this article as: Haijian Zhang, Guang Hua, Lei Yu, Yunlong Cai, Guoan Bi, Underdetermined Blind Separation of Overlapped Speech Mixtures in Time-Frequency Domain with Estimated Number of Sources, Speech Communication (2017), doi: 10.1016/j.specom.2017.02.003

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Underdetermined Blind Separation of Overlapped Speech Mixtures in Time-Frequency Domain with Estimated Number of Sources Haijian Zhanga,d,∗ , Guang Huab , Lei Yua , Yunlong Caic , Guoan Bid a School of Electronic Information, Wuhan University, China of Electronic Information and Communications, Huazhong University of Science and Technology, China c Department of Information Science and Electronic Engineering, Zhejiang University, China d School of EEE, Nanyang Technological University, Singapore ∗ (Corresponding Author) Email: [email protected]

CR IP T

b School

Abstract

Noise suppression and the estimation of the number of sources are two practical issues in applications of

AN US

underdetermined blind source separation (UBSS). This paper proposes a noise-robust instantaneous UBSS algorithm for highly overlapped speech sources in the short-time Fourier transform (STFT) domain. The proposed algorithm firstly estimates the unknown complex-valued mixing matrix and the number of sources, which are then used to compute the STFT coefficients of corresponding sources at each auto-source timefrequency (TF) point. After that, the original sources are recovered by the inverse STFT. To mitigate the

M

noise effect on the detection of auto-source TF points, we propose a method to effectively detect the auto-term location of the sources by using the principal component analysis (PCA) of the STFTs of noisy mixtures. The PCA-based detection method can achieve similar UBSS outcome as some filtering-based methods. More

ED

importantly, an efficient method to estimate the mixing matrix is proposed based on subspace projection and clustering approaches. The number of sources is obtained by counting the number of the resultant

PT

clusters. Evaluations have been carried out by using the speech corpus NOIZEUS and the experimental results have shown improved robustness and efficiency of the proposed algorithm.

1

Keywords: Underdetermined blind source separation; noise suppression; estimation of number of sources;

CE

estimation of mixing matrix; short-time Fourier transform.

AC

1. Introduction

Blind source separation (BSS) is to recover the underlying source signals based on observed mixtures

from a sensor array or a single sensor, without knowing the information of the sources and the mixing process. In many practical applications, the challenging case for source separation is when only one sensor is available, which is known as single-channel BSS [1, 2, 3]. This paper focuses upon the instantaneous 1 This work was supported by the National Natural Science Foundation of China under Grant No. 61501335 and the Natural Science Foundation of Hubei Province under Grant No. 2015CFB202.

Preprint submitted to Speech Communication

February 23, 2017

ACCEPTED MANUSCRIPT

underdetermined BSS (UBSS) problem with a sensor array, i.e., the number of sensors is more than one but less than the number of sources. BSS with sensor array is more extensively researched than that with a single sensor. This is simply because more sensors could collect more information from the sources which helps the separation process. The BSS problems have been widely encountered in audio, radar, communication, image processing, and other areas [4, 5, 6, 7, 8, 9, 10]. Most of existing BSS algorithms reported in the literature have achieved desirable performance in a high SNR environment. Real-world signals might be contaminated

CR IP T

by strong noise, and as a result, many reported algorithms obtain very poor BSS performance or fail to properly handle such severely distorted signals. Better methods to mitigate the noise effect are required to achieve robust solutions of BSS. Another practical problem of BSS algorithms lies in the unknown number of sources which has been often theoretically assumed available. Generally, the information on the number of sources is not available in practical applications [4, 11], and thus blind estimation of the number of sources

AN US

from the received mixtures becomes crucial in achieving desirable BSS performance. This paper considers the above two practical issues and proposes a noise-robust BSS algorithm with the estimated number of sources by exploiting the spatial time-frequency distribution (STFD) of the sensor array output data. The application of the STFD for BSS leads to an ever-growing research area. In [12], a BSS algorithm was introduced based on the joint diagonalization of multiple covariance matrices. Instead of using covariance matrices, Belouchrani et al. in [13] proposed the spatial time-frequency distribution based BSS (STFD-

M

BSS) algorithm by using the diagonalization of a combined set of spatial STFD matrices, which has been demonstrated to be more robust to noise because the noise power spreads over the entire time-frequency

ED

(TF) domain [14, 15, 16]. The main requirement of STFD-BSS is the selection of auto-term or cross-term TF points. Some efficient selection methods have been reported in [17, 18, 19, 20]. In [21, 22], the authors indicated another tendency for efficient BSS by avoiding the problem of TF point selection, i.e., applying

PT

signal synthesis techniques in STFDs. However, this signal synthesis method requires the sources to be approximately disjoint in the TF domain. In [23], A¨ıssa-El-Bey et al. proposed two efficient STFD-based

CE

underdetermined BSS (UBSS) algorithms for TF-nondisjoint source separation by subspace projection and signal synthesis. Compared to the STFD-BSS in [13], the STFD-UBSS in [23] does not require the TF point selection, and is more robust to noise because only the TF features of the localized source are used for signal

AC

synthesis. Furthermore, the STFD-UBSS can deal with the underdetermined case. More information on recent research about the STFD-based BSS and UBSS can be found from [24, 25, 26]. Since the short-time Fourier transform (STFT) is easy to implement and does not have the cross-terms

in the TF domain, this paper is devoted to the development of the spatial STFT based UBSS (STFT-UBSS) algorithm which was originally reported in [23]. The STFT-UBSS in [23] separates the mixed sources in the STFT domain by assigning the estimated STFT values located at each auto-source TF point to their corresponding sources. Then each source is recovered by the TF synthesis using the estimated STFT values that have been allocated to this source. To minimize the implementation complexity, the STFT-UBSS in 2

ACCEPTED MANUSCRIPT

[23] only deals with the auto-source TF points2 in STFT domain that are the TF points having localized concentration of energy compared with a threshold value. However, due to the inappropriate choice of the threshold value, either certain TF points that are entirely from strong noise are detected as spurious auto-source TF points, or some true auto-source TF points are not detected. In [27, 28], Aziz-Sba¨ı et al. proposed a method to choose an optimal threshold value and to deal with the noise contribution in the recovery process by estimating the noise standard deviation. Alternatively, we can filter each noisy mixture

CR IP T

before the STFT-UBSS is applied. In [29], Andrianakis et al. proposed a speech enhancement algorithm that models the time and frequency dependencies of the speech STFT values with a Markov random field (MRF) prior. This MRF-based method can be used for the enhancement of the STFT of each noisy mixture, and then the auto-source TF points are detected based on the filtered STFT images with a relatively small threshold value at different SNRs. Other noise reduction techniques for speech signals have been reported in

AN US

[30] and the references therein. It is expected that a de-noising operation may mitigate the influence of noise and improve the UBSS performance. One objective of our algorithm is to efficiently detect auto-source TF points and achieve performance improvement in a low SNR environment, which is frequently encountered in many practical scenarios.

Another important issue is how to obtain an accurate estimation of the mixing matrix, which is the premise of the STFT-UBSS algorithm. Advanced estimation methods of mixing matrix based on the STFTs

M

of mixtures have been reported in [31, 32, 33, 34, 35, 36, 37]. However, they have some limitations. Specifically, the method in [31, 32] is only suitable for two speech mixtures, and requires the sources to be W-disjoint

ED

orthogonal in the TF domain. The methods in [33, 34, 35, 36, 37] are designed in the case of real-valued mixing matrix. In [23], the complex-valued mixing matrix of sources with overlapped (weak-sparseness) spectral contents was estimated by clustering the single-source TF points (i.e., the TF points associated

PT

with a single source), which are detected by selecting the TF points having sufficient energy. However, when the number of sources or the observation time increases, more multi-source TF points possessing strong

CE

energy will appear, which significantly influences the estimation accuracy of the mixing matrix. In addition, the number of sources is generally assumed to be known and the estimation of the actual number of sources has not been adequately addressed. Two advanced clustering methods have been used for automatically

AC

estimating the number of sources in [38] for the cases that the sources are assumed to be sparse in the TF domain. More sophisticated methods for accurately estimating the complex-valued mixing matrix and the number of highly-overlapped sources are needed. In this paper, we propose a robust STFT-UBSS scheme by firstly detecting auto-source TF points in low

SNR environments. Specifically, the principal component analysis (PCA) technique is applied to compress the STFT images from received mixtures into one noise-removed STFT image, based on which auto-source 2 The

auto-term location of sources in STFT domain is termed as auto-source TF points in this paper.

3

ACCEPTED MANUSCRIPT

TF points are detected by assigning a relatively small threshold value. This PCA-based method not only achieves comparable UBSS performance with the filtering-based method in [29], but also requires much less computation time. More importantly, we propose an estimation method of the complex-valued mixing matrix based on subspace analysis and clustering methods [39]. The mixing matrix can be accurately estimated and the number of sources can be obtained by counting the number of columns in the estimated mixing matrix [11]. Compared to the algorithm in [23], the developed STFT-UBSS algorithm is of practical

CR IP T

importance, and is especially suitable for the sources which are significantly overlapped in STFT domain. Note that we only consider instantaneous UBSS problems in this paper, while reverberant situations are not in the scope of this paper.

This paper is organized as follows. We describe the proposed STFT-UBSS algorithm in Section 2, where the PCA-based detection of auto-source TF points, the estimation of the mixing matrix as well as the

AN US

number of sources are elaborated. In Section 3, the STFT-UBSS algorithm is evaluated by simulation with various speech data. The advantages and limitations of the proposed algorithm are discussed in Section 4. Finally, Section 5 concludes this paper.

Notation: We use {·}T as the transpose operator, {·}H as the transpose conjugate operator, and {·}† as c denotes the estimate of {·}, || · || denotes the norm operator, the Moore-Penrose pseudoinverse operator. {·}

M

| · | denotes the absolute operator, and max / min{·} means the maximum/minimum function. 2. The proposed STFT-UBSS algorithm

ED

Let sn (t), n = 1, . . . , N , denote the unknown sources, where N is the number of sources impinging on an M -dimensional uniform linear array (ULA) from N distinct directions. The output vector xm (t), m =

PT

1, . . . , M , are modeled as

x(t) = As(t) + n(t),

(1)

CE

where A = a1 , . . . , ai , . . . , aN denotes the mixing matrix, ai is the steering vector of the ith source, T T x(t) = x1 (t), . . . , xM (t) are the observations, s(t) = s1 (t), . . . , sN (t) are the far-field speech source

AC

signals, and n(t) is additive white noise vector. In many communication systems, speech modulated signals

captured at a remote location may contain some speech overlaps, which happens when one person talks and the other person interrupts or one person starts before the other ends. The overlapping is especially frequent for the speech signals in unlicensed radio bands, e.g., the citizens band radio and amateur radio, where different users can share the same frequency band. It should be noted the proposed STFT-UBSS algorithm is designed for instantaneous blind source separation, and cannot deal with convolutive mixtures. In addition, the signal model above has two limitations: 1) only suitable for uniform linear array; 2) the sources are the far-field speech modulated signals, and no reverberation is involved. 4

ACCEPTED MANUSCRIPT

Under the assumption that different sources are approximately disjoint in the TF domain, each source can be separated by sequentially masking the TF region corresponding to an individual source, and then recovered through TF synthesis technique. This approach is not valid when different sources significantly overlap in the TF domain. The proposed STFT-UBSS algorithm is particularly designed to deal with source signals being non-disjoint in STFT domain. Specifically, the following assumptions are made:

CR IP T

• Different sources might be highly overlapped in STFT domain, and the number of sources is unknown. • The number of sources is strictly less than the number of sensors at any TF point 3 .

The flowchart of the proposed algorithm for blind speech separation is illustrated in Fig. 1. Based on the signal model defined in (1), the procedure of the proposed STFT-UBSS algorithm is described as follows: - Step 1: Computing the STFT of the mixtures x(t) in (1), we obtain an M × 1 STFT vector at each

AN US

TF point (t, f ) 

(2)

M

 Sx1 (t, f )     ... Sx (t, f ) =     SxM (t, f )      a11 . . . aN 1 Sn1 (t, f ) Ss1 (t, f )      .. ..   ..  .    .. =  .. +      . . .  .     SnM (t, f ) a1M . . . aN M SsN (t, f )

ED

= ASs (t, f ) + Sn (t, f ),

where S denotes the STFT operator and the steering vector of the ith source is ai = [ai1 , . . . , aiM ]T .

PT

- Step 2: In order to mitigate the effect of the noise term Sn and reduce the computational complexity, we need to precisely select the auto-source TF points. For this purpose, the M STFT amplitude

CE

images |Sxi |, i = 1, . . . , M , computed in (2) are compressed by using the dimensionality reduction techniques into a single STFT image, wherein the energy of TF points from noise is suppressed. The PCA technique is applied in our study due to its simplicity. Based on the de-noised STFT image |Spca |

AC

using PCA to be discussed in the next subsection, we detect the auto-source TF points by using the following criterion at each time-instant |Spca (t, f )| > T0 , maxv |Spca (t, v)|

(3)

where v denotes the frequency index and T0 is an empirical threshold value for selection of the autosource TF points. All the TF points which satisfy this criterion are included in the set Ωa . 3 This

implies that the main interest of our algorithm is for the cases when strictly more than 2 mixtures are observed.

5

ACCEPTED MANUSCRIPT

- Step 3: The main premise of the STFT-UBSS algorithm is to estimate the number of sources N and the mixing matrix A. We design an efficient method to blindly estimate the mixing matrix based on subspace analysis and clustering methods, which will be elaborated in Subsection 2.2. The number of b clusters, N , is determined by the number of columns of the estimated mixing matrix A.

- Step 4: It is assumed that there are at most K (K < M ) sources present at each auto-source TF b we can compute the K STFT values at each point [23]. Based on the estimated mixing matrix A,

following simplified expression eS e s (t, f ) Sx (t, f ) ≈ A





(4)

(t, f ) ∈ Ωa ,

AN US

Ss (t, f )   n1 ..   = [an1 . . . anK ]  , .   SsnK (t, f )

CR IP T

point ∈ Ωa , and then associate them to their corresponding sources. According to (2), we have the

e denotes the steering vectors of the K sources at each point (t, f ) ∈ Ωa , where the M × K matrix A e s (t, f ) contains the STFT values of these K sources. Our task here is to find and the K × 1 vector S

e at each TF out which K sources contribute at each auto-source TF point, i.e., the determination of A

M

e can be found out by minimizing the following term point, where A e = [an . . . an ] A 1 K

ED

= arg

min

b am1 ,...,b amK

(5)

PSx (t, f ) ,

em = e −1 A e H is the orthogonal projection matrix into noise subspace, A e m (A e HA where P = I − A m m m)

PT

b Next, the bmK ] is the combination of K random columns of the estimated mixing matrix A. [b am1 . . . a

CE

STFT values of the K sources at each TF point can be estimated by   Ssn1 (t, f )   ..  e† e s (t, f ) =  S   ≈ A Sx (t, f ), (t, f ) ∈ Ωa , .   SsnK (t, f )

(6)

AC

which indicates the energy at each auto-source TF point in Ωa is decomposed into K STFT values

and associated to their corresponding sources.

- Step 5: Each source is recovered using inverse STFT (ISTFT) [40] from the estimated STFT associated to it.

2.1. Detecting auto-source TF points by the PCA technique In order to ease implementation complexity and improve the source recovery performance, we need to correctly select the auto-source TF points by eliminating the spurious auto-source points from random noise. 6

ACCEPTED MANUSCRIPT

Generally, the energy of auto-source TF points are locally concentrated in STFT domain, and thus they can be detected by setting an energy threshold. To determine an optimal threshold value, it is necessary to estimate the noise power, and different threshold values are required at different SNR levels [27, 28]. In case of a high SNR environment, auto-source TF points are readily detected by defining a small threshold value. However, in a low SNR environment, the property of localized concentration of signal energy is significantly destroyed by strong random noise [30]. The above described operation of choosing auto-source TF points

CR IP T

with a small threshold value detects many spurious auto-source points that are actually from noise. Because the speech energy varies at different TF locations, the use of a large threshold value to avoid the detection of spurious TF points will inevitably lead to miss-detection of some weak auto-source TF points buried in strong noise. The requirement of the estimation of noise power and the difficulty in selecting an appropriate threshold value make the BSS impractical and inefficient.

AN US

The de-noising operation can substantially reduce the probability of selecting TF points pertaining to only noise, which will consequently improve the recovery performance. It is well known that the STFT amplitudes of speech signals are strongly correlated in time and frequency domain. Such an observed correlation is mainly from the STFT frame overlap, the spectral leakage due to windowing operation, and partly from the intrinsic property of speech signals. In [29], Andrianakis et al. proposed a speech enhancement algorithm by exploiting the time-frequency dependencies. The underlying STFT value of each TF point is estimated based

M

on the values of its neighboring TF points. Although the enhancement algorithm in [29] can significantly suppress noise, it is computationally inefficient in our case since we need to conduct this algorithm for each

ED

noisy mixture and estimate the STFT value of each TF point. In addition, the estimation of noise power and pitch frequency is also required. Another problem of de-noising operations is that the signal content might be somewhat impaired.

PT

Our study advocates the detection of auto-source TF points without filtering each noisy mixture. We propose to detect the auto-source TF points by exploiting the space diversity since the STFT amplitude

CE

images of multiple mixture data from M array sensors are available. Dimensionality reduction techniques, such as PCA [41] or manifold learning techniques [42], can be used to reduce the M -dimensional (M -D) STFT images in spatial domain into lower-dimensional images, wherein the noise is suppressed. The PCA

AC

technique is applied in our study since it is simple and fast from the viewpoint of implementation. The PCA is an orthogonal transformation which converts a set of observations of possible correlated variables into a set of values of uncorrelated variables, i.e., principal components. The first principal component has the largest variance. The PCA can be realized by eigenvalue decomposition of the covariance matrix of the input data. The dimensionality reduction is realized by only extracting the first few principal components. In our case, the M STFT amplitude images |Sxi |, i = 1, . . . , M , are transformed by using the PCA into M principal components, and the first principal component is chosen to construct a de-noised STFT image. In this way, the energy of the TF points from random noise is well suppressed and the energy of 7

ACCEPTED MANUSCRIPT

deterministic signal is enhanced. Assuming the size of each STFT image is Nt × Nf , where Nt and Nf are the dimensions of time and frequency, respectively. In order to satisfy the input requirement of the PCA, each STFT matrix |Sxi | is firstly transformed into a Nt Nf × 1 vector by linking all the columns of |Sxi | with each other to obtain a Nt Nf × M matrix X. The empirical mean along each column of X is calculated and then subtracted from each column. The processed data matrix X after mean centering is set as the input of the PCA. Mathematically, the output of the PCA on X is expressed as y2

...

i yM = XW,

CR IP T

h Y = y1

(7)

where W is an M × M matrix which is comprised of all the eigenvectors of XT X. Y is a Nt Nf × M matrix which includes full principal components. The first principal component is kept by replacing W with the eigenvector w1 corresponding to the largest eigenvalue

AN US

y1 = Xw1 ,

(8)

which realizes the dimensionality reduction from M -D data to 1-D data. Since the data in each column of X is contaminated by independent identically distributed (iid) Gaussian noise, the output Y also contains iid Gaussian noise because this distribution is invariant to W. The first few principal components with larger variances represent the signal dynamics and those with smaller variances are dominated by noise.

M

When the data X is the sum of an information-bearing signal and a Gaussian noise, the PCA is optimal for dimensionality reduction from the information-theoretic point of view [43]. Only the first principal

ED

component y1 in Y is kept since it has the highest SNR. Next, the Nt Nf × 1 vector y1 is transformed back into a Nt × Nf STFT image Spca . The resultant image Spca with highest SNR highlights the intrinsic signal content hidden in random noise. It should be mentioned that the filtered STFT image Spca by PCA is only

PT

used to detect the auto-source TF points, not for source recovery. The noise cancellation implies that the auto-source TF points can be readily detected based on a thresholding operation, defined in (3), with a

CE

relatively small threshold value.

2.2. Estimation method of A and N

AC

The blind estimation of the mixing matrix A in underdetermined cases has been a challenging problem,

especially when the sources are non-disjoint in TF domain and the number of sources N is unavailable. In

[23], the single-source TF points are firstly selected by detecting the TF points with strong energy. Then these TF points are classified by using the k -means clustering method with the assumption of the known number of sources. The mixing matrix is finally estimated based on the clustering results. Specifically, the TF points in the STFT domain with sufficiently strong energy are found by using the criterion |Spca (t, f )| > T1 , maxv |Spca (t, v)| 8

(9)

ACCEPTED MANUSCRIPT

where T1 is an empirical threshold value which selects the TF points with sufficiently concentrated energy. All the TF points satisfying the criterion in (9) are included in the set Ωs , which is supposed to be a subset of the auto-source point set Ωa determined by (3). It is assumed that the ideal steering vector of the ith source is a normalized M × 1 vector and its mth element is expressed as below m ∈ {1, . . . , M },

CR IP T

2π 1 aim = √ e−ι λ d(m−1)sin(θi ) , M

(10)

where ι denotes the imaginary unit, d is the inter-element spacing, λ is the wavelength, and θi denotes the direction of arrival (DOA) of the ith source. Next, the normalized spatial direction vector for each TF point in the set Ωs is calculated by making its first entry real   Sx1 (t,f ) ||S  x (t,f )|| 

.. .

 |Sx1 (t, f )| , ·  Sx1 (t, f )

(t, f ) ∈ Ωs .

AN US

 v(t, f ) =  

SxM (t,f ) ||Sx (t,f )||

(11)

This normalized spatial vector at a single-source TF point is approximately equal to the steering vector of

M

this single source. Because in this case, the expression in (2) degrades into   Sx1 (t, f )   ..   Sx (t, f ) =   ≈ ai Ssi (t, f ), .   SxM (t, f )

(12)

ED

when only the source i exists at the TF point (t, f ).

The k -means clustering method is conducted to classify all the spatial direction vectors in Ωs assuming

PT

the number of clusters N is known. Based on the resultant N clusters ∪Ωi = Ωs , i = 1, . . . , N , the steering vector of each source is finally estimated by averaging the spatial vectors of all the TF points in the same

CE

cluster

bi = a

1 ]Ωi

X

v(t, f ),

i = 1, . . . , N,

(13)

(t,f )∈Ωi

AC

where ] denotes the number of TF points in the ith cluster set Ωi . The above estimation method has two limitations. Firstly, the selection of appropriate single-source

TF points has significant influence on the estimation of mixing matrix. However, the selection criterion of single-source TF points by detecting strong TF points according to (9) depends on the assumption that the source signals are sparse in STFT domain. When the speech sources highly overlap in STFT domain, the energy of many strong TF points within Ωs may come from more than one source. The existence of multi-source TF points will significantly impact the estimation accuracy of the mixing matrix. Secondly, the assumption of known number of sources is not practical. Fig. 2 (a) shows a 3-D view of the strong TF 9

ACCEPTED MANUSCRIPT

points (T1 = 0.2) of a noisy data with overlapped speech sources by plotting the real parts of their spatial vectors. The resultant clusters of the k -means clustering method are indicated by various colors. In Fig. 2 (a), note that the detected strong TF points from different sources are mixed together in the 3-D space, thus it will be difficult to estimate the actual number of sources by using any clustering method which can automatically obtain the number of clusters. Even the number of sources is known, we cannot achieve accurate estimation of mixing matrix due to the ambiguous boundaries between sources.

CR IP T

When different sources are spectrally overlapped in STFT domain, the obtained TF point set Ωs by (9) may contain different types of TF points except single-source TF points. We illustrate some possible situations of the TF points ∈ Ωs in Fig. 3, where an energy threshold is used to pick out the strong TF points. It is seen that the detected strong TF points contain the single-source TF points, the double-source TF points, and the TF points resulting from only noise or noise-dominant source, i.e., the energy of noise is

AN US

much greater than that of source. According to (11) and (12), the spatial vectors of single-source TF points are proved to be ideal candidates for estimation of mixing matrix, however, this conclusion holds only when the noise power in (2) is neglectable. In this paper, we clarify that the desirable TF points for estimating the mixing matrix are not those associated with only one source. In the following, we demonstrate that the appropriate TF points are actually those whose dominant energy is from one source rather than from multiple sources or noise, which is justified by analysing the mean square error (MSE) of the steering vectors

M

of the estimated mixing matrix through (13).

Instead of considering the detected strong points in Ωs displayed in Fig. 2 (a) as single-source TF points,

ED

we assume the points in Ωs are either single-source TF points or double-source TF points4 . Supposing the energy at each double-source TF point in Ωs is mainly contributed by two STFT values, Ssi and Ssj , we

PT

define a dominance parameter γ as follows γ=

|Ssi | , |Ssj |

i, j ∈ {1, 2, . . . , N },

(14)

where we assume |Ssi | ≥ |Ssj |. When the value of γ is much larger than one, we say this point is dominant

CE

by source i. Let [ai , aj ] denote the steering vectors of sources i and j at a double-source TF point (t, f ),

AC

the expression in (2) becomes 

  Sx1 (t, f ) ai1    ..    .. Sx (t, f ) =  ≈ . .    SxM (t, f ) aiM

  aj1   S (t, f ) ..   si . .   Ss (t, f ) j ajM

(15)

In order to facilitate analysis of MSE of estimated steering vector, we consider a scaled steering vector

of the source i as below

4 Some

T 2π 2π ai = 1, e−ι λ dsin(θi ) , · · · , e−ι λ d(M −1)sin(θi ) .

(16)

TF points in Ωs might associate with more than two sources, we disregard this seldom happened case for simplicity.

10

ACCEPTED MANUSCRIPT

In this case, the STFT observation vector of each point (t, f ) ∈ Ωi in (13) is approximated as the estimate of the steering vector of source i by making its first element to be one, i.e.,   S (t,f )   ai1 Ss (t,f )+aj1 Ss (t,f )   i j x1 b ai1 ai1 Ssi (t,f )+aj1 Ssj (t,f )    Sx1 (t,f )     .. ..  .     . bi (t, f ) =  ..  =  a = . .         aiM Ssi (t,f )+ajM Ssj (t,f ) SxM (t,f ) b aiM Sx (t,f ) ai1 Ss (t,f )+aj1 Ss (t,f ) 1

Ssi (t,f ) Ssj (t,f )

j

CR IP T

Defining

i

(17)

= γ exp(ιφ), the estimation error of the steering vector of source i can be evaluated by

computing the MSE according to (16) and (17)

AN US

MSEi = ||b ai (t, f ) − ai ||2 2 aj2 − ai2 2 + · · · + ajM − aiM = 0 + γ exp(ιφ) + 1 γ exp(ιφ) + 1 2 PM m=1 ajm − aim = 2 γ exp(ιφ) + 1 2 PM m=1 ajm − aim , = γ 2 + 2γcos(φ) + 1

(18)

which is a decreasing function of the dominance parameter when γ ≥ 1. This signifies that the estimation

M

accuracy highly depends on the dominance parameter γ, i.e., the larger γ is, the lower MSE we can obtain. It should be noted that the estimation error will be further mitigated because the final estimation of the steering vector of the source i is implemented by the average operation in (13) over all the detected TF

ED

points dominated by source i. The above analysis process is also suitable for single-source TF points by regarding Ssj in (14) as the STFT value of white noise at the TF point (t, f ).

PT

The aforementioned analysis clarifies that the appropriate TF points for the estimation of the mixing matrix are not the single-source TF points but those whose energy is dominant by a particular source. Three observations can be made based on Fig. 3. Firstly, single-source TF points with a small value of γ

CE

are undesirable for mixing matrix estimation. Secondly, the TF points with strong energy in Ωs may have a high probability of containing double-source TF points. Lastly, some double-source TF points are also

AC

useful for precise estimation as long as these points have a significant value of γ. In the following, we present a new method to accurately estimate the mixing matrix as well as the number

of sources. Compared to existing methods, the proposed method satisfies the following relaxed assumptions based on the set Ωs : • It is assumed that there are at most two sources at each TF point in the set Ωs . • For each source, we can always find some TF points, where the energy from this source is dominant compared to that from other sources and noise. 11

ACCEPTED MANUSCRIPT

According to previous analysis and discussion, we have clearly explained the motivation and the essential idea of the proposed method in the following steps. 1. The proposed method is actually an extension of the STFT-UBSS method in [23], wherein the method has two imperfections. Firstly, the number of sources is assumed to be known. Secondly, the estimation performance of mixing matrix in [23] is limited by the cases where the spectral contents of sources

CR IP T

are overlapped in TF domain. However, in practice, the number of sources needs to be estimated and the sources are prone to be spectrally overlapped in TF domain. As shown in Fig. 1, the spectral contents (strong TF points in white color) of N = 4 sources are non-disjoint, and the 3-D view of these strong TF points is given in Fig. 2 (a). In this case, we have the difficulty in accurately estimating the actual number of sources by using any clustering method. Even though the number of sources is known and some clustering method (e.g., k -means) is applied to classify the strong TF points into N

AN US

clusters, the average spatial vectors of the clusters are then computed as the estimated mixing matrix, A. Nevertheless, the existence of the multi-source TF points resulting from the overlapped spectral contents of sources will severely impact the estimation accuracy of A.

There is no doubt that the consideration of estimating N conforms to the actual requirement and the accurate estimation of A when sources are spectrally overlapped plays an important role in the final

M

separation performance. However, less efforts have been devoted to address these issues. That is the reason why we have the motivation to investigate a feasible solution which can accurately estimate N and A especially for highly TF overlapped sources.

ED

2. Therefore, the main objective of this paper is to propose a robust method of estimating the mixing matrix as well as the number of sources for underdetermined blind separation of non-disjoint sources

PT

in STFT domain. The idea of the proposed method is motivated from the fact that the estimation accuracy can be well guaranteed if the sources are non-overlapped in STFT domain (all the TF points are single-source points, which are useful for the estimation of N and A), while the performance

CE

is limited when the sources gradually overlap (the TF points include single-source and multi-source points). It has been analysed the occurrence of multi-source points at the intersection of different

AC

sources is the reason why the estimation of N and A becomes difficult. As a result, the basic idea of the proposed method is to extract the entire or a subset of single-source TF points from the whole set of strong TF points by getting rid of multi-source TF points and noise TF points.

3. Before introducing the proposed method about how to extract single-source TF points, we firstly analyse the possible situations of auto-source TF points, as illustrated in Fig. 3. Certain auto-source TF points may involve more than two sources, however, this seldom happens and is neglected in our paper. Thus there are mainly three types of TF points: single-source points, double-source points, 12

ACCEPTED MANUSCRIPT

and noise TF points where only noise exists or the noise energy is dominant over that of source. It has been proved that the desirable TF points of estimating the mixing matrix are not exactly the single-source TF points where only one source exists, the desirable TF points are actually the points where the energy of one source is dominant over those of other sources and noise power, which is named as dominant TF point, e.g., the last situation of TF points shown in Fig. 3. 4. Consequently, the essential idea of the proposed method is to detect dominant TF points of each source

CR IP T

with unknown number of sources. For instance, among the strong TF points in Fig. 1, our aim is to identify a set of effective dominant TF points for each source by eliminating non-dominant TF points. As illustrated in Fig. 3, the current problem lies in how to distinguish dominant and non-dominant TF points. Since the dominant TF point means that the energy of one source is dominant over those of other sources and noise, we propose a heuristic method which firstly attempts to obtain the energy

AN US

of individual sources at each TF point, and then the energy ratio of the largest one to the rest is calculated. Finally, a set of dominant TF points are selected if their calculated energy ratios are greater than a predefined threshold.

The challenge of this proposed method lies in how to obtain the energy of individual sources at each strong TF point. For simplicity, we assume that there are at most two sources at each TF point, i.e., the energy of each TF point is contributed by two STFT values. Motivated from the idea in [23],

M

these two STFT values can be estimated by means of subspace projection provided that the steering vectors of all the sources are available, which does not mean that we must firstly obtain the mixing

ED

matrix. Actually, the two STFT values at each TF point can be successfully estimated based on a matrix which includes the steering vectors of all the sources. 5. The proposed method of extracting a set of dominant TF points is depicted in Fig. 1. At first, the

PT

k -means clustering method is applied to classify the strong TF points by giving a fixed number of clusters, N0 , which is usually assigned a value larger than N . The TF points with similar spatial

CE

vectors will be grouped into the same cluster. The k -means clustering result is shown in Fig. 1, where we use various colors to indicate different clusters, based on which we can obtain a matrix A0 which includes at least one pure steering vector of each source. A more detailed description is given in Section

AC

2.2.1. Based on the estimated matrix A0 , the two STFT values at each strong TF point are calculated by subspace projection, and the dominant TF points can be therefore selected by computing the energy ratio and defining a dominance threshold, which is elaborated in Section 2.2.2. As shown in Fig. 1, a set of dominant TF points are selected from the strong TF points by subspace analysis. Based on these dominant TF points, the number of sources and the mixing matrix can be easily estimated by using a clustering method which can automatically determine the number of clusters. The clustering result is shown in Fig. 1, where it is clearly find out the actual number of sources N and the mixing matrix A can be precisely estimated based on the resultant clusters. The detail of this step is presented in 13

ACCEPTED MANUSCRIPT

Section 2.2.3. 2.2.1. Step 1 Applying k -means clustering method to classify all the spatial direction vectors in the set Ωs given a fixed number of clusters, N0 , that is generally larger than N (The optimal value of N0 will be discussed bi , is calculated by averaging all the spatial direction vectors in in simulation section). A column vector, v

CR IP T

the ith cluster in the same way as in (13), so that we obtain an M × N0 matrix A0 . The estimated A0 by k -means clustering on Ωs can be mathematically expressed as a combination of various spatial vectors b2 v

A0 =[b v1

···

pure

z ≈ a1

}| { a2 · · · aN

bi v

···

bN0 ] v

mixed

}| { a13 · · · a(N −1)N

z a12

random

z }| { others ,

(19)

AN US

which denotes that A0 is comprised of three possible parts: pure steering vectors from N sources, mixed spatial vectors among N sources and other situations, e.g., distorted spatial vectors due to random noise. This estimated A0 will be used for the identification of dominant TF points in Ωs . 2.2.2. Step 2

(20)

ED

M

The two STFT values at each point ∈ Ωs can be estimated according to (15)   Ssn1 (t, f ) b s (t, f ) =   = A†2 Sx (t, f ), (t, f ) ∈ Ωs , S Ssn2 (t, f )

where A2 = [an1 , an2 ] are the steering vectors of two most possible sources present at each point in Ωs . For each TF point (t, f ) ∈ Ωs , we determine the optimal an1 and an2 from the estimated A0 by minimizing the

PT

following subspace projection

{an1 , an2 } = arg min

CE

am1 ,am2

PSx (t, f ) ,

(21)

e 2 (A e HA e 2 )−1 A e H means the orthogonal projection matrix onto noise subspace of A e 2 , and where P = I − A 2 2 e 2 = [am , am ], where m1 , m2 ∈ {1, . . . , N0 } and am and am are two random column vectors of the A 1 2 1 2

AC

matrix A0 . A set of dominant TF points are selected by assigning a dominance threshold γ0 , i.e., max{|Ssn1 (t, f )|, |Ssn2 (t, f )|} > γ0 . min{|Ssn1 (t, f )|, |Ssn2 (t, f )|}

(22)

Defining the TF points in Ωs satisfying (22) are included in the set Ωd , and we have Ωd ⊂ Ωs ⊂ Ωa . 2.2.3. Step 3 Finally, the mean-shift clustering method [39] without knowing the number of sources is used on the set Ωd . Fig. 2 (b) and (c) display the mean-shift clustering results of dominant TF points by choosing γ0 = 10 14

ACCEPTED MANUSCRIPT

and γ0 = 30, respectively. The number of sources is determined by the number of clusters, and the mixing matrix can be obtained by averaging all the direction vectors in each cluster. The centroids of the N ideal steering vectors in A, marked by red circles in Fig. 2 (b) and (c), verify that the dominant TF points can accurately estimate the steering vectors since they are close to the ideal centroids of A. We also note that different clusters are clearly separated in spatial domain, which guarantees a correct estimation of the number of sources. The proposed estimation method is summarized in Table 1.

CR IP T

It should be mentioned that the two steering vectors of A2 in (20) for each TF point in Ωs can be effectively selected from the column vectors of A0 via the minimization process in (21). The optimal an1 and an2 will be selected from the first part of A0 since the minimization process will automatically choose the pure steering vectors. Specifically, for the TF points with two sources i and j ∈ {1, . . . , N }, the resultant an1 and an2 by implementing (21) will be the purest vectors of source i and source j among the first part

AN US

of A0 . For single-source TF points with source i in Ωs , one of an1 and an2 comes from the first part of A0 , whereas the other one could be any column vector of A0 due to random noise. However, this random spatial vector will not have detrimental effect on the ratio computation in (22). Note that it is not necessary to know which sources correspond to the two optimized steering vectors in (21), because our purpose here is to extract a set of dominant TF points from Ωs , which can be determined by coarse estimation of STFT values at each point in Ωs .

M

Different values of T1 and γ0 mainly influence the quantity of final dominant TF points (see Fig. 2 (b) and (c)). The selection of the value of N0 for the formulation of matrix A0 has a wide range of space,

ED

because the two STFT values in (20) can be successfully estimated as long as A0 contains as least one pure steering vector for each individual source. The proposed estimation method can be straightforwardly extended to the case where some TF points ∈ Ωs include more than two sources by assuming the energy at

PT

each point is contributed by more than two STFT values. In this case, the dominant TF points are selected

CE

by computing the energy ratio between the largest one and the rest.

3. Numerical Results

AC

Simulation results on various speech data from NOIZEUS database [44] are presented to illustrate the efficiency and robustness of the proposed STFT-UBSS algorithm on the estimation of N and A, as well as the performance of underdetermined source separation. The performance comparison of the proposed algorithm is conducted with the reported UBSS algorithm in [32] and [23], and the PCA-based de-noising method is compared with the enhancement algorithm reported in [29]. The estimation accuracy of the mixing matrix and separated sources is assessed via the normalized MSE (NMSE) and the proposed evaluation criteria in [45, 46], termed mixing error ratio (MER) for assessment of mixing matrix; signal to distortion ratio (SDR), signal to interference ratio (SIR) and signal to artifacts ratio (SAR) for assessment of separated sources. 15

ACCEPTED MANUSCRIPT

We consider a uniform linear array whose sensors are separated by a half wavelength spacing and M = 4 sensors are used. The number of speech sources is N = 5 and they are received from different DOAs: θ1 = 5◦ , θ2 = 15◦ , θ3 = 30◦ , θ4 = 45◦ , and θ5 = 75◦ . The speech duration is 2s, the sampling rate is 8 kHz, and speech sources are highly overlapped in STFT domain. The window length of the STFT is L = 512 samples, and the overlapping size is 480 samples. The proposed UBSS algorithm is evaluated on synthetic

CR IP T

speech mixtures generated with the NOIZEUS database, and each speech source has the same level. 3.1. Estimation of N and A

The number of sources N can be determined by counting the number of resultant clusters using some clustering method on the detected set Ωd illustrated in Fig. 2. For comparison purpose, we present the b versus different SNR levels by conducting the mean-shift performance of the estimated number of sources N

clustering method on Ωs and Ωd in Table 2, where N = 5 and we choose T1 = 0.3, N0 = 12, and γ0 = 5. To

AN US

avoid the interference of outliers, the resultant clusters with small number of points are discarded (During our simulation, we found that under a very low probability there might exist the interference of outliers in the clustering results of mean-shift method, but most of the time the number of outliers is 1 or 2. Therefore, we set the threshold value to 5, which is actually an empirical value). It is noted that the proposed method (mean-shift on Ωd ) can precisely estimate the number of sources even in a low SNR situation, e.g., SNR=-

M

5dB. This is due to the elimination of non-dominant TF points, as shown in Fig. 2 (b) and (c). In contrast, the mean-shift clustering on Ωs which includes non-dominant TF points (shown in Fig. 2 (a)) tends to

ED

overestimate the number of sources, and can hardly correctly estimate the actual number of clusters. It is worth discussing the sensitivity of the proposed estimation method to different selections of parameters, i.e., T1 , N0 and γ0 . Fig. 4 shows the correct estimation probabilities of N vs. SNR levels with

PT

different parameter settings, where the SNR levels tested are from -5dB to 20dB. It is seen that the proposed estimation method can tolerate a certain range of the parameter variation: T1 ∈ [0.2, 0.5], N0 ∈ [8, 20], and

CE

γ0 ∈ [5, 20]. Specifically, the proposed method can achieve almost perfect estimation performance when the SNR level is greater than 10dB, and the performance tends to degrade gradually as the SNR level decreases. Even so, the probability of correct estimation is still higher than 90%. In Fig. 4 (c), it should be explained

AC

that it might fail to detect any dominant TF point for a large value of γ0 especially at low SNRs. In this case, we cannot implement the estimation method, which explains the incomplete curves in Fig. 4 (c). As a result, a relatively large value of T1 can be selected to reduce the computational complexity. However, a

too large value of T1 , such as greater than 0.5, might provide insufficient information for the estimation of

N . Our experiments have indicated that the appropriate selection range of T1 is between 0.2 and 0.4 and the value of N0 can be up to several times of the true number of sources N . With regard to the selection range of γ0 , a common threshold value of γ0 for all the SNR levels is around 5 since a large value (≥ 10) is inappropriate for low SNR cases. The reason is that the increase of the noise power will inevitably decrease 16

ACCEPTED MANUSCRIPT

the value of the dominant parameter γ, and therefore reduce the quantity of dominant TF points. To evaluate the estimation performance of A, we apply k -means clustering method on Ωs and Ωd assuming the number of sources is known, and then the steering vectors of sources are estimated by averaging the spatial vectors of the TF points in the same cluster. The estimation performance versus SNR levels based on Ωs and Ωd at different situations of parameter setting is shown in Fig. 5, where the normalized MSE at each SNR level is averaged over 50 trials. The effect of the parameter variation on the estimation

CR IP T

of A is similar to that of N . We observe that the performance based on Ωs is limited due to the inaccurate detection of single-source TF points. In contrast, the desirable performance of the proposed method based on Ωd verifies that the dominant TF points are the appropriate TF points for accurate estimation of A. The performance gain of the proposed algorithm compared to the algorithm in [23] is dependent on the b for three sets of different speech data. specific speech data. Fig. 6 (a) gives the performance of estimated A

AN US

Data 1 and Data 2 have highly spectrally-overlapped speech sources, whereas Data 3 has a limited amount of overlap in the TF domain. It is concluded that the more overlapped regions of the speech data in the TF domain, the larger performance gain we can achieve, i.e., the proposed algorithm is especially suitable for the separation of sources which significantly overlap in the TF domain. 3.2. Performance of UBSS

M

According to the aforementioned analysis of the impacts of parameter variations for the estimation performance of N and A, let us set T1 = 0.3, N0 = 12, and γ0 = 8 for the evaluation of the UBSS

ED

performance. Since we assume the noise power is not available, a relatively small threshold value of T0 = 0.02 in (3) is used for the detection of auto-source TF point set Ωa . It is assumed that there are at most K = 2 sources present at each auto-source TF point ∈ Ωa .

PT

The proposed UBSS algorithm is compared to the algorithm in [23] in terms of NMSEs and various evaluation criteria in [45, 46]. To reveal the advantage of using PCA for signal de-noising aiming to the

CE

detection of auto-source TF points, the speech enhancement algorithm reported in [29] is used to de-noise each mixture, and the auto-source TF points are detected based on the sum of the STFTs of the de-noised mixtures. The performance comparison of different UBSS algorithms including the computation time is

AC

presented in Table 3 (5 trials for each algorithm). To evaluate the detection performance of auto-source TF points via PCA and the algorithm in [29], we also give the simulation result by assuming the auto-source TF point set Ωa is known5 . The computation time is recorded by implementing the Matlab code on a PC with a 3.1GHz Intel Core i5 Processor. Note that the UBSS algorithms considering the de-noising operations and using the dominant set Ωd substantially outperform the original UBSS algorithm in [23] in terms of the estimations of A and S. The UBSS algorithm based on PCA and Ωd achieves comparable performance with 5 In

simulation, the known set Ωa is determined by detecting the TF points based on the STFT image free of noise.

17

ACCEPTED MANUSCRIPT

that achieved by the algorithm based on the filtering method in [29]. However, the PCA uses much less computation time than the filtering-based algorithm. Although the PCA-based algorithm does not remove the noise effect at the auto-source TF region for source recovery, the filtering-based algorithm not only increases computational complexity but also somewhat destroys signal structure. In addition, we note that the performance of the PCA-based algorithm using the set Ωd is close to that of the case when a known Ωa is assumed.

CR IP T

Fig. 6 (b) displays the UBSS performance comparison of the proposed algorithm with the two existing methods in [32] and [23] by using the three speech data mentioned in Fig. 6 (a). In [32], the authors proposed a demixing algorithm that separates an arbitrary number of sources using two anechoic mixtures. However, this method is only suitable for two speech mixtures. Therefore, the authors in [23] presented a cluster-based UBSS algorithm to extend the method in [32] for an arbitrary number of mixtures which

AN US

is greater than one. On the one hand, it is observed that the proposed PCA-based method has achieved desirable performance gain compared to the original STFT-UBSS algorithm in [23] particularly at low SNRs. Although the proposed method has no significant performance improvement compared to the one in [23] when the SNR level increases, the latter has used the prior information on the number of sources. On the other hand, although the extended method based on [32] outperforms the one in [23] at low SNRs, it encounters performance bottlenecks when the SNR level augments. The performance limitation of the

M

extended method in [32] lies in that it requires the speech sources to be approximate W-disjoint orthogonal in the TF domain, whereas the speech data in our simulation do not satisfy this condition (Data 1 and Data

ED

2 have highly spectrally-overlapped speech sources, whereas Data 3 has a limited amount of overlap in the TF domain).

It should be discussed that when the estimation accuracy of mixing matrix in our method at low SNRs is

PT

much higher than those of other methods at high SNRs, the separation of our method performs even worse. The reason lies in that the proposed method is consistent with the change of SNR. Its performance is mainly

CE

dependent on the SNR level and partially dependent on the estimation accuracy of the mixing matrix. In other words, even though we continuously improve the estimation of A, the separation performance might

AC

be limited due to the SNR level or the algorithmic mechanism. 3.3. More sources with small spacing separation Finally, let us consider the challenging case with a fixed number of sensors M = 4 and a large number of

sources, i.e. N ∈ {5, 6, 7, 8}. In this case, it is quite possible some sources have close DOAs in space domain,

and the overlapped regions of sources in STFT domain increase as the augment of the number of sources. We assume the DOAs of the eight speech sources are θ1 = 5◦ , θ2 = 10◦ , θ3 = 15◦ , θ4 = 20◦ , θ5 = 30◦ , θ6 = 40◦ , θ7 = 45◦ , and θ8 = 75◦ , respectively. The compared estimation results in terms of N , A and S as a function of the number of sources are 18

ACCEPTED MANUSCRIPT

shown in Fig. 7. It is observed that the proposed algorithm can accurately estimate the number of sources as well as successfully implement the UBSS for the cases involved with a large number of sources, and is more robust to the increase of the number of sources than the algorithm in [23]. Under the assumption the number of sources is known, the performance gain between the proposed algorithm and the one in [23] can be further enlarged by increasing the number of sensors (for low SNR cases) or using larger value of K (for high SNR cases). Fig. 8 (a) shows the NMSEs of estimated sources as

CR IP T

a function of different sizes of mixing matrix at SNR=5 dB. It is seen that the performance gain between the proposed algorithm and the one in [23] gradually augments as the number of sensors increases. As illustrated in Fig. 6 (b) and Fig. 7 (c), the proposed algorithm has similar performance of source separation to the one in [23], i.e., a large gain on the estimation of A cannot give rise to a significant gain on the estimation of S. We assume there are at most K = 3 sources present at each auto-source TF point instead

AN US

of assuming K = 2, and its performance of source separation when SNR=40 dB is shown in Fig. 8 (b), where we see that the assumption of K = 3 outperforms the case assuming K = 2 for the proposed algorithm due to significant overlaps in the TF domain. However, the superiority of the algorithm in [23] assuming K = 3 disappears when N > 5, because the performance of source separation becomes more sensitive to the estimation accuracy of the mixing matrix in case of a large number of overlapped sources.

M

4. Discussion

To provide further insight into the proposed algorithm, let us discuss its advantages and limitations.

ED

Compared to existing UBSS algorithms, the proposed algorithm exhibits the following advantages: • The reason of performance limitation of the algorithm in [23] in low SNR environments is that many

PT

TF points from noise are detected as auto-source TF points due to the inappropriate choice of the threshold value. Instead of applying a filtering operation like in [29] to mitigate the noise effect, we

CE

advocate the PCA not only due to its simplicity and low computational complexity, but also due to its suitability for various types of source signals, as the method in [29] is limited for speech signals. Moreover, we can avoid the estimation of noise power and pitch frequency by using the PCA.

AC

• We propose an estimation method of complex-valued mixing matrix based on the detection of dominant TF points, which are demonstrated to be appropriate candidates for accurate estimation of A. Although the detection of single-source TF points based on bilinear TFDs has been investigated in the literature [26], the detection of dominant TF points based on linear STFT is firstly addressed in our study for the estimation of complex-valued mixing matrix. • Most existing UBSS algorithms assume a known number of sources, N , however, practical applications may not give the accurate information on the value of N . Based on the detected dominant TF points 19

ACCEPTED MANUSCRIPT

in Fig. 2, existing clustering methods can easily obtain a correct estimation of N . • The proposed estimation method is insensitive to parameter variation, and is effective especially when the overlapped regions of sources in TF domain become significant. There also exist some issues which are worth to be further investigated:

CR IP T

• The detection of auto-source TF points in low SNR environments could be realized using the manifold learning (ML) techniques instead of the PCA. Specifically, by learning a nonlinear manifold on multiple STFT images in spatial space, the clean STFT image embedded in noise can be obtained. The principle is that the manifold learning enhances the deterministic structure (signal) and ignores the random structure (noise). The de-noising operation by ML has been demonstrated to outperform some traditional filtering methods [47]. Although ML techniques are verified to be effective for dimensionality

AN US

reduction, they suffer from high computational complexity.

• For simplicity, the algorithm in [23] assumes there are fixed K sources present at each auto-source TF point, which implies the noise energy will contribute to the source synthesis. Thus, it is better to determine the exact number of sources at each auto-source TF point particularly in a low SNR environment.

M

• In our study, it is assumed that all sources have the comparative energy. The proposed STFT-UBSS algorithm cannot well deal with the case where the sources have very different energies. Specifically,

ED

the algorithm might fail to detect any dominant TF point for the source with very weak energy, and therefore fail to estimate the mixing matrix of this source.

PT

• It should be mentioned that the proposed UBSS algorithm will fail in separating covolutive mixtures. The reason lies in that the mixing matrices of sources stay unchanged in the instantaneous case, that is why we can successfully obtain well-localized clusters to estimate N and A, as shown in Fig. 2 (b)

CE

and (c). Whereas the channel matrices in convolutive case are dependent on the frequency information [48], i.e., they have different values in different frequency bands. Therefore, the proposed method of

AC

estimating N and A on the assumption of fixed mixing matrix is not suitable for the convolutive case. Further work could consider the convolutive mixtures [48, 49, 50, 51, 52, 53, 54, 55, 56] and the calibration error of array sensors [57, 58, 59, 60]. Besides, the proposed algorithm assumes there always exist some dominant TF points for each source, which might not hold for multipath fading signals. How to successfully separate the source signals which almost superpose in the TF domain needs more research attention [61].

20

ACCEPTED MANUSCRIPT

5. Conclusion To relax the limitation of the existing UBSS algorithms for highly-overlapped sources in TF domain and the indeterminacy of the number of sources, this paper proposes a noise-robust algorithm for underdetermined blind separation of non-disjoint sources in STFT domain. The proposed method for estimation of the complex-valued mixing matrix can simultaneously provide an estimate of the number of sources. The

CR IP T

accurate estimation of the mixing matrix and the auto-source TF points has well improved the speech separation performance in a totally blind environment. Although the proposed algorithm is herein intended for speech-related applications, it may also be applied for applications in the areas of radar, communication and sonar.

AN US

Acknowledgment

The authors would like to thank Prof. A¨ıssa-El-Bey et al. for kindly providing us the program code in their paper [23]. The authors would also like to thank the anonymous reviewers and the associate editor for their contributions in greatly improving the quality of our paper.

References

M

[1] B. Gao, W. L. Woo, S. S. Dlay, Adaptive sparsity non-negative matrix factorization for single-channel source separation, IEEE Journal of Selected Topics in Signal Processing 5 (5) (2011) 989–1001.

ED

[2] B. Gao, W. L. Woo, S. S. Dlay, Unsupervised single-channel separation of nonstationary signals using gammatone filterbank and itakura-saito nonnegative matrix two-dimensional factorizations, IEEE Transactions on Circuits and Systems I: Regular Papers 60 (3) (2013) 662–675.

[3] N. Tengtrairat, W. L. Woo, S. S. Dlay, B. Gao, Online noisy single-channel source separation using adaptive spectrum

PT

amplitude estimator and masking, IEEE Trans. Signal Process. 64 (7) (2016) 1881–1895. [4] G. R. Naik, W. Wang, Blind Source Separation: Advances in Theory, Algorithms and Applications, Springer-Verlag Berlin Heidelberg, 2014.

CE

[5] T. Xu, W. Wang, W. Dai, Sparse coding with adaptive dictionary learning for underdetermined blind speech separation, Speech Communication 55 (3) (2013) 432–450. [6] X. Chen, W. Wang, Y. Wang, X. Zhong, A. Alinaghi, Reverberant speech separation with probabilistic time-frequency

AC

masking for B-format recordings, Speech Communication 68 (2015) 41–54. [7] P. Pertila, J. Nikunen, Distant speech separation using predicted time-frequency masks from spatial features, Speech Communication 68 (2015) 97–106.

[8] P. Bofill, M. Zibulevsky, Underdetermined blind source separation using sparse representations, Signal Processing 81 (11) (2001) 2353–2362.

[9] S. Araki, H. Sawada, R. Mukai, S. Makino, Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors, Signal Processing 87 (8) (2007) 1833–1847. [10] K. Abed-Meraim, Y. Xiang, J. H. Manton, Y. Hua, Blind source-separation using second-order cyclostationary statistics, IEEE Trans. Signal Process. 49 (4) (2001) 694–701.

21

ACCEPTED MANUSCRIPT

[11] H. Zhang, G. Bi, S. G. Razul, C.-M. See, Estimation of underdetermined mixing matrix with unknown number of overlapped sources in short-time fourier transform domain, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6486–6490. [12] A. Belouchrani, K. Abed-Meraim, J.-F. Cardoso, E. Moulines, A blind source separation technique using second-order statistics, IEEE Trans. Signal Process. 45 (2) (1997) 434–444. [13] A. Belouchrani, M. Amin, Blind source separation based on time-frequency signal representations, IEEE Trans. Signal Process. 46 (11) (1998) 2888–2897.

IEEE Signal Process. Lett. 23 (1) (2016) 139–143.

CR IP T

[14] H. Zhang, L. Yu, G. S. Xia, Iterative time-frequency filtering of sinusoidal signals with updated frequency estimation,

[15] H. Zhang, G. Bi, W. Yang, S. G. Razul, C. M. S. See, IF estimation of FM signals based on time-frequency image, IEEE Trans. Aerosp. Electron. Syst. 51 (1) (2015) 326–343.

[16] H. Zhang, G. Bi, S. G. Razul, C. M. S. See, Robust time-varying filtering and separation of some nonstationary signals in low SNR environments, Signal Process. 106 (2015) 141–158.

[17] C. Fevotte, C. Doncarli, Two contributions to blind source separation using time-frequency distributions, IEEE Signal

AN US

Process. Lett. 11 (3) (2004) 386–389.

[18] A. Belouchrani, K. Abed-Meraim, M. Amin, A. Zoubir, Blind separation of nonstationary sources, IEEE Signal Process. Lett. 11 (7) (2004) 605–608.

[19] E. Fadaili, N. Moreau, E. Moreau, Nonorthogonal joint diagonalization/zero diagonalization for source separation based on time-frequency distributions, IEEE Trans. Signal Process. 55 (5) (2007) 1673–1687. [20] L. Cirillo, A. Zoubir, M. Amin, Blind source separation in the time-frequency domain based on multiple hypothesis testing, IEEE Trans. Signal Process. 56 (6) (2008) 2267–2279.

M

[21] W. Mu, M. Amin, Y. Zhang, Bilinear signal synthesis in array processing, IEEE Trans. Signal Process. 51 (1) (2003) 90–100.

[22] N. Linh-Trung, A. Belouchrani, K. Abed-Meraim, B. Boashash, Separating more sources than sensors using time-frequency

ED

distributions, EURASIP J. Appl. Signal Process. 2005 (17) (2005) 2828–2847. [23] A. A¨ıssa-El-Bey, N. Linh-Trung, K. Abed-Meraim, A. Belouchrani, Y. Grenier, Underdetermined blind separation of nondisjoint sources in the time-frequency domain, IEEE Trans. Signal Process. 55 (3) (2007) 897–907. [24] D. Peng, Y. Xiang, Underdetermined blind source separation based on relaxed sparsity condition of sources, IEEE Trans.

PT

Signal Process. 57 (2) (2009) 809–814.

[25] S. Xie, L. Yang, J. Yang, G. Zhou, Y. Xiang, Time-frequency approach to underdetermined blind source separation, IEEE Trans. Neural Netw. Learn. Syst. 23 (2) (2012) 306–316.

CE

[26] A. Belouchrani, M. Amin, N. Thirion-Moreau, Y. Zhang, Source separation and localization using time-frequency distributions: An overview, IEEE Signal Process. Mag. 30 (6) (2013) 97–107. [27] S. M. Aziz-Sba¨ı, A. A¨ıssa-El-Bey, D. Pastor, Robust underdetermined blind audio source separation of sparse signals in the

AC

time-frequency domain, in: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 3716–3719.

[28] S. M. Aziz-Sba¨ı, A. A¨ıssa-El-Bey, D. Pastor, Contribution of statistical tests to sparseness-based blind source separation, EURASIP Journal on Advances in Signal Processing 2012 (1) (2012) 1–15.

[29] Y. Andrianakis, P. White, A speech enhancement algorithm based on a Chi MRF model of the speech STFT amplitudes, IEEE Audio, Speech, Language Process. 17 (8) (2009) 1508–1517. [30] M. Souden, S. Araki, K. Kinoshita, T. Nakatani, H. Sawada, A multichannel MMSE-based framework for speech source separation and noise reduction, IEEE Trans. Audio, Speech, Lang. Process. 21 (9) (2013) 1913–1928. [31] A. Jourjine, S. Rickard, O. Yilmaz, Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures, in:

22

ACCEPTED MANUSCRIPT

Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 5, 2000, pp. 2985–2988. [32] O. Yilmaz, S. Rickard, Blind separation of speech mixtures via time-frequency masking, IEEE Trans. Signal Process. 52 (7) (2004) 1830–1847. [33] F. Abrard, Y. Deville, A time-frequency blind signal separation method applicable to underdetermined mixtures of dependent sources, Signal Process. 85 (7) (2005) 1389–1403. [34] Y. Li, S. Amari, A. Cichocki, D. Ho, S. Xie, Underdetermined blind source separation based on sparse representation, IEEE Trans. Signal Process. 54 (2) (2006) 423–437.

57 (7) (2009) 2604–2614.

CR IP T

[35] S. Kim, C. Yoo, Underdetermined blind source separation based on subspace representation, IEEE Trans. Signal Process.

[36] V. Reju, S. N. Koh, I. Y. Soon, An algorithm for mixing matrix estimation in instantaneous blind source separation, Signal Process. 89 (9) (2009) 1762–1773.

[37] J. J. Thiagarajan, K. N. Ramamurthy, A. Spanias, Mixing matrix estimation using discriminative clustering for blind source separation, Digital Signal Process. 23 (1) (2013) 9–18.

[38] Y. Luo, W. Wang, J. Chambers, S. Lambotharan, I. Proudler, Exploitation of source nonstationarity in underdetermined

AN US

blind source separation with advanced clustering techniques, IEEE Trans. Signal Process. 54 (6) (2006) 2198–2212. [39] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell. 24 (5) (2002) 603–619.

[40] D. Griffin, J. Lim, Signal estimation from modified short-time fourier transform, IEEE Transactions on Acoustics, Speech and Signal Process. 32 (2) (1984) 236–243.

[41] I. Jolliffe, Principal Component Analysis, Springer Series in Statistics, Springer, 2002.

[42] D. Lunga, S. Prasad, M. Crawford, O. Ersoy, Manifold-learning-based feature extraction for classification of hyperspectral

M

data: A review of advances in manifold learning, IEEE Signal Process. Mag. 31 (1) (2014) 55–66. [43] B. C. Geiger, G. Kubin, Relative information loss in the PCA, in: Proc. IEEE Information Theory Workshop, 2012, pp. 562–566.

ED

[44] Y. Hu, P. Loizou, Evaluation of objective quality measures for speech enhancement, IEEE Trans. Audio, Speech, Lang. Process. 16 (1) (2008) 229–238.

[45] E. Vincent, R. Gribonval, C. Fevotte, Performance measurement in blind audio source separation, IEEE Trans. Audio, Speech, Lang. Process. 14 (4) (2006) 1462–1469.

PT

[46] E. Vincent, S. Araki, F. Theis, G. Nolte, P. Bofill, H. Sawada, A. Ozerov, V. Gowreesunker, D. Lutter, N. Q. Duong, The signal separation evaluation campaign (2007-2010): Achievements and remaining challenges, Signal Processing 92 (8) (2012) 1928–1936.

CE

[47] Q. He, Y. Liu, Q. Long, J. Wang, Time-frequency manifold as a signature for machine health diagnosis, IEEE Trans. Instrum. Meas. 61 (5) (2012) 1218–1230. [48] A. A¨ıssa-El-Bey, K. Abed-Meraim, Y. Grenier, Blind separation of underdetermined convolutive mixtures using their

AC

time-frequency representation, IEEE Trans. Audio, Speech, Lang. Process. 15 (5) (2007) 1540–1550.

[49] A. Blin, S. Araki, S. Makino, Underdetermined blind separation of convolutive mixtures of speech using time-frequency mask and mixing matrix estimation, IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E88-A (7) (2005) 1693–1700.

[50] V. Reju, S. N. Koh, I. Soon, Underdetermined convolutive blind source separation via time-frequency masking, IEEE Trans. Audio, Speech, Lang. Process. 18 (1) (2010) 101–116.

[51] A. Alinaghi, W. Wang, P. J. Jackson, Spatial and coherence cues based time-frequency masking for binaural reverberant speech separation, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 684–688. [52] A. Alinaghi, P. J. Jackson, Q. Liu, W. Wang, Joint mixing vector and binaural model based stereo source separation, IEEE/ACM Trans. Audio, Speech, Lang. Process. 22 (9) (2014) 1434–1448.

23

ACCEPTED MANUSCRIPT

[53] H. Sawada, S. Araki, R. Mukai, S. Makino, Blind extraction of dominant target sources using ICA and time-frequency masking, IEEE Trans. Audio, Speech, Lang. Process. 14 (6) (2006) 2165–2173. [54] H. Sawada, S. Araki, R. Mukai, S. Makino, Grouping separated frequency components by estimating propagation model parameters in frequency-domain blind source separation, IEEE Trans. Audio, Speech, Lang. Process. 15 (5) (2007) 1592– 1604. [55] H. Sawada, S. Araki, S. Makino, Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment, IEEE Trans. Audio, Speech, Lang. Process. 19 (3) (2011) 516–527.

CR IP T

[56] H. Sawada, H. Kameoka, S. Araki, N. Ueda, Multichannel extensions of non-negative matrix factorization with complexvalued data, IEEE Trans. Audio, Speech, Lang. Process. 21 (5) (2013) 971–982.

[57] L. Zhao, G. Bi, L. Wang, H. Zhang, An improved auto-calibration algorithm based on sparse bayesian learning framework, IEEE Signal Process. Lett. 20 (9) (2013) 889–892.

[58] L. Zhao, L. Wang, G. Bi, L. Yang, An autofocus technique for high-resolution inverse synthetic aperture radar imagery, IEEE Trans. Geosci. Remote Sens. 52 (10) (2014) 6392–6403.

[59] L. Zhao, L. Wang, G. Bi, L. Zhang, H. Zhang, Robust frequency-hopping spectrum estimation based on sparse bayesian

AN US

method, IEEE Transactions on Wireless Communications 14 (2) (2015) 781–793.

[60] L. Zhao, L. Wang, L. Yang, A. M. Zoubir, G. Bi, The race to improve radar imagery: An overview of recent progress in statistical sparsity-based techniques, IEEE Signal Process. Mag. 33 (6) (2016) 85–102. [61] G. Fabrizio, A. Farina, An adaptive filtering algorithm for blind waveform estimation in diffuse multipath channels, IET

AC

CE

PT

ED

M

Radar, Sonar Navigation 5 (3) (2011) 322–330.

24

ACCEPTED MANUSCRIPT

Table 1:

Estimation method of the mixing matrix and the number of sources

Step 1

According to (9), detect a group of strong TF points ∈ Ωs , and then calculate the spatial vector of each point in Ωs by (11). Use k -means clustering method on Ωs to obtain A0 shown in (19). Based on A0 in step 1, extract a group of dominant TF

CR IP T

Step 2

points ∈ Ωd from Ωs by successively implementing (21), (20) and (22). Step 3

Conduct mean-shift clustering method on Ωd . The number of resultant clusters gives the estimate of N , and the aver-

AN US

aging of spatial vectors of each cluster gives the estimate of

AC

CE

PT

ED

M

A.

25

ACCEPTED MANUSCRIPT

Table 2: Probabilities of estimation of the number of sources using mean-shift clustering for 200 trials (N = 5, M = 4)

Mean-shift on Ωs (for comparison purpose)

3

4

5

6

7

8

≥9

20 dB

0

0

0

0

0

0

100%

15 dB

0

0

0

0

0

0

100%

10 dB

0

0

0

0

0

0

100%

5 dB

0

0

0

0

0

0

100%

0 dB

0

0

0

0

0

0

100%

-5 dB

0

0

0

0

0

0

100%

AN US

Mean-shift on Ωd (the proposed method)

CR IP T

SNR

b N

3

4

5

6

7

8

≥9

20 dB

0

0

100%

0

0

0

0

15 dB

0

0

100%

0

0

0

0

10 dB

0

0

100%

0

0

0

0

5 dB

0

0

99%

1%

0

0

0

0 dB

0

0

98%

2%

0

0

0

-5 dB

0

0

96%

4%

0

0

0

AC

CE

PT

ED

M

SNR

b N

26

ACCEPTED MANUSCRIPT

Table 3: Performance comparison of different algorithms (N = 5, M = 4, K = 2, T1 = 0.3, N0 = 12, and γ0 = 8)

Algor. in [23] Algor. in[29]+Ωd PCA+Ωd known Ωa +Ωd SNR=10 dB -22.5 dB

-33.8 dB

-33.7 dB

-33.7 dB

25.4 dB

36.9 dB

36.6 dB

36.8 dB

-5.9 dB

-7.6 dB

-7.4 dB

-7.7 dB

5.2 dB

8.0 dB

7.9 dB

8.1 dB

11.8 dB

13.4 dB

13.4 dB

13.5 dB

6.6 dB

9.7 dB

9.5 dB

9.8 dB

23.8 s

43.7 s

11.8 s

11.9 s

b MER of A

b NMSE of S b SDR of S b SIR of S

b SAR of S Time

b NMSE of A

-22.2 dB

-30.8 dB

-30.7 dB

-30.5 dB

33.5 dB

33.3 dB

33.6 dB

-6.0 dB

-6.0 dB

-6.0 dB

5.9 dB

5.8 dB

5.8 dB

11.7 dB

11.7 dB

11.5 dB

7.5 dB

7.3 dB

7.3 dB

42.9 s

13.6 s

11.9 s

-22.6 dB

-26.1 dB

-27.8 dB

-26.9 dB

25.4 dB

28.8 dB

29.6 dB

29.6 dB

-1.7 dB

-3.9 dB

-3.6 dB

-3.8 dB

-2.7 dB

3.1 dB

2.0 dB

2.2 dB

5.3 dB

9.1 dB

8.8 dB

8.7 dB

-0.9 dB

4.9 dB

3.6 dB

3.9 dB

43.3 s

41.7 s

16.9 s

12.1 s

b MER of A

25.1 dB

b NMSE of S

-3.6 dB

b SDR of S

1.6 dB

b SIR of S

9.1 dB

Time

41.1 s

b SAR of S

M

3.0 dB

SNR=0 dB

b MER of A

b NMSE of S

PT

b SDR of S

ED

b NMSE of A

b SIR of S

CE

b SAR of S

AC

Time

AN US

SNR=5 dB

CR IP T

b NMSE of A

27

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 1: The flowchart of the proposed STFT-UBSS algorithm for blind speech separation (N is the number of sources, M is

PT

the number of sensors, and A denotes the mixing matrix): 1) Compute the STFTs of the M mixtures; 2) Detect auto-source TF points by PCA technique; 3) Estimate N and A based on clutering methods and subspace analysis; 4) Decompose the

AC

CE

engergy of each auto-source TF point; 5) Recover each source via ISTFT.

28

ACCEPTED MANUSCRIPT

1

0.6 0.4

0.5

0.2 0

0

−0.5

−0.4

1 −1 1

0.8

−0.6 0.5

0.5 0

−0.5

0.4

0.2

0

−1

0.6

0

−0.2

−0.4

0.4

−0.6

(b) Dominant TF points ∈ Ωd (γ0 = 10).

AN US

(a) Strong TF points ∈ Ωs (T1 = 0.2).

CR IP T

−0.2

0.6 0.4 0.2 0 −0.2

−0.6 0.2

0

−0.2

−0.4

0.8

0.6 0.4 −0.6

ED

0.4

M

−0.4

(c) Dominant TF points ∈ Ωd (γ0 = 30).

PT

Figure 2: Three-dimensional view of TF points by plotting the real parts of spatial vectors defined in (11) (N = 4, M = 3, SNR=10 dB). Different clusters are marked with different colors and the red circles show the domain of ideal steering vectors of N sources. (a) k -means clustering result on strong TF points with known number of clusters when T1 = 0.2. (b) mean-shift

CE

clustering result on dominant TF points when γ0 = 10 without knowing the number of clusters. (c) mean-shift clustering result

AC

on dominant TF points when γ0 = 30.

29

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Figure 3: Four possible situations of strong TF points in the set Ωs .

30

ACCEPTED MANUSCRIPT

1

0.95

0.9

0.85

0.8

T1=0.2 T =0.3 1

0.75

T =0.4

0.98

0.96

0.94

N =8 0

N0=11 N =14

0.92

0

N =17

1

0

T1=0.5 0.7 −5

0

5 10 SNR (dB)

15

N0=20

0.9 −5

20

0

15

20

(b) Different values of N0 .

1

0.95

0.9

M

Probability of correct estimation of N

5 10 SNR (dB)

AN US

(a) Different values of T1 .

0.85

CR IP T

Probability of correct estimation of N

Probability of correct estimation of N

1

γ0=5

γ0=10 γ0=15

ED

γ0=20

PT

0.8 −5

0

5 10 SNR (dB)

15

20

(c) Different values of γ0 .

Figure 4: Probability of correct estimation of the number of sources vs. different SNRs with various parameter settings (N = 5,

AC

CE

M = 4). (a) varying T1 (N0 = 12, γ0 = 8). (b) varying N0 (T1 = 0.3, γ0 = 8). (c) varying γ0 (T1 = 0.3, N0 = 12).

31

ACCEPTED MANUSCRIPT

−20 −22

−20

−24 −26

NMSE (dB)

NMSE (dB)

−25

−30 T =0.2 on Ω 1

s

−28 −30

1

d

−32

T =0.3 on Ω 1

−35

s

T =0.3 on Ω 1

−34

s

T =0.4 on Ω 1 1

d

−36

s

T1=0.5 on Ωd

−5

0

0

d

N =11 on Ω 0

d

N0=14 on Ωd

T =0.5 on Ω

−40

s

N =8 on Ω

d

T =0.4 on Ω 1

Ω

CR IP T

T =0.2 on Ω

N =17 on Ω 0

d

N0=20 on Ωd

5 10 SNR (dB)

15

−38 −5

20

0

−22

−26 −28 −30

M

NMSE (dB)

−24

−34

20

(b) Different values of N0 .

−20

−32

15

AN US

(a) Different values of T1 .

5 10 SNR (dB)

Ωs

γ0=5 on Ωd

γ0=10 on Ωd γ0=15 on Ωd γ0=20 on Ωd

ED

−36

PT

−5

0

5 10 SNR (dB)

15

20

(c) Different values of γ0 .

b assuming known N vs. different SNRs with various parameter settings (N = 5, Figure 5: Normalized MSEs of estimated A

AC

CE

M = 4). (a) varying T1 (N0 = 12, γ0 = 8). (b) varying N0 (T1 = 0.3, γ0 = 8). (c) varying γ0 (T1 = 0.3, N0 = 12).

32

ACCEPTED MANUSCRIPT

−20 −22

−26 −28 −30 Ωs (Data 1)

−32

Ωd (Data 1) Ωs (Data 2)

−34

Ωd (Data 2) Ωs (Data 3) Ωd (Data 3)

−36 −5

0

5

SNR (dB)

10

15

-2

-4

M

-6

-8

Method in [32] (Data1) Method in [23] (Data 1) Proposed method (Data 1) Method in [32] (Data2) Method in [23] (Data 2) Proposed method (Data 2) Method in [32] (Data3) Method in [23] (Data 3) Proposed method (Data 3)

ED

NMSE (dB)

20

AN US

b (a) NMSEs of A.

CR IP T

NMSE (dB)

−24

-10

PT

-12

CE

-5

0

5

10

15

20

SNR (dB)

b (b) NMSEs of S.

Figure 6: Performance comparison of the proposed algorithm with the methods in [32] and [23] with known N for three sets of b vs. different SNRs. (b) speech data (N = 5, M = 4, K = 2, T1 = 0.3, N0 = 12, γ0 = 8). (a) Normalized MSEs of estimated A

AC

b vs. SNRs. Normalized MSEs of estimated S

33

ACCEPTED MANUSCRIPT

1

−18 −20

0.9

0.7

−24 −26 −28

0.65 0.6

−30

SNR=0 dB SNR=5 dB SNR=10 dB SNR=20 dB

0.55 5

−32

6 7 Number of sources N

−34 5

8

−2 −3

−5 −6 −7

−10

M

NMSE of S (dB)

−4

−9

Aissa−El−Bey et al. (SNR=20dB) proposed (SNR=20dB) Aissa−El−Bey et al. (SNR=5dB) proposed (SNR=5dB)

ED

−11

PT

5

8

(b) Estimation of A.

−1

−8

6 7 Number of sources N

AN US

(a) Estimation of N .

CR IP T

0.8 0.75

0.5

Aissa−El−Bey et al. (SNR=20dB) proposed (SNR=20dB) Aissa−El−Bey et al. (SNR=5dB) proposed (SNR=5dB)

−22

0.85

NMSE of A (dB)

Correct estimation probability of N

0.95

6 7 Number of sources N

8

(c) Estimation of S.

Figure 7: Performance comparison of the algorithm in [23] with known N and the proposed algorithm with estimated N vs.

CE

different number of sources (M = 4, K = 2, T1 = 0.3, N0 = 15, γ0 = 10). (a) correct estimation probability of N based on Ωd .

AC

b (c) NMSEs of estimated S. b (b) NMSEs of estimated A.

34

ACCEPTED MANUSCRIPT

Aissa−El−Bey et al. proposed

−7 −6

−4 −3 −2 −1 0

4x5

4x6

5x6

4x7 5x7 6x7 4x8 5x8 Dimension of mixing matirx

CR IP T

NMSE of S (dB)

−5

6x8

7x8

AN US

(a) Different sizes of A (SNR= 5 dB, K = 2).

Aissa−El−Bey et al. (K=2) Aissa−El−Bey et al. (K=3) proposed (K=2) proposed (K=3)

−16 −14

−10

M

NMSE of S (dB)

−12

−8 −6

ED

−4 −2

4x5

PT

0

5x6 6x7 Dimension of mixing matirx

7x8

CE

(b) Different sizes of A and varying K (SNR= 40 dB).

Figure 8: Performance comparison of the algorithm in [23] with known N and the proposed algorithm with estimated N b with different sizes of A at SNR= 5dB. (b) Normalized (T1 = 0.3, N0 = 15, γ0 = 10). (a) Normalized MSEs of estimated S

AC

b with varying A and K at SNR=40 dB. MSEs of estimated S

35

Underdetermined blind separation of overlapped speech mixtures in time-frequency domain with estimated number of sources

Underdetermined blind separation of overlapped speech mixtures in time-frequency domain with estimated number of sources

Recommend Documents