Single-channel separation using underdetermined blind autoregressive model and least absolute deviation

Single-channel separation using underdetermined blind autoregressive model and least absolute deviation

Neurocomputing 147 (2015) 412–425 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Single-...

1MB Sizes 0 Downloads 44 Views

Neurocomputing 147 (2015) 412–425

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Single-channel separation using underdetermined blind autoregressive model and least absolute deviation N. Tengtrairat a, W.L. Woo b,n a b

Department of Computer Science, Payap University, Chiang Mai, Thailand School of Electrical and Electronic Engineering, Newcastle University, England, United Kingdom

ar t ic l e i nf o

a b s t r a c t

Article history: Received 14 August 2013 Received in revised form 3 May 2014 Accepted 17 June 2014 Communicated by E.W. Lang Available online 28 June 2014

A novel “artificial stereo” mixture is proposed to resemble a synthetic stereo signal for solving the signalchannel blind source separation (SCBSS) problem. The proposed SCBSS framework takes the advantages of the following desirable properties: one microphone; no training phase; no parameter turning; independent of initialization and a priori data of the sources. The artificial stereo mixture is formulated by weighting and time-shifting the single-channel observed mixture. Separability analysis of the proposed mixture model has also been elicited to examine that the artificial stereo mixture is separable. For the separation process, mixing coefficients of sources are estimated where the source signals are modeled by the autoregressive process. Subsequently, a binary time–frequency mask can then be constructed by evaluating the least absolute deviation cost function. Finally, experimental testing on autoregressive sources has shown that the proposed framework yields superior separation performance and is computationally very fast compared with existing SCBSS methods. & 2014 Elsevier B.V. All rights reserved.

Keywords: Autoregressive process Underdetermined system Blind source separation Machine learning Sparse Time–frequency

1. Introduction Single-channel blind source separation (SCBSS) is a case where only one sensor is available and this corresponds to the extreme case of the underdetermined BSS problem. Under this scenario, most traditional BSS methods fail to recover the source signals from the single channel observation. This steers a research avenue of single channel blind source separation (SCBSS) problem. SCBSS focuses on recovering the underlying source signals from an unknown mixing given only a sole sensor without any a priori data of sources (except for assumptions which are known). SCBSS has interested many researchers during the last decade. Many SCBSS approaches have been proposed to solve the problem in an ubiquitous range of disciplines, for example, speech processing, image processing, telecommunications, and electromagnetic (EM) brain analysis. Mathematically, it can be treated as one mixture of N unknown source signals xðtÞ ¼ y1 ðtÞ þ y2 ðtÞ þ ⋯ þyN ðtÞ

ð1Þ

where t ¼ 1; 2; …; T denotes time index and the goal is to estimate the sources yn ðtÞ; 8 n A N of length T when only the observation signal xðtÞ is available. Many SCBSS approaches have n

Corresponding author. E-mail addresses: [email protected] (N. Tengtrairat), [email protected] (W.L. Woo). http://dx.doi.org/10.1016/j.neucom.2014.06.043 0925-2312/& 2014 Elsevier B.V. All rights reserved.

been proposed to solve the problem [1,2]. In general, it can be categorized into two groups i.e. model-based and data-driven methodologies. In this paper, we focus on data-driven SCBSS. These methods perform source separation without recoursing to the training information. Sparse non-negative matrix factorization (SNMF) is a well known approach in this category. The SNMF method [3] establishes a set of basis for each speaker and a mixture is mapped onto the joint bases of the speakers. This technique is a powerful linear model which takes the advantage of simplicity. No assumption on sources is required such as statistical independent and non-Gaussian distribution and no grammatical model. However, the drawback of the SNMF method is its lack of the temporal structure. Additionally, large amount of computation is required to determine the basis of the source signals. For audio signals, it is vital to regard the temporal variation that underlies human speech. The acoustic signal and high-level temporal parameters should be mapped not only into corresponding low-level durational variations, but also into modifications of fundamental frequency and intensity [4]. A two-dimensional model leading to the SNMF2D has thus been developed in [5,6] to integrate the temporal feature into the SNMF. The SNMF2D applies a double convolution to model both spreading of spectral basis and variation of temporal structure inherent in the sources. Some success has already been reported in recent literature [7,8] to show the validity of SNMF2D in separating single channel mixture.

N. Tengtrairat, W.L. Woo / Neurocomputing 147 (2015) 412–425

In binaural BSS method, many methods have been proposed for example in [9–11]. One of the successful approaches is the Degenerate Unmixing Estimation Technique (DUET) [9] which has been proposed as a separating method using binary time– frequency (TF) masks. A major benefit of DUET is the estimates from two channels that are combined inherently as part of the clustering process. DUET approach has been demonstrated to recover the underlying sparse sources given two anechoic mixtures in the TF domain. However, the DUET algorithm has been practically handicapped to recover signals when only one channel is available. Additionally, constructing the masks blindly from only one mixture is remaining an open problem. In practical scenarios, this crux problem has not sufficiently developed to make its way out of laboratories yet. In this paper, a new framework for solving the above problem is presented by reformulating the binaural BSS problem using monaural method. The paper contributes a novel method whose strength is summarized as follows: (1) It is executed in “one-go” without the need of iterative optimization or training phase. Hence, the method works very fast and does not require any parameter tuning and any a priori data for sources. (2) It has low computational complexity and does not exploit high-order statistic. Hence this yields the benefit of implementation. (3) It is independent of initialization condition, i.e. no need for random initial inputs or any predetermined structure on the sensors. This renders robustness to the proposed method. We term the proposed method as Single Observation Likelihood estimatiOn based Least Absolute Deviation (SOLO-LAD) algorithm. This paper is organized as follows: Section 2 presents the proposed artificial stereo mixing model. The proposed demixing method is elucidated in Section 3. Henceforward, the analysis of separability of the proposed mixing model is derived in Section 4. Next, Section 5 shows experimental results and its analysis. Finally, Section 6 concludes the paper.

413

in terms of the source signals, AR coefficient and time-delay as x1 ðtÞ ¼ y1 ðtÞ þ y2 ðtÞ x2 ðtÞ ¼ a1 ðt; δ; γÞy1 ðt  δÞ þ a2 ðt; δ; γÞy2 ðt  δÞ þ r 1 ðt; δ; γÞ þr 2 ðt; δ; γÞ ð5Þ Define aj ðt; δ; γÞ ¼

 ayj ðδ; tÞ þ γ

ð6Þ

1 þ jγj ej ðtÞ  ∑

Mj

m¼1

ayj ðm; tÞyj ðt  mÞ

maδ 1 þ jγj

r j ðt; δ; γÞ ¼

ð7Þ

where aj ðt; δ; γÞ and r j ðt; δ; γÞ represent the mixing attenuation and residue of the jth source, respectively. The derivation of (5) is fully presented in Appendix A. 2.1.2. Time–frequency domain The TF representation of the mixing model is obtained using the STFT of xj ðtÞ, j ¼ 1; 2 as X 1 ðτ; ωÞ ¼ Y 1 ðτ; ωÞ þY 2 ðτ; ωÞ X 2 ðτ; ωÞ ¼ a1 ðτÞe  iωδ Y 1 ðτ  δ; ωÞ þ a2 ðτÞe  iωδ Y 2 ðτ  δ; ωÞ 0

1

B M1 a ðm; τÞe  iωm Y ðτ  m; ωÞ M 2 a ðm; τÞe  iωm Y ðτ  m; ωÞC y1 y2 1 2 B C B ∑ þ ∑ C @m¼1 A 1 þ jγj 1 þ jγj m¼1 m aδ

for

maδ

ð8Þ

8 τ; ω

In (8), we have used the fact that ej ðtÞ{sj ðtÞ, thus the TF of aj ðt; δ; γÞ and r j ðt; δ; γ Þ in (6) and (7) simplifies to Mj

Rj ðτ; ωÞ ¼  ∑

ayj ðm; τÞe  iωm Y j ðτ  m; ωÞ

ð9Þ

1 þ jγj

m¼1 m aδ

The TF of aj ðt; δ; γÞ can be expressed as

2. Single channel mixing model

STFT

aj ðt; δ; γÞ - aj ðτÞe  iωδ

2.1. Proposed artificial stereo mixture model

To facilitate further analysis, we also define

2.1.1. Time domain In this paper, the case of a mixture of two sources is considered for simplicity. The single-channel mixture can be expressed as x1 ðtÞ ¼ y1 ðtÞ þ y2 ðtÞ

ð2Þ

where x1 ðtÞ is the single-channel mixture, and y1 ðtÞ and y2 ðtÞ are the original source signals which are assumed to be modeled by the autoregressive (AR) process [12,16] Mj

yj ðtÞ ¼  ∑ ayj ðm; tÞyj ðt  mÞ þ ej ðtÞ

ayj ðm; τÞe  iωðm  δÞ

m¼1

ð11Þ

1 þ jγj

ma δ

which forms AR coefficients residue of the jth source. Assuming that the jth source is dominant at a particular TF unit, (8) can be simplified by using (10) and (11) as follows: X 1 ðτ; ωÞ ¼ Y j ðτ; ωÞ X 2 ðτ; ωÞ ¼ aj ðτÞe  iωδ Y j ðτ  δ; ωÞ  ∑

m¼1

ayj ðm; τÞ 1 þ jγj

e  iωm Y j ðτ  m; ωÞ

ma δ

where asj ðm; tÞ denotes the mth order AR coefficient of the jth source at time t, M j is the maximum AR order, and ej ðtÞ is an independent identically distributed (i.i.d.) random signal with zero mean and variance σ 2ej . The model (3) is particularly interesting in source separation as it enables us to formulate a virtual mixture by weighting and time-shifting the single channel mixture x1 ðtÞ as x1 ðtÞ þ γx1 ðt  δÞ 1 þ jγj

Mj

C j ðτ; ωÞ ¼ ∑

Mj

ð3Þ

m¼1

x2 ðtÞ ¼

ð10Þ

ð4Þ

In (4), γ A ℛ is the weight parameter, and δ is the time-delay. The mixture in (2) and (4) is termed as “artificial stereo” since it has an artificial semblance of a stereo signal except that it is given by one location which results in the same time-delay but different attenuation of the source signals. To show this, we can express (4)

 ½aj ðtÞ  C j ðτ; ωÞe  iωδ Y j ðτ; ωÞ; ðτ; ωÞ A Ωj

ð12Þ

for δ and m r ϕ, and Ωj is the active area of Y j ðτ; ωÞ defined as Ωj : ¼ fðτ; ωÞ : Y j ðτ; ωÞ a 0; 8 k ajg. From (12), it can be seen that the artificial stereo mixture comprises three components i.e. aj e  iωδ , C j ðτ; ωÞ and Y j ðτ; ωÞ. A rigorous analysis of (12) will disclose that even if Y j ðτ; ωÞ is unknown, the signature of each source can be extracted directly from X 1 ðτ; ωÞ using only information of aj e  iωδ and C j ðτ; ωÞ. Thus, this constitutes the separability of the proposed mixing model which will be analyzed in Section 4. 2.1.3. Assumption of the proposed method The proposed SOLO-LAD method aims to recover the original signals by estimating the mixing coefficients and constructing a

414

N. Tengtrairat, W.L. Woo / Neurocomputing 147 (2015) 412–425

binary mask. To achieve this, the following assumptions will be used: Assumption 1. Sources satisfy the windowed-disjoint orthogonality (WDO) [14] condition Y i ðτ; ωÞY j ðτ; ωÞ  0;

8iaj ;

8 τ; ω

ð13Þ

where Y j ðτ; ωÞ is the Short-Time Fourier Transform (STFT) of yj ðtÞ. To ensure the stationarity of the source within any TF unit, the shift size of the window function (Δτ ) should be less than Bl for all l. This is practically justified by choosing the appropriate length of the window function Wð U Þ. Hence, we can write ayj ðm; T l Þ ¼ ayj ðm; τÞ provided that Δτ rBl and τ A T l . Signals are sparsely distributed in high-resolution time–frequency (TF) domain. Different signals overlap only in few points in a TF plane; for example, the mixture of musical instruments i.e. drum and piano or the mixture of music and speech. In this case, the WDO condition is satisfied where each TF unit is dominated by the active signal. On the other hand, a mixture of similar signals has high degree overlapping, for example, a mixture of speech signals with low-resolution TF units. When this happens, the orthogonality property between the source signals is violated. Hence, the WDO of the mixture will not be satisfied in this case. Most of the available source separation algorithms are tested in anechoic environments. In reverberant and noisy environments, the influence of environmental signals degrades the orthogonality of the sources in the TF domain. The noise and room impulse responses smear the energy of specific TF bins in time and in frequency and so disturb the sparseness of the sources. As a result, the overlap of the TF spectra of signals increases with increasing interference signals. Thus, the WDO condition will not hold in the case of high reverberant and noisy environments. Assumption 2. Sources satisfy the local stationarity of the time– frequency representation. This refers to the approximation of Y j ðτ  ϕ; ωÞ  Y j ðτ; ωÞ where ϕ is the maximum time-delay (shift) associated with F W ð U Þ with an appropriate Wð U Þ. If ϕ is small compared with the length of Wð U Þ then Wð  ϕÞ  Wð U Þ [15]. Hence, the Fourier transform of a windowed function with shift ϕ yields approximately the same Fourier transform without ϕ. For the proposed method, the pseudo-stereo mixture is shifted by δ and by invoking the local stationarity this leads to STFT  iωδ

yj ðt  δÞ - e  e

 iωδ

8 δ; jδjr ϕ

ð14Þ

Thus, the STFT of yj ðt  δÞ where jδj rϕ is approximately e  iωδ Y j ðτ; ωÞ according to the local stationarity.

3. Single channel demixing method The framework of developing a separating process is to estimate the mixing coefficients of the jth source and construct a binary mask for separating the mixtures. 3.1. Mixing coefficient estimation using complex 2D histogram To begin, let us assume that the jth source is dominant at a particular TF unit where only the jth source is active. The estimate of aj ðτ; ωÞ ¼ aj ðτÞ  C j ðτ; ωÞ can be determined as aj ðτ; ωÞ ¼ ðX 2 ðτ; ωÞ=X 1 ðτ; ωÞÞeiωδ ¼ aj ðτÞ  C j ðτ; ωÞ ðiÞ ¼ aðrÞ j ðτ; ωÞ þ iaj ðτ; ωÞ;

ðrÞ a^ j ¼ ðiÞ a^ j

¼

∑ðτ;

ωÞ A Ωj jX 1 ðτ;

ωÞX 2 ðτ; ωÞj Re

X 2 ðτ; ωÞ X 1 ðτ; ωÞ

cos ðωδÞ  Im

8 ðτ; ωÞ A Ωj

ð15Þ

X 2 ðτ; ωÞ X 1 ðτ; ωÞ

sin ðωδÞ

∑ðτ;ωÞ A Ωj jX 1 ðτ; ωÞX 2 ðτ; ωÞj

h     i ωÞ X 2 ðτ; ωÞ ∑ðτ; ωÞ A Ωj jX 1 ðτ; ωÞX 2 ðτ; ωÞj Im XX21ðτ; ðτ; ωÞ cos ðωδÞþ Re X 1 ðτ; ωÞ sin ðωδÞ ∑ðτ; ωÞ A Ωj jX 1 ðτ; ωÞX 2 ðτ; ωÞj

ð16Þ The above can then be combined to form the estimate of (15) as ðrÞ ðiÞ a^ j ¼ a^ j þia^ j :

ð17Þ

Relating (17) with (15), we can use similar idea to express a^ j ¼ a^ j  C^ j where a^ j and C^ j are the complex 2D histogram estimates of aj ðτÞ and C j ðτ; ωÞ, respectively. Alternatively, we can use the concept of symmetric mixing attenuation αj which is defined as αj : ¼ aj  1=aj . 3.2. Construction of masks Once the mixing coefficients of the jth source are computed, the binary TF mask for the jth source can then be constructed by the evaluated result of the proposed cost function. The cost function labels each TF units with the k arguments through maximizing the cost for the jth source. Let us start from the mixture model (12) where the jth source is active at a particular TF area Ωj , X 1 ðτ; ωÞ ¼ Y j ðτ; ωÞ þ σ 21 ðτ; ωÞ X 2 ðτ; ωÞ  ½aj ðτÞ  C j ðτ; ωÞe  iωδ Y j ðτ; ωÞ þ σ 22 ðτ; ωÞ; ¼ aj ðτ; ωÞe  iωδ Y j ðτ; ωÞ þ σ 22 ðτ; ωÞ σ 21 ðτ;

Y j ðτ  δ; ωÞ

Y j ðτ; ωÞ;

where Ωj is the active TF area of the jth source, aðrÞ j ðτ; ωÞ ¼ Re iωδ are ½ðX 2 ðτ; ωÞÞ=ðX 1 ðτ; ωÞÞeiωδ , aðiÞ j ðτ; ωÞ ¼ Im½ðX 2 ðτ; ωÞÞ=ðX 1 ðτ; ωÞÞe pffiffiffiffiffiffiffiffi the real and imaginary parts of aj ðτ; ωÞ, respectively, and i ¼  1. Although the ratio X 2 ðτ; ωÞ=X 1 ðτ; ωÞ seems straightforward, it is difficult to obtain aj ðτ; ωÞ directly from this ratio because the term C j ðτ; ωÞ varies with frequency frame-by-frame. A weighted complex 2-dimensional (2D) histogram estimation is proposed to solve this problem. The weighted complex 2D histogram is a function of ðτ; ωÞ with the weight ∑τ;ω jX 1 ðτ; ωÞX 2 ðτ; ωÞj to estimate aj ðτ; ωÞ and cluster them into N groups (where N is the total number of sources in the mixture). In particular, the real and imaginary parts of aj ðτ; ωÞ can be estimated as h     i

ðτ; ωÞ A Ωj ð18Þ

where aj ðτ; ωÞ ¼ aj ðτÞ  C j ðτ; ωÞ, ωÞ and ωÞ are i.i.d. white complex Gaussian noise signals with zero mean and variances. From this model, the maximum likelihood (ML) estimate is used for computing aj ðτ; ωÞ A ℝ by finding Y ML j ðτ; ωÞ which has the maximum likelihood. Thus, the proposed least absolute deviation (LAD) cost function is derived from the maximum likelihood (ML) where the Gaussian likelihood function pðX 1 ðτ; ωÞ; X 2 ðτ; ωÞj Y j ðτ; ωÞ; aj ðτ; ωÞ; δÞ is formulated using (18) as  Lj ðτ; ωÞ : ¼ pðX 1 ðτ; ωÞ; X 2 ðτ; ωÞY j ðτ; ωÞ; aj ðτ; ωÞ; δÞ ¼ ∏ f σ 2 ðτ; ωÞ f σ 2 ðτ; ωÞ ðX 1 ðτ; ωÞ  Y j ðτ; ωÞ; ðτ; ωÞ A Ωj

1

σ 22 ðτ;

2

X 2 ðτ; ωÞ  aj ðτ; ωÞeiωδ Y j ðτ; ωÞÞ   X 1 ðτ; ωÞ  Y j ðτ; ωÞ2 1 ∑ ¼ C U exp  2ðτ; ωÞ A Ωj σ 21 ðτ; ωÞ ! jX 2 ðτ; ωÞ  aj ðτ; ωÞeiωδ Y j ðτ; ωÞj2 þ σ 22 ðτ; ωÞ

ð19Þ

where C is a normalizing constant, X 1 ðτ; ωÞ and X 2 ðτ; ωÞ A Ωj . Note that the above is derived by formulating the likelihood function pðX 1 ðτ; ωÞ; X 2 ðτ; ωÞjY j ðτ; ωÞ; aj ðτ; ωÞ; δÞ using (18), maximizing the likelihood function with respect to Y j ðτ; ωÞ and then substituting the obtained result into the Gaussian likelihood function. The instantaneous likelihood function Lj ðτ; ωÞ in (19) clusters every

N. Tengtrairat, W.L. Woo / Neurocomputing 147 (2015) 412–425

    ¼ arg min ½ak ðτÞ  C k ðτ; ωÞY j ðτ; ωÞ aj ðτÞY j ðτ δ; ωÞ  k    M j a ðm; τÞe  iωðm  δÞ Y ðτ  m; ωÞ yj j  þ ∑   1 þ jγj m¼1 

ðτ; ωÞ unit to the jth dominating source for Lj ðτ; ωÞ ZLk ðτ; ωÞ; 8 k a j. The full derivation is presented in Appendix A. This process is equivalent to minimizing the following function: Jðτ; ωÞ ¼ arg min j a^ j X 1 ðτ; ωÞ eiωδ X 2 ðτ; ωÞj

ð20Þ

k

Technically, the proposed cost function (20) partitions the TF plane of the mixed signal into k groups of ðτ; ωÞ units by evaluating the cost function. For each TF unit, the kth argument that gives the minimum cost will be assigned to the kth source. Subsequently, the mask can be built as  1 J ðτ; ωÞ ¼ j : ð21Þ M j ðτ; ωÞ : ¼ 0 otherwise The original sources can then be recovered by computing as follows: Y^ j ðτ; ωÞ ¼ M j ðτ; ωÞX 1 ðτ; ωÞ:

ð22Þ

Finally, the estimated sources are converted back into time domain by using the inverse STFT. To sum up, the overview of the proposed method is tabulated in Table 1.

415

ð23Þ

m aδ

Invoking the local stationarity of the source where δ and m r ϕ, (23) then becomes     Jðτ; ωÞ ¼ arg min ½ak ðτÞ  C k ðτ; ωÞY j ðτ; ωÞ aj ðτÞY j ðτ; ωÞ  k      iωðm  δÞ M j a ðm; τÞe Y j ðτ; ωÞ yj  þ ∑  1 þ jγj  m¼1  m aδ    ¼ arg min ½ak ðτÞ  C k ðτ; ωÞY j ðτ; ωÞ ½aj ðτÞ  C j ðτ; ωÞY j ðτ; ωÞ k    ð24Þ ¼ arg min ak ðτÞ  C k ðτ; ωÞ  aj ðτÞ þ C j ðτ; ωÞY j ðτ; ωÞ k

4. Separability of the proposed artificial stereo mixture model The separability of the proposed mixture model can be examined from the artificial stereo mixture by considering aj ðt; δ; γÞ and r j ðt; δ; γÞ in the following three cases. Case 1 refers to identical sources mixed in the single channel, Case 2 represents different sources but setting γ and δ for the artificial stereo mixture such that a1 ðt; δ; γÞ ¼ a2 ðt; δ; γÞ, and Case 3 corresponds to the most general case where the sources are distinct, and γ and δ are selected arbitrarily such that the mixing attenuations and residues are also different. The above cases are evaluated by the LAD cost function in (20). Eq. (20) can further be analyzed in term of the jth source by using (12) where X 1 ðτ; ωÞ ¼ Y j ðτ; ωÞ and X 2 ðτ; ωÞ ¼ aj ðτÞe  iωδ Y j ðτ  δ; ωÞ  ∑

Mj

m¼1

ðayj ðm; τÞe  iωm Y j ðτ  m; ωÞ=1 þ jγjÞ, therefore,

maδ (20) becomes Jðτ; ωÞ ¼ arg min jak ðτ; ωÞX 1 ðτ; ωÞ  eiωδ X 2 ðτ; ωÞj k     ¼ arg min½ak ðτÞ  C k ðτ; ωÞY j ðτ; ωÞ  k  2

3   6 M j a ðm; τÞe  iωm Y ðτ  m; ωÞ7 y j  7 6 j  eiωδ 6aj ðτÞe  iωδ Y j ðτ  δ; ωÞ  ∑ 7 5 4 1 þ jγj m¼1   maδ

Table 1 The overview of the proposed SOLO-LAD method.

aj ðtÞ and r j ðtÞ will be used for aj ðt; δ; γÞ and r j ðt; δ; γÞ, respectively, to facilitate the paper. The parameterization of aj ðtÞ and r j ðtÞ depends on δ and γ although this is not shown explicitly. Henceforth, we consider the following three cases: Case 1. Identical sources mixed in the single channel which can be expressed as follows: If

In this case, the second mixture is simply formulated as a timedelayed of the first mixture multiply by a scalar plus the redundant residue the separability of this case is presented by substituting the artificial stereo mixture of Case 1 into the cost function. Since both residues are equal, then C 1 ðτ; ωÞ ¼ C 2 ðτ; ωÞ ¼

a ðm; τÞe  iωðm  δÞ =ð1 þ jγjÞ. For Case 1 by invokCðτ; ωÞ ¼ ∑M m¼1 y maδ ing the local stationarity of the sources Y j ðτ  M; ωÞ ¼ Y j ðτ; ωÞ for jM j r ϕ, the cost function (30) becomes Jðτ; ωÞ ¼ arg min jaðτÞ  Cðτ; ωÞ  aðτÞ þ Cðτ; ωÞjjY j ðτ; ωÞj ¼ 0

x2 ðtÞ ¼ x1 ðtÞ

þ γx1 ðt  δÞ . 1 þ jγj

(2) Compute STFT of the mixture model: R1 Wðt  τÞxi ¼ 1; 2 ðtÞe  iωt dt. X i ¼ 1;2 ðτ; ωÞ ¼ p1ffiffiffiffi 2π  1 (3)

ðrÞ ðiÞ Compute a^ j ¼ a^ j þ ia^ j using (17).

(4) Evaluate the LAD cost function: Jðτ; ωÞ ¼ arg min jaj X 1 ðτ; ωÞ  eiωδ X 2 ðτ; ωÞj. k  (5) 1 Jðτ; ωÞ ¼ j . Construct a binary TF mask: M j ðτ; ωÞ : ¼ 0 otherwise (6) Demixing the mixture: Y^ j ðτ; ωÞ ¼ M j ðτ; ωÞX 1 ðτ; ωÞ. (7) Transform the recovered sources back into time domain using inverse STFT: Y^ j ðτ; ωÞ-iSTFT y^ ðtÞ. j

for

8 k:

ð25Þ

k

As a result, the cost function Jðτ; ωÞ is zero for all k arguments i. e. J 1 ¼ J 2 ¼ 0 thus there is no benefit achieved at all. In this case, the cost function cannot distinguish the k arguments, the mixture is not separable. Case 2. Different sources but setting γ and δ for the artificial stereo mixture such that a1 ðt; δ; γÞ ¼ a2 ðt; δ; γÞ which can be expressed as follows: If

(1) Model the artificial stereo mixtures: x1 ðtÞ ¼ y1 ðtÞ þ y2 ðtÞ and

a1 ðtÞ ¼ a2 ðtÞ ¼ aðtÞ and r 1 ðtÞ ¼ r 2 ðtÞ ¼ rðtÞ;   aðδ; tÞ þ γ then x2 ðtÞ ¼ x1 ðt  δÞ þ 2rðtÞ: 1 þ jγj

a1 ðtÞ ¼ a2 ðtÞ ¼ aðtÞ then

x2 ðtÞ ¼

and

r ðtÞ a r 2 ðtÞ; !1

 aðδ; tÞ þ γ   x1 ðt  δÞ þ r 1 ðtÞ þ r 2 ðtÞ: 1 þ γ 

This case remains almost similar to the previous case and differs only in terms of r1 ðtÞ a r2 ðtÞ. As each residue r j ðtÞ is related to the jth source via C j ðτ; ωÞ, the separability of this mixture can be analyzed using (30) as Jðτ; ωÞ ¼ arg min jaðτÞ  C k ðτ; ωÞ  aðτÞ þ C j ðτ; ωÞjjY j ðτ; ωÞj k

¼ arg min j C k ðτ; ωÞ þ C j ðτ; ωÞjjY j ðτ; ωÞj k

ð26Þ

416

N. Tengtrairat, W.L. Woo / Neurocomputing 147 (2015) 412–425

For k ¼ j,

We first treat the situation of r1 ðtÞ ¼ r 2 ðtÞ. Since the mixing attenuations a1 ðτÞ and a2 ðτÞ correspond respectively to y1 ðtÞ and y2 ðtÞ then the likelihood function can be expressed as

Jðτ; ωÞ ¼ arg min j  C j ðτ; ωÞ þ C j ðτ; ωÞjjY j ðτ; ωÞj k

¼0

ð27Þ

Jðτ; ωÞ ¼ arg min jak ðτÞ  Cðτ; ωÞ  aj ðτÞ þ Cðτ; ωÞjjY j ðτ; ωÞj k

The cost function yields a zero value for k ¼ j, and nonzero value for k a j in. Despite the mixing attenuation for both sources are identical, the likelihood function is still able to distinguish the k arguments by using only the difference of residues. Therefore, the mixture of Case 2 is separable. Case 3. General case where the sources are distinct, and γ and δ are selected arbitrarily such that the mixing attenuations and residues are also different. Case 3 can be expressed as follows: a1 ðtÞ a a2 ðtÞ and r 1 ðtÞ ar 2 ðtÞ ðor r 1 ðtÞ ¼ r 2 ðtÞÞ      ay1 ðδ; t Þ þ γ  ay2 ðδ; tÞ þ γ x2 ðtÞ ¼ y1 ðt  δÞ þ 1 þ jγj 1 þ jγj

ð28Þ

k

This function yields a nonzero value only for k aj. In this case, the likelihood function can separate the k arguments due to the difference of ak and aj . The case of r 1 ðtÞ a r 2 ðtÞ follows similar line of argument as above where the cost function becomes Jðτ; ωÞ ¼ arg min jak ðτÞ  C k W  aj ðτÞ þ C j ðτ; ωÞjjY j ðτ; ωÞj

ð29Þ

k

then This function yields a nonzero value only for k aj; thus this function is able to distinguish the k arguments, the mixture is separable. In summary, by considering aj ðtÞ and r j ðtÞ with respect to above three cases, the separability of the proposed mixture model is analyzed under three cases:

y2 ðt  δÞ þ r 1 ðtÞ þ r 2 ðtÞ

Case 1: a1 ðtÞ ¼ a2 ðtÞ; Case 2: a1 ðtÞ ¼ a2 ðtÞ; Case 3: a1 ðtÞ a a2 ðtÞ;

x 107 5 4

r 1 ðtÞ ¼ r 2 ðtÞ r 1 ðtÞ ar 2 ðtÞ r 1 ðtÞ a r 2 ðtÞ

ðor ðr 1 ðtÞ ¼ r 2 ðtÞÞ

From the analysis, if at least one parameter of the sources i.e. aj ðtÞ or rj ðtÞ has the different values, the artificial stereo model is separable as in Cases 2 and 3. Hence, the proposed artificial stereo mixing model can be separated to unveil the original sources.

3 2 1 200 0

0 3

2

1

0

-1

-2

-200

-3

5. Results and analysis

α(i)

The performance of the proposed method is demonstrated by separating synthetic and real-audio sources. On the one hand, the synthetic sources represent the stationary AR signals. On the other

α(R) Fig. 1. Complex 2D histogram corresponding to two sources.

original source 2

3 2 1 0 -1 -2 -3

4 Amplitude

Amplitude

original source 1 2 0 -2 -4 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0

0.05

0.1

0.15

0.2

Time [s]

0.25

0.3

0.35

0.4

0.45

0.35

0.4

0.45

Time [s]

single channel mixture

Amplitude

4 2 0 -2 -4 0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Time [s] estimated source 1

3 2 1 0 -1 -2 -3

estimated source 2

4 Amplitude

Amplitude

Weight

If

¼ arg min jak ðτÞ  aj ðτÞjjY j ðτ; ωÞj

2 0 -2 -4

0

0.05

0.1

0.15

0.2

0.25 Time [s]

0.3

0.35

0.4

0.45

0

0.05

0.1

0.15

0.2

0.25 Time [s]

Fig. 2. Two original sources, observed mixture and two estimated sources.

0.3

N. Tengtrairat, W.L. Woo / Neurocomputing 147 (2015) 412–425

hands, the real-audio sources are inherently non-stationary include voice and music signals. All experiments have been conducted under the same conditions as follows: the sources are mixed with normalized power over the duration of the signals. All mixed signals are sampled at 16 kHz sampling rate. The TF representation is computed by using the STFT of 1024-point Hamming window with 50% overlap [15,17]. For the artificial stereo parameter i.e. δ and γ. The possible choice of δ is determined by δmax o

fs 2f max

ð30Þ

where δmax is the maximum time delay, f max is the maximum frequency present in the sources and f s is the sampling frequency. Due to the factor e  iωδ is only uniquely specified if jωmax δmax j o π where ωmax ¼ 2πf max =f s , otherwise this would cause phase-wrap [13]. As long as the delay parameter is less than δmax , there will not be any phase ambiguity. For example, for a maximum frequency f max ¼ 4:0 kHz, and a sampling frequency f s ¼ 16 kHz, one obtains δmax o 2 using (30). Therefore, phase ambiguity can be avoided provided δ is limited to 1 only. Additionally, for a sampling frequency f max ¼ 2: 0 kHz the maximum delay δmax is selected to be 1, 2, 3, or 4. This condition will be used to determine the range of δ in formulating the pseudo-stereo mixture. For the weight γ parameter, rigorous Monte-Carlo testing has been conducted and if found that a range of γ ¼ f1; 2; 3; 4g [18] yields the best separation performance. Hence, the term γ can be arbitrarily choice from the recommended range.

Table 2 Comparison of average SDR on mixture of two AR sources with SOLO-LAD, SNMF2D, SCICA, and IBM. Methods

SDR S1

SDRS2

SOLO-LAD SNMF2D SCICA IBM

19.6 9.0 18.4 20.3

17.1 8.1 10.5 17.4

The separation performance is evaluated by measuring the distortion between original source and the estimated one according to the signal-to-distortion (SDR) ratio [19] defined as SDR ¼ 10log 10 ðstarget 2 =einterf þ eartif 2 Þ where einterf represent the interference from other sources and eartif is the artefacts. An average SDR result is computed from executing 100 experiments under the same mixture. The proposed approach termed “SOLOLAD” will be compared with the sparse nonnegative matrix 2-dimensional factorization (SNMF2D) [4], the single-channel independent component analysis (SCICA) [20] and the ideal binary mask (IBM) [21] which represents the ideal separation performance. The SNMF2D parameters are set as follows [22,23]: number of factors is 2, sparsity weight of 1.1, number of phase shift and time shift is 31 and 7, respectively for music. As for speech, both shifts are set to 4. The TF domain used in SNMF2D is based on the log-frequency spectrogram. Cost function of SNMF2D is based on the Kullback–Leibler divergence. As for the SCICA, the number of block is 10 with time delay set to unity. MATLAB is used as the programming platform. All simulations and analyses are performed using a PC with Intel Core 2 CPU 3 GHz and 3 GB RAM.

5.1. AR sources 5.1.1. Separation of two AR sources Two stationary AR sources are synthesized for y1 ðtÞ and y2 ðtÞ using the model (2) with following the coefficients: ay1 ¼ ½  3:7864; 5:5051;  3:6365; 0:9224 and ay2 ¼ ½  2:7577; 3:8025;  2:6216; 0:9037 and e1 ðtÞ and e2 ðtÞ are zero mean white Gaussian signal with average variances 1:1  10  6 and 1.210  4 , respectively. The coefficients and the variances are randomly selected. It should be noted that ay1 ð0Þ ¼ ay2 ð0Þ ¼ 1 by definition but this has not been included in above to avoid cluttering the notation. The source signals are shown in Fig. 1. The artificial stereo parameters are selected to be γ ¼ 2 and δ ¼ 2. The histogram-resolution parameters are set at ΔαðrÞ ¼ 3, ΔαðiÞ ¼ 200, ζ ðrÞ ¼ 101 and ζ ðiÞ ¼ 3 where ζ ðrÞ and ζ ðiÞ denotes the number of resolution bins, and ΔαðrÞ and ΔαðiÞ are the maximum value of αðrÞ and αðiÞ , respectively.

original source 2 2

2

1

Amplitude

Amplitude

original source 1 4

0 -2 -4 0.2

0 -1

0.21

0.22

0.23

0.24

-2 0.2

0.25

0.21

Time [s]

0.23

0.24

0.25

0.24

0.25

estimated source 2

2 1

Amplitude

2

Amplitude

0.22

Time [s]

estimated source 1

4

0 -2 -4 0.2

417

0 -1

0.21

0.22

0.23

Time [s]

0.24

0.25

-2 0.2

0.21

0.22

0.23

Time [s]

Fig. 3. Zoomed view of the two original sources and their estimated sources.

418

N. Tengtrairat, W.L. Woo / Neurocomputing 147 (2015) 412–425

In Fig. 1, a^ 1 ðτÞ and a^ 2 ðτÞ have been plotted against a1 ðτÞ and a2 ðτÞ from top to bottom, respectively. Fig. 2 shows the mixed signal and the separated sources based on the SOLO-LAD method. Visually, it can be seen that the mixture has been very well separated comparing with the original sources. The separation performance is tabulated in Table 2 which shows the comparison results of SNMF2D, SCICA, proposed SOLO-LAD and IBM. The SDR results of each method are calculated from the average of 100 experiments under the same mixture. The proposed SOLO-LAD method has successfully estimated the sources with high accuracy as shown in Figs. 2 and 3. In particular, the SOLO-LAD method renders an average SDR improvement of 9.8 dB per source over the SNMF2D and 3.9 dB per source over the SCICA, respectively. Since the proposed method estimates the parameter a^ j from the complex 2D histogram, its result is based on the averaged AR coefficient of each source. As such, the estimated aj befits very well the purpose of separating stationary AR sources. The IBM results have also been included for comparison purpose where the proposed SOLO-LAD method almost reaches as same separation performance as IBM.

5.1.2. Separation of more than 2 sources In this evaluation, the proposed method is tested by increasing the number of sources from 3 to 5. For each configuration of the number of sources, we generate 50 mixtures. Twenty sources are synthesized using the model (3) with the coefficients asj ðm : tÞ and the variances ej ðtÞ are randomly selected. The coefficients and the variances are randomly selected. All experiments are conducted under the same conditions: ΔαðrÞ ¼ 3, ΔαðiÞ ¼ 200, ζ ðrÞ ¼ 101 and ζ ðiÞ ¼ 3. The SDR performance of higher order mixtures has been tabulated in Table 3. The separation performance progressively deteriorates as the number of sources increases because the sources are more mutually correlated each other. This renders the separation problem more difficult. The average SDR of 3, 4 and 5 sources are 18.5 dB, 17.3 dB, and 15.4 dB per source, respectively. Table 3 Average SDR results for mixture of 3–5 sources. γ

Amplitude

2 1 1

2 3 3

SDR (dB) y1

y2

y3

y4

y5

18.7 17.1 17.1

18.1 16.8 15.6

18.8 17.8 16.9

– 17.3 13.5

– – 13.7

Audio sources are used to test the proposed method. Three type of mixtures are generated i.e. music þ music (mþm), music þ speech (mþ s), and speech þ speech (s þs). The male and female speeches are randomly selected from TIMIT and music sources from the RWC [24] database. Both sources are mixed with equal power to generate the mixture.

5.2.1. Mixture of music and music The mixture of drum and Jazz is presented in Fig. 4 along with the original sources. By using the proposed SOLO-LAD and SNMF2D methods, the estimated drum and Jazz signals are well distinguished from the mixture compared with the original sources i.e. Figs. 5, 6, and 7. Visually, the SCICA method can separate only the jazz signal from the mixture but the estimated drum is not be separated well compared with its original source as shown in Fig. 8. The average SDR results on the mixture of m þ m using the proposed method compared with SNMF2D and SCICA methods are shown in Fig. 9. The proposed SOLO-LAD method yields the best separation performance among the three methods. The SOLO-LAD method surpass the SNMF2D and SCICA methods with the average SDR improvement at 2.11 dB per source and 3.29 dB per source, respectively.

5.2.2. Mixture of speech and music Two original sources with equal power and the mixture are shown in Fig. 10 at top and second row. Visually in Figs. 11 and 12, the estimated sources have been clearly separated when compared with the original sources. On the other hand, the estimated sources of the SNMF2D method in Fig. 13 have been incorrectly separated as seen in dash boxes. The estimated male speech has lost some of its information while the estimated drum is still mixed with the male speech. For the SCICA method, the original sources cannot efficiently be separated from the mixture as shown in Fig. 14. Fig. 15 illustrates that the proposed SOLO-LAD methods yields the average SDR improvement at 2.15 dB per source and 3.92 dB per source over the SNMF2D and SCICA methods. For the proposed method, the results of a^ j from the complex 2D histogram are based on the average AR coefficient of each source. Where the AR coefficients of the drum signal is less non-stationary compared with the AR coefficients of the speech signal. Hence, the estimated drum has better SDR than the estimated speech.

original drum 10 0 -10 0

0.5

1

1.5

2

2.5

1.5

2

2.5

1.5

2

2.5

Time [s] Amplitude

s1 þ s2 þ s3 s1 þ s2 þ s3 þ s4 s1 þ s2 þ s3 þ s4 þ s5

δ

original Jazz 10 0 -10 0

0.5

1 Time [s]

Amplitude

Mixture

5.2. Real audio sources

single-channel mixture 10 0 -10 0

0.5

1 Time [s]

Fig. 4. Two original sources, observed mixture of music and music.

N. Tengtrairat, W.L. Woo / Neurocomputing 147 (2015) 412–425

original source 2

10

10

5

5

Amplitude

Amplitude

original source 1

0 -5 -10 0.9

1

1.1

1.2

1.3

0 -5 -10 0.9

1.4

1

Time [s]

Amplitude

-5

1.1

1.2

1.3

1.4

1.3

1.3

1.4

5 0 -5 -10 0.9

1.4

1

Time [s]

1.1

1.2

Time [s]

Amplitude

Fig. 5. Zoomed view of the original drum and jazz and their estimated sources.

estimated drum 10 0 -10

0

0.5

1

1.5

2

2.5

1.5

2

2.5

1.5

2

2.5

1.5

2

2.5

Amplitude

Time [s] estimated Jazz

10 0 -10 0

0.5

1 Time[s]

Amplitude

Fig. 6. Estimated sources using proposed SOLO-LAD.

estimated drum

10 0 -10 0

0.5

1

Amplitude

Time [s] estimated Jazz

10 0 -10 0

0.5

1 Time[s]

Amplitude

Fig. 7. Estimated sources using SNMF2D.

estimated drum 10 0 -10

0

0.5

1

1.5

2

2.5

1.5

2

2.5

Time [s] Amplitude

Amplitude

0

1

1.2

estimated source 2

10

5

-10 0.9

1.1

Time [s]

estimated source 1

10

419

estimated Jazz 10 0 -10

0

0.5

1 Time[s]

Fig. 8. Estimated sources using SCICA.

420

N. Tengtrairat, W.L. Woo / Neurocomputing 147 (2015) 412–425

5.2.3. Mixture of speech and speech Two speech signals with equal power and the mixture are shown in Fig. 16. Visually, the three methods produce inefficient separation results as shown in Figs. 17–19 compared with the original sources. SCICA

SNMF2D

For speech mixtures, the proposed SOLO-LAD method yields the best SDRs with the average improvement at 2.9 dB and 3.0 dB per source over the SCICA and SNMF2D methods. Separation the speech and speech mixing is the most difficult scenario for the SCBSS problem due to speech signals are highly time-varying signals. Additionally, the speech signatures have more similarity than the other types of mixtures. For example, AR coefficients of male and female have slightly different at each AR-order. In the case of the SNMF2D and SCICA methods, speech signals may use the same basis components to reconstruct the original source. Finally, the average SDR of the proposed SOLO-LAD, SNMF2D, and SCICA methods for each mixing types shows in Fig. 20. In overall, the proposed method explicitly shows better separation performance than SNMF2D and SCICA where the average SDR improvement of SOLO-LAD over SNMF2D and SCICA are 2.38 dB per source (46%) and 3.41 dB per source (83%), respectively. As expected for all three methods, the mixture of m þm obtains the best separation performance followed by s þ m, and sþs,

SOLO-LAD

12

SDR [dB]

10 8 6 4 2 0 SDR s1

SDR s2

Fig. 9. Average SDR of mþ m mixture for SOLO-LAD, SNMF2D, and SCICA.

Amplitude

original male speech 10 0 -10 0

0.5

1

1.5

2

1.5

2

1.5

2

2.5

Time [s]

Amplitude

original drum 10 0 -10 0

0.5

1

Amplitude

Time [s] single-channel mixture 10 0 -10 0

0.5

1 Time [s]

Fig. 10. Two original sources, observed mixture of speech and music.

original source 2 10

5

5 Amplitude

Amplitude

original source 1 10

0 -5

0 -5

-10

-10 1

1.1

1.2

1.3

1.4

1.5

1

1.1

estimated source 1

10

1.3

1.4

1.5

1.4

1.5

estimated source 2

10 5 Amplitude

5 Amplitude

1.2 Time [s]

Time [s]

0 -5

0 -5

-10

-10 1

1.1

1.2 Time [s]

1.3

1.4

1.5

1

1.1

1.2

1.3

Time [s]

Fig. 11. Zoomed view of the original speech and drum and their estimated sources.

Amplitude

N. Tengtrairat, W.L. Woo / Neurocomputing 147 (2015) 412–425

421

estimated male speech 10 0 -10 0

0.5

1

1.5

2

2.5

1.5

2

2.5

Amplitude

Time [s] estimated drum 10 0 -10 0

0.5

1 Time [s]

Fig. 12. Estimated sources using proposed SOLO-LAD.

Amplitude

estimated male speech 10 0 -10 0

0.5

1

1.5

2

2.5

1.5

2

2.5

1.5

2

2.5

1.5

2

2.5

Time [s]

Amplitude

estimated drum 10 0 -10 0

0.5

1 Time [s]

Fig. 13. Estimated sources signals using SNMF2D.

Amplitude

estimated male speech 10 0 -10

0

0.5

1

Amplitude

Time [s] estimated drum

10 0 -10

0

0.5

1 Time [s]

SDR [dB]

Fig. 14. Estimated sources using SCICA.

10 9 8 7 6 5 4 3 2 1 0

SCICA

SDR speech

SNMF2D

SOLO-LAD

SDR music

Fig. 15. Average SDR of sþ m mixture for SOLO-LAD, SNMF2D, and SCICA.

respectively. The reasons are two-fold: Firstly the difference of AR coefficients between music and music is more distinct than the other two types. Secondly, speech signals are highly nonstationary thus it is more difficult to separate than music. The computational complexity of the proposed SOLO-LAD, SNMF2D, and SCICA have been calculated on a function of N sample size of a signal (N), number of sources (Ns ), length of the STFT window (L), number of frequency-shifts (N∅ ) and time-shift (Nτ ) for the

SNMF2D, number of iterations for SNMF2D (C) and SCICA (I), and number of SCICA blocks (K). This is indicated in Table 4. The computation complexity of the above algorithms is plotted in Fig. 21 with the following parameters: N s ¼ 2; L ¼ 1024; N ∅ ¼ 31; N τ ¼ 7; C ¼ 100; I ¼ 100; K ¼ 10 and N varies from 1  104 to 8  104 . Fig. 21 shows that the complexity of SCICA is almost identical to SNMF2D in the region of 1010 operations. Thus, the overall computational complexity associated with both algorithms is significantly high. On the other hand, the proposed SOLO-LAD consumes the least computation which renders it very fast and yet yields the best separation performance among the three methods. The proposed SOLO-LAD method is computationally less demanding than SNMF2D and SCICA. The reason is SOLO-LAD does not require any iteration for updating parameters. On the other hand, SNMF2D requires updating the spectral basis and the mixing of the sources. As for SCICA, the computational complexity varies gradually with increasing sample size. This result is caused by three major reasons: (i) the steps are repeated until all sources have been extracted; (ii) it requires deflation to remove the contribution of the extracted source of interest and (iii) complexity of the ICA algorithm within the SCICA grows exponentially with the number of blocks.

422

N. Tengtrairat, W.L. Woo / Neurocomputing 147 (2015) 412–425

6. Conclusion

music + music

speech + music

speech + speech

12

Amplitude

10 8 SDR [dB]

In this paper, a novel single channel blind separation algorithm has been presented. The method assumes that the source signals are characterized as AR processes. The artificial stereo mixture by time-delaying and weighting the observed mixture is proposed on AR processes. Additionally, the proposed mixture model has been verified by the separability analysis that the mixture model can be separated. The proposed method has demonstrated high level separation performance of stationary and non-stationary sources. The key of success of the proposed method is the source

6 4 2 0 -2 SCICA

SNMF2D

SOLO-LAD

Fig. 20. Average SDR of three mixture types for SOLO-LAD, SNMF2D, and SCICA.

original male speech

10 0 -10 0

0.5

1

1.5

2

2.5

2

2.5

2

2.5

2

2.5

2

2.5

Amplitude

Time [s] original female speech 10 0 -10 0

0.5

1

1.5

Amplitude

Time [s] single-channel mixture 10 0 -10 0

0.5

1

1.5 Time [s]

Amplitude

Fig. 16. Two original sources, observed mixture of speech and speech.

estimated male speech 10 0 -10

0

0.5

1

1.5

Amplitude

Time [s] estimated female speech 10 0 -10 0

0.5

1

1.5 Time[s]

Amplitude

Fig. 17. Estimated sources using proposed SOLO-LAD. estimated male speech 10 0 -10 0

0.5

1

1.5

2

2.5

2

2.5

Amplitude

Time [s] estimated female speech 10 0 -10 0

0.5

1

1.5 Time[s]

Amplitude

Fig. 18. Estimated sources using proposed SNMF2D.

estimated male speech 10 0 -10 0

0.5

1

1.5

2

2.5

2

2.5

Amplitude

Time [s] estimated female speech 10 0 -10

0

0.5

1

1.5 Time [s]

Fig. 19. Estimated sources using proposed SCICA.

N. Tengtrairat, W.L. Woo / Neurocomputing 147 (2015) 412–425

423

Table 4 Computation complexity of SNMF2D, SCICA, and SOLO-LAD. Methods

Number of operations

SOLO-LAD SNMF2D

5N þ L þ 4NN s þ 2Nlog 2 L



2Nlog 2 L þ CN s 3τ2L þ 2N ϕ N τ N þ N ϕ 4NL þ 2N τ N þ 2N ϕ N τ N þ 2L þ N τ N þ 2L þ N s 2L

SCICA

½2KðK þ 1ÞðN  K þ 1ÞIK þ K 3 þ 2ðKðN  K þ 1ÞÞ þ ðK 2 þ KðK  1ÞÞðN  K þ 1ÞNs

Number of flops

SOLO_LAD

SNMF2D

ej ðtÞ  ∑

SCICA

1.E+12 1.E+11 1.E+10 1.E+09 1.E+08 1.E+07 1.E+06 1.E+05 1.E+04 1.E+03 1.E+02 1.E+01 1.E+00

Mj

m¼1

ayj ðm; tÞyj ðt  mÞ

maδ 1 þ jγj

r j ðt; δ; γÞ ¼

ð33Þ

where aj ðt; δ; γÞ and r j ðt; δ; γÞ represent the mixing attenuation and residue of the jth source, respectively. Using (32) and (33), the overall proposed mixing model of the SOLO-LAD can now be formulated in terms of the sources as 1

2

3

4

5

6

7

x1 ðtÞ ¼ y1 ðtÞ þ y2 ðtÞ

8

Sample size (*1E+04 )

x2 ðtÞ ¼ a1 ðt; δ; γÞy1 ðt  δÞ þ a2 ðt; δ; γÞy2 ðt  δÞ þ r 1 ðt; δ; γÞ þr 2 ðt; δ; γÞ:

Fig. 21. Comparison of computational complexity on mixture of two audio sources between SOLO-LAD, SNMF2D, and SCICA.

parameters given by aj ðt; δ; γÞ and r j ðt; δ; γ Þ obtained from the artificial stereo mixture. Based on the estimated TF mask, the proposed approach is able to capture the audio characteristics in particularly music–music and music–speech mixtures. This has rendered robustness to the separation method. Experimental testing has been undertaken to show that the proposed framework yields superior separation performance compared with existing SCBSS methods.

ð34Þ Derivation of least absolute deviation cost function The proposed least absolute deviation (LAD) cost function is derived from the maximum likelihood (ML) of the jth source where the Gaussian likelihood function is firstly formulated by using (12) as  Lj ðτ; ωÞ : ¼ pðX 1 ðτ; ωÞ; X 2 ðτ; ωÞY j ðτ; ωÞ; aj ðτ; ωÞ; δÞ ¼ ∏ f σ2 ðτ; ωÞ f σ2 ðτ; ωÞ ðX 1 ðτ; ωÞ  Y j ðτ; ωÞ; X 2 ðτ; ωÞ ðτ;ωÞ A j

1

 aj ðτ; ωÞe Appendix A

2

iωδ

Y j ðτ; ωÞÞ

jX 1 ðτ; ωÞ  Y j ðτ; ωÞj2 1 ∑ 2ðτ; ωÞ A j σ 21 ðτ; ωÞ ! jX 2 ðtÞ  aj ðτ; ωÞeiωδ Y j ðτ; ωÞj2 þ σ 22 ðτ; ωÞ

¼ C U exp 

Derivation of artificial-stereo mixture in term of sources Eq. (4) can be expressed in terms of the source signals, AR coefficient and time-delay as

ð35Þ

x1 ðtÞ þ γx1 ðt  δÞ 1 þ jγj

where aj ðτ; ωÞ ¼ aj ðτÞ  C j ðτ; ωÞ, C is a normalizing constant, X 1 ðτ; ωÞ and X 2 ðτ; ωÞ A Ωj . Maximizing (35) is equivalent to maximizing the following:

¼

y1 ðtÞ þ y2 ðtÞ þ γ½y1 ðt  δÞ þy2 ðt  δÞ 1 þ jγj

Lj ðτ; ωÞ ¼ 

¼

1  ∑M γy ðt  δÞ m ¼ 1 ay1 ðmÞy1 ðt  mÞ þ e1 ðtÞ þ 1 1 þ jγj 1 þ jγj

x2 ðtÞ ¼

ð36Þ Secondly, the Gaussian likelihood function is maximized with respect to Y j ðτ; ωÞ. The ML of Y j ðτ; ωÞ is obtained by solving ∂Lðτ; ωÞ=∂Y j ðτ; ωÞ ¼ 0 for 8 ðτ; ωÞ A Ωj as below

2  ∑M m ¼ 1 ay2 ðmÞy2 ðt  mÞ þ e2 ðtÞ γy2 ðt δÞ þ þ 1 þ jγj 1 þ jγj

¼

ð  ay1 ðδÞ þ γÞ ð  ay2 ðδÞ þ γÞ y1 ðt  δÞ þ y2 ðt  δÞ 1 þ jγj 1 þ jγj

ðX 1 ðτ; ωÞ  Y j ðτ; ωÞÞðX n1 ðτ; ωÞ  Y nJ ðτ; ωÞÞ ∂Lðτ; ωÞ ∂ ¼ ∂Y j ðτ; ωÞ ∂Y j ðτ; ωÞ σ 21 ðτ; ωÞ

e1 ðtÞ  ∑M1 a ðmÞy1 ðt  mÞ m ¼ 1 y1



maδ 1 þjγj

þ

¼

e2 ðtÞ  ∑M2 a ðmÞy2 ðt  mÞ m ¼ 1 y2 þ

maδ 1 þjγj

ð31Þ

!

ðX 2 ðτ; ωÞ  aj ðτ; ωÞe  iωδ Y j ðτ; ωÞÞðX n2 ðτ; ωÞ  aj ðτ; ωÞeiωδ Y nJ ðτ; ωÞÞ σ 22 ðτ; ωÞ

 ðX 1 ðτ; ωÞ  Y j ðτ; ωÞÞ σ 21 ðτ; ωÞ 

ðX 2 ðτ; ωÞ  aj ðτ; ωÞe  iωδ Y j ðτ; ωÞÞaj ðτ; ωÞeiωδ σ 22 ðτ; ωÞ

ðτ; ωÞ A Ωj

ð37Þ

Equating the above to zero, Y ML j ðτ; ωÞ can be derived as

Define aj ðt; δ; γÞ ¼

jX 1 ðτ; ωÞ  Y j ðτ; ωÞj2 jX 2 ðτ; ωÞ  aj ðτ; ωÞeiωδ Y j ðτ; ωÞj2 þ σ 21 ðτ; ωÞ σ 22 ðτ; ωÞ ðτ; ωÞΩ A j ∑

 ayj ðδ; tÞ þ γ 1 þ jγj

ð32Þ

 ðX 1 ðτ; ωÞ  Y j ðτ; ωÞÞ ðX 2 ðτ; ωÞ aj ðτ; ωÞe  iωδ Y j ðτ; ωÞÞaj ðτ; ωÞeiωδ ¼ σ 21 ðτ; ωÞ σ 22 ðτ; ωÞ

424

N. Tengtrairat, W.L. Woo / Neurocomputing 147 (2015) 412–425

σ 22 ðτ; ωÞ þ σ 21 ðτ; ωÞa2j ðτ; ωÞ σ 21 ðτ; ωÞσ 22 ðτ; ωÞ ¼

Y j ðτ; ωÞ

σ 22 ðτ; ωÞX 1 ðτ; ωÞ þσ 21 ðτ; ωÞaj ðτ; ωÞeiωδ X 2 ðτ; ωÞ σ 21 ðτ; ωÞσ 22 ðτ; ωÞ



Y ML j ðτ; ωÞ ¼

σ 22 ðτ; ωÞX 1 ðτ; ωÞ þ σ 21 ðτ; ωÞaj ðτ; ωÞeiωδ X 2 ðτ; ωÞ ; σ 22 ðτ; ωÞ þ σ 21 ðτ; ωÞa2j ðτ; ωÞ ð38Þ

ðτ; ωÞ A Ωj

Subsequently, the obtained result (39) is substituted into the Gaussian likelihood function (37) and assuming that σ 21 ðτ; ωÞ  σ 22 ðτ; ωÞ ¼ σ 2 ðτ; ωÞ, we then have Y ML j ðτ; ωÞ ¼

X 1 ðτ; ωÞ þ aj ðτ; ωÞeiωδ X 2 ðτ; ωÞ ; 1 þ a2j ðτ; ωÞ

ðτ; ωÞ A Ωj

ð39Þ

By invoking the W-DO assumption and substituting (12) into (39), the original sources can be computed from (39) which can be derived as follows: Y ML j ðτ; ωÞ ¼ ¼ ¼ ¼ ¼

X 1 ðτ; ωÞ þ aj ðτ; ωÞeiωδ X 2 ðτ; ωÞ 1 þ a2j ðτ; ωÞ Y j ðτ; ωÞ þ aj ðτ; ωÞeiωδ ½aj ðτÞ  C j ðτ; ωÞe  iωδ Y j ðτ; ωÞ 1 þ a2j ðτ; ωÞ Y j ðτ; ωÞ þ aj ðτ; ωÞ½aj ðτÞ  C j ðτ; ωÞY j ðτ; ωÞ 1 þ a2j ðτ; ωÞ Y j ðτ; ωÞ þ aj ðτ; ωÞaj ðτ; ωÞY j ðτ; ωÞ 1 þ a2j ðτ; ωÞ 1 þ a2j ðτ; ωÞ 1 þ a2j ðτ; ωÞ

Y j ðτ; ωÞ

¼ Y j ðτ; ωÞ

ð40Þ Y ML j ðτ;

ωÞ ¼ Y j ðτ; ωÞ, the proposed cost In the light of (40) and function can finally be formulated using least absolute deviation (LAD) which can be expressed as    X 1 ðτ; ωÞ þak ðτ; ωÞeiωδ X 2 ðτ; ωÞ  Jðτ; ωÞ ¼ arg min X 1 ðτ; ωÞ     1 þ a2 ðτ; ωÞ k

k   X ðτ; ωÞ þ a2 ðτ; ωÞX ðτ; ωÞ  X ðτ; ωÞ  a ðτ; ωÞeiωδ X ðτ; ωÞ  1  1 1 2 k k ¼ arg min     1 þ a2 ðτ; ωÞ k k

  a2 ðτ; ωÞX ðτ; ωÞ  a ðτ; ωÞeiωδ X ðτ; ωÞ   k 1 2 k ¼ arg min     1 þ a2k ðτ; ωÞ k    ak ðτ; ωÞ   ¼ arg min ðak ðτ; ωÞX 1 ðτ; ωÞ  eiωδ X 2 ðτ; ωÞÞ   1 þ a2k ðτ; ωÞ k   ¼ arg min ak ðτ; ωÞX 1 ðτ; ωÞ  eiωδ X 2 ðτ; ωÞ

ð41Þ

k

where ak ðτ; ωÞ ¼ ak ðτÞ  C k ðτ; ωÞ  ayk ðδ; τÞ þ γ ak ðτÞ ¼ 1 þjγj ayk ðm; τÞe  iωðm  δÞ 1 þ jγj m¼1 Mk

C k ðτ; ωÞ ¼ ∑

maδ

and ðak ðτ; ωÞ=1 þ a2k ðτ; ωÞÞ is regarded as a constant; thus it can be negligible. References [1] B. Gao, W.L. Woo, B.W.-K. Ling, Machine learning source separation using maximum a posteriori nonnegative matrix factorization, IEEE Trans. Cybern. 44 (7) (2014) 1169–1179.

[2] N. Tengtrairat, W.L. Woo, Extension of DUET to single-channel mixing model and separability analysis, Signal Process. 96 (2014) 261–265. [3] P.O. Hoyer, Non-negative matrix factorization with sparseness constraints, J. Mach. Learn. Res. 5 (2004) 1457–1469. [4] K.E. Hild II, H.T. Attias, S.S. Nagarajan, An expectation–maximization method for spatio-temporal blind source separation using an AR-MOG source model, IEEE Trans. Neural Netw. 19 (3) (2008) 508–519. [5] B. Gao, W.L. Woo, S.S. Dlay, Variational regularized two-dimensional nonnegative matrix factorization, IEEE Trans. Neural Netw. Learn. Syst. 23 (5) (2012) 703–716. [6] M.N. Schmidt, M. Morup, Nonnegative matrix factor 2-D deconvolution for blind single channel source separation, in: Proceedings of the ICABSS 2006, vol. 3889, 2006, pp. 700–707. [7] G. Zhou, S. Xie, Z. Yang, J.-M. Yang, Z. He, Minimum-volume-constrained nonnegative matrix factorization: enhanced ability of learning parts, IEEE Trans. Neural Netw. 22 (10) (2011) 1626–1637. [8] B. Gao, W.L. Woo, S.S. Dlay, Single-channel source separation using EMDsubband variable regularized sparse features, IEEE Trans. Audio Speech Lang. Process. 19 (4) (2011) 961–976. [9] Ö. Yilmaz, S. Rickard, Blind separation of speech mixtures via time–frequency masking, IEEE Trans. Signal Process. 52 (7) (2004) 1830–1847. [10] T. May, S.V.D. Par, A. Kohlrausch, A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation, IEEE Trans. Audio Speech Lang. Process. 20 (7) (2012) 2016–2030. [11] J. Woodruff, D.L. Wang, Binaural localization of multiple sources in reverberant and noisy environments, IEEE Trans. Audio Speech Lang. Process. 20 (5) (2012) 1503–1512. [12] W.-K. Ma, T.-H. Hsieh, C.-Y. Chi, DOA estimation of quasi-stationary signals with less sensors than sources and unknown spatial noise covariance: a Khatri–Rao subspace approach, IEEE Trans. Signal Process. 58 (4) (2010) 2168–2180. [13] R.G. McKilliam, B.G. Quinn, I.V.L. Clarkson, B. Moran, Frequency estimation by phase unwrapping, IEEE Trans. Signal Process. 58 (6) (2010) 2953–2963. [14] R. de Frein, S. Rickard, The synchronized short-time-Fourier-transform: properties and definitions for multichannel source separation, IEEE Trans. Signal Process. 59 (1) (2011) 91–103. [15] R. Balan, J. Rosca, S. Rickard, J. O’Ruanaidh, The influence of windowing on time delay estimates, in: Proceedings of the Conference on Information Sciences and Systems, Princeton, vol. 1, 2000, pp. WP1-15–WP1-17. [16] Y. Xiang, S.K. Ng, V.K. Nguyen, Blind separation of mutually correlated sources using precoders, IEEE Trans. Neural Netw. 21 (1) (2010) 82–90. [17] Y. Song, X. Peng, Spectra analysis of sampling and reconstructing continuous signal using hamming window function, in: Proceedings of the 4th IEEE International Conference on Natural Computation, 2008, pp. 48–52. [18] N. Tengtrairat, Bin Gao, W.L. Woo, S.S. Dlay, Single-channel blind separation using pseudo-stereo mixture and complex 2-D histogram, IEEE Trans. Neural Netw. Learn. Syst. 24 (11) (2013) 1722–1735. [19] E. Vincent, R. Gribonval, C. Févotte, Performance measurement in blind audio source separation, IEEE Trans. Audio Speech Lang. Process. 14 (4) (2006) 1462–1469. [20] B. Mijovic, M.D. Vos, I. Gligorijevic, J. Taelman, S.V. Haffel, Source separation from single-channel recordings by combining empirical-mode decomposition and independent component analysis, IEEE Trans. Biomed. Eng. 57 (9) (2010) 2188–2196. [21] Y. Li, D. Wang, On the optimality of ideal binary time–frequency masks, in: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2008, pp. 3501–3504. [22] B. Gao, W.L. Woo, S.S. Dlay, Unsupervised single-channel separation of nonstationary signals using Gammatone filterbank and Itakura–Saito nonnegative matrix two-dimensional factorizations, IEEE Trans. Circuits and Syst. I 60 (3) (2013) 662–675. [23] B. Gao, W.L. Woo, S.S. Dlay, Adaptive sparsity nonnegative matrix factorization for single channel source separation, IEEE J. Sel. Top. Signal Process. 5 (5) (2011) 989–1001. [24] M. Goto, H. Hashiguchi, T. Nishimura, R. Oka, RWC music database: music genre database and musical instrument sound database, in: Proceedings of the ISMIR 2003, 2003, pp. 229–230. N. Tengtrairat received the B.Eng. degree in Computer Engineering from the Chiang Mai University, Chiang Mai, Thailand, M.Sc. degree in Management Information System from the Chulalongkorn University, and Ph. D. from the Newcastle University, UK in 2013. She is currently a Lecturer with the Department of Computer Science at Payap University, Thailand. Her research interest lies in the area of machine learning and computational signal processing.

N. Tengtrairat, W.L. Woo / Neurocomputing 147 (2015) 412–425 W.L. Woo was born in Malaysia. He received the B.Eng. degree (1st Class Hons.) in Electrical and Electronics Engineering and the Ph.D. degree from the Newcastle University, UK. He was awarded the IEE Prize and the British Scholarship in 1998 to continue his research work. He is currently a Senior Lecturer with the School of Electrical and Electronics Engineering at the same university. His major research is in the mathematical theory and algorithms for nonlinear signal and image processing. This includes areas of machine learning, information analytics, sensing signal and image processing. He has published over 250 papers on these topics on various journals and international conference proceedings. Currently, he serves on the Editorial Board of several international signal processing journals. He actively participates in the international conferences and workshops, and serves on their organizing and technical committees. Dr. Woo is a Senior Member of the IEEE and a Member of the Institution Engineering Technology (IET).

425