Blind source separation based on independent vector analysis using feed-forward network

Blind source separation based on independent vector analysis using feed-forward network

Neurocomputing 74 (2011) 3713–3715 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Letter...

175KB Sizes 0 Downloads 31 Views

Neurocomputing 74 (2011) 3713–3715

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Letters

Blind source separation based on independent vector analysis using feed-forward network$ Myungwoo Oh, Hyung-Min Park  Department of Electronic Engineering, Sogang University, Seoul 121-742, Republic of Korea

a r t i c l e i n f o

abstract

Article history: Received 1 February 2011 Received in revised form 9 May 2011 Accepted 7 June 2011 Communicated by M.S. Bartlett Available online 28 June 2011

This paper presents an algorithm that employs a feed-forward (FF) network on each bin as an unmixing system in the framework of independent vector analysis (IVA) to effectively separate highly reverberated mixtures with the exploitation of inter-frequency dependencies of each source signal. Furthermore, to avoid whitening of unmixed source signals due to the use of the FF unmixing network, we derive a learning algorithm for the network based on the extended non-holonomic constraint and the minimal distortion principle. Experiments show that the proposed method delivers better separation performance than the conventional IVA and the FF independent component analysis methods. & 2011 Elsevier B.V. All rights reserved.

Keywords: Blind source separation Independent component analysis Feed-forward network

1. Introduction Blind source separation (BSS) recovering source signals from their mixtures without knowing the mixing process has attracted considerable interest. Independent component analysis (ICA), which is the method to find statistically independent sources resorting to higher-order statistics, has been successfully used for BSS [2]. As real-world acoustic mixing involves convolution, ICA has generally been extended to the deconvolution of mixtures in both time and frequency domains. Although the frequency-domain (FD) approach is generally preferred because of the intensive computations and slow convergence of the time-domain (TD) approach, the permutation problem must be resolved [2]. Independent vector analysis (IVA) can effectively avoid this problem and improve the separation performance by introducing a plausible source prior that models inherent dependencies across frequency instead of using an independent prior at each frequency bin [4]. As the conventional FD ICA, IVA separates source signals by estimating an instantaneous unmixing matrix on each frequency bin because convolution in the time-domain can be replaced with multiplication of the bins in the frequency-domain. Although the FD approaches are attractive due to the simple multiplication, it can be valid only when the frame length is large enough to cover the entire reverberation of the mixing process [3]. However, a $ This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MEST) (No. 2010-0027797).  Corresponding author. Tel.: þ 82 2 705 8916; fax: þ 82 2 706 4216. E-mail address: [email protected] (H.-M. Park).

0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2011.06.008

real-world acoustic reverberation is often too long (compared to a typical frame length used to analyze audio signals) to effectively unmix source signals with an instantaneous weight matrix [1]. Recently, Kim et al. extended the conventional FD ICA by employing a feed-forward (FF) unmixing filter structure to separate source signals in highly reverberant environments [3]. In addition, to guide the separated source signals into their designated channel outputs without arbitrary permutation and scaling, the method utilized spatial information of the sources and then updated the unmixing filters with a learning rule incorporating an additional regularization term to obtain output signals as close to first-stage outputs based on the spatial information as possible. Although the FF ICA method described above could be used to deal with highly reverberated mixtures, it still assumed an implausible prior for the inter-frequency independencies of each source signal. In addition, the applied entropy-maximization learning rule suffered from a side-effect that was the removal of inter-frame correlations of unmixed source signals. Moreover, the permutation problem inherent in the FD ICA was avoided using a somewhat heuristic technique that employed the first-stage beamforming outputs and the additional regularization term in the learning rule. To overcome these limitations of the FF ICA method and the inability of the IVA method to separate highly reverberated mixtures, we introduce an FF unmixing filter network on each frequency bin in the framework of the IVA. In this method, we derive a learning algorithm for the network based on the extended non-holonomic constraint and the minimal distortion principle (MDP) [5] to avoid the inter-frame whitening effect and the scaling indeterminacy of the unmixed source signals. Furthermore, this method adopts the minimum power distortionless response (MPDR) beamformer with

3714

M. Oh, H.-M. Park / Neurocomputing 74 (2011) 3713–3715

extra null-forming constraints [3] as an optional pre-processing step to improve separation performance.1

Room size: 5 m

4m

3m

2. Proposed FF IVA method 3

First of all, the observed TD mixtures are converted to FD signals by short-time (ST) Fourier transform. Before the FF unmixing process, target signal enhancement with the suppression of interfering components from the mixtures is performed by the MPDR beamformer with extra null-forming constraints based on spatial information of the sources formulated as [3] ^ xðk,nÞ ¼ ðDH ðkÞ½RðkÞ þ lI1 DðkÞÞ1 DH ðkÞ½RðkÞ þ lI1 xðk,nÞ,

ð1Þ

^ where xðk,nÞ ¼ ½x^ 1 ðk,nÞ    x^ L ðk,nÞT and xðk,nÞ ¼ ½x1 ðk,nÞ    xL ðk,nÞT denote the ST FD representations of the beamforming output signal and mixture signal vectors, respectively, at frequency bin k and frame n. L is the number of sources and mixtures. D(k) and R(k) denote a matrix composed of steering vectors towards the sources ½d1 ðkÞ    dL ðkÞ and the input spectral covariance matrix, respectively. l is a small positive constant to avoid the singularity of R(k). ^ After the pre-processing of the mixtures, we separate xðk,nÞ from them by employing an FF unmixing filter network expressed as U X

s^ ðk,nÞ ¼

^ Wðk,mÞxðk,nmÞ,

ð2Þ

m¼0

where s^ ðk,nÞ denotes the ST FD representation of unmixed source signals. Wðk,mÞ represents an unmixing filter coefficient matrix. As a measure of independence, we use the Kullback–Leibler divergence between an exact joint probability density function (pdf) pðt^ 1 ðnÞ    t^ L ðnÞÞ and the product of hypothesized pdf models of the Q sources Li ¼ 1 qðt^ i ðnÞÞ, where t^ i ðnÞ ¼ ½s^ i ð1,nÞ    s^ i ðK,nÞ and K is the number of frequency bins. After removing the term independent of the FF unmixing filter network, we obtain the cost function: J¼

K X

logjdetWðk,0Þj

L X

Eflogqðt^ i ðnÞÞg:

ð3Þ

i¼1

k¼1

The on-line natural gradient algorithm to minimize the function is derived as

3

2

20° 40°

1

2m

60°

1 2

60°

1m 2.5 m 20 cm 1m

Fig. 1. Source and microphone positions during experiments. Two microphones were fixed at positions marked by gray circles. Two sources were placed at positions marked by blank circles. Numbers in circles represent configuration indices. Height of all sources and microphones was 1.5 m.

ðk,n3U=2ÞJ2 g is minimized by the MDP, the FF unmixing filter coefficients are initialized to zero, except for the diagonal elements of Wðk,U=2Þ at all frequency bins, which are initialized to one.

3. Experimental results The proposed method was compared with the conventional FD ICA [6], the conventional IVA [4], and the FF ICA [3] methods in terms of the signal-to-interference ratio (SIR) defined as SIRðdBÞ ¼

  L 1X target signal energy in the i-th output 10 log10 : Li¼1 interference signal energy in the i-th output

ð6Þ

where ‘off-diag ðÞ’ represents a matrix with diagonal elements equal to zero and b is a small positive weighing constant. As the mixing system may be non-minimum phase and EfJs^ ðk,nUÞx^

Two mixtures were simulated using the image method with two source signals arranged according to three different configurations, as shown in Fig. 1. The source signals were concatenated sentences uttered by two male and two female speakers from the TIMIT database. Each signal was 8-s long at a 16-kHz sampling rate. The SIRs of the mixtures ranged from 2.5 to 2.7 dB. The optimal step size for each method was determined by extensive experiments. As analysis windows for the STFT, we employed Hanning windows of 2048 and 512 samples with 512- and 128-sample windows shifting for the methods with instantaneous and feed-forward unmixing systems, respectively, which provided the best performances. For fair comparison, we initialized parameters of all the methods with the same values computed from the MPDR beamformer with extra null-forming constraints based on spatial information of the sources given as Eq. (1). Tables 1–3 show the output SIRs, each of which is averaged over eight different cases of source signals for a given configuration.2 For all the experimental configurations and reverberation times RT60s, the proposed method delivered higher SIRs than the other methods. In particular, the SIR difference between the proposed and the FF ICA methods in configuration 3 for 0.6-s RT60 was large because the MPDR beamforming step and the FF

1 Note that the beamforming procedure is essential to guide the separated source signals into their designated channel outputs without arbitrary permutation and scaling in the FF ICA method.

2 For the ICA methods, the score function was sgnðjs^ i ðk,nÞjÞexpðj  +s^ i ðk,nÞÞ For the FF ICA and the proposed methods, the length of each FF unmixing filter was 11 taps (U ¼10).

^ DWðk,mÞpWðk,mÞjðkÞ ðtðnUÞÞ

U X

H s^ ðk,nUmþ rÞWðk,rÞ,

ð4Þ

r¼0

where the second term on the right side introduces a U-frame delay ^ ¼ to avoid non-causality. In the multivariate score function jðkÞ ðtðnÞÞ ½jðkÞ ðt^ 1 ðnÞÞ    jðkÞ ðt^ L ðnÞÞT , we use jðkÞ ðt^ i ðnÞÞ ¼ @log qðt^ i ðnÞÞ=@s^ i qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PK 2 ^ ðk,nÞ ¼ s^ i ðk,nÞ= k ¼ 1 js i ðk,nÞj . As the algorithm also tries to remove the inter-frame correlations of unmixed source signals because of entropy-maximization, we introduce the extended non-holonomic constraint and the MDP to obtain an algorithm that does not introduce whitening effect or scaling indeterminacy by following a derivation similar to that in [5] expressed as

DWðk,mÞp

U X

H

^ foff  diagðjðkÞ ðtðnUÞÞ s^ ðk,nUmþ rÞÞ

r¼0 H ^ s^ ðk,nUm þ rÞgWðk,rÞ, þ bðs^ ðk,nUÞxðk,n3U=2ÞÞ

ð5Þ

M. Oh, H.-M. Park / Neurocomputing 74 (2011) 3713–3715

Table 1 Output SIR averaged over eight different cases of source signals for configuration 1 (dB). RT60 (s)

Conv. FD ICA

Conv. IVA

FF ICA

Prop. FF IVA

0.2 0.4 0.6

17.97 10.56 7.87

17.44 11.27 8.68

21.28 11.87 6.59

22.10 12.79 9.16

Table 2 Output SIR averaged over eight different cases of source signals for configuration 2 (dB). RT60 (s)

Conv. FD ICA

Conv. IVA

FF ICA

Prop. FF IVA

0.2 0.4 0.6

16.70 5.87 3.48

16.04 6.66 4.02

21.59 10.77 6.81

22.26 12.68 8.44

Table 3 Output SIR averaged over eight different cases of source signals for configuration 3 (dB). RT60 (s)

Conv. FD ICA

Conv. IVA

FF ICA

Prop. FF IVA

0.2 0.4 0.6

10.08 3.87 2.57

10.49 4.80 3.52

13.75 7.10 4.41

14.08 9.23 7.58

ICA method were not successful for close sources at high reverberation.

4. Conclusion In this paper, an IVA method employing an FF unmixing filter network was described. Because this method can separate highly reverberated mixtures using a plausible source prior for modeling inter-frequency dependencies without introducing whitening effect, it can provide high separation performance.

3715

References [1] S. Araki, S. Makino, T. Nishikawa, H. Saruwatari, Fundamental limitation of frequency domain blind source separation for convolutive mixture of speech, in: Proceedings of the ICASSP, 2001, pp. 2737–2740. ¨ [2] A. Hyvarinen, J. Karhunen, E. Oja, Independent Component Analysis, John Wiley & Sons, 2001. [3] L.-H. Kim, I. Tashev, A. Acero, Reverberated speech signal separation based on regularized subband feedforward ICA and instantaneous direction of arrival, in: Proceedings of the ICASSP, 2010, pp. 2678–2681. [4] T. Kim, H.T. Attias, S.-Y. Lee, T.-W. Lee, Blind source separation exploiting higher-order frequency dependencies, IEEE Transactions on Audio, Speech, and Language Processing 15 (2007) 70–79. [5] K. Matsuoka, S. Nakashima, Minimal distortion principle for blind source separation, in: Proceedings of the International Conference on ICA and BSS, 2001, pp. 722–727. [6] H. Sawada, R. Mukai, S. Araki, S. Makino, A polar-coordinate based activation function for frequency domain blind source separation, in: Proceedings of the International Conference on ICA and BSS, 2001, pp. 663–668.

Myungwoo Oh received the B.S. degree in Electronic Engineering from Sogang University, Seoul, Korea, in 2010. Currently, he is pursuing M.S. program at Department of Electronic Engineering, Sogang University. His current research interests include the theory and applications of binaural or multi-microphone processing for blind source separation and noiserobust speech recognition.

Hyung-Min Park received the B.S., M.S., and Ph.D. degrees in Electrical Engineering and Computer Science from Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea, in 1997, 1999, and 2003, respectively. From 2003 to early 2005, he was a Post-Doc. at the Department of Biosystems, KAIST. From 2005 to early 2007, he was with the Language Technologies Institute, Carnegie Mellon University. In 2007, he joined the Department of Electronic Engineering, Sogang University, Seoul, Korea, as an Assistant Professor and now is an Associate Professor. His current research interests include the theory and applications of binaural or multi-microphone processing for source localization and noise-robust speech recognition.