Fast fixed-point independent vector analysis algorithms for convolutive blind source separation

Fast fixed-point independent vector analysis algorithms for convolutive blind source separation

ARTICLE IN PRESS Signal Processing 87 (2007) 1859–1871 www.elsevier.com/locate/sigpro Fast fixed-point independent vector analysis algorithms for con...

406KB Sizes 0 Downloads 93 Views

ARTICLE IN PRESS

Signal Processing 87 (2007) 1859–1871 www.elsevier.com/locate/sigpro

Fast fixed-point independent vector analysis algorithms for convolutive blind source separation Intae Leea,b,, Taesu Kimb, Te-Won Leeb a

Electrical and Computer Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA b Institute for Neural Computation, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA Received 1 August 2006; received in revised form 3 January 2007; accepted 9 January 2007 Available online 25 January 2007

Abstract A new type of independent component analysis (ICA) model showed excellence in tackling the blind source separation problem in the frequency domain. The new model, called independent vector analysis, is an extension of ICA for (independent) multivariate sources where the sources are mixed component-wise. In this work we examine available contrasts for the new formulation that can solve the frequency-domain blind source separation problem. Also, we introduce a quadratic Taylor polynomial in the notations of complex variables which is very useful in directly applying Newton’s method to a contrast function of complex-valued variables. The use of the form makes the derivation of a Newton update rule simple and clear. Fast fixed-point blind source separation algorithms are derived and the performance is shown by experimental results. r 2007 Elsevier B.V. All rights reserved. Keywords: Blind source separation; Cocktail party problem; Convolutive mixture; Permutation problem; Statistical signal processing; Statistical learning; Independent component analysis; Independent vector analysis; Fast algorithm; Complex optimization

1. Introduction Independent component analysis (ICA) is a wellknown algorithmic method that has been mostly used and also been very successful in the field of blind source separation [1–5]. It assumes statistical independence among source signals and separates the sources from their mixtures by maximizing the independence among the output signals. In the simplest form of the analysis, an observation vector x½t ð¼ ½x1 ½t; x2 ½t; . . . T Þ is modelled as an instantaneous mixture as x½t ¼ As½t,

(1)

Corresponding author.

E-mail addresses: [email protected] (I. Lee), [email protected] (T. Kim), [email protected] (T.-W. Lee).

where s½t ð¼ ½s1 ½t; s2 ½t; . . . T Þ denotes the source vector that has statistically independent source components and A is a square (mixing) matrix which is fixed over time and is invertible. From the relation of y½t ¼ Wx½t ¼ WAs½t,

(2)

ICA learns the square (unmixing) matrix W by optimizing an objective function, a.k.a. contrast, such that the resulting components of the output y½t ð¼ ½y1 ½t; y2 ½t; . . . T Þ become statistically as indepen^ 1 dent as possible, which means y½t ¼ s^ ½t and W ¼ A where ^ denotes the estimate of . In most practical situations where there is propagation time delay and reverberation, the process of source mixing is not instantaneous but convolutive. Hence, researches have been extended to ICA algorithms that model the spatio-temporal

0165-1684/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2007.01.010

ARTICLE IN PRESS 1860

I. Lee et al. / Signal Processing 87 (2007) 1859–1871

structure of the convolutive mixing process [6–10] or to complex ICA algorithms in order to separate the sources in the frequency domain. Dealing with the signals in the frequency domain has the advantage of increased performance due to the fact that it can better handle longer filter lengths and that the convolved mixture problem reduces to an instantaneous mixing problem in each frequency bin as in (1) and (2), which is xf ½n ¼ Af sf ½n; f ¼ 1; 2; . . . ; F ,

yf ½n ¼ Wf xf ½n, ð3Þ

where xf ½n ¼ ½xf1 ½n; xf2 ½n; . . . T , sf ½n ¼ ½sf1 ½n; sf2 ½n; . . . T , yf ½n ¼ ½yf1 ½n; yf2 ½n; . . . T , each value of the superscript f denotes the frequency bin, and F is the number of frequency bins. Note that, there exists another discrete time domain denoted by the dummy variable n, different from real time t, where each value of n corresponds to each frame of shorttime Fourier transforms. For convenience, the time notations t and n will mostly be omitted since most ICA algorithms regard the process of each signal as i.i.d. samples of a random variable. Although the separation of such instantaneous mixture in each frequency bin is easily performed by complex ICA algorithms, there remains the problem of grouping all frequency components of each source signal, the well-known permutation problem. This results from the permutation indeterminacy of ICA and is not straightforwardly solved. There have been extensive works that proposed techniques to solve it. Smoothing the frequency-domain filter is one approach [11–13]. Other solutions used direction of arrival (DoA) estimation [14–16]. Also, for colored signals, inter-frequency correlations of the signal envelopes were used [17,18]. Although these methods provided a good intuitive solution, additional algorithmic steps and computation were needed and more importantly, many of these methods have not been successful in challenging acoustic settings such as ill-posed geometric arrangements among sources and sensors, and scenarios that involve larger number of sources. A fundamentally new approach was taken to the convolutive BSS problem in the frequency domain. All (short-time) frequency components of each source signal were considered together as a multivariate source and hence, instead of using a contrast function that measures the component-wise independence in each frequency bin, a contrast function

that measures the whole independence among the multivariate sources was applied [19–21]. Our work here is a longer and also extended version of [20], a previously published conference paper. The new ICA formulation for multivariate sources is called independent vector analysis (IVA) and we apply it to the frequency-domain BSS problem while examining the entropic contrasts and the likelihood contrast with a proper multivariate probability density function (PDF). Also we introduce a useful second-order Taylor polynomial of a real-valued function of complex variables in simple complex notations in order to derive Newton’s method-type complex IVA algorithms for convolutive BSS. Several fast algorithms are derived and experimentally tested.

2. Independent vector analysis The model of IVA consists of a set of the standard ICA models in (3) where the univariate sources across different layers are dependent such that they can be aligned and grouped as a multivariate variable, or vector. In Fig. 1, the 2  2 IVA mixture model is depicted where s1 ð¼ ½s11 ; s21 ; . . . ; sF1 T Þ and s2 ð¼ ½s12 ; s22 ; . . . ; sF2 T Þ denote the multivariate sources and x1 ð¼ ½x11 ; x21 ; . . . ; xF1 T Þ and x2 ð¼ ½x12 ; x22 ; . . . ; xF2 T Þ denote the observed mixtures. As it can be seen, the mixing of the multivariate sources is constrained component-wise forming ICA mixture models in each layer. The difference of applying IVA from applying multiple ICAs to such a scenario is that IVA groups the dependent sources as a multivariate variable and learns each group as a whole. The mixture model where independent groups of dependent sources are mixed was first proposed by Cardoso [22] with the name of multidimensional ICA (MICA). Also, Hyva¨rinen [23] proposed a maximum likelihood (ML) algorithm of MICA, which is subspace ICA (ISA). While IVA closely resembles those models and while MICA (including ISA) and IVA share the same contrasts, they differ in their mixture models. For comparison, the mixture model of MICA is depicted in Fig. 2. While IVA consists of multiple standard ICA layers where the component-wise mixing keeps the dependent source components unmixed, MICA consists of a single mixture layer which cannot be decomposed into multiple ICA mixtures, since it is assumed that the dependent sources are also mixed.

ARTICLE IN PRESS I. Lee et al. / Signal Processing 87 (2007) 1859–1871

x1

x2

xF

a1

x 11

a121

x 12 x21

2 a11 2 a21

x1F x F2

1

F a21

2

F a22

A

x1 = A1 s1

1

s2

x2 = A2 s2

s1F

sF

xF = AF sF

s2

∗ s2 2

∗ F

s1

s1 2

A

F a12

F a11

s11



A 2 a12

2 a22

=

x1

12

a1 22

=

x 22

a1

11

=

1861

s2F

x2

s2

s1

Fig. 1. The mixture model of independent vector analysis (IVA). Independent component analysis (ICA) is extended to a formulation with multivariate variables, or vectors, where the components of each vector are dependent. The difference of IVA from multidimensional ICA or subspace ICA is that the mixing process is restricted to the source components on the same horizontal layer to form a standard ICA mixture.

x1

xj

s 11

a11 a12 a21 a22

=

xd1+d2



A

d s2 2

d s1 1 s 12

s1

s2

Fig. 2. The mixture model of MICA and ISA. The model consists of a single mixture layer where all source components are mixed together.

The IVA model fits the model of (noiseless) convolutive BSS problem in the frequency domain. The individual mixture layer in IVA corresponds to the instantaneous mixture in each frequency bin and the dependent sources grouped together as a multivariate variable correspond to the (short-time) frequency components of a time signal.

ing and unmixing in IVA are restricted to each frequency bin. Here, we keep yf zero-mean and white by preprocessing the observation data xf ð¼ ½xf1 ; xf2 ; . . . T Þ to be zero-mean and white, and by constraining the unmixing matrix Wf ’s to be orthogonal, E½xf ðxf ÞH  ¼ I,

(4)

2.1. Contrasts of IVA for spatially white data Many ICA algorithms keep the output data zeromean and spatially white since no correlation is a necessary condition for independence and since it increases the learning speed. While it can also be done in MICA algorithms including ISA, in IVA, however, it is tractable only for the output data in each frequency bin, yf ð¼ ½yf1 ; yf2 ; . . . T Þ, since mix-

Wf ðWf ÞH ¼ I;

f ¼ 1; 2; . . . ; F ,

(5)

where Wf is the estimate of the inverse of Af . Let us assume that xf is already preprocessed to be zeromean and white. Since ICA basically separates the sources by maximizing the independence of the output signal yi ’s, its contrast functions are represented by the

ARTICLE IN PRESS I. Lee et al. / Signal Processing 87 (2007) 1859–1871

1862

mutual information of yi ’s as  ! Y  IðyÞ ¼ D f y  f yi  i Z f y ðzÞ dz, f y ðzÞ log Q ¼ dimðzÞ R i f yi ðzi Þ

ð6Þ ð7Þ

where DðkÞ denotes the Kullback-Leibler (KL) divergence and f y ðzÞ is meant to denote the PDF of a random variable y with a dummy variable z ð¼ ½z1 ; z2 ;    T Þ. Mutual information can be regarded as a metric that measures the distance from a given distribution to the product of its marginal distributions. Hence mutual information as a contrast function measures how independent the components in the output data are and is minimized to zero if and only if the yi ’s are mutually independent. Similarly, the contrast functions of IVA can be represented by the mutual information among the multidimensional variable yi ’s,  ! Y X  Hðyi Þ  HðyÞ, (8) D f y  f yi ¼  i i where HðÞ denotes the entropy function, Z HðyÞ ¼  f y ðzÞ log f y ðzÞ dz.

(9)

RdimðzÞ

Note that the output signal yi ð¼ ½y11 ; y21 ; . . . ; yF1 T Þ is a vector now. This contrast reduces to the minimum, zero, if and only if yi ’s are mutually independent. Note that the term HðyÞ in (8) is constant since log j detðWf Þj ¼ 0 for any value of f, which follows from (5), and hence, minimizing the KL divergence term in (8) is equivalent to minimizing the sum of the entropies of the multidimensional variables, i.e.  ! Y  arg min D f y  f yi  i W1 ;...;WF X ¼ arg min Hðyi Þ, ð10Þ W1 ;...;WF

i

with the constraint of orthogonal Wf ’s. In order to make the discussion that will follow simpler, we assume, in addition to the mutual independence of si ð¼ ½s1i ; . . . ; sFi Þ’s, E½si ðsi ÞH  ¼ I,

(11)

i.e. the components in each multivariate sources are, although dependent on each other, uncorrelated and the variance of each component is set to one

which can be assumed owing to the scaling indeterminacy in ICA. Then, by the constraint of zero-mean and white yf , the whole output data y ð¼ ½y1 ; y2 ; . . . ; yF Þ is also kept zero-mean and white (The notation ½a; b; . . . is meant to denote ½aT ; bT ; . . . T for any column vectors a and b.); E½yyH  ¼ I.

(12)

As in ICA, negentropy can be employed for the derivation of an entropic contrast of IVA that is equivalent to the others. Negentropy is defined as follows: NðzÞ ¼ Dðf z kf zG Þ,

(13)

where zG is meant to denote the (multidimensional) Gaussian random variable that has the same mean vector and the same covariance matrix with z. In other words, f zG is the information projection of f z onto the Gaussian manifold in information geometry. From the following Pythagorean relation in information geometry: Nðyi Þ ¼ HðyG i Þ Hðyi Þ, |fflffl{zfflffl}

(14)

constant

note that HðyG i Þ in (14) is a constant term since, from (12), yi is zero-mean and white such that yG i is always fixed to be the unique zero-mean and white multivariate Gaussian random variable. Hence, by plugging in the relation of (14) into the right-hand side of (10), we can easily see that X X arg min Hðyi Þ ¼ arg max Nðyi Þ, (15) W1 ;...;WF

i

W1 ;...;WF

i

with the constraint of orthogonal Wf ’s. It is clear that the entropic contrasts of IVA in (10) and (15) are extensions of the classic entropic contrasts of ICA from univariate variables to multivariate variables. By replacing yi for the yi terms, we come up with the entropic contrasts of ICA. As it was done for ICA (J.-F. Cardoso [24, Chapter 4]), the relation between the mutual information contrast (10) and the sum of negentropies (15) can be visually seen in information geometry. Information geometry is a space of probability distributions where each PDF corresponds to a point in the space. By defining the Gaussian manifold and independent vector manifold as the set of multivariate Gaussian PDFs and the set of PDFs where the multivariate yi ’s are mutually independent, respectively, the mutual information (10) and the sum of negentropies (15) can be regarded as the distances from a point f y to

ARTICLE IN PRESS I. Lee et al. / Signal Processing 87 (2007) 1859–1871

the Gaussian manifold and to the independent vector manifold, respectively. In Fig. 3, the points, manifolds, and the distance measures are depicted and indicated where the PDFs are for zero-mean and white random variables. Note that the Gaussian manifold is drawn as a single point since a zero-mean and white multidimensional Gaussian PDF is unique. When updating the unmixing matrix Wf ’s, the point f y moves around correspondingly while keeping the distance  ! Y X  Nðyi Þ (16) NðyÞ ¼ D f y  f yi þ  i i constant. Hence our goal is either to minimize the distance between Q f y and the independent vector manifold,QDðf y k i f yi Þ or to maximize the P distance between i f yi and the Gaussian manifold, i Nðyi Þ. Note that the assumption on the multivariate source components in (11) is not needed to come up with the negentropy contrast in (15) or the Pythagorean relation in (16), i.e. they are still true without the given assumption. However, the motivations for the assumption were because the source distributions we later assume are spherically symmetric multivariate distributions which imply uncorrelatedness in the multivariate source components and also because the explanation becomes much simpler with the assumption given. Other than the entropic contrasts, we can exploit prior information of the source distribution and use

fy

N(y)

fyG = Πi fyiG

D(fy || Πi fyi )

Σi N( yi ) Πi fyi

Gaussian manifold independent vector manifold

Fig. 3. Entropic contrasts of multidimensional ICA in the information geometry of zero-mean white probability distributions.

1863

the (normalized) log-likelihood contrast function, ~ E½logð f^ ðW1 Þ1 s1 ;...;ðWF Þ1 sF ðx1 ; . . . ; xF ÞÞ ~ ¼ E½logð f^ s ðyÞÞ X ~ ¼ E½logð f^ s ðy ÞÞ, i

i

ð17Þ ð18Þ

i

P ~   ¼ ð1=NÞ where E½ n    and (17) is because of orthogonal Wf ’s. The estimate of the source PDF, denoted by f^ s , is called source prior or source target. Note that the contrast functions of IVA can also be employed as the contrast functions of MICA [22] including ISA [23], whereas the mixture models of the analyses differ. 3. Fast fixed-point IVA algorithms for frequencydomain BSS In the derivation of IVA algorithms for frequency-domain BSS problems, we come to deal with complex-valued variables. 3.1. Complex variables: assumptions Because real-valued contrast functions of complex-valued variables are not analytic, in a number of signal processing applications there arises the question of how to deal with complex variables. There are standard ways to handle the problem. The first is to write every complex variable in terms of two real variables, e.g. replace each complex pffiffiffiffiffiffiffi variable z with u and v using z ¼ u þ jv; ðj ¼ 1Þ, and then use the tools for real-valued variables. An alternative is to follow the lines of [25,26], which is equivalent to the first method described, but clearer in notations. These methods basically carry out the optimization in the real domain with respect to the real and imaginary parts of the complex variables. In most scenarios it is known that complex variables show circular symmetry around the origin. Hence, as the relation between the separated real variables, we assume circularity in the source variables such that E½ssT  ¼ O,

(19)

where s ¼ ½s1 ; s2 ; . . . ; sF . And hence E½xxT  ¼ O, where x ¼ ½x1 ; x2 ; . . . ; xF .

(20)

ARTICLE IN PRESS I. Lee et al. / Signal Processing 87 (2007) 1859–1871

1864

3.2. Contrast functions For the likelihood contrast in (18), a proper source prior f^ si which models the frequency components of an acoustic signal should be chosen. Previous works observed that natural signals such as speech have such property of spherical invariance and introduced spherically invariant random processes (SIRP). For details and relevant topics, see [27,28] and the references therein. Taking this into account, we introduce a spherically symmetric, exponential norm distribution (SEND), pffiffiffiffiffiffiffiffi  ð4=dÞkzk2 ^f s ðzÞ / e , (21) i kzkd1 2 where d is the total dimension in real domain. It should be noted that, similar to the contrast functions, the probability functions of complex variables are in fact functions of real variables which are the real and imaginary parts of the complex variables and thus the PDFs are functions of 2F real variables (d ¼ 2F ). This probability distribution was derived such that it has spherical contours and the l 2 -norm of the random vector si qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  P f 2ffi ksi k2 ¼ follows exponential distribuf jsi j tion. The variance was adjusted such that si is white, which corresponds to one-dimensional real variables having variance one half. In [19], the following multidimensional superGaussian distribution which, for convenience, we will call spherically symmetric Laplace (SSL) distribution and, which had been proposed by Hyva¨rinen [23] for ISA, was used in order to model the variance dependency of the frequency components; pffiffiffiffiffiffiffiffiffiffiffi f^ si ðzÞ / e 2ðdþ1Þkzk2 . (22) Each distribution can be regarded as an extension of double exponential distribution, a.k.a. Laplace distribution. These distributions shrink to the Laplace distribution with variance one half for d ¼ 1. After replacing f^ si ðÞ in the likelihood contrast (18) with either SEND or SSL, the likelihood contrasts have the form of " !# X X f ~ G jy j2 E (23) i

i

f

with the relation of ! X f 2 jyi j ¼  log f^ si ðyi Þ, G f

where yfi ¼ ðwfi ÞH xf and ðwfi ÞH denotes the i-th row of the unmixing matrix Wf with the normalization constraint of ðwfi ÞH wfi ¼ 1. Note that the contrast has changed its sign to be negative likelihood. By using the Lagrange multiplier lfi ’s, we have the normalization constraint together in the contrast function, which results in " " # !# X X f X f f f jy j2 l ððw ÞH w  1Þ . E~ G  i

i

i

i

f

(25) The entropic contrast on the right-hand side of (10) is also available if given a proper entropy estimator. In [29], an entropy estimator of spherically symmetric random variables, which estimates the entropy directly from the data, was derived: N X

^ i Þ ¼ ca Hðy

logð¯ri ½nd  r¯i ½n  mN d Þ þ cb ,

n¼mN þ1

(26) where ca and cb are some constants, N is the total number of data, mN is an integer which is a function of N, and r¯i ½n is the reordered kyi ½nk2 ’s such that r¯ i ½n þ 14¯ri ½n;

n ¼ 1; 2; . . . ; N  1.

(27)

By plugging it into the entropy term on the right-hand side of (10) and applying some approximations, a semi-nonparametric IVA contrast for spherically symmetric sources was derived as follows: N X X

arg min

W1 ;...;WF

i

logð¯ri ½n2F  r¯i ½n  mN 2F Þ

n¼mN þ1

ð28Þ ¼ arg min 1

W ;...;W

X

N X

i

n¼mN þ1

X

N X

i

n¼mN þ1

F

logð¯ri ½n2F Þ

  !! r¯i ½n  mN  2F þ log 1  r¯ i ½n ð29Þ  arg min 1

W ;...;W

F

þ logð1  0Þ (24)

i

f

¼ arg min

W1 ;...;WF





logð¯ri ½n2F Þ ð30Þ

N X X i

n¼mN þ1

logð¯ri ½n2 Þ

ð31Þ

ARTICLE IN PRESS I. Lee et al. / Signal Processing 87 (2007) 1859–1871

 arg min

W1 ;...;WF

¼ arg min

W1 ;...;WF

N XX i

logð¯ri ½n2 Þ

ð32Þ

1865

3.3. Contrast optimization 1: quadratic Taylor expansion in complex notations

n¼1

X

" !# X f 2 ~E log jyi j

i

f

ð33Þ

with the constraint of orthogonal Wf ’s. The approximation in (30) is because of (27) and because 2F , which is approximately the number of FFT points we choose in the algorithm, is a large integer value for the need of dealing with practical reverberation time. The approximation in (32) comes from increasing the data indices of sum which enables the algorithm to avoid output data reordering. Spherically symmetric multidimensional sources can be represented by scaled mixture of independent Gaussian distributions since independent Gaussian distributions show spherical symmetry. The contrast was designed to have some flexibility in the distribution of the l 2 -norm of spherically symmetric sources. An equivalent algorithm as the one using (33) was proposed in [30]. For the model of each frequency component, they assumed a Gaussian distribution with a scale parameter that stays approximately constant for a short time period and slowly varies over time. All Gaussian models along the frequency bins are coupled with the same scale parameter and the scale parameter is estimated by the sample standard deviation. This resembles the ML-type IVA algorithm using a multivariate source prior where the source prior can be derived from scaled mixture of Gaussian distributions. Note that the contrast (33) also falls into the form of (25) with a corresponding GðÞ. The nonlinearity function GðÞ’s that correspond to each contrast function are summarized in Table 1. MLPDF ’s denote the likelihood contrasts with the specified PDF as the source target and SNP denotes the approximated entropic contrast function in (33).

ICA algorithms mainly consist of two parts. One is the choice of the contrast function and the other is the choice of the optimization method. For optimization, most ICA algorithms use gradient descent update rule including natural gradient [31], or relative gradient [32]. Also fixed-point iteration [33], Newton’s method update rule [34], or exhaustive search by rotation [35] have been applied. Since the contrast functions are chosen, we are to derive BSS algorithms with the choice of an optimization method. Here we employ Newton’s method update rule. This, when compared to other gradient descent methods, converges fast and is free from selecting an efficient learning rate. In order to use complex notations while applying Newton’s method to our contrasts, we employ the definitions of complex gradient and complex Hessian in [25,26]. It was shown that a complex variable and its complex conjugate must be considered as separate variables and that the complex gradient and Hessian correspond to their real counterparts one-to-one by simple linear transformations [26]. That is, for w ð¼ ½w1 ; w2 ; . . . T Þ ¼ u þ jv ð¼ ½u1 ; u2 ; . . . T þ j½v1 ; v2 ; . . . T Þ,

ð34Þ

z ¼ ½u1 ; v1 ; u2 ; v2 ; . . . T ,

(35)

wD ¼ ½w1 ; w1 ; w2 ; w2 ; . . . T ,

(36)

gðwÞ ¼ gD ðw; w  Þ

ð37Þ

¼ hðu; vÞ,

ð38Þ

where u and v are real-valued vectors, it was shown that the quadratic Taylor polynomial of hðu; vÞ with respect to z around the point z ¼ O is equivalent to a Taylor polynomial form in complex

Table 1 Nonlinearity functions of the contrasts and their derivatives Contrast

GðzÞ

G0 ðzÞ

G00 ðzÞ

MLSSL MLSEND

pffiffiffi z pffiffiffiffiffiffiffiffiffiffiffiffipffiffiffi ð2=F Þ z þ ðF  12Þ log z log z

pffiffiffi 1=2 z pffiffiffiffiffiffipffiffiffi 1= 2F z þ ðF  12Þ1=z 1=z

pffiffiffiffiffi 1=4 z3 pffiffiffiffiffiffipffiffiffiffiffi 1= 8F z3  ðF  12Þ1=z2 1=z2

SNP

ARTICLE IN PRESS I. Lee et al. / Signal Processing 87 (2007) 1859–1871

1866

    q2 q q q q q2 ¼ ¼ , ¼ qbT qa qaqbT qa qbT qbT qa a; b 2 fw; w g.

notations as follows: qhðO; OÞ 1 T q2 hðO; OÞ z z þ z qzT 2 qzqzT qg ðO; OÞ ¼ gD ðO; OÞ þ D T wD qwD

hðO; OÞ þ

ð45Þ

3.4. Contrast optimization 2: Newton’s method

2

1 q gD ðO; OÞ þ wTD wD 2 qwD qwTD

ð39Þ

if (and only if)   q 1 q q ¼ j , qwi 2 qui qvi   q 1 q q ¼ þj , qwi 2 qui qvi

Let us replace the w terms in (42) with wfi and set gðwfi Þ to be the summation term of the contrast in (25) as " !# X f H k k 2 gðw Þ ¼ E~ G jðw Þ x j i

ð40Þ



X

i k lki ððwki ÞH wki

 1Þ.

ð46Þ

k

    q2 q q q q q2 ¼ , ¼ ¼ qb qa qaqb qa qb qbqa a; b 2 fwi1 ; wi2 g.

ð41Þ

It is still cumbersome to apply Newton’s method to a contrast using the quadratic Taylor polynomial form on the right-hand side of (39) although it is in complex notations. Here, we introduce a quadratic Taylor polynomial in notations of the original complex variables w and w instead of wD . By changing the order of the components in wD such that wD ¼ ½w; w  and plugging it into the righthand side of (39), it can be seen that the expression decomposes into

The wfi that optimizes the function gðwfi Þ will set the gradient qgðwfi Þ=qðwfi Þ to be zero and hence from (42), ! qgðwfi;o Þ q2 gðwfi;o Þ qgðwfi Þ  þ ðwfi  wfi;o Þ qðwfi Þ qðwfi Þ qðwfi Þ qðwfi ÞT þ

q2 gðwfi;o Þ qðwfi Þ qðwfi ÞH

ðwfi  wfi;o Þ  O.

ð47Þ

Then the derivative terms in (47) become " ! # X qgðwfi;o Þ f  j k 2 ~ ¼ E ðyi;o Þ G jyi;o j xf  lfi wfi;o , qðwfi Þ k (48)

qgðwo Þ ðw  wo Þ qwT qgðwo Þ ðw  wo Þ þ qwH 1 q2 gðwo Þ þ ðw  wo ÞT ðw  wo Þ 2 qwqwT 1 q2 gðwo Þ þ ðw  wo ÞH  H ðw  wo Þ 2 qw qw 2 q gðwo Þ þ ðw  wo ÞH  T ðw  wo Þ, qw qw

gðwÞ  gðwo Þ þ

q2 gðwfi;o Þ qðwfi Þ qðwfi ÞT " ! X j k 2 ~ ¼E G jyi;o j k

þjyfi;o j2 Gk ð42Þ

where the point of Taylor expansion has changed from w ¼ O to w ¼ wo . The above notations we used for gradient and Hessian with vectors are defined as T  T q q q q q ¼ ; ; ; ¼ , (43) qw qw1 qw2 qwT qw

T q q q ¼ ; ; ; qw qw1 qw2

q ¼ qwH



q qw

X

T ,

(44)

jyki;o j2

k

" ~ G E

!!

j

X

xf ðxf ÞH  lfi I

ð49Þ

~ f ðxf ÞH   lf I E½x i

ð50Þ

!

jyki;o j2

k

þjyfi;o j2

#

X

Gk

!# jyki;o j2

k

¼

" ! X j k 2 ~ jyi;o j E G k

þjyfi;o j2

G

k

X k

!# jyki;o j2

! 

lfi

I,

ð51Þ

ARTICLE IN PRESS I. Lee et al. / Signal Processing 87 (2007) 1859–1871

q2 gðwfi;o Þ qðwfi Þ qðwfi ÞH " ! # X f  2 k k 2 f f T ~ ¼ E ððyi;o Þ Þ G jyi;o j x ðx Þ k

" ~ E

ððyfi;o Þ Þ2

G

X

k

ð52Þ

!# ~ f ðxf ÞT  E½x

jyki;o j2

k

¼ O,

ð54Þ

where yfi;o ¼ ðwfi;o ÞH xf and some approximations by separation of expectations were applied to in (50) and (53) as in [36]. (51) and (54) follow from (4) and (20), respectively. Note that, by (51) and (54), the Newton step equation in (47) reduces to wfi  wfi;o ¼ 

P ~ f Þ Gj ð jyf j2 Þxf   lf wf E½ðy i;o i i;o f i;o  , f 2 f 2 k P f 2 j P ~ E½G ð f jyi;o j Þ þ jyi;o j G ð f jyi;o j Þ  lfi ð56Þ

where that the Lagrange multiplier lfi should be " !# X f f f ~ jy j2 Gj li ¼ E jyi;o j2 . (57) i;o f

Also, instead of evaluating lfi , we can remove it by multiplying the numerator in (56) on both sides of the equation. Hence, with the need of normalization, the learning rule becomes " ! X f f j 2 ~ w E G jy j i;o

f

þjyfi;o j2 Gk "  E~

ðyfi;o Þ

X

!# jyfi;o j2

wfi;o

f

G

j

X f

(59)

is employed rather than the deflationary decorrelation scheme. It is known that many local optima exist in a cost function framework which results from wrong permutation [37,30]. In our algorithms we employ identity matrix as the initial unmixing matrix in all frequency bins for being helped avoiding those local optima.

(55)

wfi;o

i

ðWf ðWf ÞH Þ1=2 Wf

4. Experiments

1 qgðwi;o Þ  , cðwi;o Þ qðwfi Þ

where cðwi;o Þ is the constant term multiplied to I in (51). This visually shows that also the Newton step ICA and IVA algorithms for complex variables become a variant of gradient descent algorithm where a semi-optimized learning rate is computed in each iteration. By substitution, our corresponding iterative algorithm becomes as follows: wfi

In addition to normalization, the rows of the unmixing matrix Wf ’s need to be decorrelated. Since the contrasts are not derived as one-unit contrast functions, the symmetric decorrelation scheme Wf

ð53Þ

1867

! jyfi;o j2

# f

x .

ð58Þ

Our fast fixed-point BSS algorithms were applied to n  n speech separation problems. The mixed speech signals we used were either synthetic signals generated in a simulated room environment or real signals recorded in a real ordinary office environment. In generating synthetic data, 8-s long clean speech signals were used. Also, 2048-point FFT and a 2048-tab long Hanning window with the shift size of 512 samples were chosen. 4.1. Source separation in a simulated room environment The geometric configuration of the simulated room environment is depicted in Fig. 4(a). We set the room size to 7 m  5 m  2:75 m and set all heights of the microphone and source locations to 1:5 m. A reverberation time of 100 ms was chosen and the corresponding reflection coefficients were set to 0.57 for every wall, floor, and ceiling. Clean speech signals were convolved with room impulse responses that were obtained by an image method [38–41]. Various 2  2 and 4  4 case simulations (Fig. 4(b)) were carried out. The separation performance was measured by the signal to interference ratio (SIR) in dB defined as ! P P f f 2 n;f j i riqðiÞ sqðiÞ ½nj SIRout ¼ 10 log P P , (60) f f 2 n;f j iaj riqðjÞ sqðjÞ ½nj where qðiÞ indicates the separated source index that i-th source appears and riqðjÞPis the overall impulse response which is defined as m wfim afmqðjÞ .

ARTICLE IN PRESS I. Lee et al. / Signal Processing 87 (2007) 1859–1871

1868

Exp. # 1 2 3 4 5 6 7 8 9 10 11 12

Source Mics Locations Used A,I 2,3 B,G 2, 3 E,G 2,3 I,K 2,3 C,D 2,3 E,F 2,3 I,J 2,3 A,B,G,I all B,C,G,I all C,E,G,I all E,G,H,I all B,C,D,E all

c

S I R

Exp. # SAWADA MLSSL G MLSEND D SNP F MLSSL A ML SEND S SNP T

1 n/a 16.4 21.2 21.2 15.1 19.8

2 16.9 15.8 19.6 19.6 15.0 19.5

3 8.1 16.5 18.6 18.6 12.3 17.0

4 n/a 11.7 12.3 12.3 10.7 11.8

5 14.0 15.3 15.8 15.8 14.1 16.1

6 11.5 14.9 17.7 17.7 14.3 16.0

7 15.7 14.9 16.2 16.2 13.2 15.6

8 16.5 11.5 12.1 12.1 11.3 11.8

9 9.9 10.4 10.5 10.5 8.8 10.0

10 10.1 11.2 11.7 11.7 10.1 11.5

11 8.5 8.2 9.2 9.2 7.2 9.0

12 6.6 7.7 7.6 7.5 7.2 7.2

19.8 19.5 17.0 11.8 16.1 16.0 15.6 11.8 10.0 11.5

9.0

7.2

Fig. 4. Simulated room environment experiments. Twelve experiments were carried out using 7 different algorithms. (a) The geometric configuration of the simulated room environment. (b) The source and microphone locations of each experiment. For instance, in experiment 2, the simulated data were obtained using an image method such that two source signals located in B and G were recorded by microphones 2 and 3. (c) Separation results of the algorithms in each experiment. The performances were measured by SIRout (60) in dB.

The performances of the fast algorithms were compared with those of the gradient descent (natural gradient) algorithms with the same contrasts and the well-known BSS algorithm by Sawada et al. [16] which applies (complex) ICA in each frequency bin and corrects the permutation using both amplitude correlation and DoA. For the gradient descent algorithms, the data were preprocessed to be zero-mean and white, and the unmixing matrix Wf ’s were constrained to be orthogonal by using the symmetric decorrelation scheme in (59). The separation results are shown in Fig. 4(c). The fast IVA algorithms and gradient descent IVA algorithms consistently separated the source signals well while Sawada’s algorithm was not consistent in showing good separation results. Also, the MLSEND contrast and the SNP contrast showed almost equivalent separation results which mostly outperformed the results by the MLSSL contrast.

In addition to separation performance, the number of iterations until convergence was compared for fast IVA algorithms and gradient descent IVA algorithms. While there is no need to choose a proper learning rate for the fast algorithm, we scaled the learning rate of the natural gradient type algorithm such that the speed is semioptimized and the separation performance does not degrade. When the contrast was the same for both optimization algorithms, the fast algorithm needed far less number of iterations until convergence. In order to show the convergence speed, the result of a sample case (experiment 2) is shown in Fig. 5. The flow of SIRout ’s, by the gradient descent and the fast algorithm with MLSEND contrast, is plotted with respect to the number of iterations. While the seperation performances were similar, the improvement in the number of iterations by the Newton update IVA algorithms is significant.

ARTICLE IN PRESS I. Lee et al. / Signal Processing 87 (2007) 1859–1871

4.2. Source separation in a real acoustic environment Real data were recorded in an ordinary office environment. Three human voices and a piece of hip-hop music played from a stationary speaker

20

SIRout(dB)

15

10

1869

were captured by four equidistantly aligned microphones. The sources were located approximately 1 m 2 m away from the microphones. Figs. 6(a) and (b) show the separated sources by Sawada’s algorithm and by the fast IVA algorithm with MLSEND contrast, respectively. Note that the first plot in Fig. 6(b) corresponds to the separated piece of music. In this experiment where real reverberation and noise were applied, our IVA algorithm significantly outperformed Sawada’s algorithm and showed robustness. 5. Conclusions

5

0

natural gradient fast (Newton update)

−5 0

100

200 300 400 number of iterations

500

600

Fig. 5. The SIRout convergence plots of the fast Newton update IVA algorithm and the natural gradient IVA algorithm with the same MLSEND contrast for experiment 2 in Fig. 4.

IVA, a new type of MICA formulation, tackles the problem of separating acoustic mixtures by adding a source constraint to the ICA learning such that each source is learned with the inter-frequency dependency relation and hence the permutation problem that occurs in frequency-domain BSS can be avoided. Interestingly, similar to ICA, we had certain flexibility in defining or modelling the source prior. The assumption of spherical symmetry and sparseness in the frequency domain seemed efficient for good performance on speech separation.

Fig. 6. Experiments with real acoustic recordings. (a) The separation results from Sawada’s algorithm. (b) The separation results from our fast IVA algorithm with MLSEND contrast.

ARTICLE IN PRESS 1870

I. Lee et al. / Signal Processing 87 (2007) 1859–1871

Contrast functions of IVA, which are also contrast functions of MICA, were examined. The available entropic contrasts of IVA closely resembled the classical entropic contrasts of ICA. For simpler derivation and further use, uncorrelatedness in the multidimensional variable was assumed. However, it should be noted that it is not a necessary condition for the derivation of the entropic contrasts. In order to derive fast algorithms with complex variables by directly applying Newton’s method, a quadratic Taylor polynomial was introduced in complex notations. Using this form of Taylor expansion, the derivation becomes straightforward. We later found that an equivalent form was proposed independently in [41]. While citing [26], however, the authors did not catch the relation between the Newton method derivation in [26] and the proposed form and thus they derived it from scratch. It should be noted that this form is equivalent to the Newton method in [26] and that the form can be easily derived from it. Acknowledgments The authors would like to thank the reviewers and Jiucang Hao for valuable comments and recommendations.

References [1] J. Herault, C. Jutten, Space or time processing by neural network models, in: Proceedings of AIP Conference: Neural Networks for Computing, 151, 1986. [2] P. Comon, Independent component analysis—a new concept?, Signal Processing 36 (1994) 287–314. [3] T.-W. Lee, Independent Component Analysis: Theory and Applications, Kluwer Academic Publishers, Boston, 1998. [4] S. Roberts, R. Everson, Independent Component Analysis: Principles and Practice, Cambridge University Press, Cambridge, UK, 2001. [5] A. Hyva¨rinen, E. Oja, Independent Component Analysis, Wiley, New York, 2002. [6] D. Yellin, E. Weinstein, Multichannel signal separation: methods and analysis, IEEE Trans. Signal Process. 44 (1) (1996) 106–118. [7] R. Lambert, Multichannel blind deconvolution: FIR matrix algebra and separation of multipath mixtures, Ph.D. Dissertation, University of Southern California, 1996. [8] K. Torkkola, Blind separation of convolved sources based on information maximization, in: Proceedings of IEEE International Workshop on Neural Networks for Signal Processing, 1996, pp. 423–432. [9] T.-W. Lee, A.J. Bell, R. Lambert, Blind separation of convolved and delayed sources, in: Advances in Neural Information Processing Systems, 1997, pp. 758–764.

[10] S. Weiss, On adaptive filtering on oversampled subbands, Ph.D. Dissertation, Signal Processing Division, University of Strathclyde, 1997. [11] P. Smaragdis, Blind separation of convolved mixtures in the frequency domain, Neurocomputing 22 (1998) 21–34. [12] L. Parra, C. Spence, Convolutive blind separation of nonstationary sources, IEEE Trans. Speech and Audio Processing 8 (3) (2000) 320–327. [13] F. Asano, S. Ikeda, M. Ogawa, H. Asoh, N. Kitawaki, A combined approach of array processing and independent component analysis for blind separation of acoustic signals, in: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001, pp. 2729–2732. [14] S. Kurita, H. Saruwatari, S. Kajita, K. Takeda, F. Itakura, Evaluation of blind signal separation method using directivity pattern under reverberant conditions, in: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000, pp. 3140–3143. [15] M.Z. Ikram, D.R. Morgan, A beamforming approach to permutation alignment for multichannel frequency-domain blind speech separation, in: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002, pp. 881–884. [16] H. Sawada, R. Mukai, S. Araki, S. Makino, A robust and precise method for solving the permutation problem of frequency-domain blind source separation, in: Proceedings of International Conference on Independent Component Analysis and Blind Source Separation, 2003, pp. 505–510. [17] J. Anemueller, B. Kollmeier, Amplitude modulation decorrelation for convolutive blind source separation, in: Proceedings of International Conference on Independent Component Analysis and Blind Source Separation, 2000, pp. 215–220. [18] N. Murata, S. Ikeda, A. Ziehe, An approach to blind source separation based on temporal structure of speech signals, Neurocomputing 41 (2001) 1–24. [19] T. Kim, H.T. Attias, S.-Y. Lee, T.-W. Lee, Blind source separation exploiting higher-order frequency dependencies, IEEE Trans. Speech and Audio Processing 15 (1) (2007) 70–79. [20] I. Lee, T. Kim, T.-W. Lee, Complex FastIVA: A robust maximum likelihood approach of MICA for convolutive BSS, in: Lecture Notes in Computer Science, 2006, pp. 625–632. [21] A. Hiroe, Solution of permutation problem in frequency domain ICA, using multivariate probability density functions, in: Lecture Notes in Computer Science, 2006, pp. 601–608. [22] J.-F. Cardoso, Multidimensional independent component analysis, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1998, pp. 1941–1944. [23] A. Hyva¨rinen, P.O. Hoyer, Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces, Neural Computation 12 (7) (2000) 1705–1720. [24] S. Haykin, Unsupervised Adaptive Filtering, vol. 1, Wiley, New York, 2000. [25] D.H. Brandwood, A complex gradient operator and its application in adaptive array theory, IEE Proc. F, H 130 (1) (1983) 11–16. [26] A. van den Bos, Complex gradient and Hessian, IEE Proc. Vision Image and Signal Process. 141 (1994) 380–382.

ARTICLE IN PRESS I. Lee et al. / Signal Processing 87 (2007) 1859–1871 [27] H. Brehm, W. Stammler, Description and generation of spherically invariant speech-model signals, Signal Processing (12) (1987) 119–141. [28] H. Buchner, R. Aichner, W. Kellermann, Blind source separation for convolutive mixtures: a unified treatment, in: Y. Huang, J. Benesty (Eds.), Audio Signal Processing for Next-Generation Multimedia Communication Systems, Kluwer Academic Publishers, Boston, 2004, pp. 255–293. [29] I. Lee, T.-W. Lee, On the assumption of spherical symmetry and sparseness for the frequency-domain speech model, IEEE Trans. on Speech, Audio and Language Processing, 2007, to appear. [30] M. Davies, Audio source separation, in: Mathematics in Signal Processing, vol. 5, Oxford University Press, Oxford, 2002, pp. 57–68. [31] S.-I. Amari, A. Cichocki, H.H. Yang, A new learning algorithm for blind signal separation, in: Advances in Neural Information Processing Systems, vol. 8, 1996, pp. 757–763. [32] J.-F. Cardoso, The invariant approach to source separation, in: Proceedings of the International Symposium on Nonlinear Theory and Applications NOLTA, vol. 1, 1995, pp. 55–60. [33] A. Hyva¨rinen, E. Oja, A fast fixed-point algorithm for independent component analysis, Neural Computation 9 (7) (1997) 1483–1492.

1871

[34] A. Hyva¨rinen, Fast and robust fixed-point algorithms for independent component analysis, IEEE Trans. Neural Networks 10 (3) (1999) 626–634. [35] E.G. Learned-Miller, J.W.F. III, ICA using spacings estimates of entropy, J. Machine Learning Res. 4 (2003) 1271–1295. [36] E. Bingham, A. Hyva¨rinen, A fast fixed-point algorithm for independent component analysis of complex-valued signals, Int. J. Neural Systems 10 (1) (2000) 1–8. [37] M.Z. Ikram, D.R. Morgan, Exploring permutation inconsistency in blind separation of signals in a reverberant environment, in: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000, pp. 1041–1044. [38] R.B. Stephens, A.E. Bate, Acoustics and Vibrational Physics, Edward Arnold Publishers, 1966. [39] J.B. Allen, D.A. Berkley, Image method for efficiently simulating small room acoustics, J. Acoust. Soc. Amer. 65 (1979) 943–950. [40] W.G. Gardner, The virtual acoustic room, Master’s Thesis, MIT, 1992. [41] G. Yan, H. Fan, A Newton-like algorithm for complex variables with applications in blind equalization, IEEE Trans. Signal Process. 48 (2) (2000) 553–556.