An EM algorithm for convolutive independent component analysis

An EM algorithm for convolutive independent component analysis

Neurocomputing 49 (2002) 187 – 211 www.elsevier.com/locate/neucom An EM algorithm for convolutive independent component analysis Sabine Deligne ∗ , ...

236KB Sizes 0 Downloads 77 Views

Neurocomputing 49 (2002) 187 – 211

www.elsevier.com/locate/neucom

An EM algorithm for convolutive independent component analysis Sabine Deligne ∗ , Ramesh Gopinath IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598, USA Received 14 April 2001; accepted 22 October 2001

Abstract In this paper, we address the problem of blindly separating convolutive mixtures of spatially and temporally independent sources. Source densities are modelled as mixtures of Gaussians. We present an EM algorithm to compute Maximum Likelihood estimates of both the separating 2lters and the source density parameters, whereas in the state-of-the-art separating 2lters are usually estimated with gradient descent techniques. The use of the EM algorithm, as an alternative to the usual gradient descent techniques, is advantageous as it provides a faster and more stable convergence and as it does not require the empirical tuning of a learning rate. Besides, we show how multichannel autoregressive spectral estimation techniques can be used in order to properly initialize the EM algorithm. We demonstrate the e8ciency of our EM algorithm together with the proposed initialization scheme by reporting on simulations with arti2cial mixtures. Finally, we discuss the theoretical and practical limitations of our EM approach and point out possible ways c 2002 Elsevier Science B.V. All rights reserved. of addressing these issues in future works.  Keywords: Blind signal separation; Convolutive mixtures; Gaussian mixture source densities; EM algorithm; Row-by-row estimation; Autoregressive spectral estimation

1. Introduction Blind Source Separation (BSS) addresses the issue of recovering source signals from the observation of mixtures of these sources. Independent component analysis (ICA) aims at achieving BSS in the case where the source signals are mutually statistically independent and mixed with an unknown matrix, resulting in distinct instantaneous mixtures observed at the outputs of sensors. However, in most real world ∗

Corresponding author. Tel.: +1-914-945-2253; fax: +1-914-945-4490. E-mail address: [email protected] (S. Deligne).

c 2002 Elsevier Science B.V. All rights reserved. 0925-2312/02/$ - see front matter  PII: S 0 9 2 5 - 2 3 1 2 ( 0 2 ) 0 0 5 3 2 - 5

188

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

situations, the assumption of instantaneous mixing between the sources does not hold due to propagation delays and multipath propagation in reDective environments. It is more realistic to expect multiple attenuated echos of the source signals to arrive at the inputs of the sensors at diEerent times. This situation can be modelled by assuming that the source signals are mixed with an unknown matrix of 2lters (instead of scalars), resulting in distinct convolutive mixtures observed at the outputs of sensors. We refer to the BSS of convolutive mixtures as convolutive ICA (CICA). In this introduction, we show how the contribution presented in this paper relates to existing works in the ICA and CICA 2elds. Usually the algorithms devised to solve the ICA problem consist in estimating an unmixing matrix transforming the observed mixtures into signals that are as independent as possible. This approach is grounded in theoretical results about blind identi2ability according to which, assuming that no more than one source is Gaussian, then BSS can be achieved by restoring statistical independence [7,4,5]. As a consequence, the outputs of the unmixing matrix equal the actual sources up to scaling and permutation. Various measures of statistical independence were proposed, especially the Mutual Information measure resulting in the popular infomax approach [3]. The Mutual Information measure can be understood as a maximum likelihood (ML) scheme where the densities of the sources are 2xed [5]. A drawback of this approach is that too big a mismatch between the assumed and the actual source densities can lead to imperfect separation, since it amounts to shifting the global maximum of the likelihood function. One way to circumvent a possible mismatch of the source distributions is to model the source densities with Dexible parametric density functions, and to estimate their parameters together with the unmixing matrix. In [1], each source is modelled with a mixture of Gaussians, so that the usual EM equations for 2tting mixtures of Gaussians can be used to provide ML estimates of the density parameters. An ML estimate of the unmixing matrix is computed with gradient update rules after each EM reestimation of the source densities. Recently, it was shown, as a by-product of Gaussianization [6], that EM equations could be derived not only to compute the estimates of the density parameters, but also to compute an ML estimate of the unmixing matrix. The framework used in [6] is inspired by the row-by-row estimation scheme introduced in [9] for the estimation of semi-tied covariance matrices. The use of an EM procedure is advantageous as it provides a faster and more stable convergence than gradient descent techniques, while not requiring the empirical tuning of a learning rate. Our contribution is to show that the EM solution provided for ICA in [6] can be extended to the case of CICA. In this paper, we derive EM estimation equations for the unmixing 2lters as opposed to the usual gradient update rules [2]. This paper is organized as follows. In Section 2, we present the CICA hypothesis and the notations used throughout the paper. In Section 3, we derive EM equations for all the parameters in the CICA model. We propose two variants of an initialization scheme in Section 4, and we address implementation issues such as computational and memory costs in Section 5. The algorithm is evaluated in Section 6 with simulations on arti2cial data. The theoretical and practical limitations of the EM framework presented

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

189

in this paper for a BSS task are discussed in Section 7. Conclusions for this work are given in Section 8.

2. Convolutive ICA (CICA) Consider D mutually independent random variables Y1 : : : Yd : : : YD called sources, generating real-valued data Y1 (t) : : : Yd (t) : : : YD (t) for each time index t, between 1 and T . The T realizations {Yd (1) : : : Yd (T )} of each source Yd are assumed to be independent. The D streams of data generated by the sources are not observable: we assume that they are mixed with a D × D matrix B of 2lters with 2nite impulse response (Bi; j )D i; j=1 . The mixing and 2ltering process results in D streams of data {X1 (t) : : : Xd (t) : : : XD (t)} observed at the output of D sensors: Xd (t) =

D 

(Bd; j ∗ Yj )(t)

j=1

=

D P B −1 

Bd; j (p)Yj (t − p);

j=1 p=0

where the symbol ∗ denotes the convolution product and where PB is the common length of all the 2lters Bd; j . The basic problem addressed in this paper is to identify a D × D matrix A of unmixing 2lters with responses (Ad; j )D d; j=1 allowing to recover D mutually independent sources Yˆ 1 ; : : : ; Yˆ D , from the observed convolutive mixtures {X (1); : : : ; X (T )}: Yˆ d (t) =

D 

(Ad; j ∗ Xj )(t)

j=1

=

D P A −1 

Ad; j (p)Xj (t − p);

(1)

j=1 p=0

where PA is the common length of all the 2lters Ad; j . Note that we assume that each source is white, i.e. that there are no time correlations within the stream generated by each source. This assumption does not narrow the applicability of our solution to blind CICA, since, as emphasized in [2], it is not possible in the fully blind separation problem to distinguish between the spectra of the 2lters and the spectra of the sources. Therefore, one can as well assume that the time structure within each observed mixture is introduced exclusively by the convolutive aspect of the mixing. In that case, learning the separating transform means learning the spectra of the 2lters and recovering white sources. It remains to clarify the relation between the recovered sources [Yˆ 1 : : : Yˆ D ]T and the actual sources [Y1 : : : YD ]T . For this purpose, we rewrite matricially Eq. (1), as a

190

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

product of the convolutive mixtures by a TD × TD separating matrix:   Yˆ (1)    ···    Yˆ (T )   A(0) 0D :::    A(1) A(0)  0D :::        X (1) :::        ::: A(0) 0D : : :  =  : : : 0D A(PA − 1)  · · ·  ;     X (T ) :::      ::: 0D A(PA − 1) ::: A(0) 0D    ::: 0D A(PA − 1) : : : A(0)

(2)

where A(p)=(Aij (p))D i; j=1 is the D×D matrix of scalars formed by the pth coe8cients of the separating 2lters. We note from Eq. (2) that the CICA problem can to a certain extent be formally expressed as an ICA problem with PA D observed streams since Yˆ (t) = [Yˆ 1 (t) : : : Yˆ D (t)]T = [A(0) : : : A(PA − 1)](t);

(3)

where (t) = [X1 (t) : : : XD (t) : : : X1 (t − PA + 1) : : : XD (t − PA + 1)]T ; i.e. the PA D observed streams at time t consist of the D sensor outputs at time t together with PA − 1 delayed versions of the D sensor outputs. As it is well known in ICA, the recovered sources [Yˆ 1 : : : Yˆ D ]T equal the actual sources up to scaling and permutation. In the case of CICA, Eq. (3) shows that the ambiguity on the scale of the sources can be solved by requiring to 2nd a solution [A(0) : : : A(PA − 1)] where each 1 × PA D row has a norm equal to 1. Then there remains only the ambiguity on the order of the sources, i.e. the ambiguity on the order of the rows in [A(0) : : : A(PA − 1)]. Note that by writing Eq. (2), we have implicitly assumed that the TD × TD mixing matrix was invertible, which implies especially that the matrices B(0) and A(0) be full rank. This may not be the case in real world applications, since distinct convolutive mixtures can be obtained with mixing 2lters such that the matrix formed by the zero-delay coe8cients is not invertible. We discuss this issue and give an example of such 2lters in Section 7.1. In this paper, following the approach in [2], the density of each source Yd is modelled with a mixture of Gaussians de2ned by the unknown prior, mean and variance param2 Id eters (d; i ; d; i ; d; i )i=1 , where Id is the prespeci2ed number of Gaussian components in the density of Yd . The CICA model is thus fully speci2ed by the set of paramePA −1 2 Id D ; ((d; i ; d; i ; d; ters  = {(A(p))p=0 i )i=1 )d=1 }. In Section 3, we derive EM equations to compute ML estimates for all the parameters in the set .

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

191

3. Derivation of an EM algorithm for CICA The problem of computing ML estimates of  can be formulated as an ML estimation problem from uncomplete data, where the missing data are the Gaussian components drawn for each Yd , and as such it can be solved with an EM procedure [8]. The complete data are given by (X; Z), where the random variable Z = [Z1 : : : ZD ] is such 2 that Zd (t) speci2es the index i of the Gaussian (d; i ; d; i ; d; i ) drawn for Yd (t). The likelihood of the complete data computed with the parameter set  is given by the joint density p (X; Z). Following the EM framework, we de2ne the auxiliary function ˆ ) as the conditional expectation of the log-likelihood of the complete data (X; Z): Q(; ˆ ) = E[log p (X; Z) | ; ˆ X ]: Q(;

(4)

ˆ ) ¿ Q(; ˆ ) ˆ then p (X ) ¿ p ˆ(X ). According to the Baum–Welch inequality, if Q(;  ˆ ). For this purpose, The E-step of the EM procedure consists in computing Q(; we express the complete data likelihood as a function of the source densities: p (X; Z) = |det A(0)|T p (Yˆ ; Z); where |det A(0)| is the absolute value of the determinant of the Jacobian of the separating transform as shown by Eq. (2). With the assumption that the recovered data Yˆ (t) are independent and consist of D mutually independent sources, the log-likelihood can be written as log p (X; Z) = T log|det A(0)| +

T  D 

[log p(Zd (t)) + log p(Yˆ d (t)|Zd (t))]

t=1 d=1

= T log|det A(0)| +

Id T  D  



Zd (t);i

t=1 d=1 i=1

2 ˆ (t) −  ) 1 ( Y d d; i 2 log d; i − log 2d; ; i − 2 2 2d; i (5)

where the Kronecker variable Zd (t);i = 1 if Zd (t) = i and =0, otherwise. From Eqs. (4) and (5), it follows that ˆ ) = T log|det A(0)| Q(; +

Id T  D   t=1 d=1 i=1

2 ˆ (t) −  ) 1 ( Y d d; i 2 ; !d; i (t) log d; i − log 2d; i − 2 2 2d; i 

where !d; i (t) represents the occupancy probability of the Gaussian component (d; i) at time t:

ˆd; i (1= 2ˆ2d; i ) exp((Yˆ d (t) − ˆd; i )2 =2ˆ2d; i ) ˆ X] =

!d; i (t) = E[Zd (t);i |; : Id 2 2 2 ˆ  i =1 ˆd; i (1= 2ˆd; i ) exp((Y d (t) − ˆd; i ) =2ˆd; i )

192

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

In the M-step of the EM procedure, we derive reestimation equations of the parameters ˆ ) with respect to . Zeroing the derivative by maximizing the auxiliary function Q(; 2 ˆ of Q(; ) with respect to d; i , d; i and d; i yields the reestimation equations: T

!d; i (t) = T t=1 !d; i (t)

T

!d; i (t) ; T

(6)

PA −1 !d; i (t)Yˆ d (t)  X (r) = ad (r)d; T i ; ! (t) d; i t=1 r=0

(7)

t=1

d; i = Id

i =1

t=1

T d; i =

t=1

where ad (r) is the dth row vector of the matrix A(r), and where T !d; i (t)X (t − r) X (r) d; i = t=1 T t=1 !d; i (t) with the convention X (t) = 0 for t ¡ 0. T P A −1 !d; i (t)(Yˆ d (t) − d; i )2 X (r; r  ) T  2 = d; i = t=1 T ad (r)d; ad (r ); i t=1 !d; i (t) r;r  =0

(8)

where X (r; r  ) d; i

T =

t=1



X (r) X (r ) T  !d; i (t)(X (t − r) − d; i )(X (t − r ) − d; i ) : T t=1 !d; i (t)

ˆ ) with respect to the 2lter coe8cients Ad; j (p) is In [2], the maximization of Q(; performed with a gradient descent algorithm. In this paper, we show how to derive EM reestimation equations for Ad; j (p) by following the scheme introduced in [9] for the derivation of semi-tied covariances, and used in [6] for Gaussianization. We rewrite ˆ ) as a function of the rows of the 2ltering matrices using the fact that det A(0) = Q(; ad (0)cdT (0), with ad (0) any particular row of A(0), and, cd (0) the row vector of the cofactors for the row ad (0): ˆ ) = T log|ad (0)cT (0)| − Q(; d

D;T Id 1  (Yˆ d (t) − d; i )2 !d; i (t) +K 2 2 d; i d; t=1 i=1

= T log(|ad (0)cdT (0)|) −

D PA −1 1  ad (r)Gd (r; r  )aTd (r  ) + K; 2  d=1 r;r =0

where K is a constant that does not depend on the 2lter coe8cients and where Gd (r; r  ) =

Id T  1  X (r) X (r  ) T  !d; i (t)(X (t − r) − d; i )(X (t − r ) − d; i ) : 2  i=1 d; i t=1

(9)

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

193

ˆ ) with respect to ad (r), for r = 0, is The derivative of Q(;  d Q = −ad (r)Gd (r; r) − ad (r  )Gd (r  ; r); dad(r)  r =r

where we used the equality Gd (r; r  ) = GdT (r  ; r). Zeroing this derivative yields the reestimation equation:    ad (r  )Gd (r  ; r) Gd−1 (r; r) for r = 0: (10) ad (r) = −  r  =r

ˆ ) with respect to ad (0) is Similarly, the derivative of Q(;  d cd (0) Q=T ad (r)Gd (r; 0): − ad (0)Gd (0; 0) − T dad(0) ad (0)cd (0) r=0

ˆ ) can be written as a quadratic equation of the scalar Zeroing this derivative of Q(; T variable # = T=(ad (0)cd (0)): $1 #2 − $2 # − T = 0;

(11)

where $1 = cd (0)Gd−1 (0; 0)cdT (0);  $2 = 

 r=0

 ad (r)Gd (r; 0) Gd−1 (0; 0)cdT (0):

The two roots of this quadratic equation: 1  $2 ± $22 + 4$1 T #= 2$1 both correspond to maxima. The root to select is thus the one which will yield the ˆ ). Substituting the reestimation equation of ad (0): highest value of Q(;    ad (r)Gd (r; 0) Gd−1 (0; 0) (12) ad (0) = #cd (0) − r=0

ˆ ) and keeping only the terms depending on #: into Q(; ˆ ) = T log|#$1 − $2 | − 1 $1 #2 Q(; 2    2   −$ ±$2 + 4$ T  $2 ± $22 + 4$1 T  2 1  1 2 = T log  :  − 2 $1   2 2$1 1

Note that from the de2nition of $1 , the quantity $22 + 4$1 T is always positive.

194

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

 Therefore, if $2 ¿ 0, we will select # = ($ − $22 + 4$1 T )=2$1 , whereas if $2 6 0, we 2  2 will select # = ($2 + $2 + 4$1 T )=2$1 . After proper selection of #, the row ad (0) is reestimated with Eq. (12).

4. Initialization 4.1. Initialization scheme in ICA One limitation of the EM algorithm is that it may converge towards a local maximum instead of the global maximum of the likelihood function. A judicious choice of the initial estimates can limit the risk of converging towards a steady point of the likelihood function that would be very far away from the global optimum. Besides, it is known that proper initialization can also speed up the convergence process. The existing experimental works in ICA show that whitening the observations usually allows a faster convergence towards independent sources. The idea is that decorrelating the data is an initial move towards a sensible direction before seeking for mutual independence. Therefore, in ICA, any decorrelating transform can be expected to provide a good initial guess for the unmixing matrix. For example, the inverse square root matrix of the covariance matrix of the observations (E(XX T )−1=2 ) is a transform that is often used as the initial value of the unmixing matrix in ICA. In this paper, we propose two variants of an initialization scheme for the case of CICA, so that all unmixing matrices A(0) : : : A(PA − 1) are assigned adequate initial values. Eq. (2) shows that, assuming unmixing 2lters with an impulse response of length PA , the BSS of D convolutive mixtures can to some extent be formulated as the BSS problem of PA D instantaneous mixtures. In this pseudo-ICA problem, the observation vector containing the PA D instantaneous mixtures at time t, (t), consists of the D convolutive mixtures observed at time t; : : : ; (t − PA + 1). (t) = [X (t) : : : X (t − PA + 1)]T : Besides, in this pseudo-ICA problem, the D × PA D matrix unmixing the observation vector  is [A(0) : : : A(PA − 1)]. A natural extension of the ICA initialization scheme is thus to initialize the matrix [A(0) : : : A(PA − 1)] with a transform decorrelating the observation vector , i.e. decorrelating the observation vector X both spatially (over its D components) and temporally (over a window of PA samples). In other words, the issue is to 2nd a linear transform % such that the covariance matrix of % is diagonal. A straightforward way to proceed would be to compute % as the inverse of the Cholesky factorization of the covariance matrix E(T ). However, in Sections 4.2 and 4.3 of this paper, we propose two variants of a computationally more e8cient procedure where the observation vector X is 2rst decorrelated temporally, and then spatially. In both variants, the procedure starts by forming the observation vector  and by estimating the covariance matrix E(T ). To ensure the block-Toeplitz structure of E(T ) in spite of the 2nite length of the observed data, we smooth our estimation of E(T ) by averaging all the D × D blocks spanning along the same diagonal.

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

195

4.2. First initialization variant The principle of the initialization procedure proposed in this section is to compute % by composing two linear transforms % = %1 ◦ %2 , where: • %1 is a transform (a D × PA D matrix) such that the covariance matrix of %1  is block-diagonal of block dimension D ×D, i.e. %1 decorrelates the observation vector X temporally (over a window of PA samples), • %2 is a transform (a D × D matrix) such that the covariance matrix of %2 (%1 ) is diagonal, i.e. %2 decorrelates the observation vector X spatially (over its D dimensions). As a time-decorrelating transform, the matrix %1 can be computed as the multichannel linear prediction coe8cients [%1 (0) : : : %1 (PA −1)] at order PA of the observation vector X . It can be estimated with a recursion solving the multichannel Yule–Walker equations up to order PA : [%1 (0) : : : %1 (PA − 1)]E(T ) = [PA 0D : : : 0D ]; where the covariance matrix E(T ) is a PA D × PA D block-Toeplitz matrix of block dimension D × D, and where PA is the D × D covariance matrix of the forward prediction error output by the linear prediction 2lter %1 (0) : : : %1 (PA − 1) with input X . Several algorithms were proposed in the 2eld of multichannel autoregressive spectral estimation to solve the multichannel Yule–Walker equations (Chapter 15 in [12]). In this paper, we experiment with the multichannel extension of the well-known Levinson algorithm, but other (possibly more e8cient) algorithms could be used. After the matrix %1 is computed, the transform %2 can be computed as a D×D matrix diagonalizing the covariance matrix PA , i.e. as a transform whitening the data obtained by 2ltering the observation vector X with the linear prediction 2lters %1 (0) : : : %1 (PA − 1). In our experiment, %2 is computed as the inverse of the Cholesky factorization of PA (the Cholesky factorization of PA produces a lower triangular matrix C such that CC T = PA , so that by choosing %2 = C −1 , it comes that the covariance matrix %2 PA %2T equals the D × D identity matrix). The initial unmixing 2lters are applied to the observed data X (1) : : : X (T ) to recover initial sources. The density parameters are initialized by applying an EM procedure to the initially recovered sources, following the usual ML estimation scheme in Gaussian mixture modelling. 4.3. Second initialization variant The second initialization variant described in this section diEers from the initialization procedure described in Section 4.2 in the following way. It does not require to explicitly compute the transform %1 consisting of the prediction 2lters. Instead, it requires to compute the prediction error data output by the prediction 2lters. The prediction error can be e8ciently computed using the multichannel extension of the

196

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

reDection coe8cient estimation method instead of using the Levinson algorithm. The reDection coe8cient estimation method has a lower computational cost than the Levinson algorithm (Chapter 15 in [12]). The transform %2 is estimated as a whitening transform of the prediction error data. The CICA algorithm is applied to the prediction error data, instead of the data actually observed at the output of the sensors. Besides, the CICA algorithm is applied by initializing the unmixing matrix A(0) with %2 and by initializing the unmixing matrices A(1) : : : A(PA − 1) with D × D zero matrices. Like in the initialization procedure described in Section 4.2, the density parameters are initialized by applying an EM procedure to the initially recovered sources, following the conventional ML estimation scheme in Gaussian Mixture Modelling. This initialization variant enhances the overall computational e8ciency of the algorithm since: • the prediction error data can be computed at a lower computational cost than the linear prediction coe8cients, • the prediction error data to which the CICA algorithm is applied contain fewer time correlations than the actually observed data, and thus it can be expected to converge faster, • the search space is enlarged to a subspace of the space of 2lters of length up to 2PA (this subspace including the space of 2lters of length up to PA ) with no increase of the computational cost. This results from the fact that the 2lters eventually output by the procedure are a cascade of the initial linear prediction 2lters of length PA and of the 2lters estimated on the prediction error data by the CICA algorithm which are of length PA also (note that the initial 2lters and the 2lters estimated by the CICA algorithm could be of lengths, respectively, PA1 and PA2 with PA1 = PA2 , thus allowing to explore a subspace of the space of 2lters of length up to PA1 + PA2 ). 4.4. Estimation of the length of the unmixing >lters In the initialization schemes explained in Sections 4.2 and 4.3, it is assumed that the length PA of the unmixing 2lters is known. In practice, PA only needs to be an upper bound on the length of the 2lters since the unnecessary 2lter coe8cients should converge towards zero when applying the algorithm. However, the upper bound should not be too high in order to maintain the algorithmic complexity to a manageable level. The value of PA can be selected in the same way as the order of an autoregressive model is selected in autoregressive spectral estimation. Several model orders are postulated. Based on these, one computes some error criterion that indicates which model order to choose (chapter 8 in [12]). Various criteria have been proposed as objective functions for model order selection. In this paper, we experiment with the multichannel version of the Akaike information criterion (AIC) (Chapter 15 in [12]). At each recursion of the estimation algorithm of either the multichannel linear prediction coe8cients (initialization variant of Section 4.2) or of the prediction error (initialization variant of Section 4.3), we compute the value of the AIC function. The model order which is selected is the value for which the AIC function is minimal.

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

197

5. Implementation issues 5.1. Storage of the su?cient statistics The E-step of each iteration of the algorithm requires to compute the occupancy counts for each Gaussian and to collect the following statistics: sum(i; d) =

T  t=1

!d; i (t); T 

!d; i (t)X (t − r);  T   T  !d; i (t)X (t − r)X (t − r ) sumXX (r; r ; i; d) = sumX (r; i; d) =

t=1

t=1



sumX (r; i; d)sumX T (r  ; i; d) ; sum(i; d)

hence requiring storage for O(PA2 D2 Id ) scalars for each of the D sources modelled with Id Gaussians. 5.2. Computational cost Collecting the su8cient statistics requires O(PA2 D3 ) operations per observation vector, after which the reestimation of the density parameters according to sum(i; d) ; T PA −1 ad (r) sumX (r; i; d) ; d; i = r=0 sum(i; d) PA −1  T  r; r  =0 ad (r) sumXX (r; r ; d; i)ad (r ) 2 d; i = sum(i; d) d; i =

requires again O(PA2 D3 ) operations. Once the density parameters are reestimated, updating each row vector [ad (0) : : : ad (PA − 1)] requires to • compute the row of cofactors cd (0): O(D3 ) operations • compute the inverse of all the matrices Gd (r; r  ): O(PA2 D3 ) operations, • update ad (0) with Eq. (12), and for each r ¿ 1, update ad (r) with Eq. (10): an additional O(PA2 D2 ) operations, • normalize [ad (0) : : : ad (PA − 1)] so that its norm equal to 1: O(PA D) operations. Overall, the complexity of the algorithm is thus O(PA2 D4 ). x Note that one could loop over the EM reestimation of the density parameters and=or the EM reestimation of the unmixing matrices. However, in practice, most of the improvement of

198

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

the likelihood comes from the 2rst M-step so that the algorithm can proceed with just one iteration of each M-step. The Gaussian occupancy counts need be updated only after both the density parameters and the unmixing matrices have been reestimated. The normalization of the rows of the unmixing matrices requires that the estimated means and variances be accordingly scaled with the same normalization factor. In practice, we found out that updating the means and variances with the newly updated and normalized rows (Eqs. (7) and (8)), rather than simply scaling them, could signi2cantly speed up the convergence.

6. Blind deconvolution experiments 6.1. Experimental protocol The aim of these experiments is to assess the ability of our EM algorithm, together with our initialization schemes, to recover unknown source densities and unknown unmixing 2lters from the observation of convolutive mixtures. Therefore, all experiments are conducted on arti2cial data which are generated so as to strictly comply with the CICA modeling assumptions. We generated 10,000 data points simulating two independent sources with mixtures of Gaussians densities having three components: Source 1

Source 2

(11 ; 12 ; 13 ) = (0:2; 0:3; 0:5)

(21 ; 22 ; 23 ) = (0:4; 0:1; 0:5)

(11 ; 12 ; 13 ) = (−18; −5; 5)

(21 ; 22 ; 23 ) = (−3; 1; 5)

(11 ; 12 ; 13 ) = (1; 1; 1)

(21 ; 22 ; 23 ) = (1; 1; 1)

The probability density function of these two sources is plotted in Fig. 1. The source data are mixed with 2lters of impulse responses B11 ; B12 ; B21 and B22 plotted, respectively, in Figs. 3– 6. The probability density function of the two resulting convolutive mixtures is plotted in Fig. 2. By following the procedure described in the appendix we computed oE-line estimates of the unmixing 2lters that we wish to estimate blindly with the CICA algorithm. After normalization of the rows, the unmixing 2lters have the impulse responses:   0:600 −0:432 −0:300 0:173 A(0) = ; A(1) = ; 0:351 0:757 0:350 0:253   0:180 −0:518 −0:060 0:173 A(2) = ; A(3) = ; −0:246 0:202 −0:070 0:101  0 0 A(p) = for p ¿ 4: 0 0

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

199

0.2

0.18

0.16

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0 _ 50

_ 40

_ 30

_ 20

_ 10

0

10

20

30

40

50

Actual source densities for the first (_._) and second source (_)

Fig. 1. Actual probability density functions of the two sources: (-) and (-.-).

These values will serve as the reference values against which to evaluate the accuracy of the 2lters blindly estimated by our CICA algorithm. 6.2. Experiments and results The experiments are conducted on the convolutive mixtures generated as described in Section 6.1, by assuming that the number of sources and the number of Gaussians in each source density are known. We evaluate an upper bound on the length of the unmixing 2lters, by applying the scheme outlined in Section 4.4 to the observed mixtures. Fig. 7 shows the values of the AIC function against the postulated length of the 2lters. The AIC function is minimal for a 2lter length of 4, which corresponds to the actual length of the unmixing 2lters in our experiments. However, we proceed the experiment with PA set to the upper bound value 10. We estimate the unmixing 2lters and the source densities with the algorithm described in Section 3 by using the 2rst initialization variant explained in Section 4.2. Fig. 8 shows the log-likelihood of the observations computed with the parameters obtained at each iteration. It converges after about 15 iterations towards the log-likelihood value computed with the actual parameters which is plotted in the 2gure with the dotted line. After convergence, the estimated unmixing matrices are     0:602 −0:429 −0:299 0:174 ˆ ˆ A(0) = ; A(1) = ; 0:353 0:756 0:352 0:253

200

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211 350

300

250

200

150

100

50

0 _ 50

_ 40

_ 30

_ 20

_ 10

0

10

20

30

40

50

Density histogram of the first (_ ) and second (_ .) observed convolutive mixtures

Fig. 2. Probability density functions of the two observed convolutive mixtures: (-) and (-.).

 ˆ A(2) =  A(p) =

0:179

−0:519

−0:247

0:199

0:00

0:00

0:00

0:00



 ;

ˆ A(3) =

−0:062

0:169

−0:072

0:097

;

for p ¿ 4:

These estimates are very close to the normalized values of the reference unmixing 2lters that we had estimated oE-line by inverting the mixing matrix. Note also that even though the number of taps in the 2lters was set to 10, the unnecessary taps were properly estimated to equal zero. Fig. 9 evaluates the amount of distortion between the original and recovered sources and the quality of the separation as a function of the frequency as suggested in [11]. Fig. 9 was produced with the matlab code made available at “http://ele.tue.nl/ica99”. Plotted in Fig. 9 are: (i) source 1, and, source 2: the spectra of the original sources Yd (d = 1; 2), (ii) mixture 1 due to source 1, and, mixture 2 due to source 2: the spectra of the signal Md; Yd observed at the dth output of the mixing system {Bd; j }D d; j=1 when PB −1 only the source Yd is active, i.e. Md; Yd = p=0 Bd; d (p)Yd (t − p), (iii) BSS output 1 due to source 1, and, BSS output 2 due to source 2: the spectra of the signal Rd; Yd recovered at the dth output of the estimated unmixing system

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

201

Impulse response B11 of the mixing matrice B 1

0.8

0.6

0.4

0.2

0 _ 0.2 _ 0.4 _ 0.6 _ 0.8 _1 5

10

15

20

25

30

35

40

Fig. 3. Impulse response of the mixing 2lter B11 .

D PA −1 ˆ {Aˆ d; j }D d; j=1 , when only the source Yd is active, i.e., Rd; Yd (t)= j=1 p=0 Ad; j (p) Mj; Yd (t − p); these plots show the amount of distortion subsisting between the original sources and the recovered sources (we compensated for any permutation to ensure that the dth output corresponds to the dth source), (iv) BSS output 1 due to source 2, and, BSS output 2 due to source 1: the spectra of the signal Rd; Yj with j = d, recovered at the dth output of the estimated unmixing system {Aˆ d; j }D d; j=1 , when only the source Yj is active; these plots show for each recovered source the amount of leakage coming from the other source that subsists after unmixing. As can be seen in Fig. 9, our separation algorithm allows to recover from the distortion introduced by the mixing system in our simulation: the spectra of R1; Y1 and R2; Y2 match, respectively, the spectra of Y1 and Y2 . Besides, the residual cross-source leakage subsisting after separation is negligible compared to the level of the properly recovered signals: the energy in the spectra of R1; Y2 and R2; Y1 is about 50 and 30 dB less, respectively, than the energy in the spectra of R1; Y1 and R2; Y2 , for all frequencies. Figs. 10 and 11 show the probability density functions corresponding to the Gaussian mixture models estimated by the algorithm for the two recovered source. As can be seen the estimated densities perfectly match the actual densities of the original sources.

202

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211 Impulse response B12 of the mixing matrice B 1

0.8

0.6

0.4

0.2

0 _ 0.2 _ 0.4 _ 0.6 _ 0.8 _1 5

10

15

20

25

30

35

40

Fig. 4. Impulse response of the mixing 2lter B12 .

Finally, Fig. 12 compares the log-likelihood of the observations computed with the parameters obtained at each iteration by using either the 2rst (Section 4.2) or the second (Section 4.3) initialization variant. Note that whereas it takes about 15 iterations to reach convergence with the 2rst variant, it takes only about 8 iterations with the second variant to reach the same optimum value. 7. Discussion and perspectives 7.1. Theoretical limitation As pointed out already in Section 2, there are some well-posed problems of BSS that the EM framework presented in this paper cannot address. Indeed, when writing Eq. (2) in order to establish the relation between the density of the sources and the density of the observations, we have assumed that the zero-lag mixing and unmixing matrices B(0) and A(0) were full rank. This is not necessarily the case in practice. For example, two sources mixed with the 2lters of impulse responses B˜ i; j (p):   0:3536 0:8536 0:3536 −0:1464 ˜ ˜ B(0) = ; B(1) = 0:1464 0:3536 −0:8536 0:3536

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

203

Impulse response B21 of the mixing matrice B 1

0.8

0.6

0.4

0.2

0 _ 0.2 _ 0.4 _ 0.6 _ 0.8 _1 5

10

15

20

25

30

35

40

Fig. 5. Impulse response of the mixing 2lter B21 .

can be recovered with the unmixing 2lters of impulse response A˜ i; j (p):   0:3536 −0:8536 0:3536 0:1464 ˜ ˜ A(0) = ; A(1) = −0:1464 0:3536 0:8536 0:3536 ˜ ˜ even though neither B(0) nor A(0) are full rank matrices. More examples of this kind as well as generic properties of these 2lters can be found in [10]. In ICA, in order to ensure that the BSS problem is not an ill-posed problem, the mixing matrix is required to be full rank. It is required in order to guarantee that the observed mixtures are linearly independent and not just scaled copies of each other. 2 The issue raised here with CICA is somehow diEerent. The BSS can remain a well-posed problem in cases where the mixing matrix B(0) is not full rank like in the above example (there exist unmixing 2lters that allow to recover the original sources). The technique presented in this paper does not cover such cases (nor does for example the mixed EM=gradient technique presented in [2]).

2

This is known as the constraint of spatial diversity in the ICA literature.

204

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211 Impulse response B22 of the mixing matrice B 1

0.8

0.6

0.4

0.2

0 _ 0.2 _ 0.4 _ 0.6 _ 0.8 _1 5

10

15

20

25

30

35

40

Fig. 6. Impulse response of the mixing 2lter B22 .

7.2. Practical limitations In Section 6, we report on simulations for a low-dimensional arti2cial example. In some problems of practical interest, as many as a few hundreds taps may be required to obtain a proper separation of the convolved sources. We discuss here some of the issues raised by high-dimensional problems: memory requirement, risk of converging to a local optimum and computational cost. Memory-wise, even high-dimensional problems remain tractable with our algorithm: the memory requirement (O(PA2 D2 Id ) scalars for each of the D sources, the dth source being modelled with Id Gaussians) is moderate and can be implemented on a personal computer, with D a few sources, each Id a few Gaussians and PA a few hundreds taps. Besides, should the memory requirements exceed the capabilities of the computer, the algorithm could still be implemented with a memory-e8cient procedure. In that procedure, the 2lters and the Gaussian parameters are estimated alternately at each iteration. Assuming 2xed Gaussian parameters, the second-order su8cient statistics required to reestimate the 2lter rows consist of the matrix Gd (r; r  ), so that the statistics sumXX (r; r  ; d; i) does not need to be stored. The memory requirements then reduce to O(PA2 D2 ) scalars for each of the D sources. Once new estimates of the 2lter rows have been computed, the Gaussian parameters are reestimated by applying a standard

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211 4

205

Esitmation of the length of the unmixing filters

x 10 10

9.8

Value of the Akaike Information Criterion

9.6

9.4

9.2

9

8.8

8.6

8.4

8.2

8 0

5

10

15

20

25

30

Postulated length of the filters

Fig. 7. Value of the Akaike information criterion against the postulated number of taps in the unmixing 2lters.

EM algorithm for Gaussian mixture estimation on the recovered sources. The inconvenience of the memory-e8cient procedure is the increased computational cost since an EM algorithm with computation of posteriors needs to be conducted twice instead of just once at each iteration. Another issue raised by high-dimensional problems is that the high complexity of the objective function may increase the risk that the EM algorithm converge towards a local optimum. We suggest to use an incremental procedure where the complexity of the objective function is increased progressively by incrementing repeatedly the hypothesized number of taps in the unmixing 2lters. The algorithm is 2rst applied by assuming 2lters having just a few taps, say PA1 , until it gets closed to converging. The algorithm is then applied again by assuming 2lters having PA2 taps, PA2 ¿ PA1 (possibly PA2 = PA1 + 1), where the initial 2lters are initialized with the previously estimated 2lters for the lags comprised between 0 and PA1 − 1 and with zeros for the lags comprised between PA1 and PA2 − 1. This procedure can be repeated until the hypothesized number of taps allows a good separation of the sources. The assumption made here is that the global optimal of the objective function for 2lters having PA2

206

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211 Convergence of the loglikelihood with the initialization scheme 1

Value of the loglikelihood

_5

_ 5.5

_6

0

20

40

60 Number of iterations

80

100

120

Fig. 8. Log-likelihood function against the number of iterations with the 2rst initialization scheme. The log-likelihood computed with the actual parameters is plotted with the dotted line.

taps is somehow close to the global optimal of the objective function for 2lters having PA2 taps, when PA2 and PA1 have close enough values. A possible drawback of this incremental procedure is to increase the number of operations, especially when using small increments, as the algorithm is applied for each increment of the number of taps. However, in our preliminary experiments, we observed that by using small increments (say 1 or 2), very few iterations were required in order to converge towards a stable solution for each newly hypothesized number of taps. The main bottle-neck of our algorithm to solve a high-dimensional BSS problem is its computational cost. With as many as O(PA2 D4 ) operations to collect all the statistics and to update all the parameters at each iteration, and considering that at least a few tens of iterations may be needed to get close to convergence, our EM technique does not provide a real-time solution when the number of taps exceeds a few tens. Further algorithmic investigations need to be conducted to circumvent this limitation. A possibility would be to reestimate the matrices A(p) in turn: only a few matrices— for certain “well chosen” values of p—could be simultaneously reestimated while all other matrices would be kept 2xed. The set of taps {p} for which reestimation would be performed could be selected at each iteration based, for example, on the relative

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211 source 2 Magnitude (dB)

Magnitude (dB)

source 1 60 40 20 0

60 40 20 0 mixture 2 due to source 2

Magnitude (dB)

Magnitude (dB)

mixture 1 due to source 1 60 40 20 0

60 40 20 0 BSS output 2 due to source 2

Magnitude (dB)

Magnitude (dB)

BSS output 1 due to source 1 60 40 20 0

60 40 20 0 BSS output 2 due to source 1

Magnitude (dB)

Magnitude (dB)

BSS output 1 due to source 2 60 40 20 0 1 10

2

207

3

10 10 Frequency (Hz)

60 40 20 0 1 10

2

3

10 10 Frequency (Hz)

Fig. 9. Separation and distortion as a function of the frequency. The plots were obtained following the evaluation protocol proposed in [11], and by using the corresponding matlab code made available at “http://ele.tue.nl/ica99”.

change of the matrice A(p) during the previous iteration. As it is now, the algorithm presented in this paper is more 2tted for BSS problems that do not require a real-time solution.

8. Conclusion In this paper, we address the problem of the BSS of convolutive mixtures where the density of the sources is modelled with Gaussian mixtures. By extending the work in [6], we derive an EM algorithm to compute ML estimates of both the source densities and the separating 2lters, whereas in the state-of-the art so far, the separating 2lters are usually estimated with gradient update rules. We show how to estimate a good starting point for the EM algorithm with a scheme based on multichannel autoregressive spectral estimation techniques. The ability of the algorithm to blindly separate convolutive mixtures and to estimate the source densities is demonstrated by experiments on arti2cial mixtures.

208

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211 0.25

0.2

0.15

0.1

0.05

0 _ 50

_ 40

_ 30

_ 20

_ 10

0

10

20

30

40

50

Source density estimated by the CICA algorithm

Fig. 10. Estimated probability density function of the 2rst recovered source.

Compared to gradient descent approaches, the scheme presented in this paper is advantageous as it provides a more stable convergence and as it does not require any empirical tuning. However, further investigations are required to circumvent certain of its theoretical and practical limitations. On the theoretical side, the solution proposed in this paper cannot address the cases where the instantaneous mixing matrix (i.e. the zero-lag matrix of the mixing 2lters) is not full rank. On the practical side, algorithmic strategies need to be further investigated in order to handle the computational cost involved in high-dimensional BSS problems.

Appendix Selection of )lters In this section, we explain the scheme devised to select the mixing 2lters in our experiments, so that they comply with the assumptions of the algorithm: • the mixing 2lters are invertible, with 2nite length (PA ¡ ∞) unmixing 2lters, • the unmixing matrix A(0) is full rank.

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

209

0.25

0.2

0.15

0.1

0.05

0 _ 50

_ 40

_ 30

_ 20

_ 10

0

10

20

30

40

50

Source density estimated by the CICA algorithm

Fig. 11. Estimated probability density function of the second recovered source.

This scheme is inspired by the matricial equation (2), together with its counterpart:   X (1)  ···  X (T )  B(0) 0D :::  B(1) B(0) 0 ::: D   :::   0D B(PB − 1) ::: B(0) 0D = :::   :::   ::: 0D B(PB − 1) ::: B(0) ::: 0D B(PB − 1) :::   Yˆ (1)   × ::: : Yˆ (T )

inverse

      :::     0D  B(0) (A.1)

210

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211 Convergence of the loglikelihood with the 2 initialization schemes

Value of the loglikelihood

_5

_ 5.5

_6

0

2

4

6

8

10

12

14

16

18

20

Number of iterations

Fig. 12. Log-likelihood function against the number of iterations using either the 2rst (-.-) or the second (-+-) initialization scheme. A These equations suggest that the unmixing matrices {A(p)}Pp=1 can be identi2ed by inverting the TD ×TD matrix in Eq. (A.1), resulting into the TD ×TD unmixing matrix in Eq. (2). Assuming that T is chosen such that T ¿ PA , estimates of the matrices A can be computed by averaging the D × D blocks along the diagonals of {A(p)}Pp=1 the inverted matrix.

References [1] H. Attias, Independent factor analysis, Neural Comput. 11 (1999) 803–851. [2] H. Attias, C.E. Schreiner, Blind source separation and deconvolution: the dynamic component analysis algorithm, Neural Comput. 10 (1998) 1373–1424. [3] A.J. Bell, T.J. Sejnowski, An information-maximization approach to blind separation and blind deconvolution, Neural Comput. 7 (6) (1995) 1004–1034. [4] X.-R. Cao, R.-W. Liu, General approach to blind source separation, IEEE Trans. Signal Process. 44 (3) (1996) 562–571. [5] J.-F. Cardoso, Blind signal separation: statistical principles, Proc. IEEE 9 (1998) 2009–2025. [6] S.S. Chen, R.A. Gopinath, Gaussianization, Proceedings of the NIPS Conference, 2000. [7] P. Comon, Independent component analysis, a new concept? Signal Process. (Special Issue on Higher-order Statistics) 36 (3) (1994) 287–314.

S. Deligne, R. Gopinath / Neurocomputing 49 (2002) 187 – 211

211

[8] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum-likelihood from incomplete data via the EM algorithm, J. Roy. Statist. Soc. 39 (1) (1977) 1–38. [9] M.J.F. Gales, Semi-tied covariance matrices for hidden Markov models, IEEE Trans. Speech Audio Process. 7 (1999) 272–281. [10] R.A. Gopinath, C.S. Burrus, Factorization approach to unitary time-varying 2lter banks, IEEE Trans. Signal Process. 43 (3) (1995). [11] D. Schobben, K. Torkkola, P. Smaragdis, Evaluation of blind signal separation methods, Proceedings of the International Workshop on Independent Component Analysis and Blind Signal Separation, 1999. [12] S. Lawrence Marple, Jr., Digital Spectral Analysis with Applications, Prentice-Hall Signal Processing Series, A.V. Oppenheim (Series Editor), 1987. Sabine Deligne obtained a master’s degree in Telecommunication Engineering in 1993 and Ph.D. in Signal and Image processing in 1996 from the Ecole Nationale Superieure des Telecommunications, Paris, France. From 1997 to 1998, she worked in the Interpreting Telecommunication Labs of ATR, Kyoto, Japan. Since October 1998, she has been with the Human Language Technologies Group at IBM T.J. Watson Research Center. Her research interests are in speech and language processing, robust speech recognition techniques and statistical learning. Ramesh A. Gopinath obtained a B.S. degree from IIT, Madras in 1987 and Ph.D. in Signal Processing from Rice University, Houston, TX in 1992. Since March 1994, he has been with the Human Language Technologies Group at IBM T. J. Watson Research Center where he currently manages the Robust Speech Recognition Group. His research interests are in signal processing, speech and speaker recognition, statistical learning and applied mathematics.