Generalized mutual information approach to multichannel blind deconvolution

Generalized mutual information approach to multichannel blind deconvolution

ARTICLE IN PRESS Signal Processing 87 (2007) 2045–2060 www.elsevier.com/locate/sigpro Generalized mutual information approach to multichannel blind ...

278KB Sizes 0 Downloads 46 Views

ARTICLE IN PRESS

Signal Processing 87 (2007) 2045–2060 www.elsevier.com/locate/sigpro

Generalized mutual information approach to multichannel blind deconvolution Dinh-Tuan Pham Laboratoire de Mode´lisation et Calcul, BP 53, 38041 Grenoble, Cedex, France Received 13 July 2005; received in revised form 27 October 2006; accepted 5 February 2007 Available online 1 March 2007

Abstract In this paper, we propose an approach to the multichannel blind deconvolution problem, based on the mutual information criterion and more generally on an appropriate system of estimating equations. Formulas for the quasiNewton algorithm and the asymptotic covariance matrix of the estimator are provided. More interesting results have been obtained in the pure deconvolution case. By a clever parameterization of the deconvolution filter, the estimated parameters are asymptotically independent and explicit and simple formula for their variance are obtained. The quasi-Newton algorithm also becomes particularly simple. Simulation results showing the good performance of the methods are provided. r 2007 Elsevier B.V. All rights reserved. Keywords: Blind source separation; Convolutive mixture; Mutual information; Entropy; Estimating equation; Deconvolution; Contrast

1. Introduction Blind source separation aims at separating sources from their mixtures, without relying on any specific knowledge other than their independence [1]. In the case of convolutive mixtures, however, this assumption alone can only permit to separate the sources up to a convolution, so that further assumptions may be introduced to eliminate this ambiguity. A common assumption is the temporal independence of the source signals, in this case the problem may be called multichannel blind deconvolution, since it reduces to the well-known Tel.: +33 4 76 51 44 23; fax: +33 4 76 63 12 63.

E-mail address: [email protected]. URL: http://www-lmc.imag.fr/lmc-sms/Dinh-Tuan.Pham/ (D.-T. Pham).

blind deconvolution problem when there are only one source and one sensor. Specifically, K observed sequences fX 1 ðtÞg; . . . ; fX K ðtÞg are related to K source sequences fS 1 ðtÞg; . . . ; fS K ðtÞg (we assume here that there are the same number of sources as sensors) via linear convolutions XðtÞ ¼

1 X

AðuÞSðt  uÞ,

(1)

u¼1

where XðtÞ ¼ ½X 1 ðtÞ    X K ðtÞT and SðtÞ ¼ ½S 1 ðtÞ    S K ðtÞT , T denoting the transposition, and fAðuÞg is a sequence of K  K of matrices. Sources and observations are assumed to be real. A model with noise may be considered, however, this work is specifically designed for the noiseless case and may not perform well in the presence of noise. The goal is to recover the sources from the observations,

0165-1684/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2007.02.002

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

2046

using only the available informations: the independence of the source sequences (blind separation) and the temporal independence of these sequences (blind deconvolution). Naturally, the recovered source sequences fY k ðtÞg are obtained through a deconvolution matrix filter YðtÞ ¼

1 X

BðuÞXðt  uÞ,

(2)

u¼1

where YðtÞ ¼ ½Y 1 ðtÞ    Y K ðtÞT and fBðuÞg is a sequence of matrices to be determined. This is usually done by minimizing a criterion or solving a system of estimating equations. There has been several proposals for criteria for blind separation of convolutive mixtures and multichannel blind deconvolution. Earlier criteria are mostly based on cross cumulants between the outputs Y k ðtÞ [2–6]. More recently, frequency domain approach has attracted interest since the convolution becomes multiplication in the frequency domain via the Fourier transform. This gives rise to criteria based on (cross-) polyspectra [7–10]. However, criterion based on the mutual information has not received much attention. Such a criterion has been introduced in [11] but no algorithm for minimizing it has been investigated and no study on the asymptotic distribution of the estimator has been made. The paper [12] minimizes a cost function which looks similar to the mutual information criterion but is not quite the same, as the determinantal term is different. It should be noted that in the unichannel ðK ¼ 1Þ case, an algorithm based on the mutual information criterion has been introduced in [13]. The same criterion also has implicitly appeared earlier in [14] but in a simplified form as the entropy of the output, since the author prewhitens the signal so that the deconvolution filter transfer function would have unit amplitude. This form is similar to the infomax criterion, which was also proposed for the blind deconvolution [15]. This paper considers the mutual information approach to multichannel deconvolution. The multichannel case is much harder than the unichannel case since it involves both separation and deconvolution. We actually consider a more general class of estimators, which minimize some criterion generalizing the mutual information or are simply solutions of some system of estimating equations. Such generalization could reduce the computational cost, since the mutual information involves the

unknown densities of the recovered sources, which must be estimated. However, the use of the mutual information (actually an estimated empirical version) should lead to an efficient (optimal) estimator as this criterion is closely related to the log likelihood [1]. This will be proved in the unichannel case. The same result might be proved in the multichannel case but due to the complexity of the calculations, we have not done so. This work is focused on the theoretical aspects of the problem. In particular, we derive formulas for the asymptotic covariance matrix of the estimator and for implementing the quasi-Newton algorithm for its computation. In the unichannel case, one can obtain a very simple quasi-Newton algorithm by working with the logarithm of the deconvolution filter and separating its real and imaginary parts. No such simplification is possible in the multichannel case and the quasi-Newton algorithm, as it stands, may be too complex to be useful. Section 2 introduces the theoretical mutual information criterion and derives its gradient and Hessian. The next section considers a general estimation method, which includes the mutual information approach. Sections 5 and 4 provide formulas for the quasi-Newton algorithm and the asymptotic covariance matrix of the estimator. Section 6 specializes the results on the unichannel case and Section 7 provides some simulation examples. For ease of reading proofs of results are relegated to an Appendix. The following notations will be used throughout:

 

I denotes the identity matrix, 0 denotes the null matrix (or vector). For a matrix sequence fMðuÞg, fMy ðuÞg denotes its inverse (in the convolutive sense), defined by the condition def

ðMy %MÞðuÞ ¼

1 X

My ðvÞMðu  vÞ

v¼1

(

I;

u ¼ 0;

0

otherwise:

¼





For a matrix sequence fMðuÞg, its (discrete time) Fourier transform is denoted as MðoÞ ¼ P 1 iou . The same symbol M has been u¼1 MðuÞe reused for simplicity but the variable o (Greek letter) in contrast with u (Roman letter) helps to avoid the confusion. trðMÞ denotes the trace of the matrix M.

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

 

diag v denotes the diagonal matrix with diagonal elements the components of the vector v. for a function g (of a real variable), g0 denotes its derivative.

Both sequences fAðuÞg in (1) and fBðuÞg in (2) will be assumed to admit an inverse. Further, since one cannot estimate the whole sequence fBðuÞg without any restriction from a finite length observation record, we shall assume that this sequence is parameterized by some finite dimensional vector parameter y. The dimension of y can be very large so that this is not a great restriction. For example, one may restrict the sequence fBðuÞg to have some given finite support and take y to be composed of the elements of the matrices BðuÞ for those u for which BðuÞ is not zero. 2. The mutual information criterion, its gradient and Hessian This section derives the (theoretical) mutual information criterion and computes its gradient and Hessian. This will be used as introduction to the empirical criterion and the estimation equations which actually define the estimator. The Hessian also plays a role in the development of the quasiNewton algorithm and the asymptotic distribution of the estimator. The mutual information is a well-known popular measure of dependence which has been used as criterion for the blind separation of instantaneous mixtures of sources [16] and is closely related to the maximum likelihood principle [1]. Here, the objective is to make the recovered sources Y k ðtÞ, for different k and t, as independent as it is possible. Therefore, we shall adopt as criterion the average mutual information lim

L!1

1 I½Y k ð1Þ; . . . ; Y k ðLÞ; k ¼ 1; . . . ; K, L

where Ið  Þ denotes the mutual information between the indicated random variables. It is well known that the mutual information can be expressed in terms of entropy [17]: IðZ 1 ; . . . ; Z n Þ ¼ Pn j¼1 HðZ j Þ  HðZ 1 ; . . . ; Z n Þ where Z 1 ; . . . ; Z n are random variables (or vectors) and HðÞ denotes the entropy (or joint entropy when appropriate).1 Note that by stationarity, H½Y k ðtÞ does not depend on t 1

The entropy ofRa random vector (or variable) Z with density pZ is defined as  pZ ðzÞ log pZ ðzÞ dz, the joint entropy of several

2047

and hence will be denoted by HðY k Þ. Further, the limit of H½Yð1Þ; . . . ; YðLÞ=L as L ! 1 is no other than the entropy rate of the process fYðtÞg [17], which we denote by H½YðÞ, and from (2) one has [11] Z 2p do , H½YðÞ ¼ H½XðÞ þ log det BðoÞ 2p 0 noting that the last integral is real since Bð2p  oÞ is the complex conjugate of BðoÞ, and the real part of log det BðoÞ is log j det BðoÞj. From the above results, the average mutual information criterion equals, up to an additive constant Z 2p K X do CðBÞ ¼ . (3) HðY k Þ  log det BðoÞ 2p 0 k¼1 We now compute the gradient and Hessian of CðBÞ with respect to the parameter y specifying fBðuÞg. The Hessian is, however, quite complex, so for its computation we shall limit ourselves to the case where y is the ‘‘true’’ parameter, that is assumption (S) is satisfied. S. The fY k ðtÞg defined in (2) are independent sequences of independent random variables. Let dy be a small increment of y and dB the associated increment of B, we shall compute the expansion of CðB þ dBÞ up to second order in dy. As dB will induce an increment dY k ðtÞ ¼

K 1 X X

dBkj ðuÞX j ðt  uÞ

(4)

j¼1 u¼1

of Y k ðtÞ, (dBkj ðuÞ denoting the general element of the matrix dBðuÞ), we will need to consider the expansion of HðY k þ dY k Þ up to second order in dy. To this end, we shall make use of a result in [18]: ‘‘Let Y be a random variable and dY is a small random increment, then under appropriate assumptions HðY þ dY Þ  HðY Þ þ E½dY cY ðY Þ þ 12 Ef½dY  EðdY jY Þ2 c0Y ðY Þ  ½EðdY jY Þ0 2 g

ð5Þ 0

up to second order in dY , where cY ¼ ð log pY Þ is the score function of Y, pY denoting the density of Y, and EðÞ and EðjY Þ denote the expectation and conditional expectation given Y.’’ (footnote continued) random vectors (or variables) is the entropy of the vector obtained by stacking their components.

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

2048

Lemma 1. Let c½YðtÞ ¼ ½c1 ðY 1 ðtÞÞ . . . cK ðY K ðtÞÞT where ck is the score function of Y k ðtÞ (which does not depend on t by stationarity) and denote f YcðYÞ ðoÞ ¼

1 X

EfYðuÞcT ½Yð0Þgeiuo

(6)

u¼1

the cross spectral density between the processes fYðtÞg and fc½YðtÞg. Then for dY k ðtÞ given P by (4), the first order term in the expansion of K k¼1 HðY k þ dY k Þ with respect to dy equals Z 2p  do 1 tr dBðoÞBðoÞ f YcðYÞ ðoÞ , 2p 0 and under assumption (S), the second order term equals ( K K X 1X E½c0k ðY k Þ EðY 2j Þ 2 k¼1 j¼1 Z 2p do  j½dBðoÞBðoÞ1 kj j2 2p 0 ) 1 0 2  fE½ck ðY k ÞEðY 2k Þ þ 1gðdBB1 Þkk , 2 R 2p where dBB1 ¼ 0 dBðoÞBðoÞ1 do=ð2pÞ and we have dropped the index t in Efc0k ½Y k ðtÞg and E½Y 2j ðtÞ as these expectations do not depend on it, by stationarity. Since the variation of log det BðoÞ induced by dB equals trfdBðoÞBðoÞ1  ½dBðoÞBðoÞ1 2 =2g up to second order in dy, the above lemma shows that the first order term in the expansion of CðB þ dBÞ with respect to dy equals Z 2p do trfdBðoÞBðoÞ1 ½f YcðYÞ ðoÞ  Ig 2p 0 and the second order term, under assumption (S), equals K Z 2p 1X fE½c0k ðY k ÞEðY 2j Þj½dBðoÞBðoÞ1 kj j2 2 j;k¼1 0 þ ½dBðoÞBðoÞ1 kj ½dBðoÞBðoÞ1 jk g 

do 2p

K 1X fE½c0k ðY k ÞEðY 2k Þ þ 1gðdBB1 Þ2kk . 2 k¼1

The above results are nonparametric in some sense as they do not involve explicitly the parameterization of B; this parameterization only serves to

provide a rigorous meaning for the terms ‘‘first and second order’’. In a parametric setup, the above results yield immediately that the gradient of CðBÞ is  Z 2p  qC qBðoÞ do ¼ tr BðoÞ1 ½f YcðYÞ ðoÞ  I qym qy 2p m 0 (7) and its Hessian, under assumption (S), is ( K Z 2p X q2 C ¼ EðY 2j ÞE½c0k ðY k Þ qym qyn j;k¼1 0     qBðoÞ qBðoÞ 1 1  BðoÞ BðoÞ qym qyn kj kj     ) qBðoÞ qBðoÞ þ BðoÞ1 BðoÞ1 qym qyn jk kj K do X  fEðY 2k ÞE½c0k ðY k Þ þ 1g 2p k¼1     qB 1 qB 1 B B ,  qym kk kk qyn



ð8Þ

where qB 1 B ¼ qym

Z

2p 0

qBðoÞ do . BðoÞ1 qym 2p

(9)

3. The estimation methods We shall consider more general methods than simply minimizing an estimation of the mutual information criterion (3). 3.1. Theoretical criterion and estimating equations We shall generalize the criterion (3) by replacing the entropy HðY k Þ by log QðY k Þ, where Q is a class II strictly superadditive functional, as it has been shown in [19] that such generalized criterion remains a contrast, in the sense that it can be minimized if and only if fBðuÞg is such that assumption (S) is satisfied. Recall that [20] a functional QðY Þ (of the distribution of Y) is said to be of class II if it is scale equivariant2: QðaY Þ ¼ jajQðY Þ for any real number a, and is said to be strictly superadditive if Q2 ðU þ V ÞXQ2 ðUÞ þ Q2 ðV Þ

(10)

2 The definition in [20] requires also that Q be translation invariant, but since we work with zero mean random variables, we drop this requirement.

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

for any pair of independent random variables U; V , with equality attained only when U and V are Gaussian. The superadditivity property is difficult to verify as (10) must be satisfied for any pair of independent random variables, but actually only random variables which are linear combinations of sources (at different times) need to be considered. Therefore, we shall drop this requirement. Of course, minimizing the criterion then might not yield the sequence fBðuÞg for which assumption (S) is satisfied. However, in practice one often cannot find the global minimum but only a local minimum, even only a simple stationary point, of a criterion, and it will be seen below that the above sequence is a stationary point of (3) whenever Q ¼ eH is a class II functional. Assume that log Q admits a ‘‘coordinate-free derivative’’ cY at Y, defined by 1 log½QðY þ ZÞ=QðY Þ ¼ E½ZcY ðY Þ,  8 random variables Z.

lim

!0

ð11Þ

Then the gradient calculation in Section 2 and the proof of Lemma 1 extends straightforwardly to the generalized criterion, by redefining ck to be the function cY k defined above (which may no longer be the score function of Y k ). Setting this gradient to zero, one gets the theoretical estimating equations  Z 2p  qBðoÞ do 1 ¼ 0. (12) tr BðoÞ ½f YcðYÞ ðoÞ  I qy 2p m 0 If Q is of class II, these equations are satisfied as soon as the sequence fBðuÞg is such that assumption (S) is satisfied. Indeed, this assumption entails that the spectral density f YcðYÞ is a constant diagonal matrix with diagonal elements equal to E½Y k ck ðY k Þ; k ¼ 1; . . . ; K, and since QðY þ Y Þ ¼ j1 þ jQðY Þ, one gets, by taking the logarithm and letting  ! 0, that E½Y cY ðY Þ ¼ 1 (which implies E½YcTY ðYÞ ¼ I since the components of Y are assumed to be independent). The above arguments rely only on the fact that E½Y k ck ðY k Þ ¼ 1; 1pkpK. Thus, we may generalize the above estimating equations by taking ck to be arbitrary given functions satisfying only the above equality.3 Such equations are valid estimating equations, as a solution of them is the sequence fBðuÞg for which assumption (S) is satisfied. The use 3

ck should depend (mildly) on the distribution of Y k , such that E½ck ðY k ÞY k  ¼ 1.

2049

of such estimating equations is thus more flexible than that of a criterion as these equations need not arise from equating to zero the gradient of a criterion. The downsize is that solving for such equations can lead to a spurious solution. By using a contrast, one can monitor its decrease in the iterative algorithm to ensure that a local minimum is attained, hence reduce greatly the risk of getting a spurious solution. Further, as it will be seen later (Lemma 5), optimal performance is achieved for ck being the score function of Y k , that is when the information criterion is used. 3.2. Empirical criterion and estimating equations In practice, one would minimize an empirical version of (3) with H ¼ log Q: Z 2p K X do ^ ^ . (13) CðBÞ ¼ HðY k Þ  log det BðoÞ 2p 0 k¼1 ^ k Þ is an estimator of HðY k Þ ¼ log QðY k Þ. where HðY For computational purposes, the last integral may be replaced by a Riemann sum. ^ k Þ should depend Naturally, the estimator HðY only on the sequence fY k ðtÞg defined in (2). But this definition involves in general the entire sequence fXðtÞg while only a finite length record, Xð0Þ; . . . ; XðT  1Þ say, is available. To solve this problem, we extend the observed sequence Xð0Þ; . . . ; XðT  1Þ periodically, and thus redefine YðtÞ ¼

1 X

BðuÞXðt  u mod TÞ.

(14)

u¼1

Other approaches are possible but the above is most convenient as it is well adapted to the use of the discrete Fourier transform (DFT). However, in the case where the deconvolution filters have finite impulse response: BðuÞ ¼ 0 for uo  m1 or u4m2 , say, then it is preferable to work only with those YðtÞ which can be computed from the observations according to (2). In this case, which we will refer to as the FIR case, we assume for simplicity that Xðm2 Þ; . . . ; XðT þ m1  1Þ are observed so that YðtÞ can be computed according to (2) for t ¼ 0; . . . ; T  1. Thus in all cases, the estimator ^ k Þ will be based on Y k ð0Þ; . . . ; Y k ðT  1Þ. HðY ^ Let us compute the gradient of CðBÞ. Define the ^ function ck by ^ ^ ½Y k ðtÞ ¼ T qHðY k Þ . c k qY k ðtÞ

(15)

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

2050

Strictly speaking, (15) defines c^ k only at the data points, but it is often the case that this definition can be naturally extended to all other points, and in any case the values of c^ k at the data points are all that will be needed. With this definition, one has the following lemma. Lemma 2. In the case where YðtÞ is computed as (14): ( 1 X qBðuÞ 1 TX ^ qCðBÞ ¼ tr qym qym T t¼0 u ) qB T Xðt  u mod TÞc^ ½YðtÞ  B1 , qym ð16Þ where

^ ½Y 1 ðtÞ    c ^ ½Y K ðtÞ ^ T ½YðtÞ ¼ ½c c 1 K

and

1

qB=qym B is defined in (9). If the integral in (13) has been replaced by a Riemann sum, the integral in (9) should be replaced by an analogous Riemann sum. If further this Riemann sum is based on the points 2pn=T; n ¼ 0; . . . ; T  1, one also has (   T 1 ^ qCðBÞ 1X qBð2pn=TÞ 2pn 1 ¼ tr B qym T n¼0 qym T    ) 2pn  f^ YcðYÞ I , ð17Þ ^ T where       2pn 1 2pn T T n ^f ^ 2p ¼ dY dcðYÞ ^ YcðYÞ T T T T

(18)

^ is the cross periodogram between fYðtÞg and fc½YðtÞg, PT1 i2pnt=T being the DFT of dY ð2pn=TÞ ¼ t¼0 YðtÞe the sequence YðtÞ; . . . ; YðT  1Þ and d cðYÞ being ^ defined similarly. In the FIR case where YðtÞ is computed as (2), formula (16) still holds but without ‘‘mod T ’’. Formula (16) is well adapted to the FIR case since the summation is finite and there is no modulo T operation. Formula (17) is similar to (7) and avoids infinite summation in the general case. Setting the right-hand side of (16) or (17) to zero yields the estimating equations, to be solved for the estimated parameter. These equations involve ^ defined in (15), therefore, we only the functions c k ^ by may again generalize them by replacing c k

arbitrary given functions, subjected only to the condition 1 1 TX ^ ½Y k ðtÞ ¼ 1; Y k ðtÞc k T t¼0

1pkpK,

(19)

and converging as T ! 1 to some limits ck (which would then satisfy E½Y k ck ðY k Þ ¼ 1). Such equations are still valid estimating equations since they converge to the (generalized) estimating equations (12). (To see that the right-hand side of (16) converges to the left-hand side R 2p of (12), note that EfXðuÞcT ½Yð0Þg ¼ 0 BðoÞ1 f YcðYÞ ðoÞ eiuo do, as is shown in the proof of Lemma 1.) ^ defined by (15) Condition (19) is satisfied for c k ^ ^ ^ with H ¼ log Q and Q equivariant with respect to scale change. Indeed, it results from taking the logarithmic derivative of both sides of the ^ ^ equality Q½aY k ð0Þ; . . . ; aY k ðT  1Þ ¼ jajQ½Y k ð0Þ; . . . ; Y k ðT  1Þ with respect to a, then setting a ¼ 1. 3.3. Some particular cases In the mutual information approach, log QðY Þ is the entropy of Y. Then its ‘‘coordinate-free’’ derivative cY defined in (11) is no other than the score function of Y [11,18]. The entropy can be estimated by the method in [21]. This method is based on replacing the integral in the definition of entropy by a discrete Riemann sum and replacing the unknown density by a kernel estimator. Such estimator can be computed quickly over a regular grid by the ‘‘binning’’ technique. This technique is somewhat similar to the histogram but much more sophisticated and does not suffer from the ‘‘discontinuity’’ problem (see [21] for details). The function c^ k defined in (15) is then the score estimator of Y k introduced in [21]. This paper also provides a means to compute it quickly. A simple example of type II functional is QðY Þ ¼ ½EðjY ja Þ1=a ; a40. For a ¼ 4, it is superadditive if one restricts oneself to the set of zero means subGaussian random variables [19]. The ‘‘coordinatefree derivative’’ cY of log Q at Y is then given by cY ðyÞ ¼ signðyÞjyja1 =EðjY ja Þ. The natural estimate ^ Þ of log QðY Þ is obtained by replacing EðjY ja Þ HðY in its formula by the corresponding sample average. The function c^ k defined in (15) is then simply ^ ðyÞ ¼ signðyÞjyja1 =½T 1 PT1 jY k ðtÞja . c k t¼0 For the generalized estimating equations, a simple choice for c^ k could be a fixed function

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

2051

normalized so that (19) holds, more precisely: ^ ðyÞ ¼ cðyÞ=fT 1 PT1 c½Y k ðtÞY k ðtÞg, where c is c k t¼0 some nonlinear function.

Lemma 3. Under the assumption (S) and the condition E½Y k ck ðY k Þ ¼ 1, the trace of (20) equals the right-hand side of (8).

4. The quasi-Newton algorithm

The above lemma shows that one can use the right-hand side of (8) for the elements of the matrix KðyÞ in the quasi-Newton iteration. In practice, one still needs to replace the unknown quantities EðY 2j Þ and E½c0k ðY k Þ in this formula by some estimates, ^ 0 ½Y k ðtÞ. such as the sample averages of Y 2j ðtÞ and c k

One may solve the estimating equations by the quasi-Newton algorithm. Let lðyÞ ¼ 0 be the (vector) equation to be solved, this algorithm computes the solution through the iteration yðnÞ 7!yðnþ1Þ , the solution of the equation KðyðnÞ Þðyðnþ1Þ  yðnÞ Þ ¼ lðyðnÞ Þ,

5. Asymptotic performance

where KðyÞ is some approximation of the Jacobian ql=qy of the map y7!lðyÞ. In the mutual information approach, lðyÞ ¼ ^ ^ qCðBÞ=qy where CðBÞ is the empirical criterion ^ (13) with H being the entropy estimator, one may expect that ql=qy can be approximated by the formula (8) for q2 C=qy2 . We shall show that this result still holds in the ‘‘estimating equations’’ approach in which lðyÞ has components given by the right-hand side of (16) or (17) with arbitrary c^ k satisfying (19) and converging to some limit ck . We will, however, use (16) since it is valid also in the FIR case, unlike (17). To construct an approximation to ql=qy, we replace the sample average by the ^ by c , which is justified for large expectation and c k k T. Then the derivative of (16) with respect to yn is approximated, in the FIR case where YðtÞ is computed by (2),by the trace of  X q qBðuÞ T EfXðuÞc ½Yð0Þg qyn qym u   Z 2p q qBðoÞ 1 do . ð20Þ  BðoÞ qyn qym 2p 0

In this section, y denotes the true value of the parameter and y^ denotes the estimator obtained by ^ ¼ 0, where lðÞ solving the estimating equation lðyÞ is defined in previous section. The asymptotic distribution of y^  y may be obtained in a standard way, through the first order expansion

In the general case where YðtÞ is computed by (14), one gets the same result except that the term EfXðuÞcT ½Yð0Þg is replaced by T 1X EfXðt  u mod TÞcT ½YðtÞg. T t¼0

However, since BðuÞ and its first and second derivatives should converge to zero quickly as u ! 1, one may restrict oneself to indexes u in fa; . . . ; ag, for some (large) a, and approximate YðtÞ; t 2 ½a; T  a by (2). Then for u 2 fa; . . . ; ag and Tba, the above expression can be approximated by EfXðuÞcT ½Yð0Þg with Yð0Þ defined by (2). Thus, the derivative of (16) with respect to yn can still be approximated for large T by the trace of (20).

^  lðyÞ þ 0 ¼ lðyÞ

ql ^ ðy  yÞ. qy

In the previous section, we have shown that the Jacobian ql=qy can be approximated for large T and under assumption (S) by the matrix with general element the right-hand side of (8), which we now denote by K. Since assumption (S) is satisfied when y is the true parameter, one has y^  y  K1 lðyÞ, where means having the same asymptotic distribution. Therefore, we need only to derive the asymptotic distribution of lðyÞ. Lemma 4. Let lðyÞ be the vector defined by the right^ satisfying (19) and hand side of (16) with arbitrary c k depending on the data only through a finite dimen^ ðyÞ ¼ Ck ðy; l^ k Þ where Ck ð; Þ are sional vector lk : c k fixed continuously differentiable functions and l^ k depends on Y k ð0Þ; . . . ; Y k ðT  1Þ and tends to a limit lk as T ! 1. Then lðyÞ is asymptotically Gaussian with zero means and covariance matrix G=T, with G having general element ( K Z 2p X G mn ¼ EðY 2j ÞE½c2k ðY k Þ j;k¼1

0

    qBðoÞ qBðoÞ 1 1  BðoÞ BðoÞ qym qyn kj kj     ) qBðoÞ qBðoÞ do þ BðoÞ1 BðoÞ1 qym qyn 2p jk kj

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

2052



k¼1

 

  q log BðoÞ q log B   R qyn qyn  q log BðoÞ q log BðoÞ do . þðk  1ÞI I qym qyn 2p

K X fEðY 2k ÞE½c2k ðY k Þ þ 1g

qB 1 B qym

  kk

qB 1 B qyn

 , kk

where ck ðyÞ ¼ Ck ðy; lk Þ and qB=qym B1 is defined in (9). The above lemma shows that the estimator y^ is asymptotically Gaussian with mean y and covariance matrix K1 GK1 =T. Note that G mn has the same form as K mn (given by the right-hand side of (8)) but with c0 replaced by c2 . 6. The pure deconvolution case The unichannel case ðK ¼ 1Þ is of special interest as it corresponds to the pure deconvolution problem and moreover it permits considerable simplifications. In this case, XðtÞ, YðtÞ and BðuÞ are all scalars and will be denoted as X ðtÞ, Y ðtÞ and BðuÞ and ^ . we will drop the index k in ck and c k The expression (17) for the gradient in this case reduces to T 1 ^ qCðBÞ 1X q log Bð2pn=TÞ ¼ qym T n¼0 qym     2pn ^  f Y cðY 1 , ð21Þ ^ Þ T where f^Y cðY ^ Þ ð2pn=TÞ is defined similarly to (18), except that it is now a scalar. The general element of the approximate Jacobian for the quasi-Newton algorithm can also be written more compactly, putting k ¼ EðY 2 ÞE½c0 ðY Þ,  Z 2p  q log BðoÞ q log BðoÞ K mn ¼ k þ qym qym 0 q log BðoÞ do q log B q log B  ðk þ 1Þ  . qyn 2p qym qyn The above formulas show that it is of interest to parameterize log BðoÞ instead of BðoÞ. It is even more interesting to parameterize separately its real and imaginary parts. Indeed, noting that jzj2 ¼ ðRzÞ2 þ ðIzÞ2 and z2 ¼ ðRzÞ2  ðIzÞ2 where R and I denote the real and imaginary parts, and that Bð2p  oÞ is the complex conjugated of BðoÞ so that q log B=qym , q log B=qym and the right-hand side of the last equality are real, one gets   Z 2p  q log BðoÞ q log B K mn ¼ ðk þ 1Þ R  qym qym 0

Thus, the matrix K becomes block diagonal when the real and imaginary parts of log BðoÞ are parameterized by two different sets of parameters. By a similar calculation, the elements of the matrix G can be rewritten as   Z 2p  q log BðoÞ q log B 2 G mn ¼ ðr þ 1Þ R  qym qym 0   q log BðoÞ q log B  R  qyn qyn  q log BðoÞ q log BðoÞ do , þðr2  1ÞI I qym qyn 2p qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi where r ¼ 1= EðY 2 ÞE½c2 ðY Þ is the correlation between Y and cðY Þ (since E½Y cðY Þ ¼ 1). Thus the estimators of the vector parameters yR and yI specifying the real and imaginary parts of log BðoÞ, are asymptotically independent with covariance matrix, respectively, 1 r2 þ 1 1 J T ðk þ 1Þ2 R

and

1 r2  1 1 J , T ðk  1Þ2 I

where # q log BðoÞ q log B ¼ R  qyR qyR 0 m m " # q log BðoÞ q log B do  R  2p qyR qyR n n Z

J R;mn

2p

"

and Z

2p

J I;mn ¼

I 0

q log BðoÞ q log BðoÞ do . I 2p qyIm qyIn

The following lemma shows that optimal estimator is obtained if and only if c^ is a score estimator. Lemma 5. Let cY be the score function of Y and qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r ¼ 1= EðY 2 ÞE½c2Y ðY Þ be the correlation between Y and cY ðY Þ. Then r2 þ 1 1 X 2 2 r þ 1 ðk þ 1Þ

and

r2  1 1 , X 2 2 r  1 ðk  1Þ

with equality if and only if c ¼ cY , in which case k ¼ r2

(hence K ¼ G).

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

The above result shows that it is much harder to estimate the phase of the deconvolution filter (the imaginary part of log B), than its gain (the exponential of the real part of log B). In particular, if the signal is Gaussian, r ¼ 1 so that the asymptotic covariance matrix of the estimator specifying the imaginary part of log B is infinite, which is consistent with the well-known fact that the phase of the deconvolution filter is ambiguous in this case. A simple interesting parameterization for log BðoÞ is log BðoÞ ¼ y0 þ

m X ½ym cosðmoÞ þ iymþm sinðmoÞ. m¼1

(22) The parameters y0 ; . . . ; ym and ymþ1 ; . . . ; y2m specify the real and imaginary parts of log BðoÞ, respectively. Note that log BðoÞ is the complex conjugate of log BðoÞ and thus its real part is an even function and its imaginary part is an odd function, hence the use of the cosine and sine functions to represent them. The parameter y0 corresponds to the scale of the sources and cannot be estimated. We may (and will) put it to 0. As for the y1 ; . . . ; y2m , their asymptotic distribution is quite simple. The matrices J R and J I corresponding to y1 ; . . . ; ym and ymþ1 ; . . . ; y2m can be seen to be simply one half of the identity matrix. Thus the estimators ½y^ m y^ mþm T , m ¼ 1; . . . ; m are asymptotic independent Gaussian with covariance matrix: " # 2 0 2 ðr2 þ 1Þ=ðk þ 1Þ . (23) T 0 ðr2  1Þ=ðk  1Þ2 Further, expression (21) for the gradient reduces to ðrm þ rm Þ=2, for m ¼ 1; . . . ; m and rm  rm for m ¼ m þ 1; . . . ; 2m, where   1 2p TX 2pn rm ¼ ei2pnm=T f^Y cðY ^ Þ T n¼0 T ¼

T 1 1X

T

Y ðt þ mÞc½Y ðtÞ

ð24Þ

t¼0

are the circular cross covariance between fY ðtÞg and fc½Y ðtÞg. Thus, the quasi-Newton algorithm reduces to the fixed point iteration ( ym 7!ym 

ðrm þ rm Þ=ðk^ þ 1Þ; ðrmm  rmm Þ=ðk^  1Þ;

m ¼ 1; . . . ; m; m ¼ m þ 1; . . . ; 2m;

2053

k^ denoting the current estimate for k. Alternatively, one can work with the parameters Wm ¼ ðym þ ymþm Þ=2 and Wm ¼ ðym  ymþm Þ=2, P m ¼ 1; . . . ; m, so that one has log BðoÞ ¼ 0ojmjpm Wm eimo . The quasi-Newton algorithm for these parameters is then Wm 7!Wm 

^ m  rm kr ; k^ 2  1

0ojmjpm.

(25)

Since d Y ð2pn=TÞ ¼ Bð2pn=TÞd X ð2pn=TÞ, this yields the updating equation     2pn 2pn dY 7!d Y T T " # X kr ^ m  rm i2pnm=T  exp  e . k^ 2  1 0ojmjpm ð26Þ ^ Note that R the criterion (13) reduces to simply HðY Þ, since log BðoÞ do=ð2pÞ ¼ y0 ¼ 0. This criterion can be computed to ensure that it is decreased at each iteration, otherwise the Newton step size could be reduced so that it is so. The estimation algorithm (for the Wm ) is summarized in Table 1. An interesting consequence of the separation of log B into its real and imaginary parts is that this real part is directly linked to the spectral density f X of the observed sequence fX ðtÞg, since jBðoÞj2 must be inversely proportional to f X ðoÞ. Therefore, if R log BðoÞ is parameterized as hðo; yR Þ, one may apply any spectral estimation method under the parametric model f X ðoÞ / exp½2h ðo; yR Þ to obtain an estimate of yR . In the case where log BðoÞ is parameterized as (22), a simple method for estimating y1 ; . . . ; ym is to minimize T 1 X n¼0

"

    #2 2pn 2pn , y0 þ ym cos þ log d X T T m¼1 m X

since jd X ð2pn=TÞj2 =T is a raw estimate of f X ð2pn=TÞ. The resulting estimators are suboptimal but can be used to initialize the algorithm, and in our limited experience, it can vastly improve the convergence. There is, however, no simple method to initialize ymþ1 ; . . . ; y2m . One may simply initialize them by zero, which amounts to forcing the initial deconvolution filter to have zero phase.

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

2054

Table 1 P Summary of the algorithm for the parameterization BðoÞ ¼ exp½ 0ojmjpm Wm eimo ] Initialization: Compute the initial source estimate Y ð0Þ; . . . ; Y ðT  1Þ and its DFT d Y ð2pn=TÞ, n ¼ 0; . . . ; T  1, using the initial ^ Þ. deconvolution filter. Optionally, compute the criterion HðY Iteration: ^ ðtÞ, t ¼ 0; . . . ; T  1. The function c^ is part of the specification of the algorithm, and can be constructed as in (1) Compute c½Y Sections 3.2 and 3.3. ^ ðtÞg according to either one of the right-hand sides of (2) Compute the circular correlations rm , 0ompm, between fY ðtÞg and fc½Y (24). (3) Compute the estimate k^ of k ¼ EðY 2 ÞE½c0 ðY Þ by replacing the expectation by sample average. Then update Wm ; 0ojmjpm and d Y ð2pn=TÞ, n ¼ 0; . . . ; T  1, by (25) and (26). (4) Compute Y ð0Þ; . . . ; Y ðT  1Þ from d Y ð2pn=TÞ, n ¼ 0; . . . ; T  1, by the inverse DFT. ^ Þ. If it does not decrease then repeat (3), with decreased step size in (25) and (26), and (5) Optionally compute the new criterion HðY (4)–(5) again. Repeat the iteration until convergence.

7. Some simulation examples 7.1. The pure deconvolution case We consider the observation model X ðtÞ ¼ r Sðt  1Þ þ SðtÞ þ r00 Sðt þ 1Þ where fSðtÞg is the source sequence. Thus, BðoÞ should be proportional to 1=ð1 þ r0 eio þ r00 eio Þ. One can factorize 1 þ r0 eio þ r00 eio as cð1 þ z0 eio Þð1 þ z00 eio Þ by solving the equations cð1 þ z0 z00 Þ ¼ 1, r0 ¼ cz0 , r00 ¼ cz00 . This yields z0 ¼ r0 =c, z00 ¼ r00 =c with c being the 2 solution of cð1 þ r0 r00p =cffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Þ ¼ 1. The last equation has two solutions 1=2  1=4  r0 r00 . We require that both jz0 j and jz00 j are strictly less than 1 and it may be shown that a necessary and sufficient condition for this to hold is jr0 þ r00 jo1, in which case c is real and should be taken as the larger (in absolute value) of the above two solutions. Thus z0 ; z00 are real and one has the expansion 0

log BðoÞ ¼

1 X ð1Þj z0 j j¼1

þ

j

eijo

1 X ð1Þj z00j j¼1

j

eijo þ constant.

We take r0 ¼ 0:4, r00 ¼ 0:45, which yields the values of ð1Þj ðz00j  z0 j Þ=j reported in Table 2. One can see from this table that one may truncate the expansion for log BðoÞ without incurring much error at index m ¼ 10, and thus consider the values reported in Table 2 as the ‘‘true value’’ of yj (first row) and ymþj (second row). We generate 1000 sequences of observations of length T ¼ 1024 using a source pffiffiffi p ffiffiffi distribution uniform over ½ 3; 3. For each

Table 2 Value of ðz00j  z0 j Þ=j for j ¼ 1; . . . ; 10 (r0 ¼ 0:4; r00 ¼ 0:45) j

1

ð1Þj ðz00j þ z0 j Þ=j 1.1117 ð1Þj ðz00j  z0 j Þ=j 0.0654

2

3 0.3101 0.1157 0.0364 0.0202

4

5 0.0487 0.0220 0.0113 0.0063

j 6 7 8 9 10 ð1Þj ðz00j þ z0 j Þ=j 0.0103 0.0050 0.0025 0.0013 0.0007 ð1Þj ðz00j  z0 j Þ=j 0.0035 0.0020 0.0011 0.0006 0.0003

sequence of observations, we apply our algorithm with m ¼ 10 and HðY Þ being logðEY 4 Þ=4 and the entropy of Y, respectively. In the first case, cðY Þ ¼ Y 3 =EðY 4 Þ and this function and logðEY 4 Þ=4 can be estimated easily from sample moments. In the second case the entropy HðY Þ and score function cY are estimated by the method of [21]. The mean and standard deviation of the resulting estimators, denoted by y^ 4th and y^ MI , are reported in Table 3. One can see from Table 3 that the mean values of the estimators are very close to those reported from Table 2. The standard deviation of y^ 4th are, howj ever, much larger than those predicted by formula (23) (which yields 0.0245 for 1pjp10 and 0.0289 ^ 4th for 10ojp20) especially for y^ 4th 11 ; . . . ; y20 . One reason is that the deconvolution filter, as explained above, cannot exactly recover the sources, so that k ¼ 3EðY 2 Þ2 =EðY 4 Þ would be less than the theoretical value 53, as the recovered source is still a mixture and hence has a distribution closer to Gaussian than the uniform distribution. Another reason is that there are so many parameters to be estimated

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

2055

Table 3 Simulation result for the estimator y^ 4th based on the 4th moment contrast logðEY 4 Þ=4 and the estimator y^ MI based on the mutual information criterion j 1 2 3 4 5 6 7 8 9 10

y^ MI j

y^ 4th j 1.0979 0.3031 0.1097 0.0472 0.0204 0.0104 0.0038 0.0013 0.0007 0.0004

0.0354 0.0305 0.0270 0.0271 0.0260 0.0256 0.0273 0.0265 0.0273 0.0261

1.1078 0.3080 0.1122 0.0484 0.0207 0.0106 0.0037 0.0021 0.0007 0.0010

y^ 4th 10þj 0.0188 0.0192 0.0187 0.0191 0.0186 0.0183 0.0197 0.0188 0.0192 0.0187

0.0654 0.0364 0.0202 0.0113 0.0063 0.0035 0.0020 0.0011 0.0006 0.0003

y^ MI 10þj 0.0326 0.0374 0.0398 0.0418 0.0434 0.0414 0.0423 0.0447 0.0383 0.0366

0.0649 0.0363 0.0204 0.0129 0.0069 0.0053 0.0027 0.0021 0.0014 0.0017

0.0216 0.0220 0.0215 0.0225 0.0218 0.0224 0.0217 0.0220 0.0220 0.0217

1st column refers to the mean, 2nd column to the standard deviation.

including k, so that the asymptotic results do not attain yet. In fact we have observed, in the case of sample size 512 and lower, that in some simulations the estimator k^ can be less than 1, mostly in the early stage of the algorithm where the extracted source can be highly mixed. (For such extracted source, the theoretical k would be close to 1, although still greater than 1 since a mixture of sub-Gaussian variables is still sub-Gaussian.) In the case of sample size 1024, we have not got a k^ less than 1 (even once in 1000 simulations) but nevertheless it could be quite close to 1. We suspect that all this phenomena contribute to the increases of variance of the estimator. Concerning the estimator y^ MI j , it has a theoretical asymptotic variance 0 (superefficiency), since the pffiffiffi score function is 1 at  3. However, again because the source cannot be exactly extracted, superefficiency cannot be achieved (for fixed m). Further, due to the smoothing effect of the kernel method, the estimated score function is pulled toward zero and made more linear (than the true one). In fact, we have observed that our score estimate in the earlier stage of the algorithm can be quite close to that of a Gaussian, yielding a k^ close to 1. It appears to us that using a smoothing parameter much smaller than what would be used in density estimation can be quite beneficial here, as it reduces the smoothing effect (at the expense of an increase in variance). Also it is important to ^ c^ 0 ðY ÞEðY ^ 2 Þ, (E ^ denoting the estimate k^ not by E½ Y ^ 2 ðY ÞEðY ^ 2 Þ. ^ c sample average operator) but by E½ Y This is because there is no guarantee that the first product is always greater than 1, while the second is guaranteed to be so, by (19) and the Schwartz

inequality. Using a small smoothing parameter produces a wiggly curve for c^ Y , but this does not ^ 2 ðY Þ, and since the smoothing effect is ^ c decrease E½ Y ^ c^ 2 ðY Þ. However, the reduced it tends to increase E½ Y smoothing parameter cannot be reduced too much without the adverse effect of high variance. Thus, there is a practical limit in performance of the y^ MI j estimator. Nevertheless, at sample size 1024, it performs better than the theoretical performance (far from achieved in practice) of the estimator y^ 4th j . Finally, we note that initializing the algorithm by the method described in the previous section is quite useful. It accelerates the algorithm considerably. This is due to the fact that the starting value is closer to the solution, but also that the extracted sources in the early stage of the algorithm is less mixed and hence has a higher k, which stabilizes the algorithm. The mean number of iterations is 2.179 for the estimator y^ 4th and 6.950 for y^ MI j j . (The stopping ^ Þ and is the criterion is based on the decrease of HðY same in both algorithms.) The slower convergence of the algorithm for the MI estimator may be due to the fact that the score estimator is rather unstable due to the low value of the smoothing parameter (indeed the lower this parameter, the slower the algorithm). But as pointed out earlier, a low value of this parameter (here equal to half the standard choice) can yield a better estimate of the deconvolution filter. 7.2. The multichannel case In this case, the theoretical formula for the approximate Hessian is too complex to be useful.

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

2056

P XðtÞ ¼ 17 u¼14 AðuÞSðt  uÞ where fAðuÞg is the inverse sequence of fBðuÞg, truncated to the range ½14; 17. This truncation is justified by the fact that this sequence decays quickly to 0 at infinity, as can be seen in Fig. 1. The sources are taken to be independent random with a uniform pffiffiffi pffiffivariables ffi distribution in ½ 3; 3. One hundred observation sequences of length 1028 are generated and for each sequence the BFGS is applied to minimize the estimated mutual information criterion. Since the deconvolution filter is FIR, the data is not periodically extended but instead the output of the deconvolution is made shorter (length 1024 exactly). Accordingly, the gradient is computed by formula (16) (without the ‘‘mod T’’). Mean and standard deviation of the

Numerical calculation using this formula is not cheap and requires complex coding. Further this matrix doesn’t have any interesting structure and thus its inversion can be costly since it is a large matrix. For these reasons, we shall adopt the BFGS (Broyden–Fletcher–Goldfarb–Shanno) algorithm [22] to minimize the criterion. This algorithm constructs numerically an approximate inverse Hessian matrix by analyzing successive gradient vectors. Thus, we need only to compute such gradient (and also the criterion) using to our theoretical formula. P2 We consider the observation model u¼2 BðuÞXðt  uÞ ¼ SðtÞ where BðuÞ; jujp2 are 2  2 matrices which are listed in Table 4. The process fXðtÞg is actually computed as Table 4 True and estimate deconvolution filter (std., standard deviation) True value

Mean of estimate

Std. of estimate

Bð2ÞT

0.0112 0.8057

0.6451 0.2316

0.0146 0.8005

0.6449 0.2383

0.0381 0.0273

0.0426 0.0274

Bð1ÞT

0.9898 0.2895

1.3396 1.4789

0.9832 0.2817

1.3369 1.4798

0.0369 0.0403

0.0426 0.0354

Bð0ÞT

3.1380 1.2919

0.6841 1.9271

3.1406 1.2913

0.6759 1.9265

0.0151 0.0328

0.0447 0.0148

Bð1ÞT

0.3306 0.4978

0.8436 1.4885

0.3418 0.4985

0.8484 1.4891

0.0492 0.0391

0.0395 0.0252

Bð2ÞT

0.5465 0.2463

0.8468 0.6630

0.5349 0.2492

0.8435 0.6620

0.0453 0.0348

0.0349 0.0271

0.4 0.3 0.2 0.1 0 −0.1 −0.2 −10

−5

0

5

10

Fig. 1. Elements of the matrix AðuÞ as function of u.

15

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

estimated deconvolution filter are listed in Table 4. As in the pure deconvolution case, we find that better results are obtained by using a lower value of the smoothing parameter than the standard choice. Thus, this parameter is taken to be half of the standard choice. One can see from Table 4 that the deconvolution filter is well estimated. Note that the above results do not come directly from the output of the BFGS algorithm, since there is a permutation and scale ambiguity. In order to eliminate the permutation ambiguity, we have ^ ¼ ½Bð2Þ ^ ^ permuted the rows of B    Bð2Þ ^ (BðuÞ being the estimate of BðuÞ) whenever the ^T product of the diagonal elements of BB (B ¼ ½Bð2Þ    Bð2Þ) is less in absolute value than that of the off diagonal elements. To eliminate the scale ambiguity, we have normalized the rows of ^ (eventually permuted) so that they have the same B norm as that of B. The BFGS algorithm, however, converges somewhat slowly. The average number of iterations is 16.32. This may be due to the fact that it needs time to build up correctly the inverse of the Hessian, this inverse being set initially as the identity matrix which can be quite far from the truth. Another point is that we do not have a method to initialize the deconvolution filter so we simply initialize it as the identity filter, which again can be quite far from the desired solution.

2057

finite. This is consistent with the well-known fact that for Gaussian source, there is ambiguity in the deconvolution filter, but this ambiguity concerns only the phase of its transfer function, not the amplitude. In the pure deconvolution case, we propose a parameterization which leads to the asymptotic decorrelation of the estimated parameters and very simple formulas for its asymptotic variance, and also to a very simple quasi-Newton algorithm. The algorithm has been validated by simulation. For the multichannel case, the theoretical formula for the Hessian may be too complex to be useful. But using only the formula for the gradient, one can set up a BFGS algorithm to minimize the criterion. Simulation shows that the algorithm has good performance and the model parameters are well estimated. Appendix A A.1. Proofs of lemmas Proof of Lemma 1. By (4) and (5), the first order term in the expansion of H½Y k ð0Þ þ dY k ð0Þ is 1 K X X

Efck ½Y k ð0ÞdBkj ðuÞX j ðuÞg.

u¼1 j¼1

8. Conclusion and discussion We have provided a solution of the blind separation–deconvolution problem, based on the mutual information criterion or more generally on an appropriate system of estimating equations. We have developed formulas for the quasi-Newton algorithm and the asymptotic covariance matrix of the estimator. More interesting results have been obtained in the pure deconvolution (single input single output) case. In particular, the asymptotic covariance matrix simplifies if one parameterizes not the deconvolution filter transfer function, but its logarithm, in this case its real and imaginary parts are asymptotically independent. These parts correspond to the amplitude and the phase of the filter transfer function. It can be seen that when the correlation coefficient r2 between the score function of the source and the source itself tend to 1 (indicating that the source distribution approaches Gaussianity), the variance of the phase estimator goes to infinity, but that of the amplitude remains

Summing up along the index k yields that the first P order term in the expansion of K k¼1 HðY k þ dY k Þ P1 equals the trace of u¼1 dBðuÞEfXðuÞc½Yð0ÞT g. R 2p Since EfXðuÞc½Yð0ÞT g ¼ 0 f XcðYÞ ðoÞeiuo do=ð2pÞ where f XcðYÞ is the cross spectral density between the processes fXðtÞg and fc½YðtÞg (defined similarly to (6)), the last sum can be written as R 2p 0 dBðoÞf XcðYÞ ðoÞ do=ð2pÞ. Noting that f YcðYÞ ðoÞ ¼ BðoÞf XcðYÞ ðoÞ as fYðtÞg is related to fXðtÞg through (2), one gets the first result of the lemma. To prove the last result P of the lemma, one y expresses Xðt  uÞ in (4) as v B ðvÞYðt  u  vÞ. Then, by (4) dY k ð0Þ ¼

1 X K X ðdB%By Þkj ðuÞY j ðuÞ u¼1 j¼1

¼ ðdB%By Þkk ð0ÞY k ð0Þ þ Z k , say. Note that ðdB%By Þkk ð0Þ ¼ ðdBB1 Þkk and that under assumption (S) Y k ð0Þ and Z k are independent, hence the variance of Z k equals that of dY k ð0Þ

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

2058

Z

minus that of ðdBB1 Þkk Y k ð0Þ, which is Z 2p K X do 2 EðY j Þ j½dBðoÞBðoÞ1 kj j2 2p 0 j¼1

0

( S2 ¼ E

 EðY 2k ÞðdBB1 Þ2kk . Therefore,

2p



  q qBðoÞ do , BðoÞ1 qym qyn 2p

X qBðuÞ qym

u

applying

(5)

and

noting

that

qYT ð0Þ XðuÞ diag c0 ½Yð0Þ qyn

)

and

1

E½dY k ð0ÞjY k ð0Þ ¼ ðdBB Þkk Y k ð0Þ, hence E½dY k ð0Þ jY k ð0Þ0 ¼ ðdBB1 Þkk , one obtains the second order PK term in the expansion of k¼1 HðY k þ dY k Þ as given in the lemma. & ^ Proof of Lemma 2. From the definition (11) of c k ^ T ½YðtÞ in the lemma: and that of c ( ) K T 1 X q X 1 qYðtÞ T ^ k Þ ¼ tr c^ ½YðtÞ . (A.1) HðY qym k¼1 T t¼0 qym In the case where YðtÞ is computed as (14) 1 X qYðtÞ qBðuÞ ¼ Xðt  u mod TÞ, qym qym u¼1

(A.2)

which yields the first result of the lemma, noting that q log det BðoÞ=qym ¼ tr½qBðoÞ=qym BðoÞ1 . In the FIR case where YðtÞ is computed as (2), one has the same formula (A.2) but without ‘‘mod T ’’. This yields the last statement of the lemma. By the (discrete) Parseval equality, the right-hand side of (A.1) equals "    # T 1 1 X 2pn T 2pn tr 2 dqY=qym dcðYÞ , ^ T T T n¼0 ð2pn=TÞ are the DFT where dqY=qm ð2pn=TÞ and dcðYÞ ^ of the sequences qYðtÞ=qym ; . . . ; qYðT  1Þ=qym and ^ T ½YðtÞ; . . . ; c ^ T ½YðT  1Þ. But from (A.2) one c has dqY=qym ð2pn=TÞ ¼ qBð2pn=TÞ=qym dX ð2pn=TÞ and similarly from (14), dX ð2pn=TÞ ¼ Bð2pn=TÞ1 dY ð2pn=TÞ. Hence from (18), K X

q ^ kÞ HðY qym k¼1 "    # 1 1 TX qBð2pn=TÞ 2pn 1 ^ 2pn ¼ tr B . f YcðYÞ ^ T n¼0 qym T T This completes the proof of the lemma.

&

Proof of Lemma 3. One may split (20) into S 1 þ S2 þ S3 where S1 ¼

X q2 BðuÞ u

qym qyn

T

EfXðuÞc ½Yð0Þg

S3 ¼

X qBðuÞ u

qym

XðuÞ

qcT ½Yð0Þ. qyn

Consider first S 1 . By a similar argument as in the proof of Lemma R 2p 1, the first term in the definition of S1 equals 0 ½q2 BðoÞ=qym qyn BðoÞ1 f YcðYÞ ðoÞ do= ð2pÞ. Under assumption (S) and the condition that E½ck ðY k ÞY k  ¼ 1, f YcðYÞ reduces to I. Hence Z 2p qBðoÞ qBðoÞ do . S1 ¼ BðoÞ1 BðoÞ1 qy qy 2p m n 0 P y Consider now S 2 . Expressing XðuÞ as v B ðvÞ Yðu  vÞ, one gets from (A.2) that qYð0Þ=qy m ¼ P y w Cm ðwÞYðwÞ where Cm ðwÞ ¼ ðqB=qym %B ÞðwÞ, hence XX EfCm ðuÞYðuÞYT ðvÞ S2 ¼ u

v T ^ 0 ½Yð0Þg. Cn ðvÞ diag c

By assumption (S), EfY j ðuÞY l ðvÞc0k ½Y k ð0Þg vanishes unless ðu; jÞ ¼ ðv; lÞ, in which case it equals EðY 2j ÞE½c0k ðY k Þ if ðu; jÞað0; kÞ or EðY 2k ÞE½c0k ðY k Þ þ cov½Y 2k ; c0k ðY k Þ if ðu; jÞ ¼ ð0; kÞ. Therefore, denoting by C m;jk ðuÞ the general element of Cm ðuÞ, a somewhat tedious algebra yields trðS 2 Þ ¼

K X X j;k¼1

þ

C m;kj ðuÞEðY 2j ÞC n;kj ðuÞE½c0k ðY k Þ

u

K X

C m;kk ð0ÞC n;kk ð0Þ cov½Y 2k ; c0 ðY k Þ.

k¼1

By the Parseval Theorem, one may rewrite the first term in the above right-hand side as   Z 2p X K qBðoÞ 0 1 2 EðY j ÞE½ck ðY k Þ BðoÞ qym 0 kj j;k¼1   qBðoÞ do .  BðoÞ1 qyn kj 2p Consider finally S3 . Expressing again XðuÞ as P y v B ðvÞYðu  vÞ and noting that by assumption (S), Y j ðsÞ is uncorrelated with ck ½Y k ð0Þ

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

2059

unless ðj; sÞ ¼ ðk; 0Þ, one gets   X qck trðS 3 Þ ¼ C m;kk ð0ÞE Y k ð0Þ ½Y k ð0Þ . qyn k

cov½cjk ðuÞ; cjk ðvÞ 8 < EðY 2j ÞE½c2k ðY k Þ=T þ oðT 1 Þ; ¼ : oðT 1 Þ;

The function ck satisfies E½Y k ck ðY k Þ ¼ 1, therefore taking the derivative with respect to yn ,     qck qY k 0 E Yk ðY k Þ ¼ E ½cðY k Þ þ Y k ck ðY k Þ . qyn qyn

cov½cjk ðuÞ; ckj ðvÞ ( 1=T þ oðT 1 Þ; ¼ oðT 1 Þ

By a similar calculation as before, the last expecta^ 0 ðY k Þg. Thus, one tion equals C n;kk ð0Þf1 þ E½Y 2 c k

k

gets finally trðS 3 Þ ¼ 

K X

C m;kk ð0ÞC n;kk ð0Þf1 þ E½Y 2k c0k ðY k Þg.

k¼1

v ¼ u; otherwise;

v ¼ u; otherwise:

Therefore, the vector lðyÞ with component l m ðyÞ, is asymptotically Gaussian with zero means and covariance matrix G=T, where G has general element: G mn ¼

K XX u

fEðY 2j ÞE½c2k ðY k Þ

j;k¼1

Combining the above results and noting that

C m;kj ðuÞC n;kj ðuÞ þ C m;kj ðuÞC n;kj ðuÞg

Cm ð0Þ ¼ qB=qym B1 , one gets that the trace of (20) is indeed equal to the right-hand side of (8). &



K X fEðY 2k ÞE½c2k ðY k Þ þ 1gC m;kk ð0ÞC n;kk ð0Þ. k¼1

Proof of Lemma 4. Let l m ðyÞ be the right-hand side of (16). Expressing P Xðt  u mod TÞ (or Xðt  uÞ in the FIR case) as v By ðvÞYðt  u  vÞ and denoting Cm ðwÞ ¼ ðqBðuÞ=qym %By ÞðwÞ, one gets ( l m ðyÞ ¼ tr

X w

) T 1 1X qB 1 T ^ , Cm ðwÞ Yðt  wÞc ½YðtÞ  B T t¼0 qym

Note that qB=qym B1 ¼ Cm ð0Þ. Then, using (19), the above formula reduces to l m ðyÞ ¼

K XX ua0 j;k¼1

C m;kj ðuÞ^cjk ðuÞ þ

K X

C m;jk ð0Þ^cjk ð0Þ,

jak¼1

where C m;kj ðuÞ denotes the general element of Cm ðuÞ and c^jk ðuÞ ¼

1 1 TX ^ ½Y k ðtÞ. Y j ðt  uÞc k T t¼0

We shall show below that the c^jk ðuÞ ðu; jÞað0; kÞ have the same asymptotic distribution as the cjk ðuÞ, ^ replaced by which are defined as c^jk ðuÞ but with c k ck . One may apply the Central Limit Theorem to the statistics cjk ðuÞ, ðu; jÞað0; kÞ to see that under assumption (S) they are asymptotically Gaussian with zero mean and covariance structure cov½cjk ðuÞ; clm ðvÞ ¼ oðT 1 Þ; ð j; kÞefðl; mÞ; ðm; lÞg,

By

the

Parseval

Theorem

and

noting

that

1

Cm ð0Þ ¼ qB=qym B , one can rewrite G mn as in the lemma. To complete the proof, we need to show that the c^jk ðuÞ; ðu; jÞað0; kÞ have the same asymptotic distribution as cjk ðuÞ. By assumption ^ ½Y k ðtÞ  c ½Y k ðtÞ  qCk ½Y k ðtÞ; lk ðl^ k  lk Þ c k k qlk and since T 1 1 X qCk pffiffiffiffi Y j ðt  uÞ ½lk ; Y k ðtÞ; qlk T t¼0

ðu; jÞað0; kÞ

are bounded in probability and l^ k ! lk , c^jk ðuÞ cjk ðuÞ ¼ oðT 1=2 Þ, one gets the result. & Proof of Lemma 5. By integration by part E½c0 ðY Þ ¼ E½cðY ÞcY ðY Þ, hence by the Schwartz inequality k2 pr2 r2

, with equality if and only if c is proportional to cY . But since E½Y cðY Þ ¼ 1 ¼ E½Y cY ðY Þ, the last condition amounts to c ¼ cY . Note that this implies k ¼ r2

. It follows that 1 1 ðk þ 1Þ2 pr2 r2

þ 2r r þ 1

pðr2 þ 1Þðr2

þ 1Þ 2 þ r2 since 2r1 r1

pr

. This yields the first result of the lemma.

ARTICLE IN PRESS D.-T. Pham / Signal Processing 87 (2007) 2045–2060

2060

On the other hand, since E½Y cðY Þ ¼ E½Y cY ðY Þ ¼ 1, one can decompose c and cY into ~ Þ, cðY Þ ¼ Y =EðY 2 Þ þ cðY c ðY Þ ¼ Y =EðY 2 Þ þ c~ ðY Þ, Y

[9]

Y

~ being noncorrelated with Y. Then with c~ and c Y again by integration by part, ~ 0 ðY Þ ¼ E½cðY ~ Þc ðY Þ ¼ E½cðY ~ Þc ~ ðY Þ E½c Y Y ~ Þc ~ ðY Þ. Therefore, by hence k ¼ 1 þ EðY 2 ÞE½cðY Y Schwartz inequality again 2

[10]

[11]

2

~ ðY ÞE½c ~ ðY Þ ðk  1Þ2 p½EðY 2 Þ2 E½c Y

[12]

¼ ðr2  1Þðr2

 1Þ, ~ 2 ðY Þ ¼ r2  1, EðY 2 ÞE½c ~ 2 ðY Þ ¼ since EðY 2 ÞE½c Y r2  1. Equality can be achieved if and only if

~ ¼ c~ , which amounts to c ¼ c . This yields the c Y Y second result of the lemma. &

[13]

[14]

References [15] [1] J.-F. Cardoso, Blind signal separation: statistical principles, Proc. IEEE 10 (86) (1998) 2009–2025. [2] P. Comon, Contrasts for multichannel blind deconvolution, IEEE Signal Process. Lett. 3 (7) (1996) 209–211. [3] E. Moreau, J.-C. Pesqquet, Generalized contrasts for multichannel blind deconvolution of linear system, IEEE Signal Process. Lett. 4 (6) (1997) 182–183. [4] J.K. Tugnait, Identification and deconvolution of multichannel linear non-Gaussian processes using higher order statistics and inverse filter criteria, IEEE Trans. Signal Process. 45 (3) (1997) 658–672. [5] Y. Inouye, K. Hirano, Cumulant based identification of linear multi-input multi-output systems driven by colored inputs, IEEE Trans. Signal Process. 45 (6) (1997) 1543–1552. [6] Y. Inouye, K. Tanabe, Super-exponential algorithm for multichannel blind deconvolution, IEEE Trans. Signal Process. 48 (3) (2000) 881–888. [7] W. Yellin, E. Weinstein, Criteria for multichannel signal separation, IEEE Trans. Signal Process. 42 (8) (1994) 2158–2165. [8] B. Chen, A.P. Petropulu, Multiple-input–multiple-output blind system identification based on cross-polyspectra, in:

[16]

[17] [18]

[19]

[20] [21]

[22]

Proceedings of the IEEE International Conference on Acoustic, Speech, Signal Processing, vol. 1, Istanbul, Turkey, 2000, pp. 584–587. B. Chen, A.P. Petropulu, Frequency domain blind MIMO system identification based on second and higher order statistics, IEEE Trans. Signal Process. 49 (8) (2001) 1677–1688. M. Castella, J.-C. Pesquet, A.P. Petropulu, A family of frequency- and time domain contrasts for blind separation of convolutive mixtures of temporally dependent signals, IEEE Trans. Signal Process. 53 (1) (2005) 107–120. D.T. Pham, Mutual information approach to blind separation of stationary sources, IEEE Trans. Inform. Theory 48 (2002) 1935–1946. L.Q. Zhang, A. Cichocki, S. Amari, Multichannel blind deconvolution of non-minimum phase systems using information backpropagation, in: Proceedings of the 6th International Conference on Neural Information Processing, vol. 1, Perth, Australia, 1999, pp. 210–216. A. Taleb, J.S. i Casal, C. Jutten, Quasi-nonparametric blind inversion of Wiener systems, IEEE Trans. Signal Process. 49 (5) (2001) 917–924. D.L. Donoho, On minimum entropy deconvolution, in: D.F. Findley (Ed.), Applied Time Series Analysis II, Academic Press, New York, 1981, pp. 565–609. A. Bell, T. Sejnowski, An information maximization approach to blind separation and deconvolution, Neural Computation 7 (1995) 1129–1159. D.T. Pham, Blind separation of instantaneous mixture of sources via an independent component analysis, IEEE Trans. Signal Process. 44 (11) (1996) 2768–2779. T. Cover, J.A. Thomas, Elements of Information Theory, Wiley, New York, 1991. D.T. Pham, Entropy of a random variable slightly contaminated with another, IEEE Signal Process. Lett. 12 (7) (2005) 536–539. D.T. Pham, Contrast functions for blind seperation and deconvolution of sources, in: Proceedings of ICA 2001 Conference, San Diego, USA, 2001, pp. 37–42. P.J. Huber, Projection pursuit, Ann. Statist. 13 (2) (1985) 435–475. D.T. Pham, Fast algorithms for mutual information based independent component analysis, IEEE Trans. Signal Process. 52 (10) (2004) 2690–2700. W.H. Press, B.P. Flannery, S.A. Teukolsky, W.T. Vetterling, Numerical Recipes in C. The Art of Scientific Computing, second ed., Cambridge University Press, Cambridge, MA, 1993.