Neurocomputing 56 (2004) 381 – 398
www.elsevier.com/locate/neucom
A geometric algorithm for overcomplete linear ICA Fabian J. Theisa;∗ , Elmar W. Langa , Carlos G. Puntonetb a Institute
of Biophysics, University of Regensburg, D-93040 Regensburg, Germany of Architecture and Computer Technology, University of Granada, Spain
b Department
Received 12 November 2001; received in revised form 16 September 2003; accepted 26 September 2003
Abstract Geometric algorithms for linear square independent component analysis (ICA) have recently received some attention due to their pictorial description and their relative ease of implementation. The geometric approach to ICA was proposed 5rst by Puntonet and Prieto (Neural Process. Lett. 2 (1995), Signal Processing 46 (1995) 267) in order to separate linear mixtures. We generalize these algorithms to overcomplete cases with more sources than sensors. With geometric ICA we get an e9cient method for the matrix-recovery step in the framework of a two-step approach to the source separation problem. The second step—source-recovery—uses a maximum-likelihood approach. There we prove that the shortest-path algorithm as proposed by Bo5ll and Zibulevsky (in: P. Pajunen, J. Karhunen (Eds.), Independent Component Analysis and Blind Signal Separation (Proceedings of ICA’2000), 2000, pp. 87–92) indeed solves the maximum-likelihood conditions. c 2003 Elsevier B.V. All rights reserved. Keywords: Overcomplete blind source separation; Overcomplete independent component analysis; Overcomplete representation; Geometric independent component analysis
1. Introduction In independent component analysis (ICA), given a random vector, the goal is to 5nd its statistically independent components. This can be used to solve the blind source separation (BSS) problem which is, given only the mixtures of some underlying independent sources, to separate the mixed signals thus recovering the original sources. ∗
Corresponding author. Tel.: +49-941-9432924; fax: +49-941-9432479. E-mail address:
[email protected] (F.J. Theis).
c 2003 Elsevier B.V. All rights reserved. 0925-2312/$ - see front matter doi:10.1016/j.neucom.2003.09.008
382
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
In contrast to correlation-based transformations such as principal component analysis, ICA renders the output signals as statistically independent as possible by evaluating higher-order statistics. The idea of ICA was 5rst expressed by Jutten and Herault [15,16] while the term ICA was later coined by Comon [9]. However, the 5eld became popular only with the seminal paper by Bell and Sejnowski [5] who elaborated upon the Infomax-principle 5rst advocated by Linsker [21,22]. Amari [1] as well as Cardoso and Laheld [7] later simpli5ed the Infomax learning rule introducing the concept of a natural gradient which accounts for the non-Euclidean Riemanian structure of the space of weight matrices. Recently, geometric ICA algorithms have received further attention due to their relative ease of implementation [27,28]. They have been applied successfully to the analysis of real-world biomedical data [3,26] and have been extended to non-linear ICA problems [25], too. These algorithms generally assume that at least as many sensor signals as there are underlying source signals are provided. In overcomplete ICA, however, more sources are mixed to less signals. The ideas used in overcomplete ICA originally stem from coding theory, where the task is to 5nd a representation of some signals in a given set of generators which often are more numerous than the signals, hence the term overcomplete basis. Sometimes this representation is advantageous as it uses as few ‘basis’ elements as possible, which is then called sparse coding. Olshausen and Field [24] 5rst put these ideas into an information theoretic context decomposing natural images into an overcomplete basis. Later, Harpur and Prager [11] and, independently, Olshausen [23] presented a connection between sparse coding and ICA in the square case. Lewicki and Sejnowski [19] then were the 5rst to apply these terms to overcomplete ICA, which was further studied and applied by Lee et al. [17]. De Lathauwer et al. [10] provided an interesting algebraic approach to overcomplete ICA of three sources and two mixtures by solving a system of linear equations in the third- and fourth-order cumulants, and Bo5ll and Zibulevsky [6] treated a special case (‘delta-like’ source distributions) of source signals after Fourier transformation. 2. Basics For m; n ∈ N let Mat(m × n) be the R -vectorspace of real m × n matrices, and Gl(n) := {W ∈ Mat(n × n) | det(W ) = 0} be the general linear group of R n . In the general case of linear BSS, a random vector X : → R m called mixed vector is given; it originates from an independent random vector S : → R n , which will be denoted as source vector, by mixing with a mixing matrix A = (a1 ; : : : ; an )T ∈ Mat(m × n), i.e. X = AS. Here denotes a 5xed probability space. Only the mixed vector is known, and the task is to recover both the mixing matrix A and the source signals S. Let ai := Aei denote the columns of A, where ei are the unit vectors. We will assume that the mixing matrix A has full rank and any two diOerent columns ai ; aj are linearly independent, i = j. Note that for the square case (m=n) this just means that A is invertible i.e. A ∈ Gl(n). In this case, we can of course recover S from A by S = A−1 X . For less sources than
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
383
mixtures (m ¿ n) the BSS problem is said to be undercomplete, and it is easily reduced to a square BSS problem by selecting m mixtures or applying some more sophisticated preprocessing like PCA. The separation problem we are mainly interested here is the overcomplete case where less mixtures than sources are given (m ¡ n). The problem stated like this is ill-posed, hence further restrictions will have to be made. For the square case, many diOerent algorithms have been proposed with the popular Bell–Sejnowski maximum entropy algorithm [5] and the FastICA algorithm by HyvParinen and Oja [12,13] being one of the most e9cient algorithms among them. Some have also been extended to overcomplete situations [20,18]. In this paper we consider a geometric approach to the source separation problem as advocated by Puntonet et al. [27,29].
3. A two-step approach to the separation In the square case, it is su9cient to recover the mixing matrix A in order to solve the separation problem, because the sources can be reconstructed from A and X by inverting A. For the overcomplete case as presented here, however, after 5nding A in a similar fashion as in square ICA (matrix-recovery step), the sources will be chosen from the n − m-dimensional a9ne vector space of the solutions of AS = X using a suitable boundary condition (source-recovery step). Hence, with this algorithm we follow a two step approach to the separation of more sources than mixtures; this two-step approach has been proposed recently by Bo5ll and Zibulevsky [6] for delta distributions. It can be contrasted with the single step separation algorithm by Lewicki and Sejnowski [19], where both steps have been fused together into the minimization of a single complex energy function. We show that our approach resolves the convergence problem induced by the complicated energy function, and, moreover, it reQects the square case as a special case in a very obvious way. Hence, we borrow the idea of splitting up the separation into matrix- and sourcerecovery steps and present a new algorithm for the overcomplete matrix-recovery step that is largely based on geometric algorithms developed to solve the square case.
4. Matrix-recovery step In the 5rst step, given only the mixtures X , the goal is to 5nd a matrix A ∈ Mat(m× n) with full rank and pairwise linearly independent columns such that there exists an independent random vector S with X =A S . If we assume that at most one component of S and S is Gaussian, then for m = n it is known [9] that A is equivalent to A; here we consider two matrices B; C ∈ Mat(m × n) equivalent if C can be written as C = BPL with an invertible diagonal matrix (scaling matrix) L ∈ Gl(n) and an invertible matrix with unit vectors in each row (permutation matrix) P ∈ Gl(n). In the overcomplete case, however, no such uniqueness theorem is known, and we believe that without further restrictions it would not be true anyway.
384
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
For geometric matrix-recovery, we use a generalization of the geometric ICA algorithm [27,28]. Later on, we will restrict ourselves to the case of two-dimensional mixture spaces for illustrative purposes mainly. With high dimensional problems, however, geometrical algorithms need many samples [14], hence seem less practical; but for now let m ¿ 1 be arbitrary. Let S : → R 2 be an independent n-dimensional Lebesgue-continuous random vector describing the source pattern distribution; its density function is denoted by : R n → R . As S is independent, factorizes in the following way: (x1 ; : : : ; x n ) = 1 (x1 ) : : : n (x n ) with the marginal source density functions i : R → R . We assume that the source variables Si are distributed symmetrically, i.e. i (x)=i (−x) for x ∈ R and i=1; : : : ; n. In particular, E(S) = 0. For stability of the geometric algorithm, it is furthermore required that the signals are super-Gaussian (positive kurtosis) and unimodal—in practice, these restrictions are often met at least approximately. As above, let X denote the mixed vector and A the mixing matrix such that X = AS. A is assumed to be of full rank and to have pairwise linearly independent columns. Since we are not interested in dealing with scaling factors, we can assume that the columns in A have Euclidean norm 1. If a1 ; : : : ; an ∈ R m denote the columns of A, we often write A = (a1 | : : : |an ): The geometric learning algorithm for symmetric distributions in its simplest form then goes as follows: Pick 2n starting elements w1 ; w1 ; : : : ; wn ; wn on the unit sphere S m−1 ⊂ R m such that wi and wi are opposite each other, i.e. wi = −wi for i = 1; : : : ; n, and such that the wi are pairwise linearly independent vectors in R m . Often, these wi are called neurons because they resemble the neurons used in clustering algorithms and in Kohonen’s self-organizing maps. If m = 2, one usually takes the unit roots wi = exp(((n − 1)=n)i). Furthermore 5x a learning rate : N → R such that (t) ¿ 0; n∈N (n) = ∞ and 2 (n) ¡ ∞. Then iterate the following step until an appropriate abort condition n∈N has been met: Choose a sample x(t) ∈ R m according to the distribution of X . If x(t) = 0 pick a new one—note that this case happens with probability zero since the probability density function X of X is assumed to be continuous. Project x(t) onto the unit sphere and get y(t) := x(t)=|x(t)|. Let i be in {1; : : : ; n} such that wi or wi is the neuron closest to y with respect to the Euclidean metric. Then set wi (t + 1) := (wi (t) + (t) (y(t) − wi (t))); where : R m \ {0} → S (m−1) denotes the projection onto the (m − 1)-dimensional unit sphere S (m−1) in Rm , and wi (t + 1) := −wi (t + 1): All other neurons are not moved in this iteration. In Fig. 1, the learning algorithm has been visualized on the sphere.
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
w3 (∞)
w3 (0)
*
*
385
w2 (0)
* *w (∞) 2
*w (∞) 1
x2
w1 (0)
*
x1
Fig. 1. Visualization of the geometric algorithm with starting points w1 (0); w2 (0) and w3 (0) and end points w1 (∞); w2 (∞) and w3 (∞). Dash–dotted lines show borders of the receptive 5elds.
Similar to the square case, this algorithm may be called absolute winner-takes-all learning. It resembles Kohonen’s competitive learning algorithm for self-organizing maps with a trivial neighbourhood function (0-neighbour algorithm) but with the modi5cation that the step size along the direction of a sample does not depend on distance, and that the learning process takes place on S (m−1) not in R (m−1) . The authors present a formal framework for this geometric method in the square case in [30]. One goal of future research is to generalize it to the overcomplete case. 5. Uniqueness of geometric matrix-recovery The basic idea of the geometric separation method lies in the fact that in the source space {s1 ; : : : ; s } ⊂ R n , where si represent a 5xed number of samples of the source vector S with zero mean, the clusters along the axes of the coordinate system are transformed by A into clusters along diOerent lines through the origin. The detection of those n new axes allows us to determine a matrix B which is equivalent to A, see Fig. 2 We now consider the learning process to have terminated already and describe precisely how to recover the mixing matrix A then, i.e. after the axes, which span the observation space, have been extracted from the data successfully. Let L := {(x1 ; : : : ; x n ) ∈ R n |∃i xi ¿ 0; xj = 0 for all j = i} be the set of positive coordinate axes. Denote L := AL the image of this set under A.
386
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
α3
α2
x
2
α1
x
1
Fig. 2. Example of a two-dimensional scatter plot of a mixture of three high-kurtosis Gamma signals with identical variance. The signals have been mixed by a matrix A mapping the unit vectors onto vectors inclined with angle i to the x1 -axis.
We claim that L intersects the unit (m − 1)-sphere S m−1 := {x ∈ R m x| = 1} in exactly n distinct points {p1 ; : : : ; pn }. For this note that L ∩ S m−1 is the image of the unit vectors {e1 ; : : : ; en } under the map f : R n \ ker A → R m \ {0} → S m−1 ; x → Ax →
Ax ; |Ax|
where ker A = A−1 {0} denotes the kernel of A, so we have (after a possible reordering of the pi ’s) f(ei ) = pi . Since the ai = Aei are pairwise linearly independent, pi = pj induces ei = ej and hence i = j. Furthermore, we note that the pi ’s span the whole R m , because A has full rank. De5ne the matrix Mp1 ;:::;pn ∈ Gl(n) to be the linear mapping of ei onto pi for i = 1; : : : ; n, i.e. Mp1 ;:::;pn = (p1 | : : : |pn ): Then the following lemma holds.
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
387
Lemma 1. For a permutation $ ∈ Sn , the two matrices Mp1 ;:::;pn and Mp$(1) ;:::;p$(n) are equivalent. Proof. Mp1 ;:::;pn = Mp$(1) ;:::;p$(n) P for a permutation matrix P. With this de5nition of the matrix Mp1 ;:::;pn the following theorem can be stated. Theorem 2 (Uniqueness of the geometric method). The matrix Mp1 ;:::;pn is equivalent to A. Proof. By construction of Mp1 ;:::;pn , we have Aei ; Mp1 ;:::;pn (ei ) = pi = f(ei ) = |Aei | so there exists a i ∈ R \ {0} such that Mp1 ;:::;pn (ei ) = A(i ei ): Setting
L :=
1
0 : : :
0
n
yields an invertible diagonal matrix L ∈ Gl(n), such that Mp1 ;:::;pn = ALei : This shows the claim. In fact, since we assumed that the columns ai of A have unit norm, it immediately follows that {p1 ; : : : ; pn } = {a1 ; : : : ; an }. 6. Source-recovery step Using the results given above, we can assume that an estimate of the original mixing matrix A has been found. We are, therefore, left with the problem of reconstructing the sources using the mixtures X and the estimated matrix A. Since A has full rank, the equation x = As yields the n − m-dimensional a9ne vectorspace A−1 {x} as solution space for s. Hence, if n ¿ m the source-recovery problem is ill-posed without further assumptions. An often used [19,6] assumption can be derived using a maximum likelihood approach, as will be shown next.
388
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
The problem of the source-recovery step can be formulated as follows: Given a random vector X : → R m and a matrix A as above, 5nd an independent vector S : → R n satisfying an assumption yet to be found such that X = AS. Considering X = AS, i.e. neglecting any additional noise, X can be imagined to be determined by A and S. Hence, the probability of observing X given A and S can be written as P(X |S; A). Using Bayes Theorem the posterior probability of S is then P(X |S; A)P(S) ; P(S|X; A) = P(X ) the probability of an event of S conditional on X and A. Given some samples of X , a standard approach for reconstructing S is the maximum-likelihood algorithm which means maximizing this posterior probability conditioned on the prior probability P(S) of S. Using the samples of X one can then 5nd the most probable S such that X = AS. In terms of representing the observed sensor signals X in a basis {ai } this is called the most probable decomposition of X in terms of the overcomplete basis of R m given by the columns of A. Using the posterior of the sources P(S|X; A), we can obtain an estimate of the unknown sources by solving the following relation: S = arg max P(S|X; A) X =AS
= arg max P(X |S; A)P(S): X =AS
Since X is fully determined by S and A, P(X |S; A) is trivial, which leads to S = arg max P(S): X =AS
Note that of course the maximum under the constraint X =AS is not necessarily unique. If we assume P(S) to be a Gaussian distribution, this leads to S = arg max exp(−|S1 |2 − · · · − |Sn |2 ) X =AS
= arg min |S1 |2 + · · · + |Sn |2 X =AS
= arg min |S|2 ; X =AS
2 1=2 where |v|2 := denotes the Euclidean norm. Indeed, this source estimation i |vi | is a linear operation and can be achieved using the Moore–Penrose inverse A+ of A. Hence, in the Gaussian case, the solution of the source-recovery is unique; in fact, for a given sample x it is the Euclidean perpendicular from 0 onto the hyperplane A−1 {x } ⊂ R n . This exists and is unique, because R n is a Hilbert space and the plane is convex and closed. If P(S) is assumed to be Laplacian, that is P(Si )(t) = a exp(−|t|), then we get S = argmax exp(−|S1 | − · · · − |Sn |) X =AS
= arg min |S1 | + · · · + |Sn | X =AS
= arg min |S|1 ; X =AS
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
389
Fig. 3. Shortest-path algorithm. The shortest-path decomposition of the x in the above picture uses only the vectors a1 and a2 .
where |v|1 := i |vi | denotes the 1-norm. Uniqueness of S holds as well in this case as will be shown in Lemma 3 for the case m = 2. Note that the S may not be unique for other norms; for example considering the supremum norm |x|∞ , the perpendicular from 0 onto an a9ne vectorspace is not unique. The general algorithm for the source-recovery step therefore is the maximization of P(S) under the constraint X = AS. This is a linear optimization problem which can be tackled using various optimization algorithms [8]. In the following, we will assume a Laplacian prior distribution of S which is characteristic of a sparse coding of the observed sensor signals. In this case, the minimization has a nice visual interpretation, which suggests an easy to perform algorithm: The source-recovery step consists of minimizing the 1-norm |s |1 under the constraint As = x for all samples x . Since the 1-norm of a vector can be pictured as the length of a path with parallel steps to the axes, Bo5ll and Zibulevsky call this search shortest-path decomposition—indeed, s represents the shortest path to x in R m along the lines given by the matrix columns ai =Aei of A, as will be shown in the subsequent section, see Fig. 3.
7. Shortest-path algorithm Bo5ll and Zibulevsky 5rst proposed this algorithm in [6]. We present a proof for the fact that 5nding the shortest path indeed means minimizing the 1-norm. For simplicity, we only deal with two-dimensional mixtures, i.e. m = 2. The goal is to 5nd a arg minx =As |s|1 . As above let A = (a1 | : : : |an ) denote the normalized columns of A. The shortest-path algorithm is based on the following lemma. Lemma 3. Let a1 ; : : : ; an ∈ R 2 ; n ¿ 1, be normalized and pairwise linearly indepenn dent. Let x ∈ R 2 and sˆ ∈ R n be such that x = i=1 sˆi ai = As. ˆ Let j; k ∈ {1; : : : ; n} be such that aj or −aj lies closest to x from below and ak or −ak lies closest to x from above in terms of their angle to a ;xed axis, arbitrary if x = 0.
390
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
Fig. 4. Shortest-path algorithm, proof of '(s) = 3 case. Given a decomposition of x into three vectors ai , the above 5gure shows that the path can be shortened, and it then only uses the two vectors closest to x .
Fig. 5. Shortest-path algorithm, proof that the shortest-path decomposition among all paths with '(s) = 2 uses the (closest to x ) vectors aj and ak . The triangle inequality shows this claim.
Then sˆ = arg minx =As |s|1 holds if and only if sˆi = 0 for i = j; k. Furthermore, the sˆ is unique. Proof. Without loss of generality we assume that x = 0. Let sˆ = arg minx =As |s|1 . Let '(s) = |{i|si = 0}| be the number of non-zero si ’s. We claim that '(s) ˆ = 2. Next assume the claim not to be true. A simple geometric consideration presented in Fig. 4 indicates that '(s) ˆ ¿ 3. So let sˆi = 0 and de5ne y := x − sˆi ai and t := sˆ − sˆi ei . Then At = y and by induction ('(t) ¡ '(s)) ˆ we know that '(t) = 2. Hence '(s) ˆ = 3, which is a contradiction. So we conclude that '(s) ˆ = 2. As a simple consequence it follows from Fig. 5 that indeed sˆi = 0 for i = j; k holds. Now let sˆi = 0 for i = j; k. Since sˆj and sˆk are linearly independent, they form a basis; therefore sˆ is uniquely determined by sˆi = 0 for i = j; k. As shown above, an s with s = arg minx =As |s|1 ful5ls these equations; uniqueness then shows that sˆ = arg minx =As |s|1 .
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
391
So the algorithm can be formulated as follows. For a given sample x pick the columns aj and ak of A that lie closest to x as in Lemma 3. Then s ∈ R n is de5ned by ((aj |ak )−1 x )j i = j; (s )i = ((aj |ak )−1 x )k i = k; 0 otherwise: It is easy to check that As = x , so by Lemma 3 this means that s is a minimum of the 1-norm of all s with As = x .
8. Experimental results In this section, we give a demonstration of the algorithm. The calculations were performed on a AMD Athlon 1 GHz computer using Matlab and took no more than 1 min at most. In order to compare the mixture matrix A with the recovered matrix B from the matrix-recovery step, we calculate the generalized crosstalking error E(A; B) of A and B de5ned by E(A; B) := min A − BM ; M ∈(
where ( is the group of all invertible real n × n-matrices where only one entry in each column diOers from 0 and · is a 5xed matrix norm. Lemma 4. E(A; B) = 0 if and only if A is equivalent to B. Proof. Note that ( consists of all n×n-matrices of the type PL, where L is an invertible diagonal matrix (scaling matrix) and P a permutation matrix. If A is equivalent to B, then by de5nition there exists a M ∈ ( such that A = BM , therefore E(A; B) = 0. Vice versa, if E(A; B) = 0, then there exists M ∈ ( with A − BM = 0, i.e. A = BM . In our examples, we used the norm A = (1=n) i; j |aij |, but without the factor 1=n. In practice, E(A; B) can be calculated more easily by 5rst normalizing the column rows of A and B and then taking the minimum over the 5nite set (2n! elements) ( ⊂ ( of all matrices with only one {−1; 1} in each column. In order to analyze the source-recovery step, we look at the crosstalking error E1 (Cor(S; S )) of the correlation matrix of the original sources S and the recovered sources S , where n n n n |cij | |c | ij − 1 + −1 E1 (C) = maxk |cik | maxk |ckj | i=1
j=1
j=1
i=1
392
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
is the performance index, often called crosstalking error, of a n×n-matrix C introduced by Amari et al. [2]. Lemma 5. Let C ∈ Gl(n): E1 (C) = 0 if and only if C ∈ (, i.e. if C is the product of a scaling and a permutation matrix. Proof. ( consists of matrices with exactly one non-zero element per column and per row. As C is invertible, C has at least one non-zero element per column and row, and E1 (C) = 0 obviously if and only if C is of that type, i.e. if C ∈ (. Corollary 6. E1 (Cor(S; S )) = 0 if and only if S and S are decorrelated. Example 1 is a toy example, where we mixed three independent Gamma distribution signals (distribution proportional to exp(−|x|* )) with zero mean, unit variance and an approximate kurtosis of 9:3, to two signals using the mixture matrix 0:7071 −0:4472 −0:9487 A= : 0:7071 0:8944 0:3162 Geometric matrix-recovery was performed using the above geometric algorithm, where the learning rate was decreased in a manner following [4]. Some tweaking of the parameters was necessary for establishing su9ciently stable convergence. The initial learning rate was 0:02 with a cooling step every 50 epochs using the cooling parameter + = 60 as in [4]. After 105 iterations, the following mixture matrix was found: 0:9608 0:7118 −0:4079 B= ; −0:2773 0:7024 0:9130 which is already very close to A and can be improved by more sophisticated training procedures. The generalized crosstalking error of A and B is E(A; B)=0:1183. A scatter plot of the mixing space together with the trained neurons on the circle is shown in Fig. 6. After the source recovery step, we compared the demixed sources with the original ones and got the following correlation matrix 0:1656 0:9522 0:1048 −0:2551 0:0719 0:9219 −0:8827 −0:0974 0:1949 with E1 (Cor(S; S )) = 1:9493. In the next example, we used a very easy ‘delta-like’ distribution in the source space, which was now chosen to be 5ve dimensional. We present this example to demonstrate that our algorithm is a generalization of Bo5ll’s and Zibulevsky’s who only deal with these delta distributions [6]. As mixture matrix A, we chose 0:9950 0:7071 0:9487 0 −0:7071 A= 0:0995 0:7071 0:3162 1:0000 0:7071
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
393
Scatterplot 1 5 3 0.5 2 0 1 -0.5 4 6 -1
-1.5 -1.5
-1
-0.5
0
0.5
1
1.5
Fig. 6. Example 1: Three mixed gamma distributions with the source directions indicated by the lines and the trained neurons marked with asterixes.
and, using the same parameters as above, found in 2 × 104 iterations the following matrix: 0:9950 0:7075 0:7074 −0:9484 −0:0003 B= 0:0996 0:7067 −0:7068 −0:3170 1:0000 with E(A; B) = 0:0028. Again, a scatter plot is shown in Fig. 7. Due to the shape of the source distributions it is not astonishing that the accuracy of the algorithm is that high, and also the source-recovery part performed extraordinarily well as can be seen from the correlation matrix of the demixings and the original sources 1:0000 0:0000 0:0001 0:0000 −0:0000 0:0000 1:0000 −0:0000 −0:0001 0:0008 0:0033 0:0001 0:0000 −1:0000 −0:0000 −0:0000 0:0004 0:0000 0:0000 1:0000 −0:0000 0:0000 −1:0000 0:0000 0:0006 with E1 (Cor(S; S )) = 0:0108. In the last example, we mixed three speech signals using the mixture matrix 0:9923 0:4472 0:2425 A= 0:1240 0:8944 −0:9701
394
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398 Scatterplot 9
1 0.8 6
3
0.6 0.4 8 0.2 1 0 2 -0.2 7 -0.4 -0.6 4
5
-0.8 -1 -1
-0.8
-0.6
-0.4
-0.2
10 0
0.2
0.4
0.6
0.8
1
Fig. 7. Example 2: Five mixed delta distributions.
to two mixed signals as shown in Fig. 8, left-hand side and middle, with scatterplot given in Fig. 9. After 105 iterations, we found the following mixing matrix: 0:9670 0:4522 −0:2677 B= ; 0:2546 0:8919 0:9635 which is satisfactorily similar to A (E(A; B) = 0:1952), and after source-recovery, we got a correlation of demixed signals and sources as follows: 0:8897 0:2700 −0:3910 0:1585 0:7957 0:2376 0:2075 −0:3067 −0:8287 with E1 (Cor(S; S )) = 3:7559. In the right of Fig. 8, the demixed signals are shown. One can see a good resemblance to the original sources, but the crosstalking error is still rather high. We claim that this is a fundamental problem of the source-recovery step, which, to our knowledge, using the above probabilistic approach cannot be improved any further. For this, we performed an experiment using the source recovery algorithm to recover three Laplacian signals mixed with 1 cos( ) cos(2 ) A = ; 0 sin( ) sin(2 ) where we gave the algorithm already the correct mixing matrix (Fig. 9). We then compared the crosstalking error E(Cor(S; S )) of the correlation matrix of the recovered signals S with the original ones (S) for diOerent angles ∈ [0; =2]. Fig. 10 shows
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398 1 0.5
2
2
1
1
0
0
395
0 -0.5 -1
0
2
4
6
-1
0
2
4
4
x 10
1
6
-1
2
1
0
0
0
-0.5
-1
-1
2
4
6 4
x 10
1
-2
0
2
4
6
-2
0
-0.5
-1 4
6 4
x 10
2
4
6 x 10
2
0
2
0
4
1
0
6 x 10
4
x 10
0.5
-1
4
2
1
0
2
4
0.5
-1
0
4
x 10
-2
0
2
4
6 4
x 10
Fig. 8. Example 3: The three sources to the left, the two mixtures in the middle, and the recovered signals to the right. The speech texts were ‘californication’, ‘peace and love’ and ‘to be or not to be that’. The signal kurtoses were 8.9, 7.9 and 7.4.
that the result is nearly independent on the angle, which makes sense because one can show that the shortest-path algorithm is invariant under coordinate transformations like A . This experiment indicates that there might be a general limit on how good sources can be recovered in overcomplete settings. 9. Conclusion We have presented a two-step approach to overcomplete BSS. First, the original mixing matrix is approximated using the geometry of the mixture space in a similar fashion as geometric algorithms do this in the square case. Then the sources are recovered by the usual maximum-likelihood approach with a Laplacian prior. We have also given a proof for the shortest-path recovery algorithm proposed by Bo5ll and Zibulevsky. Finally, we have discussed two toy and one real-world example for over-
396
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398 Scatterplot 1.5
1
5
3
0.5 1 0 2 -0.5
4
6
-1
-1.5 -1
-0.5
0
0.5
1
1.5
Fig. 9. Example 3: Scatterplot of the three mixed speech signals. 5 4.5
mean crosstalking error
4 3.5 3 2.5 2 1.5 1 0.5 0
0
5
10
15
20
25
30
35
40
45
angle/degree
Fig. 10. Source recovery test. Plot of the mean crosstalking error E(Cor(S; S )) of the correlation matrix for 100 recovery runs, 1000 samples each, for every angle .
complete separation, and have taken a glimpse into an analysis of the source-recovery step. For further research, two issues will have to be dealt with. On the one hand, the geometric algorithm for matrix-recovery will have to be improved and tested, also for higher mixing dimensions m. We are currently experimenting with an overcomplete generalization of the square ‘FastGeo’ algorithm [14], a histogram-based geo-
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
397
metric algorithm, which looks more stable and also faster. On the other hand, the source-recovery step has to be analyzed in more detail, especially addressing the question if there is a natural information theoretic barrier regarding how well data can be recovered in overcomplete settings. References [1] S. Amari, Natural gradient works e9ciently in learning, Neural Comput. 10 (2) (1998) 251–276. [2] S. Amari, A. Cichocki, H.H. Yang, A new learning algorithm for blind signal separation, Adv. Neural Inform. Process. Systems 8 (1996) 757–763. [3] Ch. Bauer, M. Habl, E.W. Lang, C.G. Puntonet, M.R. Alvarez, Probabilistic and geometric ICA applied to the separation of EEG signals, in: M.H. Hamza (Ed.), Signal Processing and Communication (Proceedings of SPC’2000), IASTED/ACTA Press, Anaheim, USA, 2000, pp. 339–346. [4] C.T. Bauer, C.G. Puntonet, M.R. Alvarez, E.W. Lang, Separation of related EEG signals with geometric procedures, Second International ICSC Symposium on Engineering of Intelligent Systems (EIS’2000), Paisley, Scotland, 27–30 June, 2000, pp. 104 –108. [5] A.J. Bell, T.J. Sejnowski, An information-maximisation approach to blind separation and blind deconvolution, Neural Comput. 7 (1995) 1129–1159. [6] P. Bo5ll, M. Zibulevsky, Blind separation of more sources than mixtures using sparsity of their short-time Fourier transform, in: P. Pajunen, J. Karhunen (Eds.), Independent Component Analysis and Blind Signal Separation (Proceedings of ICA’2000), Helsinki, Finland, 2000, pp. 87–92. [7] J.F. Cardoso, B.H. Laheld, Equivariant adaptive source separation, IEEE Trans. Signal Process. 44 (12) (1996) 3017–3030. [8] S. Chen, D.L. Donoho, M.A. Saunders, Atomic decomposition by basis pursuit, Technical Report, Department of Statistics, Stanford University, Stanford, CA, 1996. [9] P. Comon, Independent component analysis—a new concept? Signal Process. 36 (1994) 287–314. [10] L. De Lathauwer, P. Comon, B. De Moor, J. Vandewalle, ICA algorithms for 3 sources and 2 sensors, IEEE Signal Processing Workshop on Higher-Order Statistics, Caesarea, Israel, June 14 –16, 1999, pp. 116 –120. [11] G.F. Harpur, R.W. Prager, Development of low-entropy coding in a recurrent network, Network 7 (1996) 277–284. [12] A. HyvParinen, Fast and robust 5xed-point algorithms for independent component analysis, IEEE Trans. Neural Networks 10 (3) (1999) 626–634. [13] A. HyvParinen, E. Oja, A fast 5xed-point algorithm for independent component analysis, Neural Comput. 9 (1997) 1483–1492. [14] A. Jung, F.J. Theis, C.G. Puntonet, E.W. Lang, Fastgeo—a histogram based approach to linear geometric ica, Proceedings of ICA 2001, San Diego, CA, 2001, pp. 418– 423. [15] C. Jutten, J. Herault, Blind separation of sources, part I: an adaptive algorithm based on neuromimetic architecture, Signal Process. 24 (1991) 1–10. [16] C. Jutten, J. Herault, P. Comon, E. Sorouchiary, Blind separation of sources, parts I, II and III, Signal Process. 24 (1991) 1–29. [17] T. Lee, M.S. Lewicki, M. Girolami, T.J. Sejnowski, Blind source separation of more sources than mixtures using overcomplete representations, IEEE Signal Process. Lett. 6 (4) (1999) 87–90. [18] M.S. Lewicki, B.A. Olshausen, Probabilistic framework for the adaptation and comparison of image codes, J. Optical Soc. America 16 (7) (1999) 1587–1601. [19] M.S. Lewicki, T.J. Sejnowski, Learning overcomplete representations, Neural Comput. 12 (2000) 337–365. [20] M.S. Lewicki, T.J. Sejnowski, Learning nonlinear overcomplete representations for e9cient coding, in: M.I. Jordan, M.J. Kearns, S.A. Solla (Eds.), Advances in Neural Information Processing Systems, Vol. 10, The MIT Press, Cambridge, 1998. [21] R. Linsker, An application of the principle of maximum information preservation to linear systems, in: D.S. Touretzky (Ed.), Advances in Neural Information Processing Systems, Vol. 1, Morgan Kaufmann, San Mateo, CA, 1989, pp. 186 –194.
398
F.J. Theis et al. / Neurocomputing 56 (2004) 381 – 398
[22] R. Linsker, Local synaptic learning rules su9ce to maximize mutual information in a linear network, Neural Comput. 4 (1992) 691–702. [23] B.A. Olshausen, Learning linear, sparse, factorial codes, Technical Report, AIM-1580, 1996. [24] B.A. Olshausen, D.J. Field, Sparse coding of natural images produces localized, oriented, bandpass receptive 5elds, Technical Report CCN-110-95, Department of Psychology, Cornell University, Ithaca, NY 14853, 1995. [25] C.G. Puntonet, M.R. Alvarez, A. Prieto, B. Prieto, Separation of speech signals for nonlinear mixtures, Lecture Notes in Computer Science, Vol. 1607, Springer, Berlin, 1999, pp. 665 – 673. [26] C.G. Puntonet, Ch. Bauer, E.W. Lang, M.R. Alvarez, B. Prieto, Adaptive-geometric methods: application to the separation of EEG signals, in: P. Pajunen, J. Karhunen (Eds.), Independent Component Analysis and Blind Signal Separation (Proceedings of ICA’2000), Helsinki, Finland, 2000, pp. 273–278. [27] C.G. Puntonet, A. Prieto, An adaptive geometrical procedure for blind separation of sources, Neural Process. Lett. 2 (1995). [28] C.G. Puntonet, A. Prieto, Neural net approach for blind separation of sources based on geometric properties, Neurocomputing 18 (1998) 141–164. [29] C.G. Puntonet, A. Prieto, C. Jutten, M.R. Alvarez, J. Ortega, Separation of sources: a geometry-based procedure for reconstruction of n-valued signals, Signal Process. 46 (1995) 267–284. [30] F.J. Theis, A. Jung, E.W. Lang, C.G. Puntonet, A theoretic model for geometric linear ICA, Proceedings of ICA 2001, San Diego, CA, 2001, pp. 349 –354. Fabian J. Theis was born in Ansbach, Germany, on May 19, 1976. He obtained an M.Sc. degree in Mathematics in 2000, an M.Sc. degree in Physics in 2001 and his Ph.D. in Physics in 2003. He worked as visiting researcher at the department of “Architecture and Computer Technology” (University of Granada, Spain) and at the RIKEN Brain Science Institute (Wako, Japan). His research interests include statistical signal processing, linear and nonlinear independent component analysis, overcomplete blind source separation and low-dimensional manifold theory.
Elmar W. Lang was born in Regensburg, Germany in July 19, 1951. He obtained a degree in Physics in 1977 and received his Ph.D. in 1980. In 1988 he 5nished his Habilitation in Biophysics and became Ap1. Professor in 1994 at the University of Regensburg. His research interests include statistical signal processing, independent component analysis and blind source separation and medical image analysis.
Carlos G. Puntonet was born in Barcelona, Spain, on August 11, 1960. He received the B.Sc. degree in 1982, the M.Sc. degree in 1986 and his Ph.D. degree in 1994, all from the University of Granada, Spain. These degrees are in Electronics and Physics. He has been Visiting Researcher at the “Laboratorie de Traitement d’Images et Reconnaissance de Formes” (INPG, Grenoble, France), at the Institute of Biophysics (Regensburg, Germany) and at the Institute of Physical and Chemical Research (RIKEN, Nagoya, Japan). Currently, he is an Associate Professor at the department of “Architecture and Computer Technology” at the University of Granada, Spain. His research interests lie in the 5elds of signal processing, independent component analysis and blind separation of sources, arti5cial neural networks and optimization methods.