Separation of stationary and non-stationary sources with a generalized eigenvalue problem

Separation of stationary and non-stationary sources with a generalized eigenvalue problem

Neural Networks 33 (2012) 7–20 Contents lists available at SciVerse ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet ...

929KB Sizes 0 Downloads 52 Views

Neural Networks 33 (2012) 7–20

Contents lists available at SciVerse ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Separation of stationary and non-stationary sources with a generalized eigenvalue problem Satoshi Hara a,∗ , Yoshinobu Kawahara a , Takashi Washio a , Paul von Bünau b , Terumasa Tokunaga c , Kiyohumi Yumoto d a

Institute of Scientific and Industrial Research (ISIR), Osaka University, Osaka, 5670047, Japan

b

Machine Learning Group, Berlin Institute of Technology, Franklinstr, 28/29, 10587, Berlin, Germany

c

Meiji Institute for Advanced Study of Mathematical Sciences, Meiji University, Kanagawa, 2148571, Japan

d

Space Environment Research Center, Kyushu University, Fukuoka, 8128581, Japan

article

info

Article history: Received 30 April 2011 Received in revised form 27 February 2012 Accepted 2 April 2012 Keywords: Blind source separation Stationarity Non-stationarity Generalized eigenvalue problem Stationary subspace analysis

abstract Non-stationary effects are ubiquitous in real world data. In many settings, the observed signals are a mixture of underlying stationary and non-stationary sources that cannot be measured directly. For example, in EEG analysis, electrodes on the scalp record the activity from several sources located inside the brain, which one could only measure invasively. Discerning stationary and non-stationary contributions is an important step towards uncovering the mechanisms of the data generating system. To that end, in Stationary Subspace Analysis (SSA), the observed signal is modeled as a linear superposition of stationary and non-stationary sources, where the aim is to separate the two groups in the mixture. In this paper, we propose the first SSA algorithm that has a closed form solution. The novel method, Analytic SSA (ASSA), is more than 100 times faster than the state-of-the-art, numerically stable, and guaranteed to be optimal when the covariance between stationary and non-stationary sources is time-constant. In numerical simulations on wide range of settings, we show that our method yields superior results, even for signals with time-varying group-wise covariance. In an application to geophysical data analysis, ASSA extracts meaningful components that shed new light on the Pi 2 pulsations of the geomagnetic field. © 2012 Elsevier Ltd. All rights reserved.

1. Introduction Stationarity is a cornerstone assumption in several fields, including statistical learning (Quiñonero-Candela, Sugiyama, Schwaighofer, & Lawrence, 2008), signal processing and control theory. However, real data is often non-stationary, and there are various approaches to test (Dickey & Fuller, 1979; Priestley & Rao, 1969) and correct for (Heckman, 1979; Murata, Kawanabe, Ziehe, Müller, & Amari, 2002; Quiñonero-Candela, Sugiyama, Schwaighofer, & Lawrence, 2008; Shimodaira, 2000) non-stationarities in statistical models. Examples include biomedical measurements (Blankertz et al., 2008; Shenoy, Krauledat, Blankertz, Rao, & Müller, 2006), geophysical data (Kaufmann & Stern, 2002; Mann, 2004) and econometric time series (Engle & Granger, 1987). Non-stationary effects are not just a nuisance for methodology. In fact, understanding temporal changes in data is often the main point of interest, so that discovering and describing nonstationarities in high-dimensional datasets are a key challenge



Corresponding author. E-mail address: [email protected] (S. Hara).

0893-6080/$ – see front matter © 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2012.04.001

in explorative data analysis. However, in many settings, the observable data contains contributions from both stationary and non-stationary components that are not directly accessible. In electroencephalography (EEG) (Dornhege, Millán, Hinterberger, McFarland, & Müller, 2007) or seismographic data analysis (Yilmaz, 2001), for instance, measurements on the scalp or the earth’s surface capture the activity of a multitude of sources located inside the brain or the earth respectively that we cannot measure directly, for technical, medical or ethical reasons. In other cases, the structure of the data generating system is even less understood, but it is more plausible to assume that the observed signals is a mixture of several underlying factors, e.g. as in stock market analysis. Some of these latent factors may be stationary while others are nonstationary. The existence of stationary sources is not discernible from the mixed signals: a single non-stationary source can render all variables of a multivariate time series non-stationary and thus mask the presence of time-invariant behavior. Conversely, nonstationary components with low signal power can remain hidden among strong stationary sources. It is therefore important to discern the stationary and the non-stationary group of components in the mixed signals. However, standard Blind Source Separation (BSS) methods (Hyvärinen, Karhunen, & Oja, 2001; Lee & Seung,

8

S. Hara et al. / Neural Networks 33 (2012) 7–20

2001; Ziehe, Laskov, Nolte, & Müller, 2004) are not helpful in this respect: BSS algorithms such as Independent Component Analysis (ICA) (Hyvärinen et al., 2001) separate sources by independence not by stationarity or non-stationarity. In particular, the stationary and non-stationary sources need not be independent. To that end, the Stationary Subspace Analysis (SSA) paradigm (von Bünau, Meinecke, Király, & Müller, 2009a) has been proposed. In the SSA model, the observed time series x(t ) is generated as a linear mixture of stationary sources ss (t ) and non-stationary sources sn (t ) with a time-constant mixing matrix A, ss (t ) x(t ) = A n s (t )





and the aim is to recover these two groups of underlying sources given only samples from x(t ). The separation of stationary and non-stationary sources is useful in many circumstances. First of all, SSA can uncover stationary components in seemingly nonstationary time series. Moreover, SSA allows to study the stationary and the non-stationary part independently: e.g. for changepoint detection (Kohlmorgen, Lemm, Müller, Liehr, & Pawelzik, 1999; Siegmund & Venkatraman, 1995), contributions from the stationary sources are not informative and can be removed to reduce the number of dimensions (Blythe, von Bünau, Meinecke, & Müller, 2012). Conversely, one may be interested in the estimated stationary signals that reflect constant relationships between variables (Engle & Granger, 1987; Meinecke, von Bünau, Kawanabe & Müller, 2009). Moreover, if the channels of the time series x(t ) are spatially distributed, the estimated mixing matrix Aˆ can be visualized to reveal the characteristic patterns of stationary and non-stationary contributions, e.g. as in EEG analysis (Dornhege et al., 2007; von Bünau, Meinecke, Scholler & Müller, 2010). In this paper, we propose a novel SSA algorithm, Analytic SSA (ASSA), where the solution is obtained by solving a generalized eigenvalue problem. The solution to ASSA is guaranteed to be optimal under the assumption that the covariance between stationary and non-stationary sources is time-constant. Thanks to the analytic form, the algorithm requires a much lower computational cost than the state-of-the-art method KL-SSA (von Bünau et al., 2009a), does not require the selection of algorithmic parameters such as step size or convergence criterion, and is numerically stable. Moreover, ASSA finds a sequence of projections, ordered by their degree of stationarity, and therefore does not need to repeat the procedure for different numbers of stationary sources. In our simulations on synthetic data, we demonstrate that ASSA outperforms KL-SSA and ICA over a wide range of settings, even when the covariance between stationary and non-stationary sources changes over time. Moreover, we apply ASSA to geomagnetic data, namely Pi 2 pulsation time series (Tokunaga, Kohta, Yoshikawa, Uozumi, & Yumoto, 2007; Yumoto et al., 2001), which are highly non-stationary and known to involve several sources corresponding to the geophysical mechanisms. In this case, the independence assumption of ICA is not suitable to recover the sources of interest. ASSA successfully decomposes the signals into meaningful global and local modes, which is in agreement with geophysical theory, and more plausible than the decomposition obtained by ICA (Tokunaga et al., 2007). The remainder of this paper is organized as follows. First of all, we introduce the SSA model and the state-of-the-art algorithm KL-SSA in Section 2. In Section 3, we derive our novel methods ASSA and study its theoretical properties. The relationship to similar methods is discussed in Section 4. Section 5 contains extensive numerical simulations to show its validity and a comparison to KL-SSA and ICA. The application to geophysical data analysis is presented in Section 6. Our conclusion and outlook are summarized in the last Section 7.

2. Stationary subspace analysis Stationary Subspace Analysis (SSA) models the observed signal x(t ) ∈ Rd as a linear superposition of stationary sources ss (t ) ∈ Rm and non-stationary sources sn (t ) ∈ Rd−m (von Bünau et al., 2009a), s x(t ) = As = A



An

   ss (t ) n s (t )

(1)

where A is a time-constant invertible mixing matrix. We refer to the span of As ∈ Rd×m and An ∈ Rd×(d−m) as the stationary and non-stationary subspace respectively. The aim of SSA is to factorize the observed time series x(t ) into stationary and non-stationary sources. That is, SSA estimates the

⊤

inverse mixing matrix A−1 as B = Bs⊤ Bn⊤ such that sˆ s (t ) = Bs x(t ) and sˆ n (t ) = Bn x(t ) are the estimated stationary and nonstationary sources respectively. We refer to the matrix Bs ∈ Rm×d and Bn ∈ R(d−m)×d as the stationary and non-stationary projection respectively. The demixing matrix B is not unique, because the factorization into a group of stationary and a group of non-stationary sources is not unique (von Bünau et al., 2009a; von Bünau, Meinecke, Király, & Müller, 2009b). First of all, any linear transformation within the two groups of sources yields another valid demixing. Second, adding stationary components to the estimated non-stationary sources leaves their non-stationary nature intact, whereas the converse is not true. This means that we cannot identify the true non-stationary sources sn (t ) from the mixing. Formally, if we apply the demixing to the mixed sources,





sˆ s (t ) Bs As = BAs(t ) = n Bn As sˆ (t )





Bs An Bn An

ss (t ) sn (t )



 (2)

we see that by the preceding argument, a solution to the SSA problem is fully characterized by the condition Bs An = 0m×(d−m) , i.e. a stationary projection Bs must eliminate all nonstationary contributions in the estimated stationary sources. This is equivalent to the condition that the rows of the stationary projection are orthogonal on the non-stationary subspace, span(Bs⊤ )⊥span(An ), where span(∗) denotes the column span of a matrix. In terms of subspaces, this means that the orthogonal complement of the estimated stationary projection is equal to the true nonstationary subspace. Thus we conclude that we can identify the true stationary sources ss (t ) (up to the linear transformation Bs As ) and, equivalently, the true non-stationary subspace. On the other hand, the recovered sources sˆ n (t ) are kept non-stationary for several different values of Bn As and therefore the true nonstationary sources and the true stationary subspace are not identifiable (von Bünau et al., 2009a, 2009b). Note that the SSA model (1) itself does not specify a notion of stationarity. Both the KL-SSA algorithm (von Bünau et al., 2009a) and our novel ASSA algorithm are based on the so-called weak stationarity (Hamilton, 1994). A possible extension would be to take time structure into account, i.e. the delayed covariance or the autocorrelation (Hamilton, 1994). The notion of stationarity is usually determined by the application domain and numerical considerations. 2.1. The KL-SSA algorithm The first SSA algorithm (von Bünau et al., 2009a), that we will refer to as KL-SSA,1 is based on the notion of weak stationarity (Hamilton, 1994) without time structure. That is, a time series u(t ) is considered stationary, if its mean and covariance remains

1 KL stands for the Kullback–Leibler divergence (Kullback & Leibler, 1951), which is used to measure the stationarity of the estimated sources by comparing epoch distributions.

S. Hara et al. / Neural Networks 33 (2012) 7–20

(a) N = 2.

9

(b) N = 3.

Fig. 1. Illustrative example of spurious stationarity in d = 2. (a) Given two Gaussians with equal means (two ellipsoids), there may exist more than one projection direction on which projected distributions are equal. (b) For three Gaussians (three ellipsoids), this is no longer the case.

constant over time, i.e.

avoids degenerate solutions and reduces the number of parameters in the optimization problem. The KL-SSA optimization problem (von Bünau et al., 2009a) is

E[u(t )] = E[u(t + τ )] E[u(t )u(t )⊤ ] = E[u(t + τ )u(t + τ )⊤ ]

N 1 

for all t , τ ∈ R. In order to apply this criterion in practice, where we cannot estimate the mean and covariance at a single time point, we divide the n samples {u(t ); t = 1, . . . , n} into N epochs with consecutive inN dex sets T1 , . . . , TN where i=1 Ti = {1, . . . , n}. We then consider the time series u(t ) to be stationary if the corresponding epoch means µ1 , . . . , µN and covariance matrices Σ1 , . . . , ΣN are identical, i.e.

Bˆ sKL = argmin

µi = µj and Σi = Σj

This optimization problem is non-convex and a local solution is found using a gradient-based method (Amari, 1998; Avriel, 2003; Plumbley, 2005); see Müller, von Bünau, Meinecke, Király and Müller (2011) for an implementation. Note that the population statistics in (5) are replaced with sample estimators in practice. In this context, advanced techniques such as exponentially weighted moving average (Montgomery, 2007; Roberts, 1959) would be helpful to obtain more accurate estimates; we use naive estimators2 for the simulation in Section 5, because we are primarily interested in comparing the performance of SSA algorithms.

for all pairs of epochs 1 ≤ i, j ≤ N. Now this formulation involves

O (N 2 ) equality conditions between epochs, which we can reduce to O (N ) by using the equivalent condition that each epoch’s mean and covariance matrix is equal to the average,

µi = µ and Σi = Σ , 1 ≤ i ≤ N (3) N N 1 1 where µ = N i=1 µi and Σ = N i=1 Σi are the average epoch

mean and covariance matrix respectively. Let us now turn to the algorithm which finds the stationary projection according to this definition. We have observed samples from the time series x(t ) = As(t ) which we have divided into N epochs along the time index. The choice of epochs T1 , . . . , TN (e.g. non-overlapping consecutive blocks or sliding window) is a model parameter that is selected by the user according to the specific application. For example, the epoch length determines the time-scale on which non-stationarities can be detected or it may be desirable to align the epochs to an experimental paradigm. The number of epochs N needs to be large enough in order to avoid spurious solutions; see Section 2.2 for a lower bound that guarantees this. The aim of KL-SSA is to find the stationary projection Bs such that the estimated stationary sources sˆ s (t ) = Bs x(t ) are weakly stationary according to the condition (3). Since the first two moments of the estimated stationary sources sˆ s (t ) can be written as the projected moments of the input x(t ), this means that we aim to find Bs such that Bs µi = Bs µ and Bs Σi Bs⊤ = Bs Σ Bs⊤

(4)

for all epochs 1 ≤ i ≤ N. In order to find this projection Bs , KL-SSA aims to minimize the distance between each epoch mean and covariance matrix and their respective averages. This distance is measured using the Kullback–Leibler divergence DKL (Kullback & Leibler, 1951) between Gaussians N (µ, Σ ). According to the Maximum Entropy Principle (Jaynes, 1957), this is the least restrictive distributional assumption. Since stationary sources can only be determined up to a linear transformation, we can require that Bs Σ Bs⊤ = Im without loss of generality where Im is the identity matrix of size m. This constraint determines the scaling,

Bs ∈Rm×d

N i=1

DKL N (Bs µi , Bs Σi Bs⊤ )∥N (Bs µ, Bs Σ Bs⊤ )



= argmin Bs ∈Rm×d

N 1 

N i=1



  ∥Bs (µi − µ)∥2 − log det Bs Σi Bs⊤

subject to Bs Σ Bs⊤ = Im .

(5)

2.2. Spurious stationarity in the KL-SSA algorithm The feasibility of SSA depends on the number of non-stationary sources d − m and the number of epochs N. If the number of epochs with a distinct distribution of the non-stationary sources is too small, there exist directions in the non-stationary subspace on which the projected moments match—these are called spurious stationary projections; see Fig. 1 for an example. The existence of spurious stationary projections renders the solution to SSA unidentifiable. The following theorem (von Bünau et al., 2009b) provides us how many distinct epochs are necessary, in order to guarantee that there are no spurious stationary projections in the generic case. Theorem 1 (Spurious Stationarity in KL-SSA). For the KL-SSA algorithm, given a d-dimensional signal with m stationary sources, the number of distinct epochs N required to avoid the existence of spurious stationary projections is N >

d−m 2

+ 2.

(6)

In the special case when the mean is known to be constant for all epochs, this becomes N > d − m + 1.

(7)

2 The estimators for the epoch mean and covariance are given by,

ˆi = µ

1 

|Ti |

t ∈Ti

x(t ) and

ˆi = Σ

1

|Ti | − 1



ˆi x(t ) − µ

t ∈Ti

2

.

10

S. Hara et al. / Neural Networks 33 (2012) 7–20

Note that in practice, having more epochs of sufficient length is always desirable, as we will see in the results of the simulations in Section 5. Unless the number of samples in each epoch becomes too small, additional epochs provide more information about the variation in the non-stationary subspace which makes it easier to identify.

replacing the log term in (5),



BsA

ˆ = argmin Bs ∈Rm×d

N 1 

N i =1

∥Bs (µi − µ)∥2

  s

+ 2 Tr B Σi − Σ Σ 

−1



Σi − Σ B



s⊤

 

3. Analytic SSA The KL-SSA optimization problem (5) is not convex; a local minimum is found by a gradient-based search procedure. Thus the solution depends on the choice of initial values and algorithmic parameters. Moreover, our stability analysis in Appendix A.3 revealed that the objective function is very flat in the neighborhood of the global solution. This leads to a slow convergence and adds to the computational cost, which is magnified by the need to repeat the optimization to avoid local minima. In a fixed-point formulation (see Appendix A.2), KL-SSA requires O (rNm(m2 +md+ d2 )) operations to solve (5) where r is the number of iterations. This computational complexity limits the algorithm’s practical utility on large and high-dimensional dataset. In particular, since KL-SSA requires to prespecify the number of stationary sources m so that it needs to be run repeatedly in order to explore the results for a range of values. In order to overcome these limitations, we propose a novel SSA algorithm, called Analytic SSA3 (ASSA). Based on an approximate upper bound of the KL-SSA objective function (5), it is formulated as a generalized eigenvalue problem, which can be solved efficiently. As such, ASSA does not require any initializations nor algorithmic parameters. In particular, we can show that the solution is optimal when stationary and non-stationary sources have time-constant group-wise covariance. Even when this is not the case, our numerical simulations show that ASSA yields very good results (see Section 5).

The ASSA objective function is based on the following approximate upper bound of the log-term in the KL-SSA objective function (5). Theorem 2 (Approximate Upper Bound of KL-SSA). 4 Let us denote the unconstrained log-term in the KL-SSA objective function (5) by f (Bs ), f (B ) =

N 1 

N i=1

det Bs Σi Bs⊤



− log

det Bs Σ Bs⊤





g (B ) =

N 2 

N i=1

where the matrix S is given by S=

N 1 

N i =1

µi µ⊤ i + 2Σi Σ

−1

 Σi − µµ⊤ − 2Σ .

(10)

This objective function can be interpreted as the variance of the mean and covariance across all epochs. The next result ensures the optimality of our approach. Theorem 3 (Optimality of ASSA). The ASSA solution Bˆ sA is optimal, i.e. span(Bˆ sA⊤ ) = span(Bs∗⊤ ), when the covariance between stationary and non-stationary sources is a time-constant.

The proof is given in Appendix E.2. Moreover, it can be shown that the case of time-constant covariance between stationary and non-stationary sources can be reduced to the equivalent SSA model with group-wise uncorrelated sources; see Lemma 2 in the Appendix E. This result suggests a canonical choice for the estimated non-stationary projection, which is not identifiable in general (see Section 2): the non-stationary projection Bn is chosen such that the estimated sources are group-wise uncorrelated, i.e. Bn Σ Bs⊤ = 0(d−m)×m . From (2), we see that this is equivalent to the condition (11)

Thus we conclude that if the stationary and non-stationary sources have time-constant covariance, we can identify the non-stationary sources sn (t ) in the equivalent group-wise uncorrelated model (up to the linear transformation Bn As ) and from (11), it follows that under this condition, the stationary subspace span(As ) can also be identified. In Appendix D, we provide further discussions about the case when the time-constant covariance assumption is not fulfilled and how the optimality of ASSA is skewed.

 3.2. ASSA as a generalized eigenvalue problem

and Bs∗ is one of the true stationary projections that satisfies (4) and the additional constraint Bs∗ Σ Bs∗⊤ = Im . Then, the second order Taylor approximation of f (Bs ) in the neighborhood of the solution Bs∗ is upper bounded by the function g (Bs ) defined as s

(9)

Bs ∈Rm×d

span(Bn⊤ )⊥span(As ).

3.1. Analytic SSA objective function

s

  = argmin Tr Bs SBs⊤ subject to Bs Σ Bs⊤ = Im

    −1   Tr Bs Σi − Σ Σ Σi − Σ Bs⊤

(8)

under the constraint that Bs Σ Bs⊤ = Im where Tr [∗] denotes the trace of a matrix. The proof of this theorem is given in Appendix E.1. Using this bound, we formulate the following ASSA objective function by

3 Preliminary work has been presented in Hara, Kawahara, Washio, and von Bünau (2010), where the ASSA objective function (9) is derived heuristically. 4 Note that Theorems 2 and 3 are valid only when true epoch population means and covariances are available. In practice, we replace those statistics with sample estimators which might lead to a biased result. Nonetheless, as we see in Section 5, ASSA shows significant improvement of the resulting errors over KL-SSA.

The optimization problem (9) is known to be equivalent to the 

minimization of the generalized Rayleigh quotient Tr



Bs SBs⊤





Bs Σ Bs⊤

−1

which appears in several other BSS problems (Jolliffe,

1986; Mardia, Kent, & Bibby, 1979). The solution is found efficiently by solving the corresponding generalized eigenvalue problem. The Lagrangian of the ASSA problem (9) is given by

L(Bs , Λ) = Tr Bs SBs⊤ − Tr Λ Bs Σ Bs⊤ − Im





 



where Λ ∈ Rm×m is the matrix of Lagrange multipliers. By setting its derivative to zeros, we obtain the following generalized eigenvalue problem, S ϕ = λΣ ϕ. The solution to this problem is a set of generalized eigenvalues λj and generalized eigenvectors ϕj , {λj , ϕj }dj=1 . The generalized eigenvectors are Σ -orthonormal to each other, i.e. ϕ⊤ j Σ ϕj′ = 1 if j = j′ and 0 otherwise. This Σ -orthogonality is equivalent

S. Hara et al. / Neural Networks 33 (2012) 7–20

to the uncorrelatedness among recovered sources ϕ⊤ j x(t ). Hence, each generalized eigenvalue corresponds to the value of the ASSA ⊤ objective function λj = ϕ⊤ j S ϕj /ϕj Σ ϕj . Let λ1 ≤ λ2 ≤ · · · ≤ λd be the generalized eigenvalues in ascending order. Then the estimated stationary projection Bˆ sA is given by the m eigenvectors with m smallest eigenvalues, Bˆ sA = ϕ1



···

ϕm

⊤

11

rNm(m2 + md + d2 )) (see Appendix A.2), where r is the number of optimization steps which is expected to be large (e.g. r > 100) since KL-SSA converges slowly due to its flatness around the true solution. ASSA is clearly computationally advantageous, which is an important property for an algorithm that is used in the context of explorative data analysis, where results need to be obtained quickly and for different settings.

and the non-stationary projection BnA consists of the remainings, Bˆ nA = ϕd



···

ϕm+1

⊤

3.5. Choosing the number of stationary sources

.

This solution can be interpreted from a deflation point of view (Hyvärinen et al., 2001), where the ASSA objective function values (eigenvalues) λi are interpreted as a non-stationarity score. In the deflation approach, the stationary projections are determined incrementally. In the first step, we select the direction with minimum non-stationarity score λ1 . The (k + 1)-th stationary projection is then found in the Σ -orthogonal complement of the previously determined stationary projections ϕ1 , . . . , ϕk . Thus, in each step, the dimensionality of the input space is deflated by projecting out the newly found stationary projection ϕk +1 . In particular, note that the ASSA solution is uniquely determined if all eigenvalues λj are different whereas the KL-SSA solution is unique only up to linear transformations. 3.3. Spurious stationarity in ASSA As in KL-SSA, we need a certain number of distinct epochs in order to avoid the existence of spurious stationary projection, that render the solution unidentifiable (see Section 2.2). We show that this minimum number of epochs is smaller for ASSA, which is useful in practice where data tends to be scarce. Theorem 4 (Spurious Stationarity in ASSA). If the covariance between stationary and non-stationary sources is time-constant, the number of epochs N required to guarantee that there exist no spurious stationary projections in d dimensions with m stationary sources is N ≥

2(d − m + 1)

(12)

ℓ+1 N where ℓ = N1 i=1 rank(Σi − Σ ). In the special case where the mean

is constant over all epochs, this bound becomes N ≥

2(d − m) + 1



.

(13)

The proof of this theorem is given in Appendix E.3. The requirement of ASSA (12) is looser than KL-SSA (6) when ℓ ≥ 3(1− d−m4 +4 ). Since ℓ is the average number of non-stationary sources with different variances among epochs, we can assume ℓ ≈ d − m in practice and the inequality holds. Again, note that this theorem merely indicates the minimum number of distinct epochs that are necessary to guarantee determinacy; having more epochs is always desirable to improve the accuracy of the solution. 3.4. Computational complexity The ASSA algorithm consists of three steps: (1) the estimation of the N epoch mean vectors and covariance matrices; (2) the computation of the matrix S; (3) the solution of the generalized eigenvalue problem. Let n be the total number of samples n =  N 2 i=1 |Ti |. Then the first step is in O (nd ), the second step is in 3 O (Nd ) and the solution of the generalized eigenvalue problem requires O (d3 ) operations so that the overall complexity is O (nd2 + Nd3 ). The overall computational complexity of KL-SSA, when formulated as a fixed point algorithm, is of the order O (nd2 + Nd3 +

In practice, the number of stationary sources m may be unknown and needs to be chosen from the available data. Whereas the KL-SSA algorithm requires to specify the number m, so that testing every value 1 ≤ m ≤ d would require d − 1 independent runs of the algorithm, ASSA finds all possible stationary projections in a single step, ordered by their stationarity score. We can then treat the evaluation of each projection in a post-processing stage, independently from the source separation. Since the ASSA objective function (9) takes zero for truly stationary sources, one would expect to see a significant jump of the eigenvalue at some level. However, in our empirical studies, we have found that small errors in the estimation of the stationary projections accumulate in the eigenvalues, which make this jump less pronounced. Apart from the visual inspection of eigenvalues, there exists a wide range of different procedures for testing stationarity (Dickey & Fuller, 1979; Priestley & Rao, 1969) for various types of signals and applications, which is the more suitable approach in practice. 4. Relation to previous work 4.1. Independent component analysis Independent Component Analysis (ICA) (Hyvärinen et al., 2001) finds independent sources from a linear mixture, whereas SSA separates sources by stationarity or non-stationarity. That is, in the ICA mixing model x(t ) = As(t ), the sources s(t ) are assumed to be independent whereas the general SSA model (1) merely presupposes that there exists a group of stationary and a group of non-stationary sources, which may have arbitrary dependence structure among and between themselves. In order to solve the ICA problem, three major properties of sources are used (Hyvärinen et al., 2001) which are nonGaussianity (Cardoso & Souloumiac, 1993; Comon, 1994; Hyvärinen, 1999), autocorrelation (Congedo, Gouy-Pailler, & Jutten, 2008; Molgedey & Schuster, 1994; Tong, Liu, Soon, & Huang, 1991), and non-stationarity of the variance (Hyvärinen, 2002; Kawamoto, Matsuoka, & Ohnishi, 1998; Matsuoka, Ohoya, & Kawamoto, 1995; Parra & Sajda, 2003; Pham & Cardoso, 2001). The third criterion, the non-stationarity, has some resemblance to the ASSA and KL-SSA approach. However, these algorithms impose independence on the sources and do not consider changes of the mean. Moreover, for ASSA, we have shown that it is optimal in the case of time-constant group-wise covariance, which is a less restrictive assumption than the pair-wise independence of the ICA model. We further show the practical distinction of the non-stationarity based ICA to the SSA problem on the simulated experiment in Section 5. Apart from the differences of underlying models, there are some prior works close to ASSA in the ICA context. For example, Parra and Sajda (2003) have formulated the non-stationarity based ICA as a generalized eigenvalue problem. They divide samples into two epochs and diagonalize sample covariance matrices from each epoch simultaneously by solving a generalized eigenvalue problem. The major difference of their approach to ASSA is that the generalized version, joint diagonalization of N covariance

12

S. Hara et al. / Neural Networks 33 (2012) 7–20

(a) Gaussian.

(b) ARMA(3, 3).

(c) Gaussian (6–20 change points).

(d) Lorenz95 + white noise.

(e) Sound, music and voice (9 kinds).

(f) Constant (6–20 change points).

Fig. 2. Examples for candidate processes: (a) and (b) are candidates for stationary sources and (c), (d) and (e) are candidates for non-stationary sources. When (e) is chosen, one of nine recordings is assigned randomly. (b), (d) and (f) are candidates for the time-varying covariance structure (see Appendix B for further details).

matrices from N epochs (Belouchrani, Abed-Meraim, Cardoso, & Moulines, 1997; Cardoso & Souloumiac, 1993; Choi & Cichocki, 2000; Moreau, 2001; Pham & Cardoso, 2001), cannot be solved by a generalized eigenvalue problem and requires solving much computationally expensive non-convex optimization problems. Here, we point out that ASSA can be interpreted as a modified version of the above algorithm to the SSA model. In the ASSA context, non-stationary independent sources in ICA are replaced with stationary and non-stationary sources with time-constant group-wise covariance. This difference changes the problem from the joint diagonalization to the joint block-diagonalization (AbedMeraim & Belouchrani, 2004; Belouchrani, Amin, & Abed-Meraim, 1997; Flury & Neuenschwander, 1994; Theis & Inouye, 2006), which results in the ASSA algorithm under that all epoch means are constant. We present further details in Appendix C. 4.2. Supervised dimensionality reduction

chosen as follows. Each element of the mean µs and the factors Ls ∈ Rd×d of the covariance matrix Σ s = Ls⊤ Ls are sampled from the standard normal distribution N (0, 1). The ARMA coefficients are also randomly drawn from Gaussian distributions, where if the resulting set of parameters are producing a non-stationary process, we discard them and regenerated from the Gaussian till the resulting process gets stationary. For the non-stationary sources, we evaluate (c) an i.i.d. Gaussian model with 6–20 change points N (µni , Σin ) with parameters determined as before, (d) the chaotic Lorenz95 (Lorenz & Emanuel, 1998) process plus white noise, and (e) nine different kinds of real recordings of environment sounds, musics and voices.5 The initial values of the Lorenz95 process and the nine real recordings are also selected at random. We also investigate the effect of time-varying covariance between the stationary and non-stationary sources, i.e. the case where the optimality of ASSA is not guaranteed. Therefore, we introduce the following model on non-stationary sources sn (t ) = sn (t ) + C (t )ss (t ), ′

The aim of supervised dimensionality reduction, or feature selection, is to find components that are informative for solving a classification or regression task. A common approach is to maximize the difference between the distributions of each class (Blankertz et al., 2008; Blankertz, Tomioka, Lemm, Kawanabe, & Muller, 2007; Fisher, 1936; Fukunaga, 1990) where the most prominent method is Linear Discriminant Analysis (LDA) (Fisher, 1936; Fukunaga, 1990). LDA finds the direction on which the distance between the class means are maximal, under the metric induced by the common covariance matrix. Common Spatial Patterns (CSP) (Blankertz et al., 2008, 2007; Grosse-Wentrup & Buss, 2008; Koles, 1991), a popular method in EEG analysis (Dornhege et al., 2007), finds the projections such that the difference in variance between two classes is maximized. If we interpret each epoch T1 , . . . , TN of the data x(t ) in ASSA as samples from a different class, finding the most non-stationary components is similar to maximizing the difference among class distributions as in CSP and LDA. ASSA can therefore be understood as a generalization of LDA and CSP, because it takes both the mean and the variance into account. In particular, LDA is included as a special case of ASSA where all epoch covariances are equal. 5. Numerical simulations 5.1. Dataset description In this section, we investigate the performance of the proposed ASSA algorithm and some existing methods using artificial data generated according to the SSA mixing model (1). In order to evaluate the behavior of the algorithms in a realistic setting, we use several types of different sources; see Fig. 2 for an overview. For the stationary sources, these are (a) the i.i.d. Gaussian N (µs , Σ s ) and (b) the ARMA (Autoregressive Moving Average) (Hamilton, 1994) model of order (3, 3). The parameters of these two models are

(14)

where sn ′ (t ) are non-stationary sources that are uncorrelated with the stationary sources ss (t ). A time-varying covariance structure between the stationary and the non-stationary sources is induced by the matrix C (t ) ∈ R(d−m)×m , which is parametrized by a correlation parameter c. It bounds the amplitude of canonical correlations (Mardia et al., 1979) between the two groups of sources and ranges from zero (correlation is constant) to one (correlation varies from −1 to 1). The details of the data generation can be found in Appendix B. We set the dimensionality of the observed signal to d = 10, the number of stationary sources to m = 5 and the total number of available samples to 5000, which are divided into non-overlapping consecutive epochs T1 , . . . , TN where we vary their number N in the simulations. 5.2. Baseline methods and error measurement In this simulation, we introduce two baseline methods to contrast with ASSA. The first one is the KL-SSA algorithm, which is implemented as a fixed point algorithm (Appendix A.1). Since KL-SSA finds only local solutions, we choose the solution with the smallest objective function value among five restarts with random initialization.6 The second baseline is a non-stationarity based ICA algorithm discussed in Section 4.1. Here, we adopt the method proposed by Pham and Cardoso (2001) since it measures the nonstationarity of sources using epoch covariances, which is similar to the approaches by ASSA and KL-SSA. It also suffers from local

5 Available here: http://research.ics.tkk.fi/ica/demos.shtml  ⊤ ⊤ −1 6 The initial values are generated as Bs⊤ = e0.5(M −M ) Σ 2 where Bn⊤ each element of M ∈ Rd×d is uniformly random in [−10, 10] and e is a matrix exponential.

S. Hara et al. / Neural Networks 33 (2012) 7–20

13

(b) c = 0.3.

(a) c = 0.

(c) c = 0.6.

(d) c = 0.9.

Fig. 3. Median errors of ASSA, KL-SSA and non-stationarity based ICA (Pham & Cardoso, 2001) over 1000 random realizations of the data for different correlation parameters c. The dimensionality of the observed signals, the number of stationary sources and the signal length are set to 10, 5 and 5000 respectively. The observations are divided into non-overlapping consecutive epochs. The horizontal axis denotes the number of epochs N and is in the log scale. The vertical axis is a subspace error and the error bars extend from the 25% to the 75% quantile.

optima; we therefore choose the best solution among five random restarts as KL-SSA. We then construct the stationary projection

⊤

Bˆ sICA in the following manner. Let W = w1 · · · wd is a d × d demixing matrix derived by ICA where each raw vector corresponds to the source recovering projection sˆj (t ) = wj⊤ x(t ). For each vector wj , we heuristically measure the non-stationarity of the recovered source sˆj (t ) using the score,



n-score(wj ) =

N  

σji − σ j

2

i =1

where σji denotes the standard deviation of sˆj (t ) in the i-th epoch and σ j is their average across epochs σ j = N1 i=1 σji . This criterion achieves the minimum zero for perfectly stationary sources, i.e. σji = σji′ for all i ̸= i′ , and we choose the resulting

N

projection Bˆ sICA as a span of m raw vectors with top m smallest scores. In order to evaluate the performance of algorithms, we adopt the smallest canonical angle (Chatelin, 1993) between subspaces θ (in degrees) to measure the difference between the estimated stationary projection Bˆ s and the true non-stationary subspace An . We report the number 90 − θ (Bˆ s⊤ , An ), which is zero for a perfect demixing, where stationary projection is orthogonal to the nonstationary subspace. 5.3. Results The results are shown in Fig. 3. For small number of epochs, we observe the effect of spurious stationarity: the true solution cannot be found reliably because it is masked by the presence of spurious stationary projections. In this setting, the minimum required number of epochs N given by the bounds (6) and (12) are 5 and 3 for KL-SSA and ASSA respectively. Though we have not analyzed the spurious stationarity condition for the ICA method by Pham and Cardoso (2001), it seems that it is in the intermediate of ASSA and KL-SSA. For any methods, when the number of epochs N is smaller, there exist spurious solutions which result in the observed median errors above 45°. Moreover,

larger numbers of epochs are clearly preferable to obtain more accurate solutions; when the number of samples per epoch gets too small (around N ≥ 250 in this case), the effect of estimation errors in the epoch mean and covariance matrix leads to deteriorating performance. Fig. 3(a) shows the result for the case c = 0 (time-constant covariance), in which ASSA is guaranteed to be optimal. We can see that ASSA outperforms both baseline methods. While the median ICA result is achieving the competitive performance with ASSA around N = 50 to 100, we can also see its instability from the 75% error quantiles, nearly 90 degree errors meaning totally collapsed solutions. Even for time-varying covariances (c > 0), where ASSA is not guaranteed to be optimal, Fig. 3(b)–(d) show that ASSA is consistently outperforming on average (median performance) for all numbers of epochs N. In these cases, the independence assumption of the ICA model is also not fulfilled since the underlying stationary and non-stationary sources are correlated. The figures clearly show that ICA cannot reliably recover the two groups of sources, because its assumptions are violated. The ICA results for correlated sources seem to be quite distorted and appropriate stationary projections are found only by chance. We conjecture that the relatively poor performance of KLSSA is due to its numerical instability (see Appendix A.3). We further conducted two extensive comparisons of ASSA and KL-SSA. In Table 1, we have summarized the computational advantage of ASSA over KL-SSA. Here, we find that ASSA has achieved more than 100 times faster speed than KL-SSA by avoiding an iterative optimization, which is a practical bottleneck of KL-SSA due to its flatness of the objective function (see Appendix A.3). The results of an exhaustive comparison over different degrees of correlation parameter are also shown in Fig. 4. Here, the median error of ASSA is significantly lower than that of KL-SSA even though the error of KL-SSA slightly improves as a correlation parameter gets larger. However, despite its good median performance of ASSA, we also observe the gradual growth of its 75% error quantile. We conjecture that this is due to the violated assumption. It seems that even though ASSA is performing

14

S. Hara et al. / Neural Networks 33 (2012) 7–20

Table 1 The median runtime in seconds for ASSA and KL-SSA in the simulation depicted in Fig. 3(a). We used a Matlab implementation under 64 bit Windows7 with a Intel Xeon W3565 CPU. ‘‘Pre1’’ denotes the computation of means and covariances from data. ‘‘Pre2’’ is an individual pre-processing, the computation of the matrix S in ASSA and the whitening in KL-SSA. ‘‘Main’’ is an optimization process, solving the generalized eigenvalue problem in ASSA and the one updating step in KL-SSA. ‘‘Step’’ denotes the median number of updating steps in KL-SSA with five random initializations. ‘‘Total’’ is the overall runtime. N

Pre1

Pre2

Main

Step

Total

ASSA KL-SSA

10 10

0.0014 0.0014

0.0002 0.0002

0.0002 0.0005

– 340

0.0020 0.1930

ASSA KL-SSA

50 50

0.0049 0.0049

0.0004 0.0004

0.0002 0.0023

– 372

0.0059 0.9086

ASSA KL-SSA

100 100

0.0092 0.0091

0.0007 0.0007

0.0002 0.0046

– 560

0.0107 2.6569

ASSA KL-SSA

200 200

0.0178 0.0178

0.0013 0.0012

0.0002 0.0091

– 556

0.0202 5.2513

Fig. 4. Comparison of ASSA and KL-SSA for varying correlation parameter c. In this simulation, the number of epochs N is set to 100. The vertical axis shows the error measured as the subspace angle to the true solution; the horizontal axis shows the correlation parameter. The median error of ASSA and KL-SSA over 1000 random realizations of the data is plotted along with error bars that extend from the 25% to the 75% quantile.

well on average, its result is distorted for certain cases, and the probability of facing such cases increases as a degree of assumption violation grows.7 Even so, the result shows ASSA is not that sensitive to the assumption violation as ICA and is outperforming KL-SSA in terms of the 75% error quantile for correlation parameters smaller than 0.6. 6. Application to the geomagnetic data analysis We now apply ASSA to the investigation of the dynamics of the earth’s magnetic field using ground magnetometer data. The geomagnetic phenomenon called Pi 2 pulsation (Jacobs, Kato, Matsushita, & Troitskaya, 1964; Saito, 1969) has been studied to reveal the connection to the substorm (Saito, Yumoto, & Koyama, 1976) or the propagation mechanism of magnetohydrodynamic waves in the magnetosphere (Uozumi et al., 2004). However, the observations of Pi 2 pulsations on the ground involve several components reflecting: (1) propagations of fast and shear Alfvén wave; (2) resonances of plasmaspheric or magnetospheric cavity and magnetic field lines; (3) transformations to ionospheric current systems (Kuwashima & Saito, 1981; Olson & Rostoker, 1977; Sutcliffe & Yumoto, 1991; Yeoman & Orr, 1989; Yumoto et al., 2001). It is unclear how they couple with each other and how their signals are distributed at different latitudes. Thus, in order to extract the global system of Pi 2 pulsations from the superpositions of several effects, the use of ICA had been proposed (Tokunaga et al., 2007). The result of ICA suggests the existence of two major

7 See Appendix D for the further discussion about this point.

components in an isolated Pi 2 event: one is the global oscillation that is common for all latitudes whereas the second is the local pulsation that is observed only in some specific latitudes. However, the source-wise independency assumption underlying ICA is too restrictive for this specific problem. From a geophysical point of view, one expects that there are several factors behind each source that interact with each other and thus lead to dependent sources. We have also observed in Section 5 that ICA results for such sources can be highly distorted. On the other hand, the components that are closely related to the Pi 2 event are those that exhibit strong nonstationary behavior over the selected time window. Thus, in order to obtain meaningful results, extracting the non-stationary sources seems more plausible than factorizing into independent sources. The ground magnetometer data was obtained from CPMN stations at the 210° magnetic meridian (MM) chain and South America chain (Yumoto et al., 2001). Fig. 5(a) shows the horizontal direction component of each station8 : bandpass-filtered (25–250 s) amplitude–time recordings of Pi 2 pulsation observed during the time window 13:35–13:55 UT on 17 February 1995 at 400 points in time. Note that the top four signals have larger powers: KTN (115 nT), TIK (71 nT), CHD (36 nT) and ZYK (11 nT), the other signals have power around 3 nT. The periodic wave in channel ZYK is environmental noise that is not related to the Pi 2 event. As shown in Fig. 5(a), most signals, especially the ones in low latitude, have similar highly non-stationary waveforms. We therefore expect that a common non-stationary source can be recovered from the signals. Moreover, the most stationary sources would correspond to observation noise and the sources with medium non-stationarity score are probably related to local phenomena. In order to suppress the effect of noise, we first extract the seven Principal Components (Jolliffe, 1986) from the data to which we then apply ASSA. Fig. 6 shows the waveforms of the non-stationary sources (Ns) estimated by ASSA (using a consecutive partitioning into N = 20 epochs) in descending order by their non-stationarity score λj . We categorize the seven sources according to their relative non-stationarity into group A (highly non-stationary), group B (medium non-stationarity) and a noise group (virtually stationary). We conjecture that the sources in groups A and B are related to the Pi 2 event: Fig. 5(b) and (c) show the Pi 2 components A and B plotted as a linear combination of the sources in groups A and B respectively. We can see that the Pi 2 component A, the global mode, is distributed globally to all latitudes whereas the Pi 2 component B, the local mode, occurs only in some specific stations (KTN, CHD), mainly at nightside high latitudes. In past studies, the plasmaspheric cavity mode is deemed to be one of the dominant mechanism of Pi 2 pulsations at low and middle latitudes (Sutcliffe & Yumoto, 1991; Takahashi, Lee, Nosé, Anderson, & Hughes, 2003; Yeoman & Orr, 1989). This finding coincides with our Pi 2 component A: the KTN station, whose signal does not show contributions from component A, is located in very high latitude so that one would not expect that it is affected by the plasmapause. The plasmapause is also known to cause a polarization reversal of substorm associated to Pi 2 pulsations (Fukunishi, 1975; Takahashi et al., 2003). In this particular Pi 2 event, its location is estimated between the stations CHD and ZYK. Hence the phase reversal of the Pi 2 component A between CHD and ZYK is probably related to the polarization reversal. However, the interpretation of the Pi 2 component B is unclear. Potential causes are substorm current systems such as the westward auroral electrojets or oscillations of the current wedge. In this Pi 2 event, the estimated location of the aurora break up spot

8 The names of the stations are abbreviated by three letter codes; see Yumoto et al. (2001).

S. Hara et al. / Neural Networks 33 (2012) 7–20

(a) Original signals.

(b) Pi 2 Comp. A.

15

(c) Pi 2 Comp. B.

Fig. 5. (a) Original signals: horizontal direction component of Pi 2 pulsations observed on 17 February 1995 at CPMN stations. The bandpass filter range is 25–250 s. The plots are aligned in the descending order of station’s latitude from the top. Stations above and below dashed line are the 210° MM chain and the South America chain, respectively. The scaling of vertical axis is around 3 nT except for top 4 stations. (b) Separated Pi 2 component A as a linear combination of N1, N2 and N3. (c) Separated Pi 2 component B as a linear combination of N4 and N5.

is in between the stations KTN and TIK. This would imply that the signals from the KTN and TIK stations both show large local modes, which is not the case for the component B. The effect of current systems is highly complex and only partly understood. Further analysis will require satellite observations and the investigation of other aspects of the magnetometer data that is beyond the scope of this paper. The components extracted by ASSA suggest that there are two major sources behind the Pi 2 pulsations. Our Pi 2 component A corresponds directly to the geophysical theory and the findings of other empirical studies. The component B suggests that there are other mechanisms whose understanding requires further investigation of current systems and the auroral breakup. In comparison to the ICA result (Tokunaga et al., 2007), the global mode found by ASSA is more plausible because it has smaller power at the KTN station, which locates north of the auroral breakup and less effects from the plasmapause is expected. 7. Conclusion and future work In this paper, we have proposed the first SSA algorithm, ASSA, whose solution can be obtained in closed form, and we have shown that it is optimal in the case of time-constant group-wise covariance. Thanks to its formulation as a generalized eigenvalue problem, it is more than 100 times faster than the state-of-the-art KL-SSA, and it does not require tuning any algorithmic parameters. Moreover, the condition for avoiding spurious solutions is looser and ASSA returns the solution for all possible numbers of stationary sources in one step, whereas this would require several runs of KLSSA. We have demonstrated the performance of ASSA in a realistic set of experiments and applied it to geomagnetic measurements Pi 2 pulsations, where it successfully factorizes the observed time series into meaningful components.

Fig. 6. Estimated non-stationary sources (Ns) by means of ASSA. The observed signals are divided to N = 20 non-overlapping consecutive epochs. The estimated sources were classified into three groups based on their ASSA scores λ. N1, N2 and N3 were classified into Group A. N4 and N5 were classified into Group B. N6 and N7 were noise sources.

A number of open questions remain. First of all, to date there exists no systematic approach for selecting the number of stationary sources m from data. Even though ASSA’s eigenvalue spectrum, and subsequent hypothesis testing can offer some guidance, a principled model selection technique, e.g. Information Criterion (Akaike, 1974; Schwarz, 1978), still needs to be developed. Similarly, apart from the lower bound on the number of epochs N, their choice T1 , . . . , TN is so far determined heuristically, based e.g. on the number of samples in each epoch. Most importantly, both KL-SSA and ASSA hinge on the limited notion

16

S. Hara et al. / Neural Networks 33 (2012) 7–20

Algorithm 1 : KL-SSA with a fixed point algorithm Input: samples {x(t )}t , index sets {Ti }Ni=1 , number of stationary sources m Output: stationary projection Bˆ sKL

A.2. Computational complexity

Ti Ni=1 ;

8:

divide samples to epochs by { } w N center and whiten the means, covariances {µw i , Σi }i=1 ; m×d ⊤ initialize W ∈ R so that WW = Im ; repeat compute the gradient dW ; update W ← dW ; normalize W so that WW ⊤ = Im ; until W converges

9:

set Bˆ sKL ← W Σ

1: 2: 3: 4: 5: 6: 7:

− 12

;

of weak stationarity, which is a good pragmatic choice for many scenarios. However, an extension towards separating sources by non-stationarities with respect to the time structure would open up a wide field of new applications, where temporal changes in the frequency domain are the main point of interest. Acknowledgments This work was partially supported by JSPS Grant-in-Aid for Scientific Research(B) #21013032, #22300054 and #22700147, JST PREST Program #3726 and AOARD AWARD #FA2386-10-1-4007. We thank Dr. S. Shimizu for helpful comments. Appendix A. Computational issues of KL-SSA A.1. KL-SSA with a fixed point algorithm

− 21

Σiw = Σ

− 21

  µi − µ Σi Σ

− 12

for 1 ≤ i ≤ N. We also factorize the stationary projection Bs as

KL W

− 21

, W ∈ Rm×d and derive the alternative problem of (5): N   1  2 w ⊤ = argmin ∥W µw ∥ − log det W Σ W i i

Bs = W Σ

W ∈Rm×d

N i=1

The fixed point algorithm is based on the fact that solutions to the optimization problem minW f (W ) with a constraint WW ⊤ = Im has the following property (Hyvärinen et al., 2001): span(W ⊤ ) = span(dW ⊤ ) ∂ f (W )

where dW denotes the gradient dW = ∂ W . It indicates that the optimal W is proportional to the gradient dW . Therefore, in the fixed point algorithm, we update W by substituting dW and rescale it so that WW ⊤ = Im is kept. The overall procedure is summarized in Algorithm 1. Note, in KL-SSA, the gradient dW is given as N 2 

N i=1

w⊤ W µw − W Σiw W ⊤ i µi



−1

N

|Ti |. The computation of the whitening matrix Σ 2 is in O (d ) and the whitening of all epochs is of the order O (Nd3 ). In the optimization stage, there appears an inverse of W Σiw W ⊤ which requires O (md(m + d)) for the matrix multiplication and O (m3 ) for the inverse. The cost for all epochs is thus O (Nm(m2 + md + d2 )) and the overall complexity is O (nd2 + Nd3 + rNm(m2 + md + d2 )) i=1 3

where r is the number of updating steps till convergence. A.3. Stability When solving the optimization problem minu f (u) by an iterative method, its numerical stability is governed by the condition number of the Hessian matrix ∇ 2 f (u), where the condition number κ(C ) for a matrix C is defined as a ratio of its largest singular value to the smallest singular value. If κ(∇ 2 f (u)) is large, the contour of f forms a long ellipsoid with large eccentricity. In that case, the optimization procedure tends to require larger number of steps and the solution gets numerically instable (Boyd & Vandenberghe, 2004). Here, we see the Hessian matrix of KL-SSA (5) for the case of m = 1, i.e. Bs = b⊤ ∈ R1×d for simplicity. The unconstrained KLSSA objective function f (b) is

−1



W Σiw .

N 1 



  ∥b⊤ µi − µ ∥2 b⊤ Σ b

N i=1

− log

b⊤ Σi b



b⊤ Σ b

.

Let b∗ ∈ Rd is a true SSA solution satisfying (4), then we derive 1 4

∇ 2 f (b∗ ) =

N ⊤ 1  Σi b∗ b⊤ ∗ Σi − Σ b∗ b∗ Σ . N i =1 b⊤ ∗ Σ b∗

Moreover, we can see from (E.2) that this Hessian matrix gets zero when the covariance between stationary and non-stationary sources is time-constant. It implies that f is very flat in the neighborhood of the true solution b∗ and the gradient based method may stop far before. When the covariance between the two groups of sources is time-varying, we can express Σi b∗ as

 Σi b∗ = As

An

   Σ s s⊤ ns A b∗ Σi

where Σ s ∈ Rm×m is a covariance matrix of stationary sources that is constant across epochs, and Σins ∈ R(d−m)×m is a covariance matrix between non-stationary and stationary sources in the i-th epoch. Then we derive N 1 

subject to WW ⊤ = Im .

dW =

In this section, we derive the computational complexity of KL-SSA in the fixed point formulation (Algorithm 1). In the preprocessing stage, we compute the sample means and covariance matrices which is O (nd2 ) where n is the total size of epochs n =

f (b) =

To solve the optimization problem (5), the combination of natural gradient (Amari, 1998; Plumbley, 2005) and conjugate gradient (Avriel, 2003) had been proposed by von Bünau et al. (2009a). Here, we introduce the use of the fixed point algorithm (Hyvärinen et al., 2001). The fixed point algorithm is simpler since it does not require the tuning of step size as the gradient method. It therefore allows us to compare ASSA and KL-SSA more objectively. In the pre-processing stage, each epoch mean and covariance is centered and whitened as

µw i = Σ

In the synthetic experiment stopped the updating  in Section  5, we 1 ⊤ Tr Wnew Wold < 10−5 . iteration when 1 − m

N i=1

ns

⊤ s ns Σ s As⊤ b∗ b⊤ ∗ A (Σi − Σ ) = 0m×(d−m)

where the equality holds from the definition of Σ . Then the Hessian matrix is   1 2 0m×m 0m×(d−m) ⊤ ∇ f (b∗ ) = A A n 0(d−m)×m Z 4 where Z n is defined as Zn =

ns ns ⊤ N ns s 1  (Σins − Σ )As ⊤ b∗ b⊤ ∗ A ( Σi − Σ )

.

N i=1 b⊤ ∗ Σ b∗ It is obvious that the Hessian matrix is rank deficient and thus the condition number is infinite, which again implies that KLSSA is instable around b∗ . Note that this result is irrelevant to

S. Hara et al. / Neural Networks 33 (2012) 7–20

the parametrization of b. Even if we parametrize b as b = b(θ) with some other parameter θ ∈ Rd , the Hessian matrix of f over θ, ∇θ2 f (b(θ)), is expressed as

∇θ2 f (b(θ)) = Jθ ∇ 2 f (b)Jθ⊤ with a Jacobian matrix Jθ ∈ Rd×d , and again it is rank deficient.

After a proper scaling of ss (t ) and sn (t ), their correlation is C (t ). In this simulation, we set C (t ) = R1 diag(c (t ))R⊤ 2 where R1 and R2 are m × m and (d − m) × (d − m) orthogonal matrices and diag(c (t )) ∈ Rm×(d−m) is a matrix with c (t ) ∈ Rmin(m,d−m) on its diagonal. Each component of c (t ) is limited to [−1, 1] from the definition of correlation. The process c (t ) is also chosen from several different sources in Fig. 2; (b) ARMA(3, 3), (d) Lorenz95 and (f) constant with 6 to 20 change points. The chosen process is scaled so that each component of c (t ) belongs to [−c , c ] for a given correlation parameter c ∈ [0, 1]. When c = 0, the covariance between stationary and non-stationary sources is zero and thus time constant, in which the optimality of ASSA is guaranteed while the ASSA assumption is violated for c > 0. The overall data generating procedure is as follows: (1) randomly generate A, R1 and R2 , (2) randomly assign m processes to stationary sources from two candidates and d − m processes to non-stationary sources from three candidates, (3) generate ss (t ) and sn ′ (t ) from each assigned processes, (4) randomly assign one from three candidates to c (t ) and generate non-stationary sources sn (t ) according to the model (14), and (5) generate the observed signal x(t ) from the SSA mixture (1). Appendix C. ASSA and joint block-diagonalization First, we briefly introduce how the joint block-diagonalization (Abed-Meraim & Belouchrani, 2004; Belouchrani, Amin et al., 1997; Flury & Neuenschwander, 1994; Theis & Inouye, 2006) approach can be applied to the SSA problem. As have shown in (E.1), the essential covariance structure of stationary and nonstationary sources is in the block-diagonal form when two sources are group-wise uncorrelated. Therefore, the source recovery can be interpreted as the problem of finding a matrix B ∈ Rd×d that makes BΣi B⊤ to be block-diagonal for all N matrices in one time, which is achieved by solving

B∈Rd×d

N     boffdiag BΣi B⊤ 2 i=1

subject to BΣ B⊤ = Id

In the analysis, we consider the simplest case when stationary and non-stationary sources are group-wise uncorrelated. This is because we can always construct such a model in a timeconstant situation without loss of generality from Lemma 2 (Appendix E.2). Under such a model, we study how the following small perturbation on a covariance between the two groups of sources affects the resulting ASSA solution,

 ϵ Σisn ⊤ A Σin

Σs Σi = A ϵ Σisn ⊤



Appendix B. Data generation

Bˆ = argmin

17

(C.1)

where boffdiag(∗) denotes block-off-diagonal elements of a matrix. The equivalence of ASSA to (C.1) is summarized in the next theorem. For the proof, see Appendix E.4. Theorem 5 (ASSA and Joint Block-Diagonalization). The problem (C.1) coincides with the ASSA problem (9) with a condition µ1 = µ2 = · · · = µN . Appendix D. Assumption violation and ASSA solution In Section 3, we have constructed the ASSA algorithm based on the assumption that stationary and non-stationary sources have a time-constant covariance. Moreover, even when this assumption is not fulfilled, we have observed that ASSA is quite robust against the violation through simulations in Section 5. Here, we provide one theoretical result that gives an insight how the assumption violation affects the ASSA solution.

where Σ s ∈ Rm×m is a covariance matrix of stationary sources that is common across epochs, Σin ∈ R(d−m)×(d−m) is a covariance of non-stationary sources in the i-th epoch, and Σisn ∈ Rm×(d−m) is a covariance between the two groups. Here, we let O (Σisn ) = O (1) and explicitly impose a parameter 0 ≤ ϵ ≪ 1 to express that the assumption violation is sufficiently small. Our main result is based on the fact that under this small perturbation, we can also express the matrix S defined in (10) as

 S=A

0m×m ϵQ ⊤

ϵQ P



A⊤ + O (ϵ 2 )

(D.1)

with some Q ∈ Rm×d and P ∈ R(d−m)×(d−m) . We then have the following theorem. Theorem 6 (Effect of the Assumption Violation). For a sufficiently small perturbation 0 < ϵ ≪ 1, the ASSA solution Bˆ sA is not orthogonal to the n-space span(An ) and its error is in the following order,

O (Bˆ sA An ) = O (ϵ QP −1 ). See Appendix E.5 for the proof. From the definition of the matrix S, we can interpret matrices Q and P as the non-stationarity degree of the covariance between two groups and the covariance within non-stationary sources respectively. The above result shows that even for small ϵ , the assumption violation might cause larger error if QP −1 is large. It occurs when Q has larger values in the directions where P has smaller values. The smaller values of P mean that the sources in that directions are non-stationary only slightly, while the larger Q means that two sources are strongly correlated. We can therefore interpret the above result as that sources with only slight nonstationarity tend to be mixed up with truly stationary sources under the assumption violation. On the other hand, the effects of small correlations with highly non-stationary sources can be negligible in practice. It is in line with our intuition that the significant non-stationarity could be easier to distinguish even when there are some correlations between the two groups. The above theorem can partly explain the simulation result in Section 5. In Fig. 4, the median error gradually grows along a correlation parameter c while the 75% quantile is rapidly increasing. Note that the parameter c corresponds to the perturbation ϵ in the theorem. The theorem states that the assumption violation is not always fatal; there are some cases that has smaller errors even under larger c if QP −1 is small. We conjecture that this is the reason why the ASSA solution is not entirely collapsed even for larger c, but only for some specific cases as observed in the 75% error quantile. Appendix E. Proof of theorems E.1. Proof of the Theorem 2 Since f (Bs ) takes its minimum zero at Bs = Bs∗ , its first order derivative vanishes. Therefore, the second order Taylor approximation f˜ (Bs ; Bs∗ ) depends only on the second order derivative

18

S. Hara et al. / Neural Networks 33 (2012) 7–20

of f (Bs ): N

m

derive   the following  further upper   bound from the fact that Tr Bs Σi − Σ Bn∗⊤ Bn∗ Σi − Σ Bs⊤ ≥ 0 holds for any Bs :

d

1  

f˜ (Bs ; Bs∗ ) =



N i=1 ′ j,j =1 k,k′ =1

f˜ (B ; B∗ ) ≤ s



∂ 2 ℓ(Bs ; Σi ) ∂ 2 ℓ(Bs ; Σ ) − 2 ∂ Bsjk ∂ Bsj′ k′ ∂ Bsjk ∂ Bsj′ k′   s  Bjk − Bs∗,jk Bsj′ k′ − Bs∗,j′ k′   where ℓ(Bs ; Σ ) = − log det Bs Σ Bs⊤ and 1

1 ∂ 2 ℓ(Bs ; Σ ) 2 ∂

Bsjk



Bsj′ k′

 Bs =Bs∗

=



∂ 2 ℓ(Bs ; Σi ) ∂ 2 ℓ(Bs ; Σ ) − N i=1 2 ∂ Bsjk ∂ Bsj′ k′ ∂ Bsjk ∂ Bsj′ k′  N 1  = −Im,jj′ Σi,kk′ + Im,jj′ Σ kk′



+ Im,j· B Σi,·k′ Im,j′ · B Σi,·k − Im,j· B Σ ·k′ Im,j′ · B Σ ·k s

 Im,jj′ Σi − Σ k· Bs∗⊤ Bs∗ Σi − Σ ·k′





where the last equality holds from Σ =

1 N

N

i =1

Σi . By using

and Bs∗ Σi Bs∗⊤ = Bs∗ Σ Bs∗⊤ , we get the resulting second order Taylor approximation f˜ (Bs ; Bs∗ ) as j,j′ ,k,k′

Cjj′ Dkk′ Xjk Xj′ k′ = Tr CXDX

 ⊤



f˜ (B ; B∗ ) = s

N 1 

s

Tr Bs Σi − Σ Bs∗⊤ Bs∗ Σi − Σ Bs⊤

N i=1





 





  + Tr B Σi − Σ B∗ B Σi − Σ B∗ .  s



s⊤ s





s⊤

N 2 

N i=1

Tr Bs Σi − Σ Bs∗⊤ Bs∗ Σi − Σ Bs⊤

 

i

where ⟨C , D⟩ = Tr CD s

and Ci = B



 ⊤

Σi − Σ B∗ .











s

A

n





Σs



0m×(d−m)  s A n

An

Σi

0(d−m)×m

⊤

.

(E.1)

(E.2)



Lemma 2 (Equivalent Class of Uncorrelated SSA). Any SSA model (1) with a time-constant covariance between stationary and nonstationary sources can be reduced to the equivalent model with groupwise uncorrelated sources. Proof. Let Σ sn ∈ Rm×(d−m) is a time-constant covariance matrix between stationary and non-stationary sources. Then, the equivalent uncorrelated SSA model is given as



B

ss (t )



s (t ) − Σ n

sn⊤

Σ

 s (t )

s −1 s

where Σ s ∈ Rm×m is a covariance matrix of the stationary sources and thus time-constant. 



      Ci , C ⊤  ≤ ⟨Ci , Ci ⟩ C ⊤ , C ⊤ = ⟨Ci , Ci ⟩ 



Lemma 1 (ASSA and Uncorrelated SSA). The ASSA solution Bˆ sA is optimal when stationary and non-stationary sources are group-wise uncorrelated, i.e. the covariance between the two group is zero.



from the Cauchy–Schwarz inequality i





x(t ) = A + BΣ sn Σ s −1

Then, we first derive the following upper bound: f˜ (Bs ; Bs∗ ) ≤

 

with 1 ≤ i ≤ N. It is obvious that the ASSA objective function (9) gets zero at Bs = Bs∗ . Since the ASSA objective function is nonnegative, Bs∗ is a minimizer. 

     + Im,j· B∗ Σi − Σ ·k′ Im,j′ · Bs∗ Σi − Σ ·k



s⊤ Tr Bs Σi − Σ B⊤ ∗ B∗ Σi − Σ B

Bs∗ µi = Bs∗ µ and Bs∗ Σi = Bs∗ Σ

s





where Σ s is a covariance matrix of stationary sources that is constant across epochs, and Σin is a covariance of non-stationary sources in the i-th epoch. From this block-diagonal structure and the orthogonality between Bs∗ and An , Bs∗ Σi = Bs∗ As Σ s As⊤ holds. Since this is independent of the epoch index i, we derive the relations on Bs∗ :



N i =1



The theorem is obvious from following lemmas.

Σi = A

s





E.2. Proof of the Theorem 3



s⊤ s



N i =1

Bs =Bs∗

+ Im,jj′ Σi,k· B∗ B∗ Σi,·k′ − Im,jj′ Σ k· B∗ B∗ Σ ·k′ s⊤ s

=



Proof. If stationary and non-stationary sources are group-wise uncorrelated, Σi is in the following block diagonal form:

N i =1

N 1 

N 2 



Here, Cjk denotes the (j, k)-th element of a matrix C and Cj· , C·k are its j-th row and k-th column vectors respectively. From the assumption Bs∗ Σi Bs∗⊤ = Bs∗ Σ Bs∗⊤ = Im , we derive the following simpler expression

s

 

−1 ⊤ where B∗ = Bs∗⊤ Bn∗⊤ . Moreover, B⊤ follows from ∗ B∗ = Σ B∗ Σ B⊤ = I and we derive (8).  d ∗

 −1 = − Bs Σ Bs⊤ jj′ Σkk′

s

N i =1

Tr Bs Σi − Σ Bs∗⊤ Bs∗ Σi − Σ Bs⊤

  s  n⊤ n   s⊤  + Tr B Σi − Σ B∗ B∗ Σi − Σ B

 −1  −1 s + Bs Σ Bs⊤ jj′ Σk· Bs⊤ Bs Σ Bs⊤ B Σ·k′  s s⊤ −1 s  s s ⊤  −1 s + B Σ B j· B Σ·k′ B Σ B j′ · B Σ·k .

N 1 1

N 2 

s

i

is an inner product of matrices C and D

s⊤

Here, let us denote Bn∗ ∈ R(d−m)×d is a Σ -orthonormal complement of Bs∗ , i.e. Bn∗ Σ Bs∗⊤ = 0(d−m)×m , Bn∗ Σ Bn∗⊤ = Id−m . We then

E.3. Proof of the Theorem 4 From Lemma 2, it is sufficient to prove the claim only for the case of group-wise uncorrelated sources. Under the uncorrelated model, the conditions (4) is replaced by (E.2). Let us denote by B a set of ASSA solutions that satisfies (E.2) and the constraint Bs Σ Bs⊤ = Im . The uniqueness of the ASSA solution is guaranteed s′⊤ (up to linear transformation) if span(Bs⊤ ) = span (B ) holds sfor s s′ ⊤ any solutions B , B ∈ B . It holds when b ∈ Bs ∈B span(B ) has degrees of freedom equal to or less than m − 1 where the

S. Hara et al. / Neural Networks 33 (2012) 7–20

−1 stems from the constraint. The conditions (E.2) impose the following N (d + 1) constraints:  ⊤ µi − µ b = 0 (E.3)   Σi − Σ b = 0d×1 (E.4) where 1 ≤ i ≤ N. However, note that not all of them are independent. Since at most rank(Σi − Σ ) equations are independent in (E.4), the expected number of independent constraints is N (ℓ + 1). Conditions (E.3) and (E.4) also include d − m + 1 dependent equations since their sums are obviously zeros from the definition of   µ and Σ , i.e. ( Ni=1 µi − N µ)⊤ b = 0 and ( Ni=1 Σi − N Σ )b = 0d×1 . Therefore, the total number of independent constraints is N (ℓ+ 1)−(d − m + 1) and b has d − N (ℓ+ 1)+(d + m − 1) degrees of freedom. Since this has to be equal to or less than m − 1, (12) follows. In the special case when the mean is constant, the condition (E.3) vanishes and we have N ℓ − (d − m) independent constraints. The degrees of freedom on b is d − N ℓ+(d − m) and (13) holds.  E.4. Proof of the Theorem 5

⊤

Let a matrix B = Bs⊤ Bn⊤ , then the block-off-diagonal element of BΣi B⊤ is Bs Σi Bn⊤ . Moreover, the next holds for the objective function of (C.1):



N 

Tr Bs Σi Bn⊤ Bn Σi Bs⊤





i=1



N  

Tr Bs Σi Bn⊤ Bn Σi Bs⊤ + Tr Bs Σi Bs⊤ Bs Σi Bs⊤







 

where Qi = Σisn + Σin Σ



Tr Bs Σi Σ

−1

Σi Bs⊤



(E.5)

where the last equality follows from Bs⊤ Bs + Bn⊤ Bn = B⊤ B and BΣ B⊤ = Id . The equivalence of (E.5) to the ASSA objective function (9) can be checked with some algebra with a condition µ1 = µ2 = · · · = µN . Hence, Bs Σi Bs⊤ = Bs Σ Bs⊤ = Im is a necessary condition for a minimizer of (E.5) to coincide with a minimizer of (C.1), which is guaranteed by Lemma 1.  E.5. Proof of the Theorem 6 We first show that the matrix S is given in a form (D.1). From the definition, we have the average covariance Σ as



sn

Σs

ϵΣ

sn ⊤

sn

ϵΣ n Σ N 1

where Σ = N inverse is given as

 A

i =1

Σisn and Σ

n

=

1 N

N

i =1

Σin . Hence, its

  sn sn ⊤ sn s −1 + ϵ 2 Σ s −1 Σ J Σ Σ s −1 −ϵ Σ s −1 Σ J −1 −⊤ Σ =A A sn ⊤ −ϵ J Σ Σ s −1 J   sn n −1 Σ s −1 −ϵ Σ s −1 Σ Σ −⊤ =A A−1 n −1 sn ⊤ n −1 −ϵ Σ Σ Σ s −1 Σ + O (ϵ 2 )   n sn ⊤ s −1 sn −1 n −1 where J = Σ − ϵ 2 Σ Σ Σ + O (ϵ 2 ). Using = Σ this expression, we can write down the product Σi Σ −1

N 2 

Q =

P =

N i =1 N 1 

N i=1

Σisn − Σ

sn



Id−m + Σin Σ

µni µni ⊤ + 2Σin Σ

n −1

, and the matrix S is in a

Here, we used notations µi A µs ⊤



µ

 n⊤ ⊤

n −1



 n ⊤ Σin − µn µn − 2Σ .

=

A µsi ⊤



µni ⊤

⊤

and µ

=

.

Now, we turn to proving the main claim. Let us denote Bˆ sA,ϵ be

the ASSA solution under some fixed ϵ ≥ 0. Note that Bˆ sA,0 = Bs∗ holds from Theorem 3. Here, we assume that Bs∗ satisfies Bs∗ Σ Bs∗ ⊤ =

Im . Hence, from the continuity of Bˆ sA,ϵ over ϵ , there exists some 0 ≤

η ≪ 1 and C ∈ Rm×d such that O (C ) = O (1) and Bˆ sA,ϵ = Bs∗ + ηC for 0 ≤ ϵ ≪ 1. With this expression, we can write down products ˆ s ˆ s⊤ Bˆ sA,ϵ Σ Bˆ sA⊤ ,ϵ and BA,ϵ S BA,ϵ as follows,   s s s⊤ s ⊤ s s s s⊤ ⊤ Bˆ sA,ϵ Σ Bˆ sA⊤ = I + η CA Σ A B + B A Σ A C m ,ϵ ∗ ∗   sn ⊤ sn + ϵη CAn Σ As ⊤ Bs∗ ⊤ + Bs∗ As Σ An ⊤ C ⊤   n + η2 CAs Σ s As ⊤ C ⊤ + CAn Σ An ⊤ C ⊤ + O (ϵη2 )   n ⊤ s⊤ s ⊤ s s n⊤ ⊤ Bˆ sA,ϵ S Bˆ sA⊤ ,ϵ = ϵη CA Q A B∗ + B∗ A Q A C

Tr



Bˆ sA,ϵ Σ Bˆ sA⊤ ,ϵ

Σi = A



Σ ϵ Qi⊤ s

ϵ Qi Σin Σ

−1 

Bˆ sA,ϵ S Bˆ sA⊤ ,ϵ



  = 2ϵηTr CAn Q ⊤ As ⊤ Bs∗ ⊤   + η2 Tr CAn PAn ⊤ C ⊤ + O (ϵη2 ) from



Bˆ sA,ϵ Σ Bˆ sA⊤ ,ϵ

 −1

= Im + O (η). As we have discussed in

Section 3, the ASSA solution Bˆ sA,ϵ is a minimizer of the above. This time, Bs∗ is a constant and thus we search for the optimal ηC . By setting the derivative over ηCAn to zeros, we get

ηCAn = −ϵ Bs∗ As QP −1 + O (ϵη). Note this is equivalent to Bˆ sA,ϵ An = 0m×(d−m) and we get the claim. 



−1

Σi Σ

sn

form (D.1) with Q , P defined as follows,

i=1

Σ

Σisn − Σ



Hence, we have

N 

Σ =A



+ η2 CAn PAn ⊤ C ⊤ + O (ϵη2 ).

i=1

=

19 n −1

n −1

 Σin

A⊤ + O (ϵ 2 )

−1

Σi as

Bs∗ + ηC An since Bs∗ An =





References Abed-Meraim, K., & Belouchrani, A. (2004). Algorithms for joint block diagonalization. In Proc. 12th Euro. sig. proc. conf. (pp. 209–212). Akaike, H. (1974). A new look at the statistical identification model. IEEE Transactions on Automatic Control, 19(6), 716–723. Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276. Avriel, M. (2003). Nonlinear programming: analysis and methods. Dover Pubns. Belouchrani, A., Abed-Meraim, K., Cardoso, J., & Moulines, E. (1997). A blind source separation technique using second-order statistics. IEEE Transactions on Signal Processing, 45(2), 434–444. Belouchrani, A., Amin, M., & Abed-Meraim, K. (1997). Direction finding in correlated noise fields based on joint block-diagonalization of spatio-temporal correlation matrices. IEEE Signal Processing Letters, 4(9), 266–268. Blankertz, B., Kawanabe, M., Tomioka, R., Hohlefeld, F., Nikulin, V., & Müller, K.-R. (2008). Invariant common spatial patterns: alleviating nonstationarities in brain–computer interfacing. Advances in Neural Information Processing Systems, 20, 113–120. Blankertz, B., Tomioka, R., Lemm, S., Kawanabe, M., & Muller, K.-R. (2007). Optimizing spatial filters for robust EEG single-trial analysis. IEEE Signal Processing Magazine, 25(1), 41–56.

20

S. Hara et al. / Neural Networks 33 (2012) 7–20

Blythe, D. A. J., von Bünau, P., Meinecke, F. C., & Müller, K.-R. (2012). Feature extraction for change-point detection using stationary subspace analysis. IEEE Transactions on Neural Networks, 23(4), 631–643. Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge Univ. Pr. Cardoso, J. F., & Souloumiac, A. (1993). Blind beamforming for non-Gaussian signals. In IEE proc. f. IET. Vol. 140 (pp. 362–370). Chatelin, F. (1993). Eigenvalues of matrices. John Wiley and Sons. Choi, S., & Cichocki, A. (2000). Blind separation of nonstationary and temporally correlated sources from noisy mixtures. In Proc. IEEE stats. sig. proc.: Vol. 1 (pp. 405–414). IEEE. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36(3), 287–314. Congedo, M., Gouy-Pailler, C., & Jutten, C. (2008). On the blind source separation of human electroencephalogram by approximate joint diagonalization of second order statistics. Clinical Neuropsychology, 119(12), 2677–2686. Dickey, D. A., & Fuller, W. A. (1979). Distribution of the estimators for autoregressive time series with a unit root. Journal of the American Statistical Association, (74), 427–431. Dornhege, G., Millán, J. D. R., Hinterberger, T., McFarland, D. J., & Müller, K.-R. (2007). Toward brain–computer interfacing. MIT Pr. Engle, R. F., & Granger, C. W. J. (1987). Co-integration and error correction: representation, estimation, and testing. Econometrica, 55(2), 251–276. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188. Flury, B., & Neuenschwander, B. (1994). Simultaneous diagonalization algorithms with applications in multivariate statistics. In Proc. approx. comput. (pp. 179–205). Birkhäuser Boston Inc. Fukunaga, K. (1990). Introduction to statistical pattern recognition. Academic Pr. Fukunishi, H. (1975). Polarization changes of geomagnetic Pi 2 pulsations associated with the plasmapause. Journal of Geophysical Research, 80(1), 98–110. Grosse-Wentrup, M., & Buss, M. (2008). Multiclass common spatial patterns and information theoretic feature extraction. IEEE Transactions on Biomedical Engineering, 55(8), 1991–2000. Hamilton, J. D. (1994). Time series analysis. Princeton Univ. Pr. Hara, S., Kawahara, Y., Washio, T., & von Bünau, P. (2010). Stationary subspace analysis as a generalized eigenvalue problem. In Proc. 17th int. conf. neural info. proc. (pp. 422–429). Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47(1), 153–162. Hyvärinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Hyvärinen, A. (2002). Blind source separation by nonstationarity of variance: a cumulant-based approach. IEEE Transactions on Neural Networks, 12(6), 1471–1474. Hyvärinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. John Wiley and Sons. Jacobs, J. A., Kato, Y., Matsushita, S., & Troitskaya, V. A. (1964). Classification of geomagnetic micropulsations. Geophysical Journal of the Royal Astronomical Society, 8(3), 341–342. Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4), 620–630. Jolliffe, I. T. (1986). Principal component analysis. Springer. Kaufmann, R. K., & Stern, D. I. (2002). Cointegration analysis of hemispheric temperature relations. Journal of Geophysical Research, 107(D2), 4012. Kawamoto, M., Matsuoka, K., & Ohnishi, N. (1998). A method of blind separation for convolved non-stationary signals. Neurocomputing, 22(1–3), 157–171. Kohlmorgen, J., Lemm, S., Müller, K.-R., Liehr, S., & Pawelzik, K. (1999). Fast change point detection in switching dynamics using a hidden Markov model of prediction experts. In Proc. 9th int. conf. art. neur. netw. (pp. 204–209). Koles, Z. J. (1991). The quantitative extraction and topographic mapping of the abnormal components in the clinical EEG. Electroencephalography and Clinical Neurophysiology, 79, 440–447. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 79–86. Kuwashima, M., & Saito, T. (1981). Spectral characteristics of magnetic Pi 2 pulsations in the auroral region and lower latitudes. Journal of Geophysical Research, 86(A6), 4686–4696. Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 13, 556–562. Lorenz, E. N., & Emanuel, K. A. (1998). Optimal sites for supplementary weather observations: simulation with a small model. Journal of the Atmospheric Sciences, 55(3), 399–414. Mann, M. E. (2004). On smoothing potentially non-stationary climate time series. Geophysical Research Letters, 31(7), L07214. Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis. Academic Pr. Matsuoka, K., Ohoya, M., & Kawamoto, M. (1995). A neural net for blind separation of nonstationary signals. Neural Networks, 8(3), 411–419.

Meinecke, F.C., von Bünau, P., Kawanabe, M., & Müller, K.-R. (2009). Learning invariances with stationary subspace analysis. In IEEE 12th International Conference on Computer Vision Workshops. (pp. 87–92). http://dx.doi.org/10.1109/ICCVW.2009.5457715. Molgedey, L., & Schuster, H. (1994). Separation of a mixture of independent signals using time delayed correlations. Physical Review Letters, 72(23), 3634–3637. Montgomery, D. (2007). Introduction to statistical quality control. John Wiley and Sons. Moreau, E. (2001). A generalization of joint-diagonalization criteria for source separation. IEEE Transactions on Signal Processing, 49(3), 530–541. Müller, J. S., von Bünau, P., Meinecke, F. C., Király, F. J., & Müller, K.-R. (2011). The stationary subspace analysis toolbox. Journal of Machine Learning Research, 12, 3065–3069. Murata, N., Kawanabe, M., Ziehe, A., Müller, K.-R., & Amari, S. (2002). Online learning in changing environments with applications in supervised and unsupervised learning. Neural Networks, 15(4–6), 743–760. Olson, J. V., & Rostoker, G. (1977). Latitude variation of the spectral components of auroral zone Pi 2. Planetary and Space Science, 25(7), 663–671. Parra, L., & Sajda, P. (2003). Blind source separation via generalized eigenvalue decomposition. The Journal of Machine Learning Research, 4, 1261–1269. Pham, D., & Cardoso, J. (2001). Blind separation of instantaneous mixtures of nonstationary sources. IEEE Transactions on Signal Processing, 49(9), 1837–1848. Plumbley, M. D. (2005). Geometrical methods for non-negative ICA: manifolds, Lie groups and toral subalgebras. Neurocomputing, 67, 161–197. Priestley, M. B., & Rao, T. S. (1969). A test for non-stationarity of time-series. Journal of the Royal Statistical Society: Series B, (31), 140–149. Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. (Eds.). (2008). Dataset shift in machine learning. MIT Pr. Roberts, S. W. (1959). Control chart tests based on geometric moving averages. Technometrics, 239–250. Saito, T. (1969). Geomagnetic pulsations. Space Science Reviews, 10, 319–412. Saito, T., Yumoto, K., & Koyama, Y. (1976). Magnetic pulsation Pi 2 as a sensitive indicator of magnetospheric substorm. Planetary and Space Science, 24(11), 1025–1029. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464. Shenoy, P., Krauledat, M., Blankertz, B., Rao, R. P. N., & Müller, K.-R. (2006). Towards adaptive classification for BCI. Journal of Neural Engineering, 3, R13–R23. Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2), 227–244. Siegmund, D., & Venkatraman, E. S. (1995). Using the generalized likelihood ratio statistic for sequential detection of a change-point. Annals of Statistics, (23), 255–271. Sutcliffe, P. R., & Yumoto, K. (1991). On the cavity mode nature of low-latitude Pi 2 pulsations. Journal of Geophysical Research, 96(A2), 1543–1551. Takahashi, K., Lee, D. H., Nosé, M., Anderson, R. R., & Hughes, W. J. (2003). Crres electric field study of the radial mode structure of Pi 2 pulsations. Journal of Geophysical Research, 108(A5), 1210. Theis, F., & Inouye, Y. (2006). On the use of joint diagonalization in blind signal processing. In Proc. 2006 IEEE int. symp. circ. syst. (pp. 3589–3592). Tokunaga, T., Kohta, H., Yoshikawa, A., Uozumi, T., & Yumoto, K. (2007). Global features of Pi 2 pulsations obtained by independent component analysis. Geophysical Research Letters, 34, L14106. Tong, L., Liu, R., Soon, V., & Huang, Y. (1991). Indeterminacy and identifiability of blind identification. IEEE Transactions on Circuits and Systems, 38(5), 499–509. Uozumi, T., Yumoto, K., Kawano, H., Yoshikawa, A., Ohtani, S., Olson, J. V., et al. (2004). Propagation characteristics of Pi 2 magnetic pulsations observed at ground high latitudes. Journal of Geophysical Research, 109. von Bünau, P., Meinecke, F. C., Király, F. J., & Müller, K.-R. (2009a). Finding stationary subspaces in multivariate time series. Physical Review Letters, 103(21), 214101. von Bünau, P., Meinecke, F. C., Király, F.J., & Müller, K.-R. (2009b). Finding stationary subspaces in multivariate time series, supplemental materials. EPAPS document (E-PRLTA-103-014948). von Bünau, P., Meinecke, F. C., Scholler, S., & Müller, K.-R. (2010). Finding stationary brain sources in EEG data. In Proceedings of the 32nd Annual Conference of the IEEE EMBS (pp. 2810–2813). Argentina: Buenos Aires, http://dx.doi.org/10.1109/IEMBS.2010.5626537. Yeoman, T. K., & Orr, D. (1989). Phase and spectral power of mid-latitude Pi 2 pulsations: evidence for a plasmaspheric cavity resonance. Planetary and Space Science, 37(11), 1367–1383. Yilmaz, O. (2001). Seismic data analysis. Society of Exploration Geophysicists,. Yumoto, K., et al. (2001). Characteristics of Pi 2 magnetic pulsations observed at the CPMN stations: a review of the step results. Earth Planets and Space, 53, 981–992. Ziehe, A., Laskov, P., Nolte, G., & Müller, K.-R. (2004). A fast algorithm for joint diagonalization with non-orthogonal transformations and its application to blind source separation. Journal of Machine Learning Research, 5, 777–800.