Statistical information approaches for the modelling of the epileptic brain

Statistical information approaches for the modelling of the epileptic brain

Computational Statistics & Data Analysis 43 (2003) 79 – 108 www.elsevier.com/locate/csda Statistical information approaches for the modelling of the ...

603KB Sizes 17 Downloads 312 Views

Computational Statistics & Data Analysis 43 (2003) 79 – 108 www.elsevier.com/locate/csda

Statistical information approaches for the modelling of the epileptic brain  Panos M. Pardalosa; b; ∗ , J. Chris Sackellaresb; c; d; e; g , Leonidas D. Iasemidish , Vitality Yatsenkoc , Mark C.K. Yange; f , Deng-Shan Shiauc; e , Wanpracha Chaovalitwongsea; c; e a Department

of Industrial and Systems Engineering, Center for Applied Optimization, University of Florida, 303 Weil Hall, 32611-6595, Gainesville, FL 32611, USA b Biomedical Engineering Program, University of Florida, Gainesville, FL 32610, USA c Department of Neuroscience, University of Florida, Gainesville, FL 32610, USA d Department of Neurology, University of Florida, Gainesville, FL 32610, USA e McKnight Brain Institute, University of Florida, Gainesville, FL 32610, USA f Department of Statistics, University of Florida, Gainesville, FL 32611, USA g Malcolm Randall V.A. Medical Center, Gainesville, FL 32611, USA h Departments of Biomedical Engineering, Arizona State University, Tempe, AZ 85287, USA Received 1 February 2002; received in revised form 1 May 2002

Abstract First, the theory of random process is linked with the statistical description of epileptic human brain process. A statistical information approach to the adaptive analysis of the electroencephalogram (EEG) is proposed. Then, the problem of time window recognition of the global stochastic model based upon Bayesian estimation and the use of global optimization for restricted experimental data are proposed. A robust algorithm for estimating unknown parameters of stochastic models is considered. The ability of nonlinear time-series analysis to extract features from brain EEG signal for detecting epileptic seizures is evaluated. c 2003 Elsevier Science B.V. All rights reserved.  Keywords: Epilepsy; Statistical information; Optimization; Reconstruction; Detection; Similarity; Robust estimation

 Research was partially supported by the NSF grants DBI-980821, EIA-9872509, NIH grant R01-NS-39687-01A1 and the Department of Veterans ABairs. ∗ Corresponding author. Department of Industrial and Systems Engineering, Center for Applied Optimization, University of Florida, 303 Weil Hall, 32611-6595, Gainesville, FL 32611, USA. Tel.: +1-352392-9011; fax: +1-352392-3537. E-mail address: [email protected] (P.M. Pardalos).

c 2003 Elsevier Science B.V. All rights reserved. 0167-9473/03/$ - see front matter  PII: S 0 1 6 7 - 9 4 7 3 ( 0 2 ) 0 0 1 5 2 - 4

80

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

1. Introduction A major challenge in computational neurobiology is to deJne measures that accurately describe the dynamical features characterizing brain functions. The potential scientiJc and clinical applications for such measures are enormous. Recently, measures of information characteristics have been used in physics (Grassbeger and Procacciia, 1983; Nicolis, 1995; Eubank and Farmer, 1997; Klimontovich, 1999; Yatsenko et al., 1994), mathematics (Chentsov, 1982; Amari, 1982), estimation theory (Golodnikov et al., 2000), signal processing (Fante, 1988; Vapnik, 1998; Principe et al., 2000; Fitzgerald et al., 2001; Percival and Walden, 2001), and neuroscience (Freeman and Skarda, 1985; Babloyantz and Destexhe, 1986; Diambra et al., 1999; Du et al., 2000; Sackellares et al., 2000; Casdagli et al., 1996). In this paper we will apply statistical information approaches to electroencephalogram (EEG) signals (Nidermeyer and De Silva, 1983) in order to detect the development of epileptic seizures (ES). Traditionally, there are two mathematical approaches for identifying ES. One, we can assume knowledge of the dynamical model and initial conditions. The states of the epileptic human brain can then be regarded as points whose future evolution occurs along deterministic trajectories. We will refer to this type of epileptic human brain model as dynamical systems (Abarbanel, 1995; Casdagli et al., 1996; Du et al., 2000; Kaneko, 2001; Sackellares et al., 2000; Iasemidis et al., 2001). Alternatively, we can assume that both the dynamics and the initial conditions are uncertain, and treat this uncertainty as noise. We will call these types of models stochastic systems. The dynamical Gow of brain states is uncertain. Any EEG measurement has only Jnite precision and does not determine initial conditions precisely; experimental measurements inevitably contain unknown factors that cannot hope to be taken into account. Thus any dynamical law is only an approximation. Deterministic models are good approximations when there are only a few degrees of freedom (Loskutov and Mikhailov, 1990; Nikolis and Prigijine, 1989; Nicolis, 1995; Eubank and Farmer, 1997). When we consider the human brain as a dynamical system with many degrees of freedom, deterministic models become intractable. This can happen because it is impossible to measure all dynamical variables, since the interactions are so complicated and so poorly understood that it is impossible to model them (Arnold, 1974; Eubank and Farmer, 1997; Loskutov and Mikhailov, 1990; Nicolis, 1995). Even when there are only a few degrees of freedom, if the system is chaotic, a deterministic description is useful only over a short period of time. Over longer periods of time, uncertainties are ampliJed experimentally, and deterministic models break down (Loskutov and Mikhailov, 1990; Nicolis, 1995; Eubank and Farmer, 1997). When we consider the dynamics of the human brain over a period of time which is longer than the prediction limit, deterministic dynamical models break down. When this occurs, we consider it as a stochastic process (Gardiner, 1985; Stogatz, 1994). Instead of modelling the deterministic evolution of the state through time, we must now model the evolution of an ensemble of possible states through time (Klimontovich, 1999; Eubank and Farmer, 1997; Nicolis, 1995). This is a more appropriate approach because a single point in a deterministic model becomes a function in a nondeterministic model, describing an entire set of possible states. All the factors that are impossible to model

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

81

in detail are included in this function. Random process modelling attempts to minimize the eBects of our ignorance by utilizing the limited information that is available. At present, diBerent principles and methods are used to estimate statistical characteristics of brain physiology. Most methods currently employed in biology are based on rather simplistic assumptions about the linearity and stationarity of the underlying processes. Hence, they are suboptimal in many situations. In many cases, these methods require a large amount of experimental data, which is not always available. Therefore, when the information about the dynamics is limited, it becomes necessary to formulate and to solve the problems concerned with analysis of the brain states. This paper introduces modern statistical information methods that have been developed in the Jeld of nonlinear estimation (Wets, 1999; Brockett, 1981) and optimization. We shall apply these methods to nonlinear analysis of EEG data. We propose a robust nonlinear estimation algorithm of unknown parameters of a stochastic model. Some geometric aspects of robust Bayesian estimation are examined. We compute the Kullback–Leibler distance (KLD) to use for detection of epileptic activity. The methodology presented here uses short time intervals to identify epileptic activity. 2. Problem statement We will consider the epileptic human brain as a complex nonequilibrium dynamical system that exhibits a wide range of behaviors (Freeman and Skarda, 1985; Loskutov and Mikhailov, 1990; Nicolis, 1995). Over time, the normal brain proceeds through a variety of state transitions, such as wakefulness, drowsiness, and various stages of sleep. The epileptic brain makes additional pathological transitions into and out of states of seizures. As these transitions occur, the complexity of the dynamical state changes. These diBerent physiological states can be viewed as diBerent attractors in state space. Our research has been based on dynamical analysis of long-term continuous intracranial recordings obtained for clinical diagnostic purposes (Fig. 1). These nonlinear dynamics with diBerent attractors can be observed in phase space using the EEG. Next, we consider the epileptic human brain as a stochastic system in which EEG data are observed at discrete moments in time t = 1; 2; : : : ; n. Let the time window extraction of the local dynamical variables yti of ith time window be described by a suitable parametric family of probability densities i pi (yti | uti ; xt−1 ; i) =

i i t ( );

uti ,

(1) i xt−1

= {(yri ; uri ); i i

conditioned on the latest input the data r = 1; : : : ; t − 1} observed up to the time t − 1, and constant parameters ∈ . In this paper we propose that a more basic approach to ES detection problems is to exploit optimal reconstruction of a stochastic model and recursive robust Bayesian estimation of unknown parameters for restricted experimental EEG data within a local time window. We are bounded to the analysis of a rationally chosen description of the true posterior density which produces a geometrical structure deJned on the family of possible posteriors. We shall use a Pythagorean-like relation of probability distributions for reconstruction under condition of limited data. A robust algorithm for

82

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

Measurement

Characterization of EEG signal

of EEG signals Estimation of Lyapunov Exponents

Chaotic

Dynamical Analysis

Reconstruction of Dynamical Systems

Chaotic or Stochastic Dynamics

Global Optimization

Prediction of Epileptic Activities

Data Base

Stochastic

Information Statistical Analysis

Reconstruction of Stochastic Models

Decision Making

Fig. 1. Diagram of adaptive analysis of EEG data.

estimating of unknown parameters of a local dynamical model will be proposed using examples of EEG segments represented in phase space. We adopt three main assumptions: (a) A structure of the local dynamical system extracted from the time window does not change. (b) A Jxed amount of data is suMcient to evaluate t ( )i . (c) The functions t ( )i ; t = 1; 2; : : : are nowhere vanishing. Assumption (a) is based on the estimation of embedding dimension (Takens, 1981). Assumption (b) excludes from our investigation a problem where the Jltering of data is necessary (e.g. estimation of ARMA model parameters). Assumption (c) restricts the class of possible models. SpeciJcally, it eliminates regression models with bounded (e.g. uniformly distributed) noise. Under these assumptions, we examine statistical information approaches using system characteristics of the human brain and a reduced description of the EEG posterior density.

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

83

The main Jndings include (Chentsov, 1982): (1) The evolution of human brain states may be considered from stochastic point of view (Amari, 1985; Freeman, 2000). (2) We consider human brain dynamics over period of time when the deterministic description breaks down. (3) The family of possible posterior densities of the local time window, called posterior family, can be imbedded into an exponential family. (4) There exists a reduced statistical description of the true posterior density of the local window, which is Bayes-closed in the sense of a group homomorphism. (5) The Bayes-closed description of the local window can be reduced to a set of equivalence classes of densities corresponding to some values of the parameters utilized. (6) Any of the equivalence classes represents an exponential subfamily of the posterior exponential family. All these exponential subfamilies have the same “generating” functions. (7) There exist orthogonal parameters of the local window, one of which identiJes the posterior density accounts for the transition within a speciJc equivalence class, while the other one reGects a movement across particular equivalence class which is orthogonal at any point to the appropriate equivalence class. (8) The existence of the orthogonal parameterization of the local window implies that it is possible to construct a curved exponential subfamily of the posterior exponential family which is orthogonal to the equivalence classes that it intersects. (9) The orthogonal setting of the curved exponential subfamily of local window relative to the equivalence class enriches the intersection point with important extremal properties: this point minimizes the KLD from any point of the curved exponential subfamily and, at the same time, the dual KLD from the true posterior density.

3. Basic denitions In this section we review some basic results and deJnitions which are used in this paper. We emphasize qualitative understanding rather than rigorous mathematical proof. For a detailed description the reader is referred to standard textbooks (Boothby, 1975; Kobayashi and Nomixu, 1963). The statistical connections are elaborated in Chentsov (1982) and Pugachev and Sinitsin (1987). By the T -index (Iasemidis et al., 2001) at time t between electrode sites i and j, we mean the variable √ Tij = N × E(STLmax; i − STLmax; j =ij )(t); (2) where E(·) is the sample average diBerence for the (STLmax; i − STLmax; j ) estimated over the moving window t () deJned as  1 if  ∈ [t − N − 1; t];

t () = (3) 0 if  ∈ [t − N − 1; t];

84

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

where N is the length of the moving window, ij (t) is the simple standard deviation of the STLmax diBerences between electrode site i and j within the moving window

t (); STLmax (short time maximum Lyapunov exponent) is the operator of numerical estimation of Lmax . The diBerentiable manifold P is a space (with some suitable topological properties) whose points can be mapped, at least locally, to a Jnite-dimensional Euclidean space (KulhavRy, 1996). Any suitable mapping (homomorphism) : P → Rdim P can stand for a coordinate function. The transformations from one coordinate function to another are assumed to be diBerentiable and regular (i.e. diBeomorphisms). The Lie group M is a group and a diBerentiable manifold at the same time. The group operations (composition and inversion of group elements) are assumed to be diBerentiable. Being a manifold, M can be provided with a coordinate function  : M → Rdim M . A mapping  : M → X from a Lie group M to another Lie group X is called a homomorphism of Lie groups if it is an algebraic homomorphism that retains group operations and consequently, neutral elements: (a) (mm ) = (m)(m ) for any m; m ∈ M ; (b) (m−1 ) = −1 (m) for any m ∈ M ; (c) (1M ) = 1X ; and, at the same time, it is a diBerentiable mapping. The rank of a homomorphism is deJned as the rank of the corresponding Jacobian matrix. We say that a Lie group M acts on a diBerentiable manifold P (on the right) if there exists a mapping  : P × M → P satisfying the conditions (a) if 1M ∈ M is the neutral element of M , then (p; 1M ) = p for all p ∈ P; (b) if m; m ∈ M , then ((p; m); m )=(p; mm ). The action is called free if (p; m)= p implies m = 1M . Let P be a diBerentiable manifold and M a Lie group acting on P (on the right). M induces an equivalence relation on P; p and p are equivalent. If for some m ∈ M , we have p = pm. P=M is denoted by the quotient space of P relative to this equivalence and is assumed that the action of M on P is free, the canonical projection  : P → P=M is diBerentiable and P is locally trivial (diBeomorphic to the product manifold P=M × M ). Then we say that P together with the action of M on P forms a principal Cber bundle over P=M , denoted by P(P=M; M ). The tangent space of a diBerentiable manifold P at the point p ∈ P is, roughly speaking, a vector space obtained by local linearization around p. It is composed of tangent vectors of the smooth curves passing through p. The natural basis of the tangent space T! , associated with speciJc coordinates != (p), is formed by the partial derivatives with respect to !i , namely @=@!i ; i = 1; 2; : : : ; dim P, which when applied to any diBerentiable function on P generates directional derivatives along particular coordinate curves. We deJne the inner product of two basis vectors @=@!i , and @=@!j at a generic point p( ; !) ∈ P with the coordinate ! as  gij (!) =

@ @ ; @!i @!j



 = Ep

@ log p( ; !) @!i



@ log p( ; !) @!j

 :

(4)

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

85

Expectation is always taken relative to the point in which the tangent space is constructed. The quantities gij (!); i; j = 1; 2; : : : ; dim P form together the so-called metric tensor. The corresponding matrix (gij ) is a Bayes variant of the well-known Fisher information matrix. This deJnition makes the metric tensor invariant with respect to coordinate transformations both in and in ! (Savage, 1954; Sorenson, 1988). dim P dim P Using Eq. (4), the inner product of two vectors A = i=1 Ai @=@!i ; B = i=1 Bi @=@!i can be expressed as dim P dim P A; B = Ai gij Bj : i=1

j=1

We call the tangent vectors A; B orthogonal if A; B = 0. We adopt two main assumptions: (A1) A Jxed amount of data is suMcient to evaluate mt ( ). (A2) The functions mt ( ); t = 1; 2; : : : are nowhere vanishing. Assumption (A1) excludes from our investigation a problem where the Jltering of data is necessary (e.g. estimation of ARMA model parameters). Assumption (A2) restricts the class of possible models. Typically, it eliminates regression models with bounded (e.g. uniformly distributed) noise. Under these assumptions, we examine recursive Bayesian estimation using only a reduced description of the posterior density. To test the independence of the preictal state and the ES state, we employ the Kullback–Leibler distance, which is a measure of probability space of the distance between two densities. The formal mathematical formula of the Kullback–Leibler distance is given in Section 11. In this section, We include experimental results validating the theoretical results regarding the Kullback–Leibler distance to see whether two densities p and q of the EEG data are the same. 4. Characterization of brain dynamics The problem of system characterization is that of trying to deduce properties of the dynamics from the EEG data. The traditional approach to this question in time-series analysis is to Jt a linear model to the data by determining its optimal order and estimating parameters of the model. A simple generalization is to Jt the best nonlinear model. Without any prior knowledge, any model that we Jt is likely to be ad hoc. We are more likely to be interested in such questions as the following: How nonlinear is the model? How many degrees of freedom does it have? Forecasting provides one possible test for self-consistency. When the forecasts of a nonlinear model are signiJcantly better than those of a linear model, there is good reason to believe the nonlinear model. The great advantage of this is that it is very unlikely that a long series of good forecasts happen at random; we may make one or two lucky guesses, but we are unlikely to make a long series of them. This can be made precisely by applying standard tests for statistical signiJcance (Abraham and Ledolter, 2000). Furthermore, when the order of approximation is known, the scaling properties of error can be used to compute the dimension and the largest Lyapunov exponent.

86

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

Varying the number of data points and looking at the scaling in the prediction error give an estimation of the dimension, whereas varying the extrapolation time t gives the exponents. Dynamical systems theory provides a language for properly posing these questions, which become: What is dimension of the dynamics? How chaotic is dynamics, or in other words, what is the metric entropy, or the largest Lyapunov exponent? (Abarbanel, 1995; Sackellares et al., 2000). It thus gives us a model-independent language for posing the problems of system characterization. These characteristics have been developed to characterize nonlinear dynamical system, and their blind application to experimental time series may easily produce spurious results (Theiler, 1986; Ruelle, 1994). The problem is that algorithms for computing dimension, Lyapunov exponents, and metric entropy are notoriously unreliable. There are many subtleties involved on computing dimension, and many incorrect conclusions have been reached through naive application of these methods (Diambra et al., 1999). 5. Stochastic mathematical models of the EEG signal Our approach to reconstruct a local model is based on understanding of the neurophysiological and dynamical mechanisms aBecting epileptic brain function (Freeman and Skarda, 1985; Diambra et al., 1999; GrTobler et al., 1998; Sackellares et al., 2000). We hypothesize that reconstructed mathematical model characterizes electrophysiological activity of the hippocampus region. Let us have an EEG measurement xt at a discrete time t, and for simplicity let t = 1; t = 2, and so forth. The set of measurements {xt } is called a time series. If x is a scalar variable it is a univariate time series; if x is of higher dimension, it is a multivariate time series. ARMA model. We may also describe EEG signal of time windows by ARMA model. We will call a system for which we may write q m ai xt−i + bi nt−1+1 ; xt = − i=1

(5)

i=1

i.e., one in which, letting a0 = 1, the transfer function may be written as q bi z−i+1 Hz = i=1 ; m −i i=1 ai z

(6)

ARMA model of order (m; q) of single time windows of EEG time series. Here n is the input vector time series, xt is the state vector, y is the output vector, and A; B and C are constant matrices. The set of all such systems may be analyzed as a system space parametric representation by {A; B; C}. Nonlinear stochastic model. A process is deterministic if there is some value of m such that the probability density function approaches a delta function in the limit of perfect measurements of {xt }. When the dynamics are formulated this way the dynamics laws, if any, are not apparent. An alternative way to deJne a random process is as a

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

87

stochastic dynamical system, in either discrete time, x n+1 = F(xn ; nt )

(7)

or continuous time, x˙ = F(x(t); n(t))

(8)

where n(t) is a random variable representing the uncertain aspects of the time evolution, often called noise and xt is the state. The state no longer gives us precise information about the behavior of the brain, but it summarizes as much as we can know, assuming we can know the noise. The degree to which the process is deterministic or random depends on the relative dependence of F on the state x relative to noise n. It also depends on whether the noise is completely random or whether it is correlated (either to x or to the noise at other times). When the noise inGuences the future dynamical behavior, as it does in Eq. (8) it is often called dynamical noise xt+1 = F(xt ) + nt :

(9)

Another source of noise is measurement. It is impossible to measure quantity with inJnite precision. The unknown remainder can be considered noise. In the classic limit, a good measuring instrument does not eBect the dynamics. In this case the “true” state x remain deterministic, xt+1 = F(xt );

(10)

but the observed state y does not, yt = xt + , t :

(11)

We will assume that the measuring instruments are decoupled from the dynamics, and call any random process in this form observation noise. Observation noise is obviously much simpler than dynamical noise, but it is by no means trivial. Iterated once, an observational noise process looks like yt+1 = F(yt ) − nt+1 :

(12)

The dynamics of y is much more complicated than those of x. Although the dynamics of additive dynamical noise does not change the underlying deterministic behavior. In order to be more speciJc, some notations will be introduced at this point. Given time series x1 ; x2 ; : : : ; xN , the general formulation of a nonlinear stochastic model is xt = f(xt−1 ; xt−2 ; : : : ; xt−m ; nt ; );

(13)

where f : Rm → R is a smooth map, which has been designed as the “skeleton”, nt represents dynamical noise, and m is a positive integer, the so-called embedding dimension. The mathematical model (13) may be written as p(yt | ut ; xt−1 ; ) =

t ( );

(14)

88

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

Markov model and Fokker–Plank equation. One way to determine a random process is in terms of the joint probability of the values of the time series at successive times: p(x1 ; : : : ; xd ):

(15)

If the joint probabilities of order higher than d factorize in the following way p(x1 ; : : : ; xd ) = p(x1 )p(x2 ; : : : ; xd+1 );

(16)

then this is a dth order Markov process. The process is deterministic if there is some value of d such that the probability density function approaches a delta function in the limit of perfect measurements of xt . We shall assume that the process is Markovian and therefore, for continuous processes for the statistical description of the time evolution, the following Fokker–Planck equation (Klimontovich, 1999) can be used as @t f(x; a; t) = @x[/(x; a); @x f] + @x [A(x; a)f]:

(17)

The kinetic form of the Fokker–Planck equation has been used by Klimontovich (1999), where the comparative analysis of the Ito, Stratonovich and kinetic forms of stochastic equations was carried out. Controlled dynamical model. The controllable model describes the actual EEG signal behavior only locally. The parameters in this model thus depend on the current state of the process, and vary in time. The amount of data relevant to the current working point is typically limited. To prevent mixing relevant and obsolete data in parameter estimation, we must introduce an explicit model of parameter variations, or apply some heuristics for “forget” information. Consider a signal on which two sequences of random variables are measured yt = (y1 ; : : : ; yt−1 );

ut = (u1 ; : : : ; ut−1 );

(18)

which take values in subsets V and N of Rdim yt and Rdim ut , respectively. The variable uk is deJned as a directly manipulated “input” to the patient at time k to the past history of data represented by the sequences yk−1 and uk . Both the above sequences form together a complete sample of data. Let the dependence of the variable yti of the local stochastic model of a given EEG time window be described by a suitable parametric family of probability densities p(yt | ut ; xt−1 ; ) =

t ( );

(19)

conditioned on the latest control ut , the data xt−1 = {(yr ; ur ); r = 1; : : : ; t − 1} observed up to the time t − 1, and constant parameters ∈ . The short notation t ( ) will be preferred when (Peterka, 1981) is taken as a function of the unknown parameter for the EEG data. Let the prior uncertainty of be expressed through a probability density p0 ( ) built on the basis of available initial information from local time window. Then the update of the density pt−1 ( ) by new data is deJned by the Bayes theorem. If the input

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

89

generator employs no other information about than observed data (Peterka, 1981) p(yt | ut xt−1 ; ) = p(yt | ut xt−1 );

(20)

then the Bayes rule simpliJes to the multiplicative “composition law” pt ( ) = 1pt−1 ( ) t ( );

(21)

where 1 is the normalization factor. Note that instead of trying to directly estimate the unknown parameters , we prefer to gather complete information about them expressed through the posterior density pt ( ). This standpoint has strong conceptual advantages (Peterka, 1981), but the construction of the posterior density according to the functional relation (21) is rarely feasible. The main diMculty is that the dimension of the statistic suMcient to determine the true posterior pt ( ) may be too large or even permanently growing with the increasing number of observed data points. Hence, with real computing facilities, we are forced to use a reduced statistics instead of a suMcient one. 6. Reconstruction of a local stochastic model Let us consider a measurement system (MS) registering T signals. To each T signal (TS) corresponds a real sensor (electrode) signal of certain form, which is transmitted to the sensors and appears in their input, possibly distorted, and corrupted by random noise. The input T (t) is a stochastic process, described in terms of probability density functions. Let the TS be denoted by T1 ; T2 ; : : : ; TN . A nonlinear stochastic model is given as Tt = f(Tt−1 ; Tt−2 ; : : : ; Tt−m ) + ,t ;

(22)

m

where f : R → R is a smooth map, which has been designed as the “skeleton”, ,t represents noise, and m is a positive integer, the so-called embedding dimension. The data used to solve the reconstruction problem will be represented as {T; y} = {Ti ; yi }n1 , where n = N − m; Ti ∈ Rm are the rows of T , and yi ∈ R. Here we present a nonlinear method for the reconstruction of the function f given by (22) using a global optimization approach. Let n 1 fˆ = arg min R[g] = C |yi − g(Ti )|, + g H ; (23) y∈H 2 i=1

where

 |T |, =

0

if |T | 6 ,;

|T | − ,

otherwise

(24)

is the so-called ,-insensitive loss function, C is a positive number, and H is a reproducing Kernel Hilbert space with norm H (Cristianini and Shawe-Tailor, 2001; Vapnik, 1998).

90

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

It can be proven that the function fˆ can be expressed by ˆ )= f(T



!i 2i (T ) + b;

(25)

i=1

where the set of functions {2i (T )}∞ i=1 is a basis of H . The optimization problem is deJned by n

mins x; x

n

n

n

i=1

i=1

1 s (xi − xi )(xjs − xj )L(Ti ; Tj ) − xis (yi − ,) + xi (yi + ,) 2 i=1 j=1

(26)

subject to 0 6 xi ; xis 6 C; i = 1; : : : ; n; n

(xis − xi ) = 0;

i=1

where

  (Ti − Tj ) L(Ti ; Tj ) = exp − v

(27)

is a Gaussian radial basic function kernel of Hilbert space H (Vapnik, 1998), v is the width of the kernel. The function f(T ) can be expressed as f(T ) =

n

(xis − xi )K(T; Ti ) + B:

(28)

i=1

It can be shown that the solution to this problem leads to several coeMcients i =(xis −xi ) equal to zero. Therefore, the data points associated with them are not involved in the last expression. The remaining data points, called support vectors, contain all the information needed for the approximation of the function f. Note that if the number of support vectors is small, the calculation time of the method will be reduced. 7. Reconstruction of a purely deterministic model The shadowing problem is that of Jnding a purely deterministic orbit x n+1 = f(xt ) that stays close to yt , or in other words, a deterministic orbit that shadows the noisy orbit. The problems of shadowing and noise reduction are closely related. On one hand, shadowing is not a noise reduction problem, because the shadowing orbit x is an artiJcial construction. On the other hand, if we pretend that the shadowing orbit x is a signal, then the “eBective noise” n˜ can be deJned trivially so that xt = xt + nt :

(29)

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

91

An alternative method is based on optimization approach and original Anosov-Bowen construction. We introduce the worst-case distance between points on the noise orbit y and points on the shadowing orbits x. This amounts to using the L1 or subnorm to measure the distance between two trajectories,

1

yt − xt 2 ; (30) O(x; y) = N where yt − xt is the Euclidean distance between the vectors xt and yt . The Euclidean norm is used in two senses, both to measure the distance between two trajectories xt and yt . Minimizing O is equivalent to minimizing O2 . For the given noise trajectory segment yt , we wish to Jnd a trajectory segment x that minimizes O2 (x; y), subject to the constraint that x is deterministic with respect to f, that is, xt+1 = f(xt )

(31)

for i − 1; 2; : : : ; N − 1. This problem is straightforward to solve with the method of Lagrange multipliers. It is equivalent to minimizing Q=

N

yt − xt 2 + 2

N −1

t=1

|f(xt ) − xt+1 |)† t ;

(32)

t=1

where † denotes the transpose and {t } are the Lagrange multipliers. The factor of two in front of the sum is a convenience that simpliJes subsequent expressions. DiBerentiating with respect to the unknowns and searching for an extreme point gives the following system of equations:  @Q = 0 = −(xt − yt ) + f † (xt )t − t−1 ; @xt

@Q = 0 = f(xt ) − xt+1 ; @t

(33) 

where t = 0 if t ¡ 1 or t ¿ N − 1; f † (xt ) = dg=d x(xt ) is the Jacobian matrix of f at xt . In the Jrst equation f runs from 1 to N , whereas in the second t runs from 1 to N − 1. Typically, f is nonlinear and this system of equations has no closed-form solution. Because Eq. (33) is nonlinear, numerical approximation must be used to solve it in general. Any of a large variety of search algorithms can be used, for example, Newton’s method. A careful analysis shows that such a search requires inverting a matrix M . The structure of matrix M complicates the search. To get the largest possible noise reduction we want to make N as large as possible. However, when the underlying dynamics are chaotic M becomes ill conditioned. Furthermore, approximate homoclinic tangencies cause it to be nearly rank deJcient. Both of these factors make M diMcult to invert when N is large.

92

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

8. Stochastic and deterministic dynamics Noise reduction is a vague term for a class of problems in which we wish to decompose a signal into a part that we want and a part that we do not want. To be a little more precise, suppose we observe a noisy EEG time series {yt }, where t = 1; : : : ; N . Assume that it can be decomposed as yt = xt + nt ;

(34)

where xi is the “desired” part, or the signal, and nt the noise. We can only observe the noisy signal yt , but we wish to recover the pure signal xt . What we call “noise” is usually subjective, and to reduce it one must have a criterion for distinguishing ni from xi . For example, if the noise is predominantly at high frequencies, and the signal predominantly at low frequencies, we can reduce the noise by simply applying a low-pass Jlter. However, if the noise and the signal have similar frequencies, spectral methods are ineBective and more sophisticated criteria are required. Recent methods for noise reduction assume that the distinguishing feature of the signal {xt } is that it is trajectory of an invertible dynamical system f; that is xt+1 = f(xt ). The noise {nt } might also be a trajectory of a dynamical system; if so, we will assume that it is substantially diBerent from f. Often this means that the dimension of the noise is much greater than that of the signal precisely by saying that it is high dimensional. The noise reduction problem can be broken into two parts: learning dynamical system f, and reducing the noise once it is known. These are three cases: (A1) We are given f. (A2) We must learn an approximation to f, but we have available some “clean” data with known values of xt . (A3) We must learn f directly from the noise data yy . In all three of these cases we are ultimately faced with problem (A1), that of reducing the noise once we know f or have an approximation to it. The best approach to noise reduction depends on the goal. For example, we may not care about recovering each value of the signal xt exactly—we may be interested only in recovering a statistical property, such as the power spectrum. Statistical noise reduction is intricately simpler than detailed noise reduction.

9. Estimating parameters of the local controlled dynamical model We consider the problem of recursive discrete-time Bayesian parameter estimation for model (19) with the following assumptions. (h1) The EEG data information is suMcient to determine correlation dimension D2 . (h2) The function t ( ); t = 1; 2; : : : has the same support , i.e. t ( ) ¿ 0 almost surely for any ∈ .

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

93

(h3) The data statistics are suMcient to determine t ( ) as a function of the unknown parameters and have a limited Jnite dimension at any t. Under these assumptions, we can embed the Bayes estimation problem into a standard diBerential-geometric method (KulhavRy, 1996). Because of assumption (h3), Bayes rule (21) can be converted into the form log pt ( ) = const + log p0 ( ) +

t

log

r ( ):

r=1

t The additive composition law inspires us to embed the term r=1 log r ( ) into a vector space F. In such a case, some points of F coincide with the log-likelihood log p(y|u; x; ) for diBerent data; other points result from their additive composition and are added just to complete a vector space. Clearly, this method of composition is not unique, but it is the simplest method in terms of both its deJnition and dimensionality of the enveloping vector space. The “extended” space of likelihoods M = exp F has a useful property of being a group with the group operation deJned by pointwise multiplication of functions. Due to an obvious bijection ( ) ↔ f( ) = log ( ), if f = log ; f = log  then  = exp|f + f |; −1 = exp| − f|. Note also that exp 0 = 1. As the posterior density is normalized in order to give a unit integral over , there is no need to care about the norm of the likelihood functions ( ) during computations. This is why we consider equivalence classes ˜ of functions ( ) diBering only by their norm. The same equivalence relation will also be applied to functions q( ) representing unnormalized posterior densities. These considerations motivate the following deJnition. Let  be the set composed of log-likelihood functions log p(y|u; x; ) corresponding to all possible values of EEG data (y; u; x) of local time window. Denoted by F, the linear hull of —the smallest vector subspace of the enveloping vector space L that contains . Introduce the space of unnormalized densities Q=exp L and likelihoods M =exp F as the spaces of exponential functions of the elements of L and F, respectively. DeJne an equivalence relation ∼ on the vector space L so that two functions from L are equivalent if they diBer at most by an additive Jnite constant. This relation induces on M and Q the equivalence of proportionality—equality up to a positive Jnite multiplicative factor. By the geometric ˜ M˜ ; ) with the following model of Bayesian estimation we will call the object (Q; meaning: Q˜ and M˜ are quotient spaces Q= ∼ and M= ∼, respectively, and the mapping  : Q˜ × M˜ → Q˜ is deJned by = (q; qm ˜ m) ˜ = cq( )m( ) | c ¿ 0;

q ∈ q; ˜ m ∈ m: ˜

(35)

The mapping  is the unnormalized version of the Bayes rule (21). We adopt the following assumption: (h4) The linear hull F of the set  of log-likelihood functions log p(y|u; x; ) over all possible data (y; u; x) has a Jnite dimension N . Finally, in the sequel we restrict our attention to the case: (h5) Let Q˜ = q˜0 M˜ = {q˜0 m| ˜ m˜ ∈ M˜ } for an arbitrary but Jxed prior density q0 ∈ q˜0 of local time window.

94

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

It can be veriJed that the above geometric model is well deJned in the sense of diBerential geometry: To be able to introduce the Riemannian geometry for the Bayesian estimation problem, we introduce in addition to Q˜ the corresponding space of normalized densities. ˜ Let P˜ ⊂ Q˜ contain all of the equivalence classes q˜ = {cq( ) | c ¿ C} from Q which are composed of integrable functions for which q( )d ¡ ∞. By the posterior family P we mean the space of normalized densities p( ) corresponding in a one-to-one way to particular equivalence classes p˜ ∈ P˜ p( ) =

q( ) : q( ) d( )

Assume (h1) – (h5) are fulJlled. Then Q˜ is a diBerentiable manifold, M˜ is a Lie group with the group operation induced by pointwise multiplication of functions, and  deJned ˜ by (35) speciJes the action of M˜ on Q. The space of normalized densities is also well deJned from a diBerential-geometric viewpoint (Chentsov, 1982). Assume (h1) – (h5) are fulJlled. Then P is a diBerentiable manifold. Now we are ready to formulate the problem. According to (h4), a complete description of m˜ ∈ M˜ can be done through the coordinate function  : M˜ → RN . Note that  can be given other interpretations. From the statistical viewpoint,  is a statistic, i.e. a function of observed data. From the hierarchical Bayesian viewpoint,  is a hyper-parameter that speciJes uniquely the posterior density of the unknown parameters

(provided that the prior density is Jxed). In a concrete implementation of recursive Bayesian estimation we are often forced to store only a reduced description  of m˜ whose dimension is smaller than N . Choosing an inappropriate description , we can completely loose control over estimation. Thus, we are searching for a “good” description which would retain basic operations over M˜ . Find a homomorphism (of Lie groups) of rank n ¡ N  : M˜ → Rn :

(36)

If such mapping exists, perform the optimal recursive estimation on equivalence classes [q] ˜ = q˜0 [m] ˜ with [m] ˜ = {m˜  ∈ M˜ |(m˜  ) = x(m)} ˜

(37)

in the sense that 



t ] = [q˜t−1 ][ t |q˜t−1 ∈ [q˜t−1 ]; m t ∈ [ [q˜t−1 m mt ] = {q˜t−1 m mt ]}:

(38)

It is the relation (38) that has motivated us to search for a description of  which would have the basic property of homomorphisms. If a homomorphic description  is applied, there is no diBerence between the one-shot and recursive estimation. This is why we call such a description Bayes-closed.

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

95

Clearly, using a homomorphism  : M˜ → Rn of rank n ¡ N we are not yet able to specify the true density pt ∈ P exactly, but we can determine the equivalence class

 p0 ( )m( ) ; m ∈ m; ˜ m˜ ∈ [ mt ] (39) [pt ] = p ∈ P|p( ) = p0 ( )m( )d( ) in which pt lies. Thus, optimal recursive estimation of the local stochastic model parameters implies a closely related problem of “approximation”—to Jnd a rationally justiJed representative of the equivalence class [pt ] which could be used in a subsequent decision making. Our aim is to suggest a procedure that  would give the same t r ) is given (in result; both for the case of pt known and the case when just ( r=1 m addition to knowledge of p0 ). 9.1. Approximation of the recursive estimation Given a homomorphism  : M˜ → Rn of rank n, Jnd a subfamily Z ⊂ P orthogonal to all equivalence classes [p] ⊂ P deJned by (39). If such a family exists, construct the approximating density pt by the orthogonal projection of the true posterior pt into Z. We note that the manifold P speciJed by deJnition 9 represents the exponential family with a generic point written in local coordinates ! = (!1 ; : : : ; !N ) as   N k 1 ! fk ( ) : (40) p( ; !) = 1p0 ( ) exp k

This observation follows directly from deJnition 9, since the set {f1 ; f2 ; : : : ; fN } forms a basis of the vector space F. Given a basis f1 ; : : : ; fN ; m˜ ∈ M˜ can be described completely by local coordinates !1 ; : : : ; !N . The construction of a reduced but homomorphic description of m˜ is described by the following propositions. Proposition 9.1. DeCne the mapping  : M˜ → Rn of rank n through ˜ = Ep0 [hi ( ) log m( )]; t (m)

(41)

where hi ( ) ∈ F are n linearly independent functions such that Ep0 [hi ] = 0; i = 1; : : : ; n and m( ) stands for an arbitrary representative of the equivalence class m. ˜ The mapping in Eq: (41) forms a homomorphism (of Lie groups). Any other homomorphism is related to Eq: (41) by an isomorphism. v ⊂ M˜ is a kernel of some homomorphism Proof. Any (closed regular) Lie subgroup M v (see Sorenson; 1988; Theorem 19; Chentsov; 1982; Theorem 6.14). One obvious of M v is the canonical homomorphism ∗ : M˜ → M˜ = M v homomorphism with the kernel M v . Any other homowhich assigns to any m˜ the corresponding equivalence class m˜ M v . morphism is related to the canonical one through an isomorphism to M˜ = M Therefore, it is suMcient to Jnd a deJnition of the mapping  that ensures that v ; m˜ ∈ M˜ and, at the same time, (m) v = v ∈ M (m ˜ mv ) = (m) ˜ for any m ˜ = (m˜  ) if m˜ M m˜ Mv . Taking into account the analytic expression for Jbers, it is easy to verify that

96

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

the deJnition (Peterka, 1981) fulJlls the above condition in the case when the terms Ep0 [hi fi ] and Ep0 [hi ] are equal to zero. v of M˜ The kernel of the homomorphism  deJned by (41) is a Lie subgroup M (Boothby, 1975). The use of a reduced description  induces an equivalence relation v . Thus, we are able ˜ namely, q˜ and q˜ are equivalent if q˜ = q˜ v for some v ∈ M on Q; to distinguish only among the sets v } = {q˜ ˜ |( ˜ ) = 0}; v = {q˜ v | v ∈ M q˜M

(42)

which represent the equivalence classes [q] ˜ = q[ ˜ m]. ˜ In terms of diBerential geometry, v ; M v ). Thus, the equiv v on Q˜ produces a principal Jber bundle Q( ˜ Q= ˜ M the action of M ˜ alence classes [q] ˜ represent submanifolds of Q which are mapped, in a straightforward way, to the equivalence classes of normalized densities [p] ⊂ P. v (the kernel of ) corresponds to a vector subspace Fv , through The Lie subgroup M the formula Mv = ( exp Fv )= ∼. Using (41) implies that Fv is spanned by the functions v1 ( ); : : : ; vN −n ( ) which satisfy Ep0 [hi vj ] = 0; i = 1; : : : ; n; j = 1; : : : ; N − n. A generic density of the equivalent class (Jber [p] through p can be expressed in local coordinates r = (r 1 ; : : : ; r N −n ) as   N −n r j vj ( ) : (43) p( ; s; r) = 1p( ) exp  j=1

Hence, Jbers form the exponential subfamilies of P and the posterior exponential family may be written in the form    n  N −n si hi ( ) exp  r j vj ( ) : (44) p( ; s; r) = 1p0 ( ) exp i=1

j=1

Re-parametrize Eq. (44) as follows:  p( ; s;[ r) = 1p0 ( ) exp

N





N −n

!k (s)f [ k ( ) exp 

 r j vj ( ) ;

(45)

j=1

k=1

where the Jrst exponent models a movement of the point p over an n-dimensional submanifold (subsurface) of P (see (40)) through a smooth vector-valued function != !(s) [ of s=( [ s[1 ; : : : ; s[n ), while the second one explains changes within the corresponding Jber. The following proposition solves the problem of Jnding a parametrization (s;[ r), such that at any point p of  N   !k (s)f [ k ( ) ; (46) Z = p ∈ P | p( ; s) [ = 1p∗ ( ) exp k=1

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

97

the tangent vectors of the (N − n)-dimensional Jber [p] in p and of the n-dimensional family Z (46) in a reference point pn are orthogonal:   @ @ = 0: (47) ; @s[i @r j Combining (44) and (45), this condition reads    @ @ Ep log p( ; s;[ r) log p( ; s;[ r) = 0: @r j @s[i

(48)

Proposition 9.2. Given a set of equivalence classes [p] ⊂ P produced by the Lie subgroup M0 = ; there exists a submanifold Z ⊂ P through a given point pn ∈ P that is orthogonal to all equivalence classes which it intersects. An intersection point pˆ ∈ [p] ∩ Z fulClls the identities Ep0 [hi log p] ˆ = Ep0 [hi log p];

(49)

Epˆ [vj ] = Ep∗ [vj ]:

(50)

Proof. After taking the partial derivative in the Jrst term of (14); we get     @ @ p( ; s; [ r) log p( ; s; [ r) d( ) = 0: @s[ @r[ By substituting (45) for p( ; s;[ r) in the second term and taking into account that the densities are normalized; we derive after some manipulations    @ p( ; s;[ r) vj ( ) d( ) = 0: @s[ Finally; changing the order of integration and partial derivative @=@s[ ; we derive (@=@s[ )Ep [vj ] = 0. Thus; Ep [vj ] = const for any p ∈ Z which implies directly (50). Identity (49) results from the fact pˆ ∈ [p] and deJnition (1) of . Proposition 9.2 implies that there exists an orthogonal parameterization (s;[ r) of the manifold P making the changes within Z and within equivalence classes [p] locally independent—in the sense of a zero inner product of the tangent vectors. Z can be interpreted as follows. DeJne at each point (s;[ r) ∈ P a vector space T (Z) consisting of the vectors that are tangential to P and, at the same time, orthogonal to the tangent space of [p]. Then the submanifold Z such that its tangent space coincide with T (Z) at any point is called the integral submanifold of the vector Jelds determined by the basis vectors of F(Z). In a statistical context, the family Z is said to be a curved exponential family (Savage, 1954; Sorenson, 1988). Geometrically speaking, Z is an n-dimensional differentiable submanifold imbedded in the N -dimensional manifold P of the posterior exponential family.

98

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

Relation (50) of Proposition 9.2 implies that Ep [vj ] = Ep [vj ] = Epn [vj ] for any p; p ∈ Z and, hence E1p+(1−1)p [vj ] = Ep [vj ] for 0 6 1 6 1. Thus, Z can also be interpreted as a diBerentiable submanifold embedded in a mixture Z t that is deJned as the smallest space of densities which contains Z and is closed with respect to a convex composition 1p + (1 − 1)p ; 0 6 1 6 1. It can be proved (see the proof of Proposition 9.2) that Z is orthogonal to all Jbers [p] that it intersects. The optimal local properties of orthogonal parameterization are accompanied by very attractive global properties. Corollary 9.3. The intersection point pˆ ∈ [p]∩Z; Z  p∗ represents a unique solution of the following dual optimization problems:  p( ) ˆ ˆ log min p( ) d( ); (51) p∈Z ˆ p( )  min

p∈[p] ˆ

p∗ ( ) log

p∗ ( ) d( ): p( ) ˆ

(52)

Proof. The corollary follows by application of Theorems 3.8 and 3.9 in (Insua and Ruggeri; 2000). For more details on “projection geometry”; see Kullback and Leibler (1951). The above formulas have a great intuitive appeal. First, pˆ represents a point of [p] that minimizes the KLD from a reference density p∗ (typically the prior density p0 ) of local window. At the same time, pˆ represents the point of Z that minimizes the dual KLD from the true density p (although we do not know p completely). (See Fig. 2) In the next section, we summarize the previously discussed results in an algorithmic form. 9.2. The algorithm 1. 2. 3. 4.

Initialization. Choose a prior p0 ( ). Specify the posterior exponential family P (see (40)). To determine form (44) of the posterior exponential family, choose a base {h1 ; : : : ; hn ; v1 ; : : : ; vN −n } of the vector space F that fulJlls Ep0 [hi ] = 0; Ep0 [vi ] = 0; Ep0 [hi vi ] = 0 for all i = 1; : : : ; n, j = 1; : : : ; N − n. 5. Evaluate the metric tensor at p0   @ @ gij (p0 ) = ; = Ep0 [hi hj ] @si @sj p0 for i; j = 1; : : : ; n, and its inversion g−1 (p0 ). 6. Set 0 = 0 and t:=1. 7. (Recursive steps.) Observe new data ut ; yt .

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

EEG signal

EEG measurement

99

Observe new data

Reconstruction of stochastic model

Evaluate ∆χt'

Estimation of Lyapunov exponents

Update ∆χt'

Choose a prior probability density P0

Determine st

Specify the posterior exponential family P

Determine rt

Choose a base of the vector space

Increment t=t+1

Evaluate the metric tensor

Estimation Kullback-Leibler distance

Prediction and optimal decision making

Fig. 2. A common structure of the adaptive detection of epileptic seizures.

8. Evaluate \t = Ep0 [hi ( ) log nt ( )]: 9. Update  t = t−1 + \t :

10. Determine st , such that st = g−1 (p0 )t = st−1 + g−1 (p0 )\t for a presanctiJed (possibly time variable) reference density pt∗ , determine rt , such that Eptˆ [vj ( )] = Ept∗ [vj ( )]: 11. Increment t:=t + 1 and go to step 7.

100

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

10. Information characteristic of EEG data In this section, basic deJnition entropy and information are introduced. 10.1. Entropy and information The entropy H (p) of a probability density p is  H (p) = − p(x) log p(x) d x: The entropy of a distribution over a discrete domain is H (p) = − pi log pi :

(53)

(54)

i

The entropy of EEG data (obtained from multiple channels as in Fig. 3) describes the extension to which the distributions is concentrated on small sets. If the entropy is low, the distribution is concentrated at a few values of x. It may, however, be concentrated on several sets from each other. Thus, the variance can be large while the entropy is small. We see that it is the average over something that depends explicitly on p. The entropy can be written as H (p) = − log p(x) :

(55)

We see that the average depends explicitly on p. The entropy can thus be viewed as a self-moment of the probability, in contrast to the ordinary moments, for example, x2 , which are average over quantities of diBerent kinds of information than the ordinary moments. Entropy measures the degree of surprise one should feel upon learning the results of a measurement. It counts the number of possible states, weighting each by its likelihood. Joint and conditional entropies are deJned by equations  H (x; p) = − p(x; y) log (p(x; y) d x dy; (56)  H (x|p) = − p(x|y) log (p(x|y) d x: (57) The negative of entropy is sometimes called information: I (p) = −H:

(58)

For our purposes the distribution between the two is semantic. Entropy is most often used to refer to measures, whereas information typically refers to probabilities. For two random variables x and y, the mutual information states how much information y gives about x that is not present in x alone. I (x; y) = H (x) − H (x|y):

(59)

H (x) is the uncertainty in x; H (x|y) is the uncertainty in x given y. Because knowing y cannot make x more uncertain, H (x) ¿ H (x|y). The mutual information is therefore nonnegative. The mutual information tells how much the uncertainty of x is decreased by knowing y.

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

101

Fig. 3. Diagram of implanted electrode locations.

Fig. 4. Plots of the Lmax and entropy over time derived from an EEG signal recorded at BL1 , an electrode site overlying the seizure focus.

In Fig. 4, the examples of the STLmax and entropy proJles from EEG recordings over 2-hour interval including one seizure are illustrated. 10.2. Kullback–Leibler distance KLD (the ±-divergence) was introduced by Kullback and Leibler (1951). It is called the Kullback divergence, the Kullback–Leibler information, the relative entropy, divergence, or information for discrimination. It has been studied in details by many authors, including Amari (1985). This section lists some of remarkable properties of KLD. KLD is particularly important and has many applications in Jelds related to probability and information. Following the convention, we refer D(−1) as the Kullback

0 -2

0

20

40 60 80 100 Number of iterations

10

Hist (N=100)

Hist (N=100)

2

0.5

1

0.5

1

10−3

2

0 −2

0

20

40 60 80 100 Number of iterations

10

0

0.5

1

1.5

0

0.5

1

1.5

60

0 −1.5 −1 −0.5

1.5

3 KLD

×

0

2

0 −1.5 −1 −0.5

1.5

60

0 −1.5 −1 −0.5

KLD

0

Hist (N=500)

Hist (N=500)

0 −1.5 −1 −0.5

4

Henon map(δ=0.005)

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

Henon map (δ=0)

102

×

10−3

2 1

0

1 2 3 4 5 6 7 8 9

0

1 2 3 4 5 6 7 8 9

Fig. 5. KLD analysis of the Hennon map. The left column illustrates two data samples (N = 100 and 500) generated by the Hennon map without noise. On the right, two samples (N = 100 and 500) generated by Hennon map with noise ( = 0:003).

divergence and D(1) as its dual. Unlike other f-divergences, the Kullback divergence satisJes the following chain rule:  D(−1) (p q) = D(−1) (pk qk ) + D(−1) (pn (·|y) qn (·|y))pk (y) dy; (60) which gives another proof of the monotonicity. In particular, it satisJes the additivity: D(±1) (p12 q12 ) = D(±1) (p1 q1 ) + D(±1) (p2 q2 )

(61)

for product distribution p12 (x1 ; x2 ) = p1 (x1 )p2 (x2 ). Fig. 5 illustrates the comparison of KLD analysis between the simulated data generated by Henon map with and without white noise.

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

103

11. Computational results In this section we discuss the possibility of reconstructing a dynamical model of the epileptic brain by analyzing the dynamical changes of the EEG. Our research was based on spatio-temporal dynamical analysis (Iasemidis et al., 2001). We used EEG recording obtained from an experimental rodent model of human epilepsy that has seizure, like humans, are spontaneous, intermittent, and sometimes lethal. In order to demonstrate the use of KLD in forecasting problem, we will present here an example of EEG data analysis. Other tests of this methodology have been reported by Amari (1982) and Chentsov (1982). A straightforward way to test for independence of preictal region and ES is to see whether the densities p(x) and q(x) are the same. One natural way to do this is to introduce the idea of distance on probability space and measure the distance on probability space and measure the distance between p and q. One measure of the distance between densities p and q is the 2 statistic. 2 =



(p(x) − q(x))2 d x: q(x)

(62)

The 2 statistic is usually deJned on EEG data sets using a histogram estimate for p and q. This introduces an extra factor of N , the number of cells, into the equations. Kernel density estimation is less time eMcient but usually more data eMcient for smooth probability densities (Yaglom, 1952). The basic idea is to let each point “inGuence” surrounding points through a kernel function. Kernel density estimations are of the form n

1 p(x) = k N i=1



x − xi !

 ;

(63)

where the sum is over the entire data set, k is the kernel (a normalized probability, density), represents an approximation norm, and ! is a window width. To make a good estimate the kernel function and width must be properly matched to the density function and the number of data points. Another way to estimate probability density of a given local window is to utilize Parzen windowing (Fisher, 1997; Principe and Fisher, 2000). In Parzen windowing, the probability density is approximated by sum of even, symmetric kernels whose centers are translated to the simple points. A suitable kernel function for this purpose is the Gaussian. Another possible distance measure is the Kullback–Leibler information distance:  K(p; q) =



q(x) q(x) log p(x)

 d x:

(64)

104

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

A4

A5

A3

A1

A2

Seizure B1

B3

B2

B4

5000

B5 E

0 −5000 0

2

4

6

8

10

12

Minutre 1500 1000

A0

Seizure

500 0 −5000 1500

5000

150 A2

1500 C

KLD

1000

0

1000

500

500

0 −5000

0 0

5000

0 1 2 3 4 5 6 7 8 9 10

1500

60

A5

500 0 −2000

2000

0

5000

1000

0

5000

B5

500 0

0

0 −5000 1500

D

KLD

1000

B2

4000

0 1 2 3 4 5 6 7 8 9 10

0 −5000

Fig. 6. KLD analysis on the EEG signals recorded from a subdural electrode overlying orbitofrontal cortex contralateral to the seizure onset zone. The upper plot shows an EEG signal over a 13-min time period. The trace is divided into Jve preictal windows (A1 –A5 ), a window which includes a seizure (A0 ) and 5 postictal windows (B1 –B5 ). Each window is 75 s in duration. The histogram in the second row is taken from the window containing the seizure (A0 ). Histograms for example preictal windows A2 and A5 , are shown on the left. Histograms from postictal windows B2 and B5 are shown on the right. In the center, KLDs between the A5 and other windows (plot C) and between the window A0 and each of the other windows (plot D) are depicted.

Assuming that the two distribution p and q are nearly the same, the KLD can be expanded in a Taylor series.   2    p p 2 + ··· K(p; q) = − q −1 −1 q q 

 ≈

q

2 

p −1 q

 2=

(p − q)2 : 2q

(65)

In Figs. 6 and 7 the values of the KLD between the two reference windows (signal A5 and epileptic seizure) and other windows are computed using the Eq. (65).

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

A5

A4

A3

5000

A2

Seizure B1

A1

B4

B3

B2

105

B5 E

0 −5000 0

6

4

2

8

10

12

Minutes 1500 1000

A0

Seizure

500 0 −5000 1500

500 0

−5000

0

0

40

5

10

−5000

0

5000

B2

0

−5000

0

5000

0

5000

1500

D

1000

KLD

A5

500 0

1000 500

0

5000

1500 1000

C

KLD

1000

5000 1500

150

A2

0

B5

500 0

0

5

10

0

−5000

Fig. 7. KLD analysis on the EEG signals recorded from a subdural electrode overlying the epileptic focus (seizure onset zone).

12. Concluding remarks In this paper we have investigated a statistical information approach to the characterization of the spatiotemporal dynamics of the epileptic human brain. Also, we provide an analytical proof, which shows that robust Bayesian estimation and order approximation provides the optimal selection of a local stochastic model. We also illustrate an application of this approach to experimental EEG data, using KLD. We have shown that the time windowed extraction of local nonlinear dynamics of the system, based upon statistical information analysis (Lyapunov exponents, eBective correlation dimension) of intracranial EEG signals recorded from epileptic patients, usually allows us to detect and identify an unequivocal “preictal” phase preceding ES. These Jndings indicate that observed changes in nonlinear parameters reGect the physiological mechanisms underlying epileptic processes as well as the spatial distribution of local neuronal circuits involved in the development of epileptic seizures. It is likely that this type of analysis may be used to understand how relatively Jxed damage to neuronal circuits can lead to the intermittent occurrence of brief (seizure typically are seconds to less than 3 min long) transient seizure states. Attempts to explain this

106

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

phenomenon at the level of the neuron or its components have not explained the intermittency of this dynamical disorder. The methods described herein may be useful in the investigation of other poorly understood epileptic processes such as the anatomical spread of seizure discharges originating from isolated epileptogenic zones within the hippocampus or neocortex. In the study of a system as complex as the human brain, we postulate that the important distinction is not between chaos and randomness, but between systems with low dimension attractors and those with high-dimensional attractors. If an EEG time series is produced by motion on a very high-dimensional attractor, then from a practical point of view, it is impossible to gather enough information to exploit the underlying determinism. If the dynamics is simulated in a state space whose dimension is lower than that of the attractor, only a projection of the dynamics remains, and determinism in invisible. The result is that the dynamics appear to be random. In some instances such as the interictal alert state, with many degrees of freedom, the statistical information approach is as good as any other approach. In fact, linear models may even be optimal. However, if random behavior results from low-dimensional chaos, it is possible to make forecasts that are much better than those of linear models. Furthermore, the resulting models can provide useful diagnostic information pertaining to the nature of the underlying dynamics, aiding the search for a description in terms of Jrst principles. An analysis of the problem of reconstructing a local controlled stochastic model based upon restricted experimental data within a given time window shows that it is possible to solve it on the basis of the strategy of recursive nonlinear Bayesian estimation. Analysis of the diBerential-geometric estimation scheme structure allows one to generalize the approach to the broader class of complex processes exhibited by the human brain. Assuming, that the diagnostic system for estimating complex dynamical processes admits its model realization, the possibility of implementing the mini–max estimation strategy is demonstrated. This implementation may be performed in the form of a software system. The geometry of nonlinear estimation provides a new insight into the question regarding the choice of a suitable “approximation family” (Kullback and Leibler, 1951). The requirement that the reduced statistic has to be a homomorphism, which implies that the equivalence classes (Jbers) are exponential families. The integral manifolds orthogonal to all Jbers are curved exponential families, that can be viewed as intersections of the posterior exponential family with a mixture family. Hence, the mixture family is the best candidate for the approximation family, which is orthogonal to all Jbers. References Abarbanel, H., 1995. Analysis of Observed Chaotic Data. Springer, New York. Abraham, B., Ledolter, J., 2000. Statistical Methods for Forecasting. Wiley, New York. Amari, S.-I., 1982. DiBerential geometry of curved exponential families curvatures and information loss. Ann. Statist. 10 (2), 357–385. Amari, S.-I., 1985. DiBerential-Geometrical Methods in Statistics. Springer, New York.

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

107

Arnold, V., 1974. Dynamical Systems, Mathematical Methods of Classical Mechanics. Springer, New York. Babloyantz, A., Destexhe, A., 1986. Low dimensional chaos in an instance of epilepsy. Proc. Nat. Acad. Sci. USA 83, 3513–3517. Boothby, W.M., 1975. An Introduction to DiBerential Manifolds and Riemannian Geometry. Academic Press, London. Brockett, R., 1981. Nonlinear systems and nonlinear estimation theory. In: Hazewinkel, M., Willems, J.C. (Eds.), Stochastic Systems: The Mathematics of Filtering and IdentiJcation and Applications. Reidel, Dordrecht. Casdagli, M.C., Iasemidis, L.D., Gilmore, R.L., Roper, S.N., Savit, R.S., Sackellares, J.C., 1996. Characterizing nonlinearity in invasive EEG recordings from temporal lobe epilepsy. Physica D 99, 381–399. Chentsov, N.N., 1982. Statistical Decision Rules and Optimal Inference. Nauka, Moscow, 1972. English translation: Translation of Mathematical Monographs, Vol. 53, Amer Math. Soc., Providence, RI (in Russian). Cristianini, N., Shawe-Tailor, J., 2001. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods for Time Series Analysis. Cambridge University Press, New York. Diambra, L., Bastos de, F., Malta, C., 1999. Epileptic activity recognition in EEG recording. Physica A 273, 495–499. Du, D.-Z., Pardalos, P.M., Jie, W. (Eds.), 2000. Discrete mathematical problems with medical applications. In: DIMACS Series, Vol. 55, Amer. Math. Soc., Providence, RI. Eubank, S., Farmer, J.D., 1997. Modelling chaotic systems. In: Lui, L. (Ed.), Introduction to Nonlinear Physics. Springer, New York, pp. 152–174. Fante, R., 1988. Signal Analysis and Estimation. Wiley, New York. Fisher, J., 1997. Nonlinear extension to the minimum average correlation energy Jlter. Ph.D. Dissertation, University of Florida. Fitzgerald, W., Smith, R., Walden, A., Young, P., 2001. Nonlinear and Nonstationary Signal Processing. Cambridge University Press, New York. Freeman, W., 2000. Neurodynamics: An Exploration of Mesoscopic Brain Dynamics. Springer, London. Freeman, W., Skarda, C., 1985. Spatial EEG patterns, non-linear dynamics and perception: the neo-Sherringtonian view. Brain Res. Rev. 10, 147–175. Gardiner, C., 1985. Handbook of Stochastic Methods. Springer, Berlin. Golodnikov, A., Knopov, P., Pardalos, P.M., Uryasev, S., 2000. Optimization in the space of distribution functions and applications in the Bayes analysis. In: Uryasev, S. (Ed.), Probabilistic Constrained Optimization. Kluwer Academic Publishers, Dordrecht, pp. 102–131. Grassbeger, P., Procacciia, I., 1983. Measuring the strangeness of strange attractors. Physica D 9, 189–208. GrTobler, T., Barna, G., ”Erdi, P., 1998. Statistical model of the hippocampal CA3 region. I. The single-cell module; bursting model of the pyramidal cell. Biol. Cybern. 79, 301–308. Iasemidis, L.D., Pardalos, P.M., Shiau, D.S., Sackellares, J.C., 2001. Quadratic binary programming and dynamic system approach to determine the predictability of epileptic seizures. J. Combin. Optim. 5 (1), 9–26. Insua, D.R., Ruggeri, F., 2000. Robust Bayesian Analysis, Lecture Notes in Statistics, Springer, New York. Kaneko, K., 2001. Complex System: Chaos and Beyond: A Constructive Approach with Applications in Life Sciences. Springer, Berlin. Klimontovich, Y., 1999. Entropy and information of open systems. Phys. Usp. 42, 375–384. Kobayashi, S., Nomixu, K., 1963. Foundations of DiBerential Geometry, Vol. I. Interscience Publishers, New York. KulhavRy, R., 1996. Recursive Nonlinear Estimation: A Geometric Approach. Springer, London. Kullback, S., Leibler, R.A., 1951. On information and suMciency. Ann. Math. Statist. 22, 79–86. Loskutov, A., Mikhailov, A., 1990. Introduction to Synergetics. Nauka, Moscow. (in Russian) Nicolis, G., 1995. Introduction to Nonlinear Science. Cambridge University Press, New York. Nidermeyer, F., De Silva, L., 1983. Electroencefalography: Basic Principles, Clinical Application and Related Fields. Urban and Schwarzerberg, Baltimore. Nikolis, G., Prigijine, I., 1989. Exploring Complexity. W.H. Freeman and Company, New York.

108

P.M. Pardalos et al. / Computational Statistics & Data Analysis 43 (2003) 79 – 108

Percival, D., Walden, A., 2001. Wavelet Methods for Time Series Analysis. Cambridge University Press, New York. Peterka, V., 1981. Bayesian approach to system identiJcation. In: EyichoB, P. (Ed.), Trends and Progress in System IdentiJcation. Pergamon Press, Oxford. Principe, J., Fisher, W., 2000 In: Symon, H. (Ed.), Information Theoretical Learning. Wiley, New York, pp. 265–319. Principe, J., Dongxin, X., Zhao, Q., Fisher, W., 2000. Learning from examples with information theoretic criteria. J. VLSI Signal Process. 26, 61–77. Pugachev, V., Sinitsin, I., 1987. Stochastic diBerential systems. In: Analysis and Filtering. Wiley, New York. Ruelle, D., 1994. Where can one hope to proJtably apply the ideas of chaos. Phys. Today 47 (7), 24–30. Sackellares, J.C., Iasemidis, L.D., Gilmore, R.L., Roper, S.N., 2000. Epilepsy—when chaos fails. In: Lehnertz, K., Arnhold, J., Grassberger, P., Elger, C.E. (Eds.), Chaos in the Brain?. World ScientiJc, Singapore. Savage, L.J., 1954. The Foundations of Statistics. Wiley, New York. Sorenson, H.W., 1988. Recursive estimation for nonlinear dynamic systems. In: Spall, J.C. (Ed.), Bayesian Analysis of Time Series and Dynamic Models. Marcel Dekker, New York. Stogatz, S., 1994. Nonlinear Dynamics and Chaos. Addison-Wesley, Reading, MA. Takens, F., 1981. Dynamical systems and turbulence. In: Rang, D., Young, L. (Eds.), Lecture Notes in Mathematics. Springer, Berlin, pp. 366–381. Theiler, J., 1986. Spurious dimension from correlation algorithms applied to limited time-series data. Phys. Rev. A 34, 2427–2432. Vapnik, V., 1998. Statistical Learning Theory. Wiley, New York. Wets, R., 1999. Statistical estimation from an optimization viewpoint. Ann. Oper. Res. 85, 79–101. Yaglom, A., 1952. Stationary Random Functions. Prentice-Hall, Englewood CliBs, NJ. Yatsenko, V., Titarenko, T., Kolesnik, Y., 1994. IdentiJcation of the non-Gaussian chaotic dynamics of the radioemission back scattering processes. In: Proceedings of the 10th IFAC Symposium on System IdentiJcation SYSID’94, Kobenhavn, pp. 313–317.