Hidden Markov based autoregressive analysis of stationary and non-stationary electrophysiological signals for functional coupling studies

Hidden Markov based autoregressive analysis of stationary and non-stationary electrophysiological signals for functional coupling studies

Journal of Neuroscience Methods 116 (2002) 35 /53 www.elsevier.com/locate/jneumeth Hidden Markov based autoregressive analysis of stationary and non...

424KB Sizes 1 Downloads 73 Views

Journal of Neuroscience Methods 116 (2002) 35 /53 www.elsevier.com/locate/jneumeth

Hidden Markov based autoregressive analysis of stationary and nonstationary electrophysiological signals for functional coupling studies M.J. Cassidy *, P. Brown Sobell Department of Neurophysiology, Institute of Neurology, University College London, Queen Square, London WC1N 3BG, UK Received 28 November 2001; received in revised form 1 February 2002; accepted 2 February 2002

Abstract In this paper, we apply multivariate autoregressive (MAR) models to problems of spectral estimation for stationary and nonstationary electrophysiological data. We describe how to estimate spectral matrices and approximate confidence limits from MAR coefficients, and for stationary data spectral results obtained from the MAR approach are compared with fast Fourier transform (FFT) estimates. The hidden Markov MAR (HMMAR) model is derived for spectral estimation of non-stationary data, and traditional model order selection problems such as the number of states to include in the hidden Markov model or the choice of MAR model order are addressed through the use of a Bayesian formalism. # 2002 Elsevier Science B.V. All rights reserved. Keywords: Autoregressive models; Spectral estimation; Non-stationary analysis; Electrophysiological data; Functional coupling

1. Introduction In recent years, it has become more evident to neuroscientists that an understanding of the spectral content of electrophysiological signals can provide new insights into the workings of the nervous system. Possible functional roles for neuronal synchrony and oscillations have been proposed in a wide range of phenomena, such as consciousness, perception, motor control, cortical /subcortical interactions and attention (Singer, 1999). In addition, spectral information has provided insight into neurological diseases such as Parkinson’s disease (Brown et al., 2001) and myoclonus (Brown et al., 1999). The quantities that provide the most practical information are the spectral power in each signal, and the coherence and phase spectra that indicate the amount of linear coupling and phase difference between two signals. These quantities are traditionally calculated from the Fourier transform of the data, and a series of

* Corresponding author. Tel.: /44-207-837-3611x3069; fax: /44207-278-9836. E-mail address: [email protected] (M.J. Cassidy).

papers by Halliday et al. (1995) and Rosenberg et al. (1989) describe how to implement this type of analysis for data that is relevant to neuroscientists. The typical tradeoff that one encounters in this analysis is the tradeoff between spectral resolution and variance. This arises because of the need to average over a number of disjoint (or overlapping) data segments. One can improve the resolution of the spectrum by increasing the number of data points in each disjoint segment, but then the variance of the spectrum (which is inversely related to the total number of disjoint sections) increases for a fixed amount of data. This tradeoff implies that the fast Fourier transform (FFT) approach to spectral estimation will give the best results when one has a large amount of data. This in turn introduces another potential problem, because EEG is generally thought to be non-stationary over times greater than about 2 s (Nunez, 1995), while EMG and tremor signals may be non-stationary over longer time intervals. FFT based methods assume signal stationarity, so for long data sets in which there could be a lot of drift in signal mean and variance, one has to look back at the original data and be careful how one interprets the final estimated spectra. Even in experimental paradigms where external parameters are kept

0165-0270/02/$ - see front matter # 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 5 - 0 2 7 0 ( 0 2 ) 0 0 0 2 6 - 2

36

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

approximately constant (e.g. when measuring EEG/ EMG during a tonic muscle contraction), one should be aware of possible non-stationarities brought on by fatigue or shift in attention. The limitations of the FFT approach become more evident, though, when one starts to consider other experimental paradigms of interest. One may, for example, be interested in coherence and phase changes over a number of short-lived conditions (e.g. before, during and after a phasic movement). One might be interested in the spectral changes in a signal as a patient falls asleep, or as medication takes effect. It is clear that methods of analysis that rely on signal stationarity will not be appropriate in these cases. One must look to new techniques that can objectively segment signals into suitable stationary regions (states) and perform spectral analysis in each state. The purpose of this paper is to describe the autoregressive (AR) approach to non-stationary spectral estimation in a neuroscience context. AR methods are extremely common in a huge range of other scientific disciplines due to their desirable statistical properties, but as yet their use by neuroscientists has been limited (however, see Florian and Pfurtscheller, 1995; Peters et al., 2001). Here for the first time we derive a fully Bayesian hidden Markov MAR (HMMAR) model for non-stationary spectral estimation. We begin our methods section with a brief introduction to stationary AR theory and a description of the Bayesian implementation of the stationary MAR model. Both are included as necessary pre-requisites to understanding the non-stationary case. We also derive a Bayesian MAR mixture model, which is required to initialise the HMMAR model. In the stationary case, even though one can still use FFT based approaches, there are advantages to be gained by employing an AR approach. The total number of AR parameters needed to model a time series is typically much less than the total number of data points in the signal, and, therefore, gives a statistically desirable compact representation of the signal. FFT methods, by comparison, need to determine as many coefficients as there are points in a particular data segment. In itself, this is statistically undesirable and is the reason why one needs to average over a large number of data segments to get a sensible spectrum. As a result, it is commonly claimed that AR spectral estimates tend to be more robust than FFT estimates when one has a small data set. We will test this claim with respect to long and short sections of EEG, EMG and tremor data in Section 3. Another advantage of the AR approach to static data analysis comes when one is interested in measuring more than two channels at once. In the FFT approach, it is necessary to take the channels ‘two at a time’ to calculate spectra. In other words, one does not have a

true multivariate representation of multivariate data. By contrast, we will show that AR methods can be straightforwardly extended to multivariate AR (or MAR) models, which do provide a full multivariate representation of the signal, from which appropriate spectra can be calculated. We will also demonstrate how to calculate confidence limits for power, coherence and phase. The biggest problem that anyone working with static AR and MAR models faces is the choice of model order to be used. If one thinks in terms of a univariate model, the model order can be interpreted as twice the number of peaks that one wants in the power spectrum. The question is, therefore, how does one know how many peaks to include? One may have some prior reason to expect that there will be peaks in particular frequency bands but really one would like to have some objective measure of what to choose for the model order. Previously, a number of model order selection criteria have been employed for this purpose. We will mention some of these criteria but our ultimate aim will be to motivate a unifying approach to parameter estimation and model selection through Bayesian methods. We will see the model order problem arise later on in the paper in a different guise. By formulating our model in a Bayesian way, all model selection problems can be handled in a rigorous, objective manner. The fully Bayesian implementation of the stationary MAR model is still very recent (Penny and Roberts, 2001) and has not yet been applied to much physiological data. In this paper we critically examine the performance of the stationary Bayesian MAR model applied to different types of biomedical signals. However, the major focus of this paper is to introduce non-stationary MAR modelling. We describe the MAR mixture and HMMAR models and show how their parameters can be learnt within the Bayesian framework introduced previously. A MAR mixture model basically assigns each data point to each of a number of distinct states with a corresponding probability, where each state is described by a particular set of MAR coefficients and all assignment probabilities sum to one. In this paper, we use the mixture model to initialise the HMMAR model, which is a true non-stationary model as it infers dynamic information about the data. In other words, unlike the basic mixture model the HMMAR model calculates probabilities of being in a state i at time t given that it was in some other state j at time t/1. As was mentioned earlier, the model order selection problem reappears when one asks the question, how many states are to be included in each model? We will demonstrate how the Bayesian formalism allows this question to be answered in a straightforward fashion. We then apply the HMMAR algorithm to various data sets, and evaluate its performance and possible future applications in the final discussion.

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

37

standard results (Weisberg, 1980)

2. Methods

aml (X T X )1 X T Y

2.1. Non-Bayesian stationary AR models

(5)

for the mean value of the AR coefficients, and 2.1.1. The basic univariate model Suppose that one has recorded a stationary, univariate, stochastic, time series y/{y1. . .yT } of length T . One can ask the question of how to predict the value of yt given the previous p values ytp . . .yt1 in the time series? If one describes yt as a linear combination of the p previous values plus a noise contribution, then this is a p th order AR model that can be used for prediction. One can use the weighted data points to predict the next point in the series, and the model can be written as yt 

p X

yti ai et ;

(1)

i1

where ai is the AR coefficients and et is the residual error, which is usually assumed to be Gaussian. In this equation, it has been implicitly assumed that the mean of the data has been subtracted from the time series first. One can rephrase this model in a more probabilistic form as p(yt jxt ; a; s 2 ) N(yt jxta ; s 2 );

(2)

where p (yjx ) denotes the probability of y conditioned on x, a /(a1, . . ., ap )T is a column vector containing all the AR coefficients and xt /(yt1, . . ., ytp ) is a row vector containing the regressors (the superscript T denotes the transpose). N(m, s 2) is shorthand notation for the univariate Normal or Gaussian density with mean m and variance s 2, and is given by   1 (y  m)2 N(yjm; s 2 ) pffiffiffiffiffiffiffiffi exp  : (3) 2ps 2s 2 The quantity p (yt jxt , a , s 2) is known as the likelihood of the data point yt , and the likelihood of the whole data Q 2 set is given by the product t p (yt jxt , a , s ). The likelihood is a product of probabilities because the data is assumed to be IID (identically and independently distributed). The question now becomes one of how to determine the AR coefficients that best represent the entire time series. If one constructs three matrices Y , X and E , each with T rows given by the quantities yt , xt and et , Eq. (1) can be rewritten as Y XaE:

(4)

This is now in the form of a linear regression model, which has been extensively studied in the statistical literature (see, for example Weisberg, 1980). A good way to determine the coefficients is by maximising the likelihood defined earlier. One can see that maximising this product of probabilities is equivalent to minimising the total residual error squared at (yt /xt a)2. The maximum likelihood (ML) solution is given by the

s 2ml 

1 (Y Xaml )T (Y Xaml ) T p

(6)

for the data variance. The covariance matrix for the AR coefficients is given as S ml s 2ml (X T X )1 :

(7)

Another way to think of the AR model is as a white noise sequence that has been filtered to give the observed time series. The filter that yields the output series is an all pole (or infinite impulse response (IIR)) filter with coefficients given by a1, . . ., ap . This implies that after one has fit the AR model to the data and calculated the AR coefficients, the residual errors given in the matrix E should be Gaussian distributed. This can be easily checked by plotting a histogram, and gives one indication of how well the model has fitted the data. If the residuals are not Gaussian or are correlated in any way, then this indicates that the model has not fitted the data well. For more discussion of this, and for information on other traditional methods for estimating the AR coefficients, the reader is referred to the review paper (Pardey et al., 1996).

2.1.2. Multivariate models The generalisation of the univariate AR model to the multivariate case is very simple and many of the results of the previous section hold. One can still write the model as yt 

p X

yti ai et ;

(8)

i1

but now yt /(y1(t ), . . ., yd (t )) is a row vector containing the values from the d-dimensional time series y at time t , et is a row vector of residuals, and ai are d /d coefficient matrices. The multivariate model can be written as Y XAE

(9)

where the rows of Y are given by yt , the rows of E are just et and A is formed by stacking the coefficient matrices to give A /(a1, . . ., ap )T . The rows of the matrix X are given by (yt1, . . ., ytp ). This form again implies that linear regression methods will give the ML solution. The likelihood of a single measurement is now given by a multivariate Gaussian p(yt ½xt ; A; L)N(yt jxt A; L 1 ) where

(10)

38

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

N(yjm; S) (2p)d=2 jSj1=2

  1 T 1 exp  (ym)S (ym) (11) 2

and jSj denotes the determinant of the matrix S. L is the inverse covariance (or precision) matrix. The total likelihood can be written as a product or more compactly as the expression p(DjA; L)(2p)dT=2 jLjT=2   1 exp  tr(L(Y XA)T (Y XA)) (12) 2 where D /{Y , X } is the entire data set, and tr( ) denotes the trace operation. The ML solution for A is exactly the same as Eq. (5), and the other expressions are only slightly modified to Sml 

1 T k

(Y XAml )T (Y XAml )

Sml Sml  (X T X )1

(13)

for the noise and MAR coefficient covariance’s, respectively, where / is the Kronecker product (see Appendix A) and k /pd 2 is the total number of MAR coefficients. Having introduced the MAR model, all further results in this paper will be presented for the multivariate case so as to retain full generality. However, examples will often be given for bivariate processes, as these are the simplest of the multivariate models. 2.2. Bayesian stationary MAR models 2.2.1. Basic theory, priors and MAR update equations In this section, we embed the MAR model into a Bayesian framework. The reason for this will become clear as we address the problem of model order selection. Recall that the model order is related to the number of peaks in the spectrum, so that a higher model order leads to a more complex spectrum. The question is what is the correct number of peaks to best model the true spectrum. Consider first how one might choose the appropriate model order for a particular data set. One suggestion might be to examine the likelihoods for a number of different models and choose the model with the highest value. The problem with this though, is that a more complex model will always fit the data with a higher likelihood, i.e. it fits closer to the particular data set measured, rather than being representative of the underlying process from which that particular data set was generated. In other words, one desires a model that can not only fit the data, but can also be used to generate more data that looks sensible. If data from a 4th order system is fit with a 16th order model, it may fit very well to that particular data set but if it was used to generate further data, the model output would most

likely be very unrepresentative of the original 4th order process. This overfitting to the data can be avoided in the Bayesian approach by choosing suitable prior probability densities for the model parameters which automatically penalise (or regularise) overcomplex models. To illustrate this, consider the probabilistic form of the MAR model, p(D jA , L ) which was given in Eq. (12). If we introduce prior probability densities on the coefficients p(A ) and on the inverse noise (or precision) p (L ), we can construct a quantity called the evidence by integrating over the model parameters to leave a function that just depends on the observed data and the model order. The evidence is p(D)

g p(D½A; L)p(A)p(L)d AdL:

(14)

This operation is also called marginalisation, and is a consequence of the most basic result of probability theory, p (x )/ay p(x , y )/ay p (xjy )p (y), for discrete y . If the regularising prior for the MAR coefficients is chosen suitably, then overcomplex models will be automatically penalised. One can then choose the model with the highest evidence as being the most suitable representation of the process from which the data has been generated (assuming equal a priori probabilities for each model order). If we define a /vec(A ), a suitable prior density for the MAR coefficients is  k=2   a a p(a) exp  a T a ; (15) 2p 2 which says that the coefficients are initially drawn from a zero-mean Gaussian prior with precision a . If one looks at this density, one can see that as a 2 /a T a gets large, p (a ) tends to zero so in the integral, large a contributions become suppressed. A more complex spectrum will generally imply more regions of high curvature in the spectrum, which will require relatively larger values for the MAR coefficients. This shows heuristically how this choice of prior can suppress higher order models. This choice of prior should strictly be written as p (aja ), because it depends on the precision a , which is an example of a hyper-parameter, which controls the distribution of other parameters. For this model to be strictly correct, we should introduce a (hyper) prior over this hyper-parameter also. The precision must be positive, so a suitable prior is the Gamma density   1 a c1 a p(a) G(a½b; c) (16) exp  : G(c) b c b One could carry on defining priors over hyper /hyperparameters indefinitely. However, by setting b and c so that the Gamma density is sufficiently uninformative

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

(e.g. b /1000 and c /0.001), we can stop the hierarchy here. One can also choose an uninformative prior for the noise precision L , given by p(L)½L½ (d1)=2 :

(17)

which gives the final equation for the evidence as

g p(D; u) du  p(D½a; L)p(a½a)p(a)p(L) da da dL; g

(18)

where u /{A , L, a } are the model parameters. The only problem now is that the integral Eq. (18) is analytically intractable and the parameters, which maximise the evidence, cannot be straightforwardly obtained, so an approximation technique must be employed. In this paper we will describe and employ the variational approximation, and for further information on this approach, the reader is referred to Penny and Roberts (2001), Ghahramani and Beal (2001), Attias (2000), Jordan et al. (1999) and the cited papers therein. Other possible approaches such as the Laplace approximation (Minka, 2000a) and Markov Chain Monte Carlo methods (Neal, 1993) will not be considered here. The logarithm of the evidence can be written as log p(D)ln

g

q(u½D)

p(D; u) du; q(u½D)

(19)

where q (u jD ) is some arbitrary probability density. Jensen’s inequality (which reflects the fact that the log of the sum is greater than the sum of the log) can be used to write ln p(D)]F 

g

q(u½D) ln

p(D; u) q(u½D)

du;

(20)

where F is known as the negative variational free energy (which is often used by physicists in other contexts). F , therefore, provides a lower bound on the log evidence, which becomes an equality if the density q(u jD )/ p (u jD ), the true posterior distribution for the parameters. To see this one can consider Bayes’ rule, which gives the posterior density as p(u½D)

p(D; u) p(D)



p(D½u)p(u) p(D)

:

tions as the likelihood of the data (viewed as a function of the parameter of interest), then it is said to be a conjugate prior. What this means is that the resulting posterior (which is proportional to the product of the prior and the likelihood, according to Bayes’ rule) will have the same functional form as the prior. In the MAR case, therefore, one can write q(u½D) q(a½D)q(a)q(L½D):

p(D)

(21)

39

(23)

This factorisation is very useful because it allows one to separate Eq. (20) into a sum of terms for each parameter, which can then be maximised in turn to determine the relevant posterior distributions. A detailed derivation of the update equations is given in Appendix E. Here, we just present the results. The posterior distribution for the MAR coefficients is given ˆ where as q(a½D)N(a½a; ˆ S); ˆ k )1 Sˆ (Lˆ X T X  aI ˆ Lˆ X T X )aml aˆ  S(

(24)

and aml is the vec form of the ML solution for A given earlier. The posterior distribution for the precision on the MAR coefficients is now q(a jD )/G(a jb?, c ?), with 1 1 T 1 1 ˆ  a a tr(S) b? 2 2 b k c? c 2 aˆ b?c?:

(25)

Finally, the distribution for the noise precision is a Wishart distribution (see Appendix D) q (LjD )/ Wd (Ljr, R ) with parameters ˆ T (Y X A)V ˆ R (Y X A) r T Lˆ rR 1 ˆ (pd) V (I (pd) )T ((Id X T X )S) k

(26)

and the vec transpose of a matrix X (p ) is defined in Appendix B. The Eqs. (24) /(26) constitute the parameter update equations, which should be iteratively applied until a consistent solution is reached. The parameters can be initialised in the algorithm with the corresponding ML estimates. This variational MAR algorithm was first derived by Penny and Roberts (2001).

If one sets q/p in Eq. (20) and substitutes for p (D , u ) from Bayes’ rule, then one obtains F

g

p(u½D) ln p(D) du ln p(D):

(22)

One can say something more about the form of the posterior distribution q. If the prior probability density for a parameter belongs to the same family of distribu-

2.2.2. Model order selection We described earlier how the evidence for a particular model could be used for model order selection, because the evidence includes terms from the priors that automatically penalise more complex models. The negative free energy has been introduced as a lower bound on the

40

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

log of the evidence and the update equations that were introduced in the previous section are iteratively applied to maximise this lower bound and to obtain estimates of the posterior distribution of the MAR model parameters. The negative free energy functional, calculated with the final estimates of the posteriors, can, therefore, be used for model order selection. Specifically, suppose that one wanted to determine the most accurate model order for a data set. By fitting MAR models with orders ranging from p/1 to n and then calculating the negative free energy F (p ) for each model, one can create a probability distribution over model orders q (p) as q(p) Pn

exp(F (p))

k1

exp(F (k))

(27)

;

if one assumes that each model order is equally likely, a priori. By substituting the estimated densities in the expression for the free energy (Eq. (20)), one can show that the free energy expression reduces to F (p)

T ln ½R½KL(q(A½D); p(A½a)) 2

KL(q(a½D); p(a))ln Gd (T=2)

Td 2

ln 2;

(28)

where the generalised Gamma function Gd and analytic expressions for each of the KL divergences are given in the appendix (note that the last two terms in this expression do not depend on the model order p, so do not contribute to model order selection). The connection to more familiar model order selection criteria can be made by taking the T 0/ limit. In this limit, the model parameters will become more and more peaked about the ML values (i.e. the priors become less and less important). So in the above expression, the V term in R will be suppressed (which is a contribution to the noise precision from the covariance of the MAR coefficients, and is now becoming vanishingly small) and the asymptotic limit of the KL terms will be given as (N /2)log(T ), where N is the number of parameters in the model (Attias, 2000). The free energy then reduces to F (p)

T k ˆ T (Y X A)j ˆ lnj(Y X A) ln(T) 2 2

(29)

as a function of the model order p, which is equal to the Bayesian information criterion (BIC) and also equal to the negative of the minimum description length (MDL; Penny and Roberts, 2001) criterion. So, these popular AR model order selection criteria can be seen as limiting cases of the variational approach. One might, therefore, expect a superior performance of the variational criter-

ion in cases where the data is limited. This has been confirmed in Penny and Roberts (2001). 2.3. Non-stationary MAR models In this section, we begin our discussion of the problem of estimating spectra for non-stationary time series. One may, for instance, be interested in an experimental situation where the statistical moments (i.e. the mean and variance) of the biomedical signals being recorded are changing over time. This is the case, for example, prior to and during movement or during the execution of a cognitive task. Alternatively, it could be due to a change in the experimental conditions or could be due to unforeseen intrinsic changes in the subject/patient’s physiological state. If one was presented with a set of data comprised of a number of physiological states, it would be very difficult to objectively distinguish between the different states using an FFT based approach. The models presented in this section are designed to cope with data made up of many different states, and are all based on the MAR representation. Indeed, one will see that it is in the non-stationary regime that the MAR models really come into their own. Ultimately, the preferred model of use will be the fully Bayesian HMMAR model, but we will first present a simpler MAR mixture model and its Bayesian extension so that the relevant concepts can be fully understood before advancing to the HMMAR model. To our knowledge, this is the first time that the variational Bayesian approach has been applied to the HMMAR and MAR mixture models. 2.3.1. The MAR mixture model Suppose that a data set is made up of data generated from a number of different states, each of which can be regarded as a static MAR state in its own right. The mixture model basically represents the likelihood of each data point as a weighted combination of the likelihood for the point to be in each static MAR state. The weights represent the probability that a data point belongs to the jth state. The mixture model is not a good generative model for the data and is not a genuine nonstationary model, because it does not include any information about the state at one time point given the state at a previous time point. So if the mixture model was then used to generate a new data set, there would be no kind of reconstruction of the original dynamics. The HMMAR model is superior to the mixture model because it does include dynamical information, but one can still study the mixture model as a ‘poor man’s’ hidden Markov model to help understand many of the relevant concepts. Given a particular data set, it can give a useful first idea of how the data is partitioned state-wise, and can be used to initialise the HMMAR model.

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

In a mixture model with K states and parameters u , the likelihood of a data point yt is given by p(yt jp; u)

K X

pj p(yt juj ):

(30)

j1

One can express this sum using a hidden indicator variable st as X

p(yt ; st jp; u)

K X

st

p(st jjp)p(yt jst j; u);

j1

pj p(st jjp); p(yt juj )p(yt jst j; u):

(31)

The purpose of introducing the hidden variables is so that we can use the EM algorithm to do inference over them. The EM algorithm is an iterative process for learning the ML values for the model parameters when hidden variables are present in a model. One first constructs a suitable functional (in this case the negative free energy) which approximates the total log likelihood of the data. The E step of the algorithm calculates the probability distribution over the hidden variables by maximising this functional and keeping the model parameters fixed. Then in the M step, one maximises the functional with respect to the model parameters holding the distribution over the hidden variables fixed. By iterating the E and M steps, one converges to the ML solution in such a way that the estimated likelihood is guaranteed to increase at each iteration. For further information on the EM algorithm, the reader is referred to Roweis and Ghahramani (1999). In our particular model, we already know that p (yt juj ) will be given by the Gaussian likelihood for a data point in the MAR model described earlier. For the moment, though, we will avoid making this explicit. The total log-likelihood ln p(D)

T X

ln

T X

p(yt ; st jp; u)

st

t1



X

ln

X q(s )p(y ; s jp; u) t t t st

t1

(32)

q(st )

where q (st ) is some arbitrary distribution over the hidden variables st . This distribution will be iteratively determined by the EM algorithm. Using Jensen’s inequality, this expression becomes ln p(D)]

T X X t1



q(st ) ln p(yt ; st jp; u)

st T X K X t1 j1

T X X t2

gtj ln pj p(yt juj )

T X K X

q(st ) ln q(st )

st

gtj ln gtj F

(33)

t1 j1

the negative free energy. As discussed above, the EM algorithm alternates between maximising this free en-

41

ergy with respect to the hidden distribution gtj while keeping the parameters uj fixed (the E step), and maximising with respect to the parameters uj while holding gtj fixed (the M step). One can see that the negative free energy is maximised when q(st ) /p (yt , st jp , u ), i.e. when gtj  P

pj p(yt juj ) ; k pk p(yt juk )

(34)

which determines the E step. For the M step, one only needs to maximise the first term of the negative free energy as this is the only term dependent on the parameters. If one introduces a Lagrange multiplier to enforce the constraint that aj pj /1, one obtains pj 

1

T X

(T  1)

t1

gtj

(35)

for the total probability of being in state j. One now needs to maximise with respect to the other model parameters, which is easily done when one sees that the term to be maximised is just a slight modification of the MAR likelihood term given by Eq. (12). We have, as a function of one set of MAR parameters uj / {Lj , Aj }, F

T X

gtj ln p(yt juj )

t1



drj ln (2p) 2

rj 1 lnjLj j tr(Lj (Yj Xj Aj )T (Yj Xj Aj )) (36) 2 2 pffiffiffiffiffi where the rows of Yj and Xj are given by gtj yt and pffiffiffiffiffi gtj xt ; respectively, and rj at gtj : This is just as one would intuitively expect. For each state, every data point is weighted by its corresponding probability. The standard ML formulae can be applied to obtain the new ML solution at each M step. 

2.4. The Bayesian MAR mixture model The mixture model of the previous section could be used to determine the parameters for a number of MAR states in a data set if the total number of states is known. If the number of states is not known in advance for whatever reason, we are once again faced with a model order selection problem. Previously, we showed how introducing priors on the model parameters leads to a free energy expression whose parameters can be integrated over, leaving an approximation to the log evidence that depends on the model order only. We now show how this procedure can be implemented for the MAR mixture model introduced above.

42

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

The total log evidence (including the prior densities) can be written as

g

 ln

S

X

q(S)q(u)q(p)

p(Y ; Sjp; u)p(p)p(u) q(S)q(u)q(p)

S

du dp

]F 

g

X

q(S)q(u)q(p) ln

S

p(Y ; Sjp; u)p(p)p(u) du dp; q(S)q(u)q(p) (37)

by Jensen’s inequality. S /{s1, . . ., sT } is the set of hidden variables. The form of the prior density p(u ) is the same as in the stationary Bayesian MAR model, and the appropriate prior density for the variable p is a Dirichlet prior D(pju)

1

K Y

Z(u)

i1

ui1

(38)

pi ;

Q where Z (u )/ i G(ui )/G(aj uj ) and ai pi /1. The variational Bayesian equivalent of the E step in the EM algorithm determines the distribution q (S ) over the hidden variables. As a functional of S , the negative free energy is X FS  q(S)q(u)q(p) ln p(Y ; Sjp; u)

g



S

X

q(S) ln q(S):

(39)

S

If one expresses this in terms of the variables gtj , one has Fgtj 

g q(u )q(p)g j

tj

T

j

j

t

t

j

j

t

t

j

j

j

ˆ d xt )T ): (43) (yt xt Aˆj )Lˆ j (yt xt Aˆj )T tr(Lˆ j (Id xt )S(I

ln p(D) X  ln p(Y ; Sjp; u)p(p)p(u)du dp

g

g q(A )q(L )(y x A )L (y x A ) dA dL

ln pj p(yt juj )gtj ln gtj

These identities are sufficient to determine gtj / exp Itj , which must then be normalised for each value of t . This is done by resetting gtj to gtj =ai gti ; which completes the variational E step. The variational M step involves updating the distributions for p and uj /{Aj , Lj , aj }. As a functional of pj , the negative free energy is Fpj KL(q(pj ); exp I(pj )) where I(pj )

X T

 gtj uj 1 ln pj :

(45)

t1

The negative free energy is maximised whenQ q (pj )/ exp I (pj ), so the final distribution for p is just j q(pj ) which is Dirichlet (after the appropriate normalisation). The update rules for the MAR parameters are extremely straightforward. In the same way as the standard MAR mixture model, the updates for each set of parameters {Aj , Lj , aj } proceed as in the stationary case, but with each data point weighted by the probabilities gtj . The updates are, therefore, as in Section 2.2.1, where Y and X are replaced by Yj and Xj pffiffiffiffiffi pffiffiffiffiffi with rows gtj yt and gtj xt ; respectively, (one also has to replace appropriate occurrences of T with gj at gtj as illustrated in Section 2.3.1). The negative free energy for this model (to be used in model order selection) is also intuitive in its form. In terms of a particular MAR model order p and number of states K , one has F (p; K)

  exp Itj ; gtj ln gtj

(44)

K X

fMAR(j; p)g

j1

KL(q(pjD); p(p))

X

q(S)ln q(S);

(46)

S

 1 q(Aj )q(Lj )q(p)  (yt xt Aj )Lj (yt xt Aj )T 2  1 ln pj  lnjLj j dAj dLj dp: (40) 2

Itj 

g

g

Evaluating Itj requires the integrals X q(p)ln pj dpc(uj )c( ui );

g q(L )lnjL jdL  j

j

j

c((rj 1i)=2)lnjRj j;

MAR(j; p) 

gj 2

lnjRj jKL(q(Aj jD); p(Aj jaj ))

KL(q(aj jD); p(aj ))ln Gd (41)

i d X

where

(42)

i1

(where c denotes the digamma function, defined in Appendix C) and

  gj dg  j ln 2 2 2

(47)

is the contribution to the negative free energy from the j th MAR(p) model. The remaining KL -divergence and entropy terms sum to K X j1

  X  X  pj ln G pj ln G uj : ln G uj j j

(48)

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

2.5. HMMAR models We are now ready to look at HMMAR models. As remarked above, the HMMAR model is the simplest model that can represent dynamic information about which MAR state the data is in at a particular time. Once the parameters of a HMMAR model have been learnt, it is possible to generate new time series from this model that preserve the dynamic structure of the original data set. Mackay applied the variational Bayesian approach to a certain class of HMMs in Mackay (1997). Here, we apply the variational approach to the HMMAR model. Recall that for the MAR mixture model, the likelihood of a single data point was written as p(yt ju) ast p(yt ; st ju); so the total likelihood is just the product of T of these terms. This is the same as saying there is no dynamic structure in the data. In cases where the MAR state changes in a particular data set, this assumption is not valid. One needs to include something that takes into account all possible state changes at different times. This is done by writing the likelihood of a data point as XX p(yt ju) p(yt ; st ; st1 ju) 

st

st1

st

st1

XX

p(yt jst ; u)p(st jst1 ):

(49)

Here, p(st /i jst1 /j) denotes the probability of being in state i at time t, given that one was in state j at time t/1. By marginalising over the extra variable st1, dynamic structure can now be represented. The set of model parameters are u /{A , L , a , p, P }, where p is now the initial state distribution and P is a state transition probability matrix. By analogy with the analysis of the mixture model, we can write down the log likelihood as ln p(D)

T X

ln p(yt ju)ln

X

p(y1 js1 ; u)p(s1 ju)

S

t1

T X X ln p(yt jst ; u)p(st jst1 ; u)  t2

ln

X S

]

X S

S

bs1 ps1 

T X t2

ln

X

bst Pst st1

S

 T X q(S) ln bs1 ps1  ln bst Pst st1 ln q(S) ;

(50)

t2

where bst  p(yt jst ; u); bj p(yt jst  j; A; L; a)

(the prior over P will be discussed shortly), the final negative free energy expression will be modified to X q(S)q(u) F

g

S



 ln bs1 ps1 

T X

ln bst Pst st1 ln

t2

  q(u) p(u)

ln q(S) du:

(52)

If one defines the quantity bs1/ /q (u )bst du , and similarly for p * and P *, then as a functional of S the negative free energy becomes F

X

 T X q(S) ln bp ln bP s1  s1  st  st st1 ln q(S) ;

S

(53)

t2

which is in exactly the same form as Eq. (50). This means that one can use the same algorithm to determine the distribution over hidden states (i.e. the E step) for both non-Bayesian and fully Bayesian models. The only difference is that one replaces b, p and P with b*, p * and P *. The recursive ‘forward /backward’ algorithm (so-called because it requires a forward and a backward pass through the data to determine the probabilities) that does this inference for normal non-Bayesian hidden Markov models is well known and has been described in great detail in the literature (see for example Rabiner and Juang, 1986). For the purposes of this paper, all that is important is that this algorithm outputs a set of probabilities gtj /st /j just as in the mixture model. The algorithm also determines vtij /st /ijst1 /j, where the angled brackets denote expected probabilities given the data and model parameter distributions. The variational M step is now straightforward. The probabilities gtj can be used to weight the data matrices Y and X as in the mixture model to determine the updated distributions for the MAR parameters Aj , Lj and aj . The initial state distribution p is updated in the same way as the mixture model (see Eq. (44)). The update for the state transition matrix is only slightly more complicated. The relevant prior density for this matrix is given by a product of Dirichlet densities. Essentially one has a Dirichlet density for each row of the matrix, which must each be normalised to reflect the obvious fact that whatever the state is now, it must have come from one of the K states at the previous time step. For Pij , the prior is Y p(P) D(fPi1 . . . PiK gjv): (54)

ps1 p(s1 ju); pj  p(s1  jjp) Pst st1 p(st jst1 ; u); Pij p(stijst1 j; P):

43

i

(51)

It should be clear now that for the fully Bayesian model, after inclusion of the necessary prior densities

As a functional of Pij , the negative free energy is FPij KL(q(Pij ); exp I(Pij )) where

(55)

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

44

I(Pij )

X T

 v tij vj 1 ln Pij :

3.1. Data example 1 (56)

t2

The final distribution for P is, therefore, also a product of Dirichlets. As one might expect, the expression for the free energy is also a straightforward extension from the mixture model. In terms of a particular MAR model order p and number of states K , one has F (p; K)

K X fMAR(j; p)gKL(q(pjD); p(p)) j1

KL(q(PjD); p(P))q(S)ln q(S)

(57)

where the expression MAR(j , p ) was defined earlier and represents the contribution from the jth MAR(p ) models. The extra P -dependent terms sum to K X i;j1

ln G

 X    X X K K K Pij  ln G Pij K ln G vj : vj i1 j1 j1

(58)

Once one has inferred all the parameters of the hidden Markov model, one often wants to determine the most likely state sequence for the data. For the mixture model, this is easy. One just chooses the state with the highest probability associated with it. For the HMM, it is not quite so straightforward, and one has to use the Viterbi decoding algorithm to determine the most likely state sequence. This is basically very similar to the forward /backward algorithm used to determine the distribution over hidden states in the E step, and has been used extensively in the literature. The main difference is that some of the sums are replaced by maximisation’s in the Viterbi algorithm. The reader is referred to Rabiner and Juang (1986) and Ghahramani (2001) for more information on forward /backward and Viterbi algorithms and other types of hidden Markov models.

The theory of parameter estimation in MAR models has now been sufficiently developed to consider some simple examples and to compare the results with those obtained from a conventional FFT based approach. The first example looks at 60 s of EEG data sampled at 80 Hz while the subject was at rest. Fig. 1 shows the spectra obtained from AR and FFT methods when applied to (a) the full 60 s and (b) the first 4 s of the data set. Ninety-five percent confidence limits are also plotted. For case (a), the FFT block length was 256 data points, and the optimum AR model order was 57. In case (b), the FFT block was reduced to 128 points and the AR model order also fell to 28. One can see that the two spectra agree very well when all the data is used. When only a small amount of data is available, however, both spectral estimates degrade but the AR spectrum is more accurate at higher frequencies and continues to represent the alpha peak as the strongest peak in the spectrum. The AR spectrum is also smoother and is defined at all frequencies, unlike the pointwise FFT spectrum which is forced to lose some spectral resolution to maintain a reasonable variance. However, one should be aware that the AR approach is much more computationally intensive than the FFT approach. In this particular example the AR method took on the order of 100 times as long as the FFT method, a fact which should be kept in consideration when choosing a particular method of analysis.

3. Results In this section, we apply the algorithms developed in Section 2 to a number of different data sets. In Section 3.1, we compare spectral estimation results from the Bayesian MAR model and a conventional FFT based approach, applied to some EEG data. Section 3.2 looks at the power, coherence and phase between postural tremor and EMG. The necessary calculations on how to calculate the spectral matrix and approximate confidence limits from the parameters of a MAR model are detailed in Appendix F. Sections 3.3 and 3.4 present some results of the HMMAR model applied to nonstationary EMG and EEG activity, respectively.

Fig. 1. Power spectra for FFT and AR estimators, applied to a univariate EEG data set. The left-hand spectra derive from a 60 s long data set. The right hand spectra come from the first 4 s of this set.

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

3.2. Data example 2

45

The second example looks at postural tremor. Acceleration from the 5th distal phalanx of the right hand was recorded simultaneously with EMG activity from the associated abductor digiti minimi (ADM) while the hand was held out with the fingers spread. The data was sampled at 202 Hz and approximately 192 s of data were recorded. The static MAR model developed in the previous sections was applied to the data for model orders ranging from 2 to 70. The model with the highest evidence was chosen as the model most representative of the data, and a standard FFT based analysis was also run so as to compare the two approaches. Fig. 2 shows the power, coherence and phase spectra obtained using the two methods. One can see a very good agreement between the MAR and FFT spectra, as one would expect for such a large data set. In addition, in physiological terms we would anticipate the coupling between ADM EMG and finger tremor to be strong, which enhances the spectra further. The confidence limits for the FFT spectra are tighter than the corresponding MAR limits, which is a result of the approximation made when calculating the MAR confidence limits, as discussed earlier. One advantage of the MAR approach (which is only small in this particular case) comes when one wants to make estimates of the slope of the phase spectrum, which gives the time lag between the two time series over a particular frequency band (Priestley, 1981). One should remember that the FFT spectra are pointwise spectra, while the MAR spectra are continuous, so in cases where the number of points in a particular frequency band is limited in a FFT approach (i.e. when the data set is small), an AR approach may be more appropriate.

In the present example, though, this is not the case. There is a broad band of coherence between ADM EMG and finger acceleration with a large peak at around 40 Hz. The dependency of phase on frequency is clearly linear over 25/65 Hz. The MAR and FFT phase spectra look similar and suggest that EMG drives finger tremor with a delay appropriate for electromechanical coupling. In this particular example, where ADM contraction strength was about 40% of the maximum possible, this suggests that motor unit activity might have been synchronised in the Piper frequency band, in order to drive finger tremor at such high frequencies (Brown et al., 1998). The above procedure was then repeated on the first 3 s of the data set, so that the effects of reducing the amount of data on the spectral estimates could be assessed. In order to obtain a reasonable FFT estimate, the number of points in each block had to be reduced from 256 to 32. The optimum MAR model order also dropped from 59 to 21. When there is plenty of data the optimum model order can be quite high, but as the length of the data set decreases, the optimum model order also falls to reflect this. One can understand this in another way the choice of model order depends on the data and on the prior. As the total amount of data decreases, the prior term becomes more important so lower model orders will be preferred. Fig. 3 shows the spectral estimates in this case. One can see that because of the reduced frequency resolution in the FFT estimate, the spectra are much more choppy. The MAR spectra by contrast are smoother, but once again the confidence limits are larger, especially on the coherence estimate. Both estimates show coherences that are biased high,

Fig. 2. Power, coherence and phase spectra for FFT and MAR estimators, applied to a bivariate data set, where one channel is finger acceleration, and the other channel is EMG. The data set is 192 s long.

Fig. 3. Power, coherence and phase spectra for FFT and MAR estimators, applied to a bivariate data set, where one channel is finger acceleration, and the other channel is EMG. The data set is 3 s long.

46

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

Fig. 4. Power, coherence and phase spectra for FFT and MAR estimators, applied to a bivariate data set, where both channels are EMG. The data set is 192 s long.

Fig. 5. Power, coherence and phase spectra for FFT and MAR estimators, applied to a bivariate data set, where both channels are EMG. The data set is 3 s long.

but only the MAR estimate shows the lowest frequency peak in the coherence spectrum. Fig. 4 gives the spectra for the same data ADM EMG signal as above, together with the simultaneously recorded ipsilateral first dorsal interosseous (1DI) muscle. In this case, the relationship between the two signals is not as strong and will be due, physiologically, to a common presynaptic drive to the motor neurones of these two hand muscles. One can see that the coherence spectrum is much smaller than the previous example, and the major peak has shifted down to 15 /35 Hz. Taken in conjunction with the spectra in Fig. 2, this suggests that descending drives to muscle at Piper frequencies are more focal in their effects on spinal alpha motor neurones than those in the beta (15 /30 Hz) band. Both drives are believed to originate in the contralateral motor cortex (Conway, 1995; Salenius et al., 1997; Brown et al., 1998; Halliday et al., 1998; Mima et al., 2000). In Fig. 3, the dependency of phase on frequency is linear over the beta range. The MAR and FFT phase spectra suggest a small lag of 1DI over ADM. Such a phase difference is useful in excluding volume conduction as a likely cause of intermuscular coherence. In this case also, the FFT spectral estimates for coherence and phase are degraded when compared with the MAR estimates. As the size of the data set is reduced to 3 s, one can see from Fig. 5 that both coherence estimates suffer. This is not surprising since the original values were so small. Apart from this, once again the MAR estimates maintain their smoothness as FFT estimates lose frequency resolution. The conclusion from this example is that for static data, there is not a huge difference between the MAR and FFT spectral estimates. The main advantage of the

MAR approach was that the spectra were smoother overall, which would be useful when trying to calculate slopes of phase spectra. The MAR spectra also outperformed the FFT estimates when there was plenty of data but the coherence between the two signals was lower (Fig. 4). Due to the large confidence limits on the MAR spectra, however, this advantage was not maintained as the size of the data set decreased. 3.3. Data example 3 As a first test of the VHMMAR model, we look at some non-stationary EMG activity, recorded from the

Fig. 6. The bottom graph shows the raw forearm extensor EMG, while the top graph shows the most likely occupation of each of the 3 inferred states.

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

Fig. 7. In this figure, the top shows the recorded parieto /occipital EEG activity, the middle trace shows the states inferred by the HMMAR model, and the lower trace shows the timings of the blocks of strobe activity.

forearm extensors while extending the wrist from 10 to 408 periodically every 5/10 s, with a brief rest in between. Fig. 6 shows the recorded data and the most probable states inferred by the HMM. A three state HMM gave the highest free energy of all the models, with a MAR model order of 26. It is clear that the HMMAR model has successfully divided the data into three states consistent with two levels of tonic EMG activity (states 2 and 3) and a brief movement period. The states are clear to distinguish by eye also, but in this example the HMMAR model proves useful in providing an objective measure of when each state begins and ends. 3.4. Data example 4 The results of the previous example are promising, although the three states were very easy to distinguish by eye. As a second example, we present some EEG data that is not so visually clear in its state separation, let alone state timing. Fig. 7 shows the data recorded from the right parieto-occipital region (referred to linked ears), the inferred state occupation, and the true state occupation times. The data was generated in an experiment where a subject sat in darkness and a strobe light was flashed at three different frequencies, each separated by a rest period. The flash frequencies were 9, 10 and 15 Hz. A total of four states was chosen as the most likely number, with a MAR model order of 12. Fig. 8 shows the inferred spectra for each of these states. One can see clearly from these spectra that the effect of the flashing light is to create oscillatory activity that is synchronous

47

Fig. 8. This figure shows the inferred spectra of each of the four states, and an expanded representative section of the raw EEG.

with the strobe source. Spectral peaks appear at the base and harmonic frequencies. In addition, the alpha activity at about 11 Hz which one sees at rest is lost when the strobe begins. Lastly, we compared these results with those obtained from a sonogram, derived by conventional Fourier methods. The sonogram is shown in Fig. 9, with the start and end times of the strobe activity marked by the solid white vertical lines. In order to distinguish following responses at 9 and 10 Hz, a sliding window of 2 s blocks was used. As noted before, frequency resolution is achieved at the expense of temporal resolution so one has difficulty in identifying the exact start and finish times of the different spectral regimes. In addition, the alpha peak at 11 Hz during the rest periods is not clearly shown in the sonogram. In these respects, the HMMAR model is superior. A fairer comparison might be achieved by comparing the output of the HMMAR model with a ‘mixtures of FFT’ model rather than the sonogram. However, for states with a small number of data points the FFT derived spectral estimates could be quite poor. In this particular experimental situation, EEG activity was recorded from eight sites on the scalp. In this case, one could easily look at multivariate state changes, or one could run the HMMAR model on cleaned EEG data obtained from an ICA decomposition of the signal, for example. An example of the ICA decomposition applied to the cleaning of EEG data is given in Cassidy and Brown (2001) and would improve the state separation of the HMMAR model even more. Alternatively one could apply the HMMAR model directly to single or multiple ICA activations.

48

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

Fig. 9. This figure shows the time-varying spectra for the EEG data, as determined by conventional Fourier methods. The solid vertical lines mark the start and end points of the strobe activity.

4. Discussion In this paper, we have applied MAR models to problems of spectral estimation pertinent to the neurophysiologist, and have introduced a Bayesian HMMAR model as a way of analysing non-stationary data. Model parameters were estimated using the variational Bayesian approach in order to tackle potential problems of model order selection. This approach has hitherto only been presented within the specialised signal processing community. By applying it to the MAR and HMMAR models we have demonstrated the potentially powerful applications to spectral estimation. The advantages of the approach outlined here are most obvious when dynamic data sets are considered. The wider confidence limits of MAR compared with FFT spectral estimates from limited data sets (see Sections 3.1 and 3.2) suggest that the latter algorithm may be more suitable where comparisons between static states are desired, for example in determining the shortterm effects of repetitive transcutaneous magnetic stimulation or short-acting drugs upon oscillations in the EEG. However, our results also suggest that in situations where coherences between signals are low or when spectral smoothness and frequency resolution is important, the MAR approach may give improved estimates (see also Spyers-Ashby et al., 1998). We should also reemphasise that the most likely MAR model order in our examples is much higher than the model orders preferred in many other studies. Our high model orders are clearly justified having been derived from a well-defined model order selection procedure. They are also intuitively satisfactory since if large quantities of data are available, there is no reason to penalise moderately high model orders if they improve the accuracy of the spectra. There are several techniques that may give power changes with good temporal resolution during changing conditions, but many neurophysiological studies of movement or dynamic cognitive tasks require the assessment of both the power and the coherence between signals such as different EEG channels. Usually this is achieved using the FFT algorithm. Data are divided into serial, often overlapping, windows, then averaged across tasks for each window (see, for example

Kilner et al., 2000). Windows are necessarily kept narrow so that the signal may be considered stationary, although the corollary of this is poor frequency resolution. Neither is local stationarity of the signal assured as the data are segmented according to block size and overlap, rather than according to any state change in the signal. We have already seen how the AR model gives superior frequency resolution over the FFT method where windows have to be kept short. But in combination with the HMM, where data are assigned to states on probabilistic grounds (rather than subjected to arbitrary temporal segmentation), there are additional advantages. State changes are detected with improved temporal resolution and spectral estimates are potentially enhanced, as within a state they need not be limited to data derived from a single window. As a potential application, the HMMAR model might be useful in determining the precise timing of EEG states in long-latency visual perceptual tasks for correlation with behavioural measures of perceptual timing (see for example, Rodriguez et al., 1999). There is yet another advantage of the HMMAR model, which may open up new experimental possibilities. The standard FFT approach assumes that tasks involve stereotyped processing, as data are divided into sequential windows and averaged across tasks or events. However, many neurophysiological signals vary unpredictably. The HMMAR model is able to choose common states for averaging, regardless of their timing in the time series. Two examples serve to illustrate the potential usefulness of this approach. In Section 3.2 we saw how the coherence between EMG signals due to common presynaptic drives from the motor cortex has a distinctive pass band. This ‘cortical signature’ in the coupling between muscles may be used to infer a cortical origin for muscle activity, such as in certain forms of myoclonus (Brown et al., 1999). The usefulness of this surrogate marker of cortical drive would be extended if EMG /EMG coherence could be determined in conditions with chorea and tics or in paroxysmal movement disorders such as paroxysmal kinesogenic choreoathetosis, where the involuntary motor responses are very variable. Under these circumstances the HMMAR technique may identify common states and produce the necessary bivariate spectra. Alternatively, the

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

HMMAR model could be used to follow the sequential cortical spread of epileptic activity or the high frequency activity that may precede it (Allen et al., 1992), with fine temporal resolution in the EEG, thereby helping to pinpoint epileptic foci that may be amenable to surgery. Another experimental approach that is opened up by the HMMAR model is the potential for on-line intervention during motor or cognitive tasks. Consider the self-paced grasping of a target that requires navigation around an obstacle, the nature of which may change between trials. This requires several integrated activities, particularly the spatio-temporal organisation of the motor plan to avoid the viewed obstacle. We could derive information about the contribution of different cortical areas and the temporal sequence of these contributions by analysing the behavioural consequences of disrupting their function with short bursts of focal repetitive transcutaneous magnetic stimulation (Hallett, 2000). With the HMMAR model we can organise such an intervention during the preparation as well as execution of the movement by identifying state changes in EEG, for example over the parietal association cortex, that form a critical staging post in the dorsal stream of visual processing related to visuomotor transformation (Milner and Goodale, 1995). The HMMAR model has clear advantages over the FFT in bivariate analysis when dynamic signals are begin considered. However, there is an additional potential advantage of the approach outlined here. In many cases, most obviously in the analysis of EEG changes, many channels must be analysed. Hitherto this has usually been accomplished by deriving multiple bivariate spectra (see for example Gerloff et al., 1998; Manganotti et al., 1998; Andres et al., 1999). Comparison then requires Bonferonni correction which may be over-conservative as it assumes different pair-wise comparisons are independent of one another. This is a scenario that it is at odds with current views that cortical areas operate as a dynamic integrated network (Bressler et al., 1993; Singer, 1993). As we have shown, the AR model easily extends to multivariate spectral analysis, but until now the selection of model order has been problematic for limited data sets. The Bayesian approach described here addresses this problem, giving the prospect of a truly multivariate analysis of EEG /EEG coherence. Of course, once one has introduced priors into the model, one can also look at the effects of changing these priors to incorporate other knowledge relevant to a particular experiment. In this paper, all the MAR coefficients are controlled by the same prior. This is a useful first approach, but one can easily consider different groupings of coefficients. MAR coefficients at different lags can be related to different spectral peaks, and if there are many channels recorded then the MAR coefficients will also relate to different cortical

49

sites (in the EEG case, for example). One may have hypotheses or a priori reasons to expect coupling between different cortical areas or within different frequency bands. One can test these hypotheses by giving each putative group of MAR coefficients a different prior, and then decide which model is more representative of the data by looking at the final free energy. For more discussion of these issues, the reader is referred to (Penny and Roberts, 2001). A final point worth mentioning with respect to MAR models as a whole is that once the parameters have been learnt, they can not only be used to calculate the basic spectral matrix but can also be used to calculate other quantities of potential interest, such as the directed transfer function (Kaminski and Blinowska, 1991). Thus, using the methods of this paper it is a straightforward matter to examine non-stationary changes in the direction of flow of information between brain structures. We will be investigating these possibilities in future work. To conclude, we hope that the reader has been convinced of the potential power of MAR models combined with Bayesian probabilistic methods for the spectral analysis of both stationary and non-stationary biomedical signals. The MAR framework provides a compact representation of a signal, and the Bayesian theory allows one to infer probability distributions over all of the MAR model parameters, providing the best representation of a data set given particular choices of model priors. Such probabilistic models are intrinsically elegant but are also extremely flexible and can readily be integrated into larger hierarchical probabilistic schemes, such as would be required in the synthesis of EEG and fMRI, for example. With such statistically desirable qualities, it seems likely that Bayesian MAR models will play an important part in such exciting future developments in brain imaging.

Appendix A: The Kronecker product If A is a m /m matrix and B is a n /n matrix, then the Kronecker product of A and B is the (mn ) /(mn ) matrix 2

a11 B A B  4. . . am1 B

. . . a1m B

3

5: amm B

(A:1)

For properties of the Kronecker product see, for example, page 477 in Box and Tiao (Box and Tiao, 1992).

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

50

KL(q; p)(cq 1)C(cq )ln bq cq ln G(cq )ln G(cp )

Appendix B: The Vec Transpose Operation The vec-transpose operation generalises the vec and transpose operations. Suppose X is an m /n matrix. X (p ) operates column-wise on X , converts each column into a p /(m /p ) matrix (filling each column of the new matrix in turn with the elements taken sequentially from the original column) and then stacks all these matrices to form a final pn /(m /p ) matrix. As an example, 3(2) 2 a11 a12 2 3 6a21 a22 7 a11 a31 a51 7 6 6a31 a32 7 6a21 a41 a61 7 7 6 6 7 (B:1) 6a41 a42 7  4a12 a32 a52 5: 7 6 4a a52 5 a22 a42 a62 51 a61 a62 For more information on the vec-transpose operation and various identities involving it, the reader is referred to the paper by Minka (2000b).

Appendix C: The Di-Gamma and generalised Gamma functions

cp ln bp (cp 1)(c(cq )ln bq )

Gd (a)p d(d1)=4

G(a(i 1)=2)

(D:3) where Zd (a; B)2 ad=2 jBj

Ld (a; B)

For more information on this function, see (Press et al., 1992). For more information on the generalised Gamma function, see Muirhead (1982).

(D:4)

g W (X ½a; B)lnjX jdX : d

(D:5)

Zd (p; P) Zd (q; Q)

  pd 1  Ld (p; P): 2

(C:1)

(C:2)

Gd (a=2)

The KL -divergence between densities q(X )/ Wd (X jq,Q ) and p (X )/Wd (X jp,P ) is given by   qd 1 qd q Ld (q; Q)  tr(PQ 1 ) KL(q; p) 2 2 2

i1

d c(x) (lnG(x)): dx

a=2

The KL -divergence of a Wishart can be defined in terms of the integral

log

subject to the condition Re(a )]/(d/1)/2. The Di-Gamma function is defined as

(D:2)

Wishart densities The Wishart distribution is given by ((Muirhead, 1982), page 85)   1 1 (ad1)=2 Wd (X ½a; B) jX j exp  tr(BX ) Zd (a; B) 2

The generalised Gamma function is defined by d Y

bq cq bp

(D:6)

Dirichlet densities The KL -divergence between q(x )/D(x jp) and p (x) /D(x jq ) is KL(q; p)lnZ(p)ln Z(q)

K X (qi pi ) i1

(D:7)  (c(qi )c(pi )); Q where Z (p )/ i G (pi )/G (aj pj ) and K is the total number of states.

Appendix D: KL divergences Normal densities The KL divergence for Normal densities q(x )/ N(x jmq ,aq ) and p (x )/N(x jmp ,ap ) is KL(q; p)0:5ln

jSp j jSq j

0:5tr(S 1 p Sq )

d 0:5(mq mp )T S 1 p (mq mp ) : 2

Appendix E: Derivation of the Bayesian MAR model parameters We first construct the negative free energy functional for the MAR model using the factorised form for the posterior density q.

(D:1)

Gamma densities For Gamma densities q(x )/G(x jbq ,cq ) and p(x )/ G(x jbp ,cp ) the KL-divergence is

F 

g g

q(u½D)fln p(D½u)ln p(u)ln q(u½D)gdu  dT T q(u½D)  ln 2p lnjLj 2 2

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53   1 k a a  tr(L(Y XA)T (Y XA)) ln  vec(A)T vec(A) 2 2 2p 2 c

a

ln(b G(c))(c1)ln a  b ln q(a)ln q(L) du:

(d  1) 2

I(a)

This implies that q(a jD ) is a Gamma density G(a jb ?,c ?) with 1 b?



1

1 1 ˆ a T a tr(S) 2 2 b

(E:8)

Finally, as a functional of the noise precision matrix L one has FL KL(q(L½D); exp[I(L)]);

(E:9)

where

g

I(L) (E:2)

g q(A½D)





where

g



q(L½D)q(a)

1 a   tr(L(Y XA)T (Y XA)) a T a dAda; (E:3) 2 2

(recall that a/vec(A )). The functional KL (q , p )/ q (x )ln(q (x)/p(x )dx is a quantity known as the Kullback /Leibler divergence, which is a measure of the ‘distance’ between two probability distributions q and p. Clearly this distance will be minimised and, therefore, the free energy (as a functional of A ) will be maximised when this quantity disappears, i.e. when q(A½D)exp[I(A)]  1 ˆ aˆ XA)T (Y XA)) a T a : exp  tr(L(Y 2 2

k c? c 2

aˆ b?c?:

 1   tr(L(Y XA)T (Y XA)) 2 a T  vec(A) vec(A)ln q(A½D) dAdLda 2   exp[I(A)] dA  q(A½D)ln q(A½D)

I(A)

(E:7)

(E:1)

g q(A½D)q(L½D)q(a½D)

KL(q(A½D); exp[I(A)]);

  a a a ln  a T a(c1)ln a da 2 2p 2 b

 k

  k a a a a ˆ tr(S)(c1)ln a : ˆ  aˆT a  ln 2 2p 2 2 b

ln Lln q(A½D)

If we first look at the terms that are a function of A , we have FA 

g

q(a½D)

51

1 lnjLj tr(L(Y XA)T (Y XA)) dA 2 2

T

T

1 ˆ T (Y X A)V)]; ˆ lnjLj tr[L((Y X A) V 2 2 (pd)

)T (Ik X T X )((Sˆ (I (pd) k

)

)T ((Id X T X )S)(pd) ; )  (I (pd) k

(E:10) (p )

and the vec transpose of a matrix X is defined in Appendix B. The distribution of L is, therefore, q (L jD )/Wd (L jr, R ), where W is a Wishart density (see Appendix D) with parameters ˆ T (Y X A)V ˆ R (Y X A) Lˆ rR

1

r T

(E:11)

:

(E:4)

Lˆ/ and a are the mean noise and weight precisions from the approximating densities q(L jD ) and q(a jD ) (note that the densities have to be updated iteratively until they converge to a global solution-as a result it does not matter which parameter is updated first). After some algebraic manipulation, one can show that q(a jD )/N ˆ where (/a½a; ˆ S);

Appendix F: Spectral estimation and confidence limits

/

ˆ X T X  aI ˆ k )1 Sˆ (L ˆ L ˆ X T X )aml a ˆ S(

(E:5)

and aml is the vec form of the ML solution for A given earlier. As a functional of the weight precision parameter a , one can write Fa KL(q(a½D); exp[I(a)]); where

(E:6)

Spectral estimation In this section we describe how the MAR coefficients can be used for spectral estimation. The derivation of the spectral matrix (which will contain all the information necessary to calculate power, coherence and phase spectra) follows simply from the theory of linear systems. For a more detailed discussion, the reader is referred to Priestley (1981). A general linear system will have a number of inputs and outputs, and one can write down a relation between the inputs and outputs in the time or frequency domain. In the frequency domain, the relation between input X and output Y is encapsulated by the system’s transfer function G , which describes the system’s frequency response to an impulse. The output spectrum sY is related to the input spectrum sX through the relation

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53

52

sY (v)G(v)sX (v)G(v)

(F:1)

where * denotes the complex conjugate operation. The connection to the time domain is made through the relations G(v)

 X

gk e ivk

Yt 

k

 X

gk Xtk

(F:2)

k

which show how the impulse response matrices gk relate to the transfer function, and also how they relate inputs and outputs in the time domain. We have seen in the previous sections that a MAR process can be cast in the form of a linear system, which we can make even more explicit by rewriting Eq. (8) as 

p X

ytk ak et

(F:3)

k0

(compare with Eq. (F.2)) where a0 //Id , the negative d -dimensional identity matrix. If one defines the quantity a(v)

p X

ak e ivk ;

(F:4)

k0

then by analogy with Eq. (F.1), one can write a(v)sY (v)a(v)

1 2p

L 1 :

(F:5)

The right hand side is a constant spectrum given by the noise covariance matrix as one would expect for a Gaussian process, so by calculating the inverse matrices, one finally obtains sY (v)

1 2p

(a(v)La(v))1

(F:6)

for the spectral matrix. The elements of this matrix siiY give the auto-spectra, and siiY (v) give the appropriate cross-spectra. One can calculate the coherence with the standard formula coh(v)

js ijY (v)j2 fs iiY (v)s jjY (v)g

;

(F:7)

and the phase is just the exponent of the cross-spectrum. Confidence limits Approximate confidence limits for log-spectra, coherence and phase can be obtained. For further details about confidence limits in AR models, the reader is referred to Neumaier and Schneider, (2001). First we consider how to construct the confidence limits for one of the MAR coefficients x . Here, the quantity of interest is the t-ratio t

x9 sx

;

(F:8)

ˆ is the error of the ML estimate, and where x9  xx sx is the ML estimated standard deviation, calculated from the appropriate element of the coefficient covariance matrix a. The standard assumption is that the tratio follows a t-distribution with T/k degrees of freedom, which implies 100a % confidence limits of x9 t(T k; (1a)=2)sx ;

(F:9)

where t(n, a ) is the a -quantile of a t -distribution with n degrees of freedom. Now, consider some function f that depends on the ˆ The notation MAR coefficients, and let fˆ f (vec(A)): vec(A ) means to sequentially stack all of the columns of the MAR coefficient matrix A to create one long column vector. Linearizing about fˆ gives fˆ f : ˆ (9fˆ)T vec(AA): From this, one can estimate the variance s 2f h(fˆ f )2 i as T ˆ ˆ s 2f :h(9fˆ)T vec(AA)vec( AA) 9fˆi

(9fˆ)T S(9fˆ);

(F:10)

where S is the covariance matrix for the MAR coefficients. If the function f is a non-linear function of the coefficients, then the t-ratio t/f9/sf will not generally follow a t -distribution. However, one can still obtain approximate confidence limits on such functions by assuming that this t-ratio is t -distributed (see Neumaier and Schneider, 2001). To calculate the limits, first differentiate Eq. (F.5) with respect to one MAR coefficient x to get @a @x

sY aa

@sY @x

aasY

@a @x

 0:

(F:11)

A small amount of manipulation leads to the equation       @sY @a 1 1 @a sY sY (a)  a ; (F:12) @x @x @x which can be used to determine the components of 9sijY and hence s s Y ij 2 through Eq. (F.10). The final value for the confidence limits is obtained from Eq. (F.9). We should note that the confidence limits for the coherence are calculated slightly differently. Confidence limits are pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi calculated for tanh 1 j coh(v)j and then transformed back to the domain of coh(v) afterwards. This prevents the confidence limits for the coherence from rising above one. If one calculates confidence limits for coherence in the FFT approach, one uses this variance stabilising transform because coherence is not normally distributed.

References Allen PJ, Fish DR, Smith SJ. Very high-frequency rhythmic activity during SEEG suppression in frontal lobe epilepsy. Electroencephalogr Clin Neurophysiol 1992;82:155 /9.

M.J. Cassidy, P. Brown / Journal of Neuroscience Methods 116 (2002) 35 /53 Andres FG, Mima T, Schulman AE, Dichgans J, Hallett M, Gerloff C. Functional coupling of human cortical sensorimotor areas during bimanual skill acquisition. Brain 1999;122:855 /70. Attias H. A variational Bayesian framework for graphical models, In: T. Leen et al. Editors. NIPS 12. Cambridge, MA, 2000. Box GEP, Tiao GC. Bayesian Inference in Statistical Analysis. Wiley, 1992. Bressler L, Coppola R, Nakamura R. Episodic multiregional cortical coherence at multiple frequencies during a visual task performance. Nature 1993;366:153 /6. Brown P, Salenius S, Rothwell JC, Hari R. Cortical correlate of the Piper rhythm in humans. J Neurophysiol 1998;80:2911 /7. Brown P, Farmer SF, Halliday DM, Marsden J, Rosenberg JR. Coherent cortical and muscle discharge in cortical myoclonus. Brain 1999;122:461 /72. Brown P, Oliviero A, Mazzone P, Insola A, Tonali P, Di Lazzaro V. Dopamine dependency of oscillations between subthalamic nucleus and pallidum in Parkinson’s disease. J Neurosci 2001;21:1033 /8. Cassidy MJ, Brown P. Task-related EEG /EEG coherence depends on dopaminergic activity in Parkinson’s disease. Neuroreport 2001;12:703 /8. Conway BA, et al. Synchronization between motor cortex and spinal motoneuron pool during the performance of a maintained motor task in man. J Physiol 1995;489:917 /24. Florian G, Pfurtscheller G. Dynamic spectral analysis of event related EEG data. Electroen Clin Neurophys 1995;95:393 /6. Gerloff C, Richard J, Hadley J, Schulman A, Honda M, Hallett M. Functional coupling and regional activation of human cortical motor areas during simple, internally paced and externally paced finger movements. Brain 1998;121:1513 /31. Ghahramani Z. An introduction to hidden Markov models and Bayesian networks. Int J Pattern Recognit Artif Intel 2001;15:9 / 42. Ghahramani Z, Beal MJ. Propagation algorithms for variational Bayesian learning, In: Leen T et al., editor. NIPS 13, Cambridge, MA, 2001. Hallett M. Transcranial magnetic stimulation and the human brain. Nature 2000;406:147 /50. Halliday DM, Rosenberg JR, Amjad AM, Breeze P, Conway BA, Farmer SF. A framework for the analysis of mixed time series/ point process data-theory and application to the study of physiological tremor, single motor unit discharges and electromyograms. Prog Biophys Mol Biol 1995;64:237 /78. Halliday DM, Conway BA, Farmer SF, Rosenberg JR. Using electroencephalography to study functional coupling between cortical activity and electromyograms during voluntary contractions in humans. Neurosci Lett 1998;241:5 /8. Jordan MI, Ghahramani Z, Jaakola TS, Saul LK. An introduction to variational methods in graphical models. Machine Learn 1999;37:183 /233. Kaminski M, Blinowska KJ. A new method of the description of the information flow in the brain structures. Biol Cybern 1991;65:203 / 10. Kilner JM, Baker SN, Salenius S, Hari R, Lemon RN. Human cortical muscle coherence is directly related to specific motor parameters. J Neurosci 2000;20:8838. Mackay DJC. Ensemble learning for hidden Markov models. Technical report, University of Cambridge, 1997.

53

Manganotti P, Gerloff C, Toro C, Katsuta H, Sadato N, Zhuang P, Leocani L, Hallett M. Task-related coherence and task-related spectral power changes during sequential finger movements. Electroenceph Clin Neurophysiol 1998;109:50 /62. Milner AD, Goodale MA. The Visual Brain in Action. Oxford University Press, 1995. Mima T, Steger J, Schulman AE, Gerloff C, Hallett M. Electroencephalographic measurement of motor cortex control of muscle activity in humans. Clin Neurophysiol 2000;111:326 /37. Minka TP. Automatic choice of dimensionality for PCA. http:// vismod.www.media.mit.edu/tpminka, 2000a. Minka TP. Old and New Matrix Algebra Useful for Statistics. http:// vismod.www.media.mit.edu/tpminka, 2000b. Muirhead RJ. Aspects of Multivariate Statistical Theory. Wiley, 1982. Neal R. Probabilistic inference using Markov chain Monte Carlo methods. University of Toronto Technical Report CRG-TR-93-1, Department of Computer Science, 1993. Neumaier A, Schneider T. Estimation if parameters and eigenmodes of multivariate autoregressive models. ACM Trans Math Soft 2001;27:27 /57. Nunez P. Neocortical Dynamics and Human EEG Rhythms. Oxford: Oxford University Press, 1995. Pardey J, Roberts SJ, Tarassenko L. A review of parametric modelling techniques for EEG analysis. Med Eng Phys 1996;18:2 /11. Penny WD, Roberts SJ. Bayesian multivariate autoregressive models with structured priors. Accepted for publication in IEE Proceedings: Vision, Image and Signal Processing (2001). Peters BO, Pfurtscheller G, Flyvbjerg H. Automatic differentiation of multichannel EEG signals. IEEE Trans Biomed Eng 2001;48:111 / 6. Press WH, Teukolsky SA, Vetterling WT, Flannery BVP. Numerical Recipies in C. Cambridge, 1992. Priestley MB. Spectral Analysis and Time Series, vol. 2. Academic Press, 1981. Rabiner LR, Juang BH. An introduction to hidden Markov models. IEEE Acoustics Speech Signal Processing Magazine 1986;3:4 /16. Rodriguez EN, George N, Lachaux JP, Martinerie J, Renault B, Varela FJ. Perception’s shadow: long-distance synchronization of human brain activity. Nature 1999;397:430 /3. Rosenberg JR, Amjad AM, Breeze P, Brillinger DR, Halliday DM. The Fourier approach to the identification of functional coupling between neuronal spike trains. Prog Biophys Mol Biol 1989;53:1 / 31. Roweis S, Ghahramani Z. A unifying review of linear Gaussian models. Neural Comput 1999;11:305 /45. Salenius S, Portin K, Kajola M, Salmelin R, Hari R. Cortical control of human motoneuron firing during isometric contraction. J Neurophys 1997;77:3401 /5. Singer W. Neuronal representations, assemblies and temporal coherence. Prog Brain Res 1993;95:461 /74. Singer W. Neuronal synchrony: a versatile code for the definition of relations. Neuron 1999;24(49 /65):111 /25. Spyers-Ashby JM, Bain PG, Roberts SJ. A comparison of fast Fourier transform (FFT) and autoregressive (AR) spectral estimation techniques for the analysis of tremor data. J Neurosci Methods 1998;83:35 /43. Weisberg S. Applied Linear Regression. Wiley, 1980.