Classifying functional time series

Classifying functional time series

ARTICLE IN PRESS Signal Processing 87 (2007) 79–100 www.elsevier.com/locate/sigpro Classifying functional time series$ R.H. Glendinning, S.L. Fleet...

611KB Sizes 2 Downloads 143 Views

ARTICLE IN PRESS

Signal Processing 87 (2007) 79–100 www.elsevier.com/locate/sigpro

Classifying functional time series$ R.H. Glendinning, S.L. Fleet QinetiQ ltd, Great Malvern, Worcestershire WR14 3PS, UK Received 9 September 2005; received in revised form 17 March 2006; accepted 13 April 2006 Available online 5 June 2006

Abstract We consider the problem of classifying a high-dimensional time series into a number of disjoint classes defined by training data. Techniques of this type are an important component of a number of emerging technologies. These include the use of dense sensor arrays for condition monitoring, brain–computer interfaces for communications and control, the detection of moving pedestrians from sequences of images and the study of cognitive function using high-resolution electroencephalography (EEG). We propose a novel approach to problems of this type using the parameters of an underlying functional auto-regression model. We compare the performance of this approach using two contrasting data sets. The first is based on simulated series with different characteristics and sampling schemes and a second based on highdimensional times series generated by multi-channel EEG. Both experiments show that our approach outperforms conventional time series methods by exploiting low-intrinsic dimensionality (smoothness). In addition, our simulation experiments show that good performance can be maintained for data generated by non-stationary sampling schemes, the latter causing large reductions in the performance of conventional procedures. These experiments suggest that meaningful information can be extracted from high-resolution EEG. r 2006 Elsevier B.V. All rights reserved. Keywords: Functional auto-regression; Multi-dimensional signals; Regularization; Biomedical pattern recognition

1. Introduction We consider the problem of assigning a time series X T ¼ ðX t ; t ¼ 1; . . . ; TÞ into one of K disjoint classes. While problems of this type are relatively well understood for low-dimensional vectors X t , advances in sensor technology are increasingly generating high-dimensional data. Typical examples $ This work was carried out as part of the program of the Data and Information Fusion Defence Technology Centre, rQinetiQ Ltd 2005. Corresponding author. Tel.: +44 1684 894850; fax: +44 1684 894384. E-mail address: [email protected] (R.H. Glendinning).

are magnetic resonance imaging [1], which generates sequences of two-dimensional images for in vitro examination of tissue, high-resolution electroencephalography (EEG) for real-time measurements of brain function [2] and the study of climate changes using spatially dispersed sensors [3]. The development of effective classifiers for highdimensional time series must address the particular characteristics of data of this type. To identify the key issues, we consider a generic classification problem using EEG data. Here, our aim is to assign a subject into one of two classes ðk ¼ 1; 2Þ using a time series X T , with X t taking values in (at least) a 61-dimensional Euclidean space (see [4] for a specific example). The conventional approach [5]

0165-1684/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2006.04.006

ARTICLE IN PRESS 80

R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

suggests a quadratic discriminant d quad ðX T Þ, where X T is assigned to class 1, if

Observations at t+1

d quad ðX T Þ ¼ ½X T 0 ½C ð1Þ 1 ½X T   ½X T 0 ½C ð2Þ 1 ½X T o0,

ð1Þ

when PðX T jkÞ the conditional distribution of X T in class K, follows a multivariate normal distribution with zero mean, non-degenerate covariance matrix C ðkÞ and equal prior class membership probabilities. The properties of d quad are described in [5], with [6–8] providing a Bayesian perspective. However, satisfactory performance may be difficult to achieve. The first obstacle lies in the dimension of C ðkÞ , which is 61  61 in the EEG example. Procedures which avoid the inversion of C ðkÞ by estimating ½C ðkÞ 1 in the frequency domain are described in [5] and provide a computationally feasible means of dealing with high-dimensional matrices. However, their validity rests on the assumption that the dimensionality of X t is small relative to T, which may not hold in our context. Related techniques using a robust estimate of the distance between the class densities PðX T jkÞ have the same limitation (see [9] for an overview of this approach). Procedures which use the form of ½C ðkÞ 1 given by a specific model of the temporal dynamics of X t (such as a vector-valued auto-regression) are an effective means of mitigating the computational costs associated with the inversion of C ðkÞ , but may not give satisfactory performance [10] when C ðkÞ is ill-conditioned. The latter is often associated with spatially smooth signals sampled at high frequencies [11]. A number of alternative procedures have been applied to similar problems. These include the direct estimation of the discriminant function by a neural network [1,12–14]. This uses a time delay structure to determine the relationship between class membership and X T . These are contrasted with a principal components analysis of X T in the latter study. Good performance is achieved in detecting the presence of moving pedestrians from sequences of images in [12,14] using these procedures. However, they do not exploit the functional characteristics of the data and are currently limited to stationary sampling schemes. Non-stationary sampling schemes present particular problems for conventional procedures [15,16]. Here the number and location of the observations sti may vary with time t. An example of a nonstationary sampling scheme is given in Fig. 1.

S

S t

S t+1

t+2

Fig. 1. A non-stationary sampling scheme in S ¼ ½0; 1  ½0; 1.

Mechanisms generating non-stationary sampling schemes are sensor failure, varying maintenance procedures, data drop-outs and adaptive sampling rates. These are common characteristics of distributed sensor systems [17], environmental modelling [18] and astronomical time series, which suffer from gaps due to telescope scheduling [19]. The latter can be regarded as a high-dimensional time series by stacking observations over successive time periods. We assume that the vector of observations ðX t ðsti Þ; i ¼ 1; . . . ; mt Þ is generated by an underlying functional auto-regression [20]. This describes ðX t ðsÞ; t ¼ 1; . . . ; TÞ as a sequence of elements in a function space F. This formalism mitigates the effects of non-stationary sampling schemes (with varying values of sti and mt ) and provides a principled means of reducing the effects of illconditioning by exploiting low complexity (essentially smoothness of X t ðsÞ or the parameters of the functional auto-regression generating the data). Models of this type [20,21] are used in a wide range of applications and include reduced state space models [22–24] for spatio-temporal modelling, although other procedures can be used to exploit spatial smoothness [18]. We develop a feature based classifier for functional time series. This approach provides a principled means of producing low-dimensional classifiers for independently distributed functions [25]. Our approach appears to be new and unifies classifier construction and the choice of the value of regularization parameters. These characteristics

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

differentiate our approach from techniques developed for dynamic shape analysis [26] and target classification from sequences of radar range profiles [27]. The direct use of a regularized estimate of the covariance kernel associated with ðX t ðsÞ; t ¼ 1; . . . ; TÞ, can be used to deal with highdimensional time series, although this lies outside the scope of this study. We demonstrate the performance of our approach using simulated high-dimensional time series and real data generated by EEG [28]. Consistent classification of data of this type is challenging, as there are large numbers of non-uniformly distributed sensors (electrodes). Other potential applications include fault detection [29] and environmental control using spatially dense sensor arrays [30], medical diagnosis from heart motion [31] and video retrieval [32] from sequences of damaged images.

coefficients ðrðkÞ l ; l ¼ 1; . . . ; pÞ are bounded linear operators with respect to the norm kxk2F , x 2 F . 3. A feature based classifier for functional time series We address the problem of classifying highdimensional time series using a feature based approach. This is an effective means of generating classifiers for multivariate time series [34], spatiotemporal processes [32] and independently distributed functional data [25,35]. The aim is to replace X T by a vector of features with minimal or no loss of information. Suitable features are often generated by the sufficient statistics associated with the joint distribution of X T , and generate classifiers which are closely related to their pseudo-Bayesian analogues [36]. Here, we make the following assumptions:



2. A functional auto-regression

 We address the problem of classifying high-dimensional time series generated by k ¼ 1; . . . ; K classes using the characteristics of the model describing their temporal evolution. While there are a wide range of potential models, we focus on the functional analogue of an auto-regressive process. These have proved to be a computationally tractable and widely applicable family of models [20]. Time series from the kth class are generated by a functional auto-regression FARðpÞ of arbitrary order p. This is given by X t ðsÞ  mðkÞ ðsÞ ¼

p X

ðkÞ rðkÞ l ðX tl ðsÞ  m ðsÞÞ

l¼1

þ ðkÞ t ðsÞ;

s 2 S,

ð2Þ

where mðkÞ ðsÞ is the expected value of X t ðsÞ, which takes values in the space F . The innovations ððkÞ t ðsÞ; 1oto1Þ are a sequence of independent and identically distributed F -valued random variables with zero mean and covariance operator SðkÞ  . We assume that F is a sub-space of the usual L2 space on S, as this provides a convenient context for our calculations [20] using the usual inner product associated with L2 Z hx; yiL2 ¼ xðsÞyðsÞ ds; x; y 2 L2 , (3) S

which exists and is bounded for all x; y 2 L2 . The choice of F is problem specific [33] and often reflects assumptions about the smoothness of X t ðsÞ. The

81

ðkÞ A1. The vector of parameters ðrðkÞ 1 ; . . . ; rp ; ðkÞ ðkÞ m ; S Þ uniquely determine the data generating mechanism for the k ¼ 1; . . . ; K classes. A2. The functional auto-regression (2) can be written in the form (neglecting dependence on k) by p Z X X t ðsÞ  mðsÞ ¼ rl ðs; vÞðX tl ðvÞ  mðvÞÞ dv l¼1

S

þ t ðsÞ;

s; v 2 S

ð4Þ

with sufficient conditions in [20]. First, we consider scenarios where the auto-regressive coefficients uniquely discriminate between classes. Here, our focus is on the p-dimensional functional feature f ðu; vÞ ¼ ðr^ 1 ðu; vÞ; . . . ; r^ p ðu; vÞÞ;

u; v 2 S

(5)

constructed for the estimates of the parameters ðr^ l ðu; vÞ; l ¼ 1; . . . ; pÞ of a common auto-regressive model (4) for all time series. These are described by their kernels over S  S under assumption A2. It is easy to see that the effects of non-stationary sampling schemes and time series of different lengths are mitigated by the use of (5). We construct a lower-dimensional vector of scalar-valued features using f ðu; vÞ. These are given by the coefficients of an expansion of f ðu; vÞ in terms of the eigen-functions ðGl ðu; vÞ; l ¼ 1; . . . ; Lmax Þ generated by functional principal components. An algorithm generating the latter is presented in the Appendix. A key step is a multivariate principal components analysis of the vectors f p ¼ ð½vec A^ 1 0 :

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

82

. . . : ½vec A^ p 0 Þ, where ðA^ l ; l ¼ 1; . . . ; pÞ are parameter estimates associated with matrical form of (4), see Section 6. This gives the approximation f L ðu; vÞ ¼

L X

with a uniform prior for class membership. We use the classifier ðkÞ qk ¼ ½g  g¯ ðkÞ 0 ½C^ 1 ½g  g¯ ðkÞ  þ log det½C^ ðkÞ 

hf ðu; vÞ; Gl ðu; vÞiL2 Gl ðu; vÞ;

u; v 2 S

l¼1

(6) for L ¼ 1; . . . ; Lmax , although different subsets of scores may more appropriate in certain scenarios. We use a modification of Oja’s algorithm [37] to extract the first few principal components associated with f p , as Lmax is generally relatively small. This approach generates low-dimensional features using two contrasting steps. The first is the generation of a satisfactory approximation to f ðu; vÞ with small L and the ability of functional principal components to identify differences between classes in a parsimonious manner. Both effects are related to the assumption that X t ðsÞ and ðrl ðu; vÞ; l ¼ 1; . . . ; pÞ are smooth functions over S  S, see Fig. 2. Here the axes generated by functional principal components are denoted by the dashed lines. Let f ðiÞL ðu; vÞ be the vector of functional features associated with the ith functional time series in the kth class. The corresponding scores are gki ¼ ðhf ðiÞL ðu; vÞ; Gl ðu; vÞiL2 ; l ¼ 1; . . . ; LÞ,

(7)

which are used by an appropriate classifier [25] to give k^ ¼ argmink ðqk ; k ¼ 1; . . . ; KÞ

(8)

Optimal axis for classification Class two

Class one

in our experiments, with sample means and covariance matrices g¯ ðkÞ ¼

bk 1X g ; bk i¼1 ki

where bk is the number of time series in the kth class in the training data. The use of (9) rests on the assumption that gki follows a multivariate normal distribution, although Gaussian classifiers are robust to departures from this assumption. We select the value of L and complexity constraints used to identify the underlying FARðpÞ to maximize an estimate of classification performance using crossvalidation (see Section 7). The feature vector f ðu; vÞ can be based on the mean m^ ðkÞ ðuÞ alone, or include all the parameters defining the common functional auto-regression. It may be beneficial to fit a FARðpÞ in the former case, to mitigate the effect of temporal dependence on cross-validation estimates of classification performance [38].

4. A finite approximation for X t ðsÞ We approximate the function X t ðsÞ by its projection onto a sub-space Dn of L2 . This is spanned by n pre-specified basis functions ðvj ðsÞ; j 2 J~ n Þ. We restrict attention to tensor products for their computational simplicity and X X~ t ðsÞ ¼ d~tj vj ðsÞ; vj ðsÞ ¼ Bj1 ðs1 ÞBj2 ðs2 Þ, s ¼ ðs1 ; s2 Þ

Fig. 2. A feature based classifier for a two-class problem.

bk 1X C^ ðkÞ ¼ ½g  g¯ ðkÞ 0 ½gki  g¯ ðkÞ , bk i¼1 ki

(10)

j2J~

Lower dimensional space

(9)

n

ð11Þ

for the spatial case with s ¼ ½0; 12 and j ¼ ðj 1 ; j 2 Þ 2 J~ n . We select basis functions fBjr ðsÞg to ensure that the temporal dynamics of X t ðsÞ are described in a parsimonious manner. Tensor products can be generated by univariate wavelet basis [39,40] and splines [33,41], for processes with spatially localized temporal dynamics and classical orthogonal polynomials or Fourier basis to describe global smoothness.

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

We identify (11) by regularized least squares d~ t ¼ argmind t

mt X

with the pooled sample mean

ðX t ðsti Þ  X~ t ðsti ÞÞ2 þ Cðl ; X~ t ðsÞÞ,

X¯ ðsÞ ¼

i¼1 n where d~ t ¼ ðd~tj ; j 2 J~ Þ0 and Cðl ; X~ t ðsÞÞ is a penalty controlled by the scalar parameter l . This can be used to constrain the smoothness of X~ t ðsÞ, mitigate the effects of instabilities generated by non-uniform sampling or measurement noise. We restrict attention to quadratic complexity penalties and mt X

d¯ j vj ðsÞ;

bk K X X

T kl ,

ð16Þ

k¼1 l¼1

where bk is the number of time series in the kth class of differing lengths T kl . Smooth estimates of mðkÞ can be constructed for each class and combined (see (28)). Then X Y~ ðiÞ ðsÞ ¼ d¯ ðiÞj vj ðsÞ, d¯ ðiÞj ¼ d~ ðiÞj  d¯ j ;

(13) n ~ for the ordering of the set J given by hðlÞ : J~ n ! ðl ¼ 1; . . . ; nÞ, V ðsÞ ¼ ðvhðlÞ ðsÞ; l ¼ 1; . . . ; nÞ0 and d t ¼ ðd thðlÞ ðsÞ; l ¼ 1; . . . ; nÞ0 . Then d~ t ¼ ðV 0 V þ l DÞ1 V 0 Y



N 1X d¯ j ¼ d~ðiÞj , N i¼1

j2J~ n

ðX t ðsti Þ  d 0t V ðsti ÞÞ2 þ l d 0t Dd t

i¼1

(14)

with V ¼ ðV ðsti Þ; i ¼ 1; . . . ; mt Þ and Y ¼ ðX t ðsti Þ; i ¼ 1; . . . ; mt Þ0 . Here, we focus on the ridge penalty given by D ¼ I, the n  n identity matrix. This shrinks d~t towards zero [42]. An appropriate choice of n ensures that kX t ðsÞ  X~ t ðsÞk2L2 is small, although the norm kxk2F can be used [33] to exploit the smoothness of X t ðsÞ and facilitate [20] asymptotic analysis. We use the smallest value of n which ensures that X~ t ðsÞ replicates X t ðsÞ at the sample points for noise free data, with smaller values of n in other scenarios [22]. 5. Constructing a common ortho-normal basis We propose an algorithm to empirically generate a common ortho-normal basis ðfl ðsÞ; l ¼ 1; . . . ; nÞ. These are the eigen-functions generated by a functional principal components decomposition of the empirical covariance operator constructed from the pooled (over time, series and class) training data. This is a popular pre-processing step in the analysis of functional time series [3,43], although our approach may be advantageous for spatiotemporal data [24]. The ith function in the pooled training data is given by X X~ ðiÞ ðsÞ ¼ d~ðiÞj vj ðsÞ; s 2 S (15) j2J~ n

X j2J~ n

(12)

d~ t ¼ argmind t

83

s 2 S.

ð17Þ

The first functional principal component f1 ðsÞ is given by fðsÞ ¼ w0 V ðsÞ;

Y~ ðiÞ ðsÞ ¼ d^0ðiÞ V ðsÞ,

(18)

where the elements of J~ n are ordered by the index h : ði ¼ 1; . . . ; nÞ ! J~ n and d^ ðiÞ ¼ ðd^ ðiÞhðlÞ ; . . . ; d^ðiÞhðnÞ Þ0 , V ðsÞ ¼ ðvhð1Þ ðsÞ; . . . ; vhðnÞ ðsÞÞ0 .

ð19Þ

Then f1 ðsÞ is given by the solution of C Y~ w1 ¼ l1 Bw1 ,

(20)

where Z  vhðkÞ ðsÞvhðlÞ ðsÞ ds; k; l ¼ 1; . . . ; n . B¼

(21)

S

For further details of our approach to principal components for functions taking values in ddimensional Euclidean space are described in the Appendix. Each functional time series ðY~ t ðsÞ; t ¼ 1; . . . ; TÞ is given by the approximation X Y~ t ðsÞ ¼ (22) c~tj fj ðsÞ; s 2 S, j2J n n

where J ¼ f1; . . . ; ng and fl ðsÞ is the eigen-function associated with the lth ordered eigen-value lðlÞ , where lð1Þ X    XlðnÞ and c~tl ¼ hY~ t ðsÞ; fl ðsÞiL2 . 6. Estimating the parameters of a FARðpÞ Parameter estimation for functional autoregressions has been intensively studied over the last decade, with most work focusing on the use of the functional Yule–Walker [20,40,43,44]. We adopt a least squares approach [45], as it provides a simple means of incorporating complexity penalties and

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

84

provides a framework for recursive-in-time calculations [39]. Our approach differs from [45], as our calculations are carried out using the approximation X Y~ t ðsÞ ¼ (23) c~tj fj ðsÞ; s 2 S j2J

where kxk2J ¼ x0 Jx is the Euclidean norm associated with J an n0  n0 dimensional matrix with entries hfhðrÞ ðsÞ; fhðkÞ ðsÞiL2 and PJ n nR0 Y~ t ðsÞ ¼ c~0R0 t fðsÞ and

gR0 ðsÞ ¼ g0R0 fðsÞ

n

(27)

of the centred data (using the pooled mean). Here, J n is given by J~ n , when the initial basis functions are used to describe Y~ t ðsÞ and J n ¼ f1; . . . ; ng for empirically generated orthogonal basis functions. The basis function approach (23) is well suited to scenarios with irregular sampling schemes and can be used to identify functional regressions [46] and auto-regressions using reproducing kernels [41] or sieves in [47–49]. Our approach appears to be new and allows for different degrees of complexity for each coefficient (and non-zero mean function). The functional nature of the data and its emphasis on regularization differentiate it from related techniques in spatio-temporal modelling [22–24].

As the mean function of the underlying FARðpÞ may be much smoother than Y~ t ðsÞ itself, we introduce the regularized estimate T X

ðsÞ 0

kPJ n nR0 Y~ t ðsÞ  gR0 ðsÞk2L2

t¼1

þ Cðl0 ; gR0 ðsÞÞ,

ð24Þ

where PJ n nR0 Y~ t ðsÞ is the projection of Y~ t ðsÞ onto the space generated by the basis functions with indices in J n with R0 removed. Here Cðl0 ; gR0 ðsÞÞ is a soft complexity penalty controlled by the parameter l0 and X gj fj ðsÞ, gR0 ðsÞ ¼ j2J n nR0

PJ n nR0 Y~ t ðsÞ ¼

X

c~tj fj ðsÞ,

0

g~ l0 R0 ¼ ðJ þ l0 F 0 Þ J c¯ R0 , (28)



where A is the Moore–Penrose inverse of the matrix A and c¯ R0 ¼ ð¯ch0 ðiÞ ; i ¼ 1; . . . ; n0 Þ0 ;

c¯ h0 ðiÞ ¼

T 1X c~th ðiÞ . T t¼1 0

Estimates of the same form are developed for functions on ½0; 1 in [50] using a B-spline basis, although the sample paths and mean function have the same level of smoothness. Taking F 0 ¼ diagðf h0 ðiÞ ; i ¼ 1; . . . ; n0 Þ and J ¼ I, gives g~ l0 R0 h0 ðiÞ ¼ ðI þ l0 f h0 ðiÞ Þ1 c¯ h0 ðiÞ ;

i ¼ 1; . . . ; n0 .

(30)

This includes the usual projection estimate in [51] for l0 ¼ 0 and R0 ¼ ðnb ; . . . ; nÞ, with l0 ¼ 0 and R0 ¼ ; giving the sample mean and ðf h0 ðiÞ ¼ 1; i ¼ 1; . . . ; nÞ, the usual ridge regression estimate. Similar estimates are derived for Gaussian processes using a reproducing kernel basis [52,53]. Suitable complexity penalties are given later. Our algorithm provides an automatic means of choosing l0 and R0 and the parameters of the underlying functional auto-regressions. 6.2. Regularized estimates for FAR(p)

where hard constraints are imposed using the basis functions in R0 , which are removed from J n . Using the ordering of the set J n nR0 by the index h0 : ði ¼ 1; . . . ; nÞ ! J n nR0 , we write (24) in the form [33] T X

m~ l0 R0 ðsÞ ¼ g~ 0l0 R0 fðsÞ;

ð25Þ

j2J n nR0

g~ l0 R0 ¼ argmingR

with fðsÞ ¼ ðfh0 ðiÞ ðsÞ; i ¼ 1; . . . ; n0 Þ , c~R0 t ¼ ð~cth0 ðiÞ ; i ¼ 1; . . . ; n0 Þ0 and gR0 ¼ ðgR0 h0 ðiÞ ; i ¼ 1; . . . ; n0 Þ0 , where the cardinality of J n nR0 is n0 . We restrict attention to complexity penalties described by a quadratic form with non-negative definite matrix F 0 . This gives substantial computational benefits and includes popular smoothness and ridge penalties [46]. Then

(29)

6.1. Estimating the mean function

m~ l0 R0 ðsÞ ¼ argmingR

0

k~cR0 t 

j2J n

(

gR0 k2J

t¼1

þ l0 g0R0 F 0 gR0 ,

We estimate the auto-regressive parameters of the FARðpÞ using the mean corrected data for each functional time series. This is given by X Y¯ t ðsÞ ¼ c¯ tj fj ðsÞ,

ð26Þ

c¯ tj ¼

c~tj  g~ l0 R0 j ;

j 2 J n nR0 ;

c~tj

otherwise;

ð31Þ

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

where g~ l0 R0 is given by (28). Then, we assume that p Z X ¯ rl ðs; vÞY¯ tl ðvÞ dv þ t ðsÞ; s; v 2 S, Y t ðsÞ ¼ l¼1

S

T X

r~ lR ¼ argminrR

kY¯ t ðsÞ  Y^ tjt1 ðs; rR Þk2L2

t¼pþ1

!

þCðl; rR Þ ,

ð33Þ

where Rl describes ql hard constraints on rl p X

rRl Y¯ tl ðsÞ;

rRl 2 ðL2 Þnql

and soft constraints ll F l ðrRl Þ

(35)

l¼1

with l ¼ ðl1 ; . . . ; lp Þ and R ¼ ðRl ; l ¼ 1; . . . ; pÞ. First, we write (34) in matrical form using the ortho-normal basis functions ðfj ðsÞ; j 2 J n Þ. These are generated by functional principal components or chosen by a priori reasoning. Then c¯ t ¼

p X

ARl c¯ tl þ d~ t ,

l¼1

~t ðsÞ ¼

X j2J

p X

ARl c¯ tl

(38)

d~ tj fj ðsÞ;

is the matrical analogue of (34). Next, we derive the matrical version of (33) in Section 6.3. This is constructed in the usual way [46] using quadratic penalties. These soft constraints can be used to describe smoothness using higher derivatives (giving the familiar thin plate spline penalty or Laplacian). Difference based versions of the latter are used to generate regularized estimates for uniformly sampled functional time series in [45]. 6.3. Estimation using separable constraints We assume that the hard and soft constraints can be applied independently to the n models describing the temporal dynamics of the orthogonal basis functions describing Y¯ t ðsÞ, in (32). While this limits the structure of admissible constraints, it facilitates the study of high-dimensional problems using parallel processing architectures. Let p X

0 ½aðmÞ ¯ tl þ d~ tm ; Rl  c

m ¼ 1; . . . ; n,

(39)

l¼1

(34)

p X

Y^ tjt1 ðs; rR Þ ¼

c¯ tm ¼

l¼1

Cðl; rR Þ ¼

and

l¼1

(32) ¯ where Y t ðsÞ is smooth and the operators ðr1 ; . . . ; rp Þ lie in a low-dimensional sub-space of L2 . Earlier approaches use an embedding generated by functional principal components ([40] or [54] for a functional regression). We extend this approach by allowing for a wider range of hard and soft constraints, which may differ between operators rl . Similar ideas [55,56] are used to identify functional auto-regressions with exogenous variables. We introduce the estimator

Y^ tjt1 ðs; rR Þ ¼

85

1oto1,

ð36Þ

n

where c¯ t ¼ ð¯cthðiÞ ; i ¼ 1; . . . ; nÞ0 for the ordering of the set J n given by the index h : ðl ¼ 1; . . . ; nÞ ! J n . Then the ðr; mÞth element of ARl is give by faRl ;hðrÞ;hðmÞ g, where XX rRl ðu; vÞ ¼ aRl ;j1 ;j2 fj 1 ðuÞfj 2 ðvÞ; u; v 2 S j 1 2J n j 2 2J n

0 where aðmÞ Rl ¼ ðARl ðr; mÞ; r ¼ 1; . . . ; nÞ . We introduce hard constraints using the unconstrained analogue of (39). Let 0 cðmÞ ¼ Z aðmÞ þ d~ ðmÞ ;

m ¼ 1; . . . ; n,

~ ðmÞ

where d ¼ ðd~ pþ1hðmÞ ; . . . ; d~ ThðmÞ Þ c ¼ ð¯cpþ1hðmÞ ; 0 ðmÞ 0 . . . ; c¯ ThðmÞ Þ and aðmÞ ¼ ð½aðmÞ 1  ; . . . ; ½an  . We write the hard constraint Rl in the form ¯ ðmÞ aðmÞ aðmÞ ¼R Rl ; l l

ðmÞ

¯ ðmÞ ¼ n  ql , rankR l

(41)

¯ ðmÞ describe the constraints on where the matrix R l the mth row of Al using Rl . Our focus is on constraints which set pairs of rows and columns of the matrical representation of rl to zero. This reflects prior knowledge about the 1

1 0

0 A2 =

A1 = 0

0

0 n

(37)

(40)

0

0 n

Fig. 3. The structure of the parameter estimates for a FARð2Þ.

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

86

smoothness of this operator. Fig. 3 describes the structure of these estimates for a FARð2Þ. In other scenarios, we can use Rl to describe the location of isolated zero elements or subsets of columns. Substituting for aðmÞ in (40), gives cðmÞ ¼

~ ðmÞ , Z 0 aðmÞ R þd ðmÞ ðmÞ 0 0 0 aR ¼ ð½aðmÞ R1  ; . . . ; ð½aRn  Þ

(42)

underlying functional auto-regression. This is incorporated by fitting models from a pre-determined family specified by

 

where and the design matrix Z ¼ ðZt ; t ¼ p þ 1; . . . ; TÞ, where Z t ¼ ð¯c0t1 ; . . . ; c¯ 0tp Þ0 . Writing the soft constraints in matrical form gives Cðl; rR Þ ¼

p X

ðmÞ 0 m ll ½aðmÞ l  ½DRl ½al ,

(43)

l¼1

where ½Dm Rl  is a non-negative definite matrix. Then 1 ¯ ðmÞ 0 ¯ ðmÞ 0 ZZ 0 ½R ¯ ðmÞ  þ F m  ZcðmÞ , a^ ðmÞ l Þ ½R l;R ¼ ð½R

(44)



where m Fm l ¼ diagðll ½DRl ; l ¼ 1; . . . ; pÞ.

(45)

Our focus is on the use of ridge penalties of the form Dm Rl ¼ I nql , an ðn  ql Þ  ðn  ql Þ dimensional identity matrix, to mitigate the effects of numerical instabilities by shrinking A^ lRl to zero [42]. Ridge penalties are used in various vector-valued input–output models in regression format [57,58], although the model has fixed dimension in these applications. Further reductions in computational costs can be achieved by searching over an equivalence class of ortho-normal basis. For a two-dimensional Fourier series, we define the lth equivalence class by ðvj ðsÞ; j ¼ ðj 1 ; j 2 Þjj 1 þ j 2 ¼ lÞ;

j ¼ ðj 1 ; j 2 Þ 2 J n . (46)

The corresponding equivalence class is equal to fl ðsÞ for the ordered empirical basis generated by functional principal components. Our formalism can be used to define subset functional auto-regressions, where whole matrices are set to zero. These may be of value in modelling near periodic functional time series. Setting individual elements of the coefficient matrices associated with vector-valued auto-regressions to zero are used to generate low-complexity models in [59]. Other penalties are introduced in [8] for vector-valued auto-regressions using the Bayesian paradigm. 7. Designing the classifier A key component of our approach is to exploit prior knowledge about the coefficients of the

The model order Pfamily ¼ f1; . . . ; pmax g. Hard constraints Rfamily ¼ fRfamily ; . . . ; Rfamily pmax g, 1 family describes an ordered set of ql where Rl constraints on the lth coefficient of the FAR model. We restrict attention to Rfamily ¼ l fR0 ; . . . ; Rn g, R0 ¼ ; and Rv ¼ fn  v þ 1; . . . ; ng sets the corresponding row and column pairs in the matrical form of rl to zero (see Fig. 3). This imposes smoothness constraints on rl using Fourier or functional principal component basis functions. Soft constraints Lfamily ¼ fLfamily ; . . . ; Lfamily pmax g, 1 family where Ll ¼ ðev ; v ¼ 1; . . . ; rl Þ is an ordered set of rl soft constraints applied to the lth coefficient of a FAR model. We restrict attention to constraints that shrink the auto-regressive parameters towards zero.

We simultaneously select a model and classifier with the best classification performance in the training set. This is a successful means of constructing classifiers for independent functional data [35] and mitigates difficulties in fitting complex to highdimensional time series. We define the family of classifier structures by



The classifier structure Sfamily ¼ ðS family ;v ¼ v 1; . . . ; W Þ describes an ordered set of classifier structures. Here S family ¼ f1; . . . ; vg is the classiv fier structure generated by the first v scores ordered by the value of the associated eigenvector. There may be value in searching over all subsets of scores in some scenarios, as scores associated with low-order eigenvalues may contain useful information.

We describe the construction of the classifier for time series with zero mean, as the general case follows immediately by adding hard and soft constraints R0 and Rl0 . 7.1. The search strategy A wide range of search procedures can be used to determine an optimal model and classifier structure Pfamily  Rfamily  Lfamily  S family .

(47)

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

These include stochastic search strategies (genetic algorithms) or stepwise procedures. The recursion SEARCHðkÞ Step

Purpose

1

Choose the next family member to be searched in loop k If ðko5Þ begin SEARCHðk þ 1Þ, Otherwise: Choose the initial classifier structure and evaluate performance using cross-validation If the stopping rule in loop k is satisfied return control to SEARCHðk  1Þ, Otherwise: Update the optimal model or classifier structure and goto step 1

2

3

We exploit a pre-defined ordering of each family, although an ordering can be determined from estimated classification performance. Our classifier is described by the recursion SEARCHðkÞ. This operates on five nested loops. The nested loops Loop

Purpose

1

The FAR order loop Initialization: We take p ¼ 1 as the current model order Update: Increase the order by one The FAR coefficient loop Initialization: The coefficient associated with the model order defined in loop 1 Update: Decrease the index by one The soft constraint loop Initialization: We take e1 to be the current constraint on the FAR coefficient in loop 2 Update: Increase the index by one The hard constraint loop Initialization: No constraint applied Update: Let Rl ¼ fn  v þ 1; . . . ; ng be the current hard constraints applied to the FAR coefficient in loop 2. We remove a further L row and column pairs from Rl at the next loop to give Rl ¼ fn  v  L þ 1; . . . ; ng The classifier scores loop Initialization: The initial classifier uses S family 1 Update: Increase the index by one

2

3

4

5

87

7.2. The algorithm The first step in our approach is to project the training data onto a set of deterministic basis functions (tensor products of Fourier basis are used in our experiments). After selecting an appropriate model and classifier structure, we remove v functional time series at random from the training set to give a test set. We generate an empirical basis using functional principal components (if required) and identify the underlying FAR model with order and (soft/hard) constraints specified by our search procedure. The classifier is constructed using the selected structure and evaluated on the test set. This is replicated a number of times and the average performance is used to construct a stopping rule. This approach is known as leave-v-out crossvalidation CVv , with the percentage of correctly classified functional time series X T as the performance measure, although this can be replaced by other problem specific metrics. The stopping rule is satisfied when the current model and classifier structure produce a decrease in classification performance determined by CVv compared with the best achieved A useful modification to our algorithm limits cross-validation to the scores, with the eigenvectors calculated from the whole training set (rather than for each test and validation set). Empirical evidence [35] suggests that this modification gives similar performance to our algorithm in problems with relatively large training sets (at greatly reduced computational cost). 8. Extensions to more complex scenarios The development of our classifier is based on the assumption that the underlying functional time series is stationary. However, there are a number of scenarios where functional data exhibit nonstationarity. We consider two important cases. The first is based on the analysis of periodic data and the second on temporal trends of known form. First, we consider periodic functional data. A typical example is the periodic motion of the two- or three-dimensional boundary of the heart. This can be incorporated into our approach using a seasonal functional auto-regression. Here the underlying functional time series has a period of N intervals and the vth seasonal component

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

88

ðX ðtÞNþv ; 1ovo1Þ is generated by the FARðpðvÞÞ process X ðtÞNþv ðsÞ  mv ðsÞ ¼

pðNÞ X

rvj ðX ðtjÞNþv ðsÞ

Pq ^ component u¼1 yu wu ðtÞ from X t ðsÞ and estimate mðsÞ and the parameters of the functional autoregression under study. Appropriate features are then constructed using the algorithm described earlier.

j¼1

 mv ðsÞÞ þ Ztv ðsÞ

ð48Þ

with the usual notation. Here each periodic component evolves with different temporal dynamics. The covariance operator associated with Ztv ðsÞ is denoted by Sv . Assuming that each seasonal component evolves independently over time, gives a P FARð1Þ for the M ¼ Nð N j¼1 pðjÞÞ-dimensional function X t ðsÞ

¼ ðX ðt1ÞNþ1 ðsÞ; . . . ; X ðtpð1ÞÞNþ1 ðsÞ, . . . ; X ðtpðNÞÞNþN ðsÞÞ

ð49Þ

with analogous quantities mðsÞ and Z¯ ðsÞ associated ¯ with ðmv ðsÞ; v ¼ 1; . . . ; NÞ and ðZtv ðsÞ; v ¼ 1; . . . ; NÞ, then X t ðsÞ  mðsÞ þ Z¯ t ðsÞ; ¯ t1 ðsÞ  mðsÞÞ ¯ ¼ rðX ¯

s2S (50)

with the M-dimensional operator r¯ given by rðX ¯ ðtjÞNþv ðsÞ  mv ðsÞÞ ¼ rvjðvÞ ðX ðtjÞNþv ðsÞ  mv ðsÞÞ (51) for v ¼ 1; . . . ; N and jðvÞ ¼ 1; . . . ; pðjÞ. The parameters of this model can be identified in the same way as before, with the mean functions constructed from each periodic component. The estimates are then used as features in our classifier. This model appears to be new, although a number of authors have proposed varying coefficient models with parameters depending on seasonal variables for functional data. Next we consider data with temporal trends of known form. Here the expected value of X t ðsÞ is given by mt ðsÞ ¼ mðsÞ þ

q X

yu wu ðtÞ,

(52)

u¼1

where q and ðwu ðtÞ; u ¼ 1; . . . ; qÞ are known (a linear trend is given by q ¼ 1 and w1 ðtÞ ¼ t). We suggest a two stage estimate of mt ðsÞ. First, we calculate X dif t ðsÞ ¼ X t ðsÞ  X t1 ðsÞ;

t ¼ 2; . . . ; T

(53)

and use least squares to estimate ðyu ; u ¼ 1; . . . ; qÞ based on the observed values of the trend. We subtract the resulting estimate of the parametric

9. A comparative study using simulated data We examine the performance of our approach using data generated by functional auto-regressions with known characteristics in a two class experiment. The kth class is generated by X t ðsÞ ¼ rðkÞ 1 X t1 ðsÞ þ t ðsÞ;

s 2 ½0; 1  ½0; 1

(54)

with the usual notation. We follow [33] and generate functional time series by projecting X t ðsÞ and t ðsÞ onto a set of ortho-normal basis functions. The latter reflect problem specific characteristics and include tensor products of a Fourier basis, classical orthogonal polynomials, wavelets or eigen-functions associated with the covariance kernel of X t ðsÞ. We focus on the use of Fourier basis, as they are well suited to phenomena with temporal characteristics depending on spatial scale [24]. Fourier series can be computed efficiently for non-uniform designs, although they suffer from the usual limitations of global approximations. They give the projections X X X t ðsÞ ¼ ctj vj ðsÞ; t ðsÞ ¼ dtj vj ðsÞ, (55) j2J n

j2J n

where vj ðsÞ ¼ fj1 ðs1 Þfj2 ðs2 Þ with j ¼ ðj 1 ; j 2 Þ and J n ¼ ð1pj 1 pJ 1 ; 1pj 2 pJ 2 Þ. Here J 1 ¼ J 2 ¼ 5 and   8 2px int½l=2 > 1=2 1=2 > sin ; l even X2; > < 2 jSj jSj   fl ðxÞ ¼ 2px int½l=2 > > > 21=2 jSj1=2 cos ; l odd X3; : jSj (56) where int½y is the integer part of y, f1 ðsÞ ¼ jSj1=2 and S ¼ ½0; 1  ½0; 1. We assume that ct is generated by a vector-valued auto-regression of the form ct ¼ A1 ct1 þ dt ;

1oto1

(57)

in each class, with ct ¼ ðcthðiÞ ; i ¼ 1; . . . ; nÞ0 and dt ¼ ðdthðiÞ ; i ¼ 1; . . . ; nÞ0 , where hðiÞ ¼ ðh1 ðiÞ; h2 ðiÞÞ is an ordering of the set J n and n ¼ J 1 J 2 . Nonorthogonal basis functions can be used to generate X t ðsÞ, although (57) must be modified (see [46] for its functional regression analogue). We construct models with relatively few free parameters by simplifying the structure of the

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

parameters [23]. Then A1 ¼ diagfrl ; l ¼ 1; . . . ; ng

(58)

and ðdt ; t ¼ 1; . . . ; TÞ0 is a sequence of independent random vectors drawn from an n-dimensional Gaussian distribution with zero mean and covariance matrix S. The correlation between the lth and mth elements of dt is denoted by Slm , which decays exponentially as a function of jl  mj. Model of this type provide satisfactory approximations to a wide range of natural phenomena [23]. The functional time series in class 2 are generated by A1 ¼ diagfrl ; l ¼ 1; . . . ; ng with covariance matrices S1 , S2 or S3 described in Table 1. Class 1 is described by A1 ¼ 0 and matching covariance matrices. This gives three experiments

  

A: very low-complexity dynamics and innovations, B: low-complexity dynamics with high-complexity innovations, C: low-complexity dynamics with low-complexity innovations,

where complexity broadly describes the smoothness of the auto-regressive parameter rðkÞ and the 1 innovations t ðsÞ. This is controlled in experiment A by the rate of decay of rl and a step function with l ¼ 7 in experiments B and C. The numerals (1)–(3) Table 1 The data generating mechanisms Experiment A A1 ¼ diagfrl ; l ¼ 1; . . . ; ng, rl ¼ r expðalÞ (1) a ¼ 2:5 r ¼ 0:071=2 a ¼ 0:5 (2) r ¼ 0:081=2 a ¼ 0:01 (3) r ¼ 0:101=2 olm ¼ d expðlðb þ g=2Þ  mðb  g=2Þ þ gÞ S1 b ¼ 25:0=ðn  1Þ g ¼ 0:5=ðn  1Þ d ¼ 0:08 Experiment B A1 ¼ diagfrl ; l ¼ 1; . . . ; ng, rl ¼ r1 wð1; k1 Þ þ r2 wðk1 þ 1; nÞ (1) r1 ¼ 0:02 r2 ¼ 0:005 k1 ¼ 7 (2) r1 ¼ 0:04 r2 ¼ 0:01 k1 ¼ 7 (3) r1 ¼ 0:08 r2 ¼ 0:02 k1 ¼ 7 olm ¼ d expðlðb þ g=2Þ  mðb  g=2Þ þ gÞ S2 b ¼ 25:0=ðn  1Þ g ¼ 0:0 d ¼ 0:08 Experiment C A1 ¼ diagfrl ; l ¼ 1; . . . ; ng, rl ¼ r1 wð1; k1 Þ þ r2 wðk1 þ 1; nÞ r2 ¼ 0:005 k1 ¼ 7 (1) r1 ¼ 0:02 (2) r1 ¼ 0:04 r2 ¼ 0:01 k1 ¼ 7 (3) r1 ¼ 0:08 r2 ¼ 0:02 k1 ¼ 7 olm ¼ sl sm expðbjl  mjÞ, s2l ¼ s21 wð1; k1 Þ þ s22 wðk1 þ 1; nÞ b ¼ 25:0=ðn  1Þ S3 s22 ¼ 0:02 s21 ¼ 0:08 wða; bÞ ¼ 1 for l 2 ½a; b and zero otherwise

89

in Table 1 give an approximate ranking of the degree of separation between the classes. Two hundred series of length T ¼ 200 are generated in classes 1 and 2, and randomly partitioned into test and training sets of the same size (100 series). 10. The experiments We evaluate the performance of our functional classifier using Fourier ðFARÞ and empirical basis functions ðFPCÞ generated by the approach described in Section 5. We compare their performance using results generated by a vector-valued autoregression VARm ðpÞ for the m-dimensional time series X T ¼ ðX t ; . . . ; X T Þ, where X t ¼ ðX t ðsti Þ; i ¼ 1; . . . ; mÞ0 , where sti is the ith design point at time t. Then p X Xt ¼ Al X tl þ Z t ; 1oto1 (59) l¼1

with the usual notation and feature vector f p ¼ ð½vec A^ 1 0 : . . . : ½vec A^ p 0 Þ. A reduced set of features (p100 in our experiments) is constructed by principal components [37]. The corresponding scores are used in a simple Gaussian classifier and selected using the same approach as our functional classifier. This provides an automatic means of choosing features in the group of classifiers described in [34] can be used to select a subset of variables, model order p and sparse representations of ðA^ l ; l ¼ 1; . . . ; pÞ, although this lies outside the scope of this study. We observe the underlying functional time series at m points in S ¼ ½0; 1  ½0; 1. We use the smallest value of m which recovers the coefficients of the underlying tensor product representation for X t ðsÞ, as there is no additive noise. Three sampling designs are used in our experiments

 



A uniform design: this describes a design with uniformly spaced spatial locations over ½0; 1 ½0; 1. A non-stationary design: this is generated by ðsti ; i ¼ 1; . . . ; m1 m2 Þ ¼ ðZi1 ; i ¼ 1; . . . ; m1 Þ  ðZi2 ; i ¼ 1; . . . ; m2 Þ, where Zi1 and Zi2 are independent random variables drawn from a uniform distribution over ½0; 1. A damaged design: this is generated by randomly removing a fixed percentage of observations from a uniform design at each time interval.

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

90

We fix the maximum FAR order to be pmax ¼ 1 and restrict attention to Lfamily with one soft constraint given by e1 ¼ 108 . The structure of Rfamily and S family are the same as before (Section 7), with the latter using a maximum of W ¼ 100 scores. Four row and column pairs ðL ¼ 4Þ are removed at each iteration to generate an appropriate structure for A1 . We present results for the modified algorithm using eigenvectors calculated from the whole training set. We use leave-v-out cross-validation CVv , with v ¼ 20 and calculate the average percentage of correctly classified time series over 25 realizations to construct the stopping rule. 11. The results We present results for experiments A–C with three different sampling schemes: uniform, nonstationary and damaged, in Tables 2–4. The former gives a fair comparison with conventional techniques, as the resulting process can be described by a stationary vector-valued auto-regression [15]. We focus on the performance measures

  

the percentage of correctly classified time series in the test set, the number of functional scores in the classifier, the number of row-column pairs removed from the models.

We see from Tables 2–4 that the best performance for all sample designs is given by the functional Table 2 The percentage of correctly classified series for experiment A using uniform, non-stationary and damaged designs with 10% missing values Uniform

Non-stationary

Damaged

FAR FPC VAR FAR FPC VAR FAR FPC VAR 1 94.5 (8) [4]

98.0 (9) [8]

90.5 (27)

93.5 (8) [0]

94.0 (8) [4]

52.0 (42)

95.0 (1) [12]

97.0 (1) [16]

97.5 (16)

2 84.0 (9) [20]

70.5 (2) [20]

54.5 (47)

83.5 (10) [20]

70.5 (2) [20]

51.5 (42)

81.5 (8) [20]

69.0 (10) [20]

59.0 (6)

3 63.5 (1) [24]

48.5 (14) [4]

46.5 (7)

63.5 (1) [24]

48.0 (14) [4]

50.5 (22)

67.0 (1) [24]

52.0 (4) [12]

49.0 (7)

The average number of scores in the classifier and row-column pairs removed from the model are in square and round brackets, respectively.

Table 3 The percentage of correctly classified series for experiment B using uniform, non-stationary and damaged designs with 10% missing values Uniform

Non-stationary

Damaged

FAR FPC VAR FAR FPC VAR FAR FPC VAR 1 86.5 (6) [20]

68.5 (15) [20]

62.0 (46)

86.0 (6) [20]

69.0 (15) [20]

51.0 (35)

86.5 (6) [20]

68.5 (15) [20]

57.5 (7)

2 64.0 (13) [20]

54.5 (43) [16]

57.5 (7)

64.0 (13) [20]

55.5 43 [16]

51.5 (46)

50.5 (12) [16]

57.0 (20) [20]

55.0 (8)

3 50.5 (5) [16]

51.5 (45) [16]

51.0 (35)

49.6 (5) [16]

52.5 (45) [16]

51.0 (35)

56.0 (14) [16]

51.0 (4) [12]

49.5 (2)

The average number of scores in the classifier and row-column pairs removed from the model are in square and round brackets, respectively.

Table 4 The percentage of correctly classified series for experiment C using uniform, non-stationary and damaged design with 10% missing values Uniform

Non-stationary

Missing

FAR FPC VAR FAR FPC VAR FAR FPC VAR 1 86.5 (6) [20]

86.5 (9) [20]

60.5 (34)

84.0 (4) [20]

86.5 (9) [20]

50.5 (28)

81.0 (4) [20]

85.0 (6) [20]

54.5 (18)

2 63.5 (15) [20]

63.5 (10) [20]

55.5 (34)

63.0 (13) [20]

62.5 (10) [20]

50.5 (29)

81.0 (4) [20]

85.0 (6) [20]

60.5 (34)

3 54.5 (1) [12]

50.5 (6) [4]

58.5 (36)

49.5 (21) [0]

45.5 (7) [16]

49.0 (24)

50.0 (11) [16]

46.5 (4) [16]

43.0 (45)

The average number of scores in the classifier and row-column pairs removed from the model are in square and round brackets, respectively.

techniques FAR and FPC in all experiments with significant differences between classes. All techniques have broadly similar performance when there are small differences between classes (see row 3 in Table 4). Differences between the performance of VAR and the functional procedures are small when X t ðsÞ is very smooth (see row 1 in Table 3 for typical results). Techniques based on Fourier ðFARÞ and empirically generated basis functions ðFPCÞ have broadly

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

similar performance for all sampling designs in experiments A and C, although the performance of FAR is generally superior to FPC, as our choice of Fourier basis is advantageous. However, substantial differences between the performance of FAR and FPCA are apparent in experiment B with kA1 kL2 b0 in class 2 (see row 1 of Table 3). This effect appears to be related to the dependence of FPC on the covariance structure of X t ðsÞ, which is dominated by S2 , rather than the characteristics of A1 in this experiment. All functional procedures select classifiers using a plausible approximate rank for the matrix A^ 1 . This is described by the number of row-column pairs in the final model and is expected to correspond with approximately seven or 18 row-columns removed from the model in experiments A–C, respectively. This decreases with the complexity of A1 in experiment B, with stable estimates in experiments A and C, as we would expect, although atypical behaviour is associated with kA1 kL2  0 in class 2. The number of scores used by FPC and FAR are generally similar and much smaller than the multivariate procedure VAR. We see from Tables 2–4 that the classifier based on the coefficients of a vector-valued auto-regression for the observations ðVARÞ suffers large reductions in performance for non-stationary designs, when there are significant differences between classes. This differs from the functional techniques which have similar performance, as expected. The random removal of 10% of the observations at each time interval has little effect on the performance of all procedures, with the largest changes associated with VAR, although this effect differs in each experiment (see Tables 2–4). The robustness shown by VAR, FAR and FPC may be related to the high degree of smoothness of X t ðsÞ in our experiments. 12. Comparisons using multi-channel EEG data The classification of time series generated by the process known as EEG is a key component of a number of emerging technologies. These include the identification of individuals [60] using biometrics, detecting sleep states [61] and providing a noninvasive means of controlling machines in near real time [62]. However, a reliable means of exploiting the temporal dynamics of multi-channel and highresolution EEG for classification remains elusive,

91

Fig. 4. Extended standard electrode positions.

with time series methods generally limited to small numbers of electrodes [34,61,63]. While other ad hoc procedures are relatively successful [4], they are unable to exploit the rich temporal structures extracted from large training sets by time series methods [64]. This appears to be related to the high dimensionality and ill-conditioned nature of the vector of observations at each time interval, 64 for the standard electrode positions in Fig. 4. 13. A functional auto-regression for scalp potentials Here, we describe a functional model for EEG data. We assume that the skull can be approximated by a sphere and describe the temporal dynamics of scalp potentials by their projection onto the set S ¼ ½0; 1  ½0; 1 in the plane [65]. Realistic skull models [2] can be used by our approach, although projection onto a common skull model is required. We describe S by the polar co-ordinates ðr; yÞ, were R ¼ ½0; 1 and A ¼ ½0; 1, with the front of the skull pointing North and the y-axis in a N–S direction. While theoretical and empirical studies have shown that EEG data exhibit non-linear dynamics, we expect that linear models will provide good approximations for moderate sampling rates. We assume that the temporal dynamics [45] of the scalp potentials ðX t ðr; yÞ; t ¼ 1; . . . ; TÞ are described by a functional auto-regression (in integral form) of order p Z p Z X X t ðr; yÞ ¼ rl ðr; y; v1 ; v2 Þ l¼1

RA

RA

X tl ðv1 ; v2 Þ dv1 dv2 þ t ðr; yÞ

ð60Þ

ARTICLE IN PRESS 92

R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100 Scalp potential field at time t-1 on R x A

Scalp potential at a specific point in R x A and time t

A linear operator or weighting

Fig. 5. The model X t ðr; yÞ ¼ r1 X t1 ðr; yÞ þ t ðr; yÞ.

on R  A. Empirical evidence suggests that the Laplacian associated with X t ðr; yÞ is bounded, so that (60) lies in a sub-space of L2 and ðrl ; l ¼ 1; . . . ; pÞ are bounded linear operators on this space. We can regard the kernels rl ðr; y; v1 ; v2 Þ as sequence of weights for potential fields X tl ðv1 ; v2 Þ over R  A. We illustrate this interpretation by the diagram in Fig. 5. Here the weights around ðr; yÞ can be increased to reflect spatially local potentials as predictors of X t ðr; yÞ. In contrast to earlier work on functional models for EEG described in [2,45], we adopt a basis function approach to describe ðX t ðr; yÞ; t ¼ 1; . . . ; TÞ, which has a number of advantages. While our focus is on functional methods, we note that other procedures can exploit spatial smoothness. These include factor models [66] or spatially varying collections of scalar models [67], multiwavelets [68] and spatial smoothing [69]. 14. The data We compare the performance of our functional classifier and baseline ðVARÞ using EEG data from a study of the genetic basis for alcoholism. The underlying hypothesis is that the activation of certain groups of neurons in the brain are indicative of alcoholism. The spatial smoothness and fixed electrode positions makes this data well suited to comparisons between functional and conventional techniques.

Here, a group of alcoholics and controls view line drawings from the Snodgrass and Vanderwart picture set [70]. These are a collection of visual stimuli used to study cognition (such as name and image agreement, familiarity and visual complexity). Selected drawings are presented to subjects in three different ways or paradigms (a single line drawing (S1), a pair of different (S1S2) and matching line drawings (S2)). Scalp potentials are measured at a sampling rate of 256 Hz at the extended standard electrode positions (Fig. 4) for 1 s after each paradigm is revealed. The resulting data (a 64  256 matrix) is called a trial. This procedure is repeated for selected drawings, with a rest interval between paradigms (see [28] for a detailed description of the data collection methodology). We exclude measurements from the reference sites A1, NZ and A2 in our experiments with no preprocessing, although the removal of high- and lowfrequency (in time) components is often used [60] to limit attention to the g band centred at 40 Hz, where activity is strongly correlated with perception and memory. We exclude trials with missing observations (equal voltage readings from an electrode over a long time interval) or eye blink artefacts characterized by high-voltage peaks in the frontal and prefrontal cortex [60]. A sequence of skull potentials projected onto the unit disk at an interval of 0.25 s is given in Fig. 6 for an alcoholic (top) and control

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

93

Fig. 6. Potentials developing from left to right at 0.25 s intervals. Top: alcoholic; bottom: control, with light greys associated with highscalp potentials.

(bottom). Light greys indicate high voltages (the average potential at each electrode is approximately zero). Selected time series for the electrodes FPz, AFz, Fz, FCz, Cz, CPz, Pz, POz and Oz are presented in Figs. 7 (control) and 8 (alcoholic). Trials associated with 10 alcoholics and 10 controls are independently partitioned into test and training sets, with a small number of trials removed to give a similar number of trials for each paradigm (see Table 5).

structure of Rfamily and Sfamily are the same as before (Section 7), with the latter using a maximum of W ¼ 100 scores. Four row and column pairs ðL ¼ 4Þ are removed at each iteration to generate an appropriate structure for A1 . We present results for the modified algorithm using eigenvectors calculated from the whole training set. We use leave-vout cross-validation CVv , with v ¼ 20 and calculate the average percentage of correctly classified time series using 25 realizations to give stable estimates of classification performance.

15. Constructing a classifier

16. Results

We focus on the tensor product representation X X~ t ðr; yÞ ¼ d~ tj fj 1 ðrÞfj2 ðyÞ; ðr; yÞ 2 R  A (61)

We see from Figs. 9 and 10 that the innovative ^ associated with a FARð1Þ or covariance matrix O VAR61 ð1Þ are similar in both classes, and there are noticeable differences in the average value of A^ 1 in each class. This supports the use of a classifier based on A^ 1 or r1 . We note that there are prominent diagonal features in the average value of A^ 1 in both classes (see Figs. 9 and 10), which suggest strong positive correlation between the potential fields at successive time intervals at 256 Hz. This suggests that the use of a higher-order model may be advantageous, although not feasible for our VAR61 ðpÞ model. The percentage of correctly classified trials is given in Table 6 for FAR, FPC and VAR61 ð1Þ. We see that the best performance is given by the functional auto-regression FARð1Þ with a Fourier basis. The use of Fourier and functional principal component basis outperforms the vector-valued auto-regression VAR61 ð1Þ for the raw data, as we

j2J n

using a Fourier basis (56) to reflect the smoothness of scalp potentials and J n ¼ fj ¼ ðj 1 ; j 2 Þ; 1pj 1 pJ 1 ; 1pj 2 pJ 2 g, with n ¼ J 1 J 2 . We take J 1 ¼ J 2 ¼ 7, n ¼ 7  7 to closely approximate the data, as there is little evidence of additive noise. We fit (61) by least squares, although a robust loss function can be used to mitigate the effect of spatially localized anomalies (often generated by electrode popping). We use a ridge penalty with l ¼ 108 to identify (61) in our experiments. We fix the maximum FAR order to be pmax ¼ 1, as this model order good performance and facilitates comparisons with our VAR61 ð1Þ baseline. The latter is computationally impractical for higherorder models. We restrict attention to Lfamily with one soft constraint given by e1 ¼ 108 . The

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

94

FPz

50 0 -50

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

AFz

20 0 -20

Fz

10 0 -10

FCz

10 0 -10

Cz

50 0 -50

CPz

10 0 -10

Pz

20 0 -20

POz

20 0 -20

Oz

50 0 -50

Fig. 7. Potentials developing from left to right for a control subject and the electrodes FPz, AFz, Fz, FCz, Cz, CPz, Pz, POz, Oz on the centre line of the scalp.

would expect for our experiments with simulated data. The good performance of functional techniques is related to the spatial smoothness of the scalp potential field generated by the conductivity of the head [2]. This ensures that the temporal dynamics can be described by relatively few basis functions (see Table 6). The functional approach does not outperform techniques [4] based on global measures of brain activity in specific bands (g band centred at 40 Hz). These give results in the range (91.8–96.1%) using a non-linear classifier. We suggest that a non-stationary functional model may be required from recent evidence [64] on the temporal characteristics of EEG. These can be generated by an appropriate latent model with time varying coefficients.

Next, we explore the value in using a reduced complexity vector-valued auto-regression, while these can be constructed by an extension of our baseline classifier, we consider the seven electrodes identified by [4] as containing the bulk of the information describing differences between alcoholics and controls. This classifier VAR7 ð1Þ gives very similar performance to the VAR61 ð1Þ in Table 6 and confirms the importance of these electrodes and associated cognitive processes in discriminating between alcoholics and controls. However, the performance rather less than the functional approach. We conclude that the functional approach gives the best performance among time series methods in this experiment. We see that the number of scores in the classifier (the figures in round brackets in Table 6), are

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

95

FPz

20 0 -20 0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

AFz

50 0 -50

Fz

20 0 -20

FCz

0 -10 -20

Cz

50 0 -50

CPz

20 0 -20

Pz

20 0 -20

POz

20 0 -20

Oz

20 0 -20

Fig. 8. Potentials developing from left to right for an alcoholic subject and the electrodes FPz, AFz, Fz, FCz, Cz, CPz, Pz, POz, Oz on the centre line of the scalp.

17. Conclusions

Table 5 The number of test and training trials Class

Alcoholic Control

Trials

246 246

Paradigm S1

S1S2

S2

83 83

82 82

81 81

substantially less for VAR61 ð1Þ than the functional techniques FAR and FPC. This may be related to the noticeable differences in the magnitude of certain off-diagonal elements of A^ 1 for alcoholics and controls in Fig. 10 and suggests value in the use of a localized orthogonal basis.

In this study, we consider the problem of classifying sequences of densely sampled spatial data into a number of disjoint classes defined by training data. This is an important component in a number of emerging technologies. These include the use of dense sensor arrays for condition monitoring, brain–computer interfaces for communications and control, the automatic detection of moving pedestrians from sequences of images and the study of cognitive functions using high-resolution EEG. Our approach, which appears to be new, is based on the assumption that the data is generated by an underlying functional auto-regression. This describes the temporal development of a sequence

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

96

2

10

1

5

0 0

-1 -2

-5 40

40 20 0

0

20

10

30

40

20 0

10

0 0

10

0

2

20

30

40

10

1

5

0 0

-1 -2

-5 40 20 0 0

20

10

30

40

40

20

20

30

40

^ left); control: bottom (A^ 1 , right, Fig. 9. The average value of A^ 1 for a FARð1Þ fit using the full Fourier basis. Alcoholics: top (A^ 1 , right, S ^ left). S

1

20 15

0.5

10 0

5 0

-0.5 60 40 20 0 0

20

40

60

60

20 0 0

1

15

0.5

10

0

5

-0.5 60 40

40

20 0 0

60

40

20

60

20

40

0 60 40 20 0 0

20

40

60

^ left); control: bottom (A^ 1 , right, S^ left). Fig. 10. The average value of A^ 1 for a VAR61 ð1Þ fit to the raw data. Alcoholics: top (A^ 1 , right, S

of elements in an appropriate function space which is observed at a number of locations. A typical example is the sequence of electric potentials on the scalp of an individual developing over time. This

formalism mitigates the effect of non-stationary sampling schemes generated by sensor failures or changes in location and provides a principled means of mitigating the effect of ill-conditioning generated

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100 Table 6 The percentage of correctly classified trials with the average number of scores in the classifier and row-column pairs removed from the model in square and round brackets, respectively Fourier

FPC

VAR61 ð1Þ

VAR7 ð1Þ

89.8 (74) [40]

89.2 (91) [29]

83.1 (46) N/A

82.5 (62) N/A

by high-resolution sensors [11] by exploiting the smoothness of the coefficients of the underlying functional auto-regression. Our focus on spatial data and automatic means of exploiting of the lowintrinsic dimensionality (smoothness) differentiates our approach from the techniques used to construct classifiers for dynamic shape analysis [26] and automatic target detection [27]. We assess the performance of our approach using simulated and real multi-channel EEG series. Our experiments show that the functional approach can outperform conventional procedures by exploiting the low-intrinsic dimensionality for uniformly sampled data. In addition, it maintains a high level of performance for non-stationary sampling schemes generated by sensors with time varying positions, perhaps generated by sensor failure or different maintenance practices. Conventional procedures show large reductions in performance [15] using designs of this type. We note that the choice of basis functions to represent the underlying functional auto-regression may be significant, with functional principal components giving inferior performance in certain scenarios. Next, we consider an experiment using multichannel EEG, we note that our experiments appear to be the first to study the performance of techniques of this type for EEG. Here, the functional approach gave the best performance among time series methods, but did not outperform activation based methods [4]. In the light of much recent work on non-stationary models for EEG [64], we suggest that similar non-stationary functional models may be required. These can be generated by an appropriate latent model with time varying coefficients. Our approach also provides a principled means of assimilating temporal dynamics from sensors of different modalities and resolutions [71] and suggests that meaningful information can be extracted from high-resolution EEG.

97

Appendix Here, we describe an algorithm to construct functional principal components for vector-valued functions defined on the domain S ¼ ½0; 1d . This is based on the approach described in [46] and imposes smoothness constraints by limiting the number of basis functions [72]. Let Y~ ðsÞ ¼ ðY~ 1 ðsÞ; . . . ; Y~ d ðsÞÞ0 be the projection of a Rd -valued function onto a sub-space of H p ¼ H 1  H 2 . . .  H d defined by the basis functions fv1j ðsÞ; j 2 J n1 g      fvdj ðsÞ; j 2 J nd g

(62)

with nl pn; l ¼ 1; . . . ; d and the usual notation. Let Y~ l ðsÞ ¼ V 0l ðsÞwl

(63)

V 0l ðsÞ

w0l

with ¼ ðvlhð1Þ ðsÞ; . . . ; vlhðnl Þ ðsÞÞ, ¼ ðwlhð1Þ ; . . . ; wlhðnl Þ Þ, where hl : J nl ! ð1; . . . ; nl Þ orders the basis functions in terms of increasing oscillations. Following [46], we see that the first functional principal component is the eigen-function associated with the integral equation C Y~ fðsÞ ¼ lfðsÞ;

kfðsÞkH d ¼ 1,

(64)

where f0 ðsÞ ¼ ðf01 ðsÞ; . . . ; f0d ðsÞÞ, fl ðsÞ ¼ V 0l ðsÞwl and C Y~ is the covariance operator associated with Y~ ðsÞ. Then Z X d C Y~ fðsÞ ¼ C ð1;mÞ ðs; tÞfm ðtÞ dt; Y~ S m¼1

...;

Z X d S m¼1

!0 C ðd;mÞ ðs; tÞfm ðtÞ dt Y~

ð65Þ

and C ðl;mÞ ðs; tÞ ¼ V 0l ðsÞcðl;mÞ V ðtÞ Y~ Y~

(66)

gives C Y~ fðsÞ ¼ V 0 ðsÞcY~ BW , 0

ðw01 ; . . . ; w0d Þ,

(67) 0

where W ¼ V ðsÞ ¼ B ¼ diagfB1 ; . . . ; Bd g and

ðV 01 ðsÞ; . . . ; V 0d ðsÞÞ,

ðl;mÞ cY~ ¼ fcY ~ g, Z Bl ¼ vlhl ðk1 Þ ðsÞvmhm ðk2 Þ ðsÞ ds; S  k1 ; k2 ¼ 1; . . . ; nl .

ð68Þ

Bl is a nl  nl matrix (an identity matrix for an ortho-normal basis). Then fðsÞ and l satisfies V 0 ðsÞcY~ BW ¼ lV 0 ðsÞW ;

W ¼ ðw1 ; . . . ; wp Þ0

(69)

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

98

with W satisfying d X

w0l Bl wl ¼ 1

(70)

l¼1

and the remaining eigen-functions generated from (64) using the orthogonality condition hfk ðsÞ; fm ðsÞiH d ¼ 0, for kom, in the same way as [46]. These equations can be solved directly, or after transforming the initial basis (62) to orthogonality 1=2 using Bl . Smoothness constraints are incorporated by restricting ðnl ; l ¼ 1; . . . ; dÞ. This construction of smooth functional principal components for vector-valued functions on a spatial domain appears to be new, although [73] describes an approach based on a composite penalty [46]. Our approach avoids the complex rules used to generate observations on a regular spatial lattice [74]) and is related to the Karhunen–Loe´ve expansion generated by a known spatial covariance function [23]. We can generate functional principal components which are insensitive to anomalous functions (Y~ ðsÞ generated by anomalous environmental conditions), by replacing C Y~ by a robust analogue [72]. References [1] R. Assadollahi, F. Pulvermuller, Neural network classification of word evoked neuromagnetic brain activity, in: S. Wermter, J. Austin, D., Willshaw (Eds.), Emergent neural computational architectures based on neuroscience: towards neuroscience-inspired computing, Lecture Notes in Computer Science, vol. 2036, 2001, pp. 311–319. [2] J.C. Jimenez, R. Biscay, O. Montoto, Modelling the electroencephalogram by means of spatial spline smoothing and temporal auto-regression, Biol. Cybernet. 72 (1995) 249–259. [3] P.C. Besse, H. Cardot, D.B. Stephenson, Autoregressive forecasting of some functional climatic variations, Scand. J. Statist. 4 (2000) 673–687. [4] R. Palaniappan, P. Ravendran, S. Omatu, VEP optimal channel selection using genetic algorithm for neural network classification of alcoholics, IEEE Trans. Neural Networks 13 (2002) 486–491. [5] R.H. Shumway, Discriminant analysis for time series, in: P.R. Krishnaiah, L.N. Kanal (Eds.), Handbook of Statistics, vol. 2, Elsevier Science, Amsterdam, 1982, pp. 1–62. [6] S.M. Shaarawy, M.A. Almahmeed, Bayesian classification with multivariate auto-regressive sources that might have different orders, Comm. Statist. Sim. Comput. 24 (1995) 567–581. [7] P.J. Brown, T. Fearn, M.S. Haque, Discrimination with many variables, J. Amer. Statist. Assoc. 94 (1999) 1320–1329. [8] W.D. Penny, S.J. Roberts, Bayesian multivariate autoregressive models with structural priors, IEE Proc. Vision Image Signal Proc. 149 (2002) 33–41.

[9] Y. Kakizawa, R.H. Shumway, M. Taniguchi, Discrimination and clustering for multivariate time series, J. Amer. Statist. Assoc. 93 (1998) 328–340. [10] W.J. Krzanowski, P. Jonathan, W.V. McCarthy, M.R. Thomas, Discriminant analysis with singular covariance matrices-methods and applications to spectroscopic data, J. Roy. Statist. Soc. C 44 (1995) 101–115. [11] B.K. Alsberg, Representation of spectra by continuous functions, J. Chemometrics 7 (1993) 177–193. [12] C. Wo¨hler, J.K. Anlauf, An adaptable time-delay neural network algorithm for image sequence analysis, IEEE Trans. Neural Networks 10 (1999) 1531–1536. [13] B.W. Stiles, J. Ghosh, Habituation based neural networks for spatio-temporal classification, Neurocomputing 15 (1997) 273–307. [14] C. Wo¨hler, U. Kressel, J.K. Anlauf, Pedestrian recognition by classification of image sequences-global approaches vs local spatio-temporal processing, in: International Conference on Pattern Recognition, September 3–7, Barcelona, 2000, pp. 540–544. [15] F. Khellah, P. Fieguth, M.J. Murray, M. Allen, Statistical processing of large image sequences, IEEE Trans. Image. Process. 14 (2005) 80–93. [16] A. Harvey, S.J. Koopman, L. Penzer, Messy time series: a unified approach, Adv. Econometrics 13 (1998) 103–143. [17] J. Little, M. Goldstein, P. Jonathan, Efficient Bayesian sampling inspection for industrial processes based on transformed spatio-temporal data, Statist. Modell. 4 (2004) 299–313. [18] J.R. Stroud, P. Muller, B. Sanso, Dynamic models for spatiotemporal data, J. Roy. Statist. Soc. 63 (2001) 673–689. [19] J.D. Scargle, Phase-sensitive deconvolution to model random processes with special reference to astronomical data, in: D.F. Findley (Ed.), Handbook of Statistics, vol. 2, North-Holland, Amsterdam, 1981, pp. 549–564. [20] D. Bosq, Linear Processes in Function Spaces. Theory and Applications, Lecture Notes in Statistics, vol. 149, Springer, New York, 2000. [21] P.C. Kyriakidis, A.G. Journel, Geostatistical space–time models: a review, Math. Geol. 31 (1999) 651–684. [22] M.D. Ruiz-Medina, F.J. Alonso, J.M. Angulo, Functional stochastic modeling and prediction of spatiotemporal processes, J. Geophys. Res. 108(D24) (2003) STS 9.1–9.8. [23] K.V. Mardia, C. Goodall, E.J. Redfern, F.J. Alonson, The kridged Kalman filter, Test 7 (1998) 217–285. [24] C.K. Wikle, N. Cressie, A dimension reduced approach to space time Kalman filtering, Biometrika 86 (1999) 815–829. [25] P. Hall, D.S. Poskitt, B. Presnell, A functional data-analytic approach to signal discrimination, Technometrics 43 (2001) 1–9. [26] B. North, A. Blake, M. Isard, J. Rittscher, Learning and classification of complex dynamics, IEEE Trans. Pattern Anal. Machine Intell. 22 (2000) 1016–1034. [27] D.Q. Zhou, G.S. Liu, J.X. Wang, Spatio-temporal target identification method of high-range resolution radar, Pattern Recognition 33 (2000) 1–7. [28] X.L. Zhang, H. Begleiter, B. Porjesz, W. Wang, A. Litke, Event related potentials during object recognition tasks, Brain Res. Bull. 38 (1995) 531–538. [29] M.I. Friswell, D.J. Inman, Sensor validation for smart structures, J. Intell. Material Systems Struct. 10 (1999) 973–982.

ARTICLE IN PRESS R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100 [30] S.A. Beaulah, Z.S. Chalabi, D.G. Randle, A real-time knowledge-based system for intelligent monitoring in complex, sensor-rich environments, Comput. Electronics Agriculture 21 (1998) 53–68. [31] P. Clarysse, D. Friboulet, I.E. Magnin, Tracking geometrical descriptors on 3-D deformable surfaces: application to the left ventricular surface of the heart, IEEE Trans. Med. Imaging 16 (1997) 392–404. [32] E. Ardizzone, A. Capa, M. La Cascia, Using temporal texture for content based video retrieval, J. Visual Languages Comput. 11 (2000) 241–252. [33] P.C. Besse, H. Cardot, Spline approximation of the forecast of a first-order autoregressive functional process, Canad. J. Statist. 24 (1996) 467–487. [34] C.W. Anderson, E.A. Stolz, S. Shamsunder, Multivariate autoregressive models for classification of spontaneous electroencephalographic signals during mental tasks, IEEE Trans. Biomed. Eng. 45 (1998) 277–286. [35] R.H. Glendinning, S.L. Fleet, Classifying non-uniformly sampled curves, Pattern Recognition 37 (2004) 1999–2008. [36] R.L. Kashyap, Optimal feature selection and decision rules in classification problems with time series, IEEE Trans. Inform. Theory IT-24 (1978) 281–288. [37] Z. Wang, Y. Lee, S. Fiori, C-S. Leung, Y.-S. Zhu, An improved sequential method for principal component analysis, Pattern Recognition Lett. 24 (2003) 1409–1415. [38] P.M. Robinson, C. Velasco, Autocorrelation robust inference, in: G.S. Maddala, C.R. Rao (Eds.), Handbook of Statistics, vol. 15, Elsevier, Amsterdam, 1997, pp. 268–298. [39] M.D. Ruiz-Medina, J.M. Angulo, Spatio-temporal filtering using wavelets, Stochastic Environ. Res. Risk Assess. 16 (2002) 241–266. [40] A. Antoniadis, T. Sapatinas, Wavelet methods for continuous-time prediction using Hilbert-valued autoregressive processes, J. Mult. Anal. 87 (2003) 133–158. [41] F. Mokhtari, T. Mourid, Pre´diction des processus a´ temps continu autore´gressifs via les espaces a´ noyau reproduisant, C. R. Acad. Sci. Paris, Ser. I 334 (2002) 65–70. [42] J. Gross, Restricted ridge estimation, Statist. Probab. Lett. 65 (2003) 57–64. [43] D. Bosq, Modelization, non-parametric estimation and prediction for continuous time processes, in: G. Roussas (Ed.), Non-parametric Functional Estimation and Related Topics, Nato, ASI, Avon Books, New York, 1991, pp. 509–529. [44] T. Mourid, Estimation and prediction of functional autoregressive processes, Statistics 36 (2002) 125–138. [45] P.A. Valdes-Sosa, Spatio-temporal autoregressive models defined over brain manifolds, Neuroinformatics 2 (2004) 239–250. [46] J.O. Ramsay, B.W. Silverman, Functional Data Analysis, Springer, New York, 1997. [47] A. Labbas, T. Mourid, Estimation et pre´vision d’un processus autore´gressif Banach, C. R. Acad. Sci. Paris Ser. I 335 (2002) 767–772. [48] F. Rachedi, T. Mourid, Estimateur crible de l’ope´rateur d’un processus ARB(1), C. R. Acad. Sci. Paris, Ser. I 336 (2003) 605–610. [49] N. Bensmain, T. Mourid, Estimateur sieve de l’ope´rateur d’un processus ARH(1), C. R. Acad. Sci. Paris, Ser. I 332 (2002) 1015–1018.

99

[50] H. Cardot, Nonparametric estimation of smoothed principal components analysis of sampled noisy functions, J. Nonparametric Statist. 12 (2000) 503–538. [51] T. Mourid, Statistiques d’une saisonnalite´e pertube`e par un processus a repre`sentation autore´gressive, C. R. Acad. Sci. Paris, Ser. I 334 (2002) 909–912. [52] P. Anilkumar, On estimating the mean function of a Gaussian process, Statist. Probab. Lett. 19 (1994) 77–84. [53] A. Subramanyam, U.V. Naik-Nimbalkar, Optimal unbiased statistical estimating functions for Hilbert space-valued parameters, J. Statist. Plann. Inf. 24 (1990) 95–105. [54] H. Cardot, F. Ferraty, P. Sarda, Spline estimators for the functional linear model, Statist. Sinica 13 (1999) 571–591. [55] J. Damon, S. Guillas, The inclusion of exogenous variables in functional auto-regressive ozone forecasting, Environmetrics 13 (2002) 759–774. [56] S. Guillas, Non-causalite´ et discre´tisation fonctionnell the´ore´mes limites pour un processus ARHX(1), C. R. Acad. Sci. Paris, Ser. I 331 (2000) 91–94. [57] S. Chen, Multi-output regression using a locally regularised orthogonal least-squares algorithm, IEE. Proc. Visual Image Signal Process. 149 (2002) 185–195. [58] S. Chen, X. Hong, C.J. Harris, Sparse kernel regression modelling using combined locally regularised orthogonal least squares and D-optimality experimental design, IEEE. Trans. Automat. Control 48 (2003) 1029–1036. [59] S.G. Koreisha, T. Pukkila, The selection of the order and identification of non zero elements in the polynomial matrices of vector auto-regressive processes, J. Statist. Comput. Sim. 62 (1999) 207–235. [60] R. Palaniappan, Method of identifying individuals using VEP signals and neural network, IEE Proc. Sci. Meas. Technol. 151 (2004) 16–20. [61] P. Sykacek, S.J. Roberts, Bayesian time series classification, in: T.G. Dietterich, S. Becker, Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems, vol. 14, MIT Press, Cambridge, MA, 2002, pp. 937–944. [62] J.R. Wolpaw, N. Birbaumer, D.J. McFarland, G. Pfurtscheller, T.M. Vaughan, Brain–computer interfaces for communications and control, Clinical Neurophysiol. 11 (2002) 767–791. [63] B. Obermaier, C. Guger, C. Neuper, G. Pfurtscheller, Hidden Markov models for online classification of single trial EEG data, Pattern Recognition Lett. 22 (2001) 1299–1309. [64] E. Mo¨ller, B. Schack, M. Arnold, H. Witte, Instantaneous multivariate EEG coherence analysis by means of adaptive high-dimensional auto-regressive models, J. Neurosci. Meth. 105 (2001) 143–158. [65] F. Perrin, J. Pernier, O. Bertrand, J.F. Echallier, Spherical splines for scalp potential and current density mapping, EEG Clinical Neurophysiol. 72 (1989) 184–187. [66] R. Prado, M. West, A.D. Krystal, Multichannel electroencephalographic analyses via dynamic regression models with time-varying lag-lead structure, J. Roy. Statist. Soc. C50 (2001) 95–109. [67] P.L. Purdon, V. Solo, R.M. Weisskoff, E.N. Brown, Locally regularized spatiotemporal modeling and model comparison for functional MRI, NeuroImage 14 (2001) 912–923.

ARTICLE IN PRESS 100

R.H. Glendinning, S.L. Fleet / Signal Processing 87 (2007) 79–100

[68] T. Koenig, P. Marti-Lopez, P. Valdes-Sosa, Topographic time-frequency decomposition of the EEG, NeuroImage 14 (2001) 383–390. [69] K.J. Friston, A.P. Holmes, K.J. Worsley, J.P. Poline, C.D. Frith, R.S.J. Frackowiak, Statistical parametric maps in functional imaging: a general linear approach, Human Brain Mapping 2 (1885) 173–181. [70] J.G. Snodgrass, M. Vanderwart, A standardised set of 260 pictures: norms for the naming agreement, familiarity, and visual complexity, J. Exp. Psychol.: Human Learning Memory 6 (1980) 174–215. [71] C. Lamm, C. Windischberger, U. Ledolter, E. Moser, H. Bauer, Co-registration of EEG and MRI data using

matching of spline interpolated and MRI-segmented reconstruction’s of the scalp surface, Brain Topography 14 (2001) 93–100. [72] N. Locantore, J.S. Marron, D.G. Simpson, N. Tripoli, J.T. Zhang, K.L. Cohen, Robust principal component analysis for functional data, Test 8 (1999) 1–28. [73] F.A. Ocana, A.M. Aguilera, M.J. Valderramaa, Functional principal components analysis by choice of norm, J. Mult. Anal. 71 (1999) 262–278. [74] M.A. Cane, A. Kaplan, R.M. Miller, B. Tang, E.C. Hackert, A.J. Busalacchi, Mapping tropical pacific sea level: data assimilation via a reduced state space Kalman filter, J. Geophys. Res. 101,C10 (1996) 22599–22617.