Some comments on a bridge between nonlinear dynamicists and statisticians

Some comments on a bridge between nonlinear dynamicists and statisticians

PHYSICA Physica D 58 (1992) 299-303 North-Holland Some comments on a bridge between nonlinear dynamicists and statisticians Howell Tong University o...

343KB Sizes 0 Downloads 21 Views

PHYSICA

Physica D 58 (1992) 299-303 North-Holland

Some comments on a bridge between nonlinear dynamicists and statisticians Howell Tong University of Kent at Canterbury, Kent, UK Received 24 September 1991 Revised manuscript received 11 December 1991 Accepted 24 December 1991

We address the question: "Why is a bridge between the two groups desirable?"

1. Introduction

It is a well-known fact that deterministic chaos can generate data which are apparently random and it is equally well known that from time to time statisticians have to resort to simulation to solve analytically intractable problems. One would therefore expect that there should be much c o m m o n ground between the nonlinear dynamicists and the statisticians. It is interesting to note that it was only relatively recently that the two groups began to interact with each other in a constructive way. Like other historical developments, there probably does not exist a unique explanation as to why the interaction has been so slow in coming. The basic tenet, written or unwritten, in chaos is that randomness is associated with or even wholly generated by deterministic chaos. On the other hand, statisticians, whether of the frequentist or the Bayesian persuasion, accept randomness as given and try to live with it. This fundamental difference in philosophy may account for much of the differences in methodology of the two groups. Personally, I believe that it is important to probe deeper and analyse the

sources of randomness, whilst accepting that there will always remain a proportion of the randomness which cannot be adequately explained. In this sense, I believe that a nonlinear dynamicist's approach deals with the deeper layer of randomness.

2. W h y is a bridge between the two groups desirable?

I think a bridge is most desirable for both groups for various reasons. First, results in one area might accelerate or clarify development in the other. Second, joint efforts might help solve some hard open problems common to both groups. I shall illustrate these points with some examples, the choice and presentation of which cannot perhaps avoid subjectivity completely. (i) In chaos one typically (though not invariably nowadays) deals with very large data sets; at the recent Workshop at Warwick University, data sets of the size well beyond 106 w e r e frequently mentioned. On the other hand, most statisticians typically deal with a much more

0167-2789/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved

300

H. Tong / Bridging nonliner dynamics and statistics

modest size, say 1 0 2. (This difference in emphasis is not so much a question of equipment because with modern computing equipment available to the statisticians, the storage and manipulation of large sample data is not a serious problem.) T h e r e f o r e , given the different data environment, it is perhaps not surprising that there are different methodological emphases. For example, most statisticians subscribe quite vehemently to the Principle of Parsimony: The complexity of the model must be penalised so as to produce the simplest model that one can get away with. There are, however, sound scientific reasons for always penalising over-fitting, whatever the sample size. (See, e.g., [15].) As a simple example, let a linear autoregressive (AR) model be fitted to the observations (X1, X 2. . . . , XN), assumed Gaussian and stationary, either by the least squares m e t h o d or the maximum likelihood method. It is well known that the sum of squares of the residuals (RSS) tends to decrease with increasing 'complexity', in this case the number of past values upon which each current observation is regressed, i.e. the order of the linear autoregression. A naive suggestion would then be to fit as high an order as possible. What is the price? The price is that such an over-fitted model has very low predictive power. To quantify this statement, let E(XN+ i -- ffN+ l) 2 denote the mean squared error of (one-step-ahead) prediction, where XN+I is the prediction obtained from the model fitted to the data (X~ . . . . . XN). It turns out that under general conditions for a pth order A R model, AR(p), E(XN+

I - X N + I ) 2 = 0rp 1 q-

q- O

,

(1)

where ¢2 is the minimum mean squared error of prediction when the model is known. Thus the penalty is measured by the multiplying factor (1 + p / N ) , which increases with p. In practical terms, (1) suggests that there will come a time when the reduction in RSS (note that RSS/ ( N - p) is an unbiased estimate of ~r2p) will not be sufficient to compensate for the increase in

the penalty due to overfitting, thus leading to the generally poorer performance in prediction with an over-fitted model. The above discussion is quite heuristic and the interested readers are referred to [15] and the references therein for a more detailed discussion. It seems to me that the dynamical systems community could benefit greatly from a systematic adoption of the above principle. As an example, recently [3] has considered the estimation of the embedding dimension for a "noisy" dynamical system:

X , = F(X, 1. . . . .

(2)

X~ d) + e,,

where F is an unknown function and {e,} is a sequence of martingale differences with unknown variance ~r2, and d is an unknown positive integer representing the embedding dimension. The objective is to estimate d from the given observations (X I . . . . . Xp,,) of an assumed stationary time series with finite variance and absolutely continuous distribution. As an estimate of F(z 1. . . . . Zd), we calculate FN,\t(ZI~ Z 2 .....

Zd)

d

Z x, 1-I k((z,- X, ,)/hN)

s~'t

i=1 d

,

(3)

Z [I k((z, - xs ,)/hN) x / t i:1

where, for the present note, k may be taken as any smooth probability density function with finite absolute mean, and h N is called the bandwidth, which controls the amount of smoothing over the neighbourhoods of the zi's. Note that there is some similarity between (3) and the localized receptive fields in neural network. [12]. Let FN denote an analogous estimate but without any deletion. The delete-one procedure is actually a rather subtle way of imposing a penalty on overfitting. Specifically, it may be shown [3] that under quite standard conditions the following scaling law holds:

H. Tong / Bridging nonliner dynamics and statistics

2 {x, - kN,,,(X,_,

. . . .

that (6) implies and is implied by the following

,

generalised spectral representation of the time

i

= E {x,l

x

kN(x,_, . . . . .

[2°

1 + h~,~ + %

series:

x,

(+)1

,

(4)

where a = k(o) and Op denotes the "little O in probability". Note that eq. (4) may be compared with eq. (1). Here the penalty term is principally due to h S (note that h d ~ 1 for sufficiently large N), which increases with d, the embedding dimension. (ii) An important tool in dynamical systems is the singular value decomposition (SVD) introduced to the chaos literature by Broomhead and King [1]. The SVD is founded on the K a r h u n e n Lo6ve (KL) expansion. It is perhaps pertinent to repeat briefly the KL expansion here as I believe that further exploitation in the chaos study may still be possible. (For more detail, see e.g. [16].) Let {X(t):a<-t<-[3}, (a, f l ~ ) , be a mean-square continuous-time time series and Var X(t) <- ~, all t. Let p(s, t) = corr(X(s), X(t)) denote the autocorrelation function. Routine consideration of the linear integral equation /3

f p(s,t) O(s)ds=Ag~(t),

a--
(5)

t~

where p(s, t) acts as the standard Hermitian positive definite kernel, yields upon Mercer's theorem the uniformly convergent series (in both variables)

p(s, t) = ~ AjqJj(s) qJj(t) ,

(6)

j-1

where the A/s are called the eigenvalues (>0) and the functions ~0j(s) are called the eigenfunctions such that /3

f thi(t) thk(t) dt = 8~k.

301

(7)

c~

A more fundamental result in this approach is

X(t) = ~ ~

tp~(t)Zj,

(8)

i~ 1

where /3

Zj= A i ,/2 f X(t) t~i(t) dt

(9)

o~

so that corr(Zp Z , ) = 8~k.

(10)

In statistics (or rather probability), the representation (8) is commonly called the KarhunenLo~ve (KL) expansion after K. Karhunen [9] and M. Lo6ve [10], although the same expansion was introduced independently by many others. (See, e.g., [16]). In particular, when t E {1, 2 . . . . , m}, then we have afinite collection of random variables (X(1) . . . . , X(m)) and the KL expansion reduces to the well-known principal component analysis introduced by H. Hotelling in 1933 in his study of educational psychology [g]. It is relevant to point out two facts, which do not seem sufficiently widely appreciated in the chaos literature. First, stationarity is not necessary in the above discussion; only finite variance is needed. This is significant because almost all the applications of the KL expansion in the chaos literature are (in my view unnecessarily) restricted to stationary time series. ([6] is a notable exception although the authors still refer to the Toeplitz property of the "covariance" matrix.) Second, although it might be quite reasonable to arrange the sample eigenvalues (obtained from data) in descending order '[i -> A2 -->... and use the ratio of the sum of the first few A's to the "trace" as a measure of the amount of information explained by the first few principal components, the statistical sampling properties of this ratio statistic is by no means

302

H. Tong / Bridging nonliner dynamics and statistics

trivial. See, e.g., [14]. It might also be worthwhile exploring the use of the Principle of Parsimony to the determination of the number of principal components in the context of chaos. (iii) Modelling based on local function approximation is witnessing rapid development in the chaos literature. As recognised by Farmer and Sidorowich [4], Casdagli [2], Grassberger et al. [7] and Sugihara and May [13], the threshold models introduced by me in the late 1970's and early 1980's were a precursor to this development. Of course, much has developed since, as demonstrated in the above references. It is perhaps worth remembering that the basic point of the threshold models may be summarised in the form of the threshold principle, which advocates the local approximation over states. I have elaborated this principle elsewhere and most recently in [15]. The fact that only one, two or three thresholds were used in the early development should be seen in perspective because (a) the choice was made merely for computational convenience (SUN workstations were not available in 1980); (b) the principle of parismony dictated that the data sets analysed by me and my associates did not warrant too many thresholds. Of course, once the threshold principle of "divide and rule" is accepted as a useful concept there is then no limit to the computational varieties. Indeed, the statisticians Lewis and Stevens have recently adapted the powerful numerical algorithm of multivariate adaptive regression splines ( M A R S ) due to Friedman [5] to provide a versatile and efficient implementation of the threshold principle. [Further details may be found in ref. [11]]. In fact, the area of nonparametric time series modelling provides an excellent c o m m o n ground for joint exploration. (iv) To fully comprehend the intimate relationship between order and disorder, low dimensional attractors and high dimensional attractors and so on is a challenge facing both the nonlinear dynamicists and the statisticians. There is

much to be gained if there is closer collaboration between the two groups. Quite often knowledge from both areas is essential in order to attack a problem of common concern. For example consider a deterministic model

X(t)=F(X(t-1)),

t=1,2,3 .....

(11)

where X ( t ) E ~. Suppose X(t) is not observable and instead we observe

Y(t) = X(t) + e(t) ,

(12)

where e(t) is the measurement/observation noise. The above set-up is very common in the chaos literature. Now, from (11) and (12) we may deduce that approximately Y(t) = F ( Y ( t -

1)) + e(t) - F ' ( y ( t -

1)) e ( t - 1), (13)

where F' denotes the derivative. If the Lyapunov exponent of (11) is positive so that the deterministic model is sensitive to initial conditions, then the stochastic model (13) will in general tend not to be invertible in the sense that the noise term e(t) will not be measurable with respect to the sigma algebra generated by Y(t), Y ( t - 1 ) , . . . . Without invertibility, statistical inference/estimation based on maximum likelihood of any parametrised form of F would be extremely difficult. The latter problem is well known in statistical nonlinear time series analysis. (See, e.g., [15], p. 309.) A n o t h e r example of particular current concern, some subtleties of which might have been overlooked in the literature, has to do with a model with dynamic (i.e. system) noise: Z(~) = -P- ( ~ )\(~7t (') 1 ) t

-(') ' -~- (~t

t = 1, 2 , . . .

(14)

where

zl°)=

(15)

H. Tong / Bridging nonliner dynamics and statistics

corresponds to the underlying deterministic model of interest. For simplicity we assume that both Z and e are real scalars. Let A~°) denote the Lyapunov exponent of the system (15), which is assumed to be ergodic with invariant measure /z (°) induced by F (°). Similarity, let h (~) denote the Lyapunov exponent of the stochastic system (14), again assumed to be ergodic with invariant measure /z (~) induced by F ('~. Specifically, h (~) = f i n [dF(~l(x)/dx[tx(~)(dx), e>-O. Now, given observations from (14), the obvious sample version A(') say of h (~) is a natural estimate of h (~), but not h (°). Therefore, we plainly need to correct A(~) for bias if it is used to estimate h (°l. (A similar remark applies to the Grassberger correlation dimension). A more fundamental question which does not seem to have been addressed is this: whilst it is well known that h (°) measured the exponential divergence of two initial points in state space upon iteration under F (°), it is not clear to me that h (') measures the "exponential divergence" of two initial distributions, which seems to me the more relevant concept to develop.

Acknowledgments I would like to thank Sir David Cox and the referee for constructive comments, Professor P.G. Drazin and Dr. G.P. King for inviting me to participate in the I U T A M Symposium and N A T O Advanced Research Workshop on Interpretation of Time Series from Nonlinear Mechanical Systems, and the participants for being so

303

tolerant towards the odd statistician in their company. The fundamental difficulties described in section 2 (iv) are the results of on-going joint research with K.S. Chan, and I thank him for his comments. Partial support under the CSS initiative of the SERC is gratefully acknowledged.

References [1] D.S. Broomhead and G.P. King, Physica D 20 (1986) 217. [2] M. Casdagli, Physica D 35 (1989) 335. [3] B. Cheng and H. Tong, J. R. Stat. Soc. (B) 54 No. 2 (1992) 427. [4] J.D. Farmer and J.J. Sidorowich, in: Evolution, Learning and Cognition, ed. Y.C. Lee (World Scientific, Singapore, 1988). [5] J.H. Friedman, Ann. Stat. 19 (1991) 1. [6] M. Ghil and R. Vautarad, Nature 350 (1991) 324. [7] P. Grassberger, T. Schreiber and C. Schaffrath, Nonlinear time sequence analysis, Tech. Rep., Dept. of Physics, Univ. of Wuppertal, Germany (1991). [8] H. Hotelling, J. Educ. Psy. 24 (1933) 417; 498. [9] K. Karhunen, Ann. Acad. Sci. Fennicae, Ser. A., I 34 (1946) 3. [10] M. Lo6ve, Rev. Sci. 84 (1945) 297; 84 (1946) 195. [11] J.G. Stevens, An investigation of multivariate adaptive regression splines for modeling and analysis of univariate and semi-multivariate time series systems, Naval Postgraduate School, Monterey, California, U.S.A., (1991), unpublished. [12] K. Stokbro, D.K. Umberger and J.A. Hertz, Complex Systems 4 (1990) 603. [13] G. Sugihara and R.M. May, Nature 344 (1990) 734. [14] T. Sugiyama and H. Tong, Commun. Statist.-Theor. Meth. A 5(8) (1976) 711. [15] H. Tong, Nonlinear Time Series: A Dynamical System Approach (Oxford Univ. Press, Oxford, 1990). [16] A.M. Yaglom, Correlation Theory of Stationary and Related Random Functions I & II (Springer, Heidelberg, 1986). [17] G.U. Yule, Philos. Trans. R. Soc. A 226 (1927) 267.