PHYSICA ELSEVIER
Physica D 79 (1994) 174-192
Noise reduction schemes for chaotic time series Mike Davies Centre for Nonlinear Dynamics, University College London, Gower Street, London WCIE 6BT, UK
Received 29 August 1992; revised 29 November 1993; accepted 6 May 1994 Communicated by A.V. Holden
Abstract
There has recently been a great deal of interest in the development of noise reduction techniques for noisy trajectories from nonlinear dynamical systems. Particular attention has been shown to their development for use in the analysis of chaotic time series. In this paper we consider the problem in an optimisation context. This allows us to categorise some of the commonly used techniques and to compare the performance of different methods from a theoretical standpoint in order to identify some of the pitfalls that can occur with these methods. Finally this formulation enables us to propose a natural composite method that provides flexibility between speed and stability.
1. Introduction
I m p r o v e m e n t s in time series analysis, gained f r o m dynamical systems theory, have led to an increased interest in data from nonlinear systems with complex dynamics. With the aid of mapping approximations and an understanding of the p r o b l e m s of embedding m a n y of the analysis techniques that so far have been restricted to numerical models are now accessible to the experimentalist. H o w e v e r not all experiments can be controlled enough to stop the noise in the system f r o m becoming a problem. This has led to a great deal of research being focused upon the p r o b l e m of reducing noise from such data. As in m o s t of the previous work on noise reduction, the noise is assumed to be I.I.D. (Identical I n d e p e n d e n t Distribution) and the noise level is assumed to be small with respect to the level of the deterministic signal. F u r t h e r m o r e the noise
reduction procedure is considered separately from the approximation of the dynamics. Arguments for and against this are considered in Section 2. In this p a p e r we treat this p r o b l e m as a simple minimisation task, adjusting the trajectory to minimise the dynamic error. By definition of the t e r m noise reduction, this approach seems quite natural. I n d e e d one of the original noise reduction schemes p r o p o s e d by Kostelich and Y o r k e [1] used the idea of minimisation. H o w e v e r the aim there was to adjust the orbit to m a k e it m o r e deterministic while remaining close to the original data. Unfortunately this does not provide a unique minimisation process since it involves simultaneously minimising the distance of the new trajectory from the old one and the size of the dynamic error. Thus an arbitrary weight has to be incorporated to balance the two tasks. This is in essence an a priori estimate of the relative
0167-2789/94/$07.00 9 1994 Elsevier Science B.V. All rights reserved S S D I 0167-2789(94)00136-E
M. Davies / Physica D 79 (1994) 174-192
amounts of dynamic and measurement noise present in the data. H e r e we will consider the simpler problem of merely minimising the dynamic noise as much as possible. This is not a serious simplification since, if the data comes from a hyperbolic attractor and the noise level is sufficiently small, then we know from Bowen's shadowing L e m m a [2] that we can reduce the dynamic error to zero. F u r t h e r m o r e if the data set is infinitely long then the deterministic orbit is unique. Although we will never encounter an infinitely long trajectory, we will argue that the deterministic orbit for a finite length noisy trajectory is almost unique and is only badly defined at the ends (i.e. little is gained by explicitly minimising the shadowing distance as was proposed in Farmer Sidorowich [31). If the data did not come from a hyperbolic attractor, or the noise level is too high to enable shadowing then, as long as the algorithm is forced to be stable it will cease to adjust the data when no further noise reduction is possible (i.e. the cost function will become flat). In the absence of any prior knowledge of the noise this provides a natural replacement for the arbitrary weight used by Kostelich and Yorke. A further advantage in treating the problem in this way is that many of the previously proposed schemes for noise reduction can be viewed as simple minimisation or root-finding procedures. In general these can be broadly classified as either a Newton based [3,4] or a gradient descent based [5-8] method for minimising, or zeroing the dynamic noise. H a m m e l [4] and Farmer and Sidorowich [3] explained some of the problems that can occur in using the Newton based approach, particularly those problems associated with homoclinic tangencies and, in Section 3.2, we provide an estimate for the ill-conditioning that these produce. Farmer and Sidorowich demonstrated that one solution is to use singular value decomposition to invert the badly conditioned matrix. However this matrix is banded and will generally have a very low bandwidth, so
175
it would be preferable to be able to exploit this structure and find a stable order N inversion process as opposed to using singular value decomposition which is order N3! It would also be preferable if manifold decomposition [4] (which can be used in the absence of homoclinic tangencies) was not needed. Manifold decomposition is not a problem in terms of speed since this is an order N process, however it requires the tangent spaces to be split into stable and unstable directions which is difficult in the presence of tangencies. Furthermore, if a time series has been samples from a flow, unless it can be reduced to a Poincar6 map, the sampled time series will contain a zero Lyapunov exponent and manifold decomposition will not be applicable. One possible solution to this problem would ~,c to use a gradient descent method. These are more stable than the Newton methods and work at noise levels that are significantly higher. This is because they do not assume so much about the closeness of the pseudo-orbit and the deterministic shadowing orbit. H o w e v e r once the gradient descent method has been chosen the experimentalist is opting for an algorithm that only converges linearly onto a deterministic orbit. Also some of these algorithms will, in general, not even converge linearly in regions close to homoclinic tangencies [8,9]. Ideally we would like an algorithm that worked for large amounts of noise, whose process is of order N and whose asymptotic convergence is quadratic rather than linear. The composite m e t h o d presented below appears to achieve all these requirements. The approach used here is a quasi-Newton m e t h o d (Levenberg - M a r q u a r d t m e t h o d [10]). This is shown to provide a simple and effective solution to the instabilities and convergence problems that are associated with the Newton m e t h o d and the gradient descent m e t h o d when they are applied individually. We conclude the paper with two examples of noise reduction. In the first example we compare the gradient descent, manifold decomposition
176
M. Davies / Physica D 79 (1994) 174-192
and L e v e n b e r g - M a r q u a r d t algorithms when applied to the Hdnon map with knowledge of the dynamics. Finally we apply gradient descent and L e v e n b e r g - M a r q u a r d t algorithms to data taken from the Lorenz equation. H e r e the dynamics has to be approximated and it is clear that this provides the main limitation to the effectiveness of the noise reduction algorithm.
2. Function approximation and noise reduction When the dynamics are not known a priori and we are merely presented with a scalar time series it is necessary to initially embed the data in a reconstructed state space. From Takens' t h e o r e m , [11], we know that we can achieve this using delay vectors. Furthermore these can be generalised to combinations of linearly filtered delay coordinates (see [12-14]). For a detailed discussion of optimising the embedding we refer the reader to Casdagli et al. [15]. Once the data has been embedded we can model the dynamics by fitting a function to map points forward in time (although this is the general approach, Schrieber and Grassberger use a non-predictive function). There are a variety of methods for approximating the mapping function both locally, [16], or globally, [17], however the basic idea is the same in each case. Having defined a space of functions, we wish to choose the function that best approximates the dynamics. Generally this means minimising the errors, in a least squared sense, where the error terms are defined as:
=x,-
(1)
where ei is the approximation error associated with xg the ith point in the series and f is the approximating function. Clearly the number of parameters classifying the space of functions considered must be a balance between not being too restrictive (i.e. not modelling the dynamics well enough) and not overfitting the dynamics
and modelling the noise (if the number of parameters is equal to the number of data points then the function will merely interpolate the data). The aim then is to remove the approximation errors from the orbit. Unfortunately it is clear from Eq. (1) that the approximation errors are not directly decomposable from the trajectory, however, if the mapping function in the linear neighbourhood of the trajectory is considered it is possible to estimate the position of a deterministic orbit that 'shadows' the noisy one [4]. This approach to noise reduction is equivalent to solving an n-dimensional root finding problem using a N e w t o n - R a p h s o n algorithm. A n o t h e r method is to estimate the position of an orbit within this linear neighbourhood that merely has less noise than the original orbit. This m e t h o d can be shown to be equivalent to an n-dimensional minimisation problem using a gradient descent method. Although we are considering the noise reduction procedure separately from the function approximation it is wise to alternate between the two processes so that after each noise reduction iterate the dynamics should be refitted. This is because both procedures aim to minimise the approximation errors, only with respect to different parameters. Indeed, if the function approximation has a parametric formulation, it is possible to reduce the problem to a single extended minimisation procedure that adjusts the function parameters and the trajectory simultaneously. However this will not be dealt with here and we refer the reader to Davies and Stark [18]. Thus noise reduction can either be viewed as a minimisation problem (one of minimising the noise in the time series), or a root finding problem (finding the trajectory with zero noise). The advantage of using a Newton based m e t h o d is obviously speed since such algorithms converge at a quadratic rate onto the solution. However the cost of such a method is stability. A Newton m e t h o d is not very stable once the assumption of linearisation becomes questionable. Below, in the next two sections we briefly
M. Davies / Physica D 79 (1994) 174-192
review these two approaches to noise reduction, before proposing a composite method at the end of Section 4.
3. The root-finding problem Before considering the noise reduction problem in terms of minimisation we consider some of the methods based on zeroing the dynamic error. Whereas there are several simple algorithms to zero find in functions mapping N---~ [19], the problem is not as easy with higher dimensional functions. Essentially the only practical way of root finding in high dimensions is to use the N e w t o n - R a p h s o n approach. This m e t h o d is well known and commonly found in numerical procedures. It is based on linearizing the function as an estimate for the position of the root. If the root is assumed to lie within the local linear neighbourhood of the initial guess the problem of the root can be determined directly: (2)
Xne w = Xcu r -- g ' ( X c u r ) - l g ( X c u r ) ,
where g(x) is the error function in this case and g'(x) is the derivative with respect to x. As long as the initial guess is close enough to the zero the algorithm will converge onto it at a quadratic rate, see for example [19]. To apply it to the noise reduction problem the error function, given in Eq. (1), must first be linearised. Then Eq. (2) can be written in matrix form as: D ( x _ . e w - - X c u r ) ~--~_ff ,
-J2
D =
1
-J3
1
(3) -Jn
177
possible since D is not a square matrix but instead is ( n - 1 ) x n, with d-dimensional elements, where d is the dimension of the reconstructed state space. This is because the number of data points is greater than the number of error terms. It can only be solved by adding extra constraints. One method that applies a reasonable set of additional constraints was proposed by H a m m e l [4] and is briefly reviewed below.
3.1. Manifold decomposition Hammel's method for noise reduction originated from earlier work on the shadowing problem, H a m m e l et al. [20], in which they proved that finite lengths of numerical approximations to the H6non map actually shadowed real trajectories to within a given distance. The problem of solving Eq. (3) is to choose an extra d-dimensional constraint that produces a solution close to the original data. If, for example, e 1 = 0 is used as the additional constraint then Eq. (3) can be solved. However if the time series is chaotic this constraint forces the deterministic orbit to start at xl, thus the deterministic orbit will generically diverge exponentially from the original and not produce a close orbit. The method proposed by H a m m e l was to identify the stable and unstable directions in the local neighbourhood of the noisy orbit. T h e n if the errors in the stable subspace at the beginning of the trajectory are set to zero, the effect will reduce as it is propagated along the data as i--~n. Similarly if the errors tangential to the unstable subspace are set to zero, at the end of the trajectory these constraints will reduce when propagated towards the beginning of the time series. This allows Eq. (3) to be solved in two parts:
1
where Ji is the Jacobian of the mapping function f(xi) a t the ith data point. It is immediately apparent from Eq. (3) that to solve this for x .... requires inverting the matrix D which is not
(Xne w -- X c u r ) ~ + 1 = J ~ ( x n e w - X c u r ) ; - - E ~ + I , u u u (Xne w - - X c u r ) i : ( J U ) - l ( ( x n e w -Xcur)i+l 4- • i + 1 )
,
(4) where the superscripts s and u refer to the stable
178
M. Davies / Physica D 79 (1994) 174-192
and unstable decompositions respectively. The stable decomposition can then stably be iterated forwards and the unstable decomposition can be stably iterated backwards. For m o r e details see H a m m e l [4]. 3.2. Tangencies imply ill-conditioning o f D
Although manifold decomposition provides a simple m e t h o d for solving the problem of constraints it relies on the ability to decompose the tangent spaces into stable and unstable subspaces. Unfortunately most chaotic attractors do not possess the necessary hyperbolic structure everywhere since they contain homoclinic tangencies. In practice this means there will be points in the time series where the stable and unstable subspaces b e c o m e almost tangential and therefore do not span the tangent space accurately. This manifests itself in the form of the matrix D, with the additional constraints imp o s e d by H a m m e l ' s m e t h o d , becoming singular, as we will show below. In this context our definition of a neartangency in the linearised p r o b l e m is as follows:
e i = 0, if i ~ ( m , m + 1}, e m = --z~k)sm and era+1 = . Since [1%11--Ila~mll we know that I1~11 <-Ilell and thus D has a singular value greater than or equal to one. Now, given an e-near-tangency at the m t h point we can construct an u p p e r b o u n d for the smallest singular value. To do this it is necessary to find an e such that there is a constant K - > II~xll/llell. Let us choose e such that e l = O , if i ~ m and e m = o t u m + / 3 S m . Then we can solve for &x using the m e t h o d outlined in the last section. Thus we know that:
Jmem
&x~ = 0 ,
i
&,cU = 0 ,
i>_m,
ks m =
-e~ ,
(5)
and Ilaxll >--[IGII = 11/311. If we now choose em" s = 0 we can write: a=em'U--/3cosO
,
/3 = a cos 0 ,
(6)
and since e m 9 S =- 0 , and u 9s = cos 0 then e m " U = ][emllsinO and we can write/3 in terms of Ilem[l:
( II%[Isin20'~ /3 = \ 2 ( 1 + c o s 2 0 ) / '
(7)
Definition. e-near tangency. T h e r e exists an e-
near-tangency at the m t h point in pseudo-orbit if we can find unit vectors u,~, in the unstable subspace, and Sm, in the stable subspace such thats m.u m=cosO and0
and by construction Ile[I--Ile~ll. T h e r e f o r e we have:
T h e unstable and stable subspaces here are defined by the directions associated with the singular values of jn that are greater than or less than one respectively. We can now find a lower b o u n d for the condition n u m b e r of D, with the added constraints, 2tx] = &,c~ = 0, as a function of the angle between the directions s m and u m at s o m e position m in the trajectory. First we can show that D has at least one singular value greater than or equal to one: i.e. there exists an &x such that Ilaxll-< Ilell, where e = D &x. Choose &x such that &x i = 0, if i ~ m and &x m r 0. T h e n we calculate e from Eq. (4):
Thus, under the assumptions set out above, we have shown that the condition n u m b e r of D with the additional constraints is greater than 1/E. F a r m e r and Sidorowich [3] suggested overcoming this p r o b l e m by using Singular Value D e c o m p o s i t i o n to invert the matrix stably. H o w ever this decomposition could not be used for the whole data set since it is an order N 3 process, therefore for large time series this m e t h o d would be impractical. A composite solution was proposed by F a r m e r and Sidorowich in which the stable and unstable directions are identified and then singular value decomposition is applied in
Ilaxll
licit--<
2(lsin 20 + cos20 )
-<~.
(8)
M. Davies / Physica D 79 (1994) 174-192 the regions where the subspaces are almost tangential. Elsewhere manifold decomposition is used. Apart from being a little messy there is a further problem of ill-conditioning that cannot be solved in this way. If the time series is a sampled data set taken from a flow then generally the time series will have a zero Lyapunov exponent. This means there will be another zero singular value associated with the matrix D. H o w e v e r this singular value does not manifest itself locally in the time series, Thus the above approach is only really applicable to maps or flows sampled in such a way that a Poincar6 map can be obtained. To solve the problem in the case of a flow a different method is necessary. 3.3. Implicit and explicit shadowing In Section 3 we noted that there is no unique solution to the zero finding problem. Instead there is a d-dimensional family of solutions, where d is the dimension of the state space. In terms of Hammel's method this set can be realised by perturbing the end constraints that were set to zero in the stable and unstable directions. Thus, if we chose non-zero end constraints, we will obtain a different solution. This poses the question: can we do any better? T h e answer is, theoretically, yes we can. For example we could explicitly search for the closest solution to the original pseudo:orbit. This was proposed by F a r m e r and Sidorowich [3] and was thus called 'optimal shadowing' since in finding the deterministic orbit that is closest to the noisy orbit it can be viewed as a maximum likelihood estimate. T h e y applied this by formulating the following Lagrangian function: N
s
=
~ k-1
N-1
Ily~ - x~ll + 2 ~ k
[f(xk)-
xk+l]
Ta ~ ,
1
(9) where f(x) is the mapping function as before, x k is the adjusted trajectory, Yk is the noisy trajec-
179
tory and Ak are Lagrange multipliers. F a r m e r and Sidorowich linearised this function and solved the approximation iteratively using the N e w t o n - R a p h s o n method. H e r e the additional objective of optimising the shadowing removes the rank deficiency in the problem. Unfortunately the problem can still be ill-conditioned. It should be noted that this is not due to the dynamics being chaotic, as was suggested by Farmer and Sidorowich [3], but is due to the presence of homoclinic tangencies. At first it might appear that the minimisation process could remove these singularities in the same way that it does for the rank deficiency described above but there is a fundamental difference. The rank deficiency in the matrix D were due to the fact that it was a non-square matrix and therefore had singular values that were zero, whereas in general a finite length orbit from a dynamical system will not have any exact tangencies. Instead near-tangencies will result in some singular values becoming very small but while the singular values remain strictly positive the linearised equations still have an exact solution (however absurd this is). This is a common predicament in least squared estimation since a problem can be ill-conditioned even when a set of equations may be over-determined. Since formulating the problem in this form has not made the algorithm any more stable in the presence of tangencies it is necessary to consider what benefits will be gained by this additional work. Obviously it is likely to find an orbit that is closer to the pseudo-orbit than the methods of manifold decomposition, but how much closer is "closer"? This can be determined by decomposing the problem into the method of H a m m e l and the optimisation of the initial constraints of gx] and gx;1, (this is equivalent the solution by Newton in Farmer and Sidorowich). If the system has well-defined Lyapunov exponents then perturbations applied to the initial conditions will introduce changes at the ends of the solution and their effect will decay exponentially away from the end points. The only other points that are
180
M. Davies / Physica D 79 (1994) 174-192
badly defined are the tangencies due to the problem being badly conditioned. Also since the orbit can be considered to be "almost" unique (due to the shadowing lemma) away from the ends and the tangencies the maximum likelihood estimate is only using information local to the end points and is therefore not going to produce a statistically significant improvement. This emphasises the importance of the uniqueness of the shadowing orbit (and the convergence of finite length orbits onto it as the length tends to infinity). It is this fact and not a statistical argument that makes chaotic noise reduction so powerful.
The simplest and most robust m e t h o d for minimising a function is to take small steps 'down hill'. This is best achieved using a steepest descent method. One major advantage of this over the Newton m e t h o d is that it makes no assumption of the form of the cost function and merely requires a smoothness criterion. The basic algorithm is: Xnew = Xeur -- const. • VH(x) ,
In this section the problem is reformulated as a minimisation problem. This is a less ambitious aim and is not confined to the assumption that a deterministic orbit and the original noisy orbit are approximately within the same linear neighb o u r h o o d because a minimisation strategy can search for a less noisy orbit close to the original noisy data. One consequence of this is that a minimisation algorithm will not necessarily be forced to be unstable in situations where there are no close true orbits as is the case when homoclinic tangencies occur in the dynamics. In the minimisation problem the aim becomes one of minimising the following cost function: (10)
i~2
where the error term is the same as that defined in Eq. (1). Two methods are discussed below to solve the n-dimensional minimisation problem. T h e first m e t h o d is a solution by gradient descent and the second approach is a compromise between the gradient descent method and the minimisation analogue of the N e w t o n - R a p h s o n algorithm.
(11)
where the constant defines the size of the step. VH can be determined by differentiating Eq.
(lO): OH ~ Oe i 0x~ - 2 T-;-, Ei 9 i:1 OXk
4. Minimising the dynamic noise and weak shadowing
H=~{~,
4.1. Solution by gradient descent
(12)
This evaluates the partial gradient for the kth point in the state space. Although Eq. (12) has to be evaluated for each point in the state space this is not difficult, since the definition of the error function in Eq. (1) means that virtually all the terms in the summation are zero: aEi =0, 3x k
i~k,k+l.
(13)
Therefore Eq. (12) can be rewritten as: OH Oxk - 2(ek - JkG+,),
(14)
where J~ is the Jacobian of the approximating function f(x 0 at the kth point. VH can therefore be calculated in order n operations and the routine can be iterated until the level of noise is low enough. It is interesting that the gradient descent approach does not suffer from the fact that the problem is ill-posed or the ill-conditioning that will be introduced by homoclinic tangencies, thus no extra constraints need to be added. This is essentially because these singularities in D are equivalent to directions on the cost function that are flat, therefore the gradient descent m e t h o d of Eq. (11) will always go perpendicular to the singular directions (it is no coincidence that
181
M. Davies / Physica D 79 (1994) 174-192
VH = 2DTe and hence the gradient descent direction projects out the singular directions in D). H o w e v e r homoclinic tangencies do still cause some problems when using a gradient descent algorithm since when the gradient of the cost function becomes almost flat repeated iterations of the algorithm cease to reduce the error at a linear rate. This was also observed numerically by Grassberger et al. [9]. However it is infinitely more preferable for a noise reduction algorithm to grind to a halt at tangencies than for it to b e c o m e unstable. 4.2. A comparison with other m e t h o d s
It is convenient, at this point, to compare the ideas presented in the last section to some other schemes that have been proposed for noise reduction algorithms. The methods of Hammel [4] and Farmer and Sidorowich [3] have already been reviewed in some detail above, but recently some new ideas for stable noise reduction algorithms in the presence of large noise levels have been proposed by Sauer [7], Cawley and Hsu [6] and Schreiber and Grassberger [5]. The aim of these new methods is to be able to filter data when the noise levels are too high to be able to successfully apply the methods of Hammel or F a r m e r and Sidorowich. We have already shown that in the context of the minimisation problem presented above stable noise reduction for high noise levels can be achieved by taking small enough steps down the cost function. In this section we will demonstrate that the above methods can be considered in terms of minimisation and they can be classed as generalised gradient descent algorithms. The algorithms presented by Cawley and Hsu and Sauer are very similar although the specific embedding methods differ slightly. These involve advanced embedding techniques such as global and local singular systems analysis. However, such methods take no account of what adjustments are going to be made and hence the noise reduction procedure can be treated separately.
Initially Cawley and Hsu use local singular systems analysis to approximate some neighbourhood of the manifold, on which the dynamics acts, as a local linear submanifold. This over-embeds the local neighbourhood and can be thought of as a local linear predictor (singular systems analysis gives the total least squares estimate for this problem: see Golub and Van Loan [21]). We can therefore formulate their method in terms of the prediction errors that result. Here we consider the mapping f : N d - ~ d over-embedded in a 2d-dimensional space, such that dynamics acts on a co-dimension 1 submanifold that is merely the graph from Nd to ~d: see Fig. 1. If we now assume that the local singular systems analysis accurately determines this submanifold and that the noise level is low, then we can linearise the problem as before. The next step in Cawley and Hsn's algorithm is to project the data orthogonally onto (or at least towards) the submanifold. In Fig. 1 this gives us a new point for each pair {xi, xi+l} which in turn gives us two new values for each x i. In terms of the prediction error and the jacobian, ji, of the mapping we can write these two values as follows: Xi . . . .
= Xi,old
- - ( ~ " ffi
1
(15)
\
x,.,
;,' . .
\\ ....f ' \
.' , !\ 9
\e_
i
X~ Fig. 1. A graphical representation of the projection noise reduction method of Cawley and Hsu [6] and Sauer [7]. This method can be shown to be equivalent to noise reduction by gradient descent.
M. Davies / Physica D 79 (1994) 1 7 4 - 1 9 2
182
and xi
....
=
(16)
Xi.old + 6 " JiE~ ,
where 6 is a small constant defining the extent to which the points are adjusted towards the corrected values. To achieve consistency a compromise then has to be sought. This is arrived at by averaging the two corrections (in general, over-embedding the data further will result in multiple corrections but the concept is the same): (17)
x i . . . . = x~.o, d - 6 . (ei_ , - J~E~) .
A comparison between this equation and Eq. ( t l ) makes it immediately obvious that the method of Cawley and Hsu (and similarly Sauer's method) is a reformulation of the steepest descent method for the cost function given in Eq. (10). Finally the steepest descent approach can be compared to an algorithm proposed by Schreiber and Grassberger [5]. This method makes similar small adjustments to the trajectory to reduce the overall dynamic error. However the adjustment is just delta times the approximation error:
ment for just using the error term is that since the function approximation is non-predictive the additional terms will on average be small and can therefore be ignored. This was proved for a simple case using the Baker map, Schreiber and Grassberger [5]. Thus the method is similar although not identical to the steepest descent algorithm, although it is not at all obvious that it will always reduce noise in the general case. 4.3.
Solution
by Levenberg-Marquardt
The minimisation analogue to solving the rootfinding problem by N e w t o n - R a p h s o n is trivial and provides no real advantages over the method described in Section 3.2. However it is presented briefly since it will be used below. The Newton method for solving Eq. (10) is to assume that the trajectory for the minimum error lies within the linear neighbourhood of the noisy trajectory. The cost function can then be modelled with a quadratic form and a closed form solution can be written for the minimum: (02H~
(18)
x i ..... = X,.,,Id + 6 " E~ .
If this error was the same as that defined in Eq. ( l ) then the resulting noise reduction would only take place in the stable direction, therefore they used a function approximation that contained information from both the past and the future, for example: Xi = F ( X i - d l ,
Xi-kl+l,
" " 9 ,
Xi-I, Xi+l, " " " , Xi+d2) ,
(19) where d l and d2 are the dimensions of the past and future parts of the delay vector. In such a case the steepest descent adjustment for the error function would be: OH
OFi d I
OX i
OX i
OFi- 1 tEi dl . . . .
OX i
Ei-I
oFi+d2
+e i-'.
ax i
ei+d2,
(20)
where everything is defined as above. The argu-
Xmi n = x c u , - \ O x
-1
2/
VH.
(21)
Using the linearisation of the dynamical system given in Eq. (3) this can be written as: Xm~,=Xcur-(2DTO)
~VH,
(22)
where D is the matrix that was defined as before. In fact further inspection will show that this equation can be arrived at by merely pre-multiplying equation 3 by D v and rearranging (remember VH = 2 D i e ) . However Eq. (22) cannot easily be solved since the problem is still rank deficient. What we would like would be to achieve the stability of the gradient descent method described in the last section without sacrificing the speed of the Newton method (Newton methods converge quadratically when stable, whereas gradient descent methods only converge at a linear rate). This is in fact possible by constraining the solution to have a restricted step. Such an approach falls
M. Davies / Physica D 79 (1994) 174-192
under the category of L e v e n b e r g - M a r q u a r d t methods (see for example Fletcher [10]). This m e t h o d provides a simple way of stabilising the Newton m e t h o d while retaining the order N speed and quadratic convergence. Heuristically the idea is to produce an algorithm that can smoothly be changed from the gradient descent algorithm of Eq. (11) to the Newton m e t h o d of Eq. (22). A closer look at these two equations shows that they are identical apart from the term in front of VH. Thus one obvious compromise is to use: Xne w ~ Xcu r
_ 12 (DTD + 8 I ) - I V H ( x )
(23)
where I is the identity matrix and 6 is an arbitrary weight that defines whether the algorithm behaves more as a gradient descent or more as a Newton approach. If 6 is zero then Eq. (23) is identical to Eq. (22). However when 8 becomes large then the inverse becomes approximately (1/8)1, which is equivalent to Eq. (11). Fig. 2 demonstrates the effect of this algorithm for a simple two dimensional cost function. The curve shows the set of solutions for different values of 8 that map the point A
Fig. 2. L e v e n b e r g - M a r q u a r d t applied to a simple two dimensional quadratic porential well. Starting at A the estimate for the minimum lies on the dotted line between A and B. If 6 is small the solution is near B and is equivalent to the Newton method. If 6 is large the solution is near A and is equivalent to the steepest descent solution.
183
towards the minimum B. For large values of 8 the curve descends down the line of steepest gradient, however as 8---~0 then the solution moves more towards the minimum. In fact the theoretical arguments for the L e v e n b e r g - M a r quardt procedure are more rigorous than this and it can be shown to find the minimum value for H under the constraint that the step size is restricted. Also if an appropriate m e t h o d for selecting 6 is used the resulting algorithm can be proved to be globally convergent and it can be shown that the convergence rate is quadratic (see Fletcher [10] for a detailed exposition of the subject). The most important benefit of this m e t h o d is its effect on the singularities of D. Consider the singular value decomposition of D = U S V T. Then the matrix M = ( D T D + 61) can be written as~
M = V(S 2 + 8 I ) V v .
(24)
Hence even though D may be singular the singular values of the M are bounded from below by 8. Indeed a further impressive feature of this algorithm is that if the non-zero singular values are bounded away from zero then, as 8 ~ 0, the solution for x from Eq. (21) will tend to the solution of the rank deficient minimisation problem by the M o o r e - P e n r o s e pseudo-inverse which is in some sense optimal. In practice however 8 cannot be allowed to become too small otherwise implementation of the algorithm will lead to numerical instabilities. Thus, in the directions where the singular values are identically zero this algorithm reverts completely to the steepest descent algorithm (i.e. it does not step in these directions). Also in the case of near homoclinic tangencies, where the singular values will be small but finite, although the gradient descent will dominate, this algorithm will step a small amount in the associated directions. This appears to be enough to solve the convergence problem that can occur when gradient descent is used on its own. Finally implementation of this m e t h o d is simple since M is forced to be well-conditioned (and
M. Davies / Physica D 79 (1994) 174-192
184
diagonally dominant). This allows the use of sparse matrix techniques to exploit the matrix's banded structure. Therefore Eq. (23) can be solved by applying Cholesky decomposition to M such that M = G G T, where G is a lower triangular matrix and then solving this by applying forward and backward substitution to G and G T respectively. All these procedures are order N and are readily found in various numerical texts, for example Golub and Van Loan [21]. In the case of high noise levels when the linearity assumption needs to be relaxed, a larger 6 can be used to restrict the step size allowable at each iteration thereby increasing the stability of the algorithm. Although it is possible to implement an adaptive algorithm for 3 (see Press et al. [19]), here we will only consider using a fixed to impose a certain amount of regularisation on the problem. The flexibility of the L e v e n b e r g - M a r q u a r d t algorithm makes it our preferred choice for a noise reduction procedure since the extra overheads of using this algorithm as opposed to the gradient descent method are minimal and indeed trivial if the dynamics are also being estimated. H o w e v e r the speed gains can be significant, as we will show in the next section.
5.1. E x a m p l e
1: the H ~ n o n m a p
In this section we take a deterministic time series from a simple two-dimensional dynamical system (the H6non map) and add measurement noise onto it. Then it is possible to compare the effects of filtering the data with the 3 different methods described in above: manifold decomposition, L e v e n b e r g - M a r q u a r d t and solution by gradient descent. The details of the comparison are given below. T h e time series came from the H6non map: Xn=]+Yn
y, = bx,_ 1 ,
2
1--aXn
1,
(25)
where the parameter settings are: a = 1.4, b = 0.3. The time series taken was the x coordinate
starting at x 1 = x 2 = 0.5. Onto this 10% measurement noise was added which composed of a random i.i.d, signal with uniform distribution and mean zero. Then the data was e m b e d d e d in a delay space (delay = 1). Fig. 3 shows plots of the embedded data both with and without the noise added. Throughout this comparison the equations of the mapping function are assumed to be known (an alternative case is considered below) and are used to calculate the values of the derivatives for the algorithms. All three algorithms are iterated 10 times and the resulting filtered time series are compared. Two measures of performance are taken. The first, dynamic error, defined a s z i - f ( z i _ l ) , where z i is the filtered time series, measures how deterministic the resulting trajectory is. The second, measurement error, defined as the distance, [[x i - z i [ [ , between the original deterministic time series, xi, and the filtered time series, z i, measures how well the noise reduction algorithm has located the original data. Initially we applied manifold decomposition to the data. Fig. 4 shows plots of both m e a s u r e m e n t error and dynamic error, comparing the filtered time series to the unfiltered (noisy) data. It is clear from the figure that the filtered orbit is deterministic to within machine precision almost everywhere although the final orbit located is not precisely the initial deterministic orbit. If the system had been uniformly hyperbolic we would have only expected discrepancies to exist between the two "shadowing" d e t e r m i n i s t i c orbits at the ends of the time series. H e r e we see that the presence of homoclinic tangencies means that when a shadowing orbit exists then it is not unique (the new orbit can be considered to be a shadowing orbit since the m e a s u r e m e n t error does not go above the maximum measurement error for the noisy orbit). Although the resulting orbit is not everywhere close to the original time series, when the data is in regions of hyperbolicity the measurement error reduces down to machine precision. Fig. 5 shows the error plots for the Levenberg - M a r q u a r d t algorithm. Although it is possible to
185
M. Davies / Physica D 79 (1994) 174-192
.'>'"
-2".,
/" .,2"
":..~.~ -.
x(n+/)
'~
,,..
/ / / / x/,
'i i
5 =~>
--
I
--CIU~
0
13
5
I
X(n)
!3
'"
: "9 .,
.U?: ! :i
X(n+l)
,,
7):
". i . _?. : .2
.
~.3 9
'
.'
':4
.
-,~..
-- I.t)
X(n)
Fig. 3. Plots of the delay embeddings of the deterministic signal (top) from the H6non map and then the signal after 10% noise was added (bottom). implement an adaptive process to select ~ at each iterate this and was not used here. If only a few iterates of the algorithm are being undertaken the additional overheads of an adaptive process will b e c o m e costly (details of methods for selecting 6 are described in Press et al. [19] and Fletcher [10]). Here we set 6 to be a constant, 6 = 0.00001, which has the effect of providing a guaranteed upper bound for the condition number of M in Eq. (24) (this ap-
proach has certain similarities with regularisation methods, see Press et al. [19]): c o n d ( M ) -< 105 ,
(26)
and therefore this algorithm is likely to be m o r e stable than the manifold d e c o m p o s i t i o n . If we compare the results of this algorithm to those produced using manifold d e c o m p o s i t i o n w e can draw two basic conclusions. First, the regions where the dynamic error has not been reduced to
186
M. Davies / Physica D 79 (1994) 174-192
(a) 1 102
O 10_7
I
O
- l o
1 0 -13
10
-16
I
-~9
0
i
0
200
i
4-00
i
i
600
800
600
800
1000
T i m e Step 107
(b)lo.
~
101
10_2
"~
10--5
10
-~
10_11
10
-~4-
10
-77
i
0
200
I
i
4-00
1000
T i m e Step Fig. 4. A comparison between the unfiltered signal (dotted) and the same signal after 10 iterates of the manifold decomposition noise reduction algorithm (continuous). (a) A plot of dynamic error, whereas (b) is a plot of the distance from the original clean time series. m a c h i n e precision are larger than before (alt h o u g h the difference b e t w e e n 10 - 7 and 10 -16 is small). Thus the algorithm is slightly (but trivially) slower than m a n i f o l d d e c o m p o s i t i o n . Secondly, although the p e r f o r m a n c e o n the d y n a m i c error is slightly w o r s e , the m e a s u r e m e n t error is virtually indistinguishable.
Finally w e c o m p a r e the gradient descent m e t h od to the other t w o algorithms. T h e step size c h o s e n for the descent was 0.1. This was not o p t i m i s e d but it was selected as a r e a s o n a b l e value that was not unstable. T h e plots are s h o w n in Fig. 6. It is clear that the i m p r o v e m e n t s in both deterministic error and in m e a s u r e m e n t
M. Davies
(a)
187
Physica D 79 (1994) 174-192
10 e
10 ~
10 o
10--~ ",3 lO--e
lO--e
lO--tm
lO--ts
10--18 0
i
i
i
i
200
400
600
800
1 0 0 0
800
1000
T i m e Step
(b)
10 7
10 4
10 t
10--2
10--~.
i
10--8
10--11
10--~4
10--17 0
,
i
i
200
400
600
T i m e Step Fig. 5. A comparison between the unfiltered signal (dotted) and the same signal after 10 iterates of the Levenberg-Marquardt noise reduction algorithm (continuous). (a) A plot of dynamic error, whereas (b) is a plot of the distance from the original clean time series.
error are v e r y m u c h less than w e r e a c h i e v e d with either of m a n i f o l d d e c o m p o s i t i o n or the L e v e n b e r g - M a r q u a r d t approaches. T o a certain e x t e n t
it could be argued that this is due to the s p e e d o f the gradient d e s c e n t a l g o r i t h m since all three algorithms w e r e o n l y iterated 10 t i m e s e v e n
188
M. Davies / Physica D 79 (1994) 174-192
I CI':'
<)--z
O
(I
Fig. 6. A comparison between the unfiltered signal (dotted) and the same signal after 10 iterates of the gradient descent noise reduction algorithm (continuous). (a) A plot of dynamic error, whereas (b) is a plot of the distance from the original clean time series.
though the gradient descent method converges at a far slower rate. However further iterations do not make significant gains over this one. This is because the algorithm tends to halt when the trajectory becomes virtually non-hyperbolic. Grassberger et al. [9] has observed that a nonpredictive version of the gradient descent type m e t h o d did not suffer from this halting problem.
However in their test the algorithm had to be iterated 500 times before it could achieve a similar level of reduction in dynamic error to that shown in Figs. 4 and 5. Before we write off this method, we should realise that the example given here is very idealised since we have a good embedding and know the dynamics perfectly. In the next exam-
M. Davies / Physica D 79 (1994) 174-192
ple we consider a more realistic scenario where the dynamics is unknown and where a less " n a t u r a l " delay space has to be chosen. 5.2. E x a m p l e 2: the L o r e n z attractor
For our second example we compared the performance of both the gradient descent algorithm and the L e v e n b e r g - M a r q u a r d t algorithm for a variety of fixed step sizes when applied to data from the Lorenz equations: dX dt - o - ( Y - X ) , dY d--t- + r X -
Y-
XZ ,
dZ dt
-
-bZ
+ XY,
(27)
where cr = 10, b = ~ and r = 28. The equations were integrated using a R u n g e - K u t t a fixed step integration routine with a step size of 0.001. After the transients had died down a time series of 5000 points was obtained from the x coordinate. We then added 10% i.i.d, noise with a uniform probability distribution. Since the sampling rate is quite high a simple delay space would not be appropriate. H e r e we used a singular systems state space (see B r o o m h e a d and King [12]) using an initial delay window of ten points. We then chose the first three singular directions to construct our space. It should be noted that it is trivial to adapt the noise reduction algorithms to work in such a space. Finally the dynamics were modelled in both cases with a global function approximation, using 100 Gaussian radial basis functions (see B r o o m h e a d and Lowe [17]) with centres chosen from the data set. Having approximated the dynamics we applied our noise reduction procedure. This consisted of applying an iteration of the relevant scheme followed by refitting parameters in the function approximation. This latter step is important since both steps minimise the noise, only with respect to different parameters, thus we can view the
189
whole process as zigzagging down an extended cost function. Finally both steps were repeated ten times in each case. Fig. 7 shows a plot of the first two singular directions for the noisy data (top). The bottom picture in Fig. 7 shows the same data after applying the L e v e n b e r g - M a r q u a r d t algorithm (6 = 0 . 1 ) 10 times. It is clear from these two plots that the noise reduction process has retrieved much of the usual structure of the L o r e n z attractor that is not apparent in the noisy data. A similar picture is obtained when the gradient descent algorithm is used. T o compare the two methods quantitatively we therefore calculated two measures of error. The first is the total dynamic error which is defined as:
Eay.= ~ e~,
(28)
i=d+l
where Ei is the error term defined in Eq. (1). The second measure of performance that we used was the total measurement error, which is defined as: E ....
=
~
(x i - zi) 2 .
(29)
i=1
Initially the total dynamic error for the noisy data was 4.88. Table 1, below, lists the total dynamic error after applying the two algorithms using a variety of fixed step sizes. H e r e 6 is the L e v e n b e r g - M a r q u a r d t parameter and hence we use 1/6 for the gradient descent step size. It is not surprising that for large 6 the two methods have a similar performance since as 6--+ oo the L e v e n b e r g - M a r q u a r d t algorithm reduces to the method of gradient descent. F u r t h e r m o r e it is not surprising that for 6 < 1 the gradient descent algorithm performs poorly. This is because the method is only stable for small step sizes. However the L e v e n b e r g - M a r q u a r d t approach does not suffer from this problem and as can be seen from the table as ~ is reduced (at least to 0.01) the performance of the L e v e n b e r g - M a r q u a r d t
190
M.
Davies
I Physica
D
79 (1994)
174-192
.
"
".
9
;t/,
:..
.
~.f
-
.
"
"
,
.~,~,'~f~-i-),;::~
,.. :
.
s.',::r
'-.'" ~.~..~/.::~,
r
...
..'.
:~:..
.~:.".~4~i,~*.,<,?.~ :
9:": ,~. . . . . ~ -
9 -
--
".
.,.......:::
.
V2
~'(:~
""'
,~. :'.L_
:,(,%#,~;~.~g~x~,,~:....
..
.
"-
"5-:
. . . . 9 -:.
"'."
.. ~.
; ..:: '
.,. t
9 "~
~
:
.
~.". -".
.
1 Ill
--
i!, " )
4O
--
2 0
q)
~ o
4 O
V1
.
I~sO
.
.
.
.
4 0
.
( /
2Cl
2 0
4 0
~')c
V1 Fig. 7. The top picture shows data from the Lorenz equations with 10% noise added, plotted using the first two singular directions. The bottom picture shows the same data after ten iterates of the L e v e n b e r g - M a r q u a r d t noise re duc t i on algorithm.
algorithm, in terms of dynamic error, keeps increasing. However this performance is not seen in terms
of the measurement error, as is illustrated in Table 2 below. Initially the measurement error for the noisy data was 5.48 x 103. Then for large
Table 1 The total dynamic error for a variety of step sizes Delta
0.01
0.1
0.5
1
2
5
G r a d i e n t descent Levenberg-Marquardt
3.1 • 10 -5
1.6 x 10 4
1.1 • 104 4.2 x 10 4
7.2 X 1 0 - 1 7.6 x 10 4
5.1 x 10 3 4.5 x 10 3
6.4 7.2
X 10 -2 X 10 -2
191
M. Davies / Physica D 79 (1994) 174-192 Table 2 The total m e a s u r e m e n t error for a variety of step sizes Delta
0.01
0.1
0.5
1
2
5
G r a d i e n t descent Levenberg-Marquardt
2.3
4.4 x 102
1.8 • 107 3.1 • 102
8.8 x 102 3.2 • 102
3.9 x 102 3.3 x 102
4.6 x 102 4.9 x 102
X 10 3
& as above, the two algorithms produce similar results and as ~ is reduced the gradient descent m e t h o d becomes unstable. However, unlike before, the error for the L e v e n b e r g - M a r q u a r d t algorithm does not continue to reduce much below the best performance by the gradient descent. Instead it levels out at about 3.1 x 10 2 and then rises again. This is likely to be due to the fact that the measurement error is affected by the limitations of the function approximation process (the same is not true for the dynamic error). Thus we can expect that the function approximation will be the limiting factor in the performance of such algorithms. Note that, in this case, the algorithms will perform better with larger amounts of data. Furthermore improvements can be made by combining the function approximation and noise reduction minimisation procedures into one. We refer the reader to Davies and Stark [181 for further details.
6. Discussion
This paper has considered some of the popular methods for noise reduction in the context of minimising (and zero-finding) the dynamic noise. This enabled us to classify previously proposed algorithms into two broad categories: Newton methods and gradient descent methods. We then showed that the presence of homoclinic tangencies in the data is extremely important to the performance of a noise reduction algorithm since they make the root-finding problem ill-conditioned. One way to circumvent this problem is to break the time series up into smaller lengths of data such that the root-finding problem is well conditioned for each length individually, however this is not ideal and we were keen to
develop a more systematic m e t h o d of dealing with the presence of tangencies. One method that we have previously proposed is to use gradient descent. This requires minimal computational effort and can be made to be as stable as necessary, while guaranteeing to reduce noise. It also turns out that this m e t h o d is strongly related to various approaches developed by other researchers, based on projecting the e m b e d d e d data towards some approximation of the deterministic manifold. F u r t h e r m o r e these methods are a good way to tackle the presence of tangencies and we show that solving the minimisation problem using a gradient descent algorithm effectively 'avoids' the tangencies by simply not stepping in the relevant directions. However the one drawback with the gradient descent algorithm is its speed. The gradient descent algorithm can be shown to converge onto the minimum at a linear rate, whereas a Newton based algorithm will converge at a quadratic rate. Thus neither Newton based methods nor gradient descent methods are ideal since Newton methods become unstable with tangencies and gradient descent methods are slow. To produce an acceptable compromise we applied the L e v e n b e r g - M a r q u a r d t algorithm to the minimisation problem. This provides a systematic method by which it is possible to interpolate between Newton and gradient descent based methods and as we have shown we can use the L e v e n b e r g - M a r q u a r d t parameter to inflict an upper bound on the condition of the problem. The incentive for using the L e v e n b e r g - M a r quardt algorithm is thus to find an algorithm with approximately quadratic convergence that remains stable. Finally we have applied the m e t h o d to two examples. The first example demonstrated how
192
M. Davies / Physica D 79 (1994) 174-192
well the L e v e n b e r g - M a r q u a r d t m e t h o d could w o r k w h e n the m a p p i n g f u n c t i o n is k n o w n . H e r e t h e r e was very little difference b e t w e e n this m e t h o d a n d using m a n i f o l d d e c o m p o s i t i o n . T h e s e c o n d e x a m p l e c o m p a r e d the p e r f o r m a n c e of the L e v e n b e r g - M a r q u a r d t a l g o r i t h m to the grad i e n t d e s c e n t m e t h o d . H e r e the data was t a k e n f r o m a flow a n d the d y n a m i c s had to b e approxim a t e d . A l t h o u g h the L e v e n b e r g - M a r q u a r d t m e t h o d was n o t s u b s t a n t i a l l y b e t t e r in terms of a n i m p r o v e m e n t in m e a s u r e m e n t e r r o r it rem a i n e d stable for a b r o a d e r r a n g e of step sizes a n d also p e r f o r m e d b e t t e r in t e r m s of r e d u c i n g the d y n a m i c error.
References [1] E.J. Kostelich and J.A. Yorke, Noise reduction: finding the simplest dynamical system consistent with the data, Physica D 41 (1990) 183-196. [2] R. Bowen, Markov partitions for Axiom A diffeomorphisms, Amer. J. Math. 92 (1970) 725-747. [3] J.D. Farmer and J.J. Sidorowich, Optimal shadowing and noise reduction, Physica D 47 (1991) 373-392. [4] S.M. Hammel, A noise reduction method for chaotic systems, Phys. Lett. A 148 (1990) 421-428. [5] T. Schreiber and P. Grassberger, A simple noise reduction method for real data, Phys. Len A. 160 (1991) 411-418. [6] R. Cawley and G.-H. Hsu, SNR performance of a noise reduction algorithm applied to coarsely sampled chaotic data, Phys. Lett. A 166 (1992) 18-196. [7] T. Sauer, A noise reduction method for signals from nonlinear systems, Physica D 58 (1992) 193-202.
[8] M. Davies, Noise reduction by Gradient Descent, Int. J. Bifurc. Chaos 3 (1992) 113-118. [9] P. Grassberger, R. Hegger, H. Kantz, C. Schaffrath and T. Schreiber, On noise reduction methods for chaotic data (1992), preprint. [10] R. Fletcher, Practical Methods of Optimisation (Wiley, New York, 1980). [11] F. Takens, Detecting strange attractors in fluid turbulence, Dynamical Systems and Turbulence, Lecture Notes in Math. 898, ed. D. Rand and L-S. Young (Springer, Berlin, 1981). [12] D.S. Broomhead and G.P. King, Extracting qualitative dynamics from experimental data, Physica D 20 (1987) 217-236. [13] T. Sauer, J.A. Yorke and M. Casdagli, Embedology, J. Star. Phys. 65 (1991) 579-616. [14] D.S. Broomhead, J.P. Huke and M.R. Muldoon, Linear filters and nonlinear systems, J. Stat. Phys. 65 (1992). [15] M. Casdagli, S. Eubank, J.D. Farmer and J. Gibson, State space reconstruction in the presence of noise, Physica D 51 (1991) 52-98. [16] J.D. Farmer and J.J. Sidorowich, Predicting chaotic time series, Phys. Rev. Lett. 59(8) (1987) 845-848. [17] D.S. Broomhead and D. Lowe, Multivariate functional interpolation and adaptive networks, Complex Systems 2 (1988) 321-355. [18] M.E. Davies and J. Stark, A new technique for estimating the dynamics in the noise reduction problem, in: Nonlinearity and chaos in engineering dynamics, eds. J.M.T. Thompson and S.R. Bishop (Wiley, New York, 1993). [19] W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T. Vetterling, Numerical Recipes in C (Cambridge Univ. Press, Cambridge, 1988). [20] S.M. Hammel, J.A. Yorke and C. Grebogi, Numerical orbits of chaotic processes represent true orbits, Bull. Amer. Math. Soc. 19 (1988) 465-469. [21] G.H. Golub and C.F. Van Loan, Matrix Computations (John Hopkins Univ. Press, Baltimore, 1983).