Bayesian optimization of empirical model with state-dependent stochastic forcing

Bayesian optimization of empirical model with state-dependent stochastic forcing

Chaos, Solitons and Fractals 104 (2017) 327–337 Contents lists available at ScienceDirect Chaos, Solitons and Fractals Nonlinear Science, and Nonequ...

6MB Sizes 0 Downloads 53 Views

Chaos, Solitons and Fractals 104 (2017) 327–337

Contents lists available at ScienceDirect

Chaos, Solitons and Fractals Nonlinear Science, and Nonequilibrium and Complex Phenomena journal homepage: www.elsevier.com/locate/chaos

Bayesian optimization of empirical model with state-dependent stochastic forcing Andrey Gavrilov∗, Evgeny Loskutov, Dmitry Mukhin Institute of Applied Physics of the Russian Academy of Sciences, 46 Ul’yanov Street, Nizhny Novgorod 603950, Russia

a r t i c l e

i n f o

Article history: Received 31 May 2017 Revised 1 August 2017 Accepted 29 August 2017

Keywords: Time series analysis Bayesian methods Stochastic processes Artificial neural networks

a b s t r a c t A method for optimal data simulation using random evolution operator is proposed. We consider a discrete data-driven model of the evolution operator that is a superposition of deterministic function and stochastic forcing, both parameterized with artificial neural networks (particularly, three-layer perceptrons). An important property of the model is its data-adaptive state-dependent (i.e. inhomogeneous over phase space) stochastic part. The Bayesian framework is applied to model construction and explained in detail. Particularly, the Bayesian criterion of model optimality is adopted to determine both the model dimension and the number of parameters (neurons) in the deterministic as well as in the stochastic parts on the base of statistical analysis of the data sample under consideration. On an example of data generated by the stochastic Lorenz-63 system we investigate this criterion and show that it allows to find a stochastic model which adequately reproduces invariant measure and other statistical properties of the system. Also, we demonstrate that the state-dependent stochastic part is optimal only for large enough duration of the time series. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction A traditional method of modeling the observed processes generated by a nonlinear dynamical system is the reconstruction of the underlying evolution operator [1,2]. The first-principle models (FPMs), i.e. models based on axiomatic physical or other laws and rules of system operation, are hard to apply for modeling various natural physical phenomena, biological, socio-economic processes, and so on mainly because of the extreme generality of these models: the complexity and spatial distribution of the system under consideration frequently leads to cumbersome models, hence, to a large scatter of qualitative prognostic estimates because of their strong sensitivity to subgrid parameterizations. As a result, the FPMs are, in the majority of cases, redundant for description of the dynamics of particular phenomena. Reduction of such models to simpler forms based on selecting variables with definite time scales enables qualitative description of the mechanisms underlying the studied phenomena (for instance, there exist climatic models of intermediate complexity [3–5], basic dynamic models of atmospheric photochemistry [6,7], conceptual models of natural phenomena [8,9], etc.). However, in such procedures the models often lose their connection to the experimentally measured time series ∗

Corresponding author. E-mail addresses: [email protected] (A. Gavrilov), [email protected] (E. Loskutov), [email protected] (D. Mukhin). http://dx.doi.org/10.1016/j.chaos.2017.08.032 0960-0779/© 2017 Elsevier Ltd. All rights reserved.

and, hence, their ability of quantitative forecasting. Alternatively to FPMs, empirical models are aimed at extraction of information about the dynamics of the system directly from the time series: the most statistically justified (in terms of certain empirical criterion of validity, see below) model of evolution operator is chosen, while the FPM equations are not brought into play. In fact, the empirical approach implies a model sufficient for reconstruction of the system’s dynamical properties manifested in the observed dynamics at the timescales of interest. In the past 30 years (see [2] and the literature therein), the empirical models, by virtue of their universality and independence of different physical assumptions, became very popular in forecasting climate dynamics [10– 12], financial time series [13], investigation of living systems [14], and so on. A typical situation in the construction of an empirical model is when there is no one-to-one coupling between the consecutive states in the space of the phase variables reconstructed from the time series, i.e., the problem of reconstructing an evolution operator as a single-valued function is incorrect. In the context of real complex systems, this circumstance is a consequence of high dimension of the system that has generated the time series: the number of dynamic variables of the model with which it is reasonable to work is limited by finite size of available data sample. There are a lot of special studies devoted to the problem of appropriate empirical reduction of observed high-dimensional data: from the standard principal component analysis [15–18] and more

328

A. Gavrilov et al. / Chaos, Solitons and Fractals 104 (2017) 327–337

advanced linear techniques [19–27] to nonlinear reduction methods [28–36]. Anyway, when working with real data we have to construct a model in a certain low-dimensional subspace of the full phase space of the system, where, generally speaking, there is no single-valued evolution operator. In model examples, such a situation always occurs if interactive noise is added to the datagenerating equations. Such an ambiguity is traditionally regarded to be a defect of the model and is described as a random (Gaussian) error additive to the evolution operator function. For example, in the works [37,38] artificial neural network (ANN) is used to approximate the evolution operator of the low-dimensional deterministic system assuming that the defect of the model is Gaussian noise with constant dispersion. On the other hand, the described ambiguity of the evolution operator is not, obviously, homogeneous over the phase space of the model due to the nonlinearity of the studied system. In the present work we use the ANN-based model modified according to the following idea: it is possible to take this inhomogeneity into consideration; namely, in the regions of phase space with small scatter in the dependence of the current state on the previous state, the evolution operator may be reconstructed more accurately, thus decreasing systematic approximation bias. A more general structure of the model taking this inhomogeneity into account is proposed in [12,39], where a random component of the model is supposed to depend on phase variables, i.e. to be state-dependent, with both the stochastic and the deterministic components being approximated by the ANN-based function in the form of a threelayer perceptron [40]. We have chosen ANN functions because they are known to be effective approximators of unknown dependencies (see [40]), their efficiency in constructing prediction models by time series was shown in many works (e.g., [13,39,41–44]). The convenience of this approximation resides on the following: firstly, the obtained functions are bounded, which allows avoiding worsening of the approximation accuracy at the edges of the definition domain. Secondly, it is possible to efficiently increase dimension and complexity of the model by increasing a small number of parameters. The drawback is degeneracy of the ANN parameter space that may be overcome by introducing regularizing restrictions (see Section 2.2.3). The main question arising with such a modification of the model is the following: is it justified to increase the number of parameters by introducing an additional function into the model or, in other words, is the resulting model more economic in terms of description of the time series? To answer this question we need a criterion for comparing models of different complexity or, which is the same (without loss of generality), with different numbers of parameters. In this work we use a very general optimality criterion based on Bayesian evidence [45]. This criterion is transparently interpreted in the framework of the Bayesian paradigm: if all the considered models are equiprobable a priori, then the best model is the one reproducing the observed data with the highest probability. From the information theory point of view, this criterion is equivalent to the minimum description length principle [37], i.e. the best model is required to provide the most informationallycompressed representation of the observed data. We also developed a numerical method for estimating the Bayesian evidence of the described ANN-based stochastic model, including the procedure of Bayesian regularization of ANN. The paper is organized in the following way. In Section 2 the structure of the evolution operator model is described (Section 2.1), the measure of model optimality is introduced (Section 2.2), and the method of its numerical assessment is presented (Section 2.3). In Section 3 the proposed approach is used for reconstructing the stochastic Lorenz system by its scalar time series having different duration and generated at different noise level. Also, the properties of the Bayesian evidence with

respect to the proposed stochastic model are investigated in detail in this section. It is shown that the stochastic model admitting a state-dependent random term is, generally, more preferable than the model with constant noise and gives a better description of the dynamic properties determining the observed behavior. Finally, the concluding remarks are formulated in Section 4.

2. Method of optimal stochastic model construction 2.1. Stochastic model of evolution operator Let us consider discrete time series X = (x1 , . . . , xN ), xn ∈ RD of characteristics x of an unknown dynamical system at N equidistant time moments, for which we want to construct an evolution operator model. Evolution operator, by definition, is a function acting in a phase space of the system. However, in a general case analysis of the observed data does not allow us to make a conclusion about finiteness of the dimension of the phase space of the dynamical system that has generated the time series X. Therefore, we have to construct a model in a certain subspace of dimension d, whose elements must be derived from observables X (see the particular realizations below) and will be denoted by un ∈ Rd . Obviously, in general case there is no single-valued evolution operator Q : un → un+1 in such a subspace. The arising ambiguities of this map may be described stochastically by defining the evolution operator Q as a random dynamical system [46]. In practical applications, it can be simplified (see [39] for details) to the following form:

un+1 = f(un ) + gˆ (un ) · ξn .

(1)

Here the mapping f : Rd → Rd represents the “deterministic” part of the model; ξn ∈ Rd is a normal delta-correlated random process, with vector components being mutually independent and having zero mean and unity variance; the matrix function gˆ : Rd → Rd×d maps model state un to the matrix having dimension d × d and represents, together with ξ n , the state-dependent stochastic perturbation term. The model (1) proved to be efficient as applied to reconstructing systems of different complexity [11,39] because the strong Gaussianity assumption about ξ n is complemented here by the state-dependence (i.e. inhomogeneity over the phase space) of the covariance matrix. One of the most widely used ways to specify the phase variables u via the data X in case of small dimension D is the Takens delay method and its modifications [47], when the phase variable is obtained by successively shifting the time series X by an arbitrary time lag. In practice, different values of the time lag lead to different (more or less adequate) structures of the phase space projections and, eventually, to different results of modeling. Therefore, usually the lag is taken close to the time scale of interest. For example, for the lag equal to one time step the Takens method gives:





uk = xk , xk−1 , . . . , xk−(l−1) ,

(2)

Here l is the number of delays. Otherwise, if the dimension D is high, the empirical dimensionality reduction methods (see Section 1) can be applied as a preprocessing step before the delaying. In this work, for notation and demonstration simplicity, we restrict our consideration to the case of the scalar time series X = (x1 , . . . , xN ), xn ∈ R and the reconstruction of the phase space by the Takens delay method with the lag equal to one time step (with such a restriction, any lag can be set by a proper resampling of the time series). Obviously, in this case the dimension d is equal to l. The form of the model (1) in this case will apparently become much simpler due to the presence of d − 1 trivial couplings

A. Gavrilov et al. / Chaos, Solitons and Fractals 104 (2017) 327–337

between the variables:

(u1 )n+1 = f (un ; μ ) + g(un ; ν ) · ξn , (u2 )n+1 = (u1 )n , (ud )n+1

(3)

.. . = (ud−1 )n

Here f : Rd → R and g : Rd → R are the scalar functions of d variables; we also assign the parameters μ and ν to them, indicating that they will be explicitly parameterized by certain class of approximator; ξn ∈ R is a normal delta-correlated random process, with zero mean and unity variance. Finally, Eq. (3) can be written in terms of one scalar phase variable which we denote by vk = (u1 )k :

vn+1 = f (vn , vn−1 , . . . , vn−d+1 ; μ ) + g(vn , vn−1 , . . . , vn−d+1 ; ν ) · ξn , n = d, . . . , N − 1.

(4)

The corresponding back-transformation from the phase space to the observed variables in our case is given simply by:

xn = vn ,

n = 1, . . . , N.

(5)

In this work, for parameterization of the functions f and g we will use ANN in the form of perceptron with one hidden layer, as it is effective in approximating unknown nonlinear dependences (see Section 1):

Am din ,dout (z ) =

m 

  αi · tanh ωiT z + γi .

(6)

i=1

Here, din and dout are dimensions of the ANN input and output, respectively, i.e. αi ∈ Rdout , ωi ∈ Rdin , z ∈ Rdin ; m is number of neurons in the hidden layer determining accuracy of data approximation; ωiT z means inner product of column-vectors ωi and z; γi ∈ R. We parameterize the functions f and g in (4) via the ANN (6) with dout = 1, din = d (in case of scalar time series X):

f ( u, μ ) =

m Ad,1f



g( u , ν ) =

( u ),

m Ad,g1

(u )

const (ν )

for mg ≥ 1,

(7)

Let the set of models {Hi } of the evolution operator be specified. In our case, each model Hi of this set is specified by the hyperparameters (mf , mg , d)i . According to the Bayesian approach, the measure of the optimality of the model Hi is the probability that this model is a source of the observed time series X. According to the Bayes theorem it is equal to:

P (X |Hi )P (Hi ) P (Hi |X ) =  . i P (X |Hi )P (Hi )

2.2.2. Analytical expression for Bayesian evidence An individual model Hi (Eqs. (4)–(7)) is specified by the parameters μ and ν and vector vinit = (v1 , . . . vd ) of the initial state. Let us collect them into vector of parameters θ = (μ, ν, vinit ). The Bayesian evidence P(X|Hi ) can be expressed via the corresponding likelihood function P(X|θ , Hi ) and prior probability measure Pi in the set i of the parameters of the model Hi :

P (X |Hi ) =



i

P (X |θ , Hi )dPi , θ ∈ i .

P (X |θ , Hi ) =



m

2.2. Determining model optimality 2.2.1. Bayesian evidence definition It is clear from the form of the model (4)–(7) that the number of the model parameters specifying its complexity is determined by three hyperparameters mf , mg and d: the number of neurons in the deterministic and stochastic parts of the model, respectively, and the dimension of the phase space of the model. Thus, the problem of choosing a model of optimal complexity is formalized. It is necessary to find a point in the space of hyperparameters corresponding to the “best” model in terms of certain optimality criterion. In this work, we suggest the optimality criterion based on the Bayesian approach.

(9)

Taking into consideration Eqs. (4) and (5), the likelihood P(X|θ , Hi ) is factorized as:

× m

(8)

Here, P(X|Hi ) is the Bayesian evidence of the model Hi , i.e. the probability density function (PDF) of the time series X in an ensemble of all possible time series generated by the model Hi . The multiplier P(Hi ) is a prior probability of the model Hi reflecting prior preferences in the choice of this model. The denominator in (8) plays the role of normalization term which is independent of Hi . Further, assuming all the models of the set to be equiprobable a priori, we can reduce the expression (8) to P (Hi |X ) = α P (X |Hi ), where α is Hi -independent constant multiplier. Thus, the optimal model corresponds to the maximum of the Bayesian evidence P(X|Hi ) over Hi . Let us derive analytical expression for Bayesian evidence of the stochastic model (4)–(7).

for mg = 0.

g Here the sets of ANN coefficients {αi , ωi , γi }i=1f and {αi , ωi , γi }i=1 corresponding to the functions f and g are understood as the vectors of the parameters μ and ν, respectively. Note, in (7) we take into consideration the standard homogeneous stochastic part case with g = const which is denoted by mg = 0. In fact, the choice of the model architecture (particularly, the type of ANN) is left up to the prior knowledge of the user. Here we use one of the simplest nonlinear ANN (6) (feedforward, with only one hidden layer) to exclude the potential problems with model training which arise with more sophisticated type of ANN, e.g. well-known problem of vanishing gradient in deep networks [48], and focus on the analysis of the nontrivial stochastic component g.

329



N

n=1 N

δ ( xn − vn )

n=d+1



(10)

P (vn |vn−1 , . . . vn−d ; μ, ν ) dvd+1 . . . dvN .

Further, under the assumption of the Gaussianity and deltacorrelation of the noise ξ n in (4) we have:

P (vn |vn−1 , . . . vn−d , μ, ν )



= 2π g(vn−1 , . . . vn−d , ν )2

−1/2



 (vn − f (vn−1 , . . . vn−d , μ ))2 × exp − . 2g(vn−1 , . . . vn−d , ν )2

(11)

Next, we specify the probability measure Pi by introducing prior PDF of the parameters θ :

dPi = Ppr (μ, ν|Hi , q )dμdν · Ppr (vinit |Hi )dvinit .

(12)

In (12) Ppr (μ, ν|Hi , q) is the prior PDF of the coefficients μ, ν of the functions f, g. Here we use additional hyperparameter q for this PDF which is explained below. In fact, this hyperparameter can be regarded as identifier of the model Hi , together with (mf , mg , d)i , but for us it is convenient to keep it separate from Hi . With this note, we will write P(X|Hi , q) instead of P(X|Hi ). For the parameters vinit we will use standard normal distribution with unity dispersion, implying that the time series X must be pre-normalized:



Ppr (vinit |Hi ) =

1

(2π )d/2



d 1 2 exp − vn . 2 n=1

(13)

330

A. Gavrilov et al. / Chaos, Solitons and Fractals 104 (2017) 327–337

The substitution of (10)–(13) into (9) yields the analytical expression for the optimality:



d  x2n exp − 2

−d/2

P (X |Hi , q ) = (2π )



n=1



(μ, ν, X ; Hi )Ppr (μ, ν; Hi , q )dμdν.

×

 (μ, ν, X ; Hi ) =

(14)

2.3. Model learning and numerical implementation

−1/2

N 

2π g(xn−1 , . . . xn−d , ν )2

n=d+1



N  (xn − f (xn−1 , . . . xn−d , μ ))2

× exp −

n=d+1

2g(xn−1 , . . . xn−d , ν )2

 . (15)

Via the function P(X|Hi , q) it is natural to express the data description length [49] by means of the model (Hi , q), i.e., the amount of the information needed for encoding the data X using the model (Hi , q):

I (X ) = − ln P (X |Hi , q ).

(16)

This function will be used as a cost function to minimize in a numerical implementation. 2.2.3. Prior PDF of the ANN coefficients Actually, the function Ppr (μ,ν|Hi , q) restricts the domain of model learning in the space of parameters, and the hyperparameters q determining its size control the power of the model Hi . Introduction of this PDF helps to compensate the degeneration of the parameter space which is the drawback of ANN. Besides, the domain of parameter search is narrowed, thus simplifying numerical analysis. For convenience of calculation, following [50] we take this function in Gaussian form. For each group of parameters of ANN (6) we set the corresponding dispersion:

 {αi , ωi , γi }m i=1 ; σα , σω , σγ   m  − dout2

α 2 = 2π σα2 exp − |2σi |2 α

PANN



i=1

× ×

m 

i=1

m 

i=1

2π σω2 2π σγ2

− d2in − 12



ω

exp − |2σi |2 ω



γ2

2



(17)



exp − 2σi 2 . γ

Hence, for the parameters μ, ν in the case of mg ≥ 1:

 μ; σαf , σωf , σγf   ×PANN ν; σαg , σωg , σγg ,   q = σαf , σωf , σγf , σαg , σωg , σγg ,

Ppr (μ, ν|Hi , q ) = PANN



(18)

and in the case of homogeneous stochastic part (mg = 0):

Ppr (μ, ν|Hi , q ) = PANN



× q=



μ; σαf , σωf , σγf 1

2π σν2

exp



−ν 2 , 2σν2

 σα , σωf , σγf , σν , f

(19)

2.2.4. Cost function for optimality Finally, the minimum of the function (16) over q is used as the optimality criterion of the model Hi , i.e. hyperparameters (mf , mg , d)i :

E (Hi ) = min [− ln P (X |Hi , q )]. q

The model corresponding to the smallest E, i.e., providing the minimum data description length (MDL principle), or, which is the same, the maximum probability of reproducing the observed time series X (Bayesian evidence), is regarded to be optimal. What is important, here we self-consistently determine the characteristics q of the prior distribution Ppr (μ, ν|Hi , q) for the parameters of the model Hi , thus specifying its optimal regularization.

(20)

2.3.1. Numerical assessment of optimality Exact computation of the integral in the expression (14) is impossible in the majority of cases, as it requires significant computational resources: the integrand is a rather complicated function of many variables. Let us consider the minus logarithm of the integrand in (14) as a function of μ, ν:

X,Hi ,q (μ, ν ) = − ln [(μ, ν, X ; Hi ) · Ppr (μ, ν|Hi , q )], (21)   and then calculate the integral of exp − X,Hi ,q (μ, ν ) by the

Laplace method (in the same way as in, e.g. [51]) assuming that the main contribution to the integral is made by the neighborhood of the minimum of the function X,Hi ,q over (μ,ν). According to this method, the Taylor expansion of this function up to the second order in the neighborhood of the parameters (μ0 ,ν0 ) corresponding to the maximum of (21) allows the optimality cost function (16) to be represented approximately as:

− ln P (X |Hi , q ) ∼ = X,Hi ,q (μ0 , ν0 ) + 12 ln 21π ∇ ∇ T X,Hi ,q +

1 2

d  n=1





x2n + ln 2π .

(22)

Here ∇ ∇ T X,Hi ,q is the matrix of second derivatives (Hessi matrix) of the function X,Hi ,q with respect to the parameters (μ,ν) at the point of its minimum (μ0 ,ν0 ). The expression (22) allows us to estimate the model optimality once the minimum (μ0 , ν0 ) is found. The optimality is based on the concurrence of the first and the second terms in (22). The first term X,Hi ,q (μ0 , ν0 ) basically characterizes the model likelihood and decreases with the increase of fitting accuracy, i.e. number of neurons in the ANN and dimension d. In contrast, the penalty term 1 1 T 2 ln 2π ∇ ∇ X,Hi ,q in (22) prevents the model from overfitting, as it grows with the number of parameters dim(μ, ν ). In other words, the first term penalizes too simple models which are not able to capture the dynamical properties of the system, while the second term penalizes too complex (overfitted) models which tend to reproduce the particular noise realization instead of the dynamical properties of the system. It is worth noting that the second term behaves asymptotically (for N >> dim(μ, ν )) as 21 M¯ ln N, where M¯ is the number of nondegenerate parameters of the model (3), as the eigenvalues of the Hessi matrix ∇ ∇ T X,Hi ,q corresponding to the singular directions in the space of parameters do not depend on N and are determined by the prior PDF only. Also, in the case N >> dim(μ, ν ) the last term in (22) can be neglected. Consequently, the optimality criterion used in case of long time series is analogous to the Schwartz criterion (also called Bayes information criterion, BIC [52]) in the absence of degeneration in the model’s space of parameters, which is certainly not the case when using ANN. 2.3.2. Algorithm of optimal model construction In the framework of the Bayesian approach, the model (Hi , q) can be learned by the maximization of the posterior PDF of the ANN parameters which is defined by the following expression:

P (μ, ν|X, Hi , q )    P X |θ , Hi Ppr (μ, ν|Hi , q )Ppr ( vinit |Hi )dvinit = . P (X |Hi , q )

(23)

A. Gavrilov et al. / Chaos, Solitons and Fractals 104 (2017) 327–337

It is easy to show that this PDF as a function of (μ,ν) is proportional to the integrand of (14), i.e.:





P (μ, ν|X, Hi , q ) ∝ exp − X,Hi ,q (μ, ν ) .

(24)

Hence, by finding (μ0 , ν0 ) for estimation of the optimality (22) we simultaneously learn the model (Hi , q). Thus, the algorithm of optimal model construction which we suggest is quite straightforward – we repeatedly perform an iteration including the following steps: 1. Generating values of mf , mg , d; 2. Generating random vector q; 3. Learning the model (Hi , q) and obtaining (μ0 , ν0 ) by gradientbased minimization of (21) from random initial point; 4. Estimating optimality cost function (22). 5. Multiple adjusting the value of q with updating (μ0 , ν0 ) by gradient-based minimization and recalculation of the cost function (22). 6. Estimating the q-independent optimality cost function (20) by choosing the adjustment trial with the best optimality (minimal description length) from the step 5. Separation of q-minimization from minimization over other hyperparameters allows us to save a lot of computational resources because multiple updating of (μ0 , ν0 ) is obviously faster than minimization by starting from random initial point at each q, especially for close values of q. Thus, as result of each iteration we find together optimal parameters of regularization q, the parameters (μ0 ,ν0 ) of the model Hi at optimal q and its optimality cost function (20). Finally, the optimal hyperparameters mf , mg , d are found by minimization of optimality cost function E(Hi ) over all iterations. 3. Stochastic model for the Lorenz-63 system with interactive noise 3.1. The Lorenz-63 system Let us now demonstrate the proposed method on a simple example – reconstructing the Lorenz-63 system [53] at “classical” values of parameters (corresponding to the presence of a chaotic attractor), to which we added a Langevin noise:

dx˜ = 10(y˜ − x˜) dt dy˜ = 25x˜ − y˜ − x˜z˜ dt 8 dz˜ = − z˜ + x˜y˜ + σ η (t ) dt 3

3.2. Optimal embedding dimension of a model The success of the deterministic reconstruction depends on how well the observed attractor (or regime) of the system can be embedded into the phase space. Particularly, the minimal embedding dimension of the unperturbed (σ = 0) Lorenz system (25) for the case of Takens delays by the time series (26) is d = 4. It can be assessed empirically, e.g. via method of false nearest neighbors [1,55]. The results of this method are shown in Fig. 2: the dimension d = 4 corresponds to the case when the fraction of false neighbors stops to decrease significantly with the growth of d. However, the methods like false nearest neighbors, i.e. based on the analysis of the phase space geometry, are usually known to fail in the case of random or high-dimensional dynamical systems (for the latter the main reason is insufficient duration of the available time series): the low-dimensional deterministic modeling becomes less appropriate there. That is why finding an optimal embedding dimension d of a stochastic model that, actually, determines the best phase space projection in terms of stochastic modeling, is an important practical problem. Bayesian evidence of the stochastic model solves this problem. To illustrate the selection of optimal embedding dimension by the optimality criterion we will investigate (the dependence of the optimality cost function (20) on the dimension d for the other hyperparameters – the number of neurons mf of the deterministic part and mg of the stochastic part – minimizing the function (20) for a given d: m f ,mg ,q

Here η(t) is Gaussian white noise and the parameter σ determines the intensity of noise. The variables of classical Lorenz-63 system are marked by tilde to avoid notation confusion. We consider scalar time series formed by values of the variable y˜ sampled with interval τ as the observed data X:

xn = y˜(t0 + nτ )

Fragments of the time series X obtained for system (25) without noise (σ = 0) as well as in the presence of noise (σ = 10) are presented in Fig. 1. Also attractor projections of the system for the same noise levels in the (xn , xn+1 ) plane for different lengths of the time series and the corresponding invariant measures (estimated by long time series generated by the system) are shown. In the absence of noise the system aperiodically switches between two regions (“wings”) of the Lorenz attractor. Corresponding structure of the attractor can be observed in the 2-D projections and 2-D invariant measure in Fig. 1a. Clearly, this low-dimensional structure of the evolution operator is destroyed with the increase of the noise intensity σ , indicating that the dimension of the phase space for deterministic description grows.

E (d ) = min [− ln P (X |Hi , q )]. (25)

(26)

Here we use the time step τ = 0.17 corresponding approximately to the lag providing the first well expressed minimum of mutual information between y(t) and y(t + τ ). This selection criterion for τ is widely used [1] in the case of the Takens coordinates (2). It is aimed at obtaining the best resolution of attractor topology in the reconstructed phase space. Note, however, that this criterion is crude and the value of τ that determines a “good” attractor projection of the system may, generally speaking, vary in a wide range and may be an object of optimization of the model (3) along with other hyperparameters. The optimization of the time delays was done, for example, in the work [54] in application to the evolution operator in a simpler form. Here we will not vary τ .

331

(27)

Note that such a dependence on d, generally speaking, is expected to be invariant to different ways of representing the functions f and g by means of universal nonlinear approximators. 3.3. Optimality: near-deterministic system case Consider as the first example a “near-deterministic” case, namely, the time series having length N = 20 0 0 generated by the system (25) for σ = 1. It is natural to expect that, with a low noise level, the model of the dimension close to 4 with deterministic part f giving a good description of system behavior and a nearly zero weakly pronounced stochastic part g will be optimal. After application of the Bayesian optimization algorithm (Section 2.3.2) we estimated the dependence of the stochastic model optimality on the embedding dimension (27) for this case which is shown in Fig. 3. Indeed, it is clear that in the case under consideration the optimal dimension of the model is in the 4–5 range. Let us consider the behavior of the obtained optimal model (4), i.e. model with the parameters (μ0 , ν0 ) corresponding to the maximum of posterior PDF (23) at optimal hyperparameters d, mf , mg ,

332

A. Gavrilov et al. / Chaos, Solitons and Fractals 104 (2017) 327–337

Fig. 1. The time series generated by system (25) at σ = 0 (a) and σ = 10 (b). Upper panel – fragment of the time series of xn , lower panel (from left to right) – discrete phase trajectory projection onto the plane of variables (xn , xn+1 ) (for the duration N = 500 and N = 20 0 0) and invariant measure of the system in the same variables.

Fig. 2. The fraction of false neighbors versus model dimension d assessed using the method of false nearest neighbors by the time series of the system (25) at σ = 0.

Fig. 3. Optimality cost function (20) as a function of model dimension d with the other parameters corresponding to the minimum of E at fixed d – Eq. (27).

q. The invariant measure of the optimal model in the plane of the parameters (xn , xn+1 ) (estimated by long time series generated by the stochastic model) is shown in Fig. 4 (left panel) in comparison with the invariant measure of the system (right panel). The panel in the middle of Fig. 4 depicts the invariant measure generated by

the model (4) in which the stochastic part was zeroed (let us call it f-based model); this figure characterizes the intrinsic dynamics of the deterministic part of the model. It is seen that all measures look the same. Hence, the f-based model almost entirely captures the behavior of the full model (4) and is good in reproducing the

A. Gavrilov et al. / Chaos, Solitons and Fractals 104 (2017) 327–337

333

Fig. 4. Comparison of the invariant measures of model (4) and system (25) in the case of low noise level (σ = 1): (a) behavior of the model (4); (b) behavior of the deterministic part of the model (4); (c) behavior of the system (25).

invariant measure of the system (25) at low noise, while g is insignificant in this case. This example demonstrates that the proposed algorithm adequately performs in the standard case of lowdimensional near-deterministic system under consideration and selects near-deterministic model. 3.4. Optimality: significantly random system case Further, let us investigate the behavior of optimality in the case when the system cannot be recognized as close to low-dimensional deterministic system. 3.4.1. Dependence on the duration and noise level We analyzed the time series generated by the system (25) at three values of noise level: σ = 5, σ = 10 and σ = 20. For each of the above σ values we considered the cases of “short” (N = 500) and “long” (N = 20 0 0) time series. The obtained optimization results are presented in Fig. 5 as optimality cost function E versus dimension d (Eq. (27)). One can see the tendency to decreasing optimal embedding dimension with the increase of noise level and/or decrease of time series length. This circumstance is a consequence

Fig. 6. Optimality (20) of the model having dimension d = 3 as a function of the number of neurons mf in the deterministic part of the model for the length of the observed time series N = 500 (a) and N = 20 0 0 (b) in the system with noise σ = 10. The curves of different colors correspond to different number of neurons mg in the stochastic part.

of the fact that, with such variations of the learning sample, the ability of the stochastic model (4) to noise filtering reduces, i.e. the reconstructed deterministic part f becomes less adequate than the deterministic part of the studied system (25); in other words, with

Fig. 5. Behavior of optimality (20) at noise levels in system (25) σ =5, 10 and 20 and time series length N=500 and 20 0 0. Each figure shows optimality versus model dimension d (with the other parameters corresponding to the minimum of E at fixed d – Eq. (27)). Optimal dimension in each case corresponds to the minimum of E(d).

334

A. Gavrilov et al. / Chaos, Solitons and Fractals 104 (2017) 327–337 Table 1 Optimal values of hyperparameters of the model for different parameters of the system. N\σ

5

10

20

20 0 0 500

m f = 7 ÷ 13 mg = 2 ÷ 6 d = 4 ÷ 5 m f = 5 ÷ 10 mg = 0 ÷ 6 d = 3

m f = 7 ÷ 11 mg = 2 ÷ 6 d = 3 ÷ 4 m f = 4 ÷ 7 mg = 0 ÷ 6 d = 3

m f = 5 ÷ 11 mg = 2 ÷ 6 d = 3 m f = 3 ÷ 7 mg = 0 ÷ 6 d = 3

the increase of the number of parameters the model starts to overfit earlier. From the practical point of view the interesting question is how the optimal model changes when the duration of the observation changes for the same system. As an example, we considered the case σ = 10. To investigate the question, the optimality (20) as a function of the number of neurons mf in the deterministic part of the model (4) with a different number of neurons mg in the stochastic part is plotted in Fig. 6 for two time series of different lengths generated by the system (25) with noise σ = 10. The model dimension d was 3 in this case, i.e. it was optimal in conformity with Fig. 5. From these dependences the following conclusions can be made. First, the minimum number of neurons in the deterministic part, as well as the minimum over the dimension (see Fig. 5) is expressed the less the more the data length N. This effect is explained by the practical overfitting criterion: with a larger amount of data we can afford a larger number of neurons in the network. Consequently, addition of a new neuron (beyond the optimal number of neurons) to the deterministic part leads to a faster growth of penalty term 12 ln 21π ∇ ∇ T X,Hi ,q in (22) with respect to the number of parameters. Second, clearly, for N = 20 0 0 the model with

a homogeneous stochastic part (mg = 0) and with one neuron in the stochastic part (mg = 1) is much less justified (optimal) than the model with mg = 2, 3, 4 (see Fig. 6b). In other words, with a rather long data sample, a model with significantly inhomogeneous (state-dependent) stochastic part tends to be optimal. At the same time, the length N = 500 is insufficient for the model to capture this inhomogeneity. One can see in Fig. 6a that the models with homogeneous and inhomogeneous parts are equally valid. Thus, in a general case the use of the state-dependent approximation of the stochastic term in model (4) is preferable as it enables a more effective description of data compared to the model with constant variance of a random component. More exactly, the significance of state-dependent stochastic forcing grows with the increase of the time series duration. To summarize, the values of hyperparameters corresponding to the neighborhood of the minimal optimality criterion (20) for the system with the other values of noise σ and length N are collected in Table 1. It is clearly seen that the optimal dimension of the model and the number of neurons in the deterministic part decrease with the growth of noise level. Also, with large enough noise, the optimal dimension of the model for both “short” and “long” time series becomes less than the minimal embedding di-

Fig. 7. Comparison of the invariant measures of full model (4), f-based model (deterministic part of stochastic model) and system (25) at different noise levels (by rows). Model (4) was learned by the time series of system (25) with length N = 500 and N = 20 0 0 (by columns). By rows: row 1 – σ = 5 (a1–a5), row 2 – σ = 10 (b1–b5), row 3 – and σ = 20 (c1–c5). By columns: column 1 – invariant measure of system (25), columns 2 and 4 – invariant measures of model (4) for the length N = 500 and N = 20 0 0, respectively, columns 3 and 5 – invariant measures of the f-based model (4) for the length N = 500 and N = 20 0 0, respectively.

A. Gavrilov et al. / Chaos, Solitons and Fractals 104 (2017) 327–337

Fig. 8. Examples of behavior of “too simple” (d = 1, m f = 3, mg = 2), optimal (d = 3, m f = 6, mg = 4) and overfitted (d = 8, m f = 17, mg = 3) models learned by the time series of system (25) with length N = 500 and noise σ = 10: (a–c) – invariant measures of the models, (d) – distribution of time between transitions of the phase trajectory between the wings of the Lorenz attractor (for the system and the models, see key). The time between the transitions is measured in the number of discrete points of the time series.

mension corresponding to the deterministic system and equal to 4 (Fig. 5e and f). Hence, at high noise intensity, the observed behavior is indistinguishable from the dynamics of the stochastic system with the dimension smaller than the dimension of the original system that has generated the time series. This is clearly illustrated in Fig. 7 which shows the invariant measures of both the full model (4) and f-based model (deterministic part) constructed by time series of different lengths and for different noise levels in comparison with the invariant measure of the system. It is seen in the figures that with the increase of noise as well as with the decrease of the length of the time series, the invariant measure of f-based model captures the initial deterministic system dynamics increasingly less (see also the invariant measure of the unperturbed system (25) shown in Fig. 1). 3.4.2. Verification of the optimality To verify adequacy of optimal model selection by the proposed method let us compare the reconstructed stochastic model with different hyperparameters for the system with noise σ = 10. We chose for this illustration a short (N = 500) time series, as the dependence of model optimality on hyperparameters is much stronger than in the case N = 20 0 0 (following Figs. 5 and 6). Qualitative characteristics of the obtained models are shown in Fig. 8 for three cases: too simple model with d = 1, optimal model with d = 3, and too complex (overfitted) model with d = 8. The upper panels of Fig. 8 show 2-D invariant measures. It is clear that the invariant measure of the overfitted model has stronger asymmetry of the wings than the invariant measures of the too simple and optimal models have. At the same time, the system does not demonstrate such an asymmetry (see the invariant measure of the system in Fig. 7). In the lower panel of Fig. 8 another relevant characteristic of the system is displayed – the statistics of the residence time of phase trajectory in one wing of the attractor. This time is measured in number of discrete time intervals. One can see that the simple model does not capture the two peaks of discrete residence time PDF which are observed in the system, i.e. it does not capture the dynamical properties of the system because of insufficient dimension d. The overfitted model also underestimates the main peak of the discrete residence time PDF. Thus, according to the considered characteristics, the optimal stochastic model shows

335

Fig. 9. Comparison of behavior of models with homogeneous and inhomogeneous stochastic parts learned by time series with length N = 20 0 0 of the system (25) with noise level σ = 10: (a) invariant measure of system ; (b) invariant measure of optimal model – d = 3, m f = 7, mg = 2; (c) invariant measure of model with homogeneous stochastic part – d = 3, m f = 7, mg = 0; (d) – distribution of time between transitions of the phase trajectory between the wings of the Lorenz attractor (for the system and the models, see key). The time between the transitions is measured in the number of discrete points in the time series.

the best reproduction of the qualitative dynamical properties of the system. Finally, in the same way Fig. 9 provides visual demonstration of the second important statement that, in general case, the model with inhomogeneous stochastic part is optimal. Here we took the case with N = 20 0 0 and σ = 10 for the illustration because the optimal number of neurons in the stochastic part is essentially nonzero for this case. The upper panels of Fig. 9 show invariant measure of the system compared to invariant measures of both the optimal model with mg = 2 and model corresponding to homogeneous stochastic part with mg = 0 (d and mf are optimal in both cases). Definitely, the invariant measure of the model with statedependent stochastic part (Fig. 9b) matches the invariant measure of the system better than model with constant g. In the lower panel of Fig. 9, again, the statistics of the residence time defined above for the system and the two described models is shown. Here both models capture the main peaks of the discrete residence time PDF, though this bimodality is better expressed by the model with inhomogeneous stochastic part. Also, this optimal model better features the secondary peak at residence time equal to 9. It is clear that in terms of two-dimensional invariant measure and residence time distribution the model with optimal inhomogeneous stochastic part reproduces the system dynamics much better than the model with constant variance of random perturbations. 4. Conclusions We have proposed a method of reconstructing an optimal stochastic model of the evolution operator underlying an observed random process. The practical importance of this problem is explained by the fact that the observed time series of natural origin are almost always empirically indistinguishable from stochastic processes, and statistically correct algorithms are needed for retrieving information about the laws of their evolution. A discrete model of the evolution operator has been considered. It is a superposition of deterministic and stochastic parts that are represented through functions in the form of three-layer artificial neuron networks. An important property of the proposed model is data-adaptive stochastic perturbations of the evolution operator in-

336

A. Gavrilov et al. / Chaos, Solitons and Fractals 104 (2017) 327–337

homogeneous over phase space (state-dependent), with the complexity (nonlinearity) of both the deterministic part of the model and the inhomogeneity of the noise action being optimized within the Bayesian framework. An algorithm of model construction including the estimation of Bayesian optimality criterion of model complexity has been presented in detail. The efficiency of the model has been demonstrated on an example of analyzing data generated by the Lorenz-63 system perturbed by Langevin noise. Evidently, as a result of the nonlinearity of the considered system the response to random perturbations is not homogeneous over the phase space of the system, hence the degree of ambiguity of the evolution operator depends significantly on the values of dynamic variables. Correct treatment of such inhomogeneity during construction of the cost function for the parameters of the empirical model undoubtedly improves accuracy of system reconstruction by observed data and, hence, ability of the model to reflect key dynamical properties of the system and reproduce qualitative behavior (invariant measure) as well as the predictive capability of the model. However, the main question arising in this connection is the degree of informativeness of the available data sample for such a reconstruction. We have shown that in the case of a sufficiently large time series length, the model with inhomogeneous stochastic part is more effective in terms of the Bayesian optimality criterion than the model with state-independent random perturbations. We have compared the dynamical characteristics (2-D invariant measure and residence time PDF) of models with the dynamical properties of the studied system and shown that in terms of the proposed method the optimal model (where possible state-dependence of the perturbations is also taken into consideration and optimized) is the best to reproduce qualitative behavior of the system underlying the observed dynamics. The particular form of the model which we suggest and investigate (Eq. (4)) is natural for using a feedforward ANN to approximate deterministic and stochastic parts. At the same time, it is interesting to implement the idea of Bayesian optimization of a model with state-dependent stochastic perturbation by means of more sophisticated types of ANN used in the deep machine learning. In this case both the model equations and the cost functions will be slightly modified in accordance with the developed framework. Work in this direction is left for the future. Acknowledgments We would like to thank Yury Malkov for useful discussions concerning our work. The study was supported by the Russian Science Foundation (Grant no. 16-12-10198).

[8] [9]

[10]

[11]

[12]

[13] [14]

[15]

[16] [17] [18]

[19]

[20] [21]

[22]

[23]

[24] [25]

[26]

[27] [28] [29] [30]

[31]

References [32] [1] Abarbanel HDI. Analysis of observed chaotic data. New York, NY: Institute for Nonlinear Science; Springer; 1996. ISBN 978-0-387-98372-1. doi:10.1007/ 978- 1- 4612- 0763- 4. [2] Bezruchko BP, Smirnov DA. Extracting knowledge from time series. Springer series in synergetics. Berlin, Heidelberg: Springer; 2010. ISBN 978-3-64212600-0. doi:10.1007/978- 3- 642- 12601- 7. [3] Jin F-F, Neelin JD, Ghil M. El Niño/Southern Oscillation and the annual cycle: subharmonic frequency-locking and aperiodicity. Physica D 1996;98:442–65. [4] Zebiak SE, Cane MA. A model El Niño–Southern Oscillation. Mon Weather Rev 1987;115(10):2262–78. doi:10.1175/1520-0493(1987)1152262:AMENO 2.0.CO; 2. [5] Kwasniok F. Optimal Galerkin approximations of partial differential equations using principal interaction patterns. Phys Rev E 1997;55(5):5365–75. doi:10. 1103/PhysRevE.55.5365. [6] Feigin AM, Konovalov IB. On the possibility of complicated dynamic behavior of atmospheric photochemical systems: instability of the antarctic photochemistry during the ozone hole formation. J Geophys Res: Atmos 1996;101(D20):26023–38. doi:10.1029/96JD02011. [7] Feigin AM, Konovalov LB, Molkov YI. Toward an understanding of the nonlinear nature of atmospheric photochemistry: essential dynamic model of the meso-

[33]

[34]

[35]

[36]

[37]

spheric photochemical system. J Geophys Res: Atmos 1998;103(D19):25447– 60. doi:10.1029/98JD01569. Zaliapin I, Ghil M. Another look at climate sensitivity. Nonlinear Process Geophys 2010;17(2):113–22. doi:10.5194/npg- 17- 113- 2010. Ghil M, Zaliapin I, Thompson S. A delay differential model of ENSO variability: parametric instability and the distribution of extremes. Nonlinear Process Geophys 2008;15(3):417–33. doi:10.5194/npg- 15- 417- 2008. Kravtsov S, Kondrashov D, Ghil M. Empirical model reduction and the modeling hierarchy in climate dynamics. In: Palmer T, Williams P, editors. Stochastic physics and climate modelling. Cambridge University Press; 2009. p. 35–72. Mukhin D, Loskutov E, Mukhina A, Feigin A, Zaliapin I, Ghil M. Predicting critical transitions in ENSO models. Part I: methodology and simple models with memory. J Clim 2015;28(5):1940–61. doi:10.1175/JCLI- D- 14- 00239.1. Mukhin D, Kondrashov D, Loskutov E, Gavrilov A, Feigin A, Ghil M. Predicting critical transitions in ENSO models. Part II: spatially dependent models. J Clim 2015;28(5):1962–76. doi:10.1175/JCLI- D- 14- 00240.1. Zhang G, Kline D. Quarterly time-series forecasting with neural networks. IEEE Trans Neural Netw 2007;18(6):1800–14. doi:10.1109/TNN.2007.896859. Gribkov D, Gribkova V. Learning dynamics from nonstationary time series: analysis of electroencephalograms. Phys Rev E 20 0 0;61(6):6538–45. doi:10. 1103/PhysRevE.61.6538. Jolliffe IT. Principal component analysis. Springer series in statistics. 2nd ed. New York, NY: Springer; 1986. ISBN 978-1-4757-1906-2. doi:10.1007/ 978- 1- 4757- 1904- 8. Preisendorfer R. Principal component analysis in meteorology and oceanography. Elsevier; 1988. Navarra A, Simoncini V. A guide to empirical orthogonal functions for climate data analysis. Springer-Verlag; 2010. Hannachi A, Jolliffe IT, Stephenson DB. Empirical orthogonal functions and related techniques in atmospheric science: a review. Int J Climatol 2007;27(9):1119–52. doi:10.1002/joc.1499. Hannachi A. A new set of orthogonal patterns in weather and climate: optimally interpolated patterns. J Clim 2008;21(24):6724–38. doi:10.1175/ 2008JCLI2328.1. de la Iglesia MD, Tabak EG. Principal dynamical components. Commun Pure Appl Math 2013;66(1):48–82. doi:10.1002/cpa.21411. Hasselmann K. PIPs And POPs: the reduction of complex dynamical systems using principal interaction and oscillation patterns. J Geophys Res 1988;93(D9):11015. doi:10.1029/JD093iD09p11015. Pires CAL, Ribeiro AFS. Separation of the atmospheric variability into nonGaussian multidimensional sources by projection pursuit techniques. Clim Dyn 2017;48(3):821–50. doi:10.10 07/s0 0382- 016- 3112-9. Vejmelka M, Pokorná L, Hlinka J, Hartman D, Jajcay N, Paluš M. Non-random correlation structures and dimensionality reduction in multivariate climate data. Clim Dyn 2015;44(9–10):2663–82. doi:10.10 07/s0 0382- 014- 2244- z. Ghil M. Advanced spectral methods for climatic time series. Rev Geophys 20 02;40(1):10 03. doi:10.1029/20 0 0RG0 0 0 092. DelSole T, Tippett MK, Shukla J. A significant component of unforced multidecadal variability in the recent acceleration of global warming. J Clim 2010;24(3):909–26. doi:10.1175/2010jcli3659.1. ˚ Empirical orthogonal teleconnections. J van den Dool HM, Saha S, Johansson A. Clim 20 0 0;13(8):1421–35. doi:10.1175/1520-0442(20 0 0)0131421:EOT 2.0.CO; 2. Vautard R, Yiou P, Ghil M. Singular-spectrum analysis: a toolkit for short, noisy chaotic signals. Physica D 1992;58:95–126. Tan S, Mayrovouniotis ML. Reducing data dimensionality through optimizing neural network inputs. AlChE J 1995;41(6):1471–80. doi:10.1002/aic.690410612. Kramer MA. Nonlinear principal component analysis using autoassociative neural networks. AlChE J 1991;37(2):233–43. doi:10.1002/aic.690370209. Dong D, McAvoy T. Nonlinear principal component analysis—based on principal curves and neural networks. Comput Chem Eng 1996;20(1):65–78. doi:10. 1016/0 098-1354(95)0 0 0 03-K. Lee JA, Verleysen M, editors. Nonlinear dimensionality reduction. Information science and statistics. New York, NY: Springer; 2007. ISBN 978-0-387-39350-6. doi:10.1007/978- 0- 387- 39351- 3. Gavrilov A, Mukhin D, Loskutov E, Volodin E, Feigin A, Kurths J. Method for reconstructing nonlinear modes with adaptive structure from multidimensional data. Chaos: Interdiscip J Nonlinear Sci 2016;26(12):123101. doi:10.1063/ 1.4968852. Hastie T. Principal curves and surfaces: Ph.D dissertation (Ph.D. thesis). Stanford Linear Accelerator Center, Stanford University; 1984. http://pca.narod.ru/ HastieThesis.htm. Gámez AJ, Zhou CS, Timmermann A, Kurths J. Nonlinear dimensionality reduction in climate data. Nonlinear Process Geophys 2004;11(3):393–8. doi:10. 5194/npg- 11- 393- 2004. Scholkopf B, Smola A, Muller KR. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comp 1998;10(5):1299–319. http://neco.mitpress. org/cgi/content/abstract/10/5/1299. Mukhin D, Gavrilov A, Feigin A, Loskutov E, Kurths J. Principal nonlinear dynamical modes of climate variability. Sci Rep 2015;5:15510. doi:10.1038/ srep15510. Molkov YI, Mukhin DN, Loskutov EM, Feigin AM, Fidelin GA. Using the minimum description length principle for global reconstruction of dynamic systems from noisy time series. Phys Rev E 2009;80(4):046207. doi:10.1103/ PhysRevE.80.046207.

A. Gavrilov et al. / Chaos, Solitons and Fractals 104 (2017) 327–337 [38] Rossi V, Vila J-P. Bayesian multioutput feedforward neural networks comparison: a conjugate prior approach. IEEE Trans Neural Netw 2006;17(1):35–47. doi:10.1109/TNN.2005.860883. [39] Molkov YI, Loskutov EM, Mukhin DN, Feigin AM. Random dynamical models from time series. Phys Rev E 2012;85(3):036216. doi:10.1103/PhysRevE.85. 036216. [40] Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Netw 1989;2(5):359–66. [41] Feigin AM, Molkov YI, Mukhin DN, Loskutov EM. Investigation of nonlinear dynamical properties by the observed complex behaviour as a basis for construction of dynamical models of atmospheric photochemical systems. Faraday Discuss 2002;120:105–23. doi:10.1039/b102985c. [42] Molkov YI, Mukhin DN, Loskutov EM, Timushev RI, Feigin AM. Prognosis of qualitative system behavior by noisy, nonstationary, chaotic time series. Phys Rev E 2011;84(3):036215. doi:10.1103/PhysRevE.84.036215. [43] Grieger B, Latif M. Reconstruction of the El Niño attractor with neural networks. Clim Dyn 1994;10:267–76. [44] Wu A, Hsieh WW, Tang B. Neural network forecasts of the Tropical Pacific Sea surface temperatures. Neural Netw 2006;19(2):145–54. doi:10.1016/j.neunet. 20 06.01.0 04. [45] Jeffreys H. Theory of probability. Clarendon Press; 1998. https://global.oup. com/academic/product/the-theory-of-probability-9780198503682?cc=ru{&} lang=en{&}. ISBN 9780198503682. [46] Arnold L. Random dynamical systems. Springer monographs in mathematics. Berlin: Springer-Verlag; 1998. ISBN 3540637583.

337

[47] Takens F. Detecting strange attractors in turbulence. In: Dynamical systems and turbulence. Berlin; Heidelberg: Springer; 1980. p. 366–81. doi:10.1007/ BFb0091924. [48] Kolen John F, Kremer Stefan C. A Field Guide to Dynamical Recurrent Networks. Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies. edition 1. Wiley-IEEE Press; 2001. p. 237–43. doi:10.1109/ 9780470544037.ch14. [49] Rissanen J. Modeling by shortest data description. Automatica 1978;14(5):465– 71. doi:10.1016/0 0 05-1098(78)90 0 05-5. [50] Neal RM. Probabilistic inference using Markov chain Monte Carlo methods. Technical report. Toronto, Ontario, Canada: Department of Computer Science, University of Toronto; 1993. [51] Loskutov EM, Molkov YI, Mukhin DN, Feigin AM. Markov chain monte carlo method in Bayesian reconstruction of dynamical systems from noisy chaotic time series. Phys Rev E 2008;77(6):066214. doi:10.1103/PhysRevE.77.066214. [52] Schwarz G. Estimating the dimension of a model. Ann Stat 1978;6(2):461–4. doi:10.1214/aos/1176344136. [53] Lorenz EN. Deterministic nonperiodic flow. J Atmos Sci 1963;20(2):130–41. doi:10.1175/1520-0469(1963)0200130:DNF 2.0.CO;2. [54] Judd K, Mees A. Embedding as a modeling problem. Phys D: Nonlinear Phenom 1998;120(3–4):273–86. doi:10.1016/S0167-2789(98)0 0 089-X. [55] Kennel MB, Brown R, Abarbanel HDI. Determining embedding dimension for phase-space reconstruction using a geometrical construction. Phys Rev A 1992;45(6):3403–11. doi:10.1103/PhysRevA.45.3403.