Information-theoretic approach to dynamics of stochastic systems

Information-theoretic approach to dynamics of stochastic systems

Probabilistic Engineering Mechanics 27 (2012) 47–56 Contents lists available at SciVerse ScienceDirect Probabilistic Engineering Mechanics journal h...

654KB Sizes 1 Downloads 87 Views

Probabilistic Engineering Mechanics 27 (2012) 47–56

Contents lists available at SciVerse ScienceDirect

Probabilistic Engineering Mechanics journal homepage: www.elsevier.com/locate/probengmech

Information-theoretic approach to dynamics of stochastic systems K. Sobczyk a,b,∗ , P. Hołobut a a

Institute of Fundamental Technological Research of the Polish Academy of Sciences, Pawińskiego 5b, 02-106 Warsaw, Poland

b

Department of Mathematics, Informatics and Mechanics, University of Warsaw, Poland

article

info

Article history: Available online 8 May 2011 Keywords: Stochastic dynamics Information Entropy Relative entropy Information gain Non-Gaussian prediction Maximum entropy approximation

abstract In the paper, we show how some basic informational quality measures (e.g. the Shannon entropy and the relative entropy/Kullback–Leibler divergence) defined for stochastic dynamical systems change in time and how they depend on the system properties and intensity of random disturbances. First, the Liouvillian systems (when randomness is present in the initial states only) are discussed and then various (linear and nonlinear) systems with random external excitations are treated in detail. Both, general and specific systems are considered, including numerical and graphical illustrations. © 2011 Elsevier Ltd. All rights reserved.

1. Introduction Stochastic dynamics is now a greatly advanced field comprising the methods of investigation of various physical/engineering systems subjected to external or/and parametric random excitations (cf. [1,2]). It seems, however, that the existing analyses have mostly been concentrated on the characterization of such response/solution properties as stability, extremal properties of the response, first-passage time, stochastic resonance, etc. In many situations such ‘‘reliability-type’’ measures of system dynamics turn out to be not sufficient; especially, in the analysis of various complex biological and social systems an adequate approach should take advantage of the potential inherent in information theory. Information-theoretic reasoning is well known in physics, especially in statistical thermodynamics where the relationships between the statistical (or, informational) entropy of the system and its thermodynamical entropy have been studied for a long time (cf. [3,4]). A good example is the Boltzmann equation for the time-dependent probability density of a dilute gas in the space of position and velocity of particles and the famous result for the Boltzmann (statistical) entropy known as the H-theorem. However, in contrast with standard thermodynamical intuition (and the second law of thermodynamics — valid for isolated systems) the information entropy in real open systems (interacting with the environment) need not be a monotonic function of time;

∗ Corresponding author at: Institute of Fundamental Technological Research of the Polish Academy of Sciences, Pawińskiego 5b, 02-106 Warsaw, Poland. E-mail address: [email protected] (K. Sobczyk). 0266-8920/$ – see front matter © 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.probengmech.2011.05.007

it may oscillate, increase or decrease in time depending on the system dynamics. Information theory is especially relevant to data processing and statistical inference. But the analogous problems arise in the context of any real dynamical system; for example, how much information can we infer from a particular set of observations about a dynamical system in question? How does the Shannon entropy flow through a dynamical system, and how does it depend on the system characteristics and intensities of random external/internal disturbances? Other problems occurring in the modelling and analysis of stochastic dynamical systems (which require an information-theoretic approach) are concerned with the comparison of the ‘‘goodness’’ of various models; for example, what is the information gain that the ‘‘realistic’’ model provides (about the process) in excess of that obtained on the basis of a corresponding ‘‘simplified’’ model? For some time, the problems of information and dynamics, like those indicated above, have attracted research interest in various fields. The existing attempts in this direction can be found in [5] and the references therein, as well as e.g. in the recent papers [6,7]. There is also an inspiring literature on information measures of predictability for turbulent flow and for other complex processes in environmental science. The objective of this paper is to show how some basic informational quality measures defined for stochastic systems change in time and how they depend on the system properties and probabilistic nature of random excitations. Special attention is paid to quantifying the entropy rate in selected classes of stochastic systems and to characterizing the information divergence (or, information gain) between some ‘‘real’’ and ‘‘idealized’’ stochastic dynamical models. In addition to the general approach the analysis presents also specific dynamical systems treated analytically and numerically with the results presented graphically.

48

K. Sobczyk, P. Hołobut / Probabilistic Engineering Mechanics 27 (2012) 47–56

2. Informational quality measures of stochastic systems Let f (y , t ) be the probability density of the stochastic process Y (t ) = [Y1 (t ), . . . , Yn (t )] at time t ∈ T , where T is an interval (finite or infinite) on the non-negative part of the real line. (Convention: for matrix operations, all vector quantities are assumed to be column vectors, even if they are defined as rows.) The Shannon entropy of Y (t ) at time t is commonly defined as (cf. [8]) Ht (f ) = HY (t ) = −⟨log f (y , t )⟩f

∫ =−

f (y , t ) log f (y , t ) dy ,

(1)

where the bracket ⟨·⟩ denotes the average value (with respect to density f ) of the quantity indicated, and the integration is extended over the support of f (y , t ) with respect to y ∈ Rn . (Although the base of the logarithm in definition (1) – and others to follow – is, in general, arbitrary, two bases are especially useful: 2 and e. The transition from base a to base b is, of course, simply done through multiplying by the constant logb a. In what follows, we frequently use base e for convenience, without stating it explicitly — base e will be immediately obvious from the lack of ‘‘log’’ in the appropriate formula. However, we present numerical results in the common bits (base 2)). The entropy (1) characterizes, for each t, the overall randomness of Y (t ). More exactly, the entropy of the process with a continuous state space quantifies its randomness in a relative sense: the difference HY2 (t ) − HY1 (t ), in the same coordinate system, is a measure of the entropy after the system has passed from the ‘‘state’’ f1 to f2 . It is clear that each measurement (observation) of Y (t ) at time t decreases its entropy HY (t ); so, a complete recognition of Y (t ) – which means gaining a complete information about Y (t ) – must be interpreted as reduction of HY (t ) to zero. Therefore, HY (t ) quantifies also the information content It of Y (t ) at time t. It can easily be shown that for an n-dimensional Gaussian random vector Y with distribution N (m, K ) the Shannon entropy is 1 (2) HG = H [N (m, K )] = log{(2π e)n |K |}, 2 where m and K are the mean and the covariance matrix of Y , respectively. For n = 1, we have N (m, σ 2 ) and HG = 1 log(2π e σ 2 ), where σ 2 denotes the variance of Y . It should be 2



noticed that HG = 0 for σ = ( 2π e)−1 ≈ 0.24198 = σ ∗ ; for σ < σ ∗ , HG < 0, for σ ⩾ σ ∗ , HG ⩾ 0. Another measure which can be used for quantifying the information/entropy transfer in stochastic systems is the relative entropy (or the Kullback–Leibler divergence measure). Let f (y , t ) and q(y , t ) be two probability densities with the same support such that f is the probability density of our main interest (often, unknown explicitly), whereas q is the appropriate prior or reference density. Often q is assumed to be a density of an invariant or stationary distribution of the process of interest (and it does not depend on time). The relative entropy between f and q is defined as



Ht (f , q) = log

f (y , t ) q(y , t )



∫ = f

f (y , t ) log

f (y , t ) q(y , t )

dy ,

(3)

with integration over the common support of f and q. We can interpret Ht (f , q) as the amount of information that the distribution f (y , t ) provides about the state of the system in excess of that given by distribution q(y , t ). For this reason Ht (f , q) can be regarded as a measure of utility of the particular prediction strategy of system evolution provided by distribution q as compared to an unknown prediction characterized by the distribution f . The relative entropy defined by Eq. (3) has the following basic properties: (a) Ht (f , q) ⩾ 0, with equality if and only if densities f and q are identical.

(b) Ht (f , q) is invariant under smooth and invertible transformations of the state variables; (c) if q is constant on the state space of the process considered, i.e. q = 1/mL , where mL is a finite Lebesgue measure of the given state space, then Ht (f , q) = −Ht (f ) + log mL ;

(4)

if mL = 1, then Eq. (4) gives a simple relation between relative entropy and the Shannon entropy. It should be underlined that (as is well known) Ht (f , q) defined by Eq. (3) is not a distance measure in the sense of mathematical analysis (it is not a metric since it is not symmetric and the triangle inequality does not hold). However, due to its property (a), it has been widely accepted as an informational indicator of similarity (or divergence) of two probability distributions. It is useful to remember the expression of H (f , q) when both densities are Gaussian. Let N1 (m1 , K1 ) and N2 (m2 , K2 ) be two n-dimensional Gaussian random variables with the densities f (y ) and q(y ), respectively. Then the relative entropy HG (f , q) = H (N1 , N2 )

=

1 2

 ln

   |K2 | + Tr K1 K2−1 − K1−1 |K1 | 

+ (m1 − m2 )T K2−1 (m1 − m2 ) .

(5)

In the one-dimensional case we have N1 (m1 , σ12 ), N2 (m2 , σ22 ), and HG (f , q) = H (N1 , N2 )

 σ22 σ12 (m1 − m2 )2 = ln 2 + 2 − 1 + . 2 σ1 σ2 σ22 1



(6)

For example, when m1 = m2 = 0 and σ22 /σ12 = 2.0, then H (N1 , N2 ) ≈ 0.13932 bits, whereas for σ22 /σ12 = 1/2, H (N1 , N2 ) ≈ 0.22134 bits. The third important information-theoretic measure of stochastic systems is the mutual information. Let X and Y be two dependent vectorial random variables (e.g. the instantaneous values of stochastic processes). The Shannon mutual information, or the Shannon amount of information about X provided by observation of Y , is defined as I (X , Y ) =

∫∫

fXY (x, y ) log

fXY (x, y ) fX (x)fY (y )

dx dy .

(7)

Let X = [X1 , . . . , Xk ], Y = [Y1 , . . . , Yl ], [X , Y ] be respectively k, l and k + l-dimensional random Gaussian vectors with covariance matrices A, B, C . Then I (X , Y ) =

1 2

log

|A||B| . |C |

(8)

2 ), where ρXY is If k = l = 1: I (X , Y ) = − 12 log(1 − ρXY the correlation coefficient. It is clear that the mutual information I (X , Y ) can be regarded as the relative entropy between the joint density fXY (x, y ) and the product of its marginals; so it quantifies the information gain provided by accounting for the dependence between the variables considered.

3. Entropy of systems with random initial states Let us consider first the simplest form of random dynamics (rooted in the classical statistical physics), when a dynamical system itself is deterministic and randomness is generated by the random initial conditions. In general, the system is governed by the

K. Sobczyk, P. Hołobut / Probabilistic Engineering Mechanics 27 (2012) 47–56

system of equations: dY (t )

= F [Y (t ), t ], Y (t0 ) = Y0 (γ ), (9) dt where t denotes time, Y (t ) = [Y1 (t ), . . . , Yn (t )] is the unknown process, and Y0 (γ ) = [Y10 (γ ), . . . , Yn0 (γ )] is an n-dimensional random variable (γ ∈ Γ , where (Γ , F , P ) is the probability space on which Y0 (γ ) is defined) characterized by a probability density f0 (y ). If the assumptions of the theorem on the existence and uniqueness of the solution are satisfied, then the solution of Eq. (9) is given by Y (t , γ ) = g [t0 , t , Y0 (γ )].

(10)

During the motion the initial probability density f0 (y ) evolves to a new probability density f (y , t ), where t > t0 . Also the initial entropy H (t0 ) = H0 transfers to the entropy H (t ). In general, probability density f (y , t ) satisfies the Liouville equation (cf. [9]) n ∂ ∂ f (y , t ) − + [f (y , t )Fi (y , t )] = 0, ∂t ∂ yi i =1

(11)

or in vectorial form

∂ f (y , t ) + div [f (y , t )F (y , t )] = 0. ∂t

(12)

This equation says that the probability density f (y , t ) is the integral invariant of motion of the system (9) or, in other words, that the probability is conserved during the motion. Making use of Eq. (11) we can derive the evolution equation for the Shannon entropy functional (1) defined on f . To do this we assume that: functions Fi and the density f have partial derivatives with respect to time and variables yi , i = 1, . . . , n; the products fFi tend to zero sufficiently fast as yi → ±∞; for considered functions  dy and d/dt commute. Indeed, dHt (f ) dt

 =−

d dt



  =−

1 f

 =−

ln f f

1 df f dt

 f

n − ∂ f dyi ∂f · + ∂ y dt ∂t i i=1  d

 (13) f

where it was used that dt f ln f dy = [ dtd ln f ] f dy — easily shown by assuming that every point of the volume element dy moves according to equations of motion (9); then conservation of d probability requires that f dy is constant in time, i.e. dt [f dy ] = 0 (dP = f dy ishere a time-invariant measure). Alternatively,  ∂ d one can write dt f ln f dy = [f ln f ]dy, then use Eq. (11) to ∂t replace ∂ f /∂ t, and finally integrate by parts, to obtain the same ∂f result (13). Because by the normalization condition dy = ∂t



= 0, and by Eq. (9) dy /dt = F , Eq. (13) yields ∫ +∞ n ∫ − dHt (f ) ∂f =− Fi dy1 . . . dyn . (14) ··· dt ∂ yi −∞ i=1 d dt



f dy =

d 1 dt

Integration by parts in Eq. (14) and use of the assumptions made above (concerning the behaviour of fFi at infinity) gives



+∞ −∞

∂f Fi dyi = − ∂ yi

+∞



f −∞

∂ Fi dyi . ∂ yi

(15)

Therefore, we finally have the following evolution equation for the information entropy (in three equivalent forms): dHt (f ) dt

=

 n  − ∂ Fi i =1

∂ yi

(16) f

= ⟨div F ⟩f ∫ = f div F dy

(17) (18)

49

where the integral in the last equation is n-fold with respect to y1 , y2 , . . . , yn . It is clear (from Eqs. (16)–(18)) that the sign of the entropy rate of the system considered depends on the sign of the divergence of the vector field F (y , t ) in the governing Eq. (9). We can draw the following conclusions. 1. The information entropy can grow with time only if the dynamical system (9) has positive mean flow divergence. But, in general, this condition is not satisfied. 2. For Hamiltonian systems div F = 0 (the phase space flow is divergenceless), so dH /dt = 0, which means that entropy H (t ) is conserved during the motion, i.e. H (t ) = H0 , t ⩾ t0 . 3. For non-Hamiltonian systems one may expect that entropy may decrease with time as well as it may increase at some stages of time evolution. This is especially the case of dissipative systems. Examples. (a) A very simple illustration is as follows. Take the system Y˙ = −Y , which is non-Hamiltonian (div F = −1). ˙ (t ) = −1, and H (t ) = H0 − (t − t0 ), t ⩾ t0 . So, Therefore, H the entropy decreases with time, which means that the system becomes more and more predictable. (b) Let us take a general linear system Y˙ = AY . In this case Eqs. (16)–(18) yield: dH dt

= Tr A =

n −

aii ,

H (t0 ) = H0 .

(19)

i=1

So, the system if Tr A = 0, and — entropy ∑ is entropy-conserving∑ stable if aii < 0. The quantity s = aii is equal to the sum of the eigenvalues λi of the matrix A. So, we have the assertion: if for the linear system with a constant matrix A, the sum of its eigenvalues is negative, the entropy of this system (induced by random initial conditions) is monotonically decreasing (entropy stable). Each eigenvalue gives the rate of decay, or growth, along one direction in the state space. It is worth noticing that if the condition s < 0 is ‘‘globally’’ satisfied it is possible that some λ’s are negative and some others — positive; this indicates that entropy of some state variables Yi may decay whereas the remaining Yk , i ̸= k, display entropy growth. (c) Let us consider the well known (nonlinear) Lorenz system (convective motion of air) Y˙1 = −σ Y1 + σ Y2 Y˙2 = −Y1 Y3 + rY1 − Y2 , Y˙3 = Y1 Y2 − bY3

Y1 (t ) Y2 (t ) , Y3 (t )

 Y (t ) =



(20)

Y (t0 ) = Y0 (γ ), where Y1 is the vertical convective velocity, Y2 — temperature, Y3 — temperature gradient. Usually, σ = 10, b = 8/3 (when r ≈ 24 a chaotic motion starts). For this system div F =

3 − ∂ Fi = −(σ + 1 + b) = α = −13.66 ∂ yi i=1

(21)

˙ (t ) = (the phase space contracts at a uniform rate). Therefore, H α and when t0 = 0, HY (t ) = H0 − 13.66t. It means that the information content of initial data is gradually lost (we understand that chaotic motion generates its ‘‘own’’ kind of disorder (when 24 < r . 250)). (d) Let us take a model: t Y¨ (t ) + (b − t )Y˙ (t ) − aY (t ) = 0,

(22)

where b > 0. In this case, when Y1 = Y , Y2 = Y˙ , Y2 Y˙1 = 

 ˙Y2 = 1 − b Y2 + aY1 , t

(23)

50

K. Sobczyk, P. Hołobut / Probabilistic Engineering Mechanics 27 (2012) 47–56

and

˙ Y (t ) = div F = 1 − H

b t

Integration by parts (with account of the boundary restrictions upon f : f (y , t ) → 0 as |y | → ∞) yields

.

(24)

So, HY (t ) decreases when t < b, and increases when t > b.

It is of interest to know how a random external noise influences the entropy flow in the system in question. Let us consider a system governed by the following Langevintype vectorial equation

= F (Y , t ) + σξ(t ),

dt

t ∈ [t0 , ∞),

(25)

where Y (t ) is an unknown stochastic n-dimensional vector process, ξ(t ) is a given vector of standard white noise excitation, and σ = {σij } is the intensity matrix. It is clear that

⟨ξ(t )⟩ = 0, ⟨ξi (t )ξj (t ′ )⟩ = δij δ(t − t ′ ).

(26)

Under known assumptions (when Eq. (25) is interpreted as the Itô stochastic differential equation), the solution process Y (t ) is a diffusion Markov process and its probability density f (y , t |y0 , t0 ) = f (y , t ) satisfies the following Fokker–Planck–Kolmogorov equation (cf. [10]) n ∂ ∂ f (y , t ) − + [f (y , t )Fi (y , t )] ∂t ∂ yi i=1

∂ 2 f (y , t ) − = 0, bij 2 i,j=1 ∂ yi ∂ yj n 1−

n − r =1

that is the matrix B = {bij } = σσ T . What is the rate of the Shannon entropy flow in system (25) with its probability density governed by Eq. (27)? According to the definition of H (t ) — Eq. (1):

˙ =− H

d

f ln f dy = −

dt

∂f (ln f + 1) dy . ∂t

(29)

The second term on the right hand side is (due to the normalization condition) equal to zero, and ∂ f /∂ t (due to F–P–K equation (27)) consists of two components, i.e. AL = −div(f F ), AD = 1 2



i,j

∂2f

bij ∂ y ∂ y . Therefore i

(30)

˙ L – the Liouvillian rate – is represented by Eqs. (16)–(18), where H i.e. ∫

f div F dy .

(31)

˙ D – the diffusion entropy rate – is as The second component H follows ˙D = − H



1−

1− 2

i,j

bij

dy

(33)



bij

f

∂ 2f ln f dy . ∂ yi ∂ yj

(34)

(35)

The above entropy rate is positive since the diffusion matrix

{bij } is positive definite. This indicates that the total entropy rate ˙ = H˙ L + H˙ D may have positive or negative signs (with the H value of zero when, e.g., random noise is absent and the system is ˙ D expressed by Eqs. (33)–(35) can divergenceless). The quantity H be interpreted as the information entropy production (by a random ˙ L characterizes the system entropy due excitation). The quantity H to the process inside the system. Example 1. As an illustration of the entropy change in a stochastic dynamical system let us consider a simple process governed by the equation Y˙ (t ) + aY (t ) = ξ (t ),

a > 0,

(36)

where ξ (t ) is a Gaussian white noise, such that

⟨ξ (t )⟩ = 0,

⟨ξ (t1 )ξ (t2 )⟩ = 2Dδ(t2 − t1 ).

(37)

σ =



2D,

(38)

and Eq. (27) is

(32)

(39)

with f (y, 0) = f0 (y) — being a Gaussian density with mean m0 and variance σ02 . Since the system is linear the normality of this initial (Gaussian) density is preserved, and we have f (y, t ) = N (mY (t ), σY2 (t )) with mY (t ) = ⟨Y (t )⟩ = m0 e−at ,

σY2 (t ) = σ02 e−2at +

D a

(1 − e−2at ).

(40)

Therefore the density f (y, t ) and entropy HY (t ) are, respectively f (y, t ) = HY (t ) =

j

˙ = H˙ L + H˙ D , H

˙L = H

f ∂ yi ∂ yj

i ,j

∂ f (y, t ) ∂ ∂ 2 f (y, t ) = −a [yf (y, t )] + D , ∂t ∂y ∂ y2 (28)



1 ∂f ∂f

F (y, t ) = −ay, (27)

σir σjr ,



2

∫ bij

Therefore the drift and diffusion coefficients are

where bij =

1−

∂ ln f ∂ ln f dy 2 i ,j ∂ yi ∂ yj   1− ∂ ln f ∂ ln f = bij . 2 i ,j ∂ yi ∂ yj f =

4. Entropy production by random noise

dY (t )

˙D = H

˙ Y (t ) = H

1

[2π σY2 (t )]1/2 1 2

[ exp −

ln[2π e σY2 (t )], a(D − aσ02 )e−2at

D − (D − aσ02 )e−2at

] (y − m0 e−at )2 , 2σY2 (t )

(41)

(42)

.

(43)

˙ Y (t ) > 0) takes place if It is seen that increase of entropy (H σ02 < D/a whereas the entropy decreases when σ02 > D/a. Keeping in mind that in the case without noise (D = 0) we have ˙ Y (t ) = −a, this decrease of entropy Liouvillian system for which H is ‘‘inverted’’ to its growth by noise if σ02 < D/a. Fig. 1 illustrates the entropy change given by Eqs. (42), (43). For such numerical values of σ02 , a, and D that σ02 < D/a (σ0 = 2, a = 0.1, D = 5) we observe a monotonic growth of HY (t ), whereas for σ0 = 2, a = 0.3, D = 0.25, we have σ02 > D/a and the entropy decreases in time.

K. Sobczyk, P. Hołobut / Probabilistic Engineering Mechanics 27 (2012) 47–56

Fig. 1. Entropy of the first order system (36). Curve (1) corresponds to m0 = 5, σ02 = 2, a = 0.1 and D = 5; curve (2) — to m0 = 5, σ02 = 2, a = 0.3 and D = 0.25.

Fig. 2. Entropy of the linear oscillator (44), for ω0 = 2, g0 = 10, and three values of ζ .

Example 2. Let us take now a linear harmonic oscillator Y¨ (t ) + 2ω0 ζ Y˙ (t ) + ω02 Y (t ) = ξ (t ), Y (0) = 0,

(44)

Y˙ (0) = 0,

where ω0 is the natural frequency of the system, ζ is the damping ratio, and ξ (t ) is a Gaussian white noise with a constant spectral density g0 . As it is well known (cf. [2]) the nonstationary probability density of the response is Gaussian N (0, σY2 (t )), with

σY2 (t ) =

π g0 2ζ ω03

2 2 2 2 2 {1 − λ− 0 exp(−2ω0 ζ t )[λ0 + 2ω0 ζ sin λ0 t

+ ω0 λ0 ζ sin 2λ0 t ]},

Fig. 3. Entropy of the linear oscillator (44), for ζ = 0.1, g0 = 10, and three values of ω0 .

Let f (y , t ) represent the true probability density of Y (t ) at time t. However, this density is not easily available (due to complexity of the system). Let q(y , t ) be another probability density associated with the system response which can be obtained under some simplifying hypotheses; it can be viewed as an ‘‘idealized approximation’’. The question which arises is: what is the information loss if we make the system predictions on the basis of q instead of f . A measure which seems to be particularly well suited to answering the above question is relative entropy defined by Eq. (3). According to its meaning it quantifies the amount of information which distribution f (y , t ) provides about the system behaviour in excess of that given by distribution q(y , t ). We may think of the probability density f as being nonstationary – compared to a stationary one, non-Gaussian – compared to a Gaussian approximation, or – generally – as ‘‘realistic’’ in contrast to an ‘‘idealized’’ one. In the existing studies, the relative entropy has mostly been used in the analysis of the convergence of nonstationary densities of diffusion Markov processes to the stationary ones (cf. [10,11]); for example, it has been shown that if f and q are two solutions of the F–P–K equation, then Ht (f , q) decays monotonically with time (functional Ht (f , q) plays the role of a Lyapunov functional). Here we wish to concentrate primarily on the quantitative characterization of Ht (f , q) with a special attention to the effect of system parameters. 5.1.1. Example 1: Ornstein–Uhlenbeck process Let us consider again the first order linear system (36) Y˙ (t ) + aY (t ) = ξ (t ),

(45)

where λ0 = ω0 (1 − ζ 2 )1/2 . The entropy HY (t ) is again given by Eq. (42) and its variability in time is shown in Figs. 2 and 3. 5. Relative entropy for true and idealized models 5.1. Gaussian prediction

(47)

when ξ (t ) is a Gaussian white noise, and the densities f and q are as follows (a) f (y, t ) — nonstationary, with f (y, t0 ) = N (m0 , σ02 ), q(y, t ) — nonstationary, with f (y, t0 ) = δ(y − y0 ); (b) f (y, t ) — as in case (a), q(y, t ) = qs (y) — stationary density. In case (a):

Let us consider a general form of a dynamical model Y˙ (t ) = F (Y , t ) + X (t ),

51

(46)

where F is, in general, a nonlinear vector-valued function with values in Rn , and X (t ) is a random, time-varying excitation. The initial condition Y (t0 ) = Y0 can be deterministic or random.

f (y, t ) : N (mf , σ ) = N 2 f

q(y, t ) : N (mq , σq2 ) = N

 m0 e



−at



y0 e−at ,

2 −2at 0e

D a

+

D a

(1 − e

−2at



) ,



(1 − e−2at ) .

(48)

52

K. Sobczyk, P. Hołobut / Probabilistic Engineering Mechanics 27 (2012) 47–56

Fig. 4. Relative entropy of the system (47), case (a), as given by Eq. (49), for a = 0.2, D = 0.5, m0 = 1, and three values of σ0 .

Fig. 6. Relative entropy of the system (47), case (b), as given by Eq. (51), for a = 0.2, D = 0.5, m0 = 1, and three values of σ0 .

Fig. 5. Relative entropy of the system (47), case (a), as given by Eq. (49), for a = 0.2, m0 = 1, σ0 = 2, and three values of D.

Fig. 7. Relative entropy of the system (47), case (b), as given by Eq. (51), for a = 0.2, m0 = 1, σ0 = 2, and three values of D.

Therefore, when m0 = y0 , the relative entropy is

The relative entropy is

Ht (f , q) =

=

1

 ln

2 1

[

2

ln

σ

2 q

σ

2 f

 +

σ −σ 2 f

2 q



Ht (f , qs )

σ

2 q

=

] β(1 − e−2at ) σ02 e−2at + , (49) σ02 e−2at + β(1 − e−2at ) β(1 − e−2at )

where β = D/a. Indeed, when t → ∞ the relative entropy of these two solutions tends to zero, but for each finite instant of time Ht (f , q) > 0 and depends on the noise intensity D and the system parameter a. However, if m0 ̸= y0 the relative entropy Ht (f , q) will have (according to Eq. (6)) the additional term resulting from the difference m0 − y0 . Figs. 4 and 5 display the convergence, measured by the relative entropy (49), between the distributions f (y, t ) and q(y, t ), as t increases. There exists a finite time when the system ‘‘remembers’’ the initial state, for which the informational distance between f and q is significant and depends essentially on the numerical values of σ0 and D. In case (b): f (y, t ) : N (mf , σ ) = N 2 f

q(y, t ) : N



0,

D a



.

 m0 e

−at



2 −2at 0e

+

D a

(1 − e

−2at

1

[ ln

2

σ02 e−2at

] β σ02 + m20 − β −2at e . (51) + β + β(1 − e−2at )

This formula, when illustrated graphically (for fixed numerical values of σ0 and D) shows the rate of convergence (measured by the relative entropy) of nonstationary response to the stationary one (Figs. 6, 7). It is worth noticing that the number of bits by which the response in the transient period differs from the stationary response (for different values of D) is quite essential. 5.1.2. Example 2: vibratory system with time-varying parameters To illustrate further the utility of the relative entropy measure in prediction of nonstationary processes let us consider a system analysed by Sissingh [12] and Soong & Grigoriu [13]: Y¨ (t ) + c (t )Y˙ (t ) + [p2 + k(t )]Y (t ) = α(t )ξ (t ),

(52)

Y (0) = Y˙ (0) = 0, where ξ (t ) is a Gaussian white noise with zero mean and one-sided spectral density g0 , and



) , (50)

c (t ) =

γ 8

 1+

4µ 3

 sin t

,

K. Sobczyk, P. Hołobut / Probabilistic Engineering Mechanics 27 (2012) 47–56

Fig. 8. Relative entropy between the solution to Eq. (52) with time-varying parameters — f , and with constant parameters as in Eq. (56) — q; g0 = 1, p = 5, γ = 0.8, and µ = 2.

k(t ) =

α(t ) =

γµ 6

γ

µ2 γ

cos t − 1+

6

2

sin 2t ,

8





 sin t

(53)

,

where γ , µ are constants identified in rotor dynamics. Since the time-varying coefficients (53) are deterministic, the response process Y (t ) is Gaussian. It has been shown (cf. [13]) that for p ≫ 1 the approximate solution for the variance of the response has the form

π g0

σY2 (t ) =

p2

e−η(t )

t



(54)

0

where

η(t ) =

t



c (τ ) dτ = 0

γ 8

t+

8µ 6

(1 − cos t ).

(55)

Therefore, the Gaussian probability density f (y, t ) of Y (t ) is N (0, σY2 (t )), where σY2 (t ) is given by Eqs. (54)–(55). Let us take as a reference (or, idealized) system associated with oscillator (52), the system (52) in which the coefficients are constant, namely: c (t ) = 2ζ ω0 =

α(t ) =

γ 6

γ 8

,

k(t ) = 0,

= α0 .

p2 = ω02 , (56)

In this case the response is governed by the Gaussian process with density q(y, t ) = N (0, σ˜ Y2 (t )), where σ˜ Y2 (t ), obtained from the approximate solution (54)–(55) as a particular case, is

π g0 α02 σ˜ (t ) = (1 − e−2ζ ω0 t ). 2ζ ω03 2 Y

Fig. 9. Standard deviations: σY — given by Eqs. (54)–(55), and σ˜ Y — defined by Eqs. (56)–(57); g0 = 1, p = 5, γ = 0.8, and µ = 2.

5.2. Non-Gaussian prediction: systems with random parameters — conditioning In many situations the system parameters may display a significant level of uncertainty due to the errors in model identification or because of some intrinsic mechanisms of the process under consideration. It is of interest, therefore, to quantify the effects of such an uncertainty on the information-type measures of the system performance. Let such a more realistic model be, instead of (46), represented as Y˙ (t ) = F (Y (t ), V (γ ), t ) + X (t ),

eη(τ ) α 2 (τ ) dτ ,

53

Y (t0 ) = Y0 ,

where V = V (γ ) is a vectorial random variable with a given probability density fV (v ). If the probability density of the response process Y (t ), when the value of V = v is fixed, is fc (y , v , t ) = fc (y , t |v ) then the ‘‘true’’ probability density of the system response is f (y , t ) =



fc (y , v , t )fV (v ) dv ,

(59)

where integration is over the support of the density fV (v ). Therefore, the amount of information that the response density f (y , t ) provides beyond (in excess of) that given by q(y , t ) = fc (y , t |v ) when the system parameters are fixed is quantified by the relative entropy Ht (f , fc ) =



f (y , t ) log

f (y , t ) fc (y , t |v )

dy ,

(60)

where f (y , t ) is given by Eq. (59). Let us restrict ourselves here to linear, time-invariant systems governed by the equation Y˙ (t ) = [A + V (γ )]Y (t ) + X (t ),

(57)

Of course, we should keep in mind that the exact variance of the oscillator characterized by Eqs. (52), (56) is given by (45), with multiplier α02 . The relative entropy Ht (f , q) can be calculated on the basis of Eq. (6) for Gaussian densities. Its graphical illustration is shown in Fig. 8, whereas in Fig. 9 are the plots of the variances σY (t ) determined by Eqs. (54)–(55) and σ˜ Y2 (t ) from Eq. (57). The numerical values of parameters are taken as follows: γ = 0.8, µ = 2.0, p = 5.0, g0 = 1.0. It is seen that periodicity present in the variance of the response results in the periodic-type variability of the relative entropy it time.

(58)

(61)

where X (t ) is a given Gaussian excitation process with a given mean mX (t ) and covariance KX (t1 , t2 ), A is a given deterministic matrix {aij }, whereas V (γ ) is a random matrix, and Y0 = Y (t0 ) is a vector of initial conditions (assumed here to be deterministic). The solution to Eq. (61) when V (γ ) = v is fixed has the known form Y (t ) = Φ(t , t0 )Y0 +



t

Φ(t , s)X (s) ds,

t ⩾ t0 ,

(62)

t0

where Φ is the fundamental matrix of the system (61). This formula is a base for the calculation of the mean and covariance function of the Gaussian process Yc (t ). Since the excitation process X (t ) is assumed to be Gaussian, the response process, when V (γ ) = v — fixed, is characterized

54

K. Sobczyk, P. Hołobut / Probabilistic Engineering Mechanics 27 (2012) 47–56

Fig. 10. Relative entropy for the oscillator (63). f corresponds to V with a Gaussian distribution N (10, σ02 ), truncated to the interval [5, 15]; fc corresponds to the fixed v = 10; ζ = 0.1, g0 = 1, σ0 varies.

Fig. 11. Relative entropy for the oscillator (63). f corresponds to V with a Gaussian distribution N (10, σ02 ), truncated to the interval [5, 15]; fc corresponds to the fixed v = 10; σ0 = 2, g0 = 1, ζ varies.

by the conditional Gaussian probability fc (y , t |v ) expressed by the conditional mean mY (t |v ) and covariance KY (t1 , t2 |v ). Unconditional – non-Gaussian – probability density takes into account randomness of the system parameters included in V (γ ), and is given by Eq. (59). Therefore, Ht (f , fc ) according to Eq. (60) quantifies the information gain that the ‘‘realistic’’ model (61) provides in excess of that obtained from the corresponding ‘‘idealized’’ model. Various forms of the system governing equations (in general, matrix A may depend on time t) and various hypotheses concerning the probability distribution of V (γ ) make it possible to treat effectively a wide class of real situations. 5.2.1. Example: Relative entropy for simple vibratory system Let us consider again the linear harmonic oscillator governed by the equation Y¨ (t ) + 2V (γ )ζ Y˙ (t ) + V 2 (γ )Y (t ) = ξ (t ), Y (0) = Y˙ (0) = 0,

(63)

where V (γ ) = Ω0 (γ ) – the natural frequency (stiffness coefficient) – is a random variable with mean ⟨Ω0 (γ )⟩ = ω0 and with probability density fV (v), ζ is a deterministic damping ratio, and ξ (t ) is a Gaussian white noise with zero mean and intensity (spectral density) g0 = const. Of course, Eq. (63) can be equivalently represented in the form of a system of two first order equations to fit formulation (61). The nonstationary conditional probability density of the response for Eq. (63), when V (γ ) = v is fixed, is the Gaussian density fc (y, t |v) = N (0, σY2 (t ; g0 , ζ , v)), with

σY2 (t ; g0 , ζ , v) =

π g0 2ζ v 3

2 2 2 2 2 {1 − λ − 0 exp(−2vζ t )[λ0 + 2v ζ sin λ0 t

+ vλ0 ζ sin 2λ0 t ]}, 2 1/2

where λ0 = v(1 − ζ )

σY2 (g0 , ζ , v) =

π g0 2ζ v 3

.

(64)

. In the stationary state, i.e. if t → ∞ (65)

The unconditional density f (y, t ), occurring in Eq. (60), is obtained according to Eq. (59), and then the relative entropy (60) can be determined by integration. To illustrate graphically the resulting relative entropy Ht (f , fc ), calculations have been performed for two special cases. In the first case, a Gaussian distribution with mean m0 = 10 and variance σ02 , truncated to the interval [a, b] = [5, 15], was assumed for V . The mean of V was therefore ω0 = m0 = 10. In the second

Fig. 12. Relative entropy for the oscillator (63). f corresponds to V with a uniform distribution on the interval [5, 15]; fc corresponds to the fixed v = 10; g0 = 1, ζ varies.

case, calculations were performed for the uniform distribution of V on the same interval [a, b]. The mean of V was therefore also ω0 = (b − a)/2 = 10. In both cases, the same numerical value of the noise intensity g0 = 1.0 was assumed, and fc corresponded to the mean v = ω0 = 10. Fig. 10 shows the relative entropy (versus time) between the unconditional density – integrated over the truncated Gaussian distribution of the stiffness coefficient v – and the conditional one (when v = ω0 is constant). It is seen that although the effect of the variance of the random stiffness is essential, the information distance between f (y, t ) and fc (y, t |v), expressed in bits, is rather small. Fig. 11 shows how Ht (f , fc ) depends on the damping coefficient of the oscillator (here σ0 = 2.0). Fig. 12 exhibits Ht (f , fc ) when v is randomized by the uniform distribution (on the same interval [5, 15]); the shapes of the curves look very similar to those in Fig. 11, but the information distance in bits is, for the uniform distribution, essentially greater — what could be expected, since in this case the randomness of the stiffness coefficient V is greater. 6. Non-Gaussian prediction via maximum entropy approximation An important issue in quantifying the relative entropy H (f , q) of stochastic dynamical systems is that often we are not able to have the density f (y ) explicitly and exactly. However, in

K. Sobczyk, P. Hołobut / Probabilistic Engineering Mechanics 27 (2012) 47–56

55

many situations we possess some partial information about f (y ) expressed by its statistical moments



k

k

y11 y22 . . . yknn f (y ) dy = mk1 k2 ...kn ,

k i = 0 , 1 , . . . , Ki ,

(66)

or more generally



˜ k, ϕk (y )f (y ) dy = m

k = 0, 1, . . . , K ,

(67)

where, if k = 0, we have the normalization condition (in this case it is assumed that ϕ0 (y ) = 1). Let us approximate the unknown f (y ) by f ∗ (y ) according to the spirit of the maximum entropy method, or its counterpart – the minimum relative entropy (cf. [14,5]). The density f ∗ (y ) satisfies the constraints (67) and minimizes the relative entropy Ht (f ∗ , q). ∗ As it is known  f (y∗) is the unique density in the family of densities ˜ k , for k = 1, 2, . . . , K , and it has the for which ϕk (y )f (y )dy = m exponential form

 −

f ∗ (y ) = q(y ) exp −

 λk ϕk (y ) ,

(68)

where λk , the unknown Lagrange multipliers, are determined from constraints (67). The relative entropy H (f , q) can be represented as a sum of two terms H (f , q) = H (f , f ∗ ) + H (f ∗ , q).

(69)

The first term can be interpreted as the approximation error (in the representation of f (y ) by the minimum cross-entropy density f ∗ (y )) whereas the second term as the estimation error (a ‘‘distance’’ between q and f ∗ ). Representation (69) can be viewed as the ‘‘triangle equality’’ for the relative entropy (valid only when f ∗ is the minimum relative entropy approximation of f !). Indeed, since the density q is positive the following holds f

log

q

 = log

f f∗ f∗ q

 = log

f f∗

+ log

f∗ q

.

(70)

ln

q

=−



λk ϕk (y ),

(71)

k

and H (f ∗ , q) = −

K ∫ −

f ∗ (y )λk ϕk (y ) dy ,

(72)

k=0

which is the general and explicit formula for the relative entropy between the approximated density f ∗ (y ) and the reference density q(y ). If the number of moment constraints in Eq. (67) is taken to be K1 or K2 , and K1 < K2 , then H (fK∗1 , q) ⩽ H (fK∗2 , q) ⩽ H (f , q).

Y˙ = a1 Y + a2 Y 2 + a3 Y 3 + ξ (t ) = F (Y ) + ξ (t ),

(74)

with ⟨ξ (t )⟩ = 0, ⟨ξ (t )ξ (t )⟩ = 2Dδ(t − t ), and the Gaussian initial condition f0 (y) = N (m0 , σ02 ). The moment equations corresponding to Eq. (74) have the form ′





[(k − 1)Dyk−2 + yk−1 F (y)]f (y, t ) dy

= k[(k − 1)DMk−2 + a1 Mk + a2 Mk+1 + a3 Mk+2 ],

(75)

˙k = (where 0, 1, . . .), which results from M  k Mk = ⟨Y ⟩, k k = d ∂ y f ( y , t ) dy = y f ( y , t ) dy, replacing ∂ f /∂ t through the dt ∂t k

Fokker–Planck–Kolmogorov equation, and integrating by parts. ˙ k involves moments up to Mk+2 . This hierarchy is The equation for M closed by using the maximum entropy assumption. The maximum entropy solution to Eq. (74) is sought in the form f (y, t ) = exp(−λ0 (t ) − λ1 (t )y − λ2 (t )y2 − λ3 (t )y3

When log = ln, Eq. (68) yields f∗

(cf. [15,16]). In this extended version of the method the partial information about the unknown density f (y ) – and more generally about f (y , t ) – is taken in the form of equations for moments (derivable from the given stochastic equations via the Itô formula (cf. [2])). The example below shows how the relative entropy between the non-Gaussian maximum entropy density of the system and the Gaussian density (of a linearized system) changes in time. The considered system is as follows

˙k = k M

Taking the average of both sides with respect to f and accounting for the fact that the average of log(f ∗ /q) with respect to f is the same as the average with respect to f ∗ (since f ∗ satisfies moment conditions (67)), the equality (69) is obtained (cf. also [14]). It is clear from Eq. (69) that H (f , q) ⩾ H (f ∗ , q).

Fig. 13. Lagrange multipliers for the maximum entropy solution (76) to Eq. (74).

(73)

Therefore, as one could expect, an increase of the number of given moments of the unknown density f (y ) results in an increase of the information gain (compared to the information contained in the reference distribution q(y )). The Lagrange multipliers λk occurring in Eq. (68) can be calculated in the same way as it is made in the maximum entropy method, which has been extended to stochastic dynamical systems

− λ4 (t )y4 ).

(76)

To compute λ0 , . . . , λ4 , the first six nontrivial moment equations ˙ 1 through M ˙ 6 . For each t, M0 (t ) = (75) are used: for M 1, M1 (t ), . . . , M6 (t ) determine the Lagrange multipliers λ0 (t ), . . . , λ4 (t ), which in turn serve to compute M7 (t ) and M8 (t ) – present in the moment equations. Starting from the chosen ˙ 1, . . . , M ˙ 6 are Gaussian initial condition, the equations for M integrated forward. The reference system is assumed to be linear: Y˙ = a˜ Y + ξ (t ),

(77)

with the coefficient a˜ found by a least-squares approximation to Eq. (74) on the interval [−|m0 | − 3σ02 , |m0 | + 3σ02 ]. The exact solution to Eq. (77), with the same initial condition as for Eq. (74), has a Gaussian density q(y, t ) = N (mY (t ), σY2 (t )), where mY (t ), σY2 (t ) are given by Eq. (40) (with a = −˜a). Computations were performed until the two densities – for Eqs. (74) and (77) – stabilized. The values of the parameters were: a1 = 1, a2 = 1, a3 = −1, m0 = 0, σ02 = 1 (from which a˜ = −4.58), and D = 1. Fig. 13 presents the values of the Lagrange multipliers of Eq. (76). Fig. 14 shows the time evolution of the bimodal

56

K. Sobczyk, P. Hołobut / Probabilistic Engineering Mechanics 27 (2012) 47–56

Fig. 14. f — the maximum entropy density (76) for Eq. (74), and q — the Gaussian density for the linear approximation (77), at a few instants of time.

Although the specific systems treated in the paper are rather simple, the basic reasoning is general; so, it can be used in a variety of other systems, especially economic, social and biological, where information flow seems to be a key issue in modelling and prediction. References

Fig. 15. Relative entropy between f — the maximum entropy density (76) for Eq. (74), and q — the Gaussian density for the linearized system (77).

probability density f against the Gaussian probability density q. The relative entropy between f and q, as a function of time, is given in Fig. 15. The information distance is significant, since the maxima of the two densities diverge. 7. Conclusions In many fields of contemporary research, information and dynamics are becoming the key terms. The information-theoretic approach to stochastic dynamics can essentially enrich the existing ‘‘traditional’’ methods and results. The analysis presented in this paper allows one to quantify the information content of a system during its motion and to make statements on the informational quality of different approximations in stochastic dynamics.

[1] Lin YK, Cai GQ. Probabilistic structural dynamics: advanced theory and applications. New York: McGraw-Hill; 1995. [2] Sobczyk K. Stochastic differential equations with applications to physics and engineering. Dordrecht: Kluwer; 1991. [3] Zubariev DN. Nonequilibrium statistical thermodynamics. Moscow: Nauka; 1971 [in Russian]. [4] Ebeling W. Entropy and information in processes of self-organization: uncertainty and predictability. Physica A 1993;194:563–75. [5] Sobczyk K. Information dynamics: premises, challenges and results. Mech Syst Signal Process 2001;15(3):475–98. [6] Garbaczewski P. Differential entropy and dynamics of uncertainty. J Stat Phys 2006;123(2):315–55. [7] Barlas N, Josic K, Lapin S, Timofeyev I. Non-uniform decay of predictability and return of skill in stochastic oscillatory models. Physica D 2007;232: 116–127. [8] Cover TM, Thomas JA. Elements of information theory. New York: John Wiley; 1991. [9] Nemytskii VV, Stepanov VV. Qualitative theory of differential equations. Princeton: Princeton University Press; 1960. [10] Risken H. The Fokker–Planck equation. Berlin: Springer; 1989. [11] Lasota A, Mackey MC. Chaos, fractals, and noise. New York: Springer; 1994. [12] Sissingh GJ. Dynamics of rotors operating on high advance ratios. J Helicopter Soc 1968;13:56–63. [13] Soong TT, Grigoriu M. Random vibration of mechanical and structural systems. New Jersey: Prentice Hall; 1993. [14] Barron AR, Shen C-H. Approximation of the density functions by sequences of exponential families. Ann Statist 1991;19(3):1347–69. [15] Sobczyk K, Trębicki J. Maximum entropy principle and nonlinear stochastic oscillators. Physica A 1993;193:448–68. [16] Sobczyk K, Trębicki J. Approximate probability distributions for stochastic systems: maximum entropy method. Comput Methods Appl Mech Engrg 1999;168:91–111.