Physica A 305 (2002) 196 – 199
www.elsevier.com/locate/physa
Information of sequences and applications Claudio Bonannoa , Stefano Galatoloa , Giulia Menconib; ∗ a Dipartimento
b Centro
di Matematica “L. Tonelli”, Via Buonarroti, 2=a 56127 Pisa, Italy Interdisciplinare per lo Studio dei Sistemi Complessi, Via Bonanno Pisano, 25=b 56100 Pisa, Italy
Abstract In this short note, we outline some results about complexity of orbits of a dynamical system, entropy and initial condition sensitivity in weakly chaotic dynamical systems. We present a technique to estimate orbit complexity by the use of data compression algorithms. We also outline how this technique has been applied by our research group to dynamical systems and to c 2002 Elsevier Science B.V. All rights reserved. DNA sequences. Keywords: Information; Complexity of a single orbit; Entropy; DNA sequence
1. Information, complexity and entropy In our approach, the basic notion is the notion of information. Given a 4nite string s (namely a 4nite sequence of symbols taken in a given alphabet), the intuitive meaning of quantity of information I (s) contained in s is the length of the smallest binary message from which one can reconstruct s. This concept is expressed by the notion of algorithmic information content (AIC). We limit ourselves to give an intuitive idea which is very close to the formal de4nition (for further details, see Refs. [1,2] and relative references). We can consider a partial recursive function as a computer C which takes a program p (namely a binary string) as an input, performs some computations and gives a string s = C(p), written in the given alphabet, as an output. The AIC of a string s is de4ned as the shortest binary program p which gives s as its output, namely IAIC (s; C) = min{|p|:C(p) = s} ; where |p| means the length of the string p. From this point of view, the shortest program p which outputs the string s is a sort of optimal encoding of s. The information that is necessary to reconstruct the string is contained in the program. Unfortunately, ∗
Corresponding author. Tel.: +39-050-844282; fax: +39-050-844224. E-mail address:
[email protected] (G. Menconi).
c 2002 Elsevier Science B.V. All rights reserved. 0378-4371/02/$ - see front matter PII: S 0 3 7 8 - 4 3 7 1 ( 0 1 ) 0 0 6 6 1 - 6
C. Bonanno et al. / Physica A 305 (2002) 196 – 199
197
this coding procedure cannot be performed on a generic string by any algorithm: the AIC is not computable by any algorithm. Another measure of the information content of a 4nite string can also be de4ned by a lossless data compression algorithm Z which satis4es some suitable properties which we will not specify here. Details are discussed in Refs. [1,2]. Since the compressed string contains all the information that is necessary to reconstruct the original string we can de4ne the information content of the string s with respect to Z as the length in bits of the compressed string Z(s); namely, IZ (s) = |Z(s)| : Clearly, this measure of information will be as accurate as the compression algorithm is eDcient. Notice that the information content IZ (s) turns out to be a computable function. For this reason, we will call it computable information content (CIC). In any case, given any string s; we assume to have de4ned the quantity I (s) via AIC or via CIC. If ! is an in4nite string, in general, then its information is in4nite; however, it is possible to de4ne another notion: the complexity. The complexity K(!) of an in4nite string ! is the average information I contained in a single digit of !, namely, K(!) = lim sup n→∞
I (!n ) ; n
(1)
where !n is the string obtained taking the 4rst n elements of !: If we equip the set of all in4nite strings with a probability measure , then the couple ( ; ) can be viewed as an information source, provided that is invariant under the natural shift map , which acts on a string ! = (!i )i∈N as follows: (!) = !, ˜ where !˜ i = !i−1 ∀i ∈ N. It can be proved that the entropy h of ( ; ) is the expectation value of the complexity: K(!) d : (2) h =
The entropy of a system can then be viewed as the average (over the space X ) quantity of information that is necessary to describe a step of the evolution of a point of the system. The orbit complexity, however, is de4ned for each point, independently of the choice of an invariant measure. If some (not necessarily invariant) measure m is de4ned on the space X , then the expectation value of the orbit complexity with respect to m can be seen as a possible generalization of the entropy to the nonstationary case. If, in general, we consider (X; T ) a topological dynamical system over a metric space, then a construction can be done (see Refs. [1–6]) to associate to the orbits of the system a family of strings (that are in some sense a symbolic version of the real orbits) and then de4ne the complexity of the orbit by the complexity of the associated strings. Here, we do not give a formal de4nition of this construction but only recall its main ideas. The space X is divided into a 4nite partition = {B1 ; : : : ; Bn } and to each point x, the in4nite string !x; is associated by listing the elements of the partition that are visited during the orbit of x (a technical point that will not be discussed here is that sometimes it is necessary to consider open covers instead of partitions and this gives
198
C. Bonanno et al. / Physica A 305 (2002) 196 – 199
a slightly more complicated de4nition of the complexity or information content of an orbit). Then the information content I (x; n; ) of n steps of the orbit of x with respect to the partition can be de4ned by the information content of the associated string IAIC (x; n; ) = IAIC (!x;n ). In a weakly chaotic dynamical system the entropy of the system is 0; however, there are many possible diHerent weakly chaotic dynamics which the usual entropy indicator is not able to distinguish. For these reasons, among others, some de4nitions of generalized entropy have been given (for example, Refs. [7,8]). In our approach, we will look at the kind of asymptotic behavior of the quantity of information that is necessary to describe an orbit as a generalized indicator of orbit complexity. In Ref. [4] it is proved that this asymptotic behavior is related to the kind of initial condition sensitivity of the system. In particular, if we have power law sensitivity (I x(n) I x(0)np ), that is, two points starting at time 0 at a very small distance I x(0) are at distance I x(0)np at time n), then the information content of the orbit is IAIC (x; ; n) ∼ p log(n) :
(3) p
If we have stretched exponential sensitivity (I x(n) I x(0)2n , p 6 1), then IAIC (x; ; n) ∼ np . This is the behavior of the Manneville map de4ned below. 2. Numerical results As we have previously showed in brief, the analysis of I (x; ; n) gives useful information on the underlying dynamics. Since I (x; ; n) can be de4ned through the CIC, it turns out that it can be used to analyze the experimental data using a compression algorithm which is both eDcient enough and fast enough to analyze long strings of data. We have implemented a particular compression algorithm that we called Compression Algorithm Sensitive To Regularity (CASToRe) [9]. Its internal running is described in the appendix of Refs. [1,2]. We have used CASToRe on the Manneville map f(x) = x + xz ; x ∈ [0; 1]; z ¿ 1 and we have checked that the experimental results agree with the theoretical one, indeed it has been proved [4,10,11] that for almost each x (with respect to the Lebesgue measure) the information content with respect to a suitable partition with z ¿ 2 is IAIC (x; ; n) ∼ n1=(z−1) . Then we have analyzed the behavior of I (x; ; n) for the logistic map at the chaos threshold. The numerical results in physics literature suggest that, in the case of logistic map at the edge of chaos, the speed of separation of nearby starting orbits is (less or equal than) a power law. In Ref. [12] we formally proved this expected behavior, using the relation between the information content and the initial data sensitivity given by Eq. (3). Hence, the information function increases below any power law, as it has been con4rmed by numerical experiments on the map using the algorithm CASToRe (see Refs. [1,2,12]). Finally, we have applied CASToRe and the CIC analysis to DNA sequences [1,2], in view of studying the randomness of symbolic strings produced by a biological source. We look at genomes as 4nite symbolic sequences where the alphabet is that of nucleotides {A; C; G; T }. The performed analysis exploited a window segmentation with
C. Bonanno et al. / Physica A 305 (2002) 196 – 199
199
windows of length L, so we obtain an Lth CIC-complexity KZ (L), that referred to some genome : when the scale L increases, the complexity of coding regions in (with respect to protein coding) shows no sensitivity to scale variation, while the complexity of noncoding regions decreases with increasing L (revealing long-range correlations) until a critical L∗ is reached after which the function KZ (L) remains constant. This strong diHerence between coding and noncoding regions is also underlined by the analysis of the information IZ (L) over a genome at a 4xed scale L. In particular, in Archaea genomes (Archaeoglobus fulgidus, Methanococcus Jannaschii and Methanobacterium autotrophicum), one can identify precise areas of the complete genome where the information is de4nitely lower than in the remaining genome: those are the noncoding regions with signi4cant extent. We are now checking the biological implications of this algorithmic properties. References [1] V. Benci, C. Bonanno, S. Galatolo, G. Menconi, F. Ponchio, Information complexity and entropy: a new approach to theory and measurement methods, http://arXiv.org/abs/math.DS/0107067, 2001. [2] A.I. Khinchin, Mathematical Foundations of Information Theory, Dover Publications, New York, 1975. [3] A.A. Brudno, Entropy and the complexity of the trajectories of a dynamical system, Trans. Moscow Math. Soc. 2 (1983) 127–151. [4] S. Galatolo, Orbit complexity, initial data sensitivity and weakly chaotic dynamical systems, http://arXiv.org/abs/math.DS/0102187, 2001. [5] S. Galatolo, Orbit complexity by computable structures, Nonlinearity 13 (2000) 1531–1546. [6] S. Galatolo, Orbit complexity and data compression, Discrete Continuous Dyn. Systems 7 (2001) 477–486. [7] C. Tsallis, A.R. Plastino, W.-M. Zheng, Power-law sensitivity to initial conditions—new entropic representation, Chaos Solitons Fractals 8 (6) (1997) 885–891. [8] F. Takens, E. Verbitski, Generalized entropies: Renyi and correlation integral approach, Nonlinearity 11 (4) (1998) 771–782. [9] F. Argenti, V. Benci, P. Cerrai, A. Cordelli, S. Galatolo, G. Menconi, Information and dynamical systems: a concrete measurement on sporadic dynamics, Chaos Solitons Fractals 2001, to appear. [10] P. Gaspard, X.J. Wang, Sporadicity: between periodic and chaotic dynamical behavior, Proc. Natl. Acad. Sci. USA 85 (1988) 4591–4595. [11] C. Bonanno, The manneville map: topological, metric and algorithmic entropy, in preparation. [12] C. Bonanno, G. Menconi, Computational information for the logistic map at the chaos threshold, http://arXiv.org/abs/nlin.CD/0102034, 2001.