Journal of Theoretical Biology 321 (2013) 54–62
Contents lists available at SciVerse ScienceDirect
Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi
Investigation on series of length of coding and non-coding DNA sequences of bacteria using multifractal detrended cross-correlation analysis Cristina Stan a,n, Monica Teodora Cristescu a, Luiza Buimaga Iarincab, C.P. Cristescu a a b
Department of Physics, Faculty of Applied Sciences, Politehnica University of Bucharest, RO-060042, Romania National Institute of Research and Development of Isotope Molecular Technology, Cluj-Napoca, RO-400293, Romania
H I G H L I G H T S c c c
Investigation on DNA sequences using multifractal detrended cross-correlation analysis. A multifractal cross-correlation series is associated to each pair of coding and noncoding series. Criteria of class affiliation of bacteria and archea are proposed.
a r t i c l e i n f o
a b s t r a c t
Article history: Received 14 September 2012 Received in revised form 27 December 2012 Accepted 31 December 2012 Available online 10 January 2013
In the framework of multifractal detrended cross-correlation analysis, we investigate characteristics of series of length of coding and non-coding DNA sequences of some bacteria and archaea. We propose the use of a multifractal cross-correlation series that can be defined for any pair of equal lengths data sequences (or time series) and that can be characterized by the full set of parameters that are attributed to any time series. Comparison between characteristics of series of length of coding and non-coding DNA sequences and of their associated multifractal cross-correlation series for selected groups is used for the identification of class affiliation of certain bacteria and archaea. The analysis is carried out using the dependence of the generalized Hurst exponent on the size of fluctuations, the shape of the singularity spectra, the shape and relative disposition of the curves of the singular measures scaling exponent and the values of the associated parameters. Empirically, we demonstrate that the series of lengths of coding and non-coding sequences as well as the associated multifractal cross-correlation series can be approximated as universal multifractals. & 2013 Elsevier Ltd. All rights reserved.
Keywords: Genomic DNA series Hurt exponent Singularity spectra
1. Introduction Searching for information on structure, dynamics and function of DNA sequences using concepts and methods from statistical physics, nonlinear physics and signal processing plays a fundamental role in the interdisciplinary field of bioinformatics. The main involved concepts are from the dynamics of processes from disordered systems such as scale invariance, statistical similarity, long-range and multifractal correlations of fluctuations, phase transitions, etc. Due to the complexity of the analysis, timeeffective computational methods are required. A wide range of natural phenomena expressed as time-series or data sequences from physics, biology, medicine, earth sciences, econophysics, etc. can be well modeled as multifractal processes (Stanley and Meakin 1988; Schertzer and Lovejoy, 1992;
*
Corresponding author. Tel.: þ40 21 402 91 02; fax: þ 40 21 402 91 20. E-mail addresses:
[email protected],
[email protected] (C. Stan).
0022-5193/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.jtbi.2012.12.027
Lopes and Betrouni, 2009; Seuront et al., 1999; Scarlat et al., 2007; Oczeretko et al., 2001; Cristescu et al., 2012). The multifractal analysis was initially developed mainly by Kolmogorov in the context of turbulence and, during the last 20 years it has found extensive applications in a very wide range of fields. There are different methods by which a multifractal analysis can be carried out; the main ones are: detrended fluctuation analysis (Peng et al., 1994; Ossadnik et al., 1994; Kantelhardt et al., 2002), wavelet detection of fluctuations (Muzy et al., 1993; Jaffard and Siam, 1997) and fluctuation measurement by structure functions or singular measures (Davis et al., 1994; Yu et al., 2003; Stan et al., 2012; Tessier et al., 1996). Searching for long range correlations in time series or data sequences and for the cross-correlations between different simultaneously recorded data sets by multifractal analysis enables us to provide new insight into the dynamics of natural systems and to find new classification criteria. Many methods have been proposed in order to investigate cross-correlations between simultaneously recorded time series, beginning with the Pearson analysis
C. Stan et al. / Journal of Theoretical Biology 321 (2013) 54–62
(Pearson, 1901; Ascombe, 1973; Rodgers and Nicewonder, 1988). For nonstationary signals, a method based on detrended crosscorrelation analysis was proposed in Podobnik and Stanley (2008). Zhou (2008) considered the same type of investigation using the multifractal formalism, later applied to specific systems (He and Chen, 2011; Jiang and Zhou, 2011). Recently, the statistical methods (Zhou et al., 2005; Stan et al., 2010, 2012; Wang et al., 2010) and the multifractal analysis (Su et al., 2009; Yu et al., 2001; Kulkarni et al., 2005) are seriously considered as means for finding similarity/dissimilarity characteristics, new classification criteria, and for the identification of the belonging of an organism to accepted categories based on limited information such as using fragments of its genome. In the framework of multifractal detrended cross-correlation analysis (MFDCCA), we propose a new type of sequence ‘‘the multifractal cross-correlation series (CCS)’’. This can be defined for any pair of equal lengths sequences and described by the full set of parameters that can be attributed to a time series. So far, only the mutual Hurst exponent (Podobnik and Stanley, 2008) or the Hurst exponent and the singularity spectrum (Zhou, 2008) of a pair, were considered. In this study, we consider the CCS for a pair of coding and non-coding length sequences of a DNA complete genome and the identification of class affiliation of certain bacteria and archaea is obtained by comparison between characteristics of coding and non-coding DNA sequences and of their associated CCS. The paper is organized as follows: Section 2 presents a short description of the MFDCCA and introduces the CCS. Section 3 is dedicated to the application of the CCS to the identification of class affiliation of some bacteria and archaea. In Section 4, it is empirically demonstrated that the analyzed genomic sequences and their associated CCS can be approximated as universal multifractals. The final section is dedicated to conclusions.
and ( F x ð0,sÞ ¼ exp
) 2Ns 1 X ln½F 2x ðs, nÞ 4N s n ¼ 1
The technique used in this work to evaluate the Hurst exponents is multifractal detrended fluctuation analysis (MFDFA). For a data set (time series) of N equally spaced values, the ‘‘profile’’ is computed as follows: XðjÞ ¼
j X
ðxi /xSÞ;
j ¼ 1,2,:::,N
ð1Þ
i¼1
where /xS is the mean of the original series. The new series XðjÞ is divided into 2Ns non-overlapping segments (Ns ¼Int(N/s) of length s, obtained from the beginning to the end and in reverse. The detrending starts with the best polynomial fit of the signal X n ðkÞ on each segment n obtaining a certain trend function (usually a polynomial type function) X^ n ðkÞ,k ¼ 1, ,s. The next step implies the computation of the variances (Kantelhardt et al., 2002): F 2x ðs, nÞ ¼
s 1X ½X n ðkÞX^ n ðkÞ2 , s k¼1
n ¼ 1,2,:::Ns
ð2Þ
ð5Þ
for q ¼ 0. For a statistically self-similar signal, the dependence of the fluctuation function on the window length s is expected to be power law type: F x ðq,sÞ shx ðqÞ
ð6Þ
The exponent hx ðqÞ is called generalized Hurst exponent, and is computed as the slope of the line representing the log–log plot of the fluctuation function versus s in the scaling region. The main Hurst exponent is computed for q¼2. For monofractal time series, hx ðqÞ is independent of q. Indeed, since the scaling behavior of the variances F 2x ðs, nÞ is the same for all segments, the averaging procedure in Eqs. (4) and (5) will give identical scaling behavior for all values of q. In the case of multifractal time series, for positive values of q, the segments with large variance F 2x ðs, nÞ (i.e. large deviations from the corresponding fit) will dominate the average F x ðq,sÞ. Thus, for q4 0, hx ðqÞ describes the scaling behavior of the segments with large fluctuations. For negative values of q, the segments with small variance F 2x ðs, nÞ will dominate the average, hence, for qo 0, hx ðqÞ describes the scaling behavior of the segments with small fluctuations (Davis et al., 1994). As demonstrated by Kantelhardt et al. (2002), under some general conditions, considered satisfied in this study, the MFDFA is equivalent to the partition function procedure (e.g. obtained from the wavelet formalism (Jaffard and Siam, 1997; Gangodagamage et al., 2007)), which introduces the exponent tðqÞ such that the relationship
tðqÞ ¼ qhðqÞ1 2. MFDCCA and the cross-correlation series
55
ð7Þ
is satisfied (the subscript x is omitted when no confusion is expected). Another very popular way of analyzing the fractality of time series or sequences is the structure function algorithm (Seuront et al., 1999). For a (statistically) self-similar series, the structure q function of order q 40 defined as Sq ðrÞ ¼ / 9xi þ r xi 9 S is related to the delay r by a power type relationship: Sq ðrÞ C q r BðqÞ ,
BðqÞ ¼ qhðqÞ
ð8Þ
As discussed by Muzy et al. (1993), the exponent tðqÞ of the partition function procedure obtained using the wavelet formalism (Jaffard and Siam, 1997), under some restrictions on the allowed values of q is related to the structure function exponent BðqÞ with the same hðqÞ as in (6) and (7). As detailed in the book by (Seuront (2010) and the paper of Schmitt et al., 1995, for multifractal processes, the exponent BðqÞ is nonlinear and concave, and can serve to the definition of a function KðqÞ as: KðqÞ ¼ qHzðqÞ
or
KðqÞ ¼ qHqhðqÞ
ð9Þ
where F 2x ðs,
s 1X nÞ ¼ ½X n ðkÞX^ n ðkÞ2 , s k¼1
n ¼ Ns þ 1,:::,2N
ð3Þ
The fluctuation function of order 2 is defined as the square root of the relation (2) and (3). The whole series can be characterized by a q order fluctuation function (q positive or negative, nonzero value) defined as: 9 8 q=2 1=q 2N s = < 1 X 2 F x ðq,sÞ ¼ ½F x ðs, nÞ ð4Þ ; :2Ns n ¼ 1
H ¼ zð1Þ ¼ hð1Þ
ð10Þ
Eq. (9) shows that KðqÞ represents a correction, expressing the departure of the function BðqÞ from linearity (characteristic of monofractal series) due to intermittency. Besides hðqÞ or tðqÞ, the fractal properties can be described by the multifractal (singularity) spectrum f ðaÞ computed by the application of the Legendre transform (Hilborn, 1994; Zia et al., 2009) to the partition function exponent tðqÞ: dtðqÞ=dq ¼ a
ð11Þ
56
C. Stan et al. / Journal of Theoretical Biology 321 (2013) 54–62
and
and f ðaÞ ¼ qatðqÞ
ð12Þ
Using KðqÞ, a nondecreasing function C ðqÞ ¼
KðqÞ q1
ð13Þ
is defined (Schmitt et al., 1995). By similarity to the relationship defining the generalized dimensions DðqÞ ¼
tðqÞ , q1
ð14Þ
CðqÞ is a codimension characterizing the sparseness of the process. It is usual to define a measure of inhomogeneity by dK C 1 ¼ C ð1Þ ¼ : ð15Þ dq q ¼ 1 This is bound between C 1 ¼ 0 for a homogeneous space-filling process and C 1 ¼ d, where d is the dimension of the space supporting the process (d ¼ 1 for time series). Very low C 1 , close to zero, characterize fields with values in the neighborhood of the mean almost everywhere, while large C 1 corresponds to fields with low values almost everywhere except in isolated locations where, the values are much higher than the mean (Schmitt et al., 1995). Eq. (9) can be rewritten KðqÞ ¼ qHqhðqÞ ¼ q½hð1ÞhðqÞ
ð16Þ
which shows that KðqÞ can be computed using the MFDFA exponent only. Eq. (16) gives the well known properties of KðqÞ, i.e.: Kð0Þ ¼ Kð1Þ ¼ 0,
KðqÞ o0,
q A ð0,1Þ,
KðqÞ 40,
q4 1
With (16), Eq. (15) takes the form dhðqÞ C1 ¼ dq q ¼ 1
ð17Þ
ð18Þ
which is used for computation in the present analysis. The main purpose of this work is the study of correlations between two time series/data sequences using the MFDCCA method which follows directly from DFA (Podobnik and Stanley, 2008) and MFDFA (Zhou, 2008; Jiang and Zhou, 2011). The general MFDCCA method can be implemented with various types of detrending; it was proposed that in the case of polynomial detrending it should be called MF-X-DFA (Zhou, 2008) and when using moving average detrending the proper name should be MF-X-DMA (Jiang and Zhou, 2011). By these renaming any conceivable type of detrending can be identified by an appropriate name. However, in a study where only one type of MFDCCA is applied, it is sufficient to give a statement of the type of detrending used and continue to call the method by its general name MFDCCA. We consider two sequences of equal lengths denoted xi , yi i ¼ 1,2, ,N: For both, the MFDFA is performed according to the previous discussion. Additionally, a correlation fluctuation function is constructed replacing (2) to (5) with F 2xy ðs, nÞ ¼
s 1X 9X n ðkÞX^ n ðkÞ99Y n ðkÞY^ n ðkÞ9, s k¼1
n ¼ 1,2,:::Ns
F 2xy ðs, nÞ ¼
s 1X 9X n ðkÞX^ n ðkÞ99Y n ðkÞY^ n ðkÞ9, s k¼1
n ¼ Ns þ 1,:::,2N
ð19Þ
ð20Þ ( F xy ðq,sÞ ¼
2N s 1 X ½F 2 ðs, nÞq=2 2Ns u ¼ 1 xy
)1=q shxy ðqÞ
ð21Þ
( F xy ð0,sÞ ¼ exp
) 2Ns 1 X ln½F 2xy ðs, nÞ shxy ð0Þ 4N s u ¼ 1
ð22Þ
It is to be expected (as demonstrated for many computed examples (Jiang and Zhou, 2011)) that the cross-correlation of two power-law auto-correlated sequences is also of power law type. Eqs. (21) and (22) correspond to the situation of long-range cross-correlated series when the cross-fluctuation function follows a power-law dependence on the size of the ‘‘box’’ s. The scaling exponent hxy ðqÞ is computed as the slope of the line representing the log–log plot of the fluctuation function versus s in the scaling region and describes the cross-correlations between the kinds of fluctuations of the two series. The definition of the scaling exponent hxy ðqÞ can be understood in the sense that it is characterizing a time series or, in the present context a data series with fluctuation described by Eqs. (21) and (22). Following this line of reasoning we consider a hypothetical series which we call ‘‘the multifractal cross-correlation series’’ that can be defined for any pair of data sequences of equal lengths. In this study, we associate the CCS for pairs of series of lengths of coding and non-coding sequences of a complete DNA genome. Similar to the series of lengths of coding and non-coding sequences, in addition to the mutual Hurst exponent, the hypothetical CCS can be characterized by a function K xy ðqÞ defined according to Eq. (9), by a singularity spectrum and by all the parameters associated to each of the considered pair. Comparison between characteristics of series of length of coding and noncoding DNA sequences and of the CCS associated to the pair is used for the identification of class affiliation of certain bacteria and archaea.
3. Identification of class affiliation of some bacteria and archaea by characteristics of the series of lengths of coding and non-coding sequences and their associated multifractal cross-correlation series The DNA molecule may be circular or linear, and can be composed of 100,000 to 10,000,000,000 nucleotides in a long chain (Paux et al., 2008). The prokaryotes—bacteria and archaea typically have a single circular chromosome, but many variations exist (Thanbichler and Shapiro, 2006). Bacterial chromosome has a relatively simple structure consisting of a succession of coding sequences (CS) and non-coding sequences (NCS). Each sequence consists of a specific series of types and number of bases. The coding sequences represent genes and they dominate the total content of bases in the bacterial chromosome. A CS, composed of hundreds of base pairs, represents a particular gene which is eventually translated into a particular protein. The NCS contains regulatory information. The CS series lengths can be regarded as a space fluctuating series and therefore are suitable for the statistical analysis (Morariu and Buimaga-Iarinca, 2010; Yu and Anh, 2001; Yu et al., 2001; Zainea and Morariu, 2007). It should be mentioned that while the correlation properties of DNA at the level of bases have been intensively investigated, the series lengths of CS and NCS received little attention. Although correlation characterization of each CS series and NCS series has been previously attempted on bacterial genomes, to our knowledge a full statistical characterization of the cross-correlation between the CS and NCS of a complete genome has not been reported to date. We investigate the genomes of bacteria presented in Table 1. The species were selected from four representative groups. The extraction of the data, from the EMBL-EBI data base (http:// www.ncbi.nlm.nih.gov/), was done with a code created in Matlab.
C. Stan et al. / Journal of Theoretical Biology 321 (2013) 54–62
57
Table 1 . Domain
Phylum
Archaea Bacteria
Proteobacteria Hyperthermophilic Tenericutes (Chlamydia)
Species
Strain
Archaeoglobus fulgidus Methanocaldococcus jannaschii
Archaeoglobus fulgidus Methanocaldococcus jannaschii
Afulgi MJan
Helicobacter pylori Rickettsia prowazekii Aquifex aeolicus Thermotoga maritima Chlamydia pneumonia
Helicobacter pylori 26695 Rickettsia prowazekii Aquifex aeolicus Thermotoga maritima Chlamydophila pneumoniae TW-183 Chlamydophila pneumoniae J138
HPylor_266 Rprowaz Aquifex Tmarit ChlamPneu_TW ChlamPneu_J
Each species of bacteria or archaea is characterized by a number of N coding sequences each having a specific length li where i¼1,y, N. We considered in our analysis the series consisting of CS and NCS lengths as they follow in the genome. The length of CS, expressed as number of base pairs, was calculated as the difference between the start and the end position of CS in the genome. The mean value of the CS length is also available in the proteome section of EMBL-EBI data base (http://www.ncbi.nlm.nih.gov/). In the following analysis the length of the CS and NCS are denoted by xi and yi respectively. Our study encompasses four distinct groups: archaea (Afulgi and MJan), hyperthermophilic bacteria (Aquifex and Tmarit), proteobacteria (Rprowaz and HPylor_266) and chlamydia (ChlamPneu_ TW and ChlamPneu_J). Taking into consideration the relatively large spread in the length of the analyzed series (between 834 and 2410), in order to satisfy the accepted condition of MFDCCA for the length of the non-overlapping segments, 10 r s rN=4 (Kantelhardt et al., 2002; Zhou, 2008), we decided to use as maximum length s ¼ 200. For all analyzed cases the log-log plot of the dependence for the fluctuation function F ðq,sÞ on s is linear for all considered values of q between 10 and 10, showing that long range correlations for individual series and for the multifractal cross-correlation series exists up to the maximum value of s. The hðqÞ dependence for all these genomes is shown in Fig. 1. Besides the CS and NCS the dependence for the newly defined CCS is plotted (with solid circles). From these graphs we can formulate the following observations: – for negative values of q, the values of hðqÞ for NCS (hy ðqÞ) are higher than for the CS (hx ðqÞ) and the values of hðqÞ for CCS (hxy ðqÞ) fall in between for all cases. For the range of small fluctuations (negative q between 10 and 0) all values of hðqÞ are in excess of 0.5, showing positive correlation and positive cross-correlation. – for medium size fluctuations (0 oq r2), hx ðqÞ, hy ðqÞ, hxy ðqÞ are also in excess of 0.5 with the same meaning, positive autocorrelation and cross-correlation. At the same time, for the main Hurst exponent hð2Þ for all series, the relative position of the three curves is changing in most of the situations. – for large fluctuations (q 42), the values of hxy ðqÞ gradually diminish and become practically zero for q 4 10 showing negative or complete absence of cross-correlation between coding and noncoding sequences for all studied genomes. The behavior of hx ðqÞ and hxy ðqÞ is different for different categories. The column Dh in Table 2 presents the range of variation of hðqÞ for the CS, NCS and CCS of each considered genome The larger is the interval of variation of h, the greater is the degree of multifractality. From the graphs in Fig. 1 we see that the NCS are the most and the CS the least multifractal in all cases. In the majority of situations, the CCS is characterized by a degree of multifractality in between the two with the exception of the
Short name
Hyperthermophilic group where the multifractality of the CCS is smaller than that of both NCS and CS. The singularity spectra for a representative case in each of the four groups are shown in Fig. 2. The values of a that correspond to the maximum of the singularity spectra are given in Table 2 in the column a_max. The values qða_maxÞ are identified from the curves aðqÞ (not shown) and in all cases these are slightly negative. We observe that the symmetry features of the singularity spectrum curve are characteristic for each studied group. We define a measure of symmetry as: Symm ¼
RwLw Rwþ Lw
ð23Þ
where Rw and Lw represent the width of the right hand and left hand halves of the curve computed at half maximum (f ðaÞ ¼ 0:5). The values corresponding to the spectra of NCS and CCS are presented as percentage in the Symmcolumn of Table 2. We observe the following: for archaea group both curves (Fig. 2a) are considered symmetrical because our computation accuracy is evaluated at 5%. For the hyperthermophilic bacteria the curves are both left hand asymmetrical (Fig. 2c). The chlamydia group (Fig. 2d) is characterized by opposite asymmetry of the two curves (NCS positive and CCS negative) and for the proteobacteria group (Fig. 2b) the curves are both strongly right hand side asymmetrical. Fig. 3 shows the curves of KðqÞ computed using (16). We observe that the analysis based on the KðqÞ curves is only partially conclusive in distinguishing between the groups. For the proteobacteria the curve for the CS is in between the NCS and CCS curves; for the hyperthermophilic bacteria, characteristic is the fact that the CS and CCS curves are very close to each other and considerably different from the NCS curve. Concerning the chlamydia group and the archaea, the CCS curve is situated almost symmetrically between the CS and NCS curves and this analysis is not applicable because the three curves are situated in similar relative positions. For each of the KðqÞ curves the estimation of the measure of inhomogeneity is performed using the definition (15) and the formula (18). The values are presented in Table 2 in the columns under the name C_int. We find that, unlike the curves themselves, the ratio between the intermittency parameter for NCS and the CCS shown in the Table 2 in the column Ratio is characteristics of each of the four groups (within the estimated error of 5%) and can be considered as an identifier for class affiliation of bacteria. In support of this statement we computed this ratio for other bacteria in the considered groups: for Pyrococcus abyssi we found Ratio¼4.43 in good agreement with the values for archaea (the first group in Table 2); for Chlamydia pneumoniae CWL the Ratio¼5.21 is in good agreement with the values in the Chlamydia group (the third in Table 2); for Haemophilus influenzae we found Ratio¼3.57 in agreement with the values for Proteobacteria (the last group in Table 2); all within the same 5% estimated error.
58
C. Stan et al. / Journal of Theoretical Biology 321 (2013) 54–62
Fig. 1. Graphs of hðqÞ dependence for the studied prokaryotes: Afulgi (a); MJan (b); Rprowaz(c), HPylor_266 (d); Aquifex (e); Tmarit (f); ChlamPneu_TW (g); ChlamPneu_J (h). The curves for CS are drawn with open circles, for NCS with white squares and for CCS with solid circles.
4. Genomic sequences and the associated multifractal cross-correlation series as universal multifractals According to the universal multifractal hypothesis (Tessier et al., 1993; Schertzer and Lovejoy, 1997; Vasilescu et al., 2010), these multifractals possess the remarkable property of allowing the description of the whole statistics of a given field with only three basic parameters. Each data sequence that can be approximated as a universal multifractal is completely characterized by the set of parameters H, C 1 , which in this context is usually called intermittency parameter, and a known as Le´vy index (it should not be confused with the scaling exponent of the
singularity spectrum). Under universal multifractal hypothesis, the scaling moment function is written: ( K ðqÞ ¼
C1 a a1 ðq qÞ
C 1 qln q
aa1 a¼1
ð24Þ
H is computed from (10), C 1 is given by (15) and a is obtained by differentiation of (9) after replacement KðqÞ with (16) for a a 1. Under fairly general conditions, the probability distribution of a random variable can be specified by its statistical moments. These allow the introduction of a scaling moment function KðqÞ describing the multi-scaling of the statistical moments of order q
C. Stan et al. / Journal of Theoretical Biology 321 (2013) 54–62
59
Table 2 . Name
a_max
Dh
Afulgi MJan Rprowaz HPylor_266 Aquifex Tmarit ChlamPneu_TW ChlamPneu_J
qða_maxÞ
C_int
Ratio
Symm ð%Þ
CS
NCS
CCS
CS
NCS
CCS
CS
NCS
CCS
CS
NCS
CCS
NCS/CCS
NCS
CCS
0.31 0.27 0.38 0.41 0.57 0.53 0.25 0.28
1.35 1.85 1.18 1.38 1.17 1.07 1.09 1.12
0.42 0.45 0.65 0.62 0.55 0.52 0.59 0.56
0.64 0.57 0.56 0.58 0.58 0.61 0.68 0.68
0.89 0.72 0.69 0.81 0.76 0.88 0.68 0.68
0.76 0.65 0.63 0.70 0.65 0.74 0.68 0.68
1.51 0.79 0.57 0.45 0.25 0.17 0.35 0.22
0.18 0.17 0.31 0.18 0.35 0.08 0.35 0.22
0.45 0.39 0.13 0.45 0.04 0.04 0.35 0.22
0.001 0.021 0.017 0.040 0.037 0.029 0.011 0.010
0.170 0.110 0.039 0.135 0.126 0.175 0.102 0.098
0.037 0.024 0.011 0.038 0.031 0.043 0.020 0.019
4.59 4.58 3.54 3.55 4.06 4.07 5.10 5.15
2.8 3.9 47 35 5.8 9.3 7.9 7.7
1.8 3.2 47 42 7.7 9.4 7.0 7.1
Fig. 2. The singularity spectra for a representative case in each of the four groups: Afulgi (a); Rprowaz (b); Aquifex (c); ChlamPneu-TW (d). The curves for CS are drawn with open circles, for NCS with white squares and for CCS with solid circles.
of the sequence (Chu, 2004). For statistically self-similar series, a power law dependence of the statistical moments on the averaging interval is expected: /½eðrÞq S r KðqÞ
ð25Þ
where ‘‘/:S’’ indicates statistical or spatial averaging. The relationship computed for the condition q ¼ 0 introduced back into (9) gives hð0ÞhðqÞ ¼
C1
a1
qa1
ð26Þ
The slope of the line representing this dependence in a log-log plot gives the value of a. The value already computed for C 1 by (17) is used for testing the consistency of computation. We also considered another way of computing a, i.e. by differentiation of (16) for a a 1 in respect of q. The relationship computed for the condition q ¼ 0 gives
a ¼ 1
K 0 ð1Þ : K 0 ð0Þ
ð27Þ
In the case of the universal multifractals, the function KðqÞ defined by (9) is identical to KðqÞ defined by (25) and the values of
the parameters C 1 and a, computed using the scaling moment function in the respective formulas should give similar results. The dynamics of a large class of biological, geological and financial time series are demonstrated to be well approximated as universal multifractals (Tessier et al., 1996; Vasilescu et al., 2010; Muzy et al., 1991; Lovejoy Schertzer, 2010; Seuront and Lagadeuc, 2001). The debate on the existence of universal multifractals has lasted for over a decade (Schertzer and Lovejoy, 1997). We show that the series of CS and NCS as well as their associated CCS satisfy the universal multifractal conditions reasonably well. The adopted procedure is the following: for each analyzed sequence, the KðqÞ dependence is obtained from Eq. (16). The curves KðqÞ are convex for all series and satisfy the condition (17). Some of the curves computed using this procedure are shown in Fig. 3. Then, the intermittency parameter is obtained as the derivative of hðqÞ in the point q ¼ 1 with opposite sign and the parameter a is computed as average of the values given by (26) and (27). Then, using the singular measures algorithm (Yu et al., 2003; Lovejoy Schertzer, 2010; Seuront and Lagadeuc, 2001) we carried out a computation of the scaling moment function KðqÞ as slope of the line of the graph of (25) in log–log plot. The parameters C 1 and a estimated using this KðqÞ in Eqs. (15) and (26) are in good agreement with the values previously obtained using the function
60
C. Stan et al. / Journal of Theoretical Biology 321 (2013) 54–62
Fig. 3. The curves of KðqÞ computed using (9) for the studied genomes: Afulgi (a); MJan (b); Rprowaz (c), HPylor_266 (d); Aquifex (e); Tmarit (f); ChlamPneu_TW (g); ChlamPneu_J (h). The curves for CS are drawn with broken lines, for NCS with thin continuous lines and for CCS with thick lines.
KðqÞ defined by (9), demonstrating that the universal multifractal hypothesis is well satisfied by the series of CS and the NCS as well as by their associated CCS. Next, these values are introduced in (24) for a a 1 and the theoretical function KðqÞ is computed. The results represented in Fig. 4 with dotted lines show that the universal multifractal approximation is satisfactorily valid for values of qr 4, for the series of CS and NCS as well as for the newly defined CCS. A representative of each of the four groups is presented. Finally, we decided to try to obtain an image of the multifractal cross-correlation series associated to a certain pair of coding and non-coding sequences using Lovejoy’s Mathematica code for the simulation of continuous in scale universal multifractals (Lovejoy Schertzer, 2010). Clearly, at each simulation, a completely different-looking sequence is obtained although the values of the two parameters C 1 and a are the same, and we had to find a way of choosing between these. Fig. 5c presents the hypothetical CCS series associated to the pair of CS series (Fig. 5a) and of NCS series (Fig. 5b) of Tmarit. We chose this bacteria because of the
closeness between the KðqÞ curves for the CS series and the associated CCS series (Fig. 3f). It was hypothesized that the histogram of the two series should also look similar. Accordingly, from the multitude of simulated sequences we chose the one with a histogram closely similar to the histogram of the coding sequence. This is shown in Fig. 5c. By simple eye inspection the multifractal cross-correlation series has characteristics from each of the CS and NCS series to which it is associated. We cannot so far identify a general criteria of choosing the appropriate multifractal cross-correlation series from a series of simulations using the values of parameters axy ; C 1,xy computed as discussed in Section 4.
5. Conclusions This study consists in the investigation of the characteristics of series of length of coding and non-coding DNA sequences of some bacteria and archaea using an extension of the MFDCCA. While the analysis is carried out on a pair of time series (data sequences)
C. Stan et al. / Journal of Theoretical Biology 321 (2013) 54–62
61
Fig. 4. Universal multifractal approximation for: MJan (a); Aquifex (b); ChlamPneu_TW (c); Rprowaz (d). The theoretically computed function KðqÞ is shown for each situation by dots. On each graph the thin lines show the NCS, the broken line is for the CS and thick line corresponds to the CCS.
Fig. 5. Coding sequence (a); Non-coding sequence (b); Simulated hypothetical multifractal cross-correlation series for Tmarit using the following estimated values of the parameters axy ¼ 1:97; C 1,xy ¼ 0:043 (c).
of equal length as usual, we consider a hypothetical multifractal cross-correlation series that can be characterized by the full set of parameters that are attributed to any time series. The analysis performed on the series of CS, NCS of the complete genome and on their associated CCS is based on definite characteristics of the relationship between these series. Specifically, these features are observed in the dependence of the generalized Hurst exponent on the size of fluctuations (order q), in the shape of the singularity spectra, in the shape and relative disposition of the curves of the singular measures scaling exponent and the intermittency parameters. The results of the analysis can be considered as criteria for the identification of class affiliation of bacteria and archaea. Empirically, we demonstrate that the CS, NCS, and the associated CCS can be approximated as universal multifractals
and the computed parameters are used as test of validity for the conclusion of the previous analysis. Finally, an attempt at simulating the multifractal crosscorrelation series for a particular bacteria is performed using Lovejoy’s code for the simulation of continuous in scale universal multifractals.
References Ascombe, F.J., 1973. Am. State 27, 17. Cristescu, C.P., Stan, C., Scarlat, E.I., Minea, T., Cristescu, C.M., 2012. Physica A 391, 2623. Chu, P.C., 2004. Chaos Solitons Fractals 19, 275. Davis, A., Marshak, A., Wiscombe, W., Cahalan, R., 1994. J. Geogr. Res. 99, 8055.
62
C. Stan et al. / Journal of Theoretical Biology 321 (2013) 54–62
Gangodagamage, C., Barnes, E., Foufoula-Georgiou, E., 2007. Geomorphology 91, 198. He, L.-Y., Chen, S.-P., 2011. Physica A 390, 3806. Hilborn, R.C., 1994. Chaos and Nonlinear Dynamics. Oxford University Press p. 452. Jaffard, S., Siam, S., 1997. J. Math. Anal. 28, 944. Jiang, Z.-Q., Zhou, W.-X., 2011. Phys. Rev. E 84, 016106. Kantelhardt, J.W., Zschiegner, S., Koscielny-Bunde, E., Havlin, S., Bunde, A., Stanley, H.E., 2002. Physica A 316, 87. Kulkarni, O.C., Vigneshwar, R., Jayaraman, V.K., Kulkarni, B.D., 2005. Bioinformatics 21, 3818. Lopes, R., Betrouni, N., 2009. Med. Image Anal. 13, 634. Lovejoy, S., Schertzer, D., 2010. Comput. Geosci. 36, 1393. Muzy, J.F., Bacry, E., Arneodo, A., 1993. Phys. Rev. E 47, 875. Morariu, V.V., Buimaga-Iarinca, L., 2010. Fluctuation Noise Lett. 9, 47. Muzy, J.F., Bacry, E., Arneodo, A., 1991. Phys. Rev. Lett. 67, 3315. Oczeretko, E., Juczewska, M., Kasacka, I., 2001. Folia Histochem. Cytobiol. 39, 75. Ossadnik, S.M., Buldyrev, S.V., Goldberger, A.L., Havlin, S., Mantegna, R.N., Peng, C.-K., Simons, M., Stanley, H.E., 1994. Biophys. J. 67, 64. Peng, C.K., Buldyrev, S.V., Havlin, S., Simons, M., Stanley, H.E., Goldberger, A.L., 1994. Phys. Rev. E 49, 1685. Pearson, K., 1901. Philos. Mag. Ser. 6, 559. Podobnik, B., Stanley, H.E., 2008. Phys. Rev. Lett. 100, 084102. Paux, E., Sourdille, P., Salse, J., et al., 2008. Science 322, 101. Rodgers, J.L., Nicewonder, W.A., 1988. Am. State 42, 59. Stanley, H.E., Meakin, P., 1988. Nature 335, 405. Schertzer, D., Lovejoy, S., 1992. Physica A 185, 187. Seuront, L., Schmitt, F., Lagadeuc, Y., Schertzer, D., Lovejoy, S., 1999. J. Plankton Res. 21, 877.
Scarlat, E.I., Stan, C., Cristescu, C.P., 2007. Physica A 379, 188. Stan, C., Minea, T., Cristescu, T.M., Buimaga-Iarinca, L., Cristescu, C.P., 2012. UPB. Sci. Bull., Ser. A 74, 109. Stan, C., Cristescu, C.P., Scarlat, E.I., 2010. J. Theor. Biol. 264, 513. Stan, C., Cristescu, C.P., Neacsu, D.A., 2012. Rev. Roum. Chim. 57, 45. Su, Z.-Y., Wu, T., Chaos, S.-Y.Wang, 2009. Solid Fractals 40, 1750. Seuront, L., 2010. Fractals and Multifractals in Ecology and Aquatic Science. CRC Press/Taylor & Francis, Boca Raton. Schmitt, F., Lovejoy, S., Schertzer, D., 1995. Geophys. Res. Lett. 22, 1689. Schertzer, D., Lovejoy, S., 1997. J. Appl. Meteorol. 36, 1296. Seuront, L., Lagadeuc, Y., 2001. J. Plankton Res. 23, 1137. Tessier, Y., Lovejoy, S., Hubert, P., Schertzer, D., Pecknold, S., 1996. J. Geophys. Res 31D, 427. Thanbichler, M., Shapiro, L., 2006. J. Struct. Biol. 156, 292. Tessier, Y., Lovejoy, S., Schertzer, D., 1993. J. Appl. Meteorol. 32, 223. Vasilescu, J., Cristescu, C.P., Belegante, L., 2010. J. Opt. Adv. Mat. 12, 1414. Wang, S., Tian, F., Qiu, Y., Liu, X., 2010. J. Theor. Biol. 265, 194. Yu, C.X., Gilmore, M., Peebles, W.A., Rhodes, T.L., 2003. Phys. Plasmas 10, 2772. Yu, Z.-G., Anh, V., Lau, K.-S., 2001. Physica A 301, 351. Yu, Z.-G., Anh, V., 2001. Chaos Solitons Fractals 12, 1827. Yu, Z.-G., Anh, V., Wang, B., 2001. Phys. Rev. E 63, 011903. Zhou, W.-X., 2008. Phys. Rev. E 77, 066211. Zhou, Li-Qian, Yu, Zu-Guo, Deng, Ji-Qing, Anh, Vo, Long, Shun-Chao, 2005. J. Theor. Biol. 232, 559. Zia, R.K.P., Redish, F., McKay, R., 2009. Am. J. Phys. 77, 614. Zainea, O., Morariu, V.V., 2007. Fluctuation Noise Lett. 7, L501. /http://www.ncbi.nlm.nih.gov/S.