Available online at www.sciencedirect.com
BioSystems 91 (2008) 183–194
Discriminating coding, non-coding and regulatory regions using rescaled range and detrended fluctuation analysis Rene te Boekhorst a,∗ , Irina Abnizova b , Chrystopher Nehaniv a a
School of Computer Science, University of Hertfordshire, College Lane, Hatfield, AL10 9AB Hertfordshire, UK b MRC-BSU Cambridge, UK
Received 12 June 2006; received in revised form 27 March 2007; accepted 24 May 2007
Abstract In this paper we analyse the efficiency of two methods, rescaled range analysis and detrended fluctuation analysis, in distinguishing between coding DNA, regulatory DNA and non-coding non-regulatory DNA of Drosophila melanogaster. Both methods were used to estimate the degree of sequential dependence (or persistence) among nucleotides. We found that these three types of DNA can be discriminated by both methods, although rescaled range analysis performs slightly better than detrended fluctuation analysis. On average, non-coding, non-regulatory DNA has the highest degree of sequential persistence. Coding DNA could be characterised as being anti-persistent, which is in line with earlier findings of latent periodicity. Regulatory regions are shown to possess intermediate sequential dependency. Together with other available methods, rescaled range and detrended fluctuation analysis on the basis of a combined purine/pyrimidine and weak/strong classification of the nucleotides are useful tools for refined structural and functional segmentation of DNA. © 2007 Elsevier Ireland Ltd. All rights reserved. Keywords: Hurst exponent; Rescaled range; Detrended fluctuation analysis; Coding DNA; Regulatory regions
1. Introduction It is increasingly acknowledged that variation in the complexity of organisms is due to the regulation of gene activity rather than to the genetic specifications for protein coding per se. Gene activity is dynamic and affected by, among other things, metabolism and cell signalling (such as communication via hormones) (Rees et al., 2000; Jump and Clarke, 1999; Yamada and Noguchi, 1999). According to Markstein et al. (2002) as much ∗
Corresponding author. E-mail addresses:
[email protected] (R. te Boekhorst),
[email protected] (I. Abnizova),
[email protected] (C. Nehaniv).
as 50% of the metazoan genome is regulatory. However, most of this is not yet deciphered as it is extremely difficult to identify the components of regulatory regions. Regulatory regions contain transcription factor binding sites (TFBS), short sequences of DNA which are often located upstream or downstream the start position of gene transcription begins (although they may also occur within a gene). In turn, these binding sites are “recognized” by transcription factors, proteins that – upon binding to them – act as repressors or activators, thus controlling the rate of transcription of DNA into mRNA and ultimately that of translation into proteins. The identification of regulatory regions and TFBSs is an obviously essential, but unfortunately far from easy, step to obtain a deeper understanding of the regulation
0303-2647/$ – see front matter © 2007 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.biosystems.2007.05.019
184
R. te Boekhorst et al. / BioSystems 91 (2008) 183–194
of individual genes. The difficulties are especially outspoken for higher eukaryotes, where some regulatory regions – called enhancers – are located far upstream or downstream the target gene. The desire for large-scale comprehension has driven the development of high throughput methods. In turn, this has favoured computational approaches to the prediction of genomic components such as exons and regulatory regions because these sidestep the ultimately more reliable but slow and expensive route of experimental verification. The work reported in this paper aims to contribute to the computational detection of regulatory DNA by contrasting their statistical characteristics with those of (non-)coding but non-regulatory regions. Various methods have been used to characterize the statistical properties of genomic components. Nucleotide composition is commonly investigated with tools from information theory, i.e. by estimating the entropy of parts of the genome (Abnizova et al., 2006; Orlov and Potapov, 2004; Orlov et al., 2006) and statistical linguistics, such as those based on Zipf’s law (Mantegna et al., 1994). Statistical dependencies between nucleotides have been analysed using mutual information functions (Li, 1997), spectra (Voss, 1992; Vaidyanathan and Yoon, 2004; Bernola-Galv´an et al., 1999), hidden Markov models (Yoon and Vaidyanathan, 2006) and methods derived from random walk dynamics, such as detrended fluctuation- and rescaled range analysis (Ossadnik et al., 1994), to assess long-range correlations among nucleotides. The latter have attracted much attention (Abnizova et al., 2006; Voss, 1992; Herzel and Große, 1997; Azbel, 1995) and correlations up to 1000 bp have been found in particular for non-coding DNA. Coding regions appear to lack such long-range correlations (Abnizova et al., 2006; Buldyrev et al., 1992; but see Voss, 1992) but seem instead to be characterised by three cycle periodicities, as has been established by spectral analysis (Vaidyanathan and Yoon, 2004). A conventional way to study DNA is by functional segmentation, a top-down approach in which a genomic sequence is partitioned into segments and these are identified as a particular functional types of DNA (such as coding- or regulatory regions) if their statistical properties match with those of experimentally verified cases of that functional type. The opposite strategy could be called serial prospecting. This is a bottom-up procedure that maps the DNA landscape by assessing local statistical properties while moving along the sequence. By using sliding windows, changes in these local statistical properties can be detected. Regions with striking statis-
tical features or change points therein are candidates for further analysis. In both approaches the choice of segment- and moving window size is subjective. The problem is that too small segments and windows may not encompass the complete region of interest or reveal too much detail. However, when segments and windows are too large they may overlook critically important local differences or, in case the sequence is non-stationary, contain regions of different statistical structure. Non-stationarity violates assumptions of most algorithms currently in use and makes the results of Markov models and information theoretic measures worse than meaningless. In our previous work (Abnizova et al., 2006; Orlov et al., 2006), we advocated a combined approach that also addresses the problem of non-stationarity. This procedure involves: (i) Using statistical descriptors that distinguish between different types of DNA on the basis of the compositional heterogeneity and non-stationarity of a nucleotide sequence. We have used informational entropy for measuring compositional heterogeneity and rescaled range analysis to estimate the Hurst exponent (H), which we use as a measure for the degree of non-stationarity; a value of H < 0.5 points to short-term correlations, H = 0.5 represents a series of independent and identically distributed measurements (as in white noise) and H > 0.5 indicates the presence of long-range correlations (Schroeder, 1991). (ii) Adapting the window size so that compositional homogeneity (maximal entropy) and stationarity (a minimal value of |H − 0.5|) are locally optimised (for further explanation, see Section 2.2.1). We found that this procedure detected start positions of exons quite well (see Table A.1 in Appendix A), implying that these units can be typified as being relatively homogenous and stationary. The results were more pronounced if windows adapted their size to contain local minima of the Hurst exponent than to optimize around local maxima of entropy. Furthermore, the Hurst exponent appeared to be remarkably good in characterizing sequences of experimentally verified exons, which were found to have a significantly lower average H than non-coding regions (Abnizova et al., 2006; Orlov et al., 2006). However, questions remain. First of all, why does the Hurst exponent perform so well? Hurst exponents are estimated by means of rescaled range analysis, a procedure that is not known as being particular pow-
R. te Boekhorst et al. / BioSystems 91 (2008) 183–194
185
erful (Clegg, 2005). What would the results have been if we had used detrended fluctuation analysis instead, a method of estimating sequential persistence that is currently favoured by many (for a review also see Buldyrev et al., 1995; the website http://reylab.bidmc.harvard.edu/ tutorial/DFA/node5.html and Bernola-Galv´an et al., 1999)? Another issue concerns the way in which rescaled range analysis and detrended fluctuation analysis work. Both methods require that prior to further analysis the investigated sequence is binary coded and for this a pyrimidine/purine (P/P) classification is often chosen (but see Buldyrev et al. (1995) for an application of all possible coding schemes). However, there is no reason not to use other dichotomies, such a weak/strong bonding (W/S) categorisation. It is important to study the effect of alternative classification conventions to validate the established view that coding regions have generally low sequential persistence due to a lack of long-range correlations (Buldyrev et al., 1995). Indeed, most studies report a Hurst exponent of coding DNA of around 0.5, which corresponds to a series of independently positioned purine or pyrimidines nucleotides. But how does this relate to the three cycle periodicities found by spectral analysis of coding regions, which imply short-range correlations (and hence a value of H < 0.5)? We will address these questions by comparing how well rescaled range analysis and detrended fluctuation analysis, based on both a P/P and a W/S binary classification convention, discriminate between coding, regulatory and non-coding, non-regulatory sequences. The analysis presented in this paper differs from other similar comparative studies in that we explicitly focus on exons and regulatory regions rather than just coding and non-coding regions. Furthermore, the exons and regulatory regions are exclusively experimentally verified sequences and are analysed at the level of individual sequences of a single species (i.e. estimates are not pooled or averaged over species) using rigorous statistics (balanced repeated measurement ANOVA and non-parametric procedures).
of high quality rather than quantity, the limiting factor was the number of experimentally verified regulatory regions. Because investigational confirmation of functional genomic units is a highly laborious and expensive procedure, at present only a small number of regulatory enhancer regions have been experimentally validated. As far as we know, at the time of writing the largest set for a species is the one compiled by Nazina and Papatsenko (2003) and was therefore chosen as our primary subset of data. It contains 60 experimentally verified regulatory regions from Drosophila melanogaster targeting developmental genes and that range from 600 bp up to 2400 bp in size. The number and the lengths of these sequences defined those of the other two subsets. Our second subset contains 60 exons chosen with the Ensemble Genome Browser (http://www.ensembl.org/) randomly across the D. melanogaster genome by selecting them from each chromosome proportionally to the length of that chromosome. Each exon was checked to be internal, i.e. the first and last exon of a gene were excluded because these are known to be partly non-coding. The third subset comprises of 60 non-coding, nonregulatory (NCNR) DNA regions from D. melanogaster and was sampled using the Ensemble Genome Browser. The selection procedure was as follows:
2. Materials and methods
2.2. Measurements
2.1. Data
Each of the sequences used in this work was subjected to rescaled range- and detrended fluctuation analysis. The starting point of both rescaled range analysis and detrended fluctuation analysis is to imagine a “walk” along the DNA string in which the direction is changed from up to down when a particular, predefined state or event is encountered. For example, the walker moves up each time a pyrimidine (a thymine or a cytosine) occurs and moves down whenever a purine (an adenosine or a guanine) is encountered (Buldyrev et al., 1994). Various other
The size of the data set is constrained by our requirement to work exclusively with experimentally verified coding- and regulatory enhancer (and not promoter) regions. The sample sizes should be as large as possible but balanced, i.e. the regulatory regions should not be outnumbered by the in principle very large numbers of coding and non-coding, non-regulatory sequences that could be obtained. By thus opting for a data set
1. As for the exons, the sequences were selected from each chromosome proportionally to their length. 2. Within each chromosome, first a start location was chosen at random from a uniform distribution. As the size of the regulatory regions varies from 600 to 2400 bp, we sampled the length of the prospective sequence uniformly within this range. 3. We finally checked if the stretch chosen contained any known coding regions, or was located 1000 bp upstream/ downstream regions of any gene as in that case it may have a regulatory function. If so, we discarded it, and sampled a new one according to the procedure, until we obtained 60 sequences. All data sets are available from our website (visit http:// www.mrcbsu.cam.ac.uk/BSUsite/AboutUs/People/irina.xml). Rescaled range analysis and detrended fluctuation analysis were applied to the whole length of each sequence (i.e. without the use of windows).
186
R. te Boekhorst et al. / BioSystems 91 (2008) 183–194
occurrences in the DNA sequence may be considered as states or events for a walk and we have therefore also performed the analysis using a classification into weak (A, T) and strong (G, C) binding nucleotides. By this procedure, the actual occurrences of events are cumulated and this gives a much clearer picture of the sequential structure than the original binary scores. The outcome of this integration is a landscape as shown in Fig. 1b, in which the regulatory region of the CyIIIa gene of the sea urchin Strongleocentrotus purpuratus is mapped by applying P/P coding. Because in this gene long stretches of mainly purine alternate with stretches that contain mostly pyrimidine, the resulting landscape is characterised by prominent hills and valleys. As will be explained below, the Hurst exponent quantifies the hilliness of the landscape and hence is a measure for the degree of non-stationarity. If the probability of the occurrence of events is identical, constant and independent of the position in the string, then such an integrated DNA walk would actually be a random walk. Note that the randomness of this process is in its increments (the original binary scores) because it is these that make up a series of independent and identically distributed events (as, for instance, generated by Bernoulli trials or a white noise process). Such a series of increments is therefore stationary, but the random walk, resulting from the integration of the increments, is not and the stronger the sequential dependency (persistency) in the original data, the more pronounced will be the contours of the landscape and hence the degree of non-stationarity.
Non-stationarity implies that the statistical characteristics (such as mean, variance and co-variance) of subsequent data change over the series. In turn, this is manifest as significant and largely constant (=long-range) autocorrelations (Wei, 1994). But although autocorrelation coefficients may inform whether or not a series is non-stationary, in themselves they do not quantify the degree of non-stationarity. 2.2.1. Rescaled range analysis (RRA): calculation of the Hurst exponent One way to quantify the degree of non-stationarity is by calculating the Hurst exponent (H). This measure is related to the frequency of the spectrum, f−β , as H = (β − 1)/2. When the original series (i.e. the non-integrated values) is from a white noise process, which has a flat spectrum (β = 0, hence f0 ) and no persistence, Horig = −0.5 for that series. After integration, the resulting random walk of that same series has a Hurst exponent of Hintegr = 0.5. However, when the original series is itself a random walk (Horig = 0.5), its cumulative sum is characterised by Hintegr > 0.5 demonstrating the link between stronger non-stationarity and a larger Hurst exponent. Conversely, a value of Hintegr < 0.5 in the integrated data implies Horig < −0.5 for the original series and indicates anti-persistence (negative autocorrelations) (Schroeder, 1991). For our data, we calculated the Hurst exponent for purine/pyrimidine sequences by setting xk = +1 for k = T, C and xk = −1 for k = A, G. In the WEAK/STRONG classification scheme xk = +1 for k = A, T and xk = −1 for k = G, C. The degree
Fig. 1. Rescaled range analysis. (a) Binary classified sequence of regulatory region of the CyIIIa gene of the sea urchin Strongleocentrotus purpuratus. Purines are symbolised as +1, pyrimidines as −1. (b) Integrated data plotted as the cumulative sum of (a). (c) Double logarithmic plot of rescaled range against position of the sequence (data from Kirchammer and Davidson, 1996).
R. te Boekhorst et al. / BioSystems 91 (2008) 183–194
of non-stationarity displayed by {xk } is then captured by the following computational steps: (i) Define an initial segment of n subsequent nucleotides. (ii) Calculate the local mean, m(n), and standard deviation, s(n), of the classified values for the initial segment. (iii) For each subsequent position i = 1, 2, . . ., n of the segment subtract the classified value from the local mean to obtain the deviation di = xk,i − m(n). (iv) Determine the cumulative deviation D by adding up di to the previous (summed) deviations: D(i, n) =
i m=1
dm =
i
(xm − m(n)).
m=1
Do this until the last position of the segment, xn , is incorporated in the calculation. (v) From the thus integrated deviations, assess the range of D for the initial segment: R(n) = maxD(i, n) − maxD(i, n). i≤n
i≤n
(vi) Divide the range by the standard deviation, giving the rescaled range: RR(n) = R(n)/s(n). (vii) Increase the size of the initial segment to n: = n + n, where n is some fixed step size, and repeat the procedure. The idea behind these calculations is to estimate the degree by which local deviations from the mean accumulate for ever larger subsets of the sequence. Dividing by the standard deviation to obtain the rescaled range makes the degree of persistence relative to the fluctuations of the whole segment. For scale free data, the rescaled range follows the power law R(n)/s(n) ∼ (n/2)H and the parameter H, the Hurst exponent, can be interpreted as a measure of self-similarity (Schroeder, 1991). From this relationship, H can be estimated from the least squares fit of the regression of log[R(n)/s(n)] on log[n]. In Fig. 1c the Hurst exponent is the slope of the curve of R(n)/s(n) after double logarithmic transformation. 2.2.2. Detrended fluctuation analysis (DFA) The rationale behind DFA is to clean the integrated data from external trends and sources of non-stationarity that are not intrinsic to the self-similarity of the sequence (Peng et al., 1994; http://reylab.bidmc.harvard.edu/tutorial/DFA/node5.html). As in rescaled range analysis, this procedure starts with binary coding the data k and cumulating the deviations from the mean as y(k) = i (xi − m(n)). Next, the thus integrated series is segmented into boxes of equal length n in each of which a trend line, yn (k), is fitted by least square minimization. The data are then detrended and the root mean square fluctuation is calculated as
F (n)
N [y(k) k=1
N
− yn (k)]2
.
187
The calculations are repeated for boxes of increasing n to provide a relationship between F(n) and the box size n (Buldyrev et al., 1995 used an increasing segment as in the computation for H). A linear relationship between log F(n) and log(n) indicates self-similarity and the slope of the regression line relating log F(n) to log(n) is the scaling exponent α. Like the Hurst exponent, α is a parameter for self-similarity that takes the value 0.5 if the integrated series corresponds to a random walk (i.e. the original series before integration is white noise). Also like H, the α exponent can be seen as an indicator of the roughness of the time series. A low α (0 < α < 0.5) characterises a “spiky” time series (antipersistence); the larger α, the smoother (more persistent) the time series.
3. Results 3.1. ANOVA analysis of DNA type discrimination The degree of persistence was estimated for three types of sequences (CODING = coding DNA; REGREG = regulatory regions; NCNREG = noncoding, non-regulatory regions), by two methods (RRA, DFA) using two classification conventions (P/P, W/S). Correspondingly, we have three possible factors affecting the degree of persistence: DNATYPE, METHOD and BINCODE. We intended to estimate the effects of the three factors on the degree of persistence by means of a multi-variate ANOVA of a between-groups, repeated measures design (for details about ANOVA, see Damon and Harvey, 1987). Because both coding conventions and both methods were applied to the same sequence, they are the levels of the two “within-subject” factors BINCODE (with levels W/S and P/P) and METHOD (with the levels DFA and RRA) whereas DNATYPE is the “between-groups factor” (with the levels CODING, REGREG and NCNREG). The assumptions underlying ANOVA (normality of the within-cell residuals and the homogeneity of variance) were not met for the BINCODE condition W/S (also after logarithmic transformation). We therefore settled for an ANOVA to test for differences within the condition BINCODE = P/P only and investigated the condition BINCODE = W/S by means of a nonparametric procedure. The results (Fig. 2) show that DNATYPE has a significant impact in case of P/P coding (F = 72.44, d.f. = 2, p < 10−6 ). For both methods, the degree of persistence is lowest for coding DNA and highest for non-coding, non-regulatory regions. Furthermore, Hurst exponents are higher than α in all comparisons (i.e. a significant effect of METHOD, F = 99.51, d.f. = 1,
188
R. te Boekhorst et al. / BioSystems 91 (2008) 183–194
Fig. 2. Least square means of degree of persistence for the two methods and the three DNA types for P/P binary coding.
p < 10−6 ). Post-hoc comparison tests (Newman–Keuls test and Duncan’s test, see Damon and Harvey, 1987) showed all contrasts to be significant, apart from the one between non-coding, non-regulatory regions as measured by DFA and regulatory regions as measured by RRA. To analyse the effect of METHOD and DNATYPE for the sequences after W/S coding, we resorted to non-parametric tests. Kruskal–Wallis ANOVA by ranks (Siegel and Castellan, 1988) showed that the degree of persistence differed significantly between types of DNA for both DFA (KW test statistic = 100.238; d.f. = 2, N = 177; p < 0.001) and RRA (KW test statistic = 109.676; d.f. = 2, N = 177; p < 0.001). From the box plots in Fig. 3 it can be seen that for both methods the results are in line with those of the sequences after P/P coding. Multi-comparison tests indicated significant differences between all contrasts apart from the
one between regulatory and non-coding, non-regulatory regions. Descriptive statistics for all levels of comparison are brought together in Table 1. By and large, they confirm the results of other research, i.e. that coding regions have on average lower sequential persistence than noncoding stretches (Peng et al., 1994; Buldyrev et al., 1993). Values of H are larger and more variable than those of α in almost all cases, but exons after W/S coding are a notable exception. Another interesting outcome is that Hurst exponents appear to differentiate somewhat stronger between the three types of DNA than α and this holds in particular for W/S coded sequences. In Fig. 4 this is visualised in plots of H and α measured after P/P coding versus W/S coding. The slanting lines are the decision lines giving optimal separations between coding and non-coding, regulatory and non-regulatory DNA and were found by linear discriminant analysis (Damon and Harvey, 1987). The figure shows the almost perfect linear discrimination between coding- and non-coding DNA that results when both W/S and P/P coding are used; for RRA only one exon is within the cloud of noncoding, non-regulatory regions, whereas DFA resulted in just two non-coding, non-regulatory regions ending up in the cluster of exons. From the plots it can be seen that RRA slightly outperforms DFA with respect to the differentiation between coding and regulatory regions (RRA: 9 misclassifications out of 60 regulatory regions versus 15 out of 60 for DFA). The discrimination between regulatory and non-coding, non-regulatory regions, however, is far less clear. The lower panel in Fig. 4 displays the 95% confidence limits around the means of the clouds and shows that the distances between the DNA types are somewhat larger for RRA than for DFA.
Fig. 3. W/S coding. Median values for degree of persistence measured by RRA (left) and DFA (right) for different types of DNA. Multi-comparison tests indicate significant differences between all contrasts apart from the one between regulatory and non-coding, non-regulatory regions.
R. te Boekhorst et al. / BioSystems 91 (2008) 183–194
189
Table 1 Descriptive statistics for the degree of persistence measured for 180 genomic sequences DNATYPE
BINCODE
METHOD
Mean
S.E.
−95%
+95%
CODING
WS*
DFA RRA DFA RRA
0.442 0.388 0.443 0.516
0.012 0.013 0.012 0.001
0.419 0.363 0.420 0.497
0.465 0.414 0.466 0.535
DFA RRA DFA RRA
0.671 0.675 0.625 0.659
0.012 0.013 0.012 0.010
0.648 0.650 0.602 0.639
0.695 0.701 0.648 0.678
DFA RRA DFA RRA
0.633 0.632 0.562 0.612
0.012 0.013 0.012 0.010
0.609 0.606 0.539 0.593
0.656 0.657 0.585 0.631
PP* NCNREG
WS PP*
REGREG
WS PP*
*
Significant differences between RRA and DFA after Wilcoxon matched pair signed rank test (p < .0005).
The numbers of correct and incorrect classifications are tabled in Appendix A. From these data, the performance of DFA and RRA using W/S versus P/P coding was computed in terms of the statistics sensitivity, specificity and correlation coefficient (for a description of these statistics, see Appendix A). From Tables 2 and 3 it can be seen that the performance of RRA is slightly better than that of DFA.
4. Discussion 4.1. Similarities and discrepancies with other studies In accordance with the observations of Peng et al. (1994) and Buldyrev et al. (1993, 1995), we found a significant difference between different functional parts
Fig. 4. (Above) Scatter plots of persistence of all 180 genomic sequences measured by RRA (right) and DFA (left) using P/P and W/S binary classification. The sloping lines separate coding (solid, diamonds), regulatory (broken, circles) and non-coding, non-regulatory (stippled, triangles) sequences with smallest number of misclassifications and were obtained by linear discriminant analysis). (Below) Mean persistence ±95% confidence limits for the three types of genomic sequences measured by the two binary classification conventions.
190
R. te Boekhorst et al. / BioSystems 91 (2008) 183–194
Table 2 Performance statistics for RRA DNA type
Sensitivity
Specificity
Correlation coefficient
CODING REGREG NCNREG All non-coding
0.967 0.483 0.633 0.942
0.942 0.800 0.800 0.967
0.891 0.293 0.430 0.891
Table 3 Performance statistics for DFA DNA type
Sensitivity
Specificity
Correlation coefficient
CODING REGREG NCNREG All non-coding
0.900 0.467 0.610 0.925
0.925 0.758 0.800 0.900
0.815 0.228 0.425 0.815
of DNA, with sequential persistence for coding regions being lower than for non-coding DNA. However, there are also differences, the most striking one being our low values for exons (which actually suggest an antipersistent sequential structure). Whereas the average Hurst exponent and α for exons and non-conserved, noncoding regions are respectively H = 0.452, α = 0.443 and H = 0.671, α = 0.675,1 Buldyrev et al. (1993) report a coefficient of 0.51 for a coding-rich sequence (of E. coli) and of 0.61 for an intron-containing, coding-poor fragment (Human). The large scale study of Buldyrev et al. (1995), which considered all 33,301 coding and all 29,453 non-coding eukaryotic sequences available in the release of the GenBank in 1994, found an overall average of αcoding close to 0.5 and an overall average of αnon-coding of about 0.58. As in our study, Buldyrev et al. (1995) found an effect of binary classification. However, the direction of the differences does not quite agree with our results. Whereas in Buldyrev et al. (1995) the average α is larger for P/P than for W/S scoring irrespective the functionality of the sequence, in our case this holds only for coding DNA (see end note 1). The issue of what binary classification scheme to use can be resolved by using them both as a classifier as is illustrated in Fig. 4. Moreover, the combined use of the two classification conventions may shed some light on the structure of the investigated sequences. For instance, a stronger sequential persistence for W/S than for P/P clas1
One hundred and twenty additional coding and non-coding, nonregulatory sequences from D. melanogaster kindly provided by Marc Halfon and Long Li (Department of Biochemistry and Center of Excellence in Bioinformatics, SUNY) showed the same trend.
sification, as found by us for regulatory regions, occurs when series of relatively alternating A’s and T’s are followed by strings of C’s and G’s. As an example, consider the sequence AATATAGCGCCATAA. This is a more or less random series of pyrimidines (P) and purines (p), PPpPpPPpPppPpPP, but a strongly non-stationary and persistent sequence in terms of weak (W) and strong (S) binding nucleotides (WWWWWWSSSSSWWWW). A clustering of certain known transcription factor binding sites (such as the TATA box) and CpG islands could lead to this effect. The discrepancies between our results and those of Buldyrev et al. (1995) are probably due to the fact that our data sets comprise exclusively experimentally verified coding and regulatory regions. While the data of Buldyrev et al. (1995) may have included experimentally verified coding regions, although this is not stated by the authors, no mention is made about explicit regulatory regions (they were simply lumped in the non-coding class). Given the state of the art at that time, it is unlikely that they have actually been experimentally confirmed. Furthermore, the data of Buldyrev et al. (1995) were analysed as means of α averaged over a number of species which might have swamped out crucial differences. Also, whereas the coding regions analysed by Buldyrev et al. (1995) might have contained introns, the sequences we investigated are explicitly coding exons (i.e. cleaned from first- and last exons which are known to be non-coding). 4.2. Anti-persistence and negative short-range correlations among nucleotides in coding DNA The estimates we obtained for exons in this study are quite low (apart from Hp/p = 0.516), which corroborate our previous findings from another data set (Abnizova et al., 2006). The conventional view that coding regions have practically no sequential correlations (Buldyrev et al., 1995) therefore has to be adjusted: they actually possess short-range correlations (H, α < 0.50), which indicates anti-persistent behaviour and negative auto-correlations. This resolves the contradiction between the reported randomness of coding sequences suggested by accounts of H, α ∼ 0.50 on the one hand and the frequently reported latent periodicity of coding DNA on the other hand (Avery, 2002; Chechetkin and Lobzin, 1988; Korotkov and Kudryaschov, 2001; Boeva et al., 2004). The existence of anti-persistent, short-range correlations demonstrated by our study would actually be predicted from the significant 3 bp patterns and repeats described in these other studies.
R. te Boekhorst et al. / BioSystems 91 (2008) 183–194
191
4.3. Intermediate sequential persistence of regulatory regions
4.5. Other measures for inferring the sequential structure of DNA
Another important outcome of our study is the intermediate persistence of regulatory regions, which supports the results of Orlov and Potapov (2004) and Orlov et al. (2006) who demonstrated an intermediate complexity for regulatory regions compared to that of exons and introns. This in-between serial persistence of explicit regulatory regions has not been reported before and although it is significantly different from that of the other two types of DNA sequences, it adds to the difficulty of identifying them (as is evidenced by the low sensitivity and correlation coefficients in Tables 2 and 3). We are currently working on extensions of our algorithms to boost the contrast of comparisons involving regulatory regions.
We like to stress that estimating the fractal dimension of genomic sequences is not our prime interest. We therefore have not used more advanced methods like multi-fractal characterisation (Yu et al., 2001, 2003). Likewise, our concern is not the separation of power law long-range correlations from other sources of nonstationarity as a research objective in itself. Instead, our rationale is practical rather than theoretical: we use H purely to distinguish regions of different sequential persistence, whatever its cause. For that reason we also have not been particularly concerned about the constancy of the slope in the double log relationships. Our estimates of sequential persistence are therefore indeed rather crude. Actually, measuring the Hurst parameter, by whatever method (such as Fourier analysis, Whittle’s maximum likelihood estimator, the local Whittle estimator and wavelet analysis, see Clegg, 2005), is rather “hit and miss” Clegg (2005). Fourier analysis has been used to estimate H (Vaidyanathan and Yoon, 2004; Li, 1997) by exploiting the fact that the spectral density of a self-similar process obeys the power law f−(2H+1) , but a direct application of the periodogram leads to incorrect estimates of H and Whittle’s MLE is a more sophisticated estimator (Whittle, 1953; Rose, 1996). The local Whittle is a semiparametric version of this estimator and is based on fewer assumptions (Clegg, 2005). The wavelet method is used as an alternative to the spectral method in that it uses waveforms other than sine waves. It is favoured because it is not bounded by the linearity constraints of spectrum analysis and naturally captures self-similar scaling. For this reason wavelet analysis enjoys a growing popularity in bioinformatics research (for applications to the study of long-range correlations in genomic DNA, see Audit et al., 2001, 2004). A number of studies have been undertaken to compare the merits of these methods in estimating the Hurst exponent (Herzel and Große, 1997; Clegg, 2005; Rose, 1996; Stoev et al., 2005; Taqqu et al., 1995). An analysis by Clegg (2005) has shown that simulated time series with H = 0.7 are better estimated by local Whittle and wavelets than by RSS plots. Nevertheless, all estimators had problems with added AR(1) noise. Wavelet spectra are quite sensitive to severe bursts and non-stationarity in the form of mean shifts and lower frequency oscillations may result in overestimating H by this method (Stoev et al., 2005). When considering real data, Clegg (2005) advises to use extreme caution and a researcher relying on the results of any single estimator is likely to arrive at
4.4. Differences in discriminatory power between RRA and DFA Detrended fluctuation analysis has been developed as an alternative to rescaled range analysis to distinguish between long-range correlations due to self-affinity (i.e. fractality associated with power law scaling) on the one hand and other external sources on the other hand (Buldyrev et al., 1994; Peng et al., 1994). This is an important notion, because fractality is by no means the only cause for sequential persistence. As pointed out by Gneiting and Schlather (2003), fractal dimension and Hurst exponent are in principle independent of each other and can be explained without recourse to self-affinity. It is this difference between RRA and DFA that makes their comparison particular interesting, because it actually implies a contrast between estimators that more (H) or less (α) include other sources of serial persistence than self-similarity. The somewhat stronger discrimination by RRA thus may indicate that these other sources of non-stationarity could be important in distinguishing exons and regulatory regions from each other and other types of DNA. It should also be noted that the DFA algorithm is not designed to handle all possible non-stationarities and works better for certain types of non-stationary sequences (especially slowly varying trends) than others (http://reylab.bidmc.harvard.edu/tutorial/DFA/ node5.html). It has indeed been suggested that it is better to study the signal as it is since detrending might destroy important information as well as bring about spurious correlations (Scafetta et al., 2002).
192
R. te Boekhorst et al. / BioSystems 91 (2008) 183–194
incorrect conclusions. However, because our interest is in the distinguishing power of the Hurst exponent rather than in the assessment of its true value, we were happy to confine ourselves to RRA, which after all performed well and lends itself to a direct comparison with DFA. Nevertheless, similar analyses based on other estimators are certainly worthwhile and are planned as future work. Appendix A A.1. Adaptive window search for coding DNA based on RRA In previous work (Abnizova et al., 2006), we measured the Hurst exponent in a window of variable size sliding along DNA stretches in order to predict the starting positions of exons. The optimal size of the sliding window giving a minimal Hurst expoTable A.1 Comparison of actual and estimated start position of exons on the basis of low Hurst exponents measured in a shifting, adaptive window Species
Actual
Predicted
Fugu Fugu Fugu Fugu Fugu Mouse Mouse Mouse Mouse Mouse Mouse Mouse Mouse Mouse Mouse Mouse Yeast Yeast Yeast Yeast Yeast Human Human Human Human Human Human Human Human Drosophila Drosophila Drosophila Drosophila
470 500 80 120 1 300 660 1130 225 550 15 410 250 950 1 720 470 690 400 500 400 70 225 100 90 480 1005 85 250 210 230 370 200
465 495 75 120 1 315 690 1110 240 550 30 395 255 930 15 735 450 705 400 525 390 45 210 120 75 450 985 75 240 210 240 360 195
nent was considered as optimal. Those windows with H < 0.5 were assumed to be the most likely locations for exons. As an illustration, we apply this method here to predict the start positions of a small set of known coding regions from five species, each containing internal exons flanked by intronic sequences (the data were obtained from the Ensemble Genome Browser http://www.ensembl.org/, and the DoTs database, http://www.allgeness.org/). As one can see from Table A.1, the start positions of the exons are predicted fairly accurate for this data set. For a more detailed description of the adaptive window technique, see Abnizova et al. (2006). A.2. Statistics to assess the performance of DFA and RRA based on linear discriminative analysis We applied a linear discriminant analysis to crossvalidate the classification of the sequences into the three types of DNA. The tables (Tables A.2 and A.3) below give the number of correct and incorrect predictions, from which the numbers of true or false positives and negatives were derived. These quantities are defined as follows: • True positives (TP) are those sequences that were both verified and predicted as a particular DNA type. • False negatives (FN) are sequences that were experimentally verified to belong to a particular type of DNA, but not predicted as such. • False positives (FP) are predicted, but not experimentally verified sequences. • True negative (TN) are neither verified nor predicted sequences. As in Tompa et al. (2005), we used the following metrics to measure the performance of RRA and DFA: • Sensitivity: SN = TP/(TP + FN); • Specificity: SP = TN/(TN + FP); • Correlation. The correlation coefficient (CC) is CC = √
TP · TN − FN · FP . (TP + FN)(TN + FP)(TP + FP)(TN + FN)
For example, for DFA, the number of predicted exon sequences (TP) is 54. The number of sequences which are not predicted but are actual exons, FN, is 6. The number of predicted but not verified exons, FP, is 9.
R. te Boekhorst et al. / BioSystems 91 (2008) 183–194
193
Table A.2 Reclassification scheme resulting from the linear discriminant analysis of the RRA
CODING REGREG NCNREG Sum
From CODING to
From REGREG to
From NCNREG to
Sum
58 (0.32) 2 (0.01) 0 (0.00) 60 (0.33)
7 (0.04) 29 (0.16) 24 (0.13) 60 (0.33)
0 (0.00) 22 (0.12) 38 (0.21) 60 (0.33)
65 (0.36) 53 (0.29) 62 (0.34) 180 (1.00)
Table A.3 Reclassification scheme resulting from the linear discriminant analysis of the DFA
CODING REGREG NCNREG Sum
From CODING to
From REGREG to
From NCNREG to
Sum
54 (0.30) 6 (0.03) 0 (0.00) 60 (0.33)
9 (0.05) 28 (0.16) 23 (0.13) 60 (0.33)
0 (0.00) 23 (0.13) 37 (0.21) 60 (0.33)
63 (0.36) 57 (0.32) 60 (0.33) 180 (1.00)
Thus, the values of the performance metrics defined above for this example are: • SN here is 54/(54 + 6) = 0.90; • SP = 111/(111 + 9) = 0.925; • CC = √(54 × 111 − 6 × 90)/ (54 + 6)(111 + 9)(54 + 9)(111 + 6) = 0.815. References Abnizova, I., te Boekhorst, R., Walter, K., Gilks, W.R., 2006. New methods to infer DNA function from sequence information. In: Kolchanov, N., Hofestaedt, R., Milanesi, L. (Eds.), Bioinformatics of Genome Regulation and Structure II. Springer, New York, pp. 165–177. Audit, B., Thermes, C., Vaillant, C., d’Aubenton-Carafa, Muzy, J.F., Arneodo, A., 2001. Longrange correlations in genomic DNA: a signature of the nucleosomal structure. Phys. Rev. Lett. 86, 2471–2474. Audit, B., Vaillant, C., Arneodo, A., 2004. Wavelet analysis of DNA bending profiles reveals structural constraints on the evolution of genomic sequences. J. Biol. Phys. 30 (1), 33–81. Avery, P., 2002. Fitting interconnected Markov chain models-DNA sequences and test cricket matches. Statistician 52, 267–278. Azbel, Y.M., 1995. Universality in a DNA statistical structure. Phys. Rev. Lett. 75, 68–171. Bernola-Galv´an, P., Oliver, J.L., Rom´an-Rold´an, R., 1999. Decomposition of DNA Sequence Complexity. Phys. Rev. Lett. 83, 3336–3339. Boeva, V, Makeev, V., R´egnier, M., 2004. SWAN: searching for highly divergent tandem repeats in DNA sequences and statistical significance. JOBIM’04, IEEE Computer Society, Montr´eal. Buldyrev, S.V., Goldberger, A.L., Havlin, S., Peng, C.K., Simons, M., Sciortino, F., Stanley, H.E., 1992. Long range fractal correlations in DNA (comment on the letter by R.F. Voss in PRL 68, 3805). Phys. Rev. 701. Buldyrev, S.V., Goldberger, A.L., Havlin, S., Peng, C.K., Simons, M., Sciortino, F., Stanley, H., 1993. Long range fractal correlations in DNA. Phys. Rev. Lett. 71, 1776.
Buldyrev, S.V., Goldberger, A.L., Havlin, S., Peng, C.K., Stanley, H.E., 1994. Fractals in biology and medicine: from DNA to the heartbeat. In: Bunde, A., Havlin, S. (Eds.), Fractals in Science. Springer Verlag, Berlin, pp. 49–87. Buldyrev, S.V., Goldberger, A.L., Havlin, S., Mantegna, R.N., Matsa, M.E., Peng, C.K., Simons, M., Stanley, H.E., 1995. Longrange correlational properties of coding and noncoding DNA sequences: GenBank analysis in DNA. Phys. Rev. 51, 5084– 5091. Chechetkin, V., Lobzin, V., 1988. Study of correlations in segmented DNA sequences: application to structure coupling between exons and introns. J. Theor. Biol. 190, 69–83. Clegg, R., 2005. A Practical Guide to Measuring the Hurst Parameter, http://www.richardclegg.org/pubs/rgcpew05.pdf. Damon, R., Harvey, W., 1987. Experimental Design ANOVA and Regression. Harper and Row, NY. Herzel, H., Große, I., 1997. Correlations in DNA sequences: the role of protein coding segments. Phys. Rev. E 55, 800–810. Gneiting, T., Schlather, M., 2003. Stochastic Tools That Separate Fractal Dimension and Hurst Effect. Technical Report 425. University of Washington, http://www.stat.washington.edu/www/ research/reports/2003/tr425.pdf. Jump, D.B., Clarke, S.D., 1999. Regulation of gene expression by dietary fat. Annu. Rev. Nutr. 19, 63–90. Kirchammer, C., Davidson, E., 1996. Spatial and temporal information processing in the sea urchin embryo: modular and intramodular organization of the CyIIIa gene cis-regulatory system. Development 122, 333–348. Korotkov, E., Kudryaschov, N., 2001. Latent periodicity of many genes. Genome Inf. 12, 437–439. Li, W., 1997. The complexity of DNA. Complexity 3, 33–37. Mantegna, R.N., Buldyrev, S.V., Goldberger, A.L., Havlin, S., Peng, C.K., Simons, M., Stanley, H., 1994. Linguistic features of noncoding DNA sequences. Phys. Rev. Lett. 73, 3169–3172. Markstein, M., Markstein, P., Markstein, V., Levine, M.S., 2002. Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes in the Drosophila embryo. PNAS 99, 763–768. Nazina, A., Papatsenko, D., 2003. Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency. BMC Bioinf. 22, 4–65.
194
R. te Boekhorst et al. / BioSystems 91 (2008) 183–194
Orlov, Y.L., Potapov, V.N., 2004. Complexity: an Internet resource for analysis of DNA sequence complexity. Nucl. Acids Res. 32 (Web Server issue), W628–W633. Orlov, Y.L., te Boekhorst, R., Abnizova, I., 2006. Statistical measures of the structure of genomic sequences: entropy, complexity, and position information. J. Bioinf. Comput. Biol. 4 (2), 523–526. Ossadnik, S., Buldyrev, S., Goldberger, A., Havlin, S., Mantegna, R., Peng, C., Simmons, M., Stanley, H., 1994. Correlation approach to identify coding regions in DNA sequences. Biophys. J. 1 (1), 64–70. Peng, C.K., Buldyrev, S.V., Havlin, S., Simons, M., Stanley, H.E., Goldberger, A., 1994. Mosaic Organization of Nucleotides. Phys. Rev. E, 1685–1689. Rees, W.D., Hay, S.M., Brown, D.S., Antipatis, C., Palmer, R.P., 2000. Maternal protein deficiency causes hypermethylation of DNA in the livers of rat fetuses. J. Nutr. 130, 1821–1826. Rose, O., 1996. Estimation of the Hurst Parameter of Long-Range Dependent Time Series. Research Report No. 137 from The Institute of Computer Science, University of W¨urzburg. Scafetta, N., Latora, V., Grigolini, P., 2002. L´evy scaling: the diffusion entropy analysis applied to DNA sequences. Phys. Rev. E 66, 1– 15. Schroeder, M., 1991. Fractals, Chaos Power Laws. Minutes From an Infinite Paradise. Freeman, New York. Siegel, J., Castellan, J., 1988. Non Parametric Statistics for the Behavioral Sciences. McGraw-Hill, New York. Stoev, S., Taqqu, M.S., Park, C., Marron, J.S., 2005. On the wavelet spectrum diagnostic for Hurst parameter estimation in the analysis of Internet traffic. Comput. Networks 48 (3), 423–445.
Taqqu, M.S., Teverovsky, V., Willinger, W., 1995. Estimators for longrange dependence: an empirical study. Fractals 3, 785–788. Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., Makeev, V.J., Mironov, A.A., Noble, W.S., Pavesi, G., Pesole, G., R´egnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., Zhu, Z., 2005. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23, 137–144. Vaidyanathan, P., Yoon, B., 2004. The role of signal-processing concepts in genomics and proteomics. J. Franklin Inst., 1–27. Voss, R., 1992. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett. 2, 3805–3808. Wei, W., 1994. Time Series Analysis. Addison-Wesley, Redwood City, CA. Whittle, P., 1953. Estimation and information in stationary time series. Arkiv f¨or Matematik 2 (23), 423–434. Yamada, K., Noguchi, T., 1999. Nutrient and hormonal regulation of pyruvate kinase gene expression. Biochem. J. 337, 1–11. Yoon, B., Vaidyanathan, P., 2006. Context-Sensitive Hidden Markov Models for Modeling Longrange Dependencies in Symbol Sequences (http://reylab.bidmc.harvard.edu/tutorial/DFA/node5. html). Yu, Z.G., Anh, V., Lau, K.S., 2001. Multifractal characterisation of length sequences of coding and noncoding segments in a complete genome. Physica A 301 (1–4), 351–361. Yu, Z.G., Anh, V., Lau, K.S., 2003. Iterated function system and multifractal analysis of biological sequences. Int. J. Mod. Phys. B 17 (22–24), 4367–4375.