Molecular Phylogenetics and Evolution 60 (2011) 228–235
Contents lists available at ScienceDirect
Molecular Phylogenetics and Evolution journal homepage: www.elsevier.com/locate/ympev
Evolutionary trends of GC/AT distribution patterns in promoters Elisa Calistri a,b,⇑, Roberto Livi b, Marcello Buiatti a a b
Dipartimento di Biologia Evoluzionistica, Universita’ degli Studi di Firenze, via Romana 19, 50125 Firenze, Italy CSDC – Dipartimento di Fisica, Universita’ degli Studi di Firenze, INFN – Firenze, via G. Sansone 1, 50019 Sesto Fiorentino, Italy
a r t i c l e
i n f o
Article history: Received 19 November 2010 Revised 25 March 2011 Accepted 17 April 2011 Available online 7 May 2011 Keywords: GC/AT ratio Evolution Promoter sequence DNA conformation/function
a b s t r a c t Nucleotide distributions in genomes is known not to be random, showing the presence of specific motifs, long and short range correlations, periodicities, etc. Particularly, motifs are critical for the recognition by specific proteins affecting chromosome organization, transcription and DNA replication but little is known about the possible functional effects of nucleotide distributions on the conformational landscape of DNA, putatively leading to differential selective pressures throughout evolution. Promoter sequences have a fundamental role in the regulation of gene activity and a vast literature suggests that their conformational landscapes may be a critical factor in gene expression dynamics. On these grounds, with the aim of investigating the putative existence of phylogenetic patterns of promoter base distributions, we analyzed GC/AT ratios along the 1000 nucleotide sequences upstream of TSS in wide sets of promoters belonging to organisms ranging from bacteria to pluricellular eukaryotes. The data obtained showed very clear phylogenetic trends throughout evolution of promoter sequence base distributions. Particularly, in all cases either GC-rich or AT-rich monotone gradients were observed: the former being present in eukaryotes, the latter in bacteria along with strand biases. Moreover, within eukaryotes, GC-rich gradients increased in length from unicellular organisms to plants, to vertebrates and, within them, from ancestral to more recent species. Finally, results were thoroughly discussed with particular attention to the possible correlation between nucleotide distribution patterns, evolution, and the putative existence of differential selection pressures, deriving from structural and/or functional constraints, between and within prokaryotes and eukaryotes. Ó 2011 Elsevier Inc. All rights reserved.
1. Introduction Coding and non-coding DNA sequences perform two very different although complementary functions in the transmission of genetic information to proteins. Coding sequences are passive templates containing the information to be transcribed and translated, but need to be activated and regulated by complexes docking on the non-coding portion upstream of the Transcription Starting Site (TSS). Thus, the composition and distribution of nucleotides in coding DNA sequences has to be coherent with the distribution and composition of amino-acids in the protein to be translated, both critical for the determination of its four dimensional conformational landscape dynamics and function. Therefore, selection pressure acts on the amino-acid sequence of the proteins and only indirectly on the nucleotide distribution in DNA coding sequences. On the other hand, as it happens for amino-acids in proteins, the composition and distribution of nucleotides in non-coding DNA play a crucial role in determining its conformational landscape, very relevant for the recognition of regulatory proteins and, partic⇑ Corresponding author at: Dipartimento di Biologia Evoluzionistica, Universita’ degli Studi di Firenze, via Romana 19, 50125 Firenze, Italy. E-mail address:
[email protected] (E. Calistri). 1055-7903/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.ympev.2011.04.015
ularly in eukaryotes, for the correct organization and packaging of the structural components of chromosomes, see Calladine et al. (1988) and Frauenfelder et al. (2001). For instance, it has been shown that nucleotide pair stacking in DNA chains influences their structure which, in turn, affects promoters functioning, nucleosome wrapping and more in general all the processes where DNA-proteins interactions are involved (Gabrielian et al., 2000; Kanhere and Bansal, 2005). Therefore, while in coding sequences the physico-chemical properties of nucleotides are not the critical factor for the transmission of the message, in non-coding ones they are fundamental in determining the double helix dynamics. Consequently, deviations from randomness in nucleotide distribution patterns, putatively deriving from natural selection, are not necessarily to be expected in coding sequences, but may be found in non-coding ones. Whole genome studies on deviations from randomness of nucleotide distributions have been carried out since the 1990s using a number of computational tools and have always confirmed the aforementioned concepts (see for example Hildebrand et al. (2010) for recent evidence of selection upon bacterial genomic GC content). As reported, for instance, in Audit et al. (2001) and Acquisti et al. (2004), coding sequences were found to be quasirandom, or, in other words, they showed what may be called
E. Calistri et al. / Molecular Phylogenetics and Evolution 60 (2011) 228–235
‘‘quenched randomness’’, while long and short correlations, periodicities, patchiness etc. were found in non-coding DNA. In some cases, moreover, correlations were found to correspond to long and short range chromosomal structural constraints, see for instance Kendal and Suomela (2005), Trifonov (1989), and Li (1997). Eukaryotic genomes have been shown to be organized in a hierarchical order of domains at different scale lengths, generally displaying a fractal geometry (Acquisti et al., 2004), both at the DNA and the chromatin level (Lieberman-Aiden et al., 2009). Low complexity sequences (generally series of short repeats) have proved to be more abundant in eukaryote, than in prokaryote genomes and in non-coding, than in coding sequences (Menconi et al., 2008). At a higher hierarchical level, patchiness of genome geography has been shown to be a general rule in eukaryotes, where compositionally homogeneous regions, called isochores, have been found, their size and average GC/AT ratio being conserved throughout evolution (Bernardi et al., 1985). The origin and fixation of isochores has been the object of a long lasting debate between the supporters of neutral and selectionist models. Selectionists propose a functional explanation of the presence of isochores and, in general, of long homogeneous sequences, a hypothesis supported, for example, by the fact that the average size of an isochore is equal to the average size of a site of replication as defined by Ma et al. (1998). Moreover, it has been shown, particularly in the human genome, that GC-richer isochore families, the so-called ‘‘genome core’’, have a high gene density and are preferentially populated by transposons, while GC-poorer ones, called the ‘‘genome desert’’, contain a low number of genes, and differ from the former in chromatin structure and recombination maps, UTRs length, breadth and intensity of gene expression, etc. Bernardi (2007). These data suggest that the molecular context and the nucleotide distribution patterns may influence gene activity within a sort of division of labor positively selected during evolution in different genome domains. The hypothesis of positive selection, particularly of GC-rich isochores, has been challenged by several authors proposing an alternative explanation: the GC-biased gene conversion (BGC), see for instance Duret and Galtier (2009). It is rather surprising, in relation to this ongoing relevant discussion, that most of the extant computational and molecular evidence on constraints to randomness has been obtained on whole genomes while only few papers are available on the evolutionary dynamics of specific non-coding regions with known functions. For this reason, in our work we decided to investigate deviations from randomness of nucleotide distributions at the level of the promoters. These structures, being the main controllers of gene expression, are the most probable target of changes in nucleotide sequences affecting qualitative and quantitative regulation and may be therefore liable to be shaped by means of natural selection. Transcription of different genes begins with the formation of complexes between transcription factors and specific short DNA sequences (‘‘boxes’’ or ‘‘motifs’’) present in the promoter of the gene to be activated. On the other hand, the quantitative regulation of expression of every gene depends on promoter nucleotide distribution pattern, particularly in terms of the GC/AT ratio, and its consequent specific conformational landscape. The hypothesis of the relevance of global distribution patterns and not only of the presence of single motifs seems to be coherent with previous studies (Aerts et al., 2004). They showed that in some Metazoan species, although the average AT content is greater than GC, the GC/AT ratio increases following a monotone function in a large region around the TSS, and the parameters of that function are different among distinct classes of vertebrates, and between vertebrates and arthropods. On these grounds, in the present work, at variance with the majority of previous studies centered on the identification of motifs, the analysis of base composition patterns, with a specific attention to the dynamics of the GC/AT ratio, has been carried
229
out on sequences from 1000 bp to the TSS of all promoters present in data banks in a large set of organisms from bacteria to multicellular eukaryotes. The data obtained have then been ordered according to the phylogenetic positions of the analyzed species or groups. 2. Materials and methods 2.1. Promoter retrieval The sequences of 1000 bp (and of 300 bp in bacteria) upstream the TSS – or the ATG, (the translational start site) for two species: Saccharomyces cerevisiae and Oryza sativa – of all annotated genes in each organism have been downloaded from different freely available promoter databases. Being the accuracy of algorithm promoter prediction still limited, for our analysis we have chosen databases constructed by collecting experimentally identified promoters. In our work, we have used the 50 ? 30 single strand promoter sequence, conventionally stored in genomic repositories, as it corresponds to the sense strand. Promoter sequences for Plasmodium falciparum, Cyanidioschyzon merolae, Danio rerio and Mus musculus from 1000 to +200 relative to the TSS (where first transcribed base is the position +1) of all the annotated and nonannotated genes were retrieved from DBTSS (version February, 2008), (Yamashita et al., 2006). Xenopus tropicalis, Monodelphis domestica, Gallus gallus, Canis familiaris, Rattus norvegicus, Bos taurus and Pan troglodytes promoters have been downloaded from ECRbase, a database of evolutionary conserved regions, promoters, and transcription factor binding sites in vertebrate genomes (Loots and Ovcharenko, 2007). S. cerevisiae and Caenorhabditis elegans promoter sequences have been retrieved from Saccharomyces Cerevisiae Promoter Database (SCPD) and Caenorhabditis Elegans Promoter Database (CEPD) respectively, provided by Cold Spring Harbor Laboratory (CHSL). Arabidopsis promoter sequences have been downloaded from the latest version of The Arabidopsis Information Resource (TAIR8) web site (released in March 2008), a database which includes genetic and molecular data for the model higher plant Arabidopsis thaliana. We have collected promoter sequences for the species O. sativa from The Institute for Genomic Research Rice Genome Annotation project, which undertakes the updating of the rice genome sequence and annotation. Bacterial complete genomes have been downloaded from NCBI repository where the sequence conventionally reported is the so called Watson-strand. We have treated them so as to retrieve the 300 bp in the sense strand upstream each transcript. For all the organisms taken into account, we have carried out our analyzes on the whole ensembles of promoters, that is the repertoire of the promoters of all the known, annotated, protein coding genes, as they are provided by experimental retrieving in the databases. In the majority of the studied organisms, the ensemble analyzed is a complete or nearly complete set of their promoters, while in a few others as, for instance, C. elegans, M. domestica or C. familiaris, the set of annotated genes is not complete. The analysis has been carried out on a wide range of organisms from unicellular prokaryotes and eukaryotes to plants and animals, with a particular attention to vertebrates. 2.2. Base composition analysis (BCA) For every genome G we have analyzed a set of NG promoter sequences. Base composition analysis (BCA) is a straightforward measure of the percentage of A, C, G and T nucleotides in a set of DNA sequences. Here we are interested in studying the spatial distribution of nucleotides along the promoters. Accordingly, we have analyzed all NG sequences starting from the TSS and we have
230
E. Calistri et al. / Molecular Phylogenetics and Evolution 60 (2011) 228–235
measured the occurrence of A, C, G and T at each position along the aligned set. Then we have calculated the mean density qx(l) of each nucleotide x as the function of its position along the promoter, with l = 1, 2, . . . , 1000:
qx ðlÞ ¼
NG 1 X ðxÞ s ðlÞ NG i¼1 i
ð1Þ
where x = A, C, G, T and si = 1 if there is nucleotide x in l, 0 otherwise. 3. Results We shall now discuss the results obtained in the different groups of organisms analyzed (prokaryotes, unicellular eukaryotes, plants and animals). Note that in all the figures presented in this manuscript the horizontal axis corresponds to the bp l relative to the TSS, while in vertical axis we report qx(l), making use of the colour code x = A (black), T (blue), C (red) and G (green). The use of negative values for the space variable l is conventional for bp upstream the transcription site. For the sake of space and clarity of the notation, in all the following figures the symbols qx(l) and l are not explicitly indicated. 3.1. Prokaryotes We have analyzed qx(l) in a large sample of prokaryotes, belonging both to mesophiles and thermophiles according to their preferential environment, to obtain a wide and representative spectrum of the base composition behavior in prokaryote genomes. In Fig. 1a and b, curves show the dynamics of AT and GC from 150 bp to the TSS for all the species analyzed, divided into taxonomic groups according to Wu and Eisen (2008). As shown in Fig. 1, nucleotide density functions are either constant or slowly increasing in AT or GC with an abrupt acceleration near the TSS and very often a strand bias can be observed near to it. The average GC/AT ratio in bacteria’s promoters is correlated to the base composition of the whole genome, GC contents of the promoter region increasing linearly with the GC content of the genome (data not shown). Our data therefore are coherent with previous results showing that also the GC content of other parts of the bacterial genome (coding sequences, stable RNA genes, and spacers) is positively linearly correlated with the GC content of their total genomic DNA (Muto and Osawa, 1987). The large range in GC/AT content among bacterial species is well established, varying between at least 17% and 75% GC, with an even larger variation in the third codon position. In our sample, 28 out of 40 bacterial species have more AT than GC in promoters and, as we can see from Fig. 1, nucleotide densities from 150 bp to the TSS show functions whose parameters are different according to the subdivision into the major phylogenetic groups of the present taxonomic classification. In our results, more strictly related organisms, belonging to the same genus, generally exhibit a similar base composition and similar nucleotide densities along the aligned sequences. This behavior is coherent with previous results showing that the genomic guanine and cytosine content of eubacteria is related to their phylogeny (Bentley and Parkhill, 2004). Entering into more details, as shown in Fig. 1, in the whole group of Firmicutes (red1 boxes), both mesophilic and thermophilic, AT base pairs always have a higher density than their GC counterpart and the difference between AT and GC is higher on the average in the first group than in the second. Generally, the slope of the curve increases approaching the TSS and this is a general rule. However, within this group, some differences in the GC/AT ratio and in the 1 For interpretation of color in Figs. 1–4, the reader is referred to the web version of this article.
length and amount of strand biases can be observed between genuses. For instance, in species belonging to Listeriae the ratio AT/GC is very high and in three of them the strand bias is extended throughout the whole promoter, while in the other Firmicutes the strand bias is limited to the sequence near the TSS. The situation is rather different in Gamma-proteo-bacteria (orange boxes) where GC is prevalent although it tends to decrease towards TSS in five out of the eight species analyzed, three of which do not show significant quantitative differences between AT and GC contents and only one shows higher AT density. As far as Thermotogae are concerned (green boxes), AT density is higher in Thermotoga and Thermosifon but not in Thermobyfida and Acidothermus. In Thermosynococcum no significant difference between AT and GC densities is observed. Finally, all Streptococcus species showed a prevalence of AT over GC and a terminal strand bias near the TSS, while GC density is higher than AT in Staphylococcus. It is worth stressing here that, as also discussed by Mustoa et al. (2004), the promoter GC/AT content does not seem to be correlated with the environmental average temperature. 3.2. Unicellular eukaryotes In Fig. 2 we show the behavior in terms of BCA of three unicellular species of eukaryotes: the protozoan P. falciparum (Fig. 2A), the red alga C. merolae (Fig. 2B) and the yeast S. cerevisiae (Fig. 2C). Promoters of the eukaryotic unicellular species analyzed are in this case very different: two of the three organisms show a prevalence of AT rich sequences followed by a decrease in GC toward the TSS, covering a very short stretch in P. falciparum (50 bp) and a longer region, starting at 300 bp, in S. cerevisiae. It is also observed a strong strand bias, constant throughout the sequences in Plasmodium, limited to the last 300 bp in S. Cerevisiae, thus mimicking someway the behavior of prokaryotes. It is worth noting that the third unicellular organism analyzed, C. merolae, is the only GC-rich eukaryotic unicellular species observed. In this case, at variance with the other two species, the promoters show a constant prevalence of GC base pairs until 150 bp from TSS, followed by a sudden increase in GC starting at around 150 bp and reaching a peak around 50 bp from TSS. C. merolae is a small unicellular red alga living in acid sulfate-rich hot springs and it is considered to be one of the photosynthetic eukaryotes endowed with the simplest cell architecture and the smallest genome. Because of these peculiar features and its morphological characteristics, C. merolae is therefore thought to be one of the most primitive eukaryotes (Matsuzaki et al., 2004), putatively showing ancestral features of eukaryotic phototrophs. The fact that this alga lives at high temperature environments and is it GC-rich may have some relevance in reference to the selectionist hypothesis proposed by Bernardi’s theory (Bernardi, 2007), as we will discuss later. 3.3. Plants The two plants considered in our work, O. sativa and A. thaliana, whose promoter patterns are shown in Fig. 3A and B respectively, are distantly related species both belonging to the Phylum of Magnoliophyta, O. sativa, a monocot, representing an early lineage of angiosperms. In both cases AT density is largely prevalent over GC as in most other species so far discussed but, unlike most prokaryotes and the two unicellular eukaryotes, both plants show an increase in GC from 300 bp to the TSS (as previously described in C. merolae). As we will discuss later, a similar pattern is also present in three animal species, namely C. elegans (not shown), D. rerio and X. tropicalis, showing a prevalence of AT, but almost no strand bias (see Fig. 4A and B). On the other hand, in the plants analyzed a strand bias is present both in AT and GC distributions. Particularly, an increased frequency of C residues and a slight reduction in G’s that causes the
E. Calistri et al. / Molecular Phylogenetics and Evolution 60 (2011) 228–235
231
Fig. 1. BCA in bacterial promoter sequences from 150 bp to 1 bp relative to the Transcriptional Start Site (TSS). In the graph boxes monophyletic groups are highlighted by different colors. (a) Mesophiles. Red boxes: Firmicutes; Orange boxes: Gammaproteobacteria. Here we report the genus each species belongs to: (A–D) Listeriae, (E and F) Clostridium; (G–I) Lactobaccillus; (J and K) Staphyloccoccus; (L and M) Escherichia; (N–R) Pseudomonas; (S) Proteus; (T) Klebilsella. (b) Thermophiles. Green boxes: Thermotogae and Deinococcus; Light green boxes: Cyanobacteria and Chloroflexi; Red boxes: Firmicutes. Here we report the genus each species belongs to: (a,d,f,g) Thermotoga, (b) Thermobifida; (c) Acidothermus; (e) Thermosifo; (h) Thermosynecoccus; (i,j, l) Streptococcus; (m,r,s) Thermoanaerobacter; (n) Morella; (o) Clostridium; (p) Geobacillus; (q) Pelotomaculum; (t) Thermodesulfolobus. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
GC-strand bias can be detected in the upstream region of the translational start site. Our results are in agreement with the extensive analysis of GC skews in a wide range of species carried out by Washio and Tomita (2005) who found that a significant GC-skew (C > G) is strictly conserved among monocots, eudicot plants and fungi, and it has been attributed to transcription correlated constraints. 3.4. Vertebrates As we have seen so far, in eukaryotes, while remaining constant in the upstream part of the analyzed regions (rich in weak bases, A and T, almost in all species), the average nucleotide density qx(l) changes downstream, where, generally, the GC content increases approaching the TSS, independently from its average levels. This aspect becomes considerably more evident in vertebrate genomes, and particularly in amniotes on which we focused our attention. We collected promoters of representative species of the orders of fish, amphibians, birds and several mammals as they are the most
studied and characterized ensembles of annotated genomic regions. In this sample, base composition trends are in good agreement with a macro-evolutionary scenario following the geological time-scale of the representative species taken into account in our study. Moreover, the distribution of variation seems to be roughly coherent with accepted phylogenetic data obtained with paleontological and molecular methods, as shown for instance by the fact that P. troglodytes (and Homo sapiens, not shown) exhibits shapes similar to rodents’ which are their closest relatives, see Mustoa et al. (2004). The most ancestral species (the fish D. rerio, the amphibian X. tropicalis and the marsupial M. domestica, Fig. 4A–C) show promoters rich in A and T along the whole sequence, A and T densities remaining constantly at higher levels in our global analysis. The fish D. rerio exhibits the lowest level of GCs coherently with the fact that the chromosomes of zebra-fish almost only contain GCpoor isochores (Costantini et al., 2007), in a genome which, like those of all cold-blooded vertebrates, is much more homogeneous in composition than those of the warm-blooded ones. Nonetheless,
232
E. Calistri et al. / Molecular Phylogenetics and Evolution 60 (2011) 228–235
Fig. 1 (continued)
in D. rerio, as well as in X. tropicalis and in M. domestica, we can detect an increase in GC content starting at 180 bp from the TSS in the former, and at 300 bp in Xenopus, (the shape of the curves is quite fuzzy in the Opossum because of the low number of promoters present in the data bank for this species). A similar increase in GC concerns a wider region in latest diversified amniotes (Fig. 4D– H), where G and C are prevalent in a region from 400 bp to 700 bp before the TSS, reaching a maximum length in G. Gallus (Fig. 4D), the only bird analyzed.
cases), the noise of function qx(l) is reduced and the global trend is maintained proving the stationary features of qx(l). Stationarity is a statistical property confirming the reliability of our findings, as it shows that a significant amount of sequences displays patterns similar within a species, but significantly different between them. Two kinds of patterns have been found to be indicative of evolutionary trends in our data, namely the presence of a strand bias near to the TSS and the overall GC/AT distributions from 1000 bp to the TSS. We will therefore discuss them separately as they probably have different causes and functions.
4. Discussion 4.1. Strand bias The data obtained with our analysis show very clear trends of global changes in the dynamics of base composition patterns in promoters throughout evolution in a series of organisms ranging from bacteria to mammalian vertebrates. As it emerges from all the above figures, nucleotide density qx(l) is characterized by a specific interpolating function in every phylogenetic group. Such function is stationary as fluctuations are decreasing when the number of analyzed sequences increases. In fact, by averaging over an increasing number of sequences (N up to NG), qx(l) becomes a less and less fluctuating function,pand ffiffiffiffi the standard deviation from the average scales down like 1= N . Actually, by adding speciesspecific sequences to the analyzed promoters (i.e. increasing
The first distinctive character differentiating organisms belonging to different phylogenetic groups is the presence/absence and the pattern of strand biases. Such biases in bacteria often cover the whole sequence analyzed or a significant part of it and are still relevant in unicellular eukaryotes and plants, being present only to a lesser extent in animals. The prominence of strand biases in prokaryotes, but not in eukaryotes, may be due to the fact that in bacteria only one strand is transcribed (Frank and Lobry, 1999). In vertebrates the strand bias observed shows G and T residues more frequently than C and A in the leading strand. That confirms and extends previous studies, which report the presence of skews in
E. Calistri et al. / Molecular Phylogenetics and Evolution 60 (2011) 228–235
233
Fig. 3. BCA in plants. Promoter sequences from 1000 bp to 1 bp relative to the TSS in Arabidopsis thaliana (A) and promoter sequences from 1000 bp to 1 bp relative to the ATG in Oryza sativa (B).
Fig. 2. BCA in unicellular organisms. Promoter sequences from 1000 bp to 1 bp relative to the TSS in Plasmodium falciparum (A) and Cyanidioschyzon merolae (B). Promoter sequences from 1000 bp to 1 bp relative to the ATG in Saccharomyces cerevisiae (C).
mammals (Louie et al., 2003; Green et al., 2003) with an excess of G and T over A and C in the sense strand of genes. This feature is attributed to the action of the transcription-coupled DNA repair system by Green et al. (2003) and to differences in replication, deamination, mutation and repair biases in the leading but not in the lagging strand by Frank and Lobry (1999). In their work, it is still detectable 1000 bp after the end of transcription, but is less evident in the promoter region. 4.2. Overall nucleotide distribution patterns GC/AT contents vary along the analyzed sequences following a monotone function towards the TSS in all the species analyzed.
As discussed in Results, in prokaryotes the increase in density may be either in AT or in GC, according to phylogenetic differentiation. On the contrary, in eukaryotes we generally observe the increase, in the last part of the sequence, of GC base pairs, irrespective of the AT or GC prevalence in the upstream parts of the string. The straight gradient of base density, particularly clear in vertebrates, is a function of position and it represents a global trend of most sequences, or of a statistically relevant amount of them. Moreover, the deviation from random distribution of the observed base density dynamics concerns more extended regions when we move from unicellular organisms to vertebrates and, within this group, covers broader regions in modern species than in ancient ones. Recent data obtained in our laboratory have shown that the monotone nature of the curves obtained with the whole set of species-specific promoters analyzed is the result of increasing numbers of either GC or AT-rich low complexity short stretches towards the TSS (manuscript in preparation). This feature is coherent with previous results obtained in our laboratory (Acquisti et al., 2004; Buiatti and Buiatti, 2004; Menconi et al., 2008), where low complexity regions have been shown to be more frequent in eukaryotes than in prokaryotes and in non-coding rather than in coding sequences. It should be noted here that evolution is faster in regulatory, non-coding sequences than in coding DNA and, indeed, low complexity sequences, which are hyper-variable in length and number, are distributed in a nonrandom fashion, in non-coding DNA. Both the presence of low complexity sequences and a fast evolution would probably be counter selected in coding DNA where even a single change in a non-synonymous codon could drastically modify the function of coded proteins. As already discussed in the introduction, the situation is different in the case
234
E. Calistri et al. / Molecular Phylogenetics and Evolution 60 (2011) 228–235
Fig. 4. BCA in vertebrates. Promoter sequences from 1000 bp to 1 bp relative to the TSS in (A) Danio rerio, (B) Xenopus tropicalis, (C) Monodelphis domestica, (D) Gallus gallus, (E) Canis familiaris, (F) Mus musculus, (G) Bos bovis, (H) Pan troglodytes. Species are arranged according to divergence-time estimates for mammalian orders and major lineages of vertebrates, as given in Kumar and Hedges (1998), based on molecular time estimations. On the left: a molecular timescale for vertebrate evolution, adapted from Kumar and Hedges (1998). All times indicate Mya separating humans and the Order, Family or Genus shown. Letters (A–H) relate every represented species with the phylogenetic group they belong to.
of non-coding sequences, where selection acts directly on nucleotide distributions, thus favoring compositions which leads to a positive conformational landscape. It is worth noting here that the correlation between promoter function and the dynamic conformational landscape of promoter DNA is supported by a number of studies showing a relationship between transcription related functions and the composition of sequences upstream the TSS and by theoretical studies on sequencedependent conformational energy, see for instance Araúzo-Bravo et al. (2005). If this is true then the monotone increase in length and numbers of the low complexity sequences we have observed may be relevant for fast putatively adaptive changes in the levels
of gene expression (see for instance Usdin (2008) for a review of positive and negative effects of repeats and diseases). Probably it is not by chance that longer repeats (Alu, SINES and LINES) have been shown to be a signal of the presence of housekeeping genes (Daniel-Eller et al., 2007). All this supports the relevance of the whole promoter sequence base composition and not only of the presence of motifs, which, in their turn, undergo rapid evolutionary turnover (Dermitzakis et al., 2003). This is confirmed by theoretical and experimental results (Choi et al., 2004) showing that promoter DNA exhibits a sequence-dependent (but not sequence-specific) preference to open at, or near, the TSS, for both a TATA box-binding protein (TBP)-dependent and a TBP-independent transcriptional system. Moreover, the same authors show that while DNA opening dynamics derive from signal reception, temporal melting is independent from localized DNA-protein interactions, thus suggesting that the information leading to DNA activation is given by the global distribution of nucleotides itself. Particularly, the GC-rich regions in pluricellular eukaryotes are rich in genes and especially in the housekeeping, broadly expressed ones. This supports the idea that selective pressures may have favored the concentration of such genes in regions where GC-richness may have provided structural properties facilitating the access to transcription machinery in actively transcribed chromatin. These hypotheses do not imply that transcription activation is the sole cause of selection of GC-rich sequences in eukaryotes because, as suggested by Bernardi, also the environment could be involved, as a selective force, favoring the high thermodynamic stability of GC, particularly in warm blooded animals (as well as in the thermophilic alga C. merolae we have analyzed). As far as eukaryote genomes are concerned, Vinogradov (2003) stressed the fact that GC-rich isochores (where both genes and retro-transposons are preferentially located) are more thermo-stable, have a higher bendability and ability to undergo B–Z transitions but show a lower level of curvature (Nickerson and Achberger, 1995). B–Z transitions, favored by GC-rich alternative pyrimidine/purine sequences, are indeed correlated with transcription initiation which, in eukaryotes, only occurs through the un-wrapping and remodeling of nucleosomes as also discussed by Khuu et al. (2007). Moreover, as reported by Gupta et al. (2008), low complexity sequences appear to be the hallmark of nucleosome inhibiting sequences and nucleosomes are destabilized by even relatively low numbers of AG, AT, GC, GA, GC repeats (even lower than 15) thus favoring transcription. The situation is different in bacteria where chromosome organization is by far less complex than in eukaryotes and where constraints to randomness are much weaker. Instead DNA flexibility is needed for the complex conformational changes necessary for the contemporary transcription and translation. The positive role of curved DNA in promoter regions has been experimentally verified as regards Escherichia coli RNA polymerase activity (Nickerson and Achberger, 1995) and transcription initiation (Carmona and Magasanik, 1996; Mitchison, 2005). Moreover, it has been experimentally proved that DNA bending and curvature both favored by AT-rich sequences, may have a role in transcription mechanisms, (Pérez-Martı´n et al., 1994; Suzuki and Yagi, 1995; Jáuregui et al., 2003). This is supported for instance by the finding that T7 promoter strength is increased by the presence of upstream AT sequences (Tang et al., 2005).
5. Conclusions The monotone behavior of the phylogenetically different nucleotide gradients, observed in the regions upstream of the TSS, and the nature of its composition, both AT- or CG-rich, suggest that nucleotide content is actively maintained and it is not only a by-product of neutral effects like mutational bias or biased gene
E. Calistri et al. / Molecular Phylogenetics and Evolution 60 (2011) 228–235
conversion. Unlike what emerges from our results, nucleotide distributions in coding sequences resemble a random behavior (Acquisti et al., 2004; Menconi et al., 2008), and in fact selective pressures, which may act on coding regions, are surely different from the ones concerning non-coding, although functional elements as promoters. The evidence collected in our work suggests that the differentiation throughout evolution of different compositional promoter structures may derive from a dynamical interaction between even contrasting causes of selection pressures, with varying functional meanings in different parts of the core promoter. Moreover, the clear differences between prokaryote and eukaryote promoters seem to support the existence of positive selection pressures acting differently in the two groups of organisms coherently with the different energetic and conformational conditions known to be critical for transcription. Finally, besides the environmentled selection of phenotypes, a further selective pressure on mutations could derive from the need of their coherence with the molecular contexts in which they would be inserted. This would confirm the existence of more ‘‘regional’’ modes of context-derived selection (Bernardi, 2007), which may concern regulative, non-coding regions.
References Acquisti, C., Allegrini, P., Bogani, P., Buiatti, M., Catanese, E., Fronzoni, L., Grigolini, P., Mersi, G., Palatella, L., 2004. In the search for the low-complexity sequences in prokaryotic and eukaryotic genomes: how to derive a coherent picture from global and local entropy measures. Chaos, Solitons and Fractals 20, 127–137. Aerts, S., Thijs, G., Dabrowski, M., Moreau, Y., DeMoor, B., 2004. Comprehensive analysis of the base composition around the transcription start site in Metazoa. BMC Genomics 5, 5–34. Araúzo-Bravo, M., Fujii, S., Kono, H., Ahmad, S., Sarai, A., 2005. Sequence-dependent conformational energy of DNA derived from molecular dynamics simulations: toward understanding the indirect readout mechanism in protein-DNA recognition. Journal of the American Chemical Society 127 (46), 16074–16089. Audit, B., Thermes, C., Vaillant, C., d’Aubenton Carafa, Y., Muzy, J.F., Arneodo, A., 2001. Long-range correlations in genomic DNA: a signature of the nucleosomal structure. Physical Review Letters 86 (11), 2471–2474. Bentley, S.D., Parkhill, J., 2004. Comparative genomic structure of prokaryotes. Annual Review of Genetics 38 (1), 771–791. Bernardi, G., 2007. The neoselectionist theory of genome evolution. Proceedings of the National Academy of Sciences USA 104 (20), 8385–8390. Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J., Cuny, G., Meunier-Rotival, M., Rodier, F., 1985. The mosaic genome of warm-blooded vertebrates. Science 228, 953–958. Buiatti, M., Buiatti, M., 2004. Towards a statistical characterization of the living state of matter. Chaos, Solitons and Fractals 20, 55–61. Calladine, C., Drew, H., McCall, M., 1988. Structural basis for DNA bending. Journal of Molecular Biology 201, 127–137. Carmona, M., Magasanik, B., 1996. Activation of transcription at sigma 54dependent promoters on linear templates requires intrinsic or induced bending of the DNA. Journal of Molecular Biology 261 (3), 348–356. Choi, C., Kalosakas, G., Rasmussen, K., Hiromura, M., Bishop, A., Usheva, A., 2004. DNA dynamically directs its own transcription initiation. Nucleic Acids Research 32 (4), 1584–1590. Costantini, M., Auletta, F., Bernardi, G., 2007. Isochore patterns and gene distributions in fish genomes. Genomics 90, 364–371. Daniel-Eller, C., Regelson, M., Merriman, B., Nelson, S., Horvath, S., Marahrens, Y., 2007. Repetitive sequences environment distinguishes housekeeping genes. Gene 390, 153–165. Dermitzakis, E., Bergman, C., Clark, A., 2003. Tracing the evolutionary history of drosophila regulatory regions with models that identify transcription factor binding sites. Molecular Biology and Evolution 20, 703–714. Duret, L., Galtier, N., 2009. Biased gene conversion and the evolution of mammalian genomic landscapes. Annual Review of Genomics and Human Genetics 10 (1), 285–311. Frank, A., Lobry, J., 1999. Asymmetric substitution patterns: a review of possible underlying mutational or selective mechanisms. Gene 238, 65–77.
235
Frauenfelder, H., McMahon, B., Austin, R., Chu, K., Groves, J., 2001. The role of structure, energy landscape, dynamics and allostery in the enzymatic function of myoglobin. Proceedings of the National Academy of Sciences USA 98, 2370– 2374. Gabrielian, A., D, L., A, B., 2000. Curved DNA in promoter sequences. In Silico Biology 1, 183–196. Green, P., Ewing, B., Miller, W., Thomas, P., Green, E., 2003. Transcription-associated mutational asymmetry in mammalian evolution. Nature Genetics 33, 514–517. Gupta, S., Dennis, J., Thurman, R., Kingston, R., Stammatopoulos, J., Noble, W.S., 2008. Predicting human nucleosome occupancy from primary sequence. PLoS Computational Biology 4, 1–11. Hildebrand, F., Meyer, A., Eyre-Walker, A., 2010. Evidence of selection upon genomic GC-content in bacteria. PLoS Genetics 6 (9). Jáuregui, R., Abreu-Goodger, C., Moreno-Hagelsieb, G., Collado-Vides, J., Merino, E., 2003. Conservation of DNA curvature signals in regulatory regions of prokaryotic genes. Nucleic Acids Research 31 (23), 6770–6777. Kanhere, A., Bansal, M., 2005. Structural properties of promoters: similarities and differences between prokaryotes and eukaryotes. Nucleic Acids Research 33 (10), 3165–3175. Kendal, W., Suomela, B., 2005. Large-scale genomic correlations in Arabidopsis thaliana relate to chromosomal structure. BMC Genomics 6 (1), 82. Khuu, P., Sandor, M., Young, J.D., Ho, P.S., 2007. Phylogenomic analysis of the emergence of GC-rich transcription elements. Proceedings of the National Academy of Sciences 104, 16528–16533. Kumar, S., Hedges, B., 1998. A molecular timescale for vertebrate evolution. Nature 392, 917–920. Li, W., 1997. The study of correlation structures of DNA sequences: a critical review. Computers & Chemistry 21 (4), 257–271 (open Problems of Computational Molecular Biology). Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B., Sabo, P., Dorschner, M., Sandstrom, R., Bernstein, B., Bender, M., Groudine, M., Gnirke, A., Stamatoyannopoulos, J., Mirny, L., Lander, E., Dekker, J., 2009. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326 (5950), 289–293. Loots, G., Ovcharenko, I., 2007. ECRbase: database of evolutionary conserved regions and promoters and transcription factor binding sites in vertebrate genomes. Bioinformatics 23 (1), 122–124. Louie, E., Ott, J., Majewski, J., 2003. Nucleotide frequency variation across human genes. Genome Research 13, 2594–2601. Ma, H., Samarabandu, J., Devdhar, R., Acharya, R., Cheng, P., Meng, C., Berezney, R., 1998. Spatial and temporal dynamics of DNA replication sites in mammalian cells. The Journal of Cell Biology 143 (6), 1415–1425. Matsuzaki, M. et al., 2004. Genome sequence of the ultrasmall unicellular red alga Cyanidioschyzon merolae 10D. Nature 428, 653–657. Menconi, G., Benci, V., Buiatti, M., 2008. Data compression and genomes: a twodimensional life domain map. Journal of Theoretical Biology 253, 281–288. Mitchison, G., 2005. The regional rule for bacterial base composition. Trends in Genetics 21 (8), 440–443. Mustoa, H., Nayaa, H., Zavalaa, A., Romeroa, H., Alvarez-Valin, F., Bernardi, G., 2004. Correlations between genomic GC levels and optimal growth temperatures in prokaryotes. FEBS Letters 573, 73–77. Muto, A., Osawa, S., 1987. The guanine and cytosine content of genomic DNA and bacterial evolution. Proceedings of the National Academy of Sciences USA 84, 166–169. Nickerson, C., Achberger, E., 1995. Role of curved DNA in binding of Escherichia coli RNA polymerase to promoters. The Journal of Bacteriology 177 (20), 5756–5761. Pérez-Martı´n, J., Rojo, F., de Lorenzo, V., 1994. Promoters responsive to DNA bending: a common theme in prokaryotic gene expression. Microbiology and Molecular Biology Reviews 58 (2), 268–290. Suzuki, M., Yagi, N., 1995. Stereochemical basis of DNA bending by transcription factors. Nucleic Acids Research 23 (12), 2083–2091. Tang, G.-Q., Bandwar, R.P., Patel, S.S., 2005. Extended upstream A–T sequence increases T7 promoter strength. The Journal of Biological Chemistry 280 (49), 40707–40713. Trifonov, E., 1989. The multiple codes of nucleotide sequences. Bulletin of Mathematical Biology 51, 417–432. Usdin, K., 2008. The biological effect of simple tandem repeat: lessons from expansion diseases. Genome Research 18, 1011–1019. Vinogradov, A., 2003. DNA helix: the importance of being GC-rich. Nucleic Acids Research 31, 1838–1844. Washio, S.F.Y., Tomita, M., 2005. GC-compositional strand bias around transcription start sites in plants and fungi. BMC Genomics 6 (1), 26. Wu, M., Eisen, J., 2008. A simple, fast, and accurate method of phylogenomic inference. Genome Biology 9, R151. Yamashita, R., Suzuki, Y., Wakaguri, H., Tsuritani, K., Nakai, K., Sugano, S., 2006. DBTSS: database of human transcription start sites and progress report 2006. Nucleic Acids Research 34, 86–89.