“Anticipated” nucleosome positioning pattern in prokaryotes

“Anticipated” nucleosome positioning pattern in prokaryotes

Gene 488 (2011) 41–45 Contents lists available at SciVerse ScienceDirect Gene journal homepage: www.elsevier.com/locate/gene “Anticipated” nucleoso...

463KB Sizes 2 Downloads 58 Views

Gene 488 (2011) 41–45

Contents lists available at SciVerse ScienceDirect

Gene journal homepage: www.elsevier.com/locate/gene

“Anticipated” nucleosome positioning pattern in prokaryotes Alexandra E. Rapoport a, Edward N. Trifonov a, b,⁎ a b

Genome Diversity Center, Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel Department of Functional Genomics and Proteomics, Faculty of Science, Masaryk University, Kotlarska 2, CZ-61137 Brno, Czech Republic

a r t i c l e

i n f o

Article history: Accepted 3 August 2011 Available online 23 August 2011 Received by Takashi Gojobori Keywords: Amphipathic alpha helices Evolution of nucleosome positioning pattern Hidden sequence patterns Sequence periodicity Shannon N-gram extension

a b s t r a c t Linguistic (word count) analysis of prokaryotic genome sequences, by Shannon N-gram extension, reveals that the dominant hidden motifs in A + T rich genomes are T(A)(T)A and G(A)(T)C with uncertain number of repeating A and T. Since prokaryotic sequences are largely protein-coding, the motifs would correspond to amphipathic alpha-helices with alternating lysine and phenylalanine as preferential polar and non-polar residues. The motifs are also known in eukaryotes, as nucleosome positioning patterns. Their existence in prokaryotes as well may serve for binding of histone-like proteins to DNA. In this case the above patterns in prokaryotes may be considered as “anticipated” nucleosome positioning patterns which, quite likely, existed in prokaryotic genomes before the evolutionary separation between eukaryotes and prokaryotes. © 2011 Elsevier B.V. All rights reserved.

1. Introduction Although the translation (triplet) code is commonly considered as the major informational load carried by the genomic sequences, the decades of the sequence research made it obvious that there are other messages (codes) contained in the sequences (Trifonov, 1980, 1989). This is especially clear in case of eukaryotes where the protein coding sequences comprise only few percents of the genome. One of the first additional codes that drew attention of the sequence research community was, indeed, eukaryotic chromatin code (Trifonov, 1980; Trifonov and Sussman, 1980), the most recent version of which is described in Gabdank et al. (2009, 2010) and Trifonov (2010, 2011). The latest work boils down to a conclusion that all other nucleosome positioning patterns suggested during three decades by various authors match to the universal pattern YRRRRRYYYYYR, of which GAAAATTTTC and AAAAATTTTT are the most representative of generally A + T rich eukaryotic sequences. How to evaluate the importance of one or another code among over dozen of the known codes operating at DNA, RNA and protein levels (Trifonov, 1989)? Since every code is associated with its characteristic sequence vocabulary (Brendel et al., 1986), one simple measure of relative contribution of any given code to the overall sequence vocabulary is its weight share in the vocabulary. One, thus, comes to the following question: Which of the codes, known or unknown, are

Abbreviation: aa, amino acid. ⁎ Corresponding author at: Genome Diversity Center, Institute of Evolution, University of Haifa, Mount Carmel, 31905 Haifa, Israel. Tel.:+972 4 828 8096. E-mail address: [email protected] (E.N. Trifonov). 0378-1119/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.gene.2011.08.002

represented by the most frequent oligonucleotide words of a given genome sequence? In this study we intend to provide answers by exploring the oligonucleotide frequencies of prokaryotic genomes. The analysis boils down to rather unexpected conclusions: The major linguistically dominant code in prokaryotes is the code for amphipathic alphahelices of proteins. Surprisingly, it is very similar to the nucleosome positioning code of eukaryotes which, in its turn, is the major code of eukaryotic genomes (Rapoport et al., 2011). 2. Methods 2.1. Genomes For the analysis complete genomes of 14 prokaryotes were downloaded from the ftp site of NCBI: A. aeolicus, B. subtilis, B. japonicum, C. pneumoniae, C. violaceum, D. ethenogenes, F. nucleatum, G. violaceum, M. acetivorans, N. equitance, S. solfataricus, T. denticola, T. maritima, and T. thermophilus. To minimize possible biases in the analysis the genomes have to be as different as possible. The indicated genomes are available representatives of 14 different phyla, both from Bacteria and Archaea. 2.2. N-gram extension algorithm N-gram extension algorithm is based on the Shannon's N-gram extension (Shannon, 1948). The pattern generation was done on the basis of oligonucleotide frequencies for respective genomes (see Supplement) as described in the text and illustrated by Fig. 1. New trinucleotides are added iteratively until some substring starts to repeat.

42

A.E. Rapoport, E.N. Trifonov / Gene 488 (2011) 41–45

triplet 1. 2. 3. 4. 5. 6.

TTT AAA AAT ATT TTC GAA

score 322022 319190 203238 202148 200524 197938

A

B [G(A)(T)]G(A)(T)C[(A)(T)C]

C ...(GAAATTT)GAAATTTC(AAATTTC)..., ...(GAAAATTTT)GAAAATTTTC(AAAATTTTC)... Fig. 1. Triplet extension for genome sequence of B. subtilis. A — The most frequent trinucleotides (see Supplement for the whole Table of trinucleotides). B — The general description of the extension motif. Here the parentheses mean repetition of the sequences or bases within the brackets uncertain number of times (N 2 for A and T in our case). C — Two particular examples of the extension.

In cases of homo-trinucleotides AAA and TTT the extension generates An and Tn, i.e., (A) and (T) of indefinite length. Such extension can be interrupted by connecting to highest BAA and AAB triplets (B for non-A) and to highest VTT and TTV triplets (V for non-T), respectively. In Fig. 1 those will be GAA, AAT, ATT and TTC. The typical output of the extension procedure is a string of various repeats, like [G(A)(T)]G(A)(T)C[(A)(T)C], where parentheses incorporate substrings, repeating uncertain number of times. I.e., [G (A)(T)] means ..GAA…AATT…TTGAA…AATT…TTGAA… The extension patterns can also be generated by using dinucleotides or tetranucleotides, with, essentially, the same results (Rapoport et al., 2011).

(extension) of the most frequent oligonucleotides, as described below (Rapoport et al., 2011). This technique is known as N-gram extension (Shannon, 1948). In most cases such extension, indeed, results in some dominating motif, with remarkable prevalence of one distinct group of the words. As a typical example of such reconstruction, Fig. 1A shows the most frequent trinucleotides of the genomic sequence of B. subtilis, and the respective extended motif in general form (Fig. 1B). The particular fusion sequences GAAATTTC and GAAAATTTTC (Fig. 1C) incorporate all respective highest frequency oligonucleotides. While the component trinucleotides of these sequences are very frequent, the longer oligonucleotides are not as frequent (data not shown), so that, for example, the complete motif GAAAATTTTC appears only 17 times in the genome (to compare with hundreds of thousands of the component trinucleotides in Fig. 1A). In that very sense the fusion motif can be considered as a hidden dominant motif that is revealed only when the most frequent component words of the motif are put together. Not only the trinucleotides but their appropriate combinations also represent the hidden dominant motif, like sequences GAAxxTTTxx or xAAAxxxTTC. All such sequences taken together suggest the consensus GAAAATTTTC, which can be generated by simple Shannon N-gram extension. In Table 1 the fusion motifs calculated for all 14 representative prokaryotic genomes are shown. In most cases, in the prevailing A+ T rich species, the motifs of the structure T(A)(T)A and G(A)(T)C appear within the central parts of the extensions either entirely (as in F. nucleatum, T. denicola, M. acetivorans, and B. subtilis) or partially. For example, in S. solfataricus only T(A)T part is generated by the N-gram extension. Similarly, in A. aeolicus G(A) and (T)C are generated separately by extensions from AAA and TTT. In this case two separate motifs do not fuse in one continuous pattern. These motifs, however, are parts of the consensuses T(A)(T)A or G(A)(T)C. As above, abbreviations (A) and (T) stand for uncertain number of repeating A and T bases. We conclude, thus, that the T(A)(T)A and G(A)(T)C are dominant extension motifs of A + T rich prokaryotes. The four G + C rich genomes all generate different, GC centered motifs.

2.3. Positional auto(cross)correlation analysis 3.2. The motifs T(A)(T)A and G(A)(T)C are periodic The distance distribution analysis was performed as in Trifonov and Sussman (1980). That is, all distances between respective elements in a given sequence have been tabulated and plotted in the form of a histogram. Presence of periodical peaks in the histogram corresponds to sequence periodicity. 3. Results and discussion

Since major part of a prokaryotic genome is occupied by proteincoding sequences, the dominant nucleotide sequence pattern may Table 1 Triplet extension patterns for 14 prokaryotic genomes. The sequences which conform to the dominant motif are in capital bold. Species

G+C Extension motif content %

Starting triplet

F. nucleatum N. equitans (Arch) -“S. solfataricus (Arch) T. denicola C. pneumoniae -“M. acetivorans (Arch) A. aeolicus -“B. subtilis T. maritima -“D. ethenogenes

27.2 31.6

TTT AAA TTT TTT TTT AAA TTT TTT AAA TTT TTT GAA TTC TTT

3.1. Dominant prokaryotic oligonucleotides and hidden common motif In this work we analyzed the oligonucleotide compositions of 14 taxonomically diverse eubacterial and archaeal genomes (see Methods). Essentially, the topmost trinucleotides have been selected in each case, and a hypothesis checked whether they all could belong to one common longer sequence motif. Many codes contained in the sequences may be described each by an ideal (consensus) sequence motif. Known examples are chromatin code (Gabdank et al, 2009), gene splicing code (Shapiro and Senapathy, 1987) and binding sites for various transcription factors (e. g. Heinemeyer et al., 1998). Actual recognition (binding) sites corresponding to the codes only rarely appear in the form of the full length consensus sequences (with the exception of restriction nucleases). Rather, only short oligonucleotides of the ideal sequences are present in the recognition sites. As the “word” count analysis of the complete genomes shows there are some especially frequent oligonucleotides in the complete repertoires of the “words”. These may correspond to those codes (consensuses) that are of the predominant use. The ideal sequence motifs corresponding to the dominant code(s) may be reconstructed by overlapping

35.8 37.9 40.0 42.7 43.3 43.5 46.2 48.9

-“Dominant motifs G. violaceus B. japonicum C. violaceum T. thermophilus (Arch)

62.0 64.1 64.8 69.5

[t(a)]T(A)(T)A [t(a)] (ta)T(A) T(at) (ta)A ( T)A(ta) [a(t)]a(t)T(A) T [(a)t] [t(a)]T(A)(T)A [(t)a] [g(a)]G(A)[g(a)] [(t)c](T)C [(t)c] [g(a)]G(A)(T)C [(t)c] [gg(a)]gG(A)[gg(a)] [(t)cc](T)Cc [(t)cc] [g(a)(t)]G(A) (T)C [ (a)(t)c] [g(a)]G(A)[g(a)] [(t)c](T)Ctt [ctt] (cggc)cggc(T)Cagccg (gccg) [ta]T(A)gccg [gccg] T(A)(T)A G(A)(T)C (ggc)ggcc(gcc) (gc)gc(gc) (gc)gc(gc) [a(g)]a(g)(c)t [(c)t]

AAA

GCC GCG GCG CCC

A.E. Rapoport, E.N. Trifonov / Gene 488 (2011) 41–45

represent a consensus of certain protein sequence motif or, rather, several motifs, for different reading starts. Among the amino acids, present in the motifs, obviously, would be lysine (K, encoded by AAA) and phenylalanine (F, encoded by TTT and TTC), but also glutamic acid (E, GAA), asparagine (N, AAT), Leucine (L, TTA) and isoleucine (I, ATT) — both polar and non-polar residues, with dominating K and F. The positional autocorrelations of amino acids encoded by NRN codons (R = A, G), which are almost all polar, as well as of amino acids encoded by NYN codons (Y = C, G), primarily nonpolar, are known to display distinct 3.5 aa periodicity (Trifonov et al., 2001). This suggests that the patterns T(A)(T)A and G(A)(T)C may also be of periodical nature. The repeat of 3.5 amino acid residues corresponds to 3.5 × 3 = 10.5 bases. Repeated dominant motifs T(A) (T)A and G(A)(T)C should have the base A, as well as T, repeated 4– 5 times to fit to the nearest integers 10 and 11 of the 10.5 base period. Taking for certainty the repeats AAAAATTTTT and GAAAATTTTC of length 10 one ends with the (consensus) periodical patterns (AAAAATTTTT)n and (GAAAATTTTC)n. In reality, since the complete patterns are rather infrequent, only periodicity of its component trinucleotides could be, actually, observed. In Fig. 2 three repeats of the nucleotide sequence patterns are shown together with their translated amino acid sequence versions. The full expected spectrum of the amino acids encoded by the repetitions, all reading starts considered, would be F (six times), K (five times), and E, N, R, S, L and I (once or twice each). K, E, N, S and R make a polar group (S is weakly polar) — underlined in Fig. 2, while I, L and F are non-polar residues. Each one of these amino acids, especially F and K, follows at the distances ~ 3.5 × n residues one after another, that is, displays the 3.5 residue periodicity. The non-polar residues L, I, V, F and M, as a group, have been shown to be periodical, counter-phase to polar groups E, K, D, R, Q (Weiss and Herzel, 1998). It was later observed also that the similar groups F, L, I, V and D, R, E, K are counter-phase to each other (Cohanim, 2007). We singled out the dominating residues F and K (encoded by TTT and AAA components of the AAAAATTTTT and GAAAATTTTC patterns) and checked whether they individually would display the same counter-phase behavior. As Figs. 3 and 4 demonstrate, this is the case, indeed. In Fig. 3 preferential distances between K residues are 4, 7, 11, 14 — the nearest integers to periodical series 3.5, 7, 10.5, 14,… The same is observed for F. Similar calculations for E, N, R, S, L and I residues (data not shown) demonstrate the periodicity as well. Fig. 4 displays counterphase behavior of the oscillating K and F residues. The higher occurrence distances 2, 5 and 9 are close to values of 3.5 × (0.5 + n) series (1.75, 5.25, 8.75,…). Thus, the behavior of the polar and non-polar residues is in full accordance with known property of amphipathic alpha-helices — preferential positioning of polar residues at one surface of the helices, while non-polar residues are placed on the opposite surface of the helices (Epand, 1993). The repetitions (AAAAATTTTT)n and (GAAAATTTTC)n would represent ideal amphipathic alpha-helices. One has to bear in mind that formation of the above motifs in the Shannon N-gram extension procedure does not mean that the motifs as such, in their entirety, are frequent (see above). Frequent are its components, shorter oligonucleotides. Therefore, the observed 3.5 aa periodicity is, actually, a hidden periodicity, expressed in higher occurrences of the distances, multiples of 3.5, between the residues.

43

Fig. 3. Positional autocorrelation analysis of F and K. The curves represent respective total distance distributions in 9 prokaryotic proteomes (F. nucleatum, N. equitans, S. sofataricus, T. denticola, C. pneumoniae, M. acetivorans, A. aeolicus, B. subtilis and T. maritima).

Thus, the periodicity of polar and non-polar residues in proteins, demonstrated in earlier work, as well as in this study, is in accordance with periodical repetition of the hidden AAAAATTTTT and GAAAATTTTC nucleotide sequence patterns, discovered in this work by Shannon N-gram extension.

...GAAAATTTTCGAAAATTTTCGAAAATTTTC... ... E N F R K F S K I F ...

...AAAAATTTTTAAAAATTTTTAAAAATTTTT... ... K N F - K F L K I F ... Fig. 2. Translation of motifs (AAAAATTTTT)n and (GAAAATTTTC)n. Polar residues are underlined.

Fig. 4. F/K distance distribution. The curve represents sum of the distributions of distances between F and K residues in 9 prokaryotic proteomes (see legend to Fig. 3).

44

A.E. Rapoport, E.N. Trifonov / Gene 488 (2011) 41–45

3.3. The nucleosome positioning pattern, apparently, developed before eukaryotes and prokaryotes separated The periodical patterns (AAAAATTTTT)n and (GAAAATTTTC)n are indistinguishable from the nucleosome positioning pattern in eukaryotes determined earlier by three different methods (Gabdank et al., 2009; Rapoport et al., 2011; Trifonov, 2010). It is not as strong in prokaryotes, but the identity of the patterns is a very puzzling observation, since the nucleosomes are not present in the representatives of prokaryotic phyla analyzed. Why the pattern that has developed as dominant in prokaryotes should be any similar, even identical to the dominant nucleosome pattern of eukaryotes? The periodical amphipathic pattern is characteristic of prokaryotic genomes (Cohanim, 2007; Weiss and Herzel, 1998), but it does not require domination of phenylalanine and lysine that leads to the motifs. There are other eleven polar residues, and other seven non-polar ones to build a variety of amphipathic helices. One possible answer is a coevolution of two patterns — the amphipathic pattern and some other pattern, that opportunistically utilized the 10–11 base periodicity imposed on the prokaryotic DNA by the amphipathic helices. There are two phenomena rooted in such DNA sequence periodicity: intrinsic DNA curvature (Bolshoy et al., 1991) and nucleosome positioning (Trifonov, 2011) which are principally different since in former case the DNA molecule has the curvilinear shape without application of any external force, while in latter case the deformation is actively involved. Accordingly, they are described by different sequence rules, though both possess nearly the same periodicity (Bolshoy et al., 1991; Gabdank et al., 2009; Rapoport et al., 2011; Trifonov, 2010, 2011). In prokaryotes the 10–11 base periodicity of the DNA sequence is required for DNA curvature which takes rather form of superhelicity with different sequence periods — 11 bases in Eubacteria and 10 bases in Archaea (Herzel et al., 1998, 1999). Do the sequences AAAAATTTTT and GAAAATTTTC represent the curvature (writhe) of DNA? Incidentally, the significant electrophoretic anomaly of DNA with the concatenated GAAAATTTTC decamer has been observed (Hagerman, 1986) indicating rather strong curvature of this DNA. If, indeed, the motifs did emerge in evolution under two pressures (amphipathic helices and curvature), then, since corresponding two codes (messages) have different a priori unrelated vocabularies of short words, the resulting fusion motif should be a compromise, serving best for both, but not necessarily being best for any one of the messages. Indeed, the best for the amphipathicity would be, probably, use of all polar and non-polar amino acids, rather than preferentially phenylalanine and lysine. On the other hand, the motifs AAAAATTTTT and GAAAATTTTC have only poor similarity to the motif AAAAAACGCG associated with the highest DNA curvature observed so far (Bolshoy et al., 1991; Koo and Crothers, 1988). Another known 10–11 base periodical motif is the nucleosome DNA positioning pattern. Since Eubacteria lack nucleosomes, the nucleosome positioning pattern would be of no need and its emergence could only be coincidental. Some of the Archaea are documented to have small nucleosome-like structures containing histone-like proteins (Pereira et al., 1997) but not those which are analyzed in this work. The positively charged histone-like proteins are contained in Eubacterial cells as well. These have been subject of intensive studies during last decades (Drlica and Rouviere-Yaniv, 1987; Rouviere-Yaniv and Gros, 1975). Among these the best known is HU-protein, one of the most abundant proteins associated with bacterial nucleoid and involved in DNA recombination and repair (Kamashev and Rouviere-Yaniv, 2000). Another example is H–NS protein. This regulatory protein binds to DNA and cooperatively forms a nucleoprotein structure responsible for silencing of specific genes (Bouffartigues et al., 2007). The logo of its binding site contains the sequence AAATT (Lang et al., 2007), a match to the nucleosome-specific motifs. Nucleoid protein Fis binds to AAAWTTT sequence (Hengen et al., 1997). Thus, although the prokaryotic species

analyzed do not have their DNA folded in nucleosomes, the sequence motif responsible for binding of the histone-like proteins to DNA could have developed and, apparently, did. It coexists now with the amphipathicity and curvature patterns. As a result the hidden nucleosome-specific eukaryotic motifs AAAAATTTTT and GAAAATTTTC appear as dominant in prokaryotes as well. Development of the nucleosome positioning pattern in evolution started apparently already before separation prokaryotes–eukaryotes, thus, anticipating the formation of eukaryotic chromatin structure. This is, perhaps, not as surprising as it may appear, considering the fact that respective structural periods are numerically almost identical: amphipathic periodicity 10.5 bases (3.5 × 3), DNA superhelicity periods — 10 bases in Archaea and 11 bases in Eubacteria, and 10.4 base nucleosome DNA periodicity. The possible role of alpha helices in the observed sequence periodicity in the eukaryotes has been first pointed to by V. Zhurkin (1981). The above evolutionary connection could not be made at that time since the periodicity of “non-coding” intervening and intergenic sequences (Rapoport et al., 2011) had not yet been documented. 4. Conclusions The N-gram extension analysis applied to prokaryotic genomes reveals hidden dominant patterns AAAAATTTTT and GAAAATTTTC characteristic of A + T rich genomes. These patterns correspond well to amphipathic alpha-helices encoded in the sequences. The pattern is similar, almost identical, to the nucleosome positioning pattern in eukaryotes and may also correspond to binding sites of histone-like prokaryotic proteins. It, thus, could be considered as “anticipation” of eukaryotic nucleosome pattern, before the separation of eukaryotes and prokaryotes in evolution. Acknowledgments The work has been supported by grant 222/09 of the Israel Science Foundation, and by Fellowship of SoMoPro (South Moravian Program, Czech Republic) with financial contribution of European Union within the 7th framework program (FP/2007–2013, grant agreement no. 229603). The authors are grateful to Z. M. Frenkel for numerous discussions, and to anonymous reviewers for thoughtful comments. Appendix A. Supplementary data Supplementary data to this article can be found online at doi:10. 1016/j.gene.2011.08.002. References Bolshoy, A., McNamara, P., Harrington, R.E., Trifonov, E.N., 1991. Curved DNA without AA: experimental estimation of all 16 wedge angles. Proc. Natl. Acad. Sci. U. S. A. 88, 2312–2316. Bouffartigues, E., Buckle, M., Badaut, C., Travers, A., Rimsky, S., 2007. H–NS cooperative binding to high-affinity sites in a regulatory element results in transcriptional silencing. Nat. Struct. Mol. Biol. 14, 441–448. Brendel, V., Beckmann, J.S., Trifonov, E.N., 1986. Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J. Biomol. Struct. Dyn. 4, 11–21. Cohanim A. B., Ph. D. Thesis, Haifa Technical University, 2007. Drlica, K., Rouviere-Yaniv, J., 1987. Histone-like proteins of bacteria. Microbiol. Rev. 51, 301–319. Epand, R., 1993. The Amphipathic Helix. CRC Press, Boca Raton, FL. Gabdank, I., Barash, D., Trifonov, E.N., 2009. Nucleosome DNA bendability matrix (C. elegans). J. Biomol. Struct. Dyn. 26, 403–412. Gabdank, I., Barash, D., Trifonov, E.N., 2010. Single-base resolution nucleosome mapping on DNA sequences. J. Biomol. Struct. Dyn. 28, 107–121. Hagerman, P.J., 1986. Sequence-directed curvature of DNA. Nature 321, 449–450. Heinemeyer, T., et al., 1998. Databases on transcriptional regulation: TRANSFAC, TRRD, and COMPEL. Nucleic Acids Res. 26, 364–370. Hengen, P.N., Bartram, S.L., Stewart, L.E., Schneider, T.D., 1997. Information analysis of Fis binding sites. Nucleic Acids Res. 25, 4994–5002. Herzel, H., Trifonov, E.N., Weiss, O., Grosse, I., 1998. Interpreting correlations in biosequences. Physica A 249, 449–459. Herzel, H., Weiss, O., Trifonov, E.N., 1999. 10–11 bp periodicities in complete genomes reflect protein structure and DNA folding. Bioinformatics 15, 187–193.

A.E. Rapoport, E.N. Trifonov / Gene 488 (2011) 41–45 Kamashev, D., Rouviere-Yaniv, J., 2000. The histone-like protein HU binds specifically to DNA recombination and repair intermediates. EMBO J. 19, 6527–6535. Koo, H.S., Crothers, D.M., 1988. Calibration of DNA curvature and a unified description of sequence-directed bending. Proc. Natl. Acad. Sci. U. S. A. 85, 1763–1767. Lang, B., et al., 2007. High-affinity DNA binding sites for H–NS provide a molecular basis for selective silencing within proteobacterial genomes. Nucleic Acids Res. 35, 6330–6337. Pereira, S.L., Grayling, R.A., Lurz, R., Reeve, J.N., 1997. Archaeal nucleosomes. Proc. Natl. Acad. Sci. U. S. A. 94, 12633–12637. Rapoport, A.E., Frenkel, Z.M., Trifonov, E.N., 2011. Nucleosome positioning pattern derived from oligonucleotide compositions of genomic sequences. J. Biomol. Struct. Dyn. 28, 567–574. Rouviere-Yaniv, J., Gros, F., 1975. Characterization of a novel, low-molecular-weight DNAbinding protein from Escherichia coli. Proc. Natl. Acad. Sci. U. S. A. 72, 3428–3432. Shannon, C., 1948. A mathematical theory of communication. AT&T Tech. J. 27, 379–423. Shapiro, M.B., Senapathy, P., 1987. RNA splice junctions of different classes of eukaryotes: sequence statistics and functional implications in gene expression. Nucleic Acids Res. 15, 7155–7174.

45

Trifonov, E.N., 1980. Sequence-dependent deformational anisotropy of chromatin DNA. Nucleic Acids Res. 8, 4041–4053. Trifonov, E.N., 1989. The multiple codes of nucleotide sequences. Bull. Math. Biol. 51, 417–432. Trifonov, E.N., 2010. Base pair stacking in nucleosome DNA and bendability sequence pattern. J. Theor. Biol. 263, 337–339. Trifonov, E.N., 2011. Cracking the chromatin code: precise rule of nucleosome positioning. Phys. Life Rev. 8, 39–50. Trifonov, E.N., Sussman, J.L., 1980. The pitch of chromatin DNA is reflected in its nucleotide sequence. Proc. Natl. Acad. Sci. U. S. A. 77, 3816–3820. Trifonov, E.N., Kirzhner, A., Kirzhner, V.M., Berezovsky, I.N., 2001. Distinct stages of protein evolution as suggested by protein sequence analysis. J. Mol. Evol. 53, 394–401. Weiss, O., Herzel, H., 1998. Correlations in protein sequences and property codes. J. Theor. Biol. 190, 341–353. Zhurkin, V.B., 1981. Periodicity in DNA primary structure is defined by secondary structure of the coded protein. Nucleic Acids Res. 9, 1963–1971.