Codes of Nucleotide Sequences E. N. TRIFONOV Depurtment Received
of Polymer
Research,
The Weizmann
I June 1987; revised 21 November
Institute
ofScience, Rehovot 76100, Israel
I987
ABSTRACT Nucleotide sequences have many properties of a language. This analogy when developed to its fullest results in an interesting linguistic description of the nucleotide sequences. Several a priori features of this language (called “Gnomic”) are discussed, based on its molecular nature. Gnomic appears to be a multicode language, with overlapping degenerate messages, each one encoding physically different specific interactions (protein-DNA, protein-RNA, protein-protein, and RNA-RNA). Several codes of the Gnomic language are discussedthe RNA-to-protein translation (triplet) code; the chromatin code responsible for DNA folding in chromatin; the framing code which secures correct triplet counting during translation; and, tentatively, the RNA loop code, presumably responsible for the formation of RNA loops with specific recognition sequences. The last code is aperiodic and involves
mirror-symmetrical
sequence
periodicities.
texts is discussed,
sequence
A general
elements,
technique
based on the context
while the other
of detection
contrast
codes
are based
of words in continuous
of the internally
correlated
on the
(no blanks) strings.
INTRODUCTION Nucleotide sequences are frequently compared to a language [lo, 20, 35, 12, 161, a metaphor which can be traced as far back as 1892, when J. F. Miescher, the discoverer of DNA, observed that large molecules built of a few similar, but not identical, small chemical units could carry the hereditary information “just as words and concepts of all languages can find expression in twenty-four to thirty letters of the alphabet” [17]. Today one frequently finds whole pages and entire books filled with the four-letter-alphabet texts known as nucleotide sequences [22]. During last two decades, a vast amount of research has been devoted to the analysis of nucleotide sequences and to the peculiarities of nucleotide distributions. However, V. Brendel and H. G. Busse [5] presented the first paper which treated the nucleotide sequences as (formal) language. In our recent attempt to develop the language analogy to its fullest [30, 6, 31, we found very interesting and widely diversified linguistic characteristics within the nucleotide sequences. The analogy became so striking that we believe that nucleotide sequences should be considMATHEMATICAL
BIOSCIENCES
90:507-517
QElsevier Science Publishing Co., Inc., 1988 52 Vanderbilt Ave., New York, NY 10017
507
(1988)
0025-5564/88/$03.50
508
E. N. TRIFONOV
ered and treated as a language and have suggested that the name for this language be Gnomic (from Greek gnome-maxim, aphorism). Thus, the term “Gnomic” is suggested for use instead of “nucleotide sequences” whenever their linguistic properties are considered [30]. This paper is an attempt at a systematic introduction to Gnomic language. Some general considerations on the structure (morphology) of Gnomic are discussed, as well as its distinct features and similarities to languages proper. Gnomic language appears to be a multicode language, and several established, as well as a few emerging, codes of Gnomic are discussed as representatives of a large spectrum of codes hidden in the Gnomic texts. 1.
GNOMIC-A BIOLOGICAL
LANGUAGE FOR COMMUNICATION MACROMOLECULES
BETWEEN
There are several properties of Gnomic language which can be formulated, a priori, even before any linguistic analysis of the Gnomic texts. First, this is a language which various macromolecules in the cell utilize to communicate. All major processes of gene expression-DNA replication, DNA-to-RNA transcription, RNA-to-protein translation, RNA-to-DNA reverse transcription-involve multiple specific interactions between DNA and proteins, between RNA and proteins, and between proteins and proteins, as well as RNA-DNA and RNA-RNA interactions at both inter- and intramolecular levels. “Specific” is used here to include “sequence-specific.” In other words, certain parts of the interacting molecules come into contact, and thus only certain sequences of bases, of amino acids, or possibly of sugar residues in polysaccharides are involved in any given specific interaction. Such contacts can, of course, be considered as elementary acts of the sequence-mediated communication between biological macromolecules. The sequences involved could then be considered as the “words” or “phrases” that the interacting molecules convey to one another. Simple as it is, this view lends itself to outlining some general properties of Gnomic language, very much in accord with what is discussed in the following a posteriori chapters. (1) Gnomic is expected to contain contiguous strings of bases (or amino acids, in the translated version of Gnomic) involved in the sequence-specific contacts. The size of these single contact worh could range from one to about fifty bases for protein-nucleic acid interactions, being limited by the physical size of the protein bound [see Figure l(a)], (2) The single contact words which are involved in the same macromolecular complex would make a multicontact message consisting of several contiguous words separated by sequence spacers, largely irrelevant in the context of given specific multicontact interaction. That is, Gnomic is ex-
CODES
OF NUCLEOTIDE
SEQUENCES
509
FIG. 1. Biomacromolecular interactions and structure of Gnomic language. (a) Single contact words. (b) Multicontact interspersed messages. (c) Overlapping messages. DNA, RNA, and protein molecules are shown schematically, as indicated in the insert. Contact sites (sequences)
are represented
by black bars.
petted to contain interspersed messages built of several contiguous words physically separated by other messages [Figure l(b)]. (3) Since the primary nucleotide sequence of DNA is transcribed into RNA, which (in the form of messenger RNAs) is in turn translated into proteins, the original DNA sequence simultaneously carries several fundamentally different classes of codes. This multiplicity of codes reflects the variety of biomacromolecules and multiplicity of their interactions. For example, specific protein-protein contacts involve amino acid sequences uniquely designed to ensure their specificity. The corresponding primary DNA sequences have characteristics different from the sequence features dictated by demands of the RNA-protein or RNA-RNA interactions.
E. N. TRIFONOV
510
(4) The same sequence can be involved in several interactions of different nature; hence overlapping messages (codes) are expected to occur in Gnomic. Thus, in principle, the same sequence can be responsible for (a) DNA-protein interactions, (b) RNA-protein and/or RNA-RNA interactions following DNA transcription, and (c) the interactions of the translated protein with its substrates or cofactors [Figure l(c)]. (5) The characteristics listed above imply that each of the overlapping messages should be degenerate to a certain degree to allow for coexistence. Alternatively, the same macromolecule could have similar affinities to several different contact words not necessarily related by their sequence. That is, Gnomic is expected to contain frequent synonyms. Use of synonymous words (messages) would be one way to realize the degeneracy of the overlapping messages. 2.
GNOMIC LANGUAGE VERSUS HUMAN ALPHABETIC LANGUAGE(S)
A. SIMILARITIES
Both Gnomic and human languages can be expressed in the form of one-dimensional arrays (texts) of symbols, elements of limited alphabets. Both languages reflect, however, the four-dimensional (space and time) reality, though of different kinds. In both Gnomic and human languages, the order of letters, words, and sentences is important. In both cases, translocations could severely damage the message. However, it is possible for large blocks of texts to become translocated without any obvious consequences (transposable elements, some chapters of a textbook). Both languages are very sensitive to mutations (copying errors), and both are subject to evolution, though, of course, by different mechanisms. One can easily trace “viable mutations” in spoken language: point mutations or “ac(“dependence” = “dependance” = “dependancy” = “dependency,” cessory” and “accessary”), deletions and insertions (cf. “cancelled” and “canceled, ” “acoustic” and “acousticaI,” “ thru” and “through”), and more complex changes (like “center” and “centre”, “connection” and “connexion”). Both spoken language and nucleotide sequences can be considered as gradually changing living organisms. Thus, the analogy might actually reach much farther (or “further”) than superficial comparison would reveal. B.
DISTINCTIVE
FEATURES
OF GNOMIC
LANGUAGE
Gnomic texts do not have any obvious spaces or equivalent symbols between putative words. Some ancient human texts have also been composed
CODES OF NUCLEOTIDE
SEQUENCES
511
in a continuous manner. Even modem languages, if artificially expressed in such writing without blanks, would not cause severe misunderstanding. It is noteworthy that in spoken languages the “blanks” between the pronounced words are practically indiscernible. In the case of an unknown language, like Gnomic, identification of the words within their contexts becomes a problem. One solution to this problem is discussed in Section 3.E below. Overlapping messages (codes), typical of Gnomic, are not allowed in human languages. Frequent types of overlap found in Gnomic texts are overlapping protein coding sequences [24,11] and some regulatory sequences overlapping with promoters [23]. One interesting example of overlapping is the transcription of two genes from opposite strands of the same DNA region [l]. Other examples of overlapping codes are discussed below. Gnomic “expressions” carrying the same message are typically rather diffuse, being very different in their sequence, though some common elements are frequently present. Prokaryotic [14] and eukaryotic [7] promoters provide good examples of diffuse, fuzzy messages. Such fuzziness, apparently, is due to the fact that each molecular contact involves many different sequence-dependent physical and geometrical features. A one-to-one relationship between the sequence and the features does not exist. On the other hand, efficient recognition does not necessarily require that all relevant features be simultaneously present. Such dispersed, yet very specific, interaction has been described in terms of distributional recognition [29]. The fundamentally diffuse character of the recognition signals in the sequences obviously provides necessary flexibility (degeneracy) when several overlapping messages are present. Messages in Gnomic are often discontinuous, interspersed by passages of completely different texts. The most striking example of such sequence arrangement is a gene splicing which was discovered a decade ago simultaneously in several laboratories [4, 34, 8, 18, 2, 251. Another interesting case is interruption of an rRNA gene by an insert which was identified as a tRNA gene [15]. Gnomic language is exceedingly repetitive. The repeats, both tandem and interspersed, are very frequent, from primitive mono- and dinucleotide long repeats, through satellite sequences with repeat lengths ranging from few bases to several hundred bases, to repeats of entire genes. Gnomic texts apparently carry dispensable elements. First, the number of repeats is often unstable, varying from cell to cell (e.g. [13]). Second, so-called pseudogenes are frequently found, which are very close to normal genes in their sequence, but cannot be expressed due to frameshift and nonsense mutations or lack of introns, etc. Thus, both a priori considerations on the molecular nature of Gnomic and our growing knowledge about some of the properties of this language
E. N. TRIFONOV
512 indicate features:
by the following
distinctive
In the following section, several particular codes of Gnomic discussed with reference to the features described above.
are briefly
(1) (2) (3) (4) (5) (6) (7)
that
Gnomic
can be characterized
single contact words, interspersed messages, multiple codes, overlapping messages, degeneracy, repetitiveness, dispensable elements.
3.
CODES OF GNOMIC
A.
RNA-TO-
PROTEIN
TRANSLATION
CODE
This is classical triplet code which is responsible for the translation of an mRNA sequence into the amino acid sequence of a protein [21, 261. Each of 20 different amino acids is encoded in RNA by a three-base word (codon, triplet) or by several triplets (degeneracy). The protein coding sequences in prokaryotes are contiguous, while in eukaryotes the sequences are interrupted by intervening sequences of unknown function and destiny [25]. The triplet code was the first one deciphered and has since been publicized to such an extent that there is a common belief that the triplet code expresses most, if not all, of the information contained in the sequences. However, this is hardly the case, since the protein-coding sequences represent only few percent of the eukaryotic genome, the rest being of largely unknown function. Several less-known Gnomic codes in various stages of their deciphering are discussed below. B.
CHROMATIN
CODE
Surprisingly, some Gnomic codes are as simple as purely periodical patterns. The chromatin code, which determines which pieces of DNA are folded into the nucleosomes, (elementary structural units of chromosomes), is of this kind. The protein core of the nucleosome wraps DNA around itself and thus contacts with only one side of the molecule. Any sequence element which has special affinity or preferential orientation relative to the protein core can impart the sidedness to the DNA molecule as soon as the element is repeated along the sequence with the period equal to that of the DNA helix. It was found that some dinucleotides, AA and TT in particular, indeed have a tendency to be repeated along chromatin DNA sequences with the DNA
CODES
OF NUCLEOTIDE
SEQUENCES
513
helical repeat period of about 10.5 bases [27, 28,191. In addition, the AA. TT stack of base pairs has been found to be not parallel as in classical B-DNA structure, but rather locally deflecting the DNA axis by approximately 9” [32, 331. Periodic repetition of the AA. TT elements results in unidirectional curving of the DNA, which facilitates its smooth bending in the nucleosome. The AA( TT) periodicity is always a weak component of the sequence and thus allows other messages to be contained within these sequences as well. C.
TRANSLATION-FRAMING
CODE
Direct autocorrelation analysis of natural nucleotide sequences indicates that in addition to the 10.5base periodicity there is very strong 3-base periodicity [27], which actually arises from the protein-coding regions. The repeating pattern (G-nonG-N), [31] definitely interferes with the proteincoding capacity of the sequence and therefore should be of special importance. Ribosomes, the cellular machines responsible for protein synthesis, contain in their exposed rRNA sequences the structure (NNC),, which is complementary to the (G-nonG-N), motif in mRNA [31]. Strikingly, these (NNC), sequences have been previously identified as actual or potential sites for rRNA-mRNA interactions. Apparently, predesigned complementarity of mRNA to these sites in the ribosome is of importance for maintaining the correct translation reading frame. The (G-nonG-N), pattern thus, can be considered as translation-framing or frame-keeping code. Strong G-periodicity is shown to trigger ribosome slippage, i.e. counting occasionally two or four bases instead of three [31]. The most striking example of such ribosome slippage occurs in the RF-2 protein gene of E. coli, which carries a wild-type frameshift that is compensated by ribosome slippage at the point of the frameshift, so that normal protein is synthesized [9]. The new (correct) frame immediately downstream from the slippage site is characterized by a very high frequency of G in the first positions of the triplets. D.
RNA LOOP
CODE?
Mirror-symmetrical sequences (palindromes) are occasionally mentioned in the literature, capturing the imagination by their symmetry. Since the cellular biochemical systems can read (replicate or transcribe) the sequences only in one specific chemical direction (5’-3’), the mirror-symmetrical sequences would be nonsense sequences along at least 50% of their length. We have found that intervening sequences (introns) are full of palindromic “words,” [3] which presumably are not just amusing rare whims of chance, but rather occur as functionally meaningful sequence elements. One possible significance of the palindromes is their inability to fold onto themselves or to form hairpins, typical elements of RNA structure. To illustrate this point, let us consider 16 different dinucleotides which can be subdivided into halves in
514
E. N. TRIFONOV
AA
FIG. 2. Dinucleotide mirror symmetry (vertical dinucleotides.
symmetry rosette. Orthogonal axes indicated correspond to axis) and to complementary symmetry (horizontal axis) of the
two ways, depending on the choice of their symmetry properties-complementary symmetry or mirror symmetry (Figure 2). If complementary symmetry is not allowed (no complementary contacts more than one base long), half of the dinucleotides must be discarded. The remaining half will still possess the mirror symmetry. Hence, an excess of the mirror-symmetrical elements in the sequence reflects its inability to fold complementarily upon itself. Functionally, the palindromes in intervening sequences might represent signal sequences in RNA exposed for interactions with other biomolecules involved in gene-splicing processes and gene regulation [3].
CODES E.
OF NUCLEOTIDE
CONTRAST
SEQUENCES
515
WORDS OF GNOMIC
There are many known examples of short, specific sequences responsible for certain functions which, on the one hand, allow for only small (if any) changes within the “word” and, on the other, can be flanked virtually by any sequence, being found in many different “contexts” [30]. Such words with strong internal correlation can be easily detected by estimating the positional correlation contrast for various combinations of nucleotides within ensembles of all contexts where these combinations have been found [6]. The contrast is measured as standard deviate SD of observed frequency f(s) of a given short sequence s = { Ni N2 . . . N,, } from its expected occurrence E(S). The E(s)-value for the string s of length n (n-mer) is calculated from observed frequencies of (n - 2)- and (n - 1)-mers:
E(N,N,...N,) = j(K...N,-,)f(N,...N,) f(N,...N,-d ' S&+[(j+ The strings which have significant contrast (say, SD > 3) can thus be considered as contrast words, preferred or avoided, i.e. f(s) > E(s)and f(s) -C E(s), respectively [6]. One must bear in mind that the contrast-word approach does not imply possible meanings of the detected words. This approach only indicates that the words might well have some important meaning, being singled out by their contrast. The contrast-word technique actually detects specific codes based on contrast, leaving their deciphering, i.e. the determination of their meaning, to further sequence analysis and experimental manipulations with the words detected. This technique also allows for the construction of vocabularies (word usage), which have turned out to be very different for different species or sequence functions [6]. In particular, by this criterion the intervening sequences were found to be far from random, having a very special vocabulary ([3]; see also above). The few Gnomic codes described are of different nature, corresponding to DNA-protein interactions (chromatin code), protein-RNA or RNA-RNA interactions (possible RNA loop code), and more complex RNA-RNA-protein interactions (triplet code and framing code). These messages can, and frequently do, share the same nucleotide sequences. For example, a DNA fragment which corresponds to a protein coding exon can belong at the same time to the nucleosome, in which case at least three different codes are sandwiched together in one sequence-triplet code, framing code, and chromatin code. How many different messages can coexist in the same piece of nucleotide sequence? The upper, purely combinatorial limit-of order 2”,
516
E. N. TRIFONOV
where n is the length of the sequence-would probably be too generous, though it does not include codes that may be locally identical on a given sequence. The diffuse, degenerate character of the Gnomic codes, as well as involvement of the same nucleotides simultaneously in several messages (like G in first positions of several triplets), makes the above question difficult to answer. One has to be aware, however, of Gnomic texts carrying many vitally important wisdoms, silent until yet another code is cracked. This work has been partially supported by the Minerva Foundation. The help of Dr. R. Thresher is especially appreciated. REFERENCES J. P. Adelman, C. T. Bond, J. Douglass, and E. Herbert, Two mammalian genes transcribed from opposite strands of the same DNA locus, Scwnce 235:1514-1517 2
(1987). Y. Aloni,
R. Dhar,
RNA maturation:
0. Laub,
M. Horowitz,
The leader sequences
and G. Khoury,
Novel
of simian virus 40 mRNA
mechanism
for
are not transcribed
adjacent to the coding sequences, Proc. Nut. Acad. Sci. U.S. A. 74:3686-3690 (1977). J. S. Beckmann, V. Brendel, and E. N. Trifonov, Intervening sequences exhibit distinct A
6
8 9
10 11 12
13 14 15
vocabulary, J. Biomolec. Sfr. Dyn. 4:391-400 (1986). S. M. Berget, C. Moore, and P. A. Sharp, Spliced sequences at the 5’.terminus adenovirus 2 late mRNA, Proc. Nut. Acad. Sci. U.S. A. 74:3171-3175 (1977).
of
V. Brendel and H. G. Busse, Genome structure described by formal languages, Nucl. Acids Res. 12:2561-2568 (1984). V. Brendel, J. S. Beckmann, and E. N. Trifonov, Linguistics of nucleotide sequences: Morphology and comparison of vocabularies, J. Biomolec. Sfr. Dyn. 4:11-21 (1986). P. Bucher and E. N. Trifonov, Compilation and analysis of eukaryotic pol II promoter sequences, Nucl. Acids Res. 14:10009-10026 (1986). L. T. Chow, R. E. Gelinas, T. R. Broker, and R. J. Roberts, An amazing sequence arrangement at the 5’-ends of adenovints 2 messenger RNA, Cell 12:1-8 (1977). W. J. Craigen, release factors:
R. G. Cook, W. P. Tate, and C. T. Caskey, Bacterial Conserved primary structure and possible frameshift
peptide chain regulation of
release factor 2, Proc. Nut. Acud. Sci. U.S. A. 82:3616-3620 (1985). F. H. C. Crick, The genetic code III, Sci. Amer. 215(4):55-62 (1966). J. J. Dunn, and F. W. Studier, Complete nucleotide sequence of bacteriophage
T7
DNA and the location of T7 genetic elements, J. Moiec. Biol. 166:477-535 (1983). A. K. Ebralidze, A. K. Naumova, A. L. Kenzior, N. A. Churikov, and G. P. Georgiev, The primary structure of suffix-the repetitive sequence located at 3’-ends of different (1984). genes of Drosophih melanogaster, Dokl. Akad. Nauk SSSR 275:1508-1510 N. V. Fedoroff and D. B. Brown, The nucleotide sequence of oocyte 5s DNA in Xenopus lreors. I. The AT-rich spacer, Cell 13:701-716 (1978). C. B. Harley and R. P. Reynolds, Analysis of E. coli promoter sequences, Nucl. Acids Res. 15:2343-2361 (1987). T. Y. K. Heinonen, M. N. Schnare, P. G. Young, and M. W. Gray, Rearranged coding segments, separated by a transfer RNA gene, specify the two parts of a discontinuous large subunit ribosomal RNA in Terrahymenu pynformis mitochondria, J. Biol. Chem. 262:2879-2887
(1987).
CODES 16
OF NUCLEOTIDE
N. K. Jeme,
517
SEQUENCES
The generative
grammar
of the immune
system,
EMBO
J. 4:847-852
(1985). 18
H. F. Judson, D. F. Klessig,
19
encoded at least 10 kb upstream from their main coding regions, CeN 12:9-21 G. Mengeritsky and E. N. Trifonov, Nucleotide sequence-directed mapping
17
20 21
22 23
24
The Eighth Day of Creation, Simon and Schuster, New York, 1979. Two adenovirus mRNAs have a common 5’-terminal leader sequence
nucleosomes, Nucl. Acid Res. 11:3833-3851 (1983). V. V. Nalimov, In the Labyrinths of Language: A Mathematical Philadelphia, 1981.
Journey,
26
ISI Press,
M. W. Nirenberg, 0. W. Jones, P. Leder, B. F. C. Clark, W. S. Sly, and S. Pestka, On the coding of genetic information, Cold Spring Hurb. Symp. Quunt. Biol. 28:549-557 (1963). Nucleotide Sequences 1985. A Compilation from the GenBank and EMBL Data Libraries, IRL Press, Washington, 1985. M. Ptashne, K. Backman, M. Z. Humayun, A. Jeffrey, R. Maurer, B. Meyer, and R. T. Saucer, Autoregulation and function of a repressor in phage lambda, Science 194:156-161 (1976). F. Sanger, A. R. Coulson,
T. Friedmann,
G. M. Air, B. G. Barre& N. L. Brown, J. C.
Fiddes, C. A. Hut&son III, P. M. Slocombe, and M. Smith, The nucleotide of bacteriophage phiX174, J. Molec. Biol. 125:225-246 (1978). 25
(1977). of the
P. A. Sharp, Splicing of messenger RNA precursors, J. F. Speyer, P. Lengyel, C. Basilio, A. J. Wahba, Synthetic polynucleotides Biol. 28:559-567 (1963).
and the amino
sequence
Science 235:766-771 (1987). R. S. Gardner, and S. Ochoa.
acid code, Cold Spring Harb. Symp.
Quant.
29
E. N. Trifonov and J. L. Sussman, The pitch of chromatin DNA is reflected in its nucleotide sequence, Proc. Nat. Acad. Sci. U.S. A. 77:3816-3820 (1980). E. N. Trifonov, Sequence-dependent deformational anisotropy of chromatin DNA, Nucl. Acids Res. 8:4041-4053 (1980). E. N. Trifonov, Sequence-dependent variations of B-DNA structure and protein-DNA
30
recognition, Cold Spring Harb. Symp. Quant. Biol. 47:271-278 (1983). E. N. Trifonov and V. Brendel, Gnomic-u Dicfiona~ of Genetic Codes, Balaban.
27 28
31
Philadelphia, 1986. E. N. Trifonov, Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16s rRNA nucleotide sequences, J. Molec. Biol. 194:643-652
(1987).
33
L. E. Ulanovsky, M. Bodner, E. N. Trifonov, and M. Choder, Curved DNA: Design. synthesis and circularization, Proc. Naf. Acad. Sci. U.S. A. 83:862-866 (1986). L. E. Ulanovsky and E. N. Trifonov, Estimation of wedge components in curved
34
DNA, Nature 326:720-722 (1987). H. Westphal and S.-P. Lai, Quantitative
32
35
electron
microscopy
RNA, J. Molec. Biol. 116:525-548 (1977). T. T. Wu, M. Reid-Miller, H. M. Perry, and E. A. Kabat, mouse gamma2b switch region and their implications switching,
EMBO
J. 3:2033-2040
(1984).
of early
adenovirus
Long identical repeats in the for the mechanism of class