Codes of nucleotide sequences

Codes of nucleotide sequences

Codes of Nucleotide Sequences E. N. TRIFONOV Depurtment Received of Polymer Research, The Weizmann I June 1987; revised 21 November Institute of...

694KB Sizes 2 Downloads 149 Views

Codes of Nucleotide Sequences E. N. TRIFONOV Depurtment Received

of Polymer

Research,

The Weizmann

I June 1987; revised 21 November

Institute

ofScience, Rehovot 76100, Israel

I987

ABSTRACT Nucleotide sequences have many properties of a language. This analogy when developed to its fullest results in an interesting linguistic description of the nucleotide sequences. Several a priori features of this language (called “Gnomic”) are discussed, based on its molecular nature. Gnomic appears to be a multicode language, with overlapping degenerate messages, each one encoding physically different specific interactions (protein-DNA, protein-RNA, protein-protein, and RNA-RNA). Several codes of the Gnomic language are discussedthe RNA-to-protein translation (triplet) code; the chromatin code responsible for DNA folding in chromatin; the framing code which secures correct triplet counting during translation; and, tentatively, the RNA loop code, presumably responsible for the formation of RNA loops with specific recognition sequences. The last code is aperiodic and involves

mirror-symmetrical

sequence

periodicities.

texts is discussed,

sequence

A general

elements,

technique

based on the context

while the other

of detection

contrast

codes

are based

of words in continuous

of the internally

correlated

on the

(no blanks) strings.

INTRODUCTION Nucleotide sequences are frequently compared to a language [lo, 20, 35, 12, 161, a metaphor which can be traced as far back as 1892, when J. F. Miescher, the discoverer of DNA, observed that large molecules built of a few similar, but not identical, small chemical units could carry the hereditary information “just as words and concepts of all languages can find expression in twenty-four to thirty letters of the alphabet” [17]. Today one frequently finds whole pages and entire books filled with the four-letter-alphabet texts known as nucleotide sequences [22]. During last two decades, a vast amount of research has been devoted to the analysis of nucleotide sequences and to the peculiarities of nucleotide distributions. However, V. Brendel and H. G. Busse [5] presented the first paper which treated the nucleotide sequences as (formal) language. In our recent attempt to develop the language analogy to its fullest [30, 6, 31, we found very interesting and widely diversified linguistic characteristics within the nucleotide sequences. The analogy became so striking that we believe that nucleotide sequences should be considMATHEMATICAL

BIOSCIENCES

90:507-517

QElsevier Science Publishing Co., Inc., 1988 52 Vanderbilt Ave., New York, NY 10017

507

(1988)

0025-5564/88/$03.50

508

E. N. TRIFONOV

ered and treated as a language and have suggested that the name for this language be Gnomic (from Greek gnome-maxim, aphorism). Thus, the term “Gnomic” is suggested for use instead of “nucleotide sequences” whenever their linguistic properties are considered [30]. This paper is an attempt at a systematic introduction to Gnomic language. Some general considerations on the structure (morphology) of Gnomic are discussed, as well as its distinct features and similarities to languages proper. Gnomic language appears to be a multicode language, and several established, as well as a few emerging, codes of Gnomic are discussed as representatives of a large spectrum of codes hidden in the Gnomic texts. 1.

GNOMIC-A BIOLOGICAL

LANGUAGE FOR COMMUNICATION MACROMOLECULES

BETWEEN

There are several properties of Gnomic language which can be formulated, a priori, even before any linguistic analysis of the Gnomic texts. First, this is a language which various macromolecules in the cell utilize to communicate. All major processes of gene expression-DNA replication, DNA-to-RNA transcription, RNA-to-protein translation, RNA-to-DNA reverse transcription-involve multiple specific interactions between DNA and proteins, between RNA and proteins, and between proteins and proteins, as well as RNA-DNA and RNA-RNA interactions at both inter- and intramolecular levels. “Specific” is used here to include “sequence-specific.” In other words, certain parts of the interacting molecules come into contact, and thus only certain sequences of bases, of amino acids, or possibly of sugar residues in polysaccharides are involved in any given specific interaction. Such contacts can, of course, be considered as elementary acts of the sequence-mediated communication between biological macromolecules. The sequences involved could then be considered as the “words” or “phrases” that the interacting molecules convey to one another. Simple as it is, this view lends itself to outlining some general properties of Gnomic language, very much in accord with what is discussed in the following a posteriori chapters. (1) Gnomic is expected to contain contiguous strings of bases (or amino acids, in the translated version of Gnomic) involved in the sequence-specific contacts. The size of these single contact worh could range from one to about fifty bases for protein-nucleic acid interactions, being limited by the physical size of the protein bound [see Figure l(a)], (2) The single contact words which are involved in the same macromolecular complex would make a multicontact message consisting of several contiguous words separated by sequence spacers, largely irrelevant in the context of given specific multicontact interaction. That is, Gnomic is ex-

CODES

OF NUCLEOTIDE

SEQUENCES

509

FIG. 1. Biomacromolecular interactions and structure of Gnomic language. (a) Single contact words. (b) Multicontact interspersed messages. (c) Overlapping messages. DNA, RNA, and protein molecules are shown schematically, as indicated in the insert. Contact sites (sequences)

are represented

by black bars.

petted to contain interspersed messages built of several contiguous words physically separated by other messages [Figure l(b)]. (3) Since the primary nucleotide sequence of DNA is transcribed into RNA, which (in the form of messenger RNAs) is in turn translated into proteins, the original DNA sequence simultaneously carries several fundamentally different classes of codes. This multiplicity of codes reflects the variety of biomacromolecules and multiplicity of their interactions. For example, specific protein-protein contacts involve amino acid sequences uniquely designed to ensure their specificity. The corresponding primary DNA sequences have characteristics different from the sequence features dictated by demands of the RNA-protein or RNA-RNA interactions.

E. N. TRIFONOV

510

(4) The same sequence can be involved in several interactions of different nature; hence overlapping messages (codes) are expected to occur in Gnomic. Thus, in principle, the same sequence can be responsible for (a) DNA-protein interactions, (b) RNA-protein and/or RNA-RNA interactions following DNA transcription, and (c) the interactions of the translated protein with its substrates or cofactors [Figure l(c)]. (5) The characteristics listed above imply that each of the overlapping messages should be degenerate to a certain degree to allow for coexistence. Alternatively, the same macromolecule could have similar affinities to several different contact words not necessarily related by their sequence. That is, Gnomic is expected to contain frequent synonyms. Use of synonymous words (messages) would be one way to realize the degeneracy of the overlapping messages. 2.

GNOMIC LANGUAGE VERSUS HUMAN ALPHABETIC LANGUAGE(S)

A. SIMILARITIES

Both Gnomic and human languages can be expressed in the form of one-dimensional arrays (texts) of symbols, elements of limited alphabets. Both languages reflect, however, the four-dimensional (space and time) reality, though of different kinds. In both Gnomic and human languages, the order of letters, words, and sentences is important. In both cases, translocations could severely damage the message. However, it is possible for large blocks of texts to become translocated without any obvious consequences (transposable elements, some chapters of a textbook). Both languages are very sensitive to mutations (copying errors), and both are subject to evolution, though, of course, by different mechanisms. One can easily trace “viable mutations” in spoken language: point mutations or “ac(“dependence” = “dependance” = “dependancy” = “dependency,” cessory” and “accessary”), deletions and insertions (cf. “cancelled” and “canceled, ” “acoustic” and “acousticaI,” “ thru” and “through”), and more complex changes (like “center” and “centre”, “connection” and “connexion”). Both spoken language and nucleotide sequences can be considered as gradually changing living organisms. Thus, the analogy might actually reach much farther (or “further”) than superficial comparison would reveal. B.

DISTINCTIVE

FEATURES

OF GNOMIC

LANGUAGE

Gnomic texts do not have any obvious spaces or equivalent symbols between putative words. Some ancient human texts have also been composed

CODES OF NUCLEOTIDE

SEQUENCES

511

in a continuous manner. Even modem languages, if artificially expressed in such writing without blanks, would not cause severe misunderstanding. It is noteworthy that in spoken languages the “blanks” between the pronounced words are practically indiscernible. In the case of an unknown language, like Gnomic, identification of the words within their contexts becomes a problem. One solution to this problem is discussed in Section 3.E below. Overlapping messages (codes), typical of Gnomic, are not allowed in human languages. Frequent types of overlap found in Gnomic texts are overlapping protein coding sequences [24,11] and some regulatory sequences overlapping with promoters [23]. One interesting example of overlapping is the transcription of two genes from opposite strands of the same DNA region [l]. Other examples of overlapping codes are discussed below. Gnomic “expressions” carrying the same message are typically rather diffuse, being very different in their sequence, though some common elements are frequently present. Prokaryotic [14] and eukaryotic [7] promoters provide good examples of diffuse, fuzzy messages. Such fuzziness, apparently, is due to the fact that each molecular contact involves many different sequence-dependent physical and geometrical features. A one-to-one relationship between the sequence and the features does not exist. On the other hand, efficient recognition does not necessarily require that all relevant features be simultaneously present. Such dispersed, yet very specific, interaction has been described in terms of distributional recognition [29]. The fundamentally diffuse character of the recognition signals in the sequences obviously provides necessary flexibility (degeneracy) when several overlapping messages are present. Messages in Gnomic are often discontinuous, interspersed by passages of completely different texts. The most striking example of such sequence arrangement is a gene splicing which was discovered a decade ago simultaneously in several laboratories [4, 34, 8, 18, 2, 251. Another interesting case is interruption of an rRNA gene by an insert which was identified as a tRNA gene [15]. Gnomic language is exceedingly repetitive. The repeats, both tandem and interspersed, are very frequent, from primitive mono- and dinucleotide long repeats, through satellite sequences with repeat lengths ranging from few bases to several hundred bases, to repeats of entire genes. Gnomic texts apparently carry dispensable elements. First, the number of repeats is often unstable, varying from cell to cell (e.g. [13]). Second, so-called pseudogenes are frequently found, which are very close to normal genes in their sequence, but cannot be expressed due to frameshift and nonsense mutations or lack of introns, etc. Thus, both a priori considerations on the molecular nature of Gnomic and our growing knowledge about some of the properties of this language

E. N. TRIFONOV

512 indicate features:

by the following

distinctive

In the following section, several particular codes of Gnomic discussed with reference to the features described above.

are briefly

(1) (2) (3) (4) (5) (6) (7)

that

Gnomic

can be characterized

single contact words, interspersed messages, multiple codes, overlapping messages, degeneracy, repetitiveness, dispensable elements.

3.

CODES OF GNOMIC

A.

RNA-TO-

PROTEIN

TRANSLATION

CODE

This is classical triplet code which is responsible for the translation of an mRNA sequence into the amino acid sequence of a protein [21, 261. Each of 20 different amino acids is encoded in RNA by a three-base word (codon, triplet) or by several triplets (degeneracy). The protein coding sequences in prokaryotes are contiguous, while in eukaryotes the sequences are interrupted by intervening sequences of unknown function and destiny [25]. The triplet code was the first one deciphered and has since been publicized to such an extent that there is a common belief that the triplet code expresses most, if not all, of the information contained in the sequences. However, this is hardly the case, since the protein-coding sequences represent only few percent of the eukaryotic genome, the rest being of largely unknown function. Several less-known Gnomic codes in various stages of their deciphering are discussed below. B.

CHROMATIN

CODE

Surprisingly, some Gnomic codes are as simple as purely periodical patterns. The chromatin code, which determines which pieces of DNA are folded into the nucleosomes, (elementary structural units of chromosomes), is of this kind. The protein core of the nucleosome wraps DNA around itself and thus contacts with only one side of the molecule. Any sequence element which has special affinity or preferential orientation relative to the protein core can impart the sidedness to the DNA molecule as soon as the element is repeated along the sequence with the period equal to that of the DNA helix. It was found that some dinucleotides, AA and TT in particular, indeed have a tendency to be repeated along chromatin DNA sequences with the DNA

CODES

OF NUCLEOTIDE

SEQUENCES

513

helical repeat period of about 10.5 bases [27, 28,191. In addition, the AA. TT stack of base pairs has been found to be not parallel as in classical B-DNA structure, but rather locally deflecting the DNA axis by approximately 9” [32, 331. Periodic repetition of the AA. TT elements results in unidirectional curving of the DNA, which facilitates its smooth bending in the nucleosome. The AA( TT) periodicity is always a weak component of the sequence and thus allows other messages to be contained within these sequences as well. C.

TRANSLATION-FRAMING

CODE

Direct autocorrelation analysis of natural nucleotide sequences indicates that in addition to the 10.5base periodicity there is very strong 3-base periodicity [27], which actually arises from the protein-coding regions. The repeating pattern (G-nonG-N), [31] definitely interferes with the proteincoding capacity of the sequence and therefore should be of special importance. Ribosomes, the cellular machines responsible for protein synthesis, contain in their exposed rRNA sequences the structure (NNC),, which is complementary to the (G-nonG-N), motif in mRNA [31]. Strikingly, these (NNC), sequences have been previously identified as actual or potential sites for rRNA-mRNA interactions. Apparently, predesigned complementarity of mRNA to these sites in the ribosome is of importance for maintaining the correct translation reading frame. The (G-nonG-N), pattern thus, can be considered as translation-framing or frame-keeping code. Strong G-periodicity is shown to trigger ribosome slippage, i.e. counting occasionally two or four bases instead of three [31]. The most striking example of such ribosome slippage occurs in the RF-2 protein gene of E. coli, which carries a wild-type frameshift that is compensated by ribosome slippage at the point of the frameshift, so that normal protein is synthesized [9]. The new (correct) frame immediately downstream from the slippage site is characterized by a very high frequency of G in the first positions of the triplets. D.

RNA LOOP

CODE?

Mirror-symmetrical sequences (palindromes) are occasionally mentioned in the literature, capturing the imagination by their symmetry. Since the cellular biochemical systems can read (replicate or transcribe) the sequences only in one specific chemical direction (5’-3’), the mirror-symmetrical sequences would be nonsense sequences along at least 50% of their length. We have found that intervening sequences (introns) are full of palindromic “words,” [3] which presumably are not just amusing rare whims of chance, but rather occur as functionally meaningful sequence elements. One possible significance of the palindromes is their inability to fold onto themselves or to form hairpins, typical elements of RNA structure. To illustrate this point, let us consider 16 different dinucleotides which can be subdivided into halves in

514

E. N. TRIFONOV

AA

FIG. 2. Dinucleotide mirror symmetry (vertical dinucleotides.

symmetry rosette. Orthogonal axes indicated correspond to axis) and to complementary symmetry (horizontal axis) of the

two ways, depending on the choice of their symmetry properties-complementary symmetry or mirror symmetry (Figure 2). If complementary symmetry is not allowed (no complementary contacts more than one base long), half of the dinucleotides must be discarded. The remaining half will still possess the mirror symmetry. Hence, an excess of the mirror-symmetrical elements in the sequence reflects its inability to fold complementarily upon itself. Functionally, the palindromes in intervening sequences might represent signal sequences in RNA exposed for interactions with other biomolecules involved in gene-splicing processes and gene regulation [3].

CODES E.

OF NUCLEOTIDE

CONTRAST

SEQUENCES

515

WORDS OF GNOMIC

There are many known examples of short, specific sequences responsible for certain functions which, on the one hand, allow for only small (if any) changes within the “word” and, on the other, can be flanked virtually by any sequence, being found in many different “contexts” [30]. Such words with strong internal correlation can be easily detected by estimating the positional correlation contrast for various combinations of nucleotides within ensembles of all contexts where these combinations have been found [6]. The contrast is measured as standard deviate SD of observed frequency f(s) of a given short sequence s = { Ni N2 . . . N,, } from its expected occurrence E(S). The E(s)-value for the string s of length n (n-mer) is calculated from observed frequencies of (n - 2)- and (n - 1)-mers:

E(N,N,...N,) = j(K...N,-,)f(N,...N,) f(N,...N,-d ' S&+[(j+ The strings which have significant contrast (say, SD > 3) can thus be considered as contrast words, preferred or avoided, i.e. f(s) > E(s)and f(s) -C E(s), respectively [6]. One must bear in mind that the contrast-word approach does not imply possible meanings of the detected words. This approach only indicates that the words might well have some important meaning, being singled out by their contrast. The contrast-word technique actually detects specific codes based on contrast, leaving their deciphering, i.e. the determination of their meaning, to further sequence analysis and experimental manipulations with the words detected. This technique also allows for the construction of vocabularies (word usage), which have turned out to be very different for different species or sequence functions [6]. In particular, by this criterion the intervening sequences were found to be far from random, having a very special vocabulary ([3]; see also above). The few Gnomic codes described are of different nature, corresponding to DNA-protein interactions (chromatin code), protein-RNA or RNA-RNA interactions (possible RNA loop code), and more complex RNA-RNA-protein interactions (triplet code and framing code). These messages can, and frequently do, share the same nucleotide sequences. For example, a DNA fragment which corresponds to a protein coding exon can belong at the same time to the nucleosome, in which case at least three different codes are sandwiched together in one sequence-triplet code, framing code, and chromatin code. How many different messages can coexist in the same piece of nucleotide sequence? The upper, purely combinatorial limit-of order 2”,

516

E. N. TRIFONOV

where n is the length of the sequence-would probably be too generous, though it does not include codes that may be locally identical on a given sequence. The diffuse, degenerate character of the Gnomic codes, as well as involvement of the same nucleotides simultaneously in several messages (like G in first positions of several triplets), makes the above question difficult to answer. One has to be aware, however, of Gnomic texts carrying many vitally important wisdoms, silent until yet another code is cracked. This work has been partially supported by the Minerva Foundation. The help of Dr. R. Thresher is especially appreciated. REFERENCES J. P. Adelman, C. T. Bond, J. Douglass, and E. Herbert, Two mammalian genes transcribed from opposite strands of the same DNA locus, Scwnce 235:1514-1517 2

(1987). Y. Aloni,

R. Dhar,

RNA maturation:

0. Laub,

M. Horowitz,

The leader sequences

and G. Khoury,

Novel

of simian virus 40 mRNA

mechanism

for

are not transcribed

adjacent to the coding sequences, Proc. Nut. Acad. Sci. U.S. A. 74:3686-3690 (1977). J. S. Beckmann, V. Brendel, and E. N. Trifonov, Intervening sequences exhibit distinct A

6

8 9

10 11 12

13 14 15

vocabulary, J. Biomolec. Sfr. Dyn. 4:391-400 (1986). S. M. Berget, C. Moore, and P. A. Sharp, Spliced sequences at the 5’.terminus adenovirus 2 late mRNA, Proc. Nut. Acad. Sci. U.S. A. 74:3171-3175 (1977).

of

V. Brendel and H. G. Busse, Genome structure described by formal languages, Nucl. Acids Res. 12:2561-2568 (1984). V. Brendel, J. S. Beckmann, and E. N. Trifonov, Linguistics of nucleotide sequences: Morphology and comparison of vocabularies, J. Biomolec. Sfr. Dyn. 4:11-21 (1986). P. Bucher and E. N. Trifonov, Compilation and analysis of eukaryotic pol II promoter sequences, Nucl. Acids Res. 14:10009-10026 (1986). L. T. Chow, R. E. Gelinas, T. R. Broker, and R. J. Roberts, An amazing sequence arrangement at the 5’-ends of adenovints 2 messenger RNA, Cell 12:1-8 (1977). W. J. Craigen, release factors:

R. G. Cook, W. P. Tate, and C. T. Caskey, Bacterial Conserved primary structure and possible frameshift

peptide chain regulation of

release factor 2, Proc. Nut. Acud. Sci. U.S. A. 82:3616-3620 (1985). F. H. C. Crick, The genetic code III, Sci. Amer. 215(4):55-62 (1966). J. J. Dunn, and F. W. Studier, Complete nucleotide sequence of bacteriophage

T7

DNA and the location of T7 genetic elements, J. Moiec. Biol. 166:477-535 (1983). A. K. Ebralidze, A. K. Naumova, A. L. Kenzior, N. A. Churikov, and G. P. Georgiev, The primary structure of suffix-the repetitive sequence located at 3’-ends of different (1984). genes of Drosophih melanogaster, Dokl. Akad. Nauk SSSR 275:1508-1510 N. V. Fedoroff and D. B. Brown, The nucleotide sequence of oocyte 5s DNA in Xenopus lreors. I. The AT-rich spacer, Cell 13:701-716 (1978). C. B. Harley and R. P. Reynolds, Analysis of E. coli promoter sequences, Nucl. Acids Res. 15:2343-2361 (1987). T. Y. K. Heinonen, M. N. Schnare, P. G. Young, and M. W. Gray, Rearranged coding segments, separated by a transfer RNA gene, specify the two parts of a discontinuous large subunit ribosomal RNA in Terrahymenu pynformis mitochondria, J. Biol. Chem. 262:2879-2887

(1987).

CODES 16

OF NUCLEOTIDE

N. K. Jeme,

517

SEQUENCES

The generative

grammar

of the immune

system,

EMBO

J. 4:847-852

(1985). 18

H. F. Judson, D. F. Klessig,

19

encoded at least 10 kb upstream from their main coding regions, CeN 12:9-21 G. Mengeritsky and E. N. Trifonov, Nucleotide sequence-directed mapping

17

20 21

22 23

24

The Eighth Day of Creation, Simon and Schuster, New York, 1979. Two adenovirus mRNAs have a common 5’-terminal leader sequence

nucleosomes, Nucl. Acid Res. 11:3833-3851 (1983). V. V. Nalimov, In the Labyrinths of Language: A Mathematical Philadelphia, 1981.

Journey,

26

ISI Press,

M. W. Nirenberg, 0. W. Jones, P. Leder, B. F. C. Clark, W. S. Sly, and S. Pestka, On the coding of genetic information, Cold Spring Hurb. Symp. Quunt. Biol. 28:549-557 (1963). Nucleotide Sequences 1985. A Compilation from the GenBank and EMBL Data Libraries, IRL Press, Washington, 1985. M. Ptashne, K. Backman, M. Z. Humayun, A. Jeffrey, R. Maurer, B. Meyer, and R. T. Saucer, Autoregulation and function of a repressor in phage lambda, Science 194:156-161 (1976). F. Sanger, A. R. Coulson,

T. Friedmann,

G. M. Air, B. G. Barre& N. L. Brown, J. C.

Fiddes, C. A. Hut&son III, P. M. Slocombe, and M. Smith, The nucleotide of bacteriophage phiX174, J. Molec. Biol. 125:225-246 (1978). 25

(1977). of the

P. A. Sharp, Splicing of messenger RNA precursors, J. F. Speyer, P. Lengyel, C. Basilio, A. J. Wahba, Synthetic polynucleotides Biol. 28:559-567 (1963).

and the amino

sequence

Science 235:766-771 (1987). R. S. Gardner, and S. Ochoa.

acid code, Cold Spring Harb. Symp.

Quant.

29

E. N. Trifonov and J. L. Sussman, The pitch of chromatin DNA is reflected in its nucleotide sequence, Proc. Nat. Acad. Sci. U.S. A. 77:3816-3820 (1980). E. N. Trifonov, Sequence-dependent deformational anisotropy of chromatin DNA, Nucl. Acids Res. 8:4041-4053 (1980). E. N. Trifonov, Sequence-dependent variations of B-DNA structure and protein-DNA

30

recognition, Cold Spring Harb. Symp. Quant. Biol. 47:271-278 (1983). E. N. Trifonov and V. Brendel, Gnomic-u Dicfiona~ of Genetic Codes, Balaban.

27 28

31

Philadelphia, 1986. E. N. Trifonov, Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16s rRNA nucleotide sequences, J. Molec. Biol. 194:643-652

(1987).

33

L. E. Ulanovsky, M. Bodner, E. N. Trifonov, and M. Choder, Curved DNA: Design. synthesis and circularization, Proc. Naf. Acad. Sci. U.S. A. 83:862-866 (1986). L. E. Ulanovsky and E. N. Trifonov, Estimation of wedge components in curved

34

DNA, Nature 326:720-722 (1987). H. Westphal and S.-P. Lai, Quantitative

32

35

electron

microscopy

RNA, J. Molec. Biol. 116:525-548 (1977). T. T. Wu, M. Reid-Miller, H. M. Perry, and E. A. Kabat, mouse gamma2b switch region and their implications switching,

EMBO

J. 3:2033-2040

(1984).

of early

adenovirus

Long identical repeats in the for the mechanism of class