J. theor. Biol. (1997) 188, 343–353
The Preferential Mode Analysis of DNA Sequence L L* F J† * Department of Physics, Inner Mongolia University, Hohhot; † Department of Basic Sciences, Inner Mongolia University of Technology; Department of Physics, Inner Mongolia University, Hohhot, China (Received on 26 September 1996, Accepted in revised form on 23 May 1997)
After reviewing approaches to the nucleotide correlation of DNA sequences the preferential mode analysis method is emphasized and discussed in detail. The preferred modes and poor modes in coding regions, as well as in introns, 5'-caps and 3'-tails are found through the statistical analysis of sequence data of all kinds of species in GenBank. The relation between the preferential mode analysis and informational parameter method is deduced. It is discovered that in higher species the coding sequences preferentially use the strong–weak bond (strong bond = C,G; weak bond = A,T) language and many noncoding regions (introns, 5'-caps, 3'-tails) use purine–pyrimidine language. The application of different languages in coding and noncoding sequences is a result of evolution, and it may be related to the functional differences in these two regions. Furthermore, we find that many preferential triplets in coding sequences can be expressed in a form of (* W S) (W = A,T; S = C,G), which may be explained by its relation to t-RNA abundance. The systematic change of some mode contents with evolution has also been found. 7 1997 Academic Press Limited
1. Introduction DNA contains all the information on the structure and function of biological macromolecules, the control and regulation of gene expression, the development of individuals and the evolution of species. As a genetic language expressed by one-dimensional sequence of four letters, the nucleotide sequence is an ingenious product of molecular evolution through the interaction of random mutation and natural selection. Because of strong random drift, the base composition of a coding sequence is near that of a stochastic sequence. For example, the first-order informational redundancy D1 of a typical coding sequence is in the order of 0.05 (Luo et al., 1988). However, a DNA sequence could not be a random sequence. The difference between nucleotide sequences and a random sequence is found in the base correlation which exists in the former. This correlation forms the basis of the grammatical construction of the genetic language. So the investigation into nucleotide correlation is of special importance. In recent years many authors have 0022–5193/97/190343 + 11 $25.00/0/jt970485
discussed the correlation properties of nucleotides in DNA sequences (Rowe & Trainor, 1983; Li & Kaneko, 1992; Peng et al., 1992; Nee, 1992; Maddox, 1992; Prabhu & Claverie, 1992; Munson et al., 1992; Lio et al., 1994; Buldyrev et al., 1995; Arneodo et al., 1995). The sequence has been mapped onto a one-dimensional walk (Peng et al., 1992; Buldyrev et al., 1995) or more completely, onto a three-dimensional walk (Li & Kaneko, 1992; Luo & Tsai, 1996) in a similar way to the earlier two-dimensional walk (Luo & Tsai, 1988), which gave a fractal analysis. Some authors characterized the symbolic sequence by decomposing it into binary sequences and then quantified the base correlation (Voss, 1992; Chechetkin & Turygin, 1994, 1996; Luo & Ji, 1995). Another direction of investigation is based on the information theoretic method (Gatlin, 1972; Luo et al., 1988; Luo & Li, 1991; Herzel & Grosse, 1995). These authors defined Markovian entropy with lag or mutual information to describe the nucleotide correlation between adjacent or nonadjacent sites in the sequences. 7 1997 Academic Press Limited
. .
344
The preferential mode analysis (PMA) is a method to examine the preferred modes and poor modes in DNA sequences of a variety of species. The preferred modes may be related to specific codes of nucleotide sequences (Trifonov & Brendel, 1986; Nussinov, 1987; Schmitt et al., 1996). The method is of great importance in the linguistic analysis of hereditary information. In this article we shall investigate the main features of preferential modes in coding sequences and in several kinds of noncoding regions (introns, 5'-caps, 3'-tails, etc.). The results obtained on the preferred and poor modes are summarized in sections 3 and 4. We shall point out the linguistic difference between coding and noncoding sequences, and investigate the possible mechanism responsible for the preferential modes. The problem of reading-frame dependence, and the relationships between PMA and the other methods for studying base correlation (information parameter method, etc.) will also be discussed. 2. The Statistical Analysis of Preferential Modes — Method The preferential mode analysis can be performed at the binary rather than tertiary level (Chechetkin & Turygin, 1994). The most important subdivision of four nucleotides into two classes are the purine– pyrimidine and the strong–weak bond classifications. So a particular fragment 4Xj =A,C,G,T5 in DNA sequence can be reduced to 4Yj =R,Y5, R = A,G; Y = C,T; or reduced to 4Yj =S,W5, S = C,G; W = A,T. The notation (Y1 , Y2 , . . . , Yk )n is introduced to denote the consecutively occurring nucleotide fragment (Y1 , Y2 , . . . , Yk ) (which is called mode) with repetition n times. For example, for k = 3, there are eight modes for each base classification, R–Y or S–W, and for a given reading frame (Tavare & Song, 1989; Trifonov, 1987). The three independent reading frames in the coding sequence will be denoted as E1, E2 and E3, respectively. E1 represents the translation frame, which is actually used and E2, E3 represent the frames shifted from the former to the right by one or two bases, respectively. For noncoding sequences and the reading frame can be introduced formally and in the same way as for the coding sequences. They will be denoted as I1, I2 and I3 for introns, C1, C2 and C3 for 5'-caps (to the left of the initiator codon) and T1, T2 and T3 for 3'-tails (to the right from the terminators). The analysis can be generalized to k = 2 and 4, 5, 6, etc, as well. We find that k = 2 and 3 are the most important cases in which many preferential modes are found. The result is consistent
with the short-range dominance of base correlation in coding sequence (Luo & Li, 1991). We shall confine ourselves to cases of k = 2 and 3 (with n = 1, 2, 3, 4 . . .) in the following discussion. The sequence data are taken from GenBank Release 67.0, where the nucleotide sequences have been classified into ten categories, namely, primate, rodent, other mammalian, other vertebrate, invertebrate, plant, bacterial, viral, phage, and organelle (chloroplast and mitochondrial). The exons, introns, 5'-caps and 3'-tails from genes varying in length from 150–5000 bases have been chosen in our statistical analyses. The number of nucleotides which have been considered in our statistics is 6.8 million in all—3.9 million for exons, 0.7 million for introns, 1.1 million for 5'-caps and 1.2 million for 3'-tails. The main steps of our statistical analysis are summarized as follows: (1) calculate the occurrence frequency of each mode (mode m, say) in one sequence at first, and then sum up the frequencies in different sequences for a given category of species, divided by the base number and multiplied by 1000. The result, denoted by Nm , represents the mode content (MC) of mode m per 1000 bases for a particular category of species (primate, rodent, . . .). (2) to find the normalized frequency p(Yj ) of base Yj (p(Y1 ) + p(Y2 ) = 1) which occurs in sequences for each category of species. Since the base number in each category of sequences is large enough, the normalized frequency p(Tj ) can be seen as a probability. (3) to estimate the statistical significance of the calculated mode content. Consider an independent sequence with a given purine–pyrimidine (or strongweak bond) content p(Yj ) but no correlation in any pair of bases in the sequence. The average frequency of mode m = (Y1 , Y2 . . . Yk ) per 1000 bases in the assumed independent sequence is Nm = (1000/k)p1 p1 = p(Y1 )p(Y2 ) . . . p(Yk )
(1)
Under the assumption of polynomial distribution of independent trials the corresponding standard deviation is sm2 = (1000/k)p1 (1 − p1 )
(2)
Note that the average frequency Nm and deviation sm of mode m in the independent sequence are dependent on the category of species, since the base composition of an independent sequence has been assumed to be category dependent.
836934 JTB 188/3 ISSUE — MS 0239 (4) define relative content of mode (RMC) Wm = (Nm − Nm )/sm
(3)
in which Nm denotes the MC of mode (Y1 Y2 . . . Yk ), Nm and sm are the quantities of independent sequence given by eqns (1) and (2), respectively. If the obtained mode content Nm exceeds Nm + sm then the mode m is called a preferred mode; conversely, if it is smaller than Nm − sm then m is a poor mode. So Wm q 1 represents a preferred mode and Wm Q − 1 represents a poor mode. (5) to investigate the deviation of a nucleotide sequence from the independent sequence. One may sum up the MCs of all modes with length k and reading frame i (i = 1, . . . , k) in the given sequence and define Uki = [Wm )2 ]ki = 2−k s (Wm )2
(4)
m
with lag [or D3 in notation of Luo & Li, (1990)] which describes the correlation of next-to-nearest neighboring bases. Since U3i = 1000/(8 × 3) s acb
4(p − pa pc pb )2/pa pc pb (1 − pa pc pb )5 (i) acb
(7)
and D3 = 2Spa log2 pa + s pa(b log2 pa(b ab
1 (1/ ln 2) s (pa(b − pa pb )2/pa pb
(8)
ab
where pa(b is the joint probability of a pair of bases a and b occurring in next-to-nearest neighboring sites in sequence, pa(b = s pacb .
Evidently, for an ideal independent sequence the deviation Uki is zero because any mode content Nm in the sequence is equal to Nm . Uki may be used as a measure of deviation from the independent sequences. In fact, it can be proved that Uki is related to the second-order information redundancy with lag (Luo & Li, 1991). For example, U2i is related to the base correlation in the nearest neighboring sites, U3i is related to the base correlation in next-to-nearest neighboring bases, etc. In fact, from eqns (1)–(4) we have, (i) − pa pb )2/pa pb (1 − pa pb )5 (5) U2i = 1000/(4 × 2) s 4 pab ab (i) where pab is the joint probability (normalized frequency) of a pair of bases a and b occurring in the i-th reading frame, namely, occurring in odd–even position for i = 1 and even–odd position for i = 2. pa and pb are corresponding normalized frequencies of a single base in the sequence (which is equal to the base probabilities in the independent sequence). On the other hand, the second-order informational redundancy
D2 = −2Spa log2 pa + Spab log2 pab 1 (1/ ln 2) s (pab − pa pb )2/pa pb
345
(6)
ab
So, if the reading frame has been taken into account in the definition of informational redundancy then D2 , which describes the base correlation in adjacent sites, equals U2i as defined in this paper, apart from an unimportant factor. Similarly, U3i is related to D2
c
Equation (8) is nearly the same as eqn (7). 3. Results and Discussions on Preferential Modes in Coding Sequences Tables 1 and 2 summarize the relative mode content of preferred and poor modes in coding sequences. To save space only those modes have been included in these tables the RMC of which for most species are explicitly larger than 1 or smaller than −1. We have found many preferred and poor modes in coding regions. For example, in R–Y language, the triplet modes E1-RRR, E1-RYY, E2-YYR, E3-YRR and the doublet modes Ei-RR, Ei-YY (i = 1, 2) are important preferred modes; the triplet modes E1-YRR, E2-RRY, E3-RYR and E2-YRY, and the doublet modes Ei-RY, Ei-YR (i = 1, 2) are important poor modes. Taking the repetition of mode into account we have found in our statistics that E1-RYY in repetition twice (n = 2) is still a preferred mode, so E1-RYY and E2-YYR represent the same mode essentially; E1-YRR in repetition two or three times (n = 2, 3) is still a poor mode, so E1-YRR, E2-RRY and E3-RYR are the same mode, too. The similar analyses can be done for W–S classification of bases. In W–S language, for many species the triplet modes E1-SWS and its cyclic E2-WSS, E3-SSW, the triplet mode E1-WWS and its cyclic E2-WSW, E3-SWW, and the doublet modes Ei-SW, Ei-WS (i = 1, 2) are important preferred modes, the triplet modes E2-SSS, the triplet mode E2-SWW and its cyclic E3-WWS, and the doublet modes Ei-SS, Ei-WW (i = 1, 2) are important poor modes.
. .
346
T 1 Relative mode content of preferential modes in coding sequences (triplets) E1-YYY E1-YRY E1-YRR E1-RYY E1-RYR E1-RRY E1-RRR E2-YYR E2-YRY E2-RRY E2-RRR E3-YYY E3-YRR E3-RYR E3-RRR E1-SSS E1-SSW E1-SWS E1-SWW E1-WSS E1-WSW E1-WWS E2-SSS E2-SSW E2-SWS E2-SWW E2-WSS E2-WSW E3-SSS E3-SSW E3-SWS E3-SWW E3-WWS
PRI
ROD
MAM
VRT
INV
PLN
BCT
VRL
PHG
ORG
1.69 −1.42 −2.38 1.24 −1.33 0.97 1.91 1.49 −2.15 −1.73 1.30 1.82 1.15 −2.87 1.68 −1.50 −1.27 4.41 −1.12 −0.57 −1.09 1.85 −1.96 −0.06 0.01 −2.47 2.80 3.56 −1.68 2.50 1.04 2.37 −2.76
1.61 −1.51 −2.60 1.59 −1.04 0.92 1.83 1.89 −2.30 −1.84 1.12 1.71 1.18 −3.03 1.52 −1.73 −1.12 4.69 −1.29 −0.88 −0.92 2.20 −2.41 −0.15 0.40 −2.50 2.75 4.19 −1.90 2.22 1.02 2.96 −2.83
1.51 −1.48 −2.81 1.83 −1.40 1.33 1.90 2.24 −2.36 −1.99 1.15 1.62 1.82 −3.15 1.39 −1.43 −2.15 5.57 −1.63 −0.74 −2.00 3.02 −1.89 −0.21 −1.08 −3.20 4.10 4.67 −1.64 3.81 0.64 3.70 −3.49
1.57 −1.26 −3.32 2.07 −1.34 1.23 2.06 2.67 −2.33 −2.21 1.05 1.70 2.09 −3.19 1.21 −1.51 −0.59 4.20 −0.54 −1.30 −2.09 2.73 −2.16 −0.61 −0.05 −2.65 3.26 3.70 −1.36 2.42 −0.19 3.26 −3.19
0.90 −0.85 −3.43 1.94 −1.26 1.63 1.61 2.40 −1.92 −1.71 0.04 0.95 2.65 −2.84 0.62 −0.89 1.05 1.96 0.33 −1.47 −2.30 1.52 −1.18 −1.17 0.47 −1.74 2.56 0.93 −0.19 1.59 −1.49 1.23 −2.26
0.62 −2.22 −3.19 3.19 −1.83 2.10 1.58 3.58 −2.10 −1.72 0.18 1.09 3.00 −2.95 0.69 −1.27 1.69 0.82 1.65 −1.76 −1.64 1.85 −1.38 −1.65 0.80 −0.82 1.93 0.82 −0.28 0.91 −1.42 0.61 −1.92
−0.79 −0.92 −3.46 2.68 0.14 2.51 −0.08 2.83 −1.16 −2.55 −0.92 −0.09 2.87 −2.83 −0.43 0.59 −0.52 2.28 1.36 −1.96 −3.56 0.81 0.39 −1.75 −1.48 −2.58 2.75 0.32 0.66 2.46 −2.48 1.06 −2.99
1.13 −1.45 −3.03 1.39 0.03 0.71 1.46 1.92 −1.34 −2.00 0.51 1.35 1.20 −2.76 0.98 −0.15 −0.19 1.67 0.47 −1.11 −0.97 0.18 −0.30 −0.95 0.14 −1.29 1.05 0.78 −0.05 0.80 −0.60 0.45 −1.42
0.55 −0.69 −2.09 1.66 −0.78 0.99 0.97 2.37 −1.51 −0.75 −0.26 0.19 2.17 −1.80 −0.18 0.10 0.95 0.47 1.28 −1.02 −2.09 −0.12 0.05 −0.97 −0.26 −1.03 1.13 −0.70 0.50 0.72 −1.49 −0.24 −1.63
0.92 −2.07 −3.19 1.58 0.47 1.37 1.08 1.64 −1.57 −2.36 0.24 1.56 1.10 −2.99 1.35 −0.11 4.91 −1.86 1.40 −1.95 −1.48 −1.42 −0.02 −2.01 2.36 0.51 −0.32 −2.63 0.78 −0.95 −2.54 −2.22 −1.00
Evidently the above-mentioned rules on doublet preference (preferred RR, YY, SW, WS) are closely related to that of triplet preference (preferred RRR, RYY, YYR, SWS, etc.). Furthermore, by comparison of R–Y and W–S classifications of bases we find that for higher species the preference of preferred
modes and poor modes in W–S language is higher than that in R–Y language. If the variation of RMC for a given mode in different species is observed we find that sometimes the values show a large variation along the same row in the table. For example, the contents of E1-SWS
T 2 Relative mode content of preferential modes in coding sequences (doublet) E1-YY E1-YR E1-RY E1-RR E2-YY E2-YR E2-RY E2-RR E1-SS E1-SW E1-WS E1-WW E2-SS E2-SW E2-WS E2-WW
PRI
ROD
MAM
VRT
INV
PLN
BCT
VRL
PHG
ORG
1.43 −1.50 −1.30 1.26 1.37 −1.24 −1.44 1.14 −1.36 1.49 1.36 −1.62 −1.31 1.29 1.46 −1.64
1.31 −1.40 −1.15 1.13 1.36 −1.20 −1.45 1.09 −1.62 1.70 1.63 −1.84 −1.68 1.67 1.77 −1.98
1.29 −1.49 −1.03 1.14 1.30 −1.03 −1.50 1.08 −1.18 1.33 1.24 −1.53 −1.32 1.38 1.50 −1.77
1.26 −1.31 −1.09 1.03 1.28 −1.10 −1.33 0.98 −1.32 1.64 1.08 −1.53 −1.16 0.91 1.49 −1.43
0.73 −0.70 −0.67 0.55 0.73 −0.67 −0.70 0.50 −0.17 0.13 0.20 −0.26 −0.31 0.34 0.28 −0.46
0.82 −0.85 −0.76 0.72 0.82 −0.76 −0.85 0.67 −0.55 0.52 0.51 −0.55 −0.57 0.52 0.56 −0.62
−0.23 0.18 0.26 −0.28 −0.20 0.23 0.16 −0.30 0.60 −0.60 −0.61 0.55 0.53 −0.55 −0.52 0.43
0.85 −0.78 −0.85 0.73 0.87 −0.88 −0.80 0.72 −0.29 0.31 0.26 −0.34 −0.23 0.18 0.26 −0.31
0.35 −0.49 −0.18 0.23 0.36 −0.19 −0.50 0.19 0.46 −0.34 −0.49 0.29 0.35 −0.40 −0.23 0.15
0.83 −0.90 −0.82 0.81 0.74 −0.73 −0.80 0.67 0.81 −0.43 −0.87 0.49 0.80 −0.86 −0.42 0.44
T 3 Genetic code showing (*WS) preference
U
A
G
C
U
A
G
C
PHE LEU LEU* PHE* ILE ILE MET* ILE* VAL VAL VAL* VAL* LEU LEU LEU* LEU*
TYR — — TYR* ASN LYS LYS* ASN* ASP GLU GLU* ASP* HIS GLN GLN* HIS*
CYS — TRP CYS SER ARG ARG SER GLY GLY GLY GLY ARG ARG ARG ARG
SER SER SER SER THR THR THR THR ALA ALA ALA ALA PRO PRO PRO PRO
and E2-WSW increase from −2 for lower organism to 4 for higher species. This means the evolutionary correlation of RMC values of these modes. It is worth noting that in W–S language, many preferred triplet modes for higher species and a part of lower organisms can be put in a unified form, namely (*WS)—the second letter in codon triplet taking W and the third letter—S (see Table 1). Lio et al. (1994) indicated that the third codon G + C periodicity as a possible signal for an internal selective constraint. Now, through the statistics on more abundant data we find not only the S-preference in third codon position, but also the W-preference in second codon position. To understand the meaning of the obtained result we rearrange the code table in the following form (Table 3). The (*WS) preference means that amino acids in left two columns of the table prefer using synonymous codons labeled with *, namely, the codons (*WS) are preferably used than (*WW) for the same amino acid. The result is consistent with the codon usage data (Wada et al., 1990). For example, the ratio of synonymous codon usages (*WS):(*WW) is 2.1 for primates. But the ratio (*SS):(*SW) is only 1.3 for the same category. So the G + C preference in third codon position is only statistically obvious for the case of second position taking A or T. Another implication of (*WS) preference is, by comparison of codons in different amino acids, the second position taking W occurs more frequently than S. For example, one has the frequency ratio (SW *):(WW *):(SS *): (WS *) = 1.66:1.23:1.09:1.0 for amino acids in primates. Ikemura (1981) assumed that the codon usage is constrained mainly by the availability of the corresponding tRNA species. He found a strong
347
correlation between tRNA abundance and choice of codons among synonymous codons. For example, the relative frequencies of usage of codons for leucine is correlated with the corresponding cognate tRNA species in Escherichia coli. The proposal can be generalized to codons corresponding to different amino acids. So, from the generalized Ikemura’s assumption one can predict the distribution of copy numbers of various species of tRNA from codon frequencies. Namely, we predict that the tRNA with first base in its anticodon triplet being S (G or C) and second base being W (A or T) is relatively abundant than other cases. The prediction is proved correct by new data on tRNA copy numbers of E. coli. The genomic organization and physical mapping of the tRNA genes have been exhaustively surveyed by Komine et al. (1990). The total complement is 78 copies for 45 tRNA (or 41 anticodon) species (in addition to a gene for selenocysteine tRNA). The distribution of nucleotides in anticodons is listed in Table 4. In table the copy numbers are given for each case (the number in brackets refers to the species of tRNA). We find that the frequency of 2nd position of a codon being W and 1-, 3-positions being S is much larger than other cases. The result is just what predicted from (SWS) preference for E. coli (as seen from Table 1). On the other hand, from the copy number data we see that the tRNA abundance with third base of its anticodon being C or T is explicitly larger than G or A (46:32). This means the frequency of 1st position of a codon being R larger than that being Y. The point can also be predicted from the (RYY) and (RRY) preference in E. coli (see Table 1). Table 5 summarizes the deviation of coding regions from independent sequences. The deviations in S–W classifications of bases are generally larger than that in R–Y classification (for higher species). It shows again that exons preferentially use S-W language. Moreover, the deviations in triplet are larger than doublet case. Evidently, the large deviation of S–W triplet is related to the encoding capacity of exons. Besides, the evolutionary dependence of Uki has also been indicated by large variation of these values along some rows of the table. Another point that can be seen in Table 5 is the frame dependence of U3i . The T 4 Distribution of bases in anticodons (E. coli) 1st position 2nd position 3rd position
G
C
T
A
S
W
31 (18) 16 (12) 19 (11)
20 (14) 16 (10) 24 (10)
23 (12) 21 (9) 22 (14)
4 (1) 25 (14) 13 (10)
51 32 43
27 46 35
. .
348
T 5 Deviation of exons from independent sequence Exon R–Y
S–W
U31 U32 U33 U21 U22 U31 U32 U33 U21 U22
PRI
ROD
MAM
VRT
INV
PLN
BCT
VRL
PHG
ORG
2.41 1.71 2.22 1.91 1.71 3.81 4.27 3.49 2.15 2.06
2.56 1.95 2.31 1.57 1.65 4.48 5.41 4.14 2.91 3.19
3.03 2.18 2.61 1.57 1.55 6.86 7.30 6.31 1.77 2.28
3.56 2.51 2.93 1.40 1.40 4.43 5.26 4.23 2.00 1.62
3.06 2.00 2.64 0.45 0.43 1.97 1.78 1.51 0.04 0.13
4.53 3.06 3.73 0.63 0.61 2.42 1.83 1.27 0.28 0.33
3.39 3.04 3.05 0.06 0.05 3.20 3.01 3.11 0.36 0.26
2.17 1.49 1.99 0.65 0.68 0.66 0.68 0.64 0.09 0.07
1.39 1.19 1.23 0.12 0.11 1.04 0.89 0.91 0.17 0.09
2.66 2.16 2.77 0.72 0.54 4.72 2.70 4.17 0.47 0.44
inequality between U31 , U32 and U33 means the existence of reading frame in exons. 4. Results and Discussions on Preferential Modes in Noncoding Sequences Tables 6, 7 and 8 summarize the relative mode content of preferred and poor modes in noncoding sequences. To save space only those modes have been listed the RMCs of which for most species are larger than 1 or smaller than −1. The RMCs for a large portion of S-W modes are lower than the fluctuation bound and have been omitted in these tables. At first we find that there is no preferential reading frame which exists in introns as well as in 5'-caps and
3'-tails, since the contents of the same mode in different frames are nearly the same, for example, I1-SWS 1 I2-SWS 1 I3-SWS. All these noncoding regions—introns, 5'-caps and 3'-tails—preferentially use R–Y language. For higher species, RRR, YYY, RR and YY are important preferred modes and RYR, YRY, RY and YR are important poor modes both for introns and caps–tails. As for W–S language, only SWS is an important preferential mode which has a relatively large mode-content. Since the content of ordered fragment RRR or YYY (or RR, YY) repeated itself 2–4 times also exceeds the fluctuation bound, so it seems that the noncoding regions preferentially store information in consecutive R or Y. These frequently occurred modes are possibly correlated
T 6 Relative mode content of preferential modes in introns I1-YYY I1-YRY I1-RYR I1-RRR I2-YYY I2-YRY I2-RYR I2-RRR I3-YYY I3-YRY I3-RYR I3-RRR I1-YY I1-YR I1-RY I1-RR I2-YY I2-YR I2-RY I2-RR I1-SWS I1-WWS I2-SWS I2-WSW I3-SWS I3-WSW
PRI
ROD
MAM
VRT
INV
PLN
BCT
VRL
PHG
ORG
3.26 −1.96 −1.84 2.98 3.24 −2.02 −1.73 2.97 3.16 −1.84 −1.90 3.04 2.36 −2.42 −2.31 2.30 2.33 −2.27 −2.38 2.22 1.67 −1.14 1.74 0.78 1.71 0.89
2.97 −1.60 −1.46 2.84 3.02 −1.65 −1.43 2.68 3.04 −1.61 −1.57 2.88 2.10 −2.32 −1.91 2.06 2.07 −1.87 −2.29 1.97 2.13 −0.74 2.15 1.41 1.92 1.46
3.19 −1.76 −1.96 3.07 3.09 −1.81 −1.97 2.96 3.34 −1.85 −1.64 3.11 2.32 −2.61 −2.00 2.21 2.38 −2.05 −2.66 2.22 1.58 −1.09 1.56 0.75 1.38 1.02
2.29 −1.07 −0.65 2.02 2.18 −1.12 −0.73 1.88 2.16 −0.79 −0.87 1.90 1.40 −1.54 −1.31 1.37 1.34 −1.24 −1.47 1.25 1.09 −0.88 1.28 0.47 1.26 0.44
2.69 −1.61 −0.20 1.76 2.85 −1.46 0.05 1.83 2.77 −1.35 −0.15 1.90 1.36 −1.53 −1.37 1.48 1.45 −1.47 −1.63 1.54 0.29 −1.22 0.15 −1.25 0.10 −1.28
0.99 −0.47 0.21 0.59 0.80 −0.14 −0.06 0.65 0.84 −0.24 −0.55 0.85 0.41 −0.45 −0.41 0.35 0.52 −0.53 −0.56 0.40 0.19 −0.60 0.68 −0.35 0.33 −0.00
0.61 −0.21 −0.25 0.26 0.66 −0.45 −0.43 0.42 0.52 −0.75 −0.61 0.22 0.44 −0.44 −0.44 0.26 0.40 −0.41 −0.38 0.14 0.05 −0.54 −1.06 −1.03 −0.74 −1.01
1.33 −0.59 −1.04 1.06 1.46 −0.72 −0.68 1.29 0.97 −0.19 −0.43 0.96 1.03 −1.18 −0.87 0.92 0.83 −0.67 −0.97 0.66 −0.40 −0.60 0.14 −0.25 0.16 0.07
0.34 −0.19 −0.54 0.62 0.42 0.07 −1.02 0.53 0.70 −0.33 −0.64 1.15 0.32 −0.23 −0.38 0.20 0.82 −0.85 −0.70 0.59 −0.18 −1.17 −0.21 −1.12 −0.19 −1.18
1.59 −0.97 −0.72 1.24 1.42 −0.89 −0.68 1.05 1.59 −0.64 −1.13 1.38 1.07 −1.12 −0.95 0.91 1.13 −1.02 −1.16 0.93 −0.49 −1.27 0.04 −1.76 −0.20 −1.25
349
T 7 Relative mode content of preferential modes in 5'-caps C1-YYY C1-YRY C1-RYR C1-RRR C2-YYY C2-YRY C2-RYR C2-RRR C3-YYY C3-YRY C3-RYR C3-RRR C1-SWS C2-SWS C3-SWS C1-YY C1-YR C1-RY C1-RR C2-YY C2-YR C2-RY C2-RR
PRI
ROD
MAM
VRT
INV
PLN
BCT
VRL
PHG
ORG
2.77 −1.60 −1.55 2.55 2.78 −1.53 −1.63 2.46 2.83 −1.49 −1.57 2.41 1.10 0.98 0.95 2.07 −2.03 −2.06 1.89 2.00 −2.00 −1.96 1.75
2.56 −1.58 −1.29 2.36 2.62 −1.25 −1.50 2.48 2.65 −1.36 −1.52 2.34 1.71 1.72 1.66 1.91 −1.98 −1.86 1.80 1.84 −1.78 −1.91 1.64
2.42 −1.32 −1.30 2.28 2.78 −1.29 −1.43 2.19 2.57 −1.19 −1.53 2.22 1.28 1.31 1.65 1.82 −2.31 −1.26 1.60 1.89 −1.31 −2.40 1.55
1.78 −0.64 −0.69 1.78 1.96 −0.70 −0.61 1.73 1.94 −0.74 −0.81 1.40 0.70 0.61 0.44 1.20 −1.34 −1.08 1.09 1.19 −1.07 −1.32 0.98
1.10 −0.08 −0.49 0.97 1.07 −0.48 −0.25 0.67 1.19 −0.50 −0.31 0.79 0.15 −0.07 0.30 0.61 −0.65 −0.56 0.47 0.73 −0.69 −0.76 0.50
1.57 −0.41 −0.68 1.51 1.38 −0.68 −0.47 1.21 1.63 −0.50 −0.57 1.35 −0.01 0.01 −0.06 0.95 −1.00 −0.93 0.84 0.99 −0.97 −1.03 0.79
0.28 0.18 −0.58 0.11 0.43 −0.57 0.01 0.20 0.51 −0.29 −0.36 0.01 0.46 −0.32 −0.20 0.27 −0.33 −0.21 0.09 0.34 −0.25 −0.41 0.04
1.21 −0.78 −1.00 1.31 1.35 −1.04 −0.87 1.30 1.43 −0.73 −1.24 1.24 −0.01 −0.79 −0.67 1.11 −1.03 −1.20 1.00 1.06 −1.15 −0.99 0.88
0.67 −0.67 −1.13 0.57 0.58 −0.87 −0.35 0.25 0.92 −0.72 −1.38 0.70 0.09 0.29 −0.30 0.68 −0.87 −0.46 0.48 0.85 −0.61 −1.04 0.50
1.74 −1.31 −0.85 1.43 1.84 −0.93 −1.00 1.39 1.87 −0.87 −1.04 1.27 −0.39 −0.65 −0.63 1.36 −1.44 −1.25 1.17 1.20 −1.08 −1.29 0.94
with DNA helical parameters (Lennon & Nussinov, 1985). Tables 9, 10 and 11 summarize the deviations of noncoding regions from independent sequences. We find that for three kinds of noncoding regions, the deviation of R–Y language are generally larger than S–W language apart from some exceptions in lower organisms. Alternatively, the deviations of R–Y
languages for eukaryotes in noncoding regions are larger than that for prokaryotes. It shows the evolutionary correlation of these deviations. Furthermore, for higher species in using R–Y language, the deviations in doublets are larger than triplets for introns and caps–tails. Especially for introns the R–Y doublets show a very high deviation from independent sequences. It seems to indicate that for higher
T 8 Relative mode content of preferential modes in 3'-tails T1-YYY T1-YRY T1-RYR T1-RRR T2-YYY T2-YRY T2-RYR T2-RRR T3-YYY T3-YRY T3-RYR T3-RRR T1-SWS T2-SWS T3-SWS T1-YY T1-YR T1-RY T1-RR T2-YY T2-YR T2-RY T2-RR
PRI
ROD
MAM
VRT
INV
PLN
BCT
VRL
PHG
ORG
2.57 −1.53 −1.44 2.47 2.62 −1.32 −1.45 2.50 2.63 −1.48 −1.40 2.40 1.51 1.51 1.46 1.78 −1.94 −1.70 1.82 1.92 −1.87 −2.05 1.77
2.45 −1.36 −1.20 2.33 2.45 −1.37 −1.82 2.24 2.44 −1.29 −1.31 2.19 1.89 1.93 1.96 1.60 −1.72 −1.56 1.64 1.74 −1.74 −1.83 1.59
2.44 −1.42 −1.17 2.43 2.55 −1.43 −1.35 2.40 2.53 −1.35 −1.55 2.27 1.03 1.50 1.21 1.76 −1.90 −1.73 1.82 1.76 −1.75 −1.86 1.61
1.75 −0.50 −0.70 1.65 1.88 −0.67 −0.52 1.60 1.95 −0.69 −0.69 1.66 1.15 1.30 1.10 1.09 −1.31 −0.90 1.06 1.19 −1.04 −1.36 0.96
1.25 −0.05 0.00 0.98 1.32 −0.13 −0.41 0.88 1.40 −0.32 −0.20 0.92 0.29 0.01 0.37 0.72 −0.68 −0.77 0.68 0.74 −0.83 −0.67 0.53
1.19 −0.25 −0.26 1.06 1.15 −0.20 −0.50 0.92 1.39 −0.39 −0.32 0.92 −0.09 0.01 0.20 0.68 −0.65 −0.72 0.63 0.75 −0.83 −0.67 0.44
0.40 −1.09 −0.35 0.24 0.39 −0.70 −1.28 0.09 0.56 −0.92 −0.90 0.08 −0.88 −0.72 −0.46 0.32 −0.22 −0.44 0.28 0.36 −0.47 −0.25 0.11
1.51 −0.42 −0.91 1.37 1.45 −0.32 −0.33 1.34 1.57 −0.31 −0.42 1.33 −0.08 0.08 0.39 1.16 −1.24 −1.07 1.12 1.19 −1.11 −1.27 0.98
1.36 −0.70 −1.47 1.02 0.85 −1.19 −0.25 0.70 1.06 −0.72 −0.75 0.55 −0.78 −0.31 −0.73 0.82 −0.88 −0.77 0.78 0.94 −0.92 −0.98 0.71
1.25 −0.80 −0.53 0.88 1.53 −1.09 −0.16 0.78 1.48 0.11 −0.16 0.89 −0.68 −0.24 0.08 0.58 −0.42 −0.73 0.54 0.76 −0.93 −0.58 0.56
. .
350
T 9 Deviation of introns from independent sequence Intron R–Y
S–W
U31 U32 U33 U21 U22 U31 U32 U33 U21 U22
PRI
ROD
MAM
VRT
INV
PLN
BCT
VRL
PHG
ORG
3.56 3.52 3.51 5.52 5.31 0.75 0.80 0.79 0.37 0.47
2.96 2.89 3.10 4.44 4.24 1.20 1.21 1.05 1.54 1.45
3.54 3.37 3.68 5.30 5.48 0.79 0.69 0.71 0.13 0.39
1.61 1.44 1.44 1.99 1.77 0.39 0.44 0.43 0.11 0.14
2.08 2.32 2.19 2.07 2.33 1.35 1.29 1.37 1.31 1.46
0.29 0.21 0.27 0.16 0.26 0.07 0.08 0.09 0.00 0.00
0.09 0.14 0.10 0.16 0.12 1.01 1.31 1.20 1.40 1.55
0.61 0.72 0.41 1.02 0.63 0.13 0.18 0.16 0.11 0.05
0.24 0.24 0.53 0.08 0.56 0.31 0.32 0.59 0.11 0.07
0.77 0.59 0.84 1.04 1.13 1.39 1.39 1.25 1.84 1.72
organisms the storage of information in noncoding sequences through R–Y language preferentially uses doublet encoding. Why exons preferentially use S–W language, but noncoding regions preferentially use R–Y language for higher species? For S–W representation, the sentences of genetic language in two strands of DNA are precisely the same; but for R–Y representation, two strands express information differently in general unless some symmetry exists. Therefore, W–S language seems more convenient for the purpose of translation and proofreading. Perhaps this may explain the preference of W–S language in coding regions. As for noncoding sequences, their functions are mainly regulation and control of the gene expression. The secondary structure of introns (which is determined mainly by the hydrogen bonds between purine and pyrimidine) and the recognition of some sites in 5'-caps by enzymes and repressors, etc. are important factors in the splicing reaction and regulating process. They all require R–Y rather than S–W representation of the language. Since the recognized sites in DNA sequence are related with the local deviation of double-helix structure, and the latter is caused by the steric repulsion between purine bases in consecutive base pairs but on opposite strands (Calladine, 1982; Dickerson, 1983). So, R–Y sequence can describe the local deviations of helix
structure and determine the appropriate sites for recognition by proteins but S–W sequence cannot. On the other hand, using different languages in coding and noncoding sequences may be helpful to avoid the confusion between these two regions. Of course, the differentiation of two languages in DNA sequences is an outcome of evolution. So the preference of S–W language or R–Y language is marked only for eukaryotes and higher species. The linguistic difference between coding and noncoding sequences can be manifested by informational redundancies. We calculate the informational parameters D1 (which describes the base composition deviating from the random usage) and D2 (which describes the nucleotide correlation, namely the grammatical construction of the genetic language) of each sequence in three kinds of reduced languages, namely, in R–Y, S–W and AC–GT language, respectively, and then take average of them over a given category of species. To avoid large fluctuation due to shortness of some sequences only those with length longer than 600 are taken into account. The averaged D1 and D2 in categories Pri and Bct (relative values) are shown in Table 12. From Table 12 we find that for primates the informational redundancy D2 of exons (E) in S–W language is larger than that in other two languages; but D2 of introns (I), caps (C) and tails (T ) in RY language is larger than the other two cases.
T 10 Deviation of 5'-caps from independent sequence 5'-cap R–Y
S–W
U31 U32 U33 U21 U22 U31 U32 U33 U21 U22
PRI
ROD
MAM
VRT
INV
PLN
BCT
VRL
PHG
ORG
2.57 2.52 2.52 4.07 3.73 0.64 0.66 0.72 0.01 0.01
2.20 2.33 2.28 3.58 3.24 0.82 0.76 0.77 0.78 0.78
2.00 2.25 2.13 3.21 3.38 0.83 0.77 0.95 0.05 0.11
1.10 1.19 1.03 1.40 1.32 0.73 0.94 0.83 0.19 0.13
0.39 0.29 0.38 0.33 0.46 0.82 0.80 0.78 0.70 0.81
0.83 0.61 0.81 0.87 0.90 0.10 0.12 0.13 0.11 0.14
0.08 0.12 0.13 0.06 0.08 0.81 0.86 0.82 1.00 1.02
0.65 0.73 0.75 1.19 1.06 0.52 0.58 0.65 0.31 0.43
0.43 0.20 0.53 0.42 0.61 0.39 0.32 0.38 0.09 0.08
0.99 1.00 0.97 1.72 1.29 1.54 1.97 1.76 2.70 2.29
351
T 11 Deviation of 3'-tails from independent sequence 3'-tails R–Y
U31 U32 U33 U21 U22 U31 U32 U33 U21 U22
S–W
PRI
ROD
MAM
VRT
INV
PLN
BCT
VRL
PHG
ORG
2.30 2.34 2.31 3.30 3.65 0.59 0.55 0.57 0.20 0.15
2.03 1.98 1.96 2.67 3.00 0.89 0.89 0.93 0.90 0.95
2.11 2.21 2.15 3.26 3.07 0.55 0.73 0.67 0.01 0.04
1.00 1.08 1.16 1.22 1.32 0.32 0.34 0.37 0.21 0.14
0.51 0.48 0.59 0.51 0.49 0.78 0.68 0.90 1.04 0.93
0.46 0.41 0.53 0.45 0.48 0.09 0.11 0.09 0.07 0.12
0.06 0.05 0.08 0.10 0.11 1.02 1.08 1.01 1.37 1.53
0.82 0.79 0.82 1.33 1.31 0.27 0.30 0.24 0.18 0.09
0.80 0.48 0.53 0.66 0.80 0.61 0.52 0.42 0.89 0.74
0.50 0.62 0.63 0.34 0.52 4.03 3.75 3.39 4.39 4.22
The preference seems not explicit in bacteria. This explains why coding sequences preferentially use S–W language but noncoding regions preferentially use R–Y languages for higher species. On the other hand, the redundancy D1 of S–W language is the largest one in three kinds of binary languages. The latter is related to the larger fractal dimension in a DNA walk with S–W rule than walks with other rules (Luo & Tsai, 1996). 5. Concluding Remarks Through the statistical analysis we have found all preferential modes of di- and tri-nucleotides in the genetic language. The (*WS) preference in higher species and in some of the lower organisms is of special interest. We have proposed a mechanism of tRNA abundance to explain it. The mechanism seems correct for E. coli, but it should bear further experimental test in other species. In previous work (Luo et al., 1988) we pointed out that the first-order informational redundancy of noncoding sequences takes a larger value (than coding regions) because of the many control signals which decrease the informational entropy of these T 12 D1 and D2 in three languages Pri
R 5 Y S 5 W AC 5 GT
E I C T E I C T E I C T
Bct
D1
D2
D1
D2
0.40 0.47 0.60 0.64 2.06 2.93 4.45 3.23 0.39 0.42 0.43 0.35
0.89 2.65 1.76 1.78 1.47 1.14 0.68 0.55 0.55 0.56 0.62 0.95
0.43 0.20 0.34 0.37 1.95 3.59 3.29 2.89 0.24 0.34 0.23 0.35
0.23 0.28 0.21 0.27 0.25 0.34 0.19 0.27 0.22 0.28 0.26 0.30
regions. The linguistic peculiarities of the noncoding regions indicated by Mantegna et al. (1994) is actually another statement of the above result. Now we discover that in higher species the S–W language of exons and the R–Y language of introns 5'-caps and 3'-tails have larger second-order informational redundancies. So exons preferentially use the S–W language and noncoding regions preferentially use the R–Y language accompanying evolution. The linguistic difference of coding and noncoding regions is very interesting. We suggest that the reasons for the different preferences in these two regions may be related to their functional disparity in gene expression. Through the comparison of the preferential modes in different reading frames we have demonstrated that there is a triplet-frame in exons but no such frame exists in all noncoding regions. However, the differences of RMC in some genes, for example, I1-RY(YR) and I2-RY(YR) of some species in Table 6, C1-RY(YR) and C2-RY(YR) of some species in Table 7 and T1-RY(YR) and T2-RY(YR) of some species in Table 8, seem too large to be neglected. If these differences cannot be attributed to fluctuation, then there must exist some hidden doublet frame in introns, caps and tails for some particular genes in spite of the general frame independence of noncoding sequences. To summarize, the linguistic difference of coding and noncoding regions can be expressed as follows: the genetic language of exons is S–W dominant in higher species and triplet-frame dependent; the genetic languages of introns, 5'-caps and 3'-tails are R–Y dominant in higher species and they are generally frame independent. Our analyses show that many preferred modes can repeat themselves consecutively several times, for example, n = 2, 3, 4 . . .. We call them the ordered fragments. Sometimes, very high repetitions (large n) have also been found (Ji & Luo, 1997). However, to
352
. .
save space, these data have not been included in this manuscript. An ordered fragment with large n means that the long-range correlation exists in the sequence. On the other hand, the triplet repetition in exons explains the one third peak in the correlation spectrum analysis of coding sequences (Voss, 1992; Gutierrez et al., 1994; Luo & Ji, 1995). Many authors have discussed the molecular evolution of DNA sequences by use of informational parameter analysis and fractal analysis. It has been found that some evolutionary parameters do exist in the sense of coarse grain average. That is, after category averaging, these parameters correlate roughly with evolution. For example, the C + G content of coding sequence increases with evolution (Bernardi et al., 1986; Aota et al., 1986), the second-order informational redundancy D2 (which describes the correlation of neighboring bases) increases with evolution, etc. (Luo et al., 1988; Luo & Bai, 1995). Now we find many ordered-fragment-contents showing the evolutionary relations. For example, as seen from Tables 1, 2, 6–8, the preferred mode (*WS) in coding regions and the preferred modes RR, YY in introns and other noncoding sequences, etc. increase with evolution. In the meantime, from Tables 5, 9–11 we find many evolutionary relations about the deviation Uki . All these results show the base correlation increasing with evolution. The preferential mode analysis presented in this article is consistent with earlier informational analyses. We have demonstrated that the deviations from independent sequences introduced in this paper are exactly the informational redundancies with lag (or mutual information). However, as seen from eqns (5–8), the deviations Uki is frame dependent. So the deviation analysis is equivalent to the frame dependent informational analysis. That is, in some respects it is equivalent to the subsequence analysis by use of informational parameter method (Luo & Shengli, 1990). We have split the DNA sequence into three subsequences according to three codon positions and demonstrated the frame-dependence of informational parameters. For example, D1 of the third subsequence is much larger than the value of D1 for the other two subsequences, D2 of the neighboring base correlation between two codons is more obviously dependent on evolution. The inhomogeneity of exon sequence can also be explored by use of the coincident index method (Xu et al., 1993) and Pearson’s statistics (Lee, 1996). On the other hand, the fractal dimension introduced in a DNA walk is related to informational parameter
D1 through Rn2 0 n 2D1 (n) where Rn2 is the mean square separation of the end points of a zigzag line (a DNA walk in three dimensional base space) containing n bases (Luo & Tsai, 1996). So the different methods are consistent with each other. Each method describes the nucleotide correlation from its own aspects and considering them together gives a unified understanding of the statistical property of DNA sequences. The authors are grateful to Professor Dalai for his helpful discussions and Dr Verina Waights for her help in English language. REFERENCES Aota, S. & I, T. (1986). Diversity in G + C content at 3rd position of condons in vertebrate genes and its cause. Nucl. Acids Res. 14, 6345–6355. A, A., B, E., G, P. V. & M, J. F. (1995). Characterizing long-range correlations in DNA sequences from wavelet analysis. Phys. Rev. Lett. 74, 3293–3296. B, G. & B, G. (1986). Compositional constraints and genome evolution. J. Mol. Evol. 24, 1–11. B, S. V., G, A. L., H, S., M, R. N., M, M. E., P, C. K., S, M. & S, H. E. (1995). Long range correlation properties of coding and noncoding DNA sequences. Phys. Rev. E51, 5084–5091. C, C. R. (1982). Mechanics of sequence-dependent stacking of bases in B-DNA. J. Mol. Biol. 161, 343–352. C, V. R. & T, A. Y. (1994). The spectral criteria of disorder in nonperiodic sequences. J. Phys. A: Math and Gen. 27, 4875–4898; (1996) Study of correlation in DNA sequence. J. theor. Biol. 178, 205–217. D, R. E. (1983). Base sequence and helical structure variation in B and A DNA. J. Mol. Biol. 166, 419–441. G, L. (1972). Information Theory and Living System. Columbia Univ. Press. G, G., O, J. L. & M, A. (1994). On the origin of the periodicity of three in protein coding DNA sequences. J. theor. Biol. 167, 413–414. H, H. & G I. (1995). Measuring correlations in symbol sequences. Physica A216, 518. I, T. (1981). Correlation between the abundance of E. coli tRNA and the occurrence of the respective codons in its protein genes. J. Mol. Biol. 151, 389–409. J, F. M. & L, L. F. (1997). The ordered fragments in nucleotide sequence and molecular evolution. Acta Scientiarum Naturalium Universitatis Intramongolicae. v. 28, 493–504. K, Y., A, T., I, H. & O, H. (1990). Genomic organization and physical mapping of the tRNA genes in E. coli K12. J. Mol. Biol. 212, 579–598. L, W. J. (1996). Heterogeneity analysis of nucleotide sequences. Acta Scientiarum Naturalium Universitatis Intramongolicae, 27, 729–731. L, W. J. & L. L. F. (1997). The periodicity of base correlation in nucleotide sequence. Phys. Rev. E. 56, 848–851. L, G. & N, R. (1985). Eukaryotic oligomer frequencies are correlated with DNA helical parameters. J. theor. Biol. 22, 427–433. L, W. & K. (1992). Long range correlation and partial f−a spectrum in a noncoding DNA sequence. Europhys. Lett. 17, 655–660.
L, P., R, S. & B, M. (1994). Third codon G + C periodicity as a possible signal for an internal selective constraint. J. theor. Biol. 171, 215–223. L, L. F. & B, G. Y. (1995). The maximum information principle and the evolution of nucleotide sequences. J. theor. Biol. 174, 131–136. L, L. F. & J, F. M. (1995). The correlation spectrum of nucleotide sequences—How to extract signals from background noise? Acta Scientiarum Naturalium Universitatis Intramongolicae. 26, 419–426. L, L. F. & L, H. (1991). The statistical correlation of nucleotides in protein—coding DNA sequences. Bull. Math. Biol. 53, 345–353. L, L. F. & S. (1990). The information-theoretic investigation of heterogeneous DNA sequences. Acta Scientiarum Naturalium Universitatis Intramongolicae. 21, 229–234. L, L. F. & T, L. (1988). Fractal dimension of nucleic acid sequences and its relation to evolutionary level. Chinese Physics Letters 5, 421–424. L, L. F. & T, L. (1996). DNA walk and fractal analysis of nucleotide sequences. Acta Scientiarum Naturalium Universitatis Intramongolicae. 27, 781–789. L, L. F., T, L. & Z, Y. M. (1988). Informational parameters of nucleic acid and molecular evolution. J. theor. Biol. 130, 351–361. M, J. (1992). Long range correlations within DNA. Nature 358, 103. M, R. N., B, S. V., G, A. L., H, S., P, C. K., S, M. & S, H. E. (1994). Linguistic features of noncoding DNA sequences. Phys. Rev. Lett. 78, 3169–3172.
353
M, P. J., T, R. C. & M, G. S. (1992). DNA correlations. Nature 360, 636. N, S. (1992). Uncorrelated DNA walks. Nature 357, 450. N, R. (1987). Theoretical molecular biology: Prospectives and perspectives. J. theor. Biol. 125, 219–235. P, C. K., B, S. V., G, A. L., H, S., S, F., S, M. & S, H. E. (1992). Long range correlations in nucleotide sequences. Nature 356, 168–170. P, V. V. & C, J. M. (1992). Correlations in intronless DNA. Nature 359, 782. R, G. W. & T, L. E. H. (1983). On the informational content of viral DNA. J. theor. Biol. 101, 151–170. S, A. O., E, W. & H, H. (1996). The modular structure of informational sequences. BioSystems 37, 199–210. T, S. & S, B. (1989). Codon preference and primary sequence structure in protein-coding regions. Bull. Math. Biol. 51, 95–115. T, E. N. (1987). Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16S rRNA nucleotide sequences. J. Mol. Biol. 194, 643–652. T, E. N. & B, V. (1986). Gnomic—A Dictionary of Genetic Codes. Balaban, Philadelphia. V, R. F. (1992). Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett. 68, 3805–3808. X, J., C, R. S., L, L. J., S, R. Q. & S, J. (1993). Coincident indices of exons and introns. Comput. Biol. Med. 23, 333–343. W, K., A, S. T., I . (1990). Codon usage tabulated from GenBank data. Nucl. Acids Res. 18, 2367–2411.