.J. Mol. Biol.
(1983) 163, 363-376
Contextual
Constraints
on Synonymous
DA\-II) J. LIPMAS
Codon
Choice
.ISI) W. JOHN WILHUR
Mathematical Research Branch National Institute of Arthritis Diabetes an,d Diyestive and Kidney Diseases National Institutes of Health, Bethesda. Md 20205. V.S.A (Received
3 June 1982)
We have studied the statistical constraints on synonymous codon choice to evaluate various proposals regarding the origin of the bias in synonymous codon usage observed by Fiem et al. (1975). Air ct al. (1976) Grantham et aZ. (1980) and others. We have determined the statistical dependence of the degenerate third base on either of its nearest neighbors in mitochondrial. prokaryotic, and eukaryotic coding sequences. We noted an increasing dependence of the third base on its nearest neighbors in moving from mitochrondria to prokaryotes to eukaryotes. A statistical model assuming random equiprobable selection of synonymous codons was found grossly adequate for the mitochondria, but totally indequate for prokaryotes and eukaryotes. A model assuming selection of synonymous codons reflecting a genomic strategy, i.e. the genome hypothesis of Grantham et al. (1980), gave a good approximation of the mitochondrial sequences. A statistical model which exactly maintains codon frequency, but allows the position of corresponding
synonymous codons to vary was only grossly adequate for prokaryotes
and totally
inadequate for eukaryotes. The results of these simulations are consistent with the measures on experimental sequences and suggest that a “frequency constraint” model such as that of Grantham et (zl. (I 980) may be an adequate explanation of the codon usage in mitochondria. However, in addition to this frequency constraint. there may be constraints on synonymous codon choice in prokaryotes due to codon
context. involve
Furthermore, a constraint
any proposal to explain on the context
codon usage in eukaryotes
must
of a codon in the sequence.
1. Introduction The sequence of amino acids in a protein is generally considered to constitute the primary constraint on the evolution of its corresponding nucleic acid sequence. This constraint is mediated through the codon table. Because of the pattern of degeneracy in the codon table, the intensity of this contraint varies between the different positions within each codon, with the second base being constrained by the choice of amino acid, and the third base being far less constrained than the other two bases.
I). J. LIPMAS
364
AXI)
W. J. WILBUR
\‘arious workers have presented data demonstrating a non-uniform distribution in synonymous codon usage (Fiers et al.. 1975 ; Air et al.. 1976 ; Efstratiadis et al.. 1977). Miyata & Hayashida (1981) have presented data establishing the extraordinarily high evolutionary rate of pseudogenes ; a rate generally higher than that between synonymous codons in functional genes. Both observations imply additional constraints are operating on the protein-coding regions of nucleic acids. perhaps involved with control of translation (Ikemura, 1981) or due to nucleic acid secondary structure requirements (Hasegawa et al., 1979). We have examined the general constraints acting on the different positions of the codon in eukaryotic. prokaryotic, and mitochondrial coding sequences. This is an extension of previous work measuring the constraints on the ordering of bases and the constraints acting toward non-uniform base composition in coding sequences (Lipman B Maizel, 1982). The previous study revealed that mitochondrial sequences show stronger constraints acting toward non-uniform base composition and weaker constraints on the ordering of the bases than all other coding sequences examined. The present study involves measuring the divergence from independence of the bases. in three positions defined with respect to the codon : doublet doublet doublet codon.
position position position
1 = the first two bases of a codon: 2 = the second and third bases of a codon ; 3 = the third base of a codon and the first
base of the next
Our results suggest that, on the average, mitochondrial sequences have the weakest constraint on the degenerate third base of a codon, while eukaryotes have the strongest constraint on the third base of each codon. We have also tested a number of statistical models which maintain the constraints of the amino acid sequence but vary the choice of synonymous codons in different ways. A model assuming no selection between synonymous codons was grossly adequate for though a model which maintained the synonymous codon mitochondria, frequencies of each coding sequence (but allowed the position of corresponding synonymous codons to vary) yielded an excellent approximation of the measures determined on experimental sequences. The latter model was only grossly adequate for prokaryotes and completely inadequate for the eukaryotes. Only a model which adjusted the probability of the third base depending on both the second base of the codon and the first base of the next codon generated sequences with measures similar to the eukaryotic and prokaryotic sequences. These results suggest that the constraints on mitochondrial sequences acting below the level of amino acid sequence are far less stringent than those acting on the eukaryotic or prokaryotic sequences. Also, any explanation for the eukaryotic selection of synonymous codons must involve the context of the codon within the sequence.
2. Methods The statistical measure used in this study is derived from Znformution Theory, and its application to nucleic acid sequences is described in greater detail by Gatlin (1972) and by Lipman & Maize1 (1982).
SYNONYMOUS 112, the divergence
from
independence
I)2 =
~ i,j=
CODON
(‘HOICE
of the bases. is defined
f’ij
IOR
Pjii-
I
~ Pj
365
as:
IOgfj.
j=l
where P, is the probability of doublet ij. Zj,; is the probability of basej following i. Pj is the probability of base j, logarithms are to base 2, and DP is in bits. Independence of the bases is defined as P, = Pi*Pj. The value of n2 ranges from 0 to 2 bits. When the bases of a sequence are independent, D2 = 0. As the IX measure of a sequence increases. the probability of base i at a particular sequence position increasingly depends on, at least. its nearest neighbors. doublet counts are To calculate D2 on an intact sequence, singlet and overlapping tabulated. To calculate 02 for a particular position, for example position 1, only the singlets and doublets occupying position 1 of each codon are counted. In general, the singlet and doublet counts are used to calculate 02 as follows:
112=
; (nij/Sd) i,;= 1
log (nij/ni)-
;. @,/A%) log ()I,/%), j=*
where nii is the number of doublet z”, ni is the number of singlet i, nj is the number of singlet,j. LVd is the sum total of doublets counted and ,Vs is rhe sum total of singlets counted (note, for the position specific measures, ;Vs = Xd). This. as applied to find D2 for position 1, obtains IL,~ from doublet position 1 and 7~~and rtj from the first and second bases of each codotl. 02, will denote the D2 measure for the intact sequence, not distinguishing respectively. codon positions. Subscripts 1. 2 and 3 will denote the D2 measures for the corresponding codon-defined doublet positions of a sequence. The computer simulations of various statistical models will be described in the Results. All computing was done on a DEC KL-10, programs were written in PASCAL, and Figures were generated using the MLAB facility at the National Institutes of Health. All nucleic acid sequences were taken from the Nucleic Acid Sequence Data Base at the National Biomedical Foundation, Georgetown University, except those of the human mitochondria genome. which were t,aken from the Data Bank of Los Alamos Scientific> Laboratory.
3. Results In Tables 1 to 3 are the 02 values for the eukaryotic, prokaryotic and mitochondrial sequences. There is a fairly wide range of values for each category within a group. The mitochondria, on the average, have the lowest 02, value, and a relatively low D22 and 02, value. The prokaryotes have a higher 02, value while the eukaryotes, with the highest D2,, have a relatively high D22 and D23 value. It appears that the rise in 02, from the mitochondria to the eukaryotes is due to increasing constraints on the third base of the codon. In mitochondrial sequences the third base shows weak dependence on either the second base of the codon or the first base of the neighboring codon. In prokaryotes the third base shows increased dependence on the second base of the codon, while in eukaryotes, the third base shows a dependence on both the second base of the codon and the first base of the neighboring codon. We conducted the following simulations-to investigate the relationship between the above measures of statistical constraints on the third base of the codon and the bias in choice of synonymous codons. The first simulation tests the assumption that synonymous codons are chosen randomly and with equal probability (model I). For each real sequence 100 simulated sequences were generated in the following way.
366
1). J.
LIPMAN
AND T;\HLE
W.
1
112 vakes for mitochondrd Sequence
1)2,
Human cytochrome oxidase peptide 1 Human cytochrome oxidase peptidr 2 Human cytochrome oxidase peptide 3 Yeast cytochrome oxidase peptide 2 Yeast ATPase suhunit 3 or 6 Yeast cytochrome b Human ATPase
0013 0904 0020 0.034 0~032 0.027 @016
D:! values of prohryotic
E. coli chloramphenicol acetyl transferase E. eoli dihydrofolate reductasr E. cozi trpFs E. wli t?yx E. coli !lkpA E. wli RNA polymerase p-chain E. cob lactose permease E. coli RPLA E. coli RPLK E. wli OMPA E. wli heat stable toxin ST-I Snlmonrlln typhirrt7triunt lrpl) Salmonella typhimrrriutrr trpR Salmonelln typhimurirrm trpA Shigrlla dysentwine trpl) Serrntin mrr rcc.5~11,~ trpG
0.024 0037 OMO oG41 0.045 0034 0.028 0056 0030 WO23 0.127 0.023 0.043 0.034 0.039 0041
,J. WILSI’K
sequenws 1)s 1
IX?,
1123
0084 0.084 0. loo 0165 0147 w145 0,079
0039 0.120 ON7 0466 om6 0.110 0999
0.03 1 0.042 (1040 0.027 WI08 0967 0.025
0362 Cl.50 0.171 0.085 0.152 0.220 0082 0.301 0247 0129 0.235 O-083 0.130 0108 @161 ti230
0033 0.032 0050 0043 0.029 0.018 0.041 OG21 0.110 0.023 0.155 o%u4 0.072 0.025 0.048 0.044
sequences
0~061 o-074
0.082 0.053 0,106 0.087 0.076: 0996 OQ96 0936 (b219 OG69 WO82 0101 WO61 O-084
The nucleic acid sequence is translated into the corresponding amino acid sequence. The set of synonymous codons for each amino acid position is identified. A new coding sequence is generated by randomlr choosing with equal probability one codon out of the set of corresponding synonymous codons for each amino acid position. Thus, the amino acid sequence is reproduced exactly but the synonymous codons are randomly assigned. This model therefore invokes no selective constraint on the use of synonymous codons. Sequences thus generated will maintain the D21 value close to that of the original experimental sequence but D2,, 02, and ~722~ values are free to varv. The second simulation assumes that all positions of corresponding synonymous codons within a sequence are equivalent. The sequences are generated as follows. The positions of all corresponding synonymous codons within each sequence are identified. A new sequence is generated by randomly shuffling the corresponding synonymous codons between appropriate sequence positions. The resulting
SYNOSYhlOl~S
(‘ODON TABLE
(‘HOICE
3
D2 values for eukaryotic sequences
Human corticotropin-lipotropin pIWLU30r Human choriogonadotropin a-chain plYX:WSO, Human somatotropin precursor Human choriomammotropin Human interferon al -precursor Human interferon &precursor Human interferon ,%precursor Rat pancreatic a-amylase Rat somatotropin precursor Bovine parathyroid hormone precursor Sea urchin histone HBA (Strong?Jlocmtrotus
pwpwntws)
Sea urchin histone H2-4 (t’srrmmechin,rs nr ilirrris) Sea urchin histone H3 Sea urchin histone H2B Fruit fly major heat shock protrm Mouw Hb /3 major chain gene Mouse Hh fl minor chain gene Human insulin Human Hb x-chain Human HI) p-chain Human Hb y-(:-chain Rat, prolactin Rat pro&&in 2 blouse Ig K-chain V region germline gene K2 Mouse Ig K-chain T’ region germline gene MOPC’ 41 Mouse Tl lg x-chain 1’ region differentiated gene Mouse Ig h-l chain V region differentiated gene Mouse Ig p-chain C’ region germline gene Mouse Ig y-I -chain (’ wgion germline gene Rabbit HI) I-cxhain (‘hicken lysozymr precursor Yeast wtin
0.094 PO92 0,133 w154 0.148 (F140 0948 0.1 13
0.1 14
0.039
O-029
0%6 0.105 0.054 @I46 0191 0999 O-095 0,137
0.127 0073 0.116 0.280 0.305 0.1 17 0.127 WI.59
0.044 tiO5X 00,X2 0166 0.180 (b102 0,069 OG98
0.149
0092
0.17H 0. I 53
0.1 I.5
0-0x0
oai%
0152 0.076 0481
0130 0.270 fi21.5
oQ90 0.108 0.133
0.056 0~050 0461
PO47 0.118 0. I 20 0.177 0.108 O-165 0,139 00.59 0964
@103 0196 0159 0.170 0,140 0265 0239 0~093 0~225
0%3 e259 0.279 0.2 13 I kO59 0.187 0.253 (b179 0.231
0.024 0.137 0.142 0.145 om 1 0.148 0161 0.076 OG985
0.097
0.184
OT?Ol
0 I 13
O-166
0,114
WI43
0.084
0~088
0.111
0,153
0.074
0.079
0.148
ow3
0.120
OMi
0,124
(PI27
omo
O-094 oan9 0.109 0.094
0.117 0.272 0959 0.33 1
0.170 0073 0038 002H
0091 OG64 0.034 0.040
sequences exactly reproduce the amino acid sequence and codon frequencies of the original sequence but the position of corresponding synonymous codons is randomly assigned (model II). Sequences thus generated will maintain the n21 and 02, values exactly as the original but the 02, and 02, values are free to vary. The last simulation assumes that the degenerate third base of a codon is chosen with respect to iti nearest neighbors. The sequences are generated as follows. The
1). ,I. LIPMAS
36X
ANI)
W. .I. WILHI!R
counts of the triplet defined by doublet positions 2 and 3 are tabulated. From these counts the conditional probabilities of the third base, given the second base of a codon and the first base of the neighboring codon, are calculated. The third base of each codon of the simulated sequence is chosen by first identifying its nearest neighbors and thus the associated conditional probabilities. These conditional probabilities are used to weight a random choice of third base. ,4 sequence is thus generated which exactly reproduces the amino acid sequence as well as maintaining the conditional probability of the third base given its nearest neighbors on either side. This model invokes a selective constraint on the choice of third base (and thus synonymous codon) which involves the context of the codon (model III). Sequences thus generated maintain D2, values exactly as the original measures : 112, and D23 values will be very close to the original measures. Given that all position-specific 02 measures are maintained as the original, the 02, value should be very close to the original experimenta. D2, value. Therefore. model III serves as a control. For each of the three models, 100 simulated sequences were generated from each experimental sequence. The four D2 measures were determined for each generated sequence and a mean and standard deviation were calculated for 100 simulations of each experimental sequence. Thus, for each experimental sequence we have four sets of 02 values. The first set of values are those for the experimental sequence, the second set is the mean of simulations based on model I (no selective constraint). the third set is the mean of simulations based on model II (frequency constraint), and the fourth set is the mean of simulations based on model 111 (context constraint). In Table 4 we compare the averaged D2 measures of the experimental sequences with the averaged measures for the three models. An asterisk denotes those 112, values of a simulation which were 2 3 s.1) t from the corresponding experimental 02, value. For the mitochondria. it can be seen that model 1 (no selection) is grossly adequate (1-O s D.. too low), though model II (frequency constraint) is a better approximation. For the prokaryotes. model II is grossly adequate (1.2 RD.. too low) but model III (context constraint) is a better approximation. Finally, for the eukaryotes, only model III gives a sufficient approximation of experimental values. Simulation by model III is useful because it demonstrates that one can conduct a legitimate “Monte Carlo” experiment to choose synonymous codons (the range in simulated 02 values was as great as for the other simulations) and still approximate the experimental K!, value. 4 close approximation of the experimental values does not constitute a proof of the model. For the purposes of this study it is of more importance that we have shown model II to be totally inadequate for eukaryotes than that model III is adequate. As mentioned above model III serves only as a control. The results of these simulations are consistent with the pattern of the measures on the experimental sequences alone: in mitochondria, the third base is only weakly dependent on either of its nearest neighbors ; in prokaryotes, the third base is dependent on the second base of the codon and only weakly dependent on the first base of the neighboring codon : and in eukaryotes, the third base is dependent t ilhbrwiation
used s I) standard deviation.
0.020
(TO23
Model II (frequency constraint)
Model III (context constraint) 0.111
0.111
0.082
@Ill
D2,
0.083
0.06 1
0.034
0961
Mitochondria 02,
of the experimental
2 3 s.1) from the corresponding
0014
Model I (no selection)
t I)2 values of a simulation
0.019
D2,
D2 measures
Experimental
Averaged
02,
0086
0.086
0~081
@Of46
L)4 value
0.044
0~035
0016t
with
0176
0.160
w039t
0160
D22
0469
@039
0036
0.051
D23
the averaged
Prokaryotes
02,
compared
0.042
experimental
0465
oa3 1
0.029
0a4u
LB,
sequences
T.ABLE 4
PO89
0043t
0419t
0.087
112,
0.1 11
0.1 11
w103
0.111
I)2 l
0170
0.154
04)4&f
0.154
112,
Eukaryotes
meawreS for the three models
0~1%
0~055t
oa52t
0.152
1123
3in
1). J. LIPMAN
AND
u’. .J. WILBUR
on both its nearest neighbors. Thus, there is an increase in the constraints on the third base of the codon between mitochondria, prokaryotes and eukaryotes. There is also a corresponding increase in D2,. the divergence from independence for the overall sequence, between the three groups. What is the relationship between the D2, measure for a sequence and the position-specific 02 measures? While there is clearly a mathematical relationship it is not simple nor does it appear subject to useful analytical treatment. To examine this relationship we plotted the D2, VCYSUSD2,, D22 and D23 for each of the three With three interesting exceptions. no groups of experimental sequences. correlations were found between 02, and the position-specific D2 measures. In Figure 1 we have plotted 02, versus D2, for the mitochondria. With the exception of one point, representing the cytochrome oxidase peptide II gene of human, the data are remarkably linear (the correlation coefficient wit,h all points included was O-91). The data are limited, seven points. yet it appears that the overall divergence from independence in mitochondrial coding sequences shows a strong positive correlation with 02,) the measure for the position most constrained by amino acid both examination of the experimental D22 sequence. For the prokaryotes, measures, and the results of the simulations suggest that the choice of the degenerate third base is dependent on the second base of the codon, however, there was no clear relationship between the D2, and D22 values. Figure 2 is a plot of 112, versus D22 for the eukaryotes. With the exception of several points in the lower right corner of the graph. there is an apparent positive correlation between D2, and 02,. Figure 3 is a plot of 02, versus 02, for the eukaryotes. This clearly demonst’rates a strong. positive correlation between the D2, and 02, values. To summarize the above results, the mitochondrial coding sequences have the lowest D2, values as well as a low D2, and D23 value. Consistent with this
PM:. 1. D2, ~w~sus 112, (in bits) for mitochondrial
coding sequences
SYSONYMOUS
CODON
37 I
(‘HOICE
0.241
+ +
+ ++ +
+
++
FIG. 2. ZB, ZIPWUR 112, (in bits) for eukaryotic
coding
sequences.
+ +++ + +++ + +S+,* +*+ + 4
FIG:. 3. i32, versus Il2,
(in bits) for cukaryotic
coding
sequencf:s
observation of weak constraints on the degenerate third base, a statistical model which assumed no selective constraint on synonymous codons (model I), was grossly adequate in reproducing the experimental 02 measures, though a model which assumes selection of a particular biased frequency of codons (model II) was a good approximation. Furthermore, a positive, nearly linear, correlation was found between the 02, and 02, measures of mitochondrial coding sequences. The prokaryotic coding sequences, which had intermediate D2, values with respect to the other two groups, had a higher 02, value than the mitochondrial sequences but a low 02, value. The results of the simulations were consistent with this observation of increased constraints on the degenerate third base because of a
372
I). .J. LIPMAX
ANI)
u’
,I. WTLHl~R
dependence on the second base of the codon. That is, a statistical model involving no selective constraint on choice of synonymous codons was totally inadequate (model I), but one which maintains a particular synonymous codon frequency for each sequence (the frequency found in each experimental sequence) was, at least. grossly adequate in reproducing the experimental 112 measures (model II). The eukaryotic coding sequences, which have the highest D2, values, have a relatively high D22 and 02, value. The results of the simulations are consistent with this observation of selective constraints from both nearest neighbors operating on the choice of degenerate third base. That is, a model which maintains exactly the codon frequency of each gene, but allows the position of corresponding synonymous oodons to vary (model II). was totally inadequate for the eukaryotes. In addition. there appeared to be a strong, positive correlation between IX?, and 1j23. and a lesser correlation between D2, and IJ2,.
4. Discussion With the rapid accumulation of a wide rariet’y of nucleic acid sequences. the nonuniform distribution of synonymous codons has become apparent (for a review see Grantham et al., 1980). This suggested that there was a non-random, and thus selective constraint on coding sequences in addition to the constraints of the amino acid sequence. Further evidence of this additional selective constraint was the work by Miyata & Hayashida (1981) which demonstrated that the evolutionary rate of pseudogenes exceeded the evolutionary rate of synonymous codons. Work by Sheppard 8r Gutman (1981), comparing two allotypic forms of the rat K light-chain constant region, demonstrated that in at least some instances, the evolutionary rate of synonymous codons was slower than that of replacement codons. What is t’he basis of these additional selective constraints! One can examine some proposed functional constraints acting on the choice of synonymous codons in light of our findicgs on the statistical constraints of synonymous codon usage. Grantham et al. (1980) have proposed that the nonuniform distribution of synonymous codons represents a genome strategy, i.e. the distribution of synonymous codons in each gene conforms to the species overall codon usage. This proposal was later modified (Grantham et al., 1981) when they observed a correlation between codon usage and protein expressivity in bacteria. The genes of abundantly expressed proteins tended to use similar synonymous codons. Ikemura (1981) extended this hypothesis by demonstrating a correlation between the abundance of Escherichin toll: transfer RNAs and the corresponding codon usage in protein genes. Ikemura (1981) rated the synonymous codons as to their optimality with respect to the content of isoaccepting tRNAs and the nature of the codon-anticodon interaction. The genes for abundantly expressed proteins tended to use the optimal codons selectively, while the genes for proteins less frequently expressed did not show a selective usage of optimal codons. The above hypotheses state that in any gene, the selection of synonymous codons is toward a particular frequency. This frequency. in the most general case, reflects that of the genome ; in the more specific case, just the genes of abundantly expressed proteins would share synonymous codon usage.
SYNONYMOUS
('OUOS
('HOIPE
373
Model I. which assumed no selective usage of synonymous codons, was even grossly adequate only for mitochondria, and even for mitochondria, the average 02 values were 1 S.D. away from the experimental values. A statistical model which assumed selection toward the overall codon usage of yeast mitochondria (genome hypothesis) resulted in 02 measures that closely approximated the experimental yeast mitochondria 02 measures (data not shown). A statistical model for the prokaryotes was tested which selected towards the pooled codon usage of the E. coli found had high rplL, ompA and rpoB genes, three genes that Ikemura concentrations of optimal codons. Sequences were generated with this model from the above three sequences and the resulting 02, values were 2 8.D lower than the 112, values of the experimental sequences. A model which invokes selection toward a particular, generalized synonymous codon frequency appears sufficient only for the mitochondria (the genome hypothesis). Indeed, a model which assumes that each gene is selectively constrained to its own particular synonymous codon frequency (model II) Our results do produced sequences with 02, values 1.2 S.D. too low for prokaryotes. not contradict those of Grantham et al. (1980,1981) or Ikemura (1981) but do suggest that additional constraints may be involved for the prokaryotes. Model II (frequency constraint) was found totally inadequate for the eukaryotes with the simulated 02, values over 5 s.n. too low. Therefore, any model for the eukaryotes must also involve the position of the codon with respect to the rest of the coding sequence. Shepherd (1981), and T. Smith, M. Waterman & J. Sadler (unpublished results) have found a preference for pyrimidine-purine doublets in doublet position 3. Smith, Waterman & Sadler, suggested that this was due to functional constraints on ribosome translocation, while Shepherd suggested that this was a persisting pattern of an archaic code. We tested a model which maintains the doublet frequencies (and would thus maintain any pyrimidine-purine ordering) in the third doublet position. Sequences thus generated maintain D2, exactly, D2, is maintained very close to the original sequence, but 02, and 02, are free to vary. For the eukaryotes, the 02, values of the resulting sequences were 2 s.1). greater than the 02, values generated for model II, which exactly maintained the codon frequencies of each sequence (data not shown). This “third doublet posit,ion” model was > 1 s.11.worse than model II for prokaryotes, as would be predicted from the experimental D2 values. This pyrimidine-purine model proposed by Shepherd (1981). and T. Smith, M. Waterman & ?J. Sadler (unpublished results) gains further support from our finding of a strong. positive correlat,ion between D2, and IA?,. Nevertheless, the D2, values of sequences generated by this model were still >3 s.I). lower than those of the experimental sequences, and thus cannot, alone account for the selective constraints on eukaryotic. synonymous codons. Hasegawa et al. (1979), have found a correlation between synonymous codon usage and secondary structure in MS2. Those regions presumed to be involved in base-pairing tended toward G or C in the degenerate third base position, while those regions not involved in base-pairing tended toward C or A. Thus, secondary structure requirements of the viral RKA may be involved in the choice of synonymous codons. It is of interest to note that the prokaryotic and eukaryotic
374
I). .J. LIPMAN
ANI)
W. .J. WILHITR
viruses are more similar to the mitochondrial sequences than to the prokaryotes 01 eukaryotes in that they exhibit low values of both D22 and D23 (data not shown). Another context constraint which may influence codon usage is selection for particular nearest-neighbor patterns. Nussinov (1980) has analyzed a broad range of DNA sequences and found some “universal” nearest-neighbor rules. Though the patterns may differ between prokaryotes and euka,ryotes, certain heterodinucleotides consistently appear at great’er (or lesser) frequencies than t’heil mirror image dinucleotides. Nussinov (1981) extended this analysis by determining the doublet frequencies in the three codon-defined doublet positions of proteincoding sequences. The doublet frequencies in the coding frame (doublet position 1) often diverged from the “universal” pattern. In the non-coding frames (doublet, positions 2 and 3) the doublet frequencies were not very different from the doublet frequencies in regions of DNA4 which do not code for protein. Nussinov also did simulations which demonstrated that the overall doublet frequencies in coding regions were not primarily determined by choice of amino acid. Nussinov concluded that the constraints on synonymous codon choice are the same as the structural “universal” doublet frequencies in non-protein constraints which lead to particular coding regions. Although some doublets appear far less frequently than expected (i.e. C’-(2). and others appear more frequently (i.e. T-G or LA). there is significant individual variation between the non-coding frames of prot,ein-coding regions. In the mouse nhemoglobin for example. CG appears to be selectively maintained in the noncoding frame (T. Smith. M. Waterman & J. Sadler, unpublished results). Also. if the primary constraint on synonymous codon choice was to maintain a particular doublet frequency. one would not expect to find a difference between doublet positions 2 and 3. However, these positions appear different in prokarpotes with the divergence from independence being greater on the average in position 2 than the structural constraints on doublet position 3. Despite these objections, frequencies which operate throughout the genome may play a role in t’he choice of synonymous codons. Bossi $ Roth (1980) have isolated mutants in BuEmonelZa which have increased efficiency in translating an amber codon with a suppresser tRK’A. This increased efficiency was due to a mutation in the nucleotide adjacent to the 3’ side of the amber codon. Thus, there is experimental evidence that changes in codon context can affect translation. A functional requirement which would impose a positional constraint on synonymous codons involve an interaction of the tRNAs at the I’ and r\ site in the ribosome. If particular tRNA pairs had a differential effect on the accuracy or rate of translation, then the probability of the degenerate third base would depend on at least its nearest neighbor. It can be seen that the 112 measures of the statistical constraints on synonymous codons are useful in evaluating various explanations for the patterns of synonymous eodon usage. Can the 112 measures be related to measures of evolutionary rate in synonymous codon positions? Brown et al. (1982), in studying mitochondrial sequences of primates. found that the evolutionary rate in synonymous codon positions was nine t,o ten times that of synonpmous codon positions in chromosomal genes. They concluded that this difference was due to a
SYNONYMOITS
CODON CHOICE
3i.j
higher mutation rate in mitochondria. With this in mind, they concluded that, similar higher evolutionary rates in other functional regions of the mitochondrial genome, tRNA genes for example, were due to a combination of a higher mutation rate and relaxed functional constraints. Our results suggest that there are also relaxed constraints on the synonymous codon positions in mitochondria, and it may not be necessary to invoke a higher rate of mutation to explain the evolutionary rate of the mitochrondrial genome, the relaxed functional constraints may be sufficient. We are currently studying the relationship between the measures of evolutionary rate and our measures of the statistical constraints on nucleic acid sequences.
5. Conclusion We have examined the constraints on choice of synonymous codons in mitochondrial, prokaryotic and eukaryotic coding sequences. The position-specific 02 measures indicated that the constraints on the degenerate third base were weakest in mitochondria. These constraints were stronger in prokaryotes because of an increased dependence of the third base on the second base of the codon. Eukaryotes displayed the strongest constraint on the choice of synonymous codon because the degenerate third base was dependent on both its nearest neighbors. The results of simulations were consistent with the above. The simplest model which best approximated the mitochondrial 02 values assumed selection of a particular overall frequency of synonymous codons, the genome hypothesis of Grantham et nl. (1980). Such a model was grossly inadequate for prokaryotes even when modified by assumptions of correlations between protein abundance and choice of synonymous codons. The most stringent model assuming selection of a particular synonymous codon frequency (model II) resulted in 1X.2,values 12 s.1). lower than the experimental prokaryotic 02, values. Therefore, selection of synonymous codons in prokaryotes may involve positional constraints. Explanations for synonymous codon usage in eukaryotes must involve positional constraints. Selection of pyrimidine-purine third position doublets may be involved but alone are not sufficient to account for the observed constraints. Other positional constraints, such as nucleic acid secondary structure requirements, or interactions between the tRNAs in the ribosome. may also be involved in the choice of synonymous codons in eukaryotes. We acknowledge advice.
Dr C. Guthric.
Dr A. Wilson
and Dr T. Smith
for helpful
discussions
and
REFERENCES Air, G. M.. Blackburn, E. H., Coulson, =1. R.. Galibert, F., Sanger. F., Gedat, J. W. & Ziff. B. (1976). J. Afol. Biol. 107, 445-458. Bossi, L. & Roth, J. (1980). Nature (London), 286, 123-127. Hrown, W.. Prager, E., Wang, A. & Wilson. A. (1982). J. Mol. ,!hol. 18, 22.5339. Efstratiadis, A.. Kafatos, F. & Maniatis. T. (1977). Cell. IO. 571p:i8Fi.
E.
376
I).
,J. I.IPMMAh’
rlh’i)
U’.
,I.
Fiers.
WILBUR
W., Contreras, R., Duerinck, F.. Haegeman, G., Merregaert, J.. Min Jou, W.. Raeymaekers, A., Volckaert, G., Ysebaert, M.. Van de Kerckhove, J., Nolf, F. & Van Montagu, M. (1975). Xature (London), 256, 273-278. Gatlin, I,. (1972). Information Theory and the Living System, (!olumbia ITniversitg Press. New York and London. Grantham, R.. Gamier, C.. Gouy, M.. Mercier. R. B Pare. A. (1980). Nuc/. Acids Res. 8. r49r62. Grantham, R.. Gautier, (1.. Gouy, M., tJacobzone. M. & Mercier, R. (1981). ~VILC/.Acids Res. 9. r433r-79. Hasegawa, M., Yasunaga, T. & Miyata, T. (1979). ,ZTUCZ.Acids Res. 7, 2073-2079. Ikemura. T. (1981). J. Mol. Biol. 151. 389-400. Lipman, D. & Maizel. J. (1982). Nucl. Acids Res. 10. 2723-2739. Miyata. T. & Hayashida, H. (1981). Proc. Nat. Acad. AX., l~.S.A 78, .5739-5743. Xussinov, R. (1980). Nucl. Acids Res. 8. 4545-4562. Nussinov, R. (1981). J. Mol. Evol. 17. 237-244. Shepherd, J. (1981). Proc. Nat. Acad. h’ci., I’A”.A. 78, 15961600. Sheppard, H. W. & Gutman, G. A. (1981). Proc. .v;at. Acad. Sci.. 1.S.A. 78. 7064-7068.
Edited
by S. Rrenner