Contextual constraints on synonymous codon choice

Contextual constraints on synonymous codon choice

J. Mol. Biol. (1983) 163, 363-376 Contextual Constraints on Synonymous DA\-II) J. LIPMAS Codon Choice .ISI) W. JOHN WILHUR Mathematical Rese...

931KB Sizes 0 Downloads 86 Views

.J. Mol. Biol.

(1983) 163, 363-376

Contextual

Constraints

on Synonymous

DA\-II) J. LIPMAS

Codon

Choice

.ISI) W. JOHN WILHUR

Mathematical Research Branch National Institute of Arthritis Diabetes an,d Diyestive and Kidney Diseases National Institutes of Health, Bethesda. Md 20205. V.S.A (Received

3 June 1982)

We have studied the statistical constraints on synonymous codon choice to evaluate various proposals regarding the origin of the bias in synonymous codon usage observed by Fiem et al. (1975). Air ct al. (1976) Grantham et aZ. (1980) and others. We have determined the statistical dependence of the degenerate third base on either of its nearest neighbors in mitochondrial. prokaryotic, and eukaryotic coding sequences. We noted an increasing dependence of the third base on its nearest neighbors in moving from mitochrondria to prokaryotes to eukaryotes. A statistical model assuming random equiprobable selection of synonymous codons was found grossly adequate for the mitochondria, but totally indequate for prokaryotes and eukaryotes. A model assuming selection of synonymous codons reflecting a genomic strategy, i.e. the genome hypothesis of Grantham et al. (1980), gave a good approximation of the mitochondrial sequences. A statistical model which exactly maintains codon frequency, but allows the position of corresponding

synonymous codons to vary was only grossly adequate for prokaryotes

and totally

inadequate for eukaryotes. The results of these simulations are consistent with the measures on experimental sequences and suggest that a “frequency constraint” model such as that of Grantham et (zl. (I 980) may be an adequate explanation of the codon usage in mitochondria. However, in addition to this frequency constraint. there may be constraints on synonymous codon choice in prokaryotes due to codon

context. involve

Furthermore, a constraint

any proposal to explain on the context

codon usage in eukaryotes

must

of a codon in the sequence.

1. Introduction The sequence of amino acids in a protein is generally considered to constitute the primary constraint on the evolution of its corresponding nucleic acid sequence. This constraint is mediated through the codon table. Because of the pattern of degeneracy in the codon table, the intensity of this contraint varies between the different positions within each codon, with the second base being constrained by the choice of amino acid, and the third base being far less constrained than the other two bases.

I). J. LIPMAS

364

AXI)

W. J. WILBUR

\‘arious workers have presented data demonstrating a non-uniform distribution in synonymous codon usage (Fiers et al.. 1975 ; Air et al.. 1976 ; Efstratiadis et al.. 1977). Miyata & Hayashida (1981) have presented data establishing the extraordinarily high evolutionary rate of pseudogenes ; a rate generally higher than that between synonymous codons in functional genes. Both observations imply additional constraints are operating on the protein-coding regions of nucleic acids. perhaps involved with control of translation (Ikemura, 1981) or due to nucleic acid secondary structure requirements (Hasegawa et al., 1979). We have examined the general constraints acting on the different positions of the codon in eukaryotic. prokaryotic, and mitochondrial coding sequences. This is an extension of previous work measuring the constraints on the ordering of bases and the constraints acting toward non-uniform base composition in coding sequences (Lipman B Maizel, 1982). The previous study revealed that mitochondrial sequences show stronger constraints acting toward non-uniform base composition and weaker constraints on the ordering of the bases than all other coding sequences examined. The present study involves measuring the divergence from independence of the bases. in three positions defined with respect to the codon : doublet doublet doublet codon.

position position position

1 = the first two bases of a codon: 2 = the second and third bases of a codon ; 3 = the third base of a codon and the first

base of the next

Our results suggest that, on the average, mitochondrial sequences have the weakest constraint on the degenerate third base of a codon, while eukaryotes have the strongest constraint on the third base of each codon. We have also tested a number of statistical models which maintain the constraints of the amino acid sequence but vary the choice of synonymous codons in different ways. A model assuming no selection between synonymous codons was grossly adequate for though a model which maintained the synonymous codon mitochondria, frequencies of each coding sequence (but allowed the position of corresponding synonymous codons to vary) yielded an excellent approximation of the measures determined on experimental sequences. The latter model was only grossly adequate for prokaryotes and completely inadequate for the eukaryotes. Only a model which adjusted the probability of the third base depending on both the second base of the codon and the first base of the next codon generated sequences with measures similar to the eukaryotic and prokaryotic sequences. These results suggest that the constraints on mitochondrial sequences acting below the level of amino acid sequence are far less stringent than those acting on the eukaryotic or prokaryotic sequences. Also, any explanation for the eukaryotic selection of synonymous codons must involve the context of the codon within the sequence.

2. Methods The statistical measure used in this study is derived from Znformution Theory, and its application to nucleic acid sequences is described in greater detail by Gatlin (1972) and by Lipman & Maize1 (1982).

SYNONYMOUS 112, the divergence

from

independence

I)2 =

~ i,j=

CODON

(‘HOICE

of the bases. is defined

f’ij

IOR

Pjii-

I

~ Pj

365

as:

IOgfj.

j=l

where P, is the probability of doublet ij. Zj,; is the probability of basej following i. Pj is the probability of base j, logarithms are to base 2, and DP is in bits. Independence of the bases is defined as P, = Pi*Pj. The value of n2 ranges from 0 to 2 bits. When the bases of a sequence are independent, D2 = 0. As the IX measure of a sequence increases. the probability of base i at a particular sequence position increasingly depends on, at least. its nearest neighbors. doublet counts are To calculate D2 on an intact sequence, singlet and overlapping tabulated. To calculate 02 for a particular position, for example position 1, only the singlets and doublets occupying position 1 of each codon are counted. In general, the singlet and doublet counts are used to calculate 02 as follows:

112=

; (nij/Sd) i,;= 1

log (nij/ni)-

;. @,/A%) log ()I,/%), j=*

where nii is the number of doublet z”, ni is the number of singlet i, nj is the number of singlet,j. LVd is the sum total of doublets counted and ,Vs is rhe sum total of singlets counted (note, for the position specific measures, ;Vs = Xd). This. as applied to find D2 for position 1, obtains IL,~ from doublet position 1 and 7~~and rtj from the first and second bases of each codotl. 02, will denote the D2 measure for the intact sequence, not distinguishing respectively. codon positions. Subscripts 1. 2 and 3 will denote the D2 measures for the corresponding codon-defined doublet positions of a sequence. The computer simulations of various statistical models will be described in the Results. All computing was done on a DEC KL-10, programs were written in PASCAL, and Figures were generated using the MLAB facility at the National Institutes of Health. All nucleic acid sequences were taken from the Nucleic Acid Sequence Data Base at the National Biomedical Foundation, Georgetown University, except those of the human mitochondria genome. which were t,aken from the Data Bank of Los Alamos Scientific> Laboratory.

3. Results In Tables 1 to 3 are the 02 values for the eukaryotic, prokaryotic and mitochondrial sequences. There is a fairly wide range of values for each category within a group. The mitochondria, on the average, have the lowest 02, value, and a relatively low D22 and 02, value. The prokaryotes have a higher 02, value while the eukaryotes, with the highest D2,, have a relatively high D22 and D23 value. It appears that the rise in 02, from the mitochondria to the eukaryotes is due to increasing constraints on the third base of the codon. In mitochondrial sequences the third base shows weak dependence on either the second base of the codon or the first base of the neighboring codon. In prokaryotes the third base shows increased dependence on the second base of the codon, while in eukaryotes, the third base shows a dependence on both the second base of the codon and the first base of the neighboring codon. We conducted the following simulations-to investigate the relationship between the above measures of statistical constraints on the third base of the codon and the bias in choice of synonymous codons. The first simulation tests the assumption that synonymous codons are chosen randomly and with equal probability (model I). For each real sequence 100 simulated sequences were generated in the following way.

366

1). J.

LIPMAN

AND T;\HLE

W.

1

112 vakes for mitochondrd Sequence

1)2,

Human cytochrome oxidase peptide 1 Human cytochrome oxidase peptidr 2 Human cytochrome oxidase peptide 3 Yeast cytochrome oxidase peptide 2 Yeast ATPase suhunit 3 or 6 Yeast cytochrome b Human ATPase

0013 0904 0020 0.034 0~032 0.027 @016

D:! values of prohryotic

E. coli chloramphenicol acetyl transferase E. eoli dihydrofolate reductasr E. cozi trpFs E. wli t?yx E. coli !lkpA E. wli RNA polymerase p-chain E. cob lactose permease E. coli RPLA E. coli RPLK E. wli OMPA E. wli heat stable toxin ST-I Snlmonrlln typhirrt7triunt lrpl) Salmonella typhimrrriutrr trpR Salmonelln typhimurirrm trpA Shigrlla dysentwine trpl) Serrntin mrr rcc.5~11,~ trpG

0.024 0037 OMO oG41 0.045 0034 0.028 0056 0030 WO23 0.127 0.023 0.043 0.034 0.039 0041

,J. WILSI’K

sequenws 1)s 1

IX?,

1123

0084 0.084 0. loo 0165 0147 w145 0,079

0039 0.120 ON7 0466 om6 0.110 0999

0.03 1 0.042 (1040 0.027 WI08 0967 0.025

0362 Cl.50 0.171 0.085 0.152 0.220 0082 0.301 0247 0129 0.235 O-083 0.130 0108 @161 ti230

0033 0.032 0050 0043 0.029 0.018 0.041 OG21 0.110 0.023 0.155 o%u4 0.072 0.025 0.048 0.044

sequences

0~061 o-074

0.082 0.053 0,106 0.087 0.076: 0996 OQ96 0936 (b219 OG69 WO82 0101 WO61 O-084

The nucleic acid sequence is translated into the corresponding amino acid sequence. The set of synonymous codons for each amino acid position is identified. A new coding sequence is generated by randomlr choosing with equal probability one codon out of the set of corresponding synonymous codons for each amino acid position. Thus, the amino acid sequence is reproduced exactly but the synonymous codons are randomly assigned. This model therefore invokes no selective constraint on the use of synonymous codons. Sequences thus generated will maintain the D21 value close to that of the original experimental sequence but D2,, 02, and ~722~ values are free to varv. The second simulation assumes that all positions of corresponding synonymous codons within a sequence are equivalent. The sequences are generated as follows. The positions of all corresponding synonymous codons within each sequence are identified. A new sequence is generated by randomly shuffling the corresponding synonymous codons between appropriate sequence positions. The resulting

SYNOSYhlOl~S

(‘ODON TABLE

(‘HOICE

3

D2 values for eukaryotic sequences

Human corticotropin-lipotropin pIWLU30r Human choriogonadotropin a-chain plYX:WSO, Human somatotropin precursor Human choriomammotropin Human interferon al -precursor Human interferon &precursor Human interferon ,%precursor Rat pancreatic a-amylase Rat somatotropin precursor Bovine parathyroid hormone precursor Sea urchin histone HBA (Strong?Jlocmtrotus

pwpwntws)

Sea urchin histone H2-4 (t’srrmmechin,rs nr ilirrris) Sea urchin histone H3 Sea urchin histone H2B Fruit fly major heat shock protrm Mouw Hb /3 major chain gene Mouse Hh fl minor chain gene Human insulin Human Hb x-chain Human HI) p-chain Human Hb y-(:-chain Rat, prolactin Rat pro&&in 2 blouse Ig K-chain V region germline gene K2 Mouse Ig K-chain T’ region germline gene MOPC’ 41 Mouse Tl lg x-chain 1’ region differentiated gene Mouse Ig h-l chain V region differentiated gene Mouse Ig p-chain C’ region germline gene Mouse Ig y-I -chain (’ wgion germline gene Rabbit HI) I-cxhain (‘hicken lysozymr precursor Yeast wtin

0.094 PO92 0,133 w154 0.148 (F140 0948 0.1 13

0.1 14

0.039

O-029

0%6 0.105 0.054 @I46 0191 0999 O-095 0,137

0.127 0073 0.116 0.280 0.305 0.1 17 0.127 WI.59

0.044 tiO5X 00,X2 0166 0.180 (b102 0,069 OG98

0.149

0092

0.17H 0. I 53

0.1 I.5

0-0x0

oai%

0152 0.076 0481

0130 0.270 fi21.5

oQ90 0.108 0.133

0.056 0~050 0461

PO47 0.118 0. I 20 0.177 0.108 O-165 0,139 00.59 0964

@103 0196 0159 0.170 0,140 0265 0239 0~093 0~225

0%3 e259 0.279 0.2 13 I kO59 0.187 0.253 (b179 0.231

0.024 0.137 0.142 0.145 om 1 0.148 0161 0.076 OG985

0.097

0.184

OT?Ol

0 I 13

O-166

0,114

WI43

0.084

0~088

0.111

0,153

0.074

0.079

0.148

ow3

0.120

OMi

0,124

(PI27

omo

O-094 oan9 0.109 0.094

0.117 0.272 0959 0.33 1

0.170 0073 0038 002H

0091 OG64 0.034 0.040

sequences exactly reproduce the amino acid sequence and codon frequencies of the original sequence but the position of corresponding synonymous codons is randomly assigned (model II). Sequences thus generated will maintain the n21 and 02, values exactly as the original but the 02, and 02, values are free to vary. The last simulation assumes that the degenerate third base of a codon is chosen with respect to iti nearest neighbors. The sequences are generated as follows. The

1). ,I. LIPMAS

36X

ANI)

W. .I. WILHI!R

counts of the triplet defined by doublet positions 2 and 3 are tabulated. From these counts the conditional probabilities of the third base, given the second base of a codon and the first base of the neighboring codon, are calculated. The third base of each codon of the simulated sequence is chosen by first identifying its nearest neighbors and thus the associated conditional probabilities. These conditional probabilities are used to weight a random choice of third base. ,4 sequence is thus generated which exactly reproduces the amino acid sequence as well as maintaining the conditional probability of the third base given its nearest neighbors on either side. This model invokes a selective constraint on the choice of third base (and thus synonymous codon) which involves the context of the codon (model III). Sequences thus generated maintain D2, values exactly as the original measures : 112, and D23 values will be very close to the original measures. Given that all position-specific 02 measures are maintained as the original, the 02, value should be very close to the original experimenta. D2, value. Therefore. model III serves as a control. For each of the three models, 100 simulated sequences were generated from each experimental sequence. The four D2 measures were determined for each generated sequence and a mean and standard deviation were calculated for 100 simulations of each experimental sequence. Thus, for each experimental sequence we have four sets of 02 values. The first set of values are those for the experimental sequence, the second set is the mean of simulations based on model I (no selective constraint). the third set is the mean of simulations based on model II (frequency constraint), and the fourth set is the mean of simulations based on model 111 (context constraint). In Table 4 we compare the averaged D2 measures of the experimental sequences with the averaged measures for the three models. An asterisk denotes those 112, values of a simulation which were 2 3 s.1) t from the corresponding experimental 02, value. For the mitochondria. it can be seen that model 1 (no selection) is grossly adequate (1-O s D.. too low), though model II (frequency constraint) is a better approximation. For the prokaryotes. model II is grossly adequate (1.2 RD.. too low) but model III (context constraint) is a better approximation. Finally, for the eukaryotes, only model III gives a sufficient approximation of experimental values. Simulation by model III is useful because it demonstrates that one can conduct a legitimate “Monte Carlo” experiment to choose synonymous codons (the range in simulated 02 values was as great as for the other simulations) and still approximate the experimental K!, value. 4 close approximation of the experimental values does not constitute a proof of the model. For the purposes of this study it is of more importance that we have shown model II to be totally inadequate for eukaryotes than that model III is adequate. As mentioned above model III serves only as a control. The results of these simulations are consistent with the pattern of the measures on the experimental sequences alone: in mitochondria, the third base is only weakly dependent on either of its nearest neighbors ; in prokaryotes, the third base is dependent on the second base of the codon and only weakly dependent on the first base of the neighboring codon : and in eukaryotes, the third base is dependent t ilhbrwiation

used s I) standard deviation.

0.020

(TO23

Model II (frequency constraint)

Model III (context constraint) 0.111

0.111

0.082

@Ill

D2,

0.083

0.06 1

0.034

0961

Mitochondria 02,

of the experimental

2 3 s.1) from the corresponding

0014

Model I (no selection)

t I)2 values of a simulation

0.019

D2,

D2 measures

Experimental

Averaged

02,

0086

0.086

0~081

@Of46

L)4 value

0.044

0~035

0016t

with

0176

0.160

w039t

0160

D22

0469

@039

0036

0.051

D23

the averaged

Prokaryotes

02,

compared

0.042

experimental

0465

oa3 1

0.029

0a4u

LB,

sequences

T.ABLE 4

PO89

0043t

0419t

0.087

112,

0.1 11

0.1 11

w103

0.111

I)2 l

0170

0.154

04)4&f

0.154

112,

Eukaryotes

meawreS for the three models

0~1%

0~055t

oa52t

0.152

1123

3in

1). J. LIPMAN

AND

u’. .J. WILBUR

on both its nearest neighbors. Thus, there is an increase in the constraints on the third base of the codon between mitochondria, prokaryotes and eukaryotes. There is also a corresponding increase in D2,. the divergence from independence for the overall sequence, between the three groups. What is the relationship between the D2, measure for a sequence and the position-specific 02 measures? While there is clearly a mathematical relationship it is not simple nor does it appear subject to useful analytical treatment. To examine this relationship we plotted the D2, VCYSUSD2,, D22 and D23 for each of the three With three interesting exceptions. no groups of experimental sequences. correlations were found between 02, and the position-specific D2 measures. In Figure 1 we have plotted 02, versus D2, for the mitochondria. With the exception of one point, representing the cytochrome oxidase peptide II gene of human, the data are remarkably linear (the correlation coefficient wit,h all points included was O-91). The data are limited, seven points. yet it appears that the overall divergence from independence in mitochondrial coding sequences shows a strong positive correlation with 02,) the measure for the position most constrained by amino acid both examination of the experimental D22 sequence. For the prokaryotes, measures, and the results of the simulations suggest that the choice of the degenerate third base is dependent on the second base of the codon, however, there was no clear relationship between the D2, and D22 values. Figure 2 is a plot of 112, versus D22 for the eukaryotes. With the exception of several points in the lower right corner of the graph. there is an apparent positive correlation between D2, and 02,. Figure 3 is a plot of 02, versus 02, for the eukaryotes. This clearly demonst’rates a strong. positive correlation between the D2, and 02, values. To summarize the above results, the mitochondrial coding sequences have the lowest D2, values as well as a low D2, and D23 value. Consistent with this

PM:. 1. D2, ~w~sus 112, (in bits) for mitochondrial

coding sequences

SYSONYMOUS

CODON

37 I

(‘HOICE

0.241

+ +

+ ++ +

+

++

FIG. 2. ZB, ZIPWUR 112, (in bits) for eukaryotic

coding

sequences.

+ +++ + +++ + +S+,* +*+ + 4

FIG:. 3. i32, versus Il2,

(in bits) for cukaryotic

coding

sequencf:s

observation of weak constraints on the degenerate third base, a statistical model which assumed no selective constraint on synonymous codons (model I), was grossly adequate in reproducing the experimental 02 measures, though a model which assumes selection of a particular biased frequency of codons (model II) was a good approximation. Furthermore, a positive, nearly linear, correlation was found between the 02, and 02, measures of mitochondrial coding sequences. The prokaryotic coding sequences, which had intermediate D2, values with respect to the other two groups, had a higher 02, value than the mitochondrial sequences but a low 02, value. The results of the simulations were consistent with this observation of increased constraints on the degenerate third base because of a

372

I). .J. LIPMAX

ANI)

u’

,I. WTLHl~R

dependence on the second base of the codon. That is, a statistical model involving no selective constraint on choice of synonymous codons was totally inadequate (model I), but one which maintains a particular synonymous codon frequency for each sequence (the frequency found in each experimental sequence) was, at least. grossly adequate in reproducing the experimental 112 measures (model II). The eukaryotic coding sequences, which have the highest D2, values, have a relatively high D22 and 02, value. The results of the simulations are consistent with this observation of selective constraints from both nearest neighbors operating on the choice of degenerate third base. That is, a model which maintains exactly the codon frequency of each gene, but allows the position of corresponding synonymous oodons to vary (model II). was totally inadequate for the eukaryotes. In addition. there appeared to be a strong, positive correlation between IX?, and 1j23. and a lesser correlation between D2, and IJ2,.

4. Discussion With the rapid accumulation of a wide rariet’y of nucleic acid sequences. the nonuniform distribution of synonymous codons has become apparent (for a review see Grantham et al., 1980). This suggested that there was a non-random, and thus selective constraint on coding sequences in addition to the constraints of the amino acid sequence. Further evidence of this additional selective constraint was the work by Miyata & Hayashida (1981) which demonstrated that the evolutionary rate of pseudogenes exceeded the evolutionary rate of synonymous codons. Work by Sheppard 8r Gutman (1981), comparing two allotypic forms of the rat K light-chain constant region, demonstrated that in at least some instances, the evolutionary rate of synonymous codons was slower than that of replacement codons. What is t’he basis of these additional selective constraints! One can examine some proposed functional constraints acting on the choice of synonymous codons in light of our findicgs on the statistical constraints of synonymous codon usage. Grantham et al. (1980) have proposed that the nonuniform distribution of synonymous codons represents a genome strategy, i.e. the distribution of synonymous codons in each gene conforms to the species overall codon usage. This proposal was later modified (Grantham et al., 1981) when they observed a correlation between codon usage and protein expressivity in bacteria. The genes of abundantly expressed proteins tended to use similar synonymous codons. Ikemura (1981) extended this hypothesis by demonstrating a correlation between the abundance of Escherichin toll: transfer RNAs and the corresponding codon usage in protein genes. Ikemura (1981) rated the synonymous codons as to their optimality with respect to the content of isoaccepting tRNAs and the nature of the codon-anticodon interaction. The genes for abundantly expressed proteins tended to use the optimal codons selectively, while the genes for proteins less frequently expressed did not show a selective usage of optimal codons. The above hypotheses state that in any gene, the selection of synonymous codons is toward a particular frequency. This frequency. in the most general case, reflects that of the genome ; in the more specific case, just the genes of abundantly expressed proteins would share synonymous codon usage.

SYNONYMOUS

('OUOS

('HOIPE

373

Model I. which assumed no selective usage of synonymous codons, was even grossly adequate only for mitochondria, and even for mitochondria, the average 02 values were 1 S.D. away from the experimental values. A statistical model which assumed selection toward the overall codon usage of yeast mitochondria (genome hypothesis) resulted in 02 measures that closely approximated the experimental yeast mitochondria 02 measures (data not shown). A statistical model for the prokaryotes was tested which selected towards the pooled codon usage of the E. coli found had high rplL, ompA and rpoB genes, three genes that Ikemura concentrations of optimal codons. Sequences were generated with this model from the above three sequences and the resulting 02, values were 2 8.D lower than the 112, values of the experimental sequences. A model which invokes selection toward a particular, generalized synonymous codon frequency appears sufficient only for the mitochondria (the genome hypothesis). Indeed, a model which assumes that each gene is selectively constrained to its own particular synonymous codon frequency (model II) Our results do produced sequences with 02, values 1.2 S.D. too low for prokaryotes. not contradict those of Grantham et al. (1980,1981) or Ikemura (1981) but do suggest that additional constraints may be involved for the prokaryotes. Model II (frequency constraint) was found totally inadequate for the eukaryotes with the simulated 02, values over 5 s.n. too low. Therefore, any model for the eukaryotes must also involve the position of the codon with respect to the rest of the coding sequence. Shepherd (1981), and T. Smith, M. Waterman & J. Sadler (unpublished results) have found a preference for pyrimidine-purine doublets in doublet position 3. Smith, Waterman & Sadler, suggested that this was due to functional constraints on ribosome translocation, while Shepherd suggested that this was a persisting pattern of an archaic code. We tested a model which maintains the doublet frequencies (and would thus maintain any pyrimidine-purine ordering) in the third doublet position. Sequences thus generated maintain D2, exactly, D2, is maintained very close to the original sequence, but 02, and 02, are free to vary. For the eukaryotes, the 02, values of the resulting sequences were 2 s.1). greater than the 02, values generated for model II, which exactly maintained the codon frequencies of each sequence (data not shown). This “third doublet posit,ion” model was > 1 s.11.worse than model II for prokaryotes, as would be predicted from the experimental D2 values. This pyrimidine-purine model proposed by Shepherd (1981). and T. Smith, M. Waterman & ?J. Sadler (unpublished results) gains further support from our finding of a strong. positive correlat,ion between D2, and IA?,. Nevertheless, the D2, values of sequences generated by this model were still >3 s.I). lower than those of the experimental sequences, and thus cannot, alone account for the selective constraints on eukaryotic. synonymous codons. Hasegawa et al. (1979), have found a correlation between synonymous codon usage and secondary structure in MS2. Those regions presumed to be involved in base-pairing tended toward G or C in the degenerate third base position, while those regions not involved in base-pairing tended toward C or A. Thus, secondary structure requirements of the viral RKA may be involved in the choice of synonymous codons. It is of interest to note that the prokaryotic and eukaryotic

374

I). .J. LIPMAN

ANI)

W. .J. WILHITR

viruses are more similar to the mitochondrial sequences than to the prokaryotes 01 eukaryotes in that they exhibit low values of both D22 and D23 (data not shown). Another context constraint which may influence codon usage is selection for particular nearest-neighbor patterns. Nussinov (1980) has analyzed a broad range of DNA sequences and found some “universal” nearest-neighbor rules. Though the patterns may differ between prokaryotes and euka,ryotes, certain heterodinucleotides consistently appear at great’er (or lesser) frequencies than t’heil mirror image dinucleotides. Nussinov (1981) extended this analysis by determining the doublet frequencies in the three codon-defined doublet positions of proteincoding sequences. The doublet frequencies in the coding frame (doublet position 1) often diverged from the “universal” pattern. In the non-coding frames (doublet, positions 2 and 3) the doublet frequencies were not very different from the doublet frequencies in regions of DNA4 which do not code for protein. Nussinov also did simulations which demonstrated that the overall doublet frequencies in coding regions were not primarily determined by choice of amino acid. Nussinov concluded that the constraints on synonymous codon choice are the same as the structural “universal” doublet frequencies in non-protein constraints which lead to particular coding regions. Although some doublets appear far less frequently than expected (i.e. C’-(2). and others appear more frequently (i.e. T-G or LA). there is significant individual variation between the non-coding frames of prot,ein-coding regions. In the mouse nhemoglobin for example. CG appears to be selectively maintained in the noncoding frame (T. Smith. M. Waterman & J. Sadler, unpublished results). Also. if the primary constraint on synonymous codon choice was to maintain a particular doublet frequency. one would not expect to find a difference between doublet positions 2 and 3. However, these positions appear different in prokarpotes with the divergence from independence being greater on the average in position 2 than the structural constraints on doublet position 3. Despite these objections, frequencies which operate throughout the genome may play a role in t’he choice of synonymous codons. Bossi $ Roth (1980) have isolated mutants in BuEmonelZa which have increased efficiency in translating an amber codon with a suppresser tRK’A. This increased efficiency was due to a mutation in the nucleotide adjacent to the 3’ side of the amber codon. Thus, there is experimental evidence that changes in codon context can affect translation. A functional requirement which would impose a positional constraint on synonymous codons involve an interaction of the tRNAs at the I’ and r\ site in the ribosome. If particular tRNA pairs had a differential effect on the accuracy or rate of translation, then the probability of the degenerate third base would depend on at least its nearest neighbor. It can be seen that the 112 measures of the statistical constraints on synonymous codons are useful in evaluating various explanations for the patterns of synonymous eodon usage. Can the 112 measures be related to measures of evolutionary rate in synonymous codon positions? Brown et al. (1982), in studying mitochondrial sequences of primates. found that the evolutionary rate in synonymous codon positions was nine t,o ten times that of synonpmous codon positions in chromosomal genes. They concluded that this difference was due to a

SYNONYMOITS

CODON CHOICE

3i.j

higher mutation rate in mitochondria. With this in mind, they concluded that, similar higher evolutionary rates in other functional regions of the mitochondrial genome, tRNA genes for example, were due to a combination of a higher mutation rate and relaxed functional constraints. Our results suggest that there are also relaxed constraints on the synonymous codon positions in mitochondria, and it may not be necessary to invoke a higher rate of mutation to explain the evolutionary rate of the mitochrondrial genome, the relaxed functional constraints may be sufficient. We are currently studying the relationship between the measures of evolutionary rate and our measures of the statistical constraints on nucleic acid sequences.

5. Conclusion We have examined the constraints on choice of synonymous codons in mitochondrial, prokaryotic and eukaryotic coding sequences. The position-specific 02 measures indicated that the constraints on the degenerate third base were weakest in mitochondria. These constraints were stronger in prokaryotes because of an increased dependence of the third base on the second base of the codon. Eukaryotes displayed the strongest constraint on the choice of synonymous codon because the degenerate third base was dependent on both its nearest neighbors. The results of simulations were consistent with the above. The simplest model which best approximated the mitochondrial 02 values assumed selection of a particular overall frequency of synonymous codons, the genome hypothesis of Grantham et nl. (1980). Such a model was grossly inadequate for prokaryotes even when modified by assumptions of correlations between protein abundance and choice of synonymous codons. The most stringent model assuming selection of a particular synonymous codon frequency (model II) resulted in 1X.2,values 12 s.1). lower than the experimental prokaryotic 02, values. Therefore, selection of synonymous codons in prokaryotes may involve positional constraints. Explanations for synonymous codon usage in eukaryotes must involve positional constraints. Selection of pyrimidine-purine third position doublets may be involved but alone are not sufficient to account for the observed constraints. Other positional constraints, such as nucleic acid secondary structure requirements, or interactions between the tRNAs in the ribosome. may also be involved in the choice of synonymous codons in eukaryotes. We acknowledge advice.

Dr C. Guthric.

Dr A. Wilson

and Dr T. Smith

for helpful

discussions

and

REFERENCES Air, G. M.. Blackburn, E. H., Coulson, =1. R.. Galibert, F., Sanger. F., Gedat, J. W. & Ziff. B. (1976). J. Afol. Biol. 107, 445-458. Bossi, L. & Roth, J. (1980). Nature (London), 286, 123-127. Hrown, W.. Prager, E., Wang, A. & Wilson. A. (1982). J. Mol. ,!hol. 18, 22.5339. Efstratiadis, A.. Kafatos, F. & Maniatis. T. (1977). Cell. IO. 571p:i8Fi.

E.

376

I).

,J. I.IPMMAh’

rlh’i)

U’.

,I.

Fiers.

WILBUR

W., Contreras, R., Duerinck, F.. Haegeman, G., Merregaert, J.. Min Jou, W.. Raeymaekers, A., Volckaert, G., Ysebaert, M.. Van de Kerckhove, J., Nolf, F. & Van Montagu, M. (1975). Xature (London), 256, 273-278. Gatlin, I,. (1972). Information Theory and the Living System, (!olumbia ITniversitg Press. New York and London. Grantham, R.. Gamier, C.. Gouy, M.. Mercier. R. B Pare. A. (1980). Nuc/. Acids Res. 8. r49r62. Grantham, R.. Gautier, (1.. Gouy, M., tJacobzone. M. & Mercier, R. (1981). ~VILC/.Acids Res. 9. r433r-79. Hasegawa, M., Yasunaga, T. & Miyata, T. (1979). ,ZTUCZ.Acids Res. 7, 2073-2079. Ikemura. T. (1981). J. Mol. Biol. 151. 389-400. Lipman, D. & Maizel. J. (1982). Nucl. Acids Res. 10. 2723-2739. Miyata. T. & Hayashida, H. (1981). Proc. Nat. Acad. AX., l~.S.A 78, .5739-5743. Xussinov, R. (1980). Nucl. Acids Res. 8. 4545-4562. Nussinov, R. (1981). J. Mol. Evol. 17. 237-244. Shepherd, J. (1981). Proc. Nat. Acad. h’ci., I’A”.A. 78, 15961600. Sheppard, H. W. & Gutman, G. A. (1981). Proc. .v;at. Acad. Sci.. 1.S.A. 78. 7064-7068.

Edited

by S. Rrenner