Gene, 120 (1992) 93-98 0 1992 Elscvier Science
GENE
Publishers
B.V. All rights reserved.
93
0378-l 119/92/$05.00
06679
Comparison typhimurium (S7 protein;
of the complete and Escherichia
rpsG; elongation
Urban Johanson Department
Received
factor G; j&4;
and Diarmaid
of MolecularBiology.
by G. Bernardi:
sequence coli evolution
of the
rate; divergence;
str
codon
operon
adaptation
in Salmonella
index; ribosomal
protein)
Hughes
Uppsala University. Biomedical
9 April 1992; Revised/Accepted:
Center, S-751
25 May/26
24 Uppsala, Sweden
May 1992; Received
at publishers:
26 June 1992
SUMMARY
The nucleotide (nt) sequences of the str operon in Escherichia coli K- 12 and Salmonella typhimurium LT2 were completed and compared at the nt and amino acid (aa) level. The order of conservation at the nt and aa level is rpsL> tufA > rpsG> f USA. A striking difference is that the rpsG-encoded ribosomal protein, S7, in E. coli K-12 is 23 aa longer than in S. typhimurium. The very low (0.18) codon adaptation index of this part of the E. coli K-12-encoding gene and the unusual stop codon (UGA) suggest that this is a relatively recent extension. A trend towards a higher G+C content in fusA (gene encoding elongation factor (EF)-G) and tufA (gene encoding EF-Tu) in S. typhimurium is noted. In fusA, nt substitutions at all three positions in a codon occur at a much higher frequency than expected from the number of nt substitutions in the gene, assuming they are random and independent events. An analysis of substitutions in this and other genes suggests that the triple substitutions in fusA, and some other genes, are the result of the sequential accumulation of individual mutations, probably driven by selection pressure for particular codons or aa.
INTRODUCTION
The str operon in prokaryotes consists of rpsL, rpsG, fusA and tufA (Douglas, 1991 and references therein) coding for the r-proteins S12 and S7, and the translation factors EF-G and EF-Tu, respectively. The operon is expressed from a promoter upstream from rpsL, via a polycistronic mRNA (Jaskunas et al., 1975). S7 acts as an autoregulator of rpsG and fusA translation, by binding to
Correspondence
to: Dr. D. Hughes,
Uppsala
University,
Biomedical
Sweden.
Tel. (46.18)174203;
Department
Center,
of Molecular
Biology,
Box 590, S-751 24 Uppsala, EXPERIMENTAL
Fax (46-18)557723.
Abbreviations: aa, amino acid(s); bp, base pair(s); CAI, codon adaptation index: EF, elongation factor; ,jiisA, gene encoding EF-G; kb, kilobase(s) or 1000 bp: nt, nucleotide(s); ribosomal; r-proteins; S7. EF-G
rpsG, gene encoding
the mRNA sequence between rpsL and rpsG (Dean et al., 1981). The regulation of EF-Tu production is more complex, with a second chromosomal copy of the gene and two additional promoters for tufA inside $usA (Zengel and Lindahl, 1990). To facilitate the study of the regulation of tufA and the genotype of novel mutants selected in fusA in S. typhimurium the unknown part of the str operon, rpsG and fusA, was sequenced in a wt strain and the whole operon was compared with the completed E. co/i sequence.
PCR, polymerase
chain reaction;
S7; rpsL, gene encoding
r,
S12; S7, S12.
SD, Shine-Dalgarno (sequence); str, operon encoding and EF-Tu; fz&, gene encoding EF-Tu; wt, wild type.
S12,
AND DISCUSSION
(a) Sequencing of rpsG and fusA The 2.8-kb region between rpsL and tufA was amplified from chromosomal DNA of the strain S. typhimurium LT2, and sequenced. A small part of rpsG in E. coli was also sequenced using the K-12 strain MG1655 in order to make
:
c 2.iz1 652
G-7
vai
AK
lys
n1c 11e
CAC GCT SAA his ala q1u
GTh
va1
CCG C3G TCT GA& AIL: ser gl’! me lFl.
pro
TX pk
GGA 1‘A’: g1y ryr
GCA ACT ttlr
aia
CAG GIG $12 1eu
CGT XT ser
arg
CTG ACC AAIl 1eu ttlr lys
GGT
g1y
xx arg
A GC.& KG TAC ACT al.3 scr tyr ttz
n A?G
met
GAA
WC
g1uphc
CTG AAG If% 1ys
TAT tyr
GAT GAT KG asp asp ala
qlu 2611 692
ST ARC AAC asr. ,2r:n se:
s:’ “.il
GCT CAG GCC GIA ala gh ala “al
I-lg. I. Nucieotidc scyuenced opcron
scqucncc
previously
T AT: GM Ilr
GCC CGT XT arq gly
ofr/xG.
the mtcrgcmc
glL ala
in S. t_rphin~iun~ (Hughes
in E. c,o/i K-12 (Post and Nomura,
AAA
Tw’
1ys
***
gcr
spacer.
:a gca
aat gqg
tt?Laaa
cicc
aaa
g ate
,I
ccq
tqc tcz
ctc ctg sag ggg
/itsA and the .fu.sA-t~fi4 spacer in S. ryphimurium.
and Buckingham,
1980; Zengel et al.,1984:
a
agagcgeta tag taa
The first
et al., 1980), nt 244-441
scquencc is underlined where insertions or deletions occur. The deduced aa sequences two rows below for E. col;. but only where they differ. The nt sequcncc data reported
ata
tag
cc
1I nt do overlap with the rpsG region
1991); the same is true for the last spacer (Tuohy et al.,1990). Yokota
gqa
The nt sequence
of the
from this work, is shown above if it differs. The
arc shown one row below the nt sequence for S. typhimurium, and in this paper arc in the EMBL, GenBank and DDBJ nt sequcncc
databases under the accession Nos. X64591 (S. typhirnurium) and X64592 (E. coli). Methods. The 2.8-kb region between rpsL and tgfA was amplified from chromosomal DNA of the S. f~phimurium strain LT2. using two pairs of primers in a symmetric PCR (30 cycles for 1 min at 94’C; 1 min at 66’C: 2 min at 72’C). In a second asymmetric PCR using the same conditions as above, but seeded with 5:, of the first reaction, template was generated for the subacquent nt sequencing. The primers for sequencing and PCR were based on the E. coli sequence whcrc the S. typhimurium was unknown. Detarls of the primers used arc available on request. E. co/i K-12 strain MG1655 was also sequenced, from nt position 241 to 521. The asymmetric PCR products wcrc purified through Centricon 100 filters prior to sequencing with the T7 sequencing kit from Pharmacia (Uppsala)
Cr^G pro
95
bp
Sa)
124
375
R aa
97 6
d1
471
96
bp Eb) W DNAC)
95
96.1
97.9
96.6
68
28
69
27
82.5
85.5
2115
71
,185
70 94.3
90 I
39 98 0
1000
99 4
97 6
99.7
0 66
061
0 70
0 70
CAI E”
0 66
0 63
0 74
0 82
K s ‘)
0 08
CAI 5
e)
KAh)
0
Fig. 2. The data for the first part of the operon,
including
0 I8
0 13
0.23
0 09
0 003
0015
0 001
the rpsL-rpsC spacer, are from Hughes
and Buckingham
40
79.7
(1991) and for ru/il essentially
from
Sharp (1991). The part of the rpsG:fusA spacer, which is a coding sequence in E. cd K-12 but not in S. tJ’phimurium, is hatched. The figure is drawn to scale, except for firsA and tuJA which are much longer, indicated by the broken bar. s bp in S. t_vpphimurium; stop codons are included in coding region. b bp in E. coli, stop codons
are included in coding region. ’ “, identity at nt level, insertions and deletions are treated equal to substitutions. ’ ‘; identity at aa level. ’ CA1 (Sharp and Li, 1987a) in S. tJ’phimurium. r CA1 adaptation index in E. coli. g Ks (Li et a1.,1985), the number of synonymous substitutions per synonymous site. h K, (Li et a1.,1985), the number of nonsynonymous substitutions per nonsynonymous site.
the E. coli sequence complete. The deduced aa sequence of S7 from K-12 has one extra Arg9* (Fig. 1) compared to the sequence reported from protein sequencing (Reinbolt et al., 1978) otherwise it is in perfect agreement. The results of a comparative analysis of the str operons are summarized in Fig. 2.
The most striking difference between the str operons of these two organisms is that the encoded S7 protein in E. coli K-12 is 23 aa longer than S7 in S. typhimurium. However, K-12 seems to be the exception because even E. coli B and all other species examined so far encode the short version of S7 (Reinbolt et al,, 1978; Buttareli et al., 1989; Douglas, 1991; Wagar and Pang, 1992, and references therein). This extension of S7 in K-12 is probably quite recent judging from the low CA1 0.18 (Fig.2) of this part of the gene which is close to the value, 0.17, of a sequence of equiprobable sense codons (Sharp and Li, 1987a). The CA1 for the whole vpsG in K-12 is only 0.53 which is certainly lower than the average 0.61 for 32 r-proteins examined but still higher than or the same as the six lowest in the set which ranges from 0.42 to 0.8 1. The maintenance of the low CA1 in these other six genes might in part be due to regulatory constraints but shows that values at 0.53 and below are tolerated in highly expressed genes. The stop codon UGA of rpsG in K-12 is also rather unusual for an r-protein (Post and Nomura, 1980) or for any gene with a high CA1 (Sharp and Bulmer, 1988) and might reflect how unadapted this end of the gene is for optimal translation. An examination of the sequences for 32 r-proteins from E. coli K- 12 shows that only four do not use UAA as the stop codon, rpsG included. In the homologous part of S7 in S. typhimurium there is only one aa change, a Glu for an Asp,
which is considered as a conservative 1974; Li et al., 1985).
change (Grantham,
(c) Intergenic regions The total length of the coding and intergenic regions of the operon is identical in the two species, insertions in one spacer being compensated by deletions in another. The only intergenic spacer in the operon suggested to be functionally important so far (excluding the short SD regions) is the rpsL-rpsG spacer where protein S7 is supposed to bind (Nomura et al., 1980). As can be seen in Fig.2 this spacer is indeed highly conserved compared with the others in the same operon, implying that the same autoregulation occurs in S. typhimurium as has been shown in E. coli (Dean et al., 1981). (d) EF-G EF-G is the least conserved gene in the operon. However, there is only one aa change in a conserved region of EF-G (Kohno et al., 1986; Grinblat et al., 1989) and that is a moderately conservative replacement (Grantham, 1974; Li et al., 1985) of Thr493 by Ala in S. typhimurium, the very end of a conserved region. This region may be involved in the interaction with the ribosome (Kohno et al., 1986). It is somewhat unexpected that fusA is the most divergent gene in the operon (Fig. 2), but it might be that the important domains of EF-G are small relative to its size. One of the striking features of EF-G is its large size which may be necessary to make essential contacts with the two r-subunits at the same time. Two functional promoters for tufA, within the fusA gene, have been described (Zengel and Lindahl, 1990; Zengel et al., 1984). The sequence of the second of these promoters is changed in S. typhimurium. A T2466+C transition weak-
96 ens the homology with the -35 consensus sequence and may indicate that the first promoter is the dominant one in S. typhimurium.
The order is not reversed by including the missing Arg in the E. coli sequence compared. There is in general a negative correlation between KS and CA1 (Sharp and Li, 1987b; Sharp, 1991), but in the str operon fusA with the highest K, value also has the second highest CAI. This suggests that the exceptionally low KS of rpsL and rpsG are not wholly a result of the high CA1 but could in part be explained in terms of a selective pressure on the nt sequence for a regulatory purpose. The stricter conservation of the genes at both ends of the operon could reflect constraints imposed by regulation of the messenger level. Of the 196 nt substitutions in the operon, 86 replace an AT bp in E. coli with a GC bp in S. typhimurium. The reverse is observed in 66 cases. The changes responsible for this drift mainly occur in fusA and tufA, changing from 50.80/b to 51.70/, and from 53.2% to 54.0% G+C, respectively. However, this trend does not greatly change the overall nt composition of the operon. In E. coli the G+C content of the operon is 51 .O% compared with 5 1.5 “/b in S. typhimurium. A trend towards a higher G+C content in
(e) The stv operon At the nt level, rpsL and t&A show the highest interspeties similarity followed by rpsG (only comparing the homologous, coding part of the genes) and then fusA. This is partly a reflection of the similarity at the protein level which follows the same order. The K, values (the number of synonymous substitutions per synonymous site) which perhaps better illustrate the divergence at the nt level independent of the aa sequence also show the same order of divergence. A high K, indicates freedom at the nt level unperturbed by constraints on the aa sequence. The number of nonsynonymous substitutions per nonsynonymous site (KA) is clearly higher in fusA than in the other genes. The previously reported order of conservation in the operon, comparing the distantly related organisms, Spirulina platensis, E. coli and Micrococcus luteus (Buttareli et al., 1989) puts EF-G before S7 as the more conserved protein.
TABLE
I
The distribution
of nt substitutions
a
Gene
“/, identity b
K,’
CAI(E)d
(1)
(2)
(3)
(4)
Obs./Exp.’
Number of codons r
(5)
0
I
2
3
P(>n)g (7)
(6)
w
98.2
0.00 1
0.78
1.0
0.9
2.8
0
1.0
rpsL tgf;l
98.1
0
0.66
1.0
1.0
0
0
1.0
98.0
0.82
1.0
0
1.0
96.6
0.63
1.0
1.0 1.0
2.5
rp.G
0.00 1 0.003
0
0
,firsA
94.3
0.74
1.0
93.8
0.63
1.0
0.8 1.1
1.6
rpoB ompA
0.015 0.008 0.039
0.76
0.020
0.41
1.0 0.9
0.8
trpB
90.1 84.4
trpE
80.2
0.069
0.36
1.0
I.0
31.8 8.7
1.2x lo-= 4.3 X lo-
14.9
3.0 x lo-’
1.2
1.0 0.4
I.0
0.9
1.7
0.3
0.6
1
7.8 x lo-
trpA
75.3
0.083
0.34
1.0
1.1
0.8
1.5
6
1.2 x 10 2.1 x 10-i
tar
75.2
0.124
0.32
1.0
1.0
0.8
1.9
17
8.8 x 10
p&B
72.8
0.166
0.33
1.1
1.0
0.7
2.7
25
1.1 x IF5
SUIA
68.5
0.136
0.23
1.1
1.0
0.7
2.3
12
6.1 x 10-l
“ A subset of the genes that were sequenced were retrieved
from the EMBL databank.
K,
in E. co/i and S. typhimurium are listed according and CA1 are essentially
to their identity
1 ’ ’ ?
at nt level (column 2). The nt sequences
from Sharp (1991). The nt substitutions
are assumed
to occur randomly
and in-
dependently, in the calculations of the expected number of codons with 0, 1, 2 or 3 nt substitutions. The expected frequency for each type of codon (0. 1, 2 or 3 nt substitutions) is calculated as the product of the probability of occurrence of the nt substitutions and/or the nonsubstitutions making up that codon type, times the number
of permutations
in the gene to get the expected
number of codons
of 0, 1, 2 or 3 nt substitutions
may be greater. h i’O identity at nt level, allowing gaps in the alignment L K, (Li et al., 1985), the number
of nonsynonymous
’ Codon adaptation index in E. cd. ’ The observed number of codons in the reading
(columns
5). The expected
of each type. Because triple substitutions
saturate
only if they are a multiple of codons substitutions
per nonsynonymous
frame with 0, 1, 2, or 3 nt substitutions
frequency
is multiplied
by the number
of codons
the sites in a codon the number of substitution
in the reading
frame.
site. relative to the expected
r The observed number of codons in the reading frame with three substitutions. D The probability to find n or more codons with three nt substitutions in a given gene. The distribution
is assumed
number. to be binomial.
events
97 S. typhimurium has been observed before (Riley and Krawiec, 1987, and references therein).
per codon relative to the expected is even higher than in group three. In addition, triple substitutions are more rarely out of frame than in frame. The triple substitutions in group
(f) The distribution of nt substitutions There are more codons with three nt substitutions in ,@A than would be expected if each nt substitution was a random and independent event. In order to examine this in more detail 13 genes with different K, and CA1 values (Sharp, 1991) were compared. The results are listed in Table I. There are two possible ways in which the triple substitutions could have arisen (Sharp, 1991). One possibility is that there are mutational mechanisms which cause multiple simultaneous substitutions. Alternatively, they
two are also more conservative in terms of both the aa and the codon (Grantham, 1974; Sharp and Li, 1987a) than such substitutions in group three. These observations argue that the model of sequential substitutions has some relevance in fusA, rpoB and ompA. We propose that in this
could be the result of substitutions occurring sequentially with conservation of aa similarity and/or the CA1 as the driving force in the selection. The genes in Table I are listed in order of nt divergence and seem to fall into at least three groups according to how many triple substitutions they have relative to the number expected. The first group consisting of the first four genes in Table I lacks any codons with three substitutions. The next three genes, fusA, rpoB and ompA, form a second group in which the number of codons with triple substitutions exceeds the expected value by about an order of magnitude. The high number of triple substitutions in group two is statistically significant (Table I). The last six genes in Table I with approximately twice as many triple substitutions as expected, make up the third group. The lack of triple substitutions in the first group is consistent with their very low nt substitution frequencies and their high conservation at the aa level. The observed double substitutions in this group are unique events and thus not statistically significant. If nt substitutions have a tendency to appear in clusters, because of the nature of the mutational mechanism, this should be most evident in the third group where the divergence is greatest and the selection pressure against multiple changes should be less than in the two other groups. The underrepresentation of double substitutions in this group probably indicates that these genes are subject to some selection pressure, and this will account for some of the excess of triple substitutions in most of the genes in this group. However, their almost even spread in all frames indicates that their occurrence may indeed be at least partly due to a mutational mechanism causing simultaneous multiple nt substitutions. In the second group, as in the first, there is a high CA1 but the aa sequences, although highly conserved, are more divergent than in the first group as indicated by the higher K, values. The erratic frequency representation of both single and double substitutions in this group may partly be a consequence of strong selection pressure against any changes and partly be due to variation because of the small sample size. In this group the number of triple substitutions
class of genes most of the nt substitutions will reduce the fitness of the gene and thus be subject to selection pressure for better codons, resulting in the selection of additional nt substitutions within a codon.
ACKNOWLEDGEMENTS
This work was supported by grants from the Swedish Natural Science Research Council to D.H. and to C.G. Kurland, and from the Swedish Cancer Society to C.G. Kurland. We thank Farhad Abdulkarim, Otto Berg and Charles G. Kurland for helpful suggestions.
REFERENCES
Buttareli,
F.R., Calogero,
Characterization
R.A., Tiboni, O., Gualerzi,
their evolutionary
relationship
CO. and Pon, CL.:
genes from Spirulina platensis and
of the ~tr operon
to those
of other prokaryotes.
Gen. Genet. 217 (1989) 97-104. Dean, D., Yates, J.L. and Nomura, M.: Identification tein S7 as a repressor
of translation
Mol.
of ribosomal
within the str operon
pro-
of E. coli.
Cell 24 (1981) 413-419. Douglas,
S.E.: Unusual
organization
of a ribosomal
plastid genome of Cryptomonas @: evolutionary Genet. Grantham,
19 (1991) 289-294. R.: Amino acid difference
formula
protein operon in the considerations.
Curr.
to help explain
protein
evolution. Science 185 (1974) 862-864. Grinblat, Y., Brown, N.H. and Kafatos, F.C.: Isolation ization
of the Drosophila translational
elongation
and character-
factor
2 gene. Nu-
cleic Acids Res. 17 (1989) 7303-7314. Hughes,
D. and Buckingham,
R.H.: The nucleotide
sequence of rpsL and
its flanking regions in Salmonella typhimurium. Gene 104 (1991) 123124. Jaskunas, S.R., Lindahl, L., Nomura, M. and Burgess, R.R.: Identification of two copies of the gene for the elongation factor EF-Tu in E. cd.
Nature
257 (1975) 458-462.
Li, W.-H., Wu, C.-I. and Lou, C-C.: A new method for estimating onymous and nonsynonymous rates of nuclcotide substitution sidering the relative likelihood
of nucleotide
and codon changes.
synconMol.
Biol. Evol. 2 (1985) 150-174. Kohno,
K., Uchida,
T., Ohkubo,
H., Nakanishi,
S., Nakanishi,
T., Fukui,
T., Ohtsuka, E., Ikehara, M. and Okada, Y.: Amino acid sequence of mammalian elongation factor 2 deduced from the cDNA sequence: homology with GTP-binding proteins. Proc. Natl. Acad. Sci. USA 83 (1986) 4978-4982. Nomura, M., Yates, J.L., Dean, D. and Post, E.L.: Feedback regulation of ribosomal protein gene expression in Escherichia coli: structural
98 homology
of ribosomal
Natl. Acad. Post,
L.E. and Nomura,
Escherichia
Rcinbolt,
RNA and ribosomal
protein
mRNA.
Proc.
Sci. USA 77 (1980) 7084-7088. M.: DNA
coli. J. Biol. Chem.
J., Tritsch,
ture of ribosomal
sequences
from the SIT operon
of
255 (1980) 4660-4666.
D. and Wittmann-Liebold,
B.: The primary
struc-
lular and Molecular
S.: Genome
organization.
In: Neidhardt,
F.C.,
Biology, Vol. 2. ASM Press, Washington,
DC,
1.
P.M.
termination
Determinants
and Bulmer, codons.
of DNA
sequence
divergence
between
M.: Select&
differences
among
translation
Thompson,
S., Gesteland,
J.F.: The role of EF-Tu
substitution
to codon
in
usage bias. Mol.
mining translocation
R.F., Hughes,
and other translation
step size. Biochim.
Wagar,
D. and Atkins,
components
Biophys.
Acta
in deter1050
E.A. and Pang, M.: The gene for the S7 ribosomal
Chkmydia
synonymous
The codon adaptation codon
index - a measure
usage bias, and its potential
tions. Nucleic Acids Res. 15 (1987a)
1281-1295.
trachomatis:
eron. Mol. Microbial. Yokota,
T., Sugisaki,
characterization
(I 990)
of
applica-
protein
within the chlamydial
of
Str op-
6 (1992) 327-335.
H., Takanami,
M. and Kaziro,
of the cloned t!fA gene of Escherichia
Y.: The nucleotide coli. Gene
12 (1980)
25-31. Zengel, J.M. and Lindahl, L.: Mapping of two promoters for elongation factor Tu within the structural gene for elongation factor G. Biochim. Biophys. Acta 1050 (1990) 317-322. Zengel, J.M., Archer, R.H. and Lindahl,
Gene 63 (1988) 141-145.
Sharp, P.M. and Li, W.-H.: directional
related
Biol. Evol. 4 (1987b) 222-230.
sequence
Escherichia cd and Salmonella typhimurium: codon usage, map position, and concerted evolution. J. Mol. Evol. 33 (1991) 23-33.
Sharp,
The rate of synonymous
genes is inversely
274-278.
Ingraham, J.L., Low, K.B., Magasanik, B.. Schacchter, M. and Umbarger, H.E. (Eds.), Escherichia coli and Salmonella typhimurium: Cel-
P.M.:
and Li, W.-H.:
Tuohy, T.M.F.,
protein S7 from E. coli strains K and B. FEBS Lett.
Riley, M. and Krawiec,
Sharp,
P.M.
cnterobacterial
91 (1978) 297-301.
1987, pp. 967-98
Sharp,
the Escherichia
colifis
L.: The nucleotide
gene, coding for the elongation
cleic Acids Res. 12 (1984) 2181-2192.
sequence
of
factor G. Nu-