J. Mol.
Bi01. (1978)
126, 847-863
Genetic VII?.
Studies of the lac Repressor
On the Molecular Nature of Spontaneous in the lacl Gene of EschricKu coli PHILIPJ. Harvard
University,
Hotspots
FARABAIJOH~
Cambridge,
URSULA SCHMEISSNER§,
Muss.
MURIELLE
U.S.A. HOFER
AND JEFFREY
DCpartement
de Biologic,
H.
MILLEE
Molkculuire,
Universite’
(Received
7 JUZIJ 1978)
de GenCve, Genive, Suisse
140 independently occurring spontaneous mutations in the kzcl gene of &?cherichia w&i have been examined genetically and physically. DNA sequence analysis of a genetic “hotspot” shows t’hat the tandemly repeating sequence 5’-C-T-G-G-C-TG-G-C-T-G-G-3’ generates mutations at a high rate, either deleting or adding one unit of four nucleotides (C-T-G-G). Twelve larger deletion mutations have also heen sequenced; seven of these were formed by eliminating segments between repeated sequences of five or eight nucleotides, one copy of t,he repeat,ed sequence remaining after the deletion. Possible mechanisms accounting for the involvcment of repeated sequences in the creation of spontaneous mutations are considered.
1. Introduction In 1961 Seymour Benzer published a classic study of mut’ational sites in the r.ZI cistrons of bacteriophage T4, in which he demonstrated that all sites are not equally mutable. Instead, mutability varies considerably over a range of several orders of magnitude. He termed highly mutable sites “hotspots” and suggested that, some aspect of the nucleotide sequence surrounding each site might play a role in determining the rate of mutation. Because of recent advances in DNA sequencing techniques, the molecular basis of mutational hotspots is subject to direct analysis. In this and a related study (Conlondre et al., 1978) we examine the role of base sequence on mutation rate. In particular, we ask whether there are special sequences which are involved in the generation of small and large additions and deletions, and of base substitutions. The EacI gene represents an ideal system for examining mutational hotspots, since the results of detailed genetic studies (Miller et aE., 1977; Coulondre $ Miller, 1977) can be combined with the knowledge of the full nucleotide sequence (Farabaugh, 1978). Also, the recent cloning of the I gene onto a small plasmid (Calos, 1978) makes t Paper
in this address:
series is Sommer
1 Present 9: Present.
~tdrlr~ss:
Lahoratorp
VI
et al. (1978).
Cornell University,
Ithaca,
of Molecular
N.Y.,
Biology,
U.S.A.
National
Conrer
Inxtit,ut,e.
H&hesda,
Ed.
Press
(London)
Ltd.
II.S.A. x47
(lo2~-283(i/78/36os47-17
$o2.on/o
CJ 1978
Academic
Inc.
848
P.
J.
FARABAUGH
ET
AL.
possible the rapid analysis of altered sequences in the lucI DNA, since mutations can be crossed directly onto the plasmid by genetic techniques (see Materials and Methods). In this study we characterized a collection of 140 mutations in the I gene which arose spontaneously on an F’lacpvoB, Bepisome in an Escherichia coli K12 strain deleted for the 1acproB region (GMl). The mutations were mapped and separated into distinct sites. Several different classes of mutations can be recognized, including small frameshifts, large deletions and insertions? and base substitutions. Two very large hotspots appear in this collection. We have determined the sequence change for six representative mutations from these sites, and also for 12 deletion mutations. It is clear from the results presented in the following sections that repeated nucleotide sequences lead to a high rate of deletion and frameshift formation. In a parallel study, we show that the rates of base substitution mutations are profoundly affected by specific aspects of the DNA sequence (Coulondre et al., 1978). The implications of these findings are considered in the Discussion.
2. Materials (a)
and Methods
Bacterial
strains
All mutations were isolated in strain GM1 (Miller et al., 1977), which carries an F’ZacproB episome with the mutations Iq and 118 (Miiller-Hill et al., 1968; Scaife & Beckwith, 1966) and harbors a deletion of the lac and proB regions on the chromosome. X7733 is A(lacproB) gaZE styA thi. The recAderivative of this strain (Ganem, 1972) was converted to Nal’ (Pfahl, 1972) and termed MP30. (b) i-
mutants
were
isolated
Isolation
as described
of mutants
by Miller
(c) Deletion Mapping Coulondre
was carried out & Miller (1977).
by
(d)
techniques
Cro.kng
et al.
(1977).
mapping described
mutations
by Schmeissner
onto
et al. (1977a),
and
by
plasmids
Heterodiploids were constructed carrying part of the Zac region on either the pMC1 or pMC4 plasmids (Calos, 1978) and various I mutat,ions on the F’ZacproB episome. Samples of overnight broth cultures were plated on Xgal indicator plates and deep blue colonies were picked and purified and verified for the i- character. These strains were used to prepare plasmid DNA carrying the I- mutations. (e) DNA
sequence
analysis
of deletion
endpoints
The positions of the endpoints of the deletions analyzed here were determined by comparing the nucleotide sequence of ZucI from the deletion strains with the wild-type sequence (Farabaugh, 1978). The position of each mutation with respect to the restriction map of ZacI, which was approximately known from genetic mapping, was confirmed by restriction mapping. A restriction endonuclease which cleaved no more than 120 nucleotide pairs from the site of the mutation was chosen and the fragment containing the mutation isolated. The fragment was end-labeled on its two 5’-termini with [32P]ATP and T4 polynucleotide kinase (Maxam & Gilbert, 1977). The labeled fragments were cleaved with a second restriction endonuclease and the singly labeled fragment containing the deletion subjected to the Maxam-Gilbert direct DNA sequencing procedure (Maxam & Gilbert, 1977). t Two mutations result from DNA carrying these alterations
the insertion of 181, (Calos et al., 1978a).
as determined
by a physical
analysis
of the
HOTSPOTS
.963
-
.974--r) c D
THE
S/O,----+ Sl.36 -
Ul9.3 SIIZ
IN
:
:
PM
P H
5.2
FS.?-
:
*
S1!)
GENE
FSS-
s1.?0::
Z~tcl
H -
TTM
PIP
M
P
I.4
-
s.i.?
P
-
AM0
HA
HA
S.?J tc-
S66
ccSB6 FIG.
is that
1. d schematic of Farabaugh
diagram of the region sequenced (1978). For details see Materials
for each mutation. and Methods.
The numbering
systcam
Figure 1 is a schematic diagram of the region sequenced for each mutation. The deletions 5’74, S112 and U193 were sequenced from the HpaII site at position -1 (the numbering system is that of Farabaugh, 1978). S74 was sequenced from the fragment covering the region from this HpaII site to the HaeIII site at position 107. S112 was sequenced from the fragment containing the region between the HpaII site and the Mb011 site at position 359. U193 was sequenced from the fragment produced by HpuII cleavage at position -- 1 and 56 (the singly labeled fragment was produced by separation of alkali-denatured strands (Maxam & Gilbert, 1977)). The S23 mutation is contained on a fragment extending from the HaeIII site at position 108 to the iMboII site at, 359; we sequenced a fragment labeled at the HaeIII site. Deletions S10, S136, S65 and S120 are all contained on the fragment extending from the HaeIII site at position 243 to the HpaII site at position 457; each was sequenced from a fragment labeled at the HaeIII sit)e. S32, a repeated occurrence of t,he S65 deletion, was sequenced from a fragment labeled at the Mb011 site at position 358 which extended to the Hue111 sit’e at 243. The major hotspot deletion and insertion mutations F32, FS84, FS5, FS25, FS45 and FSGS are all contained on the fiagmtmt produced by cleavage at the HaeIII site at position 589 ttntl the H@I site at 699; each was sequenced from a fragment with t.he label at the Hue111 site. S86’ is contained on the same fragment but was sequenced from the HpaII site. X24 and S% are contained on il fragment which extends from the HpuII site at 809 t)o t’he Hue111 site at 589; they werp sequenced from a fragment labeled at the HpaII site. S42 was sequenced from a fragment labeled at the Hind11 site at position 986 which ext.ended t,o the AZuI site at position 979. A complete description of the sequence of each deletion endpoint is given in Farabaugh (1977).
3. Results (a) Genetic characterization Figure 2 shows the distribution of 140 spontaneous mutations in the lucf gene, as det*ermined by recombination tests with a large set of deletions (Schmeissner et al.: 1977a,b). Each mutation is of independent origin and results in an inactive repressor protein and the i- phenotype (see Materials and Methods). Two hotspotsf domirmte the spectrum. These comprise 94 of the 140 mutations and occur at or near the same position in the middle of the gene. Although we cannot distinguish among these 94 mutations by recombination tests, we can define two classes based on the reversion rates: 78 of the mutations revert at an uncharacteristically high frequency (revertants appearing at about 10e5 in the population), whereas the remaining 18 mutations do not, generate revertants at a detectable rate ( < 10e8). Therefore, we assume that at t Since these mutations occur we define t,hem as hotspots.
at a very
high
frequency
relative
to other
mutations
in the I gene.
850
P.
J.
FARABAUGH
ET
AL.
FS5, FS25. FS45.fS65
S58
A6.528
m m
-S23
m
s74
-
Slf2
FSZ, FS84
I.510 I S/36 532 I S651
IS56
I S42
D 524 m
S/W -
S86
-
FIU. 2. The distribution of 140 spontaneous mutations in the Zacl’gene. Each mutation is of independent origin and resulted in the i- phenotype. A single occurrence of an apparent point mutation is represented by a square. Filled in squares indicate mutations with very low or undetectable reversion rates, whereas open squares indicate mutations which are unstable, reverting at frequencies of 1Om6 or greater. Deletions are shown below the line, which represents the length of the I gene given in terms of the position of the corresponding residue in the Zac repressor. All mutations were mapped against the deletions used to divide the gene into 108 marked sections (Schmeissner et al., 1977a,b). The allele numbers are given in cases where the mutational change has been sequenced. 5114 and 558 are insertions of the transposable element 1~21 (CMOS et al., 19783). 528 is an unstable duplication of 88 base-pairs (CMOS et al., 1978a). The other sequenced mutations are described in this paper.
least two different mutations are appearing at or extremely near the same point in the gene. A number of large deletions can be identified in this collection. Most of these have both endpoints within the I gene and its control region. We mapped the deletions against a set of point mutations at known positions (see below). From the markers used to map the deletion endpoints it appears that in at least three cases the same or a very similar deletion recurs. Together, the deletions and the two hotspots in the middle of the gene constitute close to 8Oo/o of the spontaneous mutations detected in the I gene. In the following section we describe the DNA sequence analysis of these mutations. We report elsewhere the molecular nature of spontaneous base substitutions (Coulondre et al., 1978), which comprise part of the remaining mutations. (b) DNA
sequence at a hotspot
Four mutations from the most prominent hotspot and two mutations from what appears to be a second hotspot were crossed onto the plasmid carrying the I gene and DNA from each of these was isolated, cut with restriction enzymes, and sequenced by procedures outlined in Materials and Methods. The results are shown in Figure 3. The wild-type sequence in this region of the gene (between bases 620 and 631) shows a striking tandem repeat of the sequence 5’-C-T-G-G-3’, which occurs three times in succession. Four unstable mutations (FS5, FS25, FS45 and FS65) from the major hotspot were found to contain an additional C-T-G-G sequence, thus extending the repeat to four sets of the same sequence. The high reversion at this site can then be envisioned as the subsequent loss of one C-T-G-G sequence. Two stable mutations
HOTSPOTS
IN
THE
ZacZ
GENE
x.5 I
5,-G-T-C-T-G-G-C-T-G-G-C-T-G-G-C-T-G-G-C-3’ >--FS5,25,45,65 TI wild-type
G-T-C-T-G-G-C-T-G-G-C-T-G-G-C __j__)FSZ
FS84 I
G-T-C-T-G-G-C-T-G-G-C --
b’IG. 3. The
sequence change
resulting
from
the two
mutational
hotspots.
(See text
for details.)
from the second cluster depicted in Figure 2, FS2 and FS84, showed the loss of one set, of the four-base sequence C-T-G-G from this same wild-type sequence. Apparently, reversion by re-addition of a C-T-G-G sequence to the two remaining sets occurs at too low a frequency (less than 10e8) to be detected in this system. These results show that the same sequence is responsible for generating two different frameshift mutations both of which arise at a high rate relative to all other mutations in the I gene. Figure 4 depicts the molecular consequences of each of these mutations. It can be seen that in both cases a nonsense codon is encountered soon after the shift in the reading frame. The resulting repressor fragments are 203 and 311 amino acids long, respectively, and have no activity.
200 SerAloArgLeuArg
LeuAloGlyTrpHisLys --A TCGGCGCGTCTGCGTCTGGCTGGCTGGCATAAA AGCCGCGCAGACGCAGACCGACCGACCGTATTT +F/
Ser AlaArg
LeuArgLeuAlaGlyTrpLeuAlaSTOP --Ad
CGGCGCGTCTGCGTCTGGCTGGCTGGCTGGCATAAA AGCCGCGCffiACGCffiPCGACXXACCGACCGTATTT 80 FIQ. 4. The molecular each of the 2 hotspots.
k
SerAlaArgLeuArgLeuAloGly --.
IleLys(9aal
TCGGCGOZTCTGCGTCTGGCTGGCATAAA AGCCGCGCAGACGCAGACCGACCGTATTT.
occurrences*
consequences of the addition Hyphens omitted for clarity.
.
STOP . . TAG . . . .ATC
IS occurrences*
or deletion
of 4
base-pairs resulting from
(c) DNA sequenceof deletion endpoints We crossed 12 deletions from the collection shown in Figure 2 onto the pMC1 plasmid and sequenced the relevant portion of the I DNA (see Materials and Methods). Figures 5 and 6 show the results. It is clear that repeated sequences are involved frequently in deletion formation, since in seven out of 12 cases (see Table 1) repeats of five or eight bases are found at each end in the wild-type sequence. The deletions remove one of the repeated sequences and all of the intervening DNA. Moreover, in three cases (X74 and X112, SlO and S136, 532 and S65) the identical deletion has recurred independently at the same sequence! Of the deletions which did not show significant repeats none recurred in the sample, suggesting that non-repeat sites may be much less specific.
P. J. FARABAUGH
852
ET
AL. Gln26 CT‘
75 bases
15 .-TCAGG[GTGGTGAA]--~~--~------~
113
~~~[GTGGTGAAJCCAGGCCA
S74,SIl2
123 bases
140 GTGGAA[GCGGCGAT]----~~~~~
-------
2?0
[GCGGCGATITAAATC
S23
20 bases
ArglOl r --AGAACG[AAGCGGCG]325
---.--~
365 A A G C G GC GIG T G C AC A-
Glu - 105
22 bases
l
3‘5
LyslO8 r
[GT~GA~AGCCTGTAAA
S32,S65 Fm. 5. The region deleted by the mutations 874, S112, S23, SlO, S136, 932 and S65. The repeated sequence present in the wild type before the deletion is boxed, and the deletion indicated both above and below the line. Marker rescue experiments were performed (see Table 3) using mutations at the position indicated by asterisk. The base-pair and codon numbers are also indicated to facilitate placement on the gene-protein maps. Hyphens omitted for clarity.
TABLE
1
Xequencesat the endpoints of deletions Site 20 146 331 316 694 694 943 322 658
(no. of base-pairs)
to 95 to to to to to to to to
Sequence
repeat?
Bases
G-T-G-G-T-G-A-A G-C-G-G-C-G-A-T A-A-G-C-G-G-C-G G-T-C-G-A CA CA G
269 351 338 107 719 956 393 685
deleted 75 123 20 22 13 25 13 71 27
None None
Occurrences 874,5112 523 SlO, S136 S32,565 S24 556 542 s120 S86
t This refers only to the nucleotides seen in the wild-type sequence at either end of the segment which is subsequently deleted, and does not imply that repeats of 1 or 2 bases are significant. Only the repeated sequences of 5 and 8 nucleotides are considered meaningful here. The data shown Farabaugh (1978).
in Figures
5 and
6 are
tabulated
here.
The
numbering
system
is that
of
(d) Additional deletions We also screened a collection of over 800 mutagen-induced i- mutants for the presence of deletions (see Table 2). This represents an effective probe for monitoring the percentage of deletions among I- mutations, and also the absolute frequency of induction. There is no selection involved, provided the deletions are longer than the distance between the point mutations used, or else span one of these markers (Fig. 7).
HOTSPOTS
IN
THE
Ear2
C:ENE
X5::
TAECLE~ Deletions among I- mutations Non-suppressible mutations tested
iMutrtgen 2-aminopurine Ultraviolet light knitroquinoline-l-oxide IWATT None (spontaneous) t Excluding The deletions amall deletions, not be detected purine collection 2.aminopurinc.
the frameshift
334 174 87 86 139 mutation
involving
“,/o of total
Deletionst found
80 80 73 Y2 YY a deletion
% of total
Y 12 1 0 IY
2 5.5 0.X 0 14
of 4 bases.
extending into I but not into 2 and which are not lethal are scored hare. Very such as the frameshift caused by the removal of 4 bases depicted in Fig. 3, would by the mapping probes used (see text). The deletions appearing in the a-aminoprobably arose spontaneously in the treated cultures and were not induced hj
Trp220 *. GCCGACTGGAGTGClCA,
25 hoses
.I
,-
Gln231 ‘3 [C~~AATGCTGAATGA
556
Le”304 ; TGGACCGCTTGCT;Ej
3s
13 bases
Gln3ll . :$GCCAGGCGGTGAAG
4 a/
SW
Ser97 i
Tyr126 .
7 Ib”SP5
CGTGGFGGTGTCGAT SIZO
TGATCATTAACTA‘C
Fro. 6. The region deleted by the mutations S24,S56,542,5120 the legend to Fig. 5. Hyphens omitted for clarity.
and S8G. For further
details
Seth
Ultraviolet light provides a five- to tenfold stimulation of deletions of this type?. at a maximum. (u.v.-stimulated deletions have been detected in other bacterial systems (Demerec, 1960; Schwartz & Beckwith, 1969)). On the other hand, 4-nitroquinoline-l-oxide and 2-aminopurine do not appear to stimulate deletions in t,his system. (The 2% deletions found after 2-aminopurine treatment probably reflect the spontaneous background, since independent tests indicate the presence of other spontaneous mutations in this collection; t)he spontaneous background was much lower in the u.v.-treated cells.) t Larger extending
deletions which would into or past la&.
prevent
episomo
replication
are not scored
hpre;
neither
are thosr
P.
854
J.
FARABAUGH
ET
AL.
FIG. 7. The deletions found among I- mutations. S, spontaneous; A, 2-aminopurine-induced; U, ultraviolet light-induced; Q, 4-nitroquinoline-l-oxide-induced. Some of these mutations represent the spontaneous background in the mutagenized cells (see Table 2 and text for further description).
Several deletion endpoints These may represent other clustering results from the SlO and X136, since both intervals, and SlO and X136 repeated sequence of eight prove interesting.
cluster in small regions (i.e. A108, All4 and U54). examples of repeated sequences. The most suggestive addition of A31 and U94 to the group already including endpoints of these deletions are mapped within short) have already been shown to be identical (both involving a nucleotides). Sequence analysis of these deletions should
(e) Recombination
testa
Since the sequence results from the preceding sections provide us with deletions at known positions, we can test the resolution of our mapping system by utilizing point mutations at known positions. As Figures 5 and 6 indicate, the repeated sequences provide an ambiguity regarding the exact point of deletion formation. From the standpoint of “marker rescue” the bases between the markers and the deletion always include the remaining copy of the repeated sequence. Thus, in recombination tests TABLE
Rewmbinants
from
crosses between deletions Frequency
Deletion 574 s112 SlO 5736 S32 S65 824 856 842 5120 586
1
3
3
of recombinants
and point
mutations
( x 10m7)
Separation from point mutations (in nucleotide pairs) 4 5 8 10
11
13
14
50 50 20 20 1 1 < o-2 < 0.2
3 4 < 0.2 40 10 0.5
Each approximate recombinant frequency were carried out as described by Coulondre
60 100 60
10 10 10
60 2 is the average & Miller (1977).
of several
determinations.
Crosses
HOTSPOTS
IS
THE
Zntl
C:EKE
S.iJ
bet\\-een 832 or X65 and the mutation affecting nucleotide 318. the deletion is (ItIvisioned as removing bases 321 to 343, whereas in tests against a mutat)ion affect,ing k~ase 341, A32 and X65 are positioned as deleting bases 316 to 338 (see Fig. 5). N’ith these considcrat’ions in mind we can examine Table 3 which gives the results of s;1Lvcbra,lrecombination experiments. It is int)erestinp to not,e that rtacombinatioll can bc detected in some cases even when the mut,ation is three rmclrot idrs a\\-a\. from t Iit\ deletion endpoint.
4. Discussion The tnolecular nature of hotspots has intrigued genet%ists ww since Renzcr (196 1) first discovered that mutat,ions within a gene are not, dist,ribut,ed randomly among t Ir(L available sites. The work presented here demonstrates that, special nucleotide scquencbcls play an important role in determining spont,aneous rates of mutatio:]. Small fram+ shifts and deletions constitute two major classes of mutations occurring sl)~)ntalwousl?.. tog&her comprising approximately 80% of the lad - mutat’ions. Tantleml,v repeatccl sequences, such as t,he C-T-G-G-T-G-G-C-T-G-G sequence found in t.he I gtanc. arc a major source of hot’spots, generating addit’ions and delet,ions of the repeated mlit at it high rat,e with respect t)o other mutational sites in t’hr gene. Mut’ants having at1 additional C-T-G-G sequence at this point, in the gene appear at a frequency of’ about :! x LWB, and t)hose having a deletion of a C-T-G-G sequrnco are found at approximattbly 0.5 >i 10 m6. Larger deletions also preclominat’e at, repcat,ed sequences as t’lic% data in Table 1 and Figure 4 demonstrate. Recurrence of identical delet~ions at thtssf, point:, scrvcs to strengthen this conclusion. \Vhat of the remaining 200/, of t,he mutations? As Figure 2 shows t hrse arc’ icat,brrt~tl over numerous sites. We have crossed each of these mutations ont.0 a plasmid and ttxaminrd the size of Hi&I1 restriction fragment’s, From this analysis two mutations jS5S and S11d) have now been shown to be due t’o t’he ins&ion of t’he transposable. element. IS1 (Calos et nl., 19735). This invest)igat’ion has also revealed a tandem duplicat.ion of 88 base pairs in one case (S28; see the accompanying paper: Calos et ~1.. 1978~). Both base substitutions and other small frameshift mutations should compriscb many of the remaining lesions. In a separate paper \VC:analyze base substitutions anal show t)hat thctstb occur preferent)ially at’ 5-methylcytosines (Coulondre et nl.. 197X). ‘I’akckn in its larger perspective, spontaneous mut,ations prrdotninat,c at special sctqucnces. even t,hough many different types of changes arc involved. Rastd on studies of alt,ered T4 phage lysozyme sequencchs, Streisinger el nl. (I!Wi) proposed a model to accounts for the generation of frameshift mutations, suggest,itrg that, after breakage and dig&ion of a single strand, “slipped mispairing” occurs at tandomlp repeated sequences. This can lead t.o additions and dnlctions adjacent, to w pratcd sequences (Okada et al., 1972). Such a model account,s for a number 01‘ characterized frameshifts in other systems (see review by Roth, 1974). The two frameshift hotspots which have been sequenced conform niccl,v to t)hc predictionh of t IN, Strr%inger model. Alternatively, it is possible t#hat, such events are generated 1~~. unequal crossing over in general recombination. .1t. remains tic) bc, dctcrmi t~ti whet her mutJations similar to F82 and FS5 (- 4 bases and + 4 bases, respcctivel!-) arise at the same frequencies in a recA- st’rain. However, the reversion of FR5 to / does occur in a recA- background (see Table 4). If sequencing experiments \crit-. that thcsr have indeed undergone a loss of four bases back to wild t,ype, it. n-oultl
856
P.
J.
FARABAUGH TABLE
Frequency
of if revertants
Frequency
AL.
4
for the frameshift
mutations
FS2 and FS5
of i+ rev&ants ( x 10v5) recA + recA -
Mutation FS2
ET
<
0.001
-
Reversion tests were carried out in strain X7733 (recA+) and MP30 (reck), Coulondre & Miller (1977) (see Materials and Methods). The results of 4 different shown for FE5.
as described experiments
by are
argue strongly against the involvement of recA-mediated recombination in the formation of this type of frameshift mutation. It would be of considerable interest to determine which enzymes in the cell are responsible for the generation of such frameshifts. Strains carrying fast reverting insertions such as 3’85 could conceivably be usedto detect mutants which lack these enzymes by screening for the elimination of the high reversion rate. It is not unlikely that the two major hotspots report.ed by Benzer (1961) in the rII cistrons also involve repeating sets of bases, as has been suggested from reversiou studies with different mutagens (S. Brenner, personal communication). The finding that spontaneous delet)ionsare favored at repeated sequencesraisesa number of interesting questions. Are deletions of all lengths also favored at repeating stretches of nucleotides?This could be answeredby sequencingboth termini of deletions extending over several gene lengths. Are all repeated sequencesequally susceptibleto deletion formation? A computer analysis of sequence repeats in lad indicates that certain sequencesmay indeed be favored (see Appendix). The involvement of the Ret system in deletion formation should also be tested. Although general tonB-trp deletions have been shown to arise at approximately the samerate in recA + and recA- strains (Franklin, 1967), specific deletions have not been monitored. From Table 1 it is evident that some deletions do not occur at repeated sequences;these would not be expected to show recA dependence in any case.On the other hand, bhe deletions examined here may beformed by the same kind of “slippage” mechanismproposedfor frameshifts (Streisinger et al., 1966). (Previous studies with the rII system of phage T4 have led to predictions of the involvement of repeated sequencesin the origin of spontaneous deletions; S. Brenner, personal communication.) The recurring sequenceswhich generate deletions should also result in the reciprocal event; namely t,he production of a duplication of the region between the repeats. These have not been detected in the remainder of the collection of the I- mutations depicted in Figure 2 in the restriction enzyme analysis of plasmids carrying these mutations (seeabove). The only duplication found by this method did not arise via a repeated sequence(Caloset al., 1978a). (One example of this type of duplication has been detected in phage lambda; R. Maurer, unpublished results.) In at least two systems deletions have been found in conjunction with point mutations (J. Roth, unpublished results; Barnett et al., 1967). The 12 deletions
HOTSPOTS
IN THE
857
ZacZ GENE
analyzed here do not arise together with an additional lesion nearby, however, since the DNA sequenceextending for approximately 50 basesbeyond each endpoint has been shown to be identical to wild type (Farabaugh, 1977). Experiments aimed at examining the other unstable mutations (Fig. 2) at the sequencelevel are currently in progress, as are attempts to determine the sequence change resulting from additional deletions found in this system. We t,hunk Drs S. Brenner, F. Crick, F. Stahl, W. Gilbert and J. Roth for helpful discussions. This work was supported by a grant (GM09641) from the National Institutes of Health (to W. Gilbert), and by a grant from the Swiss National Fund (F. N. 3.179.77) (to .J. H. M.). REFERENCES Burnett,
L., Brenner,
Phil.
Trans.
S., Crick, F. H. C., Shulman,
Roy. Sot. ser. B, 252, Proc. Nat. Acd Sd.,
R. G. & Watts-Tobin,
R. J. (1967).
487-560. U.S.A.
Benzer, S. (1961). 46, 1585-1594. Calos, M. (1978). Nature (London), 274, 762-765. Calos, M., Galas, D. & Miller, J. H. (1978a). J. Mol. Biol. 126, 865-869. Calos, M., Johnsrud, L. & Miller, J. H. (19783). CeZE, 13, 411-418. Coulondre, C. & Miller, J. H. (1977). J. Mol. Biol. 117, 525-567. Coulondre, C., Miller, J. H., Farabaugh, P. J. & Gilbert, W. (1978).
Nature,
(London),
274, 775-780. Demerec, M. (1960). Proc. Nat. Acad. Sci., U.S.A. 46, 1075-1079. Farabaugh, P. J. (1977). PhD thesis, Harvard University. Farabaugh, P. J. (1978). Nature (Lo&on), 274, 765-769. Franklin, N. C. (1967). Genetics, 55, 699-707. Ganem, D. (1972). Honors thesis, Harvard University. Maxam, A. & Gilbert, W. (1977). Proc. Nat. Acad. Sci., U.S.A. 74, 560-564. Miller, J. H., Ganem, D., Lu, P. & Schmitz, A. (1977). J. Mol. Biol. 109, 275-302. Miiller-Hill, B., Crapo, L. & Gilbert, W. (1968). Proc. Nat. Acad.Sci., U.S.A. 59, 1259-1263. Okada, Y., Streisinger, G., Emrich, J., Newton, J., Tsugita, A. 8z Inouye, M. (1972). Nature, (London), 236, 338-341. Pfahl, M. (1972). Genetics, 72, 393-410. Roth, J. R. (1974). Annu. Rev. Genet. 8, 319-346. Scaife, J. G. & Beckwith, J. R. (1966). Cold Spring Harbor Symp. Quad. Biol. 31, 403-408. Schmeissner, U., Ganem, D. & Miller, J. H. (1977a). J. Mol. Biol. 109, 303-326. Schmeissner, U., Ganem, D. & Miller, J. H. (19773). J. Mol. Biol. 117, 572-575. Schwartz, D. & Beckwith, J. R. (1969). Genetics, 61, 371-379. Sommer, H., Schmitz, A., Schmeissner, M., Miller, J. H. & Wittmann, H. G. ( 1978). J. Mol.
Streisinger, M.
Biol.,
123, 467-469.
G., Okada. Y., Emrich,
(1966).
Cold
Spring
Harbor
J., Newton, Symp.
Quant.
J., Tsugita, Biol.
A., Terzaghi,
31, 77-84.
E. & Inouye,
858
D.
J.
GALAS
APPENDIX
An Analysis of Sequence Repeats in the lad Gene of Escherichia coli DAVID J. GALAS
Ddpartement de Biologic Molthdaire, Universitd de Genkve,Genbe, Suisse Since it is now clear that a significant fraction of non-lethal spontaneous deletions terminate in repeated sequences (see the main text), the role played by these repeats in the formation of deletions has become an important question. As a first step in the investigation of the issue,the sequenceof the I gene must be analysed to establish the background of repeated sequencesagainst which the observed deletions occurred. The possibility that a careful comparison of these observed deletions with potential sites for repeat-catalysed deletion formation may suggest or restrict hypotheses as to the processesinvolved is the primary motivation for such an analysis. In this paper I present an analysis of the sequenceof the I gene (Farabaugh, 1978) for direct repeats, inverted repeats, and the possibility of slipped mispairing (contiguous or overlapping repeats). This spectrum of repeats is compared with the data reported in the accompanying paper, the sequencedendpoints of deletions internal to the I gene and the frameshift mutation hotspot. It can be argued from this comparison that it is likely that the distance between repeats is an important factor, and that some sequence specificity is also involved in at least one pathway for deletion formation. (a) Direct repeats The sequenceof the I gene was read into the computer and scannedfor repeats by a program which used the following simple algorithmt. To begin, a short sequence(the first N bases)is taken for comparison with every N-base sequencein the gene. Each sequencefor which the agreement is equal to, or greater than, a certain threshold, L, is printed out with its position, degree of match, and the distance between the first basesin the two N-base sequences.The next N-base sequence (shifted up one base in the gene)is then taken and the samecomparisonsmade with the remaining sequence of the gene (one lessbasein the geneis used each time the N-base sequenceis shifted to avoid duplicating comparisons).The parameters L and N are set as required when the program is executed. The magnitude of this processis indicated by the problem of determining the number of exact eight-base repeats. L and N are thus set to eight. The sequencescannedhere is 1150 baseslong, including the leader region of the transcribed DNA and 41 basesfollowing the final sensecodon. This scan then requires (1150-8)a/2 pairs of eight-base sequencesbe compared, about 5 x lo6 single base comparisons. The results of a scan for eight and nine base exact repeats are shown in Table Al. t These FORTRAN
calculations IV.
were
performed
on a Nova
840 computer.
The
programming
wm
done
in
APPENDIX
Al
TABLE
Direct repeats of eight or more bases in the I gene
Repeat
Base
C-T-G-G-C-T-G-G-C G-A-A-G-C-G-G-C-G3 G-C-G-C-G-T-T-G-G C-C-A-G-C-G-T-G-G G-C-G-C-A-A-C-G-C
620 143
A-A-G-C-G-G-C-G$ G-T.G-G-T-G-A-A C-C-G-C-G-T-G-G T-C-T-C-G-C-G-C G-C-G-G-C-G-A-T G-C-G-A-C-T-G-G A-A-G-C-G-G-C-G2 G-C-G-T-G-G-T-G G-T-G-G-A-A-G-C C-G-A-C-T-G-G-A G-G-G-C-A-A-A-C G-T-T-T-C-C-C-G
331
814 303 374 20 91 281
146 529 144 93 140
682 199
86
1
Base 624
2
d
Extent of repeat
Deletions Frameshifts
331 1048 925
188 234 622
4
9
1113
739
9
-
351
20 75 87 89 123 152 207
8
SlO,S136 S74,5112 -
95 178
370 269 681 351 306 434 1091 917 1085
213
294 403 718 999
9 9 9
8t
8 8 8 8
-
923 -
85 8s 85
-
8
-
8 8
-
These repeats were found using the algorithm described in the text. Base 1 is the location, with respect to the first base of the I message, of the first base of the first occurrence of the repeat (with respect to the amino-terminal end of the repressor). Base 2 is the locationof the first base of the second occurrence of the repeat. d is the number of bases between the first bases of the 2 occurrences. The deletions, designated by allele number (see the main text), are listed in the last column. The 4 notations in the next to last column indicate that the sequence is part of a larger that 9 out of 10 bases match: repeat which is interrupted by one mismatch. t, indicates $ indicates that there is an identical sequence: $ indicates that it. is part of IO out of 11 match.
Note that the observed deletions occur only with eight-base repeat endpoints, even though there are four nine-baserepeats available. Clearly the size of the repeat is not, t,he only determining factor. As one can tell from a glance at Table Al, the nine-base repeats are spacedrather far apart; in fact, the closestof these spansa distance greater than the size of the largest deletion. It is possiblethat the distance between repeats reduces their probability of forming deletions sufficiently to account, for the absence of nine-base repeats in this sample. That this possibility is consistent with the eightbase repeat spectrum is also clear from Table Al. It is only the closely spaced eight,baserepeats that are represented here. In this sampleit is entirely possibleto account for the distribution by the random occurrence of deletions, at’ roughly t,he samerate, among the five most closely spacedeight-base repeats. To illustrate this point supposethat the distance between repeats were unimportant and all eight- and nine-base repeats were equally likely as deletion sites. Then the probability that the five observed deletions would be confined to the five closest, repeats, as they are in this sample, would be about 3 x 10m3.However, the distribution of the five observed deletions among these five sites cannot be distinguished from random. If five deletions (sites) were chosen at random from a collection in which all five sites are equally represented, the probability that this sample of five would miss two sites, as in the real sample,is O-41.Thus it is not unlikely that a random sampling would yield the observed result.
860
D.
J.
GALAS
An important argument for the significance of the spacing of the repeats is encountered by considering the three repeats marked $ in Table Al. Since the sequences are identical the possible influence of sequence specificity is removed. The fact that the repeat spaced at 20 bases was found twice among the deletions, and that ones spaced at 207 and 187 bases are not found, argues strongly that the spacing influences the frequency of deletion formation. An examination of the seven-base repeat spectrum suggests, however, that there is more involved than the spacing between repeats. The 11 most closely spaced of these, shown in Table A2, have six among them closer than TABLE
A2
The elevenclosestexact seven-baserepeatsin the I gene Repeat 5’ C-G-C-G-C-C-G A-T-T-A-A-T-G T-G-A-C-C-A-G C-A-A-C-T-G-G G-C-A-A-A-C-C A-A-C-C-A-C-C A-T-A-T-C-T-C C-C-T-G-C-A-C C-A-A-A-C-C-A C-T-G-G-G-C-G G-G-T-G-G-T-G
Base
1
Base
2
d
3’ 250 1063 415 191 1030 886 637 244 710 533 21
284 1122 481 293 1030 1009 828 444 920 779 311
34 59 66 102 111 123 191 200 210 246 290
Of the 34 exact ‘I-base repeats these are those with the lowest value of d, excepting the single overlapping repeat (d = 6) which is entered in Table A3. The designations of the columns are defined in the legend to Table Al.
some observed deletion. It is possible that the smaller repeat unit accounts for the absenceof the seven-baserepeats among the deletions. However (for reasonsdiscussed below), it seemsmore likely that among the deletion endpoints there is somesequence resemblancefor which the processis partially specific. Table Al showsthat there are striking similarities among the eight-base repeats for the five deletions. The 874 and 523 endpoints match in five out of eight bases,and if the pyrimidines are considered equivalent in seven out of eight bases.The Xl0 and 523 endpoints match in six out of eight baseswith a two-base shift (G-C-G-G-C-G). Some sort of specificity is also suggested by the deletion ending in the five-base repeat. This deletion, found twice in a sample of independently isolated deletions, is quite small (22 bases), but there are many repeats of five and six basesas close or closer together. In Figure Al all the repeats in the I gene closer than 50 basesare represented. Figure Al(a), in which the repeats are displayed as a function of the length of the repeats and of their spacing, suggeststhat there is something notable about the particular repeats for which deletions were found, surrounded as they are by other available sites. That the deletions, S65 and S32, were isolated independently and found to terminate at the samefive-base repeat suggeststhat this site is preferred. The probability of the samedeletion being found twice if all five- and six-base repeats closer than 50 baseswere equally likely is about O-03,and if all 902 five-base repeats
APPENDIX
I i
----TZF
0
IO0
200
-
Ml
I S2 565 - Sld S/36_ X
300
400
500 Poslllon
600
----__- - --- _-- - ./ 700 SO0 900 1000 I 100
I” r gene
FIG. Al. The distribution of close, direct repeats in the 1 gene. Both (a) and (b) include only those repeats closer than 50 bases (A d 50). (a) This shows the distribution with respect to the spacing between repeats, and the vertical axis indicates the length of repeat involved. Overlapping or contiguous repeats are indicated by a solid circle (a). Occurrence of more than one repeat with the same JJ is shown by multiple symbols. The deletion mutations are indicated over the repeats that are their endpoints. (b) This shows the close repeats distributed along the I gene. The scale is the same as in Table Al. Deletions are indicated by heavy bars and their numbers.
were equally likely, it is negligibly small. In Figure Al(b) the distribution of t,he close repeats near the beginning of the gene is shown. The only notable feature of this distribution is the clustering of the close repeats near the beginning and the end of the gene. Nothing can be said here of the fact that the two deletions, of roughly the same size, are located in the same region of the gene. The sample is simply too small. Only characterization of additional deletions can determine the actual specificity. In this connection it may be useful to note that subsequences of repeated deletion end-points exist elsewhere in the gene. If the specificity is even partially carried in such a subsequence, its efficacy as a deletion endpoint may be enhanced over background. An example of such a subsequence is the last entry in Table A2. This seven-base sequence matches in six out of seven bases the endpoints of deletion S74jSll2. There are three repeats displayed in Table Al which have identical sequences, resulting from the triple repetition of an eight-base sequence. Such a multiple repet’ition is suggestive of duplication events in the evolution of the I gene. Other evidence supporting such an assertion will be considered in another paper. (b) Slipped
+nispairing
The striking mutational hotspot discovered in the I gene has been explained as the consequence of slipped mispairings leading to the insertion or deletion of four basepairs (see the accompanying paper). There is only one prominent hotspot in the gene, thus it may be revealing to examine the opportunities for slipped mispairing afforded by the sequence elsewhere in the I gene. This was done in much the same way as the repeats were analysed. We have ignored here the possibility that, sequences which include mismatches can participate in slipped mispairing and have only looked for contiguous or overlapping exact repeats of five bases or more. These repeats are indicated on the left side of Figure Al (a) by the solid dots. They are also tabulated in Table A3. It can be seen immediately that the hotspot is within the stretch of DNA
862
D.
J.
GALAS
TABLE
Opportunities Base
1
620 620 > 610 309 947 18 1000
Base
2
A3
for slipped mispairing A
Repeat 5’ C-T-G-G-C-T-G-G-C C-T-G-G-C G-C-G-T-C-T-G G-T-G-G-T-G C-T-C-T-C A-C-A-A-C G-A-A-A-A
624 628 616 311 949 190 1005
in the I gene Length 3’
This list represents all those exact repeats of 5 or more bases which are contiguous or overlapping. The bracket marking the first 2 entries indicates that the second is a subsequence of the first. Column designations are as for Tables Al and A2.
that can be slipped and mispaired with the maximum homology in the fashion proposedby Streisinger et al. (1966). Note that the sameregion can be slipped eight bases as well as four but at the cost of reducing the homology from nine to five bases.Only the four-base deletion and insertion events have been observed. Convincing evidence that five basesis insufficient homology for this processis provided by the observation that, having lost four basesat this point, mutants rarely revert: the hotspot is no longer hot when reduced to a five-base homology (seethe accompanying paper). There are only two other can.didates in the gene for slipped mispairing, having homologies of six and seven. The situation for these sites is rather ambiguous, however, becauseit is not clear what effect these strand-slip-produced mutations have on the protein. The seven-basesite could add or delete six bases,while the six-base site could add or delete three bases,neither resulting in a frameshift.
TABLE
A4
Inverted repeatsof eight or more basesin the I gene Base
87 238 16 284 249 232 160 107 107 239 214 300 Column designations 2 sequences are different
1
Base 2
A
Sequence
128 953 1010 605 606 683 655 621 625 954 973 1072
41 715 994 321 357 451 495 514 518 715 759 772
5’ A-C-G-C-G-G-G-A-A-A C-A-G-G-G-C-C-A-G A-C-C-A-C-C-C-T-G T-C-G-G-C-G-C-G C-G-G-C-G-C-G-T G-A-C-T-G-G-A-G A-A-T-T-C-A-G-C T-G-G-C-T-G-G-C T-G-G-C-T-G-G-C A-G-G-G-C-C-A-G C-A-A-T-C-A-G-C G-C-T-G-G-C-A-C
Length 3’ 10 9 9 8 8 8 8 8 8 8 8 8
are as in previous Tables. However, since these are inverted repeats and the listed sequence is that one beginning at base 2 in each case.
the
APPENDIX
863
(c) Inverted
repeats
To complete the picture of the structure of repeated sequences in the gene the inverted repeats have been analysed; even though no genetic events have yet been associated with them. The algorithm used here is identical to the direct repeat algorithm after a step which inverts and takes the (base-pairing) complement0 of t,he sequence of interest. The results of t’his scan for exact inverted repeats are shown in Table A4 for sequences of eight bases or more. The number of occurrences is in accord with the expected numbers from a random sequence. The analysis presented here appears to support bhe notion that some sequence specificity is involved in deletion formation. The evidence seems rather convincing that, the distance between repeat sequences is an important factor in determining the frequency of deletions with t,hese endpoints, and that the frameshift hotspot is simply due to slipped mispairing at that site. However, the analysis is merely supportive. particularly as regards sequence specificity, and will lead to more definite conclusions as more spontaneous mutations in the I gene are analysed. I thank supported
Dr J. H. by a grant
Miller from
for the
support and Swiss National
for
reading the manuscript. Fund (F.N. 3.179.77.).
This
work
WHS
REFERENCES Farabaugh, Streisinger, (1966).
P. J. (1978). ,V&we (London), 274, G., Okada, Y., Emrich, J., Newton,
Cold Spring
Harbor
765-769. J., Tsugita,
Symp. Quant. Biol.
A.,
31, 77-84.
Terzaghi,
E. & Inouyc?
M.