An analysis of sequence repeats in the lacI gene of Escherichia coli

An analysis of sequence repeats in the lacI gene of Escherichia coli

858 D. J. GALAS Hhh An Analysis of Sequence Repeats in the lad Gene of Escherichia coli DAVID J. GALAS Ddpartement de Biologie Molhduire, Unive...

409KB Sizes 71 Downloads 61 Views

858

D.

J. GALAS

Hhh

An Analysis of Sequence Repeats in the lad Gene of Escherichia coli DAVID J. GALAS Ddpartement

de Biologie

Molhduire,

Universite’ de Genkve, Gendve, Suisse

Since it is now clear that a significant fraction of non-lethal spontaneous deletions terminate in repeated sequences (see the main text), the role played by these repeats in the formation of deletions has become an important question. As a first step in the investigation of the issue, the sequence of the I gene must be analysed to establish the background of repeated sequences against which the observed deletions occurred. The possibility that a careful comparison of these observed deletions with potential sites for repeat-catalysed deletion formation may suggest or restrict hypotheses as to the processes involved is the primary motivation for such an analysis. In this paper I present an analysis of the sequence of the I gene (Farabaugh, 1978) for direct repeats, inverted repeats, and the possibility of slipped mispairing (contiguous or overlapping repeats). This spectrum of repeats is compared with the data reported in the accompanying paper, the sequenced endpoints of deletions internal to the I gene and the frameshift mutation hotspot. It can be argued from this comparison that it is likely that the distance between repeats is an important factor, and that some sequence specificity is also involved in at least one pathway for deletion formation. (a) Direct repeats The sequence of the I gene was read into the computer and scanned for repeats by a program which used the following simple algorithm?. To begin, a short sequence (the first N bases) is taken for comparison with every N-base sequence in the gene. Each sequence for which the agreement is equal to, or greater than, a certain threshold, L, is printed out with its position, degree of match, and the distance between the first bases in the two N-base sequences. The next N-base sequence (shifted up one base in the gene) is then taken and the same comparisons made with the remaining sequence of the gene (one less base in the gene is used each time the N-base sequence is shifted to avoid duplicating comparisons). The parameters L and N are set as required when the program is executed. The magnitude of this process is indicated by the problem of determining the number of exact eight-base repeats. L and N are thus set to eight. The sequence scanned here is 1150 bases long, including the leader region of the transcribed DNA and 41 bases following the final sense codon. This scan then requires ( 1150-8)2/2 pairs of eight-base sequences be compared, about 5 x lo6 single base comparisons. The results of a scan for eight and nine base exact repeats are shown in Table Al. t These calculations

FORTRANIV.

were performwl

on a Nova 840 computer.

The programming

was done in

APl’ENDIX TABLE

859

Al

Direct repeats of eight or more bases in the I gene

Repeat

Base

1

Base

2

A

Extent of repeat

Deletions Framrshifts -

C-T-(-$G-C-T.G-G.C G-h-h-G-C-G-G-C-Gf

620 143

624 331

4 188

9 9

G-C-G-C-G-T-T-G-G C-C.A-G.C-G-T-G.G G-C-G-C-A-A-C-G-(’

814 303 374

1048 925 Ill3

234 622 739

9 9 !I

--

A-.&.G.C-G-(&C.G: G-T-G-G-T-G-A-A C-C-G-C-G-T-G-G T-C-T-C-G-C-G-C G-C-G-G-C-G-A-T G-C-G-A-C-T-G-G A-A-G-C-G-G-C-G$ G-C-G-T-G-G-T-G G-T-G-G-A-A-G-C C-G-A-C-T-G-G-A G-G-G-C-A-A-A-C G-T-T-T-C-C-C-G

331 20 91 281 146 529 144 93 140 682 199 86

351 95 178 370 269 681 351 306 434 1091 917 1085

20 75 87 89 123 152 207 213 294 403 718 999

x St

SlO,S136 s74, s112

8 8 8 8

8s 88 8§ 8 8 x

SZ.?

-

These rrpcats WCPO found using the algorithm tlescribrd in the text. Base 1 is the location, xvith respect to the first base of the 1 message, of t,he first base of the first occurrence of the repeat (with respect to thr amino-terminal end of the mpressor). Base 2 is the location of the first base of the second orcurwnc~e of the repeat. A is the number of bases between the first bases of the 2 occurrenres. The deletions, designated by allele number (we the main text), are listed in the last rolumn. The 4 notations in the next to last column indicate that the sequence is part of a larger rrpeat which is interrupted by one mismatch. t, indicates that 9 out of 10 bases match: !: intlicatrs t,hat~ thew is an iticntical srqwnrc: 5 indicates that it is part of IO oat of I1 matrh.

Sot,e that t)hc observed deletions occur only with eight-base repeat endpoints, even though there are four nine-base repeats available. Clearly the size of the repeat is not t.he only determining factor. As one can t.ell from a glance at, Table Al, t,he nine-base repeats are spaced rather far apart; in fact _t’he closest of these spans a distance gteat.er t,han the size of the largest deletion. It, is possible that, t’he distance between repeat,s reduces t,heir probability of forming deletions sufficient’ly to account for the absence of nine-base repeats in this sample. That t)his possibility is consistent with the eightbase repeat spectrum is also clear from Table Al. It is only the closely spaced eightbase repeats that are represented here. In this sample it is entirely possible to account f roughly the same rate, among the five most closely spaced eight-base repeat,s. To illustrat’e this point suppose that, the distance between repeats were unimportant and all eight- and nine-base repeats were equally likely as deletion sit)es. Then the probability that, the five observed deletions would be confined to the five closest repeats, as they are in this sample, would be about 3 x 10m3. However, the distribution of the five observed deletions among these five sites cannot be disbinguished from random. If five deletions (sites) were chosen at random from a collection in which all five sit,es are equally represented, the probability that this sample of five would miss two sites, as in bhe real sample, is 0.41. Thus it is not unlikely that a random sampling would yield the observed result.

D.

860

J.

GALAS

An important argument for the significance of the spacing of the repeats is encountered by considering the three repeats marked $ in Table Al. Since the sequences are identical the possible influence of sequence specificity is removed. The fact that the repeat spaced at 20 bases was found twice among the deletions, and that ones spaced at 207 and 187 bases are not found, argues strongly that the spacing influences the frequency of deletion formation. An examination of the seven-base repeat spectrum suggests, however, that there is more involved than the spacing between repeats. The 11 most closely spaced of these, shown in Table AZ, have six among them closer than

TABLE

A2

The eleven closest exaci seven-base repeats in the I gene Repeat _, 0

C-G-C-G-C-C-G A-T-T-A-A-T-G T-G-A-C-C-A-G C-A-A-C-T-G-G G-C-A-A-A-C-C A-A-C-C-A-C-C A-T-A-T-C-T-C C-C-T-G-C-A-C C-A-A-A-C-C-A C-T-G-G-G-C-G G-G-T-G-G-T-G

Base

1

Base 2

A

284 1122 481 293 1030 1009 828 444 920 779 311

34 59 66 102 111 123 191 200 210 246 290

3’ 250 1063 415 191 1030 886 637 244 710 533 21

Of the 34 exact 7-base repeats these are those with the lowest value of A, excepting the single in Table A3. The designations of the columns are overlapping repeat (A = 6) which is entered defined in the legend to Table Al.

some observed deletion. It is possible that the smaller repeat unit accounts for the absence of the seven-base repeats among the deletions. However (for reasons discussed below), it seems more likely that among the deletion endpoints there is some sequence resemblance for which the process is partially specific. Table Al shows that there are striking similarities among the eight-base repeats for the five deletions. The S74 and S23 endpoints match in five out of eight bases, and if the pyrimidines are considered equivalent in seven out of eight bases. The SlO and 523 endpoints match in six out of eight bases with a two-base shift (G-C-G-G-C-G). Some sort of specificity is also suggested by the deletion ending in the five-base repeat. This deletion, found twice in a sample of independently isolated deletions, is quite small (22 bases), but there are many repeats of five and six bases as close or closer together. In Figure Al all the repeats in the I gene closer than 50 bases are represented. Figure Al(a), in which the repeats are displayed as a function of the length of the repeats and of their spacing, suggests that there is something notable about the particular repeats for which deletions were found, surrounded as they are by other available sites. That the deletions, S65 and S32, were isolated independently and found to terminate at the same five-base repeat suggests that this site is preferred. The probability of the same deletion being found twice if all five- and six-base repeats closer than 50 bases were equally likely is about O-03, and if all 902 five-base repeats

APPENDIX

Spmnq between repeats (AI

0

100

200

300

400

500

600

700

800

900

IO00

II00

Poslt~on In i gene

FIG. Al. The distribution of close, direct repeats in the I gene. Both (a) and (b) include 0111~ those repeats closer than 50 bases (A < 50). (a) This shows tho distribution with respect to thts spacing between repeats, and the vertical axis indicates the length of rrpcat involved. O\wrlapping or contiguous repeats arc indicated by a solid circle (0). Occurrence of nmrc than om’ repeat with the same n is shown by multiple symbols. The deletion mutations arc indicated ovw the repeats that arc their endpoints. (b) This shows the close repeats distributed along the 1 gene. The scale is the same as in Table A 1. Deletions we indicated by heavy bars and their numbers.

were equally likely, it is negligibly small. In Figure Al(b) the distribution of t,he close near t’he beginning of the gene is shown. The only not,ablc feature of this distribution is the clust,ering of the close repeats near the beginning and the end of the gene. Nothing can be said here of the fact that) the two delet’ions, of roughly the same size, are locat,ed in the same region of t,he gene. The sample is simply too small. Only characterization of additional deletions can determine t,hc actual specificity. In t,his connection it’ may be useful to note that subsequences of repeated deletion end-points exist elsewhere in the gene. If the specificity is even part)ially carried in such a subsequence, its efficacy as a deletion endpoint may be enhanced over background. An example of such a subsequence is t’he last entry in Table X2. This seven-base sequence matches in six out of seven bases the endpoints of deletion L374/J‘112. There are three repeats displayed in Table Al which have ident~ical sequences, resulting from the triple repetition of an eight-base sequence. Such a mult~iple rep&ition is suggest,ivc of duplicat,ion everus in the evolution of the 1 gene. Othrr evidence supporting such an assertion will be considercd in another paper. repeats

(b) Slipped w&pairing The striking mutational hotspot, discovered in t,he 1 gene has been explained as the consequence of slipped mispairings leading to the ins&ion or delet#ion of four basepairs (see the accompan.ying paper). There is only one prominent, hotspot in the gene, thus it may be revealing to examine the opportmmics for slipped mispairing afforded by the sequence elsewhere in the I gene. This was done in much the same way as the repeats were analysed. We have ignored here the possibility t,hat sequences which include mismatches can participate in slipped mispairing and have only looked for contiguous or overlapping exact repeats of five bases or more. These repeats are indicated on the left side of Figure Al(a) by the solid dots. They are also tabulated in Table A3. It can be seen immediat.ely t,hat t)he hotspot is within t)he stretch of DNA

D.

862

J.

GALAS

TABLE

Opportunities Base

1

620 > 620 610 309 947 18 1000

Base

2

624 628 616 311 949 190 1005

A3

for slipped mispairing d

4 8 6 2 2 3 5

in the I gene

Repeat 5’ C-T-G-G-C-T-G-G-C C-T-G-G-C G-C-G-T-C-T-G G-T-G-G-T-G C-T-C-T-C A-C-A-A-C G-A-A-A-A

Length 3’ 9 5 7 6 5 5 5

This list represents all those exact repeats of 5 or more bases which are contiguous or overlapping. The bracket marking the first 2 entries indicates that the second is a subsequence of the first. Column designations are as for Tables Al and A2.

that can be slipped and mispaired with the maximum homology in the fashion proposed by Streisinger et al. (1966). Note that the same region can be slipped eight bases as well as four but at the cost of reducing the homology from nine to five bases. Only the four-base deletion and insertion events have been observed. Convincing evidence that five bases is insufficient homology for this process is provided by the observation that, having lost four bases at this point, mutant’s rarely revert: the hotspot is no longer hot when reduced to a five-base homology (see t.he accompanying paper). There are only two other caadidates in the gene for slipped mispairing, having homologies of six and seven. The situation for these sites is rather ambiguous, however, because it is not clear what effect t.hese strand-slip-produced mutations have on the protein. The seven-base site could add or delete six bases, while t,he six-base site could add or delete three bases, neither resulting in a frameshift.

TABLE

A4

Inverted repeats of eight or more bases in the I gene Base

87 238 16 284 249 232 160 107 107 239 214 300 Column designations 2 sequences are different

1

Base 2

A

128 953 1010 605 606 683 655 621 625 954 973 1072

41 715 994 321 357 451 495 514 518 715 759 772

Length

Sequence 5’ A-C-G-C-G-G-G-A-A-A C-A-G-G-G-C-C-A-G A-C-C-A-C-C-C-T-G T-C-G-G-C-G-C-G C-G-G-C-G-C-G-T G-A-C-T-G-G-A-G A-A-T-T-C-A-G-C T-G-G-C-T-G-G-C T-G-G-C-T-G-G-C A-G-G-G-C-C-A-G C-A-A-T-C-A-G-C G-C-T-G-G-C-A-C

3’ 10 9 9 8 8 8 8 8 8 8 8 8

are as in previous Tables. However, since these are inverted repeats and the listed sequence is that one beginning at base 2 in each case.

the

APPENDIX (c) Inverted

xc3 repents

To complete the picture of the structure of repeated sequences in the gene the inverted repeats have been analysed; even though no genetic events have yet been associat,ed with them. The algorithm used here is ident’ical to the direct repeat algorithm after a step which invert,s and takes the (base-pairing) complement of the sequence of int’erest. The results of hhis scan for exact, inverted repeats are shown in Table A4 for sequences of eight, bases or more. The number of occurrences is in accord with t,he cxpecbed numbers from a random sequence. The analysis presented here appears to support t,hc notion t,hat some sequence specificit~+y is involved in deletion formation, The evidence seems rather c~)nvincing that the distance bet,ween repeat, sequences is an important factor in determining the frequency of deletions w&h t,hese endpoints, and that, the frameshift. hotSspot is simply due t,o slipped mispairing at that, site. However. the analysis is merely supportive, parbicularly as regards sequence specificity, and will lead to more definite conclusions as more spontaneoL~s r~lut,ations in the I gene are analysrd. I thank supported

Dr J. H. Miller hy a gumt from

for support, and for tcding t,ho Swiss National

Flmd

(F.N.

t,he manuscript,. 3.179.77.).

Farabaugh, I’. J. (1978). Nature (Lodor~), 274, 7A5-7K9. St~reisinger, G., Okada, Y., Emrich, d., Newton, ,I., Tsupita, A., ‘l’erzaphi, (1966). Cold Spring Harbor Symp. @cant. Biol. 31, 77-M.

This work

E. & lnouyc,

was

M.