Ring theory

Ring theory

J. *?!fol, Biol. (1973) 77, 85-99 Ring Theory C. A. THOMAS JR, B. H. ZIMM-~ AND B. M. DAXCIS Department of Biological Chemistry Harvard iWedica1 Sch...

1MB Sizes 25 Downloads 97 Views

J. *?!fol, Biol.

(1973) 77, 85-99

Ring Theory C. A. THOMAS JR, B. H. ZIMM-~ AND B. M. DAXCIS Department of Biological Chemistry Harvard iWedica1 School Boston, Mass. 02115, U.S.A. (Received 18 September 1972, and in revised form 8 Febrwwy

1973)

In what follows we demonstrate that the minimum requirement for the formation of a DNA ring is a pair of ordinary (ABC . . . ABC) or inverted (ABC . I . C’B’A’) repetitions. DNA fragments that are partly degraded :from their ends by a 3’ (or 5’) specific exonuclease such as exonuclease III (or X exonuclease) produce resected fragments that can only form rings by virtue of ordinary repetitions. Next we analyze how random fragments cut from DNA molecules containing ordinary repetitions would be expected to form rings. Since longer fragments (> 5 to 10 pm) cyclize less efficiently than do shorter ones (2 pm), we are led to the view that the chromatid is composed of thousands of dist,inctive regions, Galled g-regions, within which characteristic repetitious sequences are clustered in an intermittent or tandem fashion. Mathematical expressions are derived that allow one to measure the length and number of these g-regions from the ring frequency, R, and its dependence on the length of the fragment. The interior organization of the g-regions is considered in terms of two models and their variants: intermittent repetition and tandem repetition. These are depicted in Figure 2. The objective of this effort is to calculate the frequency of rings that can be generated from these two models, and to explain the “shortas the fragment length side fall-off “, that is, the decrease in ring frequency becomes shorter. This could not be due to the stiffness of the DN-A double helix and must reflect a distribution of spacing of the repetitious sequences within the g-regions. Mathematical expressions are obtained that allow one to estimate the average values of the repetitive or partly repetitive unit. These estimates may be obtained from the dependence of ring frequency on the extent of resection, and from the dependence of ring frequency on the length of shorter fragments. in the previous The mathematical expressions derived here are employed papers of this group, and lead to the conclusion that the g-regions are composed rof tandemly repeating sequences.

1. Table of Definitions R,

Bing

frequency.

The

cludes rings, lariats

number

and double

fraction

of all structures

seen

that

are rings.

In-

(poly) rings.

F,

The number fraction of cyclizable fragments having both exposed ends within a g-region.

6,

The Gotal number fraction of cyclizable fragments counting those that have both exposed ends within g-regions and those that‘ do not. Clearly Iz = EG.

t Present address: Department Jolla, Calif. 92037, U.S.A.

of Chemistry,

Revelle 85

counting

College,

only

University

those fragments

of California,

La

86

C. A. THOMAS

JR,

B. H.

ZIM’M

AND

33. hi. DAXCIS

E, The e&iency

of ring formation. This factor lumps all inefficiencies of resection, annealing, etc. It is the number of rings formed divided by the number of fragments that could have formed rings if resection or annealing conditions were appropriate. The number of nucleotides resected from a given terminal. Usually taken as the average value determined by the fraction of acid-soluble nucleotides.

r,

Duplex fragment length measured in nucleotide pairs. The minimum number of nucleotides required to form a ring under the conditions of annealing (often taken as 33). The “resected length”

1’ = 1 - 2r + 2b, + 1.

The number of nucleotide pairs in a given g-region. A g-region is a length of double helix containing clustered (intermittent or tandem) repetitions. The fraction of all nuoleotide pairs residing in the g-regions.

Y,

The number average of g.

g, cl

*

* Y?

The fraotion of nucleotides residing in g-regions having I’ or more nueleotide pairs. The number of g-regions per haploid genome. (Sometimes used as the number of tandemly-repeating units in a given DNA fragment, and for other purposes,)

n,

The number of g-regions having 1’ or more nucleotides.

n*, I g,

The number of nucleotide pairs in a hypothetical intermittent repetition. In this model, many blocks g’ units long are located in an irregular way within a gregion. The fraction g-region.

a,

I

S,

The number average of g when the average includes only g-regions having 1' or more units.

The number of nucleotide pairs in the entire haploid genome.

4

a,

2

of nucleotides in a g-region that reside in g’ blocks within

that

Has a special meaning defined by equation (15). The number of nuoleotides in a tandemly-repeating unit within a g-region. When used in connection with a fractional tandem model it is the spacing of the intermittently repeating sequences. The number of tandemly-repeating

blocks of length s in a g-region: g = fs.

2. Results and Discussion The following analysis deals with the formation of rings from fragments of DNA by the association of single polynucleotide chains of complementary sequence. The DNA fragments are considered to be portions of a double helix cut randomly from a much longer double helix representing the mononemio ehromatid. On this occasion, only folded rings will be considered; rings arising from the association of completely denatured polynucleotide chains (slipped rings) involve factors which are not understood or controlled and are not as well characterized experimentally.

RING

THEORY

87

The minimum requirement for the formation of a folded ring is a repetition in the nucleotide sequence. Assuming that single polynucleotide chains associate in an antiparallel manner, only two kinds of repetitions are relevant: ordinary repetitions and inverted repetitions. These are illustrated in Figure 1 by an arbitrarily chosen block of five nucleotides. In order to expose single chains with complementary sequence, one or both chains of the double helix must be broken in order to provide a chain terminal at which an exonuclease can act. In the case of an ordinary repeat,, this nick, or chop, must occur to the left and right of the two repetitive sequences. A conventional exonuclease (i.e. one acting at the 3’ end, or the 5’ end, but not both} will then expose the complementary sequences, provided that the initial chain breakage did not occur t,oo far away, and that a sufficient number of nucleotides was removed. Ordinary

Inverted

repeat

AGGAT VIIIIi

AGGAT IIIII

AGGAT \, , , , , I 8%

AGGAT -

-’ Resection

at nick

TCCTA

at terminal

!,,,I TCCTA

x

ATCCT IIIII

ATCCT 1! * 18

AGGAT IIlll Resection

at m?k ATCCT I I I I t

2GGAT Resection

repeat

Resection

at terminal

FIG. 1. Rings from ordinary and inverted repetitions. The double helix is depicted as two hori., zontal parallel lines. The 5 nucleotide pairs representing the repetitious sequence are depicted by “rungs” and labeled. The “X” locates the position of a single- or double-chain breakage.

In the case of an inverted repetition, conventional exonucleases cannot expose complementary sequences from fragments because such nucleases destroy one of the two possible complementary sequences. However, nicks located to the left of each repetition will allow a 3’-acting exonuclease, such as exonuclease III, to expose complementary sequenees. Likewise, 5’-acting exonucleases would expose complementary sequences if the nicks were located to the right of the repetit,ious sequences. X.Exonuclease might serve as an example, but this enzyme will not operate at nicks. (The “lefts” and “‘rights” refer to the drawing in Fig. 1; obviously if the alt’ernate ohain were nicked, the rights and lefts would be reversed.) With inverted repetitions, t#hecomplementary sequences reside on the same chain; therefore, the newly-associated dou.ble helix wiIl not lie in the ring, but as a side arm (Fig. I). Whether such a side arm would be visible in the electron microscope or not depends on its length. In any event such “lariat ” structures are not diagnostic of inverted repetitions because ordinary repeats can generate the structures of similar appearances.

C. A. THOMAS

58

JR,

B. H.

ZXMM

AND

B. $1. DANCIS

While inverted repetitions may well exist in eukaryotic DNA, we can be sure that they are not responsible for the majority of the folded rings because X exonuclease (a conventional exonuclease that cannot operate at nicks) will produce just as many rings as exonuclease III, and both nucleases produce rings whose contour length is equal to (predictably slightly shorter than) the length of the linear fragments. Therefore, we shall be considering models of the chromatid in which ordinary repetitions are responsible for the observed rings. Double and polyrings will not be considered at this time.

Intermittent 8 /

I I

9’ (cl

FIG. 2. The regionally-repetitious chromatid. The single line bearing folded regions reminiscent of chromomeres represents the mononemic chromatid that is organized into distinctive regions, the contour length of which is g nucleotide pairs. The fraction of all the nucleotides organized into such regions is y. We oan picture two extreme models for the organization of sequences within these regions. At one extreme we have tandem repetition, where each sequence s units long is repeated f times to make up the total region g (g = fs). This model may be degraded by: (a) supposing that only a portion of s is repeating, the remainder being non-repeating. This might be called the “fractional tandem” model. Alternatively: (b) one might picture the sequences as being only partially repeating-that is containing occasional substitutions that differ from repetition to repetition. This might be called the “variegated tandem.” The second model, called intermidtent repetition, assumes that each region containing a total of g nucleotides is composed of non-repetitious DNA containing irregularly arranged sequences g’ nucleotides long. The number of them, n’, and the fraction, a, of the nucleotides of the region represented in intermittently-repetitious sequences is negotiable (n’s’ = ag). For mathematical convenience we assume all blocks of g’ are copolymers-that is, tsndemly-repetitious with a very short s’. This clearly overestimates the fraction of cyclizable fragments. The intervening non-repetitious DNA is pictured to be similar to Escherichia co& DNA. If the repetitious sequenoes are regularly spaced as shown in (c), we return to the fractional tandem-repetition model (a).

We start with the assumption that fragmentaGon is proceeding in a random fashion (at least with regard to ring-forming ability). However, fragment’s of different length produce rings at different frequency. It appears that fragmen.ts 1 to 2 pm long produce rings most efficiently, while longer fragments rarely form rings. Thus it appears that the ordinary repetitions are clustered into relatively short regions of variable extent, few of which extend over 5 to 10 pm in length, although Necturus DNA may have much longer regions. The folded ring experiments mean that eukaryotic DNA is regionally repetitious. Remote regions may have similar or even identical sequences, but t.hese experiments tell us nothing about this possibilit,y.

RING

THEORY

(a) The regionally-repetitious

89

chromatid

The regionally-repetitious model for the chromosome is depicted in Figure 2. Each of the various regions, called generically “g-regions”, is shown as a compacted wavy line giving the appearance of a chromomere. The interior organization of these regions need not concern us for the moment, provided that we imagine them to be densely populated with repetitious sequences that are distinctive of the given region. Each region is pictured as containing its own characteristic repetitious sequences. Not all of the DNA need be in repetitious regions ; this DNA, about which the previous experiments tell us nothing, is depicted as the straight line segment between the regionally-repetitious regions. The number of nucleotide pairs in any one g-region is called g, the number of such g-regions per genome is n, and the number of nucleotide pairs in the entire genome is A. For analytical simplicity we picture the physical genome as a ring formed by tbe end-to-end enchainment of mononemic chromatids, thereby avoiding end effects. The fraction of nucleotides found in regionally-repetitious regions will be called. 2). Let us suppose that the lengths of the various regions have some characteristic number distributSions G(g) so normalized that

y=; Jom and G’(g)dg

n=

.f 0m G(g)dg.

At the outset let us suppose that all g-regions have precisely the same length g. Suppose that a large number of these genomes is randomly broken into doublechain fragments containing 1 nucleotide pairs. There are g ways to locate the first nucleotide pair of the fragment in the g positions of the region. We now suppose that an exonuclease “resects” each terminal, thereby exposing r nucleotides. We assume a minimum of b, complementary nucleotides must unite to form a stable ring. Under the most favorable conditions, there are only g - (I - 2r + 2bo) + 1 ways that a eyclizable fragment will result. This can be seen from the following drawing, wherein b, = 4, r = 11, g = 23, and 1 = 33.

1’ c=yg-l’ -=y g cl-ii>

?!he fraction of cyolizable fragments is

where 1’ = 1 - 2r + 2b, - 1, and 1’ is always smaller than g otherwise C = 0. A plot of C veMus 1’ gives an intercept equal to g on the 1’ axis, and equal to y on the C axis. If we now suppose that the regions are not all of the same length, but that G(y)@ describes the number of g-regions having g to g + dg nucleotide pairs, then

90

C. A. THOMAS

JR,

c=

B. H.

s;

ZIMM

AND

B. M. DANCIS

G(dUg - WdblWg A/l

,

(3)

where g/l is the number of fragments produced from a region g units long and the total number of fragments is A/l. Thus

C = U/4 j; G(g)(g- Udg. When 1’ is smaller than any value of g for which G(g) is significant, we can extend the integration to 0 C = (l/-4 jr GkM% - Cl’/4 ju” Cf(ddg, = y - (Z’n)/A.

(5)

Defining the number average value of g as ~7= (l/n) jr sG(g)ds = (r4in

(6)

and combining (5) and (6) we have G = y(l - z’/g),

(7)

which is similar to equation (2). From equation (7) we can see that the projected intercept of C on the 1' axis is S+The projected or actual intercept on the C axis is y, the fraction of nucleotides in regionally-repetitious regions. This is depicted in Figure 3(a). When I’ is comparable to the length of some of the g-regions, equation (7) no longer applies. In Figure 3(b), a sketch of C versus I’ for an arbitrary G(g) is shown. Note that the curve has a negative slope, but positive curvature over the entire range. It approaches the C axis in a linear fashion giving an intercept, y, and a slope that projects to 0 on the 1’ axis in a manner consistent with equation (7) and Figure 3(a). A line drawn tangent to the curve at other values of 1' projects to lower intercepts on the C axis and higher values on the I’ axis. These intercepts can be understood in the following way: if we define n* as the number of g-regions having values of g > I’,

and g* as the number average value of g, counting only those regions having I’ or more nucleotide pairs, g* = (l/n*)

(9)

s l: gQ(g)dgs

and y* as the fraction of the DNA residing in g-regions containing 1’ or more nucleotide pairs, Y*

= U/4

j;

(10)

qQ(ddg;

then we can resolve equation (4) in the following way: C = (l/4

j;

sG(g)dg - V’l4

j;

G(ddg

0 = y* - (Z'n")/A G = y*[l

- (Z’n”)/(Lly”)]

c = y*(l - V/g*)-

(11)

RING

THEORY

012345678

Pm. 3. The expected frequency of oyclizsble fragments of various oontour lengths. (a) G(g) is narrowly distributed about a mean value of g and 1’ is less than any value of g for which G(g) is significant. (b) I’ and g are largely overlapping. (0) Assuming the observed ring frequency, R, is related to the fraction of cyclizable fragments, C, by a factor (less than 1) E, (R = CC), the intercept on the R axis becomes EY (or ~7”) but the intercept on the I’ axis remains g (or g*).

From t,his expression we may see that the tangent drawn at any value of I’ will extrapolate to y* on the C axis and g* on the 1’ axis. Thus by making measuremen% of the number fraction of rings, R(Z'),of various contour legnths one may probe the distribution G(g). Assuming that the ring frequency, R(Z'): is smaller than C by a constant factor, E, which is the unknown efficiency of resection, folding and scoring, we have (12) R(E’) = cy*[l - Z’/g”]. As shown in Figure 3(e), in the R versus 1’ graph, the intercept on the ordinate CY affected by E (it becomes my*) but the intercept on the 1’ axis is not (it remains 9”). From equation (12) one may estimate f$ and g* for the various DNAs studied. These are shown in the following Table (Table 1). The first line of this Table can be

92

C. A. THOMAS

JR,

B. H. TABLE

Estimation

ZIXM

AND

1

of EY* and g* from ring frequency DNA

Drosophila virilisf Mouse (liver)$ .&Jecturus (blood)j

R. M. DANCIS

E’ = I - 2r + Zb, k-4 2,2 2.5 6

as a function 1

of length cl* (wd

25 25 20

6 8 17

f Taken from Fig. 7 of Lee & Thomas (1973) by drawing a tangent to the curve at 2.5 pm. 1 Taken from Fig. 1 of Pyeritz & Thomas (1973) by drawing tangents to the curve at 3 and 6 pm.

read “more than 25% of the D. vii-ibis DNA is organized into g-regions having an average length of6 microns if those regions smaller than 2.2 microns are not counted”. It must be stated that the precision of these statements is very low, since many points are not available at grea.ter fragment lengths. The reason for this is that, with the exception of Necturw DNA, large rings are very rare. While we are confident that 11(1’)falls as 1’ increases, the actual curve is difficult to determine by counting rings in the electron microscope. The development to this point predicts that the highest frequency of rings is obtained from the shortest fragments. This is clearly contrary to the facts. In order to account for the decrease in ring frequency with decreasing fragment length, we must consider the interior organization of the regionally-repetitious regions. To this subject we now turn. (b) Two models We can advance two models for the interior organization of sequences within a g-region. The first model, and its variants, are called “intermittent repetition”. In this model the g-region is thought to be composed of non-repetitious DNA containing a large number of blocks of repeated simple sequences. These blocks, all of which are pictured to consist of g’ nucleotide pairs, are thought to be highly internally repetitious, or copolymer-like. Each g-region is thought to contain its own characteristic g’-blocks. Initially, we shall suppose that these blocks are distributed randomly; later we introduce irregular and then regular spacing. Finally, we introduce the requirement that the g’-blocks are non-internally repetitious in an effort to explain the observation that the frequency of rings decreases when the fragment length becomes small. The other extreme model is derived from Callan’s (1960) original suggestion, namely, that the chromomeres contain tandemly-repeating sequences. This idea has been supported by the studies of the regions that specify ribosomal RNA and 5 S RNA. This model, called the “tandem repetition model” asserts that each g-region is composed off identical sequences arranged contiguously in tandem; each sequence is s nucleotide pairs in length (g = fs). In contrast to the intermittent repetition model, the repeating sequences are thought to be long (1000 nucleotide pairs) and internally non-repetitious. As before, each g-region is thought to contain one type of repeating sequence only; the exact sequence is characteristic of that particular g-region. To summarize then, the one model assumes thatf each g-region contains a number

RING

THEORY

93

of short identical, or nearly identical, sequences interspersed with much non-repeating DNA; the other assumes that each g-region is a tandem series of identical repeats of a rather long characterist,ic sequence. These models and variations of them are depicted in Figure 2. (i) Intermittent

repetitions

(I.) Random spacing of repeating sequences. We imagine a given region of the ehro-

matid containing g nucleotide pairs to be divided up into a large number of blocks, all of whieh are g’ units long (g’ = 5 in the drawing below).

Next we imagine that a certain fraction, a, of these blocks has the characteristic repetitious sequence (marked “ x ” above). The probability that a given block contains this characteristic sequence is, therefore, CL.This model results in contiguous repeating blocks, or repeating blocks separated by an integral number of g’ nucleo&’ d d es. Next we must calculate how frequently a shear breakage followed by a resection will expose a minimum sequence of b,, nucleotides that is complementary to another sequence of F, nucleotides exposed by the same process at the opposite end of the fragmen-t. Here we must make an assumption regarding the nature of the repetitive sequence. We are interested in finding the case in which the largest number of rings will form with the least amount of repetition in the DNA (i.e. the smallest value of CZ).Rings will form most easily when the repetitive sequence is itself internally repetitious. Therefore, the intermittent repetitions are pictured to be homopolymers, for example, dG: d@, or simple regular copolymers such as dh-dT : dT-dA. This clearly is an extreme case and predicts the highest frequency of rings. With this assumption, all we need calculate is how frequently any two specified resections of r nucleotides will expose one ox more sequences that are b, nueleotides or longer. The number of ways in which a fragment can be cut so that a resection of r nucleotides will expose b. or more nucleotides of a repeating region g’ nucleotides long is r + g’ - 2b, + 1, as can be seen from the following diagram, wherein the two extreme positions of a g’ block are shown.

The number of blocks of g’ nucleotides is: (r + g’ - 2b, + 1)/s’.

(1%)

If a is the chance that a given g’-block contains a repeating sequence, then 1 - a is the chance t,hat it does not. The chance that all exposed g’-blocks do not have repeating sequences is : (1 - a)

((rts'-2b,+l)/s')

(14)

94

C. A. THOMAS

JR,

B. H.

ZIMM

AND

B.

X.

DANCIS

Therefore, the chance, call it a’, that a given resection exposes one or more repeating sequences is : a’ = 1 _ (1 _ a)W+g’-2b,,+lW) (15) The chance that a given pair of resections (such as those at the terminals of a given fragment) both expose one or more repeating sequences is: F = (cL’)~,

(16)

where F is equal to the number fraction of cyclizable fragments derived from the region in question. If all regions have approximately the same values of g’ and b, and enjoy the same extent of resection at each random break, then the total fraction of cyelizable fragments should be C, where C = Fy*[l

- Z’/g*].

(17)

The second factor is equation (11) again. Clearly (17) assumes that all regions have values of g that are greater than g’. Equation (17) is also an over-estimate of C because equation (11) assumes that there are actually g - (I - 2r + 2b, - 1) ways of making a cyclizable fragment from a given region. This assumes a high density of intermittent repetitions. The validity of equation (16) might be questioned because the total number of g’-blocks represented in a stretch of r + g’ - 2b, + 1 nucleotides may in actuality not be an integer, whereas equations such as (15) generally deal with integer values of the exponent. This led us to re-analysis of the problem, taking account of all possibilities by computer simulation, but a maximum difference of less than 5% in F was found. (2) Irregular and regular disposition of repeating sequences. Because of the importance of equation (17) in the arguments presented in the accompanying papers, we have given much thought to the following problem: what kind of intermittent repetition model would produce the highest frequency of cyclizable fragments, C, with the smallest density of intermittent repetitions? As the most extreme case, we could suppose that the g’-blocks are so spaced that no more than one of them is ever exposed at any resected terminal. In this way one does not “waste” a second or third g’-block when a single one would do to produce a cohesive terminal. If the repetitious g’-blocks were randomly distributed, juxtaposition of two such blocks would be a frequent occurrence when a assumed appreciable values. As a result of this, a higher value of a would be required to account for a given frequency of rings than would be the case if the g’-blocks were irregularly disposed throughout the g-region. We think that the frequency of rings given by such an “irregular disposition” model is merely the first term of the expansion of equation (16) : F = [a(r + g’ - 2b0 + 1)/g’]“.

(18)

In order to satisfy ourselves that equations (16) and (18) did in fact represent a “ worst case ” (that is to say, predict the highest ring frequency with the smallest a), we approached the calculation from a different point of view. This approach specifically involved the fragment length, 1, which is absent from equations (16) and (18), because length is not important if the fragmentation event is unrelated to the location

RING

THEORY

95

of a g’-block. In this formulation we supposed that the fragments formed a random distribution in length, and that the g’-blocks were spaced precisely at imervals of pi nucleotides. Under these circumstances, fragments having lengths that match the interval (or any multiple of it) have a high probability of being cyclizable. Those whose lengths do not match the interval have no chance of forming rings. When the appropriate int,egrations are done, we find :

@ g! (tp(s/L) exp{-b

+ 4/L)

1 - exp(--s/l)

I _ exp(--s/L)

[

- exp(-((f

+ l)slL)

fU - exd--s/h)1

where L is the average value of 1 and f + 1 is the number of g’-blocks per g-region. This expression (19) must be compared with (17) and (18) without y*, that is C = [u(r + g’ - 2b, + l)/g’]“[l

- V/g*],

We know that a’ is always less than a and the entire expression to the right of (a’)2 can be shown to be less than 1 by substituting various values of I;, s and $ For purposes of comparing ring frequency, we take 1 - l’/g* to be B. Thus equations (16) and (18) remain the “worst case”. Thus, there is no reason to consider (19) further. We think that it will not be possible to contrive an “intermittent repetit,ion” model that will produce more rings than given by equation (18), if the fragmentation event is considered to occur randomly, or at least without reference tc the position of g’-blocks. (ii) Tandem repetition We turn now to the second extreme model which assumes that the entire g-region is composed of an integral number of identical tandemly-repeating sequences. The analysis of this problem is facilitated by winding the double helix into a larger helix so that the identical sequences lie over one another as depicted in Figure 4. Each different region would be wound into a larger helix that would comain f turns, each of s nucleotide pairs. While such helices might conceivably have physical reality in the chromomere, here they are merely an analytical convenience. The chromatid is now broken as shown by the arrows (Fig. 4) and each terminal resected by r nucleotides. For cyclization to occur, the breaks must take place s + b, or more nucleotides apart, and the resection must proceed until a complement’ary sequence of b, or more units in length is exposed, The rings so formed will comain an integral number, n, of repeating units. For example, two such fragments are shown in Figure 4. Fragment A cont,ains one repeat, fragment B contains 3. The base pairs forming the closure of these rings are depicted as the thin vertical lines. From this Figure we can see that for oyclization to occur we must have: s + b, < 1 < na - b, + 2r < (f - 1)s - b, + 2r.

Gw

Clearly 1 > s + b,, otherwise the resulting fragment would never expose (a sufficient number of) complementary nucleotides no matter how far resection proceeded. If resection proceeded to just that point that cyclization became possible, then I = ns + 2r - b,; any resection in excess of this may produce a partly single-chain ring, but we assume such rings would be scored. Finally, it is clear that the maximum fragment length capable of producing a ring is defined by the number of nucleotide pairs, g, in the region under consideration (g = fs).

C. A.

96

THOMAS

JR,

B.

H.

ZIMM

AND

B. M. DANCIB

Fragment n =I

A

Fragment n=3

B

FIG. 4. A helical model to visualize a tandemly-repeating g-region. A region containing seven repeating subunits, each represented by one turn of the helix, is shown. Two fragments, A and B, are illustrated by the pieces between the vertical lines. The resection of the single chains from the fragment ends is indicated by the thin arrows. The complementary base pairs that can be formed between the exposed single chains are represented by the thin vertical lines.

Equation

(20) can be inverted into a condition on s : (1 - 2r + bo)/n < s < (1 - b,)/n.

In Figure 5 a graph in s-space is drawn showing the various s values that would result in ring formation from fragments of a particular length 1 and degree of resection r. If all parameters in this problem are considered to be distributed, a very complicated problem results. Therefore, we assume that the distributions of 1 and r are narrow compared to those of s and g. In t,his case 1 and r may be replaced by their mean values. l-b0 ii

I-b0

l-b0 l-bQ

-I 0

FIG.

111111:1 etc.

,-2!+bo -3

5. A graph in s-space. Heavy

5 I if2r+b, 2

;

L + 1-2,---b,

+ !

fs

lines show the regions in which cyclieation

is possible.

Next we must separate the effects of s and g. It may be that all values of s are less t,han all values of g. In this case the effects of s and g on C are almost independent. Let us defme S(s) as the weight distribution function of s, such that S(sjds is the weight fraction of the tandemly repeating regions with lengths between s and s + ds. Note that ~,i,,X(s)ds = 1, Making use of equation (4) and Figure 5, we can write;

RTNG

THEORY

C = WV j,” G(g)[g- O-b1 x

Equation (22) takes on a simpler form under the following cases: case A (I large sompared to most values of s and r small compared to I) ; case B, (I small compared t,o most values of s, and r large compared to most values of s) ; ca)seB, (1 large and .r large compared to most values of s). These cases are considered in order. Case A (1 large compared to all values of s for which S(s) is s’ignificant, and 2~ much smaller than I). In this case the sum in equation (22) may be approximated by an integral, since the terms that are important are those of large ?t, and are closelyspaced (in Fig. 5). Thus we have :

where y and z are dummy variables, and x = y/l. Since 1 is large, the last integral may be extended to infinity and becomes independent of 1; and in fact defines the number average of s, S, since S(s)/s is proportional to the number of regions of size s :

1 ; Gw~ld~ .j= s ; [S(s)/s]ds = jo" [s(s)/sl ds . f

(2%

ence the fraction of cyclizable fragments becomes

or, if equation (7) is valid: c = y[l - Z’/f7][2(r - b,)/d];

(2&g

R = cy*[l - Z’/g*][2(r - &J/B].

w33)

or generally

Case B, (1 small and T large compared to most values of s). In this case Y is large enough that all the heavy lines in Figure 5 overlap. This occurs when 2(r - b,) > (I - b,)/2.

cw

Now all fragments whose s values are less than 1 - b, are cyclizable if they lie completely within a single tandemly-repeating region. In this case the separate integrals of equation (22) coalesce into one, so that we get in place of equation (Xl), the following : C = r(l - I’/& j;- ” S(s)ds . To obtain an expression for S(s), we differentiate S(Z - b,) =

(28)

and rearrange:

c dC/dl y(1 - Z’/J) + yg(l - I’/&” .

T Notice that in the tandemly-repeating case when the repeating unit, S, is non-repeating, the integration must be extended to I, not I’. This causes a slight change in the meaning of y*, g** etc., which we ignore.

98

C. A. THOMAS

JR,

B. H.

ZIMM

AND

B. M. DAXCIS

Presumably the last term is small, since g is much larger than I in this case. Also, since 1 > b,, equation (29) simplifies to fW) = W)WlW.

(30)

The rise of C with 1 at short lengths (short-side fall-off) thus reflects the distribution of sizes of repeating units S(s). From equation (30) the weight average of s, denoted by s,, is easily obtained. The definition of s, is

s,= salls sX(s)ds.

(31)

If equation (30) is substituted in equation (31) and the point of maximum C (1 = I, and C = C,) is taken as the limit of integration, we find s, = CW j:” W~ldW = Wjoum

EdC

(3W (32b)

This last equation (32b), which is useful for quick estimation of the weight average repeat distance, may be most conveniently applied in the form

Here the quantity l,C, is the area of the rectangle bounded by the axes and whose upper right corner is at the maximum of the ring-frequency curve, and the integral is the area between the vertical axis and this curve from its beginning to its maximum. We are now ready to turn to the last simplifying case, case B, (1 large compared to all values of s for which S(s) is significant, and r large, obeying equation (27)). This case is similar to case A, except that now the upper limit of the integral leading to equation (26a) is larger than all values of s for which X(s) is non-zero, so that ’ - b X(s)ds = j,” X(s)ds = 1. s0

(33)

Thus, equation (28) becomes

C = Wj;

G(d[g - 0%

(4)

Thus we have equation (4) again; the situation leads to the simplest case of all. If we can make the same approximation as in eqnation (5) we can now determine y and, by the use of equation (26a) or (32a), S or s,~. The above results apply almost unchanged even if the exact tandem repeat model is modified as shown in Figure 2. If, for example, occasional local variations are introduced to modify the exact repeat, the melting point of the reannealed helices will be lowered slightly, but the expected value of the ring frequency does not change. In the fractional tandem model, where only a fraction, a, of s, is repeating and the remainder is a unique sequence, the effect is to lower the ring-frequency when r is small (case A) by a multiplication by a. This result is not exact, but the error is usually small, If r > s the introduction of unique DNA into s has no effect at all.

RING

THEORY

919

3. Conclusion This long and rather abstract development has been guided by a critically imporobservation : the experimental curves of ring frequency against fragment length show a maximum with decrease in ring frequency on both the long and the short sides. The long-side decrease implies the clustering of the repeatin.g segments of any given type inO0 regions which we have called g-regions. The treatment of this longside decrease in terms of the number of ways in which a fragment of given lengt,h can be broken from the parent chromosome and still have both its ends in the same g-region was the subject of the first part of this paper. From the shape of the fall-off of the ring-frequency curve some information about the average size and the distribution of sizes of the g-regions can be obtained. The decrease on the short side of the maximum, on the other hand, is an indicatiola that the repeating segments in each g-region cannot be randomly spaced. We have examined two models with regular spacing ; in one of these the repeating units are assumed to have a very simple copolymeric structure to maximize the probability of ring formation; in the other the structure within the repeating unit was presumed to be unique so that complementary sequences could be matched in only oae way. Means of estimating the average spacing of the repeating units were derived for both models. This by no means exhausts the variety of possible models or the calculations thai; might be based on them. We feel that more definitive experimental information should be available before more extensive discussion is warranted.

tant

We thank Dr Lynn Klotz for his interest in this analysis and for his helpful suggestions at the early stages of its development. We also thank Dr L. M. Okun for his contri,” butions to the tandem-repetition case. Our work has been supported by the National Inst.itates of Health (grants nos GM 11916 and AI08186) and the National Science Foun., dation (grant no. GB31118X). One of us (B. M. D. )was supported by a National Institutes of Health Postdoctoral Fellowship (no. l-F02-GM49,155). REFERENCES Callaaa, H. G. & Lloyd, L. (1960). PhiZ. l’ran~. Roy. SOC. B243, 135. Lee, C. S. & Thomas, C. A., Jr (1973). J. Mol. Biol. 77, 25. Pyeritz, R. E. & Thomas, C. A., Jr (1973). J. Mol. Biol. 77, 57.