Enzymatic break-up of polypeptides as a stochastic process

Enzymatic break-up of polypeptides as a stochastic process

J. Theoret. Biol. (1967) 17,282-303 Enzymatic Break-up of Polypeptides as a Stochastic Process JOHN M. BLATT~ Weizmann Institute, Rehovoth, Israel (R...

1MB Sizes 1 Downloads 54 Views

J. Theoret. Biol. (1967) 17,282-303

Enzymatic Break-up of Polypeptides as a Stochastic Process JOHN M. BLATT~ Weizmann Institute, Rehovoth, Israel (Received 5 January 1967, and in revisedform 5 June 1967) We present a mathematical study of the enzymatic break-up of artificial polypeptide chains, containing only one amino acid, but that one in both L and D stereochemical forms. The enzyme attacks only LL bonds, not LD, DL, or DD bonds. Furthermore, the enzyme appears to be sensitive to a whole region of the molecule in the neighbourhood of the bond which is to be broken; e.g. the LL bond in the sequence DLLD is not broken by the enzyme. If the enzymatic reaction is allowed to go to completion, the weight distribution of the final fragments obtained gives information about (i) the breaking rules, including the “recognition length”, and (ii) the building up of long chains containing both L and D residues. These points are illustrated in the present paper by a full discussion of a particular set of breaking rules and build-up rules. Sections 1 to 7 contain a discussion of the case of infinitely long chains. This restriction is removed in the work of Sections 8 to 11. The mathematical methods used are standard methods of stochastic theory, but the results are new. 1. Introdllction Enzymes which break up long protein molecules do so by catalyzing chemical reactions which break bonds between adjacent amino acids in the protein chain. In general, a given enzyme breaks certain bonds, and not certain others. These “specificity rules” are often studied by means of artificial peptide chains of low molecular weight, for example, “chains” consisting of just two amino acids. Evidence is accumulating, however, that the enzyme “recognizes”, and reacts differently to, much more than just the two amino acids immediately adjacent to the bond to be broken. The recognition length seems to be five amino acids in one case, even more in some others. Since natural proteins are exceedingly complicated systems, a series of recent studies (Berger, private communication) has been concerned with enzymatic break-up of artificial macro-molecules, made up from just one t This work was done whiie the author was on Study Leave from the University of New South Wales, Kensington, N.S.W., Australia. 282

ENZYMATIC

BREAK-UP

283

amino acid, but that one in both the L and D stereochemical varieties. Since the enzyme, a natural substance, is purely L, it does not “like” D amino acids, and does not break DD, LD, or DL bonds. It does break LL bonds. But if a D amino acid occurs within the “recognition length” the enzyme recognizes the interloper and retires sulkily, refusing to break the LL bond if there is a D lurking in the neighbourhood. By this means, it is possible in principle to determine the “recognition length’ of the enzyme in question. The weight distribution of the fragments resulting from the enzyme acting on a statistical distribution of chain molecules with a known average L : D ratio depends on a number of factors: (i) (ii) (iii) (iv)

the breaking rules for the particular enzyme; the distribution of L and D along the chain molecules; the average L : D ratio ; the distribution of chain lengths in the sample, before action of the enzyme.

By studying the weight distribution of the fragments, one may hope to obtain information about these various factors. In practice, the breaking rules are determined best from study of short chains of known composition; the details of the breaking rules tend to get “washed out” when the enzyme is let loose on a statistical distribution of long chains. But there are certain questions which can be answered only by studying long chains. The most obvious of these is whether the breaking rules, determined from studies of short chains, are still applicable when the enzyme acts on long chains. A less obvious, but equally important, question concerns the structure of the long chains. During the build-up of the long chains, successive amino acids, each of which may be L or D, are attached to the free end of the chain (the two ends are not equivalent, and build-up occurs only at one end, not at the other). It is at least possible that there is a preference for LL and DD bonds, as against LD or DL bonds, during this build-up process, so that strings of all L and strings of all D are more probable than they would be on a purely random drawing of L’s and D’s from an urn. Since the enzyme dislikes all D’s the enzymatic action becomes much more efficient if the D’s are concentrated into tight strings, rather than being distributed randomly all through the chain. In this way, the weight distribution of the fmal fragments reflects the way in which the chain was built up originally, i.e. it reflects the stereo-chemical specificity of the bond between amino acids. The full discussion of the problem in probability theory posed by this situation is quite lengthy and intricate. It is the subject of two papers, of which this paper is the grst. In this first paper, we shall define some basic

284

J. M. BLATT

concepts, and we shall give a solution for a particular, simple set of breaking rules and assumptions about the build-up process. This illustrates the sort of thing one encounters. One assumption, that of essentially infinite chain length, is relaxed in Sections 8 to 11 of this paper. The second paper, more mathematical in nature, will deal with the problem in its full generality, to the extent that we have been able to solve it so far. The simplifying assumptions made in Sections 1 to 7 are as follows. First, we ignore end effects, i.e. we assume initial chains of substantially infinite length. Second, we allow for no more than nearest-neighbour correlation in the build-up pattern of the chains. That is, let 1 and 6 be the proportions of L and D, respectively, A+6 = 1,

I” > 0,

6 > 0.

(1.1) If there were no correlation during build-up, the probability of finding a pair of adjacent amino acids to be both L would be AZ, to be both D would be 6’, to be LD would be U, with the same value for DL. We define the “inhibition factor” h to be the ratio h = Probability

of finding a pair to be DL 61

U.2)

As we shall see, this one parameter fixes the entire nearest neighbour correlation pattern, and we assume no more than nearest neighbour memory in what follows. Further simplifying assumptions are related to the “breaking rules” of the enzymes. We assume that breaks occur at places in the original chain where there were at least three successive L’s, at all those places, and at no other places. This implies a “recognition length” of five for the enzyme. The fragments which emerge are of two types, D-containing fragments and pure-l fragments. We shall assume that the enzyme can break off L's from the outside more easily than L’s enclosed between D's. In particular, we assume that outer L’s are broken off in succession until there are exactly one L remaining at the left end of the “D-string” and exactly two L's at the right end (the two sides of a molecule are distinguishable). Fragments consisting entirely of L’s are broken down to very small pieces, namely to pieces of lengths 1, 2, and 3 on1y.t With the rules as stated, the smallest D-containing fragment is LDLL, which has length 4. A simple chemical weight determination therefore suffices to separate L-fragments from D-fragments, unambiguously. t In Sections 8 to 11 we shall assume that pure-l fragments are always broken down

completely, to single Cs.

ENZYMATIC

BREAK-UP

285

The L-fragments arise from places in the original chain with 4 or more consecutive L’s. In this section, we shall be content with deriving the weight fraction of the original chain which eventually emerges in the form of L-fragments (see footnote on previous page). When it comes to D-fragments, it is possible to obtain, without further assumptions, the full length distribution, and it is the purpose of this paper to do so. We define the “D-length” of a D-containing fragment to be its length exclusive of exterior L’s (in practice, the chemical weight of the fragment minus three). Let Pn be the probability of finding a fragment of D-length n, among all the D-containing fragments. Let W, be the weight fraction of the original chain which eventually emerges in the form of fragments of total length k. For k 2 4, these are D-containing fragments, and W, is related to Pks3. For 1 I k zs 3, these are L-fragments, and we shall determine the sum WI + W, + W,, but not the separate weight fractions. 2. Conditional Probabilities A long chain may be thought of as consisting of a string of consecutive L’s, followed by a string of consecutive D’s followed in turn by a string of consecutive L’s, and so on (any one of these “strings” may, of course, be of length 1). As we walk along the chain, we encounter exactly as many changes L-to-D as changes D-to-L. Thus, if we pick a pair of nearest neighbours in the chain at random, the probability that this is a pair LD equals the probability that it is a pair DL. Let this probability be called CO,then w = 61 for no correlation

(2-l)

o = h&l,

(2.2)

and, in general, where h is the “inhibition factor” introduced in Section 1. We now introduce conditional probabilities rrfl as follows: we let i = 1 denote an L, i = 2 a D, in the left-hand position of the pair; j = 1 means 2 means a D, in the right-hand position. The constituent on the anL,j= left is assumed to be given, and “ii is the conditional probability of finding the constituent on the right of the pair to be as specified by the index j. For example, the probability of finding a randomly chosen pair of neighbours to be a pair LD, in that order, is the absolute probability of finding an L, i.e. L, multiplied by the conditional probability xi2 that the right-hand neighbour of that L is a D. Thus we have 0 = hi2. Similarly,

the probability

of finding a randomly

(2.3) chosen pair of neighbours

286

I.

M.

BLATT

to be a pair DL, in that order, is the absolute probability of finding a D, i.e. 6, multiplied by the conditional probability 7r21 that the right-hand neighbour of the D is an L. Thus we have Co= da,,. Substitution

(2.4)

of (2.2) into (2.3) and (2.4) yields 7r12 = ha,

(2.5)

7r21 = hl.

G-9

Let us now consider the conditional probability nlr of finding a given L neighboured by an L to its right. Given that the left-hand constituent of the pair is an L, the right-hand constituent must be something, either an L or a D, and thus nli +n12 = 1. Thus we obtain 7111= l-n,, (2.7) and by an exactly analogous argument 7122= l--x2,.

cm

Equations (2.5) to (2.8) determine the entire set of conditional probabilities nzl in terms of the weight fractions 6 and A and the one free parameter h. 3. D-strings We introduce the concept of a “D-string of length n”, to wit, a section contained in the original chain, of length n, with D’s as exterior members at both ends, and containing no “triple-L” in its interior. If there is a D-string of length n, the fragment containing it after the enzyme process has run to completion will have a D-length greater than or equal to n, the actual D-length depending upon what is on either side of the n constituents under consideration. Let pn be the conditional probability that a segment of length n+ 3, starting with the 4-sequence LLLD, is actually a triple-L followed by a D-string of length n. Such a segment can be obtained by going to the right along the chain until we find a sequence of three L’s in a row followed by a D; we then go an additional n - 1 steps and demand that the last n constituents, including the initial D, form a D-string. For the special case n = 1, we have no choice at all, i.e. a segment of length 4 starting with the Csequence LLLD is always a triple-L followed by a D-string of length 1. Thus p1 = 1. (3.1)

ENZYMATIC

287

BREAK-UP

The conditional probability pz is equal to the conditional finding a D to the right of the given D, i.e.

probability

of

P2 = z22. (3.2) To tid pa, we note that there are only two possible D-strings of length 3, namely DDD and DLD. Given that the first element is a D, the conditional probability of DDD is (7~~~)~ and the conditional probability of DLD is The two events in question are mutually exclusive, so that their =21=12* probabilities add, hence P3 = (n2212 +A21 812. (3.3) The probability P. of finding a D-fragment of D-length n (ignoring all L-fragments for the moment) is simply related to the conditional probability pm. The desired event is a segment of length n+6, with 3 consecutive L’s on either side and a D-string of Iength R in its interior. The 3 L’s on the left and the D-string of length n are included in the definition of pn. We now have the additional requirement that the rightmost D of the D-string be followed by the sequence LLL. The probability of this is 7r21(7r11)2,so that P”

Combining

=

(3.4)

~21hl)2P”.

(3.1) and (3.4) we obtain

Pl = =21h1)2 so that (3.4) can also be written in the form

(3.5)

P” = Pl Pw (3.6) Thus P., the desired probability, differs from p,, only by a multiplicative constant equal to PI, equation (3.5).

4. Recursion Relation We now derive a recursion relation for p., assuming n 2 3. A D-string of length n+ 1 can be formed in one of three, mutually exclusive, ways and only in those ways: (a) D-string of length II followed by a single D, probability = pn7rz2; (b) D-string of length n - 1 followed by LD, probability = pn - 1n2 17r12; (c) D-string of length n - 2 followed by LLD, probability = pn-27r2171117112. Adding the probabilities

of these mutually

exclusive events, we obtain

P.+1 = ~22P”+~217h2P”-l+~217hl7h2P”-2. (4.1) Equation (4.1) becomes meaningless for n = 2 and II = 1. However, if we make the purely formal definitions for IZ = 0 and n = - 1, p. = 0 and pm1 = 0, (4.2)

288

J.

M.

BLATT

then equation (4.1) can be used also for 12= 1 and n = 2, and yields the correct values (3.2) for pz and (3.3) for p3. Equation (4.1) is a linear difference equation of order 3 for the unknown quantities p,. Given three starting values, e.g. p- 1, po, and p1 from equations (4.2) and (3. l), all subsequent p. can be determined by successive use of (4.1). According to (3.6), P, differs from pn only be a multiplicative constant, and therefore satisfies the same difference equation: P II+1 =

a,,~,+a,,n,,~*-,+~,,nlln,,P,_z, (4.3) with the initial conditions P-1 =P() =o Pl = ~21hJ2. (4.4) From the practical point of view of computation, especially on an electronic computer, equations (4.3) and (4.4) constitute a completely adequate solution. Since the right side of (4.3) is a sum of intrinsically positive terms, there is no loss of significant figures through cancellation, and the iteration can be continued until the probabilities become sufficiently small (say, less than one-millionth of PI).

5. Analytic Solution of tbe Recursion Relation

The linear difference equation (4.3) has constant (independent of n) coefficients, and is thus solved in standard fashion by P” = Aa” + BB” + Cy”, (5.1) where A, B, C are constants to be determined, and a, j?, y are the three roots of the cubic equation Q(x) = x3-7122~2-821~12~-~21~lln12 = 0. (5.2) The coefficients in this polynomial are the same coefficients which appear in (4.3). The three constants A, B, C are determined by substituting (5.1) into the initial conditions (4.4), leading to the three linear equations A B P-,=cr+j+y=O, P,, = A+B+C P, = Aa+Bjl+Cy

C

(5.3)

=O,

(5.4) = n21(lt11)2.

These three equations can be solved immediately

(5.5)

for A, B, C, yielding (5.6)

with corresponding expressions for B and C, obtained by cyclic interchange of the three quantities a, /Y?,y.

ENZYMATIC

289

BREAK-UP

Thus the complete analytic solution for the probability of finding a D-containing fragment to have a D-length equal to n is II+1

cP+l pn = x21(Icll)2

[ (a-fi)(a-y)

ll+1

B + (B-y)(B+)

+ (y-L)(y-/j)

I '

6')

where a, /3, y are the three roots of the cubic equation (5.2), and the zl, are given by (2.5) to (2.8). Putting x = 1 in the polynomial Q(x), and using (2.7) and (2.8) to simplify the result, we obtain

QW =

(5.8)

7c21(7d2.

Putting x = xz2, we obtain immediately Q(n22)

= --12~2ihl+~22)*

(5.9)

Thus Q(1) is positive, Q(7rz2) is negative, and there must be a real root of the cubic between them. We call this real root a, and thus have 11z2
” g (a-B)(a-y)

ti+’

= constant a”+l

for large n.

(5.15)

Thus, for s@ciently large n, the P, form a geometric progression with ratio equal to a, the positive roof of the cubic equation (5.2). t A more detailed discussion proves that, in the case of three real roots, the most negative root has absolute value less than a, and in the case of complex conjugate /I and y, their common absolute value is ltss thau a. It is in any case clear that negative or complex roots must not appear in the asymptotic result (5.15), since P, is an intrinsically positive quantity for all n, including very large n.

290

J.

M.

BLATT

6. The Mean D-length For further work, it is convenient to define and evaluate the generating function (6.1)

Using standard methods, based on (5.6), (5.1 l), (5.12) and (5.13), this can be reduced to V(z) = ~2lbl lYZ/W>~ (6.2) where S(z) = 1-K22Z--R21812z2--llZ171111112z3. (6.3) This polynomial is related to the polynomial Q(x) appearing in (5.2) by S(z) = z’Q(l/z). By setting z = 1 in V(z), we obtain the normalization

(6.4) constant (6.4)

It is an easy matter to verify, on hand of (2.7) and (2.8), that the denominator S(1) in this expression equals the numerator, so that V(1) = 1, as it must be. We then obtain the average D-length ii by differentiation: (6.5) Explicit differentiation

together with use of (6.4) leads to

This is the explicit expression for the average D-length taining fragments.

of all the D-con-

7. The Average Weight of the L-hgments. Weight Fractions Before we can convert the results obtained so far to weight fractions, we need to say something about the fractional weight of the original chain tied up in pure-l fragments. This is not completely trivial, since our method of treating the D-containing fragments has given us no information about how many L’s are tied up, on the average, inside the D-strings of the D-fragments. We must therefore treat the L-fragments ub initio. The sections of the original chain which are going to form D-containing fragments consist of D-strings separated by segments of at least three L’s in a row. If the separating segment, at which the enzyme breaks the chain, consists of exactly three L’s, these three L’s are all tied up in D-containing

ENZYMATIC

291

BREAK-UP

fragments: one L adheres to the left end of the right D-fragment, the other two L’s adhere to the right end of the left D-fragment. Thus, in order to have any L’s available for formation of Gfragments, we need L-gaps of length greater than or equal to four. An Lgttp of length J gives rise to a number of “available L’s” equal to Z-3. Let ql be the conditional probability of finding an L-gap of length Z, given that there is an L.-gap at all (i.e. given that there are at least three L’s in a row). The tist of these probabilities, q3, is equal to the conditional probability of having a D immediately to the right of our three L’s, i.e. q3 = =12* The next probability, q4, is the conditional probability of three L’s is followed by LD, in that order, hence

44 = %1%2* (7.2) that our three L’s are followed by the sequence

Similarly, q5 is the probability LLD, so that

q5

Continuing

(7.1) that our sequence

(7.3)

=Ghd2~12.

this argument, we obtain the general formula 4r =hr3~12.

The average number of available L’s per gap is the expectation I- 3, i.e.

(7.4) value of (7.5)

If we now use (2.7), we obtain the very simple result t

=-=--Rll

7t12

hb

(7.6)

l*

This is the average number of available L’s in the L-gaps, and is therefore directly proportional to the weight fraction of the original chain tied up in pure-l fragments. The weight fraction tied up in D-containing fragments is proportional to ii + 3, where ii is the average D-length (6.6); the additional 3 represents the three exterior L’s attached to every D-containing fragment. We note that there is one L-gap for every &fragment, and vice versa. Thus the weight fraction of the original chain tied up in pure-l fragment is t Wl+W2+W3=p=

t+fi+3

A(l-hJ)j.

(7.7)

Fragments of weight k 2 4 are D-containing fragments. The probability of Ending a D-containing fragment of weight k among all D-containing

292

J. M.

BLATT

fragments is Pks3, where n = k- 3 is the D-length of the fragment. weight fraction contributed by D-containing fragments of length k is W, = $

The

(k 2 4).

8. Introduction and Notation The remainder of this paper is concerned with the end-effects which arise because the actual initial chain has some finite length, N. In practice, N ranges from not much more than 10 to at most several hundred, depending upon how the chains are prepared. Given an initial chain of length N, containing D’s and L’s in proportion 6 and A respectively, 6 > 0, 1 > 0, 6 +1 = 1, let us define RN,” to be the expected number of fragments of length n. Clearly this number vanishes for n > N. Since each fragment of length n contributes a chemical weight proportional to n, the weight fraction W,,, of fragments of weight n arising from a chain of weight N is

Since the weight fractions must add up to unity, we have the normalization condition

To illustrate the concepts involved, let us consider the break-up of chain molecules consisting of N = 3 constituents. Chains starting with a D are not broken at all, since even the molecule DLL is not broken up by the enzyme (it requires at least three L’s to the right of a D to make a break, i.e. DLLL breaks into DLL+L, but DLL is stable). Chains starting with LD, i.e. the two molecules LDD and LDL, are also unbroken. There are only two more possibilities, namely LLD and LLL. LLD is broken into L+LD, and LLL is broken into L+L+L.t Thus the pattern, with associated probabilities is : Chains starting with D probability = 6 not broken Chains starting with LD probability = AX, Z not broken (g-3) LLD probability = lZnlinlz broken into L+LD LLL probability = lit’, r broken into L + L + L / t The original enzyme fails to do this breaking, but we shall now assume that a second enzyme is employed after the first one, the second enzyme having the property of peeling off all the left-most L’s with the exception of the last L before a D. In this section, we shall assume that the enzymatic break-up is produced by a combination of these two enzymes, and we shall not attempt to calculate what happens if only the first enzyme is employed. The more advanced methods of paper 2 are needed for a dkussion of the latter case.

ENZYMATIC

293

BREAK-UP

The expected number of fragments of length 1 is then given by R 3.1

~x,1x,,+3~(z,X (8.4) where the first term represents the L from LLQ, the second term represents the three L’s from LLL. The expected numbers of fragments of lengths 2 and 3 are, similarly, R 3.2 = hl7h2, (8.5) =

R 3,s = d+h,,. @4 It should be noted that the RN,” are not probabilities, i.e. they do not add up to 1. Rather, the normalization is equation (8.2). It is easily verified that (8.4), (8.5), and (8.6) lead to the identity:? R3,1+2R3,2+3R3,3 = 3. (8.7) We also note that the fragments of length 2, i.e. LD, and the various fragments of length 3 which arise in this particular breakup are all “nonstandard”, i.e. none of them can arise from an infinite chain. The break-up of an infinite chain by our two enzymes leads to pure-l fragments, all of which are of length 1, and various D-containing fragments, the smallest of which is LDLL, of length 4. Thus, under our present assumptions, rmy observed fragments of lengths 2 and 3 arise from the ends of$nite chains. For later use, we define the more specific quantitiesR,,,(” to be the expected numbers of fragments of length n, arising from the break-up of a chain of length N which starts with a given constituent i, where i = 1 means L, i = 2 means D. Referring to the pattern (8.3) we obtain, R$“,, = 0, Rit’, = 7hl~12+3hl)2 w3) R(31)2= 7~~~7~~~ Rs”3 = z12

RI”, = 0, R$“‘3 = 1.

These are connected with the desired RN,” through R N,n = ARg,‘,+6Rc’,.

(8.9) (8.10) (8.11)

Each R$& is separately normalized by N

c nR#f, = N. n=l

As a further preparatory step, let us calculate R#fN for i = 1, i = 2, and various values of N. We start with i = 2, i.e. chains of length N starting with a D. For N = 1, 2, and 3, such chains are not broken at all, and we obtain Rf)N = 1 for N = 1,2,3. (8.13) t To verify (8.7), useequations (l.l), T.B.

(2.7) and (2.8). 20

294

J.

M.

BLATT

We get one fragment of size N if the chain is unbroken, no fragment of size N if there is any break at all. Thus RN,N is equal to the probability of no break occurring. For a chain of general length N, starting with a D, the con@urations which cannot be broken are of three types: probability (i) a D-string of length N, (ii) D-string of length N- I, followed by an L probability (iii) D-string of length N- 2, followed by LL probability Adding the probabilities

= pN; = pN- 1rrt 1 ; = pill - 2 z2 1 n, 1.

of these three mutually

exclusive events, we obtain (8.14) R!$!v= PN+PN-~Kz~+PN-z~z~~~~. There is an alternative way of calculating Rg$ which must lead to the same result. A chain starting with a D is either broken somewhere, or it is not. If the position of the left-most break is after exactly k constituents, the corresponding fragment has D-length k- 2; the probability of such a fragment is given by (3.4); (8.15) Pk-* = ~*lhlPk-2. The left-most break may be after 3,4, 5,. . . , N- 1 constituents, and the remaining probability is the probability of no break at all; thus we obtain N-l Rg)N

=

l-

kg3Pk-2

N-3

= l-“glP,

for N 24.

The fact that (8.14) and (8.16) are identical follows, after some calculation, from (8.15) and the recursion relation (4.3). We shall not go through the proof, since the probability argument shows that the identity must hold. The difference in (8.16) will recur frequently, and we therefore introduce the notation II-1 (8.17) Q,= l-kzlP,= f& k=n

Q. is the probability of finding, in an infinite chain, a segment of D-length equal to or greater than n, starting from a given D. In terms of the analytic solution (5.1), we obtain the closed formula Qn=Ea+Fp+sy

for n = 2,3,. . . .

(8.18)

For convenience of notation, we also define Q, = 1 for n = l,O, -1, -2 ,.... With this notation, we can rewrite (8.13) and (8.16) as

(8.19)

RkT)N = QN- 2, which is now valid of all N = 1, 2, 3, 4,. .

(8.20) (8.20)

ENZYMATIC

295

BREAK-UP

Next, we determine R’i’,, i.e. we consider chains starting with an L. If the second constituent of the chain is an L,, a break occurs right there, and we certainly do not get an unbroken chain’ of length N. Thus, we must have a D as the second constituent, and the desired quantity is Rg’N = U%,N-I =z12QN-3 forN=2,3,4 ,.... (8.21) A chain of length 1 cannot be broken, so that R’,“,9 = 1. (8.22) Using the general relation (1.1 l), we get N=2,3,4 ,.... (8.23) R 1,l = 1, RN,N = @~-2+&&~-s This general formula for the expected number of completely unbroken chains of length N checks for N = 3 with (8.6)y as it must. It is apparent that a detailed calculation of the sort we have been doing so far would be very lengthy for the general RN,“,. In Section 9, we shall derive a recursion formula for the R& We shall solve this recursion formula by means of generating functions, in Section 10; the solution will be discussed in Section 11. 9. Regeneration Point Method and Reeumlon Relatlons In order to derive a recursion relation for the quantities R$& we shall use the well-known “regeneration point method”. The idea is quite simple. Given a chain of length N, we can label all the break-points, i.e. all bonds which will eventually be broken by the enzymatic action. Imagine this has been done. We now start from the left and walk along the chain until we come to the left-most break-point. Let this occur after the first k constituents. We then have an unbreakable fragment of length k, followed by another chain of length N-k. If we know all the R@,” for M < N, we can thus determine R$), for the given N. The only diffi&ult case if the one where no break occurs at all in the original chain; but we have just determined Rj!N explicitly for i = 1 and i = 2, see (8.20), (8.21) and (8.22) so this difficulty is already solved. In order to carry through this approach, we require the conditional probability u#)~ that a chain of length N, starting with a constituent of type i, has its left-most break-point after the first k constituents.t Let us determine these conditional probabilities: For i = 2, the chain starts with a D. The earliest break-point is then in position k = 3, arising from the possibility that the 6irst four constituents are DLL& in that order; the probability of this is just PI, equation (3.5). Thus we have c&=Pi forN=4,5,6 ,.... (9.1) kc,

t we cmphaslzc is net my

that

the l&most eventual break-point, the lint break-point in time, during

in which the uuymatic

we are interested action. 20.

296

J.

M.

BLAT1

The left-most break-point occurs in position k = 4 if and only if the chain starts with a D-string of length 2, followed by three L’s in a row. The probability of this happening is just Pz. In general, for all k between k = 3 and k = N-l, we have c&f; = Pke2 for 3 I k I N-l. P-2) The case k = N means no break-point at all, and this conditional possibility is given by equation (8.20), i.e. &)N=QNv2 N=l,2,3,4 ,.... (9.3) The remaining a(‘) vanish, i.e. c&?~=O forN=2,3,4 ,..., (9.4 u&=0 forN=3,4,5 ,.... P-5) It is intriguing that the probabilities calculated for an inflnite chain, i.e. the Pn and Q. come into the finite chain calculation in the guise of conditional probabilities for finding the left-most break-point after a certain number of constituents. Turning now to chains starting with an L, i.e. i = 1, we get two separate situations depending upon whether the second constituent is L or D. If it is L, the first break-point is right there. Hence we have o&t), =nll forN=2,3,4 ,.... (9.6) If the second constituent is a D, we require k- 1 unbroken constituents including this D in order to have the left-most break after k; thus ,..., N=2,3,4 ,.... (9.7) 4% = 1~iscr~?i,~-~ fork=2,3,4 Equations (9.6) and (9.7) determine all the c#\ except for the obvious It is clear from the defmition that they obey the sum rule:

oy’l = 1. (g-8) of the o$, and can also be verified directly, k&%

= 1.

(9.9)

With this preliminary work out of the way, we are now ready to use the regeneration point method. We start from a chain of length N and initial constituent i, and ask for the expected number of fragments of length n, R$.& The expected number of size-n fragments from the chain to the right of break-point k is Rgik ,,, where the upper index 1 indicates that this second chain always starts v&h an L (it is for this reason that we had to keep track of the starting constituent of the chain all the way through). The one frag, if and only if k = n ; otherwise it ment to the left contributes 1 to R #). contributes nothing.

ENZYMATIC

BREAK-UP

297

Events for which the left-most break-point occurs in different positions are mutually exclusive, and hence their probabilities can be added. We therefore obtain

where the Brst term represents the n-fragments arising from the chain to the right, the last term represents the contribution of the left-most fragment in the one situation, k = n, where this fragment has the desired length n. Equation (9.10) was derived on the assumption that II is less than N; however, it also holds for n = N, provided we make the obvious definition R$‘, = 0 for n > N.

(9.11)

With this definition, only the second term on the right of (9.10) contributes for n = N, and its contribution is the probability of having no break-point at all in the original chain, which is precisely what we want. In principle, equation (9.10) with the one initial condition R’:‘, = 1 for both i = 1 and i = 2

(9.12)

suffices to determine all the R$!. and hence all the weight fractions. Furthermore, since all terms in (9.10) are intrinsically positive, there is no danger of cancellation destroying numerical accuracy, so that (9.10) is suitable for computer evaluation. Nonetheless, this is a case where further analytic work is desirable. The computer calculation based on (9.10) tends to get awkward because UN earlier R@, must be kept in storage to continue the calculation. But quite apart from this purely technical dithculty, it is neither easy nor very instructive to look at long and cumbersome tables of numbers. A completely analytic solution of (9.10) is possible, and leads to insight into the nature of the end correction. 10. Generating Functions and Solution of the Recursion Equation In order to solve equation (9.10) analytically, we introduce generating functions. Let z be a complex number, and define the power series G,,,(z) = g z”R#f,.

(10.1)

N-l

We note that the index summed over is N, the original length of the chain, not n, the size of the fragment. Since a fragment of size n can arise from chains of any size N 2 n, the series in (10.1) does not terminate. It is, however, a power series of non-zero radius of convergence, and hence defines a function of z.

298

J.

M.

BLATT

Actually, our main interest is not in the separate R#ln, but rather in RN,, as given by (8.11). The corresponding generating function is G,(Z) = f

(10.2)

z~RN,. = lG,,.(~)+6Gz,,(z).

N=l

We shall require two kinds of generating functions associated with the conditional probabilities c#~ of Section 9. The first of these is defined as follows : $i,MtZ)

i=1,2,

=k$lZkati+k,k

M=l,2,3,4

,....

(10.3)

Its actual value is, for i = 2, (10.4)

andfori=

1, $1,&f(z)

=$1(z)

=

(10.5)

7w+7c12~3w).

We note that both these functions are independent of M. The second type of generating function associated with the conditional probabilities a$fk, which we shall require is defined as follows: (0 zNaN,n?

qn(z) = 2 zNag!, = z”a!f). + N=l

(10.6)

N=n+l

where we have made use of the fact that a = 0 if n exceeds N. By standard methods of probability theory, (9.10) can now be shown to lead to the result (10.7) Gi, n(Z) = tii(z)Gl, n(z) + mi, n(Z). We can solve this equation immediately for the case i = 1, since then G,,,(z) is the only unknown function. The result is

G,,Jz) = ol,:> 1-$1(z)’ With this information with the result

(10.8)

available, we can now solve (10.7) also for i = 2, G2.

n(z)

=

$,GW,,

n(z)

(10.9)

+ ~2,nW

The quantity of main interest to us is G,(z), which equals G

n

(z)

=

[3,+G+2(z)101,.(9+

6o

tz) 2,n

*

(10.10)

l--*,(z)

The reduction of G,(z) to directly useable form involves a number of operations which are omitted here for the sake of brevity. The reduced result is surprisingly simple, to wit

AzX,(z) G”(Z) = ___-. (1 _ z>2 +

~WZ.“(Z),

(10.11)

ENZYMATIC

299

BREAK-UP

where K,(z) is a simple polynomial: n = 1, (10.12) K,(z) = (1-~12Z--Ai1~12z2)(1-~12Z), n =2,3,4 ,.... (10.13) K,(z) = (1-~1~~--1111~12~')1~1~(Q,-3-zQn-2) When we return from the generating function Gk(z) to the desired quantities R,,, it turns out that there are in general five different cases to consider, depending upon the value of nt :

n = 1 is special, because of the special expression (10.12) ; n = 2, 3,4,. . . , N-3 is the “normal” case; n = N-2, n =N-1, and n = N are again special. It is convenient to define a factor F by (10.14) F = 1~,,(n,,)~ = MA(l- ha)‘. We note that l/F is equal to t+fi+ 3, the denominator in (7.7). With this definition, we obtain for the normal case, after simplification R N,n = F W-n)P,-,+Q,-, [

+

7h2u+27dp 7111

n

_ + 3

p,-2 ~12hlY

I

forN24, n=2,3,4 ,..., N-3. (10.15) The result for the special case n = 1 is R N, 1 = rZ(xll)‘N+ln,,n,,(l+3a,1) N 2 4. (10.16) The final three special cases are n = N- 2, n = N- 1, and n = N. For n = N-2, we obtain N 2 4. (10.17) RN,.-, = ~~12[(1+~li)P~-s+(~ll)~Q~-s]+8P~-4 For n = N- 1, the result is N 2 4. (10.18) = I~,,(PN-,+~~~QN-~)+~PN-J RN,.-, Finally, for n = N, we get R N,N = ~&N-,+&~?N-z N 2 4. (10.19) We note, as a first check, that (10.19) agrees completely with (8.23), which we found by a direct probability argument. Equations (10.15) to (10.19) constitute a complete solution of the problem posed in Section 8. 11. Discussion

We start the discussion by giving simple interpretations of formulas (10.18) and (10.17). Formula (10.18) gives the expected number of fragments of length N- 1, from a chain of length N. Since such a chain can give at t Henceforth, we take the chain length N 2 4. This is no restriction since the results for N = 1,2 and 3 can be written down directly, without all this formalism.

300

J.

M.

BLATT

most one such fragment, R,, N- 1 is equal to the probability of the chain splitting into (N- 1)+ 1. For a chain starting with L, this can happen in two ways: (i) LD . . . DLLL with the sequence D. . D being a D-string, probability = ~c~~Z’~-~; (ii) LLD. . with the D initiating a D-string of length at least N -4, probability = rrl r rr12 QNe4. If the chain starts with a D, on the other hand, there is only one configuration which does the trick, namely (iii) D. . . DLLL the initial part being a D-string of length N -3, probability = PNw3. If we multiply the sum of (i) and (ii) by 1, the probability of having the chain start with L, and multiply (iii) by 6, and add, we obtain precisely formula (10.18). A similar simple interpretation holds for (10.17). The configurations giving rise to a segment of length N- 2 are : (i) D. . DLLLA (where A stands for “anything”), probability (ii) LD...DLLLA, probability (iii) LL,D. . DLLL, probability (iv) LLLD... (D-string of length at least N - 5) probability

= 6P, _ 4 : = In, 2PN - 5 ; = In,,n,lPN-,: = E,(n, 1)2n,2 QN- 5.

The sum of these probabilities is equal to (10.17). Next, let us look at the cases n = 2 and n = 3, which are included in (10.15). The index of P,- 3 is n - 3 = - 1 and 0, respectively, and in both cases the probability involved is zero, see (4.4). Thus, for n = 2 and n = 3. formula (10.15) gives no term proportional to N, the length of the original chain. This is what we must expect, since fragments of length 2 or 3 cannot arise from the interior of the original chain, but only from the two ends. Putting in the known values of the Pk and Qk, we get from (10.15) for n = 2 R N,2 = F = A(n,,)27r,,. (11.1) An interpretation of this formula is as follows. A fragment of length 2 can arise in only one way, namely from the right-hand end of a chain of which the last four constituents are LLLD. The probability of this precise sequence occurring in a given position is given by (11.1). Putting the known values of Pk and Qk into (10.15) for n = 3 leads to R N,3 = 2F. (11.2)

ENZYMATIC

301

BREAK-UP

An interpretation of this expression is slightly more involved, since fragments of size 3 can arise at both ends of the original chain: (ii) Chain starts with DLLL. . . , (ii) Chain ends with . . .LLLDA,

probability probability

We get (11.2) by adding these contributions, h,,

=

= 6x21(z1 J2 ; = I(n,J2n,,.

and making use of the identity

#a,,.

(11.3)

We note that the events (i) and (ii) above are not mutually exclusive, so that we are not allowed to add their probabilities as such. But RN,3 is not a probability; rather, it is the expected number of 3-fragments, and we can add such expectation values, provided the initial chain is of sufficient length N so that the two ends contribute independently. Next, let us convert the RN,” to weight fractions, using (8.1), and examine the structure of the resulting expression. For n = 1,2,3,. . . , N-3 (thus excluding the last three values of n), W,,, has an exceedingly simple form: for n = 1,2,3,. . .,N-3.

(11.4)

Here W.‘“) is the weight fraction for an infinite chain, as calculated in Sections 1 to 7, and the second term is the end correction to the weight fraction. This end correction is given by a quantity &, which is independent of N, divided by the chain length N. Collecting terms, we find (11Sa) A = h7b2(l +%I), (11Sb) 4. =-Fn2P,_,+FnQ,-,+nsn21x,2(l+2?r,,)P,_,+n~P,_2. If the original chain is of sufficient length N so that the contribution of the very long fragments (n = N-2, n = N- 1, and n = N) can be ignored, the end correction can be made even for a mixture of chains of various lengths N, without difficulty: we just get (11.6) where the averages are taken over the distribution-in-N. This may turn out to be a useful method for estimating the sires N of the macromoIecules which occur. We warn, however, that N must be uery large before (11.6) is an adequate approximation (see below). There is another consequence of (11.4), which provides a useful check on the calculation. Let us imagine that N is so large that the weight fractions, as well as corrections to the weight fractions, can be ignored for n close to N. Since the weight fractions eventually become small, we can always 6nd such an N, at least mathematically.

302

J.

M.

BLATT

Since the weight fractions must add up to unity, and the weight fractions computed for the infinite chain do so, it follows from (11.4) that the 4, must add up to zero: (11.7)

“g#)” = 0.

In view of the rather complicated form of the 4,, equations (11.5), this identity is by no means trivial to establish. We have not derived it analytically from these formulas, but we have checked it numerically by using the computer programme BUSTER, which calculates all these quantities. Turning now to the numerical results from BUSTER, let us look at the weight fractions for a typical case, 1 = 6 = 3, h = 1, i.e. a chain with equal amounts of L and D, and no nearest-neighbour correlation. For N = 10, the actual weight fractions are, in percent, 9.38, 1.25, 3.75, 6.88, 6.05, 6.80, 7.11, 7.11, 10.37, 42.31. (11.8) The surprising aspect of these numbers is the huge weight fraction contributed by the unbroken chains, n = N = 10, i.e. 42% of the total weight. This is far and away the largest weight fraction of the lot. As we increase the original chain length N, we expect this final contribution to become less important, and so it does. However, the decrease is surprisingly slow. In Table 1 below we give some representative values: TABLE

Some weight fractions Fragment length n

I

(in percent) for chains I = 6 = 3 h = 1 Original

N=

10

N = 20

chain

length N --.-____--N = 30 N=40

N=co

:

1.25 9.38

7.81 0.63

7.29 0.42

7.03 0.31

06.250

3 4

3.75 6.88

1.88 5.00

1.25 4.38

0.94 4.06

0 3.125

N

7.11 7.11 10.37 42.31

3.70 3.47 4.74 17.87

1.70 1.56 2.08 7.73

0.75 0.69 0.91 3.35

0 0 0 0

N-3 N-2 N-l

This table shows a number of interesting features: for n = 1, 2, 3,4, we see the slow approach to the asymptotic values for an infinite chain length N, namely 6.25, O-00, OGJ and 3.125, respectively. The weight fraction n = 4, which is the first “standard” D-containing fragment LDLL, has an enormous end correction even for N = 40 (4.06% instead of the

ENZYMATIC

BREAK-UP

303

asymptotic 3*125x, a factor of l-3.). The weight fractions then decrease smoothly with increasing fragment weight n, except for the final three values. These final weight fractions are associated with the special formulas (10.17), (10. IS), and (10.19), and are nof represented correctly by (11.4). Even for N as high as 30, the final weight fraction W,,, = RN,, is the largest single weight fraction of all. The following crude rule emerges from a study of the numerical values obtained from BUSTER, for various values of the parameters 1, 6, and h: If the chain length N equals the mean D-length ii, as given by formula (6.6), then the weight fraction W,,, of the unbroken chains is roughly l/3, say between 30 and 35 %. If the original chain length is twice this (N = 2fi), then the weight fraction of the unbroken chains is about 11 to 13 %. No great accuracy is claimed for these rough estimates, but they do give a general idea of what one may expect. 12. Conclusion

In this paper we have examined in detail only one particular example of probabilistic calculations for enzymes breaking up amino-acid chains. The methods which we have employed are well-known to specialists in the theory of stochastic processes, and we claim no originality of these methods. The results are, we believe, new. The main remaining assumption limiting the generality of the treatment is the assumption that it is possible, by looking at the original unbroken chain, to label all the eventual breakpoints. This is true for the combination of two enzymes we have discussed in Sections 8 to 11, but it is not true in general. Quite frequently, the eventual break-up pattern is influenced by the time-sequence of the breaks. The eventual break-up pattern is one thing if the very first break occurs at one point, and it is different if the first break occurs somewhere else. In that case, the eventual break-up pattern preserves a “memory” of the very first break (and of other early breaks), and the method of Section 11 fails, vide the footnote to page 295. The more general method required for break-up with memory is of rather more complexity mathematically, and will be reported elsewhere. We are grateful to Professor A. Berger of the Weizmann Institute, Rehovoth, Israel, for a number of valuable and helpful discussions, and to the John F. Kennedy Foundation of the Weizmann Institute for the Senior Fellowship during the tenure of which this work was done.