A Markov analysis of DNA sequences

A Markov analysis of DNA sequences

I. theor. Biol. (1983) 104,633-645 A Markov Analysis of DNA Sequences HAGAI ALMAGOR Department (Received of Physical Chemistry, The Hebrew Univers...

786KB Sizes 0 Downloads 66 Views

I. theor. Biol. (1983) 104,633-645

A Markov

Analysis of DNA Sequences HAGAI ALMAGOR

Department (Received

of Physical Chemistry, The Hebrew University Jerusalem, Jerusalem 91904, Israel

14 December

1982, and in revised form 6 May

of 1983)

We present a model by which we look at the DNA sequence as a Markov process. It has been suggested by several workers that some basic biological or chemical features of nucleic acids stand behind the frequencies of dinucleotides (doublets) in these chains. Comparing patterns of doublet frequencies in DNA of different organisms was shown to be a fruitful approach to some phylogenetic questions (Russel & Subak-Sharpe, 1977). Grantham (1978) formulated mRNA sequence indices, some of which involve certain doublet frequencies. He suggested that using these indices may provide indications of the molecular constraints existing during gene evolution. Nussinov (1981) has shown that a set of dinucleotide preference rules holds consistently for eukaryotes, and suggested a strong correlation between these rules and degenerate codon usage. Gruenbaum, Cedar & Razin (1982) found that methylation in eukaryotic DNA occurs exclusively at C-G sites. Important biological information thus seems to be contained in the doublet frequencies. One of the basic questions to be asked (the “correlation question”) is to what extent are the 64 trinucleotide (triplet) frequencies measured in a sequence determined by the 16 doublet frequencies in the same sequence. The DNA is described here as a Markov process, with the nucleotides being outcomes of a sequence generator. Answering the correlation question mentioned above means finding the order of the Markov process. The difficulty is that natural sequences are of finite length, and statistical noise is quite strong. We show that even for a 16 000 nucleotide long sequence (like that of the human mitochondrial genome) the finite length effect cannot be neglected. Using the Markov chain model, the correlation between doublet and triplet frequencies can, however, be determined even for finite sequences, taking proper account of the finite length. Two natural DNA sequences, the human mitochondrial genome and the SV40 DNA, are analysed as examples of the method. Procedure We adopt a model in which each nucleic sequence is regarded as a chain generated by a properly built “machine”, called the sequence generator. The information we put into the sequence-generator is a set of rules, or constraints, under which the sequence is to be generated. The machine 633 0022-5193/83/200633+13

$03.00/O

@ 1983 Academic

Press Inc. (London)

Ltd.

634

H.

ALMAGOR

works according to these Q priori rules, and may not change them during its work. We call the set of rules according to which the sequences are being generated “the statistics of the generator”, and use the term “chain” for the output of the generator. If the generator’s statistics is just the probabilities of the four letters (the nucleotides A, C, G and T), then the appearance of a certain nucleotide at any step of the chain generation, does not depend on the identity of the nucleotide in any previous step. Such a generator is known as a zero-order Markov source, and the chain it generates as a zero-order Markov chain (Feller, 1950). We shall thus term the chain a “zero-order Markov chain” (zero-order chain) and the generator will be a “zero-order” one. A more composite generator is one the statistics of which is the set of the 16 doublet probabilities. This uniquely determines the four values of the nucleotide probabilities. (The converse is of course not true. Many doublet distributions are compatible with a given nucleotide distribution apart from the case where only one of the nucleotides may appear.) If we write pij for the probability that the doublet ij (i, i = A, C, G, T) will be generated, and pi for the probability for the nucleotide i, then the two terms are related by j=l

We also have Pjli = PijlPi

for the conditional probability of the nucleotide j to be generated right next to the nucleotide i along the chain. The set of 16 values of the pij is equivalent to the set of 16 values of the p/Ii in the sense that either set uniquely determines the other, and thus either one of these sets describes properly the statistics of the generator. This generator will generate chains in which the appearance of a certain nucleotide in any step depends only on the nucleotide in the previous step. The nucleotides will be nearestneighbor correlated by what is known as first-order correlations. Such a generator is known as a first order Markov source, and the chain it generates is a first-order Markov chain. It should be emphasized that two generators, one zero-order and the other first-order, can have their pi values equal but still have different doublet distributions. However, if (all) their doublet probabilities are the same they definitely have the same nucleotide probabilities. Table 1 summarizes the expressions for the nucleotide, doublet and triplet probabilities for two such generators. We use the symbol for the probability of the triplet (ijk). Pijk

MARKOV

ANALYSIS

OF

DNA

SEQUENCES

635

Now, suppose we have a generator whose statistics is not known a priori. The only way to determine the statistics of such a generator is by examining the chain it generates. In practice, we can never look at an infinitely long chain, and we are confronted with the problem that for any finite length, there is a distribution of output sequences, each having frequencies which are not necessarily equal to the actual probabilities of the generator’s statistics. To take an extreme case, consider a zero-order generator with the most-random statistics (pi = 0.25 for all i) and examine sequences of precisely five nucleotides of its output. It is obvious that no sequence of length five has even one of the nucleotide frequencies equal to its statistical probability 0.25. In the same way there is no sequence of length 16 which has all its doublets equally frequent. In fact, most sequences of length 200 will have triplet frequencies different from the statistical probabilities. Let us use the term “typical” for a sequence which has all its triplet-frequencies equal to the statistical probabilities. By the law of large numbers, we know that when we examine longer and longer sequences it is more and more probable for a given sequence to be typical and less and less so for being a-typical. However, atypical sequences are always possible and what we actually see is that for any length of sequences and for every triplet we have a distribution of possible frequencies. Of course when the probabilities are known, we can generate chains of any length, and as length goes to infinity we are guaranteed by the law of large numbers that all frequencies will converge to their corresponding probabilities, thus revealing the generator’s statistics. However, when we have only one single output sequence, for example the 5226 nucleotides long genome of SV40, we have only one member of a whole distribution of sequences, and any frequency we measure in this sequence is one member of a distribution of possible frequencies whose mean value equals the statistical probability. It will be shown that the width of the frequency-distributions cannot be neglected even for sequences as long as 16 000 nucleotides which is in the order of magnitude of some of the longest natural sequences presently known. There is therefore a length-dependent bias between frequencies and probabilities. Let us now reformulate the correlations question as the need to determine which of the following three possibilities holds in a given sequence. (a) The sequence is an output of a zero-order generator; triplet frequencies and nucleotide frequencies are compatible with the assumption of no nearest-neighbor correlations; any bias is due to sequence length effects only. (b) Triplet frequencies and nucleotide frequencies are not compatible with zero-order assumptions, but adding the information of doublet frequencies is enough to explain the bias. This means that triplet frequencies

636

H.

ALMAGOR

and doublet frequencies are compatible with the assumption of first-order correlations, nearest-neighbor correlations being the longest range ones along the sequence. (c) Nearest-neighbor correlations are not enough to explain the triplet frequencies. Bias is too strong for doublet and triplet frequencies to be compatible with the first-order assumption, there are some longer range correlations, and the sequence is higher than first-order. We note that case (c) covers all eventualities which are not included in (a) or in (b). Our aim is to construct the zero-order and the first-order generators for a given sequence. For the zero-order generator we need to know the four nucleotide probabilities, and for the first-order generator we need the 16 doublet probabilities. Since we have only a single sequence, the one analyzed, we can measure only frequencies, and there is no way by which we can actually know the real probabilities. Thus, having a single sequence as our data, the best zero-order generator we can build is the one which uses the nucleotide frequencies of the analyzed sequence as its nucleotide probabilities. The best first-order generator we can build is the one which uses the doublet frequencies of the analyzed sequence as its doublet probabilities. If each of the generators produces a large number of sequences in the length of the given sequence, we shall obtain two distributions of sequences, a zero-order one and a first-order one and in particular, two distributions of triplet frequencies. For each of the 64 triplets the zero-order and the first-order frequency distributions can be compared with the measured (experimental) frequency in order to judge whether either of the possible generators is strictly compatible with the experimental triplet frequencies. Taking nucleotide frequencies and doublet frequencies as the generators probabilities, apart from being the best one can do, is also justifiable since the nucleotide and doublet frequencies converge to their stationary values faster than triplet frequencies as the chain becomes longer. Materials

and Methods

The L-strand of human mitochondrial genome (HMG) and the strand of SV40 DNA complementary to early mRNA, were taken from the Nucleic Acid Sequence Library available at the Weizmann Institute of Science, Rehovot. Both sequences were analyzed while being read in the directions 5’+ 3’. (However, the general conclusions presented do not depend on the direction and if the two strands match exactly, not even on which strand we use.) Analysis was performed on a CYBER CDC machine. Generating the sequences was based on a transition probability matrix [pi/i]. For the

MARKOV

ANALYSIS

OF DNA

637

SEQUENCES

first-order generator, the matrix elements were the conditional probabilities pjli. For the zero-order generator the matrix had all its rows identical, each containing the four nucleotide probabilities and therefore the elements did not depend on the row’s index. Examples for the matrices of the two generators are given in Table 3. The generation of a sequence was performed by choosing arbitrarily the first nucleotide, getting a random number from the machine’s randomizer and choosing the second nucleotide from the transition probabilities matrix according to the random number and the first nucleotide. The second nucleotide was then used as the condition for the third, and so on. In the case of a zero-order generator conditioning on the previous nucleotide, that is choosing a row in the matrix, does not have any effect since all rows are equal. The last nucleotide of every sequence was used as the first for the next sequence for continuity. The random number generator used was that of FORTRAN of the CYBER CDC machine, which has a cycle of 248 numbers. Part of the analysis could be carried out using algebraic enumeration methods. (See, for example, Spears, Jeffcott & Jackson, 1980). However, already for the first order generator the enumeration of sequences is rather cumbersome and the suggested approach seems more direct and clear. Results We present results of analyses carried out separately on two naturally occurring sequences, the 5226 nucleotides long genome of SV40 (Reddy et al., 1978) and the 16 569 nucleotides long HMG (Anderson et al., 1981). The frequencies of all nucleotides, doublets and triplets in a sequence were measured (Table 2). Nucleotide and doublet frequencies were taken as probabilities. A zero-order generator was constructed using the nucleotide probabilities as its statistics. A first-order generator was constructed using doublet frequencies as its statistics. Each of the generators was then used to produce 1000 sequences of the same length as the original sequence, TABLE

1

Probability expressionsfor the generators Zero-order Nucleotides Doublets Triplets

generator

First-order generator

Pi Pii = PiP, P,jk

=

PiP,Pk

P, = 1 PIi I Pti = P~Prli Pijk

=

P~jpkl,

638

H.

TABLE 2 and doublet counts in SV40 DNA and HkfG

Nucleotide

A

ALMAGOR

SV40 DNA 5226 nucleotides C G

T

1

A

HMG 16 569 nucleotides C G

T

Ix

A

532

288

345

352

1517

A

1602

1492

797

1232

5123

$ T

423 237 325

247 266 293

272 27 389

398 257 574

1095 1033 1581

E T

1529 618 1374

1770 710 1203

436 429 514

1440 419 1003

2177 5175 4094

Doublet cOunts are represented as elements of a matrix. The counts of the nucleotides are shown in the fifth columns, every element of this column being the sum of its corresponding row’s elements (except for G which terminating both sequences has one extra count in both).

3 Generator matrices TABLE

Zero-order

matrix

for HMG

A C G T

AAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCGGGGGGGGV AAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCGGOG AAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCGGGGGGGGAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCCCCCCGGGGGGGG-

A C G T

AAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCGGGGGGGGGGGGGGB AAAAAAAAAAAAAAAAAAAAAAACCCCCCCCCCCCCCGI? AAAAAAAAAAAAAACCCCCCCCCCCCCCCCGGGGGGGGGGGGGGGS AAAAAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCCGGTT

First-order

Examples of characters, the 14 times. This used contained

matrix

for SV40

DNA

transition probabilities matrices for zero-order and first-order generators. Every row has 60 resolution is therefore l/60. For example, in the first row of the first-order matrix, G appears means that the probability for G to follow A along the chain is 14/60. The matrices actually 96 characters in every row--the resolution being l/96.

and the frequency of every triplet was measured in each of the output sequences. For every triplet, the two frequency-distributions were then plotted and compared with the experimental frequency. Examples of these plots for four triplets of HMG are shown in Fig. 1. There are 64 such plots for each analyzed sequence. The examples shown were selected as representing different possible relationships between the zero-order distribution, the first-order distribution and the experimental frequency. The width of the curves in Fig. 1 is due only to the “shortness” of the sequence, which is 16 569 in this case. It is obvious from these curves that the effect of finite length of the sequence cannot be neglected. In Fig. 1 we see that for the triplet ACA the zero-order and the first-order distribution curves overlap very little, the experimental value fits the first-order curve well

MARKOV

0

150

ANALYSIS

300300 (a)

OF

450 (b)

DNA

600 0 counts

639

SEQUENCES

150 (c)

300300

450

1600

(d)

FIG. 1. The triplet frequencies analysis of human mitochondrial genomes is demonstrated by the triplets AAG, AAC, CGA and ACA (frequencies are in counts per chain). The zero-order frequencydistribution curves (- - -) were produced from 1000 sequences generated by the zero-order generator, each as long as the original sequence. The first-order curves (-) were produced correspondingly by the first-order generator. The curves describe, for each case, the number of chains (out of the sample of 1000) vs the counts of the given triplet in a chain. For example, the maximum of the first-order curve for AAC chows that in 190 out of the 1000 sequences generated by the first-order generator, the triplet AAC appeared 445 times. The counts of the triplets in the original sequence are shown by . . . .. (a) Triplet AAG, (b) triplet AAC, (c) triplet CGA, (d) triplet ACA.

and therefore fits the first-order assumption. In the case of AAG the two curves are also separated but the experimental value fits the zero-order curve. The plot for CGA, which is typical of all triplets which contain CG, shows a clear distinction between the two curves, and the experimental value fits the first-order distribution perfectly. However, there is a need to carry the analysis one step further as is expressed by the plot of AAC in Fig. 1. Here we see that the two curves (the zero-order and the first-order) overlap to a considerable extent and the experimental value is within this overlapping region. Which of these curves does it fit better? In the plot of AAG it seems that the experimental value fits the zero-order curve much better than the first-order curve, but still it “cuts” the first-order curve. Does this mean that the experimental frequency of AAG is explained by zero-order assumptions, or is it just that the experimental frequency of AAG happens to be a very rare but statistically possible fluctuation in the first-order distribution? For the next step we use the following argument: if the original sequence is zero-order, then doublet probabilities do not give us more information than we know from nucleotide probabilities (see Table 1). In this case first-order curves should overlap zero-order curves and experimental values must be shown strictly compatible with the set of zero-order curves. On

640

H.

ALMAGOR

0.4c

P

0.2

-4

-2

0 (a)

2

4

-8

-6

-4

D-values

-2

0

2

4

6

8

(b)

FIG. 2. SV40 DNA. Histograms of the four standardizations. p is the probability, D-value is the standardized variable. If the counts of triplets in the original sequence are characteristic points each of its corresponding first-order curve then standardization of these values, each by its corresponding first-order curves’ mean and standard deviation, should form the normal distribution. This standardization, D(E/l) (W), is shown in (a). Standardization of the first-order curves themselves each by its own mean and standard deviation, D(l/l) (a), is also shown in (a). The normal distribution is drawn in a dotted line. If, on the other hand, the experimental triplet frequencies are characteristic points of the zero-order curves, they should form the normal distribution if standardized each by its corresponding zero-order curves’ mean and standard deviation. This standardization, D(.E/O) (a), is shown in (b). Also in (b) is shown D(l/O) (EJ), the standardization of the points forming each first-order curve by its corresponding zero-order curves’ mean and standard deviation.

the other hand, if the experimental triplet frequencies are determined by nearest-neighbor correlations alone (first-order sequence) we must be able to show that they are strictly compatible with the set of first-order curves. The values which are most compatible with a given curve are certainly the points on the curve itself. We therefore have to examine whether the experimental frequency values behave like typical points of the first-order curves and not like typical points of the zero-order curves. The 64 first-order curves were therefore standardized into units of standard deviations from the mean. In this way all 64 curves could be united into a simple histogram of a standard variable. This variable is named D(l/l) and the histogram is shown by striped columns in Fig. 2(a) (for SV40 DNA) and Fig. 3(a) (for HMG). As is expected from the central limit theorem (Cramer, 1955) this histogram fits the normal distribution

MARKOV

ANALYSIS

OF DNA

641

SEQUENCES

0.3-

P

0.2

-

0.1

-

-4

-2

0 (a)

2

4

-a

-6 -4 -2 D-values

0

2

4

6

a

IO

(b)

FIG. 3. HMG. Histograms of the four standardizations. Details as in Fig. 2

exactly, drawn independently as a dotted curve. The 64 values of the experimental triplet frequencies were then standardized as well, each by the mean and standard deviation of its corresponding first-order curve as if they were points on these curves. The unified variable thus formed is termed D(E/l) and its histogram is shown by black columns in Fig. 2(a) (for SV40 DNA) and Fig. 3(a) (for HMG). Similarly, the compatability of experimental values and the zero-order curves is examined standardizing the experimental triplet frequencies, each by its corresponding zero-order curve, thus forming the variable D(E/O), histogrammed by striped columns in Figs 2(b) and 3(b) for SV40 and HMG, respectively. Standardization of the zero-order curves themselves should again give the normal distribution, and therefore is not shown. To complete the picture, the first-order curves were standardized to the corresponding zero-order mean and standard deviation values. The rationale for doing so is that if the experimental values are typical points of the first-order curves, then any feature of the points on the first-order curves should be reconstructed by the experimental values. This latter standardization forms the variable D(l/O) which is histogrammed by black columns in Figs 2(b) and 3(b) for SV40 and HMG, respectively. The fact that (in both Figs 2(b) and 3(b)) the variable D(E/O) shows broad non-normal patterns means that for both sequences the triplet

642

H.

ALMAGOR

frequencies are not compatible with zero-order curves, and therefore triplet frequencies cannot be explained by nucleotide frequencies alone. Case (a) is thus ruled out. There is also a good agreement between the D(E/O) and the D(l/O) histogram for both sequences which means that both in SV40 DNA and in HMG the experimental triplet frequencies are as uncompatible with zero-order assumptions as first-order assumptions would suggest them to be. The fit of the D(E/l) histogram to the normal curve is much better in Fig. 2(a) (SV40) than in Fig. 3(a) (HMG). It seems reasonable to claim that triplet frequencies in SV40 DNA sequence are dominated by first order correlations while in the HMG sequence some longer range correlations are explicit.

Discussion The ruison d’etre of a Markov process model for DNA sequences is embedded in the very mechanism by which this information-carrying sequence is being synthesized: nucleotides being chained linearly one by one. It is like in a written language, where the accumulation of information along the story is based on correlations between neighboring letters, interpreted as meaning during a later, sequential process termed “reading”. Such constraints which are not necessarily frame-dependent as in coding sequences, can be traced using models as the one suggested here. When we use terms like “triplet frequencies expected from nucleotide frequencies”, do we really mean a single-valued frequency for every triplet? The first conclusion of this work is that even for sequences of considerable length (such as 16 000 nucleotides) we should rather have a distribution of frequencies to which each measured frequency should be compared. Standard deviation values of the triplet frequency distributions are roughly given by the square root of their corresponding mean values. This is with the exception of the homogeneous triplets AAA, CCC, GGG and TIT which can very compactly overlap in a sequence and can therefore have a wider distribution and hence larger standard deviations. (For example, the sequence CCCCC has three CCC-triplets in it). Noise in frequency measurements is therefore relatively stronger for the rare triplets. Thus, when rare triplets are examined, it is especially important to consider this noise. In several cases the zero-order mean frequency is equal, or very close to the corresponding first-order mean frequency (e.g., ACC in SV40 DNA, or TGC in HMG). In general, for any given triplet the difference between the zero-order and first-order means is comparable to the width

MARKOV

ANALYSIS

OF

DNA

SEQUENCES

643

of the curves. Therefore we cannot neglect the width of the distribution or the effect of the finite length of the sequence. We have shown how Markovian analysis of DNA sequences can resolve the difficulties encountered by effects of finite length. In one of the two sequences analyzed, SV40 DNA, it can be concluded that first-order correlations determine the triplet frequencies. It is very likely that if correlations of longer range than nearest-neighbors determine triplet frequencies, as seems to be the case in the HMG, they are related to some features of the coding properties in parts of the sequence. This is especially so for a sequence which is as highly coding as HMG. However, SV40 DNA is also highly coding and yet doublet frequencies seem to provide enough information to determine triplet frequencies in this sequence. It is the intention of this discussion to stress the possible important role doublet frequencies may play in nucleic acid sequences. It is suggested by this work that nearest neighbor correlations are steadily conserved along the DNA chain, and in some cases they are the longest range correlations that do so; longer range correlations seem to average out. It is obvious that in some parts of a DNA sequence there may be some long-range correlations between nucleotides. For example, in a sequence that codes for a globular protein, say an enzyme, we expect some long range correlations constrained by the three dimensional structure of the protein. However, these long range correlations, along with other features which are specific to a particular part of the sequence, may average out when the whole sequence is analyzed. Only globui characteristics of the chain remain and show up in the analysis. In analyzing frequencies of words in Western languages, Abramson (1963) shows that when one generates first-order sequences using as the generator statistics the frequencies of letter-doublets (with space as a 27th letter), one can recognize the language from which the doublet frequencies have been taken by trying to read these sequences. This does not hold for the corresponding zero-order sequences. However, in zero-order sequences the length of words recalls that of the real language. Now, one of the letters, namely the space, never appears within a word and has one role only-that of arresting (or starting) a word. The mean length of words is determined by space frequency and therefore zero-order sequences should have the same mean word length as the real language. However, first-order correlations seem to be strong enough to discriminate amongst several other languages of the same family. Some of the constraints are symbolic (as, for example, in the doublets SH and CH in English or GN in French), and are due to the fact that the language has to be spoken. Such constraints are due to the mechanism rather than to the meaning of the information. DNA is indeed a language.

644

H.

ALMAGOR

It may even be a manifold of languages. It seems that DNA language or languages have strong constraints on doublet frequencies. They seem to be stronger than those on triplets. As an example, for a global constraint on doublets, let us look at the double CG which was found to be strongly constrained (C is rarely followed by G) in both SV40, DNA and HMG. For all triplets which contain CG (eight altogether) in SV40 DNA, and for all but one of these triplets (the exception is TCG) in HMG, the triplet frequency analysis plot is clearly compatible with first-order assumptions, as is shown in the example of CGA inFig. 1. Triplets containing CG are more rare than would be expected from nucleotide frequencies alone. It was shown (Gruenbaum et al., 1981,1982) that eukaryotic DNA methylase is exclusively CG specific, so that >95% of all animal DNA methylation occurs at the sequence C-G. The frequency of C-G in the eukaryotic DNA is therefore constrained by the amount of methylation needed. This is an example of a dinucleotide constraint which recalls di-letter constraints in languages mentioned above. There may be other dinucleotides which impose constraints in a similar way. Nussinov (1981) has concluded that some aspects of degenerate codon choice is due to double regularities. Grantham (1978) used CG and GC frequencies in formulating indices to contrast RNA sequences of different groups of organisms. Russel & Subak-Sharp (1977) used doublet frequencies for taxonomic analyses. We interpret our results to mean that there are some features of the DNA chain embedded in the doublet frequencies. Apart from modifications (as in the eukaryotic C-G), these features may be correlated for example to the three dimensional structure of the DNA via helical twist angles (Kabsch, Sander & Trifonov, 1982), or to helical periodicities (Rhodes & Klug, 1981). Similar conclusions have been reached by C. Sander (personal communication) using information theoretic considerations. In this work we did not refer to coding properties of the sequence (and indeed both sequences analyzed are highly coding). Neither did we separate the analysis for different reading frames. The development of the Markov chain model is being carried out to deal with these kinds of questions. We also prepare to do this analysis on groups of different sequences (e.g., several viral sequences) as an output of a common generator to look for common features and taxonomic or evolutionary patterns. I am indebted to R. D. Levine for many fruitful discussions. I am grateful to J. Sussman for the sequence library, A. Ofir for advice in computing and D. Darom for important remarks on the graphics. I thank M. Almagor for constant support and encouragement.

MARKOVANALYSISOFDNASEQUENCES

645

REFERENCES ABRAMSON, N. (1963). Information Theory and Coding. New York: McGraw-Hill. ANDERSON,S.,BABIER,A.T.,BARRELL,B.G.,DEBRUIJN,M.H.L.,COULSON,A.R.. DROUIN, J., EPERON, I. C., NIERLICH, D. P., ROE, B. A., SANGER, F., SCHREIER. P. H.,SMITH, A.J.H.,STADEN,R. & YouNG,J.G. (1981).Nature 290,457. CRAMER, H. (1955). The Elements of Probability Theory and Some of Its Applications. New York: John Wiley. FELLER, W. (1950). An Introduction to Probability Theory and Its Applications, Vol. 1. New York: John Wiley. GRANTHAM, R. (1978). FEBSLett.95,1. GRUENBAUM,Y.,CEDAR,H.&RAZIN,A.(~~~~). Nature 295,620. GRUENBAUM,Y.,STEIN,R.,CEDAR,H.& RAZIN,A.(~~~~). FEBSLett. 124,67. KABSCH, W., SANDER,~. & TRIFONOV, E. N. (1982). Nucleic Acids Res. (In press). NUSSINOV, R. (1981).J. mol.Biol. 149, 125. REDDY, V. B., THIMMAPPAYA, B., DHAR, R., SUBRAMANIAN, B., ZAIN, S., PAN, J., GHOSH,P. K.,CELMA,M. L.& WEISSMAN,S.M. (1978). Science 200,494. RHODES,D.& K~u~,A.(l981). Nature 292,378. RussEL,G.J. & SUBAK-SHARPE,J.H. (1977).Nature 266,533. SPEARS. W.T.,JEFFCOTI',B.& JACKSON,D.M.(~~~O).J. Comb. Theor.A 28, 191.