Threshold consensus methods for molecular sequences

Threshold consensus methods for molecular sequences

J. theor. BioL (1992) 159, 481-489 Threshold Consensus Methods for Molecular Sequences WILLIAM H. E. DAYt AND F. R. MCMORRIS~ ~f Department of Compu...

638KB Sizes 0 Downloads 31 Views

J. theor. BioL (1992) 159, 481-489

Threshold Consensus Methods for Molecular Sequences WILLIAM H. E. DAYt AND F. R. MCMORRIS~

~f Department of Computer Science, Memorial University of Newfoundland, St John's, N F A l C 5S7, Canada and ~ Department of Mathematics, University of Louisville, Louisville, K Y 40292, U.S.A. (Received on 8 Februm T 1992, Accepted in revised form on 30 March 1992) We introduce a parameter!zed threshold consensus method (thx) for molecular sequences which is based on a majority-rule voting principle. In contrast to other frequency-based methods, the th~ method uses a single criterion to return ambiguity codes of different lengths. We derive basic features of the method and establish that it returns at most two ambiguity codes at any position of the consensus sequence. We bound from below the size of the frequency gap that exists when the th.~method returns an ambiguity code. Using such properties, we compare the thx method to other consensus methods for molecular sequences which are defined in terms of threshold or gap criteria§.

1. Introduction Consensus methods play a vital role in molecular biology where consensus sequences are used to identify m R N A initiation and termination sequences (Cavener, 1987; Cavener & Ray, 1991 ; Yamauchi, 1991), to analyze the secondary structure o f R N A (Piller et al., 1990; Jurka & Milosavljevic, 1991), to align multiple D N A sequences (Bains, 1986, 1989; Waterman, 1986, 1989a), and to find molecular patterns occurring imperfectly above a preset frequency (Waterman et aL, 1984; Waterman, 1989b). Typically, methods for consensus sequences are frequency-based (Day & McMorris, t992b) because their results can be defined solely in terms o f the relative frequencies f = (J] ,J~ . . . . ) with which nucleic acid bases or amino acids occur at each position in a set of aligned molecular sequences. A method's decision to return an ambiguity code at such a position may employ threshold criteria of the forms f~> c or J~> c, where c is a constant (Kolodrubetz, 1990; Daniels & Deininger, 1991; Grasser & Feix, 1991; Sayers & Eckstein, 1991; Yamauchi, 1991), gap criteria of the form f d f~+~> c (Choo et aL, 1991), or combinations o f the two (Cavener, 1987; Shapiro & Senapathy, 1987). But when these methods return ambiguity codes of different lengths (i.e. codes representing different numbers of bases), they may apply different criteria which depend on the lengths o f the codes being returned. Although such

Author to whom correspondenceshould be addressed. §This work was supported in part by the Natural Sciences and Engineering Research Council of Canada under grant A-4142, and by the US Office of Naval Research under grant N00014-89-J-1643. The first author is an Associate in the Program in Evolutionary Biology of the Canadian Institute for Advanced Research. 481

0022-5193/92/240481 + 09 $08.00/0

© 1992 Academic Press Limited

482

W,

H.

E,

DAY

AND

F.

R.

McMORRIS

criteria may incorporate application-dependent constraints into the consensus method, they confound the goals to summarize and to interpret the relative frequencies f = ( ~ ,J~ . . . . ) and furthermore they restrict usage of the method to applications where the constraints are expected to be satisfied. An alternative approach separates the summarization of relative frequencies from their interpretation with the expectation that, once an informative consensus result is available, researchers will apply application-dependent criteria during its interpretation. We submit that a consensus method, to be informative in this sense, should apply criteria consistently to return ambiguity codes of different lengths. How to accomplish this in frequency-based consensus methods is the problem we investigate herein. Our main result is to develop a new parameterized threshold consensus method (thx) for molecular sequences which is based on a majority-rule voting principle. Our paper is organized as follows. Section 2 motivates essential features of the th~ method in a voting context, while section 3 develops the assumptions and terminology to put it in a molecular context. Section 4 defines the method and derives its basic properties. We establish, for example, that at any position in a consensus sequence the method never returns more than two ambiguity codes, and we obtain lower bounds on the size of the frequency gaps which occur when the method returns ambiguity codes. In section 5 we use properties such as these to compare the thx method with other frequency-based consensus methods for molecular sequences.

2. Voting Context Consensus methods have long been recognized as effective tools for analyzing data (Day, 1986, 1988). Many of the basic consensus concepts can be described in a simple voting context as follows. Suppose there is an electorate of k voters in an election involving a slate S = { s l , . . . , sn} of alternatives. The votes cast in the election are specified by a vector P = (pj . . . . . Pk), called a profile, in which p~ denotes the alternative selected by voter i. The proportion f,, of votes cast in P for each alternative s,, is specified by a vector f = (fl . . . . . f , ) of non-negative frequencies such that f~ + . . . +fn = 1. The set of all profiles is the k-fold Cartesian product S k, and we stipulate that the set of all electoral outcomes is the set II(S) of all subsets of S. The criterion of obtaining consensus by majority rule (May, 1952; Campbell, 1982) can be viewed as a function mr: S k ~ II(S) such that for each profile P s S k,

mr(P) = {s,,,ES: f , > I/2}. One of majority rule's appealing features is that, when it exists, the winner is unique and has the support of more than half the voters. A drawback, of course, is that with slates of more than one alternative, the rule mr need not select a winner. In order to compare consensus methods, we find it useful to equip the alternatives in S with an arbitrary, but fixed, standard order sl > . . . > s , . After an election, but before determining its outcome, we will for convenience relabel the alternatives in the election's profile so that the most frequent alternative is labeled s~, the next most frequent s2, and so on. Thus, after relabeling, we have j'] > . . . >_f, where each frequency f,, is associated with the corresponding alternative Sm in standard order.

CONSENSUS

METHODS

FOR MOLECULAR

SEQUENCES

483

Because the consensus methods that we consider are frequency-based, this relabeling has no effect on the results we present other than to simplify their presentation. Notice, for example, that since only st can then have its frequency J~ in the real interval (1/2, 1], only st can be the majority-rule winner for P, Put another way, after relabeling the rule mr:Sk-* II(S) declares as the set of winners when

0
{st } as the set of winners when

l / 2
and

Majority rule can be modified to strengthen the criterion of winning so that, for example, any winner must have more than 75% of the votes. Let xE[0, l) be the proportion of the interval (1/2, 1] that must be added to 1/2 to determine the winning threshold. The modified majority rule is a function mrx:Sk~H(S) such that for each profile P E S k,

mrx(P) = {s,,ES: f,,,> (1 + x)/2}. With mr,~2, {s~ } is the set of winners wheneverj] > 0.75. Observe that mr0 = mr while, when x is sufficiently close to 1, mrx returns a winner only when an alternative is elected unanimously. Informally, rnrx is a parameterized family of consensus methods which has the criteria of majority rule and unanimity as its extremes, and is required to declare at most one alternative as the winner. Perhaps (after relabeling to put frequencies in standard order) a voting method such as mrx should be flexible enough to declare the two most popular alternatives as winners. Then it seems reasonable to require that )'] +J~ > (1 + x ) / 2 but, at the same time, J~ should not be too small. Just as among the frequencies3q . . . . . f, only f~ can lie in the interval (1/2, 1], so among f2 . . . . . f , only f2 can lie in the interval (I/3, 1/2]. In general, if m e {1 . . . . . n} then f,, lies in the interval [0, 1/m], and only f,, among the frequencies (f,, . . . . . f,) can lie in the interval ( 1 / ( m + 1), 1/m]. For us, it thus seems reasonable that a voting method declare {sl} as a set of winners when

fl > (1 + x ) / 2 and./] e(1/2, 1],

{s~, s2} as a set of winners when

{sl,s2,s3} as a set of winners when

)q + f 2 > (1 + x ) / 2 andf2e(l/3, 1/2], (1) J] + f 2 + J ~ > (1 + x ) / 2 and J ~ e ( I / 4 , 1/3],

and so forth. This concept can be the basis of a general treatment of consensus methods for sequences, as we now explain. 3. Molecular Context We consider consensus methods to be tools to summarize the distributions of alternatives (e.g. bases, amino acids) at the positions of an aligned set of molecular sequences. Typically the methods make three simplifying assumptions: analysis of molecular sequences is a multistage process in which sequence alignment precedes the identification of consensus sequences; an alignment of the molecular sequences has already been obtained; and aligned positions within molecular sequences can be

484

W.H.E.

DAY

AND

F.

R.

McMORRIS

treated independently (Day & McMorris, 1992a). Thus the problem to find a consensus of k aligned molecular sequences, in which n aligned positions have been identified, can be viewed as a set of n simpler problems, each to find a consensus of k symbols at an aligned position. To model this simpler problem, we define a consensus m e t h o d to be a function cm: U -~ V which maps each element of its domain U to a suitable element of its codomain V. To specify U and V, let S be a finite set of symbols of interest. Although S might be the set of amino acids, or any other finite set, we will often take S to be the set {A, C, G, T} of nucleic acid bases so as to place the consensus pr-~blem in a DNA context. Each element of U represents a possible k-tuple of bases appearing at a given aligned position in each o f k molecules. As before, for any positive integer k, let S k denote the set of profiles of S of length k. We will use S k as the domain of the consensus methods we investigate. The codomain V typically is a set of ambiguity codes such as those proposed by the Nomenclature Committee of the International Union of Biochemistry (1985). However, in order to emphasize the constituent bases, and their number, we will represent each such code by the subset of its bases. Thus (with a minor abuse of set-theoretic notation) we define the set of non-empty subsets of S to be FI '(S) = {A, C, G, T, AC, AG, AT, CG, CT, GT, ACG, ACT, AGT, CGT, ACGT}, where AG denotes a purine base, CT denotes a pyrimidine base, and so on. Letting H(Z) denote the set of all subsets of Z, we will use H(I-I'(S)) as the codomain of the consensus methods we investigate. When c m ( P ) = ~ , the consensus method maps the profile P to a symbol denoting the empty set: no ambiguity code is associated with P. Having given pre-eminence to S = {A, C, G, T}, we emphasize that nothing in the sequel depends on the particular symbols A, C, G, T, or on the fact that four symbols may appear in examples. What usually is of interest is not the set of available symbols, but rather the subset of symbols actually appearing in a profile. So now we have that each consensus method is a function cm: s k ~ H(H'(S)) which maps a profile P of length k to a (possibly empty) subset of ambiguity codes from H'(S). When defining that function's value for P, it is convenient to refer to the frequency of occurrence of each base in P, and furthermore to relabel the bases of P in such a way that frequencies have a standard order. As noted above, these assumptions cause no loss of generality since the consensus methods to be discussed depend on these frequencies, but not on physical or chemical properties of particular bases. Throughout, when referring to S = {A, C, G, T}, we adopt the alphabetical order A > C > G > T as a standard order of the bases, and we specify the frequencies with which bases occur in P by a vector f = f ( P ) = ( f ~ , f ~ , f 3 , f 4 ) , with 1= f i + J ~ + ~ + J ~ , where: (i) the bases in the profile are relabeled so that )q > ~ >_~ >jq>_0, and (ii)Jq is the frequency of A, )~ of C, )q of G, and J~ of T. Consider profile (G, G, G, C, T, G, C, A, G). To satisfy requirements (i) and (ii), interchange labels A and G to obtain P = (A, A, A, C, T, A, C, G, A), so t h a t f ( P ) = (0-556, 0-222, 0. I 11, 0-111).

C O N S E N S U S M E T H O D S FOR M O L E C U L A R S E Q U E N C E S

485

4. Threshold Consensus Methods thx

As usual, let S have standard order s~ > . . . > s , . For any profile P~S k, let its frequencies (fl . . . . . f,) also be in standard order so that fl _>... > f , where e a c h f , is associated with the corresponding s,,. For any constant xE [0, 1), define the threshold consensus method to be a function thx:Sk~ H ( I I ' ( S ) ) such that for each profile P E S k,

the(P) = {sj . . . s,,e H'(S) :f,, > (m + x)/(m(m + 1)) }.

(2)

Table I illustrates the use of thx when x = 0 so that tho(P)= {s~... s,,~ H ' ( S ) : f , , > l / ( m + l)}. At this extreme in the value of x, tho exhibits characteristics of majority rule since for each profile P, mr(P)=mro(P)~_tho(P). At the other extreme, thx exhibits characteristics of unanimity since for each k a small positive e exists such that, with any profile P, m r l - , ( P ) = thin-,(P) = {A} if and only ifJ(P) = (1, 0, 0, 0). Thus, as x increases from 0 to l, the consensus criterion used by thx becomes increasingly restrictive. TABLE !

Ambiguity codes returned by the consensus method tho when S = {A, C,G, T} f( P, ) J~V2) tho(P, ) = tho(P2) (0"50, 0-33, 0-17, 0"00) (0.28, 0.27, 0.25, 0.20) (0.51, 0-17, 0.16, 0-16) (1"00, 0'00, 0"00, 0"00) {A} (0.50, 0.34, 0-16, 0.00) {AC} (0-34, 0.34, 0.16, 0.6) (0-27, 0.27, 0.26, 0.20) (0.48, 0.26, 0.26, 0.00) {ACG} (0.25, 0.25, 0-25, 0-25) (0.37, 0-21, 0.21,0.21) {ACGT} (0.66, 0.34, 0.00, 0-00) {A, AC} (0-51,0.34, 0.08, 0-07) (0.40, 0.34, 0.26, 0.00) {AC, ACG} (0.34, 0.34, 0.26, 0.06) (0.27, 0-26, 0.26, 0.21) {ACG, ACGT} (0-26, 0.26, 0-26, 0.22) The frequencies in the first two columns represent profiles of 100 bases. Profile P~ has the minimum possible frequenciesft (as i increases) that still yield the consensus result in column 3; P2 has the maximum possible frequencies.

Interesting features of th.~ follow from simple results about a profile's frequency distribution. LEMMA I

Suppose xe[0, 1) and P e S ~. If th:,(P) returns an ambiguity code of length m > 0 , then and

Z +.-. +f,.> (m+x)/(m+ 1)

(3)

f,,+, + . . . + f , < (1 - x ) / ( m + 1).

(4)

Proof Using (2) we have J] + . . . +f,, > rnf, > (m + x)/(m + 1). Then f,,+ l + . . . + f . = 1 - ( J q + . . . + f t , ) < 1 - ( m + x ) / ( m + 1)=(1 - x ) / ( m + 1).

486

W.H.E.

DAY

AND

F.

R.

McMORRIS

At (1) we stressed that an ambiguity code of length m should satisfy a requirement that jq + . . . + f , , > (1 + x ) / 2 . Every code returned by thx necessarily satisfies that requirement.

PROPOSITION 2

If xe[O, 1), P e S k, and sl . . . s,.ethx(P), then fro + . . . +fro> (1 + x ) / 2 . Proof Since x < 1 and m > I, the result follows from (3). • Table 1 shows that th~ can return zero, one, or two ambiguity codes. We now show that th~ cannot return results such as {A, ACG},{A, ACGT}, or {AC, ACGT}, in which code lengths are not consecutive integers.

PROPOSITION 3

l f xe[O, 1), P e S k, and the(P) ~ ~J, then the set of lengths of the ambiguity codes in thx(P) forms a set of consecutive integers.

Proof If Ith.~(P)l < 1 the result is trivial, so suppose two codes in thx(P) have lengths m and m + j where m > 0 and j > 0. We assert that j = 1. Observe that jq + . . . + f , , + i > m f,,, +jfj > m/(m + I ) +j/(m + j + 1). Thus f,, +j + m+ - - . +f. = 1 - (j~ + . . . +f,, +j) < 1 - m / ( m + 1 ) - j / ( m + j + 1) so that O_
l)/((m + 1 ) ( r e + j + 1)).

Thus j = 1, for if j > 1 then m(l - j ) + 1 _<0 so that 0 _
COROLLARY

4

l f xs[O, 1)and P s S k, then Ith~(P)l<2. Actually thx can return two ambiguity codes only when x is sufficiently small.

PROPOSITION 5

Suppose xe[0, 1) and P E S k. If thx(P) returns ambiguity codes of positive lengths m and m + 1, then x < 1/(m + 3).

Proof Using (2) and (4), (m + 1 + x)/(m + 1)(m + 2)
CONSENSUS

METHODS

FOR

MOLECULAR

SEQUENCES

487

COROLLARY 6

I f x > 0 . 2 5 , thx returns at the most one ambiguity code. A desirable property of any consensus method is that its ambiguity codes for a profile P provide insights about the pattern of frequencies in ~ P ) . For example, the method of Choo et al. (1991) requires that a large gap exists between consecutive frequencies inJ(P) : the method returns an ambiguity code of length one if and only i f f ~ / f 2 > 3 . Other methods (Cavener, 1987; Shapiro & Senapathy, 1987) include requirements of the formf,,/f,,+ i > 2. When t/'~ returns an ambiguity code of length m, the gap between fm and f,,+~ can be desc;ibed in terms of either differences or ratios. PROPOSITION 7

Suppose xe[0, 1) and P e S k. I f thx (P) returns an ambiguity code of length m > 0 , then and

fm - f , + l > x / m f,,/f,,+ l> (m + x) / (m( l - x) ).

(5) (6)

Proof Using (2) and (4), f,, - f , , ÷ l > (m + x ) / ( m ( m + 1)) - (1 - x ) / ( m + 1) = x / m . Using (2) and (4) again, f ,,~fro÷, > ( m + x ) / ( m ( m + 1 ) f , , , + O > ( m + x ) / ( m ( 1 - x ) ) . • When x = 0 . 5 and m = 1, (5) and (6) show that J~-J~>0-5 and J~/f2>3; these bounds are tight when 3q is slightly greater than 0-75 and f2 is slightly less than 0.25. 5. Discussion

When analyzing k molecular sequences, the decision to return as a consensus result an ambiguity code of length one may be based on a threshold criterion of the form 3q > c for a constant c >0-5 (e.g. Yamauchi, 1991). Such a consensus method returns the same ambiguity code of length one as does th2c- ~. If the criterion has the form J~ > c for a constant c>0.5 (e.g. McGeoch, 1990; Grasser & Feix, 1991 ; Sayers & Eckstein, 1991), the consensus method returns the same ambiguity code of length one as does th2c-~-~ where e is a small positive constant depending on k. However, th2c- ~ and th2~- ~- ~ have the desirable feature that they also generalize the threshold criteria so as to return ambiguity codes of all lengths. Yamauchi's consensus method (1991), ya, returns code: A if j] > 0- 5; AC if./] ~ 0- 5 and J~ +j~ > 0.75; and ~ otherwise. Although ya and tho return identical ambiguity codes of length one, they may differ at the other lengths: thus y a ( P ) = {AC} ~ ~ = tho(P) when f ( P ) = (0.50, 0" 26, 0.12, 0.12), while ya(P) = ~ # {AC } = tho(P) when f ( P ) = (0-34, 0-34, 0-16, 0-16). This incompatability arises because ya's criterion at length one (3q > 0-5) requires one parameter value (x = 0) for thx, while its criterion at length two (fl +f2>0.75) requires a different value (x=0.25). When compared with the consensus methods of Choo et al. (1991) and Cavener (1987), members of the family thx exhibit conformity in the way they return ambiguity

488

W. H. E. D A Y A N D

F. R. M c M O R R I S

codes of given length. We will use the terminology of Day & McMorris (1992b). Let m be the ambiguity-code length of interest. For consensus methods cm and cm', define the relation 3. When A eth~/2(P),(6) establishes that j~/J~> 3 so that A ech(P). Since thl/2(P)=fZJ#{A} =ch(P) when J(P)=(0-6, 0-2, 0-2, 0), we have that thl/20.5 and fJ/f2 > 2; AC if (j~ < 0- 5 orj~/f2 < 2) and J~ +f2 > 0.75; and ACGT, otherwise. When A~tht/3(P), (3) and (6) establish that)q >0.5 a n d f t / f 2 > 2 so that A e c a ( P ) . Since tht/3(P) = [~ # {A} = ca(P) when t i P ) = (0.6, 0" 2, 0.2, 0), we have that thl/3 !/3 so thatj~/f2 < 2, and sinceft +f2> 0-75 by (3), we then have A C e c a ( P ) . Since t h w / 3 ( P ) = f ~ # { A C } = c a ( P ) when ] ( P ) = (0-50, 0-26, 0-12, 0-12), we have that thl/3<2ca. Since tht/2 and thl/3 conform to ch and ca, they exhibit features of ch and ca which then can be extended within the th.~ paradigm so as to return ambiguity codes of all lengths. For thx, as for the other frequency-based methods we mentioned, the problem of calculating the consensus results for profiles of length k can be solved easily by algorithms requiring order at most k time and space. The threshold consensus methods thx are not restricted to analyzing particular sets (S) of symbols, and they do not incorporate rules that are peculiar to particular biological contexts. Consequently the th~ methods can be used to analyze DNA, RNA, or amino acid sequences. In a more general setting they might be used to analyze matrices of nominal data where, for example, a matrix column gives responses of experimental subjects to a particular multiple-choice question. We conclude by asking several promising open questions. What axioms do the th~ consensus methods satisfy? Are there interesting axiomatic characterizations of th.~? As profile length increases, and when profiles are randomly selected, what are the probabilities that thx returns consensus results containing zero, one, or two ambiguity codes? Are there other sensible ways (but still based on threshold or gap criteria) to define consensus methods for molecular sequences? REFERENCES BAINS, W. (1986). MULTAN: a program to align multiple DNA sequences. Nucl. Acids Res. 14, 159-177. BMNS, W. (1989). MULTAN (2), a multiple string alignment program for nucleic acids and proteins. Comput. Appl. Biosci. 5, 51-52. CAMPBELL, D. E. (1982). On the derivation of majority rule. Theor. Decision 14, 133-140. CAVENER, D. R. (1987). Comparison of the consensus sequence flanking translational start sites in Drosophila and vertebrates. Nucl. Acids Res. 15, 1353-1361. CAVENER, D. R. & RAY, S. C. (1991). Eukaryotic start and stop translation sites. Nucl. Acids Res. 19, 3185-3192. CHOO, K. H., VISSEL, B., NAGY, A., EARLE, E. & KALITSIS,P. (1991). A survey of the genomic distribution of alpha satellite DNA on all the human chromosomes, and derivation of a new consensus sequence. Nucl. Acids Res. 19, 1179-1182.

CONSENSUS

METHODS

FOR MOLECULAR

SEQUENCES

489

DANIELS, G. R. & DE1N1NGER, P. L. (1991). Characterization of a third major SINE family of repetitive sequences in the galago genome. NucL Acids Res. 19, 1649-1656. DAY, W. H. E. (1986). Foreword: comparison and consensus of classifications. J. Classification 3, 183-185. DAY, W. H. E. (1988). Consensus methods as tools for data analysis. In: Classification and Related Methods o f Data Analysis (Buck, H. H., ed.) pp. 317-324. Amsterdam: Elsevier. DAY, W. H. E. & McMORRIS, F. R. (1992a). Consensus sequences based on plurality rule. Bull. math. Biol. 54, 1057-1068. DAY, W. H. E. & MCMORRIS, F. R. (1992b). Critical comparison of consensus methods for molecular sequences. NucL Acids Res. 20, 1093-1099. GRASSER, K. D. & FEIX, G. (1991). Isolation and characterization of maize cDNAs encoding a high mobility group protein displaying a HMG-box. Nucl. Acids Res. 19, 2573-2577. JURKA, J. & MILOSAVLJEVIC, A. (1991). Reconstruction and analysis of human Alu genes. J. molec. Evol. 32, 105-121. KOLODRUBETZ, D. (1990). Consensus sequence for HMGl-like DNA binding domains. Nucl. Acids Res. 18, 5565. MAY, K. O. (1952). A set of independent, necessary and sufficient conditions for simple majority decision. Econometrica 20, 680-684. McGEocH, D. J. (1990). Protein sequence comparisons show that the "pseudoproteases'" encoded by poxviruses and certain retroviruses belong to the deoxyuridine triphosphatase family. Nucl. Acids Res. 18, 4105-4110. NOMENCLATURE COMMITTEE OF THE INTERNATIONALUNION OF BIOCHEMISTRY(NC-IUB). (1985). Nomenclature for incompletely specified bases in nucleic acid sequences--recommendations 1984. Eur. J. Biochem 150, i-5. Also (1986). J. Biol. Chem. 261, 13-17. PILLER, K. J., BAERSON, S. R., POLANS, N. O. & KAUFMAN, L. S. (1990). Structural analysis of the short length ribosomal DNA variant from Pisum satioum L. cv. Alaska. NucL Acids Res. lg, 3135-3145. SAVERS, J. R. & ECKSTEIN, F. (1991). A single-strand specific endonuclease activity copurifies with overexpressed T5 DI5 exonuclease. Nucl. Acids Res. 19, 4127-4132. SHAr'mO, M. B. & SENAPATHY, P. (1987). RNA splice junctions of different classes of eukaryotes: Sequence statistics and functional implications in gene expression. NucL Acids Res. 15, 7155-7174. WATERMAN, M. S. (1986). Multiple sequence alignment by consensus. Nucl. Acids Res. 14, 9095-9102. WATERMAN, M. S. (1989a). Sequence alignments. In: Mathematical Methods for DNA Sequences (Waterman, M. S., ed.) pp. 53-92. Boca Raton: CRC Press. WATERMAN, M. S. (t989b). Consensus patterns in sequences. In: Mathematical Methods for DNA Sequences (Waterman, M. S., ed.) pp. 93-115. Boca Raton: CRC Press. WATERMAN, M. S., ARRATIA, R. & GALAS, D. J. (1984). Pattern recognition in several sequences: consensus and alignment. Bull. math. Biol. 46, 515-527. YAMAUCHI, K. (1991). The sequence flanking translational initiation site in protozoa. Nucl. Acids Res. 19, 2715-2720.