Interpreting Consensus Sequences Based on Plurality Rule WILLIAM H. E. DAY
Department of Computer Science, Memorial University of Newfoundland, St. John's, Newfoundland AIC 5S7, Canada AND F. R. McMORRIS
Department of Mathematics, Universityof Louisville, Louisville, Kentucky 40292 Received 13 September 1991; revised 18 March 1992
ABSTRACT Our goal is to help researchers interpret the results of a function, based on the concept of plurality rule, that calculates a consensus of a profile of molecular bases. By expressing the plurality rule function as a composition of simpler functions, we obtain both an algorithm to calculate the consensus result and an upper bound on the number of nonequivalent results. Consequently, when used to analyze molecular sequences such as DNA or RNA, the plurality rule function yields at most 48 nonequivalent consensus results. For problems of reasonable size, we describe an algorithm to calculate the probability that each consensus result would occur if the bases were equally likely to appear at every position of the plurality rule function's input profile.
1.
INTRODUCTION
Consensus functions are recognized today as valuable tools for data analysis [4], especially w h e n some sort o f data aggregation is desired. T h e basic constituents o f a consensus function are easily described in the following voting context. A n electorate of k voters participates in an election involving a slate S of alternatives. T h e votes cast in the election are specified by a profile P = (Pl ..... Pk) in which Pi is the alternative selected by voter i. F o r each alternative a ~ S, let na(P) be the n u m b e r of occurrences of a in P, while N ( P ) = maxa~sna(P). The set of all profiles is the k-fold Cartesian p r o d u c t S~; the set of all possible electoral o u t c o m e s is the set I I ( S ) o f all n o n e m p t y subsets of S. Thus the electoral m e t h o d o f obtaining consensus by plurality rule [11-14] is the f u n c t i o n / ~ : S k ---, I I ( S ) such t h a t / ~ ( P ) = {a ~ S: ha(P) = N ( P ) } for every profile P ~ S k.
MATHEMATICAL BIOSCIENCES 111:231-247 (1992) ©Elsevier Science Publishing Co., Inc., 1992 655 Avenue of the Americas, New York, NY 10010
231
0025-5564/92/$5.00
232
WILLIAM H. E. DAY AND F. R. McMORRIS
Using this voting theory paradigm, consensus functions have been proposed for, and applied in, contexts relevant to biologists [1, 2, 8, 9]. Recently we [5] continued in this tradition by developing plurality rule consensus functions for analyzing molecular sequences. In order to reveal the underlying features of these functions, we made simplifying assumptions that we intend to remove in subsequent investigations. We assumed that such an analysis is a multistage process in which sequence alignment precedes the identification of consensus sequences, that an alignment of the molecular sequences has already been calculated, and that aligned positions within molecular sequences can be treated independently. Thus the problem to find a consensus of k aligned molecular sequences in which n aligned positions have been identified becomes a set of n simpler problems, each to find a consensus of k alternatives at one aligned position. When researchers employ such consensus functions, how might they assess the significance of the results? This is the problem addressed in this paper. The consensus model must be adjusted in order to analyze aligned positions within molecular sequences. Although generally S is a nonempty set of s alternatives, a particularly interesting case is when S = {A,C, G,T} and the alternatives are the nucleic acid bases adenine, cytosine, guanine, and thymine. When describing a consensus of such aligned bases, biologists may use ambiguity codes such as R (for purine base) or N (for any base). To accommodate ambiguity codes as consensus elements, we extend the basic consensus model to allow each code to be designated by its corresponding subset of bases. Thus a profile's consensus may contain any of the subsets (shown with corresponding ambiguity codes defined by the International Union of Biochemistry [10l): {A) = A, (C} = C, (G} = G, {T) = T, (A,G} = R, (C,T} = Y, (A,C} = M, {G,T} = K, {A,T} = W, {C,G} = S, {A,C,G} = V, {A,C,T} = H, {A,G,T}= D, {C,G,T}= B, {A,C,G,T}= N. When II(S) denotes the set of all nonempty subsets of S, II2(S) = II(II(S)) becomes the codomain of consensus functions for sequences. In this setting, a consensus function is a function f: S k ~ II2(S) whose domain contains s k profiles and whose codomain contains 22,- i _ 1 subsets of ambiguity codes. We will use the term consensus result to mean an element of f ( P ) for some profile P. But how can researchers interpret a consensus of sequences? When s = 4, will they comprehend the significance of any one among 32,767 possible consensus results? Our goals are to identify which results are possible (Sections 2 and 3) and which are likely (Section 4). A plurality rule consensus function can be reduced to a composition of simpler ones in which nonessential detail has been removed. To accomplish this, we exploit the invariance of plurality rule results to
INTERPRETING CONSENSUS SEQUENCES
233
certain manipulations of the input [7, 11, 13]. A consensus function f is anonymous if, for all permutations tr of {1,...,k}, f ( ( P l ..... P~))= f((P~(1) ..... p,~(~))). For ~- a permutation of S and for any X ~ II(S), let ~-(X) be the subset of S obtained by applying ~" to each alternative of X. Similarly, for any Y ~ lie(s), let r ( Y ) be the subset of II(S) obtained by applying r to each element (i.e., ambiguity code) of Y. Yet again, for any profile P ~ S k, let ~'(P) be the profile of S ~ obtained by applying ~ to each element of P. A consensus function f is neutral if f(~'(P)) = r ( f ( P ) ) for every profile P ~ S k and every permutation ~" of S. A consensus function is symmetric if it is anonymous and neutral. In Section 2 we describe how a consensus function's domain and codomain can be simplified by partitioning them into classes such that elements of the same class differ only by nonessential labelings of bases. Consequently, when s = 4 , the consensus codomain has at most 5257 nonequivalent results. Will researchers comprehend the significance of any one among so many possibilities? The bound on the number of nonequivalent results can be improved by exploiting features of a definition of plurality rule consensus in terms of balanced profiles [5]. The concept of balanced profile is based on an idea that the frequencies with which bases appear in such profiles should be as equal as possible. For any profile P = (Pl ..... Pk) ~ S ~, let the projection of P into II(S) be the set F ( P ) = {x ~ S: x = Pi for some 1 ~
d(P,Q)=
~_, ~(Pi,qi). l<~i<~k
Thus the distance between P and Q counts the positions at which they disagree. Of particular interest for P ~ S k is the distance to its nearest balanced neighbors: D(P)-
min Q ~ B(S,k)
d(e,o).
234
WILLIAM H. E. DAY AND F. R. McMORRIS
Finally, define the plurality rule for sequences to be the consensus function/~ : S ~ ~ II2(S) such that for every profile P ~ S ~, /e(P)={X~11(S):X=F(Q)
forQ~B(S,k),d(P,Q)=D(P)}.
When S ={A,C,G,T} and P = (A,G,A,A), D(P)= 1 and {(A,A,A,A), ( G , G , A , A ) , ( A , G , G , A ) , (A,G,A,G)} is the set of balanced profiles closest to P; thus p ( P ) = {{A},{A, G}}. To simplify notation hereinafter, denote each ambiguity code by the sequence of its bases so that / e ( P ) = {A, AG}. In Section 3, we improve the bound on the number of nonequivalent plurality rule consensus results. Two characteristics of a profile, distance pattern and type pattern, determine the profile's plurality rule consensus result. We will show that profiles of s bases exhibit at most 22s- 1 _2s+ 1 + 2s + 1 distinct combinations of patterns, and so when s = 4, at most 105 nonequivalent results appear in the range of the plurality rule function. This bound overestimates the actual number of results: Some combinations are not exhibited by any profile, and several combinations may map to a single result. Because when s = 4, at most 48 nonequivalent results appear in the range of the plurality rule function, researchers can expect to comprehend the significance of any particular result. Indeed, for problems of reasonable size, when s = 1 ..... 4 and k = 1.... ,100, only 70 combinations of patterns actually occur, and only 26 nonequivalent results actually occur in the range of the plurality rule function (see Table 4). In Section 4, we investigate how probabilities are distributed on the range of the plurality rule function when each of the s bases occurs with probability s-1 at each of the k positions of a profile. Two hypotheses based on this model seemed reasonable to us: The fewer the bases of an ambiguity code, the less probable that code's occurrence in a consensus result [e.g., prob(s, k, {ACGT}) > prob(s, k, {ACG})]; and the fewer the ambiguity codes in a consensus result, the more probable that result's occurrence [e.g., prob(s, k, {ACG}) > prob(s, k, {ACG, ACGT})]. To test them, we designed an algorithm to calculate the probabilities of occurrence of all plurality rule consensus results for given s and k. The algorithm, being based on an enumeration of partitions of the integer k, can solve problems of reasonable size, and so we calculated the exact probabilities of plurality rule consensus results for all combinations of s = 1,...,5 and k = 1..... 100 (see Tables 4 and 5). Since the results support only one of the hypotheses, we discuss trends suggested by the data of this study. Impatient readers might inspect Tables 4 and 5 and then turn to the concluding remarks in Section 5.
INTERPRETING CONSENSUS SEQUENCES 2.
235
EFFECTS OF SYMMETRY
When a consensus function f: S k ---, II2(S) is symmetric, its domain and codomain can be partitioned into classes such that elements in each class differ only in nonessential labelings of bases. Call any two profiles P and Q of S k congruent (and write P ~ Q) if there is a permutation ~of S such that n a ( P ) = na(~(Q)) for all a ~ S. Clearly ~ is an equivalence relation on S k, and we let [P] be the equivalence class to which P belongs, while [S k] is the set of all such equivalence classes of S k. A type of congruence applies as well to II2(S). Call any two consensus results Y and Z of II2(S) congruent (and write Y = Z ) if there exists a permutation ~- of S such that Y = r ( Z ) . Since --- is an equivalence relation on H2(S), we will let [Z] be the equivalence class to which any consensus result Z belongs and let [II2(S)] be the set of all such equivalence classes of H2(S). We will use [S k] and [II2(S)] to evaluate the original consensus function f: s k ~ II2(S) by a composition in which nonessential structure of the domain and codomain is suppressed. At this stage we can illustrate the composition by the diagram
s -, [ s ] -, [
--,
It will be convenient to denote each equivalence class of [S g] or [II2(S)] by a canonically labeled element, or leader, of that class. In order to do this, we will impose an arbitrary but fixed ordering on S = { a 1..... a~} so that a~ < a 2 < --. < a s. From now on, we will refer to this as the canonical ordering of S. When S = {A, C, G, T}, our canonical order will be A < C < G < T. A profile P = (p~ .... , p k ) E S k is a (profile) leader if Pi ~ Pi+ 1 for 0 < i < k, and n a ( P ) >1 n a + ( P ) for 0 < i < s. Thus ( A , A , C , C , T ) is a leader, but ( C , C , A , A , T ) and ( A , A , C , T , T ) are not, when S = { A , C , G , T } and the ordering is as above. Let A(S k) be the set of leaders of S k. Since each equivalence class of S k has a unique leader of A(Sk), and since each leader is associated with a unique class, a natural bijection between A(S ~) and [S k ] exists and permits replacement of [S k ] by A ( s k ) . Consequently, define the function 3: S ~ ~ A ( S k) such that 6 ( P ) is the leader of [P] for every profile P ~ S k. Thus our decomposition of f is illustrated by the diagram
s
L a(s
--,
--, ri
(s).
Next we consider equivalence classes of I-[2(S). Since the elements of consensus results are ambiguity codes, the labeling conventions are somewhat more complex. Relative to the canonical order on S, write each ambiguity code X of length n in canonical form as a word
236
WILLIAM H. E. DAY AND F. R. McMORRIS
XlX 2 "''X n
s u c h t h a t x 1 < x 2 < --. <
x n.
For
example, A C T
is t h e c a n o n i -
form of the ambiguity code {C,A,T}. Assume that all ambiguity codes o f the same length are themselves lexicographically ordered relative to the canonical order on S, then concatenate these linear orders into a standard order in which all ambiguity codes of length 1 come first, then all of length 2, and so on. (See Table 1 for three standard orders of particular interest.) For any ambiguity code X , let r ( X ) be its rank f r o m the right in the standard order; thus, when S = {A, C,G,T}, r(ACGT) = 0 and r ( A ) = 14. Using ranks, characterize any consensus result Z = {z I..... Zn} by a unique integer cal
i n t ( Z ) = 2 r(z,) + 2 r(z2) + ... + 2 r(z.).
A result Z ~ IJ2(S) is a (consensus) leader if i n t ( Z ) = max{int(Y): Y ~ [Z]}. Let qb(II2(S)) be the set of leaders of II2(S). As before, a natural bijection exists between ~(II2(S)) and [H2(S)] and permits replacement of [II2(S)] by ~(II2(S)). Our decomposition of f is now
s* L To relate the labeling of the consensus result to that of the original profile, we impose on any (canonically labeled) consensus leader the particular labeling of bases exhibited by the profile. For any profile P ~ S k, let ~r be a permutation of {1,..., k} and z be a permutation of S such that 8 ( P ) = r ( o ' ( P ) ) . Define the function ~b: qb(II2(S))~ [ I 2 ( S ) such that 4ffZ) = z-1(Z) for every consensus leader Z ~ ~(II2(S)).
L n (s)
s L
As we indicated previously, profile leaders are disguised partitions of integers. Recall that a partition o f integer k into m parts is an m-tuple ot = ( a 1..... a m) satisfying k = a~ + --. + ffm
and
a~ ~ "" ~ a m >1 1.
TABLE l Standard Orders of Ambiguity Codes S
Standard order
{A,C} {A,C, G} {A,C,G,T}
(A,C, AC), or A < C < AC (A,C,G, AC,AG, CG, ACG), or A < C < G < AC < '-. < A C G (A, C, G, T, AC, AG, AT, CG, CT, GT, ACG, ACT, AGT, CGT, ACGT)
237
INTERPRETING CONSENSUS SEQUENCES
A leader P e A ( S k) induces a partition a =(n,~(P) ..... n~m(P)) of k into m parts if k -~ n a l ( P ) -t- " . + na~,(e )
nam(P ) >11.
and
Let I~ be the set of all partitions of k into at most s parts, and define the function ~-: A(S k) ~ 14 such that ~-(P) is the partition of k induced by P for every leader P ~ A(Sk). Finally, we can explicitly determine our decomposition of f:
s L a(s
z,
L
L
Since 7r is a natural bijection between A(S g) and I~, the sets [ski, A(sk), and I~ all have the same cardinality. Indeed, if P~ denotes the number of partitions of k into exactly m parts, well-known recurrence relations for partitions of an integer [3, p. 48] establish that I[S~]l = [A(sk)l = [I~1 = P~+s" Although we have no analogous general result for the cardinality of [I-I2(S)] and ~(II2(S)), bounds exist for several interesting values of s. PROPOSITION 1
When S contains one, two, three, or four bases, the cardinality of [II2(S)] and ~(II2(S)) is at most 1, 5, 55, or 8521, respectively. Proof. Here are details for S={A,C,G,T}; the other proofs are similar but simpler. Specify candidates for leaders of II2(S) in such a way as to bound their number. Let G = {G1,G2,G3,G4} be the partition of II(S) in which G 1 --- {A,C,G,T}, G 2 = {AC, AG, AT, CG, CT, GT}, G 3 = {ACG, ACT, AGT, CGT}, and G 4 = {ACGT}. Consider the set L = L 1 u L 2 U L 3 U L4, where t l ~-- {X E r I 2 ( S ) : X n G 1 ~ {{A} , { A , C } , { A , C , G } , {A,C,G,T}} },
L 2 = {X ~ IIz(S): X n G I = f ~ ,
X n G 2 ~ { {AC}, {AC,AG}, {AC,AT}, {AC,AG,AT}, {AC, AG, CG}, {AC, AG, CT}, {AC, AG, CT, GT}, {AC, AG, CT, GT}, {AC, AG, AT, CG, CT}, {AC, AG, AT, CG, CT, GT} } }, L 3 = {X E I ] 2 ( S ) : X N ( G 1 U Gz) = 0 , X N G 3 E { {ACG}, { A C G , A C T } , { A C G , A C T , A G T } ,
{ACG, ACT, AGT, CGT} } },
238
WILLIAM H. E. DAY A N D F. R. M c M O R R I S
and t 4 = {{ACGT}}.
By construction the L i a r e pairwise disjoint. The reader can verify that to every element Z of II2(S) there corresponds a permutation ~- of S such that r ( Z ) ~ L. For example, when Z - { C G , GT, A G T , CGT, ACGT}, the permutation r = ( A G ) ( C T ) is such that 7 ( Z ) = {AC, AT, A C G , ACT, ACGT} ~ L 2. Thus L contains a set of leaders of II2(S), and ILl = I L I I + ILel+ IL3/+ IL 4] = 4(211)+ 10(25)+4(21)+ 1 = 8521. • The p r o o f used a particular standard order of ambiguity codes. Using a different standard order based on G2GIG3G 4 instead of G1G2G3G 4 yields a tighter bound. COROLLARY
When S contains one, two, three, or four bases, the cardinality of [II2(S)] and ~ ( H 2 ( S ) ) is at most 1, 5, 55, 5257, respecticely. 3.
EFFECTS OF PLURALITY RULE
We have seen that a symmetric consensus function f: S k ~ II2(S) can be viewed as a composition f = ~b o g o ~- o 6 of simpler functions, where g: I~ --* ~ ( I I 2 ( S ) ) is a version of f in which nonessential labeling detail has been removed. When f is plurality rule, the function ,~ such that /~ = ~b o ~ o ~- o 3 has additional structure that enables us to improve the bound on the cardinality of [II2(S)] and qb(II2(S)). To understand how, consider two characteristics of any partition a = ( a 1..... a m) of k into m parts. If c~ contains n distinct integers, consider it a sequence in which the first (i.e., c~1) appears t 1 times, the second appears t 2 times, and so on. Then c~ is said to have type pattern t = t ( a ) = (t 1.... , tn). Thus partitions (15,3,3,3) and (14,4,4,2) have type patterns (1,3) and (1,2,1), respectively. There are 2 m- i distinct type patterns of partitions into m parts, and when m = 4, the type patterns are (1,1,1, 1), (1, 1,2), (1,2,1), (1,3), (2,1, 1), (2,2), (3,1), and (4). A second characteristic concerns distances between partitions. For partitions c~=(c~ 1. . . . . am) and Y = ( Y l ..... Yn) of k, the distance d between a and y is min(rn,n)
d(a,y)=k-
Y'~
min(oti,Yi).
i=l
Informally, d counts unit discrepancies between corresponding parts: When a = % no discrepancies exist, but when a = ( k ) and y =
INTERPRETING CONSENSUS SEQUENCES
239
(1, 1..... 1) there are k - 1. Sometimes the parts of a partition /3 = (/31,-..,/3m) of k will be of essentially one size, and we say that /3 is balanced if/3i ~ {[k / m], [k / m]} for i -- 1,..., m. Sometimes we write/3 j to show that a partition has j parts. Thus balanced partitions of 10 are /31 =(10), /32 =(5,5), /33 =(4,3,3), /34 =(3,3,2,2). To measure the extent to which a partition a of k into m parts is balanced, let the distance vector v = v ( a ) be an m-tuple (vl ..... vm) such that ~ = d ( a , / 3 J ) for j = l ..... m, where /3J is balanced. Thus a = ( 5 , 2 , 2 , 1 ) has v ( o t ) = (5,3,2,2). To identify the positions of the minima in v ( a ) , let the distance pattern d = d ( a ) b e the increasing sequence ( d l , . . . , d n) of indices in v ( a ) where the minima occur. Thus a = (5, 2, 2,1) has d(t~) = (3,4). There are 2 m - 1 distinct distance patterns of partitions into m parts. Table 2 shows the eight type patterns and 12 distance patterns that occur among partitions of k into four parts when k = 1,..., 100. When /~ = ~b o 9~ o ~-o 3, the function 9~ maps each partition a = ( a l ..... otm) of I~ to a consensus leader Z ~ *(II2(S)). Since ol represents a profile leader, each part a i is associated with the base a i in the canonical order a 1 < ... < a m of S. In the following examples, let S have four bases with order A < C < G < T. Consider the partition a = (6, 3, 2,1) for which 6 is associated with A, 3 with C, 2 with G, and 1 with T. From a, 9~ calculates v ( a ) = (6, 3, 3, 3), d(t~) = (2, 3, 4), and t( a ) =(1,1,1,1). In d ( a ) each element d i describes an ambiguity code containing the first d i bases in the order of S; thus Z contains {AC, ACG, ACGT}. Pattern t ( a ) partitions S (in canonical order) into classes of bases playing identical roles in the ambiguity codes of Z:
TABLE 2 Distance and Type Patterns k
Partition
Distance vector
Distance pattern
Type pattern
22 20 24 24 24 24 24 24 24 24 24 24
(14,3,3,2) (13,3,3,1) (15,3,3,3) (16,4,3,1) (21,1,1,1) (14,4,4,2) (15,4,4,1) (12,6,3,3) (10,10,3,1) (7,7,7,3) (9, 9, 3, 3) (6, 6, 6, 6)
(8,8,8,8) (7,7,7,8) (9,9,10,9) (8,8,9,10) (3,11,14,15) (10,8,8,8) (9,8,8,9) (12,6,7,6) (14,4,5,8) (17,10,3,3) (15, 6, 5, 6) (18,12, 6, 0)
(1,2,3,4) (1,2,3) (1,2,4) (1,2) (1) (2,3,4) (2,3) (2,4) (2) (3,4) (3) (4)
(1,2,1) (1,2,1) (1,3) (1,1,1,1) (1,3) (1,2,1) (1,2,1) (1,1,2) (2,1,1) (3,1) (2, 2) (4)
240
WILLIAM H. E. DAY AND F. R. McMORRIS
Since t ( a ) p a r t i t i o n s S into {{A},{C}, {G}, {T}}, each base plays a distinct role in Z. Together, the patterns describe the consensus leader Z = {AC, ACG, ACGT}. Next consider a = (9,2,2,1) with v ( a ) = (5,5,5,5), d ( a ) = (1, 2, 3, 4), and t ( a ) = (1, 2,1). From d ( a ) , Z contains {A, AC, ACG, ACGT}. According to t ( a ) , C and G play identical roles, so Z must include A G as well as AC. Together, the patterns describe Z = {A, A C , A G , A C G , ACGT}. When s = 4, a = (5,1,1,1) is a fascinating example with v ( a ) = ( 3 , 3 , 3 , 3 ) , d ( a ) = ( 1 , 2 , 3 , 4 ) , and t ( a ) = ( 1 , 3 ) . From d ( a ) , Z contains {A, AC, ACG, ACGT}. According to t(ot), C, G, and T play identical roles; so Z must include A G and A T as well as AC, and it must include A C T and A G T as well as ACG. Together, the patterns describe Z = {A, A C , A G , A T , A C G , A C T , A G T , ACGT}. To illustrate how to evaluate the plurality rule function a = ~b o ~ o 7r o 6, suppose S = {A,C,G,T} and consider the profile P = (T,T,T, G , A , T , A , T , G , T , T , A ) ~ S 12. When o- = (4,11)(5,8)(7,10)(9,12) and ~- = (A,C,T), the profile leader is 6 ( P ) = r ( o - ( P ) ) = (A, A, A, A, A, A, A, C, C, C, G, G). The corresponding partition, a = ~ r ( 6 ( P ) ) = (7,3,2), has v ( a ) = (5,3,3), d ( a ) = (2,3), and t ( a ) = (1,1,1), of which d and t together describe the consensus leader ~ ( T r ( ~ ( P ) ) ) = {AC, ACG}. Since ~b = ~ . - 1 = ( A , T , C ) , the consensus result is / e ( P ) = ~b(~t(~r(~(e)))) = {AT, AGT}. Since consensus leaders can be derived from pairs of type and distance patterns, bounds on the number of such pairs also bound the number of consensus leaders. Let u ( m ) be the actual number of distinct pairs of type and distance patterns realized by partitions of k into m parts, and let N ( s ) be the cardinality of qb(H2(S)). Clearly, N ( s ) <~ u(1) + ... + u(s). Although u(m)<~(2m-1)(2 m - 1), not all such pairs are realized. PROPOSITION 2 I f a partition a o f k into m parts has a distance pattern d ( a ) with d I = 1, then it has a typepattern t ( a ) with t 1 = 1. Proof If d ( a ) has d 1 = 1, the closest balanced partition to a = (~1 ..... a m) is /3 = (k) so that a 1 > [ k / 2 ] . The result follows because at most one part of a has that property. • PROPOSITION 3 I f a partition a o f k into m parts has t ( a ) = ( m ) , then d ( a ) = ( m ) . Proof
If t ( a ) = (m), then a is balanced and the result follows.
•
These results establish an upper bound on the number of consensus leaders in the codomain of the plurality rule function.
INTERPRETING CONSENSUS SEQUENCES
241
PROPOSITION4 For s >>.1, N(s) ~ U*(s) = 22s- 1 _ 2s+I + 2s + 1. Proof. O f (2 m- 1)(2m--1) possible pairs of distance and type patterns for partitions into m parts, Proposition 2 excludes 2 2m-3 pairs and Proposition 3 another 2 m- 1 - 2 , so v(m) <. 3 x 2 2 m - 3 - 2 m + 2 when m > 1. Clearly v(1) = 1. The result follows by arithmetic. [] Let A(m) be the actual number of distinct pairs of type and distance patterns realized by partitions of k into m parts, where k ~< 100, and define U,(s) = ,~(1)+ ... + h(s). Although in general U,(s) need not be an upper bound of (I)(II:(S)), it is for the values of s in Table 3. Let N*(s) be the number of plurality rule consensus leaders (or nonequivalent consensus results) identified by the pairs of patterns that U*(s) counted, and let N,(s) be the actual number of plurality rule consensus leaders realized by all partitions of k ~< 100. Clearly, N,(s)<~ N(s)<~ N*(s). Table 3 gives values of U*(s), U,(s), N*(s), and N,(s) for 2 ~< s ~<5. In addition, Table 4 shows the consensus leaders counted by N,(s) for 2 ~< s ~<4. 4.
PROBABILITY CALCULATIONS
If {AT, AGT} is the plurality rule consensus of a profile P = ( T , T , T , G , A , T , A , T , G , T , T , A ) of S = { A , C , G , T } , one might wish to know the probability of its occurrence under suitable models of profile selection. We investigated this problem for a model in which each of the s bases of S occurs with probability s-~ at each position of a profile. Since every profile of length k occurs with probability s -k, the probability that a consensus result Z ~ I I : ( S ) occurs is prob(s, k, Z ) = I{P ~ sk: /z(P) = Z}I/s k.
TABLE 3 Bounds Concerning N(s) = 1(I)(II2(8))1 s
1
2
3
4
5
U*(s) U,(s) N*(s) N,(s)
1 1 1 1
5 5 3 3
23 19 11 8
105 70 48 26
459 236 168 72
(1)
242
WILLIAM H. E. DAY AND F. R. McMORRIS
A naive algorithm enumerates all 4 lz profiles in order to calculate prob(4,12, {AT, A G T } ) = 722,304/16,777,216= 0.04305; but since the number of profiles increases exponentially with profile length, the approach is feasible only for problems of small size. Instead, we calculate plurality rule consensus probabilities by enumerating partitions of k into at most s parts. Although the asymptotic behavior of our algorithm still increases exponentially with k, the reduction of computational effort is considerable: Instead of 16,777,216 profiles of length 12, the algorithm enumerates 27 partitions of 12 into three or four parts to determine that (7,3,2), (6,4,2), and (5,5,2) generate the consensus leader of {AT, AGT}. For each such partition a, it counts the number of profiles Q ~ S 12 for which o~ = ~'(6(Q)). By analyzing partitions of an integer rather than profiles, the algorithm evaluates the numerator of (1). It remains to count, for any partition c~ of k into m parts, the number of profiles Q ~ S k for which c~ = 7r(6(Q)). The solution is based on natural correspondences among profiles, partitions of sets, and partitions of integers. Recall that a partition o f set X is a collection of nonempty subsets (or classes) that are disjoint and whose union is X. If X = {1..... 12}, for example, then W = {{1,2,3, 6,8,10,11},{4, 9},{5,7,12}} is a partition of X into three classes. If a partition of a set has A1 classes of cardinality 1, A 2 classes of cardinality 2 ..... and An classes of cardinality n, it is of type lX~.2 a2 . . . . . n xo. Thus W has type 1°21314°5°6°71 = 21-31.71, while {{1,3,6,8, 12},{2,4,7,9,11},{5, 10}} has type 21.52 . PROPOSITON 5 (BERGE [3, p. 39]) I f X is a set of k objects, the number N ( A 1..... An) o f its partitions o f type 1A' . . . . . n ~. is [(k!)]/[(l!) a. . . . . . (n!)a,(hl!) . . . . . (A,!)].
If S is a set of s colors, let a partition of a set X be called colored if distinct colors of S are associated with each of the partition's classes. For example, if S ={A,C,G,T} and W = {Wl,Wz,W 3} is a partition of X, then W = { w ~ , wGz,wA} is a colored partition in which T is associated with wl, and so forth. Counting colored partitions is an easy consequence of Proposition 5. COROLLARY I f X is a set o f k objects, and S a set of s colors, the number Ns(h 1..... An) o f colored partitions o f type 1~.... n A" is
[(s!)(k!)]/[(s
- m ) ! ( 1 !) h' - ' - ( n !) ~°( h 1!) -'. ( h, !)],
where m = E Ai is the number o f classes in partitions o f that type.
INTERPRETING CONSENSUS SEQUENCES
o,243
R
o~ ill
li
%
~ll
~ll ~il Ill
I,I
ttl
~'"~ i =~
II
H
<
<
@
~
r.,.l "~
~ & .,...
<< <
0
._~ N
<
@-.
I1
~N b
N~
244
WILLIAM H. E. DAY AND F. R. MCMORRIS
If X = { 1 .... ,12} and S={A,C,G,T}, the number of colored partitions of type 2152 is (4!)(12!) N4(0,1,0,0,2)
=
(1!)(1!)°(2!)'(3!)°(4!)°(5!)2(0!)(1!)(0!)(0!)(2!) (4!)(12!) =199,584, (2!)(5!)z(2!)
while N4(0,1,1,0,0,0,1 ) =190,080
and
N4(0,1,0, 1,0,1 ) =332,640.
Obvious bijections map the colored partition {{1, 2, 3, 6, 8, 10, 11}T, {4, 9}~, {5, 7,12} A} to the profile ( T , T , T , G , A , T , A , T , G , T , T , A ) and the type 213171 to the partition (7,3,2). Thus three applications of the corollary suffice to evaluate the numerator of (1) as 722,304 = 190,080 + 332,640 + 199,584. We used this algorithm to calculate the probabilities of the rences of the plurality rule consensus leaders Z in Table 4 combinations of s = 1 , . . . , 5 and k = 1. . . . . 100. Table 4 prob(4, k, Z) for k = 1 0 , 2 0 , . . . , 5 0 ; thus the probability that a randomly selected from {A,C,G,T} 1° maps to {ACG, ACGT} is
occurfor all shows profile
prob(4, 10, {ACG,ACGT} ) = 0.4390. Since we expected prob(s, k, {ACGT}) > prob(s, k, {ACG}) > prob(s,k,{AC})> prob(s, k, {A}), we had hypothesized that the fewer the bases of an ambiguity code, the less probable that code's occurrence in a consensus result. If prob(s,k,i) denotes the probability that a randomly selected profile of S k maps to a consensus leader having an ambiguity code with i bases, the hypothesis is that prob(s,k,i)> prob(s,k,j) for s>~i>j>~l. Table 5 shows prob(4, k, i) for k = 10,20 .... ,100 and i = 1 ..... 4. These data support the hypothesis and demonstrate that, as k increases, A C G T is ever more likely to occur in a consensus result. We also had hypothesized that the fewer the ambiguity codes in a consensus result, the more probable that result's occurrence. Intuitively, the greater the number of ambiguity codes in a consensus result, the more unusual the structure of profiles mapping to that result. The data of Table 4 don't support this hypothesis. For example, that prob(4, k, {ACG}) is greater than prob(4, k, {ACG, ACGT}) is supported at k = 20,30,40 but contradicted at k = 10 (spectacularly)
INTERPRETING CONSENSUS SEQUENCES
245
TABLE 5 Prob(4, k, i), the Probability That a Profile Randomly Selected from {A, C, G, T}k Maps to a Consensus Result Having Ambiguity Codes with i Bases k
prob(4, k, 1)
prob(4, k, 2)
prob(4, k, 3)
prob(4, k, 4)
10 20 30 40 50 60 70 80 90 100
0.0126 0.0004 --- 0 ~- 0 ----0 -- 0 --- 0 ----0 -= 0 --- 0
0.1841 0.0397 0.0070 0.0011 0.0002 0.0001 =0 ~0 --- 0 ---0
0.7452 0.4793 0.1490 0.1932 0.0707 0.0537 0.0247 0.0238 0.0090 0.0084
0.7779 0.7071 0.9152 0.8919 0.9653 0.9650 0.9831 0.9871 0.9955 0.9942
and k = 50 (mildly). U n u s u a l structure in profiles need not be associated with consensus results having m a n y ambiguity codes.
5.
CONCLUDING
REMARKS
T h e plurality rule consensus function provides a m e a n s by which a user can s u m m a r i z e a profile of nucleic (or amino) acids at a position in a set of aligned m o l e c u l a r sequences. It is based on a well-known m e d i a n principle [1, 2]. A plurality rule consensus has a natural interp r e t a t i o n as the result of an optimization p r o b l e m : each ambiguity code in the consensus describes a b a l a n c e d profile that is closest ( a m o n g all b a l a n c e d profiles) to the given input profile. If a plurality rule consensus contains m o r e than one ambiguity code, the user m a y have contextd e p e n d e n t ( p e r h a p s biological) reasons for preferring one code to another, but such considerations are external to the use of plurality rule as a descriptive consensus tool. O u r probabilistic interpretation of the plurality rule consensus function (Section 4) assumes that bases are equally likely to a p p e a r at every position of the given input profile. This a s s u m p t i o n s e e m s reasonable in cases (e.g., E. coli) w h e r e each nucleic acid base a p p e a r s with approximately equal frequency. W h e n significant d e p a r t u r e s f r o m this assumption occur, o n e might still expect plurality rule consensus results with m o r e than o n e ambiguity code to b e c o m e rare as profile length increases, and a m o n g consensus results having exactly o n e ambiguity code, one might still expect codes of a particular length to prevail (e.g., T a b l e 5) as profile length increases.
246
WILLIAM H. E. DAY AND F. R. McMORRIS
We were unable to establish that any of 35 --- U*(4) - U.(4) combinations of type pattern and distance pattern must occur when s = 4. As we found none of them in a complete analysis of profiles of length k = 1..... 100, we conjectured that they never occur and that our list of 26 plurality rule consensus leaders was complete. Recently, B. G. Mirkin and one of us used combinatorial arguments to prove these conjectures [6]. We acknowledge with pleasure the assistance of H. Todd Wareham (Memorial University of Newfoundland), who wrote computer programs to investigate combinatorial aspects of the plurality rule consensus function. Wareham also wrote a PASCAL computer program to calculate the plurality rule consensus of a profile. The program's input is restricted to descriptions of profiles for which s ~< 5 and k ~< 100. It outputs the plurality rule consensus result along with the probability of occurrence described in Section 4. For problems thus constrained, the program runs like the wind. The program can be used to analyze D N A or R N A sequences with gaps (e.g., S = {A, C, G,T,-}, S = {R,Y,-}) or without (e.g., S = {A, C, G, T}, S = {R, Y}). This paper bears witness to the usefulness of partitions of integers and sets as mathematical models for investigating biological problems. Representing profiles as partitions led to simplifications of the plurality rule consensus function for sequences, to bounds on numbers of nonequivalent plurality rule consensus results, and to efficient algorithms for obtaining the plurality rule consensus of profiles of molecular bases.
William H. E. Day is an Associate in the Program in Evolutionary Biology of the Canadian Institute for Advanced Research. This project was supported in part by the Canadian Institute for Advanced Research and by the Natural Sciences and Engineering Research Council of Canada under grant A-4142 (W. H. E. D.) and in part by the U.S. Office of Naval Research under grant NOOO14-89-J-1643 (F. R. McM.). REFERENCES J.-P. Barth61emy and F. R. McMorris, The median procedure for n-trees, Z Classif. 3:329-334 (1986). 2 J.-P. Barth61emy and B. Monjardet, The median procedure in cluster analysis and social choice theory, Math. Soc. Sci. 1:235-267 (1981). 3 C. Berge, Principlesof Combinatorics, Academic, New York, 1971. 4 W . H . E . Day, Consensus methods as tools for data analysis, in Classification and Related Methods of Data Analysis, H. H. Bock, Ed., Elsevier Science, Amsterdam, 1988, pp. 317-324. 1
INTERPRETING CONSENSUS SEQUENCES
247
5 W . H . E . Day and F. R. McMorris, Consensus sequences based on plurality rule, Bull. Math. Biol. (1992), to appear. 6 W . H . E . Day and B. G. Mirkin, On the existence of constrained partitions of integers, J. Comput. Info. (1992), to appear. 7 A. Levenglick, Fair and reasonable election systems, Behav. Sci. 20:34-46 (1975). 8 F . R . McMorris and D. Neumann, Consensus functions defined on trees, Math. Soc. Sci. 4:131-136 (1983). 9 T. Margush and F. R. McMorris, Consensus n-trees, Bull. Math. Biol. 43:239-244 (1981). 10 Nomenclature Committee of the International Union of Biochemistry (NC-IUB), Nomenclature for incompletely specified bases in nucleic acid sequences--recommendations 1984, Eur. J. Biochem. 150:1-5 (1985), 11 J. Richelson, A comparative analysis of social choice functions, Behav. Sci. 20:331-337 (1975). 12 J. Richelson, A characterization theorem for the plurality rule, J. Econ. Theory 19:548-550 (1978). 13 F. S. Roberts, Characterizations of the plurality function, Math. Soc. Sci. 21:101-127 (1991). 14 F.S. Roberts, On the indicator function of the plurality function, Math. Soc. Sci. 22:163-174 (1991).