Consensus sequences based on plurality rule

Bulletin of MathematicalBiologyVol.54, No. 6, pp. 1057-1068,1992. Printedin GreatBritain. 0092 8240/9255.00+0.00 PergamonPressLtd Societyfor Mathemat...

Download PDF

720KB Sizes 2 Downloads 25 Views

Report

PDF Reader
Full Text

Bulletin of MathematicalBiologyVol.54, No. 6, pp. 1057-1068,1992. Printedin GreatBritain.

0092 8240/9255.00+0.00 PergamonPressLtd Societyfor MathematicalBiology

CONSENSUS SEQUENCES BASED ON PLURALITY RULE •

•

WILLIAM H. E. DAY Department of Computer Science, Memorial University of Newfoundland, St. John's, NF, Canada A1C 5S7 F . R . MCMORRtS Department of Mathematics, University of Louisville, Louisville, KY 40292, U.S.A. (E.mail: [email protected],frmcmoO1 @ulkyvx.bitnet)

We apply concepts of social choice theory, in particular those concerning m e d i a n and plurality rules, to investigate the p r o b l e m of finding a consensus of aligned molecular sequences. O u r model of consensus permits consensus elements at each aligned position to d e n o t e ambiguity codes if several alternatives are equally-preferred candidates for consensus. O u r results concern plurality rules which are m e d i a n rules, are characterized by the C o n d o r c e t properties, a n d are efficient to calculate. O u r a p p r o a c h is axiomatic.

1. Consensus Sequences in Molecular Biology. Identifying consensus sequences is one of a complex of biological problems involving the interaction, orientation, and evolution of molecular sequences. In particular, the problems of aligning sequences (Waterman, 1989b) and identifying consensus patterns in sequences (Waterman, 1989a) are so closely related they are sometimes difficult to separate. For example, the problem of identifying a longest common subsequence in a set of molecular sequences can be considered a special case of either problem; but since the longest c o m m o n subsequence problem is NPcomplete (Maier, 1978, Garey and Johnson, 1979), it is unlikely that efficient algorithms exist to obtain optimal solutions to realistic consensus or alignment problems. To analyse such problems, we make three simplifying assumptions: • analysis of molecular sequences is a multi-stage process in which sequence alignment precedes the identification of consensus sequences; • an alignment of the molecular sequences has already been calculated; . aligned positions within molecular sequences can be treated independently. Thus the problem to find a consensus of k aligned molecular sequences, in which n aligned positions have been identified, can be viewed as a set of n simpler problems, each to find a consensus of k alternatives at an aligned position. Of the three assumptions, the first two are typically made as prerequisites to 1057

1058

W . H . E . DAY AND F. R. McMORRIS

the calculation of consensus sequences. The third may seem problematic because it invites us to disregard evidence of molecular structure that may introduce dependencies among aligned positions. Making this assumption enables us to investigate theoretical and computational properties of consensus methods within a tractable mathematical model. In future work, we hope to relax the assumption and study how a consensus in one aligned position might be influenced by a context of other aligned positions. Since the era of ancient Greek city states, consensus methods have been powerful tools for political or social change, and today they are recognized as valuable tools for data analysis (Day, 1988). Basic criteria of consensus are easily described in simple voting contexts. Let there be an electorate of k voters in an election involving a slate S of alternatives. The votes cast in the election are specified by a profile P= ( p ~ , . . . , Pk) in which Pi denotes the alternative selected by voter i. For each alternative asS, let na(P ) denote the number of occurrences of a in P, while N(P)= maxa ~s n,(P). The set of all profiles is the k-fold Cartesian product sk; the set of all electoral outcomes is the set IT(S) of all subsets of S. The criterion of obtaining consensus by majority rule (May, 1952; Campbell, 1982) is a function Maj:Sk--}II'(S) such that for each profile Pc S k, Maj (P) = {as S:n,(P) > k/2}. Majority rule's appealing features are that, when it exists, the winner is unique and has the support of more than half the voters; but with slates of more than one alternative, majority rule need not select a winner. One generalization, the criterion of obtaining consensus by plurality rule (Richelson, 1975, 1978; Roberts, 1991), is a function q~:sk--*II'(S) such that for each profile p s s k :

~o(P) = {a s S'na(P ) N(P)}. =

Although plurality rule always returns at least one winner, a winner may be supported by only a minority of the voters. In our development of consensus models for sequences, we shall strive to ensure an impartial treatment of equally-preferred candidates for consensus. A third criterion for obtaining consensus identifies the set of median (or middlemost) alternatives among those in S. To compare alternatives of S, let d be a metric real-valued function with domain S 2. The criterion of obtaining consensus by median rule (Kemeny, 1959; Kemeny and Snell, 1962) is a function Med:Sk--}H'(S) such that for each P= (Pl, . . . , Pk) sSk, M e d ( P ) = {a s S:E~ ~
M(P) = {ae S:Z 1
CONSENSUS SEQUENCES BASED ON PLURALITY RULE

1059

In contexts relevant to biologists, majority and median rules have been applied with considerable success (e.g. Barth~lemy and Monjardet, 1981; Barth~lemy and McMorris, 1986; Margush and McMorris, 1981; McMorris and Neumann, 1983). We propose to continue in this tradition by using plurality-rule paradigms to obtain a consensus of molecular sequences. In Section 2 we develop a model of consensus in which we define median rules and plurality rules for sequences. Although authors often assume alternatives are linearly ordered by voters (e.g. Campbell, 1982; Fishburn, 1977; Levenglick, 1975; Richelson, 1978; Young, 1974; Young and Levenglick, 1978), our voters (e.g. molecules) select single alternatives (e.g. nucleotide bases) and are indifferent to the rest. A noteworthy feature of the model, and where we differ from Roberts (1991), is that it allows consensus elements to denote ambiguity codes when several alternatives are equally-preferred candidates for consensus. In Section 3 we use the model to investigate features of plurality rules for sequences. Plurality rules are median rules, they are characterized by the Condorcet properties (Young and Levenglick, 1978), and we characterize the conditions under which they exhibit a desirable feature of majority rule--that consensus elements are approved by at least half the voters. In Section 4 we consider in our model the special case corresponding to the traditional plurality rule. In section 5 we discuss the implications of these results.

2. Models of Consensus for Sequences. The problem of determining a consensus of a set of molecular sequences can be modeled by a function C : U ~ V which maps each element of its domain U to a suitable element of its codomain V. Consider how to specify the domain of C, and let S be a finite nonempty set of alternatives. Of particular interest is the case where S = {A, C, G, T} and the alternatives represent the nucleic acid bases adenine, cytosine, guanine and thymine. Each element of C's domain represents a possible k-tuple of alternatives (bases) appearing at an aligned position in each of k molecules. As before, for any positive integer k, let a profile of S be a k-tuple of alternatives of S, and let S k denote the set of profiles of S of length k. Since a researcher usually investigates a fixed, known set ofk sequences, S k is a possible domain of C. However, in Section 3 we will explore whether consensus results are consistent at different values of k. For this reason, we will adopt w k> 0Sk as the domain of the consensus functions we investigate. Consider how to specify the codomain of a consensus function. When determining a consensus of aligned bases in several molecules, biologists sometimes specify special codes such as R (for purine base), Y (for pyrimidine base), or N (for any base). Informally, using R as a consensus element indicates uncertainty as to whether the consensus should be adenine or guanine. We will permit such ambiguity codes to be used as consensus elements; we simply require that each code be designated by its corresponding subset of alternatives. For

1060

W . H . E . DAY A N D F. R. McMORRIS

example, when S = {A, C, G, T}, a consensus of a profile may contain any of the subsets (shown with corresponding ambiguity codes defined by the International Union of Biochemistry): {A}=A, {C}=C, {G}=G, {T}=T, { A , G } = R , {C, T}= Y, {A, C}=S, { G , T } = W, {A,T}=Q, {C, G } = Z , {a, C, G } = V, {A, C, T } = H , {A, G, T } = D , {C, G, T}=B, {A, C, G, T } = N . Thus, when II(S) denotes the set of all nonempty subsets of S, H(H(S)) is a possible codomain of consensus functions. However, it seems useful to parameterize the cardinalities of consensus elements so that biologists can restrict their use of nontrivial ambiguity codes. For any positive integer j, define Hi(S)= {X6H(S):[~ <<,j}. Thus H(1-Ij(S)) is the set of all nonempty subsets of ambiguity codes having at mostj alternatives, and we will adopt H (rIj(s)) as the codomain of the consensus functions we investigate. For any positive integer j, a consensus function is a function Cj with domain Uk>OSk and codomain H(Hj(S)). For any profile P e S k, Cj(P) is the set of most-preferred elements of Hi(S); each x~C~(P) is a consensus element of P. When more than one consensus element exist, all are preferred equally. If X~ Cj(P) and X = {a}, X corresponds to the preferred alternative a~ S; if Ixl > 1, X is a set of alternatives of S, all of which are preferred equally. When j = 1, S = {A, C, G, T} and P = (A, G, A, A), a reasonable consensus might be C 1(P)= {{A}}, a being the preferred alternative: not only does a majority of the voters favor alternative A, but a change of one vote would cause A to be elected unanimously. When j = 2, however, a reasonable consensus might be C 2 ( P ) = {{a}, {a, G} }, the consensus elements being those denoting a and R: a change of one vote either would cause A to be elected unanimously, or it would cause a tie vote between A and G. Of the parameterized family Cj: u k> osk~I-I(Hj(S)), the extremes are noteworthy: C 1 prohibits nontrivial ambiguity codes from appearing as consensus elements, while Cisrimposes no such restrictions. Our concept of median profiles is based on an idea that the frequencies with which alternatives appear in such profiles should be balanced as much as possible. For any profile P = (Pl, • • •, Pk)~Sk, let the projection of P into H(S) be the set F(P)={xeS:x=pi for some 1 <~i<<.k}. Recall that for any profile P c S k and alternative deS, n,(P) is the number of occurrences of a in P. A profile P~S k, with X = F(P), is called balanced if n,(P)e {rk/lX[], Lk/[~J} for every aeX, where [ ] and LJ denote the ceiling and floor functions. (A,A,A,A,A,A,A), (A,A,A,A,C,C,C), (A,A,A,C,C,G,G) are balanced profiles; (A, A, A, A, A, C, C), (A, A, A, A, C, C, G), (A, A, A, C, C, C, G) are not. Biologists may wish to limit the number of alternatives in the projection of a balanced profile: for any positive integer j, a profile P is called j-balanced if it is balanced and F(P)~I-Ij(S). Thus (A, A, A, C, C, G, G) is three-balanced but not two-balanced. The set of all jbalanced profiles is denoted by Bj(S, k)= {pe sk:p is j-balanced}. The concept of median profiles also requires measuring disagreement

CONSENSUS SEQUENCES BASED ON PLURALITY RULE

1061

between profiles. A natural specification uses the identity metric 1:$2~{0, 1} for which z(x, y ) = 0 if and only if x = y. Let the disagreement between P = (Pl, Pk) and Q = ( q l , . . . , qk) be measured by a function d:SkxSk~N such that d(P, Q)=(1/k)Zl~i<~k t(pi, qi); d is a metric and is normalized such that 0 <<.d(P, Q) ~ 1 for all P and Q. Informally, the distance between P and Q is the proportion of positions at which they disagree. Of particular interest for any profile P e S k is the distance to its nearest j-balanced neighbour: Dj(P)= mine ~Bj(s, k) d( P, Q ). For any P e S k, a profile Q ~ Bj( S, k) is called a j-median fit of P if d(P, Q)= Dj(P ). Thus, define the j-median rule for sequences to be a consensus function Mj:Uk>oSk--~ H(II](S)) such that for each profile p~sk: •

•

• ,

Mr(P)= {XeIIj(S):X=F(Q) for Q a j-median fit of P},

= {xenj(s):x=r(QI for QeBj(S, k), d(P, Q) = Dj(P)}. As an example, when S = {A, C, G, T} and P = (A, G, A, A), the reader can verify that D2(P ) = 1/4 and that (A, A, A, A), (G, G, A, A), (A, G, G, A), and (A, G, A, G) are all two-median fits of P. Thus M2(P ) = {{A}, {A, G}}. The concept of j-balance is central to plurality rules as well as median rules. However, for medians the emphasis was on disagreement, while for plurality rules it is on measuring agreement between a profile and its closest j-balanced profiles. Notice that na(P ) can be interpreted as counting the agreements between P ~ S k and a balanced profile Q with {a}=F(Q). By extension, for Xe H(S) let nx(P ) be the proportion of agreements between P e S k and a closest balanced profile Q having X = F ( Q ) , so that nx(P)=(1/k)Z,~ x min (n~(P), k/)4])- Since Q is balanced, each alternative appears in Q either [k/lx]] or k~ I J ] times. When n~(P)> k/IX ], nx(P ) accounts for [k/IX]] occurrences of a in P; when n,(P)<~k/Ix I, so that .o(P) kk/]xlJ, ,x(P) accounts for ha(P) occurrences of a in P. To obtain a normalized measure of agreement between any profile P ~ S k and its nearest j-balanced neighbours, define Nr(P)= max x ~nj(s)nx(P). For any Pc S k, a profile Q e B~(S, k) having nr(e)(P ) -- Nj(P) maximizes the agreement possible between a j-balanced profile and P; subject to the constraint imposed by j, F(Q) is a most-preferred set of alternatives for P. Thus, define the j-plurality rule for sequences to be a consensus function q)j:Uk>oSk--*II(Hj(S)) such that for each profile p~sk:

(pj(P) = {X~ II](S):X= F(Q) for Q ~ Bj(S, k ), nx(P ) = N~(P)}. As an example, when S = {A, C, G, T} and P-- (A, G, A, A), the reader can verify that N2(P ) = 3/4 and that (A, A, A, A), (G, G, A, A), (A, G, G, A), and (A, G, A, G) all exhibit that maximum agreement between two-balanced profiles and P. Thus q 2(P)= {{A}, {A, G}}.

1062

W . H . E . DAY AND F. R. McMORRIS

3. Plurality Rules for Sequences. In this section (and the next) we use our model of consensus to investigate features of plurality rules for sequences. Propositions 1-5, which are the main results, exhibit properties of plurality rules which should help researchers to decide whether the plurality rules are suitable for their sequence applications. For any positive j, Dj and Nj are complementary measures of proximity because Dj(P) + Nj(P) = 1 for every profile P. A pleasing (and straightforward) consequence is that j-plurality rules for sequences are also median rules. 1. For any positive integer j, ~oj= Mj. Plurality rules may exhibit properties that are considered fair and reasonable in other voting contexts (e.g. Levenglick, 1975; Richelson, 1975). We state several of them here in terms of an arbitrary consensus function C f w , > oS*---, H(Hj(S)), for any positive integer j. We have tried to use definitions that correspond to usage in the social choice literature. However, we note that terminology is not standardized. Throughout, P = (Pl, - - - , Pk) and Q = (ql, • • •, qm) are profiles. Cj is Condorcet if for all X, Y~Hj(S), PROPOSITION

nx(P ) > nr(P ) implies Yq~Cj(P),

(1)

nx(P ) = nr(P ) implies [Xe Cj(P) if and only if Ye Cj(P).]

(2)

and

Cj is anonymous if, for all permutations o- of { 1 , . . . , k}, Cj(P)= C~((p~(1), • . . , p~(g))). For z a permutation of S and X a subset of S, let -c(X) denote the

subset of S obtained by applying -cto each alternative in X. Cj is neutral if, for all permutations z of S, Xe Cj(P) if and only if z(X)e Cj((z(px), . . . , Z(pk))). Cd is symmetric if it is both anonymous and neutral. Cj is limited (Roberts, 1991) if F(P) _ • Cj(P). Let P + Q denote the concatenation of P and Q. Cj is consistent if Cj(P + Q) = Cj(P)c~ Cj(Q) whenever Cj(P)c~ Cj(Q) ~ ~ . Finally, Cj is majoritary ifnx(P ) >~1/2 for all XE Cj(P). It is straightforward to verify that j-plurality rules for sequences are Condorcet, symmetric and limited. In addition, they are characterized by the Condorcet properties. PROPOSITION 2. For any positive integer j, Cj = q~j if and only if Cj is Condorcet. Proof. When Cj = ~pj, it is immediate that Cj is Condorcet. Next, suppose Cj is Condorcet. First consider X e Cj(P) so that X6 Hi(S). By (I), nx(P ) >1nr(P), for all Ye Hi(S), so that X~ ~pj(P); hence ~oj(P) _ Cj(P). Next consider Y~ ~oj(P). Since q~j(P)~_Cj(P), let X be any set in both• Since X and Y are in q~j(P), nx(P ) = ny(P); hence Y~ Cj(P) by (2). Thus Cj(P)~_ q~j(P) so that Cj= ~pj. • Each property is necessary in this characterization. Let Hj(S) be linearly ordered, and let C) be defined such that:

CONSENSUS SEQUENCES BASED ON PLURALITY RULE

1063

C](P) = q~i(P) if [q~j(P)[= 1, = {J(}, where X is the unique least element in ~0j(P) according to the linear ordering of Hi(S), otherwise. Then C] satisfies (1) but not (2). Let C2 be defined such that:

c f ( P ) = (oj(P) if nx(P ) = ny(P) for any distinct X, Ye IIj(S), = I-Ij(S), otherwise. Then C2 satisfies (2) but not (1). In voting theory, consistency seems basic: the well-known Borda, Condorcet and plurality consensus functions are all consistent (Young, 1974), Young (1974) characterized the Borda consensus function in terms of consistency, and Young and Levenglick (1978) showed Kemeny's (1959) rule to be the unique social preference function that is neutral, consistent and Condorcet. But in our model, j-plurality rules for sequences are consistent only in simple or trivial cases. PROPOSITION 3. (pj is consistent if and only if j = 1 or Is I= 1.

Proof The reader can verify that ~o1 is consistent, rpj is trivially consistent when ISI=I. Suppose j > l , S={A, C} and P = ( A , A , A , A , C). Since one change of P yields the balanced profiles (A, A, A, A, A) or (A, A, A, C, C), we have ~0j(P)= {{A}, {A, C}}. Since two changes of P + P yield (A, A, A, A, A, A, A, A, A, A), while three changes yield the balanced profile ( A , A , A , C , C , A , A , C , C , C ) , we have q)j(P+P)={{A}}. Since {{A}}= (pj(P + P ) ¢ (pj(P)c~q~j(P)= {{a}, {a, C}}, it follows that ~0j is inconsistent. For an election with two alternatives, simple majority-rule consensus is attractive because any winning alternative is approved by at least half the voters. In the straight-forward extension of simple majority rule to the case where g > 2 alternatives are on the ballot, that desirable (majoritary) property need not hold since a winning candidate may be approved by as little as 100/g per cent of the voters. A similar situation exists with j-plurality rules for sequences. If S={A, C, G, T} and P = ( A , C, G), then ~01(P)={{a}, {C}, {G}}; since nx(P ) = 1/3 for all XE opt(P), q~l violates the majoritary property. Intuitively, one expects q)j not to be majoritary when j < [S]/2. On the other hand, q)j is majoritary when j>>-IS[/2, and so in these cases every consensus element obtained by ~0jhas the approval of at least half the voters. To see why, recall that Nj(P) measures the proportion of agreements between P ~ S k and its nearest j-balanced neighbours.

1064

W . H . E . DAY A N D F. R. McMORRIS

PROPOSITION 4. goj is majoritary if and only/fj t> ]S]/2.

Proof. Consider any profile Pc S k. Define f = N(P)/k and 9 = IF(P)] • We will impose on F ( P ) = { a l , . . . , %} an arbitrary but fixed ordering such that n~,(P) >~... >~n,,(P). Relative to this ordering, let B i be a balanced profile in which F(BI)= { a l , . . . , ai}. Finally, let B#~ S k be a balanced profile for which F(B#)=S. Suppose j>/[SI/2, and consider four cases. (i) 1/>f~> (g + 1)/(29). (ii) (9+ 1)/(29)>~f >>-1/2. (iii) 1/2>~f>~1/j. (iv) 1/j ~f>/1/g.

Then Nj(P) >~1 - d(P, B1) -- f~> (g + 1)/(29) > 1/2. Then Nj(P)>>.1-d(P, B2)~> 1/2+ 1/(29) = (9 + 1)/(29)> 1/2. Then Nj(P)>~ 1-d(P, Bj) ~>1/2+ l / j > (j+ 1)/(2j) > 1/2. Then Nj(P) >~Nj(B#) =j/IS] >t 1/2.

Thus gojis majoritary. Next supposej < ]SI/2. Consider any k and B# ~ S k such that bothj and ISI divide k. Then Nj(B#)=j(k/]S])/k=j/Is I < 1/2 so that gojis not majoritary. • Let the minimum value of Nj(P) for any profile in S k be defined by ~j(S, k)=minp~skNj(P). Proposition 4 can then be sharpened to obtain the following result. COROLLARY. flj(S, k)=j/IS I when j
limited and consistent. Each property is necessary in this characterization. Let C 1 be defined such that CI(P)={{p~}} for every P=(p~, . . . , pk)~S k. Then C 1 is neutral, consistent, limited but not anonymous. Let 1-In(S) be linearly ordered, and let C 2 be defined such that C2(p)= {X} for every P 6 S k, where X is the least element in go~(P). Then C 2 is anonymous, consistent, limited, but not neutral. Let C 3 be defined such that C3(p)=HI(S) for every P ~ S k. Then C 3 is symmetric, consistent, but not limited. Let C 4 be defined such that C4(P) = gol(P) when the alternatives in P ~ S k occur with equal frequency, = {{X} :X occurs in P with the second-highest frequency}, otherwise. Then C 4 is symmetric, limited, but not consistent. Proposition 5 was suggested by Young's (1975) characterization of scoring functions. Richelson(1978) gave a characterization of plurality rules in a model

CONSENSUS SEQUENCES BASED ON PLURALITY RULE

1065

which is more complex than ours since its voters specify linear orders of alternatives. Roberts (1991) gave characterizations of plurality rules in Richelson's model, and in our model for the case when j = 1. 5. Discussion. An alignment of molecular sequences may introduce gaps in the sequences: for example, when S = {A, C, G, T} the profile P = (A, G, , , G) represents an aligned position at which gaps occur in the third and fourth sequences. We can include gaps in a consensus analysis by adding the gap symbol to the set of alternatives. For example, if S = { A , C , G , T , - - } and P = ( A , G, , , G), then q~I(P)={{G},{--}} and ~o2(P)= {{G, --}}. Whenj > 1, the inconsistency of (pj (proposition 3) has a natural explanation. If a profile P is short, several balanced profiles may be equally distant from it, and so (pj(P) may contain more than one consensus element. When long profiles are constructed from short profiles by concatenation, ties are less likely in the median computations for the long profiles, and so violations of consistency occur. Intuitively, these violations result from q~j'sincreased ability to discriminate among median candidates as profiles become longer. This property is an advantage, not a disadvantage, of j-plurality rule: as more molecular sequences are analysed, one should expect a consensus to become more precise. [When evaluating Condorcet consensus functions, Fishburn (1977) cited one based on Kemeny's (1959) median rule for its superior ability to produce choice sets that contain a single consensus element.] One can use the corollary of Proposition 4 to construct a normalized consensus index which places a profile between the extremes of being (in perfect agreement with) a balanced profile and having the minimum possible agreement with a nearest balanced profile. For example, to each consensus function ~oj:Uk>oSk-~II(IIj(S)) define a consensus index Ij:Wk>oSk~[o, 1], where [0, 1] is the closed real interval, such that for each profile PES k, Ij(P)=(Nj(P)-f~j(S, k))/(1--f~j(S, k)). This approach is like ones Day and McMorris (1985) advocated. The interpretation of particular values of Ij is problematic since one lacks information about the distribution of values of Ij when profiles are selected randomly. The consensus index Ij can be used to specify a parameterized family of consensus rules which vary between the extremes of j-plurality rule and unanimity of consensus. For any positive integer j and real number le [0, 1], define the (j,l)-plurality rule for sequences to be a consensus function qgj,l:Wk>oSk-~lT(Mj(S)) such that for each profile pEsk:

qgj,l(P) = q~j(P) if Ij(P) >_ l, = ~ , otherwise. Thus 1 serves to filter consensus elements by the degree to which a profile

1066

W . H . E . DAY A N D F. R. McMORRIS

agrees with its closest balanced profiles. At one extreme, q)j,0 = @j; at the other, q~j,1(P) = (Pj(P) if and only if P itself is a balanced profile. This proposal is in the spirit of McMorris and Neumann's (1983) specification of the M~ family of consensus functions for n-trees, that family's extremes corresponding to majority-rule consensus (Margush and McMorris, 1981) and strict consensus (Nelson, 1979; Sokal and Rohlf, 1981). It might be interesting to investigate whether the ~oj,t family has useful axiomatic characterizations. The computational complexities of algorithms to evaluate the j-plurality rules (pj:k.)k>oSk--}rI(I-Ij(S)) depend on k, s=]S[, and j<<.s. Straightforward algorithms require O(k + 2 s) time and O(s) work space, where the exponential term results from a worst case in which many subsets of S satisfy the pluralityrule criterion. Instead of listing subsets of S, the algorithms could list alternatives from which a specified number must be selected. With this modification, straightforward algorithms require O(k +js) time and O(s) work space. When analysing sequences of nucleotides or proteins, it is usual to consider s as being fixed by the problem's specification. In such cases, algorithms to evaluate the j-plurality rules require O(k) time and constant work space; but since any algorithm must spend of the order of at least k time to inspect the input, these algorithms are considered optimal with respect to their asymptotic behaviour. Our plurality-rule model for consensus sequences has limitations. It tacitly assumes that analysis of molecules is a multi-stage process in which sequence alignment is performed before the identification of a consensus sequence. Several authors have combined the analyses of these problems. Bains (1986) described a heuristic algorithm which alternates between alignment and consensus phases until convergence yields both an alignment of sequences and a consensus sequence. The j-plurality rules might be employed in the consensus phase of such an algorithm. Waterman (1986) developed algorithms for sequence alignment which used an algorithm to find consensus words of given length within a larger window (Waterman et al., 1984; Waterman, 1989a). In such algorithms the j-plurality rules might be used to identify consensus words of given length. Our basic model for consensus sequences also assumes that aligned positions are independent of one another, and so plurality rule can be used at each aligned position to calculate a consensus. Our next objective is to relax this assumption so that plurality-rule consensus in one aligned position might depend on a context of other aligned positions. Although Waterman (1984) and Bains (1986) used novel approaches to address this issue, the formal properties of their consensus methods were obscured by algorithmic or biological considerations. We hope to develop a plurality-rule method for consensus sequences which is sensitive to important biological constraints, but of which the formal properties are easier to comprehend.

CONSENSUS SEQUENCESBASED ON PLURALITYRULE

1067

This w o r k was s u p p o r t e d in p a r t by the C a n a d i a n Institute for A d v a n c e d Research, the N a t u r a l Sciences a n d E n g i n e e r i n g R e s e a r c h C o u n c i l of C a n a d a u n d e r G r a n t A-4142, a n d the U.S. Office of N a v a l R e s e a r c h u n d e r G r a n t N00014-89-J-1643. T h e first a u t h o r is a n Associate in the P r o g r a m in E v o l u t i o n a r y B i o l o g y of the C a n a d i a n Institute for A d v a n c e d Research.

LITERATURE Bains, W. 1986. MULTAN: a program to align multiple DNA sequences. Nucl. Acids Res. 14, 159-177. Barth61emy, J.-P. and F. R. McMorris. 1986. The median procedure for n-trees. J. Classification 3, 329-334. Barth61emy, J.-P. and B. Monjardet. 1981. The median procedure in cluster analysis and social choice theory. Math. Soc. Sci. 1,235-267. Campbell, D. E. 1982. On the derivation of majority rule. Theor. Decision 14, 133-140. Day, W. H. E. 1988. Consensus methods as tools for data analysis. In Classification and Related Methods of Data Analysis, H. H. Bock (Ed.), pp. 317-324. Amsterdam: Elsevier. Day, W. H. E. and F. R. McMorris. 1985. A formalization of consensus index methods. Bull. math. Biol. 47, 215-229. Fishburn, P. C. 1977. Condorcet social choice functions. S I A M J. appl. Math. 33, 469~489. Garey, M. R. and D. S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. San Francisco: W. H. Freeman. Kemeny, J. G. 1959. Mathematics without numbers. Daedalus 88, 577-591. Kemeny, J. G. and J. L. Snell. 1962. Preference rankings: an axiomatic approach. In Mathematical Models in the Social Sciences, Ch. 2, pp. 9-23. New York: Ginn. Levenglick, A. 1975. Fair and reasonable election systems. Behav. Sci. 20, 34~46. Maier, D. 1978. The complexity of some problems on subsequences and supersequences. J. Assoc. Comput. Mach. 25, 322-336. Margush, T. and F. R. McMorris. 1981. Consensus n-trees. Bull. math. Biol. 43, 239-244. May, K. O. 1952. A set of independent, necessary and sufficient conditions for simple majority decision. Econometrica 20, 680-684. McMorris, F. R. and D. Neumann. 1983. Consensus functions defined on trees. Math. Soc. Sci. 4, 131-136. Nelson, G. 1979. Cladistic analysis and synthesis: principles and definitions, with a historical note on Adanson's Familles des Plantes (1763-1764). Syst. Zool. 28, 1-21. Richelson, J. 1975. A comparative analysis of social choice functions. Behav. Sci. 20, 331-337. Richelson, J. 1978. A characterization theorem for the plurality rule. J. Econ. Theory 19, 548-55O. Roberts, F. S. 1991. Characterizations of the plurality function. Math. Soc. Sci. 21,101-127. Sokal, R. R. and F. J. Rohlf. 1981. Taxonomic congruence in the Leptopodomorpha reexamined. Syst. Zool. 30, 309-325. Waterman, M. S. 1986. Multiple sequence alignment by consensus. Nucl. Acids Res. 14, 9095-9102. Waterman, M. S. 1989a. Consensus patterns in sequences. In Mathematical Methods for DNA Sequences, M. S. Waterman (Ed.), pp. 93-115. Boca Raton: CRC. Waterman, M. S. 1989b. Sequence alignments. In Mathematical Methods for DNA Sequences, M. S. Waterman (Ed.), pp. 53-92. Boca Raton: CRC. Waterman, M. S., R. Arratia and D. J. Galas. 1984. Pattern recognition in several sequences: consensus and alignment. Bull. math. Biol. 46, 515-527. Young, H. P. 1974. An axiomatization of Borda's rule. J. Econ. Theory 9, 43-52.

1068

W.H.E. DAY AND F. R. McMORRIS

Young, H. P. 1975. Social choice scoring functions. S I A M J. appl. Math. 28, 824-838. Young, H. P. and A. Levenglick. 1978. A consistent extension of Condorcet's election principle. S I A M J. appl. Math. 35, 285-300. R e c e i v e d 4 M a r c h 1991 R e v i s e d 11 S e p t e m b e r 1991

Consensus sequences based on plurality rule

Consensus sequences based on plurality rule

Recommend Documents