Bulletin oJ Mathemaucal B~ology Vol. 55, No 4, pp 695 713, 1993 Printed 111Great Britain.
0092-8240/9356 00 + 0 00 Pergamon Press Lid 1993 Society for Mathematical Biology
CONSENSUS FUNCTIONS AND PATTERNS IN MOLECULAR SEQUENCES •
•
BORIS MIRKIN Department of Informatics and Applied Statistics, Central Economics-Mathematics Institute, Krasikova 32, Moscow 117418, Russia FRED S. ROBERTS
Department of Mathematics and Center for Operations Research, Rutgers University, New Brunswick, NJ 08903, U.S.A. (E.mail :froberts@ dlmacs.rutgers.edu ) In recent years, methods of consensus, developed for the solution of problems in the social sciences, have become widely used in molecular biology. We study a method of consensus originally due to Waterman et al. (Waterman, Galas and Arratia. 1984. Pattern recognition in several sequences: consensus and alignment. Bull. math. Biol. 46, 515-527) which is used to identify patterns or features in a molecular sequence where a pattern can vary in position within a given window. We show that some well-known consensus methods of the social sciences, the median and the mean, are special cases of this method for certain choices of the parameters used in it and give a precise account of the parameters for which these special cases arise. We also show that the specific parameters used in the method of Waterman et al. make their method equivalent to the median procedure which is widely used in the social sciences.
1. Introduction. Methods of consensus have been studied in the social sciences for centuries. Such methods, stimulated by the fundamental work of Arrow (1951), have found widespread application in problems of voting and social choice, data analysis and classification. In recent years, they have become widely used in molecular biology. Day and McMorris (1992a) survey the use of such methods in molecular biology and make a critical comparison of nine consensus methods used in the field; Day and McMorris (1992b) has many references. There are literally hundreds of papers in the field of "alignment and consensus". The purpose of this paper is to explore a particular consensus method originally due t o W a t e r m a n et al. (1984) and to show that some widely-used methods of consensus are, surprisingly, special cases of the method, and that the special case of the method adopted by Waterman et al. and ,others is in fact the well-known median procedure of the social sciences. Moreover, we give conditions under which different widely-used consensus methods are special cases of the Waterman et al. method. 695
696
B. M I R K I N AND F. S. ROBERTS
The method of Waterman et al. is also explained and clarified in Galas et al. (1985) and Waterman (1989) and applied in Waterman (1986) and W a t e r m a n and Jones (1990). We follow the description given in Waterman (1989) and, for simplicity, refer to this as Waterman's method. Waterman's m e t h o d is concerned with the widely-studied problem of pattern recognition in molecular sequence analysis. The problem is to identify patterns or features which are conserved, although not conserved precisely in location or in entries, in different sequences. Let us suppose that we have a number of sequences where each entry is chosen from an alphabet. A pattern is thought of as a subsequence of contiguous letters of a rather short, fixed length k. We allow some shifts in locations of patterns; therefore, we study windows or words of length longer than the length of the patterns we are seeking. The windows all start at the same position, say thejth, in each sequence and all have the same length L. For each possible pattern we ask how closely it can be matched in each of the sequences in a window of length L starting at thejth position. We wish to find a pattern or set of patterns which are a consensus pattern, i.e. which are in some sense a good fit in as m a n y of the windows as possible. To state this idea precisely, let us take Z to be a finite alphabet and consider a function F w h i c h assigns to each finite collection d of words on Z of length L a set of words of length k, F(W). The set F ( d ) is thought of as the set of consensus patterns. One way to define F, the m e t h o d used by Waterman et al. (1984), Galas et al. (1985) and Waterman (1989), starts with a measure of the distance d(w, a) between a potential pattern word w and a word a from d . Although, as these authors observe, the method makes sense for various ways to measure d(w, a), they emphasize the case where d(w, a) is the smallest n u m b e r of mismatches in all possible alignments ofw within a, and this is the way we shall define d. They then use parameters 2 d which are (implicitly assumed to be) m o n o t o n e nonincreasing in d and let F ( d ) be the set of all words w of length k which maximize:
a~d
The specific parameters 2 a used by W a t e r m a n et al. (1984), Galas et al. (1985) and W a t e r m a n (1989) are 2~= (k-d)/k. To give an example, an alphabet frequently used is the purine/pyrimidine alphabet {R, Y}, where R = A or G and Y = C or T. For notational simplicity it is easier to use the digits 0, 1 rather than the letters R, Y. Thus, suppose that Z = {0, 1}, let w = 0 0 0 be a potential pattern word and let a = 101010 be a window in a molecular sequence. Then, d(w, a) = 1, since there is one mismatch if we start w in the second position of a, and there is no way to place w in a with no mismatches. Next, suppose d consists of the words a = l l l 0 1 0 and
CONSENSUS FUNCTIONS AND PATTERNS
697
b = 111111 and suppose k = 2 . Then, the possible pattern words are w 1 = 0 0 , w 2 = 0 1 , w 3 = 10 and w 4 = 11. We have:
d(Wl, a ) = l , d(Wl, b ) = 2 , d(wz, a ) = 0 , d(w2, b ) = l , d(w3, a ) = 0 , d(w3, b ) = 1, d(w4, a ) = 0 , d(w4, b ) = 0 . Hence: $4(W1)~---21+2 2 ,
Sf f ( W 2 ) = 2 0 + 2 1 ,
s~,(w3)=2o + 2 1 ,
s~(w4)=2o + 2 o .
As long as 2 o > 21 > 22 , it follows that w 4 is the consensus pattern, according to Waterman's method. To give a slightly more difficult example, suppose that I2={0, 1}, a l = 000000, a 2 = 100000, a 3 = 111110, d = {a 1 , a 2 , a3} and k = 3. The possible pattern words are: w 1 =000, w 2 = 0 0 1 , w 3 = 0 1 0 , w 4 = 0 1 1 , w 5 = 100, w 6 = 101, w7 = 110, w s = 111. Then,
d(wi, aj) is
given by the i, j entry of the following matrix:
0021
1
2
1
1
1
2
2
1
1
0
1
2
1
1
2
1
0
2
0_
3 If 2 o > 2 1 > 2 z > 2 3 , then:
Sd(W2) = 2 2 "l"221 < 2 2 .l. 220 =
s~(wl)
and Sd(W8)=2 3 + 2 2 + 2 0 < 2 2 + 2 1 + 2 O=Sd(W7).
By similar analysis one sees that the score is maximized by either:
sd(wl) =
22 + 22 o
698
B. M I R K I N AND F. S. ROBERTS
or
sd(Ws)= 221 + 20, so which is the consensus depends on which is bigger, 22 + 2 o or 221 . We can define other consensus functions F. One is the median, the collection of all words w of length k which minimize:
S d(w, a); a~d
and another is the mean, the collection of all words w of length k which minimize: r d ( w ) = z ( w ) = ~ d(w, a) 2. aEd
Another measure which might be of interest to minimize is a convex combination of these two:
?d(W)=?(w)=c~O-(W)+(1--~)'C(W),
C~e[0, 1].
The words minimizing 7 will be called the mixed median-mean. Mixed m e d i a n - m e a n s might be of interest if we are not ready to choose either medians or means, and so want some combination of the two. [Other consensus functions which we shall not study but which might be of interest involve minimizing Z d(w, a)" for fixed m and convex combinations of these sums for different ms. We might also consider minimizing E log d(w, a) m for fixed m.] Minimizing o-d, -co~,or ?d leads to words which are in some sense best fitting to d . There is a long literature on means and medians in the group consensus literature. Some surveys are the papers and books by Barth61emy (1989), Barth61emy and Monjardet (1981, 1988), Barth61emy et al. (1982), Marcotorchino and Michaud (1978), Mirkin (1979) and Roberts (1976). To illustrate these consensus concepts, we again take Z = {0, 1}. Let: a 1 = 1111, a2 =0000, a 3 = 1000, a4=0001,
{al, a2, a3, a4} and k = 2 . Then, the possible pattern words are w I =00, W2=01 , W3= 10 and w4= 11 and d(w i, a~) is given by the i, j entry of the d=
following matrix: -2
0
0
0-
1
1
1
0
1
1
0
1
0
2
1
1
C O N S E N S U S F U N C T I O N S A N D PATTERNS
699
Hence: 4-
Z d(wa, a j ) = 2 + 0 + 0 + 0 = 2 ,
j=l
and similarly 4
4
2 d(w2,a~)=3, j=l
Z
4-
d(wa,aj)=3,
j=l
Z
d(w4,@ - - 4 ,
j=l
so w~ is the median. However: 4.
4
4
4
Z d(wl,aJ)2=4, Z d(w2,@ 2=3, Z d(wa,aj)2=3, Z d(w4,ajl2=6; j=l
j=l
j=l
j=l
therefore, there are two elements making up the mean, w 2 and w3, neither of which is in the solution provided by the median. This paper was motivated by the observation that for fixed k~>2 and Z of size at least 2, there is a choice of Zd such that for all L ~>k and for all words w, w' of length k from Z and all finite sets d (of any size) of words of length L:
) >>.G
s
)
s ,(w').
(1)
Thus, for these 2 d we maximize s~ (as in Waterman's m e t h o d ) if and only if we minimize ao~, (as in the median). In particular, this is the case with the parameters 2a= (k-d)/k used by W a t e r m a n et al. (1984), Galas et al. (1985) and W a t e r m a n (1989). (We shall note that the same conclusions follow trivially if k = 1.) Similarly, given k~>2 and Z of size at least 2, there is a choice of 2 d so that:
, ( w ) >1
s
) <<.
{2)
Thus, for these 2 d we maximize s d (as in Waterman's m e t h o d ) if and only if we minimize z~, (as in the mean). More generally, for all rational ~ ~ 1-0, 1] there is a choice of 2 d so that:
, ( w ) >1
so ,(w ) <<.s ,(w').
(3)
It follows that median, m e a n and mixed median m e a n are special cases of the W a t e r m a n scoring method. O u r main purpose in this paper will be to give a
700
B. M I R K I N
A N D F. S. R O B E R T S
precise account of the values of 2 d for which each of these measures is the Waterman method. 2. The Main Results. To compare the functions s(w), cr(w), z(w) and ?(w) it is useful to use the following notation. Given w, let t i be the number of words a in d such that d(w, a)=i. Note that t i = 0 for i > k and that: k
S(W)= i=0 k
o'(w)= ~ tii, i=0 k
z(w)= ~ tii2, i=0 k
?(w)= ~ o~iti; i=O
where ~i = ~i + (1 - ~)i 2. F r o m the point of view of the application we have in mind, it seems natural to allow d to have repeated words. (This could correspond literally to repeats or to giving more significance to some words than others.) Allowing repeated words will make the analysis easier. However, for mathematical completeness we also consider the possibility that d does not have repeated words. If it does not, we call d nonrepetitive. In the following it will be convenient to let the alphabet Z consist of the elements 0, 1, 2 , . . . , n. All of our alphabets will have size at least 2 and hence consist of at least 0 and 1. THEOREM 1. Suppose k isfixed, k >~2, Z is an alphabet of at least two letters and in the definition of 7 is in [0, 1]. Suppose that 20, 2 1 , . . . is a sequence of real numbers. Suppose that: ?~(w)>~?d(w'),
,sd(w ) ~
(3)
holds for all words w, w' of length k from Z and all finite nonrepetitive sets ~ (of any size) of words of length L >~k from Z. Then there are constants A, B and C, so that for all j with 0~
j=Aj'+Bj+C.
(4)
CONSENSUS FUNCTIONSAND PATTERNS
701
Proof. Since Z has at least two elements it has at least the elements 0 and 1. Fix i >i 0, suppose k > i + 1 and consider the following words of length L from Z: a=0...
0 1 . . . 1,
with
k 0s;
b=110...01...1,
with
L - k ls at the end;
c=0...
with
k - ( i + 1) 0s at the beginning;
with
k - ( i + 1) 0s.
0 1 . . . 1,
d = 1 0 . . . 0 1 . . . 1,
Note that d = {a, b, c, d} is nonrepetitive. Let: w=0...0, w'=10...0, be two words of length k. Then note that:
d(w, a ) = 0 , d(w, b)=2, d(w, c ) = i + 1, d(w, d ) = i + 1, d(w', a ) = l , d(w', b ) = l , d(w', c ) = i + 2 (since k > i + 1), d(w', d)=i. (All these are obtained with the pattern beginning from the first position in the window.) Hence, it follows that:
a ~,(w) = 2i + 4 = ~~,(w'), z~c(w)= 2i 2 + 4i + 6 = z ~(w'), and, therefore, that:
V.(w)=v.(w'). It follows from (3) that s~(w)-=s~,(w') and thus for 2~
(5)
Letting A i = 2 i - 2 i _ 1 we deduce from (5) that for 2~
Ai+a=Ai+I +Az--A1. It follows that for k ~>3: A3=2A2-A1, and in general, for all 2<~j<.Gk: Aj-- ( j - 1) A / - ( j - 2 ) A 1 . Thus, for 2<~j<~k:
= A j + A j _ ~ +2j_ 2
(6)
702
B. MIRKIN AND F. S. ROBERTS =
• .
=
.
J ~" Ai-l-21 i-2
=
J Z
[(i--1) A2--(i--2)A1]-1-21
i=2
J
J
= A 2 ~" ( i - 1 ) - A 1 ~" ( i - 2 ) + 2 ~ i=2
i=2
=AjZ + Bj+C, where
A -Az 2 B-
A1 - ½(22--22~ +2o), 2
-A2 3A1 1 2 t--2 - = ~(-22 +421-32°)' C=-A
1+21 =20 .
(7)
(8) (9)
We have shown that equation (4) holds for 2 ~
A + B + C = A I + 2o=21. Remarks. (a) It is interesting to note that Theorem 1 does not need the hypothesis that the 2 d are nonincreasing or the hypothesis 2 o > 2 1 , which is used in the rest of this paper. (b) It is also interesting to note that the result still holds (with the same proof) if we only require (3) for sets sJ of words of fixed length L, L ~>k. The same thing will be true in all of the following theorems. We have chosen to state the results with varying L. However, there will be one point where we will need to use the result of Theorem 1 with fixed L. (c) If k = 1, equation (4) for 0 ~~2 and Y, is an alphabet of at least two letters. Suppose that 2 o, 21 . . . . is a sequence of real numbers with 2~ < 2 o. Then the following are equivalent:
CONSENSUS FUNCTIONS
AND PATTERNS
703
(a) The equivalence: a.(w)/> a~c(w')~--~ s.(w) ~
(1)
holds for all words w, w' of length k from Z and all finite nonrepetitive sets d (of any size) of words of length L >i k from Z. (b) There are constants, B, C, B < O, so that for all j with 0 ~
(10)
Proof. Taking e = 1, we know from Theorem 1 and its proof that (a) implies equation (4) with A, B and C given by equations (7), (8) and (9). Let ~ consist of the two words: a=0...0
b=0...01...1,
with k-2Os,
and let: w=0...0, w'=0...
01
be two words of length k. Then a ¢ b and we have:
d(w, a) = O, d(w, b) = 2, d(w', a) = 1, d(w', b) = 1. Since ad(w ) = 2 = o-~(w'), equivalence (1) tells us that s~v(w)=sd(w'), and so: 2o+22=21+21. By equation (7) this implies that A = 0. Hence, from (4) we have equation (10). Also, note that from (10), 21 = B + C, so that from (9), B = 21 - 2 o < 0, the latter inequality by hypothesis. Next, suppose that (b) holds. Then, since Z~=o ti= Z~=ot'i= ] d [ and B <0: k
k
i=0
i=0
,
k
,
,
k
i=0
i=0
k
k
i=0
i-O
i=0
i=O
k
' Z tii>>" Z t'ii i=0
,
k
,B Z tii<~B Z t'ii k
'
k
,B Z tii+C Z t i ~ B Z t'ii+C Z t'i
i=0 •
704
B. M I R K I N A N D F. S. R O B E R T S
This theorem says that for k>~2, 22 of size at least 2 and 21 < 2 o, the median procedure corresponds exactly to the choice of parameters 2j = Bj + C, B < O. COROLLARY 1. /fk~> 1, 22 has at least one element, and: 2j-
k-j k
for 0 ~>.kfrom Z. Proof The proof that (b) implies (a) does not use the hypotheses that k ~>2 or Z has at least two elements, and it does not require nonrepetitiveness in (a). N o t e that (b) holds by taking B = -- 1/k and C = 1. • This corollary says that Waterman's m e t h o d with the parameters used in W a t e r m a n et al. (1984), Galas et al. (1985) and Waterman (1989) is in fact the median procedure. The corollary is easy to prove without the machinery we have developed here by simply verifying equation (1) directly with 2j = (k - j ) / k . Remark. If k = 1, then (b) follows without any hypotheses. Just pick C = 2 o and B = 2 1 --C. Thus, (a) holds, again by the same proof as in the theorem. In the next theorem we allow the set d to have repetitions. THEOREM 3. Suppose k is fixed, k ~>2, Z is an alphabet of at least two letters and in the definition of y is a rational number in [0, 1]. Suppose that 2 o, 2 1 , . . . is a sequence of real numbers with 21 < 20 . Then the following are equivalent: (a) The equivalence:
(3) holds for all words w, w' of length k from Z and all finite sets d (of any size) of words of length L >>.k from Z. (b) There are constants, D and E, D < 0 , so that for all j with 0~
(11)
Proof. Suppose that (a) holds. F r o m T h e o r e m 1 and its proof equation (4) holds for 0 ~
(12)
N o w take E = 20 = C and D = A + B. We shall show from (4) and (12) that (11)
C O N S E N S U S F U N C T I O N S A N D P A TTER N S
705
holds with this D and E. It holds trivially for j-- 0 and it holds forj = 1 because by (4):
21=A+B+C=D+E. It holds for j = 2 because by (12): 22 = 2 ( 2 - e) (D + E) -- [2(2-- e) -- l I E =2(2--~)D+E = D[20~+ 21(1 - ~ ) ] + E . Hence, we have shown that for j = 0 , 1, 2:
21=A72 + B'j+C ',
(13)
where A', B' and C' are given by the coefficients in (11). Now, since (4) and (13) both hold forj = 0, 1, 2, it is easy to show that A = A ', B = B' and C = C'. Hence, (13) and, therefore, (11) holds for O<<.j<~ksince equation (4) does. Finally, D < 0 because by (11):
21 =D+ E=D+ 2o, and 21 < 20 by hypothesis. To prove (b) it remains to prove (12). Since ~ is a rational in [0, 1], we can write 2 ( 2 - ~) as sit for positive integers s > t. Let d consist of s - t words of the form: a=0...0, and t words of the form: b=0...01...1,
with
k-20s.
(Note that repetitions are allowed in this theorem.) Let: w=0...0 w ' - - 0 . . . 01 be two words of length k. Then, d(w, a)= O, d(w, b)= 2, d(w', a ) = 1 and d(w', b)= 1. Recall that: k
it,,
=
i=0
where cq=~i+ (1 --c~)i 2. Thus:
706
B. M I R K I N A N D F. S. R O B E R T S
S
~d(W)
= (S - - t)O~ o + t ~ 2 = (S - -
t)O + 2(2 -- a)t = - t = s, t
(14)
= SC l = s .
(15)
Since 7~(w)=?,~,(w'), equation (3)implies that sd(w ) =sd(w'), and so: (s-- t)2 o +
t)~ 2 = $ 2 1 ,
from which equation (12) follows. Next, assume (b). Then, since D < 0 and ~ i k= o ti = - 2 2 = 0 ti, = I,S~l, k
k
i=O
i=0 k
k
k
Z ti i2 + D°~ Z tii + E Z ti <-
'D0-a)
D(1-a)
i=0
i=0
i=0
k
k
k
~ t'ii2+Do~ ~" t'ii+E ~ t'i i=0
k
i=0 k
i=0 k
k
,D Z ~iti +E E ti<'D Z ait'i+E Z t'i i=0
i=0
k
i=O
i=0
k
,D ~" aiti<~D ~ cqt'i i=0
i=0
k
k
i=O
i=0
,
Remark. The result that (b) implies (a) in both Theorems 2 and 3 is in each instance a special case of a more general result. U n d e r the hypotheses of T h e o r e m 2, l e t f b e a real-valued function on {0, 1 , . . . , k} and let: 6~(w) = ~ f(d(w, a)). Suppose there are constants B and C, B < 0 , so that for all O<<,j~k: =
Then the equivalence:
(j) + c.
CONSENSUS FUNCTIONS AND PATTERNS
707
,sd(w)<.sAw') holds for all words w, w' of length k from Z and all finite sets s¢ (of any size) of words of length L >/k from Y,. The proof of this more general result is exactly analogdus to the proof that (b) implies (a) in Theorems 2 and 3. COROLLARY 2. Suppose k is fixed, k~>2, and Z is an alphabet of at least two letters. Suppose that 2 0 , 2 1 , . . . is a sequence of real numbers with 21 < 20 . Then, the following are equivalent: (a) The equivalence:
• Aw) I>
s (w) <
(2)
holds for all words w, w' of length k from Z and all finite sets ~¢ (of any size) of words of length L >~k from Y_,. (b) There are constants A and C, A < 0 , so that for all j with 0~
(16)
Proof. Take c~=0 in equation (11). • This corollary says that for k~>2, Z of size at least 2 and 21 <20, the mean procedure corresponds exactly to the choice of the parameters 2 j = A j 2 + C, A<0. Our next goal is to show analogues of Theorem 3 and Corollary 2 if we require d to be nonrepetitive. To obtain such results, we find it necessary to allow a larger alphabet or to take k sufficiently large, and also to consider only L strictly larger than k. If~ is a rational number in [0, 1], write 2(2 - ~) = s/t, for s and t positive integers, and let: r = r ( ~ ) = m a x { s - t , t+ 1}.
(17)
THEOREM 4. Suppose k isfixed, k 1>2, a in the definition of ~ is a rational number in [0, 1] and E is an alphabet of at least r + 1 = r ( a ) + 1 elements, where r(a) is defined by equation (17). Suppose that 20, 2 1 , . . . is a sequence of real numbers with 21 <20 . Then, the following are equivalent: (a) The equivalence:
s (w) <
(3)
holds for all words w, w' of length k from Z and all finite nonrepetitive sets d (of any size) of words of length L > k from E. (b) There are constants D and E, D < 0 , so that for all j with 0<~j ~
708
B. M I R K I N AND F. S. ROBERTS
2j = D(1 -- c~)je + Daj + E.
(11)
Proof. The proof of T h e o r e m 3 applies if we can prove (12) without using sets d with repetitions. Let ~¢ consist of the s - t words: 0 . . . 0j, j 6 { 0 , . . .
,s--t--l},
and the t words:
O...Ojr...r, je{1,...,t},
with
k-20s.
These words are all different. Let: w=0...0, w ' = 0 . . . Or be two words of length k. Then, since L > k:
d(w, 0 . . . 0 j ) = 0 , j ~ { 0 , . . . , d(w, 0 . . . Ojr . . . r ) = 2 , j ~ { 1 , . . . , d ( w ' , O . . . 0j)= 1 , j e { 0 , . . . d(w',O... Ojr... r)=l,je{1,...,
s-t-l}, t},
,s-t-l},
sincer>s-t-1,
t},
since j > 0 .
N o t e that (14) and (15) hold and so the rest of the proof of (12) is as in the p r o o f of T h e o r e m 3. • COROLLARY 3. Suppose k is fixed, k~>2, and Y~ is an alphabet of at least four letters. Suppose that 20, 2 1 , . . . is a sequence of real numbers with 21 < 20 . Then, the following are equivalent: (a) The equivalence: 1>
S
(2)
holds for all words w, w' of length k from Z and all finite nonrepetitive sets d (of any size) of words of length L > k from E. (b) There are constants A and C, A < 0 , so that for all j with 0~
(16)
Proof. Take e = 0 in T h e o r e m 4 and s = 4, t = 1. • The next theorem shows that if we want to reduce the size of the alphabet, we m a y need to increase the size of k.
CONSENSUS FUNCTIONS AND PATTERNS
709
THEOREM 5. Suppose k is fixed, k~>3, and Z is an alphabet of at least two elements. Suppose that 20, 21, . . . is a sequence of real numbers with 21 < 2 o. Then, the following are equivalent: (a) The equivalence: •
s
(2)
holds for all words w, w' oflenoth k from Z and all finite nonrepetitive sets d (of any size) of words of length L > k from Z. (b) There are constants A and C, A < 0 , so that for all j with 0 ~
(16)
Proof. The proof that (b) implies (a) is the same as it was for Theorem 3 (with = 0). Suppose that (a) holds. Then we already know by Theorem 1 that (4) holds for allj~ [0, k]. [-Actually, we need to use the observation in Remark (b) after the proof of Theorem 1. In Remark (b) we note that the result still holds if the hypothesis is changed to fixed L. We do not have the Theorem i hypothesis for all L>~k here, only for all L > k . Hence, we need to apply the modified Theorem 1 to a particular L.] Let i~> 1, k>~ i + 1 and let d consist of the following four words of length L: a=O...0, b=10...0, c=0...01...1,
with
k-(i+l)
0s,
d=0...01...1,
with
k-(i-1)0s.
Note that since 2<~k i + 1 ~>2, since k < L: d(w, a) = O, d(w, b) = O, d(w, c) = i + 1, d(w, d) = i - 1, d(w', a) -- 1, d(w', b) = 1, d(w', c) = i, d(w', d) = i. (If i + l = k , then w ' = 1 0 . . . 0 a n d d(w', b)=0; if k = L , then d(w, b)¢O.) It follows that for k > i + 1 ~>2: "Cd (W) ~---02 -~-0 2 -~ (i + 1) 2 +
(i-
1) 2 = 2i 2 +
z~(w')--- 12+ 1 2 + i 2 + i 2 = 2 i 2+2.
2,
710
B. MIRKIN AND F. S. ROBERTS
Thus, by (2), s a(w)=s~(w') and so, for 2~
1 = 2 1 + 2 1 -l- 2i q- 2 i •
(18)
Again using A i = 2 i - 2 i _ 1 we find from equation (18) that for k > i + 1 ) 2 : A i + 1 = Ai-I- 2 A 1 .
Hence: A 2 ----3
A 1,
and in general, for 2 ~
2j=Aj+2j_I = A j + A j - 1 +2j 2
=Aj+A~_I + A j _ 2 + • • • +A2+21 J
[(2i-1)A1]+21
=
i=2
= ( j 2 - 1 ) A1+21 =j2
AI+ (21-A1);
and so for 2~
2j-= A'j 2 + B'j + C',
(19)
A ' = A 1 , B ' = 0 , C ' = 2 1 - A 1.
(20)
where
By equation (20), equation (19) holds forj = 0, 1. Now since k ~>3, we conclude that since equation (19) holds for 0 ~
CONSENSUS FUNCTIONS AND PATTERNS
711
3. Open Questions. A variety of open questions remain at this time. We have not yet determined conditions on )~i necessary and sufficient for equations (2) and (3) to hold in the cases where no repetitions are allowed and k < L, k = 2 or k = L for all k >~2. We have not determined conditions necessary and sufficient for equation (3) to hold when ~ is irrational, although this is perhaps just of theoretical interest. It would also be interesting to study conditions under which other consensus methods are special cases of Waterman's. Of particular interest might be such consensus methods as minimizing ~a~.~ d(w, a) m for fixed m or minimizing ~a~d log d(w, a) m. In the group consensus literature in the social sciences, there has been considerable emphasis on finding axioms characterizing different consensus procedures. This emphasis goes back to the early work of Arrow (1951). However, consensus methods used in molecular biology tend to be chosen because they seem interesting or useful, rather than on the basis of some theory. Such a theory could be based on physical parameters. Alternatively, it could be based on an axiomatic approach as in the group consensus literature in the social sciences. Axioms have been given for the median procedure by Young and Levenglick (1978), at least in the situation where we are minimizing ~¢ d(w, a) and d is a set of rankings, a situation different from the one considered here. It would be interesting and potentially of practical significance to try to axiomatize the Waterman procedure. It would also be interesting and potentially important to try to axiomatize the median procedure in the context of the Waterman problem, i.e. to give axioms for F(~¢) to be obtained by minimizing tT.~. In practical applications in molecular biology good algorithms for obtaining a consensus pattern are essential. Waterman et al. (1984), Galas et al. (1985) and Waterman (1989) provide a method for computing their consensus patterns. A considerable amount of work has been devoted in the literature to finding algorithms for obtaining the median, although this is often a difficult computational problem and is even NP-complete in some contexts. One of the reasons for our interest in the median procedure is that some of the algorithms for computing medians might be usable to improve upon the computational methods given for the Waterman procedure; such methods are described in the above papers for the specific version of the parameters 2 d = ( k - d)/k, which, as we have observed, leads to the median procedure. We close by briefly summarizing some of the literature on computation of the median. The problem is proved NP-complete by Wakabayashi 0986) and Bartholdi et al. (1989) in the context of rankings under the symmetric difference distance. J-Iowever, for certain partially ordered sets Barth61emy and Janowitz (1991) show that a polynomial algorithm computes the median. A paper that addresses one aspect of the computational complexity of the consensus pattern
712
B. MIRKIN AND F. S. ROBERTS
problem is that by Day and McMorris (1993). Many references on algorithms for computation of the median are in the paper by Barth61emy (1989). Boris Mirkin thanks DIMACS for supporting a visit during which research leading to this paper was performed. Fred Roberts thanks the National Science Foundation for its support under grant number IRI-89-02125 to Rutgers University. Both authors thank William H. E. Day, Buck McMorris, Aleksandar Pekec and Denise Sakai for their helpful comments.
LITERATURE Arrow, K. 1951. Social Choice and Individual Values, Cowles Commission Monograph 12. New York: Wiley. Barth~lemy, J.-P. 1989. Social welfare and aggregation procedures: combinatorial and algorithmic aspects. In Applications of Combinatorics and Graph Theory in the Biological and Social Sciences. F. S. Roberts (Ed.). IMA Volumes in Mathematics and its Applications, Vol. 17, pp. 39-73. New York: Springer-Verlag. Barth61emy, J.-P. and M. F. Janowitz. 1991. A formal theory of consensus. SIAMJ. Disc. Math. 4, 305-322. Barth61emy, J.-P. and B. Monjardet. 1981. The median procedure in cluster analysis and social choice theory. Math. Soc. Sci. 1,235-268. Barth61emy, J.-P. and B. Monjardet. 1988. The median procedure in data analysis: new results and open problems. In Classification and Related Methods of Data Analysis. H. H. Bock (Ed.). Amsterdam: North-Holland. Barth+lemy, J.-P., C. Flament and B. Monjardet. 1982. Ordered sets and social sciences. In Ordered Sets. I. Rival (Ed.), pp. 721-758. Dordrecht: D. Reidel. Bartholdi, J. J. III, C. A. Tovey and M. A. Trick. 1989. Voting schemes for which it can be difficult to tell who won the election. Soc. Choice Welf. 6, 157-165. Day, W. H. E. and F. R. McMorris. 1992a. Critical comparison of consensus methods for molecular sequences. Nucl. Acids Res. 20, 1093-1099. Day, W. H. E. and F. R. McMorris. 1992b. Discovering consensus molecular sequences. In Proceedings of 16. Jahrestagung, Gesellschaft fuer Klassifikation e.V., Dortmund, Germany, 1-3 April 1992. A Volume in the Series Studies in Classification, Data Analysis, and Knowledge Organization. Berlin: Springer-Verlag. Day, W. H. E. and F. R. McMorris. 1993. On the computation of consensus patterns in DNA sequences. Math. comput. Modelling, to appear. Galas, D. J., M. Eggert and M. S. Waterman. 1985. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia Coli. J. molec. Biol. 186, 117-128. Marcotorchino, J. F. and P. Michaud. 1978. Optimization in ordinal data analysis. Technical Report. Paris: IBM. Mirkin, B. G. 1979. Group Choice. P. C. Fishburn (Ed.). New York: Wiley. Roberts, F. S. 1976. Discrete Mathematical Models, with Applications to Social, Biological, and Environmental Problems. Englewood Cliffs, NJ: Prentice-Hall. Wakabayashi, Y. 1986. Aggregation of Binary Relations: Algorithmic and Polyhedral Investigations, PhD Thesis, Augsburg. Waterman, M. S. 1986. Multiple alignment by consensus. Nucl. Acids Res. 14, 9095. Waterman, M. S. 1989. Consensus patterns in sequences. In Mathematical Methods for DNA Sequences. M. S. Waterman (Ed.), pp. 93-115. Boca Raton, FL: CRC Press. Waterman, M. S. and R. Jones. 1990. Consensus methods for DNA and protein sequence alignment. Meth. Enzymol. 183, 221-236.
CONSENSUS FUNCTIONS AND PATTERNS
713
Waterman, M. S., D. Galas and R. Arratia. 1984. Pattern recognition in several sequences: consensus and alignment. Bull. math. Biol. 46, 515-527. Young, H. P. and A. Levenglick. 1978. A consistent extension of the Condorcet election principle. S I A M J. appl. Math. 35, 285-300.
Received 23 A u g u s t 1992 Revised 24 O c t o b e r 1992