Taxonomy with Confidence JAMES A. CAVENDER Moniana State Uniwrsity,
Bozeman, Moniana 59717
Received 3 January 1978; reuised 23 March 1978
ABSTRACT There are essentially three ways in which four species may be related in a phylogenetic tree graph. It is usual to compute for each of these three possibilities the smallest number of mutations that could have brought about the observed distribution of characteristics among the four species. The graph that minimizes this number is then preferred. In fact, the hypothesis that the graph chosen in this way is correct may be accepted with confidence if the minimum is strong in a sense described here. In principle, the theory could be extended to treat sets of more than four species.
THE MODEL A modem setting for mathematical taxonomy imagines species as elements of a space E of all possible forms. As a species evolves, it marches gradually through the space. Occasionally, one species splits into two whose paths part. The trails of all the descendants of an ancient species form a tree graph in E. The taxonomist’s problem is to deduce (some aspect of) the graph from (some attributes of) the contemporary species which constitute some of its terminal vertices. This article shows how an estimated phylogenetic tree, reconstructed from descriptions of contemporary species, may be regarded as a confidence set. If a tree with confidence level 1 -(r = .95 is correctly constructed by the method given here, its maker can rest assured that the probability is greater than .95 that his tree is actually the true one, supposing the model is correct. The method is illustrated here in the case where exactly four contemporary species are being considered. The generalization to more than four species brings in no new conceptual difficulties, only an enormous increase in computations. Name the four species A, 6, C, and D. They are related according to one of the three trees in Fig. 1. The common ancestor of A, 6, C, and D may be MA THEMA TfCAL
BIOSCIENCES
0 Elsevier North-Holland,
Inc., 1978
40, 271-280
(1978)
271 0025-55~/78/0040-0271$01.75
272
JAMES A. CAVENDER
Fig.
1 C
A
.
B
C
4
D
A
l
B
I
I
I,%~
Db
IJP Fig. 2
anywhere on the graph-on an edge or at one of the six vertices. (Some degenerate cases, a few of which are shown in Fig. 2, occur when one or more edges of the graphs in Fig. 1 are of length zero. The three distinct types of Fig. 1 intersect in some of these degenerate cases.) A description of a species will be a binary sequence of length N. (To illustrate, a 1 at the fourth place might mean something like “red pigmentation present at margins of wings” while 0 would stand for the absence of such pigmentation. Each of the N digits of the string is to encode an independent character.) Again, generalization to the quatemary sequences of DNA molecules and to other discrete structures challenges patience more than ingenuity. To each species, A, B, C, D, S,, or S,, there corresponds a sequence of O’s and l’s of length N. The event that the kth digit of the sequence corresponding to A is different from the kth digit of the sequence corresponding to S, has a probability p. The basic assumption of my model is that this
TAXONOMY
WITH CONFIDENCE
273
probability equals the edge length a but is independent of k and everything else. This represents a real simplification, which entails some loss of generality and validity. On the other hand, an assumption of this sort is lurking at the bottom of almost every probabilistic model in science. Assumption. Let X and Y be adjacent vertices in a phylogenetic graph. Let Fx,y.k be the event that the kth digit of X is not equal to the kth digit of Y. Let z be the length of the edge XY. Then 0 Qz < r, the probability of Fx,y,k is z, and all the events Fv,,J, for all i E { 1,2,3,. . . , N} and all pairs {V, W} of adjacent vertices, are independent. It is assumed that a change of a 0 to a 1 over the course of time may be followed by a change back to 0, and that a change one way is just as probable as a change the other way. This is why z < i. It is not assumed that at a given time separate populations are changing equally rapidly. Thus, even if the true graph has the first topology and S2 is the common ancestor, a need not equal b. This I regard as biologically more realistic than the popular contrary assumption. Our taxonomic problem is to decide which of the three tree graphs in Fig. 1 is the correct one. One might desire more information about the tree of descent than its topology, but the classical concern of taxonomy has been the topology only, without consideration of the edge lengths. Moreover, the three topologies comprise a natural, ready-made family of confidence sets. Let @i be the family of all graphs having the first topology in Fig. 1. Let C& and @&represent the second and third topologies in a similar way. These families intersect among the degenerate graphs. Represent a graph G by a 6-tuple G = (a, 6, c,d, e, t), where t = j if G E@$ and a, b,c, d, and e are the edge lengths shown in Fig. 1. Degenerate graphs may have two or three representations. THE TEST If N is the number of characters sampled, each species will be represented by a sequence of N zeroes and ones. Assume without loss of generality that species A is all zeros. There are eight ways for the states of any character to be represented among the other three species (Table 1). Let x0 be the number of characters distributed in pattern 0 (same for all four species), xi the number showing pattern 1 (only species B different),and so TABLE 1 Species D C B A
Pattern
0
1
2
3
4
5
6
7
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 0
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 0
214
JAMES A. CAVENDER
forth. Then x,, + x, + x2 + x3 + x4 + x5 + x6 + , = N. The ordered octuple x = x7) summarizes all the data. Given a graph G =(a, b, c, d,e,t), (xo,x,,xz,..., the probability of a given x is easily, though laboriously, computed. It is the multinomial probability p(G,x)=(
where f; is the probability This f; generally depends e. When t= 1,
xo,x,,E ,...,,,)f$tiW.fi: of any particular characteristic showing pattern i. upon the topology t as well as upon a, 6, c, d, and
f. = abEde + abcdt? + ci6cde + ci&&?, f, = a&ae + a&cd; + Cbcde + fib&, f,=abcde+abEdC+fidEde+Z6cdE, f,=a~c~e+a6Ede~+bbEde+~bcdP,
(1)
f,=abEde+abcdP+Lidcde+~6~d~, f5=a6Ede+a6cdd+tibcde+GbEdC, f,=abcde+abEdE+~6Ede+cl~cd~, f7=a6cde+a6Edd+dbEde+ribcdt?,
where 2=1-a 6=1-b:
E=l-c, d=l-d, P=l-e. Choose a confidence level 1 -(Y. To make confidence sets ?A,, Ci&, and Cl&of octuples for which sup T{$lLk}= GEBk
let
sup GE%
x XE%_
p(G,x)<~u,
intervals,
identify
TAXONOMY
WITH CONFIDENCE
275
and let % be the complement of S, u S, u S,. Now if x E S,, one may say with confidence that G E@,, that is, that the actual phylogenetic tree is of the first type of Fig. 1. The sets 5, and S, have analogous significance. If x E %, one must be noncommital; no one of the three graphs can be preferred with confidence on the basis of the data. In fact, the probability that a conclusion reached in this way is correct is at least 1- cr, as the following computations show. From the definitions,
and similarly
9{xxS2~‘?LIG~~2}> l-a, 1-a.
9{xES,u’%(G~8,}>
The choice of CR,,,‘&,, and ‘$ must in some degree be arbitrary. The natural desire to make 9, as small and hence improbable as possible is frustrated by the lack of an objective measure of the size of 9L. At least, 4,) $, and Cl&should be disjoint. One possible choice of ‘?Q has ~,={x,>x~+r}u{x~>x~+r}, ~={x~>x~+~}u{x~>x~+~),
(2)
~={~,>xj+~}u{x~>x~+~}, where r is a positive constant depending on N and CLThe rationale GE& and x=(xO,xl,xz ,..., x7), then at least
changes minimal
in the states of characters number of changes is
must
have occurred.
m2 =x,+x2+2x3+x4+x5+2x~+x,.
is this. If
If G Et&
the
276
JAMES
A. CAVENDER
If G E@,, it is m,=x,+x,+x,+x,+2x,+2x,+x,. If one wishes to assume G E8, if
the smallest mi
which is equivalent
number
and
of changes,
one will assume
m,
to xs < X6 and
x,< x6.
Thus S, ought to be the event that xg is bigger than x3 and x5, say
Choosing S, and S, analogously, one has (2). Minimizing the count of changes is the “parsimony criterion” of Camin and Sokal [ 11, also variously known as “the method of minimum evolution” [9], “the principle of minimum improbability” [lo], and “the minimum-steps method” [6]. Farris [3,4], using a model similar to this one but more general, has shown that the parsimony criterion yields a maximum-likelihood estimate of the tree (not just of its topology). Felsenstein [6] found that if the model is restricted so that different populations evolve at the same rate, this is no longer so. The problem of finding the most parsimonious topology has received much attention [2,9], but no method that is known to work and is efficient enough to be practical for a moderate number of species has been found. The practical taxonomist with four species to sort out needs a sample of characteristics, a confidence level 1 - (11,and the r that corresponds to his N and (Y. Examination of the polynomials (1) shows that it is possible to maximize both fs and f3 +fs while simultaneously minimizing f6 by choosing G=G,=(f,O,f,O,O,l). Thenf,=f,=f,=f,=f and_f,=f,=f,=f,=O. (This is clear once it is seen that f3+fs < 4 always. To maximize f3+f5, first observe that
so e might as well be set equal to zero. With e = 0, one has ~(f3+fS)=(l_2h)(~d+cj)>O, since b < +. In a similar way,
TAXONOMY
WITH
277
CONFIDENCE
are all seen to be non-negative. Hence f3 +fs is maximal when a = b = c = d = i and e=O. And then f3 +fs = f.) Now ?? {a,} is evidently maximized. [There are three other graphs in @, that maximize ‘3’{ 9, }. They are: G,=(O,f,O,~,O,l) with f,=f,=f4=f5=i and jZ=j3=_&=j,=O;G,= (0,~,~,0,0,l)withf,=f,=f,=f,=fandf~=fs=f~=f,=0;G4=(~,0,O,f,O,1) with f0 =f3 = f4 = f, = a and f, = fi = fs = f6= 0.1 Since xg = x3 = 0 a.s. in this case, ?F { 9, } = 9 {x5 >r}. Thus r, the smallest integer that makes 9 { 3, } Q OL,can be read from a table of cumulative binomial probabilities. Table 2 gives r for several N and three traditional values of (Y.
TABLE
\ a\ .l .05 .Ol
N
2
2
3
4
10
15
20
30
40
1
2 2
2 3 3
4 5 6
6 7 8
8 9 11
12 13 15
14 15 17
-
Here it should help to repeat the usual cautions on confidence intervals. If a taxonomist goes through life always using this procedure with a = .05 and saying, “I cannot decide” when x E %, then he will be right at least 95% of the time. But if he finds x E Gu quite often and only announces his conclusion when x B G%,then he may be announcing the wrong 5% and little more. Indeed, if a, b,c, and d are ordinarily very close to i, then the data in x can be of little use and x E G2Lwill be common. The theory is based on a model that contains an assumption of independence. This must be remembered. The set of characteristics used is the sample; N is the sample size. If the sample is biased through prejudice or accident, the results will not be trustworthy. A biased sample could be contrived to show that whales should be classed with fish, or anything else. No statistical test can detect sample bias, much less correct it. Protein sequences appear to me to be unbiased samples. Avoiding bias in choosing characteristics of the traditional sort will be difficult. The sample size N must be fixed at the outset. Adding one characteristic at a time while watching for a chance fluctuation that throws x into S, is a false use of the method given here. In principle, however, the theory could be altered to accommodate such a procedure. STATISTICAL
PROPERTIES
The set ?%i is the rejection region for a size cr test of the hypothesis G E@,. The test is neither uniformly most powerful nor unbiased [7, pp. 206 -207, 2241.
278
JAMES
To see that the test is not uniformly
most powerful,
A. CAVENDER
define a region C&t;
by C!?L?1;=%,n{x3+xg>21.}. Since C&’is a strict subset of ai,
their probabilities
V%lG]
satisfy
QV%lG]
for any G. In particular, for G,,G*, G,, and G,, the graphs 9 { 3, IG} in a,, the inequality (3) is strong. Hence lx’=
sup 9{%{}< GE@,
(3) that maximize
sup ??{%i}=cr. G EC!&
The size (Y’of the test based on 3,’ can be raised back toward (Yby reducing r to a new. lower value r’ for which sup 9{%;,} G EO),
B (Y,
(4)
where
Now let G5 =(O,O,O,O, $,3), the extreme case of graph of the third topology. It has fo=j3= i, whence x,+x3= N. The powers of the tests at G5 have
This proof assumes that N is large enough so that r’ strictly smaller than r and satisfying (4) exists and so that 2r’ r} <(Y. DOING
WITHOUT
X,,
In practice, x0 may be impossible to determine; characters that by chance have not changed or have changed but changed back again are hidden among those that cannot change. A solution to this problem is to start over again with x0 an unknown parameter along with a, b,c,d,e, and t. Let X=(X,,X, ,..., x7). Let 3, be defined by (2). For a given r, the
TAXONOMY
confidence
279
WITH CONFIDENCE
level 1 - 1y is given by (Y= sup G EO,
c
p(G,xo,X).
fEQ,
X,>O
For any fixed x,,, the graphs at which this supremum is achieved are merely the G,, G,, G3, and G, already computed, for the reasons already given. Thus the problem reduces to finding x0 that maximizes, for example,
or, in other words, finding x0 that maximizes the probability of X,+X, + . . . + x7 successes in x,, + x, + x2 + . . . x, independent Bernoulli trials with probability p = 1 -Jo = % of success. But this is an elementary exercise. Let N’ = x, + x1 + xj + . . . + x7. We seek the x0 that maximizes
and that therefore
also maximizes
(N’+ $(x0)=( N;xo)(a)xo=
l)(N’+2)*. 1X2X*.*
* (N’+x,) XX0
1 4”o’
But N’+x,
$(x0)= 4x,woThus #(x0) increases
1).
with x0, so long as
The x0 we seek must therefore That is,
be the greatest integer for which (5) is true.
x0= HN’/311 where [[ ]] represents the greatest-integer function. (Cf. Feller [5, pp. 45461.) For example, when N’= 15, the column for N = 15 + [[ y]] = 20 in Table 2 may be used to find r.
280
JAMES A. CAVENDER
The examples in the preceding test is neither
uniformly
most
section can be amended powerful nor unbiased.
I thank the referees for raising the questions
to show that this
treated in the last two sections.
REFERENCES 1 J. H. Camin and R. R. Sokal, A method for deducing branching sequences in phylogeny, Evolution 19: 31 l-326 (1965). 2 G. F. Estabrook, Cladistic methodology: a discussion of the theoretical basis for the induction of evolutionary history, Annual Reuiew of Ecology and Systematics 3: 427-456 (1972). 3 J. S. Farris, A probability model for inferring evolutionary trees, Syst. Zoo/. 22: 250-256 (1973). 4 J. S. Farris, Estimation of number of amino acid substitutions when back mutations can occur, Amer. Naturalist 107, 53 l- 534 (1973). 5 W. Feller, An Introduction to Probability Theory and its Applications, Vol. I, 3rd ed., Wiley, New York, 1968. 6 J. Felsenstein, Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters, Syst. Zool., 23: 240-249 (1973). 7 T. S. Ferguson, Mathematical Statistics, A Decision Theoretic Approach, Academic, New York, 1967. 8 E. L. Lehman, Testing Statistical Hypotheses, Wiley, New York, 1959. 9 E. A. Thompson, The method of minimum evolution, Ann. Human Genetics 36: 333-340 (1973). 10 S. M. Ulam, Some ideas and prospects in biomathematics, Annual Review of Biophysics and Bioengineering 1: 277-292 (1972).