The agreement metric for labeled binary trees

The agreement metric for labeled binary trees

The Agreement Metric for Labeled Binary Trees WAYNE GODDARD Department of Mathematics, University of Pennsylvania, Philadelphia, PA AND EWA KUBIC...

755KB Sizes 0 Downloads 110 Views

The Agreement

Metric for Labeled Binary Trees

WAYNE GODDARD Department of Mathematics,

University of Pennsylvania, Philadelphia, PA

AND

EWA KUBICKA, GRZEGORZ KUBICKI, AND F. R. McMORRIS Department of Mathematics, University of Louisville, Louisville, KY Received 31 August 1993; revised 10 Januay 1994

ABSTRACT Let S be a set of n objects. A binary tree of S is a binary tree whose leaves are labeled without repetition from S. The operation of pruning a tree T is that of removing some leaves from T and suppressing all inner vertices of degree 2 which are formed by this deletion. Given two trees T and U, an agreement tree is a tree that can be obtained from T as well as from U by pruning the fewest number of leaves from the two trees. A quadratic algorithm is presented for doing this and two metrics are defined based on agreement trees.

1. INTRODUCTION

AND DEFINITIONS

The development and analysis of different ways to compare various types of trees has been a concern of much research in the biological sciences over the past 20 years (cf. [l, 2, 4, 7, 9, 111). For example, trees representing hierarchical classifications of a collection of objects of interest are produced as output of clustering algorithms and when several algorithms are utilized, it is of interest to be able to determine the similarities among the outputs. Another multiple tree situation can sometimes occur when multiple runs are made of one algorithm on the same data set. For example, with the popular package phylogenic analysis using parsimony (PAUP) [12], it is not uncommon to have thousands of output trees to analyse via measures of similarity. The main goal of this article is to present a new metric based on the notion of the agreement subtree of two trees and give an efficient algorithm to compute this metric.

MATHEMATZCAL

BZOSCZENCES

123:215-226 (1994)

OElsevier Science Inc., 1994 655 Avenue of the Americas, New York, NY 10010

215 0025-5564/94/$7.00

216

W. GODDARD

ET AL.

Suppose S is a set of 12 objects for which two hierarchical classifications have been constructed. A common way to represent such classifications is by means of a tree whose leaves (vertices of degree one) are labeled with the elements of 5’. We assume T and U are two trees whose leaves are labeled without repetition by the elements of S and all internal vertices are unlabeled with degree 3, so that T and U are leaf labeled binary trees. The operation of pruning on a tree T is the removing of a subset of leaves from T and suppressing all inner vertices of degree 2 which are formed by this deletion. For example, a tree Tl and the tree T2 resulting after pruning the leaf 2 from T, are depicted in Figure 1. The idea of common pruned subtrees was introduced independently by Rosen [8] and by Gordon [4] in order to ascertain the similarity of the original trees. A greatest common pruned tree of two trees T and U, which we simply call an agreement subtree, is a tree that can be obtained from T as well as from U by pruning the fewest number of leaves from the two trees. The motivation of this topic and its relation to consensus are developed nicely in Finden and Gordon [3] and in Swofford [ 111. Work on an algorithm for finding an agreement subtree has attracted some interest. The polynomial algorithms given by Gordon [4] and by Finden and Gordon [3] to find a largest common pruned tree were heuristic and did not guarantee the production of a common pruned tree of largest size, where the size of a binary tree is defined as the number of leaves it has. Moreover the difference between what their algorithms obtain and the theoretical optimal solution is not known. An exact algorithm for finding an agreement subtree of two trees was presented in Kubicka, Kubicki, and McMorris [61. Its complexity is of the order n”gz”, where n is the number of leaves in T (and in U>. In the next section we present an exact polynomial-time algorithm for finding an agreement subtree. Moreover this algorithm generalizes to nonbinary trees and remains polynomial-time as long as the highest degree in T and in U is bounded by a constant independent of n. Similar work has been reported recently by Steel and Warnow 1101.

FIG. 1. The pruning of T, to moduce

T,.

AGREEMENT 2.

METRIC FOR LABELED

217

TREES

A QUADRATIC-TIME ALGORITHM AGREEMENT SUBTREE

FOR FINDING

AN

Let A(T, U) denote the set of all agreement subtrees for two binary trees T and U, and let #(T,U) denote its size. Although the focus of our attention is when T and U have the same number of leaves, we introduce slightly more generality and assume that T and U have m and n leaves, respectively. The reason for this is that we need to compare trees of different sizes in our proofs and algorithm. Our first goal is to give a quadratic-time algorithm based on dynamic programming for finding #(T, U) and an element of A(T,U), denoted by AT, U>. For a rooted binary tree we assume that the degree of the root is 2. All logarithms are of base 2, and for simplicity they are denoted by log. We first consider the rooted version of the problem. For the purpose of describing the algorithm, we artificially label all internal vertices of T and U. However, these labels will not be counted in agreement subtrees. Let T, be the subtree rooted at a vertex a. If a has children b and c then Tb and T, are rooted subtrees obtained from T, by deleting the vertex a. If a has no children then a is a leaf and T, = a. In this case, finding A(T,,U,!) for any other rooted tree U, is trivial; namely, ACT,, U,,) = a if a is a leaf of U,, and AT,, U,) = 0 otherwise. The following lemma is basic to our algorithm.

LEMMA I

Let T, be a tree rooted at a vertex a with children b and c, and let U, have a root w with children x and y. Then #CT,, U,) is the maximum of the following six numbers: #(T,,,U,)+#(T,,U,), #(T,,U,)+#(T,,U,), #(T,,U,), #(T,,U,), #(Tb,U,), and #(T,,U,). Proof Consider an agreement subtree A, for T, and U,. If A, contains leaves from only one of Tb and T,, or contains leaves from only one of 17, and U,, then clearly the number of leaves in A, is equal to maxI#(T,,U,>,#(T,,U,),#(T,,U,),#(T,,U,)}. Otherwise, suppose that the vertex I- has children s and t. We claim that A, cannot contain a leaf b’ from Tb and a leaf c’ from T,. Suppose that this is the case and f is any leaf in T,, say f is from Tb. Then the subtree induced by the vertices f, b’, and c’ in A, is different than the subtree induced by the same vertices in T, as in Figure 2. Therefore the number of leaves in A, is max(#(T,, U,) + #CT,, U,), #CT,, U,> + n #CT,, U,>l.

Now, suppose we have two rooted trees T, and U, with m leaves and 12 leaves, respectively. First notice that T, has 2m - 1 vertices (both internal vertices and leaves) and U,, has 2n - 1 vertices. The algorithm

W. GODDARD

218

r

FIG. 2. Subtree of A, induced same leaves in T,.

ET AL

a

by b’, c’, and f and the subtree

induced by the

to determine #CT,, UJ iteratively calculates #CT,,, U,,> for every pair of vertices a’ in T, and w’ in U,. For example, it does a postorder traversal of T, and for each vertex a’ in T, does a postorder traversal of U,. To organize the calculations for the same pairs of vertices a’ and w’, one can keep a (2m - 1) x (2n - 11 matrix of calculated values #CT,,, U,,). Call this matrix M(Tat,Uwn). In this matrix, with every value #(T,,,U,,) one can also keep a pointer to a corresponding agreement subtree of T,, and U,,. In this way, in addition to the size of an agreement subtree, an instance of such an agreement subtree can be found. The complexity of this procedure is clearly O(mn). Now we consider the unrooted version of the above algorithm. Again suppose T and U are unrooted trees with m and n vertices, respectively. We can adapt the rooted version to solve the unrooted case by introducing roots of degree 2 into T and U by subdividing an edge (i.e., the inserting of a vertex onto an edge). To determine an agreement tree for T and U one could try all possible locations for the roots and take the largest agreement tree that is obtained. By doing this, an 0(&n*) algorithm could be produced. However, we can make this process more efficient by noticing that we may fix the root for the first tree and only consider sites for the root on the second tree. Let us suppose that we root one of the trees, say tree T, by inserting a vertex (root) w at an arbitrary edge forming a rooted tree T,,. Now find an agreement tree for T and U using the previous O(m2n2) method. This will give a set I/ of leaves to be pruned to obtain this agreement tree. Prune all the vertices in I’ from the rooted tree T, to produce the subtree A,. Notice that the tree A, is an agreement subtree of T, and U for a proper placement of the root in the tree U. So, suppose T is rooted at w. We now show how to locate the root of V, and a corresponding agreement subtree, in O(mn> time. For all

AGREEMENT

METRIC

FOR LABELED

ZlY

TREES

possible placements of the root z on U we need to calculate the size of the agreement tree of T, and U,. For adjacent vertices a and b of U, let U,” denote the component of U - ab which contains the vertex a and is rooted there. Further, let Zf = {#(Uob, T,,,,): w'E T,,,). Start by placing the root z on U on an edge xy where y is a leaf. This is depicted on Figure 3a. The orientation of all edges denotes the direction from child to its parent. We do the calculations of #CT,+,, U,>as one would in the rooted case. This means that we find the values in Zi for every parent b and child a in U,.Let us notice that Z,Y is the same as Z: and ZJ is equal to 25. This process takes O(mn) time according to Lemma 1. The next step is to move the root z to a neighboring edge as in Figure 3b. Say the vertex x was adjacent to vertices g and h. Move the root z to the edge xg, for example. Now, we need to calculate the values of Zi for every parent b and child a in U,.But we already know all these values except for Zf. But we do know Z,X and ZJ and from them we calculate all elements of Z$ in O(m) time by Lemma 1. We also calculate #(T,,,,U,). As it is illustrated in Figure 3b and 3c, the algorithm continues by moving the root z around the tree U.It is organized so that at each new position, there is precisely one set which must be calculated. Suppose z is on edge ab; then either Z,” or Zi must be calculated. Moreover, we have sufficient information to do so in O(m) time. For if we want to calculate Zi, we have Zz and Zz for the other two neighbors c and d of the vertex a. One movement strategy that will work is preorder traversal of the edges of U,but what we really need is that each new edge for the root z is incident with one already tried. Thus we see that the unrooted algorithm takes also O(mn) time. As in the rooted case, in addition to the sizes of agreement subtrees, in all sets Zi we can also keep pointers to agreement subtrees themselves.

Y

Y

P /

a

‘bbt3

A 1

Z

(a)

(cl

x

2

5

g

7 d

643

FIG. 3. (a) The first position of the root t. (6) The second position of the root z. (c) Possible movements for the root z.

220

W. GODDARD

ET AL.

This makes it possible to find an agreement subtree for T and U with the same time complexity. The following theorem is now apparent. THEOREM

I

Let T and U be two labeled binary trees of size n (rooted or unrooted). Then there exists an 0(n2) algorithm for finding an agreement subtree of T and U. The algorithm for finding #(T,U) described above can be generalized to rooted trees which are nonbinary. As long as the maximum degree of T and U is bounded by the constant independent on m(n), the complexity of the algorithm remains quadratic. The only differences in the modified algorithm are that a larger number of possibilities need to be considered in Lemma 1 and we have to generalize the postorder traversals. For example, if both T and U are ternary trees, then instead of six possibilities in Lemma 1, we have 48 numbers to choose the maximum from. 3. METRICS LABELED

ON THE LEAVES

SET OF BINARY

TREES

WITH

n

We now let T, be the set of all binary trees with iz leaves labeled. The function d,: T, XT,, + R defined by d,(T, U> = n -#CT, U) for all T, U E T, is easily seen to be a metric on T,. This metric counts the number of leaves which have to be pruned from both trees to obtain a common substructure. Because of the polynomial algorithm described in the previous section, d, is also easy to compute. However, this metric does not take into account the location of pruned leaves and therefore can possibly obscure some vital information about the intuitive “closeness” of two trees. As an illustration of this, let T,, T2, and T3 be the binary labeled trees of size 10 with leaves labeled as in Figure 4. Notice that d,(T,, T,) = d,(T,, TJ = d,(T,, TJ = 1, since it is enough to prune the leaf x to obtain identical subtrees. Intuitively, T, and T,

b

x

FIG. 4. Binan, labeled trees of size 10 with leaves labeled.

h

X

AGREEMENT

METRIC

FOR LABELED

221

TREES

look to be closer than T, and T3 or than T2 and T3. This type of problem is not unique to d,. For example, similar problems have been found with the often used symmetric difference metric [2, 71. In order to address the situation above, we modify the metric d, by adding a fractional part, which reflects how far apart, with respect to an “optimal” agreement subtree, the pruned leaves are. We now need a little bit of new notation. Let T, and T, belong to T, and let A be an agreement subtree of T, and T2. Let VA denote the set of all leaves pruned from Tl and T2 to obtain A. For every x from V’ let A,(x) denote the tree obtained from T, by pruning all leaves in VA except for x. Now form the tree A(x) by adding leaves x1 and x2 to the agreement tree A in such a way that A(x) is isomorphic to A,(x) after pruning x1 and is isomorphic to A,(x) after pruning x2. See Figure 5 for an illustration as to how this is done. Let s(x, Al denote the length of the path between the leaves x1 and x2 in the tree A(x). In the above example we have s(x, A) = 4. Define Z(A) to be the sum of s(x, A) over all pruned vertices x.

c

Z(A)=

x E

S(X,A).

va

Next we define L(T,,T,) to be the minimum value of I(A) taken over all agreement subtrees A of T, and T2.

L(T,,T,) =

Acf&,z(A).

Let T, and T2 be two distinct binary labeled trees with size n. Then we define

d(T,,T,) = d,(T,,T,)+

a xlb

l

WTl,T,)

L(TlTT,).

X2

FIG. 5. The tree A(x) for the trees T, and T2.

222

W. GODDARD

= 0. We mention that the choice of the normalization arbitrary. We chose nd, because it is a very simple for L, and thus we get a straightforward fraction to d,.

ET AL.

If T1 = T,, set d(T,,T,)

THEOREM

factor is somewhat strict upper bound less than 1 added

2

d: T, x T, + R is a metric

The function

on T,.

Proof It is obvious that the function d is nonnegative and symmetric so that we need only show the triangle inequality. First note that for any 7’i and T,, we have d(T,,T,) < dl(TI,T2)+ 1. Therefore if dl(TI,T2) + d,(T,,TJ > d,(T,,T,), then d,(T,, T2)+ dl(T2,TJ > d(T,,T,), and also d(T,,T,)+d(T,,T,)>d(T,,T,). Now assume that d,(T,,T,)+d,(T,,TJ = d,(T,, T,). Let A,, be an agreement subtree for the trees T, and 7; realizing the minimum in the definition of d(T,,T,). Let ui,uZ,.. .,uk be the leaves to be pruned from TI (and T2) to obtain A,, and let vt,v2,..., vt be the leaves to be pruned from T3 (and T,) to obtain A,,. Since d,(T,,T,)+ d,(T,,T,)= d,(T,,T,), the sets {u1,u2,...,uLl and 1vi,v2,..., vt} are disjoint. If A denotes the tree obtained by pruning the leaves u1,u2,. . . ,uk, v1,v2,. . ., v, from both TI and T,, then notice that A can be also obtained from either A,, or A,, by pruning. Therefore s(ui,A)~s(ui,A12) for i=l,2,...,k and s(vj,A)~s(vj,A,,) for j= from the following: 1,2,... , t. Finally, we obtain the triangle inequality

=

d(T,,T,) + d(Tz,Tj).

n

Now note that for the trees in Figure 4, we have d(T,, T2) = 1-t & and d(T,,T,)=d(T,,T,)=l+~. Call an agreement subtree for the two trees TI and T, which realizes the minimum L(T,, T,) an optimal agreement subtree. Clearly the distance d(T,,T,) is relatively easy to compute once we know an optimal agreement subtree. However this has the appearance of a very difficult computational problem, especially since the number of agreement subtrees might be exponential in the size of the trees [5]. (Also, the same example can be used to show that the number of optimal

4GREEMENT

METRIC

FOR LABELED

TREES

zz5

agreement subtrees could be exponential.1 Surprisingly, it turns out that our algorithm can be modified to produce an optimal agreement subtree in essentially the same quadratic time! If we apply the algorithm for finding an agreement subtree described in Section 2, but on each level of the recursion an optimal subtree is selected, then, by Lemma 2 in the Appendix, the final agreement subtree produced will also be optimal. Moreover, the generalization for the unrooted case is identical. Therefore, we have the following crucial theorem. THEOREM 3

There exists a polynomial (quadratic) time algorithm for finding an optimal agreement subtree of two labeled binary trees. COROLLARY

1

The metric d, can be computed in quadratic time. Finally we note that a comprehensive study of the behavior of this new metric needs to be undertaken, perhaps along the lines found in Steel and Penny [9].

W.G.S research was supported by ONR Grant NOOO14-91-J-1022 and F.R.M. ‘s research was supported bj ONR Grant NOOO14-89-J-1643.

APPENDIX To see how to modify the algorithm to guarantee an optimal agreement subtree, we first need to generalize the function s. Let T, and U, be two binary rooted trees with labeled leaves and with roots a and w, respectively. We do not assume that the labels for the trees T, and U, are necessarily the same. Let A, be an agreement subtree of T, and U, with the root r and let x be a leaf pruned in the process of obtaining the agreement subtree A,. If the leaf x belongs to both T, and U,,,, then the value s(x, A,) is defined as before. Otherwise, if x is a leaf of T, but not of U,, then A,(x) will have only one leaf added with respect to A,. Call this leaf xi, and define s(x,A,) to be the length of the path in A,(x) from x1 to the root r. Similarly we define s(x, A,), when x is a leaf of U, but not of T,. LEMMA 2

Let T, be a tree rooted at a vertex a whose children are b and c, and let U, have a root w with children x and y. Let k, denote the number of leaves which belong, to both trees T,, and U, or to both T, and U,. Let

224

W. GODDARD

ET AL.

k, denote the number of leaves which belong to both Tb and U,, or to both T, and UY.Let finally p,,, p,, p,, py denote the number of leaves which belong to exactly one of Tb, T,, TX, or Ty, respectively, and let p

be their sum. Then the value L(T,, U,) is the minimum of the following six numbers: LCT,, U,> + LCT,, UJ + 2k, + p, L(T,, U,>+ L(T,, U,> + 2k, + p, L(T,,U,)+p,, L(T,,U,)+p,, L(T,,U,)+p,, and L(T,,U,)+p,. Proof The proof is based on the following descriptions. First note that every agreement subtree is formed in one of the six ways as given in Lemma 1. Therefore, optimal agreement subtrees are also formed in this way. We want to show that if at every stage of the recursion, we only consider optimal agreement subtrees, and by selection of one of the six possibilities that give us the largest number of leaves and the smallest value of L, we end up with an optimal agreement subtree, i.e., we want to show that it is not possible to build an optimal agreement subtree out of nonoptimal agreement subtrees. Let A, be an optimal agreement subtree of T, and U,. Let us call its root by r and its children by t and z. According to Lemma 1 there are six possibilities of obtaining any agreement tree out of iteratively found agreement trees for some smaller subtrees. Since all six cases are similar, we will consider only one of them. Let us assume that A, is an agreement subtree of Tb and U, and A, is an agreement subtree of T, and U, as in Figure 6. We will prove that both A, and A, are optimal agreement subtrees. Let I/ denote the set of leaves pruned from both T, and U, in order to obtain A,. We now need to define eight disjoint sets whose union gives the set I/. Let l V,, denote the set of all leaves from I/ which belong to both Tb and U,, l Vcvcydenote the set of all leaves from I/ which belong to both T, and U,, l Vby denote the set of all leaves which belong to both T,, and UJ,,

FIG. 6. Illustration

for the proof of Lemma 2.

AGREEMENT

METRIC FOR LABELED TREES

zzs

V,, denote the set of all leaves which belong to both T, and VI, V, denote the set of all leaves which belong only to Tb, l V, denote the set of all leaves which belong only to T,, l V, denote the st of all leaves which belong only to V,, and l VYdenote the set of all leaves which belong only to uY. Clearly k, = IVb,l+ IV,,1 and p = IVbl+ IV,l+ IV,l+ IVyI. With the above notation we have the following: l

l

L(T,,Uw) =z(A,) =

c

~(x,Ar)

XEV

+.tc, (S(X,A,)+l+s(x,A,)+l) cx

‘,Jv

(+v4)+1)+

‘.&

(+v4)+1)+

=

+

v,, u

cvy(4x+%)+1)

x E

x

x E

cv, (4xJ*)+l)

x E

h

Vbj

cv,, u v, u v, 4x74) u

c

s(x, A,) +q Iv,,l+ lv,,l) + Iv,1 x E vex u vcy u VbY u v,u vy

+ V,l+ w-,1+lq =1(A,)+Z(A,)+2k,+p.

Let now B, (II,) be an optimal agreement tree for Tb and U, (for T, and UY>.Let B, be an agreement tree whose two branches are B, and B,. Since Z(B,)I~U,) and ESPY the following sequence of inequalities shows that B, is an optimal agreement subtree of T, and

226

W. GODDARD

ET AL.

U,, A, is an optimal agreement subtree of Tb and U,, and A, is an optimal agreement subtree of T, and U,.

L(T,,U,)=E(A,)IZ(B,)=Z(B,)+~(B,)+~~,+~ II(A,)+I(A,)+~~,+~=~(A,)=L(T,,U,).

n

REFERENCES 1

W. H. E. Day, Optimal

algorithms

for comparing

trees with labeled leaves, J.

Classif. 2~7-28 (1985).

2

3

G. F. Estabrook, F. R. McMorris, and C. A. Meacham, Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units, Syst. 2001. 34:193-200 (1985). C R. Finden and A. D. Gordon, Obtaining common pruned trees, J. Classif. 2:255-276

(1985).

A. D. Gordon, On the assessment and comparison of classifications, in Analyse de Don&es et Zrzformatique, R. Tomassone, ed., INRIA, Le Chesnay, France, 1980, pp. 1499160. 5 E. Kubicka, G. Kubicki, and F. R. McMorris, On agreement subtrees of two binary trees, Congr. Numer. 88:217-224 (1992). 6 E. Kubicka, G. Kubicki, and F. R. McMorris, An algorithm to final agreement subtrees, J. Classif. (in press). 7 D. Penny and M. D. Hendy, The use of tree comparison metrics, Syst. 2001. 34:75-82 (1985). 8 D. E. Rosen, Vicariant patterns and historical explanation in biogeography, Syst. Zool. 27:159-188 (19781. new 9 M. A. Steel and D. Penny, Distributions of tree comparison metrics-some results, Syst. Biol. 42:126-141 (1993). 10 M. Steel and T. Warnow, Kaikoura tree theorems: Computing the maximum agreement subtree, Zf. Proc. Lett. 48:77-82 (1993). 11 D. L. Swofford, When are phylogeny estimates from molecular and morphological data incongruent?, in Phylogenetic Analysis of DNA Sequences, M. M. Miyamoto and J. Cracraft, eds., Oxford Univ. Press, New York, 1991, pp. 295-333. 12 D. L. Swofford, PAUP: Phylogenetic Analysis Using Parsimony, version 3.0, Illinois Natural History Survey, Champaign, IL, 1991. 4