Numbering binary trees with labeled terminal vertices

Numbering binary trees with labeled terminal vertices

0092-8240/83/0100334)8503.00/0 Bulletin of Mathematical Biology, Vol. 45, No. 1, pp. 33-40, 1983. Pergamon Press Ltd. (~ 1983 Society for Mathematic...

353KB Sizes 0 Downloads 85 Views

0092-8240/83/0100334)8503.00/0

Bulletin of Mathematical Biology, Vol. 45, No. 1, pp. 33-40, 1983.

Pergamon Press Ltd. (~ 1983 Society for Mathematical Biology

Printed in Great Britain.

NUMBERING BINARY LABELED TERMINAL •

TREES WITH VERTICES

F. JAMES ROHLF* Mathematical Sciences Department, I B M T. J. W a t s o n R e s e a r c h C e n t e r , P. O. B o x 218, Y o r k t o w n H e i g h t s , N Y 10598, U . S . A .

For each rooted binary tree with t labeled terminal vertices (leaves) a natural number can be assigned uniquely. Unrooted trees with t labeled terminal vertices and t-2 unlabeled internal vertices of degree 3 can also be numbered uniquely using the same convention. Rooted trees in which the heights of the internal vertices are rank ordered are also considered. Applications to problems in taxonomy are discussed.

Introduction. There has been m u c h interest in recent years in methods for numbering and generating trees. These techniques are of practical importance since their use can greatly facilitate the storage and matching of trees, as well as provide a simple means to generate random trees. G6bel (1980) proposed a numbering scheme for rooted trees with unlabeled (indistinguishable) vertices. Knott (1977), P r o s k u r o w s k i (1980), Rotem and Varol (1978) and Solomon and Finkel (1980) have proposed methods for enumeration and numbering binary trees with unlabeled vertices. K n u t h (1973) has dealt with the case of binary trees in which all of the vertices are labeled. Of particular interest in classification problems (see, e.g. Sneath and Sokai, 1973) are binary trees in which only the t terminal vertices (leaves) are labeled. Such trees correspond, for example, to the t - 1 nested subsets of t objects produce by a clustering method. These trees have been called n-trees (Bobisud and Bobisud, 1972) and bare trees (e.g. Day, 1980). U n r o o t e d trees with labeled terminal vertices and internal vertices of degree 3 are also of interest (especially in phylogenetic studies). The present paper is concerned with methods for numbering and generating such trees. Equivalent techniques may have been developed by others (D. Colless and J. Felsenstein, personal communication) but I am not aware of any published accounts. *Present address: Department of Ecology and Evolution, State University of New York, Stony Brook, NY 11794, U.S.A. 33

34

F.J. ROHLF

Binary Trees with Labeled Leaves. Consider a given rooted binary tree with t labeled terminal vertices (leaves), t - 1 unlabeled internal vertices of degree 3, and an additional terminal vertex which is the root. The branches of the tree are assumed to have been preordered so that, for example, the lowest ordered descendant of each left branch is "lower" than the lowest ordered descendant of its right branch. The method of determining the order for the terminal labels is arbitrary (numerically, alphabetically, etc.) but it must be fixed for any one application. No meaning is given to the differences in the heights (distance from the root) of the vertices (but see below). Such trees have 2t vertices and hence 2t-1 edges. The edges can be numbered (using the numbers 0 to 2t) starting, for example, with the edge connected to the first terminal vertex in the order in which the edges are first encountered in a " d o w n w a r d " (towards the root) direction as one traverses the tree [using an inorder traversal (Knuth, 1973)]. Figure 1 shows an example. The importance of the preordering mentioned above can also be seen from this example. If the left-right ordering of the edges coming from any of the internal vertices were altered, the order in which they would be encountered (and hence numbered) would be altered. Other conventions could also be used as long as they could be consistently applied. Consider adding a new terminal vertex to an existing tree with t labeled vertices. To do this one of the 2 t - 1 existing edges must be broken and connected to a new internal vertex which is, in turn, connected to the new labeled terminal vertex. Thus, a number, say Nt+l, can indicate the edge on a tree with t labeled vertices where the vertex t + 1 is to be added. If edge Nt÷, connects, say, to vertices i and j, then in the I

~

o/ /

5

5

VV

2

4

ROOT Figure |. An example of a rooted binary tree with t = 5 labeled terminal vertices and the edges numbered according to the convention described in the text. Lengths of the edges are arbitary.

N U M B E R I N G BINARY TREES WITH L A B E L E D T E R M I N A L VERTICES

35

list of edges in the u p d a t e d tree this edge will c o n n e c t v e r t e x i to the new internal v e r t e x and two n e w edges must be inserted into the list immediately following the p r e s e n t edge. The first such edge c o n n e c t s the new internal v e r t e x to the n e w labeled terminal vertex, t + 1. The next connects the n e w internal vertex to v e r t e x j. The n u m b e r s of all subsequent edges are thus increased b y 2. In a similar manner one could delete the last vertex, t, (and its associated internal vertex) from an existing tree and identify b y Nt(0-< bit < - 2 t - 2 ) the n u m b e r associated with the edge in the r e d u c e d tree where the r e m o v e d internal v e r t e x was located. If one w e r e to successively r e m o v e the highest o r d e r e d labeled vertices one could obtain the t-tuple of n u m b e r s N = (N1, N2,. • •, Ni . . . . .

Nt).

(1)

The ith c o m p o n e n t indicates the edge, in a tree of size i - 1, w h e r e the ith labeled v e r t e x is to be added in order to regenerate the given tree. The n u m b e r s N1 and N2 are always equal to zero and could be omitted since there is no choice in h o w one c o n s t r u c t s a tree with less than 3 labeled vertices. This t-tuple is analogous to K n u t h ' s (1973:390) canonical representation of a r o o t e d tree. H o w e v e r , in his case, all vertices were labeled and the n u m b e r s d e n o t e d v e r t e x names rather than edges as in the p r e s e n t case. It can be s h o w n (Knuth, 1973; Felsenstein, 1978) that since there is only one s e q u e n c e in which one can r e d u c e a given tree b y s u c c e s s i v e l y removing the highest order labeled terminal vertex, there is also only one sequence in which one can built up to the desired tree b y adding the vertices in the order 1, 2 , . . . , t. These s e q u e n c e s are the inverse of each other. The t-tuple N describes the s e q u e n c e , thus there is a 1-1 c o r r e s p o n d e n c e b e t w e e n a tree and s o m e N. F o r t -> 2 there are Bt = lZI ( 2 i - 3) = ( 2 t - 3 ) B t _ ~

(2)

i=2

such trees. F o r t = 1 there is only B1 = 1 tree. This formula has b e e n established b y several w o r k e r s (e.g. E d w a r d s and Cavalli-Sforza, 1964; Phipps, 1976). The set of all such trees can be r e p r e s e n t e d geometrically as Bt points in a space of t dimensions with the Ni giving the coordinates along the ith axis. T h e s e points can be numerically identified (numbered) using a variety of different conventions, b u t the following m e t h o d as a mixed b a s e n u m b e r s e e m s natural and convenient. L e t M = B,_~Nt + Bt-2Nt-~ +" • • + n 3 N 4 + N3.

(3)

36

F.J. ROHLF

N o t e that the terms corresponding to N~ and N2 have been omitted since t h e y are always equal to zero. It is e a s y to show that 0 < M -< B, - 1 and hence there is a unique c o r r e s p o n d e n c e b e t w e e n each value of M and some binary tree with t labeled terminal vertices. Figure 2 shows, for example, all 15 possible rooted binary trees with 4 labeled terminal vertices and their associated values of M. As can be seen f r o m this figure, a d j a c e n t trees according to this n u m b e r i n g s c h e m e are not necessarily similar. The expression for M can also be written as M = N3 + 3(N4 + 5(N5 + . • • + (2t - 5)N,)).

(4)

Thus, the t-tuple N can be r e c o n s t r u c t e d f r o m M by the following algorithm: A=M

F o r i = 3 to t - 1 begin 3)] Ni = A - (2i - 3)B B = L[A[(2i

-

A=B

end N, = A.

i

1

1

[

2

1

Figure 2. All possible rooted binary trees with 4 labeled terminal vertices and their numbering according to the method described in the text.

NUMBERING BINARY TREES WITH LABELED TERMINAL VERTICES

37

The L operation obtains the largest integer not greater than its argument (it corresponds to truncating integer arithmetic for positive integers).

Unrooted (Free) Trees. If the label t is given to the root vertex of a rooted binary tree with t - 1 labeled terminal vertices, then one obtains an unrooted tree with t labeled terminal vertices, t - 2 unlabeled internal vertices (all of degree 3) and 2 t - 3 edges. The m e t h o d described in the previous section can be applied to such unrooted trees by the simple artifice of temporarily considering the last labeled vertex to be the 'root' and then using the methods of the previous section for a rooted tree with t - 1 labeled terminal vertices. For example, if the roots of the trees in Figure 2 were labeled as to represent a fifth terminal vertex, then the 15 trees would correspond to all possible unrooted binary trees with 5 labeled terminal vertices. Binary Trees and Heights of Internal Nodes Ranked. Frank and Svensson (1981) defined a dendrogram for t objects as an ordered sequence of t - 1 partitionings of the set of objects into nested pairs of subsets until one obtains t subsets each containing only a single object. This sequence of partitionings can be represented as a binary tree in which the height (distance from the root) of each internal vertex corresponds to the order (ranging from 1 to t - 1 ) at which a subset split into the two subsets whose members are the descendants of the left and right branches of the given vertex (only one subset is allowed to split at each step). The terminal vertices are all considered to have a height of t. It should be noted that this definition of the term " d e n d r o g r a m " is not the one usually used in cluster analysis and numerical t a x o n o m y (see, e.g. Sneath and Sokal, 1973). The trees in Figure 2 are plotted as such dendrograms since no two internal vertices are shown at the same height (this fact was ignored in the previous sections). For these trees the internal vertices are distinguishable (by their heights) even though they are not labeled. Frank and Svensson (1981) give the number of possible such dendrograms as D t = t ! ( t - 1 ) ! 2 '-'.

(5)

This may also be expressed as

Dt+~ t ( t + l ) -

2

D,.

(6)

38

F. 3. ROHLF

This relationship and a method for numbering such dendrograms can be developed as a simple extension of the method proposed in the previous sections. Consider a dendrogram with t labeled terminal vertices (e.g. the dendrogram in Figure 3 with 5 terminal vertices). The internal vertices occur at t - 1 distinct levels. Since the t terminal vertices are all plotted at the same height, the edges must differ in length. Introduce d u m m y internal vertices (of degree 2) at heights 1 to t - 1 so that all edges are of the same length (see Figure 3). The number of dummy vertices required will be ( t - 1 ) ( t - 2)/2. The tree will then have u=l+t+(t-1)+

(t - 1)(t - 2) 2

-

1

t ( t + 1)

2

(7)

vertices (with the root counted as a vertex). The v - 1 edges each represent a distinct location where a new vertex could be added to the tree (since a new vertex will be placed within an existing edge it will not be at the same height as any existing vertex). Thus, there can be t ( t + 1)/2 as many dendrograms with t + 1 objects as there are with only t objects (as stated in equation 6). The edges can be identified by the numbers 0 to t ( t + 1 ) / 2 - 1 using the same tree traversal convections as before. A canonical representation and numbering scheme, M', can thus be established for dendrograms as was done before for binary trees with labeled leaves, where M ' = D t - l N t + D t - z N t - 1 +" • • + D 3 N 4 + Na.

Such numbers can be unpacked into their canonical sequence using an I

3

5

2

4

5

4

-I-

3

W

-i- 2

I 0

ROOT

Figure 3. An example of a dendrogram with t = 5 objects represented as a binary tree. Dummy internal vertices (circles) have been added so that all edges are of equal length and all terminal vertices are of the same height.

NUMBERING BINARYTREES WITH LABELED TERMINAL VERTICES

39

algorithm analogous to that u s e d b e f o r e (substitute i(i + 1)/2 for 2 i - 3 in two places). A p p l i c a t i o n s . The p r o p o s e d mapping of trees into natural n u m b e r s is useful for identification of binary trees and d e n d r o g r a m s with a given number of labeled terminals. F o r example, in biological t a x o n o m y some methods of analysis require the generation and checking of large numbers of binary trees. A given tree can be c h e c k e d as to w h e t h e r it duplicates trees already considered b y simply comparing its M with those in a stored list. A numbering s c h e m e is particularly useful for the generation of binary trees and d e n d r o g r a m s r a n d o m l y from the set of all possible such trees b y simply using a r a n d o m integer as M. The principle limitation of the m e t h o d s e e m s to be the fact that the range of possible values of M b e c o m e s large quite rapidly as t increases so that multiple precision arithmetic m a y be required to manipulate M in a c o m p u t e r . F o r example, Bso is larger than 1076. The n u m b e r of dendrograms, Dr, increases e v e n more rapidly. In such cases the t-tuple, N, w o u l d have to be used directly.

This p a p e r r e p r e s e n t s Contribution N o . 387 in E c o l o g y and Evolution, State U n i v e r s i t y of N e w York, S t o n y Brook. This r e s e a r c h was supported in part b y Grants (Nos D E B 772461101 and D E B 8003508) from the National Science F o u n d a t i o n . LITERATURE Bobisud, H. M. and L. E. Bobisud. 1972. "A Metric for Classifications." Taxon 21, 607--613. Day, W. H. E. 1980. "A New Approach to Constructing Tree Metrics." Technical Report No. 8001. Department of Computer Science, Memorial University of Newfoundland, St. John's, Newfoundland, Canada, 14pp. Edwards, A. W. F. and L. L. Cavalli-Sforza. 1964. Reconstruction of Evolutionary Trees. In Phenetic and Phylogenetic Classification, W. H. Heywood and J. McNeil, pp. 67-76. Systematics Association Publication No. 6, London. Felsenstein, J. 1978. "The Number of Evolutionary Trees." Syst. Zool. 27, 27-33. Frank, O. and K. Svensson. 1981. "On Probability Distributions of Single-Linkage Dendrograms." J. statist. Comput. Simul. 12, 121-131. G6bel, F. 1980. "On a 1-l-Correspondence between Rooted Trees and Natural Numbers." J. Combinatorial Theory, Series B 29, 141-143. Knott, G. D. 1977. "A Numbering System for Binary Trees." Communs. Ass. comput. Mach. 20, 113-115. Knuth, D. E. 1973. The Art of Computer Programming, Vol. 1, 2nd edition. Reading: Addison-Wesley. 634 pp. Phipps, J. B. 1976. "The Numbers of Classifications." Can. J. Bot. 54, 686--688. Proskurowski, A. 1980. "On the Generation of Binary Trees." J. Ass. comput. Mach. 27, 1-2.

40

F . J . ROHLF

Rotem, D. and Y. L. Varol. 1978. "Generation of Binary Trees from Ballot Sequences." J. Ass. comput. Machs. 25, 396--404. Sneath, P. H. A. and R. R. Sokal. 1973. Numerical Taxonomy. San Francisco: W. H. Freeman. 573 pp. Solomon, M. and R. A. Finkel. 1980. "A Note on Enumerating Binary Trees." J. Ass. comput. Math. 27, 3-5.

RECEIVED 8-5-81 REVISED 1-26-82