Molecular Phylogenetics and Evolution Vol. 19, No. 1, April, pp. 1– 8, 2001 doi:10.1006/mpev.2000.0891, available online at http://www.idealibrary.com on
An Algorithm for Constructing Local Regions in a Phylogenetic Network K. T. Huber,* ,1 E. E. Watson,† and M. D. Hendy‡ *FMI, Department of Mathematics and Physics, Mid Sweden University, S-851-70 Sundsvall, Sweden; and †Institute of Biomolecular Sciences, and ‡Institute of Fundamental Sciences, Massey University, Private Bag 11 222, Palmerston North, New Zealand Received May 10, 1999, revised October 13, 2000; published online March 8, 2001
1993; Stine et al., 1992; Templeton et al., 1992) as they may make the researcher aware of possibly competing signals (such as recombination or crossover events) in the data, thus, leaving more room for interpretation by highlighting, rather than arbitrarily resolving, conflicting characters. This criticism holds for most tree-building programs based on maximum-parsimony, distance, or maximum-likelihood methods. Postprocessing of an obtained tree, with which an arbitrarily resolved character conflict might be detected, is rarely ever done (Bandelt et al., 1995). A remark by Swofford and Olsen (1990) emphasizes this: “Unfortunately, phylogenetic analysis is frequently treated as a black box into which data is fed and out of which The Tree springs.” In this paper, we present a local construction scheme for a phylogenetic network called the Buneman graph, which is also known as the median network (Bandelt, 1991; Bandelt et al., 1995) and which has proven helpful in phylogenetic analysis for small data sets: For a (not necessarily small) taxa set X together with either 2-state character sequences or a distance matrix, a set of splits can be canonically derived (for the case of 2-state character sequences, see, e.g., Huber et al., 1999). By a split of X we mean a division (i.e., partition) of X into two nonempty subsets. To each of these splits a support value can be estimated. This could be done for distance data using Split decomposition (Bandelt and Dress, 1992) and for nucleotide sequence data by spectral analysis (Hendy et al., 1994), where we can select those splits whose support values are greater than some predetermined threshold. We call the set of all the selected splits a split system, denoted by ¥, and use #¥ for its size. When ¥ contains incompatible splits such as when there might have been cross-over events, recombination, hybridization, or parallel mutation on fast-evolving sites, it is well known that a tree cannot represent ¥. A tool to visually display a selected collection ¥ of splits of some taxa set X is the Buneman graph G(¥) associated with ¥ (see Barthelemy, 1989; Barthelemy and Guenoch, 1991; Dress et al., 1997; Huber et al.,
The groupings of taxa in a phylogenetic tree cannot represent all the conflicting signals that usually occur among site patterns in aligned homologous genetic sequences. Hence a tree-building program must compromise by reporting a subset of the patterns, using some discriminatory criterion. Thus, in the worst case, out of possibly a large number of equally good trees, only an arbitrarily chosen tree might be reported by the tree-building program as “The Tree.” This tree might then be used as a basis for phylogenetic conclusions. One strategy to represent conflicting patterns in the data is to construct a network. The Buneman graph is a theoretically very attractive example of such a network. In particular, a characterization for when this network will be a tree is known. Also the Buneman graph contains each of the most parsimonious trees indicated by the data. In this paper we describe a new method for constructing the Buneman graph that can be used for a generalization of Hadamard conjugation to networks. This new method differs from previous methods by allowing us to focus on local regions of the graph without having to first construct the full graph. The construction is illustrated by an example. © 2001 Academic Press
1. INTRODUCTION In phylogenetic analysis, a tree structure for representing phylogenetic data is usually preferred. Yet data representing networks (graphs) that include a number of alternate trees (where a choice might be made on a secondary criteria) can be advantageous (Bandelt, 1991; Bandelt et al., 1995; Bandelt and Dress, 1992; Crandall and Tempelton, 1993; Dopazo et al., 1993; Eigen, 1987; Eigen et al., 1988; Eigen and Winkler-Oswatitsch, 1990; Excoffier and Smouse, 1994; von Haeseler and Churchill, 1993; Hendy et al., 1978, 1980; Haszprunar, 1992; Lester et al., 1983; Lockhart et al., 1995, 1998; Meacham, 1984; Page and Holmes, 1998; Rozas and Aguade, 1
To whom correspondence should be addressed. 1
1055-7903/01 $35.00 Copyright © 2001 by Academic Press All rights of reproduction in any form reserved.
2
HUBER, WATSON, AND HENDY
structed. When the algorithm is extended to construct the full graph, it is at least as computationally efficient as other methods. 2. AN INFORMAL DESCRIPTION OF THE ALGORITHM AND AN EXAMPLE
FIG. 1. The Buneman graph G(¥) of ¥ of Table 1. This is drawn with the vertices as identified in Table 2 and with each element of d X associated with a vertex using the algorithm with x : ⫽ A and d x ⫽ 5. (The computations are detailed in Table 2.) The split sets of G(¥) are identified by sets of parallel edges, so, for example, the split labeled by CDGHI (index 7) is the set of six edges parallel to (v 1, v 4) near the top right of the figure. When these edges are deleted from G(¥), the vertex labels on the two components are ABEFJK and CDGHI, respectively. If we set d x: ⫽ 2, then we obtain the subgraph G x(¥, d x) above the dashed line.
1999; and see Fig. 1 for an example). Among a number of useful properties, the Buneman graph G(¥) is always an isometric 2 subgraph of the #¥-dimensional hypercube H(¥) associated with ¥. Also it is known that G(¥) is a tree precisely if no two splits in ¥ are incompatible. Moreover, it follows from Bandelt et al. (1995) that all most-parsimonious trees supported by the taxa set can be realized in G(¥). This theoretically highly attractive result may, however, be of restricted practical use in view of a computation done by Excoffier and Smouse (1994) that showed that, on a RFLP data set of 56 haplotypes, the number of equally parsimonious trees exceeds one billion. The construction algorithm that we present below differs significantly from various other strategies (e.g., Barthelemy and Guenoch, 1991; Dress et al., 1997; 3 see also Bandelt, 1991, and Bandelt et al., 1995): It allows us to construct a disk, that is, the local structure, around any labeled vertex of the graph, without having to first construct the full graph. This will make the attractive concept of split representation realized in the Buneman graph applicable to larger data sets. In particular, it enables us to either attempt the construction of the full network by a disk approach or to focus on regions of interest in cases where the full graph may be visually confusing or computationally challenging. Further, if the partially completed graph is too complex for visual representation, we may choose to increase the threshold value, so that a useful approximate graphical representation of the data can be con2
This means that the (graph theoretical) distance (in G(¥)) between every pair of vertices v and w in G(¥) is the same as the (graph theoretical) distance of v and w in H(¥). 3 In this paper, a direct generalization of Buneman’s original tree building approach (Buneman, 1971) is given.
In this section, we first give an informal description of the algorithm and then elucidate the algorithm by means of an example. A formal description, as well as a verification of the algorithm, is given in Appendixes A and B. Suppose we have a set of splits ¥ of a taxa set X. The Buneman graph G(¥) comprises vertices (points or nodes) linked by (unweighted) edges (lines). Each taxon of X labels a vertex of G(¥). A vertex of G(¥) that is unlabeled is called latent and can be thought of representing a hypothetical intermediate sequence. Each edge of G(¥) corresponds to a split in ¥. Some splits may have more than one corresponding edge and deleting each of the edges corresponding to a split ⫽ {A, B} in ¥ cuts G(¥) into two connected subgraphs, with the labeled vertices of these subgraphs corresponding to the taxa sets A and B. Usually, G(¥) is drawn so that all the edges corresponding to the same split in ¥ are parallel. 2.1. An Informal Description of the Algorithm Suppose we wish to construct, for a set ¥ of unweighted splits of a taxa set X, part of G(¥) “close” to the vertex labeled by a particular taxon x in X. Expressed more formally, we wish to construct, for a prechosen value d, all those vertices in G(¥) that are separated from x by at most d edges together with all the edges of G(¥) connecting them. We note that if d is sufficiently large, we will construct all of G(¥). In a preprocessing phase, we identify each split of ¥ by that subset in that does not contain x and write ¥ x as the set of these subsets, over all splits in ¥. The algorithm is now based on the observation that every vertex v in G(¥) can be identified with a unique subset of ¥ x. To be more precise, calling two elements A 1, A 2 in ¥ x incompatible if either A 1 ⫽ A 2 or A 1 艚 A 2 is a proper 4 subset both of A 1 and of A 2 and a subset C of ¥ x a clique if either C is the empty set or every pair of elements in C is incompatible, then v can be identified with a unique clique in ¥ x (see Appendix A for more on this). We begin the construction with vertex v 1, which corresponds to the empty set and is labeled by x. We then identify and label the vertices v 2, v 3, . . . , which are adjacent to v 1. These vertices correspond to the “maximal” elements of ¥ x, that is, those elements that are not contained in any other element of ¥ x. If there are m maximal elements in ¥ x, then we 4
A proper subset of A is a nonempty subset of A not equal to A.
3
LOCAL PHYLOGENETIC NETWORKS
TABLE 1 A Set of Splits Derived from West African Human mt Sequences Site:
162
189
278
304
311
319
362
A 011WM B 343WF C 337WF D 338WF E 331WF F 334WF G 529WH H 038WM I 057WM J 256WH K 051WY Reference string Index
A 䡠 䡠 䡠 䡠 䡠 䡠 䡠 䡠 䡠 G K 1
T 䡠 䡠 䡠 䡠 䡠 䡠 䡠 C C 䡠 IJ 2
C 䡠 䡠 T T 䡠 T T T T T DEGHIJK 3
T 䡠 䡠 䡠 䡠 䡠 䡠 C C 䡠 䡠 HI 4
T 䡠 䡠 䡠 䡠 C C C C C C FGHIJK 5
G A A 䡠 䡠 䡠 䡠 䡠 䡠 䡠 䡠 BC 6
T 䡠 C C 䡠 䡠 C C C 䡠 䡠 CDGHI 7
Note. Variable sites of cluster L3b (Watson et al., 1997) derived from the mtDNA hypervariable I region, together with the induced splits. This cluster is specific to West Africans (represented in this example by the Fulbe (WF), Hausa (WH), Mandenka (WM), and Yoruba (WY) ethnic groups). Sequence A (011WM) varies from the Cambridge Reference Sequence (Anderson et al., 1981) at positions 124 and 223. All positions are given without the prefix “16.” Each sequence has an associated single-letter identifier (A, . . . , K). In each column, the dot indicates the same nucleotide as in sequence A. Each split is indexed by the string of letters of the subset not containing sequence A and each such string of letters itself is indexed by a number between 1 and 7.
connect v 1 to each of the vertices v 2, v 3, . . . , v m ⫹ 1 by an edge directed away from v 1 and label each obtained edge {v 1, v i} (2 ⱕ i ⱕ m ⫹ 1) by the split referenced by the maximal element in ¥ x that corresponds to v i. These will be all the vertices at distance one from v 1. We now construct the vertices at distance two from v 1, by identifying the edges directed away from v 2, v 3, . . . , v m ⫹ 1—at the same time labeling these vertices and edges. Then we successively identify and label the edges directed away from these vertices, which will be distance three from v 1, continuing until we have constructed and labeled all vertices of G(¥) that are at a distance at most d from v 1. The rules for identifying the adjacencies, the associated sets and labels for the vertices, and the labels for the edges are specified in detail in Appendix A. In a postprocessing phase, we redraw the edges so that those edges labeled by the same split are drawn parallel. 2.2. An Example X ⫽ {A, B, C, . . . , K} is a set of sequences derived from the hypervariable I region of the human mitochondrial DNA (Watson et al., 1997), which induces the set ¥ of splits given in Table 1. With x : ⫽ A, each split ⫽ {Y, Z} induced by Table 1 is referenced by the subset Y of that does not contain A. (For brevity, the elements of Y are written as a string without braces or commas.) Thus, for example, the split 兵兵A, B, C, F其, 兵D, E, G, H, I, J, K其其 of X is referenced by the string DEGHIJK, and so ¥ x consists of the elements K, IJ, DEGHIJK, HI,
FGHIJK, BC, and CDGHI. However, in the following discussion and in Table 2, we prefer to refer to an element v in ¥ x by the index of v given in Table 1. Thus, for example, the string CDGHI is indexed by 7 and, hence, ¥ x can be thought of as the set {1, . . . , 7}. Clearly, each such index also uniquely indexes a split induced by Table 1. Suppose we want to construct the full graph G(¥). We start the construction with the initial vertex v 1, which we label by A and to which we assign the empty set as its clique. In order to identify all incident edges of v 1 and thus all adjacent vertices of v 1, we need to identify all maximal elements in ¥ x since each maximal element in ¥ x references a split that corresponds to the edge that joins v 1 and a newly constructed vertex. In the example, the maximal elements in ¥ x are indexed by 3, 5, 6, and 7 and the first constructed vertex—it is incident to the edge corresponding to the split {DEGHIJK, ABCF}—is v 2 to which the clique consisting only of the string indexed by 3 and the label E are associated (cf. also Table 2). We note here that later on in the construction process, finding all incident edges and thus all adjacent vertices of an already constructed vertex is, in general, more complicated than the situation described above. This is due to the fact that the algorithm’s underlying idea for finding incident edges and thus adjacent vertices of a vertex v is to identify v with a unique clique C v of ¥ x and then to add a set A in ¥ x to C v. However, in general, the union C v 艛 {A} of C v and {A} is not necessarily a clique and, thus, does not necessarily correspond to a vertex in G(¥). In order to make sure that A is always chosen in such a way that C v 艛 {A} is a clique, and thus the split represented by A corre-
4
HUBER, WATSON, AND HENDY
sponds to an edge joining v with a further vertex, we have introduced the notion of the forward set of a vertex. It turns out that, in general, every set in the forward set of a vertex v in G(¥) represents a split that corresponds to an incident edge of v and thus yields an adjacent vertex of v. Consequently, reaching a level where the forward set of each newly constructed vertex is empty terminates the algorithm. (For details, see Appendix A, where a formal description of the algorithm is given, Table 2, which was created by applying the proposed construction algorithm to the example, and Appendix B, where a verification of the algorithm is presented.) As can be seen from Fig. 1, more than one already constructed vertex, e.g., vertices v 2 and v 3, may be adjacent to the same vertex v, e.g., vertex v 6. In this situation, every edge leading to v is added to the graph, but the unique clique in ¥ x that corresponds to v is computed only once, namely, the first time that v is encountered. 3. CONCLUSIONS From a theoretical point of view, the representation of a split system of a set of taxa by the Buneman graph is a highly attractive concept. Yet, in general, the number of vertices in the Buneman graph is large (see Dress et al., 1997, for an enumeration formula) and their computation is NP-hard (Barthelemy and Guenoch, 1991, p. 198). Even for taxa sets sufficiently small to allow the construction, the Buneman graph may be too complicated, making it hard to actually visualize and to derive complex evolutionary relationships. A way out of this dilemma is to use the presented algorithm to either attempt the construction of the full Buneman graph by using a disk strategy or only focus on and construct local regions of interest. The freedom in choice of a reference taxon for the construction and the size of the local region surrounding it make these approaches feasible. When the computation of the full Buneman graph may be infeasible, this algorithm allows us still to construct, in reasonable time, local regions of interest that represent the split system in the spirit of the Buneman graph. A possible application is spectral analysis of aligned nucleotide sequence data, which, using Hadamard conjugation, produces a support signal for each possible split among the taxa set. The Hadamard conjugation estimates the number of unobserved substitutions under simple substitution models. If the data truly reflect a phylogenetic signal, then most of these support signals have expected value zero and the remainder correspond to a compatible set of splits. However, if there had been a transgenic effect, or a misalignment, then this would lead to pairs of incompatible splits having positive expected support. Hence when we examine the
conjugate spectrum we use a preselected threshold to filter out all small signals. If the remaining signals do not fit a tree, then we need a mechanism to display this information. A network, such as the Buneman graph, with the edges weighted by the supports for each split is appropriate for this purpose. However, in studies with a large number of splits, we can display local regions of the Buneman graph, to illustrate the nature of the nontree information in the sequences. APPENDIX A: A FORMAL DESCRIPTION OF THE ALGORITHM Here, we formally describe an algorithm that, for a set ¥ of unweighted splits on a taxa set X ⫽ { x 1, x 2, . . . , x n} of n taxa and an arbitrarily chosen reference element x 僆 X, constructs the graph G x(¥, d x) comprising the vertices and edges of the Buneman graph G(¥) within a prechosen distance d x of the vertex of G(¥) labeled by x. Note that for any x 僆 X, the graph G x(¥, #¥) constructed by the algorithm below is the Buneman graph G(¥). The computational complexity of this algorithm is O(e), where e denotes the number of edges in G x(¥, d x). Preliminaries and notation. The algorithm is based on the interconnection of the vertices and edges of G(¥) with subsets of X, which was shown in Dress et al. (1997): From X we select a reference taxon x. We identify each split ⫽ {A, B} of X by the set in that does not contain x. As in the last section, let ¥ x be the collection of all the subsets A 債 X that do not contain ¯ } consisting of A and the x and for which the set {A, A ¯ complement A: ⫽ X ⫺ A is a split in ¥. We recall that two sets A 1 , A 2 僆 ¥ x are called incompatible whenever A 1 艚 A 2 is a proper subset both of A 1 and of A 2 . A collection of k distinct sets in ¥ x , with each pair of sets incompatible, is referred to as a k-clique. Clearly, a clique is a k-clique for some k ⱖ 0. The 0-clique A of no sets, and the 1-cliques {A}, for each A 僆 ¥ x , vacuously meet the incompatibility criterion. It follows directly from Dress et al. (1997) that there is a one-to-one correspondence between the vertices of G(¥) and the cliques of ¥ x. Based on this, we identify the vertices of G(¥) with the cliques in ¥ x, and, for x 僆 X, we denote by Ꮿ x the collection of the cliques in ¥ x that correspond to vertices in G x(¥, d x). We call two cliques C 1, C 2 in ¥ x adjacent precisely if their corresponding vertices in the Buneman graph G(¥) are adjacent; that is, the sets C 1 and C 2 differ in precisely one element, where for any clique C 債 ¥ x, the set C denotes the collection of all sets in ¥ x that contain some set in C as a subset (see also below). During the construction of G x(¥, d x), the elements C in Ꮿ x are successively identified. To each new clique
5
LOCAL PHYLOGENETIC NETWORKS
C 僆 Ꮿ x, we associate the forward set F C. This set plays the key role in the algorithm below as: • it codes for all new cliques adjacent to C in ¥ x; • it identifies the new (i.e., the not already identified) edges in H(¥) that are incident to C in G x(¥, d x); • an adjacent clique C⬘ in Ꮿ x can be expressed as a modification of C by the element in F C that led to C⬘; and • the label of C can be easily computed with F C and C. We call a clique C in ¥ x terminal if the forward set F C is empty. For a subset A of X, we define the collection max( A) to consist of all sets in ¥ x that are maximal subsets of A; that is, they are proper subsets of A in ¥ x not properly contained in any other proper subset of A in ¥ x. Also, we define the collection min( A) to consist of all sets in ¥ x that are minimal supersets of A; that is, they are sets in ¥ x properly containing A and not properly contained in any other set in ¥ x that properly contains A. Outline of algorithm. A proof of the claims in the descriptions below is given in Appendix B. Input. The sets X and ¥, a reference taxon x in X and a distance d x. Output. G x(¥, d x). We construct the set ¥ x and initialize the algorithm with the clique C : ⫽ A, its forward set F C : ⫽ max(¥ x), and its label L C : ⫽ { x}. Iteratively, we compute the cliques C⬘ in ¥ x of distance at most d x from x in G(¥), together with their forward sets F C⬘ and labels L C⬘. The iteration itself is governed by the concept of a level l of G x(¥, d x): We assign the 0-clique, i.e., the empty set A, to be on level l ⫽ 0, and for l ⱖ 1, a clique C⬘ in ¥ x to be on level l if the distance between the 0-clique and C⬘ is l. Note that no two cliques on the same level can be adjacent (this follows since G(¥) is an isometric subgraph of the #¥-dimensional hypercube (Wetzel, 1995, p. 45), and hence all circuits in G(¥) contain an even number of vertices). Thus, for each edge {C 1, C 2} in G x(¥, d x) the level of the cliques C 1 and C 2 must differ by 1; we orient {C 1, C 2} so that the first clique is at the lower level. The algorithm terminates when either l ⫽ d x or it encounters the terminal level, that is, the level where all the cliques are terminal. We now give an algorithmic description of the main part of the algorithm: Input. The set X, the split system ¥, a reference taxon x in X, and d x. Construct ¥ x Initialization. l ⫽ 0, C : ⫽ A, L C ⫽ { x}, F C ⫽ max(¥ x). Let CliquesOnl denote an internal variable for storing the already constructed cliques.
if l is terminal or d x ⫽ 0, then G x(¥, d x) is an isolated vertex. endif while l is not terminal and l ⱕ d x do CliquesOnl :⫽ A, for each nonterminal clique C on level l do for each set c 0 in F C do compute the clique C⬘ : ⫽ (C 艛 {c 0}) ⫺ min(c 0) and join C and C⬘ by an edge. if C⬘ ⰻ CliquesOnl, then CliquesOnl :⫽ CliquesOnl 艛 {C⬘}, F⬘(c 0) ⫽ F⬘C(c 0) : ⫽ { f 僆 (FC ⫺ {c0}) : f 艚 c0 ⫽ A and c0 債 f }, F⬙(c 0) ⫽ F⬙C(c 0), : ⫽ { f 僆 max(c 0) : f 債 f⬘ for all f ⬘ 僆 F ⬘(c 0) and f 艚 c ⫽ A for all c 僆 C⬘}, F C⬘ : ⫽ F⬘(c 0) 艛 F⬙(c 0), (forward set of C⬘) L C⬘ : ⫽ ( c 僆 C⬘ c) 艚 ( f 僆 FC⬘ f ), (label of C⬘) endif, endfor, endfor, l 3 l ⫹ 1, endwhile.
艚
艚
Output. G x(¥, d x). In Table 2, we illustrate the application of this algorithm to the example given in Table 1 by listing, for every level of the construction, all the vertices encountered on this level, together with their associated cliques, forward sets, and label sets. (When we are presented with a set of weighted splits we can apply a simple modification to this algorithm to construct that subgraph with all vertices whose weighted distance from x is less than d x.) APPENDIX B: VERIFICATION OF THE ALGORITHM Let ¥ be a system of splits of X and x in X. For any clique C in ¥ x, let C denote the extension of C, where C is the collection of all sets in ¥ x that contain some set in C as a subset. Note that A ⫽ A and C 債 C for any clique C in ¥ x. Further, if C is a clique in ¥ x and c 僆 C, then there exists no c⬘ 僆 C ⫺ {c} such that c⬘ 債 c. Hence, C is the set of minimal elements of C, so C 1 ⫽ C 2 if and only if C 1 ⫽ C 2. Recall that two cliques in ¥ x are called adjacent if and only if they correspond to adjacent vertices in G(¥). Let C denote a clique ¥ x, and let c 0 be some set in ¥ x. Then it follows from Dress et al. (1997, Theorem 3.3) that there exists an adjacent clique C⬘ 債 ¥ x with C⬘ ⫽ C 艛 {c 0} if and only if c 0 is a maximal set in ¥ x ⫺ C and c 艚 c 0 ⫽ A for all c 僆 C. In what follows, we explicitly identify the clique C⬘ above. To this end, for any clique C in ¥ x, we define C to be the collection of all the sets c⬘ in ¥ x ⫺ C such that for all c 僆 C the intersection c 艚 c⬘ is nonempty.
6
HUBER, WATSON, AND HENDY
TABLE 2 The Computations Used to Derive the Buneman Graph for the Splits of Table 1 Level l
Vertex v
l ⫽ 0
v1
l ⫽ 1
v2 v3 v4 v5
:⫽ :⫽ :⫽ :⫽
l ⫽ 2
v6 v7 v6 v8 v7 v8 v9 v9
: ⫽ v 2(5) : ⫽ v 2(7) ⫽ v 3(3) : ⫽ v 3(7) ⫽ v 4(3) ⫽ v 4(5) : ⫽ v 4(6) ⫽ v 5(7)
v 1(3) v 1(5) v 1(7) v 1(6)
Clique C ⫽ Cv
Forward set F C
Label LC
A
{3, 5, 6, 7}
{A}
{3} {5} {7} {6}
{5, 7} {3, 7} {3, 5, 6} {7}
{E} {F} A {B}
{3, 5} {3, 7}
{1, 2, 7} {5}
A {D}
{5, 7}
{3}
A
{6, 7}
A
{C}
{2} {1} {3, 5, 7}
{7} A {2, 4}
{J} {K} {G}
l ⫽ 3
v 10 v 11 v 12 v 12 v 12
l ⫽ 4
v 13 : ⫽ v 10(7) v 13 ⫽ v 12(2) v 14 : ⫽ v 12(4)
{2, 7}
{4}
A
{4}
{2}
{H}
v 15 : ⫽ v 13(4) v 15 ⫽ v 14(2)
{2, 4}
A
{I}
l ⫽ 5
: ⫽ v 6(2) : ⫽ v 6(1) : ⫽ v 6(7) ⫽ v 7(5) ⫽ v 8(3)
Note. The cliques, forward sets, and labels of the vertices of the Buneman graph G(¥) as computed by the algorithm (using the indexing scheme given for ¥ x in Table 1). For a vertex v in G(¥), the adjacent vertex obtained by crossing an edge represented by c 0 僆 F Cv is denoted by v(i 0), where i 0 is the index of c 0. For vertex v 3 constructed on level l ⫽ 1, the adjacent vertex obtained by crossing an edge labeled by the split indexed by i 0 ⫽ 7 is v 8 : ⫽ v 3(7). Vertex v 8 has the empty set as its label set and, thus, is latent. Vertex v 8 and vertex v 4 are incident, too. The edge that we have crossed in this case corresponds to the split indexed by i 0 ⫽ 5. If, as in Fig. 1, we choose x : ⫽ A and d A :⫽ 2, then the vertices and adjacencies given in level l 0, l 1, and l 2 are exactly those that constitute the local network G x(¥, d x).
Hence, for any c 0 僆 C, it follows that (C 艛 {c 0} ⫺ min(c 0)) is a clique in ¥ x. Now, let C be a clique in ¥ x and c 0 僆 C be some arbitrarily chosen element. We claim that min(c 0) 債 C if and only if c 0 is maximal in C: Suppose to the contrary that min(c 0) 債 C but c 0 is not maximal in C. Then there exists some set c 1 僆 ¥ x ⫺ C that properly contains c 0. Without loss of generality, we can assume c 1 to be minimal with respect to c 0 債 c 1. Hence, c 1 僆 min(c 0) 債 C, by assumption, but this contradicts the choice of c 1. Conversely, suppose that c 0 is maximal in C. Then every c 1 僆 min(c 0) must be contained in C too, since c 1 ⰻ C along with the assumed nonemptiness of the intersection of c 0 with any element c 僆 C would imply that c 1 is contained in C, which contradicts the assumed maximality of c 0 in C. /
Lemma 1. If C is a clique in ¥ x and c 0 is a maximal set in C, then C⬘: ⫽ (C 艛 {c 0}) ⫺ min(c 0) is the only clique in ¥ x with C 艛 {c 0} ⫽ C⬘. In particular, C and C⬘ are adjacent. Proof. We showed above that C⬘ is a clique. Choose c 僆 C⬘, then there exists some c⬘ 僆 C⬘ with c⬘ 債 c. If c⬘ ⫽ c 0, then c⬘ 債 C and so c 僆 C. If c⬘ ⫽ c 0, then we can choose some c⬙ 僆 ¥ x with c 0 債 c⬙ minimal and c⬙ 債 c. Hence c⬙ 僆 min(c 0) 債 C, by assumption. It follows that there exists some c 僆 C such that c 債 c⬙. Thus, c 債 c and so c⬘ 僆 C. Thus, C⬘ 債 C 艛 {c 0}. Conversely, suppose that c 僆 C 艛 {c 0}. If c ⫽ c 0, then c 僆 C⬘. Otherwise, if c ⫽ c 0, then there exists some c⬘ 僆 C such that c⬘ 債 c. In this case, if c⬘ 僆 C ⫺ min(c 0), then c⬘ 僆 C and so c is an element in C. If c⬘ 僆 min(c 0), then c 0 is also a subset of c and so c must also be an element in C⬘. Hence C 艛 {c 0} 債 C⬘ and C 艛 {c 0} ⫽ C⬘. The uniqueness of C⬘ and the adjacency of C and C⬘ are trivial. ■ Given some clique C in ¥ x, we define the forward set of C, denoted by F C, to be the set of all maximal elements in C. Note that for C ⫽ A, we have F A ⫽ max(¥ x) as A ⫽ ¥ x ⫺ A ⫽ ¥ x. Hence, the initialization of the forward set of our reference vertex is well defined. For any clique C in ¥ x and any maximal element c 0 in C recall that we defined F⬘(c 0) ⫽ F⬘C(c 0) to be the collection of all elements in (F C ⫺ {c 0}) that have a nonempty intersection with c 0 but do not contain c 0. Also, we denoted by F⬙(c 0) ⫽ F⬙C(c 0) the set of elements of max(c 0) that are (a) not properly contained in any set in F⬘(c 0) and (b) have a nonempty intersection with each set in C⬘: ⫽ (C 艛 {c 0}) ⫺ min(c 0). Proposition 1. Let C denote a clique in ¥ x with C ⫽ A. Then for the clique C⬘ : ⫽ (C 艛 {c 0}) ⫺ min(c 0) with c 0 a maximal element in C, the forward set F C⬘ equals the set F⬘(c 0) 艛 F⬙(c 0), with F⬘(c 0) and F⬙(c 0) as defined above. Proof. First, suppose that f 僆 F⬘(c 0); that is, c 0 債 f, f is a maximal element in C, and f 艚 c 0 ⫽ A. Then it follows immediately that f 艚 c ⫽ A for all c 僆 C⬘. Also, f ⰻ C⬘ for, otherwise, f 僆 C⬘ and so there exists some c 僆 C⬘ with c 債 f. Yet, c 0 債 f and, so, f ⰻ C follows since C⬘ ⫺ {c 0} 債 C, a contradiction. We claim that f is, indeed, a maximal element in C⬘, which immediately indicates that f 僆 F C⬘ and so F⬘(c 0) 債 F C⬘ must hold: Suppose that there exists some set f ⬘ in C⬘ with f 債 f⬘. Then f⬘ is a set in ¥ x ⫺ C⬘ and for all c 僆 C⬘, we have c 艚 f⬘ ⫽ A. We note that in view of C 債 C⬘ 艛 min(c 0), we have c 艚 f ⫽ A, for all c 僆 C. By Lemma 1, C 債 C 艛 {c 0} ⫽ C⬘ and so f ⬘ 僆 ¥ x ⫺ C. Hence, f ⬘ 僆 C, which contradicts /
7
LOCAL PHYLOGENETIC NETWORKS
the assumed maximality of f in C. Thus, f 僆 F C⬘, as claimed. Next suppose that f 僆 F ⬙(c 0) but that f ⰻ F C⬘. Since f 僆 max(c 0) and C⬘ is a clique, we cannot have f 僆 C⬘. By assumption, f 艚 c ⫽ A for all c 僆 C⬘ and so we have, indeed, f 僆 C⬘. Hence, there must exist a maximal set f⬘ 僆 C⬘ that properly contains f. We claim that f ⬘ ⫽ c 0 and that f⬘ is contained in F C: By Lemma 1, the assumption that f⬘ 僆 C⬘ 債 ¥ x ⫺ C⬘ implies that f ⬘ 僆 ¥ x ⫺ C. Clearly, for any c 僆 C ⫺ min(c 0) 債 C⬘ we have f⬘ 艚 c ⫽ A, and for any c 僆 min(c 0), we have A ⫽ f 艚 c 債 f⬘ 艚 c in view of f 僆 max(c 0). Hence, f⬘ 僆 C. Observe that, the same argument implies f ⬘ ⫽ c 0. We next note that f⬘ 僆 F C for, otherwise, we could find some f⬙ 僆 F C with f⬘ 債 f⬙. We claim that then f⬙ 僆 C⬘. In view of A ⫽ f⬘ 艚 c 債 f ⬙ 艚 c for all c 僆 C⬘ it suffices to establish that f⬙ ⰻ C⬘: Suppose to the contrary that f⬙ 僆 C⬘. Since f ⬙ 僆 C it follows that c 0 債 f. In view of the discussion preceding Lemma 1 and the assumption that c 0 is maximal in C it follows that f⬙ 僆 C, which contradicts the maximality of f⬘ 僆 C as, clearly, f 債 f⬙. Hence, f ⬘ must be contained in F C, as claimed. Since, f⬘ ⫽ c 0 and neither A ⫽ c 0 艚 f⬘ nor c 0 債 f ⬘ holds, f⬘ 僆 F ⬘(c 0) follows, which is a final contradiction in view of f 債 f⬘ and f 僆 F ⬙(c 0). Hence, f⬘ 僆 F C⬘ and, thus, F⬙(c 0) 艛 F⬘(c 0) 債 F C⬘ follows. To see the reverse inclusion, suppose that f is a maximal element in C⬘. Then, in particular, c 0 艚 f ⫽ A and, so, c 艚 f ⫽ A holds for all c 僆 min(c 0). Since C ⫺ min(c 0) 債 C⬘, it follows that c 艚 f ⫽ A for all c 僆 C. By Lemma 1, f 僆 ¥ x ⫺ C⬘ 債 ¥ x ⫺ (C 艛 {c 0}) ⫽ (¥ x ⫺ C) 艚 (¥ x ⫺ {c 0}). Hence, f 僆 C ⫺ {c 0}. We next consider the two cases f 債 c 0 and f 債 c 0. First, suppose that f 債 c 0. Let g denote a maximal set in C that properly contains f. Then g ⫽ c 0 and so g 僆 (¥ x ⫺ C) 艚 (¥ x ⫺ {c 0}) ⫽ ¥ ⫺ C⬘, by Lemma 1. Since A ⫽ c 艚 f 債 c 艚 f⬘ for all c 僆 C⬘, we have g 僆 C⬘, a contradiction. Hence, f 僆 F ⬘(c 0). Next, suppose that f 債 c 0. We claim that f is a maximal subset of c 0: Suppose to the contrary that there exists some f⬘ 僆 max(c 0) that properly contains f. Since f is assumed to be maximal in C⬘, we have min( f) 債 C⬘, by discussion preceding Lemma 1. It follows that there exists some c in C⬘ such that c 債 f⬘ 債 c 0. Yet, C⬘ is a clique and so c ⫽ c 0. Hence, f⬘ ⫽ c 0, which is impossible as f⬘ is assumed to be properly contained in c 0. Thus, f 僆 max(c 0), as claimed. We conclude this case by showing that f 僆 F⬙(c 0). To this end, it suffices to show that f 債 f⬘ for all f⬘ 僆 F⬘(c 0): Suppose that there exists some f⬘ 僆 F⬘(c 0) with f 債 f⬘. Then f ⬘ 僆 (¥ x ⫺ C) 艚 (¥ x ⫺ {c 0}) ⫽ ¥ x ⫺ C⬘, by the definition of C⬘ and Lemma 1. Yet, for all c 僆 C⬘, we have A ⫽ c 艚 f 債 c 艚 f⬘ and, so, f⬘ 僆 C⬘, which contradicts the maximality of f in C⬘ as f 債 f⬘.
Thus, F C⬘ 債 F⬘(c 0) 艛 F⬙(c 0) and so F C⬘ ⫽ F ⬘(c 0) 艛 F⬙(c 0), as claimed. ■ Combining Lemma 1 and Proposition 1, it follows for an arbitrary split system ¥ of X and any reference element x 僆 X that the graph, which is returned by our algorithm, is G x(¥, d x). In particular, for d x ⫽ #¥ it is isomorphic to the Buneman graph G(¥). It remains to specify the labeled and latent vertices of G(¥). To this end, note that it follows immediately from Dress et al. (1997) that a vertex v (with corresponding clique C ⫽ C v and forward set F ⫽ F C) of the Buneman graph is latent precisely if the intersection L ⫽ L C :⫽
/
/
艚
c 僆 C
c艚
艚
f
fⰻC
is empty. If L is nonempty, then the labels of v are given by L. Clearly, c 僆 C c ⫽ c僆C c and for any f ⰻ C and any c⬘ 僆 C the assumption c⬘ 艚 f ⫽ A implies c⬘ 債 f . Moreover, since for any f, f ⬘ 僆 C with f ⬘ 債 f, the inclusion f 債 f ⬘ trivially holds, we obtain
艚
L⫽
艚
c僆C
艚
c艚
艚
f .
f僆F
ACKNOWLEDGMENTS The authors thank Vincent Moulton for many useful discussions and Hans-Ju¨rgen Bandelt, Michael Langton, David Penny, and the two referees for helpful comments. The first author also thanks the Institute of Fundamental Sciences, Massey University, for hosting her and the New Zealand Marsden Fund for its support.
REFERENCES Anderson, S., Bankier, A. T., Barrell, B. G., De Brun, M. H. L., Coulson, A. R., Drouin, J., Eperon, I. C., Nierlich, D. P., Roe, B. A., Sanger, F., Schreier, P. H., Smith, A. J. H., Staden, R., and Young, I. G. (1981). Sequence and organization of the human mitochondrial genome. Nature 290: 457– 465. Bandelt, H.-J. (1991). Phylogenetic networks. Verh. Naturwiss. Ver. Hamburg 34: 51–57. Bandelt, H.-J., and Dress, A. (1992). Split decomposition: A new and useful approach to phylogenetic analysis of distance data. Mol. Phylogenet. Evol. 1: 242–252. Bandelt, H.-J., Forster, P., Sykes, B. C., and Richards, M. B. (1995). Mitochondrial portraits of human populations using median networks. Genetics 141: 743–753. Barthelemy, J. (1989). From copair hypergraphs to median graphs with latent vertices. Discrete Math. 76: 9 –28. Barthelemy, J., and Guenoch, A. (1991). “Trees and Proximity Relations.” Wiley, New York. Buneman, P. (1971). The recovery of trees from measures of dissimilarity. In “Mathematics in the Archaeological and Historical Sciences” (F. Hodson, Eds.), pp. 387–395, Edinburgh Univ. Press, Edinburgh. Crandall, K. A., and Templeton, A. R. (1993). Empirical tests of some predictions from coalescent theory with applications to intraspecific phylogeny reconstruction. Genetics 134: 959 –969.
8
HUBER, WATSON, AND HENDY
Dopazo, J., Dress, A., and von Haeseler, A. (1993). Split decomposition: A technique to analyze viral evolution. Proc. Natl. Acad. Sci. USA 90: 10320 –10324. Dress, A., Hendy, M. D., Huber, K., and Moulton, V. (1997). On the number of vertices and edges in the Buneman graph. Ann. Combin. 1: 329 –337. Eigen, M. (1987). “Stufen zum Leben.” Piper, Munich. Eigen, M., and Winkler-Oswatitsch, R. (1990). Statistical geometry on sequence space. Methods Enzymol. 183: 305–530. Eigen, M., Winkler-Oswatitsch, R., and Dress, A. (1988). Statistical geometry on sequence space—A method of quantitative sequence analysis. Proc. Natl. Acad. Sci. USA 95: 5913–5917. Excoffier, L., and Smouse, P. E. (1994). Using allele frequencies and geographic subdivision to reconstruct gene trees within a species: Molecular variance parsimony. Genetics 136: 343–359. Haszprunar, G. (1992). How reliable are computer parsimony programs for systematics? An addition. Z. Zool. Syst. Evol. 30: 244 –248. Hendy, M. D., Foulds, L. R., and Penny, D. (1980). Proving phylogenetic trees minimal with l-clustering and set partitioning. Math. Biosci. 51: 71– 88. Hendy, M. D., Penny, D., and Foulds, L. R. (1978). Identification of phylogenetic trees of minimal length. J. Theor. Biol. 71: 441– 452. Hendy, M. D., Penny, D., and Steel, M. A. (1994). A discrete Fourier analysis for evolutionary trees. Proc. Natl. Acad. Sci. USA 91: 3339 –3343. Huber, K. T., Moulton, V., Lockhart, P., and Dress, A. (1999). “Lite Median Networks: A Technique for Studying Plant Speciation,” Preprint, Mid Sweden University. Lester, R. N., Roberts, P. A., and Lester, C. (1983). Analysis of immunotaxonomic data obtained from spur identification and absorption techniques. In “Proteins and Nucleic Acids in Plant Systematics” (U. Jensen and D. E. Fairbrothers, Eds.), pp. 275–300, Springer, Berlin/Heidelberg.
Lockhart, P. J., Meyer, A. E., and Penny, D. (1995). Testing the phylogeny of swordtail fishes using split decomposition and spectral analysis. J. Mol. Evol. 41: 666 – 674. Lockhart, P. J., Steel, M. A., Barbrook, A. C., Huson, D., and Howe, C. J. (1998). A covariotide model describes the evolution of oxygenic photosynthesis. Mol. Biol. Evol. 15: 1183–1188. Meacham, C. A. (1984). Evaluating characters by character compatibility analysis. In “Cladistics: Perspectives on the Reconstruction of Evolutionary History” (T. Duncan and T. F. Stuessy, Eds.) pp.135–151. Columbia Univ. Press, New York. Page, R. D. M., and Holmes, E. C. (1998). “Molecular Evolution: A Phylogenetic Approach,” Blackwell Science, Oxford. Rozas, J., and Agauade, M. (1993). Transfer of genetic information in the rp49 region of Drosophila subobscura between different chromosomal gene arrangements. Proc. Natl. Acad. Sci. USA 90: 8083– 8087. Stine, O. C., Dover, G. J., Zu, D., and Smith, K. D. (1992). The evolution of West African populations. J. Mol. Evol. 34: 336 –344. Swofford, D. L., and Olsen, G. J. (1990). Phylogeny reconstruction. In “Molecular Systematics” (D. M. Hillis, C. Mortiz, and B. K. Mable, Eds.), pp. 411–501, Sinauer, Sunderland, MA. Templeton, A. R., Crandall, K. A., and Sing, C. F. (1992). A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping and DNA sequence data. III Cladogram estimation. Genetics 132: 619 – 633. von Haeseler, A., and Churchill, G. A. (1993). Network models for sequence evolution. J. Mol. Evol. 37: 77– 85. Watson, E. E., Forster, P., Richards, M., and Bandelt, H.-J. (1997). Mitochondrial footprints of human expansion in Africa. Am. J. Hum. Genet. 61: 691–704. ¨ hnlichkeitsbezieWetzel, R. (1995). “Zur Visualisierung abstrakter A hungen,” Ph.D. thesis, Bielefeld.