Journal of Theoretical Biology 472 (2019) 36–45
Contents lists available at ScienceDirect
Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/jtb
Testing the rogue taxa hypothesis for clustering instabilityR Amanda M. Saunders a,∗, Daniel Ashlock b, Steffen P. Graether c a
Bioinformatics Program and the University of Guelph, Canada Department of Mathematics and Statistics, University of Guelph, Canada c Department of Molecular and Cellular Biology, University of Guelph, Canada b
a r t i c l e
i n f o
Article history: Received 29 October 2018 Revised 1 April 2019 Accepted 4 April 2019 Available online 4 April 2019 Keywords: Phylogenetics Hierarchical clustering Bioinformatics Bootstraping Clustering stability
a b s t r a c t There have been longstanding concerns about the stability of hierarchical clustering. A suggested explanation for this instability is the presence of “rogue taxa”, i.e. taxa whose removal from a data set can apparently restore stability. In this study, the rogue taxa hypothesis is tested by partitioning a large data set into many smaller ones and checking for rogue behavior. The checking was performed with a standard hierarchical clustering algorithm and with a novel algorithm designed to have greater stability. It was found that rogue taxa cannot reasonably be said to exist because the status of being a rogue taxon depends on the data partition in which the taxon is embedded. In addition to the choice of data used, the choice of algorithm and algorithm parameters can have a large effect on the degree to which a taxon appears rogue. Instability in hierarchical clustering can be increased by problematic data points, but the status of data points being problematic depends not on their biological antecedents, but on their position in the local geometry of the data. The results of this study strongly suggest that instability in traditional hierarchical clustering routines is primarily a problem with the algorithm design. © 2019 Elsevier Ltd. All rights reserved.
1. Introduction
1.1. Rogue Taxa
A fairly large number of different researchers (Ashlock et al., 2009; Aberer and Stamatakis, 2011; Ting et al., 2014; Saunders, 2017; Saunders et al., 2018; Westover et al., 2013) have detected that hierarchical clustering algorithms suffer from a type of instability in which adding or deleting even a single taxon from a data set can cause a substantial rearrangement of the resulting tree produced by the algorithm. An algorithm with significantly higher stability, bubble clustering was derived in Saunders (2017) and is presented in Saunders et al. (2018). The critical difference between the stable and unstable algorithms is that traditional algorithms create intermediate points, representing internal nodes of the taxonomy tree, on-the-fly. This creation of intermediate data objects can go very differently if one of the taxa involved is missing or if a new one is added. Bubble clustering gathers data on the similarity of taxa first and only then derives the tree from the similarity data without creating intermediate data objects.
In this study, we test an explanation for the instability of hierarchical clustering algorithms, the idea of rogue taxa (Aberer and Stamatakis, 2011; Ting et al., 2014). This test is performed by creating multiple data sets with common taxa from a larger data set and then checking the contributions to clustering instability of each taxon in multiple contexts. A rogue taxon is defined to be a species that is responsible for clustering instability. This assignment of responsibility to data points embeds the hypothesis that the problem is bad data, not a bad algorithm. These rogue taxa have three different behavioral types: “friendly” rogues fix an incorrect taxonomy, “evil” ones disrupt a correct taxonomy, and “crazy” rogues change a taxonomy to a different but still incorrect taxonomy (Westover et al., 2013). Our study demonstrates that the problem is not bad data points, and so implicates the algorithm as the source of the instability. In particular, no taxon remains “rogue” across the different partitions of the larger data set that happen to contain it; contributions to instability are strongly dependent on which subset of the larger data set a taxon resides in. Suppose we take a taxonomy that changed substantially after a new species was added. This means that the species added exhibited the kind of behavior that defines a taxon to be rogue. Testing such a data set for the presence of rogue taxa must
R The authors thank the University of Guelph and Natural Science and Engineering Research Council of Canada (NSERC) for supporting this work. ∗ Corresponding author. E-mail addresses:
[email protected] (A.M. Saunders),
[email protected] (D. Ashlock),
[email protected] (S.P. Graether).
https://doi.org/10.1016/j.jtbi.2019.04.002 0022-5193/© 2019 Elsevier Ltd. All rights reserved.
A.M. Saunders, D. Ashlock and S.P. Graether / Journal of Theoretical Biology 472 (2019) 36–45
succeed, though the taxon identified as problematic may not be the one that was last added to the data set. If the problem really is a bad data point, then rogue taxa are a valuable hypothesis that means we can continue to use our current taxonomic algorithms by leaving out a few misbehaving sequences. There have been some tools designed specifically for this purpose; one of these is outlined in Aberer et al. (2012). As noted in Westover et al. (2013), the majority of rogue identification does not determine which kind of rogue the taxon is before pruning, resulting in friendly rogues being discarded along with the evil and crazy ones. Of course knowing the type of a rogue taxon (friendly, crazy, or evil), requires that the actual taxonomy be available from other sources, making the distinction problematic when using algorithmic methods to generate a taxonomy. In any case, the existence of taxa that can destabilize a tree building algorithm are problematic, even if they are friendly. The rogue taxa hypothesis, if correct, means that there are unclassifiable taxa that must be ignored. We provide evidence in this study that this situation does not hold and that, in fact, all taxa may be classified. Since the creation of bootstrapping (Felsenstein, 1985), there have been indications that neighbor joining based methods can produce poor quality results. When bootstrapping was new it was criticized for producing very low confidence values and was suspected of being biased towards low numbers (Efron et al., 1996). The original bootstrapping algorithm does not in fact bias the confidence values downwards; its only potential flaw is that it assumes that features are independent (Efron et al., 1996). These observations indicate that there have been concerns about the quality of the hierarchical trees being generated, but there seems to be a failure to test the hypothesis that the algorithm being used to build the trees is appropriate. This occurs even when, to attain the best consensus trees, sometimes 25% or more of the taxa needed to be pruned (Aberer and Stamatakis, 2011). Large scale phylogenies suffer from an already sparse data pool, given that so few species have been molecularly characterized relative to the total number of species (Parfrey et al., 2006), and that having to trim even more to try and build a consensus significantly impairs the ability to construct large scale trees. This lack of data of course also interacts with the sub-setting issue previously discussed. Because roguishness is subset-dependent, the order in which taxa are pruned may impact which species are found in the resultant consensus trees, potentially even affecting the topology of the resultant tree. Previously, several different hypotheses for the origin of roguishness of a taxon have been proposed: improper alignment, horizontal gene transfer, or erroneously labelled sequences (Aberer and Stamatakis, 2011). These lines of reasoning facilitate the simple removal of the problematic taxa under the assumption that they are causing problems due to being, in some way, incorrect, and therefore responsible for errors in the deduced relationships between taxa. The results shown in this paper suggest that there are other factors not related to an individual data point: the subset of data and the algorithm that are being used to generate the taxonomy. This emphasizes the importance of examining the methodologies in addition to examining the original data for errors. The work here highlights the potential danger of assuming that any roguishness is due entirely to abnormalities in the data. The use of a data set partitioned into multiple test sets in this study permits a test of the hypothesis that the problem may be intrinsic to the clustering algorithm, rather than a property of individual taxon within the data. If this is correct, the problem is not one or another data point, but rather it is the relative geometric relation of the data points. The results support the view that instability is conditional relative to companion data points, rather than an intrinsic property of a taxon.
37
1.2. Measuring instability A prerequisite to testing an alternative to the rogue taxa hypothesis is to specify a means of detecting rogue taxa. The core of the instability detector is a way to place a distance measure on the set of all trees with the same taxa for leaves. In biology, a subtree of a taxonomy, itself displayed as a tree, is called a clade. For any two taxa used to build a tree, there is the smallest sub-tree that contains them. We call this subtree the minimal containing clade (MCC) of the two taxa. If there are n taxa, there are 2n pairs of taxa and so 2n minimal containing clades. The size of a clade is defined to be the number of leaves in its subtree or, equivalently, the number of taxa in the clade. In Ashlock et al. (2009) it was shown that the map from a tree to its corresponding vector of sizes of minimal containing clades is an injection of the space of trees into Euclidean space. From this, it was demonstrated that the standard (Euclidean) distance between minimal containing clade vectors places a metric structure on the space of trees. Since this distance measure is based on the trees embedded hypothesis about how closely related each pair of taxa are, it is ideal for detecting differences between taxonomies of a set of species. This distance between trees is called the minimal containing clade distance or MCC-distance. The stability measure used in this study is based on the stability of trees after the addition or removal of a single taxon. We modify the instability measure from Ashlock et al. (2009) to obtain this measure, where the instability induced by removing a single taxon is the distance between the two trees based on the different techniques that removed the one taxon. If we have an existing tree, then we may simply remove a taxon and smooth over the point of removal. For the other removal method, we delete that taxon from the data set, and then rerun the tree-building algorithm. Both methods produce a new tree based on the original taxa, with the exception of the removed taxon. These two trees are called the removal tree for taxa A, RA , and the rebuilding tree for taxa A, BA . An example can be seen in Fig. 1. The stability measure from the earlier work used the average distance, over all taxa in a data set, between the removal and rebuilding trees. In this study, we use the individual distances d(Rx , Bx ) for a taxon x, where d( · , · ) is the MCCdistance between trees. This permits us to measure the individual contribution of a taxon to instability when clustering a data set. It is necessary to defend the MCC-distance as a good choice for measuring instability. The MCC-distance is, in fact, an excellent metric for measuring tree stability when that tree is a taxonomy because it is based entirely on how close trees place each pair of taxa. This size of the minimal containing clade of a pair of taxa represents the hypothesis embedded in the tree about how closely related those taxa are. A larger minimal containing clade indicates less closely related taxa. This means that the MCC-distance responds to, and responds only to, disagreements between trees about the relatedness of pairs of taxa. Other distance measures between trees exist, but these are often problematic to compute or do not incorporate relatedness of pairs of species as transparently. An edit metric (Gusfield, 1997) on trees is typically difficult to compute quickly. Two intricate distance measures on phylogenetic trees, based on an association of trees with orthants of Euclidean space, appears in Owen (2008). The derivation provides an efficient algorithm for computing the distance between trees, but the resulting distance measure does not appear to yield the same quality of information about relatedness of pairs of taxa. While it may be that a minimal editing criterion is an excellent way to induce a metric, it does not yield information as directly relevant to relatedness of taxa. In
38
A.M. Saunders, D. Ashlock and S.P. Graether / Journal of Theoretical Biology 472 (2019) 36–45
Fig. 1. Exemplary rearrangement of an hclust() tree when a taxa is deleted. An example of a rebuilding tree constructed with sample 48 removed (top) and a removal tree constructed with all the data and the leaf for sample 48 removed (bottom) from Saunders (2017). This example also can be used to illustrate the major differences in leaf relationship when a single point is removed. The point removed for these trees was chosen randomly.
any case, the MCC distance can be computed with quadratic time complexity, proportional to the number of pairs of taxa, and so is acceptably fast. 2. Methods The experiments, when performed, have two distinct phases. The first phase chooses the different subsets of the master data set while the second tests each subset for the presence of rogue taxa. Choosing a collection of subsets of the master data set is itself a non-trivial task performed with an evolutionary algorithm (Ashlock, 2006). Once the collection of subsets of the master data set are selected, the rogue status of their members must be computed and compared.
2.1. Selecting the subsets In order to test the hypothesis that there are rogue taxa that disrupt the clustering, a collection of 75-member subsets of the classical 150-member iris morphology data set (Fisher, 1936) were created in a manner that ensured they were substantially different from one another. Making the subsets different from one another was accomplished by forcing them to be evenly distributed in the space of the subsets using the algorithm described below. The iris data set has fifty collections of measurements from each of the three species. The subsets chosen picked 25 of the records for each flower species. A technique called point packing (Ashlock and Graether, 2016) was used to choose the collections of subsets of 75 flowers in
A.M. Saunders, D. Ashlock and S.P. Graether / Journal of Theoretical Biology 472 (2019) 36–45
a manner that meant any two subsets would disagree with one another on at least 70 flowers. This means that the shared taxa within any two of the subsets are relatively small in comparison to, for example, the results of selecting subsets of the iris data set uniformly at random. The point packing algorithm chooses subsets that are well spread out through the space of possible subsets. The subset selections were encoded as vectors of 150 ones and zeros, where one indicated the presence of a record in the subset and zero its absence. The evolutionary algorithm from (Stoodley et al., 2018) was modified to work in the metric space of binary vectors under the Hamming metric. The Hamming metric counts the number of positions where two binary vectors of the same length disagree. The evolutionary algorithm was tuned to pack (Thompson, 1984) a relatively large number of subset selections into the space of 75 member subsets distributed equally among the species of flowers, given the requirement that the subsets disagree on 70 flowers. This creates a highly diverse collection of subsets. The minimum disagreement of 70 flowers was chosen because it is relatively large and it yields roughly 50 subsets when the subset selection algorithm is run. Setting the required minimum difference between subsets to 60, for example, finds hundreds of subsets. Approximately fifty collections of subsets of the iris data set are sufficient to test the rogue taxa hypothesis, but not too many as to require huge amounts of computing power to perform a complete analysis on all the subsets.
39
largest possible collection of subsets, given the minimum level of inter-subset disagreement specified. The larger the collection is, the more evenly it must be spread out in the space, according to the theory of sphere-packing (Thompson, 1984). Algorithm 2 The Conway Crossover Operator. Input: One or more (n, d)-codes C1 , . . . Ck , a random material rate R ≥ 0, (integer) and a minimum distance d. Output: An (n, d) − code. Details: Set Q to be the union of C1 , . . . Ck . Generate R random words and add them to Q. Shuffle the set Q into a random order. Apply Algorithm 1 to Q, returning the result.
The Conway crossover operator works as follows: The two larger subsets are combined and augmented with randomly selected subsets; the number of additional subsets added is a tunable parameter of the algorithm. The resulting larger collection of subsets is then shuffled and filtered with Conway’s lexicode algorithm to produce a new collection of subsets. The role of filtration is to restore compliance with the minimum distance rule, which is almost always violated when the two collections of subsets are combined and augmented. The number of additional random subsets used for augmentation during Conway crossover is termed the random material rate. 2.3. Tuning the subset selection algorithm
2.2. Algorithm details Point packing is accomplished with an evolutionary algorithm (Ashlock, 2006). The evolutionary algorithm operates on a population of collections of subsets while constructively maintaining a Hamming distance of 70 (at least seventy members different) between all pairs of subsets in a given collection. The size of this population is an important algorithm parameter. The initial population of collections of subsets, used to begin a run of the evolutionary algorithm, are created by using the Conway’s lexicode algorithm (Algorithm 1) on a collection of one-hundred 75 member subsets selected uniformly at random. This algorithm is detailed in Stoodley et al. (2018). Conway’s lexicode algorithm is a greedy algorithm that filters a collection of subsets to create ones obeying the minimum distance 70 restriction; when applied to a random collections of subsets it usually throws out a substantial majority of them. Algorithm 1 Conway’s Lexicode Algorithm. Input: An alphabet A, a string length n, an ordered subset S⊂An and a minimum distance d Output: An (n, d)-code Details: Initialize an empty set of words R Traversing S in order, if s ∈ S is at distance d or more from all current members of R, add it to R Report R, which is constructively an (n, d)-code
Once the initial population of collections of subsets is selected, the evolutionary algorithm proceeds by repeatedly applying the following population updating technique. Three collections of subsets are picked. The Conway crossover operator (Algorithm 2) is applied to the two larger collections of subsets to produce a collections of subsets in compliance with the minimum distance restriction. This new “child” of the two larger collections of subsets replaces the smallest of three chosen. The algorithm is seeking the
A parameter-setting experiment was used to tune the subset selection algorithm. It used a full factorial design on the population sizes 10, 100, and 10 0 0, and the random material rates 10, 20, 30, and 40. The evolutionary algorithm was run for 40,0 0 0 population updatings and the largest collection of subsets saved as the result of the algorithm. Each set of parameters was run 30 times to assess its behavior. Box plots of the distribution of sizes of collections of subsets located for each parameter setting are shown in Fig. 2. The largest population size turned in very poor performance; the small and intermediate ones were similar, with the small population size being slightly better. The best random material rate was 20 or 30, so the algorithm was re-run for 10 0,0 0 0 updating events with a population size of 10 and random material rates of 20, 25, and 30. The best results were with a random material rate of 25, locating a 56-member collection of subsets of the iris data set that was used to evaluate the rogue taxa hypothesis. 2.4. Protocol for testing the rogue taxa hypothesis We performed hierarchical clustering using the hclust (method = “complete”) algorithm in the R statistical package, and computed the individual contributions to instability by each taxa in all collections where it was a member. Complete linkage was selected as the method option for hclust(), as previous studies indicated that it tended to have better stability than either the average or centroid methods (Saunders, 2017). If rogue taxa are present in the full iris data set, then they should make large contributions to the instability in many of subsets in which they appear. The same subsets were tested for rogue taxa using bubble clustering, a novel hierarchical clustering algorithm (Algorithm 3) explained in detail in Saunders (2017). A bubble is a subset of the data in the form of a sphere in the data space, centered on a data item, with a radius selected uniformly at random between a maximum radius proportional to the largest distance between any two data points and a minimum radius equal to the smallest distances between any two data points.
40
A.M. Saunders, D. Ashlock and S.P. Graether / Journal of Theoretical Biology 472 (2019) 36–45
Fig. 2. Parameter Study on Data Paritions. Shown are the distribution of sizes of collections of subsets of the iris data set with minimum Hamming distance 70 found for 30 runs of an evolutionary algorithm for each set of algorithmic parameters tested.
Algorithm 3 Bubble clustering - construction of associator matrix. Input: Numeric data matrix with n samples degree of bubble coverage(L) Output: Associator matrix Details: Initialize n x n associator matrix to 0.0 0 0 0 0 0 0 0 01 Repeat Select center (c) from n samples Generate random radius (r) Number in bubble (N) set to 0 For i in n IF dist(i, c) < r Increment N by 1 For i in n For j in n IF dist(i, c) < r AND dist(j, c) < r associator matrix (i, j) + = 1/N Until 10L n iterations have been performed Return: associator matrix
In operation, this algorithm selected 10L bubbles, choosing centers by sequentially selecting each data point as a center. For each pair of points, a connection strength that is initialized to zero is increased by the reciprocal of the number of points in each bubble in which the two points are both members. The resulting matrix of connection strengths is used as the basis for a neighbor joining algorithm in which points or groups of points are connected if the connection between them is the strongest among groups not already connected. The difference between this and a standard neighbor joining algorithm is that all data used to perform the neighbor joining is collected before any joins are made. We argue in 4. Results and Discussion that this is the genesis of the much higher stability of the bubble clustering algorithm. As part of the test of the rogue species hypothesis, bubble clustering with L = 4 with both maximum radii equal to maximum inter-point distance and maximum radii restricted to 50% of the maximum distance between points was used. These values were selected based on the testing performed in Saunders (2017). If the property of being a rogue taxon is intrinsic to a species, rather than an effect depending on the particular geometry of points in the data set and among data points created during neighbor joining, then the taxon should remain rogue when using a different clustering algorithm. Since the roguishness of each sample is being tested multiple times, a formal definition of what would be required for a taxon
to be truly rogue was required. We use the definition that the taxon requires to routinely contribute more instability than the other samples, regardless of the subset tested. A taxon is rogue if and only if its instability contribution is 2.5 times greater than the overall average instability contribution of all taxa across all subsets in greater than 90% of its member subsets. 3. Results and discussion The amount of instability generated by any one sample being removed from the subset of points used to construct a hierarchical tree was found to be dependent on the subset from which it was removed. This is true for all observed subsets and points. As can be seen in Fig. 3, no one individual stands out strongly in terms of its instability contributions. It can be noted that there are more large instability values from the samples belonging to the I. versicolor and I. virginica groups than the I. setosa group. This is potentially due to the fact that the two groups exhibiting higher instability are not as cleanly separated from each other than they are from the I. setosa cluster. These results indicate that the “rogue” nature of a sample or group of samples does not exist independent of other factors, and in fact arises from the particular geometry of the data set. The location of the remaining points being used to construct the tree have a fair degree of influence on the individual contribution of a given point to overall instability measures. The presence of individual outliers for taxa in the two species that are not well separated arises from the fact that those are locations where each individual within that cluster have the potential to affect the order of more possible hierarchical joins. This is simply because the hypervolume of the data space occupied by the I. versicolor and I. virginica samples is a denser part of the data space. Each of the I. versicolor and I. virginica samples have the potential to more strongly affect the join order of the remaining 49 points in that larger grouping, whereas each of the I. setosa samples only have this degree of effect on 24 other points. This idea was tested, and is supported by, the results obtained by re-running the instability calculations with all possible pairs of species, with the outlier behavior almost entirely present in the I. versicolor and I. virginica runs. As the instability measure is tree-size dependent, it is not possible to compare the instability contributions generated when only 50 leaves are tested to those that have 75 leaves. In order to further test the hypothesis that the contributions of individual data points arise from the geometry of the data set, rather than properties particular to the data points themselves, the
A.M. Saunders, D. Ashlock and S.P. Graether / Journal of Theoretical Biology 472 (2019) 36–45
41
Fig. 3. Instability contributions of individual taxa. Instability contributions of each of the 150 member samples of the iris data set when trees are constructed using hclust(). Red, I. setosa samples; blue, I. versicolor; green I. virginica samples.
I. versicolor and I. virginica samples were each tested for instability in combination with the I. setosa samples to see if this eliminated the higher instability values. Each of the 56 sets of 75 points contains 25 of each type of iris. So the type not under test was excluded from stability testing to test that the interaction between the I. versicolor and I. virginica samples was generating the outlier behavior. This secondary test demonstrated that I. versicolor and I. virginica samples behave similarly to I. setosa when paired, but not when paired with each other, as shown in Fig. 4. At the species level, the t-test found that there is a significant difference between the mean instability values. For I. setosa and either of the other two species: p − value < 2.2e − 16, I. versicolor/I. virginica: p − value = 0.01658. The later difference is only a relatively small one. The biggest difference between mean instability by species is that between I. setosa and I. versicolor, with means of 19972.78 and 22170.45 respectively. Therefore, the species that contributes the least to the overall instability measure is contributing 90% of the instability contributed by the species with the most instability. On the individual level, the difference is larger but still not particularly extreme. The sample that contributes the smallest average instability has 74% of the value of that of the sample with the largest average instability contribution. These differences, while significant, are not sufficient for any of the individual samples to be labelled as rogue. Rogue taxa can usually be defined as taxa that change clades depending on which data is used to construct the tree (e.g. nuclear versus mitochondrial, different gene sequences etc.) or when bootstrapping analysis is being performed. We are measuring the similar and connected trait of how much the rest of the tree changes when the sample is removed from the dataset. For a sample to be a “rogue taxon”, it would need to contribute significantly more to the average instability of the dataset than the other samples. In this case, the selected cutoff is a single sample that, on average, contributes more than 2.5 the instability than the average sample. No single individual within these experiments contributed a markedly different amount of instability. This signifies that the degree of
shift in the proposed relationships between individual samples cannot be blamed on a single sample skewing the results. The instability of the resultant trees is generated by the methodology of the neighbor joining algorithm, and arises from the geometry of the points as they are joined. This instability does not arise from particular points in the data that the algorithm was processing. While it is apparent that, at the species level, there is some interaction between the different clusters as indicated in Fig. 4, it is not enough to generate any extreme average differences in instability. All three species have similar but distinct contributions when all are included, and when one of the two sets with outlier behavior is excluded in certain subsets. Excluding either I. versicolor or I. virginica samples removes the outlier behavior, suggesting that the effect of subset is stronger when there are clusters that are not well separated. This also supports the hypothesis that the geometry of the data set, which changes when new species are added, is creating the problem through its interaction with the data set. Rather than rogue taxa, we have rogue data configurations that exhibit an aggregate behavior. 3.1. Bubble clustering results When the same experiment is performed with bubble clustering as the clustering method, the average instability contribution is greatly reduced; the average instability contribution per sample with hclust (method=complete) is 21249.77, while with bubble clustering and L = 4, the average instability contribution value is 960.8594 for the larger maximum radius and 8279.747 for the smaller maximum radius. All three values are significantly different ( p − value < 2e − 16 in all comparisons), with both parameter sets for bubble clustering producing significantly more stable trees than with hclust (method = complete). The first test with bubble clustering was performed with 104 bubbles with the default maximum radius size set to the largest distance between two points within the data set (Fig. 5A). It
42
A.M. Saunders, D. Ashlock and S.P. Graether / Journal of Theoretical Biology 472 (2019) 36–45
Fig. 4. Instability contributions of taxons when species are examined in pairs. Shown are box plots giving the distribution of instability contributions of each of the 150 member samples of the iris data set. Each of the images represents the instability contributions when one of (A) I. virginica, (B) I. versicolor, or (C) I. setosa samples are excluded from tree construction using the standard hclust() function in R. Samples are colored as in Fig. 3.
was noted that two samples (45 and 99) had slight and severely, respectively, increased Snip-Rebuild distances. Both samples have mean instability contributions that are 2.5 times greater than the overall instability average. Further investigation of the data in Fig. 5A showed that only sample 99 was rogue in > 90% of the samples. Upon looking at the plot of the iris data (Fig. 6), it appears that the two samples are in the boundary regions of the cluster. This likely did not happen in the hclust() tests due to the way that the algorithm joins the points. Because bubble clustering does not join points until after the measure of associations are calculated, and then does so through a maximal spanning tree, the points at the boundary remain at the boundary, and then affect how two clusters are joined. Neighbor joining, on the other hand, generates points to represent cluster centers, which effectively pull the points towards the center of the clusters. Because the roguishness of sample 99 can be explained by its location within the data space, it is quite possible that if a larger sample of iris petal and sepal sizes were recorded, it would no longer be classified as a rogue taxon. To see if limiting the association measurement to more local topologies would have an effect on the roguishness of these individuals, the maximum bubble size was restricted to 50% of the maximum distance between points (Fig. 5(D)). The average instability contributions increased significantly when association measurements were limited to be more local, but the individuals that could be classified as rogue taxa in the previous experiment did not display an increased contribution. On the contrary, sample 99 contributed the least to the average instability when the radius was restricted. This is more evidence that the status as a rogue taxon is a property of the algorithm itself rather than the taxon.
This observation suggests that, while samples may present as rogue taxa even when bubble clustering is used, it is possible to eliminate the rogue behavior, if not the taxa, by choosing to limit association measures to a more local neighborhood. This comes at a cost of increasing the overall instability of the tree. This also suggests that our test for roguishness, when applied with the bubble clustering algorithm, may be a technique for finding individuals at the edges of clusters. Scanning for taxa that show as being rogue can also be used to test for the appropriateness of the parameters selected. This assumes that the increased contribution to the instability is due to major rearrangements to within-clade positions of the remaining points. The peripheral nature of the problematic points may simply reflect that the major clusters are not joined in the same manner when the peripheral points are excluded. The total instability when using bubble clustering remains significantly lower than when using the most conservative method for hclust(), permitting the conclusion that the increased contribution to instability by a few points does not have an effect on the overall reliability of the tree to predict taxa relatedness. One of the other differences between the results of the testing of the clustering algorithms is that the patterns in the standard deviation of the Snip-Rebuild distances (Table 1). The size of the standard deviation of the instability contributions by species for the hclust (method = “complete”) depended on the species. This is consistent with the previously discussed presence of outlier behavior in the I. versicolor and I. virginica iris data sets. For the various bubble clustering results, the differences in the standard deviations by species are not as large as those exhibited by the hclust (method = “complete”). It should also be noted that bubble clustering standard deviation values tended to be smaller than
A.M. Saunders, D. Ashlock and S.P. Graether / Journal of Theoretical Biology 472 (2019) 36–45
43
Fig. 5. Impact of changing bubble size on instability contributions of taxons. Instability contributions of each of the 150 member samples of the iris data set using a bubble clustering algorithm with L = 4. The maximum bubble radius was set to (A) 100% of the maximum distance between points, (B) 90% of the maximum distance, (C) 75% of the maximum distance, (D) 50% of the maximum distance. Samples are colored as in Fig. 3.
Fig. 6. Pairwise feature scatterplots of the iris data set. Plot of all pairwise factors in the iris data set. Samples 45 and 99 are shown as red circles.
those from the hclust() algorithm. This indicates that, for bubble clustering, the effect of using different subsamples is smaller. The instance of bubble clustering that produced a standard deviation value closest to those of the hclust() results was the one with a
smaller maximum radius. This may in part be because the maximum bubble radius is dependent on the distance between points in the subset, together with the fact that the more the maximum radius is restricted, the more it affects the stability results.
44
A.M. Saunders, D. Ashlock and S.P. Graether / Journal of Theoretical Biology 472 (2019) 36–45 Table 1 The standard deviation of the individual instability contributions, to three significant figures, by sample species for the different algorithms tested. The R parameter is the fraction of maximum inter-point distance used as the upper bound on randomly sampled bubble radii. Iris species (standard deviations) Algorithm
I. setosa
I. versicolor
I. virginica
A1 A2: A2: A2: A2:
2370 1160 1200 1380 2780
6440 1410 1410 1720 2900
5890 1090 1160 1660 2810
L L L L
= = = =
4 4 4 4
R R R R
= = = =
1 0.9 0.75 0.5
A1 - hclust (method = “complete”). A2 - bubble clustering.
Fig. 7 shows which taxa, including their contributions in each of the subsets of the data, that are designated as a rogue taxon for hclust() and L = 4 bubble clustering. Interestingly, hclust() does not detect any rogue taxa, while bubble clustering finds one rogue
taxon and many taxa that contribute visibly to instability – but this is relative to the far smaller degree of instability present in bubble clustering. The fact that the variance of the amount of changes in the snip and rebuild trees is much smaller causes bubble clustering to detect more apparently unstable taxa, but these are relatively harmless because they are within a far more stable phylogenetic environment. It is important to note that the stability of the results is an important measure of the quality of results. If the predicted relationships between samples can drastically change depending on which data points are passed to the taxonomy-generating algorithm, then sound conclusions can only be drawn from the resultant tree by accident - the more unstable the tree, the more arbitrary it is. This means that, without testing the stability of the produced tree, it is easy to overstate the significance of results, which may impair further research efforts if the conclusions drawn from an unstable tree are incorrect. A highly unstable tree will only be correct by chance. Bootstrapping can give a sense of the quality of a tree on the set of taxa it was constructed on, but this does not speak
Fig. 7. Comparison of fraction of roguishness using bubble clustering and hclust(). Percent roguishness by sample for hclust() (top) and bubble clustering L = 4 R = 1 (bottom). A rogue cutoff value of > 0.9 is represented by a horizontal line.
A.M. Saunders, D. Ashlock and S.P. Graether / Journal of Theoretical Biology 472 (2019) 36–45
to the issue of the tree’s ability to retain its basic structure under addition and deletion of taxa. The instability measure used in this study is a method to quantify the precision of the constructed trees. The accuracy and quality measures used to evaluate trees are related to a different attribute, the chance a tree is correct for the data set used; they do not examine stability of the tree after the addition or deletion of taxa. This suggests that the stability measures presented in this study may be a critical addition to the tool kit for numerical taxonomy. While this investigation focused on neighbor joining and bubble clustering methods, the issues described here are relevant to a wide variety of algorithms. Likelihood-based methods, such as maximum likelihood, while using a more sophisticated method for joining together points and clusters, still produce intermediate entities that can contribute to instability. How points are joined in subsequent steps depends on the membership of those intermediate objects, and the intermediate objects have different properties depending on which points they contain. For example, the mean location of a cluster can shift when a point is removed. This repeated over multiple joining steps can generate a cascade of shifts in which groups of points are joined by the tree building algorithm. The intuition that creating intermediate objects is the source of instability was validated to some extent by testing, and gives us a simple criterion for judging algorithms. The nature of the intermediate objects changes, possibly radically, after changes cascade. This means that the generation of intermediate objects is to be avoided. Bubble clustering does so by first generating an associator matrix and then performing a neighbor joining algorithm on that fixed object. This example suggests how to proceed in repairing more sophisticated likelihood or model based algorithms: gather the information first, then build the tree. The use of intermediate objects permits the creation of relatively fast, simple algorithms. We have demonstrated that there is a potential price to this speed and algorithmic simplicity. Bubble clustering opens up new techniques for the generation hierarchical trees, including taxonomy trees. This new technique is more precise because of the stability of the generated trees, and therefore has a significantly better chance of guiding hypotheses that are based on the true underlying structure of the data. When previous neighbor joining algorithms were used, the instability caused by the algorithms resulted in some data to be removed in an effort to make cleaner trees. As shown here, which data points cause the instability is very dependent on which subset of the overall dataset are included in the analysis. The overall effect is which data points that are removed is highly dependent on which data points are present. Also, the “cleaner” tree that is being produced is dependent on the chosen subset and not necessarily related to the underlying hierarchical structure within the data. Bubble clustering offers a much better alternative, and our results suggest that, as with any newly developed algorithm, there is still room for improvement. An R package containing the bubble clustering algorithm and the associated functions is under production and nearing completion; please send enquiries to the first author. This will enable other researchers to benefit from the higher stability of the bubble clustering algorithm. 4. Conclusions and next steps The results of this study, on a single data set, suggest that roguishness is strongly algorithmically dependent and can be affected by the algorithmic parameters selected. This emphasizes the need to select an algorithm and parameters that are tailored to the particular dataset in question. These results also indicate that routine tests for the presence of roguishness and the general stability of
45
algorithm output might be a useful technique for parameter selection, which can increase the quality of constructed trees. An obvious next step is to repeat the tests with different data sets, and attempt to more clearly identify the causes of roguishness in different algorithms. This would be useful in testing the hypothesis, for example, that in bubble clustering roguishness is a trait of points on the boundary between clusters. A classification of various data types into clear behavior patterns is another potential line of inquiry to pursue. The results of this study also indicate that altering the distribution from which bubble radii are selected is a previously unidentified algorithm parameter that can further improve the stability of trees produced via bubble clustering. This notion arises from the observed reduction in roguishness when the bubble clustering algorithm is forced to only examine more local connections when building its matrix of association strengths. At present, bubble radii are chosen uniformly at random. A softer approach would be to use a distribution that favors shorter radii without excluding the larger ones entirely. An agent-case embedding (Ashlock and Lee, 2013) is a tool for deriving relatedness measures simultaneously between algorithms and the data they operate on. The application of this technique, with tree stability as the focal statistic, to a variety of data sets with a variety of hierarchical clustering algorithms would yield information that both provide a classification of types of data sets, and give guidance on the relatedness of different clustering algorithms. This would provide guidance about the appropriateness of a particular algorithm for a particular data set. References Aberer, A.J., Krompass, D., Stamatakis, A., 2012. Pruning rogue taxa improves phylogenetic accuracy: an efficient algorithm and webservice. System. Biol. 62 (1), 162–166. Aberer, A.J., Stamatakis, A., 2011. A simple and accurate method for rogue taxon identification. In: Proceedings of the 2011 IEEE International Conference on Bioinformatics and Biomedicine. IEEE Press, Piscataway NJ, pp. 118–122. Ashlock, D., 2006. Evolutionary Computation for Opimization and Modeling. Springer, New York. Ashlock, D., Graether, S., 2016. Conway crossover to create hyperdimensional point packings, with applications. In: Proceedings of the 2016 Congress on Evolutionary Computation. IEEE Press, Piscataway NJ, pp. 1570–1577. Ashlock, D., Lee, C., 2013. Agent-case embeddings for the analysis of evolved systems. IEEE Trans. Evolut. Comput. 17 (2), 227–240. Ashlock, D., vonKonigslow, T., Schonfeld, J., et al., 2009. Breaking a hierarchical clustering algorithm with an evolutionary algorithm. In: C. D., et al. (Eds.), Intelligent Engineering Systems Through Artificial Neural Networks, 19. ASME Press, pp. 197–204. Efron, B., Halloran, E., Holmes, S., 1996. Bootstrap confidence levels for phylogenetic trees. Proc. Natl. Acad. Sci. 93 (23). 13429–13429. Felsenstein, J., 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39 (4), 783–791. Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7 (2), 179–188. Gusfield, D., 1997. Algorithms on Strings, Trees, and Sequences. Cambridge University Press. Owen, M.A., 2008. Distance Computation in the Space of Phylogenetic Trees Ph.D. thesis. Cornell, Ithica NY. Parfrey, L.W., Barbero, E., Lasser, E., Dunthorn, M., Bhattacharya, D., Patterson, D.J., Katz, L.A., 2006. Evaluating support for the current classification of eukaryotic diversity. PLoS Genetics 2 (12), e220. Saunders, A., 2017. In search of more stable hierarchical trees: devising new algorithms to improve upon the stability of neighbor joining. University of Guelph, Guelph, Ontario, Canada. Saunders, A., Ashlock, D., Houghten, S., 2018. Heirarchical clustering and tree stability. In: Proceedings of CIBCB 2018. IEEE Press, Piscataway NJ, pp. 1–8. Stoodley, M., Ashlock, D., Graether, S., 2018. Data driven point packing for fast clustering. In: Proceedings of the 2016 Congress on Evolutionary Computation. IEEE Press, Piscataway NJ, pp. 1–8. Thompson, T.M., 1984. From Error-Correcting Codes through Sphere Packings to Simple Groups. Cambridge University Press, New York, NY. Ting, K.S., Pisani, D., Creevey, C.J., Wilkinson, M., 2014. Concatabominations: Identifying unstable taxa in morphological phylogenetics using a heuristic extension to safe taxonomic reduction. Syst. Biol. 64 (1), 137–145. Westover, K.M., Rusinko, J.P., Hoin, J., Neal, M., 2013. Rogue taxa phenomenon: a biological companion to simulation analysis. Mol. Phylogenet. Evol. 69 (1), 1–3.