Phylogenomics N Rodrı´guez-Ezpeleta and H Philippe, Universite´ de Montre´al, Montre´al, QC, Canada ª 2009 Elsevier Inc. All rights reserved.
Defining Statement Introduction Basic Concepts and Principles of Phylogenetic Inference Phylogenomic Inference Based on Primary Sequences Supermatrices and Systematic Errors
Glossary bootstrap analysis A statistical analysis used to evaluate the robustness of a tree. A series of bootstrap pseudo-replicates, generated by randomly sampling positions with replacement from the original data set, are used for tree building. The statistical support for each internal branch is calculated as the percentage of trees where the corresponding group was recovered. character Heritable attribute that may differ among species and that can be used for phylogenetic inference (e.g., an amino acid position). character state Each of the different forms that a character may adopt (e.g., tryptophan). convergence Independent acquisition of identical character states in evolutionarily distinct lineages. homology Two characters are homologous if they derive from a common ancestor. homoplasy Similarity due to convergence and not due to common ancestry. inconsistency Statistical property according to which a given method will converge toward supporting an incorrect solution with increasing confidence as more data are analyzed. long branch attraction The phenomenon that under certain conditions two rapidly evolving lineages will
Abbreviations EST HGT
expressed sequence tag horizontal gene transfer
Defining Statement Phylogenomics is a new discipline that consists on using genomic data to address evolutionary questions. Here, we
Phylogenomic Inference Based on Whole Genome Features Contribution of Phylogenomics to the Microbial Tree Conclusion Further Reading
cluster together regardless of their true evolutionary relationships. monophyletic group A group defined by one or more synapomorphies containing an ancestor and all of its descendants. Also called a clade in the cladistic context. orthology Two homologous characters are orthologous if they originated from a speciation event. paralogy Two homologous characters are paralogous if they originated from a duplication event. reversal Evolutionary change from a derived to an ancestral character. stochastic error Deviation between a parameter value and its estimation due to the finite length of the data set used for the inference. synapomorphy Derived character shared by two or more species of a group that originated from their last common ancestor. systematic error Deviation between a parameter value and its estimation due to a failure of the inference method, in general because of violations of the model assumptions by the data properties. xenology A gene is said to be xenologue if it has been acquired by horizontal gene transfer.
MRP SSU rRNA
matrix representation using parsimony small subunit ribosomal RNA
describe the variety of phylogenomic inference methods available, discuss their strong and weak points and summarize their contribution to the resolution of the microbial Tree of Life.
633
634 Genetics, Genomics | Phylogenomics
Introduction The only figure in Darwin’s book On the Origin of Species is a phylogenetic tree, which demonstrates the emphasis he placed on this important concept. Reinforcing this view, soon after, Haeckel drew the first universal Tree of Life using morphological and embryological characters. Today, the inference of phylogenetic trees, whose leaves and nodes correspond to extant species and their extinct ancestors, respectively, constitutes a major activity of evolutionary biologists. This task requires the definition of a series of homologous characters, that is, those that are comparable across all studied organisms and assumed to be inherited from a common ancestor. Although widely used in the past, morphology and cell ultrastructure have now largely given way to molecular characters, which include three main advantages. First, some proteins and genes are comparable across all living organisms, including microorganisms and viruses; second, genomes contain more than thousands of nucleotides, providing a much larger number of characters than a handful of comparable morphological structures; and, third, genes and proteins evolve by the accumulation of small individual changes, facilitating the use of mathematical models for phylogenetic inference. Because of its considerable degree of conservation across all cellular organisms, the small subunit ribosomal RNA (SSU rRNA) became, in the 1980s, the reference molecular marker for phylogenetic inference. Although it revealed the existence of a third domain of life – the Archaea – this molecule did not provide significant statistical support regarding many other interesting questions. More surprisingly, conflicting topologies were obtained as alternative markers became available during the late 1990s. Events such as horizontal gene transfer (HGT), which make the history of a gene different from that of the species, may account for these discrepancies, and substantiate Darwin’s statement: ‘‘a classification founded on any single character, however important that may be, has always failed.’’ In addition, single genes generally contain insufficient information to provide firm statistical support and inferred phylogenies are often affected by sampling error. Consequently, with the recent advent of genomic data, a new discipline, known as phylogenomics, has appeared. It comprises two complementary branches: the use of genome-scale data to infer the Tree of Life and the study of mechanisms of genome evolution based on various phylogenetic approaches. In this article, we focus on the Tree of Life. The concepts and methods are similar to those of classical phylogenetic inference. Therefore, a brief introduction to these, aimed at beginners, will precede the description of two phylogenomics-specific sequence-based approaches: supertree and supermatrix. Furthermore, genome data offer the possibility to exploit other characters such as gene content and order, implying
the application of new genome-specific methods that will be described thereafter. The major concern of phylogenomics is that systematic errors, which induce tree reconstruction artifacts, are exacerbated in large data sets. Here, we will describe some of the methods applied to detect and overcome these problems. Finally, progress in the resolution of the Tree of Life due to phylogenomics will be briefly reviewed.
Basic Concepts and Principles of Phylogenetic Inference Basically, inferring phylogenies implies the recovery of monophyletic groups by detecting synapomorphies, which are character states acquired on the corresponding basal branches. Two steps are thus required for tree building: the identification of homologous (comparable) characters and the detection of synapomorphies. Homology and Orthology Two sequences are assumed to be homologous, that is, derived from a common ancestral sequence, if the similarity between them is so high that it cannot be explained by mere chance. Because during the course of evolution, nucleotides (or amino acids) not only exchange between character states, but can also be inserted or deleted, the identification of homologous positions within homologous sequences is not trivial. Multiple alignment algorithms exist for this purpose, but they require the definition of parameters such as the penalty for introducing a gap (an insertion or a deletion), which are difficult to estimate especially because they are not constant along and/or among the sequences. More problematic, the identification of the optimal alignment calls for prior knowledge of the phylogenetic tree connecting the sequences. For instance, the cost of an insertion will depend on whether two sequences are sister groups. Ideally, alignment and phylogeny should be inferred simultaneously, and remarkable progress has been made in that direction recently. However, computational burden renders this approach intractable for large-scale analyses. In practice, multiple alignments are usually created progressively following the branching order of a guide tree generated from unaligned data. To avoid biases in phylogenetic inference because of limitations of the alignment algorithm, it is of prime importance to remove positions that are ambiguously aligned using programs such as GBlocks. Although homology assessment is an essential initial step in phylogenetic reconstruction, it is not sufficient when the objective is to infer species phylogenies. The sequences used for this purpose also need to be orthologues, that is, derived from speciation events. In contrast to paralogues (which are derived from gene duplication events) and xenologues (which are acquired by HGT),
Genetics, Genomics | Phylogenomics
orthologous genes reflect the true evolutionary history of the species (Figure 1). To detect orthologous sequences, several methods have been proposed, most of which are based on sequence similarity (e.g., best reciprocal BLAST hits). These approaches, although reasonably effective in practice, do not guarantee the identification of orthology, because they do not detect xenology, and because sequence similarity does not accurately reflect phylogenetic relationships. In theory, the identification of orthologous sequences would require knowing the organismal phylogeny, which involves a certain amount of circularity because orthologous genes are required to infer species phylogenies. Like the alignment problem, orthology assignment and reconstruction of species phylogeny should be performed simultaneously. Although methods have been proposed recently, the computational burden remains prohibitive when a reasonable number of species and sequences are considered. Therefore, the common practice involves using genes that are present in single copy in all or most of the species and/or a posteriori verifying that single-gene phylogenies are not incongruent with the inferred species phylogeny.
(a)
Homologues Paralogues
Orthologues sp1 α
sp3 α
sp1 β
sp2 β
Copy α
Gene duplication
Copy β
sp3 β
Ancestral gene
(b) A
Orthologues
sp2 α
B
C D
E
A
D
B
C
Methods for Phylogenetic Inference Once homology (and orthology when one is interested in species phylogeny) is assigned, synapomorphies have to be identified. This central step is challenging because multiple substitutions occurring along the tree for the same character (e.g., a position in the alignment) blur the phylogenetic signal. Homoplasy refers to the situation in which the same character state has been independently acquired by convergence or reversion in two unrelated organisms (Figure 2). Because homoplasy does not reflect common ancestry, its detection constitutes the major difficulty in tree reconstruction. Three main methods have been devised to sort wheat from chaff (synapomorphy from homoplasy): maximum parsimony, distance-based, and probabilistic (maximum likelihood and Bayesian). The maximum parsimony method selects the tree that requires the minimum number of substitutions to explain the observed alignment, thus minimizing convergences and reversals (homoplasies). When the degree of divergence among the sequences is small, that is, when homoplasy is rare, maximum parsimony is accurate; however, when the sequences have diverged to an extent such that homoplasy is common, an incorrect tree may be inferred by underestimation of the number of substitutions (Figure 2). In distance methods, a multiple alignment is converted into a pair-wise evolutionary distance matrix and the tree that best fits this matrix is calculated with algorithms such as neighbor joining. Here, the most important step is also the correct estimation of the number of substitutions, which are determined using mathematical models from pair-wise dissimilarities. Because only a small part of the information contained in the multiple alignment (i.e., pair-wise comparisons) is
E a
b
c
d
C–G T–G T–G A–G
G–C
Species tree
635
Gene tree
Figure 1 Homology, orthology, paralogy, and xenology. (a) Hypothetical evolution of an ancestral gene. Following a gene duplication event, the two resulting copies, and , evolve through speciation events. The genes are orthologous to each other, but paralogous to the genes, and vice versa. All genes (orthologues and paralogues) present in the figure are homologous to each other because they share the same common ancestor. (b) Misleading effects of horizontal gene transfer (HGT) in phylogenetic inference. Let the tree on the left represent the true evolutionary relationships among the species A, B, C, D, and E, and suppose that a gene from the species B was horizontally transferred to the species D. If that same copy in D is used for phylogenetic inference, the inferred tree places D as a sister group of B (tree on the right), whereas D is more closely related to E than to B.
ATG
a
a b c d
1 A A G G
2 T G G G
3 C G G G
b
c
d
A–G
G–C T–G
ATG
Figure 2 Synapomorphies, convergences, and reversals. Let ‘ATG’ be an ancestral sequence that evolves through nucleotide substitutions to generate sequences a, b, c, and d as illustrated in the tree on the left. Character state G in the first position is a synapomorphy to species c and d because it was acquired on the branch that leads to their ancestor. On the contrary, character states G common to species b and c, d in positions 2 and 3 are considered to be homoplasies because they were acquired either by convergence (character 2) or by reversion (character 3 in species b) (compare left tree and matrix). The tree on the right illustrates the evolutionary relationships among species a, b, c, and d inferred under maximum parsimony, where underestimation of the substitution number leads to a wrong tree.
636 Genetics, Genomics | Phylogenomics
taken into account, distance methods do not handle high levels of homoplasy efficiently. A consensus now exists that phylogenies should be inferred with probabilistic methods. Contrary to parsimony, they make explicit assumptions about the evolutionary process using detailed models whose parameters (e.g., branch lengths, stationary frequencies of character states, or instantaneous rates of exchangeability) can be estimated from a training set or directly from the data during phylogenetic inference. The basis of these approaches is to find the trees that best explain the alignment; in other words, instead of minimizing the number of changes, they infer the most likely substitution history given a model and a topology. Inherent to these methods is the possibility to use statistical tests to evaluate different evolutionary hypotheses. Two probabilistic philosophies have been used in phylogenetic reconstruction: maximum likelihood and Bayesian inference. Despite their theoretical differences (e.g., inference of the most likely values vs. use of prior probabilities and sampling over the posterior distribution), their results are in practice virtually identical when they use the same evolutionary models. The major limitation of probabilistic methods is the computational time, which is by several orders of magnitude higher than for distance and parsimony methods. This explains why, although less accurate, the latter methods are still in use. Evolutionary Models and Sources of Inconsistency A phylogenetic inference method is said to be inconsistent if it converges toward an incorrect solution as more data (characters) are considered. Felsenstein brought this issue into the limelight when he demonstrated in 1978 that the maximum parsimony method, commonly used at that time, could be inconsistent when the evolutionary rate is heterogeneous among lineages. He showed that, under certain conditions, long branches can be erroneously grouped together, a phenomenon called long branch attraction. Indeed, long branches are more likely to accumulate multiple substitutions at the same positions leading to homoplasy, which could overwhelm the genuine phylogenetic signal. Distance methods also turned out to be inconsistent in several cases. In contrast, consistency is guaranteed in the probabilistic framework when the model used accounts for all aspects of the evolutionary process that generated the observed sequences. This is indeed the very reason why probabilistic methods should be preferred. Systematic errors (i.e., errors caused by the inconsistency of the method) are exacerbated in large data sets, and are therefore a major concern in phylogenomics. In contrast, single-gene phylogenies are often affected by stochastic error because of the limited historical signal
present in a small number of homologous characters. By shifting focus from stochastic to systematic errors, the accuracy of models of sequence evolution to ensure the consistency of probabilistic methods becomes more critical. The first model of sequence evolution, proposed in 1969 by Jukes and Cantor, was extremely simple; it assumed that each position evolves independently and that, except for the expected number of substitutions per branch, the evolutionary process is homogeneous. Current models generally take at least three types of heterogeneities into account: the differences in the instantaneous substitution rate among each nucleotide or amino acid pair, the differences in the equilibrium frequencies of 4 nucleotides or 20 amino acids, and the rate heterogeneity across sites. These standard models are thought to perform correctly in most cases because these three heterogeneities are among the most important aspects shaping the evolution of the sequences. But there are other features of the evolutionary process that violate these models and may thus induce systematic errors affecting phylogenetic inference. For example, (1) nucleotide or amino acid stationary frequencies are not constant across the tree, (2) evolutionary rates vary across positions throughout time (heterotachy), (3) positions evolve interdependently from each other, and (4) most positions undergo repeated substitutions among a restricted subset of the amino acid alphabet. Even when sophisticated models of evolution including these features are used, systematic errors can arise because we are still far from being able to model the whole complexity of the evolutionary process. Compared to other approaches, the advantage of probabilistic methods is that one can explicitly look for the properties of the model that are violated by the data and then develop models that account for these properties. Similarly, they allow the use of simulations to test whether the model used correctly describes the true evolutionary process. In such a case, for a given predefined summary statistic, the true data should be indistinguishable from the data simulated under the model to be evaluated, and a significant deviation will indicate a model-misspecification problem.
Phylogenomic Inference Based on Primary Sequences The most obvious use of genomic data to infer phylogenies is to apply the same methods as in standard single-gene-based phylogenetics. Once putatively orthologous genes are detected, two approaches can be used: supertree, which combines the trees obtained from individual alignments, and supermatrix, which combines genes before phylogenetic reconstruction (Figure 3).
Genetics, Genomics | Phylogenomics
Genomic data
Gene 1
637
Orthologous genes
Gene 3
Gene 2
Gene 4
Sp 7
SpSp4 5
Sp
Sp3
1
Sp
Sp4
Sp
5
Sp 8 Sp7
Sp2
Sp
4
Supertree 9
Sp
Sp8
Sp 1
Supermatrix
6
Sp8
Sp7
Sp 6
Sp 6
1 Sp 3 Sp
Sp 6
9
Sp Sp2 Sp8
Sp
>Sp1 >Sp2 >Sp3 >Sp4 >Sp5 >Sp6 >Sp7 >Sp8 >Sp9
1
Individual alignments
Sp2 Sp3 Sp4
Sp
6
Sp7
Sp
5
Phylogenomic tree Figure 3 Methods for phylogenomic inference based on primary sequences. Starting from genomic data, the alignment of orthologous genes is required. Once this crucial step is achieved, two alternative approaches can be used to infer phylogenetic trees. The supermatrix approach involves analyzing the concatenation of individual genes, and nonoverlapping taxa are coded as missing data. Alternatively, the supertree approach combines the optimal trees obtained from the analysis of individual genes, each of which contains data from only partially overlapping sets of taxa.
Because of the solid methodological background in sequence-based phylogenetic inference, these remain the methods of choice.
Supertree Approach This approach consists of combining the optimal trees obtained from the analysis of individual genes in a single ‘supertree’. Contrary to the classic consensus techniques (e.g., majority rule consensus tree used in bootstrap analysis), the source trees only need to have overlapping rather than identical taxon sets, giving much more flexibility and allowing incorporation of more data (e.g., a gene that has been lost in a single species can still be considered). Different methods for combining trees exist, but because of its intrinsic simplicity and its demonstrated accuracy, the matrix representation using parsimony (MRP) is most popular. In brief, starting with a set of trees, the presence of all the clades observed is coded as a binary character (missing species being represented by a question mark), and the obtained matrix is analyzed by parsimony to construct a supertree. This approach presents several major limitations, including the fact that the
resulting supertree is often biased toward large and/or unbalanced source trees. But, perhaps the major limitation is that single-gene phylogenies generally do not have enough discriminating power (stochastic error), and thus, by combining trees without considering their uncertainties, a too strong weight is given to potentially weak signals. Several variants have been developed to overcome this problem, for example, by weighting each column of the matrix according to the bootstrap proportions of the clade it represents. The supertree approach has several advantages. First, it can be used to combine trees that have been obtained from disparate sources such as molecular and morphological data; second, by calculating individual gene trees, one can separate trees that are relatively different from one another, for example, because of hidden paralogy or HGT; and, third, it can be easily parallelized and does not require as much memory resources as the supermatrix approach.
Supermatrix Approach The supermatrix approach follows the principle of total evidence, that is, combining all relevant available data,
638 Genetics, Genomics | Phylogenomics
which in this case are the alignments of each individual gene. This ‘supermatrix’ will then be analyzed by the standard sequence-based phylogenetic inference methods described earlier (or by variants thereof). In this approach, the sequences of genes that cannot be used for some species, because they have been lost, horizontally transferred, or have not yet been sequenced, are coded as question marks. Using simulated and real data, several studies have shown that a certain degree of missing data (10–30%) does not seriously affect phylogenetic inference, provided that each taxon is sequenced for a sufficiently large number of positions (at least several thousands). This robustness makes the supermatrix approach powerful for phylogenetic reconstruction, as data sets can be assembled at low cost by mining existing databases or by the sequencing of multiple PCR-targeted loci or preferably randomly selected cDNA clones. This allows the incorporation of a large number of species, instead of being restricted to model organisms for which complete genome sequences are available. The supermatrix method has been criticized because combining genes with different histories will not produce a single rational phylogenetic reconstruction. Yet, in most studies, the possibility of incongruences is minimized by selecting genes having a priori the same evolutionary history (e.g., single-copy genes) and by checking a posteriori that single-gene trees are not incongruent with the supermatrix-based tree. But even if the genes combined support the same evolutionary relationships among species, they may have evolved in a different way, for example, faster or slower in each species. This additional heterogeneity is handled by partitioned models in which parameters, such as branch lengths, can be different for each gene. Because the supermatrix method has been intensively explored, tested, and validated, many of its weaknesses and strengths are known, and it is therefore widely used. The major limitations of the supermatrix approach are (1) the memory requirement and the computational load, (2) the reliability of heuristic searches, which often become trapped in local maxima separated by high barriers in the tree space, and (3) the increased effect of systematic errors. Several innovations have been used to address the two first points, including genetic algorithms, disk-covering methods, and parallelized computing. Systematic error is by far the major concern of phylogenomics but has been better characterized for supermatrix methods than for any other phylogenomic approach.
support), because the small amounts of phylogenetic signal contained in each gene should, in principle, add up and overwhelm stochastic errors. However, high statistical support does not necessarily mean that the obtained tree is correct, because of the systematic error. Moreover, when the strength of the systematic error is of similar magnitude as the genuine phylogenetic signal, this can lead to weak statistical support. Therefore, the study of potential systematic errors always deserves particular attention. Approaches to Detect Systematic Errors When using single genes for phylogenetic inference, the most straightforward way to detect systematic errors is to observe incongruences between different markers. This is obviously no longer possible when all the data are combined into a single supermatrix and therefore alternative approaches should be used to reveal incongruences and potential systematic errors: 1. Using different tree reconstruction methods. Because different methods are not sensitive in the same way and to the same extent to systematic errors, they will potentially produce slightly different topologies, identifying the most problematic parts of the tree. The congruence of all methods, although encouraging, is not a definitive proof of the absence of systematic errors, because all methods will have problems to correctly locate, for instance, a very fast-evolving, or rogue, lineage. 2. Partitioning the data set vertically or horizontally, that is, in subsets of genes or of species. For instance, one can compare trees based on supermatrices of informational and operational genes, or based on random subsamples of genes or positions (i.e., a jackknife test). Importantly, experience has demonstrated that varying taxon sampling is a very efficient way to reveal systematic errors because of the introduction or elimination of fast-evolving lineages, which are more likely to have accumulated homoplasies. Therefore, the targeted removal of divergent taxa is highly recommended. If sufficient computational resources are available, analyses of numerous random taxon subsamples could be illuminating. These approaches fundamentally test the coherence of a given approach (e.g., the supermatrix method). Even more conclusive evidence for the presence of systematic errors is the recovery of different topologies using phylogenomic approaches based on different character types (e.g., gene order data).
Supermatrices and Systematic Errors
Developing Better Models to Reduce Systematic Errors
The use of large data sets generally results in a global increase in the resolution of phylogenetic trees (increased statistical
The development of improved models of sequence evolution is a continuous quest in phylogenetics. Recently,
Genetics, Genomics | Phylogenomics
this field has experienced important advances, mainly because of (1) the availability of improved computational resources, (2) the introduction of fast algorithms such as the Monte Carlo Markov Chain, and (3) the use of large data sets allowing a more accurate estimation of a larger number of parameters (hence allowing more complex models). Among the most important recent improvements are models handling rate variation throughout time (heterotachy), nonstationarity of nucleotide or amino acid composition, site-specific amino acid propensity, and site-interdependent evolution due to protein tertiary structures. A major step in every phylogenetic analysis is to identify the model that fits the data best, because it should in theory result in the most reliable tree. However, an improved fit can be obtained by a better explanation of data characteristics that do not disturb phylogenetic inference; as a result, a less fitted model may sometimes lead to a more accurate tree. For this reason, the comparison of different suboptimal models is interesting, at least to potentially detect parts of the phylogeny affected by systematic errors. Ultimately, all of the known parameters shaping sequence evolution should be combined into the same model, but this would require a major increase in the computing resources needed, explaining why this integration is only beginning. Unfortunately, no model will ever capture all of the complexity of evolutionary patterns, and therefore alternative approaches are required to detect and overcome systematic errors. Alternative Approaches to Reduce Systematic Errors The theoretical basis of the alternative approaches to reduce systematic errors is that fast-evolving data (species or sites) are more prone to induce artifacts because they have accumulated a larger number of multiple substitutions. Accordingly, the most obvious approach is to simply remove the fast-evolving or aberrant taxa from the analysis. Unfortunately, this is only possible when all taxonomic groups under study are represented by at least one slow-evolving organism. The removal of fast-evolving genes is another possibility, but it is probably not a very efficient approach because genes are always a mixture of slow- and fast-evolving positions. In particular, a gene can appear slow because it contains many constant positions while its variable positions are extremely rapid. However, the specific removal of genes in which the problematic taxa are the fastest evolving has shown its efficiency in several cases. The removal of the fast-evolving positions is also promising. Several variations to estimate the evolutionary rate (e.g., compatibility-based, within predefined groups, or without the problematic taxa) have been proposed and successfully applied, although their relative performance is unstudied. The identification of fast-evolving species, genes, or sites
639
requires, however, an a priori knowledge of the phylogeny to accurately estimate evolutionary rates. This circularity therefore constitutes the main limitation of this approach. An alternative method consists in discarding the fastest-evolving substitutions. For example, the RY coding for nucleotide data implies replacing the nucleotides by purines and pyrimidines in all the sequences, allowing elimination of transitions, which occur more often than transversions. A certain degree of compositional heterogeneity is also reduced by this approach because the frequency of purines and pyrimidines appears more stationary over time than that of individual nucleotides. A similar coding method has been proposed for amino acids in which they are grouped according to their biophysical properties. Finally, one should not forget that systematic errors are caused by the inability of algorithms to correctly detect multiple substitutions. To tackle this, the most obvious way would therefore be to increase taxon sampling to provide more information on the series of substitutions that occurred over time. In other words, breaking branches into small pieces on which multiple substitutions are unlikely, hence rendering the effect of homoplasy negligible. Although this increase in taxon sampling is not always feasible (e.g., Amborella or coelacanth) and is expensive, it should remain a priority.
Phylogenomic Inference Based on Whole Genome Features Methods for phylogenomic inference based on whole genome features have been introduced only recently, which limits their evaluation at this time point. In principle, these methods are promising because whole genome features such as gene content and order are more complex than primary sequence data and therefore less prone to homoplasy by convergence or reversal: changes in gene content and order have billions of possible character states compared to only 4 nucleotides and 20 amino acids. Other approaches, such as the DNA string approach, do not require the difficult step of homology assessment. The three methods explained below are illustrated in Figure 4. DNA String Approach This method relies on the fact that the frequency of short oligonucleotides (DNA strings) is relatively constant throughout a particular genome but variable across genomes. The comparison of the frequency of DNA strings between different organisms provides a measure of similarity that can be used for phylogenetic inference. In brief, for each genome, a DNA string vector is calculated as the ratio between observed and expected (from the nucleotide frequencies) oligonucleotide frequencies. The DNA
640 Genetics, Genomics | Phylogenomics
Genomic data
DNA string frequencies per genome
Orthologous genes
Presence/absence of pairs of genes for each genome
Presence/absence of each gene for each genome
Sp1 Sp2 Sp3
Break-point/rearrangement distance per genome pair
Gn1 Gn2 Gn3 0 0 1 1 0 1 1 1 1
GnG 1 1 1
Sp1 Sp1 Sp2 Sp3
0
SpS
DNA string frequence differences per genome pair
Sp1 Sp2 Sp3 0 0.3 0.5 0 0.4 0
SpS 0.7 0.6 0.4 0
SpS
1
1
0
0
Distance matrix
Sp
9
Sp
Sp8
1
Character matrix
Sp2 Sp3
Sp
6
Sp7
Sp
Sp4
5 Phylogenomic tree
Figure 4 Methods for phylogenomic inference based on whole genome features. Approaches based on gene content and gene order require previous orthology assignment. A character matrix can be constructed by scoring the absence/presence of genes or of pairs of genes per species, which will be analyzed by maximum parsimony or converted into distance matrix. A distance matrix can also be constructed by calculating break-point or rearrangement distances between pairs of sequences. Approaches based on DNA string frequencies do not require identification of orthologous genes and consist of calculating evolutionary distances among species from the differences in their oligonucleotide word usage, and reconstruct phylogenetic trees using standard distance-based algorithms.
string vector differences are calculated for each genome pair for all the species studied and transformed into a distance matrix. A standard distance-based method is then applied to generate a phylogenetic tree. The advantage of this approach is that, contrary to all other methods, it does not rely on homology or orthology and that it does not require any alignment. Although it seems possible to extract phylogenetic signals from oligonucleotide frequencies, this feature evolves fast and the phylogenetic signal in DNA strings saturates rapidly, which may limit the use of this method. Currently, DNA strings are transformed into evolutionary distances without any model of evolution, but improvements are easy to envision.
Gene Repertoire The analysis of gene repertoire is a straightforward way of comparing genomes and was the first published method used with complete genomes. This approach is based on the principle that closely related species will share a large proportion of genes and distantly related species will have differentially lost or gained a large proportion of genes with respect to their last common ancestor. The method consists of the construction of a data matrix that scores the
absence/presence of each gene. This matrix is then analyzed by parsimony or maximum likelihood. A distance matrix that represents the proportion of shared orthologues between each genome can also be derived from the data matrix and used by distance-based methods to construct a tree. The most important limitation of this approach is the definition of the gene repertoire itself. In most cases, orthologous genes are considered, but their identification is based on simple similarity searches, which are prone to error. In theory, phylogenetic analysis of each gene family would be required to accurately define gene repertoire, but this would not only imply a huge amount of computing time but the results would also be plagued by stochastic error. Alternatively, it has been proposed that only homologous genes be considered, but this drastically reduces the number of gene families. In both cases, it is often difficult to assess gene absence with certainty because an accelerated rate of evolution and/or a short sequence length can make a gene undetectable even when using sophisticated similarity search tools such as psi-blast. Another major problem of gene repertoire-based methods is the so-called small/big genome attraction artifact, which causes the grouping of unrelated species with small genomes: certain organisms, for example, intracellular parasites, tend to lose a similar
Genetics, Genomics | Phylogenomics
set of genes. This was in fact the first systematic error identified for non-sequence-based phylogenomic methods. Although probabilistic models of gene gains and losses are currently being developed, they do not seem to perform better than parsimony methods, and numerous improvements are still required.
Gene Order The gene order approach follows the same logic as gene repertoire-based methods, but requires the accurate recognition of orthologous relationships. The method starts with the construction of a character or a distance matrix. Scoring the presence/absence of gene pairs in each genome generates the character matrix. The distance matrix is generally based on break-point distances, that is, the number of adjacencies present in one genome but not in the other, or on rearrangement distances, that is, the number of rearrangements, inversions, transpositions, insertions, and deletions between genomes pairs. Among these three approaches (presence/absence of gene pairs, break-point distances, and rearrangement distances), only the latter implies the use of true evolutionary distances – the other two may severely underestimate the evolutionary distances between genomes. Because almost infinite combinations of gene order are possible and the probability of convergence is small, this approach is promising. However, although more trustworthy, rearrangement distances are extremely difficult to compute, even under the unrealistic assumption that only inversions have occurred in both genomes. Nevertheless, probabilistic methods for gene order evolution are currently under development. The computational burden presently makes the use of gene order for phylogenetic inference difficult, except for small genomes (mitochondria and plastids).
Rare Genomic Changes Finally, genomic data can also be analyzed using an approach similar to the one that has been used for morphology over the centuries: the identification, in complete genomes, of a few complex characters that are putatively homoplasy free because they evolve slowly (possibly only a single change). Various so-called rare genomic changes have been proposed, useful at different evolutionary depths: retroposon insertion of, for example, SINE and LINE, genetic code variation, gene fission and fusion, or presence/absence of introns. Because of their scarcity, these characters are well suited for ‘manual’ analyses, which is currently the rule. However, there are no reasons for not applying standard statistical methods to these types of characters, as is currently done for intron position evolution. More generally, any genomic characteristics
641
that can be rigorously compared among organisms could be considered, and its usefulness should be evaluated. Limitations and Perspectives Like nucleotide and amino acid sequences, whole genome features are affected by homoplasy, and inferences can be misled by systematic errors. However, except for the small/big genome attraction mentioned earlier, systematic errors have not yet been characterized in detail, rendering evaluation of their impact uncertain. To deal with them, the same approaches as the ones described for the supermatrix method should be applied: modifying species sampling or removing fast-evolving data (e.g., regions in the genome with high recombination rates). Similarly, the use of improved models that better describe the evolution of these characters is the ultimate goal. Although a large amount of work is needed, obtaining reliable methods complementary to the supermatrix approach is of prime importance: because inferring ancient evolutionary events is extremely difficult, results corroborated by independent methods are the most trustworthy. This corroboration is the key to validate the inference of the Tree of Life.
Contribution of Phylogenomics to the Microbial Tree Because of large differences in genome size, data used in phylogenomics come mainly from complete genome sequences in the case of prokaryotes and from expressed sequence tag (EST) sequencing for eukaryotes (particularly in the case of animals). New sequencing technology will probably make available numerous complete eukaryotic genomes soon. Surprisingly, despite the fact that hundreds of complete bacterial genomes have been available for several years, very few bacterial phylogenies have been inferred with more than 100 and none with more than 200 species. Technical limitations, in particular userfriendly tools to handle this large amount of data, and computational burden explain this underuse of promising genomic data. Accordingly, phylogenomics has not yet produced a large number of new results, although the situation is currently changing. Eukaryotes Because of the availability of large amounts of sequence data from animals, fungi, and green plants, advances concerning the evolutionary relationships among these three groups represent the first and most spectacular achievements of phylogenomics. For instance, the sister group of vertebrates was shown to be tunicates, instead of cephalochordates as long assumed. Inferring the eukaryotic tree
642 Genetics, Genomics | Phylogenomics
requires more than this, because much of the evolutionary diversity of the domain Eukaryota is contained outside these three kingdoms. Recently, large-scale genomic data have been generated for a range of poorly studied but important microbial eukaryotes commonly referred to as protists, and phylogenomic studies have been carried out to determine their evolutionary position in the eukaryotic tree. A number of these unicellular lineages have appeared to be related to animals (Choanoflagellata, Capsaspora, and Ichthyosporea), or fungi (Nucleariidae), but most of them have been regrouped in exclusively unicellular proposed superensembles – Amoebozoa, Excavata, Rhizaria, and Chromalveolata. Amoebozoa are a group of morphologically diverse amoebae, which includes slime molds (e.g., Dictyostelium), lobose amoeba (e.g., Amoeba), and anaerobic Archamoeba (e.g., Entamoeba). These organisms were thought to have evolved independently, but recent phylogenomic analyses based on hundreds of protein-coding nuclear genes have shown that they share a common origin. Excavata are an ensemble of organisms that do not possess a single common feature but that are united by a series of overlapping ultrastructural and molecular characters. Till now, there is no molecular phylogenetic evidence for the monophyly of this group, but recent phylogenomic analyses have found monophyly for a few subensembles such as the grouping of Jakobida, Euglenozoa, and Heterolobosea. Chromalveolates unite alveolates (ciliates, dinoflagellates, and apicomplexans), stramenopiles, haptophytes, and cryptophytes and the hypothesis of their monophyly is based on the assumption of a single secondary endosymbiosis with a red alga that is at the origin of vertically inherited chlorophyll-c-containing plastids. According to recent phylogenomic studies, haptophytes and cryptophytes could be sister groups, but there is no convincing evidence to cluster them with alveolates and stramenopiles. Indeed, recent multigene analyses associate the last superensemble, the Rhizaria, with alveolates and stramenopiles. In general, the progress concerning the eukaryotic phylogeny has not been as impressive as expected because taxon sampling remains sparse (often only one species to represent a large and diverse phylum, some phyla being completely absent), and because the relationships to infer are ancient (hence homoplasy is not negligible). With the increase of the available sequence data for many interesting unicellular eukaryotes and the development of better models of evolution, the situation is likely to improve. One of the most difficult questions is the position of the root of the eukaryotic tree. The Archaea, often used as outgroup, are too distantly related for a reliable inference. Although -proteobacteria constitute a much closer outgroup for the eukaryotic genes of mitochondrial origin, they have not yet been used at a genome scale to decipher the eukaryotic root. Although a few rare genomic changes (gene fusions and gene duplications) have been proposed
to address this issue, they are not fully congruent. They nevertheless suggest a root between unikonts (Amoebozoa and Opisthokonta) and bikonts (all other eukaryotes). Phylogenomic analyses using rich taxon sampling and improved models of sequence evolution are definitively needed to settle the root of the eukaryotic tree. Prokaryotes Despite the large number of complete prokaryotic genomes now available, the bacterial and archaeal phylogenies remain almost identical to the rRNA-based tree. The phylogeny of prokaryotes has been carefully inspected because of the supposed predominance of HGT, which would make the inference of the species phylogeny difficult or even impossible. For some researchers, HGT is considered to be so widespread that the phylogenetic signal is thought to have disappeared. Analyses of gene order, DNA strings, and other methods have shown that it is possible to robustly recover most relationships among prokaryotes, and several supermatrix and supertree approaches have shown that there is a core of genes that rarely undergo HGT and that these genes are therefore suitable for determining the phylogeny of prokaryotes. Relationships that were already supported by rRNA phylogenies, such as the monophyly of proteobacteria, cyanobacteria, spirochetes, and chlamydiales, have been confirmed by phylogenomic studies, but relationships between them are mostly unresolved. Based on orthologue distances, concatenated alignments, and supertree analyses at least three major new clades are strongly supported (albeit systematic error cannot be excluded, due to insufficient testing) – the sister group of chlamydiales and spirochetes, the sister group of aquificales and thermotogales, and the grouping of high GC Grampositive bacteria with cyanobacteria and deinococcales. In addition, several other major groupings were found by some but not by other approaches and should be considered tentative for the moment. The resolution of the bacterial radiation remains one of the major challenges, even in the age of phylogenomics. Within the Archaea, the separation between Euryarcheota and Crenarcheota was clear with rRNA phylogenies, but some uncultured taxa such as Korarchaeota were of uncertain position. Phylogenomics recently questioned the monophyly of Crenarcheota, but taxon sampling remains too sparse to draw any firm conclusions, many important lineages, such as Nanoarchaeota and Methanopyrales being represented by a single species. One of the most interesting debates concerning archaeal phylogeny is on whether the methanogens are monophyletic. Ribosomal RNA-based phylogenies did not support their monophyly, but suggest that Methanopyrus emerged early on. The nonmonophyly was confirmed by phylogenomic studies, but methanogens
Genetics, Genomics | Phylogenomics
appear to have emerged relatively late, although the position of Methanopyrus remains unsettled, possibly close to methanobacteriales and methanococcales. The sparse taxon sampling and the extreme evolutionary properties of archaeal genomes (in particular amino acid compositions) make phylogenomic inference quite difficult and further refined studies are needed.
Conclusion Phylogenomics is without doubt the most promising way to resolve the Tree of Life, as demonstrated in several cases. Improvements in species sampling and inference methods are still needed to avoid systematic errors and enhance phylogenetic signal. However, resolving power does not increase linearly with the number of characters considered, which implies that some closely spaced speciation events will most likely remain unresolved, albeit making such inference from lack of resolution premature. A serious drawback of phylogenomics is that the needed resources, particularly computational ones, increase dramatically with respect to single-gene phylogenetic inference, thus contributing to environmental degradation, that is, to the loss of biodiversity. Although it was already noticed that scientific observations often lead to the destruction of the object under study, this is particularly problematic in the case of the Tree of Life. See also: Genome Sequence Databases: Types of Data and Bioinformatic Tools; Horizontal Transfer of Genes between Microorganisms; Metagenomics
643
Further Reading Beiko RG, Harlow TJ, and Ragan MA (2005) Highways of gene sharing in prokaryotes. Proceedings of the National Academy of Sciences of the United States of America 102: 14332–14337. Brochier-Armanet C, Boussau B, Gribaldo S, and Forterre P (2008) Mesophilic crenarchaeota: Proposal for a third archaeal phylum, the Thaumarchaeota. Nature Reviews Microbiology 6: 245–252. Burki F, Shalchian-Tabrizi K, Minge M, et al. (2007) Phylogenomics reshuffles the eukaryotic supergroups. PLoS ONE 2: e790. Daubin V, Gouy M, and Perrie`re G (2002) A phylogenomic approach to bacterial phylogeny: Evidence of a core of genes sharing a common history. Genome Research 12: 1080–1090. Delsuc F, Brinkmann H, and Philippe H (2005) Phylogenomics and the reconstruction of the Tree of Life. Nature Reviews Genetics 6: 361–375. Jeffroy O, Brinkmann H, Delsuc F, and Philippe H (2006) Phylogenomics: The beginning of incongruence? Trends in Genetics 22: 225–231. Lerat E, Daubin V, and Moran NA (2003) From gene trees to organismal phylogeny in prokaryotes: The case of the gamma-proteobacteria. PLoS Biology 1: e19. Moret BME, Tang J, and Warnow T (2005) Reconstructing phylogenies from gene-content and gene-order data. In: Gascuel O (ed.) Mathematics of Evolution and Phylogeny, pp. 321–352. Oxford, UK: Oxford University Press. Philippe H, Delsuc F, Brinkmann H, and Lartillot N (2005) Phylogenomics. Annual Reviews Ecology Evolution Systematics 36: 541–562. Rodrı´guez-Ezpeleta N, Brinkmann H, Burger G, et al. (2007) Toward resolving the eukaryotic tree: The phylogenetic positions of jakobids and cercozoans. Current Biology 17: 1420–1425. Rodrı´guez-Ezpeleta N, Brinkmann H, Roure B, Lartillot N, Lang BF, and Philippe H (2007) Detecting and overcoming systematic errors in genome-scale phylogenies. Systematic Biology 56: 389–399. Snel B, Martijn AH, and Dutilh BE (2005) Genome trees and the nature of genome evolution. Annual Reviews Microbiology 59: 191–209. Wolf YI, Rogozin IV, Grishin NV, and Koonin EV (2001) Genome trees and the tree of life. Trends in Genetics 18: 472–479.