Journal of Biomedical Informatics 39 (2006) 3–5 www.elsevier.com/locate/yjbin
Guest Editorial
Phylogenetics in the modern era Since the beginning of recorded history, human beings have been concerned with understanding the origins of life. Among the philosopher–scientists in the Western world, Aristotle was the first to record a scheme that categorically organized life on Earth [1]. It was not until 1735, however, that Carl LinnaeusÕ Systema Naturae [2] proposed a formal organization scheme for categorizing groups of common organisms (Linnaeus focused on plants). Jean-Baptiste LaMarck applied this approach, which assigned binomen names (consisting of both a genus name and a species name) to animals, in his Philosophie Zoologique (1809) [3]. This Linnaean binomen nomenclature is still used in the modern taxonomy to identify organisms across the tree of life. While Aristotle, Linnaeus, and LaMarck were only attempting to organize organisms into categorically defined groups, it was not until Darwin published Origin of Species in 1859 [4] that an evolutionary basis was formally considered as a criterion for taxonomic organization. DarwinÕs theory of evolution, which posits that observable features are selected through a natural process, laid the foundations for modern evolutionary biology. Indeed, Origin of Species contained only a single figure—one that depicted the evolution of organisms into distinct species based on a branching pattern (Fig. 1). With the discovery of fossil evidence of previous life on Earth, we became acutely aware that life on Earth has come and gone in many forms. Systematic techniques were pioneered to organize extinct organisms relative both to themselves and to extant organisms (e.g., dinosaurs to birds) using comparable morphological features (e.g., relative femur size and orientation) [6]. The basic tenet of these methods was that organisms that are more related to each other would share more common features. The study and development of these methods in the context of evolutionary history, represented as hierarchical relationships akin to DarwinÕs figure, became known as phylogenetics, which attempted to define how organisms (as species) emerged in relation to each other with respect to evolutionary time. Because phylogenetic techniques describe evolution in a systematic manner, phylogenetics is often referred to as systematics. Fast forward to the current era of genomics. Spurred on by the Human Genome Project [7] and the sequencing of the Haemophilus influenzae genome [8], the acquisition of the
1532-0464/$ - see front matter Ó 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.jbi.2005.11.009
molecular blueprints of life—as deoxyribo nucleic acid (DNA)—has produced exponential amounts of data in public repositories such as GenBank [9]. Phylogenetic techniques have been adapted to accommodate these molecular data (as previous phylogenetic techniques were designed to use only morphological data). These techniques are referred to as molecular phylogenetics or molecular systematics. Phylogenetics can be divided into two general types of methods: Phenetic and Cladistic. Phenetic approaches, often called numerical taxonomy, are those that propose hypothetical evolutionary organization based on overall relationships. The basic phenetic approach is called Minimum Evolution, which attempts to minimize the overall observed change between the entities (e.g., organisms) being considered, called taxa (singular taxon). In practice, heuristic methods (e.g., Neighbor-Joining) are employed to posit phenetic relationships. Phenetic approaches are seductive from a computational perspective, mostly because of their algorithmic nature: all pair-wise comparisons are first computed between all taxa, organized into a similarity matrix, and then hierarchically clustered according to this matrix. Cladistic methods describe evolutionary history in terms of individual units of evolution (e.g., nucleic acids from DNA). The basis and theory of cladistic approaches were first described in Willi HennigÕs Phylogenetic Systematics (1966) [10]. Popularly used cladistic methods can be broken into two major classes: Non-Statistical and Statistical. While non-statistical (e.g., maximum parsimony) methods assume equivocal models of evolution, statistical (e.g., maximum likelihood and Bayesian) approaches use statistical models of evolution to infer phylogenetic relationships. Compared to phenetic methods, cladistic methods are computationally intense. The general cladistic approach is first to consider all possible tree topologies, and then to evaluate them according to a specified criterion that satisfies a specific model of evolution (e.g., maximum parsimony, maximum likelihood, or Bayesian) to determine the best tree(s). This special issue consists almost entirely of methodological review papers that provide in-depth discussions about the various informatics methods that phylogeneticists use in contemporary studies. The issue was designed
4
Guest Editorial / Journal of Biomedical Informatics 39 (2006) 3–5
Fig. 1. Reprinted with permission from DarwinÕs Origin of Species [5]. Thanks to Dr. van Wyhe, Director of the Complete Work of Charles Darwin (http://darwin-online.org.uk) for permission to use this reproduction.
Fig. 2. Overview of this special issue. The articles in this special issue are organized into core questions relative to the three basic tasks in phylogenetics.
with two purposes in mind: (1) provide a comprehensive introduction to those just beginning to use phylogenetic techniques in their research; and (2) provide practicing phylogeneticists with an overview of recent advances in theories and techniques. While it is impossible to cover every nuance one might consider in phylogenetics, the articles presented here offer a balanced coverage of many per-
tinent issues and considerations when using phylogenetic techniques in contemporary studies. I have divided the overall phylogenetic approach, as summarized in this issue, into three distinct parts: (1) gathering and organizing data for comparative studies; (2) inferring data into phylogenetic trees; and, (3) assessing the reliability of phylogenies. Within each part, a set
Guest Editorial / Journal of Biomedical Informatics 39 (2006) 3–5
of core questions is addressed (as shown in Fig. 2). The first three reviews present an in-depth discussion about three essential tasks associated with the gathering and organizing of data for subsequent phylogenetic analyses. First, DeSalle describes how the individual units of study, characters, are ascertained and used in comparative studies [11]. Second, Phillips considers how to relate and organize character data such that they can be used in subsequent phylogenetic analyses [12]. Finally, Wiens provides a review of the considerations and effects of missing character data in light of plausible evolutionary hypotheses [13]. After character data are gathered and organized, phylogenetic analyses are performed to infer evolutionary hypotheses. The second set of articles begins with original research presented by Weber and colleagues that describes a new method for inferring phylogenetic trees within a maximum parsimony framework [14]. Next, Kosiol et al. provide an insightful review of maximum likelihood methods, and their applicability to studying diseases in light of evolutionarily hypotheses [15]. Charleston and Perkins then provide a review of cophylogenetic methods and principles, which involve the concurrent study of multiple phylogenies that are biologically related (e.g., via host–parasite relationships) [16]. The assessment of the ÔcorrectnessÕ for a given evolutionary organization is entrenched in controversy. While there are no real ÔgoldÕ standards in phylogenetic studies, there are established metrics to assess the confidence of inferred trees (based on the methods described in the second set of articles). These metrics are the focus of the last two reviews. First, Egan reviews metrics and philosophical considerations commonly used for assessing the acceptability of a proposed phylogenetic tree [17]. Planet then presents statistical tests that are used to assess the confidence of phylogenetic analyses in light of potentially conflicting phylogenetic signals [18]. The ever-increasing generation and availability of various types of biological data in various forms have presented the scientific community with unprecedented challenges. Phylogenetic techniques provide us with the tools to organize and examine these data in the context of evolutionary history. The phylogenetic principles and techniques covered in this special issue are actively used in various domains (e.g., see: [19–22]), and I encourage the biomedical community at-large to consider the applicability and advancement of phylogenetic methods and principles showcased in this special issue. Acknowledgments I convey my utmost gratitude to each of the contributing authors. I am especially thankful to Rob DeSalle who assisted with the identification of pertinent authors to include in this issue. Finally, I thank the reviewers for their insightful comments that helped improve the quality of the articles individually and the issue as whole.
5
References [1] Aristotle. In: Barnes J, editor. Complete works of Aristotle: the revised Oxford translation, Vols.1–2. Princeton University Press; 1997. [2] Linnaeus C. Species plantarum: a facsimile of the first edition, 1753. Sabbot: Rudolph William Natural History Books; 1959. [3] Lamarck J-B. Philosophie zoologique (1809). Paris: Flammarion; 1994. [4] Darwin C. The origin of species (1859). Signet Classics; 2003. [5] http://pages.britishlibrary.net/charles.darwin2/diagram.jpg. [6] Donoghue MJ, Doyle JA, Gauthier J, Kluge AG, Rowe T. The importance of fossils in phylogeny reconstruction. Ann Rev Ecol Syst 1989;20:431–460. [7] Collins FS, Patrinos A, Jordan E, Chakravarti A, Gesteland R, Walters L. New goals for the U.S. Human Genome Project: 1998–2003. Science 1998;282:682. [8] Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 1995;269:496. [9] Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Madden TL, et al. Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res 2004;32:D35–D40. [10] Hennig W. Phylogenetic systematics. Champaign, IL: University of Illinois Press; 1966. [11] DeSalle R. WhatÕs in a character? J Biomed Inform 2006;39:6–17. [12] Phillips AJ. Homology assessment and molecular sequence alignment. J Biomed Inform 2006;39:18–33. [13] Wiens JJ. Missing data and the design of phylogenetic analyses. J Biomed Inform 2006;39:34–42. [14] Weber GM, Ohno-Machado L, Schieber S. Representation in stochastic search for phylogenetic tree reconstruction. J Biomed Inform 2006;39:43–50. [15] Kosiol C, Bofkin L, Whelan S. Phylogenetics by likelihood: evolutionary modeling as a tool for understanding the genome. J Biomed Inform 2006;39:51–61. [16] Charleston MA, Perkins SL. Traversing the tangle: algorithms and applications for cophylogenetic studies. J Biomed Inform 2006;39:62–71. [17] Egan MG. Support vs. corroboration. J Biomed Inform 2006;39:72–85. [18] Planet PJ. Tree disagreement: measuring and testing incongruence in phylogenies. J Biomed Inform 2006;39:86–102. [19] Avise JC. Molecular markers, natural history, and evolution. Londom: Chapman and Hall; 1994. [20] Harvey PH, Pagel MD. The comparative method in evolutionary biology. Oxford: Oxford University Press; 1991. [21] Harvey PH, Leigh-Brown AJ, Maynard-Smith J, Nee S. New uses for new phylogenies. Oxford: Oxford University Press; 1996. [22] Albert VA. Parsimony, phylogeny, and genomics. Oxford: Oxford University Press; 2005.
Indra Neil Sarkar Divisions of Invertebrate Zoology and Library Services, American Museum of Natural History, Central Park West at 79th Street, New York, NY 10024, USA E-mail address:
[email protected] Received 10 November 2005 Available online 9 December 2005