Biochimie 119 (2015) 205e208
Contents lists available at ScienceDirect
Biochimie journal homepage: www.elsevier.com/locate/biochi
Editorial
Structural biology and genome evolution: An introduction
For fifty years, reconstruction of a relatively small number of “significant” gene trees has driven the preconceptions and perceptions of molecular evolution. More recently sequencing of thousands of genomes along with 3D structure determination of one hundred thousand proteins has refocused and clarified the outlines of molecular evolution. In particular, phylogenetic analyses of genome scale data based on distributions of modular protein domains challenge conventional beliefs about the evolution of organisms, and their proteomes. In 1992 a note entitled “One thousand families for the molecular biologist” [1] by Cyrus Chothia sparked a radical advance in the study of molecular evolution. His calculations suggested, “The large majority of proteins come from no more than one thousand families”. Further, he recognized the proteins in such structural families as homologs because, as he reasoned, they share “sequences, functional and/or genetic similarities that strongly imply that they are descended from a common ancestor” [1]. Then in 1998, Chothia and company, made the discovery that pairwise sequence comparisons such as BLAST searches commonly used by gene tree enthusiasts seriously underestimate identifications of significantly diverged homologs [2,3]. As Park et al. say, “ Sequence Comparisons Using Multiple Sequences Detect Three Times as Many Remote Homologues as Pairwise Methods” [3]. Thus, without a comprehensive grasp of the distributions of homologous protein characters encoded by genomes, divergent evolution of sequences was often mistaken for convergent evolution of structures [4,5]. Similarly, the realization that missense mutations significantly affect protein folding and function encouraged fundamental concerns about the generality of the neutral theory of molecular evolution [6,7]. It has taken a few decades of large-scale genome sequencing as well as many tens of thousands of 3D structural determinations of proteins to realize the full reach of Chothia's structural insights. It is now recognized that most proteins are constructed from one or more compact domains that are in turn associated with shorter terminal regions that we refer to as linkers [8]. Linkers are essential to the associations of proteins with other macromolecules and subcellular structures, while most enzymatic activities reside in the tertiary structures of domains that constitute the major fraction of protein mass [8]. In isolation, linkers and some fibrous proteins are identified with fluid secondary structures, but they present no stable tertiary folds. Stable domains are gathered in large groups with a common 3D structure, for example as superfamilies, of which there are a mere two thousand or so in nature [9]. Such 3D structures or tertiary folds are identifiable in the hundreds of thousands of proteins whose structures have been
http://dx.doi.org/10.1016/j.biochi.2015.10.023 0300-9084/© 2015 Published by Elsevier B.V.
determined at atomic resolution [9]. The essential point in the present context is that the tertiary folds of domains at the hierarchical level of superfamily (SF) are stable, and self-folding [10]. That is to say, SFs are modular units of protein structure. As such, they can be employed as homologous characters to support phylogenetic reconstructions. The papers included in this section of the journal were written by participants of a meeting held in June 2014 and organized by The Royal Swedish Academy of Sciences, supported by the Knut and Alice Wallenberg Foundation. The principal theme of the meeting as well as the orientation of a majority of the papers follows Chothia's approach to molecular evolution [9e17]. These report advances and retreats made in our understanding of the evolution of the proteins and proteomes of the three superkingdoms (archaea, bacteria, and eukaryotes) as well as their viruses. Six of the nine papers [9e13,16] explore molecular evolution from Chothia's phenotypic perspective. In contrast, two of the papers [14,17] view aspects of molecular evolution as it has been studied from the vantage of the gene tree paradigm that was pioneered in the work of Zukerkandl and Pauling [18] as well as Woese [19]. A final paper [15] presents a critical comparison of gene trees contra genome scale trees. The emphasis here highlights both the technical and conceptual limitations of conventional gene trees [18,19] in comparisons with genome scale trees that more faithfully represent species trees [15,20]. Here, we present elements of the emergent phenotypic paradigm for the study of genome evolution that Chothia and his colleagues jump-started [1,3,4]. The pipeline for such phylogenomic reconstructions begins with complete sequence determinations for genomes. Hidden Markov models (HMMs) guided by 3D protein structures obtained by crystallographic and/or NMR methods annotate the genome sequences. Then families of homologous protein domains at the hierarchical level of superfamily (SF) from the SCOP [21] or CATH [22] databases are identified and collected for each sampled genome; each batch of SFs annotated from a genome sequence defines a taxon [15,20]. Most proteins feature one or more SFs from a basic set of circa, 2000 unique SFs that have been identified in one hundred thousand protein structures encoded by thousands of genome sequence [8,9,20]. Of these, up to1300 SFs have been identified at the root of the so-called tree of life (ToL), which implies that the common ancestor of the ToL is astonishingly complex [15,20]. That fact is currently a source of lively speculation. Since Chothia's 1992 note [1] appeared, the major advances and retreats made with conventional sequence-based gene tree
206
Editorial / Biochimie 119 (2015) 205e208
methods [18,19] feature the identification of artifacts inherent to those methods [23e25]. Thus, deep phylogeny reconstructed from raw sequence data yields gene trees that are highly distorted by Long Branch Attractions arising from mutation saturation and mutational rate variation (heterotachy) [23e25]. Evidently the reconstruction of deep phylogeny based on gene sequences is fraught with prohibitive technical challenges [23e26]. So far such technical limitations do not challenge the implementation of genome scale phylogeny with SFs [3,4,15,20,21,27]. Accordingly, we have some confidence that the ToL obtained by implementing taxa defined as genomic SF content is reliable [15,20]. That confidence is reinforced by observations showing that selection operates to conserve the 3D folds of SFs [10], rather than their coding sequences, which are variable [1,3,4,6,8e10]. The distinction between microevolution and macroevolution may be useful here, particularly, since the latter entails speciation and higher order evolutionary events [28,29]. Accordingly, we use here the terms microevolution and macroevolution to distinguish the hierarchical levels of phylogeny represented by sequence based gene trees and genome content trees, respectively. The validity of this convention is well supported by the principal components analysis (PCA) of genomic SF distributions in Fig. 1B. These reveal a systematic covariation of protein domain-SFs that defines the unique SF complement of each sequenced genomic species [15,20]. Thus, PCA analyses suggest that speciation is always associated with unique genomic events involving acquisition and/or loss of protein domains [15,20]. Accordingly, we are comfortable with the interpretation that genome content trees based on the gains and losses of SFs report on the speciation events of deep phylogeny that may be obscured through allelic drift in gene trees [28]. A popular belief among molecular evolutionists is that horizontal gene transfer (HGT) is rife [30,31]. This belief has survived all the
published reports of global HGT frequencies that are described as negligible in gene trees as well as in genome content trees [15,20]. The original assertion that HGT was rampant was intended to account for the anecdotal observations of extensive sharing of coding sequences between very distantly related genomes [30]. Further, the purported dominance of HGT in the ancestry of contemporary species undermined confidence in the monophyly of extant organisms [32]. But those extravagant interpretations discounted a viable alternative explanation: namely, that the shared sequences are in fact characters that descended from the last common ancestors of contemporary speciesdsynapomorphies. Synapomorphies or “special similarities” are the shared traits that report on the divergent descent of phylogenetic characters originating from the common ancestor at the root of a clade to its descendants [33]. There is now convincing evidence supporting the monophyly of life based on similarities of proteins [34] as well as similarities of proteomes (e,g, Venn diagrams Fig. 1A) [15,20]. In addition, independent evidence from phylogenetic reconstructions based on genome content show that proteins shared by different groups of organisms are in the overwhelming majority of instances are identifiable as synapomorphies, and as not HGT [15,20,35] (Fig. 2). Indeed synapomorphies link archaea and bacteria (the akaryotes) as sister clades that diverge from a last akaryote common ancestor (LACA) (Fig. 2). In parallel, several eukaryote sister clades diverge independently from a last eukaryote common ancestor (LECA). Here, LACA and LECA diverge independently from a more complex LUCA (Fig. 2). Reconstructions of the proteomes of the three ancestors, LACA, LECA and LUCA confirm the independent divergence of akaryotes and eukaryotes from the common ancestor [15,20]. These data confirm the suggestion that systematic sequence artifacts distort Woese's universal tree [15,20,24,26]. Phylogeny based on genome content of protein domains
Fig. 1. Patterns of SF sharing between and within major organismal groups A. This Venn diagram displays the SF repertoires common to and unique to proteomes of Archaea, Bacteria and Eukaryotes. SFs in ~50 genomes from each group representing the broadest possible taxonomic range of sequenced genomes were compared. B. PCA projection of the covariation of genomic SF content. Each circle represents an individual proteome defined by its genomic SF cohort. The projection shows the distinct SF covariation of Archaea, Bacteria and Eukaryotes. Further it reveals a greater similarity between the SF distributions of Akaryotes (Archaea and Bacteria) and less similarity between those and the SF distributions of Eukaryotes (redrawn from Ref. [20]).
Editorial / Biochimie 119 (2015) 205e208
207
Fig. 2. Global phylogeny of organisms. Genealogical relationships inferred from genomic SF composition shows that Archaea and Bacteria (Akaryotes) are sister groups, which are more closely related to each other than they are to Eukaryotes in terms of recency of common ancestry. Akaryotes descended from LACA. Eukaryotes are more closely related to each other than they are to Akaryotes. Eukaryotes descended from LECA. LACA and LECA are sister groups that descended from LUCA (redrawn from Ref. [20]).
reconstructs a robust simulation of the macroevolution of genomes. Wolynes' review [10] is an important clarification of the relationship between sequences and protein structure, which accounts in turn for the reliability of phylogeny based on protein structures. Thus, “quantitative analysis of co-evolution patterns allows us to infer the statistical characteristics of the folding landscape. These [characteristics] turn out to be consistent with what has been obtained from laboratory physicochemical folding experiments signaling a beautiful confluence of genomics and chemical physics.“ Wolynes' summary statement begins [10] “The selection constraint of having funneled folding landscapes has left its imprint on the sequences of existing protein structural families.” This is the point around which all the papers in this section of the journal revolve. The macroevolutionary processes of selection and counter selection (loss) of unique protein structures is demonstrably what Darwinian molecular evolution of proteomes is about, and at least in part, what speciation is about. The latter reservation is a nod to the fact that a significant fraction of genome sequence does not encode protein domain structures, particularly among eukaryote genomes. . The point here is that selection for domains is acting on the hierarchical level of protein fold and function, not on the level of sequences per se. There may be hundreds of different sequences that encode a given SF with a given function, but a single generalized function can be assigned to each member of that superfamily, even though its chemical activity may vary in recognizable detail. So, massive sequence degeneracy is the name of the game in which genes encode protein structures. The selective conservation of structure is reflected in the contrasting multiplicity of sequences that can encode a superfamily in the same or in different genomes [1,3,4,8e10]. That sequence degeneracy contributes to the mutationally robust character of proteins as well as to their domains as inferred by Fares [13]. Indeed, the measure of “architecture plasticity potential” that is to say the
capacity to form ”distinct domain architectures” can be determined by a new measure that was realized by Linkeviciute et al. [16]. To this new metric may be added a novel approach by Bitard-Feildel et al. who have introduced “hydrophobic cluster analysis” as a means to identify novel or orphan protein domains [12]. Finally, the special features and complexities of viral proteomes are summarized in Abroi 's study of the coevolution of viral and host proteins [11]. To round off the presentations, Gabaldon and Pittis take up the perennial problem of cellular compartment evolution in the conventional context of gene trees [14]. Then Penny and Zhong take up two venerable issues from the gene sequence universe: how does the inferred antiquity of sequences influence their reliability in deep phylogeny, and how are we to view the origins of proteins themselves [17]. Finally, Kurland and Harish offer a critical assessment and comparison of genome content phylogeny contra gene trees [15]. In summary, five principle preconceptions of molecular evolution associated with gene trees are: i. that protein-coding sequences are molecular fossils; ii. that gene trees are equivalent to species trees; iii. that the tree of life is rooted in a very simple akaryote cell implying that akaryotes are primitive and that eukaryotes are advanced; iv. that all or most incongruities between alignmentbased gene trees from the same genomes are due to horizontal gene transfer (HGT), and v. that evolution tends to proceed from the simple to the complex. The factual challenges to these mainstream preconceptions are numerous [15]. In fact, genome trees tell a very different story: i. 3D protein domain structures are the molecular fossils of evolution, while their coding sequences are eternal transients. ii. Species trees are very different from gene trees because genes evolve more or less independently while the organism as a whole (or its genome) represents a higher level of complexity than individual genes. iii. The ToL is rooted in a surprisingly complex universal common ancestor
208
Editorial / Biochimie 119 (2015) 205e208
(UCA), from which akaryotes evolved primarily by gene loss while eukaryotes evolved primarily by duplications. iv. HGT including endosymbiosis is a negligible player in genome evolution from UCA to the present. v. Finally, relatively few protein domains-SFs are novel additions to evolving proteomes. That is to say, molecular evolution is not in general a story about small, simple genomes becoming large, and complex. Life has been more interesting than that. References [1] C. Chothia, One thousand families for the molecular biologist, Nature 357 (1992) 543e544. [2] S.E. Brenner, C. Chothia, T.J.P. Hubbard, Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships, Proc. Natl. Acad. Sci. 95 (1998) 6073e6078. [3] J. Park, K. Karplus, C. Barrett, R. Hughey, D. Haussler, T. Hubbard, C. Chothia, Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods, J. Mol. Biol. 284 (1998) 1201e1210. [4] A.G. Murzin, How far divergent evolution goes in proteins, Curr. Opin. Struct. Biol. 8 (1998) 380e387. [5] K.A. Mackin, R.A. Roy, D.L. Theobald, An empirical test of convergent evolution in rhodopsins, Mol. Biol. Evol. 31 (2014) 85e95. [6] T.L. Blundell, S.P. Wood, Is the evolution of insulin Darwinian or due to selectively neutral mutation? Nature 257 (1975) 197e203. [7] M.A. DePristo, D.M. Weinreich, D.L. Hartl, Missense meanderings in sequence space: a biophysical view of protein evolution, Nat. Rev. Genet. 6 (2005) 678e687. s, Reductive evolution of proteomes [8] M. Wang, C.G. Kurland, G. Caetano-Anolle and protein structures, Proc. Natl. Acad. Sci. U. S. A. 108 (2011) 11954e11958. [9] I. Sillitoe, N. Dawson, J. Thornton, C. Orengo, The history of the CATH structural classification of protein domains, Biochimie 119 (2015) 209e217. [10] P.G. Wolynes, Evolution, energy landscapes and the paradoxes of protein folding, Biochimie 119 (2015) 218e230. [11] A. Abroi, A protein domain-based view of the virosphereehost relationship, Biochimie (2015). [12] T. Bitard-Feildel, M. Heberlein, E. Bornberg-Bauer, I. Callebaut, Detection of orphan domains in Drosophila using “hydrophobic cluster analysis”, Biochimie 119 (2015) 244e253. [13] M.A. Fares, Survival and innovation: the role of mutational robustness in evolution, Biochimie 119 (2015) 254e261. n, A.A. Pittis, Origin and evolution of metabolic sub-cellular [14] T. Gabaldo compartmentalization in eukaryotes, Biochimie 119 (2015) 262e268. [15] C.G. Kurland, A. Harish, The phylogenomics of protein structures: the backstory, Biochimie 119 (2015) 284e302. [16] V. Linkeviciute, O.J.L. Rackham, J. Gough, M.E. Oates, H. Fang, Function-selective domain architecture plasticity potentials in eukaryotic genome evolution, Biochimie 119 (2015) 269e277. [17] D. Penny, B. Zhong, Two fundamental questions about protein evolution, Biochimie 119 (2015) 278e283. [18] E. Zuckerkandl, L. Pauling, Molecules as documents of evolutionary history, J. Theor. Biol. 8 (1965) 357e366. [19] C.R. Woese, O. Kandler, M.L. Wheelis, Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya, Proc. Natl. Acad. Sci. U. S. A. 87 (1990) 4576e4579. [20] A. Harish, A. Tunlid, C.G. Kurland, Rooted phylogeny of the three superkingdoms, Biochimie 95 (2013) 1593e1604.
[21] A.G. Murzin, S.E. Brenner, T. Hubbard, C. Chothia, SCOP: a structural classification of proteins database for the investigation of sequences and structures, J. Mol. Biol. 247 (1995) 536e540. [22] C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells, J.M. Thornton, CATH e a hierarchic classification of protein domain structures, Structure 5 (1997) 1093e1108. [23] D. Penny, L. Collins, Evolutionary genomics leads the way, in: G. Caetanos (Ed.), Evolutionary Genomics and Systems Biology, John Wiley & Anolle Sons, Inc., Hoboken, New Jersey, 2010. [24] H. Philippe, H. Brinkmann, D.V. Lavrov, D.T.J. Littlewood, M. Manuel, € rheide, D. Baurain, Resolving difficult phylogenetic questions: why G. Wo more sequences are not enough, PLoS Biol. 9 (2011). [25] N. Lartillot, H. Brinkmann, H. Philippe, Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model, BMC Evol. Biol. 7 (2007) S4. [26] L. Salichos, A. Rokas, Inferring ancient divergences requires genes with strong phylogenetic signals, Nature 497 (2013) 327e331. [27] J. Gough, K. Karplus, R. Hughey, C. Chothia, Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure, J. Mol. Biol. 313 (2001) 903e919. [28] D.H. Erwin, Macroevolution is more than repeated rounds of microevolution, Evol. Dev. 2 (2000) 78e84. [29] G.G. Simpson, Tempo and Mode in Evolution, Columbia University Press, New York, 1944. [30] W.F. Doolittle, Phylogenetic classification and the universal tree, Science 284 (1999) 2124e2128. [31] S. Nelson-Sathi, F.L. Sousa, M. Roettger, N. Lozada-Chavez, T. Thiergart, A. Janssen, D. Bryant, G. Landan, P. Schonheit, B. Siebers, J.O. McInerney, W.F. Martin, Origins of major archaeal clades correspond to gene acquisitions from bacteria, Nature 517 (2015) 77e80. [32] W.F. Doolittle, The nature of the universal ancestor and the evolution of the proteome, Curr. Opin. Struct. Biol. 10 (2000) 355e358. [33] W. Hennig, Phylogenetic Systematics, University of Illinois Press, 1966. [34] D.L. Theobald, A formal test of the theory of universal common ancestry, Nature 465 (2010) 219e222. [35] J. Gough, Convergent evolution of domain architectures (is rare), Bioinformatics 21 (2005) 1464e1471.
Charles G. Kurland* Microbial Ecology, Department of Biology, Lund University, Ecology €lvegatan 37), SE-223 62 Lund, Sweden Building (So Ajith Harish** Structure and Molecular Biology, Department of Cell and Molecular Biology, Biomedical Center, Uppsala University, 751 24 Uppsala, Sweden *
Corresponding author.
**
Corresponding author. E-mail address:
[email protected] (C.G. Kurland). E-mail address:
[email protected] (A. Harish). Available online 4 November 2015