FEMS Microbiology Letters 201 (2001) 187^191
www.fems-microbiology.org
On surrogate methods for detecting lateral gene transfer Mark A. Ragan * Institute for Molecular Bioscience, The University of Queensland, Brisbane, Qld 4072, Australia Received 17 May 2001; accepted 29 May 2001 First published online 25 June 2001
Abstract Surrogate methods for detecting lateral gene transfer are those that do not require inference of phylogenetic trees. Herein I apply four such methods to identify open reading frames (ORFs) in the genome of Escherichia coli K12 that may have arisen by lateral gene transfer. Only two of these methods detect the same ORFs more frequently than expected by chance, whereas several intersections contain many fewer ORFs than expected. Each of the four methods detects a different non-random set of ORFs. The methods may detect lateral ORFs of different relative ages; testing this hypothesis will require rigorous inference of trees. ß 2001 Federation of European Microbiological Societies. Published by Elsevier Science B.V. All rights reserved. Keywords : Microbial evolution ; Lateral gene transfer; Horizontal gene transfer; Nucleotide composition; Markov model; GC content
1. Introduction Darwin's explanation of organismal diversity ^ as the consequence of genealogical descent, with genetic modi¢cation, as on a bifurcating tree ^ stands as a landmark of biology, indeed of modern scienti¢c enquiry. This paradigm uni¢es diverse observations, and connects them to underlying genetic processes now understood at the molecular level. However, Darwin's paradigm has been, and continues to be, violated in certain circumstances. Plastids and mitochondria originated not by treelike `vertical' descent within the eukaryotic lineage, but rather by lateral transfer of genetic information from bacteria; and subsequent plastid diversi¢cation has involved further lateral events [1]. Plasmids spread resistance to antibiotics laterally among bacterial populations [2]. Organelles and plasmids might be considered special cases, of limited extent in time and space ^ complications, not threats, to Darwin's paradigm. The same cannot be said of new analyses, arising in part from the wealth of newly available prokaryotic genome sequences, that appear to reveal rampant lateral gene transfer (LGT ; also known as horizontal gene transfer) among microbes [3^ 13]. The most convincing among these are based on phylogenetic inference. Well-supported topological disagree-
* Tel. : +61 (7) 3365-1160; Fax: +61 (7) 3365-4388; E-mail :
[email protected]
ment between the tree inferred for one gene family and that inferred for another can often be parsimoniously explained only by invoking LGT [14^17]. However, it is di¤cult to extend these studies to all known genes and genomes. Many gene families are intrinsically of restricted phyletic distribution; and some genes accept changes so rapidly that orthologs cannot con¢dently be identi¢ed or aligned, unavoidably yielding sparse trees with weakly supported topological features. Large data sets, on the other hand, pose computational challenges both for inferring trees and for assessing con¢dence intervals of subtrees. Unknown support for subtrees of unknown optimality is not a promising foundation from which to assess the prevalence of LGT. Thus there has been considerable interest in developing methods by which laterally transferred genes can be identi¢ed without the need to infer gene or protein trees. At least seven such surrogate methods have been put forward [3^13], some applicable only to regions (e.g. 50-kb windows) of genomic sequence, others to individual open reading frames (ORFs). Each genomic region or ORF is assessed to determine how typical it is of other regions or ORFs in the genome under investigation. Depending on how the assessment criteria are structured, atypicality may indicate that the genomic region or ORF has an origin di¡erent from that of the rest of the genome, i.e. has arisen by LGT. Here I focus on surrogate methods that can be applied to individual ORFs. Lawrence and Ochman [3^5] identi¢ed ORFs in the
0378-1097 / 01 / $20.00 ß 2001 Federation of European Microbiological Societies. Published by Elsevier Science B.V. All rights reserved. PII: S 0 3 7 8 - 1 0 9 7 ( 0 1 ) 0 0 2 6 2 - 2
FEMSLE 10016 16-7-01
188
M.A. Ragan / FEMS Microbiology Letters 201 (2001) 187^191
genome of Escherichia coli K12 that have nucleotide contents [3] or codon usage patterns [4] atypical of other E. coli ORFs. They demonstrated that nucleotide composition at each codon position is, at equilibrium, linearly related to nucleotide composition of the entire genome, and described a process of `amelioration' [3] during which introgressed ORFs become progressively more like those of the host. They identi¢ed a set of E. coli ORFs, some 17.5% of the total, that by these criteria are candidates for having arisen by LGT. This approach was recently criticized for high rates of both false positives and false negatives [18]. Coding regions di¡er from non-coding regions in statistically de¢nable ways [19^21] that can be formally expressed as Markov models. These models can be used to identify genes in microbial [6,22,23] and other genomes. The models are parameterized on training sets from speci¢c genomes, and thus tend to be genome-speci¢c. Nonetheless, even when optimized in this way and run to convergence, they usually fail to detect all genes known to be present in that genome. Hayes and Borodovsky [6] showed that a model trained on ORFs identi¢ed (by various criteria) as atypical for a given genome could e¤ciently recognize many ORFs not found by a model trained on `typical' ORFs. In the case of E. coli, these additional ORFs show a functional pro¢le distinct, in part, from that of typical ORFs. These authors proposed that many ORFs detected by the atypical model had been laterally transferred into the E. coli genome [6]. A third and conceptually di¡erent approach has been introduced by Clarke et al. (submitted). These authors sorted GenBank by species to create a target database for BLASTP analysis, and identi¢ed ORFs that show patterns of BLAST matches signi¢cantly di¡erent from the median pattern shown by ORFs in that same genome. These ORFs were termed `phylogenetically discordant' because if a distance matrix were constituted from the numerical values of the BLASTP analysis and a tree generated, its topology would presumably be anomalous for ORFs in that genome. Indeed, removal of discordant sequences strengthened bootstrap support for key topological features in whole-genome distance trees calculated from mean pairwise BLASTP expectation scores. LGT is the most obvious means by which an ORF could have a genealogical history di¡erent from that of its host genome. Finally, LGT may generate unusual patterns of gene distribution among organisms. Catastrophic gene loss or lineage-speci¢c rates of sequence change can also generate unusual distributional patterns, especially if they occur in multiple lineages (i.e. on multiple occasions). Occasionally, independent evidence (e.g. from gene co-localization, or remnants of transposons) may implicate LGT as the cause of a particular distribution [12]; but in the absence of such evidence, how are patterns caused by LGT to be distinguished from those explicable within the framework of vertical transmission ? Ragan and Charlebois (submitted)
sorted GenBank into hierarchical taxa, and implemented a dual-threshold criterion to distinguish LGT from nonLGT patterns. A BLASTP match with expectation value better than a rigorous inclusion threshold was taken to indicate the presence of a putative homolog in the target taxon, while lack of a BLASTP match with expectation value better than a more permissive exclusion threshold implies that homologs are absent. As the inclusion threshold is made more stringent and the exclusion threshold less stringent, lineage-speci¢c rate e¡ects are progressively ¢ltered out, with only the clearest patterns remaining. Furthermore, the more sparse the phyletic distribution, the less parsimoniously can the pattern be attributed to multiple catastrophic losses. By this approach, these authors identi¢ed, for each of 23 bacterial genomes, a set of ORFs with homologs in only one non-self bacterial phylum. Most of these homologs occur among Firmicutes and Proteobacteria, which in a resolved bifurcating tree cannot be sister lineages to all 23 of the bacteria in question. Thus overall, these sets must be enriched in laterally transferred ORFs. Herein I examine whether these four surrogate methods detect the same ORFs in the genome of E. coli K12 [24]. Any ORF with an atypical base composition, detected by an atypical Markov model, presenting an atypical pattern of BLAST matches, and with its only sure homolog in a non-adjacent lineage, would bear the weight of circumstantial evidence against sharing a common phylogenetic history with its more typical counterparts in the same genome. Alternatively, should these approaches fail to identify a common set of ORFs, we might call into question their validity or generality as surrogate methods for detecting LGT. 2. Materials and methods Datasets were generated by analysis of the genome sequence of E. coli K12 MG1655 (GenBank accession U00096). An updated list of ORFs having anomalous base compositions [3] was generously provided by Je¡rey Lawrence. An updated list of ORFs found by an atypical Markov model [6] was kindly supplied by Mark Borodovsky. The list of phylogenetically discordant ORFs combines results from six separate PDS analyses, i.e. two BLASTP expectation thresholds (e910310 and e91035 ) and three levels of GenBank target sets (delineation at NCBI level 0, all sequences; level 1, Bacteria; and level 2, Proteobacteria), and was graciously furnished by Robert Charlebois. ORFs with anomalous phyletic distributions are here de¢ned as those ¢nding a BLASTP match in exactly one non-self bacterial phylum (second-level NCBI category) at inclusion thresholds e910310 or e910320 , and exclusion threshold ev1035 . The four ORF lists were combined in a single spreadsheet and indexed by NCBI gi number ; where necessary, ORFs were
FEMSLE 10016 16-7-01
M.A. Ragan / FEMS Microbiology Letters 201 (2001) 187^191
re-indexed to gi number by BLAST comparison against GenBank U00096. ORF spacings were based on the order in U00096, ignoring tRNA and other non-protein-coding genes. Intersection sizes and ORF spacings expected under a stochastic model (see below) were determined by Monte Carlo simulations (n = 10 000 replicates) assuming the location of each ORF to be independent of every other ORF. 3. Results Numbers of ORFs in the genome of E. coli K12 MG1655 identi¢ed by the four surrogate methods, individually and in combination, are shown in Table 1. It is not surprising that di¡erent numbers of ORFs are identi¢ed, as each method utilizes (explicitly or implicitly) one or more thresholds which were not here selected in any standardized or coordinated way (indeed, no conceptual framework exists for doing so). The extent to which the four methods identify (insofar as possible) the same atypical ORFs is the more relevant and interesting question. The null model posits that each method and ORF is independent. Consider genome G containing NGEN number of ORFs. If by application of method A to genome G we identify NA atypical ORFs, and by method B identify NB atypical ORFs, then we expect NAB = (NA / NGEN )U(NB /NGEN )UNGEN in the intersection (identi¢ed by both methods), and NAb = (NA /NGEN )U[13(NB / NGEN )]UNGEN to be found by A but not by B. This model
189
is easily extended to all intersections among multiple independent methods. If we observe signi¢cantly more than the expected number of ORFs in an intersection, those methods preferentially target the same ORFs. Table 1 shows observed and expected numbers of E. coli ORFs in all intersections among these four surrogate methods. Because little is known of the statistical distribution of ORF subclasses, con¢dence was assessed relative to full and 95% ranges of counts found in 10 000 simulations. Several results stand out as very di¡erent from expectations under the null model. In particular, more than twice as many ORFs are compositionally atypical and identi¢ed by an atypical Markov model; some of these are also phylogenetically discordant, or have an atypical distributional pro¢le. Hayes and Borodovsky [6] observed that lengthy ORFs found by their atypical model tend to have atypical oligonucleotide compositions. Except where both base composition and Markov model methods are involved, however, an E. coli ORF judged atypical by any one of these methods is less likely to be found atypical by another method than expected by chance. For example, ORFs that are compositionally atypical and phylogenetically discordant are 69% fewer than expected. ORFs that have both atypical distributional pro¢les and atypical base compositions, or atypical pro¢les and are phylogenetically discordant, are 40% fewer than expected (Table 1). Two lines of evidence con¢rm that the four surrogate methods in fact detect di¡erent subsets of the E. coli genome. First, the four atypical sets exhibit very di¡erent ORF-to-ORF spacings (Table 2), with 70% of composi-
Table 1 Numbers of E. coli ORFs identi¢ed by four surrogate methods, individually and in combinations Criterion
# ORFs exp
Sim range
Sim 95%
# ORFs obs
All GC All MM All PD All DP GC only MM only PD only DP only GCEMM GCEPD GCEDP MMEPD MMEDP PDEDP GCEMMEPD GCEMMEDP GCEPDEDP MMEPDEDP GCEMMEPDEDP GCDMMDPDDDP
^ ^ ^ ^ 535.1 449.6 270.4 187.3 95.5 57.4 39.8 48.3 33.4 20.1 10.2 7.1 4.3 3.6 0.8 1763
^ ^ ^ ^ 492^575 407^490 230^304 157^216 67^129 33^85 20^60 25^73 15^54 7^37 1^23 0^20 0^13 0^14 0^5 1716^1823
^ ^ ^ ^ 514^559 430^472 253^289 171^203 79^112 45^71 29^52 36^61 23^44 12^29 5^17 3^13 1^9 1^8 0^3 1740^1794
752 650 416 297 462** 332** 322** 200 201** 18** 24* 39 32 12 20* 24** 3 2 0 1691**
% +/3 ^ ^ ^ ^ 314 326 +19 +7 +110 369 340 319 34 340 +96 +238 34.1
Expectations are calculated as described in the text. Simulations range is total range over 10 000 simulations; simulations 95% interval was calculated by removing 2.5% of simulations from each end of the distribution. Asterisks mark observed values outside the full range (**) or 95% interval (*) of simulations. GC, base composition method [3]; MM, Markov model method [6]; PD, phylogenetic discordance method (Clarke et al., unpublished) ; DP, distributional pro¢le method (Ragan and Charlebois, submitted). Boolean symbols: E, intersection of the data sets; D, union of the data sets.
FEMSLE 10016 16-7-01
190
M.A. Ragan / FEMS Microbiology Letters 201 (2001) 187^191
Table 2 Spatial distribution of ORFs identi¢ed by each of the four surrogate methods Method
GC GC GC GC
observed sim mean sim range sim 95%
MM MM MM MM
observed sim mean sim range sim 95%
ORF-to-ORF spacing, in numbers of ORFs 1
2
3^5
6^10
11^20
21^30
31^50
51+
69.9** 17.5 12.8^21.9 15.0^20.1
1.9** 14.5 15.0^20.1 12.1^16.9
6.8** 11.9 7.6^16.8 9.7^14.2
6.0** 24.6 18.8^30.1 21.5^27.7
7.9** 19.5 14.6^24.6 16.9^22.3
3.1** 10.3 6.9^13.8 8.6^12.0
3.5** 1.5 0.3^3.1 0.8^2.3
1.5** 0.2 0.0^0.9 0.0^0.7
28.3** 15.1 10.2^20.3 10.8^17.7
13.7 12.8 8.6^18.0 10.5^15.2
21.5** 10.9 6.6^15.4 8.6^13.2
16.2** 23.8 18.2^30.0 20.6^27.1
14.5** 20.9 15.4^25.5 18.0^24.0
3.2** 13.2 8.9^17.7 11.1^15.4
2.5 2.6 0.6^5.2 1.5^3.7
0.2 0.6 0.0^1.8 0.2^1.2
PD PD PD PD
observed sim mean sim range sim 95%
11.3 9.7 4.6^15.1 7.0^12.5
10.1 8.7 4.3^13.9 6.3^11.5
22.1** 7.9 3.6^13.2 5.5^10.3
19.5 19.4 12.7^27.4 15.9^23.1
23.3 21.7 15.1^28.1 17.8^25.7
9.4** 20.9 13.9^28.4 17.1^24.8
3.4* 7.5 2.9^12.5 5.3^9.9
1.0* 3.7 0.7^7.2 2.2^5.3
DP DP DP DP
observed sim mean sim range sim 95%
12.8** 6.9 2.7^12.5 4.4^9.8
9.4* 6.4 1.7^12.5 3.7^9.1
15.5** 6.0 1.7^11.4 3.4^8.8
14.1 15.6 8.8^23.6 11.8^19.5
24.6* 19.6 11.1^29.0 15.2^23.9
12.5** 23.3 14.8^33.0 18.5^28.3
7.4* 11.4 5.4^18.9 8.1^15.2
3.4** 8.3 3.7^13.8 5.7^10.8
The observed value of the distance between the ORFs, mean and full range of simulations, and interval containing 95% of simulated values, expressed as percent of ORFs in each category. Asterisks mark observed values outside the full range (**) or 95% interval (*) of simulations. Rows may not add to 100% due to rounding. Abbreviations for the methods are as in Table 1.
tionally atypical ORFs, but only 11% of phylogenetically discordant ORFs and 13% of those with unexpected distributions, adjacent to another such ORF. Conversely, 23% of the latter ORFs, but only 6% of those found by an atypical Markov model and 8% of those atypical by base composition, are located 21 ORFs or more from the next such ORF downstream. Second, the base composition method identi¢es 47 insertion sequences and transposases, whereas an atypical Markov model ¢nds 17, the anomalous distribution method three, and phylogenetic discordance only one (results not shown). 4. Discussion These four surrogate methods fail almost completely to identify a common set of E. coli ORFs. The pair of methods overlapping most in their predictions ^ the base composition approach of Lawrence and Ochman [3] and the atypical Markov model of Hayes and Borodovsky [6] ^ identify the same ORFs only about twice as frequently as expected by chance. Several pairs of methods ¢nd common ORFs much less frequently than by chance. Indeed, surprisingly few intersections fall within even the full range of simulated values. The four predicted sets of ORFs are spaced di¡erently across the E. coli genome, and include very di¡erent numbers of insertion sequences and transposons. Thus each surrogate method does in fact ¢nd a non-random set of ORFs, whatever its nature. If these are laterally transferred ORFs, there must be several distinct subsets.
The observed frequencies at which each method detects insertion sequences and transposases suggests an explanation. Base composition di¡erences are in many cases `ameliorated' over some tens to a few hundred million years [3,5]. Compositional di¡erence may thus preferentially detect recent lateral transfers, including transposable elements. The atypical Markov models detect some of these, plus ORFs whose base compositions (but not oligonucleotide frequencies or codon usage) have equilibrated with their new genomic background. The other two methods explicitly focus on cross-phylum and cross-domain patterns, hence might be expected to detect more ancient events. This study demonstrates the need for a systematic, comprehensive approach to the study of LGT based on ¢rst principles, i.e. rigorous inference and statistically based comparison of molecular phylogenetic trees. As more genomic sequences appear, a tree-based approach will become both more challenging and more rewarding. Perhaps only by such an approach will we ultimately learn what these surrogate methods are actually detecting. Acknowledgements I thank Robert Charlebois, Je¡ Lawrence and Mark Borodovsky for access to data; Robert Charlebois and W. Ford Doolittle for discussions; Ian Bailey-Mortimer and Emily McGhie for expert programming; and the Canadian Institute for Advanced Research for fellowship support.
FEMSLE 10016 16-7-01
M.A. Ragan / FEMS Microbiology Letters 201 (2001) 187^191
References [1] Gray, M.W. (1998) Evolution of organellar genomes. Curr. Opin. Genet. Dev. 9, 678^687. [2] Spratt, B.G. and Maiden, M.C. (1999) Bacterial population genetics, evolution and epidemiology. Phil. Trans. R. Soc. Lond. Biol. 354, 701^710. [3] Lawrence, J.G. and Ochman, H. (1997) Amelioration of bacterial genomes: rates of change and exchange. J. Mol. Evol. 44, 383^397. [4] Lawrence, J.G. and Ochman, H. (1998) Molecular archaeology of the Escherichia coli genome. Proc. Natl. Acad. Sci. USA 95, 9413^9417. [5] Ochman, H. and Lawrence, J.G. (1996) Phylogenetics and the amelioration of bacterial genomes. In: Escherichia coli and Salmonella. Cellular and Molecular Biology, 2nd edn., Vol. 2 (Neidhardt, F.C., Curtis, R., III, Ingraham, J.L., Lin, E.C.C., Low, K.B., Magasanik, B., Rezniko¡, W.S., Riley, M., Schaechter, M. and Umbarger, H.E., Eds.), pp. 2627^2637. American Society for Microbiology, Washington, DC. [6] Hayes, W.S. and Borodovsky, M. (1998) How to interpret an anonymous bacterial genome: machine learning approach to gene identi¢cation. Genome Res. 8, 1154^1171. [7] Koonin, E.V., Mushegian, A.R., Galperin, M.Y. and Walker, D.R. (1997) Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea. Mol. Microbiol. 25, 619^637. [8] Koonin, E.V. and Galperin, M.Y. (1997) Prokaryotic genomes: the emerging paradigm of genome-based microbiology. Curr. Opin. Genet. Dev. 7, 757^763. [9] Karlin, S., Mra¨zek, J. and Campbell, A.M. (1998) Codon usages in di¡erent gene classes of the Escherichia coli genome. Mol. Microbiol. 29, 1341^1355. [10] Karlin, S., Brocchieri, L., Mra¨zek, J., Campbell, A.M. and Spormann, A.M. (1999) A chimeric prokaryotic ancestry of mitochondria and primitive eukaryotes. Proc. Natl. Acad. Sci. USA 96, 9190^ 9195. [11] Makarova, K.S., Aravind, L., Galperin, M.Y., Grishin, N.V., Tatusov, R.L., Wolf, Y.I. and Koonin, E.V. (1999) Comparative genomics of the Archaea (Euryarchaeota): evolution of conserved protein families, the stable core, and the variable shell. Genome Res. 9, 608^628. [12] Nelson, K.E., Clayton, R.A., Gill, S.R., Gwinn, M.L., Dodson, R.J., Haft, D.H., Hickey, E.K., Peterson, J.D., Nelson, W.C., Ketchum, K.A., McDonald, L., Utterback, T.R., Malek, J.A., Linher, K.D.,
[13]
[14]
[15] [16]
[17]
[18]
[19] [20]
[21]
[22] [23]
[24]
191
Garrett, M.M., Stewart, A.M., Cotton, M.D., Pratt, M.S., Phillips, C.A., Richardson, D., Heidelberg, J., Sutton, G.G., Fleischmann, R.D., Eisen, J.A., White, O., Salzberg, S.L., Smith, H.O., Venter, J.C. and Fraser, C.M. (1999) Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima. Nature 399, 323^329. Worning, P., Jensen, L.J., Nelson, K.E., Brunak, S. and Ussery, D.W. (2000) Structural analysis of DNA sequence: evidence for lateral gene transfer in Thermotoga maritima. Nucleic Acids Res. 28, 706^709. Smith, M.W., Feng, D.-F. and Doolittle, R.F. (1992) Evolution by acquisition: the case for horizontal gene transfers. Trends Biochem. Sci. 17, 489^493. Brown, J.R. and Doolittle, W.F. (1997) Archaea and the prokaryoteto-eukaryote transition. Microbiol. Mol. Biol. Rev. 61, 456^502. Jain, R., Rivera, M.C. and Lake, J.A. (1998) Horizontal gene transfer among genomes: the complexity hypothesis. Proc. Natl. Acad. Sci. USA 96, 3801^3806. NesbÖ, C.L., Boucher, Y. and Doolittle, W.F. (2001) Comparative genomics of four archaea: is there a core of euryarchaeal non-transferable proteins ? J. Mol. Evol., in press. Koski, L.B., Morton, R.A. and Golding, G.B. (2001) Codon bias and base composition are poor indicators of horizontally transferred genes. Mol. Biol. Evol. 18, 404^412. Fickett, J. (1982) Recognition of protein-coding regions in DNA sequences. Nucleic Acids Res. 10, 5303^5318. Gribskov, M., Devereux, J. and Burgess, R.R. (1984) The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression. Nucleic Acids Res. 12, 539^549. Staden, R. (1984) Measurements of the e¡ect that coding for a protein has on DNA sequence and their use for ¢nding genes. Nucleic Acids Res. 12, 551^567. Lukashin, A.V. and Borodovsky, M. (1998) GeneMark.hmm : new solutions for gene ¢nding. Nucleic Acids Res. 26, 1107^1115. Salzberg, S.L., Delcher, A.L., Kasif, S. and White, O. (1998) Microbial gene identi¢cation using interpolated Markov models. Nucleic Acids Res. 26, 544^548. Blattner, F.R., Plunkett III, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode, C.K., Mayhew, G.F., Gregor, J., Davis, N.W., Kirkpatrick, H.A., Goeden, M.A., Rose, D.J., Mau, B. and Shao, Y. (1997) The complete genome sequence of Escherichia coli K-12. Science 277, 1453^1474.
FEMSLE 10016 16-7-01