Molecular Phylogenetics and Evolution 48 (2008) 313–325
Contents lists available at ScienceDirect
Molecular Phylogenetics and Evolution journal homepage: www.elsevier.com/locate/ympev
Reticulate or tree-like chloroplast DNA evolution in Sileneae (Caryophyllaceae)? Per Erixon *, Bengt Oxelman Department of Systematic Botany Evolutionary Biology Centre, Uppsala University, Norbyv. 18D, SE-75236 Uppsala, Sweden Department of Plant and Environmental Sciences, Göteborg University, Box 461, SE-40530 Göteborg, Sweden
a r t i c l e
i n f o
Article history: Received 7 December 2007 Revised 4 April 2008 Accepted 7 April 2008 Available online 14 April 2008 Keywords: Sileneae cpDNA Chloroplast recombination Incongruence Indel characters
a b s t r a c t Despite sampling of up to 25 kb of chloroplast DNA sequence from 24 species in Sileneae a number of nodes in the phylogeny remain poorly supported and it is not expected that additional sequence sampling will converge to a reliable phylogenetic hypothesis in these parts of the tree. The main reason for this is probably a combination of rapid radiation and substitution rate heterogeneity. Poor resolution among closely related species are often explained by low levels of variation in chloroplast data, but the problem with our data appear to be high levels of homoplasy. Tree-like cpDNA evolution cannot be rejected, but apparent incongruent patterns between different regions are evaluated with the possibility of ancient interspecific chloroplast recombination as explanatory model. However, several major phylogenetic relationships, previously not recognized, are confidently resolved, e.g. the grouping of the two SW Anatolian taxa S. cryptoneura and S. sordida strongly disagrees with previous studies on nuclear DNA sequence data, and indicate a possible case of homoploid hybrid origin. The closely related S. atocioides and S. aegyptiaca form a sister group to Lychnis and the rest of Silene, thus suggesting that Silene may be paraphyletic, despite recent revisions based on molecular data. Ó 2008 Elsevier Inc. All rights reserved.
1. Introduction The concept of the chloroplast genome as a single evolving unit is fundamental to plant systematics. This assumption rests on the fact that there are very few documented cases of recombination among chloroplast genomes of flowering plants (Medgyesy et al., 1985; Houliston and Olson, 2006), no published evidence for lateral transfer (LGT) of chloroplast genes in land plants (see Archibald et al., 2003 for an example in chlorarachniophytes, and Rice and Palmer, 2006 for a cryptophyte/haptophyte example), and that chloroplast DNA (cpDNA) is predominantly uniparentally inherited (Wolfe and Randle, 2004). Therefore, DNA sequences from different parts of the chloroplast genome are assumed to have evolved on a single tree topology, even if the species containing those sequences have a history of hybridization. Sequence data from the nuclear genome are very different in this respect, because hybridization can cause nuclear genomes to merge and potentially recombine. Early studies on plant phylogeny utilizing chloroplast DNA sequences were typically based on a single gene (e.g. rbcL, Chase et al., 1993). Biotechnological development has facilitated inclusion of more genes to increase resolution and support. For example, the early evolutionary diversification of flowering plants has been addressed by concatenation of a large number of chloroplast * Corresponding author. Fax: +46 18 4716457. E-mail address:
[email protected] (P. Erixon). 1055-7903/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.ympev.2008.04.015
genes. Graham and Olmstead (2000) based their phylogenetic study on basal angiosperm relationships on 17 genes yielding 13,806 DNA characters and Goremykin et al. (2003) made a study on less taxa from the same group on sequences from 13 completely sequenced chloroplast genomes, based on 30,017 DNA characters from protein-coding genes. However, if taxon sampling is sparse, and the branches long, phylogenetic methods can be inconsistent (Felsenstein, 1978), leading to erroneous conclusions (Soltis et al., 2004; Qiu et al., 2005; Rydin and Källersjö, 2002; Stefanovic et al., 2004). Closely related taxa should be less problematic, because the branches of the phylogenetic tree are shorter and therefore less sensitive to inconsistencies of the phylogenetic model (Felsenstein, 1978; Philippe and Laurent, 1998; Bergsten, 2005). Thus, it might be expected that inferring phylogenies of closely related taxa should be easy, given that enough information is at hand. Ideally we would like to have branches with so few substitutions per site that the probability of multiple substitutions is close to zero, and at the same time sequences that are long enough to contain information for every branch. In chloroplast phylogenetic studies, protein-coding genes have often been dismissed for low-level relationships, presumably because of conservative negative selection that is expected to be associated with these. Instead, systematists have often used non-coding regions such as intergenic spacers and introns for shallow phylogenetic problems (e.g. Shaw et al., 2005, 2007).
314
P. Erixon, B. Oxelman / Molecular Phylogenetics and Evolution 48 (2008) 313–325
With decreased sequencing costs and increased amounts of available cpDNA sequence data, lack of information because of low sequence variability becomes less of a problem. Low-level systematics and relatively shallow phylogenetic problems could potentially benefit from the simplicity of cpDNA data, because biparentally inherited sequences and recombination are more problematic for closely related taxa (Álvarez and Wendel, 2003; Poke et al., 2006). Analyses based on cpDNA sequences do indeed often exhibit little homoplasy (e.g. Archibald et al., 2005; Popp and Oxelman 2001, 2004; Popp et al., 2005; Frajman and Oxelman, 2007) relative to nrDNA. Because of the above-mentioned properties of the chloroplast genome, different DNA sequence regions are often concatenated without evaluation of possible incongruence (e.g. Bremer et al., 2002; Nishiyama et al., 2004). If incongruences among different regions from the chloroplast are found, they will be interpreted as artifacts, rather than the result of conflicting evolutionary histories (e.g. Graham and Olmstead, 2000). Hard incongruences (mutually well supported conflicting phylogenies) between chloroplast regions from closely related taxa are not known to us. Given the assumption that the chloroplast genome has evolved according to a single tree topology, poorly resolved phylogenetic trees should be the result of either poor fit of the model used for phylogenetic inference, or lack of information due to limited number of characters. If, on the other hand, the tree model is invalid, for example due to recombination, this could result in poor phylogenetic resolution. The tribe Sileneae (Caryophyllaceae), a phylogenetically relatively shallow group in the angiosperm tree, includes several taxa that have properties of interest to, e.g. breeding system (e.g. Desfeux et al., 1996; Andersson-Ceplitis, 2002; McCauley et al., 2005), evolution of sex chromosomes (e.g. Vyskot and Hobza, 2004; Filatov, 2005), pollination (e.g. Kephart et al., 2006), host–parasite interactions (e.g. Hood and Antonovics, 2000), aberrant evolution of mitochondrial genes (e.g. Städler and Delph, 2002; Mower et al., 2007; Sloan et al., 2008), and strong positive selection in a chloroplast gene (Erixon and Oxelman, 2008). The group has been studied phylogenetically based on sequences from the nuclear ribosomal internal transcribed spacers (ITS) region (Oxelman and Lidén, 1995; Desfeux et al., 1996; Oxelman et al., 2001; Burleigh and Holtsford, 2003; Eggens et al., 2007; Frajman and Oxelman, 2007) and parts of four nuclear low-copy number RNA polymerase genes (Popp and Oxelman, 2004, 2007; Popp et al., 2005; Eggens et al., 2007). Hitherto, the only chloroplast regions used for inferring relationships in Sileneae have been the rps16 intron (Oxelman et al., 1997; Popp and Oxelman, 2004), the psbE-petL spacer (Popp et al., 2005) and the trnL/F intron/ spacer region (Burleigh and Holtsford, 2003). The data from the previous chloroplast DNA studies are not possible to combine because of different taxon sampling, but Popp and Oxelman (2004) presented a combined analysis of data from the rps16 intron, ITS and RNA polymerase genes, resulting in a highly resolved and supported tree. However, a few nodes were not well supported, and some recent data suggest strong disagreement between nuclear and plastid regions for particular groups (e.g. Frajman and Oxelman, 2007), and also between different nuclear regions (Frajman et al., unpublished) for some taxa. The sampling of Popp and Oxelman (2004) included only 29 of the c. 700 species in the tribe, and it is possible that there actually are more hard incongruences among the individual genes, that are undetected because of lack of information (i.e. the null hypothesis of a single tree topology could not be rejected). Many groups in Sileneae have a complex evolutionary history with several cases of allopolyploidization and hybridization (Oxelman, 1996; Popp and Oxelman, 2001; Popp et al., 2005; Frajman and Oxelman, 2007).
In light of the more prevalent degree of recombination in the nuclear genome, we anticipate that a ‘‘backbone” of well-supported chloroplast DNA phylogenetic relationships could potentially provide a useful framework onto which nuclear phylogenies could be anchored. We use roughly 25 kb of DNA sequence from the large single copy region (LSC) in the chloroplast of 24 species in Sileneae. As previous cpDNA phylogenies based on short regions (e.g. rps16, Oxelman et al., 1997) indicate that the lack of resolution in some parts of the tree is due to lack of information rather than homoplasy, we wanted to collect large amounts of data for relatively few taxa to test this. Our approach is to sequence long contiguous regions, including both coding and non-coding regions, rather than to try to find many ‘‘super-informative” short regions. Shaw et al. (2005) evaluated 21 non-coding cpDNA regions and concluded that regions differ in variability. They did not find any extremely variable regions and their results also showed large differences between different taxonomic groups. In this paper we will (1) explore if the major phylogenetic relationships in Sileneae become well resolved in a tree-like fashion by 25 kb of cpDNA sequence, (2) investigate the effect of incrementally adding data, and (3) explore potential causes of weak tree resolution. 2. Materials and methods 2.1. Plant material An attempt was made to represent as much of the phylogenetic range of variation as possible within the tribe Sileneae. All genera (as recognized by Oxelman et al., 2001) in Sileneae (Agrostemma L., Atocion Adans., Eudianthe (Rchb.) Rchb., Heliosperma (=Ixoca) (Rchb.) Rchb., Lychnis L., Petrocoptis A. Br. ex Endl., Silene L., and Viscaria Bernh.) including several representatives of the two approximately equally sized subgenera in Silene (Behen (Moench) Bunge and Silene). The three species in Silene subgenus Silene was chosen to represent the three major clades in this group based on previous studies (Oxelman and Lidén, 1995; Oxelman et al., 1997; Eggens et al., 2007). The resolution of subgenus Behen on the basis of previous molecular studies is poor (Oxelman et al., 1997; Popp and Oxelman, 2004). Consequently, we sampled more taxa from this group (eight taxa), mainly based on morphological variation and previous taxonomic work (Chowdhuri, 1957). Silene sordida and S. cryptoneura were included because they do not fall in either of the two subgenera and there are incongruent patterns between nuclear ribosomal ITS data (Oxelman and Lidén, 1995) and chloroplast data (Oxelman et al., 1997). Silene aegyptiaca and S. atocioides are morphologically very similar, but preliminary unpublished results showed surprisingly high sequence divergence and the two taxa did not group with S. cryptoneura, which could have been expected from morphological and phytogeographical patterns, as well as previous classifications (Coode and Cullen, 1967; Chowdhuri, 1957). Plant materials used in this study are presented with voucher data and GenBank/EMBL accession numbers in Table 1. 2.2. PCR and sequencing The choice of the three largest sequence regions used in this study was guided by the identification of large contiguous regions with high proportions of non-coding sequences in the spinach genome (GenBank Accession No. NC_002202). The largest region (>18 kb; size measures in this section refer to the spinach genome) between rbcL and petB (Fig. 1, region 4) was initially divided into four subregions (4.3–4.9 kb), by constructing four primer pairs (rbcL-F2/cemA-R, cemA-F/petG-R, petL-F2/clpP-R, clpP-F/
Table 1 Voucher list and GenBank accession numbers Voucherab
Taxon
Agrostemma githago L.
Lychnis chalcedonica L. Lychnis flos-cuculi L. Lychnis flos-jovis Desr. Petrocoptis pyrenaica (Bergeret) A. Br. Silene aegyptiaca (L.) L.f. Silene atocioides Boiss. Silene conica L. Silene cryptoneura Stapf Silene fruticosa L. Silene integripetala Bory and Chaub. Silene latifolia Poir. Silene Silene Silene Silene
littorea Brot. pseudoatocion Desf. samia Melzh. and D. Christodoulakis schafta S. G. Gmel. ex Hohen.
Silene sordida Hub.-Mor. and Reese Silene sorensenis (B. Boivin) Bocquet Silene umniflora Roth Silene zawadskii Hort. ex Fenzl Viscaria vulgaris Bernh.
1
2bc
Z831542
EU2216391
Z83160 Z83155 EF1181162
EU221636 EU221636 EU2216341
Z83160
1
EU308502
3
2
4
1
EU308519
2
EU3085032
EU314632 EU308520
4a
4b
4c
4d
EU3146671
EU3146471
EU3146431
EU3146571
EU314663 EU314664
EU314651 EU314648
EU314642 XX
EU314658 XX
EU314661 EU314662 EU314665
EU314649 EU314650 EU314653
EU314641 EU314640 EU314639
EU308523 EU308524 EU314660
EU314666
EU314652
EU314638
EU314659
1
Z83163 Z83166 Z83167 EU314654 EU314655 Z831702
EU221628 EU221629 EU221638 EU162232 EU221633 EU2216241
EU314630 EU314631 EU3146221
EU308517 EU308518 EU3085101
XX Z83188 AJ2949732
EU221626 EU308501 EU2216231
EU314624 EU314627 EU3146211
EU308512 EU308514 EU3085091
Z831712
EU2216251
EU3146231
EU3085111
Z83185 EU314656 Z83168 Z831942
EU221619 EU221630 EU221618 EU2216311
EU314617 EU314628 EU314616 EU3146291
EU308505 EU308515 EU308504 EU3085161
Z83186 AJ831773 Z831732
EU221627 EU221622 EU2216201
EU314625 EU314620 EU3146181
EU308513 EU308508 EU3085061
Z83177 Z83157
EU221621 EU221637
EU314619
EU308507
P. Erixon, B. Oxelman / Molecular Phylogenetics and Evolution 48 (2008) 313–325
Atocion rupestre (L.) B. Oxelman Eudianthe laeta Willk. Heliosperma alpestre (Jacq.) Griseb.
Popp 1049 (UPS)1 Oxelman ITS-AGR 30616 (GB)2 Oxelman 2198 (GB) Oxelman 1876 (GB) Frajman 29.VI.2002 (LJU)1 Surina 136562 (LJU)2 Oxelman 2277 (GB)1 Erixon 68 (UPS)2 Oxelman 2200 (GB) Oxelman ITS-FLO 30610 (GB) Oxelman 2276 (GB) Edmondson and McClintock 2933 (E) Oxelman 1690 (GB) Erixon 70 (UPS)1 Oxelman 1944 (GB)2 Oxelman 1628 (GB) Oxelman and Tollsten 934 (GB) Oxelman 1902 (GB)1 Oxelman and Tollsten 930 (GB)2 Erixon 72 (UPS)1 Oxelman 2310 (GB)2 Oxelman 11.IV.1986 (GB) Erixon 71 (UPS) Oxelman 2208 (UPS) Popp 1053 (UPS)1 Oxelman 2264 (GB)2 Oxelman 2206 (GB) Eggens 48 (UPS) Erixon 73 (UPS)1 Oxelman 2197 (GB)2 Oxelman 2241 GB Oxelman 2199 (GB)
Sequence region
For sequence region reference see Fig. 1. a Herbarium abbreviations according to Holmgren et al. (1990). b Superscript (1 and 2) indicate if there are different vouchers for the sequenced regions. c Lychnis chalcedonica has sequence for region 2a and 2b, i.e. psaA-ndhJ.
315
316
P. Erixon, B. Oxelman / Molecular Phylogenetics and Evolution 48 (2008) 313–325
Fig. 1. Map of sequenced regions with the approximate positions of PCR primers indicated. Fragment lengths and positions correspond to the complete chloroplast genome of Spinacia oleracea (GenBank Accession No. NC_002202). Genes are marked with gray boxes; those above line are transcribed from left to right and below line from right to left. Sequence partitions are labeled as: 1, rps16 intron; 2a, psaA-trnS; 2b, trnS-ndhJ; 3, ndhC-trnM; 4, rbcL-petB; 4a, ycf4-cemA; 4b, psbE-petG; 4c, psaJ-rps18; 4d, rps18-clpP. Regions marked with white bars (1, 2b, 4a, 4b, 4c, 4d) have been sequenced for the complete taxon set (24 taxa), whereas regions marked with black (1, 2b, 3, 4) have been sequenced for the more restricted taxon sampling (17 taxa, see also Table 1). The spinach genome map is reproduced from Schmitz-Linneweber et al. (2001).
petB-R, Fig. 1) in conserved regions of coding sequences by use of some of the available complete chloroplast genomes on GenBank. Special emphasis was given to the sequence of spinach (Caryophyllales). Initial sequencing was performed using the PCR primers. Nested sequencing primers were subsequently constructed by using obtained sequences and spinach. The same initial procedure was used for the region between trnS and trnM (6.8 kb, Fig. 1), but it was subsequently divided into two subregions trnS-ndhJ (Fig. 1, region 2b) and ndhC-trnM (Fig. 1, region 3), excluding a part of predominately coding sequence of 1.4 kb (part of ndhJ, ndhK and ndhC). The rps16 intron (Fig. 1, region 1) was chosen because it had already been sequenced for a large number of Sileneae taxa (e.g. Oxelman et al., 1997), and amplified with primers rpsF/rpsR2 (Oxelman et al., 1997). The four sequence regions (Fig. 1) together consist of more than 25 kb and were completely sampled for 17 taxa (Table 1). For seven of the 24 taxa in the study (Agrostemma githago, Atocion rupestre, Eudianthe laeta, Lychnis flos-cuculi, L. flosjovis, Petrocoptis pyrenaica, and Viscaria vulgaris) roughly half of the data were sequenced (see Fig. 1 for details). The choice of sequencing regions for these seven taxa was based on preliminary estimates from the other 17 taxa of high sequence variation. An additional 3.7 kb upstream of trnS-ndhJ was amplified with primer pair psaA-F/rps4-R2 for L. chalcedonica (Fig. 1, region 2a). This was done in order to verify a genome rearrangement in L. chalcedonica. In total, 103 primers were used for PCR and sequencing (Supplementary data, Appendix 1).
2.3. Alignment and indel coding Sequences were assembled and edited using SequencerTM v.3.1.1 (Gene Codes Corporation). Gene identification was done by aligning the coding regions from spinach with the sequence matrix. Sequences were aligned using Se–Al v.2.0a11 (Rambaut, 1996) using the criteria of Oxelman et al. (1997) as guidelines. When more than one sequence could not be unambiguously aligned to the others for a particular region, the region was excluded from further analyses. The exons of the clpP1 gene were also excluded from the analyses, because some taxa have extremely divergent sequences and are under positive selection (Erixon and Oxelman, 2008). The number of excluded characters for each data partition is given in Table 2. Indel characters were scored using SeqState v.1.25 (Müller, 2006) implementing the ‘‘simple gap coding” method of Simmons and Ochoterena (2000). 2.4. Phylogenetic analysis Bayesian phylogenetic analyses on separate and combined matrices were performed with MrBayes v.3.1.2 (Huelsenbeck and Ronquist, 2001). Substitutions were modeled according to the model that received highest AIC scores as calculated by MrAIC 1.4.2 (Nylander, 2005a) together with PHYML 2.4.4 (Guindon and Gascuel, 2003) for each data partition (Table 2). The
317
P. Erixon, B. Oxelman / Molecular Phylogenetics and Evolution 48 (2008) 313–325 Table 2 Statistics from matrices used in the Bayesian phylogenetic analyses Region/matrixa
Number of taxa
Number of charactersb
Parsimony informative charactersb
Excluded characters
Substitution model
AICd
SplitSDe
rps16 (1) trnS-ndhJ (2b) ndhC-trnM (3) rbcL-petB (4) ycf4-cemA (4a) psbE-petG (4b) psaJ-rps18 (4c) rps18-clpP (4d) Total, no indels Total with indels Total, only indels Subgenus Behen
24 24 17 17 24 24 24 24 24 24 24 8
962 + 101 5779 + 599 2620 + 180 21280 + 1093 1345 + 133 1911 + 182 1139 + 115 3532 + 352 30,952 33,149 2197 25,225 + 643
78 + 35 421 + 198 138 + 56 976 + 319 121 + 42 171 + 59 90 + 36 298 + 131 1869 2564 695 213 + 124
0 335 78 1728c 47 15 164 842c 2141c 2141c 0 n.a.
GTR + i + C GTR + C GTR + C GTR + C GTR + C GTR + C HKY + C GTR + C GTR + C GTR + C n.a. GTR + C
0.904 0.765 0.745 0.812 0.877 0.751 0.527 0.530 0.859 0.859 n.a. 0.607
0.0029 0.0009 0.0029 0.0018 0.0045 0.0023 0.0047 0.0031 0.0046 0.0020 0.0027 0.0022
a b c d e
The numbers in parentheses refer to Fig. 1. The two numbers represent nucleotide and indel characters, respectively. The large number of excluded characters is due to the exclusion of the clpP1 exons. Akaike weights. Average standard deviation of split frequencies of sampled trees between independent Baysian analyses.
binary model of Lewis (2001) was applied to the indel data. Analyses were run for five million generations with four MCMC chains and two independent runs with trees sampled every 100th generation. The first half of the sampled trees was discarded as burn-in. Default prior distributions were applied, but the sensitivity to branch length prior (see, e.g. Yang and Rannala, 2005) was investigated by applying exponential prior distributions with three different means (0.05, 10, and 100) on the complete data set. Maximum parsimony bootstrap (Pboot) analyses were performed with PAUP* v.4.0b10 for Unix (Swofford, 2002). They were carried out with full heuristics, 10,000 replicates, TBR branch swapping, the MULTREES option off, and random addition of sequences with five replicates. To find the most parsimonious tree(s) for the complete data set, a heuristic search with TBR branch swapping, the MULTREES option on, and random addition of sequences with 10,000 replicates was performed. In order to estimate how much data are needed to resolve the different nodes in the phylogeny, maximum parsimony jackknifing with varying deletion frequencies (99%, 97%, 95%, 90%, . . ., 5%) was performed on the complete data set. The jackknife analyses were carried out with full heuristics, 10,000 replicates, TBR branch swapping, the MULTREES option off, and random addition of sequences with five replicates. The 50%-analysis and the 36.7%-analysis are equivalent with the ‘‘delete-half” jackknife support and the ‘‘e 1” jackknife support, respectively, for the complete data set. 2.5. Exploration of homoplasy distribution Given a tree model and independently and identically distributed (i.i.d.) substitutions, conflicting data (homoplasy) is expected to be randomly distributed along a DNA sequence. By contrast, a recombined sequence contains a mosaic of different signals. To explore if short branches and weak support are the result of rapid radiation under a tree i.i.d. model or of recombination, branch support can be compared between partitions consisting of randomly sampled sites and blocks of adjacent sites. Under a tree i.i.d. model, no difference between the two partition types is expected. If the signal is unevenly (i.e. not independently) but identically (i.e. supporting a single topology) distributed, random partitions should on average receive higher support than adjacent, contiguous sites. For example, assume a sequence alignment with 100 sites and three informative, adjacent sites with the same pattern. Random samples of 10 sites will contain at least one informative site with the probability 1 0.9710 = 0.263, whereas
the corresponding probability for samples of 10 adjacent sites will be only 0.120. Conversely, in a moderately mosaic sequence containing conflicting phylogenetic signals, blocks of adjacent sites are on average expected to have equal or more branch support than partitions consisting of random sites, because the conflicting signals will receive individual support that will be unaffected or decrease when combined. We applied this analysis on the eight taxa belonging to Silene subg. Behen, where phylogenetic resolution remained poor despite the large amount of sequence data sampled. The character matrix for the 8-taxon Behen data set was reduced compared to the original 24-taxon matrix as large autapomorphic insertions (>10 bp) as well as empty columns resulting from the limited taxon sampling were excised. The reduced 8-taxon matrix consisted of 25,225 characters (indel information was ignored). To test the null hypothesis that homoplasy is randomly distributed, we used a sliding window with size 4000 sites and moved 200 sites per replicate. This resulted in 107 block partitions that were compared to 107 matrices of randomly sampled sites. All data sets were analyzed with maximum parsimony bootstrap with full heuristics, 1000 replicates, TBR branch swapping, the MULTREES option off, and random addition of sequences with five replicates. Only branches with bootstrap frequency above 50% were considered, and for each of these, 50 were subtracted from the bootstrap percentage, which were then summed (b50 sum). The null hypothesis that there were no differences between b50 sums from randomly sampled sites than from adjacent was tested by applying Student’s t-test as implemented in VassarStats (Lowry, 2006). The impact of individual sequences on the analysis was evaluated by removing each sequence one at a time and performing the analysis again on the remaining sequences. The estimated P-values should not be interpreted in a strict probabilistic sense, because the overlapping categories violate the assumption of independent samples. In addition to the Bayesian analysis and the parsimony bootstrapping analyses, the complete Silene subgenus Behen matrix was analyzed with maximum likelihood performed with PHYML 2.4.4 (Guindon and Gascuel, 2003) with full estimation under the GTR + C model and with 1000 bootstrap replicates using the script BootPHYML 3.4 (Nylander, 2005b). Phylogenetic uncertainty in Silene subgenus Behen was explored with Neighbor-Net (Bryant and Moulton, 2004) as implemented in SplitsTree v.4.3 (Huson and Bryant, 2006). The complete matrix for the eight taxa was translated into a distance matrix corrected under the GTR + C model (a = 0.27) using PAUP* v.4.0b10 (Swofford, 2002).
318
P. Erixon, B. Oxelman / Molecular Phylogenetics and Evolution 48 (2008) 313–325
psbB spacer). Several large deletions were also found, e.g. in all eight taxa of subgenus Behen (158 bp in the trnP-psaJ spacer and 103 bp in the trnF-ndhJ spacer), S. fruticosa (172 bp in the trnF-ndhJ spacer), and L. chalcedonica (in the ndhC-trnV spacer).
3. Results 3.1. Sequence information The three exons of the clpP1 gene show extremely high substitution rates for S. conica, S. fruticosa, L. chalcedonica and L. flos-cuculi. Multiple paralogous copies of the region were also found in S. fruticosa and L. chalcedonica, and there is evidence of positive selection (Erixon and Oxelman, unpublished data). Lychnis chalcedonica, L. flos-cuculi, and S. conica lack introns in the clpP1 gene. The same gene order is found in all investigated species, compared to Spinacia oleracea (NC_002202), except for the position of the accD and psaI genes in L. chalcedonica. This region, corresponding to 2.6 kb in Spinacia, has been interchanged with another cpDNA region (2.9 kb in Spinacia) containing the three exons of the ycf3 gene (region 2a in Fig. 1). The accD region shows signs of a complicated evolutionary history with multiple duplications and high sequence variability in L. chalcedonica (unpublished data). In addition to the large insertions in the accD and clpP (Erixon and Oxelman, unpublished data) genes for some taxa, we found 19 autapomorphic insertions larger than 30 bp. The largest were found in Heliosperma alpestre (315 bp in the trnL intron), S. fruticosa (244 bp, in the psbN/psbH spacer), and S. conica (193 bp in the clpP-
B
3.2. Phylogenetic analyses Table 2 summarizes the number of taxa, included and excluded nucleotide and indel characters, parsimony informative characters, the substitution model used in the Bayesian analyses, Akaike weights, and the average standard deviation of split frequencies between the two independent analyses (runs), for the different DNA regions. The effect of different branch length priors (exponential distributions with means 0.05, 10, and 100) on branch lengths and posterior clade probabilities, estimated on the complete data set, were small. Posterior clade probabilities differed only in the third decimal and the internal and terminal branch lengths were 0.25% and 0.53% longer, respectively, in the analysis with the largest branch length prior relative to the shortest. A single most parsimonious tree was found for the complete data set. It differs from the Bayesian consensus phylogram (Fig. 2) only in the rooting of Silene subgenus Behen. The Bayesian
C Petrocoptis
S. samia S. uniflora
.50/ 1.00/ -
1.00/88.7
1.00/100 .65/ 1.00/100 1.00/100
1.00/87.5 1.00/100 1.00/100
1.00/99.9
1.00/100 1.00/100 1.00/99.4 1.00/100 .69/ -
1.00/100
0.01
Petrocoptis
S. samia S. latifolia S. littorea 1.00/100 S. integripetala .66/ S. conica .61/ S. zawadskii 1.00/95.3 .98/ S. sorensenis .89/ S. uniflora S. fruticosa .98/90.1 .78/53.6 S. pseudoatocion 1.00/99.6 S. schafta 1.00/96.3 1.00/100 S. cryptoneura S. sordida .96/90.4 L. chalcedonica 1.00/85.1 1.00/100 L. flos-cuculi L. flos-jovis 1.00/100 S. aegyptiaca S. atocioides 1.00/76.0 Heliosperma .88/ 1.00/100 Viscaria Atocion .98/52.6 Eudianthe subgenus Behen - /50.7 0.1 S. cryptoneura/S. sordida .64/ -
S. conica
S. latifolia S. littorea S. zawadskii S. sorensenis S. integripetala S. cryptoneura S. sordida L. chalcedonica 1.00/89.2 L. flos-cuculi L. flos-jovis S. fruticosa S. pseudoatocion S. schafta S. aegyptiaca S. atocioides Heliosperma Eudianthe Viscaria Atocion Heliosperma - /83.8
1.00/99.9 .84/53.4
Agrostemma
Agrostemma
Eudianthe
.54/ -
Agrostemma
A
Petrocoptis S. samia
.58/ 1.00/ -
1.00/85.5
S. conica S. latifolia
S. littorea S. uniflora 1.00/100 S. zawadskii 1.00/100 S. sorensenis .95/58.8 S. integripetala S. fruticosa 1.00/100 1.00/100 S. pseudoatocion 1.00/100 S. schafta S. cryptoneura 1.00/100 S. sordida 1.00/100 L. chalcedonica 1.00/96.0 1.00/76.4 1.00/100 L. flos-cuculi L. flos-jovis S. aegyptiaca 1.00/100 1.00/100 S. atocioides Heliosperma .77/76.0 Eudianthe 1.00/100 Viscaria 1.00/100 Atocion 0.01 .94/ -
.96/ -
subgenus
Behen subgenus
Silene
Fig. 2. Total evidence phylogeny, with indel characters (A), only nucleotide characters (B), and only indel characters (C). Numbers on nodes are branch support (Bpp/Pboot). Only values above 0.50/50% are shown. Topology conflicts between Bpp and Pboot are marked with arrows and the alternative parsimony topologies are shown as inserted partial trees. Gray dots indicate nodes failing to converge to high support values and discussed in the text.
319
P. Erixon, B. Oxelman / Molecular Phylogenetics and Evolution 48 (2008) 313–325
The clade consisting of L. chalcedonica and L. flos-cuculi and the clade consisting of S. latifolia and S. conica have weak saturation curves, being 95% saturated at c. 16,000 and 23,000 characters, respectively. The concave shape of the curves for some nodes in Behen indicates that addition of data will not improve the resolution. The sistergroup relationship between the two large subgenera in Silene (Behen and Silene) shows an almost straight line. In the separate analyses of only Silene subgenus Behen, the average support sums for observed contiguous partitions were 48.9, compared to 44.0 for the random partitions (P = 0.06). When S. littorea, S. zawadskii or S. sorensenis are individually excluded, the contiguous partitions have substantially higher support sums (Table 3). Several of the 7-taxon data sets have higher support sums, for the complete matrix, than the 8-taxon data set, despite the fact that they sum over four rather than five nodes (Table 3). The Neighbor-Net (Fig. 4) gives a general impression of star-like phylogeny, but shows conflict with respect to the interrelationships of S. integripetala and S. uniflora, and also with respect to the S. sorensenis and S. zawadskii clade, which is strongly supported in the tree analyses (Fig. 4C). Two nodes strongly supported by Bayesian inference, but poorly supported by parsimony bootstrap, behave differently in the maximum likelihood bootstrap analysis. The position of S. integripetala as sister to S. sorensenis and S. zawadskii receives good support
tree roots Behen on the terminal branch of S. integripetala, whereas the parsimony tree roots Behen on the terminal branch of S. samia. This difference results in three conflicting bipartitions. None of these have parsimony bootstrap support >50% (Fig. 2). Because the three nodes only found in the single most parsimonious tree eventually will get a frequency of 100% (delete 0%) in the jackknife graph (Fig. 3), both these and the three conflicting Bayesian nodes are included in Fig. 3. Twelve nodes with maximum bootstrap support (100%) from the parsimony analysis of the complete data set have a saturation curve type in Fig. 3, that is, random subsets of a size that is less than half of the entire dataset always resolve the node. The curves can be classified in three groups on the basis of how steep they are, i.e. how much data on average are needed to resolve the node. The five well-supported nodes found in the rps16 analysis (Supplementary data Fig. 1) are also those that need the least data to be retrieved (group A). Roughly 1000 characters are needed to retrieve them 95% of the times. Three more nodes (group B: Silene subgenus Behen, S. pseudoatocion/S. fruticosa, and S. aegyptiaca/S. atocioides outside the rest of Silene and Lychnis) need roughly 4000 characters. Four nodes (group C: S. sorensenis/S. zawadskii, Agrostemma/ Petrocoptis, Atocion/Viscaria/Heliosperma/Eudianthe, and S. aegyptiaca/S. atocioides together with rest of Silene and Lychnis) need roughly 6000 characters.
A
B
C
A C
80
subgenus Behen S. sorensenis/ S. zawadskii S. conica/S. latifolia *S. conica/S. latifolia/S. samia S. uniflora/S. littorea *Behen without S. sorensenis/ S. zawadskii/S. integripetala
75
*Behen without S. integripetala
70
Behen without S. samia S. sorensenis/ S. zawadskii/ S. integripetala Behen without S. conica/ S. latifolia/ S. samia subgenus Silene S. pseudoatocion/S. fruticosa subgenera Behen and Silene S. cryptoneura/ S.sordida Lychnis L. chalcedonica/L. flos-cuculi S. cryptoneura/ S. sordida and Lychnis aegyptiaca-group aegyptiaca-group outside Silene and Lychnis Atocion/Viscaria Atocion/Viscaria/Heliosperma/Eudianthe Heliosperma/Eudianthe Agrostemma/Petrocoptis
100 95 90 85
65 60
B B
55 50
A A
45 40 35
A B C A C
30 25 20 15
C
10
delete half
5 0
10
20 5000ch
30 10000ch
40 15000ch
delete 1/e 60 20000ch
70
80 25000ch
90 30000ch
100 % resampled 33149ch
Fig. 3. The frequency (y-axis) with which nodes are parsimoniously resolved as a function of jackknife resample fraction, i.e. number of characters resampled out of the total 33149 characters (x-axis). The standard ‘‘delete-half”—and ‘‘1/e”—jackknife for the complete matrix are indicated with vertical lines. Groups marked with asterisk are supported nodes from the Bayesian analysis, but not present in the single most parsimonious tree. Circled lines (also denoted in the legend) are discussed in the text.
Table 3 Mean parsimony bootstrap support sums (b50 sums, as defined in Materials and methods) from 4000 bp contiguous (C) and random (R) partitions, P-values from the t-tests comparing these, parsimony bootstrap support sum (Pboot), maximum likelihood bootstrap support sum (MLboot), and Bayesian posterior probability support sum (Bpp) for the entire (E) nucleotide character set Data set
Mean b50 sum, C
Mean b50 sum, R
‘‘P-value”
Pboot b50 sum, E
Mlboot b50 sum, E
Bpp b50 sum, E
All 8 taxa Excl. S. samia Excl. S. littorea Excl. S. uniflora Excl. S. integripetala Excl. S. latifolia Excl. S. conica Excl. S. zawadskii Excl. S. sorensenis
48.9 53.4 65.0 46.4 51.0 35.7 38.8 33.1 27.6
44.0 54.4 53.7 47.6 46.9 35.8 39.7 20.1 20.7
0.063 0.398 0.0027 0.379 0.112 0.483 0.388 0.00024 0.0075
96.3 129.2 109.6 161.7 104.3 69.7 101.2 64.3 45.1
124.9 135.5 149.0 149.7 88.7 99.7 105.6 73.4 43.0
156.8 182.4 157.4 171.6 103.2 147.4 135.3 100.0 100.8
320
P. Erixon, B. Oxelman / Molecular Phylogenetics and Evolution 48 (2008) 313–325
S. conica
S. littorea
S. latifolia
S. samia
S. uniflora
B S. zawadskii
S. samia
S. sorensenis 1.00/93.3/93.8/89.9
S. conica S. latifolia
S. integripetala 1.00/53.0/ - / -
A
0.0010
0.78/ - / - / -
C
S. littorea S. uniflora
S. zawadskii S. sorensenis 1.00/61.6/52.6/85.0 S. integripetala 1.00/100/99.9/100
0.01
Fig. 4. Neighbor-Net phylogenetic network for subgenus Behen (A) with magnification of centre (B), and Bayesian phylogram (C) with support values: Bpp/PbootI/Pboot/ MLboot (Bpp = Bayesian posterior probabilities, PbootI = parsimony bootstrapping with indel characters, Pboot = parsimony bootstrapping without indel characters, MLboot = maximum likelihood bootstrapping). Values below 50% are not shown. Arrow indicates the lack of phylogenetic signal connecting S. uniflora and S. littorea discussed in the text.
(85%), whereas the sister relationship of S. uniflora and S. littorea receives no support above 50% (Fig. 4). The Neighbor-Net does not display any phylogenetic signal connecting S. uniflora and S. littorea (see arrow, Fig. 4B). As an arbitrary example of conflicting signal along the sliding 4000 bp windows, we show the results from the analysis with S. littorea excluded. Two larger partitions (7 and 6.2 kb, respectively) with conflicting supported phylogenies were identified (Fig. 5). 4. Discussion As can be seen in Fig. 2, 14 of the 21 nodes in the tree based on the entire data set have converged to maximal Bayesian posterior probabilities and parsimony bootstrap frequencies higher than 85% (12 have reached 100%). Most previous cpDNA phylogenetic studies of Sileneae have focused on the rps16 intron, which normally is c. 800 bp in the group (Oxelman et al., 1997). Using rps16 data only (Supplementary data Fig. 1), five nodes receive comparably high support values (all have posterior probabilities of 1.00, the Pboot frequencies range from 87.9% to 99.9%). Thus, for nine of the sixteen poorly supported nodes in the rps16 tree, addition of c. 25 kb of sequence has greatly improved support for the inferred phylogenetic relationships. However, for seven nodes, support levels are still weak. In the following text, we discuss the significance of these findings for each taxonomic group. Silene aegyptiaca and S. atocioides are morphologically very similar but geographically vicariant, S. atocioides being a species of virgin gravelly habitats in the mountainous zone of SW Anatolia, and S. aegyptiaca being a weed of agricultural land in the Easternmost Mediterranean (Personal observations; Boissier, 1888). Chowdhuri (1957) considered them to be different varieties of the same species. The magnitude of sequence difference
between the two taxa is therefore highly surprising. As a comparison, Atocion rupestre and Viscaria vulgaris are less divergent (Fig. 2), but they have almost invariably been classified in different genera in the taxonomic history of Sileneae (see Oxelman and Lidén, 1995 for a review). Of course, taxonomic rank has no absolute meaning in terms of divergence, but gives a rough indication on the magnitude of difference that taxonomists have attributed to the taxa in the past. We believe that it is likely that the rate of cpDNA substitutions have for some reason been elevated in S. aegyptiaca and S. atocioides, which makes a detailed cpDNA sequence analysis of this taxonomically quite difficult group warranted and promising. The solid support for their position as sister to all other Silene and Lychnis species in the cpDNA phylogeny is also unexpected. They have not previously been included in any molecular study, but they have on the basis of morphology been put together with S. cryptoneura and S. sordida in sect. Atocion (Coode and Cullen, 1967; see also Carlström, 1986). This is not consistent with cpDNA data, and unpublished nrDNA data (ITS) also speak against this classification. The ITS data do not, however, resolve the position of S. aegyptiaca/S. atocioides in relation to Lychnis. Sequencing of several unlinked nuclear sequence regions is necessary to investigate the origin of this group in more detail. Silene cryptoneura and S. sordida are strongly supported as sister taxa, but their terminal branches differ substantially in estimated length. Nuclear ribosomal DNA data from the ITS region do not put S. sordida and S. cryptoneura together, and they are both intermingled with species from subgenus Behen (Oxelman and Lidén, 1995). Even if the relationships are only weakly supported, there is no support in the ITS data for the exclusion of S. cryptoneura and S. sordida from a monophyletic Behen. The cpDNA sequences, on the contrary, show strong evidence for this.
321
P. Erixon, B. Oxelman / Molecular Phylogenetics and Evolution 48 (2008) 313–325
S. samia S. samia
S. uniflora S. zawadskii 0.91/57.0/63.3 S. sorensenis 1.00/99.1/99.9 S. integripetala 0.77/76.7/80.3 S. conica 0.99/81.8/75.7 S. latifolia
S. conica
1.00/79.1/68.2 S. latifolia 0.99/88.7/70.6
S. uniflora
S. integripetala 0.65/ - / S. zawadskii 1.00/90.2/83.2 S. sorensenis
trnS-trnM (7kb)
0.01
Support: Bpp/ Pboot/MLboot
petA-rps18 (6.2kb) 100 Parsimony bootstrap (%) 90
0.95
Bayesian posterior probability
80 70 60
S. integripetala/S. sorensenis/S. zawadskii S. integripetala/S. uniflora
50 40 30 20 10 0 1
10
20
30
40
50
60
70
80
90
100
Block partition data set #
Fig. 5. Graph showing parsimony bootstrap support for two conflicting bipartitions of 4000 bp sliding window contiguous partitions for seven taxa in Behen (S. littorea excluded). Dots in gray and black represent partitions with posterior probabilities >0.95, for S. integripetala/S. sorensenis/S. zawadskii and S. integripetala/S. uniflora, respectively. Bayesian phylograms (top) show conflicting topologies between the trnS–trnM and petA-rps18 data blocks. Numbers on nodes are branch support: Bayesian posterior probabilities (Bpp)/parsimony bootstrap (Pboot)/maximum likelihood bootstrap (MLboot).
In most cpDNA data partitions, there is weak to moderate support for S. cryptoneura and S. sordida as sister to Lychnis (Supplementary data Fig. 1). This is also the result in the total evidence analysis based on only nucleotides (1.00/87.5%), whereas the indel characters alone give weak support (0.78/53.6%) for a position not found in any other analysis, i.e. sister to a clade consisting of the subgenera Silene and Behen (Fig. 2). One of the data sets (rps18-clpP, Supplementary data Fig. 1(4d)) strongly supports (1.00/93.2%) another position, i.e. as sister to Behen. If the eight taxa of Behen are excluded from the analysis, the relationship of S. cryptoneura and S. sordida as sister to Lychnis receives maximum support (1.00/100%, data not shown). The apparent hard incongruence between cpDNA and nuclear data, and possibly also the conflicting signals within the cpDNA sequences, suggest a complicated past of this group. Silene cryptoneura is an annual plant of virgin habitats in the mountainous zone in SW Anatolia, whereas S. sordida occupies similar habitats in roughly the same geographical area, but on serpentine soil. Silene salamandra Pamp. (Rhodes) and S. insularis Barbey (Karpathos) are vicariant to S. cryptoneura (Oxelman and Långström, unpublished data; Carlström, 1986), and morphologically very similar. Silene sordida is morphologically different from these in seed morphology and by having nocturnal flowers. The possible reticulate past of the group needs to be investigated in detail using several nuclear low-copy genes. The deep resolution among the major clades in Silene/Lychnis is problematic. The cpDNA data support four distinct groups (subgenus Behen, subgenus Silene, S. cryptoneura/S. sordida, and Lychnis), but their interrelationships are hard-to-resolve in a tree-like fashion, because analyses of different parts of the chloroplast genome result in conflicting topologies. The rps18-clpP data set (Supple-
mentary data Fig. 1(4d)) not only associate S. cryptoneura and S. sordida with Behen (in line with nrDNA data), it also keeps Silene monophyletic (except for S. aegyptiaca/S. atocioides). In the 6-locus study by Popp and Oxelman (2004) only the nuclear DNA intron region RPD2a gave substantial support for Lychnis and Silene as sister taxa. Unfortunately, their study does not include some taxa crucial for the understanding of this relationship (S. cryptoneura, S. sordida and S. aegyptiaca, S. atocioides). Excluding these taxa from our analysis results in exactly that position of Lychnis (data not shown). Within Silene subgenus Behen, S. sorensenis and S. zawadskii are strongly supported as a monophyletic group in most data sets and the results are also corroborated in other studies (Popp and Oxelman, 2004; Popp et al., 2005). The two taxa define the informal group Physolychnis s.l., where S. zawadskii is the sister taxon to a species-rich group of Arctic and alpine species from Central Asia and America, as well as several other predominantly American and Asian taxa (Oxelman and Lidén, 1995; Oxelman et al., 1997; Popp and Oxelman, 2004; Popp et al., 2005; Popp and Oxelman, 2007). We also consider the sister relationship of S. latifolia and S. conica to be strongly supported, despite the poor support for this relationship in some of the data sets and the relatively low Pboot support in the total evidence analysis (85.5%, Fig. 2A). This is most likely caused by the short terminal branch in S. latifolia in combination with the much longer terminal branch in S. conica (long branch attraction to other long branches in subgenus Behen), and the relatively short branch leading to the two taxa. The rooting of Silene subgenus Behen relative to other species in Sileneae is problematic, because of the long branch separating it from the rest and the poor within-group resolution. The rooting of the group in the single most parsimonious tree differs from that of the Bayesian consensus tree, although the unrooted topology is
322
P. Erixon, B. Oxelman / Molecular Phylogenetics and Evolution 48 (2008) 313–325
the same. Despite the low overall variability in the cpDNA data within Behen (0.84% parsimony informative sites), the amount of homoplasy is high (Consistency Index/Retention Index = 0.56/ 0.34). The proportion of parsimony informative characters in ITS nrDNA is eight times higher for the group, but the homoplasy is lower (7.2% parsimony informative sites, CI/RI = 0.63/0.46, unpublished data). Usually, ITS data have considerably higher levels of homoplasy for the same taxon sampling as a consequence of alignment and sequencing problems, but also because of incomplete concerted evolution, paralogy, pseudogenes, and compensatory base changes (Álvarez and Wendel, 2003). The analysis of homoplasy distribution performed on all eight taxa in Behen does only weakly (P = 0.06) reject the null hypothesis that phylogenetic information is randomly distributed along the sequenced regions. When S. littorea (having the second longest branch) is excluded, structured homoplasy with respect to S. integripetala emerges (Fig. 5). Our interpretation of the substantial decrease in the P-value when S. littorea is excluded (Table 3), is that the conflicting phylogenetic signal seen in different contiguous partitions when the analyses are performed without S. littorea (Fig. 5) is possibly canceled out by long branch attraction between S. uniflora and S. littorea, when the latter species is included. The mosaic pattern (Fig. 5) could be explained by recent recombination between certain lineages, but we were unable to detect any such using several methods implemented in the software RDP2 (results not shown, version 2.0, Martin et al., 2005). Still, the observed pattern may have been caused by ancient recombination, which has been obscured by substitutions occurring after the recombination event(s), but may also be due to ancient rapid radiation followed by unequal substitution rates in the daughter lineages. The sliding window analysis of the 4000-bp block partitions identifies at least one region (c. 6200 bp) stretching from the petA gene to the rps18 gene as rather strongly supporting the grouping of S. uniflora and S. integripetala. For the complete matrix, the resolution and support are greatly improved by the exclusion of S. uniflora (Table 3), indicating that it is this species that causes the conflicting signal. For example, the bootstrap support for S. integripetala as sister to S. sorensenis and S. zawadskii increases from 52.3% to 87.1% (Pboot) and 85.0% to 99.9% (MLboot) when S. uniflora is excluded (data not shown). The small P-values of the analysis when either S. zawadskii or S. sorensenis are excluded is probably due to S. integripetala being attracted to one of these taxa or S. uniflora. The resolution in these data sets is, however, very poor. We conclude that the chloroplast data does not lend itself to an easy interpretation, with respect to some of the relationships in Behen, but that it seems worthwhile to sample more extensively from subgenus Behen, and particularly from the relatives of S. uniflora in order to examine the hypothesis of a reticulate past of its chloroplast DNA genome. Silene vulgaris (Moensch) Garcke is obviously a closely related taxon, which may not be expected to break up the terminal branch very much, but Oxelman et al. (1997) identified the morphologically rather dissimilar S. pendula L. as a strongly supported relative. This taxon was previously classified in the same section as S. littorea (Talavera, 1979). At present, however, the most sensible interpretation is that the sequences indicate a rapid ancient diversification followed by unequal substitution rates. The null hypothesis of tree-like evolution of the chloroplast genome cannot be confidently rejected. The data do not give any strong indications of recent reticulations in the tree, and even if there had been such events going on in the past, it would have been drowned in noise caused by the long terminal branches. The apparent substitution rate heterogeneity among the terminal branches complicates the situation. Our data do not indicate that longer sequences will help (Fig. 3). Denser taxon sampling, when possible, could potentially resolve the relationships by breaking up the long terminal branches (Rydin and Källersjö, 2002; Hillis et al., 2003;
Soltis et al., 2004). Because of the very short internal branches in Silene subgenus Behen adding more taxa will probably not resolve the group, unless this could help identify problematic taxa in our sampling. The three species that represent Silene subgenus Silene (S. fruticosa, S. pseudoatocion, and S. schafta) form a strongly supported monophyletic group. These three species represent each of the major clades that were discovered by Oxelman and Lidén (1995) based on ITS sequence data. Silene fruticosa represents a clade of many mainly perennial species from the Mediterranean and Eurasia. Silene schafta represents a smaller group with mainly Asian distribution (Eggens et al., 2007), and S. pseudoatocion belongs to a large group of annual species from the Mediterranean area and Africa (Oxelman and Lidén, 1995; unpublished data). The sistergroup relationship between the latter group and the S. fruticosa group is weakly supported by the rps16 data, but receives maximal support from the entire cpDNA data. Nuclear data appear ambiguous concerning this relationship (Popp and Oxelman, 2004). All data partitions strongly support Atocion and Viscaria as sister genera and in most partitions these two form a strongly supported clade together with Heliosperma and Eudianthe. In contrast to nuclear data (Popp and Oxelman, 2004), Petrocoptis is not part of this group. This incongruence between cpDNA and nuclear DNA data seems very strongly supported and warrants further investigation. Petrocoptis is a small genus, endemic to the Pyrenees. Some nuclear data indicate that the Petrocoptis lineage may have been involved in the formation of Heliosperma (Frajman et al., unpublished; see also below). The cpDNA interrelationship between Heliosperma, Eudianthe and the Atocion/Viscaria-clade is difficult to resolve. There are apparently problems with long terminal branches and unequal substitution rates, as indicated by several cases of incongruence between the Bayesian analyses and the parsimony analyses. The phylogenetic history of Heliosperma is very complex, probably involving several hybridizations between distantly related lineages (Frajman and Oxelman, 2007). 4.1. The contribution of indel characters By performing the ‘‘simple gap coding” described by Simmons and Ochoterena (2000) the number of parsimony informative characters are increased by 37% in our study. This is more than in any of the 38 studies reviewed in Simmons et al. (2001). Indel characters are often less homoplastic than nucleotide characters (Simmons et al., 2001), but at least in our study, the amount of homoplasy is larger for the indel characters compared to the nucleotide characters (CI/RI = 0.461/0.546 vs. 0.587/0.640). All of the nodes with consistently strong support in the individual data sets also receive support in the analysis of only indel characters, but none of the ‘‘hard-to-resolve” relationships get increased support for either alternative. The only clade in subgenus Behen that gets substantial support from indels is S. sorensenis/S. zawadskii, despite the fact that Behen has proportionally more potentially parsimony informative indels than the taxon set as a whole (the addition of indel characters increase the number of parsimony informative characters with 58%). 4.2. Why are weird taxa always weird? Three taxa in this study (L. chalcedonica, S. fruticosa, and S. conica) stand out as being less ‘‘well behaved”. They were generally more difficult to amplify, required construction of several species-specific primers, and they have more long insertions and deletions. All three species exhibit extreme sequence evolution in two regions, physically separated by >10 kb of normal sequence, the clpP1 gene region (see Erixon and Oxelman, 2008) and the accD
P. Erixon, B. Oxelman / Molecular Phylogenetics and Evolution 48 (2008) 313–325
gene region (unpublished data). Several paralogous partial copies of clpP1 and accD have been amplified for L. chalcedonica and S. fruticosa. Both these genes are atypical chloroplast genes, but considered essential for plant development (Kuroda and Maliga, 2003; Kode et al., 2005). This study and others (e.g. Oxelman and Lidén, 1995; Oxelman et al., 1997; Popp and Oxelman, 2004) clearly show that the three species represent three separate lineages. It is puzzling that they show accelerated sequence evolution in the same regions, despite the fact that this property hardly could have been inherited from a common ancestor. 4.3. What is a hard incongruence and what can cause it? Under a correct model of sequence evolution, and uninformative priors, Bayesian posterior probabilities should represent the probability that the corresponding clade is true (Huelsenbeck et al., 2001). Erixon et al. (2003) found that Bayesian inference under these circumstances is conservative with a more traditional frequentist statistical interpretation (but see Huelsenbeck and Rannala, 2004). The error rate is much elevated when the model is under-parameterized (Erixon et al., 2003; Huelsenbeck and Rannala, 2004). Bootstrapping is often considered to be conservative, and the 5%-significance level is often approximated to 70% bootstrap support (Hillis and Bull, 1993). However, this is probably a poor approximation even with maximum likelihood under a correct model (Erixon et al., 2003), and the uncertainty associated with empirical data and/or parsimony as optimality criterion is considerable. The situation in Silene subgenus Behen is problematic, because five mutually contradictory nodes, from different data partitions, receive posterior probabilities in the range 0.95–1.00 without correspondingly high parsimony bootstrap values (Fig. 2 and Supplementary data). Judged from the estimated branch lengths, the risk of long branch attraction artifacts is large. If the internal branches are short and the substitution rates on terminal branches are unequal, long branch attraction among distantly related taxa can weaken the phylogenetic signal even further. In such situations, it is recommendable to explore clade support using maximum likelihood bootstrapping, because it is putatively less sensitive than parsimony to LBA and more conservative (smaller type I error) than Bayesian inference. It is documented that hard (or near-hard) polytomies will cause unpredictable behavior in Bayesian analyses, with arbitrary resolutions of the polytomy occasionally receiving very high posterior probabilities (Lewis et al., 2005). We consider two cases, in this study, to be candidates for hard incongruences, i.e. the position of S. cryptoneura/S. sordida, where nuclear and plastid data are in conflict, and the interrelationship between S. uniflora and S. integripetala, where different compartments of the chloroplast genome are in conflict. The relative positions of Heliosperma and Eudianthe are also problematic, but less so. The situation within Lychnis is an example of relationships that are unresolved or poorly supported in several of the data sets and rather strongly incongruent between two of the data sets (Supplementary data Fig. 1(2b) and (4a)), but one of these alternatives receives strong support in the combined analysis. In order to give a biological explanation of hard incongruences between different chloroplast regions, processes such as hybridization, heteroplasmy, and recombination have to be inferred. If these processes were to occur, it would most likely have to involve closely related species. To our knowledge, there are no studies published where extensive phylogenetic investigations have been made on closely related taxa and large amounts (e.g. 10–20% of the genome) of chloroplast sequence data. Thus, it is not surprising that there are almost no documented cases of chloroplast recombi-
323
nation in angiosperms, other than experimentally induced (Medgyesy et al., 1985), despite the fact that nearly a third of the angiosperms surveyed exhibit occasional heteroplasmy (Wolfe and Randle, 2004). It has been shown that mtDNA inheritance in Silene vulgaris (closely related to S. uniflora) is not strictly maternal (only 96%). The paternal transmission and subsequent heteroplasmy can result in novel mitochondrial recombinants (McCauley et al., 2005). Situations like these will most likely also give heteroplasmic cpDNA. Houliston and Olson (2006) found evidence of non-neutral organelle evolution also in S. vulgaris. They found variation patterns in two chloroplast genes suggestive of chloroplast recombination. Each chloroplast contains several to hundreds of genomes and their replication is complex and poorly understood, but it is largely recombination-dependent (Bendich, 2004). Recombination does not only occur between genomes within the same chloroplast, but also between different chloroplasts in a cell (Wolfe and Randle, 2004). 5. Conclusions This study represents the most extensive cpDNA character sample for a group of closely related species of flowering plants. We show that many, previously poorly resolved, phylogenetic relationships can be confidently resolved with addition of large amount of sequence data, in some cases 16,000 bp or more. Some of these relationships imply major differences between the chloroplast genome evolution and the nuclear genome evolution, suggesting that hybridization is a significant process to take into account when inferring the phylogenetic relationships among the taxa in Sileneae. Several cpDNA tree nodes could not be confidently resolved despite more than 33,000 molecular characters. Addition of more sequence data, with the current taxon sampling, is probably not the remedy. The reason for this is probably a combination of star-like phylogenetic patterns, substitution rate heterogeneity, and possibly also ancient chloroplast recombination. Acknowledgments We thank Katarina Andreasen, Joey Shaw, Mats Thulin, Mikael Thollesson, Niklas Wikström, and an anonymous reviewer for valuable comments on an earlier draft of this manuscript, and Nahid Heidari for laboratory support. This work was supported by grants from the Linnaeus Centre for Bioinformatics, Uppsala University, and by grants from the Helge Ax:son Johnson foundation, the Nilsson-Ehle foundation, and the Royal Swedish Academy of Sciences to P.E. and by Grant 2003-2696 from the Swedish Research Council to B.O. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.ympev.2008.04.015. References Álvarez, I., Wendel, J.F., 2003. Ribosomal ITS sequences and plant phylogenetic inference. Mol. Phylogenet. Evol. 29, 417–434. Andersson-Ceplitis, H., 2002. Evolutionary dynamics of mitochondrial plasmids in natural populations of Silene vulgaris. Evolution 56, 1592–1598. Archibald, J.M., Rogers, M.B., Toop, M., Ishida, K., Keeling, P.J., 2003. Lateral gene transfer and the evolution of plastid-targeted proteins in the secondary plastid-containing alga Bigelowiella natans. Proc. Natl. Acad. Sci. USA 100, 7678–7683. Archibald, J.K., Mort, M.E., Wolfe, A.D., 2005. Phylogenetic relationships within Zaluzianskya (Scrophulariaceae s.s., tribe Manuleeae): classification based on DNA sequences from multiple genomes and implications for character evolution and biogeography. Syst. Bot. 30, 196–215.
324
P. Erixon, B. Oxelman / Molecular Phylogenetics and Evolution 48 (2008) 313–325
Bendich, A.J., 2004. Circular chloroplast chromosomes: the grand illusion. Plant Cell 16, 1661–1666. Bergsten, J., 2005. A review of long-branch attraction. Cladistics 21, 163–193. Boissier, E., 1888. Flora Orientalis. Bremer, B., Bremer, K., Heidari, N., Erixon, P., Olmstead, R.G., Anderberg, A.A., Källersjö, M., Barkhordarian, E., 2002. Phylogenetics of asterids based on 3 coding and 3 non-coding chloroplast DNA markers and the utility of non-coding DNA at higher taxonomic levels. Mol. Phylogenet. Evol. 24, 274–301. Bryant, D., Moulton, V., 2004. Neighbor-Net: an agglomerative method for the construction of phylogenetic networks. Mol. Biol. Evol. 21, 255–265. Burleigh, J.G., Holtsford, T.P., 2003. Molecular systematics of the eastern North American Silene: evidence from nuclear ITS and chloroplast trnL intron sequences. Rhodora 105, 76–90. Carlström, A., 1986. The phytogeographical position of rodhos. Proc. R. Soc. Edinburgh B 89, 79–88. Chase, M.W., Soltis, D.E., Olmstead, R.G., Morgan, D., Les, D.H., Mishler, B.D., Duvall, M.R., Price, R.A., Hills, H.G., Qiu, Y.L., Kron, K.A., Rettig, J.H., Conti, E., Palmer, J.D., Manhart, J.R., Sytsma, K.J., Michaels, H.J., Kress, W.J., Karol, K.G., Clark, W.D., Hedren, M., Gaut, B.S., Jansen, R.K., Kim, K.J., Wimpee, C.F., Smith, J.F., Furnier, G.R., Strauss, S.H., Xiang, Q.Y., Plunkett, G.M., Soltis, P.S., Swensen, S.M., Williams, S.E., Gadek, P.A., Quinn, C.J., Eguiarte, L.E., Golenberg, E., Learn, G.H., Graham, S.W., Barrett, S.C.H., Dayanandan, S., Albert, V.A., 1993. Phylogenetics of seed plants—an analysis of nucleotide-sequences from the plastid gene rbcL. Ann. MO Bot. Gard. 80, 528–580. Chowdhuri, P.K., 1957. Studies in the genus Silene. Notes from the Royal Botanic Garden Edinburgh, vol. 22, pp. 221–278. Coode, M.J., Cullen, J., 1967. In: Davis, P.H. (Ed.), Silene L.. Flora of Turkey 2, Edinburgh. Desfeux, C., Maurice, S., Henry, J.P., Lejeune, B., Gouyon, P.H., 1996. Evolution of reproductive systems in the genus Silene. Proc. R. Soc. Lond. B, Biol. Sci. 263, 409–414. Eggens, F., Popp, M., Nepokroeff, M., Wagner, W.L., Oxelman, B., 2007. The origin and number of introductions of the Hawaiian endemic Silene species (Caryophyllaceae). Am. J. Bot. 94, 210–218. Erixon, P., Svennblad, B., Britton, T., Oxelman, B., 2003. Reliability of Bayesian posterior probabilities and bootstrap frequencies in phylogenetics. Syst. Biol. 52, 665–673. Erixon, P., Oxelman, B., 2008. Whole-Gene Positive Selection, Elevated Synonymous Substitution Rates, Duplication, and Indel Evolution of the Chloroplast clpP1 Gene. PLoS ONE 3(1) e1386. doi:10.1371/journal.pone.0001386. Felsenstein, J., 1978. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool. 27, 401–410. Filatov, D.A., 2005. Evolutionary history of Silene latifolia sex chromosomes revealed by genetic mapping of four genes. Genetics 170, 975–979. Frajman, B., Oxelman, B., 2007. Reticulate phylogenetics and phytogeographical structure of Heliosperma (Sileneae, Caryophyllaceae) inferred from chloroplast and nuclear DNA sequences. Mol. Phylogenet. Evol. 43, 140–155. Goremykin, V.V., Hirsch-Ernst, K.I., Wölfl, S., Hellwig, F.H., 2003. Analysis of the Amborella trichopoda chloroplast genome sequence suggests that Amborella is not a basal angiosperm. Mol. Biol. Evol. 20, 1499–1505. Graham, S.W., Olmstead, R.G., 2000. Utility of 17 chloroplast genes for inferring the phylogeny of the basal angiosperms. Am. J. Bot. 87, 1712–1730. Guindon, S., Gascuel, O., 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52, 696–704. Hillis, D.M., Bull, J.J., 1993. An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst. Biol. 42, 182–192. Hillis, D.M., Pollock, D.D., McGuire, J.A., Zwickl, D.J., 2003. Is sparse taxon sampling a problem for phylogenetic inference? Syst. Biol. 52, 124–126. Holmgren, K.P., Holmgren, H.N., Barnett, C.L. (Eds.), 1990. Index Herbariorum 1: The Herbaria of the World. International Association for Plant Taxonomy, New York Botanical Garden, New York. Hood, M.E., Antonovics, J., 2000. Intratetrad mating, heterozygosity, and the maintenance of deleterious alleles in Microbotryum violaceum (=Ustilago violacea). Heredity 85, 231–241. Houliston, G.J., Olson, M.S., 2006. Nonneutral evolution of organelle genes in Silene vulgaris. Genetics 174, 1983–1994. Huelsenbeck, J.P., Ronquist, F., Nielsen, R., Bollback, J.P., 2001. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294, 2310–2314. Huelsenbeck, J.P., Ronquist, F., 2001. MrBayes: Bayesian inference of phylogeny. Bioinformatics 17, 754–755. Huelsenbeck, J.P., Rannala, B., 2004. Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. Syst. Biol. 53, 904–913. Huson, D.H., Bryant, D., 2006. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. 23, 254–267. Kephart, S., Reynolds, R.J., Rutter, M.T., Fenster, C.B., Dudash, M.R., 2006. Pollination and seed predation by moths on Silene and allied Caryophyllaceae: evaluating a model system to study the evolution of mutualism. New Phytol. 169, 667–680. Kode, V., Mudd, E.A., Iamtham, S., Day, A., 2005. The tobacco plastid accD gene is essential and is required for leaf development. Plant J. 44, 237–244. Kuroda, H., Maliga, P., 2003. The plastid clpP1 protease gene is essential for plant development. Nature 425, 86–89. Lewis, P.O., 2001. A likelihood approach to estimating phylogeny from discrete morphological character data. Syst. Biol. 50, 913–925. Lewis, P.O., Holder, M.T., Holsinger, K.E., 2005. Polytomies and Bayesian phylogenetic inference. Syst. Biol. 54, 241–253.
Lowry, R., 2006. VassarStats, Web site for statistical computation. Available from:
. Martin, D.P., Williamson, C., Posada, D., 2005. RDP2: recombination detection and analysis from sequence alignments. Bioinformatics 21, 260–262. McCauley, D.E., Bailey, M.F., Sherman, N.A., Darnell, M.Z., 2005. Evidence for paternal transmission and heteroplasmy in the mitochondrial genome of Silene vulgaris, a gynodioecious plant. Heredity 95, 50–58. Medgyesy, P., Fejes, E., Maliga, P., 1985. Interspecific chloroplast recombination in a Nicotiana somatic hybrid. Proc. Natl. Acad. Sci. USA 82, 6960–6964. Mower, J.P., Touzet, P., Gummow, J.S., Delph, L.F., Palmer, J.D., 2007. Extensive variation in synonymous substitution rates in mitochondrial genes of seed plants. BMC Evol. Biol. 7, 135. Müller, K., 2006. Incorporating information from length-mutational events into phylogenetic analysis. Mol. Phylogenet. Evol. 38, 667–676. Nishiyama, T., Wolf, P.G., Kugita, M., Sinclair, R.B., Sugita, M., Sugiura, C., Wakasugi, T., Yamada, K., Yoshinaga, K., Yamaguchi, K., Ueda, K., Hasebe, M., 2004. Chloroplast phylogeny indicates that bryophytes are monophyletic. Mol. Biol. Evol. 21, 1813–1819. Nylander, J.A.A., 2005a. MrAIC.pl v. 1.4.2: software distributed by the author. School of Computational Science, Florida State University. Nylander, J.A.A., 2005b. BootPHYML v. 3.4: software distributed by the author. School of Computational Science, Florida State University. Oxelman, B., Lidén, M., 1995. Generic boundaries in the tribe Sileneae (Caryophyllaceae) as inferred from nuclear rDNA sequences. Taxon 44, 525– 542. Oxelman, B., 1996. RAPD patterns, nrDNA ITS sequences and morphological patterns in Silene section Sedoideae (Caryophyllaceae). Plant Syst. Evol. 201, 93–116. Oxelman, B., Lidén, M., Berglund, D., 1997. Chloroplast rps16 intron phylogeny of the tribe Sileneae (Caryophyllaceae). Plant Syst. Evol. 206, 393–410. Oxelman, B., Lidén, M., Rabeler, R.K., Popp, M., 2001. A revised generic classification of the tribe Sileneae (Caryophyllaceae). Nord. J. Bot. 20, 743–748. Philippe, H., Laurent, J., 1998. How good are deep phylogenetic trees? Curr. Opin. Genet. Dev. 8, 616–623. Poke, F., Martin, D.P., Steane, D.A., Vaillancourt, R.E., Reid, J.B., 2006. The impact of intragenic recombination on phylogenetic reconstruction at the sectional level in Eucalyptus when using a single copy nuclear gene (cinnamoyl CoA reductase). Mol. Phylogenet. Evol. 39, 160–170. Popp, M., Oxelman, B., 2001. Inferring the history of the polyploid Silene aegaea (Caryophyllaceae) using plastid and homoeologous nuclear DNA sequences. Mol. Phylogenet. Evol. 20, 474–481. Popp, M., Oxelman, B., 2004. Evolution of a rna polymerase gene family in Silene (caryophyllaceae)—incomplete concerted evolution and topological congruence among paralogues. Syst. Biol. 53, 914–932. Popp, M., Oxelman, B., 2007. Origin and evolution of North American polyploid Silene (Caryophyllaceae). Am. J. Bot. 94, 330–349. Popp, M., Erixon, P., Eggens, F., Oxelman, B., 2005. Origin and evolution of a circumpolar polyploid species complex in Silene (Caryophyllaceae) inferred from low copy nuclear RNA polymerase introns, rDNA, and chloroplast DNA. Syst. Bot. 30, 302–313. Qiu, Y.-L., Dombrovska, O., Lee, J., Li, L., Whitlock, B.A., Bernasconi-Quadroni, F., Rest, J.S., Davis, C.C., Borsch, T., Hilu, K.W., Renner, S.S., Soltis, D.E., Soltis, P.S., Zanis, M.J., Cannone, J.J., Gutell, R.R., Powell, M., Savolainen, V., Chatrou, L.W., Chase, M.W., 2005. Phylogenetic analyses of basal angiosperms based on nine plastid, mitochondrial, and nuclear genes. Int. J. Plant Sci. 166, 815–842. Rambaut, A. 1996. Se–Al: Sequence Alignment Editor. Available from:
. Rice, D.W., Palmer, J.D., 2006. An exceptional horizontal gene transfer in plastids: gene replacement by a distant bacterial paralog and evidence that haptophyte and cryptophyte plastids are sisters. BMC Biol. 4, 31. Rydin, C., Källersjö, M., 2002. Taxon sampling and seed plant phylogeny. Cladistics 18, 485–513. Schmitz-Linneweber, C., Maier, R.M., Alcaraz, J.-P., Cottet, A., Herrmann, R.G., Mache, R., 2001. The plastid chromosome of spinach (Spinacia oleracea): complete nucleotide sequence and gene organization. Plant Mol. Biol. 45, 307–315. Shaw, J., Lickey, E.B., Beck, J.T., Farmer, S.B., Liu, W., Miller, J., Siripun, K.C., Winder, C.T., Schilling, E.E., Small, R.L., 2005. The tortoise and the hare II: relative utility of 21 noncoding chloroplast DNA sequences for phylogenetic analysis. Am. J. Bot. 92, 142–166. Shaw, J., Lickey, E.B., Schilling, E.E., Small, R.L., 2007. Comparison of whole chloroplast genome sequences to choose noncoding regions for phylogenetic studies in angiosperms: the tortoise and the hare III. Am. J. Bot. 94, 275–288. Simmons, M.P., Ochoterena, H., 2000. Gaps as characters in sequence-based phylogenetic analyses. Syst. Biol. 49, 369–381. Simmons, M.P., Ochoterena, H., Carr, T.G., 2001. Incorporation, relative homoplasy, and effect of gap characters in sequence-based phylogenetic analyses. Syst. Biol. 50, 454–462. Sloan, D.B., Barr, C.M., Olson, M.S., Keller, S.R., Taylor, D.R., 2008. Evolutionary rate variation at multiple levels of biological organization in plant mitochondrial DNA. Mol. Biol. Evol. 25, 243–246. Soltis, D.E., Albert, V.A., Savolainen, V., Hilu, K., Qiu, Y.-L., Chase, M.W., Farris, J.S., Stefanovic, S., Rice, D.W., Palmer, J.D., Soltis, P.S., 2004. Genome-scale data,
P. Erixon, B. Oxelman / Molecular Phylogenetics and Evolution 48 (2008) 313–325 angiosperm relationships, and ‘ending incongruence’: a cautionary tale in phylogenetics. Trends Plant Sci. 9, 477–483. Städler, T., Delph, L.F., 2002. Ancient mitochondrial haplotypes and evidence for intragenic recombination in a gynodioecious plant. Proc. Natl. Acad. Sci. USA 99, 11730–11735. Stefanovic, S., Rice, D.W., Palmer, J.D., 2004. Long branch attraction, taxon sampling, and the earliest angiosperms: Amborella or monocots? BMC Evol. Biol. 4, 35. Swofford, D.L., 2002. PAUP*: phylogenetic analysis using parsimony (*and other methods), version 4.0b10. Sinauer, Sunderland, MA.
325
Talavera, S., 1979. Revision de la sect. Erectorefractae del genero Silene L.— Lagascalia 8, 135–164. Vyskot, B., Hobza, R., 2004. Gender in plants: sex chromosomes are emerging from the fog. Trends Genet. 20, 432–438. Wolfe, A.D., Randle, C.P., 2004. Recombination, heteroplasmy, haplotype polymorphism, and paralogy in plastid genes: implications for plant molecular systematics. Syst. Bot. 29, 1011–1020. Yang, Z., Rannala, B., 2005. Branch-length prior influences Bayesian posterior probability of phylogeny. Syst. Biol. 54, 455–470.