Infection, Genetics and Evolution 11 (2011) 83–91
Contents lists available at ScienceDirect
Infection, Genetics and Evolution journal homepage: www.elsevier.com/locate/meegid
Phylogenetic evaluation of the ‘Typhimurium’ complex of Salmonella strains using a seven-gene multi-locus sequence analysis§ Rebecca L. Bell a,*, Narjol Gonza´lez-Escalona a, Robert Stones b, Eric W. Brown a a b
Division of Microbiology, Office of Regulatory Science, Center for Food Safety and Applied Nutrition, U.S. Food and Drug Administration, College Park, MD 20740, USA Food and Environment Research Agency, Sand Hutton, York YO41 1LZ, UK
A R T I C L E I N F O
A B S T R A C T
Article history: Received 30 April 2010 Received in revised form 30 September 2010 Accepted 6 October 2010 Available online 21 October 2010
Salmonella enterica comprises over 2500 serovars, many of which are significant foodborne pathogens in humans. The ability to subtype these microbes is difficult due to the highly clonal nature of many Salmonella strains and a lack of congruence among traditional typing approaches. This work examines the phylogenetic utility of a multi-locus sequence typing (MLST) approach to discriminate between members of a closely related collection of salmonellae, the Salmonella reference collection A (SARA). This 72 strain collection, referred to as the ‘Typhimurium’ complex, consists of S. Typhimurium and its four closest serological relatives. In this analysis, nucleotide sequences from seven housekeeping genes (aroC, dnaN, hemD, hisD, purE, sucA, and thrA) were PCR amplified, sequenced, and combined into a single concatenated character matrix providing 3360 bp for cladistic analysis. The resultant most parsimonious tree yielded seven clades of Salmonella strains that partitioned largely along serovar divisions within the collection except for five ‘Paratyphi B’ strains, two ‘Saintpaul’ strains, and two ‘Typhimurium’ strains. Convergence in the SARA tree was approximately 20% indicating that the vast majority of sequence changes were phylogenetically informative. Despite a high consistency among nucleotide substitutions, analysis of congruence identified several SARA strains with recombinant alleles in the concatenated matrix. These findings point to important differences among phylogenetic contributions made by the individual genes comprising this MLST dataset. Published by Elsevier B.V.
Keywords: Salmonella enterica Phylogenetics Multi-locus sequence analysis Reference collection Genetic diversity
1. Introduction Salmonella enterica is responsible for an estimated 1.4 million infections in the United States annually (USDA, 2009). Infection can occur after eating undercooked meat, poultry and eggs or from eating raw produce that has been contaminated with the bacterium. In recent years several outbreaks have occurred in the United States that were associated with Salmonella contamination of produce, one of the most devastating being a S. enterica ˜ o and serrano Saintpaul outbreak associated with tomatoes, jalapen peppers that sickened over 1400 individuals (http://www.cdc.gov/ salmonella/saintpaul/jalapeno/map.html) (CDC, 2008). The Salmonella reference collection A (SARA) is a well characterized strain collection of S. enterica serovar ‘Typhimurium’ and its closest phylogenetic relatives, ‘Saintpaul’, ‘Heidel-
§ Note: Nucleotide sequence data reported in this paper are available in the GenBank database under the accession numbers HM797762–HM798264. * Corresponding author at: Division of Microbiology, FDA-CFSAN, 5100 Paint Branch Parkway, Mailstop HFS-712, College Park, MD, USA. Tel.: +1 301 436 1442; fax: +1 301 436 2644. E-mail address:
[email protected] (R.L. Bell).
1567-1348/$ – see front matter . Published by Elsevier B.V. doi:10.1016/j.meegid.2010.10.005
berg’, ‘Paratyphi B’ and ‘Muenchen’ (Beltran et al., 1991). According to the CDC, four of these five serovars are thought to be responsible for the majority of Salmonella infections occurring in the U.S. from 1997 to 2006 (CDC, 2007). The collection was first described using multi-locus enzyme electrophoresis (MLEE) and DNA-DNA hybridization and consists of 72 strains, encompassing at least one isolate for each unique electrophoretic type and several isolates from the most common electrophoretic type(s) (Beltran et al., 1991). Since SARA represents a well-defined and closely related collection of strains, it can be used to readily evaluate newer typing methods for discriminatory and phylogenetic potential. The most commonly used method of S. enterica typing is the Kaufmann–White scheme which relies on the serotyping of the Oantigen, present in the lipopolysaccharide, and the H-antigens, the phase 1 and 2 flagellar proteins. To date researchers have identified over 2500 serovars of S. enterica using this scheme. While it is useful to study the epidemiology of an outbreak, frequent recombination of O and H antigens makes it inadequate for determining genetic and phylogenetic relationships of Salmonella strains and serovars (Wang et al., 2002). To address this issue many molecular subtyping methods have been developed such as pulsed-field gel electrophoresis (PFGE),
R.L. Bell et al. / Infection, Genetics and Evolution 11 (2011) 83–91
84
amplified fragment length polymorphism (AFLP) and multi-locus sequence typing (MLST). PFGE is currently used by PulseNet for the tracking of outbreaks (http://www.cdc.gov/pulsenet). This technique is highly discriminatory however, in some instances, it lacks the ability to recapitulate the evolutionary paths of certain closely related foodborne species including some clusters of S. enterica (e.g., S. enterica serovar Enteritidis) and Escherichia coli O157:H7 (Davis et al., 2003; Zheng et al., 2007). MLST is a DNA sequence-based method that clusters strains by combining several conserved housekeeping genes into a single analysis thus reiterating the genetic relatedness of the strains being studied. This subtyping scheme was first proposed by Maiden et al. (1998) for Neisseria meningitidis and has now been applied to numerous bacterial species (Godoy et al., 2003; Gonzalez-Escalona et al., 2008; Kotetishvili et al., 2002; Lacher et al., 2007). MLST has several advantages over non-sequence based methods. In particular the DNA sequence is non-ambiguous, unlike PFGE and MLEE both of which rely on gel banding patterns, thus data from several labs can easily be compared. MLST also allows for the reconstruction of phylogenetic relationships within the population being studied using a single combined character matrix with no loss of valuable sequence data. Several different MLST strategies have been examined in various environmental and clinical isolates of Salmonella (Fakhr et al., 2005; Harbottle et al., 2006; Sukhnanand et al., 2005; Tankouo-Sandjong et al., 2007; Torpdahl et al., 2005). One scheme using seven housekeeping genes, (i.e., aroC, dnaN, hemD, hisD, purE, sucA, thrA) has become popular due to the maintenance of sequence information in a publicly accessible database (http:// mlst.ucc.ie). Other schemes have explored the use of different housekeeping genes or a combination of housekeeping and virulence associated genes (Alcaine et al., 2006; Foley et al., 2006; Kotetishvili et al., 2002). Each approach has had varying success at distinguishing different populations of salmonellae. However, the ability of these strategies to discriminate closely related strains has not been thoroughly studied in a wellcharacterized Salmonella reference collection. For this work, the SARA collection was used to investigate the ability of the conventional seven housekeeping gene scheme noted above to differentiate closely related S. enterica strains into phylogenetically relevant clusters. Each individual gene was examined for its discriminatory power and evolutionary congruence in the construction of most parsimonious trees. Additionally, the contributions of mutation and recombination were investigated for each of these genes among the five serovars comprising the SARA strain complex.
2. Materials and methods 2.1. Bacterial strains The SARA collection was obtained from the Salmonella Genetic Stock Center, University of Calgary. This collection consists of a total of 72 strains of S. enterica Typhimurium (21 strains), ‘Saintpaul’ (8 strains), ‘Heidelberg’ (11strains), ‘Paratyphi B’ (22 strains) and ‘Muenchen’ (10 strains) (Supplemental Table 1) (Beltran et al., 1991). 2.2. Genomic DNA isolation, PCR amplification and sequencing Genomic DNA was isolated using the InstaGene Matrix (BioRad) following the manufacturer’s protocol. In brief, colonies from overnight cultures were washed in water. InstaGene Matrix was added to the cells and incubated at 56 8C for 30 min. After vortexing vigorously for 10 s the cells/matrix were incubated at 100 8C for 8 min and vortexed again. Before every use, the matrix was centrifuged for a minimum of 3 min. PCR reactions were prepared by combining 10 ml of isolated DNA with PCR buffer containing a final concentration of 1.5 mM MgCl2 (Applied BioSystems), 0.2 mM each dNTPs (Promega), 0.2 mM of the appropriate forward and reverse primer, and 1.25 U of GoTaq DNA polymerase (Promega). All primer sequences, for amplification and sequencing, can be found at the MLST Databases at the ERI, University College Cork (http:// mlst.ucc.ie/mlst/dbs/Senterica/documents/primersEnterica_html). PCR cycling conditions were as follows: 94 8C for 10 min, then 34 cycles of 94 8C for 1 min, 55 8C for 1 min, 72 8C for 1 min, then a final incubation at 72 8C for 5 min. PCR products were cleaned using ExoSAP-IT (USB Corp) by incubating 20 ml of the PCR reaction with 8 ml of Exo-SAP-IT for 15 min at 37 8C and then 15 min at 80 8C. Nucleotide cycle-sequencing was performed in both directions directly on purified PCR templates via automated Sanger dideoxychain termination methods using the primers described above (Amplicon Express). 2.3. Descriptive statistics Nucleotide sequences of each individual gene and the seven-gene concatenated dataset were aligned using CLUSTAL X (Thompson et al., 1997). PAUP * v.4.03b (Swofford, 2002) was used to determine the number of informative sites and parsimony informative sites, the number of alleles for each gene, the consistency index, and the G1 statistic which evaluates the skewedness of phylogenetic signal from random noise alone (Hillis and Huelsenbeck, 1992). Genetic
Table 1 Genetic characteristics and evolutionary variation among the seven loci included in the Salmonella enterica SARA MLST analysis.
aroC dnaN hemD hisD purE sucA thrA a
No. characters analyzed (No. codons)
Total no. alleles (No. synapomorphic alleles)a
No. informative characters (No. parsimonious informative characters)
% variable sites (all pos.)b
% variable sites (3rd pos. only)c
dS SEd
dN SEd
dN/dSd
CIe
G1
516 552 468 555 345 465 459
13 13 15 12 19 10 19
98 (64) 96 (81) 113 (105) 128 (104) 71 (62) 91 (80) 109 (87)
2.5 3.3 4.1 5.8 5.5 2.8 6.3
6.4 8.7 6.4 15.7 9.6 5.8 11.8
2.33 0.74 2.56 0.85 1.86 0.77 2.86 0.67 2.23 0.84 3.01 0.95 5.33 1.25
0.014 0.01 0.12 0.12 0.68 0.26 0.14 0.12 0.18 0.09 0.32 0.02 0.10 0.03
0.01 0.05 0.36 0.05 0.81 0.01 0.02
0.88 0.89 0.92 0.90 0.87 0.96 0.88
4.36 2.51 3.51 4.45 5.42 2.77 1.80
(172) (184) (156) (185) (115) (155) (153)
(7) (6) (8) (4) (12) (5) (6)
Any allele shared by more than one SARA strain. The percentage of positions containing at least two different nucleotides by the total number of characters. c The percentage of third positions containing at least two different nucleotides divided by the total number of third position characters. d Mean numbers of synonymous substitutions per 100 synonymous sites (dS) and nonsynonymous substitutions per 100 nonsynonymous sites (dN). dN/dS is a relative measure of the selective constraint on each locus. Reported values are Jukes–Cantor corrected. All values were calculated using MEGA v.3.0 (Kumar et al., 2001). e Calculated using PAUP* v. 4.03b. b
R.L. Bell et al. / Infection, Genetics and Evolution 11 (2011) 83–91
distances and rates of synonymous (dS) and nonsynonymous substitutions (dN) were calculated using the MEGA v.3.0 (Molecular Evolutionary Genetic Analysis) program (Kumar et al., 2001). All values were Jukes–Cantor corrected. 2.4. Phylogenetic analysis The phylogenetic method employed uses maximum parsimony (Farris, 1983) and is available in PAUP* v.4.03b (Swofford, 2002). Most parsimonious trees were sought using heuristic search methods, tree bisection/resection, and random sampling of 100 replicates. E. coli K-12 (strain MG1655, acc no. U00096.2) and E. coli
[(Fig._1)TD$IG]
85
O157:H7 (strain EDL-933, acc no. AE005174.2) were acquired from GenBank and used as outgroups to root the phylogeny. Once the minimum number of trees was found, semi-strict (combinablecomponent) consensus methods were applied in order to condense the number of equally most parsimonious cladograms into a single tree, thereby presenting only those relationships in the consensus tree that were not in topological conflict among the original trees. In cases where individual strains radiated from the same node, a conservative approach was adopted whereby strain polytomies originating from the same node were grouped together as a single clade. Character support for internal tree nodes was determined by 1000 iterations of bootstrapping and jackknifing (Felsenstein,
Fig. 1. Most parsimonious (MP) seven-gene tree for Salmonella SARA strains. The cladogram shown is derived from the combined analysis of 3360 nucleotide characters from aroC, dnaN, hemD, hisD, purE, sucA, and thrA housekeeping gene sequences and is taken here to represent a whole-chromosome phylogeny for 72 SARA strains. The resultant tree is a semi-strict consensus of seven equally MP trees that had a CI of 0.79, indicating a 21% level of homoplasy, and RI of 0.89, and an overall length of 973 steps (i.e., nucleotide substitutions). Bootstrap and jackknife percentages (out of 1000 iterations) are reported above and below each node, respectively (Felsenstein, 1985). E. coli K-12 (strain MG1655, acc no. U00096.2) and E. coli O157:H7 (strain EDL-933, acc no. AE005174.2) were acquired from GenBank and used as outgroups to root the phylogeny. The tree yielded a G1-statistic of 4.47 indicating a large measure of skewedness in the data when compared to 10,000 random trees. The brackets to the right of the tree denote the seven distinct clades (denoted A–G) of S. enterica subsp. enterica strains to emerge from the tree. The tree largely sorted SARA strains along serovar boundaries, and the predominant serovar comprised by each clade is noted in parentheses. The hemD sequence from SARA29 was coded as missing data due to the inability to obtain a complete sequence. Salmonella serovars are abbreviated as follows: Tm, ‘Typhimurium’; He, ‘Heidelberg’; Sp, ‘Saintpaul’; Pb, ‘Paratyphi B’; and Mu, ‘Muenchen’.
R.L. Bell et al. / Infection, Genetics and Evolution 11 (2011) 83–91
86
1985; Swofford, 2002). The overall compatibility of sites was measured for the combined dataset using COMPATDNA, a windows based program written by one of the authors (RS) and based on the compatibility algorithm of Jakobsen and Easteal (1996) whereby two sites are deemed compatible if the substitutions at these sites can be accounted for only once in a phylogeny. Here, binary sites only (informative sites containing exactly two distinct nucleotides) were included in the analysis. 2.5. Population evolution analysis Clonal structure was examined using eBurst v 3.0 (http:// eburst.mlst.net) (Feil et al., 2004). Clonal complexes were defined as strains sharing six or more of the seven alleles in the dataset (Feil et al., 2004). Support for the ancestral types was determined using 1000 bootstrap resamplings. Two different sequence types (STs) are considered single locus variants (SLVs) when they differ from each other at a single locus. Two STs are considered double locus variants (DLVs) when they differ at two loci and so forth. Estimated recombination rates were derived as described previously (Feil et al., 2000), where the per-allele and per-site recombination/mutation (r/ m) parameter was calculated empirically. Briefly, any SLV allele differing by one nucleotide and not observed elsewhere in the database as part of another sequence type was considered to have arisen by mutation. A SLV allele differing by multiple nucleotides or containing a single nucleotide change observed as part of another ST in the database was considered to have originated by recombination. Congruence between strains was assessed using the incongruence length difference (ILD) test (Farris et al., 1995). ILD tests were done using the Partition Homogeneity Test available in PAUP* v.4.03b utilizing a heuristic model. Split decomposition of the concatenated seven-gene dataset was performed to confirm the ILD tests for each serovar using the SplitsTree v.4.8 program (Huson and Bryant, 2006). 2.6. Nucleotide sequence and accession numbers The nucleotide sequence data reported in this paper have been deposited in the GenBank sequence database with the following accession numbers: aroC, HM798193–HM798264; dnaN, HM797762–HM797833; hemD, HM798122–HM798192; hisD, HM798050–HM798121; purE, HM797978–HM798049; sucA, HM797906–HM797977; thrA, HM797834–HM797905. 3. Results 3.1. Genetic diversity of the seven housekeeping genes used in the multi-locus sequence analysis of Salmonella Typhimurium complex strains DNA sequence analysis of the seven housekeeping genes resulted in 3360 base-pairs for phylogenetic analysis (Table 1). A total of 101
different alleles were identified; 48 of which were synapomorphic (i.e., shared and derived). The genes yielding the greatest number of alleles were purE and thrA with 19 different alleles each. However, purE retained nearly twice as many synapomorphic alleles as thrA. Conversely, hisD and sucA yielded the fewest number of synapomorphic alleles (4 and 5, respectively). Total site variation within each gene ranged from 2.5% (aroC) to 6.3% (thrA) and third position variability exceeded the total variability in each gene. Total genetic diversity (P) was about 0.86%. Synonymous substitutions occurred more frequently than nonsynonymous substitutions in every gene (dN vs. dS values) (Table 1). Analysis of dN/dS ratios also revealed all seven genes to be under strong stabilizing/negative selective pressure. 3.2. Cladistic analysis of the concatenated seven-gene sequence alignment The seven-gene concatenated DNA sequence alignments were used to construct the most parsimonious phylogenetic tree (Fig. 1). This tree partitioned the 72 SARA strains into seven distinct clades (A–G). Low bootstrap and jackknife values for most of the clades were observed and likely resulted from the limited number of conserved nucleotide substitutions that partitioned major SARA lineages in the tree. Five of the seven clades (A and C–F) largely captured the serological classifications of these strains. Four of the strains, however, were unable to be grouped. Also, each major lineage radiated from a common node in the tree making evolutionary descent difficult to assign. Examination of the seven individual gene’s contributions to the tree topology revealed slight differences between the numbers of informative and parsimony informative (PI) substitutions (i.e., present in at least two strains) (Table 1). hemD yielded the most useful sites for analysis while purE retained the fewest PI sites. Overall, in the concatenated sequence, 21% of the sites were informative and 17% of the sites were PI. Consistency indices (CI) were comparable among the seven genes where about 20% of the concatenated matrix contained homoplasious substitutions (CI = 0.79). An analysis of tree skewedness (G1) affirmed clear separation from randomness and substantial structure in the phylogenetic signal of the genes (Table 1). The genetic distances observed between the seven distinct SARA clades supported the phylogenetic partitions revealed in the tree and are shown in Table 2. The mean divergence among clades was 1.14% while mean intra-clade variation was only 0.13%. Maximum diversity was observed between Clades C and F (1.38%), which represent serovars ‘Heidelberg’ and ‘Muenchen’, respectively. Clade F (i.e., the ‘Muenchen’ clade) was the most distant clade, showing the greatest divergence from the remaining SARA lineages. Clade B, which represents two strains of ‘Typhimurium’ that were removed from the larger Clade A of ‘Typhimurium’ strains, was 0.65% diverged from Clade A.
Table 2 Combined 7-gene diversity within and between S. enterica SARA clades. Clade (n)
Mean nucleotide diversity SE (%)a A
A (18) B (2) C (15) D (6) E (17) F (8) G (2) a
B
C
D
E
F
G
0.04 0.04b 0.74 0.16 0.74 0.15 0.76 0.13 1.19 0.19 0.75 0.15
0.14 0.03b 1.02 0.18 1.02 0.16 1.38 0.20 1.13 0.19
0.12 0.04b 1.04 0.17 1.00 0.16 0.65 0.13
0.34 0.07b 1.32 0.19 0.88 0.16
0.42 0.07b 1.06 0.19
0.00 0.00b
b
0.03 0.01 0.65 0.15 0.51 0.12 0.65 0.14 0.96 0.16 0.96 0.16 0.83 0.16
Mean pairwise sequence variation for seven loci among the seven SARA clades described in Fig. 1. All distances are Jukes–Cantor corrected in MEGA v.3.0 (Kumar et al., 2001). b Intra-clade distance.
D
Individual genes are designated as follows: A, aroC; D, dnaN; He, hemD; Hi, hisD; and S, sucA. Individual alignment position numbers are given following the gene abbreviation. Clade A was used as the reference clade, subsequently signature substitutions uniquely defining all seven SARA clades are presented. Signature substitutions in these clades are definitive meaning the presence of these particular sequence change in a strain defines it as being part of this clade. These substitutions are considered primary substitutions. d Clades B and D were not defined by any primary signature substitution. Successful differentiation of these two clades depends on the elimination of primary signature substitutions of Clades C and E–G and the presence of the substitutions described here. Thus these are considered secondary signature substitutions.
He103 G!T A285 G!A A276 T!C A186 C! d
A90 C!T B
d
Gc
c
S36 C!T S12 C!T He303 G ! T He270 C!T D81 C!T D78 C!T Fc
Ec
b
S177 C!T Hi309 A!C Hi252 T!C
He103 G He91 A He91A ! C He87 T He87 T!C D81 C D78 C A285 G A276 T A186 C
Nucleotidesubstitutiona
A90 C A Cc
b
Clade
Table 3 Signature nucleotide substitutions unique to individual clades of SARA strains.
a
S397 C!T S286 G!A
S36 C S12 C Hi309 A Hi252 T He303 G He270 C
3.4. Phylogenetic compatibility of the seven MLST genes
Varying levels of congruence were observed in our seven-gene alignment. In order to better understand the genetic stability of the seven-gene dataset for the SARA collection, several population genetic parameters were investigated. Allelic variation for the seven genes analyzed here is reported in Table 4. The concatenated sequence matrix displayed a total of 42 distinct sequence types (STs) among the five SARA serovars. The number of unique STs varied
S273 A! G
S177 C
S273 A
S286 G
Evaluation of the concatenated seven-gene nucleotide sequence alignment revealed the existence of several primary synapomorphic nucleotides capable of differentiating specific clades. The signature synapomorphic substitutions, their positions in the Clustal X alignment, and the SARA strain groupings that they distinguish are listed in Table 3. Using Clade A alleles as the nucleotide substitution baseline, four of the seven clades reported here (i.e., C and E–G) retained definitive substitutions that were diagnostic for all of the strains residing in the particular clade. As an example, SARA Clade C was defined uniquely by two SNPs within the hemD gene at positions 87 and 91. Additionally, Clade E was distinguished from the others by two changes in the sucA gene at positions 273 and 286. Clades F and G were denoted by changes in two separate genes, dnaN/hisD and hemD/sucA, respectively. Clades B and D did not yield any primary substitutions uniquely capable of defining these groups. However, several combinations of secondary SNPs were identified that were capable of distinguishing these two clades. For example, the assignment of SARA strain 7 to Clade B, depends on the absence of any signature SNPs defining Clades C, E–G, and on the presence of a four SNP combination in the aroC gene at positions 90, 186, 276, and 285. Thus, Clades B and D were capable of being identified using these secondary SNP configurations. Taken together, these data provided SNP-based diagnostic schemes for every clade described from the phylogenetic analysis reported in Fig. 1.
3.5. Population structure of the Salmonella Typhimurium complex
S435 C!T
S397 C
3.3. Identification of several ‘‘signature’’ nucleotide substitutions capable of distinguishing SARA clades
In order to further examine the phylogenetic and evolutionary accuracy of our seven-gene concatenated sequence data set, we examined the taxonomic congruence of each of the individual genes employed here. The overall compatibility of sites was measured for the combined dataset using COMPATDNA, a windows-based program based on the compatibility algorithm of Jakobsen and Easteal (1996) whereby two sites are deemed compatible if the substitutions at these sites can be accounted for only once in a phylogeny. Overall compatibility of the seven housekeeping genes was 53% (Fig. 2). When the seven genes were analyzed separately, varying levels of inter-gene compatibility were observed with hemD possessing the highest compatibility score (61.7%) with the six other loci and dnaN retaining the lowest compatibility (44.1%) with the remaining loci. Overall, five of the seven genes included here yielded compatibility scores above 50%. In contrast, a majority of the sites in aroC and dnaN were found to be incompatible with the remainder of the seven-gene data set. An examination of congruence within each of the seven genes revealed remarkable compatibility within five of the seven genes. sucA showed the greatest intra-gene compatibility at 96.4% followed by hemD (84.8%), thrA (76.9%), hisD (75.1%), and purE (64.1%). It is notable that aroC and dnaN, along with retaining the lowest inter-gene compatibility in the seven-gene set, also yielded the lowest intra-gene compatibility scores of 49.1% and 45.5%, respectively (Fig. 2).
87
S435 C
R.L. Bell et al. / Infection, Genetics and Evolution 11 (2011) 83–91
[(Fig._2)TD$IG]
R.L. Bell et al. / Infection, Genetics and Evolution 11 (2011) 83–91
88
Fig. 2. Compatibility matrix of the concatenated seven-gene sequence matrix showing pairwise comparisons of informative binary sites within (colored triangles) and between the genes indicated. Corresponding base pair coordinates from the 7-gene alignment are listed along the x and y axes. Labels on the diagonal denote the intragene compatibility scores for each gene indicated. Intragene compatibility scores <50% are shaded blue, between 50% and 75% are shaded green, and >75% are shaded yellow. Compatibilities for intergene comparisons are provided in parentheses below the gene names. The compatibility matrix shown was constructed in a program written in Java by R.S. for determination and visualization of incompatible sites. The algorithm is similar to that described in the program RETICULATE originally crafted by Jakobsen and Easteal (1996).
between serovars; ‘Heidelberg’ and ‘Muenchen’ had the fewest, with 6 STs each, and ‘Paratyphi B’ had the most, with 15 STs. Using the eBURST program (Feil et al., 2004), the clonal structure of SARA strains revealed four major clonal complexes
(CC), three doublets and seven singletons (Fig. 3). These multistrain complexes retained six or more common alleles with the suspected ancestral strain (shown in blue, Fig. 3). Most of the clonal complexes appeared to be serovar specific in their
Table 4 Allelic diversity and epidemiologic congruence of the five SARA serovars. Serovar
Typhimurium Saintpaul Heidelberg Paratyphi B Muenchen a b c d e
Number of strains
21 8 11 22 10
Number of alleles aroC
dnaN
hemDa
hisD
purE
sucA
thrA
3 2 2 5 3
3 2 4 7 3
3 1 4 6 5
1 3 2 6 5
3 6 4 12 3
2 3 3 5 3
4 5 4 7 2
Out of 71 SARA strains (SARA 29 not available). Major clade listed in parentheses. Calculated using PAUP* v.4.0B (Swofford, 2002) with 1000 data partitions. ILD score obtained after the removal of strains 42, 43, 56, 61. ILD score obtained after the removal of strains 70, 71.
Total number of STs
% Strains in major cladeb
ILD scorec
7 8 6 15 6
86 (A) 62 (D) 100 (C) 77 (E) 80 (F)
1.00 0.824 1.00 0.715d 1.0e
[(Fig._3)TD$IG]
R.L. Bell et al. / Infection, Genetics and Evolution 11 (2011) 83–91
89
Fig. 3. eBurst plot depicting the clonal expansion of the 72 SARA strains of S. enterica based on the seven housekeeping gene concatenated sequence matrix. The four clonal complexes identified by this analysis are denoted I–IV and the SARA strains are indicated by their respective numbers (1–72). The ancestral sequence type (ST) is denoted in blue while derived and ungrouped STs are black. SARA 49 denoted the sole SLV found to spawn a second group of additional SLVs in clonal complex II. The green line represents the triple locus variant connecting CCI and CCII. Individual genes responsible for single locus variants (SLVs) among these clonal complexes are listed along the line connecting them to their ancestral sequence type. Adjacent superscripts of M (mutation) or R (recombination) designate the mechanism most likely responsible for the emergence of the SLV in question and were based on a previously reported empirical method for estimating recombination rates (Feil et al., 2000).
organization. The exception, however, was CCII which is comprised primarily of most of the SARA ‘Heidelberg’ strains but also two ‘Paratyphi B’ strains, 42 and 43, and one ‘Typhimurium’, SARA13. It is also interesting that CCI and CCII are linked via S. Typhimurium strain SARA13, a triple locus variant of the ancestral strains comprising CCI and a notable outlier to the larger ‘Typhimurium’ clade in the parsimony tree presented in Fig. 1. Interestingly, most of the single-locus allelic variants (11 of 19) identified in Fig. 3 seem to have arisen from recombination rather than point mutation (8 of 19). This resulted in a per-allele recombination/mutation (r/m) ration of 1.4:1. In the per-site analysis, the r/m ratio was calculated from the total number of substitutions within the 11 alleles assumed to have arisen via recombination (n = 11) compared to the number introduced by mutation (n = 8). This resulted in a per-site r/m ratio of 5:1. As noted previously, the correlation among serovar status and phylogenetic grouping varied between 75% and 100% (Table 4). That is, 75% of S. Saintpaul strains resided in Clade D (see Fig. 1), while 100% of S. Heidelberg strains coalesced within Clade C. Of the five serovars, only ‘Heidelberg’ strains assorted monophyletically. The remaining four serovars failed to form monophyletic relationships using these seven housekeeping genes. Incongruence length difference (ILD) analysis was used to identify individually recombinant strains. Six strains from serovars ‘Paratyphi B’ (strains A42, A43, A56, and A61) and ‘Muenchen’ (strains A70 and A71) were singled out as being incongruent with their serovar (Table 4). Upon inspection, these strains were also found to be phylogenetically discordant between these two serovars in the Fig. 1 tree. Split decomposition analysis of the concatenated seven-gene alignment for each serovar confirmed the ILD findings and identified this same set of strains as being potential targets of recombination (Fig. 4). Several reticulated pathways identified from the split-tree networks underscored the incongruence identified previously from either tree inspection or ILD analysis for SARA strains A7, A8, A13, A26, A29, A42, A43, A56, A61, A62, A70, and A71.
4. Discussion The purpose of this study was to investigate the evolutionary genetic aspects of a highly homogeneous reference population of S. enterica – the SARA collection, also called the ‘Typhimurium’ complex – using a concatenated seven housekeeping gene DNA sequence matrix. The ‘Typhimurium’ complex (e.g., SARA) presented a homogeneous and well-characterized population of clinically relevant salmonellae for which to challenge an MLST approach. Numerous previous MLST studies have cited the success of MLST in delineating closely related bacterial species (Godoy et al., 2003; Priest et al., 2004; Richter et al., 2006). Here, phylogenetic analysis revealed seven distinct evolutionary clades of SARA strains, several of which are well-supported statistically. The majority of these clades assorted SARA strains largely along serological lines indicating that, for the most part, the serological classification of a strain was consistent with its phylogenetic grouping. It is noteworthy that the MLST clades derived here deviated from MLEE clusters reported previously by Beltran et al. (1991). Numerous strains from a common serovar and a single multi-locus sequence clade were found displaced into disparate MLEE clusters. It is also notable that our tree resolved SARA strains into seven distinct clades, each containing multiple strains. MLEE also provided about the same number (n = 6) of well-defined strain clusters as our combined housekeeping gene tree indicating that the discriminatory power of these methods is comparable despite the loss of phylogenetic accuracy in methods like MLEE. Taken together, these data point to the combined seven-gene scheme used here as a reasonable compromise between subtyping schemes that provide extensive differentiation but belie epidemiological relevance and those methods that offer evolutionarily accurate clustering but little discriminatory potential. Several strains analyzed in our study presented seemingly aberrant phylogenetic positions in our tree. Due to conflicting signals among the seven different genes, SARA strains A29, A56,
[(Fig._4)TD$IG]
90
R.L. Bell et al. / Infection, Genetics and Evolution 11 (2011) 83–91
Fig. 4. Split decomposition analysis of concatenated housekeeping gene sequences for the five S. enterica serovars comprised by SARA strains. Individual panels in the figure reveal split graph networks for the SARA strains representative of the serovar indicated. Split graph solutions are derived from for each serovar from the combined seven-gene sequence matrix. SARA strains are listed by number adjacent to their respective nodes in the networks. Diamonds in the graphs represent parallel evolutionary pathways in the data and signal a reticulated phylogenetic history for strains associated with these portions of the graph. All split graphs were drawn with equal edges to illuminate every parallel path in the networks. Strains denoted with an asterisk were also identified by the ILD test as potential recipients of horizontally exchanged DNA. Split graphs were generated by split decomposition of concatenated sequences using the SplitsTree program v.2.4 (Huson and Bryant, 2006).
A61, and A62 sorted at the base of the tree as a polytomy. This lack of resolution in strain placement may simply be the result of misidentification from the outset for several members of the SARA collection. More than likely, however, this observation may signal a complex genetic etiology for these strains and points to a substantial role for recombination in the mosaic structure of the chromosomes of these phylogenetically confounding SARA members. Indeed, previous reports have documented extensive levels of recombination between HK gene alleles among the subspecies I salmonellae (Brown et al., 2002, 2003; Octavia and Lan, 2006). Previous phylogenetic analyses of the SARA collection, such as MLEE, and mdh DNA sequence analysis, also exhibited atypical clustering patterns for three of these strains (Selander et al., 1990a,b). In line with this hypothesis is the observation that molecular serological typing of these four strains yielded O-group assignments consistent with their designated serovars (unpublished observation). It is equally remarkable that within Clade C of the SARA tree, significant congruence was noted among strains representing distinct serotypes. Specifically, when ILD scores were examined along clade boundaries, Clade C comprised a strain group with 100% congruence (ILD = 1.00). This was surprising given that two S. Paratyphi B (A42 and A43), one S. Typhimurium (A13), and one S. Saintpaul (A26) strain populated this predominantly S. Heidelberg clade. This observation signals a more widespread genomic compatibility between these strains despite their disparate serovar status. Otherwise, ILD analysis likely would have detected any strain or group of strains in phylogenetic discordance with the remainder of the clade. One explanation for this unexpected strain clustering is a reticulated evolutionary pathway for serotype. In support of this notion, a preliminary analysis of whole genome sequences of SARA strains A13 and A42 suggested a ‘Heidelberg’ origin for the chromosome of these strains and not ‘Typhimurium’ or ‘Paratyphi B’ (unpublished observation).
Clade F, as revealed by the pairwise distance matrix, yielded the greatest amount of intra- and interclade diversity in the MLST tree. That is, Clade F was, on average, more diverged from every other SARA clade than the remaining clades were from each other. Additionally, SARA strain sequences composing Clade F varied more from each other than sequences derived from strains residing in other MLST clades. The elevated sequence diversity associated with these eight S. Muenchen strains may be explained in at least two ways. First, homologous recombination or increased mutation rates among Clade F coding sequences could account for elevated variation observed among these particular SARA strains. The diversifying effects of recombination and basal mutation rate increases have long been noted (Dykhuizen and Green, 1991; LeClerc et al., 1996; Matic et al., 1997). A more parsimonious explanation simply may reflect a more extensive divergence time for Clade F strains since it is difficult to postulate an evolutionary scenario that would assign an advantage to elevated substitution or recombination rates solely among Clade F strains and not the remaining closely related strains comprising the SARA complex. The discriminatory capability of our concatenated dataset to resolve individual SARA strains varied according to serovar. For example, S. Typhimurium strains proved more challenging to differentiate while S. Paratyphi B could be distinguished more readily. Moreover, the issue arises as to the suitability of the sevengene set included here for optimal phylogenetic partitioning of the SARA strain complex. First, the most parsimonious tree to result from the single concatenated sequence analysis lacked deep nodal structure thus preventing a detailed assessment of ancestry among the subgroups comprising the SARA complex. This dearth of resolution deep in the tree may have resulted from incongruence among several of these genes (i.e., aroC and dnaN were less than 50% compatible [see Fig. 2]). Second, some of the genes included here provided very few substitutions for analysis. This was true
R.L. Bell et al. / Infection, Genetics and Evolution 11 (2011) 83–91
particularly for aroC, where substitution rates were very low (i.e., only 2.5% of sites varied between SARA alleles for this gene). This lack of variation coupled with the observed incongruence between aroC and other genes places its utility into question when assembling an MLST scheme for closely related Salmonella strains. Similarly, thrA yielded substantial sequence variability along with a relatively large number of distinct alleles (n = 19, see Table 1). However, only six of these alleles (31%) were found to be parsimony informative (shared by at least two strains) indicating that the majority of diversity observed for this gene could be categorized as single-strain attribution rather than strain clustering. Despite this caveat, the MLST scheme employed here was capable of partitioning this highly homogeneous group of Salmonella subspecies I serovars. The exploration of additional gene sets may provide further detailed evolutionary hypotheses of this and other S. enterica strain populations along with the identification of new signature substitutions capable of differentiating this significant group of foodborne pathogenic strains. Ethical statement No human subjects or animals were used in this study. Acknowledgments We are very grateful to E. Strain, M. Naum, J. Zheng, and M. Allard, for insightful comments and to E.F. Boyd for kindly contributing strains. R. Bell was supported in part by an appointment with the Research Participation Program at the Center for Food Safety and Applied Nutrition administered by Oak Ridge Institute for Science and Education through and interagency agreement between the U.S. Department of Energy and the U.S. Food and Drug Administration Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.meegid.2010.10.005 References Alcaine, S.D., Soyer, Y., Warnick, L.D., Su, W.L., Sukhnanand, S., Richards, J., Fortes, E.D., McDonough, P., Root, T.P., Dumas, N.B., Grohn, Y., Wiedmann, M., 2006. Multilocus sequence typing supports the hypothesis that cow- and humanassociated Salmonella isolates represent distinct and overlapping populations. Appl. Environ. Microbiol. 72, 7575–7585. Beltran, P., Plock, S.A., Smith, N.H., Whittam, T.S., Old, D.C., Selander, R.K., 1991. Reference collection of strains of the Salmonella typhimurium complex from natural populations. J. Gen. Microbiol. 137, 601–606. Brown, E.W., Kotewicz, M.L., Cebula, T.A., 2002. Detection of recombination among Salmonella enterica strains using the incongruence length difference test. Mol. Phylogenet. Evol. 24, 102–120. Brown, E.W., Mammel, M.K., LeClerc, J.E., Cebula, T.A., 2003. Limited boundaries for extensive horizontal gene transfer among Salmonella pathogens. Proc. Natl. Acad. Sci. U.S.A. 100, 15676–15681. CDC, 2007. Preliminary FoodNet data on the incidence of infection with pathogens transmitted commonly through food—10 states, 2006. MMWR Morb. Mortal. Wkly. Rep. 56, 336–339. CDC, 2008. Cases infected with the outbreak strain of Salmonella Saintpaul, United States, by state, as of August 25 9pm EDT. http://www.cdc.gov/salmonella/ saintpaul/jalapeno/map.html (06.23.2009). Davis, M.A., Hancock, D.D., Besser, T.E., Call, D.R., 2003. Evaluation of pulsed-field gel electrophoresis as a tool for determining the degree of genetic relatedness between strains of Escherichia coli O157:H7. J. Clin. Microbiol. 41, 1843–1849. Dykhuizen, D.E., Green, L., 1991. Recombination in Escherichia coli and the definition of biological species. J. Bacteriol. 173, 7257–7268. Fakhr, M.K., Nolan, L.K., Logue, C.M., 2005. Multilocus sequence typing lacks the discriminatory ability of pulsed-field gel electrophoresis for typing Salmonella enterica serovar Typhimurium. J. Clin. Microbiol. 43, 2215–2219. Farris, J.S., 1983. In: Proceedings of the 2nd meeting of the Willi Hennig Society: Advances in Cladistics 2. Farris, J.S., Kallersjo, M., Kluge, A.G., Bult, C., 1995. Testing significance of incongruence. Cladistics 10, 315–319. Feil, E.J., Li, B.C., Aanensen, D.M., Hanage, W.P., Spratt, B.G., 2004. eBURST: inferring patterns of evolutionary descent among clusters of related bacterial genotypes from multilocus sequence typing data. J. Bacteriol. 186, 1518–1530.
91
Feil, E.J., Smith, J.M., Enright, M.C., Spratt, B.G., 2000. Estimating recombinational parameters in Streptococcus pneumoniae from multilocus sequence typing data. Genetics 154, 1439–1450. Felsenstein, J., 1985. Confidence-limits on phylogenies—an approach using the bootstrap. Evolution 39, 783–791. Foley, S.L., White, D.G., McDermott, P.F., Walker, R.D., Rhodes, B., Fedorka-Cray, P.J., Simjee, S., Zhao, S., 2006. Comparison of subtyping methods for differentiating Salmonella enterica serovar Typhimurium isolates obtained from food animal sources. J. Clin. Microbiol. 44, 3569–3577. Godoy, D., Randle, G., Simpson, A.J., Aanensen, D.M., Pitt, T.L., Kinoshita, R., Spratt, B.G., 2003. Multilocus sequence typing and evolutionary relationships among the causative agents of melioidosis and glanders, Burkholderia pseudomallei and Burkholderia mallei. J. Clin. Microbiol. 41, 2068–2079. Gonzalez-Escalona, N., Martinez-Urtaza, J., Romero, J., Espejo, R.T., Jaykus, L.A., DePaola, A., 2008. Determination of molecular phylogenetics of Vibrio parahaemolyticus strains by multilocus sequence typing. J. Bacteriol. 190, 2831–2840. Harbottle, H., White, D.G., McDermott, P.F., Walker, R.D., Zhao, S., 2006. Comparison of multilocus sequence typing, pulsed-field gel electrophoresis, and antimicrobial susceptibility typing for characterization of Salmonella enterica serotype Newport isolates. J. Clin. Microbiol. 44, 2449–2457. Hillis, D.M., Huelsenbeck, J.P., 1992. Signal, noise, and reliability in molecular phylogenetic analyses. J. Hered. 83, 189–195. Huson, D.H., Bryant, D., 2006. Application of phylogenetic networks in evolutionary studies. Mol. Biol. Evol. 23, 254–267. Jakobsen, I.B., Easteal, S., 1996. A program for calculating and displaying compatibility matrices as an aid in determining reticulate evolution in molecular sequences. Comput. Appl. Biosci. 12, 291–295. Kotetishvili, M., Stine, O.C., Kreger, A., Morris Jr., J.G., Sulakvelidze, A., 2002. Multilocus sequence typing for characterization of clinical and environmental Salmonella strains. J. Clin. Microbiol. 40, 1626–1635. Kumar, S., Tamura, K., Jakobsen, I.B., Nei, M., 2001. MEGA2: Molecular Evolutionary Genetics Analysis Software. Penn State University, State College, PA. Lacher, D.W., Steinsland, H., Blank, T.E., Donnenberg, M.S., Whittam, T.S., 2007. Molecular evolution of typical enteropathogenic Escherichia coli: clonal analysis by multilocus sequence typing and virulence gene allelic profiling. J. Bacteriol. 189, 342–350. LeClerc, J.E., Li, B., Payne, W.L., Cebula, T.A., 1996. High mutation frequencies among Escherichia coli and Salmonella pathogens. Science 274, 1208–1211. Maiden, M.C., Bygraves, J.A., Feil, E., Morelli, G., Russell, J.E., Urwin, R., Zhang, Q., Zhou, J., Zurth, K., Caugant, D.A., Feavers, I.M., Achtman, M., Spratt, B.G., 1998. Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc. Natl. Acad. Sci. U.S.A. 95, 3140–3145. Matic, I., Radman, M., Taddei, F., Picard, B., Doit, C., Bingen, E., Denamur, E., Elion, J., 1997. Highly variable mutation rates in commensal and pathogenic Escherichia coli. Science 277, 1833–1834. Octavia, S., Lan, R., 2006. Frequent recombination and low level of clonality within Salmonella enterica subspecies I. Microbiology 152, 1099–1108. Priest, F.G., Barker, M., Baillie, L.W., Holmes, E.C., Maiden, M.C., 2004. Population structure and evolution of the Bacillus cereus group. J. Bacteriol. 186, 7959–7970. Richter, D., Postic, D., Sertour, N., Livey, I., Matuschka, F.R., Baranton, G., 2006. Delineation of Borrelia burgdorferi sensu lato species by multilocus sequence analysis and confirmation of the delineation of Borrelia spielmanii sp. nov. Int. J. Syst. Evol. Microbiol. 56, 873–881. Selander, R.K., Beltran, P., Smith, N.H., Barker, R.M., Crichton, P.B., Old, D.C., Musser, J.M., Whittam, T.S., 1990. Genetic population structure, clonal phylogeny, and pathogenicity of Salmonella paratyphi B. Infect. Immun. 58, 1891–1901. Selander, R.K., Beltran, P., Smith, N.H., Helmuth, R., Rubin, F.A., Kopecko, D.J., Ferris, K., Tall, B.D., Cravioto, A., Musser, J.M., 1990. Evolutionary genetic relationships of clones of Salmonella serovars that cause human typhoid and other enteric fevers. Infect. Immun. 58, 2262–2275. Sukhnanand, S., Alcaine, S., Warnick, L.D., Su, W.L., Hof, J., Craver, M.P., McDonough, P., Boor, K.J., Wiedmann, M., 2005. DNA sequence-based subtyping and evolutionary analysis of selected Salmonella enterica serotypes. J. Clin. Microbiol. 43, 3688–3698. Swofford, D.L., 2002. Phylogenetic Analysis using Parsimony (PAUP* 4.0b) Program and Documentation. The Smithsonian Institution, Washington, DC. Tankouo-Sandjong, B., Sessitsch, A., Liebana, E., Kornschober, C., Allerberger, F., Hachler, H., Bodrossy, L., 2007. MLST-v, multilocus sequence typing based on virulence genes, for molecular typing of Salmonella enterica subsp. enterica serovars. J. Microbiol. Methods 69, 23–36. Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F., Higgins, D.G., 1997. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25, 4876–4882. Torpdahl, M., Skov, M.N., Sandvang, D., Baggensen, D.L., 2005. Genotypic characterization of Salmonella by multilocus sequence typing, pulse-filed gel electrophoresis and amplified fragment length polymorphism. J. Microbiol. Methods 63, 173–184. USDA, 2009. Foodborne Illness Cost Calculator: Salmonella. http://www.ers.usda.gov/data/foodborneillness/salm_intro.asp (09.28.2010). Wang, L., Andrianopoulos, K., Liu, D., Popoff, M.Y., Reeves, P.R., 2002. Extensive variation in the O-antigen gene cluster within one Salmonella enterica serogroup reveals an unexpected complex history. J. Bacteriol. 184, 1669–1677. Zheng, J., Keys, C.E., Zhao, S., Meng, J., Brown, E.W., 2007. Enhanced subtyping scheme for Salmonella enteritidis. Emerg. Infect. Dis. 13, 1932–1935.