A pangenomic study of Bacillus thuringiensis

A pangenomic study of Bacillus thuringiensis

Available online at www.sciencedirect.com Journal of Genetics and Genomics 38 (2011) 567e576 www.jgenetgenomics.org A pangenomic study of Bacillus t...

1MB Sizes 5 Downloads 82 Views

Available online at www.sciencedirect.com

Journal of Genetics and Genomics 38 (2011) 567e576 www.jgenetgenomics.org

A pangenomic study of Bacillus thuringiensis Yongjun Fang a,e,1, Zhaolong Li b,c,1, Jiucheng Liu b,e,1, Changlong Shu d, Xumin Wang b, Xiaowei Zhang b,e, Xiaoguang Yu b,e, Duojun Zhao b,e, Guiming Liu b,e, Songnian Hu b,e, Jie Zhang d,*, Ibrahim Al-Mssallem e,*, Jun Yu a,b,c,e,* a

James D. Watson Institute of Genome Sciences, College of Life Science, Zhejiang University, Hangzhou 310058, China CAS Key Laboratory of Genome Science and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100029, China c Graduate University of Chinese Academy of Sciences, Beijing 100049, China d State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Chinese Academy of Agricultural Sciences, Beijing 100193, China e The Date Palm Genome Project (DPGP), King Abdulaziz City for Science and Technology (KACST), P.O. Box: 6086, Riyadh 11442, Kingdom of Saudi Arabia b

Received 24 July 2011; revised 25 October 2011; accepted 9 November 2011

Abstract Bacillus thuringiensis (B. thuringiensis) is a soil-dwelling Gram-positive bacterium and its plasmid-encoded toxins (Cry) are commonly used as biological alternatives to pesticides. In a pangenomic study, we sequenced seven B. thuringiensis isolates in both high coverage and basequality using the next-generation sequencing platform. The B. thuringiensis pangenome was extrapolated to have 4196 core genes and an asymptotic value of 558 unique genes when a new genome is added. Compared to the pangenomes of its closely related species of the same genus, B. thuringiensis pangenome shows an open characteristic, similar to B. cereus but not to B. anthracis; the latter has a closed pangenome. We also found extensive divergence among the seven B. thuringiensis genome assemblies, which harbor ample repeats and single nucleotide polymorphisms (SNPs). The identities among orthologous genes are greater than 84.5% and the hotspots for the genome variations were discovered in genomic regions of 2.3e2.8 Mb and 5.0e5.6 Mb. We concluded that high-coverage sequence assemblies from multiple strains, before all the gaps are closed, are very useful for pangenomic studies. Keywords: Bacillus thuringiensis (B. thuringiensis); Pseudo-chromosome; Pangenome

1. Introduction Bacillus thuringiensis, or B. thuringiensis (Schnepf et al., 1998), is a Gram-positive bacterium and forms the taxonomic B. cereus clade with B. cereus, B. anthracis, B. mycoides, B. pseudomycoides, B. weihenstephanensis, and B. cytotoxis (or B. cytotoxicus) (Bavykin et al., 2004; Lapidus et al., 2008; Vos et al., 2009). B. thuringiensis is widely used as a biological pesticide because of its ability to kill insects of several diverse taxonomic orders (Widner and Whiteley, * Corresponding authors. Tel: þ86 10 6289 6634, fax: þ86 10 6289 4642 (J. Zhang); Tel: þ966 1481 4394, fax: þ966 1481 3458 (I. Al-Mssallem); Tel: þ86 10 8299 5357, fax: þ86 10 8299 5373 (J. Yu). E-mail addresses: [email protected] (J. Zhang); [email protected] (I. Al-Mssallem); [email protected] (J. Yu). 1 These authors contributed equally to this work.

1989), which are composed of harmful pests to crops and human health (Knowles et al., 1992; Koni and Ellar, 1994; Mizuki et al., 1999). The insect-killing ability of B. thuringiensis lies in the toxins that are mainly encoded by its large plasmids, such as pBtoxis (Berry et al., 2002). The concept of microbial pangenome, or an extensive genomic comparison for a single species (Medini et al., 2005; Read and Ussery, 2006), is very encouraging, which often leads to a global view and overall comprehension in genetic content and variability for the species. The pangenome can be either “open” or “closed,” depending on its capability of acquiring new genes. Theoretically and mathematically, an open pangenome can freely acquire genes into its sequence repertory along with the addition of new isolates, whereas a closed pangenome stops accumulating new genes at a limited pangenomic content. Essentially, this reflects the

1673-8527/$ - see front matter Copyright Ó 2011, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, and Genetics Society of China. Published by Elsevier Limited and Science Press. All rights reserved. doi:10.1016/j.jgg.2011.11.001

568

Y. Fang et al. / Journal of Genetics and Genomics 38 (2011) 567e576

ability of a bacterium to obtain genes from other bacteria or any genetic materials from other organisms inhabiting in the same community with the premise of global dispersibility of itself. For B. thuringiensis genomics, there are five complete and twenty in-progress genomes publicly available on NCBI (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi, as of November 6, 2011) but its pangenomic nature is not yet known although those of its close relatives, B. cereus and B. anthracis, are known to be open and closed, respectively (Tettelin et al., 2005, 2008). Nonetheless, many studies discovered that members of the B. cereus clade are very closely related in terms of phylogenic evolution as B. thuringiensis may resemble B. cereus when losing its characteristic plasmids (Carlson et al., 1994; Helgason et al., 1998, 2000), and vice versa, a B. cereus strain may display characteristic functional properties of B. thuringiensis or B. anthracis when it acquires plasmids of B. thuringiensis or B. anthracis, such as pBtoxis (Hu et al., 2005) or pXO1/pXO2 (Hoffmaster et al., 2004; Klee et al., 2010). According to the distributed genome hypothesis (DGH) (Hogg et al., 2007), a bacterial species has a supragenome or pangenome holding a gene pool for any (not particular) naturally transformable strains to exchange some of their genes through mechanisms such as horizontal gene transfer (Nelson et al., 1999) to adapt to their dwelling environment. Taking the advantage of the next-generation sequencing technology (Metzker, 2009; Zhou et al., 2010), we sequenced seven B. thuringiensis strains using Roche GS FLX and carried out a pangenomic study for both understanding their pangenomic characteristics and discovering their insect-killing genes. 2. Materials and methods 2.1. Bacterial strains and DNA extraction The strain SU4 (CGMCC No. 2071) was stored in China General Microbiological Culture Collection Center (CGMCC), and other strains were obtained from the Bacillus Genetic Stock Center (BGSC) (BGSC No. of each strain used in this studydHD212: 4K1, HD516: 4S3, SLM: 4AG1, T22001: 4AN1, T03B001: 4AO1, and Toguchini: 4AD1). We obtained the seven B. thuringiensis strains from the curator (State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Chinese Academy of Agricultural Sciences). According to the information of the selected B. thuringiensis isolates, they are all soil-dwelling, and the original places where they were isolated are: HD212 and HD516 from United States of America, SLM from France, SU4 and T22001 from China, T03B001 from Japan, and Toguchini from Russia. Such a distribution is to ensure the validity of this pangenomic study. Genomic DNA samples of seven B. thuringiensis strains: HD212, HD516, SLM, SU4, T03B001, T22001, and Toguchini, were prepared based on a standard protocol (Chomczynski and Sacchi, 1987).

2.2. Genome sequencing and sequence assembling Shotgun libraries were constructed and sequenced on Roche GS FLX according to the manufacturer’s protocol (Margulies et al., 2005). High-coverage sequences were acquired to the redundancy of 19e43 folds for all genomes. Using the default parameters, we assembled each genomes with Newbler, a de novo genomic sequence assembler customdesigned and developed by Roche for handling its sequencing data. 2.3. Building pseudo-chromosomes and sequence annotation We followed a published procedure (Tettelin et al., 2005) to build pseudo-chromosomes based on the assembled contigs for the seven B. thuringiensis genomes through alignments against B. thuringiensis var. Konkukian 97-27 (NC_005957) (Han et al., 2006) using NUCmer of MUMmer 3.2 (Kurtz et al., 2004). We connected all contigs with an oligonucleotide linker sequence (NNNNNCATTCCATTCATTAATTAATTAA TGAATGAATGNNNNN) and placed the unordered contigs at the end of pseudo-chromosomes randomly. The assemblies were annotated based on GeneMark (Lukashin and Borodovsky, 1998) and Glimmer ( Delcher et al., 1999; Besemer and Borodovsky, 2005), along with RNAmmer (Lagesen et al., 2007) for rRNA and tRNA Scan (Lowe and Eddy, 1997) for tRNA. We also compared our assemblies in a pair-wise fashion among themselves, as well as to two reference B. thuringiensis strains, Al Hakam (NC_008600) (Challacombe et al., 2007) and Konkukian 97-27. We surveyed plasmid sequence coverage for the B. thuringiensis assemblies by comparing the sequences to a collection of 22 known B. thuringiensis plasmids from the NCBI databases, which includes mini plasmid (NC_007202), pBMB67 (NC_ 009841), pDAN-involved (NC_007203), pFR12 (NC_010281), pFR12.5 (NC_010282), pFR55 (NC_010283), pGI3 (NC_005567), pK1S1 (NC_009034), pTX14-1 (NC_002091), pTX14-2 (NC_004334), pTX14-3 (NC_0014446), pBMBt-1 (NC_00682), pUIBI-1 (NC_004059), pBtoxis (NC_010076), pBT9727 (NC_006578), pAW63 (NC_010599), pBMB2062 (NC_002108), pBMB7635 (NC_011796), pBMB9741 (NC_001272), pBMB175 (NC_010895), pGI1 (NC_004335), and pALH1 (NC_008598). If the annotated contigs are equal to or greater than 5% of the relevant known plasmid size, we marked the corresponding pseudo-chromosome (or the strain) as harboring the particular plasmid. We also collected important B. thuringiensis-related sequences, including cry, cyt, vip, and endotoxin genes for plasmid annotation and for surveying insecticidal genes. 2.4. Phylogenic analysis We built phylogenic trees based on 16S rRNA and DNA polymerase III a subunit (Zhao et al., 2007) sequences from: three B. thuringiensis genomes (Al Hakam and Konkukian 9727 as well as BMB171 whose accession number is NC_014171)

Y. Fang et al. / Journal of Genetics and Genomics 38 (2011) 567e576

(He et al., 2010), two B. cereus genomes (AH820, NC_011773; and E33L, NC_006274), and the seven B. thuringiensis pseudochromosomes, using MEGA 4 (Tamura et al., 2007). 2.5. Pangenome calculations We combined all genes from the seven B. thuringiensis pseudo-chromosomes into one single query and searched against the query itself with blastN to obtain genes shared by (1) all seven strains, (2) less than seven but greater than one strain, and (3) only one single strain. The data were compiled with a Perl script and extrapolated for core genes, unique genes, and B. thuringiensis pangenome as previously described (Tettelin et al., 2005), but we used median rather than mean value for each group of genes.

569

Table 2 Reads uniquely mapped to B. thuringiensis Al Hakama. Strain

Mapped reads

Percentageb

Ref coverage (%)

HD212 HD516 SLM SU4 T03B001 T22001 Toguchini

241,006 343,593 272,466 204,390 277,996 300,233 319,303

43.20 51.80 63.00 68.30 43.40 69.90 61.40

86.03 86.22 86.81 86.17 85.86 85.34 89.89

a B. thuringiensis Al Hakam is served as the reference (Ref) and its complete genome was downloaded from NCBI; b The percentage represents uniquely mapped reads divided by total reads.

We used Roche GS FLX as sequencing platform that yielded satisfactory data with a median read length of w400 bp. The genome coverage ranges from 19 to 43 (Table 1), leading to an excellent set of genome assemblies with N50 of 50e122 kb. Although the number of contigs for each genome had a broader range, they fell into an interval that is not very challenging but time-consuming for gap closure. To estimate the quality of the genome assembling, we mapped the unique reads to a reference sequence, B. thuringiensis Al Hakam by using GsMapper (a contig assembler developed by Roche) at default setting values. We used the sequence of B. thuringiensis var. Konkukian 97-27 to build pseudo-chromosomes and B. thuringiensis Al Hakam to assess the quality of genome assembling, just to double check for the validity of genome assembling procedures because their sequences (of the same species) are similar and yet different. About 43%e70% of the total reads were uniquely mapped to the reference, which nevertheless cover 85%e90% of the reference genome sequence (Table 2). This observation suggested that significant amount of reads failed to map to the reference genome due to sequence diversities among the different strains. It is also evident that the unique reads mapped to the reference only covered at

most w83% of the new assemblies (pseudo-chromosomes). This phenomenon is consistent with the fact that they acquired significant number of genes from the environment (Supplementary Table 1). We carried out an initial annotation of our assemblies or pseudo-chromosomes since we believe that our sequence coverage is adequate (Tables 1e3). First, the estimated genome size for all B. thuringiensis strains ranges from 5.7 Mb to 6.4 Mb, falling into the size range of known B. thuringiensis strains. Second, Newbler is an assembling tool specifically developed by the sequencer provider for its sequence data. Being quite reliable for sequence assembly, Newbler also tends to merge similar reads into the same contig so that gene clusters may not be separated correctly, especially when quality of the data are not high enough or the gene clusters are numerous. We did observe that there were reducing number of tRNA and rRNA genes in the assembled genomes, and the annotated tRNA and rRNA genes (5S, 16S, and 23S) of pseudo-chromosomes appeared less than those of the complete B. thuringiensis genomes (14 copies in any strain of Al Hakam, var. Konkukian 97-27, and BMB171). Fortunately, the total number of genes per genome is very similar among all strains proportionally (Table 3 and Supplementary Table 2). In fact, when searching the contigs produced with GsMapper, we found that there were 11e14 genes of 5S rRNA but none of 16S and 23S rRNAs since the latter two rRNAs are too large and diverse for GsMapper to map to the reference genome precisely.

Table 1 Data summary of B. thuringiensis genome assemblies.

Table 3 Pseudo-chromosomes and gene prediction.

3. Results 3.1. The sequence assembly of B. thuringiensis strains

Strain

Raw dataa

No. of Contig contigs size

HD212 HD516 SLM SU4 T03B001 T22001 Toguchini

191,349,984 260,515,445 168,640,159 117,930,393 272,091,142 181,211,287 207,638,755

352 128 473 314 513 190 160

N50 size

N90 size

Coverageb

6,159,437 49,230 12,453 31.07 6,135,890 122,399 37,901 42.46 6,371,219 85,408 14,454 26.47 6,200,384 90,407 21,309 19.02 6,289,626 80,216 17,075 43.26 6,053,017 80,129 21,369 29.94 5,655,124 88,976 29,064 36.72

a Size is measured in bp; b Genome coverage equals a value of total read lengths divided by the estimated genome size.

Strain

Pseudochromosomea

GC%

Genes

tRNA

5S rRNA

16S rRNA

23S rRNA

HD212 HD516 SLM SU4 T03B001 T22001 Toguchini

6,179,482 6,147,880 6,396,131 6,214,221 6,353,475 6,062,256 5,664,851

34.93 34.79 35.11 34.67 34.78 34.97 35.14

6,682 6,435 7,109 6,487 6,778 6,381 6,149

98 98 113 102 101 102 106

5 5 6 8 9 7 6

1 1 2 1 1 1 1

2 3 1 1 1 1 2

a The length of B. thuringiensis pseudo-chromosomes is measured in base pairs.

570

Y. Fang et al. / Journal of Genetics and Genomics 38 (2011) 567e576

3.2. Genomic content characterization and diversity evaluation

Table 4 Comparative analysis of the eight B. thuringiensis genomes to B. thuringiensis Al Hakam and repeat statistics.

We carried out a thorough comparative analysis on published sequences and our new assemblies to evaluate their qualities. First, the overall conservation among the B. thuringiensis strains remains very significant; at least 81% genes among the strains are orthologous by a pair-wise comparison (Supplementary Table 2) and all orthologous pairs have identity above 84.5% (Fig. 1). Second, we compared each genome from our seven assemblies as well as the genomic sequence of B. thuringiensis var. Konkukian 97-27 to that of B. thuringiensis Al Hakam (Table 4) by measuring two essential parameters: %Ref and %Query, which are the coverage of matched sequence fragments of a B. thuringiensis strain against the reference and the query sequence itself, respectively. The reason why the coverage of over 85.34% of the unique reads from the seven strains against the reference genome (Table 2) declined to 43.42%e57.11% is largely due to variable sequence identities between strain-specific reads and the reference, which were probably not algorithmically mapped well by MUMmer as compared to GsMapper. We also attempted to evaluate repetitive sequences in our assemblies. There is an enormous amount of simple repeats in the assemblies, such as those 20 bp in length. However, when the repeat unit reaches 50 bp, the number of repeats found among the strains is around 1000. We were able to come up with SNP statistics for every strain (Supplementary Table 3) and some of the changes in coding sequences may be useful for further functional scrutiny.

Strains

Matched bases

Ref (%)c

Query (%)d

Repeatse

Tandem repeatsf

Al Hakama Konkukian 97-27b Toguchini T22001 HD516 HD212 SU4 T03B001 SLM

5,257,091 4,318,127 3,002,439 2,282,406 2,458,924 2,454,627 2,527,517 2,394,285 2,448,247

100.00 82.14 57.11 43.42 46.77 46.69 48.08 45.54 46.57

100.00 82.44 53.00 37.65 40.00 39.72 40.67 37.68 38.28

7,768 7,592 27,352 40,598 20,523 107,859 83,409 208,426 175,677

302 293 522 546 490 692 689 896 817

3.3. Plasmid survey Our data were acquired with dual goal in mind to assemble both genome and plasmid sequences so that the fragmentation

Fig. 1. Orthologous gene alignment based on sequence assemblies of the seven B. thuringiensis isolates. The diagram was drawn by using Mapview after running NUCmer of MUMmer 3.2. The aligned gene is permease, a member of drug/metabolite transporter superfamilies. Due to massive rearrangement, the fragments that matched the annotation were not labeled individually. However, all corresponding fragments were discovered among the genome assemblies. The three tandem horizontal red solid bars (the top row) represent the matched portions of the permease gene (14,387 bp in size), and other horizontal red solid bars (below the top row) indicate NUCmer-detected matches of the seven B. thuringiensis isolates to the permease gene (identities between 84.5% and 100%).

47,855 47,881 53,224 57,832 57,551 58,422 59,701 61,416 59,020

a

The genome of B. thuringiensis Al Hakam, downloaded from NCBI, is the reference in this table; b The genome of B. thuringiensis var. Konkukian 9727 was downloaded from NCBI; c Ref (%): percentage of matched regions against the reference genome; d Query (%): percentage of matched regions against the query genome; e Survey for repeats with length equal to or greater than 20 bp; f For the two columns under tandem repeats, the left one shows the statistics on repetitive units with length equal to or greater than 10 bp and the right one displays those with length equal to or greater than 5 bp.

experiment was done with caution that plasmid DNA needed to be sheared completely because they are usually folded more tightly than larger chromosomal DNA. However, without gap closure and de novo annotation, it is almost impossible to map all the genes from the plasmids. Therefore, we applied a simple rule to estimate the distribution of 22 commonly occurring plasmids among B. thuringiensis strains, i.e., if 5% of the plasmid length is covered by our sequence contigs, we say that this plasmid is identified (Supplementary Table 4). The cutoff value is empirically set according to the statistical standard of P  0.05 although this threshold would probably bring about artificially high values because of the original integration of partial or entire plasmids into bacterial chromosomes. Yet with the consideration of common occurrence of these selected plasmids in B. thuringiensis strains, this supposition has its rationality and should be able to avoid failing to count the actual number of plasmids harbored by a B. thuringiensis strain. Our result showed that all strains seem to harbor 4e6 plasmids (Table 5), and the sequence identities of the selected plasmids, when aligned, are all above 80% as shown for pBtoxis as an example albeit different sequence coverage (Fig. 2). Most of the pBtoxis fragments were found to have their counterparts in the seven strains except in Toguchini, and the fact allowed us to get a clue why this strain has minimal contigs and/or pseudochromosome sizes among the seven B. thuringiensis strains (Tables 1 and 3). Table 5 Distribution of inferred plasmids among the seven B. thuringiensis isolates. Strain

Plasmids

HD212 HD516 SLM SU4 T03B001 T22001 Toguchini

pAW63, pBMB67, pBT9727, pBtoxis pALH1, pBtoxis, pFR55, pTX14-1, pTX14-2 pBtoxis, pTX14-1, pTX14-2, pUIBI-1 pAW63, pBT9727, pBtoxis, pDAN-involved, pFR12, pFR12.5 pALH1, pBMB175, pBtoxis, pGI3, pTX14-2 pAW63, pBT9727, pBtoxis, pFR12, pFR55, pTX14-3 pALH1, pBMB7635, pBMB9741, pDAN-involved, pGI1

Y. Fang et al. / Journal of Genetics and Genomics 38 (2011) 567e576

571

Fig. 2. Plasmid pBtoxis-like fragments among the seven B. thuringiensis isolates. The diagram was drawn by the same method used as in Fig. 1. pBtoxis was used as the reference. The strain Toguchini has very fewer matches than the rest six strains. Red solid bars and numbers are the same annotations as those in Fig. 1.

3.4. Genome sequence alignment and mutation hotspots We analyzed the B. thuringiensis genome assemblies in a pair-wise fashion to reveal similarities and differences among the B. thuringiensis strains (Table 4 and Supplementary Table 5). By running the MUMmer module according to the instructions of MUMmer 3.2, we found that the two reference genomes exhibit a better synteny (82.14% identity) although there are transposons dispersing over their genomes (Fig. 3A). We also showed two other examples: one is a comparison between HD516 and B. thuringiensis Al Hakam (Fig. 3B) and the other is an alignment between two B. thuringiensis pseudochromosomes, SU4 and HD212 (Fig. 3C). We found that the regions of 2.3e2.8 Mb and 5.0e5.6 Mb among the seven isolates seem to be the hotspots for mutation events, including insertion, deletion, and transposition. For the region of 5.0e5.6 Mb, we suspected that the hotspots may be caused by our random arrangement of unordered contigs at the end of pseudo-chromosomes but we were not able to conclude if this is the exclusive reason. For the 2.3e2.8 Mb region, followingup studies should give birth to valuable discoveries in relation to important loci such as terminus of replication. 3.5. Phylogenic relationship of the B. thuringiensis strains Due to horizontal gene transfer, a major mechanism for genetic material exchange among prokaryotes, and the very close relations between members of B. cereus clade, we realized that phylogenic analysis may not yield stable relationship among the B. thuringiensis strains other than accurate and adequate selections of target sequences. Therefore, we used three sequences for this exercise: 16S rRNA, PolC, and dnaE3, all of which are excellent for phylogeny. The latter two are DNA polymerase III a subunits of Firmicutes, whose sequences are highly conserved (Zhao et al., 2006, 2007; Hu et al., 2007a, 2007b) (Fig. 4). Although low bootstrap values indicated poor separations among the individual strains, the B. thuringiensis strains showed great intra-specific diversity when including B. cereus as out-group in the 16S rRNA-based phylogeny, B. thuringiensis Al Hakam was partitioned with the B. cereus strains (Fig. 4A), as previously observed (Tourasse and Kolstø, 2008). Furthermore, phylogeny based on PolC and dnaE3 gave rise to a rather stable but fine differential relationship among all B. thuringiensis strains (Fig. 4B). The phylogenic relationship allowed us to do pairwise comparison for mapping out detailed SNPs and repeat

contents (Tables 4 and Supplementary Table 3) (Radnedge et al., 2003; Bavykin et al., 2004). 3.6. Pangenome calculations We calculated B. thuringiensis pangenome parameters based on our seven genome assemblies (Fig. 5) and showed that the B. thuringiensis pangenome has a set of 4196 core genes (Fig. 5A) and an asymptotic value of 558 unique genes (Fig. 5B), and is “open” in nature (Fig. 5C). The reasons why we exclusively enclosed the seven pseudo-chromosomes but did not include all complete B. thuringiensis genomes presently available on NCBI into the pangenome calculation are twofold. First, the curve fitting was not as smooth as expected upon the addition of complete genomes. The shape of the curves remained the same with or without adding the complete genomes. Second, our unfinished genome assemblies tend to overestimate the number of genes (Table 3) so that the resulting pangenome may have more genes than what it should have. Therefore, we did the B. thuringiensis pangenome calculation by using just our seven B. thuringiensis assemblies as long as we understand the outcome of our analyses albeit the probability slightly biased. Since the number of genes is quite large, we listed their functional classifications (Supplementary Tables 1 and 6) in terms of functional categories and metabolic pathways involving all relevant unique genes of the seven B. thuringiensis isolates. There are tremendous individual variations among the strains, and these genes are acquired from different origins and undergo diverse genetic exchanges with bacteria living in the same environment. These genes primarily function in the following categories: (1) biological process: biological regulation, cellular process, developmental process, establishment of localization, localization, metabolic process, and response to stimulus; (2) cellular component; (3) molecular function: binding, catalytic activity, transcription regulator activity, and transporter activity; (4) genetic information processing: replication and repair; and (5) metabolism: amino acid metabolism, carbohydrate metabolism, and metabolism of terpenoids and polyketides. 3.7. Evaluation of insecticidal genes To estimate the number of insecticidal genes in the B. thuringiensis genome assemblies, we used all known sequences available on NCBI as queries. The numbers of the insecticidal genes among the strains vary and the candidate

572

Y. Fang et al. / Journal of Genetics and Genomics 38 (2011) 567e576

reference strains, B. thuringiensis Al Hakam, B. thuringiensis BMB171, and B. thuringiensis var. Konkukian 97-27 have only 2, 1, and 3 relevant genes, respectively, and not all of them have been validated as real insecticidal genes by functional assays. We also aligned the queries to the B. thuringiensis plasmid sequence assemblies, revealing 18 and 2 candidate insecticidal genes harbored by pBtoxis (gij161598544, NC_010076.1) and pBMB67 (gij157502088, NC_009841.1j), respectively. Most of the candidate insecticidal genes are unique to a particular strain. Only a few exceptional cases were discovered. For example, two candidates were found in both pBtoxis and the genome assemblies. One appears as a cryptic Cry protein (gij51090226jdbjj AB116649.1j, B. thuringiensis genes for hypothetical protein, cancer cell-killing Cry protein parasporin-3, and C-terminal half of Cry protein), and this gene was also found in HD212, SLM, T22001, and Toguchini. Another gene (gij3426159jdbjjD88381.1j, a B. thuringiensis gene for insecticidal protein in complete CDS) was identified in both SLM and pBtoxis. Since we have not assembled the sequences in their complete forms, we have not yet experimentally verified whether any of these candidate insecticidal genes are truly harbored by the genomes or plasmids. Based on the vast literature, we believe that they are most likely carried by plasmids and move around the species, with some leaving relics or cryptic forms in their genomes. 4. Discussion

Fig. 3. Examples of genome-wide alignment of the selected B. thuringiensis isolates in an all-versus-all manner: (A) alignment of B. thuringiensis var. Konkukian 97-27 to B. thuringiensis Al Hakam, (B) HD516 to B. thuringiensis Al Hakam, and (C) SU4 to HD212. Matches in the forward strand are in red and those in the reverse strand are in blue.

genes are distributed as 40 in HD212, 7 in HD516 and SLM, 8 in SU4, 24 in T03B001, 4 in T22001, and only 2 in Toguchini. Obviously, more stringent analysis coupled with PCR-based validation is of essence for the final verdict since the three

The immediate goal of our current study is not to assemble all genomes to completion but to interrogate the pangenomics of B. thuringiensis, and the task can be achieved with significant coverage for each genome and enough independent B. thuringiensis isolates. Our data provide further support for DGH. We showed that different B. thuringiensis isolates harvest diverse genes from the environment through horizontal gene transfer, as well as its various mechanisms, and many of these genes are not easily annotated unless plasmids harboring transient genes are truly identified (Gonzalez et al., 1982). Our results suggest that individual B. thuringiensis isolates capture genes essential or harmless to their survival from other bacteria dwelling in the same microbial community so that they manifest enormous diversity for adaptation to their unique environments. In addition, other mobile genetic elements (MGEs) (Frost et al., 2005), such as transposons, phages, integrons, and genomic islands, also inevitably contribute to a pool of supragenome or pangenome although this study was not focused on the other MGEs. Our study demonstrated that B. thuringiensis pangenome is open, similar to B. cereus but different from B. anthracis, as the latter is a closed pangenome. Why do these closely related species within the same genus behave so differently? If we assume that an open pangenome is the norm for the genus of Bacillus, it must be that B. anthracis, having a closed pangenome, is unusual. We deduced several reasons for this disparity. First, B. anthracis is a recently emerged clone of B. cereus, with an estimated origin in the Holocene period

Y. Fang et al. / Journal of Genetics and Genomics 38 (2011) 567e576

573

Fig. 4. Phylogenetic relationship of the B. thuringiensis isolates among the B. cereus clade. Neighbor-Joining trees were drawn with MEGA4 following a sequential handling of selected CLUSTAL alignments coupled with bootstrapping. A: phylogenic survey of twelve strains of the B. cereus clade based on 16S rRNA. B: phylogenic analysis of ten isolates of B. thuringiensis based on their PolC and dnaE genes.

(12,000e26,000 years ago) (Vassileva et al., 2006), and hence the short evolutionary time is most likely to result in a limited number of genetic variations. Second, B. anthracis is highly monomorphic, non-motile, and mostly present as a dormant spore in the soil (Kolstø et al., 2009), so that its involvements in homologous recombination and horizontal gene transfer

events, which are the major forces driving bacterial evolution, are scarce. Third, unlike B. cereus and B. thuringiensis, which have the global regulator PlcR to control the normal transcription of secreted virulence factors, B. anthracis is defective of PlcR regulon because of a nonsense mutation at nucleotide position 640 of the plcR gene (Agaisse et al., 1999),

574

Y. Fang et al. / Journal of Genetics and Genomics 38 (2011) 567e576

leading to a difficulty in survival unless in the presence of its two plasmids pXO1 and pXO2. Furthermore, the coevolution of the B. anthracis chromosome with its plasmids exhibits an effective approach for B. anthracis to stay alive in its nichemammals, yet narrowing its host spectrum at the same time. In contrast, B. thuringiensis and B. cereus can easily exchange or acquire genetic materials with/from other bacteria through genetic mechanisms, such as conjugation (Gonzalez et al., 1982; Haack et al., 1996; Vilas-Boˆas et al., 1998). With open pangenomes, B. thuringiensis and B. cereus are also free to acquire genetic materials from the environment and thus to best fit themselves into their evolutionary landscapes. Acknowledgements The work was supported by a grant from King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia (No. KACST 428-29) (http://www.kacst.edu.sa) and aided with an institutional grant from CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences. And this study was also supported by the grants from the National Basic Research Program (973 Program) (No. 2010CB126604), the Special Foundation Work Program (No. 2009FY120100), the Ministry of Science and Technology of the People’s Republic of China and from the National Science Foundation of China (No. 31071163). We thank our colleagues at Beijing Institute of Genomics, Chinese Academy of Sciences: Meng Yang, Guangyu Zhang, Quanzheng Yun, Sun Zhang, Tala, Fusen Li, and Jixiang Wang for their technical supports in sequencing and data analyses. Supplementary data Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.jgg.2011.11.001. References

Fig. 5. Core genes and unique genes of the B. thuringiensis pangenome. A: core genes calculation. The horizontal dotted line (red) shows the asymptotic value of y ¼ 9760*exp(x/0.5) þ 1106*exp(x/5.3) þ 4196, where 4196 is the number of core genes regardless of how many genomes are added into the B. thuringiensis pangenome. B: new (unique) genes of a pangenome. The horizontal dashed line (black) indicates the asymptotic value of the function of y ¼ 558.2 þ 1945.7*exp(x/2.0163) as new genomes are being added, the unique genes in an extrapolated genome reaches to 558.2. C: the B. thuringiensis pangenome. The curve (red) isP fitted to function of P(n) ¼ 6487 þ 558.2*(n  1) þ 1945.7*exp[2/2.0163] n2 x¼0 exp[x/2.0163], where P(n) represents a pangenome having a function varying along with n (the quantity of genomes in the pangenome). The median gene number in the seven B. thuringiensis isolates is 6487. It indicates that the B. thuringiensis pangenome increases along with the addition of new genomes. Therefore, the nature of the B. thuringiensis pangenome is open.

Agaisse, H., Gominet, M., Økstad, O.A., Kolstø, A.B., Lereclus, D., 1999. PlcR is a pleiotropic regulator of extracellular virulence factor gene expression in Bacillus thuringiensis. Mol. Microbiol. 32, 1043e1053. Bavykin, S., Lysov, Y., Zakhariev, V., Kelly, J., Jackman, J., Stahl, D., Cherni, A., 2004. Use of 16S rRNA, 23S rRNA, and gyrB gene sequence analysis to determine phylogenetic relationships of Bacillus cereus group microorganisms. J. Clin. Microbiol. 42, 3711e3730. Berry, C., O’Neil, S., Ben-Dov, E., Jones, A., Murphy, L., Quail, M., Holden, M., Harris, D., Zaritsky, A., Parkhill, J., 2002. Complete sequence and organization of pBtoxis, the toxin-coding plasmid of Bacillus thuringiensis subsp. israelensis. Appl. Environ. Microbiol. 68, 5082e5095. Besemer, J., Borodovsky, M., 2005. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res. 33, W451e454. Carlson, C., Caugant, D., Kolsto, A., 1994. Genotypic diversity among Bacillus cereus and Bacillus thuringiensis strains. Appl. Environ. Microbiol. 60, 1719e1725. Challacombe, J.F., Altherr, M.R., Xie, G., Bhotika, S.S., Brown, N., Bruce, D., Campbell, C.S., Campbell, M.L., Chen, J., Chertkov, O., Cleland, C., Dimitrijevic, M., Doggett, N.A., Fawcett, J.J., Glavina, T., Goodwin, L.A.,

Y. Fang et al. / Journal of Genetics and Genomics 38 (2011) 567e576 Green, L.D., Han, C.S., Hill, K.K., Hitchcock, P., Jackson, P.J., Keim, P., Kewalramani, A.R., Longmire, J., Lucas, S., Malfatti, S., Martinez, D., McMurry, K., Meincke, L.J., Misra, M., Moseman, B.L., Mundt, M., Munk, A.C., Okinaka, R.T., Parson-Quintana, B., Reilly, L.P., Richardson, P., Robinson, D.L., Saunders, E., Tapia, R., Tesmer, J.G., Thayer, N., Thompson, L.S., Tice, H., Ticknor, L.O., Wills, P.L., Gilna, P., Brettin, T.S., 2007. The complete genome sequence of Bacillus thuringiensis Al Hakam. J. Bacteriol. 189, 3680e3681. Chomczynski, P., Sacchi, N., 1987. Single-step method of RNA isolation by acid guanidinium thiocyanate-phenol-chloroform extraction. Anal. Biochem. 162, 156e159. Delcher, A.L., Harmon, D., Kasif, S., White, O., Salzberg, S.L., 1999. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636e4641. Frost, L., Leplae, R., Summers, A., Toussaint, A., 2005. Mobile genetic elements: the agents of open source evolution. Nat. Rev. Microbiol. 3, 722e732. Gonzalez, J., Brown, B., Carlton, B., 1982. Transfer of Bacillus thuringiensis plasmids coding for delta-endotoxin among strains of B. thuringiensis and B. cereus. Proc. Natl. Acad. Sci. USA 79, 6951e6955. Haack, B.J., Andrews, R.E., Loynachan, T.E., 1996. Tn916-mediated genetic exchange in soil. Soil Biol. Biochem. 28, 765e771. Han, C.S., Xie, G., Challacombe, J.F., Altherr, M.R., Bhotika, S.S., Bruce, D., Campbell, C.S., Campbell, M.L., Chen, J., Chertkov, O., Cleland, C., Dimitrijevic, M., Doggett, N.A., Fawcett, J.J., Glavina, T., Goodwin, L.A., Hill, K.K., Hitchcock, P., Jackson, P.J., Keim, P., Kewalramani, A.R., Longmire, J., Lucas, S., Malfatti, S., McMurry, K., Meincke, L.J., Misra, M., Moseman, B.L., Mundt, M., Munk, A.C., Okinaka, R.T., Parson-Quintana, B., Reilly, L.P., Richardson, P., Robinson, D.L., Rubin, E., Saunders, E., Tapia, R., Tesmer, J.G., Thayer, N., Thompson, L.S., Tice, H., Ticknor, L.O., Wills, P.L., Brettin, T.S., Gilna, P., 2006. Pathogenomic sequence analysis of Bacillus cereus and Bacillus thuringiensis isolates closely related to Bacillus anthracis. J. Bacteriol. 188, 3382e3390. He, J., Shao, X., Zheng, H., Li, M., Wang, J., Zhang, Q., Li, L., Liu, Z., Sun, M., Wang, S., Yu, Z., 2010. Complete genome sequence of Bacillus thuringiensis mutant strain BMB171. J. Bacteriol. 192, 4074e4075. Helgason, E., Okstad, O.A., Caugant, D.A., Johansen, H.A., Fouet, A., Mock, M., Hegna, I., Kolsto, A.-B., 2000. Bacillus anthracis, Bacillus cereus, and Bacillus thuringiensisdone species on the basis of genetic evidence. Appl. Environ. Microbiol. 66, 2627e2630. Helgason, E., Caugant, D., Lecadet, M., Chen, Y., Mahillon, J., Lo¨vgren, A., Hegna, I., Kvaløy, K., Kolstø, A., 1998. Genetic diversity of Bacillus cereus/B. thuringiensis isolates from natural sources. Curr. Microbiol. 37, 80e87. Hoffmaster, A.R., Ravel, J., Rasko, D.A., Chapman, G.D., Chute, M.D., Marston, C.K., De, B.K., Sacchi, C.T., Fitzgerald, C., Mayer, L.W., Maiden, M.C.J., Priest, F.G., Barker, M., Jiang, L., Cer, R.Z., Rilstone, J., Peterson, S.N., Weyant, R.S., Galloway, D.R., Read, T.D., Popovic, T., Fraser, C.M., 2004. Identification of anthrax toxin genes in a Bacillus cereus associated with an illness resembling inhalation anthrax. Proc. Natl. Acad. Sci. USA 101, 8449e8454. Hogg, J., Hu, F., Janto, B., Boissy, R., Hayes, J., Keefe, R., Post, J.C., Ehrlich, G., 2007. Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome. Biol. 8, R103. Hu, J., Zhao, X., Yu, J., 2007a. Replication-associated purine asymmetry may contribute to strand-biased gene distribution. Genomics 90, 186e194. Hu, J., Zhao, X., Zhang, Z., Yu, J., 2007b. Compositional dynamics of guanine and cytosine content in prokaryotic genomes. Res. Microbiol. 158, 363e370. Hu, X., Hansen, B.M., Yuan, Z., Johansen, J.E., Eilenberg, J., Hendriksen, N.B., Smidt, L., Jensen, G.B., 2005. Transfer and expression of the mosquitocidal plasmid pBtoxis in Bacillus cereus group strains. FEMS Microbiol. Lett. 245, 239e247. Klee, S.R., Brzuszkiewicz, E.B., Nattermann, H., Bru¨ggemann, H., Dupke, S., Wollherr, A., Franz, T., Pauli, G., Appel, B., Liebl, W., CouacyHymann, E., Boesch, C., Meyer, F.-D., Leendertz, F.H., Ellerbrok, H.,

575

Gottschalk, G., Grunow, R., Liesegang, H., 2010. The genome of a Bacillus isolate causing anthrax in chimpanzees combines chromosomal properties of B. cereus with B. anthracis virulence plasmids. PLoS ONE 5, e10986. Knowles, B.H., White, P.J., Nicholls, C.N., Ellar, D.J., 1992. A broadspectrum cytolytic toxin from Bacillus thuringiensis var. kyushuensis. Proc. R. Soc. Lond. B: Biol. Sci. 248, 1e7. Kolstø, A.B., Tourasse, N.J., Økstad, O.A., 2009. What sets Bacillus anthracis apart from other Bacillus species? Ann. Rev. Microbiol. 63, 451e476. Koni, P.A., Ellar, D.J., 1994. Biochemical characterization of Bacillus thuringiensis cytolytic delta-endotoxins. Microbiology 140, 1869e1880. Kurtz, S., Phillippy, A., Delcher, A., Smoot, M., Shumway, M., Antonescu, C., Salzberg, S., 2004. Versatile and open software for comparing large genomes. Genome Biol. 5, R12. Lagesen, K., Hallin, P., Rødland, E.A., Stærfeldt, H.-H., Rognes, T., Ussery, D.W., 2007. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 35, 3100e3108. Lapidus, A., Goltsman, E., Auger, S., Galleron, N., Se´gurens, B., Dossat, C., Land, M.L., Broussolle, V., Brillard, J., Guinebretiere, M.H., 2008. Extending the Bacillus cereus group genomics to putative food-borne pathogens of different toxicity. Chem. Biol. Inter. 171, 236e249. Lowe, T., Eddy, S., 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955e964. Lukashin, A., Borodovsky, M., 1998. GeneMark. hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107e1115. Margulies, M., Egholm, M., Altman, W., Attiya, S., Bader, J., Bemben, L., Berka, J., Braverman, M., Chen, Y., Chen, Z., Dewell, S., Du, L., Fierro, J., Gomes, X., Godwin, B., He, W., Helgesen, S., Ho, C., Irzyk, G., Jando, S., Alenquer, M., Jarvie, T., Jirage, K., Kim, J., Knight, J., Lanza, J., Leamon, J., Lefkowitz, S., Lei, M., Li, J., Lohman, K., Lu, H., Makhijani, V., McDade, K., McKenna, M., Myers, E., Nickerson, E., Nobile, J., Plant, R., Puc, B., Ronan, M., Roth, G., Sarkis, G., Simons, J., Simpson, J., Srinivasan, M., Tartaro, K., Tomasz, A., Vogt, K., Volkmer, G., Wang, S., Wang, Y., Weiner, M., Yu, P., Begley, R., Rothberg, J., 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376e380. Medini, D., Donati, C., Tettelin, H., Masignani, V., Rappuoli, R., 2005. The microbial pan-genome. Curr. Opin. Gene Dev. 15, 589e594. Metzker, M., 2009. Sequencing technologiesdthe next generation. Nat. Rev. Genet. 11, 31e46. Mizuki, E., Ohba, M., Akao, T., Yamashita, S., Saitoh, H., Park, Y., 1999. Unique activity associated with non-insecticidal Bacillus thuringiensis parasporal inclusions: in vitro cell-killing action on human cancer cells. J. Appl. Microbiol. 86, 477e486. Nelson, K.E., Clayton, R.A., Gill, S.R., Gwinn, M.L., Dodson, R.J., Haft, D.H., Hickey, E.K., Peterson, J.D., Nelson, W.C., Ketchum, K.A., McDonald, L., Utterback, T.R., Malek, J.A., Linher, K.D., Garrett, M.M., Stewart, A.M., Cotton, M.D., Pratt, M.S., Phillips, C.A., Richardson, D., Heidelberg, J., Sutton, G.G., Fleischmann, R.D., Eisen, J.A., White, O., Salzberg, S.L., Smith, H.O., Craig Venter, J., Fraser, C.M., 1999. Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima. Nature 399, 323e329. Radnedge, L., Agron, P.G., Hill, K.K., Jackson, P.J., Ticknor, L.O., Keim, P., Andersen, G.L., 2003. Genome differences that distinguish Bacillus anthracis from Bacillus cereus and Bacillus thuringiensis. Appl. Environ. Microbiol. 69, 2755e2764. Read, T.D., Ussery, D.W., 2006. Opening the pan-genomics box. Curr. Opin. Microbiol. 9, 496e498. Schnepf, E., Crickmore, N., Van Rie, J., Lereclus, D., Baum, J., Feitelson, J., Zeigler, D., Dean, D., 1998. Bacillus thuringiensis and its pesticidal crystal proteins. Microbiol. Mol. Biol. Rev. 62, 775e806. Tamura, K., Dudley, J., Nei, M., Kumar, S., 2007. MEGA4: molecular evolutionary genetics analysis (MEGA) software version 4.0. Mol. Biol. Evol. 24, 1596e1599. Tettelin, H., Riley, D., Cattuto, C., Medini, D., 2008. Comparative genomics: the bacterial pan-genome. Curr. Opin. Microbiol. 12, 472e477. Tettelin, H., Masignani, V., Cieslewicz, M.J., Donati, C., Medini, D., Ward, N.L., Angiuoli, S.V., Crabtree, J., Jones, A.L., Durkin, A.S.,

576

Y. Fang et al. / Journal of Genetics and Genomics 38 (2011) 567e576

DeBoy, R.T., Davidsen, T.M., Mora, M., Scarselli, M., Margarit y Ros, I., Peterson, J.D., Hauser, C.R., Sundaram, J.P., Nelson, W.C., Madupu, R., Brinkac, L.M., Dodson, R.J., Rosovitz, M.J., Sullivan, S.A., Daugherty, S.C., Haft, D.H., Selengut, J., Gwinn, M.L., Zhou, L., Zafar, N., Khouri, H., Radune, D., Dimitrov, G., Watkins, K., O’Connor, K.J.B., Smith, S., Utterback, T.R., White, O., Rubens, C.E., Grandi, G., Madoff, L.C., Kasper, D.L., Telford, J.L., Wessels, M.R., Rappuoli, R., Fraser, C.M., 2005. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome”. Proc. Natl. Acad. Sci. USA 102, 13950e13955. Tourasse, N.J., Kolstø, A.B., 2008. SuperCAT: a supertree database for combined and integrative multilocus sequence typing analysis of the Bacillus cereus group of bacteria (including B. cereus, B. anthracis and B. thuringiensis). Nucleic Acids Res. 36, D461eD468. Vassileva, M., Torii, K., Oshimoto, M., Okamoto, A., Agata, N., Yamada, K., Hasegawa, T., Ohta, M., 2006. Phylogenetic analysis of Bacillus cereus isolates from severe systemic infections using multilocus sequence typing scheme. Microbiol. Immunol. 50, 743e749.

Vilas-Boˆas, G.F.L.T., Vilas-Boˆas, L.A., Lereclus, D., Arantes, O.M.N., 1998. Bacillus thuringiensis conjugation under environmental conditions. FEMS Microbiol. Ecol. 25, 369e374. Vos, P., Garrity, G., Jones, D., Krieg, N.R., Ludwig, W., Rainey, F.A., Schleifer, K.-H., Whitman, W.B., 2009. Bergey’s Manual of Systematic Bacteriology. In: The Firmicutes, Second ed., vol. 3. Springer Publishing Company: New York, pp. 21e127. Widner, W., Whiteley, H., 1989. Two highly related insecticidal crystal proteins of Bacillus thuringiensis subsp. kurstaki possess different host range specificities. J. Bacteriol. 171, 965e974. Zhao, X., Zhang, Z., Yan, J., Yu, J., 2007. GC content variability of eubacteria is governed by the pol III alpha subunit. Biochem. Biophys. Res. Commun. 356, 20e25. Zhao, X.Q., Hu, J.F., Yu, J., 2006. Comparative analysis of eubacterial DNA polymerase III alpha subunits. Genomics Proteomics Bioinfomatics 4, 203e211. Zhou, X., Ren, L., Li, Y., Zhang, M., Yu, Y., Yu, J., 2010. The next-generation sequencing technology: a technology review and future perspective. Sci. China Life Sci. 53, 44e57.