Comparative analysis of CRISPR loci in lactic acid bacteria genomes

Comparative analysis of CRISPR loci in lactic acid bacteria genomes

International Journal of Food Microbiology 131 (2009) 62–70 Contents lists available at ScienceDirect International Journal of Food Microbiology j o...

1MB Sizes 0 Downloads 57 Views

International Journal of Food Microbiology 131 (2009) 62–70

Contents lists available at ScienceDirect

International Journal of Food Microbiology j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / i j f o o d m i c r o

Comparative analysis of CRISPR loci in lactic acid bacteria genomes Philippe Horvath a,⁎, Anne-Claire Coûté-Monvoisin a, Dennis A. Romero b, Patrick Boyaval a, Christophe Fremaux a, Rodolphe Barrangou b a b

Danisco France SAS, BP10, F-86220 Dangé-Saint-Romain, France Danisco USA Inc., 3329 Agriculture Drive, Madison, WI 53716, USA

A R T I C L E

I N F O

Keywords: Clustered regularly interspaced short palindromic repeats (CRISPR) Lactic acid bacteria Cas gene Phage Horizontal gene transfer

A B S T R A C T Clustered regularly interspaced short palindromic repeats (CRISPR) are hypervariable loci widely distributed in bacteria and archaea, that provide acquired immunity against foreign genetic elements. Here, we investigate the occurrence of CRISPR loci in the genomes of lactic acid bacteria (LAB), including members of the Firmicutes and Actinobacteria phyla. A total of 102 complete and draft genomes across 11 genera were studied and 66 CRISPR loci were identified in 26 species. We provide a comparative analysis of the CRISPR/ cas content and diversity across LAB genera and species for 37 sets of CRISPR loci. We analyzed CRISPR repeats, CRISPR spacers, leader sequences, and cas gene content, sequences and architecture. Interestingly, multiple CRISPR families were identified within Bifidobacterium, Lactobacillus and Streptococcus, and similar CRISPR loci were found in distant organisms. Overall, eight distinct CRISPR families were identified consistently across CRISPR repeats, cas gene content and architecture, and sequences of the universal cas1 gene. Since the clustering of the CRISPR families does not correlate with the classical phylogenetic tree, we hypothesize that CRISPR loci have been subjected to horizontal gene transfer and further evolved independently in select lineages, in part due to selective pressure resulting from phage predation. Globally, we provide additional insights into the origin and evolution of CRISPR loci and discuss their contribution to microbial adaptation. © 2008 Elsevier B.V. All rights reserved.

1. Introduction Lactic acid bacteria (LAB) are historically defined as a ubiquitous and heterogeneous family of microbes which can ferment various nutrients primarily into lactic acid. LAB are Gram-positive, micro-aerophilic, acidtolerant, non-sporulating rods and cocci which reside in diverse habitats, including human cavities such as the gastrointestinal tract, the oral cavity, the respiratory tract and the vaginal cavity, as well as a number of environmental niches including dairy, meat, vegetable and plants (Klaenhammer et al., 2002, 2005; Kleerebezem and Hugenholtz, 2003). LAB are widely used in numerous industrial applications, ranging from starter cultures in the dairy industry to probiotics in dietary supplements, and bioconversion agents. Among domesticated bacteria widely studied and exploited, the LAB are found in two distinct phyla, namely Firmicutes and Actinobacteria. Within the Firmicutes phylum, LAB belong to the Lactobacillales order and include the following genera: Aerococcus, Alloiococcus, Carnobacterium, Enterococcus, Lactobacillus, Lactococcus, Leuconostoc, Oenococcus, Pediococcus, Streptococcus, Symbiobacterium, Tetragenococcus, Vagococcus, and Weissella, which all are low-GC content organisms (31–49%). Within the Actinobacteria phylum, LAB belong to the Atopobium and Bifidobacterium genera, ⁎ Corresponding author. Tel.: +33 549191209; fax: +33 549864839. E-mail address: [email protected] (P. Horvath). 0168-1605/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.ijfoodmicro.2008.05.030

with a GC content of 36–46% and 58–61%, respectively (Klaenhammer et al., 2005; Pfeiler and Klaenhammer, 2007). At this time, the genome sequences of 50 LAB have been determined and published, with 31 additional draft genomes available in GenBank, and another 167 genome sequencing projects initiated (Liu et al., 2005). LAB have relatively small genomes, approximately 1.7–3.3 Mbp in size, containing 1700–3000 genes (Makarova and Koonin, 2007). Comparative and functional genomics analyses have provided novel and critical insights into the numerous LAB functionalities, phylogenetic diversity and evolutionary processes, notably with regards to their ability to catabolise nutrients, enhance human health, and adapt to their respective habitats. Specifically, differential genomic content and hypervariable regions unravel genetic diversity and reveal critical content involved in unique physiological functionalities and phenotypical specificities. Clustered regularly interspaced short palindromic repeats (CRISPR) are hypervariable genetic loci widely distributed in bacteria and archaea (Jansen et al., 2002). CRISPRs represent a family of DNA repeats which typically consist of short and highly conserved repeats, interspaced by variable sequences called spacers, often times adjacent to cas (CRISPR-associated) genes (Haft et al., 2005; Makarova et al., 2006a; Sorek et al., 2008). Recent studies have established that CRISPR provides acquired resistance against viruses (Barrangou et al., 2007; Deveau et al., 2008; Horvath et al., 2008) and allow microbial populations to survive phage predation (Kunin et al., 2008; Tyson and

63

P. Horvath et al. / International Journal of Food Microbiology 131 (2009) 62–70

Banfield, 2008), possibly via a RNA-interference-like mechanism (Makarova et al., 2006a; Sorek et al., 2008). Since CRISPR loci play a critical role in the adaptation and persistence of a microbial host in a particular ecosystem where viruses are present, it provides a historical perspective of phage exposure, and insight into the co-directed evolution of the phage and the host genomes. In this manuscript, we investigate the occurrence of CRISPR loci in the genomes of lactic acid bacteria and provide a comparative analysis of the CRISPR/cas content and diversity across LAB genera and species. Specifically, we analyzed CRISPR repeats, CRISPR spacers, leader sequences and cas content and architecture. Additionally, we discuss the origin and evolution of CRISPR loci in light of the phylogenetic relationships of LAB species and genera and address the impact of CRISPR loci on genome evolution and microbial fitness. 2. Materials and methods Complete genome sequences were retrieved from GenBank at the National Center for Biotechnology Information (http://www.ncbi. nlm.nih.gov/, Benson et al., 2008). Draft genome sequences were obtained from specific web sites that are compiled in the Entrez Genome project list (http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi) or in the Genomes OnLine Database (http://www.genomesonline.org/, Bernal et al., 2001). The genome sequence of Atopobium vaginae ATCC BAA55 (release March 14, 2007) was obtained from http://med.stanford. edu/sgtc/research/atopobium_vaginae.html (Atopobium vaginae Genome Project, Stanford Genome Technology Center, funded by the Ellison Medical Foundation and NIH grant HG02052). Sequence data for Streptococcus equi, S. equi subsp. zooepidemicus, Streptococcus pneumoniae 23F, S. pneumoniae INV104B, S. pneumoniae INV200, S. pneumoniae OXC141, Streptococcus suis P1/7, and Streptococcus uberis were produced by the Bacterial Genomes Sequencing Group at the Sanger Institute and can be obtained from ftp://ftp.sanger.ac.uk/pub/pathogens/. For Enterococcus faecalis HH22, E. faecalis OG1RF, E. faecalis TX0104, Enterococcus faecium DO, and Streptococcus iniae 9117, preliminary sequence data was obtained from Baylor College of Medicine Human Genome Sequencing Center website at http://www.hgsc.bcm.tmc.edu/. Draft genome sequences of Bifidobacterium bifidum JCM1255, Bifidobacterium breve JCM1192, Bifidobacterium catenulatum JCM1194, Lactobacillus fermentum IFO3956, and Lactobacillus rhamnosus ATCC 53103 were retrieved from the Human Metagenome Consortium Japan website (http://metagenome.jp/). For published genome sequences, CRISPR loci were retrieved from the CRISPRdb database (Grissa et al., 2007b). Alternatively, the detection of CRISPR loci in draft genome sequences was achieved using CRISPRFinder (Grissa et al., 2007a) or Dotter (Sonnhammer and Durbin, 1995). Non-coding sequences located at the 5′ end of the first identified CRISPR repeat for each locus were selected as putative leader sequences and compared using Dotter. Identification of cas genes was performed using BLAST (Altschul et al., 1990) and TIGRFAMs (Haft et al., 2003; Selengut et al., 2007). BLAST was also used for similarity searches between CRISPR spacer sequences and existing sequences in the GenBank database limited to Bacteria (taxid:2) or Viruses (taxid:10239) entries. Only matches showing 100% identity over the complete CRISPR spacer sequences were retained; matches to sequences found within CRISPR loci were ignored. Draft genome sequences were annotated using Artemis 9 (Rutherford et al., 2000). Multiple sequence alignments and phylogenetic analyses were carried out using Clustal X (Larkin et al., 2007), and dendograms were visualized with the accompanying application NJ Plot.

order (genera Aerococcus, Carnobacterium, Enterococcus, Lactobacillus, Lactococcus, Leuconostoc, Oenococcus, Pediococcus, Streptococcus, Symbiobacterium, Tetragenococcus, and Weissella) and 36 strains of the Atopobium and Bifidobacterium genera (Supplementary Table 1). Among them, 50 projects are completed and published in GenBank, while 31 genomes are available in GenBank as multi-contig draft sequences, and an additional 22 complete or draft genome sequences could be retrieved from specific genome project web sites (see Materials and methods). Altogether, 102 genome sequences were subjected to CRISPR analysis. A few pathogenic species of medical interest were over-represented (26 S. pneumoniae, 13 Streptococcus pyogenes, and 8 Streptococcus agalactiae genomes, for instance), while other genera and species, notably Aerococcus, Alloiococcus, Tetragenococcus, Vagococcus, and Weissella were missing from the analysis. Among the 102 genomic sequences investigated, CRISPR loci were identified in 47 genomes (Supplementary Table 2). This ratio (46.1%) is slightly higher than the documented ratio of CRISPR-containing genomes in the Bacteria superkingdom, which is 40.6% (225 out of 554 genomes), according to the CRISPRdb database (Grissa et al., 2007b). Specifically, CRISPR loci could be identified in Enterococcus, Lactobacillus, Streptococcus, Symbiobacterium, Atopobium, and Bifidobacterium, but not in Carnobacterium (1 genome), Lactococcus (3 genomes), Leuconostoc (2 genomes), Oenococcus (2 genomes), or Pediococcus (1 genome); for these latter species the absence of any identifiable CRISPR locus may be strain-dependent and does not preclude the existence of a CRISPR-cas system within the genome of

3. Results 3.1. Occurrence of CRISPR loci LAB genomes At the time of analysis, 248 LAB genome projects could be documented, including 212 strains belonging to the Lactobacillales

Fig. 1. Comparative analysis of CRISPR repeats. Thirty-seven CRISPR repeat sequences were aligned using Clustal X (Larkin et al., 2007). CRISPR repeat families are shown on the right.

64

P. Horvath et al. / International Journal of Food Microbiology 131 (2009) 62–70

other strains belonging to the same species. While most of these CRISPR loci were located on the chromosome, one CRISPR locus was identified on the E. faecium pHT beta plasmid (Tomita and Ike, 2005). For genera with multiple species represented, CRISPRs were identified in 8/13, 10/13, and 4/6 Lactobacillus, Streptococcus, and Bifidobacterium species, respectively. In several cases more than one CRISPR locus could be detected within a genome; three distinct CRISPR loci were found on the Streptococcus thermophilus LMD-9 chromosome (Makarova et al., 2006b; Horvath et al., 2008). For selected species with multiple genome sequences available, such as Bifidobacterium longum, E. faecalis, S. pyogenes, S. suis and S. thermophilus, the presence and number of CRISPR loci varied between strains. The global features of the 66 CRISPR loci identified in LAB sequences are summarized in Supplementary Table 2. The CRISPR locus nomenclature we defined is based on the genus (one upper case letter) and species/subspecies (three lower case letters), complemented with a CRISPR locus number, and an additional letter when multiple strains within the same species carried one particular CRISPR locus. For each CRISPR locus, we provide the number of repeats, the repeat and spacer sizes, the presence of flanking cas genes, and the sequence of the typical repeat (see below). While the average number of repeats per locus is 19.5, only 4 loci had more than 50 repeats, three of which in Actinobacteria. The largest number of repeats (n = 116) was identified in Bifidobacterium adolescentis L2-32, where the last three repeats are separated from the first 113 repeats by an insertion sequence (IS) element. Globally, repeats were 28 to 37 bp long, and spacers are of similar size (26–40 bp), which is consistent with previously identified CRISPR loci (Sorek et al., 2008).

the CRISPR locus found on the E. faecium plasmid pHT beta (Tomita and Ike, 2005) is the sole member of the Efam1 family. The other two families on this part of the dendrogram (Fig. 1), Lsal1 and Blon1 contain three Lactobacillus and two Bifidobacterium CRISPRs, respectively. Thus four genera, namely Enterococcus, Lactobacillus, Streptococcus, and Bifidobacterium belong to the CRISPR repeat supra-family mentioned above. For the other supra-family present at the bottom of the dendrogram (Fig. 1), three distinct families were identified, namely Ldbu1, Sthe2 and Lhel1. Four relatively distant genera are represented in the Ldbu1 family: Lactobacillus, Symbiobacterium, Atopobium and Bifidobacterium. Similarly, Lhel1 also includes members of the Lactobacillus, Streptococcus, Symbiobacterium and Bifidobacterium genera. Finally, family Sthe2 is constituted by two streptococcal CRISPRs, different from the other S. thermophilus loci present in the other supra-family. Overall, 4 different CRISPR repeat families were found in Streptococcus genomes, and 3 distinct families were identified in Lactobacillus and Bifidobacterium genomes. These results corroborate previous observations indicating that the distribution of CRISPR loci in prokaryotes is not always consistent with the classification of species based on various chromosomal genes including 16S rDNA (Haft et al., 2005; Godde and Bickerton, 2006; Makarova et al., 2006a; Kunin et al., 2007, Makarova and Koonin, 2007). Previous studies have reported that many CRISPR repeats have a GAAA(C/G) 3′ terminus (Kunin et al., 2007; Sorek et al., 2008), however this was only detected in the Sthe2 family. In contrast, repeats in the Sthe1 and Sthe3 families identified in the upper suprafamily had a A(A/C)AAC 3′ terminus sequence, which might indicate a specific signature for this particular set of CRISPR repeat families. 3.3. Analysis of CRISPR spacer sequences

3.2. Identification of CRISPR repeat families CRISPRs are typically defined by the sequence of the repeat. Within certain CRISPR arrays, and with the exception of the terminal repeat which is usually degenerate at its 3′ end (Horvath et al., 2008), some repeats may be slightly different from the others due to single nucleotide polymorphisms. Nevertheless, the typical repeat sequence is defined as the most frequent sequence within a particular CRISPR locus (Supplementary Table 2). Remarkably, nearly all 43 repeats of the Ssan1 CRISPR locus of Streptococcus sanguinis SK36 (Xu et al., 2007) are different from each other, rendering the definition of a typical repeat sequence difficult. In this case the typical repeat sequence presented in Supplementary Table 2 is a consensus sequence where each base corresponds to the most frequent nucleotide at each position; interestingly the resulting repeat sequence is found only once within the Ssan1 CRISPR locus. CRISPR repeats were compared using multiple sequence alignment, where only one repeat sequence per CRISPR locus was included. The dendrogram presented in Fig. 1 illustrates how the various typical repeat sequences of LAB split into CRISPR repeat families. Eight CRISPR repeat families were identified and named according to a representative member, namely Sthe1, Sthe3, Efam1, Lsal1, Blon1, Ldbu1, Sthe2 and Lhel1. Five of these families, Sthe1, Sthe3, Efam1, Lsal1 and Blon1, appear to belong to a same supra-family as they cluster closer to each other, which might indicate a common origin. Interestingly, all five CRISPR repeats across these families are exactly 36 bp long. The other three families are more divergent, namely Lhel1 (32–37 bp repeats), Sthe2 (36–37 bp repeats), and Ldbu1 (28–30 bp repeats). The Sthe1 family is of particular interest for the Streptococcus genus, as it encompasses the CRISPR loci of four distinct streptococcal species, including the well-characterized and active S. thermophilus CRISPR1 (Barrangou et al., 2007; Horvath et al., 2008). The CRISPR locus of Streptococcus vestibularis (accession numbers DQ072993 and DQ072994; Bolotin et al., 2005) also belongs to this family, with a repeat sequence identical to that of S. thermophilus CRISPR1. Similarly, the Sthe3 family is also interesting for streptococci, since six different streptococcal CRISPRs are grouped, along with two CRISPRs identified in E. faecalis. In contrast,

Spacers, defined as the sequences flanked by two consecutive CRISPR repeats, constitute the most diverse part of CRISPR loci between different bacterial species and strains. It was recently shown that new spacers, as part of new repeat-spacer units, can be acquired by bacteria in response to phage predation (Barrangou et al., 2007; Horvath et al., 2008, Deveau et al., 2008). These short sequences are derived from the infecting phage genome, and their presence in the CRISPR context confers to the bacterium acquired “immunity” against phages which contain an identical proto-spacer. The observed similarity between spacers and plasmid sequences led to the hypothesis that CRISPRs may also provide resistance against plasmid determinants (Mojica et al., 2005; Bolotin et al., 2005; Makarova et al., 2006a; Horvath et al., 2008). A total of 104 spacers with 100% identity matches over the whole length were identified in the LAB CRISPRs studied, including 3 genera and 11 species (Supplementary Table 3). We categorized the CRISPR spacer sequence matches in four categories, namely phage, prophage, plasmid and chromosomal sequences. Overall, 27 showed identity to phages, 49 showed identity to prophage sequences, while 5 showed identity to plasmid sequences and 23 showed identity to chromosomal sequences. Spacers identical to known sequences are particularly insightful, since it was previously shown that 100% identity between spacer and proto-spacer sequences is required to provide immunity (Barrangou et al., 2007; Deveau et al., 2008). These proportions are consistent with previous studies investigating sequence similarity between CRISPR spacers and extrachromosomal elements such as phages and plasmids (Bolotin et al., 2005; Mojica et al., 2005; Barrangou et al., 2007; Horvath et al., 2008; Sorek et al., 2008), although the relative proportions are somewhat different, which reflects the bias in the data available in public databases. 3.4. CRISPR leader sequence analysis The CRISPR leader typically consists of up to 550 bp of A/T rich noncoding sequence located at the 5′ end of the first repeat, and likely acts as the promoter for CRISPR repeat-spacer units transcription (Jansen et al.,

P. Horvath et al. / International Journal of Food Microbiology 131 (2009) 62–70 Fig. 2. Graphic representation of CRISPR loci in LAB genomes. The thirty-seven CRISPR loci identified in LAB genomes are represented graphically: repeat/spacer arrays are shown as black boxes; cas genes are represented by narrow arrows while other genes are represented by box arrows. Homologous genes are represented using an identical color scheme; IS elements appear in gray. Parentheses indicate the end of contig sequences. Rectangles represent regions that are deleted in another strain of the same species. cas1 genes are highlighted in bold.

65

P. Horvath et al. / International Journal of Food Microbiology 131 (2009) 62–70

Fig. 2 (continued).

66

67

P. Horvath et al. / International Journal of Food Microbiology 131 (2009) 62–70

2002; Lillestøl et al., 2006). While the leader is typically not conserved between species, several CRISPR loci found within the same chromosome can share sequence similarities (Sorek et al., 2008). We analyzed leader sequence similarities both within and across the 8 CRISPR repeat families, and overall, no particular sequence conservation was observed. The size of the leaders varied widely between the 8 families identified (from 20 to 534 bp), with an average size of 138 ± 94 bp. Further, we investigated CRISPR leader sequence similarities within families and no significant conservation was observed between different species, especially for families with three or more leader sequences available, with the exception of the Sthe3 family. In this family the 72-bp section of the leader immediately adjacent to the first repeat appeared highly conserved across the five Streptococcus species investigated, namely agalactiae, iniae, mutans, pyogenes, and thermophilus. 3.5. Most of the CRISPR loci are associated with cas genes LAB CRISPR loci flanked by cas genes are indicated in Supplementary Table 2, and locus architecture and gene organization for 37 representative loci we identified is shown in Fig. 2. Thirty loci contain cas gene sets, which can be categorized into 7 distinct cas families according to sequence and architectural conservation. These cas families are identical to the families identified in the CRISPR repeat analysis previously listed, with the exception of Efam1 which is not associated to any cas gene. For all CRISPR loci containing cas genes, the repeats were colinear with the cas genes, with the orientation based on the location of the leader sequence and of the terminal repeat, which corresponds to the probable direction of transcription of the repeat-spacer array. In the vast majority, cas genes are located on one side of the repeat-spacer array, preferably on the 5′ end (except for CRISPR1 of Symbiobacterium thermophilum IAM 14863 where the cas genes are on the 3′ end). In S. sanguinis SK36 and CRISPR2 of S. thermophilus LMD-9 the repeat-spacer array is flanked on both sides by cas genes. Relatively few CRISPR loci did not contain cas genes, notably CRISPR2 of E. faecalis (four strains), the E. faecium plasmid pHT beta, Lactobacillus acidophilus NCFM, Lactobacillus brevis ATCC 367 (containing at least 2 CRISPR loci; a third one is composed of a single repeat), and CRISPR2 of S. thermophilum IAM 14863. In contrast, cas genes were identified in the vicinity of CRISPR2 in Streptococcus mutans UA159, where the repeat-spacer array was reduced to a single repeat, which might indicate degeneracy. This is further supported by the presence of two frameshifts in the cas1 coding sequence, resulting in three open reading frames. In several cases, the repeat-spacer array or the cas gene set is disrupted by an IS element, notably for S. suis 89/1591, L. fermentum IFO3956 CRISPR1 and CRISPR2, S. thermophilum IAM 14863 CRISPR1, and B. adolescentis (2 strains). In S. thermophilus the presence of a IS element within the repeat-spacer array was observed in 7 of 268 distinct CRISPR1 or CRISPR3 loci analyzed, and always at the opposite end of the leader (Horvath et al., 2008, and unpublished results). Such an insertion event could provide cis regulatory signals that enhance the transcription of distal CRISPR spacers, especially for large repeat-spacer arrays. In other cases IS elements were also identified on one side of the CRISPR-cas region, notably in Lactobacillus casei ATCC 334 and BL23, Lactobacillus helveticus DPC 4571, and S. thermophilum IAM 14863 CRISPR2. The presence of IS elements on both sides of the CRISPR-cas region might indicate a higher propensity for horizontal gene transfer of the CRISPRCas system, as previously noted (Godde and Bickerton, 2006; Tyson and Banfield, 2008). Interestingly, the genes upstream of the cas set in Streptococcus infantarius subsp. infantarius ATCC BAA-102 (Sthe1 family) is homologous to those found in four streptococcal members of the Sthe3 family, which might indicate a common ancestry. In several cases where multiple genome sequences are available for a particular species, we observed partial or complete deletions in the CRISPR-cas region (Fig. 2). Specifically, the complete CRISPR1-cas

region identified in E. faecalis OG1RF was absent in the genome of E. faecalis V583. Similarly, the CRISPR1-cas region of B. longum DJO10A is not found in B. longum NCC2705. In L. casei the Lcas1 locus was identified in the genome of strain BL23 but not in that of ATCC 334, whereas the Lcas2 locus present in ATCC 334 was absent in BL23 (Hartke et al., 2008). Finally, in S. thermophilus, CRISPR2 and CRISPR3 are not always present (Horvath et al., 2008). 3.6. Eight CRISPR families are present in LAB CRISPR-Cas systems have previously been divided into 7–8 subtypes, each defined by a 2–6 cas gene set (Haft et al., 2005; Makarova et al., 2006a). Overall, six core cas genes (cas1-6) are associated with multiple subtypes and the cas1 gene (COG1518) serves as the universal marker of CRISPR-associated genes (Sorek et al., 2008). We analyzed the cas1 gene sequences across all LAB CRISPR loci and generated a phylogenetic tree based on the Cas1 protein sequences (Fig. 3). The clustering generated 7 distinct groups, which correlated with the 8 families identified in the CRISPR repeat analysis, and with the 7 families identified in the cas gene content and architecture comparative analysis. This is consistent with the similarities previously observed between clustering of the typical CRISPR repeats and that of Cas protein sequences (Kunin et al., 2007; Horvath et al., 2008), and the proposed hypothesis that CRISPR repeats and cas genes for a particular CASS system are likely functionally

Fig. 3. Comparative analysis of Cas1 protein sequences. The thirty Cas1 protein sequences identified in LAB loci were aligned using Clustal X (Larkin et al., 2007). CRISPR repeat families are annotated on the right.

68

P. Horvath et al. / International Journal of Food Microbiology 131 (2009) 62–70

coupled (Makarova et al., 2006a; Horvath et al., 2008). This is further supported by the congruence between the CRISPR repeat tree and the Cas1 tree, especially since cas1 is considered as the universal marker for cas genes. Although we have previously observed this phenomenon in a more limited study of S. thermophilus CRISPR loci (Horvath et al., 2008), here comparable results were also obtained across species, genera and two relatively distant phyla. Notably, three different subsets were identified for Bifidobacterium, namely the Blon1 family and two other CRISPR loci in the Lhel1 and Ldbu1 families. Similarly, Lactobacillus CRISPR loci were categorized in three distinct CRISPR families, namely Lsal1, Lhel1 and Ldbu1. Overall, the categorization of the CRISPR repeats in 8 families was identical to the categorization obtained for the CRISPR locus architecture and cas gene content, and also identical to that based on Cas1 sequences. The co-evolution of CRISPR repeats and cas genes likely reflects their functional coupling. 4. Discussion We identified eight distinct families of CRISPR loci in LAB genomes, based on CRISPR repeat sequences and cas gene content, organization and sequences. These eight families do not cluster according to phylogenetic relationships for LAB. Even though Firmicutes and Actinobacteria are relatively distant phyla, two of the 8 families we identified include at least four different genera and members of both phyla. While CRISPR repeats and cas genes vary widely among microbial species, we identified highly similar loci in relatively distant genera and species. Phylogenetic analyses of the LAB genomes, comparing gene content across genera and species, and reconstructing ancestral gene sets, have revealed a combination of extensive gene loss accompanied by select gene acquisitions via duplication and horizontal gene transfer during the evolutionary adaptation of the various groups to their respective habitats (Makarova et al., 2006b; Makarova and Koonin, 2007). CRISPR repeat sequence similarities in cas gene content, architecture and sequence observed between distant microbial species might be explained by horizontal gene transfer, which is consistent with the presence of CRISPRs on megaplasmids (Godde and Bickerton, 2006; Sorek et al., 2008). Also, the proposed hypothesis that horizontal gene transfer has played a critical role in the distribution and evolution of CRISPR loci is further supported by the observed differences between the GC content of the CRISPR loci, especially the cas genes, and that of the chromosome on which they are present. Notably, for LAB CRISPR loci, %GC differences were most significant for B. adolescentis ATCC 15703 CRISPR1 (Bado1a), with a

cas GC content of 47.04%, whereas the chromosomal GC content is 59.18%. (Fig. 4). This might indicate that a CRISPR of the Lhel1 family was transferred laterally from a low-GC Firmicutes genus, perhaps Lactobacillus or Streptococcus, to the high GC Actinobacteria genus Bifidobacterium. Since these genera are concurrently present in the human GI tract (Glaser et al., 2002; Klaenhammer et al., 2002; Klaenhammer et al., 2005; Schell et al., 2002), and because the primary function of CRISPR is to provide defense against foreign genetic elements, we hypothesize that this CRISPR locus horizontal exchange between two distant phyla may have resulted from Bifidobacterium phage predation in the human intestinal environment. The extensive gene loss is representative of the metabolic simplification due to the relatively nutrient-rich habitats LAB occupy. Phylogenetic trees have clearly divided the LAB into various groups, firstly separating the Actinobacteria and the Firmicutes phyla, then within the Lactobacillales order, four distinct families have been defined, namely the Leuconostoc group (including Leuconostoc and Oenococcus), the L. casei–Pediococcus group (including some lactobacilli and the Pediococcus genus), the Lactobacillus delbrueckii group (including other species of the Lactobacillus genus), and the Streptococcus group (encompassing both streptococci and lactococci) (Makarova et al., 2006b). Other genera and species included in our study, notably the Enterococcus genus and the various Streptococcus species would most closely relate to the Streptococcus group (Makarova and Koonin, 2007). Among lactobacilli, whole genome comparative analyses have clearly established a dichotomy between two sub-groups (Makarova et al., 2006b; Makarova and Koonin, 2007). This is further supported by the relative high synteny observed within species of the L. acidophilus group, notably L. acidophilus, L. gasseri, L. johnsonii, L. delbrueckii and L. helveticus (Klaenhammer et al., 2005; Berger et al., 2007; Callanan et al., 2008). Although L. acidophilus and L. helveticus are nearly identical in genome content and organization, their CRISPR content is highly polymorphic, since their CRISPR loci belong to two distinct families (Ldbu1 and Lhel1, respectively), and only Lhel1 contains a cas gene set. This is particularly interesting as both CRISPR loci are located in the same genomic environment on their chromosomes (Callanan et al., 2008). A similar phenomenon was also observed between B. catenulatum CRISPR1 (Ldbu1 family) and B. adolescentis CRISPR1 (Lhel1 family), which are surrounded by homologous genetic content on both sides of the loci. Globally, the eight CRISPR families we identified do not correlate with phylogeny, which suggests that they have evolved independently

Fig. 4. CRISPR/cas variability in GC content. The GC content for CRISPR/cas and the surrounding genomic region in B. adolescentis ATCC 15703 is shown as a graphical display from Artemis (Rutherford et al., 2000). cas genes appear in green, repeats are shown in dark blue, and neighbouring open reading frames are colored in light blue. The GC content plot (300 bp window) is shown at the top. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

69

P. Horvath et al. / International Journal of Food Microbiology 131 (2009) 62–70

from other elements on the chromosome. The CRISPR-Cas system has previously been identified within LAB as a candidate for dissemination via horizontal gene transfer, in L. delbrueckii and L. casei (Makarova and Koonin, 2007). Overall, the diversity of CRISPR loci in LAB likely reflects their lateral origin and their rapid evolution due to environmental pressure, notably linked to phage predation (Tyson and Banfield, 2008; Kunin et al., 2008). This polymorphism is potentially valuable for typing and comparative analyses of strains and microbial populations, within or between particular ecosystems and environmental niches. Since comparative analyses of CRISPR repeats, locus content and architecture, and Cas1 sequences are likely to generate equivalent results, CRISPR repeats are arguably the most convenient and desirable template for comparative CRISPR studies. This is particularly valid in cases whereby cas genes are absent or unknown. Also, the identification and annotation of cas genes can be difficult, especially for novel and perhaps distant organisms, whereas CRISPR arrays can be identified regardless of sequence content. This is also consistent with the core definition and existence of CRISPR loci, which are based upon the presence of repeats. While the environmental phage population clearly influences the bacterial ecosystem (Kunin et al., 2008), the CRISPR system likely applies a relatively high selective pressure on phage in return, which is reflected by the ability of phages to alter their genomes accordingly via mutation and deletion (Barrangou et al., 2007; Deveau et al., 2008). This provides an explanation for the high evolutionary rates observed in phage genomes (Sorek et al., 2008), and phage metagenomic studies would provide insights into their ability to overcome CRISPR systems. Overall, CRISPR loci appear to be a key distinguishing feature in LAB, from a phylogenetic and genome evolutionary standpoint. The ability of the CRISPR system to provide immunity against foreign genetic elements is essential and actively involved in the microbe's propensity to survive phage predation and adapt to its environment. Acknowledgements This work was supported by funding from Danisco A/S. The authors acknowledge the assistance of Ursula Bengård Hansen for genome annotation. We are grateful to Dr. R.W. Hyman (Stanford Genome Technology Center, Palo Alto) for allowing the use of the draft genome sequence of A. vaginae ATCC BAA-55, and to Prof. Axel Hartke (Université de Caen), Dr. Manuel Zuniga (CSIC-IATA, Valencia) and Dr. Josef Deutscher (INRA, Thiverval Grignon) for sharing unpublished results on the L. casei BL23 genome. We also thank Dr. J. Parkhill (The Sanger Institute, Cambridge) for kindly providing access to sequence data generated at The Sanger Institute. The DNA sequence of E. faecalis HH22, E. faecalis OG1RF, and E. faecalis TX0104 was supported by grant number R21 AI64470 from NIAID at the BCM-HGSC. The DNA sequence of E. faecium DO was supported by grant number R01 AI 042399-04 from DOE and NIAID/NIH at the BCM-HGSC. The DNA sequence of S. iniae 9117 was supported by funding from USDA at the BCM-HGSC. The genome sequence data for B. bifidum JCM1255, B. breve JCM1192, B. catenulatum JCM1194, L. fermentum IFO3956, and L. rhamnosus ATCC 53103 were produced by the Human Metagenome Consortium Japan (HMGJ) and can be obtained from http://metagenome.jp/. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.ijfoodmicro.2008.05.030. References Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. Journal of Molecular Biology 215, 403–410. Barrangou, R., Fremaux, C., Deveau, H., Richards, M., Boyaval, P., Moineau, S., Romero, D.A., Horvath, P., 2007. CRISPR provides acquired resistance against viruses in prokaryotes. Science 315, 1709–1712.

Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L., 2008. GenBank. Nucleic Acids Research 36, D25–30. Berger, B., Pridmore, R.D., Barretto, C., Delmas-Julien, F., Schreiber, K., Arigoni, F., Brüssow, H., 2007. Similarity and differences in the Lactobacillus acidophilus group identified by polyphasic analysis and comparative genomics. Journal of Bacteriology 189, 1311–1321. Bernal, A., Ear, U., Kyrpides, N., 2001. Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Research 29, 126–127. Bolotin, A., Quinquis, B., Sorokin, A., Ehrlich, S.D., 2005. Clustered regularly interspaced short palindrome repeats (CRISPRs) have spacers of extrachromosomal origin. Microbiology 151, 2551–2561. Callanan, M., Kaleta, P., O'Callaghan, J., O'Sullivan, O., Jordan, K., McAuliffe, O., SangradorVegas, A., Slattery, L., Fitzgerald, G.F., Beresford, T., Ross, R.P., 2008. Genome sequence of Lactobacillus helveticus, an organism distinguished by selective gene loss and insertion sequence element expansion. Journal of Bacteriology 190, 727–735. Deveau, H., Barrangou, R., Garneau, J.E., Labonté, J., Fremaux, C., Boyaval, P., Romero, D.A., Horvath, P., Moineau, S., 2008. Phage response to CRISPR-encoded resistance in Streptococcus thermophilus. Journal of Bacteriology 190, 1390–1400. Glaser, P., Rusniok, C., Buchrieser, C., Chevalier, F., Frangeul, L., Msadek, T., Zouine, M., Couvé, E., Lalioui, L., Poyart, C., Trieu-Cuot, P., Kunst, F., 2002. Genome sequence of Streptococcus agalactiae, a pathogen causing invasive neonatal disease. Molecular Microbiology 45, 1499–1513. Godde, J.S., Bickerton, A., 2006. The repetitive DNA elements called CRISPRs and their associated genes: evidence of horizontal transfer among prokaryotes. Journal of Molecular Evolution 62, 718–729. Grissa, I., Vergnaud, G., Pourcel, C., 2007a. CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats. Nucleic Acids Research 35, W52–57. Grissa, I., Vergnaud, G., Pourcel, C., 2007b. The CRISPRdb database and tools to display CRISPRs and to generate dictionaries of spacers and repeats. BMC Bioinformatics 23, 172. Haft, D.H., Selengut, J.D., White, O., 2003. The TIGRFAMs database of protein families. Nucleic Acids Research 31, 371–373. Haft, D.H., Selengut, J., Mongodin, E.F., Nelson, K.E., 2005. A guild of 45 CRISPR-associated (Cas) protein families and multiple CRISPR/Cas subtypes exist in prokaryotic genomes. PLoS Computational Biology 1, e60. Hartke, A., Zuniga, M., Deutscher, J., 2008. Unpublished results. Horvath, P., Romero, D.A., Coûté-Monvoisin, A.C., Richards, M., Deveau, H., Moineau, S., Boyaval, P., Fremaux, C., Barrangou, R., 2008. Diversity, activity, and evolution of CRISPR loci in Streptococcus thermophilus. Journal of Bacteriology 190, 1401–1412. Jansen, R., Van Embden, J.D., Gaastra, W., Schouls, L.M., 2002. Identification of genes that are associated with DNA repeats in prokaryotes. Molecular Microbiology 43, 1565–1575. Klaenhammer, T.R., Altermann, E., Arigoni, F., Bolotin, A., Breidt, F., Broadbent, J., Cano, R., Chaillou, S., Deutscher, J., Gasson, M., van de Guchte, M., Guzzo, J., Hartke, A., Hawkins, T., Hols, P., Hutkins, R., Kleerebezem, M., Kok, J., Kuipers, O., Lubbers, M., Maguin, E., McKay, L., Mills, D., Nauta, A., Overbeek, R., Pel, H., Pridmore, D., Saier, M., van Sinderen, D., Sorokin, A., Steele, J., O'Sullivan, D., de Vos, W., Weimer, B., Zagorec, M., Siezen, R., 2002. Discovering lactic acid bacteria by genomics. Antonie Van Leeuwenhoek 82, 29–58. Klaenhammer, T.R., Barrangou, R., Buck, B.L., Azcarate-Peril, M.A., Altermann, E., 2005. Genomic features of lactic acid bacteria effecting bioprocessing and health. FEMS Microbiology Reviews 29, 393–409. Kleerebezem, M., Hugenholtz, J., 2003. Metabolic pathway engineering in lactic acid bacteria. Current Opinion in Biotechnology 14, 232–237. Kunin, V., Sorek, R., Hugenholtz, P., 2007. Evolutionary conservation of sequence and secondary structures in CRISPR repeats. Genome Biology 8, R61. Kunin, V., He, S., Warnecke, F., Peterson, S.B., Garcia Martin, H., Haynes, M., Ivanova, N., Blackall, L.L., Breitbart, M., Rohwer, F., McMahon, K.D., Hugenholtz, P., 2008. A bacterial metapopulation adapts locally to phage predation despite global dispersal. Genome Research 18, 293–297. Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., Higgins, D.G., 2007. Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947–2948. Lillestøl, R., Redder, P., Garrett, R., Brügger, K., 2006. A putative viral defence mechanism in archaeal cells. Archaea 2, 59–72. Liu, M., van Enckevort, F.H., Siezen, R.J., 2005. Genome update: lactic acid bacteria genome sequencing is booming. Microbiology 151, 3811–3814. Makarova, K.S., Koonin, E.V., 2007. Evolutionary genomics of lactic acid bacteria. Journal of Bacteriology 189, 1199–1208. Makarova, K.S., Grishin, N.V., Shabalina, S.A., Wolf, Y.I., Koonin, E.V., 2006a. A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biology Direct 1, 1–26. Makarova, K.S., Slesarev, A., Wolf, Y., Sorokin, A., Mirkin, B., Koonin, E., Pavlov, A., Pavlova, N., Karamychev, V., Polouchine, N., Shakhova, V., Grigoriev, I., Lou, Y., Rohksar, D., Lucas, S., Huang, K., Goodstein, D.M., Hawkins, T., Plengvidhya, V., Welker, D., Hughes, J., Goh, Y., Benson, A., Baldwin, K., Lee, J.H., Díaz-Muñiz, I., Dosti, B., Smeianov, V., Wechter, W., Barabote, R., Lorca, G., Altermann, E., Barrangou, R., Ganesan, B., Xie, Y., Rawsthorne, H., Tamir, D., Parker, C., Breidt, F., Broadbent, J., Hutkins, R., O'Sullivan, D., Steele, J., Unlu, G., Saier, M., Klaenhammer, T., Richardson, P., Kozyavkin, S., Weimer, B., Mills, D., 2006b. Comparative genomics of the lactic acid bacteria. Proceedings of the National Academy of Sciences of the United States of America 103, 15611–15616.

70

P. Horvath et al. / International Journal of Food Microbiology 131 (2009) 62–70

Mojica, F.J., Díez-Villaseñor, C., García-Martínez, J., Soria, E., 2005. Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. Journal of Molecular Evolution 60, 174–182. Pfeiler, E.A., Klaenhammer, T.R., 2007. The genomics of lactic acid bacteria. Trends in Microbiology 15, 546–553. Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., Rice, P., Rajandream, M.A., Barrell, B., 2000. Artemis: sequence visualization and annotation. Bioinformatics 16, 944–945. Schell, M.A., Karmirantzou, M., Snel, B., Vilanova, D., Berger, B., Pessi, G., Zwahlen, M.C., Desiere, F., Bork, P., Delley, M., Pridmore, R.D., Arigoni, F., 2002. The genome sequence of Bifidobacterium longum reflects its adaptation to the human gastrointestinal tract. Proceedings of the National Academy of Sciences of the United States of America 99, 14422–14427. Selengut, J.D., Haft, D.H., Davidsen, T., Ganapathy, A., Gwinn-Giglio, M., Nelson, W.C., Richter, A.R., White, O., 2007. TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Research 35, D260–264.

Sonnhammer, E.L., Durbin, R., 1995. A dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. Gene 167, GC1–GC10. Sorek, R., Kunin, V., Hugenholtz, P., 2008. CRISPR—a widespread system that provides acquired resistance against phages in bacteria and archaea. Nature Reviews Microbiology 6, 181–186. Tomita, H., Ike, Y., 2005. Genetic analysis of transfer-related regions of the vancomycin resistance Enterococcus conjugative plasmid pHTbeta: identification of oriT and a putative relaxase gene. Journal of Bacteriology 187, 7727–7737. Tyson, G.W., Banfield, J.F., 2008. Rapidly evolving CRISPRs implicated in acquired resistance of microorganisms to viruses. Environmental Microbiology 10, 200–207. Xu, P., Alves, J.M., Kitten, T., Brown, A., Chen, Z., Ozaki, L.S., Manque, P., Ge, X., Serrano, M.G., Puiu, D., Hendricks, S., Wang, Y., Chaplin, M.D., Akan, D., Paik, S., Peterson, D.L., Macrina, F.L., Buck, G.A., 2007. Genome of the opportunistic pathogen Streptococcus sanguinis. Journal of Bacteriology 189, 3166–3175.