PanGeT: Pan-genomics tool

PanGeT: Pan-genomics tool

GENE-41677; No. of pages: 8; 4C: Gene xxx (2016) xxx–xxx Contents lists available at ScienceDirect Gene journal homepage: www.elsevier.com/locate/ge...

2MB Sizes 0 Downloads 33 Views

GENE-41677; No. of pages: 8; 4C: Gene xxx (2016) xxx–xxx

Contents lists available at ScienceDirect

Gene journal homepage: www.elsevier.com/locate/gene

Research paper

PanGeT: Pan-genomics tool Iyyappan Yuvaraj a,1, Jayavel Sridhar b,c,1, Daliah Michael a, Kanagaraj Sekar a,⁎ a b c

Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560012, India Centre of Excellence in Bioinformatics, School of Biotechnology, Madurai Kamaraj University, Madurai 625021, India Department of Biotechnology (DDE), Madurai Kamaraj University, Madurai 625021, India

a r t i c l e

i n f o

Article history: Received 27 May 2016 Received in revised form 5 November 2016 Accepted 10 November 2016 Available online xxxx Keywords: Pan-genome Core genes Dispensable genes Unique genes Species Genome Proteome CDS

a b s t r a c t A decade after the concept of Pan-genome was first introduced; research in this field has spread its tentacles to areas such as pathogenesis of diseases, bacterial evolutionary studies and drug resistance. Gene content-based differentiation of virulent and a virulent strains of bacteria and identification of pathogen specific genes is imperative to understand their physiology and gain insights into the mechanism of genome evolution. Subsequently, this will aid in identifying diagnostic targets and in developing and selecting vaccines. The root of pan-genomic studies, however, is to identify the core genes, dispensable genes and strain specific genes across the genomes belonging to a clade. To this end, we have developed a tool, “PanGeT – Pan-genomics Tool” to compute the ‘pan-genome’ based on comparisons at the genome as well as the proteome levels. This automated tool is implemented using LaTeX libraries for effective visualization of overall pan-genome through graphical plots. Links to retrieve sequence information and functional annotations have also been provided. PanGeT can be downloaded from http://pranag.physics.iisc.ernet.in/PanGeT/ or https://github.com/PanGeTv1/PanGeT. © 2016 Published by Elsevier B.V.

1. Introduction The term “Pan-genome” was first introduced by Tettelin et al. (2005) a decade ago. It defines the complement of genes from genomes belonging to a clade, which includes core, dispensable and strain specific genes. In recent decades, the availability of huge genome data has facilitated and accelerated the pace of pan-genomic studies (Medini et al., 2005; Puigbò et al., 2014). It has gained popularity owing to its potential in understanding genome evolution and in identifying potential molecular targets (Ho et al., 2012; Delany et al., 2013). It is known that bacteria undergo minor evolutionary changes by acquiring DNA sequences from plasmids, phages and other genomes of the same species, which enhances their survival, adaptation and virulence (Boto, 2015). Many changes were noticed in the physiology of the pathogens after acquiring the genetic pool from external sources. The development of drug resistance by novel microbial genetic elements in many microbes, pose the greatest threat to therapeutic solutions, thereby highlighting the importance of the acquisition and reduction of specific DNA sequences in a pathogen. Identification and analysis of such genome fluidity will

contribute to the clinical management, epidemiology and development of new diagnostic methods (Ahmed et al., 2008). Studies have revealed that such strain specific genes were applicable for the identification of vaccine candidates and diagnostic and antimicrobial targets (Muzzi et al., 2007). Thus, close monitoring of the distribution of virulent gene content in pathogenic strains of a species is mandatory for their effective containment (Sugawara et al., 2013). Over the last decade, few computational tools like Pan-genometree (Snipen and Ussery, 2010), PanSeq (Laing et al., 2010), GET_HOMOLOGUES (Contreras-Moreira and Vinuesa, 2013) and PGAP (Zhao et al., 2012) were developed for pan-genomic studies. However, it is noticed that most of these tools use BLASTP (Altschul et al., 1990) for the multiple proteome comparisons, which does not take into account the non-transcribed or un-translated regions, for comparative analysis. Moreover, the existing pan-genome tools require numerous dependencies for installation and are restricted to compare only a limited number of genomes. Therefore, to address these issues, we have proposed a new robust, user friendly and time inexpensive effective tool to identify core genes, dispensable genes and strain specific genes. 2. Materials and methods

Abbreviations: CDS, Coding DNA Sequences; RBH, Reciprocal Best Hit; H-value, Homology value; E-value, expected value. ⁎ Corresponding author. E-mail addresses: [email protected], [email protected] (K. Sekar). 1 Contributed equally.

2.1. Implementation PanGeT has been conceived and developed to facilitate researchers to compare entire genomes or proteomes together in order to compute

http://dx.doi.org/10.1016/j.gene.2016.11.025 0378-1119/© 2016 Published by Elsevier B.V.

Please cite this article as: Yuvaraj, I., et al., PanGeT: Pan-genomics tool, Gene (2016), http://dx.doi.org/10.1016/j.gene.2016.11.025

2

S. no. Serovar$

RefSeq Id⁎

Total Comparison of pan-genome prediction tools CDS⁎ PanGeT BLASTN mode PanGeT BLASTP mode Unique Core (RBH-hits)@

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Salmonella enterica Typhimurium LT2 Salmonella enterica Typhi CT18 Salmonella enterica Typhi Ty2 Salmonella enterica Paratyphi A ATCC 9150 Salmonella enterica Choleraesuis SC B67 Salmonella enterica arizonaeserovar 62_z4_z23 Salmonella enterica Newport SL254 Salmonella enterica Heidelberg SL476 Salmonella enterica Schwarzengrund CVM19633 Salmonella enterica Paratyphi A AKU 12601 Salmonella enterica Agona SL483 Salmonella enterica Dublin CT 02021853 Salmonella enterica Paratyphi C RKS4594 Salmonella enterica Typhimurium SL1344 Salmonella enterica Gallinarumpullorum RKS5078 Salmonella enterica Typhi_P_stx_12 Salmonella enterica Typhimurium 14028S Salmonella enterica Typhimurium ST4_74 Salmonella enterica Typhimurium T000240 Salmonella enterica Typhimurium UK_1

NC_003197 NC_003198 NC_004631 NC_006511 NC_006905 NC_010067 NC_011080 NC_011083 NC_011094 NC_011147 NC_011149 NC_011205 NC_012125 NC_016810 NC_016831 NC_016832 NC_016856 NC_016857 NC_016860 NC_016863

4451 2559 4111 4352 4096 4385 4467 4605 4666 4538 4079 4580 4530 4577 4446 4325 4690 5315 4625 4713 4451

31 31 12 1 18 829 111 150 177 1 100 46 15 0 32 0 12 1 80 5

%

Core Unique (RBH-hits) @ 2586

17 15 30 3 28 689 77 128 157 3 111 67 36 8 39 75 360 20 79 9

PGAP-GF %

@

PGAP-MP %

@

Core

Unique

Core

Unique

2712 2686 2685 2689 2691 2688 2707 2708 2705 2688 2697 2700 2698 2712 2690 2691 2711 2710 2716 2709

38 30 66 29 84 880 139 177 222 18 171 118 87 25 100 185 495 51 142 34

2756 2721 2726 2722 2740 2746 2748 2749 2752 2724 2745 2738 2743 2756 2734 2731 2756 2758 2764 2757

32 29 46 7 57 747 79 131 171 4 120 74 50 11 63 147 449 23 141 9

PanSeq %

Core fragments!

GET_HOMOLOGUES Unique Core~ fragments!

11,334! 117 (Fragments) 81 3 23 54 176 27 34 56 0 50 36 7 12 3 0 5 0 12 0

Unique~

2515 (protein)/2557 Does not provide (DNA) unique gene data

I. Yuvaraj et al. / Gene xxx (2016) xxx–xxx

Please cite this article as: Yuvaraj, I., et al., PanGeT: Pan-genomics tool, Gene (2016), http://dx.doi.org/10.1016/j.gene.2016.11.025

Table 1 List of twenty Salmonella serovars analyzed for their pan-genome content. $Salmonella enterica strains taken from GenBank, ⁎Accession number and gene annotations reported in NCBI-RefSeq, @Core (Reciprocal Best Hits exclusive of paralogous and partial homologues) and %Unique genes identified using PanGeT, Gene Family algorithm (PGAP-GF) and Multiparanoid algorithm (PGAP-MP) with PGAP. !Core and Unique genome fragments were computed using PanSeq. ~ Core Amino Acid gene clusters identified by GET_HOMOLOGUES tool.

1145 (protein)/1154 (DNA) Unique gene data not provided 3375 60 (Fragments) 59 28 28 13 17 9 45 38 73 142 54 84 23 1210 1206 1207 1200 1207 1209 1201 60 55 82 163 64 96 37 1204 1202 1205 1190 1200 1205 1193 29 16 31 118 37 32 18 1171 11 26 24 75 11 34 14 1126 NC_000915 NC_000921 NC_008086 NC_010698 NC_011333 NC_011498 NC_012973 Helicobacter pylori 26695_uid57787 Helicobacter pylori J99_uid57789/ Helicobacter pylori HPAG1_uid58517 Helicobacter pylori Shi470_uid59165/ Helicobacter pylori G27_uid59305/ Helicobacter pylori P12_uid59327/ Helicobacter pylori B38_uid59415/ 1 2 3 4 5 6 7

1468 1488 1531 1566 1493 1568 1382

Unique~ Unique Core~ fragments!

GET_HOMOLOGUES PanSeq PGAP-MP PGAP-GF PanGeT BlastP mode

Unique% Core Unique% Core@ Unique% Core@ Unique% Core Core (RBH-hits)@ fragments! (RBH-hits)@

PanGeT BlastN mode

Total CDS⁎ Comparison of pan-genome prediction tools

A key feature of PanGeT is that it allows the users to run the program in a convenient and easy manner. The program provides a comprehensive graphical output in the form of a flower plot to get an overall view of the pan-genome. Further, the generated flower plot contains hyperlinks in the core genes section, by which the users can download each core sequence and retrieve annotations on every gene sequence. A petal of the flower plot represents each genome taken for analysis and is indicated along with their accession numbers. Each petal is equipped with the number of strain specific genes and hyperlinks to retrieve strain specific gene sequences, gene co-ordinates, strand orientation and their functional annotations. The corresponding locus ids of the genes are also provided with a direct link to KEGG database (Kanehisa et al., 2012) for further information. In addition, it also provides information about the dispensable (accessory) genes (Sugawara et al., 2013) which are found among the subset (few) of query genomes. Data pertaining to the accessory genes of a species will help researchers to identify the distribution pattern of particular virulent genes, while comparing a group of pathogenic and non-pathogenic strains. In keeping with its user friendly nature, PanGeT caters the option to use either BLASTN or BLASTP, based on the user's search requirements. The facility provided to change parameters such as E-value and H-value cut off, enables effective and user defined search results. Furthermore, since BLASTP module uses only protein sequences for comparison, it can be used to compare eukaryotic proteomes as well.

RefSeq Id⁎

2.2. Features

S. No Strain$

the pan-genome of a species. This suite offers flexibility to choose either BLASTN or BLASTP modes, depending on the user's research interests. In the BLASTN searches, the CDS (Coding DNA Sequences) are extracted from the complete genomes of the selected organisms, based on their annotations. The resulting CDS nucleotide file obtained from Genome ‘1’(query sequences) is compared first with Genome ‘2’ using BLASTN and subsequently with all genomes up to the last Genome ‘N’. Similarly, Genome ‘2’, Genome ‘3’… Genome ‘N’ are then compared with Genome ‘1’ up to the remaining genomes, implementing the ALL against ALL genome alignments (using BLAST) with an E-value of ‘1e− 5’. In the case of BLASTP, the entire set of protein sequences of all the strains are used for whole proteome comparisons. Further, it computes the H-value (Homology value) for each CDS from BLASTN outputs, as proposed by Fukiya et al. (2004), to identify the homologous sequences among the species (in genomes as well as proteomes). The H-value criteria denote the degree of similarity in terms of length and identity between query sequence and the set of genomes or proteomes taken for analysis. We have adapted the H-value cut-off [0.42] used earlier (Shao et al., 2010), to differentiate the non-homologous entries. H-value is a key parameter applied to evaluate the matching length and identity of the particular region in the range of 0–1 (frequency of the highest Bit score). It is calculated as follows: H = i × lm/lq, where, ‘i’ represents the level of identity of the region, ‘lm’ represents the length of the highest scoring matching sequence (including gaps) and ‘lq’ represents the query length. It is to be noted that the number of core genes decreases with the increase in H-value. In the event that no matching sequence is found, the H-value for the query sequence is assigned as ‘zero’. Two levels of filters have been applied: [1] Hvalue (cut off N0.42) and [2] Reciprocal Best Hits (RBH) to identify the real core genes and exclude the paralogous and partial homologues. From the homologous sequences, core orthologous sequences were identified through RBH, where the best hit of one genome is considered as the best hit of the other and vice versa. Absence of a BLAST hit or hit with a low ‘H’ value for a particular query will be reported as strain specific genes.

3 Table 2 List of seven H. pylori strains analyzed for their pan-genome content. $Helicobacter pylori strains taken from GenBank, ⁎Accession number and gene annotations reported in NCBI-RefSeq, @Core (Reciprocal Best Hits exclusive of paralogous and partial homologues) and %Unique genes identified using PanGeT, Gene Family algorithm (PGAP-GF) and Multiparanoid algorithm (PGAP-MP) with PGAP. !Core and Unique genome fragments were computed using PanSeq. ~ Core Amino Acid gene clusters identified by GET_HOMOLOGUES tool.

I. Yuvaraj et al. / Gene xxx (2016) xxx–xxx

Please cite this article as: Yuvaraj, I., et al., PanGeT: Pan-genomics tool, Gene (2016), http://dx.doi.org/10.1016/j.gene.2016.11.025

4

I. Yuvaraj et al. / Gene xxx (2016) xxx–xxx

3. Results and discussion 3.1. Test data The performance of PanGeT was evaluated by computing the pangenome for the same genome dataset used by PGAP (Zhao et al., 2012) [set of 11 Streptococcus pyogenes strains obtained from NCBI (NC_002737, NC_003485, NC_004070, NC_004606, NC_006086, NC_ 007296, NC_007297, NC_008021, NC_008022, NC_008023, NC_ 008024)]. PanGeT detected 1319 and 1347 core genes, while using BLASTN and BLASTP modes respectively, with an E-value of ‘1e−5’ and H-value of ‘0.42’. These values are consistent with the 1366 core genes detected by PGAP and 1376 core genes reported in an earlier study (Lefebure and Stanhope, 2007). 3.2. Case study 1 To demonstrate the use and functioning of PanGeT, two separate groups of genomes were selected to compute the pan-genome of each species – the complete genome sequences of the twenty Salmonella enterica serovars, (listed in Table 1), were obtained from GenBank

(Benson et al., 2012) and the complete genome sequences of seven Helicobacter pylori (listed in Table 2). [i] For the Salmonella enterica serovar genomes, 2559 and 2586 core genes were detected by the BLASTN and BLASTP methods, respectively. In an earlier study (Chan et al., 2003) of 26 Salmonella enterica serovar genomes, 2244 core genes were identified using DNA microarrays. These results comply with the fact that the number of core genes decreases as the number of genomes increases. It is noteworthy that the results obtained from the in silico analysis (PanGeT) are consistent with that obtained from the in vitro studies. The Flower Plot depicts the core genome of Salmonella and each petal contains the number of strain specific genes (Fig. 1a). In addition, hyperlinks have been provided for the list of core, dispensable and strain specific genes to access more information on the same (Fig. 1b to d). [ii] In the case of Helicobacter genomes, 1126 and 1171 core genes were detected by the BLASTN and BLASTP, respectively (Flower plot representation in Fig. 2). These numbers are consistent with the ~ 1200 core genes identified by the pan-genomic tool PGAP (Zhao et al., 2012). It is observed that the Shi470 strain (NC_000698) has the maximum number (118) of unique genes, while the other six strains yielded comparatively fewer unique genes (Fig. 2).

Fig. 1. a: Pan-genome of the twenty Salmonella enterica genomes computed using PanGeT-BLASTN. b: List of core genes in Salmonella enterica genomes computed using PanGeT. c: Expanded view of the dispensable genes of 20 Salmonella enterica serovars computed using PanGeT. d: Expanded view of the strain specific genes identified in LT2 strain (NC_003197) with links to their sequence and KEGG Genome maps.

Please cite this article as: Yuvaraj, I., et al., PanGeT: Pan-genomics tool, Gene (2016), http://dx.doi.org/10.1016/j.gene.2016.11.025

I. Yuvaraj et al. / Gene xxx (2016) xxx–xxx

Fig. 1 (continued).

Please cite this article as: Yuvaraj, I., et al., PanGeT: Pan-genomics tool, Gene (2016), http://dx.doi.org/10.1016/j.gene.2016.11.025

5

6

I. Yuvaraj et al. / Gene xxx (2016) xxx–xxx

Fig. 2. Pan-genome of the seven Helicobacter pylori genomes computed using PanGeT-BLASTP.

Thus, it is seen that apart from quickly and efficiently differentiating the entire set of coding genes into core, dispensable and unique, PanGeT exhibits the results graphically in a comprehensive flower plot, with hyperlinks to retrieve their functional annotations and sequences. It is noteworthy that the unique genes identified by PanGeT in Salmonella and Helicobacter were found to retain the strain specific diagnostic markers reported in earlier studies (Arrach et al., 2008; Fischer et al., 2010).

non-pathogenic (M. smegmatis mc2155, M. sp. KMS, M. gilvum and M. vanbaalenii) proteomes were selected as input. A total of eight conserved genes were identified only in the six pathogens (Table 4). Out of these, three genes (Succinate-semialdehyde dehydrogenase, Orotate phosphoribosyl transferase, Glycosyl transferase) are found to be same out of the four genes identified in the study by Prasanna and co-workers (Prasanna and Mehra, 2013). The four additional genes spotted by

3.3. Case study 2

Table 3 List of ten Mycobacterium strains selected for pan-genome analysis (1–6 are pathogenic strains and 7–10 are non-pathogenic strains) and the respective core and unique genes predicted by PanGeT.

In this case study, we attempt to exemplify the usefulness of PanGeT, by performing part of a comparative genomics study on a set of ten pathogenic and non-pathogenic Mycobacterium species (Table 3), a detailed study of which has earlier been reported (Prasanna and Mehra, 2013). It is a well-known fact, that the pathogenic species of Mycobacterium genus is responsible for a hoard of diseases and subsequently causes death, in many living beings, including humans. The commonly known infections are tuberculosis and leprosy, which are caused by Mycobacterium tuberculosis and Mycobacterium leprae (Brosch et al., 2000), respectively. To make the case study concise we will only display the use of BLASTP. Thus, the six pathogenic (M. tuberculosis H37Rv, M. avium 104, M. ulcerans, M. marinum, M. abscessus and M. leprae) and four

S. no Strain

RefSeq Id

Total CDS PanGeT BLASTP results

1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

NC_018143 NC_008595 NC_008611 NC_010612 NC_021282 NC_002677 NC_018289 NC_008705 NC_009338 NC_008726

4036 5120 4160 5423 4958 1605 6689 5460 5241 5979

Core genes Unique genes M. tuberculosis H37Rv M.avium 104 M. ulcerans M. marinum M. abscessus M. leprae M. smegmatis mc2155 M. sp. KMS M. gilvum M. vanbaalenii

998

Please cite this article as: Yuvaraj, I., et al., PanGeT: Pan-genomics tool, Gene (2016), http://dx.doi.org/10.1016/j.gene.2016.11.025

663 900 296 580 1690 127 1153 693 381 738

I. Yuvaraj et al. / Gene xxx (2016) xxx–xxx Table 4 List of the eight genes present only in the six pathogens out of ten Mycobacterium species identified by PanGeT. S. No

RefSeq Id of the Genes

Product

1.

NP_302648, YP_884057, YP_905203, YP_001848809, YP_006513557, YP_008023967 NP_302606, YP_883917, YP_904353, YP_001848965, YP_006513708, YP_008024763 NP_301274, YP_879805, YP_907713, YP_001853390, YP_006517120, YP_008020939 NP_301275, YP_879804, YP_907714, YP_001853391, YP_006517121,YP_008020938 NP_301465, YP_882556, YP_905747, YP_001850517, YP_006514784, YP_008021480 NP_302023, YP_881623, YP_906143, YP_001851343, YP_006515482, YP_008023839 NP_302463, YP_879952, YP_907566, YP_001853225, YP_006516973, YP_008020995 NP_302485, YP_883707, YP_904801, YP_001849225, YP_006516750, YP_008023503

Succinate-semialdehyde dehydrogenase Orotate phosphoribosyl transferase (pyrE) Glycosyl transferase

2. 3. 4. 5. 6. 7. 8.

Hypothetical protein MarR family transcriptional regulator RNA polymerase sigma factor SigC Hypothetical protein Transmembrane carbonic anhydrase

PanGeT in the six pathogens, are important virulent factors directly related to their pathogenesis. The identified RNA polymerase sigma factor SigC present in all six pathogens, is different from that present in the non-pathogenic Mycobacterium. These alternative sigma factors are

7

essential for regulating expression of virulence genes and virulence-associated genes in bacterial pathogens (Kazmierczak et al., 2005). Again, the MarR family transcriptional regulator (multiple antibiotic resistance regulators) was found only in the six pathogenic strains of Mycobacterium (Otani et al., 2016). Out of the rest, one is a transmembrane carbonic anhydrase and two are hypothetical proteins. BLASTP identified a total of 998 core genes in all ten strains and has generated the flower plot (Fig. 3). Similar comparative genomics studies can be carried out in various species to provide valuable insights and a deeper knowledge of virulent and avirulent species, evolutionary studies and most importantly, identification of disease related genes and probable vaccine candidates. 4. Tool performance Although there are other pan-genomic tools available, each with their own merits and demerits, it will be beneficial for the users to have a choice of tools to select from, depending on the nature of their research requirements. Nevertheless, for the purpose of evaluating the performance of the proposed tool, the core and strain specific genes predicted by PanGeT in case study 1 were compared with the predictions of PGAP, PanSeq and GET_HOMOLOGUES (results are displayed in Tables 1 and 2). PanSeq does not consider the functional annotations of the gene fragments, which might restrict the use of this tool. GET_HOMOLOGUES

Fig. 3. Pan-genome of the ten Mycobacterium genomes computed using PanGeT-BLASTP.

Please cite this article as: Yuvaraj, I., et al., PanGeT: Pan-genomics tool, Gene (2016), http://dx.doi.org/10.1016/j.gene.2016.11.025

8

I. Yuvaraj et al. / Gene xxx (2016) xxx–xxx

identifies the core gene clusters found in the query genomes and their DNA/protein sequences, which are almost similar to the outputs from BLASTP, under the H-value of 0.47. However, GET_HOMOLOGUES was not equipped to identify the strain specific or unique genes, which are very critical for pan-genomic studies. As mentioned earlier, the pan-genome data predicted by PanGeT clearly describes the list of strain specific, core and dispensable genes in the flower plot, with complete information on genome location, Gene ID, COG class, functional annotation and sequence information. PanGeT has been developed to run on Windows, Linux and Mac operating systems, which requires BLAST and LaTeX/MikTex utilities alone. Furthermore, PanGeT is time inexpensive and is relatively faster in most cases (refer Table S1) than the other existing tools. This, coupled with the meritorious features such as interactive graphical outputs, fewer installation dependencies, platform flexibility and scalability is sure to make PanGeT a suitable choice for researchers involved in pan-genomic studies. PanGeT is freely available over the World Wide Web (www) at the URLs http://pranag.physics. iisc.ernet.in/PanGeT/ or https://github.com/PanGeTv1/PanGeT. General comments and suggestions for improvements are welcome and should be addressed to Dr. K. Sekar at [email protected]. 5. Conclusion The genesis of pan-genomic studies, a decade ago, has opened up avenues to analyze the genomes of multiple strains together to identify the genome evolution that has occurred with respect to their virulence, immune evasion and environmental adaptations. To this end, the proposed tool has been successfully constructed and tested to compute the pan-genome efficiently under various platforms. Users with minimal computational expertise will also be able to use PanGeT with ease and comfort. Taking into account its user friendly features, time efficiency and comprehensive graphical and interactive output, it is envisaged that PanGeT will serve a significant section of researchers interested in pan-genomic studies. Taking ahead the studies done on the output obtained through PanGeT, for instance, the strain specific genes could be applied as molecular targets for diagnostic and therapeutic purposes. Similarly, comparative genomics studies, offering deeper insights into the distribution of dispensable genes among the individual strains may indicate their significance in genome evolution and adaptation, paving the way to designing and creating new and effective vaccines. Supplementary data to this article can be found online at http://dx. doi.org/10.1016/j.gene.2016.11.025. Competing interests None. Authors' contribution. JS conceived, carried out this study and evaluated the tool. IY and KS were involved in scripting, testing and evaluation of the tool, in different platforms. DM was involved in evaluation of the tool and wrote the manuscript. Authors' information. JS is an Assistant Professor in the Department of Biotechnology, Madurai Kamaraj University. KS is an Associate Professor at the Department of Computational and Data Sciences, Indian Institute of Science, Bangalore. IY is a Ph.D. student and DM is a Project Assistant at the Department of Computational and Data Sciences, Indian Institute of Science, Bangalore.

Acknowledgements The authors acknowledge the facilities offered by the Interactive Graphics Facility, Indian Institute of Science, Bangalore. JS thanks the University Grants Commission, Govt. of India, for funding (F. NO. 1139/2012) and Department of Biotechnology, Govt. of India for funding the research projects. References Ahmed, N., Dobrindt, U., Hacker, J., Hasnain, S.E., 2008. Genomic fluidity and pathogenic bacteria: applications in diagnostics, epidemiology and intervention. Nat. Rev. Microbiol. 6, 387–394. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410. Arrach, N., Porwollik, S., Cheng, P., Cho, A., Long, F., Choi, S.-H., McClelland, M., 2008. Salmonella serovar identification using PCR-based detection of gene presence and absence. J. Clin. Microbiol. 46, 2581–2589. Benson, D.A., Karish-Mizrachi, I., Clark, K., Lipman, D.J., Ostell, J., Sayers, E.W., 2012. GenBank. Nucleic Acids Res. 40, D48–D53. Boto, L., 2015. Evolutionary change and phylogenetic relationships in light of horizontal gene transfer. J. Biosci. 40, 465–472. Brosch, R., Gordon, S.V., Eiglmeier, K., Garnier, T., Cole, S.T., 2000. Comparative genomics of the leprosy and tubercle bacilli. Res. Microbiol. 151, 135–142. Chan, K., Baker, S., Kim, C.C., Detweiler, C.S., Dougan, G., Falkow, S., 2003. Genomic comparison of Salmonella enterica serovars and Salmonella bongori by use of an S. enterica serovar Typhimurium DNA microarray. J. Bacteriol. 185 (2), 553–563. Contreras-Moreira, B., Vinuesa, P., 2013. GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl. Environ. Microbiol. 79, 7696–7701. Delany, I., Rappuoli, R., Seib, K.L., 2013. Vaccines, reverse vaccinology, and bacterial pathogenesis. Cold Spring Harb. Perspect. Med. 3, a012476. Fischer, W., Windhager, L., Rohrer, S., Zeiller, M., Karnholz, A., Hoffmann, R., Zimmer, R., Haas, R., 2010. Strain-specific genes of Helicobacter pylori: genome evolution driven by a novel type IV secretion system and genomic island transfer. Nucleic Acids Res. 38, 6089–6101. Fukiya, S., Mizoguchi, H., Tobe, T., Mori, H., 2004. Extensive genomic diversity in pathogenic Escherichia coli and Shigella strains revealed by comparative genomic hybridization microarray. J. Bacteriol. 186, 3911–3921. Ho, C.C., Wu, A.K., Tse, C.W., Yuen, K.Y., Lau, S.K., Woo, P.C., 2012. Automated pangenomic analysis in target selection for PCR detection and identification of bacteria by use of ssGeneFinder Webserver and its application to Salmonella enterica serovar typhi. J. Clin. Microbiol. 50, 1905–1911. Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., Tanabe, M., 2012. KEGG for integration and interpretation of large scale data. Nucleic Acids Res. 40, D109–D114. Kazmierczak, M.J., Wiedmann, M., Boor, K.J., 2005. Alternative sigma factors and their roles in bacterial virulence. Microbiol. Mol. Biol. Rev. 69 (4), 527–543. Laing, C., Buchanan, C., Taboada, E.N., Zhang, Y., Kropinski, A., Villegas, A., Thomas, J.E., Gannon, V.P.J., 2010. Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinf. 11, 461. Lefebure, T., Stanhope, M.J., 2007. Evolution of the core and pan-genome of streptococcus: positive selection, recombination, and genome composition. Genome Biol. 8, R71. Medini, D., Donati, C., Tettelin, H., Masignani, V., Rappuoli, R., 2005. The microbial pan-genome. Curr. Opin. Genet. Dev. 15, 589–594. Muzzi, A., Masignani, V., Rappuoli, R., 2007. The pan-genome: towards a knowledge based discovery of novel targets for vaccines and antibacterials. Drug Discov. Today 12, 11–12. Otani, H., Stogios, P.J., Xu, X., Nocek, B., Li, S.-N., Savchenko, A., Eltis, L.D., 2016. The activity of CouR, a MarR family transcriptional regulator, is modulated through a novel molecular mechanism. Nucleic Acids Res. 44 (2), 595–607. Prasanna, A.N., Mehra, S., 2013. Comparative Phylogenomics of pathogenic and non-pathogenic mycobacterium. PLoS One 8 (8), e71248. Puigbò, P., Lobkovsky, A.E., Kristensen, D.M., Wolf, Y.I., Koonin, E.V., 2014. Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes. BMC Biol. 12, 66. Shao, Y., He, X., Harrison, E.M., Ou, H.-Y., Kumar, R., Deng, Z., 2010. mGenomeSubstractor: a web based tool for parallel in Silico subtractive hybridization analysis of multiple bacterial genomes. Nucleic Acids Res. 38, W194–W200. Snipen, L., Ussery, D.W., 2010. Pan genome trees. Stand. Genomic Sci. 2, 135–141. Sugawara, M., Epstein, B., Badgley, B.D., Unno, T., Xu, L., Reese, J., Gyaneshwar, P., Denny, R., Mudge, J., Bharti, A.K., Farmer, A.D., May, G.D., Woodward, J.E., Médigue, C., Vallenet, D., Lajus, A., Rouy, Z., Martinez-Vaz, B., Tiffin, P., Young, N.D., Sadowsky, M.J., 2013. Comparative genomics of the core and accessory genomes of 48 Sinorhizobium strains comprising five genospecies. Genome Biol. 14, R17. Tettelin, H., et al., 2005. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pangenome”. Proc. Natl. Acad. Sci. U. S. A. 102 (39), 13950–13955. Zhao, Y., Wu, J., Yang, J., Sun, S., Xiao, J., Yu, J., 2012. PGAP: pan-genomes analysis pipeline. Bioinformatics 28, 416–418.

Please cite this article as: Yuvaraj, I., et al., PanGeT: Pan-genomics tool, Gene (2016), http://dx.doi.org/10.1016/j.gene.2016.11.025