Journal of Microbiological Methods 54 (2003) 423 – 426 www.elsevier.com/locate/jmicmeth
Note
GenomeComp: a visualization tool for microbial genome comparison Jian Yang a, Jinhua Wang a, Zhi-Jian Yao b, Qi Jin c, Yan Shen b, Runsheng Chen a,* a
Laboratory of Bioinformatics, Institute of Biophysics, CAS, Datun Road 15, Chaoyang District, Beijing 100101, PR China b Chinese National Human Genome Center, Beijing 100176, PR China c State Key Laboratory for Molecular Virology and Genetic Engineering, Beijing 100052, PR China Received 17 August 2002; received in revised form 5 February 2003; accepted 4 March 2003
Abstract We have developed a software tool, GenomeComp, for summarizing, parsing and visualizing the genome sequences comparison results derived from voluminous BLAST textual output. With GenomeComp, the variation between genomes can be easily highlighted, such as repeat regions, insertions, deletions and rearrangements of genomic segments. This software provides a new visualizing tool for microbe comparative genomics. D 2003 Elsevier Science B.V. All rights reserved. Keywords: Perl/Tk; Visualization; Whole-genome comparison
Advances in automatic DNA sequencing technique and the whole-genome shotgun strategy have resulted in a tremendous increase in the amount of available genome data. To date, over 115 whole genome sequences and their annotations have been published, and more than 580 genome sequencing projects are ‘‘in progress’’ (data from http:// igweb.integratedgenomics.com/GOLD/) (Bernal et al., 2001). Among publicly available genomes, nearly 90% of genome sequences are from microbes. These valuable data provided good subjects for experimental studies and functional analysis. Comparative genomics has become more and more attractive, especially between two close species. * Corresponding author. Tel.: +86-10-6488-8543; fax: +86-106487-7837. E-mail address:
[email protected] (R. Chen).
Through pairwise genome comparisons, information about the genome structure, evolving patterns and processes that influence genomic designs could be revealed (Andersson, 2000). For example, by comparison of the genome organizations between Mycoplasma genitalium and Mycoplasma pneumoniae, a conserved gene order was discovered (Himmelreich et al., 1997). The comparison between Chlamydia trachomatis and Chlamydia pneumoniae implicated the importance of the unique DNA regions (Kalman et al., 1999). From some more closely related bacterial genomes comparison, such as pairs of H. pylori (H. pylori 26695 and H. pylori J99) and Mycobacterium species (Mycobacteerium leprae and Mycobacterium tuberculosis), it was discovered that replication played a major role in directing genome evolution (Tillier and Collins, 2000). The similarities and differences of the five publicly available Salmonella genome sequences
0167-7012/03/$ - see front matter D 2003 Elsevier Science B.V. All rights reserved. doi:10.1016/S0167-7012(03)00094-0
424
J. Yang et al. / Journal of Microbiological Methods 54 (2003) 423–426
reveals that horizontal gene transfer has provided each genome with 10 –12% of unique DNA (Edwards et al., 2002). For genomic level comparison, several software tools are available. ACT (http://www.sanger.ac.uk/ Software/ACT/) is a useful DNA sequence comparison viewer based on the annotation tool Artemis (Rutherford et al., 2000), but it could not carry out the comparison by itself. Alfresco (Jareborg and Durbin, 2000) can visualize the results of many external analysis programs as comparison figures. VISTA (Mayor et al., 2000), which is a set of tools for visualize global sequence alignments, uses a continuous curve to represent the level of identity between sequences. The Unix-based program MUMmer (Delcher et al., 1999, 2002) applies a new algorithm for fast alignment of large sequences and the results can be presented by DisplayMUMs graphically (http://www.tigr.org/software/displaymums/). However, all of the currently available visualization tools were written in Java and require a Java Virtual Machine to run, so it might be difficult for biologists to use. There are also some web-based tools (Florea et al., 2000; Schwartz et al., 2000) that are applicable to comparative sequence analysis, unfortunately, they were inconvenient for genome-scale comparison or when using unpublished data. For remedying this situation, we developed GenomeComp, a new visualization tool with a user-friendly interface. We used Perl/Tk to write GenomeComp. This tool is implemented as a stand-alone program that can be run on Linux, Unix, Mac OS X and Microsoft Windows operating systems. It can be easily used to compare, parse and visualize large genomic sequences, especially closely related genomes such as interspecies or interstrains. In comparison of microbial genomes (several million base pairs in length), we have not found technical difficulty as long as sufficient system memory is available. For easier public access, we put the Perl source codes (or executable binary programs for some common systems) and the complete manual including examples (step-by-step tutorials) at our Web site (http://www.chgb.org.cn/GenomeComp/). It is freely available for all researchers. In general, GenomeComp gets input from two whole genome sequences in either of Fasta, GenBank or EMBL formats. Then it automatically accesses the external programs from the BLAST2 suite (Altschul
et al., 1997) through system calls to perform the comparison between paired sequences in background. All parameters of comparison have been optimized as defaults for common use but were also configurable for expert users. Here the external nucleotide sequence alignment search programs could be MEGABLAST or BLASTN (both of which are available from NCBI anonymous FTP server). We set the former one as the default choice for large genomic sequences comparison because it uses a ‘‘greedy algorithm’’ for saving time and is optimized much faster than the latter (Zhang et al., 2000). To summarize the voluminous and tedious BLAST textual output, is an unavoidable problem when processing large genomic sequences (Sonnhammer and Durbin, 1994). In our program, we adopt a strict ‘‘positional coverage limitation’’ to evaluate matches reported by BLAST and concisely present only the most relevant information. This method can suppress spurious matches by restricting the number of matched sequences in a certain region to only one, e.g. if a reported segment of a maximal segment pair (MSP) is already completely covered (for both of the query and subject segments) by other MSPs with a higher score, this MSP will be rejected. After the redundant and ‘‘junk’’ matches are filtered out, the accepted matches are graphically and dynamically presented in a scrollable and zoomable canvas with different colors representing different matching lengths (see Fig. 1). The annotation information of each genome, such as open reading frames (ORFs) from the GenBank or EMBL feature records will also be displayed. When the mouse is moved over the items in the canvas, a summary report will be displayed automatically. In addition, if you click on each significant part, more detailed information can be provided. The whole or regional graphical comparison result in the canvas can be saved for further analysis or for presentation in PostScript format (an example is shown on the bottom right of Fig. 1), which can be viewed by using the free Ghostscript program from Aladdin (http://www.cs.wisc.edu/~ghost/). There are some additional features of GenomeComp for special situations: (i) when only one genome sequence is provided, the program will do a selfcomparison, which will automatically filter those matches in the same coordinates, so some structural features like repeat sequences will be discovered; (ii) if
J. Yang et al. / Journal of Microbiological Methods 54 (2003) 423–426
425
Fig. 1. The top of the figure is a screen shot of GenomeComp showing a section of the comparison result of E. coli K-12 MG1655 (upper) and S. flexneri 2a 301 (lower) genomes. The left frame lists some custom configuration options. The top right canvas shows the dynamic comparison result and the bottom right label displays the brief information of selected item. The navigation window at the bottom left lists specific regions in two genomic sequences separately. The bottom right is an overview of the PS format output file for the comparison of three genomes: S. flexneri 2a 301 (abbreviated Sf301), E. coli K-12 (MG1655) and E. coli O157 (EDL933) from the top to the bottom (here we anchored each sequence at the center so as to get a better view).
given three genome sequences (one as reference and the other two as related sequences), the program can automatically compare the other two sequences against the reference and present both of the results in the canvas (see the bottom right of Fig. 1 for an example); (iii) for pairs of input sequences whose lengths are widely different, GenomeComp can easily get a better local visualization for the comparison result by anchoring one sequence in a certain position in the canvas; (iv) in the comparison of some closely related organisms, the specific regions in those genomes should be important since they might sometimes reveal significant information about lateral gene transfer, or even
related to pathogenicity islands when comparing pathogen and non-pathogen genomes (Hacker et al., 1997). Hence GenomeComp is providing a navigation window to access specific regions (shown on the bottom left of Fig. 1). The output of comparison contains the start, end and length in base pairs of specific regions for both compared sequences. Doubleclicking on a listed line, the canvas will scroll to the exact location of the specific region, which is highlighted. GenomeComp has been efficiently applied in comparison between Shigella flexneri 2a strain 301 (genome size 4.6 Mbp) and its close relatives, the
426
J. Yang et al. / Journal of Microbiological Methods 54 (2003) 423–426
non-pathogenic K-12 strain MG1655 (genome size 4.6 Mbp) and enterohemorrhagic O157:H7 strain EDL933 (genome size 5.5 Mbp) of Escherichia coli (Jin et al., 2002). The comparison results (shown on the bottom right of Fig. 1) indicate that the three strains share a common ‘backbone’ sequence of nearly 90%. The ‘colinearity’ is broken only once between MG1655 and EDL933 by a 442 kb inversion at the replication terminus (Perna et al., 2001); however, surprisingly, it is broken 13 times between MG1655 and Sf301 by inversions and translocations of DNA segments greater than 5 kb. In detailed analysis, we found that in the Shigella purE region (please see top of Fig. 1), it is evident that numerous counterpart E. coli genes are absent, which are caused by four deletions immediately flanking the Shigella purE gene. Since ‘‘black holes’’ have previously been found to contribute a great deal to virulence in Shigella (Maurelli et al., 1998), is it possible that one or more of these deleted genes are additional ‘‘black holes’’? This should be worthwhile for further studies.
Acknowledgements The development of GenomeComp was funded by the Chinese Academy of Sciences Grant No. KSCX22-07.
References Altschul, S.F., Thomas, L.M., Alejandro, A.S., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389 – 3402. Andersson, S.G.E., 2000. The genomics gamble. Nat. Genet. 26, 134 – 135. Bernal, A., Ear, U., Kyrpides, N., 2001. Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res. 29, 126 – 127. Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L., 1999. Alignments of whole genomes. Nucleic Acids Res. 27, 2369 – 2376. Delcher, A.L., Phillippy, A., Carlton, J., Salzberg, S.L., 2002. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30, 2478 – 2483. Edwards, R.A., Olsen, G.J., Maloy, S.R., 2002. Comparative genomics of closely related salmonellae. Trends Microbiol. 10, 94 – 99. Florea, L., Riemer, C., Schwartz, S., Zhang, Z., Stojanovic, N.,
Miller, W., McClelland, M., 2000. Web-based visualization tools for bacterial genome alignments. Nucleic Acids Res. 28, 3486 – 3496. Hacker, J., Blum-Oehler, G., Muhldorfer, I., Tschape, H., 1997. Pathogenicity islands of virulent bacteria: structure, function and impact on microbial evolution. Mol. Microbiol. 23, 1089 – 1097. Himmelreich, R., Plagens, H., Hilbert, H., Reiner, B., Herrmann, R., 1997. Comparative analysis of the genomes of the bacteria Mycoplasma pneumoniae and Mycoplasma genitalium. Nucleic Acids Res. 25, 701 – 712. Jareborg, N., Durbin, R., 2000. Alfresco—a workbench for comparative genomic sequence analysis. Genome Res. 10, 1148 – 1157. Jin, Q., Yuan, Z., Xu, J., Wang, Y., Shen, Y., Lu, W., Wang, J., Liu, H., Yang, J., Yang, F., Zhang, X., Zhang, J., Yang, G., Wu, H., Qu, D., Dong, J., Sun, L., Xue, Y., Zhao, A., Gao, Y., Zhu, J., Kan, B., Ding, K., Chen, S., Cheng, H., Yao, Z., He, B., Chen, R., Ma, D., Qiang, B., Wen, Y., Hou, Y., Yu, J., 2002. Genome sequence of Shigella flexneri 2a: insights into pathogenicity through comparison with genomes of Escherichia coli K12 and O157. Nucleic Acids Res. 30, 4432 – 4441. Kalman, S., Mitchel, W., Marathe, R., Lammel, C., Fan, J., Hyman, R.W., Olinger, L., Grimwood, J., Davis, R.W., Stephens, R.S., 1999. Comparative genomes of Chlamydia pneumoniae and C. trachomatis. Nat. Genet. 21, 385 – 389. Maurelli, A.T., Ferna´ndez, R.E., Bloch, C.A., Rode, C.K., Fasano, A., 1998. ‘‘Black holes’’ and bacterial pathogenicity: a large genomic deletion that enhances the virulence of Shigella spp. and enteroinvasive Escherichia coli. Proc. Natl. Acad. Sci. U. S. A. 95, 3943 – 3948. Mayor, C., Brudno, M., Schwartz, J.R., Poliakov, A., Rubin, E.M., Frazer, K.A., Pachter, L.S., Dubchak, I., 2000. VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 16, 1046 – 1047. Perna, N.T., Plunkett, G., Burland, V., Mau, B., Glasner, J.D., Debra, J.R., Mayhew, G.F., Evans, P.S., Gregor, J., Kirkpatrick, H.A., Po´sfai, G., Hackett, J., Klink, S., Boutin, A., Shao, Y., Miller, L., Grotbeck, E.J., Davis, N.W., Lim, A., Dimalanta, E.T., Potamousis, K.D., Apodaca, J., Anantharaman, T.S., Lin, J., Yen, G., Schwartz, D.C., Welch, R.A., Blattner, F.R., 2001. Genomic sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409, 529 – 533. Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., Rice, P., Rajandream, M.A., Barrell, B., 2000. Artemis: sequence visualisation and annotation. Bioinformatics 16, 944 – 945. Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R., Miller, W., 2000. PipMaker—a web server for aligning two genomic DNA sequences. Genome Res. 10, 577 – 586. Sonnhammer, E.L., Durbin, R., 1994. A workbench for largescale sequence homology analysis. Comput. Appl. Biosci. 10, 301 – 307. Tillier, E.R.M., Collins, R.A., 2000. Genome rearrangement by replication-directed translation. Nat. Genet. 26, 195 – 197. Zhang, Z., Schwartz, S., Wagner, L., Miller, W., 2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203 – 214.