Gene 208 (1998) 31–35
Tools for visualization and integration of intermediate sequencing results in large disease gene discovery projects Andrey Rzhetsky *, Sergey Kalachikov, Xiaolu Ye, Peisen Zhang, James J. Russo Columbia Genome Center, Columbia University, 630 West 168th Street-BB 16–1611, New York, NY 10032, USA Received 16 June 1997; received in revised form 30 October 1997; accepted 31 October 1997; Received by T. Gojobori
Abstract We describe two Java applets which are useful for insightful presentation of intermediate experimental data in gene discovery projects involving large scale sequencing. One of these applets provides a physical map of a genomic region and provides easy access to the second applet, which furnishes a detailed map of sequence contigs associated with clones on the physical map. In particular, the second applet displays all the known information about each contig, including the presence of exons, database homology ‘hits’, repetitive elements and other features; the graphics are linked to other World Wide Web pages, providing detailed information on each feature. These applets should be useful to other research groups working on large sequencing projects. © 1998 Elsevier Science B.V. Keywords: Java applet; World Wide Web; HTML page; Exon trapping; cDNA; BLAST hits; DNA clones
1. Introduction The invention of automated sequencers induced an avalanche of genome sequencing projects generating large amounts of data on a daily basis. Handling, reviewing and making sense of that data turned out to be a separate problem, one that is often confounded by a diversity of computers and operating systems associated with different laboratories participating in a common research project. In this communication, we describe the overall design and a user-friendly JavaA interface developed for the Columbia Genome Center computer system that tremendously facilitates thoughtful daily browsing through new sequence data. In particular, we provide here a detailed description of two Java applets. The first of these applets provides a constantly updated pictorial representation of the relationship among clones making up the map of a particular genomic region. The clone depictions serve as links to the * Corresponding author. Tel: +1 212 3047552; Fax: +1 212 3045515;
[email protected] Abbreviations: HTML, hypertext markup language; BAC, bacterial artificial chromosome; BLAST, basic local alignment search tool; dBEST, database of expressed sequence tags; dbSTS, database of sequence tagged sites; DNA, deoxyribonucleic acid; PAC, P1 artificial chromosome; URL, universal resource locator. 0378-1119/98/$19.00 © 1998 Elsevier Science B.V. All rights reserved. PII S 03 7 8 -1 1 1 9 ( 9 7 ) 0 0 6 35 - 5
second, more inclusive applet, which consists of representations of sequence contigs within each clone, upon which are mapped all the known information about each contig, including the presence of exons, database homology ‘hits’, repetitive elements and other features. The graphics representing these features are in turn linked to alignment information, database descriptions and the like.
2. Experimental and discussion 2.1. Complex of programs for browsing through large arrays of data In our system, all newly obtained sequence clones are daily processed with the programs PHRED and PHRAP (Green, 1996) that assemble individual clones into longer sequence contigs. These sequence contigs are then analysed with programs of the BLAST family to identify significant similarities with sequences stored in public databases (Altschul et al., 1990). Putative exons are identified with the Gene Finder program (Solovyev et al., 1994), and all sequence contigs are compared with cDNA and ‘exon trapped’ sequences generated in the same region. Finally, diverse data on each sequence contig are integrated and made available to all the
32
A. Rzhetsky et al. / Gene 208 (1998) 31–35
project investigators and visualized through the World Wide Web using Java technology. Numerous HTML files embedding the two Java applets described in this paper are generated automatically on a daily basis with a set of supporting programs written in Perl 5.0 and C++ languages. These programs are dependent on our local computer environment and file directory structure; they are not easily customizable and, for that reason, are not described here in greater detail. 2.2. Cosmid map The first applet, CosmidMap, supports a ‘reprogrammable’ map of a set of genomic clones [we used mostly cosmids, but any vector can be used, see Fig. 1A and Zhang et al. (1994) for a description of one of the methods of physical mapping] covering a genomic region of interest. All clones are represented by labeled colorcoded boxes; by pressing a mouse button when pointing at a particular clone, a user visualizes all connections of the corresponding clone with the rest of the clones in the map. Furthermore, by clicking a color-coded box, the user can download a URL (a unified resource locator) containing further detailed information related to this clone. The ‘reprogramming’ of the clone map for personal datasets consists simply of specifying applet parameters on the corresponding HTML page; the resulting HTML page should include the standard syntax required for embedding Java applets [see, for example, Cornell and Horstmann (1996)] and a set of parameters used by our applet, which is described below and shown in Fig. 2A. The parameters are defined as follows. The parameter name ‘edges’ corresponds to a parameter containing a string encoding the edges of the graph whose vertices are clones; the presence of an edge indicates an overlap between the two clones involved. Each edge is uniquely specified by the names of two clones. The names in the string should be separated by a hyphen (no blank spaces are allowed), and clone pairs are separated by commas. Thus, in the example in Fig. 2A, we have defined only two edges, one between clone1 and clone2, and the other one between clone2 and clone3. The rest of the applet parameters define the position and the appearance of the clone labels. Parameters whose names are formed by concatenating a clone name with ‘x’ or with ‘y’ (e.g. ‘clone1x’ and ‘clone1y’) define the horizontal and the vertical clone label positions (in pixels), respectively, of the clone rectangle’s lower left vertex. Parameter names obtained by concatenation of a clone name with ‘s’ define the width of the clone tag on the map (i.e. the length of the rectangle) and the color of this tag; the absolute value 1 corresponds to the narrowest tag, and values 2, 3, and 4 can be used to represent longer cloning vectors, such as BACs and PACs; the color of each tag
is defined by the value sign of this parameter; a minus sign corresponds to white tags, and a positive sign yields yellow tags. Obviously, under our setup, there is a rough, but not exact, correspondence between the clone length and the graphical representation of its length. Finally, the unique parameter with the name ‘where’ indicates an address (a server and a directory) in the World Wide Web where the browser should search for an HTML-file after the user clicks a clone tag (‘http://genome2.cpmc.columbia.edu/~andrey’ in Fig. 2A). In our example, after clicking on the clone tag ‘clone1’, the browser would attempt to download the file ‘http://genome2.cpmc.columbia.edu/~andrey/ clone1.html.’ 2.3. Contig profile The second applet, ContigProfile, provides a clickable image map displaying a contig of genomic sequence (see Fig. 1B) highlighted with several sublength features: sequence similarities identified with BLAST searches against four different sequence databases (Altschul et al., 1990); repetitive sequences; exon trapping data; cDNA data obtained in a local project [see, for example, Bonaldo et al. (1996)]; and the putative exons predicted with one of the currently available programs for analysis of genomic sequences [see, for example, Solovyev et al. (1994)]. The applet can be used in four modes defined by the states of two toggle keys, ‘Condense/Decondense’ and ‘Flip/Flip back’. The former toggle allows displaying multiple cDNA and exon-trapped sequences ‘condensed’ to a single line or separately; the latter toggle allows the user to orient the sequence properly, i.e. arrange it in such a way that the chromosome centromere and telomere are situated to the left and to the right of the sequence contig, respectively. The parameter settings for this applet are summarized in Fig. 2B. The applet parameter names and functions of the corresponding parameters are: ‘SeqLength’—the length of the contig, ‘Name’—a unique name associated with this sequence contig (in our projects, this is the name of the oldest sequence in the contig); ‘Freshest’— a string recording the time when the newest sequence in the contig was appended, ‘BoxNum’—the total number of color-coded boxes on the contig map (123 in our example). For each ‘feature box’, the user must provide a number of parameters determining the position of the box, its type, and the comment and URL associated with this box. These can be entered manually, or preferentially via ancillary software. Each box should have a unique number, and several associated parameters; the established coding (see Fig. 1B) assigns a different color to each of these label types. As an example, for box number 123, the user should define the following parameters (see Fig. 2 B):
A. Rzhetsky et al. / Gene 208 (1998) 31–35
33
Fig. 1. Screenshots of (A) CosmidMap applet displaying a physical map covering a region of human chromosome 13 associated with B-cell chronic lymphocytic leukemia, and (B) ContigProfile applet showing a sequence contig from 6q region of human genome.
34
A. Rzhetsky et al. / Gene 208 (1998) 31–35
Fig. 2. HTML (hypertext marking language) text required for embedding (A) CosmidMap and (B) ContigProfile applets into an HTML page. The ellipsis in (A) indicates the place where additional parameters determining placement and appearance of the "clone2," "clone3," "clonen" boxes, should be inserted. In (B), parameters are given only for elements 1 and 123, the ellipsis representing all the rest.
‘start123’ and ‘end123’ (the first and the last nucleotide sites corresponding to the box position within the contig—obviously both numbers should be positive, less than the length of the contig itself, and the first value should be less than or equal to the second); ‘label123’—label of this feature box, which can be one of the following ‘repeat’ (a repetitive sequence);
‘cDNA’ (a homology to a cDNA deciphered in this project); ‘XT’ (a homology to exon trapping sequence data obtained in this project); ‘blastn’ (similarity to an entry from the non-redundant database of nucleotide sequences); ‘blastp’ (similarity to an entry from the non-redundant database of protein sequences);
A. Rzhetsky et al. / Gene 208 (1998) 31–35
‘blastd’ (similarity to an entry from dBEST ); ‘blasts’ (similarity to an entry from DBSTS); ‘GF’ (a predicted exon). ‘direction123’ (strand in which the feature resides, this can take values of ‘+’, ‘−’, or ‘o’, the last option corresponding to the case where the actual strand is unknown or irrelevant; when the ‘+’ or ‘−’ options are chosen, boxes have pointers on their right or left sides, respectively; the ‘o’ option produces a standard rectangle); ‘Comment123’ (any string of text which is displayed at the top of the screen each time the mouse pointer passes the corresponding feature box); ‘URL123’ ( URL with additional information on this box; this URL is accessed and displayed after the user clicks the corresponding feature box). Embedded features include the vertical position of the boxes with regard to the sequence contig representation (i.e. BLAST hits are shown above the sequence line using narrow boxes; repetitive element boxes are displayed right on the line, and the positions of cDNA and trapped exon homologies, as well as GeneFinder results (Solovyev et al., 1994), are positioned below the full sequence. 2.4. Biological example Fig. 1 represents an example of real data representations associated with two projects that are currently being actively pursued by the Columbia Genome Center. Fig. 1A shows a physical map covering a large region (about 600 000 nucleotides) of human chromosome 13 hypothetically associated with B-cell chronic lymphocytic leukemia—each color bar stands for a cloning vector that can be either a cosmid (shorter rectangles) or PAC (the longer rectangle). The ‘white’ cloning vectors are those that were selected for sequencing in the early stages of the project, and the yellow ones are vectors that were added more recently. Since the images of the cloning vectors are linked to the lists of assembled sequence contigs associated with each cosmid or PAC, the CosmidMap applet displaying this map is used
35
extensively by researchers to monitor the progress of ongoing DNA sequencing associated with this region. The lists of current sequence contigs are in turn linked to the ‘cartoons’ (created with ContigProfile applet) picturing integrated information available on each sequencing contig. For instance, Fig. 1B shows a cartoon representing a large (23 719 base pairs) contig in an area of the human genome commonly deleted in breast, ovarian and other cancers. Thus, this portion of human chromosome 6 presumably harbors a gene that might act as a tumor suppressor. The ContigProfile applet allows one to quickly grasp which regions are most likely to contain a transcription unit. Thus, the righthand part of the contig shows cDNA hits (red arrows) predicted exons (blue arrows), and homologies with protein database sequences (black boxes) and with nucleotide database sequences (white boxes); exon trapped sequences are depicted in the middle of the contig (green boxes). The ContigProfile applet allows for efficient visual screening of hundreds of such contigs on a daily basis. 2.5. Availability of the applets Information on the availability of the applets can be obtained by sending requests to the first author.
References Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410. Bonaldo, M.F., Lennon, G., Soares, M.B., 1996. Genome Res. 6, 791–806. Cornell, G., Horstmann, C.S. 1996. Core Java. The SunSoft, Sun Microsystems Inc., 901 San Antonio Road, Palo Alto, CA 94303, USA. Green, P. 1996. PHRED and PHRAP. University of Washington, Seattle, WA, see URL http://www.mbt.washington.edu/phrap.docs/. Solovyev, V.V., Salamov, A.A., Lawrence, C.B., 1994. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 22, 5156–5163. Zhang, P., Schon, E.A., Fischer, S.G., Cayanis, E., Weiss, J., Kistler, S., Bourne, P.E., 1994. An algorithm based on graph theory for the assembly of contigs in physical mapping of DNA. Comp. Appl. Biosci. 10, 309–317.