SHORT COMMUNICATION GenomeInspector: Basic Software Tools for Analysis of Spatial Correlations between Genomic Structures within Megabase Sequences KERSTIN QUANDT, KORBINIAN GROTE,
AND
THOMAS WERNER1
Institut fu¨r Sa¨ugetiergenetik, GSF-Forschungszentrum fu¨r Umwelt und Gesundheit GmbH, Ingolsta¨dter Landstrasse 1, D-85758 Oberschleissheim Received September 18, 1995; accepted February 3, 1996
The speed of acquisition of genomic sequence data exceeds the evaluation of function of the sequences by a vast margin. Most software available for the prediction of individual features does not assess the correlation of different motifs (level 1 methods). Here, we present a second-level software package called GenomeInspector (GI) for further analysis of results obtained with level 1 methods. Our approach does not require any a priori knowledge about motif organization and was designed as a modular package with a graphical user interface. Three examples for GI application are presented. q 1996 Academic Press, Inc.
Current genome sequencing projects add megabases of partially or completely unannotated sequences to the databases. The software available for DNA sequence analysis is usually limited to the prediction of individual features and does not assess the correlation of different motifs (level 1 methods — examples are GCG-FindPatterns) (1, 4, 6, 10, 11). However, spatial organization of multiple individual elements is a hallmark of biologically functional units like promoters. One of the main features of our program GenomeInspector (GI) is its ability to assess distance correlations between large sets of sequence elements, thus enabling the detection of significant patterns of elements without any a priori knowledge. A unique feature of GI is its ability to define classes by correlations (e.g., promoters) directly from the sequence data without predefined training sets. However, organizational features without distance correlations will be missed (e.g., enhancer – promoter distances that are highly variable). GI was designed as a modular package, thus facilitating inclusion of new methods. The program has the following features: 1 To whom correspondence should be addressed. Telephone: (49)89-3187-4050. Fax: (49)-89-3187-4400. E-mail:
[email protected].
A graphical user interface: This interface allows interactive analysis and visualization of megabase sequences and comprehensive display of results. Fast performance and the ability to handle long sequences (@106 nucleotides) like complete chromosomes, by using symbolic representations: Regions are defined as extended sequence elements (e.g., open reading frames, long/short interspersed elements, replication origins, scaffold/matrix attachment regions), whereas points represent sequence elements reduced to an anchor position (transcription factor binding sites, restriction sites, or recognition sites). A set of points is used, for example, to describe all binding sites of a specific transcription factor on the selected sequence(s). Integration of input data from various level 1 sources: Open reading frames (ORFs) can be extracted directly from IG-, GCG-, or GenBank-formatted files or data bank entries. These ORFs can be restricted by certain criteria, for example, a certain minimum or maximum length, a defined gene family, or a given term in its annotations. Positions of protein binding sites can be read directly from GCG-FindPatterns-, ConsInspector-, and MatInspector-output files (4, 11). Addition of input modules to include further level 1 software results is facilitated by a core routine provided with GI. Visualization of the sequence and its elements: The overall distribution of GI element sets on the sequence can be visualized in a sliding window representation, in a corresponding histogram, or as a full-scale map over the entire sequence length. Basic statistics like minimum, maximum, mean, and standard deviation are also supplied. The detection and subsequent selection of distance-correlated sets of points or regions as well as the extraction of corresponding primary sequence data: A user-defined range around a selected set of reference elements is analyzed for the occurrence of correlated elements. Overrepresented distance corGENOMICS
301
33, 301–304 (1996) ARTICLE NO. 0197
0888-7543/96 $18.00 Copyright q 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
AID
Genom 4011
/
6r13$$$101
03-18-96 12:59:43
gnmxas
AP: Genomics
FIG. 1. Example of GenomeInspector output that shows the absolute values of ABF1 sites found in the {1000-bp range of 1223 open reading frames from four complete chromosomes of Saccharomyces cerevisiae. Some explanations for graphic features were added to the program output.
302
AID
SHORT COMMUNICATION
Genom 4011
/
6r13$$4011
03-18-96 12:59:43
gnmxas
AP: Genomics
303
SHORT COMMUNICATION
relations are immediately evident as peaks in the graphical output (Fig. 1). The significance of peaks found in a distance correlation analysis can be further explored by calculating the ratio of points found versus points expected from random distribution. Extraction of correlation-selected elements in the form of GI input allows the assessment of multiply correlated sequence elements by repeated analysis steps. The extraction of DNA sequence data containing the selected elements is also possible. r-scan statistics (7): We included this method to evaluate and visualize any extremes in the spacing of specified sequence markers like clustering, overdispersion, or excessive regularity. GI is well suited to solving a number of common problems of large-scale sequence analysis, including promoter definition. The following three examples illustrate the principles and the power of our software package. (i) Do specific transcription factor binding sites cluster in promoter regions? The upstream regions (01000 bp) of 1223 ORFs, extracted from the Saccharomyces cerevisiae (yeast) chromosomes II, III, VIII, and XI, totaling 2.3 million bp, were analyzed for binding sites of the transcription factor ABF1, represented by the DNA sequence RTCRYNNNNNACG (2) using the IUPAC ambiguity code. GI revealed a clear peak of ABF1 sites 100 – 200 bp upstream of the ORF starts (see Fig. 1), yielding an observed/expected (from random distribution) ratio of 5.80. (ii) Can promoters of a selected gene family be identified by common occurrence and/or organization of transcription factor binding sites? The promoter regions of glycolytic enzymes in yeast were analyzed with GI for the presence of binding sites of seven different transcription factors available at the time of analysis (ABF1, CDEI, GAL4, GCN4, GCR1, RAP1, REB1; Cons/MatInspector descriptions). Only ABF1, GCR1, and RAP1 were found to yield significantly overrepresented peaks in the putative promoter regions (up to 01000 bp). Thus, the program correctly selected those three transcription factors already known to be involved in the regulation of glycolytic gene expression (12, 13) solely on the basis of distance correlation analysis. This example shows that the comparative analysis of promoter regions of gene families will allow the a priori generation of a library of promoter classes defined by occurrence and organization of sets of transcription factor binding sites. Since GI does not rely on predefined patterns, it is well suited to detecting completely new correlations. (iii) Is the spatial organization of regulatory units involving more than one ORF unique? Gininger et al. (5) described a cluster of four binding sites for the transcription factor GAL4 on yeast chromosome II with distances of ca. 400 bp to GAL1-ORF (sense
AID
Genom 4011
/
6r13$$$101
03-18-96 12:59:43
strand) and 250 bp to GAL10-ORF (antisense). Analysis of upstream regions of all ORFs on the four yeast chromosomes for putative GAL4 sites led to the identification of another GAL4 site that is located in an almost identical setup between two unknown reading frames on chromosome VIII (406 bp to YHR081w and 295 bp to YHR080c). A second, potentially GAL4-regulated pair of ORFs was extracted from a total of 1223 ORFs and 130 GAL4 sites in just two correlation/extraction steps in less than 5 min. Thus, GI allows preselection of small sets of candidate elements for subsequent experimental design and verification. Many more applications of GenomeInspector are possible. Further details, including discussion of other level 2 approaches (e.g. for the detection of promoters (8, 9) or eukaryotic genes (3)), will be addressed elsewhere. The software package GenomeInspector was developed in C and is available from the ftp site ariane.gsf.de. The software requires a UNIX computer with the X Window System and the Simple X Library by D. Giampaolo (libsx). GI is free for noncommercial research only. Commercial users should contact the authors. ACKNOWLEDGMENTS We thank Kornelie Frech and Ruth Brack-Werner for critically reading the manuscript. We also appreciate D. Giampaolo’s work in developing the Simple X Library, which helped us to supply a graphical user interface for GenomeInspector. This work was supported by the BMBF Verbundprojekt GENUS 413-4001-01 IB 306 D (Fo¨rderschwerpunkt Bioinformatik).
REFERENCES 1. Cardon, L. R., and Stormo, G. D. (1992). Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J. Mol. Biol. 223: 159–170. 2. Dhawale, S. S., and Lane, A. C. (1993). Compilation of sequencespecific DNA-binding proteins implicated in transcriptional control in fungi. Nucleic Acids Res. 21: 5537–5546. 3. Dong, S., and Searls, D. B. (1994). Gene structure prediction by linguistic methods. Genomics 23: 540–551. 4. Frech, K., Herrmann, G., and Werner, T. (1993). Computerassisted prediction, classification, and delimitation of protein binding sites in nucleic acids. Nucleic Acids Res. 21: 1655–1664. 5. Gininger, E., Varnum, S. M., and Ptashne, M. (1985). Specific DNA binding of GAL4, a positive regulatory protein of yeast. Cell 40: 767–774. 6. Hertz, G. H., Hartzell, G. W., and Stormo, G. D. (1990). Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput. Appl. Biosci. 6: 81– 92. 7. Karlin, S., Blaisdell, B. E., Sapolsky, R., Cardon, L., and Burge, C. (1993). Assessments of DNA inhomogeneities in yeast chromosome III. Nucleic Acids Res. 21: 703–711. 8. Kondrakhin, Y. V., Kel, A. E., Kolchanov, N. A., Romashchenko, A. G., and Milanesi, L. (1995). Eukaryotic promoter recognition
gnmxas
AP: Genomics
304
SHORT COMMUNICATION
by binding sites for transcription factors. Comput. Appl. Biosci. 11: 477–488. 9. Prestridge, D. S. (1995). Predicting pol II promoter sequences using transcription factor binding sites. J. Mol. Biol. 249: 923– 932. 10. Prestridge, D. S., and Stormo, G. (1993). SIGNAL SCAN 3.0— New database and program features. Comput. Appl. Biosci. 9: 113–115. 11. Quandt, K., Frech, K., Karas, H., Wingender, E., and Werner, T. (1995). MatInd and MatInspector—New fast and versatile
AID
Genom 4011
/
6r13$$$101
03-18-96 12:59:43
tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23: 4878–4884. 12. Scott, E. W., and Baker, H. V. (1993). Concerted action of the transcriptional activators REB1, RAP1, and GCR1 in the highlevel expression of the glycolytic gene TPI. Mol. Cell. Biol. 13: 543–550. 13. Willet, C. E., Gelfman, C. M., and Holland, M. J. (1993). A complex regulatory element from the yeast gene ENO2 modulates GCR1-dependent transcriptional activation. Mol. Cell. Biol. 13: 2623–2633.
gnmxas
AP: Genomics