COMPUTER CORNER
TIBS 25 – AUGUST 2000
DANTE: a workbench for sequence analysis With the advent of modern sequencing techniques, new DNA and protein sequences are released at an astonishing rate. In 1996 there were 1 3 106 DNA sequences in the GenBank database, in 1999 there were about 5 3 106 (Ref. 1), and expectations are that it will take only approximately 14 months for this number to double. This represents a serious problem for the analysis and functional assignment of sequences in terms of both computer speed and the difficulty of analysing the results2. For example, there are .100 entries in the GenBank database for the Hepatitis C polymerase protein, the majority of which correspond to different genotypes of the protein, and are consequently not useful for functional assignment. Systems are continuously being developed to annotate sequences automatically and to provide the user with compact summaries of the results3. However, it is clear that annotation would benefit enormously from expert human verification, as shown by the better performances of curated databases such as SwissProt4. Our aim was to develop an automatic system that enables the user to perform analysis and annotation of genomes or new DNA sequences. It should also serve as a common, userfriendly platform where scientists can collaborate via Internet to improve the quality of sequence annotations. These features have been implemented in a
Web-based server called DANTE (DNA analysis to extract information). DANTE automatically searches protein databases with a given DNA sequence, locates the proteins encoded by the sequence, attempts their functional assignment and presents the results in an interactive, user-friendly way. Finally, it also ‘cleans up’ the results to simplify their manual inspection and annotation.
Description DANTE is organized in two main modules: an analysis module, which performs the homology searches, retrieves the results and processes them, and a display module, which presents the results graphically. Both are written in Perl programming language. The analysis module submits the DNA query sequence to a BLASTX server (nucleotide queries against protein databases)5 to establish the position and identity of the encoded proteins. Currently, the NCBI server (http://www.ncbi.nlm.nih.gov) is used, but the option to use different servers will be implemented soon. If the query sequence is big (.8000 nucleotides), DANTE splits it into pieces to avoid overloading the server. The results are received by the server via email, and a procedure takes care of merging the pieces, if needed. The protein hits sharing an identity higher than a user-defined threshold
Table 1. Performance of the DANTE system in analysing selected viral genomes
HepCh MS2i TMV j SV40k aNumber
BLASTa
Identityb
Jointc
Fragmd
Annotse
Real ORFsf
Retrievedg
.500 60 385 .500
190 19 38 47
42 10 11 34
9 7 9 14
5 6 8 14
1* 4 5** 7
All All All All
of hits found in a standard BLASTX run with this genome. entries after filtering with an identity threshold (90% in all cases). cNumber of groups after joining hits with the same position and frame in the query. dNumber of remaining groups after removing fragments, according to database information. eFinal number of groups, obtained after removing overlapping groups with equivalent annotations. fNumber of ORFs that the genome contains (according to literature). *, in form of polyprotein; **, one ORF presents a read-through stop codon. gNumber of real ORFs that the system is able to retrieve. hHepatitis C Virus (GenBank accession D13558). iBacteriophage MS2 (GenBank accession J02467). jTobacco Mosaic Virus (GenBank accession X68110). kSimian Virus 40 (GenBank accession J02400). bRemaining
402
and coincident in location are merged in a ‘group’. This is a key step in filtering redundant information (e.g. in this step most of the different genotypes are merged into one). The analysis module then uses database functional annotations to identify the groups and to detect those composed of incomplete proteins (fragments) that can be optionally removed. Unfortunately, the annotation that an entry represents a fragment is often missing in the data bases or is worded in many different ways (the GenBank database offers a great number of examples). DANTE compares and merges the annotations of various protein databases (GenBank, SwissProt and Pir) and is able to detect most entries that are fragments (Table 1). Therefore, an additional application of the system is the maintenance and correction of databases. The next analysis step detects overlapping groups (i.e. those sharing the same position and common functional annotations in the database) and removes redundant ones. However, removing a group does not cause loss of information. During the latter two steps, the system notifies the user in case the decision requires human expertise. At this stage the system will assign keywords to groups. The keywords are those of the SwissProt database, because they form a complete, finite and accurate set, used in other applications6. However, because not all protein hits necessarily correspond to an entry in SwissProt, we use the following strategy: (1) Direct extraction of keywords from SwissProt entries, if available; (2) If not, extraction of keywords from SwissProt entries linked to other database entries if any; (3) If not, match of keywords from annotations in databases (e.g. annotation ‘cell to cell transport protein’ directly matches keyword ‘transport’) if possible; (4) If not, extraction of SwissProt descriptions that match the annotations for the group best (e.g. GenBank annotation ‘nuclear pore complex glycoprotein p62’ matches SwissProt description for entry P17955 ‘nuclear pore complex glycoprotein p62 nucleoporin’ best). At the end of this process, the analysis module provides a list of groups, composed of protein entries. Each has annotations and keywords attached, and some might be labeled as ‘removed’ (we refer to the valid, non-removed groups as ‘main’). This list is passed to
0968 – 0004/00/$ – See front matter © 2000, Elsevier Science Ltd. All rights reserved.
PII: S0698-0004(00)01616-9
TIBS 25 – AUGUST 2000
COMPUTER CORNER
the display module. An example of the analysis module’s performance is shown in Table 1. The display module works dynamically using the information provided by the analysis module and allows the user to modify it (Fig. 1). The user can choose to display ‘main’, ‘removed’ or ‘all’ groups, and can remove a main group, ‘undelete’ a removed one, or even create a new group. Options to obtain a summary of the keywords assigned, inspect the BLAST results or change the threshold of identity are provided. Each extracted group can be displayed together with the respective information: position, frame, entries [with direct links to the sequence retrieval system SRS (Ref. 7)], annotations, keywords and warnings (which can be switched off). The interface allows a very Figure 1 fast and detailed inspection of An example of a graphical display of DANTE. The screen is divided into two frames. The larger one the hits. A complete analysis (right) is used to display the nucleotide sequence (wide, red bar) and the groups found (thin, colored of a medium-size viral genome bars), as well as the control buttons. The information about a group (position, frame, number of entries, keywords, annotations and protein entries present in the databases) is displayed in the left frame takes a matter of a few minafter clicking on the group bar. Any database entry can be viewed by clicking on its name (lower part of utes and does not necessarily the frame). The annotations and keywords can be edited, added and removed by the user. Warning require inspection of the raw messages are also displayed in the left frame. The figure shows the analysis of the genome of Tobacco BLAST files. DANTE has no Mosaic Virus (GenBank accession number X68110). limits with respect to sequence length and, with optiAcids Res. 25, 31–36 mal performance, reaches for lengths up cases in which different databases have 5 Nucleic Altschul, S.F. et al. (1997) Gapped BLAST and PSI-BLAST: to 100 000 nucleotides. The system will inconsistent annotations. DANTE can be a new generation of protein database search programs. Acids Res. 25, 3389–3402 soon be improved by the addition of accessed at http://www.cnb.uam.es/ 6 Nucleic Tamames, J. et al. (1998) EUCLID: automatic several filters, allowing the user to in- ~tamames/DNATE.html. classification of proteins in functional classes by their database annotations. Bioinformatics 14, 542–543 spect the hits according to their phy7 Etzold, T. et al. (1996) SRS: information retrieval system logeny (e.g. display only bacterial hits), Acknowledgements for molecular biology data banks. Methods Enzymol. 266, 114–128 the function of the proteins (e.g. display We are grateful to Armin Lahm for only kinases), the databases they come useful discussions. This work was supfrom, etc. This will be complemented by ported by an EU TMR fellowship to J.T. JAVIER TAMAMES* AND ANNA TRAMONTANO the integration of other tools related to genome analysis [e.g. open reading References frame (ORF) predictions]. Istituto di Ricerche di Biologia Molecolare P. 1 Benson, D.A. et al. (1999) GenBank. Nucleic Acids Res. 27, 12–17 Angeletti, 00040 Pomezia, Italy. In conclusion, DANTE is a new, public 2 Pennisi, E. (1999) Keeping genome databases clean and Emails:
[email protected]; system for the analysis of nucleotide seup to date. Science 286, 447–450
[email protected] quences. Its power derives from the 3 Andrade, M. et al. (1999) Automated genome sequence *Present address: Centro de Astrobiología, analysis and annotation. Bioinformatics 15, 391–412 ability to gather dispersed information 4 Bairoch, A. et al. (1997) The SWISS-PROT protein INTA/CSIC, Carretera de Ajalvir Km.4, 28850 Torrejón de Ardoz, Madrid, Spain. sequence data bank and its supplement, TrEMBL. present in the databases, use this information to make assumptions regarding the properties of the protein hits found, and display them in an easily interStudents pretable way. DANTE also represents a tool to share knowledge and experience, Did you know that you are entitled to a 50% discount on a as it allows different users to inspect the subscription to TiBS? same sequence(s), store their results and make them available to the scienSee the bound-in subscription order card for details tific community. The system is also very useful in quickly and reliably identifying
403