EMBL DATALIBRARY
l-21
[2] EMBL By PATRICIAKAHN
23
Data Library and GRAHAMCAMERON
Background The EMBL Data Library was established in 1980 to collect, organize, and distribute a database of nucleotide sequences and related descriptive information extracted from publications in scientific journals. Since 1982 this work has been done in collaboration with GenBank (Los Alamos, NM, and Mountain View, CA), and recently the DNA Database of Japan (Mishima) joined the collaboration. Each of the three groups collects a portion of the total reported sequence data and exchanges it with the others on a regular basis. Since 1987, the Data Library has begun to provide additional data sets useful to molecular biologists. Databases EMBL Nucleotide Sequence Database The Nucleotide Sequence Database was the original motivation for the establishment of the group, and it continues to be the main endeavor of the Data Library. The data are presently distributed as “flat” text files where each entry comprises a single contiguous sequence and accompanying descriptive information (annotation). Different line types, each with their own two-letter code, are used to make up an entry. A sample entry is shown in Fig. 1. Each entry is uniquely identified within a release by its name (DMSHAKE3 in Fig. 1) and across releases by its accession number list (X07 133 and YO0847 in Fig. 1). References to database entries should always cite the primary (first) accession number. Release 18 of the database (February 1989) contained 27.2 X lo6 bases, approximately 40 times more than release 1 (June 1982). As illustrated in Fig. 2, this increase has not been linear; rather, it reflects a rapidly accelerating rate of growth, a trend which will undoubtedly continue. Clearly, the resources of the databases cannot increase at the same rate, and, therefore, coping with the data flow demands that we revise and streamline our data processing procedures. As part of an effort to achieve this we have just finished a restructuring of our database, including installation into the ORACLE relational database management system. This should result in a more efficient service to users in the short term and will enable us to offer METHODS
IN ENZYMOLOGY,
VOL.
183
Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.
24 ID xx AC xx DT DT DT xx DE xx Kw xx OS OC xx cc cc cc xx xx RN RA RT RL RL RL xx RN P.A RA RT RT RL xx FH FH FT FT FT FT xx SQ
121
DATABASES standard;
DMSKAKE3 x07133;
BP.
(publication) correction) (ref. (annotation)
melanogaster
Shaker
gene;
Drosophila Eukaryota; *source *source **map:
larval
Shaker
potassium
channel
melanogaster Metazoa;
(fruit Arthropoda;
: strain-canton : clone-ShD chromosome-X/16F; (bases 0.;
ill Pangs
1176
yoo947;
Zl-APR-1988 20-APR-1988 OS-APR-1988 D.
RNA;
Submitted Pongs O., Univeritaetsstr.
1-1176;
S;
mRNA
library-
1.
6:; 772 871
Sequence aaucccagug uggauugccu ucuuugccca ggaccgcaac aauguaagcg cugcuugggg uuugaccgua cuacggagac uuaggugauc agaccauuac aguucgcaag guuauauuuu acaaauggca auagaaacgu ccgaacaaau ccguacuuua aaagcgccag aauaucaaua cgauuaaagu gguugccugc
and
2.
Fruchtfliege) larval
instar
position=57.7
1 to
1176)
1176 BP; gagaucaaag guauggcaca aauugagcag acuuugaacc gauuaagguu auccagcucg gucgaccgag cggucaaugu aagcaauuaa cggauaauga ccgccagagu gucuagaaac caaaaaucga uauguauuau uaaauuucug uaacacuagc ucaguccaca uuaaucaccc aauaaauguu ccgccccgaa
To 984 675 a37
933 381
1108, no zero) R., Krah-Jentgens I., LLamasares S., Ferrus potassium channel
A.; proteins
Description K(+)-channel transmembrane transmembrane transmembrane A; 238 ugcacgugca caucacgacg ucaagacgaa cauuccucac ugagacacaa gagauuacgg cuucgaugcg cccuuuagac uaaauucaga gaaacagaga uguagccaua auuacccgaa ggaagacgag uugguuuaca cagggauguc gacugucguu gguaugagau acacacgacc gccaucauca accaaaaaaa
C; 254 G; gagagagaaa acgcacggca gaaggggggg gaucaugauu cuacguacgu uacuuugacc auuuuauacu guauuuagug gaggaugaag aaagucuggc auuaguguau uuuaagcauu gugccugaca uuugaacuaa augaauguua gccgaagagg uucuguuugc acacacacac uuuaugaguu aaaaaa
segment segment segment 303 u; aaguggagua gcuuaagcca cuggucaugg ucugcgaaag uaaaucaauu cgcuuagaaa auuaucagag aagaaauaaa gcuuuauuaa ugcucuucga uuguuauauu acaagguguu ucacagaucc cugucagguu ucgacauaau aggauacguu ccaaugccaa acaugcauau gccugacaug
Sl S2 S3 ggucgaugaa agcgacaagg cuuugguggc agucguuaua cccggacacg ugaauauuuu wwggwm auuuuaugaa agaggaagaa guauccagaa gcuaucaauu caauacaaca uuucuucc"u ccucgcaugu cgccaucauu aaaucuucca cuacaccaaa acaguugaac auaaagaagg
/I FIG.
gtl0;
(15~MAR-1988) on tape to the EMBL data library by: Lehrstuhl fuer Biochemie, Ruhr Universitaet Bochum, 150, 4630 Bochum-Querenburg, West Germany.
From
CDS SITE SITE SITE
K(+)-channel
fly, drosophile, Insecta; Diptera.
(bases l-1176; enum. -72 to 121 Pongs O., Kecskemethy N., Mueller Baumann A., Kiltz H.H., Canal I., "Shaker encodes a family of putative the nervous system of Drosophila"; EMBO J. 7:1087-1096(1988). Key
put.
protein.
map enum.
for
1. Sample entry from the EMBL Nucleotide Sequence Database.
in
121
EMBL DATA LIBRARY
25
Year FIG. 2. Growth of the EMBL Nucleotide !Sequence Database.
an on-line query system in the longer term. In the future the relational tables will also be available to interested users. Although the EMBL and GenBank databases regularly exchange data, the problem has existed that the content of the two databases does not completely overlap. Unfortunately for the user, this has meant that a complete collection of nucleotide sequences was obtainable only by using both the EMBL and GenBank databases. The problem of incomplete overlap arises because the two groups initially set up their own (different) procedures for processing data and because the development of procedures to exchange all data (including corrections and updates) has proceeded slowly. Difficulties have arisen, for instance, in situations where one group enters a new sequence and merges it with an existing sequence which they believe is contiguous with the new data. When these merged data arrive at the other database, the procedures for determining whether they are new or are already in the database will be confounded, since they are in fact a mixture. Until recently, in an attempt to avoid duplication of data, we have taken a conservative approach: if any of the data in an entry were already present, the entire entry would not be added to the database. However, at this point in time the majority of researchers seem more concerned about missing data than about duplicates, and we have therefore decided to change our approach. As of Release 19 (May 1989) any entry whose status with respect to the database is unclear will be included in a separate, “ancillary” database; together, the main and ancillary databases should constitute a complete collection of data from the latest EMBL and GenBank releases. This represents a temporary solution; a major effort is underway by both groups to resolve this problem in a more elegant manner.
26
DATABASES
01
S WISS-PROT ProteinSequenceDatabase The SWISS-PROT database, maintained collaboratively by the EMBL Data Library and Amos Bairoch (University of Geneva), is a collection of amino acid sequences from the Protein Identification Resource collection (PIR, Washington, D.C.) along with translations of coding sequences in our Nucleotide Sequence Database. Its releases are coordinated with those of the Nucleotide Sequence Database such that they include translations of the newest data. SWISS-PROT is essentially identical in format to the Nucleotide Sequence Database and therefore the two collections can easily be used together. ORACLE-based data management systems presently under development aim to integrate the two collections more fully. SWISS-PROT entries contain pointers to related data in the EMBL Nucleotide Sequence Database, the Protein Data Bank of three-dimensional structures (maintained at the Brookhaven National Laboratory, Upton, NY), and the PIR. These cross-references will soon be listed in the corresponding nucleotide sequence entries.
Eukaryotic PromoterDatabase(EPD) In 1988 we began to distribute a database of eukaryotic promoters, prepared by Philipp Bucher (presently at Stanford University, Stanford, CA). This database contains detailed annotation of eukaryotic transcrip tion start sites present in the Nucleotide Sequence Database and documented in the research literature. The database itself contains no sequences, but rather references to the sequences. Its releases are coordinated with the designed to be used with the Nucleotide Sequence Database. The EPD is important not only for the scientific information it contains but also as a prototype for the future. Rapid growth in both the volume and complexity of sequence data makes it increasingly impractical for the data banks to maintain a pool of expertise capable of annotating all sequences; indeed, it has been argued that such annotation is interpretive work of a kind inappropriate to the major sequence databases. The EPD, as a database maintained remote from, but coordinated with, a centralized sequence collection, provides a model whereby the detailed biological annotation can be carried out at remote sites where the appropriate expertise is present.
RestrictionEnzyme Data The database of restriction enzymes provided by Dr. R. Roberts (Cold Spring Harbor Laboratory, Cold Spring Harbor, NY) is distributed with all releases of the nucleotide sequence data.
r21
EMBL
DATA
LIBRARY
27
Data Acquisition Since 1987 the staff of the nucleotide sequence databases have devoted a great amount of effort to developing systems which encourage researchers to submit their newly determined sequences and related data directly to the databases, preferably in computer-readable form. In doing so, we are attempting to replace our traditional method of data acquisition, in which database staff identify sequence-containing articles in the scientific literature, enter the sequence data into our computer, and then annotate these sequences by reading through the corresponding articles to locate the relevant information. Direct submission is important for many reasons: (1) Abstracting the information from the research literature is labor intensive. (2) Entries can appear in the database much sooner if we get the information from the authors early in the publication process. (3) Machine-readable submissions reduce the chance of introducing errors. (4) Authors can bring far more expertise to bear on annotating their own data than the database staff can. (5) Interpretive annotation is more a research than a database activity. (6) The scale of present-day sequencing is reaching the point where journals are finding it inappropriate to print the actual sequences. If no mechanism to ensure their deposition in the databases is in place, research papers will be published whose underlying data are unavailable to the research community. (7) With an increasing number of journals publishing sequencerelated papers without printing the sequences, scanning the literature to locate sequence data will no longer be feasible. Progress with direct submissions has been good, with almost 60% of the new data coming as submissions in the latter half of 1988. Since 1989 two journals (Nucleic Acids Research and Plant Molekular Biology) have begun not only to request that authors submit sequence data to the database but to require it as a condition for publication. While these developments are encouraging, the accelerating rate at which sequence data are generated ensures that the work of entering and annotating the remaining (nonsubmitted) data will present an enormous, and eventually unmanageable, workload. The goal of ensuring more direct submissions therefore remains a high priority. How to Submit Data to EMBL
Data Library
Researchers who intend to submit data to any of the sequence databases should get a copy of a Sequence Data Submission Form, which solicits all the information needed for a nucleotide or protein sequence entry and provides instructions on how to submit the data. The form exists both in a paper version and as a computer-readable text file which can be
28
DATABASES
121
completed using a text editor. Many molecular biology journals distribute the paper version to authors reporting sequence data, and a few journals publish it periodically. I,* The computer-readable version of the form is distributed with all releases of the EMBL and GenBank databases and can also be obtained via computer network (using the EMBL File Server, see below). Alternatively, either of these versions can be obtained by contacting the Data Library in any of the ways listed in the Appendix. A data submission should include the sequence data in computer-readable form (computer network mail, magnetic tape, or MS-DOS or Macintosh floppy diskette) and a completed data submission form for each submitted sequence. Data can be sent to the Data Library via computer network, telefax, or normal post (see Appendix). Upon receiving the data, Data Library staff check whether the submission is complete. If so, it is processed within a few days, and the submittor is then informed what accession number(s) has been assigned to the data. This accession number serves as a reference that permanently identifies these data in the database, and we recommend that it be cited when referring to the data in publications. If the submission is incomplete, the submittor is notified what additional information is needed. The submission form also asks submittors whether the data should be made available to the public as soon as we have finished the corresponding entry or be withheld until publication. Data Distribution Magnetic Tape Subscriptions The main way of distributing copies of the entire database continues to be by mailing magnetic tapes. Much of this distribution is done by the Data Library, and some is done by secondary distributors, such as groups which supply the data along with sequence analysis software. The four releases of nucleotide sequence data in 1988 were supplied to an estimated total user community of approximately 10,000 scientists. The sequence databases produced by the Data Library are available either singly or as a yearly subscription. Users who subscribe to the EMBL Nucleotide Sequence Database releases for 1 year receive one release every 3 months along with the corresponding release of the EPD and the Restriction Enzyme database (see above). The four yearly SWISS-PROT releases can also be obtained by subscription. Subscribers are charged an annual fee I P. Kahn and D. Ha&dine, Nucleic Acids Rex 16, i (1988). * P. Kahn, D. Ha&dine, and G. Cameron, Plant Mol. Biol. 11, 541 (1988).
121
EMBL DATA LIBRARY
29
which covers our distribution costs (see Table I). Users can also order the latest release of the nucleotide or SWISS-PROT collections at any time. Further information about subscriptions can be obtained by contacting us at the address given in the Appendix. CD-ROM CD-ROM is attractive as a medium on which to distribute the sequence databases because it represents an inexpensive way to store large quantities data and because the devices required to read it are within financial reach of the typical personal computer user. In early 1989 the Data Library produced a prototype disk which includes prototype CD-ROM retrieval software developed in collaboration with Philips Du Pont Optical and Circle Information Systems. The design of a common GenBank/EMBL/ PIR CD-ROM format should be finalized soon, and thereafter disks will also contain the data in this format. Network Access The rapid pace of research in molecular biology has generated a requirement for better and more rapid access to the databases than that provided by quarterly releases. As part of an attempt to meet this need, EMBL set up in early 1988 a file server which enables researchers worldwide to retrieve entries from the major databases available at EMBL via computer network. Data are available over the file server as soon as Data Library staff have completed the preliminary entry, while the indices are updated daily. This is particularly attractive in combination with the data submission arrangements we have with Nucleic Acids Research and Plant Molecular Biology, since it enables readers of these journals to access the TABLE DISTRIBUTION
I CHARGES
Fee per rel& Subscriber category Academic users EMBL member state@ Nonmember states Industrial/commercial USWS
Fee per yea@
Tape
CD-ROM
Tape
CD-ROM
DM 50 DM 100
DM 100 DM 200
DM 200 DM 400
DM 400 DM 800
DM 250
DM 500
DM 1000
DM 2000
0 The member states of EMBL (as of May 1989) are Austria, Denmark, Federal Republic of Germany, Finland, France, Greece, Israel, Italy, The Netherlands, Norway, Spain, Sweden, Switzerland, and the United Kingdom. b Charges given in West German marks.
30
DATABASES
c21
sequence data in computer-readable form as soon as the issue containing them appears. The file server can be used by anyone with access to the BITNET/EARN network or to any other network which has a gateway into BITNET/EARN (e.g., JANET in the United Kingdom or ARPANET in the United States). It is provided free of charge, though users may have to meet some or all of the communication costs, depending on the accounting system of their local computer service. The following brief instructions should enable interested users to get started. Use of the facility is very simple and involves sending file server commands as standard electronic mail to the address
[email protected]. Each line of the mail message should consist of a single file server command and nothing else. The most important file server command, to get users started, is HELP. If the file server receives this command, it will return a help file to the sender, explaining in some detail how to use the facility. In order to send electronic mail to a BITNET/EARN address, users must find out which command they have to use on their own machine and how they should format the address
[email protected]. Users who do not already know how to do this should contact their local computer service or, if all else fails, contact the Data Library and we will do our best to help. Below are some examples which illustrate how to send commands to the file server using a VAX/VMS system that is a BITNET/EARN node running JNET software. To send a HELP command to the file server, use the operating system command MAIL as follows: $ MAIL (filename) ” JNET%” ” NETSERVGEMBL ““” where (filename) is the name of a tile containing file server commands. To request help information the file should contain the following command: HELP To request a copy of the data submission form, the file should contain the following GET command: GET DATALIB:DATASUB.TXT To request a copy of a nucleotide sequence entry with the accession number X00596, the file should contain the following GET command NUC:XOO596 EMBNet Another way in which EMBL is attempting to increase the availability and usefulness of the various databases is by establishing a European
PROTEIN IDENTIFICATION RESOURCE
r3i
31
molecular biology network (EMBNet) consisting of EMBL connected via DECNet to a series of national centers in western European countries. [DECNet was chosen because DEC (Digital Equipment Corporation) has made a firm commitment to migrate to the ISO-OS1 standard for networking as soon as the standard is finalized.] The long-term plan is for EMBL to provide the national centers with copies of the latest sequence databases and other relevant data collections and with retrieval software. The national centers will then make these tools, along with analytical software, available to researchers within their countries and will offer training and support in their use. In 1988 a trial phase of the EMBNet project was initiated with the following four centers: SERC Laboratory (Daresbury, UK), CIT12 (Paris, France), COAS/CAMM Center (Nijmegen, The Netherlands), and Hoffman-La Roche (Basel, Switzerland). Progress so far includes establishment of network connections with the centers and implementation of software which updates their copy of the database on a daily basis. It is planned to add several additional nodes to the system in 1989, after which a gradual expansion will lead to the inclusion of at least one node in each of the EMBL member countries. Appendix:
How to Contact
EMBL
Data Library
Computer network:
[email protected] (for data submissions);
[email protected] (for questions requiring a personal response) Postal address: Data Submissions, EMBL Data Library, Postfach 10.2209,6900 Heidelberg, Federal Republic of Germany Telephone: + 49-622 I-387-258 Telefax: +49-622 l-387-306 Telex: 46 16 13 (embl d)
[3] Protein
Sequence Database
B~WINONAC.BARKER,DAVIDG.GEORGE,
and LOIS T. HUNT
Introduction The Protein Sequence Database has been maintained by researchers at the National Biomedical Research Foundation (NBRF) since the early 1960s. The database was originally compiled by the late Margaret 0. Dayhoff as a collection of sequences for the study of evolutionary relationships between proteins, and it continues to be maintained by a scientific METHODS
IN ENZYMOLOGY,
VOL.
183
Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.