Bioinformatics in Europe — The federation strikes back

Bioinformatics in Europe — The federation strikes back

217 Bioinformatics in Europe - The Federation strikes back The recent decision o f the European Molecular BiologW Laboratory (EMBL) to found an outst...

300KB Sizes 0 Downloads 74 Views

217

Bioinformatics in Europe - The Federation strikes back The recent decision o f the European Molecular BiologW Laboratory (EMBL) to found an outstation in Cambridge, U K brought the application o f computers to molecular biology back into the news. The establishment of this outstation, to be known as the European Bioinformatics Institute (EBI), has ended a year of speculation and politicking about how this critical aspect o f research was to be supported in Europe. But what is this centre to do? The practical researcher, and particularly the biotechnologist, rarely wants data for its own sake. Data (or more accurately information) is used to expand understanding by linking it to other, different types o f information, and that knowledge is used to direct strategy in R & D . The EBI will fill some of the gaps in the chain from raw data to knowledge, leaving it up to individual researchers to fill in the rest. Where will the gaps be? There is no lack o f raw data. The combined E M B L / G e n B a n k D N A sequence database n o w contains over 80Mb of D N A sequence, and the Protein Databank (PDB) o f structural sequences contains over 1000 3-D structures. The validity of, and access to, this data has proven controversial over the past year, with faster search methods showing that 'junk' sequences have been entered into the sequence database 1, and impossible structures exist in the PDB (with, for example, atoms that are too close together, or chenqically implausible bond structures). The difficulty is that, while compiling a unified database from diverse data inputs is difficult with limited staff, checking the data for consistency is almost impossible. Ideally, the compilation o f a database could be linked to its analysis and interpretation; however, the effort involved in this is far greater than the database centres could provide, and implies the imposition o f a degree o f editorial control over the data that w o u l d not be tolerated by the researchers. Three potential solutions to this problem may be proposed: (l) to supply the database without a guarantee that the information is biologi© 1993, ElsevierScience Publishers Ltd (UK)

cally meaningful, only that it is a correctly annotated version of what was supplied to the database centre. A 'database centre' - the European Commission's (EC) European Nucleotide Sequence Centre (ENSC) could perform this fairly small task, which would be politically uncontroversial and easy to fund. However, this would not take the field o f European bioinformatics any further forward. (2) The USA's answer to the same problem was to aim for something grander, the National Center for Biotechnology Information (NCBI). Under David Lipman, this has expanded to perform software development and some informatics research into ways o f linking different data sets together. The N C B I ' s Entrez software links D N A and protein sequences with their cognate bibliographic entries (piped directly to N C B I from the National Library o f Medicine in the same building). Lipman is planning to extend this idea to a larger suite encompassing more areas o f molecular genetics and sequence analysis. This will enable the user to check the significance o f the data that they have just extracted from the database. (3) EMBL's answer to the N C B I was to have been an even bigger European Bioinformatics Institute, a centre for up to 40 bioinformatics researchers as well as software engineers, which would perform database management and other functions. Political pressure led to the reduction in the scale o f this concept, cuhninating in the more manageable EB! that has been agreed for Cambridge. The EBI will include the EMBL datalibrary/ENSC activities. The Japanese have also recently entered the field with their own informatics centre. O n e o f the arguments against the formation o f the EBI was the question about what it was to do. This is the central problem in the field o f bioinformatics, and illustrates that bioinformatics is a misnomer. Unlike other informatics fields, such as medical informatics, which is concerned with the efficient transfer, storage, linking and retrieval o f information on patients, bioinformatics is still

essentially a research activity. We do not know what questions to ask o f the data, and hence, what computer tools to develop to ask them. O n e generic tool that is widely accepted is to link entries in different databases, so that a search of one type of data can direct the searcher to relevant entries in different databases. D N A sequence information becomes more valuable to the researcher if it can be linked to protein sequence, and hence to protein 3-D structure, receptor or enzymic function, or to sub-databases o f promoter sequences, or genetic-map data, and hence to possible sites o f expression. Such linkage is, in theory, a simple matter that could be automated, but because o f the diverse way in which research results are generated and recorded, it is hard to achieve. The Integrated Genomc Project (DFKZ, Heidelberg, Germany) aims to produce such links between diverse data sets, as do some o f the NCBI's plans. The Imperial Cancer Research Fund's [(tCP,.F), London, UK] clone database links genetic-mapping and clone-construction data with hybridization and sequence data for several libraries of human D N A . The A Caenorhabditis elegans database (ACEDB) project, originally built as an informatic resource for the Caenorhabditis elegans genome project, but now used in other contexts too, also links mapping and sequencing data into a unified format. On the most basic level, a large number o f specialist databases have 'pointers' in them which point to relevant entries in other databases which, at least, allow the user to make the link. Such linked systems o f information are probably going to be one o f the goals o f the EBI. Similar systems are marketed commercially. The protein structural-analysis programs from Oxford Molecular can access the protein sequence o f a structure and then search for similar sequences, as well as similar structures. This illustrates that there is commercial value in such systems. This is a sore point in the USA, where molecularbiology software companies are c~Ting foul at the N C B I for destroying their market by providing such valuable software, essentially for free. T1BTECHJUNE 1993 (VOL 11)

218

biotopics The more sophisticated levels o f analysis do not just link items o f data that are known, a priori, to be related, such as protein and D N A sequence, but rather they analyse the information and deduce new knowledge from it, linking that to other facts in the process. This is the true goal of bioinformatics, but remains elusive because we do not k n o w what questions to ask. While techniques such as receptor-binding assays and combinatorial-library screening are standard discovery tools in the laboratory, applicable to many p r o b lems, few similarly generic tools exist in the computer. The problem breaks into three parts, all o f which the biotechnologist can attack. The first is the algorithms: h o w do you analyse molecular-biological data? This is an unanswered question, o f course, but there is a lot o f expertise out there, and accessing it efficiently is crucial to the success o f a bioinformatics-based research project. The EBI may be able to help, acting as a focus for research knowledge and the informal network o f contacts that spread gut-feeling infomlation about methods (as opposed to the formal published results). The pharmaceutical industry in Europe has felt this lack so acutely that it has set up a group to share bioinformatics information between drug companies, as a working party o f the Association o f Information Officers in the Pharmaceutical Industry. This brings us to the second, and by far the more difficult problem: h o w do you share bioinformatics results in a way that people can benefit rapidly and efficiently from them? The Caenorhabditis community is the test-bed for an electronic ' c o m munity system', which aims to make such informal electronic c o m lnunication easy. It is a prototype study only, because, unlike formal communication systems such as electronic journals or database releases, a community system must be able to encompass as diverse a range of information as could be swopped over a beer at the bar: recipes, gossip, autoradiographs, sequence alignments and, above all, the community's judgement o f their value. Again, the EBI may be able to fulfil such a role, building on the w o r k o f EMBnet, a loose network ofcentres concerned with distributing the EMBL datalibrary, but with other activities that could develop into such a system. Large companies TIBTECHJUNE 1993 (VOL 11)

could also set up such systems internally (indeed, a company's internalmail and bulletin-board systems can act in this role already), but as much of the point o f such systems is to link community members at distant sites (where chatting over a beer is not practical), such internal systems only go part o f the way. The last part is the most problematical. Even if you can link all the molecular biologists to each other and provide them with the tools they need to analyse their genetic data, there is a good chance that they will be unable to solve your practical problem unless it can be formulated in the language o f clones, genetic maps and genes. The focus on D N A as a central molecule in biology has been enormously powerful, but its limits are starting to show in the complex regulation o f the immune system, metabolite-flux analysis, and the control networks o f those very genes that are piling up in the EMBL datalibrary. This is because there is a huge amount of information and knowledge that lies outside molecular biology which is, nevertheless, essential to its application in many fields. For example, until recently, the sole database o f enzyme activities was available only as a printed book [Springer-Verlag is publishing the Gesellschaft for Biotechnologisch Forschung (GBF) enzyme database next year]. Victor McKusick's O n line database o f Mendelian Inheritance in Man (OMIM), is a text description, available through genomeproject computers, and not the object-orientated database that would link to other systems easily, Consequently, the links have to be made by the human expert, largely negating the point o f having an electronic database at all. T a x o n o m y and plant genetics are largely ignored by the molecular biology community,

despite its vast importance to biotechnology. The databases o f chemical structures and k n o w n drag-structure-activity relationships (QSA1Ks) are maintained by different people, use different access systems and are usually consulted by c o m pletely different people from those who use the molecular biology systems. Yet, notionally, they have been constructed to solve many o f the same practical problems. H o w are these wider topics to be brought into the fold o f bioinformatics? The computer problems involved are substantial (although some simple starts, such as having a database that links enzyme-substrate activities to Q S A K databases, on one hand, and protein 3-D structure data on the other, are quite feasible now). The organizational problems, h o w ever, are more difficult. The EBI, N C B I and genome-project bioinformatics centres such as the G e n o m e Database (GDB) are focussed on molecular genetics. That focus has served the community well for the 20 years since the 'Berg letter '2 at the start of the recombinant D N A decades, but it must broaden now. Whether this is done by the biotechnology and pharmaceutical industries, the software companies active in molecular biology, or institutions such as the EBI, will depend on their vision. References 1 Anderson, C. (1993) Science 259, 1684-1687 2 Berg, P., Baltimore, D., Boyer, H. W., Cohen, S. N., Davis, R. W., Hogness, D. S., Nathans, D., Roblin, P-..,Watson, J. D., Weissman, S. and Zinder, N. D. (1974) Science 185,303

William Bains PA Consulting Group, Melbourn, Rwstort, Herts., UK SG8 6DP.

Letters to t h e Editor

Trends in Biotechnology welcomes letters to the Editor which address issues raised in recent TIBTECH articles, or issues related to current developments in biotechnology which are of broad interest to the biotechnology community. Letters should normally be supported by reference to published work. Please address letters to: Dr Clare Robinson (Editor), Trends is intended for publication.