ENE-COMBI < ELSEVIER
Gene 163 (1995) GCiii-GCiv
Editorial
Computing for Molecular Biology There is novel science in the intersection between computing and molecular biology. This new science is concerned with the development of techniques for the collection and manipulation of biological data, and the use of such techniques to make biological discoveries (or at least, predictions). The scope of this new science encompasses all computational methods and theory applicable to molecular biology, including software tools, packages and systems; algorithms; mathematical :malysis of algorithms, and analysis that can be expected to lead to new algorithms; software associated with instrumentation; and computerbased techniques for solving biological problems. This science is often called bio-informatics or computational biology. We find neither name wholly satisfactory since the former tends to emphasize the techniques-development aspect of the field, while the latter emphasizes the biological-discovery aspect. We hold that both aspects of this new science form a unitary whole. The phrase "Computing for Molecular Biology" in our title is a diplomatic (or perhaps, cowardly) attempt to sidestep the choice of a name. This new science is being driven by the rapid accumulation of large molecular biology datasets. The examples are well known. The public sequence databases now hold more than 300 megabases and, at the present rate of growth, will reach 10 gigabases by 2005. Detailed genetic and physical maps containing thousands or even tens of thousands of mapped features now exist for several higher organisms, including mouse and human, and mapping efforts are underway for others, including mosquito, rat, and many species of agric,ultural importance. The number of characterized genes -though difficult to quantify -is surely in the tens of thousands and is also growing dramatically (having gone from 4,000 to 8,000 in Drosophila alone in about four years). The number of solved protein structures, now 3,500, is doubling every 2 years. Databases with more detailed biological content are now being assembled on subjects such as gene expression and metabolic pathways. This science will gain furlher impetus now that the era of complete genome sequence is upon us. There is now one complete bacterial genome in the public database, and several more are nearing completion. The yeast genome (Saccharomyces cerevisiae) will be finished by the end of 1996, and the worm and fly will surely be done in the next 0378-1119/95/$09.50 © 1995 Elsevier Science B.V. All rights reserved SSDI 0 3 7 8 - 1 1 1 9 ( 9 5 ) 0 0 5 0 9 - 9
3-5 years. We are confident that the human genome will be sequenced by 2005, because the only remaining impediments are financial; the mouse will likely follow in its footsteps, because of its immense value as a 'model'. The future of biology is dependent upon the emergence of this new science and its effective integration with the rest of the biology. The Human Genome Project has proven that 'big biology' works. Big biology will not disappear when the human genome is sequenced, but rather will aim its powerful machinery at the next layer of rate limiting biological problems. The systematic collection of large datasets is now inexorably a part of biology. These datasets cannot be collected without extensive use of computing. Even more telling, this accumulation of data would be useless without computerized methods for its storage, dissemination, and analysis. This new science is advancing from youth to maturity. The founding of specialist journals, the convoking of scientific meetings, and the formation of scientific societies, are strong indicators of the impending maturity of a field of science. It is natural at such a juncture to reflect on the scope, style, and standards of this new field. We assert without qualm that this new field is a science, because the problems at hand require the innovation, creativity and commitment that one associates with scientific undertakings. As in all scholarly fields, research must begin with a clearly stated problem, result in a clearly stated solution, and provide data or other means to demonstrate the validity of the results. Novelty is a central principle: new work must be placed in the context of previous work, and the new must compete with the old on the basis of merit. Disagreements must be resolved through reasoned debate on the basis of the evidence. Complicating the discussion of style and standards is the interdisciplinary nature of the field. It is inevitable and proper that new techniques be judged on the basis of their utility to the practice of molecular biology. We hasten to admonish, though, that this 'utility doctrine' not be applied too narrowly. It is unreasonable to insist that all work have immediate impact on biology; foundation building work on mathematics, algorithms, and prototype software is essential for the steady progress of the field. Work in this field is often derivative: a common style is to look for relevant techniques in other disciplines (generally computer science
iv
Gene 163 (1995) GCiii-GCiv
or mathematics), and to adapt these for the biological problem at hand. Work in this field sometimes involves the creation of complex software systems, or components intended for use in such systems. In these cases, a premium is placed on engineering considerations: we tend to disparage work that is too complex, preferring solutions in which judicious compromises are made to simplify the design. As the field matures, the tensions between 'pure' science and application will surface repeatedly until a comfortable accommodation is reached. We are pleased to be launching GENE-COMBIS at this exciting point in the life of this new science. We hope to foster the development of the field by providing a forum for the publication and discussion of scientific results.
Fundamental to the concept of GENE-COMBIS is 'good science'. We have tried to convey a sense of what the term 'good science' means to us in the context of this new field, and to also convey our sense of how the style and standards of the field may evolve through the next phase of its life. Ultimately, GENE-COMBIS will be as good as its authors and readers make it. We invite you to join us in this exciting endeavor. Nat Goodman and Michael Ashburner
Cambridge (MA) and Cambridge (UK), July 1995