Can informatics keep pace with molecular biology?

Can informatics keep pace with molecular biology?

Chemometrics and intelligent laboratory systems: Laboratory information management ELSEVIER Laboratory Information Management 26 (1994) 69-77 Tutor...

995KB Sizes 0 Downloads 27 Views

Chemometrics and intelligent laboratory systems: Laboratory information management

ELSEVIER

Laboratory Information Management 26 (1994) 69-77

Tutorial

Can informatics keep pace with molecular biology? Harold R. (Skip) Garner McDermott Center for Human Growth and Development, University of Texas, Southwestern Medical Center, 5323 Harry Hines Blvd., Dallas, IX 75235, USA

Received 6 ApflI.1994; accepted 28 June 1994

Abstract A host of new enabling technologies allows the molecular biologist to process many more samples and generate more data than was previously possible. These technologies include automated sequencing, biological robots, rapid polymerase chain reaction (PCR) based screening, fluorescence in situ hybridization, etc. Informatics research is underway to develop algorithms, software and hardware to attack the mounting data produced by pharmaceuticals firms, animal paternity testing, the Human Genome Project (HGP) and others. The Human Genome Project is developing advanced technology, methods and computational tools for the collection, archival, analysis and visualizafion of a vast amount of data associated with genome maps and sequences. This program will be a litmus test to determine if technology can keep pace with molecular biology. The trend is towards linking all the basic diagnostic tools to computer networks or labsiatory instrument management systems to attain high throughput and high data quality.

Contents 1. Introduction .............................................................. ;.~ ............................................................... 2. Discussion .................................................................................................................................. 2.1. Human Genome Project impact .................................................................................................... 2.2. Informatics issues .................................................................................................................... 2.3. Informatics solutions ................................................................................................................ 2.4. Future trends and needs .............................................................................................................

3. The answer ............................................................... Acknowledgements ......................................................... References ...................................................................

1. Introduction Molecular biology is changing; the at-the-bench experimentalist needs to pipette and program to remain 09255281/94/$07.00 0 1994 Elsevier Science B.V. All rights reserved SSD10925-5281(

94)00018-2

.:. ............................................................... ‘.i.. .............................................................. .). ................................................................

69 70 70 71 73

75 76 76 76

competitive. There is no replacement for creativity and ingenuity, but the first experiment the new biologist should do is at the computer: check to see if his sequence fragment is in the data base before spending

70

H.R. (Skip) Garner/Laboratory

Information Management 26 (1994) 69-77

two years cloning and assembling his favorite gene, check the bibliographic data bases, search for recent publications or information snippets on the competition, and check a bulletin board for hints. The not too distant future may go something like this: “Now, send an e-mail message to your trusted post-dot to see if we happen to have the valuable XYZ gene clone in our 40 000 member library in the freezer. The post-dot submits a polymerase chain reaction (PCR) based screening experiment request to the queue of the Molecular Biology Lookup Engine (MOBILE) automation system. The computer calculates the best primer set and PCR conditions, retrieves the maximally pooled library from the freezer, assembles the reagents, fires up the thermal cycler, finds thirteen candidate positives, and makes a mental (silicon) note. Later that day the computer informs the post-dot that aliquots of the positives and mapping information are available, questions whether or not to queue any of the clones for complete sequencing and then submits them for structure and function determination?” This scenario is not completely fiction, and there is no doubt that the blue-sky parts will be available within 10-15 years. Some are in development now (see Fig. 1) . Progress is being made on the enabling technologies necessary to make all this possible. These enabling technologies include automated sequencers [ 11, the PCR process [ 2,3], faster computers [ 41, new search algorithms [ 51, new molecular modeling codes [ 61, new visualization techniques [ 71 and a new generation of biological robots [ 8-101. The drivers for this revolution include ubiquitous individual efforts, the cross disciplinary infusion of high technology, national programs such as the Human Genome Project and, of course, privately funded high payoff ventures. The object of this paper is to survey the boundaries among biology, computational biology and robots, and illustrate some of the new automation and computational tools being appled to biology and molecular medicine in the new genome era. These new tools are designed to make it possible for us to identify, map and determine the sequence for the 50 000 to 100 000 genes that define humans, and complete that task on a time scale that will impact our lives, by 2006, the end of the 15 year international Human Genome Project. Informatics issues related to biology sample handling done on the mega-scale will be discussed. Some examples extracted from recent Human Genome Project work

Fig. 1. Funded by the National Institutes of Health, General Atomics is constructing a robotic station that will be capable of processing 10 000 samples per day. This robotic system performs pipetting, centrifugation, detection, incubation, freezer retrieval and thermal cycling in parallel to achieve the throughput rate. This system will screen YAC and cosmid libraries by PCR, assemble and thermal cycle sequencing reactions and perform minipreps. Samples are held in 864-well microwell plates (20 ~1) and 96-well microwell plates (1 ml).

will show where we are now and, finally, a wish list will be generated from enabling technologies which have yet to emerge from ‘out-of-the-hat’.

2. Discussion 2.1. Human Genome Project impact What is the Genome Project [ 111 and how is it i setting new trends in molecular biology? The objective of the Genome Project is to develop, in a systematic way, detailed information (maps and sequence) and a contiguous set of reagents (clones, etc.) that represent the genetic makeup of the human and other important species. Together, the data and clones enable biologists to elucidate the basic functions of life and seek medical solutions to disease. To accom-

H.R. (Skip) Garner I Laboratory Information Management 26 (1994) 69-77

plish this, areas of research uncommon to molecular biology are integral parts of the project - informatics, automation, advanced instrumentation development, even ethics. Now in its third year, the status of the project [ 121 is as follows. - About 2% of the human genome has been put into data bases. Similar amounts of data for other relevant species exist. - Maps of several human chromosomes are approaching completion. - Several types of reagent libraries (yeast artificial chromosome (YAC), cosmid, plasmid, Pl, bacterial artificial chromosome (BAC), etc.) have been constructed, each with its strengths and weaknesses. . - The biological, physical and computer technology necessary to acquire the remaining data within the 15 year life and three billion dollar budget of the program has been demonstrated and is ready to be deployed in mass. - The data emerging from the project are already contributing to medicine. - The project is moving much faster than originally planned. The Genome Project is changing the way biology will be done in the future. Because of the sheer iize of the total task, the time and budget constraints, it has been necessary to shift from the traditional focused wet laboratory approach to a more industrial production line style laboratory that can take advantage of the economics of scale. This is being done by introducing and depending heavily on robotics and computers to do the bulk of the repetitious work and data sifting. In addition to producing more data, the introduction of automation is expected to do the work faster, cheaper and with increased accuracy and reproducibility. Indeed, even within the Genome Project, there is a restructuring taking place towards larger, more efficient research centers and away from the traditional small grant structure. This further demonstrates that this is an endeavor that produces data of high value, but the actual data acquisition is not rocket science. The Genome Project has spawned a number of venture capital based research laboratories that are taking the ‘genomic approach’ to gene discovery [ 13,141. That is, the genes form significant diseases or conditions whose diagnostics and therapeutics will lead to high monetary gain. The race (national and intemational, public and private) to complete the project is

71

on, but will all the data make it into the public data banks? 2.2. Informatics issues This new biology, the genome approach and strong biology experiment/computer interaction, has at least five major categories into which the informatics can be grouped: (i) utilization of the global on-line data bases; (ii) data collection on a massive scale; (iii) data analysis, publication, integration, correlation and visualization; (iv) access and analysis tools for the end users of the data, drug developers, geneticists and other biological/medical researchers; (v) theoretical modeling, especially as it can be used to influence experimental research and add to basic knowledge. There is a tremendous amount of on-line data available to molecular biologists now [ 151. The business community has fully embraced the use of on-line data to conduct their daily work. Researchers, researcher directors and CEOs can all benefit from using the available data. Some example on-line data and services include: directed searches of large bibliographic and abstract libraries (Dialogue, Medline, etc.), clipping services and bulletin boards for just about any topic (the ability to submit a question to a large audience). New instrumentation and robotics have greatly increased the amount, types and sophistication of raw data. In the standard biology laboratory, this data handling scaleup usually involves a laboratory information management system (LIMS) from Beckman Instruments, Inc. in Fullerton, CA, that can work with some of_ the standard instrumentation. These are available commercially. However, this type of system was never designed to handle the amount of data emerging from genome or large drug screening laboratories. These new laboratories are scaling up to analyze 10 000 samptes a day by a host of techniques, PCR screening, sequencing, and fluorescence in situ hybridization. Although the biological methods are being adapted to skmplify the process and data analysis, the data still can take many forms. Each value has to be stored, easily recovered, related and analyzed. These include sequence chromatograms (raw data), sequence, scanned or video data (electrophoresis gels, high density hybridization grids, confocal microscopy image slices) and enzyme linked immuno sorbent assays (ELISA). As an example of the size of the problem,

72

H.R. (Skip) Garner I Laboratory Information Management 26 (1994) 69-77 Mapped/oriented cosmids

GIST

(Transputers)

SUN

Sparcstations

_

Cray Y-MP a/a64 Intel Paragon Supercomputers

+,

Fig. 2. Sample and data flow for the mapping effort plar@ed for the San Diego Genome Center. The functions are given in the boxes. The systems that actually conduct the work are given below the box.

human genome project laboratories are gearing up for sequencing on a grand scale and mapping at ever higher resolutions. Using today’s technology, this new generation of genome laboratories will run ten automated sequencers, two hybridization gridding robots, high throughput PCR assembly, and cycling and detection systems. From this sequence, clone maps are generated by screening libraries (with 20 000 to 100 000 members) using sequence tagged sites ( STSs) [ 161 determined from the sequence fragments. This amounts to 720 sequences totaling 0.25 megabaseslday, 50 to 100 000 hybridizations/day, and 10 000 PCRslday. This amount of data will be taken using many computer driven interacting systems and analyzed each day, 250 days a year (see Figs. 2 and 3). Once the raw data are in the data base, they need to be analyzed and presented to the researcher in an orderly and sifted fashion. Much of these massive data emerging from the new high-throughput laboratories will never be seen by humans. They will have programmed the algorithms for the data assembly and

reduction. Only a small fraction of the data will be viewed by researchers, data that the computers provide them with has a high probability of containing a new, valuable gene. Data analysis currently takes many forms: homology (DNA or protein) searches via servers which support the BLAST, BLAZE and FASTA programs [ 17,181, neural network analysis for coding region prediction, cross correlation of sequence homology search results with the Genome Data Base (GDB) [ 191, cross correlating sequence and mapping data to develop tomorrow’s experimental plan and check for inconsistencies and errors using custom software or high power relational data bases, automatic submission of data to public data bases using facilities such as AUTHORIN, a recent improvement to ease data submission to GenBank, the US data base for sequence information [20], and finally data summaries presented to the researchers in a condensed and orderly form. The genome laboratory data will be processed in at least the following ways: homology searches, gene/ coding region prediction, sequence assembly and map assembly. This will amount to 1000 homology searches/day using the programs BLAST, BLAZE or

DNA

YAC Preparation System

Grid Prepper Ph.D. GAS

scanner/reader

/

Labimap autogel system

Fig. 3. Sample and data flow for the sequencing effort planned for the San Diego Genome Center. The functions are given in the boxes. The systems that actually conduct the work are given below the box.

H.R. (Skip) Garner/Laboratory

Information Management 26 (1994) 69-77

FASTA, 720 coding prediction (gene sequence probability determination) runs/day using the program GRAIL [21], merging (assembly) of 720 new sequences/day with all previous data and map assembly using 100 000 hybridization results and 10 000 PCR results per day. There is a class of researchers emerging that will conduct real experiments at their computers by analyzing data measured by other groups. One of these research areas is molecular modeling, but because of the extensive work in that area it will be discussed separately. This general class of data users (drug designers, geneticists, algorithm developers) will access the data generated and only partially analyzed by the data collectors (genome centers and other l&ge scale screening efforts). The tools available,for this type of work are just beginning, mainly because enough data have only recently been collected to make this type of research fertile. A large class of informatics that is somewhat mature when compared to the others mentioned here is molecular modeling [ 22,231. This area is advanced in that a large number of software tools are available, but new methods, algorithms and visualization techniques are being introduced regularly. This informatics Tea is uniquely different from the other areas for a number of technical reasons: The analysis is rooted in physical;*as well as biological, principles. Modeling data is real, floating point data similar to other physics type data that are processed efficiently using vector supercomputers (Crays or parallel processors) [ 241. Most other biological data are integer, Boolean or symbolic. The ever increasing computational power is allowing the molecular modelers to handle larger, more complex systems. Visualization of the molecular model data is an important part that is fairly advanced, including the recent introduction of virtual reality to aid in areas such as custom drug design. For example, movement of drug binding regions by manual model operators within the space of a protein or nucleic acid to find an optimal alignment can now be done using visualization stations at supercomputer centers [ 71. Molecular modeling is also being used as a researcher’s tool to enhance and direct wet laboratory experiments. For example, following molecular modeling of the heme region of flavocytochrome b2 to select mutations that lead to new protein sequences which alter the structure and function in a predictable way, the clones were actually devel-

73

oped and behavior of the electron transport was measured and compared to computer predictions [25]. With input data now being amassed and refinement of structure and function modeling, this area of informatits will play a larger role in directing experiments and evaluating candidates for rational drug design.

2.3. Informatics solutions

To build solutions for these identified areas, an approach must be chosen based on what is known today and what can be accurately predicted in the future. Informatics bottlenecks must be identified and a variety of informatics tools must then be constructed. Solving an informatics need can span a spectrum of approaches, from off-the-shelf to custom hardware and software with varying degrees of risk. Two examples of different approaches to similar problems follow. Genome centers use a variety of data bases. Often many different data bases are used within a center [ 261 such as Flybase and HGCdb, which are variants of ACeDB originally developed by R. Durbin and J. Thierry Mieg ’. It is important that these data bases are user friendly, have adequate size and speed, can communicate converted data to and from foreign data bases, and be easy to establish and maintain. Balancing these considerations has led some to use off-the-shelf data bases such as Oracle or 4th Dimension and program only the logic of the data base using the development tools supplied with the shell. Some use only spreadsheets, and some develop their own data base shells from scratch. We (the Genome Science and Technology Center at the University of Texas, Southwestern Medical Center) decided to use 4th Dimension because it was easy to get up and running, the entire chromosome mapping data base can be carried on an inexpensive Macintosh powerbook, data are easily exported and imported from off-site data bases, the logic and capabilities are easily upgraded and laboratory technicians actually use it. Our data base is called Genome Notebook. Others have chosen Sybase, but we avoided it because it runs on a more expensive platform (Sun), runs under UNIX that requires its own maintenance, 1Information presented at the US Department of Energy, Human Genome Program Contractor-Grantee Workshop III, in Santa Fe, Nh4, February 7-10,1993.

H.R. (Skip) Garner I Laboratory Information Management 26 (1994) 69-77

Z

L

Y

Fig. 4. The computation and external linkage module diagram for the Genome Informatics System - Transputers (GIST). The transputers are used as a high speed local parallel processing search/ computation engine in which the data base ( GenBankYis kept resident in memory to optimize the analysis speed. The external connections - user interface, direct linkage to the automated sequencers and Internet on-line data bases - are made via the Macintosh host.

does not appear to offer a performance advantage and itself is considerably more expensive. The second example shows how solutions are ‘time dependent. Things that work today will not necessarily meet future needs. Sequence data emerging from our two Applied Biosystems, Inc. 373A sequencers at a rate of about 50-70 per day are processed in a number of ways, including a homology search against GenBank. Our needs are currently met by e-mailing a sequence file to the BLAST server at National Center for Biological Information [ 171. The results of the search return quickly. However, in a separate effort we are developing a high speed search engine based on a parallel processing chip called a transputer that attaches directly to our Apple computer network [ 271 (see Fig. 4). This system, Genome Informatics System Transputers (GIST), is currently operational, search time is independent of data base size because the data base is memory resident (additional processors and memory

are simple to add), and it will operate autonomously with the laboratory hardware and laboratory data base using an expert system to control the analysis process. This is being done to meet our future needs for the following reasons: - We hope to soon increase our sequencing capability to 700 + runs/day. There are not enough server resources to meet the combined needs of all similar groups competing for search time at a reasonable cost. - A local processing environment is preferred for issues of reliability, ease in automatic operations and absolute control over the algorithm. - GenBank is growing at a rate of approximately 25% per year ‘. Current servers are input/output (I/ 0) bound, so as the data base grows the time required for a search will increase dramatically, further taxing this approach. - New search and analysis techniques that are computationally intensive can be written for the transputer based system because the transputer is a general purpose microprocessor. We feel only parallel processing will be cost effective and computationally capable to meet our future needs. Since each laboratory has its own goals and approach, the identifiable bottlenecks that limit productivity will vary. The bottlenecks that currently plague our efforts, which will get worse as we try to increase our throughput and for which we are developing solutions are as follows. - Getting data into the computer from conventional sources (laboratory notebooks) is difficult because researchers and technicians view data entry time as nonproductive, but the only way to access our progress is by querying the data base, so it is the only measure of our productivity. We are developing incentives for humans and more integrated automation that will do most of the data entry automatically. - Data integration with quick and easy retrieval from various sources is hard and requires computersavvy biologists. We will be developing linking software between our data base and ultimate storage as well as taking advantage of the much larger data storage systems that are becoming cost effective. - Related to the above points is the problem of automated analysis. It will be impossible to direct each ’Product literature for Inherit System from Applied Biosystems, Inc. in Foster City, CA.

H.R. (Skip) Garner ! Laboratory Information Management 26 (1994) 69-77

individual analysis and inspect each analysis when sequencing and mapping throughput increase dramatically. First pass automatic analysis is being built into the transputer search system that will operate under the direction of Level 5, an expert system built by Information Builders, Inc. (New York, NY) with rules developed by biologists that prioritize intermediate results, coordinate advanced analysis, deliver data to the data base and inform scientists of significant finds. - One invisible bottleneck is data checking. Although it is possible to release data at a fast rate, if the quality is in question the data may never be used. We currently let the logic of the data base handle most of the data verification. Data consistency checks are built into the logic that is programmed into the data base, so it is much more than a storage and retrieval tool. The researchers responsible for informatics support for a large effort must depend on a variety of solutions to provide for data taking, storage and analysis. The resources for producing these solutions come from many sources: commercially available data bases and analysis software; commercially available expert system development shells; on-line global data bases accessed on a cost basis; a large variety of analysis codes available as servers (for example NCBI BI&T server) or at supercomputer centers (for example the National Science Foundation funded San Diego Supercomputer Center) ; bulletin boards, servers and datathat can be accessed or downloaded using Internet; implementing and applying algorithms developed in other areas of science to biology; and finally writing custom software using standard languages such as C along with the indispensable code generators and function libraries that are commercially available or free from bulletin boards. 2.4. Future trends and needs Biology has been going through changes at a modest pace driven by the infusion of enabling technologies, but with the formalization of the genome project with one of its goals being advanced instrumentation we can expect that the evolution of laboratory procedures will be even more rapid. It is difficult to predict new technologies and determine which of those will have the most impact on general biology, but a few probable areas include: massively parallel microchip analysis,

15

expanded use and new types of atomic resolution microscopy, higher-throughput turnkey sequencing systems, total conversion to non-radioactive detection technologies, and from the computer arena a shift to faster, parallel computers emerging from the ‘terallop challenge’. Each of these is strongly linked to computers and informatics in general, again with the objective of doing more, faster. Microchip based analyses are being pursued by several groups [ 28,291, some to perform sequencing and some as a method for performing many hybridization tests simultaneously. This technology involves synthesizing nucleic acid probes using a microchip style mask approach and integrating the results optically. The new atomic resolution microscopes (scanning tunneling microscope and atomic force microscope) were used to image biological molecules several years ago, with some controversy [ 301. However, new approaches to secure nucleic acids to the scanning plane to expose the bases may finally make it work for sequencing. Others are attempting to develop diagnostics for clinical genetic testing on a grand scale. There are several efforts being pursued to develop advanced sequencers besides the scanning tunneling microscope/atomic force microscope approach, laser blow-off mass spectroscopy, sequencing by hybridization, parallel capillary systems, and single molecule detection 3 [ 31,321. We are working on an engineering upgrade (low risk) to current gel-based laser sequences that will increase the throughput by a factor of 10 and drop the total cost of sequencing. This will enable completion of the genome project using only improved state-of-the-art technology. These systems would be coupled to all the necessary upstream hardware and downstream software to truly achieve high throughput. Although radioactive detection is the most sensitive system used in normal practice, the long exposure times and safety problems provide a strong force for change. With the ad$ent of phosphor imagers, and ultraviolet (UV) s&uming systems (flat bed and video), it is now possible to use this resolution to detect hybridization and PCR results up to 10 000 at a time in a format convenient for direct computer analysis and results archiving (Molecular Dynamics, Inc., Sunnyvale, CA). A final 3 Information presented at the US Department of Energy Human Genome Program Contractor-Grantee Workshop III, in Santa Fe, NM, February 7-10,1993.

76

H.R. (Skip) Garner/Laboratory

Information Management 26 (1994) 69-77

example of an evolving technology that will push biology is the international ‘teraflop computer’ project. This project has several competing groups that are attempting to make computers that are at least an order of magnitude faster than today’s fastest computers. Most approaches involve highly parallel computer systems and the necessary seamless programming environment. The stated applications of this technology include multi-day weather prediction, complex biological structure calculations and support for the genome project (Parsytec, Inc., Chicago, IL). There are many other factors that will initiate change in how biology is done, generally in the direction of more informatics and computer aided experimentations. First, there is the trend towards more cross-disciplinary work (biology, physics, engineering and computer science) and transfer of hardware and analysis techniques from these other areas of science. Second, the progress in virtual reality and its demonstrated utility in computational biology also make it a promising candidate for large impact. Artificial intelligence, specifically expert systems, showed great promise about 10 years ago for all areas, but soon suffered from appropriate problems, inflated expectations, poor implementations and poor development togls [ 331. However, the area expert systems seem to excel in is data sifting, reduction and coordinated presentation, all of which are becoming more important as the amount of data being taken and the amount of data resident in data banks increase. Current expert system development environments allow knowledge base writers (programmers) to easily include ‘expert’ knowledge from biologists in the analysis stream which can also be changed or adjusted as experiments progress. Another trend is toward computer directed experiments and integrated systems. Since good instrumentation now comes with serial communication ports and device drivers can easily be written, they are being integrated into a new generation of high-throughput systems dedicated to attaining high-sample processing rates for a few biological methods (freezer to data base). These systems will replace the general purpose pipetting stations and other devices that require significant human intervention at each step of a long process. Finally, an additional benefit to integrating many hardware systems is that other areas of science are benefiting from data fusion. For biology, this means that data emerging from many sources (laboratory equipment,

publications, public data bases, private data bases, etc.) are united by software and hardware systems that can generate a more complete picture or presentation to the experimenter or to other programs, modeling or advanced analysis systems.

3. The answer As in any area of science, business or entertainment that depends on number crunching (the speed, memory size and capabilities), demands will always expand to fill whatever computer is generally available. Likewise, newer and faster computers will emerge in response to the anticipated demands of the users. Molecular biology is in a metamorphosis where the scientist is becoming increasingly dependent on computers and technology to do an ever increasing amount of work. In the final analysis, some molecular biology informatics areas will always be ahead and some behind the capabilities of the hardware, software and algorithms.

Acknowledgements This work was funded by the National Institutes of Health National Center for Human Genome Research, Southwestern Medical Center and General Atomics. The facts and opinions in this paper were assembled from numerous sources. Those contributions are greatly appreciated. Special thanks to Glen Evans (The Salk Institute for Biological Research) for Figs. 2 and 3. This work reflects the effort of a number .,of scientists and technicians working in the genome center.

References [l] T. Hunkapiller, R.J. Kaiser, B.K. Koop and L. Hood, Largescale and automated DNA sequence determination, Science, 254 (1991) 59-67. [2] K.B. Mullis and F.A. Faloona, Specific synthesis of DNA in vitro via a polymerase-catalyzed chain reaction, Methods in Enzymology, 155 (1987) 335-350. [3] H.R. Garner and B. Armstrong, High-throughput PCR, BioTechniques, 14 (1993) 112-115; H.R. Garner, Automating the PCR Process, in K.B. Mullis, F. Ferre and R.A. Gibbs (Editors), The Polymerase Chain Reaction, Birkhauser, Boston, MA, 1994.

H.R. Garner (Skip) I Laboratory Information Management 26 (1994) 69-77

[4] Teraflops galore, a special report in IEEE Spectrum, 29(9) (1992) 2fj-33. [5] M. Gribskov and J. Devereux, Sequence Analysis Primer, Stockton Press, New York and Macmillan, Basingstroke, 1991, pp. 90-157. [6] Computational Science at the San Diego Supercomputer Center: a Bibliography, General Atomics, San Diego, CA, 1991, pp. 2140. [7] Computational Science Advances at the San Diego Supercomputer Center, General Atomics, San Diego, CA, 1991, pp. 21-22. [8] H.R. Garner, 8. Armstrong and D. Kramarsky, Dr. Prepper an automated DNA extraction and purification system, Scientific Computing and Automation, 9(4) (1993) 6168. [ 91 B. Armstrong and H.R. Garner, Analysis of protocol variations on DNA yield, Genomic Analysis Techniques and . Applications, 9(5,6) (1992) 134139. [lo] H.R. Garner, B. Armstrong and D. Kramarsky, Highthroughput DNA prep system, Genomic Analysis Techniques and Applications, 9(5,6) (1992) 127-133. [ 111 R. Dulbecco, A turning point in cancer research: Sequencing the Human Genome, Science, 231 (1986) 1055-1056. [ 121 Understanding our Genetic Inheritance, The U.S. Human Genome Project, The First Five Years, 1991-1995, US Department of Health and Human Services and the US Department of Energy, National Technical Information Service, US Department of Commerce, Springfield, VA, 1990. [ 131 Venter leaves NIH to take wholesale gene patenting private, Genetic Technology News, August (1992) 1. [ 141 DNA sequencing poised to go private, New Scientist, Fzbruary 8 (1992) 17. [ 151 Genome data bases, Science 245 ( 1991) 202-207. ‘” [16] M. Olson, L. Hood, C. Cantor and D. Botstein, A common language for physical mapping of the human genome, Science, 245 (1989)1434-1435. [ 171 S.F. Altschul, W. Gish, W. Miller, E.W. Myers and D.J. Lipman, A basic local alignment tool, Journal of Molecular Biology, 215 (1990) 403410. [ 183 W.R. Pearson and D.J. Lipman, Improved tools for biological sequence comparison, Proceedings of the National Academy of Sciencesofthe USA, 85 (1982) 24462448. [19] A.L. Pearson, The Genome Data Base (GDB) - a human gene mapping repository, Nucleic Acids Research, 19( Suppl) (1991) 2237-2239.

77

[20] M. Cinkosky, J. Fickett, P. Gilna and C. Burks, Electronic data publishing and GenBank, Science, 252 (1991) 1273-1277. [ 21 E.C. Uberbacher and R.J. Mural, Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach, Proceedings of the National Academy oj Sciences of the USA, 88 (1991) 11261-11265. [ 22 1 M. Witten, Computational medicine, IEEE Potentials, lO(3) (1991)9-13.

[ 23 J.C. Wooley, Computational

biology for biotechnology: Part II, TIBTECH, 7 (1989) 126-132. ~24 D.W. Smith, J. Jorgensen, J.P. Greenberg, J. Keller, J. Rogers, H.R. Garner and L.T. Eyck, Supercomputers, parallel processing, and genome projects, in D. Smith (Editor), Biocomputing: Informatics and Genome Projects, Academic Press, San Diego, CA, 1993, Chap. 3. t25 1 C. Kay and E.W. Lippay, Mutation of the heme-binding crevice of slavocytochrome b2 from Saccharomyces cerevisiae: altered heme potential and absence of redox cooperativity between heme and fmn centers, Biochemistry, 31(1992) 1137611382. [26] S. Clark, G. Evans and H.R. Garner, Informatics and automation used in the physical mapping of the genome, in D. Smith (Editor), Biocomputing: Informatics and Genome Projects, 1993. [27] T. Williams, Transputer performance boosted to 10 X that of its predecessor, Computer Design, 30( 8) (1991) 3638. [28] Annual Report, Affymax, Inc., Palo Alto, CA, 1992. [29] M. Eggers, K. Beattie, I. Shumaker, M. Hogan, M. Hollis, A. Murphy, D. Rathman and D. Ehrlich, Genosensors: microfabricated devices for automated high throughput DNA sequence analysis, in R. Meyers, D. Porteous and R. Roberts (Editors), Genome Sequencing and Mapping, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 1992, p. 111 (abstract). [30] D.D. Dunlap and C. Bustamante, Images of single-stranded nucleic acids using scanning tunneling microscopy, Nature, 342(1989)20&206. [31] R. Drmanac, in R. Meyers, D. Porteous and R. Roberts (Editors), Genome Sequencing and Mapping, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 1992, pp. 318318 (abstract). [32) J.D. Harding and R.A. Keller, Single-molecule detection as an approach to rapid DNA sequencing, Trends in Biotechnology, 10 (1992) 55-57. [ 331 D.A. Waterman, A Guide to Expert Systems, Addison-Wesley, Redding MA, 1985, pp. 1-171.