4, S53–S54 (1996) 0052
NEUROIMAGE ARTICLE NO.
Information Management for Complex Systems: A Case Study in Genomics OTTO RITTER SUMMARY Genome mapping and sequencing projects push the frontiers of molecular biology and genetics toward information science. I will try to demonstrate that the information management characteristics of genome research may apply to other large-scale systematic studies of complex systems such as the brain.
databases and their operations, and to the tools for analysis, visualization, and data transformations. Most of the information applications in molecular biology are now available over the Internet. Characteristic for molecular biology and other complex system domains are c c c c c c
INTRODUCTION Research in molecular biology and genetics is increasingly data intensive and information dependent. Timely access to relevant information in its context, and to software tools for processing, combining, comparing, analyzing, and rendering the information and knowledge plays an absolutely critical role for the ongoing genome projects. This could equally apply to the neurosciences. Genomic information is scattered across hundreds of heterogeneous and autonomous information systems. These systems have their own (often idiosyncratic) interfaces and control languages, and represent information using conflicting data models and formats. This heterogeneity and autonomy has roots in sociological, scientific, and technological diversity, and is in many instances appropriate and well justified. However beneficial, it poses a real problem for end users and limits the potential synergetic use of managed information. Users, including application programs, do prefer to have a logically integrated and consistent access to all the information in their domain and to all associated operations. Not only that, many users need to manage retrieved public information together with their local (unpublished) data. By managing we mean efficient local storage, data access through query language and browsing tools, and data manipulation operations, so that local copies of public data objects can be edited and/or linked with local objects. THE INFORMATION INTERCONNECTION PROBLEM Put simply, the interconnection problem is a problem of efficient and consistent access to domain-specific
large volumes of complex data, large numbers of complex operations, heterogeneity in representation, autonomy in implementation, dynamicity, and incompleteness and fuzziness of information.
The ease or difficulty with which one information system can be interconnected with other systems depend critically on at least four factors: c type of query/update operations the system supports, c quality and parseablity of exported metadata and data, c standards and conventions used for naming metadata and data elements, and c documented semantics of types, operations, and constraints. EXISTING SOLUTIONS IN BIOINFORMATICS In the past years several systems have been developed to overcome the heterogeneity and incompatibility. Biocomputing centers worldwide try to provide more or less uniform access interfaces to some replicated or remote instances of databases, analytical tools, and combined packages. Indexing and retrieval systems provide uniform query and browsing interfaces over different data formats, and loosely coupled networks of databases with hypertext (WWW) interdatabase cross-references provide some navigational connectivity without global schema, query optimization, detection and resolution of conflicts on data and metadata levels, etc. The only solutions which provide declarative query access against one global schema of the underlying data space are so far based on the warehouse approach.
S53
1053-8119/96 $18.00 Copyright r 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
S54
SUPPLEMENT
Data warehouses materialize the information from heterogeneous databases in a common repository. Two examples in genomics are the European IGD project and the Genome Topographer system developed at the Cold Spring Harbor Laboratory in the United States.
The system architecture of IGD and its data management tools could be reused in another application domain with similar information management characteristics.
THE IGD PROJECT
CONCLUSION
IGD, the Integrated Genomic Database, is an international collaborative project aiming to develop an open information management system for (primarily human) genome-related data and analytical tools. IGD integrates information from public data collections into a single logical database accessible over the Internet, and provides a graphical front end for managing and analyzing retrieved subsets of public data and/or sets of local experimental data. The information space of IGD covers biomolecular structures and sequences, genomic maps, phenotypes, biological material, individual experiments, bibliography, community data, etc., together with methods for analysis, visualization, and communication. In January, 1996, IGD contained about 4 million objects (5.5 Gbytes of data) from 17 major genomic databases and data libraries. IGD is a distributed system. We recognize three types, or layers, of subsystems: (1) individual data sources, (2) the integrated data and method servers, and (3) front-end local environments. Users interact with the IGD system through a set of locally installed tools. Most important parts of the IGD front end are the local database manager and interfaces to communication and analysis. Users can query the IGD server and download resulting data into their local database. They also put private data and analysis results into the local database. As an analysis tool, IGD provides uniform interface to 1501 preexisting applications for structure and sequence analysis, genetic analysis, and genome mapping.
Information management applications in genome research and brain research share several common characteristics, such as the complexity and dynamicity of metadata (information types), proper data, and operations. Strengthening the dialog and information exchange between the genome informatics and neuroinformatics communities may increasingly benefit both of them. ACKNOWLEDGMENT Work on IGD is supported by the European Commission by Grant GENE-CT93-0003/DG 12 SSMA.
REFERENCES Abstracts of the Second Meeting on Interconnection of Molecular Biology Databases. 7http://www-genome.wi.mit.edu/informatics/ abstracts.html8 Markowitz, V. M., and Ritter, O. 1995. There’s never time to do it right, but always time to do it over—Characterizing heterogeneous molecular biology database systems. J. Comp. Biol. 2(4). Ritter, O. 1994. The integrated genomic database. In Computational Methods in Genome Research (S. Suhai, Ed.), pp. 57–73. Plenum, New York. Ritter, O., Kocab, P., Senger, M., Wolf, D., and Suhai, S. 1994. Prototype implementation of the integrated genomic database. Comput. Biomed. Res. 27:97–115. Senger, M., Glatting, K.-H., Ritter, O., and Suhai, S. 1995. X-HUSAR: An X-based graphical interface for the analysis of genomic sequences. Comput. Methods Programs Biomed. 46:131–141.