Microbial genome data resources Victor M Markowitz Studies of the genomes of individual microbial organisms as well as aggregate genomes (metagenomes) of microbial communities are expected to lead to advances in various areas, such as healthcare, environmental cleanup, and alternative energy production. A variety of specialized data resources manage the results of different microbial genome data processing and interpretation stages, and represent different degrees of microbial genome characterization. Scientists studying microbial genomes and metagenomes often need one or several of these resources. Given their diversity, these resources cannot be used effectively without determining the scope and type of individual resources as well as the relationship between their data. Addresses Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Mail Stop 50A-1148, Berkeley CA 94720, USA Corresponding author: Markowitz, Victor M (
[email protected])
Current Opinion in Biotechnology 2007, 18:267–272 This review comes from a themed issue on Environmental biotechnology Edited by Eliora Z Ron and Philip Hugenholtz Available online 30th April 2007 0958-1669/$ – see front matter # 2007 Elsevier Ltd. All rights reserved. DOI 10.1016/j.copbio.2007.04.005
Introduction According to the Genomes OnLine Database, over 400 microbial genomes have been published to date, with over 1000 other projects ongoing and more in the process of being launched [1]. Genomes of isolate microbial organisms and aggregate genomes (also known as metagenomes) of microbial communities are sequenced by organizations worldwide, such as The Institute for Genomic Research, US Department of Energy’s Joint Genome Institute, the Welcome Trust Sanger Institute, the Broad Institute of MIT and Harvard, and the Washington University in St. Louis. For individual organisms, these organizations employ data processing strategies that are roughly similar. Individual ‘read’ sequences of a microbial genome are assembled into longer ‘contigs’ (contiguous sequences), thus producing data files with ‘draft’ genome sequences. Draft genomes are subsequently ‘finished’ via an iterative procedure that closes the gaps between contigs [2]. Both draft and finished genomes undergo a preliminary annotation employing www.sciencedirect.com
automated pipelines [3,4] for predicting potential genes (also called coding sequences or CDSs) and for determining the functional role of predicted genes using a variety of functional resources [5–8]. Metagenome sequence data processing is in the early stages of development. Because the aggregate genome of a microbial community is derived from a pool of cells, with some cells being genetically related and potentially corresponding to different strains of the same species and other cells being genetically distinct, metagenome sequence data has higher complexity, inherent incompleteness, and lower quality than isolate genome sequence data. As a consequence, traditional assembly, gene prediction, and annotation methods do not perform as well on metagenome data as they do on isolate microbial genome sequence data [9]. Metagenome data processing methods can be expected to evolve over time, with existing methods gradually replaced with new or improved methods. Draft and finished microbial genome data together with an increasing number of metagenome datasets are incorporated in various microbial genome data resources. These resources share the goal of improving preliminary genome annotations, which are often inaccurate and sparse, with numerous genes left without associated functional roles. Individual resources often use different techniques and strategies for reviewing and expanding the results of preliminary annotations. For example, the functional characterization of genes may be carried out in an individual genome-specific context or in an integrated context that supports comparative multigenome analysis [10]. Given the diversity of microbial genome resources and the annotation strategies they employ, it is important to determine the scope and type of each individual resource as well as the relationship between different resources, in terms of potential data correlations and transformations. Such information relies on data management specific details regarding the collection, organization, and manipulation of microbial genome data, which are crucial for assessing the scientific value of microbial genome data.
Microbial genome data lifecycle Microbial genome data are generated and processed by organizations worldwide and are eventually included in diverse data resources, usually after undergoing resource-specific transformations. The lifecycle of microbial genome data as they pass through various resources is illustrated in Figure 1 and briefly reviewed below (see also Box 1). Current Opinion in Biotechnology 2007, 18:267–272
268 Environmental biotechnology
Figure 1
Microbial genome data lifecycle. Typically, microbial genome and metagenome sequence data are first generated by various data processing pipelines, deposited in public archival sequence data resources, and then pass through various resources that gradually integrate the microbial genomes and metagenomes while improving the coherence and completeness of their functional characterization.
Microbial genome datasets are initially submitted to/ collected by archival public sequence data repositories [11–13]. Genome datasets in these resources include information on gene coordinates, locus identifiers, gene names, and additional functional annotations, such as associated protein clusters [5], pathways [6], protein families and domains [7–8]. Box 1 Microbial genome data resources. Database
URL
ASAP CMR Entrez Gene
https://asap.ahabs.wisc.edu/asap/home.php http://cmr.tigr.org/ http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? db=gene http://www.ergo-light.com/ERGO/ http://www.ebi.ac.uk/GenomeReviews/ http://www.genomesonline.org/ http://img.jgi.doe.gov/ http://img.jgi.doe.gov/m http://www.ebi.ac.uk/integr8/ http://mbgd.genome.ad.jp/ http://www.megx.net/ http://www.microbesonline.org/ http://compbio.mcs.anl.gov/puma2/ http://www.ncbi.nlm.nih.gov/RefSeq/ http://theseed.uchicago.edu/ http://archaea.ucsc.edu/
ERGO Genome Reviews GOLD IMG IMG/M Integr8 MBGD Megx MicrobesOnline PUMA2 RefSeq SEED UCSC Archaeal Genome Browser
Current Opinion in Biotechnology 2007, 18:267–272
Archival sequence data resources keep different versions of datasets for the same genome. Furthermore, datasets in these resources have different degrees of precision and resolution owing to diverse processing methods, curation strategies, and functional characterization techniques employed by individual data providers. A variety of resources post-process microbial genome data from primary resources with the dual goals of providing the most current view on microbial genome sequences and of gradually increasing the coherence and completeness of their associated functional annotations. Curated public resources such as RefSeq at the National Center for Biotechnology Information (NCBI) [14] and Genome Reviews at the European Bioinformatics Institute (EBI) [15] pursue these goals by synthesizing, reviewing and curating genome sequence data from primary data resources. Annotations associated with genome sequences are also extended via manual curation, computation, or incorporation from other resources. Microbial genomes can be explored using a variety of analysis tools provided by resources that often further enrich the data in archival or curated public resources. For example, NCBI’s Entrez Gene [16] supports gene-centric exploration of individual microbial genomes in RefSeq integrated with additional functional annotations. EBI’s Integr8 [15] provides similar capabilities for exploring www.sciencedirect.com
Microbial genome data resources Markowitz
microbial genomes in Genome Reviews integrated with additional functional annotations, as well as genomespecific summaries (statistics). The UCSC Genome Browser [17] provides support for genome-centric exploration of archaeal genomes enriched with data from computational (e.g. operon) analysis and experimental (e.g. microarray) studies. Multigenome comparative analysis capabilities are provided by a variety of resources such as CMR [18], ERGO [19], Microbes Online [20], PUMA2 [21], IMG [22], and MBGD [23]. Most resources [18,19,20,21,22] further revise the annotation of microbial genomes. For example, the functional characterization of microbial genomes is enhanced in CMR with Genome Properties [24], which places gene functional annotations in the context of metabolic pathways, cellular activities or cellular structure. Some resources also provide support for community annotation [20,21,22]. Such annotations are usually cumulative, that is, they are added to (rather than replace) existing annotations. A cummulative model of annotation enables a diverse community of scientists to contribute to the functional characterization of genomes, but can lead to conflicting or inconsistent annotations. Systems such as PeerGAD [25], PseudoCAP [26], and ASAP [27] address this problem by providing mechanisms with different degrees of complexity for peer review of annotations. Ontologies and controlled vocabularies, such as the gene ontology (GO) [28], are used to support uniform descriptions of functional roles. Peer review annotation mechanisms have been employed primarily for specific organisms, such as Pseudomonas syringae pv. tomato [25] and Pseudomonas aeruginosa pv. [26]. Unlike organism-specific annotation systems, SEED [29] supports a subsystem approach for annotating genomes of multiple organisms simultaneously. A subsystem consists of a set of functional roles that together implement a specific biological process. Subsystem specific groups of experts maintain sets of gene annotations for each subsystem, using an expanding SEED-specific controlled vocabulary of functional roles. (The employment of a local vocabulary instead of GO may be caused by limitations of GO such as those discussed in [30]. Furthermore, because GO was originally developed for the annotation of eukaryotic genomes, the functional terms in GO are at this point limited for annotating microbial genomes.) Subsystems applied to (projected on) new genomes result in genome-specific subsystem variants. Annotation inconsistencies or conflicts are resolved using collaborative, rather than peer review, mechanisms. Curated (reviewed, corrected, expanded) annotations for microbial genomes are usually submitted to the archival public sequence data repositories as revisions of previous www.sciencedirect.com
269
versions of genome datasets, and subsequently are incorporated into other resources, such as those mentioned above (see Figure 1). Similar to genome data generated from individual microbial organisms, microbial community metagenome data are submitted to primary archival public sequence data repositories such as GenBank, where such datasets are provided in separate databases for BLAST searches only (see http://www.ncbi.nlm.nih.gov/BLAST/Genome/ EnvirSamplesBlast.html). In addition, for highly abundant organisms in a community for which the quality of the assembly and annotation is close to that of draft isolate genomes, individual datasets are recorded in a similar way to isolate genome data. In spite of current limitations with their generation and processing [9], metagenome data are amenable to valuable analysis, as discussed in [31] and as illustrated by recent studies such as [32,33] (see also Update). Metagenome data are analyzed in the context of reference isolate genome data and sample metadata that characterize the biological material collected for generating metagenome sequences. For example, emerging metagenome data resources such as IMG/M [34] and Megx.net [35] provide comparative analysis capabilities across both metagenome and isolate microbial genome data.
Microbial genome data management Given the diversity of microbial genome data resources, it is important to determine the scope and type of each individual resource as well as the relationships between different resources. A data resource is characterized by the way it organizes, collects and manipulates microbial genome data. These data management characteristics are briefly discussed below. Microbial genome data resources employ established technologies, such as structured files, relational database management systems (e.g. Oracle, MySQL), data warehousing tools, and data integration techniques in handling the organization, collection and manipulation of microbial genome data. A microbial genome data resource involves organizing data according to a data model (schema in database terminology) that consists of data type specifications. For example, a typical microbial genome data model involves a central data type that represents genes as ordered sequences of nucleotides that encode specific (e.g. a protein or RNA molecule) products. Additional data types are used for characterizing genes, including their location on chromosomes within (species-specific) genomes, and their associated functional roles in cellular pathways. A data model that is anchored on a central data type is specific to data warehouses [36–38]. Similarities between the data models of biological data resources in general and Current Opinion in Biotechnology 2007, 18:267–272
270 Environmental biotechnology
microbial genome data resources in particular, have motivated the development of generalized biological data warehouse systems [36] and data warehouse toolkits [37,38].
Database Collection (http://www.oxfordjournals.org/nar/ database/a) maintained by Nucleic Acids Research. New resources are published annually in a special database issue of Nucleic Acids Research [50].
Similar to other biological and biomedical domains, microbial genome analysis often requires integrating data from multiple resources. Data integration is needed to provide support for the functional characterization of microbial genes, which involves data that reside in diverse functional resources, and for comparing genomes that are recorded in different resources. Biological data integration has been discussed extensively in both articles [39,40–42] and books [43], and shares the problems of other scientific and traditional data domains [44]. It is important to distinguish between ‘shallow’ and ‘deep’ integration of biological data [39]: the former amounts to juxtaposition of data items from different sources, whereas the latter involves coalescence of data items that might represent the same underlying biological objects, such as genes.
Each microbial genome data resource aims at gradually improving the coherence and completeness of the functional characterization of microbial genomes. Such improvements are iterative, with individual resources often recording evolving (i.e. new versions of) microbial genome datasets. For example, datasets that are revised in terms of gene models or functional annotations are usually resubmitted to archival resources such as GenBank, and are then included into curated resources such as RefSeq and other microbial genome data resources.
A data warehouse approach to data integration presents several important advantages [40], including local data availability, query performance, and ability to review and curate data. The latter is especially important given the inherent complexity of resolving gene model and annotation differences or inconsistencies in the context of continuously changing annotation methods and evolving biological knowledge. Addressing these problems is essential not only for data curation, but also for data analysis, which relies on clearly defined semantics of individual data elements, their relationships, and the operations that can be applied on them [45]. Operations associated with a data type range from basic operations, such as comparing data elements of the same type (e.g. two sequences), to complex operations, such as searches for certain patterns across sets of data elements (e.g. functional profiles across genomes). Defining the semantics of biological data and associated operations is a daunting task: data semantics cannot be fully characterized without data provenance (also known as data lineage) information, such as annotation characteristics and data transformation parameters [46,47]. While techniques for supporting data provenance are developed [48], the derivation/transformation/computation history of biological data items can be tracked by recording metadata on (i.e. documenting) data sources and individual data processing steps. Metadata are essential for data access, interchange and integration, and enable both software tools and people to understand and handle data [49].
Conclusions Microbial genome data are recorded in a variety of resources that are listed as part of the Molecular Biology Current Opinion in Biotechnology 2007, 18:267–272
Microbial genome data resources can contain different collections of genomes and annotations with different degrees of resolution regarding the same genomes. These differences are the result of resource-specific maintenance, curation, and functional characterization strategies. Scientists using these resources need to assess the role of each resource and the relationship between different resources. Unfortunately, such an assessment is often hindered by the lack of comprehensive documentation, with semantics of data structures, individual data elements and operations often represented implicitly and informally via shared assumptions, rather than specified explicitly and formally [51]. This problem can be addressed using the recently proposed dataspace approach [52]. A dataspace helps organize related data resources, with a catalog containing information about various resources and their relationships. Applying the dataspace approach to the microbial genome data domain could improve the organization of existing resources, whereby different resources are seen as related, rather than isolated, components in an evolving framework. Improving the organization and documentation of microbial genome data resources would benefit the maintenance of existing resources as well as the development of new resources, in particular emerging metagenome data resources. Such improvements would also help facilitate leveraging past experience as suggested by Halevy [45], via mechanisms supporting the correlation and reuse of various system components, such as data management and analytical tools. Leveraging past experience, as exemplified by the development of systems such as ERGO [19] and PUMA2 [21], which emerged from common ancestors [53,54], would help accelerate the pace of progress in microbial genome data management and analysis. Improved documentation of individual data resources is also a crucial prerequisite for developing more advanced mechanisms for exploring correlated data resources, such as path-based systems aimed at assisting scientists in exploring data across multiple biological data resources [55]. www.sciencedirect.com
Microbial genome data resources Markowitz
The variety of microbial genome data resources reflects the diversity of needs for and approaches to the complex task of characterizing comprehensively the functional and metabolic capabilities of microbial genomes and metagenomes. Although the organization and documentation of these resources needs to improve, such an improvement ultimately depends on long-term steady funding. Mechanisms for providing such support together with a better oversight of resources are generally lacking at this time [56,57].
Update The National Academy of Sciences has published a report on the emerging field of metagenomics prepared by the National Research Council’s Committee on Metagenomics [58]. The report includes a chapter on metagenome data management and addresses the problem of funding for the collection, storage and analysis of the massive amounts of metagenome data expected to be generated by a growing number of projects.
271
The authors review the key challenges in the processing and analysis of microbial community sequence data, including assembly, gene prediction, and classification (binning) methods for such data. 10. Bowers PM, Pellegrini M, Thompson MJ, Fierro J, Yeates TO, Eisenberg D: Prolinks: a database of protein functional linkages derived from coevolution. Genome Biol 2004, 5:R35. 11. Benson DA, Karsch-Mizrahi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic Acids Res 2007, 35:D21-D25. 12. Kulikova T, Akhtar R, Aldebert P, Althorpe N, Andersson M, Baldwin A, Bates K, Bhattacharyya, Bower L, Browne P et al.: EMBL nucleotide sequence database in 2006. Nucleic Acids Res 2007, 35:D16-D20. 13. Sugawara H, Abe T, Gojobori T, Tateno Y: DDBJ working on evaluation and classification of bacterial genes in INSDC. Nucleic Acids Res 2007, 35:D13-D15. 14. Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts, and proteins. Nucleic Acids Res 2007, 35:D61-D65. 15. Kersey P, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin A, Das U, Michoud K, Phan I et al.: Integr8 and Genome Reviews: integrated views of complete genomes and proteoms. Nucleic Acids Res 2005, 33:D297-D302.
Acknowledgements
16. Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 2007, 35:D26-D31.
The work presented in this paper was supported by the Director, Office of Science of the US Department of Energy under Contract No. DE-AC0205CH11231.
17. Schneider KL, Pollard KS, Baertsch R, Pohl A, Lowe TM: The UCSC archaeal genome browser. Nucleic Acids Res 2006, 34:D407-D410.
References and recommended reading
18. Peterson JD, Umayam LA, Dickinson T, Hickey EK, White O: The comprehensive microbial resource. Nucleic Acids Res 2001, 29:123-125.
Papers of particular interest, published within the period of review, have been highlighted as: of special interest of outstanding interest 1.
Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides NC: The genomes on line database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res 2006, 34:D332-D334.
2.
Bloom T, Sharpe T: Managing data from high-throughput genomic processing: a case study. In Proceedings of the 30th International Conference on Very Large Databases, Toronto, Canada, August 31–September 3 2004. Editors: Nascimento MA, Ozsu MT, Kossman D, Miller RJ, Blakeley JA, Schiefer KB; Morgan Kaufmann. pp. 1198-1201.
19. Overbeek R, Larsen N, Walunas T, D’Souza M, Pusch G, Selkov E, Liolios K, Joukov V, Kaznadzey D, Anderson I et al.: The ERGO genome analysis and discovery system. Nucleic Acids Res 2003, 31:164-171. 20. Alm EJ, Huang KH, Price MN, Koche RP, Keller K, Dubchak IL, Arkin AP: The Microbes Online web site for comparative genomics. Genome Res 2005, 15:1015-1022. One of the existing integrated microbial genome resources described in more detail than other similar resources. 21. Maltsev N, Glass E, Sulakhe D, Rodriguez A, Syed MH, Bompada T, Zhang Y, D’Souza MD: PUMA2 — grid-based highthroughput analysis of genomes and metabolic pathways. Nucleic Acids Res 2006, 34:D369-D372.
3.
Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T, Clausen J, Kalinowski J, Linke B, Rupp O, Giegerich R, Pu¨hler A: GenDB — an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 2003, 31:2187-2195.
22. Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, Padki A, Zhao X, Dubchak I, Hugenholtz P, Anderson I et al.: The Integrated Microbial Genomes (IMG) system. Nucleic Acids Res 2006, 34:D344-D348.
4.
Hauser L, Larimer F, Land M, Shah M, Uberbacher E: Analysis and annotation of microbial genome sequences. Genet Eng (NY) 2004, 26:225-238.
23. Uchiyama I: MBGD: a platform for microbial comparative genomics based on the automated construction of orthologous groups. Nucleic Acids Res 2007, 35:D343-D346.
5.
Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278:631-637.
6.
Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, 32:D277-D280.
7.
Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL et al.: The Pfam protein families database. Nucleic Acids Res 2004, 32:D138-D141.
24. Haft DH, Selengut JD, Brinkac LM, Zafar N, White O: Genome Properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics. Bioinformatics 2005, 21:293-306. Representative of efforts to enhance functional characterization of microbial genomes in terms of various metrics, such as gene content, presence of metabolic pathways, and various phenotypic properties.
8.
9.
Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R et al.: New developments in the InterPro database. Nucleic Acids Res 2007, 35:D224-D228. Chen K, Pachter L: Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput Biol 2005, 1:106-112.
www.sciencedirect.com
25. D’Ascenzo MD, Collmer A, Martin GB: PeerGAD: a peer-reviewbased and community-centric web application for viewing and annotating prokaryotic genome sequences. Nucleic Acids Res 2004, 32:3124-3135. 26. Winsor GL, Lo R, Ho Sui SJ, Ung KSE, Huang S, Cheng D, Ho Ching WK, Hancock REW, Brinkman FSL: Pseudomonas aeruginosa genome database and PseudoCAP: facilitating community-based, continually updated, genome annotation. Nucleic Acids Res 2005, 33:D338-D343. Current Opinion in Biotechnology 2007, 18:267–272
272 Environmental biotechnology
27. Glasner JD, Rusch M, Liss P, Plunkett G, Cabot EL, Darling A, Bradley DA, Infield-Harm P, Gilson MC, Perna NT: ASAP: a resource for annotating, curating, comparing, and disseminating genomic data. Nucleic Acids Res 2006, 34:D41-D45. 28. Gene Ontology Consortium: The Gene Ontology database and informatics resource. Nucleic Acids Res 2004, 32:258-261. 29. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R et al.: The subsystems aproach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 2005, 33:5691-5702. A decentralized model of annotation, whereby biological process specific subsystems are annotated jointly across collections of available genomes, is discussed and illustrated using several subsystems. 30. Myhre S, Tveit H, Mollestad T, Laegreid A: Additional gene ontology structure for improved biological reasoning. Bioinformatics 2006, 22:2020-2027. 31. Deutschbauer AM, Chivian D, Arkin AP: Genomics for environmental microbiology. Curr Opin Biotechnol 2006, 17:229-235. Overview of the emerging field of metagenomics. 32. Tringe SG, von Mering C, Kobayashi A, Salamov A, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC et al.: Comparative metagenomics of microbial communities. Science 2005, 308:554-557. Introduces the concept of environmental gene tag (EGT) analysis to cope with the fragmentation and incompleteness of microbial community sequences. Variants of EGT analysis have been used extensively in metagenome studies. 33. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard NU, Martinez A, Sullivan MB, Edwards R, Brito BR et al.: Community genomics among stratified microbial assemblages in the ocean’s interior. Science 2006, 311:1355-1359. 34. Markowitz VM, Ivanova NN, Palaniappan K, Szeto E, Korzeniewski F, Lykidis A, Anderson I, Mavrommatis K, Kunin V, Garcia Martin H et al.: An experimental metagenome data management and analysis system. Bioinformatics 2006, 22:e359-e367. 35. Lombardot T, Kottmann R, Pfeffer H, Richter M, Teeling H, Quast C, Glockner FO: Megx.net — database resources for marine ecological genomics. Nucleic Acids Res 2006, 34:D390-D393. 36. Shah SP, Huang Y, Xu T, Yuen MMS, Ling J, Ouellette BFF: Atlas — a data warehouse for integrative bioinformatics. BMC Bioinformatics 2005, 6:10.11861/1471-2105-6-34. http://www. biomedcentral.com/1471-2105/6/34/. 37. Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res 2004, 14:160-169. 38. Lee TJ, Pouliot Y, Wagner V, Gupta P, Stringer-Calvert DWJ, Tenenbaum JD, Karp PD: BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics 2006, 7:10.1186/1471-2105-7-170. http://www.biomedcentral.com/ 1471-2105/7/170. 39. Searls DB: Data integration: challenges for drug discovery. Nat Rev Drug Discov 2005, 4:45-58. A comprehensive review of biological data integration problems from a domain application (rather than information technology) perspective. Covers diverse aspects of integration for data generated directly by technology platforms through data derived although various transformations. 40. Wong L: Technologies for integrating biological data. Brief Bioinform 2002, 3:389-404. 41. Stein LD: Integrating biological databases. Nat Rev Genet 2003, 4:337-345.
Current Opinion in Biotechnology 2007, 18:267–272
42. Hernandez T, Kambhampati S: Integration of biological sources: current systems and challenges ahead. SIGMOD Record 2004, 33:51-60. 43. Lacroix Z, Critchlow T (Eds): Bioinformatics: Managing Scientific Data. Morgan Kaufman Publishers; 2003. 44. Halevy AY, Ashish N, Bitton D, Carey MJ, Draper D, Pollock J, Rosenthal A, Sikka V: Enterprise information integration: successes, challenges and controversies. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Maryland, USA, June 14–16, 2005. ACM 2005. Editor: Ozcan F. pp.778-787. A collection of short articles that illustrate the complexity of data integration in general, and in particular the overwhelming importance of application-specific semantics of data in integration. 45. Halevy AY: Why your data won’t mix: semantic heterogeneity. ACM Queue 2005, 3:50-58. This paper discusses the crucial role of semantic heterogeneity in data integration and shows how leveraging past experience could facilitate addressing this problem, thus making a strong case for data and tool sharing. 46. Cohen S, Cohen-Boulakia S, Davidson S: Towards a model of provenance and user views in scientific workflows. In Data Integration in the Life Sciences, Third International Workshop, DILS 2006, Hinxton, UK, July 20–2, 2006, Proceedings. Lecture Notes in Computer Science 4075. Editors: Leser U, Naumann F, Eckman BA; Springer. pp. 264-279. 47. Buneman P, Chapman A, Cheney J: Provenance Management in Curated Databases. In Proceedings of ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27–9, 2006. ACM 2006. Ediotrs: Chaudhuri S, Hristidis V, Polyzotis N. pp. 539-550. The authors discuss data provenance for curated biological databases and proposes a practical approach for tracking provenance. 48. Simmhan YL, Plale B, Gannon D: A survey of data provenance in e-science. SIGMOD Record 2005, 34:31-36. 49. Gray J, Liu DT, Nieto-Santisteban M, Szalay A, DeWitt DJ, Heber G: Scientific data management in the coming decade. SIGMOD Record 2005, 34:34-41. Discusses the challenges of large-scale scientific data management and the technical areas that are critical to address them. 50. Galperin MY: The molecular biology database collection: 2007 update. Nucleic Acids Res 2007, 35:D3-D4. 51. Jagadish HV, Olken F: Database management for life science research. SIGMOD Record 2004, 33:15-20. 52. Franklin M, Halevy AY, Maier D: From databases to dataspaces: a new abstraction for information management. SIGMOD Record 2005, 34:27-33. 53. Overbeek R, Larsen N, Smith W, Maltsev N, Selkov E: Representation of function: the next step. Gene 1997, 191:GC1-GC9. 54. Overbeek R, Larsen N, Pusch G, D’Souza MD, Selkov E Jr, Kyrpides NC, Fonstein M, Maltsev N, Selkov E: WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res 2000, 28:123-125. 55. Cohen-Boulakia S, Davidson S, Froidevaux C, Lacroix Z, Vidal ME: Path-based systems to guide scientists in the maze of biological data sources. J Bioinform Comput Biol 2006, 4:1069-1095. 56. No authors listed: Sustainable databases. Nat Cell Biol 2006, 8:1311. A compelling discussion of the problems facing biological data resources and their uncertain future. 57. No authors listed: The database revolution. Nature 2007, 445:229-230. 58. Committee on Metagenomics: The New Science of Metagenomics: Revealing the Secrets of Our Microbial PlanetNational Academies Press; 2007 In: http://www.nap.edu.
www.sciencedirect.com