Developing integrative bioinformatics systems

Developing integrative bioinformatics systems

BIOSILICO Vol. 1, No. 5 November 2003 REVIEWS RESEARCH FOCUS Developing integrative bioinformatics systems Matthias Fellenberg Manifold examples fro...

54KB Sizes 3 Downloads 117 Views

BIOSILICO Vol. 1, No. 5 November 2003

REVIEWS RESEARCH FOCUS

Developing integrative bioinformatics systems Matthias Fellenberg Manifold examples from the current biological literature describe the application of integrative analysis methods.These examples demonstrate the need for bioinformatics methods that integrate data from diverse sources and data that have heterogeneous formats.The requirements of such methods can be demonstrated through examples of applications used to analyze high-throughput data (particularly gene expression data) in combination with biological knowledge (such as functional classification and pathways). On the basis of these examples, we suggest general requirements for integrative analyses and a scenario for the development of integrative bioinformatics systems. Matthias Fellenberg Biomax Informatics AG Lochhamer Str. 11 82152 Martinsried Germany e-mail: matthias.fellenberg@ biomax.de

▼ Currently, the most important technologies in molecular biology are high-throughput (HT) technologies that generate large amounts of biological data to describe gene expression, protein expression, protein–protein interactions and metabolite abundance. Effective, efficient and intelligent methods are needed to manage, store, retrieve and analyze these heterogeneous data [1]. Many analysis methods and statistical analysis tools have been developed and are routinely applied [2–4], including normalization [5], clustering [6–9] and classification [10–12] tools. However, analysis methods that focus only on numerical experimental data are not sufficient [13–16]. The integration of other sources of information is needed for more comprehensive analyses, especially for judging the biological significance of analysis results [17]. The integration of these sources must go beyond hyperlinks [18]; the datasets must be interlinked analytically. The terms ‘integration’ and ‘integrative analysis’ are used in various ways. Integration can involve the amalgamation of databases from various sources and with different formats into a single retrieval engine, such as SRS (Lion Bioscience, http://www.lionbioscience. com/) or BioRS™ Integration and Retrieval

1478-5282/03/$ – see front matter ©2003 Elsevier Science Ltd. All rights reserved. PII: S1478-5282(03)02372-4

System (Biomax Informatics, http://www.biomax.com). Alternatively, integration could involve integrating the laboratory workflow, i.e. managing and controlling laboratory experiments, samples and collected data in an electronic system, which often includes a laboratory information management system (LIMS) [19]. This review focuses on the analytical combination of high-throughput data with biological knowledge; in this sense, we use the term ‘integrative analysis’ to mean the analysis of high-throughput data in the context of available biological knowledge. The current biological, biotechnological and bioinformatics literature provides many examples of the integrative analysis of specific high-throughput datasets. Although nearly all of the interpretation of a dataset involves integrative approaches to some extent, these analyses are often performed manually (i.e. in a non-automatic, non-standardized fashion). Manual analysis requires extensive database searches and data preprocessing to overcome technical, syntactic and semantic challenges [18]. Results are obtained by a time-consuming cycle of trial and error that involves hypothesis generation and verification. Recent bioinformatics publications indicate that integrative analysis approaches will dominate the future of bioinformatics [20–22]. Some methods and tools for automated integrative analysis are described in the literature [8,23,24], but these are still rare. An extensible bioinformatics system that enables flexible and automated integrative analysis and that is capable of integrating all relevant tools is needed [1]. In the following sections, examples of integrative analysis methods from the current literature are described. Using these examples, a systematic scheme for integrative analysis is developed. From this scheme, the requirements

www.drugdiscoverytoday.com

177

REVIEWS

BIOSILICO Vol. 1, No. 5 November 2003

RESEARCH FOCUS of a bioinformatics system that could provide flexible integrative analysis are extracted. The last section, a kind of ‘wish list’, describes the envisioned bioinformatics system. Because measurement of gene expression rates is still the dominating high-throughput technique, nearly all of the example applications mentioned here apply to that area of research. These analysis methods can be generalized, however, to cover other areas such as proteomics and metabolomics. We describe how an integrative bioinformatics system can support the analysis of diverse kinds of high-throughput data.

Biological high-throughput data

Gene expression data Protein expression data

Content

Analysis of data within a specific context

Protein-protein interactions

Functional annotations Metabolic pathways Other systematic annotations BioSilico

Methods for integrative data analyses Integrative analysis methods integrate data from diverse sources. They combine two or more datasets and restrict or interpret individual datasets with respect to the other datasets to increase the understanding of the data. In this way, the data and the features derived from the data by internal analyses are structured and analyzed further (Fig. 1). High-throughput datasets may be combined with other high-throughput datasets; for example, gene expression data may be integrated with data relating to protein–protein interactions. The most interesting qualitative results are obtained, however, by combining high-throughput datasets with datasets that systematically classify genes and proteins according to their function or structure (e.g. by ontologies). This technique allows more information to be extrapolated from the data than is the case when by any single kind of data is analyzed in isolation. Integrative analysis focuses on the extraction of qualitative, not quantitative, results from these large-scale datasets.

Integrating high-throughput data with biological knowledge The need for high-throughput data analysis in a specific biological context can be seen in the questions that researchers pose: Do co-expression and protein–protein interactions relate to each other [8,25,26]? How do gene expression and protein abundance compare [27]? Does protein function relate to gene expression changes [14]? Which genes are differentially expressed in specific cancer subtypes [10,12,28]? Can the function of uncharacterized genes be determined from their co-expression relationships [16,29,30]? What metabolic or regulatory pathways are affected by a change in gene expression [30,31]? How are interacting genes and proteins functionally related [20,32–34]? These questions cannot be answered using high-throughput data alone: additional information about the measured subjects (e.g. genes, proteins and metabolites) must be taken into account. Only the integration of all of the available

178

www.drugdiscoverytoday.com

Figure 1. Integrative analysis links high-throughput data (left) with classification data, annotations and biological entities such as pathways (right), defining a specific context of analysis.

data enables the systematic investigation of biological systems aimed at prediction and simulation [35]. The term ‘Systems Biology‘ was coined to describe investigations using this approach [22]. These investigations require complex, integrative analysis methods that can be flexibly combined, as well as integrative analysis cycles that include hypothesis generation and thus allow exploratory approaches.

Analysis of gene regulation and gene structure The biologically relevant signals that result from gene expression analysis are caused by the regulation of the corresponding genes. Gene expression data can be mined for features that elucidate gene-regulatory processes. The nullhypothesis of this approach is that co-expressed genes, which can be identified by clustering the genes according to expression data, are co-regulated by the same transcription factors. This approach allows one to search for cisregulatory elements in the genetic sequences of the co-expressed genes to assess the significance of the gene clusters and the identified sequence motifs [36]. An alternative approach circumvents gene clustering and directly fits cisregulatory elements to expression data [37]. A recent paper describes the analysis of time-series data (from an analysis of the yeast cell cycle) to find direct regulatory relationships between pairs of genes [38].

Functional analysis of genes Pioneering large-scale gene expression experiments have demonstrated that co-expressed genes often share a common functionality [39–41]. The relation of gene expression to gene function has also been examined systematically [14] and is exploited in many experiments that involve gene expression analysis. In such experiments, the functional

REVIEWS

BIOSILICO Vol. 1, No. 5 November 2003

RESEARCH FOCUS properties of genes that are shown to have a significant expression profile are examined [12,15,28]. Functional descriptions are used to find conceptual similarities among genes, which can be used to predict various classifications (e.g. cancer classes) [42]. Other approaches predict the cellular function of previously uncharacterized genes using co-expression relationships [8,16,32,43,44]. A proof of principle has been achieved with the development of a compendium of expression patterns for deletion mutants in yeast [45]. The functional analyses described above depend on several prerequisites. The gene identifiers (IDs) used for the subjects of HT measurements have to be matched to IDs used in external databases (such as EMBL [http://www.ebi. ac.uk/embl/] or GenBank [http://www.ncbi.nlm.nih.gov/ GenBank/]) so that these resources can be accessed. Additionally, computational analysis methods require that functional information is systematically structured. This can be achieved by using ontologies (such as Gene Ontology [46,47] and the FunCat™ Functional Catalog [48] developed by Biomax Informatics [http://www.biomax.com]) or by extracting information from unstructured sources using computer linguistic methods [15,42,49].

Pathway analysis DeRisi et al. [39] were among the first to map gene expression data to metabolic pathways. Since this early publication, several approaches that are more systematic have been described. Most of these approaches use established pathway schemes from Internet sources such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) ([50]; http:// www.genome.ad.jp/kegg/). Expression data are mapped to the pathways and visualized in pathway diagrams [29–31,51]. Other approaches generate pathways de novo and score them on the basis of gene expression data [23,24]. A mathematical framework, called Biochemical Systems Theory, has been developed to generate models of gene regulation from expression data [52]. Pathway analysis focuses mainly on metabolic pathways. Methods for the metabolic interpretation of highthroughput data have been developed but are, for the most part, still not integrated into analysis frameworks. Generalization of the developed analysis methods to regulatory and signal transduction pathways should be possible once sufficient systematic data are available for analysis. Mapping to standard metabolic pathways and de novo generation of metabolic pathways both need a sound basis in data. This basis often includes a set of enzymatic reactions, in which each reaction is described by an Enzyme Commission (EC) number, a possible in vivo direction, as

HT data

Ontological data

Proprietary data

Modified HT data

Public data

Preprocessed data

Non-numerical results

Integrative analysis

Results on the HTdata

Results on the classification data

Interpretation BioSilico

Figure 2. Workflow of a typical integrative analysis. Both the highthroughput (HT) data and the classification data are processed before they are used as input for integrative analysis. Results on either side can be used to refine either the input data or the selection of input data (using feedback loops).

well as main and side metabolites and their corresponding coefficients. A catalog of concisely defined standard pathways should also be available. For efficient analysis, the pathway catalog has to provide different levels of detail to allow flexible metabolic analysis. The need for such flexibility is best reflected in a hierarchical structure. For some approaches, the individual pathways should be small enough so that the member reactions can be co-regulated and strong signals can be expected from the mapping of gene expression data. Other approaches require a comprehensive view of larger parts of the metabolic network.

General scheme for integrative analyses To design a bioinformatics framework that supports the described analyses, common principles that can be shared by several applications must be formulated. The main analysis of the described approaches is generalized in the following scheme (Fig. 2). High-throughput datasets are analyzed by statistical methods, including normalization, filtering, classification and clustering. The statistical methods involved take into account only numeric properties, that is, the abundance profiles of the measured subjects

www.drugdiscoverytoday.com

179

REVIEWS

BIOSILICO Vol. 1, No. 5 November 2003

RESEARCH FOCUS (genes, proteins and metabolites) or the patterns of the individual measurements. The derived numeric results (modified datasets) and non-numeric results (e.g. gene lists) of the analyses are further processed by integrative methods. The results are evaluated in the context of other datasets; for example, genetic sequences that correspond to the measured subjects, functional annotations or data about metabolic pathways. Each dataset must be preprocessed for the integrative analyses. Integrative analyses can also involve feedback loops, which use results to modify the experimental strategy, to suggest further experiments, or to recalculate quality measures and normalization. Feedback loops lead to finer granularity of results in each step. The final results are used to interpret the data and annotate the subjects with new findings. Clusters of co-expressed genes may have common features. Pathways, regulatory elements or functional categories are validated or scored in terms of the experimental conditions. These results are biologically relevant information and can be compiled in unstructured form in a report or paper, or fed into structured and semi-structured public or proprietary databases. Flexible and intuitive reporting systems should allow the user to export the results in diverse formats, ranging from technical representations (such as extensible markup language [XML]) to user-readable formats (such as portable document format [PDF]). Additionally, results should be stored in a systematic annotation scheme using a controlled vocabulary or ontology. In this way, the results become easily accessible for subsequent experiments and can be used directly in integrative analysis. Integrative analysis methods need two kinds of data: the high-throughput data and the additional information about or characterization of the analyzed genes. During the analysis, the data types must be simultaneously accessible to the tools or transferred from one tool to another. Mapping between subject identifiers in the datasets must be established and managed. An integrated framework that allows all of the involved data types to be handled should allow information to be comfortably and reliably transferred. Complex integrative analyses carried out within such a framework gain efficiency and reliability. They comply to standardized procedures by following established workflow schemes and are thus more easily reproducible. The applied methods and method parameters can be automatically documented. High-throughput datasets are generally large. Classification data can be very complex and detailed. It is therefore necessary that integrative analyses can be performed in batch mode, enabling exploratory high-throughput analyses, e.g.

180

www.drugdiscoverytoday.com

the iterative analysis of a dataset for all categories of a functional annotation scheme. Both framework and bioinformatics methods should support such high-throughput analysis.

Requirements for an integrative bioinformatics system A framework that supports the integrative analysis of highthroughput biological data should enable analysis workflows that follow the described scheme. Several building blocks can be identified, including an information technology (IT) infrastructure for data management, an analysis interface for bioinformatics methods, and an annotation module.

Infrastructure and data management Integrative analyses are data-driven, and data management is therefore important to integrative bioinformatics systems. High-throughput data are usually stored in a central database. Classification data can also be stored in this central database or in a distributed database system that is governed by an integration and retrieval system. A modularized multi-tier architecture built around such an ensemble of databases ensures data integrity and security. Users can access the system via distributed graphical user interfaces from their institution’s intranet or via the Internet. User and work-group management and secure connection channels would allow restricted access to the system and make parts of the information visible to individual users. In addition, administrative tasks such as making backups, maintenance, configuration and updating can be performed efficiently at a central location, ideally via a comfortable graphical administration interface. Although establishing this kind of infrastructure is a standard IT task, data management for integrative analysis needs to take into account the biological meaning of the data and to support the systematic storage and linkage of diverse data types. Raw high-throughput data come in different formats, depending on the type and vendor of facilities used for their generation. The upload mechanism of the integrative bioinformatics system must be flexible enough to handle diverse formats. The ability to upload tab-delimited files, which enable interactive file format definition (including column selection and definition of data format, and internal aliases per column), is a minimum requirement. Ideally, the proprietary formats of popular vendors should also be recognized. Data in all formats should be internally mapped to the same generic structure. As long as the defined column aliases match, this technique enables data from different sources to be used in a single analysis. It is commonly required that the origin of datasets be stored

REVIEWS

BIOSILICO Vol. 1, No. 5 November 2003

RESEARCH FOCUS within the analysis system. Such a mechanism links the workflow of data analysis with the laboratory workflow, and should ideally be connected to a laboratory information management system. Both the kinds of subjects (e.g. genes, proteins or metabolites) and the individual subjects that are measured in a dataset should be identifiable. Relationships between the subjects should be established to allow complex queries across subject types; for example, ‘gene A codes for protein B, proteins P and Q are part of the same complex, protein S is an enzyme that catalyzes the biosynthesis of metabolite T’. On the basis of these requirements, the data-management module should organize the measurements, the subjects measured and the relationships between subjects. The module should allow high-throughput data to be uploaded, managed and processed, and subject relations to be uploaded or interactively established. It should be possible to store the mapping of subject identifiers to various synonyms. A retrieval engine with an easy-to-use graphical user interface should allow datasets to be extracted from the database flexibly. Manual selection and complex querying features should allow the user to choose subsets of available measurements and subsets of subjects that have been measured within the selected measurements. In this way, queries can be formulated according to data properties (e.g. ‘all genes that have an expression level greater than 2.3 in at least 5 of 8 measurements’), parameters (e.g. ‘all measurements that have a name beginning with AML*’), annotations (e.g. ‘all genes that belong to organism Human and belong to the functional category ABC transporters’) or relations (e.g. ‘all genes that code for a protein of a specific family’). The retrieved datasets are often two-dimensional matrices with one column per measurement and one row per measured subject.

Bioinformatics methods The second main part of the integrative analysis system involves the analysis of the stored data. Analysis adapters allow the retrieved datasets to be analyzed by statistical or bioinformatics methods. A method receives input data of defined types and a set of parameters (some or all parameters can be optional). All input data are related to a set of measurements and a set of subjects. In the most trivial case, the input is numerical data scattered in the twodimensional matrix that results from data retrieval. The input may also be systematic classification data that relate to the set of subjects or the set of measurements. A method can provide results in several formats: new measurements, that is, modified raw data (e.g. data generated by normalization or principle component analysis);

groupings of input subjects or measurements (e.g. clusters or classifications); or other results that depend on the individual methods (e.g. hierarchical trees from some clustering methods or p-values calculated for subject groups). The results of one method can be the input for a subsequent method. Workflow support can be achieved by the ability to define and store chains of methods or analysis templates. Within a chain of methods, the output of one method can be used as the input data or parameters for a subsequent method. These analysis templates can be applied to a dataset in the same way as individual methods are applied. Analyses can be defined by selecting the analysis method or method chain, defining the input dataset through the retrieval functionality and defining the parameters for each of the methods. A defined analysis should be named and stored within the data management framework. Upon execution of the stored analysis, the methods are calculated and the results are stored with the analysis. To enable exploratory high-throughput analyses, algorithms and methods must be designed to score the significance of results. Statistical measures, such as p-values, are a common concept used to score significance. In addition, it is important that the results of each method, numeric or integrative, can be easily visualized. For integrative analysis methods, the ability to visualize the results is often crucial to interpreting them. For some methods, standard diagrams (e.g. scatter plots, bar plots or three-dimensional plots) can be used, but other methods need special visualization diagrams that allow the results to be exploited to their full extent. Classification data or annotations may have to be mapped to diagrams of numerical methods (see Figure 3 for an example). Other classification data are better visualized directly; for example, in pathway diagrams. These diagrams should include a representation of the subjects of the high-throughput data and their features.

Annotation support Annotation support is the third crucial part of the analysis system. Knowledge about the measured subjects needs to be available to the integrative analysis. Annotations should be stored in the same environment as other datasets so that they are available for later reference, to colleagues that work in the same area, and for use in successive integrative analyses of new data. The most valuable annotation data for automatic integrative analyses are systematic annotations. Lists of controlled vocabulary (catalogs, ontologies or thesauri) can be used to avoid problems of alternative spellings and spelling errors, and to unify the annotation of different co-workers.

www.drugdiscoverytoday.com

181

REVIEWS

BIOSILICO Vol. 1, No. 5 November 2003

RESEARCH FOCUS Ontology, FunCat Functional Catalog and pathway schemes such as KEGG) and standards such as the minimum information about a microarray experiment (MIAME) standard ([53,54]; http://www.mged.org/Workgroups/MIAME/). In addition, the framework should allow additional catalogs for proprietary and user-specific annotation to be defined.

Number of genes 30 25 20 15 10 5 0

0

Summary 2

4

6

8 10 4 12 14 2 16 18 0 Cluster number (rows)

6

8

10

12

14

Cluster number (columns)

BioSilico

Figure 3. Example of annotation data mapped onto a SOM gene clustering.The grid represents the clustering, and each grid point corresponds to a gene cluster.The third dimension of the diagram is used to plot the number of genes per cluster that belong to a specific functional category.The expression data used come from diauxic shift in yeast [39]. Genes that belong to the FunCat ribosome biogenesis category are projected. Red peaks represent groups of clusters that have closely correlated gene-expression profiles that contain a large number of ribosomal genes.

These lists allow descriptions that explain exactly how a term should be used in the annotation. A list of synonyms that are internally mapped to the same annotation entry can solve the problem of unmatched synonyms. Annotation attributes should be entered (e.g. as text, numbers or World Wide Web links) and ordered in annotation forms according to different thematic areas of annotation. Ideally, the information that accompanies the highthroughput data (e.g. shipped annotations or measurement parameters produced by the measuring unit) should be seamlessly integrated into the annotation forms. When uploading data into the internal database, this information should also be added to the corresponding forms. Such a mechanism allows highly reliable and efficient annotation, as well as convenient quality control by querying the annotations. The retrieval of measurements according to these automatically uploaded parameters should be possible. Using a thesaurus-based annotation framework allows the annotations of different entities (measurements or subjects) to be systematically compared, that is, they can be used for automatic analysis methods. In this way, information that is generated by an institution through the analysis of high-throughput data can be made available to users of the system automatically. A framework for an integrative bioinformatics system should support existing annotation schemes (e.g. Gene

182

www.drugdiscoverytoday.com

To achieve the efficient analysis of high-throughput data such as gene expression, proteomics and metabolomics data, available information about the measured subjects (e.g. annotations and classification data) needs to be taken into account. Bioinformatics methods that make systematic use of such information are called ‘integrative analysis methods’. To handle extensive high-throughput datasets, integrative analysis methods require a high degree of automation. This automation can be realized through an ‘integrative bioinformatics system’ that allows the management of all of the required types of data within a consistent framework and supports systems biology techniques. The integrative analyses described in the scientific literature can be used to derive a general scheme for such analyses. An integrative bioinformatics system can be built according to that scheme. In addition to the standard features of a modern IT system, the integrative bioinformatics system needs flexibility for integrating data types, analysis methods, visualization modules and annotation schemes. Finally, a powerful reporting system is needed to ensure that the system supports the individual analysis steps: from the systematic storage of raw data, statistical processing of high-throughput data, and numerical and integrative analyses, to the extraction of publishable results.

Acknowledgements This paper is based on the author’s experiences in developing integrative analysis systems at Biomax Informatics. The author wishes to acknowledge the project teams that have been involved in developing the described scenario, and especially Kaj Albermann, Klaus Heumann, Wenzel Kalus and Sascha Losko for their valuable comments on the manuscript. Biomax and BioRS are registered trademarks, and FunCat is a trademark of Biomax Informatics AG in Germany and other countries. Registered names, trademarks and so on used in this document, even when not specifically marked as such, are not to be considered unprotected by law.

References 1 Jain, E. and Jain, K. (2001) Integrated bioinformatics – high throughput interpretation of pathways and biology. Trends Biotechnol. 19, 157–158 2 Quackenbush, J. (2001) Computational analysis of microarray data. Nat. Rev. Genet. 2, 418–427

BIOSILICO Vol. 1, No. 5 November 2003

REVIEWS RESEARCH FOCUS

3 Tefferi, A. et al. (2002) Primer on metabolic genomics. Part III: microarray experiments and data analysis. Mayo Clin. Proc. 77, 927–940 4 Draghici, S. (2003) Data Analysis Tools For DNA Microarrays (Mathematical Biology and Medicine Series), Chapman & Hall/CRC 5 Schuchhardt, J. et al. (2000) Normalization strategies for cDNA microarrays. Nucleic Acids Res. 28, E47 6 Eisen, M.B. et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U. S. A. 95, 14863–14868 7 Ben-Dor, A. and Yakhini, Z. Clustering gene expression patterns. In Proceedings of the Third Annual International Conference on Computational Molecular Biology (RECOMB), 1999 8 Qiang, J. et al. (2001) Beyond synexpression relationships: local clustering of timeshifted and inverted gene expression profiles identifies new, biologically relevant interactions. J. Mol. Biol. 314, 1053–1066 9 Datta, S. and Datta, S. (2003) Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19, 459–466 10 Golub, T.R. et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 11 Xiong, M. et al. (2001) Feature (gene) selection in gene expressionbased tumor classification. Mol. Genet. Metab. 73, 239–247 12 Yeoh, E-J. et al. (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1, 133–143 13 Sherlock, G. (2000) Analysis of large-scale gene expression data. Curr. Opin. Immunol. 12, 201–205 14 Gerstein, M. and Jansen, R. (2000) The current excitement in bioinformatics — analysis of whole-genome expression data: how does it relate to protein structure and function? Curr. Opin. Struct. Biol. 10, 574–584 15 Oliveros, J.C. et al. (2000) Expression profiles and biological function. Genome Inform. Ser. Workshop Genome Inform. 11, 106–117 16 Wu, L.F. et al. (2002) Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat. Genet. 31, 255–265 17 Noordewier, M.O. and Warren, P.V. (2001) Gene expression microarrays and the integration of biological knowledge. Trends Biotechnol. 19, 412–415 18 Frishman, D. et al. (2002) Online genomics facilities in the new millennium. Pharmacogenomics 3, 265–271 19 Kokocinski, F. et al. (2003) Quick-LIMS: facilitating the data management for DNA-microarray fabrication. Bioinformatics 19, 283–284 20 Marcotte, E.M. et al. (1999) A combined algorithm for genome-wide prediction of protein function. Nature 402, 83–86 21 Pellegrini, M. et al. (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. U. S. A. 96, 4285–4288 22 Ideker, T. et al. (2001) A new approach to decoding life: systems biology. Annu. Rev. Genomics Hum. Genet. 2, 343–372 23 Zien, A. et al. (2000) Analysis of gene expression data with pathway scores. Int. Syst. Mol. Biol. 8, 407–417 24 Hanisch, D. et al. (2002) Coclustering of biological networks and gene expression data. Bioinformatics 18 (Suppl. 1), S145–S154 25 Grigoriev, A. (2001) A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucleic Acids Res. 29, 3513–3519 26 Jansen, R. et al. (2002) Relating whole-genome expression data with protein–protein interactions. Genome Res. 12, 37–46 27 Greenbaum, D. et al. (2002) Analysis of mRNA expression and protein abundance data: an approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts. Bioinformatics 18, 585–596

28 Gieseg, M.A. et al. (2002) Expression profiling of human renal carcinomas with functional taxonomic analysis. BMC Bioinformatics 3, 1–13 29 Grosu, P. et al. (2002) Pathway processor: a tool for integrating wholegenome expression results into metabolic networks. Genome Res. 12, 1121–1126 30 Rhodes, D.R. et al. (2002) Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res. 62, 4427–4433 31 Bono, H. and Okazaki, Y. (2002) Functional transcriptomes: comparative analysis of biological pathways and processes in eukaryotes to infer genetic networks among transcripts. Curr. Opin. Struct. Biol. 12, 355–361 32 Pellegrini, M. (2001) Computational methods for protein function analysis. Curr. Opin. Chem. Biol. 5, 46–50 33 Schwikowski, B. et al. (2000) A network of protein–protein interactions in yeast. Nat. Biotechnol. 18, 1257–1261 34 Fellenberg, M. et al. (2000) Integrative analysis of protein interaction data. Proc. Int. Conf. Int. Syst. Mol. Biol. 8,152–161 35 Ideker, T. et al. (2001) Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292, 929–934 36 Jakt, L.M. et al. (2001) Assessing clusters and motifs from gene expression data. Genome Res. 11, 112–123 37 Bussemaker, H.J. et al. (2001) Regulatory element detection using correlation with expression. Nat. Genet. 27, 167–171 38 Kwon, A.T. et al. (2003) Inference of transcriptional regulation relationships from gene expression data. Bioinformatics 19, 905–912 39 DeRisi, J.L. et al. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 40 Chu, S. et al. (1998) The transcriptional program of sporulation in budding yeast. Science 282, 699–705 41 Iyer, V.R. et al. (1999) The transcriptional program in the response of human fibroblasts to serum. Science 283, 83–87 42 Masys, D.R. et al. (2001) Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics 17, 319–326 43 Brown, M.P.S. et al. (2000) Knowledge based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. U. S. A. 97, 262–267 44 Pavlidis, P. and Grundy, W.N. (2000) Combining Microarray Expression Data and Phylogenetic Profiles to Learn Gene Functional Categories Using Support Vector Machines. Technical report, Columbia University, Computer Science Dept. 45 Hughes, T.R. et al. (2000) Functional discovery via a compendium of expression profiles. Cell 102, 109–126 46 The Gene Ontology Consortium (2001) Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425–1433 47 Hill, D.P. et al. (2002) Extension and integration of the gene ontology (GO): combining GO vocabularies with external vocabularies. Genome Res. 12, 1982–1991 48 Frishman, D. et al. (2003) The PEDANT genome database. Nucleic Acids Res. 31, 207–211 49 Chaussabel, D. and Sher, A. (2002) Mining microarray expression data by literature profiling. Genome Biol. 3, research 0055.1-0055.16, 2002 50 Kanehisa, M. et al. (2002) The KEGG databases at GenomeNet. Nucleic Acids Res. 30, 42–46 51 Kurhekar, M.P. et al. (2002) Genome-wide pathway analysis and visualization using gene expression data. In Proceedings of the Pacific Symposium on Biocomputing, 7 52 Voit, E.O. and Radivoyevitch, T. (2000) Biochemical systems analysis of genome-wide expression data. Bioinformatics 16, 1023–1037 53 Brazma, A. et al. (2001) Minimum information about a microarray experiment (MIAME) – toward standards for microarray data. Nat. Genet. 29, 365–371 54 Spellman, P.T. et al. (2002) Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 3, research0046.1-0046.9

www.drugdiscoverytoday.com

183