European Neuropsychopharmacology 11 (2001) 399–411 www.elsevier.com / locate / euroneuro
Regulatory sequence analysis: application to the interpretation of gene expression Jaak Vilo*, Katja Kivinen European Bioinformatics Institute EBI, EMBL Outstation — Hinxton, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1 SD, UK
Abstract Microarray technologies for measuring mRNA abundances in cells allow monitoring of gene expression levels for tens of thousands of genes in parallel. By measuring expression responses across hundreds of different conditions or timepoints a relatively detailed gene expression map starts to emerge. Using cluster analysis techniques, it is possible to identify genes that are consistently coexpressed under several different conditions or treatments. These sets of coexpressed genes can then be compared to existing knowledge about biochemical or signalling pathways, the function of unknown genes can be hypothesised by comparing them to other genes with characterised function, or from trends in expression profiles in general — why cell needs to transcribe or silence the genes during particular treatment. The regulation of genes on the DNA level is largely guided by particular sequence features, the transcription factor binding sites, and other signals encaptured in DNA. By analyzing the regulatory regions of the DNA of the genes consistently coexpressed, we can discover the potential signals hidden in DNA by computational analysis methods. The prerequisite for this kind of analysis is the existence of genomic DNA sequence, knowledge about gene locations, and experimental gene expression measurements for a variety of conditions. This article surveys some of the analysis methods and studies for such a computational discovery approach for yeast Saccharomyces cerevisiae. 2001 Elsevier Science B.V. All rights reserved. Keywords: Gene expression analysis; Pattern discovery; Promoter analysis
1. Introduction A collection of gene expression level measurements taken under various experimental conditions by microarray or any other technology, defines expression profiles of the respective genes. There are many surveys of the technology and analysis in general (e.g. The Chipping Forecast, 1999; Brazma and Vilo, 2000; Celis et al., 2000; Hegde et al., 2000; Dopazo et al., 2001). The simple query ‘review microarray analysis’ from PubMed reveals 72 articles from Medline (April 2001). In this survey we will discuss how gene expression data can be used for analysing gene regulatory sequences leading to in silico prediction of novel putative transcription factor binding sites and other signals in DNA for gene regulation. A major role in gene regulation in eukaryotic organisms is played by specific proteins, called transcription factors. By binding to sequence-specific sites in the DNA, called transcription factor binding sites, they influence the transcription of a particular gene. The transcription factor *Corresponding author. E-mail address:
[email protected] (J. Vilo).
binding sites are located in promoter regions. In yeast, these regions are predominantly (but not exclusively) in the immediate vicinity of the gene (typically less than 1000 bp upstream of the translation start site; Chiang et al., 2001). It seems reasonable to hypothesize that genes with similar expression profiles, i.e. genes that are coexpressed, may share something common in their regulatory mechanisms, i.e. may be coregulated. Therefore by clustering together genes with similar expression profiles we find groups of potentially coregulated genes allowing one to search for putative regulatory signals. The first whole-genome microarray gene expression data set published was a diauxic shift experiment performed on yeast Saccharomyces cerevisiae, where authors measured expression levels for all genes by sampling yeast cells in 2-h intervals during a metabolic shift from fermentation to respiration due to glucose starvation (De Risi et al., 1997). Authors identified several distinct clusters in the gene expression profiles, and were able to show the presence of several previously characterised transcription factor binding sites (for example the stress responsive element CCCCT) located upstream to many of the genes in those
0924-977X / 01 / $ – see front matter 2001 Elsevier Science B.V. All rights reserved. PII: S0924-977X( 01 )00117-1
400
J. Vilo, K. Kivinen / European Neuropsychopharmacology 11 (2001) 399 – 411
clusters. Fascinated by these results, many researchers started to tackle the following quite natural question — ‘‘can we identify novel putative binding sites automatically by combining gene expression data clustering and sequence pattern discovery methods?’’ The same data set of diauxic shift was soon analysed by several other groups, including van Helden et al. (1998) and Brazma et al. (1998b). In both, a systematic search for overrepresented patterns was carried out. van Helden et al. (1998), searched for oligonucleotides overrepresented upstream to potentially coregulated genes (clusters from the paper of De Risi et al., 1997) and showed that potential new transcription factor binding sites can be found in this way. Brazma et al. showed additionally in the larger-scale analysis that clustering of the genes (even by a very simple method of explicit thresholding and discretisation of expression values and consequently ‘binning’ all genes according to these discrete values) enables to identify potential binding sites that cannot be explained by a statistical ‘chance’. Moreover, it was shown that many of these binding sites could not be discovered from the global comparison of all upstream sequences against random genomic regions. Both of these studies showed that many of the statistically most significant automatically discovered patterns have matches in known yeast transcription factor binding site descriptions. Later, more expression studies have been carried out under various conditions (e.g. sporulation; Chu et al., 1998) and cell cycle studies (Cho et al., 1998; Spellman et al., 1998) and the amount of the expression data is increasing rapidly. Cell cycle data were studied by Spellman et al. (1998), Zhang (1999a), Jakt et al. (2001) and Ohler and Niemann (2001) in order to identify common regulatory sites from DNA upstream sequences. Tavazoie et al. (1999) clustered 3000 expression profiles of the most variable yeast genes during the cell cycle (15 time points, data from Cho et al., 1998) into 30 clusters using a K-means algorithm. They found that for half of these clusters, strong sequence patterns are present in upstream sequences. Moreover, it was noted that upstream sequences from genes from the tightest expression profile clusters (i.e. the clusters with smaller average Euclidean distance to their centers) contain more significant patterns. In a different study, Wolfsberg et al. (1999) studied the discovery of potential regulatory sequences from groups of genes whose expression peaked at distinctive phases of the cell cycle as identified by Cho et al. (1998). It has been noted by most authors that better cluster and pattern goodness evaluation criteria, better systematic clustering and pattern discovery methods, as well as tools integrating the clustering and pattern discovery should be developed to facilitate the expression data analysis. The transcription factor binding sites are not acting alone. It is assumed that genome-wide control is achieved by a combinatorial use of multiple sequence elements (e.g.
Werner, 1999). Combinatorial aspects of the regulation will become feasible to analyse when better understanding about individual features emerges. Microarray gene expression measurements are indirect in the sense that the first real measurement is a laserscanned image. From these images spots are identified, for each spot numerical values are extracted and expression levels of each gene estimated. This multistep analysis is prone to errors at different stages, both because of the technology itself as well as our possibly limited understanding about the underlying processes. More detailed knowledge about the reliability of technology is, however, being produced and statistical models are starting to emerge and data is systematically stored in the databases like TRANSFAC (Wingender et al., 2000) and SCPD (Zhu and Zhang, 1999). The general recognition for the necessity of standards in the microarray gene expression field and life sciences in general (Brazma, 2001), international standardisation efforts (e.g. see MGED: www.mged.org) and several public microarray database initiatives (Brazma et al., 2000; Wheeler et al., 2001) allow us to believe that maturity in the technology itself as well as standardised data representation for microarray expression measurements will become more realistic and vast amounts of microarray data will be easily accessible to all researchers for further analysis and mining. At the high abstraction level, we consider microarray expression data to be a matrix of numbers, the gene expression values. In this gene expression matrix, rows correspond to genes, columns represent the different biological samples analysed, and values correspond to expression levels of the particular gene in a particular sample. When combining large number of experiments into one expression data matrix one can argue that the numerical measurement errors are not systematic and hence they ‘compensate’ themselves out in different time-points while maintaining the general trends. In general, it is important to have replicate measurements to assess the reliability of the measurements. This approach was developed further by Hughes et al. (2000b) where they established from 63 yeast wildtype versus wildtype studies, what is the average expected variability in expression for each of the genes. In that way they proposed a gene-specific error model that takes into account the normal variability for each gene individually. For genes that tend to give stable measurements even small but consistent changes can be reliably detected. This paper surveys the research about establishing links between gene expression data and regulatory sequence analysis performed by several groups on yeast Saccharomyces cerevisiae. Not all of this is necessarily straightforward to apply to higher organisms. It may even be argued that these methods are not applicable to higher organisms. The purpose of this paper is, however, not to provide immediate answers how to unravel the human
J. Vilo, K. Kivinen / European Neuropsychopharmacology 11 (2001) 399 – 411
regulatory circuitry, but rather as an overview of what methods have been used and are available for these kind of studies, and what could be achieved by computational approaches in general. It has been argued by some authors that the full automation of the regulatory site prediction may be hard to achieve at this early stage of the research. And secondly, that integration of both the clustering methods and automatic binding site prediction should be combined in a sense that one can be used to improve the other (Zhang, 1999a). Integration of the two processes is desirable and indeed has already been attempted (Holmes and Bruno, 2000). More recently, Bussemaker et al. (2001) proposed a method for regulatory element detection that does not rely on clustering of the gene expression profiles and is based on correlating gene expression values and motif occurrences. However, when combining two approaches one needs to be careful and think how to verify the results by some independent means. The real challenge for the regulatory sequence analysis is not a question about which clustering or pattern discovery method or their combination to use, but rather how to predict biologically the most relevant models for gene regulation. Recent surveys are available about the history of representation and discovery methods for DNA binding sites (Stormo, 2000) and about mathematical and algorithmic models for promoter sequence analysis with the emphasis on prokaryotic promoter sequences (Vanet et al., 1999), and gene expression data and promoter analysis (Zhang, 1998). As noted by many authors working in this field, the task of identifying promoter sequences can be very difficult (e.g. Vanet et al., 1999; Sinha and Tompa, 2000; Vilo et al., 2000). The reasons include the uncertainty in promoter region prediction, the noise level in microarray expression measurements, the question about what is the appropriate motif description language, and the algorithmic problems of identifying subtle signals from sets of sequences that do not even need to share the same motifs. It should be noted that clustering is not always needed at all for studying the expression response. Alternative approaches to clustering of expression profiles could be for example to use functional annotations and retrieve groups of genes with similar function based on these annotations (Hertz et al., 1990; Jensen and Knudsen, 2000; Hughes et al., 2000a; Zhu and Zhang, 2000), sorting of genes based on the magnitude of their expression response under certain conditions (Jensen and Knudsen, 2000) or all genes affected by a single-gene knock-out study (Hughes et al., 2000b), or distance from a single ‘interesting’ gene. Inconsistent use of words in annotation may limit the usefulness of the functional annotation-based approach (Jensen and Knudsen, 2000). Different pattern discovery methods can be applied to promoter analysis. We will discuss pattern discovery methods in general in Section 3, and applications of some
401
of the methods to promoter analysis in particular, in Section 4. We will start however by describing briefly the ideas behind most common clustering methods.
2. Identification of groups of coexpressed genes by cluster analysis techniques The task of cluster analysis in general is to identify groups of objects similar to each other. There are two main components to cluster analysis — how to define which objects are similar to each other, and how to identify groups that are similar in some respects, for example sharing some common features. The task of this article is not to discuss every detail of the distance measure calculations and clustering methods. There are several textbooks, book chapters or survey articles about cluster analysis in general (Hartigan, 1975; Legendre and Legendre, 1998; Jain et al., 1999) as well as articles which discuss some of the analysis methods for the expression data analysis (Brazma and Vilo, 2000; Quackenbush, 2001). In the following, we will summarise some of the core ideas behind the most common algorithms.
2.1. Distance (similarity) measures First and perhaps the most crucial step in analyzing the microarray data is to choose the appropriate distance measure that would allow to capture biologically relevant similarity between genes. Choice of distance measures can furthermore be extended by choice of normalization methods. Note that we are trying to cluster genes (rows in the gene expression matrix) based on the gene expression levels in different biological samples. Each gene (row in the expression matrix) can be thought of as a vector in the high-dimensional space where dimensionality is the number of biological samples analysed (number of columns in the expression matrix). A common example of distance measures is Euclidean distance that treats all coordinates independently and tries to minimize overall differences between all vector coordinates. For two genes X5(x 1 , x 2 , . . . , x n ) and Y5( y 1 , y 2 , . . . , y n ), the Euclidean distance d(X, Y) is defined as d(X, Y) 5
œO
]]]] n 2 i51 (x i 2 y i )
When the magnitude of the expression response is not important, but rather the general shape of the profile, one can normalize all vectors to have equal length prior to distance calculations. In this way, the magnitude of change is reduced and different genes become more comparable. The same can be directly achieved using correlation measure-based distances that capture the coordinated
402
J. Vilo, K. Kivinen / European Neuropsychopharmacology 11 (2001) 399 – 411
changes, not the absolute magnitude of values. The prior normalization of the data may be useful when using clustering algorithms that rely on Euclidean properties. Heyer et al. (1999) proposed a jackknife correlation measure which was based on the observation that sometimes single outliers can affect distance while in other ways genes may be uncorrelated. Based on that observation they define the minimum correlation between the genes when consistently leaving out one of the coordinates (e.g. timepoints). Jackknife correlation is robust to single outliers that cause undesired correlations between genes. Conceptually quite different ways of calculating distances can be used. For example, it is possible to replace values of the original vectors by their ranks, i.e. the smallest value will get rank 1 and is replaced by value 1, the next smallest by 2, etc. Rank correlation methods define distances on these rank values instead of original ones. Another intuitively appealing distance measure proposed for microarray gene expression analysis is based on mutual information content, i.e. how much information one gene captures about the other (D’haeseleer et al., 1998). Although many distance measures have been tried out and most seem to give reasonably good results during clustering, at present there is no objective criteria which distance measure should be used for one or other type of analysis. The final answer should eventually come from the biology itself, but it is not immediately obvious what the experiment to establish this kind of result with independent verification, should look like.
2.2. Clustering algorithms Different clustering methods may reveal different properties of the data. We will summarize a few key ideas behind these methods. Hierarchical clustering combined with global visualization of the whole expression data matrix, sorted according to hierarchical clustering output and color-coded in pseudocolors, has been developed and widely popularised to the microarray community by Eisen et al. (1998). These methods have proved very useful for understanding general expression responses. Hierarchical clustering in its most standard form (agglomerative hierarchical clustering) is a simple deterministic approach where first all pairwise distances between single genes (logically they are treated as clusters containing of exactly one gene) are calculated, and then iteratively the two nearest clusters are merged. Next, distances from all other clusters to the newly merged clusters are calculated (other distances did not change), and the procedure is repeated until all clusters have been merged into one. The hierarchical tree visualises these mergers so that the singleton clusters are ‘leaves’ and mergers produce larger subtrees. Distances between two clusters can be calculated in different ways, and that is what distinguishes single
linkage (or minimum linkage, distance is the minimum distance between objects from two different clusters), complete linkage (or maximum linkage, the distance is the largest distance between any two members from the compared clusters), and average linkage (average distance, the distance is an average between all members in two clusters) clustering methods. Furthermore, average distance can also be calculated in different ways, either by averaging over all pairwise distances between objects in two clusters (by taking into account cluster sizes, or not), or by representing each cluster by a representative ‘centroid’ and then calculating distances between these centroids. The need for either interactive exploration (point and click on subtrees, cut the tree at certain height, cut the tree to prespecified number of clusters, etc.) or other cluster validation criteria may make hierarchical clustering unattractive for automatic pipelining of clustering results to other analysis methods. In this paper, we are interested in this pipelining as we want to analyse each cluster further for common putative binding sites. Partitioning-based methods are alternative to hierarchical clustering. These methods attempt to group data into (usually non-intersecting) groups, where the number of groups can be given by users or determined by the clustering procedure itself. For K-means clustering, a typical representative of partitioning-based methods, the user has to predefine the number of clusters, K, and the task of clustering procedure is to partition data into exactly K groups. The K-means procedure starts from choosing K centers either randomly or by some deterministic procedure (users could improve the clustering outcome by submitting the most sensible starting centerpoints themselves). The calculation procedure works in steps where each object is assigned to the cluster to whose center it is the closest, and then recalculating the cluster centers by moving them to the ‘centre of gravity’ for each cluster. This procedure is repeated until cluster contents do not change any more, or until the program has performed enough cycles. ¨ ¨ Self Organising Maps (SOM) (Kohonen, 1997; Toronen et al., 1999; Tamayo et al., 1999) are somewhat similar to K-means. For SOM, the user usually has to specify a grid or lay-out for clusters, e.g. a two-dimensional grid of 5310 clusters. First each cluster is initiated by a random object from the data. Then, in a random fashion, objects are selected from the data, and the cluster most similar to the object is identified. The representative vector for that cluster and also some of the neighbouring clusters are adjusted to resemble more the chosen object. Finally, each gene is assigned to one of the clusters. Microarray gene expression data has fascinated many researchers from different fields and also launched research into novel clustering algorithms. For example, graph theory-based algorithms (for finding highly connected subgraphs from the data; Sharan and Shamir, 2000), physical mechanics-based algorithms (super paramagnetic
J. Vilo, K. Kivinen / European Neuropsychopharmacology 11 (2001) 399 – 411
clustering; Getz et al., 2000), a statistical approach for finding clusters with high variance across samples (Hastie et al., 2000), or cross-hybrids of several methods (SOTA, combining speed of SOMs with intuition of trees; Herrero et al., 2001), have been developed to help to solve certain aspects of the data analysis. There is a need for faster but also for more ‘accurate’ clustering methods, the two different and sometimes contradictory aspects, which these new algorithms try to solve each as well as possible. Fancy clustering algorithms are not always necessary, however, as is reflected by the following two examples. Heyer et al. (1999) proposed a simple clustering procedure ‘quality clustering algorithm’ (QT]Clust) where candidate clusters are calculated for each gene in a way that, for each cluster, its nearest gene will be included in the cluster one after another based on the distance from the original gene. By establishing a threshold for maximal cluster radius (a quality guarantee for a cluster), the process stops when that threshold is reached. From all clusters produced during one run of the algorithm, the largest is selected and respective genes removed from the data. The clustering continues by iteratively identifying the next largest clusters from the remaining data. A very similar procedure (or exactly the same, depending on whether cluster radius is calculated from the single starting gene or from the clusters ‘real center’) was used by Zhu and Zhang (2000). They called it a ‘largest first clustering algorithm’. Even the most widely used methods usually have problems with different aspects of the clustering quality, like the convergence of the algorithm (does the procedure find the solution and is the solution the same from run to run), potential entrapment into locally optimum clustering (i.e. there may be ‘better’ global clustering available, but the algorithm is not able to find it), and stability of the created clusters from one run to another. The question of finding ‘the best clustering’ is generally not solvable in practice. To tell what is ‘the best clustering’ one has to have some formal way of defining the clustering quality, and the task of the clustering procedure would be to find the globally optimal solution. Due to combinatorial aspects of the search space and our limited computing power, this global optimum can only be approximated by different clustering algorithms representing different heuristic search strategies. It means that different approaches can and should be used to reveal different aspects of the data or in analysis of the data in different contexts. For interactive exploratory analysis it is important to have almost immediately the results presented to the users. For squeezing the most out of the data, users may define other criteria for global cluster quality scores, incorporate background knowledge, and use expensive search strategies. We are interested in clustering of the gene expression profiles because these clusters can tell us something about regulatory mechanisms for many members of the cluster. One could study very carefully clusters from only ‘the
403
best’ clustering method, and consequently use more demanding pattern discovery methods for discovering novel patterns for that particular cluster. Another approach would be to combine different clustering methods and vary the parameters in each of these, in order to cluster data differently, possibly revealing different aspects of the same data. From these different clustering methods, it is possible to collect and combine (overlapping) clusters that represent the best aspects of different clustering methods. Not always is clustering of genes necessary or even meaningful. For example, one could just order genes based on the magnitude of response to specific conditions and analyze only the genes most affected. Genes can be sorted based on single timepoint values and the most highly upregulated or downregulated genes studied (Jensen and Knudsen, 2000). The main question in this approach is how to set the threshold — how many genes are considered and which sequence motif correlates best with the first genes? A Kolmogorov–Smirnov statistic (a rank test) has been used to calculate the significances for patterns without the need to cluster the data (Jensen and Knudsen, 2000). Another way to sort genes would be to start from a single gene of interest and sort all the other genes relative to that gene based on increasing distance. The stopping criteria may be for example the number of genes desired or the maximum distance from the first gene. Sorting is also useful when one would like to find genes that are most anticorrelated in respect to the gene of interest.
3. Pattern discovery methods By sequence pattern discovery, we mean finding a priori unknown patterns (of some given class, such as substrings, regular expressions, or probabilistic weight matrices) that are statistically overrepresented in a given set of sequences (possibly in respect to some background distribution). Algorithms for sequence pattern discovery have been widely used for characterizing protein families (e.g. Jonassen, 1997); for surveys, see for example Brazma et al. (1998a) and Wang et al., (1999). The unifying framework to pattern discovery methods divides pattern discovery problem into three logical steps (Brazma et al., 1998a): 1. What is the appropriate language to describe patterns (what is the pattern class and which features of the sequences patterns should capture). 2. What is the scoring function for comparing patterns (how to tell which pattern is ‘better’ than the other). 3. What is the most efficient algorithm to identify bestscoring patterns from the selected pattern class according to the chosen scoring function. Depending on the pattern language, we divide patterns into probabilistic patterns like for example weight matrices (Hertz and Stormo, 1999), and into discrete patterns (for
404
J. Vilo, K. Kivinen / European Neuropsychopharmacology 11 (2001) 399 – 411
example regular expressions). Although the probabilistic motif representation is probably more appropriate for describing binding efficiency to DNA, these motifs tend to be much more complex to discover by computational methods. Thus we would argue that a combination of the two approaches where methods that are rapidly able to discover discrete patterns that are used for guiding probabilistic motif discovery methods, would be desirable. This seems to be specially necessary when analysing larger and longer sets of sequences anticipated from studies in higher organisms. Based on the algorithmic component, we classify pattern discovery methods into: (1) sequence-driven, mostly alignment-based approaches, and (2) pattern-driven approaches, where search algorithm evaluates the presence of each pattern in pattern language against the set of sequences counting the numbers of occurrences. The patterndriven approaches can be carried out intelligently so that patterns that are not possible to observe from data are not generated. For example, if pattern w is not present frequently in the data, then no refinement of it (that for example by adding new characters makes w more specific) can be frequent in the data either. One obvious problem with many of the common pattern discovery algorithms that try to identify consensus motifs, i.e. motifs common to (almost) all input sequences, is that there is no guarantee that all (or almost all) of the sequences in one cluster should be regulated by the same mechanism and motifs. Several different mechanisms can have the same final regulatory effect, errors in microarray measurement may distort cluster contents, etc. The global alignment of upstream sequences of coexpressed sequences usually fails. The main reasons for failure of sequence alignments and motif discovery methods based on sequence alignments are: • there usually is no global homology between upstream regulatory sequences; • if patterns A and B are in different order on different sequences, e.g. –A–B– and –B–A–, then global alignment of these sequences is not possible.
Some of the most efficient algorithms capable of discovering discrete patterns like for example substrings (oligonucleotides) of any length, are based on suffix tree (McCreight, 1976; Ukkonen, 1995) data structure. Suffix trees are used to index texts (sequences) in the way that queries would not depend on the size of the indexed text. In the suffix tree, all possible subwords can be read from the top of the tree-structured index regardless of original text size. The direct link to pattern discovery methods is given by the fact that all possible substrings (patterns) are presented in this tree structure. Suffix tree-based approaches and extensions thereof have been used for promoter analysis by several groups (e.g. Brazma et al., 1998a,b; Marsan and Sagot, 2000; Jensen and Knudsen, 2000; Vilo et al., 2000). A suffix-tree-based sequence pattern discovery algorithm SPEXS, which allows one to produce in exhaustive manner the sequence patterns according to predefined pattern representation language, has been developed in Vilo (1998). SPEXS extends suffix tree-based approaches in the way that patterns with group positions or wildcards and flexible wildcards, can be enumerated. SPEXS can generate patterns in different search orders, one of the most useful being the order of starting from most frequent patterns to less frequent ones. For example, it first identifies patterns common to all sequences in input, then patterns common to all minus one sequences, etc. Users can set the threshold how frequent are the patterns they want to evaluate, i.e. a stopping criteria. For each pattern, its score can be calculated allowing users to post-process the reported patterns and extract the most interesting ones. The SPEXS algorithm has been used in Brazma et al. (1998b) and Vilo et al. (2000) to discover regular expression type patterns in gene upstream sequences. SPEXS has also been shown to be useful for identifying patterns from protein sequences. Recently it was applied for predicting coupling specificity of GPCR proteins to their G-proteins ¨ from Gs, Gio, or Gq11 class (Moller et al., 2001). For more information about suffix trees and algorithms see Gusfield (1997).
3.1. Pattern rating functions Therefore, one needs to use motif discovery algorithms that do not rely on global alignments. Some of the methods are, for example, Gibbs Motif Sampling (Lawrence et al., 1993; Neuwald et al., 1995), CoreSearch (Wolfertstetter et al., 1996), MEME (Bailey and Elkan, 1995a,b), and AlignACE (Roth et al., 1998; Hughes et al., 2000a). Of these, AlignACE has been optimised for alignment of DNA sequences by automatic consideration of both strands and for finding multiple motifs via an iterative masking procedure. McGuire et al. (2000) used AlignACE for discovery of regulatory motifs in 17 completed microbial genomes. Many of the different sequence pattern discovery algorithms have been applied to regulatory element analysis (Frech et al., 1997; Vanet et al., 1999).
The task of different pattern discovery methods is to find motifs that are overrepresented in the data set analyzed, or unexpected according to some other criteria. For motifs we can count how many sequences contain the motif or how many occurrences of the motif there are in total (i.e. count numbers of occurrences within the same sequence). When counting several occurrences within each sequence, we have to deal with possibly overlapping occurrences of motifs. Therefore, it is simpler to count at first just the number of sequences that contain the motif. A first simple criteria for establishing overrepresentation is to use ratios, i.e. calculate the ratio of the pattern occurrences in a cluster compared to the expected ratio
J. Vilo, K. Kivinen / European Neuropsychopharmacology 11 (2001) 399 – 411
(e.g. from all upstreams, from randomised sequences, from sequences with shuffled characters, etc.). The problem with ratios is that if the background probability of a pattern is extremely low, then infrequent patterns may have very high ‘significance’. These small probabilities may be slightly compensated by assuming higher background probabilities (Brazma et al., 1998b). Instead of calculating pattern ratios, it is better to assume some statistical model and calculate probabilities or significances for the pattern occurrences using statistical criteria. Given the background model (for example from explicit counting of pattern occurrences in a comparison set), one can obtain an estimate of how many occurrences of each pattern to expect. This probability estimate can be used to calculate the pattern ranking from probabilities based on binomial or hypergeometric distribution, for example. Binomial distribution assumes that selecting the genes into the cluster corresponds to independent random trials and it allows to calculate what is the probability to observe each pattern at least a given number of times in the data. When using hypergeometric distribution one assumes that the data are finite and hence after every selection of one gene into the cluster, the probabilities of selecting the next ones will change. For large data sizes and small numbers of trials, binomial distribution approximates well the hypergeometric distribution. A statistical measure for ranking patterns, z-score (‘normal deviate’ or ‘deviation in standard units’), which measures by how many standard deviations the number of occurrences of patterns in the sequences exceeds its expected number of occurrences, was studied by Sinha and Tompa (2000). The method for ranking patterns according to that measure uses a Markov chain model generated from upstream sequences to establish the expected probabilities for patterns. The algorithm itself enumerates all possible patterns and tabulates their numbers of occurrences. Next, all patterns are ranked based on the z-scores and best patterns are output. However, this measure is relatively time-consuming to calculate for any single motif. When calculating the total number of occurrences for patterns, i.e. possibly several occurrences per one sequence, one can in principle use the same statistical criteria. However, one has to be aware of the possibility that pattern occurrences may be overlapping and thus not independent and, biologically, potentially not meaningful. The cyclic patterns (that can overlap with themselves) have a statistically higher expected number of occurrences even under the assumption that all nucleotides have equal and independent probability of occurrence at each position. Apostolico et al. (2000) have developed methods to calculate different pattern rating scores, like mean and variance of pattern occurrences, and some derived significance measures in optimal fashion using the suffix tree algorithm. In this way, unusually frequent or infrequent words can be detected efficiently. For consensus-based methods (motifs common to all or
405
almost all sequences), the pattern ranking criteria including the information content-based methods (Jonassen et al., 1995), can be used. It should be noted that the pattern discovery methods attempting to find consensus motifs may not be applicable unless the algorithm is able to identify patterns with small numbers of occurrences. The number of pattern occurrences in each cluster may be quite low and the discovered consensus motifs may instead be representative of all upstream regions in general as opposed to the specific cluster of coexpressed genes. The measures like sensitivity and specificity of the pattern, as well as the correlation coefficients (Brazma et al., 1998a), can also be readily applied to motif discovery. Although for promoter analysis, the knowledge about true positives (genes regulated by the motif) and true negatives (genes not regulated by the given motif) is largely missing. A probabilistic segmentation model for segmenting strings into ‘words’ and concurrently building a ‘dictionary’ of these words was introduced by Bussemaker et al. (2000). In this statistical approach, background probabilities (negative data) are not used, yet the algorithm is also reporting words which occur infrequently in the data set analyzed. As mentioned in the previous section, the clustering of genes is not always necessary. When one uses instead the list of genes sorted based on some specific criteria like response to the treatment, distance from a specific gene, etc., the pattern discovery methods have to solve a slightly different problem — what are the best patterns, and to how many genes the pattern should be associated with. Jensen and Knudsen (2000) used the Kolmogorov–Smirnov rank test to answer this question.
4. Regulatory sequence analysis As discussed previously, promoter sequence analysis is an algorithmically hard and nontrivial task due to many factors. As a result, usually we have to deal with sets of potential promoter regions of potentially coregulated genes. Therefore, the algorithms used for transcription factor binding site prediction may have to detect only marginally overrepresented patterns in sets of up to hundreds of sequences of lengths of thousands of nucleotides (Vilo et al., 2000). Some of the most common signals in gene upstream regions across the whole genome can be found by comparing motifs found from all upstream regions against some background (randomized sequences, randomly selected genomic regions, coding regions, etc.). The suffix treebased algorithms have been shown to be efficient enough to facilitate this kind of analysis (Brazma et al., 1998b; Jensen and Knudsen, 2000). To obtain finer knowledge about motifs, genes that are potentially coregulated should be analyzed, as shown in Brazma et al. (1998b). van Helden et al. (1998) analysed oligonucleotide
406
J. Vilo, K. Kivinen / European Neuropsychopharmacology 11 (2001) 399 – 411
frequencies in upstream regions of co-regulated gene clusters of yeast. Statistical significance of an element was based on tables of oligonucleotide frequencies observed in all non-coding sequences. Many known regulatory sequences were identified with this method, but it also revealed unknown sites, which shared a common regulatory pattern with known elements. Oligonucleotides cannot however characterise all known binding sites. A pattern language where patterns consist of two conserved trinucleotides spaced by a non-conserved region of fixed length (spaced dyads) was used to identify some of the transcription factor binding sites (van Helden et al., 2000). Spellman et al. (1998) aimed at identifying all cell cycle-regulated genes in yeast by analysis of mRNA levels in cell cultures that had been synchronised by three methods (alfa-factor, size-based and Cdc15-based synchronisation). They analysed upstream regions of about 800 putative cell cycle-regulated genes identified by microarray experiments. For more than one-half of these good matches to known regulatory sites relevant to the phase of cell cycle (G1, S, G2, M, and M / G1) were found. Almost 70% of these genes could be controlled by G1 cyclin Cln3p or mitotic cyclin Clb2p. The remaining 300 genes did not have good binding sites, and the regulatory mechanism remained unclear. It was proposed that fluctuations in expression of these genes were not strong enough to be detected by this method, and that most of them actually harbour known or novel binding sites for cell cycle regulated gene expression. The set of 800 putative cell cycle-regulated genes found by Spellman et al. (1998) contained the majority (304 out of 421) of cell cycle regulated genes described earlier by Cho et al. (1998) plus an additional 496 genes. These additional genes included for example the upstream site for SWI5, the main cell cycle control element. Diversity of experiments helped Spellman et al. to improve the signal-to-noise ratio in their final data set. This way they could distinguish cell cycle regulation from confounding patterns such as those caused by heat shock response when a culture was shifted from one temperature to another. Relative pentamer information (an oligonucleotide bias measure) of each phase group versus the control group of non-cell-cycle genes was used for cell cycle studies (Zhang, 1999b). Roth et al. (1998) developed a local alignment program AlignACE based on Gibbs Motif Sampling and tested it with three regulatory systems in yeast: galactose response, heat shock, and mating type. It is known that many of the protein complexes, like for example the proteasome, are usually coregulated. Many groups have identified proteasome regulatory site (Mannhaupt et al., 1999) GGTGGCAA in silico. Another motif, AAAATTTT, has been reported as one of the most significant ones, by several groups. Sequences rich in A and T have high intrinsic curvature, i.e. they are bent easily. The more consequent the adenines (A), the more flexible is the DNA sequence. These sequences (for
example: AAAATTTT) have been found upstream from many genes, and have been suggested to bring other regulatory elements close together, and thus enhance the interactions of transcription factors that bind DNA. Note that the reverse complement, TTTTAAAA is not nearly as strong. Alignment of all against all yeast upstream sequences was performed by Hampson et al. (2000) and the alignment scores correlated with oxidative stress expression data time course. This study suggested a few new putative regulatory binding sites. A method for discovering signals in large sets of genome sequences has been developed by Brazma et al. (1998b) and Vilo et al. (2000), which is based on an exhaustive search for a priori unknown statistically significant sequence patterns of unrestricted length and their subsequent clustering by similarity. Note that we do not attempt to discover only the consensus patterns for sequences in the clusters, as these consensus patterns are not necessarily distinctive for the cluster but can be a feature of the background. The statistical significance of patterns in a cluster is measured by taking into account the background distribution, i.e. the number of occurrences of each pattern in background sequences (in this case a set of all putative promoter regions). The clustering of the discovered patterns by similarity makes it possible to report thousands of patterns in a concise way for a human investigator. From clusters of patterns, we derive alignments, profiles and consensus patterns. Although for largescale analysis it is most convenient to pipeline these stand-alone tools from scripts, we have implemented a WWW interface for most of these tools while also keeping in mind the interconnection between them. Practical applicability of our methods on a genomic scale is demonstrated by the following computational experiment. We systematically clustered all yeast genes based on their expression responses to 80 experimental conditions (data from Eisen et al., 1998) by K-means clustering, evaluating simultaneously the ‘goodness’ of each cluster by the average silhouette value. By choosing different K-values and varying the initial partitioning, we obtained over 52 000 different clusters (many of these clusters are highly overlapping). For each of these clusters, we retrieved the 600-bp DNA sequences upstream of the respective gene, and exhaustively searched for all the sequence patterns of unrestricted length that are overrepresented in the sequences of the cluster. Patterns are rated for each cluster according to a binomial distribution with expected probability based on its occurrence frequency in all upstream sequences. Similar pattern discovery was repeated for randomized clusters to assess the significance threshold for such patterns. From the over 6000 significant patterns, we exclude the ones discovered from only the clusters containing highly homologous upstream sequences. In this way we could list 1498 of the most interesting patterns for further studies. We clustered
J. Vilo, K. Kivinen / European Neuropsychopharmacology 11 (2001) 399 – 411
these patterns using our pattern clustering method into 62 groups. For all of these groups, an approximate alignment and consensus pattern is generated. To assess the quality of the patterns, we matched all 1498 patterns against the experimentally verified yeast binding sites as given in the SCPD database (Zhu and Zhang, 1999). Of the 62 groups, 48 had patterns matching some sites in the SCPD database. One of the strong aspects of the SPEXS algorithm is the speed of pattern enumeration, which is achieved by extending a lazy suffix tree construction algorithm to handle the group characters as well as wildcards of variable length. Note that any advanced pattern ranking criteria could be incorporated into the algorithm relatively easily. The SPEXS program can be run on several sets of input sequences in one go, while simultaneously calculating numbers and locations of all occurrences of all patterns in all input sets separately. From these counts, pattern ranking functions can be calculated efficiently. Note that the method used in Vilo et al. (2000) is similar that used in van Helden et al. (1998) except that the tabulation of word frequencies is not necessary, as these can be calculated on demand. Thus pattern discovery can be performed on several variable length regions of upstream sequences without the need to calculate background probabilities for each region separately. The SPEXS algorithm is also able to analyse the patterns with wildcards, like the spaced dyads (van Helden et al., 2000), for example. The aim of our studies is to mine automatically for new, statistically significant patterns in putative regulatory regions of genes. Such data mining experiments are not a substitute for ‘conventional single-gene dissections’ (Zhang, 1999a). Their aim is instead to explore simultaneously thousands of genes in silico (which cannot be done by conventional methods) to generate targets for conventional studies in vitro. To assist biologists in understanding the in silico discoveries a way to summarize and visualize the findings is needed. A visualization method was recently proposed showing for each predicted site the average profile for all genes that contain the motif in their upstream sequences (Chiang et al., 2001). Alternative visualization technique has been developed by the first author and is shown in Fig. 1.
5. Discussion All the different methods that have been presented in the literature may be useful in detecting certain aspects of the regulation. Systematic evaluation of gene expression clustering methods may become feasible when we have a better understanding of the discovery of most interesting patterns. It is easy to see a simple experiment that would try out different distance measures and different clustering methods, and perform systematic regulatory site identification on them. The clustering methods outperforming others
407
(in the sense that stronger sequence signals can be discovered from there) should be used. The problem however is not simple, as we have to have the appropriate pattern language in which to express regulatory signals and meaningful quality criteria for this pattern language. The clustering methods used for this kind of analysis may also need to produce overlapping clusters, or perform clustering by only a subset of all conditions. Individual regulatory sites in isolation do not provide enough information about gene regulation, as most of the motifs representing these sites occur almost randomly over all chromosomes (Werner, 1999; Fickett and Hatzigeorgiou, 1997). Approaches where combinations of individual patterns are analyzed (Brazma et al., 1997; Wagner, 1999) or methods that allow to discover combinations of sites during the pattern discovery phase, may become very useful in solving the need for higher-level organisation of individual sites. The data mining techniques to discover frequent combinations (Mannila et al., 1994; Brazma et al., 1997), association rules (Mannila et al., 1994) and episode rules (adds order to association rules) (Mannila et al., 1997), for example, already exist and could be directly applied to bioinformatics, although currently these methods are mainly used in other domains. Combination of binding sites can be evaluated for example based on the following parameters: 1. Coverage: The number of its occurrences in upstream regions. 2. Goodness: The ratio of the number of its occurrences in the upstream regions versus the number of occurrences in random regions (of the same length and number). 3. Unexpectedness: The ratio of its occurrences versus the expected number of occurrences based on the individual sites. Combinations of sites with high values for all of these criteria are looked for (Brazma et al., 1997). One has to be careful in using these criteria though. For example, slight overpreferences of sites may also make the combinations more probable. Now that expression data can give us many hints about which genes should be expected to be coregulated, one could also start revising these criteria and could develop for example criteria that rate higher combinations of occurrences that are concentrated upstream of sets of genes that share similar expression profiles. Expression analysis with DNA microarrays is unable to distinguish direct regulatory effects from indirect effects and thus our ability to identify genes that are controlled by specific regulatory factors is limited. Genome-wide location analysis that combines chromatin immunoprecipitation procedure with DNA microarray analysis, provides information on the binding sites at which proteins reside through the genome under various conditions in vivo (Ren et al., 2000). This novel microarray analysis method will allow to distinguish binding sites that are active in vivo
408 J. Vilo, K. Kivinen / European Neuropsychopharmacology 11 (2001) 399 – 411
Fig. 1. The combined visualisation of the hierarchical expression data clustering and several regulatory motifs on the 600 bp upstream regions for each gene. Yeast expression data represent expression profiles of about 850 yeast genes and 80 timepoints from cell cycle, sporulation and diauxic shift experiments. Binding site visualisation shows whether upstream sequence has a match for the particular site (on the left) and where along the sequence each motif is located (in the middle). Motifs visualised are AAAATTTT (curvature element, red), ACGCG (MluI cell cycle box MCB, yellow), GGTGGCAA (proteasomal element, light blue), ACCAGC (SWI5 site, pink), and CGGnnnnnnnnnnnCCG (GAL4 site, blue), TGA[CG]TCA (Gcn4 site, green). The visualisation is performed with WWW-tools EPCLUST, URLMAP, GENOMES, and PATMATCH, all part of the Expression Profiler http: / / ep.ebi.ac.uk / (Vilo, 2001).
J. Vilo, K. Kivinen / European Neuropsychopharmacology 11 (2001) 399 – 411
from others, thus enabling more direct dissection of regulatory networks (Ren et al., 2000; Iyer et al., 2001). Identifying regulatory sequences in the human genome presents new challenges compared to yeast. First, the sheer size and complexity of the human genome makes predictions more difficult. Gene regulatory elements in human are frequently found much farther away than in yeast, and can easily be hidden in the bulk of non-coding sequences. There are some strategies which could be used (Zhang, 1998), and we envisage that intensive research will continue on the subject. A review on identification of mammalian regulatory sequences was published recently (Pennacchio and Rubin, 2001). Annotation systems for large-scale annotation of human promoters have also started to emerge (Scherf et al., 2001). One way to successfully identify regulatory elements in the human genome would be to compare human gene upstream sequences with upstream sequences from other species, e.g. other mammals or birds. So far, most studies have compared partly sequenced human and mouse genomes, but it has become clear that no single species can be completely informative, mainly due to different mutation rates of genes and genomes. Thus, it has been suggested that several organisms would be used in comparisons to aid prediction of regulatory elements in the human genome. Many researchers have been successful in discovering many known and also novel putative binding sites. The maturation of the field will be proven only by carrying out systematic follow-up studies in wetlabs. To assist this goal, promoter databases should be developed further so that verification of in silico predictions will become easier. The information stored in these databases will help us build up the knowledge base of known and hypothesised gene regulatory network models, and these can be invaluable in assisting biologists in experiment design.
References Apostolico, A., Bock, M.E., Lonardi, S., Xu, X., 2000. Efficient detection of unusual words. J. Comput. Biol. 7 (1-2), 71–94. Bailey, T.L., Elkan, C., 1995a. The value of prior knowledge in discovering motifs with MEME. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 21–29. Bailey, T.L., Elkan, C., 1995b. Unsupervised learning of multiple motifs in biopolymers using EM. Machine Learning 21 (1-2), 51–80. Brazma, A., 2001. On the importance of standardisation in life sciences. Bioinformatics 17 (2), 113–114. Brazma, A., Jonassen, I., Eidhammer, I., Gilbert, D., 1998a. Approaches to the automatic discovery of patterns in biosequences. J. Comput. Biol. 5 (2), 279–305. Brazma, A., Jonassen, I., Vilo, J., Ukkonen, E., 1998b. Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 8 (11), 1202–1215. Brazma, A., Robinson, A., Cameron, G., Ashburner, M., 2000. One-stop shop for microarray data. Nature 403 (6771), 699–700. Brazma, A., Vilo, J., 2000. Gene expression data analysis. FEBS Lett. 480 (1), 17–24.
409
Brazma, A., Vilo, J., Ukkonen, E., Valtonen, K., 1997. Data mining for regulatory elements in yeast genome. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5, 65–74. Bussemaker, H.J., Li, H., Siggia, E.D., 2000. Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc. Natl. Acad. Sci. USA 97 (18), 10096–10100. Bussemaker, H.J., Li, H., Siggia, E.D., 2001. Regulatory element detection using correlation with expression. Nat. Genet. 27 (2), 167– 171. Celis, J.E., Kruhoffer, M., Gromova, I., Frederiksen, C., Ostergaard, M., Thykjaer, T., Gromov, P., Yu, J., Palsdottir, H., Magnusson, N., Orntoft, T.F., 2000. Gene expression profiling: monitoring transcription and translation products using DNA microarrays and proteomics. FEBS Lett. 480 (1), 2–16. Chiang, D.Y., Brown, P.O., Eisen, M.B., 2001. Visualizing associations between genome sequences and gene expression data using genomemean expression profiles. Proceedings of ISMB 2001. Bioinformatics 17, S49–S55. The Chipping Forecast, 1999. Nat. Genet. 21(1). Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhart, D.J., Davis, R.W., 1998. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell. 2 (1), 65–73. Chu, S., De Risi, J., Eisen, M.B., Mulholland, J., Botstein, D., Brown, P.O., Herskowitz, I., 1998. The transcriptional program of sporulation in budding yeast. Science 282 (5389), 699–705. D’haeseleer, P., Wen, X., Fuhrman, S., Somogyi, R., 1998. Mining the gene expression matrix: inferring gene relationships from large scale gene expression data. In: Paton, R.C., Holcombe, M. (Eds.), Information Processing in Cells and Tissues. Plenum, New York, pp. 203–212. De Risi, J.L., Iyer, V.R., Brown, P.O., 1997. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278 (5338), 680–686. Dopazo, J., Zanders, E., Dragoni, I., Amphlett, G., Falciani, F., 2001. Methods and approaches in the analysis of gene expression data. J. Immunol. Methods 250 (1-2), 93–112. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D., 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95 (25), 14863–14868. Fickett, J.W., Hatzigeorgiou, A.G., 1997. Eukaryotic promoter recognition. Genome Res. 7 (9), 861–878. Frech, K., Quandt, K., Werner, T., 1997. Software for the analysis of DNA sequence elements of transcription. Comput. Appl. Biosci. 13 (1), 89–97. Getz, G., Levine, E., Domany, E., Zhang, M.Q., 2000. Super-paramagnetic clustering of yeast gene expression profile. Physica A 279, 457–464. Gusfield, D., 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, New York. Hampson, S., Baldi, P., Kibler, D., Sandmeyer, S.B., 2000. Analysis of yeast’s ORF upstream regions by parallel processing, microarrays, and computational methods. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 190–201. Hartigan, J.A., 1975. Clustering Algorithms. Wiley, New York. Hastie, T., Tibshirani, R., Eisen, M.B., Alizadeh, A., Levy, R., Staudt, L., Chan, W.C., Botstein, D., Brown, P.O., 2000. ‘Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol. 1(2). Hegde, P., Qi, R., Abernathy, K., Gay, C., Dharap, S., Gaspard, R., Hughes, J.E., Snesrud, E., Lee, N., Quackenbush, J., 2000. A concise guide to cDNA microarray analysis. Biotechniques 29 (3), 548–556. Herrero, J., Valencia, A., Dopazo, J., 2001. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17 (2), 126–136. Hertz, G.Z., Hartzell, 3rd G.W., Stormo, G.D., 1990. Identification of
410
J. Vilo, K. Kivinen / European Neuropsychopharmacology 11 (2001) 399 – 411
consensus patterns in unaligned DNA sequences known to be functionally related. Comput. Appl. Biosci. 6 (2), 81–92. Hertz, G.Z., Stormo, G.D., 1999. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15 (7-8), 563–577. Heyer, L.J., Kruglyak, S., Yooseph, S., 1999. Exploring expression data: identification and analysis of coexpressed genes. Genome Res. 9 (11), 1106–1115. Holmes, I., Bruno, W.J., 2000. Finding regulatory elements using joint likelihoods for sequence and expression profile data. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 202–210. Hughes, J.D., Estep, P.W., Tavazoie, S., Church, G.M., 2000a. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 296 (5), 1205–1214. Hughes, T.R., Marton, M.J., Jones, A.R., Roberts, C.J., Stoughton, R., Armour, C.D., Bennett, H.A., Coffey, E., Dai, H., He, Y.D., Kidd, M.J., King, A.M., Meyer, M.R., Slade, D., Lum, P.Y., Stepaniants, S.B., Shoemaker, D.D., Gachotte, D., Chakraburtty, K., Simon, J., Bard, M., Friend, S.H., 2000b. Functional discovery via a compendium of expression profiles. Cell 102 (1), 109–126. Iyer, V.R., Horak, C.E., Scafe, C.S., Botstein, D., Snyder, M., Brown, P.O., 2001. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409 (6819), 533–538. Jain, A.K., Murty, M.N., Flynn, P.J., 1999. Data clustering: a review. ACM Comput. Surv. 31 (3), 264–323. Jakt, L.M., Cao, L., Cheah, K.S., Smith, D.K., 2001. Related articles assessing clusters and motifs from gene expression data. Genome Res. 11 (1), 112–123. Jensen, L.J., Knudsen, S., 2000. Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and functional annotation. Bioinformatics 16 (4), 326–333. Jonassen, I., 1997. Efficient discovery of conserved patterns using a pattern graph. Comput. Appl. Biosci. 13, 509–522. Jonassen, I., Collins, J.F., Higgins, D.G., 1995. Finding flexible patterns in unaligned protein sequences. Protein Sci. 4 (8), 1587–1595. Kohonen, T., 1997. Self-Organizing Maps. Springer, Berlin. Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C., 1993. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262, 208–214. Legendre, P., Legendre, L., 1998. Numerical Ecology. Developments in Environmental Modelling. Elsevier, Amsterdam. Mannhaupt, G., Schnall, R., Karpov, V., Vetter, I., Feldmann, H., 1999. Rpn4p acts as a transcription factor by binding to PACE, a nonamer box found upstream of 26S proteasomal and other genes in yeast. FEBS Lett. 450 (1-2), 27–34. Mannila, H., Toivonen, H.T., Verkamo, I.A., 1994. Efficient algorithms for discovering association rules. In: Knowledge Discovery in Databases (KDD’94). AAAI Press, Seattle, WA, pp. 181–192. Mannila, H., Toivonen, H.T., Verkamo, I.A., 1997. Discovery of frequent episodes in event sequences. Data Mining Knowledge Discovery 1 (3), 259–289. Marsan, L., Sagot, M.F., 2000. Extracting structured motifs using a suffix-tree — Algorithms and application to promoter consensus identification. In: Proceedings RECOMB’2000, Tokyo. McGuire, A.M., Hughes, J.D., Church, G.M., 2000. Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Res. 10 (6), 744–757. McCreight, E.M., 1976. A space-economical suffix tree construction algorithm. J. ACM 23 (2), 262–272. MGED. Microarray Gene Expression Database Group, http: / / www.mged.org / ¨ Moller, S., Vilo, J., Croning, M.D.R., 2001. Prediction of the coupling specificity of GPCRs to their G proteins. Proceedings of ISMB 2001. Bioinformatics, S174–S181. Neuwald, A.F., Liu, J.S., Lawrence, C.E., 1995. Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci. 4 (8), 1618–1632.
Ohler, U., Niemann, H., 2001. Identification and analysis of eukaryotic promoters: recent computational approaches. Trends Genet. 17 (2), 56–60. Pennacchio, L.A., Rubin, E.M., 2001. Genomic strategies to identify mammalian regulatory sequences. Nat. Rev. Genet. 2 (2), 100–109. Quackenbush, J., 2001. Computational analysis of microarray data. Nat. Rev. Genet. 2 (6), 418–427. Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., Volkert, T.L., Wilson, C.J., Bell, S.P., Young, R.A., 2000. Genome-wide location and function of DNA binding proteins. Science 290 (5500), 2306–2309. Roth, F.P., Hughes, J.D., Estep, P.W., Church, G.M., 1998. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol. 16 (10), 939– 945. Scherf, M., Klingenhoff, A., Frech, K., Quandt, K., Schneider, R., Grote, K., Frisch, M., Gailus-Durner, V., Seidel, A., Brack-Werner, R., Werner, T., 2001. First pass annotation of promoters on human chromosome 22. Genome Res. 11 (3), 333–340. Sharan, R., Shamir, R., 2000. CLICK: a clustering algorithm with applications to gene expression analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 307–316. Sinha, S., Tompa, M., 2000. A statistical method for finding transcription factor binding sites. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 344– 354. Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher, B., 1998. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell. 9 (12), 3273– 3297. Stormo, G.D., 2000. DNA binding sites: representation and discovery. Bioinformatics 16 (1), 16–23, Review. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., Golub, T.R., 1999. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96 (6), 2907–2912. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., Church, G.M., 1999. Systematic determination of genetic network architecture. Nat. Genet. 22 (3), 281–285. ¨ ¨ Toronen, P., Kolehmainen, M., Wong, G., Castren, E., 1999. Analysis of gene expression data using self-organizing maps. FEBS Lett. 451 (2), 142–146. Ukkonen, E., 1995. Constructing suffix trees on-line in linear time. Algorithmica 14 (3), 249–260. Vanet, A., Marsan, L., Sagot, M.F., 1999. Promoter sequences and algorithmical methods for identifying them. Res. Microbiol. 150, 779–799. van Helden, J., Andre, B., Collado-Vides, J., 1998. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281 (5), 827– 842. van Helden, J., Rios, A.F., Collado-Vides, J., 2000. Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 28 (8), 1808–1818. Vilo, J., 1998. Discovering Frequent Patterns from Strings. Technical Report C-1998-9. Department of Computer Science, University of Helsinki, pp. 20. Vilo, J., 2001. Expression Profiler. http: / / ep.ebi.ac.uk / Vilo, J., Brazma, A., Jonassen, I., Robinson, A., Ukkonen, E., 2000. In: Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data. ISMB-2000. AAAI Press, Seattle, WA, pp. 384–394. Wang, J., Shapiro, B., Shasha, A. (Eds.), 1999. Pattern Discovery in Biomolecular Data. Oxford University Press, New York. Wagner, A., 1999. Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes. Bioinformatics 15 (10), 776–784.
J. Vilo, K. Kivinen / European Neuropsychopharmacology 11 (2001) 399 – 411 Werner, T., 1999. Models for prediction and recognition of eukaryotic promoters. Mamm. Genome 10, 168–175. Wheeler, D.L., Church, D.M., Lash, A.E., Leipe, D.D., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Tatusova, T.A., Wagner, L., Rapp, B.A., 2001. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 29 (1), 11–16. Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., ¨ M., Reuter, I., Schacherer, F., 2000. TRANSMeinhardt, T., Pruß, FAC: an integrated system for gene expression regulation. Nucleic Acids Res. 28, 316–319. Wolfertstetter, F., Frech, K., Herrmann, G., Werner, T., 1996. Identification of functional elements in unaligned nucleic acid sequences by a novel triple search algorithm. Comput. Appl. Biosci. 12, 71–80. Wolfsberg, T.G., Gabrielian, A.E., Campbell, M.J., Cho, R.J., Spouge,
411
J.L., Landsman, D., 1999. Candidate regulatory sequence elements for cell cycle-dependent transcription in Saccharomyces cerevisiae. Genome Res. 9 (8), 775–792. Zhang, M.Q., 1998. Identification of human gene core promoters in silico. Genome Res. 8 (3), 319–326. Zhang, M.Q., 1999a. Large-scale gene expression data analysis: a new challenge to computational biologists. Genome Res. 9 (8), 681–688. Zhang, M.Q., 1999b. Promoter analysis of co-regulated genes in the yeast genome. Comput. Chem. 23 (3-4), 233–250. Zhu, J., Zhang, M.Q., 2000. Cluster, function and promoter: analysis of yeast expression array. In: Pacific Symposium on Biocomputing, pp. 479–490. Zhu, J., Zhang, M.Q., 1999. SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics 15 (7–8), 607–611.