29
Gene expression microarray data analysis demystified Peter C. Roberts VizX Labs, 200 West Mercer Street, Suite 500, Seattle, WA 98119, USA Abstract. The increasing use of gene expression microarrays, and depositing of the resulting data into public repositories, means that more investigators are interested in using the technology either directly or through meta analysis of the publicly available data. The tools available for data analysis have generally been developed for use by experts in the field, making them difficult to use by the general research community. For those interested in entering the field, especially those without a background in statistics, it is difficult to understand why experimental results can be so variable. The purpose of this review is to go through the workflow of a typical microarray experiment, to show that decisions made at each step, from choice of platform through statistical analysis methods to biological interpretation, are all sources of this variability. Keywords: microarray, microarray data analysis, gene expression, normalization, preprocessing, statistical analysis, clustering, cross-platform comparison, correction for multiple testing, pathway, gene ontology.
Introduction Since the original description of the use of cDNA microarrays in gene expression analysis in 1995 [1], followed a year later by oligonucleotide arrays [2], the technology has rapidly moved from the domain of specialists to being available to the whole research community, through core facilities at most academic research institutions. The arrays themselves have evolved from representing less than 50 genes, to now addressing over 50,000 transcripts on whole genome arrays for complex mammalian genomes, in some cases represented by over 1 million individual features. These arrays have traditionally measured the differential expression of known and putative protein-coding genes. The gene expression microarray data analysis process can be broken down into three main parts: preprocessing, the conversion of the signal from an array scanner to a normalized value appropriate for comparison of expression across the arrays in a study; comparative statistical analysis, to identify significantly differentially expressed genes or co-expressed genes; and biological interpretation, preferably with a statistical measure of significance. Early microarray experiments only considered the difference in expression of a gene between two sets of samples, producing highly variable results; thus, it was soon realized that statistical analysis was required to obtain meaningful Tel.: +1-206-283-4363. Fax: +1-206-283-1606.
E-mail:
[email protected] (P.C. Roberts). BIOTECHNOLOGY ANNUAL REVIEW VOLUME 14 ISSN 1387-2656 DOI: 10.1016/S1387-2656(08)00002-1
r 2008 ELSEVIER B.V. ALL RIGHTS RESERVED
30 results. Initially standard statistical methods were used to identify significantly differentially expressed genes and these are still widely employed. However, the low number of replicates used in a typical microarray study; the variability introduced by the many technical steps in preparing samples; and large number of individual tests being performed on a single array; are inappropriate for standard statistical methods, so new statistical analysis methods have been developed. Many of these have been implemented in the opensource statistical language R [3], usually as part of the Bioconductor project [4]. Unfortunately these different approaches can result in quite different results; consequently, no ‘best method’ can be identified. Recently it has been realized that there are common themes in these approaches [5], helping to identify the most appropriate methods. As microarrays have become more widely accessible to the broader research community, they are increasingly being utilized by investigators with limited knowledge of statistics. These researchers tend to be more interested in biological insights, whereas the emphasis in terms of analysis tools has been on generating statistically robust gene lists, often at the expense of biological interpretation [6]. Also, as gene expression data is now required to be deposited in publicly accessible repositories, researchers are performing meta analysis of this data. Again, they often lack the knowledge to assess the quality of the data sets of interest to them and what is the appropriate analysis approach. Little assistance is readily available to these ‘unsophisticated’ users, and most of the analysis tools available to them are designed for statisticians and bioinformaticians, which assume a level of a priori knowledge to be able to use them effectively. In this review I will endeavor to provide a general overview of the whole gene expression microarray data generation and analysis process. The focus will be on commercial gene expression microarray platforms, as custom inhouse spotted microarrays are increasingly being supplanted by commercial microarrays. In particular, the emphasis will be on the whole genome microarrays that have been designed to address the transcriptome of a given organism. Other applications for microarrays have been developed for genotyping, microRNA measurement, chromosomal region copy number and other areas, which will not be addressed in this review. Gene expression microarray platforms Microarray designs have utilized spotted cDNAs [1] or oligonucleotide probes [2]. Depending on the platform, oligonucleotides are synthesized in situ on glass slides or synthesized, purified and attached to substrates. In the case of the Illumina platform, the purified oligonucleotides are attached to beads that are randomly dispensed into wells on a slide. In addition there is the choice of one-color or two-color experimental designs. For one-color systems a single sample is hybridized to the array. For two-color experiments, two
31 RNA samples, labeled with different dyes, are mixed and hybridized on the same microarray. Two-color systems can provide experimental design flexibility, since the second sample can either be an experimental sample or a standard RNA sample applied to all the arrays [7,8]. Although the original idea behind the development of two-color systems was that the competitive hybridization would reduce errors due to slide variation, it has been found, from well-controlled experiments, that analyzing the two samples independently, rather than generating ratios, can increase experimental accuracy [9]. The commercial microarray platforms are predominantly one-color oligonucleotide microarrays. The Agilent platform is one exception, initially designed as a two-color system [10], that has been modified for the analysis of one-color or two-color experiments. Most companies offer focused arrays for specific research areas and well characterized genes and transcripts. They also offer design and manufacturing of custom arrays. Over time the array manufacturers have increased the number of features that can be placed on one slide. Now several manufacturers offer multiplexed whole genome arrays. Table 1 compares the technologies of the major commercial whole genome microarray platforms. The key difference between platforms lies in the number and length of oligonucleotide probes on the microarrays: either short-oligonucleotide (25–30 bases) or long-oligonucleotide (50–70 bases). Long oligonucleotides have greater sensitivity and are better for analyzing low copy number mRNAs [11–13]; short oligonucleotides have better specificity, being less likely to cross-hybridize with other RNAs [14]. Most platforms have a single probe for each target, each probe occurring once at a fixed location on the array. Two platforms have multiple probes per target: Illumina BeadChips have over 20 randomly located technical replicates for each probe; Affymetrix 3u expression GeneChips have 11 probe pairs per probe set and the recently introduced GeneChip Gene ST arrays have 26 probes per gene. In addition to the differences in technology, the commercial companies differ in the number of species-specific arrays they offer. Affymetrix, through their GeneChip Consortia Program, offers a large number of microarrays to support different eukaryotic genomics projects. Nimblegen has focused on arrays for prokaryotic genomics. All manufacturers have whole genome arrays for human, mouse and rat. Gene content Most commercial microarrays for human, mouse and rat whole genomes share common targets, primarily transcripts from the National Center for Biotechnology Information (NCBI) Reference Sequence collection (RefSeq) [15]. Each manufacturer has tried to expand beyond this basic set of welldefined targets, in an endeavor to address all the protein-coding genes on the
Nimblegen: HG18 4plex Phalanx Biotech: Human OneArray
22,000 47,633
In-situ ink-jet
Spotted
Spotted
Beads
Beads
In-situ photolithography In-situ photolithography Spotted
60
30
50
50
60
60
60
30,968
24,000
48,000
57,347
34,000
42,000
28,869
60
25
54,000
In-situ photolithography In-situ photolithography
25
Affymetrix: U133 plus 2 Affymetrix: Human gene ST 1.0 Agilent: Human 4 44k Applied Biosystems: Human genome survey v2 Applied Microarrays: CodeLink human whole genome Illumina: Human-6 v2 Illumina: HumanRef-8 v2 Nimblegen: HG18
Targets
Oligonucleotide deposition
Oligo length
Manufacturer: array
1
3
8
1 W20
1 W20
1
1
1
26
11 pairs
Probes per target
32,050
72,000
3,85,000
70,00,000
100,00,000
54,841
35,000
44,000
764,885
13,00,000
Total features
1
4
1
8
6
1
1
4
1
1
Arrays per slide
Table 1. Comparison of human whole genome microarray platforms. Data obtained from manufacturers websites.
3u end
3u end
3u end
3u end
3u end
3u end
3u end
3u end
Exons
3u end
Probe location
33 genome. In most cases this predated the completion of the sequencing of the human genome [16,17], when it was assumed that there would be a lot more protein-coding genes than were ultimately found in complex mammalian genomes. UniGene clusters [18] with a limited number of expressed sequence tags (EST), proprietary sequences not available publicly, and predicted genes were all sources of additional target sequences. In most cases, probes were designed to target the 3u end of individual transcripts. The ability to map transcripts to the genome, an important modification to UniGene cluster generation, has led to consolidation into fewer transcriptional loci. The public sequence databases are continually being updated and UniGene clusters are revised on a regular basis, resulting in reassignment of some transcripts. One result of this is that probes, previously thought to map to different genes, have been shown to map to the same gene, in some cases to the same transcript. This redundancy needs to be taken into consideration during biological interpretation. Also, probes previously mapped to a gene can be disassociated from that gene, which can cause confusion when gene lists are reanalyzed. Consequently, care should be taken in the conclusions drawn from microarray experiments that the probe target is an authentic transcript of the gene it is mapped to. The microarray manufacturers do update their probe mapping but it is usually not on a consistent basis and not in sync with the public databases. This can lead to ambiguities between public information and the probe annotation supplied by the array manufacturers. Another source of ambiguity, documented for Affymetrix 3u expression arrays in particular, is probe sequence inaccuracies [19–25], so the probes do not match their stated target sequence. There are also probes that map to more than one gene, therefore are not specific. Several groups have created modified CDF files to eliminate these ambiguous probes from analysis, leading to more reliable results [21,22,25]. Gene expression microarray experiment process The standard workflow for a microarray experiment is shown in Fig. 1. The key to successful gene expression experiments is good experimental design and attention to detail. There are many technical steps between sample preparation and microarray scanning (Fig. 2), each of which can introduce new sources of error and bias, which can have a profound impact on data analysis and interpretation. To minimize the impact of technical error, it is advised that a single technician process all the samples at the same time and run them on microarrays from the same manufacturing batch [5]. If this is not possible, random samples for each condition should be distributed between either technicians or days to avoid bias [26]. The principle of randomization should also be used with multiplex arrays; in most cases more than one slide will be used and samples should be randomly assigned to the slides.
34 Experimental design
Sample preparation RNA extraction Reverse transcription In vitro transcription
Microarray processing Hybridization Washing Scanning
Data preprocessing Non-specific signal correction Normalization Filtering
Differential expression Comparative statistics Multiple test correction Clustering
Biological interpretation Gene annotation Gene ontologies Pathway analysis
Fig. 1. Gene expression microarray experiment workflow. The goal of a typical
microarray experiment is to identify genes that are statistically significantly differentially expressed and identify the underlying biological processes. The actual methods used at any point will be defined by the experimental design and the microarray platform.
Experimental design Gene expression microarray experiments are designed for one of two purposes: evaluation of differential gene expression between groups, referred to as class comparison; or for classification studies, referred to as class discovery and class prediction [27]. These experiments are expensive and time
35 Random priming
3’ in vitro transcription Purified total RNA
External RNA controls Affymetrix; tERC: Agilent; Applied Biosystems
rRNA reduction Reverse transcription cDNA
2nd cycle reverse transcription cDNA
In vitro transcription cRNA
Fragmentation Fragmented cDNA
Fragmentation Fragmented cRNA
Terminal labeling Fragmented biotinylated cDNA
Hybridization
cERC: CodeLink
cERC: Affymetrix; Applied Biosystems
cERC, control oligo: Affymetrix
Fig. 2. RNA sample processing. The traditional method for labeling RNA samples
for hybridizing to microarrays has used 3u in vitro transcription to generate labeled cRNA for hybridizing to the microarray. The microarray manufacturers usually provide kits for this process that contain external RNA controls (ERC). These are added either to the total RNA (tERC), as controls for the reverse transcription and in vitro transcription reactions; or to the cRNA (cERC) prior to or after fragmentation, as controls for the microarray processing steps. Affymetrix whole transcript arrays use a different protocol, using random hexamer priming to generate labeled cDNA that is hybridized to the microarray. This may include an rRNA reduction step, depending on the amount of starting RNA. An additional control oligo, added at the same time as the cERC, is an additional control for the microarray processing.
consuming and a good experimental design is essential, to maximize the return in terms of usable information [26–30]. Experimental design involves a clear scientific hypothesis, an appreciation of the number of factors to be compared and the confidence that can be assigned to the observations: the simplest design is usually the best. Confidence in the results, the power of the analysis, is derived from using the appropriate number of replicates [31–35]. Microarray studies utilize two types of replicates, technical or biological. Biological replicates, samples from individual subjects, are the only choice for good biological inference [26,36]. Technical replicates, a single or pooled sample applied to multiple
36 microarrays, only measure the consistency of the experimental system and provide limited biological information [26]. In some cases, such as two-color systems when controlling for dye-dependant effects, technical replicates are combined with biological replicates; the average intensity of the technical replicates should be used in subsequent statistical tests. Due to the amount of data generated from microarray experiments, classic power calculation methods for estimating the number of replicates required in an experiment are inadequate, so microarray-specific methods have been proposed [31–35]. In general a minimum of five replicates per group is recommended. This number will change based on whether the samples come from inbred or outbred animals, which will increase the biological variance [34]. The reality of microarray experiments is that investigators tend to run a limited number of replicates due to sample or cost constraints; even so, no less than three replicates should be used [6]. Pooling of samples can be used, when the cost of samples is much less than the cost of microarrays or when insufficient RNA is available to run on an array and amplification methods are not desirable [33,37,38]. The same number of samples should be used for each pool and multiple pools should be used for each group. Pooling may introduce biases and does not provide the same statistical power as analyzing individual samples, but it is better than comparing a limited number of samples [5]. Pooling of samples is not appropriate for classification studies, as these rely on inter-individual variation and co-variation [5]. For investigators with limited statistical knowledge considering running microarray experiments, it cannot be overemphasized that time spent in experimental design with a statistician will ensure that valid conclusions can be drawn from the results, especially for more complex experimental designs. Sample preparation The steps involved in sample preparation are shown in Fig. 2. There are several different commercial reagents and kits in the market suitable for RNA isolation. The key factor is to make sure that the input total RNA sample is of high integrity. The standard method for assessing this is to use an Agilent 2100 Bioanalyzer, which can be used to generate an RNA integrity number (RIN) [39], to assess RNA quality. The microarray manufacturers generally provide or recommend reagent kits for labeling samples and kits are also available from other commercial sources. These kits usually use a modification of the Eberwine method to produce complementary RNA (cRNA) labeled with a dye or other functional group [40]. The kits for specific platforms often contain external RNA controls (ERC), that are complementary in sequence to control probes on the microarrays and are often present as a concentration series, so can be used to assess concentration response [41,42]. The ERC are used to monitor both the RNA labeling reactions, by being added to the total RNA sample prior to cDNA synthesis,
37 and the hybridization, washing and scanning process, by being added to the cRNA immediately prior to hybridization. In addition to this standard approach, there are several different methods for amplifying RNA from samples with low RNA concentrations [43]. As with pooling of samples, RNA amplification can introduce biases that should be taken into consideration when designing experiments and analyzing data. Data preprocessing The purpose of data preprocessing is to convert the raw signal for labeled RNA hybridized to a probe to a normalized value, an adjustment to account for variance from technical rather than biological sources [36]. Many papers have been published on preprocessing of microarray data, especially for twocolor systems and Affymetrix arrays. Usually the process involves quantification of the signal from the microarray, background corrections and normalization within and across arrays. In most cases the microarray manufacturers provide software that adequately performs most of these functions. There are also preprocessing packages available in Bioconductor, developed by researchers not satisfied with the methods provided by the array manufacturers. A logarithmic transformation followed by quantile normalization has become the preferred method of preprocessing for onecolor microarrays [44]. Quantile normalization assumes that the arrays have a similar signal distribution, which is typical for most experiments. However, quantile normalization should not be used when comparing tissues with markedly different expression profiles. The purpose of the logarithmic transformation is to stabilize the variance inherent in the microarray data, changing calculations from multiplicative to additive [36]. However, as the intensity values approach zero, this transformation is less effective. To offset this, some algorithms add a constant to the intensity values, so computations of low-intensity signals have improved variance [45]. A large number of methods have been developed for preprocessing Affymetrix data [45,46]. This was because the standard Affymetrix software, Microarray Suite 5 (MAS5) [47] and earlier versions use the difference in signal between paired perfect match (PM) and mismatched (MM) probes, to account for nonspecific binding. In some cases the MM probe has a stronger signal than the PM probe, resulting in negative signal values, and on occasion false-negative expression. MAS5 also does not normalize across arrays, instead scaling within arrays to a defined intensity value. The alternate methods generally ignore the MM probes, using global or model-based background correction, and normalizing across arrays. The most significant source of difference between the preprocessing methods is how they perform background corrections [45]. Of the alternative protocols that have been developed, RMA [48] and GCRMA [49] are those most commonly utilized. Affymetrix has also developed a new algorithm, PLIER [50], that uses an
38 improved PM–MM background correction and quantile normalization. These methods are all available in Bioconductor, mostly in the affy package. There is some concern that they have been optimized using a limited data set from a single human array version, so may have different performance with other arrays [5,51]. It is also likely that the preprocessing method of choice will vary with experimental design. For two-color systems preprocessing involves other factors. It is well known that dyes have individual biases that need to be adjusted for. In particular, the dye cyanine 5 (Cy5), commonly used in two-color microarray experiments, is rapidly degraded by ozone [52], whereas the other commonly used dye, Cy3, is not. Consequently, air quality can become a major factor in two-color analysis. As with the Affymetrix arrays, background correction methods have a marked effect on preprocessing [53]. Local background subtraction can result in negative intensities and should be avoided in favor of model-based methods, where only positive intensity values are returned. There have been arguments for not performing any background correction but this can also result in problems with downstream analysis methods with cDNA microarrays [53]. For Agilent microarrays, the feature extraction software corrects for both dye biases and background. It has been shown that this results in an increase in the variability in low-intensity data [54], that may be due to the background correction used. In general, logarithmic transformation followed by loess smoothing, is the normalization method of choice [55]. These methods are available in the limma Bioconductor package. The large number of technical replicates for probes on Illumina arrays, allows for more robust variance stabilization and normalization, utilizing both quantile and loess normalization [56]. This can be implemented using the lumi Bioconductor package. Differential expression analysis The main goal of microarray experiments is to identify genes that are significantly differentially expressed between two or more experimental conditions, usually by comparing the average intensity values of the replicate samples. Microarray analysis methods attempt to minimize two types of error in measures of differential expression: type 1 or false-positive errors and type 2 or false-negative errors [5]. Controlling type 1 errors is the major goal of many of the statistical methods developed specifically for microarray analysis. Type 2 errors are more likely to be due to properties of the platform being used; a gene may not be detected due to limited sensitivity. Filtering data using quality values and fold change cutoffs Because some genes are not expressed in any sample and not all expressed genes are differentially expressed between samples, it makes sense to remove
39 these uninformative data points from the analysis process using filters. These are usually based on either quality flags or fold change cutoffs, though more sophisticated methods have been described [57]. The simplest measure of differential expression is fold change, the magnitude of the difference in the expression of a gene between the conditions, usually reported as a ratio or log-ratio. Using fold change alone as a measure of significant differential expression is not appropriate, as it does not assess the reproducibility of the measurement or confidence in the observation [5,58]. For low-intensity genes, a relatively small change in signal can have a marked effect on fold change, resulting in type 1 errors. A fold change cutoff value can be set for filtering genes, above which genes are considered differentially expressed; however, the cutoff value is arbitrary, and setting it too high can result in type 2 errors. During the preprocessing of the array data there is usually an assessment of whether a gene is expressed: does the gene have signal significantly above nonspecific signals? This is usually reported as a quality value, which is different for each platform: the Affymetrix MAS5 algorithm flags genes as being present (P), absent (A) or marginal (M) but RMA and GCRMA do not provide a quality metric; the Codelink platform provides similar flags as measures of signal quality; Illumina BeadStudio software provides a detection p value based on the signal from the replicate beads, which is usually reported as the inverse of the p value; Agilent feature extraction software provides several values but the IsWellAboveBG flag is generally used; and Applied Biosystems uses a signal-to-noise ratio for expression and a flag value for signal quality. Quality values can be used to set quality filters for the data. Some software allows for different filtering options: only analyzing genes with perfect quality scores in all samples; allowing genes to be analyzed where the majority of samples have a perfect quality value; or analyzing all genes regardless of quality flags. In cases where the gene is not expressed in one sample group but is in another sample group, either a nominal positive expression value needs to be used for the unexpressed samples, to avoid division by zero, or the gene should be excluded from analysis, which is usually undesirable. Comparative statistics The most common microarray experiment compares expression between just two groups of samples. This means that standard t tests are used to assess the statistical significance of the observed change in expression at an individual gene level [58]. The actual t test to be used is dependent on the experimental design. Usually an unpaired Student t test is used, as the samples are considered to have equal variance, being randomly assigned to each group. In some cases, such as before and after drug treatment, a paired t test
40 provides more power. When the samples are not equally variable a Welch’s t test is appropriate. The standard t tests report a p value for each comparison at an individual probe level. The p value is the confidence that there is a true difference in expression, and p values that fall below a nominal level, usually 0.05, are considered significant. The number of replicates in a typical microarray experiment are not usually sufficient to make standard t tests robust, as they are sensitive to the effects of outlier values [58]. In addition, the large number of individual tests being run in parallel means that there will be a large number of type 1 errors. Consequently, modified t tests have been developed specifically for microarray analysis, utilizing an approach referred to as variance shrinkage [5]. They are nonparametric, using the variance of all the genes on the array to improve the power of the test. In particular Bayesian statistical approaches have been used, as they have been found to improve analysis of microarray experiments with a limited number of replicates [59,60]. The significance analysis of microarrays (SAM) is another popular approach [61]. Though these approaches improve on the classic t tests in terms of controlling the false-positive rate, there is still no ‘best method’ [62]. When samples from more than two conditions are being compared, an analysis of variance (ANOVA) is often used to estimate the relative expression of each gene in each sample [58]. Depending on the experimental design there are different types of ANOVA that can be used. As with the t test, attention needs to be paid to the variance of the samples in a group and between the groups. When only a single factor is present, the one-way ANOVA is used to compare expression at the individual gene level. When two or more factors are being compared, a two-way ANOVA should be used [6]. For more complex designs, for instance, multiple conditions with biological and technical replicates, multi-way ANOVAs should be used [58,63]. Also, modifications of the standard ANOVA have been described that use global gene variance instead of, or combined with, gene-specific variance, to control type 1 errors [58]. Corrections for multiple testing The standard statistical approach for controlling the false-positive rate when using multiple comparisons is to use corrections for multiple testing. These adjust the p values based on the total number of tests being performed. There are two approaches that are taken: a family wise error rate (FWER) control, the simplest being the Bonferroni correction [5,58]; or a false discovery rate (FDR) correction, usually that of Benjamini and Hochberg [64]. Figure 3 shows an example of the effect of different correction methods on the number of significant genes identified. The FWER corrections adjust the p value to reflect the probability that one or more false-positive errors occur in a list. The Bonferroni correction is
41
5% PCER 12,143 probes
5% FDR 8,053 probes
5% FWER 87 probes
Fig. 3. Effect of different multiple testing approaches on significantly differentially expressed gene lists. The numbers are from analysis of the CodeLink data set kidney control and aristolochic acid treated kidney sample from the MAQC rat toxicology study [116] (GEO accession: GSE5350) without any additional preprocessing or filtering. The CodeLink rat whole genome array has 33,790 probes. The per comparison error rate (PCER) was from an unpaired Student t test with a p value cutoff of 0.05. The Benjamini and Hochberg correction was used for FDR. The Bonferroni correction was used for the FWER. Data was generated using GeneSifter.
very stringent, dramatically reducing or even eliminating lists of differentially expressed genes and therefore increasing the chance of type 2 errors. There are step-down modifications of the Bonferroni correction that are less conservative, including those developed by Holm [65] and Westfall and Young [66]. However, the FWER corrections are a poor choice in discovery research, where a limited number of false-positive results are acceptable compared to eliminating true positives. The FDR correction is less stringent and adjusts the p value to control the frequency of type 1 errors in the list of significantly differentially expressed genes [67]. Positive FDR (pFDR) applies a factor to the FDR, equivalent to the proportion of nondifferentially expressed genes to the total number of genes, which reduces the correction [68]. This increases the power of the analysis, while not eliminating all false-positive errors [58]. Rather than controlling FDR below a threshold, it has been suggested that FDR estimating procedures are preferable for microarray analysis [5]. These assign a false-positive probability value to each differentially expressed gene. The control and estimation of FDR is an active area of investigation [69–71].
42 Cluster analysis Another commonly used method for statistically analyzing microarray data from multiple conditions is to use clustering. This can be used on normalized data without any other analysis being performed or on a statistically significant list of differentially expressed genes from an ANOVA. Clustering algorithms recognize patterns in the data [72]. Usually the clustering algorithms are unsupervised, the raw data being analyzed with no assumption of underlying structure. Two basic approaches are taken, either the visualization of overall expression patterns or the partitioning of genes into discrete groups. In either case, genes with similar expression profiles are grouped together. It is often assumed that genes that cluster together are coexpressed; however, unsupervised clustering algorithms will always produce clusters based on the parameters that are set; the quality and relevance of the clusters is not a factor. For classification applications, where the quality and relevance of the groups are very important, additional information about the relationship of the samples is used for supervised clustering algorithms. Hierarchical clustering is the original method used, and is still widely utilized [73]. This is a simple agglomerative clustering method, where genes are sequentially added to the cluster based on the similarity of their expression profile. There are also hierarchical clustering algorithms that take a divisive approach, starting with a single cluster and finishing with the individual genes, but this is more computationally intensive. It is a useful tool for visualizing expression patterns in microarray data, the typical output being a dendrogram of the genes, from which clusters of closely matched genes can be identified. Usually a graphic visualization, referred to as a heatmap, is also generated, where the log ratios of the intensity are usually represented as a color scale, from intense red for the highest positive values, through black for the mean intensity, to intense green for the most negative values. Figure 4A shows an example of the typical output. Many different algorithms have been used for partitioning microarray data. These include K-means algorithms [74], self-organizing maps (SOM) [75] and partitioning around medoids (PAM) [76]. There is no good method of knowing what is the best algorithm to use for a particular dataset [77,78]. These approaches require that the number of clusters to partition the data into be specified at the outset; however, it is difficult to assess what is the appropriate number of clusters for a particular dataset. One approach to assessing the appropriate number of clusters is to use silhouette widths [79]. These are a measure of how closely the genes in a cluster match the mean expression profile for the cluster; the larger the overall mean silhouette width, the better the clustering of the data. This requires repeated clustering using different numbers of initial nodes. Figure 4B shows an example of a PAM output with silhouette values. Other methods have also been developed to
B
Fig. 4. Examples of hierarchical and partition clustering. The data is from a study of male germ cell tumor samples analyzed using Affymetrix human U133A GeneChips (GEO accession: GSE3218) and preprocessed with GCRMA in GeneSifter. (A) Heatmap and dendrogram from hierarchical clustering of genes of the Wnt signaling pathway between the sample groups. (B) Partitioning around medoids (PAM) silhouettes, four clusters from 4,975 significantly differentially expressed genes, identified using a one-way ANOVA and Benjamini and Hochberg FDR correction, with an adjusted p value o0.0001 and at least a four-fold change in expression compared to a normal testis control. (The color version of this figure is hosted on Science Direct.)
NT S CC E1 E2 T YS
A
44 address the question [80–82]. There is a similar issue with validation of the quality of the clusters [77,82]. Intuitively, genes in the same cluster are co-expressed and therefore share a biological function; however, this assumption is often not borne out by the functional analysis of the genes in a cluster [72]. The reasons for this are: the complexity of the biology underlying a given gene list; the limited number of samples relative to the number of genes on a typical microarray; and the strict assignment of genes to clusters. Standard clustering techniques assign a gene to a single cluster and, once assigned, it cannot be reassigned to a different cluster. Fuzzy clustering has been used to address this problem, by assigning probabilities to genes in clusters; ultimately, the gene is assigned to the cluster where it has the highest probability score [83]. Other approaches, for instance principal component analysis (PCA) [84] and independent component analysis (ICA) [85,86], have used linear models, which allow genes to belong to more than one cluster. Supervised classification To use microarray data as a phenotype classification tool, it is necessary to identify a set of discriminative genes that can be used to assign samples to pre-defined categories. The original description of the use of gene expression microarray data for cancer classification was that of Golub [87]. Since then there have been many publications describing different approaches to the problem [5,88], utilizing several standard data sets for cancer classifier testing [88]. There are several inherent problems in the development of classifiers, including the presence of redundant transcripts in gene expression data; and introducing bias by not using completely separate data sets to create and then validate the predictive algorithm. This leads to models that are susceptible to overfitting, performing well with test data but not with new data [5]. It is widely accepted that a smaller set of non-redundant informative genes will provide the most accurate classifiers; however, there is no current method of choice to identify such genes. Biological interpretation Once a list of significantly differentially expressed genes has been obtained, the next consideration is the identification of the biological processes represented in the list. The information associated with a particular gene, the annotation, is available from many online sources [89–91]. The NCBI has many annotation resources [92], including Entrez Gene, which integrates much of the gene information for a large number of organisms. There are similar resources at the European Bioinformatics Institute (EBI) [93]. Two sources of extensive gene and genome level annotation for multiple species are Ensembl [94] and the UCSC Genome Browser [95], that has tracks
45 showing the location of Affymetrix probe sets. For organisms other than human, mouse and rat annotation is generally sparse. Intensively studied organisms, such as yeast and Drosophila, have rich data resources. Many of the organisms that have had their genomes sequenced have very limited annotation, usually in a dedicated database that is difficult to query. Commonly used sources of functional information associated with genes are the Gene Ontology (GO) database [96] and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database [97]. Useful functional information can also be found at PANTHER [98], which has pathway and ontology information. It is relatively easy to collect information for a single gene, if somewhat time consuming, but it is not easy to identify broad biological themes in this way. Microarray gene lists can have thousands of entries and it can be difficult to query databases to obtain all the related information and mine that information for common themes. For GO several tools have been developed that can take a list of genes and return lists of ontologies, with statistical measures of their significance [99–101]. Similar reports can be generated for pathways [99]. Statistical significance is assessed using either z scores or p values and FDR corrections. The z score represents whether the number of genes associated with an ontological term or pathway is significantly overrepresented, a score W2, or underrepresented, a score o 2, compared to the normal distribution. The DAVID system integrates a lot of different data sources and provides rich functional reports for microarray gene lists [102]. Gene expression microarray data analysis software Software is absolutely essential to the analysis of microarray data. However, there are very few software packages that cover all the steps in microarray analysis. This means that data tends to go through a series of individual software applications that mirror the steps in the workflow in Fig. 1. There is also a limited selection of software resources that not only provide analysis tools but also associated data storage and management capabilities. Open-source software There are several papers published each month on some aspect of microarray analysis and these papers generally have a link to access the associated software. Some are set up as websites, some software is available as Microsoft Excel plug-ins [61], but most of the statistical approaches appear as packages in Bioconductor [4]. Bioconductor is a software resource for genomics data analysis for biostatisticians and bioinformatics experts, who appreciate its power and flexibility, and do not mind the difficult interface. As mentioned earlier, there are specific packages in Bioconductor for the
46 different commercial microarray platforms. Other packages can be more general in nature and gather tools for specific analysis approaches, for instance Bayesian methods. For a casual user the learning curve is steep, requiring the learning of the R statistical scripting language [3], on which Bioconductor is based. Some graphical user interfaces have been developed to make Bioconductor more accessible. As previously mentioned, the DAVID Knowledgebase and associated tools are a popular application for biological interpretation [102]. Finding other sources of microarray analysis and interpretation applications is a daunting task [103], due to the sheer number available, though attempts to catalog them have been made [104]. Commercial software The commercial software available for microarray analysis integrates much of the functionality available in separate Bioconductor packages. The applications often support only one or two microarray platforms. Most have preprocessing, differential expression analysis and clustering tools and allow the integration of R or other scripting languages. There is often an emphasis on data visualization. However, biological interpretation tools are usually not well integrated, unless that is the focus of the application. Table 2 summarizes the capabilities of the most popular commercial applications. Though the user interfaces make them easier to use than most open-source software, they are generally designed for sophisticated users and so can be difficult to learn for nonexperts. The exception is GeneSifter, the only webbased application, which was designed with the philosophy that non expert research scientists should be able to perform their own microarray data analysis for common experimental designs. This system has broad microarray platform support, built-in data management, preprocessing, differential expression analysis, clustering tools and integrated biological significance analysis. A common criticism of commercial software is that it is a ‘black box’; the actual code being run for a statistical analysis being inaccessible to the user. GeneSifter only uses algorithms from R and Bioconductor. Cross-platform comparisons In spite of attempts to standardize microarray data reporting, led by the Microarray Gene Expression Database Society (MGED) [105] with the Minimum Information About a Microarray Experiment (MIAME) standard [106], most attempts to compare results from different microarray platforms tended to show poor concordance [12,23,107–111]. The sources of the inconsistencies were identified as: the difficulty of comparing platforms at a gene level [12]; employing different protocols for sample preparation [108,110]; and different statistical tests in data analysis [111]. When
Integromics: ArrayHub
TIBCO: Spotfire Decision Site VizX Labs: GeneSifter Biodiscovery: GeneDirector Genologics: Geneus X
X X
X X X X
All
All
Affymetrix, Illumina Affymetrix, ABI, twocolor
X
X
X
Affymetrix, Illumina All
X
X
All
X
X
X
X
X
X
X X
X
X
X
X
X
Differential expression
X
All
X
X
X
Affymetrix, Nimblegen Affymetrix
Affymetrix, two-color Genepix, other Affymetrix, cDNA
X
Affymetrix
Genomatix: ChipInspector Insightful: S+ArrayAnalyzer Molecular Devices: Acuity Ocimum Biosolutions: Genowiz Partek: Genomics Suite Rosetta Biosoftware: Resolver SAS: JMP Genomics
X
All
Preprocessing
Agilent: Genespring GX Biotique Systems: X-ray DNAStar: ArrayStar
Data storage
Microarray platforms
Company: product
X
X
X
X
X
X
X
X
X
X
X
X
Gene annotation
X
X
X
X
X
Functional analysis
Table 2. Comparison of commercial gene expression microarray data analysis software capabilities.
Win
Win, Mac, Linux Client/server
Web browser
Win
Win
Client/server
Win, Linux
Win, Mac, Linux
Win, Mac, Linux Win, Linux, Solaris Client/server
Win
Win
Win, Mac
Computer platform
R
S-Plus
SAS
Proprietary
Proprietary
Proprietary
Proprietary
S-Plus
Proprietary
Proprietary
Excel
Proprietary
Statistics
48 comparing cDNA and short-oligonucleotide platforms, similar trends in direction of differential expression, but not magnitude, were observed [109]. It was also found that results were more variable between laboratories than between platforms when common samples were run at several sites [108,110]. When common procedures were implemented, and good consistency between replicate samples was achieved, the consistency between laboratories and platforms improved. Microarrays were identified as being key technologies in future submissions to the U.S. Food and Drug Administration (FDA) and the lack of uniformity between studies and platforms was of major concern. This led to two initiatives, the MicroArray Quality Control Consortium (MAQC) [111] from the FDA and External RNA Control Consortium (ERCC) [112] from the National Institute of Standards and Technology (NIST). Both represented government, commercial and academic interests. The MAQC Consortium published a group of papers in September 2006 addressing many of the issues that had been raised, to try and identify the sources of discordance observed in other studies [9,41,113–116]. All the data generated is publicly available and acts as a superb resource for comparing analysis methods. The main study [115] utilized a pooled titration series using different ratios of two reference RNA samples: a Universal Human Reference RNA and a Human Brain Reference RNA. These were assayed on six different commercial microarray platforms and the NCI provided an in-house spotted oligonucleotide microarrays. Each platform was used at multiple test sites. Each test site ran five replicate assays for each RNA pool. The microarray providers used their own software for intensity signal quantification and a quality measure for each probe on the array. However, this made the resulting analysis more complicated because of the differences in data preprocessing and quality filtering. To make sure that, as far as possible, each platform was actually measuring the same target, a common set of 12,091 probes, mapping to a non-redundant list of genes and transcripts, was identified. The generation of this list was aided by the actual probe sequences being made available by the companies involved, allowing more rigorous comparisons than had previously been possible. In general, the inter-platform detection of these genes varied more than intra-platform variation between test sites; however, the different platforms appeared to detect similar changes in gene abundance. The optimum method for generating overlapping gene lists between platforms was to use a ranked fold change analysis, which ignored the absolute degree of observed change. Using standard statistical tests reduced the concordance between platforms. Perhaps not surprisingly, the simple statistical methods chosen have been challenged [117]. The two reference RNAs were also used to compare the concordance of results from one-color and two-color microarray studies [9]. This comparison used three different platforms, hybridizing both one-color and two-color
49 samples to each. This eliminated problems due to different platforms being used for each experimental design. Within each platform there were high correlation coefficients and good concordance between the differentially expressed gene lists for the two approaches. Two-color designs appeared to have slightly better sensitivity but one-color designs had lower compression of the expression values. Using individual intensity values rather than ratios for the two-color design appeared to have the greatest sensitivity. As well as the artificial RNA sample comparison of the main MAQC study, four platforms were compared using a biologically relevant toxicogenomics data set [116]. This data was derived from four groups of six rats treated with a range of plant-derived toxins [31]. Liver samples were collected for all groups, as well as kidney samples from control animals and those treated with aristolochic acid, a nephrotoxic compound. These six groups of six replicate samples were assayed on five sets of rat whole genome microarrays from four commercial sources. The results validated the approach used in the main study that rank fold change was the best method of maximizing gene list overlap between platforms. The fold change ranking also resulted in much better concordance when comparing platforms, based on the biological functions significantly represented in gene lists. The MAQC study also included validation of the microarray data using quantitative measurement of gene expression [113]. Generally, good concordance was seen between the quantitative assays and the microarray results for genes that were detectable on both platforms. Where discordance was observed it could be explained by difference in the location of probes, meaning that alternate splice variants could be detected. Genes found to have low expression by the quantitative assays were those that showed the most variable concordance between both the quantitative assays and microarray platforms and between the different microarray platforms in the main MAQC study. This reflected the sensitivity range of the different microarray platforms. Overall, the MAQC study showed that data was consistent at individual test sites, reproducible between test sites and comparable between platforms. However, to ensure the reliability of microarray-based studies, there is still a need for unified metrics and standards, to identify poor quality arrays and monitor performance at microarray facilities.
Public gene expression data repositories A condition of many journals for publishing papers in which gene expression microarray data is described is that the data has to be publicly accessible. This is also a condition of federal grant funding agencies. Two major data repositories are the Gene Expression Omnibus (GEO) [118] at the NCBI, and ArrayExpress [119] at the EBI. The data submitted to the repositories has to
50 be MIAME-compliant, so all the experimental details are available. Many submitters also include the original raw data files. The two repositories currently contain nearly 300,000 individual samples in approximately 10,000 studies, most of which are from gene expression microarray experiments. This huge volume of data is available for meta analysis, either to extend or confirm researchers own results, or for data mining, for instance to identify previously missed gene and disease relationships [120,121]. However, when performing meta analysis it is important to remember that analysis across microarray platforms is not straightforward.
The future The original design philosophy for gene expression microarrays was to measure the expression of all protein-coding genes. However, with the refinement of whole genome annotation, it has become clear that the estimates of the number of these genes have been over optimistic. The latest estimate for humans is approximately 20,500 [122]. This will lead to further consolidation of gene expression microarray platform content, ultimately leading to the same set of genes and transcripts being represented on each platform. As the content goes down and feature densities increase, multiplexing of arrays will increase. At the same time the number of potential non-coding RNA transcripts has dramatically increased after the publication of the results of the ENCODE pilot project [123]. Tiling microarrays were extensively used in this and the follow-up project. It is likely that new microarrays will appear to measure the diverse RNA species that are being identified and also to study transcriptional control elements. This is already occurring with microarrays for micro RNAs and recent announcements of the release of microarrays for several platforms containing CpG elements. It is likely that microarray platforms will face increasing competition from high-density RT-PCR based platforms. Another technology that will have a profound effect on microarrays is digital gene expression (DGE) [124], using the new massively parallel sequencing systems. DGE is basically an extension of serial analysis of gene expression (SAGE) [125], a tag counting method of gene expression analysis, first described in the same issue of Science as the original cDNA microarray paper. The problem with SAGE was that it was expensive to sequence enough tags to get true quantitation of gene expression. The new sequencing technologies mean that deep sequencing is much cheaper and much faster and millions, rather than thousands, of tags are sequenced. This technology has been touted as the end of microarrays, and has resulted in Applied Biosystems Inc. abandoning their microarray platform in favor of their new sequencing platform. It is more likely that
51 the two technologies will be complementary; DGE results leading to new microarray designs. As well as new technologies for measuring gene expression, the use of microfluidic devices to study gene expression at a single cell level, and laser microdissection to analyze select sets of cells from tissues, will lead to a greater appreciation of localized gene expression and potentially better classification. Gene expression microarray data analysis has generally reached a state of consensus. However, developments in microarray technology, new competing technologies and sample preparation will require new analysis methods or refining of existing methods. Improvement in classification methods and more biologically meaningful clustering approaches are required. Increased interest in meta analysis of co-variance of genes across studies and across platforms should see improvements in data analysis methods for this purpose. Also, gene expression microarray data is increasingly being used with other types of data to obtain a larger biological picture, referred to as systems biology. This is a rapidly expanding field and integration of these disparate data types will be a challenge. Acknowledgements The author would like to thank Leah Klein for helpful discussions and critical review of the manuscript. References 1. Schena M, Shalon D, Davis RW and Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995;270:467–470. 2. Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H and Brown EL. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 1996;14:1675–1680. 3. R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, 2007. 4. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH and Zhang J. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004;5:R80. 5. Allison DB, Cui X, Page GP and Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 2006;7:55–65. 6. Olson NE. The microarray data analysis process: from raw data to biological significance. NeuroRx 2006;3:373–383. 7. Kerr KF, Serikawa KA, Wei C, Peters MA and Bumgarner RE. What is the best reference RNA? And other questions regarding the design and analysis of two-color microarray experiments. OMICS 2007;11:152–165.
52 8. Novoradovskaya N, Whitfield ML, Basehore LS, Novoradovsky A, Pesich R, Usary J, Karaca M, Wong WK, Aprelikova O, Fero M, Perou CM, Botstein D and Braman J. Universal reference RNA as a standard for microarray experiments. BMC Genomics 2004;5:20. 9. Patterson TA, Lobenhofer EK, Fulmer-Smentek SB, Collins PJ, Chu T, Bao W, Fang H, Kawasaki ES, Hager J, Tikhonova IR, Walker SJ, Zhang L, Hurban P, de Longueville F, Fuscoe JC, Tong W, Shi L and Wolfinger RD. Performance comparison of one-color and two-color platforms within the MicroArray Quality Control (MAQC) project. Nat Biotechnol 2006;24:1140–1150. 10. Hughes TR, Mao M, Jones AR, Burchard J, Marton MJ, Shannon KW, Lefkowitz SM, Ziman M, Schelter JM, Meyer MR, Kobayashi S, Davis C, Dai H, He YD, Stephaniants SB, Cavet G, Walker WL, West A, Coffey E, Shoemaker DD, Stoughton R, Blanchard AP, Friend SH and Linsley PS. Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat Biotechnol 2001;19:342–347. 11. Chou C, Chen C, Lee T and Peck K. Optimization of probe length and the number of probes per gene for optimal microarray analysis of gene expression. Nucleic Acids Res 2004;32:e99. 12. Shippy R, Sendera TJ, Lockner R, Palaniappan C, Kaysser-Kranich T, Watts G and Alsobrook J. Performance evaluation of commercial short-oligonucleotide microarrays and the impact of noise in making cross-platform correlations. BMC Genomics 2004;5:61. 13. Ramdas L, Cogdell DE, Jia JY, Taylor EE, Dunmire VR, Hu L, Hamilton SR and Zhang W. Improving signal intensities for genes with low-expression on oligonucleotide microarrays. BMC Genomics 2004;5:35. 14. Barrett JC and Kawasaki ES. Microarrays: the use of oligonucleotides and cDNA for the analysis of gene expression. Drug Discov Today 2003;8:134–141. 15. Pruitt KD, Tatusova T and Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007;35:D61–D65. 16. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S,
53 Rump RW, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blo¨cker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S and Chen YJ. Initial sequencing and analysis of the human genome. Nature 2001;409:860–921. 17. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, Yao A, Ye J, Zhan M, Zhang W, Zhang H, Zhao Q, Zheng L, Zhong F, Zhong W, Zhu S, Zhao S, Gilbert D, Baumhueter S, Spier G, Carter C, Cravchik A, Woodage T, Ali F, An H, Awe A, Baldwin D, Baden H, Barnstead M, Barrow I, Beeson K, Busam D, Carver A, Center A, Cheng ML, Curry L, Danaher S, Davenport L, Desilets R, Dietz S, Dodson K, Doup L, Ferriera S, Garg N, Gluecksmann A, Hart B, Haynes J, Haynes C, Heiner C, Hladun S, Hostin D, Houck J, Howland T, Ibegwam C, Johnson J, Kalush F, Kline L, Koduru S, Love A, Mann F, May D, McCawley S, McIntosh T, McMullen I, Moy M, Moy L, Murphy B, Nelson K, Pfannkoch C, Pratts E, Puri V, Qureshi H, Reardon M, Rodriguez R, Rogers YH, Romblad D, Ruhfel B, Scott R, Sitter C, Smallwood M, Stewart E, Strong R, Suh E, Thomas R, Tint NN, Tse S, Vech C, Wang G, Wetter J, Williams S, Williams M, Windsor S, Winn-Deen E, Wolfe K, Zaveri J, Zaveri K, Abril JF, Guigo´ R, Campbell MJ, Sjolander KV, Karlak B, Kejariwal A, Mi H, Lazareva B, Hatton T, Narechania A, Diemer K, Muruganujan A, Guo N, Sato S, Bafna V, Istrail S, Lippert R, Schwartz R, Walenz B, Yooseph S, Allen D, Basu A, Baxendale J, Blick L, Caminha M, Carnes-Stine J, Caulk P, Chiang YH, Coyne M, Dahlke C, Mays A, Dombroski M, Donnelly M, Ely D, Esparham S, Fosler C, Gire H, Glanowski S, Glasser K, Glodek A, Gorokhov M, Graham K, Gropman B, Harris M, Heil J, Henderson S, Hoover J, Jennings D, Jordan C, Jordan J, Kasha J, Kagan L, Kraft C, Levitsky A, Lewis M, Liu X, Lopez J, Ma D, Majoros W, McDaniel J,
54
18. 19.
20.
21.
22.
23. 24. 25.
26. 27. 28. 29. 30.
31.
32. 33.
34.
Murphy S, Newman M, Nguyen T, Nguyen N, Nodell M, Pan S, Peck J, Peterson M, Rowe W, Sanders R, Scott J, Simpson M, Smith T, Sprague A, Stockwell T, Turner R, Venter E, Wang M, Wen M, Wu D, Wu M, Xia A, Zandieh A and Zhu X. The sequence of the human genome. Science 2001;291:1304–1351. Schuler GD. Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J Mol Med 1997;75:1432–1440. Mecham BH, Wetmore DZ, Szallasi Z, Sadovsky Y, Kohane I and Mariani TJ. Increased measurement accuracy for sequence-verified microarray probes. Physiol Genomics 2004;18:308–315. Harbig J, Sprinkle R and Enkemann SA. A sequence-based identification of the genes detected by probesets on the Affymetrix U133 plus 2.0 array. Nucleic Acids Res 2005;33:e31. Carter SL, Eklund AC, Mecham BH, Kohane IS and Szallasi Z. Redefinition of Affymetrix probe sets by sequence overlap with cDNA microarray probes reduces crossplatform inconsistencies in cancer-associated gene expression measurements. BMC Bioinformatics 2005;6:107. Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ and Meng F. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res 2005;33:e175. Draghici S, Khatri P, Eklund AC and Szallasi Z. Reliability and reproducibility issues in DNA microarray measurements. Trends Genet 2006;22:101–109. Okoniewski MJ and Miller CJ. Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinformatics 2006;7:276. Alberts R, Terpstra P, Hardonk M, Bystrykh LV, de Haan G, Breitling R, Nap J and Jansen RC. A verification protocol for the probe sequences of Affymetrix genome arrays reveals high probe accuracy for studies in mouse, human and rat. BMC Bioinformatics 2007;8:132. Kerr MK. Design considerations for efficient and effective microarray studies. Biometrics 2003;59:822–828. Miller LD, Long PM, Wong L, Mukherjee S, McShane LM and Liu ET. Optimal gene expression analysis by microarrays. Cancer Cell 2002;2:353–361. Yang YH and Speed T. Design issues for cDNA microarray experiments. Nat Rev Genet 2002;3:579–588. Zhang S and Gant TW. A statistical framework for the design of microarray experiments and effective detection of differential gene expression. Bioinformatics 2004;20:2821–2828. Hsu JC, Chang J, Wang T, Steingrı´ msson E, Magnu´sson MK and Bergsteinsdottir K. Statistically designing microarrays and microarray experiments to enhance sensitivity and specificity. Brief Bioinform 2007;8:22–31. Pan W, Lin J and Le CT. How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol 2002;3,research0022. Pavlidis P, Li Q and Noble WS. The effect of replication on gene expression microarray experiments. Bioinformatics 2003;19:1620–1627. Han E, Wu Y, McCarter R, Nelson JF, Richardson A and Hilsenbeck SG. Reproducibility, sources of variability, pooling, and sample size: important considerations for the design of high-density oligonucleotide array experiments. J Gerontol A Biol Sci Med Sci 2004;59:306–315. Wei C, Li J and Bumgarner RE. Sample size for detecting differentially expressed genes in microarray experiments. BMC Genomics 2004;5:87.
55 35. Tsai C, Wang S, Chen D and Chen JJ. Sample size for gene expression microarray experiments. Bioinformatics 2005;21:1502–1508. 36. Kreil DP and Russell RR. There is no silver bullet – a guide to low-level data transforms and normalisation methods for microarray data. Brief Bioinform 2005;6:86–97. 37. Kendziorski C, Irizarry RA, Chen K, Haag JD and Gould MN. On the utility of pooling biological samples in microarray experiments. Proc Natl Acad Sci USA 2005;102:4252–4257. 38. Mary-Huard T, Daudin J, Baccini M, Biggeri A and Bar-Hen A. Biases induced by pooling samples in microarray experiments. Bioinformatics 2007;23:i313–i318. 39. Schroeder A, Mueller O, Stocker S, Salowsky R, Leiber M, Gassmann M, Lightfoot S, Menzel W, Granzow M and Ragg T. The RIN: an RNA integrity number for assigning integrity values to RNA measurements. BMC Mol Biol 2006;7:3. 40. Van Gelder RN, von Zastrow ME, Yool A, Dement WC, Barchas JD and Eberwine JH. Amplified RNA synthesized from limited quantities of heterogeneous cDNA. Proc Natl Acad Sci USA 1990;87:1663–1667. 41. Tong W, Lucas AB, Shippy R, Fan X, Fang H, Hong H, Orr MS, Chu T, Guo X, Collins PJ, Sun YA, Wang S, Bao W, Wolfinger RD, Shchegrova S, Guo L, Warrington JA and Shi L. Evaluation of external RNA controls for the assessment of microarray performance. Nat Biotechnol 2006;24:1132–1139. 42. Kerr KF. Extended analysis of benchmark datasets for Agilent two-color microarrays. BMC Bioinformatics 2007;8:371. 43. Nygaard V and Hovig E. Options available for profiling small samples: a review of sample amplification technology when combined with microarray profiling. Nucleic Acids Res 2006;34:996–1014. 44. Bolstad BM, Irizarry RA, Astrand M and Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003;19:185–193. 45. Irizarry RA, Wu Z and Jaffee HA. Comparison of Affymetrix GeneChip expression measures. Bioinformatics 2006;22:789–794. 46. Seo J and Hoffman EP. Probe set algorithms: is there a rational best bet? BMC Bioinformatics 2006;7:395. 47. Affymetrix. New statistical algorithms for monitoring gene expression on GeneChip probe arrays Technical Note. 2001. 48. Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U and Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003;4:249–264. 49. Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F and Spencer F. A model-based background adjustment for oligonucleotide expression arrays. J Am Stat Assoc 2004;99:909–917. 50. Affymetrix. Guide to probe logarithmic intensity error (PLIER) estimation Technical Note. 2005. 51. Shedden K, Chen W, Kuick R, Ghosh D, Macdonald J, Cho KR, Giordano TJ, Gruber SB, Fearon ER, Taylor JMG and Hanash S. Comparison of seven methods for producing Affymetrix expression scores based on false discovery rates in disease profiling data. BMC Bioinformatics 2005;6:26. 52. Fare TL, Coffey EM, Dai H, He YD, Kessler DA, Kilian KA, Koch JE, LeProust E, Marton MJ, Meyer MR, Stoughton RB, Tokiwa GY and Wang Y. Effects of atmospheric ozone on microarray data quality. Anal Chem 2003;75: 4672–4675.
56 53. Ritchie ME, Silver J, Oshlack A, Holmes M, Diyagama D, Holloway A and Smyth GK. A comparison of background correction methods for two-colour microarrays. Bioinformatics 2007;23:2700–2707. 54. Zahurak M, Parmigiani G, Yu W, Scharpf RB, Berman D, Schaeffer E, Shabbeer S and Cope L. Pre-processing Agilent microarray data. BMC Bioinformatics 2007;8:142. 55. Smyth GK and Speed T. Normalization of cDNA microarray data. Methods 2003;31:265–273. 56. Lin SM, Du P and Kibbe WA. Model-based variance-stabilizing transformation for Illumina microarray. Nucleic Acids Res 2008;36:e11. 57. Calza S, Raffelsberger W, Ploner A, Sahel J, Leveillard T and Pawitan Y. Filtering genes to improve sensitivity in oligonucleotide microarray data analysis. Nucleic Acids Res 2007;35:e102. 58. Cui X and Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol 2003;4:210. 59. Baldi P and Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t test and statistical inferences of gene changes. Bioinformatics 2001;17:509–519. 60. Sartor MA, Tomlinson CR, Wesselkamper SC, Sivaganesan S, Leikauf GD and Medvedovic M. Intensity-based hierarchical Bayes method improves testing for differentially expressed genes in microarray experiments. BMC Bioinformatics 2006;7:538. 61. Tusher VG, Tibshirani R and Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001;98:5116–5121. 62. Jeffery IB, Higgins DG and Culhane AC. Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. BMC Bioinformatics 2006;7:359. 63. Li H, Wood CL, Getchell TV, Getchell ML and Stromberg AJ. Analysis of oligonucleotide array experiments with repeated measures using mixed models. BMC Bioinformatics 2004;5:209. 64. Benjamini Y and Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc. Ser B (Methodological) 1995;57:289–300. 65. Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat 1979;6: 65–70. 66. Westfall P and Young S. Resampling-based multiple testing: examples and methods for p-value adjustment, Wiley, 1993. 67. Reiner A, Yekutieli D and Benjamini Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003;19:368–375. 68. Storey JD. A direct approach to false discovery rates. J R Stat Soc: Ser B (Statistical Methodology) 2002;64:479–498. 69. Lu X and Perkins DL. Re-sampling strategy to improve the estimation of number of null hypotheses in FDR control under strong correlation structures. BMC Bioinformatics 2007;8:157. 70. Ploner A, Calza S, Gusnanto A and Pawitan Y. Multidimensional local false discovery rate for microarray studies. Bioinformatics 2006;22:556–565. 71. Perelman E, Ploner A, Calza S and Pawitan Y. Detecting differential expression in microarray data: comparison of optimal procedures. BMC Bioinformatics 2007;8:28. 72. Boutros PC and Okey AB. Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering microarray data. Brief Bioinform 2005;6:331–343.
57 73. Eisen MB, Spellman PT, Brown PO and Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998;95:14863–14868. 74. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ and Church GM. Systematic determination of genetic network architecture. Nat Genet 1999;22:281–285. 75. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES and Golub TR. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 1999;96:2907–2912. 76. Kaufman L and Rousseeuw PJ. Finding groups in data. An introduction to cluster analysis, Wiley, 1990. 77. Garge NR, Page GP, Sprague AP, Gorman BS and Allison DB. Reproducible clusters from microarray research: whither? BMC Bioinformatics 2005;6(Suppl. 2):S10. 78. Datta S and Datta S. Evaluation of clustering algorithms for gene expression data. BMC Bioinformatics 2006;7(Suppl. 4):S17. 79. Rousseeuw P. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 1987;20:53–65. 80. Tibshirani R, Walther G and Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc: Ser B (Statistical Methodology) 2001;63:411–423. 81. Dudoit S and Fridlyand J. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 2002;3,RESEARCH0036. 82. Handl J, Knowles J and Kell DB. Computational cluster validation in post-genomic data analysis. Bioinformatics 2005;21:3201–3212. 83. Dembe´le´ D and Kastner P. Fuzzy C-means method for clustering microarray data. Bioinformatics 2003;19:973–980. 84. Raychaudhuri S, Stuart JM and Altman RB. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac Symp Biocomput 2000;5:455–466. 85. Liebermeister W. Linear modes of gene expression determined by independent component analysis. Bioinformatics 2002;18:51–60. 86. Teschendorff AE, Journe´e M, Absil PA, Sepulchre R and Caldas C. Elucidating the altered transcriptional programs in breast cancer using independent component analysis. PLoS Comput Biol 2007;3:e161. 87. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD and Lander ES. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999;286:531–537. 88. Zhang J and Deng H. Gene selection for classification of microarray data based on the Bayes error. BMC Bioinformatics 2007;8:370. 89. Troyanskaya OG. Putting microarrays in a context: integrated analysis of diverse biological data. Brief Bioinform 2005;6:34–43. 90. Teufel A, Krupp M, Weinmann A and Galle PR. Current bioinformatics tools in genomic biomedical research (Review). Int J Mol Med 2006;17:967–973. 91. Quackenbush J. Extracting biology from high-dimensional biological data. J Exp Biol 2007;210:1507–1517. 92. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Ostell J, Miller V, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L and Yaschenko E. Database resources of the national center for biotechnology information. Nucleic Acids Res 2007;35:D5–D12.
58 93. European Biotechnology Institute: http://www.ebi.ac.uk/. 94. Hubbard TJP, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Fitzgerald S, Fernandez-Banet J, Graf S, Haider S, Hammond M, Herrero J, Holland R, Howe K, Johnson N, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Melsopp C, Megy K, Meidl P, Ouverdin B, Parker A, Prlic A, Rice S, Rios D, Schuster M, Sealy I, Severin J, Slater G, Smedley D, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wood M, Cox T, Curwen V, Durbin R, Fernandez-Suarez XM, Flicek P, Kasprzyk A, Proctor G, Searle S, Smith J, Ureta-Vidal A and Birney E. Ensembl 2007. Nucleic Acids Res 2007;35:D610–D617. 95. Kuhn RM, Karolchik D, Zweig AS, Trumbower H, Thomas DJ, Thakkapallayil A, Sugnet CW, Stanke M, Smith KE, Siepel A, Rosenbloom KR, Rhead B, Raney BJ, Pohl A, Pedersen JS, Hsu F, Hinrichs AS, Harte RA, Diekhans M, Clawson H, Bejerano G, Barber GP, Baertsch R, Haussler D and Kent WJ. The UCSC genome browser database: update 2007. Nucleic Acids Res 2007;35:D668–D673. 96. Gene Ontology Consortium. The Gene Ontology Project in 2008. Nucleic Acids Res 2008;36:D440–D444. 97. Aoki-Kinoshita KF and Kanehisa M. Gene annotation and pathway mapping in KEGG. Methods Mol Biol 2007;396:71–92. 98. Mi H, Guo N, Kejariwal A and Thomas PD. PANTHER version 6: protein sequence and function evolution data with expanded representation of biological pathways. Nucleic Acids Res 2007;35:D247–D252. 99. Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC and Conklin BR. MAPPFinder: using gene ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol 2003;4:R7. 100. Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Barrett JC and Weinstein JN. GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol 2003;4:R28. 101. Khatri P and Dra˘ghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005;21:3587–3595. 102. Huang DW, Sherman BT, Tan Q, Kir J, Liu D, Bryant D, Guo Y, Stephens R, Baseler MW, Lane HC and Lempicki RA. DAVID bioinformatics resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res 2007;35:W169–W175. 103. Cannata N, Merelli E and Altman RB. Time to organize the bioinformatics resourceome. PLoS Comput Biol 2005;1:e76. 104. Fox JA, McMillan S and Ouellette BFF. Conducting research on the web: 2007 update for the bioinformatics links directory. Nucleic Acids Res 2007;35:W3–W5. 105. Microarray Gene Expression Database Society: http://www.mged.org/. 106. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J and Vingron M. Minimum information about a microarray experiment (MIAME) – toward standards for microarray data. Nat Genet 2001;29: 365–371. 107. Tan PK, Downey TJ, Spitznagel ELJ, Xu P, Fu D, Dimitrov DS, Lempicki RA, Raaka BM and Cam MC. Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res 2003;31:5676–5684.
59 108. Bammler T, Beyer RP, Bhattacharya S, Boorman GA, Boyles A, Bradford BU, Bumgarner RE, Bushel PR, Chaturvedi K, Choi D, Cunningham ML, Deng S, Dressman HK, Fannin RD, Farin FM, Freedman JH, Fry RC, Harper A, Humble MC, Hurban P, Kavanagh TJ, Kaufmann WK, Kerr KF, Jing L, Lapidus JA, Lasarev MR, Li J, Li Y, Lobenhofer EK, Lu X, Malek RL, Milton S, Nagalla SR, O’malley JP, Palmer VS, Pattee P, Paules RS, Perou CM, Phillips K, Qin L, Qiu Y, Quigley SD, Rodland M, Rusyn I, Samson LD, Schwartz DA, Shi Y, Shin J, Sieber SO, Slifer S, Speer MC, Spencer PS, Sproles DI, Swenberg JA, Suk WA, Sullivan RC, Tian R, Tennant RW, Todd SA, Tucker CJ, Van Houten B, Weis BK, Xuan S and Zarbl H. Standardizing global gene expression analysis between laboratories and across platforms. Nat Methods 2005;2:351–356. 109. Petersen D, Chandramouli GVR, Geoghegan J, Hilburn J, Paarlberg J, Kim CH, Munroe D, Gangi L, Han J, Puri R, Staudt L, Weinstein J, Barrett JC, Green J and Kawasaki ES. Three microarray platforms: an analysis of their concordance in profiling gene expression. BMC Genomics 2005;6:63. 110. Wang H, He X, Band M, Wilson C and Liu L. A study of inter-lab and inter-platform agreement of DNA microarray data. BMC Genomics 2005;6:71. 111. Shi L, Tong W, Fang H, Scherf U, Han J, Puri RK, Frueh FW, Goodsaid FM, Guo L, Su Z, Han T, Fuscoe JC, Xu ZA, Patterson TA, Hong H, Xie Q, Perkins RG, Chen JJ and Casciano DA. Cross-platform comparability of microarray technology: intraplatform consistency and appropriate data analysis procedures are essential. BMC Bioinformatics 2005;6(Suppl. 2):S12. 112. Baker SC, Bauer SR, Beyer RP, Brenton JD, Bromley B, Burrill J, Causton H, Conley MP, Elespuru R, Fero M, Foy C, Fuscoe J, Gao X, Gerhold DL, Gilles P, Goodsaid F, Guo X, Hackett J, Hockett RD, Ikonomi P, Irizarry RA, Kawasaki ES, Kaysser-Kranich T, Kerr K, Kiser G, Koch WH, Lee KY, Liu C, Liu ZL, Lucas A, Manohar CF, Miyada G, Modrusan Z, Parkes H, Puri RK, Reid L, Ryder TB, Salit M, Samaha RR, Scherf U, Sendera TJ, Setterquist RA, Shi L, Shippy R, Soriano JV, Wagar EA, Warrington JA, Williams M, Wilmer F, Wilson M, Wolber PK, Wu X and Zadro R. The external RNA controls consortium: a progress report. Nat Methods 2005;2:731–734. 113. Canales RD, Luo Y, Willey JC, Austermiller B, Barbacioru CC, Boysen C, Hunkapiller K, Jensen RV, Knight CR, Lee KY, Ma Y, Maqsodi B, Papallo A, Peters EH, Poulter K, Ruppel PL, Samaha RR, Shi L, Yang W, Zhang L and Goodsaid FM. Evaluation of DNA microarray results with quantitative gene expression platforms. Nat Biotechnol 2006;24:1115–1122. 114. Shippy R, Fulmer-Smentek S, Jensen RV, Jones WD, Wolber PK, Johnson CD, Pine PS, Boysen C, Guo X, Chudin E, Sun YA, Willey JC, Thierry-Mieg J, Thierry-Mieg D, Setterquist RA, Wilson M, Lucas AB, Novoradovskaya N, Papallo A, Turpaz Y, Baker SC, Warrington JA, Shi L and Herman D. Using RNA sample titrations to assess microarray platform performance and normalization techniques. Nat Biotechnol 2006;24:1123–1131. 115. MAQC Consortium. Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Schrf U, Thierry-Mieg J, Wang C, Wilson M, Wolber PK, Zhang L, Amur S, Bao W, Barbacioru CC, Lucas AB, Bertholet V, Boysen C, Bromley B, Brown D, Brunner A, Canales R, Cao XM, Cebula TA, Chen JJ, Cheng J, Chu T, Chudin E, Corson J, Corton JC, Croner LJ,
60
116.
117.
118.
119.
120. 121.
122.
123.
Davies S, Davison TS, Delenstarr G, Deng X, Dorris D, Eklund AC, Fan X, Fang H, Fulmer-Smentek S, Fuscoe JC, Gallagher K, Ge W, Guo L, Guo X, Hager J, Haje PK, Han J, Han T, Harbottle HC, Harris SC, Hatchwell E, Hauser CA, Hester S, Hong H, Hurban P, Jackson SA, Ji H, Knight CR, Kuo WP, LeClerc JE, Levy S, Li Q, Liu C, Liu Y, Lombardi MJ, Ma Y, Magnuson SR, Maqsodi B, McDaniel T, Mei N, Myklebost O, Ning B, Novoradovskaya N, Orr MS, Osborn TW, Papallo A, Patterson TA, Perkins RG, Peters EH, Peterson R, Philips KL, Pine PS, Pusztai L, Qian F, Ren H, Rosen M, Rosenzweig BA, Samaha RR, Schena M, Schroth GP, Shchegrova S, Smith DD, Staedtler F, Su Z, Sun H, Szallasi Z, Tezak Z, Thierry-Mieg D, Thompson KL, Tikhonova I, Turpaz Y, Vallanat B, Van C, Walker SJ, Wang SJ, Wang Y, Wolfinger R, Wong A, Wu J, Xiao C, Xie Q, Xu J, Yang W, Zhang L, Zhong S, Zong Y and Slikker WJ. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 2006;24:1151–1161. Guo L, Lobenhofer EK, Wang C, Shippy R, Harris SC, Zhang L, Mei N, Chen T, Herman D, Goodsaid FM, Hurban P, Phillips KL, Xu J, Deng X, Sun YA, Tong W, Dragan YP and Shi L. Rat toxicogenomic study reveals analytical consistency across microarray platforms. Nat Biotechnol 2006;24:1162–1169. Chen J, Hsueh H, Delongchamp R, Lin C and Tsai C. Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data. BMC Bioinformatics 2007;8:412. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M and Edgar R. NCBI GEO: mining tens of millions of expression profiles – database and tools update. Nucleic Acids Res 2007;35:D760–D765. Parkinson H, Kapushesky M, Shojatalab M, Abeygunawardena N, Coulson R, Farne A, Holloway E, Kolesnykov N, Lilja P, Lukk M, Mani R, Rayner T, Sharma A, William E, Sarkans U and Brazma A. ArrayExpress – a public database of microarray experiments and gene expression profiles. Nucleic Acids Res 2007;35:D747–D750. Lu Y, Yi Y, Liu P, Wen W, James M, Wang D and You M. Common human cancer genes discovered by integrated gene-expression analysis. PLoS ONE 2007;2:e1149. English SB and Butte AJ. Evaluation and integration of 49 genome-wide experiments and the prediction of previously unknown obesity-related genes. Bioinformatics 2007;23:2910–2917. Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K and Lander ES. Distinguishing protein-coding and noncoding genes in the human genome. Proc Natl Acad Sci USA 2007;104:19428–19433. ENCODE Project Consortium. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo´ R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SCJ, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermu¨ller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O,
61 Pedersen MC, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, Gilbert J, Drenkow J, Bell I, Zhao X, Srinivasan KG, Sung W, Ooi HS, Chiu KP, Foissac S, Alioto T, Brent M, Pachter L, Tress ML, Valencia A, Choo SW, Choo CY, Ucla C, Manzano C, Wyss C, Cheung E, Clark TG, Brown JB, Ganesh M, Patel S, Tammana H, Chrast J, Henrichsen CN, Kai C, Kawai J, Nagalakshmi U, Wu J, Lian Z, Lian J, Newburger P, Zhang X, Bickel P, Mattick JS, Carninci P, Hayashizaki Y, Weissman S, Hubbard T, Myers RM, Rogers J, Stadler PF, Lowe TM, Wei C, Ruan Y, Struhl K, Gerstein M, Antonarakis SE, Fu Y, Green ED, Karao¨z U, Siepel A, Taylor J, Liefer LA, Wetterstrand KA, Good PJ, Feingold EA, Guyer MS, Cooper GM, Asimenos G, Dewey CN, Hou M, Nikolaev S, Montoya-Burgos JI, Lo¨ytynoja A, Whelan S, Pardi F, Massingham T, Huang H, Zhang NR, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Seringhaus M, Church D, Rosenbloom K, Kent WJ, Stone EA, Batzoglou S, Goldman N, Hardison RC, Haussler D, Miller W, Sidow A, Trinklein ND, Zhang ZD, Barrera L, Stuart R, King DC, Ameur A, Enroth S, Bieda MC, Kim J, Bhinge AA, Jiang N, Liu J, Yao F, Vega VB, Lee CWH, Ng P, Shahab A, Yang A, Moqtaderi Z, Zhu Z, Xu X, Squazzo S, Oberley MJ, Inman D, Singer MA, Richmond TA, Munn KJ, Rada-Iglesias A, Wallerman O, Komorowski J, Fowler JC, Couttet P, Bruce AW, Dovey OM, Ellis PD, Langford CF, Nix DA, Euskirchen G, Hartman S, Urban AE, Kraus P, Van Calcar S, Heintzman N, Kim TH, Wang K, Qu C, Hon G, Luna R, Glass CK, Rosenfeld MG, Aldred SF, Cooper SJ, Halees A, Lin JM, Shulha HP, Zhang X, Xu M, Haidar JNS, Yu Y, Ruan Y, Iyer VR, Green RD, Wadelius C, Farnham PJ, Ren B, Harte RA, Hinrichs AS, Trumbower H, Clawson H, Hillman-Jackson J, Zweig AS, Smith K, Thakkapallayil A, Barber G, Kuhn RM, Karolchik D, Armengol L, Bird CP, de Bakker PIW, Kern AD, Lopez-Bigas N, Martin JD, Stranger BE, Woodroffe A, Davydov E, Dimas A, Eyras E, Hallgrı´ msdo´ttir IB, Huppert J, Zody MC, Abecasis GR, Estivill X, Bouffard GG, Guan X, Hansen NF, Idol JR, Maduro VVB, Maskeri B, McDowell JC, Park M, Thomas PJ, Young AC, Blakesley RW, Muzny DM, Sodergren E, Wheeler DA, Worley KC, Jiang H, Weinstock GM, Gibbs RA, Graves T, Fulton R, Mardis ER, Wilson RK, Clamp M, Cuff J, Gnerre S, Jaffe DB, Chang JL, Lindblad-Toh K, Lander ES, Koriabine M, Nefedov M, Osoegawa K, Yoshinaga Y, Zhu B and de Jong PJ. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007;447:799–816. 124. Velculescu VE and Kinzler KW. Gene expression analysis goes digital. Nat Biotechnol 2007;25:878–880. 125. Velculescu VE, Zhang L, Vogelstein B and Kinzler KW. Serial analysis of gene expression. Science 1995;270:484–487.