Simultaneous confidence intervals for differential gene expressions

Simultaneous confidence intervals for differential gene expressions

Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196 www.elsevier.com/locate/jspi Simultaneous confidence intervals for differential g...

270KB Sizes 1 Downloads 133 Views

Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196 www.elsevier.com/locate/jspi

Simultaneous confidence intervals for differential gene expressions Jason C. Hsua,∗ , Jane Y. Changb , Tao Wangc a Department of Statistics, Ohio State University, Columbus, OH 43210, USA b Department of ASOR, Bowling Green State University, Bowling Green, OH 43403, USA c Department of Epidemiology and Biostatistics, University of South Florida, Tampa, FL 33612, USA

Available online 7 September 2005

Abstract In this article, we describe an environment for designing and analyzing gene expression data from 2-channel microarrays, from design of the arrays, to exact multiplicity-adjusted inference for differential gene expressions. There are two possible formulations to the comparison of gene expression levels: tests of equality and confidence intervals. The testing formulation has been more popular, but confidence intervals are more informative because they not only infer the existence of differential expressions, but also bound the magnitude of such differentials. We provide simultaneous confidence intervals. Before gene expression levels can be compared, the data need to be pre-processed to adjust for variations in the arrays, and unequal affinity of genes to dyes. Processed data are then normalized with respect to control genes (which may be all the target genes or a set of housekeeping genes). The approach we take is to model the data. Pre-processing to adjust for array and dye differences amounts to fitting array and dye effects and gene × dye interaction in the model. A set of control genes, chosen to be normalized, specifies a set of estimable functions as the normalized expression differentials. The correlation structure of estimates of these functions can then be derived based on the model, allowing accurate multiplicity adjustment to obtain simultaneous confidence intervals. This is illustrated with the analysis of a real data set. © 2005 Elsevier B.V. All rights reserved. Keywords: cDNA microarray; Gene expressions; Generalized Latin Square designs; Multiple comparisons; Simultaneous confidence intervals; Bootstrap

∗ Corresponding author.

E-mail address: [email protected] (J.C. Hsu). 0378-3758/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2005.08.029

J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

2183

1. Possible uses of transcription profiling 2-channel microarrays compare the relative expression levels of genes in two samples by comparing the relative abundance of mRNA in the two samples corresponding to each gene. This so-called transcription profiling is done by measuring the amounts of mRNA in the samples hybridized to the cDNA of each gene clone printed on each microarray. One possible use of gene expression analysis is designer medicine: treating each patient according to his/her gene expressions profile, much as one might treat a patient according to his/her age, sex, ethnicity and other covariates. Another possible use of gene expression analysis is patient targeting: eliminating patient subpopulations prone to serious Adverse Events (AE), thereby making more molecular entities available to benefit the vast majority of the population. A final possible use of gene expression analysis is drug targeting: finding proteins that co-regulate genes involved in a disease process. Microarrays for diagnostic and prognostic purposes are considered medical devices, and are subject to the regulation of the FDA. The Center for Devices and Radiologic Health (CDRH) of the FDA has issued a draft guidance on microarrays, titled Multiplex Tests for Heritable DNA Markers, Mutations and Expression Patterns. Draft Guidance for Industry and FDA Reviewers (http://www.fda.gov/cdrh/oivd/guidance/1210.html). Methodological development for such diagnostic/prognostic microarrays should be cognizant of statistical requirements for such devices. Different possible uses of transcription profiling may call for different statistical formulations. In this article, we concentrate on drug targeting, viewing it as a 2-step process. The first step uses microarray studies to select a small subset of the genes confidently inferred to be involved in the disease process. The second step then tries to find common transcription factor(s) co-regulating these genes as drug targets. 2. Two statistical formulations of gene expression level comparisons One formulation of gene expression analysis is tests of equality of gene expression levels (Dudoit et al., 2002, Yang et al., 2001, Grant et al., 2002). Another formulation is confidence intervals (Kerr and Churchill, 2001a,b, Kerr et al., 2000). A disadvantage of the tests of equality approach is p-values from tests of equality are noninformative regarding the magnitude of differential expression levels. Confidence intervals can be used for testing but provide additional information on the magnitude of differential expression levels. For example, suppose the confidence interval for the ratio of expression levels is (1.01, 1.02) for gene 100. Then even though this corresponds to a rejection of the null hypothesis of equality under the testing formulation, the differential in the expression level is not likely to be biologically significant. On the other hand, suppose the confidence interval for the ratio of expression levels is (101, 102) for gene 200. Then even though this corresponds to the same p-value as a confidence interval of (1.01, 1.02) for the null hypothesis of equality (if both the estimated difference and standard error are doubled in the second data set on the log 10 scale), the differential in the expression level would be biologically significant in this case. The confidence intervals in Kerr and Churchill (2001a,b) and Kerr et al. (2000) are individual confidence intervals. We provide simultaneous confidence intervals. An investigator can use these simultaneous confidence intervals to select genes to further investigate, being confident that all the genes selected are involved in the disease process with pre-specified probability. For example, suppose the investigator is primarily concerned with not pursuing any gene unless it is up-regulated by three folds or more. Then the set of genes whose simultaneous lower confidence bounds are

2184

J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

greater than or equal to three is such a set. In making such inferences, the familywise error rate (FWER) is controlled strongly. That is, the maximum probability of incorrectly inferring any gene to be significantly differentially expressed is controlled. In bioinformatics, it is popular to control an alternative error rate called the false discovery rate (FDR). Controlling FDR controls the expected proportion of genes not truly differentially expressed, among the genes inferred to be differentially expressed. Control of FDR is often insufficient to guard against making an incorrect decision, because one can inflate the level at which the hypotheses of interest are tested by adding null hypotheses know to be false (see Section 6 of Finner and Roter, 2001). Suppose, for example, a single null hypothesis is of interest. If three additional null hypotheses known to be false are added, then the null hypothesis of interest can in effect be tested at a level of 20% while controlling the FDR at 5%. This phenomenon can be exploited to artificially inflate the probability of finding significant differences while keeping FDR fixed. For example, a host of genes (e.g., p53, p16, Ras, ETS, . . .) are known to be likely involved in many cancer processes. Thus, in studies of cancer, one can inflate the probability of finding differentially expressed genes beyond these genes simply by including these genes that can be expected to be differentially expressed in the microarray experiment. 3. Modeling 2-channel microarray gene expression data We refer to the two conditions giving rise to potential differential gene expressions as the two “treatments,” which may be healthy and disease states of tissues, etc. Measurements of gene expression levels are affected not only by which treatment is applied and what gene is involved, but potentially also by array and dye variations. The amount of cDNA deposited may differ from one array to the next. Genes might have differential affinity for the Cy3 (green) dye or the Cy5 (red) dye. But suppose W1l is the observed expression level of the lth gene under treatment 1, and W2l is the observed expression level of the lth gene under treatment 2, after data has been pre-processed to remove the effects of such variations, then of interest are the ratios of the average expression levels E[W2l ]/E[W1l ], l = 1, . . . , g. If the distribution of log(W1l ) differs from the distribution of log(W2l ) by a location shift only: D

log(W2l ) = log(W1l ) + ,

l = 1, . . . , g,

then E[log(W2l )]−E[log(W1l )]=. But W2l = e W1l , so E[W2l ]/E[W1l ]=e . Therefore, in this case, inference on the difference of averages of logarithm of gene expression levels corresponds to inference on the ratios of average gene expression levels. Working with logarithms of intensity data, we will provide simultaneous confidence intervals for differences of mean log intensities. But we will describe the results of analyses implicitly assuming the confidence intervals have been transformed back to intervals for ratios of mean intensities. One way to analyze gene expression data is to model the logarithms of the intensities by a comprehensive linear model, as in Kerr et al. (2000). Let yij klm be the observation on the ith array, jth dye, kth treatment, lth gene, and mth replication. Then a possible model for the mean response E[log(yij klm )] which we use for illustration is as follows: D

log(yij klm ) = yij klm =  + i + j + k + l + ()j l + ()kl + ij klm ,

(1)

J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

2185

where  is the overall mean, i is the effect of the ith array, j is the effect of the jth dye, k is the effect of the kth treatment, l is the effect of the lth gene, ()j l is the interaction of the jth dye and the lth gene, ()kl is the interaction of the kth treatment and the lth gene, and ij klm are identically distributed random errors with mean zero. There are actually different types of replications: biological replications (different biological samples), technical replications (between-array replications), and within-array replications. Churchill (2002) and Yang and Speed (2002) gave insightful discussions on different types of replications. In essence, biological replicates are indispensable in order to generalize the findings from the experiment to a larger population. Within-array replications provide quality control and increase precision of the estimates. Technical replications are crucial for supplying a measurement of the between-array variability. Observations between arrays are independent if each array consists of a unique biological sample. Observations within each array from the same biological sample may be correlated. Such correlation can be modeled by treating array as a random effect, which will yield the same analysis as discussed in the article. Our approach uses least squares estimates, an advantage of which is they can be computed without iteration—a big advantage in gene expressions analysis. A disadvantage is they are not fully efficient if the errors are not normally distributed. 3.1. Pre-processing of data to adjust for array and dye variations Inclusion of array, dye, and gene × dye interaction in essence pre-processes data to adjust for array and dye variations in a way similar to what other authors have proposed (e.g., Amaratunga and Cabrera, 2001). This is because least squares estimates of treatment and genes interactions ()kl in the comprehensive model can be obtained by first regressing yij klm and (the indicator variables of) treatment and gene interactions on (the indicator variables of) all other explanatory variables, and then regressing the residuals on the residuals (see Weisberg, 1985, Section 2.4). As will be shown in the next subsection, whether one normalizes using all the genes or a set of housekeeping genes, the parameters of interest are functions of ()kl . 3.2. Normalization of gene expression data Yang et al. (2001) discussed different ways of normalizing cDNA microarray data. They proposed that all genes on the array may be used for normalization when only a small proportion of genes are expected to be differentially expressed, or there is symmetry in the expression levels of the up/down-regulated genes. Otherwise a smaller subset of genes, the so-called housekeeping genes which are expressed in all cells, can be used. Housekeeping genes not involved in a disease under study nevertheless often are not constantly expressed across different tissue types. But if the differentials of a subgroup of housekeeping genes remain relatively constant across cell types, then they become good candidates for normalization. Finding such housekeeping genes has been the focus of some recent activities (see Vandesompele et al., 2002 and http://www.wzw.tum.de/gene-quantification/hkg.html). We are thus interested in the difference of log intensities between treatments for each gene, compared to the mean difference in log intensities between treatments averaged over either all genes, or averaged over a set of housekeeping genes.

2186

J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

If we normalize based on the treatment difference averaged over all genes, then the parameters representing normalized differential gene expressions become l = 1 + l + ()1l − (2 + l + ()2l )    g g g g j =1 j j =1 ()2j j =1 j j =1 ()1j − 1 + + + − 2 + g g g g   g g j =1 ()1j j =1 ()2j − = ()1l − ()2l − g g = ()1l − ()2l − [1. − 2. ],

l = 1, . . . , g.

where a “dot” in the subscript and an over “bar” indicates averaging over the levels of the corresponding factor. Similarly, if genes 1, 2, . . . , h form a set of housekeeping genes, and genes h + 1, . . . , g are the target genes, and we normalize based on the treatment difference averaged over the housekeeping genes only, then the parameters representing normalized differential gene expressions become  h  h j =1 ()1j j =1 ()2j − l = ()1l − ()2l − h h H = ()1l − ()2l − [H 1. − 2. ],

l = h + 1, . . . , g.

These estimates can be especially efficiently computed if microarrays are designed as we recommend in the next section, because then explicit formulas for the estimates become available, avoiding CPU time and memory penalties of matrix inversions. 4. 2 × a generalized Latin Square designs for gene expression analysis Since the parameters of interest are functions of interactions between the gene factor and the treatment factor, dye and array can be considered as blocking factors. In order to avoid confounding of dye and array effects with treatment effect, it is natural to consider a design in which each treatment appears equal number times with each dye, and each treatment appears equal number of times in each array, leading to generalized Latin Square designs. We also recommend the number of times each gene is spotted remains the same for all the arrays, because it turns out that least squares estimates as well as simultaneous confidence intervals are easier to compute for such a design. Thus, using a arrays (a even) and two dyes to compare two treatments T1 and T2 , a design which can be called a generalized Latin Square design for gene expressions analysis would look like Dye\Array

1

2

3

...

a−1

a

1 2

T1 T2

T2 T1

T1 T2

... ...

T1 T2

T2 T1

To avoid systematic bias, the positions at which the genes are spotted ideally would be completely randomized, separately for each array. (This can be done using a facility in the software CloneTracker, for example.) Then, the array labels in the design are randomly assigned to the

J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

2187

actual microarrays, and the two dye labels in the design are randomly assigned to the Cy3 and Cy5 dyes. To compare three or more treatments, one can use a Generalized Youden Design. For example, to compare three treatments T1 , T2 , and T3 using 6 arrays, one can use the design Dye\Array

1

2

3

4

5

6

1 2

T1 T2

T2 T3

T3 T1

T2 T1

T3 T2

T1 T3

Other designs are possible for comparing three or more treatments using 2-channel microarrays, such as loop designs and reference designs discussed in Kerr and Churchill (2001a). If yij klm is the observation on the ith array, jth dye, kth treatment, lth gene, and mth replication, then model (1) for generalized Latin Square designs becomes log(yij klm ) = yij klm =  + i + j + k + l + ()j l + ()kl + ij klm , i = 1, . . . , a; j = 1, 2; k = 1, 2; l = 1, 2, . . . , g; m = 1, 2, . . . , nl ,

(2)

where nl the number of times gene l is spotted may differ for different genes, but for each gene remains the same for all the arrays. If the microarrays are designed as described, then it can be shown that all factors are orthogonal to each other (see Appendix A of Hsu et al., 2002, Chang et al., 2002). Consequently, least squares estimates and MSE can be expressed explicitly using “dot-bar” notations. For example, if one normalizes with respect to the overall mean, then the least squares estimate of l is  g ˆl = y ..1l. − y ..2l. −

 j =1 y ..1j.

g

g −

 j =1 y ..2j.

 ,

g

l = 1, 2, . . . , g.

Note that these are the estimates for the interaction effect if one imposes the sum-to-zero constraints, as in Kerr et al. (2000), Kerr and Churchill (2001a,b) and Kerr et al. (2002). If one normalizes with respect to the housekeeping genes instead, then the least squares estimate of l is ˆl = y ..1l. − y ..2l. −

 h

 j =1 y ..1j.

h

h −

 j =1 y ..2j.

h

 ,

l = h + 1,

h + 2, . . . , g.

5. Simultaneous confidence intervals for models with normal errors Suppose the errors ij klm are independent Normal(0, 2 ). Then the least squares estimates for l are the best linear unbiased estimates (BLUE) for them. It turns out that multiplicity adjustment is somewhat easier when normalizing with respect to housekeeping genes than normalizing with respect to all the genes. So we discuss the case of normalizing with respect to housekeeping genes first.

2188

J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

5.1. Normalizing with respect to housekeeping genes A familiar form of multiple comparison inference is multiple comparison with a control (MCC), exemplified by Dunnett’s (1955) method. When normalizing with respect to a set of housekeeping genes, each expression differential is compared to the average differential of the housekeeping genes. Thus, differential expressions analysis when normalizing with respect to housekeeping genes is essentially MCC analysis, with the “control” being the average differential of the housekeeping genes. If genes 1, 2, . . . , h are the housekeeping genes, and genes h + 1, . . . , g are the target genes, and we normalize based on the treatment difference averaged over the housekeeping genes only, then exact simultaneous confidence intervals for the target genes l , l = h + 1, . . . , g, are of the form (ˆ l − qC SEl , ˆ l + qC SEl ),

l = h + 1, . . . , g,

where SEl is the standard error of ˆ l and the critical value qC is the upper 100(1 − ) percentile of the distribution of maxl=h+1,...,g |ˆ l − l |/SEl . As shown in Hsu (1992), for the critical value qC to be exactly computable, the correlation matrix of ˆ l , l = h + 1, . . . , g, needs to have the so-called 1-factor structure, (7.31) in Hsu (Chapter 7, 1996). This structure is defined as follows. Let R denote the correlation matrix of ˆ l , l = h + 1, . . . , g. Then R is said to have the 1-factor structure if there exists a vector of constants  = ( h+1 , . . . , g ) such that R = D +  , where D is a diagonal matrix. Now Var(ˆ l ) = Var(y ..1l. =

22 anl

=

22 a

 h − y ..2l.

 j =1 y ..1j.

+ Var



h ⎡ ⎤ h 2  2 ⎦ + 2⎣ h anj j =1 1 1 , l = h + 1, . . . , g, + nl hN H

and Cov(ˆ l , ˆ l  ) = Var =

1 1 1 = . NH h nj j =1

 h

 j =1 y ..1j.

22 1 , a hN H

where h

h



h

h l = l  ,

 j =1 y ..2j.



 j =1 y ..2j.



J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

2189

So

SEl =

2ˆ 2 a



1 1 , + nl hN H

and the correlation matrix of ˆ l , l = h + 1, . . . , g, has the 1-factor structure with l = (1 + hN H /nl )−1/2 . Therefore, exact simultaneous confidence intervals for differential expressions can be computed, using the ProbMC function in SAS for example to compute qC (with the distribution being “dunnett2”, the  parameters being (1 + hN H /nl )1/2 , l = h + 1, . . . , g, and df being the error degrees of freedom). 5.2. Normalizing with respect to all the genes Though not as well known as Tukey’s all-pairwise comparisons method or Dunnett’s multiple comparisons with a control method, one form of multiple comparison inference is multiple comparison with the mean (MCM), which compares each parameter with the average of all parameters. First proposed by Halperin et al. (1955), MCM is also known as analysis of means (ANOM). When normalizing with respect to all the genes, each differential expression is compared with the differentials averaged over all the genes, and is essentially a form of MCM analysis. Exact simultaneous confidence intervals for l , l = 1, . . . , g, are of the form (ˆ l − qM SEl , ˆ l + qM SEl ),

l = 1, . . . , g,

where SEl is the estimated standard error of ˆ l , and the critical value qM is the upper 100(1 − ) percentile of the distribution of maxl=1,...,g |ˆ l − l |/SEl . Computing qM is more difficult than computing qC . Let R denote the correlation matrix of ˆ l , l=1, . . . , g. Soong (2001) gave a Gaussian quadrature algorithm with complex variable integrands to compute qM which is applicable if R has the +/− factor structure. R has this structure if there exists constant vectors  = ( 1 , . . . , g ) and  = ( 1 , . . . , g ) such that R = D +  −  , where D is a diagonal matrix. Now  g

 j =1 y ..1j.

Var(ˆ l ) = Var(y ..1l. − y ..2l. ) + Var 

g

− 2 Cov y ..1l. − y ..2l. , ⎡



 j =1 y ..1j.



g

 j =1 y ..2j.

g  g − j =1 y ..2j. g

  g 22 22  1 ⎦ 2 22 − + 2⎣ anl ag n g anl j =1 j ⎡ ⎤ g 22 2 22 ⎣ 1 ⎦ = 1− + 2 anl g ag n j =1 j 22 1 2 1 = 1− , l = 1, 2, . . . , g, + a nl g gN G =



2190

J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

and  Cov(ˆ l , ˆ l ) = Cov



g y ..1l.

− y ..2l. ,

 j =1 y ..1j.

g y ..1l  .

− y ..2l  . ,

 j =1 y ..1j.





g

 j =1 y ..2j.

g g

 j =1 y ..2j.

g

,



⎤ g  1 1⎦ 1 = − − + 2⎣ nj ag g anl g anl  j =1 2 2 1 1 1 = , l = l  , + − − a gnl gnl  gN G

22



22





22

where g 1 1 1 = . NG g nj j =1

So

SEl =

2ˆ 2 a



1 nl



2 1− g



1 , + gN G

and the correlation matrix of ˆ l , l = 1, . . . , g, has the +/− factor structure with l = (1 − NG /nl )/ hl , l = (NG /nl )/ hl , where hl = [1 + (NG /nl )/(g − 2)]1/2 . Thus the algorithm of Soong (2001) is in theory applicable. If n1 = n2 = · · · = ng, and the number of genes g is large, then as shown in Fritsch and Hsu (1997), 100(1 − ) percentile |m| of the studentised maximum modulous distribution with g treatments and MSE error degrees of freedom can be used to approximate the exact critical value qM . The ProbMC function in SAS can be used to compute |m| (with the distribution being “maxmod”, nparms being g, and df being the error degrees of freedom). 6. Simultaneous confidence intervals for models with non-normal errors Suppose the model for the response log(yij klm ) can be written as a linear model (2) but the i.i.d. errors ij klm are not normally distributed. One can still use least squares estimates to estimate the parameters of interest  g  g   j =1 y¯..1j. j =1 y¯..2j. ˆl = (y¯  − y¯  ) − − , l = 1, . . . , g, ..1l. ..2l. g g when normalizing with respect to all the genes or  h  h   y¯..1j. y¯..2j. j =1 j =1   ˆ l = (y¯..1l. − y¯..2l. ) − − , h h

l = h + 1, . . . , g,

when normalizing with respect to housekeeping genes. If we let g ∗ = 1 or h + 1 depending on whether normalization is with respect to all the genes or the housekeeping genes, then the least

J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

2191

squares estimators ˆ l , l = g ∗ , . . . , g, are unbiased and computationally feasible even for large number of genes, though theoretically not fully efficient. To obtain confidence intervals for l , l = g ∗ , . . . , g, one can use the bootstrap technique. To obtain individual (not multiplicity-adjusted) confidence intervals, Kerr et al. (2000) bootstrapped the distribution of the difference ˆ l − l for the single gene l = 1, and used the difference between the 100(1 − /2) percentile and the 100(/2) percentile as the width for confidence intervals which are placed symmetrically around ˆ l , l = 1, . . . , g. In effect, they assumed the distribution of (ˆ l − l ), l = 1, . . . , g, is symmetrical about zero. To motivate bootstrap simultaneous confidence intervals, we note that one can state the goal as to obtain percentiles qL and qU so that   √ √ P ˆ l − qU MSE < l < ˆ l − qL MSE for all, l = g ∗ , . . . , g   √ = P qL < (ˆ l − l )/ MSE < qU for all, l = g ∗ , . . . , g   √ √ ˆ l − l )/ MSE and max (ˆ l − l )/ MSE < qU = P qL < min (  ∗ ∗ l=g ,...,g

l=g ,...,g

100(1 − )%. To approximate the percentiles qL and qU , we recommend applying the bootstrap-t technique. Specifically, in analogy to Beran (1988), we bootstrap the distribution of the maximum and the minimum (over l = g ∗ , . . . , g) of the Studentised “roots” √ (3) (ˆ l − l )/ MSE, l = g ∗ , . . . , g, from the sample. We then use the 100(1 − /2) percentile qU∗ of the bootstrap distribution of the maximum to set the lower confidence limits, and we use the 100(/2) percentile qL∗ of the bootstrap distribution of the minimum to set the upper confidence limits. Thus the bootstrap-t simultaneous confidence intervals are:   √ √ ˆ l − qU∗ MSE, ˆ l − qL∗ MSE , l = g ∗ , . . . , g. We do not recommend directly bootstrapping ˆ l , l = g ∗ , . . . , g, using the percentile technique, because generally pre-pivoting improves the performance of the bootstrap. The reason we bootstrap the Studentised roots (3), in accordance with the bootstrap-t technique (Efron and Tibshirani, 1993, Section 12.5), instead of the un-Studentised roots (ˆ l − l ), l = g ∗ , . . . , g, is the bootstrap distribution of the Studentised roots seems more stable than the bootstrap distribution of the un-Studentised roots. Finally, by bootstrapping the distributions of the maximum and minimum separately (as opposed to bootstrapping the distribution of the absolute values), we allow for the possibility that the distribution of the roots is not symmetrical. 7. An example To illustrate the feasibility of our proposal, we applied our simultaneous confidence intervals methods to the Synteni data of Kerr et al. (2000), which was obtained using a 2 × 2 Latin Square design. The experiment involves a total of 1286 genes observed over 2 treatments. Each gene is replicated four times by using two arrays and two dyes. Hence, the data has a total of 1286 × 4 = 5144 observations.

2192

J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

Resid41

1

0

-1

0

1000

2000

3000 Gene

4000

5000

6000

Fig. 1. Model (4) residual versus gene plot labeled by dye (marker “+” for dye 1 and “·” for dye 2), showing dye × gene interaction.

7.1. Modeling In gene expressions analysis, it is not a good idea to base the choice of model on p-values. This is because terms involving genes, including dye-gene interaction ()il , as well as the error term, have degrees of freedom at least g − 1. In general, F-tests for special overall hypotheses with large degrees of freedom make no sense because a small effect leads to almost sure rejection. This phenomenon can be attributed to the fact that, for large degrees of freedom, a Chi-square random variable divided by its degrees of freedom has a distribution highly concentrated around one. Thus, with the number of genes g typically large, the null distribution of the F statistic has practically a point mass at one. Therefore, any sample F statistic greater than one leads to a p-value close to zero, and any sample F statistic less than one leads to a p-value close to one. For example, the F value for ()il for the Synteni data is 0.53, which leads to a p-value of 1.000. However, plotting residuals versus genes labeled by dye showed dye 1 tends to have more positive residuals than dye 2, as indicated in Fig. 1. Thus, the model log(yij klm ) = yij klm =  + i + j + k + l + ()kl + ij klm

(4)

is inadequate. Analysis of variance p-values are non-informative regarding the magnitude of the effects. A boxplot of the residuals, Fig. 2, indicate four observations may be outliers, which are identified as from gene 2379. We thus delete the four observations from gene numbered 2379, and fit the model (2) again. Residual plots are graphed with no sign of special patterns, and the normal probability plot follows

J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

2193

Fig. 2. Boxplot showing four outliers.

a straight line with a Wilk-Shapiro test p-value of 0.0796. We compute both normal error as well as bootstrap simultaneous confidence intervals. 7.2. Normalizing with respect to all genes For  = 0.05, the Studentised maximum modulous critical value |m| is 4.1215. Note that the above critical value has 1284 degrees of freedom based on model (2). For the normal error model, the estimated standard error SEl of ˆ l is 0.3476. Thus, the half-width of 95% normal error simultaneous confidence intervals for l is 4.1215×0.3476=1.4326, and the confidence intervals are (ˆ l − 1.4326, ˆ l + 1.4326),

l = 1, . . . , 1285.

Based on 20,000 bootstrap replications, the 97.5 percentile qU∗ of the maximum of the Studen√ √ tised roots (ˆ l − l )/ MSE is 4.2957 so qU∗ × MSE = 1.4932, while the 2.5 percentile qL∗ of √ the minimum of the Studentised root is −4.2788 so qL∗ × MSE = −1.4873. Therefore, 95% bootstrap simultaneous confidence intervals for l are (ˆ l − 1.4932, ˆ l + 1.4873),

l = 1, . . . , 1285.

They are essentially symmetrical, somewhat wider than normal error confidence intervals. Normal error confidence intervals infer 39 genes to have expression ratio greater than 3, and 13 genes to have expression ratio less than 13 . Bootstrap confidence intervals infer 33 of the 39 genes to have expression ratio greater than 3, and 12 of the 13 genes to have expression ratio less than 13 . 7.3. Normalizing with respect to housekeeping genes For illustration purpose, we took the genes whose 70% simultaneous confidence intervals for expression ratio are in the interval (1/3.6, 3.6), when normalized with respect to all genes, as “housekeeping” genes. There are 21 such genes. Using the differential of these 21 genes to normalize the data, we computed simultaneous lower confidence bounds for differential expressions for the remaining 1285−21=1264 genes. The 97.5 percentile qC turns out to be 4.1156, and the estimated SE of ˆ l is 0.3559. Thus, the half-width of 95% normal error simultaneous confidence intervals for l is 4.1156 × 0.3559 = 1.4647, and the confidence intervals are (ˆ l − 1.4647, ˆ l + 1.4647),

l = 22, . . . , 1285.

Based on 20,000 bootstrap replications, the 97.5 percentile qU∗ of the maximum of the Studen√ √ tised roots (ˆ l − l )/ MSE is 4.3823 so qU∗ × MSE = 1.5236, while the 2.5 percentile qL∗ of

2194

J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

the minimum of the Studentised root is −4.3946 so qL∗ × bootstrap simultaneous confidence intervals for  are (ˆ l − 1.5236, ˆ l + 1.5279),

√ MSE = −1.5279. Therefore, 95%

l = 22, . . . , 1285.

They are essentially symmetrical, somewhat wider than normal error confidence intervals. As it turns out, normal error confidence intervals infer 37 target genes to have expression ratio greater than 3, and 12 target genes to have expression ratio less than 13 . Bootstrap confidence intervals infer 31 genes to have expression ratio greater than 3, and 11 genes to have expression ratio less than 13 . 7.4. Assessing the validity of normal theory and bootstrap methods We took advantage of efficient computation available with model such as (2) to assess the validity of normal theory and bootstrap methods. We simulated data from 1500 genes with each gene having two replications under each treatment 7500 times, both with normal errors and with contaminated normal errors, and observed coverage probabilities for the normal theory method and the bootstrap-t method. For normal errors, the estimated simultaneous coverage probabilities of confidence bounds and intervals are Method

97.5% lower confidence bound

97.5% upper confidence bound

95% equal-tailed confidence interval

Bootstrap-t Normal theory

0.9748 0.9753

0.9731 0.9744

0.9481 0.9499

To assess the robustness of the methods, we model the error distribution as a contaminated normal distribution. In particular, we used the mixture 0.7204739 N(0, 0.07803918) + (1 − 0.7204739) N(0, 0.7952598) distribution, which appears to model the error distribution of another microarray data set we analyzed reasonably well. The estimated simultaneous coverage probabilities of confidence bounds and intervals are Method

97.5% lower confidence bound

97.5% upper confidence bound

95% equal-tailed confidence interval

Bootstrap-t Normal theory

0.9057 0.3797

0.8945 0.3913

0.811 0.1443

So there is indication that the bootstrap-t method has some robustness of validity with respect to contaminated normal errors, while the normal theory method does not. 8. Future research In the analysis of gene expressions, it is not yet clear whether it is better to control the FWER strongly (which our simultaneous confidence intervals approach does), or the FDR.A potential way to resolve the issue is to couple the analysis of microarrays with its intended use. In drug discovery, for example, genes are first screened for differential expressions under normal and diseased

J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

2195

conditions. Subsequently, nucleotide sequences in the promoter regions of the genes selected in the first step are mined for unusually common sequences (motifs). Proteins (transcription factors) that bind to these motifs then become candidates for drug compounds, to intervene with the disease process. While less genes will be selected by controlling the FWER, one has confidence each gene selected is involved in the disease process. More genes will be selected by controlling the FDR, but one is only confident that a proportion of the genes selected are involved in the disease process. Investigating which error rate control leads to more discoveries of useful transcription factors, using promoter sequence data bases of genes with known transcription factors and algorithms such as Motif Elucidation by Multiple EM (MEME), would be a worthwhile future research topic. Acknowledgements We thank Dr. Ramana Davuluri, Human Cancer Genetics Program, The Ohio State University, for insightful discussions. References Amaratunga, D., Cabrera, J., 2001. Analysis of data from viral DNA microchips. J. Amer. Statist. Assoc. 96, 1161–1170. Beran, R., 1988. Balanced simultaneous confidence sets. J. Amer. Statist. Assoc. 83, 679–686. Chang, J.Y., Wang, T., Hsu, J.C., 2002. Efficient least squares estimation for differential gene expressions. Proceedings of the American Statistical Association, Biometrics Section [CD-ROM], American Statistical Association, Alexandria, VA. Churchill, G.A., 2002. Fundamentals of experimental design for cDNA microarrays. Nature Genet. 32, 490–495. Dudoit, S., Yang, Y.H., Speed, T.P., Callow, M.J., 2002. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statist. Sinica 12, 111–140. Dunnett, C.W., 1955. A multiple comparison procedure for comparing several treatments with a control. J. Amer. Statist. Assoc. 50, 1096–1121. Efron, B., Tibshirani, R., 1993. An Introduction to the Bootstrap. Chapmans & Hall, London. Finner, H., Roter, M., 2001. On the false discovery rate and expected type I errors. Biometrical J. 43, 985–1005. Fritsch, K., Hsu, J., 1997. On analysis of means. In: Balakrishnan, Panchapakesan, (Eds.), Advances in Statistical Decision Theory and Methodology. Birkhauser, Boston. Grant, G., Manduchi, E., Stoeckert, C., 2002. Using non-parametric methods in the context of multiple testing to identify differentially expressed genes. In: Lin, S.M., Johnson, K.F. (Eds.), Methods of Microarray Data Analysis. Kluwer Academic Publishers, Boston, pp. 37–55. Halperin, M., Greenhouse, S.W., Cornfield, J., Zalokar, J., 1955. Tables of percentage points for the studentized maximum absolute deviate in normal samples. J. Amer. Statist. Assoc. 50, 185–195. Hsu, J., 1996. Multiple Comparisons: Theory and Methods. Chapman & Hall, London. Hsu, J.C., 1992. The factor analytic approach to simultaneous inference in the general linear model. J. Graph. Comput. Statist. 1, 151–168. Hsu, J.C., Chang, J., Wang, T., 2002. Simultaneous confidence intervals for differential gene expressions. Technical Report 592, Department of Statistics, The Ohio State University. Kerr, M.K., Churchill, G.A., 2001a. Experimental design for gene expression microarrays. Biostatist. 2, 183–201. Kerr, M.K., Churchill, G.A., 2001b. Statistical design and the analysis of gene expression microarray data. Genet. Res. 77, 123–128. Kerr, M.K., Martin, M., Churchill, G.A., 2000. Analysis of variance for gene expression microarray data. J. Comput. Biol. 7, 819–837. Kerr, M.K., Afshari, C.A., Bennett, L., Bushel, P., Martinez, J., Walker, N.J., Churchill, G.A., 2002. Statistical analysis of a gene expression microarray experiment with replication. Statist. Sinica 12, 203–218. Soong, W.C., 2001. Exact simultaneous confidence intervals for multiple comparisons with the mean. Comput. Statist. Data Anal. 37, 33–47. Vandesompele, J., De Preter, K., Pattyn, F., Poppe, B., Van Roy, N., De Paepe, A., Speleman, F., 2002. Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol. 3, 0034.I–0034.II.

2196

J.C. Hsu et al. / Journal of Statistical Planning and Inference 136 (2006) 2182 – 2196

Weisberg, S., 1985. Applied Linear Regression. second ed. Wiley, New York. Yang, Y.H., Speed, T.P., 2002. Design issues for cDNA microarray experiments. Natur. Rev. Genet. 3, 579–588. Yang, Y.H., Dudoit, S., Luu, P., Speed, T.P., 2001. Normalization for cDNA microarray data. In: Bittner, M.L., Chen, Y., Dorsel, A.N., Dougherty, E.R. (Eds.), Microarrays Optical Technologies and Informatics, Proceedings of SPIE, vol. 4266.