BOPA: A Bayesian hierarchical model for outlier expression detection

BOPA: A Bayesian hierarchical model for outlier expression detection

Computational Statistics and Data Analysis 56 (2012) 4146–4156 Contents lists available at SciVerse ScienceDirect Computational Statistics and Data ...

625KB Sizes 2 Downloads 87 Views

Computational Statistics and Data Analysis 56 (2012) 4146–4156

Contents lists available at SciVerse ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

BOPA: A Bayesian hierarchical model for outlier expression detection Zhaoping Hong, Heng Lian ∗ Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore

article

info

Article history: Received 23 February 2011 Received in revised form 30 April 2012 Accepted 4 May 2012 Available online 14 May 2012 Keywords: Cancer outlier profile analysis False discovery rate Markov chain Monte Carlo Microarrays

abstract In many cancer studies, a gene may be expressed in some but not all of the disease samples, reflecting the complexity of the underlying disease. The traditional t-test assumes a mean shift for the tumor samples compared to normal samples and is thus not structured to capture partial differential expressions. More powerful tests specially designed for this situation can find genes with heterogeneous expressions associated with possible subtypes of the cancer. This article proposes a Bayesian model for cancer outlier profile analysis (BOPA). We build on the Gamma–Gamma model introduced in Newton et al. (2001), Kendziorski et al. (2003), and Newton et al. (2004), by using a five-component mixture model to represent various differential expression patterns. The hierarchical mixture model explicitly accounts for outlier expressions, and inferences are based on samples from posterior distributions generated from the Markov chain Monte Carlo algorithm we have developed. We present simulation and real-life dataset analyses to demonstrate the proposed methodology. © 2012 Elsevier B.V. All rights reserved.

1. Introduction Microarray technology provides us with a large-scale view of a cellular state by simultaneously interrogating the expressions of thousands of genes. The large number of genes under investigation and the small number of arrays used due to experimental costs creates a typical ‘‘large p, small n’’ problem in statistics. In gene expression studies, arguably the most important problem is to identify those genes that are associated with the underlying genomic mechanism of the disease, whose patterns of expression differ according to phenotype or experimental conditions. Many different procedures have been proposed to tackle this problem (Baldi and Long, 2001; Ibrahim et al., 2002; Smyth, 2004; Efron, 2004; Do et al., 2005; Lewin et al., 2006). Recently, many researchers have come to the realization that, in many cancer studies, some genes exhibit increased or decreased expressions in disease samples, but only for a small number of those samples. The study of Tomlins et al. (2005) shows that the t-statistic has low power in this case, and they introduced the so-called cancer outlier profile analysis (COPA). Their study shows clearly that COPA can perform better than the traditional t-statistic for cancer microarray datasets. In Tibshirani and Hastie (2007), the authors introduced a new statistic, which they called the outlier sum (OS). Later, Wu (2007) proposed an outlier robust t-statistic (ORT), and showed that it usually outperformed the previously proposed statistics in both simulation studies and applications to real data. Following the works mentioned above, here we are especially interested in detecting genes that are expressed in only a subset of the tumor samples. Unlike previous works that use frequentist tests, we choose the Bayesian hierarchical modeling framework as adopted by Kendziorski et al. (2003); Newton et al. (2004). While those works favored inferences based on posterior mode found using the expectation-maximization (EM) algorithm, we prefer a fully Bayesian approach to explore the posterior distribution using a Markov chain Monte Carlo (MCMC) algorithm. Besides, our efforts to develop a new model



Corresponding author. E-mail address: [email protected] (H. Lian).

0167-9473/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2012.05.003

Z. Hong, H. Lian / Computational Statistics and Data Analysis 56 (2012) 4146–4156

4147

for detecting partial differential expression results in a much more complicated computational problem, which possibly excludes the usage of the EM algorithm. One of the most important aspects of Bayesian hierarchical modeling on microarray data is the sharing of information across different experimental units. Using Bayesian hierarchical modeling, we strengthen our inferences by borrowing information from different genes and samples. We will use a five-component mixture model (also referred to as a five-state model) for different patterns of expressions across samples. Besides the usual three states used in some previous models to designate non-differential expression, up-regulated expression and down-regulated expression, we use two extra states for partial up-/down-regulated expression. Previously proposed Bayesian methods for differential expression, such as (Newton et al., 2004; Gottardo et al., 2006; Lo and Gottardo, 2007), do not take into account partial expression profiles, and thus are expected to have low power in detecting outlier expression. This article is organized as follows. In Section 2, we describe our approach for cancer outlier profiling. In Section 3, we discuss the computational challenges posed by the model and outline the MCMC algorithm for posterior computation, with detailed sampling steps put into Appendix B. We discuss Bayesian approaches to control the false discovery rate (FDR). In Section 4, we perform some simulations to study the computational properties of the algorithm and to compare our model with several previous approaches. Interestingly, our analysis shows that the OS statistic proposed in Tibshirani and Hastie (2007) has serious problems in differential expression detection, which has not been noticed before. In Section 5, we evaluate the performance of various methods on two real-life data sets as well as on generated expression data. 2. A five-state model We consider simple two-group microarray data for detecting cancer genes. We assume that there are m normal samples and n cancer samples. The gene expressions for the normal samples are denoted by X = {xij } for genes i = 1, 2, . . . , G and samples j = 1, 2, . . . , m, while Y = {yij } denotes the expressions for cancer samples with i = 1, 2, . . . , G and j = 1, 2, . . . , n. We model these expression values as Gamma random variables with gene-specific means: xij ∼ Gamma(aj , aj λxij ), y

yij ∼ Gamma(bj , bj λij ),

with E(xij ) = 1/λxij , y

with E(yij ) = 1/λij .

We assume that the expressions for normal samples are homogeneous, so λxij ≡ λi1 , j = 1, . . . , m, but the true expressions for different cancer samples could be different. Gamma models for observed expression values have been used in many previous studies on parametric modeling of differential expression, and the implicit assumption of constant coefficient of variation in the Gamma model has been checked for some real-life datasets. For example, both (Newton et al., 2004; Jensen et al., 2009) used a figure similar to Supplementary Figure 7 in the current paper to justify the constant coefficient of variation assumption for their data. In the above, we have used sample-specific shape parameters a = (a1 , . . . , am ) and b = (b1 , . . . , bn ) in the Gamma distributions. A more parsimonious choice could be to constrain the shape parameters to be experimental condition specific, as in Jensen et al. (2009). y On the next level, marginally, the latent expression parameters λxij and λij follow a Gamma distribution

λxij , λyij ∼ Gamma(a0 , a0 x0 ). The above only specifies the marginal distribution of these parameters. Constraints on these parameters depend on another latent gene-specific state variable Zi , taking values in S = {0, 1, 2, 3, 4}, defined as follows. y

Zi = 0 : λij = λi2 = λi1 ,

denoting non-differential expression,

where λi2 is the inverse mean expression for the tumor samples, y

Zi = 1 : λij = λi2 < λi1 , y ij

Zi = 2 : λ = λi2

or λi1

denoting up-regulated expression pattern for tumor samples , with λi2 < λi1 ,

denoting partially up-regulated expression pattern for tumor samples, y

Zi = 3 : λij = λi2 > λi1 , y ij

Zi = 4 : λ = λi2

or λi1

denoting down-regulated expression pattern for tumor samples, with λi2 > λi1 ,

denoting partially down-regulated expression pattern for tumor samples, with additional parameters p = {p0 , . . . , p4 } specifying the state probabilities as P (Zi = k) = pk , k ∈ S, independently for each gene. The different parameters λi1 and λi2 are generated independently. For example, if Zi = 1, then the joint density of (λi1 , λi2 ) is 2π (λi1 )π (λi2 )1{λi1 > λi2 }, where π is the density of Gamma(a0 , a0 x0 ) and 1{·} is the indicator function. Thus the joint distribution of (λi1 , λi2 ) is a mixture with five components. To fully specify the distribution of latent parameters λyij when Zi = 2 or 4, we need to introduce a gene-specific parameter 0 < πi < 1. Conditional on the event Zi = 2 or 4, for y

y

each tumor sample independently, we assume that λij = λi2 with probability πi and that λij = λi1 with probability 1 − πi . This gene-specific parameter πi is the key to modeling heterogeneity in cancer samples. For example, πi = 1/2 means that

4148

Z. Hong, H. Lian / Computational Statistics and Data Analysis 56 (2012) 4146–4156

a priori about half of the cancer samples will exhibit differential expression, while the other half will not. πi being close to 1 means that we believe that almost all of the cancer samples will exhibit differential expressions (thus close to the homogeneous case). We assign a prior distribution Beta(s, t ) for πi independently for each gene. We use s = 2 and t = 2 as a weak prior centered at 1/2. We have also tried other priors such as Beta(1, 1) and Beta(5, 5) in our simulations, and the results are quite robust to these different choices. Finally, we use the following non-informative prior distribution for the unknown parameters Θ = {a, b, a0 , x0 , p}:

• aj , bj , a0 , x0 ∼ Uniform(0, B) with a large constant B (B = 1000 in our current implementation), • p = (p0 , . . . , p4 ) ∼ Dirichlet(d ), where we use a small number dk = 0.1 in our implementation. When Zi is constrained to only take values in {0, 1, 3}, we in effect revert back to the parametric model presented in Newton et al. (2004). They used an EM algorithm in which the latent parameters λij can be integrated out of the model. Fitting of our model is complicated by the additional states associated with partial differential expression. In particular, when Zi = 2 or 4, it is difficult to directly integrate out λij , since it is unknown whether the gene expression for a particular tumor sample has a differential expression. Thus we introduce the allocation vector gi = (gi1 , . . . , gin ). We distinguish the cases Zi = 2 (partial up-regulated expression) and Zi = 4 (partial down-regulated expression).

• When Zi = 2, we set gij = 1 if λyij = λi1 (non-outlier) and gij = 2 if λyij = λi2 . • When Zi = 4, we set gij = 1 if λyij = λi1 (non-outlier) and gij = 0 if λyij = λi2 . y

By the description of the generative model for λij above, we have p(gi |Zi = 2 or 4) =



n

πi i2 (1 − πi )ni1 p(πi )dπi ,

(1)

where ni1 is the number of non-differentially expressed tumor samples, ni2 is the number of differentially expressed tumor samples (note ni1 and ni2 implicitly depend on gi ), with ni1 + ni2 = n, and p(πi ) is the prior Beta(s, t ) density. This integral can be easily evaluated to get a closed-form expression involving Beta functions. For convenience, we also define allocation vector for the other three states:

• Zi = 0 ⇒ gij = 1, j = 1, . . . , n, • Zi = 1 ⇒ gij = 2, j = 1, . . . , n, • Zi = 3 ⇒ gij = 0, j = 1, . . . , n, with p(gi |Zi ) = 1 if and only if gi satisfies the constraints above depending on the value of Zi . With the aid of latent variables (Zi , gi ) defined above, we can reduce the complexity of model fitting by integrating out y the latent parameters λxij , λij in our hierarchical model, resulting in the following likelihood: p(X , Y , Z , g |Θ) =

G 

p(xi , yi |gi , Zi , Θ)p(gi |Zi )p(Zi |Θ),

(2)

i=1

where xi , yi are all observed expression values for gene i. Thanks to the introduction of the allocation vector, the collapsed distribution p(xi , yi |gi , Zi , Θ) can now be evaluated in closed form in the same way as illustrated in the appendix of Newton et al. (2004), and thus detailed derivation is omitted here. Note also that by introducing the allocation vector and integrating out λij , reversible jump MCMC is not required for transitioning between different states. Combining this likelihood with the prior on the parameters, we intend to explore the posterior distribution p(Z , g , Θ|X , Y ) ∝ p(X , Y , Z , g |Θ)p(Θ). We implement our hierarchical model fitting using the MCMC algorithm to be discussed in the next section. 3. Posterior sampling and inferences 3.1. MCMC sampler We alternately sample from the posterior of (Z , g ) given current values of the parameters Θ, and sample Θ given current values of latent variables (Z , g ). A detailed formula for each step is presented in the Appendix B. Sampling of Θ can be achieved either based on simple full conditional distribution (for sampling p) or based on Metropolis steps (for sampling a0 , x0 , aj , bj ). Given Θ, sampling of latent variables can be done one gene at a time, as is obvious from the product form of the likelihood (2). Thus we only discuss the sampling of latent variables for a fixed gene i. Suppose that the current latent variables are (Zi , gi ). We use the Metropolis–Hastings algorithm to propose a new set of latent variables (Zi∗ , gi∗ ). First, we need to consider the proposal distributions. For the proposal of Zi∗ , we use a simple state transition system as shown in Fig. 1. For example, when currently Zi = 0, we propose moving to any of the other four states with equal probability, whereas, if Zi = 1, we propose moving to either Zi∗ = 0 or Zi∗ = 2. These transition probabilities are denoted by ps,s∗ when transitioning from Zi = s to Zi∗ = s∗ . We do not consider transitions between more distinct states

Z. Hong, H. Lian / Computational Statistics and Data Analysis 56 (2012) 4146–4156

4149

Fig. 1. Possible transitions of the MCMC sampler between the five states. State 0 represents the non-differentially expressed state. States 1 and 3 represent up-regulated and down-regulated expression patterns, respectively, while states 2 and 4 represent partially up-regulated and partially down-regulated expression patterns, respectively.

such as from Z = 1 to Z ∗ = 3 (from up-regulation to down-regulation), since these types of transition represent large jumps and will rarely be accepted, and thus this is a waste of computation time. Constraining possible transitions between states also makes implementation an easier task. When moving between states 0, 1, and 3, no proposal of gi∗ is required, since it is entirely determined by the state, as defined in the previous section. Complications occur when a state representing partial differential expression is involved. We illustrate our algorithm with the proposed transition Zi = 0 → Zi∗ = 2. Implicitly, we have gi = (1, 1, . . . , 1). In general, denoting the proposal distribution of gi∗ by Q (gi∗ |Zi∗ ), which might depend on other variables besides the proposed state Zi∗ , the acceptance of the proposal is based on the value of the ratio:

α=

p2,0 · p(Zi∗ = 2, gi∗ |xi , yi , Θ) p0,2 Q (gi |Zi = 2)p(Zi = 0, gi |xi , yi , Θ) ∗



=

p2,0 · p2 · p(gi∗ |Zi∗ = 2) · p(xi , yi |Zi∗ = 2, gi∗ , Θ) p0,2 · Q (gi∗ |Zi∗ = 2) · p0 · p(xi , yi |Zi = 0, gi , Θ)

,

where, as introduced above, p(gi∗ |Zi∗ = 2) is the prior probability of the allocation vector defined by the integral (1). Given the proposed gi∗ , the probability p(xi , yi |Zi∗ = 2, gi∗ , Θ) can be computed in closed form, as discussed previously. Supposing that one uses the prior (1) as the proposal distribution Q (gi∗ |Zi∗ = 2), the prior distribution and the proposal distribution of gi∗ in the numerator and the denominator conveniently cancel each other. However, we do not expect this simple proposal to perform well, because the allocation vector is set randomly without reference to the observed data, and the proposal is unlikely to be accepted. In other words, the proposal gi∗ from the prior distribution is likely to force the term p(xi , yi |Zi∗ = 2, gi∗ , Θ) to be very small, even with a moderate number of tumor samples n. Instead, we adopt the idea of Jain and Neal (2004) proposed in the context of Dirichlet process clustering, and use a proposal distribution taking into account the observed expression values. Starting from an arbitrary fixed allocation vector gi on tumor samples, which might be different from the current allocation vector, we use a simple Gibbs scan over components of gi , and, for each component j, draw a new value for gij from the conditional distribution q(gij = c |Zi = 2) ∝



p(yij |gij , λic , θ )p−j,c (λic )dλic ,

c = 1, 2,

where p−j,c (·) is the posterior distribution of λic based on all observed expressions for which gik = c , k ̸= j, and when c = 1 these also include the observed expressions on normal samples. Thus, using the above probability q as the proposal distribution for each component of gi in turn, the proposal probability for a full sequential sampling scan is a product of these probabilities. In our algorithm, this product is the proposal probability, Q (gi∗ |Zi∗ = 2). However, when a single sequential Gibbs scan is used, the proposed allocation vector may still not be sensible. To improve the efficiency of the algorithm, we perform K iterations of Gibbs sampling (K = 3 in our implementation). When calculating the proposal probability, we only need to use the product of the probabilities from the last scan instead of taking product probabilities from all K Gibbs scans. The reason is the same as in Jain and Neal (2004): the calculation of acceptance is valid starting from any fixed allocation vector g , and thus should be valid even for a random starting allocation vector. We will discuss in more detail the effects of using multiple Gibbs scans on posterior samples in Appendix A. Finally, we note that, since the parameters λij are integrated out, our sampler does not jump between spaces of different dimensions. Alternatively, it is also possible to keep those parameters in the sampling steps and then use the sampler in Gottardo and Raftery (2008) that can efficiently deal with mixtures of mutually singular distribution.

4150

Z. Hong, H. Lian / Computational Statistics and Data Analysis 56 (2012) 4146–4156

3.2. Gene selection After fitting our BOPA model, gene-specific inference is primarily based on the posterior distribution of Zi . Our MCMC algorithm provides easy estimates for different quantities and parameters in our model. For example, one can assign to each gene a state that maximizes the posterior probability (which can be estimated from sampling frequencies for each state): si = arg max p(Zi = s|X , Y ). s∈S

If a short list of differentially expressed (DE) genes is desired, one possible decision rule is to declare a gene to be differentially expressed if P (Zi ̸= 0|X , Y ) ≥ C for some threshold 0 < C < 1, and Newton et al. (2004) suggested estimating the false discovery rate (FDR) with

 FDR(C ) =

p(Zi = 0|X , Y )1{p(Zi ̸= 0|X , Y ) ≥ C }

i



1{p(Zi ̸= 0|X , Y ) ≥ C }

.

(3)

i

Conversely, in order to control the false discovery rate at a certain level, we can choose the smallest C on a fine grid of threshold values with estimated false discovery rate below that level. 4. Simulation We carried out some simulation studies to assess the detection performance of our methodology. For each simulated dataset, we generated G = 1000 genes, with m = n = 4, 10, or 24 replicates, and the expression values have a Gamma distribution with shape parameters generated independently for each sample ai , bj ∼ Uniform(10, 50), i = 1, . . . , m, j = 1, . . . , n. For the distribution of the gene-specific parameters λij , we adopted the same three scenarios as in Newton et al. (2004). Scenario I. Gamma with a0 = 0.5 and a0 x0 = 0.25. Scenario II. Uniform on 5 ≤ − log2 11 for non-DE genes.

√ λi1 λi2 ≤ 11 and −2 ≤ log2 (λi2 /λi1 ) ≤ 2 for DE genes and uniform on 5 ≤ − log2 λi1 ≤ √

Scenario III. Uniform on 5 ≤ − log2 λi1 λi2 ≤ 11 and −1 ≤ log2 (λi2 /λi1 ) ≤ 1 for DE genes and uniform on 5 ≤ − log2 λi1 ≤ 11 for non-DE genes. For each scenario above, we consider both p = (0.9, 0.025, 0.025, 0.025, 0.025) and p = (0.8, 0.05, 0.05, 0.05, 0.05), and, for partially DE genes, πi values are independently generated from distribution Beta(2, 2). In the initial stage of simulation studies on several simulated data sets, we run five chains for 10,000 iterations, and for the ith chains all genes are initialized with state i, i = 1, . . . , 5. For convergence diagnostics, we used the methods developed by Gelman and Rubin (1992) and Geweke (1992), which are available from the boa package within R (Smith, 2007). We examined many different parameter values using the package, and the diagnostic statistics suggest that the chains have converged after 5000 iterations. Thus for all simulation studies we run 10,000 iterations and the first 5000 iterations are discarded as burn-in. We collect all samples after burn-in without performing thinning, since independence of samples is not critical in our studies. We apply seven testing procedures, including our method (MCMC5), and a three-state Bayesian model for which the partial DE states are removed (MCMC3), as well as BRIDGE (Gottardo et al., 2006), GaGa (Rossell, 2009), ORT, OS and the simple t-test (after log-transformation). Fig. 2 and Supplementary Figs. 1–2 show the receiver operating characteristic (ROC) curves for the seven DE detection procedures under the three scenarios (each of the three figures corresponds to one of the three scenarios). These curves are constructed by showing the true positive rate (TPR) on the y-axis versus the false positive rate (FPR) on the x-axis. The TPR is the proportion of truly DE genes among all the genes detected as differentially expressed, and the FPR is the proportion of non-DE genes wrongly included in the list of differentially expressed genes. Note that whether a gene is classified as DE or non-DE depends on the threshold we use on the posterior. We need to first determine a threshold t such that we call a gene DE if the posterior probability of DE is larger than t. For each t we have a (TPR,FPR) pair, and by connecting these points with various t, we get an ROC curve. We make several observations based on these figures. First, MCMC5 performs best for most of the simulations for small FPR, especially when the sample size is small. Second, the worst performer under our simulation is OS. Surprisingly, the ROC curves show that OS is even worse than random guessing under our simulation setup. Appendix C contains a detailed explanation for this. Thus one need to be very cautious when applying OS statistics. Third, the performance of the t-test can be close to that of MCMC5 when m = n = 24, but we see that MCMC5 performs better than the t-test when the false positive rate is small, which is arguably the more interesting case. Even if MCMC5 is not significantly better than other approaches under some scenarios, we show using our real data later that at least it is complementary to other procedures, since it can detect some interesting genes that are missed by other procedures. Furthermore, in Table 1 and Supplementary Tables 1–2 (three tables are associated with three scenarios, respectively), we show some operating characteristics of the different procedures under the same simulation setup. The results reported are based on controlling the FDR to be below 0.05. For the OS, ORT, and t-statistic, the p-values for all genes are calculated based on the rank of the statistic among those on the null genes (note we know the null genes only in simulation). We also

Z. Hong, H. Lian / Computational Statistics and Data Analysis 56 (2012) 4146–4156

4151

Fig. 2. TPR versus FPR for scenario I.

tried permutation tests, and the results are similar. Then the p-values are transformed to q-values to find the threshold for ˆ /G (G = 1000 in our simulations) is the proportion of differentially expressed genes controlling the FDR. In the table, DE ˆ is found by controlling the FDR at 0.05 (for detected by the different methods. The estimated number of DE genes, DE, controlling the FDR in our method, see Section 3.2). FDR is the realized false discovery rate, which is just the proportion of truly non-DE genes among those estimated as DE, and FNDR is the realized false non-discovery rate, the proportion of truly DE genes among those estimated as non-DE. Again, these quantities depend on the choice of target FDR, which is 0.05. Note that target FDR equal to 0.05 means we want the FDR to be (close to) 0.05, and realized FDR indicates the effectiveness of our approach in controlling the FDR, with values closer to 0.05 indicating better performance, whereas, for the FNDR, smaller values indicate better performance. From these tables, we observe that the FDR controlling procedure is not effective for ORT and OS, with the realized FDR frequently larger than 0.05, while for other procedures the FDR is well controlled except for a few cases. Besides, our MCMC5 procedure generally can detect a larger number of differentially expressed genes and at the same time achieve similar or better TPR and FPR compared to other procedures. In Supplementary Table 3, for Scenario I with m = n = 24, we demonstrate that the results are quite robust for some different choices of prior for πi . In practice, we recommend setting the prior to be Beta(2, 2) or Beta(1, 1). These are symmetric priors that are close to being noninformative. Extremely asymmetric priors favoring πi close to 0 or 1 can only be used if there is strong prior belief that only very few cancer samples can exhibit differential expression or if the cancer samples are closer to being homogeneous. Such belief seems to be unreasonable. Finally, we compared the ability of these methods to discover DE genes in different states separately (Fig. 3 and Supplementary Figs. 3–4). In the figures, power is the proportion of detected DE genes among all the truly DE genes. We actually present the power for each of the four DE states separately. The state for each gene is assigned according to the posterior probability of different states, as explained in Section 3.2. So the bars in the figure under ‘‘Z = 1’’ are the proportion of DE genes estimated to be up-regulated among all that are truly up-regulated. We see that MCMC5, MCMC3, BRIDGE, and GaGa have similar power for up-regulated genes (Z = 1) and down-regulated genes (Z = 3). But MCMC5 performs significantly better than its competitors on partially DE genes (Z = 2 and Z = 4) for all scenarios.

4152

Z. Hong, H. Lian / Computational Statistics and Data Analysis 56 (2012) 4146–4156

Fig. 3. Simulation results of scenario I with different sample size: comparisons of power for genes in different states (different expression patterns).

5. Real-life datasets 5.1. Breast cancer data example We now apply the proposed methodology to data from a gene expression study in breast cancer (Hedenfalk et al., 2001). The dataset consists of 3226 genes from 22 samples, 7 of which were breast tumors with BRCA1 mutations, 8 of which were breast tumors with BRCA2 mutations, and the remaining 7 of which were sporadic breast tumors. We treat the BRCA2 samples as the normal class and the BRCA1 samples as the disease class (the other 7 corresponding to sporadic breast tumors are not used for this analysis). We would like to examine whether there are any differences in gene expressions between these two kinds of tumor. Following the preprocessing method as in Storey and Tibshirani (2003), we eliminated those genes with one or more measurements exceeding 20, and this left us with 3170 genes for further analysis. In Supplementary Fig. 5, we plot the ranking of genes based on the coefficient of variation (CV) against the ranking of mean expression. It can be seen that there is no strong systematic relationship between the CV and the mean, which is as expected for the constant-shape Gamma model (Newton et al., 2004; Jensen et al., 2009). 5.1.1. Full data set analysis We implemented our model using the MCMC sampler outlined in Appendix B. Five chains were run for 10,000 iterations, as before, with all genes assigned initially to state i for the ith chain. Based on convergence analysis developed by Gelman and Rubin (1992) and Geweke (1992), the first 5000 iterations are discarded as burn-in.

Z. Hong, H. Lian / Computational Statistics and Data Analysis 56 (2012) 4146–4156

4153

Table 1

ˆ /G is the Operating characteristics of simulation studies with Scenario I. P (DE ) is the true proportion of differentially expressed genes simulated. DE proportion of differentially expressed genes detected by different methods. TPR is the true positive rate (same as sensitivity). FPR is the false positive rate (1 − FPR is the same as specificity). FDR is the realized false discovery rate and FNDR is the realized false non-discovery rate. m=n=4

P (DE ) = 0.1

P (DE ) = 0.2

ˆ /G DE

TPR

1 − FPR

FDR

FNDR

ˆ /G DE

TPR

1 − FPR

FDR

FNDR

MCMC5 MCMC3 BRIDGE GaGa ORT OS t-test

0.088 0.068 0.057 0.059 0.056 0.008 0.043

0.850 0.650 0.560 0.580 0.520 0.070 0.420

0.997 0.997 0.999 0.999 0.996 0.999 0.999

0.034 0.044 0.018 0.017 0.071 0.125 0.023

0.016 0.038 0.047 0.045 0.051 0.094 0.061

0.161 0.138 0.118 0.122 0.097 0.032 0.072

0.770 0.665 0.580 0.600 0.450 0.145 0.350

0.991 0.994 0.998 0.998 0.991 0.996 0.998

0.044 0.036 0.017 0.016 0.072 0.094 0.028

0.055 0.078 0.095 0.091 0.122 0.177 0.140

m = n = 10

P (DE ) = 0.1

ˆ /G DE

TPR

1 − FPR

FDR

FNDR

ˆ /G DE

TPR

1 − FPR

FDR

FNDR

MCMC5 MCMC3 BRIDGE GaGa ORT OS t-test

0.093 0.082 0.075 0.081 0.077 0.010 0.063

0.890 0.820 0.750 0.800 0.720 0.090 0.590

0.996 1 1 0.999 0.994 0.999 0.996

0.043 0 0 0.012 0.065 0.100 0.064

0.012 0.020 0.027 0.022 0.030 0.092 0.044

0.197 0.181 0.142 0.174 0.169 0.029 0.137

0.930 0.880 0.710 0.855 0.790 0.135 0.650

0.986 0.994 1 0.996 0.986 0.998 0.991

0.056 0.028 0 0.017 0.065 0.069 0.051

0.017 0.029 0.068 0.035 0.051 0.178 0.081

m = n = 24

P (DE ) = 0.1

ˆ /G DE

TPR

1 − FPR

FDR

FNDR

ˆ /G DE

TPR

1 − FPR

FDR

FNDR

0.094 0.088 0.079 0.088 0.080 0.019 0.077

0.880 0.860 0.770 0.850 0.750 0.170 0.770

0.993 0.998 0.998 0.997 0.994 0.998 0.998

0.064 0.023 0.025 0.034 0.063 0.105 0.025

0.013 0.015 0.025 0.016 0.027 0.085 0.025

0.201 0.187 0.177 0.196 0.189 0.050 0.169

0.945 0.910 0.875 0.935 0.885 0.230 0.805

0.985 0.994 0.998 0.989 0.985 0.995 0.990

0.060 0.027 0.011 0.046 0.064 0.080 0.047

0.014 0.022 0.030 0.016 0.028 0.162 0.047

MCMC5 MCMC3 BRIDGE GaGa ORT OS t-test

P (DE ) = 0.2

P (DE ) = 0.2

Using MCMC5, we estimated p = (.7488, .0006, .1244, .0300, .0962), which indicated that there were many genes expressed differently between breast tumors with BRCA1 mutations and those with BRCA2 mutations. This result is consistent with the claims made in Hedenfalk et al. (2001). The estimated p indicates that about 25% of the genes are DE. However, identifying those genes is a different matter. If we just pick the top 25% genes based on posterior probability, then typically there are a lot of false positives, while many truly DE genes will not be in the list. What researchers typically want is a shorter list that is more certain to contain only a small number of non-DE genes. For this purpose, we choose the target FDR to be 0.05 and find a list of 193 genes by the procedure briefly explained in Section 3.2. Of these 193 genes, 32 genes have an estimated posterior probability of 1 of being DE. For comparison, BRIDGE and GaGa identified 75 and 132 genes, respectively. 58 and 72 of these are also contained in the list of MCMC5, respectively. However, MCMC3, the t-test, ORT, and OS (combined with a permutation test for calculating the p-values) seem very conservative, with only 26, 32, 0, and 0 genes that are declared as differentially expressed, respectively. By searching through some existing literature, we find that many genes detected are related with DNA repair and apoptosis. For example, the c-Myc that promotes apoptosis (Juin et al., 2002) is partially up-regulated in BRCA1 tumors with 98% posterior probability. The gene ATDC is directly involved in DNA repair (Ronen and Glickman, 2001) and it is up-regulated with 99.8% posterior probability in BRCA1-related tumors. But both BRIDGE and GaGa regard it as a nondifferential gene with approximately 100% posterior probability. The gene VEGF is down-regulated in BRCA1 tumors with 100% posterior probability, and it is associated with inhibition of tumor cell apoptosis (Pidgeon et al., 2001). However, this gene is ranked at 572, 2340, and 2635 by the t-test, OS, and ORT respectively. Thus it is obviously more difficult to detect VEGF by these three methods. 5.1.2. Subset data analysis In order to evaluate the methods under small sample size, we randomly chose four replicates from each group and applied each method to find the DE genes when controlling the FDR at 0.05. This process was repeated 100 times. Since neither OS nor ORT can detect any DE genes for the full data set, we will focus our discussions on the remaining five methods. Table 2 lists the average number of genes declared DE when analyzing a subset of four replicates per group, as well as for the full data set. We also show the reproducibility of each method, which is the proportion of those DE genes discovered using the subsample that can also be found using the full data set. MCMC5 method detects more DE genes than its competitors, and exhibits higher reproducibility. Obviously, both MCMC3 and the t-test appear too conservative, with very low reproducibility under subsampling.

4154

Z. Hong, H. Lian / Computational Statistics and Data Analysis 56 (2012) 4146–4156 Table 2

ˆ number of genes declared DE gene detection for breast cancer data. DE: differentially expressed; % rep.: proportion of DE genes also declared significant when analyzing the full data.

MCMC5 MCMC3 BRIDGE GaGa t-test

4 replicates per group

Full data

ˆ DE

% rep.

ˆ DE

37.8 11 14.7 26.9 0.4

0.796 0.103 0.384 0.713 0.028

193 26 75 132 32

Table 3 Operating characteristics of the simulated data set based on parameter estimates obtained from breast cancer data.

MCMC5 MCMC3 BRIDGE GaGa ORT OS t-test

ˆ /G DE

TPR

1 − FPR

FDR

FNDR

0.095 0.069 0.057 0.057 0.026 0.008 0.023

0.358 0.274 0.227 0.218 0.099 0.031 0.092

0.998 0.999 0.999 0.996 0.998 0.999 0.999

0.011 0.005 0.007 0.049 0.046 0.063 0.012

0.177 0.196 0.206 0.208 0.232 0.245 0.233

5.1.3. Simulated data analysis We now apply our approach to the analysis of simulated datasets based on the obtained estimates from the breast cancer data. More specifically, for each gene i, we use the estimated gi and Zi , and generate the expression values from Gamma distribution with mean fixed to the sample mean of the real data. Other parameters in the model are fixed to be the estimated values. In total, 100 simulated data sets were generated, and we report the averaged results. For easy visual comparison, we compute the FDR as a function of the proportion of genes declared significant. The averaged FDR curves are plotted in Supplementary Fig. 6(a). As seen, our proposed MCMC5 method exhibits the lowest FDR for detecting the same number of DE genes. Again, we investigate the operating characteristics of different methods using an FDR of 0.05. Results are shown in Table 3, from which we see that MCMC5 can detect more DE genes than other procedures with a well-controlled FDR level. To investigate the performance of these procedures for detecting partially DE genes, we compute the detection rate as a function of the proportion of differentially expressed samples of a gene. The resulting plot given in Supplementary Fig. 6(b) highlights several findings. First, as expected, MCMC5 obtains substantially more power than other procedures when the proportion is less than 80%. Second, the power of BRIDGE and GaGa increases rapidly, and they outperform MCMC5 slightly when the proportion reaches 100%. Finally, the t-test performs the worst for genes differentially expressed only in a small proportion of tumor samples, but shows marked improvement when the proportion increases. 5.2. Prostate cancer data example We also applied our approach to the study of Singh et al. (2002), where Affymetrix arrays were used to compare the gene expressions from prostate cancer tissues with normal prostate tissues. They measured expressions of 12,625 genes from 52 tumor and 50 normal prostate tissue samples. We filtered the genes using the following two criteria: (1) expression measurements to be above 100 in at least 20% of the samples; (2) the interquartile range across the samples on the base-2 logarithmic scale to be at least 0.1. In the end we were left with 2204 genes for further analysis. Using BOPA, we identified 200 genes with 100% posterior probability of being differentially expressed, which were all ranked as 1. We estimate p = (0.8011, 0.0124, 0.0728, 0.0030, 0.1107). Table 4 lists the genes called significant by the BOPA method or t-test when the FDR is controlled at 0.01, and confirmed to be related to prostate cancer. These genes are identified by a search on PubMed. The t-test identified 290 genes as DE, 9 of which are confirmed to be related to prostate cancer, while our method identified 255, with 12 of them confirmed. There are 155 genes that appear on both lists. Almost all the 9 confirmed genes (except CAMKK2) identified by the t-test are also identified by BOPA. The rankings of these genes using other different methods are also shown in the table. For OS, 463 genes has the OS statistic value exactly equal to zero, and all these genes are assigned the ranking 1732, which explained the many genes with equal ranking for OS seen in the table. Similarly, those genes with ORT statistics equal to zero are assigned a ranking of 1742. We observe that several genes with high posterior probability by the Bayesian method (e.g. CDH1, KLK11, HSP90AA1 and IGFBP3) have large ranks by other methods. GaGa found 504 significant genes, whereas BRIDGE found only 78 DE genes, which is too conservative for this prostate cancer dataset, 103 and 52 of which also appeared in the BOPA list, respectively.

Z. Hong, H. Lian / Computational Statistics and Data Analysis 56 (2012) 4146–4156

4155

Table 4 Genes detected by t-test or BOPA under the target FDR level 0.01 and confirmed to be associated with prostate cancer. The last five columns also list the ranking of each gene by other methods. Method

Genename

Rank

BOPA

BRIDGE

GaGa

ORT

OS

t-test

AMACR FASN FOXA1 CTBP2 KRT18 KLK3 CAMKK2 P4HB SFRP1 Genename AMACR KLK3 P4HB FASN FOXA1 KRT18 SFRP1 HSP90AA1 CDH1 KLK11 IGFBP3 CTBP2

2 54 67 72 75 76 94 175 257 Rank 1 1 1 1 1 1 1 1 1 206 219 235

1 1 1 235 1 1 1366 1 1 t-test 2 76 175 54 67 75 257 490 463 2068 1812 72

6 76 1 131 112 9 444 24 108 BRIDGE 6 9 24 76 1 112 108 208 50 1421 1231 131

26 281 1443 174 239 549 379 616 1118 GaGa 26 549 616 281 1443 239 1118 717 1433 2061 2068 174

6 1742 818 1051 1742 8 28 1742 1742 ORT 6 8 1742 1742 818 1742 1742 1742 1742 823 1742 1051

929 1732 1732 1015 1732 1732 88 1732 1732 OS 929 1732 1732 1732 1732 1732 1732 1732 1732 432 1732 1015

Method BOPA

6. Conclusion and discussion We have presented a new Bayesian model for analyzing partial differential expression motivated by cancer outlier profiling problems. Besides many advantages enjoyed by general Bayesian analysis compared to classical frequentist tests, including full posterior exploration and uncertainty propagation, our model also automatically identifies genes that are only partially expressed. An efficient MCMC sampler is proposed to explore the posterior distributions of various variables and parameters. This is a natural framework for cancer outlier profiling, and it can foreseeably be extended to microarray data with more than two experimental conditions. However, these advantages of parametric modeling do not come without expense. The most serious concern is the possible model misspecification. Although we did not attempt to use alternative parametric models here, it can be expected that other common expression distributions such as the log-normal–normal model would perform similarly to the Gamma–Gamma model used here, as discussed for the three-state model presented in Newton et al. (2001). Another concern involves the modeling decision that we treat gene expressions under one condition as homogeneous, while this is not true for many datasets. It is imaginable that a nonparametric type of Bayesian model (such as one that uses Dirichlet processes) could be more reasonable. However, taking into account general heterogeneity is complicated, and it also adds to the computational burden. In this respect we follow the lead of OS and ORT in this paper, and argue that, even with this simplification, our method already shows some advantages over many previous approaches via simulations and real data analysis. Another problem is its computational complexity. In order to improve the computation speed, we actually run the MCMC algorithm on a computer cluster with 8 CPUs with the help of the snowfall package in R to parallelize the implementation. For our simulations for example, it takes about 5 to 20 min for each generated dataset to produce the results presented above. Using an ordinary PC, it would probably be about 4–5 times slower. Thus our Bayesian approach can probably only be applied to data sets with up to 5000 genes. Acknowledgments The authors sincerely thank the Associate Editor and two anonymous reviewers for their detailed comments and suggestions, which led to substantial improvements to the original manuscript. The research of Heng Lian is supported by a Singapore MOE Tier 1 Grant. Appendix. Supplementary data Supplementary material related to this article can be found online at http://dx.doi.org/10.1016/j.csda.2012.05.003. References Baldi, P., Long, A.D., 2001. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17 (6), 509–519. Do, K.A., Müller, P., Tang, F., 2005. A Bayesian mixture model for differential gene expression. Journal of the Royal Statistical Society Series C—Applied Statistics 54, 627–644.

4156

Z. Hong, H. Lian / Computational Statistics and Data Analysis 56 (2012) 4146–4156

Efron, B., 2004. Large-scale simultaneous hypothesis testing. Journal of the American Statistical Association 99 (465), 96–104. Gelman, A., Rubin, D.B., 1992. Inference from iterative simulation using multiple sequences. Statistical Science 7, 457–472. Geweke, J., 1992. In: Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (Eds.), Evaluating the Accuracy of Sampling-Based Approaches to the Calculation of Posterior Moments. In: Bayesian Statistics, Vol. 4. Clarendon Press, Oxford. Gottardo, R., Raftery, A.E., 2008. Markov chain Monte Carlo with mixtures of mutually singular distributions. Journal of Computational and Graphical Statistics 17 (4), 949–975. Gottardo, R., Raftery, A., Yeung, K., Bumgarner, R., 2006. Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics 62 (1), 10–18. Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O., Wilfond, B., Borg, A., Trent, J., 2001. Gene-expression profiles in hereditary breast cancer. New England Journal of Medicine 344 (8), 539–548. Ibrahim, J.G., Chen, M.H., Gray, R.J., 2002. Bayesian models for gene expression with DNA microarray data. Journal of the American Statistical Association 97 (457), 88–99. Jain, S., Neal, R.M., 2004. A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics 13 (1), 158–182. Jensen, S.T., Erkan, I., Arnardottir, E.S., Small, D.S., 2009. Bayesian testing of many hypotheses X many genes: a study of sleep apnea. Annals of Applied Statistics 3 (3), 1080–1101. Juin, P., Hunt, A., Littlewood, T., Griffiths, B., Swigart, L., Korsmeyer, S., Evan, G., 2002. C-Myc functionally cooperates with BAX to induce apoptosis. Molecular and Cellular Biology 22 (17), 6158–6169. Kendziorski, C.M., Newton, M.A., Lan, H., Gould, M.N., 2003. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Statistics in Medicine 22 (24), 3899–3914. Lewin, A., Richardson, S., Marshall, C., Glazier, A., Aitman, T., 2006. Bayesian modeling of differential gene expression. Biometrics 62 (1), 1–9. Lo, K., Gottardo, R., 2007. Flexible empirical Bayes models for differential gene expression. Bioinformatics 23 (3), 328–335. Newton, M.A., Kendziorski, C.M., Richmond, C.S., Blattner, F.R., Tsui, K.W., 2001. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology 8 (1), 37–52. Newton, M.A., Noueiry, A., Sarkar, D., Ahlquist, P., 2004. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5 (2), 155–176. Pidgeon, G., Barr, M., Harmey, J., Foley, D., Bouchier-Hayes, D., 2001. Vascular endothelial growth factor(VEGF) upregulates BCL-2 and inhibits apoptosis in human and murine mammary adenocarcinoma cells. British Journal of Cancer 85 (2), 273–278. Ronen, A., Glickman, B., 2001. Human DNA repair genes. Environmental and Molecular Mutagenesis 37 (3), 241–283. Rossell, D., 2009. GaGa: a parsimonious and flexible model for differential expression analysis. Annals of Applied Statistics 3 (3), 1035–1051. Singh, D., Febbo, P., Ross, K., Jackson, D., Manola, J., Ladd, C., Tamayo, P., Renshaw, A., D’Amico, A., Richie, J., Lander, E., Loda, M., Kantoff, P., Golub, T., Sellers, W., 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 (2), 203–209. Smith, B.J., 2007. Boa: an R package for MCMC output convergence assessment and posterior inference. Journal of Statistical Software 21 (11). Smyth, G.K., 2004. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical applications in genetics and molecular biology 3 (1), 1027. Storey, J.D., Tibshirani, R., 2003. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America 100 (16), 9440–9445. Tibshirani, R., Hastie, T., 2007. Outlier sums for differential gene expression analysis. Biostatistics 8 (1), 2–8. Tomlins, S.A., Rhodes, D.R., Perner, S., Dhanasekaran, S.M., Mehra, R., Sun, X.W., Varambally, S., Cao, X.H., Tchinda, J., Kuefer, R., Lee, C., Montie, J.E., Shah, R.B., Pienta, K.J., Rubin, M.A., Chinnaiyan, A.M., 2005. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310 (5748), 644–648. Wu, B.L., 2007. Cancer outlier differential gene expression detection. Biostatistics 8 (3), 566–575.