Journal of the Korean Statistical Society (
)
–
Contents lists available at ScienceDirect
Journal of the Korean Statistical Society journal homepage: www.elsevier.com/locate/jkss
Controlling the false-discovery rate by procedures adapted to the length bias of RNA-Seq Tae Young Yang a, *, Seongmun Jeong b a
Department of Mathematics, Myongji University, Yongin, Kyonggi 449-728, Republic of Korea Personalized Genomic Medicine Research Center, Division of Strategic Research Groups Korea Research Institute of Bioscience and Biotechnology, Daejeon, 34141, Republic of Korea b
article
info
Article history: Received 20 September 2016 Accepted 24 August 2017 Available online xxxx AMS 2000 subject classifications: primary 62P10 secondary 92B05 Keywords: Common weight Individual weight Length bias RNA-Seq Separate procedure Weighted procedure
a b s t r a c t In RNA-Seq experiments, the number of mapped reads for a given gene is proportional to its expression level and length. Because longer genes contribute more sequencible fragments than do shorter ones, it is expected that even if two genes have the same expression level, the longer gene will have a greater number of total reads. This characteristic creates a length bias such that the proportion of significant genes increases with the gene length. However, genes with a long length are not more biologically meaningful than genes with a short length. Therefore, the length bias should be properly corrected to determine the accurate list of significant genes in RNA-Seq. For this purpose, we proposed two multiple-testing procedures based on a weighted-FDR and a separate-FDR approach. These two methods use prior information on differential gene length while keeping the false-discovery rate (FDR) controlled at α . In the weighted-FDR controlling procedure, we incorporated prior weights for the length of each gene. These weights increase the power when the gene’s length is short and decrease the power when its length is long. In the separate-FDR controlling procedure, we sequentially ordered all genes according to their lengths and then split these genes into two subgroups of short and long genes. The adaptive Benjamini–Hochberg procedure was then performed separately for each subgroup. The proposed procedures were compared with existing methods and evaluated in two numerical examples and one simulation study. We concluded that the weighted p-value procedure properly reduced the length bias of RNA-Seq. © 2017 The Korean Statistical Society. Published by Elsevier B.V. All rights reserved.
1. Introduction Whereas standard microarray probes only cover about 20% of a gene on average, capturing only a portion of the biologically relevant data, next-generation sequencing technologies have become widely used, enabling researchers to profile and quantify transcripts across the entire transcriptome in a technique called RNA-Seq. RNA-Seq has various advantages over microarray-based technology, including high resolution to identify novel genes (Zheng, Chung, & Zhao, 2011), and it has become a commonly used alternative to microarray-based gene-expression profiling (Li & Tibshirani, 2011). In an RNA-Seq experiment, purified mRNAs from a tissue are sheared into small fragments, which are then converted into complementary DNA. This complementary DNA library is amplified and sequenced by a high-throughput platform, such as Illumina’s Genome Analyzer (Bullard, Purdom, Hansen, & Dudoit, 2010). This process generates millions of short reads, which are then mapped to the genome or transcriptome (Li, Witten, Johnstone, & Tibshirani, 2012). For a set of regions of interest,
*
Correspondence to: Myongji University, Department of Mathematics, Yongin, Kyonggi 17058, Republic of Korea. E-mail address:
[email protected] (T.Y. Yang).
http://dx.doi.org/10.1016/j.jkss.2017.08.001 1226-3192/© 2017 The Korean Statistical Society. Published by Elsevier B.V. All rights reserved.
Please cite this article in press as: Yang, T.Y., Jeong, S., Controlling the false-discovery rate by procedures adapted to the length bias of RNA-Seq. Journal of the Korean Statistical Society (2017), http://dx.doi.org/10.1016/j.jkss.2017.08.001.
2
T.Y. Yang, S. Jeong / Journal of the Korean Statistical Society (
)
–
such as gene-level or exon-level units, depending on the purpose of the experiment, the number of reads mapped to each region are counted and then used as an expression level of that region (Li & Tibshirani, 2011). In this article, the region of interest is genes. Gene expressions for microarrays are represented as continuous numbers, and differential expression among different conditions can often be analyzed by a t-test. In contrast, the discrete counts of RNA-Seq arise from either a Poisson density or a negative binomial density. Additionally, the differential expression among different conditions can often be analyzed using Fisher’s exact test or a nonparametric test, such as the Wilcoxon’s rank sum test. The gene-level p-value of RNA-Seq can be obtained by various R programs, including edgeR (Robinson, McCarthy, & Smyth, 2010), DEGseq (Wang, Feng, Wang, Wang, & Zhang, 2010), DESeq (Anders & Huber, 2010), and baySeq (Hardcastle & Kelly, 2010). Kvam, Liu, and Si (2012) compared the performance of these R programs. Because longer genes in RNA-Seq contribute more sequencible fragments than shorter ones do, it is expected that even if two genes have the same expression level, the different length will yield different total read numbers (Bullard et al., 2010; Young, Wakefield, Smyth, & Oshlack, 2010). Oshlack and Wakefield (2009) determined that the characteristic of RNA-Seq that maps reads for a given gene depends on the genes expression level and length. Thus, this characteristic creates a length bias, such that the proportion of significant genes increases with the gene length. Because the analysis of other external sources, such as the microarray datasets and the quantitative real-time polymerase chain reaction (qRT-PCR) dataset in Section 3, also shows no evidence of length bias, gene expression is not related to its transcript length. Thus, genes with long length are not more biologically meaningful than genes with short length (Young et al., 2010). The length bias of RNA-Seq should be properly corrected to determine the accurate list of significant genes, which would be used as a base for subsequent analyses such as pathway analysis or gene-set enrichment analysis. 1.1. Standard procedure for controlling the false-discovery rate A significant gene list was obtained by first calculating each gene’s test statistic based on expression levels under different conditions and ranking genes based on their corresponding p-values. Many gene-level tests are carried out so that the statistical significance of each gene’s p-value can be adjusted for multiple testing by controlling the FDR, i.e., the percentage of true-negative genes that are incorrectly rejected. The FDR often serves as a target for control in testing high-dimensional problems in genomic data analysis. Benjamini and Hochberg (1995) originally provided a method, referred to as the BH procedure, for controlling the FDR at a more stringent level. The FDR of the BH procedure at level α is equal to π0 α , where π0 represents, in general, the unknown proportion of the null hypothesis. Only when π0 is close to 1 does the procedure control the FDR at level α . The adaptive BH method is the standard method, which properly controls the FDR at level α by first estimating π0 with πˆ 0 (Storey, 2002) and then performing the BH procedure at level α/πˆ 0 . Because α/πˆ 0 is larger than the original level α of the BH procedure, the adaptive BH procedure is less conservative and thus more powerful than the original BH procedure, especially when π0 is far from 1. Because πˆ 0 in the numerical examples of Section 3 are far from 1, the adaptive BH procedure is more appropriate for controlling the FDR at α than the original BH procedure. When all genes originate from the same condition, the genes are exchangeable (Cai & Sun, 2009). However, an important feature of RNA-Seq is gene-length bias, which renders the genes are no longer exchangeable. The adaptive BH procedure conducts an analysis on all genes under the assumption that genes are exchangeable. Cai and Sun (2009) and Morris (2008) indicated that when exchangeability is not satisfied, the significant gene list of the procedure would not be appropriate. If we ignore the length bias and conduct the procedure as an analysis on all genes, then the resulting significant gene list favors long genes over short genes (Efron, 2008). This is also shown in Sections 3 and 4. Furthermore, because the procedure treats all null hypotheses equally, it cannot handle the prior information regarding the relationship between a gene’s significance and its length. 1.2. Appropriate FDR controlling procedures For multiple hypothesis testing that maintains control of the false discovery rate while incorporating prior information about the hypotheses, many authors including Genovese, Roeder, and Wasserman (2006) and Roeder and Wasserman (2009) have considered using weighted p-values. Particularly, Roeder and Wasserman (2009) provided two p-value weighting methods of the external weighting and the estimated weighting. We sequentially ordered all genes according to their lengths and then split these genes into two subgroups of short and long genes. Because the proportion of significant genes in the first subgroup was smaller than that in the second subgroup, each gene within the first subgroup was more likely to be non-significant. The genes in the first subgroup are needed to make the rejection more likely. Conversely, the genes in the second subgroup are needed to make the rejection less likely. More appropriate multiple-testing procedures are needed to properly remove the gene-length bias of RNA-Seq while keeping the FDR controlled at α . For this purpose, we provided two procedures: a weighted FDR controlling procedure and a separate FDR controlling procedure. In the weighted-FDR controlling procedure, we incorporated the prior weight based on the length of each gene, which takes two forms: individual weight and common weight. These weights increase the power when the gene’s length is short and decrease the power when its length is long. Thus, the procedure would provide the result that long genes are not expressed significantly more than short genes are. Please cite this article in press as: Yang, T.Y., Jeong, S., Controlling the false-discovery rate by procedures adapted to the length bias of RNA-Seq. Journal of the Korean Statistical Society (2017), http://dx.doi.org/10.1016/j.jkss.2017.08.001.
T.Y. Yang, S. Jeong / Journal of the Korean Statistical Society (
)
–
3
In the separate-FDR controlling procedure, we sequentially ordered all genes according to their lengths and then split these genes into two subgroups of genes with short and long genes. In the procedure, we separately performed the adaptive BH procedure for each subgroup. If a gene was only compared with genes in the same subgroup that had similar transcript lengths (Yang & Jeong, 2013), then long genes would not be expressed significantly more than short genes would be. Efron (2008) showed the legitimacy of such a separate procedure. This article is organized as follows. In Section 2, we specify our proposed procedures. Our results for two real datasets are given in Section 3. In Section 4, we describe a simulation study used to assess the proposed procedures. 2. Method In defining the significant gene list of RNA-Seq, appropriate multiple-testing procedures are needed to properly remove the gene-length bias of RNA-Seq while keeping the FDR controlled at α . For this purpose, a weighted-FDR controlling procedure and a separate-FDR controlling procedure are proposed. The two procedures are different but are closely related in that both use prior information on gene length while controlling the FDR at α . 2.1. Weighted FDR controlling procedure Suppose that m gene-level tests are being performed. Gene i (i = 1, . . . , m) has pi as the p-value of its test statistic. Let W = {w1 , . . . , wm } be a vector of prior weights. Let ri = min{wi pi , 1} and r(1) < · · · < r(m) . The weighted-FDR controlling procedure rejects any null hypothesis for which r ≤ t, such that t = max{r(i) :
πˆ 0 mr(i) i
≤ α},
where πˆ 0 ∑ is the estimated proportion of null (non-significant) genes (Storey, 2002). As long as w is assigned to each p-value m such that i=1 wi = m, the procedure controls FDR at α (Genovese et al., 2006; Wasserman & Roeder, 2009). When wi = 1 for i = 1, . . . , m, r(i) = p(i) for the adaptive BH procedure, which is a uniformly weighted method. If w is properly defined, then the procedure significantly increases the power; however, if w is not proper, then the power loss is not substantial (Genovese et al., 2006; Roeder, Devlin, & Wasserman, 2007; Roeder & Wasserman, 2009). This flexibility provides a way to reduce the RNA-Seq length bias. Here, two types of w were considered, as described below. Individual p-value weight ∑m Let xi be the length of gene i (i = 1, . . . , m) and x¯ = i=1 xi /m. Human gene lengths can be obtained from the UCSC genome browser (http://genome.ucsc.edu/cgibin/hgTables). We define
wi = xi /¯x. By this construction, the weighting increases the power (wi < 1) when the length of a gene is shorter than the average length and decreases the power (wi > 1) when its length is greater than the average length. By differentially weighting each p-value, genes having a short length are assigned small weights, resulting in smaller adjusted p-values, and they are thus more likely to be found significant. In contrast, genes with long length are up-weighted, resulting in larger adjusted p-values; thus, these genes are less likely to be significant. Common p-value weight We sequentially ordered all m genes according to their lengths and then split these genes into C1 and C2 : gene i ∈ C1 when xi ≤ x¯ and gene ∑ i ∈ C2 when xi > x¯ . Thus, C1 and C2 represent the subgroups of genes with short and long genes, respectively. We let x¯ j = i∈Cj xi /mj be the local mean of Cj for j = 1, 2, where mj is the size of Cj and m = m1 + m2 . Each gene in Cj received a common weight
wi = x¯ j /¯x if gene i ∈ Cj , which is the relative length weight of each subgroup. P-values within each subgroup were weighted equally, whereas weight varied by subgroup. The weight commonly increases the power when the length of a gene is shorter than the average length and decreases the power when its length is larger than the average length. 2.2. Separate-FDR controlling procedure As defined previously, let C1 and C2 represent the subgroup of genes with short and long genes, respectively. When calculating the FDR of each gene-level p-value, we separately conducted the adaptive BH procedure within each subgroup. Under this circumstance, a gene is only compared with genes in the same subgroup, i.e., ones that have similar transcript lengths. For each Cj (j = 1, 2), we let p(1) < · · · < p(mj ) and rejected any null hypothesis for which p ≤ tj , such that tj = max{p(i) :
πˆ 0j mj p(i) i
≤ α},
Please cite this article in press as: Yang, T.Y., Jeong, S., Controlling the false-discovery rate by procedures adapted to the length bias of RNA-Seq. Journal of the Korean Statistical Society (2017), http://dx.doi.org/10.1016/j.jkss.2017.08.001.
4
T.Y. Yang, S. Jeong / Journal of the Korean Statistical Society (
)
–
Fig. 1. Percentage of significant genes as a function of gene length (bases) for the RNA-Seq dataset of MicroArray Quality Control 2 (MAQC-2): (a) before normalization and (b) after normalization. The gene-level p-values were obtained from DESeq (Anders & Huber, 2010). Each circle shows 100 genes with similar lengths. P-value testing the null hypothesis that the slope is zero. The proportion of significant genes increased with length in the RNA-seq data; normalization using RPKM did not remove the gene-length bias.
where πˆ 0j is the estimated proportion of null genes of Cj (Oshlack & Wakefield, 2009). When FDR in each subgroup is controlled at the same level α , the overall FDR is controlled at α (Efron, 2008; Sun, Craiu, Paterson, & Bull, 2006). To rank the total m genes, we pooled all of the adjusted p-values together over C1 and C2 . 3. Numerical results In Section 2, we introduced a weighted-FDR controlling procedure and a separate-FDR controlling procedure. With respect to the weighted-FDR controlling procedure, we considered two weights: individual weight and common weight. Hereafter, the three procedures are respectively referred to as individual-weighted FDR, common-weighted FDR, and separate FDR. They are compared and evaluated in two widely used RNA-Seq datasets of MicroArray Quality Control 2 (MAQC-2) (Bullard et al., 2010) and Marioni et al. (2008). 3.1. MAQC-2 The MicroArray Quality Control 2 (MAQC-2) experiment compared Ambion’s Human Brain Reference RNA and Stratagene Human Universal Reference RNA, here referred to as Brain and UHR, respectively (Bullard et al., 2010). The experiment included the RNA-Seq dataset and two additional datasets consisting of qRT-PCR and microarray datasets. Detailed information on pre-processing, the web-site, and the experimental design of RNA-Seq, the microarray, and qRT-PCR, can be found in Bullard et al. (2010). In the RNA-Seq analysis, seven Brain replicates and seven UHR replicates of two flow cells were processed using Illumina’s standard RNA-Seq protocol. The RNA-Seq dataset contained 52,580 genes; however, many genes had less than 10 reads in total. Those genes were deleted. For comparison purposes, we also deleted genes that were not in the Affymetrix microarray. Finally, 9301 genes were left for analysis. Gene-level p-values of RNA-Seq were obtained by DESeq (Anders & Huber, 2010) because it has been widely used in practice. We applied the adaptive BH procedure to 9301 genes at an FDR of 0.01, and discovered 6746 significant genes. The length bias of non-normalized raw read counts is shown in Fig. 1(a). There is clearly a positive association between the proportion of significant genes and the RNA-Seq length. Several normalizations, such as reads per kilobase of exon model per million mapped reads (RPKM) (Mortazavi, Williams, McCue, Schaeffer, & Wold, 2008) or, similarly, fragments per kilobase of the exon model per million mapped reads (FPKM) (Trapnell et al., 2010), are often used. These normalizations adjust the read numbers by both the transcript length and the sequencing depth. However, as shown in Fig. 1(b), even normalization using RPKM did not eliminate the gene-length bias. Please cite this article in press as: Yang, T.Y., Jeong, S., Controlling the false-discovery rate by procedures adapted to the length bias of RNA-Seq. Journal of the Korean Statistical Society (2017), http://dx.doi.org/10.1016/j.jkss.2017.08.001.
T.Y. Yang, S. Jeong / Journal of the Korean Statistical Society (
)
–
5
Fig. 2. Percentage of significant genes according to gene length bases for the dataset of MAQC-2: (a) microarray and (b) quantitative real-time polymerase chain reaction qRT-PCR. Each point of (a) and (b) represents a bin of 100 and 25 genes, respectively, with similar transcript length. No increasing pattern was observed between gene length and differential expression.
In the microarray analysis, the two biological samples of Brain and UHR were processed using five Affymetrix Human Genome U133 microarray chips for each. The microarray dataset contained 54,675 genes. For identifying significant genes between Brain and UHR, we used the p-values from simple t-test statistics. The adaptive BH procedure then found 6678 significant genes at an FDR of 0.025. The gene IDs of the Affymetrix’s microarray dataset were converted to the Ensembl gene IDs of Illumina’s RNA-seq dataset using biomaRt (http://www.biomart.org). When several Affymetrix gene IDs had the common RNA-Seq Ensembl gene ID, we let the expression level of the common Ensembl gene ID be the median expression level of the Affymetrix’s genes. The qRT-PCR experiment analyzed 997 genes using four replicates each for the Brain and UHR samples. To compare expression measures among the three technologies, i.e., RNA-Seq, microarray, and qRT-PCR, we only used 440 genes common to both the microarray genes and the RNA-Seq genes in the following analysis. The absolute difference between the mean of the four UHR replicates and that of the four Brain replicates was calculated to find the significant genes by qRT-PCR. The significance of the gene was determined based on whether the absolute difference was larger than 0.5 (Bullard et al., 2010). Of the total 440 genes, 150 genes were identified as significant by the qRT-PCR experiment. In Fig. 2(a) and (b), the percentages of significant genes from the microarray dataset and the qRT-PCR dataset, respectively, are plotted as a function of gene length. Compared with the length bias of the RNA-Seq in Fig. 1, Fig. 2 does not show a positive relationship between differential expression and gene length. Fig. 3 summarizes the results from the individual-weighted FDR, the common-weighted FDR, and the separate FDR based on the 9301 p-values of RNA-Seq obtained from DESeq (Anders & Huber, 2010). Compared with Fig. 1, Fig. 3 shows a clear reduction in the gene length bias of RNA-Seq. Furthermore, overall, the individual-weighted FDR and the common-weighted FDR outperformed the separate FDR and the standard BH procedure. Because qRT-PCR technology is known to yield accurate estimates of expression levels, its results are treated as a goldstandard. Here, we only considered the 440 genes for which qRT-PCR data were available in common with the microarray filtered genes and the RNA-Seq filtered genes. We applied (I) the adaptive BH procedure to 440 microarray genes; (II) the adaptive BH procedure to 440 RNA-Seq genes; (III) the Bullard et al. (2010) method to 440 RNA-Seq genes; (IV) the Gao, Fang, Zhang, Zhi, and Cui (2010) method to 440 RNA-Seq genes; (V) the individual-weighted FDR to 440 RNA-Seq genes; (VI) the common-weighted FDR applied to 440 RNA-Seq genes; and (VII) the separate FDR to 440 RNA-Seq genes. As a result, (I) identified 265 significant genes with an FDR of 0.01, and (II)–(VII) identified 322, 318, 343, 325, 319, and 318 significant genes, respectively, with an FDR of 0.001. Note that the FDR levels were used differently for the microarray of (I) and RNA-Seq of (II)–(VII) because no significant genes could be found in the microarray at an FDR of 0.001. Overall, (I)–(VII) identified 131, 144, 138, 148, 148, 147, and 147 genes, respectively, among the 150 significant genes identified by the qRT-PCR experiment. Please cite this article in press as: Yang, T.Y., Jeong, S., Controlling the false-discovery rate by procedures adapted to the length bias of RNA-Seq. Journal of the Korean Statistical Society (2017), http://dx.doi.org/10.1016/j.jkss.2017.08.001.
6
T.Y. Yang, S. Jeong / Journal of the Korean Statistical Society (
)
–
Fig. 3. Percentage of significant genes according to gene length (bases) for the dataset of MAQC-2: (a) adaptive Benjamini–Hochberg (BH) procedure, (b) individual-weighted false-discovery rate (FDR), (c) common-weighted FDR, and (d) separate FDR. Each circle shows 100 genes with similar lengths. P-value testing the null hypothesis that the slope is zero. The individual-weighted FDR and the common-weighted FDR overall outperformed the separate FDR and the adaptive BH procedure.
Table 1 True-positive rate (TPR) and false-positive rate (FPR). Method
TPR
FPR
(I) adaptive BH procedure to microarray (II) adaptive BH procedure to RNA-Seq (III) Bullard et al. to RNA-Seq (IV) Gao et al. to RNA-Seq (V) individual-weighted FDR to RNA-Seq (VI) common-weighted FDR to RNA-Seq (VII) separate FDR to RNA-Seq
0.873 0.960 0.920 0.987 0.987 0.980 0.980
0.462 0.614 0.621 0.672 0.610 0.593 0.590
Table 1 shows the false-positive rates (FPR) and true-positive rates (TPR) using qRT-PCR. The 150 significant genes detected by qRT-PCR were considered true positives; the remaining 290 insignificant genes were identified as true negatives. In general, the performance of our methods in (V)–(VII) was satisfactory. In particular, the power (TPR) of our methods was high. In Fig. 4, the proportions of overlapping genes between the 150 top-ranked genes of qRT-PCR and the 150 top-ranked genes of (I)–(VII) are shown. Fig. 4 clearly indicates that our methods (V)–(VII) had more overlap with the qRT-PCR than the existing methods used in (I)–(IV). Furthermore, the RNA-Seq platform outperformed the microarray, with (II)–(VII) having more overlap with the qRT-PCR than (I). 3.2. Marioni et al. The study by Marioni et al. (2008) was designed to compare an RNA-Seq experiment with a microarray experiment. The microarray data were obtained from three human kidney and three liver samples by profiling Affymetrix arrays. The microarray dataset contained 17,708 genes. An RNA-Seq dataset was generated for the same liver and kidney tissues with five replicates per tissue using the Illumina Genome Analyzer. The RNA-Seq dataset contained 32,000 genes; however, genes with fewer than five reads in total were deleted. For comparison, the Affymetrix gene IDs of the microarray dataset were changed to the Ensembl gene IDs of the RNASeq dataset using biomaRt (http://www.biomart.org). We also removed genes that were not in the Affymetrix microarray. Additionally, genes were removed when their lengths were unavailable in the database of the UCSC genome browser. Finally, the microarray and RNA-Seq datasets had 15,097 genes in common, which were used in the following analysis. For identifying significant genes in the microarray between the kidney and liver, p-values were obtained from a simple t-test. We applied (I) the adaptive BH procedure to the 15,097 microarray genes with an FDR of 0.05 and discovered 10,170 significant genes. Please cite this article in press as: Yang, T.Y., Jeong, S., Controlling the false-discovery rate by procedures adapted to the length bias of RNA-Seq. Journal of the Korean Statistical Society (2017), http://dx.doi.org/10.1016/j.jkss.2017.08.001.
T.Y. Yang, S. Jeong / Journal of the Korean Statistical Society (
)
–
7
Fig. 4. Proportion of overlapping genes between the 150 top-ranked genes of qRT-PCR and the 150 top-ranked genes of the (I) adaptive BH procedure applied to the microarray; (II) adaptive BH procedure applied to RNA-Seq; (III) Bullard et al. method applied to RNA-Seq; (IV) Gao et al. method applied to RNA-Seq; (V) individual-weighted FDR applied to RNA-Seq; (VI) common-weighted FDR applied to RNA-Seq; and (VII) separate FDR applied to RNA-Seq. Our proposed methods for (V)–(VII) had more overlap with qRT-PCR than methods (I)–(IV).
For identifying significant genes in RNA-Seq between the kidney and liver, 15,097 gene-level p-values were calculated using DESeq (Anders & Huber, 2010). The obtained RNA-Seq p-values were adjusted for multiple testing by applying (II) the adaptive BH procedure, (III) the Bullard et al. (2010) method, (IV) the Gao et al. (2010) method, (V) the individual-weighted FDR, (VI) the common-weighted FDR, and (VII) the separate FDR. Finally, (II)–(VII) identified 11735, 10366, 10408, 9646, 9539, and 9354 significant genes, respectively, with an FDR of 0.05. The adaptive BH procedure was applied to the microarray and RNA-Seq. Fig. 5 shows the proportion of significant genes plotted as a function of gene length. In Fig. 5(a), the RNA-Seq data show a strong increasing pattern, reflecting the gene-level bias, whereas in Fig. 5(b) the microarray data do not show a systematic increasing pattern with length. The microarray data in Fig. 5(b) interestingly show a weak decreasing trend, which is opposite to the gene-level bias pattern observed in the RNA-Seq data. To compare (II)–(VII) for the RNA-Seq, we used (I), the microarray result of the adaptive BH procedure, as a standard. In Fig. 6, we considered the 10,170 significant genes to be true positives, and the remaining 4927 non-significant genes to be true negatives, and then computed the FPR and TPR of (II)–(VII). In general, the performance of our results for (V)–(VII) was satisfactory. The small number of false positives (FPR) indicates that our methods for (V)–(VII) were able to control for the Type I error. The power (TPR) of our methods was high. The proportion of overlapping genes between the top 250 genes of (I) versus the top 250 genes of (II)–(VII) are plotted in Fig. 7. The figure clearly shows that the RNA-Seq results of methods (V)–(VII) had more overlap with the microarray results of (I) than did the existing (II)–(IV) methods. 4. Simulation We executed a simulation study to assess our proposed methods with respect to their suitability for removing the length bias of RNA-Seq. Here, we used 15,097 Ensembl gene IDs from the RNA-Seq dataset of Marioni et al.; their corresponding human gene lengths were obtained from the UCSC genome browser. We sequentially ordered these genes according to Please cite this article in press as: Yang, T.Y., Jeong, S., Controlling the false-discovery rate by procedures adapted to the length bias of RNA-Seq. Journal of the Korean Statistical Society (2017), http://dx.doi.org/10.1016/j.jkss.2017.08.001.
8
T.Y. Yang, S. Jeong / Journal of the Korean Statistical Society (
)
–
Fig. 5. Percentage of significant genes according to gene length bases in Marioni et al. (2008). Each circle shows 100 genes with similar lengths: (a) RNA-Seq and (b) microarray.
Fig. 6. ROC curve using Marioni et al. (2008). Separate weight method is the best, and individual and common weight methods have more area under curve than the adaptive BH, Bullard et al., and Gao et al.
their lengths and then divided these genes into Q1 through Q4 , which were separated further at the 25th, 50th, and 75th percentiles in the distribution of all genes. As such, Q1 , Q2 , Q3 , and Q4 represent the category of genes with short, moderately short, moderately long, and long length, respectively. Please cite this article in press as: Yang, T.Y., Jeong, S., Controlling the false-discovery rate by procedures adapted to the length bias of RNA-Seq. Journal of the Korean Statistical Society (2017), http://dx.doi.org/10.1016/j.jkss.2017.08.001.
T.Y. Yang, S. Jeong / Journal of the Korean Statistical Society (
)
–
9
Fig. 7. Proportion of common genes between the top 250 genes of (I) the adaptive BH procedure applied to the microarray versus the top 250 genes of (II) the adaptive BH procedure applied to RNA-Seq, (III) Bullard et al. method applied to RNA-Seq, (IV) Gao et al. applied to RNA-Seq, (V) individually weighted FDR applied to RNA-Seq, (VI) common-weighted FDR applied to RNA-Seq, and (VII) separate FDR applied to RNA-Seq. The proposed three methods, i.e., (V)–(VII), had more overlap with the standard method than with the other methods, (I)–(III).
We considered a mixture density of p1 ∗ u(0, 0.0001) + p2 ∗ u(0.0001, 0.001) + p3 ∗ u(0.001, 0.1) + p4 ∗ u(0.1, 1), where p1 + p2 + p3 + p4 = 1 and u(a,b) represents the uniform density within the two boundaries of a and b. We randomly sampled gene-level p-values from the mixture density according to the following scheme: p1 = 0.1, p2 = 0.15, p3 = 0.35, p4 = 0.4 in Q1 ; p1 = 0.15, p2 = 0.15, p3 = 0.3, p4 = 0.4 in Q2 ; p1 = 0.2, p2 = 0.15, p3 = 0.25, p4 = 0.4 in Q3 ; and p1 = 0.25, p2 = 0.1, p3 = 0.2, p4 = 0.5 in Q4 . With this scheme, the simulated p-values have a bias pattern such that the percentage of significant genes increases with gene length. These simulated p-values were corrected for multiple testing using the adaptive BH procedure, individual-weighted FDR, common-weighted FDR, and separate FDR at the 0.1 level. We then obtained a list of significant genes for each method. One thousand simulations were executed to obtain the empirical distribution for the percentage of significant genes according to gene length. Suppose that some procedure yields approximately 25% significant genes over the range of Q1 -Q4 ; then, the gene expression is not related to its transcript length, and the procedure properly controls the length bias. Fig. 8 shows the simulated results for the percentages of significant genes according to the gene length. The individual-weighted FDR procedure and the common-weighted FDR procedure were overall comparable in reducing the gene-length bias. They controlled the length bias better than the adaptive BH procedure or separate FDR procedure did. Various simulation studies were also executed (not shown). We obtained similar results to those given by the proposed weighted procedure, with a properly controlled length bias. 5. Discussion For the partition of the p-values, our procedures sequentially ordered all genes according to their lengths and then split these genes into two subgroups of short and long genes. That is, we separated all genes into two groups. Fig. 9 shows that separating all genes into two groups is the best for various number of separate groups. Furthermore, as the number of groups increases, area under ROC curve decreases gradually. Statistical tests to detect differential expression among different conditions (e.g. t-test for case vs control) depend on the sample mean and the sample variance of each condition. Because the expected read count of a gene is proportional to the length as well as expression level of a transcript, the sampling is higher for long genes compared to short genes (Oshlack & Wakefield, 2009). Variances are expected to be small due to large number of reads. These small variances across different conditions may cause more power to detect differential expression at a given statistical significance regardless of the specific test used, even though the amount of length bias is similar among different conditions. 6. Conclusion We introduced a weighted-FDR procedure and a separate-FDR procedure, which were compared with existing models and evaluated in two numerical examples and one simulation study. The proposed methods outperformed the existing Please cite this article in press as: Yang, T.Y., Jeong, S., Controlling the false-discovery rate by procedures adapted to the length bias of RNA-Seq. Journal of the Korean Statistical Society (2017), http://dx.doi.org/10.1016/j.jkss.2017.08.001.
10
T.Y. Yang, S. Jeong / Journal of the Korean Statistical Society (
)
–
Fig. 8. Percentage of significant genes according to the gene-length bases: (a) adaptive BH procedure, (b) individually weighted FDR, (c) common-weighted FDR, and (d) separate FDR. Q1 , Q2 , Q3 , and Q4 represent the categories of genes with short, moderately short, moderately long, and long length, respectively. The individually weighted FDR has approximately 25% significant genes over the range of Q1 to Q4 . This indicates that the weighted FDR procedures of (b) and (c) properly controlled the length bias of RNA-Seq.
Fig. 9. ROC curve using Marioni et al. (2008). Separating all genes into two groups is the best of various number of groups. As the number of groups increases, area under curve decreases gradually.
methods in discerning true-positive and false-positive rates. Moreover, the proposed methods were comparable overall in true-positive rates and showed low false-positive rates; however, the weighted-FDR procedure was better able to reduce the length bias of RNA-Seq than the separate-FDR procedure. To implement the proposed methods, we developed the R program, which can be downloaded from http://home.mju.ac. kr/home/index.action?siteId=tyang. Please cite this article in press as: Yang, T.Y., Jeong, S., Controlling the false-discovery rate by procedures adapted to the length bias of RNA-Seq. Journal of the Korean Statistical Society (2017), http://dx.doi.org/10.1016/j.jkss.2017.08.001.
T.Y. Yang, S. Jeong / Journal of the Korean Statistical Society (
)
–
11
Acknowledgment Tae Young Yang’s research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2015R1D1A1A01056923). References Anders, S., & Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11, R106. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: a pratical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 57, 289–300. Bullard, J. H., Purdom, E., Hansen, K. D., & Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11, 94. Cai, T., & Sun, W. (2009). Simultaneous testing of grouped hypotheses: finding needles in multiple haystacks. Journal of the American Statistical Association, 104, 1467–1481. Efron, B. (2008). Simultaneous inference: when should hypothesis testing problems be combined? Annals of Statistics, 2, 197–223. Gao, L., Fang, Z., Zhang, K., Zhi, D., & Cui, X. (2010). Length bias correction for RNA-seq data in gene set analyses. Bioinformatics, 27(5), 662–669. Genovese, C. R., Roeder, K., & Wasserman, L. (2006). False discovery control with p-value weighting. Biometrika, 93, 509–524. Hardcastle, T. J., & Kelly, K. A. (2010). bayseq: empirical bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics, 11, 422. Kvam, V. M., Liu, P., & Si, Y. (2012). A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. American Journal of Botany, 99(2), 248–256. Li, J., & Tibshirani, R. (2011). Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-seq data. Statistical Methods in Medical Research. Li, J., Witten, D. M., Johnstone, I. M., & Tibshirani, R. (2012). Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics, 13, 523–538. Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., & Gilad, Y. (2008). RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Research, 18, 1509–1517. Morris, B. (2008). Comment on microarrays, empirical bayes, and the two- groups model. Statistical Science, 23, 34–40 by B. Efron. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., & Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 7, 621–628. Oshlack, A., & Wakefield, M. J. (2009). Transcript length bias in RNA-seq data confounds systems biology. Biology Direct, 4, 14. Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. Roeder, K., Devlin, B., & Wasserman, L. (2007). Improving power in genome-wide association studies: weights tip the scale. Genetic Epidemiology, 31, 741– 747. Roeder, K., & Wasserman, L. (2009). Genome-Wide Significance Levels and Weighted Hypothesis Testing. Statistical Science, 24, 398–411. Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B, 64, 479–495. Sun, L., Craiu, R. V., Paterson, A. D., & Bull, S. B. (2006). Stratified false discovery control for large-scale hypothesis testing with application to genome-wide association studies. Genetic Epidemiology, 30, 519–530. Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28, 511–515. Wang, L., Feng, Z., Wang, X., Wang, X., & Zhang, X. (2010). DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics, 26, 136–138. Wasserman, L., & Roeder, K. (2009). High-dimensional variable selection. The Annals of Statistics, 37, 2178–2201. Yang, T., & Jeong, S. (2013). Grouped false-discovery rate for removing the gene-set-level bias of RNA-seq. Evolution Bioinformatics, 9, 467–478. Young, M. D., Wakefield, M. J., Smyth, G. K., & Oshlack, A. (2010). Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology, 11, R14. Zheng, W., Chung, L., & Zhao, H. (2011). Bias detection and correction in RNA-sequencing data. BMC Bioinformatics, 12, 290.
Please cite this article in press as: Yang, T.Y., Jeong, S., Controlling the false-discovery rate by procedures adapted to the length bias of RNA-Seq. Journal of the Korean Statistical Society (2017), http://dx.doi.org/10.1016/j.jkss.2017.08.001.