Journal Pre-proof Robust identification of differentially expressed genes from RNAseq data
Md. Shahjaman, Md. Manir Hossain Mollah, Md. Rezanur Rahman, S.M. Shahinul Islam, Md. Nurul Haque Mollah PII:
S0888-7543(19)30575-0
DOI:
https://doi.org/10.1016/j.ygeno.2019.11.012
Reference:
YGENO 9405
To appear in:
Genomics
Received date:
21 August 2019
Revised date:
27 October 2019
Accepted date:
18 November 2019
Please cite this article as: M. Shahjaman, M. Manir Hossain Mollah, M. Rezanur Rahman, et al., Robust identification of differentially expressed genes from RNA-seq data, Genomics (2018), https://doi.org/10.1016/j.ygeno.2019.11.012
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2018 Published by Elsevier.
Journal Pre-proof
Robust Identification of Differentially Expressed Genes from RNA-seq Data Md. Shahjaman1*, Md.Manir Hossain Mollah 2, Md. Rezanur Rahman3, S. M. Shahinul Islam4, and Md. Nurul Haque Mollah5 1
-p
Abstract
ro
[email protected]
of
Department of Statistics, Begum Rokeya University, Rangpur-5400, Bangladesh;
[email protected] 2 Department of Bioinformatics and Public Health, Asian University for Women, Chittagong, Bangladesh 3 Department of Biochemistry and Biotechnology, School of Biomedical Science, Khwaja Yunus Ali University, Sirajgonj, Bangladesh;
[email protected] 4 Institutitute of Biological Science (IBSc), University of Rajshahi, Rajshahi-6205, Bangladesh;
[email protected] 5 Bioinformatics Lab., Department of Statistics, University of Rajshahi, Rajshahi-6205, Bangladesh;
re
Background: Identification of differentially expressed genes (DEGs) under two or more experimental conditions is an important task for elucidating the molecular basis of phenotypic
lP
variation. In the recent years, next generation sequencing (RNA-seq) has become very attractive and competitive alternative to the microarrays because of reducing the cost of sequencing and
na
limitations of microarrays. A number of methods have been developed for detecting the DEGs from RNA-seq data. Most of these methods are based on either Poisson distribution or negative
ur
binomial (NB) distribution. However, identification of DEGs based on read count data using
Jo
skewed distribution is inflexible and complicated of in presence of outliers or extreme values. Results: Most of the existing DEGs selection methods produce lower accuracies and higher false discoveries in presence of outliers. There are some robust approaches such as edgeR_robust and DEseq2 perform well in presence of outliers for large sample case. But they show weak performance for small-sample case, in presence of outliers. To address this issues an alternative approach has emerged by transforming the RNA-seq data into microarray like data. Among various transformation methods voom using limma pipeline is proven better for RNA-seq data. However, limma by voom transformation is sensitive to outliers for small-sample case. Therefore, in this paper, we robustify the voom approach using the minimum β-divergence method. We demonstrate the performance of the proposed method in a comparison of seven popular biomarkers selection methods: DEseq, DEseq2, SAMseq, Bayseq, limma (voom), edgeR
Journal Pre-proof and edgeR_robust using both simulated and real dataset. Both types of experimental results show that the performance of the proposed method improve over the competing methods, in presence of outliers and in absence of outliers it keeps almost equal performance with these methods. Conclusion: We observe the improved performance of the proposed method from simulation and real RNA-seq count data analysis for both small-and large-sample cases, in presence of outliers. Therefore, our proposal is to use the proposed method instead of existing methods to obtain the better performance for selecting the DEGs.
ro
of
Keywords: RNA-sequence data, log-cpm, DEGs, β-weight function and Robustness.
Background
-p
Identification of differentially expressed genes (DEGs) under two or more condition is a
re
fundamental objective of the transcriptomics data analysis. These DEGs play the important roles for the development of agriculture, health and environment. During the past decades microarrays
lP
have been widely used to perform this task. But after the invention of next generation sequencing (NGS) technology RNA-seq has emerged an attractive and alternative choice of microarrays [1-
na
2]. Both microarrays and RNA-seq uses the high-throughput technology to produce a vast amount of data. Thus curse of dimensionality often occurs here. It means "large p, small n"
ur
problem. Hence, dimension reduction of the data matrix is a primary objective for further
approaches.
Jo
downstream analysis. Identification of biomarkers is one of the dimensionality reduction
Array-based technology measures intensities using continuous distributions. In RNA-seq, cDNA (which is derived from RNA) are sequenced to obtain the millions of short reads. After that these short reads are mapped with its reference genome. The number of sequence reads mapped is then calculated to get the RNA-seq count data matrix. There are many advantages of RNA-seq over the microarrays: it (1) produces count data with low background noise that allow researcher to detect the transcript at low expression levels, (2) has ability to detect unknown or novel transcripts, allele-specific expression and alternative splicing isoforms, (3) does not require prior probe selection thus it avoids the related biases which usually occur during the hybridization of microarrays [3] and so on.
Journal Pre-proof
With these advantages of RNA-seq, identification of DEGs from count data has been increasing rapidly because of reducing the cost of sequencing. A number of methods have been developed to identify the DEGs from RNA-seq count data. Most of the methods use either Poisson or negative-binomial (NB) distribution to model the count data with variety of normalization procedures [4]. DEGseq based on Poisson distribution uses the Fisher’s exact test and likelihood ratio test. But it suffers from overdispersion issues and thus it is not suitable for count data analysis [5-8]. To mitigate these issues, methods based on NB distribution have been applied in
of
RNA-seq data. Among them baySeq, DESeq, DESeq2, EBSeq, edgeR, edgeR (robust) and NBPSeq have been extensively used with their respective R packages [9-13]. SAMSeq is a non-
ro
parametric method which does not assume any particular distribution for the data [14]. However,
-p
most of methods suffer from the small sample sizes in RNA-seq experiments and they cannot estimate properly the gene-wise dispersion parameters to share the information of all genes in the
re
dataset [15].
lP
Unfortunately, the aforementioned methods based on discrete distribution are inflexible and complicated in presence of extreme read counts (or outliers). Outliers may arise in RNA-seq
na
count data because there are several data generating stages from biological harvesting of RNA to counting of sequence read map [16]. Consequently, an alternative approach has emerged to
ur
address the above issue by transforming the RNA-seq count data and then apply the normalbased microarray-like statistical methods. A numerous transformation methods have been
Jo
available for RNA-seq data which includes logarithmic transformation [17], variance-stabilizing transformation (vst) [8], TMM transformation [18], regularized logarithm [10] and variance modeling at the observation level (voom) [19]. Among various transformation methods voom using limma pipeline is proven better for RNA-seq data. voom estimates the mean-variance relationship of the log-cpm values to calculate the weight for each observation and input these into the limma’s empirical Bayes pipeline for differential analysis. But, it is sensitive to outliers for small-sample case and produces misleading results in presence of outliers. Therefore, in this paper, we robusify voom approach by minimum β-divergence method. The next section will be discussed about formulation of the proposed robust voom approach. A broad simulation and real RNA-seq data analysis are carried out in the results and discussion section.
Journal Pre-proof
Materials and Methods voom transformation Suppose the RNA-seq read count data matrix is denoted by ygik, where row represents the genes and column represents the samples (g=1,2,…, G; i=1,2,…,nk;k=1,2,…,m;). voom uses log-counts per million (log-cpm) to transform the read count data matrix as follows: 𝑦𝑔𝑖𝑘 +0.5 𝑅𝑖𝑘 +1
× 106 )
(1)
of
𝑥𝑔𝑖𝑘 = log2 (
where, 𝑅𝑖𝑘 denotes the total number of mapped reads for sample i for kth condition, 𝑅𝑖𝑘 =
ro
∑𝐺𝑔=1 𝑥𝑔𝑖𝑘 . This transform data usually follow the normal distribution, which depends on the
-p
degree of skewness before transformation. A more detail about the voom can be found in [19]. However, the log-cpm values use in voom approach is sensitive to outlying observations. Though
re
logarithmic transformation reduces the influence of low level extreme/outlying observations into
lP
the reasonable spaces during parameter estimations, however this transformation fails to reduce the influence of high level extreme/outlying observations. Therefore, Limma produces misleading
na
results in some cases. To overcome this problem, in this paper, we are proposing robust voom transformation for RNA-seq data.
ur
Robust voom transformation by β-divergence method
Jo
In the second step, the following linear model is fitted using the log-cpm values: 𝐸( 𝑦𝑔𝑖 ) = 𝜇𝑔𝑖 = 𝑥𝑖𝑇 𝛽𝑔
(3)
where, 𝑥𝑖 is a vector of covariates and 𝛽𝑔 is a vector of unknown coefficients representing log2fold-changes between experimental conditions. Using ordinary least squares method the above model is fitted by estimating 𝛽𝑔 : 𝜇̂ 𝑔𝑖 = 𝑥𝑖𝑇 𝛽̂𝑔
(4)
At the last step, the fitted log-cpm values 𝜇̂ 𝑔𝑖 are converted to fitted count by 𝜆̂𝑔𝑖 = 𝜇̂ 𝑔𝑖 + log2 (𝑅𝑖 + 1) − log2 (106 )
(5)
The voom precision weights are then calculated as the inverse variances 𝑤𝑔𝑖 = lo(𝜆̂𝑔𝑖 )−4 , where lo is a piecewise linear function defined by the locally weighted regression curve[19]. Then the
Journal Pre-proof log-cpm values 𝑦𝑔𝑖 and associated weights 𝑤𝑔𝑖 are input into limma’s well-established microarray analysis pipeline to the differential expression. However, 𝜆̂𝑔𝑖 using equation (2-4) is not robust against outliers if RNA-seq data matrix 𝑟𝑔𝑖 contain the outlying counts. Therefore, in this paper, an attempt is made to robustify the voom ̂𝑔𝑘,𝛽 = approach by minimum 𝛽-divergence method. The minimum 𝛽-divergence estimators 𝜽 2 2 (𝜇̂ 𝑔𝑘,𝛽 , 𝜎̂𝑔𝑘,𝛽 ) of the parameters 𝜽𝑔𝑘 = (𝜇𝑔𝑘 , 𝜎𝑔𝑘 ) are computed iteratively as follows: 𝑛
𝜇𝑔𝑘,𝑡+1 =
𝑘 𝜑 (𝑦 ∑𝑗=1 𝛽 𝑔𝑖𝑗 |𝜽𝑔𝑘,𝑡 ) 𝑦𝑔𝑖𝑗
(6)
𝑛
𝑘 𝜑 (𝑦 ∑𝑗=1 𝛽 𝑔𝑖𝑗 |𝜽𝑔𝑘,𝑡 )
of
and 𝑛
𝑛
𝑘 𝜑 (𝑦 ∑𝑗=1 𝛽 𝑔𝑖𝑗 |𝜽𝑔𝑘,𝑡 )
(7)
ro
=
𝑘 𝜑 (𝑦 2 (𝛽+1) ∑𝑗=1 𝛽 𝑔𝑖𝑗 |𝜽𝑔𝑘,𝑡 ) (𝑦𝑔𝑖𝑗 −𝜇𝑔𝑘,𝑡 )
where, 𝛽
𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽𝑔𝑘,𝑡 ) = exp {− 2𝜎2 (𝑦𝑔𝑖𝑗 − 𝜇𝑔𝑘,𝑡 )2 }
(8)
re
𝑔𝑘
-p
2 𝜎𝑔𝑘,𝑡+1
na
lP
which is known as 𝛽-weight function. The value of this 𝛽-weight function lies between 0 to 1. It produces larger weights corresponding to the usual gene expression and smaller weights corresponding to the unusual/outlying gene expression [20-21]. Therefore, in this paper we employed this weight function as a measure outlier detection with 𝛽 = 0.2, which was also used ∗ in the study [22]. We update the outlying log-cpm values of 𝑦𝑔𝑖𝑗 as 𝑦𝑔𝑖𝑗 using 𝛽-weight function as follows:
ur
̂𝑔𝑘,𝑡 ) ≤ 𝛿 𝜇̂ 𝑔𝑘,𝛽 ; if 𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽 ∗ 𝑦𝑔𝑖𝑗 ={ ̂𝑔𝑘,𝑡 ) > 𝛿 𝑦𝑔𝑖𝑗 ; if 𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽
(9)
Jo
̂𝑔𝑘,𝑡 ) ≤ 𝛿 indicates that ith sample of gth gene for kth condition is contaminated Here, 𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽 ̂𝑔𝑘,𝑡 ) > 𝛿 indicates that it is not contaminated by outliers. by outliers and 𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽 where 𝛿 is determined by the following equation ̂𝑔𝑘,𝑡 )) + 𝛼 [max (𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽 ̂𝑔𝑘,𝑡 )) − min (𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽 ̂𝑔𝑘,𝑡 ))] ∀𝑔, 𝑘 𝛿 = min (𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽
(10)
∗ Then we replace 𝑦𝑔𝑖𝑗 in equation (2) and obtain robust 𝜆̂𝑔𝑖,𝛽 for limma’s standard pipeline.
Fold change (FC) play an important and informative indicator to identify DEGs. In this procedure the log ratio between two conditions is calculated and declares genes as DEGs whose ratio exceed an arbitrary cut-off value. It is also used for gene ranking to select the candidate gene for further analysis. Therefore, in this paper for two groups (k=2), we calculate a score to select the top DEGs, which is the average of ranks between p-values and FC values. This kind of gene
Journal Pre-proof selection and ranking was also used in [30]. We define a score (𝑆𝑔,𝛽 ,) using the rank of p-values (𝑃𝑔,𝛽 , increasing order) and rank of fold change values (𝐹𝐶𝑔,𝛽 , decreasing order) for gth as follows: 𝑆𝑔,𝛽 =
𝑃𝑔,𝛽 +𝐹𝐶𝑔,𝛽
(11)
2
RNA-Seq Count Data
ro
of
Data transformation using voom method
re
-p
Check the existence of outliers using β-divergence method
lP
yes
Apply the limma on original log-cpm data
Calculate p-values and FC values
Jo
ur
na
Apply the limma on updated log-cpm data
no
Select the significant DEGs using proposed method
Obtain the functional annotations and pathway enrichment analysis of the detected DE genes
Figure 1: Work flow of the proposed method for identification of DEGs.
Journal Pre-proof
Performance Evaluation In order to assess the performance of different gene selection methods for binary classification test, we use receiving operating characteristics (ROC) curve and area under the ROC curve (AUC). Let D be the binary class label (D = 0, 1), 𝒙 ∈ 𝑹𝑝 be a feature vector, and 𝑓0 (𝒙), 𝑓1 (𝒙) be the probability density functions for each class. We classify a gene with feature vector x into
𝐷={
1; 𝐹(𝒙) ≥ 𝑇 0; Otherwise
of
one of two classes as follows:
ro
where 𝐹(𝒙) is a score function and 𝑇 is a threshold value. Then, the true positive rate (TPR, or sensitivity) and false positive rate (FPR, or 1-specificity) are defined as
-p
TPR(𝑇) = ∫𝐹(𝒙)≥𝑇 𝑓1 (𝒙) 𝑑𝒙 and FPR(𝑇) = ∫𝐹(𝒙)≥𝑇 𝑓0 (𝒙) 𝑑𝒙
(13)
re
The ROC curve can be described by pairing these functions
From (13) the AUC is written as
lP
ROC = {(FPR(𝑇), TPR(𝑇))|𝑇 ∈ 𝑹} −∞
na
AUC = ∫
TPR(𝑇)dFPR(𝑇)
∞
The AUC close to 1 means that there is a large separation between 𝑓0 (𝒙) and 𝑓1 (𝒙). The integral
ur
of the ROC function over the given range (0, t) is called partial area under the ROC curve
Jo
(pAUC). Based on these two measures, we declare a method as good performer if it produces larger value of AUC and pAUC compare to AUCs and pAUCs of the other methods. Results and Discussion To demonstrate the performance of the proposed method in a comparison with seven popular methods (edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq and baySeq), we used both simulated and real RNA-seq count datasets. We used five R packages of other methods such as edgeR, DESeq, limma, SAMseq, and baySeq. The performance measures AUC and pAUC were computed for each of the methods using R package ROCR. All R packages are available in the comprehensive R archive network (cran) or bioconductor.
Journal Pre-proof Simulation data We generated simulated RNA-seq data with two (k=2) and multiple (k=4) groups using “simulatedReadCounts” function in TCC R package [26]. This function uses the negative binomial (NB) distribution to generate the read count data. If 𝜇 and ∅ denotes the mean and dispersion parameter of the NB distribution, respectively then the variance (V) can be modelled as V = 𝜇 + 𝜇2 ∅. The NB distribution reduces to the Poisson distribution if over-dispersion
of
parameter ∅ = 0.
ro
Simulation Study 1
To investigate the outlier detection performance of the proposed method with the mean-variance
-p
trend of log-cpm values using 𝛽-weight function, we generated a dataset with gene expression profiles of G=5,000 genes and n = 20 (n1=n2=10) samples using the negative binomial
re
distribution. The number of DEGs were set to 100 (50 up-regulated and 50 down-regulated). We
lP
added the outliers of this dataset by randomly selecting 20% genes and for each of these genes; we selected a single sample randomly and multiplied the observed count of this sample with
na
randomly selected factor between 5 and 10. Figures 2a and figure 2b represent the mean-variance trend of log-cpm values in absence of outliers and presence of outliers, respectively. The red
ur
circles in figure 2b indicate the outliers. Figures 2(c-d) represent the outlier detection performance of the proposed method using 𝛽-weight function. In these figures the smallest 𝛽-
Jo
weights for 5,000 genes are plotted against their gene index. We determined the β-weights for each gene separately for group 1 (n1 = 10 samples) and group 2 (n2 = 10 samples) using equation (9). Then we calculate the smallest β-weight for each gene. The threshold line of the cut-off value δ is indicated by gray line which was computed using (11). In the figures 2(c-d), smallest β-weights with red colour indicates that the genes associated with these weights are outlying genes. The scatter plot of ordered smallest β-weights corresponding their gene index for outlying simulated dataset is shown in figure S1 (additional file 1). From this simulation study we may conclude that the proposed method can detect the outliers properly.
na
lP
re
-p
ro
of
Journal Pre-proof
Jo
Simulation Study 2
ur
Figure 2: Mean-variance relationship. Gene-wise means and variances of simulated RNA-seq data. (a) Mean-variance trend in absence of outliers. (b) Mean-variance trend in presence of outliers. (c) Plot of smallest β-weight plot in absence of outliers for (a). (d) Plot of smallest β-weight in presence of outliers for (b).
To investigate the performance of the proposed method in comparison with seven popular methods as mentioned above, for both small-and large-sample cases with k=2 groups/conditions, we considered 100 datasets for both cases with sample sizes of n1=n2= 3 and 100, respectively. Each dataset for each case represented RNA-seq read count profiles for G=10,000 genes, each with n = (n1+n2) samples, where the read counts of each gene was generated using the negativebinomial distribution. The number of DEGs was set to 200, 400 and 600 for each of the 100 datasets. For these numbers the percentage of DEGs (PDEG) are 2%, 4% and 6%. To show the effect of outliers (extreme high counts) on the methods, we randomly selected 10%, 20% and 50% genes and for each of PDEG, we selected a single sample randomly and multiplied the
Journal Pre-proof observed count of this sample with randomly selected factor between 5 and 10 from uniform distribution. This process was applied for each of the 100 datasets. Then we computed average values of different performance measures such as true positive rate (TPR), false positive rate (FPR), area under the ROC curve (AUC) and partial AUC based on 200, 400, 600 estimated DEGs by eight methods (edgeR, edgeR_robust, DESeq, DESeq2, limma (voom), SAMseq, baySeq and proposed) for each of 100 datasets. Table 1 and table 2 summarizes the average AUCs and pAUCs estimated by eight methods based on 100 simulated datasets with PDEG =2%
(SE)
Performance Measures
20% Outliers
0.868
0.851
0.803
0.803
0.865
(0.009)
(0.013)
(0.016)
(0.016)
(0.009)
0.791
0.796
0.779
0.779
0.824
(0.016)
(0.015)
(0.020)
(0.020)
(0.009)
0.753
0.721
0.791
0.763
0.763
0.830
(0.009)
(0.010)
(0.020)
(0.014)
(0.023)
(0.023)
(0.009)
0.868
0.876
(0.008)
(0.008)
(0.008)
0.809
0.803
0.809
(0.006)
(0.006)
(0.006)
0.766
0.765
(0.010)
-p
ro
0.879
0.625
0.634
0.594
0.618
0.728
0.452
0.452
0.819
(0.012)
(0.011)
(0.010)
(0.017)
(0.016)
(0.025)
(0.025)
(0.011)
Outliers
edgeR
DESeq
DESeq2
SAMseq
baySeq
Proposed
0.055
0.055
0.056
0.050
0.040
0.040
0.050
(0.002)
(0.002)
(0.002)
(0.003)
(0.003)
(0.003)
(0.003)
(0.002)
0.026
0.026
0.010
0.052
0.046
0.038
0.038
0.046
(0.002)
(0.002)
(0.001)
(0.003)
(0.003)
(0.003)
(0.003)
(0.002)
0.020
0.019
0.018
0.043
0.038
0.035
0.035
0.045
(0.002)
(0.002)
(0.002)
(0.003)
(0.003)
(0.002)
(0.002)
(0.001)
0.015
0.019
0.016
0.031
0.034
0.007
0.007
0.039
(0.001)
(0.002)
(0.002)
(0.002)
(0.002)
(0.006)
(0.006)
(0.001)
0.055
Jo
(SE )
Proposed
DESeq2
50% Outliers
Without Outliers
pAUC
baySeq
DESeq
na
AUC
10% Outliers
SAMseq
edgeR (robust)
re
Without Outliers
Limma (voom)
edgeR
lP
Outliers
10% Outliers 20% Outliers 50% Outliers
edgeR
(robust)
ur
Performance Measures
methods for small-sample case
of
Table 1. Performance evaluation using AUCs and pAUCs estimated by different (n1=n2=3).
Limma (voom)
The performance measures pAUCs were calculated at FPR=0.1. In this table the bracket () indicates the estimated standard error of AUCs and pAUCs based on 100 simulated datasets.
Journal Pre-proof
Table 2. Performance evaluation using AUCs and (n1=n2=100). Performance edgeR Outliers edgeR Measures (robust) 0.987 0.991 Without Outliers (0.001) (0.001)
(SE )
0.989
(0.002)
(0.001)
SAMseq
baySeq
Proposed
0.990
0.990
0.991
(0.001)
(0.001)
(0.001)
(0.001)
0.914
0.869
0.972
0.977
0.974
0.985
(0.002)
(0.002)
(0.011)
(0.014)
(0.002)
(0.002)
(0.002)
(0.002)
0.927
0.967
0.913
0.893
0.954
0.968
0.967
0.983
(0.005)
(0.002)
(0.011)
(0.015)
(0.004)
(0.004)
(0.004)
(0.002)
0.951
0.955
0.983
(0.009)
(0.012)
(0.002)
SAMseq
baySeq
Proposed
0.892
0.945
0.806
0.732
0.946
(0.010)
(0.004)
(0.013)
(0.014)
of
0.976
50% Outliers Outliers
edgeR
DESeq
DESeq2
10% Outliers 20% Outliers 50% Outliers
(robust)
(0.003)
Limma
ro
edgeR
(voom)
0.097
0.098
0.097
0.097
0.099
0.098
0.098
0.099
(0.0004)
(0.0004)
(0.0004)
(0.0004)
(0.0004)
(0.0004)
(0.0004)
(0.0004)
0.092
0.096
0.087
0.086
0.096
0.092
0.092
0.098
(0.001)
(0.0006)
(0.002)
(0.001)
(0.0005)
(0.002)
(0.002)
(0.0004)
0.088
0.092
0.073
0.081
0.094
0.090
0.090
0.098
(0.002)
(0.0008)
(0.004)
(0.002)
(0.001)
(0.002)
(0.002)
(0.0004)
0.080
0.090
0.070
0.068
0.092
0.089
0.088
0.098
(0.004)
(0.004)
(0.001)
(0.004)
(0.004)
(0.0004)
(0.002)
-p
Without Outliers
pAUC
0.991
Limma (voom) 0.990
re
Performance Measures
20% Outliers
DESeq2
lP
(SE)
DESeq
methods for large-sample case
0.965
na
AUC
10% Outliers
pAUCs estimated by different
(0.001)
ur
The performance measures pAUCs were calculated at FPR=0.1. In this table the bracket () indicates the estimated standard error of AUCs and pAUCs based on 100 simulated datasets.
Jo
in absence and presence of single outlier in each of 10%, 20% and 50% genes for both smallsample cases (n1=n2=3) and large-sample cases (n1=n2=100), respectively. The genes are considered as DEGs if they satisfy adjusted p-values < 0.05 for each of the methods. The Benjamin-Hochberg (BH) method wad used to adjust the p-values.
In this table the results
within the brackets (·) indicates the estimated standard error (SE) for respective AUCs and pAUCs. From table 1, we observed that in absence of outliers, six methods (edgeR, edgeR_robust, DESeq, DESeq2, limma (voom) and proposed) performed almost similar, for small-sample case. But in presence of outliers in this case the proposed method outperforms other seven methods (edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq and baySeq). For example, in presence of single outlier with each of 20% genes, the proposed
Journal Pre-proof method produces AUC = 0.830, which is larger than 0.766, 0.765, 0.753, 0.721, 0.791, 0.763 and
Jo
ur
na
lP
re
-p
ro
of
0.763, which are produced by edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq
Figure 3: Performance evaluation using boxplot of AUCs estimated by different methods associated with varying proportions of DEGs. The panels A, B and C represent the boxplot of AUC values estimated by eight methods (edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq, baySeq and proposed)
Journal Pre-proof for small-sample case in absence and presence of one outlier in each of 10%, 20%, and 50% genes. 100 simulations were performed to obtain these results.
and baySeq. Results of table 2 also bring out the similar interpretation like table 1. In this case, for 50% outliers SAMseq and baySeq performed slightly better than edgeR, edgeR_robust, DESeq, DESeq2 and Limma (voom), probably because they are non-parametric and nonparametric methods are robust against outliers for large-sample case. Figure 3 and figure S2 shows the boxplots of AUCs based on 100 simulated datasets by each of the methods in
of
absence and presence of outliers for small-and large-sample cases, respectively. In these figures the panels A, B and C represents the boxplots of AUC values using P DEG = 2%, 4% and 6%,
ro
respectively. Similar results are found from these boxplots for every percentage of DEGs like table 1 and table 2. Therefore, from this simulation study we may conclude that the proposed
-p
method outperforms the other seven methods for both small-sample case, in presence of outliers
re
For the large-sample case, in presence of 10% and 20% outliers edgeR_robust, limma(voom), SAMseq, baySeq and proposed method produces similar kind of results (see Table 2). However,
lP
the proposed method produced slightly better performance than the other methods in this case. But in presence of 50% outliers, the proposed method outperforms other seven methodsand it
ur
Simulation Study 3
na
keeps almost equal performance with the other methods, in absence of outliers.
Jo
To demonstrate the performance of the proposed method in a comparison of other seven popular methods (edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq and baySeq) for k > 2 conditions with multiple patterns, we generated 100 datasets for both small (n1=n2=n3=n4=3)and large-sample (n1=n2=n3=n4=100) cases from negative binomial distribution using “simulatedReadCounts” function of TCC R package. Each dataset represents the RNA-seq read counts for G=10,000 genes with n = (n1+n2+n3+n4) samples. We set PDEG=2% and two different proportions of DEGs for the individual groups: (i) (P G1, PG2, PG3, PG4) = (1/4, 1/4, 1/4, 1/4) for unbiased proportion and (ii) (PG1, PG2, PG3, PG4) = (0.7, 0.1, 0.1, 0.1) for biased proportion for each of the datasets. The outliers are added in the datasets as before. We first assess the performance of the proposed method in a comparison of the other methods for detection of DEGs using various simulation conditions as mentioned above and estimate average AUCs and
Journal Pre-proof pAUCs by these methods. Table 3 and table 4 lists the average AUCs and pAUCs estimated by different methods based on 100 datasets in absence and presence of outliers for small-and largesample cases, respectively using unbiased proportion of DEGs. In these tables the results within the brackets (·) indicate the estimated standard error (SE) for respective AUCs and pAUCs. We observed similar AUCs and pAUCs from both tables, in absence of outliers. However, in presence of outliers, the proposed method outperforms other methods, for small-sample case by producing higher AUCs and pAUCs (see table 3 bold texts). For instance, in presence of 20% outliers, the proposed method produces AUC = 0.781, which is larger than 0.705, 0.702, 0.625,
of
0.603, 0.723, 0.715 and 0.714, that are produced by edgeR, edgeR_robust, DESeq, DESeq2,
ro
Limma (voom), SAMseq and baySeq. On the other hand, for large-sample case, six methods (edgeR, edgeR_robust, limma (voom), SAMSeq, bayseq and proposed method) performed well
-p
compared to DESeq and DESeq2, in presence of outliers (see table 4). In the case of 20% and 50% outliers, the proposed method performed slightly better compare to the other methods. For
re
both cases the Benjamini-Hochberg (BH) method was used to adjust the p-values for each of the
lP
methods. However, the performance of all the methods worsened for biased proportion of DEGs
na
both for small-and-large sample cases (see table A1 and table A2 in additional file 2). For
AUC (SE)
Performance Measure
pAUC (SE )
10% Outliers 20% Outliers
Jo
ur
Table 3. Performance evaluation using AUCs and pAUCs estimated by different methods for small-sample case (n1=n2= n3=n4=3) with k > 2 groups for unbiased proportion of DEGs in the groups. Performance edgeR Limma Outliers edgeR DESeq DESeq2 SAMseq baySeq Proposed Measure (robust) (voom) 0.882 0.875 0.872 0.842 0.835 0.858 0.853 0.854 Without Outliers (0.009) (0.009) (0.009) (0.012) (0.010) (0.008) (0.008) (0.010) 0.779
0.780
0.671
0.622
0.707
0.778
0.787
0.810
(0.005)
(0.006)
(0.012)
(0.012)
(0.013)
(0.006)
(0.006)
(0.010)
0.705
0.702
0.625
0.603
0.723
0.715
0.714
0.781
(0.008)
(0.008)
(0.010)
(0.010)
(0.012)
(0.009)
(0.010)
(0.012)
50% Outliers
0.611
0.611
0.547
0.569
0.692
0.613
0.641
0.788
(0.011)
(0.011)
(0.019)
(0.018)
(0.013)
(0.011)
(0.011)
(0.014)
Outliers
edgeR
DESeq
DESeq2
SAMseq
baySeq
Proposed
Without Outliers 10% Outliers
edgeR (robust)
Limma (voom)
0.056
0.055
0.051
0.055
0.053
0.054
0.053
0.050
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
0.011
0.012
0.009
0.043
0.039
0.010
0.010
0.043
(0.001)
(0.001)
(0.001)
(0.002)
(0.003)
(0.001)
(0.001)
(0.002)
Journal Pre-proof 20% Outliers 50% Outliers
0.008
0.009
0.006
0.037
0.032
0.007
0.007
0.037
(0.001)
(0.001)
(0.001)
(0.002)
(0.002)
(0.001)
(0.001)
(0.001)
0.006
0.007
0.005
0.030
0.029
0.005
0.006
0.037
(0.001)
(0.001)
(0.001)
(0.002)
(0.002)
(0.001)
(0.001)
(0.001)
-p
ro
of
The performance measures pAUCs were calculated at FPR=0.1. In this table the bracket () indicates the estimated standard error of AUCs and pAUCs based on 100 simulated datasets.
AUC (SE)
10% Outliers 20% Outliers 50% Outliers
Performance Measure
Outliers Without Outliers
pAUC (SE )
10% Outliers 20% Outliers 50% Outliers
(0.002) 0.924
0.864
0.984
(0.001)
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
0.956
0.658
0.763
0.963
0.951
0.957
0.978
(0.003)
(0.007)
(0.008)
(0.003)
(0.003)
(0.003)
(0.002)
0.919
0.578
0.697
0.932
0.944
0.838
0.978
lP
0.987
(0.004)
0.985
Limma (voom) 0.987
DESeq
na
Without Outliers
edgeR (robust) 0.989
edgeR
ur
Outliers
DESeq2
SAMseq
baySeq
Proposed
0.988
0.980
0.988
(0.004)
(0.003)
(0.010)
(0.010)
(0.004)
(0.004)
(0.006)
(0.002)
0.688
0.792
0.500
0.624
0.862
0.864
0.867
0.976
(0.008)
(0.017)
(0.013)
(0.008)
(0.010)
(0.010)
(0.003)
DESeq
DESeq2
SAMseq
baySeq
Proposed
Jo
Performance Measure
re
Table 4. Performance evaluation using AUCs and pAUCs estimated by different methods for large-sample case (n1=n2= n3=n4=100) with k > 2 groups for unbiased proportion of DEGs in the groups.
(0.010) edgeR
edgeR (robust)
Limma (voom)
0.098
0.098
0.098
0.097
0.099
0.098
0.098
0.099
(0.0003)
(0.0003)
(0.0003)
(0.0003)
(0.0003)
(0.0003)
(0.0003)
(0.0003)
0.032
0.055
0.051
0.046
0.071
0.043
0.044
0.095
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
(0.001)
0.017
0.029
0.044
0.022
0.052
0.025
0.027
0.094
(0.001)
(0.002)
(0.001)
(0.002)
(0.001)
(0.002)
(0.002)
(0.001)
0.007
0.017
0.008
0.013
0.037
0.006
0.006
0.093
(0.001)
(0.002)
(0.001)
(0.002)
(0.003)
(0.002)
(0.002)
(0.001)
Journal Pre-proof The performance measures pAUCs were calculated at FPR=0.1. In this table the bracket () indicates the estimated standard error of AUCs and pAUCs based on 100 simulated datasets.
example, AUC values for edgeR under unbiased (1/4, 1/4, 1/4, 1/4) and biased (0.7, 0.1, 0.1, 0.1) proportions decreased from 0.882 to 0.865, for small-sample case, in absence of outliers. Figure 4 represents the boxplots of AUCs estimated by eight methods based on 200 DEGs in absence and presence of outliers. In this figure the panels A and B represents the boxplots of AUCs for small-and-large sample cases, respectively. This figure also supports the results of table 3 and table 4. Therefore, from this simulation study, we may conclude that the proposed
of
method outperforms other methods in presence of outliers and in absence of outliers it keeps
Jo
ur
na
lP
re
-p
ro
equal performance with other methods.
na
lP
re
-p
ro
of
Journal Pre-proof
Jo
ur
Figure 4: Performance evaluation using boxplot of AUCs estimated by different methods. The panels A and B represent the boxplot of AUCs estimated by eight methods (edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq, baySeq and proposed) for small-and large-sample cases respectively in absence and presence of one outlier in each of 10%, 20%, and 50% genes. 100 simulations were performed to obtain these results.
Real RNA-Seq Data Study To investigate the performance of the proposed method in a comparison of the seven popular methods (edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq and baySeq) in the real RNA-Seq count data we used two datasets namely mouse dataset and Nigerian dataset. Mouse Dataset We used the RNA-Seq mouse dataset which was used in the study [23]. This dataset consist of 36535 genes with 21 samples. This dataset was downloaded from ReCount website http://bowtie-bio.sourceforge.net/recount. Among 21 samples, RNA-Seq count expression collected from 10 C57BL/6J (B6) and 11 DBA/2J (D2) inbred mouse strains. After filtering we
Journal Pre-proof retain 11,474 genes. To investigate the performance of the proposed method for detection of DEGs between the two mouse strains, we considered two latest RNA-Seq methods (edgeR_robust and DESeq2). Figure 5a represents the Venn diagram of top 100 genes estimated by edgeR_robust, DESeq2 and proposed method. From this Venn diagram, we clearly observe
Jo
ur
na
lP
re
-p
ro
of
that there are 82 mutual DEGs detected by these three methods. We also notice that there are 8, 9
Figure 5: Venn diagram and outlier gene expression profile for mouse dataset. (a) Venn diagram of the DE genes estimated by all three methods (edgeR_robust, DEseq2 and Proposed). (b) Scatter plot of the smallest β-weight for each of the 11,474 genes vs. the gene index. (c) Frequency distributions of βweights for each expression of the 11,474 genes.
Journal Pre-proof and 5 genes independently detected by edgeR_robust, DESeq2 and proposed method, respectively. The DEGs were selected using adjusted p-values<0.05 for each of the methods. The p-values were adjusted by the Benjamin-Hochberg (BH) method. Among the 5 DEGs detected by the proposed method 3 genes are declared as outlying genes by the β-weight function (9) of the proposed method. Figure 5b displays smallest β-weights for each of 11,474 genes. In this figure the red circle indicate the outlying gene. We identified 7% outlying genes from this dataset. The threshold line of the cut-off value δ=0.1 is indicated by gray color which was computed using (11). The frequency distribution of β-weights for each expression of the 11,474
of
genes is shown in figure 5c. The scatter plot of ordered smallest β-weights corresponding their
ro
gene index is shown in figure S3. The MA-plot of this dataset after voom transformation is shown in figure S4. In this figure the green circle, blue triangle and red square indicate the 8, 9
-p
and 5 DEGs identified by edgeR_robust, DESeq2 and proposed method, respectively. Then we explored the biological functions and KEGG pathways associated with these three genes using
re
WebGestalt software [24]. The gene ontology (GO) database identified 2 genes out of 3 genes.
lP
The GO results showed that these two outlying genes (Apoa2, Gpat2) are significantly enriched in different biological processes like metabolic process, neural lipid metabolic process, glycerophospholipid biosynthetic process, negative regulation of cytokine secretion involved in
na
immune respone and so on (see file A1(xls) in additional file 2). Furthermore, we revealed that these genes are involved with different molecular functions such as phosphatidylcholine-sterol
ur
O-acyltransferase activator activity, high density lipoprotein particle binding, apolipoprotein
Jo
receptor binding, lipase inhibitor activity and so on (see file A2(xls)). The KEGG pathway enrichment analysis revealed that these 2 outlying genes significantly enriched glycerolipid metabolism, PPAR signaling pathway, glycerophospholipid metabolism and metabolic pathway (see table 5). The functional interaction network of these 2 genes are shown in figure 6 and analyzed via Cytoscape using GeneMANIA plug-in [29].
Journal Pre-proof
Table 5. KEGG pathway for 2 outlying genes identified by proposed method only for mouse data KEGG ID KEGG pathway name Name of gene P-value 1.47e-02 mmu00561 Glycerolipid metabolism Gpat2 mmu03320
PPAR signaling pathway
Apoa2
mmu00564
Glycerophospholipid metabolism
Gpat2
mmu01100
Metabolic pathways
Gpat2
2.12e-02 2.34e-02
Jo
ur
na
lP
re
-p
ro
of
3.01e-01
Figure 6: Functional interactions of the two (2) genes detected by the proposed method.
Journal Pre-proof
Nigerian Data We used Nigerian individuals RNA-Seq data, which is the part of international HapMap project and also published in [27]. This dataset consists of lymphoblastoid cell lines from 29 males and 40 female unrelated Nigerian individuals. Illumina Genome Analyzer II was used to sequence the samples. The Read counts were summarized by Ensembl gene identifiers and gene annotation was performed by bioconductor tweeDEseqCountData package [28]. The number of gene of this dataset is 38,415. The genes (17,310) with least 1 count-per-million reads (cpm) in at least
of
20 samples were used for further analysis. The goal of this data analysis is to identify the DEGs
ro
between male and female. Figure 7a shows the mean-variance relationship of this dataset. This figure suggests the greater biological variation of the log-cpm values for unrelated human
-p
individuals. We first calculate the 69 β-weights corresponding to the 69 samples for each gene. Then we determine the smallest value (minimum) of 69 β-weights for each gene. Figure 7b
re
represents the scatter plot of these smallest β-weights for each of the 17310 genes against the
lP
gene index. The red circles below gray line represent outliers. Figure 7c represents the Venn diagram of top 50 genes identified by limma (voom), DEseq2, edgeR_robust and proposed
na
method. We can clearly observe from this Venn diagram that our proposed method shares more genes with other three methods. We found 39 mutual genes between these methods. The
ur
proposed method shared 3 genes with the DEseq2 and edgeR methods, which were not detected by the limma (voom). To show the influence of outliers for detection of DEGs, we corrupted 39
Jo
mutual DEGs by outliers. To generate the outlying Read counts for 39 genes, we selected a single sample randomly and multiplied the observed count of this sample with randomly selected factor between 5 and 10. We call here the 39 mutual genes as valid gene-set. Then we again applied three methods (limma (voom), edgeR_robust and proposed) to identify the DEGs. The Venn diagram in figure 7c represents the top 50 genes detected by these methods with 39 common DEGs, in presence of outliers. This figure revealed that the proposed method performs better by sharing more genes with the valid 39 DEGs and the other methods. For example, the proposed method shared maximum number of genes (5 genes) with valid DEGs, which were not shared by the other two methods (limma (voom),, edgeR_robust). We validate these 5 genes using gene ontology (GO) and pathway enrichment analysis using WebGestalt software [24].
Journal Pre-proof GO mapped 4 genes out of 5 genes and these genes are enriched in several biological processes
re
-p
ro
of
such as histone H2A acetylation, neuron cell-cell adhesion, vocalization behavior, histone H4-
(b)
Jo
ur
na
lP
(a)
(c)
(d)
Figure 7. Mean-variance relationship and comparison of the top DE genes selected by different methods for Nigerian Dataset. (a) Mean-variance trend. (b) Scatter plot of the smallest β-weight for each of the 17,310 genes vs. the gene index. (c) Venn diagram of top 50 genes estimated by limma(voom), DEseq2, edgeR_robust and Proposed method (d Venn diagram of top 50 genes estimated by limma(voom), edgeR_robust and Proposed method with valid gene-set
K16 acetylation, social behavior and so on (see file A3 (xls)). Among them gene MSL3 is located in X chromosome and genes (NLGN4Y, TTTY15, USP9Y) are located in the Y
Journal Pre-proof chromosome. In the previous study [19], limma (voom) identified NLGN4Y and TTTY15 in the top 16 DEGs. However, when we contaminated these genes by outlier, it fails to detect these genes in the top 50 genes. The pathway enrichment analysis revealed that NLGN4Y gene significantly enriched in Cell adhesion molecules (CAMs) pathway (see table 6). We also revealed that gene MSL3 and USP9Y significantly enriched in two miroRNA target (see table A3). Table 6. KEGG pathways DE genes identified by the proposed method only Name of gene
p-value
NLGN4Y
2.02e-02
of
hsa04514
KEGG pathway description Cell adhesion molecules (CAMs)
ro
KEGG ID
-p
The hypergeometric test is used to calculate the p-values and adjusted by Benjamini-Hochberg method for multiple testing corrections.
re
Conclusion
lP
Detection of differentially expressed genes (DEGs) under two or more conditions from RNA-Seq data has been become popular day by day with many advantages of next generation sequencing
na
(NGS). Typically, the RNA-seq technologies provide the read count data by direct sequencing of transcript. A number of methods have been developed for detecting the DEGs genes from RNA-
ur
seq data. Most of these methods are based on either Poisson distribution or negative binomial (NB) distribution. However, identification of DEGs based on read count data using skewed
Jo
distribution is very complex, in presence of outliers or extreme values. Transformation of RNASeq data is an alternative approach to tackle the outlier’s issues. There are several transformation methods for RNA-seq data, among them voom using limma pipeline has been widely used. However, limma by voom is sensitive to outliers for small-sample case and produces misleading results in presence of outliers. Therefore, in this paper, we robustify the voom approach using the minimum β-divergence method. We demonstrated the performance of the proposed method in a comparison of some popular DEGs selection methods (DEseq, DEseq2, SAMseq, Bayseq, limma (voom), edgeR and edgeR_robust) using both simulated and real dataset. Both types of data analysis results showed that the performance of the proposed method improved over the
Journal Pre-proof competing methods, in presence of outliers and in absence of outliers it keeps almost equal performance with these methods.
Abbreviations DEGs, differentially expressed genes; FC, fold change; NB, negative bionomial; NGS, next generation sequencing; log-cpm, log-counts per million; LIMMA, linear models for microarray data; AUC, area under curve; pAUC, partial area under curve; VST, variance-stabilizing transformation.
of
Availability of data and material
ro
The R-codes of the proposed algorithm can be found in the additional file 3.
-p
Author’s contributions
lP
re
Conceived and designed the experiments: MSJ MNHM. Performed the experiments: MSJ. Analyzed the data: MSJ. Contributed reagents/materials/analysis tools: MRR,SI. Wrote the paper: MSJ MNHM. Finalized the manuscript: all Authors. Edited and approved the final version of the manuscript: All Authors.
na
Competing interests
The authors declare that they have no competing interests.
ur
Consent for publication
Jo
Not applicable.
Acknowledgements Not applicable.
References [1]. Mortazavi A, Williams B, McCue K, Schaeffer L and Wold B.(2008): Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Meth. 5, 621-628.
Journal Pre-proof [2]. Beyer M, Mallmann MR, Xue J, Staratschek-Jox A, Vorholt D, Krebs W, Sommer D, Sander J, Mertens C, Nino-Castro A, Schmidt SV, Schultze JL.(2012): High-Resolution Transcriptome of Human Macrophages. PLoS ONE, 7, e45466. [3]. Wang Z, Gerstein M, Snyder M.(2009): RNA-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet.,10, 57–63. [4]. Bullard JH, Purdom E, Hansen KD, Dudoit S.(2010): Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11, 94. [5]. Wang L, Feng Z, Wang X, Zhang X.(2010): DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics, 26, 136-138.
of
[6]. Nagalakshmi U, Waern K, Snyder M.(2010): RNA-Seq: a method for comprehensive transcriptome analysis. Curr Protoc Mol Biol, 4,1-13.
ro
[7]. Robinson MD, McCarthy DJ, Smyth GK.(2010): EdgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26, 139-140.
-p
[8]. Anders S, Huber W.(2010): Differential expression analysis for sequence count data. Genome Biol, 11,R106.
re
[9]. Hardcastle TJ, Kelly KA.(2010): baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics, 11,422.
lP
[10]. Love MI, Huber W, Anders S.(2014): Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.
na
[11]. Leng N, Dawson J, Thomson J, Ruotti V, Rissman A, Smits B, Haag J, Gould M, Stewart R, Kendziorski C.(2012): EBSeq: an empirical bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics, 29, 1035-1043..
ur
[12]. Zhou X, Lindsay H, Robinson MD.(2014): Robustly detecting ifferential expression in RNA sequencing data using observation weights. Nucleic Acids Res, 42, e91.
Jo
[13]. Di Y, Schafer DW, Cumbie JS, Chang JH.(2011): The NBP negative binomial model for assessing differential gene expression from RNA-seq. Stat Appl Genet Mol Biol, 10, 1-18. [14]. Li J, Tibshirani R. (2013): Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-seq data. Stat Methods Med Res, 22, 519-536. [15]. Robinson MD, Smyth GK.(2008): Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321-332. [16]. George NI, Bowyer JF, Crabtree NM, Chang CW.(2015): An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data. PLoS ONE, 10, e0125224. [17]. Zwiener I, Frisch B, Binder H.(2014): Transforming RNA-seq data to improve the performance of prognostic gene signatures. PloS One,.9, e85150. [18]. Robinson MD, Oshlack A.(2010): A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol, 11, R25.
Journal Pre-proof [19]. Law W, Chen Y, Shi W, Smyth GK.(2014): voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 15, R29. [20]. Mollah MNH, Minami M, Eguchi S.(2007): Robust prewhitening for ICA by minimizing βdivergence and its application to FastICA, Neural Processing Letters, 25, 91-110. [21]. Mollah MNH, Sultana N, Minami M, Eguchi S.(2010): Robust Extraction of Local Structures by the Minimum β-Divergence method, Neural Network, 23, 226-238. [22]. Shahjaman M, Kumar N, Mollah MMH, Ahmed MS, Begum AA, Islam SMS, Mollah MNH.(2017): Robust Significance Analysis of Microarrays by Minimum β-Divergence Method. Biomed Research International, 2017, 1-18.
of
[23]. Bottomly D, Walter NA, Hunter JE, Darakjian P, Kawane S, Buck KJ,(2011): Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PLoS One, 6, e17820.
ro
[24]. Zhang B, Kirov S, Snoddy J.(2005): WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Research, 33, W741–W748.
-p
[25]. Smyth GK. (2004): Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Stat Appl Genet Mol Biol, 3:3.
re
[26]. Mollah MMH, Mollah MNH, Kishino H.(2012): β-empirical Bayes inference and model
lP
diagnosis of microarray data. BMC Bioinformatics, 13,135.
na
[27]. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK.(2010): Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 464, 768-772.
ur
[28]. Esnaola M, Puig P, Gonzalez D, Castelo R, Gonzalez JR.(2013): A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments. BMC Bioinformatics. 14: 254-10.
Jo
[29]. Shannon P, Markiel A, Ozier O. (2003): Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, 13: 2498–504. [30]. Shahjaman, M.; Rahman, M.R.; Islam, S.M.S.; Mollah, M.N.H. A Robust Approach for Identification of Cancer Biomarkers and Candidate Drugs. Medicina 2019, 55, 269.
Journal Pre-proof
Jo
ur
na
lP
re
-p
ro
of
Highlights The proposed robust voom approach showed better performance than competing methods in both simulated and real data analysis, in presence of outliers The proposed method identified two outlying genes (Apoa2, Gpat2) from mouse dataset, are significantly enriched in different biological processes like metabolic process, neural lipid metabolic process, glycerophospholipid biosynthetic process, negative regulation of cytokine secretion involved in immune respone and so on. The proposed method identified outlying genes (MSL3 ,NLGN4Y, TTTY15, USP9Y) from Nigerian dataset, significantly enriched in Cell adhesion molecules (CAMs) pathway miRNAs (mir-212,mir-216,mir-132) were identified.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7