Robust identification of differentially expressed genes from RNA-seq data

Robust identification of differentially expressed genes from RNA-seq data

Journal Pre-proof Robust identification of differentially expressed genes from RNAseq data Md. Shahjaman, Md. Manir Hossain Mollah, Md. Rezanur Rahma...

16MB Sizes 1 Downloads 81 Views

Journal Pre-proof Robust identification of differentially expressed genes from RNAseq data

Md. Shahjaman, Md. Manir Hossain Mollah, Md. Rezanur Rahman, S.M. Shahinul Islam, Md. Nurul Haque Mollah PII:

S0888-7543(19)30575-0

DOI:

https://doi.org/10.1016/j.ygeno.2019.11.012

Reference:

YGENO 9405

To appear in:

Genomics

Received date:

21 August 2019

Revised date:

27 October 2019

Accepted date:

18 November 2019

Please cite this article as: M. Shahjaman, M. Manir Hossain Mollah, M. Rezanur Rahman, et al., Robust identification of differentially expressed genes from RNA-seq data, Genomics (2018), https://doi.org/10.1016/j.ygeno.2019.11.012

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2018 Published by Elsevier.

Journal Pre-proof

Robust Identification of Differentially Expressed Genes from RNA-seq Data Md. Shahjaman1*, Md.Manir Hossain Mollah 2, Md. Rezanur Rahman3, S. M. Shahinul Islam4, and Md. Nurul Haque Mollah5 1

-p

Abstract

ro

[email protected]

of

Department of Statistics, Begum Rokeya University, Rangpur-5400, Bangladesh; [email protected] 2 Department of Bioinformatics and Public Health, Asian University for Women, Chittagong, Bangladesh 3 Department of Biochemistry and Biotechnology, School of Biomedical Science, Khwaja Yunus Ali University, Sirajgonj, Bangladesh; [email protected] 4 Institutitute of Biological Science (IBSc), University of Rajshahi, Rajshahi-6205, Bangladesh; [email protected] 5 Bioinformatics Lab., Department of Statistics, University of Rajshahi, Rajshahi-6205, Bangladesh;

re

Background: Identification of differentially expressed genes (DEGs) under two or more experimental conditions is an important task for elucidating the molecular basis of phenotypic

lP

variation. In the recent years, next generation sequencing (RNA-seq) has become very attractive and competitive alternative to the microarrays because of reducing the cost of sequencing and

na

limitations of microarrays. A number of methods have been developed for detecting the DEGs from RNA-seq data. Most of these methods are based on either Poisson distribution or negative

ur

binomial (NB) distribution. However, identification of DEGs based on read count data using

Jo

skewed distribution is inflexible and complicated of in presence of outliers or extreme values. Results: Most of the existing DEGs selection methods produce lower accuracies and higher false discoveries in presence of outliers. There are some robust approaches such as edgeR_robust and DEseq2 perform well in presence of outliers for large sample case. But they show weak performance for small-sample case, in presence of outliers. To address this issues an alternative approach has emerged by transforming the RNA-seq data into microarray like data. Among various transformation methods voom using limma pipeline is proven better for RNA-seq data. However, limma by voom transformation is sensitive to outliers for small-sample case. Therefore, in this paper, we robustify the voom approach using the minimum β-divergence method. We demonstrate the performance of the proposed method in a comparison of seven popular biomarkers selection methods: DEseq, DEseq2, SAMseq, Bayseq, limma (voom), edgeR

Journal Pre-proof and edgeR_robust using both simulated and real dataset. Both types of experimental results show that the performance of the proposed method improve over the competing methods, in presence of outliers and in absence of outliers it keeps almost equal performance with these methods. Conclusion: We observe the improved performance of the proposed method from simulation and real RNA-seq count data analysis for both small-and large-sample cases, in presence of outliers. Therefore, our proposal is to use the proposed method instead of existing methods to obtain the better performance for selecting the DEGs.

ro

of

Keywords: RNA-sequence data, log-cpm, DEGs, β-weight function and Robustness.

Background

-p

Identification of differentially expressed genes (DEGs) under two or more condition is a

re

fundamental objective of the transcriptomics data analysis. These DEGs play the important roles for the development of agriculture, health and environment. During the past decades microarrays

lP

have been widely used to perform this task. But after the invention of next generation sequencing (NGS) technology RNA-seq has emerged an attractive and alternative choice of microarrays [1-

na

2]. Both microarrays and RNA-seq uses the high-throughput technology to produce a vast amount of data. Thus curse of dimensionality often occurs here. It means "large p, small n"

ur

problem. Hence, dimension reduction of the data matrix is a primary objective for further

approaches.

Jo

downstream analysis. Identification of biomarkers is one of the dimensionality reduction

Array-based technology measures intensities using continuous distributions. In RNA-seq, cDNA (which is derived from RNA) are sequenced to obtain the millions of short reads. After that these short reads are mapped with its reference genome. The number of sequence reads mapped is then calculated to get the RNA-seq count data matrix. There are many advantages of RNA-seq over the microarrays: it (1) produces count data with low background noise that allow researcher to detect the transcript at low expression levels, (2) has ability to detect unknown or novel transcripts, allele-specific expression and alternative splicing isoforms, (3) does not require prior probe selection thus it avoids the related biases which usually occur during the hybridization of microarrays [3] and so on.

Journal Pre-proof

With these advantages of RNA-seq, identification of DEGs from count data has been increasing rapidly because of reducing the cost of sequencing. A number of methods have been developed to identify the DEGs from RNA-seq count data. Most of the methods use either Poisson or negative-binomial (NB) distribution to model the count data with variety of normalization procedures [4]. DEGseq based on Poisson distribution uses the Fisher’s exact test and likelihood ratio test. But it suffers from overdispersion issues and thus it is not suitable for count data analysis [5-8]. To mitigate these issues, methods based on NB distribution have been applied in

of

RNA-seq data. Among them baySeq, DESeq, DESeq2, EBSeq, edgeR, edgeR (robust) and NBPSeq have been extensively used with their respective R packages [9-13]. SAMSeq is a non-

ro

parametric method which does not assume any particular distribution for the data [14]. However,

-p

most of methods suffer from the small sample sizes in RNA-seq experiments and they cannot estimate properly the gene-wise dispersion parameters to share the information of all genes in the

re

dataset [15].

lP

Unfortunately, the aforementioned methods based on discrete distribution are inflexible and complicated in presence of extreme read counts (or outliers). Outliers may arise in RNA-seq

na

count data because there are several data generating stages from biological harvesting of RNA to counting of sequence read map [16]. Consequently, an alternative approach has emerged to

ur

address the above issue by transforming the RNA-seq count data and then apply the normalbased microarray-like statistical methods. A numerous transformation methods have been

Jo

available for RNA-seq data which includes logarithmic transformation [17], variance-stabilizing transformation (vst) [8], TMM transformation [18], regularized logarithm [10] and variance modeling at the observation level (voom) [19]. Among various transformation methods voom using limma pipeline is proven better for RNA-seq data. voom estimates the mean-variance relationship of the log-cpm values to calculate the weight for each observation and input these into the limma’s empirical Bayes pipeline for differential analysis. But, it is sensitive to outliers for small-sample case and produces misleading results in presence of outliers. Therefore, in this paper, we robusify voom approach by minimum β-divergence method. The next section will be discussed about formulation of the proposed robust voom approach. A broad simulation and real RNA-seq data analysis are carried out in the results and discussion section.

Journal Pre-proof

Materials and Methods voom transformation Suppose the RNA-seq read count data matrix is denoted by ygik, where row represents the genes and column represents the samples (g=1,2,…, G; i=1,2,…,nk;k=1,2,…,m;). voom uses log-counts per million (log-cpm) to transform the read count data matrix as follows: 𝑦𝑔𝑖𝑘 +0.5 𝑅𝑖𝑘 +1

× 106 )

(1)

of

𝑥𝑔𝑖𝑘 = log2 (

where, 𝑅𝑖𝑘 denotes the total number of mapped reads for sample i for kth condition, 𝑅𝑖𝑘 =

ro

∑𝐺𝑔=1 𝑥𝑔𝑖𝑘 . This transform data usually follow the normal distribution, which depends on the

-p

degree of skewness before transformation. A more detail about the voom can be found in [19]. However, the log-cpm values use in voom approach is sensitive to outlying observations. Though

re

logarithmic transformation reduces the influence of low level extreme/outlying observations into

lP

the reasonable spaces during parameter estimations, however this transformation fails to reduce the influence of high level extreme/outlying observations. Therefore, Limma produces misleading

na

results in some cases. To overcome this problem, in this paper, we are proposing robust voom transformation for RNA-seq data.

ur

Robust voom transformation by β-divergence method

Jo

In the second step, the following linear model is fitted using the log-cpm values: 𝐸( 𝑦𝑔𝑖 ) = 𝜇𝑔𝑖 = 𝑥𝑖𝑇 𝛽𝑔

(3)

where, 𝑥𝑖 is a vector of covariates and 𝛽𝑔 is a vector of unknown coefficients representing log2fold-changes between experimental conditions. Using ordinary least squares method the above model is fitted by estimating 𝛽𝑔 : 𝜇̂ 𝑔𝑖 = 𝑥𝑖𝑇 𝛽̂𝑔

(4)

At the last step, the fitted log-cpm values 𝜇̂ 𝑔𝑖 are converted to fitted count by 𝜆̂𝑔𝑖 = 𝜇̂ 𝑔𝑖 + log2 (𝑅𝑖 + 1) − log2 (106 )

(5)

The voom precision weights are then calculated as the inverse variances 𝑤𝑔𝑖 = lo(𝜆̂𝑔𝑖 )−4 , where lo is a piecewise linear function defined by the locally weighted regression curve[19]. Then the

Journal Pre-proof log-cpm values 𝑦𝑔𝑖 and associated weights 𝑤𝑔𝑖 are input into limma’s well-established microarray analysis pipeline to the differential expression. However, 𝜆̂𝑔𝑖 using equation (2-4) is not robust against outliers if RNA-seq data matrix 𝑟𝑔𝑖 contain the outlying counts. Therefore, in this paper, an attempt is made to robustify the voom ̂𝑔𝑘,𝛽 = approach by minimum 𝛽-divergence method. The minimum 𝛽-divergence estimators 𝜽 2 2 (𝜇̂ 𝑔𝑘,𝛽 , 𝜎̂𝑔𝑘,𝛽 ) of the parameters 𝜽𝑔𝑘 = (𝜇𝑔𝑘 , 𝜎𝑔𝑘 ) are computed iteratively as follows: 𝑛

𝜇𝑔𝑘,𝑡+1 =

𝑘 𝜑 (𝑦 ∑𝑗=1 𝛽 𝑔𝑖𝑗 |𝜽𝑔𝑘,𝑡 ) 𝑦𝑔𝑖𝑗

(6)

𝑛

𝑘 𝜑 (𝑦 ∑𝑗=1 𝛽 𝑔𝑖𝑗 |𝜽𝑔𝑘,𝑡 )

of

and 𝑛

𝑛

𝑘 𝜑 (𝑦 ∑𝑗=1 𝛽 𝑔𝑖𝑗 |𝜽𝑔𝑘,𝑡 )

(7)

ro

=

𝑘 𝜑 (𝑦 2 (𝛽+1) ∑𝑗=1 𝛽 𝑔𝑖𝑗 |𝜽𝑔𝑘,𝑡 ) (𝑦𝑔𝑖𝑗 −𝜇𝑔𝑘,𝑡 )

where, 𝛽

𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽𝑔𝑘,𝑡 ) = exp {− 2𝜎2 (𝑦𝑔𝑖𝑗 − 𝜇𝑔𝑘,𝑡 )2 }

(8)

re

𝑔𝑘

-p

2 𝜎𝑔𝑘,𝑡+1

na

lP

which is known as 𝛽-weight function. The value of this 𝛽-weight function lies between 0 to 1. It produces larger weights corresponding to the usual gene expression and smaller weights corresponding to the unusual/outlying gene expression [20-21]. Therefore, in this paper we employed this weight function as a measure outlier detection with 𝛽 = 0.2, which was also used ∗ in the study [22]. We update the outlying log-cpm values of 𝑦𝑔𝑖𝑗 as 𝑦𝑔𝑖𝑗 using 𝛽-weight function as follows:

ur

̂𝑔𝑘,𝑡 ) ≤ 𝛿 𝜇̂ 𝑔𝑘,𝛽 ; if 𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽 ∗ 𝑦𝑔𝑖𝑗 ={ ̂𝑔𝑘,𝑡 ) > 𝛿 𝑦𝑔𝑖𝑗 ; if 𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽

(9)

Jo

̂𝑔𝑘,𝑡 ) ≤ 𝛿 indicates that ith sample of gth gene for kth condition is contaminated Here, 𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽 ̂𝑔𝑘,𝑡 ) > 𝛿 indicates that it is not contaminated by outliers. by outliers and 𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽 where 𝛿 is determined by the following equation ̂𝑔𝑘,𝑡 )) + 𝛼 [max (𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽 ̂𝑔𝑘,𝑡 )) − min (𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽 ̂𝑔𝑘,𝑡 ))] ∀𝑔, 𝑘 𝛿 = min (𝜑𝛽 ( 𝑦𝑔𝑖𝑗 |𝜽

(10)

∗ Then we replace 𝑦𝑔𝑖𝑗 in equation (2) and obtain robust 𝜆̂𝑔𝑖,𝛽 for limma’s standard pipeline.

Fold change (FC) play an important and informative indicator to identify DEGs. In this procedure the log ratio between two conditions is calculated and declares genes as DEGs whose ratio exceed an arbitrary cut-off value. It is also used for gene ranking to select the candidate gene for further analysis. Therefore, in this paper for two groups (k=2), we calculate a score to select the top DEGs, which is the average of ranks between p-values and FC values. This kind of gene

Journal Pre-proof selection and ranking was also used in [30]. We define a score (𝑆𝑔,𝛽 ,) using the rank of p-values (𝑃𝑔,𝛽 , increasing order) and rank of fold change values (𝐹𝐶𝑔,𝛽 , decreasing order) for gth as follows: 𝑆𝑔,𝛽 =

𝑃𝑔,𝛽 +𝐹𝐶𝑔,𝛽

(11)

2

RNA-Seq Count Data

ro

of

Data transformation using voom method

re

-p

Check the existence of outliers using β-divergence method

lP

yes

Apply the limma on original log-cpm data

Calculate p-values and FC values

Jo

ur

na

Apply the limma on updated log-cpm data

no

Select the significant DEGs using proposed method

Obtain the functional annotations and pathway enrichment analysis of the detected DE genes

Figure 1: Work flow of the proposed method for identification of DEGs.

Journal Pre-proof

Performance Evaluation In order to assess the performance of different gene selection methods for binary classification test, we use receiving operating characteristics (ROC) curve and area under the ROC curve (AUC). Let D be the binary class label (D = 0, 1), 𝒙 ∈ 𝑹𝑝 be a feature vector, and 𝑓0 (𝒙), 𝑓1 (𝒙) be the probability density functions for each class. We classify a gene with feature vector x into

𝐷={

1; 𝐹(𝒙) ≥ 𝑇 0; Otherwise

of

one of two classes as follows:

ro

where 𝐹(𝒙) is a score function and 𝑇 is a threshold value. Then, the true positive rate (TPR, or sensitivity) and false positive rate (FPR, or 1-specificity) are defined as

-p

TPR(𝑇) = ∫𝐹(𝒙)≥𝑇 𝑓1 (𝒙) 𝑑𝒙 and FPR(𝑇) = ∫𝐹(𝒙)≥𝑇 𝑓0 (𝒙) 𝑑𝒙

(13)

re

The ROC curve can be described by pairing these functions

From (13) the AUC is written as

lP

ROC = {(FPR(𝑇), TPR(𝑇))|𝑇 ∈ 𝑹} −∞

na

AUC = ∫

TPR(𝑇)dFPR(𝑇)



The AUC close to 1 means that there is a large separation between 𝑓0 (𝒙) and 𝑓1 (𝒙). The integral

ur

of the ROC function over the given range (0, t) is called partial area under the ROC curve

Jo

(pAUC). Based on these two measures, we declare a method as good performer if it produces larger value of AUC and pAUC compare to AUCs and pAUCs of the other methods. Results and Discussion To demonstrate the performance of the proposed method in a comparison with seven popular methods (edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq and baySeq), we used both simulated and real RNA-seq count datasets. We used five R packages of other methods such as edgeR, DESeq, limma, SAMseq, and baySeq. The performance measures AUC and pAUC were computed for each of the methods using R package ROCR. All R packages are available in the comprehensive R archive network (cran) or bioconductor.

Journal Pre-proof Simulation data We generated simulated RNA-seq data with two (k=2) and multiple (k=4) groups using “simulatedReadCounts” function in TCC R package [26]. This function uses the negative binomial (NB) distribution to generate the read count data. If 𝜇 and ∅ denotes the mean and dispersion parameter of the NB distribution, respectively then the variance (V) can be modelled as V = 𝜇 + 𝜇2 ∅. The NB distribution reduces to the Poisson distribution if over-dispersion

of

parameter ∅ = 0.

ro

Simulation Study 1

To investigate the outlier detection performance of the proposed method with the mean-variance

-p

trend of log-cpm values using 𝛽-weight function, we generated a dataset with gene expression profiles of G=5,000 genes and n = 20 (n1=n2=10) samples using the negative binomial

re

distribution. The number of DEGs were set to 100 (50 up-regulated and 50 down-regulated). We

lP

added the outliers of this dataset by randomly selecting 20% genes and for each of these genes; we selected a single sample randomly and multiplied the observed count of this sample with

na

randomly selected factor between 5 and 10. Figures 2a and figure 2b represent the mean-variance trend of log-cpm values in absence of outliers and presence of outliers, respectively. The red

ur

circles in figure 2b indicate the outliers. Figures 2(c-d) represent the outlier detection performance of the proposed method using 𝛽-weight function. In these figures the smallest 𝛽-

Jo

weights for 5,000 genes are plotted against their gene index. We determined the β-weights for each gene separately for group 1 (n1 = 10 samples) and group 2 (n2 = 10 samples) using equation (9). Then we calculate the smallest β-weight for each gene. The threshold line of the cut-off value δ is indicated by gray line which was computed using (11). In the figures 2(c-d), smallest β-weights with red colour indicates that the genes associated with these weights are outlying genes. The scatter plot of ordered smallest β-weights corresponding their gene index for outlying simulated dataset is shown in figure S1 (additional file 1). From this simulation study we may conclude that the proposed method can detect the outliers properly.

na

lP

re

-p

ro

of

Journal Pre-proof

Jo

Simulation Study 2

ur

Figure 2: Mean-variance relationship. Gene-wise means and variances of simulated RNA-seq data. (a) Mean-variance trend in absence of outliers. (b) Mean-variance trend in presence of outliers. (c) Plot of smallest β-weight plot in absence of outliers for (a). (d) Plot of smallest β-weight in presence of outliers for (b).

To investigate the performance of the proposed method in comparison with seven popular methods as mentioned above, for both small-and large-sample cases with k=2 groups/conditions, we considered 100 datasets for both cases with sample sizes of n1=n2= 3 and 100, respectively. Each dataset for each case represented RNA-seq read count profiles for G=10,000 genes, each with n = (n1+n2) samples, where the read counts of each gene was generated using the negativebinomial distribution. The number of DEGs was set to 200, 400 and 600 for each of the 100 datasets. For these numbers the percentage of DEGs (PDEG) are 2%, 4% and 6%. To show the effect of outliers (extreme high counts) on the methods, we randomly selected 10%, 20% and 50% genes and for each of PDEG, we selected a single sample randomly and multiplied the

Journal Pre-proof observed count of this sample with randomly selected factor between 5 and 10 from uniform distribution. This process was applied for each of the 100 datasets. Then we computed average values of different performance measures such as true positive rate (TPR), false positive rate (FPR), area under the ROC curve (AUC) and partial AUC based on 200, 400, 600 estimated DEGs by eight methods (edgeR, edgeR_robust, DESeq, DESeq2, limma (voom), SAMseq, baySeq and proposed) for each of 100 datasets. Table 1 and table 2 summarizes the average AUCs and pAUCs estimated by eight methods based on 100 simulated datasets with PDEG =2%

(SE)

Performance Measures

20% Outliers

0.868

0.851

0.803

0.803

0.865

(0.009)

(0.013)

(0.016)

(0.016)

(0.009)

0.791

0.796

0.779

0.779

0.824

(0.016)

(0.015)

(0.020)

(0.020)

(0.009)

0.753

0.721

0.791

0.763

0.763

0.830

(0.009)

(0.010)

(0.020)

(0.014)

(0.023)

(0.023)

(0.009)

0.868

0.876

(0.008)

(0.008)

(0.008)

0.809

0.803

0.809

(0.006)

(0.006)

(0.006)

0.766

0.765

(0.010)

-p

ro

0.879

0.625

0.634

0.594

0.618

0.728

0.452

0.452

0.819

(0.012)

(0.011)

(0.010)

(0.017)

(0.016)

(0.025)

(0.025)

(0.011)

Outliers

edgeR

DESeq

DESeq2

SAMseq

baySeq

Proposed

0.055

0.055

0.056

0.050

0.040

0.040

0.050

(0.002)

(0.002)

(0.002)

(0.003)

(0.003)

(0.003)

(0.003)

(0.002)

0.026

0.026

0.010

0.052

0.046

0.038

0.038

0.046

(0.002)

(0.002)

(0.001)

(0.003)

(0.003)

(0.003)

(0.003)

(0.002)

0.020

0.019

0.018

0.043

0.038

0.035

0.035

0.045

(0.002)

(0.002)

(0.002)

(0.003)

(0.003)

(0.002)

(0.002)

(0.001)

0.015

0.019

0.016

0.031

0.034

0.007

0.007

0.039

(0.001)

(0.002)

(0.002)

(0.002)

(0.002)

(0.006)

(0.006)

(0.001)

0.055

Jo

(SE )

Proposed

DESeq2

50% Outliers

Without Outliers

pAUC

baySeq

DESeq

na

AUC

10% Outliers

SAMseq

edgeR (robust)

re

Without Outliers

Limma (voom)

edgeR

lP

Outliers

10% Outliers 20% Outliers 50% Outliers

edgeR

(robust)

ur

Performance Measures

methods for small-sample case

of

Table 1. Performance evaluation using AUCs and pAUCs estimated by different (n1=n2=3).

Limma (voom)

The performance measures pAUCs were calculated at FPR=0.1. In this table the bracket () indicates the estimated standard error of AUCs and pAUCs based on 100 simulated datasets.

Journal Pre-proof

Table 2. Performance evaluation using AUCs and (n1=n2=100). Performance edgeR Outliers edgeR Measures (robust) 0.987 0.991 Without Outliers (0.001) (0.001)

(SE )

0.989

(0.002)

(0.001)

SAMseq

baySeq

Proposed

0.990

0.990

0.991

(0.001)

(0.001)

(0.001)

(0.001)

0.914

0.869

0.972

0.977

0.974

0.985

(0.002)

(0.002)

(0.011)

(0.014)

(0.002)

(0.002)

(0.002)

(0.002)

0.927

0.967

0.913

0.893

0.954

0.968

0.967

0.983

(0.005)

(0.002)

(0.011)

(0.015)

(0.004)

(0.004)

(0.004)

(0.002)

0.951

0.955

0.983

(0.009)

(0.012)

(0.002)

SAMseq

baySeq

Proposed

0.892

0.945

0.806

0.732

0.946

(0.010)

(0.004)

(0.013)

(0.014)

of

0.976

50% Outliers Outliers

edgeR

DESeq

DESeq2

10% Outliers 20% Outliers 50% Outliers

(robust)

(0.003)

Limma

ro

edgeR

(voom)

0.097

0.098

0.097

0.097

0.099

0.098

0.098

0.099

(0.0004)

(0.0004)

(0.0004)

(0.0004)

(0.0004)

(0.0004)

(0.0004)

(0.0004)

0.092

0.096

0.087

0.086

0.096

0.092

0.092

0.098

(0.001)

(0.0006)

(0.002)

(0.001)

(0.0005)

(0.002)

(0.002)

(0.0004)

0.088

0.092

0.073

0.081

0.094

0.090

0.090

0.098

(0.002)

(0.0008)

(0.004)

(0.002)

(0.001)

(0.002)

(0.002)

(0.0004)

0.080

0.090

0.070

0.068

0.092

0.089

0.088

0.098

(0.004)

(0.004)

(0.001)

(0.004)

(0.004)

(0.0004)

(0.002)

-p

Without Outliers

pAUC

0.991

Limma (voom) 0.990

re

Performance Measures

20% Outliers

DESeq2

lP

(SE)

DESeq

methods for large-sample case

0.965

na

AUC

10% Outliers

pAUCs estimated by different

(0.001)

ur

The performance measures pAUCs were calculated at FPR=0.1. In this table the bracket () indicates the estimated standard error of AUCs and pAUCs based on 100 simulated datasets.

Jo

in absence and presence of single outlier in each of 10%, 20% and 50% genes for both smallsample cases (n1=n2=3) and large-sample cases (n1=n2=100), respectively. The genes are considered as DEGs if they satisfy adjusted p-values < 0.05 for each of the methods. The Benjamin-Hochberg (BH) method wad used to adjust the p-values.

In this table the results

within the brackets (·) indicates the estimated standard error (SE) for respective AUCs and pAUCs. From table 1, we observed that in absence of outliers, six methods (edgeR, edgeR_robust, DESeq, DESeq2, limma (voom) and proposed) performed almost similar, for small-sample case. But in presence of outliers in this case the proposed method outperforms other seven methods (edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq and baySeq). For example, in presence of single outlier with each of 20% genes, the proposed

Journal Pre-proof method produces AUC = 0.830, which is larger than 0.766, 0.765, 0.753, 0.721, 0.791, 0.763 and

Jo

ur

na

lP

re

-p

ro

of

0.763, which are produced by edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq

Figure 3: Performance evaluation using boxplot of AUCs estimated by different methods associated with varying proportions of DEGs. The panels A, B and C represent the boxplot of AUC values estimated by eight methods (edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq, baySeq and proposed)

Journal Pre-proof for small-sample case in absence and presence of one outlier in each of 10%, 20%, and 50% genes. 100 simulations were performed to obtain these results.

and baySeq. Results of table 2 also bring out the similar interpretation like table 1. In this case, for 50% outliers SAMseq and baySeq performed slightly better than edgeR, edgeR_robust, DESeq, DESeq2 and Limma (voom), probably because they are non-parametric and nonparametric methods are robust against outliers for large-sample case. Figure 3 and figure S2 shows the boxplots of AUCs based on 100 simulated datasets by each of the methods in

of

absence and presence of outliers for small-and large-sample cases, respectively. In these figures the panels A, B and C represents the boxplots of AUC values using P DEG = 2%, 4% and 6%,

ro

respectively. Similar results are found from these boxplots for every percentage of DEGs like table 1 and table 2. Therefore, from this simulation study we may conclude that the proposed

-p

method outperforms the other seven methods for both small-sample case, in presence of outliers

re

For the large-sample case, in presence of 10% and 20% outliers edgeR_robust, limma(voom), SAMseq, baySeq and proposed method produces similar kind of results (see Table 2). However,

lP

the proposed method produced slightly better performance than the other methods in this case. But in presence of 50% outliers, the proposed method outperforms other seven methodsand it

ur

Simulation Study 3

na

keeps almost equal performance with the other methods, in absence of outliers.

Jo

To demonstrate the performance of the proposed method in a comparison of other seven popular methods (edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq and baySeq) for k > 2 conditions with multiple patterns, we generated 100 datasets for both small (n1=n2=n3=n4=3)and large-sample (n1=n2=n3=n4=100) cases from negative binomial distribution using “simulatedReadCounts” function of TCC R package. Each dataset represents the RNA-seq read counts for G=10,000 genes with n = (n1+n2+n3+n4) samples. We set PDEG=2% and two different proportions of DEGs for the individual groups: (i) (P G1, PG2, PG3, PG4) = (1/4, 1/4, 1/4, 1/4) for unbiased proportion and (ii) (PG1, PG2, PG3, PG4) = (0.7, 0.1, 0.1, 0.1) for biased proportion for each of the datasets. The outliers are added in the datasets as before. We first assess the performance of the proposed method in a comparison of the other methods for detection of DEGs using various simulation conditions as mentioned above and estimate average AUCs and

Journal Pre-proof pAUCs by these methods. Table 3 and table 4 lists the average AUCs and pAUCs estimated by different methods based on 100 datasets in absence and presence of outliers for small-and largesample cases, respectively using unbiased proportion of DEGs. In these tables the results within the brackets (·) indicate the estimated standard error (SE) for respective AUCs and pAUCs. We observed similar AUCs and pAUCs from both tables, in absence of outliers. However, in presence of outliers, the proposed method outperforms other methods, for small-sample case by producing higher AUCs and pAUCs (see table 3 bold texts). For instance, in presence of 20% outliers, the proposed method produces AUC = 0.781, which is larger than 0.705, 0.702, 0.625,

of

0.603, 0.723, 0.715 and 0.714, that are produced by edgeR, edgeR_robust, DESeq, DESeq2,

ro

Limma (voom), SAMseq and baySeq. On the other hand, for large-sample case, six methods (edgeR, edgeR_robust, limma (voom), SAMSeq, bayseq and proposed method) performed well

-p

compared to DESeq and DESeq2, in presence of outliers (see table 4). In the case of 20% and 50% outliers, the proposed method performed slightly better compare to the other methods. For

re

both cases the Benjamini-Hochberg (BH) method was used to adjust the p-values for each of the

lP

methods. However, the performance of all the methods worsened for biased proportion of DEGs

na

both for small-and-large sample cases (see table A1 and table A2 in additional file 2). For

AUC (SE)

Performance Measure

pAUC (SE )

10% Outliers 20% Outliers

Jo

ur

Table 3. Performance evaluation using AUCs and pAUCs estimated by different methods for small-sample case (n1=n2= n3=n4=3) with k > 2 groups for unbiased proportion of DEGs in the groups. Performance edgeR Limma Outliers edgeR DESeq DESeq2 SAMseq baySeq Proposed Measure (robust) (voom) 0.882 0.875 0.872 0.842 0.835 0.858 0.853 0.854 Without Outliers (0.009) (0.009) (0.009) (0.012) (0.010) (0.008) (0.008) (0.010) 0.779

0.780

0.671

0.622

0.707

0.778

0.787

0.810

(0.005)

(0.006)

(0.012)

(0.012)

(0.013)

(0.006)

(0.006)

(0.010)

0.705

0.702

0.625

0.603

0.723

0.715

0.714

0.781

(0.008)

(0.008)

(0.010)

(0.010)

(0.012)

(0.009)

(0.010)

(0.012)

50% Outliers

0.611

0.611

0.547

0.569

0.692

0.613

0.641

0.788

(0.011)

(0.011)

(0.019)

(0.018)

(0.013)

(0.011)

(0.011)

(0.014)

Outliers

edgeR

DESeq

DESeq2

SAMseq

baySeq

Proposed

Without Outliers 10% Outliers

edgeR (robust)

Limma (voom)

0.056

0.055

0.051

0.055

0.053

0.054

0.053

0.050

(0.002)

(0.002)

(0.002)

(0.002)

(0.002)

(0.002)

(0.002)

(0.002)

0.011

0.012

0.009

0.043

0.039

0.010

0.010

0.043

(0.001)

(0.001)

(0.001)

(0.002)

(0.003)

(0.001)

(0.001)

(0.002)

Journal Pre-proof 20% Outliers 50% Outliers

0.008

0.009

0.006

0.037

0.032

0.007

0.007

0.037

(0.001)

(0.001)

(0.001)

(0.002)

(0.002)

(0.001)

(0.001)

(0.001)

0.006

0.007

0.005

0.030

0.029

0.005

0.006

0.037

(0.001)

(0.001)

(0.001)

(0.002)

(0.002)

(0.001)

(0.001)

(0.001)

-p

ro

of

The performance measures pAUCs were calculated at FPR=0.1. In this table the bracket () indicates the estimated standard error of AUCs and pAUCs based on 100 simulated datasets.

AUC (SE)

10% Outliers 20% Outliers 50% Outliers

Performance Measure

Outliers Without Outliers

pAUC (SE )

10% Outliers 20% Outliers 50% Outliers

(0.002) 0.924

0.864

0.984

(0.001)

(0.002)

(0.002)

(0.002)

(0.002)

(0.002)

(0.002)

0.956

0.658

0.763

0.963

0.951

0.957

0.978

(0.003)

(0.007)

(0.008)

(0.003)

(0.003)

(0.003)

(0.002)

0.919

0.578

0.697

0.932

0.944

0.838

0.978

lP

0.987

(0.004)

0.985

Limma (voom) 0.987

DESeq

na

Without Outliers

edgeR (robust) 0.989

edgeR

ur

Outliers

DESeq2

SAMseq

baySeq

Proposed

0.988

0.980

0.988

(0.004)

(0.003)

(0.010)

(0.010)

(0.004)

(0.004)

(0.006)

(0.002)

0.688

0.792

0.500

0.624

0.862

0.864

0.867

0.976

(0.008)

(0.017)

(0.013)

(0.008)

(0.010)

(0.010)

(0.003)

DESeq

DESeq2

SAMseq

baySeq

Proposed

Jo

Performance Measure

re

Table 4. Performance evaluation using AUCs and pAUCs estimated by different methods for large-sample case (n1=n2= n3=n4=100) with k > 2 groups for unbiased proportion of DEGs in the groups.

(0.010) edgeR

edgeR (robust)

Limma (voom)

0.098

0.098

0.098

0.097

0.099

0.098

0.098

0.099

(0.0003)

(0.0003)

(0.0003)

(0.0003)

(0.0003)

(0.0003)

(0.0003)

(0.0003)

0.032

0.055

0.051

0.046

0.071

0.043

0.044

0.095

(0.002)

(0.002)

(0.002)

(0.002)

(0.002)

(0.002)

(0.002)

(0.001)

0.017

0.029

0.044

0.022

0.052

0.025

0.027

0.094

(0.001)

(0.002)

(0.001)

(0.002)

(0.001)

(0.002)

(0.002)

(0.001)

0.007

0.017

0.008

0.013

0.037

0.006

0.006

0.093

(0.001)

(0.002)

(0.001)

(0.002)

(0.003)

(0.002)

(0.002)

(0.001)

Journal Pre-proof The performance measures pAUCs were calculated at FPR=0.1. In this table the bracket () indicates the estimated standard error of AUCs and pAUCs based on 100 simulated datasets.

example, AUC values for edgeR under unbiased (1/4, 1/4, 1/4, 1/4) and biased (0.7, 0.1, 0.1, 0.1) proportions decreased from 0.882 to 0.865, for small-sample case, in absence of outliers. Figure 4 represents the boxplots of AUCs estimated by eight methods based on 200 DEGs in absence and presence of outliers. In this figure the panels A and B represents the boxplots of AUCs for small-and-large sample cases, respectively. This figure also supports the results of table 3 and table 4. Therefore, from this simulation study, we may conclude that the proposed

of

method outperforms other methods in presence of outliers and in absence of outliers it keeps

Jo

ur

na

lP

re

-p

ro

equal performance with other methods.

na

lP

re

-p

ro

of

Journal Pre-proof

Jo

ur

Figure 4: Performance evaluation using boxplot of AUCs estimated by different methods. The panels A and B represent the boxplot of AUCs estimated by eight methods (edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq, baySeq and proposed) for small-and large-sample cases respectively in absence and presence of one outlier in each of 10%, 20%, and 50% genes. 100 simulations were performed to obtain these results.

Real RNA-Seq Data Study To investigate the performance of the proposed method in a comparison of the seven popular methods (edgeR, edgeR_robust, DESeq, DESeq2, Limma (voom), SAMseq and baySeq) in the real RNA-Seq count data we used two datasets namely mouse dataset and Nigerian dataset. Mouse Dataset We used the RNA-Seq mouse dataset which was used in the study [23]. This dataset consist of 36535 genes with 21 samples. This dataset was downloaded from ReCount website http://bowtie-bio.sourceforge.net/recount. Among 21 samples, RNA-Seq count expression collected from 10 C57BL/6J (B6) and 11 DBA/2J (D2) inbred mouse strains. After filtering we

Journal Pre-proof retain 11,474 genes. To investigate the performance of the proposed method for detection of DEGs between the two mouse strains, we considered two latest RNA-Seq methods (edgeR_robust and DESeq2). Figure 5a represents the Venn diagram of top 100 genes estimated by edgeR_robust, DESeq2 and proposed method. From this Venn diagram, we clearly observe

Jo

ur

na

lP

re

-p

ro

of

that there are 82 mutual DEGs detected by these three methods. We also notice that there are 8, 9

Figure 5: Venn diagram and outlier gene expression profile for mouse dataset. (a) Venn diagram of the DE genes estimated by all three methods (edgeR_robust, DEseq2 and Proposed). (b) Scatter plot of the smallest β-weight for each of the 11,474 genes vs. the gene index. (c) Frequency distributions of βweights for each expression of the 11,474 genes.

Journal Pre-proof and 5 genes independently detected by edgeR_robust, DESeq2 and proposed method, respectively. The DEGs were selected using adjusted p-values<0.05 for each of the methods. The p-values were adjusted by the Benjamin-Hochberg (BH) method. Among the 5 DEGs detected by the proposed method 3 genes are declared as outlying genes by the β-weight function (9) of the proposed method. Figure 5b displays smallest β-weights for each of 11,474 genes. In this figure the red circle indicate the outlying gene. We identified 7% outlying genes from this dataset. The threshold line of the cut-off value δ=0.1 is indicated by gray color which was computed using (11). The frequency distribution of β-weights for each expression of the 11,474

of

genes is shown in figure 5c. The scatter plot of ordered smallest β-weights corresponding their

ro

gene index is shown in figure S3. The MA-plot of this dataset after voom transformation is shown in figure S4. In this figure the green circle, blue triangle and red square indicate the 8, 9

-p

and 5 DEGs identified by edgeR_robust, DESeq2 and proposed method, respectively. Then we explored the biological functions and KEGG pathways associated with these three genes using

re

WebGestalt software [24]. The gene ontology (GO) database identified 2 genes out of 3 genes.

lP

The GO results showed that these two outlying genes (Apoa2, Gpat2) are significantly enriched in different biological processes like metabolic process, neural lipid metabolic process, glycerophospholipid biosynthetic process, negative regulation of cytokine secretion involved in

na

immune respone and so on (see file A1(xls) in additional file 2). Furthermore, we revealed that these genes are involved with different molecular functions such as phosphatidylcholine-sterol

ur

O-acyltransferase activator activity, high density lipoprotein particle binding, apolipoprotein

Jo

receptor binding, lipase inhibitor activity and so on (see file A2(xls)). The KEGG pathway enrichment analysis revealed that these 2 outlying genes significantly enriched glycerolipid metabolism, PPAR signaling pathway, glycerophospholipid metabolism and metabolic pathway (see table 5). The functional interaction network of these 2 genes are shown in figure 6 and analyzed via Cytoscape using GeneMANIA plug-in [29].

Journal Pre-proof

Table 5. KEGG pathway for 2 outlying genes identified by proposed method only for mouse data KEGG ID KEGG pathway name Name of gene P-value 1.47e-02 mmu00561 Glycerolipid metabolism Gpat2 mmu03320

PPAR signaling pathway

Apoa2

mmu00564

Glycerophospholipid metabolism

Gpat2

mmu01100

Metabolic pathways

Gpat2

2.12e-02 2.34e-02

Jo

ur

na

lP

re

-p

ro

of

3.01e-01

Figure 6: Functional interactions of the two (2) genes detected by the proposed method.

Journal Pre-proof

Nigerian Data We used Nigerian individuals RNA-Seq data, which is the part of international HapMap project and also published in [27]. This dataset consists of lymphoblastoid cell lines from 29 males and 40 female unrelated Nigerian individuals. Illumina Genome Analyzer II was used to sequence the samples. The Read counts were summarized by Ensembl gene identifiers and gene annotation was performed by bioconductor tweeDEseqCountData package [28]. The number of gene of this dataset is 38,415. The genes (17,310) with least 1 count-per-million reads (cpm) in at least

of

20 samples were used for further analysis. The goal of this data analysis is to identify the DEGs

ro

between male and female. Figure 7a shows the mean-variance relationship of this dataset. This figure suggests the greater biological variation of the log-cpm values for unrelated human

-p

individuals. We first calculate the 69 β-weights corresponding to the 69 samples for each gene. Then we determine the smallest value (minimum) of 69 β-weights for each gene. Figure 7b

re

represents the scatter plot of these smallest β-weights for each of the 17310 genes against the

lP

gene index. The red circles below gray line represent outliers. Figure 7c represents the Venn diagram of top 50 genes identified by limma (voom), DEseq2, edgeR_robust and proposed

na

method. We can clearly observe from this Venn diagram that our proposed method shares more genes with other three methods. We found 39 mutual genes between these methods. The

ur

proposed method shared 3 genes with the DEseq2 and edgeR methods, which were not detected by the limma (voom). To show the influence of outliers for detection of DEGs, we corrupted 39

Jo

mutual DEGs by outliers. To generate the outlying Read counts for 39 genes, we selected a single sample randomly and multiplied the observed count of this sample with randomly selected factor between 5 and 10. We call here the 39 mutual genes as valid gene-set. Then we again applied three methods (limma (voom), edgeR_robust and proposed) to identify the DEGs. The Venn diagram in figure 7c represents the top 50 genes detected by these methods with 39 common DEGs, in presence of outliers. This figure revealed that the proposed method performs better by sharing more genes with the valid 39 DEGs and the other methods. For example, the proposed method shared maximum number of genes (5 genes) with valid DEGs, which were not shared by the other two methods (limma (voom),, edgeR_robust). We validate these 5 genes using gene ontology (GO) and pathway enrichment analysis using WebGestalt software [24].

Journal Pre-proof GO mapped 4 genes out of 5 genes and these genes are enriched in several biological processes

re

-p

ro

of

such as histone H2A acetylation, neuron cell-cell adhesion, vocalization behavior, histone H4-

(b)

Jo

ur

na

lP

(a)

(c)

(d)

Figure 7. Mean-variance relationship and comparison of the top DE genes selected by different methods for Nigerian Dataset. (a) Mean-variance trend. (b) Scatter plot of the smallest β-weight for each of the 17,310 genes vs. the gene index. (c) Venn diagram of top 50 genes estimated by limma(voom), DEseq2, edgeR_robust and Proposed method (d Venn diagram of top 50 genes estimated by limma(voom), edgeR_robust and Proposed method with valid gene-set

K16 acetylation, social behavior and so on (see file A3 (xls)). Among them gene MSL3 is located in X chromosome and genes (NLGN4Y, TTTY15, USP9Y) are located in the Y

Journal Pre-proof chromosome. In the previous study [19], limma (voom) identified NLGN4Y and TTTY15 in the top 16 DEGs. However, when we contaminated these genes by outlier, it fails to detect these genes in the top 50 genes. The pathway enrichment analysis revealed that NLGN4Y gene significantly enriched in Cell adhesion molecules (CAMs) pathway (see table 6). We also revealed that gene MSL3 and USP9Y significantly enriched in two miroRNA target (see table A3). Table 6. KEGG pathways DE genes identified by the proposed method only Name of gene

p-value

NLGN4Y

2.02e-02

of

hsa04514

KEGG pathway description Cell adhesion molecules (CAMs)

ro

KEGG ID

-p

The hypergeometric test is used to calculate the p-values and adjusted by Benjamini-Hochberg method for multiple testing corrections.

re

Conclusion

lP

Detection of differentially expressed genes (DEGs) under two or more conditions from RNA-Seq data has been become popular day by day with many advantages of next generation sequencing

na

(NGS). Typically, the RNA-seq technologies provide the read count data by direct sequencing of transcript. A number of methods have been developed for detecting the DEGs genes from RNA-

ur

seq data. Most of these methods are based on either Poisson distribution or negative binomial (NB) distribution. However, identification of DEGs based on read count data using skewed

Jo

distribution is very complex, in presence of outliers or extreme values. Transformation of RNASeq data is an alternative approach to tackle the outlier’s issues. There are several transformation methods for RNA-seq data, among them voom using limma pipeline has been widely used. However, limma by voom is sensitive to outliers for small-sample case and produces misleading results in presence of outliers. Therefore, in this paper, we robustify the voom approach using the minimum β-divergence method. We demonstrated the performance of the proposed method in a comparison of some popular DEGs selection methods (DEseq, DEseq2, SAMseq, Bayseq, limma (voom), edgeR and edgeR_robust) using both simulated and real dataset. Both types of data analysis results showed that the performance of the proposed method improved over the

Journal Pre-proof competing methods, in presence of outliers and in absence of outliers it keeps almost equal performance with these methods.

Abbreviations DEGs, differentially expressed genes; FC, fold change; NB, negative bionomial; NGS, next generation sequencing; log-cpm, log-counts per million; LIMMA, linear models for microarray data; AUC, area under curve; pAUC, partial area under curve; VST, variance-stabilizing transformation.

of

Availability of data and material

ro

The R-codes of the proposed algorithm can be found in the additional file 3.

-p

Author’s contributions

lP

re

Conceived and designed the experiments: MSJ MNHM. Performed the experiments: MSJ. Analyzed the data: MSJ. Contributed reagents/materials/analysis tools: MRR,SI. Wrote the paper: MSJ MNHM. Finalized the manuscript: all Authors. Edited and approved the final version of the manuscript: All Authors.

na

Competing interests

The authors declare that they have no competing interests.

ur

Consent for publication

Jo

Not applicable.

Acknowledgements Not applicable.

References [1]. Mortazavi A, Williams B, McCue K, Schaeffer L and Wold B.(2008): Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Meth. 5, 621-628.

Journal Pre-proof [2]. Beyer M, Mallmann MR, Xue J, Staratschek-Jox A, Vorholt D, Krebs W, Sommer D, Sander J, Mertens C, Nino-Castro A, Schmidt SV, Schultze JL.(2012): High-Resolution Transcriptome of Human Macrophages. PLoS ONE, 7, e45466. [3]. Wang Z, Gerstein M, Snyder M.(2009): RNA-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet.,10, 57–63. [4]. Bullard JH, Purdom E, Hansen KD, Dudoit S.(2010): Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11, 94. [5]. Wang L, Feng Z, Wang X, Zhang X.(2010): DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics, 26, 136-138.

of

[6]. Nagalakshmi U, Waern K, Snyder M.(2010): RNA-Seq: a method for comprehensive transcriptome analysis. Curr Protoc Mol Biol, 4,1-13.

ro

[7]. Robinson MD, McCarthy DJ, Smyth GK.(2010): EdgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26, 139-140.

-p

[8]. Anders S, Huber W.(2010): Differential expression analysis for sequence count data. Genome Biol, 11,R106.

re

[9]. Hardcastle TJ, Kelly KA.(2010): baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics, 11,422.

lP

[10]. Love MI, Huber W, Anders S.(2014): Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15, 550.

na

[11]. Leng N, Dawson J, Thomson J, Ruotti V, Rissman A, Smits B, Haag J, Gould M, Stewart R, Kendziorski C.(2012): EBSeq: an empirical bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics, 29, 1035-1043..

ur

[12]. Zhou X, Lindsay H, Robinson MD.(2014): Robustly detecting ifferential expression in RNA sequencing data using observation weights. Nucleic Acids Res, 42, e91.

Jo

[13]. Di Y, Schafer DW, Cumbie JS, Chang JH.(2011): The NBP negative binomial model for assessing differential gene expression from RNA-seq. Stat Appl Genet Mol Biol, 10, 1-18. [14]. Li J, Tibshirani R. (2013): Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-seq data. Stat Methods Med Res, 22, 519-536. [15]. Robinson MD, Smyth GK.(2008): Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9, 321-332. [16]. George NI, Bowyer JF, Crabtree NM, Chang CW.(2015): An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data. PLoS ONE, 10, e0125224. [17]. Zwiener I, Frisch B, Binder H.(2014): Transforming RNA-seq data to improve the performance of prognostic gene signatures. PloS One,.9, e85150. [18]. Robinson MD, Oshlack A.(2010): A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol, 11, R25.

Journal Pre-proof [19]. Law W, Chen Y, Shi W, Smyth GK.(2014): voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biology, 15, R29. [20]. Mollah MNH, Minami M, Eguchi S.(2007): Robust prewhitening for ICA by minimizing βdivergence and its application to FastICA, Neural Processing Letters, 25, 91-110. [21]. Mollah MNH, Sultana N, Minami M, Eguchi S.(2010): Robust Extraction of Local Structures by the Minimum β-Divergence method, Neural Network, 23, 226-238. [22]. Shahjaman M, Kumar N, Mollah MMH, Ahmed MS, Begum AA, Islam SMS, Mollah MNH.(2017): Robust Significance Analysis of Microarrays by Minimum β-Divergence Method. Biomed Research International, 2017, 1-18.

of

[23]. Bottomly D, Walter NA, Hunter JE, Darakjian P, Kawane S, Buck KJ,(2011): Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays. PLoS One, 6, e17820.

ro

[24]. Zhang B, Kirov S, Snoddy J.(2005): WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Research, 33, W741–W748.

-p

[25]. Smyth GK. (2004): Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Stat Appl Genet Mol Biol, 3:3.

re

[26]. Mollah MMH, Mollah MNH, Kishino H.(2012): β-empirical Bayes inference and model

lP

diagnosis of microarray data. BMC Bioinformatics, 13,135.

na

[27]. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK.(2010): Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 464, 768-772.

ur

[28]. Esnaola M, Puig P, Gonzalez D, Castelo R, Gonzalez JR.(2013): A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments. BMC Bioinformatics. 14: 254-10.

Jo

[29]. Shannon P, Markiel A, Ozier O. (2003): Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, 13: 2498–504. [30]. Shahjaman, M.; Rahman, M.R.; Islam, S.M.S.; Mollah, M.N.H. A Robust Approach for Identification of Cancer Biomarkers and Candidate Drugs. Medicina 2019, 55, 269.

Journal Pre-proof

Jo

ur

na

lP

re

-p

ro

of

Highlights  The proposed robust voom approach showed better performance than competing methods in both simulated and real data analysis, in presence of outliers  The proposed method identified two outlying genes (Apoa2, Gpat2) from mouse dataset, are significantly enriched in different biological processes like metabolic process, neural lipid metabolic process, glycerophospholipid biosynthetic process, negative regulation of cytokine secretion involved in immune respone and so on.  The proposed method identified outlying genes (MSL3 ,NLGN4Y, TTTY15, USP9Y) from Nigerian dataset, significantly enriched in Cell adhesion molecules (CAMs) pathway  miRNAs (mir-212,mir-216,mir-132) were identified.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7