Gene 518 (2013) 164–170
Contents lists available at SciVerse ScienceDirect
Gene journal homepage: www.elsevier.com/locate/gene
Identifying differentially spliced genes from two groups of RNA-seq samples Weichen Wang a, b, 1, Zhiyi Qin a, 1, Zhixing Feng a, Xi Wang a, c, Xuegong Zhang a, d,⁎ a MOE Key Laboratory of Bioinformatics, Bioinformatics Division and Center for Synthetic and Systems Biology, TNLIST, Department of Automation, Tsinghua University, Beijing 100084, China b Department of Operational Research and Financial Engineering, Princeton University, NJ 08544, USA c School of Biomedical Sciences and Pharmacy, The University of Newcastle, Callaghan NSW 2308, Australia d School of Medicine, Tsinghua University, Beijing 100084, China
a r t i c l e
i n f o
Article history: Accepted 27 November 2012 Available online 8 December 2012 Keywords: Alternative splicing RNA-Seq Negative binomial model Differential splicing Exon-centric method
a b s t r a c t Recent study revealed that most human genes have alternative splicing and can produce multiple isoforms of transcripts. Differences in the relative abundance of the isoforms of a gene can have significant biological consequences. Identifying genes that are differentially spliced between two groups of RNA-sequencing samples is an important basic task in the study of transcriptomes with next-generation sequencing technology. We use the negative binomial (NB) distribution to model sequencing reads on exons, and propose a NB-statistic to detect differentially spliced genes between two groups of samples by comparing read counts on all exons. The method opens a new exon-based approach instead of isoform-based approach for the task. It does not require information about isoform composition, nor need the estimation of isoform expression. Experiments on simulated data and real RNA-seq data of human kidney and liver samples illustrated the method's good performance and applicability. It can also detect previously unknown alternative splicing events, and highlight exons that are most likely differentially spliced between the compared samples. We developed an NB-statistic method that can detect differentially spliced genes between two groups of samples without using a prior knowledge on the annotation of alternative splicing. It does not need to infer isoform structure or to estimate isoform expression. It is a useful method designed for comparing two groups of RNA-seq samples. Besides identifying differentially spliced genes, the method can highlight on the exons that contribute the most to the differential splicing. We developed a software tool called DSGseq for the presented method available at http://bioinfo.au.tsinghua.edu.cn/software/DSGseq. © 2012 Elsevier B.V. All rights reserved.
1. Introduction The rapid development of next-generation-sequencing has made it possible to read nucleotide compositions of DNA or RNA molecules at high throughput and high accuracy. It provides the new technique called RNA sequencing or RNA-seq to study transcriptomes by sequencing RNA molecules instead of hybridizing them to predesigned probes as in the microarray technology. RNA-seq generates tens of millions of short reads with lengths of 25 to over 100 nt randomly sampled from RNA molecules in the sample, providing digital measurements on abundances of all transcripts. An important advantage of RNA-seq is that it can reveal previously un-annotated transcripts and provide information
Abbreviations: DSG, differentially spliced gene; NB, negative binomial; IPV, isoform proportion vector; MSDR, mean sequencing depth ratio; RPKM, reads per kilobase per million reads. ⁎ Corresponding author at: FIT 1-107, Tsinghua University, Beijing, 100084, China. Tel.: + 86 10 62794919; fax: + 86 10 62773552. E-mail addresses:
[email protected] (W. Wang),
[email protected] (Z. Qin),
[email protected] (X. Zhang). 1 These authors contributed equally to this work. 0378-1119/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.gene.2012.11.045
on fine structures of transcripts, especially for alternative splicing genes. Recent RNA-seq studies have updated people's understanding on alternative splicing and revealed that most human genes may have alternative splicing (Pan et al., 2008; Wang et al., 2008). Comparing transcriptomes of two groups of samples and identifying differentially expressed (DE) genes are basic and important tasks. These have been widely studied with microarrays. The digital nature of RNAseq brings new characteristics in the gene expression measurement. New methods have been proposed for identifying differentially expressed genes from RNA-seq data, such as DEGseq (Wang et al., 2010), DESeq (Anders and Huber, 2010), baySeq (Hardcastle and Kelly, 2010) and Cufflinks (Trapnell et al., 2010). These methods focused on the overall expression of a gene regardless of its alternative isoforms. As many genes can be alternatively spliced, and different relative abundances of splicing isoforms can have crucial impact on phenotypes (Pan et al., 2008; Wang et al., 2003, 2008), it is in many scenarios more important to study the expression pattern or the relative proportion of multiple isoforms of a gene besides its overall abundance. In contrast to the rapid development of methods for studying DE genes, there are fewer methods for detecting differentially spliced
W. Wang et al. / Gene 518 (2013) 164–170
(DS) genes, i.e., genes that have different proportions of isoform expression between compared samples. Some RNA-seq data analysis tools like Cufflinks (Trapnell et al., 2010) have included features for detecting differential splicing (called Cuffdiff), but most of them are based on estimating the expression of all isoforms first and then comparing proportions of the isoforms. The inference of isoforms and estimation of isoform expression are still current topics under extensive research (Jiang and Wong, 2009; Richard et al., 2010; Roberts et al., 2011; Wu et al., 2011). Such inferences and estimations may introduce extra inaccuracy or uncertainty for studying DS genes. Actually, the accurate estimation of isoform expression is more demanding than detecting differences in their expression, and therefore it may not be a good strategy to detect differences in splicing based on the inference and estimation of isoform expression. In Stegle et al. (2010), the authors discussed the question and proposed to directly use exons not shared by isoforms to detect different abundance of isoforms. They proposed a non-parametric test for detecting differential read coverage based on a measurement defined in the Reproducing Kernel Hilbert Space. It is a promising framework but depends on known annotation of isoform structures, and the selection of kernels is still a topic of future study (Stegle et al., 2010). Singh et al. (2011) developed a method for detecting differential transcription by constructing Aligned Cumulative Transcript Graphs (ACT-Graph) from mapped exon reads and junction reads and comparing graphs between samples with a flow difference metric (FDM). Simulation experiments showed superior performance over other methods. However, the construction of ACT-Graphs requires information of junction reads. When the sequencing depth is not very high or the read length is short, the proportion of junction reads is small, and therefore many splicing sites may not be covered by junction reads. Besides, junction read mapping is still a challenging task in both the mapping accuracy and computation cost (Wang et al., 2011). These factors can be limitations for such graph-based methods. On the other hand, all existing methods were designed for comparing two individual samples. Due to the intrinsic diversity between biological samples, it's more important to compare two groups of samples to find real signals that discriminate the two groups. Several methods have included the option of comparing samples with replicates. However, this was typically done by the combination of multiple pair-wise comparisons of single samples. The underlying tests were not designed for comparing two groups. In this paper, we study the question of identifying genes that are differentially spliced between two groups of samples, either of technical replicates or biological replicates. We use “differential splicing” or DS to refer to that the relative abundances of a gene's multiple isoforms are different between samples of the compared groups. As discussed above, for the purpose of inferring differential splicing, we do not necessarily need to estimate the expression of isoforms as differential splicing can be reflected from read distributions across the exons composing the isoforms. In this way, we also don't need to know isoform structures or even the existence of multiple isoforms. We use the negative binomial (NB) distribution to model read-counts on all exons of a gene. It considers over-dispersion in read-count distribution and borrows information across samples to get better estimation of the signal of each exon. An NB-statistic is proposed to assess differential splicing. It takes into account both the between-group variation and within-group variation, similar to Baggerly et al. (2003). The proposed method can not only identify differentially spliced genes, but also detect exons that show biggest differences, i.e., exons that are most likely to be differentially spliced between the compared samples. Previously unknown alternative splicing events can be detected in this way. We also consider the possible position-associated bias in read distribution in many RNA-seq datasets and develop a bias-correction option. The comparison with the popular software Cuffdiff on both simulated and real datasets illustrated the good performance and applicability of the proposed method. The proposed software DSGseq can be accessed at http://bioinfo.au.tsinghua. edu.cn/software/DSGseq.
165
2. Methods 2.1. The model Let g denote a gene to be studied and K(g) be the set of its possible isoforms, and the number of isoforms is r(g). Note that K(g) or r(g) needs not to be known in our method. Assume that gene g has m(g) nonredundant exons, which are segments that cover all exon regions of the gene and don't overlap. This is a mathematical definition of exon, which can be an exon in the biological definition, and can also be a segment of a biological exon. For example, in the case of alternative 5′- or 3′-end events, we define the alternative part of the exon as a non-redundant exon and the remaining part as another non-redundant exon. Fig. 1 shows an example of the mathematical definition of exons in a gene with two isoforms. Long exons can be split into several non-redundant exons if necessary. When considering junction reads covering parts of two exons, we can take the junction part as an extra non-redundant exon. In the remaining part of the paper, we use the term “exon” as defined in this way. We can define a gene structure matrix A=(afj) of dimension r(g) by m(g) to indicate the composition of all isoforms, where afj =1 means that isoform f contains exon j, and afj =0 means that exon j is not included in isoform f. Again, this annotation is for the purpose of derivation only. The proposed method does not require knowing this matrix. Let kf(g) be the copy number of RNA transcripts in the form of isoform f, and lj(g) be the length of exon j. We consider the possible sequencing preference of each exon and model it as a multiplier vector {βj}jm= 1. In the following description, we consider only one gene, so we omit the superscript (g) in all annotations. The RNA sequencing procedure can be viewed as a random sampling procedure from the multiple copies of RNA transcripts in the cells. The probability that a read falls in the region of exon j can be written as
pj ¼
βj lj ∑f ∈K afj kf m
∑i¼1 ∑f ∈K afi βi li kf
¼
1 βl∑ a k : Ξ j j f ∈K fj f
ð1Þ
The matrix form for all m exons is p ¼ Ξ1 BLAT k where p and k are column vectors of pjs and kfs. B and L are diagonal matrices with diagonal elements {βj}jm= 1 and {lj}jm= 1, respectively, and Ξ is a normalization factor. This relationship implies that there is a one-to-one correspondence between p and k if identifiability is assumed, i.e., the matrix A is of full row rank. This assumption requires that there is no isoform that can be composed by the combination of other isoforms. With this assumption, the information of isoform expression is fully reflected in reads at all exons. Checking the current annotation of the human genome, we found that most genes can satisfy this assumption. Therefore, we can detect differences in isoform proportion from the information at all exons without having to estimate the expression of all isoforms. The method we propose is to detect differential splicing between two groups of samples based on the estimated expression probability vector of all exons of the gene. Let n be the number of samples in the study and Mi is the total number of reads (read-count) received at the studied gene. Let Yij be the read-count of sample i at exon j, i=1,2,⋯,n. Differences in Mi reflect the differential total expression and sequencing coverage of the gene between the compared samples. For studying differential splicing, we are comparing the read count vector Yi ={Yij} conditional on Mi. That is, we focus on the proportion of isoform expression instead of the overall expression of the whole gene. Yij/Mi can be taken as an estimate of the pj in sample i. Similar to some methods for studying differential expression of genes (Anders and Huber, 2010), we model the read count by the negative binomial (NB) model to consider over-dispersion in the data: Y ij e NB μ ij ; ϕ ; μ ij ¼ M i pj ; i ¼ 1; ⋯; n; j ¼ 1; ⋯; m
ð2Þ
166
W. Wang et al. / Gene 518 (2013) 164–170
Fig. 1. An example of the mathematical definition of exons. This gene contains two isoforms with 6 biological exons. The intron retention can be segmented into three mathematical exons; for constitute exons and the exon skipping event, the mathematical exons are the same with the biological exons; the last exon with an alternative boundary is split as two mathematical exons.
where μij is the mean and ϕ is the dispersion parameter. We assume that all exons share the same ϕ so that we can combine information of all exons to estimate the over-dispersion. The probability density of read count is !1 !y ϕ Γ y þ ϕ−1 μ ij ϕ 1 P Y ij ¼ y ¼ −1 1 þ μ ij ϕ Γ ðy þ 1Þ 1 þ μ ij ϕ Γ ϕ
ð3Þ
with mean μij and variance V(Yij)=μij +ϕμij2, where Γ() is the Gamma function. Note that with ϕ>0, the model extends the mean-variance relationship of Poisson model to take into account the within-group variation. If ϕ=0, the model becomes the conventional Poisson model. 2.2. Parameter estimation Here we describe our method for estimating parameters of the above model from multiple samples of the same group. Suppose we have n samples in the group, and we use a weighted combination of single-sample estimation to estimate the pj in the group: p^ j ¼ ∑i wi
Y ij ; Mi
j ¼ 1; ⋯; m
ð4Þ
where wi > 0,i = 1,⋯,n are weights for samples and ∑ in= 1wi = 1. We have E p^ j ¼ ∑i wi pj ¼ pj and 2 n X 2 pj þ ϕM i pj V p^ j ¼ wi : Mi i¼1
ð5Þ
Since the estimation of pj is unbiased, we can choose the weights so that the sum of the associated variances across all genes is minimized. Therefore, the optimal weights can be obtained by solving the following constrained optimization problem using the Lagrange method: 2 m X n n X X 2 pj þ ϕM i pj min∑ V p^ j ¼ wi s:t: wi ¼ 1 M j i j¼1 i¼1 i¼1
ð6Þ
which gives wi ∝
Mi 1þ
2 ϕM i ∑j pj
and
n X
wi ¼ 1:
ð7Þ
i¼1
In addition, we use Quasi-Likelihood (QL) method (Robinson and Smyth, 2007) to estimate the dispersion parameter ϕ. The QL method uses the Deviance statistic D to construct an estimating equation. D is defined as twice the difference between the log-likelihood of the saturated model and the unsaturated one. The asymptotic null distribution
of deviance D can be derived under negative binomial models and the estimation equation for ϕ (Fox, 2008) is D ¼ 2flsat −lunsat g ( " # " #) −1 m X n X Y ij Y ij þ ϕ −1 −1 − Y ij þ ϕ ¼ ðm−1Þðn−1Þ: log Y ij log þϕ ¼2 μ^ ij μ^ ij j¼1 i¼1
ð8Þ Then we come up with an iteration procedure: (i) Set initial values for weights wi = Mi/∑iMi and get the initial estimation of pjs by ^ Eq. (4); (ii) Use QL method or Eq. (8) to estimate the dispersion ϕ; (iii) Replace pjs and ϕ in Eq. (7) by their estimations and update the weights; (iv) Iterate steps (i) to (iii) until convergence. Empirically based on our experiments, the whole procedure often converges in 3 or 4 iterations. 2.3. Detecting differential splicing For the comparison of two groups of samples (denoted as groups A and B) to detect differentially spliced genes, we first estimate the probabilities of all exons' expression and their associated variances in the two groups separately by the method described above. Then an NB-statistic is proposed to test against the null hypothesis that the isoform proportions of the gene in the two groups are identical. This is equivalent to test pjA = pjB for all js (all exons) if isoforms are identifiable. Even when isoforms are not identifiable, it still gives a necessary condition (albeit not sufficient) of pjA = pjB. We define the NB-statistic as 2 m p^ jA −p^ jB 1X NB stat ¼ m j¼1 V^ p þ V^ p : jA jB
ð9Þ
The NB-statistic measures the difference in the expression probability of all exons in the gene between the two groups, considering not only the between-group variation but also the within-group variations. Baggerly et al. (2003) have proposed a similar test statistic in detecting differential expression from SAGE data. They pointed out that although such statistic can reflect differences in gene expression, the null distribution is hard to obtain because it depends on the relative size of the between-group and within-group variations (Baggerly et al., 2003). We observed the same phenomena in our study. Therefore, we do not define a significance p-value for the NB-statistic, but instead, we use the NB-statistic as a measure to rank all genes with regard to differential splicing. Indeed, we have assumed that all exons share the same dispersion and used the same set of weights for all genes when optimizing the weights for the estimation of variances, so the NB-statistic values of different genes are comparable. Biologists can take the top ranking DS genes as candidates for further functional study, or set a cutoff threshold to decide whether the observed difference in splicing is of interest. From our experiment, an
W. Wang et al. / Gene 518 (2013) 164–170
NB-statistic value over 5 is already quite significant. If a p-value is desired for each gene besides the ranking, we can adopt a permutation for each gene on the grouping of samples to obtain a permutation p-value. The rank of DS genes by NB-statistic is based on averages of all exons in the genes. One can use the method to identify exons that cause the differences. For a gene ranked as differentially spliced, one can decompose the NB-statistic into m items corresponding to the m exons in the gene: 2 p^ jA −p^ jB : NB stat j ¼ V^ pjA þ V^ pjB
ð10Þ
Exons with large NB-stat values are those that make big differences in the different isoform proportions. In this way, we can also check whether a gene contains alternative splicing events that have not been previously known by comparing with the gene annotation. By comparing the NB-stat of exons, we can get putatively the most significant differential splicing event between the compared groups, and the results can potentially shed lights on studies of splice mechanisms and corresponding functions. 3. Results 3.1. Simulations S1 and S2: two isoform genes We first conducted simulations S1 and S2 where a gene has two isoforms. This simplest case of alternative splicing can give a good idea on a method's sensitivity and specificity. The generation of data is described in detail in the Supplementary Materials. The simulated genes were generated to have 8 exons and 2 isoforms each. The lengths of exons were generated from a distribution derived from the RefSeq annotation. Flux Simulator (http://flux.sammeth.net/ simulator.html) was used to simulate RNA-seq reads of length 75 nt. Each simulation dataset includes a control group and a case group. We simulate replicate samples in each group by introducing randomness of normal distribution to the proportion vector in the samples (see Supplementary Materials for details). We define the isoform proportion vector (or IPV for short) (c,(1− c)) as the mean proportion of the two isoforms' expression. The c was set to 0.5 in the control group in Simulation S1, and to 0.25 in Simulation S2, forming IPVs of (0.5, 0.5) and (0.25, 0.75), respectively. In the case groups in both Simulations S1 and S2, c is chosen from 0.05, 0.10, 0.15, …, 0.45, 0.50 with equal possibility, generating IPVs of (0.05, 0.95), (0.10, 0.90), …, (0.50, 0.50). We can see that in S1, the smaller the c value in the case group is, the more distinct the case and control groups are. In S2, the farer the c value in the case group is from 0.25, the more distinct the case and control groups are. We may use mathematically better defined measures such as Jensen Shannon Divergence (JSD) (Trapnell et al., 2010) to describe differences between the case and control groups. But since it is an open question what level of difference will be biologically significant, we choose to use this simpler and more explicit way of describing the difference. Two sequencing depths were simulated, measured by the parameter MSDR (mean sequencing depth ratio). We set MSDR to 0.4 or 0.6 in our simulations, to represent experiments at a lower depth or a higher depth, respectively. If the total number of reads on the genome is 10 million, an MSDR of 0.4 reflects a mean RPKM (reads per kilobase per million reads) of 40. We generated 5 replicates for both the case and control groups for each gene in each simulation, and simulated 500 simulated genes for each situation. Then we counted the reads at each exon to get the read-count vector for each gene in the simulated data as we would do on real RNA-seq data. Note that we only used the exon information but did not use the isoform structure information. The NB-statistic was applied on each
167
gene to compare the two groups. This produced a list of genes ranked according to the difference in splicing between the two groups. The histogram of the NB-stat in Simulation S1 with MSDR= 0.4 is given in Fig. 2(a). The shape is similar to the exponential distribution, with most of the NB-stat values close to 0. We can either define a threshold and judge genes above the threshold as differentially spliced, or take a number of genes from the top of the rank as the most possible DS genes. Due to the yet limited understanding on the phenotype influence of different levels of changes in isoform expression proportion, it is not straightforward to define what is “true” differential splicing even on simulation data. For example, if one defines the null hypothesis as two proportion vectors are from the same distribution, many subtle changes will be called significant although many of the changes may be biologically irrelevant. Therefore, we studied a series of different definitions on “true” differential splicing in the simulation. We call a gene as differentially spliced between the two groups if the difference of c values of the two groups in the simulation model is larger than a parameter δ, and we used different values of δ as 0.15, 0.1, 0.05, or 0. For example, in Simulation S1 where c = 0.5 for the control group, δ = 0.05 means that we define a gene as a true DS gene only if the c in the case group is less than 0.45. Once “true DS genes” are defined, we can study the sensitivity and specificity of the proposed method in detecting DS genes at different cutoffs. The result can be shown by ROC curves. Fig. 2(b) shows the ROC curves at different δ settings on S1 with MSDR= 0.4. The curves on S2 are similar. We can see that the AUCs (area under the ROC curve) are all above 0.8, indicating acceptable performances in all scenarios. We can also see that the detection performance is better when δ is smaller when we tend to call genes with even very small changes as true DS genes. In the simulation, differences between genes' splicing proportions are reflected by the difference of c values between case and control samples. In Figs. 2(c) and (d), we checked distribution of the c value difference of genes detected as differentially spliced at several cutoffs in Simulations S1 and S2, with sequencing depth MSDR= 0.4. The horizontal axis is the c value difference between the case and control samples. The vertical axis shows the proportion of genes that are detected as DS genes among all genes having the same c value. We can see that, at cutoff 20, only genes with big c value differences will be called; and with cutoff 1.25, a large proportion of genes with even small c value differences will also be called. Due to the difficulty of defining “true” differential splicing, any selection of a cutoff can be arbitrary. A more reasonable way of applying the method is to take the result as a ranked list and choose the genes from the top of the list for further biological study. More information on the simulation experiments is given in Figure S1 in the Supplementary Materials. Besides detecting DS genes, we also studied the detection of exons with the biggest contribution to the difference. We decomposed the NB-statistic into m parts by Eq. (10) corresponding to the m exons, and checked whether the exon with the largest NB-stat is an alternative splicing exon in the simulated gene structure. Actually as our method does not need isoform annotation information, this step can be viewed as the detection of alternative exons. Table 1 summarizes the detection rate in the simulation. The detection rate is defined as the percentage of genes with the same c value of which the alternative spliced exons are correctly detected. From Table 1, we can see that when the c value differences between the case and control samples are not very small, most of the alternative spliced exons can be correctly detected. This indicates that the NB-statistic can be used not only for detecting DS genes, but also for detecting previously unknown alternative exons. 3.2. Simulation S3: genes with multiple isoforms For genes with more than 2 isoforms, we face the problem of comparing isoform proportion vectors of 3 or higher dimensions. The proposed NB-statistic works equally well for such cases, but it's harder to
168
W. Wang et al. / Gene 518 (2013) 164–170
Fig. 2. Performance of the NB-statistic method on Simulations S1 and S2. (a): Histogram of NB-stat values of the simulated 500 genes in Simulation S1 with sequencing depth MSDR=0.4. (b) ROC curves in Simulation S1 with MSDR=0.4. Different δ value indicates different definitions of “true” differentially spliced genes. A small δ value indicates a loose definition and a large δ value indicates a more stringent definition. The ROC curves were drawn by setting different cutoffs of NB-stat and comparing the identified DS genes with the pre-set true answers. (c) and (d): The c value distributions of the detected DS genes at different cutoff levels in Simulations S1 and S2, with sequencing depth MSDR=0.4.
define the “true” answers in the simulation as the difference between two vectors cannot be well characterized explicitly by a single parameter like the c in the 2-isoform case. Measures like JSD can describe the difference, but intuitive association of the measured values with the style and degree of the difference is not available. So we only considered some extreme cases in our simulation S3, where genes can have either identical isoform proportion vectors, or have very different proportion vectors by changing the dominating major isoform.
Table 1 The accuracy (in %) of the detection of differentially spliced exons, at various c values. c value
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
S1_MSDR = 0.4 S1_MSDR = 0.6 S2_MSDR = 0.4 S2_MSDR = 0.6
96.3 96.3 87.5 85.0
92.5 97.5 84.9 79.2
90.9 93.2 88.7 94.3
91.1 95.6 69.0 71.4
86.0 94.0 53.1 57.1
90.9 90.9 64.6 72.9
85.7 87.8 68.3 87.8
75.6 73.3 89.1 94.5
64.8 64.8 84.8 84.8
68.4 42.1 91.8 95.9
In the biological sense, switching of the major isoform is the most significant type of differential splicing. Simulation S3 was designed in a similar way as Simulations S1 and S2. We first generated gene structures with isoform number r to be 2, 3, 4 or 5 and number of exons m to be 6, 8, 10 or 12. There are 16 combinations of r and m. For each combination, we generated 25 genes, resulting in 400 genes of different structures. The mean IPV of dimension r was set to (1− 0.05(r − 1), 0.05, ⋯, 0.05) for the control group, and was set to (1− 0.05(r − 1), 0.05, ⋯, 0.05) or (0.05, 1 − 0.05(r − 1), 0.05, ⋯, 0.05) for the case group. That is, in both the control and case groups, there is one strong major isoform that takes the majority of the expression. The case group model is either identical to the control group, or is in a similar style but with a different major isoform. This type of major isoform change can cause significant biological consequences, and therefore is a situation that biologists care the most. We simulated two sequencing depths: MSDR = 0.4 or 0.6, and generated 5 replicate samples in each control or case group.
W. Wang et al. / Gene 518 (2013) 164–170
Under this setting, we calculated sensitivities and specificities at NB-stat cutoffs of 1.25, 2.5 and 5, respectively. Table 2 shows the result. In this experiment, since all the genes were set to the extreme cases, i.e., either the two groups share same proportion vector, or their proportion vectors are very different, the detection results were almost perfect. At cutoffs of 1.25 and 2.5, both the sensitivities and specificities are close to 100%, and at the cutoff of 5, still the sensitivity is above 90% and the specificity is 100%. We also used the method to detect the most different exon in this simulation. When the gene has more than two isoforms, typically there will be more exons involved in the differential splicing. In the extreme pattern of differential splicing in this simulation, the major isoform changed from the first isoform to the second one, so the exons that are different between these two isoforms must be the exons defining the differential splicing. Therefore, we checked whether the most different exon found by our method (the exon with the largest NB-stat) is an exon that is different between the two isoforms. Results showed that the method identified the correct exon in 73.1% and 71.6% of the DS genes when MSDR was 0.4 and 0.6, respectively. We can see that this is lower than the detection rate in Simulation S1 or S2, although the sensitivity and specificity in detecting DS genes are much higher. This reflects the increasing complexity of the problem when there are multiple isoforms in a gene.
169
Table 3 Percentages of genes called as differentially spliced between the case groups and the control group by NB-statistic and Cuffdiff on Simulation S4 (with medium coverage depth). Case group IPV
0.0, 1.0
0.1, 0.9
0.2, 0.8
0.3, 0.7
0.4, 0.6
0.5, 0.5
CallNB >=2 CallCuff(p b0.05)
63.1% 45.8%
49.7% 33.5%
41.9% 15.1%
26.8% 8.9%
10.6% 0.0%
1.1% 0.0%
CallNB>= 2: Percentage of genes called DS by NB-statistic at threshold NB >2. CallCuff(pb 0.05): Percentage of genes called DS by Cuffdiff at p-value b 0.5.
the case group is (0.0, 1.0). We also observed that a large percentage of genes cannot pass the screening criteria and are called NOTEST or FAIL. This is also a phenomena being widely discussed in the on-line discussion groups of Cufflinks' users. Results on simulation data at other coverage depths are given in the Supplementary Materials. We can see that when the coverage decreases, the detection powers of both methods degrade. We also experimented on the situation when Cuffdiff does not use the known isoform structures of the genes, but use structures inferred from the reads data by the Cufflinks assembly. The performance of Cuffdiff detection worsens in most cases, but in some cases when the coverage is high, using the inferred structure results in more detected DS genes than using the known structures. The performance of Cuffdiff under different situations varies in a less predictable manner (Supplementary Tables S2–S7).
3.3. Simulation S4: real gene structures and comparison with Cuffdiff We generated Simulation data S4 with real gene structures and compared the proposed NB-statistic method with the popular Cuffdiff method (Trapnell et al., 2010) which is based on isoform inference and expression estimation. The Cuffdiff software was downloaded from http://cufflinks.cbcb.umd.edu (version 1.3.0). We choose all genes on human chromosome 7 that have two alternative transcripts each in the RefSeq annotation of hg18. This gave us structures of 179 template genes. Similar to the above experiments, we generated 5 samples as the control group and 5 samples as the case group. The mean IPV of the two isoforms was set to (0.5, 0.5) for the control group, and that for the case group was set to (0.0, 1.0), (0.1, 0.9), …, (0.5, 0.5). We used the software Flux Simulator to simulate RNA-seq reads of length 75 nt, at a range of different coverage depths. The details of the simulation procedures are given in the Supplementary Materials. We summarized the proportion of genes being called as DSGs by Cuffdiff and by NB-statistic when comparing different case groups with the control group. Table 3 lists the results in the simulation with a high coverage. Default parameters were used for both NB-statistic and Cuffdiff. We can see that when the case group IPV varies from (0.5, 0.5) to (0.0, 1.0), the percentage of genes detected as DSGs by NB-statistic varies from 1.1% to 63.1%. As IPV of (0.5, 0.5) in the case group is exactly the same as that in the control group, the 1.1% called DSGs must be false positive detections. If we take most genes as differentially spliced when IPV of the case group is (0.0, 1.0), then the detection rate of 63% gives some idea about the power of NB-statistic. There is a roughly constant percentage (5%) of genes that cannot pass the filter of NB-statistic for all situations. For Cuffdiff, when isoform annotation information is used, there is no false positive detection for the (0.5, 0.5) case group, but the detection power is only 45.8% when the IPV of
Table 2 Sensitivity (sn) and specificity (sp) of the NB-statistic method on Simulation S3, at different cutoffs and sequencing depths.
S3_MSDR = 0.4 S3_MSDR = 0.6
cutoff = 1.25
cutoff = 2.5
cutoff = 5
sn
sp
sn
sp
sn
sp
0.99 0.98
0.98 1.00
0.95 0.98
1.00 1.00
0.91 0.94
1.00 1.00
3.4. Applications on real rata We applied the proposed method for comparing the kidney and liver transcriptomes from the RNA-seq data published by Marioni et al. (2008). There are 7 technical replicates of each sample sequenced in two runs. We firstly used only one run in the experiment, with 3 kidney samples and 4 liver samples. Reads were mapped to human reference genome hg18 with TopHat (Trapnell et al., 2009). Multi-hit reads and junction reads were ignored in this study for simplicity. The average read count number in each sample is 5.6 million. We used 5 as the cutoff for NB-stat and identified 1349 genes out of the total 20,332 genes in hg18 as differentially spliced between the liver and kidney samples. Among them, 117 genes have NB-stat values larger than 20. Some examples of the genes detected as differentially spliced between these two groups are given in the Supplementary Materials. From the examples, we can see evidences of differential splicing in the reads' distribution of the detected genes. We also noticed that there are quite a few detected DS genes that have not been annotated as alternative splice genes in RefSeq, indicating new discoveries of alternative splicing genes. The examples can give direct feelings on how reads are distributed in real data and how the method performed. In the Supplementary Table S8, we provide the list of the 117 genes identified by the standard NB-statistic with NB-stat cutoff of 20. Further biological study is required for a systematic understanding of these genes' differential expression patterns and functions between kidney and liver tissues. We then compared the NB-statistic method with Cuffdiff on this dataset, with all the 7 liver samples and 7 kidney samples. The NB-statistic method detected 3146 genes with NB-stat≥ 5. For Cuffdiff, when the isoform structure annotation was used, 306 genes were detected as differentially spliced at the cutoff of p-value b 0.05. A large number of the genes (19,909) were reported as NOTEST by Cuffdiff. Only 1 gene was detected as differentially spliced when Cuffdiff did not use the isoform structure annotation but inferred isoform structures by the assembly. This is very different with its performance on the simulation data. One possible reason may be that the read length is shorter in this real dataset, allowing for fewer junction reads which are critical for reliable isoform inference. Some examples of genes that are detected as DSGs by NB-statistic and/or by Cuffdiff are given in the Supplementary Materials.
170
W. Wang et al. / Gene 518 (2013) 164–170
4. Discussion Understanding the alternative splicing of genes is very important for understanding the complex regulation system of gene expression. Next-generation sequencing technologies applied on RNA molecules have provided new opportunities for these investigations. But due to the complexity of the gene expression and splicing system as well as the imperfectness of the current data, there is still a gap between the RNA-seq data we can generate and the biological understanding we want to obtain from the data. From the examples of real RNA-seq data shown in the Supplementary Materials, one can feel that there is still a long way to go for both sequencing technologies and bioinformatics methods before we can build accurate models of the alternative splicing system based on the data. On the other hand, one can also perceive that identifying differences in splicing patterns or isoform proportions is relatively an easier task than inferring splicing isoforms and estimating their expressions. In this paper, we developed an NB-statistic method that can detect differentially spliced genes between two groups of samples without using a prior knowledge on the annotation of alternative splicing. It does not need to infer isoform structure or to estimate isoform expression. To our knowledge, it is the first method designed for comparing two groups of RNA-seq samples. Besides identifying differentially spliced genes, the method can highlight on the exons that contribute the most to the differential splicing. Such information cannot be obtained with isoform-based methods. Studying the exons with large NB-stat values, we can detect previously unknown alternative splicing events and differentially spliced exons that putatively have the most significant biological function. Simulation experiments showed that the proposed method performs well on both detecting differentially spliced genes and identifying the alternative exons. We applied the method on an RNA-seq dataset of multiple samples of human kidney and liver, and identified a list of differentially splicing genes. The results also showed the existence of alternative splicing events in many genes that have not been previously known to be alternatively spliced. The proposed NB-statistic method was designed for detecting differential splicing regardless of possible differences in the overall expression of the gene. That is, we only cared about different relative proportions of underlying isoforms. In our experiments on the real data, we noticed that there are some genes that have dramatic differences in their overall expression levels. We realized that the detection of differential splicing on such genes is less reliable as the read counts in some exons in the lowly expressed sample can become too small. This points to a future direction that we should consider the detection of differential overall expression and differential splicing of a gene in a unified framework. Another important question is what types of differences are biologically significant. This question existed when studying differential expression of genes, but it becomes more complicated when we are talking about alternatively spliced genes. A purely mathematical measurement may not be sufficient to address this question. Before we have a better understanding on the biology of alternative splicing, focusing on the functionally important isoforms or exons should be a promising direction toward answering the question. 4.1. Software availability We developed a software tool called DSGseq of the method presented. It is written in R and can run on all major computer platforms running Windows or Unix/Linux. The software tool is available at http://bioinfo. au.tsinghua.edu.cn/software/DSGseq for free academic use.
Supplementary data related to this article can be found online at http://dx.doi.org/10.1016/j.gene.2012.11.045. Competing interests The authors declare that they have no competing interests. Authors' contributions XZ initiated this work. WW, XW and XZ designed the study. WW developed the basic method with the help of ZF, XW and XZ. WW and ZQ developed the software. ZQ conducted the experiments on the simulation and real data. WW, ZQ, ZF, XW and XZ wrote the paper. Acknowledgments We thank Drs. Xiaowo Wang, Ting Zhang and our other colleagues Xinyun Ma, Huijuan Feng, Likun Wang and Chao Ye for their helpful discussions. This work is partially supported by the National Basic Research Program of China (2012CB316504), the Hi-tech Research and Development Program of China (2012AA020401) and NSFC grants 91010016 and 61021063. References Anders, S., Huber, W., 2010. Differential expression analysis for sequence count data. Genome Biol. 11, R106. Baggerly, K.A., Deng, L., Morris, J.S., Aldaz, C.M., 2003. Differential expression in SAGE: accounting for normal between-library variation. Bioinformatics 19 (12), 1477–1483. Fox, J., 2008. Generalized linear models. In: Fox, J. (Ed.), Applied Regression Analysis and Generalized Linear Models. Sage Publications Inc., pp. 379–424. Hardcastle, T., Kelly, K., 2010. BaySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinforma. 11, 422. Jiang, H., Wong, W.H., 2009. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25, 1026–1032. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., Gilad, Y., 2008. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517. Pan, Q., Shai, O., Lee, L.J., Frey, B.J., Blencowe, B.J., 2008. Deep survey of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40 (12), 1413–1415. Richard, H., et al., 2010. Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments. Nucleic Acids Res. 38 (10), e112. Roberts, A., Trapnell, C., Donaghey, J., Rinn, J.L., Pachter, L., 2011. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 12, R22. Robinson, M.D., Smyth, G.K., 2007. Small sample estimation of negative binomial dispersion with applications to sage data. Biostatistics 9 (2), 321–332. Singh, D., et al., 2011. FDM: a graph-based statistical method to detect differential transcription using RNA-seq data. Bioinformatics 27 (19), 2633–2640. Stegle, O., Drewe, P., Bohnert, R., Borgwardt, K.M., Ratsch, G., 2010. Statistical tests for detecting differential RNA-transcript expression from read counts. Nat. Proc. http:// dx.doi.org/10.1038/npre.2010.4437.1. Trapnell, C., Pachter, L., Salzberg, S.L., 2009. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25 (9), 1105–1111. Trapnell, C., et al., 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28 (5), 511–519. Wang, Z., et al., 2003. Computational analysis and experimental validation of tumorassociated alternative RNA splicing in human cancer. Cancer Res. 63, 655–657. Wang, E.T., et al., 2008. Alternative isoform regulation in human tissue transcriptomes. Nature 456 (7221), 470–476. Wang, L., Feng, Z., Wang, X., Wang, X., Zhang, X., 2010. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics 26, 136–138. Wang, L., Wang, X., Wang, X., Liang, Y., Zhang, X., 2011. Observations on novel splice junctions from RNA sequencing data. Biochem. Biophys. Res. Commun. 409, 299–303. Wu, Z., Wang, X., Zhang, X., 2011. Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq. Bioinformatics 27 (4), 502–508.