CHAPTER
Computational methods for microRNA and PIWI-interacting RNA gene discovery and functional predictions
2
Chao Wu⁎, Xiaonan Zhao⁎, Peter M. Clark† Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, The Children’s Hospital of Philadelphia, Philadelphia, PA, United States⁎ Gene Therapy Program and Orphan Disease Center, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, United States†
1 Introduction miRNAs and piRNAs have both been shown to play important roles in cellular posttranscriptional regulation and are involved in a variety of biological processes and human diseases. Despite recent progresses and discoveries made to elucidate their roles in biological systems, formidable challenges remain [1]. Computational methods have been devised and deployed to generate testable hypotheses, and thus, reduce the scope and cost of functional experiments. Overall, in silico approaches have been developed to facilitate the investigations in the following categories: motif detection of the ncRNA molecules, target transcripts predictions, and assessing their functional involvements in biological processes and diseases [2]. In this chapter, we focus on the recent developments of computational methods on the identification and functional assessment of two classes of the diverse family of ncRNAs: microRNAs and PIWI-interacting RNAs. We do not cover miRNA-target prediction since this topic has been thoroughly discussed in other chapters of the book.
2 Computational Methods for the Identification of MicroRNA Genes The prediction and annotation of microRNA (miRNA) genes rely on the integrated analysis of multimodal, high-dimensional datasets using a variety of computational approaches and a thorough understanding of miRNA biogenesis and function. We begin this section with a brief introduction on miRNA biogenesis and function, followed by an in-depth description of computational methods. Current methods for miRNA gene AGO-Driven Non-Coding RNAs. https://doi.org/10.1016/B978-0-12-815669-8.00002-6 © 2019 Elsevier Inc. All rights reserved.
35
36
CHAPTER 2 Computational methods for microRNA and piRNA gene
discovery and annotation may be divided into two fundamental approaches based upon the type of input data utilized for novel miRNA gene discovery. Ab initio approaches rely exclusively on an organism’s reference genome and DNA sequence to identify putative miRNA encoding loci. The inherent limitation of ab initio approaches is that although a sequence of DNA may share some characteristic features with annotated miRNAs, it does not necessitate the expression of a functional miRNA from that locus. Consequently, such approaches have high false-positive detection rates. In order to reduce the number of false positives and identify high confidence, expressed miRNAs, numerous computational methods that utilize high throughput, RNA sequencing data have been developed. However, since miRNA expression is dynamic depending upon cell state and type, such methods are only able to identify the subset miRNA genes annotated throughout the genome to include only those that are expressed within any given sample at specifically interrogated time points and conditions. Thus, both methods may be ideally used in tandem to identify the plethora of putative miRNA encoded by an organism’s genome with added confidence given to those that are expressed and functionally validated.
2.1 MicroRNA Biogenesis and Function Mature miRNAs are short (~22 bp), single-stranded, non-coding RNA transcripts that mediate the posttranscriptional regulation of targeted transcripts following the formation of a heterodimer with the targeted RNA transcript and loading onto the Argonaute, RNA-induced silencing complex (RISC). The majority of mature miRNAs are formed through the canonical biogenesis pathway in which a primary miRNA (pri-miRNA) transcript is transcribed by polymerase II, processed by the Drosha enzyme to produce a precursor miRNA (pre-miRNA), which is then exported from the nucleus and processed by the Dicer enzyme to produce two mature (5′ and 3′ arms) miRNA molecules [3, 4]. Following Drosha/DGCR8 processing of the pri-miRNA transcript, the pre-miRNA transcript forms a characteristic hairpin-like secondary structure, containing both the 5′ and 3′ arms of the mature miRNA transcripts (Fig. 2.1). The inherent characteristics of the miRNA transcripts throughout the biogenesis process serve as deterministic features for miRNA gene discovery and characterization and will be explored throughout the chapter. Currently, there are 38,589 pre-miRNAs and 48,885 mature miRNAs annotated across 271 species within the miRNA database, miRBase (Fig. 2.2) [5–10]. The rate of miRNA discovery has remained relatively constant over the past 10 years, with ~3200 entries added to the database each year. With such growth in the number of annotated miRNAs, the employment of high-throughput, computational approaches for the discovery and characterization of novel miRNA transcripts is paramount to their quantitation and functional annotation.
2.2 Ab Initio MicroRNA Discovery Ab initio, or “from the beginning,” approaches for miRNA gene prediction are based solely upon the analysis of an organism’s reference genome sequence and do not make use of RNA expression profiling. This is a particularly complex task
FIG. 2.1 Typical read-pileup of mapped RNA sequencing reads following short read alignment to the reference genome, with the depth of coverage shown across the pre-miRNA encoding locus of has-miR-21 (top). The predicted secondary structure of the pre-miRNA transcript of hsa-miR-21 (bottom). The characteristic stem-loop, hairpin structure of the pre-miRNA transcript is shown, with the 5′ and 3′ arms of the mature miRNAs shown in green on the top and bottom of the structure, respectively.
38
CHAPTER 2 Computational methods for microRNA and piRNA gene
FIG. 2.2 The number of entries within miRbase (up to release 22) over time.
when one considers the spectra of pre-miRNA transcripts across every organism, each with unique sequence and structural properties. Consequently, these computational methodologies rely on sequence conservation across various organisms and/ or the secondary structure of the predicted pre-miRNA transcript to predict miRNA encoding loci. Although the numerous published methods of ab initio miRNA gene prediction utilize various filtering and selection criteria, the majority of algorithms are based upon deterministic features of the predicted pre-miRNA stem-loop structure, which may be predicted from the DNA sequence using Mfold, RNAfold, or UNAfold [11–13]. These thermodynamic features along with base pairing propensity, hairpin length, hairpin loop length, and sequence characteristics are among the most popular features utilized by machine learning, pre-miRNA classification algorithms. Additional filtering may also be performed based upon the sequence homology and phylogenetic analysis to identify conserved miRNA sequences across various organisms. Although such phylogenetic filters may remove potential falsepositive predictions, they might also discard true-positive miRNA predictions that are species-specific. The characteristic features of the pre-miRNA stem-loop structure have been utilized by numerous computational approaches to extract features to train a machine learning algorithm. Typically, these approaches use miRbase entries as the positive training set and other types of non-coding RNA transcripts (such as snoRNAs) as the negative training set as input to train the classifier. The methods mainly differ in the types and number of features used for classification and the computational algorithm employed. Regardless of the downstream method employed, the DNA sequence of a reference genome must first be partitioned to identify putative pre-miRNA encoding loci. This is typically accomplished using either a sliding window or K-mer composition approaches [14, 15]. Following sequence partitioning, the putative pre-miRNA hairpins
2 Computational methods for the identification of microRNA genes
that constitute the reduced search space are produced to extract features for machine learning algorithms. The overwhelming majority of computational approaches for premiRNA classification are based upon the support vector machine (SVM), although some algorithms utilize naïve bayes classification or decision trees, each utilizing a unique set of features for classification as well as various positive and negative training sets. The recently published ensemble approach, izMIR leverages 13 independent ab initio algorithms, unified into 6 different predictors to identify high confidence miRNA gene predictions [16]. Their data demonstrate that no individual method is universally superior to any other for miRNA gene prediction, each with distinct advantages and disadvantages depending upon the specific training and validation datasets used. The inherent limitations of ab initio approaches for miRNA gene discovery include the difficulty to annotate the final, mature miRNA transcript and the propensity for high false-positive discovery rates since a specific predicted pre-miRNA may not be expressed. In order to overcome this limitation, miRNA expression data may be utilized in combination with ab initio approaches to annotate high confidence premiRNA encoding loci.
2.3 High-Throughput RNA Sequencing Analysis for MicroRNA Discovery High-throughput sequencing of short RNA transcripts from various cell types and tissues is routinely used to quantify miRNA expression and has also been utilized to identify novel miRNA transcripts in various plant and animal species [17–25]. These methods rely on similar techniques used by ab initio approaches with the addition of expression information of both the 5′ and 3′ mature miRNA transcripts. These approaches have the added benefit of not only being able to identify the pre-miRNA encoding locus, but can also resolve the precise sequence of the mature miRNA transcript. These short transcripts are the final, functional products of the miRNA biogenesis pathway and may be readily quantified by performing high-throughput sequencing of isolated short RNA fragments (Fig. 2.1). Following sequencing, raw reads must be aligned to the reference genome of interest using short read alignment software packages. This step is of critical importance since spuriously mapped reads may be prone to elevated false-positive results downstream. However, obtaining properly mapped short RNA-seq reads is not a trivial task and requires significant computational resources. The short length of mature miRNA transcripts along with the sheer size, complexity, and repetitive sequence structure of reference genomes makes it challenging to properly align short RNA-seq reads. This challenge is further confounded by the fact that mature miRNA transcripts may be expressed from multiple loci and often share sequence homology with various other mature miRNA transcripts. In order to address these concerns, several short read alignment algorithms have been developed specifically for this purpose [26–28] and have been extensively reviewed elsewhere [29, 30]. Following read mapping, miRNA expression information may be used in concert with additional sequence and pre-miRNA structure information to predict both
39
40
CHAPTER 2 Computational methods for microRNA and piRNA gene
p re-miRNA encoding loci and mature miRNA transcripts. These methods rely on the characteristic read peaks of the 5′ and/or 3′ arm of the pre-miRNA hairpin that give rise to the 5′ and 3′ mature miRNAs (Fig. 2.1). The observed differences in 5′ and 3′ arm expression are typical and confound the sequence resolution of either arm if inadequate sequencing depth and read coverage are obtained. Consequently, deep sequencing is typically performed in order to predict low abundance pre-miRNA transcripts and resolve the sequences of both the 5′ and 3′ mature miRNAs. The general logic of these methods relies on first identifying expressed read clusters of uniquely mapped reads that do not overlap with a known annotated genomic locus. These reads are then collected and used to excise potential pre-miRNA hairpin structures from the underlying DNA sequence of the reference genome. These putative premiRNA encoding loci are then filtered using various features of the mature miRNA sequences, pre-miRNA sequence, and structure. The methods vary most widely in how they classify predicted pre-miRNA hairpins, using various machine learning and/or probabilistic approaches [31–38]. Read peaks may also be filtered to remove lowly expressed loci or loci with no clear Dicer cut site (poorly defined 3′ ends). Recently, the Mirnovo tool was developed to facilitate novel miRNA gene discovery in the lack of a reference genome [39]. Because this tool relies on sequence similarities among raw RNA-seq reads to identify novel miRNA transcripts, it is particularly advantageous for organisms with a poorly annotated reference genome.
2.4 Functional Validation of Predicted MicroRNA Genes Published guidelines for miRNA annotation provide additional guidance for miRNA gene annotation beyond computational prediction to include requirements for experimental validation [40, 41]. Additional experimental validation is critical to assess the expression, biogenesis, and function of the predicted miRNA transcript prior to becoming a bona fide miRNA gene. These guidelines include expression criteria (RT-qPCR, northern blot, RNA expression), biogenesis criteria (differential expression following Dicer, Drosha, or DGCR8 knock down), and functional criteria (Ago loading and Ago silencing). Several recent studies have identified numerous novel miRNA transcripts through the analysis of high-throughput RNA sequencing datasets using a variety of computational methods outlined in previous sections of this chapter [18, 22, 24, 25, 42, 43], each performing experimental validation on a subset of identified miRNA to better estimate the false-positive discovery rate. Experimental validation of miRNA expression can be done by RNA sequencing, qPCR, or northern blot. Since most novel miRNA genes are discovered through RNA sequencing, their expression is explicitly implied, but must nonetheless still be validated. Orthogonal validation of miRNA expression may be accomplished by performing a northern blot, or in a quantitative manner, by RT-qPCR. Because mature miRNAs are too short for standard RT-qPCR reactions, stem-loop RT-qPCR has been developed to quantitatively assess the expression of mature miRNA transcripts [44–46]. Dicer silencing can be accomplished through RNA interference in vitro by transducing cells with a lentiviral construct expressing an antisense shRNA transcript of the
3 Computational identification of PIWI-interacting RNAs
Dicer mRNA transcript. The Dicer transcript is subsequently sequestered and rendered inactive, resulting in an accumulation of pre-miRNA transcripts and an attenuation of mature miRNA transcripts. Differentially expressed mature miRNA transcripts are identified by comparing the abundance of mature miRNA transcripts of interest before and after transduction. With Dicer rendered inactive, validated miRNA would be expected to be downregulated following transduction and can be quantitatively assessed by RNA-seq or RT-qPCR. The limitation of this approach is that it must be performed in vitro under conditions that may not be representative of the specific physiological conditions under which the miRNAs were initially characterized. These differences, including transduction, may effectively alter the cellular transcriptome of the cell line. In addition, there is evidence that some miRNAs are not formed through the canonical biogenesis pathway and are formed in a Dicer-independent manner [47]. Similarly, Drosha, DGCR8, and Argonaute (Ago) proteins may also be knocked down in vitro by transfecting cells with specific siRNA constructs, resulting in the degradation of the targeted mRNA transcript and disruption of miRNA biogenesis and function [24]. Ago CLIP-seq studies (described in Chapter 10) have also been analyzed to provide functional evidence that an identified miRNA is loaded onto the RISC and exerts translational suppression of its targeted mRNA transcript. The RNA-seq data generated from these experiments provides functional evidence that a miRNA transcript is loaded onto the Ago protein after forming an energetically favorable heteroduplex structure with the targeted transcript [48].
3 Computational Identification of PIWI-Interacting RNAs P element-induced wimpy testis (PIWI)-interacting RNAs (piRNAs) are the largest class of small non-coding RNA molecules found in animal cells [49, 50]. Recently discovered, they are the least characterized among Ago/Piwi protein-interacting small ncRNA. The 19–33 nt long piRNAs were primarily described in the germline and involved in transposon silencing at both the transcriptional and posttranscriptional level [51, 52]. However, recent studies have revealed that piRNAs are expressed in a tissuespecific manner across a variety of human somatic tissue types. They exert silencing effect in epigenetic regulation, gene and protein regulation, chromosome rearrangement, and deleterious mutation, all of which may contribute to pathogenesis and linked to a number of cancers [53, 54]. And it was estimated that there could be up to ~2 × 105 potential piRNAs based on the number of sequences found in the mouse genome [55]. A number of computational methods have been proposed for the prediction of several types of ncRNAs such as miRNAs. However, the wide variation in piRNA sequences, lack of secondary structures, and diverse piwi functions over different tissue types and species make it challenging to identify piRNAs and establish their functionality. Experimental methods, such as immunoprecipitation and deep sequencing, were developed to recognize piRNAs [56]. These methods are usually time-consuming and expensive; they also yield low sensitivity in many cases due to insufficient quantity of samples and unavailability of multiple tissues for comparison [57]. Thus, computational methods have been proposed to supplement and enhance
41
42
CHAPTER 2 Computational methods for microRNA and piRNA gene
piRNA prediction. Specifically, as an efficient kernel-based machine learning model, SVM has been widely used to address this problem. As an early attempt, Betel et al. developed a position-specific usage method to identify piRNA sequences, utilizing the fact that mouse piRNAs have some position- specific properties (e.g., guanine or adenine at +1 position). However, the precision of this approach was only 61%–72% when lacking genome information for certain organisms [55]. Wang et al. used piRNA-transposon interaction information to extract features and built a SVM classifier, which predicted human, mouse, and rat piRNAs. The model achieved a specificity of 95% and sensitivity of 96%, with overall accuracy of 90.6%. However, this approach was limited to piRNAs whose sequences were aligned to transposons [58]. Brayet et al. extended the approach by incorporating multiple kernels in a SVM classifier to identify the piRNAs in human and drosophila. Their algorithm combined heterogeneous attributes such as sequence features and piRNAs cluster features, as well as the distance to telomere/centromere regions, during the training process. The model achieved better accuracy on human piRNAs than previously published methods, albeit yielding unbalanced results on drosophila (95% specificity and 83% sensitivity) [59]. In another exercise to identify piRNA genes using a SVM model, Seyeddokht et al. incorporated structural attributes such as the Minimum Free Energy (MFE) with the sequence features during the training of the classifier. Their results outperformed existing methods that were built utilizing motif-based features alone [60]. However, none of the abovementioned method could identify piRNAs with the function of instructing target mRNA deadenylation. To address this, Liu et al. developed a two-layer ensemble piRNA detection classifier, which integrated the physicochemical properties of nucleotides into the pseudo K-tuple nucleotide composition. Their model, 2L-piRNA, predicts piRNA sequences in the first layer and then identifies its functional types of instructing target mRNA deadenylation in the second layer. They demonstrated that their approach outperformed the existing state-of-theart methods in both accuracy and the Matthews correlation coefficient [61].
4 Computational Methods to Predict MicroRNA’s Functions miRNAs guide Ago proteins to fully or partially complementary messenger RNA (mRNA) targets, which are then silenced posttranscriptionally [62]. Rapidly accumulating studies have shown that miRNAs play pivotal regulatory roles in the development of various biological processes and progression of diseases [63–65]. While it is meaningful to explore miRNAs’ targets of transcription factors and mRNAs, it is critical to uncover their functional involvement in complex disease mechanisms such as diabetes, cardiovascular diseases, and cancers [66, 67]. Applying only knock-out experiments to discover such associations often poses bottlenecks owing to the high costs during these procedures [68]. Therefore, increasing efforts have been made to infer miRNAs’ functions using in silico methods as supplements. Computational approaches may facilitate to elucidate miRNAs’ regulatory mechanisms in diseases, and thus, greatly reduce the prospective experimental workload. The fundamental
4 Computational methods to predict microRNA’s functions
FIG. 2.3 Orthodox in silico framework of miRNA functional analysis.
hypothesis in such analyses is that if the targets of a specific miRNA are enriched with genes annotated with certain biological processes or pathways, it is then reasonable to predict that miRNA is involved in the same process, as indicated in Fig. 2.3.
4.1 In Silico Methods and Tools to Infer MicroRNAs’ Function Following this hypothesis, an intuitive approach to infer a miRNA’s functions is to identify biological processes and pathways for which the number of its molecular targets taking part in the process is statistically significant. Specifically, the systematic inverse correlations between expressions of miRNAs and those of their target transcripts offer more insight into the miRNAs’ regulatory functions. For instance, Cheng and Li devised a computational method to analyze miRNAs’ relative activities and their putative mRNA targets simultaneously using microarray expression data [69]. Since the expression levels of miRNAs were not generally available, their method examined the expression patterns of the targets of a miRNA. If the target genes appeared to be downregulated, it indicated that the effective activity of this
43
44
CHAPTER 2 Computational methods for microRNA and piRNA gene
miRNA was observed between the two conditions (i.e., healthy and disease). In a similar study, Xiao et al. proposed a multistep method to identify dysfunctional miRNA-mRNA regulatory modules (MRMs) by analyzing differentially expressed miRNAs and their target mRNAs in conjunction [70]. They then applied the method on glioblastoma and discovered regulatory modules that were both consistent across different subtypes, as well as subtype-specific modules. Furthermore, several computational methods and tools have been developed in recent years to facilitate the exploration process systematically. Some of the tools are summarized in Table 2.1. Table 2.1 Web-based tools to analyze miRNA functions through mRNA targets Resource/tool
URL
Summary
miRGator [71]
http://mirgator. kobic.re.kr/
MAGIA [72]
http://gencomp. bio.unipd.it/magia2
MMIA [73]
http://epigenomics. snu.ac.kr/ biovlab_mmia_ngs/
miTALOS [74]
http://mips. helmholtzmuenchen.de/ mitalos
ToppMir [75]
https://toppmir. cchmc.org/
FAME [68]
http://acgt.cs.tau. ac.il/fame
miRGator is a web portal encompassing miRNA diversity and expression, with a focus on deep sequencing data. It also enables the features for users to visualize and examine short read information. Magia performs integrative approach supporting different organisms and target prediction algorithms, to unveil regulatory circuits involving either miRNA or transcript factors. MMIA is a web server that integrates miRNA and mRNA expression data with predicted target information to analyze biological functions by gene set analysis. The latest development of MMIA is compatible with high-throughput sequencing platform. miTALOS is a web tool that incorporates high-quality miRNA targeting data, highthroughput sequencing-based gene expression, and pathway data, it provides insights into tissue-specific miRNA regulations of biological pathways. ToppMir is a web-based workbench that learns about biological contexts based on gene-associated information from user input. It then trains a machine learning-based model to prioritize the most significant miRNAs in the biological system. FAME infers the biological processes affected by miRNAs. It depends on a compendium of miRNA-pathway and miRNA-process associations constructed by the authors. The tool also predicts novel miRNA-regulated pathways.
Note: This list is not exhaustive and is primarily meant to provide a list of examples. We apologize for any oversights.
4 Computational methods to predict microRNA’s functions
4.2 Predict MicroRNA-Disease Associations Since the early discoveries of miRNAs, one of the most intriguing research domains remains to elucidate a miRNA’s functional association with specific diseases [76]. Increasing number of in silico methods, which often employ machine learning algorithms, have been proposed [77]. These methods usually leverage knowledge such as known disease-related genes as well as predicted miRNA-target interactions, and then predict potential miRNA-disease associations based on the “Guilt by Association” principle. For instance, under the assumption that phenotypically similar diseases tend to be associated with functionally related miRNAs, Jiang et al. proposed a computational model based on the hypergeometric distribution to infer potential miRNA-disease associations by ranking miRNAs for the disease of interest [78]. In another study, Chen and Zhang developed a global network similarity measurement to predict novel miRNA-disease associations. In this approach, they calculated miRNA-based similarity inference, phenotype-based similarity inference, and network consistency-based inference for each known miRNA-disease pair and demonstrated the network inference achieved the best performance on the training data. They then applied this metrics to predict novel associations and concluded their approach was especially useful for miRNAs whose target association information was not available [79]. Using a SVM trained by a set of features including functional similarity scores between miRNAs and the disease phenotype similarities, Jiang et al. distinguished positive disease miRNAs from negative ones, attaining an Area Under Curve (AUC) of 0.89 in a 10-fold cross-validation [80]. In addition to supervised learning-based methods, Chen and Yan developed Regularized Least Squares for miRNA-Disease Association (RLSMDA), which was a semi-supervised learning approach and did not require negative samples, to uncover the relationships between miRNAs and diseases [81]. During the validation of their model in hepatocellular and lung cancer, 80% and 84% of the top 50 predicted miRNAs of these two diseases had been confirmed by previous biological experiments, respectively.
4.3 Network-Based Approaches for MicroRNA’s Functional Prediction Each single miRNAs may regulate up to hundreds of molecular targets in a biological system [82]. In addition, combinatorial effects of miRNAs by targeting members of the same biological pathways have been discovered [83]. For example, Mavrakis et al. experimentally proved that miR-9b, miR-20a, miR-26a, miR-92, and miR233 promoted the development of T cell lymphoblastic leukemia (T-ALL) by cooperatively regulating a group of tumor suppressor genes including PTEN, NF1, and FBXW7 [84]. Therefore, network-based computational approaches have been proved to be effective in assessing miRNA’s functional involvement. In an early effort to reveal co-target relationships of miRNAs, Yoon and De Micheli modeled this biological mechanism by constructing a bipartite graph, where one side of the graph
45
46
CHAPTER 2 Computational methods for microRNA and piRNA gene
was miRNAs while the other side was mRNAs, and the miRNAs and their mRNA targets were connected by edges whose weights corresponded to the inferred binding strengths [85]. Following this, they employed a graph-mining algorithm to discover bicliques in which all of the edges have similar weights in the given bipartite and the miRNAs in a biclique could be viewed as regulating the same group of mRNA targets of this biclique. Peng et al. further improved this method by combining putative miRNA-target predictions with paired expression profiles of miRNAs and mRNA targets. By applying this method on hepatitis C virus, they recognized 38 MRMs that were supported by known biological processes [86]. Network-based approaches were also proposed to predict miRNA–disease relationships. Chen et al. compiled a miRNA-miRNA functional similarity network where each pair of miRNAs was evaluated based on their mRNA targets’ associations to diseases. They then implemented a random walk with restart model to predict novel disease miRNAs [87]. They demonstrated the utility of the approach on three types of cancers where 98% (breast cancer), 74% (colon cancer), and 88% (lung cancer) of the top 50 predicted miRNAs were previously confirmed. Liu et al. extended this method by constructing a heterogeneous network connecting a disease similarity subnetwork and a miRNA similarity subnetwork. The disease similarity subnetwork was composed of disease semantic similarities and functional similarities, while the miRNA similarity subnetwork was compiled using miRNA-mRNA pairs and miRNA-lncRNA (long non-coding RNA) associations. Following this, they utilized the same random walk model to predict miRNA-disease associations and achieved an average AUC of 0.8 across 341 diseases and 476 miRNAs [88]. In another study, Chen et al. generated a miRNAdisease network integrating known miR-disease associations, miRNA functional similarities, disease semantic similarities, and Gaussian interaction profile kernel similarities. They then developed the concepts of within- and between-scores of both miRNAs and diseases and combined these metrics to investigate miRNAdisease relationships, where within- and between-scores corresponded to known disease-associated miRNAs and unknown disease-associated miRNAs, respectively [89]. Their approach achieved an average AUC of 0.803 among 5430 experimentally validated miRNA-disease associations.
5 Exploration of PIWI-Interacting RNA’s Functions Using In Silico Methods P-element-induced wimpy testis (PIWI)-interacting RNAs (piRNAs) are the largest class of small non-coding RNA molecules found in animal cells [90]. Compared to miRNAs, it has been a challenge to establish the functionality of piRNAs owing to the wide variation in their sequences and functions over different species [91]. Nonetheless, increasing number of studies have uncovered the functions of piRNAs in the epigenetic and posttranscriptional regulation of transposons and genes [92], despite the majority of these studies to date have been performed in an ad hoc manner.
5 Exploration of piRNA’s functions using in silico methods
piRNA’s function was first discovered to defend genome integrity by guiding PIWI proteins to silence transposable elements (TEs), which present high risks to cause deleterious effects on their host in different species [93]. In addition, the mechanism of piRNA-mediated posttranscriptional silencing has been observed to affect mRNA transcripts as well, suggesting that they may play a key role in gene expression regulation [94]. In a recent study, Yuan et al. trained a computational classifier to predict potential piRNA targets of protein-coding genes on mouse model. They used previously identified piRNA target sites as the positive examples in the training set, while the negative examples were carefully selected as genes of little or no expression changes between the wild-type and the MIWI mutant (murine homolog of piwi protein) model. A collection of CLIP-Seq-derived features and position-based features were extracted to train a SVM classifier. The model achieved an AUC of 0.87 on the training samples and was then used to scan the entire genome to identify putative mRNA targets. In total, their method generated a list of 3781 mRNAs from 2587 mouse genes that hosted over 12,000 potential target sites which may be cleaved by piRNAs, and a substantial subset of their predictions were verified by independent microarray datasets [94]. In addition, it has been demonstrated that piRNAs may be implicated in a variety of diseases during posttranscriptional and epigenetic regulations [95]. Specifically, piRNAs are involved in cancer cell proliferation, apoptosis, metastasis, and invasion and may serve as potential diagnostic/prognostic biomarkers during cancer progression [96]. To overcome the tissue specificity of piRNAs, several studies have been performed to generate the molecular signature of piRNAs in different types of cancer. For instance, Singh et al. investigated piRNAs and their associated proteins in ovarian cancer using q-PCR data and western blotting. Compared to normal ovarian samples, their study identified 256 and 234 differentially expressed piRNAs in endometrioid and serous ovarian cancer samples, respectively. Furthermore, through the integrated analysis of piRNAs and mRNA expression levels, the authors discovered four gene targets of piRNA in endometrioid ovarian cancer and three gene targets of piRNA in serous ovarian cancer which were functionally enriched of tumor pathogenesis and progression [97]. Roy and Mallick carried out a similar study on neuroblastoma and unveiled the landscape of the differentially expressed piRNAs and their potential mRNA targets [98]. From an integrated study on 77 RNA-seq datasets, Krishnan et al. identified a panel of 30 piRNAs that play important roles in large-scale modulation of head and neck squamous cell carcinoma (HNSCC) in addition to their direct, smaller-scale interactions in the malignancy [99]. Besides cancers, piRNAs have also been suggested to be involved in the pathogenesis of neurodegenerative disorders such as Alzheimer disease [100]. These growing evidences indicate the potential of piRNAs as novel clinical biomarkers in disease detection, classification, and treatment [101]. Emerging databases and resources have been compiled to provide sequence and annotation information of piRNAs across organisms. Some of the databases are summarized in Table 2.2. We believe these resources can facilitate to query and interrogate piRNA information and may lead to novel functional discoveries of the piRNAs.
47
48
CHAPTER 2 Computational methods for microRNA and piRNA gene
Table 2.2 Web-based tools for piRNA functional studies Resource/tool
URL
Summary
piRNABank [102]
http://pirnabank. ibab.ac.in/
piRNAQuest [103]
http:// bicresources. jcbose.ac.in/ zhumur/ pirnaquest/
piRBase [104]
http://www. regulatoryrna.org/ database/piRNA/
piRNAtarget [105]
http://120.108 .102.11/~sophia/ piRNAtarget
piRNABank is a repository that hosts piRNA information in human, mouse, and rat. It enables users to query and interact with sequences, chromosome coordinates, and cluster information of known piRNAs. piRNAQuest is a database that provides comprehensive piRNA annotations in human, mouse, and rat. In addition to sequence and cluster information of the piRNAs, the database also hosts expression profiles of piRNAs in different tissues and from distinct developmental stages. piRBase is a platform that contains 77 million piRNA sequences from nine different species. Furthermore, piRBase provides known piRNAmRNA regulation information in animal models as well as tissue-specific piRNA expression profiles to assist functional studies. piRNAtarget is an integrated database that curates information such as sequences, parental genes, targets, expression, and methylation profiles of human piRNAs.
Note: This list is not exhaustive and is primarily meant to provide a list of examples. We apologize for any oversights.
6 Conclusion In this chapter, we have reviewed recent developments and trends using in silico methods to explore a diverse class of small regulatory RNA genes, with a focus on miRNAs and piRNAs. While computational prediction of miRNA-mRNA targets continues to be an interesting research domain, this chapter is concentrated on the growing efforts to elucidate the sequence, structure, and physiological function of these molecules through systems biology approaches. The developed and evolving computational methods for miRNA gene identification and annotation have led to the discovery of thousands of novel miRNA transcripts across several species and facilitated their involvement in a variety of physiological processes. Although the first piRNA was described after miRNAs were first discovered, the number of annotated piRNA transcripts has far outpaced the number of annotated miRNA transcripts. As the cost of high-throughput DNA and RNA sequencing continues to drop, and better computational methodologies are developed, it is likely that the number of annotated ncRNAs will continue to increase. Together with rapidly developing single cell RNA and DNA sequencing techniques, these computational tools will enable the elucidation of novel physiological processes that are mediated by miRNAs and piRNAs
References
within various cell types, paving the way for the intelligent design of functional validation experiments and an increase in our understanding of the physiological role of this important class of ncRNAs in human health and disease.
References [1] Lindow M, Gorodkin J. Principles and limitations of computational microRNA gene and target finding. DNA Cell Biol 2007;26(5):339–51. [2] Li L, et al. Computational approaches for microRNA studies: a review. Mamm Genome 2010;21(1–2):1–12. [3] Ha M, Kim VN. Regulation of microRNA biogenesis. Nat Rev Mol Cell Biol 2014;15(8):509–24. [4] Winter J, et al. Many roads to maturity: microRNA biogenesis pathways and their regulation. Nat Cell Biol 2009;11(3):228–34. [5] Kozomara A, Griffiths-Jones S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res 2014;42(Database issue):D68–73. [6] Kozomara A, Griffiths-Jones S. miRBase: integrating microRNA annotation and deepsequencing data. Nucleic Acids Res 2011;39(Database):D152–7. [7] Griffiths-Jones S. miRBase: microRNA sequences and annotation. Curr Protoc Bioinformatics 2010; Chapter 12, Unit 12.9.1–10. [8] Griffiths-Jones S, et al. miRBase: tools for microRNA genomics. Nucleic Acids Res 2008;36(Database issue):D154–8. [9] Griffiths-Jones S, et al. miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Res 2006;34(Database):D140–4. [10] Griffiths-Jones S. miRBase: the microRNA sequence database. Methods Mol Biol 2006;342:129–38. [11] Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 2003;31(13):3406–15. [12] Lorenz R, et al. ViennaRNA Package 2.0. Algorithms Mol Biol 2011;6:26. [13] Markham NR, Zuker M. UNAFold: software for nucleic acid folding and hybridization. Methods Mol Biol 2008;453:3–31. [14] Liu B, et al. Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy. J Theor Biol 2015;385:153–9. [15] Wei L, et al. Improved and promising identification of human MicroRNAs by incorporating a high-quality negative set. IEEE/ACM Trans Comput Biol Bioinform 2013;11(1):192–201. [16] Sacar Demirci MD, Baumbach J, Allmer J. On the performance of pre-microRNA detection algorithms. Nat Commun 2017;8(1):330. [17] Clark PM, et al. Novel and haplotype specific microRNAs encoded by the major histocompatibility complex. Sci Rep 2018;8(1):3832. [18] Londin E, et al. Analysis of 13 cell types reveals evidence for the expression of numerous novel primate- and tissue-specific microRNAs. Proc Natl Acad Sci U S A 2015;112(10):E1106–15. [19] Jain M, Chevala VV, Garg R. Genome-wide discovery and differential regulation of conserved and novel microRNAs in chickpea via deep sequencing. J Exp Bot 2014;65(20):5945–58.
49
50
CHAPTER 2 Computational methods for microRNA and piRNA gene
[20] Li H, et al. Deep sequencing discovery of novel and conserved microRNAs in wild type and a white-flesh mutant strawberry. Planta 2013;238(4):695–713. [21] Ladewig E, et al. Discovery of hundreds of mirtrons in mouse and human small RNA data. Genome Res 2012;22(9):1634–45. [22] Jima DD, et al. Deep sequencing of the small RNA transcriptome of normal and malignant human B cells identifies hundreds of novel microRNAs. Blood 2010;116(23):e118–27. [23] Minatel BC, et al. Large-scale discovery of previously undetected microRNAs specific to human liver. Hum Genomics 2018;12(1):16. [24] Friedlander MR, et al. Evidence for the biogenesis of more than 1,000 novel human microRNAs. Genome Biol 2014;15(4):R57. [25] Meiri E, et al. Discovery of microRNAs and other small RNAs in solid tumors. Nucleic Acids Res 2010;38(18):6234–46. [26] Liu Y, Popp B, Schmidt B. CUSHAW3: sensitive and accurate base-space and colorspace short-read alignment with hybrid seeding. PLoS One 2014;9(1):e86869. [27] Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 2009;25(14):1754–60. [28] David M, et al. SHRiMP2: sensitive yet practical short read mapping. Bioinformatics 2011;27(7):1011–2. [29] Ziemann M, Kaspi A, El-Osta A. Evaluation of microRNA alignment techniques. RNA 2016;22(8):1120–38. [30] Ye H, et al. Alignment of short reads: a crucial step for application of next-generation sequencing data in precision medicine. Pharmaceutics 2015;7(4):523–41. [31] An J, et al. miRPlant: an integrated tool for identification of plant miRNA from RNA sequencing data. BMC Bioinformatics 2014;15:275. [32] An J, et al. miRDeep*: an integrated application tool for miRNA identification from RNA sequencing data. Nucleic Acids Res 2013;41(2):727–37. [33] Yang X, Li L. miRDeep-P: a computational tool for analyzing the microRNA transcriptome in plants. Bioinformatics 2011;27(18):2614–5. [34] Friedlander MR, et al. Discovering microRNAs from deep sequencing data using miRDeep. Nat Biotechnol 2008;26(4):407–15. [35] Friedlander MR, et al. miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades. Nucleic Acids Res 2012;40(1):37–52. [36] Higashi S, et al. Mirinho: An efficient and general plant and animal pre-miRNA predictor for genomic and deep sequencing data. BMC Bioinformatics 2015;16:179. [37] Evers M, et al. miRA: adaptable novel miRNA identification in plants using small RNA sequencing data. BMC Bioinformatics 2015;16:370. [38] Paicu C, et al. miRCat2: accurate prediction of plant and animal microRNAs from nextgeneration sequencing datasets. Bioinformatics 2017;33(16):2446–54. [39] Vitsios DM, et al. Mirnovo: genome-free prediction of microRNAs from small RNA sequencing data and single-cells using decision forests. Nucleic Acids Res 2017;45(21):e177. [40] Ambros V, et al. A uniform system for microRNA annotation. RNA 2003;9(3):277–9. [41] Meyers BC, et al. Criteria for annotation of plant MicroRNAs. Plant Cell 2008;20(12):3186–90. [42] Joyce CE, et al. Deep sequencing of small RNAs from human skin reveals major alterations in the psoriasis miRNAome. Hum Mol Genet 2011;20(20):4025–40. [43] Ple H, et al. The repertoire and features of human platelet microRNAs. PLoS One 2012;7(12):e50746.
References
[44] Hurley J, et al. Stem-loop RT-qPCR for microRNA expression profiling. Methods Mol Biol 2012;822:33–52. [45] Kramer MF. Stem-loop RT-qPCR for miRNAs. Curr Protoc Mol Biol 2011; Chapter 15, Unit 15.10. [46] Chen C, et al. Real-time quantification of microRNAs by stem-loop RT-PCR. Nucleic Acids Res 2005;33(20):e179. [47] Herrera-Carrillo E, Berkhout B. Dicer-independent processing of small RNA duplexes: mechanistic insights and applications. Nucleic Acids Res 2017;45(18):10369–79. [48] Xia Z, et al. Molecular dynamics simulations of ago silencing complexes reveal a large repertoire of admissible ‘seed-less’ targets. Sci Rep 2012;2:569. [49] Aravin A, et al. A novel class of small RNAs bind to MILI protein in mouse testes. Nature 2006;442(7099):203. [50] Grivna ST, et al. A novel class of small RNAs in mouse spermatogenic cells. Genes Dev 2006;20(13):1709–14. [51] Cox DN, et al. A novel class of evolutionarily conserved genes defined by piwi are essential for stem cell self-renewal. Genes Dev 1998;12(23):3715–27. [52] Gunawardane LS, et al. A slicer-mediated mechanism for repeat-associated siRNA 5'end formation in drosophila. Science 2007;315(5818):1587–90. [53] Baylin SB. DNA methylation and gene silencing in cancer. Nat Rev Clin Oncol 2005;2(S1):S4. [54] Siddiqi S, Matushansky I. Piwis and piwi-interacting RNAs in the epigenetics of cancer. J Cell Biochem 2012;113(2):373–80. [55] Betel D, et al. Computational analysis of mouse piRNA sequence and biogenesis. PLoS Comput Biol 2007;3(11):e222. [56] Yin H, Lin H. An epigenetic activation role of Piwi and a Piwi-associated piRNA in Drosophila melanogaster. Nature 2007;450(7167):304. [57] Carmen L, et al. Existence of snoRNA, microRNA, piRNA characteristics in a novel noncoding RNA: x-ncRNA and its biological implication in Homo sapiens. J Bioinformatics Sequence Anal 2009;1(2):031–40. [58] Wang K, Liang C, Liu J, Xiao H, Huang S, Xu J, Li F. Prediction of piRNAs using transposon interaction and a support vector machine. BMC Bioinformatics 2014;15(1):419. [59] Brayet J, et al. Towards a piRNA prediction using multiple kernel fusion and support vector machine. Bioinformatics 2014;30(17):i364–70. [60] Seyeddokht A, et al. Computational detection of piRNA in human using support vector machine. Avicenna J Med Biotechnol 2016;8(1):36. [61] Liu B, Yang F, Chou K-C. 2L-piRNA: a two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Mol Ther Nucleic Acids 2017;7:267–77. [62] Eulalio A, Huntzinger E, Izaurralde E. Getting to the root of miRNA-mediated gene silencing. Cell 2008;132(1):9–14. [63] Karp X, Ambros V. Encountering microRNAs in cell fate signaling. Science 2005;310(5752):1288–9. [64] Alshalalfa M, Alhajj R. Using context-specific effect of miRNAs to identify functional associations between miRNAs and gene signatures. BMC Bioinformatics 2013;14(Suppl. 12):S1. [65] Bartel DP. MicroRNAs: target recognition and regulatory functions. Cell 2009;136(2):215–33. [66] Tyagi AC, Sen U, Mishra PK. Synergy of microRNA and stem cell: a novel therapeutic approach for diabetes mellitus and cardiovascular diseases. Curr Diabetes Rev 2011;7(6):367–76.
51
52
CHAPTER 2 Computational methods for microRNA and piRNA gene
[67] Iorio MV, et al. MicroRNA gene expression deregulation in human breast cancer. Cancer Res 2005;65(16):7065–70. [68] Ulitsky I, Laurent LC, Shamir R. Towards computational prediction of microRNA function and activity. Nucleic Acids Res 2010;38(15):e160. [69] Cheng C, Li LM. Inferring microRNA activities by combining gene expression with microRNA target prediction. PLoS One 2008;3(4):e1989. [70] Xiao Y, et al. Identifying dysfunctional miRNA-mRNA regulatory modules by inverse activation, cofunction, and high interconnection of target genes: a case study of glioblastoma. Neuro Oncol 2013;15(7):818–28. [71] Cho S, et al. MiRGator v3. 0: a microRNA portal for deep sequencing, expression profiling and mRNA targeting. Nucleic Acids Res 2012;41(D1):D252–7. [72] Bisognin A, et al. MAGIA2: from miRNA and genes expression data integrative analysis to microRNA–transcription factor mixed regulatory circuits (2012 update). Nucleic Acids Res 2012;40(W1):W13–21. [73] Chae H, et al. BioVLAB-MMIA-NGS: microRNA–mRNA integrated analysis using high-throughput sequencing data. Bioinformatics 2015;31(2):265–7. [74] Preusse M, Theis FJ, Mueller NS. miTALOS v2: analyzing tissue specific microRNA function. PLoS One 2016;11(3):e0151771. [75] Wu C, et al. ToppMiR: ranking microRNAs and their mRNA targets based on biological functions and context. Nucleic Acids Res 2014;42(W1):W107–13. [76] Jiang Q, et al. miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res 2008;37(Suppl. 1):D98–D104. [77] Lu M, et al. An analysis of human microRNA and disease associations. PLoS One 2008;3(10):e3420. [78] Jiang Q, et al. Prioritization of disease microRNAs through a human phenome-microRNAome network. BMC Syst Biol 2010;4(Suppl. 1):S2. [79] Chen H, Zhang Z. Similarity-based methods for potential human microRNA-disease association prediction. BMC Med Genomics 2013;6(1):12. [80] Jiang Q, et al. Predicting human microRNA-disease associations based on support vector machine. Int J Data Min Bioinform 2013;8(3):282–93. [81] Chen X, Yan G-Y. Semi-supervised learning for potential human microRNA-disease associations inference. Sci Rep 2014;4:5501. [82] Hendrickson DG, et al. Concordant regulation of translation and mRNA abundance for hundreds of targets of a human microRNA. PLoS Biol 2009;7(11):e1000238. [83] Liu B, Li J, Tsykin A. Discovery of functional miRNA–mRNA regulatory modules with computational methods. J Biomed Inform 2009;42(4):685–91. [84] Mavrakis KJ, et al. A cooperative microRNA-tumor suppressor gene network in acute T-cell lymphoblastic leukemia (T-ALL). Nat Genet 2011;43(7):673. [85] Yoon S, De Micheli G. Prediction of regulatory modules comprising microRNAs and target genes. Bioinformatics 2005;21(Suppl. 2):ii93–ii100. [86] Peng X, et al. Computational identification of hepatitis C virus associated microRNAmRNA regulatory modules in human livers. BMC Genomics 2009;10(1):373. [87] Chen X, Liu M-X, Yan G-Y. RWRMDA: predicting novel human microRNA–disease associations. Mol Biosyst 2012;8(10):2792–8. [88] Liu Y, et al. Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Trans Comput Biol Bioinform 2017;14(4):905–15.
References
[89] Chen X, et al. WBSMDA: within and between score for miRNA-disease association prediction. Sci Rep 2016;6:21106. [90] Girard A, et al. A germline-specific class of small RNAs binds mammalian Piwi proteins. Nature 2006;442(7099):199. [91] Wang G, Reinke V. A C. elegans Piwi, PRG-1, regulates 21U-RNAs during spermatogenesis. Curr Biol 2008;18(12):861–7. [92] Reuter M, et al. Miwi catalysis is required for piRNA amplification-independent LINE1 transposon silencing. Nature 2011;480(7376):264. [93] Post C, et al. The capacity of target silencing by drosophila PIWI and piRNAs. RNA 2014;20(12):1977–86. [94] Yuan J, et al. Computational identification of piRNA targets on mouse mRNAs. Bioinformatics 2016;32(8):1170–7. [95] Kim VN. Small RNAs just got bigger: Piwi-interacting RNAs (piRNAs) in mammalian testes. Genes Dev 2006;20(15):1993–7. [96] Han Y-N, et al. PIWI proteins and PIWI-interacting RNA: emerging roles in cancer. Cell Physiol Biochem 2018;44(1):1–20. [97] Singh G, et al. Genome-wide profiling of the PIWI-interacting RNA-mRNA regulatory networks in epithelial ovarian cancers. PLoS One 2018;13(1):e0190485. [98] Roy J, Mallick B. Investigating piwi-interacting RNA regulome in human neuroblastoma. Genes Chromosomes Cancer 2018;57(7):339–49. [99] Krishnan AR, et al. Computational methods reveal novel functionalities of PIWIinteracting RNAs in human papillomavirus-induced head and neck squamous cell carcinoma. Oncotarget 2018;9(4):4614. [100] Roy J, et al. Small RNA sequencing revealed dysregulated piRNAs in Alzheimer disease and their probable role in pathogenesis. Mol Biosyst 2017;13(3):565–76. [101] Mei Y, Clark D, Mao L. Novel dimensions of piRNAs in cancer. Cancer Lett 2013;336(1):46–52. [102] Sai Lakshmi S, Agrawal S. piRNABank: a web resource on classified and clustered Piwi-interacting RNAs. Nucleic Acids Res 2007;36(Suppl. 1):D173–7. [103] Sarkar A, et al. piRNAQuest: searching the piRNAome for silencers. BMC Genomics 2014;15(1):555. [104] Zhang P, et al. piRBase: a web resource assisting piRNA functional study. Database 2014;2014:bau110. [105] Jiang, B.-R., et al. piRNAtarget: the integrated database for mining functionality of piRNA and its targets. 2016 IEEE 16th International Conference on in Bioinformatics and Bioengineering (BIBE) 2016. IEEE.
53