Journal Pre-proof Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression Su Zhou (Conceptualization) (Methodology) (Software) (Validation) (Resources) (Data curation) (Writing - original draft) (Writing - review and editing), Shulin Wang (Writing - original draft) (Supervision), Qi Wu (Conceptualization) (Methodology) (Writing - original draft), Riasat Azim (Data curation) (Software) (Writing - review and editing), Wen Li (Investigation) (Writing - review and editing)
PII:
S1476-9271(19)30809-6
DOI:
https://doi.org/10.1016/j.compbiolchem.2020.107200
Reference:
CBAC 107200
To appear in:
Computational Biology and Chemistry
Received Date:
12 September 2019
Revised Date:
4 January 2020
Accepted Date:
5 January 2020
Please cite this article as: Zhou S, Wang S, Wu Q, Azim R, Li W, Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression, Computational Biology and Chemistry (2020), doi: https://doi.org/10.1016/j.compbiolchem.2020.107200
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier.
Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression Su Zhoua, Shulin Wanga,*, Qi Wub, Riasat Azima, Wen Lia
a
ro of
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China b College of Basic Medicine, Changsha Medical University, Changsha 410219, China.
-p
*Corresponding author. E-mail address:
[email protected]
Jo
ur
na
lP
re
Graphical abstrcat
Hightlights
A combinatorial model that combines gradient boosting decision tree with logistic regression (GBDT-LR) was proposed.
Gradient boosting decision tree model has natural advantages to find many distinguishing 1
features and feature combinations.
GBDT-LR obtained AUC of 0.9293 and AP of 0.9043 in 5-fold cross-validation. Furthermore, more than 88% of predicted top 50 potential disease-related miRNAs were confirmed by databases.
Abstract
ro of
MicroRNAs (miRNAs) have been proved to play an indispensable role in many fundamental biological processes, and the dysregulation of miRNAs is closely correlated with human complex diseases. Many studies have focused on the prediction of potential miRNA-disease associations.
-p
Considering the insufficient number of known miRNA-disease associations and the poor
performance of many existing prediction methods, a novel model combining gradient boosting
re
decision tree with logistic regression (GBDT-LR) is proposed to prioritize miRNA candidates for
lP
diseases. To balance positive and negative samples, GBDT-LR firstly adopted k-means clustering to screen negative samples from unknown miRNA-disease associations. Then, the gradient
na
boosting decision tree (GBDT) model, which has an intrinsic advantage in finding many distinguishing features and feature combinations is applied to extract features. Finally, the new
ur
features extracted by the GBDT model are input into a logistic regression (LR) model for
Jo
predicting the final miRNA-disease association score. The experimental results show that the average AUC of GBDT-LR in 5-fold cross-validation (CV) can achieve 0.9274. Besides, in the case studies, 90%, 94% and 88% of the top 50 miRNAs potentially associated with colon cancer, gastric cancer, and pancreatic cancer were confirmed by databases, respectively. Compared with the other three state-of-the-art methods, GBDT-LR can achieve the best prediction performance. The source code and dataset of GBDT-LR are freely available at 2
https://github.com/Pualalala/GBDT-LR. Keywords: miRNAs; diseases; miRNA-disease association; gradient boosting decision tree; logistic regression
1 Introduction MicroRNAs (miRNAs) are a class of non-coding single-stranded RNA molecules with a
ro of
length of about 22 nucleotides encoded by endogenous genes (Ambros, 2001). The first miRNA lin-4 was discovered in Caenorhabditis elegans in 1993 (Lee et al., 1993), but this discovery was only regarded as a particular case, not receiving the attention of the scientific community at that
-p
time. Until 2003 the second miRNA named let-7 was found to regulate the development of
re
Caenorhabditis elegans by binding to 3’-UTRs of target mRNAs (Ashrafi et al., 2003), researchers gradually realized that miRNA might be a widespread gene expression regulation mechanism.
lP
Since then, hundreds of miRNA molecules have been identified in plants, animals, and viruses (Huang et al., 2011; Li et al., 2013; Liu et al., 2015), etc. Furthermore, existing studies have
na
shown that miRNAs almost involve in all biological processes of living organisms, such as cell
ur
growth, differentiation, apoptosis, immune reaction, neural development, and tumor invasion (Barh et al., 2010; Fineberg et al., 2009; Ma et al., 2007; Taganov et al., 2006).
Jo
Studies also show that miRNAs are implicated in the development of various diseases (Jiang et al., 2009) such as cancer (Volinia et al., 2012), Parkinson’s disease (Kim et al., 2007), Alzheimer’s disease (Kumar et al., 2013; Rahman et al., 2019) and immune-related diseases (Huang et al., 2007). For example, researchers observed that the decrease of miR-143 content in colorectal cancer could result in an increase of its target gene KRAS and promote the proliferation 3
of cancer cells (Chen et al., 2009). Besides, the feedback loop created by miR-133b and its target gene PITX3 was essential for maintaining dopaminergic neurons in the brain, and the abnormal expression of miR-133b may lead to the occurrence of Parkinson’s disease (De Mena et al., 2010). Furthermore, the decreased expression of miR-29a, miR-29b-1 and miR-9 in the brain of Alzheimer’s disease patients will lead to an abnormal increase in their target gene BACE1 and raise the incidence of Alzheimer’s disease (Hebert et al., 2008). It is reasonable that identifying
ro of
potential miRNA-disease associations has immense significance for biomarker detection in
diagnosis (Barh et al., 2013), treatment (Janssen et al., 2013) and prognosis (Brenner et al., 2011) of human complex diseases. Since miRNAs play such a comprehensive role, it is high time to
-p
construct an effective and accurate computational model to reveal the potential associations
re
between diseases and miRNAs.
lP
For predicting disease-related miRNAs, many viable calculation methods have been proposed (Chen et al., 2019a). Among these methods, the similarity measure-based models and
na
machine learning-based models are two main representatives. The similarity measure-based methods are mainly according to the hypothesis that the functionally related miRNAs are
ur
closely related to those diseases with phenotypic similarities. For example, Nalluri et al. (2015) proposed a maximum weighted matching model called DISMIRA, which was based on the
Jo
bipartite graph to predict disease-related miRNA candidates. Chen et al. (2018c) applied a bipartite network projection model to predict potential miRNA-disease associations (BNPMDA). However, neither DISMIRA or BNPMDA can be used to predict diseases without known related miRNAs. In addition, Jiang et al. (2010) constructed a cumulative hypergeometric distribution model to infer potential miRNA-disease associations based on functionally related miRNA 4
network and human phenome-microRNAome network; however, its excessive reliance on the predicted miRNA-target associations can easily lead to false-positive and false-negative problems. Moreover, Chen et al. (2018d) proposed another computational method called MDHGI, which combined matrix decomposition with the heterogeneous graph inference to calculate potential miRNA-disease association scores. The four methods mentioned above can be classified as local network similarity-based computational models. Due to the inherent shortcomings and weakness
ro of
in local network similarity-based computational models, global network similarity measure-based computational models have emerged as the times require. For example, Chen and Huang et al. (2017) developed a novel method named LRSSLMDA whose primary principle was to adopt
-p
sparse subspace learning with Laplacian regularization on known miRNA-disease associations
re
network and informative feature profiles to identify miRNA-disease associations accurately.
lP
Besides, Chen et al. (2018b) proposed a semi-supervised model that was based on a low-rank inductive matrix completion algorithm for miRNA-disease association prediction (IMCMDA).
na
The advantage of IMCMDA is that it doesn't need negative samples. What’s more, Xiao et al. (2018) developed a prediction model called GRNMF which utilized graph regularized non-
ur
negative matrix factorization in heterogeneous omics data to identify potential miRNA-disease correlations and could work for new diseases (miRNAs) or those diseases (miRNAs) with sparse
Jo
known associations.
In the field of machine learning, Wang et al. (2019) proposed a model called LMTRDA to
predict the associations between miRNAs and diseases by fusing information from multiple sources, including miRNA sequences, miRNA functional similarity, disease semantic similarity, and known miRNA-disease associations. Chen et al. (2018) developed a novel model based on 5
Random Forest for predicting miRNA-disease associations (RFMDA). Furthermore, Chen et al. (2019b) introduced another computing method called EDTMDA under the framework of ensemble learning and dimensionality reduction to infer disease-causing miRNAs. However, RFMDA and EDTMDA randomly selecting negative samples from unknown miRNA-disease associations would significantly affect their prediction performance. Recently, Zhao et al. (2019) presented a method called ABMDA for miRNA-disease association prediction based on the
ro of
boosting algorithm. ABMDA is capable of improving the prediction accuracy by integrating 20 weak classifiers to form a robust classifier based on corresponding weights.
Although previously proposed methods can effectively promote future research on predicting
-p
miRNA-disease associations, they have their limitations. In this study, a combinatorial model that
re
combines gradient boosting decision tree with logistic regression (GBDT-LR) is designed to infer
lP
the associations between miRNAs and diseases by integrating known miRNA-disease associations, miRNA functional similarity, disease semantic similarity and Gaussian interaction
na
profile kernel similarity. GBDT is a boosting algorithm that belongs to the category of ensemble learning. It was firstly proposed by Friedman in 2001 (Friedman, 2001) and has been successfully
ur
applied to click prediction system (He et al., 2014). One of the significant advantages of GBDT is it can integrate multiple weak classifiers to construct a powerful classifier; another advantage is its
Jo
excellent automatic feature combination ability and efficient operation. In GBDT-LR, we firstly extract new features based on the GBDT model, then adopt LR model to complete final classification prediction. To evaluate the performance of our model, 5-fold cross validation and three case studies were implemented. The experimental results indicate that the average AUC of GBDT-LR reached 0.9274 and the average AUPR was 0.9014. What’s more, in the case studies, 6
45, 47 and 44 out of the top 50 predicted miRNAs for colon cancer, gastric cancer, and pancreatic cancer were confirmed by databases and literatures. In general, our model has better performance than the other three state-of-the-art models and can be effectively applied to identify the potential associations between miRNAs and diseases.
2 Materials and methods
ro of
2.1 Human miRNA-disease associations Relevant researchers have collated and recorded the miRNA-disease associations which were
-p
confirmed by biological experiments, and then constructed some miRNA-disease association
databases, such as HMDD (Huang et al., 2019), dbDEMC (Yang et al., 2010), miRcancer (Xie et
re
al., 2013), and miRegulome (Barh et al., 2015). According to the records of the HMDD V2.0 (Li
lP
et al., 2014), there are 5430 identified miRNA-disease associations involving 495 miRNAs and 383 diseases. We constructed an adjacency matrix A to describe the associations between disease
na
𝑑(𝑖) and miRNA 𝑚(𝑗). Specifically, if there are an identified association between disease 𝑑(𝑖) and miRNA 𝑚(𝑗), the entity 𝐴(𝑖, 𝑗) is equal to 1, otherwise 0. Furthermore, the number of
ur
miRNAs and diseases investigated in our study was represented by variables nm and nd,
Jo
respectively. In case studies, another two independent databases, dbDEMC and miRCancer were used to validate the miRNA-disease association prediction lists.
2.2 MiRNA functional similarity Wang et al. (2010) calculated miRNA functional similarity and concluded that functionally related miRNAs incline to be associated with phenotypically similar diseases. We have 7
downloaded it from http://www.cuilab.cn/files/images/cuilab/misim.zip and constructed miRNA functional similarity symmetric matrix FS. In matrix FS, the entity 𝐹𝑆(𝑖, 𝑗) denotes the functional similarity score between miRNA 𝑚(𝑖) and 𝑚(𝑗).
2.3 Disease semantic similarity MeSH database (http://www.ncbi.nlm.nih.gov/) is an authoritative Medical Subject Headings
ro of
system, which plays a crucial role in bioinformatics research and is widely used to obtain the associations between diseases. Based on MeSH database, we constructed Directed Acyclic Graph (DAG). A disease D can be depicted as 𝐷𝐴𝐺(𝐷) = (𝐷, 𝑇(𝐷), 𝐸(𝐷)), where 𝑇(𝐷) is a set of
-p
nodes composed of all ancestor nodes of disease D and itself, and 𝐸(𝐷) is an edge set that
re
corresponds to direct links from a parent node to a child node. Finally, according to the method of
computed as follows:
lP
Xuan et al. (2013), the contribution of disease d to the semantic value of disease D can be
𝐷1𝐷 (𝑑) = 1 𝑖𝑓 𝑑 = 𝐷 { , 𝐷1𝐷 (𝑑) = max{∆ ∗ 𝐷1𝐷 (𝑑′ )|𝑑 ′ 𝜖 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑜𝑓 𝑑} 𝑖𝑓 𝑑 ≠ 𝐷
(1)
na
where ∆ is a semantic contribution decay factor. Based on the previous literature (Wang et al.,
ur
2010), the value of ∆ was set to 0.5. Compared with the contribution of D to itself semantic value is 1, the contribution of its ancestor diseases to the semantic value of disease D decreases with the
Jo
increase of distance between them. Then, the semantic value of disease D can be calculated as follows:
DV1(D) = ∑ 𝐷1𝐷 (𝑑).
(2)
𝑑∈𝑇(𝐷)
According to the hypothesis that the disease semantic similarity of two diseases varies directly as the number of DAGs parts they share, the more DAGs parts they share, the higher their semantic 8
similarity is. So the semantic similarity between two diseases 𝑑(𝑖) and 𝑑(𝑗) can be computed as follows: 𝑆𝑆1(𝑑(𝑖), 𝑑(𝑗)) =
∑𝑡∈𝑇(𝑑(𝑖))∩𝑇(𝑑(𝑗))(𝐷1𝑑(𝑖) (𝑡) + 𝐷1𝑑(𝑗) (𝑡)) 𝐷𝑉1(𝑑(𝑖)) + 𝐷𝑉1(𝑑(𝑗))
,
(3)
where the entity 𝑆𝑆1(𝑑(𝑖), 𝑑(𝑗)) in matrix SS1 denotes the semantic similarity score of diseases between 𝑑(𝑖) and 𝑑(𝑗). However, in the above model, it seems to have ignored another critical point that if two
ro of
diseases in the same layer of 𝐷𝐴𝐺(𝐷) with different occurrences, we should take into account
that their contribution to disease D is also different. Actually, the contribution of a disease with a higher number of occurrences should be smaller than the lower one. In this case, we introduced
-p
disease semantic similarity model 2 as a complement to model 1, and the formula for calculation
𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝐺𝐴𝑠 𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 𝑑 . 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑠
lP
𝐷2𝐷 (𝑑) = −𝑙𝑜𝑔
re
is as follows:
(4)
The calculation of the semantic value of disease D and the semantic similarity between diseases 𝑑(𝑖) and disease 𝑑(𝑗) are similar to that in model 1:
na
𝐷𝑉2(𝐷) = ∑ 𝐷2𝐷 (𝑑) ,
∑𝑡∈𝑇(𝑑(𝑖))∩𝑇(𝑑(𝑗))(𝐷2𝑑(𝑖) (𝑡) + 𝐷2𝑑(𝑗) (𝑡)) 𝐷𝑉2(𝑑(𝑖)) + 𝐷𝑉2(𝑑(𝑗))
.
(6)
ur
𝑆𝑆2(𝑑(𝑖), 𝑑(𝑗)) =
(5)
𝑑∈𝑇(𝐷)
For computing the semantic similarity more reasonably, we combined two models to get the final
Jo
semantic similarity SS. The final semantic similarity SS is calculated as follows: 𝑆𝑆(𝑑(𝑖), 𝑑(𝑗)) =
𝑆𝑆1(𝑑(𝑖), 𝑑(𝑗)) + 𝑆𝑆2(𝑑(𝑖), 𝑑(𝑗)) . 2
(7)
2.4 Gaussian interaction profile kernel similarity for diseases According to the basic hypothesis mentioned earlier, miRNAs with similar functions are 9
closely related to diseases with similar phenotypes, so Gaussian interaction profile (GIP) kernel similarity was introduced for further similarity analysis. We used a binary vector 𝐼𝑃(𝑑(𝑖)), i-th row vector of the adjacency matrix A, to denote associations between 𝑑(𝑖) and each miRNA. Next, we calculated the GIP kernel similarity between disease 𝑑(𝑖) and 𝑑(𝑗) as follows: 2
(8)
𝐾𝐷(𝑑(𝑖), 𝑑(𝑗)) = 𝑒𝑥𝑝 (−𝑟𝑑 ‖𝐼𝑃(𝑑(𝑖)) − 𝐼𝑃(𝑑(𝑗))‖ ) ,
where KD is a symmetric matrix consisting of all investigated diseases’ GIP kernel similarity and
ro of
parameter 𝑟𝑑 is an adjustable parameter of the kernel bandwidth which can be calculated by another normalized bandwidth parameter 𝑟′𝑑 . The formula for calculating 𝑟𝑑 is as follows: 𝑛𝑑
1 2 𝑟𝑑 = 𝑟 𝑑 /( ∑‖𝐼𝑃(𝑑(𝑖))‖ ). 𝑛𝑑 ′
(9)
-p
𝑖=1
In Eq. (9), according to the previous literature (van Laarhoven et al., 2011), 𝑟′𝑑 just simply be
re
set to 1.
lP
2.5 Gaussian interaction profile kernel similarity for miRNAs
follows:
na
Similarly, the GIP kernel similarity between miRNA 𝑚(𝑖) and 𝑚(𝑗) can be calculated as
2
(10)
ur
𝐾𝑀(𝑚(𝑖), 𝑚(𝑗)) = 𝑒𝑥𝑝 (−𝑟𝑚 ‖𝐼𝑃(𝑚(𝑖)) − 𝐼𝑃(𝑚(𝑗))‖ ), 𝑛𝑚
Jo
1 2 𝑟𝑚 = 𝑟 𝑚 /( ∑‖𝐼𝑃(𝑚(𝑖))‖ ), 𝑛𝑚 ′
(11)
𝑖=1
where KM is a symmetric matrix consisting of all investigated miRNAs’ GIP kernel similarity, and the binary vector 𝐼𝑃(𝑚(𝑖)), located in the i-th column of the adjacency matrix A, which is used to denote the interaction profiles of miRNA 𝑚(𝑖). By the same reason, 𝑟 ′ 𝑚 is also set to 1.
10
2.6 Integrated similarity for miRNAs and diseases Considering that miRNA functional similarity and GIP kernel similarity are based on a unilateral consideration, and not all miRNA-miRNA pairs have functional similarity, so we integrated the two similarities into a simple matrix. It means that when miRNA-miRNA pairs have no functional similarity, its final similarity is defined as GIP kernel similarity. Otherwise, its final
ro of
similarity is defined as half of the sum of its GIP kernel similarity and functional similarity. More concretely, the calculation of integrated similarity for miRNAs is as follows: 𝑆𝑀(𝑚(𝑖), 𝑚(𝑗)) = 𝐾𝑀(𝑚(𝑖),𝑚(𝑗))+𝐹𝑆(𝑚(𝑖),𝑚(𝑗))
{
2
, 𝑖𝑓 𝑚(𝑖) 𝑎𝑛𝑑 𝑚(𝑗) ℎ𝑎𝑠 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑎𝑙 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦
(12)
.
-p
𝐾𝑀(𝑚(𝑖), 𝑚(𝑗)), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Similarly, the calculation of integrated similarity for diseases depends on whether there is a
re
semantic similarity between disease-disease pairs, which is shown below:
𝐾𝐷(𝑑(𝑖),𝑑(𝑗))+𝑆𝑆(𝑑(𝑖),𝑑(𝑗))
{
2
lP
𝑆𝐷(𝑑(𝑖), 𝑑(𝑗)) =
, 𝑖𝑓 𝑑(𝑖) 𝑎𝑛𝑑 𝑑(𝑗) ℎ𝑎𝑠 𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦
𝐾𝐷(𝑑(𝑖), 𝑑(𝑗)), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
.
(13)
na
2.7 GBDT-LR prediction model
ur
Inspired by the research of He et al. (2014), we proposed a novel model that combined
Jo
gradient boosting decision tree with logistic regression for predicting potential miRNA-disease associations. The whole prediction process consists of data preparation, model training, scoring and sorting final prediction results (see Fig. 1).
11
ro of -p re lP na
Fig. 1. The flowchart of GBDT-LR. It mainly consists of three steps: data preparation; model
ur
training; scoring and sorting final prediction results. In the data preparation stage, the main task was to balance positive and negative samples
Jo
besides the need to obtain two integrated similarity matrices. Here, we called those 5430 samples with known associations as positive samples and unknown samples as negative samples. To solve the problem that the number of unknown samples was much larger than the number of known association samples in the dataset and get roughly equal in size to positive samples, we performed k-means clustering on unknown samples and randomly extracted a corresponding number of 12
samples from each cluster as negative samples. According to the research of Rayhan et al. (2017) and Zhao et al. (2019), the value of k was set to 23. First, unknown samples were divided into 23 clusters, unlike randomly selected the same number of negative samples from each cluster, we selected the corresponding number of negative samples based on the proportion of each cluster to the total amount of unknown samples. Finally, we got 5418 negative samples. The core idea of GBDT-LR is to train a LR model on new features constructed by GBDT
ro of
model. GBDT is a commonly used non-linear model and an iteration method, which use CART
tree as the weak classifier. It is noteworthy that each iteration of GBDT model produces a weak
classifier, and each weak classifier is trained based on the negative gradient of the previous weak
-p
classifier. We used the grid search strategy to select the optimal parameters of GBDT with 5-fold
re
CV on the original data, and then we got the optimal tree number (number of iteration or number
lP
of weak classifier) of GBDT is 12. In the process of grid search strategy, we have divided the original data into three parts: training set, validation set and testing set. We make sure that the
na
three parts do not intersect and training set is used for model training, validation set is used to adjust parameters, testing set is used to evaluate the model performance. This process can
ur
avoid over-fitting problem. The inputs of the GBDT model include the training sample sets 𝐷 = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … (𝑥𝑁 , 𝑦𝑁 )} where the element of list 𝑥 is training sample and the element of
Jo
list 𝑦 is the label of each sample, the number of iteration M = 12, and log-likelihood loss function 𝐿(𝑦, 𝑓(𝑥)), meanwhile, the output are new features. The calculation process of our model is summarized into three. Step 1 is to initialize weak classifier, and the specific calculation is as follows:
13
𝑁
𝑓0 (𝑥) = arg 𝑚𝑖𝑛 ∑ 𝐿(𝑦𝑖 , 𝑐) , 𝑐
(14)
𝑖=1
𝑓0 (x) need to be set as a constant, but since all samples are initially located in the root node of the
decision tree, it usually set as the average value of all sample labels directly. Step 2 is called weak classifier update, which also includes three steps. First, the negative gradient of each sample 𝑖 is calculated as follows: 𝜕𝐿(𝑦𝑖 , 𝑓(𝑥𝑖 )) ] 𝜕𝑓(𝑥𝑖 ) 𝑓(𝑥)=𝑓
𝑚−1 (𝑥)
,
(15)
ro of
𝑟𝑖𝑚 = − [
where the 𝑓𝑚−1 (x) denote the (𝑚 − 1)-th weak classifier, 𝑦𝑖 is the real label of i-th sample, and 𝑓(𝑥𝑖 ) is the predicted probability of the label. Variable 𝑟𝑖𝑚 represent the value of the
-p
negative gradient of the i-th sample in the m-th weak classifier. The data (𝑥𝑖 , 𝑟𝑖𝑚 ) can use to fit a
re
CART regression tree and obtain its corresponding leaf node region 𝑅𝑗𝑚 where 𝑗 = 1,2, . . . , 𝐽,
lP
and 𝐽 is the number of leaf nodes of this regression tree. Second, for each leaf node region, the best fitting values are calculated as follows:
𝛾𝑗𝑚 = arg 𝑚𝑖𝑛 ∑ 𝐿(𝑦𝑖 , 𝑓𝑚−1 (𝑥𝑖 ) + 𝛾) , 𝛾
(16)
𝑥𝑖 ∈𝑅𝑗𝑚
na
where the variable 𝛾𝑗𝑚 represent the best fitting values of j-th leaf node in the m-th regression
ur
tree. Third, the m-th weak classifier in this iteration is updated as follows: 𝐽
𝑓𝑚 (𝑥) = 𝑓𝑚−1 (𝑥) + ∑ 𝛾𝑗𝑚 𝐼(𝑥 ∈ 𝑅𝑗𝑚 ) ,
(17)
Jo
𝑗=1
where the indicator function I(.) has the value 1 if its argument is true, and 0 otherwise. Step 3, repeat step 2 until the number of iterations is equal to 12. Then we can get the final strong classifier based on 12 weak classifiers. 𝑀
𝐽
𝑓𝑀 (x) = 𝑓0 (𝑥) + ∑ ∑ 𝛾𝑗𝑚 𝐼(𝑥 ∈ 𝑅𝑗𝑚 ) . 𝑚=1 𝑗=1
14
(18)
In fact, we are not mean to get the final strong classifier for binary classification, but through the process of each iteration in the GBDT model to construct new features by obtaining the position of the leaf node where samples falling on. Specifically, the element values of the new feature vector are 0 or 1, and each element corresponds to the leaf node of the regression tree. When a sample falls on a leaf node of a regression tree, the corresponding element value of this leaf node in the new feature vector is 1, while other leaf nodes of this regression tree are 0 (see
ro of
Fig. 2). The length of the new feature vector is equal to the sum of the leaf nodes contained in all regression trees. After constructed feature vectors, LR is used for the final classification task.
Finally, we sort the sample pairs by them score; the higher the score is, the more likely the miRNA
Jo
ur
na
lP
re
available at https://github.com/Pualalala/GBDT-LR.
-p
is to be associated with the relevant disease. The source code and dataset of GBDT-LR are freely
Fig. 2. The process of extracting new features from the GBDT model. Here two weak classifiers are taken as examples, which are represented by red and yellow parts respectively. Among them, the number of leaf nodes in red weak classifier is 3, while in yellow weak classifier is 2, and the prediction result of sample 𝑥 in red weak classifier falls to the second leaf node, while in yellow 15
weak classifier also falls to the second leaf node. Then we mark the prediction results of red weak classifier is [0 1 0], and the yellow weak classifier is [0 1]. Basically, the output of GBDT is a combination of these weak classifiers, like [0 1 0 0 1]. Finally, the new features constructed by GBDT model are input into LR model for the final classification task.
3 Results
ro of
3.1 Performance evaluation In this paper, we evaluated the performance of GBDT-LR and the other three methods (i.e.,
-p
LMTRDA (Wang et al., 2019), RFMDA (Chen et al., 2018a) and ABMDA (Zhao et al., 2019)) based on 5-fold CV. The evaluation metrics include area under the receiver operating
re
characteristic curve (AUC), area under the precision-recall curve (AUPR), precision, recall and
lP
F1-Score. By setting the false positive rate (FPR), the true positive rate (TPR) as the horizontal axis and vertical axis respectively, we drew Receiver Operating Characteristics (ROC) curve.
na
Also, by setting the precision ratio as the horizontal axis, the recall ratio as the vertical axis, we drew Precision-Recall (P-R) curve. Through calculating the value of AUC and AUPR, we can
ur
assess the prediction capability between GBDT-LR and other three models. Generally, the value of
Jo
AUC and AUPR are proportional to the prediction performance of the model. In the 5-fold CV, all known miRNA-disease associations were randomly divided into five
parts, one of which served as testing samples in turn while the other four parts as training samples. To reduce the influence of samples division, we performed 100 randomized divisions on known miRNA-disease associations. The experimental results show that GBDT-LR can achieve the highest performance among all methods (see Table 1 and Fig. 3). Fig. 3 shows average ROC 16
curves and average P-R curves of 5-fold CV results respectively, and we could see that the average AUC of GBDT-LR is 0.9274, which is significantly better than LMTRDA (0.8479), RFMDA (0.7388) and ABMDA (0.8841) in 5-fold CV. Besides, the average AUPR of GBDT-LR can achieve 0.9014 while the AUPR of LMTRDA, RFMDA and ABMDA are 0.8217, 0.7034 and 0.8807, respectively. The values of the other three evaluation metrics are shown in Table 1. Of course, we wish that both Precision and Recall could be highest in model evaluation, but in fact,
ro of
they are contradictory to each other in some cases. That means in most models, usually when one of the value of Precision and Recall reaches a high level, while the other will becomes very low. It is also the reason why the indicator of Recall of GBDT-LR is not as high as that of RFMDA.
-p
Actually, F1-Score is an evaluation metric that takes into account both Precision and Recall. In
re
GBDT-LR, precision and recall scores both exceed 0.80 when achieving the highest F1-Score. All
lP
the above results show that GBDT-LR can make significant improvements in predicting the correlation between miRNAs and diseases compared with some of the most advanced methods.
na
Table 1. The AUC, AUPR, Precision, Recall and F1-Score of four methods on miRNA-disease
Precision
ABMDA
0.8841
0.8807
0.8152
0.7827
0.7908
LMTRDA
0.8479
0.8217
0.8013
0.6190
0.7067
RFMDA
0.7388
0.7034
0.6253
0.9548
0.7453
association prediction task. The bolded numbers represent the comparatively better performance. AUPR
Recall
F1-Score
0.9274
0.9014
0.8315
0.8273
0.8302
Jo
ur
GBDT-LR
AUC
17
ro of
Fig. 3. The ROC and P-R curves of four methods on miRNA-disease association prediction task.
3.2 Case studies
To further demonstrate the applicability and prediction performance of GBDT-LR, three case
-p
studies on colon cancer, gastric cancer, and pancreatic cancer were provided. As mentioned earlier,
re
5430 known association samples were obtained from HMDD v2.0, while the remaining 184155 samples were unknown samples. In this section, we ranked all unknown samples and then picked
lP
out the top 50 predicted miRNAs for three specific types of cancers to see whether they were
(Xie et al., 2013).
na
confirmed by the other two independent databases, dbDEMC (Yang et al., 2010) and miRCancer
ur
Colon cancer is a common malignant tumor of the digestive tract, and the peak of morbidity age is 40-50 years old. The incidence of colon cancer ranks third in gastrointestinal tumors, and
Jo
colon cancer has no obvious early symptoms which easily led to missed diagnosis or misdiagnosis. Therefore, more sensitive and specific molecular biomarkers are urgently needed to help improve early diagnosis of colon cancer. In recent years, studies have shown that some miRNAs can affect the development of colon cancer, for example, miR-135b upregulation is common in human colon cancer, it correlates with tumor stage and poor clinical outcome, and it 18
was identified as a key downstream effector of oncogenic pathways and a potential target for colon cancer treatment (Valeri et al., 2014). In addition, overexpression of miR-215 could inhibit the proliferation of colon cancer cells (Song et al., 2010). By performing GBDT-LR to prioritize candidate miRNAs for colon cancer, 45 out of the top 50 predicted potential miRNAs were confirmed based on dbDEMC and miRCancer (see Table 2). For instance, miR-155 gene knockout contributes to decreasing cell growth, motility, and invasion of colon cancer (Onyeagucha et al.,
ro of
2013). Moreover, miR-34a can effectively inhibit cell proliferation by regulating E2F signaling pathway, the abrogation of miR-34a could be conducive to abnormal cell proliferation and the
development of colon cancer (Tazawa et al., 2007). More importantly, those miRNAs that have
-p
not been verified by biological experiments, such as miR-16, miR-200a, miR-373, miR-378a and
re
miR-142 are likely to be new biomarkers for colon cancer.
lP
Table 2. The top 50 predicted colon cancer-related miRNAs. 45 out of the top 50 miRNAs were confirmed based on dbDEMC and miRCancer. (Column 1: top 1-25 miRNAs; Column 3: top 2650 miRNAs)
miRNA
Evidence
hsa-mir-155
dbDEMC, miRCancer
hsa-mir-24
dbDEMC
hsa-mir-21
dbDEMC, miRCancer
hsa-mir-106b
dbDEMC
dbDEMC
na
Evidence
hsa-mir-125b
dbDEMC, miRCancer
hsa-mir-29a
dbDEMC
hsa-mir-148a
dbDEMC
hsa-mir-16
unconfirmed
hsa-mir-663a
dbDEMC
hsa-mir-20a
dbDEMC
hsa-let-7a
dbDEMC, miRCancer
hsa-mir-221
dbDEMC
hsa-mir-200a
unconfirmed
hsa-mir-146a
dbDEMC, miRCancer
hsa-mir-15b
dbDEMC, miRCancer
hsa-mir-222
dbDEMC
hsa-mir-29c
dbDEMC
hsa-mir-1
dbDEMC
hsa-mir-372
dbDEMC
hsa-mir-15a
dbDEMC, miRCancer
hsa-mir-31
dbDEMC
hsa-mir-133a
dbDEMC
hsa-mir-133b
dbDEMC
hsa-mir-34a
dbDEMC
hsa-mir-195
dbDEMC, miRCancer
hsa-mir-29b
dbDEMC
hsa-mir-192
dbDEMC, miRCancer
hsa-mir-19b
dbDEMC
hsa-let-7e
dbDEMC, miRCancer
Jo
hsa-mir-143
ur
miRNA
19
hsa-mir-18a
dbDEMC, miRCancer
hsa-mir-373
unconfirmed
hsa-mir-92a
dbDEMC
hsa-mir-378a
unconfirmed
hsa-mir-122
dbDEMC
hsa-mir-223
dbDEMC
hsa-mir-181a
dbDEMC, miRCancer
hsa-let-7b
dbDEMC, miRCancer
hsa-mir-199a
miRCancer
hsa-let-7c
dbDEMC, miRCancer
hsa-mir-200b
dbDEMC, miRCancer
hsa-mir-200c
miRCancer
hsa-mir-19a
dbDEMC
hsa-mir-203
dbDEMC
hsa-mir-150
dbDEMC
hsa-mir-210
dbDEMC
hsa-mir-27a
dbDEMC
hsa-mir-142
unconfirmed
hsa-mir-93
dbDEMC
hsa-mir-214
dbDEMC, miRCancer
Gastric cancer is the fourth most common cancers in the world (Pavithra et al., 2018), but in
ro of
China, the incidence of gastric cancer accounted for the first in malignant tumors, nearly two-
thirds of cases and deaths occurred in underdeveloped regions. Furthermore, the onset peak age
was over 50 years old, and the ratio of male to female was 2:1. Recently, with the improvement of
-p
economic conditions, lifestyle, education and health care system, the mortality rate of gastric
re
cancer has decreased significantly. Even so, gastric cancer still is a noticeable cancer burden and
lP
one of the critical issues in the strategy of cancer prevention and control (Yang, 2006). Therefore, a reliable model for predicting potential gastric cancer-related miRNAs is highly needed. So far,
na
experiments have confirmed some related miRNAs; for example, the expression of miR-449 in human clinical gastric cancer compared with normal tissues has decreased (Bou Kheir et al.,
ur
2011). Also, miR-218 could block the invasion and metastasis of gastric cancer by targeting the robo1 receptor (Tie et al., 2010). In the prediction list of gastric cancer-related miRNAs, 47 out of
Jo
the top 50 were confirmed by two databases (see Table 3). For instance, miR-21 may be a vital carcinogen, which was observed to be overexpressed in human gastric cancer tissues, and induces the occurrence of gastric cancer by down-regulating the tumor suppressor RECK (Zhang et al., 2008). Moreover, the upregulation of miR-93 and miR-106b has been reported to damage the TGFβ tumor suppressor pathway (Petrocca et al., 2008), which plays a key role in the initiation 20
and development of gastric cancer. More importantly, those miRNAs that have not been verified by biological experiments, such as miR-29a, miR-34b and miR-335 are likely to be new biomarkers for gastric cancer. Table 3. The top 50 predicted gastric cancer-related miRNAs. 47 out of the top 50 miRNAs were confirmed based on dbDEMC and miRCancer. (Column 1: top 1-25 miRNAs; Column 3: top 2650 miRNAs) Evidence
miRNA
Evidence
hsa-mir-21
dbDEMC, miRCancer
hsa-mir-18a
hsa-mir-155
dbDEMC, miRCancer
hsa-mir-92a
hsa-mir-125b
dbDEMC, miRCancer
hsa-mir-126
hsa-mir-29a
unconfirmed
hsa-mir-122
hsa-mir-16
dbDEMC
hsa-mir-192
hsa-mir-106a
dbDEMC, miRCancer
hsa-mir-195
hsa-mir-20a
dbDEMC, miRCancer
hsa-let-7e
dbDEMC, miRCancer
hsa-mir-17
dbDEMC, miRCancer
hsa-mir-31
dbDEMC
hsa-mir-221
dbDEMC
hsa-mir-133b
dbDEMC, miRCancer
hsa-mir-146a
miRCancer
hsa-mir-183
dbDEMC, miRCancer
hsa-mir-93
dbDEMC
hsa-mir-181a
dbDEMC, miRCancer
hsa-mir-27a
dbDEMC
hsa-mir-200b
dbDEMC, miRCancer
hsa-mir-222
dbDEMC
hsa-mir-199a
miRCancer
hsa-mir-15a
dbDEMC, miRCancer
hsa-mir-150
miRCancer
hsa-mir-1
dbDEMC, miRCancer
hsa-mir-19a
dbDEMC, miRCancer
hsa-mir-133a
dbDEMC, miRCancer
hsa-mir-205
dbDEMC, miRCancer
hsa-mir-34a
dbDEMC
hsa-mir-10b
dbDEMC, miRCancer
hsa-mir-29b
dbDEMC
hsa-mir-100
dbDEMC, miRCancer
hsa-mir-145
dbDEMC, miRCancer
ro of
miRNA
dbDEMC, miRCancer dbDEMC
dbDEMC, miRCancer dbDEMC, miRCancer dbDEMC, miRCancer
dbDEMC
hsa-mir-143
dbDEMC, miRCancer
hsa-mir-34b
unconfirmed
hsa-mir-200a
dbDEMC
hsa-mir-335
unconfirmed
hsa-let-7a
dbDEMC, miRCancer
hsa-mir-101
dbDEMC, miRCancer
hsa-mir-15b
dbDEMC
hsa-mir-24
dbDEMC
hsa-mir-29c
dbDEMC
hsa-mir-106b
dbDEMC, miRCancer
hsa-mir-19b
dbDEMC, miRCancer
hsa-mir-200c
miRCancer
Jo
hsa-mir-451a
ur
na
lP
re
-p
dbDEMC, miRCancer
Pancreatic cancer is a kind of gastrointestinal cancer with a high degree of malignancy and is a challenge to diagnose or treat it. It is also one of the worst prognosis of tumors whose morbidity and mortality have increased significantly in recent years. The clinical features of pancreatic 21
cancer are short course, rapid development and rapid deterioration. Pancreatic cancer is known as the “cancer of the King” because of its 5-year survival rate is less than 5%, so that many people have tried to work on this disease, including the prediction of potential relevant miRNAs. For example, increased serum levels of miR-200a and miR-200b can help to diagnose pancreatic cancer (Li et al., 2010). Furthermore, increased expression of circulating miR-210 can also function as a useful and novel biomarker for pancreatic cancer diagnosis (Ho et al., 2010). GBDT-
ro of
LR has been applied to reveal the potential relationship between miRNA and pancreatic cancer,
and the results show that out of the top 50 potential miRNAs, 44 were confirmed by two databases (see Table 4). For instance, the downregulation of miR-141 in pancreatic cancer tissues was
-p
related to tumor size, TNM stage, distant metastasis and overall survival (Zhao et al., 2013).
re
What’s more, miR-181c could directly suppress the expression of Hippo kinase cassette core
lP
components in human pancreatic cancer cells, and the high miR-181c levels were significantly correlated with Hippo signaling inactivation which could indirectly cause pancreatic cancer cell
na
survival and chemoresistance in vitro and in vivo (Chen et al., 2015). More importantly, those miRNAs that have not been verified by biological experiments, such as miR-193a, miR-499a,
ur
miR-378a, miR-372 and miR-28 are likely to be new biomarkers for pancreatic cancer. Table 4. The top 50 predicted pancreatic cancer-related miRNAs. 44 out of the top 50 miRNAs
Jo
were confirmed based on dbDEMC and miRCancer. (Column 1: top 1-25 miRNAs; Column 3: top 26-50 miRNAs) miRNA
Evidence
miRNA
Evidence
hsa-mir-373
dbDEMC
hsa-mir-499a
unconfirmed
hsa-mir-1
dbDEMC
hsa-mir-205
dbDEMC, miRCancer
hsa-mir-133a
dbDEMC, miRCancer
hsa-mir-19b
dbDEMC
hsa-mir-127
dbDEMC
hsa-mir-378a
unconfirmed
hsa-mir-125a
dbDEMC
hsa-mir-370
dbDEMC
22
dbDEMC, miRCancer
hsa-mir-424
dbDEMC
hsa-mir-302a
dbDEMC
hsa-mir-30e
dbDEMC
hsa-mir-193a
unconfirmed
hsa-mir-494
dbDEMC
hsa-mir-130a
dbDEMC
hsa-mir-98
dbDEMC
hsa-mir-137
dbDEMC, miRCancer
hsa-mir-181c
dbDEMC, miRCancer
hsa-mir-138
dbDEMC
hsa-mir-363
dbDEMC
hsa-mir-106b
dbDEMC
hsa-mir-372
unconfirmed
hsa-mir-9
dbDEMC
hsa-mir-28
unconfirmed
hsa-mir-30b
dbDEMC
hsa-mir-20b
dbDEMC
hsa-mir-206
dbDEMC
hsa-mir-23b
dbDEMC
hsa-mir-29a
dbDEMC
hsa-mir-18b
dbDEMC
hsa-mir-26b
dbDEMC
hsa-mir-134
dbDEMC
hsa-mir-93
dbDEMC
hsa-mir-302b
dbDEMC
hsa-mir-7
dbDEMC
hsa-mir-22
dbDEMC
hsa-mir-342
dbDEMC
hsa-mir-335
hsa-mir-30a
dbDEMC
hsa-mir-574
hsa-mir-195
dbDEMC, miRCancer
hsa-mir-184
hsa-mir-19a
dbDEMC
hsa-mir-181d
hsa-mir-27b
dbDEMC
hsa-mir-149
hsa-mir-141
dbDEMC, miRCancer
hsa-mir-885
dbDEMC dbDEMC dbDEMC dbDEMC
-p
dbDEMC
unconfirmed
re
4 Conclusion
ro of
hsa-mir-181a
lP
In this paper, we proposed a novel computational model called GBDT-LR to identify the potential miRNA-disease associations. GBDT-LR can accurately predict the potential associations
na
between miRNAs and diseases, which can be immensely helpful in understanding the
ur
pathogenesis of certain diseases at molecular level and finding more effective treatment. The excellent performance of GBDT-LR could be proved by the average AUC value of 0.9274 and the
Jo
average AUPR value of 0.9014 in 5-fold CV. The excellent prediction performance of GBDT-LR is mainly attributed to the following
reasons. First, GBDT-LR can be applied to diseases without known associated miRNAs, which greatly improves its practicability and reliability. Second, GBDT is an iterative method which can improve the prediction accuracy of weak classifier, and it has the natural advantage to discover 23
new distinguishing features and feature combinations throughout the iterative process. Third, GBDT-LR can prioritize candidate miRNAs for all investigated diseases simultaneously. However, GBDT-LR also exits some limitations which need to be improved in the future. For example, we screened negative samples by using k-means clustering to balance positive and negative samples, but it is difficult to obtain those reliable negative samples. Besides, multiple sources (e.g., disease-gene associations, miRNA-gene associations and miRNA sequence
ro of
information) need to be combined to improve the prediction performance of the model. Lastly, our future research should be based on the database HMDD V3.0, which is already available online and records more relevant information. So, it remains our zealous hope to develop new
re
-p
computational models to overcome these limitations in the future.
Author Statement
lP
Su Zhou:Conceptualization, Methodology, Software, Validation, Resources, Data Curation, Writing - Original Draft, Writing - Review & Editing
na
Shulin Wang:Writing - Original Draft, Supervision Qi Wu:Conceptualization, Methodology, Writing - Original Draft
ur
Riasat Azim:Data Curation, Software, Writing - Review & Editing
Jo
Wen Li:Investigation, Writing - Review & Editing
Conflict of interest The authors declare that they have no conflict(s) of interest. 24
Funding This work was supported by the grants of the National Natural Science Foundation of China (Grant Nos. 61672011 and 61472467) and the National Key R&D Program of China (2017YFC1311003).
References
ro of
Ambros, V., 2001. microRNAs : Tiny Regulators with Great Potential. Cell 107, 823–826.
Ashrafi, K., Chang, F.Y., Watts, J.L., Fraser, A.G., Kamath, R.S., Ahringer, J., Ruvkun, G., 2003.
Genome-wide RNAi analysis of Caenorhabditis elegans fat regulatory genes. Nature 421, 268– 272. https://doi.org/10.1038/nature01279
Barh, D., Bhat, D., Viero, C., 2010. miReg: a resource for microRNA regulation. J. Integr. Bioinform. 7. https://doi.org/10.1515/jib-2010-144
-p
Barh, D., Jain, N., Tiwari, S., Field, J.K., Padin-Iruegas, E., Ruibal, A., López, R., Herranz, M.,
Bhattacharya, A., Juneja, L., Viero, C., Silva, A., Miyoshi, A., Kumar, A., Blum, K., Azevedo, V., Ghosh, P., Liloglou, T., 2013. A novel in silico reverse-transcriptomics-based identification
re
and blood-based validation of a panel of sub-type specific biomarkers in lung cancer. BMC Genomics 14, S5. https://doi.org/10.1186/1471-2164-14-S6-S5
lP
Barh, D., Kamapantula, B., Jain, N., Nalluri, J., Bhattacharya, A., Juneja, L., Barve, N., Tiwari, S., Miyoshi, A., Azevedo, V., Blum, K., Kumar, A., Silva, A., Ghosh, P., 2015. miRegulome: a knowledge-base of miRNA regulomics and analysis. Sci. Rep. 5, 12832. https://doi.org/10.1038/srep12832
na
Bou Kheir, T., Futoma-Kazmierczak, E., Jacobsen, A., Krogh, A., Bardram, L., Hother, C., Grønbæk, K., Federspiel, B., Lund, A.H., Friis-Hansen, L., 2011. miR-449 inhibits cell proliferation and is down-regulated in gastric cancer. Mol. Cancer 10, 29. https://doi.org/10.1186/1476-4598-10-29 Brenner, B., Hoshen, M.B., Purim, O., David, M. Ben, Ashkenazi, K., Marshak, G., Kundel, Y.,
ur
Brenner, R., Morgenstern, S., Halpern, M., Rosenfeld, N., Chajut, A., Niv, Y., Kushnir, M., 2011. MicroRNAs as a potential prognostic factor in gastric cancer. World J. Gastroenterol. 17, 3976–3985. https://doi.org/10.3748/wjg.v17.i35.3976
Jo
Chen, M., Wang, M., Xu, S., Guo, X., Jiang, J., 2015. Upregulation of miR-181c contributes to chemoresistance in pancreatic cancer by inactivating the Hippo signaling pathway. Oncotarget 6, 44466–44479. https://doi.org/10.18632/oncotarget.6298
Chen, X., Guo, X., Zhang, H., Xiang, Y., Chen, J., Yin, Y., Cai, X., Wang, K., Wang, G., Ba, Y., Zhu, L., Wang, J., Yang, R., Zhang, Y., Ren, Z., Zen, K., Zhang, J., Zhang, C.-Y., 2009. Role of miR143 targeting KRAS in colorectal tumorigenesis. Oncogene 28, 1385–1392. https://doi.org/10.1038/onc.2008.474 Chen, X., Huang, L., 2017. LRSSLMDA: Laplacian Regularized Sparse Subspace Learning for MiRNA-Disease Association prediction. PLOS Comput. Biol. 13, e1005912. 25
https://doi.org/10.1371/journal.pcbi.1005912 Chen, X., Wang, C.-C., Yin, J., You, Z.-H., 2018a. Novel Human miRNA-Disease Association Inference Based on Random Forest. Mol. Ther. - Nucleic Acids 13, 568–579. https://doi.org/10.1016/j.omtn.2018.10.005 Chen, X., Wang, L., Qu, J., Guan, N.-N., Li, J.-Q., 2018b. Predicting miRNA–disease association based on inductive matrix completion. Bioinformatics 34, 4256–4265. https://doi.org/10.1093/bioinformatics/bty503 Chen, X., Xie, D., Wang, L., Zhao, Q., You, Z.-H., Liu, H., 2018c. BNPMDA: Bipartite Network Projection for MiRNA–Disease Association prediction. Bioinformatics 34, 3178–3186. https://doi.org/10.1093/bioinformatics/bty333 Chen, X., Xie, D., Zhao, Q., You, Z.-H., 2019a. MicroRNAs and complex diseases: from experimental results to computational models. Brief. Bioinform. 20, 515–539. https://doi.org/10.1093/bib/bbx130
ro of
Chen, X., Yin, J., Qu, J., Huang, L., 2018d. MDHGI: Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction. PLOS Comput. Biol. 14, e1006418. https://doi.org/10.1371/journal.pcbi.1006418
Chen, X., Zhu, C.-C., Yin, J., 2019b. Ensemble of decision tree reveals potential miRNA-disease
associations. PLOS Comput. Biol. 15, e1007209. https://doi.org/10.1371/journal.pcbi.1007209
-p
De Mena, L., Coto, E., Cardo, L.F., Díaz, M., Blázquez, M., Ribacoba, R., Salvado, C., Pastor, P.,
Samaranch, Ll., Moris, G., Menéndez, M., Corao, A.I., Alvarez, V., 2010. Analysis of the microRNA-133 and PITX3 genes in Parkinson’s disease. Am. J. Med. Genet. Part B Neuropsychiatr.
re
Genet. 153, 1234–1239. https://doi.org/10.1002/ajmg.b.31086
Fineberg, S.K., Kosik, K.S., Davidson, B.L., 2009. MicroRNAs Potentiate Neural Development. Neuron 64, 303–309. https://doi.org/10.1016/j.neuron.2009.10.020
lP
Friedman, J.H., 2001. Greedy function approximation: A gradient boosting machine. Ann. Stat. 29, 1189–1232. https://doi.org/10.2307/2699986
He, X., Bowers, S., Candela, J.Q., Pan, J., Jin, O., Xu, Tianbing, Liu, B., Xu, Tao, Shi, Y., Atallah, A., Herbrich, R., 2014. Practical Lessons from Predicting Clicks on Ads at Facebook, in:
na
Proceedings of 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining ADKDD’14. ACM Press, New York, New York, USA, pp. 1–9. https://doi.org/10.1145/2648584.2648589
ur
Hebert, S.S., Horre, K., Nicolai, L., Papadopoulou, A.S., Mandemakers, W., Silahtaroglu, A.N., Kauppinen, S., Delacourte, A., De Strooper, B., 2008. Loss of microRNA cluster miR-29a/b-1 in sporadic Alzheimer’s disease correlates with increased BACE1/ -secretase expression. Proc. Natl.
Jo
Acad. Sci. 105, 6415–6420. https://doi.org/10.1073/pnas.0710263105 Ho, A.S., Huang, X., Cao, H., Christman-Skieller, C., Bennewith, K., Le, Q.-T., Koong, A.C., 2010. Circulating miR-210 as a Novel Hypoxia Marker in Pancreatic Cancer. Transl. Oncol. 3, 109– 113. https://doi.org/10.1593/tlo.09256
Huang, J., Wang, F., Argyris, E., Chen, K., Liang, Z., Tian, H., Huang, W., Squires, K., Verlinghieri, G., Zhang, H., 2007. Cellular microRNAs contribute to HIV-1 latency in resting primary CD4 + T lymphocytes. Nat. Med. 13, 1241–1247. https://doi.org/10.1038/nm1639 Huang, Y., Shen, X.J., Zou, Q., Wang, S.P., Tang, S.M., Zhang, G.Z., 2011. Biological functions of microRNAs: a review. J. Physiol. Biochem. 67, 129–139. https://doi.org/10.1007/s13105-0100050-6 26
Huang, Z., Shi, J., Gao, Y., Cui, C., Zhang, S., Li, J., Zhou, Y., Cui, Q., 2019. HMDD v3.0: a database for experimentally supported human microRNA–disease associations. Nucleic Acids Res. 47, D1013–D1017. https://doi.org/10.1093/nar/gky1010 Janssen, H.L.A., Reesink, H.W., Lawitz, E.J., Zeuzem, S., Rodriguez-Torres, M., Patel, K., van der Meer, A.J., Patick, A.K., Chen, A., Zhou, Y., Persson, R., King, B.D., Kauppinen, S., Levin, A.A., Hodges, M.R., 2013. Treatment of HCV Infection by Targeting MicroRNA. N. Engl. J. Med. 368, 1685–1694. https://doi.org/10.1056/NEJMoa1209026 Jiang, Q., Hao, Y., Wang, G., Juan, L., Zhang, T., Teng, M., Liu, Y., Wang, Y., 2010. Prioritization of disease microRNAs through a human phenome-microRNAome network. BMC Syst. Biol. 4, S2. https://doi.org/10.1186/1752-0509-4-S1-S2 Jiang, Q., Wang, Y., Hao, Y., Juan, L., Teng, M., Zhang, X., Li, M., Wang, G., Liu, Y., 2009. miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res. 37, D98–D104. https://doi.org/10.1093/nar/gkn714
ro of
Kim, J., Inoue, K., Ishii, J., Vanti, W.B., Voronov, S. V., Murchison, E., Hannon, G., Abeliovich, A.,
2007. A MicroRNA Feedback Circuit in Midbrain Dopamine Neurons. Science 317, 1220–1224. https://doi.org/10.1126/science.1140481
Kumar, P., Dezso, Z., MacKenzie, C., Oestreicher, J., Agoulnik, S., Byrne, M., Bernier, F.,
Yanagimachi, M., Aoshima, K., Oda, Y., 2013. Circulating miRNA Biomarkers for Alzheimer’s
-p
Disease. PLoS One 8, e69807. https://doi.org/10.1371/journal.pone.0069807
Lee, R.C., Feinbaum, R.L., Ambros, V., 1993. The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 75, 843–854. https://doi.org/10.1016/0092-
re
8674(93)90529-Y
Li, A., Omura, N., Hong, S.-M., Vincent, A., Walter, K., Griffith, M., Borges, M., Goggins, M., 2010. Pancreatic Cancers Epigenetically Silence SIP1 and Hypomethylate and Overexpress miR-
lP
200a/200b in Association with Elevated Circulating miR-200a and miR-200b Levels. Cancer Res. 70, 5226–5237. https://doi.org/10.1158/0008-5472.CAN-09-4227 Li, Y., Qiu, C., Tu, J., Geng, B., Yang, J., Jiang, T., Cui, Q., 2014. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 42,
na
D1070–D1074. https://doi.org/10.1093/nar/gkt1023 Li, Y., Zhuang, L., Wang, Y., Hu, Y., Wu, Y., Wang, D., Xu, J., 2013. Connect the dots. Autophagy 9, 436–439. https://doi.org/10.4161/auto.23096
ur
Liu, B., Fang, L., Liu, F., Wang, X., Chen, J., Chou, K.-C., 2015. Identification of Real MicroRNA Precursors with a Pseudo Structure Status Composition Approach. PLoS One 10, e0121501. https://doi.org/10.1371/journal.pone.0121501
Jo
Ma, L., Teruya-Feldstein, J., Weinberg, R.A., 2007. Tumour invasion and metastasis initiated by microRNA-10b in breast cancer. Nature 449, 682–688. https://doi.org/10.1038/nature06174
Nalluri, J.J., Kamapantula, B.K., Barh, D., Jain, N., Bhattacharya, A., de Almeida, S.S., Juca Ramos, R.T., Silva, A., Azevedo, V., Ghosh, P., 2015. DISMIRA: Prioritization of disease candidates in miRNA-disease associations based on maximum weighted matching inference model and motifbased analysis. BMC Genomics 16, S12. https://doi.org/10.1186/1471-2164-16-S5-S12 Onyeagucha, B.C., Mercado-Pimentel, M.E., Hutchison, J., Flemington, E.K., Nelson, M.A., 2013. S100P/RAGE signaling regulates microRNA-155 expression via AP-1 activation in colon cancer. Exp. Cell Res. 319, 2081–2090. https://doi.org/10.1016/j.yexcr.2013.05.009 Pavithra, D., Sabitha, K., Rajkumar, T., 2018. Identification of small molecule inhibitors for 27
differentially expressed miRNAs in gastric cancer. Comput. Biol. Chem. 77, 442–454. https://doi.org/10.1016/j.compbiolchem.2018.07.013 Petrocca, F., Visone, R., Onelli, M.R., Shah, M.H., Nicoloso, M.S., de Martino, I., Iliopoulos, D., Pilozzi, E., Liu, C.-G., Negrini, M., Cavazzini, L., Volinia, S., Alder, H., Ruco, L.P., Baldassarre, G., Croce, C.M., Vecchione, A., 2008. E2F1-Regulated MicroRNAs Impair TGFβ-Dependent Cell-Cycle Arrest and Apoptosis in Gastric Cancer. Cancer Cell 13, 272–286. https://doi.org/10.1016/j.ccr.2008.02.013 Rahman, M.R., Islam, T., Turanli, B., Zaman, T., Faruquee, H.M., Rahman, M.M., Mollah, M.N.H., Nanda, R.K., Arga, K.Y., Gov, E., Moni, M.A., 2019. Network-based approach to identify molecular signatures and therapeutic agents in Alzheimer’s disease. Comput. Biol. Chem. 78, 431–439. https://doi.org/10.1016/j.compbiolchem.2018.12.011 Rayhan, F., Ahmed, S., Shatabda, S., Farid, D.M., Mousavian, Z., Dehzangi, A., Rahman, M.S., 2017. IDTI-ESBoost: Identification of Drug Target Interaction Using Evolutionary and Structural
ro of
Features with Boosting. Sci. Rep. 7, 1–18. https://doi.org/10.1038/s41598-017-18025-2
Song, B., Wang, Y., Titmus, M.A., Botchkina, G., Formentini, A., Kornmann, M., Ju, J., 2010.
Molecular mechanism of chemoresistance by miR-215 in osteosarcoma and colon cancer cells. Mol. Cancer 9, 96. https://doi.org/10.1186/1476-4598-9-96
Taganov, K.D., Boldin, M.P., Chang, K.-J., Baltimore, D., 2006. NF- B-dependent induction of
-p
microRNA miR-146, an inhibitor targeted to signaling proteins of innate immune responses. Proc. Natl. Acad. Sci. 103, 12481–12486. https://doi.org/10.1073/pnas.0605298103
Tazawa, H., Tsuchiya, N., Izumiya, M., Nakagama, H., 2007. Tumor-suppressive miR-34a induces
re
senescence-like growth arrest through modulation of the E2F pathway in human colon cancer cells. Proc. Natl. Acad. Sci. 104, 15472–15477. https://doi.org/10.1073/pnas.0707351104 Tie, J., Pan, Y., Zhao, L., Wu, K., Liu, J., Sun, S., Guo, X., Wang, B., Gang, Y., Zhang, Y., Li, Q.,
lP
Qiao, T., Zhao, Q., Nie, Y., Fan, D., 2010. MiR-218 Inhibits Invasion and Metastasis of Gastric Cancer by Targeting the Robo1 Receptor. PLoS Genet. 6, e1000879. https://doi.org/10.1371/journal.pgen.1000879
Valeri, N., Braconi, C., Gasparini, P., Murgia, C., Lampis, A., Paulus-Hock, V., Hart, J.R., Ueno, L.,
na
Grivennikov, S.I., Lovat, F., Paone, A., Cascione, L., Sumani, K.M., Veronese, A., Fabbri, M., Carasi, S., Alder, H., Lanza, G., Gafa’, R., Moyer, M.P., Ridgway, R.A., Cordero, J., Nuovo, G.J., Frankel, W.L., Rugge, M., Fassan, M., Groden, J., Vogt, P.K., Karin, M., Sansom, O.J.,
ur
Croce, C.M., 2014. MicroRNA-135b Promotes Cancer Progression by Acting as a Downstream Effector of Oncogenic Pathways in Colon Cancer. Cancer Cell 25, 469–483. https://doi.org/10.1016/j.ccr.2014.03.006
Jo
van Laarhoven, T., Nabuurs, S.B., Marchiori, E., 2011. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics 27, 3036–3043. https://doi.org/10.1093/bioinformatics/btr500
Volinia, S., Galasso, M., Sana, M.E., Wise, T.F., Palatini, J., Huebner, K., Croce, C.M., 2012. Breast cancer signatures for invasiveness and prognosis defined by deep sequencing of microRNA. Proc. Natl. Acad. Sci. 109, 3024–3029. https://doi.org/10.1073/pnas.1200010109 Wang, D., Wang, J., Lu, M., Song, F., Cui, Q., 2010. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics 26, 1644–1650. https://doi.org/10.1093/bioinformatics/btq241 Wang, L., You, Z.-H., Chen, X., Li, Y.-M., Dong, Y.-N., Li, L.-P., Zheng, K., 2019. LMTRDA: Using 28
logistic model tree to predict MiRNA-disease associations by fusing multi-source information of sequences and similarities. PLOS Comput. Biol. 15, e1006865. https://doi.org/10.1371/journal.pcbi.1006865 Xiao, Q., Luo, J., Liang, C., Cai, J., Ding, P., 2018. A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations. Bioinformatics 34, 239– 248. https://doi.org/10.1093/bioinformatics/btx545 Xie, B., Ding, Q., Han, H., Wu, D., 2013. miRCancer: a microRNA-cancer association database constructed by text mining on literature. Bioinformatics 29, 638–644. https://doi.org/10.1093/bioinformatics/btt014 Xuan, P., Han, K., Guo, M., Guo, Y., Li, Jinbao, Ding, J., Liu, Y., Dai, Q., Li, Jin, Teng, Z., Huang, Y., 2013. Prediction of microRNAs Associated with Human Diseases Based on Weighted k Most Similar Neighbors. PLoS One 8, e70204. https://doi.org/10.1371/journal.pone.0070204 Yang, L., 2006. Incidence and mortality of gastric cancer in China. World J. Gastroenterol. 12, 17–20.
ro of
https://doi.org/10.3748/wjg.v12.i1.17
Yang, Z., Ren, F., Liu, C., He, S., Sun, G., Gao, Q., Yao, L., Zhang, Y., Miao, R., Cao, Y., Zhao, Y.,
Zhong, Y., Zhao, H., 2010. dbDEMC: a database of differentially expressed miRNAs in human cancers. BMC Genomics 11, S5. https://doi.org/10.1186/1471-2164-11-S4-S5
Zhang, Z., Li, Z., Gao, C., Chen, P., Chen, J., Liu, W., Xiao, S., Lu, H., 2008. miR-21 plays a pivotal
-p
role in gastric cancer pathogenesis and progression. Lab. Investig. 88, 1358–1366. https://doi.org/10.1038/labinvest.2008.94
Zhao, G., Wang, B., Liu, Y., Zhang, J., Deng, S., Qin, Q., Tian, K., Li, X., Zhu, S., Niu, Y., Gong, Q.,
re
Wang, C., 2013. miRNA-141, Downregulated in Pancreatic Cancer, Inhibits Cell Proliferation and Invasion by Directly Targeting MAP4K4. Mol. Cancer Ther. 12, 2569–2580. https://doi.org/10.1158/1535-7163.MCT-13-0296
lP
Zhao, Y., Chen, X., Yin, J., 2019. Adaptive boosting-based computational model for predicting potential miRNA-disease associations. Bioinformatics 35, 4730–4738.
Jo
ur
na
https://doi.org/10.1093/bioinformatics/btz297
29