Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression

Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression

Journal Pre-proof Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression Su Zhou (Conce...

3MB Sizes 0 Downloads 14 Views

Journal Pre-proof Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression Su Zhou (Conceptualization) (Methodology) (Software) (Validation) (Resources) (Data curation) (Writing - original draft) (Writing - review and editing), Shulin Wang (Writing - original draft) (Supervision), Qi Wu (Conceptualization) (Methodology) (Writing - original draft), Riasat Azim (Data curation) (Software) (Writing - review and editing), Wen Li (Investigation) (Writing - review and editing)

PII:

S1476-9271(19)30809-6

DOI:

https://doi.org/10.1016/j.compbiolchem.2020.107200

Reference:

CBAC 107200

To appear in:

Computational Biology and Chemistry

Received Date:

12 September 2019

Revised Date:

4 January 2020

Accepted Date:

5 January 2020

Please cite this article as: Zhou S, Wang S, Wu Q, Azim R, Li W, Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression, Computational Biology and Chemistry (2020), doi: https://doi.org/10.1016/j.compbiolchem.2020.107200

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier.

Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression Su Zhoua, Shulin Wanga,*, Qi Wub, Riasat Azima, Wen Lia

a

ro of

College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China b College of Basic Medicine, Changsha Medical University, Changsha 410219, China.

-p

*Corresponding author. E-mail address: [email protected]

Jo

ur

na

lP

re

Graphical abstrcat

Hightlights 

A combinatorial model that combines gradient boosting decision tree with logistic regression (GBDT-LR) was proposed.



Gradient boosting decision tree model has natural advantages to find many distinguishing 1

features and feature combinations. 

GBDT-LR obtained AUC of 0.9293 and AP of 0.9043 in 5-fold cross-validation. Furthermore, more than 88% of predicted top 50 potential disease-related miRNAs were confirmed by databases.

Abstract

ro of

MicroRNAs (miRNAs) have been proved to play an indispensable role in many fundamental biological processes, and the dysregulation of miRNAs is closely correlated with human complex diseases. Many studies have focused on the prediction of potential miRNA-disease associations.

-p

Considering the insufficient number of known miRNA-disease associations and the poor

performance of many existing prediction methods, a novel model combining gradient boosting

re

decision tree with logistic regression (GBDT-LR) is proposed to prioritize miRNA candidates for

lP

diseases. To balance positive and negative samples, GBDT-LR firstly adopted k-means clustering to screen negative samples from unknown miRNA-disease associations. Then, the gradient

na

boosting decision tree (GBDT) model, which has an intrinsic advantage in finding many distinguishing features and feature combinations is applied to extract features. Finally, the new

ur

features extracted by the GBDT model are input into a logistic regression (LR) model for

Jo

predicting the final miRNA-disease association score. The experimental results show that the average AUC of GBDT-LR in 5-fold cross-validation (CV) can achieve 0.9274. Besides, in the case studies, 90%, 94% and 88% of the top 50 miRNAs potentially associated with colon cancer, gastric cancer, and pancreatic cancer were confirmed by databases, respectively. Compared with the other three state-of-the-art methods, GBDT-LR can achieve the best prediction performance. The source code and dataset of GBDT-LR are freely available at 2

https://github.com/Pualalala/GBDT-LR. Keywords: miRNAs; diseases; miRNA-disease association; gradient boosting decision tree; logistic regression

1 Introduction MicroRNAs (miRNAs) are a class of non-coding single-stranded RNA molecules with a

ro of

length of about 22 nucleotides encoded by endogenous genes (Ambros, 2001). The first miRNA lin-4 was discovered in Caenorhabditis elegans in 1993 (Lee et al., 1993), but this discovery was only regarded as a particular case, not receiving the attention of the scientific community at that

-p

time. Until 2003 the second miRNA named let-7 was found to regulate the development of

re

Caenorhabditis elegans by binding to 3’-UTRs of target mRNAs (Ashrafi et al., 2003), researchers gradually realized that miRNA might be a widespread gene expression regulation mechanism.

lP

Since then, hundreds of miRNA molecules have been identified in plants, animals, and viruses (Huang et al., 2011; Li et al., 2013; Liu et al., 2015), etc. Furthermore, existing studies have

na

shown that miRNAs almost involve in all biological processes of living organisms, such as cell

ur

growth, differentiation, apoptosis, immune reaction, neural development, and tumor invasion (Barh et al., 2010; Fineberg et al., 2009; Ma et al., 2007; Taganov et al., 2006).

Jo

Studies also show that miRNAs are implicated in the development of various diseases (Jiang et al., 2009) such as cancer (Volinia et al., 2012), Parkinson’s disease (Kim et al., 2007), Alzheimer’s disease (Kumar et al., 2013; Rahman et al., 2019) and immune-related diseases (Huang et al., 2007). For example, researchers observed that the decrease of miR-143 content in colorectal cancer could result in an increase of its target gene KRAS and promote the proliferation 3

of cancer cells (Chen et al., 2009). Besides, the feedback loop created by miR-133b and its target gene PITX3 was essential for maintaining dopaminergic neurons in the brain, and the abnormal expression of miR-133b may lead to the occurrence of Parkinson’s disease (De Mena et al., 2010). Furthermore, the decreased expression of miR-29a, miR-29b-1 and miR-9 in the brain of Alzheimer’s disease patients will lead to an abnormal increase in their target gene BACE1 and raise the incidence of Alzheimer’s disease (Hebert et al., 2008). It is reasonable that identifying

ro of

potential miRNA-disease associations has immense significance for biomarker detection in

diagnosis (Barh et al., 2013), treatment (Janssen et al., 2013) and prognosis (Brenner et al., 2011) of human complex diseases. Since miRNAs play such a comprehensive role, it is high time to

-p

construct an effective and accurate computational model to reveal the potential associations

re

between diseases and miRNAs.

lP

For predicting disease-related miRNAs, many viable calculation methods have been proposed (Chen et al., 2019a). Among these methods, the similarity measure-based models and

na

machine learning-based models are two main representatives. The similarity measure-based methods are mainly according to the hypothesis that the functionally related miRNAs are

ur

closely related to those diseases with phenotypic similarities. For example, Nalluri et al. (2015) proposed a maximum weighted matching model called DISMIRA, which was based on the

Jo

bipartite graph to predict disease-related miRNA candidates. Chen et al. (2018c) applied a bipartite network projection model to predict potential miRNA-disease associations (BNPMDA). However, neither DISMIRA or BNPMDA can be used to predict diseases without known related miRNAs. In addition, Jiang et al. (2010) constructed a cumulative hypergeometric distribution model to infer potential miRNA-disease associations based on functionally related miRNA 4

network and human phenome-microRNAome network; however, its excessive reliance on the predicted miRNA-target associations can easily lead to false-positive and false-negative problems. Moreover, Chen et al. (2018d) proposed another computational method called MDHGI, which combined matrix decomposition with the heterogeneous graph inference to calculate potential miRNA-disease association scores. The four methods mentioned above can be classified as local network similarity-based computational models. Due to the inherent shortcomings and weakness

ro of

in local network similarity-based computational models, global network similarity measure-based computational models have emerged as the times require. For example, Chen and Huang et al. (2017) developed a novel method named LRSSLMDA whose primary principle was to adopt

-p

sparse subspace learning with Laplacian regularization on known miRNA-disease associations

re

network and informative feature profiles to identify miRNA-disease associations accurately.

lP

Besides, Chen et al. (2018b) proposed a semi-supervised model that was based on a low-rank inductive matrix completion algorithm for miRNA-disease association prediction (IMCMDA).

na

The advantage of IMCMDA is that it doesn't need negative samples. What’s more, Xiao et al. (2018) developed a prediction model called GRNMF which utilized graph regularized non-

ur

negative matrix factorization in heterogeneous omics data to identify potential miRNA-disease correlations and could work for new diseases (miRNAs) or those diseases (miRNAs) with sparse

Jo

known associations.

In the field of machine learning, Wang et al. (2019) proposed a model called LMTRDA to

predict the associations between miRNAs and diseases by fusing information from multiple sources, including miRNA sequences, miRNA functional similarity, disease semantic similarity, and known miRNA-disease associations. Chen et al. (2018) developed a novel model based on 5

Random Forest for predicting miRNA-disease associations (RFMDA). Furthermore, Chen et al. (2019b) introduced another computing method called EDTMDA under the framework of ensemble learning and dimensionality reduction to infer disease-causing miRNAs. However, RFMDA and EDTMDA randomly selecting negative samples from unknown miRNA-disease associations would significantly affect their prediction performance. Recently, Zhao et al. (2019) presented a method called ABMDA for miRNA-disease association prediction based on the

ro of

boosting algorithm. ABMDA is capable of improving the prediction accuracy by integrating 20 weak classifiers to form a robust classifier based on corresponding weights.

Although previously proposed methods can effectively promote future research on predicting

-p

miRNA-disease associations, they have their limitations. In this study, a combinatorial model that

re

combines gradient boosting decision tree with logistic regression (GBDT-LR) is designed to infer

lP

the associations between miRNAs and diseases by integrating known miRNA-disease associations, miRNA functional similarity, disease semantic similarity and Gaussian interaction

na

profile kernel similarity. GBDT is a boosting algorithm that belongs to the category of ensemble learning. It was firstly proposed by Friedman in 2001 (Friedman, 2001) and has been successfully

ur

applied to click prediction system (He et al., 2014). One of the significant advantages of GBDT is it can integrate multiple weak classifiers to construct a powerful classifier; another advantage is its

Jo

excellent automatic feature combination ability and efficient operation. In GBDT-LR, we firstly extract new features based on the GBDT model, then adopt LR model to complete final classification prediction. To evaluate the performance of our model, 5-fold cross validation and three case studies were implemented. The experimental results indicate that the average AUC of GBDT-LR reached 0.9274 and the average AUPR was 0.9014. What’s more, in the case studies, 6

45, 47 and 44 out of the top 50 predicted miRNAs for colon cancer, gastric cancer, and pancreatic cancer were confirmed by databases and literatures. In general, our model has better performance than the other three state-of-the-art models and can be effectively applied to identify the potential associations between miRNAs and diseases.

2 Materials and methods

ro of

2.1 Human miRNA-disease associations Relevant researchers have collated and recorded the miRNA-disease associations which were

-p

confirmed by biological experiments, and then constructed some miRNA-disease association

databases, such as HMDD (Huang et al., 2019), dbDEMC (Yang et al., 2010), miRcancer (Xie et

re

al., 2013), and miRegulome (Barh et al., 2015). According to the records of the HMDD V2.0 (Li

lP

et al., 2014), there are 5430 identified miRNA-disease associations involving 495 miRNAs and 383 diseases. We constructed an adjacency matrix A to describe the associations between disease

na

𝑑(𝑖) and miRNA 𝑚(𝑗). Specifically, if there are an identified association between disease 𝑑(𝑖) and miRNA 𝑚(𝑗), the entity 𝐴(𝑖, 𝑗) is equal to 1, otherwise 0. Furthermore, the number of

ur

miRNAs and diseases investigated in our study was represented by variables nm and nd,

Jo

respectively. In case studies, another two independent databases, dbDEMC and miRCancer were used to validate the miRNA-disease association prediction lists.

2.2 MiRNA functional similarity Wang et al. (2010) calculated miRNA functional similarity and concluded that functionally related miRNAs incline to be associated with phenotypically similar diseases. We have 7

downloaded it from http://www.cuilab.cn/files/images/cuilab/misim.zip and constructed miRNA functional similarity symmetric matrix FS. In matrix FS, the entity 𝐹𝑆(𝑖, 𝑗) denotes the functional similarity score between miRNA 𝑚(𝑖) and 𝑚(𝑗).

2.3 Disease semantic similarity MeSH database (http://www.ncbi.nlm.nih.gov/) is an authoritative Medical Subject Headings

ro of

system, which plays a crucial role in bioinformatics research and is widely used to obtain the associations between diseases. Based on MeSH database, we constructed Directed Acyclic Graph (DAG). A disease D can be depicted as 𝐷𝐴𝐺(𝐷) = (𝐷, 𝑇(𝐷), 𝐸(𝐷)), where 𝑇(𝐷) is a set of

-p

nodes composed of all ancestor nodes of disease D and itself, and 𝐸(𝐷) is an edge set that

re

corresponds to direct links from a parent node to a child node. Finally, according to the method of

computed as follows:

lP

Xuan et al. (2013), the contribution of disease d to the semantic value of disease D can be

𝐷1𝐷 (𝑑) = 1 𝑖𝑓 𝑑 = 𝐷 { , 𝐷1𝐷 (𝑑) = max{∆ ∗ 𝐷1𝐷 (𝑑′ )|𝑑 ′ 𝜖 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑜𝑓 𝑑} 𝑖𝑓 𝑑 ≠ 𝐷

(1)

na

where ∆ is a semantic contribution decay factor. Based on the previous literature (Wang et al.,

ur

2010), the value of ∆ was set to 0.5. Compared with the contribution of D to itself semantic value is 1, the contribution of its ancestor diseases to the semantic value of disease D decreases with the

Jo

increase of distance between them. Then, the semantic value of disease D can be calculated as follows:

DV1(D) = ∑ 𝐷1𝐷 (𝑑).

(2)

𝑑∈𝑇(𝐷)

According to the hypothesis that the disease semantic similarity of two diseases varies directly as the number of DAGs parts they share, the more DAGs parts they share, the higher their semantic 8

similarity is. So the semantic similarity between two diseases 𝑑(𝑖) and 𝑑(𝑗) can be computed as follows: 𝑆𝑆1(𝑑(𝑖), 𝑑(𝑗)) =

∑𝑡∈𝑇(𝑑(𝑖))∩𝑇(𝑑(𝑗))(𝐷1𝑑(𝑖) (𝑡) + 𝐷1𝑑(𝑗) (𝑡)) 𝐷𝑉1(𝑑(𝑖)) + 𝐷𝑉1(𝑑(𝑗))

,

(3)

where the entity 𝑆𝑆1(𝑑(𝑖), 𝑑(𝑗)) in matrix SS1 denotes the semantic similarity score of diseases between 𝑑(𝑖) and 𝑑(𝑗). However, in the above model, it seems to have ignored another critical point that if two

ro of

diseases in the same layer of 𝐷𝐴𝐺(𝐷) with different occurrences, we should take into account

that their contribution to disease D is also different. Actually, the contribution of a disease with a higher number of occurrences should be smaller than the lower one. In this case, we introduced

-p

disease semantic similarity model 2 as a complement to model 1, and the formula for calculation

𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐷𝐺𝐴𝑠 𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 𝑑 . 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑖𝑠𝑒𝑎𝑠𝑒𝑠

lP

𝐷2𝐷 (𝑑) = −𝑙𝑜𝑔

re

is as follows:

(4)

The calculation of the semantic value of disease D and the semantic similarity between diseases 𝑑(𝑖) and disease 𝑑(𝑗) are similar to that in model 1:

na

𝐷𝑉2(𝐷) = ∑ 𝐷2𝐷 (𝑑) ,

∑𝑡∈𝑇(𝑑(𝑖))∩𝑇(𝑑(𝑗))(𝐷2𝑑(𝑖) (𝑡) + 𝐷2𝑑(𝑗) (𝑡)) 𝐷𝑉2(𝑑(𝑖)) + 𝐷𝑉2(𝑑(𝑗))

.

(6)

ur

𝑆𝑆2(𝑑(𝑖), 𝑑(𝑗)) =

(5)

𝑑∈𝑇(𝐷)

For computing the semantic similarity more reasonably, we combined two models to get the final

Jo

semantic similarity SS. The final semantic similarity SS is calculated as follows: 𝑆𝑆(𝑑(𝑖), 𝑑(𝑗)) =

𝑆𝑆1(𝑑(𝑖), 𝑑(𝑗)) + 𝑆𝑆2(𝑑(𝑖), 𝑑(𝑗)) . 2

(7)

2.4 Gaussian interaction profile kernel similarity for diseases According to the basic hypothesis mentioned earlier, miRNAs with similar functions are 9

closely related to diseases with similar phenotypes, so Gaussian interaction profile (GIP) kernel similarity was introduced for further similarity analysis. We used a binary vector 𝐼𝑃(𝑑(𝑖)), i-th row vector of the adjacency matrix A, to denote associations between 𝑑(𝑖) and each miRNA. Next, we calculated the GIP kernel similarity between disease 𝑑(𝑖) and 𝑑(𝑗) as follows: 2

(8)

𝐾𝐷(𝑑(𝑖), 𝑑(𝑗)) = 𝑒𝑥𝑝 (−𝑟𝑑 ‖𝐼𝑃(𝑑(𝑖)) − 𝐼𝑃(𝑑(𝑗))‖ ) ,

where KD is a symmetric matrix consisting of all investigated diseases’ GIP kernel similarity and

ro of

parameter 𝑟𝑑 is an adjustable parameter of the kernel bandwidth which can be calculated by another normalized bandwidth parameter 𝑟′𝑑 . The formula for calculating 𝑟𝑑 is as follows: 𝑛𝑑

1 2 𝑟𝑑 = 𝑟 𝑑 /( ∑‖𝐼𝑃(𝑑(𝑖))‖ ). 𝑛𝑑 ′

(9)

-p

𝑖=1

In Eq. (9), according to the previous literature (van Laarhoven et al., 2011), 𝑟′𝑑 just simply be

re

set to 1.

lP

2.5 Gaussian interaction profile kernel similarity for miRNAs

follows:

na

Similarly, the GIP kernel similarity between miRNA 𝑚(𝑖) and 𝑚(𝑗) can be calculated as

2

(10)

ur

𝐾𝑀(𝑚(𝑖), 𝑚(𝑗)) = 𝑒𝑥𝑝 (−𝑟𝑚 ‖𝐼𝑃(𝑚(𝑖)) − 𝐼𝑃(𝑚(𝑗))‖ ), 𝑛𝑚

Jo

1 2 𝑟𝑚 = 𝑟 𝑚 /( ∑‖𝐼𝑃(𝑚(𝑖))‖ ), 𝑛𝑚 ′

(11)

𝑖=1

where KM is a symmetric matrix consisting of all investigated miRNAs’ GIP kernel similarity, and the binary vector 𝐼𝑃(𝑚(𝑖)), located in the i-th column of the adjacency matrix A, which is used to denote the interaction profiles of miRNA 𝑚(𝑖). By the same reason, 𝑟 ′ 𝑚 is also set to 1.

10

2.6 Integrated similarity for miRNAs and diseases Considering that miRNA functional similarity and GIP kernel similarity are based on a unilateral consideration, and not all miRNA-miRNA pairs have functional similarity, so we integrated the two similarities into a simple matrix. It means that when miRNA-miRNA pairs have no functional similarity, its final similarity is defined as GIP kernel similarity. Otherwise, its final

ro of

similarity is defined as half of the sum of its GIP kernel similarity and functional similarity. More concretely, the calculation of integrated similarity for miRNAs is as follows: 𝑆𝑀(𝑚(𝑖), 𝑚(𝑗)) = 𝐾𝑀(𝑚(𝑖),𝑚(𝑗))+𝐹𝑆(𝑚(𝑖),𝑚(𝑗))

{

2

, 𝑖𝑓 𝑚(𝑖) 𝑎𝑛𝑑 𝑚(𝑗) ℎ𝑎𝑠 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑎𝑙 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦

(12)

.

-p

𝐾𝑀(𝑚(𝑖), 𝑚(𝑗)), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Similarly, the calculation of integrated similarity for diseases depends on whether there is a

re

semantic similarity between disease-disease pairs, which is shown below:

𝐾𝐷(𝑑(𝑖),𝑑(𝑗))+𝑆𝑆(𝑑(𝑖),𝑑(𝑗))

{

2

lP

𝑆𝐷(𝑑(𝑖), 𝑑(𝑗)) =

, 𝑖𝑓 𝑑(𝑖) 𝑎𝑛𝑑 𝑑(𝑗) ℎ𝑎𝑠 𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦

𝐾𝐷(𝑑(𝑖), 𝑑(𝑗)), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

.

(13)

na

2.7 GBDT-LR prediction model

ur

Inspired by the research of He et al. (2014), we proposed a novel model that combined

Jo

gradient boosting decision tree with logistic regression for predicting potential miRNA-disease associations. The whole prediction process consists of data preparation, model training, scoring and sorting final prediction results (see Fig. 1).

11

ro of -p re lP na

Fig. 1. The flowchart of GBDT-LR. It mainly consists of three steps: data preparation; model

ur

training; scoring and sorting final prediction results. In the data preparation stage, the main task was to balance positive and negative samples

Jo

besides the need to obtain two integrated similarity matrices. Here, we called those 5430 samples with known associations as positive samples and unknown samples as negative samples. To solve the problem that the number of unknown samples was much larger than the number of known association samples in the dataset and get roughly equal in size to positive samples, we performed k-means clustering on unknown samples and randomly extracted a corresponding number of 12

samples from each cluster as negative samples. According to the research of Rayhan et al. (2017) and Zhao et al. (2019), the value of k was set to 23. First, unknown samples were divided into 23 clusters, unlike randomly selected the same number of negative samples from each cluster, we selected the corresponding number of negative samples based on the proportion of each cluster to the total amount of unknown samples. Finally, we got 5418 negative samples. The core idea of GBDT-LR is to train a LR model on new features constructed by GBDT

ro of

model. GBDT is a commonly used non-linear model and an iteration method, which use CART

tree as the weak classifier. It is noteworthy that each iteration of GBDT model produces a weak

classifier, and each weak classifier is trained based on the negative gradient of the previous weak

-p

classifier. We used the grid search strategy to select the optimal parameters of GBDT with 5-fold

re

CV on the original data, and then we got the optimal tree number (number of iteration or number

lP

of weak classifier) of GBDT is 12. In the process of grid search strategy, we have divided the original data into three parts: training set, validation set and testing set. We make sure that the

na

three parts do not intersect and training set is used for model training, validation set is used to adjust parameters, testing set is used to evaluate the model performance. This process can

ur

avoid over-fitting problem. The inputs of the GBDT model include the training sample sets 𝐷 = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … (𝑥𝑁 , 𝑦𝑁 )} where the element of list 𝑥 is training sample and the element of

Jo

list 𝑦 is the label of each sample, the number of iteration M = 12, and log-likelihood loss function 𝐿(𝑦, 𝑓(𝑥)), meanwhile, the output are new features. The calculation process of our model is summarized into three. Step 1 is to initialize weak classifier, and the specific calculation is as follows:

13

𝑁

𝑓0 (𝑥) = arg 𝑚𝑖𝑛 ∑ 𝐿(𝑦𝑖 , 𝑐) , 𝑐

(14)

𝑖=1

𝑓0 (x) need to be set as a constant, but since all samples are initially located in the root node of the

decision tree, it usually set as the average value of all sample labels directly. Step 2 is called weak classifier update, which also includes three steps. First, the negative gradient of each sample 𝑖 is calculated as follows: 𝜕𝐿(𝑦𝑖 , 𝑓(𝑥𝑖 )) ] 𝜕𝑓(𝑥𝑖 ) 𝑓(𝑥)=𝑓

𝑚−1 (𝑥)

,

(15)

ro of

𝑟𝑖𝑚 = − [

where the 𝑓𝑚−1 (x) denote the (𝑚 − 1)-th weak classifier, 𝑦𝑖 is the real label of i-th sample, and 𝑓(𝑥𝑖 ) is the predicted probability of the label. Variable 𝑟𝑖𝑚 represent the value of the

-p

negative gradient of the i-th sample in the m-th weak classifier. The data (𝑥𝑖 , 𝑟𝑖𝑚 ) can use to fit a

re

CART regression tree and obtain its corresponding leaf node region 𝑅𝑗𝑚 where 𝑗 = 1,2, . . . , 𝐽,

lP

and 𝐽 is the number of leaf nodes of this regression tree. Second, for each leaf node region, the best fitting values are calculated as follows:

𝛾𝑗𝑚 = arg 𝑚𝑖𝑛 ∑ 𝐿(𝑦𝑖 , 𝑓𝑚−1 (𝑥𝑖 ) + 𝛾) , 𝛾

(16)

𝑥𝑖 ∈𝑅𝑗𝑚

na

where the variable 𝛾𝑗𝑚 represent the best fitting values of j-th leaf node in the m-th regression

ur

tree. Third, the m-th weak classifier in this iteration is updated as follows: 𝐽

𝑓𝑚 (𝑥) = 𝑓𝑚−1 (𝑥) + ∑ 𝛾𝑗𝑚 𝐼(𝑥 ∈ 𝑅𝑗𝑚 ) ,

(17)

Jo

𝑗=1

where the indicator function I(.) has the value 1 if its argument is true, and 0 otherwise. Step 3, repeat step 2 until the number of iterations is equal to 12. Then we can get the final strong classifier based on 12 weak classifiers. 𝑀

𝐽

𝑓𝑀 (x) = 𝑓0 (𝑥) + ∑ ∑ 𝛾𝑗𝑚 𝐼(𝑥 ∈ 𝑅𝑗𝑚 ) . 𝑚=1 𝑗=1

14

(18)

In fact, we are not mean to get the final strong classifier for binary classification, but through the process of each iteration in the GBDT model to construct new features by obtaining the position of the leaf node where samples falling on. Specifically, the element values of the new feature vector are 0 or 1, and each element corresponds to the leaf node of the regression tree. When a sample falls on a leaf node of a regression tree, the corresponding element value of this leaf node in the new feature vector is 1, while other leaf nodes of this regression tree are 0 (see

ro of

Fig. 2). The length of the new feature vector is equal to the sum of the leaf nodes contained in all regression trees. After constructed feature vectors, LR is used for the final classification task.

Finally, we sort the sample pairs by them score; the higher the score is, the more likely the miRNA

Jo

ur

na

lP

re

available at https://github.com/Pualalala/GBDT-LR.

-p

is to be associated with the relevant disease. The source code and dataset of GBDT-LR are freely

Fig. 2. The process of extracting new features from the GBDT model. Here two weak classifiers are taken as examples, which are represented by red and yellow parts respectively. Among them, the number of leaf nodes in red weak classifier is 3, while in yellow weak classifier is 2, and the prediction result of sample 𝑥 in red weak classifier falls to the second leaf node, while in yellow 15

weak classifier also falls to the second leaf node. Then we mark the prediction results of red weak classifier is [0 1 0], and the yellow weak classifier is [0 1]. Basically, the output of GBDT is a combination of these weak classifiers, like [0 1 0 0 1]. Finally, the new features constructed by GBDT model are input into LR model for the final classification task.

3 Results

ro of

3.1 Performance evaluation In this paper, we evaluated the performance of GBDT-LR and the other three methods (i.e.,

-p

LMTRDA (Wang et al., 2019), RFMDA (Chen et al., 2018a) and ABMDA (Zhao et al., 2019)) based on 5-fold CV. The evaluation metrics include area under the receiver operating

re

characteristic curve (AUC), area under the precision-recall curve (AUPR), precision, recall and

lP

F1-Score. By setting the false positive rate (FPR), the true positive rate (TPR) as the horizontal axis and vertical axis respectively, we drew Receiver Operating Characteristics (ROC) curve.

na

Also, by setting the precision ratio as the horizontal axis, the recall ratio as the vertical axis, we drew Precision-Recall (P-R) curve. Through calculating the value of AUC and AUPR, we can

ur

assess the prediction capability between GBDT-LR and other three models. Generally, the value of

Jo

AUC and AUPR are proportional to the prediction performance of the model. In the 5-fold CV, all known miRNA-disease associations were randomly divided into five

parts, one of which served as testing samples in turn while the other four parts as training samples. To reduce the influence of samples division, we performed 100 randomized divisions on known miRNA-disease associations. The experimental results show that GBDT-LR can achieve the highest performance among all methods (see Table 1 and Fig. 3). Fig. 3 shows average ROC 16

curves and average P-R curves of 5-fold CV results respectively, and we could see that the average AUC of GBDT-LR is 0.9274, which is significantly better than LMTRDA (0.8479), RFMDA (0.7388) and ABMDA (0.8841) in 5-fold CV. Besides, the average AUPR of GBDT-LR can achieve 0.9014 while the AUPR of LMTRDA, RFMDA and ABMDA are 0.8217, 0.7034 and 0.8807, respectively. The values of the other three evaluation metrics are shown in Table 1. Of course, we wish that both Precision and Recall could be highest in model evaluation, but in fact,

ro of

they are contradictory to each other in some cases. That means in most models, usually when one of the value of Precision and Recall reaches a high level, while the other will becomes very low. It is also the reason why the indicator of Recall of GBDT-LR is not as high as that of RFMDA.

-p

Actually, F1-Score is an evaluation metric that takes into account both Precision and Recall. In

re

GBDT-LR, precision and recall scores both exceed 0.80 when achieving the highest F1-Score. All

lP

the above results show that GBDT-LR can make significant improvements in predicting the correlation between miRNAs and diseases compared with some of the most advanced methods.

na

Table 1. The AUC, AUPR, Precision, Recall and F1-Score of four methods on miRNA-disease

Precision

ABMDA

0.8841

0.8807

0.8152

0.7827

0.7908

LMTRDA

0.8479

0.8217

0.8013

0.6190

0.7067

RFMDA

0.7388

0.7034

0.6253

0.9548

0.7453

association prediction task. The bolded numbers represent the comparatively better performance. AUPR

Recall

F1-Score

0.9274

0.9014

0.8315

0.8273

0.8302

Jo

ur

GBDT-LR

AUC

17

ro of

Fig. 3. The ROC and P-R curves of four methods on miRNA-disease association prediction task.

3.2 Case studies

To further demonstrate the applicability and prediction performance of GBDT-LR, three case

-p

studies on colon cancer, gastric cancer, and pancreatic cancer were provided. As mentioned earlier,

re

5430 known association samples were obtained from HMDD v2.0, while the remaining 184155 samples were unknown samples. In this section, we ranked all unknown samples and then picked

lP

out the top 50 predicted miRNAs for three specific types of cancers to see whether they were

(Xie et al., 2013).

na

confirmed by the other two independent databases, dbDEMC (Yang et al., 2010) and miRCancer

ur

Colon cancer is a common malignant tumor of the digestive tract, and the peak of morbidity age is 40-50 years old. The incidence of colon cancer ranks third in gastrointestinal tumors, and

Jo

colon cancer has no obvious early symptoms which easily led to missed diagnosis or misdiagnosis. Therefore, more sensitive and specific molecular biomarkers are urgently needed to help improve early diagnosis of colon cancer. In recent years, studies have shown that some miRNAs can affect the development of colon cancer, for example, miR-135b upregulation is common in human colon cancer, it correlates with tumor stage and poor clinical outcome, and it 18

was identified as a key downstream effector of oncogenic pathways and a potential target for colon cancer treatment (Valeri et al., 2014). In addition, overexpression of miR-215 could inhibit the proliferation of colon cancer cells (Song et al., 2010). By performing GBDT-LR to prioritize candidate miRNAs for colon cancer, 45 out of the top 50 predicted potential miRNAs were confirmed based on dbDEMC and miRCancer (see Table 2). For instance, miR-155 gene knockout contributes to decreasing cell growth, motility, and invasion of colon cancer (Onyeagucha et al.,

ro of

2013). Moreover, miR-34a can effectively inhibit cell proliferation by regulating E2F signaling pathway, the abrogation of miR-34a could be conducive to abnormal cell proliferation and the

development of colon cancer (Tazawa et al., 2007). More importantly, those miRNAs that have

-p

not been verified by biological experiments, such as miR-16, miR-200a, miR-373, miR-378a and

re

miR-142 are likely to be new biomarkers for colon cancer.

lP

Table 2. The top 50 predicted colon cancer-related miRNAs. 45 out of the top 50 miRNAs were confirmed based on dbDEMC and miRCancer. (Column 1: top 1-25 miRNAs; Column 3: top 2650 miRNAs)

miRNA

Evidence

hsa-mir-155

dbDEMC, miRCancer

hsa-mir-24

dbDEMC

hsa-mir-21

dbDEMC, miRCancer

hsa-mir-106b

dbDEMC

dbDEMC

na

Evidence

hsa-mir-125b

dbDEMC, miRCancer

hsa-mir-29a

dbDEMC

hsa-mir-148a

dbDEMC

hsa-mir-16

unconfirmed

hsa-mir-663a

dbDEMC

hsa-mir-20a

dbDEMC

hsa-let-7a

dbDEMC, miRCancer

hsa-mir-221

dbDEMC

hsa-mir-200a

unconfirmed

hsa-mir-146a

dbDEMC, miRCancer

hsa-mir-15b

dbDEMC, miRCancer

hsa-mir-222

dbDEMC

hsa-mir-29c

dbDEMC

hsa-mir-1

dbDEMC

hsa-mir-372

dbDEMC

hsa-mir-15a

dbDEMC, miRCancer

hsa-mir-31

dbDEMC

hsa-mir-133a

dbDEMC

hsa-mir-133b

dbDEMC

hsa-mir-34a

dbDEMC

hsa-mir-195

dbDEMC, miRCancer

hsa-mir-29b

dbDEMC

hsa-mir-192

dbDEMC, miRCancer

hsa-mir-19b

dbDEMC

hsa-let-7e

dbDEMC, miRCancer

Jo

hsa-mir-143

ur

miRNA

19

hsa-mir-18a

dbDEMC, miRCancer

hsa-mir-373

unconfirmed

hsa-mir-92a

dbDEMC

hsa-mir-378a

unconfirmed

hsa-mir-122

dbDEMC

hsa-mir-223

dbDEMC

hsa-mir-181a

dbDEMC, miRCancer

hsa-let-7b

dbDEMC, miRCancer

hsa-mir-199a

miRCancer

hsa-let-7c

dbDEMC, miRCancer

hsa-mir-200b

dbDEMC, miRCancer

hsa-mir-200c

miRCancer

hsa-mir-19a

dbDEMC

hsa-mir-203

dbDEMC

hsa-mir-150

dbDEMC

hsa-mir-210

dbDEMC

hsa-mir-27a

dbDEMC

hsa-mir-142

unconfirmed

hsa-mir-93

dbDEMC

hsa-mir-214

dbDEMC, miRCancer

Gastric cancer is the fourth most common cancers in the world (Pavithra et al., 2018), but in

ro of

China, the incidence of gastric cancer accounted for the first in malignant tumors, nearly two-

thirds of cases and deaths occurred in underdeveloped regions. Furthermore, the onset peak age

was over 50 years old, and the ratio of male to female was 2:1. Recently, with the improvement of

-p

economic conditions, lifestyle, education and health care system, the mortality rate of gastric

re

cancer has decreased significantly. Even so, gastric cancer still is a noticeable cancer burden and

lP

one of the critical issues in the strategy of cancer prevention and control (Yang, 2006). Therefore, a reliable model for predicting potential gastric cancer-related miRNAs is highly needed. So far,

na

experiments have confirmed some related miRNAs; for example, the expression of miR-449 in human clinical gastric cancer compared with normal tissues has decreased (Bou Kheir et al.,

ur

2011). Also, miR-218 could block the invasion and metastasis of gastric cancer by targeting the robo1 receptor (Tie et al., 2010). In the prediction list of gastric cancer-related miRNAs, 47 out of

Jo

the top 50 were confirmed by two databases (see Table 3). For instance, miR-21 may be a vital carcinogen, which was observed to be overexpressed in human gastric cancer tissues, and induces the occurrence of gastric cancer by down-regulating the tumor suppressor RECK (Zhang et al., 2008). Moreover, the upregulation of miR-93 and miR-106b has been reported to damage the TGFβ tumor suppressor pathway (Petrocca et al., 2008), which plays a key role in the initiation 20

and development of gastric cancer. More importantly, those miRNAs that have not been verified by biological experiments, such as miR-29a, miR-34b and miR-335 are likely to be new biomarkers for gastric cancer. Table 3. The top 50 predicted gastric cancer-related miRNAs. 47 out of the top 50 miRNAs were confirmed based on dbDEMC and miRCancer. (Column 1: top 1-25 miRNAs; Column 3: top 2650 miRNAs) Evidence

miRNA

Evidence

hsa-mir-21

dbDEMC, miRCancer

hsa-mir-18a

hsa-mir-155

dbDEMC, miRCancer

hsa-mir-92a

hsa-mir-125b

dbDEMC, miRCancer

hsa-mir-126

hsa-mir-29a

unconfirmed

hsa-mir-122

hsa-mir-16

dbDEMC

hsa-mir-192

hsa-mir-106a

dbDEMC, miRCancer

hsa-mir-195

hsa-mir-20a

dbDEMC, miRCancer

hsa-let-7e

dbDEMC, miRCancer

hsa-mir-17

dbDEMC, miRCancer

hsa-mir-31

dbDEMC

hsa-mir-221

dbDEMC

hsa-mir-133b

dbDEMC, miRCancer

hsa-mir-146a

miRCancer

hsa-mir-183

dbDEMC, miRCancer

hsa-mir-93

dbDEMC

hsa-mir-181a

dbDEMC, miRCancer

hsa-mir-27a

dbDEMC

hsa-mir-200b

dbDEMC, miRCancer

hsa-mir-222

dbDEMC

hsa-mir-199a

miRCancer

hsa-mir-15a

dbDEMC, miRCancer

hsa-mir-150

miRCancer

hsa-mir-1

dbDEMC, miRCancer

hsa-mir-19a

dbDEMC, miRCancer

hsa-mir-133a

dbDEMC, miRCancer

hsa-mir-205

dbDEMC, miRCancer

hsa-mir-34a

dbDEMC

hsa-mir-10b

dbDEMC, miRCancer

hsa-mir-29b

dbDEMC

hsa-mir-100

dbDEMC, miRCancer

hsa-mir-145

dbDEMC, miRCancer

ro of

miRNA

dbDEMC, miRCancer dbDEMC

dbDEMC, miRCancer dbDEMC, miRCancer dbDEMC, miRCancer

dbDEMC

hsa-mir-143

dbDEMC, miRCancer

hsa-mir-34b

unconfirmed

hsa-mir-200a

dbDEMC

hsa-mir-335

unconfirmed

hsa-let-7a

dbDEMC, miRCancer

hsa-mir-101

dbDEMC, miRCancer

hsa-mir-15b

dbDEMC

hsa-mir-24

dbDEMC

hsa-mir-29c

dbDEMC

hsa-mir-106b

dbDEMC, miRCancer

hsa-mir-19b

dbDEMC, miRCancer

hsa-mir-200c

miRCancer

Jo

hsa-mir-451a

ur

na

lP

re

-p

dbDEMC, miRCancer

Pancreatic cancer is a kind of gastrointestinal cancer with a high degree of malignancy and is a challenge to diagnose or treat it. It is also one of the worst prognosis of tumors whose morbidity and mortality have increased significantly in recent years. The clinical features of pancreatic 21

cancer are short course, rapid development and rapid deterioration. Pancreatic cancer is known as the “cancer of the King” because of its 5-year survival rate is less than 5%, so that many people have tried to work on this disease, including the prediction of potential relevant miRNAs. For example, increased serum levels of miR-200a and miR-200b can help to diagnose pancreatic cancer (Li et al., 2010). Furthermore, increased expression of circulating miR-210 can also function as a useful and novel biomarker for pancreatic cancer diagnosis (Ho et al., 2010). GBDT-

ro of

LR has been applied to reveal the potential relationship between miRNA and pancreatic cancer,

and the results show that out of the top 50 potential miRNAs, 44 were confirmed by two databases (see Table 4). For instance, the downregulation of miR-141 in pancreatic cancer tissues was

-p

related to tumor size, TNM stage, distant metastasis and overall survival (Zhao et al., 2013).

re

What’s more, miR-181c could directly suppress the expression of Hippo kinase cassette core

lP

components in human pancreatic cancer cells, and the high miR-181c levels were significantly correlated with Hippo signaling inactivation which could indirectly cause pancreatic cancer cell

na

survival and chemoresistance in vitro and in vivo (Chen et al., 2015). More importantly, those miRNAs that have not been verified by biological experiments, such as miR-193a, miR-499a,

ur

miR-378a, miR-372 and miR-28 are likely to be new biomarkers for pancreatic cancer. Table 4. The top 50 predicted pancreatic cancer-related miRNAs. 44 out of the top 50 miRNAs

Jo

were confirmed based on dbDEMC and miRCancer. (Column 1: top 1-25 miRNAs; Column 3: top 26-50 miRNAs) miRNA

Evidence

miRNA

Evidence

hsa-mir-373

dbDEMC

hsa-mir-499a

unconfirmed

hsa-mir-1

dbDEMC

hsa-mir-205

dbDEMC, miRCancer

hsa-mir-133a

dbDEMC, miRCancer

hsa-mir-19b

dbDEMC

hsa-mir-127

dbDEMC

hsa-mir-378a

unconfirmed

hsa-mir-125a

dbDEMC

hsa-mir-370

dbDEMC

22

dbDEMC, miRCancer

hsa-mir-424

dbDEMC

hsa-mir-302a

dbDEMC

hsa-mir-30e

dbDEMC

hsa-mir-193a

unconfirmed

hsa-mir-494

dbDEMC

hsa-mir-130a

dbDEMC

hsa-mir-98

dbDEMC

hsa-mir-137

dbDEMC, miRCancer

hsa-mir-181c

dbDEMC, miRCancer

hsa-mir-138

dbDEMC

hsa-mir-363

dbDEMC

hsa-mir-106b

dbDEMC

hsa-mir-372

unconfirmed

hsa-mir-9

dbDEMC

hsa-mir-28

unconfirmed

hsa-mir-30b

dbDEMC

hsa-mir-20b

dbDEMC

hsa-mir-206

dbDEMC

hsa-mir-23b

dbDEMC

hsa-mir-29a

dbDEMC

hsa-mir-18b

dbDEMC

hsa-mir-26b

dbDEMC

hsa-mir-134

dbDEMC

hsa-mir-93

dbDEMC

hsa-mir-302b

dbDEMC

hsa-mir-7

dbDEMC

hsa-mir-22

dbDEMC

hsa-mir-342

dbDEMC

hsa-mir-335

hsa-mir-30a

dbDEMC

hsa-mir-574

hsa-mir-195

dbDEMC, miRCancer

hsa-mir-184

hsa-mir-19a

dbDEMC

hsa-mir-181d

hsa-mir-27b

dbDEMC

hsa-mir-149

hsa-mir-141

dbDEMC, miRCancer

hsa-mir-885

dbDEMC dbDEMC dbDEMC dbDEMC

-p

dbDEMC

unconfirmed

re

4 Conclusion

ro of

hsa-mir-181a

lP

In this paper, we proposed a novel computational model called GBDT-LR to identify the potential miRNA-disease associations. GBDT-LR can accurately predict the potential associations

na

between miRNAs and diseases, which can be immensely helpful in understanding the

ur

pathogenesis of certain diseases at molecular level and finding more effective treatment. The excellent performance of GBDT-LR could be proved by the average AUC value of 0.9274 and the

Jo

average AUPR value of 0.9014 in 5-fold CV. The excellent prediction performance of GBDT-LR is mainly attributed to the following

reasons. First, GBDT-LR can be applied to diseases without known associated miRNAs, which greatly improves its practicability and reliability. Second, GBDT is an iterative method which can improve the prediction accuracy of weak classifier, and it has the natural advantage to discover 23

new distinguishing features and feature combinations throughout the iterative process. Third, GBDT-LR can prioritize candidate miRNAs for all investigated diseases simultaneously. However, GBDT-LR also exits some limitations which need to be improved in the future. For example, we screened negative samples by using k-means clustering to balance positive and negative samples, but it is difficult to obtain those reliable negative samples. Besides, multiple sources (e.g., disease-gene associations, miRNA-gene associations and miRNA sequence

ro of

information) need to be combined to improve the prediction performance of the model. Lastly, our future research should be based on the database HMDD V3.0, which is already available online and records more relevant information. So, it remains our zealous hope to develop new

re

-p

computational models to overcome these limitations in the future.

Author Statement

lP

Su Zhou:Conceptualization, Methodology, Software, Validation, Resources, Data Curation, Writing - Original Draft, Writing - Review & Editing

na

Shulin Wang:Writing - Original Draft, Supervision Qi Wu:Conceptualization, Methodology, Writing - Original Draft

ur

Riasat Azim:Data Curation, Software, Writing - Review & Editing

Jo

Wen Li:Investigation, Writing - Review & Editing

Conflict of interest The authors declare that they have no conflict(s) of interest. 24

Funding This work was supported by the grants of the National Natural Science Foundation of China (Grant Nos. 61672011 and 61472467) and the National Key R&D Program of China (2017YFC1311003).

References

ro of

Ambros, V., 2001. microRNAs : Tiny Regulators with Great Potential. Cell 107, 823–826.

Ashrafi, K., Chang, F.Y., Watts, J.L., Fraser, A.G., Kamath, R.S., Ahringer, J., Ruvkun, G., 2003.

Genome-wide RNAi analysis of Caenorhabditis elegans fat regulatory genes. Nature 421, 268– 272. https://doi.org/10.1038/nature01279

Barh, D., Bhat, D., Viero, C., 2010. miReg: a resource for microRNA regulation. J. Integr. Bioinform. 7. https://doi.org/10.1515/jib-2010-144

-p

Barh, D., Jain, N., Tiwari, S., Field, J.K., Padin-Iruegas, E., Ruibal, A., López, R., Herranz, M.,

Bhattacharya, A., Juneja, L., Viero, C., Silva, A., Miyoshi, A., Kumar, A., Blum, K., Azevedo, V., Ghosh, P., Liloglou, T., 2013. A novel in silico reverse-transcriptomics-based identification

re

and blood-based validation of a panel of sub-type specific biomarkers in lung cancer. BMC Genomics 14, S5. https://doi.org/10.1186/1471-2164-14-S6-S5

lP

Barh, D., Kamapantula, B., Jain, N., Nalluri, J., Bhattacharya, A., Juneja, L., Barve, N., Tiwari, S., Miyoshi, A., Azevedo, V., Blum, K., Kumar, A., Silva, A., Ghosh, P., 2015. miRegulome: a knowledge-base of miRNA regulomics and analysis. Sci. Rep. 5, 12832. https://doi.org/10.1038/srep12832

na

Bou Kheir, T., Futoma-Kazmierczak, E., Jacobsen, A., Krogh, A., Bardram, L., Hother, C., Grønbæk, K., Federspiel, B., Lund, A.H., Friis-Hansen, L., 2011. miR-449 inhibits cell proliferation and is down-regulated in gastric cancer. Mol. Cancer 10, 29. https://doi.org/10.1186/1476-4598-10-29 Brenner, B., Hoshen, M.B., Purim, O., David, M. Ben, Ashkenazi, K., Marshak, G., Kundel, Y.,

ur

Brenner, R., Morgenstern, S., Halpern, M., Rosenfeld, N., Chajut, A., Niv, Y., Kushnir, M., 2011. MicroRNAs as a potential prognostic factor in gastric cancer. World J. Gastroenterol. 17, 3976–3985. https://doi.org/10.3748/wjg.v17.i35.3976

Jo

Chen, M., Wang, M., Xu, S., Guo, X., Jiang, J., 2015. Upregulation of miR-181c contributes to chemoresistance in pancreatic cancer by inactivating the Hippo signaling pathway. Oncotarget 6, 44466–44479. https://doi.org/10.18632/oncotarget.6298

Chen, X., Guo, X., Zhang, H., Xiang, Y., Chen, J., Yin, Y., Cai, X., Wang, K., Wang, G., Ba, Y., Zhu, L., Wang, J., Yang, R., Zhang, Y., Ren, Z., Zen, K., Zhang, J., Zhang, C.-Y., 2009. Role of miR143 targeting KRAS in colorectal tumorigenesis. Oncogene 28, 1385–1392. https://doi.org/10.1038/onc.2008.474 Chen, X., Huang, L., 2017. LRSSLMDA: Laplacian Regularized Sparse Subspace Learning for MiRNA-Disease Association prediction. PLOS Comput. Biol. 13, e1005912. 25

https://doi.org/10.1371/journal.pcbi.1005912 Chen, X., Wang, C.-C., Yin, J., You, Z.-H., 2018a. Novel Human miRNA-Disease Association Inference Based on Random Forest. Mol. Ther. - Nucleic Acids 13, 568–579. https://doi.org/10.1016/j.omtn.2018.10.005 Chen, X., Wang, L., Qu, J., Guan, N.-N., Li, J.-Q., 2018b. Predicting miRNA–disease association based on inductive matrix completion. Bioinformatics 34, 4256–4265. https://doi.org/10.1093/bioinformatics/bty503 Chen, X., Xie, D., Wang, L., Zhao, Q., You, Z.-H., Liu, H., 2018c. BNPMDA: Bipartite Network Projection for MiRNA–Disease Association prediction. Bioinformatics 34, 3178–3186. https://doi.org/10.1093/bioinformatics/bty333 Chen, X., Xie, D., Zhao, Q., You, Z.-H., 2019a. MicroRNAs and complex diseases: from experimental results to computational models. Brief. Bioinform. 20, 515–539. https://doi.org/10.1093/bib/bbx130

ro of

Chen, X., Yin, J., Qu, J., Huang, L., 2018d. MDHGI: Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction. PLOS Comput. Biol. 14, e1006418. https://doi.org/10.1371/journal.pcbi.1006418

Chen, X., Zhu, C.-C., Yin, J., 2019b. Ensemble of decision tree reveals potential miRNA-disease

associations. PLOS Comput. Biol. 15, e1007209. https://doi.org/10.1371/journal.pcbi.1007209

-p

De Mena, L., Coto, E., Cardo, L.F., Díaz, M., Blázquez, M., Ribacoba, R., Salvado, C., Pastor, P.,

Samaranch, Ll., Moris, G., Menéndez, M., Corao, A.I., Alvarez, V., 2010. Analysis of the microRNA-133 and PITX3 genes in Parkinson’s disease. Am. J. Med. Genet. Part B Neuropsychiatr.

re

Genet. 153, 1234–1239. https://doi.org/10.1002/ajmg.b.31086

Fineberg, S.K., Kosik, K.S., Davidson, B.L., 2009. MicroRNAs Potentiate Neural Development. Neuron 64, 303–309. https://doi.org/10.1016/j.neuron.2009.10.020

lP

Friedman, J.H., 2001. Greedy function approximation: A gradient boosting machine. Ann. Stat. 29, 1189–1232. https://doi.org/10.2307/2699986

He, X., Bowers, S., Candela, J.Q., Pan, J., Jin, O., Xu, Tianbing, Liu, B., Xu, Tao, Shi, Y., Atallah, A., Herbrich, R., 2014. Practical Lessons from Predicting Clicks on Ads at Facebook, in:

na

Proceedings of 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining ADKDD’14. ACM Press, New York, New York, USA, pp. 1–9. https://doi.org/10.1145/2648584.2648589

ur

Hebert, S.S., Horre, K., Nicolai, L., Papadopoulou, A.S., Mandemakers, W., Silahtaroglu, A.N., Kauppinen, S., Delacourte, A., De Strooper, B., 2008. Loss of microRNA cluster miR-29a/b-1 in sporadic Alzheimer’s disease correlates with increased BACE1/ -secretase expression. Proc. Natl.

Jo

Acad. Sci. 105, 6415–6420. https://doi.org/10.1073/pnas.0710263105 Ho, A.S., Huang, X., Cao, H., Christman-Skieller, C., Bennewith, K., Le, Q.-T., Koong, A.C., 2010. Circulating miR-210 as a Novel Hypoxia Marker in Pancreatic Cancer. Transl. Oncol. 3, 109– 113. https://doi.org/10.1593/tlo.09256

Huang, J., Wang, F., Argyris, E., Chen, K., Liang, Z., Tian, H., Huang, W., Squires, K., Verlinghieri, G., Zhang, H., 2007. Cellular microRNAs contribute to HIV-1 latency in resting primary CD4 + T lymphocytes. Nat. Med. 13, 1241–1247. https://doi.org/10.1038/nm1639 Huang, Y., Shen, X.J., Zou, Q., Wang, S.P., Tang, S.M., Zhang, G.Z., 2011. Biological functions of microRNAs: a review. J. Physiol. Biochem. 67, 129–139. https://doi.org/10.1007/s13105-0100050-6 26

Huang, Z., Shi, J., Gao, Y., Cui, C., Zhang, S., Li, J., Zhou, Y., Cui, Q., 2019. HMDD v3.0: a database for experimentally supported human microRNA–disease associations. Nucleic Acids Res. 47, D1013–D1017. https://doi.org/10.1093/nar/gky1010 Janssen, H.L.A., Reesink, H.W., Lawitz, E.J., Zeuzem, S., Rodriguez-Torres, M., Patel, K., van der Meer, A.J., Patick, A.K., Chen, A., Zhou, Y., Persson, R., King, B.D., Kauppinen, S., Levin, A.A., Hodges, M.R., 2013. Treatment of HCV Infection by Targeting MicroRNA. N. Engl. J. Med. 368, 1685–1694. https://doi.org/10.1056/NEJMoa1209026 Jiang, Q., Hao, Y., Wang, G., Juan, L., Zhang, T., Teng, M., Liu, Y., Wang, Y., 2010. Prioritization of disease microRNAs through a human phenome-microRNAome network. BMC Syst. Biol. 4, S2. https://doi.org/10.1186/1752-0509-4-S1-S2 Jiang, Q., Wang, Y., Hao, Y., Juan, L., Teng, M., Zhang, X., Li, M., Wang, G., Liu, Y., 2009. miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res. 37, D98–D104. https://doi.org/10.1093/nar/gkn714

ro of

Kim, J., Inoue, K., Ishii, J., Vanti, W.B., Voronov, S. V., Murchison, E., Hannon, G., Abeliovich, A.,

2007. A MicroRNA Feedback Circuit in Midbrain Dopamine Neurons. Science 317, 1220–1224. https://doi.org/10.1126/science.1140481

Kumar, P., Dezso, Z., MacKenzie, C., Oestreicher, J., Agoulnik, S., Byrne, M., Bernier, F.,

Yanagimachi, M., Aoshima, K., Oda, Y., 2013. Circulating miRNA Biomarkers for Alzheimer’s

-p

Disease. PLoS One 8, e69807. https://doi.org/10.1371/journal.pone.0069807

Lee, R.C., Feinbaum, R.L., Ambros, V., 1993. The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 75, 843–854. https://doi.org/10.1016/0092-

re

8674(93)90529-Y

Li, A., Omura, N., Hong, S.-M., Vincent, A., Walter, K., Griffith, M., Borges, M., Goggins, M., 2010. Pancreatic Cancers Epigenetically Silence SIP1 and Hypomethylate and Overexpress miR-

lP

200a/200b in Association with Elevated Circulating miR-200a and miR-200b Levels. Cancer Res. 70, 5226–5237. https://doi.org/10.1158/0008-5472.CAN-09-4227 Li, Y., Qiu, C., Tu, J., Geng, B., Yang, J., Jiang, T., Cui, Q., 2014. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 42,

na

D1070–D1074. https://doi.org/10.1093/nar/gkt1023 Li, Y., Zhuang, L., Wang, Y., Hu, Y., Wu, Y., Wang, D., Xu, J., 2013. Connect the dots. Autophagy 9, 436–439. https://doi.org/10.4161/auto.23096

ur

Liu, B., Fang, L., Liu, F., Wang, X., Chen, J., Chou, K.-C., 2015. Identification of Real MicroRNA Precursors with a Pseudo Structure Status Composition Approach. PLoS One 10, e0121501. https://doi.org/10.1371/journal.pone.0121501

Jo

Ma, L., Teruya-Feldstein, J., Weinberg, R.A., 2007. Tumour invasion and metastasis initiated by microRNA-10b in breast cancer. Nature 449, 682–688. https://doi.org/10.1038/nature06174

Nalluri, J.J., Kamapantula, B.K., Barh, D., Jain, N., Bhattacharya, A., de Almeida, S.S., Juca Ramos, R.T., Silva, A., Azevedo, V., Ghosh, P., 2015. DISMIRA: Prioritization of disease candidates in miRNA-disease associations based on maximum weighted matching inference model and motifbased analysis. BMC Genomics 16, S12. https://doi.org/10.1186/1471-2164-16-S5-S12 Onyeagucha, B.C., Mercado-Pimentel, M.E., Hutchison, J., Flemington, E.K., Nelson, M.A., 2013. S100P/RAGE signaling regulates microRNA-155 expression via AP-1 activation in colon cancer. Exp. Cell Res. 319, 2081–2090. https://doi.org/10.1016/j.yexcr.2013.05.009 Pavithra, D., Sabitha, K., Rajkumar, T., 2018. Identification of small molecule inhibitors for 27

differentially expressed miRNAs in gastric cancer. Comput. Biol. Chem. 77, 442–454. https://doi.org/10.1016/j.compbiolchem.2018.07.013 Petrocca, F., Visone, R., Onelli, M.R., Shah, M.H., Nicoloso, M.S., de Martino, I., Iliopoulos, D., Pilozzi, E., Liu, C.-G., Negrini, M., Cavazzini, L., Volinia, S., Alder, H., Ruco, L.P., Baldassarre, G., Croce, C.M., Vecchione, A., 2008. E2F1-Regulated MicroRNAs Impair TGFβ-Dependent Cell-Cycle Arrest and Apoptosis in Gastric Cancer. Cancer Cell 13, 272–286. https://doi.org/10.1016/j.ccr.2008.02.013 Rahman, M.R., Islam, T., Turanli, B., Zaman, T., Faruquee, H.M., Rahman, M.M., Mollah, M.N.H., Nanda, R.K., Arga, K.Y., Gov, E., Moni, M.A., 2019. Network-based approach to identify molecular signatures and therapeutic agents in Alzheimer’s disease. Comput. Biol. Chem. 78, 431–439. https://doi.org/10.1016/j.compbiolchem.2018.12.011 Rayhan, F., Ahmed, S., Shatabda, S., Farid, D.M., Mousavian, Z., Dehzangi, A., Rahman, M.S., 2017. IDTI-ESBoost: Identification of Drug Target Interaction Using Evolutionary and Structural

ro of

Features with Boosting. Sci. Rep. 7, 1–18. https://doi.org/10.1038/s41598-017-18025-2

Song, B., Wang, Y., Titmus, M.A., Botchkina, G., Formentini, A., Kornmann, M., Ju, J., 2010.

Molecular mechanism of chemoresistance by miR-215 in osteosarcoma and colon cancer cells. Mol. Cancer 9, 96. https://doi.org/10.1186/1476-4598-9-96

Taganov, K.D., Boldin, M.P., Chang, K.-J., Baltimore, D., 2006. NF- B-dependent induction of

-p

microRNA miR-146, an inhibitor targeted to signaling proteins of innate immune responses. Proc. Natl. Acad. Sci. 103, 12481–12486. https://doi.org/10.1073/pnas.0605298103

Tazawa, H., Tsuchiya, N., Izumiya, M., Nakagama, H., 2007. Tumor-suppressive miR-34a induces

re

senescence-like growth arrest through modulation of the E2F pathway in human colon cancer cells. Proc. Natl. Acad. Sci. 104, 15472–15477. https://doi.org/10.1073/pnas.0707351104 Tie, J., Pan, Y., Zhao, L., Wu, K., Liu, J., Sun, S., Guo, X., Wang, B., Gang, Y., Zhang, Y., Li, Q.,

lP

Qiao, T., Zhao, Q., Nie, Y., Fan, D., 2010. MiR-218 Inhibits Invasion and Metastasis of Gastric Cancer by Targeting the Robo1 Receptor. PLoS Genet. 6, e1000879. https://doi.org/10.1371/journal.pgen.1000879

Valeri, N., Braconi, C., Gasparini, P., Murgia, C., Lampis, A., Paulus-Hock, V., Hart, J.R., Ueno, L.,

na

Grivennikov, S.I., Lovat, F., Paone, A., Cascione, L., Sumani, K.M., Veronese, A., Fabbri, M., Carasi, S., Alder, H., Lanza, G., Gafa’, R., Moyer, M.P., Ridgway, R.A., Cordero, J., Nuovo, G.J., Frankel, W.L., Rugge, M., Fassan, M., Groden, J., Vogt, P.K., Karin, M., Sansom, O.J.,

ur

Croce, C.M., 2014. MicroRNA-135b Promotes Cancer Progression by Acting as a Downstream Effector of Oncogenic Pathways in Colon Cancer. Cancer Cell 25, 469–483. https://doi.org/10.1016/j.ccr.2014.03.006

Jo

van Laarhoven, T., Nabuurs, S.B., Marchiori, E., 2011. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics 27, 3036–3043. https://doi.org/10.1093/bioinformatics/btr500

Volinia, S., Galasso, M., Sana, M.E., Wise, T.F., Palatini, J., Huebner, K., Croce, C.M., 2012. Breast cancer signatures for invasiveness and prognosis defined by deep sequencing of microRNA. Proc. Natl. Acad. Sci. 109, 3024–3029. https://doi.org/10.1073/pnas.1200010109 Wang, D., Wang, J., Lu, M., Song, F., Cui, Q., 2010. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics 26, 1644–1650. https://doi.org/10.1093/bioinformatics/btq241 Wang, L., You, Z.-H., Chen, X., Li, Y.-M., Dong, Y.-N., Li, L.-P., Zheng, K., 2019. LMTRDA: Using 28

logistic model tree to predict MiRNA-disease associations by fusing multi-source information of sequences and similarities. PLOS Comput. Biol. 15, e1006865. https://doi.org/10.1371/journal.pcbi.1006865 Xiao, Q., Luo, J., Liang, C., Cai, J., Ding, P., 2018. A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations. Bioinformatics 34, 239– 248. https://doi.org/10.1093/bioinformatics/btx545 Xie, B., Ding, Q., Han, H., Wu, D., 2013. miRCancer: a microRNA-cancer association database constructed by text mining on literature. Bioinformatics 29, 638–644. https://doi.org/10.1093/bioinformatics/btt014 Xuan, P., Han, K., Guo, M., Guo, Y., Li, Jinbao, Ding, J., Liu, Y., Dai, Q., Li, Jin, Teng, Z., Huang, Y., 2013. Prediction of microRNAs Associated with Human Diseases Based on Weighted k Most Similar Neighbors. PLoS One 8, e70204. https://doi.org/10.1371/journal.pone.0070204 Yang, L., 2006. Incidence and mortality of gastric cancer in China. World J. Gastroenterol. 12, 17–20.

ro of

https://doi.org/10.3748/wjg.v12.i1.17

Yang, Z., Ren, F., Liu, C., He, S., Sun, G., Gao, Q., Yao, L., Zhang, Y., Miao, R., Cao, Y., Zhao, Y.,

Zhong, Y., Zhao, H., 2010. dbDEMC: a database of differentially expressed miRNAs in human cancers. BMC Genomics 11, S5. https://doi.org/10.1186/1471-2164-11-S4-S5

Zhang, Z., Li, Z., Gao, C., Chen, P., Chen, J., Liu, W., Xiao, S., Lu, H., 2008. miR-21 plays a pivotal

-p

role in gastric cancer pathogenesis and progression. Lab. Investig. 88, 1358–1366. https://doi.org/10.1038/labinvest.2008.94

Zhao, G., Wang, B., Liu, Y., Zhang, J., Deng, S., Qin, Q., Tian, K., Li, X., Zhu, S., Niu, Y., Gong, Q.,

re

Wang, C., 2013. miRNA-141, Downregulated in Pancreatic Cancer, Inhibits Cell Proliferation and Invasion by Directly Targeting MAP4K4. Mol. Cancer Ther. 12, 2569–2580. https://doi.org/10.1158/1535-7163.MCT-13-0296

lP

Zhao, Y., Chen, X., Yin, J., 2019. Adaptive boosting-based computational model for predicting potential miRNA-disease associations. Bioinformatics 35, 4730–4738.

Jo

ur

na

https://doi.org/10.1093/bioinformatics/btz297

29