Identifying N6-methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer

Identifying N6-methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer

Accepted Manuscript Identifying N6 -methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer Xiaowei Zhao ,...

1MB Sizes 0 Downloads 32 Views

Accepted Manuscript

Identifying N6 -methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer Xiaowei Zhao , Ye Zhang , Qiao Ning , Hongrui Zhang , Jinchao Ji , Minghao Yin PII: DOI: Reference:

S0022-5193(19)30052-9 https://doi.org/10.1016/j.jtbi.2019.01.035 YJTBI 9817

To appear in:

Journal of Theoretical Biology

Received date: Revised date: Accepted date:

13 December 2018 4 January 2019 30 January 2019

Please cite this article as: Xiaowei Zhao , Ye Zhang , Qiao Ning , Hongrui Zhang , Jinchao Ji , Minghao Yin , Identifying N6 -methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer, Journal of Theoretical Biology (2019), doi: https://doi.org/10.1016/j.jtbi.2019.01.035

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights •

A novel method is proposed for the prediction of N6-methyladenosine sites



We construct the combination of deep features and the original features to make the results better.



M6A-PXGB first uses particle swarm optimization to optimize parameters in XGBoost



CR IP T

model. Overall accuracy of the M6A-PXGB is 77.16% to sensitivity value 76.38%, specificity

AC

CE

PT

ED

M

AN US

value 75.98%, AUC value 0.839 and MCC value 0.535.

ACCEPTED MANUSCRIPT

Identifying N6-methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer Xiaowei Zhao1,2, Ye Zhang1, Qiao Ning1,Hongrui Zhang1, Jinchao Ji1,*, Minghao Yin2,* Corresponding author: Tel.: +86-0431-84536338, Fax: +86-0431-84536338

1

CR IP T

E-mail address: [email protected] School of Computer Science and Information Technology, Northeast Normal University,

Changchun, 130117, China

Key Laboratory of Intelligent Information Processing of Jilin Universities, Northeast

AN US

2

AC

CE

PT

ED

M

Normal University, Changchun, 130117, China

ACCEPTED MANUSCRIPT

Abstract N6-methyladenosine (m6A) is the one of the most important RNA modifications, playing the role of splicing events, mRNA exporting and stability to cell differentiation. Because of wide distribution of m6A in genes, identification of m6A sites in RNA sequences has significant importance for basic biomedical research and drug development. High-throughput laboratory

CR IP T

methods are time consuming and costly. Nowadays, effective computational methods are much desirable because of its convenience and fast speed. Thus, in this article, we proposed a new method to improve the performance of the m6A prediction by using the combined features of deep features and original features with extreme gradient boosting optimized by particle swarm optimization (PXGB). The proposed PXGB algorithm uses three kinds of features, i.e.,

AN US

position-specific nucleotide propensity (PSNP), position-specific dinucleotide propensity (PSDP), and the traditional nucleotide composition (NC). By 10-fold cross validation, the performance of PXGB was measured with an AUC of 0.8390 and an MCC of 0.5234. Additionally, PXGB was compared with the existing methods, and the higher MCC and AUC of PXGB demonstrated that

M

PXGB was effective to predict m6A sites. The predictor proposed in this study might help to predict more m6A sites and guide related experimental validation. m6A sites; N6-methyladenosine; extreme gradient boosting; XGBoost; particle

AC

CE

PT

swarm optimization

ED

Key words:

ACCEPTED MANUSCRIPT

1. Introduction In the last few years, it has been found that more and more post-translational modifications of proteins have played an important role in biological processes in previous studies [1-27]. Nowadays, more and more post-translational modifications of RNA have been found in cellular RNAs especially mRNA [28,29]. N6-methyladenosine (m6A) plays an important role in various

CR IP T

biological processes, such as RNA localization and degradation [30], mRNA stability and splicing [31], microRNA biogenesis [32], brain development abnormalities [33] and other diseases. However, high-throughput laboratory methods to identify N6-methyladenosine sites are time consuming and costly. Thus, finding an accurate and fast computational method is an important task in recent years.

AN US

Until now, some machine learning methods have been used to build predictive models based on RNA sequences. Chen et al. first developed an m6A site predictor called iRNA-Methyl using pseudo nucleotide composition [34]. In this predictor, it combined the positional information of the RNA sequences and three types of physical-chemical properties using pseudo dinucleotide

M

composition (PseDNC). Liu et al. applied 23 types of physical-chemical properties matrix of nucleotides for transforming the RNA sequences to a series of auto-covariance (AC) and

ED

cross-covariance (CC) and then used SVM as a prediction engine called pRNAm-PC [35]. Zhang et al. used a new heuristic nucleotide physical-chemical property selection algorithm to choose

PT

several better physical-chemical properties with an SVM prediction engine called M6A-HPCS [36]. Chen et el. proposed a model with SVM based on the nucleotide chemical property and the

CE

nucleotide frequency information of the RNA sequences called M6Apred[37]. Li et el. used the features of position-specific nucleotides propensities and the frequencies of nucleotides with an

AC

SVM as a prediction engine [38]. Although these studies have shown the effectiveness of the features to the nucleotide position and the physical-chemical property for predicting m6A sites, the accuracy of these methods is still not enough. This paper aims to construct the combination of deep features and the original features and apply a new way for building up a novel predictor called PXGB (extreme gradient boosting with optimized by particle swarm optimization). First, we extracted fixed-length features from RNA sequences. Then, we found the best combination of features and calculated their deep features

ACCEPTED MANUSCRIPT

with single-layer autoencoder. After that, XGBoost was selected as a prediction model with the optimized features. Finally, we used PSO as the method of optimizing parameters to construct M6A-PXGB with the dataset. As shown in a series of recent publications [39-47], to develop a really useful sequence-based statistical predictor for a biological or biomedical system, one should observe the guidelines of Chou's 5-step rule [48] to make the following five steps very clear: (i) how to construct or select a valid benchmark dataset to train and test the predictor; (ii) how to

CR IP T

formulate the biological sequence samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) how to introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) how to properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) how to

AN US

establish a user-friendly web-server for the predictor that is accessible to the public. Below, we are to describe how to deal with these steps one-by- one.

2. Materials and methods 2.1 Datasets

M

In this study, datasets were constructed from 1183 genes in the S.cerevisiae genome [49]. Wet

ED

experiments showed that the genome contained 1307 methylated adenine sites and 33280 non-methylated adenine sites [49]. There are three nucleotide G(guanine), A(adenine), C(cytosine)

PT

in all these RNA segments’ center, which called a GAC consensus motif, and it has shown as Fig 1 [49]. In order to obtain a balanced class samples, 1307 negative samples were randomly selected

CE

from the 33280 non-methylated adenine sites [50]. Then 1307 positive and 1307 negative samples build up the final dataset. The major processing step we perform is followed with iRNA-Methyl

AC

[34] to avoid any unwanted bias and to ensure heterogeneity of sequences. Note that all sequences in the dataset have 51 nucleotides and the sequence similarity is less than 85% by using CD-HIT [50].

ACCEPTED MANUSCRIPT

CR IP T

Figure 1. The motif of N6-methyladenosine sites

2.2 Feature extraction

Position-specific nucleotide propensity (PSNP). Position-specific nucleotide propensity [38] computes the frequency of nucleotides at certain positions and extract statistical information from

AN US

sequences. The m6A sites has the same motif GAC in the center. After deleting the central GAC, the remaining RNA segment can be described as follow: 𝑅𝜉 = 𝑁1 𝑁2 𝑁3 … 𝑁2𝜉

(1)

where 𝑅𝜉 is a reduced RNA segment, 𝑁𝑗 (j = 1, 2,…, 2ξ) is a nucleotide at the j-th position,

M

𝑁𝑗 ∈{A(adenine), C(cytosine), G(guanine), U(uracil)}.

ED

We first calculated 4-dimensional position-specific occurrence frequency vector from the

PT

positive reduced RNA segments (2) and the negative reduced RNA segments (3). + + + + 𝑇 𝑧𝑗+ = ,𝑧1,𝑗 𝑧2,𝑗 𝑧3,𝑗 𝑧4,𝑗 -

(2)

− − − − 𝑇 𝑧𝑗− = ,𝑧1,𝑗 𝑧2,𝑗 𝑧3,𝑗 𝑧4,𝑗 -

(3)

CE

+ − where 𝑧𝑖,𝑗 and 𝑧𝑖,𝑗 are the occurrence the nucleotides frequencies of positive and negative

RNA segments, i (i =1, 2, 3, 4) is the 4 basic types of the nucleotides at j-th (j =1, 2,…, 2ξ)

AC

position from positive reduced RNA segments and negative reduced RNA segments, the numerical code 1, 2, 3 and 4 represent A(adenine), C(cytosine), G(guanine), U(uracil), respectively. We defined the position-specific frequency difference vector 𝑧𝑗 (4), as follows: 𝑧𝑗 = 𝑧𝑗+ − 𝑧𝑗− = ,𝑧1,𝑗 𝑧2,𝑗 𝑧3,𝑗 𝑧4,𝑗 -𝑇

(4)

+ − where 𝑧𝑖,𝑗 = 𝑧𝑖,𝑗 − 𝑧𝑖,𝑗 , i = 1, 2, 3, 4, and j = 1, 2, …, 2ξ.

Then we can obtain a 4*2ξ position-specific nucleotide propensity (PSNP) matrix (5), as follows:

ACCEPTED MANUSCRIPT

𝑍𝑃𝑆𝑁𝑃

𝑧1,1 𝑧2,1 = [𝑧1 𝑧2 … 𝑧2𝜉 ] = [ 𝑧3,1 𝑧4,1

𝑧1,2 𝑧2,2 𝑧3.2 𝑧4,2

… … … …

𝑧1,2𝜉 𝑧2,2𝜉 𝑧3,2𝜉 ] 𝑧4,2𝜉

(5)

Now, a 2ξ-dimensional PSNP feature vector is encoded by referring to the 𝑍𝑃𝑆𝑁𝑃 , denoted as follows: 𝑓𝑃𝑆𝑁𝑃 = ,𝑓1 𝑓2 … 𝑓𝑗 … 𝑓2𝜉 -𝑇

(6)

𝑧1,𝑗 , 𝑤𝑕𝑒𝑛 𝑁𝑗 𝑧2,𝑗 , 𝑤𝑕𝑒𝑛 𝑁𝑗 𝑓𝑗 = 𝑧3,𝑗 , 𝑤𝑕𝑒𝑛 𝑁𝑗 {𝑧4,𝑗 , 𝑤𝑕𝑒𝑛 𝑁𝑗

CR IP T

where 𝑓𝑗 is selected from the 𝑍𝑃𝑆𝑁𝑃 matrix (5) and defined as follows: =𝐴 =𝐶 , (j = 1,2,…,2ξ) =𝐺 =𝑈

(7)

AN US

Position-specific dinucleotide propensity (PSDP). In order to calculate the position-specific dinucleotide (double nucleotide) propensity (PSDP) [38] in RNA segment, the RNA segment described in (1) should be rewritten as follows:

𝑅𝜉 = 𝑁1 𝑁2 … 𝑁𝜉 = 𝐷1 𝐷2 … 𝐷2𝜉−1

(8)

M

where 𝐷𝑗 = 𝑁𝑗 𝑁𝑗+1 (j = 1,2, …, 2ξ-1) is the dinucleotide at the j-th position, and can constitute 16 types of dinucleotide, i.e.,𝐷𝑗 ∈ {𝐴𝐴,𝐴𝐶,𝐴𝐺, … 𝑈𝑈}.

PT

named 𝑍𝑃𝑆𝐷𝑃 (9).

ED

Based on the same principle as the generation of 𝑍𝑃𝑆𝑁𝑃 , we can get a 16*(2ξ-1) matrix

𝑧1,2 𝑧2.2 ⋮ 𝑧16,2

… 𝑧1,2𝜉−1 … 𝑧2,2𝜉−1 ⋱ ⋮ ] … 𝑧16,2𝜉−1

(9)

CE

𝑍𝑃𝑆𝐷𝑃

𝑧1,1 𝑧2,1 =[ ⋮ 𝑧16,1

+ − + − where 𝑧𝑖,𝑗 = 𝑧𝑖,𝑗 − 𝑧𝑖,𝑗 , 𝑧𝑖,𝑗 and 𝑧𝑖,𝑗 are the frequency of dinucleotides, i (i=1,2, …, 16) is

AC

the 16 basic types of dinucleotide, the numerical code 1, 2,…,16 represent the dinucleotide: AA,AC,…,UU, j-th (j =1, 2, …, 2ξ-1) position from positive reduced RNA segments and negative reduced RNA segments. After that, a (2ξ-1)-dimensional PSDP feature vector is encoded referring to the 𝑍𝑃𝑆𝐷𝑃 , denoted as follows: 𝑓𝑃𝑆𝐷𝑃 = ,𝑓1 𝑓2 … 𝑓𝑗 … 𝑓2𝜉−1 -𝑇 where 𝑓𝑗 is obtained from the 𝑍𝑃𝑆𝐷𝑃 matrix, and defined as follows:

(10)

ACCEPTED MANUSCRIPT 𝑧1,𝑗 , 𝑤𝑕𝑒𝑛 𝐷𝑗 = 𝐴𝐴 𝑧2,𝑗 , 𝑤𝑕𝑒𝑛 𝐷𝑗 = 𝐴𝐶 𝑓𝑗 = (𝑗 = 1,2, … ,2𝜉 − 1) ⋮ {𝑧16,𝑗 , 𝑤𝑕𝑒𝑛 𝐷𝑗 = 𝑈𝑈

(11)

Nucleotide composition feature (NC). Nucleotide composition (NC) is a classical method for nucleotide sequence to calculate feature matrix and has been widely used in previous researches [38,51,52]. For a given nucleotide, it has 1-dimensional frequencies feature vector, so a

CR IP T

4-demensional vector (A, C, G, U) has been calculated. For two-dimensional nucleotide frequencies, it has a 16-demensional feature vector (AA, AC, …,UU). For three-dimensional nucleotide frequencies, it has a 64-demensional feature vector (AAA, AAC, …,UUU). Therefore, an 84-dimensional (4+16+64=84) feature vector is generated from an RNA sequence.

AN US

In this study, the window length of RNA sequences is 51(ξ=(51-3)/2=24), so the length of PSNP is 2*24=48 and the length of PSDP is 2*24-1=47. By integrating both feature vectors, we get a 179-dimensional feature vector for an RNA sequence. 2.3 Single-layer autoencoder.

M

Single-layer autoencoder is an unsupervised learning algorithm that applies back-propagation algorithm, which has one hidden layer neural network, making the target values equal to the input

ED

vector [53-56]. For example, given an input vector x∈𝑅 𝑛 , the activation of one neuron is 𝑕𝑖 (12),

PT

i=1,2,…,m.

𝑕(𝑥) = 𝑓(𝑊1 𝑥) + 𝑏1

(12)

where the non-linear activation function is 𝑓(𝑧) = 1/(1 + 𝑒𝑥𝑝(−𝑧)), and the neuron

𝑥̃ = 𝑓(𝑊2 𝑕(𝑥) + 𝑏2 )

(13)

AC

CE

activation is 𝑕(𝑥) ∈ 𝑅 𝑚 , 𝑊1 is a m*n weight matrix. The output of this network is:

where 𝑥̃ ∈ 𝑅 𝑛 is the output vector, 𝑊2 ∈ 𝑛 ∗ 𝑚 is a weight vector, 𝑏2 ∈ 𝑅 𝑛 is a bias

vector.

Given an input vector 𝑥𝑖 (i = 1,2,…, p), we aim to get the minimum reconstruction error

∑𝑝𝑖=1‖𝑥𝑖 − 𝑥̃𝑖 ‖2, the weight matrix 𝑊1 and 𝑊2 and the bias vector 𝑏1 and 𝑏2 are updated by back-propagation. The back-propagation algorithm specifies the cost function and the partial derivative of the weight is obtained according to the gradient of the cost function, then modifies the weight and the bias, the detailed derivation process is in [57].

ACCEPTED MANUSCRIPT

The back-propagation algorithm is specifying a cost function and then modifying the weights iteratively according to the gradient of the cost function. In this study, we used auto-encoder to reconstruct the deep feature, and the key structure parameters of autoencoder are: (1) i represents the number of nodes in the hidden layer, i also represents the dimension of the deep features (we would optimize this parameter). (2) the

CR IP T

activation function is tanh. (3) the loss function is mse. 2.4 Extreme Gradient Boosting (XGBoost).

Most of the machine learning and deep learning algorithms could effectively be used to set up model. However, these methods always ignore the interpretability of features. The XGBoost algorithm [58,59] is extreme gradient boosting [60], which demonstrates its state-of-the-art

AN US

advantages in the research of the machine learning. It not only has the advantages of high accuracy of traditional boosting algorithms, but also can deal with sparse data efficiently and implement distributed and parallel computing flexibly [59]. In order to achieve the target variable calculating the XGBoost algorithm builds up a series of decision trees and assigns each leaf node a

M

quantized weight [61].

Given a n*m feature matrix of training data, the predictor uses the K additive functions to

ED

ensemble the results (14).

𝑦̂𝑖 = 𝐹(𝑥𝑖 ) = ∑𝐾 𝑘=1 𝑓𝑘 (𝑋𝑖 ), 𝑓𝑘 ∈ 𝜑

(14)

PT

where 𝑋𝑖 is one of the samples (i = 1,2,…,n),φ = *𝑓(𝑥) = 𝑤𝑠 (𝑥)+(s: 𝑅 𝑚 → T, 𝑤𝑠 ∈ 𝑅 𝑇 )

CE

is the ensemble of the trees, each tree f(x) has its structure parameter s and the leaf weight w, 𝑤𝑖 is the i-th leaf, Tr is the number of the leaves in the tree, K is the number of the trees which is used

AC

to ensemble the results and 𝑦̂𝑖 is the predicted label. In order to get the minimum loss function, Chen et al. used the greedy search rules to reduce

most of the loss, the loss function (15) is shown as follow: 1

𝐿(𝑡) = 𝑎𝑟𝑔𝑚𝑖𝑛 ∑𝑛𝑖=1 [𝑙(𝑦𝑖 , 𝑦̂ (𝑡−1) ) + 𝑔𝑖 𝑓𝑡 (𝑥𝑖 ) + 2 𝑕𝑖 𝑓𝑡2(𝑥𝑖 )] + Ω(𝑓𝑡 )

(15)

𝑔𝑖 = 𝜕𝑦̂ (𝑡−1) 𝑙(𝑦𝑖 , 𝑦̂ (𝑡−1) )

(16)

𝑕𝑖 = 𝜕𝑦2̂ (𝑡−1) 𝑙(𝑦𝑖 , 𝑦̂ (𝑡−1) )

(17)

1

Ω(𝑓𝑡 ) = 𝛾𝑇𝑟 + 2 𝜆‖𝑤‖2

(18)

ACCEPTED MANUSCRIPT

where 𝑔𝑖 and 𝑕𝑖 are the first and second order gradient statistics on the loss function, l() represents the loss function. The last term Ω(𝑓𝑡 ) is the penalty, thereinto γand λare the parameters that control the complexity of the tree, the regularization term could help to avoid over-fitting by smoothing the final learnt weights. Chen et al. evaluated the split candidates by 𝐿𝑠𝑝𝑙𝑖𝑡 (19), The loss reduction after splitting is given by: 𝑖∈𝐼𝐿

(∑𝑖∈𝐼𝑅 𝑔𝑖 )2

+∑ ℎ +𝜆 𝑖

𝑖∈𝐼𝑅 ℎ𝑖 +𝜆

(∑

𝑔 )2

𝑖 + ∑ 𝑖∈𝐼ℎ +𝜆 ]−𝛾

CR IP T

2 1 (∑𝑖∈𝐼𝐿 𝑔𝑖 )

𝐿𝑠𝑝𝑙𝑖𝑡 = 2 [∑

𝑖∈𝐼 𝑖

(19)

where I = 𝐼𝐿 ∪ 𝐼𝑅 , 𝐼𝐿 and 𝐼𝑅 are the instance sets of left and right nodes after splitting.

In order to get the importance of each splitting node in the tree, we calculated the importance of node relative variables in XGBoost model. The basis is weighting the squared improvement of

AN US

the model by each splitting and averaging all the trees, selecting the number of times the node performs the split. The importance of each splitting node (20) is defined as follow: 2 𝐼̂𝑗2 (𝑇𝑟) = ∑𝐽−1 𝑡=1 𝑖̂𝑡 𝑙(𝑣𝑡 = 𝑗)

(20)

where l represents the indicator function which is associated with squared-influence, 𝑣𝑡 is

M

the splitting variable associated with node t, and 𝑖̂2𝑡 is the empirical improvement of the square

ED

error caused by the split, 𝑖̂2𝑡 is defined as follow: 𝑤𝑤

𝑙 𝑟 𝑖̂2𝑡 = 𝑖 2 (𝑅𝑙 , 𝑅𝑟 ) = 𝑤 +𝑤 (𝑦̅𝑙 + 𝑦̅𝑟 )2 𝑙

𝑟

(21)

PT

where 𝑦̅𝑙 and 𝑦̅𝑟 are the mean of weight of left and right children nodes of t, 𝑤𝑙 and 𝑤𝑟 are sum of the weight. For a collection of decision trees *𝑇𝑟𝑚 +1𝑀 which is obtained through

CE

boosting, Eqn. (21) can be generalized by its average over all the trees in the sequence. Therefore,

AC

Eqn. (21) is redefined as follow: 1 ̂2 𝐼̂𝑗2 = 𝑀 ∑𝑀 𝑚−1 𝐼𝑗 (𝑇𝑟𝑀 )

(22)

2.5 Particle Swarm Optimizer Particle swarm optimizer (PSO) [62] is a stochastic algorithm, simulating fish and birds looking for foods and their interacting with groups. Each particle in the swarm has a velocity vector 𝑉𝑖 = 𝑣𝑖1 , 𝑣𝑖2 , … , 𝑣𝑖𝐷 and a position vector 𝑋𝑖 = 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝐷 to guide itself to a potential optimal solution [63], where i = 1 , 2, …, n, n is the number of particles in the swarm, D is the number of the parameters which need to be optimized. The personal best position of the

ACCEPTED MANUSCRIPT

particle is defined as: 𝑝𝑏𝑒𝑠𝑡𝑖 = 𝑝𝑏𝑒𝑠𝑡𝑖1 , 𝑝𝑏𝑒𝑠𝑡𝑖2, … , 𝑝𝑏𝑒𝑠𝑡𝑖𝐷 , the global best position of the particle is defined as: 𝑔𝑏𝑒𝑠𝑡 = 𝑔𝑏𝑒𝑠𝑡1, 𝑔𝑏𝑒𝑠𝑡2, … , 𝑔𝑏𝑒𝑠𝑡𝐷 . The velocity vector 𝑉𝑖 and the position vector 𝑋𝑖 are randomly initialized at the beginning of the algorithm and they update as follow: 𝑉𝑖,𝑗 (𝑡 + 1) = 𝑤𝑉𝑖,𝑗 + 𝑐1 𝑟1 [𝑝𝑏𝑒𝑠𝑡𝑖,𝑗 (𝑡) − 𝑋𝑖,𝑗 (𝑡)] + 𝑐2 𝑟2 [𝑔𝑏𝑒𝑠𝑡𝑗 (𝑡) − 𝑋𝑖,𝑗 (𝑡)] 𝑋𝑖,𝑗 (𝑡 + 1) = 𝑋𝑖,𝑗 (𝑡) + 𝑉𝑖,𝑗 (𝑡 + 1)

(23)

CR IP T

(24)

where i is i-th particle in the swarm, j is the dimension of the velocity vector 𝑉𝑖 and the position vector 𝑋𝑖 ( j = 1,2, …, D), T is the iteration from the algorithm begin to end (t=1,2, …, T), 𝑐1 and 𝑐2 are acceleration constants, 𝑟1 and 𝑟2 are two random number uniformly

AN US

distributed in [0, 1], w is the inertia weight that is used to balance global and local search ability.

In this study, after updating the velocity vector and the position vector, it has a boundary constraint (25), the position vector 𝑋𝑖 need to be within the upper and lower bounds [64]. If 𝑋𝑖 is bigger than upper bound or smaller than lower bound, we should update the value with (25). Finally, we should calculate the fitness function, the best fitness 𝐹𝑖 (𝑋𝑖 ) and the best position

M

vector 𝑋𝑖 is the value we need, and the best 𝑋𝑖 continues to iterate as gbest.

ED

2 ∗ 𝑙 − 𝑋𝑖,𝑗 , 𝑖𝑓 𝑋𝑖,𝑗 < 𝑙 𝑋𝑖,𝑗 = {2 ∗ 𝑢 − 𝑋𝑖,𝑗 , 𝑖𝑓 𝑋𝑖,𝑗 > 𝑢 𝑋𝑖,𝑗 , 𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒

(25)

PT

where l is the lower bound, u is the upper bound. In this study, the range of parameters we optimized are 0 to 1, we define the lower bound as 0 and the upper bound as 1.

CE

2.6 XGBoost optimized by PSO (PXGB) XGBoost optimized by PSO (PXGB) is the prediction which uses XGBoost to build up the

AC

model and optimizes parameters with PSO. The detailed framework is described as Fig 2. For given RNA sequences, each of them is nucleotide transformed into PSNP+PSDP+NC feature vectors and deep feature vectors, firstly. After combined two types of features, we put them into M6A-PXGB model to predict whether these sequences contain m6A or not. The main parameters of XGBoost algorithm that need to be optimized are ‘max_depth’, ‘eta’, ‘min_child_weight’, ‘subsample’, ‘alpha’ and ‘lambda’. The ‘max_depth’ is the max depth of the tree. The ‘eta’ is the learning rate which is used to control the weight to improve the robustness of the model. The ‘min_child_weight’ is the sum of minimum sample weights. The ‘subsample’ is

ACCEPTED MANUSCRIPT

the proportion of random sampling. ‘alpha’ is γin (18) and ‘lambda’ is the λ, that control the model to avoid over-fitting. After adjusting parameters by grid search, we found the best value of these parameters, ‘max_depth’: 3, 'eta': 0.1, 'min_child_weight': 0.01, 'subsample': 0.8, ‘alpha’:0.01, ‘lambda’:0.5. And we also found that the higher accuracy of ‘eta’, ‘alpha’ and ‘lambda’, the better prediction of the model. Therefore, we used particle swarm optimizer (PSO) to optimize these three parameters.

CR IP T

In this study, the dimension of the position vector D of PSO algorithm is defined as 3, the number of particles n of PSO algorithm is defined as 20, the iteration of algorithm T is defined as 30, the acceleration constants 𝑐1 and 𝑐2 are defined as 2, the inertia weight w is defined as 2 and

AC

CE

PT

ED

M

AN US

maximizing the value of AUC is our fitness function.

Figure 2. Framework of the proposed M6A-PXGB

2.7 Performance assessment In previous studies, there were three methods to evaluate a predictor for its effectiveness: independent dataset test, subsampling (or K-fold cross-validation) test, and jackknife test [64]. In the jackknife test, all the samples are singled out one-by-one as test set while others as train set.

ACCEPTED MANUSCRIPT

The jackknife test is the most credible methods among these three methods [65]. However, the large number of our datasets need too much time for jackknife test. In this study, we use 10-fold cross validation instead of jackknife test to reduce the computing time. We use Accuracy (Acc), Sensitivity (Sn), Specificity (Sp) the Matthews correlation coefficient (MCC) and Area Under the Receiver-Operating Characteristic Curve (AUC) to evaluate the performance of our predictor [66,67]. AUC is the area under the receiver-operating

CR IP T

characteristic (ROC) curve, presenting a plot of true positive rate against false positive rate. The higher of the AUC score, the better the predictive effect of the model. And others are defined as follows:

𝑁+

AN US

𝑆𝑛 = 1 − 𝑁−+ 𝑁−

𝑆𝑝 = 1 − 𝑁+− Acc = 1 −

𝑁+ +𝑁−

1−(

− 𝑁+ − 𝑁+ + ) 𝑁+ 𝑁−

− + + 𝑁− + −𝑁− )(1+𝑁− −𝑁+ ) 𝑁+ 𝑁−

M

MCC =

+ +𝑁− 𝑁− +

√(1+

(26) (27) (28)

(29)

ED

where 𝑁 + is the total number of positive samples, 𝑁−+ is the number of the positive samples which are predicted to negative samples, 𝑁 − is the total number of negative samples and

PT

𝑁+− is the number of negative samples which are predicted to positive samples.

CE

3. Results and discussion

3.1 The choice of different features

AC

Nowadays, more and more methods for feature extracting from RNA sequences has been

proposed. Many features have been used to establish the predictor for the 𝑚6𝐴 site. We evaluated the performance of each of the five individual features, i.e. PSNP, PDNP, NC, BE and PC and three combined features using five prediction engines, i.e., Support Vector Machine (SVM) [68], Random Forests (RF) [69], K nearest Neighbor (KNN) [70], XGBoost and XGBoost optimized by PSO (PXGB) over 10-fold cross-validation and performed 10 times. The feature BE is binary encoding of RNA sequences, i.e. A is represented by (1000), U is represented by (0100), G is represented by (0010), G is represented by (0001). The feature PC is physical-chemical property

ACCEPTED MANUSCRIPT

matrix [36] which encodes a RNA sequence into a feature vector. As can be seen from the Table 1, first, for all the features, the combined features have a higher AUC value than the single features. Especially the performances in five prediction engines (i.e., SVM, RF, KNN, XGBoost and PXGB) are particularly obvious. In SVM, the AUC value for the PSNP+PSDP+NC is 3.35% higher than PSNP, 1.74% higher than PSDP and 13.61% higher than NC. This phenomenon indicates PSNP and PSDP can represent the positional characteristics

CR IP T

of RNA sequences very well, but there is still a little incomplete information just in the NC feature. So, we could regard these features as complementary relationship. In PXGB, the AUC value for the PSNP+PSDP+NC is 3.13% higher than PSNP, 1.7% higher than PSDP and 10.15% higher than NC. This result illustrates the effectiveness of our method for a single feature. The feature NC

AN US

is not satisfied in other methods, while the NC feature in PXGB gets the best performances in other methods. We can also see in Table 1, whichever the feature we choose, the feature PSNP+PSDP+NC performs best in these three combined features.

Second, for XGBoost and PXGB, PXGB could perform better, ACC value and MCC value

M

than XGBoost. The feature NC in XGBoost which has insignificant effect, get a better AUC value in PXGB that makes the AUC value 1.12% higher than XGBoost. The best combined feature

ED

PSNP+PSDP+NC in PXGB make the AUC value 1.53% higher than the same features in XGBoost and the MCC value is 1.61% higher than the same features in XGBoost. Therefore, our

PT

method of optimizing parameters is effective. And if we want get the value of the same precision, the time for grid search algorithm in XGBoost is at least 24 hours, but in PXGB, we just use 40

CE

minutes for optimizing the three parameters. Third, for all these prediction engines, SVM could get the best Sn value with the

AC

PSNP+PSDP+NC feature and Sp value with the PSDP feature, but PXGB could get the best AUC , ACC and MCC values. In our dataset, the performance in SVM is better than KNN or RF. The AUC value in SVM is 2.52% higher than RF, 4.81% higher than KNN in our best combined feature, and SVM is better than KNN and RF when using the same features. When compared with PXGB, the performances of SVM is not very satisfactory. The AUC value in PXGB is 2.89% higher than SVM, the ACC value is 1.83% higher than SVM, the MCC value is 3.76% higher than SVM. Based on above description, we chose the combined features of PSNP, PSDP and NC as the

ACCEPTED MANUSCRIPT

input feature, PXGB as a valid prediction engine. Table 1. The results of five features and three combined features with 10-fold cross-validation Acc

Sn

Sp

Mcc

PSNP

0.7645

0.6955

0.6949

0.6961

0.3908

PSDP

0.7806

0.6970

0.5689

0.8284

0.4108

NC

0.6619

0.6194

0.7180

0.5226

0.2453

BE

0.7835

0.7234

0.7248

0.7220

0.4466

PC

0.7827

0.7318

0.7266

0.7373

0.4635

PC+BE

0.7832

0.7303

0.7235

0.7356

0.4606

PSNP+PSDP+PC

0.7827

0.7311

0.7252

0.7373

0.4621

PSNP+PSDP+NC

0.7980

0.7422

0.7572

0.7330

0.4858

PSNP

0.7210

0.6710

0.6053

0.7382

0.3466

PSDP

0.7737

0.7200

0.6515

0.7901

0.4453

NC

0.6581

0.6113

0.5189

0.7068

0.2298

BE

0.6975

PC

0.7122

PC+BE

0.7325

PSNP+PSDP+PC

0.7329

PSNP+PSDP+NC

0.7728

PSNP

0.7592

SVM

RF

PSDP

0.7390

0.2993

0.6553

0.5782

0.7334

0.3154

0.6836

0.5991

0.7671

0.3723

0.6767

0.6076

0.7485

0.3590

0.7089

0.6337

0.7834

0.4218

0.7219

0.7166

0.7299

0.4451

0.7536

0.7073

0.7054

0.7083

0.4140

0.6253

0.5910

0.6440

0.5386

0.1839

0.6484

0.6090

0.6935

0.5247

0.2214

PC

0.6561

0.6213

0.6867

0.5579

0.2465

PC+BE

0.6564

0.6216

0.6974

0.5476

0.2477

PSNP+PSDP+PC

0.6610

0.6351

0.7062

0.5623

0.2716

PSNP+PSDP+NC

0.7499

0.7123

0.7356

0.6887

0.4243

PSNP

0.7884

0.7364

0.7391

0.7337

0.4728

PSDP

0.8112

0.7482

0.7390

0.7575

0.4966

NC

0.7142

0.6507

0.6679

0.6335

0.3016

BE

0.7889

0.7422

0.7337

0.7506

0.4844

PC

0.7842

0.7337

0.7299

0.7376

0.4675

PC+BE

0.7910

0.7418

0.7368

0.7467

0.4836

PSNP+PSDP+PC

0.8001

0.7456

0.7414

0.7498

0.4912

PSNP+PSDP+NC

0.8116

0.7536

0.7494

0.7582

0.5073

PSNP

0.7956

0.7422

0.7425

0.7415

0.4837

PSDP

0.8199

0.7525

0.7399

0.7640

0.5044

NC

0.7254

0.6641

0.6832

0.6454

0.3294

BE

0.7960

0.7467

0.7456

0.7493

0.4939

PT CE AC

0.5552

BE

KNN

XGBoost

0.6465

ED

NC

CR IP T

Auc

AN US

features

M

predictors

ACCEPTED MANUSCRIPT

PC

0.7902

0.7395

0.7395

0.7402

0.4797

PC+BE

0.7969

0.7414

0.7487

0.7361

0.4840

PSNP+PSDP+PC

0.8099

0.7471

0.7434

0.7513

0.4936

PSNP+PSDP+NC

0.8269

0.7605

0.7556

0.7646

0.5234

PXGB

3.2 Combine with deep features After selecting the best features, we used a single-layer autoencoder to build up the deep

CR IP T

features. The best performance for the deep feature came from 120-dimension deep features, whose AUC value is 0.8200. We can get a conclusion that in the process of constructing deep features, the more the dimension, the higher the original information retention of the features. In order to get a better performance, we tried to combine the original features with the depth features to predict the results. As we can see in Table 2, after adding the original features, the results are

AN US

significantly better than before. The combination of 60-dimension deep features and PSNP+PSDP+NC get the best performance. We compared the results in different dimension, from which we can see that these features can make the value of AUC, ACC, Sn, Sp and MCC higher. Especially, we can see that the value of Sn in Table 1 is not the highest, but the current features

M

can make the Sn value reach to 0.7713. As for other evaluations, the AUC value increased from 0.8169 to 0.8390, the ACC value reached to 0.7713 and we get the MCC value of 0.5346.

ED

Therefore, combined with deep features and original features could improve performance. Table 2. The results of deep features and the combined features with 10-fold cross-validation AUC

PT

Dimension

ACC

Sn

Sp

MCC

0.8269

0.7605

0.7556

0.7646

0.5234

Deep_50

0.8051

0.7295

0.7360

0.7230

0.4591

Deep_60

0.8070

0.7375

0.7368

0.7383

0.4751

Deep_80

0.8158

0.7506

0.7452

0.7559

0.5012

Deep_100

0.8042

0.7364

0.7376

0.7353

0.4728

Deep_120

0.8200

0.7498

0.7483

0.7513

0.4996

Deep_50+PNP

0.8232

0.7548

0.7491

0.7605

0.5096

Deep_60+PNP

0.8390

0.7713

0.7638

0.7598

0.5346

Deep_80+PNP

0.8234

0.7510

0.7406

0.7613

0.5020

Deep_100+PNP

0.8222

0.7560

0.7567

0.7551

0.5119

Deep_120+PNP

0.8251

0.7556

0.7543

0.7667

0.5111

AC

CE

PNP

3.3 Interpretability As tree-boosting models, the XGBoost is composed of classification trees as the basic model

ACCEPTED MANUSCRIPT

[58]. In order to enhance the interpretability lines of the entire model, we can use weak decision trees to reduce the complexity of the model. The interpretability of the XGBoost model is mainly in selecting the importance of the features. In the above section, we have used XGBoost model to predict m6A sites for RNA sequences. As we can see in Fig 3, we plot the importance of the first 25-dimensional features, in which number 1 to 179 are the original feature and number 180 to 239 are the deep feature. There are

CR IP T

11-dimensional deep features and 9-dimensional original features in the top 20-dimensional features of importance ranking. Among the top 6-dimensional important features, 5-dimensional are deep features, and the feature importance scores are very high. Thereinto, the score of f234 which is the first important deep feature in our model is 29 and the score of the first original

AN US

feature f74 is 20. We could get a conclusion that the appropriate dimension deep features could

Figure 3. The feature importance scores

AC

CE

PT

ED

M

supplement the hidden information of the original features.

3.3 Comparison with other predictors In order to prove the effectiveness of our proposed method, we compared our predictor with

other five predictors, such as iRNAmethyl [34], pRNAm-PC[35], M6A-HPCS[36], M6Apred[37], and TargetM6A[38]. The detailed results are shown in Table 3. Our method achieves higher AUC values than those of the other predictors. Compared with TargetM6A predictors, M6A-PXGB could improved the AUC value by 0.021. When compared with other predictors, the AUC value in M6A-PXGB is improved the AUC by 0.138, 0.077, 0.057, 0.047 for iRNAmethyl, pRNAm-PC,

ACCEPTED MANUSCRIPT

M6A-HPCS and M6Apred respectively. In addition to the AUC value comparison, we also calculated the sensitivity, specificity, accuracy and MCC, shown in Table 3. Our predictor is better than others in AUC, ACC, Sn and MCC. In addition, the ROC curves is plotted in Fig 4, together with the AUC values listed in Table 3. Although, TargetM6A is 0.0183 higher than our method on Sp, our method performs significantly better than the state of the art tools. Table 3. the results of comparing with other predictors ACC

Sn

Sp

MCC

iRNAmethyl

0.701

0.6559

0.6063

0.7055

0.313

pRNAm-PC

0.762

0.6974

0.6972

0.6975

0.394

M6A-HPCS

0.782

0.7238

0.7735

0.6741

0.451

M6Apred

0.792

0.7418

0.7391

0.7445

0.483

TargetM6A

0.818

0.7632

0.7483

0.7781

0.526

M6A-PXGB

0.839

0.7713

0.7638

0.7598

0.535

CR IP T

AUC

AC

CE

PT

ED

M

AN US

predictor

Figure 4. The AUC value of different predictors

4. Conclusion In this study, we proposed a new prediction engine named M6A-PXGB and used the combined features of deep features and original features to predict m6A sites. The novelty and contribution of this predictor is that: PSO is the first time applied to optimize XGBoost and is also the first time to predict m6A sites. Although the process of optimizing the parameters of XGBoost

ACCEPTED MANUSCRIPT

is a little time-consuming, this process is much less time-consuming than the grid search with the same accuracy. We then use different prediction engines and predictors to compare, experimental results show that our predictor is very promising and outperforming. The methods of this study can also be applied to other sequence-based sites prediction problems. As pointed out in [71] and demonstrated in a series of recent publications (see, e.g., [42, 44, 72-84]), user-friendly and publicly accessible web-servers represent the future direction for developing practically more Actually, many practically useful

CR IP T

useful prediction methods and computational tools.

web-servers have significantly increased the impacts of bioinformatics on medical science [85], driving medicinal chemistry into an unprecedented revolution [86], we shall make efforts in our future work to provide a web-server for the prediction method presented in this paper.

AN US

Acknowledgments

This research is partially supported by National Natural Science Foundation of China (61403077), the China Postdoctoral Science Foundation funded project (2014M550166, 2015T80285), the Natural Science Foundation of the Education Department of JiLin Province

AC

CE

PT

ED

M

(2016-505).

ACCEPTED MANUSCRIPT

References

AC

CE

PT

ED

M

AN US

CR IP T

[1] H.L. Xie, L. Fu, X.D. Nie, Using ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chou's PseAAC. Protein Eng Des Sel 26 (2013) 735-742. [2] Y. Xu, X.J. Shao, L.Y. Wu, N.Y. Deng, iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ 1 (2013) e171. [3] C. Jia, X. Lin, Z. Wang, Prediction of Protein S-Nitrosylation Sites Based on Adapted Normal Distribution Bi-Profile Bayes and Chou's Pseudo Amino Acid Composition. Int J Mol Sci 15 (2014) 10410-23. [4] W.R. Qiu, X. Xiao, W.Z. Lin, iMethyl-PseAAC: Identification of Protein Methylation Sites via a Pseudo Amino Acid Composition Approach. Biomed Res Int (BMRI) 2014 (2014) 947416. [5] Y. Xu, X. Wen, X.J. Shao, N.Y. Deng, iHyd-PseAAC: Predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition. International Journal of Molecular Sciences (IJMS) 15 (2014) 7594-7610. [6] J. Zhang, X. Zhao, P. Sun, Z. Ma, PSNO: Predicting Cysteine S-Nitrosylation Sites by Incorporating Various Sequence-Derived Features into the General Form of Chou's PseAAC. Int J Mol Sci 15 (2014) 11204-19. [7] W. Chen, H. Tang, J. Ye, H. Lin, iRNA-PseU: Identifying RNA pseudouridine sites Molecular Therapy - Nucleic Acids 5 (2016) e332. [8] J. Jia, Z. Liu, X. Xiao, B. Liu, iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Anal Biochem 497 (2016) 48-56. [9] J. Jia, Z. Liu, X. Xiao, B. Liu, pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. Journal of Theoretical Biology 394 (2016) 223-230. [10] J. Jia, Z. Liu, X. Xiao, B. Liu, iCar-PseCp: identify carbonylation sites in proteins by Monto Carlo sampling and incorporating sequence coupled effects into general PseAAC. Oncotarget 7 (2016) 34558-34570. [11] J. Jia, L. Zhang, Z. Liu, X. Xiao, pSumo-CD: Predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC. Bioinformatics 32 (2016) 3133-3141. [12] Z. Ju, J.Z. Cao, H. Gu, Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou's general PseAAC. J Theor Biol 397 (2016) 145-150. [13] W.R. Qiu, B.Q. Sun, X. Xiao, Z.C. Xu, iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC. Oncotarget 7 (2016) 44310-44321. [14] W.R. Qiu, B.Q. Sun, X. Xiao, Z.C. Xu, iPTM-mLys: identifying multiple lysine PTM sites and their different types. Bioinformatics 32 (2016) 3116-3123. [15] W.R. Qiu, X. Xiao, Z.C. Xu, iPhos-PseEn: identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier. Oncotarget 7 (2016)

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

51270-51283. [16] Z. Ju, J.J. He, Prediction of lysine crotonylation sites by incorporating the composition of k-spaced amino acid pairs into Chou's general PseAAC. J Mol Graph Model 77 (2017) 200-204. [17] W.R. Qiu, S.Y. Jiang, B.Q. Sun, X. Xiao, X. Cheng, iRNA-2methyl: identify RNA 2′-O-methylation sites by incorporating sequence-coupled effects into general PseKNC and ensemble classifier. Medicinal Chemistry 13 (2017) 734-743. [18] S. Akbar, M. Hayat, iMethyl-STTNC: Identification of N(6)-methyladenosine sites by extending the Idea of SAAC into Chou's PseAAC to formulate RNA sequences. J Theor Biol 455 (2018) 205-211. [19] W. Chen, H. Ding, X. Zhou, H. Lin, iRNA(m6A)-PseDNC: Identifying N6-methyladenosine sites using pseudo dinucleotide composition. Analytical Biochemistry 561-562 (2018) 59-65. [20] W. Chen, P. Feng, H. Yang, H. Ding, H. Lin, iRNA-3typeA: identifying 3-types of modification at RNA's adenosine sites. Molecular Therapy: Nucleic Acid 11 (2018) 468-474. [21] P. Feng, H. Yang, H. Ding, H. Lin, W. Chen, iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics doi:10.1016/j.ygeno.2018.01.005 (2018). [22] A.W. Ghauri, Y.D. Khan, N. Rasool, S.A. Khan, pNitro-Tyr-PseAAC: Predict nitrotyrosine sites in proteins by incorporating five features into Chou's general PseAAC. Curr Pharm Des doi:10.2174/1381612825666181127101039 (2018). [23] Z. Ju, S.Y. Wang, Prediction of citrullination sites by incorporating k-spaced amino acid pairs into Chou's general pseudo amino acid composition. Gene 664 (2018) 78-83. [24] Y.D. Khan, N. Rasool, W. Hussain, S.A. Khan, iPhosT-PseAAC: Identify phosphothreonine sites by incorporating sequence statistical moments into PseAAC. Analytical Biochemistry 550 (2018) 109-116. [25] Y.D. Khan, N. Rasool, W. Hussain, S.A. Khan, iPhosY-PseAAC: identify phosphotyrosine sites by incorporating sequence statistical moments into PseAAC. Mol Biol Rep 10.1007/s11033-018-4417-z (2018). [26] M.F. Sabooh, N. Iqbal, M. Khan, M. Khan, H.F. Maqbool, Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou's PseKNC. J Theor Biol 452 (2018) 1-9. [27] L. Wang, R. Zhang, Y. Mu, Fu-SulfPred: Identification of Protein S-sulfenylation Sites by Fusing Forests via Chou's General PseAAC. J Theor Biol 461 (2019) 51-58. [28] Rozenski, J. , Crain, P. F. , & Mccloskey, J. A. . (1999). The RNA modification database: 1999 update. Nucleic Acids Research, 27(1), 196-197. [29] Dunin-Horkawicz, & S. (2006). Modomics: a database of RNA modification pathways. Nucleic Acids Research, 34(90001), D145-D149. [30] Meyer, Kate D. , and S. R. Jaffrey . "The dynamic epitranscriptome: N6-methyladenosine and gene expression control." Nature Reviews Molecular Cell Biology. [31] Nilsen, Timothy W.. "Internal mRNA Methylation Finally Finds Functions." Science 343.6176(2014):1207-1208. [32] Claudio R. Alarcón, et al. "N6-methyladenosine marks primary microRNAs for processing." Nature. [33] Meyer, Kate D. , et al. "Comprehensive Analysis of mRNA Methylation Reveals Enrichment in 3′ UTRs and near Stop Codons." Cell 149.7(2012).

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

[34] "iRNA-Methyl: Identifying N6-methyladenosine sites using pseudo nucleotide composition." Analytical Biochemistry 490(2015):S0003269715003978. [35] Liu, Zi , et al. "pRNAm-PC: Predicting N6-methyladenosine sites in RNA sequences via physical-chemical properties." Analytical Biochemistry 497(2016):60-67. [36] Zhang, Ming , et al. "Improving N6-methyladenosine site prediction with heuristic selection of nucleotide physical–chemical properties." Analytical Biochemistry 508(2016):104-113.. [37] Chen, Wei , et al. "Identification and analysis of the N6-methyladenosine in the Saccharomyces cerevisiae transcriptome." Scientific Reports 5(2015):13859. [38] Li, Guang Qing , et al. "TargetM6A: Identifying N6-methyladenosine Sites from RNA Sequences via Position-Specific Nucleotide Propensities and a Support Vector Machine." IEEE Transactions on NanoBioscience (2016):1-1. [39] W. Chen, H. Ding, X. Zhou, H. Lin, iRNA(m6A)-PseDNC: Identifying N6-methyladenosine sites using pseudo dinucleotide composition. Analytical Biochemistry 561-562 (2018) 59-65. [40] H. Yang, W.R. Qiu, G. Liu, F.B. Guo, W. Chen, H. Lin, iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC International Journal of Biological Sciences 14 (2018) 883-891. [41] Z.D. Su, Y. Huang, Z.Y. Zhang, Y.W. Zhao, D. Wang, W. Chen, H. Lin, iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC. Bioinformatics doi:10.1093/bioinformatics/bty508 (2018). [42] W.R. Qiu, B.Q. Sun, X. Xiao, Z.C. Xu, J.H. Jia, iKcr-PseEns: Identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. Genomics 110 (2018) 239-246. [43] B. Liu, F. Yang, D.S. Huang, iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics 34 (2018) 33-40. [44] X. Cheng, X. Xiao, pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. Genomics 110 (2018) 50-58. [45] L. Cai, T. Huang, J. Su, X. Zhang, W. Chen, F. Zhang, L. He, Implications of newly identified brain eQTL genes and their interactors in Schizophrenia. Molecular Therapy - Nucleic Acids 12 (2018) 433-442. [46] B. Liu, F. Weng, D.S. Huang, iRO-3wPseKNC: Identify DNA replication origins by three-window-based PseKNC. Bioinformatics 34 (2018) 3086-3093. [47] J. Jia, X. Li, W. Qiu, X. Xiao, iPPI-PseAAC(CGR): Identify protein-protein interactions by incorporating chaos game representation into PseAAC. Journal of Theoretical Biology 460 (2019) 195-203. [48] K.C. Chou, Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). Journal of Theoretical Biology 273 (2011) 236-247. [49] "High-Resolution Mapping Reveals a Conserved, Widespread, Dynamic mRNA Methylation Program in Yeast Meiosis." Cell 155.6(2013):1409-1421. [50] Xing, Pengwei , et al. "Identifying N6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine." Scientific Reports 7(2017):46757. [51] Mohamed Hashim, Ezzeddin Kamil , and R. Abdullah . "Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter." Journal of Theoretical Biology 387(2015):88-100. [52] Vinje, Hilde , et al. "Comparing K-mer based methods for improved classification of 16S

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

sequences." BMC Bioinformatics 16.1(2015):205. [53] Bengio, Y. , et al. "Greedy layer-wise training of deep networks." Advances in Neural Information Processing Systems 19(2007):153-160. [54] Goodfellow, Ian J , et al. "Measuring Invariances in Deep Networks. " International Conference on Neural Information Processing Systems Curran Associates Inc. 2009. [55] Xiong, Dapeng , J. Zeng , and H. Gong . "A deep learning framework for improving long-range residue-residue contact prediction using a hierarchical strategy." Bioinformatics (2017). [56] Wei, L. , Su, R. , Wang, B. , Li, X. , & Zou, Q.. Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites. Neurocomputing (2018). [57] Rumelhart, David E. , et al. "Backpropagation: the basic theory." Backpropagation L. Erlbaum Associates Inc. 1995. [58] Chen, Tianqi , and C. Guestrin . "XGBoost: A Scalable Tree Boosting System." (2016). [59] Gumus M, Kiran M S. Crude oil price forecasting using XGBoost. International Conference on Computer Science and Engineering. 2017:1100-1103. [60] Chen T, Tong H, Benesty M, et al. xgboost: Extreme Gradient Boosting. 2016. [61] Wang S, Dong P, Tian Y. A Novel Method of Statistical Line Loss Estimation for Distribution Feeders Based on Feeder Cluster and Modified XGBoost. Energies, 2017, 10(12):2067. [62] Kennedy, J.. Particle Swarm Optimization. Icnn95-international Conference on Neural Networks. IEEE(2002). [63] Li, Xiangtao , and M. Yin . "A particle swarm inspired cuckoo search algorithm for real parameter optimization." Soft Computing 20.4(2016):1389-1413. [64] Chou, Kuo Chen , and C. T. Zhang . "Prediction of Protein Structural Classes." Critical Reviews in Biochemistry and Molecular Biology 30.4(1995):275-349. [65] Chou, Kuo Chen , and H. B. Shen . "Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms." Nature Protocols. [66] K.C. Chou, Prediction of signal peptides using scaled window. Peptides 22 (2001) 1973-1979. [67] W. Chen, P.M. Feng, H. Lin, iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition Nucleic Acids Research 41 (2013) e68. [68] Cortes, Corinna , and V. Vapnik . "Support-vector networks." Machine Learning 20.3(1995):273-297. [69] Cutler, Kt , and L. Breiman . "Random forests." Machine Learning 45.1(2004):157-176. [70] Cover, T. M. . "Nearest neighbor pattern classification." IEEE Trans. Inf. Theory 13(1967). [71] K.C. Chou, H.B. Shen, Recent advances in developing web-servers for predicting protein attributes. Natural Science 1 (2009) 63-92 [72] X. Cheng, X. Xiao, pLoc-mPlant: predict subcellular localization of multi-location plant proteins via incorporating the optimal GO information into general PseAAC. Molecular BioSystems 13 (2017) 1722-1727. [73] X. Cheng, X. Xiao, pLoc-mVirus: predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC. Gene (Erratum: ibid., 2018, Vol.644, 156-156) 628 (2017) 315-321. [74] X. Cheng, X. Xiao, pLoc-mGneg: Predict subcellular localization of Gram-negative bacterial

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

proteins by deep gene ontology learning via general PseAAC. Genomics 110 (2018) 231-239. [75] X. Cheng, S.G. Zhao, W.Z. Lin, X. Xiao, pLoc-mAnimal: predict subcellular localization of animal proteins with both single and multiple sites. Bioinformatics 33 (2017) 3524-3531. [76] X. Xiao, X. Cheng, S. Su, Q. Nao, pLoc-mGpos: Incorporate key gene ontology information into general PseAAC for predicting subcellular localization of Gram-positive bacterial proteins. Natural Science 9 (2017) 331-349. [77] X. Cheng, X. Xiao, pLoc-mHum: predict subcellular localization of multi-location human proteins via general PseAAC to winnow out the crucial GO information. Bioinformatics 34 (2018) 1448-1456. [78] W. Chen, P. Feng, H. Yang, H. Ding, H. Lin, iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences. Oncotarget 8 (2017) 4208-4217. [79] X. Cheng, S.G. Zhao, X. Xiao, iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals. Bioinformatics (Corrigendum, ibid., 2017, Vol.33, 2610) 33 (2017) 341-346. [80] P. Feng, H. Ding, H. Yang, W. Chen, H. Lin, iRNA-PseColl: Identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into PseKNC. Molecular Therapy - Nucleic Acids 7 (2017) 155-163. [81] B. Liu, S. Wang, R. Long, iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics 33 (2017) 35-41. [82] B. Liu, F. Yang, 2L-piRNA: A two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Molecular Therapy - Nucleic Acids 7 (2017) 267-277. [83] W.R. Qiu, S.Y. Jiang, Z.C. Xu, X. Xiao, iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition. Oncotarget 8 (2017) 41178-41188. [84] W.R. Qiu, B.Q. Sun, X. Xiao, D. Xu, iPhos-PseEvo: Identifying human phosphorylated proteins by incorporating evolutionary information into general PseAAC via grey system theory. Molecular Informatics 36 (2017) UNSP 1600010. [85] K.C. Chou, Impacts of bioinformatics to medicinal chemistry. Medicinal Chemistry 11 (2015) 218-234. [86] K.C. Chou, An unprecedented revolution in medicinal chemistry driven by the progress of biological science. Current Topics in Medicinal Chemistry 17 (2017) 2337-2358.