“iSS-Hyb-mRMR”: Identification of splicing sites using hybrid space of pseudo trinucleotide and pseudo tetranucleotide composition

“iSS-Hyb-mRMR”: Identification of splicing sites using hybrid space of pseudo trinucleotide and pseudo tetranucleotide composition

Accepted Manuscript Title: “iSS-Hyb-mRMR”: Identification of Splicing sites using Hybrid space of Pseudo Trinucleotide and Pseudo Tetranucleotide Comp...

421KB Sizes 0 Downloads 15 Views

Accepted Manuscript Title: “iSS-Hyb-mRMR”: Identification of Splicing sites using Hybrid space of Pseudo Trinucleotide and Pseudo Tetranucleotide Composition Author: Muhammad Iqbal Maqsood Hayat PII: DOI: Reference:

S0169-2607(15)30135-8 http://dx.doi.org/doi:10.1016/j.cmpb.2016.02.006 COMM 4081

To appear in:

Computer Methods and Programs in Biomedicine

Received date: Accepted date:

24-8-2015 16-2-2016

Please cite this article as: M. Iqbal, M. Hayat, “iSS-Hyb-mRMR”: Identification of Splicing sites using Hybrid space of Pseudo Trinucleotide and Pseudo Tetranucleotide Composition, Computer Methods and Programs in Biomedicine (2016), http://dx.doi.org/10.1016/j.cmpb.2016.02.006 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

“iSS-Hyb-mRMR”: Identification of Splicing sites using Hybrid space of Pseudo Trinucleotide and Pseudo Tetranucleotide Composition

ip t

Muhammad Iqbal, Maqsood Hayat* Department of Computer Science, Abdul Wali Khan University Mardan

us

Email: [email protected]

cr

Corresponding Author: Dr. Maqsood Hayat

Ac ce pt e

d

M

an

[email protected]

1

Page 1 of 33

Abstract Background and Objectives: Gene splicing is a vital source of protein diversity. Perfectly eradication of introns and joining exons is the prominent task in eukaryotic gene expression, as

ip t

exons are usually interrupted by introns. Identification of splicing sites through experimental techniques is complicated and time-consuming task. With the avalanche of genome sequences

cr

generated in the post genomic age, it remains a complicated and challenging task to develop an

us

automatic, robust and reliable computational method for fast and effective identification of splicing sites.

an

Methods: In this study, a hybrid model “iSS-Hyb-mRMR” is proposed for quickly and accurately identification of splicing sites. Two sample representation methods namely; Pseudo

M

trinucleotide composition (PseTNC) and Pseudo tetranucleotide composition (PseTetraNC) were used to extract numerical descriptors from DNA sequences. Hybrid model was developed by

d

concatenating PseTNC and PseTetraNC. In order to select high discriminative features,

Ac ce pt e

minimum redundancy maximum relevance algorithm was applied on the hybrid feature space. The performance of these feature representation methods were tested using various classification algorithms including K-nearest neighbor, Probabilistic Neural Network, General Regression Neural Network, and Fitting Network. Jackknife test was used for evaluation of its performance on two benchmark datasets S1 and S2, respectively. Results: The predictor, proposed in the current study achieved an accuracy of 93.26%, sensitivity of 88.77%, and specificity of 97.78% for S1, and the accuracy of 94.12%, sensitivity of 87.14%, and specificity of 98.64% for S2, respectively.

2

Page 2 of 33

Conclusion: It is observed, that the performance of proposed model is higher than the existing methods in the literature so for; and will be fruitful in the mechanism of RNA splicing, and other

Introduction

us

1.

cr

Keywords: Splicing Sites; PseTNC; PseTetraNC; KNN; mRMR.

ip t

research academia.

Gene splicing plays prominent role in protein diversity and thus enable a single gene to increase

an

its coding capability. The precursor messenger RNA (pre-mRNA) transcribed from one gene can lead to different mature mRNA molecules during a typical gene splicing event, which causes to

M

generate multiple functional proteins. In eukaryotes gene, splicing takes place prior to mRNA translation by the differential inclusion or exclusion of regions called Exons and Introns of pre-

d

mRNA. Exons, that code for proteins are interrupted by non-coding regions called introns in

Ac ce pt e

eukaryotic genomes. There is a line between introns and exons called splice site (Figure 1). Sides of introns have splice sites, the former is called the 5´ splice site or donor site and the latter is called the 3´ splice site or acceptor site. The vast amounts of donor and accepter sites form a pattern which is recognized by the presence of GT and AG respectively. Spliceosome, which is comprises of 300 proteins and five small nuclear RNAs (snRNAs U1, U2, U4, U5, and U6) that is responsible for identification of donors and acceptors sites in genome sequence [1]. When splice sites become identified, spliceosome bind to both 3´ and 5´ ends of the introns and cause the intron to form a loop. With the help of two sequential transesterification reactions the given intron is eradicated from the genome sequence as shown in Figure.1, while the remaining two exons are linked together [2, 3]. Eliminating non-coding regions (introns) from (pre-mRNA) and

3

Page 3 of 33

fusing the required consecutive coding regions (exons) to form a mature messenger RNA (mRNA) is a prominent and notable step in gene expression. Therefore, to better understand the splicing mechanism; it is essential to identify the splicing sites in genome accurately.

ip t

Biochemical experimental approaches provide little details about identifying splicing sites with certain limitations, thus to rely only on these techniques is not appropriate, because these are

cr

time-consuming and expensive operations. In addition, these are not mostly applicable. Hence

us

with increasing the density of logic, it is a great challenge, and extremely desirable task to develop computational methods for precise, consistent, robust and automated system for timely

an

identification of splicing sites. A series of methods have been proposed to identify splicing sites consequently, considerable results have been achieved, but still it contains large vacuum for

M

further improvements in term of prediction performance. After the comprehensive review [4] and also a series of latest publications [5-11] revealed that, to develop a really effective statistical

d

predicator for biological system, we need to pass from the following steps: (i) In order to train

Ac ce pt e

and test the predictor, we need to construct or select a valid benchmark dataset; (ii) For correct reflection of biological sample in their intrinsic correlation with the target to be predicted, we have to formulate the sample with an effective mathematical expression; (iii) To operate the predication, a powerful algorithm is needed; (iv) Also to evaluate the anticipated accuracy of the predictor objectively, properly cross validation tests is needed to be performed. In view of the importance of splicing sites for genome analysis, the present study was initiated to develop a computational method for predicting splice sites. In the present work, a hybrid model “iSS-Hyb-mRMR” is proposed, which used Pseudo trinucleotide composition and Pseudo Tetranucleotide composition strategies to extract numerical descriptors. To eradicate the irrelevant and redundant features from feature space, Minimum redundancy and maximum

4

Page 4 of 33

relevance (mRMR) was applied. Classification algorithms including K-nearest neighbors (KNN), Probabilistic neural network (PNN), generalized regression neural network (GRNN) and fitting network (FitNet) were utilized in order to select the best one among these. Jackknife test was

ip t

applied to assess the performance of the classification algorithms using two datasets S1 and S2 for donor sites and acceptor sites, respectively.

cr

The rest of the paper is organized as; Section 2 describes Materials and Methods, section 3

us

describes evaluation criteria for performance measurement, section 4 describes result and

2.

Materials and Methods

2.1

Dataset

an

discussions and finally conclusion has been drawn in section 5.

M

In order to develop a statistical predictor, it is preliminary to establish a reliable and stringent benchmark dataset for training and testing the predictor. However, in case of erroneous and

d

redundant benchmark dataset, consequently, the outcomes of predictor must be unreliable and

Ac ce pt e

inconsistent. In order to remove the redundancy and reduce the similarity from the dataset usually CDHIT is applied. In addition, as pointed out in a comprehensive review [12], for examining the performance of a prediction method there is no need to split a benchmark dataset into a training and testing dataset. Because, the performance of predictor is evaluated by leave one out cross validation or sub-sampling tests, actually, the predicted outcomes are the combination of different independent dataset tests. In this regards, human splice site-containing sequences are obtained from HS3D (http://www.sci.unisannio.it/docenti/rampone/), having the sequences of exons, introns, and splice regions [13]. GT-AG rule is obeyed by all the sequences in this database i.e. that is begin with the dinucleotide GT (GU in case if RNA) and end with the

5

Page 5 of 33

dinucleotide AG. Therefore, we obtained two datasets, one for the splice donor site-containing sequences and the other for the splice acceptor, which can be formulated as splice donor dataset,

(1)

S 2 = S 2+  S 2−

splice acceptor dataset,

(2)

ip t

S 1 = S 1+  S 1−

where the positive dataset S 1+ contains 2,796 true splice donor site-containing sequences while

cr

the negative dataset S 1− composed of 2,800 false splice donor site-containing sequences;

us

S 2+ contains 2,880 true splice acceptor site-containing sequences, while S 2− having 2,800 false

2.2.

an

splice acceptor site-containing sequences, the symbol ∪ means the union in the set theory. DNA Sample Formulation

M

Suppose a DNA sequence D with L nucleotides; i.e.

D = R 1 R 2 R 3 R 4 R 5 R 6 R 7 ... R L

d

where

Ac ce pt e

Ri ∈{A ( adenine) ,C ( cytosine) ,G ( guanine) ,T ( thymine)}

(3)

i (=1,2,…L)

(4)

where R1 represents the first nucleotide at position 1; R2 represents the second nucleotide at position 2, and RL represents the last nucleotide at position L respectively. Although, the above sequential formulation of (3) is more informative regarding DNA sample, but it is difficult to preciously predict statistically such a huge number of nucleotides. DNA sequences consist of four unique nucleotides (A, C, G and T). Thus, let suppose if we consider of only 100 nucleotides

sequences,

the

number

of

different

order

combinations

would

be 4 1 0 0 = 1 0 1 0 0 lo g > 1 .6 0 6 5 × 1 0 6 0 . Actually the length of DNA sequences is much more longer 4

than 100 nucleotides sequences, therefore, the number of different combinations will be > 1 .6 0 6 5 × 1 0 6 0 [14]. Therefore, first of all for such an astronomical number it is not

6

Page 6 of 33

realizable to construct a reasonable training dataset to statistically cover all the possible different sequence-order information. Secondly, DNA sequences vary widely in length, which give arise to another harder for incorporating the sequence-order information in both the dataset

ip t

construction and algorithm formulation. Thirdly, all the existing efficient operation engines, such as support vector machine (SVM) [15-17], random forest (RF) [18-20], covariance Discriminant

cr

(CD) [20], neural network [21], conditional random field [10], nearest neighbor (NN) [22],

us

SLLE algorithm [23], K-nearest Neighbor (KNN) [23], OET-KNN [24], fuzzy K-nearest neighbor [25-27], and ML-KNN algorithm [5], can only handle vector form rather than

an

sequential samples. In this regards, BLAST was proposed for sequential samples, but this approach remains insufficient due to the lack of significant similarity [28] among the samples.

M

To avoid the complete loss of sequence-order or pattern information for proteins, the pseudo amino acid composition (PseAAC) was proposed [29]. Owing the wide and successful usage of

d

PseAAC into the areas of computational proteomics to deal with protein/peptide sequences in

Ac ce pt e

computational proteomic, recently, the concept of “pseudo k-tuple nucleotide composition (PseKNC)’ has been introduced to deal with DNA/RNA sequences in computational genetics and genomics [30, 31]. More recently two web-servers repDNA and Pse-in-One were proposed which successfully generates various modes of pseudo k-tuple nucleotide composition [32, 33]. Consequently, the concept of discrete model was proposed to incorporate the sequence-order information of DNA sample effectively [34]. The simplest discrete model used to represent a DNA sample in its nucleic acid composition (NAC) as given below: D = [ f ( A) f (C ) f (G ) f (T ) ]

T

(5)

where f ( A ) , f ( C ) , f ( G ) , and f (T ) are the normalized occurrence frequencies of adenine (A), cytosine (C), guanine (G), and thymine (T) respectively; the symbol T represents transpose 7

Page 7 of 33

operator. If we analyze (5) critically, we have zero information about all the sequence-order of nucleotides. Therefore, one way to solve this problem here is to represent the DNA sequence with the k-tuples nucleotides composition vector, i.e. with 4k components as given below: T

D =  f1k −tuples , f2k −tuples fi k −tuples f4kk −tuples 

ip t

(6)

segment. We can see from (6), the dimension of the vector is

us

41 = 4

cr

where ƒik-tuples is the normalized occurrence frequency of the ith k-tuples nucleotides in the DNA

42 = 16

an

43 = 64 44 = 256

(7)

M

45 = 1024 

d



Ac ce pt e

The above equation indicating that by increasing the value of K, although the coverage scope of sequence order will be gradually increased, the dimension of the vector D will be rapidly increased as well. Extra ordinary increasing of vector D will cause high-dimension disaster [35] which reflects certain disadvantages such as overfitting problem, unnecessarily computational time, predication of serious bias and low capacity for generalization, noise and redundancy that leads to poor predication of accuracy. To avoid such a high-dimension disaster, we have chosen pseudo trinucleotide and pseudo tetranucleotide composition. 2.3.

Feature Extraction Strategies

The basic and foundational step in machine learning processes is considered to be feature extraction technique. Discrete numerical attributes are extracted from biological sequences in this phase, because statistical models need numerical descriptors for training. Feature 8

Page 8 of 33

extraction starts from an initial set of measured data and builds derived values called features, intended to be informative, non-redundant, facilitating the subsequent learning and generalization steps. In this work, two powerful DNA sequences representation approaches are

Pseudo Trinucleotide Composition

cr

2.3.1

ip t

used to extract high discriminative features.

us

The main limitation of simple nucleotide composition is that; it does not preserve sequence order information. In order to amalgamate the occurrence frequency along with sequence order

an

information, Pseudo trinucleotide composition (PseTNC) was proposed [9]. In PseTNC, the relative frequency of nucleotides pair is computed. As a result, 4 × 4 × 4 = 6 4 − D

M

corresponding features are extracted. It can be demonstrated as: T

d

D =  f13−tuples , f23−tuples f33−tuples f43−tuples .... f643−tuples 

(8)

D =  f ( AAA) , f ( AAC ) , f ( AAG ) , f ( AAT ) ,...., f (TTT ) 

Ac ce pt e

T

where f 13 − tup les = f ( A A A ) is the normalized occurrence frequency of AAA in the DNA Sequence; f 23 − tu ples = f ( A A C ) , that of AAC; and so forth. 2.3.2. Pseudo Tetranucleotide Composition In trinucleotide composition, only three nucleotides are paired. This is still far behind the sequence order information, therefore, to give considerable information, we have to move one step more forward and to use Pseudo tetranucleotide composition (PseTetraNC) [8, 14]. In PseTetraNC, the occurrence frequency of four nucleotides pair is calculated. It can be formulated as:

9

Page 9 of 33

T

4−tuples  D =  f14−tuples , f24−tuples f34−tuples f44−tuples .... f256

(9)

D =  f ( AAAA) , f ( AAAC ) , f ( AAAG ) , f ( AAAT ) ,...., f (TTTT ) 

T

4-tuples

= f ( AAAA) is the normalized occurrence frequency of AAAA, ƒ2

4-tuples

=

ip t

where ƒ1

f ( AAAC) that of AAAC; in the DNA sequence and so forth, therefore the corresponding

cr

feature space will contain 4 × 4 × 4 × 4 = 2 5 6 − D pairs of the nucleotides.

us

The above mentioned procedure revealed that, as the number of nucleotides in pair is increased

information is gradually included into information. 2.3.3. Hybrid space

an

consequently, the number of tuples increased. Hence the local or short range of sequence order

M

Sometime, single feature extraction strategy does not achieve reasonable results due to lack of discrimination power. In such circumstances, the fusing of various feature extraction strategies

d

are required to compensate the weakness of one’s feature extraction strategy by another, and

Ac ce pt e

enhanced the discrimination properties [9]. In this regards, we have fused two feature extraction strategies including PseTNC and PseTetraNC to form hybrid space. This hybrid model has a feature vector of dimension 320-D (64+256). The main reason of using the hybrid feature extraction strategy is to exploit the benefits of the both PseTNC and PseTetraNC for the prediction of splicing sites. However, the dimensionality of the resultant feature space should not be so high that it affects the prediction performance of the classifier. 2.4.

Features Reduction

In machine learning and statistics, the feature reduction is the process of selecting a subset of useful and relevant features which are used as an input in model construction. The central assumption when using a feature reduction technique is that, if the data contains

10

Page 10 of 33

many redundant or irrelevant features [36]. Feature selection techniques are often used in domains where there are many features and comparatively few samples or data points. Feature selection technique provides three main advantages: it improves model interpretability; bring

cr

2.4.1. Minimum-redundancy-maximum-relevance (mRMR)

ip t

down training times, and enhances generalization by reducing overfitting.

us

Sometimes, the acquired attributes are highly correlated and not all of the attributes contribute in the comprehensive determination or definition of the target phenotypes. In addition, extra-

an

ordinary large feature space significantly slows down the learning process as well as the efficiency of a classifier. Hence, we need to find mutually exclusive and low correlated feature

M

subsets. These attributes are obtained by using minimum-redundancy-maximum-relevance (mRMR) [37]. It was first adopted by Peng et al. [38]. The mRMR method attempts to determine

d

whether a feature vector has minimum redundancy with other features and maximum relevance

Ac ce pt e

with the target class. The set of selected features should be maximally different from each other. Let S denote the subset of features that we are looking for. The minimum redundancy condition is

M in ( P 1, P 2 ) =

1

S

2



X i , X j∈ S

M ( X i, X j )

(10)

where M in ( i , j ) represent similarity between features, and S is the number of features in S. In general, to achieve high performance minimizing just redundancy is not enough. So the minimum redundancy criteria should be supplemented by maximizing relevance between the target variable and others variables. To measure the level of discrimination power of features when they are differentially expressed for different target classes, again a similarity measure

11

Page 11 of 33

Min ( y , x i ) is used, between targeted classes y = {0,1} and the feature expression x i . This

measure quantifies the relevance of x i for the classification task. Thus the maximum relevance condition is to maximize the total relevance of all features in S: 1 S

2

 M (Y , X )

(11)

ip t

M ax ( P 1, P 2 ) =

j

Xi∈ S

cr

Combining both the criteria such as: maximal relevance with the target variable and minimum

us

redundancy between features is called the minimum redundancy-maximum relevance (mRMR) approach. The mRMR feature set is obtained by optimizing the problems P1 and P2, receptively

an

in Eq. (10) and Eq. (11) simultaneously. Optimization of both conditions requires combining

Min = { P 1 − P 2}

M

them into a single criterion function

(12)

mRMR approach has advantageous over other feature selection techniques. In fact, we can get a

d

more representative feature set of the target variable which increases the generalization capacity

Ac ce pt e

of the chosen feature set. Hence, mRMR approach gives a smaller feature set which effectively cover the same space as a larger conventional feature set does. 2.5.

Classification Algorithms

Classification is a supervised learning, in which the data are categorized into predetermined classes. In this study, several supervised classification algorithms are utilized in order to select the best one for identification of splicing sites. 2.5.1. K-Nearest Neighbor (KNN)

KNN is widely used algorithm in the field of pattern recognition, machine learning and many other areas like that. KNN is a non-parametric method used for classification and regression purposes [39]. KNN algorithm is also known as instance based learning (Lazy learning) 12

Page 12 of 33

algorithm. It does not build classifier or model on the spot but save all the training data samples and wait until new observation needs to be classified. Lazy learning nature of KNN makes it better than eager leaning which construct classifier before new observation needs to be classified.

ip t

It is effective in case of dynamic data that changes and updates rapidly [40]. KNN algorithm has the following five steps;

n

 (x i =1

i1

− xi 2 )

2

an

E dis ( x 1, x 2 ) =

us

Step 2: Euclidean Distance formula is used for measuring distance.

cr

Step 1: For training the model, the extracted DNA features are provided to KNN algorithm.

(13)

M

Step 3: Euclidean distance values are sorted as di ≤ di+1, where i=1, 2, 3…k.

d

Step 4: Apply voting or means according to the data nature.

Ac ce pt e

Step 5: Number of nearest neighbor (value of K) depends upon the nature and volume of data provided to KNN. The k value is taken large for huge data, and small for small data. 2.5.2. Probabilistic Neural Network (PNN) The probabilistic Neural Network (PNN) is based on Bayes theory, which was first introduced by Specht in 1990 [41]. It is often used for classification purposes [42]. On the basis of probability density function, PNN provides an interactive way to interpret the structure of the network. PNN has a similar structure as Feed-Forward Networks but it works in four layers. The first layer is known as input layer, which contains the input vector that are connected to the input neurons and passed to the second layer also called pattern layer. The number of samples presented to the network and dimensions of pattern layer are equals in number. Pattern layer and 13

Page 13 of 33

input layer have one to one correspondence by exactly one neuron for each training data sample. The third layer is called summation layer, which has the same dimensions as the number of classes in the set of data samples. Lastly the fourth layer output/decision layer categories the

ip t

number of samples into one the predefined classes.

cr

2.5.3. General Regression Neural Network (GRNN)

General Regression Neural Network (GRNN), which belongs to the category of probabilistic

us

neural networks was proposed by Donald F. Specht. It is non-parametrical kernel regression

an

estimators used in statistics [43]. Structurally and functionally GRNN is very similar to PNN. It has four layers i.e.; input layer, Radial Base Layer, Special Linear layer and the output layer. The

M

total number of neurons in the input and output of GRNN is equal to the dimension of the input and output vector. GRNN is most suitable network for small and medium size datasets. Its

d

overall process is carried out in three steps. A set of training data and target data is created in the

Ac ce pt e

first step. The input data, target data and spread constant value is passed to new GRNN as arguments in the second step. And finally in the last step response of the network is noted by simulating it according to the data provided. GRNN is advantageous comparatively others neural networks, because it can accurately compute the approximation function from sparse data and also extract automatically the appropriate regression model (linear or nonlinear) from the data; it can train rapidly with very simple topology design [44]. 2.5.4. Fitting Network (FitNet)

Fitting Network (FitNet) is an artificial neural network (ANN) that consists of N layers. It is a subtype of feed forward back propagation neural network (FFBPNN). Its first layer is connected to the input vector. The preceding layer has a connection with each subsequent layer. The 14

Page 14 of 33

resultant output is produced by final layer of the network. The training of FFBPNN is carried out using the following equation (14);



n j =1

w jk ( t ) . x j ( t ) + b 0 k ( t )

(14)

ip t

U k (t ) =

Y k ( t ) = ϕ (U k ( t ) )

cr

(15)

us

where xj (t) shows the input value of j to the neuron at time t, w jk ( t ) the weight that is assigned to input value by neuron k and b0 is the bias of k neuron at time t. Whereas Y k ( t ) is the output of

an

neuron k and ϕ is the activation function [38]. Fitting Network (FitNet) is used to fit an inputoutput relationship [45]. Levenberg- Marquardt algorithm is the default algorithm used for the

M

training of the system. The algorithm divides the feature vector randomly into three sets; (i) the training data (ii) the validation set data and (iii) the test data [46]. A fitting network with one

Ac ce pt e

d

hidden layer and enough number of neurons can fit any finite numbers of input and output relationship [47]. 2.6.

Proposed Model

Looking at the importance of splicing sites, “iSS-Hyb-mRMR” model is proposed for identification of receptor and donor of splicing sites as shown in fig.2. In this model, two feature extraction methods including PseTNC and PseTetraNC were used to extract feature from the datasets S1 and S2 respectively. Then, the extracted features were fused to from a hybrid space in order to enhance their discrimination power. Further, feature selection technique mRMR was applied on the hybrid space to select those features which have minimum redundancy and maximum relevance for the best classification purpose. Four different classifiers, i.e. KNN,

15

Page 15 of 33

PNN, GRNN and Fit-Net were used for classification. The best results among individual classifiers were noted. Consequently “iSS-Hyb-mRMR” produced higher performance than existing methods is the literature so far. Criteria for Performance Evaluation

ip t

3.

cr

Performance evaluation and metrics in classification is fundamental in assessing the quality of learning methods. However many different measures have been used in the literature with the

us

aim of making better choices in general or for a specific application area. Choices made by one

an

metric are claimed to be different from choices made by other metrics. In order to assess our

3.1.

Accuracy

M

model, we have used the following different performance measures, which are given below:

d

In the fields of science, engineering, and statistics, the accuracy of a measurement system is the

Ac ce pt e

degree of closeness of measurements of a quantity to that quantity's actual (true) value.

A ccuracy =

TP + TN × 100 TP + FP + FN + TN

(16)

where TP is True Positive, TN is False Negative, TN is True Negative and FP is False Positive. 3.2.

Sensitivity

Sensitivity and specificity are also known in statistics as classification function, which are statistical measures of the performance of a binary classification test. Sensitivity (true positive rate) measures the proportion of actual positives which are correctly identified.

S e n s itiv ity =

TP × 100 TP + FN

(17)

16

Page 16 of 33

3.3.

Specificity

Specificity (true negative rate) measures the proportion of negatives which are correctly

(18)

Mathews Correlation Coefficient (MCC)

us

3.4.

TN × 100 FP + TN

cr

S p e c ific ity =

ip t

identified;

The Matthews correlation coefficient is used in machine learning as a measure of the quality of

an

binary (two-class) classifications. It takes into account true and false positives and negatives. It can be used; even if the classes having varying sizes, due to this factor it is generally regarded as

M

a balance measure. The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value in the range of −1 and +1. A coefficient of +1

d

represents a perfect prediction, 0 no better than random prediction and −1 indicates total

Ac ce pt e

disagreement between prediction and observation. M C C (i ) =

3.5.

TP × TN − FP × FN [ T P + F P ][ T P + F N ][ T N + F P ][ T N + F N ]

(19)

F-Measure

The weighted average of Precision and recall is known as F-measure. It is used for the evaluation of statistical methods. F-measure can be calculated using.

F − m easure = 2 ×

P r e c is io n × R e c a ll P r e c is io n + R e c a ll

(20)

F-measure depends on two things; precision p and recall r.

17

Page 17 of 33

P r e c is io n =

(21)

TP TP + FN

(22)

where the resultant best value for F-measure is 1 and the worst value is 0.

ip t

R e c a ll =

TP TP + FP

cr

Although in the literature so for, these four metrics (Eqs.16-19) are often used to measures the

us

predication of the model, but for most biologist these are not easy-to-understand because of lacking intuitiveness. To void this difficulty, in the current study, we have adopted the following

M

N +− N−

Ac ce pt e

Sp = 1 −

N −+ + N +− N+ + N−

Sn = 1 −

(23)

(24)

d

Acc = 1 −

an

simple formulation that is proposed in the recent publications [48-50].

N −+ N+

 N + + N +−  1 −  −+ −  N +N  Mcc =   N +− − N −+    N −+ − N +−   1 +   1 +  + N N−    

(25)

(26)

The above mentioned metrics given in equation (23-26) are valid only for single-label system. In the above equations N + represents the total number of true splicing sites samples and N − represents the total number of false splicing sites investigated. Likewise, N −+ represents the total number of true splicing sites samples that are incorrectly predicted as false splicing sites and 18

Page 18 of 33

N +− represents the total number of false splicing sites samples that are incorrectly predicted as

true splicing sites. For multi-label systems whose existence has become more frequently in system biology [51, 52], and system medicine [53], a completely different set of metrics are

Results and discussions

cr

4.

ip t

needed as defined in [4].

us

In statistical prediction, the following three well known cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test,

an

subsampling test, and jackknife test [54, 55]. However, as elucidated and demonstrated in [4, 56], among the three cross-validation methods, the jackknife test is deemed the least arbitrary

M

(most objective) that can always yields a unique result for a given benchmark dataset, and hence has been increasingly used and widely recognized by investigators to examine the accuracy of

d

various predictors [29, 57-65]. Accordingly, the jackknife test was also adopted here to examine

Ac ce pt e

the quality of the present predictor [9]. Performance comparisons of individual and hybrid feature spaces on two datasets have been discussed below; 4.1.

Prediction Performance of classifiers on benchmark dataset acceptor sites

4.1.1. Prediction Performance of classifiers for acceptor dataset using PseTNC

The success rates of classification algorithms, using PseTNC feature space are listed in Table.1. This features space contains 64 features. Among these classifiers, KNN achieved the highest predication performance with accuracy of 92.59 %, sensitivity of 87.86 %, specificity of 97.64 %, MCC of 0.86, F-measure of 0.92. 4.1.2. Prediction Performance of classifiers for acceptor dataset using PseTetraNC

19

Page 19 of 33

The experimental results of classifiers using PseTetraNC feature space are listed in Table.2. This features space contains 256 features. Due to considerable information in this vector, it remained efficient in terms of accuracy as compared PseTNC features space. Among these classification

ip t

algorithms, KNN achieved the highest performance predication with accuracy of 93.40%,

cr

sensitivity of 87.53%, specificity of 96.71%, MCC of 0.85 and F-measure of 0.92.

4.1.3. Prediction Performance of classifiers for acceptor dataset using hybrid space of

us

PseTNC and PseTetraNC

an

The success rates of classification algorithms using hybrid feature space are listed in Table.3. Due to hybrid space of PseTNC and PseTetraNC with 320 features, the performance of

M

classification algorithms is enhanced compared to individual feature spaces. Among these classifiers, KNN achieved the highest results with accuracy of 93.51%, sensitivity of 86.89%,

d

specificity of 97.67%, MCC of 0.85 and F-measure of 0.92.

Ac ce pt e

4.1.4. Prediction Performance of classifiers for acceptor dataset using reduced hybrid space of PseTNC and PseTetraNC

After applying feature reduction technique mRMR on hybrid space the success rates of classification algorithms are listed in Table 4. The reduced feature space contains 290 features, because highest results have been achieved on that feature space. Due to the eradication of irrelevant and redundant descriptors, the overall results remained excellent. Among these classifiers, KNN achieved the best prediction performance with an accuracy of 94.12%, sensitivity of 87.14%, specificity of 98.64%, MCC of 0.86 and F-measure of 0.93.

20

Page 20 of 33

4.2.

Prediction Performance of classifiers with benchmark dataset donor sites

4.2.1. Prediction Performance of classifiers for donor dataset using PseTNC

The experimental results of classifiers using PseTNC feature space are listed in Table.5. This

ip t

features space contains 64 features. Among these classification algorithms, KNN achieved the highest prediction performance with an accuracy of 92.58%, sensitivity of 87.85%, specificity of

cr

97.64%, MCC of 0.85 and F-measures of 0.92.

us

4.2.2. Prediction Performance of classifiers for donor dataset using PseTetraNC

an

The experimental results of classifiers using PseTetraNC features space are listed in Table.6. This features space contains 256 features. Similar to the accepter’s sites, for donor sites result

M

become improved due to PseTetraNC features space. Among these classification algorithms, KNN achieved the highest success rates with an accuracy of 93.40%, sensitivity of 87.53%,

Ac ce pt e

d

specificity of 96.71%, MCC of 0.84 and F-measures of 0.92. 4.2.3. Prediction Performance of classifiers for donor dataset using hybrid space of PseTNC and PseTetraNC

The experimental results of the classification algorithms using hybrid space of PseTNC and PseTetraNC for donor dataset is given in Table 7. Due to hybrid space of PseTNC and PseTetraNC with 320 features, the overall performance of classification algorithms improved. KNN achieved the highest results with an accuracy of 93.51 %, sensitivity of 86.90 %, specificity of 97.67 %, MCC of 0.85 and F-measure of 0.92, among the classifiers. 4.2.4. Prediction Performance of classifiers for donor dataset using reduced hybrid space of PseTNC and PseTetraNC

21

Page 21 of 33

The success rates of the classification algorithms using reduced hybrid space of PseTNC and PseTetraNC for donor dataset is given in Table 8. The feature space is reduced to 205 features, because excellent results have been reported on it. Due to reduction of irrelevant and redundant

ip t

descriptors, the overall results remain excellent. Still, KNN obtained the highest results with an accuracy of 93.26 %, sensitivity of 88.77 %, specificity of 97.78%, MCC of 0.86 and F-measure

Comparison of proposed composite model with existing methods

us

4.3.

cr

of 0.92, among the classifiers.

an

Comparison has been drawn among the proposed model and already existing methods in the literature reported in Table 9. The pioneer work has been carried out by BLAST [66]. The

M

predicted outcomes of BLAST model for acceptor sites were accuracy of 39.62%, sensitivity of 39.09%, specificity of 40.20%, and MCC of -0.21 and predicted outcomes for donor sites were

d

accuracy of 40.23%, sensitivity of 42.75%, specificity of 37.63%, and MCC of 0.20. Recently,

Ac ce pt e

iSS-PseDNC predictor was developed and produced considerable results [13]. The results of iSSPseDNC for acceptor sites were 87.73%, 88.78%, 86.64%, and 0.75 accuracy, sensitivity, specificity, and MCC, respectively, and for donor sites iSS-PseDNC produced 85.45%, 86.66%, 84.25% and 0.71 accuracy, sensitivity, specificity, and MCC, respectively. In contrast, our proposed composite model “iSS-Hyb-mRMR” has achieved quite promising results compared to existing methods. “iSS-Hyb-mRMR” obtained with accuracy of 94.12%, sensitivity of 87.14%, specificity of 98.64% and MCC of 0.86 of acceptor sites and achieved accuracy of 93.26%, sensitivity of 88.77%, specificity of 97.78% and MCC of 0.87 for donor sites. As demonstrated in a series of recent publications [67-71], in developing new prediction methods, user-friendly and publicly accessible web-servers will significantly enhance their impacts [72-75]. Therefore,

22

Page 22 of 33

in future work we shall make efforts to provide a web-server for the prediction method presented in this paper.

ip t

Conclusions

RNA splicing is a complicated biological process which involves interaction among DNA, RNA

cr

and proteins, thus to formulate it in a statistical model, and accurately analyze them is the central part of our work. In this study, a high throughput computational model called “iSS-Hyb-mRMR”

us

has been developed for identification of Splicing sites. Two feature extraction methods, Pseudo

an

Trinucleotide Composition and Pseudo Tetranucleotide Composition were used to extract features from DNA sequences. The extracted features were then combined to form a hybrid

M

space in order to enhance the discrimination power. Further mRMR was applied to select high discriminative feature from hybrid feature space. The performances of classification algorithms

d

were evaluated on individual as well as hybrid feature spaces. After examined the performance

Ac ce pt e

of classifiers, the performance of KNN is remarkable. In addition, it performance is also higher than already existing methods in the literature so for. This remarkable achievement has been ascribed with high discriminative features of reduced hybrid PseTNC and PseTetraNC. It is anticipated that our proposed model might be helpful in drug related application along with academia.

References

[1] A.A. Hoskins, M.J. Moore, The spliceosome: a flexible, reversible macromolecular machine, Trends in biochemical sciences, 37 (2012) 179-188. [2] http://www.dartmouth.edu/~cbbc/courses/bio4/bio4-lectures/EukGenes.html. [3] http://highered.mcgrawhill.com/olcweb/cgi/pluginpop.cgi?it=swf::535::535::/sites/dl/free/0072437316 /120077/bio30.swf::How%20Spliceosomes%20Process%20RNA. [4] K.-C. Chou, Some remarks on protein attribute prediction and pseudo amino acid composition, Journal of theoretical biology, 273 (2011) 236-247.

23

Page 23 of 33

Ac ce pt e

d

M

an

us

cr

ip t

[5] X. Xiao, P. Wang, W.-Z. Lin, J.-H. Jia, K.-C. Chou, iAMP-2L: A two-level multi-label classifier for identifying antimicrobial peptides and their functional types, Analytical biochemistry, 436 (2013) 168177. [6] W. Chen, P. Feng, H. Lin, K. Chou, iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition, Nucleic acids research, (2013) gks1450. [7] B. Liu, D. Zhang, R. Xu, J. Xu, X. Wang, Q. Chen, Q. Dong, K.-C. Chou, Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection, Bioinformatics, 30 (2014) 472-479. [8] S.-H. Guo, E.-Z. Deng, L.-Q. Xu, H. Ding, H. Lin, W. Chen, K.-C. Chou, iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition, Bioinformatics, (2014) btu083. [9] W.-R. Qiu, X. Xiao, K.-C. Chou, iRSpot-TNCPseAAC: Identify recombination spots with trinucleotide composition and pseudo amino acid components, International journal of molecular sciences, 15 (2014) 1746-1766. [10] Y. Xu, J. Ding, L.-Y. Wu, K.-C. Chou, iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition, PLoS One, 8 (2013) e55844. [11] Y. Xu, X.-J. Shao, L.-Y. Wu, N.-Y. Deng, K.-C. Chou, iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins, PeerJ, 1 (2013) e171. [12] K.-C. Chou, H.-B. Shen, Recent progress in protein subcellular location prediction, Analytical biochemistry, 370 (2007) 1-16. [13] W. Chen, P. Feng, H. Lin, K. Chou, iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition, BioMed research international, 2014 (2014). [14] W. Chen, T. Lei, D. Jin, H. Lin, K. Chou, PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition, Analytical biochemistry, 456 (2014) 53-60. [15] Y. Cai, G. Zhou, K. Chou, Support vector machines for predicting membrane protein types by using functional domain composition, Biophysical journal, 84 (2003) 3257-3263. [16] P.-M. Feng, W. Chen, H. Lin, K.-C. Chou, iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition, Analytical Biochemistry, 442 (2013) 118-125. [17] X. Xiao, P. Wang, K.-C. Chou, iNR-PhysChem: a sequence-based predictor for identifying nuclear receptors and their subfamilies via physical-chemical property matrix, PloS one, 7 (2012) e30869. [18] K.K. Kandaswamy, K.-C. Chou, T. Martinetz, S. Möller, P. Suganthan, S. Sridharan, G. Pugalenthi, AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties, Journal of Theoretical Biology, 270 (2011) 56-62. [19] W.-Z. Lin, J.-A. Fang, X. Xiao, K.-C. Chou, iDNA-Prot: identification of DNA binding proteins using random forest with grey model, PLoS One, 6 (2011) e24756. [20] W. Chen, H. Lin, P.-M. Feng, C. Ding, Y.-C. Zuo, K.-C. Chou, iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties, PLoS One, 7 (2012) e47843. [21] T.B. Thompson, K.-C. Chou, C. Zheng, Neural network prediction of the HIV-1 protease cleavage sites, Journal of theoretical biology, 177 (1995) 369-379. [22] Y. Cai, K. Chou, Predicting subcellular localization of proteins in a hybridization space, Bioinformatics, 20 (2004) 1151-1156. [23] M. Wang, J. Yang, Z.-J. Xu, K.-C. Chou, SLLE for predicting membrane protein types, Journal of theoretical biology, 232 (2005) 7-15. [24] T. Denoeux, A k-nearest neighbor classification rule based on Dempster-Shafer theory, Systems, Man and Cybernetics, IEEE Transactions on, 25 (1995) 804-813. [25] K.-C. Chou, H.-B. Shen, Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites, Journal of proteome research, 6 (2007) 1728-1734. 24

Page 24 of 33

Ac ce pt e

d

M

an

us

cr

ip t

[26] M. Hayat, A. Khan, Discriminating outer membrane proteins with fuzzy K-nearest neighbor algorithms based on the general form of Chou's PseAAC, Protein and peptide letters, 19 (2012) 411-421. [27] X. Xiao, J.-L. Min, P. Wang, K.-C. Chou, iCDI-PseFpt: identify the channel–drug interaction in cellular networking with PseAAC and molecular fingerprints, Journal of theoretical biology, 337 (2013) 71-79. [28] K.-C. Chou, Some remarks on predicting multi-label attributes in molecular biosystems, Molecular Biosystems, 9 (2013) 1092-1100. [29] K.C. Chou, Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins: Structure, Function, and Bioinformatics, 43 (2001) 246-255. [30] W. Chen, T.-Y. Lei, D.-C. Jin, H. Lin, K.-C. Chou, PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition, Analytical biochemistry, 456 (2014) 53-60. [31] W. Chen, H. Lin, K.-C. Chou, Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences, Molecular BioSystems, (2015). [32] B. Liu, F. Liu, X. Wang, J. Chen, L. Fang, K.-C. Chou, Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences, Nucleic acids research, (2015) gkv458. [33] B. Liu, F. Liu, L. Fang, X. Wang, K.-C. Chou, repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects, Bioinformatics, 31 (2015) 1307-1309. [34] S.-X. Lin, J. Lapointe, Theoretical and experimental biology in one—A symposium in honour of Professor Kuo-Chen Chou’s 50th anniversary and Professor Richard Giegé’s 40th anniversary of their scientific careers, Journal of Biomedical Science and Engineering, 6 (2013) 435. [35] T. Wang, J. Yang, H.-B. Shen, K.-C. Chou, Predicting membrane protein types by the LLDA algorithm, Protein and peptide letters, 15 (2008) 915-921. [36] M. Hayat, A. Khan, Mem-PHybrid: hybrid features-based prediction system for classifying membrane protein types, Analytical biochemistry, 424 (2012) 35-44. [37] W. Bouaguel, G.B. Mufti, An improvement direction for filter selection techniques using information theory measures and quadratic optimization, arXiv preprint arXiv:1208.3689, (2012). [38] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of maxdependency, max-relevance, and min-redundancy, Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27 (2005) 1226-1238. [39] N.S. Altman, An introduction to kernel and nearest-neighbor nonparametric regression, The American Statistician, 46 (1992) 175-185. [40] J. Han, M. Kamber, J. Pei, Data mining, southeast asia edition: Concepts and techniques, Morgan kaufmann2006. [41] D.F. Specht, Probabilistic neural networks, Neural networks, 3 (1990) 109-118. [42] http://www.mathworks.in/help/toolbox/nnet/ug/bss38ji-1.html. [43] M. Cherian, S.P. Sathiyan, Neural Network based ACC for Optimized safety and comfort, International Journal of Computer Applications, 42 (2012). [44] http://www.mathworks.com/help/nnet/ug/generalized-regression-neural-networks.html. [45] O.N.A. AL-Allaf, Cascade-Forward vs. Function Fitting Neural Network for Improving Image Quality and Learning Time in Image Compression System, Proceedings of the World Congress on Engineering, 2012, pp. 4-6. [46] http://www.mathworks.com/help/nnet/ref/patternnet.html. [47] O.N.A. AL-Allaf, S.A. AbdAlKader, A.A. Tamimi, Pattern Recognition Neural Network for Improving the Performance of Iris Recognition System, Int’l Journal of Scientific & Engineering Research, 4 (2013) 661-667.

25

Page 25 of 33

Ac ce pt e

d

M

an

us

cr

ip t

[48] H. Lin, E.-Z. Deng, H. Ding, W. Chen, K.-C. Chou, iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition, Nucleic acids research, 42 (2014) 12961-12972. [49] J. Jia, Z. Liu, X. Xiao, B. Liu, K.-C. Chou, iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC, Journal of theoretical biology, 377 (2015) 47-56. [50] M. Kabir, M. Iqbal, S. Ahmad, M. Hayat, iTIS-PseKNC: Identification of Translation Initiation Site in human genes using pseudo k-tuple nucleotides composition, Computers in biology and medicine, 66 (2015) 252-257. [51] K.-C. Chou, Z.-C. Wu, X. Xiao, iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, PloS one, 6 (2011) e18258. [52] K.-C. Chou, Z.-C. Wu, X. Xiao, iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites, Molecular BioSystems, 8 (2012) 629641. [53] K.-C. Chou, Impacts of bioinformatics to medicinal chemistry, Medicinal Chemistry, 11 (2015) 218234. [54] K.-C. Chou, C.-T. Zhang, Prediction of protein structural classes, Critical reviews in biochemistry and molecular biology, 30 (1995) 275-349. [55] M. Hayat, M. Tahir, S.A. Khan, Prediction of protein structure classes using hybrid space of multiprofile Bayes and bi-gram probability feature spaces, Journal of theoretical biology, 346 (2014) 8-15. [56] M. Hayat, A. Khan, MemHyb: predicting membrane protein types by hybridizing SAAC and PSSM, Journal of theoretical biology, 292 (2012) 93-102. [57] Z. Hajisharifi, M. Piryaiee, M.M. Beigi, M. Behbahani, H. Mohabatkar, Predicting anticancer peptides with Chou′ s pseudo amino acid composition and investigating their mutagenicity via Ames test, Journal of theoretical biology, 341 (2014) 34-40. [58] K.C. Chou, Y.D. Cai, Predicting protein quaternary structure by pseudo amino acid composition, Proteins: Structure, Function, and Bioinformatics, 53 (2003) 282-289. [59] L. Nanni, S. Brahnam, A. Lumini, Prediction of protein structure classes by incorporating different protein descriptors into general Chou’s pseudo amino acid composition, Journal of theoretical biology, 360 (2014) 109-116. [60] A. Dehzangi, R. Heffernan, A. Sharma, J. Lyons, K. Paliwal, A. Sattar, Gram-positive and Gramnegative protein subcellular localization by incorporating evolutionary-based descriptors into Chou‫ ׳‬s general PseAAC, Journal of theoretical biology, 364 (2015) 284-294. [61] K.-C. Chou, D.W. Elrod, Bioinformatical analysis of G-protein-coupled receptors, Journal of proteome research, 1 (2002) 429-433. [62] Z.U. Khan, M. Hayat, M.A. Khan, Discrimination of acidic and alkaline enzyme using Chou’s pseudo amino acid composition in conjunction with probabilistic neural network model, Journal of theoretical biology, 365 (2015) 197-203. [63] R. Kumar, A. Srivastava, B. Kumari, M. Kumar, Prediction of β-lactamase and its class by Chou’s pseudo-amino acid composition and support vector machine, Journal of theoretical biology, 365 (2015) 96-103. [64] B. Liu, J. Chen, X. Wang, Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis, Molecular Genetics and Genomics, (2015) 1-13. [65] M. Kabir, M. Hayat, iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou's PseAAC to formulate DNA samples, Molecular Genetics and Genomics (2015).

26

Page 26 of 33

Ac ce pt e

d

M

an

us

cr

ip t

[66] A.A. Schäffer, L. Aravind, T.L. Madden, S. Shavirin, J.L. Spouge, Y.I. Wolf, E.V. Koonin, S.F. Altschul, Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements, Nucleic acids research, 29 (2001) 2994-3005. [67] B. Liu, J. Chen, X. Wang, Application of learning to rank to protein remote homology detection, Bioinformatics, (2015) btv413. [68] B. Liu, L. Fang, F. Liu, X. Wang, J. Chen, K.-C. Chou, Identification of real microRNA precursors with a pseudo structure status composition approach, PloS one, 10 (2015) e0121501. [69] H. Ding, E.-Z. Deng, L.-F. Yuan, L. Liu, H. Lin, W. Chen, K.-C. Chou, iCTX-Type: A sequence-based predictor for identifying the types of conotoxins in targeting ion channels, BioMed research international, 2014 (2014). [70] B. Liu, F. Liu, L. Fang, X. Wang, K.-C. Chou, repRNA: a web server for generating various feature vectors of RNA sequences, Molecular Genetics and Genomics, (2015) 1-9. [71] S. Ahmad, M. Kabir, M. Hayat, Identification of Heat Shock Protein families and J-protein types by incorporating Dipeptide Composition into Chou's general PseAAC, Computer methods and programs in biomedicine, (2015). [72] B. Liu, J. Xu, X. Lan, R. Xu, J. Zhou, X. Wang, K.-C. Chou, iDNA-Prot| dis: Identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition, (2014). [73] B. Liu, L. Fang, S. Wang, X. Wang, H. Li, K.-C. Chou, Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy, Journal of theoretical biology, 385 (2015) 153-159. [74] X. Xiao, J.-L. Min, P. Wang, K.-C. Chou, iGPCR-Drug: A web server for predicting interaction between GPCRs and drugs in cellular networking, PloS one, 8 (2013) e72234. [75] W. Chen, P. Feng, H. Ding, H. Lin, K. Chou, iRNA-Methyl: Identifying N 6-methyladenosine sites using pseudo nucleotide composition, Analytical biochemistry, 490 (2015) 26-33.

27

Page 27 of 33

Tables Table 1: Success rates of classification algorithms using PseTNC Accuracy (%)

Sensitivity (%)

Specificity (%)

MCC

F-Measure

KNN

92.59

87.86

97.64

0.86

0.92

PNN

90.67

79.68

99.39

0.80

0.88

GRNN

83.04

67.28

99.46

0.70

0.80

FitNet

73.03

73.71

72.38

0.46

0.73

cr

ip t

Methods

Sensitivity (%)

Specificity (%) 96.71

93.40

87.53

PNN

88.09

77.21

GRNN

86.20

67.67

FitNet

75.14

74.50

MCC

F-Measure

0.85

0.92

99.39

0.78

0.87

99.32

0.70

0.80

75.70

0.50

0.75

an

KNN

M

Methods Accuracy (%)

us

Table 2: Success rates of classification algorithms using TetraNC

d

Table 3: Success rates of classification algorithms using hybrid space of PseTNCand

Ac ce pt e

PseTetraNC

Methods Accuracy (%)

Sensitivity (%)

Specificity (%)

MCC F-Measure

KNN

93.51

86.89

97.67

0.85

0.92

PNN

88.41

77.78

99.60

0.79

0.87

GRNN

82.11

65.60

99.35

0.69

0.79

FitNet

74.87

74.60

75.20

0.49

0.75

Page 28 of 33

Table 4: Success rates of classification algorithms using reduced hybrid space Accuracy (%)

Sensitivity (%)

Specificity (%)

MCC

F-Measure

KNN

94.12

87.14

98.64

0.86

0.92

PNN

87.99

76.82

99.67

0.78

0.87

GRNN

84.34

69.85

99.49

0.72

0.82

FitNet

74.38

74.67

74.27

0.48

0.75

cr

ip t

Methods

Sensitivity (%)

Specificity (%)

MCC

F-Measure

97.64

0.85

0.92

99.39

0.80

0.89

99.46

0.70

0.80

72.38

0.46

0.73

92.58

87.85

PNN

90.67

79.67

GRNN

83.04

67.28

FitNet

73.02

73.71

M

KNN

an

Methods Accuracy (%)

us

Table 5: Success rates of classification algorithms using PseTNC

d

Table 6: Success rates of classification algorithms using PseTetraNC Specificity (%)

KNN

93.40

87.53

96.71

0.84

0.92

PNN

88.09

77.21

99.39

0.78

0.87

GRNN

86.20

67.67

99.32

0.70

0.80

FitNet

75.14

74.50

75.70

0.50

0.75

Sensitivity (%)

Ac ce pt e

Methods Accuracy (%)

MCC F-Measure

Page 29 of 33

Table 7: Success rates of classification algorithms using hybrid space of PseTNCand PseTetraNC Sensitivity (%)

Specificity (%)

MCC

F-Measure

ip t

Methods Accuracy (%) 93.51

86.90

97.67

0.85

0.92

PNN

88.41

77.78

99.60

0.79

0.87

GRNN

82.11

65.60

99.35

FitNet

74.88

74.60

75.20

cr

KNN

0.79

0.49

0.75

us

0.69

Table 8: Success rates of classification algorithms using reduced hybrid space Sensitivity (%)

94.12

87.14

PNN

87.99

76.82

GRNN

84.34

FitNet

74.38

MCC F-Measure

98.64

0.86

0.92

99.67

0.78

0.87

69.85

99.50

0.72

0.82

74.67

74.27

0.48

0.75

d

M

KNN

Specificity (%)

an

Methods Accuracy (%)

Ac ce pt e

Table 9: Performance comparison of “iSS-Hyb-mRMR”with existing methods Splice Sites

Accuracy (%)

Sensitivity (%)

Specificity (%)

MCC

Acceptor [Proposed Model]

94.12

87.14

98.64

0.86

Donor [Proposed Model]

93.26

88.76

97.78

0.87

Acceptor[13]

87.73

88.78

86.64

0.75

Donor[13]

85.45

86.66

84.25

0.71

Acceptor[66]

39.62

39.09

40.20

-0.21

Donor[66]

40.23

42.75

37.63

0.20

Page 30 of 33

Highlights “iSS-Hyb-mRMR” model is proposed for identification of splicing sites.



Trinucleotide and Tetranucleotide Compositionare used as feature extraction schemes.



Hybrid space is formed by using TNC and TetraNC spaces



Various classification algorithms are analyzed.



mRMR is utilized to reduce feature space.

Ac ce pt e

d

M

an

us

cr

ip t



Page 31 of 33

Page 32 of 33

d

Ac ce pt e us

an

M

cr

ip t

an

us

cr

ip t

Figures

M

Figure 1: A schematic drawing to show the pathways of RNA splicing. (a)The 2´OH of the branchpoint nucleotide within the intron (solidline) carries out a nucleophilic attack at the first

d

nucleotide of the intron at the 5´ splice site (GU) forming the lariat intermediate. (b) The3´OH of

Ac ce pt e

the released 5´ exon then performs a nucleophilic attack at the last nucleotide of the intron at the 3´ splice site (AG). (c) Joining theexons and releasing the intron lariat.

Figure 2: Propose Model for identification of Splicing Sites

Page 33 of 33