Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou's general PseAAC

Journal of Theoretical Biology 415 (2017) 13–19 Contents lists available at ScienceDirect Journal of Theoretical Biology journal homepage: www.elsev...

Download PDF

387KB Sizes 0 Downloads 41 Views

Report

PDF Reader
Full Text

Journal of Theoretical Biology 415 (2017) 13–19

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou's general PseAAC

MARK

⁎

Muslim Khana, Maqsood Hayata, , Sher Afzal Khana,b, Nadeem Iqbala a b

Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan Faculty of Computing and Information Technology in Rabigh, King Abdul Aziz University, Saudi Arabia

A R T I C L E I N F O

A BS T RAC T

Keywords: Mycobacterium Oversampled features Un-biasness Support vector machine

This study investigates an eﬃcient and accurate computational method for predicating mycobacterial membrane protein. Mycobacterium is a pathogenic bacterium which is the causative agent of tuberculosis and leprosy. The existing feature encoding algorithms for protein sequence representation such as composition and translation, and split amino acid composition cannot suitably express the mycobacterium membrane protein and their types due to biasness among diﬀerent types. Therefore, in this study a novel un-biased dipeptide composition (Unb-DPC) method is proposed. The proposed encoding scheme has two advantages, ﬁrst it avoid the biasness among the diﬀerent mycobacterium membrane protein and their types. Secondly, the method is fast and preserves protein sequence structure information. The experimental results yield SVM based classiﬁcation accurately of 97.1% for membrane protein types and 95.0% for discriminating mycobacterium membrane and non-membrane proteins by using jackknife cross validation test. The results exhibit that proposed model achieved signiﬁcant predictive performance compared to the existing algorithms and will lead to develop a powerful tool for anti-mycobacterium drugs.

1. Introduction Membrane protein is one of the major class of basic protein classes. Experimentally, it has been reported that there are about 20–30% genomes encode membrane proteins. There are approximately 8000 estimated membrane proteins in human body. Also, these membrane proteins perform signiﬁcant roles in all cellular processes. In addition, to this as membrane proteins represent 60% of drug targets. These drug targets are critical for novel drug discovery as well as for understanding of cellular functions (Yang et al., 2012, 2015; Ji et al., 2014, 2013), It has been also proven that protein is the structural and functional unit of body, so each membrane protein has its speciﬁc function on the basis of their types. For this characterization of membrane proteins, using other existing methods are time consuming, laborious and costly. Thus there is a need of an eﬃcient computational model to characterize membrane proteins and their classes. As mycobacterium can cause critical diseases i.e. every year millions of people die due to tuberculosis (TB) and leprosy. Although, researchers have made numerous eﬀorts in this regard, in order to treat these diseases through bacteria and drugs, but due to huge exploration of protein sequences the prediction of mycobacterium membrane protein characterization is diﬃcult. For such prediction several experimental

⁎

approaches were carried out, because a complicated envelop that contain of a cell wall and cytoplasmic membrane act as a major role for multidrug resistance (Niederweis et al., 2010). It addition, they perform many essential biological and physiological functions, such as receptor of many hormones and as a transporter to carry material into or out of cells. Why this computational method is being used as compared to experimental methods to recognize membrane protein, because there are some membrane protein types which cannot be crystallize and dissolve in majority of solvent. Although the recent breakthroughs in nuclear magnetic resonance (NMR) indicate that NMR is truly a very powerful tool in determining the 3D structures of membrane proteins (Oxenoid et al., 2016; Dev et al., 2016; Schnell and Chou, 2008; Berardi et al., 2011; OuYang et al., 2013; Fu et al., 2016), however, it is costly and time-consuming. That is why the computational method is needed to predict membrane proteins types. Similarly, Cryo-electron microscopy (Cryo-EM), which is the most recently developed method also used to solve the membrane protein structure. Earlier, many computational models have been carried out for discrimination of membrane proteins on the basis of protein sequence information (Chou and Elrod, 1999; Cai et al., 2003; Wang et al., 2004; Shen and Chou, 2005; Chou and Shen, 2007; Lin et al., 2008; Chou,

Corresponding author. E-mail address: [email protected] (M. Hayat).

http://dx.doi.org/10.1016/j.jtbi.2016.12.004 Received 2 August 2016; Received in revised form 24 October 2016; Accepted 7 December 2016 Available online 08 December 2016 0022-5193/ © 2016 Published by Elsevier Ltd.

Journal of Theoretical Biology 415 (2017) 13–19

M. Khan et al.

previous research as follows.

2001; Walzer et al., 2009). Most researchers have developed many strategies for protein encoding extraction. Some of these strategies are amino acid composition (AAC) (Yang et al., 2007), pseudo amino acid composition (PseAAC) (Lin et al., 2008; Chou, 2011), split amino acid composition (SAAC) (Afridi et al., 2012), discrete wavelet analysis (DWT) (Rezaei et al., 2008), hybrid models, translation & composition (Huang et al., 2010) and tri-peptide composition (Ung and Winkler, 2011). In order to investigate the success rates of these techniques many algorithms such as support vector machine (SVM) (Chou, 2011; Ding and Dubchak, 2001; Lin, 2008; Kumar et al., 2011; Chen et al., 2009; Du et al., 2014; Hayat and Iqbal, 2014), k-nearest neighbor (KNN) (Chou, 2011), probabilistic neural network (PNN) (Hayat and Khan, 2012; Khan et al., 2008), random forest (RF) (Breiman, 2001) and Mem-EnsSAAC were used as a protein structure and function predictors (Hayat et al., 2012). Such early works suggest that machine learning algorithms performed a vital role in discrimination of mycobacterial membrane protein and their classes, but it has been used very little for discrimination of membrane protein in mycobacterium. PROB predictor has been developed Pajon et al., for identiﬁcation of beta-barrel of M. tuberculosis (Pajon et al., 2006). In this work, two membrane protein functions were found undeﬁned. Although results of this work were very good, but no one has given concentration to the identiﬁcation of mycobacterial membrane proteins. Therefore, another attempt was developed for this prediction, called identiﬁcation of mycobacterial membrane protein by using over-represented tri-peptide compositions by using binomial distribution function (Chen et al., 2012). In this work, an identiﬁcation algorithm for mycobacterial membrane protein and their classes as shown in Fig. 1 is developed. The proposed algorithm utilized simultaneously both oversampling technique SMOTE to remove biasness among diﬀerent types of member proteins and using dipeptide compositions to extracted features from the unbiased data. This is because although we could remove some redundant samples in the model-training process, when testing the predictor, all experiment-conﬁrmed samples must be included, even for those removed by SMOTE approach. Only by doing so, the prediction method is really validated by all the experiment-conﬁrmed data rather than part of them. Please see the papers (Liu et al., 2015b; Xiao et al., 2015; Jia et al., 2016a, 2016b), for a detailed analysis about this. The SVM classiﬁcation algorithm applied to deal with multiclassiﬁcation. The classiﬁcation performance of using jackknife test obtained an overall accuracy of 97.1% for mycobacterium membrane protein classes and 95.0% accuracy for mycobacterium membrane and non-membrane protein accordingly. Furthermore, this study extends

i) In order to avoid the biasness among protein membrane types this paper used the over sampling technique SMOTE. ii) This study considers numerous feature extraction algorithms on datasets. The empirical result shows that un-biased dipeptide composition performance is far better than other feature extraction algorithms. iii) In order to dimensional reduction this paper used mRMR as feature selection algorithm. Also, the data are collected from two diﬀerent datasets to evaluate the comparative performance with existing algorithms. The results of this study will further improve the predication of mycobacterial membrane protein (Xiao et al., 2016; Qiu and Sun, 2016; Qiu et al., 2016; Chen et al., 2016a), and their types and will be helpful for development of anti-mycobacterium drugs. The rest of the paper is structured as follows, Section2 represents material and methods, Section 3 shows feature selection algorithms, Section 4 and Section 5 presents classiﬁcation algorithms and performance evaluation criteria respectively, Section 6 is about result and discussion, Finally, Section 7 draws conclusion.

2. Materials and methods 2.1. Dataset In order to get precise results, an appropriate benchmark dataset is required for training and testing the computational model. In this regards, a standard dataset of mycobacterium membrane protein is used. It is constructed from universal protein Resources (uniProt) database (Magrane, 2011). For accurate and well deﬁne dataset some instructions were followed. First of all, those sequences were selected which were achieved and manually annotated by the researchers, secondly, those sequences were excluded whose type is not deﬁned, further those sequences were knocked which were fragment of other proteins (Han et al., 2006). After this, a précise and accurate dataset was generated. The database consists of two benchmark datasets namely dataset-I and dataset-II. The dataset-I contain 274 sequences, out of which 32 sequences are single pass, 192 are multi-pass, 20 are lipids anchor and 30 are peripheral membrane protein, while in dataset-II there are 564 sequences out of which 274 are membrane protein types and 290 are non-membrane proteins.

Fig. 1. Schematic diagram of the proposed algorithm.

14

Journal of Theoretical Biology 415 (2017) 13–19

M. Khan et al.

ical properties. The translation strategy is used to count occurrence frequencies. By translation, the followed whole sequence of the protein changes their index values with respect features. For example, a protein sequence is decomposed into the charge feature, which contains positive, negative and neutral residues. It includes charges namely positive, negative and neutral residues (Ahmad et al., 2016). The resultantly outcome translation contains 21 attributes.

2.2. Encoding algorithms for protein sequence In this section, we brieﬂy explain feature encoding schemes that is un-biased dipeptide composition, composition and translation and split amino acid composition. 2.2.1. Un-Biased dipeptide composition (ub-DPC) Un-Biased Dipeptide composition is a special mode of PseAAC (Chou, 2009). For a given protein sequence, it can be produced by the web-server PseAAC-General (Du et al., 2014) or Pse-in-One (Liu et al., 2015a). The later not only can be used to deal with protein/peptide sequences, but also DNA/RNA sequences, as reﬂected by series of modern studies (Liu et al., 2015b; Chen et al., 2016a, 2015). The general dipeptide composition represents all possible pair combination of amino acid in the protein sequence. The occurrence frequency of two adjacent amino acids calculated which yields a vector size of 400-D features extract for 20×20 possible amino acid composition. The general dipeptide composition plays important role in protein structure as it keeps the sequence information. In case of classiﬁcation problem where the protein classes are imbalanced the encoding algorithms of protein sequence fails to extract structure information as the algorithm tends towards the higher un-biased class. The problem can be addressed by under sampling or oversampling, which generates unbiased data. The imbalanced protein dataset is handled with the help of oversampling called Synthetic Minority Oversampling Technique (SMOTE) strategy. The un-biased dipeptide incorporating both dipeptide composition and SMOTE, which represented as

v(i ) =

2.2.3. Split amino acid compositions (SAAC) Protein sequence is the combination of amino acids which containing a lot of information at N-terminus or C-terminus. These peptides information cannot be directly extracted from protein sequences. So, for the extraction of these information's split amino acid composition (SAAC) technique is used. In this encoding scheme a protein sequence is split into diﬀerent parts in order to count the occurrence frequency of each part for extraction of the complementary information (Hayat et al., 2012; Ali et al., 2014). This attempt has been utilized by researchers for the prediction of protein functions with successful results (Huang et al., 2010). In this paper a given protein sequence has been divided into three portions accordingly with given proportion, twenty ﬁve amino acid of C-terminus, twenty ﬁve amino acid of Nterminus, and the region between these two terminals. Therefore, resultantly 60D feature space is achieved instead of 20D as in case of AAC. This feature space of SAAC versus each protein sequence is represented (Chou and Cai, 2002). int c F = [p1N , ...... , p20N , p1int , ........ , p20 , p1c , ........ , p20 ]

di 400

∑i =1 di

(2)

In this formulation symbols stand as usual, i.e. the symbol N, for Nterminus, C, for C-terminus and the symbol int, used to represents integral segment.

(3)

3. Features selection algorithm

(1)

ψub = fSMOTE (v) R

fSMOTE (v) = v + u . (v − v)

(4)

where di is the ith dipeptide out of 400 total dipeptide, and0 ≤ u ≤ 1;v R is randomly pick from minority class of k-nearest neighbor of v . The resultant unbiased dipeptide composition ψ is used for classiﬁcation of mycobacterium membrane protein. It has been observed that unbiased dipeptide composition eﬃciently represents both structure information and un-biasness of the data.

In order to decrease the computational complexity of extracted features and obtained salient sub feature space feature selection algorithm is mandatory to apply. Therefore, features selection is one of the best steps for classiﬁcation to enhance eﬃciency of the proposed model. Principal component analysis, diﬀusion maps, singular value decomposition (SVD), minimum redundancy and maximum relevancy (mRMR) and local linear discriminate analysis (LLDA) are the various features selection techniques widely used in bioinformatics. In this research study, we used mRMR (Bartenhagen et al., 2010; Kabir et al., 2015) for feature selection algorithm.

2.2.2. Composition and translation Composition and translation is one of the compositional base approaches in order to characterize the structure of uncharacterized protein sequences. For this purpose composition and translation attributes are utilized (Huang et al., 2010, 2011). These attributes show that the amino acid motif of a speciﬁc structural or the physiochemical behaviours of protein sequences. Seven distant physicochemical properties of amino acids are considered to calculate these attributes. These are composed of polarity, charge, hydrophobicity, solvent accessibility, polarization, Van de Waals volume, and secondary structure. Furthermore, as we know that polypeptide chain of protein sequences are the diﬀerent combination of twenty amino acids, which are further classiﬁed into three sub-groups in order to calculate each physicochemical property (Guo et al., 2014; Li et al., 2006). These subgroups are hydrophobicity, solvent accessibility and charge. The subgroup hydrophobicity contains polarity, neutrality and hydrophobicity. The subgroup solvent accessibility is exposed, buried and intermediate, while the subgroup charge is either positive, negative or neutral. Now in order to calculate these physicochemical properties ﬁrst of all the sequence of amino acid is divided into physicochemical properties on the basis of the corresponding values of residues. In the second stage composition and translation attributes are calculated for each physicochemical property, which indicates the compositional protein of each of the three subgroups. The compositional attributes values are 3×7=21 which represents the seven diﬀerent physicochem-

3.1. Minimum redundancy and maximum relevance (mRMR) mRMR is a feature selection method in which those features subsample space are selected that has minimum redundancy but having maximum relevance with deﬁned class (Peng et al., 2005). In feature extraction stage some features are highly correlated and not represent to contribute in the deﬁnition of the target class. Also, replicative space of feature signiﬁcantly reduced the phenomena of learning with a great eﬀect on the eﬃciency of a classiﬁer. Therefore, we need to solve this problem in order to ﬁnd optimum, mutually exclusive and with least correlated feature subsets. These features are yielded by using the feature selection technique mRMR, which is extensively applied in the ﬁeld of classiﬁcation. The relevancy can be ﬁnding out by using mutual information (MI) Algorithm given by the following expression

MI (x, y ) =

∑i,jεN p(xi , yj )log

p(xi , yj ) p(xi )p(yj )

(5)

where the symbols x and y represent two features, p(x i,yj) is the joint probability density function and p(x i) p(yj) is the individual probability 15

Journal of Theoretical Biology 415 (2017) 13–19

M. Khan et al.

function. Now to ﬁnd out the mutual information between targeted classes and feature the following formula is used.

MI (x, z ) =

∑i,kεN p(xi , zk )log

p(xi , zk ) p(xi )p(zk )

tion algorithm (Breiman, 2001; Prinzie and Van den Poel, 2008). RF is used for classiﬁcation, clustering and feature selection. In RF examples are randomly selected from training data using bootstrap algorithm to form multiple trees. Though, each tree is constructed with a randomized subset of attributes. A large number of trees are constructed like a forest. The attributes which are used to ﬁgure out optimum split at each node are randomly selected from total number of attributes. For the most frequent class at the input level, here each tree cast a singleton vote. At last, the performance is measured by ensemble the individual hypothesis by using majority voting algorithm.

(6)

In this formulation x is a feature and z is treated as a targeted class. Now minimum redundancy in the entire features space can be calculated as below,

1 MI (x, y ) s2

min(mR ) =

(7)

where, the symbol s is used to represent the feature space while, the symbol s represent the total amount of features in s. Now for maximum relevance we have the condition given by the expression

max(MR ) =

1 s

∑xεs MI (x, z )

5. Performance evaluation criteria In order to investigate the performance of all classiﬁers, the following criteria are used to assess the accurate identiﬁcation of mycobacterium protein,

(8)

Accuracy =

Finally, we have

Max (∇MI ) = MR − mR

(TP + FN ) , (GTotal )

GTotal = (TP + FP + FN + TN )

(11)

where, TP is the number of true positive instances positively predicted, TN is the number of true negative instances negatively predicted, FP is number of false positive instances negatively predicted and FN is the number of false negative instances positively predicted.

(9)

4. Classiﬁcation algorithms 5.1. Sensitivity

In the process of classiﬁcation the learning data are classiﬁed into predeﬁned classes. These classiﬁcation algorithms play important role in ﬁelds of bioinformatics and data mining. Among these classiﬁers the fundamental area of all the machine learning algorithms is the same by using training data to get information about test data, called training and testing set respectively. In this regard, three classiﬁcation algorithms SVM, KNN and RF are used so that to achieve the best result for the identiﬁcation of mycobacterium membrane proteins.

The criteria of sensitivity calculated the proportion of positives that are identiﬁed correctly and deﬁned as

Sensitivity =

TP (TP + FN )

(12)

5.2. Speciﬁcity

4.1. Support vector machine (SVM)

The criteria of speciﬁcity calculated the proportion of negatives that are identiﬁed correctly and deﬁned as

SVM is a supervised learning hypothesis developed on the basis of statistical learning theory (Khan et al., 2008). It is greatly applied for proteins structures and functions. By using SVM a given data is converted into a high dimensional features space in order to ﬁnd the optimum separating hyper-plane. This hyper-plane shows the margin between the dividing line and the support vectors in the training set. One–versus one (OVO) and one versus rest (OVR) are renowned strategies which are applied in case of multiclass problem. Furthermore, for all this prediction SVM has been trained on four kernel functions, which are linear, polynomial, sigmoid and radial base function (RBF).

Specificity =

TN (FP + TN )

(13)

5.3. Mathews correlation coeﬃcient (MCC) MCC assign values in the range of [1 −1]. A value 1 means the classiﬁcation algorithm never makes any wrong selection and a value of −1 means the classiﬁer always make wrong selection.

4.2. K-Nearest neighbor (KNN)

MCC =

K-nearest neighbor is a simplest algorithm adopted extensively in the ﬁeld of bioinformatics and classiﬁcation. KNN also refers to instance based learner because it does not generate model instantaneously but memorizes all the training examples and waits until novel example is to be categorized (Ahmad et al., 2015). It is decent for real and quickly altering data (Han et al., 2007). It draws conclusion on the basis of Euclidean distance and ﬁnally assigns novel example to majority class. Euclidean distance can be calculated as follows

(TP × TN ) − (FP × FN ) (TP + FP )(TP + FN )(TN + FP )(TN + FN )

(14)

5.4. F-Measure F-measure is used for test of accuracy. It considers on both precision p and recall r to compute. Where, p means number of correct prediction over the number of predications, it takes best values 1 and worst value 0.

n

d (x1, x 2 ) =

∑

(xi1 − xi2 )2

i =1

F − measure = 2 × (10)

where TP p = (TP + FP )

,

p×r p+r

r=

TP (TP + FN )

(15)

As we noted from above mentioned four metrics Eqs. (11)–(15) that it is hard to understand by many biologists due to lucking of intuitiveness. To solve this problem we have used the following

4.3. Random forest (RF) Breiman, 2001 introduced Random forest as ensemble classiﬁca16

Journal of Theoretical Biology 415 (2017) 13–19

M. Khan et al.

equations in our study, which are available in the recent publications (Jia et al., 2016a; Kabir et al., 2015; Lin et al., 2014).

Acc = 1 −

N −+ + N+− , N+ + N−

Mcc =

(17)

0 ≤ Sn ≤ 1

N

⎛ N+ + N−⎞ 1 − ⎜ −+ +− ⎟ ⎝N +N ⎠ ⎛ ⎜1 + ⎝

(16)

0 ≤ Sp ≤ 1

N −+ +,

Sn = 1 −

Methods

0 ≤ Acc ≤ 1

N− Sp = 1 − +− , N

N+− − N−+ ⎞⎛ N

+

⎟⎜1 + ⎠⎝

N−+ − N+− ⎞ ⎟ N− ⎠

Table 2 Prediction of classification algorithms having SAAC features using dataset-I.

(18)

,

1 ≤ Mcc ≤ 1 (19)

The mentioned set of metrics is valid only for the single-label systems. In these mentioned equations N + stands for the total number of true prediction and N − is the total number of false counted prediction. Similarly, N −+ denotes the total number of true prediction which is incorrectly counted as a false and N+− shows the total number of false prediction which is incorrectly predicted true. Multi-label systems are more frequently involved in system Biology and system medicine (Qiu and Sun, 2016; Xiao et al., 2013), therefore a complete diﬀerent set of metrics is needed as deﬁned in (Chou, 2013).

Speciﬁcity

MCC

F-Measure

Imbalance biased data SVM 80.3 KNN 78.5 RF 79.6

73.4 67.4 68.8

78.9 96.8 80.9

0.48 0.34 0.47

0.62 0.53 0.60

Un-biased data SVM 92.4 KNN 88.7 RF 91.3

81.9 77.5 85.4

56.1 50.3 59.3

0.33 0.24 0.35

0.50 0.48 0.48

mRMR on Un-biased data SVM 93.4 KNN 88.9 RF 93.7

82.7 78.5 79.8

55.9 50.9 58.5

0.33 0.26 0.34

0.52 0.48 0.53

space of dataset1, while after oversampling KNN based classiﬁer gives the high performance in comparison with rest of the classiﬁers having overall accuracy of 90.6%, After feature selection technique i.e. mRMR, KNN still gives the maximum result over-all accuracy of 90.6%. It shows that due to oversampling the composition and transition features becomes more overlap feature space and reduced the number of outlier features, In such circumstances the KNN based classiﬁer perform well. In Table 2, we have analysed the result of protein and their types by using split amino acid composition (SAAC) as feature encoding technique. The result shows signiﬁcant improvement as compared of imbalance and un-biased SAAC features with SVM classiﬁer. As we noted, after oversampling SVM improve their identiﬁcation performance by 12.16%. Furthermore, after feature selection technique on oversampling dataset1 SVM yield the maximum result of overall accuracy of 93.4% which improved result by 0.95% over the same dataset1 by using same technique. Furthermore, the proposed algorithm was used to predict the types of mycobacterial membrane protein. In this method, we used un-biased dipeptide feature encoding process for measuring the optimized feature space. In Table 3, we found that simple dipeptide feature before oversampling, SVM gives 83.2% which is better among the rest of classiﬁers. After using un-biased dipeptide, SVM shows the signiﬁcant performance with overall accuracy 97.1%. This improvement is more than 15% as compared to unbalanced dipeptide. Also, the results exhibit high identiﬁcation performance compared with other feature encoding techniques. Apart from that the identiﬁcation of mycobacterial membrane protein and their types compared with exiting algorithms on same dataset. The result shows that proposed algorithm achieves

In order to exhibit the eﬀectiveness of our proposed algorithm both jackknife and independent data tests were used. First, during jackkniﬁng procedure, each protein is excluded out for the purpose of testing and the rest of protein samples were used as training data to train classiﬁer. The advantage of such procedure is, to reduce the bias of the estimated data and all the protein samples are used at training stage, which increased the predication performance of classiﬁer. Also, jackknife test avoid random sampling. Second, during independent dataset test the classiﬁer was trained on one dataset while another dataset was used as test data. 6.1. Identiﬁcation of membrane protein and their types on dataset-I In the jackknife based classiﬁcation process, Table 1 shows the predicted results of mycobacterial membrane protein types by using composition and transition features extraction strategy. The results were tested on both imbalance and un-biased based composition and transition features. The results exhibit that SVM based classiﬁer gives 6% improvement in the result after oversampling on feature sample Table 1 Prediction of classification algorithms having composition and translation features using dataset-I. Accuracy

Sensitivity

Bold value indicates the highest result.

6. Result and discussion

Methods

Accuracy

Table 3 Prediction of classification algorithms having Unb-DPC features using dataset-I.

Sensitivity

Speciﬁcity

MCC

F-Measure

Methods

Imbalance biased data SVM 78.1 KNN 77.7 RF 77.0

67.9 71.8 64.7

72.5 76.9 75.1

0.37 0.45 0.37

0.55 0.60 0.54

Un-biased data SVM 84.6 KNN 90.5 RF 88.7

81.9 89.0 81.4

85.4 48.7 52.5

0.63 0.30 0.26

mRMR on Un-biased data SVM 83.5 KNN 90.6 RF 88.7

81.3 75.5 74.4

84.1 50.0 52.3

0.61 0.22 0.24

Sensitivity

Speciﬁcity

MCC

F-Measure

Imbalance biased data SVM 83.2 KNN 77.7 RF 73.4

74.1 74.4 66.3

79.3 77.7 84.5

0.50 0.48 0.49

0.63 0.72 0.60

0.73 0.44 0.42

Un-biased data SVM 97.1 KNN 86.6 RF 85.9

83.9 100 71.4

56.9 40.7 55.8

0.36 0.34 0.24

0.54 0.45 0.48

0.71 0.47 0.48

mRMR on un-biased data SVM 97.1 KNN 86.6 RF 87.0

83.6 73.6 74.5

56.4 44.9 56.5

0.35 0.17 0.27

0.54 0.43 0.49

The bold values indicate the highest result.

Accuracy

The Bold values indicate the highest results.

17

Journal of Theoretical Biology 415 (2017) 13–19

M. Khan et al.

Table 4 Prediction of classification algorithms having composition and translation features using dataset-II.

Table 7 Comparison performance with already existing methods. Our method

Methods

Accuracy

Sensitivity

Speciﬁcity

MCC

F-Measure

Before mRMR SVM 92.4 KNN 92.0 RF 91.7

91.4 89.8 90.1

93.3 94.1 93.1

84.8 84.1 83.3

92.2 91.6 91.3

After mRMR SVM 92.4 KNN 91.3 RF 91.3

91.4 90.7 90.5

93.3 92.4 92.0

84.8 82.6 82.6

92.2 91.1 91.1

Single Multi Peripheral OA (%) AA (%)

Sn (%)

Sp (%)

90.7 97.9 97.0 96.9 95.2

98.4 96.3 97.0

O-T-M (Chen, D., et al., 2012): Fan & Li (Fan, and Li., 2012.) Sn (%) Sp (%) Sn (%) Sp (%) 72.2 100 76.7 94.6 83.0

100 82.3 100

41.7 95.2 53.3 85.0 63.4

96.8 54.5 97.2

Sn (%) = Percent Sensitivity, OA (%) = Percent Overall Accuracy. Sp (%) = Percent specificity, AA (%) = Percent Average Accuracy. O-T-M = Over-represented tri-peptide method. The bold values indicate maximum result.

The bold values show the maximum result.

The bold values show the maximum result.

that when the same feature extraction strategy is used for other classiﬁer like RF and KNN, the SVM performs better before and after applying mRMR. Table 5 shows the predicted result by using SAAC feature extraction strategy. The results were tested on both before and after mRMR. The results exhibit that SVM classiﬁer gives 0.89% improvement after feature selection. The RF classiﬁer also gives the same result of accuracy 93.1% with 93.8%, 92.4%, 0.86 and 0.93 of sensitivity, speciﬁcity, MCC and F-measure, respectively by using SAAC features extraction technique on dataset-II. Furthermore, the predication between membrane and non-membrane protein using the proposed feature extraction method is shown in Table 6. The results show that SVM classiﬁer gives the high performance with overall accuracy of 95.0%.

Table 6 Prediction of classification algorithm having Unb-DPC features using dataset-II.

6.3. Comparison with existing benchmark datasets

Table 5 Prediction of classification algorithms having SAAC features using dataset-II. Methods

Sensitivity

Speciﬁcity

MCC

F-Measure

Before mRMR SVM 92.2 KNN 90.6 RF 92.9

92.6 87.4 93.5

91.8 93.6 92.3

0.84 0.81 0.86

0.92 0.90 0.93

After mRMR SVM 93.1 KNN 89.5 RF 93.1

92.7 86.6 93.8

93.4 92.4 92.4

0.86 0.79 0.86

0.93 0.89 0.93

Methods

Accuracy

Accuracy

Sensitivity

Speciﬁcity

MCC

F-Measure

Before mRMR SVM 95.0 KNN 91.7 RF 93.3

94.6 91.6 94.4

95.5 91.8 92.2

90.1 83.4 86.6

94.9 91.5 93.2

After mRMR SVM 93.8 KNN 92.2 RF 93.3

93.1 93.1 94.4

94.5 91.5 92.2

87.6 84.5 86.6

93.7 92.1 93.2

Furthermore, the proposed method has compared on developed model of Fan's and Li (Fan, 2012). The Fan model reported the overall average accuracy of 85%. Also, the same dataset was used by Chen, et al. Chen et al., 2012 using over-represented tri-peptide method and achieved the overall accuracy of 94.6%. In comparison, our method is executed on their dataset. 96.9% correctly predicted the membrane proteins and their types. As shown in Table 7, our method is 2.3% and 11.6% higher comparatively to above mentioned methods. This result exhibits that our proposed computational model is superior on benchmark dataset. Furthermore, as demonstrated in a series of recent publications (Xiao et al., 2016; Qiu et al., 2016; Jia et al., 2015; Chen et al., 2016b) in developing new prediction methods, user-friendly and publicly accessible web-servers will signiﬁcantly enhance their impacts (Chou, 2015). We shall make eﬀorts in our future work to provide a web-server for the prediction method presented in this paper.

The bold values show the highest percent result.

97.1% compared to existing algorithms of 93.1% which are 4% improvements. In order to prove the superiority of our computational model, we compared our prediction level with that of other like composition & translation and split amino acid compositions prediction is recorded in Tables 1, 2 over dataset-I respectively. In order to investigate the prediction performance of these models various classiﬁcation algorithms SVM, KNN and RF were used. From Tables 1, 2, we observed that there is 92.3% correct prediction by using composition & translation and 93.4% correct prediction by using SVM as an individual classiﬁer over SAAC. Finally by proposed method that is un-biased dipeptide composition (Unb-DPC) has achieved overall accuracy of 97.1% shown in Table 3, so by comparison of Tables 1–3 we proved that proposed method shows high accuracy for mycobacterial membrane protein and their types.

7. Conclusion In this paper, a reliable, eﬃcient technique has been developed for prediction of mycobacterial membrane proteins and their types. The proposed computational method used un-biased dipeptide composition for feature extraction from protein sequences. The proposed method avoids biasness among diﬀerent classes and preserves protein sequence structure information simultaneously. The performance of the diﬀerent classiﬁers is evaluated through jackknife test using benchmark dataset. The predicted results yield the overall accuracy of 97.1% for mycobacterial membrane proteins and their types. Also, the proposed model achieved an overall accuracy of 95.0% for prediction of mycobacterial membrane and non-membrane proteins. Therefore, it is anticipated that proposed method is signiﬁcantly improved the result compared to the existing methods and will provide information for further studies on membrane proteins.

6.2. Discrimination of membrane and non-membrane proteins The dataset-II is about to discriminate mycobacterial membrane protein from non-membrane proteins. The prediction outcomes of translation & composition encoding features for the dataset-II have shown accuracy of 92.4% by using SVM classiﬁer. In Table 4, we noted 18

Journal of Theoretical Biology 415 (2017) 13–19

M. Khan et al.

Jia, J., Liu, Z., Xiao, X., Liu, B., Chou, K.C., 2015. IPPI-Esml: an ensemble classiﬁer for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. J. Theor. Biol. 377, 47–56. Jia, J., Liu, Z., Xiao, X., 2016a. iPPBS-Opt: a Sequence-based ensemble classiﬁer for Identifying protein-protein binding sites by Optimizing imbalanced training datasets. Molecules 21, 95. Jia, J., Liu, Z., Xiao, X., 2016b. ISuc-PseOpt: identifying lysine succinylation sites in proteins by incorporating sequence-coupling eﬀects into pseudo components and optimizing imbalanced training dataset. Anal. Biochem. 497, 48–56. Kabir, M., Iqbal, M., Ahmad, S., Hayat, M., 2015. ITIS-PseKNC: identiﬁcation of Translation Initiation Site in human genes using pseudo k-tuple nucleotides composition. Comput. Biol. Med. 66, 252–257. Khan, A., Khan, M., Choi, T., 2008. Proximity based GPCRs prediction in transform domain. Biochem. Biophys. Res. Commun. 371, 411–415. Kumar, M., Gromiha, M.M., Raghava, G.P., 2011. SVM based prediction of RNA‐binding proteins using binding residues and evolutionary information. J. Mol. Recognit. 24, 303–313. Li, Z., Lin, H., Han, L., Jiang, L., Chen, X., Chen, Y., 2006. PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 34, W32–W37. Lin, H., 2008. The modiﬁed Mahalanobis discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition. J. Theor. Biol. 252, 350–356. Lin, H., Ding, H., Guo, F., Zhang, A., Huang, J., 2008. Predicting subcellular localization of mycobacterial proteins by using Chou's pseudo amino acid composition. Protein Pept. Lett. 15, 739–744. Lin, H., Deng, E.Z., Ding, H., Chen, W., Chou, K.C., 2014. IPro54-PseKNC: a sequencebased predictor for identifying sigma-54 promoters in prokaryote with pseudo ktuple nucleotide composition. Nucleic Acids Res. 42, 12961–12972. Liu, B., Liu, F., Wang, X., 2015. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 43, W65–W71. Liu, Z., Xiao, X., Qiu, W.R., 2015. IDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. Anal. Biochem. 474, 69–77. Magrane, M., 2011. UniProt Knowledgebase: a hub of integrated protein data, Database (Oxford) http://dx.doi.org/10.1093/database/bar1009. Niederweis, M., Danilchanka, O., Huﬀ, J., Hoﬀmann, C., Engelhardt, H., 2010. Mycobacterial outer membranes: in search of proteins. Trends Microbiol. 18, 109–116. OuYang, B., Xie, S., Berardi, M.J., 2013. Unusual architecture of the p7 channel from hepatitis C virus. Nature 498, 521–525. Oxenoid, K., Dong, Y.S., Cao, C., 2016. . Architecture of the Mitochondrial Calcium Uniporter., Nature http://dx.doi.org/10.1038/nature17656. Pajon, R., Yero, D., Lage, A., Llanes, A., 2006. B.C. J, Computational identiﬁcation of beta-barrel outer-membrane proteins in Mycobacterium tuberculosis predicted proteomes as putative vaccine candidates. Tuberculosis 86, 290–302. Peng, H., Long, F., Ding, C., 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238. Prinzie, A., Van den Poel, D., 2008. random forests for multiclass classiﬁcation: random multinomial logit. Expert Syst. Appl. 34, 1721–1732. Qiu, W.R., Sun, B.Q., 2016. iPTM-mLys: identifying multiple lysine PTM sites and their diﬀerent types, Bioinformatics http://dx.doi.org/10.1093/bioinformatics/btw1380. Qiu, W.R., Sun, B.Q., Xiao, X., 2016. iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled eﬀects into general PseAAC. Oncotarget 7, 44310–44321. Rezaei, M., Abdolmaleki, P., Karami, Z., Asadabadi, E., Sherafat, M., Moghaddam, H., Fadaie, M., Forouzanfar, M., 2008. Prediction of membrane protein types by means of wavelet analysis and cascaded neural networks. J. Theor. Biol. 254, 817–820. Schnell, J.R., Chou, J.J., 2008. Structure and mechanism of the M2 proton channel of inﬂuenza A virus. Nature 451, 591–595. Shen, H., Chou, K.C., 2005. Using optimized evidence-theoretic K-nearest neighbor classiﬁer and pseudo-amino acid composition to predict membrane protein types. Biochem. Biophys. Res. Commun. 334, 288–292. Ung, P., Winkler, D.A., 2011. Tripeptide motifs in biology: targets for peptidomimetic design. J. Med. Chem. 54, 1111–1125. Walzer, G., Rosenberg, E., Ron, E.Z., 2009. Identiﬁcation of outer membrane proteins with emulsifying activity by prediction of β-barrel regions. J. Microbiol. Methods 76, 52–57. Wang, M., Yang, J., Liu, G., Xu, Z., Chou, K., 2004. Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. Protein Eng. Des. Sel. 17, 509–516. Xiao, X., Wang, P., Lin, W.Z., 2013. IAMP-2L: a two-level multi-label classiﬁer for identifying antimicrobial peptides and their functional types. Anal. Biochem. 436, 168–177. Xiao, X., Min, J.L., Lin, W.Z., Liu, Z., 2015. iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via the benchmark dataset optimization approachJ. Biomol. Struct. Dyn. 33, 2221–2233. Xiao, X., Ye, H.X., Liu, Z., 2016. iROS-gPseKNC: predicting replication origin sites in DNA by incorporating dinucleotide position-speciﬁc propensity into general pseudo nucleotide composition. Oncotarget 7, 34180–34189. Yang, X.G., Luo, R.Y., Feng, Z.P., 2007. Using amino acid and peptide composition to predict membrane protein types. Biochem. Biophys. Res. Commun. 353, 164–169. Yang, Z., Kurpiewski, M., Ji, M., Townsend, J.E., Mehta, P., Jen-Jacobson, L., Saxena, S., 2012. ESR spectroscopy identiﬁes inhibitory Cu(II) sites in a DNA modifying enzyme to reveal determinants of catalytic speciﬁcity. Proc. Natl. Acad. Sci. USA 109, E993–E1000. Yang, Z., Ji, M., Cunningham, T.F., Saxena, S., 2015. Cu(II) as an ESR probeprobe of protein structure and function. Method. Enzym. 563, 459–481.

References Afridi, T.H., Khan, A., Lee, Y.S., 2012. Mito-GSAAC: mitochondria prediction using genetic ensemble classiﬁer and split amino acid composition. Amino Acids 42, 1443–1454. Ahmad, K., Waris, M., Hayat, M., 2016. Prediction of protein Submitochondrial locations by incorporating Dipeptide composition into Chou's general pseudo amino acid Composition56. J. Membr. Biol. 3, 293–304. Ahmad, S., Kabir, M., Hayat, M., 2015. Identiﬁcation of Heat Shock Protein families and J-protein types by incorporating Dipeptide Composition into Chou's general PseAAC. Comput. Methods Prog. Biomed. 122, 165–174. Ali, S., Majid, A., Khan, A., 2014. IDM-PhyChm-Ens: intelligent decision-making ensemble methodology for classiﬁcation of human breast cancer using physicochemical properties of amino acids. Amino Acids 46, 977–993. Bartenhagen, C., Klein, H.-U., Ruckert, C., Jiang, X., Dugas, M., 2010. Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data. BMC Bioinform., 567. Berardi, M.J., Shih, W.M., Harrison, S.C., 2011. Mitochondrial uncoupling protein 2 structure determined by NMR molecular fragment searching. Nature 476, 109–113. Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. Cai, Y.D., Zhou, G.P., Chou, K.C., 2003. Support vector machines for predicting membrane protein types by using functional domain composition. Biophys. J. 84, 3257–3263. Chen, C., Chen, L., Zou, X., Cai, P., 2009. Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine. Protein Pept. Lett., 27–31. Chen, D., Yuan, L., Guo, S., Lin, H., Chen, W., 2012. Identiﬁcation of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. J. Proteom., 321–328. Chen, W., Feng, P., Ding, H., 2015. IRNA-Methyl: identifying N6-methyladenosine sites using pseudo nucleotide composition. Anal. Biochem. 490, 26–33. Chen, W., Tang, H., Ye, J., 2016a. iRNA-PseU: Identifying RNA pseudouridine sites Molecular therapy. Nucleic Acids 6, e332. Chen, W., Ding, H., Feng, P., 2016b. IACP: a sequence-based tool for identifying anticancer peptides. Oncotarget 7, 16895–16909. Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins Struct. Funct. Bioinform. 43, 246–255. Chou, K.C., 2009. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteom. 6, 262–274. Chou, K.C., 2011. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273, 236–247. Chou, K.C., 2013. Some Remarks on Predicting multi-label attributes in Molecular Biosystems. Mol. Biosyst. 9, 1092–1100. Chou, K.C., 2015. Impacts of bioinformatics to medicinal chemistry. Med. Chem. 11, 218–234. Chou, K.C., Elrod, D.W., 1999. Prediction of membrane protein types and subcellular locations. Protein. Struct. Funct. Bioinform. 34, 137–153. Chou, K.C., Cai, Y.D., 2002. Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem. 277, 45765–45769. Chou, K.C., Shen, H.B., 2007. MemType-2L: a Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. Biochem. Biophys. Res. Commun. 360, 339–345. Dev, J., Park, D., Fu, Q., 2016. . Structural Basis for Membrane Anchoring of HIV-1 Envelope Spike, Science doi: 0.1126/science.aaf7066. Ding, C.H., Dubchak, I., 2001. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17, 349–358. Du, P., Gu, S., Jiao, Y., 2014. PseAAC-General: Fast building various modes of general form of Chou's pseudo amino acid composition for large-scale protein datasets. Int. J. Mol. Sci. 15, 3495–3506. Fu, Q., Fu, T.M., Cruz, A.C., 2016. Structural basis and functional role of intramembrane trimerization of the Fas/CD95 death receptor. Mol. Cell 61, 602–613. Guo, S., Deng, E., Xu, L., Ding, H., 2014. INuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics 30, 1522–1529. Han, J., Cheng, H., Xin, D., Yan, X., 2007. Frequent pattern mining: current status and future directions. Data Min. Knowl. Discov. 15, 55–86. Han, L., Cui, J., Lin, H., Ji, Z., Cao, Z., Li, Y., Chen, Y., 2006. Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics 6, 4023–4037. Hayat, M., Khan, A., 2012. MemHyb: predicting membrane protein types by hybridizing SAAC and PSSM. J. Theor. Biol. 292, 93–102. Hayat, M., Iqbal, N., 2014. Discriminating protein structure classes by incorporating pseudo average chemical shift to Chou's general PseAAC and support vector machine. Comput. Methods Prog. Biomed. 116, 184–192. Hayat, M., Khan, A., Yeasin, M., 2012. Prediction of membrane proteins using split amino acid composition and ensemble classiﬁcation. J. Amino Acids 42, 2447–2460. Huang, T., Shi, X., Wang, P., He, Z., Feng, K., Hu, L., Kong, X., Li, Y., Cai, Y., Chou, K., 2010. Analysis and prediction of the metabolic stability of proteins based on their sequential features, subcellular locations and interaction networks. PloS One 5, e10972. Huang, T., Wan, S., Xu, Z., Zheng, Y., Feng, K., Li, H., Kong, X., Cai, Y., 2011. Analysis and prediction of translation rate based on sequence and functional features of the mRNA. PLos One 6, e16036. Ji, M., Ruthstein, S., Saxena, S., 2013. Paramagnetic metal ions in pulsed ESR distance distribution measurements. Acc. Chem. Res. 47, 688–695. Ji, M., Tan, L., Jen-Jacobson, L., Saxena, S., 2014. Insights on Cu2+ inhibition of endonuclease catalysis by ESR spectroscopy. Mol. Phys. 112, 3173–3182.

19