Prediction of G-protein coupled receptors and their subfamilies by incorporating various sequence features into Chou's general PseAAC

Prediction of G-protein coupled receptors and their subfamilies by incorporating various sequence features into Chou's general PseAAC

computer methods and programs in biomedicine 134 (2016) 197–213 j o u r n a l h o m e p a g e : w w w. i n t l . e l s e v i e r h e a l t h . c o m ...

1MB Sizes 0 Downloads 36 Views

computer methods and programs in biomedicine 134 (2016) 197–213

j o u r n a l h o m e p a g e : w w w. i n t l . e l s e v i e r h e a l t h . c o m / j o u r n a l s / c m p b

Prediction of G-protein coupled receptors and their subfamilies by incorporating various sequence features into Chou’s general PseAAC Arvind Kumar Tiwari * GGSCMT, Kharar, SAS Nagar, Punjab, India

A R T I C L E

I N F O

A B S T R A C T

Article history:

Background and objective: The G-protein coupled receptors are the largest superfamilies of

Received 19 February 2016

membrane proteins and important targets for the drug design. G-protein coupled recep-

Received in revised form

tors are responsible for many physiochemical processes such as smell, taste, vision,

27 May 2016

neurotransmission, metabolism, cellular growth and immune response. So it is necessary

Accepted 1 July 2016

to design a robust and efficient approach for the prediction of G-protein coupled receptors and their subfamilies.

Keywords:

Methods: In this paper, the protein samples are represented by amino acid composition, di-

G-protein coupled receptors

peptide composition, correlation features, composition, transition, distribution, sequence

Weighted k-nearest neighbor

order descriptors and pseudo amino acid composition with total 1497 number of se-

Minimum redundancy maximum

quence derived features. To address the issue of efficient classification of G-protein coupled

relevance

receptors and their subfamilies, we propose to use a weighted k-nearest neighbor classi-

Sequence derived properties

fier with UNION of best 50 features, selected by Fisher score based feature selection, ReliefF,

Matthew’s correlation coefficient

fast correlation based filter, minimum redundancy maximum relevancy, and support vector machine based recursive elimination feature selection methods to exploit the advantages of these feature selection methods. Results: The proposed method achieved an overall accuracy of 99.9%, 98.3%, 95.4%, MCC values of 1.00, 0.98, 0.95, ROC area values of 1.00, 0.998, 0.996 and precision of 99.9%, 98.3% and 95.5% using 10-fold cross-validation to predict the G-protein coupled receptors and nonG-protein coupled receptors, subfamilies of G-protein coupled receptors, and subfamilies of class A G-protein coupled receptors, respectively. Conclusions: The high accuracies, MCC, ROC area values, and precision values indicate that the proposed method is better for the prediction of G-protein coupled receptors families and their subfamilies. © 2016 Elsevier Ireland Ltd. All rights reserved.

1.

Introduction

G-protein coupled receptors (GPCRs) are seven-transmembrane domain receptors that sense molecules outside the cell and activate inside signal transduction pathways for cellular

* E-mail address: [email protected]. http://dx.doi.org/10.1016/j.cmpb.2016.07.004 0169-2607/© 2016 Elsevier Ireland Ltd. All rights reserved.

responses. These are called seven-transmembrane receptors because they pass through the cell membrane seven times. G-protein coupled receptors can be grouped into six classes based on sequence homology and functional similarity; these are Class A (Rhodopsin-like), Class B (Secretin like), Class C (Metabotropic glutamate), Class D (cyclic AMP), Class E (Taste),

198

computer methods and programs in biomedicine 134 (2016) 197–213

and Class F (Vomeronasal) receptors. A larger number of Gprotein coupled receptors are available in humans. In these, some of their function, such as growth factors, light, hormones, amines, neurotransmitters, and lipids, etc., have been identified. However, a large number of G-protein coupled receptors found in the human genome have unknown functions, and so it is necessary to design an efficient approach to predict families and subfamilies of G-protein coupled receptors for a new drug discovery. The G-protein coupled receptors are transmembrane proteins which, via G-proteins, initiate some of the important signaling pathways in a cell and are involved in various physiological processes. Initially, Bhasin and Raghava [1] proposed an SVM based method by using amino acid composition and dipeptide of amino acids for the prediction of G. protein coupled receptor. Later, Bhasin and Raghava [2] proposed an SVM based method for the classification of amine type of G-proteincoupled receptors by using of amino acid composition and dipeptide composition of proteins. Gao and Wang [3] proposed a nearest neighbor method to discriminate GPCRs from non-GPCRs, and subsequently classify GPCRs at four levels on the basis of amino acid composition and dipeptide composition of proteins. Gu and Ding [4] have proposed a binary particle swarm optimization algorithm to extract effective feature for amino acids pair compositions of GPCRs protein sequence. Then they used ensemble fuzzy k-nearest neighbor classifier to predict GPCRs families. Gu et al. [5] proposed an Adaboost classifier to predict G-protein-coupled receptors by pseudo amino acid composition with approximate entropy and hydrophobicity patterns. Peng et al. [6] proposed a principal component analysis-based method for the prediction of GPCRs, family and their subfamilies by using sequence derived features. Lin and Xiao [7] proposed an ensemble of k-NN based classifier with grey incidence analysis and used pseudo amino acid composition to predict the GPCRs, family and their subfamilies. Xiao and Wang [8] have proposed a covariant discriminant-based approach by using grey level co-occurrence matrix obtained from cellular automata images of pseudo amino acid composition of protein sequence to predict the GPCRs and their functional classes. Xiao and Wang [9] have proposed a fuzzy k-NN based approach by using the combination of two different variation of pseudo amino acid composition. These are functional domain pseudo amino acid composition and low frequency Fourier spectrum pseudo amino acid composition to predict GPCR and their types. Xiao et al. [10] have proposed a fuzzy k-NN based approach to predict the interaction between GPCRs and drug in cellular network by using two dimensional fingerprint of pseudo amino acid composition generated through grey level model. Elrod [11] studied and observed that good and accurate data set is necessary for the prediction of GPCRs and their types. Chou [12] also studied about the coupling interaction between the receptors and G-binding proteins and observed that new therapeutic approached may be designed by manipulating the interaction of receptors and G-binding proteins. Chou [13] have proposed a covariant discriminant-based method by using amino acid composition to predict GPCRs and their classes. Qiu et al. [14] have proposed support vector machine based method by using pseudo amino acid compositions that are generated by using discrete wavelet transform to extract the feature from

hydrophobicity scale of amino acid composition to predict the GPCRs classes. Zia-Ur Rehman and Khan [15] have proposed an ensemble classifier based on majority voting by using nearest neighbor, probabilistic neural network, and grey incidence degree and support vector machine and used combination of pseudo amino acid composition, wavelet based multi-scale energy and position-specific scoring matrix to predict the GPCRs and their subfamilies. Xie et al. [16] have proposed an ensemble support vector machine-based approach by using amino acid hydrophobicity-based pseudo amino acid composition to predict GPCRs and their subfamilies. Elord [17] have studied and observed that the amino acid compositions are closely related to GPCR’s families. Therefore, it is necessary for good training data and for the identification of GPCR’s families and subfamilies. Xiao and Lin [18] have reviewed and summarized the development and future challenges for the prediction of GPCRs and their subfamilies by using sequence derived properties. Chou [19] studied and observed that five things are necessary for the identification of uncharacterized protein by using sequence-derived properties. These are construction of benchmark dataset, formulation of protein sample by using various sequence derived properties, design and development of computational intelligence-based method, and cross-validation test to measure the performance of the classifier and provide a web server that is easily accessible and available to public. Therefore, considering these points, for the construction of benchmark dataset, the sequence of the G-protein coupled receptors are extracted from GPCRDB [20] and all the non-Gprotein coupled receptors proteins are selected from Uniport database with the keyword NOT G-protein coupled receptors. To avoid the homology bias, the CD-HIT [21] server is used to remove the homologous sequence. For the formulation of protein sample, eight feature vectors are extracted from protein sequence; amino acid composition, dipeptide composition, correlation, composition and transition, distribution of physiochemical properties, sequence order descriptors, and pseudo amino acid composition are used. This paper proposes a weighted k-nearest neighbor in which inverse kernel function is applied to calculate weighted distance to improve the prediction performance of G-protein coupled receptors families and their subfamilies by using sequence derived properties. For non-redundant, relevant, robust, and optimal feature subset selection, a feature selection method based on fusion of five supervised filter based methods is proposed. These supervised feature selection methods include Fisher score-based feature selection [22], ReliefF [23], fast correlation-based filter (FCBF) [24], minimum redundancy and maximum relevancy (MRMR) [25], and support vector machine-based recursive feature elimination (SVM-RFE) [26]. If we apply these feature selection methods on the same dataset, then each of them results in different feature subset where features are ranked. Also, the performance of a classifier for each feature subset selected by different method might be different. Therefore here, we address this problem by proposing a method for optimal feature selection by the fusion of feature subsets produced by these methods using union of the selected features by different feature selection algorithms. Further, in this paper, the proposed method used three level strategies to predict G-protein coupled receptors and their subfamilies. First, it is determined

199

computer methods and programs in biomedicine 134 (2016) 197–213

that protein sequence is G-protein coupled receptors or nonG-protein coupled receptors. Second, if a protein is classified as G-protein coupled receptors, then the method classify the subfamilies of G-protein coupled receptors. Third, if it is classified as class A G-protein coupled receptors, then the method classify the subfamilies of class A G-protein coupled receptors. This paper used hold one out cross-validation to find the best value of nearest neighbor (k) between 1 and 30 for the knearest neighbor (k-NN) classifier. To measure the performance of wkNN classifier, ten-fold cross validation is used. In the literature, there are various web server available for the identification of proteins based on sequence derived properties [27–43]. The performance of the proposed method indicates its usefulness for the prediction of G-protein coupled receptors families and their subfamilies. Therefore, we shall make efforts in our future work to provide a web server for the proposed method presented in this paper.

2.

Materials and methods

2.1.

Construction of benchmark dataset

GPCRs. To avoid the homology bias, the CD-HIT [21] server is used to remove the homologous sequence using 70% sequence identity as the cutoff, because when we decrease the cutoff to 0.5 and 0.4, respectively, then very small sequences are left for the evaluation of classifier which is associated with lower performance values in comparison to 70% cutoff.The description of the datasets is shown in Table 1.

2.2.

Formulation of protein sample

In this paper, to fully characterize protein sequence, eight feature vectors are extracted from PROFEAT server [44] to represent the protein sample, including amino acid composition, dipeptide composition, correlation, composition, transition, distribution of physiochemical properties, sequence order descriptors, and pseudo amino acid composition with total of 1497 features being calculated for the prediction of G protein coupled receptors and their subfamilies. The brief descriptions of these prominent features are given as follows:

2.2.1.

In this paper, three data sets are used to predict GPCRs and their families.The first data set contains 2964 sequence of GPCRs that are extracted from GPCRDB [20] (http://www.gpcr.org/7tm/), and 576 non-GPCRs are selected from Uniport database with the keyword NOT G-protein coupled receptors.The second data set contains six families of GPCRs, including 974 Rhodopsin like, 621 Secretin like, 454 Metabotropic glutamate, 8 cAMP receptors 480 Taste and 425 Vomeronasal receptors. The third data set contains 974 sequences of 12 subfamilies of Rhodopsin like

Amino acid composition (AAC)

The amino acid composition is the fraction of each amino acid in the protein sequence. It is effectively used for the prediction of nuclear receptor [1], subcellular localization [45,46] and enzyme functions [47]. It is defined as

Amino acid composition (n ) =

Number of amino acid of type n Length of amino acid sequence

Where

n = 1 to 20

Total 20 features are calculated corresponding to each amino acid.

Table 1 – Number of sequences belonging to each G-protein coupled receptors and their subfamilies.

G-protein coupled receptors

Class

Subclass

Rhodopsin like

Amine Peptide Rhodopsin Olfactory Prostanoid Nucleotide-like Cannabinoid Platelet activating factor Gonadotropin-releasing hormone Thyrotropin-releasing hormone Viral Lysosphingolipid and LPA Leukotriene B4 receptor Ecdysis triggering hormone receptor Hydroxycarboxylic acid receptor CAPA

Secretin like Metabotropic glutamate cAMP receptors Taste Vomeronasal receptors Non-G-protein coupled receptors

(1)

No. of sequences 78 65 61 64 60 99 66 50 92

974

71 47 70 59 25 43 24 621 454 8 480 425 576

200

computer methods and programs in biomedicine 134 (2016) 197–213

2.2.2.

Dipeptide composition (DC)

An amino acid composition provides only sequence information but ignore the sequence order information so dipeptide composition of a protein is used to transform a variable length protein sequence to a fixed 400 feature vectors. It is effectively used for the prediction of ATP binding sites [47], subcellular localization [48,49] and G-protein coupled receptors [2]. It is calculated as

Dipeptide composition (n ) Total number of the nth dipeptide Total number of all possible dipeptide

=

n = 1 to 400

Where

2.2.3.

(2)

Correlation features (CF)

For a given protein sequence, autocorrelation features are defined by using the distribution of amino acid properties along the sequence. It is effectively used for the classification of G-protein coupled receptors [50] and DNA binding proteins [51]. In this paper, the normalized value of eight sequence properties: hydrophobicity, average flexibility, free energy of solution in water, polarizability, residue accessible surface area in tripeptide, residue volume, steric parameter and relative mutability are used to calculate these features. Let P1, P2… and PN the physiochemical property values of 1st residue, 2nd residue… and nth residue, respectively. So by using these values, a protein sequence can be converted as [P1, P2….PN]. The three types of autocorrelation features are computed as follows. Normalized Moreau-Broto autocorrelation features can be calculated by using

MB autocorrelation feature (l ) 1 N−l = ∑ pi pi + l N − l i =1

Where

So in this paper, 8*30 = 240 of each correlation features with total of 240*3 = 720 features will be calculated.

2.2.4. Composition, transition and distribution features (CTD) These features are developed by Dubchak et al. [51,52] and effectively used by Li et al. [53] for the classification of G-protein coupled receptors. For calculating these features, the 20 amino acid of a protein sequence is divided into three groups: polar, neutral, and hydrophobicity by using the seven physiochemical properties: hydrophobicity, normal Vander Waals volume, polarity, polarizability, charge secondary structure, and solvent accessibility. So for each property value, every amino acid is represented by three indexes 1, 2, and 3 according to one of three groups. The composition, transition, and distribution features are calculated by using

Composition =

Transition =

l = 1 to 30

i = 1, 2, 3

(8)

number of dipeptide encoded as “ij” and “ ji” length of the sequence − 1

ij = ‘12’, ‘13’, ‘23’

(9)

Distribution: It calculates the distribution of each property for 1st residue, 25% residue, 50% residue, 75% residue, and 100% residue respectively for each group in the amino acid sequence. So in this paper, total 7*3 = 21 composition features, 7*3 = 21 transition features, and 7*3*5 = 105 distribution features are calculated.

2.2.5. (3)

number of i in the encoded sequence length of the sequence

Sequence order descriptors (SD)

Sequence order descriptors are calculated from the physicochemical distance matrix between each pair of the 20 amino acids.

Moran autocorrelation features are calculated as N −l 1 ∑ l=1 (Pi − P ) (Pi+l + P ) N − l MACF (l) = 2 1 N ∑ (Pi − P ) N i =1

2.2.6.

Sequence order coupling numbers

The dth rank sequence order coupling number is defined as

Where

l = 1 to 30

(4)

τd =

N −d

∑ (d )

2

i ,i + d

where

d = 1 to 30

(10)

i =1

and P is the considered property P along the sequence, i.e.,

P=



N

P

i =1 i

(5)

N

where l is the lag of autocorrelation Pi and Pi+l are the properties of amino acid at the position i and i + l, respectively. Geary autocorrelation features are calculated as N −l 1 ∑ (Pi − Pi+l )2 2 ( N − l) l =1 GACF (l) = 2 1 N ∑ (Pi − P ) N i =1

and

P=



N i =1 i

N

P

Where

l = 1 to 30

Where di,i + d is the physicochemical distance between the two amino acids at position i and i + d. Here, two physiochemical distance matrix such as Schneider, Wrede and Grantham where maxlag(d) = 30. So the total (d*2 = 60) number of features of sequence order coupling numbers are extracted.

2.2.7.

Quasi sequence order descriptors

For each amino acid type, a quasi-sequence order descriptor can be defined as

(6)

Xr =

(7)

Xd =

fr

∑ r = 1 f r + w∑ d = 1 τ d 20

30

w ⋅ τ d− 20



20 r =1

f r + w∑ d = 1 τ d 30

1 ≤ r ≤ 20

21 ≤ d ≤ 50

and

and

w = 0 .1

w = 0 .1

(11)

(12)

computer methods and programs in biomedicine 134 (2016) 197–213

Here, fr is the normalized occurrence of amino acid of type r and w is the weighting factor. Here, two physiochemical distance matrix such as Schneider, Wrede and Grantham where maxlag(d) = 30. So 50*2 = 100 features of quasi-sequence order features are extracted.

2.2.8.

Pseudo amino acid composition (PAAC)

Various web servers [54–56] are available to generate the pseudo amino acid composition. It is effectively used for the prediction of nuclear receptor [57], DNA binding sites [58,59], subcellular localization [60–62], and enzyme functions [63–65]. In this paper, the pseudo amino acid composition is calculated by using the three properties: hydrophobicity (H 1 ), hydrophilicity (H2) and side chain mass (M) of each 20 amino acid to represent the sequence order correlation between all of the most 30 contiguous residues. The correlation between these three properties is calculated as

quality of features in problems with strong dependencies between features. The ReliefF randomly select an instance Ri then searches for k of its nearest neighbors from the same class called nearest hits Hj, and also k nearest neighbors from each of the different classes, called nearest misses Mj(C). It then updates the quality estimation W[A] for an attribute based on their values for Ri, Hj, and Mj(C). If instance Ri and Hj have different values of the attribute A, then the attribute A separates two instances with the same class which is not desirable. So the quality estimation W[A] is decreased. On the other hand, if instance Ri and Mj(C) have different values on the A attribute, then W[A] is increased. This whole process is repeated m times, which is defined by users, and the quality of attributes are estimated as follows.

W [ A] = W [ A] − ∑ j

θ ( Ri, R j ) =

{

+

}

1 [H1 (Ri ) − H1 ( R j )]2 + [H2 (Ri ) − H2 ( R j )]2 + [M (Ri ) − M ( R j )]2 (13) 3

By using these correlation values, a set of sequence order correlated features are calculated as

Θλ =

1 N−λ ∑ Θ ( R i , Ri + λ ) N − λ i =1

where

λ = 1 to 30

(14)

Let fn be the normalized occurrence frequency of the 20 amino acid in the protein sequence, a set of 20 + λ pseudo amino acid composition features can be calculated as

PAACn =

PAACn =

fn



20

f r + w∑ j = 1 Θ j r =1



20

30

Θn − 20 r =1

f r + w∑ j = 1 Θ j 30

1 ≤ n ≤ 20

21 ≤ n ≤ 50

and

and

w = 0 .1

w = 0 .1

(15)

(16)

So total 50 pseudo amino acid composition features are calculated.

2.3.

Feature subset selection

In this paper, five supervised filter based methods, Fisher score based, ReliefF, FCBF, mRMR, and SVM-RFE feature selection methods are used to obtain optimal number of features. If we apply these feature selection methods on the same dataset, then each of them results in different feature subset where features are ranked according to their rank. Also, the performance of a classifier for each feature subset selected by different method may be different. Therefore, here we address this problem by proposing a method for optimal feature selection by the fusion of feature subsets produced by these methods using union of the selected features by different feature selection algorithms. The brief descriptions of these five supervised feature selection methods are as follows.

2.3.1.

ReliefF

ReliefF [23] is a simple and efficient supervised feature selection algorithm inspired by instance based learning to estimate

201

2.3.2.



diff ( A, Ri, H j ) (m ⋅ k)

C ≠ class(Ri )

k P (C) ⎡ ⎤ ⎢ 1 − P (class ( R )) ∑ j =1 diff ( A, Ri, M j (C)) ⎥ i ⎣ ⎦ (m ⋅ k)

(17)

mRMR (minimum redundancy maximum relevance)

The minimum redundancy and maximum relevancy (MRMR) [25] is a multivariate feature selection method which starts with an empty set, uses mutual information to weight features and forward selection technique with sequential search strategy to find the best subset of features. The minimum redundancy maximum relevance feature selection method selects a feature subset in which each subset of feature has the minimal redundancy with other features and maximal relevance with target class. In this method, the subset of features is obtained by calculating the mutual information between the features themselves and between the features and the class variables. For binary classification, the class variable ck is 1 or 2. The mutual information MI(x, y) of two features x and y is calculated as

MI ( x, y) =

p ( x i, y j )

∑ p ( x , y ) log p (x ) p ( y ) i

j

i, j∈N

i

(18)

j

where p(xi) and p(yj) is the marginal probability density and p ( xi, y j ) is the joint probability. Similarly, the mutual information MI(x, c) of between class variable c and feature x is also calculated as

MI ( x, c) =

p ( xi, ck )

∑ p ( x , c ) log p (x ) p (c ) i

k

i,k∈N

i

(19)

k

The minimum redundancy condition is to minimize the total redundancy of all features selected by min (Redundancy) where

Redundancy =

1 S2

∑ MI ( x, y)

x , y ∈S

(20)

where S is the feature subset and |S| is the number of feature in S. The maximum relevance condition is to maximize the total relevance between all features in S and class variable. It is calculated as max (Relevance) where

202

computer methods and programs in biomedicine 134 (2016) 197–213

Relevance =

1 S



x ∈S

MI ( x, c)

(21)

The optimal subset of features is selected by

Max ( Relevance − Redundancy)

2.3.3.

2.3.5. Support vector machine-recursive feature elimination (SVM-RFE) (22)

Fast correlation based filter (FCBF)

The Fast Correlation Based Filter (FCBF) [24], is a multivariate feature selection method which starts with full set of features, uses symmetrical uncertainty to calculate dependences of features and finds best subset using backward selection technique with sequential search strategy. It consists of two stages: the first one is a relevance analysis, aimed at ordering the input variables depending on a relevance score, which is computed as the symmetric uncertainty with respect to the target output. This stage is also used to discard irrelevant variables, which are those whose ranking score is below a predefined threshold. The second stage is a redundancy analysis, aimed at selecting predominant features from the relevant set obtained in the first stage. It has an inside stopping criterion that makes it stop when there are no features left to eliminate. Symmetrical Uncertainty (SU) is a normalized information theoretic measure which uses entropy and conditional entropy values to calculate dependencies of features. If X is a random variable and P(x) is the probability of x, the entropy of X is

H ( X ) = − ∑ p ( xi ) log2 ( p ( xi ))

(23)

Conditional uncertainty of X given random variable Y is the average conditional uncertainty of X over Y

X ⎛ ⎛x ⎞ ⎛x ⎞ H ⎛ ⎞ = − ∑ p ( y j )∑ p ⎜ i ⎟ log2 ⎜ p ⎜ i ⎟ . ⎝ Y⎠ ⎝ ⎝ yi ⎠ ⎝ yi ⎠ j i

(24)

Symmetrical Uncertainty (SU) is defined as

⎡ H (X ) − H ⎛ X ⎞ ⎤ ⎢ ⎝ Y⎠ ⎥ SU ( X, Y ) = 2 ⎢ . H X H ( ) + (Y ) ⎥⎥ ⎢ ⎣ ⎦

(25)

The symmetrical uncertainty value of 1 indicate that using one feature value other feature value can be totally predicted and value 0 indicates two features are totally independent.

2.3.4.

Fisher score based filter

Fisher score [22] is one of the most widely used supervised feature selection methods. The basis criterion for Fisher score based feature selection is that it is a univariate filter method which evaluates each feature individually. However, it selects each feature independently according to their scores under the Fisher criterion, which leads to a suboptimal subset of features. The fishers score is defined as the ratio of the variance of the between classes to the variance of within classes. It is defined as σ 2 between classes Fishers score = 2 σ within classes =

The fisher score are calculated for each attributes and then, high fisher score features are selected.

maximum betweeen class variance (difference of mean ) . (26) minimum within class variance (sum of variances)

SVM-RFE [26] selects features in sequential backward elimination manner which starts with all the features and discards one feature at a time. It uses the weight magnitude as the ranking criterion. It performs following steps. 1. Train an SVM classifier on the training set 2. Compute the ranking criteria, i.e. squared coefficients for all the features 3. Ranking the features using the weights of the resulting classifier 4. Eliminate features with the smallest weight 5. Repeat the process with the remaining features

2.4. Classification of G-protein coupled receptor and their subfamilies For the classification of G-protein coupled receptor and their subfamilies, the weighted k-neighbor classifier (wkNN) [66] has been used. The k-nearest neighbor classifier is a simple and effective instance-based learning algorithm which is based on finding the k nearest neighbor, and taking a majority vote among the classes of these k neighbors, to assign a class for the given query. The distance functions affect the performance of the k-NN classifier. This paper proposes to use a wkNN in which inverse kernel function are applied to calculate weighted distance to improve the prediction performance of G-protein coupled receptors families and their subfamilies. Basically, the traditional k-NN classifier is used each time a different k, starting from k = 1 to k = square root of the training set. This paper used hold-one-out cross-validation to find the best k between 1 and 30 the value specified to the k-NN parameter. The wkNN performs the following steps. 1. Let L = {( yi , xi ) , i = 1, ……… . . , nL } be a training dataset where yi ∈ (1 … . c) represents the class and xi′ = ( xi1, … … … , xip ) represents the predictor values. Let x be the test sample whose class level y has to be predicted. 2. Obtain k+1 nearest neighbors to x by using Manhattan distance function d(x, xi). Here p

Manhattan distance = d ( x, xi ) = ∑ xs − xis .

(27)

s=1

3. The (k+1)th neighbor is used for the standardization of the k smallest distance by using following equation

D ( x, xi ) =

d ( x, xi ) . d ( x, xk +1 )

(28)

4. Transform and normalize distance by using inverse kernel 1 . to obtain the weight wi = 1 function K = D ( x, xi ) d 5. Assign a class, y of test sample, x, which shows a weighted majority of the k-nearest neighbor.

computer methods and programs in biomedicine 134 (2016) 197–213

203

Fig. 1 – A flowchart for the proposed model.

In this paper, the proposed method used three level strategies to predict G-protein coupled receptors and their subfamilies. The complete procedure of the proposed method for the prediction of G-protein coupled receptors and their subfamilies is illustrated in Fig. 1 and the steps are as follows:

3. Fusion of feature subsets produced by these methods using UNION of the selected features by different feature selection algorithms. 4. Apply wkNN classifier for the prediction of G-protein coupled receptors and their subfamilies as follows:

1. Produce seven feature vectors with 1497 features that represent a protein sequence. 2. Select best 50 number of features with Fisher score based, ReliefF, FCBF, mRMR, and SVM-RFE feature selection methods.

First, it is determined that protein sequence is G-protein coupled receptors or non-G-protein coupled receptors. Second, if protein is classified as G-protein coupled receptors then the method classify the subfamilies of G-protein coupled receptors.

204

computer methods and programs in biomedicine 134 (2016) 197–213

Third, if it is classified as class A G-protein coupled receptors, then the method classifies the subfamilies of class A G-protein coupled receptors.

Precision (i ) =

TP (i ) (TP (i) + FP (i))

(31)

where TP is the true positive, TN is the true negative, FP is the false positive, and FN is the false negative and

3.

Performance measures

TP (i ) = N + (i) − N−+ (i)

Three cross-validation methods such as sub-sampling or K-fold cross-validation test, independent data set test, and jackknife test [67–78] are generally used to predict the performance of the classifier. However to reduce the computational time, in this paper 10-fold cross validation is used to measure the performance of wkNN classifier. In K-fold cross-validation, the dataset of all proteins is partitioned into K subsets where one subset is used for validation and remaining K-1 subsets is used for training. This process is repeated for K times so that every subset is used once as a test data. Here, the variances are calculated for the different values of K such as k = 3, 5, 7, 10 and 15. It is observed that the minimum variance is obtained at K = 10, so 10-fold cross-validations are used to measure the performance of wkNN classifier (see Fig. 2). In this paper, accuracy (ACC), Precision, Receiver Operating Characteristics (ROC) and Matthew’s correlation coefficient (MCC) are used to measure the performance. Accuracy is measured by the following formulae.

C (i) ACC (i) = , T (i)

i = 1, 2, …… . n

(29)

where T(i) is the total number of sequences in class i, C(i) is the correctly predicted sequences of class i and n is the total number of classes. MCC is a balanced measure that considers both true and false positives and negatives. The MCC can be obtained as

MCC (i) =

⎛ N + (i) N − (i) ⎞ 1 − ⎜ −+ + +− ⎟ ⎝ N (i) N (i) ⎠

(30)

N+− (i) − N−+ (i) ⎞ ⎛ N + (i) − N − (i) ⎞ ⎛ 1+ − − + ⎟ ⎟ ⎜ ⎜⎝ 1 + + ⎠ ⎠⎝ N (i) N (i)

Where N + (i) is the total number of sample in subset i, N−+ (i) is the number of samples in class i that are incorrectly predicted belonging to other class, N − (i) is the total number of sample in other classes and N+− (i) is the number of samples that are incorrectly predicted belonging to class i. Precision is the proportion of instances classified as positive that are really positive. It is defined as

Variance Variance

21.00 20.50

TN (i ) = N − (i) − N+− (i) FP (i ) = N+− (i) FN (i ) = N−+ (i) .

(32)

The Receiver Operating Characteristics (ROC) is a graph that shows the performance of a classifier by plotting TP rate versus FP rate at various threshold settings. Area under ROC curve (AUC) of a classifier is the probability that the classifier ranks a randomly chosen positive instance higher than a randomly chosen negative instance.

4.

Results and comparative analysis

In this paper, a wkNN is proposed to be used with inverse kernel function to calculate weighted distance to improve the prediction performance of G-protein coupled receptors and their subfamilies by using sequence derived properties. This paper uses hold-one-out cross-validation to find the best k between 1 and 30 used as the number of nearest neighbor value specified to the wkNN classifier. For partitioning of the datasets into train and test sets and evaluating the performance of the proposed model, the 10-fold cross-validations are used. In subsequent subsections, the results and performance analysis of the proposed model for the prediction of G-protein coupled receptors and their subfamilies are presented and discussed. The performance analysis of the proposed model is shown for the 47 number of features obtained from the UNION of best 10 features selected by the different features selection algorithms such as Fisher score based feature selection algorithms, ReliefF, FCBF, mRMR, and SVM-RFE feature selection methods with 47 number of features selected by these feature selection algorithms. Here, also the performance analysis of the proposed model is also shown for UNION of the best 50 number of features selected among 1497 total number of features by using Fisher score based feature selection algorithms, ReliefF, FCBF, mRMR, and SVM-RFE feature selection methods, for each cases as well the best 187 number of features selected by these algorithms. It is observed that the performance of the classifier is improved by using the UNION of the best 50 features selected by five different feature selection algorithms.

Variance

20.00 19.50 k=3

k=5

k=7

k=10

k=15

Fig. 2 – Variances for different values of K-fold cross-validation.

4.1. Prediction of G-protein coupled receptors and non-G-protein coupled receptors To predict the G-protein coupled receptors and non-G-protein coupled receptors, a wkNN is evaluated for the 47 number of

computer methods and programs in biomedicine 134 (2016) 197–213

205

Accuracy

100 99.5 GPCR Non-GPCR

99

Overall 98.5 mRMR (47)

FCBF (47) Fisher’s score ReliefF (47) SVM-RFE (47) Combined (47) features (47)

1

MCC

0.99 0.98

GPCR

0.97

Non-GPCR

0.96

Overall MCC

0.95 mRMR (47)

FCBF (47)

Fisher’s score (47)

ReliefF (47)

SVM-RFE (47)

UNION of features (47)

Fig. 3 – Accuracy and MCC for the UNION of best 10 features (47) selected by different feature selection algorithms with features selected from these algorithms for the classification of GPCR and non-GPCR.

Accuracy

features obtained from the UNION of best 10 features selected by the different features selection algorithms such as Fisher score based feature selection algorithms, ReliefF, FCBF, mRMR, and SVM-RFE feature selection methods with 47 number of features selected by these feature selection algorithms. Here, also the performance analysis of the proposed model is also shown for UNION of the best 50 number of features selected among 1497 total number of features by using Fisher score based feature selection algorithms, ReliefF, FCBF, mRMR, and SVM-RFE feature selection methods, for each cases as well the

best 187 number of features selected by these algorithms. It is observed that the performance of the classifier is improved by using the UNION of the best 50 features (187) selected by five different feature selection algorithms for the classification of G-protein coupled receptors and non-G-protein coupled receptors (see Figs. 3 and 4). From the analysis of Table 2, it is observed that the performance of a weighted k-nearest neighbor methods provides overall accuracy of 99.9%, MCC of 1.00, ROC area of 1.00 and precision of 99.9%. Further, the support vector machine based

100 99.5 99 98.5 98 97.5 97 96.5 96 95.5 95

GPCR Non-GPCR Overall

mRMR (187)

FCBF (187)

Fisher’s score (187)

ReliefF (187)

SVM-RFE (187)

UNION of features (187)

Total features 1497

1 0.98

MCC

0.96 0.94 GPCR

0.92

Non-GPCR

0.9

Overall MCC

0.88 0.86 mRMR (187)

FCBF (187)

Fisher’s score (187)

ReliefF (187)

SVM-RFE UNION of (187) features (187)

Total features 1497

Fig. 4 – Accuracy and MCC for the classification of GPCR and non-GPCR with the UNION of best 50 features (187) selected by different feature selection algorithms with features selected from these algorithms and total number of features.

206

computer methods and programs in biomedicine 134 (2016) 197–213

Table 2 – Result analysis for the classification of G-protein coupled receptors and non-G-protein coupled receptors. Family

Non-GPCR GPCR Overall

Proposed method (Prediction results using proposed feature selection method and weighted kNN)

Prediction using proposed feature selection method and SVM, (Kernel = RBF, γ = 200, C = 100)

ACC

MCC

ROC Area

Precision

ACC

MCC

ROC Area

Precision

99.7 99.9 99.9

1.00 1.00 1.00

1.00 1.00 1.00

99.5 99.9 99.9

99.5 100 99.9

0.996 0.996 0.996

0.997 0.997 0.997

99.8 99.9 99.9

classifier was also used to examine the prediction rate of G-protein coupled receptor and non-G-protein coupled receptor using the proposed feature selection method (with RBF kernel function and the tuning parameters γ = 200, C = 100), and it provides overall accuracy of 99.9%, MCC of 0.996, ROC area of 0.997 and precision of 99.9%. Out of the two classifiers viz. wkNN and SVM, the wkNN based classifier is performing better in conjunction with the proposed feature selection method. In literature, Bhasin and Raghava [1] used only dipeptide composition with SVM classifier and reported an accuracy of 99.5%, MCC of 0.99 for the prediction of G-protein coupled receptor and non- G-protein coupled receptor. It is also observed that the performance of wkNN is better in comparison to other classifiers for the prediction of G-protein coupled receptor and non-G-protein coupled receptor (see Fig. 5).

4.2. Prediction of subfamilies of G-protein coupled receptors To predict the subfamilies of G-protein coupled receptors, a wkNN is evaluated for the 47 number of features obtained from the UNION of best 10 features selected by the different features selection algorithms such as Fisher score based feature selection algorithms, ReliefF, FCBF, mRMR, and SVM-RFE feature selection methods with 47 number of features selected by these feature selection algorithms. Here, also the performance analysis of the proposed model is also shown for UNION of the best 50 number of features selected among 1497 total number of features by using Fisher score based feature selection algorithms, ReliefF, FCBF, mRMR, and SVM-RFE feature selection

methods, for each cases as well as the best 187 number of features selected by these algorithms. It is observed that the performance of the classifier is improved by using the UNION of the best 50 features (187) selected by five different feature selection algorithms for the classification of G-protein coupled receptors (see Figs. 6 and 7). From the analysis of Table 3, it is observed that the performance of proposed method (wkNN) provides overall accuracy of 98.3%, MCC of 0.98, ROC area of 0.998 and precision of 98.3%. Further, the support vector machine based classifier was also used to examine the prediction rate of subfamilies of GPC receptors using the proposed feature selection method (with RBF kernel function and the tuning parameters γ = 200, C = 100), and it provides overall accuracy of 97.3%, MCC of 0.964, ROC area of 0.981 and precision of 97.3%. Out of the two classifiers viz. wkNN and SVM, the wkNN based classifier is performing better in conjunction with the proposed feature selection method (see Table 3). In literature, Bhasin and Raghava [1] used only dipeptide composition with SVM classifier and reported an accuracy of 91.2% for the prediction of subfamilies of G-protein coupled receptors. It is also observed that the performance of the proposed method (wkNN) is better in comparison to other classifiers for the prediction of G-protein coupled receptor (see Fig. 8).

4.3. Prediction of subfamilies of class A G-protein coupled receptors To predict the subfamilies of class A G-protein coupled receptors, wkNN is evaluated for the 47 number of features obtained

Accuracy

100 99.5

GPCR

99

Non-GPCR Overall

98.5 Random Forest

SVM

k-NN

C 4.5

Proposed Method

1.00 MCC

0.99 GPCR

0.98

Non-GPCR

0.97

Overall

0.96 Random Forest

SVM

k-NN

C 4.5

Proposed Method

Fig. 5 – Accuracy and MCC for the classification of GPCR and non-GPCR with UNION of best 50 features (187) selected by different feature selection algorithms.

207

Accuracy

computer methods and programs in biomedicine 134 (2016) 197–213

100 90 80 70 60 50 40 30 20 10 0

mRMR (47) FCBF (47) Fisher’s score (47) ReliefF (47) SVM-RFE (47)

MCC

UNION of features (47)

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

mRMR (47) FCBF (47) Fisher’s score (47) ReliefF (47) SVM-RFE (47) UNION of features (47)

Fig. 6 – Accuracy and MCC for the UNION of best 10 features (47) selected by different feature selection algorithms with features selected from these algorithms for the classification of GPCR.

from the UNION of best 10 features selected by the different features selection algorithms such as Fisher score based feature selection algorithms, ReliefF, FCBF, mRMR, and SVM-RFE feature selection methods with 47 number of features selected by these feature selection algorithms. Here, also the performance analysis of the proposed model is also shown for UNION of the best 50 number of features (187) selected among 1497 total number of features by using Fisher score based feature selection algorithms, ReliefF, FCBF, mRMR, and SVM-RFE feature selection methods, for each cases as well as the best 187 number of features selected by these algorithms. It is observed that the performance of the classifier is improved by using the UNION of the best 50 features (187) selected by five different feature selection algorithms for the classification of G-protein coupled receptors (see Figs. 9 and 10). From the analysis of Table 4, it is observed that the performance of a proposed methods (wkNN) provides overall accuracy

of 95.4%, MCC of 0.951, ROC area of 0.996 and precision of 95.5%. Further, the support vector machine based classifier was also used to examine the prediction rate subfamilies of class A GPC receptors using the proposed feature selection method (with RBF kernel function and the tuning parameters γ = 200, C = 100), and it provides overall accuracy of 91.5%, MCC of 0.909, ROC area of 0.955, and precision of 91.4%. Out of the two classifiers, viz. wkNN and SVM, the wkNN based classifier is performing better in conjunction with the proposed feature selection method (see Table 4). In literature, Bhasin and Raghava [1] used only dipeptide composition with SVM classifier and reported an accuracy of 92.6% for the prediction of subfamilies of class A G-protein coupled receptors. It is also observed that the performance of the proposed method (wkNN) is better in comparison to other classifiers for the prediction of class A G-protein coupled receptor (see Fig. 11).

Table 3 – Result analysis for the classification of subfamilies of G-protein coupled receptors. Subfamilies of GPCR

Rhodopsin like Secretin like Metabotropic glutamate cAMP Taste Vomeronasal Overall

Proposed method (Prediction results using proposed Prediction using proposed feature selection feature selection method and weighted kNN) method and SVM, (Kernel = RBF, γ = 200, C = 100) ACC

MCC

ROC Area

Precision

ACC

MCC

ROC Area

Precision

97.7 99 97.1 87.5 98.5 99.8 98.3

0.96 0.99 0.98 0.78 0.98 0.99 0.98

0.997 0.999 0.999 0.977 0.998 1.00 0.998

97.3 99.8 98.7 70 98.3 98.6 98.3

96.7 98.6 96.9 37.5 97.1 98.4 97.3

0.942 0.987 0.975 0.612 0.963 0.974 0.964

0.973 0.992 0.984 0.688 0.982 0.989 0.981

95.5 99.4 98.9 100 96.7 97.2 97.3

208

Accuracy

computer methods and programs in biomedicine 134 (2016) 197–213

100 90 80 70 60 50 40 30 20 10 0

Rhodopsin like Secretin like Metabotropic cAMP Taste Vomeronasal mRMR FCBF (187) Fisher’s (187) score (187)

ReliefF (187)

SVM-RFE UNION of Total (187) features features (187) 1497

Overall Accuracy

1 0.9

MCC

0.8 0.7

Rhodopsin like

0.6

Secretin like

0.5

Metabotropic

0.4

cAMP

0.3

Taste

0.2 Vomeronasal

0.1

Overall MCC

0 mRMR FCBF (187) Fisher’s (187) score (187)

ReliefF (187)

SVM-RFE UNION of Total (187) features fetures (187) (1497)

Fig. 7 – Accuracy and MCC for the classification of GPCR with the UNION of best 50 features (187) selected by different feature selection algorithms with features selected from these algorithms and total number of features.

100 Accuracy

80 60

Random Forest

40

SVM

20

k-NN

0

C 4.5

MCC

Proposed Method

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Random Forest SVM k-NN C 4.5 Proposed Method

Fig. 8 – Accuracy and MCC for the classification of GPCR with UNION of best 50 features (187) selected by different feature selection algorithms.

209

computer methods and programs in biomedicine 134 (2016) 197–213

Table 4 – Result analysis for classification of subfamilies of class A G-protein coupled receptors with different data sets. Subfamilies of class A GPCR

Proposed method (Prediction results using proposed feature selection method and weighted kNN) MCC

ROC Area

Precision

ACC

MCC

ROC Area

Precision

97.4 100 70.8 72 96.7 97.7 91.5 100 96 100 95.1 90.8 98 100 94.4 95.7 95.4

0.986 0.969 0.734 0.845 0.953 0.976 0.927 1.00 0.96 1.00 0.974 0.933 0.922 0.983 0.899 0.945 0.951

0.996 1.00 0.962 0.987 0.997 1.00 0.996 1.00 0.997 1.00 0.998 0.994 0.997 1.00 0.994 0.997 0.996

100 94.3 77.3 100 94.7 97.7 94.7 100 96.9 100 100 96.7 87.5 96.8 87 93.8 95.5

98.7 95.5 58.3 56 93.5 95.3 78 100 93.9 100 91.8 87.7 96 96.7 84.5 93.6 91.5

0.993 0.936 0.648 0.617 0.922 0.94 0.751 1.00 0.943 1.00 0.929 0.891 0.911 0.956 0.814 0.903 0.909

0.994 0.975 0.789 0.777 0.963 0.975 0.882 1.00 0.967 1.00 0.957 0.936 0.976 0.982 0.915 0.965 0.955

100 92.6 73.7 70 92.5 93.2 75.4 100 95.9 100 94.9 91.9 87.3 95.1 81.1 88 91.4

mRMR (47) FCBF (47) Fisher’s score (47)

Overall

Viral

Prostanoid

Thyrotropin

Peptide

Platelet activating

Rhodopsin

Olfactory

LPA

Nucleotide-like

HA receptor

LB4 receptor

GR hormone

CAPA

ETH receptor

Cannabinoid

Amine

ReliefF (47)

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

SVM-RFE (47) UNION of features (47)

mRMR (47) FCBF (47) Fisher’s score (47)

Overall MCC

Viral

Thyrotropin

Prostanoid

Platelet activating

Peptide

Olfactory

Rhodopsin

Nucleotide-like

LPA

LB4 receptor

HA receptor

GR hormone

ETH receptor

CAPA

ReliefF (47) Amine

MCC

ACC

100 90 80 70 60 50 40 30 20 10 0

Cannabinoid

Accuracy

Amine Cannabinoid cAPA Ecdysis triggering hormone receptor Gonadotropin-releasing hormone Hydroxycarboxylic acid receptor Leukotriene B4 receptor Lysosphingolipid and LPA Nucleotide-like Olfactory Rhodopsin Peptide Platelet activating factor Prostanoid Thyrotropin-releasing hormone Viral Overall

Prediction using proposed feature selection method and SVM, (Kernel = RBF, γ = 200, C = 100)

SVM-RFE (47) UNION of features (47)

Fig. 9 – Accuracy and MCC for the UNION of best 10 features (47) selected by different feature selection algorithms with features selected from these algorithms for the classification of class A GPCR.

210

100 90 80 70 60 50 40 30 20 10 0

mRMR (187) FCBF (187) Fisher’s score (187)

Viral

Overall

Thyrotropin

Prostanoid

Peptide

Platelet activating

Olfactory

Rhodopsin

LPA

Nucleotide-like

HA receptor

LB4 receptor

GR hormone

CAPA

ETH receptor

Amine

SVM-RFE (187) UNION of features (187) Total Features (1497)

1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00

mRMR (187) FCBF (187) Fisher’s score (187)

Overall MCC

Viral

Thyrotropin

Prostanoid

Peptide

Platelet activating

Rhodopsin

Olfactory

LPA

Nucleotide-like

LB4 receptor

HA receptor

GR hormone

ETH receptor

CAPA

ReliefF (187) Amine

MCC

Cannabinoid

ReliefF (187)

Cannabinoid

Accuracy

computer methods and programs in biomedicine 134 (2016) 197–213

SVM-RFE (187) UNION of features (187) Total Features (1497)

Fig. 10 – Accuracy and MCC for the classification of class A GPCR with the UNION of best 50 features (187) selected by different feature selection algorithms with features selected from these algorithms and total number of features.

Overall Accuracy 100 Accuracy

95 90 85

Overall Accuracy

80 75 Random Forest

SVM

k-NN

C 4.5

Proposed Method

Overall MCC 1

MCC

0.8 0.6 0.4

Overall MCC

0.2 0 Random Forest

SVM

k-NN

C 4.5

Proposed Method

Fig. 11 – Accuracy and MCC for the classification of class A GPCR with UNION of best 50 features (187) selected by different feature selection algorithms.

computer methods and programs in biomedicine 134 (2016) 197–213

5.

Conclusions

The G-protein coupled receptors are the largest superfamilies of membrane proteins and important targets for the drug design. Here, eight feature vectors were used to represent the protein sample, including amino acid composition, dipeptide composition, correlation features, composition, transition, distribution, sequence order descriptors and pseudo amino acid composition. In this paper, at first an optimal feature subset selection method is proposed which provides the nonredundant, relevant, and robust feature subset by taking UNION of best 50 features selected by various supervised feature selection methods such as Fisher score based feature selection, ReliefF, FCBF, mRMR and SVM-RFE feature selection methods. In the next stage, we proposed to use a weighted k-nearest neighbor classifier to predict the G-protein coupled receptors and their subfamilies. Using the 10-fold cross–validation, the proposed method achieved an overall accuracy of 99.9%, 98.3%, 95.4%, MCC values of 1.00, 0.98, 0.95, ROC area values of 1.00, 0.998, 0.996 and precision of 99.9%, 98.3% and 95.5% to predict the G-protein coupled receptors and non-G-protein coupled receptors, subfamilies of G-protein coupled receptors and subfamilies of class A G-protein coupled receptors, respectively. The performance of the proposed method indicates that the proposed method is useful for the prediction of G-protein coupled receptors families and their subfamilies. Therefore, we shall make efforts in our future work to provide a web server for the proposed method.

REFERENCES

[1] M. Bhasin, G.P.S. Raghava, GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors, Nucleic Acids Res. 32 (Suppl. 2) (2004) W383–W389. [2] M. Bhasin, G.P.S. Raghava, GPCRsclass: a web tool for the classification of amine type of G-protein-coupled receptors, Nucleic Acids Res. 33 (Suppl. 2) (2005) W143–W147. [3] Q.B. Gao, Z.Z. Wang, Classification of G-protein coupled receptors at four levels, Protein Eng. Des. Sel. 19 (2006) 511– 516. [4] Q. Gu, Y. Ding, Binary particle swarm optimization based prediction of G-protein-coupled receptor families with feature selection, in: Proceedings of the first ACM/SIGEVO Summit on Genetic and Evolutionary Computation, ACM, 2009, pp. 171–176. [5] Q. Gu, Y.S. Ding, T.L. Zhang, Prediction of G-protein-coupled receptor classes in low homology using chous pseudo amino acid composition with approximate entropy and hydrophobicity patterns, Protein Pept. Lett. 17 (5) (2010) 559– 567. [6] Z.L. Peng, J.Y. Yang, X. Chen, An improved classification of G-protein-coupled receptors using sequence-derived features, BMC Bioinformatics 11 (1) (2010) 420. [7] W.Z. Lin, X. Xiao, GPCR-GIA: a web-server for identifying G-protein coupled receptors and their families with grey incidence analysis, Protein Eng. Des. Sel. 22 (2009) 699–705. [8] X. Xiao, P. Wang, GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes, J. Comput. Chem. 30 (2009) 1414–1423.

211

[9] X. Xiao, P. Wang, GPCR-2L: predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions, Mol. Biosyst. 7 (2011) 911–919. [10] X. Xiao, J.L. Min, P. Wang, iGPCR-drug: a web server for predicting interaction between GPCRs and drugs in cellular networking, PLoS ONE 8 (2013) e72234. [11] D.W. Elrod, Bioinformatical analysis of G-protein-coupled receptors, J. Proteome Res. 1 (2002) 429–433. [12] K.C. Chou, Coupling interaction between thromboxane A2 receptor and alpha-13 subunit of guanine nucleotidebinding protein, J. Proteome Res. 4 (2005) 1681–1686. [13] K.C. Chou, Prediction of G-protein-coupled receptor classes, J. Proteome Res. 4 (2005) 1413–1418. [14] J.D. Qiu, J.H. Huang, R.P. Liang, X.Q. Lu, Prediction of G-protein-coupled receptor classes based on the concept of Chou’s pseudo amino acid composition: an approach from discrete wavelet transform, Anal. Biochem. 390 (2009) 68– 73. [15] Zia-ur-Rehman, A. Khan, Identifying GPCRs and their types with Chou’s pseudo amino acid composition: an approach from multi-scale energy representation and position specific scoring matrix, Protein Pept. Lett. 19 (2012) 890–903. [16] H.L. Xie, L. Fu, X.D. Nie, Using ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chou’s PseAAC, Protein Eng. Des. Sel. 26 (2013) 735–742. [17] D.W. Elrod, A study on the correlation of G-protein-coupled receptor types with amino acid composition, Protein Eng. 15 (2002) 713–715. [18] X. Xiao, W.Z. Lin, Recent advances in predicting G-protein coupled receptor classification, Curr. Bioinform. 7 (2012) 132– 142. [19] K.C. Chou, Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review), J. Theor. Biol. 273 (2011) 236–247. [20] F. Horn, E. Bettler, L. Oliveira, F. Campagne, F.E. Cohen, G. Vriend, GPCRDB information system for G protein-coupled receptors, Nucleic Acids Res. 31 (1) (2003) 294–297. [21] W. Li, A. Godzik, Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics 22 (2006) 1658–1659. [22] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1157–1182. [23] K. Kira, L.A. Rendell, The feature selection problem: traditional methods and a new algorithm, in: AAAI, 1992, pp. 129–134. [24] L. Yu, H. Liu, Feature selection for high-dimensional data: a fast correlation-based filter solution, in: ICML, vol. 3, 2003, pp. 856–863. [25] C. Ding, H. Peng, Minimum redundancy feature selection from microarray gene expression data, J. Bioinform. Comput. Biol. 3 (02) (2005) 185–205. [26] Y. Yu, SVM-RFE algorithm for gene feature selection, Computer Engineering (2008). [27] W. Chen, H. Ding, P. Feng, iACP: a sequence-based tool for identifying anticancer peptides, Oncotarget (2016) http:// dx.doi.org/10.18632/oncotarget.7815. [28] W. Chen, P. Feng, H. Ding, H. Lin, Using deformation energy to analyze nucleosome positioning in genomes, Genomics 107 (2016) 69–75. [29] J. Jia, Z. Liu, X. Xiao, iPPBS-Opt: a sequence-based ensemble classifier for identifying protein-protein binding sites by optimizing imbalanced training datasets, Molecules 21 (2016) 95. [30] J. Jia, Z. Liu, X. Xiao, B. Liu, iSuc-PseOpt: identifying lysine succinylation sites in proteins by incorporating sequencecoupling effects into pseudo components and optimizing

212

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

computer methods and programs in biomedicine 134 (2016) 197–213

imbalanced training dataset, Anal. Biochem. 497 (2016) 48– 56. J. Jia, Z. Liu, X. Xiao, pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach, J. Theor. Biol. 394 (2016) 223–230. B. Liu, L. Fang, F. Liu, X. Wang, iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach, J. Biomol. Struct. Dyn. 34 (2016) 223– 235. B. Liu, L. Fang, R. Long, iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition, Bioinformatics 32 (2016) 362–389. Z. Liu, X. Xiao, D.J. Yu, J. Jia, pRNAm-PC: predicting N-methyladenosine sites in RNA sequences via physical-chemical properties, Anal. Biochem. 497 (2016) 60– 67. B. Liu, F. Liu, X. Wang, J. Chen, Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences, Nucleic Acids Res. 43 (2015) W65–W71. W. Chen, P.M. Feng, H. Lin, iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition, Nucleic Acids Res. 41 (2013) e68. H. Lin, E.Z. Deng, H. Ding, iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition, Nucleic Acids Res. 42 (2014) 12961–12972. W. Chen, P.M. Feng, E.Z. Deng, iTIS-PseTNC: a sequencebased predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition, Anal. Biochem. 462 (2014) 76–83. H. Ding, E.Z. Deng, L.F. Yuan, L. Liu, iCTX-Type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels, Biomed Res. Int. 2014 (2014) 286419. S.H. Guo, E.Z. Deng, L.Q. Xu, H. Ding, iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition, Bioinformatics 30 (2014) 1522–1529. Z. Liu, X. Xiao, W.R. Qiu, iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition, Anal. Biochem. 474 (2015) 69–77 also, Data in Brief, 2015, 4: 87–89. J. Jia, Z. Liu, X. Xiao, iPPI-Esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC, J. Theor. Biol. 377 (2015) 47–56. B. Liu, R. Long, iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework, Bioinformatics (2016) http://dx.doi.org/:10.1093/ bioinformatics/btw186. H.B. Rao, F. Zhu, G.B. Yang, Z.R. Li, Y.Z. Chen, Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence, Nucleic Acids Res. 39 (Suppl. 2) (2011) W385–W390. S. Hua, Z. Sun, Support vector machine approach for protein subcellular localization prediction, Bioinformatics 17 (8) (2001) 721–728. J. Wang, W.K. Sung, A. Krishnan, K.B. Li, Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines, BMC Bioinformatics 6 (2005) 174. E. Nasibov, C. Kandemir-Cavas, Efficiency analysis of KNN and minimum distance-based classifiers in enzyme family prediction, Comput. Biol. Chem. 33 (6) (2009) 461–464.

[48] A.N. Mbah, Application of hybrid functional groups to predict ATP binding proteins, ISRN Comput. Biol. 2014 (2014) 581245. [49] A. Garg, M. Bhasin, G.P. Raghava, SVM-based method for subcellular localization of human proteins using amino acid compositions, their order and similarity search, J. Biol. Chem. 280 (2005) 14427–14432. [50] Y. Huang, Y. Li, Prediction of protein subcellular locations using fuzzy k-NN method, Bioinformatics 20 (1) (2004) 21– 28. [51] I. Dubchak, I. Muchink, S.R. Holbrook, S.H. Kim, Prediction of protein folding class using global description of amino acid sequence, Proc. Natl. Acad. Sci. U.S.A. 92 (1995) 8700– 8704. [52] I. Dubchak, I. Muchink, C. Mayor, I. Dralyuk, S.H. Kim, Recognition of a protein fold in the context of the SCOP classification, Proteins 35 (1999) 401–407. [53] Z. Li, X. Zhou, Z. Dai, X. Zou, Classification of G-protein coupled receptors based on support vector machine with maximum relevance minimum redundancy and genetic algorithm, BMC Bioinformatics 11 (1) (2010) 325. [54] P. Du, X. Wang, C. Xu, Y. Gao, PseAAC-builder: a crossplatform stand-alone program for generating various special Chou’s pseudo-amino acid compositions, Anal. Biochem. 425 (2012) 117–119. [55] D.S. Cao, Q.S. Xu, Y.Z. Liang, propy: a tool to generate various modes of Chou’s PseAAC, Bioinformatics 29 (2013) 960–962. [56] P. Du, S. Gu, Y. Jiao, PseAAC-general: fast building various modes of general form of Chou’s pseudo-amino acid composition for large-scale protein datasets, Int. J. Mol. Sci. 15 (2014) 3495–3506. [57] Y.Z. Guo, M. Li, M. Lu, Z. Wen, K. Wang, G. Li, et al., Classifying G protein-coupled receptors and nuclear receptors on the basis of protein power spectrum from fast Fourier transform, Amino Acids 30 (4) (2006) 397–402. [58] W.Z. Lin, J.A. Fang, X. Xiao, K.C. Chou, iDNA-Prot: identification of DNA binding proteins using random forest with grey model, PLoS ONE 6 (9) (2011) e24756. [59] Y. Fang, Y. Guo, Y. Feng, M. Li, Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features, Amino Acids 34 (1) (2008) 103–109. [60] J.Y. Shi, S.W. Zhang, Q. Pan, Y.M. Cheng, J. Xie, Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition, Amino Acids 33 (1) (2007) 69–74. [61] F.M. Li, Q.Z. Li, Predicting protein subcellular location using Chou’s pseudo amino acid composition and improved hybrid approach, Protein Pept. Lett. 15 (2008) 612– 616. [62] J. Ma, H. Gu, A novel method for predicting protein subcellular localization based on pseudo amino acid composition, BMB Rep. 43 (10) (2010) 670–676. [63] Y.C. Wang, X.B. Wang, Z.X. Yang, N.Y. Deng, Prediction of enzyme subfamily class via pseudo amino acid composition by incorporating the conjoint triad feature, Protein Pept. Lett. 17 (2010) 1441–1449. [64] L. Lu, Z. Qian, Y.D. Cai, Y. Li, ECS: an automatic enzyme classifier based on functional domain composition, Comput. Biol. Chem. 31 (3) (2007) 226–232. [65] C. Chen, Y. Tian X, X.Y. Zou, P. Cai X, J.Y. Mo, Using pseudoamino acid composition and support vector machine to predict protein structural class, J. Theor. Biol. 243 (3) (2006) 444–448. [66] K. Hechenbichler, K. Schliep, Weighted k-nearest-neighbor techniques and ordinal classification. Discussion paper 399, SFB 386, (Ludwig-Maximilians University, Munich, 2004.

computer methods and programs in biomedicine 134 (2016) 197–213

[67] G.P. Zhou, An intriguing controversy over protein structural class prediction, J. Protein Chem. 17 (1998) 729–738. [68] Y.D. Cai, Prediction of membrane protein types by incorporating amphipathic effects, J. Chem. Inf. Model. 45 (2005) 407–413. [69] H.B. Shen, Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells, Biopolymers 85 (2007) 233–240. [70] G.P. Zhou, K. Doctor, Subcellular location prediction of apoptosis proteins, Proteins 50 (2003) 44–48. [71] Z. Hajisharifi, M. Piryaiee, M. Mohammad Beigi, M. Behbahani, H. Mohabatkar, Predicting anticancer peptides with Chou’s pseudo amino acid composition and investigating their mutagenicity via Ames test, J. Theor. Biol. 341 (2014) 34–40. [72] S. Mondal, P.P. Pai, Chou’s pseudo amino acid composition improves sequence-based antifreeze protein prediction, J. Theor. Biol. 356 (2014) 30–35. [73] L. Nanni, S. Brahnam, A. Lumini, Prediction of protein structure classes by incorporating different protein descriptors into general Chou’s pseudo amino acid composition, J. Theor. Biol. 360 (2014) 109–116.

213

[74] F. Ali, M. Hayat, Classification of membrane protein types using Voting Feature Interval in combination with Chou’s Pseudo Amino Acid Composition, J. Theor. Biol. 384 (2015) 78–83. [75] A. Dehzangi, R. Heffernan, A. Sharma, J. Lyons, K. Paliwal, A. Sattar, Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou’s general PseAAC, J. Theor. Biol. 364 (2015) 284–294. [76] Z.U. Khan, M. Hayat, M.A. Khan, Discrimination of acidic and alkaline enzyme using Chou’s pseudo amino acid composition in conjunction with probabilistic neural network model, J. Theor. Biol. 365 (2015) 197–203. [77] R. Kumar, A. Srivastava, B. Kumari, M. Kumar, Prediction of beta-lactamase and its class by Chou’s pseudo-amino acid composition and support vector machine, J. Theor. Biol. 365 (2015) 96–103. [78] R. Sharma, A. Dehzangi, J. Lyons, K. Paliwal, T. Tsunoda, A. Sharma, Predict Gram-positive and gram-negative subcellular localization via incorporating evolutionary information and physicochemical features into Chou’s general PseAAC, IEEE Trans. Nanobioscience 14 (2015) 915– 926.