Analytical Biochemistry 588 (2020) 113477
Contents lists available at ScienceDirect
Analytical Biochemistry journal homepage: www.elsevier.com/locate/yabio
iProtease-PseAAC(2L): A two-layer predictor for identifying proteases and their types using Chou's 5-step-rule and general PseAAC
T
Yaser Daanial Khana,∗, Najm Amina, Waqar Hussainb, Nouman Rasoolc, Sher Afzal Khand,f, Kuo-Chen Choue a Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore, 54770, Pakistan b National Center of Artificial Intelligence, Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan c Dr Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi, 75270, Pakistan d Faculty of Computing and Information Technology in Rabigh, Jeddah, 21577, Saudi Arabia e Gordon Life Science Institute, Boston, MA, 02478, USA f Abdul Wali Khan University, Department of Computer Sciences, Mardan, Pakistan
A R T I C LE I N FO
A B S T R A C T
Keywords: Protease PseAAC Statistical moments 5-step rule Prediction
Proteases are a type of enzymes, which perform the process of proteolysis. Proteolysis normally refers to protein and peptide degradation which is crucial for the survival, growth and wellbeing of a cell. Moreover, proteases have a strong association with therapeutics and drug development. The proteases are classified into five different types according to their nature and physiochemical characteristics. Mostly the methods used to differentiate protease from other proteins and identify their class requires a clinical test which is usually time-consuming and operator dependent. Herein, we report a classifier named iProtease-PseAAC (2L) for identifying proteases and their classes. The predictor is developed employing the flow of 5-step rule, initiating from the collection of benchmark dataset and terminating at the development of predictor. Rigorous verification and validation tests are performed and metrics are collected to calculate the authenticity of the trained model. The self-consistency validation gives the 98.32% accuracy, for cross-validation the accuracy is 90.71% and jackknife gives 96.07% accuracy. The average accuracy for level-2 i.e. protease classification is 95.77%. Based on the above-mentioned results, it is concluded that iProtease-PseAAC (2L) has the great ability to identify the proteases and their classes using a given protein sequence.
1. Introduction Enzymes play many critical roles in the body and are known to perform a variety of functions based on their nature and characteristics [1]. These characteristics are based on the composition of proteins, i.e. the arrangement and presence of amino acids. Based on different combinations of amino acids, proteins are classified in different classes. Proteases are a type of enzymes, which perform the process of proteolysis [2]. Proteolysis normally refers to protein and peptide degradation which is crucial for the survival, growth and wellbeing of a cell [3]. The cell proteolysis is important for a most biological function as the removal of misfolded proteins, regulation of the transcription factor, amputation of precursor processing and many more [4]. Proteolysis involves proteases and peptidases which are known to perform
a variety of functions inside and outside the cell. These help cell with abnormal protein degradation that may be produced due to environmental and chemical stresses. Sometimes the protein may undergo denaturation due to temperature fluctuation or extreme conditions [5]. The Proteolytic enzymes break down the molecule chain into small fragments. Proteases are useful in the preparation of different situations like food allergies, cancer, hepatitis C, asthma and indigestion. In the manufacturing of baby foods, the protease is also used to predigest proteins [1]. These enzymes are used in different therapies, medicines and in some clinical studies. Protease performs an essential role in the complete lifecycle like in birth, growth, digestion, motorization, age and the death of all organisms [4,6,7]. Protease is also pivotal in regulating physiological processes, controlling the synthesis and turnover of proteins. Protease uses for different purposes that are related to
∗ Corresponding author. Department of Computer Science, School of Systems and Technology, University of Management and Technology, C-II Johar Town, Lahore, Pakistan. E-mail address:
[email protected] (Y.D. Khan).
https://doi.org/10.1016/j.ab.2019.113477 Received 4 February 2019; Received in revised form 2 October 2019; Accepted 18 October 2019 Available online 22 October 2019 0003-2697/ © 2019 Elsevier Inc. All rights reserved.
Analytical Biochemistry 588 (2020) 113477
Y.D. Khan, et al.
protease]’ was used as a keyword. The dataset was meticulously collected by excluding ambiguous sequences, only those sequences were selected which were not annotated with dubious words like potential, by similarity or probable. Moreover, the sequence should be a complete sequence and hence should not be annotated with words like a fragment, furthermore, it should be annotated with a class named as aspartic, cysteine, metallo, serine and threonine proteases. CD-HIT [67] was used in order to exclude the redundancy and the homology bias within the collected sequences having greater than or equal to 60% similarity. Consequently, a high-quality dataset was obtained that contained most recently discovered protease sequences as well. A total of 3339 proteases were collected, comprising 305 aspartic, 1207 metallo, 923 serine, 192 threonine and 712 cysteine proteases, after CDHIT [67]. Similarly, non-protease sequences were also collected. UniProt database was used to collect a set of 3500 non-protease sequences. Taking into account Chou's scheme [66], a protein can be expressed as
medicine or surgery [5,8]. Protease also used in a proponent of different therapies or recognized in ingredients of traditional or common medicine. It includes the elimination of dead or damaged tissues from a wound in order to support healing [2,9]. In cancer treatment, proteolytic enzymes have a long history. John Beard in 1906 reports on cancer treatment by using protease. Currently, in clinical research, the base of proteases recommends vital benefits in the treatment of a different type of cancer [10]. The amount of protease in food improved amino acid digestibility among different protein. It increases the digestion of protein and reduces the effect of anti-nutritional elements. Protease can damage anti-nutritional issues and reduces the undigested proteins entering the hind GI tract [3,6,9,11]. In the large intestine, it decreases protein fermentation and improves gut health. Protease inclusion, improve the digestibility of protein in the result of the higher energy value of the diet. Utilize a broader range of protein sources. The effect of protease activity increases locally existing protein sources. Besides all this, proteases can cause infections in insects and animals for the actual transmission of diseases and are also vital for viruses and bacteria [2]. Proteases are classified into five groups i.e. aspartic proteases, cysteine proteases, serine proteases, metallo Proteases and threonine Proteases. Mostly the methods used to differentiate protease from other proteins and identify their class requires a clinical test which is usually time-consuming and operator dependent [12]. In the last few years, many studies have been reported by the previous investigators in the field of bioinformatics and computational biology, which help in identifying the function and characteristics of proteins (see, e.g. [ [13–64]],). To address this problem, herein, we propose a computational model, named iProtease-PseAAC (2L), for identifying proteases and their classes using Chou's PseAAC [65]. Relative/absolute positionbased features and statistical moments are incorporated into general PseAAC. We have employed the Chou's 5-step rule [66] for this purpose, which is widely used in a series of publications (see, e.g. Refs. [16,18,63–77]) and is comprised of 5 steps which are (i) Benchmark dataset creation and collection (ii) Association of biological samples with target classes via mathematical formulation (iii) Incorporation of an operational prediction algorithm for classification/identification (iv) Validation of results via subsampling and other methods, and (v) Development of webserver. Herein, we will be addressing all these steps from now on, one-by-one.
Κ ρ (β ) = Μ0 Μ1⋯Μ(n − 1) Μn
(1)
Considering all, the dataset was minimized to (2)
Τ = Τ+ ∪ Τ− −
+
In the equation T holds 3339 positive sample, T holds 3500 negative samples and ∪ represents “union of two sets”. In total 3339 + 3500 = 6839 samples are included in benchmark dataset (Supplementary information S1). 2.2. Sample formulation Amino acids constitute the polypeptide chain in a particular sequence. This sequence controls the biophysical characteristics of the proteins. A characteristic is not controlled by a minor presence or absence of the amino acid. Proteins behaviour is affected by multiple factors like its amino acids composition and positioning of their residues. Observations based on data and different experiences indicate that a slight change in relative ordering or composition of amino acid residues may, by and large, modify the attributes of a protein. Based on these facts, features extracted from the primary structure of a protein using a mathematical model should take into consideration both the protein constituents and amino acids relative positions. Besides this, various feature representations and ensembling method exists which have been employed by different researchers in varipus studies. Ensemble learning is an intensively studied technique in machine learning and pattern recognition. Recent work in computational biology has seen an increasing use of ensemble learning methods due to their unique advantages in dealing with small sample size, high-dimensionality, and complex data structures [68–71]. However, in this study, iProtease-PseAAC (2L) uses a feature extraction method that extends the technique used in Refs. [51,52,72]. Using the Chou's general PseAAC [65], a protein sequence can be expressed in equational form as
2. Materials and methods In order to create a robust computational model, it is essential to organize a dependable and precise benchmark dataset for training and testing purposes. A predictor trained by faulty dataset is bound to produce untrustworthy results with unyielding verification and validation. It is of utmost significance that the collected dataset is comprehensive, relevant, non-redundant, accurate and pertinent. For the construction of the iProtease-PseAAC (2L) computational model, protein sequences dataset was collected. Feature Vector is extracted entailing relevant features in numerical form from the primary structure of a protein. These extracted features are trained on Neural Network until convergence is achieved. The first three steps of Chou's 5-step rule will be addressed here (Fig. 1).
Pʂ = 7 () = [1 2 ⋯n ⋯Ω ]T
(3)
2.1. Benchmark dataset collection
where n = (n = 1,2,3, ⋯, Ω) and T represents the transpose of the protein sequence. The components of the above Eq. (4) are the useful information extracted form that protein/peptide sequence.
A well-known publicly available database namely UniProt was the main source for collection protease and non-protease protein sequences. To obtain relevant positive sequences, ‘protease’ and/or ‘[class of
2.2.1. Statistical moments calculation Description of data collection in terms of quantitative measure is referred to as statistical moments. Different orders of moments depict Fig. 1. Process of developing the prediction model using the first three steps of Chou's 5-step rule.
2
Analytical Biochemistry 588 (2020) 113477
Y.D. Khan, et al. m
different data properties. Some moments help evaluation of data size while some others demonstrate their eccentricity and alignment. Different moments have been formed by the statisticians and mathematicians that are based on certain distribution functions and polynomials. iProtease-PseAAC (2L) has been explicated using moments that are categorized as Raw, Hahn and Central moments. Raw moments elucidate various properties of distribution such as a variance, asymmetry and mean. Raw moments are the most rudimentary form which does not exhibit invariance in term of location, scale and rotation. Since the central moments are calculated along the centroid, they exhibit location invariance but are still scale variant. Another popular set of moments is Hahn moments. They are derived from Hahn polynomials and hence exhibits location and scale variant properties. The main rationale behind their choice is their introspection to the placing and composition of residues which is of fundamental significance as discussed earlier. Computed values yielded from each scheme describes data in its peculiar way. Moreover, variance in datasets is indicated by the numerical value of moments for arbitrary datasets [73–76]. These orthogonal moments have the ability to transform the object illustrations with the lowest amount of loss of data [77,78]. Merely 20 amino acids are noteworthy with respect to protein synthesis. In order to calculate moments, a unique integer index is assigned to each amino acid residue. If the assigned index is integral, unique and consistent then it hardly makes any difference that what specific value is substituted. Firstly, a mapping mechanism is developed to transform the one-dimensional primary structure into a two-dimensional representation. Suppose S stand for series of protein and sequence is given as
S = {β1, β2, β3, …, βm − 1, βm}
In order to calculate Hahn moment, 1-dimensional interpretation S was converted to a square matrix interpretation S′. Two-dimensional input data is needed by two dimensional Hahn moments. The Hahn polynomial of order n is given as
ωma, b (p , M ) = (M + b − 1)m (M − 1)m m
×
ο (p) a, b β˜m (p , M ) = βma, b (p , M ) cm2
ο (p) =
Z10 Z01 and w ‾= Z00 Z00
(14)
ϕ (a + p + b) ϕ (b + p + 1)(a + b + p + 1) M (a + b + 2p + 1) m ! (M − p − 1)!
(15)
The orthogonal normalized Hahn for the two-dimensional discrete data are computed using the following equation
(4)
M−1 M−1
Gef =
∑ ∑
c, d
αab J˜t
u, v
(a, M ) J˜s (b, M ),
m , n = 0,1, …M − 1 (16)
a=0 b=0
The central moments and the Hahn moments are computed up to order 3. For calculation of all moments, protocols defined in Refs. [29,30,51,52,55,72,78] were employed. 2.2.2. Position relative incidence matrix Informational series is the root of a mathematical model that predict that role of proteins. Location of amino acid plays a key role in determining the physical properties of the protein. It is also important to minimize placement of amino acid in the polypeptide chain. Position relative incidence matrix (PRIM) extracts location information of amino acid in the polypeptide chain. This matrix provides relative occurrences of amino acids in a peptide. These matrices are then utilized for the figuring of moments through which included vectors are shaped. The matrix of PRIM is made with 20x20 dimensions as given below.
(6)
ZPRIM (8)
where m + n denotes the order of moments. Moments till level three are calculated as Z00, Z01, Z02, Z10, Z11, Z12, Z20, Z21, Z30 and Z03. The data centre is similar to the centre of gravity. Data is fairly distributed at the data's central point with reference to average weight. It is calculated after the calculation of raw moments. It is known as an argument (v‾ , w ‾ ) where
v‾ =
m = 0,1, …M − 1
While
l
x=1 y=1
(13)
The raw values of Hahn moments are usually scaled using a weighting function and a square norm is given as
where p = + 1 and q = c mod d if S′ is populated in row-major order. Moments till 3° are calculated using a 2D matrix S′, the following equation is used for calculating raw moments l
(12)
Δ (b + l) Δ (b)
(b)l =
(7)
∑ ∑ x my n αxy
(11)
and is simplified using the Gamma operator
c d
Zmn =
l!
The above expression uses the Pochhammer symbol generalized as
The 2-dimensional matrix S′ corresponds to the matrix S . The matrix S is converted to S′ by using ν as the mapping function.
ν (βx ) = αpq
(M + b − 1)l (M − 1)l
(b)l = b. (b + 1)…(b + l − 1)
All amino acid component of protein S are held by matrix S′ created with m× m dimensions
⋯ κ1n ⎤ ⋯ κ2m ⎥ ⋱ ⋮ ⎥ ⋯ κmm ⎥ ⎦
∑ (−1)l (−m)l (−p)l (2M + a + b − m − 1)l 1 l=0
(5)
κ κ ⎡ 11 12 κ21 κ22 S′ = ⎢ ⋮ ⎢ ⋮ ⎢ ⎣ κ m1 κ m 2
(10)
k=1 l=1
m residue exists in the primary sequence of the protein, where βi is the ith amino acid residue, also let
z = ⌈ m⌉
m
∑ ∑ (k − v‾ ) s (l − w‾ )t αkl
Βst =
⎡ Q1 → 1 ⎢ Q2 → 1 ⎢ ⋮ =⎢ ⎢ Qd → 1 ⎢ ⋮ ⎢QU → 1 ⎣
Q1 → 2 Q2 → 2 ⋮ Qd → 2 ⋮ QU → 2
Q1 → 3 Q2 → 3 ⋮ Qd → 3 ⋮ QU → 3
Q1 → b Q2 → b ⋮ Qd → b ⋮ QU → b
⋯ ⋯ ⋯ ⋯ ⋯ ⋯
Q1 → 20 ⎤ Q2 → 20 ⎥ ⋮ ⎥ ⎥ Qd → 20 ⎥ ⋮ ⎥ QU → 20 ⎥ ⎦
(17)
An item Qd → b holds the total of bth residue against the first occurrence of dth residue. PRIM makes 400 coefficient which is a large number. For reducing the coefficient more, moments are calculated. In the given protein sequence, the indication of the gain of the dth position residue is determined by Qd → b . In the genetic evolutionary procedure, this gain is replaced by amino acid form? The values of d = 1, 2 … 20 are the presentation of the sequential order of 20 native amino acid residues. These calculations were performed by following protocols defined in Refs. [77,78] were employed.
(9)
Central moments are calculated with the help of centroid. Central moments lies at the data central point where centroid acts as data's centre of gravity. Following equation is used to calculate central moments
2.2.3. Reverse position relative incidence matrix Machine learning algorithm accuracy mostly depends on the 3
Analytical Biochemistry 588 (2020) 113477
Y.D. Khan, et al.
Fig. 2. Architecture of the artificial neural network for iProtease-PseAAC (2L).
perfection of data's feature extraction and the algorithm is able to change itself for understanding data's unclear pattern. The relative positioning of amino acid in the polypeptide chain is extracted by PRIM matrix. Similar workflow at the reverse primary sequence is followed by Reverse Position Relative Incident Matrix (RPRIM). Addition of RPRIM reveals more hidden pattern and uncertainties among proteins in the polypeptide sequence. Similar to PRIM, RPRIM also has 400 elements with 20x20 dimension. RPRIM matrix is represented as
QRPRIM
⎡ R1 → 1 ⎢ R2 → 1 ⎢ ⋮ =⎢ ⎢ Rt → 1 ⎢ ⋮ ⎢ Rz → 1 ⎣
R1 → 2 R2 → 2 ⋮ Rt → 2 ⋮ Rz → 2
⋯ ⋯ ⋯ ⋯ ⋯ ⋯
R1 → k R2 → k ⋮ Rt → k ⋮ Rz → k
⋯ ⋯ ⋯ ⋯ ⋯ ⋯
R1 → 20 ⎤ R2 → 20 ⎥ ⋮ ⎥ ⎥ Rt → 20 ⎥ ⋮ ⎥ Rz → 20 ⎥ ⎦
Let AAPIV be represented as
T = {ν1, ν2, ν3, ν4, ⋯, ν20} Therefore the ith element of AAPIV is calculated as n
νi =
δ = {ο1, ο2 , ο3, ο4 , ο5, ⋯, ο20} Specific residue in the Reversed sequence is shown as
occur in reverse sequence and In the sequence above residue m1, m2 , m3, …, mn are their ordered location. The value of any element is calculated as n
ℓi =
The neural network in one of the most important tools for solving the problem discussed in this paper, it simulates processing information (Fig. 2). The neural network explains the basic shape of each residue in a given protein. For training the network, negative and positive samples are made that are used to calculate feature vector which represents 2dimensional protein structures by using raw, central and Hahn moments. The dataset was constructed containing positive and negative samples respectively, and a feature vector (FV) was then extracted using the datasets, consisting of a large number of coefficients. The FVs are then merged to make an input matrix whereas each input vector is considered as both positive and negative samples in an additional output matrix. These two matrices are employed to train the MultiLayer Neural Network. The input matrix iterates the input to the neural network while the output matrix was used to compute the errors through back propagation methodology. To increase the prediction accuracy and reduce an error rate, gradient descent algorithm and adaptive learning rate were used [74,79,80].
(20)
It shows that residue
located at a position.
r 2,
r 3,
(25)
2.3. Neural network
2.2.5. Accumulative Absolute Position Incidence Vector Amount of Amino acid residue in the polypeptide chain is represented by a frequency matrix and it also gives information relevant to protein formation. The frequency matrix lacks information relevant to the position of amino acid residues in the polypeptide chain and this deficit is accommodated by Accumulative Absolute Position Incidence Vector (AAPIV). AAPIV represent absolute positioning of amino acid residues in the polypeptide chain. A vector containing 20 elements is made where every element has a numerical ordered value that represents relevant residue in the primary sequence. The AAPIV is calculated by a method defined in Ref. [72]. Primary sequence showing the occurrence of specific residue in the structure is represented as
r 1,
∑ tm m=1
In this formula τi represents the frequency of ith native amino acid.
υk
(24)
ωk
(19)
r
(23)
k ωmk1⋯ωmk2⋯ωmk3⋯ωmn
2.2.4. Frequency matrix The amino acid sequence makes the native shape of the protein and their number of occurrence is calculated by the frequency matrix. Frequency matrix has a vital role in protein alignment. The amino acid series information is retrieved by PRIM and frequency matrix does not hold series information. This matrix covers the information about the composition of protein structure. The main purpose of using this matrix is that it basically extract the information of the sequence which has previously been mined into position relative incidence matrix (PRIM) [77,78]. The frequency matrix is calculated by the given formula
r
(22)
2.2.6. Reverse accumulative absolute position incidence vector (RAAPIV) As to prior discussion, feature extraction is efficient in detecting an ambiguous pattern. RAAPIV performs the same task, it is made from a reversed AAPIV string. The RAAPIV is calculated the same as AAPIV, by the method defined in Ref. [72]. RAAPIV contains 20 elements is shown as
The dimension of RPRIM matrix is minimized by calculating raw, central and Hahn moments. A similar methodology, as defined in the calculation of 2.2.2. was adopted.
υrk1⋯υ k2⋯υ k3⋯υrkn
∑ st t=1
(18)
ξ = {τ1, τ2, τ3, τ4, …, τ20}
(21)
2.3.1. Gradient descent and adaptive learning Different algorithms with different characteristic and performance
⋯r n 4
Analytical Biochemistry 588 (2020) 113477
Y.D. Khan, et al.
which is used for the estimation of prediction model stability. Initially, these measures have been introduced in Ref. [81], and a set of four intuitive equation has been derived in Refs. [82,83] for all these measures, which are
are available to train the neural network. Among all, Gradient Decent algorithm performs the best. It is an iterative minimization method that finds out the best set of weight which is used for making a prediction during neural network training. The main objective of algorithms is to find weights that reduce the error of the model on the training dataset. The training process is started by randomly guessing set of weight, the weight set whose loss function has more steps down value is selected. The process is repeated following a negative gradient until a satisfied lowest point is found and then the gradient of the loss function is calculated against all parameters. A gradient is a multidimensional vector containing the slope of loss function along every axis [73,74]. The weight W is updated with the help of learning rate R, objective function F(W) and its gradient £F(W). The central goal of the algorithm is to find the ideal weight W by minimizing F(W). Depending on this algorithm, the parameters are iteratively computer at every stage by the given equation. W=W-R. £F(W)
+
⎧ Sn = 1 − N−+ 0 ≤ Sn ≤ 1 N ⎪ − ⎪ Sp = 1 − N+− 0 ≤ Sp ≤ 1 N ⎪ + + N− N− ⎪ + Acc = 1 − N + + N − 0 ≤ Acc ≤ 1 ⎨ − + N+ N − ⎛ ⎞ ⎪ 1− + + N− ⎝N ⎠ ⎪ MCC = − 1 ≤ MCC ≤ 1 + + − − ⎪ ⎛1 + N+ − N− ⎞ ⎛1 + N− − N+ ⎞ ⎜ ⎟ ⎜ ⎟ − + ⎪ N N ⎝ ⎠⎝ ⎠ ⎩ ⎜
(28)
where N − represents the total number of non-proteases, correctly predicted as non-proteases by iProtease-PseAAC (2L). N+− represents the total number non-proteases which are predicted incorrectly as proteases by iProtease-PseAAC (2L). Moreover, N + is the total number of proteases which are correctly predicted as proteases by iProteasePseAAC (2L) and N −+ is the total number of proteases which are predicted incorrectly as the non-protease by iProtease-PseAAC (2L). Thus, Eq. (28) gives the explanation of specificity, sensitivity, overall-accuracy, and stability more easy to understand and intuitive, particularly when we talk about MCC [84–86]. This set of perceptive metrics have been used by a number of modern publications (see, e.g. Refs. [82,87–98]), but only for binary labelled data. Multi-label prediction is a completely different problem, which has been more popular in computational biology [99–101] and biomedicine [102]. Thus, it requires a different kind of metrics [103].
(26)
Algorithm execution depends on learning rate R and it is mostly kept constant. It defines the time for function minimization and small learning rate requires more time to reach an optimal point whereas high learning rate may lead function to never reach the optimal point, thus, the learning rate should have the ideal value to reach the optimal point. Mostly the starting process starts with a higher learning rate which slowly decreases as training proceeds. The learning rate may change at each layer which reduces the chance of gradient vanish. Weights stop to change in the first layer. Considering Wi and Wi+1 calculated sequentially parameters. Using this parameter weight, output and expected error are calculated. Comparing with the previous iteration if the error is greater than the learning rate is decreased or if the error is smaller than the learning rate is increased, weights are excluded and new weight Wi+1 is calculated. Weight calculation at each iteration is represented as (W1, W2, W3, W4 …). The following equation is used to calculate the weight for the successive epoch. Wt+1 = Wt-Rt. £L (Wt)
⎟
3.3. Self-consistency testing To test the accuracy of iProtease-PseAAC (2L), self-consistency testing was performed initially, in which the training datasets were used for testing the model. There is a reason for doing the self-consistency test and that is, we already know the actual true positive of benchmark dataset. The self-consistency is just opted to measure the training accuracy i.e. how well the model has been trained. Thus, we use the same training data for testing and results are usually more optimistic in all cases. This method does not provide a robust evaluation of the model, for which we opt for different strategies such as Jackknife testing and k-fold cross-validation. The results of self-consistency are shown in Table 1; it can be observed that the iProteasePseAAC (2L) has 98.32% Acc, 98.76% Sp, 97.51% Sp, and 0.98 MCC.
(27)
In the equation, Rt is used for tth epoch. The adaptive algorithm guarantees the normalization of learning rate while minimizing function at each epoch. Following condition is fulfilled before choosing the learning rate. L (W0) ≥ L (W1) ≥ L (W2)
3. Results and discussion 3.4. Validation of model via leave-one-out 3.1. Estimated accuracy In general, prediction models are trained using the experimentally proven dataset for prediction but some of the time we don't have experimentally proven datasets for model prediction testing. Interestingly, if somehow we have the experimentally proven dataset, it might be possible that data is not suitable or not sufficient for model testing against the prediction accuracy. To check the score four metrics of Eq. (28), what kind of testing method should be used to check the accuracy reliability of prediction model? Normally, a prediction model can be tested using Leave-one-out (jackknife), k-folds (Subsampling) and independent test [104]. In jackknife testing, every time model is trained on N – 1, where N is a total number of instances of benchmark
The objective evaluation of a newly developed predictor is a very important aspect, which helps to assess the success rate of that model [65]. However, for such objective evaluation, one needs to consider two important factors which are (i) selection of accuracy metrics and (ii) the testing method employed to validate the model. Herein, firstly we will formulate the metrics for objective evaluation, then we will employ various validation methods. 3.2. Formulation of metrics For objective evaluation, one needs to consider the metrics of evaluation and method of evaluation. The most observed practice for the objective evaluation of the predictor is the use of accuracy metrics which are (1) Accuracy (Acc), which is used for the estimation of the overall accuracy of that perdition model, (2) Sensitivity (Sn), which is used for the estimation of positive sample prediction capability, (3) Specificity (Sp), which is used for the estimation of negative sample prediction capability, and (4) Mathews Correlation Coefficient (MCC),
Table 1 Results for self-consistency testing for level-1. Predictor
iProtease-PseAAC (2L)
5
Accuracy Metrics Acc (%)
Sp (%)
Sn (%)
MCC
98.32
98.76
97.51
0.98
Analytical Biochemistry 588 (2020) 113477
Y.D. Khan, et al.
Table 2 Jackknife Validation Results (Average of n-iterations) for level-1. Predictor
Table 7 Accuracy metrics for 10-fold cross-validation of one-layer architecture.
Accuracy Metrics
iProtease-PseAAC (2L) ProtIdent [125]
Predictor
Acc (%)
Sp (%)
Sn (%)
MCC
96.07 92.0
97.39 -
96.96 -
0.92 -
iProtease-PseAAC (2L)
1 2 3 4 5 6 7 8 9 10 Average
Accuracy Metrics Acc (%)
Sn (%)
Sp (%)
MCC
89.7 90.6 90.1 91.1 91.3 90.2 91.6 90.4 91.1 91.0 90.71
85.2 86.4 85.4 85.7 87.3 84.8 86.2 85.7 86.9 87.2 86.08
93.0 93.7 93.6 95.4 94.3 94.5 92.8 93.9 94.3 93.7 93.92
0.78 0.80 0.79 0.82 0.82 0.80 0.79 0.80 0.81 0.81 0.80
Positive Samples
Negative Samples
Aspartic Protease Cysteine Protease Metallo Protease Serine Protease Threonine Protease
305 712 1207 923 192
3389–305 = 3084 3389–712 = 2677 3389–1207 = 2182 3389–923 = 2466 3389–192 = 3197
Aspartic Protease Cysteine Protease Metallo Protease Serine Protease Threonine Protease Average
Accuracy Metrics
Number of Proteases
Acc (%)
Sp (%)
Sn (%)
MCC
+
+ −
− +
−
97.00 96.08 93.63 96.13 96.02
97.24 96.30 94.96 96.84 96.15
94.43 95.22 91.22 94.26 93.75
0.84 0.89 0.86 0.90 0.73
288 678 1101 870 180
17 34 106 53 12
85 99 110 78 123
2999 2578 2072 2388 3074
95.77
96.30
93.78
0.85
-
-
-
-
Table 6 Details of dataset distribution for one-layer architecture. Class
Positive Samples
Negative Samples
Aspartic Protease Cysteine Protease Metallo Protease Serine Protease Threonine Protease Negative Samples
305 712 1207 923 192 3500
6839–305 = 6534 6839–712 = 6127 6839–1207 = 5632 6839–923 = 5916 6839–192 = 6647 6389–3500 = 3389
Sn (%)
MCC
92.11
94.30
91.52
0.81
Cross-validation is one of the best available methods to validate model prediction, cross-validation is the best option to choose and to give the validation that the iProtease-PseAAC (2L) is predicting true proteases. Using cross-validation, the benchmark dataset is distributed into total k number of unique folds, where k is the number in which the benchmark dataset is divided, for now, k = 10. In each round of validation, a different subset of data is selected randomly for validation across the rest of the data, by this, each part of the dataset is used for training and testing both. At the end of last round of cross-validation, the cumulated accuracy for k = 10 is calculated by adding the accuracy of each validation round and dividing it by 10 and it's 90.71% in this study as shown in Table 3. iProtease-PseAAC (2L) uses a position and composition variant feature extraction technique along with neural network for classification. The coefficients yielded by the iProtease-PseAAC (2L) are nondependent on such variables. The size of the feature vector is fixed, also it comprehensively computes all the possible correlation among all the possible pair of residues in the peptide chain in a succinct form. iProtease-PseAAC (2L) uses diverse sequences of both protease and nonprotease which is subsequently used for both, training and testing. As shown in Table 1, the iProtease-PseAAC (2L) exhibits higher sensitivity, specificity, accuracy, and MCC for prediction of proteases and nonproteases. At level-2, the multi-label classification was performed. The total samples in the dataset, as shown in supplementary information S1, were 3389 positive +3500 negative = 6839, however, in 3389 positive samples, 305 aspartic, 1207 metallo, 923 serine, 192 threonine and 712 cysteine proteases were present. At the second layer, only positive data was used and for the classification of proteases, the targeted class was kept as positive while all remaining samples were considered negative, excluding the negative 3500 samples (Table 4). For comparative analysis of our proposed method with previously reported methods, 5 previously reported studies were considered [125–129]. Besides these methods, we compared the proposed twolayer architecture with one-layer architecture. Similar approach was opted for one-layer i.e. in 3389 positive samples, 305 aspartic, 1207 metallo, 923 serine, 192 threonine and 712 cysteine proteases were present. In one-layer architecture, all data was used and for the classification of proteases, the targeted class was kept as positive while all remaining samples were considered negative (Table 6).The details for
Table 5 Accuracy metrics for level-2. Class
Sp (%)
3.5. Cross-validation model testing
Table 4 Details of dataset distribution for level-2. Based on this dataset distribution, accuracy metrics are computed 5 times and the average is reported as final accuracy for level-2 (Table 5). Class
Acc (%)
considering N-1 samples for training and 1 sample for testing, and the model is trained and tested according to that datasets. This process is carried out for all the samples in the dataset. In jackknife validation of prediction model, training and testing both datasets are open and every sample of the benchmark dataset is used for training and testing, it's very exhaustive because of huge turn in and out of data samples and it excludes the memory effects. Its validation always gives different output for given benchmark dataset instances. The arbitrariness problem caused by independent test and subsampling completely avoided by using jackknife. Using jackknife, perdition model validation gives 96.07% accuracy, as shown in Table 2. It has been widely used to validate the prediction model by investigators [94,105–124].
Table 3 10-fold Cross Validation results (Average of 10-folds) for level-1. Folds
Accuracy Metrics
Based on this dataset distribution, accuracy metrics are computed 6 times and the average is reported as final accuracy for one-layer architecture (Table 7).
dataset and testing is done by the rest of the 1 instance of benchmark dataset. Each time data for training and testing is selected by 6
Analytical Biochemistry 588 (2020) 113477
Y.D. Khan, et al.
Table 8 Success rate for identifying protease types.
Success Rate (%)
iProtease-PseAAC (2L)
iProtease-PseAAC (1L)
GO-PseAAC [126]
FunD-PseAAC Method [127]
ProtIdent [125]
PseAAC [128]
9-Gram Coding [129]
95.77
92.11
85.5
94.8
95.70
92.74
92.5
Fig. 3. The graphical user interface (GUI) of the iProtease-PseAAC (2L) available at biopred.org/prot.
step-by-step guide about how to use the iProtease-PseAAC (2L) web server.
results of one-layer architecture with 10-fold cross-validation are reported in Table 7. Some of the previously available methods are quite near in prediction success rate to the proposed method, however, their accuracy at level prediction is quite low (see Ref. [125]). Also, the dataset they incorporated was smaller as compared to that of proposed method. Based on these results, it can be observed that iProtease-PseAAC (2L) performs prediction of very accurately as compared to counter parts (Table 8) and can help to identify the class of a protease, only based on a sequence without any laborious experimental tasks.
4.1. Step 1 iProtease-PseAAC (2L) webserver is publicly available and can open at http://biopred.org/iprot/. After loading of the main page, you can see a header containing the number tabs i.e. Home, Prediction, About and Supplementary Data (Fig. 3). The Home tab gives an overview regarding proteases and their roll in different biological processes. The identification of proteases and their classes can be done on Prediction tab. The information related to paper and its citation can found at About tab. The Supplementary Data tab provides the facility to download the supplementary data. To perform the prediction of Sprenylation sites, click on Prediction tab.
4. WEB server The final step of Chou's 5-steps rule is the development of userfriendly publicly available web-server for the ease of users and biologists as explained in recent publications by various authors [35,85,92,95,130–133]. As demonstrated in Ref. [134], user-friendly and publicly accessible web-servers gives the future directions for reporting various important computational analyses and findings regarding PTM. Actually, they have considerably enhanced the impacts of computational biology on medical sciences [135], taking medical science into an unprecedented revolution [136] by making them easy to use and publicly available. Publicly available and user-friendly webservers provides the opportunity and set the direction for the future development of this kind of computational tools and prediction methods. The web-server for iProtease-PseAAC (2L) is available at http:// biopred.org/iprot/. The web-server is developed in Python 3.6 and the classification is done by using the scikit-neuralnetwork python library with Theano backend for high throughput. Further down, we give the
4.2. Step 2 On Prediction tab, it contains an empty text box for input sequence, where the input sequence can be inserted. After inserting the primary sequence, click on the Submit button to initiate the model to identify the proteases and their classes. The results of the prediction will appear at the next screen after the prediction is done by the iProtease-PseAAC (2L), it might couple seconds which depends on the length of the sequence. 4.3. Step 3 To find out the relevant paper for the detailed algorithm of iProtease-PseAAC (2L) and its citation click on About. 7
Analytical Biochemistry 588 (2020) 113477
Y.D. Khan, et al.
4.4. Step 4
Theor. Biol. 455 (2018) 205–211. [14] W. Chen, H. Ding, X. Zhou, H. Lin, K.-C. Chou, iRNA (m6A)-PseDNC: identifying N6-methyladenosine sites using pseudo dinucleotide composition, Anal. Biochem. 561–562 (2018) 59–65. [15] W. Chen, P. Feng, H. Ding, H. Lin, K.-C. Chou, iRNA-Methyl: identifying N6-methyladenosine sites using pseudo nucleotide composition, Anal. Biochem. 490 (2015) 26–33. [16] W. Chen, P. Feng, H. Yang, H. Ding, H. Lin, K.-C. Chou, iRNA-3typeA: identifying three types of modification at RNA's adenosine sites, Mol. Ther. Nucleic Acids 11 (2018) 468–474. [17] W. Chen, H. Tang, J. Ye, H. Lin, K.-C. Chou, iRNA-PseU: identifying RNA pseudouridine sites, Mol. Ther. Nucleic Acids 5 (2016). [18] P. Feng, H. Ding, H. Yang, W. Chen, H. Lin, K.-C. Chou, iRNA-PseColl: identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into PseKNC, Mol. Ther. Nucleic Acids 7 (2017) 155–163. [19] P. Feng, H. Yang, H. Ding, H. Lin, W. Chen, K.-C. Chou, iDNA6mA-PseKNC: Identifying DNA N6-Methyladenosine Sites by Incorporating Nucleotide Physicochemical Properties into PseKNC, Genomics, 2018. [20] A. Ghauri, Y. Khan, N. Rasool, S. Khan, K. Chou, pNitro-Tyr-PseAAC, Predict Nitrotyrosine Sites in Proteins by Incorporating Five Features into Chou's General PseAAC, Current pharmaceutical design, 2018. [21] C. Jia, X. Lin, Z. Wang, Prediction of protein S-nitrosylation sites based on adapted normal distribution bi-profile Bayes and Chou's pseudo amino acid composition, Int. J. Mol. Sci. 15 (2014) 10410–10423. [22] J. Jia, Z. Liu, X. Xiao, B. Liu, K.-C. Chou, iSuc-PseOpt: identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset, Anal. Biochem. 497 (2016) 48–56. [23] J. Jia, Z. Liu, X. Xiao, B. Liu, K.-C. Chou, pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach, J. Theor. Biol. 394 (2016) 223–230. [24] J. Jia, Z. Liu, X. Xiao, B. Liu, K.-C. Chou, iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC, Oncotarget 7 (2016) 34558. [25] J. Jia, L. Zhang, Z. Liu, X. Xiao, K.-C. Chou, pSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC, Bioinformatics 32 (2016) 3133–3141. [26] Z. Ju, J.-Z. Cao, H. Gu, Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou ׳s general PseAAC, J. Theor. Biol. 397 (2016) 145–150. [27] Z. Ju, J.-J. He, Prediction of lysine crotonylation sites by incorporating the composition of k-spaced amino acid pairs into Chou's general PseAAC, J. Mol. Graph. Model. 77 (2017) 200–204. [28] Z. Ju, S.-Y. Wang, Prediction of citrullination sites by incorporating k-spaced amino acid pairs into Chou's general pseudo amino acid composition, Gene 664 (2018) 78–83. [29] Y.D. Khan, N. Rasool, W. Hussain, S.A. Khan, K.-C. Chou, iPhosT-PseAAC: identify phosphothreonine sites by incorporating sequence statistical moments into PseAAC, Anal. Biochem. 550 (2018) 109–116. [30] Y.D. Khan, N. Rasool, W. Hussain, S.A. Khan, K.-C. Chou, iPhosY-PseAAC, Identify phosphotyrosine sites by incorporating sequence statistical moments into PseAAC, Mol. Biol. Rep. (2018) 1–9. [31] L.-M. Liu, Y. Xu, K.-C. Chou, iPGK-PseAAC: identify lysine phosphoglycerylation sites in proteins by incorporating four different tiers of amino acid pairwise coupling information into the general PseAAC, Med. Chem. 13 (2017) 552–559. [32] Z. Liu, X. Xiao, D.-J. Yu, J. Jia, W.-R. Qiu, K.-C. Chou, pRNAm-PC: predicting N6methyladenosine sites in RNA sequences via physical–chemical properties, Anal. Biochem. 497 (2016) 60–67. [33] W.R. Qiu, B.Q. Sun, X. Xiao, D. Xu, K.C. Chou, iPhos‐PseEvo: identifying human phosphorylated proteins by incorporating evolutionary information into general PseAAC via grey system theory, Mol. Inf. 36 (2017). [34] W.-R. Qiu, S.-Y. Jiang, B.-Q. Sun, X. Xiao, X. Cheng, K.-C. Chou, iRNA-2methyl: identify RNA 2'-O-methylation sites by incorporating sequence-coupled effects into general PseKNC and ensemble classifier, Med. Chem. 13 (2017) 734–743. [35] W.-R. Qiu, S.-Y. Jiang, Z.-C. Xu, X. Xiao, K.-C. Chou, iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition, Oncotarget 8 (2017) 41178. [36] W.-R. Qiu, B.-Q. Sun, X. Xiao, Z.-C. Xu, K.-C. Chou, iHyd-PseCp: identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC, Oncotarget 7 (2016) 44310. [37] W.-R. Qiu, B.-Q. Sun, X. Xiao, Z.-C. Xu, K.-C. Chou, iPTM-mLys: identifying multiple lysine PTM sites and their different types, Bioinformatics 32 (2016) 3116–3123. [38] W.-R. Qiu, X. Xiao, W.-Z. Lin, K.-C. Chou, iMethyl-PseAAC, Identification of protein methylation sites via a pseudo amino acid composition approach, BioMed Res. Int. (2014) 2014. [39] W.-R. Qiu, X. Xiao, W.-Z. Lin, K.-C. Chou, iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model, J. Biomol. Struct. Dyn. 33 (2015) 1731–1742. [40] W.-R. Qiu, X. Xiao, Z.-C. Xu, K.-C. Chou, iPhos-PseEn: identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier, Oncotarget 7 (2016) 51270. [41] M.F. Sabooh, N. Iqbal, M. Khan, M. Khan, H. Maqbool, Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou's PseKNC, J. Theor. Biol. 452 (2018) 1–9.
To download the supplementary dataset for future experimentation, click on Supplementary Data tab and download the dataset. 5. Conclusion Proteases are a type of enzymes, which perform the process of proteolysis. Proteolysis normally refers to protein and peptide degradation which is crucial for the survival, growth and wellbeing of a cell. Moreover, proteases have a strong association with therapeutics and drug development. The proteases are classified into five different types according to their nature and physiochemical characteristics. Mostly the methods used to differentiate protease from other proteins and identify their class requires a clinical test which is usually timeconsuming and operator dependent. In this study, using Chou's 5-step rule, we have developed a model for proteases identification and their class prediction based on ANN. Due to its strong biological importance, the finding of proteases positions is a primary and essential task. The aim of the study is to develop an efficient and more accurate proteases classifier and enhance it in usage and make it user-friendly and worldwide available to biologist and general users to get their results. By implementing the PseAAC, we have used many positional and compositional features of proteins samples. After model development, the prediction model was tested and validated against various exhaustive validation methods and techniques i.e. self-consistency, crossvalidation, and jackknife. The self-consistency validation gives the 98.32% accuracy, for cross-validation the accuracy is 90.71% and jackknife gives 96.07% accuracy. Moreover, the average accuracy for level-2 i.e. protease classification is 95.77%. Using the above-mentioned results, it is concluded that iProtease-PseAAC (2L) has the great ability to identify the proteases using a given protein sequence. In computational ways, the proposed model still can be improved, as the number of protein sequences for proteases and their classes is rapidly growing, day to day. Appendix A. Supplementary data Supplementary data to this article can be found online at https:// doi.org/10.1016/j.ab.2019.113477. References [1] A. Anwar, M. Saleemuddin, Alkaline proteases: a review, Bioresour. Technol. 64 (1998) 175–183. [2] P. Ellaiah, B. Srinivasulu, K. Adinarayana, A Review on Microbial Alkaline Proteases, (2002). [3] C. Lazure, N.G. Seidah, D. Pélaprat, M. Chrétien, Proteases and posttranslational processing of prohormones: a review, Can. J. Biochem. Cell Biol. 61 (1983) 501–515. [4] A.A. Agbowuro, W.M. Huston, A.B. Gamble, J.D. Tyndall, Proteases and protease inhibitors in infectious diseases, Med. Res. Rev. 38 (2018) 1295–1331. [5] L.E. Bröker, F.A. Kruyt, G. Giaccone, Cell death independent of caspases: a review, Clin. Cancer Res. 11 (2005) 3155–3162. [6] M.A. Shah, S.A. Mir, M.A. Paray, Plant proteases as milk-clotting enzymes in cheesemaking: a review, Dairy Sci. Technol. 94 (2014) 5–16. [7] A. Jablaoui, A. Kriaa, N. Akermi, H. Mkaouar, A. Gargouri, E. Maguin, M. Rhimi, Biotechnological applications of serine proteases: a patent review, Recent Pat. Biotechnol. 12 (2018) 280–287. [8] J.J. Sheehan, S.E. Tsirka, Fibrin‐modifying serine proteases thrombin, tPA, and plasmin in ischemic stroke: a review, Glia 50 (2005) 340–350. [9] L. Salamonsen, E. Dimitriadis, R. Jones, G. Nie, Complex regulation of decidualization: a role for cytokines and proteases—a review, Placenta 24 (2003) S76–S85. [10] S. Rakash, F. Rana, S. Rafiq, A. Masood, S. Amin, Role of proteases in cancer: a review, Biotechnol. Mol. Biol. Rev. 7 (2012) 90–101. [11] N. Gonzalez-Rabade, J.A. Badillo-Corona, J.S. Aranda-Barradas, M. del Carmen Oliver-Salvador, Production of plant proteases in vivo and in vitro—a review, Biotechnol. Adv. 29 (2011) 983–996. [12] D. Whitford, Proteins: Structure and Function, John Wiley & Sons, 2013. [13] S. Akbar, M. Hayat, iMethyl-STTNC: identification of N6-methyladenosine sites by extending the idea of SAAC into Chou's PseAAC to formulate RNA sequences, J.
8
Analytical Biochemistry 588 (2020) 113477
Y.D. Khan, et al.
(2017) 212–224. [72] M.A. Akmal, N. Rasool, Y.D. Khan, Prediction of N-linked glycosylation sites using position relative features and statistical moments, PLoS One 12 (2017) e0181966. [73] Y.D. Khan, F. Ahmad, M.W. Anwar, A neuro-cognitive approach for iris recognition using back propagation, World Appl. Sci. J. 16 (2012) 678–685. [74] Y.D. Khan, F. Ahmed, S.A. Khan, Situation recognition using image moments and recurrent neural networks, Neural Comput. Appl. 24 (2014) 1519–1529. [75] Y.D. Khan, N.S. Khan, S. Farooq, A. Abid, S.A. Khan, F. Ahmad, M.K. Mahmood, An efficient algorithm for recognition of human actions, Sci. World J. (2014) 2014. [76] Y.D. Khan, S.A. Khan, F. Ahmad, S. Islam, Iris recognition using image moments and k-means algorithm, Sci. World J. (2014) 2014. [77] W. Hussain, Y.D. Khan, N. Rasool, S.A. Khan, K.-C. Chou, SPalmitoylC-PseAAC: a sequence-based model developed via Chou's 5-steps rule and general PseAAC for identifying S-palmitoylation sites in proteins, Anal. Biochem. 568 (2019) 14–23. [78] W. Hussain, Y.D. Khan, N. Rasool, S.A. Khan, K.-C. Chou, SPrenylC-PseAAC, A sequence-based model developed via Chou’s 5-steps rule and general PseAAC for identifying S-prenylation sites in proteins, J. Theor. Biol. 468 (2019) 1–11. [79] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford university press, 1995. [80] S. Haykin, Neural Networks: a Comprehensive Foundation, Prentice Hall PTR, 1994. [81] K.-C. Chou, Prediction of signal peptides using scaled window, Peptides 22 (2001) 1973–1979. [82] P.-M. Feng, H. Ding, W. Chen, H. Lin, Naive Bayes classifier with feature selection to identify phage virion proteins, Comput. math. methods. med. (2013) 2013. [83] Y. Xu, X.J. Shao, L.Y. Wu, N.Y. Deng, K.C. Chou, iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins, PeerJ 1 (2013) e171. [84] W. Chen, P. Feng, H. Ding, H. Lin, K.-C. Chou, Using deformation energy to analyze nucleosome positioning in genomes, Genomics 107 (2016) 69–75. [85] W.R. Qiu, B.Q. Sun, X. Xiao, D. Xu, K.C. Chou, iPhos‐PseEvo: identifying human phosphorylated proteins by incorporating evolutionary information into general PseAAC via grey system theory, Mol. Inf. 36 (2017) 1600010. [86] X. Xiao, H.-X. Ye, Z. Liu, J.-H. Jia, K.-C. Chou, iROS-gPseKNC: predicting replication origin sites in DNA by incorporating dinucleotide position-specific propensity into general pseudo nucleotide composition, Oncotarget 7 (2016) 34180. [87] H. Lin, E.Z. Deng, H. Ding, W. Chen, K.C. Chou, iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition, Nucleic Acids Res. 42 (2014) 12961–12972. [88] Y. Xu, X. Wen, L.S. Wen, L.Y. Wu, N.Y. Deng, K.C. Chou, iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition, PLoS One 9 (2014) e105018. [89] J. Jia, Z. Liu, X. Xiao, B. Liu, K.C. Chou, pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach, J. Theor. Biol. 394 (2016) 223–230. [90] C.J. Zhang, H. Tang, W.C. Li, H. Lin, W. Chen, K.C. Chou, iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition, Oncotarget 7 (2016) 69783–69793. [91] W. Chen, H. Ding, P. Feng, H. Lin, K.C. Chou, iACP: a sequence-based tool for identifying anticancer peptides, Oncotarget 7 (2016) 16895–16909. [92] B. Liu, F. Yang, K.C. Chou, 2L-piRNA: a two-layer ensemble classifier for identifying piwi-interacting RNAs and their function, Mol. Ther. Nucleic Acids 7 (2017) 267–277. [93] B. Liu, S. Wang, R. Long, K.C. Chou, iRSpot-EL: identify recombination spots with an ensemble learning approach, Bioinformatics 33 (2017) 35–41. [94] W. Chen, P. Feng, H. Yang, H. Ding, H. Lin, K.C. Chou, iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences, Oncotarget 8 (2017) 4208–4217. [95] P. Feng, H. Ding, H. Yang, W. Chen, H. Lin, K.C. Chou, iRNA-PseColl: identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into PseKNC, Mol. Ther. Nucleic Acids 7 (2017) 155–163. [96] B. Liu, F. Yang, D.S. Huang, K.C. Chou, iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC, Bioinformatics 34 (2018) 33–40. [97] A. Ehsan, K. Mahmood, Y.D. Khan, S.A. Khan, K.C. Chou, A novel modeling in mathematical biology for classification of signal peptides, Sci. Rep. 8 (2018) 1039. [98] P. Feng, H. Yang, H. Ding, H. Lin, W. Chen, K.C. Chou, iDNA6mA-PseKNC: identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC, Genomics (2018), https://doi.org/10.1016/j.ygeno. 2018.01.005. [99] K.-C. Chou, Z.-C. Wu, X. Xiao, iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites, Mol. Biosyst. 8 (2012) 629–641. [100] W.-Z. Lin, J.-A. Fang, X. Xiao, K.-C. Chou, iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins, Mol. Biosyst. 9 (2013) 634–644. [101] X. Xiao, Z.-C. Wu, K.-C. Chou, iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites, J. Theor. Biol. 284 (2011) 42–51. [102] X. Xiao, P. Wang, W.-Z. Lin, J.-H. Jia, K.-C. Chou, iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types, Anal. Biochem. 436 (2013) 168–177. [103] K.-C. Chou, Some remarks on predicting multi-label attributes in molecular biosystems, Mol. Biosyst. 9 (2013) 1092–1100.
[42] H.-L. Xie, L. Fu, X.-D. Nie, Using ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chou's PseAAC, Protein Engineering, Des. Sel. 26 (2013) 735–742. [43] Y. Xu, K.-C. Chou, Recent progress in predicting posttranslational modification sites in proteins, Curr. Top. Med. Chem. 16 (2016) 591–603. [44] Y. Xu, J. Ding, L.-Y. Wu, K.-C. Chou, iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition, PLoS One 8 (2013) e55844. [45] Y. Xu, X.-J. Shao, L.-Y. Wu, N.-Y. Deng, K.-C. Chou, iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins, PeerJ 1 (2013) e171. [46] Y. Xu, Z. Wang, C. Li, K.-C. Chou, iPreny-PseAAC: identify C-terminal cysteine prenylation sites in proteins by incorporating two tiers of sequence couplings into PseAAC, Med. Chem. 13 (2017) 544–551. [47] Y. Xu, X. Wen, X.-J. Shao, N.-Y. Deng, K.-C. Chou, iHyd-PseAAC: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide positionspecific propensity into pseudo amino acid composition, Int. J. Mol. Sci. 15 (2014) 7594–7610. [48] Y. Xu, X. Wen, L.-S. Wen, L.-Y. Wu, N.-Y. Deng, K.-C. Chou, iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition, PLoS One 9 (2014) e105018. [49] J. Zhang, X. Zhao, P. Sun, Z. Ma, PSNO: predicting cysteine S-nitrosylation sites by incorporating various sequence-derived features into the general form of Chou's PseAAC, Int. J. Mol. Sci. 15 (2014) 11204–11219. [50] A. Ehsan, K. Mahmood, Y.D. Khan, S.A. Khan, K.-C. Chou, A novel modeling in mathematical biology for classification of signal peptides, Sci. Rep. 8 (2018) 1039. [51] W. Hussain, Y.D. Khan, N. Rasool, S.A. Khan, K.-C. Chou, SPalmitoylC-PseAAC, A sequence-based model developed via Chou’s 5-steps rule and general PseAAC for identifying S-palmitoylation sites in proteins, Anal. Biochem. 568 (2018) 14–23. [52] Y.D. Khan, M. Jamil, W. Hussain, N. Rasool, S.A. Khan, K.-C. Chou, pSSbondPseAAC: prediction of disulfide bonding sites by integration of PseAAC and statistical moments, J. Theor. Biol. 463 (2019) 47–55. [53] A.H. Butt, S.A. Khan, H. Jamil, N. Rasool, Y.D. Khan, A prediction model for membrane proteins using moments based features, BioMed Res. Int. (2016) 2016. [54] A.H. Butt, N. Rasool, Y.D. Khan, A treatise to computational approaches towards prediction of membrane protein and its subtypes, J. Membr. Biol. 250 (2017) 55–76. [55] A.H. Butt, N. Rasool, Y.D. Khan, Predicting membrane proteins and their types by extracting various sequence features into Chou's general PseAAC, Mol. Biol. Rep. (2018) 1–12. [56] A. Akhtar, A. Amir, W. Hussain, A. Ghaffar, N. Rasool, In silico computations of selective phytochemicals as potential inhibitors against major biological targets of diabetes mellitus, Curr. Comput. Aided Drug Des. 15 (2019) 401–408. [57] H. Amjad, W. Hussain, N. Rasool, Molecular simulation investigation of prolyl oligopeptidase from pyrobaculum calidifontis and in silico docking With substrates and inhibitors, Open Access J. Biomed. Eng. Biosci. 2 (2018) 185–194. [58] N. Arif, A. Subhani, W. Hussain, N. Rasool, In silico inhibition of BACE-1 by selective phytochemicals as novel potential inhibitors: molecular docking and DFT studies, Curr. Drug Discov. Technol. (2019) E-pub Ahead of Print. [59] W. Hussain, M. Ali, M. Sohail Afzalv, N. Rasool, Penta-1,4-Diene-3-One oxime derivatives strongly inhibit the replicase domain of tobacco mosaic virus: elucidation through molecular docking and density functional theory mechanistic computations, J. Antivir. Antiretrovir. 10 (2018). [60] W. Hussain, I. Qaddir, S. Mahmood, N. Rasool, In silico targeting of non-structural 4B protein from dengue virus 4 with spiropyrazolopyridone: study of molecular dynamics simulation, ADMET. virtual screening, VirusDis. (2018) 1–10. [61] I. Qaddir, N. Rasool, W. Hussain, S. Mahmood, Computer-aided analysis of phytochemicals as potential dengue virus inhibitors based on molecular docking, ADMET and DFT studies, J. Vector Borne Dis. 54 (2017) 255. [62] N. Rasool, A. Ashraf, M. Waseem, W. Hussain, S. Mahmood, Computational exploration of antiviral activity of phytochemicals against NS2B/NS3 proteases from dengue virus, Turkish J. Biochem. (2019) 261. [63] N. Rasool, S. Iftikhar, A. Amir, W. Hussain, Structural and quantum mechanical computations to elucidate the altered binding mechanism of metal and drug with pyrazinamidase from Mycobacterium tuberculosis due to mutagenicity, J. Mol. Graph. Model. 80 (2017) 126–131. [64] N. Rasool, A. Jalal, A. Amjad, W. Hussain, Probing the pharmacological parameters, molecular docking and quantum computations of plant derived compounds exhibiting strong inhibitory potential against NS5 from zika virus, Braz. Arch. Biol. Technol. (2018) 61. [65] K.-C. Chou, Some remarks on protein attribute prediction and pseudo amino acid composition, J. Theor. Biol. 273 (2011) 236–247. [66] K.-C. Chou, Using subsite coupling to predict signal peptides, Protein Eng. 14 (2001) 75–79. [67] L. Fu, B. Niu, Z. Zhu, S. Wu, W. Li, CD-HIT, Accelerated for clustering the nextgeneration sequencing data, Bioinformatics 28 (2012) 3150–3152. [68] G. Altay, F. Emmert-Streib, Revealing differences in gene network inference algorithms on the network level by ensemble methods, Bioinformatics 26 (2010) 1738–1744. [69] Y. Pengyi, Y. Yee Hwa, B.Z. Bing, Y.Z. Albert, A review of ensemble methods in bioinformatics, Curr. Bioinform. 5 (2010) 296–308. [70] S. Wan, M.W. Mak, S.Y. Kung, Ensemble linear neighborhood propagation for predicting subchloroplast localization of multi-location proteins, J. Proteome Res. 15 (2016) 4755–4762. [71] S. Wan, M.W. Mak, S.Y. Kung, Transductive learning for multi-label protein subchloroplast localization prediction, IEEE ACM Trans. Comput. Biol. Bioinform 14
9
Analytical Biochemistry 588 (2020) 113477
Y.D. Khan, et al.
[121] S. Wan, M.W. Mak, S.Y. Kung, Mem-mEN: predicting multi-functional types of membrane proteins by interpretable elastic nets, IEEE ACM Trans. Comput. Biol. Bioinform 13 (2016) 706–718. [122] S. Wan, M.-W. Mak, S.-Y.J.C. Kung, I.L. Systems, Gram-LocEN: interpretable prediction of subcellular multi-localization of Gram-positive and Gram-negative bacterial proteins, 162 (2017) 1–9. [123] S. Wan, M.-W. Mak, Predicting subcellular localization of multi-location proteins by improving support vector machines with an adaptive-decision scheme, Int. J. Machine Lear. Cybern. 9 (2018) 399–411. [124] P. Zakeri, B. Moshiri, M. Sadeghi, Prediction of protein submitochondria locations based on data fusion of various features of sequences, J. Theor. Biol. 269 (2011) 208–216. [125] K.-C. Chou, H.-B.J.B. Shen, B.R. Communications, ProtIdent: A Web Server for Identifying Proteases and Their Types by Fusing Functional Domain and Sequential Evolution Information 376 (2008), pp. 321–325. [126] G.P. Zhou, Y.D.J.P.S. Cai, Function, bioinformatics, predicting protease types by hybridizing gene ontology and pseudo amino acid composition, 63 (2006) 681–684. [127] K.-C. Chou, Y.-D.J.B. Cai, B.R. Communications, Prediction of protease types in a hybridization space, 339 (2006) 1015–1020. [128] L. Hu, L. Zheng, Z. Wang, B. Li, L.J.P. Liu, P. Letters, Using pseudo amino acid composition to predict protease families by incorporating a series of protein biological features, 18 (2011) 552–558. [129] C. Xu, R. Shi, Based on 9-gram coding of amino acids predicting proteases types by using support vector machine, Recent Pat. Comput. Sci. 5 (2012) 220–225. [130] X. Cheng, X. Xiao, K.-C. Chou, pLoc-mPlant: predict subcellular localization of multi-location plant proteins by incorporating the optimal GO information into general PseAAC, Mol. Biosyst. 13 (2017) 1722–1727. [131] X. Cheng, X. Xiao, K.-C. Chou, pLoc-mVirus: predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC, Gene 628 (2017) 315–321. [132] X. Cheng, S.-G. Zhao, W.-Z. Lin, X. Xiao, K.-C. Chou, pLoc-mAnimal: predict subcellular localization of animal proteins with both single and multiple sites, Bioinformatics 33 (2017) 3524–3531. [133] X. Cheng, S.-G. Zhao, X. Xiao, K.-C. Chou, iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals, Bioinformatics 33 (2016) 341–346. [134] K.C. Chou, H.B. Shen, Recent advances in developing web-servers for predicting protein attributes, Nat. Sci. 1 (2009) 63–92. [135] K.C. Chou, Impacts of bioinformatics to medicinal chemistry, Med. Chem. 11 (2015) 218–234. [136] K.C. Chou, An unprecedented revolution in medicinal chemistry driven by the progress of biological science, Curr. Top. Med. Chem. 17 (2017) 2337–2358.
[104] K.-C. Chou, C.-T. Zhang, Prediction of protein structural classes, Crit. Rev. Biochem. Mol. Biol. 30 (1995) 275–349. [105] A. Dehzangi, R. Heffernan, A. Sharma, J. Lyons, K. Paliwal, A. Sattar, Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou ׳s general PseAAC, J. Theor. Biol. 364 (2015) 284–294. [106] Y. Dou, B. Yao, C. Zhang, PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine, Amino Acids 46 (2014) 1459–1469. [107] K.-Y. Feng, Y.-D. Cai, K.-C. Chou, Boosting classifier for predicting protein domain structural class, Biochem. Biophys. Res. Commun. 334 (2005) 213–217. [108] R. Kumar, A. Srivastava, B. Kumari, M. Kumar, Prediction of β-lactamase and its class by Chou's pseudo-amino acid composition and support vector machine, J. Theor. Biol. 365 (2015) 96–103. [109] S. Mondal, P.P. Pai, Chou ׳s pseudo amino acid composition improves sequencebased antifreeze protein prediction, J. Theor. Biol. 356 (2014) 30–35. [110] L. Nanni, S. Brahnam, A. Lumini, Prediction of protein structure classes by incorporating different protein descriptors into general Chou's pseudo amino acid composition, J. Theor. Biol. 360 (2014) 109–116. [111] W.-R. Qiu, X. Xiao, K.-C. Chou, iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components, Int. J. Mol. Sci. 15 (2014) 1746–1766. [112] H.-B. Shen, J. Yang, K.-C. Chou, Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction, Amino Acids 33 (2007) 57–67. [113] Z.-C. Wu, X. Xiao, K.-C. Chou, iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites, Mol. Biosyst. 7 (2011) 3287–3297. [114] G.P. Zhou, K. Doctor, Subcellular location prediction of apoptosis proteins, Proteins: Struct. Funct. Bioinform. 50 (2003) 44–48. [115] W. Chen, P. Feng, H. Yang, H. Ding, H. Lin, K.-C. Chou, iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences, Oncotarget 8 (2017) 4208. [116] S. Jahandideh, S. Hoseini, M. Jahandideh, A. Hoseini, F.M. Disfani, Gamma-turn types prediction in proteins using the two-stage hybrid neural discriminant model, J. Theor. Biol. 259 (2009) 517–522. [117] H. Lin, H. Ding, Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition, J. Theor. Biol. 269 (2011) 64–69. [118] M. Masso, Vaisman II, Knowledge-based computational mutagenesis for predicting the disease potential of human non-synonymous single nucleotide polymorphisms, J. Theor. Biol. 266 (2010) 560–568. [119] S. Wan, M.-W. Mak, S.-Y.J.B.b. Kung, Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins, 17 (2016) 97. [120] S. Wan, M.-W. Mak, S.-Y.J.B. Kung, FUEL-mLoc: Feature-Unified Prediction and Explanation of Multi-Localization of Cellular Proteins in Multiple Organisms 33 (2017), pp. 749–750.
10