Journal Pre-proof An enhanced protein secondary structure prediction using deep learning framework on hybrid profile based features Prince Kumar, Sanjay Bankapur, Nagamma Patil
PII: DOI: Reference:
S1568-4946(19)30707-0 https://doi.org/10.1016/j.asoc.2019.105926 ASOC 105926
To appear in:
Applied Soft Computing Journal
Received date : 27 May 2019 Revised date : 16 August 2019 Accepted date : 5 November 2019 Please cite this article as: P. Kumar, S. Bankapur and N. Patil, An enhanced protein secondary structure prediction using deep learning framework on hybrid profile based features, Applied Soft Computing Journal (2019), doi: https://doi.org/10.1016/j.asoc.2019.105926. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Elsevier B.V. All rights reserved.
Journal Pre-proof *Manuscript Click here to view linked References
pro of
An Enhanced Protein Secondary Structure Prediction using Deep Learning Framework on Hybrid Profile based Features Prince Kumara,∗, Sanjay Bankapura , Nagamma Patila a National
Institute of Technology Karnataka, India - 575025
Abstract
re-
Accurate protein secondary structure prediction (PSSP) is essential to identify structural classes, protein folds, and its tertiary structure. To identify the secondary structure, experimental methods exhibit higher precision with the trade-off of high cost and time. In this study, we propose an effective prediction
lP
model which consists of hybrid features of 42-dimensions with the combination of convolutional neural network (CNN) and bidirectional recurrent neural network (BRNN). The proposed model is accessed on four benchmark datasets such as CB6133, CB513, CASP10, and CAP11 using Q3, Q8, and segment overlap (Sov) metrics. The proposed model reported Q3 accuracy of 85.4%, 85.4%,
urn a
83.7%, 81.5%, and Q8 accuracy 75.8%, 73.5%, 72.2%, and 70% on CB6133, CB513, CASP10, and CAP11 datasets respectively. The results of the proposed model are improved by a minimum factor of 2.5% and 2.1% in Q3 and Q8 accuracy respectively, as compared to the popular existing models on CB513 dataset. Further, the quality of the Q3 results is validated by structural class prediction and compared with PSI-PRED. The experiment showed that the quality of the Q3 results of the proposed model is higher than that of PSI-PRED.
Jo
Keywords: Biological Computing, Convolutional Neural Network, Deep Learning, Protein Secondary Structure, Bidirectional Recurrent Neural ∗ Corresponding
author Email addresses:
[email protected] (Prince Kumar),
[email protected] (Sanjay Bankapur),
[email protected] (Nagamma Patil)
Preprint submitted to Journal of LATEX Templates
August 16, 2019
Journal Pre-proof
pro of
Network, Sequence Profiles.
1. Introduction
Protein primary structure sequence are comprised of 20 amino acid residues. A linear segment of amino acid residues constitutes the protein primary structure sequence. Protein secondary structure is comprised of regions stabilized 5
by hydrogen bonds between atoms in the polypeptide backbone. The protein primary sequence fold up into the definite 3-dimensional structure, known as a tertiary structure. Protein functions are highly related to its tertiary structure
re-
which increase the significance of the prediction of tertiary structure from a protein primary sequence in molecular biology [1]. However, predicting the tertiary 10
structure directly from primary structure sequences is still an uphill task. This is addressed in two steps: initially, protein secondary structure prediction (PSSP)
lP
from protein primary structure sequences, and later, protein tertiary structure prediction from protein secondary structure information [2]. Protein secondary structure mainly comprised of either 3-class elements or 8-class elements. The 15
3-class elements of protein secondary structure are Strand (E), Helix (H), and
urn a
Coil (C) [3]. Whereas the Dictionary of Protein Secondary Structure (DSSP) [4] defines 8-class as: β-bridge (B), β-sheet (E), 310 -helix (G), α-helix (H), π-helix (I), Bend (S), Turn (T), and other residues (L). With the higher development rate of proteomics, a vast collection of protein 20
sequences are deposited in protein data banks. Over the years, the number of protein primary structure sequences is much more than the number of protein secondary structure sequences [5]. This is mainly due to the limitations of
Jo
experimental methods (like X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and electron microscopy) such as, time-consuming and
25
cost intensive in determining protein secondary structure. However, experimental methods exhibit higher precision in PSSP and hence, widely used for the annotation of the protein secondary structures. As these experimental methods are slow and the protein sequences in the protein data banks are increasing
2
Journal Pre-proof
rapidly, there is a higher demand for computational models to predict protein secondary structure. Therefore, an effective (accurate) PSSP from protein pri-
pro of
30
mary structure sequences is one of the challenging tasks in the field of biological computing. 1.1. Related Works
Protein structure prediction classified into numerous levels. They are ranging 35
from 1-dimensional to 4-dimensional protein structure prediction. The PSSP is based on 1-dimensional protein structure prediction [6, 7]. To measure the
re-
performance with PSSP, Q3 accuracy, Q8 accuracy, and Sov score for the 3-class and 8-class elements are taken into account [8]. Q3 and Q8 accuracy methods computes the percent of amino acid residues in PSSP for which the prediction 40
is correct.
lP
In the early stage of the study [6, 9], the probability of amino acid residues present in protein secondary structures is frequently analyzed by statistical methods. These models have Q3 accuracy lower than 60% as they could not be able to extract the local information (features) from the protein primary struc45
ture sequences. Then, evolutionary information of protein [10] and position-
urn a
specific scoring matrices (PSSM) [11] are taken into consideration, and they proved advantageous for PSSP by achieving a Q3 accuracy of more than 70%. In recent works, the improvement in Q3 accuracy is noticed as it gradually improves to more than 80% [12, 13]. But, when these traditional models are used 50
for the 8-class PSSP, they failed to achieve good accuracy. The Q8 accuracy of these traditional models is very low. The reason behind this failure is that they needed to distinguish among 8-class protein secondary structures [12, 14, 15, 13].
Jo
Therefore, the work related to the 3-class and 8-class PSSP seems to be more promising [6, 14].
55
In recent years, the most effective framework which comes into consideration
are deep neural networks (DNNs) and have become an established framework for the representation learning of various data [16, 17, 18, 19]. DNNs shows a significant improvement in both Q3 and Q8 accuracy of and 3-class and 8-class 3
Journal Pre-proof
PSSP respectively [13]. Some of the popular existing works are evaluated on benchmark CB513 dataset such as: Jones et al. [11] used a two-stage neural
pro of
60
network to utilize PSSM profiles for PSSP and obtained a Q3 accuracy of 79.2% on the benchmark CB513 dataset. Zhang et al. [20] used a combination of a Bayesian model and three neural networks and obtained a Q3 accuracy of 74.8%. Karypis et al. [21] used an exponential kernel function and a combined coding 65
scheme, and construct a cascaded model using a two-stage SVM-based model which results in a Q3 accuracy of 77.83%. Chu et al. [22] used the notion of unrelated proteins that there may be structural similarity in a group of local
re-
structural segments and obtained a Q3 accuracy of 72.23%. Zhong et al. [23] used Pthread and OpenMP to parallelize denoeux belief neural network (DBNN) 70
and achieved a Q3 accuracy of 72.01%. Aydin et al. [24] used and extended a hidden semi-Markov model and achieved a Q3 accuracy of 63.71% on the same
lP
benchmark model. Chopra et al. [25] used localized interactions with cellular automata for the stimulation of global phenomena for PSSP and achieved a Q3 accuracy of 56.51%. Yao et al. [26] combined DBN and the three-layered 75
neural network having feed-forward and back-propagation layer and achieved a Q3 accuracy of 78.1%. Bidargaddi et al. [27] introduced a hybrid model
urn a
having three variants as the combination of the Bayesian segmentation based on segmental semi-Markov model as the first stage, neural networks as the second stage and neural network ensembles in the last stage and obtained a Q3 80
accuracy of 70.89%. Kountouris et al. [28] used backbone dihedral angles of protein to improve the predictive accuracy and obtained a Q3 accuracy of 80%. Malekpour et al. [29] used bidirectional dependencies information of amino acids and a weighted model for the probability estimation of the segments in
Jo
structural classes and calculation of the probability of each segment in structure
85
respectively and obtained a Q3 accuracy of 66.48%. Zhou et al. [30] used the associate classification algorithm concept to propose a compound pyramid model and obtained a Q3 accuracy of 80.49%. Bouziane et al. [31] used an ensemble method to utilize the PSSM profiles, which combines the outputs of k-nearest neighbor, two feed-forward artificial neural networks, and three 4
Journal Pre-proof
90
multi-class SVM classifiers and obtained a Q3 accuracy of 76.34%. Zangooei
pro of
et al. [32] used support vector regression and non-dominated sorting genetic algorithm-II based on PSSM profiles and obtained a Q3 accuracy of 84.94%. Yang et al. [33] used a large margin nearest neighbor classification method for PSSP and obtained a Q3 accuracy of 75.44%. Ghanty et al. [34] used position95
specific probability-based features with neuro-SVM on the single sequence and obtained a Q3 accuracy of 68%. Drozdetskiy et al. [35] introduced JPRED: a PSSP server which achieved a Q3 accuracy of 81.7%. Wang et al. [36] used a deep recurrent encoder-decoder networks for PSSP and obtained a Q3 accuracy
100
re-
of 82.9% on the same benchmark model.
Wang et al. [37] utilized deepCNF with a conditional random field for PSSP and attained a Q8 accuracy of 68.3% and Q3 accuracy of 82.3%. Li et al. [12] used a multi-scale CNN followed by three stacked BRNN layers and attain a Q8
lP
accuracy of 69.7%. Busia et al. [38] used a novel chained CNN with next-step conditioning to achieve a Q8 accuracy of 71.4%. Lin et al. [39] presents a deep 105
CNN architecture connected with a multi-layer shift-and-snitch for PSSP, and attain a Q8 accuracy of 68.4%. Pollastri et al. [40] used a RNN and profiles to increase 8-class PSSP, and attain a Q8 accuracy of 51.1%. Sonderby et al. [15]
urn a
utilized a BRNNs with long-short term memory (LSTM) cells for PSSP, and attain a Q8 accuracy of 67.4%. Guo et al. [5] used 2-dimensional CNN with 110
two stacked BRNNs, and achieved a Q8 accuracy of 70.2%. The summary of the literature survey is shown in Table 1. Thus, deep learning methods proved useful for PSSP [7, 11, 41] and have been following with great attention by researchers in proteomics.
Inspired by recent progress and effectiveness of deep neural networks in PSSP, we propose a hybrid deep learning framework, 2-dimensional convolu-
Jo
115
tional bidirectional recurrent neural networks with hybrid profile features of PSSM profiles [45] and HMM profiles [46] to improve the Q3 and Q8 accuracy of three and 8-class PSSP respectively. This framework combines 2dimensional convolutional or 2-dimensional pooling operations with bidirec-
120
tional gated recurrent units (BGRUs) and bidirectional long-short term mem5
Journal Pre-proof
pro of
Table 1: Summary of Related Works based on CB513 Dataset
Authors
Proposed Models
Results
PSI-PRED: a two-stage
Jones et al. [11]
neural networks for
(1999)
Q3 - 79.2%
PSSP based on PSSM
Wang et al. [37] (2011)
Raptor-X: 8-class PSSP using deep CNN with conditional random field
Q3 - 78.3%, Q8 - 68.3%
re-
SPINE-X: multi-step
Faraggi et al. [42]
learning combined with solvent
(2012)
accessible surface area and
Q3 - 78.9%
torsion angle prediction
(2014) Mao et al. [44] (2016)
SSPro-SS8: used profiles,
lP
Magnan et al. [43]
similarity for PSSP
used deep learning method
Q3 - 78.5% Q8 - 63.5%
Q3 - 82.9%
knowledge-based systems
urn a
Pollastri et al. [40]
machine learning and structural
used RNNs and profiles
Q8 - 51.1%
(2002)
Sondarby et al. [15]
used BRNNs with
(2014)
LSTM cells
Lin et al. [39] (2016)
used deep CNN architecture
with a multi-layer shift-and-snitch
(2017)
next step conditioning
Jo
used novel chained CNNs and
(2018)
Q8 - 68.4%
for PSSP
Busia et al. [38]
Gou et al. [5]
Q8 - 67.4%
Q8 - 71.4%
used hybrid deep learning framework, 2-dimensional CNN with two BRNNs layer
6
Q8 - 70.2%
Journal Pre-proof
ory (BLSTM) units, forming four models: 2DConv-BGRUs, 2DConv-BLSTM,
pro of
2DCNN-BGRUs, and 2DCNN-BLSTM. First two models consist of only 2dimensional convolutional operations, while the other two models contain 2dimensional convolutional operations and 2-dimensional pooling operations. 125
1.2. Motivation and Contributions
The PSSP is one of the essential and challenging activities for protein tertiary structure prediction and its functional analysis. Extraction of a compelling feature set from the protein primary structure sequences is a key to improve
130
re-
the accuracy of PSSP. From the literature, the majority of the previous studies [11, 31, 32] showed that the features from the evolutionary-based PSSM profile achieve satisfactory results. However, there is an ample scope to explore features from the hidden Markov model (HMM) based evolutionary profile to enhance the prediction accuracy of PSSP. HHBlits tool [46] generates HMM profiles, and
135
lP
it is a popular tool for identification and alignment of remote homology blocks for protein sequences.
In this study, the main contributions are as follows: • As per our knowledge, this is the first work to extract hybrid features from
urn a
two evolutionary-based profiles, namely, hidden Markov model (HMM) profiles from HHblits [46] and position-specific scoring matrices (PSSM)
140
profiles from PSI-BLAST [45].
• A deep learning framework consisting of a CNN and BRNN is proposed to utilize the extracted hybrid profile features to solve PSSP effectively.
• The proposed model is evaluated on four publicly available benchmark
Jo
datasets using well-known metrics, such as, Q3 (3-class elements) accuracy,
145
Q8 (8-class elements), segment overlap (Sov) scores for 3-class elements and 8-class elements [8].
• The Q3 result quality of the proposed model is validated by performing the protein structural class prediction and the results are compared with PSI-PRED. 7
Journal Pre-proof
150
The organization of the remaining article is as follows: Section 2 provides
pro of
highlights on the datasets used and elaborates the proposed methodology. Section 3 showcases the detailed result analysis of the proposed model. Section 4 concludes with possible future works.
2. Materials and Methodology 155
2.1. Dataset
The datasets used for PSSP are a collection of protein sequences from four
re-
publicly available benchmark datasets - CB6133 [14], CB513 [14], CASP10 [47], and CASP11 [48]. CB6133 dataset is used to train and test the proposed deep learning framework. The CB513, CASP10, CASP11 datasets are only used for 160
testing.
CB6133 dataset contains 6133 protein primary sequences and is divided
lP
into groups of [0,5600) training, [5600,5877) testing and [5877,6133) validation records for the proposed deep learning framework. Rest of the datasets are entirely used for the testing purpose. CB513 dataset contains 513 protein se165
quences, CASP10 dataset contains 123 protein sequences, and CASP11 dataset
urn a
contains 105 protein sequences.
The datasets which are used in this study are available at PSSP Datasets 2.2. Proposed Methodology
The proposed methodology is mainly categorized into four major modules: 170
generation of Hybrid Profile based Features, the Local Interaction Extraction layer, the Long-Range Interaction Extraction layer, and the Output layer. The
Jo
detailed framework of the proposed model is as shown in Figure 1. The extraction of discriminating features is from a hybrid of two profiles,
namely, PSSM and HMM. The PSSM sequence profile is generated using PSI-
175
BLAST method [45] in which highly similar protein sequences are searched from a database to extract PSSM profiles. HHblits method [46] is used to search remote homology sequences from a database to generate HMM profiles.
8
Journal Pre-proof
THE LONG RANGE INTERACTION EXTRACTOR
OUTPUT LAYER
pro of
THE LOCAL INTERACTION EXTRACTOR
DENSE FEATURE LAYER
Profile based Features PSI-BLAST
BRNN
BRNN
Hybrid Features
BRNN
700 X 21
BRNN
700 X 21 700 X 42
BRNN
BRNN
re-
Profile based Features HHblits
BRNN
BRNN
Figure 1: The detailed framework of the Proposed Model
lP
The local interaction between amino acid residues (LI) are extracted using the combination of 2-dimensional convolutional layer and 2-dimensional max180
pooling layer. The long-range interactions between amino acid residues (LRI) are further extracted using a BRNNs with GRUs and LSTM units. All the extracted features are finally fed into the output layer having softmax as an
urn a
activation function for 3-class and 8-class PSSP. 2.2.1. Hybrid Profile based Features 185
We extracted PSSM profiles using PSI-BLAST method [45] on Uniref90 database with a cutoff value, i.e., E = 0.001 in three iterations. The output of PSI-BLAST generates two matrices, namely, log odds and linear substitution probability matrices. Each matrix is of size L × 21, where L is the length of the
Jo
protein sequence and the 21 indicates 20 amino acid residues with one unknown
190
residue. In this study, we have considered only linear substitution probability matrices.
HMM profiles are generated using HHblits method [46] on UniProt20 and
UniClust30 database with a cutoff value, i.e., E = 0.001 in four iterations. The output of HHblits generates a matrix of size L × 30, where L is the length of 9
Journal Pre-proof
195
the protein sequence, the first 20 out of 30 columns indicates the substitution
pro of
probability of the amino acid residues and the last 10 columns represents the probability of three states (Insertion, Deletion, and Match) with the number of each state occurring in the alignment process. In this study, we have considered only 21 columns (20 amino acid substitution and one from number of occurrence 200
of match state score) out of 30.
The length of the input protein sequences is varying in length. In this study, we have fixed the protein sequence length to 700 such that the protein sequences consisting of less than 700 residues are padded and the protein sequences ex-
205
re-
ceeding 700 residues are truncated to make their lengths to 700. Hence, the generated profiles for a protein sequence from PSI-BLAST and HHblits exhibits a shape of 700×21 for each profile. These generated profiles are sparse by nature and converted into dense using two activities: non-zero values are transformed
lP
1 by logistic function, and the entire matrix is normalized by adding 0.05 ( 20 ).
For a given M protein sequences from a dataset, the generated PSSM and 210
HMM profile feature sets exhibit of a shape of M × 700 × 21 and M × 700 × 21 respectively. Both the profile feature sets are fused to form a hybrid feature set of shape M × 700 × 42. The hybrid protein feature set is represented as P F S =
urn a
{P F1 , P F2 , P F3 , ..., P Fn }, P Fi R42 and the secondary structure labels are represented as CLS = {CL1 , CL2 , CL3 , ..., CLn }, CLi R3,8 .
215
2.2.2. Local Interaction Extraction To extract the local interactions of amino acid residues (LI) from protein sequences, a combination of 2-dimensional convolutional layer with 2-dimensional max-pooling layer are applied over 2-dimensional protein sequence profile of
Jo
shape 700 × 42.
220
The generated matrix P F S = {P F1 , P F2 , ..., P Fn } is obtained from profile
features of PSSM and HMM matrices using PSI-BLAST and HHblits, where P F SRm×k , and P Fj is the pre-processed feature vector (vector length of k) of the jth amino acid residue in protein primary sequences. In this convolutional layer for local interaction extraction, the 2-dimensional convolutional 10
Journal Pre-proof
225
filter CF Rcf1 ×cf2 is applied to cf1 profile feature vector of PSI-BLAST and
pro of
cf2 profile feature vector with rectified liner unit (ReLu) activation function, as shown in Equation (1).
f mi,j = ReLu (CF P Fi:i+cf1 −1, j:j+cf2 −1 + Bias)
(1)
Every possible window of the matrix P F S are used to produce a feature map f m by applying the filter CF , as shown in Equation (2).
230
(2)
re-
f m = [f m1,1 , f m1,2 , f m1,3 , ..., f mm−cf1 +1,k−cf2 +1 ]
Equation (2) describes the filter process. There can be multiple filters with different sizes of filters possible in the convolutional layer to learn higher mutual interactions.
After the convolutional layer, 2-dimensional max-pooling mp Rq1 ×q2 op-
235
lP
erations are further applied to extract the maximum value over the window of the filter matrix cf , as shown in Equation (3).
mpi,j = f (f mi:i+q1 ,j:j+q2 )
(3)
urn a
f (·) in Equation (3) represents 2-dimensional max-pooling function. After these pooling operations, results are as shown in Equation (4).
mp = [mp1,1 , mp1,1+q2 , ..., mp1+(m−cf1 + q1
1
−1)·q1 ,1+(k−cf2 + q1 −1)·q2 2
i
(4)
2.2.3. Long-Range Interaction Extraction
Jo
In the previous step of the convolutional layer, 2-dimensional convolutional
240
and 2-dimensional pooling operations are applied. The convolutional layer was able to extract LI. However, due the kernel size limitation, some of the non-local interaction between amino acid residues may not be captured. To retrieve the non-local interactions between amino acid residues, i.e., long-range interactions, two BRNNs [19, 40] layers are introduced in the proposed architecture with 11
Journal Pre-proof
245
LSTM cells and GRUs. The extracted long-range and local interactions between
pro of
amino acid residues exhibit rich information (to predict structural class) are fed into the output layer to obtain the 3-class and 8-class PSSPs.
Based on the residue’s past and future contexts of the protein sequences, a BRNNs utilizes a finite sequence for the purpose of predicting or labeling each 250
residue of the sequence. To predict each residue of the sequence, the outputs of two BRNNs are concatenated in which protein sequence is processed from both sides (i.e., left to right and right to left) by each BRNN.
It is well-known that the RNNs have a tendency of falling into vanishing
255
re-
gradient problem and to overcome this LSTMs and GRUs are generally used. The proposed model utilizes the BRNNs with LSTM and GRU to extract longrange interactions between amino acid residues. The structure units are as shown in Figure 2.
lP
The GRU unit consists of one memory cell and two gates, namely, update gate (u) and reset gate (r) as shown in Figure 2a. These gates determine whether 260
the information to be remembered or not in the network. On the other hand, the LSTM unit consists of one memory cell, one output state and three gates known as input gate (i), forget gate (f) and output gate (o) as shown in Figure 2b.
urn a
In the network, backpropagated errors are prevented from vanishing by the forget gates of LSTM. Whereas, LSTMs memory cells help to memories the 265
previously performed tasks. GRUs are similar to LSTM cells. However, due to the lack of output gate in GRUs, they have less number of parameters in comparison to LSTM.
The mechanism of the GRU is shown in Fig. 2a. The input feature (i) to the GRU unit at time t with previous state (s) information is represented as (it , st−1 ), the update of the GRUs is given in the following Equations (5)–(8).
Jo
270
rt = σ (Wxr it + Wsr it−1 + Br )
(5)
ut = σ (Wlu it + Wsu it−1 + Bu )
(6)
12
Journal Pre-proof
i
pro of
+
f
c
u
h
c
r
in
h
+
o
out
in
out
re-
(b) Long-Short Term Memory (LSTM) (a) Gated Recurrent Units (GRUs).
units.
lP
Figure 2: GRU and LSTM
sbt = tanh Wlbs it + Wsbs(rt st−1 +Bsb)
st = ut st−1 + (1 − ut ) sbt
(7)
(8)
urn a
From the above Equations (5)–(8), rt , ut , sbt , st are the activation of the reset
gate, update gate, internal memory cell, and GRUs output respectively. W is the
weight matrix and B is the bias term. In addition to these terms, , σ () , tanh () represents the element wise multiplication, the sigmoid and hyperbolic functions 275
respectively.
The mechanism of LSTM is shown in Figure 2b. The input feature (m) to LSTM unit at time t with previous state (h) information is represented as
Jo
(mt , ht−1 ), the update operations for LSTM unit can be shown in the following Equations (9)–(13).
ft = σ (Wxf mt + W hf mt−1 + Bf )
(9)
it = σ (Wxi mt + Whi mt−1 + Bi )
(10)
13
Journal Pre-proof
(11)
ot = σ (Wxo mt + Who mt−1 + Bo )
(12)
ht = ot tanh (ct )
(13)
pro of
280
ct = ft ct−1 + it tanh (Wxc mt + Whc mt−1 + Bc )
From the above Equations (9)–(13), ft , it , and ot are the activation of the forgot gate, input gate and output gate respectively. ct is the current cell state.
re-
In PSSP, amino acid residues at any position do not depend solely on the previous amino acid residues, however, it also depends on the following amino acid residues present in the protein sequences. To address this problem, BRNNs 285
are used for PSSP as the output of two RNN layers operating bidirectionally
lP
are concatenated and will consider both previous and succeeding amino acid residue for PSSP at that position. By using the BRNN, two types of models are designed using GRUs and LSTM. The first model contains forward and backward GRUs, and the second model contains forward and backward LSTM 290
units.
urn a
For the illustration purpose of these two models to extract LRI, BRNNs model is used as an example model. At time t, the output features which were captured by 2-dimensional convolutional layers are fed into the BRNNs to extract the bidirectional information of amino acid residues present in the 295
protein sequences. To calculate the forward hidden state and backward hidden state of BRNNs, Equations (14)–(15) are used, where i and si−1 denotes the t layer index and the concatenated output feature of the preceding layer in the
Jo
staked BRNNs respectively. These two parts are then concatenated to obtain the final feature set and it is as shown in Equation (16). In Equation (16),
300
X represents cf or mp, where cf is the feature set captured by 2-dimensional convolutional layer operations as per Equations (1)–(2), and mp is the feature set captured by 2-dimensional convolutional layer operations and 2-dimensional
14
Journal Pre-proof
max-pooling layer operations as per Equations (1)–(3).
(14)
← − ← −− i sit = BRN N s si−1 , s t t+1
(15)
h→ − ← − i st = hit ; hit ; X
(16)
pro of
→ −i −−→ st = BRN N s si−1 , sit−1 t
Moreover, two BRNNs with different units (GRU or LSTM) are stacked together for the improvisation of its performance. In BGRU variation, the
re-
305
hidden layers are updated by using Equations (5)–(8) to extract LRI, and in Equations (14)–(15) BRNNs( ) represent the BGRU model. For the BLSTM variation, the hidden layers of BLSTM are updated by using Equations (9)–(13)
310
model. 2.2.4. Output layer
lP
to extract LRI, and in Equations (14)–(15) BRNNs( ) represent the BLSTM
After the extraction of both LI and LRI, the obtained feature set are fed into
urn a
the last layer which is a fully connected layer to predict the protein secondary structure. These protein features are recorded as f = [f1 , f2 , f3 , ..., fT ]. In the 315
fully connected layer of our proposed model, the softmax layer is considered as the activation function and used to compute the probability of the class of amino acid residues, as shown in Equations (17)–(18).
Jo
pi (y|z) = sof tmax (wz h + bz ) ex sof tmax (x) = P x e
(17)
(18)
A stochastic gradient descent algorithm, i.e., Adam [49] and error backward
propagation are used to train the proposed model in all the experiments. The
320
primary goal of the training model is to minimize the cross-entropy loss function
15
Journal Pre-proof
as shown in Equation (19). On the other hand, the optimization of all the
pro of
parameters is performed according to Equation (20), where φ is the parameter set, β is the learning set, and γ is the L2 regularization hyper-parameter.
L (φ) = −
N c 1 XX 2 yij (log (pij )) + γkφk N i=1 j=1
φ←φ+β 2.3. Evaluation metrics
(20)
In this paper, four metrics are used to evaluate the performance of the pro-
re-
325
∂L (φ) ∂φ
(19)
posed model and are known as: Q3 accuracy, Q8 accuracy and Sov score [8] for 3-class and 8-class PSSP. Q3 and Q8 accuracy are calculated using below mentioned Equations (21)–(24):
lP
re For single conformational state i, Qi , QP andQall i i are defined as follows:
Qi =
ηcorrectly predicted residues in state i × 100 ηactual residues in state i
(21)
where i {H, E, C} or i {B, E, I, G, H, S, T, L}. (22)
ηcorrectly predicted residues (both state i & non−i ) × 100 ηresidues in the prediction result
(23)
urn a
ηcorrectly predicted residues in state i × 100 ηresidues in state i in the prediction result
Qpre = i
Qall i =
330
The m-class and overall per-amino acid residue accuracy Qm , where m is 3
Jo
or 8, is defined as:
Qm =
ηcorrectly predicted amino acid residues × 100 ηtotal residues
(24)
where ηx represents the number of x. For structure i, the set of the segments each consisting of structure i in actual
protein secondary structure is denoted as S1i and the set of the segments each
335
consisting of structure i in predicted protein secondary structure is denoted
16
Journal Pre-proof
as S2i , where i {H, E, C} for 3-class protein secondary structures and i
pro of
{L, B, E, G, I, H, S, T } for 8-class protein secondary structures. Let S(i) denotes the set of pairs si1 , si2 if there is some overlapping between the predicted protein secondary structure si2 and the actual protein secondary structure si1 , where si2 340
S2i and si1 S1i . Suppose, the total number of the amino acid residues present in a protein sequence is N .
Then Sov score is evaluated and defined as:
X X minov si1 , si2 + δ si1 , si2 i 1 s1 × 100 × × Sov = N maxov si1 , si2 i
where i {H, E, C} for 3-class PSSP and i {L, B, E, G, I, H, S, T } for 8 class PSSP, minov si1 , si2 is the length of the overlapping of two segments si1 and si2 , while maxov si1 , si2 is the total extent of both segments and can be calculated by Equation (26) and δ si1 , si2 is defined by Equation (27):
lP
345
(25)
re-
S(i)
maxov si1 , si2 = si1 + si2 − minov si1 , si2
urn a
δ si1 , si2 = min maxov si1 , si2 − minov si1 , si2 , minov si1 , si2 , $ % $ %! si2 si1 , 2 2
(26)
(27)
3. Results and Analysis
In this section, experimental results of the proposed model are analyzed for 3 and 8-class PSSP on four publicly available benchmark datasets as mentioned before: CB6133, CB513, CASP10, and CASP11. The proposed model trained on CB6133 dataset, and the other three datasets used for testing the trained
Jo
350
model. The proposed model for PSSP shows an improvement in the performance over popular existing methods.
17
Journal Pre-proof
3.1. Experimental attributes and Implementation All the experiments are performed on NVIDIA Tesla M40 workstation con-
pro of
355
taining two GPUs nodes having 24GB memory at each node. For the implementation of the proposed models, TensorFlow (https://www.tensorflow.org) library is used along with Keras (https://keras.io) to build and train the deep learning models. The initial weights of the proposed model are set to tensorflow 360
default values. Adam optimizer used to train all the layers of the proposed model simultaneously with the batch size of 64. Early stopping and dropout concepts are used to avoid the over-training and over-fitting of the proposed
re-
model.
In this study, the proposed model’s architecture with hyper-parameters are 365
as follows: the window size of the 2-dimensional convolution layer filter is (3x3). Each feature map has 42 channels. To avoid over-fitting of the proposed model,
lP
the local interactions between amino acid residues of protein sequences are fed into the fully connected layer which consists of 400 hidden units and then are regularized with dropout of value 0.2; the features of the fully connected layer 370
are fed into the two stacked BRNNs layers, which has 400 hidden units in each layer. Meanwhile, dropout of value 0.2 is used to regularize both LI and LRI of
urn a
protein sequences. Then LI and LRI are fed into a fully connected layer which has 600 hidden units and activation function as ReLu. 3.2. Performance Analysis of the Proposed Model 375
Two experiments are performed on the proposed model. In first experiment, 42-dimensional feature map consists of 21 features from HMM profiles, which are extracted using HHblits against Uniprot20 database, and other 21 features from
Jo
PSSM profiles and extracted using PSI-BLAST against Uniref90 database. In second experiment, 21 features from HMM profiles are extracted using HHblits
380
against Uniclust30 database and other 21 features from PSSM profiles are same as in first experiment. D
18
Journal Pre-proof
Table 2: The overall performance (Q3, Sov accuracy) of the proposed model using HMM
Our Models
CB6133
CB513
pro of
profiles based on UniProt20 database for 3-class PSSP on four publicly available datasets CASP10
CASP11
Q3 (%)
Sov (%)
Q3 (%)
Sov (%)
Q3 (%)
Sov (%)
Q3 (%)
Sov (%)
2DCNN-BLSTM
85.2
82.3
85.1
81.7
82.5
77.6
80.5
76.7
2DCNN-BGRU
84.1
81.1
83.7
79.4
73.8
64.2
71.8
63.4
2DConv-BLSTM
84.3
80.6
83.8
79.0
80.1
73.4
77.9
71.5
2DConv-BGRU
85.0
81.9
85.0
80.6
48.9
19.6
45.4
17.7
Table 3: The overall performance (Q3, Sov accuracy) of the proposed model using HMM profiles based on UniClust30 database for 3-class PSSP on four publicly available datasets CB6133
CB513
CASP10
re-
Our Models
CASP11
Q3 (%)
Sov (%)
Q3 (%)
Sov (%)
Q3 (%)
Sov (%)
Q3 (%)
Sov (%)
2DCNN-BLSTM
85.4
82.9
85.4
81.7
83.7
80.5
81.5
78.6
84.4
80.2
76.9
68.7
76.4
70.2
82.6
77.8
76.7
68.8
73.9
66.0
85.0
81.0
45.0
10.5
42.4
9.6
84.8
82.0
83.8
79.9
2DConv-BGRU
85.3
81.6
lP
2DCNN-BGRU 2DConv-BLSTM
The obtained Q3 result from the proposed model is evaluated against popular existing models such as SSPro8-SS8 [43], RaptorX-SS8 [37], SPINE-X [42], 385
PSI-PRED [11], JPRED [35], Ensemble-WMV [31], SVR-NSGAII (RBF Ker-
urn a
nel) [32], NSVM [34], and SSREDNs [36] on four datasets. The performance evaluation is based on Q3 accuracy for 3-class PSSP and is shown in Table 4. The proposed model for 3-class PSSP exhibits an improvement in the performance over the existing models as shown in Table 4. The proposed model 390
shows a minimum improvement of 1.2% and 2.5% on benchmark CB6133 and CB513 datasets respectively.
Two more experiments are performed to evaluate the proposed 8-class PSSP
Jo
model. The proposed 8-class PSSP model is assessed on four datasets and evaluated using Q8 accuracy and Sov score and the results are shown in Table 5 and
395
Table 6. In the first experiment, the proposed 2DCNN-BLSTM model (HMM profiles extracted from Uniprot20 database) exhibits a Q8 accuracy of 75.8%, 73.5%, 72.2% and 70.0%, and Sov score of 77.7%, 75.2%, 72.8% and 70.6% on benchmark CB6133, CB513, CASP10 and CASP11 datasets respectively.
19
Journal Pre-proof
Table 4: The performance (Q3 accuracy) Comparison of the Proposed Model against the
pro of
Benchmark Models on the CB6133, CB513, CASP10, and CASP11 datasets. The Q3 accuracy of the Benchmark Models are taken from their published work. (-) indicates the unavailability of the result.
CB6133
CB513
CASP10
CASP11
Q3 (%)
Q3(%)
Q3(%)
Q3(%)
SSpro(without template) [43]
79.5
78.5
78.5
77.6
SPINE-X [42]
81.7
78.9
80.7
79.3
PSI-PRED [11]
82.5
79.2
81.2
80.7
JPRED [35]
82.9
81.7
81.6
80.4
re-
Methods
Raptorx-SS8 [37]
81.2
78.3
78.9
79.1
-
76.7
-
-
SVR-NSGAII (RBF Kernel) [32]
-
80.6
-
-
NSVM [34]
-
68.3
-
-
84.2
82.9
-
-
85.4
85.4
83.7
81.5
Ensemble-WMV [31]
lP
SSREDNs [36]
Proposed Model
While in the second experiment, the proposed 2DCNN-BLSTM model (HMM profiles extracted from Uniclust30 database) attains a Q8 accuracy of 74.9%,
urn a
400
73.5%, 70.8% and 68.1%, and Sov score of 76.5%, 75.2%, 71.0% and 68.6% on benchmark CB6133, CB513, CASP10 and CASP11 datasets respectively. From both the experiments we can observe that the information present in HMM profiles extracted on Uniprot20 database is higher than that of HMM 405
profiles which are extracted on Uniclust30 database for the 8-class PSSP. Table 5: The overall performance (Q8, Sov accuracy) of the proposed model using HMM
Jo
profiles based on UniProt20 database for 8-class PSSP on four publicly available datasets
Our Models
CB6133
CB513
CASP10
CASP11
Q8 (%)
Sov (%)
Q8 (%)
Sov (%)
Q8 (%)
Sov (%)
Q8 (%)
Sov (%)
2DCNN-BLSTM
75.8
77.7
73.5
75.2
72.2
72.8
70.0
70.6
2DCNN-BGRU
73.9
75.0
73.1
74.5
42.4
38.9
41.8
39.0
2DConv-BLSTM
73.2
73.9
70.8
72.0
52.2
51.0
50.3
48.6
2DConv-BGRU
74.1
75.3
72.4
74.0
64.9
65.5
62.5
63.9
20
Journal Pre-proof
Table 6: The overall performance (Q8, Sov accuracy) of the proposed model using HMM
Our Models
CB6133
CB513
pro of
profiles based on UniClust30 database for 8-class PSSP on four publicly available datasets CASP10
CASP11
Q8 (%)
Sov (%)
Q8 (%)
Sov (%)
Q8 (%)
Sov (%)
Q8 (%)
Sov (%)
2DCNN-BLSTM
74.9
76.5
73.5
75.2
70.8
71.0
68.1
68.6
2DCNN-BGRU
74.6
75.8
72.9
74.1
46.6
43.4
46.4
43.5
2DConv-BLSTM
74.5
76.4
72.8
74.7
47.8
46.3
45.6
44.0
2DConv-BGRU
74.2
75.7
72.9
74.5
25.4
13.0
23.4
10.9
The proposed model for 8-class PSSP is evaluated against popular existing models such as DeepCNF-SS8 [44], CNF-SS8 [37], SSPro8-SS8 [43], GSN-SS8
re-
[14], LSTM large [15], SSREDNs [36], Conditioned CNN [38], MUST-CNN [39], and 2DCNN-BLSTM with PSSM [5] and 2DConv-BLSTM with PSSM [5] on 410
four benchmark datasets. The performance evaluation is based on the Q8 accuracy for 8-class PSSP and is shown in Table 7.
lP
The proposed 8-class PSSP model exhibits an improvement in the performance over the popular existing models as shown in Table 7. The proposed model shows a minimum improvement of 2.1% on benchmark CB513 dataset. 415
The proposed model exhibits an improvement in the performance due the
urn a
effectiveness of the hybrid features of PSSM and HMM profiles. These hybrid features integrated to 2-dimensional convolution layer, 2-dimensional max pooling layer with two stacked BRNNs layer, i.e, BLSTM or BGRU layer. The improvement in the performance of the proposed model is also due to effec420
tive extraction of LI and LRI of protein primary structure sequences. The experimental results suggest that the proposed 2DCNN-BLSTM model is more expressive on CB6133 and CB513 datasets in PSSP. This demonstrate that the hybrid features based on PSSM and HMM profiles along with the deep learning
Jo
framework are effective and efficient for 3-class and 8-class PSSP.
425
3.3. Validation of Q3 Results An experiment is performed to predict protein structural class [50] to validate
the quality of the obtained Q3 results. In structural class prediction, a protein sequence is classified into one of the four majority classes i.e., α, β, α/β, α + β. 21
Journal Pre-proof
Table 7: The performance (Q8 accuracy) Comparison of the Proposed Model against the
pro of
Benchmark Models on the CB6133, CB513, CASP10, and CASP11 datasets. The Q8 accuracy of the Benchmark Models are taken from their published work. (-) indicates the unavailability of the result.
Models
CB6133
CB513
CASP10
CASP11
Q8 (%)
Q8 (%)
Q8 (%)
Q8 (%)
66.6
63.5
64.9
65.6
66.4
-
-
64.9
64.8
65.1
SSpro-SS8 [43] (without template)
69.7
GSN-SS8 [14]
72.1
DeepCNF-SS8 [44]
75.2
68.3
71.8
72.3
LSTM large [15]
-
67.4
-
-
SSREDNs [36]
73.1
68.2
-
-
Conditioned CNN [38]
-
71.4
-
-
-
68.4
-
-
75.7
70.2
74.5
72.5
74.3
70.0
74.5
72.6
75.8
73.5
72.2
70.0
lP
re-
RaptorX-SS8 [37]
MUST-CNN [39]
2DConv-BLSTM with PSSM [5]
2DCNN-BLSTM
urn a
with PSSM [5]
Proposed Model
In this experiment, we have considered two popular datasets, namely z277 [51] 430
and FC699 [52]. The z277 is high similarity dataset consisting of 277 protein sequences, whereas the FC699 is low similarity dataset consisting of 858 protein sequences. The proposed model is trained on the CB6133 dataset and tested on
Jo
z277 and FC699 datasets to obtain Q3 results. For the same z277 and FC699 datasets, Q3 results (sequences) are also obtained using PSI-PRED server [53].
435
From the recent investigation on structural class prediction, Bankapur et
al. [54] showed that the Word2Vec technique [55] extracts higher discriminating features from the given sequence. Therefore, a set of 400 features is
22
Journal Pre-proof
extracted for each dataset using Word2Vec [55] from both the proposed model
440
pro of
Q3 sequences and the PSI-PRED Q3 sequences. The various machine learning classifiers such as Naive Bayes (NB), Support Vector Machine (SVM), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and k-Nearest Neighbor (kNN) are used to classify the protein sequence into the respective structural class using the extracted Word2Vec features and it is as shown in Table 8.
Table 8: The Performance Comparison (in percentage) of Structural Class Prediction based on the Features Extracted from the Proposed Model Q3 Results against PSI-PRED Model
Classifiers
z277
re-
Q3 Results
FC699
Proposed Model
PSI-PRED
Proposed Model
NB
67.6
69
84.3
76.2
SVM
81.0
80.2
86.2
86.5
LR
65.7
73.3
83.5
82.9
MLP
75.2
78.5
83.7
85.2
KNN
81.4
84.2
83.7
84.5
lP
PSI-PRED
From Table 8, it can be observed that the structural class prediction accuracy on the proposed model Q3 sequences is higher than that of PSI-PRED Q3
urn a
445
sequences. Thus, we can say that the quality of Q3 sequences obtained from the proposed model is higher than that of PSI-PRED model.
4. Conclusion and Future Work In this paper, an effective model is proposed for PSSP from amino acid 450
residue sequences. The proposed model consists of hybrid features of 42 - di-
Jo
mensions with the combination of a CNN and BRNNs. The hybrid features are extracted and selected from sequence profiles of PSI-BLAST and HHBlits. We explored HHblits on two databases, namely UniProt20 and UniClust30 for the extraction of HMM profiles and explored PSI-BLAST on the Uniref90 database
455
for the extraction of PSSM profiles. The HMM and PSSM profiles are combined to obtain a novel hybrid feature set which is fed into a deep neural network. 23
Journal Pre-proof
The deep neural network consists of one CNN layer and two BRNNs and fol-
pro of
lowed by one fully connected layer as an output layer for PSSP. The proposed prediction model is accessed on four publicly available datasets and reported 460
Q3 accuracy of 85.4%, 85.4%, 83.7%, and 81.5% and Q8 accuracy of 75.8%, 73.5%, 72.2%, and 70.0% on CB6133, CB513, CASP10, and CASP11 datasets respectively. The experimental results demonstrate that the proposed prediction models accuracy is consistently higher than that of other existing methods and it exhibits a minimum increase of 2.5% in Q3 accuracy and 2.1% in Q8 ac-
465
curacy on CB513 dataset. Further, the quality of the Q3 results is validated by
re-
predicting structural class on z277, and FC699 datasets and the structural class prediction results outperformed PSI-PRED method. From the experimental results, we conclude that an effective model is proposed to address PSSP. In the future, we would like to explore features based on physicochemical properties of amino acid residues to optimize the proposed framework. Fur-
lP
470
ther, we would like to propose an effective structural class prediction model by consuming the Q3 results of this work.
urn a
Acknowledgment
This research work is funded [KSTePS/VGST-RGS/ F/GRD No.727/2017475
18] by Vision Group on Science and Technology, Department of Information Technology, Biotechnology and Science & Technology, Govt. of Karnataka, India.
References
Jo
[1] D. W. Mount, D. W. Mount, Bioinformatics: sequence and genome analy-
480
sis, Vol. 2, Cold spring harbor laboratory press New York:, 2001.
[2] G. Wang, Y. Zhao, D. Wang, A protein secondary structure prediction framework based on the extreme learning machine, Neurocomputing 72 (13) (2008) 262–268.
24
Journal Pre-proof
[3] L. Pauling, R. B. Corey, H. R. Branson, The structure of proteins: two hydrogen-bonded helical configurations of the polypeptide chain, Proceed-
pro of
485
ings of the National Academy of Sciences 37 (4) (1951) 205–211.
[4] W. Kabsch, C. Sander, Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, Biopolymers 22 (12) (1983) 2577–2637. 490
[5] Y. Guo, B. Wang, W. Li, B. Yang, Protein secondary structure prediction improved by recurrent neural networks integrated with two-dimensional
re-
convolutional neural networks., Journal of bioinformatics and computational biology 16 (5) (2018) 1850021–1850021.
[6] J. Cheng, A. N. Tegge, P. Baldi, Machine learning methods for protein structure prediction, IEEE reviews in biomedical engineering 1 (2008) 41– 49.
lP
495
[7] S. Min, B. Lee, S. Yoon, Deep learning in bioinformatics, Briefings in bioinformatics 18 (5) (2017) 851–869.
[8] B. Rost, C. Sander, R. Schneider, Redefining the goals of protein secondary structure prediction, Journal of molecular biology 235 (1) (1994) 13–26.
urn a
500
[9] H. Chen, F. Gu, Z. Huang, Improved chou-fasman method for protein secondary structure prediction, BMC bioinformatics 7 (4) (2006) S14.
[10] B. Rost, C. Sander, Prediction of protein secondary structure at better than 70% accuracy, Journal of molecular biology 232 (2) (1993) 584–599.
[11] D. T. Jones, Protein secondary structure prediction based on position-
Jo
505
specific scoring matrices, Journal of molecular biology 292 (2) (1999) 195– 202.
[12] Z. Li, Y. Yu, Protein secondary structure prediction using cascaded convolutional and recurrent neural networks, arXiv preprint arXiv:1604.07176.
25
Journal Pre-proof
510
[13] C. Fang, Y. Shang, D. Xu, Mufold-ss: New deep inception-inside-inception
pro of
networks for protein secondary structure prediction, Proteins: Structure, Function, and Bioinformatics 86 (5) (2018) 592–598.
[14] J. Zhou, O. G. Troyanskaya, Deep supervised and convolutional generative stochastic network for protein secondary structure prediction, arXiv 515
preprint arXiv:1403.1347.
[15] S. K. Sønderby, O. Winther, Protein secondary structure prediction with long short term memory networks, arXiv preprint arXiv:1412.7828.
re-
[16] E. Gawehn, J. A. Hiss, G. Schneider, Deep learning in drug discovery, Molecular informatics 35 (1) (2016) 3–14. 520
[17] C. Angermueller, T. P¨arnamaa, L. Parts, O. Stegle, Deep learning for com-
lP
putational biology, Molecular systems biology 12 (7) (2016) 878. [18] E. Asgari, M. R. Mofrad, Continuous distributed representation of biological sequences for deep proteomics and genomics, PloS one 10 (11) (2015) e0141287.
[19] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur,
urn a
525
X. He, L. Heck, G. Tur, D. Yu, et al., Using recurrent neural networks for slot filling in spoken language understanding, IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (3) (2015) 530–539.
[20] B. Zhang, Z. Chen, Y. L. Murphey, Protein secondary structure prediction 530
using machine learning, in: Proceedings. 2005 IEEE International Joint
Jo
Conference on Neural Networks, 2005., Vol. 1, IEEE, 2005, pp. 532–537.
[21] G. Karypis, Yasspp: better kernels and coding schemes lead to improvements in protein secondary structure prediction, Proteins: Structure, Function, and Bioinformatics 64 (3) (2006) 575–586.
535
[22] W. Chu, Z. Ghahramani, A. Podtelezhnikov, D. L. Wild, Bayesian segmental models with multiple sequence alignment profiles for protein secondary 26
Journal Pre-proof
structure and contact map prediction, IEEE/ACM transactions on compu-
pro of
tational biology and bioinformatics 3 (2) (2006) 98–113.
[23] W. Zhong, G. Altun, X. Tian, R. Harrison, P. C. Tai, Y. Pan, Parallel 540
protein secondary structure prediction schemes using pthread and openmp over hyper-threading technology, The Journal of Supercomputing 41 (1) (2007) 1–16.
[24] Z. Aydin, Y. Altunbasak, H. Erdogan, Bayesian protein secondary structure prediction with near-optimal segmentations, IEEE transactions on signal processing 55 (7) (2007) 3512–3525.
re-
545
[25] P. Chopra, A. Bender, Evolved cellular automata for protein secondary structure prediction imitate the determinants for folding observed in na-
lP
ture, In silico biology 7 (1) (2007) 87–93.
[26] X.-Q. Yao, H. Zhu, Z.-S. She, A dynamic bayesian network approach to 550
protein secondary structure prediction, BMC bioinformatics 9 (1) (2008) 49.
[27] N. P. Bidargaddi, M. Chetty, J. Kamruzzaman, Combining segmental semi-
urn a
markov models with neural networks for protein secondary structure prediction, Neurocomputing 72 (16-18) (2009) 3943–3950.
555
[28] P. Kountouris, J. D. Hirst, Prediction of backbone dihedral angles and protein secondary structure using support vector machines, BMC bioinformatics 10 (1) (2009) 437.
[29] S. A. Malekpour, S. Naghizadeh, H. Pezeshk, M. Sadeghi, C. Eslahchi, A
Jo
segmental semi markov model for protein secondary structure prediction,
560
Mathematical biosciences 221 (2) (2009) 130–135.
[30] Z. Zhou, B. Yang, W. Hou, Association classification algorithm based on structure sequence in protein secondary structure prediction, Expert Systems with Applications 37 (9) (2010) 6381–6389.
27
Journal Pre-proof
[31] H. Bouziane, B. Messabih, A. Chouarfia, Profiles and majority voting-based ensemble method for protein secondary structure prediction, Evolutionary Bioinformatics 7 (2011) EBO–S7931.
pro of
565
[32] M. H. Zangooei, S. Jalili, Protein secondary structure prediction using dwkf based on svr-nsgaii, Neurocomputing 94 (2012) 87–101.
[33] W. Yang, K. Wang, W. Zuo, Prediction of protein secondary structure us570
ing large margin nearest neighbor classification, in: 2011 3rd International Conference on Advanced Computer Control, IEEE, 2011, pp. 202–205.
re-
[34] P. Ghanty, N. R. Pal, R. K. Mudi, Prediction of protein secondary structure using probability based features and a hybrid system, Journal of bioinformatics and computational biology 11 (05) (2013) 1350012. [35] A. Drozdetskiy, C. Cole, J. Procter, G. J. Barton, Jpred4: a protein sec-
lP
575
ondary structure prediction server, Nucleic acids research 43 (W1) (2015) W389–W394.
[36] Y. Wang, H. Mao, Z. Yi, Protein secondary structure prediction by using
580
urn a
deep learning method, Knowledge-Based Systems 118 (2017) 115–123. [37] Z. Wang, F. Zhao, J. Peng, J. Xu, Protein 8-class secondary structure prediction using conditional neural fields, in: 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2010, pp. 109–114.
[38] A. Busia, N. Jaitly, Next-step conditioned deep convolutional neural networks improve protein secondary structure prediction, arXiv preprint arXiv:1702.03865.
Jo
585
[39] Z. Lin, J. Lanchantin, Y. Qi, Must-cnn: a multilayer shift-and-stitch deep convolutional architecture for sequence-based protein structure prediction, in: Thirtieth AAAI conference on artificial intelligence, 2016, pp. 27–34.
[40] G. Pollastri, D. Przybylski, B. Rost, P. Baldi, Improving the prediction of
590
protein secondary structure in three and eight classes using recurrent neural 28
Journal Pre-proof
networks and profiles, Proteins: Structure, Function, and Bioinformatics
pro of
47 (2) (2002) 228–235.
[41] P. Y. Chou, G. D. Fasman, Prediction of protein conformation, Biochemistry 13 (2) (1974) 222–245. 595
[42] E. Faraggi, T. Zhang, Y. Yang, L. Kurgan, Y. Zhou, Spine x: improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles, Journal of computational chemistry 33 (3) (2012) 259–267.
600
re-
[43] C. N. Magnan, P. Baldi, Sspro/accpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity, Bioinformatics 30 (18) (2014)
lP
2592–2597.
[44] S. Wang, J. Peng, J. Ma, J. Xu, Protein secondary structure prediction using deep convolutional neural fields, Scientific reports 6 (2016) 18962. 605
[45] S. F. Altschul, T. L. Madden, A. A. Sch¨ affer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, Gapped blast and psi-blast: a new generation of protein
urn a
database search programs, Nucleic acids research 25 (17) (1997) 3389–3402.
[46] M. Remmert, A. Biegert, A. Hauser, J. S¨ oding, Hhblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment, Nature methods
610
9 (2) (2012) 173.
[47] A. Kryshtafovych, A. Barbato, K. Fidelis, B. Monastyrskyy, T. Schwede, A. Tramontano, Assessment of the assessment: evaluation of the model
Jo
quality estimates in casp10, Proteins: Structure, Function, and Bioinformatics 82 (2014) 112–126.
615
[48] J. Moult, K. Fidelis, A. Kryshtafovych, T. Schwede, A. Tramontano, Critical assessment of methods of protein structure prediction (casp)round x, Proteins: Structure, Function, and Bioinformatics 82 (2014) 1–6.
29
Journal Pre-proof
[49] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv
620
pro of
preprint arXiv:1412.6980.
[50] M. Levitt, C. Chothia, Structural patterns in globular proteins, Nature 261 (5561) (1976) 552.
[51] G.-P. Zhou, An intriguing controversy over protein structural class prediction, Journal of protein chemistry 17 (8) (1998) 729–738.
[52] L. Kurgan, K. Cios, K. Chen, Scpred: accurate prediction of protein 625
structural class for sequences of twilight-zone similarity with predicting
re-
sequences, BMC bioinformatics 9 (1) (2008) 226.
[53] L. J. McGuffin, K. Bryson, D. T. Jones, The psipred protein structure prediction server, Bioinformatics 16 (4) (2000) 404–405.
630
lP
[54] S. Bankapur, N. Patil, Protein secondary structural class prediction using effective feature modeling and machine learning techniques, in: 2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE), IEEE, 2018, pp. 18–21.
urn a
[55] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word
Jo
representations in vector space, arXiv preprint arXiv:1301.3781.
30
Journal Pre-proof *Declaration of Interest Statement
Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
pro of
☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:
Jo
urn a
lP
re-
None