An enhanced protein secondary structure prediction using deep learning framework on hybrid profile based features

An enhanced protein secondary structure prediction using deep learning framework on hybrid profile based features

Journal Pre-proof An enhanced protein secondary structure prediction using deep learning framework on hybrid profile based features Prince Kumar, Sanj...

883KB Sizes 1 Downloads 101 Views

Journal Pre-proof An enhanced protein secondary structure prediction using deep learning framework on hybrid profile based features Prince Kumar, Sanjay Bankapur, Nagamma Patil

PII: DOI: Reference:

S1568-4946(19)30707-0 https://doi.org/10.1016/j.asoc.2019.105926 ASOC 105926

To appear in:

Applied Soft Computing Journal

Received date : 27 May 2019 Revised date : 16 August 2019 Accepted date : 5 November 2019 Please cite this article as: P. Kumar, S. Bankapur and N. Patil, An enhanced protein secondary structure prediction using deep learning framework on hybrid profile based features, Applied Soft Computing Journal (2019), doi: https://doi.org/10.1016/j.asoc.2019.105926. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Elsevier B.V. All rights reserved.

Journal Pre-proof *Manuscript Click here to view linked References

pro of

An Enhanced Protein Secondary Structure Prediction using Deep Learning Framework on Hybrid Profile based Features Prince Kumara,∗, Sanjay Bankapura , Nagamma Patila a National

Institute of Technology Karnataka, India - 575025

Abstract

re-

Accurate protein secondary structure prediction (PSSP) is essential to identify structural classes, protein folds, and its tertiary structure. To identify the secondary structure, experimental methods exhibit higher precision with the trade-off of high cost and time. In this study, we propose an effective prediction

lP

model which consists of hybrid features of 42-dimensions with the combination of convolutional neural network (CNN) and bidirectional recurrent neural network (BRNN). The proposed model is accessed on four benchmark datasets such as CB6133, CB513, CASP10, and CAP11 using Q3, Q8, and segment overlap (Sov) metrics. The proposed model reported Q3 accuracy of 85.4%, 85.4%,

urn a

83.7%, 81.5%, and Q8 accuracy 75.8%, 73.5%, 72.2%, and 70% on CB6133, CB513, CASP10, and CAP11 datasets respectively. The results of the proposed model are improved by a minimum factor of 2.5% and 2.1% in Q3 and Q8 accuracy respectively, as compared to the popular existing models on CB513 dataset. Further, the quality of the Q3 results is validated by structural class prediction and compared with PSI-PRED. The experiment showed that the quality of the Q3 results of the proposed model is higher than that of PSI-PRED.

Jo

Keywords: Biological Computing, Convolutional Neural Network, Deep Learning, Protein Secondary Structure, Bidirectional Recurrent Neural ∗ Corresponding

author Email addresses: [email protected] (Prince Kumar), [email protected] (Sanjay Bankapur), [email protected] (Nagamma Patil)

Preprint submitted to Journal of LATEX Templates

August 16, 2019

Journal Pre-proof

pro of

Network, Sequence Profiles.

1. Introduction

Protein primary structure sequence are comprised of 20 amino acid residues. A linear segment of amino acid residues constitutes the protein primary structure sequence. Protein secondary structure is comprised of regions stabilized 5

by hydrogen bonds between atoms in the polypeptide backbone. The protein primary sequence fold up into the definite 3-dimensional structure, known as a tertiary structure. Protein functions are highly related to its tertiary structure

re-

which increase the significance of the prediction of tertiary structure from a protein primary sequence in molecular biology [1]. However, predicting the tertiary 10

structure directly from primary structure sequences is still an uphill task. This is addressed in two steps: initially, protein secondary structure prediction (PSSP)

lP

from protein primary structure sequences, and later, protein tertiary structure prediction from protein secondary structure information [2]. Protein secondary structure mainly comprised of either 3-class elements or 8-class elements. The 15

3-class elements of protein secondary structure are Strand (E), Helix (H), and

urn a

Coil (C) [3]. Whereas the Dictionary of Protein Secondary Structure (DSSP) [4] defines 8-class as: β-bridge (B), β-sheet (E), 310 -helix (G), α-helix (H), π-helix (I), Bend (S), Turn (T), and other residues (L). With the higher development rate of proteomics, a vast collection of protein 20

sequences are deposited in protein data banks. Over the years, the number of protein primary structure sequences is much more than the number of protein secondary structure sequences [5]. This is mainly due to the limitations of

Jo

experimental methods (like X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and electron microscopy) such as, time-consuming and

25

cost intensive in determining protein secondary structure. However, experimental methods exhibit higher precision in PSSP and hence, widely used for the annotation of the protein secondary structures. As these experimental methods are slow and the protein sequences in the protein data banks are increasing

2

Journal Pre-proof

rapidly, there is a higher demand for computational models to predict protein secondary structure. Therefore, an effective (accurate) PSSP from protein pri-

pro of

30

mary structure sequences is one of the challenging tasks in the field of biological computing. 1.1. Related Works

Protein structure prediction classified into numerous levels. They are ranging 35

from 1-dimensional to 4-dimensional protein structure prediction. The PSSP is based on 1-dimensional protein structure prediction [6, 7]. To measure the

re-

performance with PSSP, Q3 accuracy, Q8 accuracy, and Sov score for the 3-class and 8-class elements are taken into account [8]. Q3 and Q8 accuracy methods computes the percent of amino acid residues in PSSP for which the prediction 40

is correct.

lP

In the early stage of the study [6, 9], the probability of amino acid residues present in protein secondary structures is frequently analyzed by statistical methods. These models have Q3 accuracy lower than 60% as they could not be able to extract the local information (features) from the protein primary struc45

ture sequences. Then, evolutionary information of protein [10] and position-

urn a

specific scoring matrices (PSSM) [11] are taken into consideration, and they proved advantageous for PSSP by achieving a Q3 accuracy of more than 70%. In recent works, the improvement in Q3 accuracy is noticed as it gradually improves to more than 80% [12, 13]. But, when these traditional models are used 50

for the 8-class PSSP, they failed to achieve good accuracy. The Q8 accuracy of these traditional models is very low. The reason behind this failure is that they needed to distinguish among 8-class protein secondary structures [12, 14, 15, 13].

Jo

Therefore, the work related to the 3-class and 8-class PSSP seems to be more promising [6, 14].

55

In recent years, the most effective framework which comes into consideration

are deep neural networks (DNNs) and have become an established framework for the representation learning of various data [16, 17, 18, 19]. DNNs shows a significant improvement in both Q3 and Q8 accuracy of and 3-class and 8-class 3

Journal Pre-proof

PSSP respectively [13]. Some of the popular existing works are evaluated on benchmark CB513 dataset such as: Jones et al. [11] used a two-stage neural

pro of

60

network to utilize PSSM profiles for PSSP and obtained a Q3 accuracy of 79.2% on the benchmark CB513 dataset. Zhang et al. [20] used a combination of a Bayesian model and three neural networks and obtained a Q3 accuracy of 74.8%. Karypis et al. [21] used an exponential kernel function and a combined coding 65

scheme, and construct a cascaded model using a two-stage SVM-based model which results in a Q3 accuracy of 77.83%. Chu et al. [22] used the notion of unrelated proteins that there may be structural similarity in a group of local

re-

structural segments and obtained a Q3 accuracy of 72.23%. Zhong et al. [23] used Pthread and OpenMP to parallelize denoeux belief neural network (DBNN) 70

and achieved a Q3 accuracy of 72.01%. Aydin et al. [24] used and extended a hidden semi-Markov model and achieved a Q3 accuracy of 63.71% on the same

lP

benchmark model. Chopra et al. [25] used localized interactions with cellular automata for the stimulation of global phenomena for PSSP and achieved a Q3 accuracy of 56.51%. Yao et al. [26] combined DBN and the three-layered 75

neural network having feed-forward and back-propagation layer and achieved a Q3 accuracy of 78.1%. Bidargaddi et al. [27] introduced a hybrid model

urn a

having three variants as the combination of the Bayesian segmentation based on segmental semi-Markov model as the first stage, neural networks as the second stage and neural network ensembles in the last stage and obtained a Q3 80

accuracy of 70.89%. Kountouris et al. [28] used backbone dihedral angles of protein to improve the predictive accuracy and obtained a Q3 accuracy of 80%. Malekpour et al. [29] used bidirectional dependencies information of amino acids and a weighted model for the probability estimation of the segments in

Jo

structural classes and calculation of the probability of each segment in structure

85

respectively and obtained a Q3 accuracy of 66.48%. Zhou et al. [30] used the associate classification algorithm concept to propose a compound pyramid model and obtained a Q3 accuracy of 80.49%. Bouziane et al. [31] used an ensemble method to utilize the PSSM profiles, which combines the outputs of k-nearest neighbor, two feed-forward artificial neural networks, and three 4

Journal Pre-proof

90

multi-class SVM classifiers and obtained a Q3 accuracy of 76.34%. Zangooei

pro of

et al. [32] used support vector regression and non-dominated sorting genetic algorithm-II based on PSSM profiles and obtained a Q3 accuracy of 84.94%. Yang et al. [33] used a large margin nearest neighbor classification method for PSSP and obtained a Q3 accuracy of 75.44%. Ghanty et al. [34] used position95

specific probability-based features with neuro-SVM on the single sequence and obtained a Q3 accuracy of 68%. Drozdetskiy et al. [35] introduced JPRED: a PSSP server which achieved a Q3 accuracy of 81.7%. Wang et al. [36] used a deep recurrent encoder-decoder networks for PSSP and obtained a Q3 accuracy

100

re-

of 82.9% on the same benchmark model.

Wang et al. [37] utilized deepCNF with a conditional random field for PSSP and attained a Q8 accuracy of 68.3% and Q3 accuracy of 82.3%. Li et al. [12] used a multi-scale CNN followed by three stacked BRNN layers and attain a Q8

lP

accuracy of 69.7%. Busia et al. [38] used a novel chained CNN with next-step conditioning to achieve a Q8 accuracy of 71.4%. Lin et al. [39] presents a deep 105

CNN architecture connected with a multi-layer shift-and-snitch for PSSP, and attain a Q8 accuracy of 68.4%. Pollastri et al. [40] used a RNN and profiles to increase 8-class PSSP, and attain a Q8 accuracy of 51.1%. Sonderby et al. [15]

urn a

utilized a BRNNs with long-short term memory (LSTM) cells for PSSP, and attain a Q8 accuracy of 67.4%. Guo et al. [5] used 2-dimensional CNN with 110

two stacked BRNNs, and achieved a Q8 accuracy of 70.2%. The summary of the literature survey is shown in Table 1. Thus, deep learning methods proved useful for PSSP [7, 11, 41] and have been following with great attention by researchers in proteomics.

Inspired by recent progress and effectiveness of deep neural networks in PSSP, we propose a hybrid deep learning framework, 2-dimensional convolu-

Jo

115

tional bidirectional recurrent neural networks with hybrid profile features of PSSM profiles [45] and HMM profiles [46] to improve the Q3 and Q8 accuracy of three and 8-class PSSP respectively. This framework combines 2dimensional convolutional or 2-dimensional pooling operations with bidirec-

120

tional gated recurrent units (BGRUs) and bidirectional long-short term mem5

Journal Pre-proof

pro of

Table 1: Summary of Related Works based on CB513 Dataset

Authors

Proposed Models

Results

PSI-PRED: a two-stage

Jones et al. [11]

neural networks for

(1999)

Q3 - 79.2%

PSSP based on PSSM

Wang et al. [37] (2011)

Raptor-X: 8-class PSSP using deep CNN with conditional random field

Q3 - 78.3%, Q8 - 68.3%

re-

SPINE-X: multi-step

Faraggi et al. [42]

learning combined with solvent

(2012)

accessible surface area and

Q3 - 78.9%

torsion angle prediction

(2014) Mao et al. [44] (2016)

SSPro-SS8: used profiles,

lP

Magnan et al. [43]

similarity for PSSP

used deep learning method

Q3 - 78.5% Q8 - 63.5%

Q3 - 82.9%

knowledge-based systems

urn a

Pollastri et al. [40]

machine learning and structural

used RNNs and profiles

Q8 - 51.1%

(2002)

Sondarby et al. [15]

used BRNNs with

(2014)

LSTM cells

Lin et al. [39] (2016)

used deep CNN architecture

with a multi-layer shift-and-snitch

(2017)

next step conditioning

Jo

used novel chained CNNs and

(2018)

Q8 - 68.4%

for PSSP

Busia et al. [38]

Gou et al. [5]

Q8 - 67.4%

Q8 - 71.4%

used hybrid deep learning framework, 2-dimensional CNN with two BRNNs layer

6

Q8 - 70.2%

Journal Pre-proof

ory (BLSTM) units, forming four models: 2DConv-BGRUs, 2DConv-BLSTM,

pro of

2DCNN-BGRUs, and 2DCNN-BLSTM. First two models consist of only 2dimensional convolutional operations, while the other two models contain 2dimensional convolutional operations and 2-dimensional pooling operations. 125

1.2. Motivation and Contributions

The PSSP is one of the essential and challenging activities for protein tertiary structure prediction and its functional analysis. Extraction of a compelling feature set from the protein primary structure sequences is a key to improve

130

re-

the accuracy of PSSP. From the literature, the majority of the previous studies [11, 31, 32] showed that the features from the evolutionary-based PSSM profile achieve satisfactory results. However, there is an ample scope to explore features from the hidden Markov model (HMM) based evolutionary profile to enhance the prediction accuracy of PSSP. HHBlits tool [46] generates HMM profiles, and

135

lP

it is a popular tool for identification and alignment of remote homology blocks for protein sequences.

In this study, the main contributions are as follows: • As per our knowledge, this is the first work to extract hybrid features from

urn a

two evolutionary-based profiles, namely, hidden Markov model (HMM) profiles from HHblits [46] and position-specific scoring matrices (PSSM)

140

profiles from PSI-BLAST [45].

• A deep learning framework consisting of a CNN and BRNN is proposed to utilize the extracted hybrid profile features to solve PSSP effectively.

• The proposed model is evaluated on four publicly available benchmark

Jo

datasets using well-known metrics, such as, Q3 (3-class elements) accuracy,

145

Q8 (8-class elements), segment overlap (Sov) scores for 3-class elements and 8-class elements [8].

• The Q3 result quality of the proposed model is validated by performing the protein structural class prediction and the results are compared with PSI-PRED. 7

Journal Pre-proof

150

The organization of the remaining article is as follows: Section 2 provides

pro of

highlights on the datasets used and elaborates the proposed methodology. Section 3 showcases the detailed result analysis of the proposed model. Section 4 concludes with possible future works.

2. Materials and Methodology 155

2.1. Dataset

The datasets used for PSSP are a collection of protein sequences from four

re-

publicly available benchmark datasets - CB6133 [14], CB513 [14], CASP10 [47], and CASP11 [48]. CB6133 dataset is used to train and test the proposed deep learning framework. The CB513, CASP10, CASP11 datasets are only used for 160

testing.

CB6133 dataset contains 6133 protein primary sequences and is divided

lP

into groups of [0,5600) training, [5600,5877) testing and [5877,6133) validation records for the proposed deep learning framework. Rest of the datasets are entirely used for the testing purpose. CB513 dataset contains 513 protein se165

quences, CASP10 dataset contains 123 protein sequences, and CASP11 dataset

urn a

contains 105 protein sequences.

The datasets which are used in this study are available at PSSP Datasets 2.2. Proposed Methodology

The proposed methodology is mainly categorized into four major modules: 170

generation of Hybrid Profile based Features, the Local Interaction Extraction layer, the Long-Range Interaction Extraction layer, and the Output layer. The

Jo

detailed framework of the proposed model is as shown in Figure 1. The extraction of discriminating features is from a hybrid of two profiles,

namely, PSSM and HMM. The PSSM sequence profile is generated using PSI-

175

BLAST method [45] in which highly similar protein sequences are searched from a database to extract PSSM profiles. HHblits method [46] is used to search remote homology sequences from a database to generate HMM profiles.

8

Journal Pre-proof

THE LONG RANGE INTERACTION EXTRACTOR

OUTPUT LAYER

pro of

THE LOCAL INTERACTION EXTRACTOR

DENSE FEATURE LAYER

Profile based Features PSI-BLAST

BRNN

BRNN

Hybrid Features

BRNN

700 X 21

BRNN

700 X 21 700  X 42

BRNN

BRNN

re-

Profile based Features HHblits

BRNN

BRNN

Figure 1: The detailed framework of the Proposed Model

lP

The local interaction between amino acid residues (LI) are extracted using the combination of 2-dimensional convolutional layer and 2-dimensional max180

pooling layer. The long-range interactions between amino acid residues (LRI) are further extracted using a BRNNs with GRUs and LSTM units. All the extracted features are finally fed into the output layer having softmax as an

urn a

activation function for 3-class and 8-class PSSP. 2.2.1. Hybrid Profile based Features 185

We extracted PSSM profiles using PSI-BLAST method [45] on Uniref90 database with a cutoff value, i.e., E = 0.001 in three iterations. The output of PSI-BLAST generates two matrices, namely, log odds and linear substitution probability matrices. Each matrix is of size L × 21, where L is the length of the

Jo

protein sequence and the 21 indicates 20 amino acid residues with one unknown

190

residue. In this study, we have considered only linear substitution probability matrices.

HMM profiles are generated using HHblits method [46] on UniProt20 and

UniClust30 database with a cutoff value, i.e., E = 0.001 in four iterations. The output of HHblits generates a matrix of size L × 30, where L is the length of 9

Journal Pre-proof

195

the protein sequence, the first 20 out of 30 columns indicates the substitution

pro of

probability of the amino acid residues and the last 10 columns represents the probability of three states (Insertion, Deletion, and Match) with the number of each state occurring in the alignment process. In this study, we have considered only 21 columns (20 amino acid substitution and one from number of occurrence 200

of match state score) out of 30.

The length of the input protein sequences is varying in length. In this study, we have fixed the protein sequence length to 700 such that the protein sequences consisting of less than 700 residues are padded and the protein sequences ex-

205

re-

ceeding 700 residues are truncated to make their lengths to 700. Hence, the generated profiles for a protein sequence from PSI-BLAST and HHblits exhibits a shape of 700×21 for each profile. These generated profiles are sparse by nature and converted into dense using two activities: non-zero values are transformed

lP

1 by logistic function, and the entire matrix is normalized by adding 0.05 ( 20 ).

For a given M protein sequences from a dataset, the generated PSSM and 210

HMM profile feature sets exhibit of a shape of M × 700 × 21 and M × 700 × 21 respectively. Both the profile feature sets are fused to form a hybrid feature set of shape M × 700 × 42. The hybrid protein feature set is represented as P F S =

urn a

{P F1 , P F2 , P F3 , ..., P Fn }, P Fi  R42 and the secondary structure labels are represented as CLS = {CL1 , CL2 , CL3 , ..., CLn }, CLi  R3,8 .

215

2.2.2. Local Interaction Extraction To extract the local interactions of amino acid residues (LI) from protein sequences, a combination of 2-dimensional convolutional layer with 2-dimensional max-pooling layer are applied over 2-dimensional protein sequence profile of

Jo

shape 700 × 42.

220

The generated matrix P F S = {P F1 , P F2 , ..., P Fn } is obtained from profile

features of PSSM and HMM matrices using PSI-BLAST and HHblits, where P F SRm×k , and P Fj is the pre-processed feature vector (vector length of k) of the jth amino acid residue in protein primary sequences. In this convolutional layer for local interaction extraction, the 2-dimensional convolutional 10

Journal Pre-proof

225

filter CF  Rcf1 ×cf2 is applied to cf1 profile feature vector of PSI-BLAST and

pro of

cf2 profile feature vector with rectified liner unit (ReLu) activation function, as shown in Equation (1).

f mi,j = ReLu (CF P Fi:i+cf1 −1, j:j+cf2 −1 + Bias)

(1)

Every possible window of the matrix P F S are used to produce a feature map f m by applying the filter CF , as shown in Equation (2).

230

(2)

re-

f m = [f m1,1 , f m1,2 , f m1,3 , ..., f mm−cf1 +1,k−cf2 +1 ]

Equation (2) describes the filter process. There can be multiple filters with different sizes of filters possible in the convolutional layer to learn higher mutual interactions.

After the convolutional layer, 2-dimensional max-pooling mp  Rq1 ×q2 op-

235

lP

erations are further applied to extract the maximum value over the window of the filter matrix cf , as shown in Equation (3).

mpi,j = f (f mi:i+q1 ,j:j+q2 )

(3)

urn a

f (·) in Equation (3) represents 2-dimensional max-pooling function. After these pooling operations, results are as shown in Equation (4).

mp = [mp1,1 , mp1,1+q2 , ..., mp1+(m−cf1 + q1

1

−1)·q1 ,1+(k−cf2 + q1 −1)·q2 2

i

(4)

2.2.3. Long-Range Interaction Extraction

Jo

In the previous step of the convolutional layer, 2-dimensional convolutional

240

and 2-dimensional pooling operations are applied. The convolutional layer was able to extract LI. However, due the kernel size limitation, some of the non-local interaction between amino acid residues may not be captured. To retrieve the non-local interactions between amino acid residues, i.e., long-range interactions, two BRNNs [19, 40] layers are introduced in the proposed architecture with 11

Journal Pre-proof

245

LSTM cells and GRUs. The extracted long-range and local interactions between

pro of

amino acid residues exhibit rich information (to predict structural class) are fed into the output layer to obtain the 3-class and 8-class PSSPs.

Based on the residue’s past and future contexts of the protein sequences, a BRNNs utilizes a finite sequence for the purpose of predicting or labeling each 250

residue of the sequence. To predict each residue of the sequence, the outputs of two BRNNs are concatenated in which protein sequence is processed from both sides (i.e., left to right and right to left) by each BRNN.

It is well-known that the RNNs have a tendency of falling into vanishing

255

re-

gradient problem and to overcome this LSTMs and GRUs are generally used. The proposed model utilizes the BRNNs with LSTM and GRU to extract longrange interactions between amino acid residues. The structure units are as shown in Figure 2.

lP

The GRU unit consists of one memory cell and two gates, namely, update gate (u) and reset gate (r) as shown in Figure 2a. These gates determine whether 260

the information to be remembered or not in the network. On the other hand, the LSTM unit consists of one memory cell, one output state and three gates known as input gate (i), forget gate (f) and output gate (o) as shown in Figure 2b.

urn a

In the network, backpropagated errors are prevented from vanishing by the forget gates of LSTM. Whereas, LSTMs memory cells help to memories the 265

previously performed tasks. GRUs are similar to LSTM cells. However, due to the lack of output gate in GRUs, they have less number of parameters in comparison to LSTM.

The mechanism of the GRU is shown in Fig. 2a. The input feature (i) to the GRU unit at time t with previous state (s) information is represented as (it , st−1 ), the update of the GRUs is given in the following Equations (5)–(8).

Jo

270

rt = σ (Wxr it + Wsr it−1 + Br )

(5)

ut = σ (Wlu it + Wsu it−1 + Bu )

(6)

12

Journal Pre-proof

i

pro of

+

f

c

u

h

c

r

in

h

+

o

out

in

out

re-

(b) Long-Short Term Memory (LSTM) (a) Gated Recurrent Units (GRUs).

units.

lP

Figure 2: GRU and LSTM

sbt = tanh Wlbs it + Wsbs(rt st−1 +Bsb)



st = ut st−1 + (1 − ut ) sbt

(7)

(8)

urn a

From the above Equations (5)–(8), rt , ut , sbt , st are the activation of the reset

gate, update gate, internal memory cell, and GRUs output respectively. W is the

weight matrix and B is the bias term. In addition to these terms, , σ () , tanh () represents the element wise multiplication, the sigmoid and hyperbolic functions 275

respectively.

The mechanism of LSTM is shown in Figure 2b. The input feature (m) to LSTM unit at time t with previous state (h) information is represented as

Jo

(mt , ht−1 ), the update operations for LSTM unit can be shown in the following Equations (9)–(13).

ft = σ (Wxf mt + W hf mt−1 + Bf )

(9)

it = σ (Wxi mt + Whi mt−1 + Bi )

(10)

13

Journal Pre-proof

(11)

ot = σ (Wxo mt + Who mt−1 + Bo )

(12)

ht = ot tanh (ct )

(13)

pro of

280

ct = ft ct−1 + it tanh (Wxc mt + Whc mt−1 + Bc )

From the above Equations (9)–(13), ft , it , and ot are the activation of the forgot gate, input gate and output gate respectively. ct is the current cell state.

re-

In PSSP, amino acid residues at any position do not depend solely on the previous amino acid residues, however, it also depends on the following amino acid residues present in the protein sequences. To address this problem, BRNNs 285

are used for PSSP as the output of two RNN layers operating bidirectionally

lP

are concatenated and will consider both previous and succeeding amino acid residue for PSSP at that position. By using the BRNN, two types of models are designed using GRUs and LSTM. The first model contains forward and backward GRUs, and the second model contains forward and backward LSTM 290

units.

urn a

For the illustration purpose of these two models to extract LRI, BRNNs model is used as an example model. At time t, the output features which were captured by 2-dimensional convolutional layers are fed into the BRNNs to extract the bidirectional information of amino acid residues present in the 295

protein sequences. To calculate the forward hidden state and backward hidden state of BRNNs, Equations (14)–(15) are used, where i and si−1 denotes the t layer index and the concatenated output feature of the preceding layer in the

Jo

staked BRNNs respectively. These two parts are then concatenated to obtain the final feature set and it is as shown in Equation (16). In Equation (16),

300

X represents cf or mp, where cf is the feature set captured by 2-dimensional convolutional layer operations as per Equations (1)–(2), and mp is the feature set captured by 2-dimensional convolutional layer operations and 2-dimensional

14

Journal Pre-proof

max-pooling layer operations as per Equations (1)–(3).

(14)

 ← − ← −− i sit = BRN N s si−1 , s t t+1

(15)

h→ − ← − i st = hit ; hit ; X

(16)

pro of

 → −i −−→ st = BRN N s si−1 , sit−1 t

Moreover, two BRNNs with different units (GRU or LSTM) are stacked together for the improvisation of its performance. In BGRU variation, the

re-

305

hidden layers are updated by using Equations (5)–(8) to extract LRI, and in Equations (14)–(15) BRNNs( ) represent the BGRU model. For the BLSTM variation, the hidden layers of BLSTM are updated by using Equations (9)–(13)

310

model. 2.2.4. Output layer

lP

to extract LRI, and in Equations (14)–(15) BRNNs( ) represent the BLSTM

After the extraction of both LI and LRI, the obtained feature set are fed into

urn a

the last layer which is a fully connected layer to predict the protein secondary structure. These protein features are recorded as f = [f1 , f2 , f3 , ..., fT ]. In the 315

fully connected layer of our proposed model, the softmax layer is considered as the activation function and used to compute the probability of the class of amino acid residues, as shown in Equations (17)–(18).

Jo

pi (y|z) = sof tmax (wz h + bz ) ex sof tmax (x) = P x e

(17)

(18)

A stochastic gradient descent algorithm, i.e., Adam [49] and error backward

propagation are used to train the proposed model in all the experiments. The

320

primary goal of the training model is to minimize the cross-entropy loss function

15

Journal Pre-proof

as shown in Equation (19). On the other hand, the optimization of all the

pro of

parameters is performed according to Equation (20), where φ is the parameter set, β is the learning set, and γ is the L2 regularization hyper-parameter.

L (φ) = −

N c 1 XX 2 yij (log (pij )) + γkφk N i=1 j=1

φ←φ+β 2.3. Evaluation metrics

(20)

In this paper, four metrics are used to evaluate the performance of the pro-

re-

325

∂L (φ) ∂φ

(19)

posed model and are known as: Q3 accuracy, Q8 accuracy and Sov score [8] for 3-class and 8-class PSSP. Q3 and Q8 accuracy are calculated using below mentioned Equations (21)–(24):

lP

re For single conformational state i, Qi , QP andQall i i are defined as follows:

Qi =

ηcorrectly predicted residues in state i × 100 ηactual residues in state i

(21)

where i  {H, E, C} or i  {B, E, I, G, H, S, T, L}. (22)

ηcorrectly predicted residues (both state i & non−i ) × 100 ηresidues in the prediction result

(23)

urn a

ηcorrectly predicted residues in state i × 100 ηresidues in state i in the prediction result

Qpre = i

Qall i =

330

The m-class and overall per-amino acid residue accuracy Qm , where m is 3

Jo

or 8, is defined as:

Qm =

ηcorrectly predicted amino acid residues × 100 ηtotal residues

(24)

where ηx represents the number of x. For structure i, the set of the segments each consisting of structure i in actual

protein secondary structure is denoted as S1i and the set of the segments each

335

consisting of structure i in predicted protein secondary structure is denoted

16

Journal Pre-proof

as S2i , where i  {H, E, C} for 3-class protein secondary structures and i 

pro of

{L, B, E, G, I, H, S, T } for 8-class protein secondary structures. Let S(i) denotes  the set of pairs si1 , si2 if there is some overlapping between the predicted protein secondary structure si2 and the actual protein secondary structure si1 , where si2 340

 S2i and si1  S1i . Suppose, the total number of the amino acid residues present in a protein sequence is N .

Then Sov score is evaluated and defined as:

  X X minov si1 , si2 + δ si1 , si2 i 1 s1 × 100  × × Sov = N maxov si1 , si2 i

where i  {H, E, C} for 3-class PSSP and i  {L, B, E, G, I, H, S, T } for 8 class PSSP, minov si1 , si2 is the length of the overlapping of two segments si1  and si2 , while maxov si1 , si2 is the total extent of both segments and can be  calculated by Equation (26) and δ si1 , si2 is defined by Equation (27):

lP

345

(25)

re-

S(i)

  maxov si1 , si2 = si1 + si2 − minov si1 , si2

urn a

    δ si1 , si2 = min maxov si1 , si2 − minov si1 , si2 , minov si1 , si2 , $ % $ %! si2 si1 , 2 2

(26)

(27)

3. Results and Analysis

In this section, experimental results of the proposed model are analyzed for 3 and 8-class PSSP on four publicly available benchmark datasets as mentioned before: CB6133, CB513, CASP10, and CASP11. The proposed model trained on CB6133 dataset, and the other three datasets used for testing the trained

Jo

350

model. The proposed model for PSSP shows an improvement in the performance over popular existing methods.

17

Journal Pre-proof

3.1. Experimental attributes and Implementation All the experiments are performed on NVIDIA Tesla M40 workstation con-

pro of

355

taining two GPUs nodes having 24GB memory at each node. For the implementation of the proposed models, TensorFlow (https://www.tensorflow.org) library is used along with Keras (https://keras.io) to build and train the deep learning models. The initial weights of the proposed model are set to tensorflow 360

default values. Adam optimizer used to train all the layers of the proposed model simultaneously with the batch size of 64. Early stopping and dropout concepts are used to avoid the over-training and over-fitting of the proposed

re-

model.

In this study, the proposed model’s architecture with hyper-parameters are 365

as follows: the window size of the 2-dimensional convolution layer filter is (3x3). Each feature map has 42 channels. To avoid over-fitting of the proposed model,

lP

the local interactions between amino acid residues of protein sequences are fed into the fully connected layer which consists of 400 hidden units and then are regularized with dropout of value 0.2; the features of the fully connected layer 370

are fed into the two stacked BRNNs layers, which has 400 hidden units in each layer. Meanwhile, dropout of value 0.2 is used to regularize both LI and LRI of

urn a

protein sequences. Then LI and LRI are fed into a fully connected layer which has 600 hidden units and activation function as ReLu. 3.2. Performance Analysis of the Proposed Model 375

Two experiments are performed on the proposed model. In first experiment, 42-dimensional feature map consists of 21 features from HMM profiles, which are extracted using HHblits against Uniprot20 database, and other 21 features from

Jo

PSSM profiles and extracted using PSI-BLAST against Uniref90 database. In second experiment, 21 features from HMM profiles are extracted using HHblits

380

against Uniclust30 database and other 21 features from PSSM profiles are same as in first experiment. D

18

Journal Pre-proof

Table 2: The overall performance (Q3, Sov accuracy) of the proposed model using HMM

Our Models

CB6133

CB513

pro of

profiles based on UniProt20 database for 3-class PSSP on four publicly available datasets CASP10

CASP11

Q3 (%)

Sov (%)

Q3 (%)

Sov (%)

Q3 (%)

Sov (%)

Q3 (%)

Sov (%)

2DCNN-BLSTM

85.2

82.3

85.1

81.7

82.5

77.6

80.5

76.7

2DCNN-BGRU

84.1

81.1

83.7

79.4

73.8

64.2

71.8

63.4

2DConv-BLSTM

84.3

80.6

83.8

79.0

80.1

73.4

77.9

71.5

2DConv-BGRU

85.0

81.9

85.0

80.6

48.9

19.6

45.4

17.7

Table 3: The overall performance (Q3, Sov accuracy) of the proposed model using HMM profiles based on UniClust30 database for 3-class PSSP on four publicly available datasets CB6133

CB513

CASP10

re-

Our Models

CASP11

Q3 (%)

Sov (%)

Q3 (%)

Sov (%)

Q3 (%)

Sov (%)

Q3 (%)

Sov (%)

2DCNN-BLSTM

85.4

82.9

85.4

81.7

83.7

80.5

81.5

78.6

84.4

80.2

76.9

68.7

76.4

70.2

82.6

77.8

76.7

68.8

73.9

66.0

85.0

81.0

45.0

10.5

42.4

9.6

84.8

82.0

83.8

79.9

2DConv-BGRU

85.3

81.6

lP

2DCNN-BGRU 2DConv-BLSTM

The obtained Q3 result from the proposed model is evaluated against popular existing models such as SSPro8-SS8 [43], RaptorX-SS8 [37], SPINE-X [42], 385

PSI-PRED [11], JPRED [35], Ensemble-WMV [31], SVR-NSGAII (RBF Ker-

urn a

nel) [32], NSVM [34], and SSREDNs [36] on four datasets. The performance evaluation is based on Q3 accuracy for 3-class PSSP and is shown in Table 4. The proposed model for 3-class PSSP exhibits an improvement in the performance over the existing models as shown in Table 4. The proposed model 390

shows a minimum improvement of 1.2% and 2.5% on benchmark CB6133 and CB513 datasets respectively.

Two more experiments are performed to evaluate the proposed 8-class PSSP

Jo

model. The proposed 8-class PSSP model is assessed on four datasets and evaluated using Q8 accuracy and Sov score and the results are shown in Table 5 and

395

Table 6. In the first experiment, the proposed 2DCNN-BLSTM model (HMM profiles extracted from Uniprot20 database) exhibits a Q8 accuracy of 75.8%, 73.5%, 72.2% and 70.0%, and Sov score of 77.7%, 75.2%, 72.8% and 70.6% on benchmark CB6133, CB513, CASP10 and CASP11 datasets respectively.

19

Journal Pre-proof

Table 4: The performance (Q3 accuracy) Comparison of the Proposed Model against the

pro of

Benchmark Models on the CB6133, CB513, CASP10, and CASP11 datasets. The Q3 accuracy of the Benchmark Models are taken from their published work. (-) indicates the unavailability of the result.

CB6133

CB513

CASP10

CASP11

Q3 (%)

Q3(%)

Q3(%)

Q3(%)

SSpro(without template) [43]

79.5

78.5

78.5

77.6

SPINE-X [42]

81.7

78.9

80.7

79.3

PSI-PRED [11]

82.5

79.2

81.2

80.7

JPRED [35]

82.9

81.7

81.6

80.4

re-

Methods

Raptorx-SS8 [37]

81.2

78.3

78.9

79.1

-

76.7

-

-

SVR-NSGAII (RBF Kernel) [32]

-

80.6

-

-

NSVM [34]

-

68.3

-

-

84.2

82.9

-

-

85.4

85.4

83.7

81.5

Ensemble-WMV [31]

lP

SSREDNs [36]

Proposed Model

While in the second experiment, the proposed 2DCNN-BLSTM model (HMM profiles extracted from Uniclust30 database) attains a Q8 accuracy of 74.9%,

urn a

400

73.5%, 70.8% and 68.1%, and Sov score of 76.5%, 75.2%, 71.0% and 68.6% on benchmark CB6133, CB513, CASP10 and CASP11 datasets respectively. From both the experiments we can observe that the information present in HMM profiles extracted on Uniprot20 database is higher than that of HMM 405

profiles which are extracted on Uniclust30 database for the 8-class PSSP. Table 5: The overall performance (Q8, Sov accuracy) of the proposed model using HMM

Jo

profiles based on UniProt20 database for 8-class PSSP on four publicly available datasets

Our Models

CB6133

CB513

CASP10

CASP11

Q8 (%)

Sov (%)

Q8 (%)

Sov (%)

Q8 (%)

Sov (%)

Q8 (%)

Sov (%)

2DCNN-BLSTM

75.8

77.7

73.5

75.2

72.2

72.8

70.0

70.6

2DCNN-BGRU

73.9

75.0

73.1

74.5

42.4

38.9

41.8

39.0

2DConv-BLSTM

73.2

73.9

70.8

72.0

52.2

51.0

50.3

48.6

2DConv-BGRU

74.1

75.3

72.4

74.0

64.9

65.5

62.5

63.9

20

Journal Pre-proof

Table 6: The overall performance (Q8, Sov accuracy) of the proposed model using HMM

Our Models

CB6133

CB513

pro of

profiles based on UniClust30 database for 8-class PSSP on four publicly available datasets CASP10

CASP11

Q8 (%)

Sov (%)

Q8 (%)

Sov (%)

Q8 (%)

Sov (%)

Q8 (%)

Sov (%)

2DCNN-BLSTM

74.9

76.5

73.5

75.2

70.8

71.0

68.1

68.6

2DCNN-BGRU

74.6

75.8

72.9

74.1

46.6

43.4

46.4

43.5

2DConv-BLSTM

74.5

76.4

72.8

74.7

47.8

46.3

45.6

44.0

2DConv-BGRU

74.2

75.7

72.9

74.5

25.4

13.0

23.4

10.9

The proposed model for 8-class PSSP is evaluated against popular existing models such as DeepCNF-SS8 [44], CNF-SS8 [37], SSPro8-SS8 [43], GSN-SS8

re-

[14], LSTM large [15], SSREDNs [36], Conditioned CNN [38], MUST-CNN [39], and 2DCNN-BLSTM with PSSM [5] and 2DConv-BLSTM with PSSM [5] on 410

four benchmark datasets. The performance evaluation is based on the Q8 accuracy for 8-class PSSP and is shown in Table 7.

lP

The proposed 8-class PSSP model exhibits an improvement in the performance over the popular existing models as shown in Table 7. The proposed model shows a minimum improvement of 2.1% on benchmark CB513 dataset. 415

The proposed model exhibits an improvement in the performance due the

urn a

effectiveness of the hybrid features of PSSM and HMM profiles. These hybrid features integrated to 2-dimensional convolution layer, 2-dimensional max pooling layer with two stacked BRNNs layer, i.e, BLSTM or BGRU layer. The improvement in the performance of the proposed model is also due to effec420

tive extraction of LI and LRI of protein primary structure sequences. The experimental results suggest that the proposed 2DCNN-BLSTM model is more expressive on CB6133 and CB513 datasets in PSSP. This demonstrate that the hybrid features based on PSSM and HMM profiles along with the deep learning

Jo

framework are effective and efficient for 3-class and 8-class PSSP.

425

3.3. Validation of Q3 Results An experiment is performed to predict protein structural class [50] to validate

the quality of the obtained Q3 results. In structural class prediction, a protein sequence is classified into one of the four majority classes i.e., α, β, α/β, α + β. 21

Journal Pre-proof

Table 7: The performance (Q8 accuracy) Comparison of the Proposed Model against the

pro of

Benchmark Models on the CB6133, CB513, CASP10, and CASP11 datasets. The Q8 accuracy of the Benchmark Models are taken from their published work. (-) indicates the unavailability of the result.

Models

CB6133

CB513

CASP10

CASP11

Q8 (%)

Q8 (%)

Q8 (%)

Q8 (%)

66.6

63.5

64.9

65.6

66.4

-

-

64.9

64.8

65.1

SSpro-SS8 [43] (without template)

69.7

GSN-SS8 [14]

72.1

DeepCNF-SS8 [44]

75.2

68.3

71.8

72.3

LSTM large [15]

-

67.4

-

-

SSREDNs [36]

73.1

68.2

-

-

Conditioned CNN [38]

-

71.4

-

-

-

68.4

-

-

75.7

70.2

74.5

72.5

74.3

70.0

74.5

72.6

75.8

73.5

72.2

70.0

lP

re-

RaptorX-SS8 [37]

MUST-CNN [39]

2DConv-BLSTM with PSSM [5]

2DCNN-BLSTM

urn a

with PSSM [5]

Proposed Model

In this experiment, we have considered two popular datasets, namely z277 [51] 430

and FC699 [52]. The z277 is high similarity dataset consisting of 277 protein sequences, whereas the FC699 is low similarity dataset consisting of 858 protein sequences. The proposed model is trained on the CB6133 dataset and tested on

Jo

z277 and FC699 datasets to obtain Q3 results. For the same z277 and FC699 datasets, Q3 results (sequences) are also obtained using PSI-PRED server [53].

435

From the recent investigation on structural class prediction, Bankapur et

al. [54] showed that the Word2Vec technique [55] extracts higher discriminating features from the given sequence. Therefore, a set of 400 features is

22

Journal Pre-proof

extracted for each dataset using Word2Vec [55] from both the proposed model

440

pro of

Q3 sequences and the PSI-PRED Q3 sequences. The various machine learning classifiers such as Naive Bayes (NB), Support Vector Machine (SVM), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and k-Nearest Neighbor (kNN) are used to classify the protein sequence into the respective structural class using the extracted Word2Vec features and it is as shown in Table 8.

Table 8: The Performance Comparison (in percentage) of Structural Class Prediction based on the Features Extracted from the Proposed Model Q3 Results against PSI-PRED Model

Classifiers

z277

re-

Q3 Results

FC699

Proposed Model

PSI-PRED

Proposed Model

NB

67.6

69

84.3

76.2

SVM

81.0

80.2

86.2

86.5

LR

65.7

73.3

83.5

82.9

MLP

75.2

78.5

83.7

85.2

KNN

81.4

84.2

83.7

84.5

lP

PSI-PRED

From Table 8, it can be observed that the structural class prediction accuracy on the proposed model Q3 sequences is higher than that of PSI-PRED Q3

urn a

445

sequences. Thus, we can say that the quality of Q3 sequences obtained from the proposed model is higher than that of PSI-PRED model.

4. Conclusion and Future Work In this paper, an effective model is proposed for PSSP from amino acid 450

residue sequences. The proposed model consists of hybrid features of 42 - di-

Jo

mensions with the combination of a CNN and BRNNs. The hybrid features are extracted and selected from sequence profiles of PSI-BLAST and HHBlits. We explored HHblits on two databases, namely UniProt20 and UniClust30 for the extraction of HMM profiles and explored PSI-BLAST on the Uniref90 database

455

for the extraction of PSSM profiles. The HMM and PSSM profiles are combined to obtain a novel hybrid feature set which is fed into a deep neural network. 23

Journal Pre-proof

The deep neural network consists of one CNN layer and two BRNNs and fol-

pro of

lowed by one fully connected layer as an output layer for PSSP. The proposed prediction model is accessed on four publicly available datasets and reported 460

Q3 accuracy of 85.4%, 85.4%, 83.7%, and 81.5% and Q8 accuracy of 75.8%, 73.5%, 72.2%, and 70.0% on CB6133, CB513, CASP10, and CASP11 datasets respectively. The experimental results demonstrate that the proposed prediction models accuracy is consistently higher than that of other existing methods and it exhibits a minimum increase of 2.5% in Q3 accuracy and 2.1% in Q8 ac-

465

curacy on CB513 dataset. Further, the quality of the Q3 results is validated by

re-

predicting structural class on z277, and FC699 datasets and the structural class prediction results outperformed PSI-PRED method. From the experimental results, we conclude that an effective model is proposed to address PSSP. In the future, we would like to explore features based on physicochemical properties of amino acid residues to optimize the proposed framework. Fur-

lP

470

ther, we would like to propose an effective structural class prediction model by consuming the Q3 results of this work.

urn a

Acknowledgment

This research work is funded [KSTePS/VGST-RGS/ F/GRD No.727/2017475

18] by Vision Group on Science and Technology, Department of Information Technology, Biotechnology and Science & Technology, Govt. of Karnataka, India.

References

Jo

[1] D. W. Mount, D. W. Mount, Bioinformatics: sequence and genome analy-

480

sis, Vol. 2, Cold spring harbor laboratory press New York:, 2001.

[2] G. Wang, Y. Zhao, D. Wang, A protein secondary structure prediction framework based on the extreme learning machine, Neurocomputing 72 (13) (2008) 262–268.

24

Journal Pre-proof

[3] L. Pauling, R. B. Corey, H. R. Branson, The structure of proteins: two hydrogen-bonded helical configurations of the polypeptide chain, Proceed-

pro of

485

ings of the National Academy of Sciences 37 (4) (1951) 205–211.

[4] W. Kabsch, C. Sander, Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, Biopolymers 22 (12) (1983) 2577–2637. 490

[5] Y. Guo, B. Wang, W. Li, B. Yang, Protein secondary structure prediction improved by recurrent neural networks integrated with two-dimensional

re-

convolutional neural networks., Journal of bioinformatics and computational biology 16 (5) (2018) 1850021–1850021.

[6] J. Cheng, A. N. Tegge, P. Baldi, Machine learning methods for protein structure prediction, IEEE reviews in biomedical engineering 1 (2008) 41– 49.

lP

495

[7] S. Min, B. Lee, S. Yoon, Deep learning in bioinformatics, Briefings in bioinformatics 18 (5) (2017) 851–869.

[8] B. Rost, C. Sander, R. Schneider, Redefining the goals of protein secondary structure prediction, Journal of molecular biology 235 (1) (1994) 13–26.

urn a

500

[9] H. Chen, F. Gu, Z. Huang, Improved chou-fasman method for protein secondary structure prediction, BMC bioinformatics 7 (4) (2006) S14.

[10] B. Rost, C. Sander, Prediction of protein secondary structure at better than 70% accuracy, Journal of molecular biology 232 (2) (1993) 584–599.

[11] D. T. Jones, Protein secondary structure prediction based on position-

Jo

505

specific scoring matrices, Journal of molecular biology 292 (2) (1999) 195– 202.

[12] Z. Li, Y. Yu, Protein secondary structure prediction using cascaded convolutional and recurrent neural networks, arXiv preprint arXiv:1604.07176.

25

Journal Pre-proof

510

[13] C. Fang, Y. Shang, D. Xu, Mufold-ss: New deep inception-inside-inception

pro of

networks for protein secondary structure prediction, Proteins: Structure, Function, and Bioinformatics 86 (5) (2018) 592–598.

[14] J. Zhou, O. G. Troyanskaya, Deep supervised and convolutional generative stochastic network for protein secondary structure prediction, arXiv 515

preprint arXiv:1403.1347.

[15] S. K. Sønderby, O. Winther, Protein secondary structure prediction with long short term memory networks, arXiv preprint arXiv:1412.7828.

re-

[16] E. Gawehn, J. A. Hiss, G. Schneider, Deep learning in drug discovery, Molecular informatics 35 (1) (2016) 3–14. 520

[17] C. Angermueller, T. P¨arnamaa, L. Parts, O. Stegle, Deep learning for com-

lP

putational biology, Molecular systems biology 12 (7) (2016) 878. [18] E. Asgari, M. R. Mofrad, Continuous distributed representation of biological sequences for deep proteomics and genomics, PloS one 10 (11) (2015) e0141287.

[19] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur,

urn a

525

X. He, L. Heck, G. Tur, D. Yu, et al., Using recurrent neural networks for slot filling in spoken language understanding, IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (3) (2015) 530–539.

[20] B. Zhang, Z. Chen, Y. L. Murphey, Protein secondary structure prediction 530

using machine learning, in: Proceedings. 2005 IEEE International Joint

Jo

Conference on Neural Networks, 2005., Vol. 1, IEEE, 2005, pp. 532–537.

[21] G. Karypis, Yasspp: better kernels and coding schemes lead to improvements in protein secondary structure prediction, Proteins: Structure, Function, and Bioinformatics 64 (3) (2006) 575–586.

535

[22] W. Chu, Z. Ghahramani, A. Podtelezhnikov, D. L. Wild, Bayesian segmental models with multiple sequence alignment profiles for protein secondary 26

Journal Pre-proof

structure and contact map prediction, IEEE/ACM transactions on compu-

pro of

tational biology and bioinformatics 3 (2) (2006) 98–113.

[23] W. Zhong, G. Altun, X. Tian, R. Harrison, P. C. Tai, Y. Pan, Parallel 540

protein secondary structure prediction schemes using pthread and openmp over hyper-threading technology, The Journal of Supercomputing 41 (1) (2007) 1–16.

[24] Z. Aydin, Y. Altunbasak, H. Erdogan, Bayesian protein secondary structure prediction with near-optimal segmentations, IEEE transactions on signal processing 55 (7) (2007) 3512–3525.

re-

545

[25] P. Chopra, A. Bender, Evolved cellular automata for protein secondary structure prediction imitate the determinants for folding observed in na-

lP

ture, In silico biology 7 (1) (2007) 87–93.

[26] X.-Q. Yao, H. Zhu, Z.-S. She, A dynamic bayesian network approach to 550

protein secondary structure prediction, BMC bioinformatics 9 (1) (2008) 49.

[27] N. P. Bidargaddi, M. Chetty, J. Kamruzzaman, Combining segmental semi-

urn a

markov models with neural networks for protein secondary structure prediction, Neurocomputing 72 (16-18) (2009) 3943–3950.

555

[28] P. Kountouris, J. D. Hirst, Prediction of backbone dihedral angles and protein secondary structure using support vector machines, BMC bioinformatics 10 (1) (2009) 437.

[29] S. A. Malekpour, S. Naghizadeh, H. Pezeshk, M. Sadeghi, C. Eslahchi, A

Jo

segmental semi markov model for protein secondary structure prediction,

560

Mathematical biosciences 221 (2) (2009) 130–135.

[30] Z. Zhou, B. Yang, W. Hou, Association classification algorithm based on structure sequence in protein secondary structure prediction, Expert Systems with Applications 37 (9) (2010) 6381–6389.

27

Journal Pre-proof

[31] H. Bouziane, B. Messabih, A. Chouarfia, Profiles and majority voting-based ensemble method for protein secondary structure prediction, Evolutionary Bioinformatics 7 (2011) EBO–S7931.

pro of

565

[32] M. H. Zangooei, S. Jalili, Protein secondary structure prediction using dwkf based on svr-nsgaii, Neurocomputing 94 (2012) 87–101.

[33] W. Yang, K. Wang, W. Zuo, Prediction of protein secondary structure us570

ing large margin nearest neighbor classification, in: 2011 3rd International Conference on Advanced Computer Control, IEEE, 2011, pp. 202–205.

re-

[34] P. Ghanty, N. R. Pal, R. K. Mudi, Prediction of protein secondary structure using probability based features and a hybrid system, Journal of bioinformatics and computational biology 11 (05) (2013) 1350012. [35] A. Drozdetskiy, C. Cole, J. Procter, G. J. Barton, Jpred4: a protein sec-

lP

575

ondary structure prediction server, Nucleic acids research 43 (W1) (2015) W389–W394.

[36] Y. Wang, H. Mao, Z. Yi, Protein secondary structure prediction by using

580

urn a

deep learning method, Knowledge-Based Systems 118 (2017) 115–123. [37] Z. Wang, F. Zhao, J. Peng, J. Xu, Protein 8-class secondary structure prediction using conditional neural fields, in: 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2010, pp. 109–114.

[38] A. Busia, N. Jaitly, Next-step conditioned deep convolutional neural networks improve protein secondary structure prediction, arXiv preprint arXiv:1702.03865.

Jo

585

[39] Z. Lin, J. Lanchantin, Y. Qi, Must-cnn: a multilayer shift-and-stitch deep convolutional architecture for sequence-based protein structure prediction, in: Thirtieth AAAI conference on artificial intelligence, 2016, pp. 27–34.

[40] G. Pollastri, D. Przybylski, B. Rost, P. Baldi, Improving the prediction of

590

protein secondary structure in three and eight classes using recurrent neural 28

Journal Pre-proof

networks and profiles, Proteins: Structure, Function, and Bioinformatics

pro of

47 (2) (2002) 228–235.

[41] P. Y. Chou, G. D. Fasman, Prediction of protein conformation, Biochemistry 13 (2) (1974) 222–245. 595

[42] E. Faraggi, T. Zhang, Y. Yang, L. Kurgan, Y. Zhou, Spine x: improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles, Journal of computational chemistry 33 (3) (2012) 259–267.

600

re-

[43] C. N. Magnan, P. Baldi, Sspro/accpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity, Bioinformatics 30 (18) (2014)

lP

2592–2597.

[44] S. Wang, J. Peng, J. Ma, J. Xu, Protein secondary structure prediction using deep convolutional neural fields, Scientific reports 6 (2016) 18962. 605

[45] S. F. Altschul, T. L. Madden, A. A. Sch¨ affer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, Gapped blast and psi-blast: a new generation of protein

urn a

database search programs, Nucleic acids research 25 (17) (1997) 3389–3402.

[46] M. Remmert, A. Biegert, A. Hauser, J. S¨ oding, Hhblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment, Nature methods

610

9 (2) (2012) 173.

[47] A. Kryshtafovych, A. Barbato, K. Fidelis, B. Monastyrskyy, T. Schwede, A. Tramontano, Assessment of the assessment: evaluation of the model

Jo

quality estimates in casp10, Proteins: Structure, Function, and Bioinformatics 82 (2014) 112–126.

615

[48] J. Moult, K. Fidelis, A. Kryshtafovych, T. Schwede, A. Tramontano, Critical assessment of methods of protein structure prediction (casp)round x, Proteins: Structure, Function, and Bioinformatics 82 (2014) 1–6.

29

Journal Pre-proof

[49] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv

620

pro of

preprint arXiv:1412.6980.

[50] M. Levitt, C. Chothia, Structural patterns in globular proteins, Nature 261 (5561) (1976) 552.

[51] G.-P. Zhou, An intriguing controversy over protein structural class prediction, Journal of protein chemistry 17 (8) (1998) 729–738.

[52] L. Kurgan, K. Cios, K. Chen, Scpred: accurate prediction of protein 625

structural class for sequences of twilight-zone similarity with predicting

re-

sequences, BMC bioinformatics 9 (1) (2008) 226.

[53] L. J. McGuffin, K. Bryson, D. T. Jones, The psipred protein structure prediction server, Bioinformatics 16 (4) (2000) 404–405.

630

lP

[54] S. Bankapur, N. Patil, Protein secondary structural class prediction using effective feature modeling and machine learning techniques, in: 2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE), IEEE, 2018, pp. 18–21.

urn a

[55] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word

Jo

representations in vector space, arXiv preprint arXiv:1301.3781.

30

Journal Pre-proof *Declaration of Interest Statement

Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

pro of

☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

Jo

urn a

lP

re-

None