Journal of Theoretical Biology 361 (2014) 182–189
Contents lists available at ScienceDirect
Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi
Knowledge base and neural network approach for protein secondary structure prediction Maulika S. Patel a,1, Himanshu S. Mazumdar b a b
Department of Computer Engineering, G H Patel College of Engineering & Technology, Vallabh Vidyanagar, Gujarat, India Research & Development Center, Dharmsinh Desai University, Nadiad, Gujarat, India
H I G H L I G H T S
G R A P H I C A L
A novel neuro-statistical algorithm using knowledge base and neural network for PSSP is proposed. Association of 5-residue word with corresponding secondary structure forms the knowledge base. Lateral and hierarchical validation is employed for PSSP. A Backpropogation neural network is used to model the exceptions in the knowledge base. The Q3 accuracy of 90% and 82% is achieved on the RS126 and CB396 test data sets respectively.
A B S T R A C T
Start
Pre-process Sequence-Structure Database
Test Data
Training data
Prepare Knowledge Base
KB-PROSSP
Prediction using Knowledge base
Neural Network Training Using Predicted Results of KB-PROSSP and Sequence-Structure Pairs
NN
Optimized Prediction using Neural Network
Results
art ic l e i nf o
a b s t r a c t
Article history: Received 16 December 2013 Received in revised form 1 August 2014 Accepted 4 August 2014 Available online 14 August 2014
Protein structure prediction is of great relevance given the abundant genomic and proteomic data generated by the genome sequencing projects. Protein secondary structure prediction is addressed as a sub task in determining the protein tertiary structure and function. In this paper, a novel algorithm, KB-PROSSP-NN, which is a combination of knowledge base and modeling of the exceptions in the knowledge base using neural networks for protein secondary structure prediction (PSSP), is proposed. The knowledge base is derived from a proteomic sequence–structure database and consists of the statistics of association between the 5-residue words and corresponding secondary structure. The predicted results obtained using knowledge base are refined with a Backpropogation neural network algorithm. Neural net models the exceptions of the knowledge base. The Q3 accuracy of 90% and 82% is achieved on the RS126 and CB396 test sets respectively which suggest improvement over existing state of art methods. & 2014 Elsevier Ltd. All rights reserved.
Keywords: 5-Residue words Knowledge base Lateral association and validation Hierarchical validation Backpropogation neural network
E-mail addresses:
[email protected] (M.S. Patel),
[email protected] (H.S. Mazumdar). 1 Tel.: þ91 2692 231651; fax: þ 91 2692 236896. http://dx.doi.org/10.1016/j.jtbi.2014.08.005 0022-5193/& 2014 Elsevier Ltd. All rights reserved.
M.S. Patel, H.S. Mazumdar / Journal of Theoretical Biology 361 (2014) 182–189
1. Introduction PROTEINS are formed by the transcription and translation of the one dimensional DNA sequences to the three-dimensional (3-D) molecules capable of performing diverse functions. Their secondary structures—helix, strand or a coil are hard to predict. Their folding methods are harder to expound. Experimental techniques such as X-ray Crystallography and Nuclear Magnetic Resonance (NMR) face challenges such as cost, labor, expertise and time. Protein structure identification is necessary to understand the function of protein, required for drug design and also predicting the state of a disease. Often the problem of structure prediction is reduced to the secondary structure prediction problem. Protein secondary structure prediction (PSSP) is significant and relevant as it allows drawing conclusions on fold classification and providing important clues for 3-D structure prediction. Results of secondary structure predictions are used to classify proteins such as all-α or all-β proteins. The research in PSSP dates back to 1970s while computational techniques to address the issue of protein structure prediction gained a lot of attention recently. The DSSP program (Kabsch and Sander, 1983) classifies each amino acid residue into eight classes, i.e., B, E, G, H, I, S and T. We use ‘.’ instead of the blank entry for clarity. These are typically collapsed into the three standard classes: G, H and I ¼4Helix (H), B and E¼4beta-strand (E) and rest as Coils(C). This grouping suggested by DSSP is assumed to be harder than other possible groupings. The protein sequence comprises of ‘R’ residues and can have ‘m’ possible secondary structure states. The problem can be understood as a characterization problem where each of the residues is assigned a secondary state. The per-residue accuracy, Qm, for m structural states, is defined as the ratio of the number of residues for which the secondary structure is correctly assigned (C) to the total number of residues (R) in a protein sequence (Przybylski and Rost, 2007). Q m ¼ 100
C R
ð1Þ
For three class classification, m ¼3 and is referred as Q3 accuracy. A refined accuracy index, called Q8, is proposed to evaluate algorithms of secondary structure prediction (Zhang and Zhang, 2001). The per residue accuracy is expected to increase by 3–4% in case of 3 class as compared to 8-class. The interclass distance is comparatively more in case of 3-class and hence the border line misclassifications are less than 8-class where the inter class distance is lesser. However, 8-class classification embeds more specific information. The literature review below discusses the various approaches proposed so far for the PSSP problem highlighting their performance in terms of their Q3 or Q8 accuracy. A detailed review of the prediction methods for globular proteins is given by Rost (2003). The PHD method (Rost and Sander, 1994) makes use of sequence profiles and neural networks to predict with an accuracy of around 70%. The association of knowledge base and neural networks date backs to 1990s (Maclin and Shavlik, 1993). The methods employing neural networks as classifier have largely used an ensemble of neural networks or cascaded networks for increased accuracy. The accuracy of the models using neural networks alone is far less than the estimated upper limit of 88% as suggested by Rost (2003). Recurrent neural networks used by Chen and Chaudhari (2007) and multiple classifiers used by Ouali and King (2000) are able to achieve Q3 accuracy of 74% and 76%, respectively. Wang et al. (2011) experimented with conditional neural fields, a combination of conditional random fields and neural network and reported a Q8 accuracy of 64.9%. Support Vector Machines (SVM) gained a lot of consideration by the researchers for PSSP (Reyaz-Ahmed and Zhang, 2007; Kim and Park, 2003). Some methods give a window of amino acid residues as input and the secondary structure of the
183
central amino acid residue is predicted. Rost and Sander (1993) used a window size of 13 while it is 15 in case of Chatterjee et al. (2011). A cascade of support vector machines is used as classifier in Kim and Park (2003). The second stage refines the output of the first stage. Reyaz-Ahmed and Zhang (2007) used a combination of genetic algorithms, neural networks and support vector machines (GNSVM). SSpro2.0 proposed by Pollastri et al. (2002) uses profiles from BLAST and PSI-BLAST along with bidirectional recurrent neural networks. SSpro 2.0 outperformed the above methods with an accuracy of 78.13%. YASPIN, a method proposed by Lin et al. (2005) used a single neural network for PSSP and hidden markov model for optimizing the results. Leong et al. took a rule based data-mining approach that identifies dependencies between amino acids in a protein sequence and generates rules to predict secondary structure. Their method is named as RT-RICO (Relaxed Threshold Rule Induction from Coverings) (Leong et al., 2011). FLOPRED, proposed by Saraswathi et al. (2012), makes use of knowledge-base, a Neural Network based Extreme Learning Machine (ELM) and advanced Particle Swarm Optimization (PSO) techniques to predict protein secondary structures. Wu et al. (2004) used a knowledge base for PSSP and their method is called HYPROSP. The knowledge base is composed of amino acid residue words along with their secondary structure. The method used a measure termed ‘match rate’ that suggests the amount of information the target protein can extract from the knowledge base. One method which is found to be nearest to our proposed method is the one described by Kountouris et al. (2012). The similarity is the two phase approach, one for prediction and the second for smoothing the predictions. However, the existing method in Kountouris et al. (2012) used 8 BRNNs (Bidirectional Recurrent Neural Network) in the first phase, an average of which is fed to various filtering methods. A trend of using existing prediction servers and building a consensus model is also becoming popular. Such an approach is demonstrated by Yan, Marcus and Kurgan in their method described in Yan et al. (2014). They too used SVM for filtering to achieve the Q3 accuracy of 85% on a 5-fold cross validation data set. Prediction accuracy largely depends on three parameters: (1) sequence database size (2) algorithm or method and (3) better database search methods (Rost, 2002). It is anticipated by Rost and Sander (Przybylski and Rost, 2007) that the increase in the protein database sizes will lead to the increase in accuracy of protein structure prediction. We have exploited the availability of the huge sequence structure database, coupled with the development of a novel PSSP algorithm. By finding and using the hidden and embedded association of amino-acid residues together with a Backpropogation neural network, the proposed method has given an accuracy of 90% for a 3-class classification.
2. Materials and methods The proposed algorithm, KB-PROSSP-NN, uses a hybrid approach. The knowledge based method (KB-PROSSP) and neural network (NN) for optimized results are used in cascade. Fig. 1 presents the flow chart describing the KB-PROSSP-NN method of PSSP while the algorithmic details are explained in Fig. 2. KB-PROSSP-NN is a two phase algorithm. The first phase, KB-PROSSP, is a hierarchical lateralvalidation technique for PSSP. The technique uses a knowledge base built from the statistics of association between the 5-residue words and corresponding secondary structure. The second phase uses a pre-trained Backpropogation neural network that corrects the discrepancies of the knowledge base used by KB-PROSSP. The implementation of the proposed method is carried out using the C#.NET environment.
184
M.S. Patel, H.S. Mazumdar / Journal of Theoretical Biology 361 (2014) 182–189
2.1. Data sets Training set: A large protein sequence-secondary structure database, ss.txt downloaded from www.pdb.org (Kabsch and Sander, 1983, 2012) is used to create the knowledge base from the statistics of the structural composition of the n-residue words (see Fig. 1). The database consists of 174,372 protein sequences and their corresponding secondary structure (the PDB ids can be provided on request). The secondary structures of all these protein sequences are obtained using the DSSP program (Kabsch and Sander, 1983). The database is preprocessed and the information regarding the sequence and the structure is retained. Any duplicate sequence entries are removed. The test sequences are always eliminated from the knowledge base creation process and neural network training process. Test set: In this paper, results on two popular test datasets, RS126 and CB396 are presented to provide comparability with existing state of art methods. The set of 126 non homologous globular protein chains, used by Rost and Sander (1993) and referred to as the RS126 set is used by most of the researchers as a benchmark to compare the accuracy. The average sequence
2.2. Knowledge base From the training set, the possible secondary structure combinations for the n-residue word are extracted. ‘n’ is chosen to be 5 for lateral-validation (the justification for the same is discussed in Section 4). The format of the knowledge base entries is of the form (A1A2A3A4A5, S11S12S13:C1, S21S22S23: C2, …Sn1Sn2Sn3: Cn) where, A1A2A3A4A5 is the 5-residue word, Sn1Sn2Sn3 is the secondary structure of A2A3A4, Cn is the count of Sn1Sn2Sn3 in the entire database. For example, four such entries are given as below:
AAAAA,HHH:000843,—:000743,HHT:000063,
Start
Pre-process Sequence-Structure Database
Test Data
identity is less than 31% and the average sequence length is 185 residues in the RS126 data set. Currently, 117 out of 126 sequences are available. Cuff and Barton (1999) created a data set of 396 proteins with % sequence identity less than 34% and the average sequence length of 157 residues. It is popularly known as CB396. However, the method is also evaluated using hypothetical sequences and the results are consistent. Few of the recent methods used CB513 data set which contains 117 sequences out of 126 from RS126 along with 396 sequences from CB396.
Training data
KB-PROSSP
Prepare Knowledge Base
Prediction using Knowledge base
Neural Network Training Using Predicted Results of KB-PROSSP and Sequence-Structure Pairs
NN
Optimized Prediction using Neural Network
Results
Fig. 1. Flowchart of the proposed KB-PROSSP-NN method.
TTS:000039,S-S:000026,–S:000026,TSS:000025, HTT:000023 AAAAC,S--:000016,-TT:000012,HHH:000004, SSS:000002,BSB:000001 AAAAD,HHH:000141,---:000033,-ST:000023,TT:000012,-TT:000012,B-S:000012,-GG:000009, HH-:000008 ZZVBD,—:000002
The entry for the word “AAAAA” above means that this 5-residue word has 843 “HHH” occurrences, 63 “HHT” occurrences and so on for the centre 3 amino acid residues (AAA) in the database. Likewise, for all the 5 residue words, such type of entries are computed to form the knowledge base or the gist of the proposed method. This knowledge base is used to predict the secondary structure of the test sequences using algorithm 1 described in Fig. 2. The second phase takes the input from the KB-PROSSP and uses a Backpropogation neural network to model the discrepancies in the predictions of KB-PROSSP. The details are discussed in Section 2D.
Algorithm 1: KB-PROSSP Input: A query protein sequence and a knowledge base Output: Predicted secondary structure for the query sequence { k=0; while k<=S.Length – 5 { //S=protein sequence for the 7 AA word sk in protein sequence S do { W=compute(sk); // first pass of validation Increment k; } } i=0; For each set of 5 Wi’s in W:Wi,Wi+1, W i+2, W i+3, Wi+4 { ss = predict(Wi,Wi+1, W i+2, W i+3, Wi+4) //second pass of validation add ss to the predicted structure S1 } } Fig. 2. KB-PROSSP algorithm using a hierarchical and lateral validation technique.
M.S. Patel, H.S. Mazumdar / Journal of Theoretical Biology 361 (2014) 182–189
M
V L
S E G E
H H T
V L
S E G E
H H T
S1
S1
H T H
S2
H H H
S2
H T H
S2
H H H
S2
S3
T
S3
T H H H H T T
M
M
V
L
H
H H
S
S
S
E
G
E
K
L
V
W01 W02 W03 W04 W05 W11 W12 W13 W14 W15 W21 W22 W23 W24 W25 W31 W32 W33 W34 W35
185
The second pass of validation is shown in Fig. 3b. We take 5 sets from W spanning over a window of 9 AA with offset of 1 (for first pass margin) as shown in Fig. 3b and c. By comparing and correlating these sets we find out the statistics of the structure elements for the centre AA. The structure element with the highest occurrence becomes the predicted structure for the centre AA of the window, the sixth residue from the starting of the AA sequence, Fig. 3b. The second pass of lateral validation is explained using an example in Fig. 3c. Summarizing, the predicted secondary structure window is 10 less than the AA sequence. This forms the output of the algorithm. The equation for calculating the per residue accuracy, Qm is adapted as in (2). Q m ¼ 100
C ðR 5 5Þ
ð2Þ
2.4. Neural network for refinement
2.3. KB-PROSSP
A single three layer feed forward type neural network with Backpropogation learning algorithm is used. The rationale behind the use of neural network for refining the predictions is discussed here. The knowledge base is obtained by exploiting the statistics of the 5-residue words in the database. Few less frequently occurring combinations of sequence and secondary structure may not be justified. On the other hand, neural network may use limited samples and extend the model for non-sample space by accurately modeling the function space. PSSP in KB-PROSSP depends on discrete parameters like the statistics of the 5-residue words which is not guaranteed optimally. Under the circumstances, the scope of further improvement remains for discrepancy modeling of the knowledge base model. Assume that the error space in the knowledge base model is E. Neural network selects a subset from E such that the population of the selected ones is higher in E as compared to E’. This is why neural network is used as a refine operator for optimizing the results predicted by KB-PROSSP. The input to the neural network consists of binary coded 5 amino acid residues and 5 secondary structure elements predicted by KB-PROSSP. The desired output for training is the actual secondary structure. This input is randomly selected from the training data set.
The input to the algorithm-KB-PROSSP (Fig. 2) is a protein sequence and the output is the predicted secondary structure for the input protein sequence. There are two levels of validations in KB-PROSSP. Lateral-validation is to associate overlapping consecutive residue words and their secondary structures to predict the structural class of a residue. This type of lateral-validation is performed twice in a hierarchical manner which contributes to the notable results. The strategy for prediction can be better explained with an example. In the first pass of validation (Fig. 3a), the protein sequence is scanned by processing a window of 7 AA residues, which has three overlapping consecutive 5-residue words. For example, the 7 AA word “MVLSEGE” has three 5 residue words, “MVLSE”, “VLSEG”, “LSEGE”. The structure statistics for the overlapping 5-residue words are read from the knowledge base. By correlating the three 5-residue words and their secondary structures as shown in Fig. 3a, a set of strings of type “HHTTH” beginning from position 2 in the sequence till the end of sequence 1 is generated. Such strings are generated whenever there is a match between the last two structure elements of the first word and the first two structure elements of the second word, and a match between the last two structure elements of the second word and the first two structure elements of the third word (Fig. 3a, left). Let this set be named ‘W’. Fig. 3c shows the examples of W.
2.4.1. Neural network architecture The neural network architecture is decided by the application requirement and the algorithmic strategy. The neural network refines the predicted results by taking 5 amino acid residues and their predicted secondary structure obtained using KB-PROSSP. The neural network is configured to have 5 binary input neurons for each amino acid and 3 binary input neurons for each secondary structure, making a total of 40 input neurons. The desired output is known from the sequence structure database. There are 3 binary output nodes for an 8-class classification. Hidden layers are responsible for model building. Large number of hidden neurons may not necessarily produce better models. On the contrary, the problem of memorizing the input–output pairs instead of generalization may be encountered. Hence, one should try with a smaller set of hidden neurons and gradually increase till an optimum model is built. The neural network is trained with one hidden layer constituting of 24 hidden nodes. This number is experimentally optimized. For building a neural network, a matrix of (i,j) and (j,k) which represents input-hidden connections and hiddenoutput connections is preferred. All the weights have to be populated. However, in actual model, only few connections might be actively participating whereas others are dormant. To identify the optimal network, Mazumdar (1995) has suggested a randomly connected network strategy in which dormant connections are
W41 W42 W43 W44 W45
Fig. 3. Lateral association, validation and prediction, (a) The first pass of validation: Comparing the secondary structures of the three consecutive overlapping 5-residue words obtained from the 7 AA substring of the query sequence. S¼ HHTTH is the structure returned for the 2nd penta word (VLSEG) (left), no match (right), (b) The second pass of validation: The structure combinations (W) for five consecutive overlapping 5-residue words, gathered from step (a) are evaluated for highest occurrence and thereby predicting the secondary structure of the 6th amino acid residue. (c) Example of the second pass of validation.
186
M.S. Patel, H.S. Mazumdar / Journal of Theoretical Biology 361 (2014) 182–189
removed using a pruning method which drastically reduces the network size. 2.4.2. Training and weight update The neural network is trained using the output of the KB-PROSSP and the actual secondary structure. After training, the neural network has learnt the misclassifications made by KB-PROSSP. For test sequences, initially the output of KB-PROSSP is obtained. The predicted results of KB-PROSSP are given as input to the neural network and the output is observed to be 10–12% more accurate than KB-RPOSSP. Eqs. (3)–(6) are used to calculate the input and output of the neurons at hidden and output layers. Eq. (7) is the cost function. Eqs. (8) and (9) are the weight update equations between the input-hidden and hidden-output layers. X ¼ ∑Y i W ji
ð3Þ th
where Yi is the input, Wji is the weights between the i layer, X is the input to the jth layer Yj ¼
and jth
1 1 þ e xj
ð4Þ
where Yj is the output of the jth layer X k ¼ ∑Y j W kj
ð5Þ th
where Wkj is the weight between the j is the input to the kth layer Yk ¼
layer and the k
th
layer, Xk
1 1 þe xk
ð6Þ
where Yk is the output of the kth layer Eq. (7) is the cost function where dk is the desired output. 1 E ¼ ðY k dk Þ2 2
ð7Þ
where E is the error function or the cost function. Yk is the actual output and dk is the desired output. Eqs. (8) and (9) update the weights of the output and hidden layer connections, respectively. W kj ¼ W kj þ eta ðY k dk Þ Y k ð1 Y k Þ Y j
ð8Þ
W ji ¼ W ji þ eta ∑ fðY k dk Þ Y k ð1 Y k Þg Y j ð1 Y j Þ Y i
ð9Þ
Table 2 show the comparison of the KB-PROSSP-NN method with the other state of art methods for the RS126 and CB396 data sets, respectively. It is very encouraging to note that the Q3 accuracy of 90.16% and 82.28% is obtained on the most preferred RS126 and CB396 test datasets, respectively. Fig. 4 is a histogram of % accuracy versus the count of sequences. It can be seen that over 37% sequences are predicted with Q3 accuracy more than 90. An accuracy of 81% is achieved on non standard data set consisting of 5625 sequences; however, the details of the same are not included here. 3.1. Avoiding over-learning Efficient neural networks are known to generalize and evolve a black box model of the input and expected output samples. An oversize network is often capable of memorizing the input–output pairs without generalization. Such networks exhibit a small output error for training samples but perform poorly with new samples. It is essential to optimize the neural network configuration to avoid over-learning or memorizing. In the proposed method, the input to the neural network consists of two parameters. (1) 5-residue word (2) its corresponding secondary structure obtained using KB-PROSSP. The input space and the hidden neuron resources are given in Table 3. In this case, learning is only possible, if neural network evolves a model and not by memorization. The degradation in the error is marginal on removal of few neurons in hidden node whereas in memory based learning, the degradation will be exponential. The popular methods for validating that the network is not over trained is to find a suitable ratio of the input samples and the number of hidden nodes or to check the performance on independent data sets. Neural network can produce good results either by modeling or by memorizing. Over-learning or memorizing is associated with the availability of excess resource like hidden nodes. Adding noise to the hidden layer nodes increases constraint on learning which motivates model building. The model is tested with variable number of hidden nodes, i.e. 8, 12, 16 and 24, as shown in Table 4. The variation in the accuracy is about 1–2%. The Q3 accuracy with hidden nodes equal to 8 and 24 is presented in Table 5. It is interesting to note that the differential change in the accuracy due to neural network is small
where eta is the learning rate. Table 2 Prediction accuracy of different methods on CB396 data set (The table is adapted from Leong et al. (2011)).
3. Experiments and results The master database, ss.txt, is preprocessed for removing any duplicate sequence entries. The test data sets and the training data sets are mutually exclusive. A 10–12% of improvement is seen in the predicted results by using the neural network after KBPROSSP, as compared to KB-PROSSP alone. The performance of the method on the RS126 and CB396 test data sets are tabulated. Table 1 and
Method
Q3
KB-PROSSP-NN PHD (Rost and Sander, 1994) Nguyen2-stage SVM (Nguyen and Rajapakse, 2005) BLAST-RT-RICO (Leong et al., 2011) RT-RICO (Leong et al., 2010)
82.28 71.9 76.3 87.7 79.19
Table 1 Prediction accuracy of 14 different methods on RS126 data set. Method
Q8
Q3
Remarks
KB-PROSSP KB-PROSSP-NN Conditional neural fields (Wang et al., 2011) Sspro2.0 (Pollastri et al., 2002) Sspro8 1.0 (Pollastri et al., 2002) Sspro8 2.0 (Pollastri et al., 2002) YASPIN (Lin et al., 2005) BLAST-RT-RICO (Leong et al., 2011) RT-RICO (Leong et al., 2010) PHD (Rost and Sander, 1994)
70 85.19 64.7 . 60.74 62.58 . . . .
77 90.16 . 78.13 . . 77.06 89.93 81.75 71.4
Knowledge base Knowledge base with Neural Network Conditional neural fields Recurrent neural networks Recurrent neural networks Recurrent neural networks NN and HMM Rule based with Blast Rule based Profile alignment and neural network
M.S. Patel, H.S. Mazumdar / Journal of Theoretical Biology 361 (2014) 182–189
187
80 67
Sequence Count
70 60 50 40
28
30 17
20 10 0
0
1
4
0-50 51-60 61-70 71-80 81-90 91-100 %Q3 Accuracy distribution of RS126 Fig. 4. Histogram showing the distribution of Q3 accuracy using KB-PROSSP-NN for the CB396 dataset and the RS126 dataset.
Table 3 Analysis of neural network resources. 5-residue word
Secondary structure
Total input space
Hidden layer neurons (24)
Ratio of input space vs hidden nodes
205 ¼ 3200,000 85 ¼32,768 10,485,700,000 224 ¼16,777,216 6242
Table 4 Experiments with hidden nodes reduced to 8, 12 and 16 as compared to 24 (Q8 accuracy). Test set
HM-1-386 Exp386 EXP1-386 HSM1000 CB396 RS126
No. of sequences
386 386 386 1000 386 117
No of hidden nodes 24
12
16
8
73.81 77.29 76.85 74.82 76.61 85.19
75.05 76.74 76.26 75.87 76.71 81.21
75.58 77.26 76.86 76.54 76.61 81.16
75.57 76.54 76.19 75.79 76.31 80.85
Table 5 Experiments with hidden nodes reduced to 8 as compared to 24 (Q3 accuracy). Test set
HM-1-386 Exp386 EXP1-386 CB396 RS126 HSM1000
No. of sequences
386 386 386 386 117 1000
one of our earlier papers. The frequency distribution of the ss.txt and that of CB396PoorQ3Less50.txt is plotted in Figs. 6 and 7, respectively. It is found that for most of these 5-residue words, statistics differs by more than an order of magnitude and not enough for prediction. This results into an erroneous prediction in the first phase, i.e. KB-PROSSP. The accuracy of the individual sequence depends on the sequence composition in terms of 5-residue words and the statistics of these 5-residues words in the knowledge base. The Q3 accuracy of sequences in CB396 and having accuracy below 50% using KB-PROSSP alone and together with NN is plotted in Fig. 8. Fig. 8 shows the poor performing sequences whose range of accuracy is between 11 and 54% from KB-PROSSP output. It is clear that the anomaly in the low accuracy group of CB396 is found because of the anomaly in the KB-PROSSP. The role of the neural network is marginal as compared to the share of KB-PROSSP. The neural network performance is tested with a testing set which is not a part of the training set. It was particularly observed that 17 out of 386 sequences of the CB396 test set were predicted with Q3 accuracy less than 50%. In order to rule out any possibility of over-learning, two new data sets of size 386 sequences were generated. This number was particularly chosen to compare the results with CB396 data set, which contains 386 sequences. The % accuracy of the three test sets is plotted in Fig. 9. The results are comparable. The test sets are excluded from the knowledge base generation process and the neural network training.
No of hidden nodes 24
8
79.08 82.77 81.95 82.28 90.16 80
81.1 82.22 81.42 81.87 86.47 81.16
and consistent across the data sets of different sizes (Fig. 5). This gives us an indication that the neural network error correction is model based as it is independent of size of unknown test set.
4. Discussion A novel concept of protein secondary structure prediction using two-tier architecture of knowledge base and neural network is presented here. Association of 5-residue word with corresponding structure forms the knowledge base. The knowledge base is used as a weighted look up table followed by hierarchical and lateral validation of 5-residue words and structure. The choice of 5-residue words depends on optimization of memory usage and word occurrence frequency. Increasing the value of n reduces the occurrence of the n-residue words, leading to insufficient statistical information for prediction. Experimenting with n ¼ 7 or higher is infeasible due to following reasons:
3.2. Error analysis of CB396 data set A detailed investigation revealed that RS126 although excluded from the training set has close homologous sequences in the ss.txt whereas CB396 has very little homology. Poor statistics in the knowledge base results in poor accuracy for few sequences in CB396 data set. In order to prove this, frequency analysis was carried out. The 17 sequences having accuracy less than 50 were separated (resultant file is CB396PoorQ3Less50.txt). The 5-residue word (penta) frequency of the entire ss.txt and CB396Poor Q3Less50.txt database was computed using the application framework developed for protein characterization research, discussed in
a. The memory requirement of the knowledge base increases by a factor of 26 26 with increase in the size of n from 5 to 7. (We cannot take n ¼ 6). b. For n ¼7, there are 80,318,10,176 possible entries in the knowledgebase as compared to 118,81,376 entries for n ¼5. c. With the increase in the number of entries, it is very much obvious that the combinations for each 5-residue word will reduce, probably insufficient for prediction. d. Storing, handling, finding frequency of 7-residue words requires a different algorithmic strategy as the runtime memory will be insufficient.
188
M.S. Patel, H.S. Mazumdar / Journal of Theoretical Biology 361 (2014) 182–189
Fig. 5. Comparative performance of KB-PROSSP and KB-PROSSP-NN on CB396, 1083, 2933 and 4891 data sets.
No of sequences
180
161
152
160
136
140 120 100
84
80
EXP-386 EXP-1-386
60
60 33 36
40 20
125 120
7
12
39
42 44
17
CB396 32 33
27
0 0-50
51-60
61-70
71-80
81-90
91-100
Q3 Accuracy
Fig. 9. Comparison of CB396 data set % accuracy with two other data sets (EXP-386 and EXP-1-386) of equal size.
Fig. 6. Frequency of all 5-residue words in ss.txt.
Fig. 7. Frequency of 5-residue words in sequences of CB396 having accuracy o50% in ss.txt.
The basic motivation is to report our experimental results for research community using a unique and novel cascaded combination of knowledge base with a single neural network. In the first phase, the accuracy totally depends on the statistics of the 5-residue words and their secondary structure possibilities. But there are exceptions and for these exceptions, the secondary structure state predicted may be incorrect. Neural network is expected to build a model of discrepancies found in the predictions using the knowledge base. Here, if the training data is appropriate, learning takes place. That is exactly what happens when the accuracy of certain proteins is refined effectively. Neural network is given three types of information: (1) the amino acid sequence (2) the predicted structure (3) the actual structure. It uses all the three for model building. The quality of the model built will be proportional to the number and quality of the input samples. In simple words, neural network will alter the predictions made in first phase and these alterations are of two main types.
Error analysis in terms of %Q3 Accuracy of sequences
Q3 Accuracy
predicted at < 50% in CB396
60
KB-PROSSP
50
KB-PROSSP-NN
40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Sequence .no
Fig. 8. Comparison of KB-PROSSP and KB-PROSSP-NN of sequences having accuracy o50%.
e. The algorithm uses two level hierarchical validations. For n ¼5, the second level of validation uses a window of 9 amino-acid residue which will be 13 for n ¼ 7. This will increase the computation time required as this process is carried out throughout the protein sequence. f. Lastly, n ¼5 is a subset of n ¼7 but not vice-versa.
(1) Incorrect to correct. (2) Correct to incorrect. In case (1) Refinement is effective as enough statistics is available to conclude the secondary structure state. In case (2) due to the insufficient input–output pairs for a particular 5-residue word, the refinement is not effective. However, the percentage of case (1) changes is more than case (2) giving an overall increase in the prediction accuracy. The neural network is thus used to further improve the accuracy by mapping the knowledge base predicted structure to the desired structure.
5. Conclusion It is interesting to find that the knowledge base built using the 5-residue words coupled together with the Backpropogation neural network gives a Q3 accuracy of 90%. This is because of the mapping of the knowledge base predicted results with the
M.S. Patel, H.S. Mazumdar / Journal of Theoretical Biology 361 (2014) 182–189
neighboring structure elements using a supervised learning neural network. It is intuitive to expect that the accuracy improves with lateral association which makes the model of the binding forces acting to form a secondary structure. Acknowledgments The authors would like to thank Dr Rakesh Rawal for discussing the possible research avenues on oncoproteins and their characterization. This project is supported by the Planetary Exploration Research Program of Physical Research Laboratory, Ahmedabad. Authors are thankful to the anonymous reviewers whose constructive suggestions have helped in bringing more clarity to the matter. References Chatterjee, P., et al., 2011. PSP_MCSVM: brainstorming consensus prediction of protein secondary structures using two-stage multiclass support vector machines,. J. Mol. Model. 17, 2191–2201. Chen, J., Chaudhari, N., 2007. Cascaded bidirectional recurrent neural networks for protein secondary structure prediction,. IEEE/ACM Trans. Comput. Biol. Bioinf. 4 (4), 572–582. Cuff, J.A., Barton, G.J., 1999. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction,. Proteins: Struct., Funct. Genet. 34, 508–519. Kabsch, W., Sander, C., 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637. Kim, H., Park, H., 2003. Protein secondary structure prediction based on an improved support vector machines approach. Protein Eng. 16 (8), 553–560. Kountouris, P., et al., 2012. A comparative study on filtering protein secondary structure prediction,. IEEE/ACM Trans. Comput. Biol. Bioinf. 9 (3), 731–739. Leong, L., Leopold, J.L., Kandoth, C., Frank, R.L., 2010. Protein secondary structure prediction using RT-RICO: a rule-based approach. Open Bioinf. J. 4, 17–30. Leong, L., Leopold, J.L., Frank, R.L., 2011. Protein secondary structure prediction using BLAST and relaxed threshold rule induction from coverings. In:
189
Proceedings of the Seventh IEEE International Conference on Bioinformatics and Bioengineering IEEE, 1355–1359. Lin, K., Simossis, V.A., Taylor, W.R., Heringa, J., 2005. A simple and fast secondary structure prediction algorithm using hidden neural networks,. Bioinformatics 21, 152–159. Maclin, R., Shavlik, J.W., 1993. Using knowledge-based neural networks to improve algorithms: refining the Chou–Fasman algorithm for protein folding,. Mach. Learn. 11, 195–215. Mazumdar, H.S., 1995. A neural network toolbox using Cþ þ . CSI Commun. Nguyen, M.N., Rajapakse, J.C., 2005. Two-stage multi-class support vector machines to protein secondary structure prediction. Pac. Symp. Biocomput. 10, 346–357. Ouali, M., King, R.D., 2000. Cascaded multiple classifiers for secondary structure prediction. Protein Sci. 9, 1162–1176. Pollastri, G., Przybylski, D., Rost, B., Baldi, P., 2002. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins 47, 228–235. Przybylski, D., Rost, B., 2007. Predicting Simplified Features of Protein Structure, Bioinformatics—From Genomes to Therapies. Wiley-VCH, Weinheim. Reyaz-Ahmed, A., Zhang, Y., 2007. Protein secondary structure prediction using genetic neural support vector machines. IEEE. Rost, B., 2002. Alignments grow, secondary structure prediction improves,. Proteins: Struct., Funct. Genet. 46, 197–205. Rost, B., 2003. Rising accuracy of protein secondary structure Prediction. In: Chasman, D (Ed.), Protein Structure Determination, Analysis, and Modeling for Drug Discovery. Dekker, New York, pp. 207–249. Rost, B., Sander, C., 1993. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232, 584–599. 〈http://www.pdb.org〉 accessed on 25th July 2012. Rost, B., Sander, C., Schneider, R., 1994. PHD—an automatic mail server for protein secondary structure prediction. Comput. Appl. Biosci. 10 (1), 53–60. Saraswathi, S., Fernández-Martínez, J.L., Kolinski, A., Jernigan, R.L., Kloczkowski, A., 2012. Fast learning optimized prediction methodology (FLOPRED) For protein secondary structure prediction. J. Mol. Model. 18, 4275–4289. Wang, Z., Zhao, F., Peng, J., Xu, J., 2011. Protein 8-class secondary structure prediction using conditional neural fields. Proteomics 11 (19), 3786–3792. Wu, K.P., et al., 2004. HYPROSP: a hybrid protein secondary structure prediction algorithm—a knowledge-based approach. Nucleic Acids Res. 32 (17), 5059–5065. Yan, J., Marcus, M., Kurgan, L., 2014. Comprehensively designed consensus of standalone secondary structure predictors improves Q3 by over 3%. J. Biomol. Struct. Dyn. 32 (1), 36–51. Zhang, C.T., Zhang, R., 2001. A refined accuracy index to evaluate algorithms of protein secondary structure prediction. Proteins 43 (4), 520–522.