Applied Soft Computing 13 (2013) 1214–1221
Contents lists available at SciVerse ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
Finding short structural motifs for re-construction of proteins 3D structure Nikhil R. Pal a,b , Rupan Panja a,b,∗ a b
Electronics and Communication Sciences Unit, Indian Statistical Institute, 203 B.T. Road, Calcutta 700 108, India Computer Science and Engineering Department, Dr. Sudhir Chandra Sur Degree Engineering College, 540 Dum Dum Road, Surermath, Kolkata 700 074, India
a r t i c l e
i n f o
Article history: Received 1 June 2012 Received in revised form 5 October 2012 Accepted 16 October 2012 Available online 22 November 2012 Keywords: Building blocks Structural motifs Neural gas Two-stage-clustering Protein folding
a b s t r a c t With a view to find useful building blocks (short structural motifs) for reconstruction of 3D structure of proteins, we propose a modified neural gas learning algorithm that we call structural neural gas (SNG) algorithm. The SNG is applied on a benchmark protein data set and its performance is compared with a well known algorithm from the literature (two stage clustering algorithm (TSCA)). The SNG algorithm is found to generate better building blocks compared to TSCA. We have also compared the performance of SNG algorithm with that of a recently reported Incremental Structural Mountain Clustering Method (ISMCM). In general, ISMCM is found to use more building blocks to yield results comparable to that of SNG algorithm. We demonstrate the superiority of SNG over TSCA both in terms of local-fit and global-fit errors using fragments of length five, six, and seven. We also use a graphical means for comparison of the performance of the two algorithms. © 2012 Elsevier B.V. All rights reserved.
1. Introduction Prediction of the 3D structure of proteins from amino acid sequence is very important as the knowledge of protein structure helps us in many ways, such as studying its functionality and in drug design. But prediction of protein structure from its sequence is a difficult and challenging task. There have been many attempts to find implicitly/explicitly the relationship between sequences of proteins and their 3D structures [6–23,32,33]. Possibly the most popular approach to structure prediction is based on homology modeling [1,2,6–11]. But, its applicability is limited in the sense that it requires a good sequence similarity between proteins with known 3D structure and the protein whose 3D structure is to be predicted. Consequently such a method will fail for proteins with no known homologous proteins. Neural Networks such as Multilayer Perceptron (MLP), radial basis function (RBF) networks have been extensively used for successful prediction of protein structures [6–8]. Support Vector Machine (SVM) has also been used for fold prediction of proteins [7–11]. These methods fall in the category of homology modeling and hence have the limitations of homology based modeling as mentioned earlier.
∗ Corresponding author at: Computer Science and Engineering Department, Dr. Sudhir Chandra Sur Degree Engineering College, 540 Dum Dum Road, Surermath, Kolkata 700 074, India. Tel.: +91 9433304859. E-mail address:
[email protected] (R. Panja). 1568-4946/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.asoc.2012.10.027
The ab initio modeling [3–5] approaches are although most general in nature but usually they are extremely expensive in terms of computation and also modeling all kinds interactions between molecules is not an easy task [3,34,35]. The building block (BB) approach identifies short structural motifs (building blocks), which occur frequently with some structure–sequence relationship, in a set of proteins, not necessarily homologous [12–19]. The structural motifs are extracted from proteins with known 3D structure. Once identified, these building blocks can be used to construct/reconstruct 3D structures of proteins from their sequences. The building blocks can be identified from the 3D coordinates of the alpha-carbon atoms or from the dihedral angles of residues. In [20] authors use a self-organizing-map (SOM) to construct 16 protein blocks (PBs), three-dimensional structural fragments. Several methods of assigning PB to sequence fragments are developed. The Hybrid Protein Model (HPM) [21–23] is one of them. All these methods based on BBs or PBs need to use a clustering algorithm to find the BBs or PBs. For example, in [12], Unger et al. used a two-stage clustering algorithm, while in [30] a structural variant of the mountain clustering method is used. In this paper we use a modified version of neural gas (NG) algorithm for finding the BBs [24]. Unlike SOM, the topology between neurons is not defined in NG. The neural gas algorithm is known to converge faster than Self-organizing-map, K-means or maximum-entropy clustering algorithm [25]. An incremental version of the NG networks algorithm is also in use [27]. Many successful applications of NG networks can be found in the literature [25–27].
N.R. Pal, R. Panja / Applied Soft Computing 13 (2013) 1214–1221
2. Materials and methods
of the reconstructed chain are aligned with the first five residues of the best BB of the next fragment. This process is continued to reconstruct the whole protein.
2.1. Database In this paper we use the same database as used by Unger et al., referred to as “refined Brookhaven” data taken form the Brookhaven Protein Databank [12]. It consists of 82 proteins, out of which 21 proteins have been updated after Unger et al. collected their data. The list of updated proteins with their new PDB code (in parenthesis) is as follows [31]: 1APR (2APR), 1CPP (2CPP), 1CPV(5CPV), 1FB4h (2FB4h), 1FBJl (2FBJl), 1FDX (1DUR), 1GCR (4GCR), 1HMQa (2HMQa), 1INSa(4INSa), 1PCY(1PLC), 1SN3 (2SN3), 3PTP (5PTP), 3RXN (7RXN), 3TLN(8TLN), 4ADH (8ADH), 4ATCa (6AT1a), 1GAPa (1G6Na), 2FD1 (5FD1), 4FXN (2FOX), 2APP (3APP), and 4CYTr(5CYTr). Here we use the updated data set. Unger et al. used a set of four proteins (1BP2, 1PCY, 4HHBb, and 5PTI) as the training proteins. We shall also use the same set for training. Primarily we shall consider fragments of length six (i.e., we consider only hexamers). A protein is divided into all possible overlapping fragments of length six as done in [12]. Each fragment is then represented by the C˛ coordinates of all residues in the fragment. In this case, a fragment is represented by the C˛ coordinate of six consecutive residues. We use these fragments as our input for clustering/quantization to obtain a set of BBs. A protein can now be reconstructed replacing each sequence fragment by a BB which has the best match with the associated actual 3D structure. Of course, this will lead to some quantization error. In addition to hexamers, we shall also use pentamers and heptamers. 2.2. Measure of similarity between fragments Here our objective is to compute the structural similarity between two fragments, X and Y each of length q. Two similar structure fragments may be located far apart in the 3D space and may not be even aligned with each other. So to measure their similarity we will have to bring them together and align them. Thus, to make the measure of similarity shift-invariant, for each fragment, the mean vector of the q 3D coordinates of C˛ atoms is first subtracted from the coordinate of each residue. Then the fragments X and Y are aligned to the best possible extent using the BMF (best molecular fit) [28,29] algorithm to compute the dissimilarity between them [12]. In other words, the objective of the BMF algorithm is to find a linear transformation of T such that T(X) is as close to Y as possible. Then the root mean square (RMS) distance between the two aligned fragments is calculated as the measure of similarity [12]:
RMS(X, Y ) =
q T (X) r − rYi 2 i=1 i (q − 2)
1215
(1)
T (X)
is the coordinate of the C˛ atom of the ith residue in fragHere ri ment X after the BMF alignment and q is the length of the fragment. In case of hexamers, q = 6. The BMF algorithm is the most popular approach to measure structural homology [12,13,18,30,31] 2.3. Protein reconstruction For reconstructing a given protein, we follow the same strategy as in [12], i.e., we first find all possible overlapping hexamers of the protein. Then for each hexamer, the most similar BB is found. Now, since two consecutive hexamers have five residues in common, we align five residues [12]. The last five residues of a fragment are common to the first five residues of the next fragment. Aligning the common five residues we obtain the C˛ coordinate of the sixth residue of the second fragment. In this way, the last five residues
2.4. Performance evaluation The quality of the reconstructed proteins can be evaluated in several ways. Two common indices used for this are the local-fit RMS and global-fit RMS. These two measures have been used by many researchers [12,13,30,31]. 1 Local-fit RMS error (LRMSE): This is the average RMS distance between a data fragment (hexamer) and its corresponding best BB after alignment. In our reported results, the LRMSE is computed on the test proteins. 2 Global-fit RMS error (GRMSE): This is the RMS error between the original protein and the reconstructed protein after alignment. The GRMSE is computed for each protein with at least 60 residues, and also for the first 60 residues only. This protocol is also followed in [12,13,30,31]. 3. Extraction of building blocks We explain the algorithms using hexamers, i.e., with fragments of length six. Later in addition to hexamers, we shall also use pentamers and heptamers. As mentioned earlier, each protein is divided into overlapped fragments of length six. Thus a protein of length S will generate S − 5 hexamers. Let Q = {Q1 , Q2 , . . ., Qn }, be the set of training fragments and X = {x1 , x2 , . . ., xn } xi ∈ Rp be the corresponding set of vectors representing the C˛ coordinates of the fragments, for hexamers p = 6 ×3 = 18. X will be our input to all building block finding algorithms. 3.1. Two stage clustering algorithm (TSCA) of Unger et al. [12] This algorithm has two distinct stages. The first stage of TSCA is similar to the K-nearest neighbor clustering algorithm. In this stage a hexamer is randomly chosen to form the first member of a cluster. Then all hexamers that are within a fixed distance, d*, from this lone hexamer are added to the cluster. All newly added hexamers are taken one at a time and all hexamers that have not been assigned to a cluster and are within a distance of d* distance are added to the cluster. This process is continued till no other fragment can be added. In this way, we get the first primary cluster. Then from the remaining non-assigned fragments, we pick up a fragment randomly and treat it as the first member of the next cluster. The same annexation process is repeated to get the next primary cluster. This cluster forming process is continued until all the hexamers are assigned to some cluster. These clusters are primary clusters. In our experiments, following the protocols of Unger ˚ et al., we have used d * =1 A. In this case, the diameter, i.e., the maximum RMS distance between two hexamers, of a primary cluster could be much higher ˚ Hence, in the second stage of the algorithm these clusters than 1 A. are divided into sub-clusters. The objective of the second stage is to find sub-clusters so that the sub-cluster representative (BB) is ˚ from all other fragments in the within the distance d* (here 1 A) sub-cluster. Finding such BBs is an NP-hard problem, so a heuristic algorithm is used. The hexamer that has the maximum number of neighbors within 1 A˚ distance in a cluster forms a structural motif or building block (BB) and all neighbors of the BB that are within 1 A˚ form the sub-cluster. This process is repeated to assign all hexamers in a primary cluster to some sub-cluster. Finally, the entire process of finding sub-clusters (hence BBs) is repeated for all primary clusters.
1216
N.R. Pal, R. Panja / Applied Soft Computing 13 (2013) 1214–1221
3.2. The proposed structural neural gas (SNG) network As mentioned earlier, the neural gas (NG) algorithm is a neural network learning algorithm for vector quantization [24]. Let X = {x1 , x2 , . . ., xn } xi ∈ Rp be the input data set. The NG algorithm uses a set of, say, N, neurons decided by the user. Each neuron is represented by a weight vector, wi ∈ Rp ; W = {w1 , w2 , . . ., wN }. The neurons are randomly initialized by data points taken from X. There is no predefined topology between the neurons. A typical NG algorithm uses Euclidean distances between a randomly selected data point and the set of weight vectors in the network and then orders the neurons in decreasing order of similarity. The weight vectors are then updated and the update amount is dependent on this order. In the present application, we cannot use the Euclidean distance as a measure of (dis)similarity because two similar structures oriented in different direction may result in a very high value of Euclidean distance between the two. We need to use a measure of similarity that captures the structural similarity between two fragments. Hence, we slightly modify the NG algorithm so that it can handle the structural data. We call this modified network as structural neural gas (SNG) network. In SNG, in each iteration of the algorithm, a data point xi is randomly selected from X, and the neurons are ordered in the decreasing order based on their structural similarity with xi . The structural similarity is computed using the best molecular fit (BMF) routine. The neuron that is higher up in the sorted list will learn more than the one that is lower in the list. The adaptation rule can be described as winner-take-most. The learning or adaptation is as follows: w(t + 1) = w(t) + (t) × e(−k/(t)) × (xi − w(t))
(2)
where xi is xi after being aligned with w(t) using the best molecular fit algorithm [28,29], (t) is the learning co-efficient, and it is decreased as follows:
(t + 1) = i ×
t/tmax f
i
(t + 1) = i ×
f
(3)
t/tmax
i
(4)
In Eqs. (3) and (4), t is the iteration number. The excitation, depends on the order ‘k’ of neural unit j, where k varies from zero to (N − 1). i , f , i and f are learning parameters. Note that, in Eq. (2) we align x to w not w to x as our intention is to bring w closer to x. Also to initialize the network, we use a set of randomly selected distinct data points. e(−k/(t)) ,
3.2.1. Algorithm SNG 1 Initialize the set W with distinct data fragments xi taken randomly from X. Initialize ‘t’ (iteration number) to zero. 2 Select a random data fragment x = xi . 3 Align x to each weight wj with RMS error RMSj ∀ j. 4 Order the neurons in increasing order of RMSj ; j = 1, . . ., N. 5 Update the weights using the following rules: (a) w(t + 1) = w(t) + (t) × exp(−k/(t)) × (xi − w(t)) (b) (t + 1) = i × (f /i )t/tmax (c) (t + 1) = i × (f /i )t/tmax 6 t⇐t+1 7 If t < tmax (maximum no. of iterations) go to 2. In [25], Martinez et al. suggested a set of values for various learning parameters and in this investigation we follow their suggestions. In particular, for all our experiments we have used i = 0.5, f = 0.005, i = 10, f = 0.01 . In [25], Martinez et al. mentioned that these set of particular choice is not that critical. Yet, we have conducted some experiment with a different choice of i and the
Table 1 Values of LRMS error, GRMS error and the number of neurons for ten runs of the structural neural gas algorithm. Trials
Number of clusters
LRMSE
GRMSE
1 2 3 4 5 6 7 8 9 10
100 100 100 101 100 99 98 98 100 100
0.73 0.73 0.74 0.73 0.73 0.74 0.74 0.73 0.72 0.72
8.01 7.92 8.15 8.16 8.40 8.14 7.85 7.95 8.40 7.81
Average(std. dev) Unger et al.
99.6 101
0.73 (0.007) 0.76
8.09 (0.20475) 8.41
performance is very similar; in fact, slightly better. We shall discuss this result in appropriate places. 3.2.2. Assigning sequences to structural motifs Unlike other vector quantization problems, here we want to associate a sequence representation to each structural motif (building block) because finally we want to assign fragments to sequences. So, one approach could be to replace the weight of each neuron (BB) using a data fragment whose C˛ co-ordinate representation is closest to the BB. But this may result in two BBs represented by the same data fragment. So we follow a slightly different approach. We consider a bucket associated with each neuron. Then we take one data fragment at a time, and assign it to the bucket of the neuron to which it is closest according to BMF–RMS distance. Once all the data fragments are assigned to the buckets, we replace the weight associated with a neuron with a data fragment from its own bucket that is closest to it according to BMF-RMS distance. There is a possibility that a bucket associated with a neuron is empty. If that happens, we delete that neuron. In this way we get a set of building blocks where each building block has also an associated sequence representation. 4. Results The number of neurons to be used for structural neural gas algorithm is to be supplied externally. For fair comparison between SNG and TSCA, the SNG algorithm starts with number of neurons equal to the number of sub-clusters produced by TSCA. Though, on termination of the SNG algorithm, the resulting number of neurons may be reduced due to deletion of neurons. Our implementation of TSCA produced 55 primary clusters and 101 sub-clusters with LRMSE 0.76 and GRMSE 8.41. Since the quantizers produced by a NG depend on the initialization, the structural neural gas algorithm is executed ten times, each time with a different initial condition. Table 1 shows the values of the local and global reconstruction errors for the ten runs. Table 1 reveals that in each of the ten runs, both the LRMSE and GRMSE are found to be better than the corresponding errors with TSCA. The average values of LRMSE and GRMSE over the ten runs are 0.73 and 8.09, respectively. The average GRMSE is better than the GRMSE yielded by TSCA. Following [31] we also make a graphical comparison of the quality of the BBs identified by TSCA and SNG. First, we divide the LRMS error values into a number of intervals and count the number of fragments falling into these intervals. In Fig. 1, we display the frequency distribution (for run 10 in Table 1) of LRMS error. Fig. 1 reveals that compared to TSCA, the SNG building blocks represent more fragments with lower LRMS error. This observation is applicable to other trials of SNG also.
N.R. Pal, R. Panja / Applied Soft Computing 13 (2013) 1214–1221
1217
Table 2 Comparison of TSCA and SNG algorithm with pentamers and heptamers using two sets of training data. Fragment type
No. of BBs
SNG
SNG/TSCA
LRMSE (std. dev)
GRMSE (std. dev)
LRMSE
GRMSE
First training set Pentamers Heptamers
39/39 159/167
0.58 (0.0042) 0.92 (0.0055)
8.55 (0.2317) 7.92 (0.2998)
0.63 0.93
8.59 7.64
Second training set Pentamers Heptamers
48/48 243/265
0.54 (0.0053) 0.84 (0.0037)
8.08 (0.233) 6.98 (0.1992)
0.62 0.89
8.63 7.01
Fig. 1. X-axis represents LRMS errors and Y-axis represents frequency of fragments. Blue curve represents the results for the TSCA and red curve represents the same for SNG algorithm for trial 10 in Table 1.
Fig. 2. X-axis represents LRMS errors and Y-axis represents frequency of fragments. Blue curve represents the results for the TSCA and red curve represents the same for SNG algorithm for trial 5 in Table 1. (For interpretation of the references to color in the text, the reader is referred to the web version of the article.)
For trial 10, compared to TSCA, the SNG algorithm replaces almost 1300 more data fragments by its nearest Building Block with LRMS error of less than 0.2. Even if we consider the worst performing SNG trial (no. 5 in Table 1), we get a similar picture, which is shown in Fig. 2. We have also performed the experiments with data fragments of length five and seven, i.e., with pentamer and heptamer. In addition to this, we have used another four proteins as the second training set (5CYTr, 2SGA, 1LZ1, 2FOX) [12]. In case of pentamers for
TSCA
the first training set, TSCA has produced 39 building blocks with LRMSE 0.63 and GRMSE 8.59 as shown in Table 2. With pentamers and heptamers we also run the SNG algorithm 10 times using the first set of training proteins. Table 2 summarizes the results. The SNG algorithm with 39 building blocks for ten trials produced an average LRMSE of 0.58 and average GRMSE of 8.55. Though, the GRMSE value remains nearly same, the LRMSE value is improved to some extent. Again, for the second set of training proteins TSCA produced 48 Building Blocks with LRMSE 0.62 and GRMSE 8.63. The SNG algorithm with 48 building blocks for ten trials yielded an average LRMSE of 0.54 and GRMSE of 8.08. In this case both LRMSE and GRMSE improve. While in case of heptamers for the first set of training proteins TSCA produced 167 building blocks and yielded an LRMSE of 0.93 and GRMSE of 7.64. The SNG algorithm for 10 trials, on an average used 159 building blocks and resulted in LRMSE of 0.92 and GRMSE of 7.92. For the second set of training proteins, TSCA generated 265 building blocks with LRMSE of 0.89 and GRMSE of 7.01. The SNG algorithm, on the other hand, on average has used 243 building blocks producing LRMSE of 0.84 and GRMSE of 6.98. In this case, the performance of SNG algorithm is practically the same as that of TSCA, but SNG uses 21 building block less than that used by TSCA. Thus the quality of SNG building blocks is definitely better than those of TSCA. We have already mentioned that in all our experiments, we have used the same set of values of parameters recommended by Martinez et al. [25]. Yet, to see the impact of these parameters on the performance of the system, we have also done some experiments with a few other significantly different choices of parameters while using the first set of training proteins. For example, we have changed parameter, i , from 0.5 to 0.75 and run the SNG algorithm ten times. This has resulted in an average LRMSE of 0.73 and average GRMSE of 7.85. Thus, we see that practically there is no change in the RMSE (the GRMSE improves slightly). Again, changing i from 10 to 20, we got an average LRMSE of 0.73 and average GRMSE of 8.18. In this case also LRMSE (rounded) did not change, but GRMSE increases marginally. To summarize, the effect of the parameters of the algorithm is not noticeable. Although Tables 1 and 2 indicate that SNG algorithm performs better than TSCA, in order to check if such an improvement is statistically significant or not, we proceed as follows. We change the order in which the proteins (hence the fragments) are given as input to TSCA. We run the TSCA algorithm five times, each with a different order of data feed and computed LRMSE and GRMSE. For each of these cases, we also run the SNG algorithm. Table 3 summarizes the results. We then use the Kruskal–Wallis test. Both for LRMSE and GRMSE the value of the 2 is 6.82 (P value 0.009). Since in all five cases, SNG has used less number of prototypes than the corresponding TSCA case and in each case LRMSE and GMRSE are lower than the corresponding values for TSCA and the P value is 0.009, we can infer that the improvement in performance by SNG algorithm is statistically significant. Comparing our results with a recently reported method named ISMCM [31] that uses an incremental version of the Structural
1218
N.R. Pal, R. Panja / Applied Soft Computing 13 (2013) 1214–1221
Table 3 Results to check the statistical significance of the improvement in performance. Srl. no.
1 2 3 4 5
TSCA
SNG
No. of centers
LRMSE
GRMSE
Avg. No. of centers
Avg. LRMSE
Avg. GRMSE
101 99 99 100 103
0.7669 0.7712 0.7709 0.7692 0.7677
8.4193 8.5952 8.6502 8.3435 8.3839
99.6 98.3 98.3 99.2 102
0.7308 0.7356 0.7318 0.7324 0.7273
7.8183 8.1062 8.1125 7.9455 7.9791
Mountain clustering method we have found the following. For the first training set, the ISMCM algorithm uses about 10% more building blocks compared to the SNG algorithm when using pentamers and heptamers; while for hexamers, ISMCM uses 4% more building blocks. But ISMCM results in almost similar performance as that of the SNG algorithm. Let us now try to visualize how good are the structural motifs found by SNG in representing the data fragments. Fig. 3 depicts the structure of a building block representing the sequence GPNLWG while Fig. 4 represent a target hexamer (FEKLLE) which matches the motif in Fig. 3 with an RMS error of 0.06. At the first sight, one may think that the two fragments are quite dissimilar. But Fig. 5, which displays the target superimposed on the motif after BMF, reveals that it is a very good match. Similarly Fig. 6 depicts a structural motif representing FGRKTG and Fig. 7 shows a target fragment (EKMAEL) also with a very good match (RMS = 0.084) with the motif in Fig. 6. Figs. 5 and 8 show that after BMF it is almost impossible to find any difference between the motif and the target fragment. In Fig. 9, we depict a target hexamer (DYLIPI) of the building block shown
Fig. 5. The building block (in red) is superimposed with the target hexamer (in blue) (RMSE = 0.06). (For interpretation of the references to color in the text, the reader is referred to the web version of the article.)
Fig. 6. A building block representing the sequence FGRKTG. Fig. 3. A building block representing the sequence GPNLWG.
Fig. 4. A good fit target hexamers representing FEKLLE.
Fig. 7. A good fit target hexamers representing EKMAEL.
N.R. Pal, R. Panja / Applied Soft Computing 13 (2013) 1214–1221
1219
Table 4 Details of top 50 frequently occurring building blocks, trial 10 in Table 1.
Fig. 8. The building block (in red) is superimposed with the target hexamer (in blue) (RMSE = 0.084). (For interpretation of the references to color in the text, the reader is referred to the web version of the article.)
Fig. 9. The hexamer with sequence DYLIPI belonging to building block in Fig. 6.
in Fig. 6. In Fig. 10 we superimpose the data fragment DYLIPI and the BB in Fig. 6 after BMF. Fig. 10 exhibits a poorer fit although the ˚ RMSE is less than 1 A. The RMS local and global quantization errors indicate that the BBs are good representatives of the fragments. Now we shall analyze the structural behavior of some of the most populated clusters. In particular, in Table 4 and 5 we list the sequences, their corresponding secondary structure of the top 50 populated clusters produced by SNG. Table 4 corresponds to trial 10 in Table 1, while Table 5 refers to trial 5 in Table 1. These two tables reveal that the top 2 most populated clusters corresponds to all-alpha
Fig. 10. Superimposed 3-dimentional view of the building block and data fragment in Figs. 6 and 9, respectively (RMSE = 0.74).
Building block no.
Primary sequence
Secondary structure
Frequency
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
QTHDNC EVGGEA QFNGMI KVKAHG NEITCS GKVTVN GEYSFY DAVMGN ETFEVA ENNACE GCRAKR TFVYGG DVLLGA IVFDED EITCSS SKVPYN QTFVYG FEVALS IRYFYN VMGNPK VPSEFS GSLAFV YKQAKK ISMSEE SKISMS FKNNAG TNNYSY ARIIRY FTPPVQ KKLDSC GVDASK CKVLVD NIVFDE ALSNKG CSSENN NFKSAE NKEHKN AKGETF GKVNVD TGPCKA ADDGSL PHQGAG EFTPPV SENNAC NLDKKN ISPGEK KVLVDN VVYPWT DGSLAF ALWGKV
HHHHHH HHHHHH HHHHHH HHHHHH TEEEEEEEEEEEEEEE HHHHH-EEEEE T–HHH SSS–S EEEE-S EEEES-EE-TT EEEE-T HTS— EEEEEEEEE– EEEEEE HHH-HH ESSEEE –S-EE HHHHTT H—TT HHH— EEE-SS T—E –EEEE S-HHHH TT-HHH T–HHH HHHTT–EE-T E–S-E E-TT– -BSSHH -GGGBT STT-EE TT–HH –S— -TT–S GGTTTT GS-HHH TT–HH T–GGG E-TT-E HHTT– HHSGGG T–S-E HHHTT-
679 537 460 402 365 323 258 234 227 225 217 204 202 197 189 187 187 185 182 176 174 170 159 157 156 148 140 136 135 135 134 127 126 125 123 121 119 119 116 115 108 103 103 102 101 101 99 97 97 91
building blocks. In Table 5, the sequences corresponding to these two all-alpha BBs are EAFICN and QTHDNC. One wonder is these two all-alpha building blocks almost identical? Is the RMSE between these two after BMF are very small? Why do we get two all-alpha BBs when they appear structurally similar? To understand this, we computed the RMSE after BMF and the RMSE is found to be 0.187, which is quite small. Fig. 11 superimposes the two BBs after BMF and this reveals that these two are structurally quite similar and hence possibly the two can be merged at the cost of some more quantization error. The picture is not the same for other all alpha building blocks. In fact, Table 5 has six BBs which represent allalpha local secondary structure. Table 6 shows the pair-wise RMSE after BMF of these six BBs. Table 6 reveals that the pair (EAFICN and KVKAHG) has the smallest RMSE among any pair of the six BBs. Fig. 12 shows the two BBs superimposed after BMF. The pair (HDNCYK, GRLLVV) has the largest RMSE after BMF. Fig. 13 shows the two BBs superimposed after BMF. Fig. 13 reveals that although, the two BBs have very similar structure even after BMF, there is a noticeable difference between the two.
1220
N.R. Pal, R. Panja / Applied Soft Computing 13 (2013) 1214–1221
Table 5 Details for top 50 frequently occurring building blocks, trial 5 in Table 1. Building block no.
Primary sequence
Secondary structure
Frequency
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
EAFICN QTHDNC QFNGMI KVPYNK NEITCS MVGKVT DAVMGN ETFEVA KVKAHG QTFVYG IDVLLG TFEVAL VMGNPK HDNCYK ENNACE IRYFYN RIIRYF AKGETF NIVFDE SGTPVD VPSEFS GSLAFV FVYGGC FKNNAG ITCSSE SLAFVP NFKSAE DVLLGA TGPCKA KKLDSC ADDGSL IPSSEP YNKEHK CKVLVD ALSNKG DGLAHL FTPPVQ AGLCQT VALSNK EEDLLN NNFKSA NAKGET DFCLEP PGEKIV FYNAKA GRLLVV GKVNVD LVVYPW CKARII CKIPSS
HHHHHH HHHHHH HHHHHH TS—G TEEEE-EEEEE HHHHH-EEEEE HHHHHH EEEEE-EEEES EEEEEHHH-HH HHHHHH T–HHH EEEEEE -EEEEE STT-EE –EE-T -SS-SS ESSEEE –S-EE EEE-SS EEE-SS EEE-TT -S-EES -BSSHH EEEES–S— TT-HHH -TT–S -TT–H –GGGB HHHTTE–S-E HHHTTG S-HHHH TTEEEE EE–STT–BS-BSSH -STT-E GGGGSTT-EEE EEETTT HHHHHH TT–HH HHHSGG —EE HH-TT-
666 577 332 302 285 276 266 260 250 233 230 216 215 213 202 199 194 189 185 179 179 162 153 151 150 146 145 145 143 141 140 132 128 125 123 122 121 117 117 115 113 113 108 107 105 105 104 103 101 101
Table 6 The RMSE between pairs of all-alpha building blocks in Table 5.
EAFICN QTHDNC QFNGMI KVKAHG HDNCYK GRLLVV
EAFICN
QTHDNC
QFNGMI
KVKAHG
HDNCYK
GRLLVV
0.000 0.187 0.194 0.153 0.327 0.387
0.187 0.000 0.221 0.191 0.246 0.427
0.194 0.221 0.000 0.243 0.213 0.454
0.153 0.191 0.243 0.000 0.336 0.484
0.327 0.246 0.213 0.336 0.000 0.555
0.387 0.427 0.454 0.484 0.555 0.000
Fig. 12. The all-alpha building blocks EAFICN and KVKAHG superimposed after the best molecular fit (RMSE = 0.153).
Fig. 13. The all-alpha building blocks HDNCYK and GRLLVV superimposed after the best molecular fit (RMSE = 0.555).
5. Conclusion In this paper we have first discussed the two-stage clustering algorithm of Unger et al. [12] and then proposed a modified version of the neural-gas algorithm. The structural neural gas algorithm is suitable for structural data where the similarity between objects cannot be directly computed using Euclidean distance measure. Both algorithms are tested on a benchmark data set and the SNG algorithm is found to perform consistently better than the TSCA. The superiority of SNG is demonstrated using pentamers, hexamers, and heptamers. As explained in [30,31], the poor performance of TSCA may be the result of the fact that it uses only the frequency but not the geometry of the data set to find the building blocks. References Fig. 11. The all-alpha building blocks EAFICN and QTHDNC superimposed after the best molecular fit (RMSE = 0.187).
[1] M.A. Marti-Renom, A.C. Stuart, A. Fiser, R. Sanchez, M. Francisco, A. Sali, Comparative protein structure modeling of genes and genomes, Annual Review of Biophysics and Biomolecular Structure 29 (2000) 291–325.
N.R. Pal, R. Panja / Applied Soft Computing 13 (2013) 1214–1221 [2] D. Baker, A. Sali, Protein structure prediction and structural genomics, Science 294 (2001) 93–96. [3] R. Bonneau, D. Baker, Ab initio protein structure prediction: progress and prospects, Annual Review of Biophysics and Biomolecular Structure 30 (2001) 173–189. [4] R. Bonneau, J. Tsai, I. Ruczinski, D. Chivian, C. Rohl, C.E.M. Strauss, D. Baker, Rosetta in CASP4: progress in ab initio protein structure prediction, Proteins 45 (2001) 119–126. [5] Y. Liu, D.L. Beveridge, Exploratory studies of ab-initio protein structure prediction: multiple copy simulated annealing, amber energy functions, and a generalized born/solvent accessibility solvation model, Proteins: Structure, Function, and Genetics 46 (2002) 128–146. [6] N.R. Pal, D. Chakraborty, Some new features for protein fold prediction, in: Proceedings of the ICONIP 2003, 2003, pp. 1176–1183. [7] I.F. Chung, C.D. Huang, Y.H. Shen, C.T. Lin, Recognition of structure classification of protein folding by NN and SVM hierarchical learning architecture, in: Proceedings of the ICONIP 2003, 2003, pp. 1159–1167. [8] C.H.Q. Ding, I. Dubchak, Multi-class protein fold recognition using support vector machines and neural networks, Bioinformatics 17 (4) (2001) 349–358. [9] M.N. Nguyen, J.C. Rajapakse, Multi-class support vector machines for protein secondary structure prediction, Genome Informatics 14 (2003) 218–227. [10] F. Markowetz, L. Edler, M. Vingron, Support vector machines for protein fold class prediction, Biometrical Journal 45 (3) (2003) 377–438. [11] X.D. Sun, R.B. Huang, Prediction of protein structural classes using support vector machines, Amino Acids 30 (4) (2006) 469–475. [12] R. Unger, D. Harel, S. Wherland, J.L. Sussman, A 3D building blocks approach to analyzing and predicting structure of proteins., Proteins: Structure, Function, and Genetics 5 (1989) 355–373. [13] R. Kolodny, P. Koehl, L. Guibas, M. Levitt, Small libraries of protein fragments model native protein structures accurately, Journal of Molecular Biology 323 (2002) 297–307. [14] J.M. Bujnicki, Protein-structure prediction by recombination of fragments, ChemBioChem 7 (2006) 19–27. [15] K.T. Simons, C. Kooperberg, E. Huang, D. Baker, Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions, Journal of Molecular Biology 268 (1997) 209–225. [16] D. Kihara, J. Skolnick, The PDB is a covering set of small protein structures, Journal of Molecular Biology 334 (2003) 793–802. [17] Y. Zhang, J. Skolnick, The protein structure prediction problem could be solved using the current PDB library, Proceedings of the National Academy of Sciences of the United States of America 102 (2005) 1029–1034. [18] R. Kolodny, M. Levitt, Protein decoy assembly using short fragments under geometric constraints, Biopolymers 68 (2003) 278–285.
1221
[19] B.H. Park, M. Levitt, The complexity and accuracy of discrete state models of protein structure, Journal of Molecular Biology 249 (1995) 493–507. [20] A.G. de Brevern, C. Etchebest, S. Hazout, Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks, Proteins 41 (2000) 271–287. [21] A.G. de Brevern, S. Hazout, Compacting local protein folds with a hybrid protein model, Theoretical Chemistry Accounts 106 (1/2) (2001) 36–47. [22] A.G. de Brevern, S. Hazout, Hybrid protein model for optimally defining 3D protein structure fragments, Bioinformatics 19 (2003) 345–353. [23] Cristina Benros, Alexandre G. de Brevern, Catherine Etchebest, Serge Hazout, Assessing a novel approach for predicting local 3D protein structures from sequence, Proteins: Structure, Function, and Bioinformatics 62 (2006) 865–880. [24] T. Martinetz, K. Schulten, Topology representing networks, Neural Networks 7 (3) (1994) 507–522. [25] T. Martinetz, S. Berkovich, K. Schulten, “Neural-gas” network for vector quantization and its application to time-series prediction, IEEE Transactions on Neural Networks 4 (4) (1993) 558–569. [26] Ignazio Licata, Luigi Lella, Evolutionary neural gas (ENG): a model of self organizing network from input categorization, EJTP 4 (14) (2007) 31–50. [27] Isaac J. Sledge, James M. Keller, Growing neural gas for temporal clustering, in: Proceedings of the 19th International Conference on Pattern Recognition, 2008, pp. 1–4. [28] W. Kabsch, A solution for the best rotation to relate two sets of vectors, Acta Crystallographica A32 (1976) 922–923. [29] W. Kabsch, A discussion of the solution for the best rotation to relate two sets of vectors, Acta Crystallographica A34 (1978) 828–829. [30] K.L. Lin, C.T. Lin, N.R. Pal, S. Ojha, Structural building blocks: construction of protein 3-D structures using a structural variant of mountain clustering method, IEEE Engineering in Medicine and Biology Magazine 28 (4) (2009) 38–44. [31] K.L. Lin, C.T. Lin, N.R. Pal, Incremental mountain clustering method to find building blocks for constructing structure of proteins, IEEE Transactions on NanoBioscience 9 (4) (2010) 278–288. [32] M. Ruizhen, L.I. Xu, H. Chen, Y. Huang, Y. Xiao, A symmetry-related sequence structure relation of proteins, Chinese Science Bulletin 50 (6) (2005) 536–538. [33] E. Krissinel, On the relationship between sequence and structure similarities in proteomics, Bioinformatics 23 (6) (2007) 717–723. [34] A.G. Murzin, Structure classification based assessment of CASP3 predictions for the fold recognition targets, Proteins: Structure, Function, and Genetics 37 (1999) 88–103. [35] C.A. Orengo, J.E. Bray, T. Hubbard, L. LoConte, I. Sillitoe, Analysis and assessment of ab initio three dimensional prediction, secondary structure, and contacts prediction, Proteins: Structure, Function, and Genetics 37 (1999) 149–170.