Computers in Biology and Medicine 39 (2009) 166 -- 172
Contents lists available at ScienceDirect
Computers in Biology and Medicine journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / c b m
STON: A novel method for protein three-dimensional structure comparison Changiz Eslahchi a,d,∗ , Hamid Pezeshk b , Mehdi Sadeghi c , Amir Massoud Rahimi d , Heydar Maboudi Afkham a , Shahriar Arab e a
Department of Mathematical Sciences, Shahid Beheshti University, Post Code 1983963113, Tehran, Iran School of Mathematics, Statistics and Computer Science and Center of Excellence in Biomathematics, University College of Science, University of Tehran, P.O. Box 14155-6455, Tehran, Iran c National Institute of Genetic Engineering and Biotechnology, P.O. Box 14155-6343, Tehran, Iran d School of Computer Science, Institute for Research in Fundamental Sciences(IPM), P.O. Box 19395-5746, Tehran, Iran e Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, P.O. Box 13145-1384, Tehran, Iran b
A R T I C L E
I N F O
Article history: Received 5 August 2007 Accepted 5 December 2008 Keywords: Protein structure comparison Structure alignment Greedy algorithm Rigid body transformation Dihedral angle
A B S T R A C T
Protein structure comparison is an important problem in bioinformatics and has many applications in the study of structural and functional genomics. During the last decades, various heuristic methods have been developed to solve the protein structure comparison problem. Most of the protein structure comparison methods give the alignment based on the minimum RMSD (root mean square deviation) and ignore many significant local alignments that may be important for evolutional or functional studies. We have developed a new algorithm to find aligned residues in two proteins with desired RMSD value. The parameterized distance and rotation in this program enable us to search for strongly or weakly similar aligned fragments in two proteins. © 2009 Published by Elsevier Ltd.
1. Introduction Protein structure comparison is a fundamental task in molecular biology to understand physical, chemical, and biological properties of protein based on similarities and differences among the related proteins' three-dimensional structures. Since protein structures are more conserved than protein sequences during evolution, structural comparison provides more information about the evolution history than traditional comparison based on sequence analysis. About 45 000 protein structures were available from protein data bank (PDB) [1] when this report was conducted (August 2007) (http://www.rcsb.org) and the number of new structures has been growing rapidly. Intensive efforts have been made to classify all known structures yielding structural databases such as ch-SCOP [2], CATH [3] and FSSP [4]. Comparison of a newly determined structure with previously classified structures is useful to determine the protein family, evolutionary relationship and function. The general problem of protein structure comparison is NP-hard [5]. Thus, various heuristic methods have been developed during the last three decades
∗ Corresponding author at: Department of Mathematical Sciences, Shahid Beheshti University, Post Code 1983963113, Tehran, Iran. E-mail addresses:
[email protected],
[email protected] (C. Eslahchi).
0010-4825/$ - see front matter © 2009 Published by Elsevier Ltd. doi:10.1016/j.compbiomed.2008.12.004
[6–16]. The different proposed algorithms usually fall into two major categories: intermolecular and intramolecular approaches. In the intermolecular approach, one protein structure rotates and translates in three-dimensional space so that its relevant parts are superimposed with another protein structure and their intermolecular distances are minimized [6–10]. In the intramolecular approach, the internal distances between atoms within each protein are computed and these distances in two proteins are compared [11–16]. Despite the existence of numerous methods, the problem of protein structure alignment has not been solved and current methods produce different results [7,17,18]. Two criteria that have been developed for comparing alignment quality of protein structures are root mean square deviation (RMSD) of the residues placed in alignment and the number of residues that participate in the alignment. To balance the lower RMSD and larger number of matching residues, different heuristic approaches have been developed. These lead to different alignment results for same pair of proteins. In this article, we present a greedy algorithm to find the corresponding aligned residues in two proteins with desired RMSD value. We first search for the best superposition of all the relatively small fragments with the same geometry of two proteins by rigid transformation in the small space to find an optimal alignment as seeding fragments and then by a greedy algorithm, we choose the alignment with minimum RMSD. The best solution of this step is the subject for iterative steps to reduce the RMSD of the matching. It is notable
C. Eslahchi et al. / Computers in Biology and Medicine 39 (2009) 166 – 172
that our approximation is parameterized and the aligned structure can be limited to maximal distance between matched residues that we select. This enables us to search for strongly or weakly similar fragments in two proteins. In other words, near or remote structural homology can be found. A web based program of this algorithm is available at (http://bioinf.cs.ipm.ac.ir/softwares/ston). It requires two protein structures in PDB format as input (by uploading local PDB files) or PDB ID and reports the structural alignments between them based on the user's predefined parameters. The output is a PDB file with two chains generated by the server and a text file which shows the aligned residues with gaps. 2. Methods Let A be a protein consisting of n residues with numeral labels 1,2, . . . ,n. For every four consecutive residues i, i+1, i+2 and i+3; i n−3, a structure is defined to be the structure of their corresponding alpha carbon which is denoted by Si . As shown by La Cruz et al. [19] the conformation of a protein can be completely identified by three virtual bonds and angles of four consecutive C atoms − → (Fig. 1). A vector ij is the transformation vector from the corresponding alpha carbon of residue i to the one of residue j. We assign A A to each residue i n−3, a triple of angles (A i1 , i2 , i3 ) measured −−−−→ A in radian, where i is the angle between the vectors i(i + 1) and 1 −−−−−−−−−→ A −−−−−−−−−→ (i + 1)(i + 2); i is the angle between the vectors (i + 1)(i + 2) and 2 −−−−−−−−−→ (i + 2)(i + 3); and A i is defined by 3
−−−−−−−−−→ −−−−−→ −−−−−−−−→ (i + 2)(i + 3) · ((i)(i + 1) × (i + 1)(i + 2)) Ai = − −−−−−−−−−→ −−−−−→ −−−−−−−−−→ i3 3 (i + 2)(i + 3) · ((i)(i + 1) × (i + 1)(i + 2)) where · , × and i 3 are used for inner product, outer product and −−−−→ −−−−−−−−−→ the angle between the vectors i(i + 1) and (i + 2)(i + 3), respectively (see Fig. 1 for more details). Let A and B be two proteins having n and m residues, respectively. For any positive real numbers , , and p with 0 p 1, we define the distance matrix M = [mij ] such that for each 1 i n − 3 and 1 j m − 3, mij = 1 whenever B A B A B |A i1 − i1 | + |i2 − i2 | + |i3 − i3 | < 3
and mij = 0 otherwise. Non-zero entries in matrix M correspond to locally similar structures in two proteins. To find the consecutive locally similar structures, we need to determine a set S = {mi j , mi j , . . . , mi j } 1 1 2 2 k k such that: (1) mi j = 1 whenever 1 l k, 1 1 (2) i1 < i2 < · · · < ik and j1 < j2 < · · · < jk . To obtain the longest consecutive locally similar structures, the largest S should be constructed. So, we apply a greedy algorithm to
undertake this task. Suppose mij is a member of S, the next member should be chosen from one of the elements mrs for which r i + 1 and s j + 1. Therefore in each step of the algorithm we choose a non-zero entry such that it allows us to have more choices for the next member of S. To do this we use the following conditions: (3) i1 + j1 = min{i + j|mij = 1}, (4) (it + jt ) − (it−1 + jt−1 ) = min{(i + j) − (it−1 + jt−1 )|mij = 1, i > it−1 , j > jt−1 }. Obviously, if m11 = 1 then it will be chosen as the first element of S. In fact in this case every largest S can contain m11 = 1. In the next step we construct a comparison set and denote it by C. We will go through this step using following algorithm: For 1 r n − 1 and 1 r m − 1: (1) Put alpha carbon r of protein B on alpha carbon r of protein A. −−−−−−→ −−−−−→ (2) Move vector r (r + 1) in a way to lay on vector r(r + 1). An element m of S is called (r, r )-suitable if and only if the distance between the centers of masses of the structures s of A and s of B is less −−−−−−→ than Angstrom in a rotation of protein B around vector r (r + 1). Let F(r,r ) denote the set of all (r, r )-suitable elements of S. (3) If |F(r,r ) | > p|S|, then (r, r ) ∈ C. Suppose (r, r ) ∈ C. We rotate protein B by an angle k around the −−−−−−→ vector (r)(r + 1), the rotation is made in unit and 0 k < 2/ . For each k-rotation, we make a bipartite graph Gk (r, r ) as follows: The vertex set of Gk (r, r ) is X ∪ Y, where X is the set of residues of protein A and: Y = {a|a is a residue in B and there exists a residue a ∈ A such that d(a, a ) }
where d(a, a ) is the Euclidean distance between the corresponding alpha carbon of a and a . Let x ∈ X and y ∈ Y, then xy is an edge of Gk (r, r ) whenever d(x, y) . Assume that the elements of X and Y are sorted according to their positions in proteins A and B, respectively, by a greedy algorithm, we choose a matching of graph Gk (r, r ) with the following conditions: (1) We start with the first residue in X with a non-zero degree, say a1 . Then we choose the first vertex in Y which is adjacent to a1 , say b1 . (2) Delete all the vertices in X that are less than or equal to a1 and all the vertices in Y which are less than or equal to b1 . (3) We continue the process of choosing edges and deleting vertices in the same manner to get a subgraph of Gk (r, r ) with no edges. (4) Suppose a1 b1 , a2 b2 , ..., ak bk are the selected edges and Mk (r, r ) denotes this matching. Clearly Mk (r, r ) provides the equivalence of residues for the superposition of A and B corresponding to k, r, and r. The quality of this alignment, measured by the RMSD of Ca position of residues after superposition. Let Rk (r, r ) denote the RMSD of Mk (r, r ). We define an order on the set {(Mk (r, r ), Rk (r, r ))|(r, r ) ∈ C, k ∈ N, 0 k < 2/ } by, (Mk (r, r ), Rk (r, r )) < (Mk (r1 , r1 ), Rk (r1 , r1 )) 1 1 1 1 If one of the following conditions holds: (1) |Mk (r, r )| < 0.9|Mk (r1 , r1 )| 1 1 (2) 0.9|Mk (r1 , r1 ) < |Mk (r, r )| < 1.1|Mk (r1 , r1 )| and 1 1 1 1
Fig. 1. Definition of the three angles 1 , 1 and 1 for four consecutive C atoms which are numbered i to i+3.
167
Rk (r1 , r1 ). Rk (r, r ) 1 1 < |Mk (r, r )| |Mk (r1 , r1 )| 1 1
168
C. Eslahchi et al. / Computers in Biology and Medicine 39 (2009) 166 – 172
In words, in this type of ordering we first consider the size of alignment. Then for every two alignments with almost the same size, we choose the one for which the ratio of its RMSD to its size is bigger. Suppose (M (s, s ), R (s, s )) is the largest element of the set {(Mk (r, r ), Rk (r, r ))|(r, r ) ∈ C, k} Consequently, this superposition of the proteins is an optimal result due to the above mentioned algorithm. The best values which we have found for the parameters of the algorithm are as follows:
= 0.05,
2 7, ∈ {5, 10}
and p = %20
During the running time of the algorithm, many matching are generated. This method has the ability to report all the matching and their RMSD with the size at least
Fig. 2 shows the superimposed structure of 1CPC:L and 1COL:A when the distance thresholds are 2 and 7 Å and degree interval of rotations are 10◦ and 2◦ , respectively. Most of the methods try to balance the lower RMSD and larger equivalent residues. One parameter to compare the results of different methods or different parameters inside a method is ratio of the number of equivalent residues (Ne) to RMSD. As shown in Table 1, running time with distance threshold 3 Å balances sensitivity against computational cost and yet still permits interactive access via a web site. The running time even for high degree of interval and low distance threshold is not feasible to allow this program to search a protein against a database such as the PDB. Adjusting the parameters, one can search for region with high or low similarities in two proteins. In other words one can search to see if two proteins are near or remote homologs. 3.2. Comparison with other methods to detect the remote similarity
|M (s, s )| for a given 0 < < 1
We first applied STON to a number of structure pairs that are known as difficult cases [20] to test its performance in detecting weak structural similarity and also to compare STON with other
3. Results and discussion 3.1. Implementation and parameter optimization We implemented the STON algorithm in C++ on a Linux platform. The running time on a 3 GHz P4 with 1 GB RAM was measured by running the program for the alignment of structures of bacterial colicin A (PDB code 1COL:A) and phycocyanin (PDB Code 1CPC:L) with polypeptide chain lengths of 204 and 172, respectively. They have been considered as typical cases with respect to chain lengths. The number of equivalent residues and their RMSD are dependent on the degree interval of rotation () and distance threshold () between each equivalence alpha carbon to find the maximum number of elements that can be superposed. Table 1 shows the results of structural alignment between 1COL:A and 1CPC:L in different distance thresholds and degree interval of rotations. As shown in the Table 1, the running time increases when the distance threshold increases or the degree interval of rotation decreases. The range of time variation is between 2 min for (,) = (10◦ ,2 Å) to 5 h for (,) = (2◦ ,7 Å). It is obvious that the number of equivalent residues will increase when the lower degree interval of rotation and higher distance threshold are selected, but RMSD and the running time will change significantly.
Fig. 2. The superimposed structures of 1CPC:L (black) and 1COL:A (gray) when the distance thresholds and degree interval of rotation are: (A) 2 Å and 10 and (B) 7 Å and 2.
Table 1 Results of the structural alignment of colicin A (1COL:A) to phycocyanin (1CPC:L) using different parameters. Time (min)
Distance threshold (Ne)
Rotation (degree interval)
Number of equivalent residues
RMSD (Å)
Ne/RMSD
2 3 3 3 6 7 10 12 13 18 25 28 42 43 46 66 77 83 97 130 141 191 227 333
2 2 2 2 3 3 3 4 3 4 5 4 4 5 6 5 7 6 5 6 7 6 7 7
10 5 3 2 10 5 3 10 2 5 10 3 2 5 10 3 10 5 2 3 5 2 3 2
34 35 36 36 66 68 72 89 65 95 100 88 89 100 106 102 114 102 102 102 113 102 112 113
1.24 1.27 1.26 1.25 1.86 1.90 2.01 2.52 1.72 2.66 3.04 2.49 2.52 3.04 3.50 3.07 4.53 3.32 3.07 3.32 4.47 3.29 4.41 4.41
27.42 27.63 28.58 28.83 35.45 35.84 35.77 35.32 37.72 35.68 32.92 35.29 35.32 32.92 30.27 33.23 25.15 30.70 33.23 30.70 25.30 31.01 25.41 25.64
90/2.92 77/2.31 90/2.84 180/3.12 55/3.08 85/2.43 108/3.03 264/3.23 79/2.81 70/3.0 88/2.79 118/3.16 92/3.12 239/3.01 82/2.62 76/2.68 131/3.22 80/2.62 183/3 78/2.53 69/1.85 69/1.52 77/1.6 154/2.01 41/1.62 80/1.53 87/1.73 194/1.94 65/1.72 58/1.81 78/1.67 82/1.9 71/1.99 173/1.88 74/1.55 69/1.7 84/1.8 70/1.69 156/1.82 70/1.83 68/3.72/0.01 68/3.67/0.51 82/3.16/0.65 97/3.98/0.09 49/3.48/0.43 76/3.69/0.58 71/3.98/0.4 200/3.9/0.34 38/3.83/0.28 58/3.72/0.41 41/3.51/0.21 38/3.49/0.18 65/3.98/0.3 231/3.06/0.6 76/3.81/0.42 45/3.78/0.09 40/3.44/0.33 67/3.87/0.37 113/3.22/0.39 69/3.22/0.63 STON with 5 Å distance threshold and 5◦ interval of rotation. Number of aligned positions where the alignment is identical with STON 3–5 divided by the number of aligned positions in the STON 3–5. b
c
STON with 3 Å distance threshold and 5◦ interval of rotation.
86/2.99/0.13 56/3.41/0.55 81/2.01/0.87 111/2.86/0.44 275/3.8/0.66 79/2.43/0.86 76/3.34/0.83 83/2.22/0.76 121/3.24/0.45 81/5.36/0.06 232/2.48/0.76 78/1.82/0.89 80/2.43/0.84 127/4.99/0.44 81/3.22/0.59 177/2.56/0.76 75/2.08/0.83 a
82/2.9/0.52 81/2.3/0.87 79/2.6/0.81 185/3.4/0.42 58/2.7/0.75 80/1.7/0.86 91/3/0.45 237/3.5/0.51 78/2.4/0.86 71/2.9/0.84 125/2.1/0.56 111/2.7/0.6 97/3.5/0.55 224/2.8/0.55 80/2/0.85 75/2/0.84 112/3/0.67 77/2.4/0.81 179/2.6/0.78 74/2.2/0.79 65/4.24/0 81/2.37/0.87
61/1.62/0.77
88/4.01/0.37
75/1.97/0.85 71/2.13/0.78
103/3.93/0.65
74/2.45/0.84 63/2.87/0.77 87/2.53/0.71
52/2.63/0.78 78/1.67/0.91 93/3.29/0.75
100/3.19/0.62 83/2.44/0.88 100/3.11/0.89 272/ 3.57/ 0.57 63/3.01/0.68 87/1.9/0.91 117/3.05/0.43 286/3.07/0.83 87/3.01/0.86 83/3.11/0.5 101/3.09/0.17 134/3.07/0.7 102/3.27/0.35 258/3.01/0.77 85/1.9/0.95 87/2.92/0.84 141/3.25/0.41 94/3.61/0.04 199/3.24/0.83 85/2.42/0.8 83/3.1/0.73 74/1.99/0.82 72/2.46/0.79
94/3.3/0.63 81/2.3/0.88 97/3.2/0.90 211/3.5/0.75 60/2.6/0.90 86/1.9/0.93 114/3.1/0.43 291/3.3/0.85 81/2.5/0.87 75/3/0.86 98/2.6/0.79 131/3.1/0.68 103/3.1/0.58 250/2.6/0.79 85/1.9/0.93 83/2.9/0.84 132/3/0.65 78/3.1/0.01 188/2.5/0.8 80/2.1/0.86
STON 3–5a res/RMSD MAMMOTH res/RMSD/comc TopMatch res/RMSD/comc Rapido res/RMSD/comc FATCAT res/RMSD/comc Fast res/RMSD/comc DaliLite res/RMSD/comc
2GMF:A(121) 1MOL:A(94) 2RHE (114) 1EDE(310) 1UBQ (76) 3HHR:B(197) 4FGF (124) 1NSB:A(390) 1PAZ:(120) 2RHE (114) 2TRX:A(107) 4PTP (223) 4FXN (138) 1AOZ:A(552) 1MOL:A(94) 2TRX:A(107) 1AYH (214) 1GMF:A(119) 1TCA (317) 1YCC (108)
107/3.9/0.23 81/2.3/0.71 97/2.9/0.87 219/3.8/0.87 64/3.8/0.78 87/1.9/0.91 116/2.9/0.83 275/3/0.82 84/2.9/0.50 84/3.4/0.18 64/5.2/0.14 130/3.1/0.61 108/3.6/0.03 249/2.5/0.77 78/1.7/0.91 56/4.6/0.49 147/3.7/34 96/3.5/0.04 187/2.4/0.86 76/20/0.83
One of the goals in protein structure comparison is the analysis of functional sites and alignment of the residues that involve in
1BGE:B(159) 1CEW (108) 1CID (177) 1CRL(534) 1FXI:A(96) 1TEN(90) 1TIE (170) 2SIM(318) 2AZA:A(129) 3HLA:B(99) 1GP1:A(184) 2SNV (151) 3CHY (128) 1AFN:A(330) 1STF:I(125) 1DSB:A(188) 1SAC:A(204) 1RCB (129) 1TAH:A(318) 2MTA:C(147)
3.3. Structural comparison of functional residues
CE res/RMSD/comc
Table 3 shows the result of alignment of these different cases by STON and other methods. As indicated by the results in Table 3, STON is able to align structures of pairs of proteins with very different lengths and sequences as good as the other methods. Compared to other methods, STON performs better in aligning pairs of proteins with very different structures. The aligned positions suggested by STON, in most cases, are the same as the ones suggested by other methods. But the RMSD of STON algorithm is different depending on the degree of the interval parameter and in some cases it superimposes structures with less RMSD.
169
Structure pair
(i) Proteins with the same sequences and similar structures. (ii) Proteins with the same sequences but different structures (e.g. open and closed form of calmodulin). (iii) Proteins with similar structures and different lengths. (iv) Proteins with similar structures and very different sequences.
Table 2 Comparison of structure alignment for 20 difficult structure obtained by CE, DALI, FAST, FATCAT, Rapido, TopMatch, MAMMOTH and STON.
methods. The results were compared to those from seven programs available as web services, DALI [11], CE [10], FAST [21], FATCAT [22], Rapido [23], TopMatch [24] and MAMMOTH [25]. The numbers of structurally equivalent residues and RMSD between each pair reported by these methods differ considerably. However, to compare the performance of STON with these methods, equivalent residues were considered taking RMSD into account. In addition, to show that how many equivalent residues of aligned pairs reported by a certain method and STON are the same, the number of commonly aligned residues of two methods divided by the number of aligned positions in STON was calculated. Table 2 shows the results of structural alignment of 20 protein pairs by DALI, CE, FAST, FATCAT, Rapido, TopMatch, MAMMOTH and STON. The sensitivity and upper limit of RMSD are determined by the parameters applied in STON as discussed in methods. In STON 3–5, the distance threshold between equivalent residues and degree interval of rotation are 3 Å and 5◦ , respectively. As discussed before, for a lower distance threshold, sensitivity decreases but specificity increases. Hence fragments that are more similar are aligned. These limits raise the RMSD to distance threshold, although, the lower degree interval of rotation increases sensitivity of algorithm, but computational cost is a restriction factor. As seen in Table 2 the different methods have variation in RMSD and number of aligned residues. In general, the number of aligned residues and RMSD obtained by STON 5–5 are comparable with other methods and in some cases are better than they are. In STON 3–5 although the numbers of aligned residues are less than other methods, but the ratio of aligned residues to RMSD is more than the other methods in all cases. Comparison of STON 3–5 structural alignments with other four methods show that in average about 70% of aligned positions by STON 3–5 are the same as the ones obtained from other methods. A remarkable case is the pair of 1TEN and 3HHR:B in which 91% of aligned positions are the same. An exception is the pair of 3HLA:B and 2RHE for which only 18% of aligned positions using STON and DALI are common. Despite the difference in common aligned positions, the overall shape of superposition of protein structure alignments are nearly the same. Fig. 3 shows the superposition of 3HLA:B and 2RHE obtained by STON, CE and DALI. As shown by Fig. 3, the overall superposition is the same, but for example one residue shift in alignment causes the common aligned positions between the two methods to decreases significantly. To discus the performance of our algorithm, we applied STON on four different categories of protein pairs. These categories, introduced by Maiti et al. [26], are as follows:
STON 5–5b res/RMSD
C. Eslahchi et al. / Computers in Biology and Medicine 39 (2009) 166 – 172
170
C. Eslahchi et al. / Computers in Biology and Medicine 39 (2009) 166 – 172
Fig. 3. The superposition of 3hla:B (gray) and 2rhe (black) obtained by STON, CE and DALI.
Table 3 Comparison of structure alignment for structures with varying difficulty obtained by CE, DALI, FAST, FATCAT, Rapido, TopMatch, MAMMOTH and STON. Structure(s)—% sequence identity (ID)
CE res/RMSD/ comc
Same sequence and similar structure Thioredoxin (2TRX_A on 108/0.7/0.99 2TRX_B)—100% ID Hemoglobin (4HHB_A on 140/0.4/1 1DKE_A)—100% ID P21 Oncogene (6Q21_A on 168/1/0.99 6Q21_B)—100% ID ∼Same sequence and different structure Calmodulin (1A29 on 77/1.7/0.99 1CLL)—98.6% ID Maltose Bind Prot. 360/3.4/0.97 (1OMP on 1ANF)—100% ID Similar structure and different length Hemoglobin (4HHB_A on 139/1.5/0.96 4HHB_B)—43% ID Thioredoxin (3TRX on 103/1.7/0.93 2TRX_A)—29% ID Lysozyme/lactalbumin 121/1.3/0.96 (1DPX on 1A4 V)—36% ID Calmodulin/TnC (1CLL on 118/5.4/0.9 5TNC)—47% ID Similar structure and very different sequence Ubiquitin/elongin (1UBI on 72/1.2/0.99 1VCB_A)—26% ID Thio/glutaredoxin (3TRX on 74/2.3/0.85 3GRX_A)—7% ID Hemoglobins (1ASH on 135/2.3/0.94 2LHB)—17% ID Thioredoxins (1NHO_A on 64/4.1/0.39 1DE2_A)—22% ID a
DaliLite res/RMSD/ comc
Fast res/RMSD/ comc
FATCAT res/RMSD/ comc
Rapido res/RMSD/ comc
TopMatch res/RMSD/ comc
MAMMOTH res/RMSD/ comc
STON 3–5a res/RMSD
STON 5–5b res/RMSD
108/0.7/0.99
107/0.69/0.96
108/0.66/0.99
108/0.66/0.99
107/0.6/0.99
108/0.64/0.98
107/0.69
108/0.89
141/0.4/1
141/0.37/1
141/0.37/1
141/0.37/1
141/0.4/1
140/0.37/0.99
136/0.48
140/0.68
171/1.2/0.99
168/1.01/0.99
171/1.22/0.99
164/0.85/0.99
165/0.9/0.99
170/1.17/0.99
163/0.67
168/1.14
87/7.6/0.99
132/14.69/0.94
137/13.15/0.96
136/14.73/0.96
70/0.6/0.96
76/1.73/0.96
68/0.6
73/1.33
363/3.38/0.97
369/3.76/0.97
312/4.1/0.3
361/3.64/0.97
220/1.3
319/3.1
367/3.7/0.97
138/1.4/0.96
135/1.39/0.96
139/1.49/0.96
138/1.44/0.96
137/1.8/0.87
138/1.89/0.93
135/1.35
133/1.8
103/1.7/0.92
98/1.73/0.85
103/1.68/0.93
103/2.16/0.84
99/1.6/0.88
103/2.21/0.77
97/1.4
101/2.42
123/1.5/0.95
122/1.64/0.93
123/1.48/0.96
120/1.38/0.96
115/1.2/0.97
122/1.88/0.85
115/1.26
119/2.09
91/3.4/0.93
142/6.77/0.93
143/6.22/0.19
144/6.87/0.93
90/2.8/0.93
82/2.4/0.91
67/1.07
89/2.45
74/1.5/0.94
73/1.5/0.94
75/1.67/0.94
64/1.26/0.87
72/1.4/0.94
75/3/0.72
69/1.14
71/1.99
73/2.2/0.82
67/3.08/0.71
74/2.33/0.82
70/2.52/0.79
67/2.1/0.81
62/3.88/0.42
62/1.63
72/2.79
133/2.1/0.94
129/2.05/0.93
135/2.27/0.94
130/2.92/0.66
129/2.4/0.83
127/3.17/0.58
120/1.68
128/2.49
52/2.9/0.22
56/4.33/0.48
71/3.69/0.13
52/5.92/0.15
57/3.3/0.57
43/3.81/0.22
46/2.1
63/3.51
STON with 3 Å distance threshold and 5◦ interval of rotation.
b
STON with 5 Å distance threshold and 5◦ interval of rotation.
c
Number of aligned positions where the alignment is identical with STON 3–5 divided by the number of aligned positions in the STON 3–5.
the sites. The structure of the active site in a protein may provide a good model for those in related proteins even if the overall sequence homology are low. Global alignment of a protein structure to achieve more equivalent residues and lower RMSD may lead to a spatial misalignment of the functional residues in the two proteins. Performance of the STON method relative to place the functional site residues in close proximity in the global structure alignment and
also the effect of distance threshold parameter were tested by pairwise structural alignment of three protein members in five different families. In each family, the protein members have similar catalytic site residues and shape but have low sequence similarity. Catalytic site residues have been obtained from EzCatDB [27]. Table 4 shows the EC numbers, PDB codes, sequence lengths, global sequence identity and catalytic site residues of 15 proteins in five protein families.
C. Eslahchi et al. / Computers in Biology and Medicine 39 (2009) 166 – 172
171
Table 4 Catalytic site residues of proteins in five protein families. Protein family
E.C. number
PDB code
Length
Catalytic residues
Fructose bisphosphate Aldolase
4.1.2.13
1ald:A 1a5c:A 1qo5:A
364 369 364
D33-K146-E187-E189-K229 D39-K151-E194-E196-K236 D33-K146-E187-E189-K229
Arginyl tRNA Synthetase
6.1.1.19
1bs2 1iq0 1lle
607 592 577
N153-K156-H159-H162-E294-Q375 N113-K116-H119-H122-E240-Q357 N123-K126-H129-H132-E258-Q341
Inorganic Pyrophosphatase
3.6.1.1
1e6a:A 1qez:A 2prd
287 173 175
E48-K56-R78-Y93-Y192-K193 E1018-K1026-R1040-Y1052-Y1138-K1139 E21-K29-R43-Y55-Y139-K140
Pyrrolidone-carboxylate Peptidase
3.4.19.3
1a2z:A 1iof:A 1aug:A
220 208 215
E80-C143-H167 E79-C142-H166 E81-C144-H168
Pyruvate kinase
2.7.1.40
1e0t:A 1a49:A 1liu:A
470 531 574
R32-K220-T278-S312-E314 R72-K269-T327-S361-E363 R116-K313-T371-S405-E407
Table 5 RMSD for global structure alignment and catalytic site residues. Protein pairs
Pairwise sequence identity
STON 3–5a
STON 5–5b
DALI
CE
Global structure alignment # res/RMSD
Catalytic site residues RMSD
Global structure alignment # res/RMSD
Catalytic site residues RMSD
Global structure alignment # res/RMSD
Catalytic site residues RMSD
Global structure alignment # res/RMSD
Catalytic site residues RMSD
52.7% 69.5% 50.1%
327/0.76 343/0.87 328/0.86
0.26 0.41 0.46
332/1.38 348/1.03 321/1.38
0.80 0.67 0.62
337/1.0 353/1.1 340/1.1
0.34 0.39 0.36
336/0.9 343/0.8 336/1.3
0.33 0.37 0.36
Arginyl tRNA synthetase 1bs2:1lle 30.5% 1bs2:1iq0 22.3% 1lle:1iq0 28.7%
468/0.68 342/1.67 339/1.7
1.54 2.8 2.18
561/1.55 411/2.84 401/2.79
1.84 568/0.8 3.24 2.61
1.53 381/2.8 529/3.7
568/0.8 3.49 3.24
1.53 543/3.4 537/3.6
3.4 3.23
Inorganic pyrophosphatase 1e6a:1qez 20.3% 1e6a:2prd 16.7% 1qez:2prd 46.9%
135/1.5 156/1.41 160/0.79
0.93 0.82 0.62
150/2.39 166/2.39 149/1.17
1.32 0.82 0.85
166/1.8 171/1.7 167/0.8
1.07 1.04 0.47
167/1.9 172/1.8 166/0.7
1.03 1.03 0.47
Carboxylate peptidase 1a2z:1iof 53.2% 1a2z:1aug 34.5% 1iof:1aug 40.4%
202/074 200/1.1 192/1.05
0.42 0.37 0.52
204/0.98 208/1.56 201/1.69
0.47 0.72 0.57
208/1.4 209/1.1 202/1.2
0.34 0.49 0.56
204/0.7 209/1.2 201/1.2
0.29 0.5 0.55
Pyruvate kinase 1e0t:1a49 1e0t:1liu 1a49:1liu
306/1.85 325/1.49 442/0.9
2.17 1.88 0.45
340/2.46 366/2.42 471/1.54
2.35 2.26 0.47
442/2.3 431/3.5 517/2.3
2.53 2.9 1.45
425/3.5 464/3.2 471/1.5
2.86 2.94 0.73
Aldolase 1ald:1a5c 1ald:1qo5 1a5c:1qo5
a b
43.3% 39.3% 63.1%
STON with 3 Å distance threshold and 5◦ interval of rotation. STON with 5 Å distance threshold and 5◦ interval of rotation.
Results obtained from the STON method with two distance thresholds and also structural alignments by DALI and CE are shown in Table 5. The distance threshold parameter in STON means that aligned residues -carbon atoms are closer than this threshold. By selecting short distance, the more similar segments are aligned and when distance threshold increases, the less similar segments are aligned, so the numbers of equivalent residues and RMSD are increased. As shown in Table 5, the numbers of equivalent residues and RMSD in global alignment by STON 5–5 are more than STON 3–5 and are comparable with DALI and CE but the RMSD in STON
3–5 is less than STON 5–5. Comparison of STON 3–5 alignments with DALI and CE show that in most cases the ratio of equivalent residues to RMSD for STON 3–5 are more than both methods. This proves the capability of STON 3–5 to find significant local similarity. After structural alignment of protein pairs, the RMSD of backbone atoms of residues in catalytic sites of each pair were calculated. Results in Table 5 show that in all cases, the decrease of the distance threshold places the functional site residues in close proximity with low RMSD. In most cases the RMSD of catalytic site residues in STON 3–5 are in equal or closer proximity than DALI and CE. Decreasing
172
C. Eslahchi et al. / Computers in Biology and Medicine 39 (2009) 166 – 172
the distance threshold to 1 Å gives better local alignment and put functional residues in more close proximity. 4. Conclusion In this work we have presented a greedy algorithm for finding the corresponding aligned residues in two proteins with pre specified RMSD value. One of the advantages of our algorithm is that the aligned structures can be limited to maximal distance between matched residues. This enables us to search for strongly or weakly similar fragments. We have run our algorithm on different categories of protein pairs. The results indicate that STON has been performed as well as other methods in aligning structures of pairs of proteins with different lengths and sequences. The performance of STON relative to place the functional site residues have been tested by pair wise structural alignment in five different families to show the effect of distance parameter on alignment of these residues. A web based program of this algorithm is available at (http://bioinf.cs.ipm.ac.ir/softwares/ston). Conflict of interest statement None declared. Acknowledgments Changiz Eslahchi would like to thank the Center of Excellence in Biomathematics of College of Science of University of Tehran for their grant towards this work. This research was in part supported by a grant from IPM (no. CS1385-102). References [1] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, The protein data bank, Nucleic Acids Res. 28 (2000) 235–242. [2] A. Murzin, S.E. Brenner, T. Hubbard, C. Chothia, SCOP: a structural classification of proteins database for the investigation of sequences and structures, J. Mol. Biol. 247 (1995) 536–540. [3] C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells, J.M. Thornton, A hierarchic classification of protein domain structures, Structure 5 (1997) 1093–1108. [4] L. Holm, C. Sander, The FSSP database of structurally aligned protein fold families, Nucleic Acids Res. 22 (1994) 3600–3609. [5] R.H. Lathrop, The protein threading problem with sequence amino acid interaction preferences is NP-complete, Protein Eng. 7 (1994) 1059–1068. [6] A. Falicov, F.E. Cohen, A surface of minimum area metric for the structural comparison of proteins, J. Mol. Biol. 258 (1996) 871–892. [7] M. Gerstein, M. Levitt, Comprehensive assessment of automatic structural alignment against a manual standard, the SCOP classification of proteins, Protein Sci. 7 (1998) 445–456. [8] W.R. Taylor, Protein structure comparison using iterated double dynamic programming, Protein Sci. 8 (1999) 654–665. [9] A.I. Jewett, C.C. Huang, T.F. Ferrin, MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance, Bioinformatics 19 (2003) 625–634. [10] I.N. Shindyalov, P.E. Bourne, Protein structure alignment by incremental combinatorial extension (CE) of the optimal path, Protein Eng. 11 (1998) 739–747. [11] L. Holm, C. Sander, Protein structure comparison by alignment of distance matrices, J. Mol. Biol. 233 (1993) 123–138. [12] W.R. Taylor, C.A. Orengo, Protein structure alignment, J. Mol. Biol. 208 (1989) 1–22. [13] A. Godzik, J. Skolnick, A. Kolinski, Regularities in interaction patterns of globular proteins, Protein Eng. 6 (1993) 801–810.
[14] D.P. Yee, K.A. Dill, Families and the structural relatedness among globular proteins, Protein Sci. 2 (1993) 884–899. [15] J.D. Szustakowski, Z. Weng, Protein structure alignment using a genetic algorithm, Proteins 38 (2000) 428–440. [16] H.M. Grindley, P.J. Artymiuk, D.W. Rice, P. Willett, Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm, J. Mol. Biol. 229 (1993) 707–721. [17] J.M. Sauder, J.W. Arthur, R.L. Dunbrack, Large-scale comparison of protein sequence alignment algorithms with structure alignments, Proteins 40 (2000) 6–22. [18] R. Kolodny, P. Koehl, M. Levitt, Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures, J. Mol. Biol. 346 (2005) 1173–1188. [19] X.F. de la Cruz, M.W. Mahoney, B. Lee, Discrete representations of the protein C alpha chain, Folding Des. 2 (1997) 223–234. [20] D. Fischer, A. Elofsson, D. Rice, D. Eisenberg, Assessing the performance of fold recognition methods by means of a comprehensive benchmark, in: L. Hunter, T.E. Klein (Eds.), Pacific Symposium on Biocomputing, World Scientific Publishers, New Jersey, 1996, p. 300. [21] J. Zhu, Z. Weng, FAST: a novel protein structure alignment algorithm, Proteins 14 (2005) 417–423. [22] Y. Ye, A. Godzik, Flexible structure alignment by chaining aligned fragment pairs allowing twists, Bioinformatics 19 (2003) 246–255. [23] R. Mosca, T.R. Schneider, RAPIDO: a web server for the alignment of protein structures in the presence of conformational changes, Nucleic Acids Res. 36 (2008) :W42–W46. [24] M.J. Sippl, M. Wiederstein, A note on difficult structure alignment problems, Bioinformatics 24 (2008) 426–427. [25] O. Olmea, C.E. Straus, A.R. Ortiz, MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison, Protein Sci. 11 (2002) 2606–2621. [26] R. Maiti, G.H. Van Domselaar, H. Zhang, D.S. Wishart, SuperPose: a simple server for sophisticated structural superposition, Nucleic Acids Res. 32 (2004) W590–W594. [27] N. Nagano, EzCatDB: the enzyme catalytic-mechanism database, Nucleic Acids Res. 33 (2005) D407–D412.
Changiz Eslahchi graduated from Faculty of Mathematical Sciences of Teacher Training University of Tehran with B.Sc. degree in 1987. He received his M.Sc. degree in Mathematics in 1989 from University of Shiraz and Ph.D. degree in 1998 from Sharif University of Technology. He is presently an Associate Professor at Shahid Beheshti University (Iran). His research interests are focused on graph theory and combinatorics, computational biology especially fuzzy clustering, sequence pattern recognition and structure classification. Hamid Pezeshk received his B.Sc. in 1987 and M.Sc. in 1989 both degrees in Statistics from University of Shiraz. He received his D.Phil degree in 2000 from University of Oxford. He is presently an Associate Professor at University of Tehran (Iran). His research interests include stochastic processes, Bayesian sample size determination, probabilistic models in gene finding and structure classification. Mehdi Sadeghi received his B.Sc. in 1991 in Cell and Molecular Biology, M.Sc. in 1993 and Ph.D. in 2001 in Biophysics from University of Tehran. He is presently an Assistant Professor at the National Institute of Genetic Engineering and Biotechnology (Iran). His research interests are bioinformatics, especially protein structure prediction, classification and sequence pattern recognition. Amir Massoud Rahimi received his Ph.D. in 1993 from Department of Mathematics at the University of Texas at Arlington. He is presently a Researcher at the Institute for Studies in Theoretical Physics and Mathematics (Iran). His research interests are algebra and its application and commutative rings theory. Heydar Maboudi Afkham graduated from Faculty of Mathematical Sciences of Shahid Beheshti University with B.Sc. degree in 2006. He is pursuing his M.Sc. degree at Computer Vision and Active Perception Laboratory at KTH University in Stockholm, Sweden. Shahriar Arab graduated from University of Tehran with a B.Sc. degree in Microbiology in 1997. He received his M.Sc. degree in Biophysics in 2000 from Tarbiat Modares University. He is pursuing his Ph.D. degree at University of Tehran in Bioinformatics. He is currently working on protein structure prediction.