Finding compact structural motifs

Finding compact structural motifs

Theoretical Computer Science 410 (2009) 2834–2839 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsev...

447KB Sizes 0 Downloads 143 Views

Theoretical Computer Science 410 (2009) 2834–2839

Contents lists available at ScienceDirect

Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs

Finding compact structural motifs Dongbo Bu a,c , Ming Li a,∗ , Shuai Cheng Li a , Jianbo Qian a , Jinbo Xu b a

David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1

b

Toyota Technological Institute at Chicago, 1427 East 60th Street, Chicago, IL 60637, United States

c

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China

article

info

Keywords: Compact Structural motif NP-Hardness Approximation algorithm

abstract Protein structural motif detection has important applications in structural genomics. Compared with sequence motifs, structural motifs are more sensitive in revealing the evolutionary relationships among proteins. A variety of algorithms have been proposed to attack this problem. However, they are either heuristic without theoretical performance guarantee, or inefficient due to employing exhaustive search strategies. This paper studies a reasonably restricted version of this problem: the compact structural motif problem. We prove that this restricted version is still NP-hard, and we present a polynomialtime approximation scheme to solve it. This is the first approximation algorithm with a guaranteed ratio for the protein structural motif problem.1 © 2009 Elsevier B.V. All rights reserved.

1. Introduction It is widely accepted that during the evolution of proteins, structures are more conserved than sequences. In addition, structural motifs are tightly related to protein functions [2]. Thus, identifying the common substructures from a set of proteins, or a family of proteins to be more precise, can help us learn their evolutionary history and functions. With rapid accumulation of protein structures in the Protein Data Bank (PDB) there is a demand for fast and accurate structural comparison and motif finding methods. The multiple structural motif finding problem is the structural analogy with the sequence motif finding problem, which has been thoroughly studied, see for example [12]. For the former problem, the input consists of a set of protein structures in three-dimensional (3D) space, R3 . The objective is to find a set of substructures, one from each protein, that exhibit the highest degree of similarity. Roughly speaking, there are two main methods to measure the structural similarity, i.e., coordinate root mean squared deviation (cRMSD) and distance root mean squared deviation (dRMSD). The first one calculates the internal distance for each protein first, and compares these internal distance matrices. In contrast, the second method uses the Euclidean distance between the corresponding residues from different protein structures. To do this, the optimal rigid transformation of these protein structures should be done first. Various methods have been proposed to solve the structural motif finding problem under different similarity measuring schemes. Under the unit-vector RMSD (URMSD) measure, Chew et al. [5] proposed an iterative algorithm to compute the consensus shape and proved the convergence of the algorithm. Applying graph-based data mining tools, Bandyopadhyay et al. [4] described a method to assign a protein structure to functional families using the family-specific fingerprints. Under



Corresponding author. Tel.: +1 519 888 4659; fax: +1 519 885 1208. E-mail addresses: [email protected] (D. Bu), [email protected] (M. Li), [email protected] (S.C. Li), [email protected] (J. Qian), [email protected] (J. Xu). 1 A preliminary version of this paper appeared in CPM’2007. 0304-3975/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.tcs.2009.03.023

D. Bu et al. / Theoretical Computer Science 410 (2009) 2834–2839

2835

the bottleneck metric similarity measure, Shatsky et al. [15] presented an algorithm for recognition of binding patterns common to a set of protein structures. This problem is also studied in [6,7,11,13,14,18]. One of the closely related problems is the structural alignment problem, to which a lot of successful approaches have been developed. Among them, DALI [8] and CE [16] attempt to identify the alignment with minimal dRMSD, while STRUCTURAL [17] and TM-align [20] employ heuristics to detect the alignment with minimal cRMSD. However, the methods mentioned above are all heuristic and the solutions are not guaranteed to be optimal or near optimal. Recently, Kolodny et al. [10] proposed the first polynomial-time approximate algorithm for pairwise protein structure alignment based on the Lipschitz property of the scoring function. Though this method can be extended to the case of multiple protein structural alignment, the simple extension has a time complexity exponential in the number of proteins. We present an approximation algorithm employing the random sampling technique of [12] under the coordinate mean squared distance cMSD measure, which is the square of the cRMSD. Adopting a reasonable assumption, we prove that our algorithm is efficient and produces a good approximation solution. In contrast to the method in [10], our algorithm has a time complexity polynomial in the number of input protein structures. Furthermore, the sampling size is an adjustable parameter, adding more flexibility to our algorithm. The rest of this manuscript is organized as follows. In Section 2, we introduce the notations and some background knowledge. In Section 3, we prove the NP-hardness of the compact structural motif problem. The algorithm, along with its performance analysis, is given in Section 4. 2. Preliminaries A protein consists of a sequence of amino acids (also called residues), each of which contains a number of atoms including one Cα atom. In a protein structure, each atom is associated with a 3D coordinate. Here, we only take into consideration the Cα atom of a residue; thus a protein structure can be simplified as a sequence of 3D points. The common globular protein is generally compact, with the distance between two consecutive Cα atoms restrained by the bond lengths and bond angles, and the volume of the bounding sphere linear in the number of residues. A structural motif of a protein is a subset of its residues arranged in the order of appearance, and its length is the number of residues in this subset. We study in current paper only the (R,C)-compact motif, which is bounded by the minimal ball B with a radius at most R, and at most C residues in this ball do not belong to this motif. This ball B is referred to as the containing ball. To measure the similarity of two protein structures, a transformation, including a rotation and a translation, should be done first. Such a transformation is known as a rigid transformation and can be expressed with a 6D-vector τ = (tx , ty , tz , r1 , r2 , r3 ), r1 , r2 , r3 ∈ [0, 2π], tx , ty , tz ∈ R. Here, (tx , ty , tz ) denotes a translation and (r1 , r2 , r3 ) denotes a rotation. Applying transformation τ for a 3D-point u, we get a new 3D coordinate τ (u). We adopt cMSD to measure the similarity between two structural motifs. Formally, for two motifs u = (u1 , u2 , . . . , u` ) P` 2 and v = (v1 , v2 , . . . , v` ), the cMSD distance between them is defined as d(u, v) = i=1 kui − vi k , where k.k is the Euclidean distance. We will study the (R,C)-Compact Consensus Structural Motif ((R,C)-CCSM) problem: Given n protein structures P1 , P2 , . . . , Pn , and an integer `, find an (R, C )-compact motif ui of length ` along with rigid transformation τi for each P Pi ,nand a consensus structure of ` 3D points: q = (q1 , q2 , . . . , q` ), where qi is a point in 3D, minimizing distance function i=1 d(q, τi (ui )). Before presenting the approximation algorithm, we first prove that the transformation can be simplified. More specifically, we need to consider only rotations. The following lemma, due to [9], is used in our proofs. Roughly speaking, it states that if we want to superimpose two chains of points optimally, we must make their centroids coincide. Lemma 1. Given n ordered 3D points A = (a1 , a2 , . . . , an ) and n ordered 3D points B = (b1 , b2 , . . . , bn ), to minimize P 2 kρ( a i ) + T − bi k , where ρ is a rotation matrix and T is a translation vector, T must make the centroids of A and B coincide i with each other. Lemma 2. In the optimal solution of (R,C)-Compact Consensus Structural Motif problem, the centroid of τi (ui ) must coincide with the centroid of q, for 1 ≤ i ≤ n. Lemma 2 is a direct corollary of Lemma 1. The basic idea is that to find the optimal rigid transformation, we can first translate the proteins so that their centroids coincide with each other. Therefore, a transformation can be simplified to a rotation vector T = (r1 , r2 , r3 ), where r1 , r2 , r3 ∈ [0, 2π ]. Discrete transformation set is a commonly-used technique [10]. Specifically, for a real number  , we can discretize the range [0, 2π ] into a series of bins with a width of  . We refer to the discrete transformation set {(i · , j · , k · )|0 ≤ i, j, k ≤ 2π }, where i, j, k are integers, as an  -net of rotation space T . 3. Hardness result In this section, we will show that (R,C)-Compact Consensus Structural Motif problem is NP-hard. The reduction is from the Local Multiple Alignment(LMA) problem, which has been proven to be NP-hard in [1].

2836

D. Bu et al. / Theoretical Computer Science 410 (2009) 2834–2839

S1 : 011100101

010111011

S2 : 011010111

110110011

Sn : 111 100101

001111010

Extending 100 011 000000 111111 Mapping t n1t n0t n0 t n0 t n1 t n1 t n0 t 0n t 0n t 0n t 0n t n0 t1n t1n t1n t1n t 1n t 1n Fig. 1. The reduction from Local Multiple Alignment problem to (R, C )-Compact Consensus Structural Motif problem.

Local Multiple Alignment (sum-of-pairs variant) [1]: Given a set of sequences S = {s1 , s2 , . . . , sn } over an alphabet Σ = {0, 1}, and P an integer `, find a substring ti of length ` from each si , minimizing the sum-of-pairs score SPscore(t1 , . . . , tn ) = 1≤i
Given an LMA instance, we transform it into a (1,0)-CCSM instance by two steps, namely, extending and mapping (see Fig. 1). j j +1

j+`−1

1. Extending Step: For each `-mer of each sequence si , say ti = si si . . . si , we first append its complement, then attach a tail of 2` 0’s and 2` 1’s. For example, 100 becomes 100011000000111111. 2. Mapping Step: We map the extended string to a 6` ordered 3D points: 0 is mapped to a 3D point ti0 = (0, 2i, 0) and 1 is mapped to a 3D point ti1 = (1, 2i, 0). Hence, each `-mer ti is mapped to 6` ordered points; those points are located at only two 3D coordinates. Let denote these ordered points as M (ti ). Note that the centroid of M (ti ) is (1/2, 2i, 0). By the above transformation, sequence si is mapped to a protein structure Pi with 6`(m − ` + 1) points. We should notice that in real protein structures, points do not share identical coordinates. However, it is not difficult to modify our reduction to take this aspect into consideration. We ignore this for the simplicity and clarity purposes. With the above reduction, we can prove our hardness result. Theorem 1. (R,C)-Compact Consensus Structural Motif is NP-hard. The result can be proved by the following lemmas. Lemma 3. cMSD(M (ti ), M (tj ))=2dH (ti , tj ). Proof. Suppose τi and τj are the optimal transformations of M (ti ) and M (tj ) to superimpose them, respectively. The centroids of M (ti ) and M (tj ) should coincide under the optimal superimposition according to Lemma 1. Let the angle between the two line segments hτi (ti0 ), τi (ti1 )i and hτi (tj0 ), τi (tj1 )i be θ (see Fig. 2). Denote d = dH (ti , tj ). According to simple geometry, the cMSD value of M (ti ) and M (tj ) under this superimposition can be expressed as: d(M (ti ), M (tj )) = (6` − 2d) ×

 2 1

+2d ×

1

2

+

2

 2

 2 1

2

 2 −2×

 2 +

1

2

 2 +2×

1 2

1 2

! × cos θ !

× cos θ

= 3` − (3` − 2d) cos θ . It is clear that this distance reach its minimum when θ = 0, and cMSD(M (ti ), M (tj )) = 2d.



The following lemma, due to [19], suggests the equivalence of the minimum of sum of the scores to the centroid and the sum-of-pair scores for a structural motif set. Lemma 4. Given a set of structural motifs m1 , . . . , mn , the minimal value of i=1 cMSD(q, mi ) is reached when q is the average P of these motifs under the optimal transformations and the minimal value is 1n i
Pn

Suppose there is a local alignment consisting of t1 , . . . , tn with a cost c = 1≤i
P

find a q that has a score of 2n c, i.e., 1≤i≤n d(M (ti ), q) = 2n c. Conversely, given an optimal solution of an instance of the (R, C )-Compact Consensus Structural Motif problem with a score of c, the corresponding sequence fragments form a local multiple alignment with a cost of 2n c. Thus Theorem 1 holds.

P

4. (R , C )-compact motif finding algorithm In this section, we present an approximation algorithm to solve the (R, C )-Compact Consensus Structural Motif problem. The basic idea of our algorithm is as follows: first, we translate the proteins to make their centroids coincide.

D. Bu et al. / Theoretical Computer Science 410 (2009) 2834–2839

b

τj (t j0)

τ i (t i1)

π -θ 1

τi (t i0)

1

2837

1

2

θ

a 1

2

2

τ j (t j1)

2

Fig. 2. Superimposition of two motifs.

Then, for each discrete rigid transformation and each r-tuple of compact motifs, we calculate the median, and find from each protein the closest part to this median. Ultimately the median with the minimal value of the objective function value is output.

(R, C )-Compact Motif Finding Algorithm Input: n protein structures P1 , P2 , . . . , Pn , integers `, C , r, real numbers R,  . Output: median consensus u of length `, rigid transformation τi , (R, C )-compact motif ui of length ` for Pi , for 1 ≤ i ≤ n. 1. Fix P1 , translate other proteins to make their centroids coincide with that of P1 2. FOR every r length-` (R, C )-compact motif u1 , u2 , . . . , ur , where ui is a motif of some Pj DO 3. FOR every r − 1 transformations τ2 , τ3 , . . . , τr from /Rn`-net of rotation space T DO (a) Find the average of u1 , τ2 (u2 ), . . . , τr (ur ): u = (u1 + τ2 (u2 ) + · · · + τr (ur ))/r (b) FOR i = 1, 2, . . . , n DO Find the length-` (R, C )-compact motif vi of Pi and its optimal rigid transformation τi0 that minimize d(u, τi0 (vi )). Pn 0 (c) Let c (u) = i=1 d(u, τi (vi )). 4. Output u and the corresponding vi , τi0 that minimize c (u). 2 Let f (x) = i=1 (x − ai ) . It is easy to see that f (x) is minimized when x equals to the average of {ai }. The following lemma states that if we randomly choose r numbers from {ai }, and let x be the average of these r numbers, then the expected value of f (x) is (1 + 1/r ) times the minimum.

Pn

Lemma 5. Let a1 , a2 , . . . , an be n real numbers, 1 ≤ r ≤ n is an integer. Then the following equation holds: 1

n  X ai1 + · · · + air

X

nr 1≤i ,...,i ≤n i=1 r 1

r

2 − ai

=

n r +1 X

Proof. For the sake of simplicity, we use σ to denote 1

n  X ai1 + · · · + air

X

nr 1≤i ,...,i ≤n i=1 r 1

=

=

= =

1

r



X

n

nr 1≤i ,...,i ≤n r 1 1

X

nr rn

n

+ ··· +

r 2 nr − 1

σ0

+

r − 1 σ2

=

r +1 r



σ2 n

−2

n r + 1 X σ

r



2σ 2 − + σ0 n n  σ2

r r  r +1 = σ0 − r n

=

i=1

n

−2

σ2 n

− ai

+ σ0 2



. 

i=1

ai and σ 0 to denote

ai1 + · · · + air

r2

σ 0 + r (r − 1)nr −2 σ 2

Pn

r

σ + σ0

+ 2(ai1 ai2 + · · · + air −1 air )

1≤i1 ,...,ir ≤n r −1

n

i=1

2 − ai Pn

i=1

. a2i .

2

r2 a2ir

a1 + · · · + an

− ai

(ai1 + · · · + air )2

a2i1

r



2σ 2 n

+ σ0





2rnr −1 σ rnr

σ + σ0

2838

D. Bu et al. / Theoretical Computer Science 410 (2009) 2834–2839

The following lemma is needed for our analysis of the time complexity. Lemma 6. All of the (R, C )-compact motifs of length ` for protein P with m residues can be enumerated in O(m4 `c ) time. Proof. According to the definition of the (R, C )-compact motif, we know that the containing ball B of a motif must contain ` to ` + C residues of P . In addition, it is easy to see that due to the minimality of B, either there are 4 residues on the surface of B, or there are 3 residues on its surface and the radius of B is R. Therefore, to enumerate the compact motifs, we can first enumerate the containing balls, which takes O(m4 ); then from each ball, we enumerate the motifs, which takes O(`c ) times. In total, it takes O(m4 `c ) time.  Theorem 2. The (R, C )-Compact Motif Finding Algorithm outputs a solution with cost no more than

 1+

1



r

copt + O(),

in time O(n4r −2 m4r +4 R3r −3 `cr +c +3r −2 / 3r −3 ), where copt is the cost of the optimal solution. Proof. Step 1 takes O(nm) time. The enumeration of {ui } takes O(nr (m4 `c )r ) time, {τi } takes O(( Rn ` )3(r −1) ). Step 3(a)– (c) takes O(n · m4 `c · `) time (finding τi0 takes O(`) time according to [3]). So, the time complexity of the algorithm is O(n4r −2 m4r +4 R3r −3 `cr +c +3r −2 / 3r −3 ). Now, we prove the performance ratio. Given an instance of the problem, we use u∗ to denote the optimal median; vi∗ and ∗ τP denote the optimal motif in Pi and the corresponding optimal rigid transformation, respectively. Then we have copt = i n ∗ d ( u , τi∗ (vi∗ )). By the property of our cost function, it is easy to see that u∗ is the average of τ1∗ (v1∗ ), τ2∗ (v2∗ ), . . . , τn∗ (vn∗ ), i=1 ∗ i.e., u = (τ1∗ (v1∗ ) + τ2∗ (v2∗ ) + · · · + τn∗ (vn∗ ))/n. First, we claim that copt can be approximated by sampling r proteins. In particular, we will show that there exist 1 ≤ i1 , i2 , . . . , ir ≤ n such that n X

d(u∗i1 ...ir , τi∗ (vi∗ )) ≤ (1 + 1/r )copt ,

(1)

i=1

where u∗i1 ...ir = (τi∗1 (vi∗1 ) + τi∗2 (vi∗2 ) + · · · + τi∗r (vi∗r ))/r. It suffices to prove that the average of such value for all 1 ≤ i1 , i2 , . . . , ir ≤ n is (1 + 1/r )copt , which can be easily deduced from Lemma 3. For 1 ≤ i ≤ n, let τi0 be a rotation in /Rn`-net (remember R is the maximum radius of the motifs) of T that is closest to τi∗ in T . Then τi∗ can be reached from τi0 by moving at most /2Rn` along each of the three dimensions. Let u0i1 ...ir = (τi01 (vi∗1 ) + τi02 (vi∗2 ) + · · · + τi0r (vi∗r ))/r .

Now we will prove that i=1 d(u0i1 ...ir , τi0 (vi∗ )) ≤ and vi∗ [j] be the jth element of vi∗ , then we have

Pn

n X

Pn

i=1

d(u∗i1 ...ir , τi∗ (vi∗ )) + O(). Let u∗i1 ...ir [j] be the jth element of u∗i1 ...ir ,

d(u0i1 ...ir , τi0 (vi∗ ))

i=1

=

n X ` X

ku0i1 ...ir [j] − τi0 (vi∗ [j])k2

i=1 j=1

=

n X ` X

ku0i1 ...ir [j] − u∗i1 ...ir [j] + u∗i1 ...ir [j] − τi∗ (vi∗ [j]) + τi∗ (vi∗ [j]) − τi0 (vi∗ [j])k2

i=1 j=1 n X ` X ≤ (ku∗i1 ...ir [j] − τi∗ (vi∗ [j])k + (ku0i1 ...ir [j] − u∗i1 ...ir [j]k + kτi∗ (vi∗ [j]) − τi0 (vi∗ [j])k))2 i=1 j=1 n X ` X = (ku∗i1 ...ir [j] − τi∗ (vi∗ [j])k2 + (ku0i1 ...ir [j] − u∗i1 ...ir [j]k + kτi∗ (vi∗ [j]) − τi0 (vi∗ [j])k)2 i=1 j=1

+ 2ku∗i1 ...ir [j] − τi∗ (vi∗ [j])k × ku0i1 ...ir [j] − u∗i1 ...ir [j]k + kτi∗ (vi∗ [j]) − τi0 (vi∗ [j])k) n X ≤ d(u∗i1 ...ir , τi∗ (vi∗ )) + O() + 8 2 /Rn` i =1

=

n X

d(u∗i1 ...ir , τi∗ (vi∗ )) + O().

i=1

The first inequality follows by the triangle inequality of the cRMSD metric, and the second one follows by the property of  -net of transformation space and the compactness of motifs. Specifically, by our choice of τi0j , we have kτi0 (vi∗ ) − τi∗ (vi∗ )k ≤

D. Bu et al. / Theoretical Computer Science 410 (2009) 2834–2839

2839

/Rn`, ku0i1 ...ir − u∗i1 ...ir k ≤ /Rn` (more details can be found in [10]). In addition, |u∗i1 ...ir [j] − τi∗ (vi∗ [j])k ≤ 2R since the points are bounded by a containing ball with radius at most R. Pn 0 0 ∗ It is easy to see that the output of our algorithm is at least as good as i=1 d(ui1 i2 ...ir , τi (vi )). Together with (1), the performance ratio of our algorithm is proven.  5. Conclusion We present a sampling-based approximation algorithm for the problem of finding the compact consensus shape from a family of proteins. Our algorithm requires that the consensus pattern satisfies the compactness condition. To find a good algorithm in more general case is an open problem. Acknowledgments We thank the editors and referees for their useful suggestions and efficient handling of this paper. We thank our coauthors Bin Ma and Lusheng Wang of paper [12]. This work is partially supported by the Canada Research Chairs program, the NSERC Discovery Grant OGP0046506, a 863 Grant by the Ministry of Science and Technology of China, an NSERC Collaborative grant, a MITACS grant, and a grant by the National Natural Science Foundation of China 30800168. References [1] T. Akutsu, H. Arimura, S. Shimozono, On approximation algorithms for local multiple alignment, in: Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, ACM, New York, NY, USA, 2000, pp. 1–7. [2] P. Aloy, E. Querol, F.X. Aviles, M.J. Sternberg, Automated structure-based prediction of functional sites in proteins: Applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking, Journal of Molecular Biology 311 (2001) 395–408. [3] K.S. Arun, T.S. Huang, S.D. Blostein, Least square fitting of two 3-D point sets, IEEE Transactions on Pattern Analysis and Machine Intelligence 9 (5) (1987) 698–700. [4] D. Bandyopadhyay, J. Huan, J. Liu, J. Prins, J. Snoeyink, W. Wang, A. Tropsha, Structure-based function inference using protein family-specific fingerprints, Protein Science 15 (2006) 1537–1543. [5] L.P. Chew, K. Kedem, Finding the consensus shape of a protein family, in: Proceedings of 18th Annual ACM Symposium on Computational Geometry, 2002, pp. 64–73. [6] I. Gelfand, A. Kister, C. Kulikowski, O. Stoyanov, Geometric invariant core for the VL and VH domains of immunoglobulin molecules, Protein Engineering 11 (1998) 1015–1025. [7] M. Gerstein, R.B. Altman, Average core structure and variability measures for protein families: application to the immunoglobins, Journal of Molecular Biology 112 (1995) 535–542. [8] L. Holm, C. Sander, Dali: a network tool for protein structure comparison, Trends in Biochem Sciences 20 (11) (1995) 478–480. [9] T.S. Huang, S.D. Blostein, E.A. Margerum, Least-square estimation of motion parameters from 3-d point correspondences, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 69 (1986) 198–201. [10] R. Kolodny, N. Linial, Approximate protein structural alignment in polynomial time, Proceedings of the National Academy of Sciences 101 (2004) 12201–12206. [11] N. Leibowitz, Z.Y. Fligelman, R. Nussinov, Multiple structural alignment and core detection by geometric hashing, in: Proc. 7th Int. Conf. Intell. Sys. Mol. Biol, 1999, pp. 169–177. [12] M. Li, B. Ma, L. Wang, Finding similar regions in many strings, in: Proceedings of the 31st Annual ACM Symposium on Theory of Computing, Atlanta, 1999, pp. 473–482. [13] C. Orengo, CORA-Topological fingerprints for protein structural family, Protein Science 8 (1999) 699–715. [14] C. Orengo, W. Taylor, SSAP: Sequential structure alignment program for protein structure comparison, Methods in Enzymology 266 (1996) 617–635. [15] M. Shatsky, A. Shulman-Peleg, R. Nussinov, H.J. Wolfson, The multiple common point set problem and its application to molecule binding pattern detection, Journal of Computational Biology 13 (2) (2006) 407–428. [16] I.N Shindyalov, P.E. Bourne, Protein structure alignment by incremental combinatorial extension CE of the optimal path, Protein Engineering 11 (9) (1998) 739–747. [17] S. Subbiah, D.V. Laurents, M. Levitt, Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core, Current Biology 3 (1993) 141–148. [18] J. Xu, F. Jiao, B. Berger, A parameterized algorithm for protein structure alignment, in: Proceedings of the Tenth Annual International Conference on Computational Molecular Biology, 2006 pp. 488–499. [19] J. Ye, R. Janardan, Approximate multiple protein structure alignment using the sum-of-pairs distance, Journal of Computational Biology 11 (5) (2004) 986–1000. [20] Y. Zhang, J. Skolnick, TM-align: A protein structure alignment algorithm based on the TM-score, Nucleic Acids Research 33 (2005) 2302–2309.