A simple method of identifying symmetric substructures of proteins

A simple method of identifying symmetric substructures of proteins

Computational Biology and Chemistry 33 (2009) 100–107 Contents lists available at ScienceDirect Computational Biology and Chemistry journal homepage...

1MB Sizes 3 Downloads 47 Views

Computational Biology and Chemistry 33 (2009) 100–107

Contents lists available at ScienceDirect

Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem

Brief Communication

A simple method of identifying symmetric substructures of proteins Hanlin Chen a,b , Yanzhao Huang a,b , Yi Xiao a,b,∗ a b

Biomolecular Physics and Modeling Group, Department of Physics, Huazhong University of Science and Technology, Wuhan 430074, Hubei, China Center of Physical Biology, Wuhan 430074, China

a r t i c l e

i n f o

Article history: Received 6 November 2007 Received in revised form 10 July 2008 Accepted 15 July 2008 Keywords: Structural symmetry dRMSD Similarity matrix Pearson’s correlation

a b s t r a c t Accurate identifications of internal symmetric substructures of proteins are needed in protein evolution study and protein design. To overcome the difficulties met by previous methods, here we propose a simple quantitative one by using a similarity matrix plus Pearson’s correlation analysis. The distance root-meansquare deviation (dRMSD) is used to measure the similarity of two substructures in a protein. We applied this method to the proteins of the ␤-propeller, jelly roll, and ␤-trefoil families and the results show that this method cannot only detect the internal repetitive structures in proteins effectively, but also can identify their locations easily. © 2008 Elsevier Ltd. All rights reserved.

1. Introduction In protein engineering, recently there is an increasing interesting on repeat proteins (e.g., ankyrin repeat, leucine-rich repeat) due to their biological importance and particular architecture (Forrer et al., 2004; Mosavi et al., 2002). Design and engineering of repeat proteins may help to elucidate their structural and biophysical properties, such as the dependence of stability and folding on the number of repeats, as well as the importance of key intra- and inter-repeat interactions. Furthermore, repeat proteins are, like immunoglobulins, versatile natural scaffolds specialized for target binding. Thus, design and engineering of repeat proteins may result in novel binding molecules suitable for biotechnological or medical applications. Obviously, to do these, it is very important to determine the repetitive units (e.g., their lengths and locations) in proteins quantitatively. For example, one of the current protein engineering approaches is repeat-based consensus design, in which consensus repeats were obtained by both intra- and intermolecular sequence alignments of repeats. So the efficiency of this approach depends on the size and accuracy of repeat sequence database. It is usually difficult to identify the repeats in proteins directly at the sequence level. However, it is relatively easy to do this quantitatively at the structure level. On the other hand, many proteins have symmetric folds. For example, among ten most frequently encountered protein fold

∗ Corresponding author. E-mail address: [email protected] (Y. Xiao). 1476-9271/$ – see front matter © 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2008.07.026

(the so-called superfolds), six have internal structural symmetries (Salem et al., 1999). They are four-helix bundle, ferredoxin, ␤trefoil, (␣␤)8 -barrel, jelly roll and immunoglobulin (Ig) folds. It was suggested that symmetric folds might evolve from short peptide ancestors via gene duplication and fusion (Doolittle, 1995; Henikoff et al., 1997; Chothia et al., 2003; Miguel et al., 2001; Edward et al., 2004; Xu and Xiao, 2005). However, many of these repeat signals can only be seen at the structure level yet, since most sequences have undergone many mutations during evolution. Thus, accurate identification of internal structural repeats will be very important to help the investigation of protein evolution. A very interesting example is an (␣␤)8 -barrel, the imidazoleglycerol phosphate synthase (HisF, 1thf) from Thermotoga maritime. This protein appears to have a pseudo eight-fold symmetry from a qualitative overview, but it was recently identified that HisF evolved by tandem duplication and fusion from an ancestral half-barrel, i.e., the tertiary structure of HisF is an (␣␤)8 -barrel with a two-fold repeat pattern (Lang et al., 2000; Höcker et al., 2001). We shall show in the following that the method introduced in this paper indeed gives a clear two-fold, instead of eight-fold, repeat pattern for this protein. In recent 20 years, a large number of automatic methods for protein structure comparison have been developed, using different representations of structure, definitions of similarity measure and optimization algorithms (Michell et al., 1989; Subbarao and Haneef, 1991; Vriend and Sander, 1991; Fischer et al., 1992; Alexandrov et al., 1992; Barakat and Dean, 1991; Taylor and Orengo, 1989; Sali and Blundell, 1990; Edward et al., 2003). Some of these methods dissect out different substructures in a qualitative overview (Flores et al., 1993; Salem et al., 1999). To get quantitative results, Taylor et al.

H. Chen et al. / Computational Biology and Chemistry 33 (2009) 100–107

101

Fig. 1. The structures and similarity-matrix dot plot of collagenase 1 (PDB ID: 1FBL). (a) the tertiary structure, (b) the dot plot of similarity matrix, (c) repeats and their locations determined by dRMSD method and (d) the result provided by PROPEAT.

developed a fast and automatic way for the detection of symmetry or repetition of protein structures by using a Fourier analysis (Taylor et al., 2002). It is very simple to extract periodicities of repeated structures by using a Fourier analysis on the structure comparison program score matrix (Taylor, 1999) derived from the double dynamic programming (Taylor and Orengo, 1989). However, one disadvantage of this approach is that it is impossible to tell the size or locations of the substructures that give rise to any periodicity without returning to examine the original matrix. Another way to reveal the recurring substructures was also proposed, based on optimal, permuted, and alternative alignments of protein structures (OPAAS) (Edward et al., 2004) and on vector-represented secondary structure elements (Edward et al., 2003) (see the web server PROPEAT by http://gln.ibms.sinica.edu.tw/product/repeat). Here, we present an alternative quantitative approach to identify the recurring substructures. It is known that similar three-dimensional structures have similar inter-residue distances (Holm and Sander, 1993). In comparing three-dimensional structures of globular proteins, the distance root-mean-square deviation (dRMSD) is an effective signal of great significance. And many investigators usually calculate the dRMSD of the atomic positions in their model from those in the crystallographically observed native structures as a measure of their success (Fred and Michael, 1980). A very ´˚ means they are dissimilarge value (generally much more than 3 A) lar, and zero means they are identical in conformation (Maiorov and Crippen, 1994). In this work, we shall use the dRMSD to measure the similarity of two local structures of the same length within a protein in order to identify its internal symmetry or repetition. By constructing a similarity matrix plus its Pearson’s correlation analysis, we can determine both the number and locations of repetitive substructures in a protein, regardless of any specific information about the secondary structures (no matter they are helices or strands).

protein structure as a sequence of the coordinates of its C␣ atoms, S = C1 C2 C3 . . .CN , where Ci denotes the C␣ atom coordinate of the ith residue and N is the sequence length. Thus, all possible substructures of d consecutive residues in a protein structure can be represented as X1 (d) = C1 C2 . . . Cd X2 (d) = C2 C3 . . . Cd+1 ...... Xi (d) = Ci Ci+1 . . . Ci+d−1 XN+d+1 (d) = CN+d+1 CN+d+2 . . . CN

(1)

where i denotes the location of the first residue of Xi in the sequence. In order to determine the similarity of any pairs of these substructures, we use the dRMSD measure (Levitt, 1976; Cohen and Sternberg, 1980; Maiorov and Crippen, 1994). For any two substruc-

2. Methods To investigate the repetitive substructures, we use a modified recurrence quantification analysis (Xu and Xiao, 2005; Ji et al., 2007; Konopka, 1994, 1997, 2003; Wootton, 1997; Konopka and Smythers, 1987; Konopka and Chatterjee, 1988) and represent a

Fig. 2. The pseudocolor plot of the Pearson’s correlation coefficients between different sub-matrices of the similarity matrix S. The magnitude of the correlation coefficients is indicated by the colorbar. It shows that there are four strongly correlated segments which can be regarded as similar.

102

H. Chen et al. / Computational Biology and Chemistry 33 (2009) 100–107

Fig. 3. The structures, dot plots and pseudocolor plots of propeller proteins with five-, six-, seven- and eight-fold symmetries. (a) PDB ID, (b) the tertiary structures, (c) the dot plots of similarity matrices and (d) the pseudocolor plot of the Pearson’s correlation coefficients between different sub-matrices of the similarity matrices.

Fig. 4. The structures, dot plots and pseudocolor plots of jelly roll folds. (a) PDB ID, (b) the tertiary structures, (c) the dot plots of similarity matrices and (d) the pseudocolor plot of the Pearson’s correlation coefficients between different sub-matrices of the similarity matrices.

H. Chen et al. / Computational Biology and Chemistry 33 (2009) 100–107 Table 1 Evaluation of our method

tures Xi (d) and Xj (d), their dRMSD is defined as:

  d   2 1 j  i DdRMSD (Xi (d), Xj (d)) =  (rmn − rmn ) d(d − 1)/2

103

(2)

m=1 n>m

Protein fold

Total number of folds

Detected number

True positive percentage (%)

␤-Propeller Jelly roll ␤-Trefoil

56 32 28

42 26 24

75.0 81.3 85.7

j

i (r where rmn mn ) is the distance between the mth and nth C␣ atoms of the substructure Xi (Xj ). The substructure Xi (d) is considered to be similar to Xj (d) if the value of DdRMSD is less than 3 Å (Maiorov and Crippen, 1994). The dRMSD analysis of the similarity of the substructures can be done for different values of d. Then, we can build an internal similarity matrix S for the protein structure. The element S(d, i) of S is the number of the non-overlapping substructures similar to Xi (d) for a given d. The similarity matrix S can be illustrated by a dot plot with d against i in which a dot is given at the position (d, i) if S(d, i) is nonzero. If the protein structure has internal repetition, the dot plot should show repetitive patterns (see Fig. 1). Although the dot plot of the similarity matrix can reveal the internal repetitive patterns of a protein structure, it is usually difficult to determine the locations and lengths of repeats (see Figs. 3–6 in the following) in many cases. To solve this problem, we further calculate the Pearson’s correlation coefficient r between sub-matrices of the similarity matrix S. The sub-matrices have the same row size as S but smaller column size (≤N/2). For example, we can select two sub-matrices S1 and S2 with S1 including the residues (or the columns of S) from 1 to N/2 and S2 the residues (or the columns of S) from N/2 + 1 to N. If the protein structure has a repetitive structure made of two modules of equal length (1–N/2

and (N/2 + 1)–N), the Pearson’s correlation coefficient between S1 and S2 should be a maximum. Any other two sub-matrices (e.g., 1–N/2 and N/4–(3N/4 − 1)) should give a much lower Pearson’s r. In a similar way, we can subdivide S into sub-matrices with the column size being N/n if we want to explore the n-fold repetition of protein structures. It is noted that the Pearson’s correlation analysis in our method incorporates information of repeats of different lengths and like a profile–profile comparison. This may also help us identifying repeats with complex structures, e.g., the repeats of each of them is made of several non-consecutive regular subsubstructures. The Pearson’s correlation coefficients r between two submatrices is defined as:

 

r(j, k) =



  m

m

j (S n mn

n

j k −S ¯ k) (Smn − S¯ j )(Smn

− S¯ j )

2

  m

(S k n mn

2 − S¯ k )



(3)

where r(j, k) is the value of the correlation coefficient between the sub-matrix Sj and Sk , m and n are, respectively, the row and the column indices of the elements in Sj and Sk , and S¯ j and S¯ k are the average value of all elements in Sj and Sk . The Pearson’s

Fig. 5. The structures, dot plots and pseudocolor plots of ␤-trefoil folds. (a) PDB ID, (b) the tertiary structures, (c) the dot plots of similarity matrices, (d) the pseudocolor plot of the Pearson’s correlation coefficients between different sub-matrices of the similarity matrices.

104

H. Chen et al. / Computational Biology and Chemistry 33 (2009) 100–107 Table 2 (Continued )

Table 2 List of proteins (SCOP 1.69) analyzed with our method

Fold

Fold

PDB code

Length

Number of repeats/length

4-Propeller

1fbl 1gen 1hxn 1itvA

185 200 210 195

4 4 4 4

43 45 47 43

5-Propeller

1pex 1qhuA 1tl2A

192 185 235

4 4 5

43 41 48

6-Propeller

1gyhA 1oyg 1s18 1uyp 1crzA 1e1aA 1e8uA 1eur 1f8eA 1h6lA 1ijqA 1k32A 1ofzB 1npeA 1q7fA 1suuA 1v04Z 1cru

318

5

58

263 312 446 361 387 353 308 353 312 263 279 293 332 448

6 6 6 6 6 6 6 6 6 6 6 6 6

38 47 55 50 59 53 46 53 47 38 41 43 41

1ms9 1v3e 2sli 3sil 1kit 1a12A 1gof 1m1x 1olzB 1k8kC 1c9lA 1gxrA 1jjuB 1jofA 1jtdB 1l0q 1nexB 1nr0A

371 431 679 379 757 401 387 438 526 354 357 335 337 355 273 317 337 314

7 7 7 7 7 7 7 7 7 7 7 7 7

52 50 57 45 45 46 42 46 45 34 40 43 39

8-Propeller

1p22A 1qfmA 1qniE

291 354 441

7 7 7

36 45 40

Jelly roll

1ri6 1tbgA 2bbkH 1erj 1k32 1gq1B 4aah 1pfq 1kb0 1flg 1a6cA 1c8nA 1ciy 1d7pM

333 298 355 348 353 383 571 444 591 582 176 189 148 159

7 7 7 7

42 37 45 44

8

42

2 2 2 2

83 70 69 74

1dlc 1eut 1f15A 1f35 1gmmA 1jhjA 1k3iA 1k12A 1kit 1nlqA 1ny72

145 142 157 157 126 161 139 158 120 105 182

2 2 2 2 2 2 2 2 2 2 2

67 66 73 58 58 60 64 74 55 47 86

7-Propeller

Jelly roll

␤-Trefoil

PDB code

Length

Number of repeats/length

1nziA 1o6uA 1phm 1pmi 1sfp 1smvA 1xnaA 2arcA 2mev2 2stv 1gof 1a34 1bev 1gny 1mpx 1sdw 1tme

114 116 153 104 111 196 151 116 249 170 150 124 244 153 229 158 255

2 2 2 2 2 2 2 2 2 2 2

45 53 51 47 50 78 60 70 93 80 70

1bfg 1barb 1jqzA 1ijtA 1qqkA 1nunA 1l2h 8i1b 1n4kA 1abr 1hwm 1xyf 1knm 1avw 1ava 1dfcA 1hcd 1ggp 1ihkA 1ilr 1m2t 3bta 2aai 1fmm-s 1a8d 2ila 1epw 1jlxA 1dqg 1eyl 1tie 1wba

126 138 136 128 129 139 144 146 141 143 136 124 129 171 181 115 118 129 157 145 137 200 138 132 206 145 209 153 134 179 166 171

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

37 41 40 37 38 41 40 36 42 37 40 36 38 45 45 28 34 38 47 36 40 61 41 39 51 43 52 43

The absence of the values in “Number of repeats” means that no repeats can be clearly detected.

correlation coefficients between all possible pairs of sub-matrices construct a matrix r. By using pseudocolor plot, the distribution and difference of the values in r can be visualized (Fig. 2). If the protein structure has recurring substructures, the Pearson’s correlation coefficient between the corresponding sub-matrices should be very high and this can be obviously seen from the pseudocolor plot. 3. Results and discussions SCOP (Murzin et al., 2004) classifies proteins into four major structural types: all ␣, all ␤, ␣ + ␤, and ␣/␤ structures. Among all ␤ structure, there are a lot of proteins with symmetric structures, such as propeller fold and etc. In the present work, we focus on three typical symmetric structures: ␤-propellers, jelly rolls and ␤-trefoils. We attempt to describe how the internal repeats of these proteins can be extracted with our method.

H. Chen et al. / Computational Biology and Chemistry 33 (2009) 100–107 Table 4 List of proteins analyzed with the OPAAS and our method (SCOP 1.55)

Table 3 Alignments of 1fbl 1, 1fbl 2, 1fbl 3 and 1fbl 4 SeqA

Name

Length (aa)

SeqB

Name

Length (aa)

Score (%)

1 1 1 2 2 3

1fbl 1fbl 1fbl 1fbl 1fbl 1fbl

43 43 43 43 43 43

2 3 4 3 4 4

1fbl 1fbl 1fbl 1fbl 1fbl 1fbl

43 43 43 43 43 43

13 9 13 13 6 11

1 1 1 2 2 3

105

2 3 4 3 4 4

Fold

PDB

Length

Number of repeats/ length (OPAAS* )

Number of repeats/ length (ours)

4-Propeller

1fbl 1gen 1hxn 1pex

185 200 210 192

4 4 4 4

44 45 49 48

4 4 4 4

43 45 47 43

5-Propeller

1tl2A

235

5

47

5

48

6-Propeller

2qwc 1nsc 1eur 1poo 1crz 1kit 1cru 2sli 3sil

387 390 361 353 263 368 448 679 379

6 6 6 6 6 6 6

38 42 50 41 44 43 37

6 6 6 6 6

59 50 50 53 38

7-Propeller

1gof 2bbk 1qni 1erj 1a12 1c9l 1qfm 2mad 1got

387 355 441 348 401 357 430 373 339

7 7 7 7 7 7 7

50 40 42 41 47 39 45

7 7 7 7 7 7 7 7 7

50 45 40 44 52 46 45 48 43

8-Propeller

1nir 1qks 1hj5 1flg 4aah

426 432 434 582 571

8 8

45 47

8 8 8

48 49 49

8

43

1phm 1hx6 1gof 1eut 1czs 1d7p 1dlc 1ciy 1nuk 1bhg 1ulo 1xna 1dy0 1dp0 1pgs

153 370 150 142 159 159 145 148 169 204 152 151 153 207 137

2 2

51 77

2 2 2 2 2 2 2 2 2 2 2 2

51 75 70 66 74 74 67 69 74 77 51 60

2

12

1bfg 1barb 2afga 1i1b 1ilr 1abr 1hwm 1avw 1ava 1a8d 1dfc 1hcd 1epw 1jlx 1qql 8i1b 2ila 2aai 1xyf 3bta 1wba 1eyl 1ce7 1dqg 1tie

126 138 129 151 145 143 136 171 181 205 111 118 209 153 131 146 145 138 124 200 171 179 122 134 166

3 3 3 3 3 3 3 3 3 3 3 3 3 3

42 43 43 42 38 41 40 48 48 47 37 39 51 47

3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

37 41 38 40 36 37 40 45 45 51 32 34 52 43 38 36 43 41 36 61

3 3 3 3

42 54 39 41

3.1. ˇ-Propellers ␤-Propeller structure is a closed structure that appears as a fascinating architecture based on four-stranded antiparallel and twisted ␤-sheets, radically arranged around a central tunnel (Tirso et al., 2003). Five classes of ␤-propeller are known, having a four-, five-, six-, seven-, or eight-fold pseudo-symmetry axis. Representatives of the five types will be discussed in detail. When investigating the ␤-propeller fold, we take collagenase 1 (PDB ID: 1FBL) as an example at first. 1FBL has four-fold repeats (Fig. 1a), and it is easy to notice that the four-stranded antiparallel ␤-sheets are sequential and each can be regarded as a unit. As shown in Fig. 1b, the dot plot of the similarity matrix can clearly visualize the locations of four similar segments. They are marked by arrows and at 1, 44, 93 and 143. The repeats is 43 residues long. The structure studied here covers residues between 282 and 466 of 1FBL and the locations of the four repeats are: 282–324, 325–367, 374–416, 424–466 (Fig. 1c). Comparing with the result give by PROPEAT, 282–324, 325–372, 373–417, 424–461 (Fig. 1d), they nearly make no difference. For 1FBL, we further calculate the Pearson’s correlation coefficients between all possible pairs of sub-matrices of the similarity matrix S (see Fig. 2). In this case, the row size of the sub-matrices is the same as S and the column size is approximate to N/4 = 185/4 ≈ 46 where N is the sequence length of 1FBL (282–466). In the plot, the color at the point (i, j) denotes the value of the Pearson’s’s correlation coefficient between the sub-matrices beginning from the ith and jth residues, i.e., S[:,i:i − 1 + N/4] and S[:,j:j − 1 + N/4], where the first “:” represent all rows. In the pseudocolor plot, the diagonal red band means self similarity, and certainly, the correlation coefficient is 1. There are also three other red bands parallel to it. They divide the sequence into four parts at the residues 1, 47, 94 and 143, respectively. Such division indicates that there are four substructures and they resemble each other highly (the Pearson’s correlation coefficients between the first submatrix and three others are, respectively, 0.7198, 0.7446 and 0.8981). Mapping these locations into the real order of protein sequence, they are 282, 328, 375, and 424, nearly the same with those given by dot plot or PROPEAT (Fig. 1). From the dot plot and the pseudocolor plot above, it is shown that both the similarity matrix and the Pearson’s correlation coefficients can well identify the internal repeats and determine the locations. However, for some proteins, e.g., 1GQ1 (see Fig. 3), it is not easy to use the dot plot to give a clear repeats and determine unique locations and lengths of the repeats. In this situation, the pseudocolor plot of the Pearson’s correlation coefficients will play the role. For other ␤-propeller proteins, internal repeats are also detectable using this method. We choose four typical examples from five-, six-, seven- and eight-fold, respectively. They are tachylectin-2 (PDB ID: 1TL2), lectin (PDB ID: 1OFZ), beta-lactamase inhibitor protein (PDB ID: 1JTD) and cytochrome cd1 nitrite reductase (PDB ID: 1GQ1). The dot plots of similarity matrices and pseudocolor plots of the Pearson’s correlation coefficients are shown in Fig. 3. The obvious repetitive patterns or parallel bands

Jelly roll

␤-Trefoil

The absence of the values in “Number of repeats” means that no repeats can be detected in OPAAS or/and dRMSD method.

106

H. Chen et al. / Computational Biology and Chemistry 33 (2009) 100–107

Fig. 6. The structure, the dot plot of the similarity matrix and the pseudocolor plot of the Pearson’s correlation coefficients (from the left to the right) of the imidazoleglycerol phosphate synthase (HisF, 1thf).

in the plots reveal the internal symmetries of these protein structures. The four protein structures show respectively five, six and seven and eight pseudo-symmetries and each repeated unit contains approximately 40 residues. 3.2. Jelly roll The jelly roll is recognized as a superfold by CATH with diverse functions but no detectable similarity in sequence (Orengo et al., 1997; Williams and Westhead, 2002). It consists of two Greek key motifs that adopt an eight-stranded ␤-sandwich structure (Richardson, 1981). The hydrogen-bonding pattern between adjacent strands is broken in two places and as a consequence the structure comprises two four-stranded ␤-sheets. Both sheets are purely anti-parallel, with strands adjacent in sequence appearing in different sheets, with the exception of the fourth and fifth strands, which are in the same sheet. Four examples (PDB ID: 1CIY, 1EUT, 1DLC, 1K3I) are selected here to display the symmetry of their structures (see Fig. 4). In this case, both dot plots and pseudocolor plots show twofold repetitive patterns. This implies the beta-propeller folds have two-fold symmetries. It is noted that, from the dot plot, we can only determine the locations and lengths of those repeats with the consecutive residue segments, especially the longest ones. For example, the dot plot of 1CIY shows similar substructures with the length of 42 located at 502 and 563. However, the pseudocolor plots give different locations and lengths of repeats. This is because the Pearson’s correlation analysis has taken account of the substructure similarity of different sizes (d) and so the repeats it gives are combinations of similar substructures of different sizes. The Pearson’s correlation analysis can give more complex repetitive patterns than the similarity matrix. 3.3. ˇ-Trefoil The main features of the ␤-trefoil folds are that they contain 12 ␤-strands, forming six hairpins and three of the hairpins form a barrel structure. The other three hairpins form a triangular arrangement, a “hairpin triplet”, fitting on one end of the barrel. The connections between the strands enable the fold to be described in terms of three very similar ␤-trefoil units that adopt a “Y”-like shape (McLachlan, 1979). Similarly, we choose four proteins (PDB ID: 1GPP, 1HIKA, 1JQZA, 2ILA) to show their symmetries (see Fig. 5). Three repetitive patterns are presented apparently in the dot plots and pseudocolor plots, and the average length of the repeated units is about 40, which is compatible with the work of Shih and Hwang by using OPAAS (Edward et al., 2004). They analyzed 22 ␤-trefoil proteins, and the length of the repeated units lies between 37 and 54 and usually lies at around 40.

Besides the proteins above, we also analyzed other proteins of beta-propeller, jelly roll and ␤-trefoil families. Table 1 gives evaluation of the efficiency of our method and Table 2 is a list of all the proteins we analyzed, and the SCOP version used in our calculations is 1.69. Table 1 shows that the numbers of proteins of which the symmetry can be detected in the three families are 75.0%, 81.3% and 85.7%, respectively. As to ␤-propeller family, 75.0% is not so satisfactory. After inspecting the tertiary structures of the proteins which cannot be well detected, we find some of them actually have irregular features: (1) the circularly arranged ␤-sheets are separated by some full domains or other structures, or the number of strands in ␤-sheets is reduced or increased, such as 1UYP, 1KIT, 1OYG, 1CRU, 3SIL, 2SLI, 1FLG and (2) the blades are distorted and differ from each another like 1S18, 1V3E, 1MS9, 4AAH, 1KB0, 1PFQ, 1K32. Although the symmetries in these proteins cannot be obviously detected, it does not mean they cannot be detected at all. The dot plots and pseudocolor plots can detect a part of their similar substructures. We also made sequence alignments of all the repeating fragments determined by our method by using the ClustalW (http://www.ebi.ac.uk/clustalw/index.html). We find that except several proteins, such as 1tl2A, 1jtd, 1xyf, 1knm, 1hcd, etc., the repeating fragments of over 90 percent of the proteins have very low sequence homology. And the alignment scores are generally less than 25%. We also find that all the repeating fragments of jelly roll fold have very low sequence homology. Here we take 1fbl as an example, and the four repeats are named 1fbl 1, 1fbl 2, 1fbl 3, 1fbl 4. The result of alignments is listed in the following table (see Table 3), and we find the alignment score between each two repeats is very low, much less than 25%. The number of proteins whose internal repeats have low sequence homology is 87 among 96 proteins whose internal repeats can be detected with our method and the percentage is 90.6%. In order to make a comparison with OPAAS, we also applied our method to the dataset of the version 1.55 of SCOP. We calculated all the beta-propellers and ␤-trefoils in this version of SCOP. In this version, there is no separate fold classification for jelly rolls, but under all-beta protein class, galactose-binding domainlike fold and PNGaseF-like fold are annotated as containing a jelly roll topology, and we calculated all the proteins in the two folds and compared the results with those given by OPAAS (see Table 4). From the table, it is easy to see that the results of the OPAAS and ours are approximately compatible. But it seems that OPAAS cannot detect the internal symmetry in jelly roll folds. We also made sequence analysis of the detected substructures, and find that except 1TL2, the sequence all of the substructures of all other proteins have very low similarity. As mentioned in the Section 1, our method gives a strong twofold, instead of eight-fold, repeat pattern for the imidazoleglycerol phosphate synthase (HisF, 1thf), although this protein forms a

H. Chen et al. / Computational Biology and Chemistry 33 (2009) 100–107

(␣␤)8 -barrel and shows an approximate eight-fold symmetry from a qualitative overview (Fig. 6). From the pseudocolor plot of the Pearson’s correlation coefficients, we found that the substructures 1–121 and 122–242 are strongly correlated and are clearly similar to each other. This agrees almost exactly with the experiment (Lang et al., 2000; Höcker et al., 2001). Thus, accurate identification of the repeats will be very helpful to guide the investigation of protein evolution. In summary, we present a method to detect the repetitive substructures by using the similarity matrix analysis and Pearson’s correlation analysis. We applied this method to the proteins of the ␤-propellers, jelly rolls, and ␤-trefoils and the results show that this method can not only detect the internal repetitive structures in proteins effectively, but also can identify their location easily, comparing with other methods. Compared to the Fourier analysis of symmetry in protein structure, our method can simply and directly show the lengths of the recurring substructures and determine the locations without any redundant and complex processes. Different from OPAAS, the method proposed here is not sensitive to secondary structures. In comparison with the results obtained using OPAAS, it is easy to see that there are no significant differences between the repeat lengths given by both methods. Furthermore, we can detect a bit larger number of proteins with internal repeats. We expect that our method would be helpful to protein engineering and protein evolution analysis. Acknowledgements This work was supported by the National Natural Science Foundation of China under Grant no. 30525037 and no. 30470412 and the Foundation of the Ministry of Education of China. References Alexandrov, N.N., Takahashi, K., Go, N., 1992. Common spatial arrangements of backbone fragments in homologous and non-homologous proteins. J. Mol. Biol. 225, 5–9. Barakat, M.T., Dean, P.M., 1991. Molecular structure matching by simulated annealing. III. The incorporation of null correspondences into the matching problem. J. Comp.-Aided Mol. Design 5, 107–117. Chothia, C., Gough, J., Vogel, C., Teichmann, S.A., 2003. Evolution of the protein repertoire. Science 300, 1701–1703. Cohen, F.E., Sternberg, J.E., 1980. On the prediction of protein structure: the significance of the root-mean-square deviation. J. Mol. Biol. 138, 321–333. Doolittle, R.F., 1995. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64, 287–314. Edward, S.C., Shih, Hwang, M.J., 2003. Protein structure comparison by probabilitybased matching of secondary structure elements. Bioinformatics 19, 735–741. Edward, Shih, S.C., Hwang, M.J., 2004. Alternative alignments from comparison of protein structures. Proteins 56, 519–527. Fischer, D., Bachar, O., Nussinov, R., Wolfson, H., 1992. An efficient automated computer vision based technique for detection of three dimensional structural motifs in proteins. J. Biomol. Struct. Dynam. 9, 769–789. Flores, T.P., Orengo, C.A., Moss, D.S., Thornton, J.M., 1993. These RMS differences are similar to values of 0.4 Å reported for comparisons of different structures of identical proteins. Protein Sci. 2, 1811–1826. Forrer, P., Binz, H.K., Stumpp, M.T., Plückthun, A., 2004. Consensus design of repeat proteins. Chem. Bio. Chem. 5, 183–189. Fred, E.C., Michael, J.E., 1980. On the prediction of protein structure: the significance of the root-mean-square deviation. J. Mol. Biol. 138, 321–333.

107

Henikoff, S., Greene, E.A., Pietrokovski, S., Bork, P., Attwood, T.K., Hood, L., 1997. Gene families: the taxonomy of protein paralogs and chimeras. Science 278, 609– 614. Höcker, B., Beismann-Driemeyer, S., Hettwer, S., Lustig, A., Sterner, R., 2001. Dissection of a (␣␤)8 -barrel enzyme into two folded halves. Nat. Struct. Bio. 8, 32–36. Holm, L., Sander, C., 1993. Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123–138. Ji, X.F., Chen, H.J., Xiao, Y., 2007. Hidden symmetries in the primary sequences of beta-barrel family. Comput. Biol. Chem. 31, 61–63. Konopka, A.K., 1994. Sequences and codes: fundamentals of biomolecular cryptology. In: Smith, D. (Ed.), Biocomputing: Informatics and Genome Projects. Academic Press, San Diego, pp. 119–174. Konopka, A.K., 2003. Sequence complexity and composition. In: Cooper, D.N. (Ed.), Nature Encyclopedia of the Human Genome, vol. 5. Nature Publishing Group Reference, London, pp. 217–224. Konopka, A.K., 1997. Theoretical molecular biology. In: Meyers, R.A. (Ed.), Encyclopedia of Molecular Biology and Molecular Medicine, vol. 6. VCH Publishers, Weinheim, pp. 37–53. Konopka, A.K., Smythers, G.W., 1987. DISTAN-a program which detects significant distances between short oligonucleotides. Comput. Appl. Biosci. 3, 193– 201. Konopka, A.K., Chatterjee, D., 1988. Distance analysis and sequence properties of functional domains in nucleic acids and proteins. Gene Anal. Technol. 5, 87–93. Lang, D., Thoma, R., Henn-Sax, M., Sterner, R., Wilmanns, M., 2000. Structural evidence for evolution of the ␤/␣ barrel scaffold by gene duplication and fusion. Science 289, 1546–1550. Levitt,.M., 1976. A simplified representation of protein conformations for rapid simulation of protein folding. J. Mol. Biol. 104, 59–107. Maiorov, V.N., Crippen, G.M., 1994. Significance of root-mean-square deviation in comparing three-dimensional structures of globular proteins. J. Mol. Biol. 235, 625–634. McLachlan, A.D., 1979. Significance of root-mean-square deviation in comparing three-dimensional structures of globular proteins. J. Mol. Biol. 133, 557–563. Michell, E.M., Artymiuk, P.J., Rice, D.W., Willett, P., 1989. Use of techniques derived from graph theory to compare secondary structure motifs in proteins. J. Mol. Biol. 212, 151–166. Miguel, A.A., Carolina, P.I., Chris, P.P., 2001. Protein repeats: structures, functions, and evolution. J. Struct. Biol. 134, 117–131. Mosavi, L.K., Minor Jr., D.L., Peng, Z., 2002. Consensus-derived structural determinants of the ankyrin repeat motif. Proc. Natl. Acad. Sci. U.S.A. 99, 16029–16034. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C., 2004. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540. Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton, J.M., 1997. CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108. Richardson, J.S., 1981. The anatomy and taxonomy of protein structure. Adv. Protein Chem. 34, 167–339. Salem, G.M., Hutchinson, E.G., Orengo, C.A., 1999. Correlation of observed fold frequency with the occurrence of local structural motifs. J. Mol. Biol. 287, 969–981. Sali, A., Blundell, T., 1990. Definition of general topological equivalence in protein structure. J. Mol. Biol. 212, 403–428. Subbarao, N., Haneef, I., 1991. Defining topologigical equivalences in macromolecules. Protein Eng. 4, 877–884. Taylor, W.R., Orengo, C., 1989. Protein structure alignment. J. Mol. Biol. 208, 1–22. Taylor, W.R., 1999. Protein structure comparison using iterated double dynamic programming. Protein Sci. 8, 654–665. Taylor, W.R., Heringa, J., Baud, F., Flores, T.P., 2002. A Fourier analysis of symmetry in protein structure. Protein Eng. 15, 79–89. Tirso, P., Raúl, G., Glay, C., Alfonso, V., 2003. Beta-propellers: associated functions and their role in human diseases. Curr. Med. Chem. 10, 505–524. Vriend, G., Sander, C., 1991. Detection of common three-dimensional substructures in proteins. Proteins 11, 52–58. Williams, A., Westhead1, D.R., 2002. Sequence relationships in the legume lectin fold and other jelly rolls. Protein Eng. 15, 771–774. Wootton, J.C., 1997. Simple sequences of protein and DNA. In: Bishop, M.J., Rawlings, C.J. (Eds.), DNA and Protein Sequence Analysis. IRL Press, Oxford, pp. 169–183. Xu, R., Xiao, Y., 2005. A common sequence-associated physicochemical feature for proteins of beta-trefoil family. Comput. Biol. Chem. 29, 79–82.