Biochemical and Biophysical Research Communications 386 (2009) 537–543
Contents lists available at ScienceDirect
Biochemical and Biophysical Research Communications journal homepage: www.elsevier.com/locate/ybbrc
The interstrand amino acid pairs play a significant role in determining the parallel or antiparallel orientation of b-strands Ning Zhang a, Jishou Ruan b, Guangyou Duan a, Shan Gao a, Tao Zhang a,* a b
Key Laboratory of Bioactive Materials, Ministry of Education and College of Life Science, Nankai University, Tianjin 300071, PR China Chern Institute of Mathematics and College of Mathematical Science, Nankai University, Tianjin 300071, PR China
a r t i c l e
i n f o
Article history: Received 10 June 2009 Available online 18 June 2009
Keywords: b-Strands Parallel or antiparallel orientation Amino acid pairs Singular value decomposition Support vector machine
a b s t r a c t It is widely considered that it is not appropriate to treat b-pairs in isolation, since other secondary structural models (such as helices, coils), protein topology and protein tertiary structures would limit b-strand pairing. However, to understand the underlying mechanisms of b-sheet formation, studies ought to be performed separately on more concrete aspects. In this study, we focus on the parallel or antiparallel orientation of b-strands. First, statistical analysis was performed on the relative frequencies of the interstrand amino acid pairs within parallel and antiparallel b-strands. Consequently, features were extracted by singular value decomposition from the statistical results. By using the support vector machine to distinguish the features extracted from the two types of b-strands, high accuracy was achieved (up to 99.4%). This suggests that the interstrand amino acid pairs play a significant role in determining the parallel or antiparallel orientation of b-strands. These results may provide useful information for developing other useful algorithms to examine to the b-strand folding pathways, and could eventually lead to protein structure predictions. Crown Copyright Ó 2009 Published by Elsevier Inc. All rights reserved.
Introduction In principle, structural information for protein sequences having no detectable homology to a protein of known structure could be obtained by predicting the arrangement of their secondary structural elements [1]. It is well known that the two predominant protein secondary structures are a-helices and b-sheets. Although some ab initio methods for protein structure prediction have been reported [2,3], the long-range interactions required to accurately predict b-sheet tertiary structure are still difficult to simulate [1,4]. In a b-sheet, the individual extended polypeptide segments, called b-strands, are arranged side by side to form a structure resembling a series of pleats. These strands are connected by interstrand hydrogen bonds in a parallel (N-termini of both strands at the same end) or antiparallel (otherwise) configuration. Adjacent b-strands bring distant residues into close contact with one another, and constitute a specific mode of amino acid pairing [5–7] interactions (like DNA base pairing) (Fig. 1). Since assigning the b-strand topology of a b-sheet-containing protein would reduce the three-dimensional space to be searched by ab initio methods [1,8], there is a growing recognition of the importance of the strand-to-strand interactions among b-sheets [9]. Some approaches were reported to predict supersecondary * Corresponding author. E-mail address:
[email protected] (T. Zhang).
motifs, such as the b-hairpin [4,10,11], b-turns [12,13], r-turns [14] and so on. However, supersecondary folds are those arrangements of secondary structural elements occurring with a greater frequency than expected [15], e.g., two antiparallel b-strands linked by a reverse turn is called a b-hairpin [16]. However, these are a small subset, accounting for approximately one-third of all proteins in the Protein Data Bank (PDB) [17]. Indeed, there have been no systematic methods to identify strand-to-strand interactions among proteins [18]. Several studies, including statistical studies examining the frequencies of nearest-neighbor amino acids in b-ladders, found a significantly different preference for certain interstrand amino acid pairs at non-hydrogen-bonded (nHB) and hydrogen-bonded (HB) sites [5–7,19–21]. Dou et al. [18] created a comprehensive database of interchain b-sheet (ICBS) interactions. In our previous study, we also developed the SheetsPair database [22] to compile both the interchain and the intrachain amino acid pairs. The question is: how do the amino acid pairs affect the b-sheet conformation? Zaremba and Gregoret [23] revealed that the contribution of interstrand residue pair interaction energies to protein stability was dependent upon the surrounding residues. Gutfreund et al. [24] reported that pairs of residues on neighboring strands were neither more strongly conserved nor more strongly covariant than pairs of the same type in noninteracting positions. Many current studies suggest that it is not appropriate to treat each b-pair in isolation, since other secondary structural models (such as helices,
0006-291X/$ - see front matter Crown Copyright Ó 2009 Published by Elsevier Inc. All rights reserved. doi:10.1016/j.bbrc.2009.06.072
538
N. Zhang et al. / Biochemical and Biophysical Research Communications 386 (2009) 537–543
where RFðAi : Aj Þ is the relative frequency; Ai , Aj stand for the two pairing amino acids, respectively; PðAi : Aj Þ is the observed frequency of the amino acid pair Ai : Aj ; and PðAi Þ, PðAj Þ are the frequencies of the amino acids Ai , Aj obtained from all the protein chain sequences in the dataset. The entire dataset is divided into two subsets: the parallel subset and the antiparallel subset. RFðAi : Aj Þ are calculated for both subsets separately. Thus, two relative frequency matrices (RFM) are obtained: one (named RFMp) for the parallel subset and the other (named RFMap) for the antiparallel subset. Note that each of the matrices is an upper triangular matrix since only 210 possible amino acid pairs are considered. Extracting features
Fig. 1. A schematic diagram of two interacting b-strands. (a) In parallel configuration and (b) in antiparallel configuration. The boxes around amino acids represent three interstrand amino acid pairs. Here, pairing residues were defined as those adjacent to each other in neighboring strands forming either two hydrogen bonds or 0 hydrogen bonds (see [5]). An amino acid pair within parallel b-strands has one HB (hydrogen bonded) residue and one nHB (non-hydrogen bonded) residue, while residues forming an interstrand pair within antiparallel strands are either both HB or both nHB.
Transform b-strand pairs into vectors. In further steps, we will only use the RFMp and RFMap matrices to extract features for parallel and antiparallel b-strands, respectively. For a given pair of b-strands, all amino acid pairs from the common part of the two b-strands are extracted. If the b-strand pair is parallel, the RFMp matrix is used; otherwise the RFMap matrix is used. Using the relevant matrix, the b-strand pair is transformed into a 210-dimensional vector Q ¼ ðq1 ; q2 ; . . . ; qk ; . . . ; q210 Þ. For an amino acid pair Ai : Aj ð1 6 i 6 j 6 20Þ, k is calculated by:
k ¼ ð42 iÞ ði 1Þ=2 þ j i þ 1 Next, qk is assigned by:
qk ¼ countðAi : Aj Þ mij coils), protein topology and protein tertiary structures may limit bstrand pairing [1,4,25]. In view of the complexity of protein folding pathways, there is no doubt about such consideration. However, this approach is vague and impractical. Interactions between two pairing b-strands could at the very least consist of three aspects: (i) finding the partner b-strand(s), (ii) determining the relative orientation (i.e., parallel or antiparallel) and (iii) shifting the relative offset position. To understand the underlying mechanisms of b-sheet formation, separate studies must be performed on at least these three aspects. In the present study, we focus only on the second aspect.
where mij is an element of the RFMp or the RFMap matrix (at row i and column j); and countðAi : Aj Þ is the observed occurrence of the pair type Ai : Aj in the considered pair of b-strands. The 210-dimensional vector is sparsely populated. We calculate all the Q vectors for all b-strand pairs in the dataset. Generate average encoding vectors. This step will generate average encoding vectors, which can be directly used for encoding each b-strand pair. Each sparse vector Q is transformed into a 1 210 matrix, M q . The matrix M qt is then calculated by:
Mqt ¼ MTq M q Note that Mqt is a 210 210 matrix. One Mqt matrix is obtained for one b-strand pair. We sum up all the Mqt matrices and calculate the average value for both the parallel and antiparallel b-strands in the dataset:
Methods Dataset In a previous study, we developed the SheetsPair Database [22]. All the protein structures in that database were obtained from PDB. The dataset used in this research was taken from the SheetsPair Database. After excluding any protein chains that have no b-sheets, any patterns with a chain break or heteroatom and any with uncertain structures, 20,371 protein chains were extracted. These contain 54,519 parallel and 107,727 antiparallel interacting b-strand pairs, consisting of 756,158 interstrand amino acid pairs. The dataset is available online and may be downloaded for academic use at (http://sky.nankai.edu.cn/script/sky/english/bioinfo/PADS.zip).
M pav e ¼
np 1 X Mpara np i¼1 qt
M apav e ¼
nap 1 X M antipara nap i¼1 qt
Matrices Mpav e and Mapav e are both 210 210, and np and nap are the number of parallel and antiparallel b-strand pairs in the dataset. Consequently, both the Mpav e and Mapav e matrices are dealt with using the singular value decomposition (SVD). The singular value decomposition of a matrix M is defined by:
Amino acid pairing preferences
M ¼ USV T
In this study, we do not differentiate between interchain and intrachain pairs. Additionally, we do not differentiate HB/nHB pairs or HB/nHB residues (Fig. 1) either. We define 210 possible amino acid pairs for both parallel and antiparallel b-strands, regardless of the order of the two amino acids in one pair. The relative frequencies (RF) of the 210 amino acid pairs are calculated by:
where M stands for the M pav e or the M apav e matrix, U and V are orthogonal (unitary), and S (with the same dimensions as M) is a 210 210 diagonal matrix. The singular values ri are the diagonal entries of the S matrix and are arranged in descending order:
RFðAi : Aj Þ ¼
PðAi : Aj Þ ; PðAi ÞPðAj Þ
1 6 i 6 20; 1 6 j 6 20; i 6 j
r1 P r2 P P r210 P 0 The ri s are the singular values of M and the columns of U and V are the left and right singular vectors of M. In order to reduce the
N. Zhang et al. / Biochemical and Biophysical Research Communications 386 (2009) 537–543
feature dimension, we select the top 21 (10%) r1 ; r2 ; . . . ; r10 by estimating the cumulative contributions ACðri Þ for each ri :
Pi
rj 100% j¼1 rj j¼1
ACðri Þ ¼ P210
TP 100% TP þ FP TP þ 100% Sensitivity ¼ TP þ FN TN Specificity ¼ 100% TN þ FN TN Sensitivity ¼ 100% TN þ FP þ
Specificity ¼
According to the 21 selected singular values, we then extract the corresponding 21 vectors W i ð1 6 i 6 21Þ (210-dimensional vector) from the corresponding columns of the matrix U. We obtain two sets of W i vectors (each set contains 21 vectors), one for parallel and one for antiparallel:
Results
W p1 ; W p2 ; . . . ; W p21
Relative frequencies of amino acid pairing
ap ap and W ap 1 ; W 2 ; . . . ; W 21
539
This procedure is anticipated to remove the contributions of small dimension values. The two sets of W i vectors are the average encoding vectors, which will be used for b-strand pair encoding. Encode b-strand pairs. Each b-strand pair is encoded into a 21dimensional vector X ¼ ðx1 ; x2 ; . . . ; x21 Þ defined as:
A second-order statistical analysis was performed on the relative frequencies RFðAi : Aj Þ of the 210 possible interstrand amino acid pairs on every pair of b-strands. The relative frequency matrix for parallel b-strands (RFMp) is shown in Table 1, and the one for antiparallel (RFMap) is given in Table 2.
xi ¼ W i Q
The cumulative contributions of the singular values
where Q is the sparse vector we calculated first for each b-strand pair, and the W i vector will be W pi for parallel b-strands, and W ap i for antiparallel b-strands. To sum up, using the encoding scheme described above, a pair of b-strands can be expressed as a 21-dimensional vector X. The vector X is the final feature.
Singular values of the matrix M pav e and the matrix Mapav e (data not shown) are estimated by the SVD method. The cumulative contributions of the descending-ordered singular values ri are depicted in Fig. 2. The cumulative contributions of the top 21 singular values are ticked in the graph.
Distinguishing between parallel and antiparallel b-strands How are these 21-dimensional vectors X distributed? Do they have a separate distribution for parallel and for antiparallel strands? A distribution of dots in three (or fewer) -dimensional space could be depicted easily by a scatter plot. However, it is impossible to draw a scatter plot to illustrate the distribution of 21-dimensional vectors. Instead, we adopt the famous pattern recognition method—support vector machine (SVM)—to attempt to separate the two types of vectors. The aim is to show whether or not the vectors from the parallel and antiparallel b-strands can be separated from each other in such a 21-dimensional space. Here, we use the libsvm 2.83 SVM software packages [26]. In this study, we use ‘‘1” to stand for the vectors extracted from parallel b-strands, while ‘‘1” stands for those from antiparallel. The efficiency is assessed by sevenfold cross-validation. Performance measures We use the following measurements: TP (the number of correctly identified parallel b-strands), TN (the number of correctly identified antiparallel b-strands), FP (the number of antiparallel b-strands falsely identified as parallel b-strands), FN (the number of parallel b-strands falsely identified as antiparallel b-strands) and N (the total number of the b-strand pairs in the dataset, N ¼ TP þ TN þ FP þ FN).
Accuracy ¼
TP þ TN 100% N
Matthew’s correlation coefficient (MCC) is defined as:
TP TN FP FN MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðTP þ TNÞðTP þ FNÞðFP þ TNÞðFP þ FNÞ If there is no relationship between the identified values and the actual values, the correlation coefficient is 0. A perfect fit gives a coefficient of 1.0. There are other measures as well, defined by the following formulas:
Results of distinguishing between parallel and antiparallel b-strands The results from distinguishing between the parallel and antiparallel vectors using SVM are shown in Table 3. An example of distinguishing between parallel and antiparallel bstrands We give a specific protein example (PDB code: 1HZT) to illustrate the coding procedure (Fig. 3). The coding vectors and the differentiating results of protein 1HZT are shown in Table 4. Discussion Steward and Thornton [1] argued that the energetic contribution of individual interstrand residue pairs was one of many factors determining the global fold of a protein, but was insufficient to predict a protein’s three-dimensional conformation. Moreover, studies ought to be performed separately on more concrete aspects. Determining the parallel or antiparallel orientation of bstrands is one of the most important steps in b-sheet formation. Indeed, there are notable differences between the interstrand amino acid pairs on parallel and antiparallel b-strands [1], although the relative frequencies of some types of residue pairs are similar. For instance, antiparallel b-strands favor polar residues, as opposed to aromatic residues. Antiparallel b-strands have a larger proportion of exposed residue positions, and are more likely to contain a b-bridge with an edge strand, which may be solvent exposed on one edge [1]. Indeed, antiparallel and parallel b-strands were considered separately in many recent studies, such as [1,5,18]. However, from these preference differences, it is not immediately clear how the interstrand amino acid pairs contribute to the formation of b-strands. We investigated this question, by extracting features (21dimentional vectors) from the amino acid pairs within parallel and antiparallel b-strands. From Fig. 2, it is evident that the cumulative contributions of the singular values derived from Mpav e and Mapav e matrices are significantly different. The cumulative
540
N. Zhang et al. / Biochemical and Biophysical Research Communications 386 (2009) 537–543
Table 1 Relative frequency matrix of 210 amino acid pairs for parallel b-strands (RFMp) computed from the current dataset.
A C D E F G H I K L M N P Q R S T V W Y
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
1.01 — — — — — — — — — — — — — — — — — — —
1.89 2.84 — — — — — — — — — — — — — — — — — —
1.09 1.32 0.61 — — — — — — — — — — — — — — — — —
0.80 0.58 0.78 0.23 — — — — — — — — — — — — — — — —
2.10 4.09 1.35 0.94 1.89 — — — — — — — — — — — — — — —
1.77 1.75 1.28 0.90 2.25 0.96 — — — — — — — — — — — — — —
1.59 1.40 1.66 0.87 2.07 1.53 1.16 — — — — — — — — — — — — —
4.36 4.96 1.66 1.70 6.31 3.06 2.76 6.24 — — — — — — — — — — — —
1.25 1.18 1.40 1.49 1.12 1.02 1.04 1.59 0.54 — — — — — — — — — — —
2.09 2.65 1.12 0.92 3.34 2.25 1.79 6.68 0.77 1.65 — — — — — — — — — —
1.73 2.68 1.35 0.77 2.70 2.07 1.67 5.26 0.71 2.67 1.41 — — — — — — — — —
1.21 1.04 1.68 0.75 1.73 1.36 1.24 1.59 0.63 0.93 1.28 1.43 — — — — — — — —
0.85 0.95 0.48 0.52 0.84 0.86 0.85 1.41 0.50 0.82 0.63 0.45 0.27 — — — — — — —
1.11 0.65 0.62 0.85 1.13 0.99 1.23 1.21 0.60 0.98 0.75 1.25 0.65 0.32 — — — — — —
1.00 0.73 1.11 1.33 1.22 0.91 1.34 1.35 0.64 1.15 0.88 0.62 0.57 0.77 0.29 — — — — —
0.93 1.19 1.01 0.68 1.35 1.44 1.78 1.44 0.90 1.07 1.27 1.08 0.52 0.95 0.71 0.71 — — — —
2.13 1.85 1.58 1.27 2.29 1.58 2.80 3.23 1.28 1.97 1.60 2.43 0.74 1.25 1.62 1.49 1.72 — — —
4.77 5.16 2.18 1.71 6.63 3.58 3.50 13.24 1.94 6.48 4.98 2.04 1.44 1.56 2.08 2.16 4.12 8.34 — —
2.29 3.24 0.96 1.16 2.30 1.96 1.49 4.73 1.14 2.12 1.42 1.39 0.72 1.08 1.13 2.08 2.62 4.57 0.72 —
2.03 2.05 1.33 1.37 2.78 2.28 2.27 5.82 1.62 3.05 2.30 1.51 1.10 1.64 1.41 1.39 2.28 5.59 3.27 2.27
Table 2 Relative frequency matrix of 210 amino acid pairs for antiparallel b-strands (RFMap) computed from the current dataset.
A C D E F G H I K L M N P Q R S T V W Y
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
0.88 — — — — — — — — — — — — — — — — — — —
1.91 6.78 — — — — — — — — — — — — — — — — — —
0.93 0.75 0.45 — — — — — — — — — — — — — — — — —
1.01 1.25 0.92 0.51 — — — — — — — — — — — — — — — —
2.70 3.27 1.14 1.21 3.15 — — — — — — — — — — — — — — —
1.56 1.90 1.38 1.06 2.54 0.90 — — — — — — — — — — — — — —
1.34 3.14 1.71 1.77 2.50 1.75 1.42 — — — — — — — — — — — — —
2.82 3.60 1.32 1.85 4.22 2.23 2.79 2.64 — — — — — — — — — — — —
1.17 2.11 1.70 2.51 1.50 0.94 1.62 2.11 0.67 — — — — — — — — — — —
1.80 2.46 0.89 1.12 3.03 1.51 1.92 3.24 1.22 1.12 — — — — — — — — — —
1.69 2.78 0.96 1.20 2.75 1.66 1.85 3.09 1.69 2.14 1.29 — — — — — — — — —
0.87 1.55 1.54 1.25 1.58 1.25 1.66 1.18 1.13 1.11 0.81 0.80 — — — — — — — —
0.68 1.72 0.43 0.65 1.10 0.81 1.03 0.77 0.65 0.83 1.10 0.59 0.33 — — — — — — —
1.08 2.38 1.18 1.37 2.00 0.97 1.46 1.75 1.56 1.36 1.45 1.27 0.66 0.65 — — — — — —
1.26 1.30 1.64 2.29 1.78 1.13 1.86 1.84 1.02 1.24 1.26 1.09 0.94 1.81 0.65 — — — — —
1.20 2.06 1.30 1.21 2.04 1.36 1.90 1.60 1.47 1.30 1.56 1.62 0.62 1.51 1.37 1.17 — — — —
2.06 2.28 2.04 2.18 2.47 1.89 3.34 2.67 2.88 1.87 1.76 2.11 1.10 3.27 2.31 3.00 2.67 — — —
3.43 4.17 1.65 2.17 5.71 2.76 3.63 5.83 2.68 3.82 3.19 1.97 1.47 2.30 2.47 2.28 3.76 4.21 — —
2.65 13.93 1.26 1.82 5.78 2.89 3.10 4.10 2.75 3.38 3.28 2.06 1.72 2.85 3.11 2.55 3.13 5.08 4.67 —
2.71 4.16 1.82 2.05 5.91 3.14 2.83 5.03 3.85 3.82 3.12 2.45 1.52 2.59 3.02 2.42 5.25 5.98 6.43 3.67
contributions of the top 21 singular values are 92.16% for parallel, while only 63.46% for antiparallel. Since the two matrices are both average matrices, this indicates that the distribution centers of the two types of b-strands are different. In addition, the selection of the top 21 singular values (10%) is rational, because the cumulative contributions of the top 21 exceed 60%. Although the antiparallel value is smaller, at 63.46%, for parallel the value is up to 92.16%. The cumulative information suggests little further increase for the parallel case. Overall, the top 21 singular values encompass the most information. Therefore, we select the top 21 singular values (10%) to balance out as much as possible the tradeoff between distinguishing the two types of b-strand and reducing the dimension of the input space. In order to examine whether or not features extracted from the two types of b-strand can be separated from each other in such a 21-dimensional space, we adopt the support vector machine (SVM) method. It is somewhat surprising that we obtain almost perfect results: up to 99.40% accuracy, 0.9867 MCC, and over 98% sensitivity and specificity. This indicates that the hyperplane output of the SVM has grasped the complicated relationship between
the interstrand amino acid pairs and the b-strands orientations. Although the method presented here cannot be directly used for the parallel and antiparallel orientation prediction, from Table 4, it can be seen that features derived from parallel and antiparallel b-strands in protein 1HZT can be distinguished from each other dramatically. Parallel and antiparallel arrangements of b-strands not only differ in their pairing preferences, but can also be separated from each other very well based on the amino acid pairing features extracted in 21-dimensional space. Note that we do treat each pair of b-strands in isolation during the feature extracting procedure. The RFMp and RFMap matrices do not contain information either about other secondary structural models, or about protein topology and protein tertiary structures. Therefore, although other factors should also be taken into account when examining protein folding pathways, our results suggest that the parallel or antiparallel orientation of b-strands is considerably correlated with the interstrand amino acid pairs. We see that a small number of interstrand amino acid pairs appear to play a significant role in the determination of parallel or antiparallel orientation during b-sheet formation. Although the result presented here is not
541
N. Zhang et al. / Biochemical and Biophysical Research Communications 386 (2009) 537–543
Fig. 2. The cumulative contributions of the descending-ordered singular values extracted from the Mpav e (for parallel b-strands) and the M apav e matrix (for antiparallel b-strands). The singular values were estimated by the singular value decomposition (SVD). The cumulative contributions of the top 21 singular values are ticked in the graph, 92.16% for parallel, 63.46% for antiparallel, separately.
sufficient to deduce a protein’s fold, it does allow us to differentiate between contributions made by the amino acid pairs and those due to the surroundings. A b-strand’s potential forces will lead it to pair with another bstrand. Steward and Thornton [1] indicated that a single b-strand
was still able to recognize a noninteracting b-strand with greater accuracy than in the case of recognition between two random sequences. There are many kinds of forces between a pair of interacting b-strands. These forces include hydrogen bonds, van der Waals forces, electrostatic interaction, ionic bonds, hydrophobic effects, etc. Among all these forces, the hydrophobic effect is the most important factor [27]. The formation difference between parallel and antiparallel b-strands is in their relative direction. Just like DNA pairing, it is conceivable that if we reverse the relative direction of the two strands (i.e., change parallel into antiparallel, or vice versa), the interstrand amino acid pairs will be entirely different. Thus, the formation forces between the amino acid pairs will be entirely different, which may result in a ‘collapsed’ folded state. Studies [16,27] also have shown that the association of hydrophobic side-chains appears to drive the formation of simple multistranded b-sheet structures. Similarly, Cys-Cys pairs are favoured in antiparallel b-strands due to the formation of disulfides, but not in parallel b-strands [5,28]. These kinds of strong disulfide forces would be another factor in determining the orientation. Again, this kind of disulfide bond cannot be formed if we reverse the relative direction of the two interacting b-strands. In this study, we did not consider HB/nHB pairs or residues. The high success of our results suggests that hydrogen bonds are not very important for determining the parallel or antiparallel orientation of b-strands. One reason may be that hydrogen bonds are formed between the carbonyl oxygen atom of one residue and the amino hydrogen atom of another, regardless of the type of residue. Therefore, independent of other factors, the change of the effects of hydrogen bonding is minimal if the direction is reversed. Consistent with other studies [9,16], hydrogen bonding plays a
Table 3 Results of distinguishing between the 21-dimensional vectors for parallel and antiparallel b-strands using SVM. (Seven-fold cross-validation test. RBF kernel function, c and gamma are set to the default value in LibSVM 2.83.)
Result
TP 54519
FP 972
FN 0
TN 106755
Specificity+ 98.25%
Sensitivity+ 100.00%
Specificity 100.00%
Sensitivity 99.10%
Accuracy 99.40%
MCC 0.9867
Fig. 3. The protein 1HZT example. (A) The structure of the protein produced by using RASMOL. Protein 1HZT is an a/b protein with ten b-strands numbered from 1 to 10, forming seven different strand pairs. (B) The sequences of the 10 b-strands with their initial and ending residue numbers. (C) The coding procedure of the pair of strands ‘‘1–8” showing as an example. Note that the RFMp matrix and the W p1 ; W p2 ; . . . ; W p21 vectors are used in Step 2 and Step 3, respectively, since it is a parallel b-strand pair. In Step 2, the sparsely populated 210-dimensional vector Q is shown with indexes of each component as 1:
2: . . . 210:, and those components with value 0 are omitted for saving space. In Step 3, only the first five components of the vector X are shown for saving space.
542
N. Zhang et al. / Biochemical and Biophysical Research Communications 386 (2009) 537–543
Table 4 Coding vectors and orientation distinguishing results of b-strand pairs in protein 1HZT. Pair
Strand IDa
Sequence of the common partb
1–8
A1 A2 A2 A3 B1 B2
32–33: 116–117: 113–117: 107–103: 66–68: 37–35:
HL EV VENEV TARYR VCG SSF
B B B B C C
2 3 3 4 1 2
36–40: 120–124: 120–122: 98–96: 62–64: 50–48:
SSWLF VFAAR VFA ISE WTN RTV
Para Anti
C2 C3
46–51: 141–136:
LLVTRR W QY DMV
Anti
8–7 5–2
2–9 9–6 4–3
3–10 a b c
Orientation
Q
X (first five components)
Distinguished asc
Para
61:0.87 153:6.48 4:1.01 74:2.05 169:1.09 193:2.47 202:3.76 35:2.06 76:2.54 198:2.28 10:2.09 19:2.29
(0.5, 1.74, 3.92, 4.73, 0.76, . . .)
Para
(0.43, 0.44, 0.05, 0.26, 0.44, . . .)
Anti
(0.14, 0.15, 0.02, 0.09, 0.11, . . .)
Anti
85:1.22 86:1.35 198:2.16
(0.03, 0.06, 0.05,0, 0.08, . . .)
Para
4:1.01 86:2.04 130:5.83 172:1.97 194:3.11 201:2.67 54:2.04 149:1.36
(5.21, 2.43, 0.17, 0.46, 0.58, . . .)
Anti
(0.16, 0.11, 0, 0.07, 0.11, . . .)
Anti
(1.08, 2.36, 1.27, 4.31, 2.85, . . .)
Anti
Anti Anti
Anti
154:3.38 160:1.26 193:2.47 207:5.98
Indicate the sheet ID and strand number (as described in PDB file) of the two paired b-strands. Show the sequence of the common part of each pair of b-strands. Also shown are the initial and ending residue numbers of the segment. The distinguishing results by SVM. The SVM used here is trained with randomly selected 6/7 parts of the whole dataset, with protein 1HZT not containing in the training
set.
smaller role in stabilizing protein folded structures in water, because water forms hydrogen bonds with peptide amide groups. The Mpav e and M apav e matrices, used for extracting features, are derived from statistical analysis of the amino acid pairs. On one hand, our results show the importance of statistical analysis in protein structure studies. In particular, statistical information could provide a starting point for de novo computational design methods that are now becoming successful for short, single-chain proteins [29], as well as methods of protein structure prediction and understanding of protein folding mechanisms [30–31]. Fooks et al. [5] also indicated that such statistical analysis results would be useful for protein structure prediction. On the other hand, the results also show the importance of the amino acid pairs. There is a growing recognition of the importance of the amino acid pairs among bsheets [5,9,18–19,22]. Exploration on the interstrand amino acid pairs could provide a promising and feasible means of identifying the b-strand interactions. Conclusion In this study, a statistical analysis was performed on the relative frequencies of 210 possible interstrand amino acid pairs within every interacting b-strand pair. By extracting features and by using the support vector machine pattern recognition method, we examined how the interstrand amino acid pairs contribute to the determination of the orientation of b-strands in protein folding. Somewhat surprisingly, we obtained almost perfect results (up to 99.40% accuracy, 0.9867 MCC). This suggests that a small number of interstrand amino acid pairs play a significant role in determining the parallel or antiparallel orientation of b-strands. However, we have only discussed b-strand orientation; the effects of environment cannot be ignored during the process of b-sheet formation. Our results provide useful information for the development of other useful algorithms to further investigate the b-strand folding pathways, and may eventually lead to protein structure predictions. Acknowledgments The authors thank Michelle Hanlon from Cross Cancer Institute, Edmonton, Alberta, Canada, for her kindly help. We are very grateful to editor and referees for their careful review and valuable comments on our manuscript. This work was supported by grants
from the NSFC (30870827) and the National 863 Project (2008AA02Z129). References [1] R.E. Steward, J.M. Thornton, Prediction of strand pairing in antiparallel and parallel b-sheets using information theory, Proteins 48 (2002) 178–191. [2] C. Bystroff, V. Thorsson, D. Baker, HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins, J. Mol. Biol. 301 (2000) 173–190. [3] D.J. Osguthorpe, Ab initio protein folding, Curr. Opin. Struct. Biol. 10 (2000) 146–152. [4] M. Kuhn, J. Meiler, D. Baker, Strand-loop-strand motifs: prediction of hairpins and diverging turns in proteins, Proteins 54 (2004) 282–288. [5] H.M. Fooks, A.C.R. Martin, D.N. Woolfson, R.B. Sessions, E.G. Hutchinson, Amino acid pairing preferences in parallel b-sheets in proteins, J. Mol. Biol. 356 (2006) 32–44. [6] M.A. Wouters, P.M.G. Curmi, An analysis of side-chain interactions and pair correlations within antiparallel b-sheets: the differences between backbone hydrogen-bonded and non-hydrogen bonded residue pairs, Proteins 22 (1995) 119–131. [7] G.E. Hutchinson, R.B. Sessions, J.M. Thornton, D.N. Woolfson, Determinants of strand register in antiparallel b-sheets of proteins, Protein Sci. 7 (1998) 2287– 2300. [8] A. Kolinski, M.R. Betancourt, D. Kihara, P. Rotkiewicz, J. Skolnick, Generalized comparative modeling (GENECOMP): a combination of sequence comparison, threading, and lattice modeling for protein structure prediction and refinement, Proteins 44 (2001) 133–149. [9] J.S. Nowick, Exploring b-sheet structure and interactions with chemical model systems, Acc. Chem. Res. 41 (10) (2008) 1319–1330. [10] X. Cruz, E.G. Hutchinson, A. Shepherd, J.M. Thornton, Toward predicting protein topology: an approach to identifying beta hairpins, Proc. Natl. Acad. Sci. USA 99 (2002) 11157–11162. [11] X.Z. Hu, Q.Z. Li, Prediction of the b-Hairpins in proteins using support vector machine, Protein J. 27 (2008) 115–122. [12] K.C. Chou, Prediction of beta-turns, J. Pept. Res. 49 (1997) 120–144. [13] Y.D. Cai, X.J. Liu, X.B. Xu, K.C. Chou, Support vector machines for the classification and prediction of beta-turn types, J. Pept. Sci. 8 (2002) 297–301. [14] S. Jahandideha, A.S. Sarvestania, P. Abdolmalekia, M. Jahandidehb, M. Barfeie, g-Turn types prediction in proteins using the support vector machines, J. Theor. Biol. 249 (2007) 785–790. [15] G.M. Salem, E.G. Hutchinson, C.A. Orengo, J.M. Thornton, Correlation of observed fold frequency with the occurrence of local structural motifs, J. Mol. Biol. 287 (1999) 969–981. [16] M.S. Searle, B. Ciani, Design of b-sheet systems for understanding the thermodynamics and kinetics of protein folding, Curr. Opin. Struct. Biol. 14 (2004) 458–464. [17] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, The Protein Data Bank, Nucleic Acids Res. 28 (2000) 235–242. [18] Y. Dou, P.F. Baisnée, G. Pollastri, Y. Pécout, J. Nowick, P. Baldi, ICBS: a database of interactions between protein chains mediated by b-sheet formation, Bioinformatics 20 (16) (2004) 2767–2777. [19] M. Jager, M. Dendle, A.A. Fuller, J.W. Kelly, A cross-strand Trp-Trp pair stabilizes the hPin1 WW domain at the expense of function, Protein Sci. 16 (2007) 2306–2313.
N. Zhang et al. / Biochemical and Biophysical Research Communications 386 (2009) 537–543 [20] A.G. Cochran, R.T. Tong, M.A. Starovasnik, E.J. Park, R.S. McDowell, J.E. Theaker, N.J. Skelton, A minimal peptide scaffold for bturn display: optimizing a strand position in disulfide-cyclized b-hairpins, J. Am. Chem. Soc. 123 (2001) 625– 632. [21] S.J. Russell, A. Cochran, Designing stable b-hairpins: energetic contributions from cross-strand residues, J. Am. Chem. Soc. 122 (2001) 12600–12601. [22] N. Zhang, J.S. Ruan, J. Wu, T. Zhang, SheetsPair: a database of amino acids pairs in protein sheet structures, Data Sci. J. 6 (15) (2007) s589–s595. [23] S.M. Zaremba, L.M. Gregoret, Context-dependence of amino acid residue pairing in antiparallel beta-sheets, J. Mol. Biol. 291 (1999) 463–479. [24] Y. Mandel-Gutfreund, S.M. Zaremba, L.M. Gregoret, Contributions of residue pairing to beta-sheet formation: conservation and covariation of amino acid residue pairs on antiparallel betastrands, J. Mol. Biol. 305 (5) (2001) 1145– 1159. [25] J. Meiler, D. Baker, Coupled prediction of protein secondary and tertiary structure, Proc. Natl. Acad. Sci. USA 100 (2003) 12105–12110.
543
[26] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, 2001, Software available at . [27] M. Parisien, F. Major, Ranking the factors that contribute to protein b-sheet folding, Proteins 68 (2007) 824–829. [28] P.M. Harrison, M.J. Sternberg, The disulphide beta-cross: from cystine geometry and clustering to classification of small disulphide-rich protein folds, J. Mol. Biol. 264 (1996) 603–623. [29] B. Kuhlman, G. Dantas, G.C. Ireton, G. Varani, B.L. Stoddard, D. Baker, Design of a novel globular protein fold with atomic-level accuracy, Science 302 (2003) 1364–1368. [30] C.A. Rohl, C.E. Strauss, K.M. Misura, D. Baker, Protein structure prediction using Rosetta, Methods Enzymol. 383 (2004) 66–93. [31] J. Lee, S.Y. Kim, J. Lee, Protein structure prediction based on fragment assembly and parameter optimization, Biophys. Chem. 115 (2–3) (2005) 209–214.