Chemometrics and Intelligent Laboratory Systems 138 (2014) 7–13
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab
Prediction of protein–protein binding affinity using diverse protein– protein interface features Duo Ma, Yanzhi Guo ⁎, Jiesi Luo, Xuemei Pu, Menglong Li ⁎ College of Chemistry, Sichuan University, Chengdu 610064, PR China
a r t i c l e
i n f o
Article history: Received 21 February 2014 Received in revised form 5 July 2014 Accepted 9 July 2014 Available online 17 July 2014 Keywords: Protein–protein interaction Binding Affinity prediction Random forest Feature importance evaluation
a b s t r a c t Protein–protein interactions play fundamental roles in almost all biological processes. Determining the protein– protein binding affinity has been recognized not only as an important step but also as a challenging task for further understanding of the molecular mechanism and the modeling of the biological systems. Unlike the traditional methods like empirical scoring algorithms and molecular dynamic which are time consuming, we developed a fast and reliable machine learning method for the prediction of protein–protein binding affinity. Based on diverse protein–protein interface features calculated using commonly used available tools, 432 features were obtained to represent hydrogen bond, Van der Waals force, hydrophobic interaction, electrostatic force, interface shape and configuration and allosteric effect. Considering the limited number of the available structures and affinity-known protein complexes, in order to avoid overfitting and remove noises in the feature set, feature importance evaluation was implemented and 154 optimal features were selected, then the prediction model based on random forest (RF) was constructed. We demonstrate that the RF model yields promising results and the predictive power of our method is better than other existing methods. Using leave-one-out crossvalidation, our model gives a correlation coefficient (r) of 0.708 on the whole benchmark dataset of 133 complexes and a high r of 0.806 on the validated set of 53 samples. When performing the same two independent datasets, our method outperforms other two methods and achieves a high r of 0.793 and 0.907 respectively. All results indicate that our method can be a useful implement in determining protein–protein binding affinity. © 2014 Elsevier B.V. All rights reserved.
1. Introduction The interactions between proteins are one of the most fundamental processes in biology. Characterizing these interactions is an important component for understanding function mechanisms of the biological processes, since most cellular processes are accomplished through protein complexes [1–3]. Interactions between proteins can be divided into permanent and transient. Permanent complexes are very strong and irreversible, but the transient ones readily undergo changes in the oligomeric state. The transient interactions are quite frequent in the regulation of biochemical pathways and signaling cascades [4–6]. The binding strength or binding affinity of complex is a crucial factor that determines the biological function through the interaction between two proteins. As known, for a certain protein complex, the change of binding energy caused by mutations or posttranslational modification errors may lead to various diseases [7]. Therefore the binding affinity prediction for protein–protein interactions which vary in different types of complexes is of great significance for drug development and disease research. Specifically, peptide therapeutics, docking result
⁎ Corresponding authors. Tel.: +86 28 85413330; fax: +86 28 85412356. E-mail addresses:
[email protected] (Y. Guo),
[email protected] (M. Li).
http://dx.doi.org/10.1016/j.chemolab.2014.07.006 0169-7439/© 2014 Elsevier B.V. All rights reserved.
evaluating, de novo interface design and computational mutagenesis are all closely connected with the evaluation of binding energy [8–13]. Since measuring the protein–protein binding affinity experimentally is very time-consuming and costly, many computational methods have been developed. The traditional computational methods can be divided into two categories: molecular simulation methods and methods based on empirical scoring functions. The former, including molecular dynamics and Monte Carlo sampling [14], commonly can achieve high accuracy. However, they are much more time consuming since every atom needs to be considered. Given this reason, the application of these methods is extremely limited in computational methods. Compared with molecular simulation methods, the methods based on empirical score functions are much faster and fall into three types [8]. The first is statistical potentials [15–18], in which the relative combination times of atoms or residues observed in available structures are counted to conclude a potential of mean force. The second is the thermodynamic equation [19–21], in which a group of terms associated with thermodynamic properties are collected and calculated to constitute a linear equation corresponding to binding energy. The last one is mainly used for evaluating docking results, distinguishing differences among mutants and identifying biological assemblies [22–24]. These methods based on the above equations generally can use very few terms like complementarity, hydrophobicity or hydrogen bond to screen a lot of
8
D. Ma et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 7–13
candidates. Although they have been developed for a long time, they are mostly parameterized and validated in a narrow range of protein complexes [23,25]. Since Kastritis et al. [25] presented a structure based protein–protein binding affinity benchmark consisted of 144 protein–protein complexes, machine learning based methods have been proposed to fulfill the affinity prediction purpose [8,26,27]. Here, a new and effective method based on random forest (RF) was developed using a comprehensive protein–protein interface feature set including the four important non-covalent interactions, the shape and structure of interfaces as well as the allosteric effect. Totally 432 structural features were achieved for representing hydrogen bond, Van der Waals force, hydrophobic interaction, electrostatic force, interface shape and configuration and allosteric effect. For feature selection and compression, feature importance evaluation was implemented by the Permutation importance analysis of RF. Based on the optimal 154 feature, the RF model was constructed using the training set and leave-one-out (LOO) cross-validation and a satisfactory correlation coefficient (r) of 0.670 was obtained. When validated by the two independent testing sets, the model gives a high CC of 0.793 and 0.907 respectively, which is higher than other two existing methods. Further, the result obtained in our experiment highlights the importance of features for the accuracy for the model's performance, as well as the amount and diversity of the seven types of protein complexes. 2. Material and methods 2.1. Dataset preparation For training and testing our method, we used the structure-based benchmark for protein–protein binding affinity compiled by Kastritis et al. [25]. This benchmark contains 144 protein–protein complexes with the high-resolution PDB structures for both unbound and bounded states, as well as the experimentally measured binding affinities (Kd values). The samples are mostly derived from the docking benchmark 4.0 presented by Hwang et al. [28], which has eliminated the redundancy. In addition, the protein complexes in this benchmark cover diverse types in terms of biological functions. Overall, there are seven types, including antigen–antibodies (AA), enzyme–inhibitors (EI), enzyme–substrates (ES), enzyme-regulatory subunits (ER), G-proteins (OG), membrane receptors (OR) and other complexes (OX). The Kd values also cover a large range between 10−3.2 and 10−13.6. 2.2. The protein–protein interface features The binding between two monomers is directly affected by the contact region which is commonly regarded as the binding interface [5,29,30]. So the descriptors were collected mainly for characterizing the protein–protein binding interfaces, while some other features, like variables reflecting the structural transformation and the complex species were also introduced. We considered six aspects to extract abundant information from a protein–protein interface, including hydrogen bond, Van der Waals force, hydrophobic interaction, electrostatic force, interface shape and configuration, and allosteric effect. Totally, 432 descriptors were calculated employing commonly used available tools, such as Naccess [31], COCOMAPS [32], 2P2I [33], PIC [34], etc. The calculations of these features are very simple and fast so that we can construct a feature vector for a sample easily. Firstly, four kinds of non-covalent interactions of hydrogen bond, Van der Waals force, hydrophobic interaction and electrostatic force were considered. The hydrogen bound is crucial to stabilize the spatial structures and maintain the dynamics and biological functions of proteins. It has been highlighted for bio-macromolecule interactions [35]. The Van der Waals contact is a basic force which is closely related to the atomic distance. A single Van der Waals contact is non-specificity, but for the connected surfaces of two proteins, a good complementary
implies a tight combination, less repulsion and small enclosed volume [30]. The hydrophobic interaction is deemed as the most important force during protein folding and self-assembly and plays an important role in bio-macromolecule interactions. As reported in previous works [36–38], the residues on the interface are more inclined to be hydrophobic than other parts of the surface. Moreover, many tight combined complexes can even form hydrophobic cores in the binding process. The hot spots, known as the residues that have great contributions to the binding energy, often form the strong hydrophobic contact [39]. The electrostatic interaction involves the polar or oppositely charged residue pairs on the interface, especially the ion pairs and salt bridges contribute significantly to the binding energy as well as the binding specificity [40]. Besides, we also considered some special static interactions formed between certain atoms, functional groups or chemical bonds, including aromatic– aromatic, aromatic–sulfur, cation–π interactions. In addition, the non-bound descriptors of the interface were also taken into account to characterize the interface shape and configuration information. Since conservation was of great concern for biological function and binding specificity [41–44], the conservation scores for interfacial residues were introduced based on PSSM [45]. The flexibility which was represented by the mean value of B-factor was also calculated for each interface. To make up a final feature set, the composition, segments number [33], enclosed volume size [33], interfacial size, interfacial polarity [32], number of hot spots [46,47] and accessible surface area [31] were also included. Furthermore, the allosteric effect was also considered in our model construction. According to the consensus that proteins undergo a spatial structural change to a certain extent during the binding procedure [48,49], this part of energy consumption should be taken into account. We calculated the mean RMSDs of Cα and side-chain residues locating on the interface, surface and the whole protein respectively to represent the conformational change. In our experiment, the names of the descriptors as well as the tools we used for calculations were listed in Table S1 of Supplementary file S1. Finally, in order to highlight the diversity of protein complexes, we also introduced seven-dimension pseudo-variables which reflect the 7 different types of complexes in our dataset. For example, in detail, the pseudovariable vectors for EI and ES are [1,0,0,0,0,0,0] and [0,1,0,0,0,0,0] respectively.
2.3. The regression model Machine learning methods have been widely used in biological mechanism researches, as these algorithms can reveal the complicated relationship among factors with the phenotype. To construct the regression model based on the descriptors mentioned above, we chose a powerful non-linear algorithm, random forest (RF). This machine learning method has been applied to many kinds of bioinformatics researches, both in binary classification [50] and regression ways [51]. RF is an average prediction of a collection of decision trees, where the criterion at each node is chosen so as to minimize the variance within the branches [52]. For an effective ensemble, the tree predictions should have high accuracy and low correlation across trees. Once the researcher set the number of decision trees (ntree value) and the maximum number of random features available for selection at each node (mtry value), the decision trees fully grow and a member in the training set is one leaf of a decision tree. The final prediction is returned as the average of all trees. The RF can also be used to rank the importance of variables in a regression or classification problem, so that we can quantitatively evaluate the contribution of each descriptor using the permutation importance analysis by RF. It has been shown that RF performs very well in non-linear regression [53]. The RF can also be used to rank the importance of variables in a regression or classification problem, so that we can quantitatively evaluate the contribution of each descriptor using the permutation importance analysis by RF. In this study, RF
D. Ma et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 7–13
regression and feature importance evaluation were implemented using the program of RF-score [51]. 2.4. Model validation and evaluation Among the original 144 complexes included in the benchmark, there are 11 samples for which we couldn't achieve the complete features. So in our whole dataset, the number of the remaining complexes is 133. According to the proportion of 3:1, the whole dataset was divided into training set and testing set with 102 and 31 samples respectively. Furthermore, the samples in the testing set were selected evenly between the whole Kd value range and the weight central was expected to be in accord with the training set as much as possible and the diagram can be seen in Fig. S1 of Supplementary file S1. LOO cross-validation results on the training data were used for internal validation and the performance on the testing data was for external evaluation. Two measurements, the correlation coefficient (r) and root mean squared error (RMSE) are used to evaluate the performance of the models and they are defined in Eqs. (1) and (2) respectively. pred pred ðyi −yi Þ yi −yi r ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Xn pred Xn 2 pred 2 ð Þ y −y y −y i i i i¼1 i i¼1 Xn
i¼1
RMSE ¼
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 Xn pred 2 yi −yi i¼1 n
ð1Þ
9
Table 1 The LOO results on the training set of each feature subsets. Feature groupa
No. of descriptors
r
RMSE
Subset 1 Subset 2 Subset 3 Subset 4 Subset 5 Subset 6
425 207 147 138 127 114
0.668 0.670 0.670 0.662 0.661 0.654
1.643 1.644 1.630 1.652 1.655 1.670
a Subsets 1–6 represent the feature groups with different importance score levels. subset 1: whole features; subset 2: scores N 0; subset 3: scores N 0.1; subset 4: scores N 2; and subset 5: scores N 5; subset 6: scores N 10. The 7-dimension pseudo-variables were put aside from the variable ranking.
Additionally, in order to test the performance of other regression algorithms on the protein–protein binding affinity prediction, other three methods were also used to compare with RF, including support vector regression (SVR), artificial neural network (ANN) and partial least square (PLS). We performed the comparisons of different algorithms on the same training set by the LOO test and the prediction results are shown in Table 2. From Table 2, we can see that the model built by RF performed best. In addition, SVR gives a better performance than ANN and PLS. So RF was used as the final prediction model. 3.2. Validation on testing set and comparisons between different feature sets
ð2Þ
are the values of experiwhere n is the number of samples, yi and ypred i mentally determined affinity and estimated affinity respectively, while pred yi and yi are the average of yi and ypred respectively. i 3. Results and discussion 3.1. Feature selection and optimization The 432 original descriptors used in this study reflect different aspects of the structural, geometrical and physicochemical properties of protein–protein interfaces. Supplementary file S2 lists all the 432 descriptor values for all the samples. However, they may have strong information overlapping and background noises. The variable selection and optimization is a regular approach to solve this problem. Here, we employed the feature selection function of RF-score [51] to perform a variable selection procedure. Specifically, the model construction was performed based on the training set with the whole features. This process was repeated for 100 times. For each time of model building, there is a rank of variable importance scores. The average feature scores of these 100 models are defined as the feature selection standard, as shown in Table S2 of Supplementary file S1. For the importance scores, we can see that more than a half of features (218) have the importance scores lower than 0, so these features may be noises in the prediction model and there are only 114 features with importance scores higher than 10. In order to get the optimal features, five feature subsets were constructed according to the importance scores of all features. Subset 1 includes all 425 features, subset 2 includes features with importance scores N 0, subset 3 includes features with importance scores N 0.1, subset 4 includes features with importance scores N 2, subset 5 includes features with importance scores N 5 and subset 6 includes features with importance scores N 10. After adding the 7 pseudo-variables for each subset, the prediction results of the six models on the training set by LOO cross-validation are shown in Table 1. It can be seen that the model on subset 3 with 154 features performs best and it yields the highest r of 0.670 and the lowest RMSE of 1.630, so it was selected as the final prediction model.
As the predictive power of a method is mainly decided by the descriptors that it uses, to testify the efficiency of the descriptors we selected, a comparison between our method and other feature sets was performed on the same dataset. After investigating other features for protein–protein binding affinity prediction, there are two methods that have given the descriptor values for all complexes [8,54]. Moal et al. [8] constructed a large set of molecular descriptors including energetic factors and features on structural ensembles to build a regression model. Vreven et al. [54] developed an energy function for predicting binding free energies of protein complexes. So the features used by our method aim to locally characterize the protein–protein interfaces and they are different from those of the two methods since their descriptors are all global features to represent the protein–protein complexes. Since the binding between two monomers is directly affected by the contact region, the local features for protein–protein interfaces are more effective than the global ones for characterizing the binding affinity. Based on the same regression method of RF, a comparison between our method and the other two methods was implemented using the same training set and testing set. For comparison, the LOO crossvalidation results of the three methods on the 102 training samples are shown in Fig. 1(a1, b1 and c1). To compare practical performance of the three methods, the external validation was performed on the 31 testing samples and the prediction results can also been seen in Fig. 1(a2, b2 and c2). We can see that our method outperforms other two methods. For LOO results, the other two methods give a relatively low r of only 0.276 and 0.518 respectively. For the performance on the testing data, our model is comparable to or even better than the two methods and gives an r of 0.793. The detailed prediction results of our Table 2 The prediction results of different algorithms on the training set. Modeling algorithm
LOO results on the training set r
RMSE
RF SVR ANN PLS
0.670 0.658 0.467 0.090
1.639 1.870 4.880 34.030
10
D. Ma et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 7–13
Fig. 1. Scatter plots of predicted pKd values versus experimental measured values of three models (a, b and c), including the LOO cross-validation on training set (a1, b1, c1), predicted results on whole testing set (a2, b2, c2) and testing set without the outlier samples (a3, b3, c3). Methods A and B were constructed using the features proposed by Moal et al. [8] and Vreven et al. [54] respectively.
method on the training set and testing set are listed in Tables S3 and S4 of Supplementary file S1 respectively. Interestingly, from Fig. 1(a2, b2 and c2), we can see that there is the same sample (PDB ID: 2TGP) that exhibits the worst prediction results in all the three methods. Through comparing it with other enzyme– inhibitor complexes, we found that it has the weakest binding affinity with pKd of only 5.62, while others in this category all have high affinities that are at least 102 times stronger than it. So it can be deemed as an outlier sample. After removing it, the prediction results on the remaining 30 testing samples are shown in Fig. 1(a3, b3 and c3). We can see that rs of the three methods are all greatly improved. But our method still achieves better performance than the other two models when excluding this sample and only the r of our method is higher than 0.9. 3.3. Refinements on the whole dataset To investigate the current experimental data biases produced during the determination of dissociation constants or by the difference
between the PDB structural data and the real biological state, refinements on the whole dataset were carried out. We firstly performed the LOO test on the whole 133 complexes. As shown in Fig. 2(a), the prediction result of the LOO test on the whole 133 complexes is better than that of the LOO test on the 102 training samples, indicating that more samples would help to improve the performance. The r and RMSE are 0.708 and 1.472 respectively. Actually, there is a subset named as validated set in the protein– protein binding affinity benchmark [8,25]. In this validated set, the affinities of samples have been measured by more than one group/ experimental technique and they are deemed to be high confidence. From the 133 complexes, 53 samples are contained in the validated set. The LOO test result on this validated set is shown in Fig. 2(b) with a high r of 0.806 and a low RMSE of 1.335 separately which is much better than those of the LOO test result on the whole dataset. The comparison in Fig. 2(a) and (b) suggests that the measurement errors of the dissociation constants can generate the negative influence on the performance of the model.
D. Ma et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 7–13
11
Fig. 2. Scatter plots of LOO result on whole 133 samples (a), validated set (b), the subset of rigid samples (c) and the subset of flexible samples (d), respectively.
In addition, the allosteric effects both in the binding and crystallization procedures will also cause noises. As we know, the complexes undergo conformational transformations more or less during the binding process. This part of energy exchange sets up barriers for the combinations and the sizes of the energy barriers vary with the degree of allosteric or the flexibility of protein chains [25]. Besides this, discrepancies always exist between the PDB structural data and the real biological active state, which not only introduce the numerical bias from real 3D-structure in solution, but sometimes also crystal-packing interfaces forming during the crystallization are of none biological significances [55,56]. So tests were made on the rigid and flexible complexes respectively. The 20 rigid complexes were defined as those with an interface Cα RMSD b0.5 Å between unbound and bounded states and the remaining 111 complexes were classified as flexible [25]. As shown in Fig. 2(c) and (d), the r of the LOO test on the rigid samples is 0.832 that is much higher than that of the LOO test on the flexible ones (0.628). In our opinion, rigid interfaces
commonly undergo a little conformational change during a biological process. So the comparison between rigid and flexible complexes demonstrates that the conformational changes occur in the process from unbound to bound and biological active state to crystalline state has a great influence on the prediction of binding affinity. Therefore the allosteric effect does play an important role in biomolecular recognition and association. It has been demonstrated that the feature representing the allosteric effect is in the top rank of importance scores by our study. At last, we investigated the relationship between the model performance and the type of complexes. We calculated the RMSE of each type of complexes and the result is shown in Fig. 3. We can see that some types have obvious distribution characteristics in contrast to others, especially for EI complexes with the highest RMSE. Except the complex (2TGP), others have relatively strong binding strengths and the high RMSE for EI may be caused by the imbalanced distribution of dissociation constants, because almost all samples with high Kd values
Fig. 3. The RMSE for each kind of complexes.
12
D. Ma et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 7–13
belong to this complex category on the whole. Then OG and OR also have high predictive biases. The reason may be that they always have more than one binding sites. For example, a G-protein can binding to many partners [25], including GTPase activating proteins (GAPs), guanine nucleotide exchange factors, protein kinases, etc. Unlike the simple competitive association of enzyme contained complexes with the low RMSE, these kinds of multi-binding site complexes possess a more complicated binding behaviors. 4. Conclusions In this article, we aimed to predict the binding affinity of protein– protein complexes. In order to effectively characterize protein–protein interactions, various descriptors relative to the protein–protein interfaces were calculated and collected, including the descriptors for four types of the non-covalent contacts, the shape and configuration of the interfaces and the conformational change. For feature evaluation and optimization, RF-score was used to obtain the importance scores for all features and 154 optimal ones from the total 432 features were selected for model construction. In order to demonstrate the superiority of our method, a comparison between our method and other two existing ones was implemented. Both the LOO test result on the training set and prediction result on the independent dataset of our method are better than those of other two methods. In addition, refinements on the whole dataset were carried out to investigate the current experimental data biases produced during the determination of dissociation constants or by the difference between the PDB structural data and the real biological state. The r of the model on the validated set is higher than that of the model on the whole dataset, indicating that the measurement errors of the dissociation constants can generate the negative influence on the performance of the method. Moreover, the comparison between the models on the rigid and flexible samples demonstrates that the allosteric effect does play an important role in bio-molecular recognition and association. Finally, we also investigated the relationship between the model performance and the type of complexes and found that EI, OG and OR give the high RMSE because of the imbalanced distribution or the more complicated binding behaviors. Overall, our method can be a useful tool to predict protein–protein binding affinity and this work will help us to further understand the mechanism of protein–protein interactions. 5. Conflict of interest The authors have no conflict of interests concerning this work. Acknowledgments This work was funded by the National Natural Science Foundation of China (Nos. 21175095, 21273154, 21375090). Appendix A. Supplementary data Supplementary data to this article can be found online at http://dx. doi.org/10.1016/j.chemolab.2014.07.006. References [1] B. Alberts, The cell as a collection of protein machines: preparing the next generation of molecular biologists, Cell 92 (1998) 291–294. [2] S. Jones, J.M. Thornton, Principles of protein–protein interactions, Proc. Natl. Acad. Sci. 93 (1996) 13–20. [3] I.M. Nooren, J.M. Thornton, Diversity of protein–protein interactions, EMBO J. 22 (2003) 3486–3492. [4] D. La, M. Kong, W. Hoffman, Y.I. Choi, D. Kihara, Predicting permanent and transient protein–protein interfaces, Proteins 81 (2013) 805–818. [5] L.S. Swapna, R.M. Bhaskara, J. Sharma, N. Srinivasan, Roles of residues in the interface of transient protein–protein complexes before complexation, Sci. Rep. 2 (2012).
[6] J. Mintseris, Z. Weng, Structure, function, and evolution of transient and obligate protein–protein interactions, Proc. Natl. Acad. Sci. U. S. A. 102 (2005) 10930–10935. [7] M. Vidal, M.E. Cusick, A.-L. Barabasi, Interactome networks and human disease, Cell 144 (2011) 986–998. [8] I.H. Moal, R. Agius, P.A. Bates, Protein–protein binding affinity prediction on a diverse set of structures, Bioinformatics 27 (2011) 3002–3009. [9] M. Kumar, S. Verma, S. Sharma, A. Srinivasan, T.P. Singh, P. Kaur, Structure‐based in silico design of a high‐affinity dipeptide inhibitor for novel protein drug target Shikimate kinase of Mycobacterium tuberculosis, Chem. Biol. Drug Des. 76 (2010) 277–284. [10] T. Kortemme, D. Baker, Computational design of protein–protein interactions, Curr. Opin. Chem. Biol. 8 (2004) 91–97. [11] S.J. Fleishman, T.A. Whitehead, D.C. Ekiert, C. Dreyfus, J.E. Corn, E.-M. Strauch, I.A. Wilson, D. Baker, Computational design of proteins targeting the conserved stem region of influenza hemagglutinin, Science 332 (2011) 816–821. [12] A. Ben-Shimon, M. Eisenstein, Computational mapping of anchoring spots on protein surfaces, J. Mol. Biol. 402 (2010) 259–277. [13] D.C. Fry, Protein–protein interactions as targets for small molecule drug discovery, Pept. Sci. 84 (2006) 535–552. [14] P. Kollman, Free energy calculations: applications to chemical and biochemical phenomena, Chem. Rev. 93 (1993) 2395–2417. [15] L. Jiang, Y. Gao, F. Mao, Z. Liu, L. Lai, Potential of mean force for protein– protein interaction studies, Proteins Struct. Funct. Genet. Bioinforma. 46 (2002) 190–196. [16] Y. Su, A. Zhou, X. Xia, W. Li, Z. Sun, Quantitative prediction of protein–protein binding affinity with a potential of mean force considering volume correction, Protein Sci. 18 (2009) 2550–2558. [17] C. Zhang, S. Liu, Q. Zhu, Y. Zhou, A knowledge-based energy function for protein– ligand, protein–protein, and protein–DNA complexes, J. Med. Chem. 48 (2005) 2325–2335. [18] Z.-H. Zeng, Y.C. Li, Empirical parameters for estimating protein–protein binding energies: number of short-and long-distance atom–atom contacts, Protein Pept. Lett. 15 (2008) 223–231. [19] N. Horton, M. Lewis, Calculation of the free energy of association for protein complexes, Protein Sci. 1 (1992) 169–181. [20] X.H. Ma, C.X. Wang, C.H. Li, W.Zu. Chen, A fast empirical approach to binding free energy calculations based on protein interface information, Protein Eng. 15 (2002) 677–681. [21] H. Bai, K. Yang, D. Yu, C. Zhang, F. Chen, L. Lai, Predicting kinetic constants of protein–protein interactions based on structural properties, Proteins Struct. Funct. Genet. Bioinforma. 79 (2011) 720–734. [22] P. Heuser, D. Schomburg, Combination of scoring schemes for protein docking, BMC Bioinforma. 8 (2007) 279. [23] P.L. Kastritis, A.M. Bonvin, Are scoring functions in protein–protein docking ready to predict interactomes? Clues from a novel binding affinity benchmark, J. Proteome Res. 9 (2010) 2216–2225. [24] J. Audie, S. Scarlata, A novel empirical free energy function that explains and predicts protein–protein binding affinities, Biophys. Chem. 129 (2007) 198–211. [25] P.L. Kastritis, I.H. Moal, H. Hwang, Z. Weng, P.A. Bates, A.M. Bonvin, J. Janin, A structure‐based benchmark for protein–protein binding affinity, Protein Sci. 20 (2011) 482–491. [26] P. Zhou, C. Wang, F. Tian, Y. Ren, C. Yang, J. Huang, Biomacromolecular quantitative structure–activity relationship (BioQSAR): a proof-of-concept study on the modeling, prediction and interpretation of protein–protein binding affinity, J. Comput. Aided Mol. Des. 27 (2013) 67–78. [27] F. Tian, Y. Lv, L. Yang, Structure-based prediction of protein–protein binding affinity with consideration of allosteric effect, Amino Acids 43 (2012) 531–543. [28] H. Hwang, T. Vreven, J. Janin, Z. Weng, Protein–protein docking benchmark version 4.0, Proteins Struct. Funct. Genet. Bioinforma. 78 (2010) 3111–3114. [29] P.L. Kastritis, A.M. Bonvin, On the binding affinity of macromolecular interactions: daring to ask why proteins interact, J. R. Soc. Interface 10 (2013). [30] D. Reichmann, O. Rahat, S. Albeck, R. Meged, O. Dym, G. Schreiber, The modular architecture of protein–protein binding interfaces, Proc. Natl. Acad. Sci. U. S. A. 102 (2005) 57–62. [31] S.J. Hubbard, J.M. Thornton, Naccess, Computer Program, Department of Biochemistry and Molecular Biology, University College London 2, 1993. [32] A. Vangone, R. Spinelli, V. Scarano, L. Cavallo, R. Oliva, COCOMAPS: a web application to analyze and visualize contacts at the interface of biomolecular complexes, Bioinformatics 27 (2011) 2915–2916. [33] M.J. Basse, S. Betzi, R. Bourgeas, S. Bouzidi, B. Chetrit, V. Hamon, X. Morelli, P. Roche, 2P2Idb: a structural database dedicated to orthosteric modulation of protein– protein interactions, Nucleic Acids Res. 41 (2013) D824–D827. [34] K. Tina, R. Bhadra, N. Srinivasan, PIC: protein interactions calculator, Nucleic Acids Res. 35 (2007) W473–W476. [35] D. Xu, C.-J. Tsai, R. Nussinov, Hydrogen bonds and salt bridges across protein– protein interfaces, Protein Eng. 10 (1997) 999–1012. [36] C.J. Tsai, R. Nussinov, Hydrophobic folding units at protein–protein interfaces: implications to protein folding and to protein–protein association, Protein Sci. 6 (1997) 1426–1437. [37] L. Young, R. Jernigan, D. Covell, A role for surface hydrophobicity in protein–protein recognition, Protein Sci. 3 (1994) 717–729. [38] C.J. Tsai, S.L. Lin, H.J. Wolfson, R. Nussinov, Studies of protein–protein interfaces: a statistical analysis of the hydrophobic effect, Protein Sci. 6 (1997) 53–64. [39] I.S. Moreira, P.A. Fernandes, M.J. Ramos, Hot spots—a review of the protein–protein interface determinant amino‐acid residues, Proteins Struct. Funct. Genet. Bioinforma. 68 (2007) 803–812.
D. Ma et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 7–13 [40] S. Kumar, R. Nussinov, Close‐range electrostatic interactions in proteins, ChemBioChem 3 (2002) 604–617. [41] R. Sharan, S. Suthram, R.M. Kelley, T. Kuhn, S. McCuine, P. Uetz, T. Sittler, R.M. Karp, T. Ideker, Conserved patterns of protein interaction in multiple species, Proc. Natl. Acad. Sci. U. S. A. 102 (2005) 1974–1979. [42] S. Wuchty, Z.N. Oltvai, A.-L. Barabási, Evolutionary conservation of motif constituents in the yeast protein interaction network, Nat. Genet. 35 (2003) 176–179. [43] Y.S. Choi, J.S. Yang, Y. Choi, S.H. Ryu, S. Kim, Evolutionary conservation in multiple faces of protein interaction, Proteins Struct. Funct. Genet. Bioinforma. 77 (2009) 14–25. [44] O.N. Yogurtcu, S. Bora Erdemli, R. Nussinov, M. Turkay, O. Keskin, Restricted mobility of conserved residues in protein–protein interfaces in molecular simulations, Biophys. J. 94 (2008) 3475–3485. [45] D.T. Jones, Protein secondary structure prediction based on position-specific scoring matrices, J. Mol. Biol. 292 (1999) 195–202. [46] T. Kortemme, D.E. Kim, D. Baker, Computational alanine scanning of protein–protein interfaces, Sci. Signal. 2004 (2004) l2. [47] T. Kortemme, D. Baker, A simple physical model for binding energy hot spots in protein–protein complexes, Proc. Natl. Acad. Sci. 99 (2002) 14116–14121. [48] P. Tuffery, P. Derreumaux, Flexibility and binding affinity in protein–ligand, protein– protein and multi-component protein interactions: limitations of current computational approaches, J. R. Soc. Interface 9 (2012) 20–33.
13
[49] G.R. Smith, M.J. Sternberg, P.A. Bates, The relationship between the flexibility of proteins and their conformational states on forming protein–protein complexes with an application to protein–protein docking, J. Mol. Biol. 347 (2005) 1077–1101. [50] S.E. Hamby, J.D. Hirst, Prediction of glycosylation sites using random forests, BMC Bioinforma. 9 (2008) 500. [51] P.J. Ballester, J.B. Mitchell, A machine learning approach to predicting protein–ligand binding affinity with applications to molecular docking, Bioinformatics 26 (2010) 1169–1175. [52] L. Breiman, Random forests, Mach. Learn. 45 (2001) 5–32. [53] V. Svetnik, A. Liaw, C. Tong, J.C. Culberson, R.P. Sheridan, B.P. Feuston, Random forest: a classification and regression tool for compound classification and QSAR modeling, J. Chem. Inf. Comput. Sci. 43 (2003) 1947–1958. [54] T. Vreven, H. Hwang, B.G. Pierce, Z. Weng, Prediction of protein–protein binding free energies, Protein Sci. 21 (2012) 396–404. [55] E. Krissinel, K. Henrick, Inference of macromolecular assemblies from crystalline state, J. Mol. Biol. 372 (2007) 774–797. [56] O. Carugo, P. Argos, Protein–protein crystal‐packing contacts, Protein Sci. 6 (1997) 2261–2263.