Structural properties and interaction energies affecting drug design. An approach combining molecular simulations, statistics, interaction energies and neural networks

Structural properties and interaction energies affecting drug design. An approach combining molecular simulations, statistics, interaction energies and neural networks

Computational Biology and Chemistry 56 (2015) 7–12 Contents lists available at ScienceDirect Computational Biology and Chemistry journal homepage: w...

349KB Sizes 0 Downloads 19 Views

Computational Biology and Chemistry 56 (2015) 7–12

Contents lists available at ScienceDirect

Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem

Structural properties and interaction energies affecting drug design. An approach combining molecular simulations, statistics, interaction energies and neural networks Dimitris Ioannidis a , Georgios E. Papadopoulos b , Georgios Anastassopoulos c , Alexandros Kortsaris a , Konstantinos Anagnostopoulos a, * a

Laboratory of Biochemistry, Department of Medicine, Democritus University of Thrace, 68100 Alexandroupolis, Greece Department of Biochemistry and Biotechnology, University of Thessaly, Ploutonos 26 & Aeolou Greece, 41221 Larisa, Greece c Laboratory of Medical Informatics, Department of Medicine, Democritus University of Thrace, 68100 Alexandroupolis, Greece b

A R T I C L E I N F O

A B S T R A C T

Article history: Received 18 October 2014 Received in revised form 17 February 2015 Accepted 22 February 2015 Available online 25 February 2015

In order to elucidate some basic principles for protein–ligand interactions, a subset of 87 structures of human proteins with their ligands was obtained from the PDB databank. After a short molecular dynamics simulation (to ensure structure stability), a variety of interaction energies and structural parameters were extracted. Linear regression was performed to determine which of these parameters have a potentially significant contribution to the protein–ligand interaction. The parameters exhibiting relatively high correlation coefficients were selected. Important factors seem to be the number of ligand atoms, the ratio of N, O and S atoms to total ligand atoms, the hydrophobic/polar aminoacid ratio and the ratio of cavity size to the sum of ligand plus water atoms in the cavity. An important factor also seems to be the immobile water molecules in the cavity. Nine of these parameters were used as known inputs to train a neural network in the prediction of seven other. Eight structures were left out of the training to test the quality of the predictions. After optimization of the neural network, the predictions were fairly accurate given the relatively small number of structures, especially in the prediction of the number of nitrogen and sulfur atoms of the ligand. ã 2015 Elsevier Ltd. All rights reserved.

Keywords: Drug design Molecular dynamics simulation Interaction energy Neural networks

1. Introduction Since the 1980s, computer-aided drug design has been employed to help with the design of new drugs (determination of the molecular target of the drug and its structure, determination of the interaction mechanism, etc.). The field of protein structure–function has contributed significantly to this (Ghersi and Sanchez, 2011). An example is the use of the complementarity function (Sobolev et al., 1997; Sobolev et al., 1999) for protein–protein interactions, where the potential for a stable interaction of two biomolecules is determined by the frequency of favorable atomic interactions versus the frequency of unfavorable ones. A trend is to use the four basic non-covalent interactions: electrostatic, hydrogen bonds, van

Abbreviation: PDB, Protein Data Bank; MDS, molecular dynamics simulation; EM, energy minimization; MDS + EM, molecular dynamics simulation followed by energy minimization. * Corresponding author. Tel.: +30 25510 30502; fax: +30 25510 30502. E-mail addresses: [email protected], [email protected] (K. Anagnostopoulos). http://dx.doi.org/10.1016/j.compbiolchem.2015.02.016 1476-9271/ ã 2015 Elsevier Ltd. All rights reserved.

der Waals and hydrophobic interactions (Lins and Brasseur, 1995). Relevant to this is the effort to statistically study the various types of aminoacids of the interaction hotspots (Boganand and Thorn, 1998; Halperin et al., 2004; Jones and Thornton, 1996). To understand the above interactions we must also use topological criteria, since these interactions depend on the distance and atom position. Initially, the lock-and-key model was used (Durrant and McCammon, 2011; Whitesides and Krishnamurthy, 2005) for the interaction of proteins with receptor agonists and antagonists and enzyme inhibitors and activators. This approach was complemented using molecular simulations, due to the increase in computing power and the experimental and theoretical determination of the behavior of atoms in semi-empirical force fields, like CHARMM (MacKerell et al., 1988, 1998). An important point is the handling of the hydrophobic interactions (Lins and Brasseur, 1995; Meyer et al., 2006). Very useful are the in silico screening methods (Schmidt et al., 2009; Wishart et al., 2008), where a chemical substance can be checked for probable interactions against a database of structures. Also, open is the question of the role of neighboring relatively conserved water molecules that regulate and enhance the

8

D. Ioannidis et al. / Computational Biology and Chemistry 56 (2015) 7–12

interaction of proteins, like e.g., hsp90 (Yan et al., 2008). Water molecules are of interest for many aspects relevant to computational drug design such as structure, thermodynamics, binding free energy calculation, molecular docking and molecular simulation studies (deBeer et al., 2010). As replacing or conserving a crystal water molecule remains a critical problem for effective drug design, the immobile water molecules emerge as potential targets. Furthermore, other atoms have also been implicated that allosterically regulate receptors, e.g., sodium ions in dopamine receptor (Selent et al., 2010). One method of structure-based or direct drug design is attempting to build a ligand that fits to a known receptor 3D structure (receptor-based drug design) (Schneider and Fechner, 2005). Such approaches include fragment-based drug discovery (Merour et al., 2014) or combinatorial chemistry by Monte Carlo methods (Grzybowski et al., 2002). The use of 3D-QSAR (quantitative structure-activity relationships) is also widespread (Verma et al., 2010). Of particular importance is the inclusion of oxygen, nitrogen and sulfur atoms in the ligand. These are atoms with high electronegativity (especially oxygen and nitrogen) and therefore they can significantly influence the properties and interactions of the ligand. Hence the importance of the prediction of how many there should be in it. In this paper, we attempted an empirical analysis of the protein–ligand interaction, using PDB databank structures and combining various approaches using statistical measurements. Various parameters of the interactions were determined to check which of these could be useful during in silico drug design. Also, the role of immobile water molecules was studied. Finally, a neural network was created from some of the interaction parameters determined, and this was used to predict other interaction parameters. 2. Materials and methods 2.1. Databases used To obtain the starting material the Protein Data Bank PDB (Berman et al., 2000) was used. All the entries of the PDB database with Homo sapiens as source organism were downloaded (a total of 20,761 entries). A script in the PERL programming language was written to facilitate the examination and selection of the appropriate PDB entries. Using this script, only the PDB entries that met the following criteria were included: (I) to contain a HETSYN record (indication of the existence of a ligand), (II) to contain a TER record only once or twice (so that the selected files will contain only one or two protein chains, (III) not to contain heme or acetylglucosamine in the HETSYN record and (IV) not to contain nucleic acids. After this selection 1528 PDB entries were left, which were further examined manually and more entries were excluded if the ligands were sugars or lipids. After this round of selection, 400 PDB records were left. Next, these 400 entries were examined and those that were similar were excluded (similarity was defined as >95% identity in the protein sequence except when two or more PDB entries with similar proteins contained different ligands. In such cases the structures were retained, e.g PDB IDs 2J4A, 1XZX and 1Q4X, supplement IDs 6, 7 and 8). In the end, 95 entries were left. Of these 8 were further excluded due to many discontinuities in the structure, to yield a total of 87 final entries for the analysis. The general aim was to select simple entries consisting mostly of a protein and a ligand. These entries with the corresponding PDB IDs are shown in the Supplementary material. These 87 entries were classified in four categories (color coded in the Supplementary material): receptor agonists/antagonists,

enzyme inhibitors, serum blockers–transporter inhibitors and conformation blockers–chaperone inhibitors. 2.2. Molecular dynamics simulations In order to to relax the protein–ligand system, energy minimization and molecular dynamics simulations (MDS) were performed using the PDB files as the initial state. The MDS were performed using the programs NAMD (Phillips et al., 2005) with the CHARMM force field (MacKerell et al., 1988; Brooks et al., 1983) (version 27). File preparation and analysis of the MDS trajectories were performed with the VMD (Humphrey et al., 1996) program. The standard topology files of the CHARMM force field were insufficient to generate the structure file to be used in the MDS for all the PDB entries because no topologies for the ligands of the sample exist in the CHARMM force field. For this reason, the PRODRG program (Schüttelkopf and Aalten, 2004) was used to generate the missing topologies and force field parameters. The force field parameters for the covalent interactions (bonds, angles, dihedral angles and impropers) were also normalized to be of the same order of magnitude as those of CHARMM. E.g., in CHARMM the bond potential function is Kb(b–b0)2. For a C—H bond, b is the bond length, b0 (the equilibrium bond length) is 1.090 Å and Kb (the spring constant) is 367.6 kcal mol 1 Å 2. When PRODRG gave a value of 13,971.0 kcal mol 1 Å 2 for Kb, this was divided by a factor of 40 to be of the same order of magnitude as that of CHARMM. Without this normalization, the simulations would crash shortly after they started. For the MDS, each system was placed in a box without periodic boundary conditions and water molecules were added using the “add Solvation box” plugin of VMD. Next, ions were added (only Na+ and Cl ) using the VMD plugin “Add Ions”. The dimensions of the water and ion-filled box were 5 Å from the coordinates of the most extreme atom of the protein–ligand structure in each dimension. The simulations were run at 300 Kelvin with a 2 femtosec timestep. The basic parameters for the MDS were: 1–4 scaling = 1.0, cutoff = 12.0, switchdist = 10.0, pairlistdist = 13.5 without harmonic or periodic boundary conditions, with constant temperature control and Langevin dynamics. A representative configuration file containing the parameters of the simulation can be found in the Supplementary materials. Before the actual simulation, an energy minimization (EM) was performed for 3000 steps. The number of steps of the actual simulation were chosen at about 60,000 steps depending on each system size and time consumption required for each trajectory. The aim of the short MDS was to relax the crystal structure in a way compatible with the force field used, thus making the energy calculations more reliable for the purpose they were used. In order to verify this, after the simulation one more energy minimization was performed for 1000 steps and the non-covalent energies of the two stages (end of MDS and final MDS + EM) were compared. In all but two cases (ID 56 and 57, PDB ID 1ZD2 and 1ZD4, Supplementary material) the differences (comparing only the protein + ligand system) were small (<13 kcal/mol in absolute values). Furthermore, the root mean square deviation (RMSD) for both the protein backbone and the ligand were determined and compared to the initial structure. In all cases the RMSD remained stable and with values of <0.8 Å (for the protein) and <2.0 Å (for the ligand). Even in the two exceptions mentioned above (ID 56 and 57, PDB ID 1ZD2 and 1ZD4) the RMSD for the ligand was stable at about 6.0 Å and 3.5 Å, respectively. For this reason, these two cases were retained for subsequent study. After the simulation, many parameters related to the structure and stability of the protein–ligand systems were extracted; e.g., the total non-bonding interaction energies before and after the simulation, the number of water molecules in the protein–ligand

D. Ioannidis et al. / Computational Biology and Chemistry 56 (2015) 7–12

interaction cavity, the number of hydrophobic, hydrophilic, basic, acidic, etc. aminoacids and many other. They are listed and described in the Supplementary material. 2.3. Neural network prediction A neural network was used to see if some of the parameters (those that can be set in the beginning of a ligand design, by being easily determined from the structure of the respective protein cavity) can be used to predict important parameters of the ligand structure. Eight of the PDB structures were left out of the neural network training to be used for testing the predictions. The free downloaded neural network program justNN (http://www.justnn. com/) was used. Nine parameters were used as input (known or desirable values) of the neural network (parameters that train the neural network) to make predictions. These were (in parentheses the parameter reference in the Supplementary material): total cavity aminoacid number (Parameter BA. This is the number of aminoacids that compose the protein cavity, that is the part of the protein with which the ligand interacts), composition of the cavity in hydrophobic, basic, acidic and neutral polar aminoacids (Parameters BB, BC, BD and BE, respectively), number of a-helices, b-sheets, turns and random coils (Parameters BF, BG, BH and BI, respectively). Seven parameters were used as output (predicted values): total number of ligand atoms (Parameter BY), number of oxygen, nitrogen and sulfur atoms of the ligand (Parameters BR, BS and BT, respectively), number of hydrophobic, polar and amphipathic segments of the ligand (Parameters BJ, BK and BL, respectively). A small number of the sample structures (IDs 2, 7, 8, 9, 18, 47, 55, 57, 69 of the Supplementary material) were used as validating inputs (used internally by the neural network program but not participating in its training). Validating inputs are a vital part of the neural network training. By comparing the interim results (those produced during the training) to those of the validating input, it helps to ascertain that the training process progresses well (supervised learning). The rest of the structures (excluding the eight used for the evaluation) were used as training inputs. The training parameters were: learning rate = 0.6, momentum = 0.8, number of weights and nodes set to automatic selection. The neural network was composed of the input layer (9 nodes), the output layer (7 nodes) and two hidden layers (each of four nodes) so as to evaluate the cases non-linearly. Eight nodes were used (4 in the first layer and 4 in the second) and 80 weights. A full schematic depiction of the neural network is included in the Supplementary materials. Training was stopped when average error <0.01 or 100% of the validating inputs being within 50% of the predicted outputs (that is, the real values being near the predicted ones). 3. Results 3.1. Protein–ligand interaction parameters The non-bonding interaction energies between protein and ligand after the MDS + final EM were negative in all but 3 cases (IDs 56, 57 and 87). These cases (with low positive values) were retained in the sample because their exclusion did not change the results significantly and furthermore, some characteristics of them could be useful in explaining the energy differences in the subsequent linear regression. Anyway, since the zero for the potential energy is arbitrary, their exclusion based solely on this criterion would have been arbitrary. The values of the structural and interaction parameters between protein and ligand were determined from the simulation files and the PDB structures. An initial observation concerning the protein–ligand cavity is that there is considerable heterogeneity in its size, from about 20 to tens of aminoacids. The same applies when the cavity is measured

9

in facets by the pck VMD plugin (Wurtz and Schwarz, 2015) (facets are a measure of the number of polyhedra created by the atoms on the surface of the protein that block the buried atoms from accessing its surface). Also, in 75 out of the 87 cases, the ligand is located in the largest cavity. The cavities of the receptors are comprised mainly of a-helices and those of enzymes mostly of b-sheets (although in the cases of the enzymes this is not so clear-cut). The average of the sample for non-bonding, electrostatic and vdW interaction energies for the protein + ligand after MDS and after MDS + final EM is shown in Table 1. They do not change significantly after the final EM, a fact that indicates a stable interaction system. Linear regression between pairs of these parameters was carried out. The energies (non-bonding or electrostatic) were used as the dependent variable and a variety of the other parameters as the independent variable. The most significant results are shown in Table 2. The regression (y = ax) was carried out with the intercept set to zero. Some of the correlation coefficients have relatively high values, like the number of ligand atoms, or the ratio of N, O and S atoms to total ligand atoms, the hydrophobic/polar aminoacid ratio or the ratio of cavity size to the sum of ligand plus water atoms in the cavity (all of these versus the total non-bonding energy as the dependent variable). The N, O and S atoms mainly affect the electrostatic interactions (electrostatic energy as dependent variable) but also have a strong impact to the total non-bonding energy. Also a less dense cavity with fewer ligand plus water atoms has an impact to the total non-bonding energy that makes the interaction stronger (more negative energy values), due to better orientation of the vdW and electrostatic forces. An interesting factor seems to be the immobile water molecules in the cavity (root mean square position Fluctuation, rmsf < 0.55 Å, “water” parameters of the Supplementary material). In 11 cases there exist no water molecules at all (mobile or immobile). However, there are 15 cases in which at least one immobile water molecule exists in the protein–ligand cavity. These water molecules usually connect the ligand to the protein via thermodynamically favorable interactions (vdW, electrostatic or hydrogen bonds), e.g., two cases with agonists (PDB ID 2AA2 and 3L0J, ID: 4 and 16). In one case of an antagonist (ID: 6, PDB ID 2J4A) there is non-favorable interaction energy between the water molecule and the ligand, which blocks the ligand from interacting with a nearby cavity helix. Immobile water molecules contribute to cavity compartmentalization in this way and this observation expands the conclusion of other works (Yashar et al., 2004) for different modes of action via different compartmentalization of the cavity (strong interactions with different helices for agonist and antagonist). In the case of the enzymes, the mode of interaction is not always so clear-cut, since the water molecules may be involved in transient states with possible rearrangement of covalent bonds, a process not included in the simulation force field. Although the analysis of the interaction energies gives some useful insights into the mode of protein–ligand interaction, no concrete conclusions can be drawn that could help predict more Table 1 Average interaction energies  standard deviations between protein and ligand after the molecular dynamics simulation and after MDS + final energy minimization. MDS (kcal/mol) Total non-bonding interactions Electrostatic interactions vdWa interactions a

van der Waals.

71  36 45  15 25  31

MDS + EM (kcal/mol) 70  38 45  19 25  31

10

D. Ioannidis et al. / Computational Biology and Chemistry 56 (2015) 7–12

Table 2 Linear regression results using parameters described in the Supplementary material. The regression (y = ax) was carried out with with the intercept set to zero. Independent variable

Dependent variable

Water_Bd : atoms H2O in cavity BOd : aminoacids with hydrogen bonds to ligand BOd : aminoacids with hydrogen bonds to ligand CBd : cavity hydrophobic/polar ligand aminoacids CBd : cavity hydrophobic/polar ligand aminoacids CAd: cavity water atoms/ligand atoms CAd: cavity water atoms/ligand atoms BZd: cavity size/(ligand + water atoms) BZd: cavity size/(ligand + water atoms) BYd : number of ligand atoms CCd : NOS ligand atoms/total ligand atoms CCd : NOS ligand atoms/total ligand atoms Water_Cd : immobile water molecules (RMSF < 0.7) Water_Ed : immobile water molecules (RMSF < 0.55) AWd : active heteroatoms (Yes/No) BBd : cavity hybrophobic aminoacids CCd : NOS ligand atoms/total ligand atoms AZd : total NB energy heteroatom-ligand after MDS

Parameter Parameter Parameter Parameter Parameter Parameter Parameter Parameter Parameter Parameter Parameter Parameter Parameter Parameter Parameter Parameter Parameter Parameter

a b c d e f g

Slope

Fe Ef Fe Fe Ef Fe Ef Fe Ef Ef Fe Ef Fe Fe Fe Fe Lg Fe

90.75 25.26 25.24 41 41.8 29.31 30.1 15.9 16.6 1.26 412.64 413.73 43.38 58.38 49.3 3.4 162.36 0.41

SEaa

rb

F-probc

0.14 2.87 2.94 4.56 4.39 6.62 6.5 1.67 1.54 0.07 37 35.85 5.8 13.23 21.6 0.3 23.29 0.40

0.53 0.69 0.69 0.72 0.73 0.46 0.47 0.73 0.77 0.89 0.77 0.78 0.62 0.42 0.24 0.80 0.60 0.10

8.09  10 7 1.54  10 13 3.99  10 13 1.44  10 13 1.36  10 14 3.15  10 5 1.47  10 5 1.4  10 14 5.91  10 17 2.57  10 32 2.18  10 18 3.83  10 19 5.8  10 11 2.91  10 5 0.025 2.29  10 18 6.04  10 10 0.32

Standard error of the slope. Correlation coefficient. F-probability distribution. Variable descriptions in the Supplementary material. Total non-bonding energy of interaction after MDS + EM. Total non-bonding energy of interaction after MDS. Electrostatic energy of protein–ligand interaction after MDS + EM.

specific aspects of the ligand structural design. For this reason, we tried another approach, using neural networks to predict some structural parameters of the ligand based on structural parameters of the protein. 3.2. Neural network prediction A neural network was used to predict some of the composition parameters for a probable new ligand. Nine parameters were used as known inputs and seven were used as predicted outputs, as described in the Section 2. Eight of the PDB structures (listed in Tables 3 and 4) were excluded from training, to be used as a test of the quality of the predictions. In Table 3, the predicted and the observed results are shown for these seven parameters (total number of ligand atoms, number of nitrogen, oxygen and sulfur atoms, number of hydrophobic, polar and amphipathic segments of the ligand) used as predicted outputs. We chose not to use any interaction energies as either predictor or predicted parameters, since the zero of the potential energy is arbitrarily defined. Instead, we chose to use more clearly defined parameters for the neural

network approach, since they have to cope with a large number of complex relationships. In many cases the predicted values were close to the observed ones. The success rates (according to criteria listed in the table legends) were satisfactory given the small number of structures and better than statistical noise. One reason the criteria in most cases were not strict (exact equality), but looser (acceptable difference between observed and predicted values) was not only to improve the prediction success rate, but also to give leeway when designing a ligand, since stringent criteria for the ligand could sometimes be problematic during its chemical synthesis (for the cases of the number of sulfur atoms, the exact equality criterion was used, since the values in the table are only 0 or 1). Exact equality was also used for the number of amphipathic parts. In addition to the success rate, the binomial test for the number of sulfur atoms was performed. Also, simple linear regression was performed using the predicted parameter as the dependent variable and the observed parameter as the independent variable. The corresponding linear correlation coefficient R is also shown in the table.

Table 3 Comparison of observed and neural network predicted values for the ligand. PDB ID (ligand ID) (paper ID)

Oxygen atoms

Nitrogen atoms

Sulfur atoms

Hydrophobic parts

Hydrophilic parts

Amphipathic parts

Total atoms

Obsa

Predb

Obs

Pred

Obs

Pred

Obs

Pred

Obs

Pred

Obs

Pred

Obs

Pred

2J4A(OEF)(6) 1XAP(TTB)(17) 1VJY(460)(24) 1PF8(SU9)(44) 1ZD2(NC3)(56) 1BL7(SB4)(59) 1TVO(FRZ)(60) 1OKL(MNS)(72)

4 2 0 2 3 0 0 2

3 3 1 1 4 1 1 4

0 0 5 3 2 6 7 2

1 1 5 5 2 5 5 2

0 0 0 0 0 0 0 1

0 0 0 0 1 0 0 0

3 5 2 2 1 2 2 2

2 2 3 3 3 3 3 3

3 2 2 2 4 4 2 1

2 2 2 2 3 2 2 3

1 1 1 2 1 2 1 0

1 1 2 2 1 2 2 1

41 53 35 35 29 45 38 32

56 56 39 39 43 40 47 45

Success rate (criterion) Linear regressionf

87.5% (1c) 0.73

a b c d e f

75% (1c ) 0.90

Observed values. Neural network predicted values. Difference of 1 between observed and predicted value. Observed and predicted value being equal. Difference of 15% between observed and predicted value. Simple linear regression R value.

75% (=d ) 0.14

75% (1c ) 0.84

75% (1c ) 3.62e-16

62.5% (=d ) 0.62

50% (15%e ) 0.57

D. Ioannidis et al. / Computational Biology and Chemistry 56 (2015) 7–12

The success rates are good in most cases and definitely better than if the distribution of the values were random. However, for the number of sulfur atoms the p value of the binomial test is 0.2891, which is not satisfactory. The linear correlation coefficients exhibited good values except for the number of sulfur atoms (R = 0.14) and for the number of hydrophilic parts, for which linearity was non-existent (R = 3.62e-16). A deviation of more than 15% was observed for the total number of ligand atoms in 50% of the cases (4 out of the 8 structures, that is 2J4A, 1ZD2, 1TVO and 1OKL). In order to improve the results, the neural network was run while ignoring the number of random coils during the prediction phase, since random coils do not have one clear structure but a statistical distribution of many structures. Using the new parameters, the predictions (shown in Table 4) improved for some parameters. The success rate for the number of sulfur atoms increased from 75% to 87.5%, and the statistical significance p value of the binomial test improved from 0.2891 to 0.0703. Also, the linear correlation coefficient improved, from 0.14 (that is weak linear relationship), to 0.66. For the total number of atoms the success rate improved from 50% to 62.5%, namely the prediction of only 3 structures deviated more than 15% from the observed values (PDB IDs 2J4A, 1ZD2 and 1OKL). In the case of 2J4A the difference between observed and predicted total ligand atoms can be explained by the existence of bromide atoms in the structure, which are not included in the neural network model, since they do not exist in any other structure. In 1ZD2 the neural network continues to predict a sulfur atom, which is not observed. This case is one where the total non-bonding energy of the system after MDS + EM is positive (0.25 kcal/mol), which is an indication of instability of the system and could explain the discrepancy both in the sulfur and the total number of atoms. In 1OKL, the sulfur atom was correctly predicted (an improvement to the case in Table 3 where the existence of the sulfur atom was not predicted) while the difference in the total number of ligand atoms may be due to the existence of a Zn2+ heteroatom interacting with the protein–ligand cavity. Cavity interacting heteroatoms were defined as atoms neither of the protein nor of the ligand, located within a distance of 4.5 Å from the most extreme atom (either of the protein or the ligand) of the cavity (see the description of Parameter AW in the Supplementary material). If a heteroatom exceeds that distance threshold it will also have a near to zero total non-bonding energy between it and the ligand, and therefore was considered non-interacting for the purposes of this work.

11

4. Discussion The design of new ligands for protein targets (and therefore potential drugs) is based on empirical and semi-empirical approaches, mainly due to the great number of possible conformations, especially of the interaction cavity. It should be noted that when designing a new ligand, it is unknown whether the ligand will interact with the target protein, contrary to this work where the starting point were the PDB structures, with known and well-characterized interactions. These known interactions were studied in order to elucidate some general guidelines that should be taken into consideration when designing a new ligand. The first thing to be determined is which of the possible target protein cavities will be the one the ligand will interact with. As stated in the results section, this is usually the largest cavity. It is self-evident that the vdW radius of the involved atoms should be taken into consideration, as well as the polar, non-polar, hydrophobic and hydrophilic interactions between protein and ligand. A parameter that could help significantly is the inclusion of water molecules in the cavity. If, for some reason the ligand is or must be small (relatively to the cavity size), water molecules can be used to keep it in place, substituting for a possible missing group of the ligand. The immobile water molecules could contribute to the ligand design, by immobilizing it in a specific part of the cavity, according to its mode of interaction (e.g., agonist or antagonist) and therefore helping compartmentalize the cavity. The free-moving mobile water molecules in the cavity do not seem to have a significant contribution. In the case of exceptionally hydrophobic cavities, where there are no water molecules in the cavity and its composition in hydrophobic aminoacids is high (e.g., PDB ID 2QGT ID: 11 and PDB ID 2P7G, ID: 13), no water molecules should perhaps be allowed because the inclusion of water molecules in that place would be thermodynamically unfavorable. It should also be noted that in a newly designed ligand, the simulation will probably have to be more extensive than those performed in this work. In the present study the starting point for the simulations was a PDB structure that was well-defined, which cannot be the case with a new and unknown ligand. A neural network was used for the prediction of some parameters for the design of a new ligand. A large discrepancy present in some cases (where the neural network usually predicts a larger number of ligand atoms than the necessary ones) can be

Table 4 Comparison of observed and neural network predicted values for the ligand, after omission of the number of random coils during the prediction phase. PDB ID (ligand ID) (paper ID)

Oxygen atoms

Nitrogen atoms

Sulfur atoms

Hydrophobic parts

Hydrophilic parts

Amphipathic parts

Total atoms

Obsa

Predb

Obs

Pred

Obs

Pred

Obs

Pred

Obs

Pred

Obs

Pred

Obs

Pred

2J4A(OEF)(6) 1XAP(TTB)(17) 1VJY(460)(24) 1PF8(SU9)(44) 1ZD2(NC3)(56) 1BL7(SB4)(59) 1TVO(FRZ)(60) 1OKL(MNS)(72)

4 2 0 2 3 0 0 2

3 3 1 1 4 1 1 4

0 0 5 3 2 6 7 2

1 1 5 5 2 5 5 2

0 0 0 0 0 0 0 1

0 0 0 0 1 0 0 1

3 5 2 2 1 2 2 2

2 2 3 3 3 3 3 3

3 2 2 2 4 4 2 1

2 2 2 2 3 2 2 3

1 1 1 2 1 2 1 0

1 1 2 2 1 2 2 1

41 53 35 35 29 45 38 32

56 56 39 39 43 39 40 43

Success rate Linear regressionf

87.5%(1c ) 0.73

a b c d e f

75%(1c ) 0.90

Observed values. Neural network predicted values. Difference of 1 between observed and predicted value. Observed and predicted value being equal. Difference of 15% between observed and predicted value. Simple linear regression R value.

87.5%(=d ) 0.66

75%(1c ) 0.84

75%(1c) 3.62e-16

62.5%(=d) 0.62

62.5%(15%e ) 0.57

12

D. Ioannidis et al. / Computational Biology and Chemistry 56 (2015) 7–12

attributed to model problems such as the handling of heteroatoms (e.g., Zn2+) that are found in the interaction cavity. This matter remains unresolved from this work, as the relevant sample with heteroatoms consists of enzyme structures only. The exceptions of 1ZD2 and 1ZD4 may be due to a phosphate group in the crystal structure, non-covalently attached either to the protein or the ligand, located at about 50 Å away from the protein–ligand cavity. This phosphate group was excluded from the MDS because it caused the simulations to crash in both cases. A general reason for the discrepancies could also be that the number of structures used to train the neural network was small, although this can be remedied with the inclusion of larger sets of structures. 5. Conclusion The aim of this work is to provide proof of concept; whether it is possible to determine a relationship between parameters of protein–ligand interaction and whether neural networks can help in this and hence in the design of new ligands or modification of existing ones (mainly for but not necessarily limited to drug design). In conclusion, it is not only a few parameters that determine the interaction of a protein with a ligand, but many factors each one with a small contribution, although some of them as described above are more important, such as the immobile waters, the number of ligand atoms, the ratio of N, O and S atoms to total ligand atoms, the hydrophobic/polar aminoacid ratio and the ratio of cavity size to the sum of ligand plus water atoms in the cavity. If many of these factors are considered simultaneously as with the neural networks described above, a relatively good model for the interactions may emerge. It should also be emphasized that the neural networks work best when fed with large amounts of input data. The acquisition of the interaction parameters (which in most cases were extracted programmatically from the MDS output files, but in some cases the extraction was manual or semi-manual) could be automated. Then the whole procedure would be performed on hundreds, instead of tens, of structures, which when fed into a neural network might yield much better predictions. However, the expansion of the approach described in this paper is not trivial, just by applying the same procedure in more structures. Many of the parameters were determined manually (e.g., the number of hydrophobic, hydrophilic and amphipathic parts of the ligand). This would have to be automated. A possible approach would be to create the electronic density maps of the ligands, project them onto an enveloping surface and separate the parts by setting thresholds. Also, the determination of the topology and force field parameters for a large number of non-standard ligands will need more than the PRODRG program used in this paper. This is a possible avenue for the extension of the work described in this paper. Author contributions The manuscript was written through contributions of all authors. All authors have given approval to this manuscript.

Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j. compbiolchem.2015.02.016. References Berman, H.M., et al., 2000. The Protein data Bank. Nucleic Acids Res. 28, 235–242. Boganand, A.A., Thorn, K.S., 1998. Anatomy of hot spots in protein interfaces. J. Mol. Biol. 280, 1–9. Brooks, B.R., et al., 1983. CHARMM: a program for macromolecular energy minimization and dynamics calculations. J. Comp. Chem. 4, 187–217. deBeer, S., Vermeulen, N., Oostenbrink, C., 2010. The role of water molecules in computational drug design. Curr. Top. Med. Chem. 10 (1), 55–66. Durrant, J.D., McCammon, J.A., 2011. Molecular dynamics simulations and drug discovery. BMC Biol. 9 (71) . Ghersi, D., Sanchez, R., 2011. Beyond structural genomics: computational approaches for the identification of ligand binding sites in protein structures. J. Struct. Funct. Genomics 12 (2), 109–117. Grzybowski, B.A., et al., 2002. From knowledge-based potentials to combinatorial lead design in silico. Acc. Chem. Res. 35 (5), 261–269. Halperin, I., Woldsonand, A., Nussinov, R., 2004. Protein–protein interactions: coupling of structurally conserved residues and of hot spots across interfaces. Implications for docking. Structure 12, 1027–1038. Humphrey, W., Dalke, A., Schulten, K., 1996. {VMD} – {V}isual {M}olecular {D} ynamics}. J. Mol. Graphics 14, 33–38. Jones, S., Thornton, J.M., 1996. Principles of protein–protein interactions. Proc. Natl. Acad. Sci. U. S. A. 93, 13–20. Lins, L., Brasseur, R., 1995. The hydrophobic effect in protein folding. FASEB J. 9, 535–540. MacKerell Jr., A.D., et al., 1988. CHARMM: the energy function and its parameterization with an overview of the program. Encyclopedia Comput. Chem. 1, 271–277. MacKerell Jr., A.D., et al., 1998. All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. 102, 3586–3616. Merour, J.Y., et al., 2014. The azaindole framework in the design of kinase inhibitors. Molecules 28(19(12, 19935–19979. Meyer, E.E., Rosenberg, K.J., Israelchvili, J., 2006. Recent progress in understanding hydrophobic interactions. Proc. Natl. Acad. Sci. U. S. A. 15, 739–15746. Phillips, J.C., et al., 2005. Scalable molecular dynamics with NAMD. J. Comp. Chem. 26, 1781–1802. Schüttelkopf, A.W., Aalten, D. M. F. v., 2004. PRODRG a tool for high-throughput crystallography of protein–ligand complexes. Acta Crystallogr. D60, 1355–1363. Schmidt, U., et al., 2009. SuperToxic: a comprehensive database of toxic compounds. Nucleic Acids Res. 37, D295–D299. Schneider, G., Fechner, U., 2005. Computer-based de novo design of drug-like molecules. Nat. Rev. Drug Discov. 4, 649–663. Selent, J., et al., 2010. Induced effects of sodium ions on dopaminergic G protein coupled receptors. PLoS Comput. Biol. 6 (8), e1000884. Sobolev, V., et al., 1997. CASP2 molecular docking predictions with the LIGIN software. PROTEINS: Struct. Funct. Genet. 210–214. Sobolev, V., et al., 1999. Automated analysis of interatomic contacts in proteins. Bioinformatics 15 (4), 327–332. Verma, J., Khedkar, V.M., Coutinho, E.C., 2010. 3D-QSAR in drug design – a review. Curr. Top. Med. Chem. 10, 95–115. Whitesides, G.M., Krishnamurthy, V.M., 2005. Designing ligands to bind proteins. Q. Rev. Biophys. 38 (4), 385–395. Wishart, D.S., et al., 2008. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 36, D901–D906. Wurtz, J.M., Schwarz, B., 2015. Pocket Detection Software (pck). Available from: http://schwarz.benjamin.free.fr/Work/Pck/home.htm. Yan, A., Grant, G.H., Richards, W.G., 2008. Dynamics of conserved waters in human Hsp90: implications for drug design. J. R. Soc. Interface 5, S199–S205. Yashar, M., et al., 2004. The predicted 3D structure of the human D2 dopamine receptor and the binding site and binding affinities for agonists and antagonists. Proc. Natl. Acad. Sci. U. S. A. 101 (11), 3815–3820.