Computational Biology and Chemistry 85 (2020) 107237
Contents lists available at ScienceDirect
Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/cbac
Research Article
Protein structure optimization using improved simulated annealing algorithm on a three-dimensional AB off-lattice model
T
Lizhong Zhanga,b, He Maa,c,*, Wei Qiand, Haiyan Lie a
College of Medicine and Biological Information Engineering, Northeastern University, Shenyang 110169, China College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang 110142, China c Key Laboratory of Medical Image Computing (Northeastern University), Ministry of Education, Shenyang 110169, China d Department of Electrical and Computer Engineering, College of Engineering, University of Texas, El Paso TX 79968, USA e College of Pharmaceutical and Bioengineering, Shenyang University of Chemical Technology, Shenyang 110142, China b
ARTICLE INFO
ABSTRACT
Keywords: Simulated annealing Off-lattice model Cαspace-filling model Protein folding conformation
This paper proposed an improved simulated annealing (ISA) algorithm for protein structure optimization based on a three-dimensional AB off-lattice model. In the algorithm, we provided a general formula used for producing initial solution, and designed a multivariable disturbance term, relating to the parameters of simulated annealing and a tuned constant, to generate neighborhood solution. To avoid missing optimal solution, storage operation was performed in searching process. We applied the algorithm to test artificial protein sequences from literature and constructed a benchmark dataset consisting of 10 real protein sequences from the Protein Data Bank (PDB). Otherwise, we generated Cα space-filling model to represent protein folding conformation. The results indicate our algorithm outperforms the five methods before in searching lower energies of artificial protein sequences. In the testing on real proteins, our method can achieve the energy conformations with Cα-RMSD less than 3.0 Å from the PDB structures. Moreover, Cα space-filling model may simulate dynamic change of protein folding conformation at atomic level.
1. Introduction Proteins require three-dimensional (3D) structures for exploring biological mechanism (Dill and Maccallum, 2012). Although such conformations can be obtained through different experimental methods, such as protein X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy (Bagaria et al., 2013), etc., these determinations of protein structure are experimentally expensive and time-consuming (Güntert, 2004), which have largely limited their applications. In this situation, determining protein structure by computation-based methods becomes promising, even necessary (Javidpour, 2012a). According to a biological approximation, computation-based modeling for protein structure can be classified: template-based modeling (homology modeling or comparative modeling), template structure identification (fold recognition), template structure refinement and free modeling (Zhang, 2008; Márquez-Chamorro et al., 2015). The first three modeling methods are all based on known protein structures. However, there still exist a large number of proteins which do not show any homology with proteins of known structure. In light of this, free modeling has become a promising opinion for attempting to predict
⁎
structure from amino acid sequence alone without using explicit templates. Actually, free modeling is based on general principles that govern protein folding energetics and/or statistical tendencies of conformational features that native structures acquire (Simons et al., 1999; Bonneau and Baker, 2001; Bradley et al., 2005). According to Anfinsen’s remarkable assertion in 1961, protein folding is a purely physical process that depends only on its specific amino acid sequence and the surrounding solvent (Anfinsen et al., 1961; Floudas et al., 2006). In 1973, Anfinsen’s thermodynamic hypothesis on protein folding conformation clearly states that native structure of a protein corresponds to the global minimum of energy surface of protein (Anfinsen, 1973). A clear explanation for Anfinsen’s hypothesis is funneled energy landscape considering that the energy surface of protein folding is funnel-like shape and the bottom of the funnel corresponds to protein native structure (Ken, 1999; Dobson et al., 1998; Onuchic et al., 1995). The funneled energy landscape, reflecting the principle of minimal frustration hypothesis, is directly supported by the latest research (Tzul et al., 2017). Therefore, it is feasible to predict the native structure of a protein from its amino acid sequence by using theoretical computing method. However, because of
Corresponding author at: College of Medicine and Biological Information Engineering, Northeastern University, Shenyang 110169, China. E-mail address:
[email protected] (H. Ma).
https://doi.org/10.1016/j.compbiolchem.2020.107237 Received 3 October 2018; Received in revised form 11 February 2020; Accepted 15 February 2020 Available online 17 February 2020 1476-9271/ © 2020 Elsevier Ltd. All rights reserved.
Computational Biology and Chemistry 85 (2020) 107237
L. Zhang, et al.
the complexity of realistic protein structure, it is difficult to determine the exact three-dimensional structure from amino acid sequence. Thus, highly simplified protein models are often explored instead (Kim et al., 2005). Some simplified models have been proposed by abstracting essential characteristics of physical properties of real proteins. An important classification of reduced models is lattice model, a typical representative of which is HP model (Lau and Dill, 1989). Another lattice model is Face-centered-cubic lattice (FCC) (Agarwala et al., 1997). These two lattice models place strict restrictions on the positions of all amino acid residues and only involve the attractions among hydrophobic residues, and realistic protein conformation is impossible to follow such restricted requirement. To resolve this problem, Stillinger et al. put forward an off-lattice model which comprehensively specifies different interactions for hydrophobic-hydrophobic residue pairs, hydrophilic-hydrophilic residue pairs and hydrophobic-hydrophilic residue pairs. Moreover, these interactions can be are clearly reflected by optimized computing of energy function. In the model, all residues are linked by free-to-rotate chemical bonds, which can make the characteristics of formed conformation close to those of real protein. Stillinger’s off-lattice model only comprises backbone bending energy and Lennard-Jones energy. Afterwards, Irbäck et al. extended this two-dimensional off-lattice model to three dimensions (Irbäck et al., 1997). The energy function of Irbäck’s off-lattice model simultaneously considers three interactions, namely backbone bending energy, torsion energy and Lennard-Jones potentials. Other off-lattice models, such as BLN off-lattice model (Honeycutt and Thirumalai, 1992) and HPNX offlattice model (Cotta, 2005), consider more interactions due to involving more types of residues, which means that computation burden also largely increases. Consequently, Irbäck’s 3D off-lattice model is more suitable to investigate the optimization of protein folding structure. In order to search optimum structure or approximately optimal structure corresponding to the off-lattice model, researchers have designed various computing methods, including conformational space annealing (CSA) (Kim et al., 2005), particle swarm optimization (PSO) (Chen et al., 2011), genetic algorithm (GA) (Rashid et al., 2012), energy landscape paving minimizer (ELP) (Bachmann et al., 2005), improved hybrid algorithm (SATS) (Lin et al., 2014), artificial bee colony (ABC) (Li et al., 2015a; Wang et al., 2013; Li et al., 2015b), annealing contour Monte Carlo (ACMC) (Liang, 2004) and their combinations (Zhang and Lin, 2010; Zhou et al., 2014). Most approaches were studied on Stillinger’s off-lattice model, and only a few methods (Kim et al., 2005; Bachmann et al., 2005; Zhang and Lin, 2010; Liang, 2004) aimed at Irbäck’s off-lattice model and were tested only on artificial protein sequences (Fibonacci sequences). Moreover, previous researches mostly focused on developing optimization algorithms, ignoring specific insights into protein folding conformation. In addition, the construction of testing benchmark and analysis about the obtained results should be further enhanced. To overcome the above-mentioned shortcomings, we proposed an improved simulated annealing (ISA) algorithm based on Irbäck’s offlattice model, and then verified the effectiveness on simulation dataset and self-constructed benchmark dataset. The improved algorithm involves generation methods of initial solution and neighborhood solution, and employed a storage operation to remember the current best solution. Initially, the ISA algorithm found new candidate solutions for the folding structures of Fibonacci sequences. Furthermore, we constructed a benchmark dataset of 10 small proteins (< 100 residues) from the Protein Data Bank (PDB) to demonstrate the effectiveness of the algorithm. Finally, we exhibited structural change of two real proteins with Cα space-filling model in search of low energy structure.
presented a three-dimensional off-lattice protein model (Irbäck et al., 1997), in which a residue is reduced to a bead and its position is represented with the coordinate of Cα. These beads are linked up sequentially by rigid bonds, and then form spatial conformation in 3D space. The model deems that hydrophobic interactions are the main driving forces to form protein structure. Thus, the amino acid residues are divided into two types, hydrophobic (represented by letter “A”) and hydrophilic (represented by letter “B”). In the model, protein folding conformation with N residues will be determined by 2N-5 angle parameters uniquely, which consists of N-2 bond angles (α1, α2…, αN-2) and N-3 torsion angles (β1, β2…, βN-3). The bond and torsion angles are two degrees of freedom of the model. For a given protein sequence with N residues, the energy function E can be expressed: N 2
E=
k1
N 3
cos
k2
i
i=1
N 2
N
4C ( i, j )(rij 12
cos i + i=1
i =1 j =i+2
rij 6)
(1)
In Eq. (1), the three terms are backbone bending energy, torsion energy and Lennard-Jones energy, respectively. The two parameters k1 and k2 denote the strength of species-independent local interactions and finally are set to (-1, 0.5) for structural stability of protein. Backbone bending energy are irrelevant to the sequence of residues, only related to bond angle α∈[-π, π]. Torsion energy be defined as the cosine of torsion angles β∈[-π, π] among the nearby but nonadjacent residues. Lennard-Jones energy specifies the interaction between two nonadjacent residues. The coefficient C(ξi, ξj), representing the depth of minimum potential, is 1.0 for AA pairs and 0.5 for BB and AB pairs, thus producing an intramolecular mix of strong and weak attractions, roughly analogous to the situation in real proteins. The determination method of these parameters, for example, k1, k2 and C(ξi, ξj) in Eq.(1), refer to the literature (Irbäck et al., 1997). The discrete variable ξi embodies residue species in binary code. If the ith residue is hydrophobic, ξi = 1; otherwise, ξi = -1. rij denotes the Euclidean distance between residues i and j, and it can be defined as follows:
rij =
(x i
xj ) 2 + (yi
yj )2 + (z i
z j )2
(2)
In Eq. (2),(xi, yi, zi) is the coordinate of the ith residue. Locations of amino acid residues in three-dimensional space are determined (Li et al., 2015a):
i=1 i=2
(0,0, 0) (0,1, 0) (cos( 1), sin( 1) + 1, 0) (xi , yi , z i ) =
i=3 xi yi
1 1
+ cos( + sin( zi
1
i 2)
cos( 2 ) cos(
i 3),
i
i 3),
+ sin(
4
i 3)
i
N (3)
Here, the first three residues are defined in the plane z = 0. The locations of the remaining residues (i≥4) can be calculated recursively on the basis of the position of the previous one. Protein structure prediction is considered as a global optimization problem with NP-hard property (Unger and Moult, 1993), and therefore in the off-lattice model, protein structure optimization for a protein chain with N residues is equivalent to finding the optimal N-2 bond angles and N-3 torsional angles, which minimize the energy function E. 3. Improved simulated annealing algorithm Simulated annealing (SA) is a heuristic global optimization method (Kirkpatrick et al., 1983). This approach is used to find a good solution of an optimization problem by employing a random search. An important advantage of SA over other methods is the ability to jump out trap in search with a probability mechanism. In practical application,
2. Three-dimensional AB off-lattice model Studying protein structure involves the representation of protein conformations in computer (Javidpour, 2012b). In 1997, Irbäck et al. 2
Computational Biology and Chemistry 85 (2020) 107237
L. Zhang, et al.
the optimum solution of SA may be far from the global minimum, and therefore three effective strategies in the ISA algorithm are adopted to guide the search. These strategies involve initial solution generation, neighborhood solution generation and optimal solution recording.
(0.8 < γ < 1.0). We regard N-2 bond angles and N-3 torsion angles as a whole, and then execute individual disturbance using Eq. (5) for ensuring intensive computing of the energy function Eq. (1) in the neighborhood range.
3.1. Initial solution generation
3.3. Optimal solution recording
The ISA algorithm can produce initial solution, every component of which is an angle value φ∈[-π, π] obtained by Eq. (4). where random(0···1) is a uniform random number with the interval [0,1], and low and high (Low=-π, High=π) are the lower and upper bounds of neighborhood range, respectively. The initial solution consisting of 2N-5 angle values can be achieved by applying Eq. (4). Accordingly, all components of initial solution will be limited in [-π, π].
To avoid missing the optimal solution in search of low energy conformation, it is necessary to record the current optimal one timely. At the beginning of the ISA algorithm, the initial solution is considered as the optimal value. Whenever one neighborhood solution is generated and its corresponding energy value is less than that of the current solution, the solution is stored as the optimal one. When the running of the ISA algorithm is over, the optimal solution is exactly the 3D coordinates of all Cα atoms, corresponding to the optimal folding conformation.
3.2. neighborhood solution generation
3.4. Algorithm description
The generation of neighborhood solution is crucial for getting an optimal solution. Traditional SA has a low convergence, mainly because it uses random sampling to generate candidate solution. With the random sampling, the neighborhood solution would be searched blindly. In the ISA algorithm, motivated by the genetic method, we introduced a multivariable disturbance into the current solution to generate neighborhood solution. Implementation of disturbance is presented as randomly selecting the ith angle of 2N-5 angles and performing mutation. In short, the ith element selected is described as vi, i and the new element vnew of neighborhood solution is calculated by Eq. (5).
The procedure of the ISA algorithm can be described as follows: Step 1. Obtain the sequence of a protein. If the test sequence is from a real protein, it is needed to transform it into one AB string according to K-D method (Kite and Doolittle, 1982). The K-D method is adopted to distinguish between hydrophobic and hydrophilic residues. The amino acids I, V, L, P, C, M, A, G belong to class A, while D, E, F, H, K, N, Q, R, S, T, W and Y belong to class B. Choose an initial temperature T0 > 0 and a terminal temperature Te approximating to 0. Let the current temperature Tk=T0. Set the cooling coefficient γ, the Markov chain length ML and the number of temperature iteration k = 1. Step 2. Generate 2N-5 angle values for N-2 bond angles and N-3 torsion angles as the initial solution X0, and use the 2N-5 angle values to calculate energy value and 3D coordinates of Cα atoms of the protein sequence. Note that, set Xbest=X0, X = X0 (X is the current solution and Xbest is the best solution). Calculate kmax by Eq. (8) and set a value for λ. Step 3. Set the iteration number m = 1 at the current temperature Tk. Produce a neighborhood solution Xnew (new solution) by Eq. (5), calculate E, Enew and Ebest by Eq. (1) respectively, and obtain 3D coordinates of all Cα atoms corresponding to Xbest by Eq. (3). If Enew < Ebest, set Xbest=Xnew and Ebest = Enew. Set △E = Enew-E. If △E < 0 or random(0···1) < exp(-△E/Tk), set X = Xnew. Set m = m+1. The process of the Step 3 is repeated until m = ML. Step 4. If Tk < Te, stop the application; otherwise, update the current temperature by Tk+1=γTk and set k = k+1, and then go back to the Step 3.
= low + random (0
i new
=
i
1) × (high
+ sgn × random (0
low )
1) × (1
k / kmax )
(4)
(5)
Here, i in Eq. (5), determined by Eq. (6), is an random integer between 1 and 2N-5. The value of user-defined sign function sgn can be obtained by Eq. (7). (1-k/kmax) λ is an important item for ensuring the diversity of neighborhood solution. The parameter λ is defined as a tuned constant representing heterogeneous degree, based on the idea of nonuniform mutation in genetic algorithm. The parameter kmax represents the maximum number of temperature iteration calculated by Eq. (8). Another parameter k is an integer variable and denotes the number of temperature iteration, and its value changes from 1 to kmax.
i = random ( ) mod (2N
5) + 1
(6)
0.5
(7)
kmax = log10 (Te/ T0)/log10 ( )
(8)
sgn = random (0
1)
4. Results and discussions
In Eq. (8), T0 and Te represent initial and stopping temperatures, respectively. The parameter γ is called cooling coefficient
In this section, comprehensive simulations for protein folding
Fig. 1. The lowest energy conformations obtained by the ISA algorithm for Fibonacci sequences. 3
Computational Biology and Chemistry 85 (2020) 107237
L. Zhang, et al.
structure were carried out on both artificial and real protein sequences. The Fibonacci sequences are usually used as virtual testing benchmarks (artificial protein sequences). Real protein sequences were obtained from the PDB (https://www.rcsb.org/). Our ISA algorithm was implemented with Microsoft Visual C++ 6.0. The simulations were executed on an Intel core 2 Duo CPU with 4GB RAM running at 2.7 GHz under Windows 7. The parameters in the ISA algorithm were tuned through empirical tests on computer, and then they were set to as follows: the initial temperature T0 = 1.0, the terminal temperature Te = 10−12, the cooling coefficient γ = 0.99, and the tuned constant λ = 3.0. The Markov length ML = 50000 for Fibonacci sequences, and ML = 10000 for real protein sequences respectively. The ISA algorithm runs independently ten times for each protein sequence. 4.1. Testing on artificial protein sequences Fig. 2. Comparison of lowest energy values of six algorithms on four Fibonacci sequences (N = 13, 21, 34, 55).
This part used Fibonacci sequences generated in (Kim et al., 2005; Bachmann et al., 2005; Lin et al., 2014; Zhang and Lin, 2010; Liang, 2004) for comparison. These Fibonacci sequences are defined
Fig. 3. The lowest energy conformations obtained by the ISA algorithm for real protein sequences.
Fig. 4. Structural changes in simulating folding conformation of 5PTI. 4
Computational Biology and Chemistry 85 (2020) 107237
L. Zhang, et al.
Fig. 5. Structural changes in simulating folding conformation of 1UTG.
Table 1 The lowest energy values reported in the literature and obtained by the ISA algorithm for Fibonacci sequences with N = 13, 21, 34, 55. Sequence
EELP (Bachmann et al., 2005)
EACMC (Liang, 2004)
ECSA (Kim et al., 2005)
ELAGA (Zhang and Lin, 2010)
ESATS (Lin et al., 2014)
EISA
S13 S21 S34 S55
−26.498 −52.917 −92.746 −172.696
−26.507 −51.757 −94.043 −154.505
−26.471 −52.787 −97.732 −173.980
−26.498 −52.917 −98.765 −176.542
−26.507 −52.917 −99.876 −178.986
−29.474 −55.769 −101.649 −180.686
that the ISA algorithm has an improvement in searching lower energy of Fibonacci sequences, as shown in Fig. 2.
recursively by S0=A, S1=B, Si=Si-2*Si-1 (i≥2), where “*” is a concatenation operator. Therefore, the full sequences of chains corresponding to S13, S21, S34 and S55 can be obtained respectively, where the subscript indicates the length of Fibonacci sequence. In Figs. 1,3, 4 and 5, the black and grey spheres indicate hydrophobic and hydrophilic residues, respectively. Four Fibonacci sequences can fold into compact structures with hydrophobic cores, as shown in Fig. 1. The numerical results reported from the literature and obtained by the ISA algorithm are shown in Table 1.These results are calculated with Irbäck’s energy function Eq. (1). The testing indicates
4.2. Testing on real protein sequences Though the existing researches (Chen et al., 2011; Li et al., 2015a, b; Lin et al., 2014; Zhou et al., 2014; Bošković and Brest, 2016; Zhou et al., 2018) have reported their testing results on real protein sequences based on an off-lattice protein model, these methods are all based on Stillinger’s energy function (Stillinger et al., 1993), not Irbäck’s energy
Table 2 Details of amino-acid sequences of real proteins in the ISA algorithm. PDB ID
N
Protein sequence
Binary sequence
4RXN
54
1FCA
55
2GB1
56
2OVO
56
5PTI
58
1IGD
61
1I2T
61
1UTG
70
3ICB
75
2YGS
92
MKKYTCTVCGYIYDPEDGDPDDGVNPGTDFKDIPDDWVCPLCGVGKDEFEEVEE AYVINEACISCGACEPECPVDAISQGGSRYVIDADTCIDCGACAGVCPVDAPVQA MTYKLILNGKTLKGETTTEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTE LAAVSVDCSEYPKPACTMEYRPLCGSDNKTYGNKCNFCNAVVESNGTLTLSHFGKC RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGA MTPAVTTYKLVINGKTLKGETTTKAVDAETAEKAFKQYANDNGVDGVWTYDDATKTFTVTE HRQALGERLYPRVQAMQPAFASKITGMLLELSPAQLLLLLASEDSLRARVDEAMELIIAHG GICPRFAHVIENLLLGTPSSYETSLKEFEPDDTMKDAGMQMKKVLDSLPQTTRENIMKLTEKIVKSPLCM KSPEELKGIFEKYAAKEGDPNQLSKEELKLLLQTEFPSLLKGPSTLDELFEELDKNGDGEVSFEEFQVLVKKISQ MDAKARNCLLQHREALEKDIKTSYIMDHMISDGFLTISEEEKVRNEPTQQQRAAMLIKMILKKDNDSYVSFYNALLHEGYKDLAALLHDGIP
ABBBBABAAABABBABBABABBAABAABBBBBAABBBAAAAAAAABBBBBBABB ABAABBAAABAAAABABAAABAABBAABBBAABABBAABAAAAAAAAAABAAABA ABBBAAABABBABABBBBBAABAABABBABBBBABBBAABABBBBBBABBBBBABB AAAABABABBBABAAABABBBAAAABBBBBBABBABBABAAABBBABABABBBABA BABBAABAABBAAABABAABBBBBABAAAABBBABAAABABBBBBBBABBAABBAAAA ABAAABBBBAAABABBABABBBBBAABABBABBABBBBABBBAABAABBBBBABBBBBABB BBBAAABBABABABAABAABABBABAAAABABAABAAAAAABBBBABABABBAABAAAABA AAAABBABAABBAAAABABBBBBBABBBBABBBABBAAABABBAABBAABBBBBBAABABBBAABBAAAA BBABBABAABBBBAABBABABBABBBBABAAABBBBABAABAABBABBABBBABBBABABABBBBBBAAABBABB ABABABBAAABBBBAABBBABBBBAABBAABBABABABBBBBABBBABBBBBAAAAABAAABBBBBBBABBBBAAABBABBBAAAAABBAAA
PDB ID: Protein Data Bank identifier.
5
Computational Biology and Chemistry 85 (2020) 107237
L. Zhang, et al.
that in a natural protein structure, the hydrophobic residues are prone to form a core that encounters water molecules, thus a hydrophobic core will be surrounded by hydrophilic residues. The testing results on real protein sequences are shown in Table 3.
Table 3 Prediction results obtained by the ISA algorithm for 10 runs. PDB ID
N
Eaverage
Ebest
S.D.
4RXN 1FCA 2GB1 2OVO 5PTI 1IGD 1I2T 1UTG 3ICB 2YGS
54 55 56 56 58 61 61 70 75 92
−169.586 −196.598 −166.923 −177.346 −189.562 −186.802 −209.593 −227.532 −235.934 −297.238
−174.612 −203.330 −176.104 −186.848 −195.335 −194.377 −220.017 −232.549 −246.683 −306.126
5.737 6.001 4.581 5.791 4.203 3.749 6.028 4.019 7.467 7.014
4.2.1. Structural changes of protein folding conformation A numerical simulation was performed for the folding structure of 5PTI and 1UTG as two examples by running the ISA algorithm on computer, and then their representative conformations were selected and depicted with Cα space-filling model, as shown in Figs. 4 and 5. The Cα space-filling model, omitting rigid bonds between two adjacent residues, is indeed a coarse-grained representation of protein tertiary structure based on Irbäck’s off-lattice model (Irbäck et al., 1997). We can observe the aggregation of hydrophobic residues inside the folding conformation and the reconfiguration of hydrophilic residues on the surface of the folding conformation. This makes it possible to simulate dynamic change of protein folding conformation from high energy to low energy with Cα space-filling model at atomic level.
The S.D. is standard deviation of energy values.
function (Irbäck et al., 1997). To the best of our knowledge, the testing results on Irbäck’s energy function for real protein sequences have not yet been reported in the previous literature (Kim et al., 2005; Bachmann et al., 2005; Zhang and Lin, 2010; Liang, 2004). To further validate the reliability of the ISA algorithm, we constructed a benchmark dataset of 10 small proteins (N < 100) from the PDB and obtained their classification results, as shown in Table 2. Using the ISA algorithm, we depicted low energy conformations of the protein sequences in Table 2, as shown in Fig. 3. These protein sequences tend to fold into compact structures mainly due to the fact
4.2.2. Comparison between predicted and experimentally determined structures The quality of predicted structure can be evaluated by similarity comparison with the known structure from the PDB. Quality measurements have been made in terms of the root-mean-square deviation (RMSD) between the positions of the Cα atoms of the predicted and
Fig. 6. Superposition of the predicted (in black) and PDB (in grey) structures using Cα space-filling model.
6
Computational Biology and Chemistry 85 (2020) 107237
L. Zhang, et al.
experimentally resolved structures, as shown in Eq. (9). N
RMSD =
1 |rimodel N i=1
rireal |2
Appendix A. Supplementary data Supplementary material related to this article can be found, in the online version, at doi:https://doi.org/10.1016/j.compbiolchem.2020. 107237.
(9)
In Eq. (9), and are vectors representing the positions of ith Cα atoms of the predicted and PDB structures respectively. The RMSD value is expressed in length units, and the most commonly used unit of RMSD in structural biology is the Ångström (Å) equal to 10–10m. r real is extracted from the PDB file in a special format (.pdb), and r model is achieved by the ISA algorithm. The superposition of the predicted and PDB structures with Cα space-filling model is shown in Fig. 6. It is found that these predictions, with RMSDs ranging from 1.5 to 2.5 Å, may be regarded as quite successful results as suggested in (Reva et al., 1998). However, Irbäck’s off-lattice model we used involves three energy terms, and adding other interactions may improve prediction precision. With reference to Lennard-Jones term in Irbäck’s off-lattice model (Irbäck et al., 1997) and hydrogen bonding (HB) term in ECEPP/3 force field (McAllister and Floudas, 2010), we may define (r 12 rij 10) as a simplified term representing backbone hy(i, j ) HB ij drogen bonding, and incorporate it into the energy function of Irbäck’s model for better prediction in later work.
rimodel
rireal
References Agarwala, R., Batzoglou, S., Dančik, V., et al., 1997. Local rules for protein folding on a triangular lattice and generalized hydrophobicity in the HP model. J. Comput. Biol. 4 (3), 275–296. Anfinsen, C.B., 1973. Principles that govern the folding of protein chains. Science 181 (4096), 223–244. Anfinsen, C.B., Haber, E., Sela, M., et al., 1961. The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proc. Natl. Acad. Sci. U.S.A. 47 (9), 1309–1314. Bachmann, M., Arkin, H., Janke, W., 2005. Multicanonical study of coarse-grained offlattice models for folding heteropolymers. Phys. Rev. E 71, 031906. Bagaria, A., Jaravine, V., Güntert, P., 2013. Estimating structure quality trends in the Protein Data Bank by equivalent resolution. Comput. Biol. Chem. 46 (2), 8–15. Bonneau, R., Baker, D., 2001. Ab initio protein structure prediction: progress and prospects. Annu. Rev. Biophys. Biomol. Struct. 30 (1), 173–189. Bošković, B., Brest, J., 2016. Differential evolution for protein folding optimization based on a three-dimensional AB off-lattice model. J. Mol. Model. 22 (10), 252. Bradley, P., Malmström, L., Qian, B., et al., 2005. Free modeling with Rosetta in CASP6. Proteins Struct. Funct. Bioinform. 61 (S7), 128–134. Chen, X., Lv, M., Zhao, L., et al., 2011. An improved particle swarm optimization for protein folding prediction. Int. J. Inf. Eng. Electron. Bus. 3 (1), 1–8. Cotta, C., 2005. Hybrid evolutionary algorithms for protein structure prediction under the HPNX model. . Advances in soft computing 2 33, 525–534. Dill, K.A., Maccallum, J.L., 2012. The protein-folding problem, 50 years on. Science 338, 1042–1046. Dobson, C.M., Šali, Andrej, Karplus, M., 1998. Protein folding: a perspective from theory and experiment. Angew. Chemie Int. Ed. 37, 868–893. Floudas, C.A., Fung, H.K., Mcallister, S.R., et al., 2006. Advances in protein structure prediction and de novo protein design: a review. Chem. Eng. Sci. 61 (3), 966–988. Güntert, P., 2004. Automated NMR structure calculation with CYANA. Methods Mol. Biol. 278, 353–378. Honeycutt, J.D., Thirumalai, D., 1992. The nature of folded states of globular proteins. Biopolymers 32 (6), 695–709. Irbäck, A., Peterson, C., Potthast, F., Sommelius, O., 1997. Local interactions and protein folding: a three-dimensional off-lattice approach. J. Chem. Phys. 107 (1), 273–282. Javidpour, L., 2012a. Computer simulations of protein folding. Comput. Sci. Eng. 14 (2), 97–103. Javidpour, L., 2012b. Computer simulations of protein folding. Comput. Sci. Eng. 14 (2), 97–103. Ken, A., 1999. Polymer principles and protein folding. Protein Sci. 8 (6), 1166–1180. Kim, S.Y., Lee, S.B., Lee, J., 2005. Structure optimization by conformational space annealing in an off-lattice protein model. Phys. Rev. E 72 (1), 011916. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P., 1983. Optimization by simulated annealing. Science 220 (4598), 671–680. Kite, J., Doolittle, R.F., 1982. A simple method for displaying the hydropathic character of a protein. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157 (1), 105–132. Lau, K.F., Dill, K.A., 1989. A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules 22 (10), 3986–3997. Li, B., Chiong, R., Lin, M., 2015a. A balance-evolution artificial bee colony algorithm for protein structure optimization based on a three-dimensional AB off-lattice model. Comput. Biol. Chem. 54, 1–12. Li, B., Lin, M., Liu, Q., et al., 2015b. Protein folding optimization based on 3D off-lattice model via an improved artificial bee colony algorithm. J. Mol. Model. 21 (10), 261. Liang, F., 2004. Annealing contour Monte Carlo algorithm for structure optimization in an off-lattice protein model. J. Chem. Phys. 120 (14), 6756–6763. Lin, X., Zhang, X., Zhou, F., 2014. Protein structure prediction with local adjust tabu search algorithm. BMC Bioinformatics 15 (Suppl. 15), S1. Márquez-Chamorro, A.E., Asencio-Cortés, G., Santiesteban-Toca, C.E., Aguilar-Ruiz, J.S., 2015. Soft computing methods for the prediction of protein tertiary structures. Appl. Soft Comput. 35 (C), 398–410. McAllister, S.R., Floudas, C.A., 2010. An improved hybrid global optimization method for for protein tertiary structure prediction. Comput. Optim. Appl. 45 (2), 377–413. Onuchic, J.N., Wolynes, P.G., Lutheyschulten, Z., et al., 1995. Toward an outline of the topography of a realistic protein-folding funnel. Proc. Natl. Acad. Sci. U.S.A. 92 (8), 3626–3630. Rashid, M.A., Hoque, M.T., HakimNewton, M.A., et al., 2012. A new genetic algorithm for simplified protein structure prediction. AI 2012. Adv. Artif. Intell. 107–119. Reva, B.A., Finkelstein, A.V., Skolnick, J., 1998. What is the probability of a chance prediction of a protein structure with an rmsd of 6 Å? Fold. Des. 3 (2), 141–147. Simons, K., Bonneau, R., Ruczinski, I., et al., 1999. Ab initio structure prediction of CASP3 targets using ROSETTA. Proteins S3, 171–176. Stillinger, F.H., Head-Gordon, T., Hirshfeld, C.L., 1993. Toy model for protein folding. Phys. Rev. E 48 (2), 1469–1477. Tzul, F.O., Vasilchuk, D., Makhatadze, G.I., 2017. Evidence for the principle of minimal frustration in the evolution of protein folding landscapes. Proc. Natl. Acad. Sci. U.S.A.
5. Conclusion and future work In this paper, the ISA algorithm was proposed for protein structure optimization on Irbäck’s off-lattice model. The method was compared with the other five algorithms (Kim et al., 2005; Bachmann et al., 2005; Lin et al., 2014; Zhang and Lin, 2010; Liang, 2004) on artificial protein sequences and evaluated with real dataset. The results on artificial sequences indicate that the ISA algorithm can significantly outperform other reported algorithms. For real protein sequences, we performed verification testing and yield folding conformations with the RMSDs less than 3.0 Å from the PDB structures. When our method is applied to optimize protein structure for large proteins, it can yield the similar results with the predictions of smaller proteins. But as we know, the process will be time-consuming with growth of sequence length. Therefore, we will focus on other computing strategies (e.g., parallel processing) to shorten running time. Besides, the final-minimized conformations obtained from various methods may be inconsistent with the lowest energy conformations because of the incomplete interactions of the current off-lattice model, so the study about adding other interactions (e.g., hydrogen bonding interactions) in the well-established offlattice model will be another future work. CRediT authorship contribution statement Lizhong Zhang: Conceptualization, Methodology, Software, Writing - original draft. He Ma: Conceptualization, Methodology, Supervision. Wei Qian: Supervision. Haiyan Li: Validation. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgments This work was supported the National Natural Science Foundation of China [grant numbers 61672146, 81671773]; the Fundamental Research Funds for the Central Universities [grant number N150408001]; and the 2011 Collaborative Innovation Program of Guizhou Province [grant number 2015-04]. 7
Computational Biology and Chemistry 85 (2020) 107237
L. Zhang, et al. 114 (9), E1627–E1632. Unger, R., Moult, J., 1993. Finding the lowest free energy conformation of a protein is an NP-hard problem: proof and implications. Bull. Math. Biol. 55 (6), 1183–1198. Wang, Y., Guo, G.D., Chen, L.F., 2013. Chaotic artificial bee colony algorithm: a new approach to the problem of minimization of energy of the 3D protein structure. Molecular biology 47 (6), 894–900. Zhang, Y., 2008. Progress and challenges in protein structure prediction. Curr. Opin.
Struct. Biol. 18 (3), 342–348. Zhang, X.L., Lin, X.L., 2010. Effective 3D protein structure prediction with local adjustment genetic-annealing algorithm. Interdiscip. Sci. 2 (3), 256–262. Zhou, C., Hou, C., Wei, X., et al., 2014. Improved hybrid optimization algorithm for 3D protein structure prediction. J. Mol. Model. 20 (7), 2289. Zhou, C., Sun, C., Wang, B., et al., 2018. An improved stochastic fractal search algorithm for 3D protein structure prediction. J. Mol. Model. 24 (6), 125.
8