Accepted Manuscript Title: Adoption of an improved PSO to explore a compound multi-objective energy function in protein structure prediction Author: Shuangbao Song Junkai Ji Xingqian Chen Shangce Gao Zheng Tang Yuki Todo PII: DOI: Reference:
S1568-4946(18)30430-7 https://doi.org/doi:10.1016/j.asoc.2018.07.042 ASOC 5010
To appear in:
Applied Soft Computing
Received date: Revised date: Accepted date:
16-3-2018 8-6-2018 22-7-2018
Please cite this article as: Shuangbao Song, Junkai Ji, Xingqian Chen, Shangce Gao, Zheng Tang, Yuki Todo, Adoption of an improved PSO to explore a compound multiobjective energy function in protein structure prediction, (2018), https://doi.org/10.1016/j.asoc.2018.07.042 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Adoption of an improved PSO to explore a compound multi-objective energy function in protein structure prediction Shuangbao Songa , Junkai Jia , Xingqian Chena , Shangce Gaoa , Zheng Tanga , Yuki Todob,∗ a Faculty of Engineering, University of Toyama, Toyama-shi, 930-8555 Japan of Electrical and Computer Engineering, Kanazawa University, Kanazawa-shi, 920-1192 Japan
ip t
b Faculty
cr
Abstract
an
us
The protein structure prediction (PSP) problem, i.e., predicting the three-dimensional structure of a protein from its sequence, remains challenging in computational biology. The inaccuracy of existing protein energy functions and the huge conformation search space make the problem difficult to solve. In this study, the PSP problem is modeled as a multi-objective optimization problem. A physics-based energy function and a knowledge-based energy function are combined to construct the three-objective energy function. An improved multi-objective particle swarm optimization coupled with two archives is employed to execute the conformation space search. In addition, a mechanism based on Pareto non-dominated sorting is designed to properly address the slightly worse solutions. Finally, the experimental results demonstrate the effectiveness of the proposed approach. A new perspective for solving the PSP problem by means of multi-objective optimization is given in this paper.
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Particle swarm optimization (PSO) is a stochastic population-based optimization method. It was originally proposed by Kennedy and Eberhart in 1995 [1] and was further improved by Shi and Eberhart in 1998 [2]. It is a nature-inspired meta-heuristic method that is based on social behavior of bird flocking or fish schooling. The social metaphor in PSO is that every individual can learn knowledge from the experiences of itself and the entire population, and an individual can share this knowledge with every possible individual in the population. Given the effectiveness and efficiency of PSO for solving single-objective optimization problems (SOOPs) [3, 4], extending PSO for solving multi-objective optimization problems (MOOPs) has become an interest of researchers in recent years [5]. Coello et al. provide the paradigm of multi-objective particle swarm optimization (MOPSO) in [6], wherein the notion of elitism was first incorporated into MOPSO. Moreover, most research about MOPSO has revolved around two points: 1) a proper mechanism of updating personal best and global best [7, 8] and 2) the balance between the convergence speed and the diversity of Pareto optimal solutions [9, 10]. Furthermore, given the promising performance of MOPSOs, they have been implemented in many complex practical problems [11, 12, 13]. The protein structure prediction (PSP) problem is also a very complex practical problem. It remains one of the most challenging problems in computational biology [14, 15]. PSP was proposed by the Nobel Prize Winner Anfinsen in the early 1970s [16] and described as predicting the three-dimensional (3D) structure of a protein staring from its amino acid sequence. Two main factors make research of the PSP problem promising and necessary. First, the knowledge hidden in the 3D structure of proteins is critical to explain the functions of proteins, such as catalysis, storage, motion and communication [17]. In fact, resolving the PSP problem has contributed to the development of biotechnology [18]. Second, there are approximately 60 million protein sequences with unknown 3D structures in UniProtKB
pt
2 3
1. Introduction
Ac ce
1
ed
M
Keywords: multi-objective optimization problem, particle swarm optimization, knowledge-based energy function, non-dominated sorting, protein structure prediction
∗ Corresponding
author Email address:
[email protected] (Yuki Todo)
Preprint submitted to Applied Soft Computing
September 5, 2018
Page 1 of 21
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
ip t
29
cr
28
us
27
an
26
M
24 25
ed
23
database [19]. However, protein structure determination via experimental methods (e.g., X-ray crystallography and NMR spectroscopy) is time-consuming and expensive. Thus, determining the protein structures by computational technologies becomes promising, even necessary. In general, the methods for solving the PSP problem can be divided into two main categories: template-based modeling (TBM) and free modeling (FM) [20]. TBM, which is also known as homology modeling, relies on the fact that protein structures are conserved, and similar sequences lead to similar protein structures [21]. The 3D structure of a target protein is constructed according to the known structures (usually released in the PDB database [22]) of its similar proteins (the “template”). Moreover, MODELLER [23], SWISS-MODEL [24], and I-TASSER [25] are considered state-of-the-art methods among TBMs. On the other hand, if the similar proteins of a target protein are not identified, the FM should be used. FMs are considered to have greater value in theory than TBMs [26]. Recently, the most successful FMs are Rosetta [27] and QUARK [28]. Nevertheless, one success factor of these methods is using partial templates [29]. These templates are generally not considered pure FMs. The vast majority of FMs treat the PSP problem as a SOOP [29]. FMs reach the native protein structure by optimizing a well-turned (physics- or knowledge-based) potential energy function. The hypothesis underlying these FMs is that a native protein structure steadily remains in the minimum of the free energy [16]. However, the performance of existing FMs is minimally satisfactory due to two main reasons. First, the conformation search space of the PSP problem is very large [30]. Second, the existing potential energy functions guiding the search process are inaccurate [31]. In addition, the landscapes of these energy functions are so rugged [30] and complex that optimizing them is not an easy task. Thus, designing an effective energy function and executing an efficient conformation search scheme are two key factors for solving the PSP problem [32]. Recently, modeling the PSP problem as a MOOP provides new perspectives for solving it because the energy function (multi-objective) and the search scheme in MOOP are completely different from the search schemes in SOOP[33]. Numerous multi-objective approaches have been proposed in the literature [33]. For example, Cutello et al. modified the well-known multi-objective approach PAES [34] with immune-inspired operators (I-PAES) to solve the PSP problem [35]. Venske et al. proposed an adaptive multi-objective differential evolution called ADEMO/D to solve the PSP problem [36]. These two works transformed the PSP problem into a MOOP by decomposition, where a physics-based energy function CHARMM27 [37] was was decomposed into two objectives: bond and nonbond energy. These studies differed mainly in their search strategies. On the other hand, adding new supplementary objectives to the original energy function for multi-objectivization has also drawn the attention of researchers. Brasil et al. proposed a method called MOEA with many tables (MEAMT) to address the PSP problem [38]. The method is a relaxed MOEA, where customized criteria replacing non-dominance criterion are used to orient the search process. The multi-objective function in the method consists of four physical terms: Van der Waals, electrostatic, solvation, and hydrogen bond. Considering the effect of solvent, Gao et al. proposed a MOEA with three objectives (MO3) to tackle the PSP problem [39]. In their work, the solvent-accessible surface area was incorporated as the third objective coupled with bond and non-bond energy. These works all strengthened the advantages of modeling the PSP as a MOOP. However, some important issues seem to have been ignored in these approaches. First, too much attention is given to physics-based energy functions (PBEFs). In fact, it is well accepted that knowledge-based energy functions (KBEFs) are more effective than PBEFs [31]. To the best of our knowledge, combining PBEF and KBEF in a MOEA for solving the PSP problem has not been explored to date. Second, it does not seem reasonable that only Pareto optimal solutions are generated as the final prediction results by a MOEA for the PSP problem. Analogous to the FM-based SOOP [27, 28], numerous solutions with slight worse energy are maintained before the final selection phase. Pareto optimal solutions can be considered extreme solutions in terms of energy functions. Some slightly worse solutions should also be given attention. Moreover, an explicit mechanism to control the number of output solutions (the lower limit) in these MOEAs is typically missing. These motivations prompted us to perform this research to make progress in these respects. In this study, a FM approach based on MOPSO is proposed to address the PSP problem. A PBEF CHARMM22 [40] and a KBEF dDFIRE [41] are combined to construct the three-objective energy function. Of note, CHARMM22 is also decomposed into bond and non-bond energy as the first and second objectives of the study, respectively, following the ideas of previous studies [35, 36]. Moreover, a MOPSO coupled with two archives constitutes the core of the proposed approach. An archive maintaining all non-dominated solutions guides the search process. The other archive stores all acceptable solutions, where a sorting mechanism based on Pareto dominance explicitly controls the
pt
22
Ac ce
21
2
Page 2 of 21
79
2. Materials and basic concepts
76 77
80
2.1. Standard PSO The population of PSO is called a swarm, and the members (individuals) of the swarm are called particles. PSO solves a problem by moving these particles (candidate solutions) in the search space to locate the optima. Each particle can be described by three vectors: x(i)(t) = (x1(i)(t) , x2(i)(t) , ..., xd(i)(t) ) (i)(t) (i)(t) v(i)(t) = (v(i)(t) 1 , v2 , ..., vd ) (i)(t)
p
86 87 88 89 90
91
an
M
84 85
where d is the dimension of the search space. x represents the position of the ith particle in generation t. v(i)(t) is the particle velocity, which determines the direction and magnitude of change. The personal best vector p(i)(t) stores the best position that the ith particle has visited up to the generation t. In the standard PSO [2], every particle updates its velocity and position in generation t with the following equations: (i)(t) v(i)(t+1) = wv(i)(t) + C1 R(i)(t) − x(i)(t) ) + C2 R2(i)(t) (g(t) − x(i)(t) ) 1 (p (2) x(i)(t+1) = x(i)(t) + v(i)(t+1) where w is the inertia weight, which control how much the previous velocity influences the current velocity. C1 and C2 are learning factors and also called cognitive and social weight, respectively. Global best vector g(t) stores the best position visited by the entire swarm up to generation t. Of note, the topology of the swarm used in the standard PSO is considered as global topology [42], which allows all particles exchanging information with each other by means of sharing the unique global best vector g. R(i)(t) and R(i)(t) are d × d diagonal matrices generated for each particle at each 1 2 iteration, separately. Every element in the main diagonal of the matrices is a random number that satisfies the uniform distribution in [0, 1].
ed
83
(i)(t)
2.2. Multi-objective optimization problem In general, a multi-objective optimization problem (MOOP) can be defined mathematically as follows: Minimize f(x) = ( f1 (x), f2 (x), ..., fk (x))
S ub ject to x = (x1 , x2 , ..., xn ) ∈ X
92 93 94 95 96
(1)
pd(i)(t) )
pt
82
p(i)(t) 2 , ...,
Ac ce
81
=
(p(i)(t) 1 ,
cr
75
us
74
ip t
78
quantity of the output solutions. Furthermore, a clustering method is introduced to filter the final predicted structures. Finally, a set of benchmark proteins is used to test and analyze the proposed approach. The experimental results demonstrate the effectiveness of the proposed approach for solving the PSP problem. The remainder of this paper is organized as follows. PSO, MOOP, protein structure, and energy functions are described in section 2. The details of the proposed approach are provided in section 3. The experiment and analysis are provided in section 4. Finally, section 5 presents the conclusions of this paper.
73
(3)
where x is called decision vector and X is the n-dimensional decision space. f(x) is the objective vector, consisting of k(>= 2) objective functions: f1 (x), f2 (x), ..., fk (x). The goal of solving a MOOP is to identify a solution x∗ that can optimize all of these objective functions. However, it is impossible in normal cases, given the conflicts among these objective functions. On the contrary, a set of optimal solutions called Pareto optimal solutions are generated after optimization. The concept of Pareto dominance is defined as follows. For a MOOP in Eq. 3, a solution a is said to dominate a solution b donated by a ≺ b [43], if (1) ∀i ∈ {1, 2, ..., k}, fi (a) ≤ fi (b) AND (2) f(a) , f(b)
(4)
It is easily concluded that the dominance relation is transitive and can be described as follows: ∀a, b, c ∈
(5)
3
Page 3 of 21
ip t cr
A solution x∗ is called Pareto optimal, if
an
¬∃x0 ∈ X, x0 ≺ x∗ .
us
Figure 1: The representation of a protein in torsion angles. A residue includes backbone torsion angles φ, ψ, ω, and side-chain torsion angles χi .
(6)
M
For a given MOOP in Eq. 3, all Pareto optimal solutions constitute the Pareto optimal set P, which is defined as follows: P = {x∗ ∈ X |¬∃ x0 ∈ X, x0 ≺ x∗ } (7) The image of P in objective space is called Pareto Front PF and is defined as follows: PF = {f(x) | x ∈ P}
(8)
101
2.3. Representation of protein
pt
99
ed
100
However, for a practical MOOP, the real Pareto optimal set is commonly unreachable or infinite. An approximation of the set is obtained via a typical optimization method [44, 45]. Therefore, the goal of solving a MOOP is translated into generating an accepted approximation of P, which is convergent to P and diverse in objective space as far as possible.
97 98
111
2.4. Protein energy function
104 105 106 107 108 109
112 113 114 115 116 117 118
Ac ce
110
Proteins are macromolecules and consist of 20 different amino acids with specific sequences. The representation of a protein is an important issue in the field of PSP. A full-atom torsion angle representation is utilized in the work. In molecular biology, a torsion angle is defined as the angle between two planes determined by four atoms [46]. Then, the conformation of a protein is determined uniquely by the dimensional vector, which is composed of backbone torsion angles φ, ψ, ω, and side-chain torsion angles χi (i ∈ {0, 1, 2, 3, 4}), as shown in Fig. 1. As a sequence, the conformation of a protein can be modified via changing these torsion angles, assuming that other variables, such as bond lengths and bond angles, have been set to ideal values. In addition, the full-atom Cartesian coordinate representation is calculated from the torsion angle space representation to evaluate the energy. However, all modification occurs only in torsion angle space in the proposed approach.
102 103
Almost all FMs are based on the thermodynamic hypothesis[16]: the native structure of a protein is determined uniquely by its amino acid sequence, and the protein steadily remains in the minimum free energy state. The hypothesis suggests that the success of protein structure prediction relies on an accurate protein energy function to filter more native-like protein conformations [30]. There are numerous protein energy functions in the field of PSP [47], and these functions can be divided into two broad classes: physics-based energy functions (PBEFs) and knowledge-based energy functions (KBEFs). The PBEFs are mainly based on physical principles of molecular interactions[48, 49, 40]. The KBEFs are typically based on the statics of a set of known protein structures selected elaborately [41, 50, 51]. 4
Page 4 of 21
Chemistry at Harvard Macromolecular Mechanics (CHARMM) force field is a widely used PBEF for molecular dynamics in computational biology. The force field was developed by the Nobel Prize winner Martin Karplus and his group at Harvard. CHARMM22 is a version of the CHARMM force field and an all-atom empirical potential, where parameters have been tuned properly for proteins [40]. The form of CHARMM22 is expressed as follows: X X E= Kb (b − b0 )2 + KU B (S − S 0 )2 + Kθ (θ − θ0 ) + 2
angle
X
X
Kχ (1 + cos(nχ − δ))+
dihedrals
Kϕ (ϕ − ϕ0 )2 +
impropers
R Rmini j 6 qi q j mini j 12 ) −( ) + ( ri j ri j 1 ri j nonbond
121 122 123 124
128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147
where R is the gas constant and T is set to 300 K. Nobs (i, j, r) is the observed number of atom pairs i and j within the distance r in the given database. rcut is set to 14.5 Å. ∆r(∆rcut ) is the bin width at r(∆rcut ). Parameter α is set to 1.61 in the original version. To improve the performance of the DFIRE-based energy function, an upgrade of DFIRE called the dipolar DFIRE (dDFIRE) energy function was proposed [41]. This energy function adds terms about the orientation dependence of interactions between polar atoms by treating each polar atom as a dipolar. Compared with other orientation-dependent KBEFs [54, 55], dDFIRE has three advantages. First, it incorporates the hydrogen bonding interaction and the dipoledipole interaction. Second, different types of orientation-dependent interactions are all considered. Finally, it is a parameter-free KBEF. PBEFs and KBEFs are quite different. They evaluate a protein conformation from two different aspects. PBEFs try to mimic the process of protein folding, and KBEFs assess the conformation according to empirical energy terms [56]. It seems wise to combine them as a compounded function because both physical properties and statistical properties of a protein conformation can be captured. In fact, the energy functions of the most successful FMs, including Rosetta [57], CSA [58], and QUARK [28], all consist of physical terms and empirical terms. In these methods, various energy terms are grouped as a linear combination with different weights. However, this treatment can lead to the issue of how to seek the optimal relative weights. It is obvious that combining various energy terms by designing a multi-objective function can avoid this issue skilfully. Moreover, the shape of the multi-objective energy landscape is more stable because it lacks the variation of the weights. It benefits the optimization process. On the other hand, previous works [35, 38, 39] based on MOOP mainly considered PBEFs. Combining PBEF and KBEF to solve the PSP problem has not been explored using these models to date; however, this combination is dominant among the methods based on SOOP [57, 58, 28]. In the study, CHARMM22 and dDFIRE are combined to constitute the three-objective energy function. CHARMM22 is decomposed into bond energy and non-bond energy
pt
127
Ac ce
126
ed
M
125
Seven terms are found in Eq. 9: bond, Urey-Bradley, angle, dihedral angle, improper angle, Lennard-Jones 6-12 and Coulombic interaction terms. In these terms, Kb , KU B , Kθ , Kχ , and Kϕ are force constants. b, S , θ, χ, and ϕ are bond length, Urey-Bradley distance, bond angle, dihedral angle, and improper torsion angle, respectively. The variables with the subscript 0 are the equilibrium constants. For the last two terms in Eq. 9, is the Lennard-Jones well depth, and Rmin is the distance at the Lennard-Jones minimum. ri j is the distance between atom i and j. qi is the atomic change of atom i, and 1 is the effective dielectric constant. In fact, these terms in Eq. 9 are classified more generally into two categories: bond terms (including the first five terms) and non-bond terms (including the last two terms) [52]. The distance-scaled, finite ideal-gas reference (DFIRE) state-based energy function is an all-atom KBEF proposed by Zhou et al. for protein structure prediction [53]. This function uses the new reference state to construct the residuespecific energy function from a database of known protein structures. For atoms i and j that are distance ri j apart, the DFIRE-based energy u(ri j ) is calculated as follows: Nobs (i, j,r) −RT ln ( r r )α ( ∆r∆r )Nobs (i, j,rcut ) , r < rcut cut cut u(ri j ) = (10) 0, r ≥ rcut
an
119 120
us
X
(9)
cr
X
ip t
UB
bonds
5
Page 5 of 21
ip t cr us an M
Figure 2: Flowchart of the proposed approach.
151
3. Description of proposed approach
pt
149
ed
150
as the first and second objectives, respectively. dDFIRE is included as the third objective of the three-objective energy function. Therefore, the proposed energy function can be seen as a compound energy function that includes PBEF and KBEF.
148
156
3.1. Main Algorithm
153 154
157 158 159 160 161 162 163 164 165 166 167 168
Ac ce
155
In this section, the proposed approach for solving the PSP problem is described. The approach consists of two main steps. First, the MOPSO is employed to execute conformation space search and generates a set of optimal solutions. Then, a clustering method MUFOLD-CL is used to select the final solution(s). The flowchart of the proposed approach is presented in Fig. 2, and the details are described in the following content.
152
To extend the standard PSO to solve MOOPs, it is natural to incorporate the notion of Pareto dominance. However, a set of different solutions (also called Pareto optimal set) will be generated in MOOP. MOOP is different from SOOP, where only one global optimal solution is generated. Therefore, not only the convergence but also the diversity of the solution set should be given attention. Moreover, a problem, namely using Pareto dominance to compare two solutions (or individuals), also arises in MOPSO, i.e., the process of updating the personal best, because it is easy to determine which of two individuals is better according to the single objective fitness function in SOOP. However, a complete order may not exist to differentiate two feasible solutions in MOOP. One solution dominates the other solution, or neither solution dominates the other solution. As a consequence, an additional strategy for addressing this issue should be also considered. The proposed MOPSO follows the standard PSO, and the global best topology is used. The main procedure of the algorithm is described as follows, and the pseudo-code is presented in Algorithm 1. Four strategies are incorporated into the proposed MOPSO. First, a non-dominated archive gA is maintained to provide global best g. Second, a ranked 6
Page 6 of 21
cr us
an
begin Initialize parameter. Initialize all x and v of each particle. Evaluate all particles. Add all particle into gA with criterion. Add all particle into rA with criterion. Update all p. t = 1. while Stopping criterion is not met do Calculate crowding distance of each solution in gA. select a solutions as g from gA for each particle. Update all v and x by Eq. 2. Mutate appointed particles. Maintain all particles within constrained search space. Evaluate all particles. Add all particle into gA with criterion. Add all particle into rA with criterion. Update all p. t = t + 1.
ip t
Algorithm 1: Main procedure of the MOPSO.
M
Print all solutions in rA.
182
3.2. Density estimation
173 174 175 176 177 178 179 180
183 184 185 186 187 188 189 190 191 192 193 194
pt
171 172
Ac ce
170
ed
181
archive rA is maintained to retain optimal solutions visited during the search process. Third, a crowd distance method is designed to deal with the condition of comparing two non-dominated solutions. Finally, a mutation operator is utilized, paying more attention to the exploration ability of the algorithm. Initially, all parameters are set, such as max iteration, population size, inertia weight, leaning factor, the size of gA, and the size of rA. Subsequently, the position x and velocity v of every particle are randomly initialized. Then, these particles are evaluated and added to the global best archive gA and the ranked archive rA, respectively. From this point, the main loop of the algorithm starts. First, the crowding distances of all non-dominated solutions stored in gA are computed. Then, the velocity v and the position x of each particle are updated using Eq. 2. Of note, the global best g for each particle is randomly selected from the solutions in the sparsest regions in gA. Next, the mutation operator works on some particles. Then, the maintained operator works on all particles to restrict them in a constrained search space. Subsequently, all particles are evaluated and added to gA and rA if joining conditions are met. Then, all personal best vectors p of each particle are updated. Finally, if the stopping criterion is met, the main loop of the MOPSO terminates, and the solutions stored in rA are printed.
169
As mentioned above, a strategy determining the optimal solution between two non-dominated solutions is necessary. The strategy can be used in many process, such as selecting the global best g and discarding a relatively worse solutions from the archive. To obtain a well-distributed Pareto front, density measurement is a widely used strategy to set the complete order over non-dominated solutions in the literature. Grid [34, 6] and crowding distance [44] are all considered classical approaches. In the approach, crowding distance, a non-parameter-dependent method, is used to assign the complete order over two non-dominated solutions. The basic idea of calculating crowding distance is to estimate the distance between a solution and its nearest neighbors in the Pareto front. For a given solution, the crowding distance is quantified as the perimeter of the cuboid formed by its neighbors. For example, the cuboid of the ith solution determined by its two neighbors in two objective function space is presented in Fig. 3. Considering that different objectives vary to different degrees, all objective values are normalized before calculating. Of note, the distance values of two boundary solutions on each objective are provided by the entire variation range (after normalization equal to 1). 7
Page 7 of 21
2EMHFWLYH L
ip t
L L
cr
2EMHFWLYH
us
Figure 3: The crowding distance of a solution in Pareto front is quantified as the perimeter of the cuboid formed by its neighbors.
5 5
5 5
U$
7 1
5N
8
FXUUHQW SRSXODWLRQ
5Z
VRUWHG 5N
8 VRUW 5N E\ FURZGLQJ GLVWDQFH
9
5N
9 SUXQH 5N DQG JHQHUDWH U$
ed
7 VRUW E\ 3DUHWR GRPLQDQFH
an
U$
M
1U
Figure 4: The updating strategy of the archive rA.
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
pt
197
3.3. Two archives: gA and rA Retention of non-dominated solutions during search processes is common in many MOEAs [34, 59]. Two bounded external archives are maintained in the algorithm. First, a non-dominated archive gA is maintained to store the global best. A solution s can be added into gA if it matches any of the following conditions: (1) gA is empty. (2) gA is not full, and s is not dominated by any one in gA. (3) s dominates any one in gA, and the dominated ones will be deleted from gA. (4) gA is full, and s is non-dominated. In this condition, s is firstly added into gA. Then, a solution in the most crowded region is discarded. On the other hand, it is critical to select the global best g from gA because the selection of the global best can affect the convergence and diversity capability of a PSO [7]. In the proposed algorithm, selecting a solution with a maximum crowding distance appears to be a good option because it can drive the population toward the spread region of the objective space. However, this method causes the algorithm to have a high selection pressure, and the algorithm can be easily trapped in a local optimum. As a result, roulette wheel selection is used to select the global best among the top 30% solutions with the maximum crowding distance. For each personal best p, we do not use an archive to maintain non-dominated solutions visited by the particle. A particle q will take the place of its personal best p, if (1) q dominate p or (2) q and p are non-dominated and q is selected by roulette wheel selection with 50% probability. A ranked archive rA based on non-dominated sorting is also maintained in the approach and is used to store optimal solutions during the search process and output the prediction results. The updating strategy of rA is similar to the strategy in NSGA-II [44] but works on the external archive rather than the population. Initially, an empty archive rA with bounded size Nr is created. Then, at each iteration t, all N particles in the swarm will be added into rA together. If rA is not full (the number of solutions m in rA is smaller than Nr), all particles are accepted, and the next iteration t + 1 is executed. On the contrary, if rA is full, non-dominated sorting is used for pruning. The procedure of
Ac ce
195 196
8
Page 8 of 21
Table 1: Constraints of secondary structure for torsion angle φ and ψ.
helix (H) sheet (E) coil (C)
φ [−67◦ , −47◦ ] [−130◦ , −110◦ ] [−180◦ , 180◦ ]
ψ [−57◦ , −37◦ ] [110◦ , 130◦ ] [−180◦ , 180◦ ]
233
3.4. Mutation operator
223 224 225 226 227 228 229 230 231
cr
221 222
us
220
an
219
M
218
ip t
232
updating rA is presented in Fig. 4. First, each solution s in rA plus the current population is assigned a domination rank, which represents the number of solutions dominating s in rA. Undoubtedly, the smaller rank a solution has, the more competitive it is. Then, all solutions in rA are sorted in ascending order according to their domination rank. Now, all solutions are classified into non-dominated set {R0 , R1 , ..., Rw }. Next, the pruning operation is executed backward on these non-dominated sets until m is equal to Nr. To maintain exactly Nr members in rA, the last deleted set Rk is sorted in descending order according to crowding distance. The best solutions in Rk are used to fill up rA. We used a ranked archive to retain optimal solutions based on three lines of reasoning. First, it is widely accepted that the existing statistical or physical energy functions are not accurate [31]. Paying exclusive attention to the solutions with minimum energy is not reasonable. In the classical Monte Carlo Method [27, 28], PSP is modeled as a SOOP. Offspring decrease the energy, and the solutions that increase the energy but satisfy the Metropolis criterion are retained. As a result, the energies of the output solutions follow a Boltzmann distribution. To make an analogy, we model PSP as a MOOP and should not exclusively retain the solutions in the Pareto front. Second, non-dominated sorting is a clear and easy method for maintaining suboptimal solutions compared with ε-dominance [60] and anglemodified dominance [61], which are relaxed forms of Pareto dominance. Third, in contrast to the number of output solutions in other MOEAs[35, 36, 39] changing dramatically, this method can explicitly control the size of output solutions.
217
ed
It is well-known that the standard PSO converges quickly [62]. To prevent premature convergence, mutation operators are commonly used in PSO [63]. It is also important to use mutation operators in MOPSO to prevent trapping into local Pareto fronts. A mutation operator is incorporated into the proposed algorithm. It perturbs a selected variable (angle) as follows: θnew = θold + λξ (11)
Ac ce
pt
where θnew and θold are the angles after and before mutating. λ is a scale factor, and ξ is a real number, fitting uniform distribution in [−1, 1]. To create a balance between between exploration and exploitation, the mutation rate pm is shown as follows: −2t (12) pm = e Tmax
236
where t is the number of iterations and T max is the maximum allowed number of iteration. The mutation rate affects three aspects of mutation: the number of mutated particles, the number of mutated variables in the selected particles, and the variation range of the selected variables.
237
3.5. Constrained search space
234 235
238 239 240 241 242 243 244 245 246
From the perspective of computation, the search space of the PSP problem is very large given the high dimension. For example, there are approximately 300 torsion angles in protein 1ROP. Restricting these torsion angles to reduce the size of search space is a common method [64]. Three types of restriction are utilized in the work. First, we use PSIPRED [65] to predict the secondary structure of the target protein. Each residue is assigned one type: helix (H), sheet (E), or coil (C). For the backbone torsion angle φ and ψ in each type, the restriction is presented in Table 1. Second, all ω are set to 180◦ according to the natural observation that ω is very close to 180◦ in most proteins [66]. Third, for side-chain torsion angles, the restriction extracted from the rotamer library [67] is utilized. All these restrictions are used in the maintaining phase in Algorithm 1, making all particles fly through the constrained search space.
9
Page 9 of 21
Table 2: Details of the target proteins.
num. angles 192 297 191 289 185 284 345 180 278 134 289 241
num. atoms 645 942 648 947 613 905 1117 570 920 414 921 780
ip t
num. residues 46 60 46 54 39 56 70 34 56 33 52 44
cr
sec. structure α/β α α/β α β α α α α α/β α α
us
Protein ID 1AB1 1BDD 1CRN 1ENH 1I6C 1ROP 1UTG 1ZDD 2KDL 2M7T 2P6J 2P81
Table 3: The parameters applied in the proposed MOPSO.
247
3.6. Decision-making
an
description population size inertia weight cognitive weight social weight the size of global archive the size of ranked archive max iteration number
M
value 50 0.9 1.5 1.5 100 1000 2000
ed
Parameter N w C1 C2 Ng Nr It
259
4. Experimental studies
250 251 252 253 254 255 256 257
260 261 262 263 264 265 266 267
Ac ce
249
pt
258
In general, a large number of decoy structures are generated after prediction by a typical FM method. A complete prediction method should include a decision-making phase to select best native-like structures. Decision-making is also an important issue and is called model quality assessment (MQA) in Critical Assessment of protein Structure Prediction (CASP) [68]. The methods based on clustering are the most popular methods in MQA. These methods are based on the assumption that near-native structures are more likely to cluster in the basin of the protein energy landscape [69]. A population-based protein structural model analysis package named MUFOLD-CL [70] is utilized in this study. Two metrics without optimized superimposition of two structures, Dscores1 and Dscores2, are innovatively adopted in this work and make the algorithm fast. Dscores1 describes the difference between two structures and is used in the clustering process. Dscores2 describes the similarity between two structures and is used in representative selection. Finally, all cluster centers are selected as the candidates and ranked according to cluster size and spread degree.
248
This section presents the experiments and results obtained during the evaluation of the proposed approach. Twelve real proteins are used as the benchmark proteins. Detailed information of these proteins is provided in Table 2. These proteins vary in length (from 34 to 70), secondary structure, number of torsion angles, and number of atoms. In addition, the corresponding native structures for each target protein are downloaded from the PDB. These structures will be used to evaluate the quality of a predicted structure. All algorithms are implemented in C and Python language in the study. All the tests are executed on the Linux 64-bit system with a 3.4 GHz Core (TM), i5 CPU and 8 G RAM. Specially, the parameters for MOPSO are listed in Table 3.
10
Page 10 of 21
273
ip t
271 272
where a, b are the two optimally superimposed structures achieved by the Kabsch algorithm. n is the number of compared atoms, and di is the Euclidean distance between two corresponding atoms in two superimposed structures. Thus, a smaller RMSD value indicates high similarity between two structures, and zero indicates equality. Although RMSD is considered a classical method, it is sensitive to the outlying local regions of a predicted structure and is thus occasionally unsuitable for assessing similarity. Another metric, the global distance test (GDT) score [72], has become popular. It is the major measurement in CASP and is calculated as follows:
cr
270
4.1. Metrics to evaluate the predicted structure Two widely used metrics are employed in the study to assess the quantity of a predicted structure. These methods measure the similarity between the predicted structure and the native structure of a protein. The first method is the classical root-mean-square deviation (RMSD) [71]. It calculates the average distance between the match atoms of two superimposed structures as follows: s Pn 2 i=1 di (13) RMS D(a,b) = n
GDT =
279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297
298 299 300 301 302 303 304
an
M
278
where N is the number of residues in target protein. S 1 , S 2 , S 3 , and S 4 are the number of aligned residues under the distance thresholds of σ/4, σ/2, σ, and 2σ, respectively. The GDT TS (GDT total score) is calculated by setting σ to 4 Å in GDT. This vale varies from 0 to 100, and a higher GDT TS corresponds to high similarity between two structures. It should be noted that, for simplicity, it is common to consider only Cα atoms in these two metrics. 4.2. Prediction results We run the proposed MOPSO on these target proteins. A set of non-dominated solutions gA and a set of dominated solutions with rank rA are obtained for each target protein. These solutions (gA and rA) in objective space of each protein are plotted in Fig. 5. In addition, the solutions in gA consist of Pareto fronts. The gray surfaces in these figures are not the surfaces of Pareto fronts. Rather, these surfaces allow us to easily view the distribution of the solutions in objective space. From these figures, we can see that the second objective, i.e., non-bond energy, changes sharply on a large scale. This finding strengthens the necessity of separating bond energy from non-bond energy because the change of bond energy can be hidden easily. Moreover, these figures demonstrate that the non-dominated solutions in gA are diverse and span the Pareto fronts uniformly, given the density estimation scheme mentioned above. However, these non-dominated solutions are considered sparse in objective space compared with the set of dominated solutions in rA. In other words, more fruitful solutions are maintained in rA than gA, despite covering a similar scope in objective space. After generating a set of non-dominated solutions with rank in rA, the decision maker is executed to select the final solutions. We use MUFOLD-CL to cluster these solutions stored in rAs. One among five centroids with maximum size and minimum spread is selected as the final predicted result for each protein. Detailed information of the final solutions and other typical solutions in Fig. 5 is reported in Table 4, including one solution obtained with minimum RMSD, one obtained with maximum GDT TS, one obtained with minimum dDFIRE, and one selected by MUFOLDCL. Moreover, the superpositions of the native and predicted structures (selected by MUFOLD-CL) for all target proteins are shown in Fig. 6. These results also demonstrate that competitive solutions are generated by the proposed approach.
ed
277
(14)
pt
276
100(S 1 + S 2 + S 3 + S 4 ) 4N
Ac ce
274 275
us
268 269
4.3. Verify the influence of the knowledge-based energy function An accurate protein energy function governs the effectiveness of an approach for solving the PSP problem, which can guide the search to acquire a more native-like structure. In this study, we combine the PBEF CHARMM22 and KBEF dDFIRE to constitute the compound multi-objective energy function. Previous works have demonstrated that decomposing PBEF into bond energy and non-bond energy is effective. We follow the idea and incorporate a KBEF as the third objective to make the energy function more realistic. The influence of the third objective dDFIRE should be investigated. 11
Page 11 of 21
0
−20
−5
−20
−10
−30
dDFIRE
−30
−15
dDFIRE
−20 −25
−40 −50
−40 −45
−60
−30
−35
−50
−35
−70
10^5
10^5
10^5
500
ip t
dDFIRE
−25
520
440
540
10^10
10^10
700 10^15
800
non-bond (kcal/mol)
bond (kcal/mol)
10^10
560 10^15
(a) 1AB1
580 600 bond (kcal/mol)
620
480
non-bond (kcal/mol)
10^15
(b) 1BDD
bond (kcal/mol)
500
(c) 1CRN
−25 −30
us
−5
−20
−35
dDFIRE
−25 −10
dDFIRE
−30
dDFIRE
460
cr
non-bond (kcal/mol)
600
−35 −40
−15
−40 −45 −50
−45
an
−55
−20
−50
−60
−55
10^5
10^5
10^5
400
500
500
10^10
550
10^10
500 520
10^10
540 560
600
non-bond (kcal/mol)
600 10^15
non-bond (kcal/mol)
bond (kcal/mol)
non-bond (kcal/mol) 10^15
bond (kcal/mol)
(e) 1I6C
−40
−50
−30
dDFIRE
−40
−35
700
800
10^10 900 1000
−15
−20
dDFIRE
bond (kcal/mol)
300
10^5
10^4
480 10^6
300
500 10^8
320
15000
non-bond (kcal/mol)
Ac ce
(g) 1UTG
−60
10000
20000
340
520 10^10
bond (kcal/mol)
non-bond (kcal/mol)
(h) 1ZDD
540
10^12
bond (kcal/mol)
(i) 2KDL
−10
20 10
−20
dDFIRE
non-bond (kcal/mol) 10^15
−50
280
5000
0
dDFIRE
10^5
−10
pt
−60
−45
−55
−40 0
−5
bond (kcal/mol)
(f) 1ROP
−35
ed dDFIRE
−30
0
580 600
−25
−20
dDFIRE
700
M
(d) 1ENH
10^15
−30
−40
−10 −20 −30
−50
−40 450
10^5
10^5 450
350
10^10
non-bond (kcal/mol) 10^15
400
(j) 2M7T
bond (kcal/mol)
10^10
non-bond (kcal/mol)
500 10^15
550
(k) 2P6J
bond (kcal/mol)
10^10
non-bond (kcal/mol)
500 10^15
550
bond (kcal/mol)
(l) 2P81
Figure 5: The solutions in gA ( marked as ‘+’) and rA (marked as ‘·’) span in objective space for all target proteins.
12
Page 12 of 21
Table 4: The summary of the predicted results, corresponding to Fig. 5.
1ENH
1I6C
1ROP
1UTG
1ZDD
Ac ce
2KDL
2M7T
2P6J
2P81
305 306
GDT TS
ip t
43.48 51.63 38.04 36.96 59.58 60.00 56.67 50.42 53.26 53.26 48.37 46.20 53.24 58.80 56.94 46.30 41.03 45.51 35.26 32.69 76.34 76.34 73.66 66.96 48.57 52.14 36.07 41.79 83.09 86.03 77.94 77.94 39.73 44.20 41.52 41.52 43.18 45.45 37.88 32.58 53.85 54.33 53.37 35.58 63.07 67.05 52.27 47.73
cr
1CRN
-25.01 -26.72 -32.26 -24.21 -54.81 -60.50 -63.50 -51.89 -44.08 -44.08 -45.16 -41.86 -46.52 -53.51 -54.84 -43.87 -16.76 -12.88 -20.22 -16.28 -51.45 -48.26 -57.53 -53.86 -59.40 -38.22 -64.05 -49.16 -36.18 -34.86 -37.15 -34.79 -46.90 -53.64 -57.51 -49.99 -18.07 -18.18 -19.31 -13.59 -48.14 -49.25 -53.92 -39.17 -38.79 -13.10 -39.53 -28.56
RMSD ( Å) 5.72 6.16 6.48 9.80 3.73 4.03 3.83 5.64 5.16 5.16 8.35 7.57 5.81 6.58 7.36 8.92 6.51 7.03 7.29 8.47 2.21 2.26 2.81 3.51 5.68 19.68 13.39 11.13 1.84 1.89 2.24 2.15 8.69 9.73 9.80 10.29 5.04 5.10 6.80 8.46 4.81 5.26 4.92 9.44 3.55 3.66 6.16 6.28
us
1BDD
dDFIRE
an
min RMSD max GDT TS min dDFIRE MUFOLD-CL min RMSD max GDT TS min dDFIRE MUFOLD-CL min RMSD max GDT TS min dDFIRE MUFOLD-CL min RMSD max GDT TS min dDFIRE MUFOLD-CL min RMSD max GDT TS min dDFIRE MUFOLD-CL min RMSD max GDT TS min dDFIRE MUFOLD-CL min RMSD max GDT TS min dDFIRE MUFOLD-CL min RMSD max GDT TS min dDFIRE MUFOLD-CL min-RMSD max GDT TS min dDFIRE MUFOLD-CL min RMSD max GDT TS min dDFIRE MUFOLD-CL min RMSD max GDT TS min dDFIRE MUFOLD-CL min RMSD max GDT TS min dDFIRE MUFOLD-CL
non-bond energy (kcal/mol) 1.95E9 3.17E10 4.90E10 6.69E8 4.71E7 4.34E5 1.29E8 2.27E5 4.93E7 4.93E7 4.23E6 2.54E5 1.55E12 1.02E11 1.01E8 1.41E7 3.60E11 2.28E7 4.63E10 1.64E10 1.31E13 4.26E11 9.68E6 1.65E9 6.82E12 9.36E6 5.50E8 7.83E7 9.40E5 1.84E4 3.58E6 2.16E2 1.40E5 1.12E5 9.48E6 5.44E4 2.55E15 1.50E12 7.96E4 9.82E4 2.86E8 4.14E8 2.65E6 6.63E5 7.37E10 9.78E9 2.56E11 2.11E8
M
1AB1
bond energy (kcal/mol) 503.05 501.98 498.74 494.23 569.35 599.02 595.42 583.03 475.73 475.73 457.16 463.08 543.08 584.73 584.00 542.28 496.58 464.44 489.67 482.55 530.10 541.59 544.43 552.33 746.39 741.32 796.51 787.12 328.30 311.08 328.09 305.40 499.72 513.94 522.81 512.19 318.52 336.16 333.99 338.48 530.13 527.84 515.48 507.50 510.89 458.66 517.42 467.38
ed
select methods
pt
Protein
Figure 7 displays the GDT TS score and dDFIRE energy of all decoy structures in rA for all target proteins. A negative correlation between GDT TS score and dDFIRE energy is exhibited in these figures. The negative correlation 13
Page 13 of 21
, &
523
.'/
0 7
&51
ip t
%''
(1+
us
cr
%$=''
ed
M
an
87*
3 -
3
pt
Figure 6: Superposition of the native (in gray) and predicted structures (selected by MUFOLD-CL).
311
4.4. Investigate the solutions with different rank in archive rA
308 309
312 313 314 315 316 317 318 319 320 321 322 323
Ac ce
310
meets our expectation that a structure with lower dDFIRE energy is more native-like (with higher GDT TS score). Furthermore, the result illustrates that dDFIRE energy has a positive effect on the proposed approach although the optimization of dDFIRE grapples with two additional objectives. Moreover, if we use the lowest dDFIRE energy as a decision maker, the result shown in Table 4 is also very competitive.
307
We investigate the accuracy of the solutions with different ranks in archive rA. First, we compare the solutions in archive gA and that in rA. The size of gA is smaller than that of rA, and the stored schemes described above can make the guarantee that gA ⊂ rA if gA is not full. In fact, the solutions in gA are the solutions with rank 0 (non-dominated) in rA. Fig. 8 displays the boxplot of GDT TS score in gA and rA for all proteins. It is obvious that the accuracies of the solutions in gA are generally improved compared with those in rA because the solutions with a lower rank have lower energies (three energy terms) and, thus, higher accuracy. However, the differences in the GDT TS score between the archives are not very significant. In addition, rA can maintain excellent solutions that are better than the solutions in gA. Moreover, we further investigated the difference between gA and rA. We compare the best solutions and the solutions selected by MUFOLD-CL. The head-to-head comparison of the results of gA and rA is presented in Fig. 9 with regards to GDT TS. It is obvious that the best solutions in rA are better than the solutions in gA for all proteins. In addition, comparing the solutions selected by MUFOLD-CL, the rA has advantages over the gA. These findings 14
Page 14 of 21
1AB1
1BDD
0
1CRN
−20
1ENH
−20 −20 −30
−40
−20
−40
−40
60
30
40
1I6C
50
60
−50 −60 35 40 45 50 55 30
1ROP
0
1UTG
−30
−20
−40 −20 −60 40
40
60
2KDL
80
−60 30
2M7T
40
45
2P6J
0
−40
−40
35
an
30
60
−30
−50
20
50
1ZDD
−20
−40
−10
40
cr
40
us
−40 20
ip t
−60
60
80 2P81
20
−10
40
50
−20 20
0
−40
ed
−50
−60 30
M
−20
30
40
50
−60 30
−20 −40 40
50
40
50
60
325 326 327 328 329 330
331 332 333 334 335 336 337 338 339 340
strengthen our opinion that these solutions with a minimal rank (dominated by a small number of solutions) should be also retained. Furthermore, whether a negative correlation exists between the GDT TS score and the rank of the solutions in rA is confirmed. The distribution of GDT TS score of the solutions with different ranks for protein 1BDD is shown in Fig. 10. Similar results are obtained for the remaining proteins. It is unfortunate that obvious trends were not observed because the differences in GDT TS score in different rankings are not very significant and the numbers of solutions in different rankings are small. Thus, a strong tend is not observed.
Ac ce
324
pt
Figure 7: GDT TS (x-axis) versus dDFIRE (y-axis) of the solutions in archive rA for all target proteins. The negative correlation between them is exhibited.
4.5. Investigate the conflicts among three objectives The conflicts among different objectives are characteristic of MOOPs [73]. In this study, we combine PBEF and KBEF to constitute the three objective functions. Whether these objectives are in conflict is worth verifying. Given the lack of a formal definition of conflicts in the field of MOOP, the qualitative method parallel coordinates plot (PCP) [74] is used to investigate the conflicting relationship among the three objectives. The solutions in gA compose the obtained Pareto front. We pay attention to them and plot the three objectives of these solutions in PCP. Fig. 11 shows the PCPs of three proteins with different secondary structures: 1AB1 (α/β), 1BDD (α), and 1I6C (β). Similar results are also obtained for the remaining proteins. Of note, the objective values in these figures were normalized in [0, 1] for clarity. If two objectives of two solutions are in conflict, then the lines cross in PCP. It is clear that three objectives (bond energy, non-bond energy, and dDFIRE) are in conflict because the 15
Page 15 of 21
80
gA rA
ip t
60
50
cr
GDT_TS
70
us
40
30
1ROP 1UTG 1ZDD 2KDL 2M7T 2P6J
an
1AB1 1BDD 1CRN 1ENH 1I6C
Protein
2P81
90
ed
70
60
50
40
30
20 20
30
pt
GDT_TS of the best in gA
80
40
50
60
70
80
90
GDT_TS of the best in rA
GDT_TS of the one selected by MUFOLD−CL in gA
M
Figure 8: The distribution of GDT TS of the solutions in gA and rA, for all target proteins.
90
80
70
60
50
40
30
20 20
30
40
50
60
70
80
90
Ac ce
GDT_TS of the one selected by MUFOLD−CL in rA
(a)
(b)
Figure 9: Scatter plot of the comparison between gA and rA for all proteins. The points with same color in (a) and (b) correspond to a same protein.
342
straight lines connecting different solutions cross at a high degree. This finding indicates the conflicts among these energy functions.
343
4.6. Compare with other works
341
344 345 346 347 348 349 350
We compare the results obtained by the proposed approach with other works where EAs were implemented. Specifically, the metric of prediction results used in these works is RMSD. I-PAES, MEAMT, ADEMO/D, and MO3 were introduced in section 1. The brief summaries of other works are given as follows: PSO-APL [75] is an angle probability list (APL) knowledge-based single objective method based on PSO. Rosetta scoring function [51] is implemented in the method as the single objective function. AIMOES [76] is also a MOEA with three physical energy terms. It imitates fragment assembly methods [57] where fragments are obtained from search experience rather than native structures. 16
Page 16 of 21
60
50 45
ip t
GDT_TS
55
40 35
cr
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Rnak
$ %
0.2
Obj1
Obj2
Obj3
0.6 0.4 0.2 0.0
Obj1
Obj1
1 R U P D O L ] H G Y D O X H
0.4
0.8
an
0.6
Obj2
Obj3
7 K U H H R E M H F W L Y H V
7 K U H H R E M H F W L Y H V
(b)
Obj1
0.8 0.6 0.4 0.2 0.0
Obj1
Obj2
Obj3
Obj1
7 K U H H R E M H F W L Y H V
(c)
ed
(a)
, &
1.0
M
0.8
0.0
% ' '
1.0
1 R U P D O L ] H G Y D O X H
1 R U P D O L ] H G Y D O X H
1.0
us
Figure 10: The distribution of GDT TS of the solutions with different rank in rA, for protein 1BDD.
Figure 11: Parallel coordinates plot of three objectives for proteins 1AB1 (α/β), 1BDD (α), and 1I6C (β). The obj1 is bond energy, the obj2 is non-bond energy, and the obj3 is dDFIRE.
358
5. Conclusion
354 355 356
359 360 361 362 363 364 365 366 367 368 369
Ac ce
353
pt
357
The comparison results are reported in Table 5. Two metrics are used: min-RMSD represents the RMSD of the best structure among all generated decoy structures, and DM-RMSD donates the final solution selected by a decision-making method. The best values of min-RMSD and DM-RMSD for each benchmark protein are marked in boldface. Inspecting the comparison results, our proposed approach can obtain better or competitive solutions with these methods in terms of min-RMSD and DM-RMSD. In contrast to the methods above, a feature of the proposed approach involves incorporating a KBEF term as an objective function. Incorporating KBEF into MOEA strengthens our ability to solve the PSP problem.
351 352
The difficulties of the PSP problem center around two aspects: the inaccuracy of existing potential energy functions and the large size of the conformation search space. In this paper, the PSP problem was modeled as a MOOP, and an approach based on MOPSO was proposed to address these two issues. In detail, the PBEF CHARMM22 and the KBEF dDFIRE were combined to construct the potential energy function as a three-objective energy function: bond energy, non-bond energy, and the dDFIRE. Moreover, a MOPSO coupled with two archives was employed to execute the conformation search. For completeness, a decision-making method based on clustering was introduced to select final solutions from the generated decoy structures. Finally, a set of benchmark proteins was used to test the performance of the proposed approach. The experimental results demonstrated that the proposed approach could effectively solve the PSP problem. It is worth emphasizing the contributions of this study. First, combining PBEF and KBEF in a MOEA for solving the PSP problem was explored. Second, attention was given to Pareto optimal solutions and solutions with slightly 17
Page 17 of 21
Table 5: Comparing the prediction results of different methods.
1AB1
1BDD
1CRN
1ENH
1I6C
1ROP
1UTG
1ZDD
2KDL
2M7T
2P6J
2P81
min-RMSD DM-RMSD
5.72 9.80
3.73 5.64
5.16 7.57
5.81 8.92
6.51 8.47
2.21 3.51
5.68 11.13
1.84 2.15
8.69 10.29
5.04 8.46
4.81 9.44
3.55 6.28
I-PAES
min-RMSD DM-RMSD
-
-
4.43 -
-
-
3.70 -
4.60 -
2.27 -
-
-
-
-
MEAMT
min-RMSD DM-RMSD
-
6.40 -
5.36 -
6.56 -
5.37 -
-
-
-
-
-
-
-
PSO-GAL
min-RMSD DM-RMSD
-
-
8.9 15.9
-
-
10.1 17.5
15.8 19.2
7.2 10.5
13.4 18.5
8.9 8.9
-
-
ADEMO/D
min-RMSD DM-RMSD
-
-
6.06
-
-
4.48
-
2.14
-
-
-
-
MO3
min-RMSD DM-RMSD
6.09 7.52
-
5.34 5.56
7.70 11.99
-
3.07 3.22
-
2.16 3.26
-
-
5.00 5.96
3.87 4.30
AIMOES
min-RMSD DM-RMSD
6.23 6.77
6.28 6.95
-
5.75 6.67
5.63 8.02
-
-
2.85 4.45
-
-
5.43 10.82
3.77 6.43
M
an
us
cr
ip t
MOPSO
378
References
373 374 375 376
379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399
pt
372
Ac ce
371
ed
377
worse energy. This handling method is analogous to the method that is used in most FMs based on SOOP but is rarely employed in MOOP-based methods. The necessity of maintaining slightly worse solutions was also discussed and analyzed. Third, a simple mechanism controlling the number of output solutions is proposed. Finally, the proposed approach based on MOPSO acting as a FM exhibits high performance for solving the PSP problem. In further studies, we will continue to strive to improve the potential energy function and the search strategy. In addition, minimal prior knowledge about the protein structure is incorporated in the proposed approach. This method is considered a limitation for the performance of the proposed approach. Therefore, continuous efforts in these aspects are required to solve the complex problem.
370
[1] J. Kenny, Particle swarm optimization, in: Proc. 1995 IEEE Int. Conf. Neural Networks, 1995, pp. 1942–1948. [2] Y. Shi, R. Eberhart, A modified particle swarm optimizer, in: Evolutionary Computation Proceedings, 1998. IEEE World Congress on Computational Intelligence., The 1998 IEEE International Conference on, IEEE, 1998, pp. 69–73. [3] A. Khare, S. Rangnekar, A review of particle swarm optimization and its applications in solar photovoltaic system, Applied Soft Computing 13 (5) (2013) 2997–3006. [4] M. Roshanzamir, M. A. Balafar, S. N. Razavi, Empowering particle swarm optimization algorithm using multi agents capability: A holonic approach, Knowledge-Based Systems 136 (2017) 58–74. [5] R. Zhang, P.-C. Chang, S. Song, C. Wu, Local search enhanced multi-objective pso algorithm for scheduling textile production processes with environmental considerations, Applied Soft Computing 61 (2017) 447–467. [6] C. A. C. Coello, G. T. Pulido, M. S. Lechuga, Handling multiple objectives with particle swarm optimization, IEEE Transactions on Evolutionary Computation 8 (3) (2004) 256–279. [7] N. Al Moubayed, A. Petrovski, J. McCall, D2mopso: Mopso based on decomposition and dominance with archiving using crowding distance in objective and solution spaces, Evolutionary Computation 22 (1) (2014) 47–77. [8] A. Britto, A. Pozo, Using reference points to update the archive of mopso algorithms in many-objective optimization, Neurocomputing 127 (2014) 78–87. [9] P. K. Tripathi, S. Bandyopadhyay, S. K. Pal, Multi-objective particle swarm optimization with time variant inertia and acceleration coefficients, Information Sciences 177 (22) (2007) 5033–5049. [10] M. Daneshyari, G. G. Yen, Cultural-based multiobjective particle swarm optimization, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 41 (2) (2011) 553–567. [11] B. Xue, M. Zhang, W. N. Browne, Particle swarm optimization for feature selection in classification: A multi-objective approach, IEEE Transactions on Cybernetics 43 (6) (2013) 1656–1671.
18
Page 18 of 21
412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464
ip t
411
cr
409 410
us
408
an
406 407
M
405
ed
404
pt
402 403
[12] Y. Zhang, D.-w. Gong, J.-h. Zhang, Robot path planning in uncertain environment using multi-objective particle swarm optimization, Neurocomputing 103 (2013) 172–185. [13] Y.-J. Zheng, H.-F. Ling, J.-Y. Xue, S.-Y. Chen, Population classification in fire evacuation: a multiobjective particle swarm optimization approach, IEEE Transactions on Evolutionary Computation 18 (1) (2014) 70–81. [14] F. Masulli, S. Mitra, Natural computing methods in bioinformatics: A survey, Information Fusion 10 (3) (2009) 211–216. [15] B. Li, M. Lin, Q. Liu, Y. Li, C. Zhou, Protein folding optimization based on 3d off-lattice model via an improved artificial bee colony algorithm, Journal of Molecular Modeling 21 (10) (2015) 261. [16] C. B. Anfinsen, Principles that govern the folding of protein chains, Science 181 (4096) (1973) 223–230. [17] A. K. Downing, Introduction to protein architecture: the structural biology of proteins (2001). [18] G. A. Khoury, J. Smadbeck, C. A. Kieslich, C. A. Floudas, Protein folding and de novo protein design for biotechnological applications, Trends in Biotechnology 32 (2) (2014) 99–109. [19] U. Consortium, et al., Uniprot: the universal protein knowledgebase, Nucleic Acids Research 45 (D1) (2017) D158–D169. [20] J. Moult, K. Fidelis, A. Kryshtafovych, T. Schwede, A. Tramontano, Critical assessment of methods of protein structure prediction: Progress and new directions in round xi, Proteins: Structure, Function, and Bioinformatics 84 (S1) (2016) 4–14. [21] S. Kaczanowski, P. Zielenkiewicz, Why similar protein sequences encode similar three-dimensional structures?, Theoretical Chemistry Accounts 125 (3-6) (2010) 643–650. [22] P. W. Rose, A. Prli´c, A. Altunkaya, C. Bi, A. R. Bradley, C. H. Christie, L. D. Costanzo, J. M. Duarte, S. Dutta, Z. Feng, et al., The rcsb protein data bank: integrative view of protein, gene and 3d structural information, Nucleic Acids Research (2016) gkw1000. [23] B. Webb, A. Sali, Protein structure modeling with modeller, Protein Structure Prediction (2014) 1–15. [24] M. Biasini, S. Bienert, A. Waterhouse, K. Arnold, G. Studer, T. Schmidt, F. Kiefer, T. G. Cassarino, M. Bertoni, L. Bordoli, et al., Swissmodel: modelling protein tertiary and quaternary structure using evolutionary information, Nucleic Acids Research 42 (W1) (2014) W252– W258. [25] J. Yang, R. Yan, A. Roy, D. Xu, J. Poisson, Y. Zhang, The i-tasser suite: protein structure and function prediction, Nature Methods 12 (1) (2015) 7–8. [26] R. Bonneau, D. Baker, Ab initio protein structure prediction: progress and prospects, Annual Review of Biophysics and biomolecular Structure 30 (1) (2001) 173–189. [27] S. Ovchinnikov, H. Park, D. E. Kim, F. DiMaio, D. Baker, Protein structure prediction using rosetta in casp12, Proteins: Structure, Function, and Bioinformatics. [28] D. Xu, Y. Zhang, Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field, Proteins: Structure, Function, and Bioinformatics 80 (7) (2012) 1715–1735. [29] L. N. Kinch, W. Li, B. Monastyrskyy, A. Kryshtafovych, N. V. Grishin, Evaluation of free modeling targets in casp11 and roll, Proteins: Structure, Function, and Bioinformatics 84 (S1) (2016) 51–66. [30] K. A. Dill, J. L. MacCallum, The protein-folding problem, 50 years on, Science 338 (6110) (2012) 1042–1046. [31] T. Lazaridis, M. Karplus, Effective energy functions for protein structure prediction, Current Opinion in Structural Biology 10 (2) (2000) 139–145. [32] D. Baker, A. Sali, Protein structure prediction and structural genomics, Science 294 (5540) (2001) 93–96. [33] J. Handl, D. B. Kell, J. Knowles, Multiobjective optimization in bioinformatics and computational biology, IEEE/ACM Transactions on Computational Biology and Bioinformatics 4 (2). [34] J. Knowles, D. Corne, The pareto archived evolution strategy: A new baseline algorithm for pareto multiobjective optimisation, in: Evolutionary Computation, 1999. CEC 99. Proceedings of the 1999 Congress on, Vol. 1, IEEE, 1999, pp. 98–105. [35] V. Cutello, G. Narzisi, G. Nicosia, Computational studies of peptide and protein structure prediction problems via multiobjective evolutionary algorithms, in: Multiobjective Problem Solving from Nature, Springer, 2008, pp. 93–114. [36] S. M. Venske, R. A. Gonalves, E. M. Benelli, M. R. Delgado, Ademo/d: An adaptive differential evolution for protein structure prediction problem, Expert Systems with Applications 56 (2016) 209 – 226. [37] A. D. MacKerell, N. Banavali, N. Foloppe, Development and current status of the charmm force field for nucleic acids, Biopolymers 56 (4) (2000) 257–265. [38] C. R. S. Brasil, A. C. B. Delbem, F. L. B. da Silva, Multiobjective evolutionary algorithm with many tables for purely ab initio protein structure prediction, Journal of Computational Chemistry 34 (20) (2013) 1719–1734. [39] S. Gao, S. Song, J. Cheng, Y. Todo, M. Zhou, Incorporation of solvent effect into multi-objective evolutionary algorithm for improved protein structure prediction, IEEE/ACM Transactions on Computational Biology and Bioinformatics 15 (4) (2018) 1365–1378. [40] A. D. MacKerell Jr, D. Bashford, M. Bellott, R. L. Dunbrack Jr, J. D. Evanseck, M. J. Field, S. Fischer, J. Gao, H. Guo, S. Ha, et al., All-atom empirical potential for molecular modeling and dynamics studies of proteins, The Journal of Physical Chemistry B 102 (18) (1998) 3586–3616. [41] Y. Yang, Y. Zhou, Specific interactions for ab initio folding of protein terminal regions with secondary structures, Proteins: Structure, Function, and Bioinformatics 72 (2) (2008) 793–803. [42] J. Kennedy, R. Mendes, Population structure and particle swarm performance, in: Evolutionary Computation, 2002. CEC’02. Proceedings of the 2002 Congress on, Vol. 2, IEEE, 2002, pp. 1671–1676. [43] E. Zitzler, L. Thiele, M. Laumanns, C. M. Fonseca, V. G. Da Fonseca, Performance assessment of multiobjective optimizers: An analysis and review, IEEE Transactions on Evolutionary Computation 7 (2) (2003) 117–132. [44] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: Nsga-ii, IEEE transactions on Evolutionary Computation 6 (2) (2002) 182–197. [45] Q. Zhang, H. Li, Moea/d: A multiobjective evolutionary algorithm based on decomposition, IEEE Transactions on Evolutionary Computation 11 (6) (2007) 712–731. [46] A. D. McNaught, A. D. McNaught, Compendium of chemical terminology, Vol. 1669, Blackwell Science Oxford, 1997. [47] H. Zhou, J. Skolnick, Goap: a generalized orientation-dependent, all-atom statistical potential for protein structure prediction, Biophysical
Ac ce
400 401
19
Page 19 of 21
477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520
ip t
475 476
cr
473 474
us
471 472
an
469 470
M
468
ed
467
Journal 101 (8) (2011) 2043–2052. [48] W. D. Cornell, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz, D. M. Ferguson, D. C. Spellmeyer, T. Fox, J. W. Caldwell, P. A. Kollman, A second generation force field for the simulation of proteins, nucleic acids, and organic molecules, Journal of the American Chemical Society 117 (19) (1995) 5179–5197. [49] W. L. Jorgensen, D. S. Maxwell, J. Tirado-Rives, Development and testing of the opls all-atom force field on conformational energetics and properties of organic liquids, J. Am. Chem. Soc 118 (45) (1996) 11225–11236. [50] J. Zhang, Y. Zhang, A novel side-chain orientation dependent potential derived from random-walk reference state for protein fold selection and structure prediction, PloS One 5 (10) (2010) e15386. [51] A. Leaver-Fay, M. J. OMeara, M. Tyka, R. Jacak, Y. Song, E. H. Kellogg, J. Thompson, I. W. Davis, R. A. Pache, S. Lyskov, et al., Scientific benchmarks for guiding macromolecular energy function improvement, Methods in Enzymology 523 (2013) 109. [52] B. R. Brooks, C. L. Brooks, A. D. MacKerell, L. Nilsson, R. J. Petrella, B. Roux, Y. Won, G. Archontis, C. Bartels, S. Boresch, et al., Charmm: the biomolecular simulation program, Journal of Computational Chemistry 30 (10) (2009) 1545–1614. [53] H. Zhou, Y. Zhou, Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction, Protein Science 11 (11) (2002) 2714–2726. [54] T. Kortemme, A. V. Morozov, D. Baker, An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein–protein complexes, Journal of Molecular Biology 326 (4) (2003) 1239–1259. [55] S. Miyazawa, R. L. Jernigan, How effective for fold recognition is a potential of mean force that includes relative orientations between contacting residues in proteins?, The Journal of Chemical Physics 122 (2) (2005) 024901. [56] J. Lee, P. L. Freddolino, Y. Zhang, Ab initio protein structure prediction, in: From protein structure to function with bioinformatics, Springer, 2017, pp. 3–35. [57] C. A. Rohl, C. E. Strauss, K. M. Misura, D. Baker, Protein structure prediction using rosetta, Methods in Enzymology 383 (2004) 66–93. [58] J. Lee, J. Lee, T. N. Sasaki, M. Sasai, C. Seok, J. Lee, De novo protein structure prediction by dynamic fragment assembly and conformational space annealing, Proteins: Structure, Function, and Bioinformatics 79 (8) (2011) 2403–2417. [59] E. Zitzler, L. Thiele, Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach, IEEE transactions on Evolutionary Computation 3 (4) (1999) 257–271. [60] W. Gong, Z. Cai, An improved multiobjective differential evolution based on pareto-adaptive -dominance and orthogonal design, European Journal of Operational Research 198 (2) (2009) 576–601. [61] H. Sato, H. E. Aguirre, K. Tanaka, Controlling dominance area of solutions and its impact on the performance of moeas, in: International conference on evolutionary multi-criterion optimization, Springer, 2007, pp. 5–20. [62] F. Van den Bergh, A. P. Engelbrecht, A convergence proof for the particle swarm optimiser, Fundamenta Informaticae 105 (4) (2010) 341–374. [63] M. R. Bonyadi, Z. Michalewicz, A locally convergent rotationally invariant particle swarm optimization algorithm, Swarm Intelligence 8 (3) (2014) 159–198. [64] G. Helles, A comparative study of the reported performance of ab initio protein structure prediction algorithms, Journal of the Royal Society Interface 5 (21) (2008) 387–396. [65] D. W. Buchan, F. Minneci, T. C. Nugent, K. Bryson, D. T. Jones, Scalable web services for the psipred protein analysis workbench, Nucleic Acids Research 41 (W1) (2013) W349–W357. [66] D. S. Berkholz, C. M. Driggers, M. V. Shapovalov, R. L. Dunbrack, P. A. Karplus, Nonplanar peptide bonds in proteins are common and conserved but not biased toward active sites, Proceedings of the National Academy of Sciences 109 (2) (2012) 449–453. [67] R. L. Dunbrack, Rotamer libraries in the 21 st century, Current opinion in Structural Biology 12 (4) (2002) 431–440. [68] A. Kryshtafovych, A. Barbato, B. Monastyrskyy, K. Fidelis, T. Schwede, A. Tramontano, Methods of model accuracy estimation can help selecting the best models from decoy sets: Assessment of model accuracy estimations in casp11, Proteins: Structure, Function, and Bioinformatics 84 (S1) (2016) 349–369. [69] D. Shortle, K. T. Simons, D. Baker, Clustering of low-energy conformations near the native structures of small proteins, Proceedings of the National Academy of Sciences 95 (19) (1998) 11158–11162. [70] J. Zhang, D. Xu, Fast algorithm for population-based protein structural model analysis, Proteomics 13 (2) (2013) 221–229. [71] V. N. Maiorov, G. M. Crippen, Significance of root-mean-square deviation in comparing three-dimensional structures of globular proteins, Journal of Molecular Biology 235 (2) (1994) 625–634. [72] A. Zemla, Lga: a method for finding 3d similarities in protein structures, Nucleic Acids Research 31 (13) (2003) 3370–3374. [73] C. C. Coello, Evolutionary multi-objective optimization: a historical view of the field, IEEE Computational Intelligence Magazine 1 (1) (2006) 28–36. [74] R. C. Purshouse, P. J. Fleming, Conflict, harmony, and independence: Relationships in evolutionary multi-criterion optimisation, in: International Conference on Evolutionary Multi-Criterion Optimization, Springer, 2003, pp. 16–30. [75] B. Borguesan, M. B. e Silva, B. Grisci, M. Inostroza-Ponta, M. Dorn, Apl: An angle probability list to improve knowledge-based metaheuristics for the three-dimensional protein structure prediction, Computational Biology and Chemistry 59 (2015) 142–157. [76] S. Song, S. Gao, X. Chen, D. Jia, X. Qian, Y. Todo, Aimoes: Archive information assisted multi-objective evolutionary strategy for ab initio protein structure prediction, Knowledge-Based Systems 146 (2018) 58 – 72.
pt
466
Ac ce
465
20
Page 20 of 21
1. Modeling protein structure prediction problem as a multi-objective optimization problem. 2. A physics-based energy function and a knowledge-based energy function are combined to construct a compound three-objective energy function to evaluate a predicted conformation of a protein.
ip t
3. An improved multi-objective particle swarm optimization coupled with two archives is employed to execute conformation space search.
Ac
ce pt
ed
M
an
us
cr
4. A mechanism based on Pareto non-dominated sorting is designed to tackle the slightly worse solutions properly. It is analogous to other traditional methods based on single-objective optimization.
Page 21 of 21