BioSystems 105 (2011) 243–249
Contents lists available at ScienceDirect
BioSystems journal homepage: www.elsevier.com/locate/biosystems
Folding small proteins via annealing stochastic approximation Monte Carlo Sooyoung Cheon a,∗ , Faming Liang b,1 a b
Department of Informational Statistics, Korea University, Jochiwon 339-700, South Korea Department of Statistics, Texas A&M University, College Station, TX, USA
a r t i c l e
i n f o
Article history: Received 10 October 2010 Received in revised form 22 May 2011 Accepted 26 May 2011 Keywords: Protein folding Secondary structure Markov chain Monte Carlo Annealing stochastic approximation Monte Carlo
a b s t r a c t Recently, the stochastic approximation Monte Carlo algorithm has been proposed by Liang et al. (2007) as a general-purpose stochastic optimization and simulation algorithm. An annealing version of this algorithm was developed for real small protein folding problems. The numerical results indicate that it outperforms simulated annealing and conventional Monte Carlo algorithms as a stochastic optimization algorithm. We also propose one method for the use of secondary structures in protein folding. The predicted protein structures are rather close to the true structures. © 2011 Elsevier Ireland Ltd. All rights reserved.
1. Introduction In recent years there has been a great deal of interest in studying the prediction of protein folding from its amino acid sequence in biophysics. Proteins are not linear molecules such as a “string” of amino acid sequences. Rather, this “string” folds into an intricate three-dimensional structure that is unique to each protein. This three-dimensional structure allows proteins to function. Nativestate topology often plays a dominant role in the kinetics of this folding process. This implies that interactions among 20 different amino acids give rise to cooperative formation of native structure through backbone hydrogen bonding and specific side-chain packing of the native-state core. Molecular modeling attempts to predict the structure of a protein ab initio (i.e., by trying to apply laws of physics to describe the protein molecule) rather than by using databases of known structures. It is assumed that the native state of the molecule is given by those parameter values that minimize the energy function. The energy function may have multiple minima; in this case the molecule is assumed to have multiple native states, which indeed occurs in reality. Energy functions take into account all atoms in a protein molecule. Since the number of amino acids in proteins ranges from 25 to 3000 and the number of atoms ranges from around 500 to more than 10,000, dealing with even the simplest energy functions may be a difficult computational task.
∗ Corresponding author. Tel.: +82 41 860 1552; fax: +82 41 862 2000. E-mail addresses:
[email protected],
[email protected] (S. Cheon), fl
[email protected] (F. Liang). 1 Tel.: +1 979 845 8885; fax: +1 979 845 3144. 0303-2647/$ – see front matter © 2011 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.biosystems.2011.05.015
Some good prediction results can be obtained by using methods such as threading based on information on known structure (Moult et al., 1999; Venclovas et al., 1999). However, for small proteins the information contained in the amino acid sequence determines sufficiently the tertiary structure of protein with the lowest minimum energy value. Lu et al. (2003) noticed this point and suggested a new minimization function called “relative entropy”. But, it is difficult to find the thermodynamically stable state of the protein; it is a NPhard problem. The difficulties of the problem are that the dimension of the system is usually high because it is in the same order with the number of atoms involved in the system, and the energy landscape of the system can be characterized by a multitude of local minima separated by high-energy barriers. At low temperatures, traditional Monte Carlo and molecular dynamic simulations tend to get trapped in local minima. Hence, only a small fraction of the phase space is sampled, and the thermodynamic quantities cannot be estimated accurately. Thus it makes the simulations ineffective. Among the algorithms to improve the above problems, the simulated annealing (SA) algorithm can be used in physics. It can be described as follows. Suppose that one aims to find the global minimum of an objective function H (x), which is also called the energy function in the standard terms of simulated annealing. By augmenting to the system an auxiliary variable, the so-called temperature T, minimizing H (x) is equivalent to sampling from the Boltzmann distribution f(x, T) ∝ exp (− H(x)/T) at a very small value (closing to 0) of T. In this sense, we say that sampling is more basic than optimization. In order to sample successfully from f(x, T) at a very small value of T, Kirkpatrick et al. (1983) suggested to simulate from a sequence of Boltzmann distributions, f(x, T1 ), . . ., f(x, Tm ), in a sequential manner, where the temperatures form a decreasing ladder T1 > T2 > . . . > Tm with Tm ≈ 0 and T1 being reasonably large such
244
S. Cheon, F. Liang / BioSystems 105 (2011) 243–249
that most uphill conventional Markov chain Monte Carlo (Metropolis Hastings, MH; Metropolis et al., 1953; Hastings, 1970) moves at that level can be accepted. The simulation at high temperature levels aims to provide a good initial sample, hopefully a point in the attraction basin of the global minimum of H(x), for the simulation at low temperature levels. Simulated annealing algorithm : 1. Initialize the simulation at temperature T1 and an arbitrary sample x0 . 2. At each temperature Ti , i=1,2,. . .,m, simulate of the distribution f(x, Ti ) for Ni iterations using a MH sampler. Pass the final sample to the next lower temperature level as the initial sample. Honeycutt and Thirumalai (1990, 1992) and Veitshans et al. (1997) introduced an off-lattice model protein. This off-lattice model uses an ˛ carbon trace to represent the protein backbone. The so-called BLN proteins are modeled as a sequence of beads of three types: hydrophobic, hydrophilic, and neutral; these are designated by B, L, and N, respectively. These models exhibit many similarities with real proteins, and thus they are particularly useful in the study of predicting the native structure of a protein. Honeycutt and Thirumalai (1990, 1992) studied a 46-residue BLN protein, and Thirumalai and Guo (1995) and Honeycutt and Thirumalai (1992) demonstrated that the folding kinetics of the BLN 46-mer is very similar to that of real proteins. Brown et al. (2003) showed that the sequence mapping from a 20-letter amino acid code to the three-letter reduced code is sufficient for determining the folding to a target topology. Furthermore, there have been efforts in finding the global minimum-energy conformation by studying this off-lattice model protein to avoid the high energy barriers between local minima in recent years (Kim et al., 2003). In this paper, we apply the annealing stochastic approximation Monte Carlo (ASAMC) algorithm (Liang, 2007) to an off-lattice (BLN) model protein. The stochastic approximation Monte Carlo (SAMC) algorithm (Liang et al., 2007) is a generalization of the Wang–Landau algorithm (Wang and Landau, 2001) and the 1/kensemble algorithm (Hesselbo and Stinchcomb, 1995). It can be used for both continuous and discrete systems. The self-adjusting nature (explained in Section 3) of the algorithm enables it to overcome any barrier of the energy landscape. The ASAMC algorithm is an accelerated version of the SAMC algorithm for optimization problems. Our numerical results show that ASAMC is a very promising algorithm for finding the native topology of proteins. The remaining part of this paper is organized as follows. In Section 2, we provide a brief description for the BLN model. In Section 3, we review the ASAMC algorithm. In Section 4, we apply our algorithm to real datasets, and compare our algorithm with SA and MH. In Section 5, we describe the new method implementing the use of secondary structures and also apply to the same real datasets. In Section 6, we conclude the paper with a brief discussion.
Table 1 Sequence mapping between 20-letter (20) amino acid and coarse-grained threeletter (3) code. 20
3
20
3
20
3
20
3
Ala Cys Leu Ile Phe
B B B B B
Met Val Trp Tyr Pro
B B B B N
Gly Ser Thr Glu Asp
N N L L L
Asn His Gln Lys Arg
L L L L L
The total potential energy function of the BLN model with M residues is given by
U(x) =
M Kx
2 i=2
[Ai (1 + cos(i )) + Bi (1 + cos(3i ))]
i=4(dihedrals)
+ 4
M M−3
K 2 ( − 0 ) 2 i
i=3(angles)
M
+
M
(|xi − xi−1 | − a)2 +
Cij
i=1 j=i+3
dij
12
− Dij
dij
6 ,
(1)
where x = (x1 , x2 , . . ., xM ) , the force constants are given by Kx = 400/a2 and K = 20/(rad)2 , is the energy constant, a is the average bond length, xi is the position of the i th residue, is the Lennard-Jones parameter, and dij is the distance between two nonbonded residues i and j given by dij =| xi − xj |. The bond angle i is defined by three residuals i − 2, i − 1 and i, and is maintained by a harmonic potential with force constant K and equilibrium bond angle 0 = 1.8326 rad or 105◦ . First and second parts in the total energy are called the bond-stretching energy and the bond-angle bending energy, respectively. In the dihedral or torsional angle energy, each dihedral i in the chain is defined by four residues i − 3, i − 2, i − 1, and i and predefined: Ai = 0 and Bi = 0.2 if two or more of the four residues are neutral; Ai = Bi = 1.2 for all the other cases. Finally, the non-local interactions in the van der Waals energy are given by Cij = Dij = 1 for BB interactions, Cij = 2/3 and Dij = − 1 for LL and LB interactions, and Cij = 1 and Dij = 0 for all interactions involving M residues. The attractive forces in the model responsible for collapse are due to the interactions between hydrophobic beads (B–B interactions). The interactions among all other combinations of beads are repulsive, although different strengths of repulsion are used depending on the bead types involved (Brown et al., 2003). All simulations are performed in reduced units, with a, and all set equal to unity.
3. Annealing stochastic approximation Monte Carlo 2. Model A simple representation was adopted as the BLN protein model proposed by Kim et al. (2003), in which a residue was reduced to a bead and its coordinate was in the position of the C˛ atom of the residue. The protein chain is modeled as a sequence of beads of three types: hydrophobic (B), hydrophilic (L), and neutral (N) (Table 1). Attraction between the hydrophobic beads provides the energetic driving force for formation of a strong core. Repulsion between the hydrophilic beads and other beads is used to balance the forces and reduce the bias of the correct native fold. The neutral beads also serve as soft spheres with little repulsion and typically signal the turn regions in the sequence.
A protein folding prediction problem naturally contains the computational complexity for finding global energy minima. The conventional Markov chain Monte Carlo algorithm often suffers from a local trap problem (i.e., the sampler would get stuck in a local energy minimum for a long time, rendering the simulation ineffective), whereas the SAMC algorithm can perform better in this respect, whose self-adjusting mechanism can drive it to escape from local traps. In practice, however, the process of SAMC may be slow because of the broadness of the sample space. To accelerate the optimization process, we limit the sample space at each iteration of SAMC to a small region using the ASAMC algorithm. In this section we first describe the general SAMC algorithm and then present its space annealing version for optimization problems.
S. Cheon, F. Liang / BioSystems 105 (2011) 243–249
Suppose that we are interested in sampling from the following distribution, 1 exp Z
p(x) =
U(x) −
,
x ∈ X,
where X is the sample space, Z is the normalizing constant, is called the temperature, and U(x) is called the energy function. As in Ref. Liang et al. (2007), we assume that X is continuous and compact for the reason of mathematical simplicity. For example, the sample space X can be restricted to the region {x : U(x) ≤ Umax }, where Umax is sufficiently large such that the region {x : U(x) > Umax } is not of interest. Suppose that X has been partitioned according to the energy function U(x) into m disjoint subregions denoted by E1 = {x : U(x) ≤ u1 }, E2 = {x : u1 < U(x) ≤ u2 }, . . ., Em = {x : U(x) > um−1 } where u1 , u2 , . . ., um−1 are m − 1 specified real numbers. Let (x) be a non-negative function defined on the sample space X with 0 < X (x)dx < ∞, and gi = E (x)dx. In practice, we often set
i
(x) = exp −U(x)/ , and gi turns out to be the normalizing constant of the truncated distribution of p(x) on the subregion Ei . (t) Let gˆ i denote the estimate of gi obtained at iteration t. Let (t)
ti = log(ˆgi ) and t = ( t1 , . . ., tm ). SAMC forms an inhomogeneous Markov chain. At each iteration, it samples from the following distribution: 1 (x) I(x ∈ Ei ), Zt eti m
pt (x) =
t = 1, 2, . . . ,
(2)
i=1
where I(·) is the indicator function. Assume that t ∈ for all t, where is a compact set and set = [− B , B ]m with B = 1020 for all examples in this paper, although as a practical matter this is essentially equivalent to setting = m . = (1 , . . ., m ) be a mLet the desired sampling distribution m vector with 0 < i < 1 and = 1, which defines the desired i=1 i sampling frequency for each of the subregions. Let { t } be a positive, non-increasing sequence satisfying the two conditions, (i)
∞
t = ∞,
(ii)
t=1
∞
t < ∞,
t0 , max(t0 , t)
t = 1, 2, . . . ,
for some specified value of t0 > 1. A large value of t0 will allow the sampler to reach all subregions very quickly even for a large system. With the above notations, the SAMC simulation proceeds as follows. Let xt and ti denote the sample and the estimate of i , respectively, at the t th iteration of the simulation. The simulation starts with the initial estimates 01 = . . . = 0m = 0, and then iterates between the following steps: 1. Propose a new configuration x∗ in the neighborhood of xt according to a prespecified proposal distribution T(xt , ·). ∗ 2. Accept the proposed configuration x with probability min
e tJt e tJ∗
(x∗ ) T (x∗ →xt ) ,1 (xt ) T (xt →x∗ )
space exploration. The self-adjusting mechanism can be explained as follows: If a subregion is visited at iteration t, t will be updated accordingly such that the probability that this subregion (other subregions) will be revisited at the next iterations will decrease ∗ (increase). Mathematically, if xt+1 ∈ Ei , then t+1,i ← ti + t+1 (1 − / i. Note that the linear adjustment i ) and t+1,j ← tj − t+1 j for j = on transforms to a multiplying adjustment on eti ’s and thus leaves the distribution (2) invariant. The critical difference between SAMC and other stochastic approximation Markov chain Monte Carlo algorithms (Younes, 1988, 1999; Moyeed and Baddeley, 1991; Gu and Kong, 1998; Gelfand and Banerjee, 2000; Delyon et al., 1999; Gu and Zhu, 2001) is regarding sample space partitioning. Sample space partitioning improves its performance of stochastic approximation in optimization. Control of the sampling frequency also effectively prevents the system from getting trapped into local energy minima in simulations. For more details refer to Liang et al. (2007). Since now we are only interested in minimizing the energy function, U(x), we use the following space annealing version of the SAMC algorithm proposed by Liang (2007) to accelerate the optimization process. Liang (2007) suggested the space annealing version of the SAMC algorithm for neural network training and the numerical results indicated that ASAMC is superior to simulated annealing and the gradient-based algorithms in MLP training. Thus, we make use of this annealing concept to accelerate the optimization process of protein folding. The basic concept is that the sample space is limited at each iteration of SAMC to a small region to accelerate the process because the process may be slow due to the broadness of the sample space. Suppose that the subregions E1 , . . ., Em have been arranged in an ascending order by energy. Let (u) denote the index of the subregion that a sample x with energy U(x) = u belongs to. For example, if x ∈ Ei then (U(x)) = i. Let X(t) denote the sample space at iteration t. The simulation process is as follows: 1. Start with X(1) = 2. Set iteratively
m
E; i=1 i
i.e., all subregions are used.
(Umin + )
t=1
for some ∈ (1, 2). In this paper we set t =
245
,
where Jt and J∗ denote, respectively, the indices of the subregions that xt and x∗ belongs to. If it is accepted, set xt+1 = x∗ ; otherwise, set xt+1 = xt . 3. Set ∗ = t + t+1 (et+1 − ), where et+1 = (et+1,1 , et+1,2 , . . ., et+1,m ) and et+1,i = 1 if xt+1 ∈ Ei and 0 otherwise. If ∗ ∈ , set t+1 = ∗ ; otherwise, set t+1 = ∗ + c∗ where c∗ = (c∗ , . . ., c∗ ) can be an arbitrary vector which satisfies the condition ∗ + c∗ ∈ . The SAMC sampling is driven by its self-adjusting mechanism, which, consequently, implies the superiority of SAMC in sample
X(t) =
Ei ,
i=1
where Umin is the minimum energy value obtained so far in the run, and > 0 is a user-specified parameter. Note that Umin is changing during a simulation. Since the phase space X(t) shrinks iteration by iteration, this modified algorithm is called annealing SAMC (ASAMC). We considered several issues for an effective implementation of ASAMC. First, the sample space can be partitioned according to the energy function, which allows for minimizing the energy function. The maximum energy difference in each subregion should be bounded by a reasonable number (e.g., (2)) which ensures that the local MH moves within the same subregion have a reasonable acceptance rate. Second, the performance of ASAMC depends on the value of . If is too large or small, ASAMC may take a long time to locate the global minimum due to the broadness of the sample space. In this case, the sample space may contain only a few separated regions, and most of proposed transitions will be rejected. It is generally believed that allowing a sampler to jump to intermediate states of high energy will increase the probability of transitions from one local energy minimum to others. The proposal distribution used in ASAMC should be more spread out than that used in SAMC in order to reduce the negative effect of the sample space restriction. In this paper, we set = 50. This value works well for all cases considered in this research. Lastly, we consider the choice
246
S. Cheon, F. Liang / BioSystems 105 (2011) 243–249
of N and t0 , where N denotes the total number of iterations, and t0 determines the gain factor { t }. The gain factor controls the ability of ASAMC moving across subregions. The appropriateness of the choices of t0 and N can be diagnosed by examining the convergence of the run. In ASAMC, the desired sampling distribution has been set to uniform, so the convergence of the run can be diagnosed by examining the equality of the realized sampling frequencies on the subregions included in the limiting sample space X(∞) . As suggested by Wang and Landau (2001), a run can be regarded as converged if the sampling frequency of each subregion is not less than 80% of the average sampling frequency. If the simulation is diagnosed as unconverged, SAMC should be re-run with a larger value of t0 , a larger number of iterations, or both. We fixed a temperature = 1.0 and t0 = 2500 for ASAMC in all simulations of this paper.
4. Numerical examples In this study, we show how ASAMC can be used for protein folding prediction. Two small proteins, 1A7F and 9INS, were selected as the tested targets from the RCSB Protein Data Bank (Berman et al., 2000). ASAMC was first applied to these two real proteins. In both proteins, the sample space was partitioned into 102 subregions with an equal energy bandwidth of 1.0; that is, we set E1 = {x ∈ X : U(x) ≤ −10}, E2 = {x ∈ X : and −10 < U(x) ≤ −9}, . . . , E1 0 1 = {x ∈ X : 99 < U(x) ≤ 100}, E1 0 2 = {x ∈ X : U(x) > 100}. We had three types of local moves as follows. These moves were used equally likely at each iteration. The simulation starts with the initial estimates x(0) = ( 03 , . . ., 0M , 04 , . . ., 0M ) generated randomly from the standard normal distribution. Let x(t) = ( t3 , . . ., tM , t4 , . . ., tM ) denote the current state of the inhomogeneous Markov chain at iteration t. In the type-I move, a component of x(t) is selected randomly by modifying with Gaussian random variable ∼ N(0, s2 [U(x(t) ) − U0 ]), where s is a user tunable parameter and U0 is a user-guessed lower bound for U(x(t) ); i.e., x(t) is updated by x(t) + . We set U0 = − 10. The vari-
ance of suggests a different step size s U(x(t) ) − U0 for different states. The step size is large for high-energy states, and the step size is small for low-energy states. This allows the sampler to move through the high-energy region fast and explore the low-energy region in detail. The type-II move is the same as the type-I move, except for which two components are picked up randomly. In the type-III move, a spherical proposal distribution is used. A direction is generated uniformly, and then the radius is drawn from N(0, 2s2 [U(x(t) ) − U0 ]). In this case, we set s = 4.0 for 1A7F, and s = 3.0 for 9INS. The ASAMC algorithm was run 5 times independently, and each run consisted of 1.0 × 107 iterations. Table 2 shows the average energy minimum sampled among the five runs via ASAMC. For comparison, the SAMC, SA and MH algorithms were also applied to this example. Each one was run 5 times independently with the same parameter setting as ASAMC except that for SA, an initial temperature was set to T0 = 1.0 and decreased monotone in a function like Ts+1 = 0.99Ts , and for MH, we run the algorithm with various temperatures in 1A7F protein to find a proper temperature and found the energy minimum was obtained at a temperature, 0.3. Thus, the temperature, 0.3, was applied to all proteins in this paper. Each run consisted of 1.0 × 107 iterations. Table 2 shows the minimum energy values found by SAMC, SA and MH among the five runs, respectively. The comparison indicates that ASAMC and SAMC can have an ability to find the most stable state in protein folding for this example, and that SA and MH tend to get trapped in a local energy minimum. Although all root mean square deviations (RMSD) between the predicted protein folding and their native structures for 1A7F and 9INS are not within the reasonable range, 3 to 7.5 for small proteins
as suggested by Lu et al. (2003), the comparison shows that the ASAMC algorithm has made a significant improvement over SAMC, SA and MH in finding the ground states for the BLN model protein. For all proteins, the average minimum energy found by our algorithm is lower than the minimum energy by SAMC, SA and MH in all runs. The differences of the energy values come from the differences of the folding predicted angles. The values of RMSD were obtained from distance between the folded predicted protein structures and their native structures. Protein structures tested are as follows: 1A7F (29 residues – BBLLLBBNNL BBLBBLBBBN LLNNBBLNL); 9INS (30 residues – BBLLLBBNNL BBLBBBBBBN LLNBBBLNLB). Fig. 1 shows that the folded predicted structure using the ASAMC algorithm is somewhat similar to, but is not exactly the same as, the native structure of 9INS. That is, there may be a problem in the turn direction of the predicted structure. 5. The use of secondary structure in protein folding Levinthal (1968) and Wetlaufer (1973) argued that the time for a random search of all possible structures would be unrealistically long for even a small protein, and that something like a nucleation event must occur to permit structure formation in biologically feasible time. They found that distinct structural regions have been found in several globular proteins composed of single polypeptide chains, and proteins fold much too fast to involve an exhaustive search. This is the so called Levinthal paradox: how can a protein find a native state without a globally exhaustive search? Experiments show that a protein folds to its native state according to a relatively small number of pathways; i.e., it folds by a specific sequence of molecular events (Creighton, 1978; Kim and Baldwin, 1990). A pathway is defined by the intermediate states and the transition states which occur between the initial and final states (Creighton, 1978). Some pathways are strongly favored; i.e., folding is “cooperative” in that a “nucleating” hydrophobic–hydrophobic (H–H) contact acts as a constraint that restricts local conformational freedom on 2D hydrophobic–hydrophilic (HP models) and speeds the “zipping up” of other contacts nearby. Mainly two types of interaction contribute to protein folding: (i) the helix-coil propensities, among monomers that are connected neighbors in the chain sequence; (ii) the hydrophobic and solvent interactions among monomers that may either be near each other or far apart in the chain sequence (Miller et al., 1992; Dill et al., 1993). Liang and Wong (2001) proposed the use of secondary structures for protein folding in 2D HP model and showed that it is very successful in finding low energy states. Motivated by the above studies, we add the following steps using the secondary structures to the ASAMC algorithm for speeding up the simulation: 1. Sample the subsequences at random in the library consisting of subsequences, bond and dihedral angles, which will possibly fold to secondary structures in the native states of known proteins. 2. Perform sampling on the constrained conformation space where some subsequences are subject to possible secondary structures. The representation we use for protein subsequences is based on a library of various fragments of protein backbone according to the number of residues, and consists of bond and dihedral angles of ˛-helix and ˇ-sheet in its C˛ atoms. To generate the library we considered 122 protein domains based on Rost and Sander database (Rost and Sander, 1993) whose three-dimensional structure was accurately determined. The proteins (1A7F, 9INS, 1CRN and 1EJG), which are considered in all examples of this paper, are included in these 122 protein domains. These secondary structures, ˛-helix and ˇ-sheet, were obtained from the RCSB Protein Data Bank. Each
S. Cheon, F. Liang / BioSystems 105 (2011) 243–249
247
Table 2 Computational results produced by ASAMC, SAMC, SA and MH for the real datasets. Native: energy value of the native structure. CPU: CPU time (in min) cost by a single run of the algorithm on an Intel Windows 7 computer. Each of the other entries of the table is calculated by averaging over five independent runs, and the number in the parentheses represent the standard deviation of the corresponding average. Average: the averaged minimum energy value sampled by the algorithm and the standard error of the average. RMSD: root mean square deviation. 1A7F
9INS
Methods
CPU(m)
Average
RMSD
CPU(m)
Average
RMSD
ASAMC SAMC SA MH Native
17.50 17.25 17.53 17.16
−3.819 (0.132) −3.353 (0.617) −1.509 (0.724) 4.181 (0.299) 60.7142
8.374 (0.553) 10.533 (3.482) 19.803 (6.853) 21.762 (7.688)
18.51 18.54 18.24 18.05
−6.557 (0.401) −4.966 (1.275) −0.880 (2.232) 1.700 (0.792) 61.9850
9.552 (1.676) 12.118 (5.715) 19.623 (8.317) 20.044 (12.300)
Fig. 1. Comparison of the native structure with best target structure produced by ASAMC for 9INS. (a) A native structure. (b) A folded best structure generated by ASAMC. (c) The native structure superimposed on the best structure found by ASAMC.
of these fragments (sequences) was grouped using MCLUST (Fraley and Raftery, 2002), a software package for model-based clustering, density estimation, and discriminant analysis interfaced to the S-PLUS commercial software, to cluster angles. The possible secondary structures folded by subsequences of ˛-helix and ˇ-sheet are illustrated in Fig. 2(a–d), which correspond to ˇ-sheet (with 2 residues), and ˛-helix (with 8, 10 and 11 residues) of a real protein, respectively. However, the total number of secondary structures that could be folded by the subsequence may be huge as the number of amino acid sequence increases. Thus, an essentially arbitrary distribution should be assigned to these secondary structures with each structure having a nonzero mass value. For example, a secondary structure for ˇ-sheet with 2 residues like Fig. 2(a) consists of bond and dihedral angles. Bond angles were divided into two groups with probabilities 0.34536 and 0.65464, respectively, and dihedral angles were divided into two groups with probabilities 0.54639 and 0.45361, respectively. Each of these groups has a different mean and variance. Once one group with a specified mean and variance is selected according to its probability for each angle, new bond and dihedral angles are generated. These angles make a three-dimensional structure for protein folding. For example, we assign probabilities 0.65464 and 0.54639 to 2 residues of
the structure, shown in Fig. 2(a), for bond and dihedral angles, respectively. These groups have different means and variances. This function will then work as a proposal transition function for the move of the block of residues; i.e., the resulting simulation will be ergodic. The energy function in (1) has been changed to the function with constraints in secondary structures accordingly. Under this constrained conformation space, we consider 1A7F (29 residues), 9INS (30 residues), and two new small proteins such as 1EJG (46 residues) and 1CRN (46 residues). By sampling on the constrained conformation space, we fold all these proteins rapidly to their lowest energy states. The primary sequences of new proteins are given as follows: 1CRN/1EJG (46-LLBBNNBBBL NLBLBBLBNN LNLBBBBLBL NBBBBNNBLB NNLBBL). We set = 1.0, t0 = 2500, and U0 = 20. The sample space was partitioned into 132 subregions with an equal energy bandwidth of 1.0; that is, we set E1 = {x ∈ X : U(x) ≤ 20}, E2 = {x ∈ X : 20 < U(x) ≤ 21}, . . . , E1 3 1 = {x ∈ X : 149 < U(x) ≤ 150}, and E1 3 2 = {x ∈ X : U(x) > 150}. For a step size, we set s = 4.0 for all proteins. The secondary structure sequences (Fig. 2) used in this example were obtained with the DSSP software designed by Kabsch and Sander (1983). That is, these subsequences were used as known information. The ASAMC algorithm was run 5 times independently, and each run consisted
Fig. 2. Each one example of secondary structures folded by subsequences of ˛-helix and ˇ-sheet. (a) Extended ˇ-sheet with 2 residues. (b) ˛-helix with 8 residues. (c) ˛-helix with 10 residues. (d) ˛-helix with 11 residues.
248
S. Cheon, F. Liang / BioSystems 105 (2011) 243–249
Table 3 Computational results produced by ASAMC, SAMC, SA and MH for the real datasets via the use of secondary structures. (a) The subsequence, residues 9–18, is constrained to the secondary structure as ˛-helix for 1A7F. (b) Residues 7–17 is constrained as ˛-helix. (c) Residues 7–17 and 23–30 are constrained as ˛-helix, and 33–34 is constrained as ˇ-sheet). (d) Residues 7–17 and 23–30 are constrained as ˛-helix. 1A7Fa
9INSb
Methods
CPU(m)
Average
RMSD
CPU(m)
Average
RMSD
ASAMC SAMC SA MH Native
38.41 39.32 38.24 38.22
12.577(0.324) 14.299(0.111) 14.311(0.574) 17.713(0.388) 60.7142
6.934(0.609) 7.088(0.659) 10.000(0.635) 11.204(2.489)
41.14 42.08 40.47 40.41
13.959(0.741) 14.021(0.880) 14.543(1.210) 20.216(3.089) 61.9850
6.762(0.643) 8.000(0.987) 10.112(1.984) 10.798(2.828)
1CRNc
1EJGd
Methods
CPU(m)
Average
RMSD
CPU(m)
Average
RMSD
ASAMC SAMC SA MH Native
93.17 92.58 90.22 89.14
31.205(1.385) 33.112(2.244) 41.797(2.984) 42.980(4.887) 101.2824
7.446(0.279) 10.132(2.803) 11.368(4.034) 13.592(5.697)
87.47 87.33 84.87 84.43
27.854(1.090) 32.204(1.646) 39.260(3.212) 40.686(5.121) 103.2615
7.140(0.478) 8.245(2.631) 8.185(1.165) 13.692(3.267)
Fig. 3. Comparison of native structures superimposed on the best structure produced by ASAMC, SA and MH for 1A7F. (a) ASAMC. (b) SA. (c) MH.
Fig. 4. Comparison of native structures superimposed on the best structure produced by ASAMC, SA and MH for 9INS. (a) ASAMC. (b) SA. (c) MH.
Fig. 5. Comparison of native structures superimposed on the best structure produced by ASAMC, SA and MH for 1CRN. (a) ASAMC. (b) SA. (c) MH.
Fig. 6. Comparison of native structures superimposed on the best structure produced by ASAMC, SA and MH for 1EJG. (a) ASAMC. (b) SA. (c) MH.
S. Cheon, F. Liang / BioSystems 105 (2011) 243–249
of 1.0 × 107 iterations. For comparison, SAMC, SA and MH were also applied to this example. Each one was run 5 times independently with the same parameter setting as ASAMC, and each run consisted of 1.0 × 107 iterations. Table 3 shows the energy minimum found by ASAMC, SAMC, SA and MH among the five runs, respectively. The reported values are the lowest energy values and the RMSD data achieved during the most efficient run. The comparison indicates that ASAMC produced the lowest energy among them in all runs. The RMSD values are also within the reasonable range based on information of known structures (Lu et al., 2003). The folded predicted structures found by the constrained ASAMC sampler are shown in Figs. 3–6 for 1A7F, 9INS, 1CRN and 1EJG, respectively. The figures indicate the folded target structures predicted by our method are very similar to the native structures for all proteins. 6. Conclusion In this paper, we have applied the annealing stochastic approximation Monte Carlo algorithm to protein folding prediction. The algorithm was compared with the stochastic approximation Monte Carlo, simulated annealing and conventional Markov chain Monte Carlo algorithms on real datasets. The numerical results show that our algorithm outperforms three other algorithms. Among four methods, ASAMC found the lowest energy and root mean square deviation. It indicates that our algorithm can be effectively applied to simulations of protein folding using the BLN protein model, and thus it is a very promising algorithm for general optimization problems. Based on the suggestion of Liang and Wong (2001) and Liang (2004) that if we can incorporate some specific moves, which are designed based on some specific properties of the protein model, the performance of the algorithm may be further improved, we have proposed the new method via the use of protein secondary structures. This method shows that the average minimum energy and root mean square deviation found by the ASAMC algorithm are smaller than those by the SAMC, simulated annealing and traditional Markov chain Monte Carlo algorithms in all runs. The root mean square deviation of our simulations is also within the reasonable range (Moult et al., 1999; Venclovas et al., 1999; Lu et al., 2003), although a value of 6 Å has been suggested as a target value for a small protein (Reva et al., 1998). The improvement of accuracy of protein folding prediction will be our further research. In addition, since the numeric results of this paper indicate that the global energy minimum in BLN model does not correspond to that of the native structure of protein, the study about more appropriate model than BLN model for the global energy minimum will be also our another further research. References Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E., 2000. The protein data bank. Nucleic Acids Research 28, 235–242. Brown, S., Fawzi, N.J., Head-Gordon, T., 2003. Coarse-grained sequences for protein folding and design. Proceedings of the National Academy of Sciences of the United States of America 100, 10712–10717. Creighton, T.E., 1978. Experimental studies of protein folding and unfolding. Progress in Biophysics and Molecular Biology 33, 231–297. Delyon, B., Lavielle, M., Moulines, E., 1999. Convergence of a stochastic approximation version of the EM algorithm. Annals of Statistics 27, 94–128.
249
Dill, K.A., Fiebig, K.M., Chan, H.S., 1993. Cooperativity in protein-folding kinetics. Proceedings of the National Academy of Sciences of the United States of America 90, 1942–1946. Fraley, C., Raftery, A.E., 2002. Model-based on clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97, 611–631. Gelfand, A.E., Banerjee, S., 2000. Computing marginal posterior modes using stochastic approximation. Unpublished Manuscript. Department of Statistics, University of Connecticut. Gu, M.G., Kong, F.H., 1998. A stochastic approximation algorithm with Markov chain Monte-Carlo method for incomplete data estimation problems. Proceedings of the National Academy of Sciences of the United States of America 95, 7270–7274. Gu, M.G., Zhu, H.T., 2001. Maximum likelihood estimation for spatial models by Markov chain Monte Carlo stochastic approximation. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 95, 339–355. Hastings, W.K., 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109. Hesselbo, B., Stinchcomb, R.B., 1995. Monte Carlo simulation and global optimization without parameters. Physical Review Letters 74, 2151–2155. Honeycutt, J.D., Thirumalai, D., 1990. Metastability of the folded states of globular proteins. Proceedings of the National Academy of Sciences of the United States of America 87, 3526–3529. Honeycutt, J.D., Thirumalai, D., 1992. The nature of folded states of globular proteins. Biopolymers 32, 695–709. Kabsch, W., Sander, C., 1983. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637. Kim, P.S., Baldwin, R.L., 1990. Intermediates in the folding reactions of small proteins. Annual Review of Biochemistry 59, 631–660. Kim, S.Y., Lee, S.J., Lee, J., 2003. Conformational space annealing and an off-lattice frustrated model protein. Journal of Chemical Physics 119 (19), 10274–10279. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P., 1983. Optimization by simulated annealing. Science 220 (4598), 671–680. Levinthal, C., 1968. Are there pathways for protein folding? Journal de Chimie Physique et de Physico Chimie Biologique 65, 44–45. Liang, F., 2004. Annealing contour Monte Carlo algorithm for structure optimization in an off-lattice protein model. Journal of Chemical Physics 120 (14), 6756–6763. Liang, F., 2007. Annealing stochastic approximation Monte Carlo for neural network training. Machine Learning 68 (3), 201–233. Liang, F., Liu, C., Carroll, R.J., 2007. Stochastic approximation in Monte Carlo computation. Journal of the American Statistical Association 102, 305–320. Liang, F., Wong, W.H., 2001. Evolutionary Monte Carlo for protein folding simulations. Journal of Chemical Physics 115 (7), 3374–3380. Lu, B.Z., Wang, B.H., Chen, W.Z., Wang, C.X., 2003. A new computational approach for real protein folding prediction. Protein Engineering 16 (9), 659–663. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E., 1953. Equations of state calculations by fast computing machines. Journal of Chemical Physics 21, 1087–1091. Miller, R., Danko, C.A., Fasolka, M.J., Balazs, A.C., Chan, H.S., Dill, K.A., 1992. Folding kinetics of proteins and copolymers. Journal of Chemical Physics, 768–780. Moult, J., Hubbard, T., Fidelis, K., Pedersen, J.T., 1999. Critical assessment of methods of protein structure prediction (CASP): round III. Proteins: Structure, Function & Genetics 3, 2–6. Moyeed, R.A., Baddeley, A.J., 1991. Stochastic approximation of the MLE for a spatial point pattern. Scandinavian Journal of Statistics 18, 39–50. Reva, B.A., Finkelstein, A.V., Skolnick, J., 1998. What is the probability of a chance prediction of a protein structure with an rmsd of 6 A? Folding and Design 3 (2), 141–147. Rost, B., Sander, C., 1993. Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology 232, 584–599. Thirumalai, D., Guo, Z., 1995. Nucleation mechanism for protein folding and theoretical predictions for hydrogen-exchange labelling experiments. Biopolymers 35, 137–140. Veitshans, T., Klimov, D., Thirumalai, D., 1997. Protein folding kinetics: time scales, pathways, and energy landscapes in terms of sequence dependent properties. Folding and Design 2, 1–22. Venclovas, C., Zemla, A., Fidelis, K., Moult, J., 1999. Some measures of comparative performance in the three CASPs. Proteins: Structure, Function & Genetics 37, 231–237. Wang, F., Landau, D.P., 2001. Efficient multiple-range random walk algorithm to calculate the density of states. Physical Review Letters 86, 2050–2053. Wetlaufer, D.B., 1973. Nucleation, rapid folding, and globular intrachain regions in proteins. Proceedings of the National Academy of Sciences of the United States of America 70, 697–701. Younes, L., 1988. Estimation and annealing for Gibbsian fields. Annales de l’institut Henri Poincare (B)—Probabilites et Statistiques 24, 269–294. Younes, L., 1999. On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastic and Stochatic Reports 65, 1779–228.