Evolutionary Computer Programming of Protein Folding and Structure Predictions

Evolutionary Computer Programming of Protein Folding and Structure Predictions

ARTICLE IN PRESS Journal of Theoretical Biology 229 (2004) 13–18 Evolutionary Computer Programming of Protein Folding and Structure Predictions a, a...

288KB Sizes 0 Downloads 86 Views

ARTICLE IN PRESS

Journal of Theoretical Biology 229 (2004) 13–18

Evolutionary Computer Programming of Protein Folding and Structure Predictions a, a . . Bengt Nolting *, Dennis Julich , Winfried Vonaub, Karl Andertc a

Prussian Private Institute of Technology at Berlin, Am Schlosspark 30, Berlin D-13187, Germany b University of Applied Sciences Mittweida, Technikumplatz 17, Mittweida D-09648, Germany c Institute for Biological-Medical Research and Technology, Am Sandhaus 31, Berlin D-13125, Germany Received 3 September 2003; received in revised form 9 February 2004; accepted 25 February 2004

Abstract In order to understand the mechanism of protein folding and to assist the rational de-novo design of fast-folding, non-aggregating and stable artificial enzymes it is very helpful to be able to simulate protein folding reactions and to predict the structures of proteins and other biomacromolecules. Here, we use a method of computer programming called ‘‘evolutionary computer programming’’ in which a program evolves depending on the evolutionary pressure exerted on the program. In the case of the presented application of this method on a computer program for folding simulations, the evolutionary pressure exerted was towards faster finding deep minima in the energy landscape of protein folding. Already after 20 evolution steps, the evolved program was able to find deep minima in the energy landscape more than 10 times faster than the original program prior to the evolution process. r 2004 Elsevier Ltd. All rights reserved. Keywords: Protein structure predictions; Protein folding simulations; Biomolecule dynamics; Bioinformatics methods; Biomolecule folding; Protein folding dynamics; Protein folding kinetics

1. Introduction In the unfolded state proteins can attain a gigantic number of conformations, e.g., roughly 10100 for a 100. residue protein (see, e.g., Nolting, 1999b). In view of this gigantic number of possibilities, the question how most proteins can find the unique native conformation within minutes or faster is considered one of the mysteries of modern biophysics. A large number of studies have been devoted to this question, see, e.g., Goto and Aimoto . (1991), Nolting et al. (1995, 1997), Goto et al. (1999), . . Nolting (1999b), Gruebele (1999,2002a); Nolting and Andert (2000), Oliveberg (2001); Ghosh et al. (2002), Iguchi (2002) and Gromiha (2003). To reduce the experimental effort for the resolution of protein folding and structure, it is desirable to be able to perform computer simulations of folding processes. Since it is impossible to calculate to the energies of 10100 conformations, one has to find clever approaches to speed up the simulation. Here, we present a study of *Corresponding author. . E-mail address: [email protected] (B. Nolting). 0022-5193/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.jtbi.2004.02.016

protein folding for which we had a computer program evolve itself towards fast finding of deeper minima in the energy landscape of folding. 2. Materials and methods 2.1. Principle of operation of the program for the simulation of folding For a single folding simulation (Fig. 1), the C++ program first randomly calculates a relatively expanded structure. To perform one folding step of the molecule, the program simply calculates a molecular motion of this structure, e.g., a rotation around one bond of the protein molecule. Then it calculates the free energy of the resulting, new conformation and discards it in case the free energy has increased relative to the conformation prior to the molecular motion. Then the program proceeds by calculating a new molecular motion of the valid structure, and so on. After many thousands of molecular motions (folding steps), the program stops to fold the molecule and calculates the final energy of the protein molecule.

ARTICLE IN PRESS 14

B. Nolting et al. / Journal of Theoretical Biology 229 (2004) 13–18 .

Fig. 1. Sequence of one folding simulation, typically consisting of a few 10,000 folding steps. This scheme is valid for the wild-type program and also the mutant programs—mutant programs differ only by the specific types, magnitudes and sequences of molecular motions used to generate the folded conformation. The programs typically perform at least a few hundreds of such simulations and determine, which simulations led to the conformations with the lowest energies.

Usually the program performs a large set of such folding simulations. This set typically consists of at least a few hundreds of folding simulations, each using a different expanded starting structure. The program then determines which simulations yielded the structures with the lowest free energies. In order to speed up the simulation, the program works with a lattice model of the protein structure in which only bond angle changes between adjacent amino acid residues of 0, 745 , and 790 are allowed in one or simultaneously two of the three main planes (xy, yz, zx) of the lattice. Amino acid residues of the protein molecule are treated as spheres. For details on lattice models see, e.g., Shakhnovich et al. (1996), Shakhnovich (1997) and Mirny and Shakhnovich (2001). Calculations of sinus, cosinus and square root functions are replaced by tables containing the values needed by the program. For the potential of interaction between amino acids used in the folding simulations see Fig. 2. The potential is similar to that used by Casari and Sippl (1992), but contains a very strong repulsive component at very short distances which accounts for the atomic repulsion and prevents overlap of amino acid residues in the generated structures. 2.2. Evolution of the program for the simulation of folding Evolution of the folding-simulation program was performed on a PC cluster containing 7 PCs. The evolution procedure is shown in Fig. 3: The initial program contains a large number of program loops which generate different molecular motions, e.g., rotations around certain bonds in the protein molecule. Mutations of the program were created, e.g., by using different types or magnitudes of molecular motions, or different positions of the bonds around which rotations are performed, or different sequences of these motions. Seven mutant programs are tested on the 7 PCs at the same time. Positive mutants, i.e. those which performed better than the original (wild-type) program, are used as

Fig. 2. Energy potential used for the folding simulation (see also p. 22 . in Nolting, 2003). The radius is measured between the centers of the spheres representing the amino acid residues of the protein molecule. For each pair of amino acid residues in the protein molecule, this energy potential is multiplied with a factor which depends on the types of the two interacting residues, similar to the factors described by Casari and Sippl (1992). To improve protein foldability, an amino acid-independent constant was added to all factors of Casari and Sippl.

a new wild-type program for further mutations. Negative program mutants, i.e. those which did not find a deeper energy minimum within a certain period of time, were discarded.

3. Results and discussion 3.1. Evolution of a computer program Fig. 4 shows the evolution of the folding simulation program during 20 evolution steps. In the course of the evolution of the program, its ability to find deep energy minima within a certain period of time improves dramatically. Even when running the initial program for a 10 times longer period of time, it did not find the deepest energy minima which were found by the evolved program within the 10 times shorter period of time. Thus, in terms of time gain, the evolved program is more than 10 times faster than the initial (wild-type) program. 3.2. Application of the evolution method on protein folding simulations The evolved program was used for a relatively long simulation of folding of chymotrypsin inhibitor 2 (CI2) which involved a few thousands of single folding simulations, all starting with different, relatively expanded conformations. Similarly to simulations of other sequences, the program found a large number of shallow energy minima corresponding to relatively expanded misfolded conformations and a few very deep energy minima corresponding to compact conformations. Fig. 5 shows a folding simulation leading to a deep

ARTICLE IN PRESS B. Nolting et al. / Journal of Theoretical Biology 229 (2004) 13–18 .

15

Fig. 3. Scheme of the evolution of the computer program: (a) an initial computer program is written, e.g., a program for the simulation of protein folding, (b) a certain number of mutants of the computer program is created, (c) the mutants are run on one, or preferentially several, computers for a certain period of time, and the program which performed best within the set of mutants and wild-type is selected, (d) with the selected bestperforming program, the procedure proceeds analogously further. Each evolution step leads to an improved or unaltered program. The latter is the case if all mutants are negative mutants.

Fig. 4. Example for an evolution process of the program: 20 Evolution steps led to a program which found, within the same time, an energy minimum significantly deeper than the original (wild-type) program.

energy minimum: one can see that already after a very small number of folding steps (a few hundreds), the initial expanded conformation starts to collapse and then slowly reorganizes into a compact native-like state. This sequence of events is in good agreement with experimental results of the pathway of folding of proteins in general (see, e.g., Sosnick et al., 1996; . . Nolting, 1999b; Nolting et al., 1995, 1997; Englander, 2000; Englander et al., 2002; Gruebele, 2002b; Galzitskaya et al., 2001, 2002, 2003; Myers and Oas, 2002; Misumi et al., 2002; Klepeis and Floudas, 2002; Taulier and Chalikian, 2002; Chalikian and Filfil, 2003) and CI2 . in particular (see, e.g., Nolting, 1999a). In particular, similarly to native CI2, the simulated structure has a long loop. This is quite surprising considering the slight imperfections of the simulation regarding the assumed energy potential, the actual folding molecular movements, and regarding the rounding of numbers due to the discretization of space (lattice model imperfections). In summary, the evolution procedure shows very good results for CI2, as indicated by the improvement of

Fig. 5. Lattice simulation of protein folding with an evolved program: The sequence is equal to residues 20–83 of chymotrypsin inhibitor II (McPhalen and James, 1987). Each circle represents an amino acid residue of the protein. The sizes of the circles indicate the third dimension: large circles correspond to amino acid residues in the foreground; small circles to those in the background. (a) Step 100 of the protein folding simulation: the conformation is still mainly expanded. (b) Folding step 448: the molecule has collapsed into a compact conformation which already contains the native loop of CI2. (c) and (d) Folding steps 5,680 and 22,858: further condensation and reorganization. (e) Folding step 143,042: formation of a native-like . conformation. See also p. 21 in (Nolting, 2003).

the finding of deep energy minima by the program. The evolved program yields far better results than the initial program which was based on rational design and assumptions regarding the movements in possible folding events.

ARTICLE IN PRESS 16

B. Nolting et al. / Journal of Theoretical Biology 229 (2004) 13–18 .

3.3. Application of the evolution method on another protein To test the general applicability of the evolved program, it was used to fold the 64-residue N-terminal fragment of the protein barstar. Barstar, the inhibitor of the ribonuclease barnase, has been well characterized regarding its native structure and the structures of its . . folding transition states (Nolting et al., 1997; Nolting, 1999b). The deepest energy minimum found by the program in a 5-h simulation was at –3.26, compared with –1.20 (units are as in Fig. 4) for the non-evolved program. The 50 deepest energy minima all were below –2.85 (Fig. 6). Also here the initial, rationally designed program was not able to find comparably deep energy minima within a 10 times longer period of time. This shows that the evolved program also offers advantages for structure simulations of other proteins. The simulated structure corresponding to the deepest energy minimum found within the 5-h simulation was compact, did not contain non-native long loops, and overall displays some similarity to native barstar (Fig. 7). Further improvement should be obtainable by carrying out further evolution steps, evolving the program specifically for each protein, and using the evolved program for longer simulations.

3.4. Comparison with other methods Most computer programs for protein structure prediction use experimental information stored in protein structure databases, such as the Brookhaven National Laboratory Protein Data Bank (Abola et al., 1997). A common approach is based on sequence alignment with structurally known proteins. For an overview on such

( for (a) Fig. 7. Inter-residue contact plots for a cut-off distance of 8 A native barstar fragment (1–64) and (b) the simulated barstar fragment (1–64) corresponding to the deepest energy minimum (energy=–3.26) found within the 5-h simulation shown in Fig. 6.

Fig. 6. Distribution of the deepest energy minima (energyo–2.50; units are as in Fig. 4) found by the program during a 5-hour simulation for the barstar fragment (1–64). Each bar corresponds to one folding simulation. The time axis is only approximately. Many of the conformations with only moderate energy (–2.50y–2.60) are compact, but misfolded. The conformation with the lowest energy (–3.26) has a large degree of native-like properties (see the text and Fig. 7).

programs see, e.g., (Webster, 2000). Unfortunately, for most proteins, currently the computational prediction is not precise and reliable enough for application in the pharmaceutical industry. In particular, methods which are based on some kind of sequence alignment with structurally resolved proteins generally encounter tremendous difficulties for proteins with completely new sequences. Despite large efforts, the prediction probability has only slightly improved within the last 10

ARTICLE IN PRESS B. Nolting et al. / Journal of Theoretical Biology 229 (2004) 13–18 .

years. So most companies and academic institutions interested in protein structures still prefer the experimental approach, e.g., X-ray crystallographic analysis . (see, e.g., Nolting, 2003), nuclear magnetic resonance . (NMR), and F-value analysis (see Nolting, 1999b). Unfortunately, these experimental approaches are very expensive and time-consuming: NMR requires a significant solubility of the protein of interest and still cannot be applied on large proteins. X-ray crystallographic analysis requires well-diffracting protein crystals which often, even after years of experiments, cannot be obtained. F-value analysis can only resolve the transient structures (intermediate and transition state) of proteins if the native structure is known. Consequently, there is a significant demand in innovative approaches of protein structure resolution using little or no experimental data on the protein of interest and which are able to determine the structures of proteins that do not have sequence similarities with other, already resolved, proteins. The present evolution method of computer programming is a novel approach towards cheaper and faster structure determination in protein science. Even though the method is not yet able to calculate the structure of large proteins with great precision, it yielded dramatic improvements towards finding the global energy minimum and deep local energy minima, compared to traditional methods of computer programming. There have been many earlier attempts to resolve protein structures by using lattice simulations. However, many highly successful lattice simulations were constrained to 27-mers (see, e.g., Shakhnovich et al., 1996; Shakhnovich, 1997; Mirny and Shakhnovich, 2001). This is because the computational effort to sample all conformations rises exponentially with the size of the protein: a protein with 64 residues has roughly 1037 more possible conformations than a 27-meric peptide . (see Nolting, 1999b). Accordingly, the time for exhaustive enumeration of all conformations of a 64-mer takes roughly 1037 times longer than for a 27-mer. Obviously, for proteins of the size of CI2 or barstar, traditional methods of computer programming do not promise much success. The reason for the good results of our evolved computer program are mainly seen in the following facts: (a) the program uses general information on protein structures, (b) the program was significantly improved in the course of an evolution process, and (c) the chosen type of lattice appears to enable a relatively good representation of the real-protein structure, compared with some other types of lattices.

4. Conclusions 1. Evolution methods of computer programming can significantly speed up certain types of computations

17

in which it is difficult or impossible to find the fastest procedure by rational design. In the course of the evolution process, the program code changes significantly. This contrasts most simple self-learning programs, e.g., common programs for neuronal network analysis, in which mainly data tables change in the course of the learning process. 2. In particular, computer programs for the simulation of protein folding and protein structure prediction can be significantly improved by using an evolution approach in which the program evolves towards faster finding deep energy minima in the highly complex landscape of folding. 3. In the specific program for the simulation of protein folding and prediction of the folded conformation presented here, only 20 evolution steps yielded a more than 10-fold increase of the speed of finding deep energy minima in the energy landscape of two 64residue proteins.

References Abola, E.E., Sussman, J.L., Prilusky, J., Manning, N.O., 1997. Protein data bank archives of three-dimensional macromolecular structures. Methods Enzymol. 277, 556–571. Casari, G., Sippl, M.J., 1992. Structure-derived hydrophobic potential—hydrophobic potential derived from X-ray structures of globular proteins is able to identify native folds. J. Mol. Biol. 224, 725–732. Chalikian, T.V., Filfil, R., 2003. How large are the volume changes accompanying protein transitions and binding? Biophys. Chem. 104, 489–499. Englander, S.W., 2000. Protein folding intermediates and pathways studied by hydrogen exchange. Ann. Rev. Biophys. Biomol. Struct. 29, 213–238. Englander, S.W., Mayne, L., Rumbley, J.N., 2002. Submolecular cooperativity produces multi-state protein unfolding and refolding. Biophys. Chem. 101–102, 57–65. Galzitskaya, O.V., Ivankov, D.N., Finkelstein, A.V., 2001. Folding nuclei in proteins. FEBS Lett. 489, 113–118. Galzitskaya, O.V., Higo, J., Finkelstein, A.V., 2002. Alpha-helix and beta-hairpin folding from experiment, analytical theory and molecular dynamics simulations. Curr. Protein Pept. Sci. 3, 191–200. Galzitskaya, O.V., Garbuzynskiy, S.O., Ivankov, D.N., Finkelstein A, .V., 2003. Chain length is the main determinant of the folding rate for proteins with three-state folding kinetics. Proteins 51, 162–166. Ghosh, A., Elber, R., Scheraga, H.A., 2002. An atomically detailed study of the folding pathways of protein a with the stochastic difference equation. Proc. Natl. Acad. Sci. USA. 99, 10394–10398. Goto, Y., Aimoto, S., 1991. Anion and ph-dependent conformational transition of an amphiphilic polypeptide. J. Mol. Biol. 218, 387–396. Goto, Y., Hoshino, M., Kuwata, K., Batt, C.A., 1999. Folding of blactoglobulin, a case of the inconsistency of local and non-local interactions. In: Kuwajima, K., Arai, M. (Eds.), Old and New Views of Protein Folding. Elsevier, Amsterdam, pp. 3–11. Gromiha, M.M., 2003. Influence of cation-pi interactions in different folding types of membrane proteins. Biophys Chem. 103, 251–258. Gruebele, M., 1999. The fast protein folding problem. Annu. Rev. Phys. Chem. 50, 485–516.

ARTICLE IN PRESS 18

B. Nolting et al. / Journal of Theoretical Biology 229 (2004) 13–18 .

Gruebele, M., 2002a. Protein folding: the free energy surface. Curr. Opin. Struct. Biol. 12, 161–168. Gruebele, M., 2002b. An intermediate seeks instant gratification. Nat. Struct. Biol. 9, 154–155. Iguchi, K., 2002. Statistical mechanical foundation for the two-state transition in protein folding of small globular proteins. Int. J. Mod. Phys. B 16, 1807–1839. Klepeis, J.L., Floudas, C.A., 2002. Ab initio prediction of helical segments in polypeptides. J. Comput. Chem. 23, 245–266. McPhalen, C.A., James, M.N.G., 1987. Crystal and molecular structure of the serine protease inhibitor CI2 from barley seeds. Biochemistry 26, 261–269. Mirny, L., Shakhnovich, E., 2001. Protein folding theory: from lattice to all-atom models. Annu Rev Biophys Biomol Struct 30, 361–396. Misumi, Y., Terui, N., Yamamoto, Y., 2002. Structural characterization of non-native states of sperm whale myoglobin in aqueous ethanol or 2,2,2-trifluoroethanol media. Biochim. Biophys. Acta 1601, 75–84. Myers, J.K., Oas, T.G., 2002. Mechanism of fast protein folding. Annu. Rev. Biochem. 71, 783–815. . Nolting, B., 1999a. Analysis of the folding pathway of chymotrypsin inhibitor by correlation of F-values with inter-residue contacts. J. Theor. Biol. 197, 113–121. . Nolting, B., 1999b. Protein Folding Kinetics. Springer, Heidelberg.

. Nolting, B., 2003. Methods in Modern Biophysics. Springer, Heidelberg. . Nolting, B., Andert, K., 2000. Mechanism of protein folding. Proteins 41, 288–298. . Nolting, B., Golbik, R., Fersht, A.R., 1995. Submillisecond events in protein folding. Proc. Natl. Acad. Sci. USA 92, 10668–10672. . Nolting, B., Golbik, R., Neira, J.L., Soler-Gonzalez, A.S., Schreiber, G., Fersht, A.R., 1997. The folding pathway of a protein at high resolution from microseconds to seconds. Proc. Natl. Acad. Sci. USA 94, 826–830. Oliveberg, M., 2001. Characterisation of the transition states for protein folding: towards a new level of mechanistic detail in protein engineering analysis. Curr. Opin. Struct. Biol. 11, 94–100. Shakhnovich, E.I., 1997. Theoretical studies of protein-folding thermodynamics and kinetics. Curr. Opin. Struct. Biol. 7, 29–40. Shakhnovich, E., Abkevich, V., Ptitsyn, O., 1996. Conserved residues and the mechanism of protein folding. Nature 379, 96–98. Sosnick, T.R., Mayne, L., Englander, S.W., 1996. Molecular collapse: the rate-limiting step in two-state cytochrome c folding. Proteins: Struct. Funct. Genet. 24, 413–426. Taulier, N., Chalikian, T.V., 2002. Compressibility of protein transitions. Biochim. Biophys. Acta. 1595, 48–70. Webster, D.M., 2000. Protein Structure Prediction: Methods and Protocols. Humana Press, Towota, NJ.