227
Genetic algorithms for protein structure prediction Jan T Pedersen and John Moult Genetic algorithms are a general class of search methods that mimic natural gene-based optimization mechanisms. Mutation, cross-over and replication operations are performed on strings. When applied to structure prediction, each string describes a particular conformation of a protein molecule. There are many ways in which such search methods may be implemented. Recent results show potential for helping with protein structure prediction, but more data are needed before a complete assessment can be made.
Address Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, 9600 Gudelsky Drive, Rockville, MD 20850, USA
Current Opinion in Structural Biology 1996, 6:227-231 © Current Biology Ltd ISSN 0959-440X
Abbreviations GA geneticalgorithm MC Monte Carlo MD
molecular dynamics
Introduction Determining the functional conformation of a protein molecule from its amino acid sequence remains a central problem in computational biology. Most recent progress has been in the areas of comparative modeling [1] and fold recognition [2], and has been a consequence of the large number of experimental structures that ate now known: in various ways, the database of known structures is used to assist directly in the determination of the conformation of new ones. T h e classical ab initio protein structure prediction p r o b l e m - - u s i n g just the amino acid sequence and some model of the interaction between amino acids to determine s t r u c t u r e - - i s more resistant to our efforts [3]. T h e problem may be viewed as consisting of two parts: the development of more reliable ways of discriminating between right and wrong structures and the development of methods to search for the functional conformation. This review is concerned with one new development in the treatment of the search problem, the use of genetic algorithms (GAs). We begin with a brief introduction to the search problem, then describe the principles of genetic algorithms, and, finally, review attempts to use them so far. S e a r c h i n g f o r t h e f u n c t i o n a l c o n f o r m a t i o n of a protein molecule It might be argued that we have a solution to both the search and discriminatory function problems. Established Cartesian space molecular dynamics (MD) methods, together with an all-atom force field and an explicit solvent description, are believed by some to reliably reproduce the motion of a polypeptidc chain as a function of
time 14]. If this is true, then starting from a random structure and generating a long enough trajectory, the functional conformation should be found. ' L o n g enough' turns out to be too long. Current computational power is sufficient to generate M D trajectories representing about lO-8s, whereas folding in vitro typically takes of the order of 1 s. Two ways of speeding up simulations are currently being pursued: using simplified representations of the polypeptidc chain and taking larger steps in the search. Large steps have often been made by Monte Carlo (MC) methods in the dihedral angle space rather than the Cartesian space. Large trial changes are made to individual or small groups of torsion angles, and the resulting energy of the system is evaluated. If the energy decreases, the new conformation is accepted. If it goes up, there is usually some probability of accepting the change based on the Metropolis test [5]. Both M D and MC methods may be regarded as attempting to reproduce the true physical sequence of events during folding. Other methods, including genetic algorithms, assume that the functional conformation lies at the global minimum of free energy (the so-called 'thermodynamic' hypothesis [6,7]).
Genetic algorithms Genetic algorithms use the optimization procedures of natural gene-based evolution, that is, mutations, crossovers and replication operating on strings [8,9,10"]. Mutations may be thought of as operations within a single search trajectory, analogous to those of a traditional MC procedure. Cross-overs provide the means of information exchange between trajectories. Thus, genetic algorithms are a m e m b e r of the class of co-operative search methods [11,121. In these protocols, a number of searches are run in parallel and information is exchanged between them. It is expected that such information exchange can increase the efficiency by a larger factor than the number of parallel processes and this has been demonstrated for some instances [12]. In addition to offering the hope of increased search effectiveness, co-operative methods are also attractive because they lend themselves to implementation on parallel machine architectures. It is important to r e m e m b e r that there is nothing magic about these methods and their effectiveness is likely to be very problem and implementation dependent. GAs may be generally described in the following way. An initial population of trial solutions is established, represented by strings. Mutations are introduced independently into each string. In the original formulations, a mutation constitutes changing a single bit in the string describing a solution. Theoretically, there is no reason why more general operators should not be used. After some number of mutations have been performed, new strings are created by cross-over operations: two members of the population
228
Theoryand simulation
are selected, a break point in the strings is chosen, and two new population m e m b e r s are created by joining the left portion of one string to the right portion of the other and vice versa. T h e operation of creating new strings is repeated until a n e w population of accepted strings is established and then another phase of mutations is e n t e r e d into. T h i s s e q u e n c e of steps is repeated until the population converges to essentially a single string. A fitness function may be used to assess the quality of single mutations and new strings formed by cross-overs. T h e r e are many details that must be decided in imp l e m e n t i n g such a scheme. T h e ratio of mutations to cross-overs must be optimized. T h e fitness function may be used to assess all mutations and cross-overs, or just cross-overs. Only changes that increase fitness can be accepted, or some less-fit changes may be allowed, in a manner analogous to the Metropolis test used in M C m e t h o d s [5]. T h e selection of positions for mutations may be random or based on some measure of local fitness. M e m b e r s of the population chosen for cross-over trials may be random or on the basis of their fitness. Cross-over points in the strings may also be random or based on some criteria of the likelihood of success. Some m e m b e r s of the previous generation m a y be directly transferred into the new one without cross-over. Problems of premature convergence may be r e d u c e d by using subpopulations of strings that only cross-over among themselves for an e x t e n d e d period in the simulation. For structure prediction applications in particular, the nature of the strings used to describe the conformation must also be selected. T h e only theories for the method deal with a very simple framework [9], and in practice the o p t i m u m protocol is very p r o b l e m d e p e n d e n t . An excellent review of the basic m e t h o d o l o g y and applications in chemistry to date may be found in [13°].
GAs for protein structure prediction A n u m b e r of studies of the use of GAs for protein structure prediction have b e e n made in the last five years [14-18,19",20,21,22*,23,24°,25], as well as in other related structure optimization areas [26-32]. Even so, because of the wide range of ways in which the method may be applied and the difficulty of distinguishing search properties from the effectiveness of discriminatory functions, progress towards establishing the usefulness of GAs has been slow. In the following literature survey, we focus on those papers that, in our view, contribute towards answering this question. Unger and Moult [14,15] c o m p a r e d the effectiveness of M C and GA searches for finding the global m i n i m u m energy on a simple two-dimensional lattice protein model of the sort d e v e l o p e d by Lau and Dill [33]. Two types of residue, hydrophobic and hydrophilic, are used, with an energy function that scores -1 for each pair of n o n - b o n d e d hydrophobic neighbors. Chain lengths were b e t w e e n 20 and 64 residues. T h r e e types of MC methods
were included, in an a t t e m p t to provide a fair basis for comparison. A population size of 200 was used in the GA. T h e string of bond angles along the chain was used for describing a conformation, and MC steps and mutations were randomly chosen changes to a randomly selected bond angle. Cross-over sites were also selected randomly. U n d e r these conditions, the GA is very significantly more effective than the M C search. For the shortest sequences, both methods find the global minimum, but the GA requires one or two orders of magnitude fewer energy evaluations. For the longer sequences, the M C m e t h o d did not find the global m i n i m u m in the available c o m p u t e r time. In all but one intentionally difficult case, the GA was successful. T h e probable reason for the better performance of the GA is its ability to find accepted moves once the chain has adopted a compact conformation. Although it is a useful demonstration of the potential advantages of a GA for structure prediction, this model is so simple that it leaves open the question of its applicability to real proteins. In the first a t t e m p t to apply GAs to reproducing the tertiary structure of real proteins, Sun [16] used a description of a protein molecule that consisted of a full b a c k b o n e and one virtual atom per side chain. A potential of mean force derived from known protein structures was used to assess fitness. A library of p e p t i d e fragment conformations 2-5 residues long was used to construct initial conformations and to perform mutational changes. T h e library was constructed from known protein structures. An additional constraint was the experimental radius of gyration. A population size of 90 was used. Low final root mean square deviations from the experimental structures are reported. T h e significance of the results is hard to assess. F r a g m e n t s were selected from the library on the basis of s e q u e n c e similarity and the library contains the two larger structures that were reproduced. In spite of arguments to the contrary in the paper, it appears that selection on the basis of sequence similarity must introduce a strong bias towards the experimental structure, particularly for the longer fragments. In view of this uncertainty, it is a pity that no s u b s e q u e n t papers using the m e t h o d have appeared. Nevertheless, the ideas discussed are interesting and worthy of further d e v e l o p m e n t . Using a method somewhat related to that of Sun [16], and with the aid of their profile method [34], Bowie and E i s e n b e r g [17] constructed initial conformations of a small protein. N i n e - r e s i d u e segments were selected from a library of fragment conformations on the basis of the e n v i r o n m e n t codes. A similar procedure was used for some larger fragments 15-25 residues long. Care was taken to exclude homologous structures from the database. T h e m e t h o d of selecting initial conformations did enhance the local structure accuracy to a value higher than that e x p e c t e d by chance alone. Structt, res were then improved by a G A procedure in which each gene is the set of dihedral angles of a structure and mutations are changes
Genetic algorithms for protein structure prediction Pedersen and Moult
to one angle. For recombination, segments of one gene were replaced with segments of another. Mutations and cross-overs had a high probability of occurring at the fragment junctions. T h e fitness was evaluated with a function containing contributions from the profile fit, hydrophobicity, accessible surface area, atomic overlap and the sphericalness of the structure. T h e weighting of the terms in the potential was strongly biased by the experimental structure. Under these conditions, some native-like structures were efficiently generated, along with competing low-energy incorrect structures. A number of methods have been developed which are based on the assumption of knowledge of the secondary structure of a protein [18,19*,20,21,23,24",25]. Dandekar and Argos [18,19*,20] have developed a GA for predicting tertiary structure that stays close to the classical paradigm of a gene described by a bit string with mutations consisting of single bit changes. Each residue has seven possible conformations [35], encoded by three bits, so that a gene describing a conformation is 3 x N bits long, where N is the number of residues. In each generation, a subset of genes d e e m e d to be of high quality is selected and each of these is changed at one randomly chosen bit position (0---~1 or 1-+0). A new set of genes is then constructed by randomly choosing pairs of genes and cross-over points. G e n e quality is assessed using an ad hoc fitness function that combines agreement with experimental or predicted secondary structure, absence of atomic clashes, scatter around the center of mass, hydrophobic contacts and secondary structure terms (e.g. hydrogen bonds and strand persistence). T h e function was parameterized on a set of four helix bundle proteins and on one of the [3-structure proteins reproduced. T h e method generally relies on the correct pre-assignment of secondary structure for success and for some of the examples there is considerable bias in the potential towards the experimental structure. Sun etal. [21] used a method similar to that of Dandekar and Argos, in that all experimental secondary structure is explicitly introduced. T h e chain was described by a full backbone, with one virtual atom per side chain. Initial conformations were completed by selecting ~,Xlt angles for the non-secondary structure residues from a library of observed mono and dipeptide conformations by residue type. Mutation steps consisted of changing the conformation of a single residue to a new library value or making small random increments of up to 5 °. T h e number of mutation operations per generation was decreased exponentially during the search. Mutations and cross-overs were carried out at non-secondary structure positions. A population size of 200 was used, and 100 independent GA runs were completed for each protein. A very simple fitness function, measuring hydrophobic contacts, hydrogen bonds and steric overlap, was used. T h e lowest energy conformations obtained for several of the seven proteins considered have low root mean square deviations to the experimental structures. For the two
229
worse results, no structures with energies comparable to those of the experimental ones were found, indicating limitations in the search procedure. We have used a GA to predict the structure of small fragments (12-22 residues long) of proteins in a blind test [22"]. T h e procedure used was an extension of an earlier torsion space MC method [36]. Only fragments that are expected to have their conformation determined independently of the rest of the structure were selected [37]. A full heavy atom and polar hydrogen representation of the chain is used. Conformations with excessive steric overlap were rejected. Fitness was evaluated using a potential that was based on point-charge electrostatics and accessible surface area. Terms in the force field were parameterized with a potential of mean-force analysis of experimental structures. A gene is a string of ~,~ and X angles representing a conformation. No mutation steps were used. Cross-over points were weighted towards positions where the conformation was most varied in the current population. T h e extensive annealing of side-chain conformations was performed at cross-over points before evaluating the fitness of the new gene. T h e population size was 200-300, and 40-50 generations were performed. Parameters of the search were optimized systematically on a set of fragments with known structure. Searches were run in parallel on groups of workstations or on a parallel-architecture machine. One of the three blind predictions did produce a native-like structure for a 22-residue fragment. Experience with this procedure shows it to be substantially more effective than the MC procedure at generating low-energy structures [36]. At present, it is limited to relatively small protein fragments.
Other structure prediction related uses Several studies have been reported in which GA methods are used for less ambitious purposes than folding complete proteins. GAs have been applied to the problem of finding the correct set of side-chain rotamers for a protein, given the experimental backbone conformation [30,31]. Here, each gene is the string of side-chain rotamer angles for a complete protein. T h e half of the current gene pool with the lowest energy is carried forward to the next generation. T h e other half of the population is replaced by mutation and cross-over operations. Mutations are changes of the angle of a single rotamer and the mutation rate decreases exponentially during the run. Cross-overs are either transfer of a randomly chosen contiguous region of rotamer values between genes, or transfer of a selected subset from the whole protein. In practice, this method was found to be effective, but not as efficient as an alternative population-based procedure. It should be noted that there are a number of reports in the literature of relatively successful solutions of the rotamer selection problem, given an exact backbone conformation. In objective testing, however, side chains have not so far been built reliably [1]. A recent study [38} shows that this is likely to be due to the rapid deterioration in the
230
Theory and simulation
performance of the methods with increasing main-chain inaccuracy. There is no reason to think that the use of a GA can address this problem.
to see if they are likely to be correct. The selection of segments for cross-over can then be weighted accordingly.
Acknowledgement A GA has also been used to construct initial conformations for loop regions in proteins, using four possible conformations for each residue [32]. In the example provided (an eight-residue loop) a population of 30 initial genes consisting of bit strings describing trial conformations was used. Mutations consisted of single bit changes. Convergence is complete after eight generations. Again, the methodology used is interesting, but it is not clear how much the GA contributes to the performance of the complete procedure.
This work was supported by grant no. 1)OC/NIST 60NANB41)1594 from the National Institute of Standards and "lk:chnoh~gy.
1.
Mosimann S, Meleshko R, James MNG: A critical assessment of comparative molecular modeling of tertiary structures of proteins. Proteins 1995, 23:301-317.
Finally, in the closely related area of protein design, GAs have also been employed to optimize the sequence to fit a given fold [18,39].
2.
Lemer CMR, Rooman MJ, Wodak S J: Protein structure prediction by threading methods: evaluation of current techniques. Proteins 1995, 23:337-355.
3.
Defay T, Cohen FE: Evaluation of current techniques for ab initio protein structure prediction. Proteins 1995, 23:431-445.
4.
Storch EM, Daggett V: Molecular dynamics simulation of cytochrome b5: implications for protein-protein recognition. Biochemistry 1995, 34:9682-9693.
5.
Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E: Equation of state calculations by fast computing machines. J Chem Phys 1953, 21:1087-1091.
6.
Anfinsen CB: Principles that govern the folding of protein chains. Science 1973, 181:223-230.
7.
Privalov PL: Stability of proteins: small globular proteins. Adv Protein Chem 1979, 33:167-236.
8.
Holland JH: Adaptation in Natural and Artificial Systems. Ann Arbor: University of Michigan Press; 1975.
9.
Goldberg D: Genetic Algorithms in Search, Optimization, and Machine Learning. San Mateo: Addison-Wesley; 1989.
References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: • *•
Conclusions and future prospects None of the full protein structure prediction methods reported to date can be said to be free of experimental bias. Dandekar and Argos [18,19•,20] usually relied on experimental secondary structure for success and Sun et al. [21] explicitly did. The potentials of Dandekar and Argos and those of Bowie and Eisenbcrg [17] were parameterized to favor the experimental structures. The choice of fragments in the Sun method [16] appeared to favor direct selection of experimental conformations. The method of Pedersen and Moult [22•] has only been effective on relatively small peptides. So far, there have been few direct comparisons with other methods [14,40], or analysis of performance as a function of implementation. Thus, although it is clear that GAs are a useful and sometimes elegant tool for structure prediction studies, it is too early to say how effective they will eventually be. Three advantages have, however, been established to date. Firstly, GAs are easier to run in parallel than single trajectory search procedures, and therefore allow groups of processors to be utilized for a search. Secondly, GAs appear to be more efficient at finding acceptable conformations in the condensed phase than other semi-random move methods such as MC. In general, it is hard to find acceptable moves at a late stage of such a search because of steric clashes. Finally, the 'mix and match' properties of GAs offer a useful means of testing the different modes of association of substructures, for example, secondary structure units. More sophisticated versions of the GA, for example, versions that use subpopulations, have not yet been explored. Other sorts of co-operative methods, in which greater intelligence is used in the choice of fragments to combine, offer hope for the future. For example, it should be possible to examine substructures present in each gene
of special interest of outstanding interest
10. Fogel DB: Evolutionary Computation: Towards a New Philosophy • of Machine Intelligence. New York: IEEE Press; 1995. An exhaustive account of the current status of evolutionary computation, which describes the basic properties and methods for analyzing evolutionary algorithms. 11.
Huberman BA: The performance of cooperative processes. Physica D 1990, 42:38-47.
12.
Clearwater SH, Huberman BA, Hogg T: Cooperative solution of constraint satisfaction problems. Science 1991, 254:1181-1183.
13. Judson RS: Genetic algorithms and their use in chemistry. Rev • Comp Chem 1996, 20:in press. An excellent review of the current use of genetic algorithms in chemistry problems. 14.
Unger R, Moult J: Genetic algorithms for protein folding simulations. J Mol Biol 1993, 231:75-81
15.
Unger R, Moult J: Effect of Mutations on the Performance of Genetic Algorithms Suitable for Protein Folding Simulations, vol 2. North Holland: Elsevier Science Publishers BV; 1993.
16.
Sun S: Reduced representation of protein structure prediction: statistical potential and genetic algorithms. Protein Sci 1993, 2:762-785.
1 7.
Bowie JU, Eisenberg D: An evolutionary approach to folding small (~-helical proteins that uses sequence information and an empirical guiding fitness function. Proc Natl Acad Sci USA 1994, 91:4436-4440.
t 8.
Dandekar T, Argos P: Potential of genetic algorithms in protein folding and protein engineering simulations. Protein Eng 1992, 5:637-645.
19. •
Dandekar T, Argos P: Folding the main-chain of small proteins with the genetic algorithm. J Mol Bio11994, 236:844-861.
Genetic algorithms for protein structure prediction Pedersen and Moult
A bit-string encoding of a simplified protein chain representation is used together with a phenomenological potential to fold a number of helical proteins. All simulations are seeded with secondary structure prediction. 20.
21.
Dandekar T, Argos P: Identifying the tertiary fold of small proteins with different topologies from sequence and secondary structure using the genetic algorithm and extended criteria specific for strand regions. J Mo/Bio/1996, in press. Sun S, Thomas PD, Dill KA: Simple protein folding algorithm using a binary code and secondary structure constraints. Protein Eng 1995, 8:?69-??8.
Pedersen JT, Moult J: Ab initio structure prediction for small polypeptides and protein fragments using genetic algorithms. Proteins 1995, 23:454-460. A true blind test of a full atom representation torsion based genetic algorithm. A search algorithm was tested on two small protein fragments and a larger (22 residue) membrane associated peptide.
28.
HerrmannF, Suhai S: Minimization of peptide analogues using genetic algorithms, g Comput Chem 1996, in press.
29.
HerrmannF, Suhai S: Genetic algorithms in protein structure prediction. In Computational Methods in Genome Research. Edited by Suhai S. New York: Plenum Press; 1994:1 ?3-190.
30.
TufferyP, Etchebest C, Hazout S, Lavery R: A new approach to the rapid determination of protein side-chain conformations, Biomol Struct Dyn 1991, 8:1267-1289.
23.
Le-Grand SM, Merz KM Jr: The Protein Folding Problem and Tertiary Structure Prediction: The Genetic Algorithm and Protein Tertiary Structure Prediction. Boston: Birkhauser; 1994:109-124.
24. •
Le-Grand SM, Merz KM Jr: The genetic algorithm and the conformational search of polypeptides and proteins. Mol Simulat 1994, 13:299-320. A full atom representation of the protein together with a rotamer library is used; the AMBER potential is used to evaluate the fitness of the conformations. The method is tested on several small peptides and on the 46-residue protein crambin. 25. Gunn JR, Monge A, Friesner RA, Marshall CH: Hierarchical algorithm for computer modeling of protein tertiary structure: folding of myoglobin to 6.2 A resolution. J Phys Chem 1994, 98:702-711. 26.
27.
Judson RS, Jaeger EP, Treasurywala AM, Peterson MA: Conformational searching methods for small molecules. I1. Genetic algorithm approach, g Comput Chem 1993, 14:1407-1414. McGarrah DB, Judson RS: Analysis of the genetic algorithm method of molecular conformation determination, g Comput Chem 1993, 14:1385-1395.
g
31.
TufferyP, Etchebest C, Hazout S, Lavery R: A critical comparison of search algorithms applied to the optimization of protein sidechain conformations, g Comput Chem 1993, 14:790-798.
32.
Ring CS, Cohen FE: Conformational sampling of loop structures using genetic algorithms. Isr g Chem 1994, 34:245-252.
33.
Lau KF, Dill KA: Theory for protein mutability and biogenesis. Proc Nat/Acad Sci USA 1990, 87:638-642.
34.
Bowie JU, Luthy R, Eisenberg D: A method to identify protein sequences that fold into a known three-dimensional structure. Science 1991,253:164-1 ?0.
35.
RoomanMJ, Kocher J-PA, Wodak SJ: Prediction of protein backbone conformation based on seven structural assignments. g Mol Bio/1991,221:961-979.
36.
Avbelj F, Moult J: Determination of the conformation of folding initiation sites in proteins by computer simulation. Proteins 1995, 23:129-141.
37.
Moult J, Unger R: An analysis of protein folding pathways. Biochemistry 1991, 30:3816-3824.
38.
Chung SY, Subbiah S: How similar must a template structure be for homology modeling by sidechain packing methods? Pacific Biotechno/Syrup 1996, in press.
39.
JonesDT: De novo protein design using pairwise potentials and a genetic algorithm. Protein Sci 1994, 3:567-574.
40.
Meza JC, Judson RS, Faulkner TR, Treasurywala AM: A comparison of a direct search method and a genetic algorithm for conformational searching. J Comput Chem 1996, in press.
22. •
231