[12] Comparative modeling of homologous proteins

[12] Comparative modeling of homologous proteins

[12] COMPARATIVE MODELING OF HOMOLOGOUS PROTEINS 239 [12] C o m p a r a t i v e M o d e l i n g o f H o m o l o g o u s P r o t e i n s B y JONATHA...

708KB Sizes 2 Downloads 48 Views

[12]

COMPARATIVE MODELING OF HOMOLOGOUS PROTEINS

239

[12] C o m p a r a t i v e M o d e l i n g o f H o m o l o g o u s P r o t e i n s B y JONATHAN G R E E R

Introduction The discovery that proteins occur in homologous families provided the original impetus for attempting to model the three-dimensional structure of one member of the family from the known experimental structure of another.~ This methodology has since been used many times in a wide variety of systems to model new protein structures for the purpose of studying protein functional properties, 2 performing biochemical 3 or mutagenesis experiments, 4 or designing new ligands or inhibitors of biological function. 5,6 Over the past decade, we have developed a methodology for the modeling of "new" protein structures from one or more homologous known structures. 7-1° Unfortunately, such modeling by analogy is never entirely accurate. 1~-13 Therefore, it is important to develop criteria that can be used to assign confidence levels to the various parts of the structure. In particular, it is important to identify parts of the structure that are likely to deviate significantly from the usual family structural rubric. We describe here the methods that we have developed to perform the modeling and to assign reliability levels to the respective portions of the molecule. I W. J. Browne, A. C. T. North, D. C. Phillips, K. Brew, T, C. Vanaman, and R. L. Hill, J. Mol. Biol. 42, 65 (1969). 2 j. W. Lustbader, J. P. Arcoleo, S. Birken, and J. Greer, J. Biol. Chem. 258, 1227 (1983). 3 j. p. Arcoleo and J. Greer, J. Biol. Chem. 257, 10063 (1982). 4 K. Mollison, W. Mandecki, E. R. P. Zuiderweg, L. Fayer, T. A. Fey, R. Krause, R. G. Conway, L. Miller, R. P. Edalji, M. A. Shallcross, B. Lane, J. L. Fox, J. Greer, and G. W. Carter, Proc. Natl. Acad. Sci. U.S.A. 86, 292 (1989). 5 j. Greer, J. Mol. Biol. 153, 1043 (1981). 6 n . L. Sham, G. Bolis, H. H. Stein, S. W. Fesik, P. A. Marcotte, J. J. Plattner, C. A. Rempel, and J. Greer, J. Med. Chem. 31, 284 (1988). 7 j. Greer, Proc. Natl. Acad. Sci. U.S.A. 77, 3393 (1980). s j. Greer, J. Mol. Biol. 153, 1027 (1981). 9 j. Greer, Ann. N.Y. Acad. Sci. 439, 44 (1985). 10 j. Greer, Proteins 7, 317 (1990). ii L. T. Delbaere, G. D. Brayer, and M. N. James, Nature (London) 279, 165 (1979). 12 R. J. Read, G. D. Brayer, L. Jurasek, and M. N. G. James, Biochemistry 23, 6570 (1984). 13 E. R. Zuiderweg, J. Henkin, K. W. Mollison, G. W. Carter, and J. Greer, Proteins 3, 139 (1988).

M E T H O D S I N E N Z Y M O L O G Y , V O L . 202

Copyright © 1991 by Academic Press, Inc. ,All rights of reproduction in any form reserved.

240

PROTEINS AND PEPTIDES; PRINCIPLES AND METHODS

[12]

Comparative Modeling Algorithm The basic steps of the modeling method are summarized in Fig. 1. In order to perform comparative modeling, knowledge of at least one experimental structure for a member of the homologous family is essential. The amino acid sequence is needed for those proteins whose structure is known. As will become evident, the more structures and amino acid sequences that are available, the more accurate the molecular modeling of the structure of a " n e w " protein of interest is likely to be. Finally, the amino acid sequence of the new protein to be modeled is absolutely essential.

Alignment of Known Structures and Sequences When more than one experimental structure is known, the first step is to superimpose these structures in three dimensions (Fig. 2a14-2°). This superposition is not straightforward. The purpose is to overlap the molecules so that the common features of the structure, called structurally conserved regions (SCRs), coincide (Fig. 2b). Several programs have been developed to perform this function analytically. 21'22When the proteins are very similar, as in the mammalian serine proteases 8,1° or the aspartic proteinases, 6 then aligning several of the critical active-site or known conserved residues may be adequate. Another alternative that can be effective, though not objective, is manual overlap using a computer graphics system. Once the structures are properly superimposed, it is clear the overlapped structures (Fig. 2b) can be divided into SCRs, where all the known structures have the same conformation, and structurally variable regions (VRs) where each of the molecules has a different conformation. The SCRs can be defined as stretches of the main chain where the corresponding t~ carbons of the different known structures overlap within a particular tolerance, usually between 0.5 and 1 A. Only stretches of residues are 14 j. Moult and M. N. G. James, Proteins 1, 146 (1986). 15 S. Burt and J. Greer, Annu. Rep. Med. Chem. 23, 285 (1988). 16 R. E. Bruccoleri and M. Karplus, Biopolymers 26, 137 (1987). 17 R. E. Bruccoleri and M. Karplus, Macromolecules 18, 2767 (1987). 18 p. S. Shenkin, D. L. Yarmush, R. M. Fine, H. Wang, and C. Levinthal, Biopolymers 26, 2053 (1987). t9 A. T. Hagler, P. S. Stern, R. Sharon, J. M. Becker, and F. Naider, J. Am. Chem. Soc. 101, 6842 (1979). 2o p. Dauber, D. Osguthorpe, and A. T. Hagler, Biochem. Soc. Trans. 10, 312 (1982). 2t M. G. Rossmann and P. Argos, J. Mol. Biol. 109, 99 (1977). 22 S. J. Remington and B. W. Matthews, J. Mol. Biol. 140, 77 (1979).

[12]

COMPARATIVE MODELING OF HOMOLOGOUSPROTEINS

241

I KnownExperimentalStructures and theirsequences

I I structure J

l >Istructure J

I

Superimpose known [ structures in 3D

¢

I

Assign SCRs based I upon secondary structure and active site

J Assign the SCRs

I

J Add other known sequences J

J

I Align the known sequences I based upon the structures

I Identify characteristichomologous sequence patterns J J

Align "new" sequence to known sequences I

Assemble structurefor SCRs from known structures; mutate side chains to "new" sequence Select best known "spare part" for each VR from known structures; assemble main chain and mutate side chains to "new" sequence I Construct missing VRs using Protein Database search; assemble main chain and mutate side chains

I Check for errors,buried charges, etc=~.

i I Correct side chain overlap using X angles I

) I Energy minimize structure as necessary I Fro, I. Flow chartrepresentationof the comparativemodeling method describedin the text.

242

[12]

PROTEINS AND PEPTIDES." PRINCIPLES AND METHODS

a

r

F

~

Y

Y

/o.... R'

..%'.?r, ~,:v

y

W~o.,

H

"-¢.~ U9.....

E

D ~,.pI i.~ .

:

K ~/,~v / ,D ~s

,

.,6I

D (:,...

,~M

i~

"

G

/

/

""

~":o'" N T

N

"A .

.

.

.

B.

b

.

.

.

C"

//;'~

C

T K E

V

~K

D

A

N

E

/ FIG. 2.

/

~,

")s

[12]

243

COMPARATIVE MODELING OF HOMOLOGOUS PROTEINS

~.j

s ,Os .s-.,,s

%/

d

Q

:Y E •

"o

~W

tp

°oA , , A ,)

~----." -

-..../

"B"

S

"C" Other

e

.. . . . . . . . . . .

................

L i!

F c

T

L

K

A

.el G~ *

~N~A p

4

D S FIG. 2. Schematic representation of the comparative modeling method. (a) There are three proteins in this homologous family with known structures: " A , " " B , " and " C . " Each protein is represented with a characteristic dashed or dotted pattern throughout the figure. (b) The three proteins are superimposed, showing that parts of the structure are conserved (SCRs, bold lines) and parts are variable from one protein to the next (VRs, respective dashed and dotted lines based on the source of the VR). (c)-(e) Steps in the construction of the schematic " n e w " model structure. (c) The SCRs (bold lines) are constructed from the main chain coordinates of any one of the known structures since they are similar. The side chains are mutated to the new sequence as necessary. (d) The various VR conformations found in the known structures are considered for each VR of the new protein [see VRs in (a) and (b)]. The ones that do not fit are rejected (shown by crossed arrows). The most suitable

244

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS #:

NEW:

5

i0

2O

15

T A DLD~C F N L OIL T V K I E]AW D S L c

LoV_I+

[12]

S-

GIA

p

Fro. 3. Sequence alignment for the set of proteins in the schematic homologous family corresponding to Fig. 2. The alignment is performed solely based on the overlap of the threedimensional structures and not on sequence alignment methods. The boxes delineate the SCRs as determined from overlap of the three-dimensional structures (Fig. 2b). The sequence of the " n e w " protein is aligned based on the characteristic patterns of sequence homology (bottom line) found for the known structures and their sequences (see text). The IUPAC-IUB convention standard single-letter amino acid code is as follows: A, Ala; C, Cys; D, Asp; E, Glu; F, Phe; G, Gly; H, His; I, Ile; K, Lys; L, Leu; M, Met; N, Ash; P, Pro; Q, Gin; R, Arg; S, Ser; T, Thr; V, Val; W, Trp; Y, Tyr. Positions of relative deletions in the sequences are denoted by a dash. The bottom line lists the conserved, characteristic sequence patterns used to align the new sequences (see text). They are coded as follows: uppercase, almost completely conserved side chain; lowercase, high frequency of this amino acid at this position; - , a charged residue, including K, R, D, E, and H; o, S or T (may be substituted by A occasionally).

selected, not single a carbons. The VRs always lie on the surface of the protein structure and form the external loops where the main chain turns. The next step is to align the amino acid sequences for the known structures (Fig. 3). These are aligned solely based on the previously described superposition of the three-dimensional structures. Standard sequence homology methods 23are not used. Instead, wherever the a carbons of the respective proteins overlap in three-dimensional space, the sequence is aligned. In this way, all the sequence corresponding to the SCRs is aligned. The sequence in the VRs is more difficult to align at this time. If obvious sequence similarity is found between proteins within the VR, then this is used for the alignment. If subsets of the known structures have 23 M. S. Waterman, this series, Vol. 164, p. 765.

one is selected in each case. In some cases, conformational s e a r c h 14-18 o r energetics 19'2° methods must be employed since no suitable conformation can be found for that VR among the known structures (see VR in upper left corner with the sequence C-F-N-L-Q for an example). (e) The composite structure shows the source of the respective "spare parts" selected for the model structure.

[12]

COMPARATIVE MODELING OF HOMOLOGOUS PROTEINS

245

the same conformation in a particular VR, those sequences are also aligned (see, e.g., the sequences for proteins " A " and " C " in the VR at positions 3-6). Otherwise, the sequence alignment in the VRs is arbitrary. In general, each of the SCRs will display a characteristic sequence similarity pattern that is related to the fold of that portion of the structure. Once the sequences have been aligned, they must be examined carefully to identify the characteristic sequence homology patterns for each of the SCRs. In many cases, the conserved pattern will be obvious. For example, in the second SCR of Fig. 3, the pattern is clearly L - S / T - V - - - I - - , where - stands for charged residue. Similarly, in the third SCR, the pattern is G-I-A. Sometimes, though, the pattern may be difficult to discern, such as in the first and final SCRs of Fig. 3. This may occur because the SCR is external with most of the side chains pointing out into solvent and thus free to vary without major influence on the structure of the protein. In these cases, the sequence together with the structure must be examined very carefully to recognize the pattern. In a very few cases, the pattern may be so subtle that it may be impossible to identify. The above superposition of structures and alignment of their sequences needs to be performed just once for a set of known structures. Obviously, as a new experimental structure becomes available, it may be superimposed on the other known structures and its sequence aligned thereto. Sometimes, only one known structure of the homologous family is available for the modeling (Fig. 1). The first challenge in this case becomes to identify the SCRs. The great advantage of having more than one known structure is that it permits an objective definition of the structurally conserved regions which are representative of the homology family. With only one known structure, the SCRs cannot be rigorously defined. Experience 6'8A°'24 has taught us that the SCRs usually correspond to the secondary structure elements of the protein. In our studies of anaphylatoxins,13,25 the SCRs were the a helices. In the mammalian serine proteases, s,l° the SCRs coincide with the/3 strands. Consequently, when faced with selecting the SCRs in a single known structure, we assign them to the secondary structure elements. 24 When only one experimental structure is known, there are no sequences to align as was shown in Fig. 3. One can display the lone sequence of the known structure and delineate the SCRs on the sequence. However, a major problem exists in defining the characteristic sequence patterns for each SCR to permit alignment of the " n e w " sequence. This may be performed by using other known sequences for this family, even though 24 j. Greer, Science 228, 1055 (1985). 25 j. Greer, Enzyme 36, 150 (1986).

246

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[12]

their structures have not been determined, and aligning them using standard sequence homology methods. 23 However, at no time do we permit additions or deletions within the defined SCRs. This frequently requires some changes from the sequence alignments produced by the sequence homology methods .8,24Alternatively, the new sequence itself may have to be aligned based on the sequence homology methods, but again with the above proviso that no additions or deletions are permitted within the SCRs.

Construction of "New" Structure We are now ready to model the structure of the " n e w " protein of interest. The first step is to align the new sequence to those of the known structures using the characteristic sequence similarity patterns defined in the previous section. The appropriate part of each sequence is aligned to the respective SCR using the characteristic pattern (Fig. 3). Thus, the new sequence L-T-V-K-I-E fits the pattern L-o-V-±-I-± of the second SCR comprising positions 7-12. Similarly, the sequence G-I-A is trivially aligned with the sequence of the third SCR at residues 16-18. The last SCR is aligned based on the P. Sometimes, though, there is no clear characteristic sequence pattern for the SCR. This is true for the first SCR in Fig. 3. Consequently, the sequence alignment for such an SCR may be ambiguous, and alternative possible alignments may have to be considered and propagated into the model construction. As each part of the new sequence is aligned to an SCR, the remaining positions in that SCR are filled with the new sequence without permitting any additions or deletions within the SCRs. The new residues that are left correspond to the VRs. For each VR, the new sequence is compared to those of the known structures to see if one of these has the same residue length and a similar pattern of residue types so that it might serve as a model for constructing this loop. Thus, the VR at positions 13-15 in the new sequence contains 5 residues. Of the known structures, only protein " B " has 5 residues (Fig. 3 and Table I). In addition, the residues of this loop, A-W-N-T-M, in '°B" fit the new sequence, A-W-D-S-L, very nicely with the tryptophan at position 14 and similar character residues at each of the sites. Consequently, it is quite likely that the VR in the new sequence will fold the same way as this VR in " B . " Similarly, the VR at positions 19 to 20 has 2 residues and can probably be modeled after the VRs in either proteins " A " or " B " (Fig. 3 and Table I). After the new sequence has been aligned, as shown in Fig. 3, construction of the new three-dimensional model coordinates can begin. The main

[12]

COMPARATIVE MODELING OF HOMOLOGOUS PROTEINS

247

TABLE I MODEL STRUCTURES FOR VARIABLEREGIONS OF " N E w " SEQUENCE

Fragment

No. of residues in new sequence

Known structure

2 6 3 2

Any Any Any Any

2 5 5 2

"C" None "B" " A " or " B "

Confidence level

Comments

SCR 1-2 7-12 16-18 21-22 VR loop - 1-0 3-6 13-15 19-20

Low High High High

No sequence pattern

Medium Low Medium Medium

Good model No model Good model Good model

chain coordinates of any of the known structures can be taken for the SCRs and the side chains mutated in the computer to the sequence of the new protein (Fig. 2c). A number of different methods can be used to determine the side chain torsion angles, Xi, for the mutated residues. In our program, we use the X~value of the side chain from the known structure for the new side chain when applicable. We do not employ the angles beyond X~ because the side chains usually differ too much after the C~ for this to be useful. We have tried automatic torsional angle scanning based on lowest energy but find that this does not work well. There are frequently alternative conformations that are close in energy, and such scanning routines will select an unsuitable but lowest energy conformation based on energy differences that are minute and insignificant. The recent compilation of side-chain rotamer libraries 26 suggests a new approach where the limited possible conformations of a side chain may be introduced systematically in an automated procedure that selects the best of the possible conformations for the particular site in the model structure. The VRs must be constructed next (Table I). In each case, an appropriate structural fragment is identified or built, the main chain coordinates are inserted in the structure, and the side chains are mutated to fit the new sequence. These steps are illustrated graphically in Fig. 2d,e. Construction of the VRs fall into five categories that differ in the degree of complexity of modeling as follows: 1. A good model structure can be found for the loop among the known structures. In those places where one of the known structures can 26 j. W. Ponder and F. M. Richards, J. Mol. Biol. 193, 775 (1987).

248

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[12]

be used as a model for the VR, the main chain coordinates for that loop are taken from the known structure and the side chain mutated to the new sequence as before. In our schematic structure, this case is illustrated by the loop at positions 13-15 (Fig. 3 and Table I). The VR of protein " B " matches the new sequence both in residue length and residue character and thus is chosen for this part of the model structure (Fig. 2d,e). 2. There is no exact match, but a similar structural theme appears among the known structures. An example of this would be when all the VRs form a fl turn but with different lengths to the loop. In this case, one would construct the structure of the loop to fit a/3 bend following the pattern of the other members of the family. In the next two cases, the modeling is less reliable: 3. The size of the VR differs from one of the known structures by only one or two residues. This is a very common occurrence. The relative addition or deletion of one or two residues will cause the new VR to have a conformation very different from that of the other known structures. An example of this situation appears for the VR at positions 3-6 in Figs. 2d,e and 3. No appropriate model structure can be found for this loop among the known structures. Such loops have to be constructed using more complex methods such as conformational searches 14-18and energy evaluation.19,2° These methods are very time consuming, computer intensive, and not yet reliable. 27,2sAs an alternative, we have adapted a method originally developed by Kraulis and Jones 29for constructing protein structures from fragments using NMR or crystallographic data. We search the Brookhaven Protein Data Bank 3° for structural fragments of the correct residue length which have a conformation at its ends that closely fits the a-carbon positions of the ends of the loop of the new structure which lie in the adjacent SCRs. Thus, for the VR at positions 3 to 6 of the new sequence, the database is searched for a fragment with a total of 11 residues: 5 residues in the middle for the VR and 3 on the N-terminal side that should fit the conformation of residues A-D-A at positions - 1 to 2 as well as 3 more on the C-terminal side that 27 R. M. Fine, H. Wang, P. S. Shenkin, D. L. Yarmush, and C. Levinthal, Proteins 1, 342 (1986). 2s C. Chothia, A. Lesk, M. Levitt, A. Amit, R. Mariuzza, V. Phillips, and R. Poljak, Science 233, 755 (1986). 29 p. j. Kraulis and T. A. Jones, Proteins 2, 188 (1987). 30 F. C. Bernstein, T. F. Koetzle, G. J. Williams, E. J. Meyer, M. D. Brice, J. R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, J. Mol. Biol. 112, 535 (1977).

[12]

COMPARATIVE MODELING OF HOMOLOGOUS PROTEINS

249

should fit the conformation of residues L-T-V at positions 7-9. Typically, we display the 10 structural fragments that best [lowest root mean square (rms) deviation] fit the new conformation at the two ends. Those structural fragments which collide with the rest of the new protein are, of course, eliminated. Similarly, those which do not pack hydrophobic residues or which bury a charged group or which cannot accommodate a proline residue of the new sequence can also be rejected. Using this approach, one or more tentative starting structures can be chosen for this VR. This method does not produce an exhaustive list, nor does it guarantee that the conformations which emerge include the correct conformation for this VR. It does provide conformations of loops that have previously been found in experimental protein structures. 4. Several known structures have the same size loop as the new sequence but have different conformations. When this occurs, it may be difficult to determine which, if any, of the known structures should be used to model this particular VR. If it is not evident from the character of the residues forming the loop and the immediate docking environment of the loop which best fits the new structure, then more than one possible conformation for this loop may have to be propagated. 5. A very large addition occurs in the new sequence. Such large additions (up to l0 or 20 residues in some cases s,l°) have so many possible conformations that any conformational search procedures are hopeless. Therefore, in the past, 8 we just did not build these portions of the structure. However, recent analysis 1° of experimental reported crystal structures suggests that these large additions may be conformationally disordered and therefore do not need to be modeled. Using the above steps, we have assembled the complete protein from the SCR fragments and respective VR fragments of the known structures, completing the molecule with fragments selected from the wealth of database protein structures when no good model is available from the known homologous structures (Table I). The different parts of the new structure arise from quite different sources (Fig. 2e). Consequently, we may assign approximate reliability or confidence levels to the different portions of the structure based on its source, Because the conformations of the SCRs are the same in all members of the family, confidence is very high in these regions that the structure is correct. This is true when the characteristic sequence patterns are found for the respective SCR. When no characteristic sequence similarity pattern or only a weak pattern exists for the SCR, then confidence in the model for this region will be reduced (see, e.g., the

250

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[12]

SCR at positions 1-2 in Table I and Fig. 3). ff the characteristic sequence pattern is missing completely only in the new protein sequence, it may be an important warning that the structure is different in this region in this member of the family.l° Great care must be taken in modeling such regions to be sure that extrapolation from the other known structures is warranted. Confidence in such regions is accordingly reduced. For the VRs, when a good model structure appears among the known structures that fits both in residue length and in residue character (see, e.g., the VRs at positions - I-0, 13-15, and 19-20 in Table I and Fig. 3), then the confidence level is medium. It is not as high as in the SCRs because the VRs are so much more variable. For those VRs where no model structure can be identified among the known structures for this VR, the confidence level is very low (see the VR at positions 3-6 in Table I), Conformational search methods have not yet achieved sufficient reliability to be able to predict these loop conformations correctly. H-13 The facility with which the new structure has been assembled from the respective fragments of the known structures is the result of the careful superposition of the known structures which was the first step in the modeling process (Fig. 1). This superposition is essential for several important steps in the modeling. It permits the objective identification of the SCRs (Fig. 2b), allows the proper alignment of the amino acid sequences based on the superposition (Fig. 3), and permits the facile assembly of the diverse fragments into a coherent model structure with no significant overlap of main chain atoms (Fig. 2c-e).

Refinement of Model Structure The first step in the refinement of the structure is the general examination of the structure to see that there are no serious errors. Such errors include inappropriately buried charged residues or disulfide bridges that cannot be completed because the sulfur atoms are too far apart. These errors may be the result of a mistake in the alignment of the new sequence. When such problems are encountered, the sequence alignment must be reexamined and the structure rebuilt. The next step is the removal of any overlapping atoms. The nature of the assembly of the structure from fragments of the known structures minimizes the likelihood of steric overlap of the main chain atoms. Only if the bad contact was in one of the known structures will it be propagated into the new structure. In the case of the main chain of the VRs, absence of overlap of the main chain and good packing were the main criteria for choosing a suitable VR conformation. However, the side-chain mutation process to the new sequence may introduce bad contacts. Such contacts

[12]

COMPARATIVE MODELING OF HOMOLOGOUS PROTEINS

251

can almost always be relieved by suitable rotation about the side-chain X angles. Knowledge as to which of the colliding side chains is conserved in type and position among the known structures can help to determine which side-chain atoms should be moved to relieve the overlap. It is best to eliminate these overlaps manually to remove large disruptive energy forces in the initial minimization steps. One more step must be taken before the structure can be introduced into an energy minimization program. The bound waters of the structure must be added to avoid the distortions that minimization will produce when structural waters are not included. We typically take the waters that are found in the various experimental crystal structures and add them to the model new structure. (This can be done directly since all the structures are in the same reference frame.) The water molecules are then examined in detail. When a water overlaps any of the atoms of the new protein, it is removed. If a water molecule is buried in what is now a hydrophobic pocket in the new structure where there are no hydrogen bond donors or acceptors for them, the water is deleted. Remaining waters are retained for the time being. Additional waters are included as needed to prevent the molecule from collapsing into an empty active site or a polar cavity in the molecule. Great care must be taken in the initial steps of energy minimization to ensure that the process does not degrade the structure. Since most crystal structures do not include hydrogen atoms, these must be generated on the respective " h e a v y " atoms with the appropriate geometry. This will, in general, generate overlaps between hydrogen atoms, leading to enormous initial nonbonded repulsive forces. Consequently, the first minimization steps must be performed with a high template forcing31constant to restrain the atoms from moving significantly from their starting positions. Typically, only a small number of cycles, 10-50, are required to drop the enormous starting energies down to reasonable values requiring rms movements of less than 0.3 ,~ for all the atoms of the molecule. Subsequent minimization cycles are performed with a gradual decreasing of the template forcing constant until finally the protein is allowed to minimize without restraints. At all stages of this minimization, the structure is examined periodically to see if any of the waters are moving significantly. Whenever this is observed, the structure is analyzed to see why the water is moving. If it is due to a distortion introduced by the water it may be removed. Some of the surface waters may "boil o f f " after the forcing 31 R. S. Struthers, A. T. Hagler, and J. Rivier, in "Conformationally Directed Drug Design: Peptides and Nucleic Acids as Templates or Targets" (J. A. Vida and M. Gordon, eds.), p. 239. American Chemical Society, Washington, D.C., 1984.

252

PROTEINS AND PEPTIDES: PRINCIPLES AND METHODS

[13]

constraints are relaxed. This is either ignored, or the waters may be deleted. Further refinement may be performed using molecular dynamics on the whole molecule, with or without forcing constraints. Alternatively, it may be useful to apply molecular dynamics to selected portions, for example, particular VRs, that may benefit from a dynamics analysis of local conformation space.

[13] P a t t e r n - B a s e d A p p r o a c h e s to P r o t e i n Structure Prediction

By BRUCE I. COHEN, SCOTT R. PRESNELL, and FRED E. COHEN Introduction In the appropriate milieu, polypeptide chains spontaneously assemble into unique tertiary structures guided by their amino acid sequence.1 Although numerous experiments suggest the existence of a folding code, explicit specification of sequence-based folding rules has proved difficult. 2 The goal of our research is to develop a set of sequence-structure correlates that can be used to predict secondary structure from protein primary sequence. This chapter begins by discussing a series of principles which form a foundation for structure prediction. Next we sketch an algorithm for finding turns and describe some of the requirements for a pattern language that facilitates the identification of sequence-structure correlates. We then present the pattern language itself and finally offer examples of patterns which can be used to recognize turns or loops and a helices. A pattern language must allow for the specification of exact residueby-residue matches and simultaneously offer flexibility and generalizability. We have developed a convenient computer interface for the development of patterns that recognize protein substructures. Although efforts to completely automate the development of reliable sequence-structure correlates have failed in our hands, we believe that structural principles can be translated by an individual into a pattern formalism and that refinement of these initial patterns can lead to useful algorithms for predicting secondary structure. I C. B. Anfinsen, E. H a b e r , M. Sela, and F. H. White, Proc. Natl. Acad. Sci. U.S.A. 47, 1309 (1961). 2 G. E. Schultz, Annu. Reo. Biophys. Biophys. Chem. 17, 1 (1988).

METHODS IN ENZYMOLOGY, VOL. 202

Copyright © 1991by AcademicPress, Inc. All rights of reproduction in any form reserved.