A systematic approach to the comparison of protein structures

A systematic approach to the comparison of protein structures

J. Mol. Biol. (1980) 140, 77-99 A Systematic Approach to the Comparison of Protein Structures S. J. REMIxGToNt AND B. W. MATTHEWS Institute of Mol...

1MB Sizes 0 Downloads 50 Views

J. Mol. Biol.

(1980) 140, 77-99

A Systematic Approach to the Comparison of Protein Structures S. J. REMIxGToNt

AND B. W. MATTHEWS

Institute of Molecular Biology and Department of Physics Cniversity of Oregon Eugene, Ore. 97403, U.S.A. (Received 13 November 1979, and in revised form

18 February

1980)

A systematic method has been developed for comparing the backbone conformations of proteins (Remington & Matthews, 1978). Two proteins are compared by successively optimizing the agreement between all possible segments of a chosen length from one protein, and all possible segments of the same length from the other protein. The method reveals any similarities between the two proteins, and provides an estimate of the statistical significance of any given structure agreement that is obtained. The method has been tested in a number of cases, including comparisons of the dehydrogenases and of the pancreatic and bacterial serine proteases. These examples were chosen to test the ability of the comparison method to detect st’ructural similarities in the presence of large insertions and deletions. The results suggest that the detection of the “nucleotide binding fold” in the dehydrogenases is at the limit of the capability of the comparison technique in its original form, although it may be possible to generalize the method to allow for insertions and deletions in proteins. The result)s of many protein comparisons, made with different probe lengths, are summarized. For medium and long probe lel@hs, the average value of thp structural agreement does not depend very much on the type of protein being compared. The average value of the structure agreement increases with the square root of the probe length, but for probe lengths above about 40 residues, the standard deviation is independent of probe length. From these observations it is possible to construct a generalized probability diagram to rvaluate the significance of any structure agreement that might be obtained in comparing two proteins.

1. Introduction AS the number of known three-dimensional structures of proteins has increased, it has become increasingly apparent that similar patterns of folding often occur in different proteins. These similarities occur within families of homologous proteins such as the pancreatic serine proteases, or the globins, which are obviously derived from a t Present address: Max-Planck-Institut fiir Biochemie und Physikalisch-Chemisches, der Technischen Universit&t, D-8033 Martinsried bei Mtinchen, Germany. $ Author to whom correspondence should be addressed.

Institut

77 0022-2836/80/170077-23

$02.00/O

0 1980 Academic

Press Inc. (London)

Ltd.

78

S. J.

REMINGTON

AND

B.

W.

MATTHEWS

common precursor, and also between proteins that have no obvious evolutionary relationship. For example, the “immunoglobulin fold” has also been found t,o occur in superoxide dismutase (Richardson et al., 1976). In some cases it has been easy to recognize common patterns of folding in different proteins, but in other cases the intrinsic complexity of many proteins has masked underlying similarities. It is clearly of considerable interest to identify similarities between different protein structures, even though the origin of such similarities may be open for debate. On the one hand, a common pattern of folding seen in t’wo proteins may suggest that they evolved from the same precursor. On the other hand, it may be that the observed folding is favored energetically, and tends to occur spontaneously or is dictated, perhaps, by certain functional requirements. There is still a need for effective methods of comparing protein structures, As more protein struct,ures are known and are compared with each other, it may become possible to determine the basis of structural similarity. Two general approaches have been used to compare the folding of different proteins. In the topological approach, protein structures are reduced to simplified schematic diagrams that show the sequential location of a-helices and p-sheets. By visual inspection of such diagrams, it is possible to recognize potentially similar structural domains (Schulz & Schirmer, 1974; Sternberg & Thornton, 1977; Richardson, 1977; Levitt & Chothia, 1976). These methods are simple to apply but, except in a limited sense, are non-quantitative. They give only limited information about spatial equivalence, and may not detect potential regions of similarity, especially those that do not include a-helices and p-sheets. In the spatial method of structure comparison. a set of equivalent points in the two structures is identified, and their co-ordinates determined. Then, the set of reference points from one structure is rotated and translated to determine the optimum agreement with the other structure (see e.g. Matthews et al., 1968; Freer et al., 1970; Huber et al., 1971). Rossmann and co-workers have extended this method to permit the comparison of the backbones of pairs of proteins in which extensive “insertions” or “deletions” may have occurred (Rao & Rossmann, 1973; Rossmann $ Argos, 1976,1977). Recently, we proposed a new method of spatial comparison (Remington & Matthews, 1978) based on Fitch’s (1966,197O) method of comparing amino acid sequences. Both techniques permit exhaustive searches to be made for structural similarity, although the Rossmann-Argos procedure requires arbitrary assumptions on the values of various constants. On the other hand, the proposed method, in its present form, may not detect structural similarities that are interspersed with extensive insertions and deletions. The method provides an estimate of the statistical significance of any apparent structural similarity by evaluating it relative to the frequency distribution of all other comparisons. In the Rossmann-Argos procedure, the significance of the agreement between two proteins is estimated by comparing the number of “equivalent” residues obtained for the best alignment with the number of equivalent residues obtained for all other alignments of the two probeins (Rossmann & Argos, 1977). The method we have proposed is useful for comparing different proteins, and for correlating repeated structural elements within a single polypeptide chain, and has been used in this context by McLachlan (1979). In our preliminary communication (Remington & Matthews, 1978) we applied the

COMPARISON

OF PROTEIN

STRUCTURES

79

structure comparison method to comparisons of hemoglobin and myoglobin, phage lysozyme and carp calcium-binding protein, and phage and hen egg-white lysozyme. In this paper we describe further tests of the method with the dehydrogenases and the pancreatic and bacterial serine proteases. These were chosen because they represent families of proteins in which there are regions of structural similarity interspersed with large insertions and deletions. We showed previously that the comparison method will clearly reveal close structural correspondence, as is found in the different globins. Here we are more concerned with testing the method in cases of limited similarity. The tests reported here also provide a representative set of structure comparison statistics for proteins with different types of secondary structure.

2. Methods To compare two proteins, a suitable probe length L is chosen; e.g. 40 residues. Then, each possible backbone segment of length L residues from the first protein, 1 to 40, 2 to 4 1, etc. is compared in turn with each possible backbone segment of length L from the second protein. In each case the 40-alpha-carbon segment of one protein is translated and distance between the respective rotated so as to minimize Rca, the root-mean-square alpha-carbon atoms of the 2 segments :

where X, and X; are, respectively, the co-ordinate vectors of the ith alpha-carbon atom and the jth alpha-carbon atom of the 2 proteins relative to the center of mass of the L atoms being compared. The minimization of R,, requires that the centers of mass of the 2 sets of atoms coincide, so that minimization reduces to the determination of the rotation matrix A(a, /?, y). This matrix is the resultant of successive rotations of the angle y about the 2 axis of protein 2, the angle /l about the new Y axis, and the angle a about the resultant X axis. Several methods are available for the minimization of R,,, one being to iteratively adjust the rotation angles (a, /3, y) until a minimum value of R,, is obtained (cf. Matthews et al., 1968). This was the procedure used in our original paper (Remington & Matthews, 1978), and has been used for many of the comparisons reported here. The method has the advantage that it provides both the minimum of Rca, and the corresponding rotation matrix A(a, 8, y). Also, as judged by a variety of tests, it satisfactorily minimizes R,, for both good and bad structural agreements, and can be relied on to locate the true minimum. On the other hand, the iterative method is rather slow, which is a disadvantage when comparing large proteins (Remington & Matthews, 1978). The number of minimizations required to compare 2 proteins of length M and hT residues, with a probe lo5 minimizations (see length L, is (M - L + 1) x (N - L + l), which can approach Table 1). Alternatively, R,, can be minimized by a number of matrix methods. McLachlan ( 1979) has recently reviewed these methods and proposed a new minimization technique that is both fast and reliable, and which will, with some additional computing, provide the optimum rotation matrix A(a, 8, y). In practice, we found McLachlan’s method to be about 100 times faster than the angle adjustment method, when both algorithms were tested on a Varian V76 minicomputer. for all segments of the proteins being comAfter the values of R,, have been obtained pared, the mean and standard deviation of R,, are obtained, and the values of R,, “structure comparison map”. When comparing 2 very similar plotted as a contoured proteins, such as myoglobin and hemoglobin, the comparison map will have an obvious

80

S. J. REMINGTON

4ND

B. W. MATTHEWS

band of good structjural agreemetlt down the diagonal (SW c.g. SW I‘$. 1 of Rcniiugtoli & Matthews, 1978). The peaks tend to be elongated parallrl to the diagonal becausr~. as on(movw in this direction. tllew is a sllhstantial ovc~rlap btbt,wrctl t,hr sucwssivr str‘ltpt\wc, comparisons. One of the featlwrs of the stmctlu-cl comparlsorl method is that it, pro\-itics a larpc samples of values of I?,,, against which an rlnllsually good agrnrmelltj call br compared. In cornparing a number of different. prot,clirrs u’c have found t,llat the distribntiotr of thcx ohswvc~l values of Rca is approximately Gaussian. cxcopting tllosc \.alues of Xc,1 t’hat correspond t)o unus~&ly good structural agrwment. It is the drpartrlw of these low K,,, VR~IIPS from R Gaussian distriblltion that shows t,horn to be statistically “llmwlal”. distribution of In the st,ructural comparisons quoted below, U-P compare the obscrvrd R,, with the best-fit Gaussian clu-vr, and also plot, thrs results as a cumulatirc: probnbilit!: distribution (we e.g. Fig. 2(b)). In this case a Gaussian distribution appears as a straight line, and any unusually good struct,ural correspondcncc will cause RII increase in the, errrl of the distrihlltioll. freqrluncy of obwrved values of R,, at the left-hand All co-ordinates used in t,his paper nwe taken from the Protein Data Rank (Bernstein et al.. 1977).

3. Results (a) Glycolytic

enzymes

One of the best known examples of a common structural fold is the “nucleotide binding domain”, first observed in lact,ate dehydrogenase and malate dehydrogenase, and subsequently in glyceraldehyde-3-phosphate dehydrogenase and horse liver alcohol dehydrogenase (Rao & Rossmann, 1973; Rossmann et al., 1974; Webb et aZ.. 1973; Ohlsson et al., 1974). With the exception of lactate and malate dehydrogenase. none of the dehydrogenases has obviously homologous amino acid sequences. The “idealized” nucleotide binding fold consists of six strands of parallel P-sheet, connected via two helices on one side of the sheet and two helices on t’he other, including about 150 residues in all. In the actual structures, the respective coenzyme binding domains differ from each other somewhat and it is necessary to “ignore” these variations in superimposing one domain on another. In comparing the nucleotide binding domains of lactate dehydrogenase and glyceraldehyde-3-phosphate de-

20 I

40 ,

200

60

LDHase 00 1

100 I

I

I

I

I

20

40

60

00

220

240 LADHase

260

120 t

140

100

120

160 I

I

200

300

I

140

320

340

FIG. 1. Alignment of the nucleotide binding folds of lactate dehydrogenase (LDHase), glyceraldehyde-3.phosphate dehydrogenase (GPDHase) and liver alcohol dehydrogenase (LADHase), proposed by Rossmann et al. (1974). The connected bars indicate a-carbon atoms in the respective dehydrogenases that are structurally equivalent,

COMPARISOS

OF PROTEIS

STRUCTURES

81

hydrogenase, for example, Rossmann et al. (1974) found t’hat 75 “equivalent” alphacarbon atoms could be superimposed wit,hin a distance of 3.8 8. This structural alignment is illustrated diagrammatically in Figure 1. As can be seen, there are blocks of residues that coincide, interspersed with regions where the structural superposition is poor (greater than 3.8 A). It seemed that this would provide a good test’ of the structure comparison method, since one would be attempting to locate a region of structural agreement in the presence of “insertions” and “deletions”. The comparison of LDHaset and GPDHase with a probe lengt’h of 80 residues is illustrated in Figure 2(a), and the corresponding probability plot’s are shown in Figure 2(b). In Figure 2(b) we show the observed frequency disbribution and the best-fit Gaussian, as well as the same data plotted as a cumulative distribution. The average structural agreement for all possible alignment’s is 14.72 A. with a standard deviation 0 = 1.72 a. The best agreement,, R,, = 6.1 8, aligns residues 22 to 101 of LDHasc w-ith 1 to 80 of GDPHase. This is essentially the same alignment as proposed by Rossmann et al. (1974) to superimpose the nucleotide binding domains of thcsr two dehydrogenases (Fig. 1). The agreement Rcn = 6.1 .\. corresponding to 5&, is st’atistically significant. Also. the cumulative prol)ajbilitSy plot for L = 80 residues (Fig. 2(b)) is distinctly non-linear, showing an increase in the frequency of IOU agreement,s relative to that expected for a Gaussian dist*ribution. We conclude. t’herefore, that the structure comparison mebhod can successfully reveal the agreement between the “nucleotide binding domains” of LDHase and GPDHase. nota-it#hstanding the insertions and deletions. The next-highest peak in Figure 2(a) indicates an agreement of Rca = 7.7 A4 between residues 139 to 218 of LDHase and residues 178 t,o 257 of GPDHase. Thesa two segments in&de residues of the respective actiw sites of LDHase and GPDHasc. and it is int’eresting to not,e that the abovc st’ructural superposit,ion of the two backbone segments also results in a superposition of the substrata binding sites of the two enzymes. On the other hand, the above superposition does not align Argl’il and His196 of LDHase with Cysl49 and His176 of GPDHase. Garavito et al. (1977) havr suggested that these pairs of residues-play analogous roles, albeit with opposite hand. in the two dehydrogenases. Although the agreement, Rca = 7.7 Ai corresponds to 4.10, it is to be expected that in a comparison of proteins of this size, such all agreement will occur at least once 1,~ chance alone. Therefore, the significance of t,hc apparent structural agreement xvithin the catalytic domains of LDHase and GPDHasc is uncertain. Kwertheless, the suggestion remains that there is an heretofore undetected structural similarity between parts of the catalytic domains of LDHase and GPDHase. We show in Figure 3(a) and (b) the result#s of: comparing LDHase and GPDHaso with a probe length of 120 residues. Here, t,he agreement lwtJwcen the nuclootidc binding regions is quite poor (2.80). In this case the insertions and deletions swn in Figure 1 are located in such a way that they prcwnt significant structural similarit) occurring between any corresponding polypept,ide segnwnts of lengt,h 120 rrsiducs.

t Abbreviations dehydrogenase;

used: LDHase, lactate dehydrogenase; GPDHase, &l?-coraldehyde-3.phosphate LADHase, liver alcohol dehydropenasr : SGPBase, prot,ease type B from

strepto?myces griseus.

GPDHase too

150

200

(99997%) (99+37%~ (97.7%) (84%) (50%) (16%) (23%) (0.13%) (0003%)

R, (8, (b) FIG. Z.(a) Structure comparison map for LDHase and GPDHase with a probe length of 80 . and are at residues. Successive contour levels indicate values of Rca equal to 13.0 8, 11.3, intervals of 1 standard deviation (I.72 A) below the mean value of 14-73 A. Peak A indicates the alignment of the respective nucleotide binding folds, and peak R indicates similarity in the structures of the catalytic domains. of LDHase and GPDHase with a probe (b) Frequency distribution of Rca for the comparison length of 80 rosmues. The best-fit, Gaussian distribution is superimposed. The Figure also includes t,he same data plotted as a cumulative distribution. Values of R,, are grouped in increment,s of 0.1 angstrom unit. The ordinate on the right gives the probability in units of lo and the corresponding cumulative frequency of R,,, expressed as a percentage.

GPDHose

(al

FIG. 3. (a) Structure comparison map for residues. Contours are drawn at la (1.61 A) alignment of the nucleotide binding domains, domains. (b) Probability distribution corresponding

LDHase and GPDHase with a probe of length 120 below the mean of 16.6 A. Peak A corresponds to and peak B to superposit.ion of part of the catalytic to (a).

h4

S, .I. liE:JlISGTOS

ANI)

H. If’.

~I;\TTHEN’S

I)Ptt’Chl t It&11 :<(i. Figure 3(a) indicates I Irat thert~ iLI'(' two nlipnmcnt. L1 \\it h iL~rt’(‘IIlt’ilt~ one of‘ lvhich supwimposcs tllr acti v(‘ site regions. but in Figuw 3( 1)) t Ire cumulatji v(’ distribution is a straightS lint. indicating that’ thcw supwpositions arti no bcttcr tt:an might be expcctrt1 by cllaIlc(~ ;~lollc~. On comparing LDHase and (:PDHasc vvit’lr a probe Icngtll L = 40 residues. tjhc of rcsiducs 2 I to 60 of best agreemcrlt~, K,, == 2.8 A (~.Ru), is for the superposition LDHasc on 1 to 40 of GPDHa,sc. Alt~llougt~ this alignment supwirnposes the nuclcotidr: it is not better than the stat’isticallybinding domains of’ the two dehydrogenascs, expected best value, and the cumulati\-c probability plot is a straight lint. As discusscad in more detail below, t,he level of stjructjural agreement obt#ained with a short prol)e must be substantially better than with a long probe. in order to haw the same statistical significance. We have also compared the structure of liver &ohol dehydrogenase with both LDHase and GPDHasc. In the comparison of LADHase and LDHase with a probe length L = 80 residues, the best agreement, I?,, = 9.0 A, superimposes residues 192 to 271 of LADHasc on 22 t,o 101 of LDHase. This is the superposition of the nucleotide binding domains (Rossmann et ul., 1974). but the level of signilicanco (3.30) is not enough to distinguish t’his superposition as being unusually good. In the comparison of LADHase and GPDHasc with a probe length of 80 residues, the superposit’ion of t8hc nucleotide binding domains shows up as one of the higher peaks on t)he comparison map, but is by no means obvious (Fig. 4(a) and (1))). As can be seen in Figure 4(a). the region corresponding to the rcspectivc nuclcotide binding domains (1 t,o 140 in GPDHasc and 190 to 330 in LADHase) cont’ains all of t)he best’ alignments in the two struct,ures, but none of these shows out as a single dominant feature. Presumably t~hese “satellite” agreerncnts arise in part from the fact’ that the nucleotide binding domain is rather repetitive. with alt,ernating P-strands and x-helices. and also has an overall twofold repeat (Rossmann et ~1.. 1974) so that different parts of the nucleot)ide binding domains cm superimpose partially to giv-c some structural correspondence Also. the largr insert,ion between residuts 40 and 60 of GPDHasc, relative to LADHase, which prevents overall good agreement between the two domains, contribut’es to the satellite peaks by separating regions of partial agreement (see Fig. 1). It may bc noted that Event,off & Rossmann (1975) also found the nucleotide binding folds of L&ADHase and GPDHasc to bo t)he most different among the dehydrogenascs tested. ln summary, t,hc above comparisons of the dehydrogenascs suggests that the detection of the nucleoCde binding fold is at the limit of the capability of the comparison technique in its present’ form. In the case of LDHase and GPDHase, the comparison rnet,hod clearly confirms that the structural correspondence of t,he respective cocrlz,ymc binding domains of the two enzymes is statistically significant. On the other hand, the agreement between the nucleotide binding domains of. for example? LADHase and GPDHase, is not unusually significant.

(h) Nicrobial

and pancreatic

serine

proteaseu

Wo wished to test, the comparison method in a case where two structures were known to be related evolutionarily. yet had substantial differences in structure due to large insertions and deletions. Such an example is provided by the microbial and

COMPARISON

OF

PROTEIN

85

STRUCTURES

GPDHose 200

100

(al

100

Fra. 4. (a) Structural comparison of LADHase and GPDHase with A indicates the (imperfect) superposition of the nucleotide binding intervals of lo (l-83 A) below the mean agreement of 14.47 A. (b) Probability distribution corresponding to (a).

a probe regions.

of 80 residues. Peak Contours drawn at

S. .J. REMISGTOS

X6

AND

B.

IV.

MATTHEWS

pancreatic serine proi,cases. ‘I’hr microbial serine proteases art: smaller than t,htx pancreatic enzymes, with mokcular weights of about 20,000, compared to about 25,500 for X-chymotrypsin and rlastasc. The two classes of f:nzymes Ilax-(5 onI?: about ISo& amino acid sequence homology, yet they have obviously derived from a common precursor. The three-dimensional struct,ures of the microbial enzymes clearly resemble those of t,he pancreat!ic enzymes, although t*here are many differences, and only about two-thirds of t’he residues adopt topologically equivalent positions (Delbacre et r~l., 1975,1979; James it al.. 1978). Tn the following comparisons we have used the coordinates for the protease type from Strep:ptomyce.s griseus (Delbaere pt al.. 1979), and for elastase (Shotton & Watson, 1970). In Figure 5 we show the residues in SGPBase and elastase that are topologically equivalent, according to the recent alignment reported by James et al. (1978). The results of the comparison of SGPBase and elastase with probe lengths of L = 40 and 80 residues are shown in Figures 6 and 7. In both cases the comparison maps have peaks along t,he diagonal, suggesting some structural correspondence, but in neither case is the structural agreement of high significance. Slso, the probabilit,y plots (Figs 6(b) and 7(b)) show the distribution of the agreements to be essentially Gaussian. The reason that the comparison method does not detect the structural correspondence of SGPBase and elastase can be seen in Figure 5. The regions of st,ructural homology are short, typically about ten residues, and are int.erspersed with large insertions or deletions. Because of these insertions and deletions, there is no case where 40 (or 80) consecutive residues of one structure remain in alignment, with 40 (or 80) consecutive residues of the other structure. Ordinarily. one might expect to have insertions and deletions in the two structures that would compensate, but this is not the case for SGPBase, which is much shorter than elastase. Here there are 60 insertions in the elastase sequence, but only seven compensating insertions in SGPBase (see Fig. 5; and Table 1 of James et al., 1978). As a crude attempt to compensate for the difference in length of elastase and SGPBase, we ran a comparison of the two structures in which every fourth residue of elastase was deleted. This reduces the number of elastase residues to about the same as SGPBase. The results are shown in Figures 8 and 9. Here the similarity between elastase and SGPBa,se is obvious. In both cases (Figs 8(a) and S(a,)) there is the characteristic band down the diagonal that one sees in comparing very similar proteins such as trypsin and chymotrypsin, or myoglobin and hemoglobin. Also, Elastose 50

100

200

150

SGPBase

FIG. 5. Structural correspondence posed by James et al. (1978).

of elastase

and S. griseus

protease

type

B (SGPBase)

pro-

SGPBase

La)

1200

1000

2oc

FIG. 6. (a) Structure comparison of elastase and SGPBase with a probe Contours drawn at intervals of 1 (J (2-00 A) below the mean ( 11.32 b). (b) Probability distribution corresponding to (a).

of length

40 residues.

SGPBose 20 PA’



60

40 ’









00

100

Y\\

Y

20

60

x g P w

80

100

120

140

160

FIQ. 7. (a) Structure comparison of elastase and SGPBase drawn at intervals of 1 .z ( 1.71 li) below the mean ( 16.23 A). (b) Probability distribution corresponding to (a).

with

a probe

of length

80. Contours

SGPBase 60

80

(a)

I

14 3 2

i 0=

E ; z

B I ; ,o 2v, 3 4 5

IO

15

20

FIG. 8. (a) Structure comparison of SGPBase with a concatenated elastase in which every 4th a-carbon was deleted (see the text). Probe length 40 residues and contours at levels of 1.95 A below the mean (11.95 b). The continuous peak down the diagonal indicates the overall agreement of the two structures, and peaks A and B indicate agreement between one domain of elastase and the other domain of SQPBase. (b) Probability distribution corresponding to (a).

20

80

g

200 /

& E LL

100

5

15

IO

20

Rc,tal (b)

FIG. 9. (a) Structure comparison of SGPBase and carbon deleted. Probe length 80 residues and contours mean (16.28 A). (b) Probability distribution for (a).

concatenated elastase with every 4th CLdrawn at increments of 1.99 A below the

COMPARISON

OF PROTEIN

STRUCTURES

91

the probability plots (Figs 8(b) and 9(b)) sh ow obvious departures from Gaussian distribution, confirming the good structural correspondence. Deleting the elastase residues as described above clearly allows the two structures to remain sufficiently “in register” that their overall structural similarity becomes obvious, despite the local perturbations resulting from the deletion of every fourth residue. This preliminary test indicates that it may be possible to increase the abilit,y of the comparison method to deal with insertions by making comparisons in which one or the other protein is “condensed”, by systematically deleting residues. We do not suggest that “condensation ” in the form used here will always allow one to detect structural similarities in the presence of insertions and deletions. In particular, “condensation” would not be expected to work if the proteins being compared were similar except for one or two long insertions or deletions. (Although in such an idealized case, the normal comparison method would be expected to detect the structural similarity away from the insertions or deletions.) The successful result in the case of elastase and SGPBase shows that “condensation” is at least one possible way in which the comparison method may be generalized. Other approaches are possible, and need to be tested in a number of real cases. The fact that the region of good structural agreement between SGPBase and condensed elastase extends continuously along the diagonal of Figure 8(a) and 9(a) reveals something about the relation between these two molecules. Clearly, elastase is “derived” from SGPBase by making a number of insertions that are distributed fairly uniformly along the length of the molecule. Continuous structural agreement along the diagonal would not be expected if there was a large insertion in SGPBase. as was thought to occur in the region 164 to 182 (Olson et al., 1960; McLachlan & Shotton, 1971; Delbaere et al., 1975). In the revised alignment of the two sequences (Fig. 5; James et al, 1978) there are only three short insertions in the SGPBase molecule, relative to elastase, these being of length one, two and four residues. In Figure 8(a) the strong elongated peak in the bottom left corner corresponds to superposition of part of the first /I-structure domain of SGPBase on the second pstructure domain of elastase (Blow, 1969; McLachlan, 1979). The weaker peak in the top right corner arises from the superposition of the second SGPBase domain on the first elastase domain. (c) Comparisons

of different structural

types

In the comparison of two structures, statistical analysis of the data is based on the method proposed by Fitch (1966,197O) for amino acid sequence comparison. In essence, the many individual comparisons of different structural segments within the two proteins are used as a data base agr.;nst which to assess the significance of unusually good agreements. In the statistical analysis, it is assumed that the frequency distribution of structure agreements for any two proteins will be Gaussian. Also, it is anticipated (although not strictly required) that the frequency distribution will be approximately the same when comparing different types of proteins. In this section we discuss the validity of these assumptions. The assumption of a Gaussian distribution Rca is supported by the experimental data. In many comparisons of proteins of different types (helical, sheet and mixed) we have found that the distribution of R,, can be fitted by a Gaussian curve (e.g.

92

S. J.

REMINGTON

A?JD

B.

1%‘. MATTHEWS

Figs 3(b), 4(b) and 6(b)). except f or those cases where there is unusually good structural agreement between the structures being compared (e.g. Fig. 9(b)). The distributions obtained in t’he comparisons of a number of different structurw. and for different probe lengths. are summarized in Table 1. Figure 10 shows the

TABLE Distribution Proteins

compared

T4 lysozyme; hen egg-white lysozyme Elastase; microbial protease SGPBase T4 lysozyme; carp Ca -binding protein T4 lysozyme; hen egg-white lysozyme Myoglobin; hemoglobin /l chain Lactate dehydrogenase; glyceraldehyde-3P-dehydrogenase Lactate dehydrogenase ; alcohol dehydrogenase T4 lysozyme ; concanavalin A Elastase; microbial protease SOPBase Elastase (3/4) ; microbial protease SGPBase Alcohol dehydrogenase ; glyceraldehyde-3-P. dehydrogenase T4 lysozyme; hen egg-white lysozyme T4 lysozyme; carp Cabinding protein b5 ; Cytochrome myoglobin T4 lysozyme; hen egg-white lysozyme Myoglobin; hemoglobin b chain Lactate dehydrogenase ; glyceraldehyde-3-P. dehydrogenase Lactate dehydrogenase; alcohol dehydrogenase Alcohol dehydrogenase ; glyceraldehyde-3-P. dehydrogenase Hexokinase; glyceraldehyde-3.P-dehydrogenase Hexokinase ; lactate dehydrogenase

1

of Rca for different Pro be length

Number of R,,

structure comparisons Best

R,,

(4

Average Rca (8)

Standard deviation (A)

20

15,950

1.8

5.70

I.21

‘0

36,686

2.6

7.42

1.87

30

10,665

3.0

6.89

1.30

40

11,250

3.8

8.71

1.39

40

12,198

0.9

8.19t

1.887

40

85,260

2.8

10.81

2.04

40

24,360:

3.3

10.75

2.02

40

24,750

6.3

11.54

1.91

40

29,346

4.7

11.32

2.00

40

20,586

4.5

11.95

1.95

40

24,696

2.6

10.94

2.00

60

7350

5.7

10.75

1.47

60

5145

5.6

10.70

1.49

60

2444

9.0

Il.85

1.07

80

4250

6.1

12.32

1.63

80

4958

1.5

11.42t

3.16f

80

63,500

6.4

14.73

1.72

80

l&500$

9.0

14.47

1.64

80

18,796$

8.7

14.47

1.83

80

23,622$

7.9

15.32

2.05

80

92,750

I.9

15.10

2.09

COMPARISON

Proteins

OF

compared

Elastase; microbial protease SGPBase Elast,ase (3/4) 5; microbial protease SGPBase Lsct)ate dehydrogenase; glyceraldehyde-3-P. dt-hytlrogenasr Lactate &hydropenase; alcohol clehydrogenase illcohol dehytlrogonase; glycoraldehydP-3-P. dehydrogenase Elastase; microbial protease SGPBase

PROTEIN

STRUCTURES

93

Probe length

Number of Rx

(A)

x0

17,066

10.0

15.23

1.71

80

lO,iO6

6.9

15.28

1.99

120

44,940

10.8

16.58

1.61

Best

Average

R,,

Kc,

(4

Standard deviation (A)

120

3,440x

ll.!)

16.40

1.65

1%

3,696$

10.7

16.54

1.78

120

7986

13.6

17.05

1.16

t Whenever thert, is very good structural agreement, as for hemoglobin and myoglobin, the low values of R,, will tend to decrease the average value of R,, and increase the standard deviation. This is particularly obvious for the hemoglobin/myoglobin comparison with a probe the mean and standard length of 80, and these values are not included in Figs 10 t o 13. Strictly, deviat,ion should be recalculated omitting the unusually low values of R,,. 1 In t.hese comparisons Rcn was calculated for every 2nd residue of each protein. This reduces the numtwr of R,,, and the computing t,ime, by a fact.or of approx. 4. 5 Sre t.he text.

T4L ,CON ELA .SGE ADH,GPD LDH.GPO ADH.LDH T4L.HEL MB.HB

CON PRE CHT SOD SUB.CPA.LDH TIM AK HEW.CYS CBP MYH ME

o ’

cyS MB T4L’HEL T4L:CBP

i HEX:GPD ELA,SGB HEX,LDH LDH.GPD ADH,LDH T4L,HEL

1 : ELA.SGB

i

T4L,HEL

F

ELA,SGE LDH.GPD ADH,GPD ADH.LDH

t

Ti) ‘.CBP

8

1

I

10

/

I

I

I

I

20

30

40

50

60

Probe

/

/

,

70 80 PO length (residues)

,

100

FIG. 10. Average structure agreement for different protein comparisons plott,ed as a function of the probe lengths. The Figure includes our comparisons of different proteins (A) together wit)h McLachlan’s (1979) comparisons of a protein with itself (0). Abbreviations are as follows: CON, concanavalin A; PRE, preelbumin; CHT, a-chymotrypsin; SOD, superoxide dismutase; SUB, subtilisin ; CPA, carboxypeptidase A; LDH, lactate dehydrogenase; TIM, triose phosphate isomerase ; AK, adenylate kinase ; HEL, hen egg-white lysozyme; CY5, cytochrome b,; CBP, carp calcium binding parvalbumin; MYH, myohemerythrin ; MB, myoglobin ; ELA, elastase, T4L, bacteriophage T4 lysozyme; SGB, microbial protease B; BDH, alcohol dehydrogenase; GPD, glyceraldehyde-3-phosphate dehydrogenase.

91

S. .I.

HEhllS(:TON

ANI)

U.

IV.

,\I;\TTHE\VS

average structure agrccmcml : i .c. t IIV ccntcr of the observed distribut’ion for man> different prokin comparisons. plotted as a, function of probe length. The Figure also includes data from Mrlachlan (1979). who has madtl a scrics of comparisons of ih protein \+5th itSself. using the comparison met,hod tjo look for rclpcatcd struct~urc elements within a single polypept,ide chain. As expected, the avemge structure agreement increases with the lengt)h of the probe. With very short’ probe lengths (L = ZO), t#he avcragc value of Kc, for different proteins varies by almost, a factSor of two, but with longer probes the relative spread is much less. It might be exp&cd that the average agreement when comparing two small. compact, cc-helical proteins would be less than for a comparison of two large, extended P-t,ypc! proteins (McLachlan, 1979). This dependence on structural type is significant aith short probes, but becomes less noticeable with longer probes. For exnmplc, McLachlan (1979) found t,hat the average structure agreement for a comparison of myoglobin with itself \vith a probe of 21 rcsiducs was 4.7 8. whereas a comparison of concanavalin A n ith itself yielded a value of 8.9 A. Wit’h a. probe of 55 residues, the respective values become 10.9 8, and 14.9 A. and all other proteins tested are well within these limits (see Fig. 10). The average structure agreement increases in proportion to the square-root of t,he probe length, and the dat’a in Figure 10 can be fitted reasonably well by the equation :

%a = 1.55d~, where Rca is the average structure agreement for comparing any pair of proteins, or a protein with itself, and L is the probe length. For the reasons discussed above. this equation is less reliable for short probes. In contrast to Rca, which depends on L, the standard deviation of the distribution of R,, is essentially independent of the probe length (Fig. ll), at least for probe lengths greater than 20 residues. This is an interesting and, perhaps, unexpected result. It

3.0 .8 -0 eu 2.5

i

.----

-t-------

.

w

// IO

FIG. 11. Standard probe length (open random comparisons (Because McLachlan

1 20

30

40

/ 50 Probe

1 1 1 60 70 T;ib 90 length (residues)

1 / 1 100 110 120

deviation of R,, for different protein comparisons plotted as a function of circles). The filled circles me the standard deviations obtained from lo6 of 32 proteins (see the text) and the broken line is given by eqn (3). (1979) does not quote standard deviations, his data cannot he included here.)

COMPARISOS

OF

PROTEIN

STRUCTURES

95

might be anticipated that struct,ure agreemenbs obtained with a long probe would scatter over a much wider range than for a small probe, but this is not the case. The spread is essentially the same for a probe length of 20 residues as it is for 120. As will be discussed below, we wished to combine the average structure agreement (eqn (2)) with the average standard deviation of R,, (Fig. 11) t,o obtain a generalized the generality of equation (2) and of probability distribution of R,,. However, Figure 11 might be suspect, since they were obtained from comparisons of a limited sample of proteins. Therefore, we carried out’ the following, more general, survey of the known protein structures. The alpha-carbon co-ordinates of 32 proteins were taken from the Protein Data Bank and used to construct a single list of about 7000 numbered co-ordinates. The 32 proteins included most of the known structures for which co-ordinates were available, excluding obviously homologous structures. Then, for a chosen probe length, a random number generator was used to choose from this list one million different pairs of structural segments for which Rca was calculated. Any chosen segment that happened to overlap the beginning or end of a protein was ignored. This survey was made for probe lengths of L = 10, 20. 40, 60, 80 and 120 residues. Thus, for each of these probe lengths, we obtained a distribution of Rca that was based on a large number of comparisons ( 106), which included data for many different proteins, and which avoided the redundancy and non-randomness that, unavoidably occur when one compares a single protein with another. The results of these comparisons are summarized in Table 2. In Figure 12 we show the distributions for L = 10 and L = 60 residues. The dist,ribution for L = 10 residues is atypical in having a pronounced maximum at an Rca value of about 0.6 A. This is due primarily to the alignment of a-helical segments from different proteins, and presumably also includes some other localized secondary structure elements such as “hairpin bends”. The plot illust,rates quit,e clearly t!he limited TABLE

Distribution

Probe

2

of Rca for lo6 random, comparisons from, 32 proteins

length

10 20 40 60 80 120

Average R,, (A) 3.88 6.75 IO.47 12.76 14.30 16.70

#Standard deviation of R,, (A) 1.05 1.72 2.06 2.12 2.19 1.96

The proteins included in this calculation, with the number of residues in parenthesis, are as follows: rubredoxin (53) ; ferrodoxin (54) ; pancreatic trypsin inhibitor (59) ; cytochrome b, (85) ; high potential iron protein (85) ; cytochrome c (103) ; Bence-Jones dimer (107) ; carp parvalbumin (109) ; cytochrome c2 (112) ; hen lysozyme (129) ; cytochrome ~550 (134) ; flavodoxin (138) ; staphylococcal nuclease (142) ; superoxide dismutase (151) ; myoglobin (153) ; T4 phage lysozyme (164) ; Slreptolnyces griseus protease type B (185) ; adenylate kinase (194) : immunoglobulin Fab (208) ; papain (212); concanavalin A (237); triose phosphate isomerase (247); carbonic anhydrase B (258) ; carboxypeptidease A (307) ; thermolysin (316) ; malate dehydrogenase (325) ; lactate dehydrogenase (329); glyceraldehyde-3.phosphate dehydrogenase (333) ; alcohol dehydrogenase (374) ; phosphoglycerate kinase (408) ; hexokinase (455) ; n-plucose-6.phosphate isomerase (514).

0

5

IO

15

0

5

IO

a,, cx,

15

20

25

R,,(A)

(0)

i b)

FIG. 12. Distribution of K cI1 for 106 comparisons of structural 32 different proteins. The best-fit Gaussian is superimposed. (b) Probe length L = 60 residues.

segments chosen at random from (a) Probe length L = 10 residues.

significance that can be attached to “good structural agreement” for a short probe length. Such agreement between two proteins may indicate no more than that they both contain a-helices. As can be seen from Figure 13, the average values of Rca obtained from the extended data base agree well with equation (2) and confirm the validity of this empirical relation. The standard deviations of Rca obtained from the extended data base agree moderately well with those obtained for the individual comparisons, although they are at the high end of the ranges of individual values (Fig. 11). It will be noted in Figure 12(b), which shows the distribution for L = 60 residues for the extended data base, that there are more large values of Rca than expected for a Gaussian distribution. This is also observed for the extended data base with the other probe lengths, but is not readily apparent in the individual comparisons. A survey of the structure segments that give rise to these “bad” agreements shows that usually they are due to a superposition of a very compact region from a small protein (e.g. porcine trypsin inhibitor) on a very extended region from a larger protein (e.g. an immunoglobulin). Such comparisons will occur fairly often with the extended data base, which contains a number of very small proteins in addition to larger structures. In contrast, most of the individual comparisons are between proteins of comparable size. As is illustrated in Figure 11, the dependence of the standard deviation of Rca on probe length, based on the extended data base, can be approximated by the empirical relation : o(&J

= 2.2 tanh (L/19).

For probe lengths of 40 residues or longer, a(&) totic value of 2.2 A.

has essentially

PI reached its asymp-

3C

COMPARISON

OF

PROTEIN

STRUCTURES

97

By combining equations (2) and (3) it is possible to construct a generalized probability distribution of R,, that can be used to estimate the significance of a given structural correspondence for any probe length. Such a formulation would be useful to quantitate the significance of a particular structural resemblance that might have been detected between two proteins. Also, the same measure of agreement could be used to calibrate the agreement between a predicted structure of a protein and the observed conformation. The generalized probability distribution is illustrated in Figure 13. By using a scale proportional to the square-root of the probe length along the abscissa, equation (2) becomes a straight line giving the average value to be expected when comparing two protein backbone segments of length L residues. The observed average values are

Probe

length

(residues)

FIG. 13 Generalized structure agreement probability diagram. The solid line gives the average structure agreement as a function of probe length (eqn (2)). For comparison, the observed values for individual comparisons are also shown as open circles (cf. Fig. 10). The filled circles were obtained from lo6 random comparisons of 32 proteins. Successive broken lines give structure agreements that are better than average by IO, 20, 30 . . . The frequencies with which these levels of agreement are expected to occur by chance in a random population are also shown. The scale of the abscissa is proportional to the square-root of the probe length.

included in the Figure for comparison. The broken line drawn below the average value line shows values of Rca that are 10 better than the average, for a given probe length. Successive lines for 2a, 30, 4a, are also shown. The corresponding frequencies for these four lines are also included in the Figure. For example, if a value of Rca is at the 3a level, then the probability of this occurring by chance is 0.13% or 1 in 800. It has to be emphasized that Figure 13 is empirical, and will not give precise values for probabilities. Rather, it is intended as a guide to obtain the approximate expectation for a given value of Rca that has been obtained as a result of some comparison. As discussed above, the average value of Rca is the same for individual comparisons as for comparisons made from the extended data base, but the standard deviation of Rca is somewhat larger in the latter case. The value of a(Rca) for L greater than 40 is about 2.2 A for the extended data base, but has an average value of 1.8 A for the individual comparisons. Since the lines in Figure 13 for 1, 2, 3, 4

!)S

S. .J. REMINGTON

ANI)

1%. \V.

MATTHEWS

standard deviations are based on the extended dat,a base, they are, if anything, drawn conservatively and in any case are intended only as a guide to the approximate signification of a particular observation. One of the interesting implications of Figure 13 is that, a “good” structure agreement, between two short segments may not be very significant. For example. an agretmcnt, of 3.5 A between two 20.residue segments is only at the level of about, 20 (1 in 44). Two 40-residue segments have ho agree within about 3.4 A to bc at the 3~ lovrl. On the other hand, very high significance can be attached to moderate agreement over an extended length. For example, an agreement of 6.5 A over 100 residues is at the 40 level. Several attempts have been made to predict the three-dimensional structures of small proteins from their amino acid sequences. It is possible to use Figure 13 to evaluate the success of such methods. For example, Levitt & War&l (1975) used a simplified representation of protein conformation t,o simulate the folding of pancreatic trypsin inhibitor and, starting from a fully extended structure, obt,ained a conformation that differed from the native structure by an average value of 7.7 A for the 58 residues in the protein. From Figure 13, t,his corresponds to about 20; i.e. better than the value of 11.5 A expected for a structure drawn at random, but better with only medium significance. (By requiring that residues 48 to 58 of trypsin inhibitor be in a helix, Levit.t & Warshel obtained a structure agreement’ of 6.5 A, but, since this number includes some prior knowledge of the structure, it,s significance is unclear.) As emphasized by Hagler & Honig (1978) a “good” prediction of a protein structure should not only have a low root-mean-square discrepancy, but also have the same topology as the native structure. Neither the best Levitt-Warshel structure nor a structure predicted by Hagler & Honig with a discrepancy of 6.2 A (corresponding to 2.50 in Fig. 13) meets both criteria.

4. Conclusion The comparison method readily detects similar structural segments that are not interrupted by large insertions and deletions. If the insertions and deletions in two proteins “compensate” for each other, it may still be possible to detect structural similarities by using a long probe. However, if all the deletions are in one protein, and none in the other, then regions of structural agreement may not extend over long enough stretches to be detected. In principle, it should be possible to extend the method to allow for insertions and deletions, and methods of doing this are being tested. For medium and long probes, the average value of the structure agreement does not depend very much on the type of structure being compared. The average value of the structure agreement increases with the square-root of the probe length but, for of the observed probe lengths above about 40 residues, the standard deviation structure agreements is independent of probe length. From these observations it is possible to construct a generalized probability diagram to evaluate the significance of structure agreements that are obtained in comparing any two protein structure segments. The probability diagram shows that it is relatively easy to find “good” structural equivalence between short backbone segments of 10 to 30 residues. In

COMPARISON

OF

PROTEIN

STRUCTURES

99

contrast, good agreement over extended pieces of backbone of 60 to 100 residues is much harder to find and can, therefore, be ascribed higher significance. We thank Dr Andrew McLachlan for helpful discussions on comparison methods, and for providing a copy of this comparison algorithm. Also, we thank Dr William Bennett for a number of helpful comments on the first draft of this manuscript.. This work was supported in part by grants from the National Institutes of Health (GM21967, GM20066) and the National Science Foundat.ion (PCM77--19310).

REFERENCES Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard. O., Shimanouchi, T. & Tasurni. M. (1977). .Z. ~11oZ.RioZ. 112, .535~-542. Blow, D. M. (1969). Biochem. J. 112, 261.-268. Delbacre, L. T. J., Hutcheon, W. L. B., James, M. N. C. & Theissen, W. E. (1975). Xat?Lre (London), 257, 758-763. Delbaere, L. T. .J., Brayer, G. D. & James, M. N. (:. (1979). Can&. J. Biochem. 57, 135144. Eventoff, W. & Rossmann, M. G. (1975). CRCCrit. Rev. Biochem. 3, 11 I-140. Fitch, W. M. (1966). J. Mol. Biol. 16, 9916. Fitch, W. M. (1970). J. Mol. Biol. 49, l-14. Freer, S. T., Kraut, J., Robertus, J. D., Wright, H. T. & Xuong, Ng.H. (1970). Biochemistry, 9, 19972009. Garavito, R. M., Rossmann, M. (i., Argos, I’. & Eventoff, W’. (1977). Biochemistry, 16, 5065-5069. Haglcr, A. T. & Honig, B. (1978). Proc. Xat. Acad. Sci., U.S.A. 75, 554558. Huber, R., Epp. O., Steigemann, W. & Formanek, H. (1971). Eur. J. Biochem. 19, 42-50. James, M. N. G., Delbaere, L. T. J. & Brayer, G. D. (1978). Canad. J. Biochem. 56, 396 -402. Lrrit.t, M. & Chothia, C. (1976). Sature (London), 261, 552-558. Lcvit,t, M. & Warshel, A. (1975). Nature (London), 253, 694-698. Mat,then-s, B. W., Cohen, G. H., Silrerton, E. W., Braxton, H. 8: Davies, D. R. (1968). .J. Mol. Biol. 30, 179-183. McLnchlan, A. D. (1979). J. Mol. BioZ. 128, 49-79. McLachlan, A. D. & Shotton, D. M. (1971). h’ature New Biol. 229, 202-205. Ohlssnn, I., Nordstrom, B. & Brand&, C.-I. (1974). J. Mol. Biol. 89, 339-354. Olson, M. 0. J., Nagabhushan, N., Dzwiniel, M., Smillie, 1,. B. & Whittaker, D. R. (1960). Sature (London), 228, 438--442. Rao, S. T. & Rossmann, M. G. (1973). J. Mol. Biol. 76, 241-256. Rornington, S. J. & Matthews. B. W. (1978). Proc. Nat. Acad. Sci., U.S.A. 75, 2180-2184. Richardson, J. S. (1977). Nature (London), 268, 495-500. Richardson, J. S., Richardson, D. C., Thomas, K. A., Silverton. E. W. & Davies, D. R. (1976). ,J. Mol. BioZ. 102, 221-235. Rossrnairn, M. G. & Argos, P. (1976). J. Mol. BioZ. 105, 75-96. Rossmann, M. G. & Argos, P. (1977). J. Mol. BioZ. 109, 99-129. Rossmann, M. G., Moras, D. & Olsen, K. W. (1974). Natw.re (London), 250, 194-199. Schulz, G. E. & Schirmer, R. H. (1974). Nature (Lon,don), 250, 1422164. Shott,on, D. M. & Watson, H. C. (1970). Phil. Trans. Roy. Sot. ser. B, 257, Ill-l 18. Sternberg, M. J. E. &Thornton, J. M. (1977). J. Mol. BioZ. 110, 269-283. Webb, L. E., Hill, E. J. & Banaszak, L. J. (1973). Biochetistry, 12, 5101-5109.