J. Mol. Bioz. (1978) 126, 315-332
The Tree Structural
Organization
of Proteins
GORDON M. CRIPPEN
Department of Pharmaceutical Chemistry School of Pharmacy, University qf California Calif. 94143, U.S.A. San Francisco, (Received 22 MaFy 1978) We offer an objective definition of the domains of a prot*ein, giveu its W coordinates from high-resolution X-ray crystal studies. This is done by an algorithm which groups segments of the polypeptide chain together when there are a relatively large number of contacts between the two segments. The result is ML organizational tree showing a hierarchy of segments grouping togother, then clusters merging until all parts of the chain are included. In this view the highest level clusters correspond well to more subjective definitions of folding domains and the lowest level, the segments, roughly match the usual assignments of pieces of secondary structure. The intermediate level clusters suggest possible folding mechanisms. which are discnssed.
1. Introduction Much effort has been devoted by innumerable workers towards understanding the structural organization of globular proteins, ever since the first protein crystal structures were solved. The majority of progress has been in recognizing and predicting the lower levels of organization: the secondary structural elements, such as alpha helix, ext,ended strands, and turns (e.g. Levitt & Greer, 1977). These features can be described solely in terms of distances among sequentially close residues (Crippen, 1977a), and even a full knowledge of such local structure is clearly insufficient to determine the tertiary folding of the protein (Have1 et ob., 1978). Part of the reason for this is that a sizeable fraction of the polypeptide chain of many proteins has apparently irregular conformations, all classified as “coil”. For example, Levitt & Greer (1977) report approximately loo/ of the residues in t’heir survey of proteins as not falling into their u, /3, or turn categories. To date, no theory of protein folding or phenomenological survey of protein structure has given an important role to these segments, yet they are obviously important, in defining the tJert,iary conformation. The next) higher order of structure is genrmlly taken t,o be the packing of thr elements of secondary structure together to form p-sheets (Chothia, 1973; Richardson. 1977: Sternberg & Thornton, 1977) or clust,ers of helices and ot,her element,s (Lim. 1974: Chothia et al., 1977). Not only is this bhought of as a description of a higher level of organization in the final native conformation, but it is widely believed that. this is a step in the actual folding process (see for instance. Lim & Efimov. 1977: Ptit>syn & Rashin, 1975; Rose et al., 1976; Tanaka & Scheraga, 1977). The concept, of such a mechanism of folding is certainly appealing. but attempts to use it t,o 315 OOZZ-2836/78/350315-18
$02.00/O
0 1978 Academic
Prnxs Inc. (London)
lhl.
316
G. M.
CRIPPEN
calculate the native conformation have not been as generally successful as the less intuitive energy optimization methods (Levitt & Warshel, 1975; Kuntz et al., 1976, 1978). Once again, the role of the coil segments is unclear, except perhaps to serve as flexible connectors between helices and extended strands. The highest level of tertiary structure is the packing together of “domains” to form the entire folded polypeptide chain. The existence of domains in particular protein crystal structures has been noted by many workers, especially Wetlaufer (1973), Rao & Rossmann (1973), Rossmann BELiljas (1974), and Liljas & Rossmann (1974). The general notion is that a domain consists of a compact folding of (usually) a contiguous piece of the chain, which has a rather simple boundary in space with an adjacent domain, much like that between two lumps of clay which have been pressed together. Domains are frequently referred to as “wings” or “lobes”, and are often joined by only one strand of the chain, so that two domains are held together only by weak, non-covalent forces. Rossmann & Liljas (1974) further characterize a domain as having many short residue-residue distances within itself, but few short distances between the domain and the rest of the protein. A less subjective definition of the concept has not been set down, nor is it particularly clear how the domain relates to the p-sheet or other aggregates of the previous paragraph, except that a domain contains such features. In this paper; we propose an objective method for detecting domains given the Cartesian co-ordinates of the a-carbons of the residues from X-ray crystallography. The method is independent of any preconceived ideas about what secondary structural elements are important, yet it automatically gives a description of how they interact to form the domains. The assumptions behind the method are as follows: (i) the stabilization of the native conformation is due to energetically favorable interactions between residues that are for the most part relatively distant in sequence, i.e. “longrange interactions”. (ii) Th ese important long-range interactions can be recognized in the static crystal structure by their especially short Ca-Ca distances. Thus we disregard the possibility that some residues may be brought close together in spite of an unfavorable energy of interaction simply due to steric and other geometric constraints, dictated by the amino acid sequence. There is probably enough flexibility in the side-chains and even in the backbone that for a given approximate approach of two strands of the chain, unfavorable interactions can be avoided. (iii) We will take each short distance, hereafter referred to as a “contact”, to be equally important for the protein folding. This is an assumption that can be easily improved upon at a later date, but for the present, it allows the method to be completely independent of any estimates of interaction energies. Hence we propose an analysis of the structural organization of globular proteins based strictly on the structural evidence. (iv) Having assumed that the only importa,nt feature of the crystal data is the set of long-range contacts, we purposely avoid making use of the usual secondary structure assignments, which are based on local distances. The coil segments of the chain are held to be as important to the overall folding as bends or the runs of regular u-helix or /3 strands, even though their conformations are more difficult to describe. This point is debatable on the grounds that there are statistical differences in amino acid composition among a, 8, bend, and coil segments, which implies different interaction energies, but this effect is neglected to keep the algorithm simple. (v) It is not assumed that sequentially adjacent segments preferentially form long-range contacts, compared to the ability of non-adjacent segments to do so. In other words, we neglect
STRUCTURAL
ORGANIZATION
31 7
OF PROTEINS
the contribution to the free energy from the entropy of loop-closing as a function of loop size, when we examine the contacts between segmenbs. Once again this assumption is made for the sake of simplicity, and to keep the method on a strictly geometric basis. Furthermore, we want to test the suggestion by some authors in the field (e.g. Wetlaufer, 1973) that sequential grouping of segments is indeed preferred. Simply stated, the algorithm consists of first dividing up the chain into contiguous “segments” which have no internal long-range contacts. Then the segments are grouped together by finding the pair with the greatest number of contacts between them in relation to the number of contacts that could be made. The “cluster” thus formed is subsequently allowed to pair with other clusters or the remaining segmentIs by the same criterion until all clusters have been grouped together into a single one, comprising the entire chain. The precise definition of the algorithm is given in Methods. and the result’ing organizational trees for some proteins are presented in Results. In the Discussion, we consider some of the implications of the resuhs for the analysis of protein crystal dat,a and for the folding mechanism itself.
2. Methods We devised a binary tree clustering algorithm for the residues of a single polypeptide of clustering algorithms and chain as follows (see Hart&an (1975) f or a good overview trees) : (i) beginning with the first residue, lengthen the forming segment by adding on t,he sequentially following residue, one residue at a time, always checking that the new residue has no long-range contacts with the preceding residues of the segment. The cont,act-no contact cutoff is taken to be a Co-0 distance of 9 A. An interaction is taken to be long-range when the sequence numbers of the 2 residues differ by 7 or more. The choice of these 2 parameters is discussed later. (ii) The next segment begins where the last one was forced t,o leave off according to t,hc long-range contact criterion above. In order to counteract any tendency for the segments in the direction of increasing sequence numbers, the beginning of t,he to “overshoot” second and subsequent segments is allowed to advance toward decreasing sequence numbers as long as these prepended residues have no long-range contacts with the rest of the segment. This process tends to “undershoot”, between two so that the final boundary segments is taken to be the mean of the end of the first and t,he beginning of the second. (iii) When all segments have been determined according to tile previous 2 steps, the first segment beginning with residue 1, and the last, ending with the C-terminal residue, these fixed segments of sequentially contiguous residues initially comprise the set of objects to be clustered. (iv) If one object contains n residues, and another contains m, then the contact densit) between the 2 objects is defined to be the number of conta& observed between all possible pairs of residues of the one object with residues of the other divided by n x na. Choose the 2 objects with the highest contact density as a new object,, or clmter. ~ntl remove the 2 component objects from further consideration. (v) Continue to pair objects according to t’he previous step until single object, which is a cluster containing all the residues.
there
remains
only
t,hr
III the terminology of clustering algorithms, the above method induces a binary trcltx st,ructure on the primary objects, the segments. Each node of the tree above the first layer has been called a cluster, each cluster consisting of only 2 objects from lower levels. either segment,5 or other clusters. The root of the tree is t,hc cluster that includes all residues. If the segment-defining steps result in .s segments, then the number of clusters is necessarily always s - 1; for convenience we will nlimbor the segments 1 through .v and the clusters s + 1 through 2s - 1. Insisting on a binary tree may seem arbitrary at first, since for instance, how would 3 extended strands group to form a 3-strand fl-sheet ? This actually poses no problem.
318
G. M. CRIPPEN
because strands 1 and 2 could pair first, and then the cluster could add on strand 3 (or any permutation of 1, 2 and 3). The result in any case is a final cluster of all 3 strands, corresponding to the sheet. The other main problem is the choice of the contact cutoff and the long-range cutoff. The values of 9 A and 7 residues, respectively, come from best fitting the secondary structure assignments of myoglobin to the segments generated. Table 1 shows the result on sperm whale myoglobin of 3 choices of cutoffs: 10 A and 6 residues, 10 A and 7 residues, and 9 A and 7 residues. Decreasing the contact cutoff and increasing the long-range cutoff both tend to give fewer but longer segments. The choice of 9 A and 7 residues matches the secondary structure assignment by Levitt & Greer (1977) rather well. Note however that the segments tend to extend further in both directions than the corresponding a-helices. The segment assignment only ensures that the segment does not form long-range contacts with itself. It need not otherwise have a well-defined secondary structure, nor be particularly straight. Myoglobin is especially sensitive to the choice of cutoffs because of the danger of confusing the many local intra-helix contacts with longrange contacts. The segment assignment of pancreatic trypsin inhibitor, on the other hand, is rather insensitive to the parameters, due to its low helix content and abrupt bends. Table 2 shows a wider comparison of segment choice versus secondary structure assignment for several other proteins. In general, a segment in our algorithm corresponds to one or more whole pieces of helix or extended plus parts of coil or bend regions. Thus the segments are at least intuitively reasonable units for folding. (The one segment in each of rubredoxin and ribonuclease which corresponds to 3 /J strands according to Levitt & Greer, is not a sheet, but rather 3 straight pieces of chain joined by broad bends.) The proteins in Table 2 are all relatively small, but span the range from all-helical to all-/3sheet. The correspondence between segments and secondary structure is similar in the larger proteins we have examined. The sensitivity of the clustering assignments to choice of cutoffs is more difficult to assess. The lower half of Table 1 shows the residues included in each cluster for the 3 different choices of contact and long-range cutoffs. The clusters under each different parameter choice have been entered for maximum match with those of the other parameter choices. Of course the fewer the segments, the fewer the clusters. Even though the tree for myoglobin is relatively complicated, involving clusters of non-contiguous runs of residues, the 3 sets of clusters show reasonable resemblance to each other, taking into account that the first set involves many more segments (and hence clusters) than the other two. For example, noto how in each set, one of the largest domains always consists of the first 20 residues plus the latter 30, while the other large domain contains the intermediate residues. The degree of complexity of the trees will be of interest in later discussion. We adopt the following quantitative (if arbitrary) definition: let the “break fraction” be the average over all clusters in the tree of the ratio of the number of breaks in sequence for the residues comprising the clust,er to the number of breaks that could have been made. For example, if a cluster is made up of 5 segments in a protein of 12 segments altogether, those segments could have been chosen so that there would be at the most 4 gaps in sequence, or as few as none. However, if there are only 6 segments in total, a 5-segment cluster could have at the most 1 break in sequence, or as few as none. If the value of the break fraction is zero, there were no clusters with sequence gaps in their make-up. Such a protein is built up of only sequentially contiguous domains and subdomains, and is therefore relatively simple. At the other extreme, a break fraction of unity implies that every cluster in the entire tree consists of groups of sequentially non-adjacent segments. The break fraction of myoglobin, 0.262, is rather high compared to most proteins, and reflects the gaps in sequence for the clusters at the 7 residue-9 A cutoffs choice shown in Table 1. It is important to be able to compare tree structures with each other. In graph theory it is customary to treat the nodes as indistinguishable when such comparisons are made, reasonable to compare 2 trees but this is not appropriate in our case. It is physically having different numbers of segments, since for example, they could be from closely related proteins, but tho segment locating algorithm happoned to leave whole a curved segment in one while declaring its more tightly curved counterpart in the other to be 2
1 Sperm du~le myoglobin tree structures for differing contact and long-range cutoff choice.<: comparison of segments to standard assignment of helices TABLE
Long-range contact (A)
6 10
7 10
7 9
Segment
l-8 9-16 17-22 23-28 29-34 35-44 45-53
l-7 8-19
l-20
Standard helicest 3-19 “A”
“O-36
21-36
20-36 “IS”
37-43 44-54
37-46 47-56
54-60 61-66 67-74 75-79 80-88 89-94 95-105 106-113 llP120 121-127 128-133 134-139 140-145 146-149 150-153
55-61 62-79
57-78
37-43 “C” 44-49 51-58 “II” .59-78 “E”
80-96
79-96
85-97 “F”
96-116
97-119
100-119 “G”
120-148
1”4-149 I
Clusters
146-153 17-28 128-139 140-153 54-66 75-88 29-44 114-127 89-105 1-16 117-28, ‘154-66 29-63 67-88 106-127 89-105, j 140-163 I-16, r,128-139 l-16, 106-139
17-66 67-105, 14&153
!
l-16, 67-153 l-lb3
t Levitt
& Greer (1977).
117-129 130-141 142-148 149-153
149-153
142-153
l”O-153
44-61 20-43
l-19
37-56 67-96 80-96, 142-153
l-19, i 130-141 20-43, 1 96-116 20-61, 1 96-116 62-95, f 142-153 20-116, 1 142-153 l-19, ! 117-141 l-153
21-36, 97-119 21-56, i 97-119
l-20, 120-153, l-153
“H”
320
G. M. CRIPPEN
TABLE 2 Comparison
of segment assignments with secondary structure assignments by Levitt & Greer (1977) Protein
Pancreatic trypsin inhibitor
Ribonuclease
S
Segments
2” structure
l-15 16-26 27-39 40-47 48-68
c(: /3: /3: /3: a:
2-7 1625 28-37 43-46 47-55
l-27
a: 8: a: /!I: /l: a: ,9: j3: 8: 8: /3: 8:
3-12 13-15 24-33 34-36 42-49 50-58 61-63 68-76 78-87 89-91 94-111 116-124
28-36 37-50 51-67 68-92
93-113 Ilk124
Carp
myogen
1-7 8-21 22-34 35-40 41-55 56-71 72-93 94-109
Rubredoxin
1-8 9-24
25-41 42-53
OL:7-19 0~: 25-33 a: 41-51 a: 60-64 TV:65-70 cc: 78-88
p: p: 8: j3:
3-7 lo-14 17-20 23-26
8: 48-52
segments. Also the protein sequence order of the segments is of interest, because we consider the 3-segment clustering “1 and 2, then 3” to be different from “1 and 3, then 2”. However, we are not especially concerned whether the cluster consisting of segments 4 and or the tenth. From these considerations, we 6 is the first one created by the algorithm propose the following quantitative tree comparison algorithm. (i) Given 2 binary trees, A and B, with B involving more segments (and therefore also more clusters) than A, choose a sequence-preserving, one-to-one mapping (i.e. correspondence) of the segments of A into those of B. Then each segment of A will correspond to exactly 1 segment of B, and a subsequent A segment corresponds to a subsequent one in B. However, there will be some B segments which correspond to nothing in A, when there are strictly more segments in B than A. (ii) The B tree may now be simplified by eliminating the unused segments and condensing the clusters which include them. The result is a new tree, called B’, which now has just the same number of segments and clusters as A. Of course if the original B had the same number of segments as A, then there is only one allowed mapping in the f&t step, and this second step is unnecessary.
STRUCTURAL
ORGANIZATION
OF
32 I
PROTEINS
(iii) The mapping of segments implies a mapping of clusters. For example, if segment I, corresponds to 3B, and 2, to 5,, then if cluster 11, consists of 1, and 2,, there should be 11,. of 3B and 5a, but this cluster need not, be numbered some cluster in B’ consisting Find the number of clusters in B that match ones in A according to this rule. (iv) Try all possible mappings in the first step, and choose the one that gives the greatest number of matches in the third step. This number may range from zero to the number of clusters in A. Let the match fraction = (best number of matches)/(number of clusters in il ) be the quantitative measure of similarity between the trees while maintaining segment order. If the match fraction = 0. there is no similarity by this criterion. If the match fraction = 1, then tree A is contained within tree B, perhaps several different ways. We will refer to the number of ways to achieve the optimal match fraction as the match multiplicity. Of course intermediate values of the match fraction indicate partial resemblance of the 2 t#rees. We have examined 25 proteins in all, using the 7 residue and 9 A cutoffs throughout. These are all the proteins referred to by either Wetlaufer (1973) or Liljas & Rossmann (1974) in their discussions of domains, and for which co-ordinates were readily available Protein Data Bank. They are: tosyl-cc-chymotrypsin, tosyl through the Brookhaven elaatase, lactate dehydrogenase, malate dehydrogenase, sperm whale myoglobin, papain, phosphoglycerate kinase, carp myogen (calcium-binding protein or parvalbumin), bovine pancreatic trypsin inhibitor, ribonuclease S, rubredoxin, staphylococcal nuclease, subtilisin BPN’, thermolysin, /3-trypsin, hen egg-white lysozyme, liver alcohol dehydrogenase, carbonic anhydrase C, carboxypeptidase, conoanavalin-A, oxidized cytochrome c, ferredoxin, oxidized flavodoxin, glyceraldehyde-3-phosphate dehydrogenase, and high potential iron protein. The trees for all 25 proteins are presented in compact form i,, Table 5.
3. Results The clusters are formed by pairing the available clusters or segments with the highest contact density. Thus the first clusters calculated are those with their component residues in the most intimate contact, while the last clusters always have few contacts between their two component parts relative to the number of residues involved. There are large differences in the contact densities between the first and last segments for each protein examined, and the range of densities is similar in all the proteins. Figure 1 is a histogram of contact densities of all clusters in all proteins showing a consistent maximum around density 0.04 contact/residue2. It is clear there are major differences in contact densities of clusters, with the clusters of larger
FIG. 1. Histogram of the number of occurrences, p contaots/residue2, taken over all 26 proteins.
N, of clusters
formed
with
contact
density
322
G. M. CRII’PEN
numbers of residues being apparently rather compact so that, t,hey abut, other clusters on a relatively small surface, forming few close contact’s, We have quantitatively examined the sizes of clusters. as measured by bot’h number of residues and radius of gyration, in order objectively to distinguish domains from subdomains, and to establish that our tree-calculated domains indeed have the characteristics described in the Introduction. We begin by supposing that a domain or an enbire globular prot’ein may be approximat’ed by an ellipsoid of revolution having semi-axes a, a and ea, where e is the ellipticity. Further suppose that, the )I residues are packed inside the ellipsoid at) some uniform densit,y, so t,hat there are u a3/residue. Then the t,otal volume 1’ = (4/3)7ra3e, and the radius of gyration kxa (Damaschun obtain
(1)
[ 51 2+e2
et al., 1969). Then combining
*
equations
(2) (1) and (2) with
1’ = ‘un, we
(3) Figure 2 shows k versus n for the 25 proteins calculated bv
from the Ca co-ordinates
vi, i = 1, .
, ?L,
Y
k obs
=
(4)
where v,, is the vector location of the center of mass of the ‘IL CP atoms. The curve drawn through the observed points is t’he least-squares fit of equation (3) to the data, resulting in e = 1.04 and u = 191.00 A3/ residue. (Actually this is only a l-parameter fit of the coefficient for n*.) In other words, we expect a globular protein of n residues
FIG. 2. Radius of gyration, k, m a function of number of residues, 12, for the 26 whole proteins (0); curve is the least-squares fit to k = constant x nt.
STRIJL’TURAL
ORGBNIZATION
OF
PROTEINS
323
l+c. 3. Radius of gyration, k, as a function of number of residues, n, for each cluster of all 25 ) same fitted curve as in Fig. 2; (------) 1.2 times the solid curve’. lxotems ( 0) ; (------(‘lustrrs fulfilling the radius of gyration test for domain lie between the 2 curves.
to have a radius of gayration close t,o k ca,c = 2.77 /I+.
(5) to require a domain t’o have a relat’ively small k to ensure globularity. It is reascm~blr but this alone is insufficient~. Figure 3 shows kobs vet-sus a for all 300 clusters of the 25 proteins. where kcalc is the solid line and 1.2 kcalc is the broken line. Clearly, there are many small but nearly spherical clusters of residues created in t,he early stages of t,hc algorithm, which do not fit. our intuitive notion of a domain. Therefore we take as a domain any cluster (other than t’he last cluster, containing all residues) for which domain k3bslLlc < 1.2 and contact density p < 0.1. This corresponds to permitting 4lipticitias up t.o 24. assuming theyare indeed ellipsoids.The radius of gyration ratio and contact density cutoffs were chosen for reasonable agreement with the generall! chymotrypsin. cytochrome c. acknowledged domains see]) in carboxypeptidase. flavodoxin, staphglococcal nuclease, and lactate dehydrogenase. The result is that about 100 out of the 300 clusters qualify as “domains” by t,hese criteria. but man! of these are contained within larger domains. Comparing Figure 2, which contains only points for whole proteins, with Figure 3. where all clusters are indicated, we see that most of the small domains are less conpact, than entire proteins having the same number of residues. This is not ver! surprising, since t,he original segments are by definition far from compact, and two segments side by side make a more compact cluster. and so on. What generally does roof happen is that a small, spherical cluster is enlarged by wrapping segments onto it)s surface. Tnstead. comparison of Figures 2 and 3 indicates that, relatively elongated segments and clusters pack together to form progrrssivalp more compact, and spherical tlomains. and ultimat~ely entire proteins. Not’ only do our calculated domain-clusters correspond generally to the standard caonctlptions of spatially compact groups of residues separated by simple boundaries. hut thr precise assignments of residues to the individual domains often agree well
324
G. M. CRIPPEN TABLE
Domains found by the clustering
3
algorithm for all 25 proteins and comparisons found by inspection
Protein
Clusteringt
InspectionS,
Alcohol dehydrogenase
175-247 147-174, 324-338 248-323 34-79, 11&146 80-115,147-174,324-338 l-79, 116-146 175-323 l-174, 324-338 l-174, 324-374
Carbonic anhydrase
20-134, 233-258 20-258
4&149$
136-208,246-276 1-135 136-307
128-1891 l-1271 190-307$
171-246 38-116 l-37, 117-245
130-230$ 133-230s 27-1301 27-112s
C
Carboxypeptidase
a-Chymotrypsin
11
Concanavalin
Cytochrome Elastase
A
c
l-30, l-69,
204-237 204-237
l-47 l-62 l-137 138-240
l-87$ 88-2341 l-471 48-911 27-130$; 27-127,2362458 13&230$; 16-26,128-230s
Ferredoxin
none
l-261 27-541
Flavodoxin
l-60, 123-138 61-122
l-481 49-1381
Glyceraldehyde -3-phosphate dehydrogenase
10-36, 67-77 234-329 IO-77 1-123 124-162, 234-334 124-334
l-77f go-149f MO-3311
High potential iron protein
23-85
47-861 l-42$
Lactate dehydrogenase
4681 262-308 82-153 262-329 164-241 164-241, 262-329 82-241, 262-329
22-911
l-36, 111-129 37-110
l-38, lOl-129s 40-86$; 39-87s
Lysozyme
$
92%165$; 20-161s 166265$;162-2318 266-329$
with those
STRUCTURAL ORGANIZATION OFPROTEINS TABLE
3 (continued) Inspection$,,§
Clusterings
Protein Malate dehydrogenase
l-l 2, 29-90 162-215,251-325 91-215, 261--326 I-90, 216-260
very similar to lactate dehydrogenase S
Myoglobin
21-56, 97-119 21-119
I-795 80-1635
Papain
X-90 127-197 22-126, 198-212 l-21, 127-197
IO-lll$; IO-lllg: 112.-2071; 113-207s
l-39 40-108
l-332 ; I-345
Phosphoglycerat,e kinase
3’15
204-238
34-711; 72-1081
36-105s
two domains $
330-396 135-190,396-408 53-134 204-271
2047329 l-62, 136-190, 39G-408 191-203, 330-395 I-190, 396408 191-396 Trypsin inhibitor
none
none 5
Ribonuclease s
I-50
1 l-49, 8@-1035 I-IO, 62-76, ~OEJ-124s
Rubredoxin
none
non0 1
Staphylococcal nucleast~
I-106
l-100$ IOI-140f;
119-194 195-275 1-18, 195-276 19-118 I-18, 119-276
IOO-176f; IOI-176s 177-275f;177-275s
Subtilisin BPN
Thermolysin
l-IOOyj
l-35, 48-64 245-271, 28&316 181-244 36-47, 66-136 1-136 137-244,272-279 137-316
Trppsin
l-100$;
97-1428
164-223 I-127 l-163
l-1671 168-316$ 130-230f 27-130f
t According to the clustering algorithm, showing for each protein, all clusters in the order they are calculated, which have radius of gyration less than 20% above the value calculated in eqn (b), and a contact density for their formation of 0.100 OT less. $ According to Liljas & Rossmann (1974). f According to Wetlaufer (1973). /I Chymotrypsinogen residue numbering. 1”
326
G. M. CRIPPEN
with usual assignments by inspection. Table 3 gives a comparison of residue assignments of domains for all 25 proteins. Generally the agreement is good, but in some cases the calculated domains are clearly not similar to those of Wetlaufer (1973) or Liljas & Rossmann (1974). For instance, the myoglobin tree structure in Table 3 indicates two domains consisting of residues 21 to 56 and 97 to 119 or the more inclusive 21 t’o 119, whereas Wetlaufer suggests 1 to 79 and 80 to 153. The reason for such discrepancies is probably that the clustering algorithm judges domains solely on density of contacts and radius of gyration, and has no bias in favor of planar boundary (the “newspaper test” of Wetlaufer (1973)) or lack of sequence breaks. It is not inconceivable that these other factors may be important, but our method is a diflerent way of looking at protein structure that, is very simple and objective. Frequently the clustering algorithm produces a tree with many breaks in the sequences of the clust’ers, whereas this is usually thought to be an undesirable feature for domains. Out of the 25 proteins examined, 80% have a break fraction (defined in Methods) greater than zero, although a number have no breaks at all: carp myogen, ribonuclease, staphylococca,l nuclease, high potential iron protein, and cytochrome c. The highest observed break fraction was 0.667 for flavodoxin, but most a.re in the range 0.1 to 0.3 with the mean == 0.230. To decide whether proteins are unusual in this respect, we generated a number of appropriately unbiased (Crippen. 1977b) selfavoiding random chains on a cubic lattice, calculating the mean number of segments and break fraction in t’he corresponding trees. A cubic latt’ice walk is only a rough approximat’ion to a protein, but in order to make the comparison reasonable, the lattice size was taken to be 3.8 A (the correct O-0 virtual bond length), and the walk was confined to a cube just large enough t,o accommodate the native conformation of the particular prot’ein being simulated. The results (specifically matching lactate dehydrogenase and lysozyme) are given in Table 4. An expected value of 11 residues per segment and a break fraction between 0 and 0.2, regardless of molecular weight, are in reasonable agreement wit’h t,he proteins we have examined. The break fractions for flavodoxin (0.667) and ferredoxin (0.500) are unusually high. TABLE
4
Monte Curlo estimations of the break fraction and number of segments, J;)r the usual 7 residue and 9 A4 cutoffs. Self-avoiding wulks on a cubic lattice qf 3.X A grid size confined to the given cube size NO. residues
Cube size (4
No. succcssf111 WlXlliS
Break fraction
Average no. segments
129 331
40 56
396 585
0.122*0.121 0.125*0.075
12.1+1.4 29.2* 1.8
The details of the tree for concanavalin A illustrate some important’ points about this method of analysing protein conformations. Figure 4 shows graphically the tree obtained. Starting at the bottom, we see there are 13 segments, numbered 1 through 13, each made up of the contiguous runs of residues as indicated. Above the segments are the clusters, numbered 14 through 26 in order of their creation by the clustering algorithm. An examination of each segment by computer graphics revealed that
STRUCTURAL
ORGAKIZATION
OF
PROTEINS
FIG. 4. Diagram of the calculated tree for concanavahn A. Segments are numbered across the bottom, with the sequence numbers of the roaidues contained in each. Numbers of the tree, which is thp entire protcxn. are the clusters, no. 25 being t,he “root”
Wi
1 to 13 14 to 26
segments 4, 6. 8 and 11 appeared to be reasonably straight segments of ext)ended chain, such as the component strands of p-sheet. Segments 3, 5 and 7 consisted of one extended part joined to a rather gently bent part, which we lvill call “curve” to differentiate ib from the sharp kink found in a p-bend. All t’he other segments (1, 2, 9. 10, 12 and 13) consist, entirely of curve, t)he curvature being fairly complex somet,imes. so that the segment would not lie in a plane, but alwa’ys gently curved so that the segment made no long-range contacts with itself-by definition. We similarly inspected the clusters formed using the graphics display. The first cluster, number 14, is a standard anti-parallel /3 hairpin built up out, of two approximately straight, sequentially adjacent, rather extended strands. Cluster 15 is also such a hairpin. but both segments arc strongly bent, yet bent in the same way so t,hat the two segments form many contacts throughout the length of the hairpin. Features of this sort, can be easily recognized on a direct’ion matrix display of t,he chain (Crippen & Kuntz. 1977). Cluster 16 can be described as forming a two-stranded /?-sheet, but, cluster 17; composed of two rather complicated curve segments, can only be called a “wad“. Both 16 and 19 are two-stranded p-sheets of non-sequential segments, but t,hca algorithm clearly prefers to form them early on because their contact densities are three times that of cluster 24. which eventually unites t’hem into a sequential]) contiguous whole involving six strands of sheet albogether. Formation of cluster 20 amounts to adding the extended plus curve segment 5 to the preformed hairpin 14. consisting of segments 8 and 9. Since the contact density for making number 20 is only one quarter of that, for making number 14, it is very tempting to interpret the tsee structure in terms of a temporal sequence of events: first there are such strong interactions between segments 8 and 9 that they form a small folded nucleus. which only then offers a suitable site for segment 7 to bind. Finally after assembling all the component sheets and coil segments in a rather complicated fashion (break fraction --= 0.253) the two largest domains, 23 and 24 join to form 25, the whole protein. The tree structures of other proteins have been similarly examined. Table 5 gives t,he segments and t,he tree clustering for all 25 proteins. Much as was thr cast fo1
328
G. M. CRIPPEN
TABLE 6 Calculated
trees for
25 proteins
Alcohol dehydrogenase
Carbonic anhydrase
C
Carboxypeptidase a-Chymotrypsint Concanavalin A Cytochrome
c
Elastase Ferredoxin Flavodoxin Glyceraldehyda. -3-phosphate dehydrogenase High potential iron protein Lactate dehydrogenase
Lysozyme Malate dehydrogenase
1,18,22,37,49,66,67,73,87,102,111 ((1((2,3)11))(((4,~)((6,8)7))0)) 1,13,29,43,67,91,121,138,162,180,197,211,216,230,251,269,286
(((((1,3)5)4)((2,14)13)) (((6,7)~)(((9,lO)(ll,12))((15,16)17))))
Myoglobin
1,21,37,47,67,79,97,120,149 ((1(8,9))(((2,7)(3 4))(5,6)))
Papain
1,22,42,61,77,91,118,127,146,168,182,198 ((((2,4)(3,~))((6,12)7))(((8,9)0)1~~
cwp wogen
1,8,22,36,41,66,72,94
((1(2(3,4)))((6,6)(7,8)))
Phosphoglycerate kinase Trypsin inhibitor
1,16,27,40,48
(((LW3))~)
STRUCTURAL
ORGANIZATION TABLE
Ribonuoleaso S
OF PROTEISS
:I?!1
5 (continued)
1,28,37,51,68,93,114
((1(2,3))((4,6)(6,7)))
Rubredoxin
1,9,25,42
((1NL3)) Xtaphylococcal nuclease
1,21,29,48,86,96,107,118,138 (((1(2,3))(4(5,6)))(7(8,9)))
Subtilisin BPK’
1,19,40,53,79,100,119,142,146,161,173,184,19~,212,237,262 (((1((13,14)(1~.l6)))((7,8)(0(11,12~~~~~~~~~~~~~~~~~~~
Thermolysin
1,15,26,36,48,65,90,95,110,118,128,137,152,160,181,196, 210,218,227,245,252,272,280,298 ((((1,2)(3,5))(((4(7,8))6)((9,10)11))) ((((15(16,17))(18,19))((13,22)(12,14)))((20,21)(2:~,24))))
Trypsin
1,22,33,42,56,80,128,154,168,185,200 (((((1~4)~)((2~3)6))7)((8,11)0))
The first line gives the sequence number of the fist residue of each of the n segments. The second line uses parentheses to indicate the clustering of the segments, where the segments are referred to by numbers 1 to VI, indicating their position in the polypeptide chain. t Chymotrypsinogen numbering.
concanavalin A, hairpins are among the first clusters calculated, as well as appending curved segments to more regular segments or small clusters. There are also two segment clusters consisting of a pair of fairly straight segments passing each other as skewed lines, rather than parallel or antiparallel. The very first clusters are those of highest contact density, and correspondingly there are usually contacts between t,a-o segments up and down their entire length. There are small clusters formed later in the calculation, however, which consist of a pair of segments in close contact OWI only half their lengths. In such cases, the remaining residues have been inchxded in t’he cluster only as a consequence of chain connectivity. Whereas the eye picks out a /?-sheet as a regular object to be assembled first, the clustering algorithm oft,etl includes irregular curving segments before adding on the next strand of the sheet. Disregarding for the moment the nature of the segments, one might ask what, sorts of patterns may be found in the trees themselves, and what’ similar&y there is betwet>n the trees of homologous proteins. Using our definition of tree similarity given in Methods, we find there is one tree for two segments, three trees for t’hree segments. and 15 trees for four segments. The trivial proto-tree for two segments may btx perfectly matched (match fraction = 1.0) in any protein’s tree involving r/, segments with a multiplicity
of
3 . the binomial 0
coefficient.
That
is, this proto-tree
is found
II !/2 !(n - 2) ! times in all levels throughout the protein’s tree. The situation is not so trivial for the three proto-trees involving three segments. In a survey over all 2~ proteins, “1 and 2, then 3” and “2 and 3, then 1” are found approximately equall? often, but, “1 and 3, then 2” is four times rarer. Note the favored two proto-trees have break fractions of zero, while the rare one has break fraction == 1 (i.e. t’he first clusttir formed, which could have one sequence gap, does have a gap, since segments 1 and 3 are not contiguous). Of the 15 proto-trees involving four segments, five have break fract’ion = 0. five have 0.5, and five have 1. Taken over all 25 proteins, these trcacs
330
G. hf. CRLPPES
occur with perfectly monotonically decreasing frequency as the break fraction increases. The most commonly found pattern (three times more frequent than the runner-up) is “cluster of 1 and 2 combines with cluster of 3 and 4”, while the least frequent (100 times rarer than the commonest) is “1 and 4, then add on 3, then add on 2”. The frequency of occurrence drops suddenly from the first, to the second most common, then decreases fairly smoothly down to t)he fift’eenth. The match multiplicity ordering of the proto-trees with respect to a single protein may change somewhat compared to the all-protein ordering. but not dramatically so. In spite of the great range of frequencies of occurrence, there is relatively little difference in the frequencies of a pair of proto-trees which are the same except for a reversal of segment sequence. For example, the most common tree for three segments is “3 and 2, then I”, but the next most common one, at 91:/; of the frequency of the first, is the sequence reverse, “1 and 2, then 3”. The third three-segment tree is sequence symmet,ric. Among the four-segment, trees there are three symmetric cases and six pairs. For the worst pair, bhe less common member has a frequency only 37% of that of the more common one. The other five pairs have frequency ratios of 460/,, 67%, 69%, 76% and 85%. In comparing proto-trees to protein trees, only match fractions of 1 were counted, but now comparing the trees of proteins with each other, we will need the quantitative feature of the match fraction. Out of our set of 25 proteins, elastin, chymotrypsin, and trypsin have very similar domains, according to Liljas & Rossmann (1974). One would expect the match fraction of trypsin wit,h chymotrypsin or elastin to be higher than that of trypsin with the other proteins of the data set. However, it is relatively likely that a perfect match of 1-O can be found for the small (3 clusters) protein ferredoxin somewhere in the trypsin tree (10 clusters), and this in fact happens. On t’he other end of the scale, the large (23 clust’ers) protein, phosphoglycerate kinase, contains a match fraction = 0.6 of trypsin’s tree. Chymotrypsin’s tree is clearly different from trypsin’s and yields a match fraction of only 0.5. Apparently t#he simple compa,rison of trees is not a sensitive indication of structural homology, and the fair comparison between two protein trees would have to take into account the size effects by estimating the expected match fraction with the trees of random compact conformations of the appropriate number of residues.
4. Discussion The conservative conclusions that can be drawn from this study are that the algorithm presented here is an objective way to define domains, and that the domains we calculate correspond well to the more subjective not,ions of what t,hey should be. That is, the calculated domains are usually compact in space and in sequence, although one break in sequence for a domain is not unusual. Further, the ultimate building blocks of domains are taken to be segments, which are simply augmented pieces of the generally recognized secondary structural elements. This fits well with the customary approach toward protein tertiary structure as being formed out of segments of secondary structure. The more novel aspects of this work are an explicit use for coil segments, the concept of a whole heirarchy of supersecondary structural elements, and the emphasis on long-range contacts, not dihedral angles, in the definitions of segments and clusters.
STRUCTURAL
ORGANIZATION
OF
I’ROTElSS
33 I
Parts of the chain which are neither u-helical nor p-strand but rather have irregular conformations are accepted quite naturally into this scheme as segments as long as there are no internal long-range contacts. Because the cluster formation depends strictly on contact density between segments, the “curve” segments can be just as important in t’he analysis of protein st’ructure as cxand fl are. To our knowledge. there of are no report,+ in the literature commenting on the relative energetic importance regular versus irregular segments for the overall st,ability of t,hr native conformat,ion. The one sort of recognized secondary structure that this approach neglects is the bend, since if it is abrupt enough, it generally becomes the boundary between two sequentially adjacent segments. The last novel feature is the formal analysis of a crystal structure as a packing tree. Workers in the field have long recognized certain levels of the t,ree at the top and the bottom. but have generally ignored the middh levels of organization. Thus we speak of the whole protein (subunit), the root of thf t*ree. consisting of domains (the highest level subdivision) which are composed of secondary structure segments (the lowest’ level). Only recently has there been much analysis of the intermediate levels, such as p-sheet. /%barrcl, hairpins and cross-o\-ers (Richardson. 1977 : Sternberg $ Thornton, 1977). We believe that looking at, t,he cntiw tree structure is a helpful wa’y to grasp all levels of polypeptide chain organization. while avoiding possible overemphasis of the regular. repeating structural features. While the study of a-helices and p-sheets has been very fruitful. it is good to have an alternative approach available (such as these contact trees) which makes no assurnptions about what, kinds of secondary structure are important for the tertiary conformation. but rather focusses on residue-residue long-range contacts. which arc doubtless important in stabilizing the native conformation of globular proteins. The underlying principle in the tree organization itself, is that sequence breaks tend t.o be avoided-but’ not at all completely-as shown by t,hc comparisons with proto-trees of t,hree segments. The symmet’rical buildup from subclusters of equal size (“1 and 2 combine with 3 and 4”) is preferred over adding small pieces to a lay nucleus (“1 a.nd 2, then 3, then 4”), but, once again there a#re numerous exceptions. There is apparently no cha.in direct’ion bias in t)he buildup of domains. such as folding up starting from the r\‘-terminus. The more speculative outcome of this work is a proposed mechanism of protein folding. If one assumes that the analysis of the static crysta,l structure can be converted into a sequence of steps for the actual folding process. then the follow,inp picture emerges: first’, the protein tends to bend at a number of places, so that the chain bet’ween bends at least does not, fold back on itself, although t#here is no requircmrnt’ that it) has any particular conformation. much less a very regular one. Helical or extended segments are certa,inly not excluded. however. These segments can then interact niu hydrogen bonding, hydrophobic or electrostatic effects to bind together into small clusters. The clusters need not, consist of sequentially adjacent segments. but there is a substantial likelihood that they will. The model for this cluster format’ion would be that of a confined self-avoiding random \valk. which gives segment lengths and break fractions simi1a.r to those observed in the cryst,al structures. =\s the clusters pack toget,her t,o form larger clusters. the interactions per residue will necessarily be fewer, simply because the polvpeptide chain is self-avoiding. The tinal domains will have the weakest interactions among t,hemselves while the initial clusters will be most st’rongly formed. This picture of prot,ein folding is not very ne\v. in that’ Lim & Efimov (1977), Ptitsvn & Rashin (1975). Rose et al. (1976). Tanaka
332
G. M. CRIPPEN
& Scheraga (1977), and many others have expressed similar ideas. However, our proposed folding process has three novel features: first, the initial building blocks need not be helices or have any other regular secondary conformation, even temporarily. Second, the clusters formed need not be sequentially contiguous, but rather there will simply be a statistical tendency to be so. Third, the intermediates in the folding process are identified with the clusters found by the tree algorithm from the crystal structure co-ordinates. Of course, there may be “dead-end” intermediate conformations in the actual folding process, which must be unfolded before renaturation can continue, but at least all the clusters calculated from the final crystal coordinates must have appeared in the folding process. It is our hypothesis that they arose in roughly the same order as they were calculated. This work was supported by a grant from the Academic Senate of the University of California. We are grateful for the use of the UCSF Computer Graphics Laboratory (NIH RR1081) and for stimulating discussions with Dr I. D. Kuntz. REFERENCES Chothia, C. (1973). J. Mol. Biol. 75, 295. Chothia, C., Levitt, M. & Richardson, D. (1977). Proc. Nat. Acad. Sci., 1J.S.A. 74, 4130-4134. Crippen, G. M. (1977a). J. Gomp. Phys. 24, 96-107. Crippen, G. M. (1977b). Macrcnnolecules, 10, 21-25. Crippen, G. M. & Kuntz, I. D. (1977). J. Theoret. Biol. 66, 47-61. Damaschun, G., Mueller, J. J., Puerschel, H.-V. & Sommer, G. (1969). Monatsh. f. Chemie, 100, 1701-1714. Hartigan, J. A. (1975). Clustering Algorithms, Wiley & Sons, New York. Havel, T. F., Crippen, G. M. & Kuntz, I. D. (1978). Biopolymers, in the press. Kuntz, I. D., Crippen, G. M., Kollman, P. A. $ Kimelman, D. (1976). J. Mol. Biol. 106, 983. Kuntz. I. D., Crippen, G. M., & Kollman, P. A. (1978). Biopolymers, in the press. Levitt, M. & Greer, J. (1977). J. Mol. Biol. 114, 181-293. Levitt, M. & Warshel, A. (1975). Nature (London), 253, 694-698. Liljas, A. & Rossmann, M. G. (1974). Annu. Rev. Biochem. 43, 475-507. Lim, V. 1. (1974). J. Mol. Biol. 88, 857. Lim, V. I. & Efimov, A. V. (1977). PEBS Letters, 78, 279-283. Ptitsyn, 0. B. & Rashin, A. A. (1975). Biophys. Chem. 3, l-20. Rao, S. T. & Rossmann, M. G. (1973). J. Mol. Biol. 76, 241-256. Richardson, J. S. (1977). Nature (London), 268, 495-500. Rose, G., Winters & Wetlaufer, D. (1976). FEBS Letters, 63, 10. Rossmann, M. G. & Liljas, A. (1974). J. Mol. Biol. 85, 177-181. Sternberg, M. J. E. & Thornton, J. M. (1977). J. Mol. Biol. 113, 401-418. Tanaka, S. & Scheraga, H. A. (1977). Proc. Nat. Acad. Sci., U.S.A. 74, 1320-1323. Wetlaufer, D. B. (1973). Proc. Nut. Acad. Sci., U.S.A. 70, 697-701.