N o t e s to the Editor
Amino acid preferences for secondary structure vary with protein class Michael J. Geisow and Robin D. B. Roberts Biophysics Dil~ision, National Instinae fi~r Medical Research, Mill Hill, London NW7 IAA. UK IReceit,ed 22 May 1980) Protein s e c o n d a r y structure (e.g. :~-helix, //-strand, 3~0helix etc.) is frequently e s t i m a t e d by the use of conform a t i o n a l indices published for individual a m i n o acids. A l t h o u g h a fair indication of s e c o n d a r y (local) structure can be o b t a i n e d in m a n y cases, c o n f o r m a t i o n a l preferences expressed by such statistics d o not relate to local structure uniquely. C o n f o r m a t i o n a l indices, irrespective of their precise definition, are o b t a i n e d from analyses of the p o o l of k n o w n 3 D - p r o t e i n structures. In consequence they c o n t a i n an u n k n o w n c o n t r i b u t i o n from the overall architecture of proteins (tertiary structure). It is unclear at present to w h a t extent this tertiary structure c o n t r i b u t i o n modifies the c o n f o r m a t i o n a l index and affects the accuracy of local structure prediction ~. We d e m o n s t r a t e here the n a t u r e and m a g n i t u d e of systematic error present in c o n f o r m a t i o n a l indices from a d a t a base similar to those now in use for p r o t e i n s e c o n d a r y structure estimation. By splitting this d a t a base into :~-helical, all/3 and mixed (~//3) protein classes (within
w h i c h s e c o n d a r y a n d tertiary structure is m o r e uniform) c o n f o r m a t i o n a l indices are shown not to be constant, but to d e p e n d u p o n p r o t e i n class. It is often feasible to assign a protein to one of the a b o v e structural g r o u p s on the evidence of circular d i c h r o i s m spectra or low-resolution X-ray analysis. W h e r e this can be done, m o r e accurate prediction of s e c o n d a r y structure should be o b t a i n a b l e using the indices described here. The preferences of individual a m i n o acids for a polypeptide c o n f o r m a t i o n K have been quantified in a n u m b e r of ways. We calculate the index PK originally described by C h o u and F a s m a n 2-4 because this is the most p o p u l a r c o n f o r m a t i o n a l indicator. Table 1 shows P~ indices for three c o n f o r m a t i o n s ( s - h e l i x , / ] - s t r a n d and a p e r i o d i c all o t h e r c o n f o r m a t i o n s ) calculated for all proteins and c~, /3 or :~//3 proteins separately. N o /3-strand index can be calculated for :~-proteins, and the P~ indices f o r / / - p r o t e i n s and Pc indices for c~-proteins are of low statistical significance. The other indices in Table 1 have c o m p a r a b l e statistical weights and have essentially c o n v e r g e d to the values given. G l o b a l indices calculated by C h o u and F a s m a n 4 are similar but not identical to the g l o b a l set calculated here. The difference between the two g l o b a l sets m a y result from the i n c o r p o r a t i o n of twice as much /3s t r a n d in the present d a t a base (~ and/3 now have the same weight) in a d d i t i o n to systematic differences in the a l g o r i t h m s used to identify s e c o n d a r y structure 4'5. A l t h o u g h the indices in Table 1 were o b t a i n e d by identical criteria from protein a - c a r b o n a t o m i c co-
Table 1 Structural index variation with protein class (~,/J or :t/[J). The index PK is the mean frequency with which an individual amino acid occurs in secondary structure K, normalized by the fractional content of structure K in those proteins considered 3'4. A P~ value of unity indicates that an amino acid is not especially correlated with the structure K. Variations up or down indicate preference for or avoidance of structure K respectively. The secondary structure of the proteins used was determined by the algorithm described by Levitt and Greet 5 which uses :~-carbon atom coordinates. Thus the indices form a self-consistent data set. The eight ~-proteins used contained no/J-sheet. The 16//-proteins had < 10% ~-helix and the 23 ~/[3 proteins had comparable amounts of both structures. ~/[/proteins were not subdivided into :~/fl and 7 + [J proteins as suggested by Levitt and Chothia v, since these two classes cannot be distinguished on the basis of spectroscopic evidence. Headings Global, ~, [~ and e/[J refer to PK values obtained by analysis of all protein classes, ~, [J and mixed :~ and [J secondary structural types respectively. Particularly large and significant changes in index from the global value are underlined. ~-helix (P,)
/J-strand (P/J)
Global
~
fl
D£11fl
Global
Amino acid Ala Arg Asn Asp Cys Gin Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
1.29 1.00 0.81 1.10 0.79 1.07 1.49 0.63 1.33 1.05 1.31 1.33 1.54 1.13 0.63 0.78 0.77 1.18 0.71 0.81
1.13 1.09 1.06 0.94 1.32 0.93 1.20 0.83 1.09 1.05 1.13 1.08 i.23 1.01 0.82 1.01 1.17 1.32 0.88 1.13
1.55 0.20 1.20 1.55 1.44 1.13 1.67 0.59 1.21 1.27 1.25 1.20 1.37 0.4 0.21 1.01 0.55 1.86 1.08 0.64
1.19 1.00 0.94 1.07 0.95 1.32 1.64 0.60 1.03 1.12 1.18 1.27 1.49 1.02 0.68 0.81 0.85 1.18 0.77 0.74
Size of data base
1775
1275
200
1221
Protein class
0141-8130/80/060387-03502.00 © 1980 IPC Business Press
~
Aperiodic (Pc)
[J
~/[/
Global
~
[~
~/fl
0.84 1.04 0.66 0.59 1.27 1.02 0.57 0.94 0.81 1.29 1.10 0.86 0.88 1.15 0.80 1.05 1.20 1.15 1.39 1.56
0.86 1.15 0.60 0.66 0.91 1.11 0.37 0.86 1.07 1.17 1.28 1.01 1.!5 1.34 0.61 0.91 1.14 1.13 1.37 1.31
0.91 0.99 0.72 0.74 1.12 0.90 0.41 0.91 1.01 1.29 1.23 0.86 0.96 1.26 0.65 0.93 1.05 1.15 1.21 1.58
0.91 1.00 1.64 1.40 0.93 0.94 0.97 1.51 0.90 0.65 0.59 0.82 0.58 0.72 1.66 1.23 1.04 0.67 0.92 0.60
0.8 0.96 1.1 1.6 0 1.6 0.4 2.0 0.96 0.85 0.8 0.94 0.39 1.2 2.1 1.3 0.6 0 1.8 0.8
1.1 0.93 1.57 1.41 1.05 0.81 1.40 1.30 0.85 0.67 0.52 0.94 0.69 0.60 1.77 1.13 0.88 0.62 0.41 0.58
0.93 1.01 1.36 1.22 0.92 0.83 1.05 1.45 0.96 0.58 0.59 0.91 0.60 0.71 1.67 1.25 1.08 0.68 0.98 0.62
1836
1454
1525
1574
300
940
1423
Int. J. Biol. M a c r o m o l . , 1980, Vol 2, D e c e m b e r
387
Notes to the Editor ordinates, there are highly significant differences between the same conformational index calculated from each data base. For example, using zt, e/fl and global data, P~ for serine is 1.01 (70 residues), 0.81 (88 residues) and 0.77 (123 residues) respectively. These differences are large by comparison with the numerical fluctuation observed as each data base is expanded. As an extreme example, the mean variation in P/~ for all the amino acids in ~/[I proteins was 0.07+0.06, when the data base was increased from 2000 to 4200 residues. The consistent alteration of conformational preference with protein class must reflect a real difference in the distribution of amino acids amongst the same types of secondary structure in :~, fl and ~/fi proteins. Structurally important residues (hydrophobic, disulphide-bond forming and charged amino acids) have the largest index changes. The variations of index with protein class indicate one source of systematic error to be expected when using constants obtained from a global data base to predict protein structure. Shifts in the P/j and P( indices calculated from a global data base and an all fi data base are represented in Fi.qure 1. It can be seen lhat similar types of residues (e,g. basic, hydrophobic) are clustered, indicating similar tendencies to form /J-strand or aperiodic structure. In /#proteins, occupancy of fl-strands by basic residues (kys, His, Arg) is increased at the expense of hydrophobic residues (Val, lie, Cys). Aromatic amino acids (His, Phe, Trp) are more frequent in/;t-strands, while acidic residues are excluded, but now form a large component of the aperiodic structure. There is a net reduction in structure breaking residues (Asn, Pro, Gly, Set) in/J-strands. These shifts in index are reversed in all a-proteins (hydrophobic residues
oval
r
/
,
I • tyr
14--
•
ile
cyls
1.2--
% ~0
?J°°'
0.8--
o0,o
I 06
g l u e l
Gasp
041 04
06
08
I0
]2
1.4
1.6
Pc
Figure I The nature and magnitude of changes between conformational indices derived globally and those obtained frorn//-proteins are represented vectorially. The arrow indicates the shift from the global to the/t-protein wdues of the indices Pj~ and P(,. • Hydrophobic or aromatic residues: o acidic and basic residues with an indication or charge: t~, Reverse turn (structure breaking) residues, The broken lines represent the boundaries between favourable ( > 1.0) and unfavourable ( < 1.0) structure forrnation. It should be noted that the P(PI~ values redistribute in the top left and bottom right hand quadrants of the graph. This represents a less ambiguous distribution with respect to preference for fl-strand or aperiodic structure
388
Int. J. Biol. Macromol., 1980, Vol 2, December
more frequent in :t-helix at the expense of ionizable amino acids). These variations in amino acid preferences are reasonable if the folding (tertiary structure) of each protein class is considered. For example, when much of the polypeptide chain forms/#sheet as in/J-proteins, more/t-strand amino acids contribute to surface or subunit boundaries, and the proportion of polar residues in /J-strand relative to hydrophobic residues rises. It is interesting to note the preferential exclusion of acidic residues from //-strands, which indicates that Glu and Asp, rather than Arg, Lys or His are inimical t o / # s t r a n d formation in/;t-proteins. In/#proteins the infrequent occurrence of Ser, Ash, Pro and Gly in fl-strands suggests the loss of interactions which stabilize these structure-terminating residues in flstructure. For example, in many e/fl proteins containing a central /#sheet, species-invariant glycines occur in the sheet where they allow close-packing of e-helicies. Close approach of e-helix to buried fl-structure is also facilitated and stabilized by hydrogen bonds formed to Ser or Asn in the buried sheet. In a recent report 6, preferences of amino acids for parallel or antiparallel /#strand arrangements were shown to differ substantially. It was suggested that such differences justified the refinement of secondary structure prediction in terms of the tertiary structure environment (strand arrangement). In selecting their data bases, the authors coincidentally separated /J-proteins from e/[~ proteins, because antiparallel and parallel/#strands predominate in each class respectively. The substantial differences reported in the frequency of occurrence of Val, lie, Glu and His in parallel and antiparallel strands are reflected by Pl; values for ~/[t and/t/proteins respectively in Table 1. However, not only strand arrangement, but the whole protein architecture differs between the two data bases selected by the authors. In order to sharpen further their conclusions it would be valuable to analyse strand arrangement within a single protein class (e.g. e/[J proteins). The large differences in indices between z~ and [~ proteins and those derived globally emphasize the need to take tertiary structure into account in predictive work. The constraints placed upon amino acid neighbours in adjacent elements of z~-helix or/;t-strand are stereochemically quite different, so that elements of secondary structure should be considered together in future predictive schemes. In the expectation of improved secondary structure estimation, the indices in Table 1 were used to predict secondary structure essentially as described by Chou and Fasman a. Their procedure was modified slightly to incorporate the information provided by the aperiodic indices (P(.) and applied by a computer program. The extent to which X-ray and predicted structures were correlated was significantly better when indices appropriate to ~/[;t or [J proteins were used (Table 2). This correlation was greatly improved in the case of ~-proteins. Obviously, a substantial component of this improvement is the prior assignment of structural class in the case of and [/ proteins. However, the improved accuracy could not only be attributed to the elimination of the wrong type of secondary structure but was also apparently due to more appropriate conformational indices. Despite the substantial changes in PK indices from their previously reported global values 4, the improvement in their predictive power when applied to/#proteins was not
N o t e s to the Editor
Table 2 Correlation of predicted and observed structures. The prediction of protein secondary structure was carried out by a computer program. To quantify the degree to which observed (X-ray structure) and predicted secondary structure were correlated the coefficient CK described by Matthews 8 was calculated for each structure K C~ = (P/N - pS)/[PS(1 - SX1 - P)]~ 2
P is the number of residues correctly predicted in conformation K. N is the total number of residues./5 is the fraction of N predicted to be in conformation K and $ is the fraction of N observed to be in conformation K. Values of - 1, 0 and 1 indicate total disagreement, random and complete agreement between predicted and observed structures respectively. The global data base used to obtain the C~ values was that described in reference 4. Other data bases were as shown in Table 1. Size (residues)
Data base
Calcium-binding protein (carp muscle)
108
Global
0.12 0.57
0.22 0.61
Myohaemerythrin
112
Global
-0.9 0.54
0.39 0.54
Cyanmethaemoglobin (lamprey)
148
Global :~
0.28 0.59
0.44 0.59
CuZn superoxide dismutase (bovine)
150
Global 1:~
0.69 0.62
0.39 0.49
Bence-Jones IgC dimer Meg (human)
215
Global fl
0.11 0.43
0.48 0.43
Prealbumin (human)
127
Global [:t
0.19 0.32
0.10 0.48
Adenylate kinase (porcine)
194
Global
0.42 0.61
0.65 0.65
0.56 0.62
Oxidized flavodoxin (Clostridium MP)
138
0.65 0.67
0.61 0.72
0.22 0.1
Protein
( Thermiste pyroides)
~/'[~
as marked as with ~-proteins. Whether a given segment of polypeptide will form/S-strand depends to a great extent upon the folding of distant regions of chain, because of the formation of strong interstrand hydrogen bonds. The manner in which structural indices are presently calculated and applied may only give accurate results for secondary structure if interactions between secondary structures are weak. This situation normally only rigorously applies to synthetic amino acid polymers. Analyses of protein tertiary structure are now needed to be able to improve further u p o n the estimation of local structure. The best possible secondary structure predictions will still be needed as a starting point for tertiary
Natural cellulose sulphate: position and degree of ester sulphation determined by a3C n.m.r, spectroscopy S. Hunt and T. N. Huckerby Department o.1' Biological Sciences and Department of Chemistry, Unit:ersity OJ Lancaster, Bailrigg, Laneaster LA I 4 YQ, U K (Received 28 January 1980)
Introduction While cellulose is probably one of the most a b u n d a n t polysaccharides, natural substituted derivatives are vir-
0141 8130/80/060389 04502.00
©1980 IPC Business Press
Global
~/r[~
C2
CI~
Cc
structure analysis and as shown here, further subdivisions of the protein data base will be necessary to achieve this.
References I 2 3 4 5 6 7 8
Sternberg, M.'J. E. and Thornton, J. M. Nature 1978, 271, 15 Chou, P. Y. and Fasman, G. D. Biochemistry 1974, 13, 211 Chou, P. Y. and Fasman, G. D. Ibid., 1974, 13, 222 Chou, P. Y. and Fasman, G. D. Adv. Enzymol. 1978, 47, 45 Levitt, M. and Greer, J. J. Mol. Biol. 1977, 144, 181 Lifson, S. and Sander, C. Nature 1979, 282, 109 Levitt, M. and Chothia, C. Nature, 1976, 26l, 552 Matthews, B. W. Biochim. Biophys. Act 1975, 405, 442
tually unknown, fl-l,4 linked glucans have however been isolated, from molluscan mucous secretions, which carry fester sulphate groups 1- ~ and these have been shown by chemical methods and by X-ray diffraction and infrared spectroscopy to be cellulose sulphates with some covalently bound peptide a'8-1°. While these studies have established the cellulosic nature of the parent polysaccharide, the localization of the ester sulphate groups has proved a more difficult task. In the case of the cellulose sulphate peptidoglycan of B u c c i n u m undatum hypobranchial mucin, the degree of sulphation was originally believed to be 1.0 (residues sulphate/monosaccharide) 3 a figure we now revise to a higher value. It might have been reasonable to assume for a h o m o p o l y m e r of biological origin (with this sulphate content) that each monosaccharide might bear its sulphate g r o u p in an identical
Int. J. Biol. Macromol., 1980, Vol 2, December
389