Z theor BioL (1986) 119, 205-218
The Classification of Amino Acid Conservation WILLIAM RAMSAY TAYLOR
Laboratory of Molecular Biology, Dept of Crystallography, Birkbeck College, London W C I E 7HX, U.K. (Received 20 September 1985, and in revised form 1 November 1985) A classification of amino acid type is described which is based on a synthesis of physico-chemical and mutation data. This is organised in the form of a Venn diagram from which sub-sets are derived that include groups of amino acids likely to be conserved for similar structural reasons. These sets are used to describe conservation in aligned sequences by allocating to each position the smallest set that contains all the residue types brought together by alignment. This minimal set assignment provides a simple way of reducing the information contained in a sequence alignment to a form which can be analysed by computer yet remains readable.
1. Introduction The high level of expertise current in nucleic acid research has led to the revelation o f large numbers of protein sequences and the ability specifically to alter these in a controlled way. New sequences often exhibit a close homology with proteins which have had their structure determined crystallographically and using advanced computer graphic facilities it is possible theoretically to alter the amino acid side chains of the known structure to represent the sequence of unknown structure (e.g. Blundell et al., 1983). The new techniques are used increasingly to design proteins with altered properties using the methodology of site-specific mutagenesis. These activities require a good understanding of the basic principles of protein structure and, in particular, it is necessary to anticipate the structural effect of introducing a new amino acid into a known structure. This assessment is often based on the likelihood matrix of amino acid mutabilities derived by Dayhoff et al. (1972, 1978) or on the number of nucleotide base changes required to effect the substitution (Fitch, 1966). Such measures, however, ignore aspects of the substitution that are relevant to the local structural environment or known function of the residue and in building hypothetical structures and designing new mutants it is these details which are important. tn this paper I consider measures of amino acid relatedness in common use with the aim o f extracting from them features which will best assist a protein engineer faced with the problem of making a mutation and assessing a sequence alignment. These features are represented as groupings of amino acids (sets) which is a form that retains a descriptive quality yet allows quantitative manipulations using the formalism o f set logic. 205 0022-5193/86/060205 + 14 $03.00/0
© 1986 Academic Press Inc. (London) Ltd
206
WILLIAM
RAMSAY TAYLOR
2. Measures of Amino Acid Relatedness (A) MUTATION DATA
For every pair of the 20 naturally occurring amino acids Dayhoff et al. (1972) have determined the probability (or odds) that the mutation will occur in either direction. This matrix was most clearly presented by Sander & Schulz (1979) (see also Schulz & Schirmer, 1979) in a form where the entries have been ordered to bring frequently exchanging amino acids together. Even in its ordered form it is still difficult fully to appreciate the information contained in Dayhoff's matrix. However, using the technique of multi-dimensional scaling, French & Robson (1983) reduced the matrix to a two dimensional plot, in which frequently exchanging amino acids are closest together (see Fig. l(a)). A similar diagram (Fig. l(b)) was produced by minimising the deviation from a 2-D structure in which the entries of Dayhoff's matrix represent the inverse of ideal target distances between pairs of amino acids (Taylor, 1981). Both these figures are roughly elliptical and projecting the amino acids onto the circumference of each ellipse produces an even simpler representation with no great loss of information. In this form the long axis of the ellipse corresponds to molecular volume while the short axis corresponds to hydrophobicity. These simplifications are presented in Figs 1(c) and l(d). The cyclic order of amino acids obtained from this simplification is almost the same as that derived by Swanson (1984) (see Fig. l(e)). All the above representations of Dayhoff's matrix indicate that it can be largely accounted for by the effect of only two determining factors; hydrophobicity and size. This remarkable observation must obviously dominate any attempt to codify amino acid conservation. (B) PHYSICAL DATA
Physico-chemical properties The wide variety of physico-chemical properties manifest in the amino acid sidechains has been thoroughly considered by Sneath (1966). These have been adopted, or deduced independently, by others and used as a basis for considering relatedness between protein sequences. McLachlan (1972) summarizes these relationships by assigning a value to each transition between pairs of amino acids. These scores are presented graphically in Fig. 2 using the circle of amino acids derived from Dayhoff's matrix as a frame on which lines connect related amino acids. All high scoring transitions and most other connections on the graph are local, indicating agreement between chemistry and mutability. The less local connections mainly join hydrophobic residues, which, with the exception of proline, are relatively adjacent on the less idealised representations of Dayhoff's matrix (Fig. 1). The missing links are, perhaps, more revealing: there is only a weak link between Tyr and Trp which are strongly tied in Dayhoff's mutation matrix and there is no connection between the adjacent negatively charged residues and Gly and Pro. Measurement of properties cannot determine a common scale without reference to protein sequences and structures. The idea of such a scaling can be appreciated
THE CLASSIFICATION (a)
OF AMINO
small
(b)
®p
a.
ACID CONSERVATION small
©
/
/
/
Y® large
(c)
,~o\o'~
207
large
i
a2
(d)
(s,%,,
X A( I E'~ % E c/
i N
R
\,
%.
"2
\
~1 / o "1~ w 4 /
DO/or
\F I ',/ #
Ooo~/
\~.';"
J T
small
Cv I inner
,org,
FIG. 1. Representations of Dayhoff's mutation odds matrix. (a) Projection of the matrix by multidimensional scaling (adapted from Robson & French (1983)). Amino acids which are close together exchange frequently. (b) The equivalent diagram to (a) produced by pseudo-energy minimization to a point at which each amino acid lies in a position which gives the minimum sum of squares over the distance equation shown. (c) and (d) Idealizations of (a) and (b) respectively. The properties associated with each quadrant are indicated with the property of lesser importance bracketed. (e) A further idealization of the two plots which are constrained to a circle. Ambiguities in the cyclic order have been reconciled by consideration of both original plots. The resulting order agrees with that obtained by Swanson (1984) except for the exchange of Arg and His as indicated by an arrow. (This probably arises from Swanson's use of the Dayhoff (1978) revised matrix). Swanson's nomenclature for the quadrant characteristics is also indicated. The one letter and three letter codes for the amino acids are as follows: Gly G, Ala A, Val V, Leu L, Ile I, Ser S, Thr T, Asp D, Glu E, Asn N, Gin Q, Lys K, His H, Arg R, Phe F, Tyr Y, Trp W, Cys C, Met M, Pro P, asx b, glx z.
208
WILLIAM
RAMSAY
I/
/ I
II ('~ ( T~.
/./('TY , ~.
/
ii I I
( v 7"-'.
\
I
I
t I-~-~
TAYLOR
/'
....'////I II
..~
I
t, KJ
tX f
/.-"--~ H }
I
FIG. 2. Chemical relatedness o f the amino acids as quantified by McLachlan (1972) displayed on the circle of residues derived from Dayhoff's matrix (Fig. l(e)). The degrees o f expected conservation are indicated as follows (with McLachlan's score in brackets). Strong conservation of type = heavy circle (6); Moderate conservation o f type = lighter circle (5); Strong conservation of properties = heavy line (3); Moderate conservation of properties = finer line (2); Weak conservation o f properties = broken line (1). Most connections are local (i.e. chemistry ag[ees with mutation data) with the exception of long links from hydrophobic residues to Pro and Ser & Thr to Ash & Gin and the absence o f links between Gly & Pro and Glu & Asp.
from the relative lengths of the ellipses derived from Dayhoff's matrix (see Figs. l(c) and l(d)) in which the longer axis is associated with change of size indicating that this, on average, is dominant over hydrophobicity. Grantham (1974) scaled three properties to McLachlan's (1971) matrix of amino acid substitution frequencies. These were volume, polarity and the fraction of carbon in the side-chain (composition). He found the most dominant property was again volume followed by polarity. Secondary structure propensities
From statistical analysis of the sequences of proteins of known structure, propensities to adopt a secondary structure have been determined for each amino acid (e.g. Chou & Fasman, 1974; Gamier et al., 1979). These preferences have been used to account for clusters of amino acids which are unexpected on a physico-chemical basis. A clear example is the close association of G, P, D and E (see Fig. 1). These associate because of a propensity to lie in sharply turning regions on the surface of the protein. Gly, because of the flexibility it imparts to the local chain; Pro because of the built in turn configuration created by its back-bonding side chain; and Asp and Glu because of a requirement to expose,their charges to solvent. Robson & French (1983) indicate other instances including the close association of
THE CLASSIFICATION OF AMINO ACID CONSERVATION
209
Glu and Ala both of which favour a-helical structure, and the tight association of L, I, M, V and to a lesser extent, F and Y which tend towards fl-structure. 3. Venn Diagram of Amino Acid Sets The idea of using a Venn diagram to represent the different relationships among the amino acids was adopted from Dickerson & Gels (1969) and extended to incorporate some of the observations discussed above. The overall layout of the diagram (Fig. 3(a)) was based on the 2-D arrangement derived from Dayhoff's mutation matrix. Amino acids were then displaced (by as little as possible) from this arrangement to form groups of residues related by common physio-chemical properties. (A) SIZE AND H Y D R O P H O B I C I T Y
The major sets group the amino acids by size and hydrophobicity: both properties which were seen to dominate the structure of Dayhoff's Matrix. Two overlapping sets were used to describe hydrophobicity. One was defined as all amino acids which have a polar group in their side-chain and is referred to as polar. The second group is less well defined and contains the amino acids which were considered to be hydrophobic. This set contains some amino acids which have polar side-chains. These consequently lie in the intersecting region of the two sets which can be considered the set of amino*acids which are ambivalent to water. The inclusion of Lys in this set is justified by its long aliphatic side-chain which has been observed (Cohen et al., 1982) to extend from a buried location and expose the terminal charge to solvent. The location of Pro in Fig. I conflicts with its hydrophobic character. It was, thus, left unclassified by the two sets hydrophobic and polar. Similarly, Gly is often considered to lack a side-chain and be consequently unclassifiable in hydrophobic terms (e.g. Rees & Sternberg 1984). However, as Gly is often found buried in the interior of proteins it was classified as hydrophobic. The volume of the side-chain was considered to be sufficiently important to justify classification by two sets. The larger of these, called small, contains the nine smallest amino acids by side-chain volume (Klapper, 1971) each less than 60/~3. A subset o f this, called tiny, includes the four acids with less than three (non-H) side-chain atoms all of which are smaller than 35 A3. The relationship of Cys to the sets defined above is rather ambiguous. Although its side-chain has only two atoms, the sulphur atom is relatively large, placing it on the Tiny-Small borderline. Its classification is further complicated by the occurrence of the sulphur in two oxidation states. The reduced form contains a polarizable S - H bond which suggests a similarity to Ser ( O - H ) and with which it is associated in McLachlan's table (Fig. 2). On formation of a disulphide bond, however, this property is lost, placing the residue more firmly in the hydrophobic camp. The associated loss of conformational freedom is difficult to assess but may be associated with an effective increase in volume as the linked residue cannot accommodate as easily to structural fluctuations. Poor packing in the hydrophobic core of the
210
WILLIAM
RAMSAY TAYLOR
(o)
hny
o
sitive charged (b)
\v\ /
!
\\
FIG. 3. (a) The Venn diagram shows the relationship of the 20 naturally occurring amino acids to a selection of physio-chemical properties which are important in the determination of protein tertiary structure. The diagram is dominated by properties relating to size and hydrophobicity. The amino acids are divided into two major sets, one containing all amino acids which contain a polar group (polar) and a set which exhibit a hydrophobic effect (hydrophobic). A third major set, small, is defined by size and contains the nine smallest amino acids. Within this is an inner set of smaller residues, tiny, which have at most two side-chain atoms. The location of Cys is ambiguous as the reduced form (CH) has similar properties to serine, while the oxidized form (Css) may be more equivalent to Val. Other sets include full-charge (referred to as charged) which contains the subset positive (negative is defined by implication) and aromatic and aliphatic. The latter set is not as general as the name implies and contains only residues containing a branched aliphatic side-chain. Because of its unique backbone properties, praline was excluded from the main body of the diagram. An equivalent exclusive position is suggested for Gly by a small G. (b) An alternate representation of the relationships. A network of amino acids is formed by connecting those which differ by no more than two properties in the Venn diagram. Pairs which share the same subset are connected by a heavy bar, those with only one different property by a bold line and those which differ by two properties by a fine line. To improve clarity a few of the latter connections are omitted. Many of these only connect to one of an unresolved pair (e.g. I and L), and it can be assumed that the connection is made to both. Because of its unique position, Pro is able to make quite long range connections some of which are indicated by broken lines. immunoglobulins
near the intra-molecular
disulphide
l i n k is s u g g e s t i v e o f t h i s
( C o h e n et al., 1981, T a y l o r , 1981). C o n s i d e r i n g all t h e s e a s p e c t s , t w o l o c a t i o n s a r e suggested for Cys on the Venn diagram. However, a location in the sub-set containing T h r is a l s o p o s s i b l e .
THE
CLASSIFICATION
OF AMINO
ACID
CONSERVATION
211
(B) O T H E R SETS
The remaining set allocations are based on obvious physico-chemical properties. These include aromatic (ring containing side-chains) and aliphatic. The latter set, however, includes only amino acids with branched aliphatic side-chains and largely reflects the frequency with which this type of residue is found in/3-pleated sheet structure. The set of charged amino acids contains only those which are normally (or often) fully ionized. The subset positive is included, with negative defined by implication. For simplicity, as few sets as possible were introduced, yet even these produce almost complete segregation of the amino acids with only Y-W, I - L and A-G grouped in the same sub-sets. Additional sets can easily be imagined but these often only create further distinction between residues which are already segregated. However, an important property not well represented is hydrogen-bonding ability. To distinguish the sets of hydrogen-bond donors and acceptors on the Venn diagram (Fig. 4) would greatly reduce clarity they are thus indicated separately in Figs 4(c) and 4(d). (C) N E T W O R K R E P R E S E N T A T I O N
A useful representation of the Venn diagram can be made by removing the set boundaries and connecting adjacent residues to form a network (Fig. 3(b)) in which no connected pair differ by more than two properties. In this form the structure is more easily compared to other representations and the circular arrangement of amino acids corresponding to Fig. 1 is readily apparent. The most significant deviation from the form of Dayhoff's matrix is the separation between the negatively charged amino acids and Pro and Gly. This feature corresponds more to the physico-chemical relationships defined by McLachlan (Fig. 2). However, leaving Pro (and perhaps Gly) unclassified with respect to hydrophobicity allows connections to be made with both hydrophilic (S, N, Q) and hydrophobic residues (F, M, L, I, V and A). Some of these longer connections are indicated in Fig. 3(b) by broken lines. It is also possible to formally reduce the Venn diagram (or network) to a tree structure of the type described by Jimrnez-Montafio (1984). However, unless the corresponding tree contained multiple amino acid entries information in the Venn diagram would be lost. 4. Minimal Set Assignment The Venn diagram (Fig. 3(a)) represents a compromise between mutation data and chemistry. A similar compromise might have been achieved in a less graphical way by calculating a distance matrix of the type described by Grantham (1974). Such tables of relatedness are useful for considering amino acid changes without regard for the local environment in the protein structure in which the change occurs (e.g. exposure to solvent, secondary structure, etc.) and, consequently, can be applied uniformly along any pair of sequences to produce an overall measure of relatedness
( C)
hI ~.vr.......
a)
tiny
ch ~ed
~2itive
charged
i:] i!i::~ !! ~ ~_~/ii }
u
R
tiny small
( d)
~1
liph a t i c - . _ _ ~ ~ ~
b)
__
charged
__ tiny
:barged
/ tiny small
,,3 ,,)
THE C L A S S I F I C A T I O N OF A M I N O ACID C O N S E R V A T I O N
213
b e t w e e n t h e m . W i t h g r e a t e r k n o w l e d g e o f p r o t e i n structures, h o w e v e r , it is increasingly c o m m o n t h a t s u b s t i t u t i o n s are c o n s i d e r e d in a s e q u e n c e o f k n o w n structure. This r e q u i r e s a n e w a p p r o a c h to assessing the s u b s t i t u t i o n o d d s for a given structural e n v i r o n m e n t . T h e use o f tables o f r e l a t e d n e s s , m o r e o v e r , c a n n o t b e easily a d a p t e d to tackle this p r o b l e m as every e n v i r o n m e n t w o u l d r e q u i r e its o w n t a b l e w h i c h m u s t be p r e v i o u s l y c a l c u l a t e d from s u b s t i t u t i o n s o c c u r r i n g in t h a t e n v i r o n m e n t . This w o u l d b e a very useful set o f tables to have b u t for p r e s e n t w o r k a m o r e flexible a p p r o a c h is r e q u i r e d . W i t h the v a s t i n c r e a s e in k n o w n p r o t e i n s e q u e n c e s it is n o w c o m m o n to have m a n y r e l a t e d s e q u e n c e s to c o m p a r e a n d n o t s i m p l y a pair. This t y p e o f p r o b l e m can, o f c o u r s e , be t a c k l e d using a d i s t a n c e m e a s u r e for all o r p a r t o f a s e q u e n c e a n d c o m p i l i n g a t a b l e o f h o m o l o g y for e a c h p a i r o f s e q u e n c e s . T h e result, h o w e v e r , o b s c u r e s t h e q u a l i t a t i v e d e s c r i p t i o n o f the c o n s e r v a t i o n a n d w o r k e r s c u r r e n t l y i n t e r e s t e d in this t y p e o f p r o b l e m are often n u c l e i c - a c i d b i o c h e m i s t s w h o h a v e little time for n u m e r i c a l a b s t r a c t i o n s . W i t h t h e s e p r o b l e m s in m i n d , a m e t h o d was d e v i s e d to use t h e sets d e f i n e d b y the Venn d i a g r a m (Fig. 3 ( b ) ) to d e s c r i b e s e q u e n c e alignments. This was, s i m p l y , to find the s m a l l e s t set o r s u b - s e t w h i c h i n c l u d e s all the a m i n o a c i d t y p e s b r o u g h t t o g e t h e r b y a l i g n m e n t . T h e resulting set will be r e f e r r e d to as the m i n i m a l set. A p p l y i n g this o p e r a t i o n o v e r the w h o l e l e n g t h o f the a l i g n e d s e q u e n c e s p r o d u c e s a q u a l i t a t i v e d e s c r i p t i o n o f the c o n s e r v a t i o n at every point. F u r t h e r m o r e , the n u m b e r o f a m i n o a c i d s w h i c h constitute the a s s i g n e d set give a m e a s u r e o f the d e g r e e o f c o n s e r v a t i o n at t h a t p o i n t r a n g i n g f r o m o n e (for a b s o l u t e c o n s e r v a t i o n o f t y p e ) to t w e n t y (the set o f all a m i n o acids). D e s p i t e t h e few sets i n c l u d e d in the Venn d i a g r a m , the n u m b e r o f p o s s i b l e subsets is very large a n d i n c l u d e s m a n y which h a v e little p h y s i c a l m e a n i n g . F o r e x a m p l e ; it is difficult to i m a g i n e a structural e n v i r o n m e n t o r f u n c t i o n w h i c h w o u l d p r o m o t e the c o n s e r v a t i o n o f the set f o r m e d b y t h e u n i o n o f aliphatic a n d positive (V, I, L, H, K, R). A list o f a l m o s t seventy sets a n d subsets has, thus, b e e n c o m p i l e d w h i c h m i g h t be m a i n t a i n e d b y s t r u c t u r a l s e l e c t i o n pressures. T h e s e n a t u r a l l y consist o f the u n i o n a n d i n t e r s e c t i o n o f sets w h i c h o v e r l a p in the Venn d i a g r a m ( g r a p h i c a l e x a m p l e s o f two o f these are s h o w n in Figs 4(a) a n d (b)). O n e set w h i c h is an e x c e p t i o n to this was c r e a t e d to reflect the c o n s e r v a t i o n o f G l y , Pro a n d t h e n e g a t i v e l y FIG. 4. (a) and (b) indicate how two subsets are derived from the sets indicated on the Venn diagram (Fig. 3(a)). (a) is formed by the amino acids that are hydrophobic but not polar (or in set nomenclature: hydrophobic ^ ~ polar, where - indicates negation and A indicates set intersection). In the alternative nomenclature defined in the legend to Fig. 5 this becomes hydrophobic non-polar but as it is a commonly assigned set it is given the "trivial" name of very-hydrophobic. (b) illustrates a more complex derivative sub-set formed by tiny u (small A (polar A-hydrophobic)), where u indicates set union. In Fig. 5 polar ^ ~ hydrophobic is given the trivial name of hydrophylic and union is indicated by. or., producing the more comprehensible name of tiny. or. smaU_hydrophylic. (c) and (d) indicates the two sets of hydrogen-bond donors (c) and acceptors (d). Their definitions were taken from Baker & Hubbard (1984). It should be noted, however, that hydrogen bonds involving Cys are rare and that Met might possibly receive a hydrogen-bond. Also, the classification of His is complicated by its frequent change of ionisation state as only the unprotonated state can receive a hydrogen bond. For clarity, and as there is still some uncertainty in these sets, they were not included in the main Venn diagram despite their obvious structural importance.
214
WILLIAM
RAMSAY
Z9 "POSITZVE ° .~O " C H A R G E D " "CHARGE0_non-H ° °Negatzve" (CHARGEC_non-POS[TZVE) "Hydroohy[~c non-POSITIVE* "PydrOphylic" (POLAR_non-HYOROPHOBIC) *CHARGC_O.or. Hydrophylic ° "CHARG~O.or. Hydro~hyiic.or.P* "POLAR non-ARCMATIC.or'.CHARGEC.oP.P" 40 °POLAR ~ °POLAR.or.P" "POLAR_non-ARCHATIC.or.CVARGED" °POLAR_non-ARCMATIC_non-POS[TIVE.or.P ° "POLAR_non-ARCHATIC_non-PO$ITIV-" *SHALL_POLAR,at.P" "SHALL_POLAR* "SHALk_Hydrophylic" SO " T [ N Y " °TINY.o~.SHALL_POLAR" "TINY.or. SHALL P O L A r . o r . P " °TINY.or. Negative_HydrophyIic.or.T" "TItlY.or. NegativeHydrophylic.or.T.or.P" ~'TZN¥.or. POLAR_non-ARGHATIC ° "TINT.or. POLAR_non-ARCHATIC.or.P" °TINY.or. POLAR" "SHALL.non-P.cr. PCLAR ° ~0 "SHALL.or. POLAR" °SHALL_non-P.or. PGLAR.non-AROMATIC" "SMALL.or. POLAR n o n - A R O H A T I r ° °SHALL non--P.cr. Hydrophyli¢ ° 65 *SHALL" "SHALL_non-P" "S/4ALL_HVOROPHOBIC.oc. TIN¥° "SHALL_H¥OROPHOBIC ° 70 °SHALL_non-POLAR non-P" "SHALL_non-POLAR ° "5HALL_non-Hydrophylic" "ALIPHATIC.Or. SMALL_non-Hydrochylic ° °ALIPHATIC.or. SHALL non-PCLAR ° "ALIPHATIC.or. SHALL_HYORGPHCSIC ° "ALIPflATIC ° 20 "AL[PHATIC.or. Lar~,e_non-PCL~R* "Very-hydrophcbic" (HYORCPHOBIC.non-POLAR) "Very-hydrophobic.or.P" °Very-hydrophobic.or.T" *Very-hydrophobic.or.T.or.P" "Very-hydrophobic.or.T.or.K ° "Very-hydrophobicoOr.$/4ALL non-P.or.K" °Very-hydrophobic.or.$HALL.or.K ° "HYO~OPHOBIC.or°SMALL ° "HYOROPHOB[C.¢roSHALL non-P" 90 "HYOROPHOBIC" °H¥OROPHOBIC.or.p ° "AROHATIC.or. Ve~y-hydrophylic" "AROHATIC.or. ALIPHATIC.oc.H" °A~OMATIC.or.H ° 95 "AROHaTIC" °Largenon-Negative ° °Large_PCLAR ° °Large_non-ALIPHATI r" "Large" (non-SHALL)
b b b b b b b b b b
b b b b b b b b b b
TAYLOR
Z Z z Z z Z Z Z z z b b b b b z z Z z Z Z Z Z Z Z b b
b b b b
z z Z
R 0 D 0 S S S S P T P T P T P T S A A P A P A P A V P V P V P V V V V V V L L L L F F P F P F F F H H H H H M H H R Q Q Q
K H E R E R E N C N 0 N D N G T $ $ N T $ S N T S $ N T $ S N N n G S G T A G G T A G G T A G G T C A V C C A V C C A V C C A C A C A C A C A C A I V I V I V I V H L /4 L F H 14 L F P /4 L /4 L M L I~ Y W Y W Y U Y M T W Y M I' Wlr K H E R E R E R
K
H
EQ EQR ECRKH EGRKHP N O E O R K H O E Q R K H W Y N O E Q R K H W Y OEQRKH NOE~ CEQ ND D
SND TSNC SNOEO T S N D E G S N O E Q R K T S N C E C R K S N O E O R K H G T S N O E Q R A G T S N O E Q G T S N O E Q R A G T S N O E Q G T S N O E Q R A G T S N O GTSNO GTS GT G GP GPT CAGPT CAGP CAG Iv I V C A G L I V C A G I v C A G T L I V C A G I V C A G T I V C A G T I V C A G T F H L I V C F M L I V C F H L I V C F H L I V C F E L [ V C F H L I V
T K K K A A A A A
S S G G G G G
F/4
F W V K H
F ~
¥
K H W Y F H K H W y F H L [
M K R K R
T H W ¥ K H W Y K
N N T T T T
0 O K K K K
P S N O P S N O P
THE
CLASSIFICATION
OF
AMINO
ACID
CONSERVATION
215
c h a r g e d a m i n o acids in the b e n d regions o f p r o t e i n structures. O t h e r m i n o r sets c o n s i s t i n g o f closely r e l a t e d pairs, i n c l u d i n g Ser a n d Thr, Phe a n d Tyr, a n d Arg a n d Lys, w e r e also a d d e d . All these sets are d e f i n e d in Fig. 5 w h e r e t h e y are d e s c r i b e d using a n o m e n c l a t u r e a d a p t e d from set logic. A few e x a m p l e a p p l i c a t i o n s o f the m i n i m a l set a s s i g n m e n t s to a l i g n e d s e q u e n c e f r a g m e n t s are s h o w n in Fig. 6. To give s o m e i m p r e s s i o n o f h o w set a s s i g n m e n t varies with the n u m b e r o f a l i g n e d s e q u e n c e s a n d the overall h o m o l o g y o f the s e q u e n c e s , a p r o g r e s s i o n is s h o w n from a few c l o s e l y r e l a t e d s e q u e n c e s o f a p a r t i c u l a r i m m u n o g l o b u l i n d o m a i n to an e x t e n d e d a l i g n m e n t o f t h e s a m e d o m a i n (Fig. 6). In these it is c l e a r that c o n s e r v a t i o n is m a i n t a i n e d m a i n l y in the regions o f s e c o n d a r y structure. This o b s e r v a t i o n can b e quantified b y p l o t t i n g the set size at e a c h p o s i t i o n in the a l i g n m e n t . I n Fig. 7 this is d o n e for different n u m b e r s o f closely r e l a t e d s e q u e n c e s o f t h e i m m u n o g l o b u l i n K-chain l i g h t - v a r i a b l e d o m a i n . O f these s e q u e n c e s the B e n c e - J o n e s p r o t e i n R E I has a k n o w n c r y s t a l l o g r a p h i c s t r u c t u r e ( E p p et al., 1975). T h e s t r a n d s o f / 3 - s t r u c t u r e f o u n d in this p r o t e i n lie in two sheets w h i c h s t a c k t o g e t h e r like a s a n d w i c h ( C o h e n et al., 1981). O n e side o f each sheet is b u r i e d while the o t h e r is e x p o s e d to solvent. As the a m i n o a c i d s i d e - c h a i n s in a / 3 - s t r a n d a l t e r n a t e l y p o i n t to e i t h e r side o f the s h e e t they are c o n s e q u e n t l y a l t e r n a t e l y b u r i e d a n d e x p o s e d to solvent. This s t r u c t u r e is reflected in the d e g r e e to w h i c h they are c o n s e r v e d in Fig. 7 w h e r e a l t e r n a t e l y c o n s e r v e d a n d m u t a b l e p o s i t i o n s can b e seen in t h e / 3 - s t r a n d regions. T h e effect is m o s t clearly seen in s t r a n d s w h i c h d o n o t lie on t h e edge o f the fl-sheet. I n t h e s e the c o n s e r v e d p o s i t i o n s a r e g e n e r a l l y h y d r o p h o b i c . Plotting t h e d e g r e e o f c o n s e r v a t i o n with i n c r e a s i n g n u m b e r s o f a l i g n e d s e q u e n c e s r e v e a l e d the interesting o b s e r v a t i o n that m a n y p o s i t i o n s r a p i d l y a c q u i r e a d e g r e e o f c o n s e r v a t i o n t h a t is u n c h a n g e d b y the a d d i t i o n o f f u r t h e r s e q u e n c e s to the a l i g n m e n t . C o n s e r v a t i o n is r a p i d l y lost on a l i g n i n g the first 20 s e q u e n c e s in o r d e r o f d e c r e a s i n g h o m o l o g y to R E I (see Fig. 6 ( b ) ) b u t the a d d i t i o n o f a n o t h e r 50 s e q u e n c e s to the a l i g n m e n t causes little a l t e r a t i o n o f the c o n s e r v a t i o n profile.
FIG. 5. Intersection and union of the sets defined in Fig. 6 can produce a vast number of amino acid combinations. Those which, on an intuitive basis, seem most relevant to protein structure are defined below. These are generally unions and intersections of adjacent sets and thus produce groups of amino acids which share similar properties. Many of the sets have effectively two entries, one with proline and one without. The nomenclature used was chosen to be more readable than standard set notation and uses . or. as the inclusive "or" to represent set union. Intersection of two sets is represented by the underline character "_". Negation is indicated by the prefix "non-" and applies only to the set to which it is attached. This produces recognisable phrases such as "non-POLAR" and "SMALL_HYDROPHOBIC". Occasionally a commonly used set is given a "trivial" name: for example, the set of CHARGED_nonPOSITIVEis referred to as "Negative" and the set of HYDROPHOBIC_non-POLARas "Very-hydrophobic". Some sets which have rather long formal names are simplified by reference to individual amino acids using the one letter code, for example, the set of all hydrophobic residues excluding polar-aromatics is simply called "Very-hydrophobic. or. T. or. K". The ambiguous residue codes asx (b) and glx (z) are included when the two residues they represent occur in the set. They do not, however, count when the number of members in the set is considered.
216
WILLIAM
RAMSAY
TAYLOR
5. Conclusions In t h e r a p i d l y d e v e l o p i n g field o f p r o t e i n e n g i n e e r i n g it is i m p o r t a n t to have a m e a s u r e o f a m i n o a c i d c o n s e r v a t i o n t h a t c a n b e a p p l i e d to several h o m o l o g o u s s e q u e n c e s o f w h i c h at least o n e has a k n o w n t e r t i a r y structure. G e n e r a l m e a s u r e s o f a m i n o a c i d c o n s e r v a t i o n , such as D a y h o f f ' s l i k e l i h o o d m a t r i x are b e s t s u i t e d to s i t u a t i o n s w h e r e t h e r e is n o structural i n f o r m a t i o n a b o u t the s e q u e n c e s . T h e i r use b e c o m e s l i m i t i n g w h e n a p p l i e d to local r e g i o n s o f the p r o t e i n s e q u e n c e where, for structural r e a s o n s , the m u t a t i o n a l f r e e d o m o f a p a r t i c u l a r r e s i d u e m a y be g r e a t l y r e s t r a i n e d . W i t h k n o w l e d g e o f the local structure, h o w e v e r , it is p o s s i b l e to a n a l y s e these r e s t r i c t i o n s a n d use t h e m p r e d i c t i v e l y . A n e x a m p l e o f loss o f i n f o r m a t i o n b y a v e r a g i n g , w h i c h is a p p a r e n t in D a y h o t t ' s m a t r i x , is the resistance o f c y s t ( e ) i n e to m u t a t i o n . This, u n d o u b t e d l y , arises f r o m t h e e v o l u t i o n a r y n e e d to c o n s e r v e disulp h i d e b o n d s , b u t s u c h a r e s t r a i n t d o e s n o t a p p l y in the r e d u c i n g i n t r a c e l l u l a r e n v i r o n m e n t a n d to a p p l y it to the c o m p a r i s o n o f the s e q u e n c e s o f c y t o p l a s m i c p r o t e i n s is, t h e r e f o r e , m i s l e a d i n g . T h e c l a s s i f i c a t i o n o f a m i n o a c i d s d e f i n e d a b o v e , a n d its use in d e s c r i b i n g s e q u e n c e a l i g n m e n t s , a l l o w s the t y p e o f c o n s e r v a t i o n o b s e r v e d in s t r u c t u r a l " m i c r o - e n v i r o n m e n t s " to b e r i g o r o u s l y quantified. T h e i m p o r t a n t a s p e c t o f the a p p r o a c h is t h a t n o t o n l y c a n the d e g r e e o f c o n s e r v a t i o n b e m e a s u r e d , b u t the q u a l i t a t i v e a s p e c t o f the c o n s e r v a t i o n is also m e a s u r e d . T o g e t h e r these m e a s u r e s c a p t u r e v i r t u a l l y all the useful i n f o r m a t i o n t h a t c a n b e e x t r a c t e d f r o m a n u m b e r o f a l i g n e d s e q u e n c e s . S u c h i n f o r m a t i o n will b e o f use in d e s i g n i n g n e w m u t a n t s as a p r o t e i n e n g i n e e r can a n a l y s e a s e q u e n c e a l i g n m e n t to find, for every p o s i t i o n , the r a n g e o f p o s s i b l e a m i n o a c i d c h a n g e s t h a t m i g h t b e a c c e p t a b l e . O n a w i d e r front, the a p p r o a c h is b e i n g a p p l i e d to the a n a l y s i s o f r e s i d u e c o n s e r v a t i o n in well d e f i n e d s t r u c t u r a l m o t i f s
FIG. 6. (a) Alignment of the variable domains of the immunoglobulin kappa-chains (light-variable domain) found in the PIR database. The sequences run down the page and are represented in the one letter amino acid code (insertions are indicated by a bar "["). The numbers identify the position in the sequence of REI and next to these, heavy bars indicate regions of r-structure. To the right of the sequences the amino acid set (see Fig. 5) which best describes the alignment is indicated both by name and number. This description is indented in proportion to the number of insertions. The number just to the left of the set description is the number of amino-acids which constitute the set. This gives a measure of specificity and is the number plotted in Fig. 7. Absolutely conserved positions are indicated by the residue name only. The set assignments are divided into two regions. The outer assignments derive from the 23 sequences with greatest homology to REI while the inner is derived from the entire 73 sequences. The sequence names and their relative homology can be found in (b) which is a matrix of homology between every pair of sequences. The homology is calculated as a percentage of residue identity matches over all matched pairs in two given sequences and is entered in the matrix at a position cross-referenced by the two sequence names. For clarity, homologies over 70% are filled solid, those in the 60s are a dot, those in the 50s a "5" and those below 50% are blanked out. The entries in the matrix have been ordered such that most similar ~equences tend to be adjacent on the diagonal. This is achieved by minimising the second moment of homology about the diagonal; i.e. N
N
~, ~. H~j(i-j)2--~min I~1 J ~ l
where H~j is the percentage homology between proteins at positions i and j and N is the number of proteins. The increments of ten sequences each corresponding to a plot in Fig. 7 are indicated.
(a)
SEQUENCES
SETS
to
SETS
?3
to
Z3
?3 23 1 %
l offoD~ ~
~DOOB ~ 0 0 O O O D E E ~ E O E O & ~ O ~ O ? O O C O O ~ C ~ I ~ I
] ~E J E 0 ~ 0 ~ J ~ D ~ V g ~
10~O
~)
il
?i'SnJLL_n~-P.a¢,Hydr
o~
L ~C"
.......................... ~ t t ~ ,CC C t t t t t t t L t t t
V t L L L U L L O tUU~ ~THk Y V ~ . t ~ L U I . L ~~ V V V l t ~ VI t t VU' ' t
I~'V.,,-~,~,oo.obl..°,*S.AIL_.o.-P-o,.~"
~t ~ "#L~t ~V
Sl 15
t P , V ~ t t VVVV, t , , V V V V , Y V W ~ , ~ V~U~L~L ¥ VYLULk t t t t
VVVVVUt" t t L V ~ V ~ P l U ~ V tP V ° $ ~ V
~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lttti.[[Till[ll[Ill
[III[~LIt[L[ZITIIZt[IIII~Z[tlIITT~IIITIIItlIII
1111 ii1~111 ~ l sl&l& l iIill&,l
i i k~ &AIA I [ l & ~ l l l l l l l l
I J +j I I I l I I I J l I l l I I I r l I l l l )
AIIII~
,, .,r//.%.~..o,.._,o[
IFIIIIII
CCcCCC~CCCCCCeeeCCeCCC~CCCCCCCCCCCCCCCCCCCCCCCCCCC~CCeCCCCCCCCCeCCCCCCC
e'Vo,,-,,~,o~obl°"
7 t~°V°
?
~'~ot ~....-
~
t
o.
....
~c. °,. c . ~ -
I E l $ ~ 1 1 ~ 1~ l l ~ l ~1
~o
1t e l l ~ l r l l l
~O
[lllll/lllllIIIlllllJlllllllll~lllllJIIIIIIII I]11111JI1111111~11111]1111 IIIII rJ~[lllllJIjllll]lrllllli~llllljlllllllIlIIIIII]li~llolt~HIJ Ilrllo~l
)T] rT~S~T I I I r ~ . l I I . H l u E I D S T I ~ I S z t ~ I
~t
~F~t~?stO~tOtu~t~ttttnttT~tt
Iq I~'.,D.0P.OIXC.o,.S..~t_.°,-P"
~ttt~F~Et~t~t~r~rot~,~twlttte
t~P ~
~
i t t t u u t t t t ~ i ~ u t ~ E t t t u ~ v ° c u t t t t u u PUU ~ g ~ t IU~UUt t t t t t ~ t p t p t t t t . t t t t L I t t ~ t ~ t t L L t L t L L t t t t t ~ t t L L L L L t t L L L LL t L L L L d L L t t t k L L L L L ~ t t U L t L t L t H t ~ t ~ L t ~ t ~ t t t ~ t ~ L ~
~I
IIIIIlllIIIIIIIIIrlIIIIIII*IIIIIIIIrJIIIIIIII]IIIIIIIIIIIlIlIIIIIIIT$1111
~
ZlllrfllllllllJIillll~lllllllllllr
~l
I * ~ G I A T IC . i r . . t
~M&TIC..r .~"
~°..O~tX¢.o,.A~l~.~ttC.o,.~
°
L~U
11llJIzfillilijllrlf~l~)lTtflilllllrilllllJIIlllJIIlIIflllllllllllllTIIrl
iiI1 JIIiIflll11111
JIIIIIIIIIIlllltcllll
67
!ilii!i!iiiiiii;iiiiiiiiiiiiiiiiiiii!iiiiiiiii!!i i iiiiiiiiiiii!iSi!! !!l!{iiii!iiiiiiiiiiiii!iiii!!!!!!i!!!!!!!!!!;!!!!!}!!!!!!iii!iii!!!!!!!!!!!
6"~MALL_.~Dr0P.OIZC.or.t ZNT"
p.E
sltrtltTTtlttttttlttttTtTtTTttttTttTTTt~TTrTtttttTTttttTtttttttttttrTtTtT
~
°
~5 1 0 * t Z , V . o ~ . P O ~ J W , O . - A ~ O . A t l C " ?t ~'.tZ,.At~C" ,,
~z
ItlI~Z~IJIIIIliI~IJI~[IIIIH
H IIlll~ltlrJIrllllllll
~ ~
llllJilllllilillZl;ij)liJllllllS i HIIIII I in. illl 5 e IH IIJrlll~llllllrllH 12ZllllZrllll~liii HlIIrlllllJHiiil~illllliiilrH iLflJIJ IJ[ll H IIJH ligT]))llHiiiliHl~lllrrrlvvllllilillfl~lliH [lllllrx i i i i I l l I E l l i i l l r ) JJ I I I i l l i l i JN~) H lillll i j i i I~EI]111~11111~1 [ l i J iz i t i i1~
,,] iiiiiil;iliiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii!!!iiiiiiii!iiiiiii!iiiiiiiiil
~0 ~ . t D I ~ , , 0 1 1 C * ° , . ~ , , L L
',~ "?'~;'~1l;';;~11~;~1~11~;1;~'~l~2`['~'~`~;~:~[~'~`~;;~W'`~V`~J~ IIIIlll[lll]lglllllllllllllltlllllllllj~IllljllItlullllrlllllllllllllllll
~- ...................................
-
H IllllllllJIIIllAllrl
o
~t I ~ * V O , , - h , d ~ O ~ . O b I ~ _ O , . , . . , L _ , O , . P . O , . . " I~'S"',U.O,*P0~'n_.O~-IAO.AVlC"
Z'S~,LW."
I~'V*r,-.,~°°p,°b~.O,.~,ALC .°.-,.o~.~" ~'P0LI" -o.-..o~lrlc., C.,I~EO" -
•
6"De.''D
*
(b)
~
v
~
-
r~io~s
V
-
--
.~.
.
&~
-
3fg
.
_~
Was.
~oy=
W-c=.
Human
Mumo~
=
+4ous¢
--
~uman
-
, .
•
l~-~
__
5@5@555 ~ ~ ~ 5 5 6~55@5~ ~@~I 5 5 ~ 5 ~ ~5555~5~@~5~ 1 5 5 0 5 N 5 ~
~-~']
i 5o -
anti-ars
. . ~4o U6t.
Human
~ouse
-
--
.mj~on
-
r~31o~ re3ion
relian
Human
v
V-I v-I
v--_
r~ion
ion
-
Youse
re~zon
Rat 5~11 . ~o~se ~140
re
V-I n
chain c h ~
:hai
oh=it
V---
precursor
V-.':I re.acn v re~xonS
chain
c~ztn
chal
cn_~r
ch(in v re+xo. cra~n v reS~o~ chain procurtor
~auoz
~appa
~aooz ~ a o ~
K~@=
K>goa
KZ-~a
M ~ o ~
Ka~z
xeppa
~z)p)
-
-
Hul~n
.u.an
~e~ion re~iom
reZion
--
-
.
V
5 5 555~55~555555 5 5 5 5 555~5555555555~555555555 555 5555 , 55 55 5 5 5 ~ 5 5 5 T 5 5 5 ~ 5 ~ 5 5 ~ 5 5 5 5 ~ 5555~5 Imps:so 555~555~3555555555 55555555~5555~5 5 5 5 ~
5
~
2
~e~ons r(~ion
r.lion precursor
v
-
5
5 5 5
~
5
5
R(bbil ~abbi~ ~ab~it -
_
e ~~
5
~lJll
-
--
~555555~i~555~-
s~
515
555555
5 ~5
5
5
5
55 5
27~7..
~sll . . ahEO--5.
555
5
5
5
5
5
5
.5
217)
a
••• •
~5~
I
~ouse ~167 . . ~eUbil XP-1..
R~bbit
R~bbil ~ouse
GO~ . . . .
-
--
s
)5~55555555¢1~55555 ~ 55-~5~55~¢'~5~--=~ ~ 5 5 5 5 5 ~ 5 ~ 5 5 5 5 5 5 ~ 5 5 5 5 5 5 5 ~ ' - 5 5 5 5 5 5
T
-66
O5555555555555~
5~ 5~5 5 55 5 5 555 5 ~ 0555555 5 5 5~555~ 5
-
R~9--21 75~7..
~-
55 5 5~5~ 55 ~lB kS~ 5 SS i~5~5~5 5
~5~5~)¶55
5555 ))))
1555555~555~5
.
~ous
il
55-5.. 41~-S)~7~..
~bbit
))bh,
so~
-
region region
page
5
~555555555555~555~5~5~5¶5~555 U ~ 5 5 ~ 5 5 5 ~ 5 5 5 ~ 5 5 5 5 5 5 5 5 5 ~ 5 5 5 ~¢H~5~555555555~55~555555~5~5 ~ ~ ~ 5 ~ s i m 5 5 5 5 9 5 5 ~ 5 5 5 5 ~ 5 5 5 5 5 5 j i 5 5 5 5 ~ 5 5 5 T 5 5 5 5 5 ~ 5 ~ 5 5 ~ 5 5 5 5
5 5
5
5
~ 0 ~ 5 5 5 5 ~ 5 5 5
55
5 ~
-
-
--
-
v
regicn
re~io~
re~io~
v
5 5+ 5 i ~
li~e~e~IQe~5~T555~5555e~i~55~5~STS~5
re~1or
re(jxor
region
v
v
+
~ @ ~ @ ~ 5 5 5 5 5 5 5 5 5 1 1 ~ @ 0 ~ S ~ 5 5 5 ~ 5 5 5 ~ 5 ~ I ~ 55 5 5 5 5 5 5 5 5 5 5 5 ~ 5 5~55
.
51~7a..
PC2~'0.
MO
.5.+ , 5 5 5 5 5 5 5 5 5+. .5.~. . : 5 ~ 5 5 ~ 5 T ~ ~555(5 5)5 5~5 55 ~ I ~ e 6 6 5 5 5 5 5 5 5 5 5 5 5~55 5 gSQ505555555555~5~5~55555 ~
1~ 5~ ~5 55 5 ~ 5 5 5 5 5 5 5 5 5 5 5 5 ~ 5 5 5
/
.
I~ouse FOUS,
m ~ouse PcZlS& ¥ regio~s -
--
5
5-~5:~i~5 kg5~0~55~5555~55~55~55~55~555555
2___T
.
sie. T£
WOl.
-- H u m a n k i n . House CSPC iC' -- ~ o u s e pC?C~I..
~i~n ~uman
Human
Human AU Human K& . = ~uman C~r.. - Hu.wn oee.. Lay.. Cu
v-. • reolon -v-: re3io. v-! ~e9~o. ¥-1 ~e~lon V-: to,ion chain
oh;in
chain
chri~ ~am~ iapcc ~ a
,. . . . . . . . . . . .+o
chain
chai~
chai~ ch~n
v chugchain
chain c~aln
precursor V relions ~ous v re~ions MouSm PCb~ &. v re~ions - m~use ~,c'_7¢I V regions -- M o u S e PC2~85
choin V--IV r;li~n chain v region chain ¥ ~e~ions
K~)a c h ~ n V--IT{ • ~ a = h ~ v-r:*. Kn~pa c~axn v-tI:
Ka~a
K~I}~
~ap)a
• ~pa
• a=pa
K~OO) ~ap~a
~ a ~
~appa Kapoa
~)p)a
,
-
~ap~ chain v-:I re~o~ ~u~a~ ~e= . ~@pp6 Chain V ~eg~on -- M O ~ 5 @ )~--l~.. K~pp~ ch~in v-:{ r~ion Hu~(n HI) ~ p o a c~i~ v re~ion ~ouse ~C7~ ~- . ~ac=~ chal, v-I: ~e~ior w~man Cue K~po~ chain v region5 -- ~ o u S e x~,.4~
53
c~zi~ chazn
v
~appa era'In v regior • aOP~ chain v region ~ p p a chair V re~ion
• apoa Kapp)
v
v
Chair chtin
chain
chain c ~ i ~
~ZDD a kappa
Ka~p~
~ora
Kapp=
K~=oa chain v region -- R ~ b b i t ~ - 2 ~ . . V re ~ ion Ra b b i ~ ~ 5 = ==opa c h ~in " oh(in v ca;ion mrecurscr eebbi Ch~3P v Te~ior -- ~ o u s e J53~ . .
73
(facing
THE
CLASSIFICATION
OF
AMINO
ACID
217
CONSERVATION
. ~ '
0
o
"1:
~ ~,o~m
O~
0
• ~
~
~r
0
0
U
N
~
---,=
=
0
u~
0-0
o
~-~
,~
~.~-~
= ~=~
.
e.I
I:::
~',~.o
.--'--t~
0
oE
~
~
G
~.,~'~= ~'~:~,.~ o ~ o o ~'~ 0
o ~
I:~o
e '~'<
~ ~=
0
o
~'~
o ~
~
0
~..
~
o ° ~<,=
• ~
O ~
.- ~
o ~ o ~ _ ~ o ~ ' ~ = ~
--
0
~,j
~a~a..~
~,
218
WILLIAM RAMSAY TAYLOR
( s u p e r - s e c o n d a r y structures) c o m m o n l y f o u n d in g l o b u l a r p r o t e i n s ( S i b a n d a & T h o r n t o n , 1985; T a y l o r et aL, in p r e p a r a t i o n ) . I n these structures it is f o u n d that residue v a r i a t i o n is r e s t r a i n e d at p a r t i c u l a r l o c a t i o n s in the motifs for general structural reasons. T h e o b s e r v e d patterns o f c o n s e r v a t i o n can t h e n be used to predict the o c c u r r e n c e o f the structural m o t i f i n a s e q u e n c e o f u n k n o w n structure u s i n g p a t t e r n r e c o g n i t i o n t e c h n i q u e s such as the t e m p l a t e m a t c h i n g m e t h o d o f T a y l o r & T h o r n t o n (1983, 1984) a n d T a y l o r (1986). This work was begun while the author was a research fellow at IBM-UK Scientific Centre, i~Vinchester, and the author thanks IBM for support and computing facilities. The author is currently supported by the SERC as an advanced fellow and thanks Prof. T. L. Blundell and Dr J. M. Thornton for valuable discussion.
REFERENCES BAKER, E. N. & HUBBARD, R. E. (1984). Prog. Biophys. tooL BioL 44, 97. BARKER, W. C., HUNT, L. T., ORCU'I'F,B. C. et aL (1984). Protein Identification Resource. Release 3.0. BLUNDELL, T. L., SIBANDA,L. & PEARL, L. (1983). Nature, Lona~ 304, 273. CHOU, P. Y. ,fr FASHMAN,G. D. (1974). Biochemistry 13, 211. COHEN, F. C., STERNBERG, M. J. E. & TAYLOR,W. R. (1981). J. tool BioL 148, 253. COHEN, F. C., STERNBERG, M. J. E. & TAYLOR,W. R. (1982). J. tooL Biol. 156, 821. DAYHOFF, M. O. (1972). Atlas of Protein Sequence and Structure. Washington DC: National Biomedical Research Foundation. DAYHOFF, M. O. (1978). Atlas o f Protein Sequence and Structure. Supplement 3. Washington, DC: National Biomedical Research Foundation. DICKERSON, R. E. & GEIS, I. (1969). The Structure and Action o f Proteins. Ch. 1. New York: Harper & Row. EPP, O., LATTMAN,E. E., SCHIFFER, M., HUBER, R. & PALM, W. (1975). Biochemistry 14, 4943. FITCH, W. M. (1966). J. tool. Biol. 16, 9. FRENCH, S. & ROBSON, B. (1983). J. tooL EvoL 19, 171. GARNIER, J., OSOUTHORP,D. J. & ROBSON, B. (1978). J. tool. Biol. 120, 97. GRANTHAM, R. (1974). Science 185, 862. JIM~NEZ-MONTAI~O,M. A. (1984). Bull. math. BioL 46, 641. KLAPPER, M. H. (1971). Biochim. biophys. Acta 229, 557. MCLACHLAN, A. D. (1971). J. tool BioL 61, 409. McLACHLAN, A. D. (1972). J. tool. Biol. 64, 417. RUES, A. R. C. &"STERNBERG,M. J. E. (1984). From Cells toAtoms. Ch. 1. Oxford: Blackweil Scientific. SANDER, C. & SCHULTZ,G. E. (1979). J. tool. Evol. 13, 245. SCHULZ, G. E. Rr SCHIRMER, R. H. (1979). Principles o f Protein Structure. Ch. 1. New York: SpringerVerlag. SIBANDA, B. L. ,~" THORNTON,J. M. (1985). Nature, Lond. 316, 107. SNEATH, D. H. A. (1966). J. theor. Biol. 12, 157. SWANSON, R. (1984). Bull Math. Biol. 46, 187. TAYLOR, W. R. (1981). D. Phil. thesis, University of Oxford. TAYLOR, W. R. (1986). J. tool. Biol. 188 (in press). TAYLOR, W. R. & THORNTON,J. M. (1983). Nature, Lond. 301, 540. TAYLOR, W. R. & THORNTON,J. M. (1984). J. tool. Biol. 173, 487. TAYLOR, W. R., THORNTON,J. M., BARLOW,D. J., SIBANDA,L. & EDWARDS, M. (in preparation).