deletions in protein structures

deletions in protein structures

J. Mol. Biol. (1992) 224, 461-471 Analysis of Insertions/Deletions Protein Structures in Stefano Pascarella192and Patrick Argosl ‘European Molecula...

1MB Sizes 0 Downloads 125 Views

J. Mol. Biol. (1992) 224, 461-471

Analysis of Insertions/Deletions Protein Structures

in

Stefano Pascarella192and Patrick Argosl ‘European Molecular Biology Laboratory Postfach 10 22 09, Meyerhofstrasse, 1 W-6900 Heidelberg, Germany

Consiglio

‘Dipartimento di Scienze Biochimiche e Centro di Biologia Molecolare de1 delle Ricerche, Universita’ La Sapienza, 00185 Roma, Italy

Nazionale

(Received 8 July

1991; accepted 28 November 1991)

An analysis of insertions and deletions (indels) occurring in a databank of multiple sequence alignments based on protein tertiary structure is reported. Indels prefer to be short (1 to 5 residues). The average intervening sequence length between them versus the percentage of residue identity in pairwise alignments shows an exponential behaviour, suggesting a stochastic process such that nearly every loop in an ancestral structure is a possible target for indels during evolution. The results also suggest a limit to the average size of indels accommodated by protein structures. The preferred indel conformations are reverse turn and coil as are the preferred conformations at the indel edges (N- and C-terminal sides). Interruptions in helices and strands were observed as very rare events.

Keywords: protein

structure;

evolution;

insertions/deletions;

(a) Data

As the number of protein primary and tertiary structures increases, it becomes feasible to compare them. Homologous proteins differ not only in point mutations but also contain insertions and deletions of varying length. They are, nonetheless, required to maintain a reasonable and coherent match of equivalent regions in sequences and tertiary structures (e.g. see Argos & Rossmann, 1979; Taylor & Orengo, 1989; Sali & Blundell, 1990). Since an insertion in one sequence of an aligned pair necessarily implies a deletion in the remaining sequence, we will refer to an insertion/deletion as an indel (Kruskal, 1983). Indels as a class of protein structure modification are still relatively unexplored. An understanding of their peculiarities can shed light on the mechanism of protein evolution and open new possibilities for biotechnological improvement in natural enzymes. We will describe here indel characteristics observed in several protein families, each consisting of multiple structural superimpositions and sequence alignments. The results include the distribution of mean indcl length, average intervening sequence and number of indels with percentage length, residue identity observed for given sequence pairs. Secondary and tertiary structural characteristics of indels and surrounding residues will also be discussed.

The research reported here has been carried out with a collection of protein structural families, each consisting of non-redundant and multiple tertiary structural superimpositions and resulting primary sequence alignments (Pascarella & Argos, 1992). Unique 3-dimensional structures were taken from the July 1990 release of the Protein Data Bank (Bernstein et al., 1977). The sequence alignments resulting from their C!” superimposition have been taken from the literature whenever possible; in a few cases the spatial equivalencing was determined using the routine of Rossmann & Argos (1976) and Argos & Rossmann (1979). Secondary structure assignments were derived from the approach of Kabsch & Sander (1983). In the case of homomultimer proteins, only 1 subunit was used. Monomers of the heteromultimers were considered as individual structures. For multidomain proteins, when the domain folds were similar, they were considered as individual entities and superimposed (e.g. the 4 domains of wheat germ agglutinin). If the domain topologies differed and one or more crossed familial boundaries, they were also treated individually. In Table 1 are reported the structural families used in this research. Sequence alignments were non-redundantly selected from the HSSP collection (Sander & Schneider, 1991) for each of the protein structures included in the multiple structural superimposition. Sequences in HSSP were originally taken from the SWISS-PROT collection (Bairoch & 1991). Details regarding this databank Boeckmann, 461

$03.00/O

comparison

2. Materials and Methods

1. Introduction

0022-2836/92/060461-11

structure

0 1992 Academic Press Limited

S. Pascarella

462

and P. Argo8

Table 1 List of structural Family?

PDB$

Cytochromes

2CCY 156B 1CMS 1CMS 4APE 4APE ZAPP 2APP 2APR 2APR 2TAA lWSY 1TIM 1GOX 2ABP 2LIV 2LBP ICA2 ZCAB 3CLN 3CPV 31CB 4TNC 1GCR 1GCR 1GCR 1GCR 451c 1CCR lCYC 5CYT 3C2C 155c ICY3 ZCDV 3DFR 4DFR 8DFR ICSE 2c12 IPHH 3GRS 3GRS 1FDX IFDX 4FDl 4FDl 1FXB 4HHB 4HHB 2MHB 2MHB 1FDH 1MDB 1MBS PLHB 1ECA 2LH1 2FB4 2FB4 2FB4 2FB4 IFBJ IFBJ 1FBJ 1FBJ 1FC2 IFC2 IMCP 1MCP IPFC

Aspartic proteinases

a/p Barrel proteins

Binding proteins Carbonic anhydrase Calcium-binding

proteins

Crystallin

Cytochrome

c

Cytochrome

cj

Dihydrofolate Inhibitors Nucleotide

reductase

I binding I

Ferredoxins

Globins

Immunoglobins

families

used in. the present work Domains§

Cvtochrome c’ (Rhodosvirillum molischianum) Cytochrome b,,, (Escherichia coli) Chymosin B (bovine) Chymosin B (bovine) Endothiapepsin (chestnut blight fungus) Endothiapepsin (chestnut blight fungus) Penicillopepsin (fungus) Penicillopepsin (fungus) Rhizopuspepsin (bread mold) Rhizopuspepsin (bread mold) Taka-amylase A (Aspergillus oryzae) Tryptophan synthase (Salmonella typhimuriuni Triose phosphate isomerase (chicken) Glycolate oxidase (spinach) Arabinose binding protein (E. coli) Leu/Ile/Val binding protein (E. coli) Leu binding protein (E. co&) Carbonic anhydrase II (human) Carbonic anhydrase form B (human) Calmodulin (rat) Ca-binding parvalbumin B (carp) CA-binding protein (bovine) Troponin C (chicken) y-11 crystallin (calf) y-11 orystallin (calf) y-11 crystallin (calf) y-11 crystallin (calf) Cytochrome cs5, (Pseudomonas aeruginosa) Cytochrome c (rice) Ferrocytochrome c (tuna fish) Cytochrome c (bonito fish) Cytochrome c2 (Rh. rubrum) Cytochrome cs50 (Paracoccus denitr@cans) Cytochrome cs (Desulfovibrio desu&wieans) Cytochrome cg (D. vulgaris) Dihydrofolate reductase (Luctobacillus casei) Dihydrofolate reductase (E. coli) Dihydrofolate reductase (chicken) Eglin c (leech) Chymotrypsin inhibitor II (barley seeds) p-Hydroxybenzoate hydrolase (P. Jluorescens) Glutathione reductase (human) Glutathione reductase (human) Ferredoxin (Peptococcus aerogenes) Ferredoxin (P. aerogenes) Ferredoxin (Azobacter wine&&i) Ferredoxin (A. vinelandiij Ferredoxin (Bacillus thermoproteolyticus) Haemoglobin (human) Haemoglobin (human) Kaemoglobin (equine) Haemoglobin (equine) y-Globin (human) Myoglobin (whale) Myoglobin (seal) Haemoglobin V (sea lamprey) Erythrocruorin (chironomous) Leghaemoglobin (lupin) Fab Kol L chain (human) Fab Kol L chain (human) Fab Kol H chain (human1 Fab Kol H chain (human) Fab Ig L chain (mouse) Fab Ig L chain (mouse) Fab Ig H chain (mouse) Fab Ig H chain (mouse) Fe (human) Fc (human) Fab (mouse) Fab (mouse) Fc Iggl (porcine)

Res. Res. Res. Res. Res. Res. Res. Res.

i-175 176-323 Z-174 175-326 1-174 175-323 I-178 179-325

Res. Res. Res. Res.

1-39 40-77 88-128 129-174

Res. Res. Res. Res.

l-54 27-54 l-106 31-57

Chain Chain Chain Chain Chain

ct /l a /? y

Res. I-109 Res. 110-214 Res. l-II8

Res. Res. Res. Res. Res. Res. Res. Res. Res.

119-221 I-106 107-213 l-118 119-218 238-339 340-443 l-113 l-122

Insertions/Deletions

463

Table l--continued Family?

Interleukin Inhibitors

II

Lysozymes

Nucleotide

Domains8

PDB$

binding II

Sulphur proteinases Phospholipases Kinases Plastocyanins

Rubredoxin Repressors Rhodanese Subtilisins

Serine proteinases

lRE1 2RHE 3FAB 3FAB 3FAB 3FAB 2HFL 2HFL 2HFL lFl9 lF19 lF19 lF19 1IlB 1IlB 1IlB 1TGS 3SGB 2ovo lOV0 3LZM 2LZT 2LZ2 1LZl 1ALC 4MDH 2LDB 1LDM 5LDH 2LDX 1LLC 5ADH 3GPD 1GPD lGD1 lFX1 4FXN 2SBT 3ADK 4ATC 9PAP 2ACT lPP2 lBP2 lP2P 1PFK 3PFK 2PAZ 1PCY 1AZU 2AZA 3RXN PRXN 1RDG 1LRD lR69 2CR0 1RHD 1RHD 1CSE 1SBT 1TEC 2PRK 1TON 2PKA 2PTN 2TRM 4CHA 3EST 1HNE 2RP2 1SGT

Fab Bence-Jones (human) Fab Bence-Jones (human Fab’ new L chain (human) Fab’ new L chain (human) Fab’ new H chain (human) Fab’ new H chain (human) Fab Iggl L chain (human) Fab Iggl H chain (human) Fab Iggl H chain (human) Fab L chain (mouse) Fab L chain (mouse) Fab H chain (mouse) Fab H chain (mouse) Interleukin-1 p (human) Interleukin-I j? (human) Interleukin-1 p (human) Pancreatic secretory trypsin inhibitor (porcine) Ovomucoid inhibitor third domain (turkey) Ovomucoid third domain (silver pheasant) Ovomucoid third domain (Japanese quail) Lysozyme (bacteriophage T4) Lysozyme (hen) Lysozyme (turkey) Lysozyme (human) cl-Lactalbumin (baboon) Cytoplasmic malate dehydrogenase (porcine) Lactate dehydrogenase (B. stearothermophylus) M,-lactate dehydrogenase (dog&h) H,-lactate dehydrogenase (pig) Lactate dehydrogenase (mouse) L-Lactate dehydrogenase (L. casei) Alcohol dehydrogenase (horse) Glyceraldehyde 3-P dehydrogenase (human) Glyceraldehyde 3-P dehydrogenase (lobster) Glyceraldehyde 3-P dehydrogenase (B. stearothermyphylus) Flavodoxin (D. vulgar) Flavodoxin (Clostridium pasteurianum) Subtilisin (B. amiloquefaciens) Adenylate kinase (porcine) Aspartate carbamyl transferase (E. co&) (catalytic chain) Papain (papaya) Actinidin (kiwifruit) Phospholipase A, (rattlesnake) Phospholipase A, (bovine) Phospholipase A, (porcine) Phosphofructokinase (E. coli) Phosphofructokinase (B. stearothermophglus) Pseudoazurin (Alcaligenes faecalis) Plastocyanin (poplar) Azurin (P. aeruginosa) Azurin (AZ. denitri$cans) Rubredoxin (D. vulgaris) Rubredoxin (C. pasteurianum) Rubredoxin (D. gigas) I repressor (bacteriophage 1) 434 repressor (phage 434) 434 CR0 protein (phage 434) Rhodanese (bovine) Rhodanese (bovine) Subtilisin Carlsberg (commercial product) Subtilisin (B. amyloliquefaciens) Thermitase (Thermoactinomyces vulgaris) Proteinase K (fungus) Tonin (rat) Kallikrein (porcine) Trypsin (bovine) Trypsin (rat) cc-Chymotrypsin (bovine) Elastase (porcine) Neutrophil elastase (human) Mast cell protease (rat) Trypsin (Streptomyces griseus)

Res. Res. Res. Res. Res. Res. Res. Res. Res. Res. Res. Res. Res. Res.

l-109 104-214 1-117 115-220 l-105 1-116 117-213 l-105 109-215 1-123 124-220 3-51 50-107 108-153

Res. 1-146 Res. 152-293

X. Pascarella

464

and P. Argos

Table l--continued

Toxins Virus capsid proteins

Germ agglutinin

Hemerythrin

2SGA 3SGB 2ALP 2ABX INXB 1CTX 4RHV 4RHV 4RHV 4SBV 2MEV BMEV 2MEV 2TBV 2STV 2PLV 2PLV 2PLV 3WGA 3WGA 3WGA 3WGA lHMZ 2MHR

Proteinase A (S. griseus) Proteinase B (S. griseus) a-Lytic protease (Lysobacter enzymogenes) a-Bungarotoxin (brided krait) Neurotoxin B (sea snake) a-Cobratoxin (cobra) VP1 (rhinovirus) VP2 (rhinovirus) VP3 (rhinovirus) Southern bean mosaic virus (virus) VP1 (mengo virus) VP2 (mengo virus) VP3 (mengo virus) Tomato bushy stunt virus (virus) Satellite tobacco necrosis virus (virus) VP1 (poliovirus) VP2 (poliovirus) VP3 (poliovirus) Germ agglutinin (wheat) Germ agglutinin (wheat) Germ agglutinin (wheat) Germ agglutinin (wheat) Hemerythrin (sipunculid worm) Myoemerythrin (sipunculan worm)

Res. Res. Res. Res.

l-43 44-86 87-129 130-171

t Family name. $ PDB identification code for tertiary structures, where PDB is the Brookhaven Protein Data Bank 5 Range of residue positions in the PDB numbering scheme when domains are taken.

(referred to as SD-AL& Pascarella & Argos, unpublished results). Though we included aligned sequences with unknown tertiary structure in our original databank, in this analysis only sequences with known folds were examined to ensure reliability of the results. Multiple structural alignment techniques such as those developed by Sali & Blundell (1990) or Taylor & Orengo (1989) were not used as they align the entire sequences and well matched regions cannot be distinguished from marginally aligned segments. In the present study, careful consistency checks from pairwise superimpositions allowed elimination from the data of regions with questionable equivalencing. All programs used were written in the C language under the operating system VAXjVMS 6.4. It could be argued t,hat many of the structural families used here may represent convergent evolution and therefore be inappropriate for an indel analysis. This is not the case. Most, if not all the families are clear examples of divergent evolution where the folding topolgy and the secondary structures as well as function are conserved; for example, the globins where residue identity percentages in the structurally aligned sequences reach a minimum of 16% or the virus capsid proteins with a low of 3%. It is important to include distant familiar members in an indepth study of indels that take on major structural significance with increasing evolutionary time. Cases in evolutionary doubt may include the cytochromes (family I in Table 1); a/p barrels (family 3): and lysozymes (family 18). Of the 744 pairwise structural comparisons, these 3 families contribute only 17 of the total that represents only about 2% of the entire dataset. Furthermore, this study concentrates on indel regions delineated by unambiguous structural equivalence within the flanking segments as well as near the indel site. In the few cases that could be convergent, the local structure would at least be divergent-like.

(b)

De&&ion

of indels

Indels in the multiple structural alignments were assigned in the following manner. In 2 stru&urally aligned sequences, A and B, the true indels are those marked by asterisks; i.e. those segments flanked with at least 1 equivalenced residue pair: ***** **** AAAAAA4A4AAAAAAAAA BBBB BBBBBRBBRBRB

AAAAAAA XAAAAA BBBBBBBBBR

Hyphens indicate residues that could not be structurally equivalenced due to extreme differences in local eonformation; these regions were annotated in the databank and were not used in our indel statistics since the deletion or insertion could not be unequivocally delineated. (c) 8tatiatical

analy8es

We have analyzed the set of true indels extracted from our data base for indel length, frequency of component amino acid residues and conformations, and flanking structural environment. A filtering procedure was used throughout the research to correct for overrepresentation of the same object (indels, amino acid residues or conformations) in homologous structures. Only different objects at equivalent positions have been counted. For example, in a set of 4 structurally aligned sequences Al, A2, B and C (with Al and A2 quite similar): 10 20 30 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA~-iA AAAADAAAAAAAAAAAADAAAAAAAAAAAAADAAADAAA BBBBBB BBRBRBB BBBBRBBBRB cccccccc ccccccc ccecccc

our sampling criteria would count 1 indel for positions 9 to 13. a 2nd one in sites 21 to 29 and a 3rd at 21 to 30. Only unique amino acid residues or conformational

Insertions/Deletions

465

symbols at each column position were counted (e.g. A only once at site 10). Residue conformations were taken as defined by the method of Kabsch & Sander (1983). The H, G and I states were classified as cc-helix, B and E as /?-strand, T and S as turn, and the remainder (no assignment) as coil. All the possible pairwise structural comparisons in our databank were classified according to percentage of residue identity calculated over the number of equivalent (aligned) residues. Protein pairs were assigned to percentage identity intervals (id classes); e.g. 0% to 5%, 5% to loo/e, etc. The average indel length for each range was then determined. An oversampling correction was applied such that indels with the same length, absolute alignment position, familial membership and percentage of identity class were counted only once. The mean indel length was calculated by: “‘d 1;” av(id) =g, (1)

where Y+.!?),,~is the number of symbols S (amino acid or conformation) of type i for databank segment D; k is the total number of types; and n(L$, T is the total number of symbols of type i in the total (2’) databank. The preference ratio represents the observed fraction divided by that expected, such that a ratio of 1.0 indicates no preference, while greater or less than 1.0 refers to a preference or avoidance of symbol type i. The preference standard error cp is given according to Palau et al. (1982):

where aw(id) is the average indel length in residue identity class id; Zidis the length of the ith indel in the same class, and nid is the total number of indels in class id. The number of indels per aligned residue (avn) was calculated for each identity class id according to:

To detect the preferred environment for the true indels, we have calculated the relative frequencies of amino acid residues and main-chain conformations at the edges of indels (positions marked by an asterisk in the scheme below): ** **

(2)

where ni” is the total number of indels in the class i, li:i and l$ are the sequence length of the 1st and 2nd structure in the ith pair and p is the total number of pairs in the class id. The distribution of the intervening sequences with identity percentage was also analysed. An intervening sequence in a pairwise alignment is that segment that is flanked by indels or structurally non-superimposable sites and has structural counterparts in both sequences. For example, in 2 generic sequences A and B: *****

********

[

d = 1 fiv-fi”, Ns ’ fi where:

1

1/Z ’

(4)

,Tl n(S)i, T and N,=

k c n(S),,,. S=l

AAAAAAAAAAAAAAAAAAAAAAA BBBBBB BBBBBB * *

.

To study the conservation of local conformation indel flanking positions, we analysed the conformation the 4 positions marked with an X below: x x (1)

at of

AAAAAAA. BB BB x x

X can be cc-helix, p-strand, reverse turn or coil. We collected indels with all possible X combinations at the designated 4 positions (e.g. 2X with identical conformation and the others different from each other and from the

**_--~~__--**********

AAAAAAAAAAAAAAAAAAAAAAAAAAAA RBBBB BBBBBBBB

BBBBBB

AAAAAAAAAAAAAAAAAA BBBBBBBBBB

the intervening sequences are those marked by asterisks. Once again, correction for overrepresentation of strongly homologous intervening sequences was applied as previously described. The mean length was calculated with equation (l), where li‘ii is now the length of the ith intervening sequence in identity class id. Indels were lumped into length classes and the relative compositions of amino acid and conformation types were calculated. In groups of indels with the same length, absolute position and structural group, only different symbols (either amino acid residues or conformations) at each alignment position were counted. This prevented particularly similar sequences from biasing the results. The relative frequencies or preferences for particular amino acids and conformations were calculated as: n(S)i,D

(3)

assigned two) and any combination was considered a structural class. Indels belonging to one of those classes were sampled according to the same criteria followed throughout the research; only unique indels were counted among those falling in the same class and sharing the same absolute position, length and structural group. Particular attention was paid to the classes with at least 3 identical conformations that were considered as conservative. The non-assignable structural state “coil” was not counted in the conservative groups; i.e. X was not allowed to be coil.

(d) Correlation of the relative frequencies amino acids

of the

The residue composition were correlated to the 222 amino acid properties deposited in the Japanese Amino Acid Index Databank (Pu’akai et al., 1988) to reveal residue characteristic trends. A computer program written by J. Heringa was used for this purpose (Heringa t Argos, 1991).

8. Pascarella

466

and P. Argo8

I-

~

I 4

0

IO 20 30 40 50 60 70 80 90 Ioc 5 I5 25 35 45 55 65 75 85 95 % residue

Identity

% residue identity (0)

Figure 1. Histogram of the number of pairwise sequence alignments in the databank versus residue identity percentage classes given in 5% intervals. All the possible pairs in each structural family were considered. Percentages are relative to the number of aligned residues.

3. Results and Discussion The entire databank of aligned tertiary structures and corresponding primary sequences (Pascarella & Argos, 1992) contained about 5000 indels (1700 potential indel regions could not be structurally equivalenced and were not used in this analysis); however, after the highly homologous insertions were removed as described previously, the indel count reduced to 714, emphasizing the significance of this correction. The distribution of sequence pair frequencies in the databank with their residue identity percentage (RIP) is shown as a bar graph in Figure 1 where the RIPS have been collected in 5% intervals. Figure 2 displays the relationship between indel length and its frequency of observation. Most of the aligned sequence pairs in the databank have residue identity in the 10 to 20% range, which

3oc 240

0

IO

20

30

40

50

60

lndei length

Figure 2. Distribution of the indel length with frequency of observation. Indels have been sampled as described in the text. Length is in residue units.

(b)

Figure 3. (a) Distribution of the average indel length by “/b sequence identity in the 2 aligned sequences considered. Points are plotted for residue identity ranges of 5%, except for the rightmost, point, which considers all data above 55% identity to achieve reasonable sampling statistics. The points are plotted at the average percentage in the range and the corresponding mean indel length. (b) Logarithmic plot of the distribution in (a). The line shown is that derived from the exponential fit as given in the text. represents considerable evolutionary distance, and yet the most common indel length is one. Figure 3(a) illustrates the average indel length (AIT,) found in sequence pairs with a’ given RIP, once again taken collectively in 50/b intervals, except for the rightmost point, which covers RIPS from 55 to SO01; for credible sampling. Figure 3(b) shows the corresponding logarithmic plot, demonstrating that the data are exponentia!ly related; namely, AIL = 520 em0’017xR1P with a regression coefficient of @90. These results clearly show that the mean indel length changes little with evolutionary time measured by residue identit.y percentage in the two sequence species compared. The AIL remains at about two for RIPS greater than 35 o/o and then rises to only five at approximately 15% RIP.

Insertions/Deletions

0 0

IIII/I/II,IJII,IIII,III(,III(,IIII 10 20 30

40

% residue

50

60

70

identity

(a)

100)

% residue identity (b)

Figure 4. (a) Distribution of the average intervening length by “/b sequence identity. The intervening length refers to the number of contiguous residues in a sequence flanked by indels. The points are plotted as described in the legend to Fig. 3(a). (b) Logarithmic plot of the distribution in (a). The line shown is that derived from the exponential fit as given in the text.

While 99% of all indels have a length of ten or less in sequence pairs with RIP between 40 and SO%, nearly the same is true for sequences with RIP in the 20 to 40% and 0 to 20 oh ranges where 94 o/o and 93 o/O, respectively, of all indels have length ten or less. The mean insertion length and standard deviation in parentheses for the 0 to 20, 20 to 40 and 40 to 80 y. RIP classes are 4.6 (59), 3.0 (3.4) and 2.3 (2*1), respectively, indicating the relatively narrow distribution of the indel lengths. The five largest insertions, which are found in the 0 to 40 ‘$J~RIP range, contain between 26 and 56 residues and are not confined to any particular structural family. There is certainly an inverse relationship between RIP and indel length; proteins accept longer indels during evolution but on average, not much longer. The tendency then, once indel sites are established,

467

is to reach an equilibrium length such that residues are inserted or deleted in a balanced manner with time. Furthermore, there is a limit, in general, to the size of an indel, around five. These conclusions are at odds with generally accepted gap penalty (P) functions (P = I + Ek) used in amino acid sequence alignment (for a review, see Argos et al., 1991), where the I value to initiate a gap is much larger than the extension penalty E for a gap of length k (Gotoh, 1982, 1990). Our results suggest a more stringent extension penalty in the form of an exponential for gap lengths greater than five. Figure 4(a) shows the relationship between RIP and the average length of the protein sequence flanked by two indels, designated as the average “intervening” sequence (AIS). Figure 4(b) illustrates the logarithmic relationship from which the function AIS = 7.43 e”‘031 xR’P is derivable with regression coefficient 0.98. Thus, the extrapolation of the curve to 0% residue identity yields an intervening length of about seven or eight, a value very close to the mean length of a secondary structural element (a-helices and b-strands). An analysis of protein tertiary structures shows that helices average about 11 residues in length (Srinivasan, 1976), while strands are closer to five or six (Sternberg & Thornton, 1977), the mean between them being about eight. The convergence of this latter value and mean intervening length suggest that during the evolutionary divergence of related proteins, all the elements (loops, coils and turns) that connect secondary structural units are targets for insertions and deletions. Moreover, the exponential behaviour indicates that the occurrence of an indel is essentially a stochastic process similar to that postulated for point mutations (Dayhoff et al., 1983). Of course, the few functionally or structurally constrained loops will not be as adaptable to indels as the more general case observed here. Figure 5(a) and (b) displays the respective normal and logarithmic plots for number of indels per residue (NIR) versus residue identity percentage (RIP) in the usual 5 o/ointervals. The exponential fit of the data yields NIR = 7.09 e-0’023xR’P with a regression coefficient of 0.79. In this case the data do not correlate as well with the exponential model as in the previous plots of Figures 3 and 4, primarily caused by the NIR values in the 0 to 15% RIP range. The average indel length also flattens in the same RIP band (Fig. 3). Apparently, the stochastic process of loop evolution reaches saturation, relative to the indel characteristics examined here, at a residue identity level of about 15 %. At 0% about six indels per 100 amino acid identity, residues would be expected in a primary sequence. This correlates reasonably with the previous observations in that six indels, each about five residues long, and seven intervening sequences (to form 6 indels) at eight residues each account for about 90 amino acid residues. The RIP/NIR relationship should be useful in establishing phylogenetic trees, sequence alignments, evaluating evolutionary distances and the like.

468

S. Pascarella

and P. Argos Japanese Amino Acid Index Databank (Nakai el n,l., 1988) shows that the residues in, the classes 1 to 5, 6 to 10 and 11 to 15 are essentially those used by proteins in turn and coil regions; namely, Gly, Pro, Asn, Ser, Asp and Thr. They tend to be polar andjor small. In the 16 to 56 class, the correlations are not as clear with residues such as Glu, Leu: His, Tyr and Trp appearing as preferred.

(b) Analysis

0

IO

20

30

40

50

60

70

% residue identity (0)

% residue

identity

(b)

Figure 5. (a) Distribution of the number of indels per aligned residue by o/0 sequence identity. Points were plotted as described in the legend to Fig. 3(a). Logarithmic plot of the distribution in (a). The line shown is that derived from the exponential fit as given in the text. (a) Xtatistics

on the composition

of insertions

Insertions were generally divided into length classes using a unit of five residues that provided a reasonable sample space for all categories but one. Groups of indel length 1 to 5, 6 to 10, 11 to 15 and 16 to 56 were used. Figure 1 shows that the first class is the most populated, while the last is relatively poor in constituency. Table 2 reports the relative frequencies or preferences of the amino acid residues and their conformations when they occur in insertions. Statistics for the rarely occurring amino acids (e.g. Cys, Met and Trp) are not always well sampled. In the short insertions (length of 1 to 15), the preferred conformations are turn and coil, while in the 16 to 56 class all conformations are possible, including helix and strand. The correlation of the relative residue frequencies with various amino acid physicochemical characteristics as found in the

of indel flanks

The residue and conformational preferences for the six indel flanking positions (3 N and 3 C termini) are given in Table 3. Consistent with the indels themselves, the terminal positions also prefer turn and coil conformations and contain amino acid residues consistent with these structures; namely, Gly, Asn, Pro, Ser, Arg and Thr. Turns, loops and coils are the favoured a.ttack points for indels in protein structure; they generally constitute those regions most exposed Lo solvent, most flexible and most structurally accommodative. To analyse the protein structure surrounding indels, we have delineated a set of local environment classes as described in Materials and Method, section (c). The main-chain conformation of four residues in the two aligned sequences that constitute the flanks (N and C termini) next to the indel edges are considered. Cases where at least three positions shared the same conformation were analysed. Of the 5000 indel examples found in our databank, after filtration about 1400 could be assigned to one of the possible combinatorial conformational classes. Of these 1400, only 24 occurred where t’hree or more of the four flanks were in helical conformation. Similarly, only 20 were observed in the strand archit,ecture. Together, these interruptions of secondary structure constitute around 3% of the entire sample. In the belical case, all the insertions occurred within four residues of one of the helix terminii; mostly resulting in a helical extension of a few residues. For strands, only three of 20 insertions were within two residues of one of the termini, while the remaining cases occurred at middle positions, again most (13 of 20) yielding strand extensions up to five residues in length. Extensive strand interruptions occurred in seven cases where helices, strands and/or loops of ten or more residues were inserted. Interruptions of reverse turns constituted another 100/b of the sample, leaving about S5qo in the coil category. It is clear that indels mostly intrude in turn and coil structures, and rarely encroach upon helices and strands where short extensions at termini are the most common result. Indels within helices are the least tolerated. Figure 6 shows two tertiary structure examples of typical belical and strand extensions. Recently, Sondek & Shortle (1990) have inserted Gly or Ala a$t 20 different locations in a staphylococcal nuclease with known tertiary structure. T assayed for catalytic activity and also noted any changes in circular dichroism. For 12 different inser-

Insertions/Deletions

4 CHA I HNE

198 PLVCKKNGAWTLVGI 198 PLVCN GLlHGl

469

212 212

(a) 3C2C

51

SYTEMKAKGLTW

I CYC

53 ASKS

KGIVW

59 62

Figure 6. (a) Stereo illustration of an insertion extending a P-strand. The main-chain superimposition of cc-chymotripsin (databank code 4CHA, filled bonds) and neutrophil elastase (databank code lHNE, open bonds) are shown around the insertion site. The local spatial equivalence shown is a result of superimposing c” atoms associated with all alignable sequence elements for the 2 structures as given by Pascarella & Argos (1992) and determined by the method of Rossmann & Argos (1976). Upper-case and lower-case residue names denote the 4CHA and 1HNE sequences, respectively. The local sequence alignment is also reported. (b) Stereo illustration of an insertion extending a helix. The main-chain superimposition of cytochrome c2 (databank code 3C2C, filled bonds) and ferrocytochrome c (databank code lCYC, open bonds) are shown around the insertion site. The local spatial equivalence shown is a result of superimposing C” atoms associated with all alignable sequence elements for the 2 structures as given by Pascarella & Argos (1992) and determined by the method of Rossmann L%Argos (1976). Upper-case and lower-case residue names denote the 3C2C and 1CYC sequences, respectively. The local sequence alignment is a.lso reported.

S. Pascarella

470

Amino

preferences

Amino acidt

l-5$:

UQ

A C D E F G H K I L M s : R S T V Y W

0.91 @51 1.08 087 0.70 1.42 0.97 1.07 0.52 0.86 0.56 1.45 1.22 1.08 0.82 1.44 1.20 0.68 0.79 0.63

0.07 0.12 0.09 0.09 0.10 0.09 0.15 0.09 @07 008 0.11 0.12 012 0.11 0.10 0.09 @09 0.07 @lo 0.16

Conformation

1-51

us

6-103

Helix Strand Turn Coil

0.59

0.03 0.05

0.70

0.66

1.64 1.00

__-

Table 2 in insertions given within

acid and conformation

6-101.



(62) (210)

1.01 087 1.07 0.75 1.06 1.34 071 0.86 067 084 031 1.06 1.05 1.11 0.98 1.51

@lo 0.22 0.12 0.11 0.16 012 0.17 0.11 0.10 0.10 0.11 0.13 0.15 @15 @15 0.13

(155)

1.16

0.12

es9 0.90 0.79

( )I1

and P. Argos

(159)

(16) (129) (93) (48) (203) (40) (134) (54)

(116) (26) (145)

(102) (87)

(96) (54)

(1’3)

099 1.29 1.03

005 004

I )I1

11-153

(96) (15) (70) (44) (40) (105)

074 038 1.20 1.12 052 1,06 0.97 1.02 0.74 0.77 0.77 1.28 1.34 1.19

(16) (59) (38)

(62) (8) (58) (48) (49) (41)

rJ$

( Iii

@13

(32) (3) (36) (30)

@22

0.19

0.88

011 @22

1.08

@15 0.16 0.16 014 0.26 0.14 @I5 0.14 0.20 0.16 0.16

(38)

(1Oj

1.17

0.17

(32)

017 0.15 026 0.22 o-25 @24

(19)

0.94 0.86 1.12 075

(82)

1.01 1.58

0.21

0.10

(69)

099

0.15 0.23

(34)

1.22 0.31

0.16 0.26 022

u$

ll-15$

0.05 0.07 0.06 0.06

(9)

(2’3 (9) (32)

1.10

WV

0230 090

(24) (13) (37) (51) (35)

@$

0.51 1.20 0.74 1.24

0.19

(11)

ll-lS$

0.20 0.17 0.17 0.30

068

(121)

indel length classes

0.16

i )I1 (57)

(‘3) (48) (48)

(19)

if-w (la) (44) (33) (56) (13) (41)

(25) (271 (331

1.16 1.03 1.02 068

0.20 @13 0.14

@ll

(36)

1.32 I.78

022 0.41

(34) (17)

rJ§

11-154

4

0.73 0.64 1.31

0.07

098 1.09

0.06

0.08

099

1.19

008

1.08

006 0.07

9.16

(21) (2)

0.09

(5’5) (49)

009

t Amino acid symbol in l-letter code. $ Preferences for insertion length range. Q Standard error. 11Observed residue counts given in parentheses

Table 3 and preferences of the amino acid residues and conformations in the single

Relative frequencies position Amino acid? A

C D E F

G H Ii I L

M N P

N- and C-terminal

( 111

ps

4

087 087

004

1.04 069 0.80 1.39 1.00

0.05 006 0.07 004

008

(134)

1.04 D68 0.79 071 1.37

004 @06 0.05

(433) (234) (356)

0.10

&

093

R

1.28

S

1.24

T V

075

Y

1.00

W

0.77

0.09

1.11

(89) (418) (250) (183)

Conformations Helix Strand Turn Coil

(669)

of indels

p!:

4

052

0.02 004 0.03 0.02

1.00 1.39

1.10

The authors thank Jaap Heringa for kindly providing the computer program to calculate correlations with the Japanese Amino Acid Index Databank. The research would have been impossible without financial support to S.P. from t.he European Molecular Biology Organization and the European C0mmunit.y Commission in the form of postdoctoral fellowships.

0.09 (111) 0.04 0.05 0.06 @05 0.04 0.04 0.05 0.06

1.26

(510)

jlaaks

t.ions situated within loops or near secondary struetural terminii, 11 viable structures were recorded, while additions at eight unique sites in t,he middle region of helices and strands resulted in only five assayable structures which displayed reduced activity. The observations are in relative agreement with those described here, albeit the rather scant statistics in both studies.

(453) (352) (247) (326) (602) (477) (360) (227) (63)

t Amino acid symbol in l-letter code. jI Preferences. 8 Standard error. (1 Observed counts given in parentheses.

eferences Argos, P. & Rossmann, M. G. (1979). Structural comparisons of heme binding proteins. Biochemistry, 18, 4951-4960. Argos, P., Vingron, N. & Vogt, G. (1991). Protein sequence comparison: methods and significance. Prot. Eng. 4, 375-383. Bairoch, A. & Boeckmann, B. (199I). The SWISS-PLOT proteins sequence data bank. dvucl. Beids Res. 13. 2247-2249. Bernstein, F. C.: Koetzle, T. I?., Wiliams, G. J. B., Meyer.

InsertionslDeEetions

E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). The protein data bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535-542. Dayhoff, M. O., Barker, W. C. & Hunt, L. T. (1983). Establishing homologies in protein sequences. Methods Enzymol. 91, 524-545. Gotoh, 0. (1982). An improved algorithm for matching biological sequences. J. Mol. Biol. 162, 705-708. Gotoh, 0. (1990). Optimal sequence alignment allowing for long gaps. Bull. Math. Biol. 52; 359-373. Heringa, J. & Argos, P. (1991). Side-chain clusters in protein structures and their role in protein folding. J. Mol. Biol. 220, 151-171. Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 2577-2637. Kruskal, J. B. (1983). An overview of sequence comEdits, and parison. In Time Warps, String Macromolecules: the Theory and Practise of Sequence Comparison (Sankoff, D. & Kruskal, J. B., eds), pp. l-44, Addison-Wesley, Reading, MA. Nakai, K., Kidera, A. & Kanehisa, M. (1988). Cluster analysis of amino acid indices for prediction of protein structure and function. Prot. Eng. 2; 93-100. Palau, J., Argos, P. & Puigdomenech, P. (1982). Proteins Edited

471

secondary structure. Studies on the limits of predietion accuracy. Int. J. Peptide Protein Res. 19, 394-40 1. Pascarella, S. & Argos, P. (1992). A databank merging related protein structures and sequences. Prot. Eng. 5, in the press. Rossmann, M. G. & Argos, P. (1976). Exploring structural homology of proteins. J. Mol. Biol. 105, 75-95. Sali, A. & Blundell, T. L. (1990). Definition of general topological equivalence in protein structures: a procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J. Mol. Biol. 212, 403-428. Sander, C. & Schneider, R. (1991). Database of homologyderived protein structures and the structural meaning of sequence alignment. Proteins, 9, 56-68. Sondek, J. & Shortle, D. (1990). Accommodation of single amino acid insertions by the native state of Staphylococcal nuclease. Proteins, 7, 299-305. Srinivasan, R. (1976). Helical length distribution from protein crystallographic data. Ind. J. Biochem. Biophys. 13, 192-193. Sternberg, M. J. E. & Thornton, J. M. (1977). On the conformation of proteins: an analysis of B-pleated sheets. J. Mol. Biol. 110, 285-296. Taylor, W. R. & Orengo, C. A. (1989). Protein structure alignment. J. Mol. Biol. 208, l-22.

by A. R. Fersht