Biochimie (1992) 74, 571-580
© Soci6t6 francaise de biochimie et biologie mol6culaire / Elsevier, Pads
571
Evolutionary divergence plots of homologous proteins S BrouiUet, JL Risler, A H6naut, PP Slonimski* Centre de G~n~tique Mol#culaire du CNRS, Laboratoire Associ~ ~ i" Universitd P e t M Curie, 91198 Gif-sur-Yvette Cedex, France
(Received 13 February 1992; accepted 16 April 1992)
Summary m A simple and efficient method is described for analyzing quantitatively multiple protein sequence alignments and
finding the most conserved blocks as well as the maxima of divergence within the set of aligned sequences. It consists of calculating the mean distance and the root-mean-square distance in each column of the multiple alignment, averaging the values in a window of defined length and plotting the results as a function of the position of the window. Due attention is paid to the presence of gaps in the columns. Several examples are provided, using the sequences of several cytochromes c, serine proteases, lysozymes and globins. Two distance matrices are compared, namely the matrix derived by Gribskov and Burgess from the Dayhoff matrix, and the Risler Structural Superposition Matrix. In each case, the divergence plots effectively point to the specific residues which are known to be essential for the catalytic activity of the proteins. In addition, the regions of maximum divergence are clearly delineated. Interestingly, they are generally observed in positions immediately flanking the most conserved blocks. The method should therefore be useful for delineating the peptide segments which will be good candidates for site-directed mutagenesis and for visualizing the evolutionary constraints along homologous polypeptide chains.
graphical representation of protein homology I multiple alignment Introduction With the rapidly growing number of available protein sequences it is now common practice to compare simultaneoulsy dozens of sequences belonging to the same family. All the necessary tools are readily available to extract relevant sequences from existing databases such as EMBL, GenBank or PIR/NBRF. Some o f them simply scan the comments accompanying each sequence, like the STRINGS option of the GCG package [1], others use more elaborate index tables like the PSQ [2] or the ACNUC [3] programs. Once the desired sequences have been extracted (and translated if necessary), there exists a great choice of methods to perform multiple aligments of protein sequences [4-11]. There are a number of reasons for performing multiple sequence comparisons. For example, a set of prealigned sequences can be used to calculate a positionspecific 'profile' to which other sequences in a database will be compared [12]. This profile can also be of great help in aligning a new sequence with the other pre-aligned ones. In this respect it is worth noting that the accuracy of sequence alignments is notably im*Correspondence and reprints
proved by the use of multiple alignments [13]. The detection of motifs, patterns or biologically significant similarities common to a set of proteins may require the analysis of a multiple alignment of their sequences [14, 15]. Structural information can be obtained for a newly determined sequence if it is sufficiently similar to that of a protein of known structure [16]. It is common practice to display multiple sequence alignments in the form of tables where all sequences are sequentially written one above the other. The residues which are similar or identical within two or more sequences are then 'boxed'. This provides an easy and efficient way to highlight the regions where the sequences are most alike. This cannot give, however, a quantitative estimate of the degree of likeness between the sequences under study. Indeed, it is often desirable to point to those residues which are the most conserved within the sequences, keeping in mind that they are likely to be essential for the integrity of the structure and/or the function of the proteins. In the absence of precise structural information, the most conserved residues are good candidates for sitedirected mutagenesis experiments. Conversely, it is from the least conserved regions that one can estimate the degree of divergence between a set of sequences. The delineation of these regions of maximum divergence is not easily performed by visual inspection of a
572
(a} It K K It K K
m=2 m=8
0 0
M P -
G G G G G G
1.7 0.8
0 0
(f)
(b) V V V V V V
Risler
Gribskov
A A A A A -
(c) G G G G G G
V V V V V V
A -
G G G G G G
(g)
D s
e P
(d)
(e)
0 T
n W
N T K L A E
V V V V V D
A E N R M K
E L F ¥ K G
S(d) s(d) PLS
0.62 0.81 0,91 0,0
0,62 0,81 0,91 2,2
0.94 0.52 0.33
0.73 0.96 1.07
0,98 0,73 0.68
2.83 1.39 0.95
S(d) s(d) PLS
0.46 0.60 0.67 0.0
0.46 0.60 0.67 1.5
1.40 0.69 0.28
0,57 0,74 0.83
1,29 0.57 0,30
1.59 0.85 0.64
Fig 1. Some special cases of multiple alignments, showing in particular the importance of considering the gaps as amino-acids, a. The mean distance in the central column, which is unduly large if only two amino-acids are considered (m = 2), is reduced if the gaps are considered as amino-acids (m = 6) as explained in the text. In b-g, is the mean distance calculated with the gaps being considered as amino-acids, S(d) is the standard deviation of the distance calculated as explained in the text and s(d) is the classical standard deviation, all calculated with the Risler and the Gribskov matrices. In b and c, PLS is the mean distance calculated by the program PLOTSIMILARITY from the GCG package, b and ¢: two alignments where the two sets of sequences should be considered as equally similar though displaying a different number of gaps. The numbers under b and e pertain to the central column, d. An alignment where all the pairwise distances between amino-acids (taken in the Risler matrix) are close to one another and to the mean value, leading to an unduly small standard deviation s(d). e. An alignment where the standard deviation s(d) is greater than expected (see text), f and g. Two examples where all the amino-acids are different in each column, leading in the first case (f) to a small or S(d), and in the second case (g) to higher values of or S(d). In such cases, the estimation of the divergence by visual inspection is practically impossible.
multiple alignment, and a quantitative m e t h o d is therefore desirable. T w e n t y years ago, Wu and K a b a t [17] s u c c e e d e d in delineating regions o f h y p e r v a r i a bility in B e n c e - J o n e s proteins b y determining, at a n y position o f the multiple alignment, the f r e q u e n c y o f the m o s t c o m m o n a m i n o - a c i d at that position. M o r e recently, Taylor [ 18] and Z v e l e b i l et a l [ 19] i m p r o v e d the m e t h o d by organizing the p h y s i c o - c h e m i c a l and mutational properties o f the amino-acids in the f o r m o f Venn diagrams. An analysis o f the selective pressure exerted on h o m o l o g o u s proteins has been p e r f o r m e d by B l a n c a et al [201 w h o introduced in their algorithm the t i m e elapsed since d i v e r g e n c e o f the species. In this paper, we present a simple and efficient technique for displaying graphically multiple alignment plots. Starting f r o m the distances between a m i n o - a c i d s given in a scoring matrix, it is b a s e d on the analysis o f the m e a n distance and o f the standard deviation o f the distance at any position o f the multiple alignment, taking into account the p r e s e n c e o f insertions or deletions w h e n necessary. As will be shown, our m e t h o d does not discover new c o n s e r v e d regions but clearly points to the m o s t divergent ones which are not easily visible by e y e in classical multiple alignments.
Materials and methods Once the sequences have been aligned, the method consists of estimating the degree of similarity by calculating in each column either the mean distance between the residues, or the standard deviation of the distance. For this purpose, one must use a 20 x 20 distance matrix in which the elements are a measure of the distance between any pair of amino.acids, that is, a matrix in which all the diagonal elements are equal to zero. Such a matrix is simply the complement of a scoring, or similarity, matrix in which the diagonal elements have the highest values. The classical Dayhoff matrix I211 is not convenient for this purpose since its diagonal terms are not all equal (vide infra). We have therefore used either the matrix derived by Gribskov and Burgess [22] from the Dayhoff matrix - this is the 'standard' matrix in the GCG package [1] - or the Structural Superposition Matrix [23] obtained from the analysis of structurally homologous proteins. In the first distance matrix, that we shall call the Gribskov matrix, the distances range from 0 to 2.7. In the second matrix, referred to as the Risler matrix, the distances range from 0 to 4. Mean distance between amino-acids in one column
It often happens that one has to insert gaps in the sequences in order to maximize their linear superposition. Such a process is generally penalized in alignment programs since it would be possible to maximize the number of identities or similarities in any pair of unrelated sequences if one is free to create as many gaps as necessary [24]. For our purpose, however, there is no reason to make a special case for the gaps. Once they have been created (and penalized) by the (multiple) alignment program, or once they have been inserted in the sequences on the basis of structural considerations, the gaps must be treated as amino-acids. Consider an alignment performed with n sequen-
573 ces and let m be the number of amino-acids in one column (m = n if there are no gaps). The number of pairwise distances in one column is M = re(m-l)~2 and the classical mean distance, calculated without taking the gaps into account, is = (Y-,di])/ M (i = 1 ~ m - l , j = i + 1 --~ m). In the example shown in figure la, = 0 in columns ! and 3 and = 1.7 in column 2 (Gribskov matrix), thus the distribution of will contain an unduly sharp peak at this position. Similarly, the two sets of sequences shown in figure l b and c should be considered as equally similar, which will not be the case if the gaps are not taken into account. In each case, the six sequences differ by only one mutational event, that is, one deletion in one sequence (fig lb) or one insertion in one sequence (fig lc). Thus, in the calculation of the mean distance in one column, we take into consideration all the pairwise distances between the residues (including the gaps) in that column, that is, = (Y.dij)/N (i = 1 ~ n - l , j = i + 1 --~ n) where d~j is the distance between residues in rows i a n d j (i ~:j) and N = n(n-l)12 for n proteins. The distance between two gaps is zero and the distance between one gap and any one amino-acid is taken as the mean value of the extra-diagonal terms in the distance matrix. In this way, the column which contains gaps in figure 1 a has a lower mean distance, and those in figure l b and c have the same mean distance. This treatment prevented us from using the original Dayhoff matrix. Since its diagonal terms are not all equal, there is no simple way to estimate the distance between a gap and an amino acid. The program PLOTSIMILARITY from the GCG package [1] also calculates a mean distance between the amino acids in one column of a multiple alignment. In this program, the gaps are not taken into account. As shown in figure lb, the mean distance calculated by PLOTSIMILARITY for the central column that contains one gap has the same mean distance (zero) than the other two columns, while in figure lc the mean distance in the central column that contains five gaps is quite large. This clearly shows the necessity of including the gaps in the calculation.
Standard deviation of the distance between amino-acids in one column The standard deviation of the distance between amino-acids in one column is defined according to the formula:
s(d) = [~,(dir-)2/(N-i)ll/2(i = 1 ~ n - l , j = i + I ~ n) where d~j, , n and N are defined as above, that is, the gaps are counted as amino-acids. In general, s(d) should be preferred to the root-mean-square distance r(d) = [Y.,du2/(Nl)]lt2 because r(d) is not a centered quantity and therefore cannot be used to compare two sets of experiments. Note that, like in the preceding case, both s(d) and r(d) will not function properly if the gaps are not taken into account. There are some particular cases, however, where the standard deviation thus calculated will fail to give a realistic estimate of the degree of similarity in that column. Such an example is given in figure ld, where all the pairwise distances are close to the mean distance in the column. In this case, all the terms (d~j- ) are small and s(d) is low, although the aligned residues are not that similar. To circumvent this problem, the calculation of the standard deviation in each column has been modified so that each residue is also compared with itself, that is, i can be equal to j in the above formula. The standard deviation then becomes S(d) = [~,(dij-)Z/(N-l)]l/2 (i=l ~n,j = l---',n) where N = n ( n + l ) / 2 and < D > = (~,(dij)/N. This treatment does not modify significantly the
value o f s(d) when all the residues are identical or close to one another, but otherwise increases s(d) (see fig ld). On the contrary, the standard deviation s(d) calculated according to the standard formula will be unduly large in cases such as the one shown in figure le. In this example, V and D are rather distant residues (thus increasing the value of s(d)) but the set of six sequences cannot be considered as highly divergent. Comparing each residue with itself in the calculation of s(d), as explained above, will in this case reduce its value and give a better estimate of the overall divergence of the sequences. As shown in figure le, S(d) is lower than s(d). Therefore, we have systematically incorporated the comparison of each residue with itself in the calculation of the standard deviation of the distance and all the following results will refer to S(d).
Graphical representation Once calculated, the mean distance and the standard deviation S(d) in each column of the aligned sequences can be plotted as a function of the position of the columns. This will generally result in a highly 'noisy' graph (see Results). Therefore, the plots were classically smoothed by plotting not the actual value associated with each position, but the mean value in a window centered at the considered position. Note that the positions indicated on the horizontal axes will not be necessarily equal to the residue numbers in the sequences since the aligned sequences contain gaps. Obviously, the peaks will correspond to a high divergence and the troughs to a high conservation of residues.
Results T h e m e t h o d d e s c r i b e d a b o v e has b e e n t e s t e d o n five f a m i l i e s o f p r o t e i n s , n a m e l y 24 c y t o c h r o m e s c, e i g h t s e r i n e p r o t e a s e s , 2 6 g l o b i n s , 10 l y s o z y m e s a n d 31 a c t i n s ( t a b l e I). T h e c y t o c h r o m e s h a v e b e e n a l i g n e d a c c o r d i n g to D i c k e r s o n [25] a n d the p r o t e a s e s a c c o r d i n g to G r e e r [26] o n t h e b a s i s o f the k n o w n s t r u c t u r e s o f s o m e m e m b e r s o f t h e f a m i l i e s . T h e g l o b i n , the i y s o z y m e a n d the a c t i n s e q u e n c e s w e r e a l i g n e d w i t h t h e p r o g r a m C L U S T A L [4]. F o r the c y t o c h r o m e c family, all of the four classes defined by Dickerson [25] a r e r e p r e s e n t e d . F o r t h e g l o b i n f a m i l y , c a r e w a s taken to use sequences from different taxonomic fami l i e s [27]. W i t h i n e a c h f a m i l y , the p e r c e n t s i m i l a r i t y between each pair of sequences was calculated using the Gribskov matrix. The percent similarities ranged f r o m 2 7 . 2 to 98.1 ( c y t o c h r o m e s c), 4 3 . 9 to 63.1 ( p r o t e a s e s ) , 2 9 . 4 to 9 8 . 6 ( g l o b i n s ) , 32.0 to 8 9 . 9 ( l y s o z y m e s ) a n d 77.6 to 100.0 (actins).
I n f l u e n c e o f the d i s t a n c e m a t r i x Figures 2 and 3 show the divergence plots and S(d) for the lysozyme and the globin families, respect i v e l y , c a l c u l a t e d w i t h t h e R i s l e r a n d the G r i b s k o v matrices and a window length of seven positions.
574 I. List of the protein sequences used in this study. All sequences were extracted (and translated if necessary) from the EMBL nucleotide sequences database or the MIPS protein sequences database. The letters between brackets in the cytochrome c family are L, long; M, medium; S and S*, short, according to the classification introduced by Di~;kerson [21] based on the length of the polypeptide chains.
Table
Cytochromes c Paracoccus denitrificans c550 (L), Rhodopseudomonas sphaeroides c2 (L), Rhodopseudomonas capsulata c2 (L), Rhodospiriilum rubrum c2 (L), Rhodospirillum photometricum c2 (L), Rhodopseudomonas palustris c2 (L), Tuna c (M), Rhodomicrobium vannielii c2 (M), Rhodospirillum molischianum c2 iso-1 and iso-2 (M), Rhodospirillum fulvum c2 iso-1 and iso-2 (M), Rhodopseudomonas globiformis c2 (M), Pseudomonas aeruginosa c551 (S), Pseudomonas fluorescens c551 (S), Pseudomonas stutzeri c551 (S), Pseudomonas mendocina c551 (S), Azotobacter vinelandii c551 (S), Spirulina maxima c554 (S*), Anacystis nidulans c554 (S*), Aiaria esculenta f (S*), Porphyra tenera f (S*), Monochrysis iutheri f (S*), Euglena gracilis f (S*). Proteases Thrombin B chain (bovine), plasmin B chain (human), factor Xa (bovine), factor IXa (bovine), kallikrein (porcine), elastase (porcine), trypsin (bovine), chymotrypsin (bovine). Lysozymes Baboon, bovine, rat, California quail, duck, chachalaca, pigeon, hanuman langur, donkey, mouse. Globins Tylorrhynchus heterochaectus globins IIA, IIB, IIC and small chain; Lumbricus terrestris (common earthworm) erythrocruorin AIII, hemoglobin chain C and extracellular globin I; Anadara trapezta (ark shell) globin IIA, IIB, aI and bI chains; Aplysia kurodai (Kuroda's sea hare) and Aplysia limacina (sea hare) globins; Cerethidea rhizophorarum (water snail) myoglobin; Myxine glutinosa (atlantic hagfish) globin III; Chrinonomus thummi thummi (midge larva) globin CTT-I; Petromyzon marinus (sea lamprey) globin V; Anadara broughtonii (blood clam) globin; Parasponia andersonii hemoglobin I; Phaseolus vulgaris (kidney bean) leghemoglobin a; Glycine max (soybean) lel~hemoglobin a and c2; Viciafaba (broad bean) and Lupinus luteus (yellow iupin) leghemoglobin l; human hemoglobin ¢t and [.tchains; human myoglobin. Actins First set: Acanthamoeba castellani (amoeba), A thaliana, C albicans, H sapiens, S cerevisiae, Schizosaccharomyces pombe, Trypanosoma brucei, Tetrahymena pyriformis, Volvox carteri, Zea mays. Second set: Caenorhabditis briggsae (nematode) actins 1, 2 and 3, Ctenopharyngobon idella (grass carp), Dictyostellium discoidum (slime mould) A8, AI2, A3-SI and A3-$2 actins, D melanogaster locus 79B and 88F, Strongylocentrotus purpuratus (sea urchin), A nidulans, Pisaster ochraceus (starfish) muscle and cytoplasmic actins, Physarum polycephalum, Thermomyces lanuginosus, Xenopus laevi (clawed frog), mouse, chicken muscle and cytoplasmic actins, rat.
With both matrices, the divergence plots of the lysozyme family (fig 2) show two prominent troughs. Their position corresponds to residues which are highly conserved in the lysozyme sequences and which interact with the saccharide substrate, namely, Glu35 (label E35), Asp52 (label D52) and Asn59 (label N59) the numbering is that of hen egg-white lysozyme [28]. Two of these residues, E35 and D52, are even conserved in bacteriophage T4 lysozyme [29]. The divergence plots of the globin family (fig 3) also show well delineated peaks and troughs, although the contrast between high and low values is much smaller than in lysozymes (compare figs 2 and 3). The most prominent troughs are labeled in figure 3a according to the nomenclature used by Bashford et al [30], eg the label A12 means the twelfth residue in helix A. Interestingly, the trough F8 corresponds to the distal histidine which is present in all globins (it is the heme irons fifth ligand) and the trough E7 corresponds to a histidine residue conserved in most globins in the E helix near the oxygen binding site.
Figures 3c and 3d (Gribskov matrix) provide an example of a situation where a relatively high value of corresponds to a low value of $(d) (peak and trough labeled X). This happens because most of the pairwise distances are close to the mean distance. This does not happen with the Risler matrix (fig 3a, 3b) just because the pairwise distances are different (which does not mean, in this case, that they are any better). In figures 2 and 3, the peaks and troughs appear to be sharper and better defined with the Risler malrix than with the Gribskov matrix. Comparison of mean distance and standard deviation plots
The variations of and S(d) for the cytochrome c family are shown in figure 4 (values calculated with the Risler matrix and a window of seven). Two deep troughs labeled 8 are weaker in the S(d) plot than in the plot. Examination of the aligned sequences
575 indicates that the two corresponding regions contain a great number of gaps, so many indeed that the sequences cannot be reasonably considered as 'homologous' in this particular region. They are simply different in length and a few of them contain an extra loop. These troughs must therefore be considered as spurious. In such eases, the classical standard deviation of the distance s(d) would also be quite small, but the use of S(d) partly removes this effect (fig 4b). In the S(d) plot, the deepest two troughs correspond to well known conserved regions among cytochrome c sequences. The first one (labeled H) points to the
CXXCH peptide in which H is the proximal histidine liganded to the heme iron, and the second trough (labeled M) points to the distal methionine which is the sixth iron ligand in all cytochromes c. The most conserved and variable regions among cytochromes c are therefore correctly delineated by the S(d) plot.
Influence of the window length The plots corresponding to the protease family, calculated with the Risler matrix and windows of
1.0 •
~ ~ ~
O.5 0
Ei5 ;5:5 9
0
(C)
I oo
Eii5
D52
-
0.5
(d)
0
I oo
Fig 2. Divergence plots for the lysozyme family (ten proteins, see table I), all calculated with a window of seven residues. a. Risler matrix, plot of the mean distance . b. Risler matrix, plot of the root-mean-square distance S(d). c. Gribskov matrix, plot of the mean distance . d. Gribskov matrix, plot of the root-mean-square distance S(d). The labels E35, D52 and N59 point to highly conserved residues (see the text). The deep trough around residue 1I0 also corresponds to a conserved region.
576
2.0]
A12 C2 CD4
(H) F8
(H)
E7
(a)
2.0
N A12 C2 CD4 (H) E7
100
100
(H) X re
(H) F8
Ill
1.0
1.0'
1,0'
(H) A12 C2 CD4 E7
(C)
(H) (H) A12 C2 CI)4 E7 x Fa
I
i
0.5
(d)
I
0.5
0
NI 100
i
0
100
Fig 3. Divergence plots for the globin family (26 proteins, see table I), all calculated with a window of seven residues. a, b. and S(d) calculated with the Risler matrix, e, d. and S(d) calculated with the Gribskov matrix. The labels A12, C2, CD4, E7 and F8 point to structurally conserved amino-acids (see text). The labels X indicate a region which gives a peak in the plot and a trough in the S(d) plot. width 3, 5, 7, 9, 11 and 15 are shown in figure 5a-f. As expected, the larger the window, the smoother the curve. Some troughs, labeled G in figure 5, are due to a great number of gaps in the aligned sequences. These regions which are rich in gaps are not wide (at most four positions) and, consequently, the corresponding troughs tend to be wiped off by the larger windows. In the examples shown in figure 5, it seems that the window width should be set to at least seven residues.
With windows of width 9, I1 and 15 (fig 5d-f) there remain three prominent troughs. These troughs, labeled H, D and S, correspond to the peptides containing the conserved 57His, 102Asp and 19SSer (chymotrypsinogen numbering) respectively, which form the catalytic triad in all the serine proteases [31]. Like in the other examples, the divergence plots of the serine protease family allow a rapid vizualisation of the most conserved and variable regions of the sequences.
577
Robustness of the representation The question may be raised as to whether the statistics used in this study are sensitive to which sequences are selected. To try and answer it, we have studied the case of actin, an ubiquitous protein. In one case, ten sequences were selected (table I ) from widely different organisms such as amoeba, plants, yeast and mammals (see fig 6a, b). The resulting mean distance and standard deviation plots show clear peaks and troughs. To this set of sequences were then added 21 other ones (table I). The divergence plots for these 31 sequences are shown in figure 6c, d. Two conclusions can be drawn from the comparison of the two sets: i) the peaks and troughs which were found in the ten sequences set are observed again in the 31 sequences set. Thus the method is robust; ii) the overall mean distance decreases while more sequences are added, whereas the overall standard deviation increases noticeably. This shows that, as expected, the standard deviation provides a more accurate measurement of the heterogeneity in each column of the multiple alignment. If one is interested only in the positions of the peaks and troughs o f the divergence plots, then the mean distance and the standard deviation plots are equivalent. However, if one is more interested in comparing quantitatively the features of two sets of selected sequences, then the standard deviation should be preferred.
Conclusion The divergence plot method described in this paper is a tool for analyzing multiple protein sequence alignments in a simple way. Once the sequences have been aligned, it consists o f calculating the mean distance and the standard deviation of the distance in each column of the alignment. The results are then averaged in a window o f defined length and plotted as a function of the window position along the sequences. This gives the possibility to estimate quickly the degree of divergence between a set of sequences, an operation which cannot be performed easily by eye. It is obviously easy to decipher, in a set of aligned sequences, those regions where one or several aminoacids are strictly conserved. However, when all the residues are different in one column of a multiple alignment, it is practically impossible to estimate the degree of divergence. Such cases are shown in figure If and g where two colums each contain ten different residues. In one column (fig If) the mean distance is rather low, in the second column (fig lg) it is much higher. These two cases would be clearly differentiated by the divergence plots.
O f the two matrices used in this study, the Risler matrix [23] seems to be more resolutive than the Gribskov matrix [22]. The peaks and troughs indicating the most divergent segments or the most conserved blocks are generally better defined in the 2.0
H 6
6
M
(a)
II
1.o
160
0
|,~'
H 6
,.o
II
I
8
M
I
0.5
0
0
100
Fig 4. Divergence plots for the cytochrome c family (24 proteins, see table I ) calculated with the Risler matrix and a window of seven residues, a. Mean distance . b. rootmean-square distance S(d). The labels H and M point to the conserved proximal histidine and distal methionine. The labels 8 indicate a region in the multiple alignments where a great number of gaps were introduced.
578
2
SH
D
SS
SH
2
(a)
D
S S
i
!
(b)
1
.
|
0 2
1O0 SH
D
200
I
|
0
q
0
~ S
2
1O0 sH
D
!00 S S
(c)
0 21
1O0 SH
D
(d)
200
0
O0 SH
~S
D
200 6 S
Cf)
(e)
I
o
0
I00
200
,
0
0
100
2OO
Fig 5. Divergence plots of the mean distance for the protease family (eight proteins, see table I) all calculated with the Risler matrix, a-f. Window of 3, 5, 7, 9,11 and 15 residues, respectively. The labels H, D and S point to the conserved catalytic triad. The labels 8 indicate regions in the multiple alignment where most of the sequences contain gaps.
579
0.6
0.6,
O'ilLO '
0"+
o
16o
26o
3(~o
b
16o
No
3ao
Fig 6. Divergence plots of the actin family, all calculated with the Risler matrix and a window of seven, a, b. Mean distance and standard deviation, respectively, for the first set of ten sequences (see table I). e, d. Mean distance and standard deviation, respectively, for the second set of 31 sequences (see table I). mean distance plots as opposed to the standard deviation plots, but the mean distance is more sensitive to the presence of gaps which may lead to the presence of false minima. Finally, the averaging window should probably be set to a length of at least seven residues. The method described by Blanca et al [20], which takes into account the rate of evolution, is also aimed at delineating the most conserved or variable regions in a set of related sequences. In the present study, great attention has been paid to deal with the presence of gaps in the multiple alignments, as exemplified in figure lb and c. It could be useful to merge the two methods into a single unified one. In all of the four protein families studied here, the divergence plots point clearly to the most conserved
as well as to the most divergent segments. Both the mean distance plot and the standard deviation S(d) plot are effective in delineating the regions which are crucial for the activity of the proteins. In our examples, the divergence plots correctly found the catalytic triad in serine proteases, the His and Met iron ligands in cytochromes c, the amino-acids of the lysozyme peptide chain that interact with the bound saccharide and the distal histidine in globins. Moreover, the method described here allows an easy visualization of the evolutionary most divergent segments in a set of homologous proteins. This type of quantitative information cannot be assessed by standard methods. Interestingly, the most variable positions are often found in the near vicinity of the
580 most conserved ones. Two examples are worth mentioning. In lysozymes the most variable region is situated very close to the conserved 52Asp-59Asn and this maximum of divergence doesn't depend either on the matrix or on the calculation used (fig 2). Examination of the structures of hen egg-white and phage lysozymes [29] reveals that this region of high divergence consists in an external loop situated at the tip of the entrance of the active site crevice. In proteases the highest peaks of evolutionary divergence flank on both sides the conserved 102Asp residue as it is clearly visible when the noise is smoothed by increasing the window size in figure 5. Both peaks correspond in the chymotrypsin structure to external loops which are highly variable [26]. When the sequences of a group of proteins that belong to the same structural or functional families are available, the divergence plot method should therefore be a useful tool for determining the most conserved and the most variable regions. The first ones are generally crucial for protein activity and, therefore, should be good candidates for site-directed mutagenesis experiments, while the second may be used for estimating the evolutionary distances. The program EDIP (Evolutionary Divergence Plot) used in this study has been written in the C language on a Vax/VMS computer and is available upon request. Inquiries can be sent to BROUILLET@FRCGM51. BITNET. The input data files are either in the GCG 'Pretty' format or in the CLUSTAL format. The figures in this paper have been plotted with the CricketGraph TM software on a Macintosh micro-computer.
References
1 Devereux J, Haeberli P, Smithies O (1984) A comprehensive set of sequence analysis programs for the Vax. Nucleic Acids Res 12, 387-395 2 George DG, Barker WC, Hunt LT (1986) The protein identificationresource (PIR). Nucleic Acids Res 14, 11-15 3 Gouy M, Gautier C, Milleret F (1985) System analysis and nucleic acid sequence banks. Biochimie 67,433-436 4 HigginsDH. Sharp PM (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, 237-244 5 Corpet F (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res 16, 10881-10890 6 LipmanDJ, Altschul SF, Kececioglu JI2 (1989) A tool for multiple sequence alignment. Proc Natl Acad Sci USA 86, 4412-4415 7 Bains W (1989) MULTAN, a multiple string alignment program for nucleic acids and proteins. Comput Applic Biosci 5, 51-52 8 Vingron M, Argos P (1989) A fast and sensitive multiple sequence alignment algorithm. Comput Applic Biosci 5, 115-121 9 TaylorWM (1988) A flexible method to align large numbers of biological sequences.J Mol Evo128,161-169
10 Sobel E, Martinez HM (1986) A multiple sequence alignment program. Nucleic Acids Res 14, 363-374 11 Feng D, Doolittle RF (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evo125, 351-360 12 Gribskov M, McLachlan AD, Eisenberg D (1987) Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci USA 84, 4355-4358 13 BartonGJ, Steinberg MJE (1987) A strategy for the rapid multiple alignment of protein sequences. J Mol Biol 198, 327-337 14 Smith HO, Annau TM, Chandrasegaran S (1990) Finding sequence motifs in groups of functionally related proteins. Proc Natl Acad Sci USA 87, 826-830 15 Bacon DJ, Anderson WF (1986) Multiple sequence alignment. J Mol Biol 191,153-161 16 Blundeil TL, Sibanda L, Pearl L (1983) Three-dimensional structure, specificity and catalytic mechanism of renin. Nature 304, 273-275 17 Wu "Vl', Kabat A (1970) An analysis of the sequences of the variable regions of Bence Jones proteins and myeloma light chains. J Exp Med 132, 211-250 18 Taylor WR (1986) The classification of aminoacid conservation. J Theor Biol ! 19, 205-218 19 Zvelebil MJ, Barton GJ, Taylor WR, Steinberg MJE (1987) Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J Mol Bioi 195, 957-961 20 Blanca F, Ferraz C, Sri Widada J, Liautard JP (1990) Quantitative analysis of the selective pressure exerted on homologous proteins. Comput Applic Biosci 6, 223-228 21 Dayhoff MO, Barker WC, Hunt LT (1983) Establishing homologies in protein sequences. Methods Enzymol 91, 524-545 22 Gribskov M, Burgess RR (1986) Sigma factors from F, coil, B subtilis, phage SP01 and phage T4 are homologous proteins. Nucleic Acids Res 14, 6745--6763 23 Risler JL, Delorme MO, Delacroix H, Henaut A (1988) Amino acid substitutions in structurally related proteins: a pattern recognition approach. J Mol Biol 204, 10191029 24 DoolittleRF (1981) Similar amino acid sequences: chance or common ancestry? Science 214, 149-159 25 Dickerson RE (1980) Cytochrome c and the evolution of energy metabolism. Sci Am 242, 99--110 26 Greer J (1981) Comparative model-building of the mammalian serine proteases. J Mol Biol 153, 1027-1042 27 Goodman M, Pedwaydon J, Czelusniak J, Suzuki T, Gotoh T, Moens L, Shishikura F, Walz D, Vinogrado S (1988) An evolutionary tree for invertebrate globin sequences. J Moi Evo127, 236--249 28 Blake CCF, Koenig DF, Mair GA, North ACT, Phillips DC, Sarma VR (1965) Structure of hen egg-white lysozyme. Nature 206, 757-761 29 Matthews8W, Remington SJ, Grtitte MG, Anderson WF (1981) Relation between hen egg-white iysozyme and bacteriophage "1"4 lysozyme: evolutionary implications. J Mol Biol 147, 545-558 30 BashfordD, Chothia C, Lesk AM (1987) Determinants of a protein fold. Unique features of the globin amino acid sequences. J Mol Biol 196, 199-216 31 BirktoftJJ, Blow DM (1972) Structure of crystalline ¢gchymotrypsin.J Mol Bio168, 187-240