J. theor. Biol. (1986) 119, 189-196
Phylogenetic Trees Constructed from Hydrophobicity Values of Protein Sequences JACK A. M. LEUNISSEN AND WILFRIED W. DE JONG
Department of Biochemistry, University of Nijmegen, Geert Grooteplein N21, 6525 EZ Nijmegen, The Netherlands (Received 8 May 1985, and in final form 25 October 1985) Information about conformational properties of a protein is contained in the hydrophobicity values of the amino acids in its primary sequence. We have investigated the possibility of extracting meaningful evolutionary information from the comparison of the hydrophobicity values of the corresponding amino acids in the sequences of homologous proteins. Distance matrices for six families of homologous proteins were made on the basis of the differences in hydrophobicity values of the amino acids. The phyiogenetic trees constructed from such matrices were at least as good (as judged from their faithful reflection of evolutionary relationships), as trees constructed from the usual minimum mutation distance matrix.
Introduction In recent years several procedures have been described in which hydrophobicity values of amino acids are employed to assess tertiary structure properties of proteins (Hopp & Woods, 1981; Kyte & Doolittle, 1982; Sweet & Eisenberg, 1983; Argos & Siezen, 1983; Lawson et al., 1984). The hydrophobicity values of the successive amino acids in a sequence allow the construction of a hydrophobicity profile of a protein. Such a hydrophobicity profile thus reflects three-dimensional structural properties of a protein. Since evolution of proteins is supposed to act primarily at the level of spatial conformation (Bajaj & Blundell, 1984), it should be meaningful to explore evolutionary pathways of homologous proteins by comparison of their hydrophobicity profiles. To this end we investigated an alternative matrix method, based upon hydrophobicity values, for the construction of phylogenetic trees. In the last decade much attention has been given to the development and comparison o f algorithms for the construction of phylogenetic trees from amino acid or nucleotide sequences (e.g. Prager & Wilson, 1978; Tateno et aL, 1982). Two different strategies can basically be distinguished. One of these uses amino acid or nucleotide sequence data directly (Dayhoff, 1972; Foulds et al., 1979), and attempts to construct a tree which needs the lowest number of replacements or substitutions, respectively. The second utilizes the sequence data indirectly, by first making a matrix of pairwise distances between all pairs of sequences, and then constructing a tree which fits these distances best (Sokal & Mitchener, 1958; Fitch & Margoliash, 1967). The latter, more commonly used method can also be applied to other types of data, like D N A - D N A hybridization and immunological distances. A third method combines both strategies, by first constructing a starting cladogram on the basis of 189
0022-5193/86/060189+08 $03.00/0
© 1986 Academic Press Inc. (London) Ltd
190
J . A . M . LEUNISSEN AND W. W. DE JONG
a distance matrix, and after that trying to find the most parsimonious tree by local rearrangements of the branches (Moore et aL, 1973). A distance matrix of homologous amino acid sequences is constructed by summing the mutational distances between all individual amino acid pairs from both sequences at equivalent positions, after proper alignment of these sequences has been made. The mutational distance between all possible amino acid pairs can be obtained from published amino acid distance matrices. The most widely used amino acid distance matrix is the one of Fitch & Margoliash (1967). It is based upon the minimum nucleotide difference between pairs of amino acids. Two less familiar distance matrix methods have been described by Goodman & Moore (1977), and by Cornish-Bowden (1983). The first authors use conformational distance values, which have been derived from Chou-Fasman amino acid conformational parameters (1974a, b). The latter method does not need amino acid sequences, but constructs a matrix solely on the basis of the amino acid compositions of the proteins under investigation. We here report a comparison of phylogenetic trees constructed on the basis of a minimum mutational distance matrix with those based on distance matrices derived from hydrophobicity values o f amino acids.
Material and Methods CONSTRUCTION
OF DISTANCE
MATRICES
Distance matrices from homologous protein sequences were constructed according to the formula
D(A, B ) -
d( ai, bi) N
where D(A, B) is the distance between sequences A and B, d(a~, b~) is the distance between the ith amino acid in sequence A, and the ith amino acid in sequence B, and N is the number of positions examined. The amino acid distances d(i,j) were obtained from two different matrices, which hereafter are referred to as FM and LS. In the FM-matrix d(i,j) is the minimum mutation distance, which is the minimum number of nucleotide substitutions required to convert the codon coding for amino acid i into the codon for amino acid j (Fitch & Margoliash, 1967). In the case of the LS-matrix, the distance d(i,j) was defined as
where H~ and ~ are the hydrophobicity values of amino acid i and j, respectively. The hydrophobicity values for the calculation of the LS-matrix were taken from Lawson et al. (1984). The ru- values, as defined by Sweet & Eisenberg (1983), which measure the similarity in three-dimensional protein structure, cannot be used for clustering, since these values are not a true distance measure, but are correlation coefficients, and thus violate the triangle inequality (Cavalli-Sforza & Edwards, 1967).
PHYLOGENY
FROM
CONSTRUCTION
HYDROPHOBICITY OF PHYLOGENETIC
VALUES
191
TREES
Phylogenetic trees were constructed according to a modification of the Fitch and Margoliash method (Fitch & Margoliash, 1967; Prager & Wilson, 1978), using the FITCH-program ofJ. Felsenstein, as distributed in his PHYLIP-package for inferring phylogenies (Version 2.5). This program starts with an unrooted tree of three OTUs (Operational Taxonomic Units, which in our case are amino acid sequences), and adds one OTU at a time to the tree. After each addition it performs local rearrangements of the branching pattern, in order to find tile most parsimonious topology. Since the program cannot search all possible branching arrangements, there is a dependance on the order of input of the OTUs. Therefore the program was run several times, using different input-orders. This program differs from the original method in discarding topologies that contain negative branch-lengths (Prager & Wilson, 1978). PROTEIN SEQUENCE DATA Proteins used for the construction of phylogenetic trees were ot-crystallin A (Stapel et al., 1984; de Jong et al., 1985, and references therein), or- and fl-hemoglobin (Chauvet & Acher, 1970; Goodman, 1981, and references therein; Leclercq et al., 1981; Kleinschmidt et al., 1982), myoglobin (Goodman, 1981), cytochrome c (Goodman, 1981) and preproinsulin (Kwok et al., 1983). These protein families were selected on the basis of the following considerations. First, the density of the data sets allows the selection of representatives of the vertebrate classes of mammals, birds, amphibians and fishes, whose phylogenetic relationships are well established on the basis of biological evidence. Moreover, these proteins are among the most frequently used for phylogenetic reconstructions. Secondly, in order to construct reliable phylogenetic trees, one should be reasonably sure not to use paralogously related sequences. This condition, however, may not hold for all selected families, except for the ot-crystallin A-chain (King & Piatigorsky, 1983). CRITERIA
FOR EVALUATION
OF PHYLOGENETIC
TREES
The constructed trees were evaluated according to established taxonomic opinions (Romer, 1966; McKenna, 1975), e.g. placental mammals should group together, and should be the sister group of the marsupials. The branching patterns of the F M - and LS-derived parsimony trees for a given protein were compared with each other and with the "biological" tree. Those computer-constructed trees were considered best which showed the greatest degree of congruence with the "biological" tree. Also the branch length, as a measure of the relative strength of inferred relationships, was taken into consideration. Results
A comparison of the F M - m a t r i x , based on minimum mutation differences between amino acids, and the LS-matrix, based on hydrophobicity differences, is given in
192
J. A. M.
~'~
g~
°~
.~
LEUNISSEN
AND
W. W. D E J O N G
PHYLOGENY FROM HYDROPHOBICITY VALUES
193
Table 1. It reveals that the LS-matrix not only provides a more extended and detailed array of distance values than the FM-matrix, but also that the values in the two matrices are often far from congruent. For instance, the difference between Ile and Trp requires the maximum of 3 mutations in the FM-matrix, but gives only a very low distance of 15 in the LS-matrix. On the other hand, the distance between Pro and Thr is only 1 mutation in the FM-matrix, and 238 in the LS-matrix. The difference between the FM- and LS-matrices is also apparent from their low correlation coefficient of 0.257. Figure 1 shows the different trees found for a-crystallin A, t~-hemoglobin, flhemoglobin, myoglobin, cytochrome c and preproinsulin, using the FM- and LSmatrices. For a-crystallin A both the FM- and LS-trees are in perfect agreement with the current opinions about vertebrate relationships. In the a-hemoglobin trees the LS-matrix performs slightly better, by joining the opossum and kangaroo on one branch, although on the other hand the alligator-avian monophyly is slightly weaker in the LS-derived tree. Both a-hemoglobin trees place rat and armadillo as sister-groups, which is not in accordance with biological evidence. Also the two fl-hemoglobin trees fail to assign the armadillo to its proper position, and both connect the birds to the mammalian stem. Although Gardiner (1982) has recently proposed a sistergroup relationship of birds and mammals, the prevailing opinion is that birds are derived from archosaurian reptiles. Overall the FM- and LS-matrices perform equally well for fl-hemoglobin. For myoglobin the two trees are also very similar and in agreement with biological opinions, apart from the separation of, again, birds and alligator. The cytochrome c trees are quite different from each other, mainly due to the position of rattlesnake, and both trees deviate greatly from the biological tree. Finally, in the preproinsulin trees the separation of the ungulates ox and horse is not biologically realistic, but both trees are congruent in their branching pattern. Other hydrophobicity-based distance measures (Hopp & Woods, 1981; Kyte & Doolittle, 1982; Sweet & Eisenberg, 1983) were investigated in tlae same manner, and performed in general equally well. These results are not shown, since they provided in almost all cases the same topology as for the LS-matrix, and only slight differences in the branch lengths. Such variations can be attributed to the fact that the hydrophobicity values have been derived by different methods. In conclusion it can be stated that using hydrophobicity distances, instead of minimum mutation distances between amino acid sequences, to construct phylogenetic trees yields essentially similar or, in a single case, slightly better results. Discussion
Hydrophobicity values for amino acids are derived from their physico-chemical properties, such as ethanol/water partition coefficients. The hydrophobicity profile therefore is a reflection of the physico-chemical characteristics of the amino acid sequence which determine its three-dimensional structure. As three-dimensional structure is known to be more conserved than protein primary structure, this implies that amino acid replacements have a greater chance of being accepted if the involved
:t
V
V
7
V
~'~0:0
rf:
~
v
Z
t~ tn
Z
Z
T~
Z
t"
PHYLOGENY FROM H Y D R O P H O B I C I T Y VALUES
195
residues are more similar in c o n f o r m a t i o n a l properties. It has been shown indeed that h y d r o p h o b i c i t y and side-chain bulk o f a m i n o acid residues tend to be conserved in evolution (French & R o b s o n , 1983). This study m o r e o v e r s h o w e d an evolutionary conservation o f s e c o n d a r y structure. C o n s e q u e n t l y h y d r o p h o b i c i t y values might p e r f o r m equally well, or even better, as a distance measure for the construction o f phylogenetic trees, as the m i n i m u m mutation distances. Unlike the F M - m e t h o d , which measures the distance between protein sequences as the m i n i m u m n u m b e r o f nucleotide substitutions, the LS-matrix measures this distance in hydrophobicity-differences, which express indirectly the difference in tertiary structure. Therefore one might hope, by using the LS-matrix, to obtain a biologically m o r e meaningful and realistic representation o f the evolutionary distance b,-t'veen sequences than by using the F M - m a t r i x . O u r results indeed illustrate that the LS-matrix p e r f o r m s at least equally well as the F M - m a t r i x in constructing phylogenetic trees. Also Soto & Toh~i (1984) recently investigated the relevance o f physico-chemical differences between a m i n o acids in the construction o f evolutionary d e n d r o g r a m s o f 9 vertebrate calcitonins. T h e y f o u n d some differences in t o p o l o g y between the phylogenetic trees based on amino acid substitutions and hydrophobicity differences. It m a y o f course be recognized that also the F M - m a t r i x reflects to a certain extent, a n d in an indirect way, physico-chemical distances between amino acids. Argyle (1980) and S w a n s o n (1984) have demonstrated, that the a m i n o acid code can be a r r a n g e d in the f o r m o f a "similarity alphabet". This expresses the m i n i m u m change principle, u p o n which the code appears to be founded. In this arrangement a minimal c h a n g e in the genetic code c o r r e s p o n d s to a minimal change in the p h y s i c o - c h e m i c a l characteristics o f the a m i n o acids involved. This model then provides the link between the two matrices F M and L S . We would like to thank Professor Morris Goodman and Dr John Czelusniak (Wayne State University, Detroit, Michigan) for sending us the protein sequence data on tape, and Dr J. Felsenstein (University of Washington, Seattle, Washington) for supplying us with his PHYLIP package. REFERENCES ARGOS, P. & SIEZEN, R. J. (1983). Eur. J. Biochem. 131, 143. ARGYLE, E. (1980). Orig. Life 10, 357. FIG. 1. Phylogenetic trees based on mutational distances (FM-matrix) (left) and on hydrophobicity distances (LS-matrix) (middle) between protein sequences, as compared to the most probable biological trees (right). (a) t~-crystallin A, (b) a-hemoglobin, (c) fl-hemoglobin, (d) myoglobin, (e) cytochrome c, and (f) preproinsulin. The F M and LS trees for the different proteins are dra,:n to an approximately equal size. Scaling factors used for a-hemoglobin, fl-hemoglobin, myoglobin, cytochrome c and preproinsulin were 0-30, 0-20, 0.23, 0-51 and 0-15, all relative to a-crystallin (= 1-0). Branch lengths in the FM- and LS-trees are proportional to mutational and hydrophobicity distances, respectively. The branch lengths in the biological trees are roughly proportional to the estimated divergence times as taken from the paleontological literature (Romer, 1966; McKenna, 1975). Note the fact that the FM- and LS-trees are rootless. Abbreviations used: H: horse; O: ox; S: sheep; D: dog; R: rat; Rb: rabbit; SI: 2-toed sloth; Ar: nine-banded armadillo; K: kangaroo; Op: opossum; Os: ostrich; Ch: chicken; Du: duck; P: penguin; A: alligator; Ra: rattlesnake; F: frog; Sh: shark; Af: anglertish; Hf: hagfish.
196
J. A. M. L E U N I S S E N A N D W. W. DE J O N G
BAJAJ, M. & BLUNDELL, T. (1984). Ann. Rev. Biophys. Bioeng. 13, 453. CAVALLI-SFORZA, L. L. & EDWARDS, A. W. F. (1967). Evolution 21, 550. CHAUVET, J. P. & ACHER, R. (1970). FEBS Lett. 10, 136. CHOU, P. Y. & FASMAN, G. D. (1974a). Biochemistry 13, 211. CHOU, P. Y. & FASMAN, G. D. (1974b). Biochemistry 13, 222. CORNISH-BOWDEN, A. (1983). ~tethods Enzymol. 91, 60. DAYHOFF, M. O. (1972). Atlas of Protein Sequence and Structure, Vol. 5. Silver Spring, MD: Natl. Biomed. Res. Found. DE JONG, W. W., ZWEERS, A., VERSTEEG, M., DESSAUER)H. C. & GOODMAN, M. (1985). Mol. Biol. Evol. 2, 484. FITCH, W. & MARGOLIASH, E. (1967). Science 155, 279. FOULDS, L. R., HENDY, M. D. & PENNY, D. (1979). J. tool. EvoL 13, 127. FRENCH, S., ROBSON, B. (1983). J. mol. Evol. 19, 171. GARDINER, B. G. (1982). Zool. J. Linn. Soc 74, 207. GOODMAN, M. & MOORE, G. W. (1977). J. tool. Evol. 10, 7. GOODMAN, M. (1981). Prog. Biophys. mole~ Biol. 37, 105. HopP, T. & WOODS, K. (1981). Proc natrt Acad. Sc£ U.S.A. 78, 3824. KING, C. R. & PIATIGORSKY, J. (1983). Cell 32, 707. KLEINSCHMIDT, T., DE JONG, W. W. & BRAUNITZER, G. (1982). Hoppe-Seyler's Z. Physiol. Chem. 363, 239. KwoK, S. C. M., CHAN, S. J. & STEINER, D. F. (1983). J. biol. Chem. 258, 2357. KY'rE, J., DOOUTrLE, R. (1982). J. mol. Biol. 159, 105. LAWSON, E. Q., SADLER, A. J., HARMATZ, D., BRANDAU, D. T., MICANOVIC, R., MCELROY, R. D. & MIDDAUGH, C. R. (1984). J. biol. Chem. 259, 2910. LECLERCQ, F., SCHNEK, A. G., BRAUNITZER, G., STANGL, A. & SCHRANK, B. (1981). Hoppe-Seyler's Z, Physiol. Chem. 362, 1151. McKENNA, M. C. (1975). In: Phylogeny of the primates: a multidisciplinary approach. (Luckett, W. P. & Szalay, F. S. eds). pp. 21-46. New York: Plenum. MOORE, G. W., GOODMAN, M., BARNABAS,J. (1973). J. theor. Biol. 38, 423. ROMER, A. S. (1966). Vertebrate paleontology, 3rd edn. Chicago: University of Chicago Press. PRAGER, E. M. & WILSON, A. C. (1978). J. mol. Evol. 11, 129. SOKAL, R. & MITCHENER, C. (1958). Univ. Kansas Sci. Bull. 382, 1409. SOTO, M. A. & TOHA, J. (1984). Orig. Life 14, 637. STAPEI.., S. O., LEUNISSEN, J. A. M., VERSTEEG, M. L. T. C., WATTEL, J., DE JONG, W. W. (1984)., Nature 311, 257. SWANSON, R. (1984). Bull' math. Biol. 46, 187. SWEET, R. M. & EISENBERG, D. (1983). J. tool. Biol. 171, 479. TATENO, Y., NEI, M. & TAJIMA, F. (1982). J. mol. Evol. 18, 387.