Stochastic versus augmented maximum parsimony method for estimating superimposed mutations in the divergent evolution of protein sequences. Methods tested on cytochrome c amino acid sequences

Stochastic versus augmented maximum parsimony method for estimating superimposed mutations in the divergent evolution of protein sequences. Methods tested on cytochrome c amino acid sequences

J. Mol. Biol. (1976).105, 15-37 Stochastic versus Augmented Maximum Parsimony Method for Estimating Superimposed Mutations in the Divergent Evoluti...

2MB Sizes 0 Downloads 29 Views

J. Mol. Biol.

(1976).105,

15-37

Stochastic versus Augmented Maximum Parsimony Method for Estimating Superimposed Mutations in the Divergent Evolution of Protein Sequences. Methods Tested on Cytochrome c Amino Acid Sequences G. WILLIAM

MOORE? MORRIS GOODMAN, CLARA CALLAHAN Wayne State University School of Medicine Department of Anatomy Detroit, Mich. 48201. I1i.S.A.

RICHARD HOLMQUIST AND HERBERT MOISE Space Sciences Laboratory of California at Berkeley Berkeley, Calif. 94720, (T.S.A.

University

(Received 1 December 1.975) Two ways of estimating superimposed fixed mut,ations in the divergent descent of proteins are examined. One method counts these in terms of a Poisson process operating within selective constraints. The other uses the maximum parsimony method to connect the contemporary sequences through intervening ancestral sequences in an evolutionary tree, and then, from the distribution of fixed mutations in dense regions of this genealogy, estimates how many fixations should be added to sparse regions. An algorithm is described which determines such augmented distances. The two methods yield similar estimates of genetic divergence when tested on a series of cytochrome c amino acid sequences. Within those constraints imposed by Darwinian selection, the dynamic behavior of the evolutionary divergence of proteins is described by the probabilistic pathways of the stochastic model. The parsimony model provides a valid Aufbau-F’rinzip for examining which of those pathways occurred along a particular lineage. Concordance of the numerical magnitudes of genetic divergence estimates made by the two methods reveals them as logically consistent complements, not as mutually exclusive antagonists. Both methods indicate that cytochrome c has evolved in a non-uniform manner over geological time and more rapidly than previously estimat,ed.

1. Introduction The minimum mutation distance method (Jukes, 1963; Fitch & Margoliash, 1967) uses the genetic code to estimate the number of nucleotide point fixations separating the two genes that code for two homologous proteins since the time of their separation from a common ancestral gene. Paired amino acids are examined, a position at, a time, for the minimal number of nucleotide differences between their codons. The sum of these minimum nucleotide differences over all aligned positions is the 15

16

t:.

\I’.

XIOOIIR

kJ2’

.-IL.

minimum mutat’ion distance for the pair of amino acid sequences. Because t’his measure of distance does not detect, superimposed fixations, i.e. multiple fixations at the same nucleotide sibe, it underestimates evolutionary change more seriousl!: over long stretches of time than over short stretches. Several efforts have been made to correct for this inadequacy (Zuckerkandl & Pauling, 1965; Margoliash & Smith, 1965; Jukes & Cantor, 1969; Kimura, 1969; Fitch, 1971; Dickerson, 1971; Dayhoff et aZ., 1972a,b; Holmquist, 1972a; Holmquist et al.. 1972; Jukes & Holmquist, 1972; Moore et al., 1973; Goodman et al., 1974,1975). The stochastic method (Holmquist et al., 1972: Jukes & Holmquist, 1972; Holmquist, 1976) corrects for the degeneracy of the genetic code, multiple fixations at the same nucleotide site or within the same codon (including back mutation), and parallelism. Nucleotide point mutations are stochastically distributed over the variable portion, T, codons in length, of the structural gene. For two homologous proteins, the experimentally observed ratio of minimal 2- plus minimal S-base type amino acid replacements to minimal l-base type amino acid replacements determines the average fixation intensity pcLz,i.e. the number of one-step nucleotide replacements each variable codon (variant) has sustained on the average. The greater this ratio, the larger is pLz. The total estimated number of fixations or random evolutionary (REH) hits separating the corresponding structural genes is simply p2T2. A different way to account for superimposed fixations is to reconstruct the evolutionary tree and the intermediate ancestral sequences connecting each contemporary pair of sequences to their most recent’ common ancestor (Goodman et al., 1974). For example, a proline residue occurs at helical position A2 in both the c( and /3 chain of human hemoglobin. Thus the minimum mutation distance value is zero between aligned u and /3 human hemoglobin sequences at this position. In the reconstructed globin genealogy, however, the intervening residues connecting these two proline residues are Pro (CCU)-Ala (GCU)-Asp (GAU)-Ala (GCU)-Pro (CCU), resulting in a reconstructed distance of four fixations. When the reconstruction is done in such a way as to minimize the total number of nucleotide replacements required, the resulting ancestral sequences are termed the maximum parsimony solution. If the t’opology of an evolutionary tree is known, then the maximum parsimony ancestral sequences can be solved using the algorithm proposed by Moore et al. (1973). The maximum parsimony algorithms proposed by Fitch (1971), Hartigan (1973) and Sankoff (1973) do not provide for the degeneracy of the genetic code. Fitch $ Farris (1974) have recently proposed a heuristic modification of Fitch’s (1971) algorithm extending its applicability to proteins. The maximum parsimony reconstruction procedure in some cases inequitably replaces missing fixations because this procedure is unable to detect all multiple nucleotide replacements between nodal points on the evolutionary tree. The number of fixations separating a pair of sequences whose evolutionary history back to their common ancestor is well represented by intermediate ancestors (i.e. for a pair of sequences which lies in a “dense” portion of the evolutionary tree) is more adequately estimated than is the number of fixations separating a pair of sequences t We distinguish between the two concepts of covarion (Fitch & Markowitz, 1970) and varion, (Holmquist, 19726). The former is the number of codons in a protein that are free to fix mlltations at a point in time and is a property of a single sequenrr. The latter is the number of codons in a pair of homologous proteins that have becw free to fix mutations over some part of the period during which the two proteins have diverged from their common ancestor and is a joint property of the pair of sequences.

CYTOCHROME

c PHYLOGENY

Ii

back to their common ancestor is relatively poorly represented by intermediate ancestors (i.e. for a pair of sequences in a “sparse” portion of the evolutionary tree). Distances in sparse portions of the tree are underwhose

evolutionary

history

estimated by the maximum parsimony procedure, for just as there are gaps in the fossil record, so there are gaps in the molecular record. These gaps are of two types. Some are irreducible in the sense that they can never be filled because, as for some now-extinct soft-bodied forms, no fossil record was preserved. Other gaps are reducible; more cytochromes c have been sequenced for mammals than for worms but, given more data, the number of such gaps will diminish. Wtr have developed an augmented distance algorithm which uses the dense portion of an evolutionary tree to replace missing fixations in the sparse portion of the tree. AH v.c shall show, the random evolutionary hits and augmented distance methods for replacing missing information yield correlated results when tested on a series of cytochrome c amino acid sequences. They also indicate that cytochrome c evolved not, only at a faster rate than usually estimated, but in a non-uniform manner over geological time.

2. Description

of the Augmentation

Algorithm

The minimum augmentation algorithm has as its objective the equalization of mutation distance values in densely and sparsely represented portions of the evolutionary tree. Starting data for this algorithm are: (i) an evolutionary tree topology and (ii) an initial value assigned to each pair of nodes on the topology. With our data or direct distance a DDt number

value is obtained from the two sequences of nucleotide differences between the two

for a pair of nodes by counting the directly aligned sequences. Applying the algorithm to these direct distance values results in an augmented distance for each link (internodal distances) on the topology. The AD for a pair of non-adjacent, nodes is then the sum of the ADS along the intervening links. The AD for a link of DD = 0 is defined to be zero. The ADS for all links of equal DD are defined equal. The AD for a link of DD = k + 1 is at least one larger than the AD for a link of DD = k. Finally, the AD for a single link of DD = k is at least as large as the AD for some pair of nodes in the tree whose DD from one another is also k, but which are separated from one another by a maximal number of intermediate nodes. For internal consistency, the formal proof (Moore, 1976) allows as maximally separated nodes only those in which every intermediate link has AD less than k. We are assured for these link paths that ADS for DD = k in the sparsest portion of the evolutionary tree (i.e. single links) are no smaller of the evolutionary tree (i.e. maximally than ADS for DD = k in the densest portions separated nodes). Subject to these rest.rictions, we want the sum of ADS over the given tree t,o be minimized. The formal proof demonstrates bhat t.he augmentation algorithm has this property. The example presented in Figs 1 and 2 and Table 1 shows how the augmentation algorithm works. First (Fig. l), the evolutionary tree is converted into a network by removing the root, or most ancestral point of the tree. We now prepare a working Table (Table 1). Six columns of data are entered in the working Table. For each pair of nucleotide sequences in the network, we calculate (i) a DD value and (ii) the number of intervening links. As a check for inadvertent omissions, we note that there are exactly (N - 1). (2N - 3) pairs, where N is the number of contemporary species (exterior nodes). This collection of pairs is sorted (i) primarily in ascending order of DD values, and (ii) secondarily in ascending order of intervening link number. If two pairs of sequences have the same DD value but a different number of distance; AI, augmentation t Abbreviations used: DD, direct distance; AD, augmented increment; REH, random evolutionary hits; NDC, nucleotide differences per 100 codons, as estimated for cytochrome c messenger RNA pairs. See the first and third footnotes (” and “) in Table 3. m yr bp, millions of years before present. 2

18

0 AAA

@@ @ 0 AAC ACG CUG UUU ACG

AAC

CUG The network

PIG. 1. A hypothetical tree and its corresponding augmentation algorithm. The numbering (circles) of arbitrary. Non-parsimonious codons are used for the algorithm can be demonstrated on a relatively simple

network for demonstrating the minimum the exterior and interior nodal points is interior points so that the workings of the network.

intervening links, then the pair with the larger intervening link number appears later. For each pair, its DD value and intervening link number are entered in the proper columns. A heavy line is drawn underneath the final pair with DD = 0, underneath the final pair with DD = 1, etc. These heavy lines separate the different DD categories from one another, namely DD = 0, DD = 1, DD = 2, and DD = 3. The principal calculational effort in this algorithm is the estimation of an augmentation increment corresponding to each DD value. The AI corresponding to DD = k is the amount by which each link of DD = k must be increased to achieve the AD for that link. This calculation of the AD and AI values is best illustrated by example. The remaining columns in the working Table are filled out only in part. By definition the AI for DD = 0 is zero. For the DD = 1 category, we fill out the intervening link ADS, column 4, only for the pairs with maximal intervening link number. For DD = 1 in Fig. 1, the maximal intervening link number is 2. For the pair 1 x 2 there are two intervening links, one with DD = 0, the other with DD = 1. The augmented distance for the former is by definition zero, and for the latter is at least 1, thus disqualifying the pair because one is not less than the DD category (= 1) under consideration. In Table 1 those intervening links ADS which result in a pair being disqualified are indicated by N (for not qualified). Pairs 1 x 7 and 3 x 8 are similarly disqualified. In general, if any of the intervening link AD values are greater than or equal to the DD category value, then that pair is disqualified. If all pairs with maximal intervening links are disqualified, then we try the pairs with maximal minus one intervening links; if all pairs with maximal minus one intervening links are disqualified, then we try pairs with maximal minus two intervening links, etc. Since all pairs are disqualified for the DD = 1 category, the AI value for this DD category simply equals the AI value for the prior DD category, namely zero. For the DD = 2 category two pairs do not disqualify: 1 x 3 and 2 x 3. To calculate the total AD for a qualified pair, we sum (i) the corresponding intervening link ADS. For pair 1 x 3, the intervening link ADs sum to 0 + 1 + 1 = 2. Similarly, the total SD for the pair 2 x 3 is 1 + 1 + 1 = 3. The AI for this DD category is either (i) the

CYTOCHROME

c PHYLOGENY TABLE

1

WorkirLg Table for Figure 1

DD .-___~ 0

Pair 1 x6

Intervening link no. 1

2 > B 3y7 ti Y 7 I r2 I * 7 3x8

1 1 1 1 1 1

1 I I 2 2 2

7xX 4x8 3 x 4x5 Gr 2x7 1 x IrX 273 2rX 3 x

2 2 2 2 2 2 2 2 2 2 2

1 I 2 2 2 2 3 3 3 3 3

3 3 3 3 3 3 3 3 3 3

I 2 2 3 3 3 4 4 4 4

.‘, /: 4r 5r7 3 x 4 x 5 >, I ‘\ 1 ;x 2>4 2/R

6 X 3

4 x 7 5 G ti 4 5

Intervening DD ___-.. .-0

Pm. 2. Original and augmented distances rircled. Mutation distances are not circled.

link distances AD --~_~ 0

0, 1 0, 1 1, 2

0, 0, 1, 1, 1,

0, 0, 1, 1,

0, N 0, N N, N

1, 1 1. 2 1, 1 1, 2

0, 0, 1, 1, 1,

2, 2

1, 2, 2 1, 2, 3 1, 2, 2 1, 2, 3

1, 1 1, N 1, 1 l,N N, N

0, 1,2,2 0,1,2,N 1, 1, 2, 2 0, 1, 2, N

for the network

in Fig.

Total AD

AI 0

__-

0

2 3 -

0

5 6 -

2

1. Species numbers

arc

In the DD = 3 category, there arc t,wo qualified ADS: 5 ant1 6. ‘l’hc minimal AD is 5, and the AI from the previous DD category is 0; therefore, the new .41 is either 5 3 or 0, whichever is larger. For DD = 3, the AI is 2. For a larger working Table, further DD categories would be worked outfl in an analogous fashion. It is unnecessary to work out AI values beyond that DD category corresponding to t)he largest link value in the original network. The solution to the sample problem is illustrated in Figure 2. We now examine the performance of t$hr augments&ion algorithm large collecbion of real sequence data.

3. Evolutionary

whm

tested

on A

Tree of Cytochrome c

Figure 3 shows an evolutionary tree for 53 cytochrome c amino acid sequences. The original unaugmented link lengths in this tree were solved by the maximum parsimony algorithm (Moore et al., 1973), using the procedure described by Goodman et al. (1974). In this procedure, whenever there are alternative maximum parsimony solutions, the alternative chosen distributes fixations more often than the other 1000 -

750v) I 5 “0 500= E t 250-

FIG. 3. Cytochrome c genealogical tree. Link lengths are the numbers of nucleotide replacements (fixed mutations) between adjacent ancestor and descendant sequences and are italicized when the augmentation algorithm corrects for superimposed fixations. The ordinate is a time-scale in 10s years based on paleontological views concerning the ancestral separations of the organisms from which the cytochromes c came. Since we are unaware of fossil evidence for branch points and Euglena, the among insects, among angiosperms, among fungi, and between Crithidia the times of these branch points were guessed at from the magnitudes of the link lengths in these regions of the tree and by interpolation between the points which were placed on the basis of paleontological views. Species names are as in the legend to Table 2 with the addition of carp (Cyprinus carpis), snail (Helix aeperaa), and buckwheat (Fagopymm eeculentum).

CYTO(‘HRO&IE

c PHYI,OGENY

“I

alternatives to the link paths sparse in intervening ancestors. This procedure has two effects. The first is to reduce but not eliminate the bias towards more marked underestimation of evolutionary change in a genealogical tree’s sparse regions relative to its denser regions; it cannot eliminate the bias because the basic condition causing it remains: there can never be more than one nucleotide replacement at any particular nucleotide position when two sequences are directly compared, but there can btb and often is more than one replacement when two sequences are compared through intervening sequences. One purpose of the augmentation algorithm is to correct for this bias. The second effect is to introduce a different bias towards equality of evolutionary rates. Actual differences in evolutionary rates along various lineages are probably somewhat larger than indicated by the AD calculations reported her{, below). The numbers on the links in t#he (Rates of Cytochrome c Evolution, cytochrome c evolutionary tree in Figure 3 are the AD values for the links, i.e. they are the augmented link lengths found by the augmentation algorithm. These AD link values are also listed in the Figure alongside the corresponding DD link values. The tree required a minimum of 521 nucleotide replacements (unaugmented). Among several thousand topologies examined, it was the shortest length tree found not requiring additional mutational events of gene duplication to account for the descent of the sequences. The additional superimposed fixations found by the augmentation algorithm brought the total nucleotide replacementjs for t’his bree to 1047.

4. Stochastic Estimates for Cytochrome c The total number of nucleotide point fixations separating the 1596 pairwise comparisons from 57 distinct cytochrome c sequences representing 64 species were estimated by the stochastic method outlined in the second paragraph of the Introduction and described in detail both theoretically (Holmquist et al., 1972; Holmquist, 1976) and by numerical example (Jukes & Holmquist, 1972) elsewhere. These values are summarized in Table 2 as random evolutionary hits per 100 codons. REH values were calculated as in the Appendix of Jukes & Holmquist (1972), then normalized to 100 codons by multiplying the REH values by 100/T where T was the number of amino acid residues compared for the two species involved. For cytochrome c, T ranges between approximately 100 and 111. The sequences and their alignment were as given by Dickerson & Timkovich (1974). Of these comparisons, 36 are pairs which have diverged so widely that an accurate upper bound to the number of fixations cannot be assigned (Case I). There are 39 closely related pairs having relatively few amino acid replacements, an abnormally large proportion of which are of the minimal 2- and 3-base type (Case II). In both cases, we have entered a minimum estimate equal to 1.5 times the minimum base differences separating the homologous genes. The factor 1.5 corrects for third codon position degeneracy. One sometimes finds a situation (Case III) where on comparing two homologous proteins (usually closely related) there are only amino acid replacements of the minimal l-base type and none of the minimal 2- or S-base type, so that the ratio of the latter to the former is zero. There are 32 pairs of this type in the data. Again a value for REH equal to 1.5 t,imes the minimal base differences (in t,his case also equal to 1.5 times the amino acid differences) is used.

0’6YZ 6‘211 1'512 F'bL O'EU il'B8 b' 40 8‘88 58' h8 1'102 5' heI 2'891 I.802 l'E22 L'E22 E'bE2 h'E41 h'P4I 6’202 L'lO2 8'081 0'912 4'CIZ 1'122 t'8t.I 6’201 4'ktJI 4'CIZ B'tbl 8’tbI I'bZI h'2ZI 6'011 E'SEI b't9 L'PL

h'2S2 0‘2L1 Y'O42 1'51 II‘18 1’16 E’06 1’16 h‘LI3 C’6EZ 0'622 8'081 L'C22 t‘bEZ S‘EI2 0'622 6'491 6‘491 4'811 1'122 0'CCI x: 1‘90Z 4’bkZ t.‘kSZ 0'6ZZ 0'622 ::g 8'EEI L'hZI 9'611 9'441 b'E9 1'91 k’h 0

nn1

V'LOE h.881 8'081 I.142 O'b01 L'PBZ A'LOZ k‘A8 9.b4-c 8'04Z 0’691 2'891 8'861 Es::;: 4'EIZ E'bEI 8'011 b'961 9::;: E'E61 8'081 6'202 6'202 8'041 4'EIZ 8'011 ::g 4'OhI 6'041 E'bE E . bE B'El it.16 Z'h4 b’;l

. . . . .

. okI3

U'Vh2 I'091 1'011 1'821 9'911 1'161 O'LZI 8'081 4'921 h'lhl 6'911 9'911 8'401 I'LEI O'EkI 1'821 1'901 L’bZI C'POI 2'901 9'901 9'911 k'9II E'LZI 2'911 i:::: Z'901 2'901 E'LOI I'LBI 2-191 E'S4 E'84 8'801 O'III 1'111 E'SII 4'6hI

0

. . . . . . UNS

S'bY2 9'602 0'141 h‘I9I 1'141 1'4EZ 8'892 8'04Z 6'EhI .g :;; B’LZI E'4hI 4'122 6'611 f' 'hi 9'911 h'19I 9'991 8'LZI S'LZI C'OEI 0'111 E'LZI E'LZI L'LZI ;:g C'ShI 1’621 6’041 01421 z.:: O'E9 ;::; O'EB 9'12 4.E9

0 .

HI11

h'101 6’211 2'891 1'122 B'LZI Z’IIE z: ;g t’9hI ;: fZ’i h’9II 1'621 I'691 h'IhI B'LZI 1:ez1 8% s:1zi :.::: Z'PII E'LZI C'LZI 4:1z1 x: 6’911 9'911 9'441 I'901 E’bf 9'611 O'C9 Z'ES 6'h9 O'CB h'b4 E.E6

‘2’

Ild

v: LOE a

O'LLI

C’IhI 6'hLZ B'E61 S'OOE h'h91 ;: :z'; h'9II L'bZI 6’161 6'141 h'lkl ;:;fl; 2:: 2'911 E'6El S'LZI E'6EI

L'86 b'981 6'491 6'202

E'LZI 1’9EE S'OOE 9'18 h'h91 E'OhI E'LZI 9'911

B'lZ1

6'161 E'6CI

2'911 '5:;::

E'LZI

;:g

Z'L8

Z'BEI 0‘111 8'9ZI 8'9ZI I'LZI E’LZI Z'PII 4'LZI 2'911 E'SEI

E'hl

Lo":! h'Ih1 9'911 1’621 6'911 I'901 4'611 ;:;:

Z’h9

L'VLE 6‘981 6'491 6'202 E'LZI L'POZ B'OLZ O'E8 h'h9I !gg

L'OLE 6'981 2'891 I'IZZ S'LZI Z'IIE I'LLZ 9'18 h'h91 E'OCI E'LZI 9'911 6'161 C'6EI ;: 4;;

B'LZI

Iv'911 1’621 6’161 h'Ih1 ;:;;: E'OhI 4'8hI E'LZI

4:1z1

IYLZI Z'EEI

O'ChI 9'041

B'LZI :.z: E'LZI 0'111 8'911 8'911 I'LZI ;:g;

K B'LZI 9'911

L'BZI

0

In3

t.5

O'F.9 h'99 4'88 0’19 6'41 L'II 1’01 E'h

Z'U

S'LZI 2'911 0'42I I'98 ;: g'

9'911 0'421 1'98 6'96 ;:E"; Z'EL h'99 4'88

O'l? 8'29

. . . . .

:.:I E'k 6’2 . . . . . . . . . . .

llw3

0 . . . .

N3d

'2

1'01

6'96 2'68 6'41 8'29 b'h9 O'EB k’b4 B'Z9 4'II

h.99 4'88 0’19 Z'k9 1'01 4'01 0

nna

4’ ZEE 6'621

L'LEI

6'141 2'891 4'hhZ

6'611 h'IhI 4’1Z1 1’021 9'991 1’411 ;:tz; O'EhI 4'6hl h.911 E'LZI 0'141 Z.-XI 0'141 h'C4I h'Ih1 9'441

8'1ZI L'BZI I’LII I'kl S'h9 6'96 "o:g

- --

P'E4 6'h9 O'EII 4'44

F'hi E'hZ h'lt ;::; h'42 0

NUY

zamv;r,

4'ZEE 8'bhI E’bC1 L’LCI 6'611 +: if; I'LSZ 9'621 2'411 h'I.hI h'Ih1 0'991 ;::"L: 6'141 O'ChI h'I91 9'991 h'IhI E'OhI h'8EI

4'12I ;:;g E'OkI h'lhl O'EhI E'LCI E'SCI ;:@ 4'611 6'96 4'89 9‘LB x1 z:E'; I'41 1'41 O'IZ 4'91 4'91 4'91 L'ZI 0

VIM

L'ZEE 8'6hI 8'861 I'IZZ IlLSZ xt 6'611 ::o';; h'lhl h’IhI 0’991 8'961 1'802 b‘L41 L'BZI O'fhl 9'991 1'821 E'OhI h'BEI L'LZI E'bEI E'flEI B'LZI h'IhI O'EhI

4'41

Eg:gs:: h’E9 Il.99 6'EP 9'001 E'E9 h'9E 4'L.II h’64 h'88 E 6'ZE h’IE h’ IE h’IE 4'11 l'00'

513

9.10s 9’6hI h’8EI t:;:; L’IZZ

L'EZZ 9'81 E'9hI 6‘611 h'Ih1 e:1z1 FG:: 1'802 6'141 L'BZI 0:EhI 9'041 B'LZI :: g; E'OhI h'E4I C'ELI 9'441 6:141 :z:: '5: :I' L’E9 4‘88 0’61 Z'h9 h'EI,I h'99 4'88

0'19

L'ZEE ;:g

0'111

9'AOE 0'191 El.‘011

O"lL1 1'802 0'612 O'EB 8'hBI 9‘621 E’OhI 8’lZI 8'111 E'S61 1'902 1'8OZ h'Ih1 L'BZI O'EhI 9'041 9'911 ;:;I:

900

h'IE h'IE h'IE 1'01 4'01 h'l 4'11

v:t ;:g

Z'h9 6'LE "0:;;

2'911 E'LZI E'LZI h.911 E'LZI 1'8ZI 1'611 1'111 k'E9 I'h4 4'88 i:;E'

I'EOZ 6’202 O’EB 9'EIZ E'9hI E'OhI h'9II 8'LZI E'LhI 1'902 I'EOZ O'EhI L'BZI O'EhI 9'04I 9'911 S'LZI h'EE1

S'LZI E'6EI E'6CI 8'LZI h'IhI ;:g; I'LII h'E9 I’h4 4'48 0'61 9'18 2'81 8'0h ;:"81"

KI’Z

I'41 B'ZZ

h'8'd 4'14 6'ZC E'LI 1'4h ;:g

'~oz'

1va

0

E'E9 4'91 L'EI O'EI h'IZ h'I2 h'll 1'21 6’2

awl

4'611 h'E41 0'011 1'802 b'h4Z o'ce I'EOZ 4'811 n:o11

1'9lE

9'10s

931

6'44 4'hP t.'hP 9'EP 4'Zh 6'Zh 4'88 O'EII E'ZB h.88 h’lf h’IE 4'IL 8'1E B'IE h'lE h'IZ I'01 1'12 4'11 I'12 8'12 E'4 0

E'OhI I'ZSI Z'8EI E'L2I I'LZI h’BE1 S’LZI E’OCI 9'441 ;:E'i;

E'OZI E'bEI B'OEI 1‘802 b'k42 O'EB 6'14I L’IhI 8’Et.I E’OhI E’OhI h’I91 8’961 S'OOE 9'441 h'Ih1 :::L'; x2: h'I91 8'961 C'CEZ 9'441 h'Ih1

6'141 E'h91 8’121 E:6EI EP: h'8EI h’BE1 S’LZI E'OhI h'Ih1 O'ChI 1’621 E’fl I ’ h9 4'611 6'96 I'Zh S'ZS 4'101 4'88 Z'Eh h'I6 z: 6'ZE I'E2 I'EZ E'ZZ E'.: 8'4 'L: f;

no3

3tlY

4'91 h'lt 8°C 0

B'lI

Z'2h 0'61 E'22 0'61 h'12 I'E2

b'h# 9'441 0'421 8'29 E'S8 q.26 9'611 Z'h9 L'ZL b'EE 0'IE 9'12 ;:;:

9'LOE 2'2h2 9’6hl E'OZI h'8EI 2'811 1'611 i'LII. 1' 8az S'LZI 4'kkZ 2’911 9'BIE h'E41 9'441 E'6EI I’OhI B'hOI 0’681 O'EhI E’6EI 9'911 E’6EI 9'911 6'141 1’111 6‘161 L’IhI I'412 h’19I h‘E41 O’EhI C'OhI 8’ 401 9'441 6'911 1'911 1'921 E'bE1 9'96 0'141 9'901 6'lEI Z'PII 1’121 h'911 4'921 L'LZI Z'S' 9'9II E’LZI 9'911 E’6E 1 ;:g; h'E41 9'44I 8'401 h'Ih1 8'46 8'16 1'94 0‘49 6‘hP 9'EP 6'Zh i:;;

x h’98 8’IE 8.1E 1’46 Z’ZC Z’ZE 8’1E O’IZ 4’11 0’ 12 O’EI 0’12 I’12 2’1 h'l 0

UOI(

0'612 9'6EI 1'011 6'911 E’121 1’911 E’6EI E’bEI B’hOl t’4hI 6’911 9’911 ;: :;;

O’EhI ::t:;

I’LII 1’921 9.96 9‘901 2'911 h'911

E'LZI

1'911 6'911 1'821 ;: 2;;

9'hE

L'S4

h'46 8'96 L'CB E'LEI 6'611 9'64 h’h8 4' 41 E’Zb E'E9

Z'ZC

0‘19 tl'lh 6.Ih ;::;

I'E2 1'02 E'22 E’h2 ;:t;

e'lt 2'2E Il.1 n N;w

IHJ 8113 1Sll WIIH n3N NW3 315 030 N13 IdS nw X31 zuw 3Hrn MUd

izf: NM 330 new 103

-:-

I- 3431

87.4 2bb.2 189.0 165.9 lb5.9 254.9 219.6 84.5

311.2 270.6 318.6 234.3 221.7 123.9 304.9

SAC CMI NEU HUm US1 EllG CR1

:t%

:i::: lb3.8 137.7 162.7 lb5.9 180.8 lb1 .b 179.5 149.9 196.4 189.0

i, 42.9 96.6 45.1 109.9 96.8 151.0

:

. .

E PAR WE I%? LEK NW SPI ;:i

Eli CAU PUrn mBE HEM NGL ELD COT ABU ACE SLIN

SES

2: Tmo

ii: FRG TUN BOW DGF LAM FRF

:

z!i BAT RIB ELS unn KAN CM1 Em) PEN OUK PIJ

.

OGF

“nN RUE non ZEB

198.3 185.0 1’19.0 170.8 198.3 131.2 228.2

6 150.9 113.7 95.8 95.6 127.0 137.1 148.8 127.8 127.8 131.5 137.1, 148.8 137.5 131.5 137.9 172.7 151.0 138.2 !?I.9 198.3 192.0 138.2 127.0 137.9 161.6 161.1 138.2

1 API

235.4 204.2 152.0 177.9 143.5 201.0 233.1

::::: 195.3 156.6 156.6 143.0 154.6 155.6 151.7 82.1

::::3

:Z::%

3::: 21.7 143.0 155.6 15q.b 167.9 153.5 142.6 154 .b 155 .b 164.8

6

FRF

235.4 204.2 152.0 111.9 145.3 184.7 237.0

:3;:: 166.8 154.6 143.0 154.6 145.3 122.1 114.0 205.5 159.7 159.7 143.5 156.6 151.1 161.9 84.1

i 26.6 19.5 143.5 151.7 156.6 171.8 154.6 143.0

YF

186.6 186.6 122.2 149.8 .:p”;:: 355.1

:8”::: 355.1

::::i 132.8 145.3 14b.4 174.0 ._. lob.5 135.1 147.6 145.3 145.3 lb:.4 200.0 147.6

13i.2 161.9 145.3 159.7 145.3 122.7 145.3 122.3 156.6

0

rmo

153.1 154.6 106.0 147.6

:5’::: .-. 174.0 133.0 133.2 133.2 133.2 133.2 195.3 122.1

:;52:53 143.3 133.2 122.8 122.2 145.3

E:! 144.3 132.8 113.4

0 11.2 112.6

SMO

::2:: 183.6 103.6

152.3 202.2 124.2

50.3 20.1 30.1 14.1 59.3 22b.b 34.7

0

. .

.

.

. . . . .

.

SES . . .

I?!.1 188.1 123.8 205 .‘I 196.1 191.6 106.4

30.1 19.7 48.3 30.8 FT.3 270.5 31.7

:

. .

.

.

.

.

CAS

166.4 204.9 115.5 186.8 217.5 211.4 376.0

;::: 8.1 59.1 369.6 24.8

:::i 11.0 39.3 17.6 55.6

1t.1 15.4 15.4 20.1 25 .l 20.1 12.2 21.4

TOR

‘I?ABI.~;

lF3.1 114.9 114.6 205.4 167.9 184.7 217.7

10.0 10.8 11.9 29.4 19.7 12.8 20.9 29.4 25 .l I’( .9 20.9 52.0 29.6 11.9 21 .b 20.1 g.4 295.8 28 .3

i,

.

.

.

. .

CLU .

!!I!.5 226.6 115.5 227.1 183.9 165.7 313.2

30.1 10.B 40.5 20.1 60.2 260.5 24.3

:

.

.

. . .

. . .

. . .

rnBE

ued

6 8.1 21.6 12.8 19.1 15.4 21.6 18.8 19.8 17.8 49.2 39.4 16.6 29.4 12.8 12.1 285.3 21.1 IT!.! 202.2 116.4 201.3 204.2 150.1 228.8

49.11

0 8.1 10.8 24.8 20.1 20.1 14.1 29.8 25.7 24.8 22.1

:

.

.

PUP)

2-conlin

I!!.! 202.2 123.8 186.8 204.2 182.8 268.4

23.8 I’(.1 34.3 20.1 29.4 20.9 11 .b 23.8 49.4 39.4 21.6 29.4 14.1 60.2 285.3 32.1

0

nEn

!6_3.? 211.9 108.4 158.1 183.9 160.2 228.8

2:o’ 59.1 275.2 37.6

:I:: 25.1 2b.9 31.1 39.4 49.8 39.8 39.3

i 31.0

NGI

I?!.? 200.0 118.9 196.4 211.5 197.6 302.0

:‘2:: 49.4 39.5 20.9 29.6 19.8 110.9 __ 260.5 23.5

3t.3 20.1 10.8 22.1

.

.

.

.

.

EL0

ABU

5;:: ‘3:4

ACE

ltF.7 188.1 116.4 205.4 226.9 198.8 335.8

!$+.O 231.2 115.5 209.7 260.4 185.7 103.6

163.3 214.1 12b.0 168.7 255.6 211.4 281.4

ri i.9 13.5 36.3 6 23.8 30.1 21.6 G.5” 29.4 bB:b 50.3 29:9 39.4 30.1 20.1 20.1 29.4 lb.2 lb.2 19.7 12.8 ;o’.: 49:1 49.5 61.6 -_ _ 250.9 34 . r 260.5 24.X 386.8 4O.C

.

.

. .

.

.

. .

.

.

COT

30 .Y 304.8 131.0 222.2 108.4 158.1 181.7 112.0 210.0

14.1 29.8

SUN

33.b 32B.2 170.2 202.2 108.4 169.1 181.7 160.3 297.4

3b.r 369.6 166.4 222.2 123.8 191.5 183.9 201.0 2bB.4

NL5

CAN NEU t4lJn US1 FUG I-RI

snc

cau Pun NBE twl NGL EL0 COT MU I\CE SUN NGR NBS Pm WME nnz LEK MU SPI GIN DEB

c;i

zz Tl40

2: FRG TUN BON DGF ban FRF

%

%i en1 ROB EL5 wnn K&N Cl41 EC)U PEN

IMN WE ROR ZEB

i, 39.3 39.5 49.q 39.4

PD”

:9”::;

186.6

6 20.9 32.4 31.0 ,^.

WUE

0

15 .‘( ..^.6.8

nnz .\ LEK.

SPI

GIN

DEB

i

:23::: 207.2 115.0

110.0 112.3

TABLE S-continued NEU

1:: 205

WUpl

lb:.6 103.3

6 162.9

EUG

0

CR1

Total estimated nuclcotidc point fixations per 166 cotions (REHC, random evolutionary hits) separating the 2616 species pairs from 6-f species for the cytochrome e gene. Species are abhroviated by triplets of letters as follows: bl.4N. human (Homo sapiens): HHE, rhesus (Macnca mulatta); HOR, hors: (.k’qu:quuncnballun); ZEB. zebra (Equtca quagga Boehmi): COW, cow (Bos tarcr~t.s): DOG. dog (Can& familiaris): BAT, bat (MinioZ,Zeris schreiberti); RAB, rabbit (Oryctolagus c/rrzicuZus); ELS, elephant seal (Mirounga leonina): WH.4, gray whale (Rhoxhinnectea glazccua): KAN, kangaroo (Macropus cangura); CHI, chicken (Gallus gallus); EMU. emu (Dromaeus novuehollandia); PEN, king penguin (Aptenodytes palagonica); DUK, Pekin duck (Anas pZatyrhy~h08); PIJ, pigeon (Columba Zi&); TUR, snapping turtle (Chelydm aerpentina); SNA, rattlesnake (Crotalw adamantew); FRG, bullfrog (Rana catasbiana); TUN, tuna (Thunnus thynnw or T. alalunga); BON, b onito (Katsuwonus vagrans); DGF, dogfish (Squalus suck&); LAM, lamprey (Entoaphenon tridentatw); FRF, fruit fly (Drosophila melawgaater); SWF, screw-worm fly (Haema8o6ia irritant); SMO, silkworm moth (Samia cynthia); TMO, tobacco horn worm moth (1Manduca sexta): SES, sesame (Sesamum indicum); CAB, castor (Ricitms communis); TOM, tomato (Lycopersicum esctdentum); CAU, cauliflower (Brassica oleracea); PUM, pumpkin (Cucurbita maxima); MBE, mungbean (Phase&s aureus); HEM, hemp (Canna6ti sativa); NGL, love-in-a-mist (NigeZZa damascena); ELD, elder (Sambucua nigra); COT, cotton (Gosaypium barbadenae); ABU, abutilon (Abutilon theophruati); ACE, sycamore (Aeer negundo); SUN, sunflower (Helianthus annuus); NGR, niger (G&o&a abyaainica); NAS, nasturtium (Tropaelolum ma@); PAR, parsnip (Pastinaca 8atiWa); WHE, wheat (Tritieum aeativum); MAZ, maize (.&a mays); LEK, leek (All&m porrum): ARU, arum (Arum ma.cuZatum); SPI, spinach (S~nacea oleracea); GIN, ginkgo (@inkgo biloba): DEB, yeast (Debaryomycea kloeckeri); SAC, baker’s yeast (Saccharomyces o&form&); CAN, yeast (Candida kruaei); NEU, neurospora (~k’eurospora rrassa); HIM, humicola (the thermophilic fungus Humicola Zanuginoaa): URT, rust fungus (Ustilago sphaerogena): EUG, euglena (Euglencr gracilis); CRI, crithidia (Crithidia onoopelti). The chimpanzee (Pan troglodytes), donkey (Equus aainua), pig (Sus scrofa) or sheep (Ovia ariea), camel (Camelus identical to those of man, zebra, cow, dromedariua), turkey (Meleagria gaZlopaeo), and rape (Brassica napua), cytochrome c sequences are, respectively, whale, chicken, and cauliflower. The original publications which describe the experimental determinations of these sequences are referenced in one place in the review by Dickerson & Timkovich (1974) of the cytochromes c. Some 36 species pairs have diverged so widely that an accurate upper bound could not be assigned (Case I). For 39 pairs, the proportion of ammo acid replacements of the minimal Z- and 3-base type are abnormally large (Case II): 32 closely related pairs have no amino acid substitution of the minimal 2- or 3.base type (Case III). For each of these 3 cases, the entry in the Table is the minimum base difference multiplied by 1.5 (see text). Before interpreting the data for the 36 pairs in Case I, the reader is referred to the text. The specific pairs bullfrog, dogfish, fruit fly, screw-worm fly, parsnip, Debaryomycea ver8u8 rabbit, chicken, emu, penguin, falling under Case I are: dogfish uer.vu.8 parsnip; wheat, and ginkgo: Saccharomyces versus zebra, cow. dog, and bat: Crithidin wrau.~ penguin, pigeon, sesame, castor bean, abutilon, maize, Saccharomyces, and Humicola: and tuna and bonito oeraus ginkgo, Debaryomyces, Socchoromyceu, Cwbdida krusei, IVeuroapora crassa, and Humicola. Those under Case II are: rhesus, bat, and rabbit ~ersr~8 Pekin duck: horse and zebra p4r.puus cow, rabbit, and gray whale; cow vers?m gray whale; dog and bat versus bullfrog, tuna, and bonito; dog versus lamprey; bullfrog Zter8u8 snapping turtle: bonito uersu8 elephant seal and tuna; dogfish versus fruit fly and screw-worm fly; snapping turtle, rattlesnake, and bullfrog; castor bean ver.sUa fruit fly ?femucI snapping turtle, rattlesnake, and bullfrog; screw-worm fly z)ersus pigeon, tomato, elder, and sycamore; tomato ner~u.9 cotton; spinach %rersus cauliflower and hemp; sycamore 2)emu.s elder and cotton; and leek 2rer8’8us cotton and horse oeraua zebra; cow ?)ersua dog, elephant seal, and kangaroo; dog ver8u.s abutilon. The 32 closely relat,ed pairs in Case III are human ver8u.s rhesus; bat, elephant seal, and kangaroo: elephant seal ?)ersu.s bat and kangaroo; rabbit 2)ersus whale; chicken versus emu, penguin, Pekin duck, and snapping Pekin duck versus snapping turtle; fruit turtle: emu t:ersus penguin, Pekin duck, and snapping turtle: penguin VYSUS Pekin duck and snapping turtle; fly versus screw-worm fly; arum veraua tomato and maize: niger tIersus hemp and cauliflower; pumpkin zIersu.9 mungbean, hemp, maize, and ginkgo; mungbean versus hemp; and niger versus maize.

26

c:. iv.

MOORE

ET :IL.

The REH estimates for Case I are severe underestimates (cf. the t,omato-Crithidin separation in Table 2). The 36 pairs falling under t’his case comprise less than 3”/(, of the data. The REH estimates for Cases II and III are reasonably accurate. To be conservative we have nonetheless omkted Case II pairs from t,hr correlation in the next section because they appear to be due to strong selection with respect to t’he number of amino acid replacements allowed rather than to a Poisson process. Those species pairs falling under Case 1, II or III are individually listed in the legend to Table 2. The REH values in Table 2 are larger than other (Table 3) commonly used measures of genetic divergence. The following additional points are worth noting. First,

TABLE

3

Average values for several measures of genetic divergence between cytochromea c Mean

_~~~

~~~

REHC”* b NDCb* c MBDC AADC REHC/AADC: REHC/MBDC REHC/NDC NDC/MBDC MBDC/AADC PZ lOOT,/T

122 60 40 29 3.96 2.88 1.83 1.51 1.36 3.41 36

Standard

deviation 75 29 19 13 1.35 0.83 0.72 0.05 0.1 I 1.39 15

No. of pairs compared 1596 1596 1596 1596 1596 1696 1596 1596 1596 I4896 I4896

a REHC, random evolutionary hits/l00 codons (total number of fixations/l00 codons separating the pair of cytochrome c genes); NDC, nucleotide differences/l00 codons (predicted total number of countable, i.e. observable, nucleotide differences separating cyctochrome c mRNA pairs); MBDC, minimum base differences/l00 codons; AADC, amino acid differences/ 100 codons; pz, Poisson parameter (average number of total fixations sustained/codon free to accept mutations); Ta, number of codons free to accept mutations over the period of divergence of the 2 species; T, total codon sites compared for the species pair. Note that REH = pzTz. b If the third nucleotide position of a codon fixes mutations with, on the average, a frequency f3 greater than the average frequency with which the first 2 positions fix them, then the REH and NDC values should be augmented (multiplied) by the factor 2(fi + fi + f3)/3(fl + fz). c Calculated from eqn. (16) of Holmquist (1972a) with the identifications NDC = lOO.V’(r)/rll, L = 3T,, and X = REH. d This is less than 1696 by 107 pairs for which the data do not provide sufficient information to calculate fia and Tz. These 107 pairs are those that fall under Case I, II and III in the text. A lower bound for Tz is simply the number of amino acid differences separating the homologous protein pair.

comparison of the REH and NDC values shows that by the time one gets to the messenger RNA level over half the historical record has already been lost. This is in agreement with our earlier observation for Figure 3 wherein the contemporary set of amino acid sequences, through the corresponding mRNAs inferred from the genetic code Table, sufficed to account for 521 fixations while the augmentat,ion procedure indicated an additional 526 due to superimposed mutations. Second, if the mRNAs could be sequenced for pairs of cytochrome c genes, the observed

CYTOCHROME

c PHYLOGENY

27

nucleotide differences between them could be directly compared with the predicted NDC values, care being taken to correct (see the second footnote (“) in Table 3) the latt’er for the fact that the third codon position is known to sustain more fixations t,han the first t,wo positions (Salser et al., 1976). Such a comparison of experiment \vith prediction would const,it,ute a stringent, additional test of stochastic models of evolut’ionary divergence. Third, from the p2 values, each codon free to do so has fixed on the average between 3 and 4 mutations. And fourth, from the last row, 36% of the cytochrome c codons, on the average, have been free to 6x mutations over the periods of divergence of the species pairs examined. It is interesting that for cytochrome c minimal 3-base type amino acid replacements are not strongly selected against. Out of a total of 156,613 homologous amino acid sites compared, 110,212, 29,768, 15,997 and 636 are of the minimal 0-, l-, 2-, and 3-base type, respectively. The stochastic model predicts a total of 16,048 and 585 minimal 2- and 3-base replacements.

5. Comparison of Augmented Distance and Random Evolutionary Values

Hits

We have compared the stochastic estimates of genetic divergence, REH, and the augmented maximum parsimony estimates, AD, for the 666 pairwise comparisons resulting from 37 species of cytochrome c that both our laboratories had independently completed calculations on before we knew of each others work or suspected its mutual relevance. The data did not suffice to make accurate stochastic estimates for 25 of the 666 pairs because these 25 pairs were the type in Case I or Case II discussed in the preceding section. These 25 pairs are enumerated in footnote * of Table 4. The correlation between AD and REH was determined for the remaining 641 sequence pairs. The AD values for each pair of contemporary species of cytochrome c in Figure 3 were obtained from the sum of the AD values of the individual links connecting the two species through their evolutionary intermediates back to their common ancestor. For the comparison with REH values in Table 2 the sum of these AD values was normalized to 100 codons by multiplication by 100/103, as Figure 3 was constructed by comparing the cytochrome c sequences at 103 amino acid positions. The 641 pairs are grouped into ten taxa in Table 4. The resulting 51 pairs of AD and REH values are plotted in Figure 4 (top). A good correlation is observed in the numerical magnitude of the two measures. For closely related species or taxa, the genetic divergence is small and the number of superimposed fixations is small.This would automatically cause the numerical magnitude of our two measures of genetic divergence to be similar. To check this possibility, Figure 4 (top) was replotted, omitting the first’ four diagonals of the data in Table 4. The result is shown in Figure 4 (bottom). The 21 points plotted there are at least as distant as the mammalian-teleost divergence and demonstrate that the similarity of the REH and AD values is not an artifact. Finally, and most convincing, the 641 individual pairs of AD and REH values are plotted in Figure 5. If our two measures of genetic divergence agree, the correlation should be linear with a slope of unity and an intercept of zero. The observed slope is 1.01 and the observed intercept is 7.66. Confidence limits of 99!/, on the slope are 0.95 to 197

1511 61 l32(201 1(0 160( 21 236(61) 12

52( 7) 90(131 a 641 71‘ 59123) 18 cot 7) B5(241 36 154( 91 135(231 90 162( 7) 214( 38) 27 196( 7) 227tBB) 18

TEL

LF

I

P

Fg'

EC

'(

II 51

15

a 2) 1

11 5)

1931 1) 237(73) 6

77( lOOf

61( 611

llg691 :: r(

301 631

91 1) 121 2) C39( 29 )I B,q

34( 7) 65112) a

21 *I

n

6

I7( 71 27( 81 [‘I9126 )I lB,9

41 71

B

INTERTAXA

R

AND

20( 71 30( Ill 34

3oc

INTRPI-

B

n

AVERAGE

TABLE

4

I9Ot 0) 235(4B) t261(90)1_

157( 21 lBB( 39) t213(20)1 6.3

l'lB(. 61 128( 25 P [190( 30,i 20,lO

*,2

‘).2

751 1) 123( 9) [156(92)1

59( 1) 65116) [B‘l12511

46( -I B3( -) I99(2311 2,1

2Bf -1g 149( -1.h 1,1

C64( -)I I,0

-.

R

2

2

1

1) 71

1)’ 21

-b -1

202( 0) 257(72) 2

168( 21 237( 34) 3

lbl( 6) ,188( 42) 10

86( l't6(

691 90(

57( 74(

-

n

2

1) 1)

1

-1 -I

2151 0) 219147) 2

1791 311(

173( 6) 222( 32 1 10

9B( 21 139(16) Y

82( 781

-

TEL

GENETIC DISTANCES (ESTIHLTiD CCH'lPABlSONS OF 38 SPECIESe

1

-) -1

20% 1) 21OfB3) 4

170( 21 227(65) 6

162( 6) 162( 29 I 20

B7( 1) lO‘t(17) 6

391 521

LF

188( 21 252(65) 8

155( 2) 175(34) 12

l’I7t 6) 142( 15 ) 40

191121 19(11) 6

I

NUCLEOTIOE.POINT OF CVTOCHROME C.

226( 6) 230(65) 17

192( 61 172(30) 30

19( 8) 24110) 45

P

FIXATION&

211( 2) 287(71) 6

111(25) 148(38) 3

Fg

202( 163( 1

EC

-1 -)

FOR 676*P~IRWISE

a Maximum parsimony augmented distance (AD)/100 codons. b Stochastic distance (REH)/lOO codons. c Number of pairwise comparisons in average. Where 2 numbers arc given, that prrredinp thv c~m~rnu inrlndrs those lmiw in xr.hivh mtti~snek~ is n member; that following the comma excludes those pairs. d The number in parentheses is the population standard deviation and is given to provide some indication of the spread of the individual pairwise Oistances. e M, mammals: human, rhesus, horse, donkey, cow, dog, whale, rabbit, kangaroo; B, birds: chicken, penguin, duck, pigeon; R, reptiles: values without brackets in the matrix boxes are for turtle only; values within brackets are REH values for turtle and rattlesnake (pairwisc augmented distance comparisons for those pairs including rattlesnake were not available); A, amphibian: frog; TEL, teleost fish: tuna; LF, lower fish: dogfish and lamprey; I, insects: fruit fly, screw-worm fly, silkworm moth, tobacco horn worm moth; P, plants: wheat, mungbean, castor bean, sesame, sunflower, cotton, abutilon, cauliflower, buckwheat, pumpkin; Fg, fungi: yeast, Candida~, Neuroqora; EC, Euglena and Cri&dia. r 37 species provide 666 species pairs. The following 25 pairs were excluded because the data did not suffice to determine accurate stochastic distances for them (see text): rhesus-duck; horse-cow, whale, rabbit; donkey-cow, whale; cow-whale; dog-frog, tuna; rabbit-duck: pigeon-screw-worm fly: turtle-frog, fruit fly, and screw-worm fly; frog-fruit fly and screw-worm fly; tuna-Candida and Xeurospora; dogfish-fruit fly and screw-worm fly: C’rithidia-penguin. pigeon, castor bean, sesame and abutilon. These 25 pairs comprise less than 4V0 of the data. Including the rattlesnake increases the number of species to 38 for 703 corresponding pairwise comparisons. Among these, the above 2.5 pairs were excluded and, in ad&ion, t,he 2 pairs snake-fruit fly and snake-silkworm moth. g The value given is for turtle-frog, h The value given is for rattlesnake-frog.

(:.

WI:. 1lOORE

I<‘/’ .1 I..

REH = 12 +I,10 r50.93

5 .5s

300 -

0

?I (L

200-

063O / 0/@ a

0

I 100

AD

REH = 7.6tl.13

AD

r = 0.06

Same as above with first four diagonals removed I I I I 500 200 300 400 AD (augmented

600

distance)

FIQ. 4. Correlation between stochastic (REH) and augmented maximum parsimony (AD) estimates of genetic divergence among taxa. The top Figure includes all the data in Table 4. The bottom Figure is from the same data omitting those taxa which are related more closely than the mammalian-teleost divergence.

and on the intercept, -0.8 to +lS*l. One would expect the correlation coefficient’ to be high, but less than unity, because a stochastic evolutionary mechanism has an inherent scatter. The observed linear correlation coefficient is 0.86. The fact that not only the means of the 641 pairs of REH and AD values agree (125 and 120, respectively), but also that the standard deviations of the two populations of values are similar (79 and 68, respectively), indicates that both measures of genetic divergence are indeed capturing at least the broader features of the same empirical distribution.

6. Rates of Cytochrome c Evolution The AD values are about two-thirds of the REH values for the genetic distances between more closely related taxa within the vertebrates, but agree with the REH values for genetic distances between the more distantly related vertebrate taxa. The factor of two-thirds may be accounted for by the inability of the maximum parsimony method to correct for third codon degeneracy in some cases. Both REH and AD values increase in an orderly fashion as the phylogenetic separations between taxa increase. This does not mean that there is a simple linear relation between increasing REH or AD values and increasing evolutionary time, because certain lineages during the same span of evolutionary time may evolve more rapidly than others. In pairwise comparisons (Table 4) with invertebrate taxa, the REH and AD values for the amphibians (A), teleosts (TEL), and lower fish (LF) are, on the average,

CYTOCHROME

600

e PHYLOGENY

I

I

31

I

I

REH = a+b.AD u = 7.66 b = I.01

500

r = 0.06 ‘=AD.REH = 41 400 In c Jz f? z .o 300 ‘; a 2 E 8 200 s -2

+ Cytochrome

+++ t -Ht ++ + *i

c

+++ ++ i+

+ / + + +

iz IOC

C AD (augmented

distance)

FIG. 5. Correlation between 641 pairwise species comparisons of augmented maximum parsimony (AD) and stochastic (REH) estimates of genetic divergence for the cytochrome c gene. Data are from Table 2. The 641 species pairs used are identified in footnote (‘) of Table 4.

about 10% larger than those for the amniotes mammals (M), birds (B), and reptiles (R). The genetic divergence of the lower vertebrates among themselves is about three times larger than that of the amniotes among themselves. This increased divergence among the lower vertebrates can be due to either of two causes: an increase in the number of codon sites T, free to fix mutations, or an increase in the fixation intensity p2, i.e. in the number of mutations fixed by each codon free to do so without a necessary increase in the number of such codons. Table 5 separates these two causes for the taxa in Table 4. An increase in the number of codons free to fix mutations accounts for some of the increased divergence, while an increase in the fixation intensity accounts for most. The observations would be consistent with a constant evolutionary rate only if the average time of divergence of the lower (cold-blooded) vertebrates from one another were about three times larger than the average time of divergence of the amniotes from each other. As the fossil record makes this unlikely (see the following section), we conclude that to the extent that bhe present sequence data are representative, the evolutionary rate has been more rapid along the lineages leading to the frog, tuna and lower fish than along those leading to the mammals, birds and turtle for the cytochrome c gene.

1.0)

5.8(0.5) 4

4.7(

Y.3(

Y.4C1.91 6

4.YC2.2) 18

EC

1.6)

3.911.5) 6

5.312.0) 12

4.8f1.2) 2’1

Fg

‘I

3.1(0.7) 20

3.1CO.6) 40

3.3co.71 90

P

1.2)

3.1CO.8) 4

3.1co.71 8

1

4.2CO.2) 2

5.9(-)

Y.8lO.Y) 4

15

I

21( 31

19(5

RC

3.611.4 36

1

5

5.2CO.7) *

2.6(

141 3)

B

I

LF

‘1.710.6) e

1.3)

TEL

8

4.5(

1.0)

A

18

15( 8 Ia

2.5(

\

l.E(O.9) 23

RC

B

m

m

NUblBER

5.Ot2.0) 2

5.2tl.O) 3

3.6fO.8) 2

25(-1

12c 1)

15( ‘I )

A

TABLE

4.oc1.41 2

3.2CO.1))

23(6)

1811)

19(Z)

TEL

INTERTAXA 72 IN THE LF

3.6(

4

2.0)

5.2C2.0) 6

22( 0 1

2% 2 I

I

Y.7( 8

3.Bl1.0) 12

30(4

26t5)

I

half-matrix,

1.7)

2bC 11

27(Z)

I

AND TABLE

23( 3)

25(5

INTENSITY p2’ CO@lPARISONS”IN

27( 3)

22(2)

22(4)

FlXATlON PAIRWISE

5

a Numbers in parentheses are the observed population standard deviations. b Number of pairs in the rtverage. As these numbers am the s~rnt: for t,he corresponding cells in the upper c Includes rattlesnake. d p’z and T, are defined in footnote (“) of Table 3. T, has been normalized to 100 codons.

PZd

-Tqd \

AVERAGE INTRAAND OF VARIABLE COOONS

47(4)

49(‘0

45(2)

‘(J(3)

43(3)

4Y(ro

42(3)

P

LI.l(l.Lo 17

they

3.OlO.9)

‘I

arc omitted

4.2CO.9) 5

47(S)

47(5)

411111)

48(3)

L)3(2)

43(3)

Fg

t,hcrc

\

1

55( - 1

for clarity.

3t -,

58(11,

56(h)

53(4)

601 10)

5+t 5 I

51131

54t 8 1

53(4)

EC

CYTOCHROME

c PHYLOGENY

33

7. Non-uniformity of the Evolutionary Clock Non-uniform rather than uniform rates characterize cytochrome c evolmion. Assuming that rates were uniform allows the genetic distances among cytochrome c sequences to serve as a molecular clock. Ancestral branch times between taxa calculated by this supposed clock can be compared with fossil evidence. The dates from these calculations and those from the fossil record are presented in Table 6. There is poor agreement between the two sets of dates. The ancestral splitting between birds and turtle deduced from the AD and REH values is too recent, and the ancestral splittings of amniotes (mammals, birds and turtle) from frog, of tetrapods from teleosts, and of tetrapods and teleosts from shark and lamprey are too ancient.

TABLE

Comparison

of the molecular clock model to fossil evidence

Taxa compared for ancestral branch times -_-

6

- -.__ Mammals-birds Mammals-reptiles Birds-reptiles Mammals, birds (and reptiles)-frog Tetrapods-teleosts Tetrapods, teleosts-lower fish Vertebratesinsects Metazoans-plantse Metazoans-fungi Plants-fungi Metazoans-protozoans Plants-protozoans Fungi-protozoans

Age of split (millions of years before present) Clock model* Fossil evidence AD REHb _-300 300 309 255 270 (490) 300 135 120 (390) 225 480 (4SO)C 640 (923)d 340 765 833 (873) 400 1005 706 (744) 500 1257 1153 (1208) s 680 2289 1516 (1526) 2407 2052 (2073) 2880 1720 IOOO’? 2909 2414 (2436) 3390 2300 3165 2870 1

* The genetic distances from Table 2 are treated as proportionate to time. The genetic distance between mammals (M) and birds (B) is set equal to 300 m yr bp (the age of the split from fossil evidence) and the genetic distances for the other taxa pairs are converted into splitting times by linear approximation. When the genetic distance involves results in more than one column of Table 2, the columns are equally weighted; e.g. the genetic distance between tetrapods and teleosts is the average of M-TEL + B-TEL + R-TEL + A-TEL. b When 2 numbers are given, the one in parentheses is calculated from REH values in Table 2 which include rattlesnake as one of the reptiles: the number not in parentheses excludes rattlesnake. c The number in parentheses is calculated from the average of the mammal-frog, bird-frog, and t’urtle-frog AD values. d The number in parentheses is calculated from the average of the mammal-frog, bird-frog, and rattlesnake-frog REH values. B The vertebrate-plant taxa were averaged together and this average was averaged with the insect-plant taxa. f Knoll & Barghoorn (1975) in reassessing the evidence for Precambrian eukaryotic organisms conclude “In short, there is no good evidenoe for the presence of eukaryotes in Bitter Springs cherts. Similarly, all reports of older eukaryotes do not withstand critical examination. It would be hazardous for us to state that eukaryotes did not exist 900 million years ago, but if they did, their remains have yet to be found. Alternatively, multicellularity as evidenced by the several known Ediacaran faunas may have evolved quite rapidly following the origin of the nucleated cell. That is, eukaryotic cells may not have existed until very near the end of the Precambrian.”

34

t:.

w.

MOORE

E’T

..l L

Like Dickerson (1971) in his study of molecular evolution, we use Young (196’2) and Romer (1966) as our principal source of paleontological information on the dates of branch points in vertebrate phylogeny. The dates of about 300 m yr bp for the ancestral avian-mammalian divergence and 400 m yr bp for the teleost tetrapod divergence are well accepted and have been used as fixed points in other studies of rates of cytochrome c evolution (e.g. Margoliash & Smith, 1965; Dickerson, 1971; McLaughlin & Dayhoff, 1972). The linear regression lines between genetic distance and time drawn by various investigators are reflective of average rather than constant evolutionary rates. On using AD and REH values for the genetic distances but still taking 300 m yr bp for ohe bird-mammal split as the fixed point, linear extrapolation places the tetrapod-teleost split at about 800 m yr bp. When genetic distances are measured by procedures which more adequately replace the missing mutations between anciently separated branches no agreement with fossil evicence on branch times is observed. The clock gives a fallaciously ancient split between tetrapods and teleosts because (Table 7) cytochrome c evolved about three times more rapidly in early vertebrates than in the amniote lineages to birds and mammals. The lineage leading to the primates from their common ancestor with the other placental mammals has evolved much more rapidly than the non-primate eutherian TABLE 7 Evolutionary

Evolutionary

rates of descending lineages of cytochromes c

period

Eukaryote uni-multicell ancestor to protozoans Eukaryote uni-multicell ancestor to fungi Eukaryote uni-multicell ancestor to plants Eukaryote uni-multicell ancestor to metazoans Eukaryote ani-multicell to invertebratevertebrate ancestor Invertebrate-vertebrate ancestor to invertebrates Invertebrate-vertebrate ancestor to vertebrates Invertebrate-vertebrate to vertebrate ancestor Vertebrate ancestor to anamniotes Vertebrate ancestor to amniotes Vertebrate to amniote ancestor Amniote ancestor to turtle Amniote ancestor to birds Amniote ancestor to mammals Eutherian ancestor to primates Eutherian ancestor to non-primate eutheria

Age (m yr bp)

1000 1000 1000 1000

to to to to

0 0 6 0

1000 680 680 680

to to to to to to to to to to to to

680 0 0 500

500 500 500 300 300 300 90 90

0 0 300 0 0 0 0 0

Nucleotide replacements/ 100 codons per 10s years from AD values REH values

12 9 10 7 9 6 7 9 7 5 8 2 3 3 19

3

<17 >I1 >6 >X

9 9 10 7 -

3 2
Evolutionary rates as nucleotide replacements/100 codons per 1Oa years were calculated from the AD values on the links in the cytochrome c genealogical tree in Fig. 3 for the appropriate evolutionary periods, using the time-scale based on paleontological views. When there is more than one lineage in an evolutionary period, the evolutionary rate shown is an average of the individual rates of the different lineages. In the REH rate calculations, nucleotide replacements were apportioned among lineages by the additive algorithm of Fitch & Margoliash (1967). For example, letting a and 5 be the number of replacements separating the plants and fungi, respectively, from their common ancestor, and letting c be the number of replacements separating the latter ancestor from the protozoans, from Table 4 a + b : 287, a + c = 230, and a + 6 = 172, whence a, b and c are 58, 114 and 172 replacements/l00 radons, respectively, These are the first three REH rate entries in Table 7.

CYTOCHROME

c PHYLOGENY

35

lineages (Table 7). The converse is true for globin chain evolution (Holmquist et al., 1976). The average absolute rates, in nucleotide replacements per 100 codons per 10s years, along the lineage from t’he eutherian ancestor to the primates have been about 18, 16, 18, and 13, respectively, for cytochrome c, X-, /3-, and myoglobin. Along the non-primate eutherian lineages the corresponding average rates are 6, 28, 23 and 13. However, for both cytochrome c and the globins, rates were much more rapid in the lineage from the eutherian ancestor to t,he monkey-hominoitl ancestor than from the latter to present day catarrhine primates. Perhaps positive natural selection speeded up the rates of cytochrome c and globin evolution in the primate lineage, whereas stabilizing selection may have acted in later primates t,o preserve any functional improvements which were achieved during the earlier burst of change. In this connection, Ferguson-Miller et al. (1976) have recently found a striking difference between cytochromes c of catarrhine primates and other mammals in ability to interact with cytochrome oxidase. If the heart, mitochondria used as the source of the oxidase are from a non-primate mammal, higher primate cytochromes c have very low electron transfer activity and inhibit the activity of nonprimate mammalian cytochromes c. If the heart mitochondria are from primates, primate cytochromes c are highly reactive and non-primate cptochromes c somewhat less reactive. The fossil record is less complete for prevertebrate stages of phylogeny than for vertebrate stages but still indicative of an upper limit of about 680 m yr bp for the ancestral arthropod-chordate branch point (J. W. Schopf, personal communication; Cloud, 1974). Linear extrapolation of the AD and REH values places this date at, about I.2 billion years ago (Table 6). Again the explanation could be that early evolution (prevertebrate and early vertebrate) was about three times more rapid than later amniote evolution (Table 7). Dates for the ancestral splitting of metazoan, plant, fungi and protozoan branches are placed from REH and AD values by the uniform rate calculation at as much as 3 billion years ago (the fungi-protozoan split, for example). From the fossil record, only primitive prokaryotes existed at that time. The first, evidence of eukaroytes (these are unicellular and interpreted as asexual with poorly developed mitotic apparatus) occurs about 1.3 billion years ago (Schopf et al., 1973; Cloud, 1974). Schopf (personal communication) indicates that the type of eukaroyt’e which could have been the common ancestor of protozoans and multicellular eukaryotes first, appeared at about a billion years ago. The discrepancy between clock dates and fossil dates is due in part to non-uniform evolution of cytochrome c. The rate of cytochrome c evolution has been about three to four times faster in protozoa. fungi and plants than in birds and mammals (Table 7). Larger variations in rates are observed between individual lineages (Fig. 3 and Table 2). Marked deceleration of a and @hemoglobin chain evolution among amniottl verbebrates is also observed (Goodman et al., 1974,1975; Holmquist, et al., 1976).

8. A Paradox and its Resolution Agreement between the stochastic and augmented maximum parsimony estimates took both our laboratories by surprise. The philosophical concept upon which the stochastic model is based is that within the constraints permitted by natural selection the evolution of protein function and genetic divergence has been by-and-large a

36

C:. W. MOORE

ET il L.

trial and error process, inefficient at least in the number of nucleotide replacements necessary t,o achieve a gene coding for a given funct,ional prot,ein structure. The philosophical concept upon which t,hc maximum parsimony model is based was well without a general belief t,hat’ evolution has summarized by Sneath (1974): ‘*. followed the shortest pathways there is no constraint on the wildest of postulated pathways.” The stochastic method has its operational methodology and mathematical structure embedded in statistics and the theory of Markov chain processes. The maximum parsimony method derives its operational methodology and mathematical structure from toplogy and set theory. The former is probabilistic in approach, thr latter deterministic. Had one deliberately set out to construct such, ib is difficult’ to conceive of two more dissimilar evolutionary viewpoints, whether considered philosophically or operationally. Yet we are now confronted with the observation that both approaches lead to essentially the same estimates of genetic divergence. Inour view, resolution of the paradox is a matter of the small versus the large. In the small, i.e. locally (in the near vicinity of any node), evolution is a sequence of unique events which can be accurately mapped if a sufficiently dense data base is known. These events are either selectively or probabilistically determined and most likely there are events of both types. A probabilistic process of genetic divergence has certain advantages that may have been selected for (Holmquist, 1975). In the large, i.e. globally, evolutionary trends dynamically conform to statistical laws (discussed by Kolata, 1975). Thus we need not be surprised that the distribution of nucleotide fixations in the maximum parsimony tree (which in the tree’s dense portions can be taken as a reflection of what occurred in evolution) resembles the predicted result of a probabilistic process. Protein evolution results from mutation pressure plus selection. The results of mutation pressure are best estimated in probabilistic terms, and in the global sense so are the results of selection. It is clear from the fossil record, e.g. that extinctions of animal species over geological time have the appearance of stochastic laws (Van Valen, 1973). Evolutionary theory leads to the same conclusion (Raup & Gould, 1974; Raup et aE., 1973). Our findings suggest a similar pattern holds in protein evolution. This research was supported by National Sciences Foundation grant National Aeronautics and Space Administration grant, The Chemistry of

GB36157 and Living

Systema,

NGR06-003-460.

REFERENCES Cloud, P. (1974). Amer. Sci. 62, 54-66. Dayhoff, M. O., Eck, R. V. &Park, C. M. (1972a). In Atlas Ch. 9, pp. 89-99, The National Biomedical Research

of ProteinSequence

and Structure, Foundation, Silver Springs. Dayhoff, M. O., Park, C. M. & McLaughlin, P. J. (19725). In Atlas of Protein Sequence and Structure, Ch. 2, pp. 7-16, the National Biomedical Research Foundation, Silver Springs. Dickerson, R. E. (1971). J. MoZ. Evol. 1, 26-45. Dickerson, R. E. & Timkovich, R. (1974). In The Enzymea: Oxidation-Reduction, Cytochrome c, part 2, Table IX, pp. 201-205, Academic Press, New York. A. H. & Margoliash, E. (1976). E’ed. Ferguson-Miller, S., Brautigan, D. L., Chaviano, Proc. Fed. Amer. Sot. Exp. Biol. In the press. Fitch, W. M. (1971). Syat. 2002. 20, 406416. Fitch, W. M. & Farris, J. S. (1974). J. Mol. Evol. 3, 263-278. Fitch, W. M. & Margoliash, E. (1967). Science, 155, 279-284. Fitch, W. M. & Markowitz, E. (1970). B&hem. Genet. 4, 579-593.

CYTOCHROME

r PHYLOGENY

Xi

Goodman, M., Moore, G. W., Barnabas, J. & Matsuda, G. (1974). J. Mol. Evol. 3, l-48. Goodman, M., Moore, G. W. & Matsuda, G. (1975). Nature (London), 253, 603-608. Hartigan, J. A. (1973). Biometrics, 29, 53-65. Holmquist, R. (197%). J. Mol. Evol. 1, 115-149. Holmquist, R. (19726). J. MOE. EwoE. 2, 10-16. Holmcluist, R. (1975). J. Mol. Evol. 6, 1-14. Holmquist, R. (1976). In Molecular Anthropology (Goodman, M. & Tashian, R. E., eds), Rundovn and Non-random Processes in the MoZe&ar Evolution of Higher Organisms, I’lenum Press, New York, in the press. Holmquist, R., Cantor, C. R. & Jukes, T. H. (1972). J. Mol. Biol. 64, 145 161. Holmquist,, It., Jukes, T. H., Moisr, H., Goodman, M. & Moore, (:. W. (1976). J. Mol. Hiol. 105, 39-74. Jukes, T. H. (1963). Advan. Biol. Med. Phys. 9, 1-41. .Jukes, T. H. & Cantor, C. (1969). In MammaEian Protein Metabolism (Munro, H. N., e(l), vol. 3, pp. 21-132, Academic Press, New York. .Jukes, T. H. & Holmcmist, R. (1972). J. Mol. BioZ. 64, 163-179. Kimura, M. (1969). Proc. Nut. Acad. Sci., U.S.A. 63, 1181-1186. Knoll, A. H. & Barghoorn, E. S. (1975). Science, 190, 52-54. Kolata, G. B. (1975). Science, 189, 984-985. Margoliash, E. 62 Smith, E. L. (1965). In Evolving Genes and Proteins (Bryson, V. & Vogel, H. J., eds), pp. 221-242, Academic Press, New York. McLaughlin, P. J. & Dayhoff, M. 0. (1972). Atlas of Protein Sequence and Structure, vol. 5, Figure 6-1. Moore. G. W. (1976). J. Theor. Biol. In the press. Moore, G. W., Barnabas, J. & Goodman, M. (1973). J. Theoret. BioZ. 38, 459-485. Raup, D. & Gould, S. (1974). Syst. ZooZ. 23, 305-322. Raup, D., Gould, S., Schopf, T. & Simberloff, D. (1973). J. GeoZ. 81, 525-542. Romer, A. 8. (1966). Vertebrate Paleontology, University of Chicago Press, Chicago. Salser W., Bowen, S., Browne, D., El Adli, F., Fedoroff, N., Fry, K., Heindell, H., Paddock, G., Poon, R., Wallace, B. & Whitcome, P. (1976). Fed. Proc. Fed. Amer. Sot. Exp. BioZ. 35, 23.-35. Sankoff, 1). (1973). Publication Centre de Recherches Mathematiyues Tech., Rep. no. 262, Irniversite de Montreal. Schopf, J. W., Haugh, B. N., Molnar, R. E. & Satterthwait, D. F. (1973). J. PaZeontoZ. 47, 1-9. Sneath, P. H. A. (1974). Symp. Sot. Gen. Microbial. 24, l-39. Van Valen, L. (1973). EvoZ. Theory, 1, l-30. Young, J. Z. (1962). The Lije of Vertebrates, University of Chicago Press, Chicago. Zuckerkandl, E. & Pauling, L. (1965). In Evolving Genes and Proteins (Bryson, V. & S’ogel, H. J., eds), pp. 97-166, Academic Press, New York.