Modeling nucleotide evolution: A heterogeneous rate analysis

Modeling nucleotide evolution: A heterogeneous rate analysis

ELSEVIER Modeling Nucleotide Evolution: A Heterogeneous Rate Analysis COLLEEN KELLY Department of Computer Science and Statistics, University of Rhod...

1MB Sizes 103 Downloads 66 Views

ELSEVIER

Modeling Nucleotide Evolution: A Heterogeneous Rate Analysis COLLEEN KELLY Department of Computer Science and Statistics, University of Rhode Island, Kingston, Rhode Island 02881 AND

JOHN RICE Department of Statistics, Unioersityof California at Berkeley, Berkeley, California 94720

Received 3 September 1993; revised 20 February 1995

ABSTRACT A new model of molecular evolution is introduced that allows for heterogeneous rates across the sequence positions. The development of this model was motivated by two issues: first, a number of studies have shown that the positions in a DNA sequence evolve at different rates, and second, it has been shown that not accounting for this heterogeneity can lead to biased estimates of evolutionary parameters. The authors generalize the Markovian model of molecular evolution to allow for heterogeneous rates and explore some of the consequences of such a model. In particular, they quantify the biases incurred by incorrectly assuming an equal-rate model and consider what can be learned about evolutionary parameters under a heterogeneous model.

1.

INTRODUCTION

The Markovian model of molecular evolution was first proposed in 1969 by Jukes and Cantor [1]. Their original model was a very simple one and has since been generalized in a n u m b e r of different ways by different researchers. Although these generalizations model reality more closely, the increase in model complexity has the drawbacks of computational difficulties and identifiability problems. The ideal model, of course, balances simplicity with predictive capabilities. We argue for the need for a heterogeneous rate model of molecular evolution by discussing the ubiquity of heterogeneous rates and assessing the biases encountered when incorrectly assuming equal rates. It is widely recognized that substitution rates are not equal for all positions in a nucleotide sequence. In protein coding regions, the

MATHEMATICAL BIOSCIENCES 133:85-109 (1996) © Elsevier Science Inc., 1996 0025-5564/96/$15.00 655 Avenue of the Americas, New York, NY 10010 SSDI 0025-5564(95)00083-P

86

COLLEEN KELLY AND JOHN RICE

degeneracy of the genetic code causes the third codon position to have much higher substitution rates than the first and second positions. This problem has been dealt with in many analyses by considering only third-position sites or only first and second positions. There is certainly a loss of information in this method; nevertheless, it might be adequate if structural and functional constraints on the molecules did not also cause unequal rates of substitution. Miyata [2] discusses at length various causes of unequal substitution rates. Fitch and coworkers [3-5] first brought attention to the estimation problems caused by ignoring heterogeneous rates. They proposed a model in which some positions do not evolve (the invariable sites) and the remaining positions evolve at the same rate. They then illustrated through simulation that if a sequence evolves according to this two-state model and branch lengths are estimated with an equal-rate model, the estimates will be negatively biased. The cause of this underestimation is intuitively clear because the equal-rate model averages the observed substitutions over more sites than are actually evolving. To correct this problem, Fitch and others [3, 6-10] developed methods to estimate the proportion of invariable sites. Biological considerations lead us to believe that there may be a continuum of substitution rates rather than just the two (or possibly three) rates proposed by Fitch. Other researchers [11-17] have fit continuous distributions, such as the gamma and log-normal. Wakeley [17] argues for the use of a continuous rate distribution. Our model views the substitution rate as a continuous nonnegative random variable with some distribution. The rate of evolution in each sequence position is a realization of this random variable. In effect, evolution is allowed to speed up or slow down in certain positions. In Section 2, we introduce this model and then describe its application to data from present-day species whose evolutionary relationships are described by a known tree structure. In Section 3, we consider the inferential problems in ignoring heterogeneous rates. Section 4 proposes a likelihood ratio test for homogeneous rates. Our general model allows for an analysis of the distribution of rates. We can choose to make no assumptions about the form of the distribution and determine bounds for the mean and variance of rates, or we can assume some form such as the gamma or the log-normal and estimate the parameters of those distributions and hence their moments by maximum likelihood. Section 5 discusses these two alternatives and how to obtain posterior mean rates for each position in the sequence. Finally, in Section 6 we draw conclusions and discuss areas in need of further research.

MODELING NUCLEOTIDE EVOLUTION 2.

87

M O D E L I N G H E T E R O G E N E O U S RATES

Our heterogeneous rate model is the following generalization of the usual equal-rate Markovian model of nucleotide evolution (see Lanave [18], Felsenstein [19-21], or Tavare [22, 23]). Let Xi(t) ~ {A, G, C, or T} represent the nucleotide state in the ith position of a D N A sequence of length n at time t. The usual equal-rate model assumes that {Xi(t), i = 1..... n} are independent, stationary, and continuous Markov processes with equal transition probabilities. Under these assumptions, the transition matrix for each position may be written as P ( t ) = e Qt,

(1)

where Q is a 4 × 4 rate matrix. Our heterogeneous rate model views time, or equivalently the rate of evolution in the ith position, as a stochastic process T(t). Allowing T(t) to be a general stochastic process results in such complexity for general branching trees (see below) that we will consider only the simplest such process: T(t)= At, where A is a nonnegative random variable with cumulative distribution function F(;t). We assume that A has a moment-generating function ~(x) = E(exp(Ax)). The rates of evolution of the nucleotides at different positions are modeled by different realizations of the random variable A; thus the rate of evolution is allowed to be different at different positions. Then, conditional on A = Ai, the ith position has the transition matrix P~°(/I A = Ai) = e °a''.

(2)

To normalize the terms in the exponent we assume that E ( A ) = 1 and that Q is normalized so that 4

~, Iri,Qk k = - 1,

(3)

k=l

where ~'k are the stationary probabilities. Note that this normalization has the effect of making t equal to the evolutionary parameter k(t), the mean number of substitutions per site in time t, which is usually defined as 4

k ( t ) = - t ~_, rrkQkk. k=l

We next consider the consequences of this model for the tree-structured stochastic process of evolution. First assume that we have the simplest tree topology for s species: a star phylogeny (Figure 1). In this

COLLEEN KELLY AND JOHN RICE

88

81

82

83

84

FIG. 1. A four-species star phylogeny.

case, there is only one unobserved ancestor and all branch lengths are equal (t 1 = t 2 . . . . . t s = t) when the molecular clock hypothesis holds. Suppose that the s aligned sequences are written out in an s × n matrix N = (n (1),n(2),..., n(n)), so that n (i), the ith column of N, is an s × 1 vector of nucleotides in the ith aligned position. Let Pk be the probability of base k in the unobserved ancestor. Then 4

e(n(i)) = ~

(i) (i) PkP~N,,(t)P~N2,(t)

. . .

(i) P~Ns,(t) •

k=l

Here Pj~)(t) is the j k t h entry of the matrix P(i)(t), and Nsi is the observed nucleotide in species s in position i. Suppose that Q has a spectral decomposition with eigenvalues o), 4

Qk, = ~-, R(k~)°) •

(4)

y=l

Then the conditional probability, P(n(/) I A = / ~ i ) , may be written as 4

4

E

E

k=l jl=l

4

"'" E l~k*'kNli*'kN2i " "
x e x p { ( o ) + o).2+--. + o),) Ait}. Now the unconditional probability is

P(n (i)) fP(n") l~) dF(X) = EA(P(n(O lA)). --

(5)

MODELING NUCLEOTIDE EVOLUTION

89

Since

o~1 +

E(exp[(

O~2 + "'" Jr

o)~)At])=

(p(( o)1

+

O~.2 Jr "'" Jr

O'js)t),

we obtain

P(n(i)) =

4

4

E

E

4 "'" E

k~l jl=l

n o ( J , ) o(J2) . . . o ( J , ) + + ... + YkJlkNll'lkN2i llkNsi~Fl~ ~ Jl ~2

Js = 1

4

=

Cj,h...j,(n(')) E Jl,J2 ..... Js = 1

(p((o),

+ o~ + --- + o),)t).

(6)

Here Cjd2...,.(n ~i)) is a constant that depends on JpJ2 . . . . . Js, the eigenvectors o f Q ' a n d n ~i).

The stochastic structure for branching trees is more complex but again involves linear combinations of the moment-generating function evaluated at various points. The complexity is due to the fact that there are more unobserved ancestors and different branch lengths (see Figure 2). The probability for a branching tree depends on the particular tree topology and is thus difficult to write down in a general form. It is nevertheless easy to compute for a specific topology. For the tree depicted in Figure 2, by an argument like that above, e(n(i)) =

4 ~

(i)

(i)

(i)

pkle~,Nu(to)P~lk2(t3)P~2N2~(tl)

~ p(i) ¢t k2k3~, 41

kpk2,k 3 = I

X p(i)k3N~Att2)P(ki3)N4~(ts) 4

4

---

PklR klNli R k2N2i R k3N3i R klk3 R ~3N41 k l , k 2 , k 3ffi 1 Jo ..... Js = 1

X ~p(~-0to + o).t, + -'- + 0)5t5) 4

=

~,,

Cjo...js(n(i))q~(o)oto+O)t]+...+o)/5),

Jo, h ..... Js = 1

S1

$2

$3

84

FIG. 2. A four-species branching tree.

(7)

COLLEEN KELLY AND JOHN RICE

90 where 4

cj0j,


=

E

p

kl

R(Jo ) R ( A ) R(J2) R(J3) R(J4) R ( J s ) . kiN1 i kiN2 i k3N3i klk2 klk3 k3N4 i

(8)

kl,k2,k 3 ~ 1

In (7) we can see the complexity that would arise if a more general stochastic process T ( t ) were used. There, in place of ~( ) we would have E(exp[Ek o)kT(tk)]), the joint moment-generating function of T at several points. We note that if Pk = 7rk, then the orthogonality of left and right eigenvectors of Q will typically result in several of the coefficients being zero. We see from these expressions that the information about the branch lengths t i is entirely contained in the moment-generating function of A evaluated at various linear combinations of the ti, the coefficients of the linear combinations being the eigenvalues of Q. We also note that the distribution function of A enters into the distribution of the observable statistics N only through its moment-generating function evaluated at a finite number of points (which are unknown if the branch lengths are unknown). Clearly, problems of identifiability can arise. 3.

INFERENTIAL PROBLEMS

Having developed a model for heterogeneous rates, we can now consider how heterogeneous rates affect inference in evolutionary analyses. We first consider the effect of heterogeneous rates on estimates of evolutionary distance. Upholt [24] first suggested that rate heterogeneity could bias estimates of evolutionary distance. Fitch later [4] quantified this bias with his nomographic simulations assuming a two-state distribution for the rates and a Kimura [25] two-parameter substitution rate matrix. Takahata [15] calculated the bias in k(2t) assuming a gamma distribution for the rates and either a Jukes-Cantor [1] or a Kimura two-parameter rate matrix. Both Fitch and Takahata found that their measures are underestimated in an equal-rate model and that the bias increases nonlinearly with time. We now consider a number of measures of evolutionary distance. We show that each is a weighted average of eigenvalues of the log transition matrix and thus each measure is underestimated in a homogeneous rate model. The following measures of the evolutionary distance between two species have been proposed for homogeneous rate models. The average number of substitutions per site in time 2t, k(2t), is probably the most commonly used measure. Note that if t is the divergence time of each

91

MODELING NUCLEOTIDE EVOLUTION

species from the common ancestor, then 2t is the time separating the two species. Using the spectral decomposition of the rate matrix, Q = S ~ S -~, we can write k(2t) as the following weighted average of the eigenvalues of log[P(2t)] 4

4

"triQii= -2t ~_, ~riSik(S-X)kitrk.

k(2t) = - 2 t • i=1

i,k=l

When the evolutionary process is reversible, 4

k(2t)=-2t

~_,

(T£iaik)20"k,

i,k=l

so that the coefficients in the weighted average are all positive. The formulas for k(2t) given by Jukes and Cantor [1], k(2t) = - 3 l o g [ 1 - 4

+ v)],

and Kimura [25], k(2t) = - l l o g [ ( 1 - 2s - v ) ( ~ - Z - ~ ) ] , are special cases when Q is a specific parametric form. Above s is the probability that sites in the two species differ by a transition and v is the probability that sites differ by a transversion. Lanave et al. [18] suggest the following method to estimate time since divergence in pairs of species. Let D be a diagonal matrix containing the stationary base probabilities. Let L be a 4 × 4 matrix with entries Li,, the number of positions in the aligned sequences with base pair (i,j~. Assuming that the evolutionary process is reversible and at equilibrium, ratios of the log eigenvalues of the matrix M = --1 D - 1 / 2 L D -

1/2

n

are used as estimates of time since divergence. As the expected value of in a reversible model at equilibrium is DP(2t), the expected value of M is

L/n

E(M)

=

D1/2p( 2t )D -1/2.

Thus the log eigenvalues of E(M) are equal to the eigenvalues of the matrix log[P(2t)].

92

COLLEEN KELLY AND JOHN RICE

Barry and Hartigan [26] suggested a slightly different measure of evolutionary distance, d = - llog[det(R)], where R is the matrix of nucleotide substitution probabilities between two species. They suggest estimating this distance using R = D - 1 L / n . If evolution is reversible, then R = P(2t), and then this measure is an average of the eigenvalues of log[P(2t)]. Let Ai(R) be the ith eigenvalue of R; then 1

d=-~

4

E Iog[Ai(R)] i=1

1 4 = - ~ - Y'. log[ X,(P(2t))] i=1

1 4 = - ~ E 2~t. i=i Note that when the stationary frequencies are all 1/4, this measure is equal to k(2t). 3.1. BIASES IN MEASURES OF EVOLUTIONARY DISTANCE In the heterogeneous rate model, measures of evolutionary distance should be based on the eigenvalues of the expected value of the log of the conditional transition matrix, E(log[P(2t I A)] ) = E ( 2 Q A t ) = 2Qt, not on the log of the expected transition matrix, log[E(P(2t[A))]. Assuming a homogeneous rate model amounts to basing measures on the eigenvalues of the latter matrix, because in this model an estimate of P(2t) is obtained by averaging the transition matrices, P(i)(t), i = 1 ..... n, across the positions. If sites are evolving at different rates, then the averaged transition matrix has an expected value of P ( 2 t ) = EA(P(2t I A)) = S # ( 2 t ) S -1, where 0 ( 2 0 is a diagonal matrix with elements {~p(2trkt), k = 1..... 4}, and S and S -1 are matrices of the right and left eigenvectors of P(2t [ A), respectively. The eigenvalues of the matrix log[P(2t)] are then k = 1 ..... 4).

MODELING NUCLEOTIDE EVOLUTION

93

These eigenvalues can differ significantly from the desired eigenvalues, 2trkt, depending on t and the distribution of the rates, as shown below. Assuming a homogeneous rate model results in underestimating measures of evolutionary distance. Jensen's inequality implies that the difference in the eigenvalues is positive: log[ ~(2trkt) ] -- 2trkt = log(EA[ e 2~ktA]) - 2trkt >/E A[log( e 2~ktA) ] _ 2 trg t = EA [2 o-ktA ] --2trkt =0.

The last term is zero because the expected value of the rates is 1. In fact, the eigenvalues in the two models are equal if and only if the rates are homogeneous. Note that if A is a constant, then we certainly have equality above, and if there is equality, then EA[e 2~k'A] = exp(2o'~t), so A = 1 almost surely. In a reversible homogeneous rate model, estimates of evolutionary distance are positive weighted averages of the eigenvalues - log[ ~(2 trkt)] and thus are negatively biased: Bias=-

E C k l o g [ ~(2~rkt)] + ECk2trkt <~0, k

k

where C k > 0 for k = 1.... ,4. A Maclaurin series expansion gives an approximation for the difference in the eigenvalues of the two models. Let the cumulants of A be Ky,j = 1,2 ..... and recall that the first cumulant, E(A), is 1; then

l°g[ ~°(2trkt)] -- 2 °'kt = •

Ki( 2trkt ) i i!

--2trkt

i=1

=Eoo

Ki(2trkt)i

i=2

= Var(A)

(2trkt) 2 2

-

+

....

This expression suggests that the bias increases with increasing time and with increasing rate variability. The first relationship can be seen from

94

COLLEEN KELLY AND JOHN RICE

the mean value theorem. Since A is a nonnegative random variable, and since for negative x, exp(Ax) is a decreasing function of A, it can be seen that A and exp(Ax) are negatively correlated, so that E ( A e x p ( A x ) ) ~
~o'(x) ~< ~o(x)

for x < 0.

Now consider the time points t, t + h, and t < ~ < t + h. Then the mean value theorem implies that log[ ~o(2trk(t + h))] - l o g [ ~o(2trkt)] 2trk h

~o'(2tr k ~:) ~o(2O.k~j) ~<1.

Rearranging the above expression shows that the bias is larger for larger time points:

log[~o(2trk(t+h))]--2trk(t+h)>>-log[~(2trkt)]--2trkt

Vk.

The following example shows bias increasing with increasing rate variability. Consider two random variables A and 3{ = YA, where Y is a random variable independent of A with mean 1. Both A and 3{ have mean 1, but the variance of 3{ is larger because Var(YA) = Var(Y) Var(A) + Var(Y) + Var(A) >/Var(A). Jensen's inequality shows that the moment-generating function of 3{ evaluated at x, qJ(x), is larger than the moment-generating function of A evaluated at x, qffx):

qJ( x) = E(exp( YAx) ) = Ev(EA(exp(YAx) IY)) = e,,(9(Yx))

This implies that the bias is larger for X than for A: log[ qJ(2crkt)]--2o'kt>~log[~o(2trkt)]--2trkt

Vk.

MODELING NUCLEOTIDE EVOLUTION

95

3.2. OTHER PROBLEMS Incorrectly assuming homogeneous rates not only biases estimates of evolutionary distance but may also cause inferential problems in tree reconstruction. Olsen [16] discusses problems in the distance matrix method proposed by Fitch and Margolish [3] due to biased estimates of distance. Sankoff [10] and Cavender and Felsenstein [27] show that their quadratic invariants may not choose the correct tree topology when rates have a two-state distribution and this variability is ignored. Churchill et al. [8] have shown that the effect of heterogeneous rates on likelihood methods is to overestimate the confidence one puts in choosing one particular tree toplogy over another. 4.

TESTING FOR HETEROGENEOUS RATES

In Section 3 we discussed problems in using a homogeneous rate model when the sequence positions evolve at different rates. Because heterogeneous rates can have profound effects on inference in evolutionary analyses, it is important to test the assumption of homogeneous rates before choosing an evolutionary model. Here we propose a parametric bootstrap hypothesis test that is based on a statistic very similar to the likelihood ratio statistic. First we will consider the calculation of the likelihood ratio statistic for testing heterogeneity. The null and alternative hypotheses are H0:

Al ~ -

H~ :

Ai's are not all equal,

A2 . . . . .

An =

A

and

where Ai is the relative rate of evolution in the ith position of the sequence. The likelihood functions under these hypotheses may be calculated, but they are difficult to write down in a general form. We thus illustrate the test for the particular tree topology depicted in Figure 2. Of course, the test can be applied to other tree topologies. Using the notation of Section 2 and assuming that the positions are independent, the likelihood function under the alternative hypothesis is the product of linear combinations of exponential terms: L(Q, A1

.....

An,t O. . . . .

tsIN )

= f i L ( Q , A 1..... An,t0 ..... tsIn 'i)) i=1 n

= I-I

4

Y[

i = 1 J0 . . . . . J5

cj,,...j~(n~i')exp[(%oto + %,tl + "'" + o)st5)Ai]- (9)

96

COLLEEN KELLY AND JOHN RICE

Under the null hypothesis, the Ai's are all equal to 1, because the model is normalized with E[A] -- 1, n

L ( Q , t 0 ..... t 5 IN) = I-I

4

~-, Cjo...y,(n(i))exp[~oto

+ o~,tl + "'" + o'yst5 ].

i = 1 J0 . . . . . J5

(i0) Notice that the maximization of the likelihood functions involves the estimation of the rate matrix Q, which may have up to 12 independent parameters. Assuming a parametric form such as those proposed by Jukes and Cantor [1], Kimura [25], or Felsenstein [19] will reduce this number of parameters and simplify the maximization. The maximization under the alternative hypothesis is especially complex because it involves the estimation of one rate parameter, Ai, for each position in the sequence, the branch lengths to,...,ts, and the parameters in the rate matrix. To simplify this calculation, we propose to use the null hypothe• estimates of the branch lengths t^0 sls o..... t^0 5 and the rate matrix 6 ° as known quantities in the likelihood (9) and then maximize for the rates {Ai, i = 1,...,n}, subject to the condition that the average rate (A) is 1. We then construct the following test statistic to test the hypothesis of equal rates: max L ( Q , t o ..... tsIN ) A*(N) =

Q, t 0 , . . . , t 5

max

^o I N) " L(t)°, A1..... An,t0 ..... t5

A1,... ,An;A = 1

Certainly, small values of A*(N) are evidence against the null hypothesis, but the distribution of A*(N) is unknown. We thus suggest using a parametric bootstrap to estimate the null distribution of this statistic. Notice that there are only 4 s possible patterns of nucleotides observed in a position of aligned sequences from s species. When H 0 is true, the probabilities of each of these patterns may be calculated as in Equation (5) using Ai = 1, i = 1..... n, when Q and the branch lengths are known. Suppose that p is a 1 × 4 s vector of these probabilities. The observed number of positions that have each particular pattern is then multinomially distributed with parameters n and p, where n is the length of the sequences. The parametric maximum likelihood estimate of the distribution of the data (under H0) is multinomial with parameters n and/3, where ~ is calculated using the null hypothesis estimates of A( = 1), Q, and t'o..... t'5.

MODELING NUCLEOTIDE EVOLUTION

97

As described in Efron [28], parametric bootstrap samples may be generated by sampling from the parametric maximum likelihood estimate of the distribution of the data. For each bootstrap sample, the statistic (4) may be calculated, and in this way the distribution of the statistic is approximated. Alternatively, one could test how well a homogeneous rate Markov model fits the data with the test proposed by Goldman [32]. This is not the same test as proposed here because the alternative hypothesis in Goldman's test is more general:

HA: P(site j exhibits pattern k) =Pk

f o r j = l ..... n; k = l ..... 4 s,

where Pk are unconstrained except that they are positive, less than 1, and sum to 1. Notice that our alternative (4) assumes a Markov model, a particular tree structure, the same rate matrix for all sites, and the same relative branch lengths in the tree. Our test specifically tests for heterogeneous rates rather than being a goodness-of-fit test for the Markovian model with homogeneous rates. An advantage of the goodness-of-fit test is its relative ease to compute: the probabilities for each possible pattern (Pk,k----1 ..... 4 s) are computed by equating expectations to observations. 5.

ESTIMATION

In Section 2, we discussed the calculation of the likelihood function in a heterogeneous rate model. For any particular tree topology, this likelihood may be maximized in order to obtain estimates of the various evolutionary parameters. A difficulty is that even in a small tree, there are a large number of parameters to estimate. Again, the number of parameters may be reduced by assuming a parametric form for the rate matrix, such as those proposed by Jukes and Cantor [1], Kimura [25], or Felsenstein [20]. Assuming a distribution for the rates such as a gamma or a log-normal will also significantly reduce the number of parameters to be estimated. When a distribution is assumed, the moment-generating function can be rewritten as a function of the distributional parameters, the branch lengths, and the rate matrix parameters, and then these parameters may be estimated by maximum likelihood. These calculations are discussed in Section 5.2. When no particular distribution is assumed for the rates, estimation is a much more difficult problem. In fact, the branch lengths of the tree and the variance of the rates cannot be directly estimated. We begin this section with a discussion of what information about the evolutionary parameters can be

98

COLLEEN KELLY AND JOHN RICE

obtained from the data when a distribution is not assumed for the rates. Throughout this section we assume a parametric form for the rate matrix Q that has known eigenvalues and eigenvectors. 5.1. UNKNOWN DISTRIBUTIONAL FORM Equation (5) illustrates that information about branch lengths and the distribution of the rates enters the likelihood function only through the moment-generating function evaluated at a finite number of values. When the eigenvectors of Q are known, several of the points ~°(%0t0 + 17/1tl+

"'"

+ o~5t5),

Jo,Jl ..... J5 = 1,2,3,4,

can be estimated by maximum likelihood. In Section 2 we noted that when the process is at equilibrium, the orthogonality of the eigenvectors and the stationary probabilities results in certain coefficients being zero. The moment~enerating function evaluated at these values cannot be estimated. When both the moment-generating function and the branch lengths are unknown, the estimated values provide only a limited amount of information. We first consider inference using a set of estimated values with unknown but equal branch lengths. For example, in a four-species star phylogeny with a molecular clock, we will consider estimates of the following values of the moment-generating function:

~ (( O~'1+ t~/'2 + Oj3 "1- Oj'4) t),

jl,J2,J3,J4=l,2,3,4.

In the branching tree presented in Section 2, assuming a molecular clock, we consider estimates of the following subsets of values: {~(0), ~o(2trito),i = 2,3,4}, {~(0), ~(2tritl),i = 2,3,4}, {~(0), ~(2trit2),i = 2,3,4}.

Lower Bounds on k(t) and the Variance. In this section, we consider the random variable A ( t ) = At with mean t, variance t 2 Var(A), and moment-generating function ~t(x)-- ~(tx). Suppose we have estimated ~t(x) at r known values {x i <~0: i = 1..... r}; then we can approximate ~t(x) with an interpolating polynomial to obtain some information about the distribution of rates. Let Q(x) be Newton's interpolating polynomial; then

~Ot(X) =

Q(X)-{-

~o(tr)r(l~) fi (X-- Xk) , k=l

(11)

99

MODELING NUCLEOTIDE EVOLUTION

where ~ ~ int(x, xl, x 2..... xr), and Q(x) is the (r - 1)th-degree polynomial with the kth divided difference, ~t[xi,,xi2 ..... xik.l], as the coefficient of x k,

Q( x) = ~ot(xl) + ~t[ xl,x21( x - xl) + ...

r-1

+

t[xl,x2 ..... xr] FI ( x - x k ) .

(12)

k=l

Recall that the first divided difference is just the slope of two values, and the kth divided difference is defined recursively by

~t[Xi l~xi 2'''''Xi k+l]'~-

~t[ Xi2 . . . . . Xik+l ] -- ~Dt[Xil,Xi 2..... Xik] Xik+l-- Xil

The kth divided difference may be thought of as an approximation of the kth derivative of ¢t(x). In fact, a consequence of the mean value theorem is that

~O'[Xil'Xi2 ..... Xik+l] =

k!

(13)

for ~ ~int(xl,...,Xk+l). Since A(t) is a positive random variable, all derivatives of ~t(x) are positive for all values of x, so all divided differences of ~t are positive. Once we have estimated Q(x), the interpolating polynomial of the moment-generating function ~t(x), we can use this approximation to estimate bounds on the mean and variance of A(t). Differentiating both sides of Equation (12), evaluating at x = 0, and then considering the error term gives t = E [ A ( t ) ] = q~(0) >/Q'(0), so that Q'(0) is a lower bound on t. We actually can get the following more general result.

THEOREM 1 If ~t(x) is a moment-generating function of a positive random variable and Q(x) is its ( r - 1)th degree interpolating polynomial at the points XI~X2,...~Xr~ then Q~k)(x) <~~ k ) ( x ) ,

x >t max(x 1, x 2 .... , Xr); k = 1,2 . . . . .

100

COLLEEN KELLY AND JOHN RICE

The proof is by induction and is left to the Appendix. This result gives lower bounds on all the moments of ~t. The following theorem shows that Q"(0)-[Q'(0)] 2 is a lower bound on the variance.

THEOREM 2 For distributions with t ~ ~,~ 2( - x f 1), the approximation of the variance of A(t) with the interpolating polynomial is a lower bound, that is, V a r [ A ( t ) ] >i Q"(0) - [Q'(0)] 2. Again, the proof is left to the Appendix. The condition in Theorem 2 is met quite easily is most data sets. Recall that we parameterized the rate matrix Q so that t would equal k(t), the mean number of substitutions per site; the condition in Theorem 2 is then k ( t ) < Y'-~=2( - xj71). If we have a three-species star phylogeny and are using a Jukes-Cantor rate matrix, this is satisfied when k ( t ) < - 1 / 2 o . + ( - 1/3o.) = 5 / 8 (o. = - 4 / 3 in the Jukes-Cantor matrix). If this condition is not satisfied, more species will need to be added in the analysis. For a six-species star phylogeny and a Jukes-Cantor rate matrix, the condition is satisfied when k ( t ) < - l / 2 o . + ( - 1 / 3 o . ) + ( 1 / 4 o . ) + (-1/5o')+(-1/6o')=1.0875. Since the series Y-(I/n)diverges, we can make E ( - 1/xi) large enough for any k(t) given enough species. In practicality, any data set with k(t)> 1 would be difficult to align and would probably not be useful for this type of analysis.

Lower Bounds on Tail Probabilities. Markov's inequality gives us some more information about the distribution of rates across sites: P(At<~a)<~e-"X~ot(Xk),

k = l ..... r.

This gives r inequalities that combine into one inequality for the tail probability of At:

P ( A t >/a)/> 1 -

min e-"Xk~ot(xk).

l~k~r

(14)

Bounds on Branch Lengths. In a branching tree, bounds can be obtained for the branch lengths and rate variance. Here we describe how these bounds are obtained for the particular tree topology depicted in Figure 2. Bounds for other topologies are obtained in an analogous

101

MODELING NUCLEOTIDE EVOLUTION

way. We will assume a molecular clock so that in Figure 2, t 2 = t s, t s = t 4 + ts, t o = t 3 + t 4 + t 5. Then estimates of at least the following values of the moment-generating function may be obtained: ~o(0), ~o(2tocri) , ~o(2tltri), ~p(2tzo'i), ~o((2t o + q~((2tl + t2)tri),

~o((2to+tl+t2)tri),

tl)cri),

~o((2t o +

tz)cri),

i = 2,3,4.

The first eigenvalue of the rate matrix, o'1, is zero, and the remaining three are negative. In the J u k e s - C a n t o r rate matrix, these three eigenvalues are - 4 / 3 ; in other parametric forms there may be two or three distinct nonzero eigenvalues. Depending on the rate matrix used, additional values of the moment-generating function may also be estimated. Fitting interpolating polynomials Qo(x), Ql(x), and Q2(x) to the sets of points {~o(0),

~o(2crito), i =

2,3,4},

{~o(0), ~o(2trit,), i = 2,3,4}, {~o(0), ~o(2tr~t2), i = 2,3,4}, respectively, yields the following lower bounds on the branch lengths and rate variances:

to > / Q '00( ) ,

to2 V a r ( A ) >f Q'~(0) - [Q'o(0)] 2,

tl >t Q'1(O),

tx2 V a r ( A ) >/Q'~(0) - [ Q'1(0)] z,

' 0 ), t2 >~Q2(

t22 V a r ( A ) >/Q'~(0) - [Q~(0)] 2.

The accuracy of the bounds depends on the number of points used in the fit of the interpolating polynomial. For the Jukes-Cantor rate matrix, there are only two points with which to fit a line, and thus a rough lower bound for the branch lengths will be obtained. A lower bound for the rate variance cannot be obtained in this case. When the rate matrix has more distinct eigenvalues, better lower bounds are obtained for the branch lengths and a lower bound on the rate variance may also be obtained. O f course, the bounds are subject to estimation error.

102

COLLEEN KELLY AND JOHN RICE

Bounds on the interior branch lengths may be obtained in the following way. For ease of notation, rewrite the points as Y0 --- q~(0) = 1,

Yil = q~(2°it2),

Yi2 -----~(2Ori/1),

Yi3 = q~(2oit0),

Yi4 -----~P((2to + t2) °'i),

Yi5 = q~((2/0 + tl)tri),

Yi6 = q~((2tl + t2)°'i),

Yi7 = q~((2t0 + tl + t2) O'i),

i = 2,3,4.

We use the tree topology to determine the order relationships of the branch lengths--that is t o > t a > t 2 > 0 - - a n d then these inequalities imply the positivity of certain kth divided differences. Consider the points Yis, Yi3, Yi2, Y0; then the positivity of the first divided differences imply Yi5 < Yi3 < Yi2 < Yo = 1.

The positivity of the second divided differences imply 1--Yi2 Yi2 --Yi3 _2tritl > _2tri(to_ta

Yia--Yi5 ) > trit-----_ ~.

Rearranging these inequalities gives the following bounds on the relative branch length t a / t o : 2Yi3--2yi5 tl 1--Yi2 Yi2 - - 2 y i s + Y i 3 <~-i~o<~ 1--Yi3

for i = 2 , 3 , 4 .

These three inequalities may be combined into one: t1 1 -- Yi2 2Yi3 -- 2yi5 max ~<~-o ~ min i 1 - Yi3 i Yi2 --2yi5 + Yi3

Similarly, we can obtain the following upper and lower bounds for the relative branch length t 2 I t 1 : [ 2yi3 -- 2yi4 Yi5 -- Yi7 2yi2 -- 2Yi6 ~ t2 maxi m a x I Yil -- Yi2 + 2yi3 --2Yi4 ' Yi4 -- Yi7 ' Yil + Yi2 --2Yi6 ) <~ t l

and tz

m!nmin(1-Yil

t 1 <<"

Yi3--Yi4) 1 - Yi2 ' Yi3 - Yi5 "

MODELING NUCLEOTIDE EVOLUTION

103

Recall that all the points of the moment-generating function are estimated from the data, and thus the bounds obtained in this way are not exact. 5.2. KNOWN DISTRIBUTIONAL FORM Assuming a distributional form for A allows for the estimation of the parameters of that distribution by maximum likelihood. Suppose 0 is a vector of the parameters of the assumed distribution and ~o(x I 0) is the moment-generating function of the distribution. Recall that we have constrained A to have an expectation of 1, so there are some contraints on 0. Let q be a vector of the parameters in the rate matrix Q, and t a vector of the branch lengths in the tree. Then the likelihood of any particular tree topology may be written as a function of 0, t, and q. For example, the likelihood of the branching tree presented in Section 2 (Figure 2) may be written [from Eq. (7)] as 4

L ( O , t , q l N) = I~I

E

Cyo...ys(n(/),q)

i = 1 Jo,Jl,'",J5 = 1

× q~(o).0t0 + O~.lt1 + "'" + o~5t5 I 0),

(15)

assuming the independence of sites in the sequence. This likelihood may be maximized to obtain maximum likelihood estimates of q, t, and 0. Recall that the rate matrix is parametrized so that t = k(t), and thus the estimate i actually estimates the average number of substitutions per site in the branches. Asymptotic standard errors for these estimates may be obtained via Fisher's information matrix. An estimate of the rate variance may be calculated because the variance is a function of 0. These calculations are illustrated for different assumed distributions by Chen and Kelly [29], who discuss assessing the goodness of fit of the rate distributions. Posterior Mean Rates. Once we have estimated the parameters of the distribution 0, we can calculate a posterior mean rate for each position of the sequence. The posterior mean rates can be plotted as a function of the position in the sequence to give a picture of the variability of the sites. This type of plot can be used to look for regions or high variability (hot spots) or regions that are highly conserved. From Bayes' rule, the posterior probability that the rate in position i is hi, given that we have observed the vector n (i) at that position is P(i

= A i I n (i)) = p ( n ( i ) I i

--- A i ) P ( i

p(n(i) )

-- Ai)

(16)

The first term in the numerator of the right-hand side of Equation (16), P(n°)JA = Ai) , may be calculated for any specific tree topology as

104

COLLEEN KELLY AND JOHN RICE

described in Section 2, and the denominator, P(n°)), is the expected value with respect to A of this quantity [see Eqs. (5) and (6) for the four-species star phylogeny case]. The posterior mean of A is a reasonable summary, E [ A i nO)] = f~ AP(n°) I A)f(A) dA. f~ P(n°) I A)f(A) dA '

(17)

here f(A) is the density function of A. 6.

CONCLUSIONS AND OPEN QUESTIONS

The existence of unequal rates of evolution in DNA sequences has long been acknowledged. Using a model that makes an assumption of equal rates may cause a number of problems in evolutionary analyses [4, 5, 8, 10-2, 15, 27, 30]. Most researchers have used a simple two-state model to study inferential problems caused by heterogeneous rates. Fitch and coworkers [3, 5] first used this model to show that the biases caused by ignoring heterogeneous rates may be significant. Since then, a number of methods have been proposed to estimate the proportion of invariable sites in a sequence [3, 6-10]. A difficulty in this type of analysis is that the proportion of estimated invariable sites decreases as more species are added to the analysis. A model with continuous heterogeneous rates may be a more realistic model. The model presented here is general enough to encompass the two-state model and other models that allow for a continuum of nonnegative rates. Gamma [11-15, 17, 31] and log-normal [11, 30] distributions have also been used to model evolutionary rates along a DNA sequence. Uzzel and Corbin [13] and Wakely [17] have suggested that a gamma distribution may fit better than a discrete two-state distribution, especially when the data are from divergent species. Chen and Kelly [29] show that evolutionary estimates, such as branch lengths and rate variance, may vary with the assumed rate distribution. Since in general the rate distribution is unknown, it may be unclear which estimates are more accurate. The lower bounds for the rate variance and branch lengths obtained in Section 5.1 may then be useful. Certainly more work on determining an appropriate rate distribution model is called for. A heterogeneous rate analysis will not only yield more accurate estimates but will also yield information about the distribution of rates. Constructing a posterior mean plot of the rates in the sequence as described in Section 5.2 provides a graphical picture of this distribution and may suggest regions of structural or functional importance in a molecule. When no distribution is assumed for the rates, less informa-

105

MODELING NUCLEOTIDE EVOLUTION

tion is available. In this case, it appears that only bounds on the relative branch lengths, the rate variance, and tail probabilities can be obtained. We have not calculated errors for these bounds, but this is a reasonable next step in this analysis. As the number of species in the analysis increases, the error on these bounds should decrease. Unfortunately, as the number of species increases, the difficulty in maximizing the likelihood also increases. A good algorithm for this maximization needs to be developed. APPENDIX

PROOF OF THEOREM 1 First we show that it is true for k = 0. If Q(x) is the (r - 1)th-degree interpolating polynomial of q~t(x), then the error of the approximation is

~(r)(~) f i (X--Xk)>~O

forx>~ max Xk,

l
kffil

because ~ r ) ( ~ ) is positive for all ~. Now suppose it is true for k - 1. For any interpolating polynomial P ( x ) of a function f ( x ) such that f ( k ) ( x ) > 0 for all k and x, we have

f(k-1)(x)~p(k-1)(X),

X>~ max xj. l
Now let Q(x) be the interpolating polynomial of the function ~ot(x) at the points xl, x 2 . . . . . x r, so that ~ot(Xk) -- Q ( X k ) = O,

k = 1 . . . . . r.

By Rolle's theorem, there exist points x; < Yi < xi+l, i -- 1..... r - 1, such that q~(Yi) = Q'(Yi), so Q' is an (r -2)th-degree interpolating polynomial of ~ at the points {Yi : i = 1..... r - 1}. Then ~ and its interpolating polynomial Q' satisfy the assumptions of the theorem, and by the induction hypothesis (~p~)~k-1)(X)>~(Q')~k-1)(X)

forx>~

max l
or f o r x>~

max

l
yj,

and in particular for x >/max~ ~ j ~ • Xj ~ max~ ~ i ~ ~- ~YJ"

yj

106

COLLEEN KELLY AND JOHN RICE

PROOF OF THEOREM 2 Since Q(x) is the interpolating polynomial of q~t(x), from Equation (11) we have ~,(x) = Q ( x ) + - - ~o~r)(~ ) f i ( X -- Xk ). r[ k=l Differentiating once we obtain

~o[r+~i(~) fi

~p~(x) = Q ' ( x ) + ~ ' ( x )

~}r~(~)

+ - r!

(X--Xk)

k=l

1-I (x - x~).

j=lk~j

Differentiating again we obtain q~'(x) = Q " ( x ) + ~ " ( x )

+ [ s~'(x)]2

~°}r+~(~) H (x-x~) r! k=a

r[ ~r+:)(~) i~I (~-x~)

~r+ 1)(~) +2~'(x)

r[

k=l

r Y'~ l-I ( X - - X k )

j=lk~j

r + m

r!

g g FI (x-x~).

i=l j ¢ i k ~ i , j

These expressions give the errors on the approximations of the first and second derivatives of q~t(x). Evaluating these expressions at zero simplifies the error terms. Also, since we know ~ot(0)= 1, the first interpolation point is xl = 0, and thus any product involving this term disappears:

~0:(0)=Qt(O)-}-~r)r(l ~'''~) f i

k=2

~0;'(0) = Q"(0) + 2 ~ ' ( 0 )

(--Xk) ,

~o(r+l)(~) h r[

k=2

+ 2 q~}r)(£) 1--I (-xk). r! j=2k~j

(--Xk)

107

MODELING NUCLEOTIDE EVOLUTION T h e error t e r m in the approximation of the variance V a r ( A t ) = ~o;'(0) - [ ~o~(0)] 2 = Q " ( 0 ) - [ Q ' ( 0 ) ]

2+error

thus b e c o m e s a~(r+ 1 ) ( ~ )

error = 2 U ( O ) -,-t r! k=2

r!

_2~0t(0(¢) r]

h(_Xl,

k=2

j=2k-~j

[ r!

]2

k=2

) _~log[~otO(~:(X)) ] t=O + ~ 0

--1

1 fi - q~;(O) + 2q~}~) ( ~ ) / r t k =

(_xk)).

j = 2 xk

k=2

2

T h e first and last terms inside the parentheses are positive, so the error is positive if

j = a ~JJ > ~ p ; ( O ) = E [ A t ] = t .

REFERENCES 1 T.H. Jukes and C. H. Cantor, Evolution of protein molecules, in Mammalian Protein Metabolism, N. H. Munro, Ed., Academic, New York, 1969, pp. 21-123. 2 T. Miyata, Evolutionary changes and functional constraints in DNA sequences, in Molecular Evolution, Protein Polymorphism and the Neutral Allele Theory, M. Kimura, Ed., Japan Scientific Press, Tokyo, 1982, pp. 233-266. 3 W. M. Fitch and E. Margoliash, A method for estimating the number of invariant amino acid coding positions in a gene using cytochrome c as a model case, Biochem. Genet. 1:65-71 (1967). 4 W. M. Fitch, The estimate of total nucleotide substitutions from pairwise differences is biased, Phil. Trans. Roy. Soc. Lond. B 312:317-324 (1986). 5 J. S. Shoemaker and W. M. Fitch, Evidence from nuclear sequences that invariable sites should be considered when sequence divergence is calculated, Mol. Biol. Evol. 6(3):270-289 (1989).

108

COLLEEN KELLY AND JOHN RICE

6 A. Sidow, T. Nguyen, and T. Speed, Estimating the fraction of invariable codons with a capture-recapture method, J. Mol. Evol. 35:253-260 (1992). 7 M. Hasegawa, H. Kishino, and T. Yano, Dating of the human-ape splitting by a molecular clock of mitochondrial DNA, J. Mol. Evol. 26:132-147 (1985). 8 G. A. Churchill, A. von Haeseler, and W. C. Navidi, Sample size for a phylogenetic inference, Mol. Biol. Evol. 9:753-769 (1992). 9 S. R. Palumbi, Rates of molecular evolution and the fraction of nucleotide positions free to vary, J. Mol. Evol. 29:180-187 (1989). 10 D. Sankoff, Designer invariants for large phylogenies, Mol. Biol. Evol. 7(3):255-269 (1990). 11 G.B. Golding, Estimates of DNA and protein sequence divergence: an examination of some assumptions, Mol. Biol. Evol. 1:125-142 (1983). 12 L. Jin and M. M. Nei, Limitations of the evolutionary parsimony method of phylogenetic analysis, Mol. Biol. Evol. 7:82-102 (1990). 13 T. Uzzel and K. W. Corbin, Fitting discrete probability distributions to evolutionary events, Science 172:1089-1096 (1971). 14 R. M. Holmquist, The spatial distribution of fixed mutations within genes coding for proteins, J. Mol. Evol. 19:437-448 (1983). 15 N. Takahata, Overdispersed molecular clock at the major histocompatibility complex loci, Proc. Roy. Soc. Lond. B 243:13-18 (1991). 16 G . J . Olsen, Earliest phylogenetic branchings: comparing rRNA-based evolutionary trees inferred with various techniques, Cold Sp~qngHarbor Syrup. Quant. Biol. 52:825-837 (1987). 17 J. Wakeley, Substitution rate variation among sites in hypervariable region 1 of human mitochondrial DNA, J. Mol. Evol. 37:613-623 (1993). 18 C. Lanave, G. Preparata, C. Saccone, and G. Serio, A new method for calculating evolutionary substitution rates, J. Mol. Evol. 20:86-93 (1984). 19 J. Felsenstein, Phylogenies from molecular sequences: inference and reliability, Annu. Rev. Genet. 22:521-565 (1988). 20 J. Felsenstein, Evolutionary trees from DNA sequences: a maximum likelihood approach, J. Mol. Evol. 17:368-376 (1981). 21 J. Felsenstein, Statistical inference of phylogenies, J. Roy. Stat. Soc. A 146(3):246-272 (1983). 22 S. Tavar6 and T. Janzen, On estimating substitution rates from pairs of nucleotide sequences, Preprint, 1985. 23 S. Tavar6, Some probabilistic and statistical problems in the analysis of DNA sequences, Lect. Math. Life Sci. 17:57-86 (1986). 24 W . B . Upholt, Estimation of DNA sequence divergence from comparison of restriction endonuclease digests, Nucleic Acids Res. 4:1257-1265 (1977). 25 M. Kimura, Estimation of evolutionary distances between homologous nucleotide sequences, Proc. Natl. Acad. Sci. USA 78(1):454-458 (1981). 26 D. Barry and J. A. Hartigan, Asynchronous distance between homologous DNA sequences, Biometrics 43:261-276 (1987). 27 J . A . Cavender and J. Felsenstein, lnvariants of phylogenies in a simple case of discrete states, J. Classif. 4:57-71 (1987). 28 B. Efron, The Jackknife, the Bootstrap and Other Resampling Plans, CBMS-NSF Regional Conf. Ser. Appl. Math. Vol. 38, SIAM, Philadelphia, PA, 1982.

MODELING NUCLEOTIDE EVOLUTION

109

29 J. Chen and C. Kelly, A reexamination of DNA sequence data using a heterogeneous rate model of evolution, Unpublished, 1995. 30 G. J. Olsen, Systematic underestimation of tree branch lengths by Lake's operator metrics: an effect of position-dependent substitution rates, Mol. Biol. Evol. 8(5):592-608 (1991). 31 Z. Yang, Maximum likelihood estimation of phylogeny from DNA sequences, Mol. Biol. Evol. 10:1396-1401 (1993). 32 N. Goldman, Statistical tests of models of DNA substitution, J. Mol. Evol. 36:182-198 (1993).