A note on distributions of times to coalescence, under time-dependent population size

A note on distributions of times to coalescence, under time-dependent population size

Theoretical Population Biology 63 (2003) 33–40 http://www.elsevier.com/locate/ytpbi A note on distributions of times to coalescence, under time-depe...

166KB Sizes 1 Downloads 58 Views

Theoretical Population Biology 63 (2003) 33–40

http://www.elsevier.com/locate/ytpbi

A note on distributions of times to coalescence, under time-dependent population size A. Polanski,a,b A. Bobrowski,c and M. Kimmelb,* a

Department of Statistics, Rice University, 6100 Main Street, Mail Stop 138, P.O. Box 1892, Houston, TX, USA b Institute of Automation, Silesian Technical University, Gliwice, Poland c Department of Mathematics, University of Houston, Houston, TX, USA Received 1 May 2002; received in revised form 27 August 2002; accepted 5 September 2002

Abstract Expressions for marginal distributions of times in the time-varying coalescence process are derived. The proposed method allows also for computation of joint probability distribution for pairs, triples, etc. of coalescence times. The expressions derived are useful for (1) extending several statistics from time constant to time-varying case, (2) increasing efficiency and accuracy of simulations in time-varying evolution, and (3) debugging coalescence simulation software. r 2002 Elsevier Science (USA). All rights reserved.

1. Introduction Statistical inference from random samples of chromosomes taken from population depends on probability distributions of times in the coalescence process (Kingman, 1982). In the case of a population which has evolved with constant size over many generations, times between coalescence events are independent exponential random variables. Using this fact, analytical expressions for many interesting statistics were derived under constant population size hypothesis. These include expected numbers of mutations or recombinations in the sample (Watterson, 1975; Hudson and Kaplan, 1985; Fu, 1995), expected time to the most recent common ancestor (TMRCA) which, additionally, can be conditioned on events like mutations or recombinations in the genealogy (Griffiths, 1999; Griffiths and Marjoram, 1996), expected ages of mutations (Griffiths and Tavare, 1998; Stephens, 2000), and so forth. In the case of evolution with time-varying population size, times in the coalescence tree are no longer independent. Joint probability density function of coalescence times (Griffiths and Tavare, 1998) involves a sequential dependence of times of events that happen later in the tree on those which happen earlier. Let us *Corresponding author. Tel.: (713)-348-5255; fax: (713)-348-5476. E-mail address: [email protected] (M. Kimmel).

notice that, since coalescence is usually considered in backward time, ‘‘later in the tree’’ means ‘‘earlier in chronological time’’ and conversely. Expectations of times in coalescence processes with time-varying population size, in the literature known to us, were computed by averaging over Monte Carlo simulations (e.g., Slatkin and Hudson, 1991; Griffiths and Tavare, 1998; Nielsen, 2000). In this paper we derive expressions of marginal distributions for times in the coalescence process with time-varying population size, which follow from the form of their joint probability distribution. Our method also allows computation of joint probability distribution for pairs, triples, etc. of coalescence times. The expressions derived are useful for extending several statistics from time constant to time-varying case, increasing efficiency and accuracy of simulations in time-varying evolution and debugging coalescence simulation software. We show examples of applications of our expressions to computing probabilities that a segregating site has b mutant bases (Griffiths and Tavare, 1998) and to evaluating bias in Tajima’s D statistics caused by population growth. For large genealogies (of sizes n > 50) the described method cannot be efficiently applied because expressions contain coefficients which become very large. This effect considerably limits practical applications of the described approach. Nevertheless, we were able to

S0040-5809/02/$ - see front matter r 2002 Elsevier Science (USA). All rights reserved. PII: S 0 0 4 0 - 5 8 0 9 ( 0 2 ) 0 0 0 1 0 - 2

A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40

34

develop methods of computing of two important statistics, expected time to most recent common ancestor and expected sum of lengths of all branches in the coalescence tree, applicable for large sample sizes n:

We accept notation for coalescence times as shown in Fig. 1. Time is measured from the present to the past. Random coalescence times from sample of size n to sample of size k  1 are denoted by Tk ; k ¼ 2; 3; y; n; and their realizations by corresponding lower case letters tn ; tn1 ; y; t2 ; 0otn otn1 o?ot2 : Let us stress that our notation is not standard. It is more common that capital and lower letters T; t are used to denote times between coalescence events. Our notation seems to be more convenient for this specific application. We assume that the history of population’s effective size is described by a function of the reverse time t ð1Þ

tA½0; NÞ:

ð3Þ

2.1. Marginal distributions of times

2. Distributions of times to coalescence events

Ne ðtÞ;

density of times Tk ; y; Tn1 ; Tn ; 2okpn; is ! Z tj j n Y ð2j Þ ð2Þ ds exp  pðtk ; y; tn1 ; tn Þ ¼ : Ne ðtj Þ tjþ1 Ne ðsÞ j¼k

For a random sample of n chromosomes, joint probability density function of the distribution of times T2 ; y; Tn1 ; Tn is given by the following expression (Griffiths and Tavare, 1998): ! Z tj j n Y ð2j Þ ð2Þ ds exp  pðt2 ; y; tn1 ; tn Þ ¼ ; ð2Þ Ne ðtj Þ tjþ1 Ne ðsÞ j¼2 is the binowhere 0 ¼ tnþ1 otn otn1 o?ot2 ; and mial symbol. This probability distribution has a sequential-dependence structure, i.e., the distribution ð2j Þ

Marginal probability density functions pk ðtk Þ of times tk ; k ¼ 2; 3; y; n; follow from integration Z tk Z tn2 Z tn1 ? pk ðtk Þ ¼ 0

0

0

 pðtk ; y; tn1 ; tn Þ dtn dtn1 ?dtkþ1 ;

ð4Þ

where pðtk ; y; tn1 ; tn Þ is given by (3). The probability density function above can be represented as follows: pk ðtk Þ ¼

n X

Akj qj ðtk Þ

ð5Þ

j¼k

with ðj Þ qj ðtk Þ ¼ 2 exp  Ne ðtk Þ

Z

tk 0

! ð2j Þ ds ; Ne ðsÞ

and coefficients Akj given by Qn l l¼k;laj ð2Þ k ; kpjpn A j ¼ Qn l j l¼k;laj ½ð2Þ  ð2Þ

ð6Þ

ð7Þ

and Ann ¼ 1:

ð8Þ

2.2. Distributions of pairs of times Conditional distribution density of times Tk ; y; Tm2 ; Tm1 ; given Tm ¼ tm ; m > k; is equal to pðtk ; y; tm2 ; tm1 jtm Þ ! Z tj j m 1 Y ð2j Þ ð2Þ ds exp  ¼ : Ne ðtj Þ tjþ1 Ne ðsÞ j¼k

ð9Þ

Using (9) and the marginal distribution pm ðtm Þ of time Tm one can write the integral expressing the join probability density function pkm ðtk ; tm Þ of a pair of times Tk ; Tm ; kom Z tk Z tm3 Z tm2 ? pðtk ; y; tm2 ; tm1 jtm Þ pkm ðtk ; tm Þ ¼ 0

0

0

 pm ðtm Þ dtm1 dtm2 ydtkþ1 :

ð10Þ

This density can be represented by the following sum: Fig. 1. Notation for ancestral history of a sample of DNA sequences. Random coalescence times from sample of size n to sample of size k  1 are denoted by Tk ; k ¼ 2; 3; y; n; and their realizations by corresponding lower case letters tn ; tn1 ; y; t2 ; 0otn otn1 o?ot2 :

pkm ðtk ; tm Þ ¼

m1 X n X j1 ¼k j2 ¼m

km Bkm j1 Cj2 qj1 ;j2 ðtk ; tm Þ;

ð11Þ

A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40

where qj1 ;j2 ðtk ; tm Þ ¼

Z

!

tk j1 ðj21 Þ ð 2 Þ ds ðj22 Þ exp  Ne ðtk Þ tm Ne ðsÞ Ne ðtm Þ ! Z tm j2 ð 2 Þ ds  exp  Ne ðsÞ 0

ð12Þ

and Qm1

Bkm j1

l l¼k;laj1 ð2Þ ; l j1 l¼k;laj1 ½ð2Þ  ð 2 Þ

¼ Qm1

kpj1 pm  1;

ð13Þ

mpj2 pn:

ð14Þ

Qn

Cjkm 2

l l¼k;laj2 ð2Þ ; l j2 l¼k;laj2 ½ð2Þ  ð 2 Þ

¼ Qn

2.3. Proofs of expressions (5) and (11) Let us define integral transform Uf:g of the probability density function pðtÞ; tA½0; NÞ; with the underlying function 0oNe ðtÞoN; tA½0; NÞ; as follows: PðsÞ ¼ UfpðtÞg  Z t 

ds ¼ E exp s 0 Ne ðsÞ  Z t  Z N ds pðtÞ exp s ¼ dt; 0 0 Ne ðsÞ

ð15Þ

where E stands for expectation. The transform defined is linear. We apply transform (15) to distribution qj ðtÞ given by (6). This leads to QðsÞ ¼ Ufqj ðtÞg ! Z N j Z t j ð2Þ ð2Þ ds ¼ exp  Ne ðtÞ 0 0 Ne ðsÞ  Z t  ðj Þ ds  exp s dt ¼ 2 j : s þ ð2 Þ 0 Ne ðsÞ

ð16Þ

We compute transformation (15) of the marginal distribution pk ðtk Þ from (4) Ufpk ðtk Þg Z N Z tk Z tn2 Z tn2 ¼ y 0 0 0 Z tk 0  ds  exp s 0 Ne ðsÞ

ð18Þ

was repeatedly applied. Expanding (17) into partial fractions (Korn and Korn, 1961) results in n n Y X ð2j Þ ðj Þ Ufpk ðtk Þg ¼ Akj 2 j ; ð19Þ j ¼ s þ ð2 Þ j¼k s þ ð2Þ j¼k where Akj are coefficients (7). Using (16) and linearity, one finds the inverse transform U1 f:g for (19). It has the form (5), so the first part of the proof is completed. Transformation (15) can be generalized to the multidimensional case as follows: Pðs2 ; s3 ; y; sn Þ ¼ Ufpðt2 ; y; tn1 ; tn Þg  Z t2 Z t3 ds ds  s3 ¼ E exp s2 Ne ðsÞ 0 0 Ne ðsÞ 

Z tn ds ?  sn : 0 Ne ðsÞ

ð20Þ

We substitute pðt2 ; t3 ; y; tn Þ given by expression (2) in definition (20) above and perform manipulations analogous to those in expressions (16)–(18). This leads to n Y ðj Þ Ufpðt2 ; y; tn1 ; tn Þg ¼ : ð21Þ Pj 2 j j¼2 i¼2 sl þ ð2Þ Marginal distributions for times, pairs of times, triples, etc. can be computed, as special cases, from (21). The one-dimensional case (17) is obtained as follows:  Z tk 

ds Ufpk ðtk Þg ¼ E exp s Ne ðsÞ 0

 pðtk ; y; tn1 ; tn Þ dtn dtn1 ydtkþ1 dtk  Z tk  Z NZ N Z NZ N ds ¼ y exp s 0 tn tkþ2 tkþ1 0 Ne ðsÞ  pðtk ; y; tn1 ; tn Þ dtk dtkþ1 ydtn1 dtn n Y ð2j Þ ¼ j : j¼k s þ ð2Þ

In the above, the following identity  Z tj  j Z N ð2 Þ ds exp s tjþ1 0 Ne ðsÞ Ne ðtk Þ ! Z tj j ð2Þ ds  exp  dtj tjþ1 Ne ðsÞ  Z tj  Z N j ð2Þ ds exp s ¼ tjþ1 Ne ðtk Þ 0 Ne ðsÞ ! ! Z tjþ1 j Z tj j ð2Þ ds ð2Þ ds dtj exp  exp  Ne ðsÞ 0 Ne ðsÞ 0 ! Z tjþ1 ðj Þ ½s þ ð2j Þ ds ¼ 2 j exp  Ne ðsÞ s þ ð2Þ 0 ! Z tjþ1 j ð2Þ ds exp Ne ðsÞ 0  Z tjþ1  j ðÞ s ds ¼ 2 j exp  Ne ðsÞ s þ ð2Þ 0

35

¼ Pðs2 ; s3 ; y; sn Þjsk ¼s;sjak ¼0 ; ð17Þ

ð22Þ

where jsk ¼s;sjak ¼0 denotes: substitute sk ¼ s; and sj ¼ 0; for all jak: For the two-dimensional marginal distribu-

A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40

36

tions pkm ðtk ; tm Þ in expression (10):

all branches in the coalescence tree (ETSUM) that can be computed for large sample size n:

Ufpkm ðtk ; tm Þg ¼ Pðsk ; sm Þ  Z ¼ E exp sk 0

ð23Þ tk

ds  sm Ne ðsÞ

Z

tm

0

ds Ne ðsÞ

¼ Pðs2 ; s3 ; y; sn Þjsjak;m ¼0 ¼

m 1 Y j1 ¼k

n ðj21 Þ Y ðj22 Þ ; sk þ ðj21 Þ j2 ¼m sk þ sm þ ðj22 Þ



ð24Þ ð25Þ ð26Þ

where jsjak;m ¼0 denotes: substitute sj ¼ 0; for all jak; m: Expanding both products in (26) into partial fractions yields ¼ Pðsk ; sm Þ m1 X n X j1 ¼k j2 ¼m

Bkm j1

ðj21 Þ ðj22 Þ km C ; sk þ ðj21 Þ j2 sk þ sm þ ðj22 Þ

ð27Þ

km where Bkm j1 and Cj2 are given by (13) and(14). Now, observing that

ðj21 Þ ðj22 Þ Ufqj1 ;j2 ðtk ; tm Þg ¼ ; j1 ½sk þ ð 2 Þ½sk þ sm þ ðj22 Þ

Using our notation as in expressions (5) and (7), the expected time to most recent common ancestor (ETMRCA) equals n X A2l el ; ð29Þ ETMRCA ¼ EðT2 Þ ¼ l¼2

where E stands for expectation and Z N el ¼ tql ðtÞ dt:

ð30Þ

0

It can be verified that coefficients A2l ; l ¼ 2; 3; y; n do not explode when n grows. Consequently, expectation (29) can be computed efficiently for large genealogies.

Ufpkm ðtk ; tm Þg

¼

3.1. Expected time to most recent common ancestor

ð28Þ

where qj1 ;j2 ðtk ; tm Þ is given by (12) and, again, using linearity of transform (20) yields expression (11), which completes the second part of the proof.

3.2. Expected sum of lengths of all branches in the coalescence tree Let us denote the random times between coalescence events by Sk ; i.e., Sk ¼ Tk  Tkþ1 ; k ¼ 2; 3; y; n: Times Tk are as defined in Fig. 1, with Tnþ1 ¼ 0: The expected sum of lengths of branches in the coalescence tree (ETSUM) is given by ! ! n n X X ETSUM ¼ E kSk ¼ E T2 þ Tk : ð31Þ k¼2

3. Large genealogies When trying to apply the method to large genealogies one encounters the problem of some of the coefficients Akl in (7) becoming very large. For example, when n ¼ 4 20; A10 and when n ¼ 40; A20 14 ¼ 2:2478  10 ; 29 ¼ 10 1:080  10 : Analysis of Eq. (7) reveals that some of the coefficients Akl will explode, when n is large and k takes values close to n=2: For such case, probability density function pk ðtk Þ in expression (5), for a given tk ; is computed as a sum of very large numbers with alternating signs. Performing practical numerical computations using the standard double precision arithmetics allows effective calculations of distributions pk ðtk Þ; k ¼ 2; 3; y; n; for genealogies of size np50: By ‘‘effective calculating’’ we mean that relative numerical errors do not exceed 104 : For larger genealogies one will observe a sharp increase of numerical error. The effect described considerably limits practical applications. In general, overcoming these difficulties in summation of diverging coefficients requires a future research. However, further on we show two important statistics, the expected time to most recent common ancestor (ETMRCA) and the expected sum of lengths of

k¼2

Using expressions (5) and (30) one obtains n n n X X X A2l el þ Akl el ETSUM ¼ ¼

l¼2

k¼2

n X

n l X X

A2l el þ

l¼2

l¼2

l¼k

Akl el :

ð32Þ

k¼2

However, for all odd l ¼ 2j þ 1; 1pjp½n=2 A2l þ

l X

Akl ¼ A22jþ1 þ

k¼2

2jþ1 X

Ak2jþ1 ¼ 0

ð33Þ

k¼2

and for all even l ¼ 2j; 1pjp½n=2; we have l X k¼3

Akl ¼

2j X

Ak2j ¼ 0:

ð34Þ

k¼3

Notation ½n=2 stands for the largest integer not greater than n=2: Using (33) and (34) we obtain ETSUM ¼ 2

½n=2 X

A22j e2j :

ð35Þ

j¼1

Since the above expression involves only coefficients Akl ; with k ¼ 2; it can be efficiently numerically evaluated for large size of genealogy n:

A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40

3.3. Proofs of (33) and (34) We observe that after the multiplicative term, (33) becomes 1þ

ð2j2 Þ ð2j2 Þ

þ



2j X

ð2j2 Þð2j1 2 Þ 2j1 2jþ1 ½ð2j2 Þ  ð2jþ1 2 Þ½ð 2 Þ  ð 2 Þ

þ2

þ?

2 ð2j2 Þð2j1 2 Þyð2Þ 2j1 2jþ1 2 2jþ1 ½ð2j2 Þ  ð2jþ1 2 Þ½ð 2 Þ  ð 2 Þy½ð2Þ  ð 2 Þ

ð36Þ

Similarly, expression (34) can be equivalently written as ð2j1 2 Þ

þ

2j 2j2 2j ½ð2j1 2 Þ  ð 2 Þ½ð 2 Þ  ð 2 Þ



þ?

2j2 3 ð2j1 2 Þð 2 Þyð2Þ 2j 2j2 2j ð 2 Þ½ð 2 Þ  ð 2 Þy½ð32Þ



 ð37Þ

m þ1 X

i m1 lþ1 m ðlþi ðlmÞ: l Þði1Þð1Þ ¼ ð1Þ

ð38Þ

ð2jþkÞð2jþ1kÞ this Using the identity ðk2Þ  ð2jþ1 2 Þ ¼  2 becomes 2j X

i Y

i¼2

k¼2

ð1Þi1

2j ð2j þ kÞð2j þ 1  kÞ Y kðk  1Þ ¼0  2 2 k¼iþ1

ð40Þ

or 222j ð2jÞ!ð2j  1Þ! þ 212j

2j X

ð1Þi1

i¼2

ð2j  1Þ! ð2jÞ! ð2j  1Þ! ¼ 0:  ð2j  iÞ! i! ði  1Þ!

2j1 ð2j þ k  1Þð2j  kÞ Y kðk  1Þ ¼ 0: 2 2 k¼iþ1

¼  jð2j  2Þð2j þ 1Þ

This is a particular case of Eq. (5.24) in (Graham et al., 1998). Now, (36) can be expressed in an equivalent way as follows: ! " ! !# 2j 2j Y i Y X k k 2j þ 1 2 þ  2 2 2 i¼2 k¼2 k¼2 ! 2j Y k  ¼ 0: ð39Þ 2 k¼iþ1

k¼2

¼ 2j þ 1:

ð43Þ

ð2j þ iÞ! ð2j þ 1Þ! ð41Þ

ð45Þ

This can be reduced to ! ! 2j1 X 2j  2 i 2j  1 þ i ð1Þ 2j  1 i1 i¼3

i¼1

kðk  1Þ þ 2

i1

!

2j1 i kðk  1Þ X Y þ ð1Þi1 2 i¼3 k¼3

k¼3

ð2j2 Þ

For the proof of (36) and (37) we need the fact that for all positive integers l and m;

2

2j

2j  1

This last relation is a particular case of expression (38) with l ¼ 2j and m ¼ 2j  1: In a similar way, expression (37) has an equivalent form ! " ! !# ! 2j1 2j1 2j1 i X Y Y k Y k k 2j þ  ¼0 2 2 2 2 i¼3 k¼3 k¼3 k¼iþ1

2j1 Y

¼ 0:

2j Y

!

or

2j2 ð2j1 2 Þð 2 Þ

½ð2j1 2 Þ

2j þ i

ð42Þ

ð44Þ

2j ð2j1 2 Þ  ð2Þ

þ

ð1Þ

i

i¼1

¼ 0:



After some algebra this reduces to ! ! 2j X 2j  1 i 2j þ i ð1Þ ¼ 2ð2j þ 1Þ 2j i1 i¼2 or

ð2jþ1 2 Þ

37

and, consequently, to 2j1 X i¼1

ð1Þ

i

2j  1 þ i 2j  1

!

ð46Þ

2j  2 i1

! ¼ 2j:

ð47Þ

which is (38) with m ¼ 2j  2 and l ¼ 2j  1:

4. Examples of applications of the results Let us assume an exponential scenario of population growth Ne ðtÞ ¼ Ne0 expðrtÞ: Distributions qj ðtÞ; j ¼ 2; 3; y; n are given by ! ð2j Þ ð2j Þ expðrtÞ exp  ðexpðrtÞ  1Þ qj ðtÞ ¼ Ne0 Ne0

ð48Þ

ð49Þ

Probability density functions of marginal distributions pk ðtÞ of coalescence times Tk ; k ¼ 2; 3; y; 10; which follow from (5) and (49), are plotted versus time t in Fig. 2, for the following parameters: Ne0 ¼ 10; r ¼ 0:05; n ¼ 10: In Fig. 3, probability density function p4 ðtÞ is compared to the histogram obtained from 1000 simulations performed with the use of the method described by Slatkin and Hudson (1991). The exact

A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40

38

Fig. 2. Probability density functions pk ðtÞ; k ¼ 2; y; 10 for the exponential model of population growth Ne ðtÞ ¼ Ne0 ert with Ne0 ¼ 10; r ¼ 0:05:

Fig. 4. Plots of qn;b ; n ¼ 10; versus b for exponential history of population size and different values of the parameter k ¼ rNe0 ; k ¼ 0:1 ð * Þ; k ¼ 1 ð3Þ; k ¼ 10 ðþÞ:

we obtain qnb ¼

ðn  b  1Þ!ðb  1Þ! 2ðn 

Pn

nk Pn k¼2 ðb1Þ j¼k P½n=2 2 1Þ! j¼1 A2j e2j

jðj  1ÞAkj ej

;

ð52Þ Let us again assume exponential history of population size (48). Expectations in (30) become (Slatkin and Hudson, 1991) l ð Þ

Fig. 3. Probability density function p4 ðtÞ compared to the histogram obtained from 1000 simulations. Data are assumed the same as those in Fig. 2.

value of the expectation is EðT4 Þ ¼ 4:1212: In simulations we obtained T4 ¼ 4:1292: 4.1. Probability that a segregating site has b mutant bases Griffiths and Tavare (1998) give, in their Eq. (1.3), an expression for the probability qnb that a segregating site has b mutant bases, in the terms of expectations of coalescence times. In our notation, this expression has the following form: qnb

P ðn  b  1Þ!ðb  1Þ! nk¼2 kðk  1Þðnk b1ÞEðSk Þ P ; ð50Þ ¼ ðn  1Þ! nk¼2 kEðSk Þ

where 0obon; Sk ¼ Tk  Tkþ1 ; Tnþ1 ¼ 0: Using expression (6) and notation as in expression (30) we represent EðSk Þ in the numerator as EðSk Þ ¼ Akk ek þ Pn

n X jðj  1Þ k A ej ; kðk  1Þ j j¼kþ1

el ¼ el ðNe0 ; rÞ ¼

r

! ð2l Þ Ei  ; rNe0

ð53Þ

where Ei denotes the exponential integral, EiðmÞ ¼ RN  1 expðmxÞ=x dx; ReðmÞ > 0 (Gradshteyn and Ryzhik, 1980, Section 3.351.5). Examples of values of probabilities qnb are plotted versus b in Fig. 4, for n ¼ 10: One can observe that qnb depend on one parameter, k ¼ rNe0 : In Fig. 4, three values of this parameter, k ¼ 0:1; 1; 10; are used. 4.2. Bias in Tajima’s D statistic caused by population growth The D statistic was introduced by Tajima (1989) for testing neutrality of mutations in the population of constant effective size Ne : Let us write it in the following form: k#  s# D ¼ pffiffiffiffiffia1 ; Vs

ð54Þ

where S# is the number of segregating sites in the sample, k# denotes average number of pairwise differences,

ð51Þ

while k¼2 kEðSk Þ in the denominator is the expected value of the sum of branches in the coalescence tree, evaluated already in (35). Combining (50), (51) and (35)

expðrN2e0 Þ

a1 ¼

n1 X 1 i i¼1

ð55Þ

# Sla # 1 : In practical and VS stands for variance of k– computations, VS is replaced by its estimator V# S : Here

A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40

we will use the following expression (Tajima, 1989): 2

ð56Þ

c1 ¼

nþ1 1  ; 3ðn  1Þ a1

ð57Þ

c2 ¼

n1 2ðn2 þ n þ 3Þ n þ 2 1 X 1  þ 9nðn  1Þ na1 ða1 Þ2 i¼1 i2

ð58Þ

VS ¼ c1 M þ c2 M ; where

and M ¼ 2Ne0 m is the scaled product of the population size at present and the mutation intensity m: We study the properties of EðDÞ #

#  Eð S Þ EðkÞ pffiffiffiffiffiffi a1 EðDÞ ¼ VS

ð59Þ

in the situation where population size changes exponentially, according to (48). Denote l ! ð Þ expð k2 Þ ð2l Þ Ei  el ðkÞ ¼  ð60Þ k k which yields el ðNe0 ; rÞ ¼ Ne0 el ðkÞ: Taking into account # ¼ 2me2 ðNe0 ; rÞ; Eð S# Þ ¼ m ETSUM, using (60) that EðkÞ a1 a1 and expression (35) for ETSUM, we obtain P½n=2 Mðe2 ðkÞ  a11 j¼1 A22j e2j ðkÞÞ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi EðDÞ ¼ c1 M þ c2 M 2 P½n=2 e2 ðkÞ  a11 j¼1 A22j e2j ðkÞpffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi M: ð61Þ ¼ c1 þ c2 M Observe that one can use the above formula for the case where genealogy size n is large, since it contains only coefficients Akn with index k ¼ 2: When using expression (53) for large genealogies (and/or small values of k ¼ rNe0 Þ one needs a special approach. When the argument l ð2Þ rNe0

in (53) becomes large computing el ðNe0 ; rÞ involves l ð Þ

solving product of the type N 0: For rN2e0 > 50 we used

Fig. 5. Plots of EðDÞ versus k for different genealogy sizes, n ¼ 10; 20; 50; 100:

39

expansion (Gradshteyn and Ryzhik, 1980, Section l ð Þ

8.215) which allowed canceling first factor exp ðrN2e0 Þ: In Fig. 5 we presented plots of EðDÞ versus k for M ¼ 10: Plots in Fig. 5 show a bias, towards negative values, of D statistics caused by the increase of population size in the course of its evolution. The fact that a bottleneck in population’s history leads to negative values of D was already mentioned by Tajima in his paper (1989). In Fig. 5 one can see that the bias further increases with the increase of genealogy size n: Due to the fact that we are using expression (56) for variance VS ; valid for the case of constant population size, while in reality the population size changes, plots in Fig. 5 are not very well scaled. They show minima at values of k close to 4, while one expects that bias would rather grow monotonically with k: Expression (61) can approximate bias in D statistics only for small values of k ðo3Þ: For larger values more complicated statistics would be necessary.

5. Concluding remarks For the case of constant size population, times T2 ; y; Tn1 ; Tn are sums of independent exponential random variables (times between coalescences) and decomposition of the form (5) follows by using the moment generating function, convolution theorem and partial fraction expansion (Feller, 1970, p. 40, Problem 12; Johnson et al., 1994, p. 384, see also Stephens, 2000, expression (16), Tavare, 1984, Section 5.1). Expression (5) is more general in the sense that it concerns timevariable coalescence. The proposed method is based on transformation (15) Rt which uses the time-scale change t ¼ 0 Nds ; well known e ðsÞ in the literature (e.g., Griffiths and Tavare, 1998). It is also possible to derive (5) and (11) by repeated integration and using mathematical induction. However, computations are rather tedious. We have shown examples of applying our expressions for computing probabilities of a segregating site having b mutant bases and for studying bias in Tajima’s D statistics for growing populations. For large genealogies (of size n > 50) the method described cannot be efficiently applied in practical computations, due to the fact that expressions contain coefficients which become very large when n grows. We can mention that the same problem arises in instances of using decompositions analogous to (5), in the case when population size do not vary (Stephens, 2000; Tavare, 1984). We were able to show a method of evaluating two statistics: expected time to most recent common ancestor (ETMRCA), Eq. (29) and expected sum of

40

A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40

lengths of all branches in the coalescence tree (ETSUM), Eq. (35), for genealogies of large size n: Matlab m-files that allow performing computations presented in this paper are available at authors’ web site: www.stat.rice.edu/~mathbio/coaltim.

Acknowledgments The authors were supported by NIH grants GM58545, CA75432, Polish Scientific Committee (KBN) grant 8T11E01319, research project PBZ/KBN/ 040/P04/2001 and NATO collaborative linkage grant LST.CLG.977845.

References Feller, W., 1970. An Introduction to Probability Theory and Its Applications, 3rd Edition, Vol. II. Wiley, New York. Fu, Y.X., 1995. Statistical properties of segregating sites. Theor. Pop. Biol. 48, 172–197. Gradshteyn, I.S., Ryzhik, I.M., 1980. Table of Integrals, Series and Products, 5th Edition. Academic Press, New York. Graham, R.L., Knuth, D.E., Patashnik, O., 1998. Concrete Mathematics. A Foundation for Computer Science, 2nd Edition. Addison-Wesley, Reading, MA.

Griffiths, R.C., 1999. The time to the ancestor along sequences with recombination. Theor. Pop. Biol. 55, 137–144. Griffiths, R.C., Marjoram, P., 1996. Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol. 3, 479–502. Griffiths, R.C., Tavare, S., 1998. The age of mutation in the general coalescent tree. Stochastic Models 14, 273–295. Hudson, R.R., Kaplan, N., 1985. Statistical properties of the number of recombination events in the history of DNA sequences. Genetics 111, 147–164. Johnson, N.L., Kotz, S., Balakrishnan, N., 1994. Continuous Univariate Distributions, 2nd Edition, Vol. 1. Wiley, New York. Kingman, J.F.C., 1982. The coalescent. Stochastic Process. Appl. 13, 235–248. Korn, G.A., Korn, T.M., 1961. Mathematical Handbook for Scientists and Engineers. McGraw-Hill, New York. Nielsen, R., 2000. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154, 931–942. Slatkin, M., Hudson, R.R., 1991. Pairwise comparisons of Mitochondrial DNA in stable and exponentially growing populations. Genetics 129, 555–562. Stephens, M., 2000. Times on trees and the age of an allele. Theor. Pop. Biol. 57, 109–119. Tajima, F., 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–595. Tavare, S., 1984. Line-of-descent and genealogical processes and their applications in population genetics models. Theor. Pop. Biol. 26, 119–164. Watterson, G.A., 1975. On the number of segregating sites in genetical models without recombination. Theor. Pop. Biol. 7, 387–407.