Theoretical Population Biology 63 (2003) 33–40
http://www.elsevier.com/locate/ytpbi
A note on distributions of times to coalescence, under time-dependent population size A. Polanski,a,b A. Bobrowski,c and M. Kimmelb,* a
Department of Statistics, Rice University, 6100 Main Street, Mail Stop 138, P.O. Box 1892, Houston, TX, USA b Institute of Automation, Silesian Technical University, Gliwice, Poland c Department of Mathematics, University of Houston, Houston, TX, USA Received 1 May 2002; received in revised form 27 August 2002; accepted 5 September 2002
Abstract Expressions for marginal distributions of times in the time-varying coalescence process are derived. The proposed method allows also for computation of joint probability distribution for pairs, triples, etc. of coalescence times. The expressions derived are useful for (1) extending several statistics from time constant to time-varying case, (2) increasing efficiency and accuracy of simulations in time-varying evolution, and (3) debugging coalescence simulation software. r 2002 Elsevier Science (USA). All rights reserved.
1. Introduction Statistical inference from random samples of chromosomes taken from population depends on probability distributions of times in the coalescence process (Kingman, 1982). In the case of a population which has evolved with constant size over many generations, times between coalescence events are independent exponential random variables. Using this fact, analytical expressions for many interesting statistics were derived under constant population size hypothesis. These include expected numbers of mutations or recombinations in the sample (Watterson, 1975; Hudson and Kaplan, 1985; Fu, 1995), expected time to the most recent common ancestor (TMRCA) which, additionally, can be conditioned on events like mutations or recombinations in the genealogy (Griffiths, 1999; Griffiths and Marjoram, 1996), expected ages of mutations (Griffiths and Tavare, 1998; Stephens, 2000), and so forth. In the case of evolution with time-varying population size, times in the coalescence tree are no longer independent. Joint probability density function of coalescence times (Griffiths and Tavare, 1998) involves a sequential dependence of times of events that happen later in the tree on those which happen earlier. Let us *Corresponding author. Tel.: (713)-348-5255; fax: (713)-348-5476. E-mail address:
[email protected] (M. Kimmel).
notice that, since coalescence is usually considered in backward time, ‘‘later in the tree’’ means ‘‘earlier in chronological time’’ and conversely. Expectations of times in coalescence processes with time-varying population size, in the literature known to us, were computed by averaging over Monte Carlo simulations (e.g., Slatkin and Hudson, 1991; Griffiths and Tavare, 1998; Nielsen, 2000). In this paper we derive expressions of marginal distributions for times in the coalescence process with time-varying population size, which follow from the form of their joint probability distribution. Our method also allows computation of joint probability distribution for pairs, triples, etc. of coalescence times. The expressions derived are useful for extending several statistics from time constant to time-varying case, increasing efficiency and accuracy of simulations in time-varying evolution and debugging coalescence simulation software. We show examples of applications of our expressions to computing probabilities that a segregating site has b mutant bases (Griffiths and Tavare, 1998) and to evaluating bias in Tajima’s D statistics caused by population growth. For large genealogies (of sizes n > 50) the described method cannot be efficiently applied because expressions contain coefficients which become very large. This effect considerably limits practical applications of the described approach. Nevertheless, we were able to
S0040-5809/02/$ - see front matter r 2002 Elsevier Science (USA). All rights reserved. PII: S 0 0 4 0 - 5 8 0 9 ( 0 2 ) 0 0 0 1 0 - 2
A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40
34
develop methods of computing of two important statistics, expected time to most recent common ancestor and expected sum of lengths of all branches in the coalescence tree, applicable for large sample sizes n:
We accept notation for coalescence times as shown in Fig. 1. Time is measured from the present to the past. Random coalescence times from sample of size n to sample of size k 1 are denoted by Tk ; k ¼ 2; 3; y; n; and their realizations by corresponding lower case letters tn ; tn1 ; y; t2 ; 0otn otn1 o?ot2 : Let us stress that our notation is not standard. It is more common that capital and lower letters T; t are used to denote times between coalescence events. Our notation seems to be more convenient for this specific application. We assume that the history of population’s effective size is described by a function of the reverse time t ð1Þ
tA½0; NÞ:
ð3Þ
2.1. Marginal distributions of times
2. Distributions of times to coalescence events
Ne ðtÞ;
density of times Tk ; y; Tn1 ; Tn ; 2okpn; is ! Z tj j n Y ð2j Þ ð2Þ ds exp pðtk ; y; tn1 ; tn Þ ¼ : Ne ðtj Þ tjþ1 Ne ðsÞ j¼k
For a random sample of n chromosomes, joint probability density function of the distribution of times T2 ; y; Tn1 ; Tn is given by the following expression (Griffiths and Tavare, 1998): ! Z tj j n Y ð2j Þ ð2Þ ds exp pðt2 ; y; tn1 ; tn Þ ¼ ; ð2Þ Ne ðtj Þ tjþ1 Ne ðsÞ j¼2 is the binowhere 0 ¼ tnþ1 otn otn1 o?ot2 ; and mial symbol. This probability distribution has a sequential-dependence structure, i.e., the distribution ð2j Þ
Marginal probability density functions pk ðtk Þ of times tk ; k ¼ 2; 3; y; n; follow from integration Z tk Z tn2 Z tn1 ? pk ðtk Þ ¼ 0
0
0
pðtk ; y; tn1 ; tn Þ dtn dtn1 ?dtkþ1 ;
ð4Þ
where pðtk ; y; tn1 ; tn Þ is given by (3). The probability density function above can be represented as follows: pk ðtk Þ ¼
n X
Akj qj ðtk Þ
ð5Þ
j¼k
with ðj Þ qj ðtk Þ ¼ 2 exp Ne ðtk Þ
Z
tk 0
! ð2j Þ ds ; Ne ðsÞ
and coefficients Akj given by Qn l l¼k;laj ð2Þ k ; kpjpn A j ¼ Qn l j l¼k;laj ½ð2Þ ð2Þ
ð6Þ
ð7Þ
and Ann ¼ 1:
ð8Þ
2.2. Distributions of pairs of times Conditional distribution density of times Tk ; y; Tm2 ; Tm1 ; given Tm ¼ tm ; m > k; is equal to pðtk ; y; tm2 ; tm1 jtm Þ ! Z tj j m 1 Y ð2j Þ ð2Þ ds exp ¼ : Ne ðtj Þ tjþ1 Ne ðsÞ j¼k
ð9Þ
Using (9) and the marginal distribution pm ðtm Þ of time Tm one can write the integral expressing the join probability density function pkm ðtk ; tm Þ of a pair of times Tk ; Tm ; kom Z tk Z tm3 Z tm2 ? pðtk ; y; tm2 ; tm1 jtm Þ pkm ðtk ; tm Þ ¼ 0
0
0
pm ðtm Þ dtm1 dtm2 ydtkþ1 :
ð10Þ
This density can be represented by the following sum: Fig. 1. Notation for ancestral history of a sample of DNA sequences. Random coalescence times from sample of size n to sample of size k 1 are denoted by Tk ; k ¼ 2; 3; y; n; and their realizations by corresponding lower case letters tn ; tn1 ; y; t2 ; 0otn otn1 o?ot2 :
pkm ðtk ; tm Þ ¼
m1 X n X j1 ¼k j2 ¼m
km Bkm j1 Cj2 qj1 ;j2 ðtk ; tm Þ;
ð11Þ
A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40
where qj1 ;j2 ðtk ; tm Þ ¼
Z
!
tk j1 ðj21 Þ ð 2 Þ ds ðj22 Þ exp Ne ðtk Þ tm Ne ðsÞ Ne ðtm Þ ! Z tm j2 ð 2 Þ ds exp Ne ðsÞ 0
ð12Þ
and Qm1
Bkm j1
l l¼k;laj1 ð2Þ ; l j1 l¼k;laj1 ½ð2Þ ð 2 Þ
¼ Qm1
kpj1 pm 1;
ð13Þ
mpj2 pn:
ð14Þ
Qn
Cjkm 2
l l¼k;laj2 ð2Þ ; l j2 l¼k;laj2 ½ð2Þ ð 2 Þ
¼ Qn
2.3. Proofs of expressions (5) and (11) Let us define integral transform Uf:g of the probability density function pðtÞ; tA½0; NÞ; with the underlying function 0oNe ðtÞoN; tA½0; NÞ; as follows: PðsÞ ¼ UfpðtÞg Z t
ds ¼ E exp s 0 Ne ðsÞ Z t Z N ds pðtÞ exp s ¼ dt; 0 0 Ne ðsÞ
ð15Þ
where E stands for expectation. The transform defined is linear. We apply transform (15) to distribution qj ðtÞ given by (6). This leads to QðsÞ ¼ Ufqj ðtÞg ! Z N j Z t j ð2Þ ð2Þ ds ¼ exp Ne ðtÞ 0 0 Ne ðsÞ Z t ðj Þ ds exp s dt ¼ 2 j : s þ ð2 Þ 0 Ne ðsÞ
ð16Þ
We compute transformation (15) of the marginal distribution pk ðtk Þ from (4) Ufpk ðtk Þg Z N Z tk Z tn2 Z tn2 ¼ y 0 0 0 Z tk 0 ds exp s 0 Ne ðsÞ
ð18Þ
was repeatedly applied. Expanding (17) into partial fractions (Korn and Korn, 1961) results in n n Y X ð2j Þ ðj Þ Ufpk ðtk Þg ¼ Akj 2 j ; ð19Þ j ¼ s þ ð2 Þ j¼k s þ ð2Þ j¼k where Akj are coefficients (7). Using (16) and linearity, one finds the inverse transform U1 f:g for (19). It has the form (5), so the first part of the proof is completed. Transformation (15) can be generalized to the multidimensional case as follows: Pðs2 ; s3 ; y; sn Þ ¼ Ufpðt2 ; y; tn1 ; tn Þg Z t2 Z t3 ds ds s3 ¼ E exp s2 Ne ðsÞ 0 0 Ne ðsÞ
Z tn ds ? sn : 0 Ne ðsÞ
ð20Þ
We substitute pðt2 ; t3 ; y; tn Þ given by expression (2) in definition (20) above and perform manipulations analogous to those in expressions (16)–(18). This leads to n Y ðj Þ Ufpðt2 ; y; tn1 ; tn Þg ¼ : ð21Þ Pj 2 j j¼2 i¼2 sl þ ð2Þ Marginal distributions for times, pairs of times, triples, etc. can be computed, as special cases, from (21). The one-dimensional case (17) is obtained as follows: Z tk
ds Ufpk ðtk Þg ¼ E exp s Ne ðsÞ 0
pðtk ; y; tn1 ; tn Þ dtn dtn1 ydtkþ1 dtk Z tk Z NZ N Z NZ N ds ¼ y exp s 0 tn tkþ2 tkþ1 0 Ne ðsÞ pðtk ; y; tn1 ; tn Þ dtk dtkþ1 ydtn1 dtn n Y ð2j Þ ¼ j : j¼k s þ ð2Þ
In the above, the following identity Z tj j Z N ð2 Þ ds exp s tjþ1 0 Ne ðsÞ Ne ðtk Þ ! Z tj j ð2Þ ds exp dtj tjþ1 Ne ðsÞ Z tj Z N j ð2Þ ds exp s ¼ tjþ1 Ne ðtk Þ 0 Ne ðsÞ ! ! Z tjþ1 j Z tj j ð2Þ ds ð2Þ ds dtj exp exp Ne ðsÞ 0 Ne ðsÞ 0 ! Z tjþ1 ðj Þ ½s þ ð2j Þ ds ¼ 2 j exp Ne ðsÞ s þ ð2Þ 0 ! Z tjþ1 j ð2Þ ds exp Ne ðsÞ 0 Z tjþ1 j ðÞ s ds ¼ 2 j exp Ne ðsÞ s þ ð2Þ 0
35
¼ Pðs2 ; s3 ; y; sn Þjsk ¼s;sjak ¼0 ; ð17Þ
ð22Þ
where jsk ¼s;sjak ¼0 denotes: substitute sk ¼ s; and sj ¼ 0; for all jak: For the two-dimensional marginal distribu-
A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40
36
tions pkm ðtk ; tm Þ in expression (10):
all branches in the coalescence tree (ETSUM) that can be computed for large sample size n:
Ufpkm ðtk ; tm Þg ¼ Pðsk ; sm Þ Z ¼ E exp sk 0
ð23Þ tk
ds sm Ne ðsÞ
Z
tm
0
ds Ne ðsÞ
¼ Pðs2 ; s3 ; y; sn Þjsjak;m ¼0 ¼
m 1 Y j1 ¼k
n ðj21 Þ Y ðj22 Þ ; sk þ ðj21 Þ j2 ¼m sk þ sm þ ðj22 Þ
ð24Þ ð25Þ ð26Þ
where jsjak;m ¼0 denotes: substitute sj ¼ 0; for all jak; m: Expanding both products in (26) into partial fractions yields ¼ Pðsk ; sm Þ m1 X n X j1 ¼k j2 ¼m
Bkm j1
ðj21 Þ ðj22 Þ km C ; sk þ ðj21 Þ j2 sk þ sm þ ðj22 Þ
ð27Þ
km where Bkm j1 and Cj2 are given by (13) and(14). Now, observing that
ðj21 Þ ðj22 Þ Ufqj1 ;j2 ðtk ; tm Þg ¼ ; j1 ½sk þ ð 2 Þ½sk þ sm þ ðj22 Þ
Using our notation as in expressions (5) and (7), the expected time to most recent common ancestor (ETMRCA) equals n X A2l el ; ð29Þ ETMRCA ¼ EðT2 Þ ¼ l¼2
where E stands for expectation and Z N el ¼ tql ðtÞ dt:
ð30Þ
0
It can be verified that coefficients A2l ; l ¼ 2; 3; y; n do not explode when n grows. Consequently, expectation (29) can be computed efficiently for large genealogies.
Ufpkm ðtk ; tm Þg
¼
3.1. Expected time to most recent common ancestor
ð28Þ
where qj1 ;j2 ðtk ; tm Þ is given by (12) and, again, using linearity of transform (20) yields expression (11), which completes the second part of the proof.
3.2. Expected sum of lengths of all branches in the coalescence tree Let us denote the random times between coalescence events by Sk ; i.e., Sk ¼ Tk Tkþ1 ; k ¼ 2; 3; y; n: Times Tk are as defined in Fig. 1, with Tnþ1 ¼ 0: The expected sum of lengths of branches in the coalescence tree (ETSUM) is given by ! ! n n X X ETSUM ¼ E kSk ¼ E T2 þ Tk : ð31Þ k¼2
3. Large genealogies When trying to apply the method to large genealogies one encounters the problem of some of the coefficients Akl in (7) becoming very large. For example, when n ¼ 4 20; A10 and when n ¼ 40; A20 14 ¼ 2:2478 10 ; 29 ¼ 10 1:080 10 : Analysis of Eq. (7) reveals that some of the coefficients Akl will explode, when n is large and k takes values close to n=2: For such case, probability density function pk ðtk Þ in expression (5), for a given tk ; is computed as a sum of very large numbers with alternating signs. Performing practical numerical computations using the standard double precision arithmetics allows effective calculations of distributions pk ðtk Þ; k ¼ 2; 3; y; n; for genealogies of size np50: By ‘‘effective calculating’’ we mean that relative numerical errors do not exceed 104 : For larger genealogies one will observe a sharp increase of numerical error. The effect described considerably limits practical applications. In general, overcoming these difficulties in summation of diverging coefficients requires a future research. However, further on we show two important statistics, the expected time to most recent common ancestor (ETMRCA) and the expected sum of lengths of
k¼2
Using expressions (5) and (30) one obtains n n n X X X A2l el þ Akl el ETSUM ¼ ¼
l¼2
k¼2
n X
n l X X
A2l el þ
l¼2
l¼2
l¼k
Akl el :
ð32Þ
k¼2
However, for all odd l ¼ 2j þ 1; 1pjp½n=2 A2l þ
l X
Akl ¼ A22jþ1 þ
k¼2
2jþ1 X
Ak2jþ1 ¼ 0
ð33Þ
k¼2
and for all even l ¼ 2j; 1pjp½n=2; we have l X k¼3
Akl ¼
2j X
Ak2j ¼ 0:
ð34Þ
k¼3
Notation ½n=2 stands for the largest integer not greater than n=2: Using (33) and (34) we obtain ETSUM ¼ 2
½n=2 X
A22j e2j :
ð35Þ
j¼1
Since the above expression involves only coefficients Akl ; with k ¼ 2; it can be efficiently numerically evaluated for large size of genealogy n:
A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40
3.3. Proofs of (33) and (34) We observe that after the multiplicative term, (33) becomes 1þ
ð2j2 Þ ð2j2 Þ
þ
2j X
ð2j2 Þð2j1 2 Þ 2j1 2jþ1 ½ð2j2 Þ ð2jþ1 2 Þ½ð 2 Þ ð 2 Þ
þ2
þ?
2 ð2j2 Þð2j1 2 Þyð2Þ 2j1 2jþ1 2 2jþ1 ½ð2j2 Þ ð2jþ1 2 Þ½ð 2 Þ ð 2 Þy½ð2Þ ð 2 Þ
ð36Þ
Similarly, expression (34) can be equivalently written as ð2j1 2 Þ
þ
2j 2j2 2j ½ð2j1 2 Þ ð 2 Þ½ð 2 Þ ð 2 Þ
þ?
2j2 3 ð2j1 2 Þð 2 Þyð2Þ 2j 2j2 2j ð 2 Þ½ð 2 Þ ð 2 Þy½ð32Þ
ð37Þ
m þ1 X
i m1 lþ1 m ðlþi ðlmÞ: l Þði1Þð1Þ ¼ ð1Þ
ð38Þ
ð2jþkÞð2jþ1kÞ this Using the identity ðk2Þ ð2jþ1 2 Þ ¼ 2 becomes 2j X
i Y
i¼2
k¼2
ð1Þi1
2j ð2j þ kÞð2j þ 1 kÞ Y kðk 1Þ ¼0 2 2 k¼iþ1
ð40Þ
or 222j ð2jÞ!ð2j 1Þ! þ 212j
2j X
ð1Þi1
i¼2
ð2j 1Þ! ð2jÞ! ð2j 1Þ! ¼ 0: ð2j iÞ! i! ði 1Þ!
2j1 ð2j þ k 1Þð2j kÞ Y kðk 1Þ ¼ 0: 2 2 k¼iþ1
¼ jð2j 2Þð2j þ 1Þ
This is a particular case of Eq. (5.24) in (Graham et al., 1998). Now, (36) can be expressed in an equivalent way as follows: ! " ! !# 2j 2j Y i Y X k k 2j þ 1 2 þ 2 2 2 i¼2 k¼2 k¼2 ! 2j Y k ¼ 0: ð39Þ 2 k¼iþ1
k¼2
¼ 2j þ 1:
ð43Þ
ð2j þ iÞ! ð2j þ 1Þ! ð41Þ
ð45Þ
This can be reduced to ! ! 2j1 X 2j 2 i 2j 1 þ i ð1Þ 2j 1 i1 i¼3
i¼1
kðk 1Þ þ 2
i1
!
2j1 i kðk 1Þ X Y þ ð1Þi1 2 i¼3 k¼3
k¼3
ð2j2 Þ
For the proof of (36) and (37) we need the fact that for all positive integers l and m;
2
2j
2j 1
This last relation is a particular case of expression (38) with l ¼ 2j and m ¼ 2j 1: In a similar way, expression (37) has an equivalent form ! " ! !# ! 2j1 2j1 2j1 i X Y Y k Y k k 2j þ ¼0 2 2 2 2 i¼3 k¼3 k¼3 k¼iþ1
2j1 Y
¼ 0:
2j Y
!
or
2j2 ð2j1 2 Þð 2 Þ
½ð2j1 2 Þ
2j þ i
ð42Þ
ð44Þ
2j ð2j1 2 Þ ð2Þ
þ
ð1Þ
i
i¼1
¼ 0:
1þ
After some algebra this reduces to ! ! 2j X 2j 1 i 2j þ i ð1Þ ¼ 2ð2j þ 1Þ 2j i1 i¼2 or
ð2jþ1 2 Þ
37
and, consequently, to 2j1 X i¼1
ð1Þ
i
2j 1 þ i 2j 1
!
ð46Þ
2j 2 i1
! ¼ 2j:
ð47Þ
which is (38) with m ¼ 2j 2 and l ¼ 2j 1:
4. Examples of applications of the results Let us assume an exponential scenario of population growth Ne ðtÞ ¼ Ne0 expðrtÞ: Distributions qj ðtÞ; j ¼ 2; 3; y; n are given by ! ð2j Þ ð2j Þ expðrtÞ exp ðexpðrtÞ 1Þ qj ðtÞ ¼ Ne0 Ne0
ð48Þ
ð49Þ
Probability density functions of marginal distributions pk ðtÞ of coalescence times Tk ; k ¼ 2; 3; y; 10; which follow from (5) and (49), are plotted versus time t in Fig. 2, for the following parameters: Ne0 ¼ 10; r ¼ 0:05; n ¼ 10: In Fig. 3, probability density function p4 ðtÞ is compared to the histogram obtained from 1000 simulations performed with the use of the method described by Slatkin and Hudson (1991). The exact
A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40
38
Fig. 2. Probability density functions pk ðtÞ; k ¼ 2; y; 10 for the exponential model of population growth Ne ðtÞ ¼ Ne0 ert with Ne0 ¼ 10; r ¼ 0:05:
Fig. 4. Plots of qn;b ; n ¼ 10; versus b for exponential history of population size and different values of the parameter k ¼ rNe0 ; k ¼ 0:1 ð * Þ; k ¼ 1 ð3Þ; k ¼ 10 ðþÞ:
we obtain qnb ¼
ðn b 1Þ!ðb 1Þ! 2ðn
Pn
nk Pn k¼2 ðb1Þ j¼k P½n=2 2 1Þ! j¼1 A2j e2j
jðj 1ÞAkj ej
;
ð52Þ Let us again assume exponential history of population size (48). Expectations in (30) become (Slatkin and Hudson, 1991) l ð Þ
Fig. 3. Probability density function p4 ðtÞ compared to the histogram obtained from 1000 simulations. Data are assumed the same as those in Fig. 2.
value of the expectation is EðT4 Þ ¼ 4:1212: In simulations we obtained T4 ¼ 4:1292: 4.1. Probability that a segregating site has b mutant bases Griffiths and Tavare (1998) give, in their Eq. (1.3), an expression for the probability qnb that a segregating site has b mutant bases, in the terms of expectations of coalescence times. In our notation, this expression has the following form: qnb
P ðn b 1Þ!ðb 1Þ! nk¼2 kðk 1Þðnk b1ÞEðSk Þ P ; ð50Þ ¼ ðn 1Þ! nk¼2 kEðSk Þ
where 0obon; Sk ¼ Tk Tkþ1 ; Tnþ1 ¼ 0: Using expression (6) and notation as in expression (30) we represent EðSk Þ in the numerator as EðSk Þ ¼ Akk ek þ Pn
n X jðj 1Þ k A ej ; kðk 1Þ j j¼kþ1
el ¼ el ðNe0 ; rÞ ¼
r
! ð2l Þ Ei ; rNe0
ð53Þ
where Ei denotes the exponential integral, EiðmÞ ¼ RN 1 expðmxÞ=x dx; ReðmÞ > 0 (Gradshteyn and Ryzhik, 1980, Section 3.351.5). Examples of values of probabilities qnb are plotted versus b in Fig. 4, for n ¼ 10: One can observe that qnb depend on one parameter, k ¼ rNe0 : In Fig. 4, three values of this parameter, k ¼ 0:1; 1; 10; are used. 4.2. Bias in Tajima’s D statistic caused by population growth The D statistic was introduced by Tajima (1989) for testing neutrality of mutations in the population of constant effective size Ne : Let us write it in the following form: k# s# D ¼ pffiffiffiffiffia1 ; Vs
ð54Þ
where S# is the number of segregating sites in the sample, k# denotes average number of pairwise differences,
ð51Þ
while k¼2 kEðSk Þ in the denominator is the expected value of the sum of branches in the coalescence tree, evaluated already in (35). Combining (50), (51) and (35)
expðrN2e0 Þ
a1 ¼
n1 X 1 i i¼1
ð55Þ
# Sla # 1 : In practical and VS stands for variance of k– computations, VS is replaced by its estimator V# S : Here
A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40
we will use the following expression (Tajima, 1989): 2
ð56Þ
c1 ¼
nþ1 1 ; 3ðn 1Þ a1
ð57Þ
c2 ¼
n1 2ðn2 þ n þ 3Þ n þ 2 1 X 1 þ 9nðn 1Þ na1 ða1 Þ2 i¼1 i2
ð58Þ
VS ¼ c1 M þ c2 M ; where
and M ¼ 2Ne0 m is the scaled product of the population size at present and the mutation intensity m: We study the properties of EðDÞ #
# Eð S Þ EðkÞ pffiffiffiffiffiffi a1 EðDÞ ¼ VS
ð59Þ
in the situation where population size changes exponentially, according to (48). Denote l ! ð Þ expð k2 Þ ð2l Þ Ei el ðkÞ ¼ ð60Þ k k which yields el ðNe0 ; rÞ ¼ Ne0 el ðkÞ: Taking into account # ¼ 2me2 ðNe0 ; rÞ; Eð S# Þ ¼ m ETSUM, using (60) that EðkÞ a1 a1 and expression (35) for ETSUM, we obtain P½n=2 Mðe2 ðkÞ a11 j¼1 A22j e2j ðkÞÞ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi EðDÞ ¼ c1 M þ c2 M 2 P½n=2 e2 ðkÞ a11 j¼1 A22j e2j ðkÞpffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi M: ð61Þ ¼ c1 þ c2 M Observe that one can use the above formula for the case where genealogy size n is large, since it contains only coefficients Akn with index k ¼ 2: When using expression (53) for large genealogies (and/or small values of k ¼ rNe0 Þ one needs a special approach. When the argument l ð2Þ rNe0
in (53) becomes large computing el ðNe0 ; rÞ involves l ð Þ
solving product of the type N 0: For rN2e0 > 50 we used
Fig. 5. Plots of EðDÞ versus k for different genealogy sizes, n ¼ 10; 20; 50; 100:
39
expansion (Gradshteyn and Ryzhik, 1980, Section l ð Þ
8.215) which allowed canceling first factor exp ðrN2e0 Þ: In Fig. 5 we presented plots of EðDÞ versus k for M ¼ 10: Plots in Fig. 5 show a bias, towards negative values, of D statistics caused by the increase of population size in the course of its evolution. The fact that a bottleneck in population’s history leads to negative values of D was already mentioned by Tajima in his paper (1989). In Fig. 5 one can see that the bias further increases with the increase of genealogy size n: Due to the fact that we are using expression (56) for variance VS ; valid for the case of constant population size, while in reality the population size changes, plots in Fig. 5 are not very well scaled. They show minima at values of k close to 4, while one expects that bias would rather grow monotonically with k: Expression (61) can approximate bias in D statistics only for small values of k ðo3Þ: For larger values more complicated statistics would be necessary.
5. Concluding remarks For the case of constant size population, times T2 ; y; Tn1 ; Tn are sums of independent exponential random variables (times between coalescences) and decomposition of the form (5) follows by using the moment generating function, convolution theorem and partial fraction expansion (Feller, 1970, p. 40, Problem 12; Johnson et al., 1994, p. 384, see also Stephens, 2000, expression (16), Tavare, 1984, Section 5.1). Expression (5) is more general in the sense that it concerns timevariable coalescence. The proposed method is based on transformation (15) Rt which uses the time-scale change t ¼ 0 Nds ; well known e ðsÞ in the literature (e.g., Griffiths and Tavare, 1998). It is also possible to derive (5) and (11) by repeated integration and using mathematical induction. However, computations are rather tedious. We have shown examples of applying our expressions for computing probabilities of a segregating site having b mutant bases and for studying bias in Tajima’s D statistics for growing populations. For large genealogies (of size n > 50) the method described cannot be efficiently applied in practical computations, due to the fact that expressions contain coefficients which become very large when n grows. We can mention that the same problem arises in instances of using decompositions analogous to (5), in the case when population size do not vary (Stephens, 2000; Tavare, 1984). We were able to show a method of evaluating two statistics: expected time to most recent common ancestor (ETMRCA), Eq. (29) and expected sum of
40
A. Polanski et al. / Theoretical Population Biology 63 (2003) 33–40
lengths of all branches in the coalescence tree (ETSUM), Eq. (35), for genealogies of large size n: Matlab m-files that allow performing computations presented in this paper are available at authors’ web site: www.stat.rice.edu/~mathbio/coaltim.
Acknowledgments The authors were supported by NIH grants GM58545, CA75432, Polish Scientific Committee (KBN) grant 8T11E01319, research project PBZ/KBN/ 040/P04/2001 and NATO collaborative linkage grant LST.CLG.977845.
References Feller, W., 1970. An Introduction to Probability Theory and Its Applications, 3rd Edition, Vol. II. Wiley, New York. Fu, Y.X., 1995. Statistical properties of segregating sites. Theor. Pop. Biol. 48, 172–197. Gradshteyn, I.S., Ryzhik, I.M., 1980. Table of Integrals, Series and Products, 5th Edition. Academic Press, New York. Graham, R.L., Knuth, D.E., Patashnik, O., 1998. Concrete Mathematics. A Foundation for Computer Science, 2nd Edition. Addison-Wesley, Reading, MA.
Griffiths, R.C., 1999. The time to the ancestor along sequences with recombination. Theor. Pop. Biol. 55, 137–144. Griffiths, R.C., Marjoram, P., 1996. Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol. 3, 479–502. Griffiths, R.C., Tavare, S., 1998. The age of mutation in the general coalescent tree. Stochastic Models 14, 273–295. Hudson, R.R., Kaplan, N., 1985. Statistical properties of the number of recombination events in the history of DNA sequences. Genetics 111, 147–164. Johnson, N.L., Kotz, S., Balakrishnan, N., 1994. Continuous Univariate Distributions, 2nd Edition, Vol. 1. Wiley, New York. Kingman, J.F.C., 1982. The coalescent. Stochastic Process. Appl. 13, 235–248. Korn, G.A., Korn, T.M., 1961. Mathematical Handbook for Scientists and Engineers. McGraw-Hill, New York. Nielsen, R., 2000. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154, 931–942. Slatkin, M., Hudson, R.R., 1991. Pairwise comparisons of Mitochondrial DNA in stable and exponentially growing populations. Genetics 129, 555–562. Stephens, M., 2000. Times on trees and the age of an allele. Theor. Pop. Biol. 57, 109–119. Tajima, F., 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123, 585–595. Tavare, S., 1984. Line-of-descent and genealogical processes and their applications in population genetics models. Theor. Pop. Biol. 26, 119–164. Watterson, G.A., 1975. On the number of segregating sites in genetical models without recombination. Theor. Pop. Biol. 7, 387–407.