Some Statistical Improvements for Estimating Population Size and Mutation Rate from Segregating Sites in DNA Sequences

Some Statistical Improvements for Estimating Population Size and Mutation Rate from Segregating Sites in DNA Sequences

Theoretical Population Biology 55, 235247 (1999) Article ID tpbi.1998.1401, available online at http:www.idealibrary.com on Some Statistical Impro...

264KB Sizes 0 Downloads 30 Views

Theoretical Population Biology 55, 235247 (1999) Article ID tpbi.1998.1401, available online at http:www.idealibrary.com on

Some Statistical Improvements for Estimating Population Size and Mutation Rate from Segregating Sites in DNA Sequences Etienne K. Klein,* Frederic Austerlitz,* and Catherine Laredo *Laboratoire Evolution et Systematique, URA 2154, Universite Paris XI, Ba^timent 362, F-91405 Orsay Cedex, France; and Laboratoire de Biometrie, Institut National de Recherche Agronomique, F-78350 Jouy-en-Josas, France Received July 9, 1997

In population genetics, under a neutral WrightFisher model, the scaling parameter %=4N+ represents twice the average number of new mutants per generation. The effective population size is N and + is the mutation rate per sequence per generation. Watterson proposed a consistent estimator of this parameter based on the number of segregating sites in a sample of nucleotide sequences. We study the distribution of the Watterson estimator. Enlarging the size of the sample, we asymptotically set a Central Limit Theorem for the Watterson estimator. This exhibits asymptotic normality with a slow rate of convergence. We then prove the asymptotic efficiency of this estimator. In the second part, we illustrate the slow rate of convergence found in the Central Limit Theorem. To this end, by studying the confidence intervals, we show that the asymptotic Gaussian distribution is not a good approximation for the Watterson estimator. ] 1999 Academic Press

genealogical tree allows tracking of this shape. Thus, the polymorphism of a sample should permit estimation of the demographic and genetic parameters of the populations studied. Much data on different types of polymorphism are already available and a few estimators of demographic and genetic population parameters have been proposed. An important goal is to quantify the information contained in DNA sequences: one can ask how far back we are able to reconstruct the demographic past, and what data are necessary in terms of sample size and number of markers. Our aim here is not to call into question the elaboration of new estimators but to study the intrinsic properties of these estimators, which is a way to understand the effects of an increase of the sample size of the genes. One of the best studied and most commonly used estimation models is that of the scaling parameter %=4N+ in the WrightFisher demographic model with constant population size. The parameter N denotes the

INTRODUCTION Recently, a new method considering intraspecific polymorphism of nucleotide sequences has been developed, using the known probability laws of genealogical trees of individuals. This is the coalescence theory introduced by Kingman (1982a, b). The stochastic history of genes from individuals sampled in a population (and particularly the times since these samples genes has a common ancestor) can be analytically calculated using different models for the demography of the population. Kingman (1982a, b) employed a WrightFisher model with constant population size, and Slatkin and Hudson (1991) and Austerlitz et al. (1997) extended this to expanding populations. Takahata and Slatkin (1990) and Slatkin (1995) studied subdivided populations models, and Austerlitz et al. (1997) introduced colonization processes. The shape of a genealogy depends on demography, and the mutation process within this

235

0040-580999 K30.00 Copyright ] 1999 by Academic Press All rights of reproduction in any form reserved.

236 effective diploid population size and + the mutation rate per sequence per generation. If + is known, this allows estimation of the effective population size. Watterson (1975) proposed an estimator based on the number of segregating sites (polymorphic sites in the sampled sequences). The favorable framework of this model stimulated the development of several analytical studies concerning the properties of this estimator. Tavare (1984) replaced the Watterson estimator in a coalescent approach, and Fu and Li (1993) proved the variance to be asymptotically equivalent to the CramerRao bound. Felsenstein (1992) showed a weak performance (large variance) far from asymptotics compared to phylogenetic estimators. Fu (1994a, b) and Kuhner et al. (1995) stressed this point by proposing efficient algorithms using phylogenetic reconstructions (respectively by UPGMA and by MetropolisHastings sampling). All these studies, by using the variance as an indicator of the performance of the estimators, implicitly assume their normality. Here we clarify this assumption of normality and discuss its consequences. In Section 1, we recall the coalescence theory together with the Watterson estimator. We prove in Section 2 the Central Limit Theorem concerning the Watterson estimator, as the sample size n goes to infinity. This establishes the asymptotic normality with a slow rate of convergence. Then, in Section 3, we use tools from the theory of asymptotic statistics to study properties concerning the asymptotic efficiency in this model and apply them to the Watterson estimator. Section 4 is devoted to empirical studies at fixed sample size. By means of computer simulations, we show that the slow rate of convergence calls into question the standard Gaussian approximation. This has implications on the construction of confidence intervals.

1. THE COALESCENT AND THE WATTERSON ESTIMATOR

Klein, Austerlitz, and Laredo

sample has exactly 2, 3, ..., n ancestors. These times are independent and exponentially distributed with parameters 1, 3, ..., n(n&1)2, respectively. Now, a mutation occurs with a probability + per sequence per generation. We assume as usual that + tends to 0 as N tends to infinity in such a way that 4N+ converges to %, which is the true value of the parameter in what follows. For an n-sample, let L 2 , L 3 , ..., L n be the number of mutations that occurred during the times T 2 , T 3 , ..., T n . Conditionally on T 2 , T 3 , ..., T n , these variables follows Poisson distributions with parameters %T 2 , 3%T 3 2, ..., n%T n 2, respectively. Using the coalescence times defined above, L 2 , L 3 , ..., L n follow geometric distributions with means %, %2, ..., %(n&1), respectively. These variables are independent. 1. 2. The Watterson Estimator Under the infinite sites mutation model (Kimura, 1969), mutations never occur twice at the same site, so the number of segregating sites K n is strictly equal to the number of mutations in the coalescence tree, n

Kn = : Lm .

The Watterson estimator of %, % n , is then the moment type estimator based on the mean of K n and it is given by % n =K n a n

(2)

with n

a n = : 1(m&1).

(3)

m=2

Using the means and variances of L 2 , L 3 , ..., L n yields the first moments of the Watterson estimator (Hudson, 1990; Watterson, 1975), E[% n ]=E

1.1. The Coalescence Theory The coalescent introduced by Kingman (1982a, b) is the stochastic structure of the genealogical tree linking n sampled genes to their most recent common ancestor (MRCA). We present here only the principal results for a WrightFisher model with constant effective diploid population size N and refer to Tavare (1984) and Hudson (1990) for more details. For an n-sample, let the coalescence time T 2 , T 3 , ..., T n be the times (in units of 2N generations) during which the

(1)

m=2

Kn % n = : 1(m&1)=%, an a n m=2

_ &

so, the Watterson estimator is a moment estimator, with Var[% n ]=

% b 1+% n , an an

\

+

(4)

where n

b n = : 1(m&1) 2. m=2

(5)

237

Study of the Watterson Estimator

where O(1) is a function converging to a constant when n Ä . So,

2. A CENTRAL LIMIT THEOREM As illustrated above, the properties of the Watterson estimator are largely based on the distribution of K n , the number of mutations since the MRCA defined in (1). This random variable is the sum of n&1 independent and geometrically but not identically distributed variables. Thus, the ``classical'' Central Limit Theorem does not apply. The n-dependence of the variance of the Watterson estimator, given in (4), suggests that the rate of convergence is equal to - log n. Indeed, a n is equivalent to log(n) and b n converges to a constant Theorem 1. Assume that % is the true value of the parameter. The Watterson estimator defined in (2) converges in distribution at rate - log n to a normal distribution with mean 0 and variance %: D

Ä N(0, %). - log n(% n &%) w

(6)

Proof. This theorem can be obtained from the Lindeberg theorem (see Dacunha-Castelle and Duflo, 1987, Vol. 2); nonetheless a direct proof is not more complicated and exhibits the moments. For m2, the Laplace transform of L m is

\

8 Lm(s)=E[e &sLm ]= 1+

% (1&e &s ) m&1

+

&1

.

Now, the L m being independent variables, the Laplace transform of K n is the product of the Laplace transforms above. Therefore, the log-Laplace of the Watterson estimator is 9 %n(s)=log(8 Kn an(s))=log

\

n

+

` 8 Lm(sa n ) . (7) m=2

Let us recall that if X is a random variable with Laplace transform 8 X (s), its cumulants are defined, for k1, by C kX =(&1) k

d k8 X (0). ds k

Expanding (7) in Taylor series leads to the cumulants: for k>1, C %k n =

1 a kn

\

n

% +O(1) , m&1 m=2 :

+

C %1 n &% =0, and log n(%n &%) C=% k

(log n) k2 (1+o(1)), a k&1 n

when k>1,

where o(1) is a function converging to 0 when n Ä . The only cumulant not converging to zero is the second one (k=2), which converges to %. It follows that the asymptotic distribution is Gaussian with mean zero and variance %. Asymptotic normality of % n was not obvious. Especially, when working with branching processes, it is common to encounter exponential asymptotic distributions (see Athreya and Ney (1972, Theorem 2, p. 20)). This result therefore strengthens the result of Watterson (1975), who concludes that the Watterson estimator is approximately normally distributed for large n. This also justifies many studies based on the variance, since in a Gaussian model all the information about the distribution is contained in the variance. However, this normality is asymptotic and the rate of convergence, - log n, of this model is lower than that of regular statistical models (- n). Two points are thus non-trivial. First, one must be careful when dealing with asymptotic efficiency in non-regular models (see, for example, the Hodges estimator in Van der Vaart and Wellner (1996)). Second, such a slow convergence may induce errors when one intends to approximate the real distribution by a Gaussian one at fixed sample size. We examine the problem in Section 4.

3. ASYMPTOTIC EFFICIENCY 3.1. General Results Concerning Asymptotic Efficiency It is known that, when n is fixed, efficiency for the variance criterion is synonymous with a variance equal to the CramerRao bound (inverse of the Fisher information matrix) (see DacunhaCastelle and Duflo (1987, Vol. 1)). Asymptotically, the existence of super-efficient estimators (asymptotic variance lower than the asymptotic inverse Fisher information on a null measure set of values) denies an asymptotic generalization of the CramerRao bound. Therefore a pointwise variance

238

Klein, Austerlitz, and Laredo

criterion is not sufficient to assess asymptotic efficiency. Indeed, it requires uniformity properties on the set of values of the parameter or at least in the neighborhoods of each value of this set. We thus deal with local minimax asymptotic efficiency which means that an estimator is efficient when it asymptotically minimizes its worst performance (for a given criterion, least squares, for example) in the neighborhoods of the true parameter. Results concerning this asymptotic efficiency have been derived in some smooth models that satisfy such uniformity: ``locally asymptotically normal" (LAN) models (Van der Vaart and Wellner, 1996). Practically, this property is established from the study of the log-likelihood of the model. Here we recall the LAN property for an arbitrary statistical model: Assume that ` is a parameter in R k, X 1 , ...X n are the observations, and p n(X 1 , ...X n ; `) is the parametrized joint distribution of the (X 1 , ...X n ). We define now k_k matrices r n and I ` , and the norm &A&=sup i, j (a ij ) for any matrix A=(a ij ) i, j . Definition. This statistical model is locally asymptotically normal (LAN) with the matrix r n of rates and the Fisher information matrix I ` if, for h small enough in R k, &r

&1 n

Theorem 2 (Complete Case). The model associated with the estimation of the 2-dimensional parameter (+, 4N) is ``locally asymptotically normal,'' with the matrix of rates

rn =

\

& ww Ä0 nÄ

t p n(X 1 , ...X n ; `+r &1 hI ` h n h) = th2 n, ` & +o P(1), p n(X 1 , ...X n ; `) 2

- log n 0 , 0 -n

+

and the Fisher information

I +, 4N =

\

4N+ 0 . 0 (4N) &2

+

The proof of this property is given in the Appendix.

and

log

statistical model (called the completed case) in which the observation is the whole coalescent (the times T 2 , T 3 , ..., T n together with the intermediate numbers of mutations L 2 , L 3 , ..., L n ). We choose the 2-dimensional parameter (+, 4N). The coalescence times are observed in units of generations since observing ``continuous times'' would imply the knowledge of 4N. This approach already used by Felsenstein (1992) and Fu and Li (1993) provides information on efficiency and allows measurement of the effective loss of information between the complete and incomplete cases. In this way the Watterson estimator can be compared with phylogenetic estimators.

(8)

where 2 n, ` is a k-dimensional random vector which D Ä N k(0, I ` ), th is the transposed vector of satisfies 2 n, ` w h, and o P(1) is a random variable which converges to 0 in probability as n goes to infinity.

3. 2. Application to the Statistical Models in Which the Watterson Estimator Is Defined A statistical model is the definition of the observations, the parameter, and the associated likelihood (as considered above). We will name as the incomplete case the intuitive model in which the observation is the number of segregating sites of an n-sample of genes and the parameter is %. Its likelihood is difficult to handle (see Hudson (1990) for a recursive formula and Tavare (1984) for an explicit expression). The Watterson estimator, however, can also be studied in a more informative

Corollary. Let (+~ n , 4N n ) be the maximum likelihood estimator of (+, 4N); it follows from Theorem 2 that

\

- log n(+~ n &+) D +4N 0 w Ä N 0,  - n(4N n &4N) 0 (4N) 2

+

\ \

++ .

(9)

This result provides multiple implications. First, in this model, the estimations of + and 4N are not achieved at the same rate. Parameter + can be estimated at rate - log n, while the estimation of 4N is much faster (- n). Since these rates of convergence are bounds for any estimator in this model, these results confirm what was suggested by Felsenstein (1992) and Fu and Li (1993): When + is known, phylogenetic estimators have a variance proportional to 1n. Of course, when neither + nor 4N is known, the estimation of + slows down the estimation of % to the rate - log n. A second result is the asymptotic independence of estimations of + and 4N. Indeed, the diagonal information

239

Study of the Watterson Estimator

matrix means that asymptotic covariance of efficient estimators of (+, 4N) is zero. Considering the normality, this is equivalent to independence.

3.3. Application to the Watterson Estimator Proposition 1. The Watterson estimator is asymptotically efficient in the complete case and in the incomplete case. Proof. We can find in Van der Vaart and Wellner (1996) that in a LAN model an estimator satisfying D r n(% n &%) w Ä N(0, I &1 % ) is asymptotically efficient for many cost functions (especially the variance efficiency which is obtained with the quadratic cost). Thus an efficient estimator (+~ n , 4N n ) of the parameter (+, 4N) based on the observation of the whole tree must satisfy (9). Since %=4N+, an efficient estimator of % is % n =4N n +~ n . Now, - log n(% n &%)=- log n(+~ n &+) 4N n +

- log n -n

- n(4N n &4N) +.

D And from (9) it follows that - log n(+~ n &+) w Ä D 2 N(0, +4N), - n(4N n &4N) w Ä N(0, (4N) ), and 4N n P w Ä 4N. Moreover, - log n- n Ä 0 and 4N n and +~ n are asymptotically independent. So, the maximum likeliD hood estimator verifies - log n(% n &%) w Ä N(0, %). Since the Watterson estimator satisfies this (see (6)), we can therefore conclude that it is asymptotically efficient in the complete case. Because the incomplete case is far less informative, the Watterson estimator is obviously asymptotically efficient in the initial model. This result, stated by Fu and Li (1993), is formally proven here. Moreover, we can emphasize what happens when + is known and the effective population size N is to be estimated. The Watterson estimator does not even achieve the right rate of convergence in the complete case: from Theorem 2, the optimal rate is - n while the estimation of N using the Watterson estimator maintains the rate - log n. However, we can draw no conclusions about its asymptotic inefficiency in the incomplete case because this model is less informative. The cost of the reconstruction of the tree is hard to quantify, and could involve a large variance, especially when one requires an estimator of the coalescence times, as in Felsenstein (1992).

4. ASYMPTOTIC PROPERTIES DO NOT PROVIDE GOOD APPROXIMATIONS AT FIXED SAMPLE SIZE n Sample sizes are generally small (a few dozens up to hundreds). Here we illustrate the negative consequences of the above-mentioned rates of convergence over the practical range of values of n. In terms of variance efficiency, the results are already clear: One can build estimators with smaller variances than that of the Watterson estimator. This is not surprising for two reasons. First, there is a large difference between the asymptotic variance (which is the lower bound for the variance of any estimator) and the real variance of the Watterson estimator (see Fig. 1). This is due to the slow convergence of b n a n to 0, amplified by the % 2 for high values of % (Eq. (4)). Second, the Watterson estimator does not take into account all the information contained in the DNA sequences. Fu (1994a, b) and Kuhner et al. (1995) propose algorithms which make use of the information concerning the shape of the coalescence tree and not only its total length. These algorithms obviously give better results based on the variance criterion and, thus, give smaller confidence intervals (see Figs. 1 and 2). Here we emphasize a different outcome of the slow convergence rate. We illustrate that this results in bad approximations for the distribution of the Watterson estimator. Indeed, considering the Central Limit Theorem (6), the distribution of the Watterson estimator might be approximated by a Gaussian one with mean % and variance given by (4). Under this Gaussian assumption we build two particular (1&:) confidence intervals. The first one is symmetric around the estimation % n and is defined, as usual by I(% n )=[% n &q : - v^ n(% n ), % n +q : - v^ n(% n )],

(10)

with q : verifying P[|X| >q : ]=:, for X a standard normal variable and with the plug-in estimator of the variance given by v^ n(% n )=

% n b 1+% n n . an an

\

+

(11)

Note that this is a biased estimator of the variance, but bias is negligible here (results not shown). This interval is known to be asymptotically exact (DiCiccio and Efron, 1996). The second one is the exact confidence interval, which takes into account the fact that the variance depends on

240

Klein, Austerlitz, and Laredo

FIG. 1. Variance of the Watterson estimator calculated using (4) and asymptotic variance %log n. Four different values of n were used: 50, 100, 1000 and 10,000. Also shown is the variance of the phylogenetic estimator from Fu (1994b) for n=50.

FIG. 2. Comparison of the confidence intervals associated with a phylogenetic estimator (Fu, 1994b) and with the Watterson estimator. The first is derived by assuming the phylogenetic estimator to follow a Gaussian distribution and using the variances given in Fu (1994b). The second is derived by using the explicit formula of the distribution of the Watterson estimator (Tavare, 1984).

241

Study of the Watterson Estimator

the parameter estimated. It is derived from the following equality, which is true if normality is confirmed,

_

P % &q : <

% n &% - %(1+%b n a n )a n

&


with q : defined above. Solving in %, this leads to the % n dependent interval that satisfies P %[I g(% n ) % %]=1&:;

I g(% n )=

% n &q : " n

% n +q : " n

_(1&q b a ) , (1&q b a )& 2 :

n

2 n

2 :

n

2 n

(13)

with

" n = - v^ n(% n )+q 2: 4a 2n +q : 2a n .

(14)

FIG. 3. Empirical density of Watterson estimator plotted from the explicit form of the distribution (Tavare, 1984). Gaussian approximation with same mean and variance. Gamma approximation with same mean and variance. %=10, n=50, and 500.

242 To assert the inadequacy of the Gaussian approximation, we compare these confidence intervals to the one obtained with a gamma approximation. We choose the gamma distribution because it takes only positive values and is disymmetric with a mode smaller than its mean and a heavy tail. In practice, the empirical distribution of

Klein, Austerlitz, and Laredo

the Watterson estimator (obtained by computer simulations) confirms that the gamma is quite appropriate (see Fig. 3). A gamma distribution 1(:, *) is given by the density * : } e &*t } t :&1 1 [t0] , 1(:)

FIG. 4. Confidence intervals plotted as functions of the observation %. The diagonal is the estimation %=%. The following are plotted: the symmetrical interval I based on a Gaussian approximation (see (10) and (11)), the exact interval I g based on a Gaussian approximation (see (12)(14)), the confidence interval I G based on a gamma approximation, and the ``empirical'' interval derived by using the explicit formula of the Watterson estimator (Tavare, 1984). :=5 0; q : =1.96; and n=50 and 500. For n=50, the bootstrap interval is included.

243

Study of the Watterson Estimator

where 1 is the Euler gamma function 1(x)= x&1 &t  e dt. Its mean and variance are respectively 0 t :* and :* 2. So, fitting (:, *) to have the mean % and variance given by (4) leads to the parameters

* n(%)=

an 1+%b n a n

: n(%)=%* n(%)=

Empirical Probabilities That the Different Confidence Intervals Contain the True Value %

n

I

Ig

IG

2

50 100 500

0.8887 0.8983 0.9263

0.9479 0.9517 0.9560

0.9594 0.9518 0.9373

5

50 100 500

0.9176 0.9214 0.9414

0.9538 0.9528 0.9561

0.9542 0.9524 0.9500

10

50 100 500

0.9148 0.9358 0.9336

0.9575 0.9548 0.9501

0.9533 0.9545 0.9532

15

50 100 500

0.9274 0.9294 0.9446

0.9561 0.9540 0.9557

0.9590 0.9575 0.9594

20

50 100 500

0.9266 0.9378 0.9460

0.9527 0.9514 0.9537

0.9565 0.9562 0.9576

and (15)

%a n . 1+%b n a n

The confidence interval I G(% n )=[% min , % max ] is then obtained by calculating the lower and upper bounds % min and % max as the results of the equations

P %max(X<% n )=0.025

TABLE I

and P %min(X<% n )=0.975,

where X is a random variable following the parametrized distribution 1(: n(%), * n(%)) and % n the observed value. These equations are directly solvable by using, for example, the software Mathematica (Wolfram Research). We also compute a bootstrap interval following the Bca algorithm described in DiCiccio and Efron (1996). Last, we build the empirical confidence interval. This is a costly task because we have to compute over many values of % a confidence interval for % n (either by simulations or by using the recursive formula of Hudson (1990) or the explicit expression of Tavare (1984)). Then, by carrying out a symmetrical transformation around the diagonal, we deduce the confidence intervals for % as a function of % n . These three proposed intervals, together with the empirical confidence interval obtained from simulations, are plotted in Fig. 4 for different values of the ``observation'' % n , for different sample sizes, and for :=5 0 (q : =1.96). The diagonal is plotted to represent the estimation, given the observation. Another criterion must also be taken into account. Table I presents the score of the intervals; from simulation 10,000 confidence intervals for many values of the parameter % and counting the number of these intervals that contained the true value. Table I confirms the benefit of considering disymmetric intervals: the symmetric interval performs poorly, in most cases not attaining a score of 95 0. The normal and gamma approximations both give results significantly better than 950. However, the Gaussian confidence

Note. Results were obtained with 10,000 simulations. I is defined in (10) and (11). I g represents the normal approximation (12)(14) and I G is obtained with the gamma approximation. The sampling of 10,000 values induces a standard error equal to 0.0021.

interval is unquestionably larger than the gamma one (Fig. 4). The bootstrap interval is as wide as the Gaussian one and more difficult to calculate. Among these confidence intervals, the best estimate of the empirical confidence interval is the one obtained by a Gamma approximation. It actually reaches the goal of 95 0 and is the narrowest. The Gaussian approximation therefore appears inadequate for estimating the confidence interval around the Watterson estimator. Although the use of a gamma approximation is not optimal, it enables us to illustrate this inadequacy and furthermore to provide a confidence interval around the Watterson estimator which is easy to calculate from real data.

5. DISCUSSION The Watterson estimator is an extensively used estimator of the parameter %, which has stimulated much work about its accuracy. We therefore studied the properties of this estimator. We showed in particular the importance of dealing with distributions and not only means and variances, and of using asymptotics (here n, the sample size, goes to infinity).

244 The Watterson estimator is based on a small part of the information contained in a coalescent tree: the segregating sites of the final sample of genes. To assess the efficiency of this estimator we considered the complete statistical model (observation of the coalescence times and the intermediate numbers of mutations) and studied its statistical properties. We first proved that the complete model is locally asymptotically normal (LAN) and evaluated its rate of approach to the asymptotics and Fisher information. We thus deduced that the best estimators of % that can be built from the whole tree D satisfy the Central Limit Theorem, - log n(% n &%) w Ä N(0, %), and that the Watterson estimator does satisfy this theorem. It follows that the Watterson estimator is asymptotically efficient, whether the observation is the whole coalescent or only the final polymorphism. In fact it is asymptotically equivalent to the maximum likelihood estimator (their difference converges in probability to zero at the rate - log n) which is a stronger result than asymptotic efficiency (not shown here). The LAN property also provides a rate of convergence - n for the efficient estimation of N. This result is in accordance with the results in Felsenstein (1992) based on the variance. However, reaching this rate requires unobservable knowledge: if the exact coalescence times in units of generations could be known a maximum likelihood estimator of N based only on coalescence times could be built. It would indeed converge in distribution at the rate - n. Similarly, if the mutation rate + and the exact coalescence times in units of + generations were known (which is the case adopted by Felsenstein (1992)) the same rate could obviously be reached. In practice, however, it is absolutely impossible to know the exact coalescence times. They can be estimated only from the number of mutations observed. As soon as one needs to estimate the coalescence times from mutational events the estimation of N is slowed to the rate - log n, whether + is known or not. This was shown by Fu and Li (1993, Eq. (28)) with + unknown by calculating the asymptotic variance when only the mutational events are observable. The same analysis can be made for + known. Consequently, it seems impossible to exceed the slow convergence rate - log n and the way out in terms of asymptotic improvement should be in intrasequence recombinations. Indeed, Pluzhnikov and Donnelly (1996) showed that a non-zero rate of recombination at each site allows consistency when the number of sites is used as the asymptotics. Since the rate of convergence is too slow to concern practical sample sizes, asymptotic properties provide

Klein, Austerlitz, and Laredo

very bad approximations (in a model with rate - n, approximations are valid for n larger than 20 or 30, which would lead to n larger than 10 9 or 10 13 in a model with rate - log n !). Other authors have shown that in the range of usual sample sizes the variance of the Watterson estimator is far from the CramerRao bound (Fu and Li, 1993, Eq. (23)). Besides, this inefficiency of the Watterson estimator for fixed sample size is not in doubt since at least two algorithms provide estimators with lower standard deviations: the UPBLUE from Fu (1994b), based on an UPGMA reconstruction, and COALESCE, an algorithm based on MetropolisHastings sampling (Kuhner et al., 1995). These estimators, even if they have smaller variances and thus more narrow confidence intervals, are submitted to the Central Limit Theorem (6) which is an intrinsic lower bound whatever the observations. Consequently, they are also influenced by the slow rate of convergence, and their distribution may be as affected as that of the Watterson estimator. Indeed, on the basis of simulated empirical distribution and confidence intervals, we illustrated the inadequacy of the Gaussian distribution as an approximation for fixed sample size. First we showed that the standard symmetric confidence intervals, based on an estimation of the standard error, were not exact because the variance depends too much on the parameter estimated. Second, the exact confidence interval derived from the Gaussian approximation was quite disymmetric and exact but too wide compared to empirical confidence intervals. Though converging faster than the standard intervals (see DiCiccio and Efron (1996)), bootstrap methods did not provide narrower intervals. An approximation of the real distribution by a gamma distribution showed the possibility of quickly calculating some realistic confidence intervals (disymmetric, confined, and exact). Two implications arise. First, standard deviation, even if informative, is not the most adequate criterion for comparing the performances of estimators. Second, a standard recipe for deriving the confidence intervals from the standard deviation leads to erroneous intervals. Added to the poor performances of bootstrap methods, it appears that the proposal of confidence intervals associated with new estimators should be necessary for both theoreticians and experimenters. All these results well illustrate the benefit in considering the distributions of the estimators and not only their means and variances. This model is valuable in exhibiting such behavior since it presents some smooth asymptotic properties (LAN) but a very slow rate of convergence, which precludes the traditional approximations for a fixed sample size.

245

Study of the Watterson Estimator

APPENDIX:

Convergences Concerning f n

Estimation of (+, 4N ), when Observing L 2 , L 3 , ..., L n and T 2 , T 3 , ..., T n , Is a Locally Asymptotically Normal Model

\m, m(m&1) T m follows an exponential distribution with parameter 14N, and T 2 , T 3 , ..., T n are independent variables. f n n is the empirical mean of n i.i.d. variables. The law of large numbers gives

With observations L 2 , L 3 , ..., L n and T 2 , T 3 , ..., T n , and parameter (+, 4N), the log-likelihood is

1 P Ä E[m(m&1) T m ]=4N fn w n

log l n(m, 4N) so,

=c n &(n&1) log(4N)& f n (4N)&me n +K n log m c n only depends on the observations

{

1 2 f n 4N&(n&1) P w Ä (4N) &2. n (4N) 2

\

n

e n = : mT m m=2

+

And the standard Central Limit Theorem leads to

n

fn = : m(m&1) T m . m=2

-n

Thus, one can write

\

1 D Ä N(0, Var(m(m&1) T m )) f n &4N w n

+

D

w Ä N(0, (4N) 2 ) log

l n(++h(log n) &12, 4N+kn &12 ) l n(+, 4N)

=(h k)

\

so,

(log n) &12 (K n +&e n ) 1 & (h &12 2 ( f n (4N) &(n&1)4N) 2 n

+

0 K n(log n) &1 + &2 0 n &1(4N) &2 (2f n (4N)&(n&1))

\ h _ \k + _

1

fn

n&1

& w Ä N(0, (4N) 4N + - n \ (4N)

k)

D

&2

2

+

).

Weak Convergence of K n + &e n The Laplace transform of (T m , L m ) is given by

and we need to prove that 8 Tm , Lm(u, v)=

\

(log n) &12 (K n +&e n ) n &12( f n (4N) 2 &(n&1)4N) D

w ÄN

\

\\ 0,

4N +

0

0 (4N)

&2

K n(log n) &1 + &2

4N w Ä +

\

0

0

0 (4N) &2

+

Thus, the Laplace transform of L m +&mT m is

++

2 f n (4N)&(n&1) n(4N) 2

0

P

+

m(m&1)4N . m(m&1)4N&u&m+(e v &1)

8 Lm +&mTm(u)=8 Tm , Lm(&mu, u+) =

+

m&1 , m&1+4Nu+4N+(1&e u+ )

and log 8 Lm +&mTm(u)

.

\

=&log 1+

4N+ u 1+ &e u+ m&1 +

\

++ .

246

Klein, Austerlitz, and Laredo

Developing in u=0, we get the cumulants

C Lk m +&mTm =

Asymptotic Independence between K n + &e n and ( f n  (4N) 2 )&(n&1) 4N

4N+ 4N+ &k 1 Pk + + 2 m&1 (m&1) m&1

\

+

for k2 and with P k a polynomial function. The first cumulant always equals zero. Then, using the independence between (T m , L m ), we get when nÄ C Kk n +&en =4N+ 1&ka n +O(1)t4N+ 1&k log n, and n C log k

&12(K +&e ) n n

t4N+ 1&k(log n) 1&k2.

Calculating the Laplace transform of

\

Lm m(m&1) T m &(4N) &1 &mT m , + (4N) 2

+

leads to 8 m(u, v) =8 Tm, Lm(&um+vm(m&1)4N, u4N) e &v4N e &v4N (m&1) 4N = v(m&1) (m&1) +u& &+(e u4N &1) 4N (4N) 2 =(1&v4N+o(v 2 ))

The only cumulant converging toward a non-zero limit is the second, which converges to 4N+. This is sufficient to prove the following asymptotic normality: Kn 4N D Ä N 0, . &e n w + + - log n 1

\

+

\

+

u(4N&+) v & +o 4N (m&1) . _ 2 u(4N&+) v & +o +o + 4N (m&1) 1+

\\

\

+

+

+

The ``uv'' term from the Taylor expansion (u, v)=(0, 0) is the covariance and equals

at

1&4N+ 1&4N+ 1&4N+ &2 =& . m&1 m&1 m&1

Strong Convergence of K n + 2 log n We need the Laplace transform of L m :

Then, summing on m leads to 8 Lm(u)=

m&1 m&1&4N+(e v &1)

\

cov (log n) &12 (K n  +&e n ), n &12

fn

n&1

\(4N) & 4N ++ 2

=&(1&+4N) a n(log n) &12 n &12

so 4N+ 4N+ C Lk m = + m&1 m&1

\

2

+ \ Pk

4N+ m&1

+

and C Kk n +

2 log n

t4N+ 1&2k(log n) 1&k.

Only the first cumulant does not converge to zero when n Ä ; this is a strong convergence: Kn P 4N w Ä . + 2 log n +

t &(1&+4N)

- log n -n

Ä 0.

Since asymptotic normality is established, the convergence of the covariance to zero is sufficient to prove asymptotic independence. The following convergence is thus proved:

\

(log n) &12 (K n +&e n ) n ( f n (4N) 2 &(n&1)4N) &12

D

w ÄN

\

4N +

0

0

(4N) &2

++

.

+

247

Study of the Watterson Estimator

And this leads to the property that induces the model to be LAN, log

l n(++(log n) &12 h, 4N+n &12k) l n(+, 4N) 1 =(h k) 2 n, +, 4N & (h 2

k) I +, 4N

h

\k+ +o (1) P

with D

2 n, +, 4N w Ä N(0, I +, 4N ) and the information 4N 0 I +, 4N = + . &2 0 (4N)

\

+

ACKNOWLEDGMENTS We thank M. Slatkin, F. Rodolphe, J. Shykoff, C. Lavigne, P. H. Gouyon, and four anonymous reviewers for their valuable comments on the manuscript. This work is partly financed by a FCPR grant from French Ministere de l'Agriculture to F.A.

REFERENCES Athreya, K. B., and Ney, P. E. 1972. ``Branching Processes,'' SpringerVerlag, Berlin. Austerlitz, F., Jung-Muller, B., Godelle, B., and Gouyon, P. H. 1997. Evolution of coalescence times, genetic diversity and structure during colonization, Theor. Popul. Biol. 51, 148164.

Dacunha-Castelle, D., and Duflo, M. 1987. ``Probability 6 Statistics,'' Springer-Verlag, New York. DiCiccio, T. J., and Efron, B. 1996. Bootstrap confidence intervals, Stat. Sci. 11, 189228. Felsenstein, J. 1992. Estimating effective population size from samples of sequences: Inefficiency of pairwise and segregation sites as compared to phylogenetic estimates, Genet. Res. 56, 139147. Fu, Y.-X. 1994a. Estimating effective population size or mutation rate using the frequencies of mutations of various classes in a sample of DNA sequences, Genetics 138, 13751386. Fu, Y.-X. 1994b. A phylogenetic estimate of effective population size or mutation rate, Genetics 136, 685692. Fu, Y.-X., and Li, W.-H. 1993. Maximum likelihood estimation of population parameters, Genetics 134, 12611270. Hudson, R. R. 1990. Gene genealogies and the coalescent process, Oxford Surv. Evol. Biol. 7, 144. Kimura, M. 1969. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations, Genetics 61, 893903. Kingman, J. F. C. 1982a. The coalescent, Stoch. Proc. Appli. 13, 235248. Kingman, J. F. C. 1982b. On the genealogies of large populations, J. Appl. Probab. A 19, 2743. Kuhner, M. K., Yamato, J., and Felsenstein, J. 1995. Estimating effective population size and mutation rate from sequence data using MetropolisHastings sampling, Genetics 140, 14211430. Pluzhnikov, A., and Donnelly, P. 1996. Optimal sequencing strategies for surveying molecular genetic diversity, Genetics 144, 12471262. Slatkin, M. 1995. A measure of population subdivisions based on microsatellite allele frequencies, Genetics 139, 457462. Slatkin, M., and Hudson, R. R. 1991. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing population, Genetics 129, 555562. Takahata, N., and Slatkin, M. 1990. Genealogy of neutral genes in two partially isolated populations, Theor. Popul. Biol. 38, 331350. Tavare, S. 1984. Line-of-descent an genealogical processes and their applications in population genetics models, Theor. Popul. Biol. 26, 119164. Van der Vaart, A. W., and Wellner, J. A. 1996. ``Weak Convergence and Empirical Processes, with Applications to Statistics,'' SpringerVerlag, New York. Watterson, G. 1975. On the number of segregation sites, Theor. Popul. Biol. 7, 256276.