Quasi-Equilibrium Theory for the Distribution of Rare Alleles in a Subdivided Population: Justification and Implications

Quasi-Equilibrium Theory for the Distribution of Rare Alleles in a Subdivided Population: Justification and Implications

Theoretical Population Biology 57, 297306 (2000) doi:10.1006tpbi.2000.1453, available online at http:www.idealibrary.com on Quasi-Equilibrium The...

193KB Sizes 0 Downloads 76 Views

Theoretical Population Biology 57, 297306 (2000) doi:10.1006tpbi.2000.1453, available online at http:www.idealibrary.com on

Quasi-Equilibrium Theory for the Distribution of Rare Alleles in a Subdivided Population: Justification and Implications Tom L. Burr 1 Los Alamos National Laboratory, Los Alamos, New Mexico 87545 Received April 26, 1999

This paper examines a quasi-equilibrium theory of rare alleles for subdivided populations that follow an island-model version of the WrightFisher model of evolution. All mutations are assumed to create new alleles. We present four results: (1) conditions for the theory to apply are formally established using properties of the moments of the binomial distribution; (2) approximations currently in the literature can be replaced with exact results that are in better agreement with our simulations; (3) a modified maximum likelihood estimator of migration rate exhibits the same good performance on island-model data or on data simulated from the multinomial mixed with the Dirichlet distribution, and (4) a connection between the rare-allele method and the Ewens Sampling Formula for the infinite-allele mutation model is made. This introduces a new and simpler proof for the expected number of alleles implied by the Ewens Sampling Formula. ] 2000 Academic Press

the time scale of interest, the population allele frequencies were constant (stationary). In contrast, Slatkin's (1985) rare-allele method was developed in studies where each new mutation was assumed to be a new allele (the infinite-allele model). A given allele is produced by mutation only once and is subject to loss by genetic drift and by mutating to a new allele. At any particular time, the distribution of alleles in a completely random (unsubdivided) population has been studied for the infinite-allele model. One key result is the Ewens Sampling Formula (ESF) and its expected number of unique alleles in a sample of size n. In the infinite-allele model there is no true equilibrium distribution for any allele (the ultimate fate of each allele is extinction). Therefore, Barton and Slatkin (1986) proposed a quasi-equilibrium theory in which each allele is in equilibrium with respect to the effects of migration. The theory was shown by simulation to explain the observed inverse relation between gene flow (migration rate among partially isolated populations) and rare-allele frequencies.

INTRODUCTION Natural populations are often subdivided into partially genetically isolated colonies (demes). The frequency of each allele at a given locus in each deme can be used to estimate the extent of migration (genetic exchange) among demes. If the migration pattern has been in effect for several generations, the frequency of a given allele tends to be more similar in each deme when the migration rate is high. Exploiting that fact in a novel way, Slatkin (1985) introduced a method to estimate migration rates that uses only the alleles that have low population frequency (rare alleles). Wright (1931) introduced the FST statistic that can also be used to estimate migration rates. However, Wright's island model assumed that over 1 I gratefully acknowledge support from the Florida State University Statistics Department and the Los Alamos National Laboratory under contract to the U.S. Department of Energy. I especially thank Professors Duane Meeter and Joseph Travis, both at Florida state University.

297

0040-580900 K35.00 Copyright ] 2000 by Academic Press All rights of reproduction in any form reserved.

298 This paper examines the quasi-equilibrium theory of rare alleles for subdivided populations in more detail. All mutations are assumed to create new alleles. We present four results. First, the conditions for the quasi-equilibrium theory to apply are formally established. We use properties of the moments of the binomial distribution, thereby confirming current simulation-based conclusions and providing a simpler alternative to diffusion theory for this and similar situations. Second, we demonstrate that approximations currently in the literature can be replaced with exact results that are in better agreement with our simulations. Third, when the quasi-equilibrium theory holds, allele frequencies evolving under the island model will be well approximated by a multinomial distribution mixed with a Dirichlet distribution. This leads to a new estimator of migration rate that we compare to established estimators in simulated data. We find that a modified maximum likelihood method outperforms both Slatkin's rare-allele method and moments-based estimators (such as F ST or G ST ) in simulated data following the island model. We have found that this same modification (reduction of the large number of alleles to a modest number) is also necessary for good estimation of the sum of the Dirichlet parameters when sampling from the multinomial mixed with the Dirichlet distribution. Given the restrictive assumptions of the island model of evolution, we believe that qualitative estimates of migration rates are appropriate. We are therefore not recommending that the current trend of using F ST or G ST to estimate migration rates be replaced by the modified likelihood method. Instead, the good performance of the modified likelihood method provides complementary support to our theoretical results that the quasi-equilibrium theory explains the distribution of allele frequencies. Further, this adds to our understanding of why both the rare-allele method and the F ST methods are adequate for data that follows (or nearly follows) the island model of evolution with mutation to new alleles. Fourth, a connection between the rare-allele method and the Ewens Sampling Formula for the infinite-allele mutation model is made. This introduces a new and simpler proof for the expected number of alleles implied by the Ewens Sampling Formula. Section 2 gives the background for the island model of evolution. Section 3 describes our simulations. Section 4 formalizes the conditions for the quasi-equilibrium theory to be valid. Under that theory, the quasi-stationary distribution of the allele frequencies is the same beta distribution that arises in Wright's island model. Section 5

Tom L. Burr

introduces the exact limiting result for the average frequency of alleles found in i populations as a function of the migration rate to replace the approximate limiting result in Barton and Slatkin (1986). Section 6 uses the result in Section 5 to present a new and simple proof of the ESF. Section 7 summarizes results of comparing candidate estimators of migration rate. Section 8 is a summary.

THE ISLAND MODEL Assume there are N diploid individuals in each of d demes and that generations follow the WrightFisher (WF) model of evolution with non-overlapping generations and neutral mutation rate + to new alleles (the ``infinite-allele'' model, Kimura and Crow, 1964). Let p t, i =X t, i 2N be the relative frequency of allele A1 in deme i. Let p t be the average relative frequency of A1 in the entire population. In deme i, migration and mutation change p t to p$, where p$i =((1&m) p i, t +mp t )(1&&).

(1)

Random sampling to form the next generation leads to p i, t+1 = p$i +e t+1 ,

(2)

where Ne t+1 tBinomial(N, p$)&Np$. In general, we use the notation XtG to denote that the random variable X has distribution G. The same model is assumed to apply to each of the d demes.

SIMULATION METHOD We assume the population evolves according to Eqs. (1) and (2) and specify N (the number of diploid individuals), d (the number of demes), & (the mutation rate to new alleles), m (the migration rate), and %=Nm (the number of immigrants into each deme each generation). The expected number of mutations per generation in the population is 2Nd&, which we assume to be approximately equal to one. Throughout this paper we use the ``big-oh'' notation, where for example, O(x)x is approximately equal to 1. For example, here we write 2Nd&=O(1). Elsewhere we write O(1(2N ) 2 ) to be any number that is approximately equal to 1(2N) 2. So far we have discussed the allele A1, but we do the same adjustment for migration and mutation for all alleles present in generation t.

299

Distribution of Rare Alleles

Our simulations are performed in Fortran-77 on a Sun Sparc 10 using 2N = 32 to 512, %=0.1 to 12.8, d2; usually d=50 or 100 because Slatkin and Barton (1989) used 50 to 100 demes and because we want to sample at least d s =10 demes. See Burr (1992) for further simulation details.

+

k

\ 2+ (1&m)

k&2

p k&2 E(e 2t+1 ) t

+O(1(2N ) 2 )

We will first consider a version of the island model in which there is a true equilibrium. Assume there is one deme with N individuals and % migrants coming from some large mainland where the relative frequency of allele A1 is a constant, say p. Let p t be the relative frequency of A1 in the deme in generation t, which has some distribution F t . We model migration with the equation (3)

and we model reproduction by random mating with the equation p t+1 = p$+e t+1 ,

E( p kt+1 | p t )=(1&m) k p kt

+k(1&m) k&1 p k&1 mp t

THE QUASI-EQUILIBRIUM ARGUMENT

p$=(1&m) p t +mp,

Ee 2t+1 comes from using p t (1& p t )(2N ) instead of p$(1& p$)2N. Conditional on p t we then have

(4)

where 2Ne t+1 tBinomial(2N, p$)&2Np$. Equations (3) and (4) imply that [ p t , t=1, 2, ...] is an irreducible, aperiodic Markov chain with a finite state space. Therefore, there is a stationary distribution for p t , say F( } ), with moments + k =E( p k ), where E denotes expected value with respect to F. We then have the following result. Theorem 1. For k1 as N Ä , with Nm=%= O(1), the moments of F satisfy + k =+ k&1(4Nmp +k&1)(4Nm+k&1)+O(1(2N ) 2 ). (5) Proof. For present purposes we can assume that the initial probability distribution of p t is the stationary distribution F. Then for any t, F( } ) is the distribution of p t . We obtain Theorem 1 by solving for the k th moment of Eq. (4) as follows. E( p kt+1 )=E( p$+e t+1 ) k, where 2Ne t+1 tBinomial(2N, p$)&2Np$, and p$ is given by Eq. (3). In view of the assumption on N and Nm, we may consider m to be O(1N ). From the moments of the Binomial(2N, p$), it follows that Ee t+1 =0, Ee 2t+1 = p t (1& p t )2N+O(1(2N ) 2 ), and Ee kt+1 =O(1(2N) 2 ) for k2. The O(1(2N ) 2 ) term in the expression for

=(1&km) p kt +p k&2 p t (1& p t ) t

k

\2+<2N

+kmpp k&1 +O(1(2N ) 2 ) t

\

k

\2+<2N + p k + kmp + \ \ 2+<2N + p

= 1&km&

+O(1(2N ) 2 ).

k t

k&1 t

(6)

Even though Ee kt+1 =O(1(2N ) 2 ) for k2, and Ee t+1 =0, the other moments must be considered to show that the neglected terms remain O(1(2N) 2 ) as claimed in Eq. (6). See Burr (1992) for the complete proof. Here, we have only shown terms involving Ee 2t+1 . This same remark applies to our proof of the quasiequilibrium theorem. One feature of F is that each of its moments must remain unchanged from generation t to generation t+1. Using that fact, we may compute the expectation with respect to F( } ) in Eq. (6) to obtain + k = (1 & km & ( k2 )2N ) + k + (kmp + ( k2 )2N ) + k & 1 +O(1(2N ) 2 ). We then have + k =+ k&1(kmp +( k2 )2N ) (km+( k2 )2N )+O(1(2N ) 2 ). The result follows. Note that the beta(: 1 , : 2 ) distribution with : 1 +: 2 = 4% and mean +=: 1 4% has moments + k which satisfy + k =+ k&1(: 1 +k&1)(4%+k&1). From Eq. (5) we see that this means that all moments of the distribution of F approach those of the beta(: 1 , : 2 ) distribution as N Ä , where : 1 =4%p and : 1 +: 2 =4%=4Nm. Let G denote the beta(: 1 , : 2 ) distribution with : 1 =4Nmp and : 1 +: 2 =4Nm. Let F N denote the stationary distribution of p t , which we previously denoted F. We have the following known result. For example, see Ewens (1979), where the result is derived in an equivalent setting using diffusion theory (which assumes that N is large and m is moderate, so that Nm is O(1)). Because the method of proof is new, we present it as a theorem. We write d F N Ä G to denote that F N converges in distribution to G.

300

Tom L. Burr d

Theorem 2. F N Ä G Nm=%=O(1).

as

N Ä ,

mÄ0

with

Proof. All moments of both F N and G are finite, and because G has support on (0, 1), G is uniquely determined by its moments. Therefore, given Eq. (5), Theorem 2 is an immediate consequence of the fact if each of the moments of distribution A converges to the respective moments of distribution B, then distribution A converges to distribution B. (See, for example, Theorem 4.5.5 of Chung (1974)). We have derived the beta(: 1 , : 2 ) distribution as the approximate stationary distribution of the p t , without appealing to standard diffusion theory, but by using a moment argument. If there are k alleles then the beta (: 1 , : 2 ) is replaced with the Dirichlet(: 1 , : 2 , ..., : k ), where : 1 , : 2 , ..., : k are defined in the obvious way. If we sample n alleles from a given deme, the observed frequencies follow a multinomial mixed with this Dirichlet distribution. Rannala (1996) and Rannala and Hartigan (1996) presented the Dirichlet distribution in the same context (the island model), but used it differently than presented here. In our version of the island model, given by Eqs. (1) and (2), there is no stationary distribution for any particular allele because the ultimate fate of each allele is extinction. However, there can be an equilibrium with respect to the effects of migration, which we call quasiequilibrium following Slatkin and Barton (1986). The quasi-equilibrium assumption will now be considered more closely for four reasons. First, Barton and Slatkin (1986) provided a justification for the assumption by graphically comparing data simulated according to the island model to what would be expected under the quasi-equilibrium distribution. We will add analytical support for the assumption. Also, Barton and Slatkin (1986) presented approximations that can be replaced with an exact result that agrees better with simulated data. Second, Slatkin and Barton (1989) studied the performance of estimators of % that have been proposed. One estimator is the maximum likelihood estimator (MLE), which performed very badly in their simulations. The MLE that they used makes the same appeal to the beta that is made in the quasi-equilibrium argument, so it is puzzling that a method that uses only private alleles would outperform the MLE, which uses all the alleles. Third, based on extensive simulation studies, Slatkin (1985) published a least-squares method for using rare alleles to estimate migration rates. Currently, to apply the method it is necessary to extrapolate somehow if the sample size in each deme is not either 20, 50, or 100, because Slatkin

(1985) published estimates only for those three sample sizes. It will be shown that the least-squares method is not needed. A slightly different estimator of % using private alleles that automatically accounts for sample size will be introduced. Fourth, we will show that the rare-allele method has a simple connection with the expected number of alleles in the infinite-allele model. Our main result is that for the model defined by Eqs. (1) and (2), we have Theorem 3. )_$+O(1(2N ) 2 ), E( p kt )=E( p k&1 t

(7)

where $=(4Nm$p t +k&1)(4Nm+k&1). Proof. The proof is very similar to the proof of Theorem 1 so will not be repeated here. We do note here that even though Ee kt+1 =O(1(2N ) 2 ) for k2, and Ee t+1 =0, the other moments must be considered to show that the neglected terms remain O(1(2N) 2 ) as claimed in Eq. (7). See Burr (1992) for the complete proof. Recall that m$=m+&. In view of Theorem 2, we see that we need m$rm for the beta(: 1 , : 2 ) distribution to be a good approximation. Let F N denote the distribution of p t when Eq. (11) holds. Let G denote the beta(: 1 , : 2 ) distribution with : 1 =2Nm$p t , : 1 +: 2 =Nm$. We have the following result. d

Theorem 4. F N Ä G as N Ä , m Ä 0 with %= Nm=O(1). Proof. Theorem 4 is an immediate consequence of Eq. (7) in view of Theorem 2. Remark 1. We note that the mutation rate, &, appears as a nuisance parameter because our goal is to estimate %, and we cannot assume that & is known. To avoid this issue, we make the biologically reasonable assumption that &<
301

Distribution of Rare Alleles

Remark 3. Again using moment arguments, it can be shown (Burr, 1992) that the ratio of moment k to moment k&1 is expected to move toward a quasi-equilibrium value at a rate determined by 1&km$&( k2 )2N. At quasiequilibrium, we require t to be sufficiently large that (1&2m$&12N ) t r0. We are waiting until the initial relative frequencies of a given allele do not matter. Any allele that is born after these initial conditions are forgotten can be considered to be in quasi-equilibrium after only one generation of migration.

THE RARE-ALLELE METHOD By the quasi-equilibrium argument, it is reasonable to assume that p 1, t , p 2, t , ..., p d, t have approximately the beta(: 1 , : 2 ) distribution with : 1 =4Nmp t and : 2 = 4Nm(1&p t ). If either Nm or p t is low, then because the beta distribution puts no mass on 0, the beta could be modified by adding a point mass at 0. Assume n< 0 to get

P(X= j | X>0)=

2n : [1 j] : 2[2n& j] (4%) [2n] ) j (4%) [2n](1&: [2n] 2

\ +

(8)

for j=1, 2, ..., 2n, and where x [2n] =x(x+1)(x+2) } } } (x+2n&1). Next, if an allele is found in only one of many sampled demes assume that : 1 is small. Formally, consider the limit of Eq. (8) as : 1 Ä 0 and we have lim P(X= j | X>0)=

:1 Ä 0

(2n) ( j ) j(4%+2n&1) ( j ) C(n, 4%)

(9)

for j=1, 2, ..., 2n, and where x (2n) =x(x&1)(x&2) } } } (x&2n+1), and C(n, %)=1%+1(%+1)+ } } } + 1(%+2n&1). This result follows from l'Ho^spital's rule and straightforward algebra. We have arrived at a distribution with only the one parameter, %. Because the goal is to estimate %, this is a potentially useful approach. The advantage is that the nuisance parameter p t has been in effect eliminated. The disadvantage is that only rare alleles qualify to be used. Now let p=Xn. We have the following two results. The first moment of the limiting conditional distribution given in Eq. (9) satisfies 1E( p | p>0) =4%4%+4%(4%+1)+ } } } 4%(4%+n&1).

(10)

The second moment of the limiting conditional distribution given in Eq. (9) satisfies E( p 2 | p>0)=E( p | p>0)(1+4%n)(4%+1).

(11)

Equations (10) and (11) follow by using l'Ho^spital's rule and straightforward algebra to compute the appropriate limits. Barton and Slatkin (1986) presented approximate results for the cases where %>0.5 and %<0.5. We see no advantage in using these approximate results in place of the exact result given in Eq. (10). Their Fig. 4 compared the observed (from simulated data) p(1) versus n to that predicted using their approximate results. The approximate analytical result systematically overestimated the observed p(1), whereas in our results (Fig. 1), both the exact and the approximate results underestimate the observed data. However, the exact result agrees more closely with the observed data for the Nm=0.32 case. Equation (10) provides a way to estimate %, using E( p | p>0). But it is not clear what should qualify an allele to be considered rare enough to appeal to the limiting result in Eq. (10). The parameter : 1 in Eq. (9) is given by : 1 =4%p t , where %=Nm. Therefore, : 1 is small if % is small or if p t is small. So, we could insist that a ``rare allele'' have a low population frequency, and then use the

302

Tom L. Burr

that K alleles are possible at the locus under study, and that the mutation rate from any allele to any other allele is &K. Then the approximate stationary distribution for the joint density of ( p 1 , p 2 , ..., p K ) is the Dirichlet (: 1 , : 2 , ..., : K ), where : 1 =: 2 = } } } =: K =#K. A sample of size n from this distribution follows the MCD distribution. And, if K Ä , then Eq. (10) holds for the expected relative frequency, say Z i =X i given that X i >0 of a particular allele. Therefore, Eq. (10) implies that 1E(Z i ) =##+#(#+1)+ } } } #(#+n&1). We then have 1=  ki=1 Z i =E  ki=1 Z i =E(k) E(Z), where the last equation follows from elementary conditional probability rules. It should be noted that Watterson (1976) began with the MCD with K alleles, worked with the allele frequencies in the sample ordered from largest to smallest, and arrived at the ESF by letting K Ä . Our proof is simpler, requiring only simple calculus (l'Ho^spital's rule applied to the MCD as K Ä ) and elementary conditional probability.

ESTIMATING THE MIGRATION RATE FIG. 1. Dependence of p(1) on n for Nm=3.2 (lower) and Nm=0.32 (upper) from 100 replicates.

second moment as a second criterion. Slatkin (1985) used the arbitrary but effective criterion that ``rare alleles'' are alleles that are found in only one of the sampled demes.

CONNECTION BETWEEN THE RAREALLELE METHOD AND THE ESF Readers who are familiar with the Ewens (1979) result for E(k), the expected number of alleles in a random sample of size n diploids from the infinite-allele model, will notice a similarity to our Eq. (10). Ewens considered one large randomly mating population. Here we could use a population of size dN and let #=4dN& be the expected number of mutations per generation. Ewens result is E(k)=##+#(#+1)+ } } } +#(#+2n&1). Note the agreement with Eq. (10), with #=4dN& replacing 4%=4Nm. It will now be shown that this is not a coincidence. Because the method of proof is new, we present this known result for E(k) as a theorem. Theorem 5. Let k be the number of alleles in a random sample of size n from the ESF for the infinite-allele model. Then E(k)=##+#(#+1)+ } } } +#(#+n&1). Proof. To derive the conditional average frequency of rare alleles, the parameter p t was allowed to approach 0. Now assume that the population is not subdivided,

File: 653J 145306 . By:XX . Date:04:05:00 . Time:10:42 LOP8M. V8.B. Page 01:01 Codes: 5146 Signs: 3809 . Length: 54 pic 0 pts, 227 mm

Assume there is a collection of demes that collectively obey the island model. We report results for Nm=0.128, 1.28, and 12.8. Our procedure for sampling was to wait approximately Nd generations before taking the first sample and then sample approximately every 100 generations for large values of Nm and approximately every 250 generations for small values of Nm. See Burr (1992) for more detail about the time to equilibrium and the gaps used between samples to minimize serial correlations that make performance of estimators more difficult to study. Each sample is intended to represent typical data from one locus. Slatkin and Barton (1989) usually assumed that data from about 10 loci were available. Here, each replication consists of data from 10 loci. Assume that the allele type is identified in each of n<
303

Distribution of Rare Alleles

method motivated by Mosimann (1962), maximum likelihood (ML), and Slatkin's (1985) private-allele method. See Burr (1992) for a complete description of all five methods. But all methods have sub versions depending on how we average over loci. Some methods have sub versions depending on how we correct for sample size or depending on how many alleles we use in reduced data. For example, if 10 alleles are present, we can work with only 5 alleles and include the rarer alleles in an ``all-other'' category. Because the island model has such restrictive assumptions, an extremely precise estimate of migration rate is not really needed, but Slatkin and Barton (1989) mention that no conclusion has been reached on the best way to estimate G ST in the k-allele case. In addition, Cockerham and Weir (1993) showed that some of Slatkin and Barton's simulation results for the F ST method were incorrect. Also, there are other applications where we want good estimates of the sum of the Dirichlet parameters using samples from the MCD distribution. The poor performance of the ML method in Slatkin and Barton raised the issue of whether poor goodness of fit (via the quasi-equilibrium argument) of the simulated data to the BCB or MCD was degrading the ML performance. Slatkin and Barton (1989) present simulation results comparing the p(1), G ST , and ML estimators for these values of Nm: 0.128, 1.28, and 12.8. Some of their Table 1 is presented here in our Table 1 (with minor changes). Note that there are two versions of the G ST method. Version 1 uses a sample size correction; version 2 does not. There are also two versions of the ML method, differing in how data from different loci are combined. Overall, the results in their Table 1 suggest that the p(1) method and the G ST method without a sample size correction perform the best. Both ML estimators perform surprisingly poorly. They used 10 replicates of 10 loci each and the parameter values (for haploids): d=100, d s =10, N=256, n=50, &=10 &4.

The present study used 500 replicates of 10 loci each and the parameter values d=50, d s =10, N=128, n=20, &=10 &4. In a few instances not all 500 replicates gave a ML estimate because the NewtonRaphson procedure did not converge. Table 2 presents results for the same three values of Nm: 0.128, 1.28, and 12.8. The notation e1, e2, ..., e6 denotes different versions of the same estimator. Specifically, e1 and e2 in the % 1 (F ST method) and % 2 (G ST method) denote two ways to average over independent loci. The version which averages the estimate of Nm over each locus is denoted e1. The version which averages the estimate of F ST or G ST over loci is denoted e2. In addition, both the F ST and the G ST methods have versions which are corrected for sample size. The sample-sizecorrected versions are denoted e1* and e 2*. The MOM and ML methods have more options because k alleles can be reduced to any number of alleles between two and k&1. Results for reducing k alleles to two and four alleles for the MOM estimator and to two, four, and seven alleles for the ML estimator are presented. We found that neither method works well unless the number of alleles is reduced to some moderate number. Again, there are two ways to average over independent loci so for % 3 (MOM method), e1 and e2 denote the two ways for reducing to two alleles, while e3 and e4 denote the two ways for reducing to four alleles. The notation e1 through e6 is used in the same way for the % 4 (ML method) results. For % 5 (private-allele estimators), e1 and e2 denote the two ways to use Slatkin's (1985) version. The version which averages the estimate of Nm over each locus is denoted e1. The two versions using Eq. (10) are denoted e3 and e4. And e5 denotes the version that uses an upper cutoff for the estimated p t to decide if an allele qualifies as rare. It appears from Table 2 that version e5 of the privateallele method performs best. However, the method used to determine an upper cutoff for p t (below which an allele

TABLE 1 Comparison of p(1), G ST , and ML Estimators (Data from Table I in Barton and Slatkin, 1989) %=0.128

G ST (version 1) G ST (version 2) ML (version 1) ML (version 2) p t (1)

%=1.28

%=12.8

Bias (0)

RMSE (0)

Bias (0)

RMSE (0)

Bias (0)

RMSE (0)

252 259 2994 306 158

413 313 3274 366 207

31 24 137 66 9

61 50 173 81 37

465 &14 18 733 &35

689 16 21 795 36

304

Tom L. Burr

TABLE 2 A Comparison of Five Estimators of %=Nm %=0.128

%=1.28

%=12.8

Bias (0)

RMSE (0)

Bias (0)

RMSE (0)

Bias (0)

RMSE (0)

%1 (F ST )

e1 e 1* e2 e 2*

1200 150681 133 153

1223 157157 145 167

157 20244 26 59

160 21099 30 65

&42 2853 &54 29

42 3520 54 35

%2 (G ST )

e1 e 1* e2 e 2*

30 39 26 35

43 52 40 48

2 27 &5 16

14 35 12 23

&51 &25 &54 33

51 500 54 44

% 3 (MOM)

e1 e2 e3 e4

&20 &28 &30 &35

50 47 44 47

21 &9 40 17

48 22 62 33

1390 &20 440 39

26274 31 1907 54

%4 (ML)

e1 e2 e3 e4 e5 e6

46 40 23 20 3 2

67 58 40 38 29 29

45 15 20 9 23 15

64 27 29 19 33 25

8 &20 24 &0.8 31 9

29 29 34 19 38 19

% 5 ( p(1))

e1 e2 e3 e4 e5

5 &22 73 27 9

105 41 289 62 23

74 &16 84 20 &4

109 40 115 54 17

&47 &57 &59 &4 &2

47 58 61 52 21

Note. Distinct versions of the five estimators are denoted e1, e2, ... . Results are based on 500 replications.

in the sample would be considered rare enough to appeal to Eq. (10)) cannot be trusted. The method was to use an estimate of Nm from some other method to get an idea of how small p t would have to be. In sampling from the BCB, the limit is nearly reached for : 1 =4Nmp t 0.05. This depends on the value of %, with the limit being reached sooner for large %. In the simulations, for each value of Nm the search for a good upper cutoff of p t began with an estimate of Nm from another method. Then, the cutoff was adjusted until Eq. (11) was approximately satisfied. However, there was typically a fairly wide range of values for the cutoff that gave approximately the same agreement between the observed and expected second moments. This would be fine, except the estimate of Nm appears to be quite sensitive to changing the cutoff. The tabled results are for the lowest RMSE within a wide range of possible RMSEs using this procedure. Therefore, version e5 appears to be better than it really is. The results were presented because the privateallele method depends very much on the criterion used to

be considered private. The conclusion is that it is safest to use Slatkin's (1985) criterion and use alleles found in only one of many sampled demes rather than the dishonest version e5. A general conclusion from Table 2 is that for all estimators it is best to use the second method of averaging over loci. For example, using the G ST method, it is best to use G ST obtained from all loci, rather than to get an estimate of % from each locus using G ST estimated from each locus. One disagreement with the results presented in Slatkin and Barton (1989) is that here, the MLE (especially e4) performs quite well for all values of % and is the best method for %=12.8. The MLE performs about the same as the G ST method for the lower two values of %. However, it was important to use a moderate value of k, with k= 2, 3, or 4 usually performing well. For Nm= 0.128, the NewtonRaphson procedure failed to converge in 4 of the 500 replicates for k=2, in 11 of the replicates for k=4, and in 173 of the replicates for k=7.

305

Distribution of Rare Alleles

For Nm=1.28, the NewtonRaphson procedure failed to converge in 3 of the replicates for k=7. Generally, small values of Nm tended to cause convergence problems, especially for large k. For larger k, many of the replications did not produce an estimate or produced very erratic estimates. We have found this same behavior in simulations from the actual MCD, so we conclude that convergence problems with the MLE are not due to departure from the assumed beta or Dirichlet distribution. Except for the F ST method (which performed so erratically that we do not consider it any further), all methods have smallest RMSE for low Nm. (We have reported percent RMSE.) This can be explained by assuming that a sampling model for this data is the MCD(n, : 1 , : 2 , ..., : k ), as has been described. Standard asymptotic theory for MLEs then shows that the variance of % = Ki=1 :^ i 2 should be reduced as Nm=% decreases. We noted in Remark 1 following Theorem 4 that our method actually estimates Nm$=N(m+&). For Nm= 0.128, 1.28, and 12.8, the values of Nm$ are 0.154, 1.31, and 12.8, respectively. We see from Table 2 that some of the estimators do tend to overestimate Nm slightly and the effect is most noticeable for small Nm. Specifically, if our goal were to estimate Nm$ then the ML method, the G ST method, and our modified version of the privateallele method all do even better than we are claiming here for estimating Nm. We also point out that we have compared the new version of the p t (1) method, which uses Eq. (10), to Slatkin's (1985) version for one of the three values of n for which Slatkin (1985) provided a slope and intercept estimate. The general conclusion is that there is no clear winner between the two versions. For Nm=12.8, the new method reduces the bias, but increases the variance, so that the RMSE is only slightly lower than Slatkin's (1985) version. For Nm=0.128 and Nm=1.28, Slatkin's (1985) version has slightly lower RMSE and a bias of approximately the same magnitude. However, as stated above, all estimators are actually estimating Nm$ instead of Nm. If we considered our goal to be estimating Nm$, then our modified version has lower RMSE than Slatkin's (1985) method for Nm$=0.154, but still has higher RMSE than Slatkin's (1985) method for Nm$=1.31. The real advantage of the new method is that it is easily implemented to work for any value of n. The island model has such restrictive assumptions that a qualitative estimate of migration rate is probably suitable. However, we have experimented with relaxing the conditions and have found that the only essential requirements are that % be O(1) and N be large. Further simulations have confirmed that the estimators are not

noticeably affected when N varies, or when the number of migrants is Poisson(%) in each deme. We have allowed N to vary between 100 and 500, with no noticeable effect. We expect no noticeable effect when N varies provided N remains reasonably large. Burr (1992) presented results for several such alternative conditions including allowing each deme to have its own average number of migrants per generation. There was only a modest decrease in the performance of the estimators. The most restrictive assumption we make is that the migration pattern remains in effect for many generations. If we want to relax this assumption, we would require another approach, such as observing allele frequencies in each deme over two or more successive generations.

SUMMARY We have provided analytical support for the p(1) method of estimating %. We used a simple moment argument to first present an alternative to the usual diffusion-theory argument to arrive at an approximating beta distribution when there is a true stationary distribution. We then used the same moment argument to prove the validity of the quasi-equilibrium theory which leads to the same approximating beta or Dirichlet distribution. To date, two candidate explanations had been presented for the observed relation between p(i ) and Nm in simulated data. The explanation in Slatkin and Takahata (1985) assumed that the observed relation between p(i ) and Nm was dominated by those alleles that never left the deme where they originated. The explanation in Barton and Slatkin (1986) assumed that the observed relation between p(i ) and Nm was dominated by alleles that had been in existence long enough to reach ``quasi-equilibrium'' under migration and drift. The agreement between simulation and theory was best for the second explanation. Here, we provided the analytical basis and further simulation support for endorsing the second explanation. Next, Slatkin and Barton (1989) reported that the p(1) method to estimate Nm outperformed the ML method. Because both methods rely on the same beta approximation but the p(1) does not use all of the data, this performance is very puzzling. We have shown that the ML method suffers dramatically unless the number of alleles is reduced to a small number (2 or 3). Also, this phenomenon held in simulations from the exact MCD in auxiliary simulations we performed. Thus, we improved the ML performance by reducing the number of alleles.

306 In general we recommend the same kind of reduction in any situation with large K in the MCD. The present study confirms that the p(1) is surprisingly accurate but finds that the modified ML method outperforms both the G ST and the p(1) methods. A probable explanation for the improved performance of the ML method here is that the number of alleles was reduced to a moderate number and a better way to average over loci was used. Given the restrictive assumptions of the island model of evolution, we believe that qualitative estimates of migration rates are appropriate. We are therefore not recommending that the current trend of using F ST or G ST to estimate migration rates be replaced by the modified likelihood method. Instead, the good performance of the modified likelihood method adds to our understanding of why both the rare-allele method and the F ST methods are adequate for data that follows (or nearly follows) the island model of evolution with mutation to new alleles. We showed here that the reason for this is that the quasiequilibrium theory is in effect after initial conditions are forgotten. However, for maximum likelihood to perform well in this setting, we have found that is is necessary to reduce the number of alleles as we have described. Finally, our Eq. (10) relates the average relative frequency of rare alleles to the migration rate in a way that both improves the agreement between simulation and theory (Fig. 1) and establishes a connection to the ESF (Theorem 5).

REFERENCES Barton, N., and Slatkin, M. 1986. A quasi-equilibrium theory of the distribution of rare alleles in a subdivided population, Heredity 56, 409415.

Tom L. Burr Burr, T. 1992. ``Estimating and Modeling Gene Flow for a Spatially Distributed Species,'' Ph.D. dissertation, Florida State University, Tallahassee. Chung, K. L. 1974. ``A Course in Probability Theory,'' Academic Press, Orlando, FL. Cockerham, C. C., and Weir, B. S. 1987. Correlations, descent measures: Drift with migration and mutation, Proc. Nat. Acad. Sci. USA 84, 85128514. Cockerham, C. C., and Weir, B. S. 1993. Estimation of gene-flow from F-statistics, Evolution 47, No. 3, 855863. Crow, J. F., and Kimura, M. 1970. ``An Introduction to Population Genetics Theory,'' Harper 6 Row, New York. Ewens, W. J. 1972. The sampling theory of selectively neutral alleles, Theor. Popul. Biol. 3, 87112. Ewens, W. J. 1979. ``Mathematical Population Genetics,'' SpringerVerlag, Berlin. Kimura, M., and Crow, J. F. 1964. The number of alleles that can be maintained in a finite population, Genetics 49, 725738. Kleinman, J. 1973. Proportions with extraneous variance: single and independent samples, J. Amer. Stat. Assoc. 68, 4654. Mosimann, J. 1962. On the compound multinomial distribution, the multivariate ;-distribution, and correlations among proportions, Biometrika 49, 6582. Rannala, B. 1996. The sampling theory of neutral alleles in an island population of fluctuating size, Theor. Popul. Biol. 50, 91104. Rannala, B., and Hartigan, J. A. 1996. Estimating gene flow in island populations, Genet. Res. 67, No. 2, 147158. Slatkin, M. 1981. Estimating levels of gene flow in natural populations, Genetics 99, 323335. Slatkin, M. 1985. Rare alleles as indicators of gene flow, Evolution 39, 5365. Slatkin, M., and Barton, N. 1989. A comparison of three indirect methods for estimating average levels of gene flow, Evolution 43, 13491368. Slatkin, M., and Takahata, N. 1985. The average frequency of private alleles in a partially isolated population, Theor. Popul. Biol. 28, 314331. Watterson, G. A. 1976. The stationary distribution of the infinitelymany neutral alleles model, J. App. Prob. 13, 639651. Wright, S. 1931. Evolution in Mendelian populations, Genetics 16, 97159.