THEORETICAL
POPULATION
3,87-112
BIOLOGY
The Sampling
Theory
(1972)
of Selectively
Neutral
Alleles*
W. J. EWENS+ Department
of Zoology,
of Texas at Austin,
University Received
August
TO THE
MEMORY
DEDICATED
Austin,
Texas,
78712
17, 1971 OF KEN
KOJIMA
In this paper a beginning is made on the sampling theory of neutral alleles. That is, we consider deductive and subsequently inductive questions relating to a sample of genes from a selectively neutral locus. The inductions concern estimation, confidence intervals and hypothesis testing. In particular the test of the hypothesis that the alleles being sampled are indeed selectively neutral will be considered. In view of the large amount of data currently being obtained by electrophoretic methods on allele frequencies and numbers, and the current interest in the possibility of extensive “non-Darwinian” evolution, such a sampling theory seems necessary. However, a large number of unsolved problems in this area remain, a partial listing being given towards the end of this paper.
MATHEMATICAL
THEORY
A number of quantities will be considered in this paper and it is useful to gather together the notations that will be used consistently throughout. We define. N = number unknown)
of individuals
in the parent
IV, = effective size of parent
population
u = mutation
new alleles (normally
71 = number K = number random variable)
rate to entirely of individuals of alleles
K = number of different a random variable)
in the population alleles observed
Press,
Inc.
(normally
unknown) unknown)
in generation
t)
t (an unknown
in the sample (a realized
Center, University of Mathematics,
87 1972 by Academic
(normally
population
in sample (taken in generation
* Supported by USPHS Grant GM-15769. + Current address: Mathematical Research WI 53706. Permanent address: Department Bundoora 3083, Victoria, Australia. 0
(diploid)
of Wisconsin, La Trobe
value of
Madison, University,
88
EWENS
e = 4N,u ES(.) = expected structure at generation EP(.) = expected
value
of some sample
variable,
value of some population
variable
given
the population
t in generation
t
EC.1= W,(.) m, = effective number ?T~ = probability i, (i = 1, 2 )...) 24
that
of alleles in the population the number
of alleles
n( = number of genes in the sample Note that x ni = 2n.
observed
of the i-th allelic
in the sample
is
type, i = I,..., k.
Consider a locus A in the population in question and suppose that alleles from the infinite series A, , A, ,... can occur at the locus. Any gene is assumed to mutate, with probability u, to form an allele not currently existing (nor previously existing) in the population. [Note: this assumption is made for mathematical reasons and in practice the expression “(nor previously existing)” can be omitted.] Before considering sample properties we review some properties of the population itself. [For an expanded version of what follows, see Ewens (1969, pp. 67-71).] Given the population configuration in generation t - 1, the configuration in generation t behaves in a stochastic fashion and we shall suppose [adapting a standard model due to Wright (1931)] that if in generation t - 1 the number of genes of any existing allele is a, then the probability that in generation t there are b genes of this allele is
a = 1, 2,..., 2N, b = 0, 1, 2 ,..., 2N. Equation (1) is actually insufficient to specify the complete stochastic process under discussion: a proper specification is given in Karlin and McGregor (1967) and involves multinomial transition probabilities. The derivations given here will not use this complete specification but will rely only on (1). To this extent, then, our derivation is partially intuitive: when this is so a proper derivation is given in the accompanying paper (Karlin and McGregor, 1972). We assume that sufficient time has elapsed so that a stationary situation has been reached. Then from the transition matrix generated by (1) we can define the “effective number of alleles” m, in the population, defined as the reciprocal of the probability F that two genes drawn at random from the popula-
SAMPLING
SELECTIVELY
NEUTRAL
tion are of the same allelic type: at equilibrium Kimura, 1964)
F = {2N(l
89
ALLELES
the latter quantity
is (Crow
and
- U) -2 - 2N + 1} -l ‘v (1 + 4Nu)-l,
so that m, ‘v 1 + 4Nu. More
generally
(2)
we have m, w 1 + 4Neu,
(3)
where N, is the effective population size. The error in (3) approaches zero as N, + cc in such a way that 13 = 4N,u is kept constant. So far as other quantities are concerned, it seems necessary to use the diffusion approximation to the model (1). Aspects of the relevant diffusion theory will be found in Ewens (1969, Chapters 5 and 6), and we will be content here simply to reproduce the relevant results. First we consider the mean value E(K) of the actual number K of alleles in the population in generation t: this is
E(K) = e* [ J;2N)-lx-y1 - xle*-l dX]? where
O* = 4Nu. In somewhat E(K)
greater generality
= e [ fjel
(4)
we can write
x-y 1 - x)0--1 &c] ,
(5)
where 0 = 4N,u and N, is the variance-effective population size. A considerable portion of our subsequent discussion will relate to the parameter 0. A statement rather more informative than (5) is that
E(K; x1 ,x2) =es$2
x-‘( 1 - ~)a-~ dx,
21
(6)
(2N)-l < x1 < x2 < 1, is the mean number of alleles occurring between x1 and x2 . Clearly the function f(x)
in the population
= 8x-1(1 - x)0--1
with
frequency
(7)
is such that f(x) 6x is the probability that an allele will occur in the population with frequency in (x, x + 6x) (since we shall assume that for sufficiently small 8x the probability of two alleles having frequency in this range can be ignored). The alternative interpretation that f(x) 6x is the mean number of alleles with
90
EWENS
frequency in (x, x + 6x) will also be used on occasion. By a slight abuse of language we shall call f(x) the frequency spectrum of the process, and it will turn out that the approach to sampling problems from a frequency-spectrum point of view considerably facilitates all the arguments. The shape of the frequency spectrum is strongly dependent on 0. If 8 < 1, most of the mass in the spectrum is near x = 1, indicating that it is most likely that one allele occurs at high frequency in the population, together with a small number of very low-frequency alleles. If 0 > 1, it is unlikely that any allele will occur with appreciable frequency and one is most likely to observe a comparatively large number of low-frequency alleles. It is a characteristic of the spectrum that one seldom observes a situation with two or three alleles of moderate frequency. From (7), the probability that a gene drawn at random from the population is of an allele whose frequency in the population is in (x, x + 6x) is xf(x) sx = e(l - x)0--1 6x,
(8)
where we may now assume 0 < x < 1 to a sufficient approximation. A further use of the frequency spectrum is as follows. Suppose that the (random) frequencies of the various alleles occurring in the population in generation and consider any function of the form Zi 4 (pi), where +( pi) t are Pl , Pz ,..., is 0( pi) at most near pi = 0. We have shown above that Pr[an allele occurs in the population
with
frequency
in (x, x + 8x)]
= ex-l( 1 - x)0--1 6X. It follows
from this that
E c 4( pi) = 0 s’ c+(x)x-‘( 1 - x)@-l dx, 0
z
where we may take the lower terminal of $( .)] with negligible loss of accuracy.
E
(cpij
as zero [because of the assumed Thus, for example,
(9) nature
= 0 J‘I (1 - ~)~-l dx = 1 0
as we expect, while Pr (two genes drawn zz
at random
are of same allelic type)
E @ pg2j = 6’J‘: x(1 - ~+‘-l dx = (1 + 0)-l.
(10)
SAMPLING
This
rederives
SELECTIVELY
(3) and indicates
me, namely that to a sufficiently
NEUTRAL
an interesting connection close approximation,
e(1 - X)+-1, (2N)-1 is a density function and to this degree variable from this distribution,
m, = [E(z)]-l We now turn to properties
91
ALLELES
between
E(K)
and
< X< 1 of approximation,
while
E(K)
if z is a random
= E(x-l).
of a sample from this population.
SAMPLING
PROPERTIES
Suppose a sample of n individuals (272 genes) is drawn from the population. It will be assumed that C< N so that although sampling is without replacement, binomial formulae can be used to a sufficient approximation. Note in particular that this means that our results cannot necessarily be taken over to describe population properties, although some such translation does seem possible (see Appendix 1). The first question we ask is: what is the mean number of different alleles in the sample ? If k is the actual number of alleles in the sample, we can write k = a, + a2 + . . .. where ai = 1 if the i-th allele in the population the sample,
in generation
t is represented
in
ai = 0 otherwise. Then W)
= c w4 = c -&%(4 = c E,[l
- (1 -pi)zn]
(if we suppose that in the population the various alleles occurring have frequencies p, , p, ,...). Using (9), this may be written
E(k) ‘v e J’
x-‘( 1 - ~)~-l{l
in generation
t
- (1 - x)~“} dx
0
2+-L+...+
e
e+2n-1.
(11)
92
EWENS
Note that if 0 is extremely small we have E(k) N 1, whereas if 0 is extremely (and from a biological point of view unrealistically) large, then E(K) E 2n and we expact all genes in the sample to be of different allelic type. Both these observations agree with the observations made as a result of Eq. (7). Note that despite this agreement, the present model does not strictly apply as 0 + 0 or e+co. It is valuable to tabulate E(k) [given by (1 I)] for selected values of n and 8: a representative set of values is given in Table I. In view of Eq. (5), it is of some interest to see whether an expression for E(k) can be obtained in terms of an integral. The parallel with (5) makes it natural for us to try an expression of the form
E(k) = 8 [J:,,,, x-1(1 - q-1 dx].
(12)
Although the mathematical forms of the alternative expressions (11) and are quite different, it is not difficult to show that they give values for which are almost identical. Further, we can use the parallel with (6) to that the mean number of alleles E(K; ni , n,) to occur in the sample (sample) frequency in (xi , x2) is given by
E(k; n, , n2) = 6’ ” X-l(l s 21
- x)@-l dx.
(12) E(K) state with
(13)
Note that we may argue in the reverse direction and state, using (1 l), that an alternative expression for the mean number of alleles in the population in generation t is 8 qq=g+O+l+-.+
Thus the mean number
of alleles in the population
0 l3 + 2n + or alternatively,
9 e+2n+1
not observed in the sample is 6
(15)
+“‘+3I+2N-1’
using (5) and (12),
0s(&L-l ~‘(1
(2N)-1
l9 e+2N-1.
I3
- ~)~-l dx N 0 log(N/n)
- 0(0 -
1)[(2n)-i
- (2N)-l].
(16)
1.04
1.04
1.00
1.00
1.00
1.00
1.01
1.01
1.01
1.01
1.01
1.01
1.01
1.01
1.01
10
20
30
40
50
60
70
80
90
100
150
200
250
1.07
1.07
1.06
1.06
1.06
1.06
1.05
1.05
1.05
1.05
1.05
0.01
of n
0.001
Value
Mean
1.66
1.64
1.61
1.57
1.56
1.55
1.54
1.52
1.50
1.48
1.45
1.41
1.34
0.1
2.30
2.26
2.20
2.12
2.10
2.07
2.05
2.01
1.98
1.93
1.88
1.79
1.65
0.2
2.92
2.85
2.76
2.64
2.61
2.57
2.53
2.49
2.43
2.36
2.28
2.16
1.95
0.3
3.51
3.42
3.31
3.14
3.10
3.05
3.00
2.94
2.87
2.78
2.66
2.50
2.22
0.4
4.09
3.98
3.83
3.63
3.58
3.52
3.45
3.38
3.28
3.17
3.03
2.83
2.48
0.5
Value
4.65
4.52
4.35
4.10
4.04
3.97
3.89
3.80
3.69
3.55
3.38
3.14
2.72
0.6
5.20
5.05
4.85
4.56
4.49
4.41
4.31
4.21
4.08
3.92
3.72
3.44
2.96
0.7
5.74
5.57
5.34
5.01
4.93
4.83
4.73
4.60
4.46
4.28
4.05
3.73
3.18
0.8
6.27
6.07
5.81
5.45
5.36
5.25
5.13
4.99
4.83
4.63
4.37
4.01
3.39
0.9
6.79
6.57
6.28
5.88
5.77
5.66
5.52
5.37
5.19
4.97
4.68
4.28
3.60
1.0
7.30
7.06
6.74
6.30
6.18
6.05
5.91
5.74
5.54
5.29
4.98
4.54
3.79
1.1
7.81
7.54
7.19
6.71
6.58
6.44
6.28
6.10
5.88
5.62
5.27
4.79
3.98
1.2
9.27
8.94
8.51
7.90
7.74
7.57
7.37
7.14
6.87
6.54
6.11
5.52
4.51
1.5
in a Sample of n Individuals (i.e., 2n genes) and with Selected Values of 6 = 4N,u
I
of 0 = 4N,z1
Number of Different Alleles Observed from a Selectively Neutral Population
TABLE
11.59
11.14
10.57
9.77
9.56
9.32
9.06
8.75
8.39
7.96
7.39
6.61
5.29
2.0
-~
23.59
22.48
21.06
19.07
18.56
17.98
17.31
16.59
15.72
14.65
13.30
11.45
8.46
5.0
39.82
37.63
34.83
30.93
29.93
28.81
27.56
26.12
24.44
22.42
19.90
16.50
11.33
10.0
z3
94
EWENS
It is now necessary to derive further characteristics of the sampling distribution of K, and we do this by finding an explicit expression for ri = Pr(k = i). It will turn out that the probability distribution {ri} = {rl , nz ,..., ran} will be central to the whole discussion of drawing inferences about the population from the sample. Despite the similarity in form between (5) and (12), it does not necessarily follow that an argument similar to that below yields the distribution of K. We find ri by a variant of the “coupon collector’s problem” (or the “law of succession”). We suppose the sample of 2n genes is drawn one by one and consider the probability that the (j + I)-th gene drawn is of an allelic type not observed on the first j draws. Now if we are given the current allele frequencies p, , pa ,... in the population, this probability is x(1 - p,)jp, . Use of (9) then shows that the required unconditional probability is e 1 (1 - X>j x[x-l(1 s Thus the probability previously drawn is
- x)“-‘1 dx = e/(e + j).
(17)
g ene is of one or other allelic type
that the (j + I)-th j/cc + j).
Note also that the probability type is
(18)
that the first j draws
E 1 pij = 0 f x+-1(1
- x)“-‘}
dx
all yield
the same allelic
[using (9)l
0
i
= e(j -
l)! r(e)/r(j
+ 0).
Hence, given that the first j draws yield only one allelic form, the probability that the first j + 1 draws yield only one allelic form, is
[W W)/W + j + 111 _ i w - 1I! m/v + 81 - P-m’ which is identical to (18). Thus the condition that the first draws all yield the same allelic type does not alter the probability that on the (j + 1)-th draw a new type is drawn. We argue more generally that the number of different allelic types drawn on the first j draws has no bearing on the probability that on the (j+ I)-th draw a new type is obtained. This arises essentially from the fact [from (l)] that the behaviour of the frequency of any allelic type is Markovian, that is, is independent of the numbers and frequencies of other allelic types. A formal proof of this statement has been obtained by Karlin and McGregor and is given in an accompanying paper (1972). Th us the probability that the firstj draws yield i allelic types (which we denote q3,i) is given, for i = 1, by 4j,1 =
e(j
-
i)i/e(e +
1) .. . (e + j -
i),
(19)
SAMPLING
and for i =j
SELECTIVELY
NEUTRAL
by qj,j = ejpqe + 1) .. . (e + j -
Further
95
ALLELES
we have the recurrence
1).
(20)
relation
Qj+l.i = 4f,i{j/(e
+ i)> + %,i-l{e/(e
(21)
+ll>*
This recurrence relation, together with the boundary conditions (19) and (20), is sufficient for calculation of qj,i for all (i, j), and in particular for j = 2n. It turns out that the most economical and useful way of writing down the solution ri(= qzn,J is by introducing the polynomial
L(B) = e(e + 1) ... (e + 2n - 1) = zle + z2e2+ ... + z2ne2n
(22)
(say): if this is done, then ni = z,ei/ge + 12e2 + ... + z,,ey.
(23)
The coefficients Zi (written more fully Zipzn) are the “Stirling’s numbers of the first kind.” Properties of these are given in David and Barton (1962, pp. 291-296), where the notation used is Dissn. The theory which follows has interesting parallels with the theory of “record-breaking,” discussed in David and Barton (1962, pp. 178-182), and which also makes substantial use of Stirling’s numbers. The polynomial L(B) is of value in exploring the properties of the distribution {niTi}. Thus we have, for example, E(k) = 1 iTi =
(C
iZ&F)/L(O)
=
eLye),qe)
=
e(djde) log L(e)
d+&+...+
[using (22)], and this expression Var(K)
= C i(i -
e
e+2+i
9
agrees with (11). Next we have
1) 7ri + [E(k)]
-
[E(K)12
e2cy0) =p+!IF3&[~]2 -w)
= eLye) ~- 82[$ L(e) =E(k)-[$+(e+1)2
logL(e)] 82
(24
96
EWENS
Note that Var(k) + 0 as 0 + 0 (as we should expect) and also as 19+ 00 (as again we expect): the most variable situation occurs when 0 assumes intermediate values [although we see in Table II that for cases of biological interest, Var (k) increases with 0). We may also use (23) to compute higher moments of the distribution (rri}: in particular the third and fourth cumulants of k are
~~ = es, - 389, + 2e3s3 , K4 = es1 -
7e2s2 j- 12e3s3 - 6e4S4,
(25) (26)
where Sj = (l/&) + [l/(0 + l)j] + ... + [l/(0 + 2n - l)j]. It follows that for n large, the skewness and kurtosis of the distribution of K approach zero. In general, all standardized cumulants above the second approach zero as n -+ co, indicating asymptotic normality of the distribution. Note, however, that our theory is designed and applies for values of n considerably less than N and approximate normality is possibly not reached for values of 1z arising in practice. Note that the skewness of the distribution of Jz is always positive for f!J< 1 and also positive for sufficiently large n: These may often be the cases of practical interest. It has been remarked above that our derivations rely on the assumption that n < N and cannot be taken over to describe population distributions. This is unfortunate since in computer simulations it is usually necessary to consider properties of these distributions. If our arguments apply approximately for population properties we would have Pr(K
= ;) = p,eyqe),
(27)
where
p(e) = e(e + 1) ... (e + 2N - 1) = p,e + p,e2 + - + p,,ev
(28)
Computer simulations suggest that (27) is approximately correct for small e but, for larger 0, yields a distribution whose variance is larger than that suggested by simulations. It is possible to obtain numerical expressions for rri(i = 1,2,... 2~2) by using the recurrence relation (21). Appendix 3 lists a FORTRAN program in which this is done for values of 12up to 1000: the user of this program is required simply to write in the appropriate values of t9 and n in statements 1 and 2. The form (23) is useful to discuss a property of the distribution of K, namely that this distribution is complete: that is, there is no function of k whose expected
0.00
0.00 0.00
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
20
30 40
50
60
70
80
90
100
150
200
250
0.001
0.00
of ti
10
Value
0.07
0.07
0.06
0.06
0.06
0.06
0.05
0.05
0.05
0.05 0.05
0.04
0.04
0.01
Variance
II
0.65
0.63
0.60
0.56
0.55
0.54
0.52
0.51
0.49
0.44 0.47
0.40
0.33
0.1
1.25
1.21
1.15
1.07
1.05
1.02
1.00
0.96
0.93
0.83 0.88
0.74
0.61
0.2
1.81
1.75
1.66
1.54
1.51
1.47
1.43
1.39
1.33
1.18 1.26
1.06
0.85
0.3
2.35
2.26
2.14
1.98
1.94
1.89
1.84
1.78
1.70
1.50 1.61
1.34
1.06
0.4
2.86
2.74
2.60
2.40
2.35
2.29
2.22
2.14
2.05
1.80 1.94
1.60
1.26
0.5
Value
3.34
3.21
3.04
2.80
2.73
2.66
2.58
2.49
2.38
2.08 2.25
1.84
1.43
0.6
3.82
3.66
3.46
3.18
3.10
3.02
2.93
2.82
2.70
2.34 2.54
2.06
1.59
0.7
of 0 = 4N,u
4.27
4.10
3.87
3.54
3.46
3.37
3.26
3.14
2.99
2.59 2.82
2.27
1.74
0.8
4.72
4.52
4.26
3.90
3.80
3.70
3.58
3.44
3.28
2.83 3.08
2.47
1.88
0.9
5.15
4.93
4.64
4.24
4.13
4.02
3.88
3.73
3.55
3.05 3.33
2.66
2.00
1.0
5.57
5.33
5.01
4.57
4.45
4.33
4.18
4.01
3.82
3.27 3.58
2.84
2.12
1.1
5.98
5.72
5.37
4.89
4.77
4.63
4.47
4.29
4.07
3.47 3.81
3.00
2.23
1.2
7.17
6.84
6.41
5.81
5.65
5.48
5.28
5.05
4.79
4.05 4.46
3.47
2.52
1.5
of Number of Different Alleles Observed in a Sample of n Individuals (i.e., 2n genes) from a Selectively Neutral Population and with Selected Values of 0 = 4N,u
TABLE
9.02
8.58
8.01
7.21
7.00
6.77
6.51
6.21
5.85
4.88 5.42
4.12
2.90
2.0
18.10
17.01
15.61
13.66
13.16
12.60
11.98
11.26
10.42
8.16 9.42
6.48
3.95
5.0
29.50
27.36
24.64
20.89
19.94
18.88
17.71
16.37
14.84
10.82 13.03
8.01
4.20
10.0
5
E ii
;3 F g
3
$
F;; 2
F $ ;I! 2 ;
98
EWENS
value is zero (or any required would have, from (23),
constant).
For if such a function
g(K) existed we
F g(i) z&Ii= 0. i=l
This equation can hold (for a continuum of 0 values) only when g(i) = 0. We shall make use of this property of completeness on several occasions in our later discussion.
INFERENCE
PROPERTIES
We now turn to the question of what inductions about the population can be made from the sample. We consider first the maximum likelihood estimator of 0, given only the observed value Iz of alleles in the sample; (we shall show later that ignoring the numbers ni ... nk with which these occur in the sample involves no loss of information). The likelihood of observing K alleles is, from (2,3),
and this is maximized
with respect to variation
in 0 when
k = eqeyqe) e e =gSe+l+“‘+
e e+2n-i
*
In other words, the maximum likelihood estimator of 0 is that value for which the expected number of alleles equals the observed number of alleles. Thus maximum likelihood estimates can be found by using Table I backwards and interpolating; however for convenience we exhibit in Table III the maximum likelihood estimators of 0 for selected values of K and n. A “symmetric” 95% confidence interval (0, , 0,) for 0 can be found from Neyman-Pearson theory by computing B0 and 8r from Pr(less than K alleles / 0 = 0,) = 0.025, Pr{more
than K alleles ) 0 = f?,,} = 0.025.
The numercial program given in Appendix 3 may be used (with some trial and error) to evaluate 0, and 0, from these equations. A confidence interval of some special interest arises when a sample of n indi-
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
30
40
50
60
70
80
90
100
150
200
250
1
10
of n
Likelihood
20
Value
Maximum
0.152
0.158
0.166
0.178
0.182
0.186
0.191
0.197
0.204
0.215
0.230
0.256
0.319
2
Estimates
0.314
0.326
0.343
0.371
0.379
0.389
0.400
0.414
0.431
0.455
0.492
0.553
0.718
3
4
0.485
0.504
0.532
0.578
0.591
0.607
0.626
0.849
0.678
0.720
0.782
0.894
1.209
.___
of 0 for a Given
5
0.663
0.691
0.731
0.797
0.816
0.840
0.868
0.902
0.945
1.008
1.104
1.279
III
0.848
0.886
0.939
1.029
1.055
1.087
1.125
1.172
1.232
1.319
1.455
1.709
2.524
6
of R
Number
Value
1.805
Observed
TABLE
1.040
1.088
1.157
1.272
1.306
1.347
1.397
1.459
1.538
1.653
1.836
2.187
3.391
7
(k) of Different
1.239
1.297
1.382
1.526
1.568
1.620
1.684
1.762
1.862
2.011
2.247
2.715
4.440
8
Alleles
1.444
1.514
1.616
1.790
1.842
1.905
1.982
2.080
2.204
2.390
2.690
3.296
5.718
9
in a Sample
1.654
1.737
1.858
2.065
2.127
2.203
2.295
2.414
1.870
1.966
2.107
2.350
2.424
2.512
2.622
2.763
2.944
3.219
2.793 2.565
3.674
4.638
9.252
11
3.165
3.936
7.290
10
of Size n (272 genes)
2.091
2.201
2.364
2.644
2.730
2.834
2.962
3.127
3.342
3.669
4.217
5.409
12
viduals gives rise to a single allele and a confidence interval of the form (0, 0,) is required for 8. If a 95 o/o confidence interval is wanted, the value 8, satisfies 4(2n
- l)!/L(Q
= 0.05,
Appendix 3 again being useful for determination of 0, . The above inferences relate to the parameter 0, which is the only population characteristic about which inferences can be made from the sample. [Thus note, for example, that it is impossible, from (15) and (16), to make exact inferences about the number of alleles not observed in the sample, since the nonestimable quantity N is involved in the distribution of this quantity. On the other hand the form of (16) indicates that if an estimate of N can be made by independent methods, then a reasonably good estimate of this quantity can be found.] We now note that our inferences about 0 have used n and k only: that is, they have ignored the numbers n, , n2 ,..., n, with which the alleles appear in the sample. We now show that this procedure is justified by proving that k is a sz@cient statistic for 6; in other words, given k, the distribution of n1 ,..., nk is independent of 0, and so knowledge of these quantities provides no further information about 8. This is done by noting (Appendix 2) that Pr{n, ,..., n3 = g(n, ,..., 4
ewe).
The conditional distribution of n, ,..., n, , given k, is thus independent of 8. Hence not only may the numbers n, . .. nk be ignored when carrying out inferences about 0: standard Rao-Blackwell theory indicates that it is inejicient to make use of these quantities in such inferences. The sufficiency and completeness of the distribution of k indicates that there is a unique “best” (minimum variance unbiased) estimator of any estimable function of 0, and that such an estimator will be a function of k only. One particular function of 6’ which one sometimes wishes to estimate is the “effective number of alleles” 1 + 0 [cf. Eq. (3)]. Th is is normally estimated by 1/Cxi2, where xi = nJ(2n). The above shows that such an estimator is very ine@ient in that it uses precisely the information in the sample that should be ignored. We now show that in fact 1 + 0 is not an estimable function, for if there were an unbiased estimator of (1 + 0) th ere would be a (unique) function g(K) unbiasedly estimating 1 + 8. We would then have
zgg(4 485= (1 + e)qg for all 8, and this is impossible since the left-hand side is a polynominal of degree 2n at most whereas the right-hand side is a polynomial of degree 2n + 1. Thus it appears that probably the most satisfactory estimate of the effective number of
SAMPLING
SELECTIVELY
NEUTRAL
101
ALLELES
alleles is 1 + 8, where d is the maximum likelihood estimator of 0, even though this is a biased estimator. It is however possible to find a unique “best” estimator of (1 + 8)-l. If we denote this estimator by G(k), we must have c G(i) liei = 8(8 + 2)(8 + 3) -.. (8 + 2n -
l),
i=l
so that G(2n) = 0 and
G(‘)
coeff0j-1in(6’+2)(0+3)*..(0+2n-l) = coeff 8+l in (0 + l)(e + 2) .** (0 + 272 -
1) ’
(j = 1, 2 ,..., 2n - 1). In particular
we have
G(1) = estimate of (1 + 0)-l if 1 a 11e1e is observed
in a sample of n individuals
= 1, G(2) = estimate of (1 + 0)-l if 2 alleles are observed in a sample of n individuals
= [;+;+-.+
(2nQ/[1+;+;+*-+
(2nL1)]’
G(3) = estimate of (1 + 0)-l if 3 alleles are are observed in a sample of n individuals =
[ 2.3 L+L
2,4
[L+
A+
1.2
+...+(271-2jl(2n-l) ... + (2n - 2)l(2n -
1 1
1) ’
and so on. In general, the only estimable functions of 0 are linear combinations of functions of the form l/H(8), where H(B) is a polynomial in 8 of the form (8 + u)(8 + b) ... (8 + c), where a, b, . . .. c are positive integers satisfaying l,(a
INDEX
FUNCTION
It is sometimes desired to find an index function I(R, x1 ,..., xk) whose expected value is independent of 0 and known for selectively neutral populations. It is sometimes even further desired that the index function be a function of K only. Such an index function could hopefully be used as a measure of departure from selective neutrality. Our previous theory shows that it is impossible to construct a nontrivial index function that is a function of K only. For if E@(K) = A (A
102
EWENS
constant), then the completeness of the distribution of K (proved above) yields Q(K) = A, which is trivial. We shall exhibit later a useful index function L = L(k, n, ,..., nR), which of course depends nontrivially on nr ,..., nk as well as K.
HYPOTHESIS
TESTING
A large number of tests of hypotheses may be made in areas related to the above discussion, and we shall consider here only two such tests. Firstly, it is possible to test hypotheses about the value of the parameter 19, assuming selective neutrality does hold. Such tests must be carried out using the number li only, and will normally be closely associated with the procedure for finding confidence intervals for 0 outlined above. The theory of such tests is straightforward and is not considered further here. Second, it is possible to devise a test of the hypothesis that the alleles sampled are selectively neutral. Here the numbers n, . .. nKbecome relevant and indeed our test of the hypothesis is carried out essentially by testing whether the observed values n, . .. nk conform reasonably to what is expected under selective neutrality. The nuisance parameter 0 is eliminated from such tests by making the tests conditional on the observed value of k. Thus we have, from Appendix 2, Pr{n, ,..., n, 1 k, 2n, neutrality}
= (2n)!/(K!
Z,n, . n2 ... nk).
(30)
More precisely, we assume that, once k is given, the alleles are labelled A, ... A, in some conventional manner: Eq. (30) yields the probability that there are n, genes of the allele labelled A, , . .. . nk genes of the allele labelled A, . Equation (30) can now be used in principle to find the distribution of any test statistic t(n,, . .. . nk). Standard statistical methods normally yield an “optimal” test statistic once an alternative hypothesis is given. Here the alternative hypothesis is “selection exists,” and it is not clear that this is sufficiently precise to yield an unambiguous statistic. Here we shall simply note that if heterotic selection exists, the values n, ,..., n, will tend to be “close,” whereas if one allele is selected favoured, one of the n, values will tend to be very large, the others being quite small. Under selective neutrality a situation intermediate between these two should obtain. Thus if we define xi = n,/2n (= frequency of the i-th allele in the sample), it seems reasonable to introduce as an index of neutrality, and also as a test statistic for the hypothesis of neutrality, the “information” function
B = - i xi log xi . i=l
We shall not attempt any mathematical justification for using B other than noting the well-known properties enjoyed by the information statistic.
SAMPLING
SELECTIVELY
NEUTRAL
ALLELES
103
The complete distribution of B under the neutrality theory is complicated, and we shall be satisfied here with finding only the mean and variance of this distribution. Defining i! l,,Jj! eq. (30) shows that if H is the domain 1
bl
‘.’
= w*,~ ,
ni > 1, Zni = 2n, then %-’
=
Wk,Zn
.
H
Then the distribution ph>
=
p(nr) of n, , found by appropriate
summation
n, = l,..., 2n-k+l.
Wk-1,2n-nl/(nlWk.212),
in (30), is (33)
From (33) we have
E [ -2
xi log xi]
= E [log 2n - (2n)-l c ni log ni] 2n-k+l =
log
2ti - (h-l
k
c
(log j) wk-l,2n-j/wk,2n.
(34)
j=l
Computation of (34) is possible using standard recurrence relations for Stirling numbers and is embodied in the FORTRAN program in Appendix 4. Similarly, the joint distributionp(n, , n,) of rz, and na is found (for k 3 3) to be ph
, %)
=
Wk--2,Zn-nl-nz/(nln2Wk.2n).
(35)
Joint use of (33) and (35) yields the variance of B = -C xi log xi under the neutrality theory for k > 3; (a rather simpler formula obtains when K = 2). In both cases the necessary computations are embodied in the program in Appendix 4. Since the exact distribution of B is not known, an exact test of selective neutrality cannot yet be given. An approximate procedure assumes that B/log k has a beta distribution, the parameters of which can be found from E(B) and Var (B): a standard transformation then yields a random variable with an approximate F distribution. Part of the print-out in the program in Appendix 4 provides the value of F so calculated, together with its (non-integer) degrees of freedom. Values of F significantly in excess of standard significance points indicate evidence of heterotic selection, while large values of l/F (with reversed degrees of freedom) indicate selection favouring one allele.
104
EWENS
An even more approximate
test would
be to calculate the standardized
variable
L = [B - E(B)]/+) which, under the neutrality theory, has mean zero and standard deviation 1. An approximate two-standard deviation rule could be used to test for significance of L. We exemplify these points by considering three hypothetical samples, each with N = 350, K = 4, but with different allele frequencies in the three cases. Sample 1
2 3
Allele
frequencies
L
F(d.f.)
Comment
0.350.30 0.20 0.15 2.36 34.44(3.30,4.36) F significantly large 0.83 0.11 0.04 0.02 0.02 1.02(3.30,4.36) not significant 0.99 0.005 0.0025 0.0025 - 1.70 0.07(3.30,4.36) F significantly small
We shall not, however, emphasize the F test and prefer rather to concentrate attention on the statistic L, not used as a test function but rather simply as a reasonable index function of non-neutrality. We do this because electrophoretic data usually comprise results from several loci, several species and several geographical locations. Thus while the test outlined above can be applied to any single such observation, problems naturally arise regarding piecing together the results of a number of such tests (all related in some logical fashion) and in computing an overall significance level. Since in any event we do not expect our procedures to be any more than an adjunct to other methods in testing for nonDarwinian evolution, it may often be best to compute L for all sets of observations and to draw inferences not only from the various absolute values of L, but also from the patterns exhibited by the values of L for all the data at hand. Naturally, inferences so drawn involve a degree of subjective decision and thus should be used with caution. Nevertheless, in a forthcoming paper (Ewens and Langley, 1972) which investigates actual data using the above analysis, this approach will to a large extent be used.
DISCUSSION
The aspect of the above discussion likely to be of most practical interest is the test of the hypothesis of selective neutrality based on the frequencies of the various alleles in a sample from some population. It is therefore important to add some.words of caution relating to the use of this test. Firstly, the test does not appear particularly powerful (in the statistical sense): as a consequence, one may often maintain the hypothesis of neutrality when in fact (mild) selection does
SAMPLING
SELECTIVELY
NEUTRAL
ALLELES
105
occur. Associated with this remark is the fact that any frequency configuration, and in particular that pertaining to a selectively neutral situation, can be explained by one or other selective system. Secondly, much data now exists where the same locus is considered in related populations or species, or in the same population at different time points, or where different loci in the same population are studied. In some cases of this nature a quite subjective decision may be superior to one based on the objective procedure described above. Thus if a certain allele occurs at high frequency in a large number of related species, with other (low-frequency) alleles occurring in approximately the same frequencies in the different species, one is tempted, without any formal statistical analysis, to conclude that the high-frequency allele is selected for in all the species. This is a reasonable procedure, but in order to develop an objective statistical test and in order also to obtain a testing procedure more powerful than that outlined, it would be valuable to extend the above statitstical test to cover such situations. A third problem relates to nonidentification. The above analysis assumes total ability to differentiate alleles, whereas it is likely that even present methods often identify two different but very similar alleles. Further, complications due to linkage, fluctuation in population size, etc., have not been considered. It will probably be necessary to extend the above method to a considerable degree, taking all these factors into account, before highly reliable methods are attained. Nevertheless it is believed that the theory advanced in this paper will be of some use in obtaining these methods.
SUMMARY
The question of making inferences from a sample of genes from a locus is considered. If the locus is supposed neutral, the complete probability distribution of the number of different alleles seen is obtained. It is shown that this number is a sufficient statistic for the parameter 0 characterizing the distribution of allele number and allele frequency. This indicates that previous estimators of l/( 1 + 8) have been based on inefficient procedures. The derivation of minimum varianceunbiased estimators for estimable functions of 0 is discussed. An index function is constructed which can be used as a measure of “nonneutrality”. The question of finding confidence limits for 0 is solved and a computer program provided for this purpose. Finally the question of testing hypotheses, in particular that of selective neutrality, is considered. A test is provided for the latter hypothesis. The extensions of this test which will be necessary before powerful procedures are available are also discussed. 653/3/I-8
106
EWENS APPENDIX
I
The theory given in this paper relates to a sample whose size is supposed considerably less than the population from which it was taken. In particular in connection with computer simulations, which cannot normally use a value of N in excess of 1000, it is of interest to ask how much of the theory can be used to describe population properties. The similarity in form between (5) and (12) (and between (11) and (14)) show that an immediate translation appears possible so far as mean numbers of alleles are concerned. However, as remarked elsewhere in this paper, formula (24) (with K replacing K and N replacing n) appears to overestimate the true variance of K, at least for moderate and large values of 0. This possibly arises because the distribution of k incorporates two random components: first, the stochastic nature of the parent population and second, randomness associated with the sampling process. Note however that computer simulations indicate that one important sample property is maintained in the population, namely, that estimation of functions of 0 is carried out best by using K only and ignoring the frequencies of the various alleles in the population.
APPENDIX
We wish to establish formula notation, for example, that
2
(29). T o d o so it is convenient
to introduce
the
is an ordered collection of genes in which the first m1 genes sampled are of the same allelic type, the next m2 are of the same allelic type but different from the first, the next m3 are of the same (third) allelic type, the next m4 are all of the second allelic type, the next m5 are all of the first allelic type found, and so on. We wish to find an expression for the probability that in our sample we have n, genes of one specified allelic type, .. .. nk of a K-th specified allelic type. To do this we shall first find an expression for
that is, for the probability that we first draw a run of n, genes of one allelic followed by a run of ns genes of a second allelic type, and so on, concluding a run of n, genes of a k-th allelic type. (Note that the probability in question relates to an ordered sequence of outcomes.) We shall use repeatedly the
type, with thus facts
SAMPLING
SELECTIVELY
NEUTRAL
that if j draws have been made, the probability of draw a new allelic type not seen on the first j draws probability of any ordered sequence is the same as sequence of the same length with the same numbers Consider first the case k = 2. We know that P&A,,
lA,}
= e2(n, -
I)!/{e(e
and we seek to prove by induction
the formula
Pr(n,A,
l)!/{e(e
, n,A,}
Suppose Wn,-%
= eye, -
i)!(n2 -
(A2) is true for n = m. Then , (m + 1) -4,) = Pr{nA - WnlAl = Pr{n,A,
+ 1) . .. (e + n1 + n2 -
1)).
(A4
, m-4, , lAl)
, m-4, ,lA,l , “A,} - Pr{(n + 1) A, , mA,}
iy/{e(e
- eyn, -
l)!(m -
= eynl -
l)!d/{e(e
iyj{e(e
+ 1) --' (e + n, + 171-
1
+ 1) ... (e + n, + m)} iy/{e(e
+ 1) ... (e + fir + m)}
+ 1) ... (e + n, + m)>,
that (A2) is true for arbitrary
i)! ... (nk -
641)
since
, mA2) - Pr{n,A,
- 02f2,!(m -
= @(n, -
finding on the (j + 1)-th is O/(0 + j), and that the that of any other ordered of the various alleles.
+ 1) . . . (e + z,)},
- Pr{nA , WA,, IA31 = eyn, - l)!(m - i)!/{e(e
we see by induction induction on k that
107
ALLELES
+ 1) ...
(e +
n2 . We now aim to show by
n, + ... + n, -
1)).
643)
If (A3) is true for some value of k, we have
= Ok+l(nl Proceeding WA
l)! ... (nlc -
ly/{e(e +
1) ...
(e +
n1 + ... + nk)}.
as above we find easily that
,. . ., nk+l&+ll = P+l(nl - l)! ... (nle+l -
iy/{e(e
+ 1) ... (e + n, + ... + nk+l -
Comparison of this formula with (A3), together indicates that (A3) is true for all k.
1)).
with the truth of (A3) for k = 1,
108
EWENS
The probability required in Eq. (29) will differ from that given in (A3) only be a combinatorial multiplicative factor independent of 8. Thus Eq. (29) is true, and exact evaluation of g(nl ,..., nk) is unnecessary. It remains to establish Eq. (30). W e make the condition that the sample of 2n genes has yielded exactly K allelic types. The probability that these appeared as a run of n1 of one type, followed by a run of ns of a second type, and so on, is found by dividing the right-hand side in (A3) by that in (23) (putting i = R), and is thus (nl -
I)! ... (Q -
l)!/&
.
Now suppose that these K different types had been labelled in some conventional manner as A, ... A, . The probability, given k and 2n, that the genes had been sampled as a run of n1 of A, , followed by a run of n2 of A, , and so on, is thus (n, -
l)! ... (nlE -
l)!/k!Z,
.
Finally, we wish to disregard the order in which the genes were drawn to compute the conditional probability, given k and 2n, that there were n, genes of type A1 , etc., in any arbitrary order. This will be found by myltiplying n2 of type A2 , (A4) by (2n)!/(n,!n,! ..f n,!) and
is thus (2n)!/{k!Z,n,n,
... nB}.
645)
This is the required Eq. (30). It should be kept in mind that Eq. (A5) relates to lubelled alleles. That is, once k is given, it is assumed that the alleles are labelled A, .a’ A, in some conventional manner: the probability (A5) is then the probability that there are n, genes of the ellele labelled A, ,..., nK of the allele labelled A, . A further consequence of equation (A5) is that a higher probability attaches to a case where n, ... nk differ widely than to a case where n, ... nk are close: this agrees generally with the discussion following Eq. (7) although the parallel is not immediate as one must take account also of the number of possible permutations for any configuration of n1 ,..., nk .
APPENDIX 3 The following FORTRAN program prints out values of ri [see Eq. (23)] for any 0 and any n (n < 1000). It is required merely to insert the desired values at statements 1 and 2. (Note that minor program changes may be needed for some computers.)
SAMPLING
DIMENSION 1
THETA
2
N= P(1) : 1. Q(1) = 1. M=2*N DO3K=2,M P(K) = 0. Q(K) = 0.
3
CONTINUE
4 X
SELECTIVELY
NEUTRAL
ALLELES
109
P(2OOO), Q(2000)
=
FORMAT (* VALUE OF K PROBABILITY CUM PROBABILITY *) PRINT 4 DO 6 J=2,M A=J-I Q(1) = Q(1) * A/(THETA + A) DO 5 I = 2, J K=I--1 Q(1) = (A c P(1) + THETA c P(K)/(THETA
+
4
CONTINUE DO 6 I = 1, J P(I) = Q(I) CONTINUE CUM =O. K=O K=K+l CUM = CUM + P(K) PRINT 8, K, P(K), CUM FORMAT IF (CUM.
(I 10, 2F 21.6) LT. 0.99999) GO TO
7
CONTINUE END APPENDIX 4 The following program prints out values of B, L, F and degrees of freedom for F (see text for definitions) for any set of data required. Values of n and K should be inserted in the first statement and values for x1 , x2 ,... in the fifth, sixth,..., statements. In the unlikely event x1 = xs = ... xk , the user should not proceed with this program, and may assume immediately that his data yield significant
110
EWENS
evidence of heterotic selection. The program is written for the Univac 1108 computer of the University of Wisconsin and may require minor modifications for other computers. Double precision should be used if possible. PARAMETER N =, K = PARAMETERNN=2*N,Nl=NN-l,Kl=K-l,K2=K-2 DIMENSION W(K, NN) DIMENSION X(K) X(1) = X(2) = . .. D=K Dl = Kl SUM = 0. DOlI=l,K A = X(1) B = ALOG SUM=SUM-AAB 1
CONTINUE DO2I=l,K W(1, I) = 1.
2
CONTINUE DO31=1,Kl J=I+l DO3L= J,K W(L, I) = 0.
3
CONTINUE DO41=2,NN Z=I W(1, I) = 1./z
4
CONTINUE DO51=2,K J=I+l Z=I Y=M W(1, M) = (Y -
5
CONTINUE SIGA = 0. SIGB = 0. SIGC = 0. SIGD = 0.
1.) * W(1, M -
1)/Y ‘t
Z * W(1 -
1, M -
1)/Y
SAMPLING
SELECTIVELY
NEUTRAL
ALLELES
DD=NN DO 6 I = Kl, Nl J=NN-I A=J C = ALOG PP = A/DD R = ALOG Pl = 1. - PP Rl = ALOG(P1) SIGA = SIGA + C * W(KI, I) SIGB=SIGB+A*C*CrC*W(KI,I) SIGD = SIGD + (PP * R + Pl * Rl) * (PP c R + Pl * RI) * W(K1, 1 (A * W(K NN)) 6
7
CONTINUE SIGA = SIGA * D/W(K, NN) SIGB = SIGB * D/W(K, NN) AVINF = ALOG - SIGA/DD IF (K.EQ.2) GO TO 10 Ll=NN-K+l DO71=1,Ll L2=Ll+l-I DO7J=l,L2 L3=NN-IJ SIGC = SIGC + ALOG(1) * ALOG( CONTINUE SIGC = SIGC c D * Dl/W(K, BV = SIGB + SIGC - SIGA VARINF = BV/(DD * DD) GO TO 11 = SIGD
- AVINF
J) * W(K2,
NN) t SIGA
10
VARINF
t AVINF
11
CONTINUE SDINF = SQRT(VARINF) V = (SUM - AVINF)/(SDINF) BB = ALOG X = SUM/BB TX = AVINF/BB VAX = VARINF/(BB * BB) DFI = 2. * TX * (TX * (1. - TX)/VAX DF2 = DFl * (1. - TX)/TX FR = (1. - TX) * X/(TX c (1. - X)) PRINT 9
-
1.)
L3)
111
I)/
112
EWENS
9
FORMAT (IHI, 20X, 15H VALUE OF B ,5X, 15H VALUE OF L ,5X, 15H F RATIO ,5X, 30H DEGREES OF FREEDOM OF F) PRINT 8, SUM, V, FR, DFl, DF2 8 FORMAT (IHO, 20X, 3(F15.4, 5X), 2F15.4) END 1
ACKNOWLEDGMENTS My principal debt is to James F. Crow, who has made many valuable suggestions and given considerable help during this research. The work was carried out under the hospitality of K. Kojima, who also contributed many suggestions. I have had invaluable aid from Charles Langley and Professor F. N. David. The problem of construction of an index function and of forming objective statistical tests of the neutrality hypothesis was suggested by R. K. Selander. The linal draft of this paper was prepared after the author received many incisive comments by S. Karlin. Professors Karlin and McGregor have also provided in an accompanying paper proofs of several key results used in this paper.
REFERENCES CROW, J. F. AND KIMURA, M. 1964. The number of a finite population, Genetics 49, 725-738. DAVID, F. N. AND BARTON, D. E. 1962. “Combinatorial EVENS, W. J. 1969. “Population Genetics,” Methuen, EWENS, W. J. AND LANGLBY, C. 1972. To appear. KARLIN, S. AND MCGREGOR, J. L. 1967. The number a population. In “Proc. 5th Berk. Symp. Math. Stat. University of California Press. KARLIN, S. AND MCGREGOR, J. L. 1972. Addendum to Biol. 3, X X-X X (following paper). WRIGHT, S. 1931. Evolution in Mendelian Populations,
alleles
that
can be maintained
in
Chance,” London.
Griffin,
of mutant and Prob.,”
forms maintained in Vol. IV, pp. 415-438,
a paper
of W. Ewens,
Genetics
16, 97-159.
London.
Theor.
Pop.