THEORETICAL
POPULATION
BIOLOGY
The Eigenvalues IV. Department
7,
212-220 (1975)
of the Neutral
of Statistics,
Process
J. EWENS AND K. KIRBY
of Biology, 1Jmbersity of Pennsylvania,
Department
Alleles
Princeton
and University,
Philadelphia,
Pennsylvania
19174
Princeton, New Jersey 08540
In this note we obtain the full set of eigenvalues in the (Wright) neutral alleles model. The mode of derivation gives an insight into the nature of the corresponding right eigenvectors, and generalizes readily to models other than Wright’s. It also gives information about the distribution of the population homozygosity, and further suggests an approach to the choice of test statistic for testing for selective neutrality. An alternative and quicker derivation of the eigenvalues is provided in a companion paper (Karlin and Avni, 1975).
1. BACKGROUND
We consider a selectively neutral locus in a diploid population of fixed size N (2N genes), admitting allelic types A, , A, , A, ,... . Any gene is assumed to mutate with probability u, all new mutants being assumed to be novel allelic types. Specifically, we assume that if, in any generation, there exist ai genes of type Ai (i = 1, 2,...,p, C ai = 2N), then the probability that the following generation there will exist mi genes of type Ai , together with m, new mutants, is Pr[{a, , a, ,..., a,> -+ {m, , m, ,..., m,>] =
urn0fil 1% (2N)! lJi”=, rn
(1 - ~,tm’. (1)
We call (1) the Wright model for neutral alleles; other models are possible but will not be considered in detail by us. Our interest in considering eigenvalues is to discuss the rate of approach to stationarity of the system governed by (1). H owever, the concept of stationarity in this system is not immediately transparent. Equation (1) shows that any specified allele, whatever its current frequency, eventually leaves the system with probability 1. Stationarity cannot then have any nontrivial meaning for specified labeled alleles. Rather, the only aspect of the process (1) to which a nontrivial 212 Copyright All rights
0 1975 by Academic Press, Inc. of reproduction in any form reserved.
NEUTRAL
213
ALLELES
concept of stationarity applies (and the aspect of most interest in population genetics), is the set of configurations of allele numbers. Any such configuration is of the form {a, , a2 ,..., a,}, where C aj = 2N; in the above configuration there are a, genes of one type, u2 of a second type,..., a, of an ith type. Note that ordering within configurations is unimportant; we regard, for example, ia, , a2 3.e.9ai} and {a, , a, ,..., ai} as being identical. A configuration is thus simply a partition of the number 2N into positive integers, and the number of possible configurations is the number p(2N) of such partitions. Values of p(i) for i = 1) 2, 3,..., 500, together with asymptotic formulas, are given by Abramowitz and Stegun (1965, pp. 836837); values for small i are exhibited in Table I. TABLE
I
Number p(i) of partitions of i i
1 2 3 4 5 6 7 8 9 10 100 200
P(i)
1 2 3 5 I I1 15 22 30 42 190, 569,292 3,972,999,029, 338
P(i) - PG -
1)
1 1 1 2 2 4 4 I 8 12 21, 339,417 436,926, 597,263
In the remainder of this note we consider 2N to be fixed from one generation to another and for typographical convenience will write P for p(2N); further, we shall use the words “configuration” and “partition” interchangeably in what follows. We define the “configuration process” as the Markov chain describing the transitions of the population configuration in one generation to the configuration in the following generation, This process has P states,is aperiodic and irreducible, so that one eigenvalue of the transition matrix Q is unity and the remaining eigenvalues are less than unity in absolute value. Our aim is to find the P eigenvalues of this matrix, since these describe in a sense the rate at which the process in question reaches equilibrium. The elements qjk of Q depend on the model (1) and are extremely complex and, even for small N, the transition matrix Q is very large. We have nevertheless been able to find all the eigenvalues of Q. They are provided by the statement of the following theorem.
214
EWENS AND KIRBY
T~~0~~~.D~Jin~h,=l,h~=(I-~)i(l-1/2N)(1-2/2N)~~~(l-(i-l)/2N),
i = 2, 3, 4 ,..., 2N. Then A1, A, ,..., A,, are the eigenvalues of Q, the eigenvalue hi arising with multipltity p(i) - p(i - l), i = 2, 3,..., 2N. Remark. The multiplicities p(i) - p(i - 1) are independent of N, so long as i < 2N. They may therefore be listed once and for all ; representative values are given in Table I.
2. NOTATION AND MATHEMATICAL BACKGROUND The most crucial observation in our analysis is that p(i) - p(i - 1) is the number of partitions of the number i in which the number 1 does not enter into the partition. Thus, for example, p(6) -p(5) = 4, the partitions of 6 not involving 1 being {6}, (4, 21, (3, 3) and (2, 2, 2). (The remaining seven partitions of 6 are f5, 11, (4, 1, I), (3, 2, 11, 13, 1, 1, 11, {2,2, 1, 11, 12, 1, 1, 1, 11 and (1, 1, 1, 1, 1, 1)). We now consider all the P partitions of 2N, ordering these partitions as follows: {1, 1, l,..., l}, (2, 1, 1, I,..., I}, (3, 1, 1, I)..., l}, (4, 1, 1, l,...) I}, (2, 2, 1, 1, I,..., l}, (5, 1, 1, l,..., l}, (3, 2, 1, 1, l,..., I} ,..., (2 2, 2 ,..., 2). The rule in establishing this ordering of partitions (from the second partition on) is as follows. We consider in turn the numbers 2, 3,4 ,..., 2N. For any i, 2 < i < 2N, we consider the p(i) - p(i - 1) partitions of i into integers greater than or equal to 2. These p(i) - p(i - 1) partitions are written down in dictionary order; that is, the partition {a, , a2 ,..., a,} (al 3 as 3 ... > a,, C ak = i) precedes the partition {cr , c2 ,..., ctj (cr > c, 3 ... > ct , C ck = i) if a, = c1 , a2 = c, ,.**, aj-l = cjml , but aj > cj for somej > 1. The total partition of 2N is then completed, if necessary, by a sequence of 1’s. We shall have occasion later to refer to thejth partition (j = 1, 2, 3 ,..., P) of 2N; when we do this, we will assume the ordering of partitions devised above. We now define “sample sets” in a parallel fashion, as follows. Let b, 3 b, 3 ... > bi > 2 (C bi < 2N) be positive integers greater than or equal to 2. We may construct an initial sample set {I}, and then P - 1 additional sample sets (21, (31, {4), (2, 2], {5), (3, 2>,..., (2, 2, 2Y.7 21
(2)
from these integers, where we have used the same dictionary ordering as devised above. We number these P sample sets 1,2, 3,4 ,..., P, respectively. Suppose the ith such sample set is {cr , c2, ,..., c~}. We now form vectors 2x1... Trp) where the jth element nj’ji of xi is the probability r[{cl , c2 ,..., clc>1j] that a sample of cr + c2 + ... + ck genes taken without replacement from the
NEUTRAL
ALLELES
215
population when it is in configuration j should yield cr genes of one allelic type, c2 of a second type,..., cp of a kth type. Thus
0 i:
7dPl I 11 77K4I 21
1 1
?c1=
i.
)
~c2=
mj Ipi
We now establish two lemmas. LEMMA
1. The vectors x1 ,..., zp are linearly independent.
Proof. The matrix {x1 , x2 ,..., xP} is lower triangular, with positive main diagonal entries. b, ,..., b, LEMMA 2. Leta, ,a2 ,..., uk(ai > l,Ca, < 2N)beintegersandb,, (bi > 2, C bi < C ai , m < k) be integers greater than unity. Then we can express ~[{a, ,..., ale} lj] as a linear combination of the m[(b, ,..., b,) 1j] together with a constant, where neither the constant nor the coeficients in the linear combination depend on j. More precisely,
where the summation is over all vectors {b, ,..., b,) obeying the constraints noted, and the C, (s = 0, 1, 2 ,...) are independent ojj. Proof. The lemma is trivially true if ai > 2 (i = l,..., k). We suppose it true when Y of the ai are unity, and show that this implies its truth when Y + 1 of the ai are unity. Suppose that in the set {q , aa ,..., a,}, Y of the ai are unity, and letp(a,,..., uk \j) be the (ordered) probability that a sample of a, + a2 + ... + ak genes sampled without replacement from a population in configuration j produces, in order, a, genes of one type, followed by a2 of a second type, and so on, concluding with ak genes of a kth type. If a further gene is now sampled, it will be either of a new (k + 1)th type, or of one or other of the k types already observed. Thus,
. by a combinatorial Now TT[{U~,..., ak} 1j] differs from p(q ... uk 11) plicative factor independent ofj, so that, after rearranging,
multi-
216
EWENS AND KIRRI-
where the constants Di are independent of j. Since by assumption the terms on the right-hand side may be expressed as in (4) so may also the left-hand side. The lemma is now true by induction.
3. PROOF OF THEOREM Our proof of the theorem employs an extension of a method due to Malecot (1948); see also Kimura and Crow (1964). I n considering the probability F,* that two genes taken at random in any generation should be of the same allelic type, given the model (l), Malecot argued that these genes would be of the same allelic type only if neither was a mutant, and if also they were copies of the same gene from the parent generation or of two different genes of the same type. This leads to the equation F,* = (1 - 4”[(1/2N)
+ (1 - (1/2N))F,],
where F, is the probability that two genes taken at random in the parent population should be of the same allelic type. In our notation, this equation becomes 77*W
Ii1 = (1 - 4”[UPW + (1 - (VW) 7421Iill,
(5)
where n*[{2} 1j] is the probability that two genes taken at random in a daughter generation should be of the same allelic type, given that the parent generation was in configuration j. Now the element qik of the matrix Q whose eigenvalues we seek is the probability that, under the model (l), the population changes to configuration k in the daughter generation, given it was in configuration j in the parent population. Further, we have 7~*[(2} 1j] = 1 n*[{2) / daughter generation in configuration k] qik = i 42)
I 4 qa >
(6)
since at equilibrium parent generation and daughter generation have the same probabilities. If we form a vector rF2*lT = r7T*[pl
I ll,..., ~“W) I m,
then we may form from (6) the vector equation *=2 - Qxz> using the notation of (3). However we also have, from (5), 7t2* = (2N)-y 1 - up1 + x2x2 ,
(7)
NEUTRAL
217
ALLELES
so that
Qn2 = (2N-l)(I
- u)2x1 + X,x, .
(8)
Since also
we find, after elementary manipulations,
where a = (2N)-l(l - ~)~(l - ;\a)-l. Equation (9) shows that h, is an eigenvalue of Q, with corresponding eigenvector x2 - olxr . Since geneticists have mainly been interested in the probability of homozygosity of a diploid individual, this argument has never been carried further. However it is clear that an analogous argument can be made for three identical daughter generation genes. Since these three genes must all be descended from one gene (probability (2N)-a), two genes (probability 3(2N)-l(l - 2N)-1)) or three genes (probability (1 - (2N)-I)(1 - 2(2N)-l)), we find ~*[(3} Ij] = (1 - u)~[(~N)-~ + 3(2N)-l(1 - (2N)-l) n[{2} 1j] + (1 - (2~)~‘)(l
- 2(2N)-‘) 7${3} lj]].
Proceeding as above we find eventually
where /3 and y are constants whose exact values can be calculated easily but which are unimportant. Clearly ha is an eigenvalue of Q. A similar argument, considering four identical daughter generation genes, shows that
for certain constants 6, E, 7 whose exact values are again unimportant. In the case of four genes, a second configuration of interest arises, namely (2, 21, i.e., two genes of one allelic type and two of another. Four daughter generation genes of this nature must be all nonmutants, and will descend from two parent genes of different type (each passed on twice), three parent genes (two of one type and one of another, the singleton passed on twice) or four different parent genes, two of one type and two of another. We then find n*[{2, 2) 1~1= h,4{2,2]
Ijj + ~~742, 11Ijl
+ c27dU9 11Ii1
218
EWENS
AND
KIRBY
for certain constants cl and ca independent ofj. Using Lemma 2, we may write this as ~*[(2,2} I j] = &4{2,21
Ijl +
cd33
131+ c~[Pl
Ijl + c5
)
(11)
where ca , c, and c5 are again constants independent of 1.. Proceeding as above we find
Q(n5+
6~s + p=z + 7x1) = UT
+ 0~3 +
~2
+ 7x1)
WI
for certain constants 8, p and 7. Use of Lemma 1 shows that x5 + &a + ~LX~+ mi is linearly independent of rr4 + Srr, + crc2+ ~rri . Equations (10) and (12) then show that the eigenvalue & occurs with multiplicity at least two. The remainder of the proof continues along the above lines. When we have j genes in the daughter generation, we consider the p(j) - p(j - 1) partitions of j into integers > 2. In each case recurrence relations like (I 1) arise. The eigenvalue X, arises with multiplicity p(j) - p( j - 1) since the various eigenvectors corresponding to it, viz. x,(+i)+r
+ a linear combination of xi , =z ,..., xs(+i) ,
x~(~-~)+~+ a linear combination of rrr , ~a ,..., 7~~o-i) , X~(~)+ a linear combination of xi , x2 ,..., zr9(+i) are linearly independent. This completes the proof of the theorem, since by this method all p(2N) ei genvalues are accounted for with hi having multiplicity of exactly p(i) - p(; - 1).
4. REMARKS We conclude by making several observations. Suppose we consider a “finite” process analogous to the above, in which a large but finite number p of alleles are possible at the locus in question. A gene will mutate, with probability u/p, to a specified allelic type other than its own (all p - 1 alternative alleles having equal probability). This process takes place on a space of dimension (p + Z%&-- 1) and admits (p + %& - 1) eigenvalues. These eigenvalues have been found by Karlin and McGregor (1964); they are identical to the above set except that an additional eigenvalue 1 - u arises (with multiplicity p - l), while the multiplicity of X, (i = 2, 3,..., 2N) is (i + p - 2)!/i!( p - 2)! The essential difference between the two processes is that Karlin and McGregor’s process preserves the labeling of the alleles, whereas the configuration process does not. We are therefore able to confirm the conjecture of Ewens and Gillespie (1974) that 1 - u is a “labeling” eigenvalue and to note that the larger nonunit eigenvalue of the
NEUTRAL
ALLELES
219
configuration process (which describes in a sense the rate of approach to equilibrium) is rather smaller than that for the “labeled” process. This result has practical relevance when the question of stationarity for real-world data arises. A second remark also relates to the rate of approach to stationarity, and has has been made to us by Gillespie (personal communication). Perusal of Table I indicates that the multiplicities of the smaller eigenvalues are much larger than those of the larger eigenvalues. It is thus possible that in the spectral expansion of the configurations transition matrix, most weight is placed on the smaller eigenvalues (since these will have the great majority of spectral matrices in the expansion), again leading towards a more rapid convergence to stationarity. However, a full discussion of this possibility must also take into account the as yet unknown left eigenvectors. A third remark concerning rate of approach to stationarity is that our results are relevant to the problem of what test statistic should be used to test whether a sample of genes from a population yields evidence that the population behavior follows a selectively neutral model such as (1). The main problem with testing this hypothesis is the the system, although (say) currently governed by (l), is not at its stationary state due to disturbances arisen in the recent past. The equations we have derived show that certain functions of the allele frequencies attain their stationary values more quickly than do other functions. It would be logical to use as test statistics those functions which tend to return to stationarity at greatest speed. This matter is complex and will be taken up elsewhere. Fourthly, the above mode of argument will generalize readily to cover population models other than (1). Thus, Lemmas 1 and 2 hold true independent of the population model used, as will be the number p(2N) of configurations possible. There will always be 2N distinct eigenvalues, the ith largest appearing with multiplicity p(r) - p(i - 1); all that remains is to calculate the actual values by using methods similar to the above. Note that the evaluation of the eigenvectors will be relevant to the third remark made above, concerning choice of test statistics. This remark will also be taken up elsewhere. Fifthly, we note in passing that our calculations enable us to find, as a byproduct, exact moments of the random variable C Xi2, where X1 , X, ,..., are the (random) allele frequencies in any generation. Thus,
$wq =E[CX,(X,-&)I +& = Pr[{2)] + (1/2N), and this latter expression can be calculated exactly, from (5). Similarly, the second moments can be calculated exactly from (10) and (11). We do not pursue this matter further here other than to note that if terms of order N-l are ignored, these moments agree exactly with those already calculated by Stewart (1974).
220
EWEKS
AND KIRBY
Clearly, while the main result of this note appears as a mathematical theorem, it has relevance in a number of diverse applied areas. Perhaps its most important potential use is in the development of a test statistic for nonneutrality which to some extent overcome the problem of nonstationarity, a problem hitherto described as “intrinsically insoluble” (Ewens and Gillespie, 1974).
REFERENCES M. AND STEGUN, I. 1965. “Handbook of Mathematical Functions,” Dover, New York. EWENS, W. J. AND GILLESPIE, J. H. 1974. Some simulation results for the neutral allele model, with interpretations, Theor. Pop. Biol. 6, 35-57. KARLIN, S. AND AVNI, H. 1975. Derivation of the eigenvalues of the configuration process induced by a labeled direct product branching process, Theoy. Pop. Biol. 7, 221-228. KARLIN, S. AND MCGREGOR, J. L. 1964. Direct product branching process and related Markov chains, PYOC.Nut. Acad. Sci. 51, 598-602. KIMURA, M. AND CROW, J. F. 1964. The number of alleles that can be maintained in a finite population, Genetics 49, 725-738. MALBCOT, G. 1948. “Les Mathematiques de I’HerCditC,” Masson, Paris. STEWART, F. M. 1974. Variability in the amount of heterozygosity maintained by neutral mutations, to appear.
ABRAMOWITZ,