THEORETICAL
POPULATION
BIOLOGY
4, 395-417 (1973)
Estimation of the Amount of Inbreeding Due to Subdivision of a Population C. MATESSI AND S. D. JAYAKAR Laboratorio di Genetica Biochimica ed Evoluzionistica, Del C.N.R., c/o Istituto di Genetica Dell’ Univevsita’ di Pavia, Pavia, Italy Received
June 16, 1970
In a large population which is subdivided into isolated or partially isolated subpopulations polymorphic for a gene locus, there is an excess of homozygotes due to the subdivision. This excess increases with the variance of the gene frequency. The excess can be measured by the “coefficient of inbreeding.” The aim of this paper is to estimate this coefficient, which is a function of various population parameters. We suggest several different estimates, which are the same functional form of unbiased estimates of the population parameters. These estimates are shown to be consistent. They have been compared by numerical methods among themselves and with two other estimates suggested previously.
1. INTRODUCTION
In a random mating population of sexual diploid organisms, the proportion of individuals heterozygous at a gene locus is characteristic and depends only on the gene frequencies of the alleles at that locus. In a number of theoretical or real situations however there is a deviation from panmictic proportions, i.e., an excess or a deficiency of heterozygotes. A very simple situation which leads to a deficiency of heterozygotes and which is very likely to occur in nature is that where a large population is subdivided into smaller almost isolated random mating units or subpopulations. If we ignore this subdivision and look at the proportions of heterozygotes in the whole population, we will find an excess of homozygotes. This result has been proved by Wahlund (1928) and the proof runs as follows. Suppose that a population is subdivided into K subpopulations completely isolated from one another. Within each isolate, matings occur at random. The sizes of these isolates are equal. Consider now a locus with two alleles, A and a, with frequencies pi and qi (= 1 - p,), respectively, in the i-th isolate. The frequencies of AA and Aa individuals, respectively, in the i-th isolate
395 Copyright All rights
0 1973 by Academic Press, Inc. of reproduction in any form reserved.
396
MATESSI AND JAYAKAR
will be xi = pi2 and yi = 2p,q, . The frequencies of AA and Aa individuals in the whole population will then be
x~~++nPi-P)z y _ Ai
n
= ~7 + var(p),
= 2~~ _ 2 X7;Pi - PI” = 2$$ - 2var(p),
II
where p = (Cpi)/n, and 4 = 1 - $. We see also that the excess of homozygotes increases with the variance of the gene frequency. The increase in the proportion of homozygotes with respect to random mating could have also been produced by an amount of inbreeding such as to have an inbreeding coefficient (Wright, 1943)
F = var(p)/p(l - p).
(1.1)
Because of this analogy with inbreeding the quantity defined by (1.1) is more useful than the variance of the gene frequency as a description of the population. It has the further advantage of being in a way standardized with respect to the average gene frequency of the population, and therefore more informative if we wish to compare populations. There are many processes which could cause variation of gene frequencies between isolates within a population, for example random drift. The study of these processes is an appealing one for population geneticists. When an experimental population geneticist wishes to describe such a process, he will probably want to measure the amount of variability that it produces. As an example of such attempts at measurement, see Cavalli-Sforza (1965, 1966). In the present paper we suggest ways in which to estimate the amount of variability (as defined in Eq. (1.1)); i.e., the amount of inbreeding due to the subdivision of a population.
2.
STATEMENT
OF THE PROBLEM
Consider a large population subdivided into many isolates. The gene frequency at a certain locus varies among isolates, and in the whole population, therefore, this will result in an excess of homozygote individuals. We are interested in measuring the amount of inbreeding in the population which corresponds to this excess. Usually such a measure will be subject to sampling errors, in particular because (a) we will be able to sample individuals only from a limited number of the isolates and (b) not all individuals in the isolated will be tested. Further, any real situation is bound to be more complicated than the simple one described in the Introduction. For example, the isolates may vary in size,
INBREEDING
DUE
TO SUBDIVISION
397
and there may not be random mating within the isolates. We can then find reasonable estimates of the inbreeding only if we define certain random variables on which the estimate is likely to depend. In other words we need a precise mathematical model of the system from which observations will be made. The model should also be general enough to be readily applicable in real situations. The model which we will use is as follows. (i) The population consists of an infinite number of subpopulations (isolates) which are completely isolated from one another. The number of diploid individuals in an isolate is given by N, where N is a random variable with some probability distribution. (ii) With respect to a certain locus with two alleles A and a, the three genotypes are distinguishable. Within an isolate the frequencies of the three genotypes are X, Y, and 2, and the frequency of the allele A is P = X + (1/2)Y. X, Y, and 2 are random variables, i.e., they vary among isolates according to some probability distribution. (iii) The mating system is identical in all isolates, but mating is not necessarily at random. In general, therefore, there is a certain inbreeding coefficient f which is the same for all isolates. Since the isolate size N is not infinite, it is not true in general that X = P2 + fP(1 - P), etc. Let rr denote the frequency of A among the parents of a given generation in a particular isolate. r is of course a random variable, but given N and r, we can assume that X, Y, and 2 are multinomial variables in a sample of N individuals with the probabilities &? + frr( 1 - n) etc. (iv) The variables N and r can be either correlated or uncorrelated. Such a correlation has been found, for example, by Thoday (1963) in a laboratory population of Drosophila melanogaster. (v) In order to estimate the inbreeding due to subdivision we choose K isolates at random. From the i-th isolate we sample ni individuals at random without replacement. The three genotypes are represented in our samples in the proportions xi , yf, and zi , respectively. The sample gene frequency of A is p, = xi + (1/2)y, ; that of Q is qi = 1 -pi. We assume that Ni, the size of the i-th isolate sampled is also known. (vi) Given ri , the frequency of A among the parents of a given generation in the i-th isolate sampled, X~, yi , and zi are multinomial random variables in a sample of ni with probabilities vi2 + fni(l - xJ, etc. The assertion made above can be proved as follows. Let A, B, and C be nonnegative integer random variables such that A + B + C = N. Let their joint probability distribution be multinomial with probabilities p, q, and r, respectively, (p + q + Y = l), i.e., Pr(A, B, C) = (N!/A! B! C!)paqW.
398
MATESSI
AND
JAYAKAR
Let a, b, and c be nonnegative integer random variables such that a f b + c = n, and such that their joint probability distribution conditional on (A, B, C) is hypergeometric, i.e., Pr(a, 6, c ( A, B, C) = The probability
of (a, b, c) unconditional
on (A, B, C) is therefore
where the sum extends over all (A, B, C) such that A >, a, B > b, and C >, c, and A +- B + C = N. Then, (N - n)!
Pr(a, b, c) =
(A -
a)! (B -
. pL4-a)p(E-b)r(C-c)
b)! (C - c)!
since the sum is extended over all values of A - a, B - b, and C - c, such that A-a30,
B--20,
c-c>o,
and (A - a) + (B - b) + (C - c) = N - n.
(vii) Since the population consists of an infinite number of isolates, the definition of quantities such as the overall frequency of the genotypes, for example, may not be immediately obvious. If we sample k isolates we would estimate this frequency, X, say, by ;E,(k) = (NJ,
+ ’ .. + NT&J/N,
+ ... + Nk .
On the other hand we know that x’,(k) improves as an estimate of X, as k becomes larger. More precisely, we can define X, as
This limit is of course finite since 0 < &(k) < 1 for all k. NOW k,(k) be expressed in terms of sample moments of Xi and Ni since
can
INBREEDING DUE TO SUBDIVISION
399
There is a theorem of mathematical statistics which states that any statistic which is a rational function of sample moments of random variables tends in probability to the same function of the population moments of the random variables (Cramer, 1946). We can therefore define X, as E(NX)/E(N). The same argument applies of course to any other frequency we wish to use. For example the frequency of A is defined among mating individuals in the total We can now define Wahlund’s principle population as rrt = E(Nn)/E(N). (see Introduction) in more general terms. In fact it follows from assumptions (iii) and (iv) above that
which reduces after some manipulation x,
= rt2 + (1 - f)(E{N(~
to - d2W(N))
+ f%(l
- 9).
(2-l)
The term E{N(m - rrJZ)/E(N) is a weighted variance of 7~; in particular if N is uncorrelated with 7~, it reduces to var(rr) since rrt = E(n). It is necessarily positive and arises from the subdivision of the population, i.e., this subdivision contributes to an excess of homozygotes with respect to a random mating population. If we denote by G the total inbreeding in the whole population, i.e., if we put X, = vt2 + Grrt(l - nt), and denote by F the inbreeding
due solely to subdivision,
i.e.,
F = E{N(r - mt)2) Tt(1 - rt), I E(N) with rt = E(Nn)/E(N);
we have from (2.1) that
G = (1 -f)F+f.
(2.3)
Equation (2.3) demonstrates that F measures only that portion of the total inbreeding which is due to the subdivision of the population. In order to measure the total inbreeding, we must know the value off.
3. ESTIMATION PROCEDURE The inbreeding coefficient F, defined by (2.2) can be expressed as a rational function of the population moments E(N), E(N?r), and E(N?r2). It involves terms like {E(Nr)j2 which can be estimated unbiasedly. The procedure we adopt is to find unbiased estimates for as many as possible of these quantities and to estimate F by the same function of the estimates of these quantities.
400
MATESSI
AND
JAYAKAR
The model set out in Section 2 involves two levels of sampling, i.e., (i) a random choice of K pairs of random variables (Ni , rri) from a joint bivariate distribution which is generally unknown, even in its form, and (ii) for each pair (Ni , nJ a drawing of a trinomial sample of ni individuals which are of the three genotypes with proportions xi, yi , and xi. The probabilities of the three genotypes are, respectively: nj2 +f&(l
- 5-J;
2(1 -f)nJ
- 7q);
and
(1 - nJ2 +fxi(l
- TJ;
where f is an unknown constant. Correspondingly, the estimation also involves two levels. If 4 is a function of E(N), etc., we estimate it by a function f’ of the k pairs (NC , ni). Since ni is in fact unknown (though by our assumption Ni is known), we then estimate f’ by a function f’ of the variables xi , yi , zi , and n, , with i = I,..., K. If f’ is an unbiased estimate of 4 and f” is unbiased for f ‘, given Ni and rri , f’ is necessarily an unbiased estimate of 4, since
E(f”) = E(E(f II 1N, Tr)}= E(f’) = 4. Equation (2.2) can be written as (3.1) Unbiased estimates can easily be found for the quantities:
The first three of these quantities are population moments. Their first stage estimates are the corresponding sample moments. The second stage estimate for E(N) is of course the same as the first stage estimate, since the Ni are known. The first stage estimate of E(Nz-) is C NgrJk. Since in the second stage ri is estimated by pi = xi + ( 1/2)ya , the second stage estimate of E(Nn) is C N&k. E(NG) can be similarly estimated. In order to estimate ria, it has to be remembered that pi2 is not unbiased for 7ri2. In fact E(p,2 I77.j) = {4Tli(?Zj- 1) Vi2 - ?ZdE(y$/ 7TJ + 4?ZJ(Pi 173)}/4?Zi2* Hence 7rt is unbiasedly estimated by
The last quantity, (E(Nrr)}2 is estimated in the first stage by1 2 C NiNjriri/k(k 1 Here and elsewhere in the text, single sum extends over i = I,..., k.
CE: extends
- 1). over i, j = l,..., k with
i # j. The
INBREEDING
401
DUE TO SUBDIVISION
The second stage samples are drawn independently in the different isolates, and hence NiNjri7ri (for i # j) can be estimated by NiNjpipi . The estimate of (E(N)}2 is obtained in exactly the same way, and is therefore C C NiNj/k(K - 1). The first and second stage estimates for each of the five quantities are summarized in Table I. TABLE
I
First and Second Stage Estimates for the Various Population
Moments
Estimand
1st stage estimate
2nd stage estimate
E(N)
1 Nib
c Nilk
-WW
1 NeTilk
c Nips/k
E(Nn2)
c Ni~i’lk
IEWX2 WWN2
11 cc
c Nf /*Pi2
NiN,Ik(k - 1) NiN,~inj/K(K
- 1)
- *Pi cc cc
+ &-+
NzN,/k(k - 1) NiNjPd4W
- 1)
An estimate of F can then be obtained by substituting in the expression (3.1), for each of the quantities the corresponding estimates given in the last column of Table I. The estimate thus obtained is C Ni I$+~41 - Pt) - 4(nj: 1> M;l = est(F) = ’ c j&C Nj - CC NiNjpip,/C C NiNj
’
(3’2)
This estimate is quite general. It has been derived without any restriction regarding the correlation between N and rr, and the system of mating within isolates, i.e., the value of f. It can therefore be used regardless of whether N and ff are correlated or not. However, if we have a priori knowledge that N and P are not correlated, we could arrive at a simpler estimate of F. This follows from the fact that the definition of F in this case is simpler. Similarly, we can obtain a simpler estimate of F if we know that mating is at random within isolates, since in this case it is possible to obtain an estimate of vi2 which does not depend on yi but only on pi . These two estimates, conditional on a priori knowledge of the correlation between N and z- and of the value off are derived in the Appendix. We have then 4 different estimates of F, namely (1) the general estimate Mil , (2) an estimate M&which assumesthat N and 7rare uncorrelated, (3) an estimate M&, which assumes that f = 0 in each isolate, and (4) an estimate M& which uses both these assumptions.
402
MATESSI AND JAYAKAR
4. SOME PROPERTIES OF THE ESTIMATES Mi, is of course not unbiased for the inbreeding coefficient F. Under certain conditions, however, it is asymptotically unbiased and consistent. Let us assume first that we sample a constant number of individuals from each isolate, i.e., 72,= n for i = I,..., K. The set (Ni , xi , yi , zi) or equivalently (Ni , pi , yi) for i = I,..., k is then a sample of k independent and identically distributed random vectors. We introduce now a few definitions: ml = C N,/k
~1 = E(N)
ma = C N,p,Ik
p2 = EP-P)
m3 = c Nix/k
~3 =
m4 = 1 Nipi2/k m5 = 1 Ni2/k
c~4 = E(NP’) /Lo = E(N2)
me = 2 Nizpa21k
ps = E(N2p2).
EWY)
The set of mj’s is a set of sample moments. The pj’s are the corresponding population moments. Under suitable assumptions regarding the distribution of the random vector (N, p, y), each mj converges in probability to pLj. The statistic IM;, is a rational function of the sample moments mj ; in fact
n/r;,(m) = 1 -
mz/ml - (k2mz2- km,)/(k2%2 - km,) ’
Therefore Mir(m) converges in probability to A&(k), under the same assumptions as above for the distribution of (N, p, y) as k -+ 00. It is easy to show now that A!;,@) = F, and therefore that the statistic M& is an asymptotically unbiased and consistent estimate of F. Let us first consider the following quantity that appears in the numerator of M;;,(r). *(Pa
-
= W’P)
P4) -
+1
- -&-
,)P3
E(NP’) - 4(n : 1) E(W) + ;;“i-
W’P)
= E(Nr) - E(N+, by taking expectations of p and y conditional on N and V. In the denominator of M;,(m) we have the quantity (k2mz2- km,)/(k2m12- km,);
403
INBREEDING DUE TO SUBDIVISION
clearly as K -+ co this ratio tends to (rn&~,)~ which converges in probability to (P2/P32 = @wMw)~2* Most sampling programs however will not have a constant sample size from isolate to isolate. It is however possible to prove that even when the nts are variable, M;, is an asymptotically unbiased and consistent estimate of F. These properties are of course shared by the estimators MA,, , n/r,, , and M& .
5. A CORRECTION FOR THE BIAS OF THE DERIVED ESTIMATES It is almost impossible to remove completely the bias of the estimators discussed above, which we will refer to as the group M’. We have tried however to remove a large part of it. Let us assume that N and rr are uncorrelated, and that f = 0, and consider the simplest estimate M&, . In the special case when the samples drawn from the different isolates are of the same size, i.e., ni = n for i = I,..., k, 2n Mio = l - 2n
* cpi
CP4 - Pi) - ccpipj/(k - 1) ;
(5.1)
If we put A = A
;
m, = CpJk;
p1 = E(m,) = E(p) = E(T);
mz = Cpi2/k;
PZ = E(m2) = EW);
and if we take lz to be large enough, we can approximate (5.1) by Mi~(m,
, m3
=
1 -
4%
-
m2)l(nzl
-
q2).
Now
If we expand M&(m, , m,) in series about the point (pr , p2), and disregard terms of power higher than 2 and take expectations, we obtain the following approximation. bias(M6,) = E(MA,) -F = (1 /WI where
653/4/4-z
varh)
+ WW22 var(m,) + D12 cov(ml , mz),
404
MATESSI
AND
JAYAKAR
In particular &a = 0, since M&(m, , ma) is linear in m2 . var(m,) = var(p)/A; cov(m, , m,) = cov(P, p2)P = WP3) - E(P) * E(P2WIf we assume that the distribution
of p is approximately symmetrical, we have
WP3) = 3E(P2) * E(P) - 2W(P)13. Hence, cov(ml , m2) = 2E(p) * var(p)/k. It is then easy to show that
bias
= A (1 - ‘~)2{var(P)}2 _ (1 _ F) v”(P) %l - P12j3 41 -
PI21
(5.2) l
This bias can be estimated by the same function of the sample moments. In more general cases, when for example the q’s are variable and/or when N and w are correlated it is difficult to derive an expression similar to (5.2) for any of the M’s since they cannot be expressed exactly in terms of sample moments and in any case moments of higher order are necessary. The complexity of the algebra involved makes the calculation and estimation of an approximate bias prohibitive. Nevertheless we presume that (5.2) will still be a reasonably good approximation for the bias of M’ even in these cases. Hence associated with each estimate of the group M’ we propose an estimate M” which is approximately corrected for bias as follows. M” = M’ - corr(M’), where corr(M’) = A t(FT’g
- (1 - M’) k(B s ~“)
and P = c N,P,IC Ni ; S = C Nipi2/C Ni - $2; A = 2ii/(2% - 1); fi = C q/k.
6. ANOTHER GROUPOF ESTIMATJS From (3.1) we see that F is of the form F = 1 - (a - b)/(u(l - a)}.
INBREEDING
DUE
TO SUBDIVISION
405
It can therefore also be written as F = 1 + (b/a) - (1 - b)/(l - u) by reducing the second part of (3.1) into two partial fractions. We get then
which is a function of the population moments E(N), E(N?r), E(Aks). As before we can estimate it by the same function of the unbiased estimates of these three moments. The unbiased estimators given in Table I can be substituted and we arrive at the estimate
Pi, = est(F) = 1 If we now compare Pi, with Mi, (given in Eq. (3.2)) we see that they differ only in the estimation of
Whereas in M;, this quantity is estimated unbiasedly, in Pi, it is estimated by the same function of the unbiased estimate of nt, i.e., by +,(I - i3,). The difference between the two estimates of ~~(1 - mt) is of the order (l/k), and M;, and Pi, tend to equality as k becomes large. Besides Pi,, there are three other estimates PA,, Pi, , and PA,, which can be derived under the three hypotheses (i) that N and r are uncorrelated, (ii) that f = 0, and (iii) that both (i) and (ii) are true, respectively. These Iatter derivations are the same as for MA, , M&, , and M& . M, and P& tend to equality as k increases indefinitely. The P’s are therefore asymptotically unbiased and consistent. The four estimates of the group P’ can also be corrected approximately for bias as were the estimates of the group M’. Since the corrections were derived under the hypothesis that k is large, the corrections are identical in the two groups of estimates. The corrected estimates are then given by P” = P’ - corr(P’), where corr(P’) = A(1 - 2j4* S2/(@ - F2)“) - (1 - P’) S/(k(j5 - 8)), A, 3, and S being as defined in Section 5.
406
MATES3 AND JAYAKAR
7. Two MORE ESTIMATES Robertson (1951) considers in general terms an estimation problem very similar to the one considered in this paper. He considers a set of K independent samples from binomial trials, the sample sizes being 2n,, i = I,..., K. The binomial probabilities ni for each sample are random variables from a continuous distribution. The problem is to estimate var(7r). Our problem reduces to this one when the gene frequency is uncorrelated with the isolate size, and when mating within isolates is at random, the major difference being that we want to estimate not simply var(rr) but the ratio var(r)/(E(?r){l - E(T)}). We can therefore modify his solution to suit our purposes. When the ni’s are not constant from isolate to isolate, as is most often the case in our problem, he suggests two estimates. The first estimate is based on the x2 statistic used in testing heterogeneity between the K samples. The second one is an approximate maximum likelihood solution. An exact maximum likelihood solution is not obtainable analytically though it can be numerically approximated to any given degree of precision. The likelihood which is maximized is itself however an approximation to the exact likelihood, assuming small values of the variance and of higher moments. In order to obtain an estimate of F we divide Robertson’s estimates by $(I - fi), which is an estimate (not unbiased) of E(r){1 - E(r)}. The two estimates thus obtained are (1) the x2 estimate: 4(x n&3 - (k - 1)(2x ni + l)(l - $)fi RC = F(l - j?){4(C Q - 42 ni2 - 2(k - 1) C n3 ’
(7.1)
where S and p are as defined in (5.3), and (2) the maximum likelihood estimate:
where pi = 1 - qi ; p = 1 - q. It can easily be checked that RC and RM share with the AZ and P estimates the property of asymptotic consistency.
8. NUMERICAL COMPARISONS OF THE VARIOUS ESTIMATES
An analytical comparison of the various estimators of F is difficult if not impossible since it has not been possible to derive expressions for their bias and variance. In order to assess their relative reliabilities we have resorted to numerical methods. We have simulated on a computer the model described in Section 2. For each of the k isolates we have generated a pair of random
INBREEDING
DUE
407
TO SUBDIVISION
variables, (1) Ni , the isolate size, and (2) ~~ , the gene frequency in the isolate; these two are independently generated. Ni is sampled from a log-normal distribution and ~~ from a beta distribution, The sample size n1 for the i-th isolate is a constant fraction of Ni . The genotypes of these ni individuals are assigned by taking random numbers so that the probabilities of the three genotypes are, respectively, 7ri2 + fni(l - rri); 2ai(l - rrJ(l - f); and (1 - rrJ2 + &( 1 - TJ. Th e numerical values of the estimates to be compared are computed for each run of K isolates. For each set of parameters we have made 100 runs, so that we have 100 numerical values for each estimator. On the basis of these 100 values we have estimated the bias and the variance of each estimator, the real value of F being obtainable from the parameters of the beta distribution. Comparisons between estimators involve these two properties. There are 7 input parameters for the simulation: (a)
the mean and the variance of the log-normal distribution
(b)
the mean and the variance of the beta distribution
(c)
the inbreeding coefficient (f) within the isolates;
of N;
of P;
(d) the proportion of individuals sampled, which is constant from isolate to isolate; (e)
the number of isolates sampled (K).
Table II. shows the values of the parameters used in the simulation. We have used all possible combinations of these 5 parameter groups. We have therefore in all 48 points of the parameter space. The mean of m has been chosen so as to have a distribution of 7~asymmetrical but not excessively so. Many of the approximations made in the derivations of the estimators were based on the assumption that the higher moments of m were negligible. By choosing a slightly asymmetric distribution for 7~,we are in a way testing the robustness of these estimators. The variances of TI are a consequence of choosing TABLE Values of Parameters
II
Used in the Simulation
Parameter
Values used
1.
Mean, variance of no
0.25,0.0019
0.25,0.0089
2. 3.
Mean, variance. of N Inbreeding ( f )
362,21706
382,124422 0.01
4.
Sampling
0 0.1
5.
Number
proportion of isolates (k)
(1The corresponding
10
values of F are 0.0099 and 0.0476.
0.5 30
50
408
MATES1
AND
J$YAKAR
as the parameters of the beta distribution (5, 15) and (25, 75). The parameters of the log-normal distribution are chosen so that the 30 limits of the distribution of N are (100, 1000) in the first case and (25, 3000) in the second. TABLE Estimators
III
Used in the Absence and Presence of Inbreeding
Estimator Ref. No.
f=O
1
2
M;, M;,
4
5
6
7
M& M;,
M;,
M,”
P;,
3
8
9
10
11
P;,, Pi,, P;,
P;,
12
13
14
P,“, RC RM
In all our simulations N and rr were independently sampled and are therefore uncorrelated. Different sets of estimators have been used depending on the presence or absence of inbreeding within isolates, as is shown in Table III. When f = 0, in addition to the general estimators in groups M and P, we use those which assume absence of inbreeding, whereas when f > 0, again in addition to the general estimators, those which take inbreeding into account have been used. In other words, if we have information regarding inbreeding (or for that matter regarding correlation between N and rr), we can either use this information and therefore apply special estimators, or ignore it and use the general estimators. Strictly, therefore, the estimator reference number in Table III indicates an estimation strategy and not a particular estimator. It will be noted that when f > 0, strategies (1) and (2) give identical estimators; as do (4) and (5), (7) and (8), and (10) and (11). From our simulations we can compare the relative merits of these two strategies. The estimators RC and RM are computed in all situations to see whether they are robust with respect to inbreeding. The two properties used to judge the “goodness” of an estimator are usually its bias and its variance. In order to take into account both these, the mean square error, i.e., M.S.E. = (bias)2 + variance is used. Formally any statistical decision should be based on an a priori cost function. In caseswhere this is not known, it is usual to consider as a reasonable approximation to the “real” cost function, the square error of an estimate. The mean square error of an estimator is then the expected value of the cost of that estimator.
INBREEDING
409
DUE TO SUBDIVISION
9. RESULTS AND CONCLUSIONS None of the 14 estimators is the best, by any of the above criteria, over aI 48 points of the parameter space. Also there is no pair of estimators such that one is consistently better than the other. This is, at least partly, demonstrated by Table IV where for each estimator and by each criterion, we show the number of points in the parameter space where it is the best. It is immediately obvious from this Table that the group M tends to be less biased than the group P. Further, within the group M the estimators corrected for bias (M”) tend to have the smaller bias. As far as the variance is concerned, the group P’ seems to be the best. This is an indication that the correction for the bias, though it decreases the bias, increases the variance. Estimator RM is also fairly efficient. The situation is less clear however when we consider the mean square error. With this criterion all the estimates except no. (3) are the best at least once. Clearly Table IV is not a sufficient basis on which to choose the best estimator. TABLE Number
of
Estimator
IV
Points in the ParameterSpace Where Each Estimator is the Best
no.
Absolute
Bias
Variance
M.S.E.
Group M 1 2 3 4 5 6
l.S@ 3.5 1.5 6.5 1.5 13.5
0 0 0 0 0 0
0.33 0.67 0 0.5 2.83 6
Group P 7 8 9 10 11 12
0 0 0 0 0 0
13 14
4.5 2
Total
48
5 Fractional values are obtained to the degree of precision used.
7.33 9.83 23.83 0 0 0
3 2 10 1.5 4.83 5.33
0 7
6 5
48
48
because in many cases two or more estimates are best
410
MATESSI
AND
JAYAKAR
In order to decide on the relative merits of the estimators one possible solution is to average over the parameter space the various criteria. This procedure seems justified specially since most of the parameters considered are often such that it is not in the hands of the experimenter to control their values. Simple averaging of the criteria over the parameter space is however not advisable for the following reasons. If the real value of F, to be estimated, is large, the bias of the estimator is expected to be correspondingly large, and this is indeed true for our results. One is more interested in the bias as a proportion of the true value of F. For this reason we shall average the quantity percentage bias = 100
x
(bias/F).
As for the variance, we expect it to be larger for certain regions of the parameter space, for example when fewer isolates are sampled. This is also confirmed from our data. In the case of the bias, the true value of F is the obvious standard of reference. For the variance there is no such standard. In its absence we use for each combination of parameters, the minimum variance among the 14 estimators. We therefore average the relative efficiency as follows: Relative efficiency = min(variance)/variance. At each point therefore the most efficient estimator has efficiency 1. For the same reason, we use a transformation of the mean square error which we call relative cost and which we define as follows: relative cost = (Ei - min(E&/min(EJ, where for a given point of the parameter space Ei is the mean square error of the i-th estimator, and min(E,) is the smallest of the Ei’s. At each point, therefore, the estimator with the smallest Ei has zero cost. For each of these criteria, within the two partitions of the parameter space, f = 0 and f = 0.01, we have computed means for each estimator. In addition we have also computed the value than which 25% of the values are worse (that is, a quartile of the distribution of each criterion). These values are shown, along with the means and the quartiles over the whole parameter space in Table V. The reason for calculating the quartiles is as follows. As can be seen from Table IV, some estimators, e.g., (6) and (14), are at many points of the parameter space the best of all estimators. However, these estimators are also often among the worst. In other words, they are not very reliable. The quartile we have computed is a statistic which gives some indication of the irreliability of the estimator. As an example, from Table IV we see that on the basis of the relative cost, estimators (6) and (14) are quite often the best but from Table V(c) we see that they are among the worst with respect to the quartile.
9.30
Total
13.35
13.65
15.25
13.35
13.80
f = .Ol
Total
14.10
9.06
9.06
9.05
2
f=O
Quartile
9.06
9.54
f = .Ol
f=O
Mean
1
14.90
15.80
14.90
9.89
10.53
9.25
3
9.35
8.75
11.10
6.83
6.11
7.55
4
9.30
8.75
10.05
6.71
6.11
7.31
5
Means and Quartiles
Va
11.15
9.00
14.00
7.27
6.03
8.51
6
19.95
17.75
24.40
16.82
16.43
17.20
7
19.85
17.75
23.80
16.57
16.43
16.71
8
21.65
21.00
23.35
17.50
18.25
16.74
9
14.75
13.10
16.65
10.56
10.15
10.97
10
Bias of the Estimators
Estimate
for the Proportional
TABLE
14.30
13.10
16.45
10.32
10.15
10.49
11
15.50
15.90
15.50
10.72
11.44
9.99
14
12.70
11.15
13.35
7.93
7.23
8.63
13
21.20
18.80
23.35
17.23
16.18
18.27
14
$
2 8 2 El
8
6 ?A
i 54
z
0.80
0.80
f = .Ol
Total
0.13
0.68
0.71
f=O
f = .Ol
Total
Quartile
0.81
f=O
Mean
1
0.71
0.68
0.73
0.80
0.80
0.81
2
0.78
0.79
0.76
0.83
0.83
0.82
3
0.66
0.63
0.66
0.75
0.75
0.76
4
0.66
0.63
0.66
0.75
0.75
0.76
5
Means and Quartiles
0.68
0.69
0.66
0.77
0.77
0.77
6
0.94
0.94
0.96
0.94
0.94
0.95
8
0.94
0.94
0.96
0.95
0.94
0.95
Estimate
Efficiency
Vb
7
for the Relative
TABLE
0.90
0.92
0.88
0.91
0.91
0.91
9
0.83
0.83
0.83
0.87
0.87
0.88
10
of the Estimators
0.83
0.83
0.83
0.87
0.87
0.88
11
0.80
0.81
0.79
0.84
0.84
0.84
12
0.62 0.65
0.74
0.69
0.80
0.79
0.82
14
0.74
0.75
0.83
0.83
0.84
13
2
0.19
0.19
f = .Ol
Total
0.17
0.20
0.31
0.25
0.31
0.25
f = .Ol
Total
0.43
0.44
0.41
0.24
0.19
0.22
0.24
0.25
4
0.19
0.20
3
0.25
0.25
0.19
0.19
0.19
2
f=O
Quartile
0.19
f=O
Mean
1
0.43
0.44
0.41
0.24
0.24
0.24
5
Means and Quartile
Vc
0.32
0.28
0.34
0.25
0.24
0.26
6
0.11
0.11
0.11
0.08
0.08
0.08
7
0.15 0.16
0.10
0.17
0.18
0.18
0.17
9
0.11
0.10
0.07
0.08
0.07
8
0.12
0.11
0.12
0.08
0.08
0.08
10
Cost of the Estimators
Estimate
for the Relative
TABLE
0.11
0.11
0.11
0.08
0.08
0.08
11
0.17
0.14
0.20
0.18
0.18
0.18
12
0.18
0.21
0.17
0.11
0.11
0.12
13
0.57
0.57
0.55
0.37
0.38
0.37
14
414
MATESSI
AND
JAYAKAR
This can also be seen if we compare the relative efficiencies of the estimators (8) and (9). From Table IV estimator (9) is the best much more often than is estimator (8); but from Table V(b) we see that the mean and quartile of (9) are lower than that of (8). We will now discuss the estimators with regard to each criterion separately. PercentageBias (a) Estimators (l), (4), (7), and (10) which do not take into account the absence of inbreeding (i.e., estimate in the same way forf = 0 and forf # 0), have on the average a larger bias than, respectively, estimators (2), (5), (B), and (11) which take different forms depending on the value of $ (b) The correction for bias, though approximate, is effective for both the M and the P group of estimators, as can be seen by comparing the pairs (1,4), (2, 5), etc. It is more effective for the P estimators. This was expected since for these estimators less approximations have been made in the estimation of the bias (see Section 5). (c) One unexpected result is that the bias for the estimators (13) and (14) is less when there is inbreeding within isolates than when there is not. This is unexpected because these estimators are based on the assumption of random mating (i.e., binomial sampling of genes). (d) The estimator (5) is the least biased estimator, except for the mean in the casef = 0.01 when estimator (6) has an average percentage bias which is slightly inferior to that of (5). When f = 0, the estimator (5) corresponds to M&, and when f > 0, to n/r1 (see Table III). Relative E@iemy (a) For both group M and group P estimators, there is no difference in variance between the general and the special estimators (with respect to inbreeding), as can be seen by making the same comparisons as for the percentage bias. (b) It is evident that correcting for bias increases the variance of both group M and group P estimators. (c) The most efficient estimators are (7) and (8). However (8) has a smaller percentage bias. When f = 0, estimator (7) corresponds to Pi, , and (8) to Pi, . When f > 0, both correspond to PiI . Relative Cost The best estimator with respect to relative cost is (8). The next best are (7) and (11). Estimator (11) is less efficient than (7) but being less biased, since it is corrected for bias, in mean square error it is equivalent to (7). Although
INBREEDING
DUE
415
TO SUBDIVISION
estimator (5) has the least bias, due to its large variance, mean square error is large. We notice from the above analysis that the mean square error is affected to a large extent by the variance and to a smaller extent by the bias. For each estimator we have computed at each point of the parameter space the ratio (var/MSE), which we will refer to as R. Table VI shows the average value of R over the parameter space and its minimum value for each estimator. We see that the contribution of the variance to the mean square error is always more than half and the average contribution is from 85 to 97 %. For all our estimators, therefore, the major source of error is the variance. Estimator (8) is then in our opinion the most advisable one to use, especially if one is more interested in having an efficient estimator than an unbiased one. However, if one is more interested in using an estimator with a smaller bias, for example in circumstances where the variance is expected to be small anyway, we suggest that estimator (5) should be used. Explicit forms for the estimators (5) and (8) are given below. TABLE
VI
Value of (var/M.S.E.) for the Various Estimators Estimator
Average
Minimum
1 2 3 4 5 6 7 8 9 10 11 12 13 14
0.94 0.94 0.93 0.96 0.97 0.95 0.86 0.86 0.85 0.93 0.93 0.91 0.94 0.87
0.61 0.61 0.61 0.66 0.67 0.68 0.53 0.54 0.53 0.59 0.59 0.61 0.59 0.55
Estimator (5) for f = 0. MiO=“;lI-
-. 2% 2ff-l
(’k($-p2)3 - 2F’2s2 + (1 - M;,) k( B c p2) ,
416
MATESSI AND JAYAKAR
fi = c n,/k,
and
F = 1 NiPilC Ni ,
S = 1 N&/c
Ni - 8”.
Estimator (5) for f > 0. s k(F-$2)
- *;l)
’
where
M’ = 1 _ C %hPdl - Pi>h - 1) - M(ni - l)>/C Ni 11 (2 N+Pi/C Ni) - E C NiNjPiPj)l(C C N,N,) ’ and $ ji, and S are as defined above. Estimator (8) for f = 0.
where p is defined as above. Estimator (8) for f > 0.
APPENDIX
The expression for F which we have to estimate is F = 1 _ E(Nn)/E(N) -w4Ifw)
- E(Nn2)/E(N) -
vwwww2
l
The quantities E(N), E(Nr), {E(N)}2 and {E(Nrr))2 are estimated as in Table I of the main paper, irrespective of the value off. E(IW) has as a first stage estimate If f = 0, rrd2is estimated by esth2) = iPg2 - I~/(W>P~~/(%
- l)),
since E(P,~ I 4
= (~6 + 77,2(% - 1))KW.
INBREEDING
417
DUB TO SUBDIVISION
Hence the estimate of F is
If N and rr are known to be uncorrelated, the expression for F is F = 1_
E(T) - E(n2) E(4 - {EW2 ’
The first stage estimates of E(n), E(.rr2), and {E(v))2 are, respectively,
The second stage estimates are as in Table I or as above depending on the value off. Hence Mil = 1 _ C{%Pdl - Pi)/(% - l) - Yd4tni - l)) CPi-lZcCP,P,/(k-
1)
’
and M’
00
= 1 _ c (2nd%(1 - p,)h2% - I)> EPi-ZcCP,P,/(k1) ’ ACKNOWLEDGMENTS
We wish to thank Professor L. L. Cavalli-Sforza for suggesting the problem, and for his useful advice during the preparation of this paper. We are also grateful to Dr. W. Volkers, Department of Human Genetics, University of Leiden, Holland, for bringing to our notice the paper by Robertson (1951).
REVERENCES CAVALLI-SFORZA, L. L. 1965. Population structure and human evolution, Proc. Roy. Sot. London, Ser. A 164, 362-379. CAVALLI-SFORZA, L. L. 1966. Genetic drift in popolazioni umane, Atti Ass. Genet. ItaI. 11, 3-50. CRAMER, H. 1946. “Mathematical Methods of Statistics,” Princeton University Press, Princeton, NJ. ROBERTSON, A. 1951. The analysis of heterogeneity in the binomial distribution, Ann. Eugenics 16, 1-15. THODAY, J. M. 1963. Correlation between gene frequency and population size, Am. Naturalist 97, 409412. WAHLUND, S. 1928. Zusammensetzung von populationen und korrelationserscheinungen von standpunct der vererbungsglehre aus betrachtet, Hereditas 11, 65-106. WRIGHT, S. 1943. Isolation by distance, Genetics 28, 114-138.