THEORETICAL
POPULATION
BIOLOGY
24, 39-58 (1983)
Optimal Sampling Parameter Estimation
for Pedigree Analysis: and Genotypic Uncertainty
E. A. THOMPSON Statistical Laboratory, University of Cambridge, 16 Mill Lane, Cambridge CB.2 ISB, England Received July 15. 1982
Many problems of sampling strategy arise in pedigree analysis. Although specific questions have been previously considered, there is a need for a general study of the form of the likelihood surface for genetic models. Were genotypes observable, the log-likelihood would have a very simple form. The two classes of factors of genealogical structure affecting general questions of inference are thus those (such as inbreeding or assortative mating) which affect the genotype distribution within a pedigree, and those affecting the extent to which these unobservable genotypes can be inferred from phenotypes. The latter aspect is considered in some detail.
1. INTRODUCTION: SAMPLING AND UNCERTAINTY A topic of current discussion in genetic epidemiology is the optimal pedigree structure on which to observe phenotypic data in order to tit a genetic model. Particular questions have been considered by Moll and Sing (1979), Go et al. (1978), and Thompson et al. (1978). Although, clearly, optimal procedures depend on the class of genetic models under consideration, and also perhaps on the unknown parameter values, there is a need for a general study of the potential information in a pedigree structure, with particular reference to the problem of fitting genetic models. First, therefore, the general form of the log-likelihood on the basis of phenotypic data on a pedigree is considered. Where a genealogy is ascertained via individuals of nonrandom phenotype the likelihood requires a correction for the ascertainment process (Morton, 1959). In general, efficient procedures must cover both ascertainment and subsequent choice of pedigree structure. The two aspects can be combined by the use of sequential sampling, where an initial ascertainment allows concentration of observations in informative pedigrees, but choice of individuals for subsequent observation is dependent on data obtained. In this case the ascertainment correction involves only conditioning on the initial event (Cannings and Thompson, 1977) the corrected likelihood being the overall likelihood 39 0040.5809/83
$3.00
CopyrIght ‘F 1983 by Academic Press. Inc All rights of reproductmn in any form reserved.
40
E. A.THOMPSON
divided by this initial probability. Although the initial event may be important in enhancing expected information, its probability is not dependent on the structure of observations made subsequently, and we shall therefore restrict attention to the form of the overall likelihood: Thompson ( 198 1a, b) has considered optimal sampling procedures based on criteria of expected log-likelihood differences between alternative hypotheses of interest. The results obtained were equivalent, in some simple cases, to those using an entropy criterion. Entropy and expected loglikelihood are closely related (Kullback, 1959). In fact, the value of the entropy of the data probability distribution is minus the overall expected loglikelihood. Entropy will enter into considerations of optimal sampling in two ways. First, observations of a priori high uncertainty are intrinsically more informative than those whose values are already almost certain, although in particular cases confirming the certainty may also be important in assessing a particular genetic hypothesis. Secondly, observations on relatives are connected by dependence between their genotypes. Phenotypes observations change the probabilities of underlying genotypes both of individuals previously observed and of those not (as yet) observed. Thereby, observations change the individuals whose phenotypes are likely to provide most information about parameters of a genetic model. Thus we shall discuss the effects of genealogical relationships on the information about parameters, via its effects on the uncertainty about underlying unobservable genotypes.
2. MODEL SPECIFICATION The basic component in the estimation procedure is the likelihood for any model; the probability of data observed for individuals of a specified genealogy under a specified model (Elston and Stewart, 197 1). The general form of the likelihood function is determined by three elements in the specification of a genetic model. First there are the underlying genotypes (g E G) at loci which affect the trait: the relevant alleles carried by an individual. There may be several such loci; they may be linked or unlinked. The “genotype” may be a continuous random variable (a “polygenic component”) rather than a discrete array of possibilities. The genetic model must specify the qualitative form of the genotypes, and their frequencies (Q( g; q); g E G} in the population, in terms of parameters q. Second, the genetic model must specify the transmission of genotypes from parents to offspring:
GENOTYPE
UNCERTAINTY
IN PEDIGREES
41
denotes the probability that the child of parents with genotypes g, and g, has genotype g, and t are known or unknown parameters of the transmission function. The essence of genetic models is that genotypes of children are independent of each other, and of those of their grandparents, conditional on the parental genotypes. Often the class of genetic models to be considered will contain only one fixed T function; for example, Mendelian segregation at a single autosomal locus. In other cases the parameters t may include unknown recombination fractions (for example), but the form of T will normally be strictly constrained. The third aspect of a genetic model to be specified is the genotype-phenotype relation. The function R(d 1g; r) with parameters r will denote the penetrance probability: the probability that an individual has phenotype d E @, given that his genotype is g E G. This genotype-phenotype relationship is the most unconstrained aspect of a genetic model: R may be dependent on age, sex, and other observable demographic variables as well as on genotype. We shall consider only models where functions Q, T, and R are a sufficient specification, and denote the complete set of parameters by 0 = (q, t, r) (Table I). There is one generalisation of this class of models which does not fundamentally affect our discussion. In the above specification phenotypes of relatives are only dependent through correlations between their genotypes: only genotypes are transmitted. In some models of common or transmitted environmental effects, where phenotypes may be directly correlated, a similar general form of the likelihood obtains (Cannings et al., 1980). A simple genetic model for a dichotomous trait, which still provides scope for analysis, and freedom for a wide diversity of phenotypic distributions on pedigrees, is the following. Suppose a single diallelic autosomal locus, one of the alleles, A,, being that “for” the trait and having allele frequency rr. Individuals homozygous for this allele are always affected, those who do not carry it never. Individuals heterozygous for the allele are affected with probability p. To extend the model to problems of transmission, suppose that heterozygotes transmit the A, allele with probability 7 and the alternative with probability (1 - r). Note that this segregation distortion imposes selection; allele frequencies will not remain constant unless r = 4. [Alternatively an apparent segregation distortion may be the result of selective death of certain offspring genotypes.] Within a set of observations on current pedigrees, the evolutionary mechanism maintaining the polymorphism is unimportant; the genotypic frequencies are those relevant to the founders of the pedigree. However, mutation rates and known selective forces may provide information about the possible values of 7~.Further, if selection is strong, it may be necessary to take departures from Hardy-Weinberg into account. Table II gives genotype and phenotype probabilities for parents and offspring. The data on a set of sibships is in all cases a mixture of binomial
42
E. A. THOMPSON TABLE
I
Notation
b G &I= (gj;jE (g,;g,>g,)
h L P .Y
Q R s s T 0 = (4, t, r) X Y Z
Expectation
G}
Set of genotypes The genotypes for the individuals of 9 Genotype triple of individual and his two parents Set of phenotypes Cf. g Expected log-likelihood or entropy of the probability function specified Entropy of the distribution of the variable specified Likelihood Probability Set of individuals in pedigree Population genotype frequencies Penetrance probabilities Log-likelihood Number of sibs whose phenotypes are observed Transmission probabilities Total set of parameters Subset of ./ consisting of the founders The nonfounders; the remainder of .Y The subset of 9 on which phenotypes are observed
Indexed by the true parameter value 0 Elements g
Elements d Indexed by the true parameter value 19
Indexed by subscript j Parameters q; for example, 7t Parameters r; for example, p
Parameters I; for example, r
distributions over parental genotype combinations. Estimates of the frequencies of the components of the mixture will provide estimates of 7~,and those of the binomial probabilities within the mixture estimates of p and r. Returning to the general model, all combinations of parameters q, t and r can in theory be estimated. In practice this is not feasible, and normally one of the three classes of parameters is of principal interest. In some cases the primary objective may be estimation of q and population frequencies of traits. The optimal strategy is to sample unrelated individuals, and the problem is only of interest to pedigree analysis insofar as the estimation of r and t is confounded with that of q. In many practical problems the primary focus is the estimation of r, and a simple transmission model is assumed, such as a single autosomal locus. The value of q will be unknown, but prior
offspring is affected;
p=f
General case r = f (Thompson, 198 1b) p = 0 (recessive)
Probability
g,=A,A, g,=A,A g,=A,A:
Offspring genotype probabilities, determined by T( g, 1g, , g,; 1)
Both affected One affected Neither affected
Parent phenotype probabilities, determined by R(( / g; p)
Frequencies Q(g,; n) Q(g2; n) (Hardy-Weinberg case)
Unordered parent genotypes k,, g,)
1 1 I 1
1 0 0
1 0 0
79
(1
(1 +Pw Jr)/2
s+(l-r)p
(1Ii) 0
0
GPI
II
r2 + 2r(l - r)p (’ +rIP”4 r
2 Zr(l ~ T) (1 -5)2
P2 2P(l -PI (1 -PI’
4777 1 - 7ry
of the Genetic Model
TABLE
4n3( 1 - n)
Components
0
42
0 ‘,
w PI2
(1 I,,
0
(1 PP,
0
4n(l - n)”
P P
0 1 0
0
1
0
27c7I - n)’
0 0
0 0
0 0 1
1
0
0
(1 -n)”
P w
m
$ ii
z El
9 2 9
2
i2
z”
:: :
?2 5
44
E. A. THOMPSON
estimates of the population frequency of the trait are required in order to discriminate between alternative values of r (Stewart, 1980). Related individuals provide information on underlying genotypes, in turn allowing us to estimate penetrances. The third category of problems is the estimation of transmission parameters. Here closely related individuals, preferably parents and offspring, must be observed. Where only phenotypes are available, additional relationships between the parentoffspring pairs provide further information on underlying genotypes and haplotypes, and hence enable more accurate estimates of t to be made: this is the essence of the advantage for linkage analysis of three-generation rather than nuclear families (Edwards, 1971). Accurate estimates of the phenotype-genotype relationship (R) are then a prerequisite; only after fitting the basic model (q and Y) for a trait is it possible to establish linkage with marker loci. The detection of a linkage may then endorse the previous genetic model as in Kravtz et al. (1979). and/or provide further tuning to the estimation of penetrance parameters as in Anderson et al. (1979). In the case of the particular single-locus model above, estimates of p and r are confounded, and small departures from Hardy-Weinberg equilibrium will seldom be detected. We shall therefore focus on the problem of estimation of p and 71when r = 4, and the estimation of r and 71whenp-0 or 1.
3. THE LIKELIHOOD Under the general class of models specified by functions Q, T, and R, the total probability of a complete set of genotypes g = ( g,i :je ,y”) on individuals j in a pedigree ,P without “missing links” (i.e., all interconnecting parents specified) and phenotypes 4 = {#,i;j E Z) on some subset Z of .? is simply the product
L(Q = w, t, r) = qg, 9; 0)
=I[ R(f’jIgj;r).II T(gjIgrn(j),gf(j,;f) *,!..?Q(gi;q),
(1)
where Y is the set of nonfounders in the pedigree, and X the set of founders (XU Y = ,P) and m(j) and f(j) E ,y are the parents of j E .P. The loglikelihood is thus
S(e)= x logRj(r)+ x logT,(t)+ 2: logQ,j(q), jsX jcu jEZ where log Rj(r) will be used as a shorthand for log R($j 1gj; r), etc. when these functions are to be considered as functions of the parameters for given data.
GENOTYPE
UNCERTAINTY
IN PEDIGREES
45
The likelihood for any genetic model 8, on the basis only of phenotypic observations + on a set of individuals is
L*(e)=p(o;e)=~:p(g,~;e)
(3)
B
and the log-likelihood S* = log L * therefore takes no simple form. Therefore consider first the case where both genotypic and phenotypic data are available; in practice the available information about parameters will be less to the extent that genotypes are unobservable and uninferrable from phenotypes. From (2) the expected log-likelihood function on the basis of Q and g is
qs*(e)) = -H,(R; r) - H,(T;t) -H&Q: q), where
-H,(R; r) = c
2 E,(N(Z; g, 4)) log R($ I g; r),
gEG @IS@ -H,(T; t) = >:y> qqY; (g,;g,3g2)))1%vg, Igl)gz;[I9 RIsRZ.RIE~ (4)
and
-ffe(Q; 4) = 2 b(W; gEG
g)) log Q(g; 4).
Here E, denotes expectations taken over the distribution under parameter value 8, and N( W; A) denotes the number of individuals in set W (=X, Y or Z) with the specified attribute combination A. Note that the overall entropy of the joint distribution of g and 4 is H,(R; r) + H,(T; f> + H,(Q, 9).
(5)
Further,
E,(N(Z; (g, 4)) = R($ I g; s> ww;
g))
and
WWY; (g3; g, 5gJ>> = bdN(Y; C&T,? gd)) T(g, I ET,3g2;
4,
(7)
(g,, g2) denoting the genotypes of the parent couple, a couple being counted once for each child E Y. Now H,(R; r) is a function of penetrance and population frequency, H,(T; t) of transmission and population frequencies, and HO(Q; q) of population frequency alone. Although the log-likelihood S partitions into three distinct sets of terms, each involving only one of the
46
E. A. THOMPSON
three classes of parameters, in expectation both the genealogical structure and parameters q pervade all terms of S. If the pedigree has an inbred structure, this will distort the distribution of genotype numbers N(Z;g) within it. If there is assortative mating for the trait of interest the expected number of couples N(Y : (g,, gJ) will not be the product of single genotype expectations. However, if these distorting factors are not present, LAW;
g>) = N(Z) Q(g; 41,
N(Z) denoting the number of individuals
in Z and
(sl,g2))=N(Y)Q(g,;q)Q
b(NY;
(8)
In this case (4) reduces to I g; r> 1% R (4
-fW’it)=W’)~~
lw)(
Q R13R2
x
-\‘T(g,Ig,,g,;t)logT(g,Ig,,g*;t) 1 83
(9) I
-ff,(Q;s>= WV 1 Q(g;4)1%QNote that information regarding transmission t and penetrance r is completely separated; the interconnection arises only because underlying genotypes are not observed and cannot be precisely inferred. The expected log-likelihood per relevant individual (i.e., individuals in Y for t and in Z for r) is minus the conditional entropy of the R and T distributions conditional on, and averaged over, the underlying genotype frequency distribution which depends on q. The expected log-likelihood for q, per founder individual, is again of simple entropy form; the entropy of the Q distribution is H@(Q; q) and depends only on q. Were estimates of parameters to be based on log-likelihood (2) with expectation (4), satisfying also (6) and (7) though not necessarily (8), questions of identifiability and consistency would be easily resolved. Allelic or genotypic frequencies would be estimated from the terms log Qi(q) in (2), parameters of transmission (if any) from the log T,(t) terms, and those of penetrance from the first sum. Consistency of estimates is dependent only on the unlimited increase in the sample sizes, N(X), N(Y), and N(Z), as the pedigree is extended, and the statistical information about parameters is that in
GENOTYPE UNCERTAINTY IN PEDIGREES
41
multinomial observations with likelihood (2). The only effect of genealogical structure on the potential information about genetic parameters would be via the distortion of genotypic frequencies within the genealogy due to inbreeding or assortative mating. These effects of genotypic distortion are considered in another paper. We note here only that the distorted genotype distribution produced by inbreeding in a pedigree may either increase or decrease the precision of parameter estimates. There is no general result; the optimal underlying genotype distribution depends on precisely which penetrance and transmission parameters are to be estimated.
4. GENOTYPIC UNCERTAINTYU IN FAMILY DATA We turn now to the true situation, where genetic models must be fitted on the basic of phenotypic data alone, and the likelihood takes the form (3). The problem has close parallels with coding theory problems considered by G. Ott (1967). There, an underlying set of states follows a Markov transition process, and produces a set of observable outputs, the probability distribution for the output at any time being dependent on the state at that time. Here the states are genotypes, the outputs phenotypes. Our states do not form a simple Markov process, as individuals have two parents and may have many offspring, but this does not affect the basic structure of the output probability distributions (viz. (1) and (3)), nor of the probabilities of underlying states conditional on output. Petrie (1967) considers the identifiability of the Markov transition probabilities (our T) and state-dependent output distributions (our R) for such a process, and Baum and Eagon (1967) show that an iterative procedure of estimating transitions and penetrances (in probability our terminology) according to the current conditional distribution of the sequence of underlying states leads to a (locally) maximum likelihood estimate. This is a special case of the EM algorithm (Dempster et al., 1977), which J. Ott (1979) has used in the estimation of transmission probabilities in pedigree analysis. Thus, given identifiability of parameters and consistency of their estimation, the important aspect of a genealogical structure is the information it provides about unobservable genotypes underlying the observed phenotypes. An observation on any individual provides information about his genotype; where individuals are related it provides also information about the genotypes of individuals already observed. In addition it affects the probability distribution of genotypes of individuals not yet observed; by sequential sampling observations can thus be concentrated on those individuals with high probabilities of phenotypes that will be informative about the particular parameters to be estimated. The necessity of sampling related individuals to discriminate between alternative hypotheses providing 653/24/t-4
48
E. A. THOMPSON
equal trait frequencies in the population, and the improved efficiency of sampling via affected probands when the parameters to be estimated relate to a rare trait, are both simple aspects of this. More generally optimal sequential decision rules can be defined in terms of the distribution of genotypes conditional on phenotypes observed to date (Thompson, 198 1b). A measure of the (lack of) information about underlying genotypes provided by phenotypic observations is the entropy of the genotype distribution conditional on phenotypes, h(g 1I+). We shall consider the effect of alternative structures and sampling rules on this measure. Also of interest is the total entropy, h(Q) of the phenotype distribution of potential observations; those with a priori high uncertainty are more likely to be informative. The sum of these two quantities is the total joint entropy of genotypes and phenotypes, or minus the expected log-likelihood (5) previously considered. Thus this sum is readily computed, while h(4) can be computed using any method for the computation of likelihoods on pedigrees, for example that of Cannings et al. (1980). The conditional entropy is then given by the difference. Consider first a single individual. The uncertainty of his genotype given his phenotype is easily computed as
h(gl~)=h(g)+h(~Ig)-h(~) = H,(Q; 9) + H,(R; r) ffkf
+ (1 -./-I log( 1 -S).
(10)
Minus the last two terms consists of the entropy of the two-point distribution (f, 1 -f},f denoting the population trait frequency. As is usual for entropy values, all logs in numerical computations will be taken to base 2. Now consider a pair of non-inbred relatives in a relationship providing probabilities, ki = P(i common genes at given autosomal locus) i=o,
1,2;
k, + k, + k, = 1.
(11)
Then
and h(Q,, #z / g, , g,) = 2H,(R; r) independently of the geneaological induce correlation in both genotypes and relationship. Relationships phenotypes, and hence reduce both h(g,, gz) and h(#,, #2), relative to an unrelated pair, but the latter to a lesser extent since phenotypes are correlated only via genotypes. Thus h(g,, g, I 4,) 4J decreases with increasing genetic relationship. However, the pattern is not determined by the coefftcient of kinship (k,/4 + k,/2) alone. Sibs (k, = $, k, = 4) provide very
GENOTYPE
UNCERTAINTY
49
IN PEDIGREES
skhtly smaller h(g,, g2), larger WI, #*>, and larger h(g,, g2I#,, 42) than do parents (k, = 0, k, = 1). However, sibs always provide smaller posterior genotype uncertainty than do half-siblings (k, = k, = f), who provide values very similar to those for an unrelated pair: only close relationships have any substantial effect. Monozygotic twins have precisely half the genotypic entropy of an unrelated pair, but for intermediate values of penetrance parameters (p, in the special example model) a high proportion of the phenotypic entropy. Thus the conditional genotypic entropy for twins is less than half that for an unrelated pair. The precise contours of constant conditional genotypic entropy in the space k, + k, + k, = 1 are dependent on the genetic parameters. Some typical values are given in Table III, for the example model of Section 2. From the case for a single pair of relatives we proceed to sibship data. Here the quantity of interest is h(g,, g, / 4, ,..., 4,) the entropy of the unordered parental genotypes when phenotypes of s offspring have been observed: once the parental genotypes are correctly inferred we have independent binomial observations for the estimation of the probability that each offspring is affected (Table II). Families with certain parental genotypes give TABLE Information
III
in a Pair of Relatives’
n=O.l,p=0.4 Unrelated Half-sib Sib Parent Twin
1.516 1.478 1.366 1.365 0.758
0.818 0.812 0.796 0.796 0.742
1.047 1.015 0.920 0.918 0.365
Unrelated Half-sib Sib Parent Twin
2.247 2.204 2.061 2.050 1.124
0.963 0.958 0.94 1 0.946 0.869
1.746 1.708 1.582 1.566 0.717
n=0.3,p=0.8 Unrelated Half-sib Sib Parent Twin
2.685 2.639 2.483 2.459 1.343
1.968 1.949 1.881 1.889 1.551
1.323 1.297 1.209 k. 176 0.398
?7= 0.2, p = 0.2
’ See text for details.
50
E. A. THOMPSON
no information about segregation and penetrance parameters, but once it is established that we have such a family no further members of it will be sampled. Thus h(g,, g, 14, ,..., 4,) is still the relevant quantity. Table IV shows that substantial decreases in uncertainty are provided by the first three or four offspring. By then certain parental combinations are, if untrue, ruled out with high probability, but even at eight children substantial uncertainly remains. Sibship data is substantially less efficient than a binomial sample of the same size, and small sibships are, per individual, more useful than large ones. Thompson (1981b) has given values of the Fisher information in a sibship for distinguishing values of 7c and p giving equal population frequency f for the trait: this measure provides similar results. For the analysis of segregation distortion, given p = 0 (or p = 1 with allele frequency (1 - r~)), Table V presents results at r = 0.5, since it is hypotheses in this neighbourhood that will be of interest. Results are qualitatively similar over 0.25 < r < 0.75. For larger trait frequenciesfoffspring continue to provide information on parental genotype combination right up to s = 8, but again substantial uncertainly about parental genotype remains. Sibship data alone is not ideal, but if sibships are used, then large ones are the most useful. For small rr values this type of data contributes little information about parental genotypes. In this case a very large number of sibships will be required to test r = f, but small sibships will be almost as useful, on average, as large ones.
TABLE Conditional
27 01
Parental Genotypic Entropy under a Model of Partial Heterozygote When Phenotypes of s Offspring Are Observed
P
0.2 0.4 0.6 0.8
0.3
0.5
IV
s=o
1.205 1.205 1.205 1.205 2.110 2.110 2.110
Penetrance.
s=2
S=4
s=6
s=8
1.041 0.932 0.818 0.702
0.910 0.730 0.571 0.437
0.799 0.578 0.414 0.299
0.704 0.463 0.314 0.227
1.587 1.492 1.372 1.233
1.421 1.296 1.163
1.286 1.140
1.028
0.894
0.495 0.625 0.678 0.664
1.664 1.710
1.469 1.506
1.323 1.348
0.684 0.770
0.2 0.4 0.6 0.8
2.110
1.801 1.749 1.666 1.556
0.2" 0.4*
2.315 2.375
1.938 1.978
1.011
4$,,
I &?I.gJ 0.182 0.262 0.307 0.327
a Entropy of child phenotype gch, conditional on parental genotypes (g, , g2). * Values for K = 0.5 and p = 0.6, 0.8 given by symmetry.
GENOTYPE
UNCERTAINTY
51
IN PEDIGREES
TABLE V Conditional Parental Genotypic Entropy under a Model of Segregation Distortion, When Phenotypes of s Offspring Are Observed. r = 0.5, p = 0. 0.2
0.3
0.4
0.5
0.030
0.109
0.219
0.341
0.453
1.205
1.027 1.006
1.761 1.528 1.390 1.303 1.246
2.110 1.735 1.515 1.376 1.286
2.309 1.810 1.522 1.342 1.226
2.375 1.773 1.433 1.227 1.097
0.318
0.493
0.555
0.533
0.453
1.205 0.579 0.314 0.203 0.155
1.761 1.023 0.669 0.496 0.409
2.110 1.375 0.995 0.789 0.674
2.309 1.627 1.255 1.039 0.909
2.375 1.773 1.433 1.227 1.097
n=O.l
Recessivecase M, I g, (g?) s=o 2 4 6 8
I.115 1.061
Dominant case 4b,, I g, . g2) s=o 2 4 6 8
A slightly closed approximation to the optimal binomial sampling is achieved when parents’ parental phenotypes are also observed. The genotypic uncertainty of the unordered parent pair after observation of their phenotypes alone is that provided by two unordered unrelated individuals (S = 0, Table VI), which is rather less than for an ordered pair (Table III). A comparison with Table IV shows that three children alone leave less uncertainty about parental genotypes than observation of the parent pair, but that observation of two children is less informative than that of the two adults. When children are observed in addition to parents, parental genotypic entropy is further reduced, but such nuclear family data is not markedly more efficient than sibship data. Observation of parents and s children (Table VI) leaves very similar parental genotypic uncertainty to the obser vation of (s + 2) children (Table IV), although the former is always the smaller. Where this is the only consideration observations should be made on nuclear families, but where the potential information about parameters is greater from the offsping members, sibship data will be preferred. For problems of segregation distortion it is best to observe more offspring, since, given parental genotypes, only offspring contribute to the estimation of r. For penetrance estimation it may also be better to observe the more environmentally and demographically homogeneous sibship. Where estimates of population frequency are required, nuclear family data may be preferred, but only if the greater efficiency of this estimation outweights the disadvantage of the smaller number of observations within each sibship.
52
E. A. THOMPSON TABLE
Conditional
VI
Parental Genotypic Entropy under a Model of Partial Heterozygote When Phenotypes of Parents and s Offspring Are Observed”
Penetrance.
s=o
s=2
S=4
S=6
s-8
0.2 0.4 0.6 0.8
1.004 0.871 0.700 0.475
0.871 0.681 0.487 0.295
0.761 0.537 0.350 0.20 I
0.667 0.428 0.264 0.152
0.586 0.345 0.209 0.124
0.3
0.2 0.4 0.6 0.8
1.614 1.576 1.437 I.181
1.400 1.343 1.183 0.926
1.235 I.161 0.998 0.761
1.102 1.015 0.863 0.65 1
0.9YO 0.896 0.763 0.573
0.5
o.2b 0.46
1.604 1.735
1.354 1.499
1.174 I.316
1.038 1.172
0.9Y3 1.055
n
P
0.1
’ Compare Table IV. * Values for K = 0.5 and
p = 0.6, 0.8 given by symmetry.
5. CONNECTED
FAMILIES
AND PEDIGREES
From families we proceed to a similar discussion for extended pedigrees, again restricting attention in this paper to the non-inbred case, where the pedigree is simply a chain of connected nuclear families (Fig. 1). Observations on successive families are independent conditional on the genotype of the intervening “pivot” individuals (B, , C, , etc.). This is the useful property employed in methods for the computation of the likelihood on such family
A
1 family family
C ‘, not yet
E observed
ascertained 1_1
family
6
family
D
FIG. 1. Pedigree constructed as a chain of nuclear families: the case s = 3 is drawn. Each set of observations on a family is either of two parents and two offspring (family C), or of the final offspring, considered as a parent, and his three offspring (family B). In the former case the previous “pivot” is that offspring member who is the parent of a previously observed family (C,) and the pivot for future observations the as yet unobserved offspring D,. In the other case, respectively, B, and C, (or D, and E,).
GENOTYPE
UNCERTAINTY
53
IN PEDIGREES
pedigrees.The possible genotypes of peripheral individuals (e.g., E,) who are potential pivots is the discriminating factor in comparing alternative sequential sampling options. If expected pivotal genotypic entropy is small genotypes can be accurately predicted, and the samples pedigree extended only in informative cases, If little information about pivot genotypes is provided by phenotypic observations to date, sequential sampling will be ineffectual. Further, on a sampled pedigree, these genotype uncertainties determine the extent to which the Log-likelihood approximates the sum of contributions from the separate families. Thus we compute the entropy of the pivot genotypes, given phenotypic data. It is found that dependenceas measured by total conditional phenotopic entropy of successive families extends little beyond immediately adjacent relatives. Nonetheless the structure of relationship within families decreases genotypic uncertainty, and the relationship between families decreases it further. Table VII shows the pivot’s conditional genotypic entropy for the case s = 2; for larger s “remote” effects are (proportionately) smaller. For rare near-recessive traits the genotypic distribution is not much altered by the phenotypic data, but there is a continuing effect over several sibships. Here large-scale sequential decision procedures will be required to make optimal use of the relationship structure. For traits with higher heterozygote TABLE
VII
Conditional Genotypic Entropy of Pivot Individual Given Phenotypic Data” Parent pivot E,
Child pivot D, -______-.
R
P
Individual
0.1 0.3 0.5
0.2 0.2 0.2
0.758 1.342 1.500
0.688 1.199 1.288
0.684 1.195 1.280
0.676 1.190 1.278
0.658 1.142 I.221
0.648 1.134 I.215
0.63Y 1.128 1.212
0.1 0.3 0.5
0.4 0.4 0.4
0.758 1.342 1.500
0.64 1 1.180 1.315
0.628 1.172 1.307
0.618 I.170 1.307
0.60 I 1.122 1.25 I
0.59 I I.114 I.244
0.577 I.1 I3 1.244
0.1 0.3 0.5
0.6 0.6 0.6
0.758 1.342 1.500
0.580 1.138 1.315
0.557 I.124 1.307
0.547 1.123 1.307
0.535 1.077 1.251
0.529 1.071 1.244
0.516 1.070 1.244
0. I 0.3 0.5
0.8 0.8 0.8
0.758 1.342 1.500
0.499 1.067 1.288
0.474 1.050 1.280
0.466 I.047 1.278
0.46 I 1.006 I.221
0.457 1.002 1.215
0.445 O.YY8 I.212
E,lD
E,lD,C
E,lD,C,B
” See Fig. I for labelling of families and individual
D,lC
pivots
D,lC.B
D,lC.B.A
54
E. A. THOMPSON
penetrance, the data on a nuclear family provides very substantial infor mation about the genotype of the final member, but inclusion of more distant families has scarcely any effect. For s > 2 a chain of sibships is constructed by choosing randomly any one of the pivot’s (s - 1) sibs to continue the pedigree sampling (Fig. 1). As s is increased, the proportion of founder genes in the pedigree is decreased, each gene being, on average, observed (s + 1) times. The larger s, the smaller genotypic uncertainty, and the greater the accuracy of estimation of penetrance and transmission parameters. From Table VII it is a reasonable first approximation to consider only the effect of the immediate nuclear family. Consider first the case of an offspring pivot, whose parents and s sibs have been observed. For intermediate values of p, substantial information is provided by each additional observed sib, right up to s = 8. Even then there is much more uncertainty than if parental genotypes could be precisely inferred; this is H,(T, t) which from (9) takes the value 2n(l - 7~) (2 - ~(1 - n)) at r = 0.5 for our simple model (s = co; Table VIII). As a base point for comparisons the no-data entropy (s = 0) is also given: H,(Q; q) = -2(x log x + (1 - n) log( 1 - n) + n( 1 - 7~)). For a near-dominant trait @ = 0.8, 7c< 0.5), less additional information is provided by successive sibs, but pivotal genotypic uncertainty is smaller, than for the near-recessive (p = 0.2, n < 0.5) case. When parents are not observed, there is of course more uncertainty about pivot genotype, but as in Section 4 offspring phenotypes are almost as infor mative as parental observations about parental genotypes, and thence in this case about the genotype of their pivot sib. Again, large numbers of sibs continue to be informative, especially for small 7cand intermediate values of p (Table VIII). For higher trait frequencies, sib data alone are less efficient; parents and four offspring are preferred to eight sibs alone, who give pivot genotypic entropy 80% of that for s = 0 and 120’%) of the asymptotic limit. A spouse and offspring are more informative observations, particularly for near-dominant traits @ = 0.8, rt < 0.5). Only in the case of a rare nearrecessive (p < 0.2, z ,< 0.1) are sibs and parents preferred. Large sibships are again informative, and at s = 8 pivotal genotypic entropy is now on average only 60% of its initial (s = 0) value. However, the asymptotic value 477(1-7c){7r2110g(l +P)-@/(I
+P))logPl
+ (1 - 7r)’ [lo@ -P) - (( 1 - P)l(2 -P)) log(l - P) I I is now small. Thus even beyond s = 8 there is substantial further information to be gained.
0.2 0.4
1.500 1.500
1.343 1.343 1.343
0.4 0.6 0.8
0.3
0.5
0.758 0.758 0.758 I .343
0.4 0.6 0.8 0.2
s=o
0.758
P
0.2
i[
0.1
1.288 1.315
1.180 1.138 1.066
0.641 0.580 0.499 1.199
2 0.688
1.160 1.197
1.072 1.006 0.904
0.556 0.465 0.362 1.109
4 0.637
Spouse and s offspring observed (e.g., C,, E,)
1.010 1.050
0.932 0.846 0.736
0.442 0.343 0.252 0.995
8 0.559
Effect of Increasing
VIII
1.257 1.285
1.158 1.117 1.045
0.636 0.576 0.498 1.177
1.190 1.221
1.090 1.043 0.973
0.570 0.501 0.433 1.114
2 0.637
1.126 1.153
1.038 0.988 0.926
0.520 0.45 1 0.399 1.069
4 0.600
1.087 1.067
0.965 0.92 1 0.877
0.45 1 0.399 0.372 1.009
8 0.543
Entropy
Parents and s sibs observed (e.g., E,, D,)
Size on Pivotal Genotypic
s=o 0.682
Family
TABLE
1.330 1.340
1.201 1.173 1.139
0.656 0.573 0.532 1.219
s=2 0.695
1.249 1.255
1.151 1.117 1.079
0.683 0.525 0.477 1.161
4 0.648
s sibs only
1.159 1.158
1.067 1.014 0.972
0.488 0.430 0.399 1.107
8 0.577
Pi E! s w m
s 2 2 2
2 E
56
E. A.THOMPSON
6. CONCLUSIONS A structure of related individuals is not always advantageous for precision in estimation of genetic parameters. Relationships decrease the number of distinct genes observed in a given number of individuals. If the only aim in (2) is to estimate allele frequencies, N(X) should be maximised by observations on unrelated individuals. More generally if trait frequencies are to be estimated, on the basis of phenotypic data (3), unrelated individuals are again the optimal sample. On the other hand, unrelated individuals are completely uninformative with regard to distinguishing hypotheses providing the same population frequencies for traits, and if related individuals are sampled it will usually be to distinguish genetic models providing the same trait frequencies. Nonetheless we shall still normally need to have N(X) large to provide good joint estimate of genotype frequency and penetrance. The genotypic frequencies within a genealogy, and in particular among founders X, pervade all the components of the expected log-likelihood. Where genotypes are observable, or can be accurately inferred from phenotypes, we have the straightforward case of binomial sampling. By decreasing genotypic uncertainty, pedigree structure contributes to bringing us close to this, but an exact prescription of universally optimal policies is impossible. In Section 4 we considered several simple examples of the effects of relationships, and have seen that close relatives are more useful than distant, provided the total number of distinct genes observed is not thereby limited. An important finding is that (S + 2) sibs provide very nearly as much information about the genotypes of the family as do s sibs with their 2 parents. Where the cost of sampling is in locating families, parents of a sibship are thus valuable observations. But where the cost is per individual, sibs will almost always be preferred, especially where they are individually more informative about the particular parameters of interest. Whereas in problems of penetrance estimation, small sibships and families provide as much information per individual as do large ones, in problems of segregation larger sibships are preferred, especially for more common traits. In Section 5 preliminary consideration is given to the question of extended pedigrees, where now relationships between successive families further reduce genotype uncertainty, providing for more accurate parameter estimation and scope for sequential sampling procedures. Again an important question is the balance between observations on a large number of distinct genes and observation of more replicates of each. the latter being the effect of an interconnected structure. Here the contrast is between penetrance patterns of neardominant and near-recessive types. For a near-recessive pattern, sibs and parents of probands are the valuable observations, and large families are useful. Further, effects of data observations on genotype probabilities of relatives can extend over several families, giving scope for quite complex
GENOTYPE UNCERTAINTYIN
PEDIGREES
57
sampling rules. In contrast, for near-dominant patterns spouse and offspring are the preferred observations and sampling large numbers either of offspring or of sibs may often be wasted effort. Effects of observations on genotype distributions are very localised, suggesting that any sequential sampling rule should be based only on data on immediate relatives. The generally localised effects of observations, even in cases of quite low penetrance. will enable more complex patterns of interconnection to be studied.
REFERENCES ANDERSON. M. W.. BONNE-TAMIR. B., CARMELLI. D., AND THOMPSON. E.A. 1979. Linkage analysis and the inheritance of arches in a Habbanite isolate. Amer. J. Hum. Genrr. 31, 620-629. BAUM, L. E., AND EAGON. J.A. 1967. An inequality with application to statistical estimation for probabilistic functions of Markov chains and to a model for ecology, Bull. Anger. Mah. Sot. 73, 360-363. CANNINGS. C.. AND THOMPSON. E. A. 1977. Ascertainment in the sequential sampling of pedigrees, Chin. Gener. 12, 208-2 12. C+ZNNINGS, C.. THOMPSON, E. A.. AND SKOLNICK, M.H. 1980. Pedigree analysis of complex models, 01 “Current Developments in Anthropological Genetics” (J. Mielke and M. Crawfurd, Eds.), Plenum, New York. DEMPSTER. A. P.. LAIRD, N. M, AND RUBIN. D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Slatisr. Sot. B 39. I-22. EDWARDS, J. H. 1971. The analysis of X-linkage, Ann. Hum. Gem. (London) 34, 229-250. ELSTON, R. C.. AND STEWART. J. 1971. A general model for the genetic analysis of pedigree data, Hum. Hered. 21. 523-542. Go. R. C. P., ELSTON. R. C.. AND KAPLAN, E. B. 1978. Etliciency and robustness of pedigree segregation analysis. Amer. J. Hum. Genel. 30, 62-69. KRAVITZ. K.. SKOLNICK, M. H.. CANNINGS, C.. CARMELLI. D., BATY. B.. AMOS. B., JOHNSON. A., MENDELL, N., EDWARDS, C., AND CARTWRIGHT. G. 1979, Genetic linkage between hereditary haemachromatosis and HLA. Amer. J. Hum. Gener. 31. 601-619. KULLBACK. S. 1959. “Information Theory and Statistics,” Dover, New York. MOLL. P. P.. AND SING, C. F. 1979. Sampling strategies for the analysis of quantitative traits. ir! “Genetic Analysis of Common Diseases” (C. F. Sing and M. H. Skolnick. Eds.). Liss. New York. MORTON. N. E. 1959. Genetic tests under incomplete ascertainement. Amer. J. Hum. Getter. 11. l-i I. OTT, G. 1967. Compact encoding of stationary Markov sources, IEEE Trans. Inform. Theory~
IT-13. 82-86. OTT, J. 1979. Maximum likelihood estimation by counting methods under polygonic and mixed models in human pedigrees. Amer. J. Hum. Genet. 31, 161-175. PETRIE, T. 1967. Classification of equivalent processes which are functions of finite Market chains, IDA-CRD Log. No. 8694. STEWART. J. 1980. Schizophrenia: The systematic construction of genetic models. Amer. J. Hum. Genel. 32. 173-188. THOMPSON, E. A.. KRA~UZ, K., HILL, J.. AND SKOLNICK. M.H. 1978. Linkage and the power of the pedigree. in “Genetic Epidemiology” (N. E. Morton, Ed.), Academic Press, New York.
58
E. A. THOMPSON
E. A. 198 la. Optimal sampling for pedigree analysis: relatives of affected probands, Amer. J. Hum. Genet. 33, 968-977. THOMPSON, E. A. 1981b. Optimal sampling for pedigree analysis; sequential schemes for sibships, Biometrics 37. 3 13-326. THOMPSON,