Disequilibrium Mapping: Composite Likelihood for Pairwise Disequilibrium

Disequilibrium Mapping: Composite Likelihood for Pairwise Disequilibrium

GENOMICS 36, 1–16 (1996) 0419 ARTICLE NO. Disequilibrium Mapping: Composite Likelihood for Pairwise Disequilibrium B. DEVLIN,*,1 NEIL RISCH,† AND ...

304KB Sizes 2 Downloads 122 Views

GENOMICS

36, 1–16 (1996) 0419

ARTICLE NO.

Disequilibrium Mapping: Composite Likelihood for Pairwise Disequilibrium B. DEVLIN,*,1 NEIL RISCH,†

AND

KATHRYN ROEDER*

*Department of Statistics, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213; and †Department of Genetics, Stanford University School of Medicine, Stanford, California 94305 Received January 26, 1996; accepted May 22, 1996

tion. Generations of recombination produce a pattern of disequilibrium whereby, in expectation, disequilibrium should be greatest for markers close to the disease locus and least for distant markers. The observed disequilibrium pattern can be quite different from the expected pattern. This difference can be influenced by many evolutionary forces, including selection. However, in many cases the critical factors are genetic drift, the evolutionary sampling process, and to a lesser extent the statistical process of sampling from the current generation. For STRs, mutations at the marker loci are also important. Consequently, it is critical to take these sources of variation into account (Kaplan et al., 1995). The variability of linkage disequilibrium, due to drift, is clearly a function of population size each generation. Other factors are also important, such as the distance between the marker and disease locus and the frequency of the associated marker allele. Devlin and Risch (1995) noted the importance of the latter for simple disequilibrium mapping (see also Allamand et al., 1995). In this report, we develop a Composite Likelihood (CL) method for localizing disease genes that models the sources of variability discussed above. We elaborate the statistical features of the CL model, including its assumptions, form, how evolutionary variances are taken into account in the model, and so on. The performance of the model is evaluated using data derived from evolutionary simulations. We also illustrate this method with analyses of data from cystic fibrosis (Kerem et al., 1989) and diastrophic dysplasia (Ha¨stbacka et al., 1994). These simulations and analyses of real data show that the CL method can be an effective tool for fine-mapping disease genes, and they also illustrate the potential pitfalls of the method. Finally, we compare the CL method to other methods that use linkage disequilibrium to fine-map disease genes (Ha¨stbacka et al., 1992; Devlin and Risch, 1995; Kaplan et al., 1995; Terwilliger, 1995).

The pattern of linkage disequilibrium between a disease locus and a set of marker loci has been shown to be a useful tool for geneticists searching for disease genes. Several methods have been advanced to utilize the pairwise disequilibrium between the disease locus and each of a set of marker loci. However, none of the methods take into account the information from all pairs simultaneously while also modeling the variability in the disequilibrium values due to the evolutionary dynamics of the population. We propose a Composite Likelihood (CL) model that has these features when the physical distances between the marker loci are known or can be approximated. In this instance, and assuming that there is a single disease mutation, the CL model depends on only three parameters, the recombination fraction between the disease locus and an arbitrary marker locus, u, the age of the mutation, and a variance parameter. When the CL is maximized over a grid of u, it provides a graph that can direct the search for the disease locus. We also show how the CL model can be generalized to account for multiple disease mutations. Evolutionary simulations demonstrate the power of the analyses, as well as their potential weaknesses. Finally, we analyze the data from two mapped diseases, cystic fibrosis and diastrophic dysplasia, finding that the CL method performs well in both cases. q 1996 Academic Press, Inc.

1. INTRODUCTION

Linkage disequilibrium, the association of alleles at different loci on the same chromosome (Crow and Kimura, 1970), has proven to be a useful measure for researchers attempting to refine the location of a disease gene. The logic behind disequilibrium mapping is straightforward. At the instant a disease mutation occurs, the mutant allele and other alleles at flanking markers are in complete disequilibrium (Clegg et al., 1976). When evolutionary forces can be ignored, disequilibrium decays through the action of recombina-

2. THE CL MODEL

Where appropriate, each of the following subsections begins with a general explanation of the methods/re-

1 To whom correspondence should be addressed. Telephone: (412) 268-8973. Fax: (412) 268-7828.

1

AID

Genom 4241

/

6r1b$$$781

07-24-96 18:22:57

0888-7543/96 $18.00 Copyright q 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.

gnma

AP: Genomics

2

DEVLIN, RISCH, AND ROEDER

TABLE 1

2.2. Model Structure

Counts (Left) and Frequencies (Right) of Haplotypes and Alleles in a Sample Marker

Disease allele

Normal allele

A1 A2

n11 n21

n12 n22

n/1

n/2

Disease allele

Normal allele

n1/ n2/

pP 11 pP 21

pP 12 pP 22

pP 1/ pP 2/

n

pP /1

pP /2

1

sults described therein. Then these explanations are elaborated by technical supporting matter. The gist of the methodology should be apparent without thorough consideration of the technical matter. 2.1. Assumptions and Notation To formulate the CL model, we follow Devlin and Risch (1995), Kaplan et al. (1995), and many others by assuming initial complete disequilibrium between the disease allele and a specific marker allele for each marker locus on the chromosome. Complete disequilibrium need not arise by a single disease mutation (Ha¨stbacka et al., 1992), but that will usually be the case. We build a simple model based on the assumption that marker polymorphism precedes disease mutation and that only one disease mutation exists in the population. We then generalize our model so that it will be robust to deviations from these assumptions by introducing a heterogeneity parameter (cf. Lynn et al., 1995; Terwilliger, 1995). Each multilocus haplotype consists of L ordered markers, labeled l Å 1, . . . , L, from centromere to telomere. We assume that the physical distances between markers are known and that the Haldane map function translates these physical distances into recombination fractions. Let ul denote the recombination fraction between the l’th marker and the first marker, and let u denote the recombination fraction between the disease locus and the first marker. We parameterize the recombination fraction between disease locus and the l’th marker using al Å Éu 0 ulÉ. (Note that u Å a1 .) Only the pairwise disease-marker haplotypes are required to compute the composite likelihood, and we assume that these can be determined for each individual in the sample. This information is automatically available for a recessive disease. For a dominant disease, the information can be obtained from multiplex families, extended pedigrees, molecular haplotyping, and so on. We assume that both disease and marker loci are biallelic. Our notation for counts and frequencies (probabilities) of allele and haplotypes for a disease and marker locus is given in Table 1. Naturally the pˆ ’s in Table 1 are only sample estimates of some underlying unknown parameters, which we denote by p’s. Moreover, these parameters vary over generations, G. When necessary, we use pij(G) to denote the ij’th haplotype frequency at generation G.

AID

Genom 4241

/

6r1b$$$782

07-24-96 18:22:57

For most low-dimensional problems, it is straightforward to evaluate a likelihood model to obtain the maximum likelihood estimates for the parameters. It is not straightforward, however, to analyze the full likelihood for the problem of fine-mapping disease genes. In fact, the full likelihood for multilocus haplotypes is extremely complicated to specify because of the effects of evolutionary sampling. A similar complexity arises in the analysis of spatial data. It has become a standard practice in spatial analyses to take a composite likelihood approach (Besag, 1974). The best known application of a composite likelihood is Cox’s proportional hazard model. Other applications abound; see Lindsay (1988) for new developments and a review of the subject. To construct a composite likelihood, one starts with a set of conditional or marginal events for which one can write log likelihoods Ll(u), l Å 1, . . . , L. One then constructs the composite log likelihood CL(u) Å (l Ll(u). For example, suppose we observe two dependent random variables, Y1 , Y2 . The log likelihood could be written as log f(y1 ; u) / log f(y2Éy1 ; u) or log f(y2 ; u) / log f(y1Éy2 ; u). Two natural choices for the composite likelihood are the following: log f(y1 ; u) / log f(y2 ; u) and log f(y1Éy2 ; u) / log f(y2Éy1 ; u). Usually one of the two expressions (marginal or conditional likelihood) is much simpler to express, and consequently the choice of which composite likelihood to use will be obvious in practice. Composite likelihood estimators inherit many, but not all, of the properties of ordinary likelihood estimators. For instance, they are generally consistent (the estimated parameter value converges to the true parameter value as the sample size tends to infinity), but they are often not fully efficient. Two motivations have been given for their frequent use in practice. They provide a substitute method of estimation when the full likelihood is very difficult to specify. And they often represent the portion of the model about which we have the most knowledge. A composite log likelihood differs from an ordinary log likelihood in another important way: the terms in the sum need not be independent. This is one of the features of the methodology that we will exploit. As a consequence, however, the resulting composite likelihood does not lead naturally to a confidence interval for the disease location. For this reason we will focus on determining ‘‘search regions,’’ as discussed in Section 2.6. The entire haplotype information for L marker loci consists of an (L / 1)-way table. However, even when the full haplotype information is available, we have chosen a composite likelihood approach that forms a component log likelihood based on the association between each individual marker and the disease. Assuming that each marker is biallelic, a 2 1 2 table is available for each marker–disease combination. In our anal-

gnma

AP: Genomics

3

DISEQUILIBRIUM MAPPING METHODS

ysis of these data we will condition on the one-way marginal frequencies for the disease and marker loci. For any analysis of disequilibrium to be effective at localizing disease genes, the association between markers and the disease allele must break down over generations as a function of u. In addition, the measure of disequilibrium should be a direct function of u. For large populations, such a measure is the robust attributable risk d (Bengtsson and Thomson, 1981; Devlin and Risch, 1995), which equals



p11p22 0 p12p21 . p/1p22

[1]

Furthermore, for large populations, 0(1/G)log d is approximately equal to u, although this approximation is inaccurate for small populations (Hill, 1974). Consequently, we argue that Y Å 0log dˆ is a good measure of the association between an individual marker and the disease (Appendix I), and we base our composite likelihood on these quantities. In essence, uˆ Å Y/G provides an estimating equation for u. Our composite likelihood is constructed as CL(u) Å (Ll log f(Yl ; u), where f(Yl ; u) is the likelihood of observing Yl Å 0log dˆ l for a given u. Due to evolutionary sampling and population demography, the exact distribution of Yl is unknown. However, the results from our evolutionary simulations (Devlin and Risch, 1995, and this paper) indicate that it is reasonable to assume that it follows a gamma distribution, approximately. The gamma density can be parameterized by a mean and precision parameter (ml , ll): E[Yl] Å ml , var[Yl] Å ml/ll . (Recall that the x2 distribution is a special case of the gamma distribution.) These parameters can be related to the parameters of the evolutionary process when the distances between markers are known. Because ml is the mean of the gamma distribution, it follows that ml is approximately equal to alG. To express ll as a function of the parameters of the evolutionary process, we require some model for the variance of Yl . A direct method of estimating the variance is difficult to formulate as it depends on unknowable quantities, such as the number of disease alleles present in all generations since the advent of the disease. In Section 2.3 we detail a computation that provides an estimate of the relative variances, rl Å var[Yl]/var[Y1], based only on u and the frequency of the marker alleles. Let n Å var[Y1] and assume that the relative variances, r1 , . . . , rL , can be obtained for any given value of u. One can then link the L precision parameters by writing them as a function of n and the relative variances, ll Å ml/(rln). It follows that the composite likelihood can be expressed as a function of three unknown parameters, CL(u, G, n). We call this the simple CL model to differentiate it from the heterogeneity CL, which is introduced in Section 2.5. To compute the CL, haplotype and marginal allele frequencies are estimated directly from the data. We

AID

Genom 4241

/

6r1b$$$782

07-24-96 18:22:57

then maximize CL(u, G, n) over (G, n) for a grid of u values to obtain the profile composite likelihood, CL(u, ˆ u , nˆ u). The peak of the profile is the maximum CL G estimator for u. The shape of the profile likelihood helps to direct the laboratory search effort. Presumably the search would begin in the vicinity of the maximum CL and then proceed outward toward other promising regions. A FORTRAN program that executes the analyses described in this paper is available from the first author.2 Technical matter. A gamma density with parameters (a, l) has the form f(y: a, l) Å

la a01 ly y e y ú 0, G(a )

[2]

with mean m Å a/l and variance a/l2. By assumption, then, the composite likelihood has the form L

CL(u, G, n) Å ∑ l

SD

SD D

m2l ml m2l log 0 log G Vl Vl Vl /

S

m2l ml 0 1 log yl 0 yl , [3] Vl Vl

where ml Å alG, Vl Å rln / sl , and sl is the sampling variance of Yl . ˆ u , nˆ u) for each grid value of u, we use a To obtain (G conjugate gradient method, which is fairly stable in this situation. We use a grid of u values rather than a maximization algorithm because the likelihood surface is multimodal in u and we wish to examine all local maxima. 2.3. Relative Variances We express the variance of Yl as the sum of two components, the variance due to evolutionary sampling (rln) and the variance due to statistical sampling (sl). Recall that n is the evolutionary variance of Y1 and rl is the ratio of the evolutionary variance of Yl to the evolutionary variance of Y1 . We claim that it is difficult, if not impossible, to compute rln directly. We note that in a similar situation Kaplan et al. (1995) use simulations and even then must make numerous assumptions based on little empirical data. Here we employ a series of approximations to obtain an expression for this variance for a given G, where we assume G is not large. By inspection it will be clear that one cannot reasonably expect to compute this variance in practice, as it depends critically on the number of disease alleles in each generation. We believe that these parameters cannot be modeled or estimated well. However, we next show that ratios of variances 2 Please contact Bernie Devlin by e-mail at [email protected], for information on how to obtain the program.

gnma

AP: Genomics

4

DEVLIN, RISCH, AND ROEDER

due to evolutionary sampling are essentially constant throughout the generations, var[Yl(G)] var[Yl(1)] rl(G) Å É , var[Y1(G)] var[Y1(1)]

[4]

for any G of moderate size. This result is crucial because some terms cancel out of rl(1), rendering it computable for any given u. Because we do not observe p(G), but rather an estimate of that quantity, we must also account for sampling variance. Consequently, we obtain an expression that determines the variance of Yl due to evolutionary sampling up to a positive constant n. We find that the variance due to evolutionary sampling increases with the frequency of the associated marker allele and with the recombination fraction between the disease locus and marker locus. The statistical sampling variance decreases as the sample size of both normal and diseased chromosomes increases. In general, the evolutionary variance dominates the statistical variance so that even a huge sample from the population cannot remove the effect of drift on the disequilibrium measure (see also Kaplan et al., 1995). Technical matter. We start by determining the statistical sampling variance of log(1 0 dˆ ), which is a function of Y Å 0log dˆ . This function can be expressed as log(1 0 dO ) Å log

n21 n22 0 log , n/1 n/2

where nij is the number of chromosomes of type ij in the sample. This function of Y has a sampling variance of var[log(1 0 dO )] É

p11 p12 / n/1p21 n/2p22

[5]

(Walter, 1975). This variance has implications for the evolutionary process, when random sampling is assumed, because in this case each generation of evolutionary sampling is essentially equivalent to a statistical sample. With respect to evolutionary sampling, n/1 in the formula above is the number of disease chromosomes in the population. Because it is reasonable to assume that n/1 is considerably smaller than n/2 , the first term dominates the evolutionary sampling variance for each generation. In essence, the population sampling variance of log(1 0 d) after several generations is approximately equal to the variance of log(n21/ n/1) after several generations, and so we ignore the right term in [5] when calculating variances. For simplicity of exposition, let pg denote n21/n/1 at generation g and let ng denote n/1 , the number of disease alleles at generation g. A recursion formula will be useful for computing the variance of log pg , specifically

AID

Genom 4241

/

6r1b$$$783

07-24-96 18:22:57

var[log pg] Å E{var[log pgÉpg01]} / var{E[log pgÉpg01]}. This recursion is derived by conditioning on the proportion of unassociated marker alleles (i.e., ‘‘2’’ alleles) present in the previous generation, pg01 , and by using rules to obtain an unconditional variance from a conditional variance. From this recursion, the approximation var[log pg/1Épg] É (1 0 pg)/(ngpg), and some algebra, we find that for a moderate number of generations G

var[log pG] É ∑ gÅ1

H J

1 p11(g) . ng p21(g)

[6]

Finally, using the d method, we relate var[0log d] to var[log(1 0 d)] at generation G, var[0log d] É [eGu 0 1]2var[log pG] É (Gu)2var[log pG], [7] for small Gu. Notice that [6] depends on both the haplotype frequency and the number of disease haplotypes throughout the generations. Although it is possible to model the haplotype frequencies using the current values of p1/ and p2/ and estimates of u and G, it is difficult to construct reasonable estimates of ng , g Å 1, . . . , G. We conclude that it is not practical to attempt to estimate var[0log d] using [7]. However, it can be shown that the ratio of variances rl , for any G, is essentially equal to the ratio of variances computed for the first generation. Consequently, the sample size, n1 , cancels out of [6], and the relative variances can be computed even though the variances themselves cannot be. To establish [4] we require an expression for the terms in [6]. First, we find an approximation for p11/ p21 for g Å 1. Write D1 and D0 for disequilibrium at g Å 1 and g Å 0 and write p11(1) and p11(0) for p11(g) at g Å 1 and g Å 0. Then D0 Å p11(0) 0 p1/(0)p/1(0), or, assuming marginal frequencies do not change from g Å 0 to g Å 1, D0 Å p11(0) 0 p1/(1)p/1(1). Using the assumption of constant allele frequencies and assuming complete disequilibrium at g Å 0, D0 Å p/1(1) 0 p1/(1)p/1(1) Å p/1(1)p2/(1). Now, in expectation over large populations,

p11(1) Å p1/(1)p/1(1) / D1 Å p1/(1)p/1(1) / (1 0 u)D0 Å p/1(1)[p1/(1) / (1 0 u)p2/(1)]. We assume this expression for p11(1) holds. The conditional probability p11(1)/p/1(1) is, by assumption,

gnma

AP: Genomics

5

DISEQUILIBRIUM MAPPING METHODS

p1/(1) / (1 0 u)p2/(1) É p1/(1) / p2/(1)e0u.

cal sampling can be determined directly from [5] and the d method:

For p21 under the same assumptions,

S

p21(1) Å p/1(1) 1 0

D

s2l Å

p11(1) , p/1(1)

D

p21(1) p11(1) Å 10 É p2/(1) 0 e0up2/(1) p/1(1) p/1(1) Å p2/(1)(1 0 e0u). Consequently,

p11(1) p1/(1) / p2/(1)e0u É . p21(1) p2/(1)(1 0 e0u)

[8]

This simplifies to

S

D

p11(1) p1/(1) u 1 É u e /1 . e 0 1 p2/(1) p21(1)

[9]

Similar results can be obtained for each generation. Substituting into [6] yields G

var[log pG] É ∑ gÅ1

S

D

1 p11(g) gu e / 1 . [10] ng(egu 0 1) p21(g)

Given [10], one can demonstrate that the ratio [4] holds (Appendix II). We compared our estimates of rl to the observed variances obtained from our evolutionary simulations and found that the predicted relative variances were fairly accurate, although variation in allele frequency, due to drift, has a negative impact on the accuracy. To obtain an approximate expression for rl , evaluate [10] for G Å 1, use [7], and approximate eu 0 1 with u. (For G Å 1 the sample size cancels in the ratio.) It follows that the relative variances can be approximated by rl Å

alVl a1V1

, where Vl Å

p1/ al e / 1. p2/

[11]

Notice that the relative variance increases as a function of the distance to the disease allele and as a function of the frequency of the associated marker allele, just as one would expect. In our CL model, the role of the relative variances is to down-weight the less reliable disequilibrium measures. At generation G, the present generation, a sample of disease and normal chromosomes, are drawn for disequilibrium mapping. The variance due to this statisti-

AID

Genom 4241

/

6r1b$$$783

1 0 dO dO

2

D

pP 11 pP 12 / . n/1pP 21 n/2pP 22

[12]

2.4. Weighted Composite Likelihood

so

S

S DS

07-24-96 18:22:57

As we mentioned in Section 2.1, in general, composite likelihoods are not as efficient as ordinary likelihoods because the terms in the likelihood are correlated. The physical process of recombination leads to a correlation between dˆ l and dO *l when l and l* define markers on the same side of the disease. Evolutionary simulations show that these correlations are small when markers are separated by a recombination fraction of 0.003 or more and for a disease mutation that is relatively old (say G Å 100). However, for a dense set of markers, the correlations can be nonnegligible. We experimented with a modified composite likelihood function that down-weighted observations that were highly intercorrelated, but the modification had little impact on the analysis. Technical matter. Lindsay (1988) showed that one way to enhance the efficiency is by using a weighting scheme for the component scores. Let CL Å (l wlCLl define the weighted composite log likelihood. Lindsay showed that equal weights are optimal only if (k corr[Sl , Sk] equals a constant for all l, where Sl denotes the composite score, ÌCLl(u)/Ìu. For this model the correlation among scores is difficult to compute; however, Lindsay’s result suggests that it might be beneficial to down-weight terms in CL that are highly correlated with other terms. This is intuitively reasonable because these disequilibrium values are providing partially redundant information. For weights defined by wl Å {(k corr[dˆ l , dˆ k]}01, the contribution of a marker to the CL is down-weighted when this marker is part of a dense set of markers, whereas markers separated by substantial distances from other markers received almost full weight. However, we found little or no impact on the resulting CL surface when we used this scheme. From this we concluded that the correlation among terms in the CL has little impact. Consequently, for simplicity, we neglect weights in our formulation. 2.5. Heterogeneity Parameter The simple CL model assumes a single disease mutation. This assumption is unwarranted for many diseases, resulting in poor performance of the model. To see the impact of heterogeneity, imagine a set of marker loci and a single disease locus with two mutations: a more common mutation (d1) that is in complete disequilibrium with a particular allele at each marker locus, and a less common mutation (d2), again in complete disequilibrium with specific alleles at the marker loci. Note there is no requirement that the original dis-

gnma

AP: Genomics

6

DEVLIN, RISCH, AND ROEDER

TABLE 2 Haplotype Frequencies When There Is a Set of Uncommon Disease Mutations (DU ) in Addition to a Common Disease Mutation (D) in the Population (a)

(b)

Marker

D

DU

N

A1 A2

p1D p2D

p1DU p2DU

p1N p2N

p/D

p/DU

p/N

D / DU

N

p1/ p2/

p1D* p2D*

p1N p2N

p1/ p2/

1

p/D*

p/N

1

data collected to refine disease gene location, the kinds of disease mutations are unlikely to be separable, and thus Table 2b reflects the usual case. Ideally, however, we would separate the data by disease mutation, thereby eliminating the heterogeneity induced by two or more disease mutations. Disequilibrium based only on the common mutation D is

dÅ10

p/N p2D , p2N p/D

whereas the observable disequilibrium is ease haplotypes associated with d1 and d2 be the same. In fact, they would often differ, and therein lies the problem. Under our scenario, the ideal disequilibrium pattern would be a flat pattern of values (e.g., d Å 1) across markers; however, the actual pattern could be jagged. The observed pattern depends on whether the associated allele at marker l, for disease allele d2 , coincides with that for d1 . When they coincide, dl Å 1. When they are different, the exact value will depend on the frequency of d2 , as well as haplotype frequencies. Nevertheless, in all cases, d õ 1, sometimes substantially so. Assuming a single disease mutation, the mean of Yl can be expressed as a direct function of the recombination fraction between markers and the time since the disease mutation arose, specifically ml Å alG. For the heterogeneity CL model, however, we introduce a parameter a * and a factor that depends on allele frequencies of the normal chromosomes, bl , into the expression for m; thus ml Å bla * / alG, or ml Å al / alG. Terwilliger (1995) and Lynn et al. (1995) proposed the use of a heterogeneity parameter for models of similar form, but their implementation differs from the one we propose. The interpretation of the heterogeneity parameter a * is not simple. Roughly it can be thought of as the average difference, over marker loci, in the conditional probability of carrying the associated allele (‘‘1’’ allele) on chromosomes bearing the common disease mutation D versus that conditional probability for all disease mutations. To obtain the profile likelihood, one maximizes the heterogeneity CL as a function of G, n, a *, for each value of u in the grid. We note that the heterogeneity CL model is not strictly accurate, but it is a simple means by which the assumptions of the simple CL can be relaxed. We employed this model in the analysis of cystic fibrosis and diastrophic dysplasia with good results. Technical matter. Here we provide the mathematical motivation for our model of heterogeneity. For simplicity, we suppress the notation for loci, l, in this development. In Table 2a we define frequencies for haplotypes bearing the common disease mutation D, the remaining disease mutations DV , and the normal alleles N. Similarly, in Table 2b we define the frequencies for haplotypes collapsed over the kinds of disease mutations; in

AID

Genom 4241

/

6r1b$$$784

07-24-96 18:22:57

d* Å 1 0

p/N p2D˙ p2N p/D˙

To correct for the heterogeneity, we seek a factor a* such that d* É a*d and, on a log scale, a Å 0log a*. Using the expressions for d and d* above, we have

a Å log d 0 log d* É

S

D

p/N p1D p1D˙ 0 Å ba *, p2N p/D p/D˙

where a * is the term in parentheses. The approximation for the last equation is excellent when d and d* are close to one. Consider the term a *. Because we have no a priori information on how this quantity varies over loci, it is modeled as a constant. The term b, however, is simply the fraction of N chromosomes carrying marker allele 2. This fraction is estimable from the data and therefore should be modeled. Because b varies across loci, the heterogeneity parameter a (and a*) must also vary, and we model the mean ml as ml Å bla * / Gal Å al / Gal . 2.6. Search Regions Because the terms in the CL are not independent, the usual asymptotic theory of likelihood ratios does not apply; specifically, twice the log (composite) likelihood ratio is not x2 distributed. Consequently the interval defined by the set of all u’s not rejected in a level (1 0 g) likelihood ratio test r(g) Å {u: 2[CL(uO ) 0 CL(u)] õ x2(1, g)}

[13]

is not a g-level confidence interval. In general, because the terms in the likelihood are positively correlated, one should expect such intervals to be anticonservative. As an alternative to a confidence interval, we suggest the use of a related quantity, which we call the search interval. To construct a 99% search interval, plot CL(u) 0 CL(uˆ ) and select the region for which the graph is greater than 1/2x2(1, 0.99) Å 3.32. For instance, a search might initially begin in the region of the maximum CL and expand outward to the interval defined by

gnma

AP: Genomics

7

DISEQUILIBRIUM MAPPING METHODS

TABLE 3 Relative Locations of the Six Marker Loci and the Disease Gene for the Three Chromosomal Structures Structure

1

2

3

4

5

6

7

I II III

0.0 0.0 0.0

0.003 0.0001 0.002

0.006 0.0002 0.003*

0.008* 0.0003* 0.004

0.009 0.0004 0.0041

0.012 0.0005 0.0045

0.015 0.0006 0.0075

Note. The relative locations are measured by the recombination fraction from the most centromeric marker (Marker 1). The disease gene is indicated by an asterisk.

r(0.99). If no gene is located, then the search might be expanded further to the complement of the intervals r(0.99) and r(0.999), and so on until the gene is found. The degree to which the nominal confidence level corresponds to the actual probability of finding a gene is related in a fairly complex way to the density of the markers. Highly dense markers yield a great deal of redundant information that inflates the apparent confidence without actually increasing the likelihood of locating the gene. Through analyses of simulation results, we investigate this further. 3. SIMULATION RESULTS

To evaluate the performance of the CL method we conducted a simulation study. Details of the simulation are given in Devlin and Risch (1995; Appendix). In brief, each population initially consisted of 2000 chromosomes composed of 20 disease and 1980 normal chromosomes (i.e., 1000 individuals), which then grew over G Å 100 generations to a size of 100,000 chromosomes. Population expansion occurred at a constant exponential rate. The disease allele was initially in complete disequilibrium with an allele at each of six biallelic marker loci. Recombination occurred at random, as did reproduction (standard Fisher-Wright binomial sampling model). No mutation occurred. Initial marker allele frequencies were chosen at random from the interval (0.15, 0.85); all frequencies were allowed to drift over generations. From each simulated population a random sample of 100 disease and 100 normal chromosomes was drawn. Two hundred populations were generated for each of three chromosomal structures. Structure I consists of a set of relatively widely separated markers; for Structure II, the markers are dense; and Structure III consists of an irregular grid of markers (Table 3). The disease was located at 0.008, 0.0003, and 0.003 for Structures I–III, respectively. In Figs. 1–3 we illustrate the analyses for some populations in detail for each type of model, while Table 4 summarizes the results from the entire set of simulations conducted. In Fig. 1a we see the prototypical case for Structure I—a (roughly) unimodal profile, with a peaked maximum. A 99% search interval pinpoints the disease with a great deal of precision. Contrast this with Fig. 1b, in which the profile is much flatter in the vicinity of the

AID

Genom 4241

/

6r1b$$$784

07-24-96 18:22:57

disease locus. Here a 99% search interval is only slightly larger than it is for the data in Fig. 1a, but it is less clear where in the central portion of the chromosome to begin searching for the disease gene. This is to be expected since the observed values of dˆ are not as informative in this sample as they are for the sample depicted in Fig. 1a. In Fig. 1c, we observe the beneficial effect of our relative variances. Simple disequilibrium mapping places the disease at 0.015, while the CL method essentially ignores this anomalous observation, producing a clear picture of the likely location of the disease gene. The observation at 0.015 is not influential in the CL computations because it is associated with a high marker frequency (0.98), and hence it is treated as a highly variable observation. In Fig. 1d the observed pattern of disequilibrium suggests that the disease is located to the left of the markers. Although the true location of the disease gene is included in a 99% search region, the CL incorrectly points to the region left of the first marker as a potential location of the disease gene. Fortunately, such a pattern is the exception, rather than the rule: only 1% of the samples from Structure I resulted in misleading results such as those depicted in Fig. 1d. Most of the remaining simulations resulted in profiles similar to those depicted in Figs. 1a and 1c. The data from Structure II are far less informative than those from Structure I because the markers span a much smaller range (0.06 cM versus 1.5 cM). Frequently, blocks of markers are still in perfect association with the disease locus after 100 generations. Consequently the data are often nearly uninformative, at least using these methods, as reflected in the much flatter CLs. (Note that the scale in Fig. 2 has a reduced range relative to Figs. 1 and 3.) However, even with this chromosomal structure, there is some information in the data. Figure 2 illustrates four samples from Structure II. Most of the profiles are similar to Figs. 1a–1c, but we note that one must exercise caution with this type of data because the results can be anomalous, as illustrated in Fig. 1d. Overall, for Structure II, we found that 7% of the profiles are similar to those exhibited by Figs. 1d and 2d. In each case depicted in Fig. 2, most observed disequilibrium values are large, but the resulting CL varies considerably. When there is little or no variation in

gnma

AP: Genomics

8

DEVLIN, RISCH, AND ROEDER

TABLE 4 Performance of CL Method on 200 Simulated Populations Structure

Mean u

RMSE

Length0.99

Cover0.99

Length0.999

Cover0.999

Mean G

I II III

0.00808 0.00032 0.00325

0.00145 0.00012 0.00116

0.00488 0.00034 0.00335

0.87 0.88 0.84

0.00801 0.00046 0.00515

0.96 0.97 0.94

80 75 79

Note. Each analysis is based on a random sample of 100 diseased and 100 normal haplotypes. The three chromosomal structures are given in Table 3. The column headings translate as follows: Mean u is the average of the estimated recombination fraction between marker 1 and the disease locus; RMSE is the root-mean-square error of the estimates of u; Lengthg and Coverg are the length and coverage probability of a search region, measured in recombination fraction, corresponding to a x2 cutoff value of g; Mean G is the average estimated number of generations for the simulations (the true value is 100).

the disequilibrium values, marker allele frequencies determine, to a large extent, the maximum CL as well as the profile shape. In Fig. 2a an r(0.99) search region is confined to the right of the third marker. To understand how this profile was obtained from these nearly flat disequilibrium values, we examine the allele frequencies for the markers (0.35, 0.99, 0.52, 0.39, 0.72, 0.77) in conjunction with the disequilibrium values. For this population the second marker is essentially ignored due to the high allele frequency of the associated allele. Markers 4 and 5 are perfectly associated with

FIG. 1. Composite likelihood profile and disequilibrium values for four samples from chromosomal structure I (Table 3). The dˆ values are superimposed on the profile likelihood with scale depicted on the right-vertical axis; note, however, that this is not a curve fitting to the dˆ ’s. The horizontal axis is the recombination fraction, measured from the most centromeric marker. The arrow indicates the location of the disease gene. Because the composite likelihood is not a true log likelihood, one cannot select an interval with a preset level of confidence; however, an r(g) search interval can be found by comparing a 1/2x2(1, g) to the left-vertical axis. For example, a 99.9% interval encompasses the set of u with CL (left-vertical axis) greater than 5.41.

AID

Genom 4241

/

6r1b$$$784

07-24-96 18:22:57

the disease, while Markers 1, 3, and 6 have disequilibrium values of 0.993, 0.985, and 0.979, respectively. These values pull the maximum CL just to the left of Marker 4, where the actual disease is located. In Fig. 2b, the allele frequencies (0.33, 0.98, 0.90, 0.47, 0.41, 0.43) again play a crucial role. Markers 2–4 are perfectly associated with the disease, but two of the three are high variance observations. Consequently, although the CL is essentially flat across this block with a maximum between markers 3 and 4, it does not decisively rule out the region between Markers 5 and 6 as one might initially expect. In Fig. 2c we again observe a block of markers in complete association with the disease (Markers 2–5). In this instance, it is Markers 2 and 5 that are less reliable. This, in conjunction with the relatively low disequilibrium values at the endpoints, leads to a more peaked value of profile than observed in Fig. 1b. Nevertheless, the entire region is

FIG. 2. Composite likelihood profile and disequilibrium values for four samples from chromosomal structure II (Table 3). See the legend to Fig. 1 for an explanation of axes and other features.

gnma

AP: Genomics

9

DISEQUILIBRIUM MAPPING METHODS

nominal level of coverage (Table 4). Overall, however, the coverage was high and surprisingly uniform across structures. (The search intervals sizes and the span between proximal and distal markers did vary across structures, however.) Finally, we note that the estimates of G are negatively biased (Table 4). This bias occurs, in large part, because the parameter G is positively correlated with n. A minor component of the bias is due to inbreeding. 4. DATA ANALYSES

FIG. 3. Composite likelihood profile and disequilibrium values for two samples from chromosomal structure III (Table 3). See the legend to Fig. 1 for an explanation of axes and other features.

included in an r(0.99) search region. Figure 2d exhibits a breakdown of the method. This is not surprising given that the entire block of markers is in complete association with the disease. Clearly, this method of analysis cannot work in this instance. In general, there is less to be learned from CL analysis when the markers span such a narrow range. Structure III yields interesting results because the tight block of markers in the center of the region usually leads to a dip in the CL as exhibited in both Figs. 3a and 3b. This occurs because the association would have to be almost perfect for the disease to be located within this tight block. In the first of these populations the associations are (0.917, 0.992, 0.991, 1.0, 0.996, 0.012), while in the second population, the association within the block is much weaker, leading to a more dramatic crash in the profile. When markers are dense, the profile can be quite rough as it attempts to avoid placing the disease too close to any of the markers. Based on the simulation results the method provides an unbiased estimator of u that is highly informative for Structures I and III, but less so for Structure II (Table 4). The relative proportion of the span between the distal and proximal markers occupied by the r(0.99) search region is 33, 57, and 45%, for Structures I–III, respectively; for the r(0.999) search region, it is 53, 77, and 69%. As expected, neither of the two search intervals, based on x2 cutoff values of 99 and 99.9%, attained the

AID

Genom 4241

/

6r1b$$$784

07-24-96 18:22:57

We present analyses for two data sets, cystic fibrosis (CF) and diastrophic dysplasia (DTD). Because the location of the gene causing each of these diseases is known, they provide excellent exemplars of the CL method. For the CF analyses, we used the data from Kerem et al. (1989) on 23 markers in the vicinity of the CF locus (see their Table 1). Some of the physical distances between marker loci were only approximate (see their Fig. 1), and we assumed that the approximate values were sufficiently accurate (for one value listed as ú50 kb, we assumed it was 55 kb). All physical distances and assumed recombination fractions discussed for CF markers were measured from MetD, the most centromeric marker of those assessed by Kerem et al. (1989). There are multiple disease alleles that cause CF, although roughly 70% of the northern European CF chromosomes carry a single mutation, the D F508 deletion (e.g., Do¨rk et al., 1992). Hence we present the analysis using the heterogeneity CL model. An investigator would have little difficulty deciding which of the two models to use with these data, even without foreknowledge of the number of disease alleles, because the fit of the simple CL model is poor. The CL surface for a dense set of u’s suggests that the search for the disease gene should start within the gene itself, at uˆ Å 0.0081 (Fig. 4); the CF gene ranges from approximately 0.00771 to 0.00975. However, the

FIG. 4. Composite likelihood profile (heterogeneity model) and disequilibrium values for cystic fibrosis. The disease mutation is located at 0.0087. See the legend to Fig. 1 for an explanation of axes and other features.

gnma

AP: Genomics

10

DEVLIN, RISCH, AND ROEDER

D F508 mutation is located at approximately 0.00871, or about 60 kb from the maximum CL. The mutation is well within the r(0.99) search interval, which extends from 0.0064 to 0.0094. Finally, the estimate for the number of generations since the D F508 mutation arose is 96. This estimate is based primarily on northern European populations. Consequently it may be biased, influenced by when the mutation became prevalent in Northern Europeans. The odd variation of the dˆ l values (Fig. 4), with small values from markers proximate to the CF gene and larger values from surrounding markers, is almost undoubtedly due to multiple disease alleles. Specifically, atypically small d values occur when most or all of the marker alleles associated with non-D F508 CF mutations are different than the allele associated with the D F508 mutation; conversely, when the same marker allele is associated with both non-D F508 and D F508 mutations, unusually large d values occur. Even in the face of strong heterogeneity and a somewhat misleading pattern of disequilibrium, the CL method provided excellent information on where to search for the disease gene. Moreover, it outperformed simple disequilibrium mapping (Devlin and Risch, 1995) in the sense that the maximum disequilibrium value is much further from the true location than the maximum CL is. The CL method also performed better than the large-population likelihood method (Terwilliger, 1995), both in terms of the proximity of the maxima to the D F508 mutation and in terms of coverage. In fact, while the large-population likelihood’s 99.9% interval (for two degrees of freedom) clearly captures the CF gene itself, it does not apparently capture the D F508 mutation (cf. Terwilliger’s [1995] Fig. 4 and Kerem et al.’s Fig. 1). For the analysis of DTD, we used the data from Ha¨stbacka et al. (1994) on 10 markers (see their Table 1; we excluded D5S413 because the distance between it and other loci could not be estimated). In their table, Ha¨stbacka et al. collapsed multiallelic STR markers into bialleleic markers based on strong association with the disease allele and inferred recent mutations at the markers for some disease chromosomes; we adopted their approach for these data. We estimated the location of the two outermost markers based on Ha¨stbacka et al.’s estimate of the recombination fraction between those markers and their relative location on their map (see their Fig. 1); small changes in the locations of these loci had no impact on our analyses. Again we expressed assumed recombination fractions in terms of the recombination fractions from the most centromeric marker. The DTD data have been analyzed by Ha¨stbacka et al. (1994) and Kaplan and Weir (1995) under the assumption that there is a single disease mutation in the Finnish population. This assumption seems unreasonable based on Ha¨stbacka et al.’s (1994) results, which show there must be at least two disease alleles at this locus in this population. What is not clear is the frequency of the disease alleles. It is possible that one of

AID

Genom 4241

/

6r1b$$$785

07-24-96 18:22:57

FIG. 5. Composite likelihood profile (heterogeneity model) and disequilibrium values for diastrophic dysplasia. The disease gene is roughly located between 0.007 and 0.008. See the legend to Fig. 1 for an explanation of axes and other features.

the disease alleles is at very high frequency. To accommodate the two scenarios—a single disease allele (or one at very high frequency) versus two or more disease alleles—we analyze the data using the simple and heterogeneity CL models. The CL surface for the DTD data, based on either the simple or heterogeneity CL model, places the disease locus exactly at its true location. However, the results of the heterogeneity analysis (Fig. 5) are far more plausible than the results from the simple CL model (Fig. 6): both surfaces show a decline in CL distal to the DTD locus, but the decline for the simple model seems too abrupt, especially in light of our simulation results on a similar chromosomal structure (chromosomal structure 3). Other features of the CL surface also seem implausible. Thus we believe the heterogeneity model is the appropriate model. For either model, the estimate for the number of generations since the disease mutation arose is approximately 105. 5. DISCUSSION

Linkage disequilibrium between disease alleles and alleles at proximate markers yields valuable information for refining the location of a disease gene after it has been crudely localized using recombinant mapping. While the utility of this approach is beyond doubt, the theoretical foundation for disequilibrium mapping is still maturing (Jorde, 1995), and several methods have been proposed to analyze the disequilibrium pattern. Here we discuss these methods and how they compare to the CL method. 5.1. Simple Disequilibrium Mapping Perhaps the most frequently used method for refining disease gene location, after it has been localized by recombinant mapping, is simple disequilibrium mapping. This method uses only the pattern of pairwise disequilibrium values to infer the approximate location

gnma

AP: Genomics

DISEQUILIBRIUM MAPPING METHODS

FIG. 6. Composite likelihood profile (simple model) and disequilibrium values for diastrophic dysplasia. The disease gene is roughly located between 0.007 and 0.008. See the legend to Fig. 1 for an explanation of axes and other features.

of the disease locus. This descriptive tool has been explored in some detail by Devlin and Risch (1995). In that report, we discussed the fine-mapping properties of five commonly used measures of linkage disequilibrium, the correlation coefficient D (e.g., Hill and Weir, 1994), Lewontin’s D* (Lewontin, 1964), the robust formulation of the population attributable risk d (Ozelius et al., 1992a,b; Risch et al., 1991, 1995; Lehesjoki et al., 1993; Terwilliger, 1995), Yule’s Q (Kerem et al., 1989; Olson and Wijsman, 1994), and Kaplan and Weir’s (Kaplan and Weir, 1992) proportional difference d under the assumption of initial complete disequilibrium between disease and marker loci. (See Devlin and Risch, 1995, Table 1, for formulas.) Devlin and Risch (1995) evaluated the performance of these measures for simple fine mapping in three ways: by simple deterministic calculations, by algebraic analysis, and by stochastic simulations. All of these analyses suggested that d is an excellent measure for simple disequilibrium mapping. Compared with the other measures, it was less sensitive to both marker allele frequency variation and to sampling strategies, especially case-control sampling. However, the results of the stochastic simulations show that d is also sensitive to variation in marker allele frequencies due to genetic drift. This sensitivity to marker allele frequencies, especially when the marker allele frequency is large, can make it difficult to interpret the pattern of disequilibrium values. Moreover, in the statistical sense, it would be more efficient to analyze the information contributed by a set of loci simultaneously. In this way, the pattern of disequilibrium can be evaluated in light of the information contributed by each locus, which is a direct function of the variance. These difficulties are obviated by the CL method. 5.2. Moment Method The moment method, as it was described by Kaplan et al. (1995), was originally developed and applied by

AID

Genom 4241

/

6r1b$$$785

07-24-96 18:22:57

11

Ha¨stbacka et al. (1992) and Lehesjoki et al. (1993). In deriving their method for localizing disease genes, Ha¨stbacka et al. (1992) and Lehesjoki et al. (1993) based their analyses on two historical features of the Finnish population: the Finnish population was founded about 2000 years ago and it has since undergone quite rapid population growth. They begin by assuming that a single disease chromosome was introduced into a founding population of relatively small size. They then assume the population grew at a constant exponential rate. They argue that such a growth process, with superimposed random recombination, is like the growth of bacterial populations with random mutation that was analyzed by Luria and Delbru¨ck (1943). Thus they apply similar analyses to make inferences about u, the recombination fraction between the disease locus and a single marker. To estimate u, they need to assume a value for G, the number of generations since the founding of the population. To map disease genes using the Finns, they assume G Å 100 (2000 years, 20 years per generation). Ha¨stbacka et al. and Lehesjoki et al.’s assumption that a single disease chromosome was introduced into the population is the same as assuming that disease and marker loci are in complete disequilibrium at founding, as described previously. Their approach can be derived easily using the results of Appendix I or the arguments of Devlin and Risch (1995). Devlin and Risch argued that (1 0 u)G É d, implying that 0log(d) É Gu. With an estimate of G, we can estimate u by uˆ ˆ. Å 0log(dˆ )/G Kaplan et al. (1995) have pointed out that the ‘‘confidence interval’’ for disease gene location proposed by Lehesjoki et al. underestimates the interval that would be found using maximum likelihood. While this is an expected result, it obscures the deeper issue that either interval is a confidence interval only under the assumptions used to develop the model. Those assumptions are only guesses about the evolutionary history of the population, with the possible exception of the assumption of complete disequilibrium, and are therefore error prone. Thus, when one or more of the assumptions is incorrect, as will be typical, the interval has no simple statistical interpretation. As we argued earlier, such intervals should be termed search intervals to clarify their interpretation as the most reasonable intervals to begin searching for the disease locus. When multiple marker loci have been typed in the vicinity of the disease locus, the interpretation of the moment methods becomes obscure. Should one estimate the disease location for each marker and then average over these loci? If so, what are the appropriate weights? Such questions argue for a different treatment of the problem, such as the CL method. 5.3. Maximum Likelihood Kaplan et al. (1995) argue that methodology for finemapping should concentrate on estimating the mle and confidence interval for u. To arrive at a confidence inter-

gnma

AP: Genomics

12

DEVLIN, RISCH, AND ROEDER

val for u, they make a number of assumptions, many of which are the same as those used to derive the CL method (Appendix I). Using these assumptions, Kaplan et al. argue that the evolution of disease chromosomes can be modeled as a Poisson process, and any systematic changes in normal chromosome frequencies that occur due to recombination can be ignored. Although this process does not lend itself to explicit solutions, Kaplan et al. point out that the stochastic recursive equation they provide can be used to simulate populations for a critical set of parameters. Like the moment method, they assume, rather than estimate, the age of the disease mutation. When their evolutionary assumptions are correct, Kaplan et al.’s approach will provide a description of the likelihood surface, f(u). However, their simulation approach could be made more efficient by using the results in Appendix I. Specifically, they suggest evaluating the likelihood over a relatively wide grid of u. The results of Appendix I show that their uˆ should be closely approximated by uˆ e Å 0log dˆ /G. Therefore it would be more efficient, for a given G, to evaluate the likelihood from uˆ e outward. Kaplan et al. also argue that, when G is unknown, it is conservative to overestimate it. It is not clear that overestimating G is ideal. Consider uˆ e Å 0log dˆ /G. Clearly the central location of uˆ changes in G, and presumably a molecular biologist would begin to search for the disease locus near the estimated central location. Thus, overestimating G can be misleading. Granted, the ‘‘confidence interval’’ for the location of the disease locus will also be larger for larger G. However, as we discussed earlier, this interval is not typically a confidence interval in the usual sense, making the value of a larger interval per se somewhat questionable. Like simple disequilibrium mapping and the moment method, the maximum likelihood method does not generalize naturally when multiple marker loci are available, a typical situation for fine-mapping disease genes. Methods that use the information from many loci will generally be more effective at localizing disease genes. 5.4. Large Population Likelihood Terwilliger (1995) suggests a method of disequilibrium mapping that combines pairwise information across loci. He first proposes, for a single marker, a parameter l that measures the ‘‘proportion of increase in allele i in disease chromosomes, relative to its population frequency.’’ Terwilliger’s l is not new. With some algebra, it can be shown that l is simply d, at least for small disease allele frequencies and random sampling. However, there is a subtle distinction: l equals the standard population attributable risk d*, not the robust population attributable risk d, and is therefore a biased estimator when case-control sampling is employed (Appendix III).

AID

Genom 4241

/

6r1b$$$785

07-24-96 18:22:57

Terwilliger makes a significant contribution by recognizing that information on disease gene location should be integrated over the set of marker loci. To this end, he proposes an extension of his method to the situation in which multiple marker loci are examined. He suggests that his method is a likelihood, but it is not. To the contrary, although his method may yield an estimate of disease location that is near to Kaplan et al.’s mle, it underestimates the variability of disease location by not taking into account the evolutionary sampling process (Kaplan et al., 1995; Kaplan and Weir, 1995). Instead, the method assumes no drift; hence the likelihood for any data table is readily obtained from the multinomial distribution of the sampled values. However, for most populations, the variability due to evolutionary sampling dwarfs the variability due to statistical sampling. The structure of Terwilliger’s method is identical to a CL method, being a weighted sum of marginal loglikelihoods, with each marginal likelihood based on the pairwise disequilibrium for each marker locus and the disease locus. Because evolutionary sampling is not taken into account and the dependence between information obtained from marker–disease associations is ignored, the method inherently overestimates the strength of the data. Consequently, the profile likelihood is too peaked and the resulting confidence intervals cannot be accurate. 5.5. Composite Likelihood The CL method we propose is designed to evaluate the information from multiple pairs of loci simultaneously, each pair composed of the disease locus and a single marker locus, and to account for variability due to evolutionary and statistical sampling. While simpler than a model for the evolution of complete multilocus haplotypes, our results suggest that the CL method is effective in many settings at refining the location of the disease locus. Simulation analyses show that the CL ’s search interval, r(0.99), narrows the search region by about two-thirds and contains the disease locus with reasonable probability (at least 84%). A search interval based on r(0.999) generally gave coverage probability of about 95%. Analyses of real data on cystic fibrosis and diastrophic dysplasia proved very successful, with the maximum CL coinciding with the true location of the disease locus in both cases. For cystic fibrosis, the predominant disease mutation was well within the r(0.99) search interval. Cystic fibrosis and diastrophic dysplasia are diseases with simple genetic bases. It is unclear, at present, how successful the CL method will be in refining the location of disease loci underlying complex traits. Characterizing its performance for complex diseases is one of our current research goals. Moreover, it is unlikely that the evolutionary dynamics

gnma

AP: Genomics

13

DISEQUILIBRIUM MAPPING METHODS

for a group of tightly linked loci can be fully described by the marginal associations between pairs of loci, as implemented by the CL method. Thus, despite the excellent performance of the CL method, both with simulation and real data, it must be somewhat inefficient compared to a full likelihood model. We believe that the full likelihood model will be very difficult to specify without unrealistically stringent assumptions about population history, however. The CL model does have another drawback. While CL estimators are typically similar to full likelihood models in that they are consistent (the estimated parameter value converges to the true value as the sample size tends to infinity), this is not true of the CL model we present. This defect has little to do with the structure of the model. Instead the difficulty arises because of the evolutionary process. The CL method would be consistent if the population was large when the disease mutation arose and continued to be large (i.e., no drift). However, when the population has been small during some or all of that period, drift can have a large impact on allele and haplotype frequencies. Thus drift can affect the estimate of disease location, and this impact cannot be eliminated by obtaining a larger sample of haplotypes from the current generation. Hence, evolutionary sampling may cause the CL method to be inconsistent. The same is true for any disequilibrium mapping method developed to date. One way this difficulty might be rectified is by evaluating multiple populations with distinct evolutionary histories (Kaplan et al., 1995). This inconsistency has implications for molecular biologists searching for a disease locus. Specifically, the predicted location of the disease locus, based on the maximum CL , may be far from its true location even when a large number of disease and normal chromosomes have been sampled. Misleading patterns of disequilibrium values can be seen in Figs. 1d and 2d, resulting in maximum CL that are relatively far from the true location of the disease locus. Since those analyses are based on samples of 100 disease and normal chromosomes, one might incorrectly surmise that the patterns in the population as a whole are quite different. In fact, they are not. Instead, the same pattern is seen when the whole population is analyzed, and the maximum CL would therefore typically be misleading. Despite this drawback, our results suggest that the CL method will be an effective tool for refining the interval to be searched for a disease locus, especially when the original search region is relatively large (i.e., one to several megabases). In this instance, and with adequate data, the CL method will typically narrow the search region substantially. However, for relatively small regions, for example a few hundred kilobases or less, other kinds of analyses that reconstruct the evolutionary dynamics of whole chromosomes will generally be more useful.

AID

Genom 4241

/

6r1b$$$785

07-24-96 18:22:57

APPENDIX I

Here we demonstrate that 0log(d) Å Gu is a reasonable estimating equation for u, and why. We make the following assumptions: 1. Marker polymorphism precedes disease mutation. 2. One disease allele initially (i.e., complete disequilibrium). 3. Normal chromosome marker allele frequencies constant over time (p12 , p22 constant). 4. Fisher–Wright sampling scheme (binomial)—all carriers selectively equivalent. 5. Nonoverlapping generations. The same assumptions (and others) are used in the methods developed by Kaplan et al. (1995), Terwilliger (1995), and Ha¨stbacka et al. (1992). First, we obtain the deterministic expression for the frequency of the 1D haplotype at time G, which is equivalent to the expectation of its frequency under a drift process at time G. Note that

p11(G) Å p1/p/1 / DG Å p1/p/1 / (1 0 u)GD0 . Using the assumptions of complete disequilibrium, and assuming that allele frequencies remain constant over time, we have

p11(G) Å p1/p/1 / (1 0 u)G(p/1 0 p/1p1/) É p/1(p1/ / e0Gup2/). Assume that the frequencies of haplotypes bearing disease alleles change over time, but that the frequencies of haplotypes bearing ‘‘normal’’ alleles do not. This assumption would be reasonable for a large population and a recent disease mutation because the change in expected frequencies of normal haplotypes would then be small. Focusing only on disease haplotypes, then, the conditional probability of having the original haplotype, f11 , equals

p11(G) Å p1/ / e0Gup2/ p/1 and f21 equals 10

p11(G) Å p2/(1 0 e0Gu). p/1

The kernel of a ‘‘large population likelihood’’ (where drift and relatedness are irrelevant) is n

n

L* Å f1111f2121 Å (p1/ / e0Gup2/)n11(p2/(1 0 e0Gu))n21, and its log is

gnma

AP: Genomics

14

DEVLIN, RISCH, AND ROEDER

L* Å n11log(p1/ / e0Gup2/) / n21log(p2/(1 0 e0Gu)). Then, taking the derivative with respect to u, n21Gp2/e0Gu ÌL* 0n11Gp2/e0Gu Å / 0Gu Ìu p1/ / p2/e p2/(1 0 e0Gu)

APPENDIX II l , the marginal allele frequency for Assume that pj/ the l’th locus, is essentially constant over time and from [8]

hl(g) Å

0Gu

Å

0n11Gp2/e n21G . / Gu 0Gu (e 0 1) p1/ / p2/e

Setting the derivative to zero, we obtain after some algebra n21p1/ n11 0 Å e0Gu. n/1 n/1p2/

Fact. If hl(g)/h1(g) É c for all g, and if bg is a sequence of constants, then (g bghl(g) É c. (g bgh1(g) The ratio hl(g)/h1(g) equals

Substituting the marginal observed allele frequencies pˆ for p and simplifying yields

pP 11pP 22 0 pP 12pP 21 Å dO * Å e0Gu, pP /1pP 2/ so that we obtain as an estimating equation Gu Å 0log(dˆ *). d* is the population attributable risk and equals Lewontin’s D* under certain conditions (Devlin and Risch, 1995). Because d É d* whenever p/1 is small, even under case-control sampling (Devlin and Risch, 1995), we see that 0log(dˆ ) Å Gu is a reasonable estimating equation for u. Assume we have some independent estimate of G and estimate u by uˆ e Å 0log(dˆ )/G. Next, we ask how close is uˆ e to Kaplan et al.’s uˆ , which is the maximum likelihood estimate (mle) under their assumptions. At generation G we let p11(u) denote the proportion of disease alleles of maker type 1 present in the population. Because this value is unknown, the likelihood for u is an expectation over the random quantity p11 : n11

l l p1/ / p2/ exp(0gu) . l p2/[1 0 exp(0gu)]

n21

Eu[p11(u) (1 0 p11(u)) ]. Kaplan et al. obtain an estimate of this function by averaging over evolutionary simulations. To show that uˆ e É uˆ , we use a Taylor series argument. Let h(p; u) Å p11(u)n11(1 0 p11(u))n21 and expand about f11(u) Å Eu[p11(u)]:

F

G

l l p1/ / p2/ exp(0gal) p12/[1 0 exp(0ga1)] p12/a1 É l , 1 1 l p1/ / p2/exp(0gal) p2/[1 0 exp(0gal)] p2/al

provided that gal and ga1 are small. For instance, for g Å 20 and u Å 0.01, the approximation is accurate. Therefore, under these conditions, it follows from [7] and [10] that rl Å

SD

al (Gal)2 (g ng01hl(g) É 2 01 (Ga1) (g ng h1(g) a1

2

hl(1) , h1(1)

which simplifies to [11]. When g is large, a different argument depending on population size often holds. Provided ng is large, as is often the case when g is large, the contribution to the sum from ng01hl(g) is negligible. Consequently, [11] is a good approximation even under these less stringent conditions. APPENDIX III

Here we examine the behavior of Terwilliger’s l as a measure of disequilibrium under case-control sampling. We begin with some background on disequilibrium and population attributable risks. We write the unstandardized disequilibrium in a population as D Å p11p22 0 p12p21 . D is the numerator for both Levin’s robust population attributable risk (PAR) and the standard PAR; the distinction between the two lies in their denominators. The standard form is given by

Eu[h(p; u)] É h(f; u) / Eu[(p 0 f)]h*(f; u) / 12 Eu[(p 0 f)2]h9(f, u). Notice that the first term on the right-hand side is L*, the second term is zero, and E[(p 0 f)2] Å var[p] ú 0. We showed above that L*(u) is maximized at uˆ e . By differentiation one can show that h9(f, u) also achieves its maximum approximately at uˆ e . Consequently, it follows that the true mle is closely approximated by uˆ e .

AID

Genom 4241

/

6r1b$$$786

07-24-96 18:22:57

d* Å

p1/ p1/(f 0 1) , (p1É1/ 0 p1É2/) Å p/1 1 / p1/(f 0 1)

where f Å {p11/p1/}/{p21/p2/}, the relative risk. However, it can also be written as

d* Å

gnma

D p11p22 0 p12p21 Å p/1p2/ p/1p2/

AP: Genomics

DISEQUILIBRIUM MAPPING METHODS

TABLE 5 Haplotype, Marker, and Disease Frequencies for cfold Increased Sampling of Disease Haplotypes Marker

Disease allele

Normal allele

A1

cpP 11

pP 1É/2(1 0 cpP /1)

cpP 21

pP 2É/2(1 0 cpP /1)

cpP /1

1 0 cpP /1

A2

cD / pP 12 pP /2 pP 22 0 cD pP /2 1

under conditions given in Devlin and Risch (1995). Moreover, the standard PAR is identical to Terwilliger’s (1995) l, as can be shown algebraically using the expressions in that paper. Incidentally, the robust PAR, D/(p/1p22), is derived via the odds ratio approximation to the relative risk. The impact of case-control sampling on d*, and therefore l, can be derived with some algebra. First, we note the impact of c-fold oversampling of cases versus controls on allele and haplotype frequencies in Table 5. Based on these values, we define d† Å D†/(p†/1p†2/) to be the value of d* under c-fold oversampling. Then D† Å

c(1 0 cp/1) D p/2

and

p†/1p†2/ Å

c(1 0 cp/1) D. p/2

Using these expressions, and inserting d, we find

d† Å d

(1 0 p†/1) 1 0 p†/1d

Two things are obvious from the last expression. First, whenever d õ 1, d† õ d. Second, for a set of L marker loci collected to assess the same disease mutation, the ratio d†l /dl varies with l (i.e., across the marker loci). The latter induces additional variation and potential bias into the estimate of the disease location. ACKNOWLEDGMENTS This research was supported in part by NSF Grants DMS-9496219 (K.R.) and DMS-9508427 (B.D. and K.R.) and NIH Grant HG00348 (N.R. and B.D.).

REFERENCES Allamand, V., Broux, O., Richard, I., Fougerousse, F., Chiannilkulchai, N., Bourg, N., Brenguier, L., Devaud, C., Pasturaud, P., Pereira de Souza, A., et al. (1995). Preferential localization of the limbgirdle muscular dystrophy type 2A gene in the proximal part of a 1-cM 15q15.1–q15.3 interval. Am. J. Hum. Genet. 56: 1417–1430. Bengtsson, B. O., and Thomson, G. (1981). Measuring the strength

AID

Genom 4241

/

6r1b$$$786

07-24-96 18:22:57

15

of associations between HLA antigens and diseases. Tissue Antigens 18: 356–363. Besag, J. E. (1974). Spatial interaction and the statistical analysis of lattice systems (with discussion). J. R. Statist. Soc. B 36: 192– 236. Clegg, M. T., Kidwell, J. F., Kidwell, M. G., and Daniel, N. J. (1976). Dynamics of correlated genetic systems. I. Selection in the region of the Glued locus of Drosophila melanogaster. Genetics 83: 793– 810. Crow, J. F., and Kimura, M. (1970). ‘‘An Introduction to Population Genetics Theory,’’ Harper and Row, New York. Devlin, B., and Risch, N. (1995). A comparison of linkage disequilibrium measures for fine scale mapping. Genomics 29: 311–322. Do¨rk, T., Neumann, T., Wulbrand, U., Wulf, B., Ka¨lin, N., Maab, G., Krawczak, M., Guillermit, H., Ferec, C., Horn, G., et al. (1992). Intra- and extrangenic marker haplotypes of CFTR mutations in cystic fibrosis families. Hum. Genet. 88: 417–425. Ha¨stbacka, J., de la Chapelle, A., Kaitila, I., Sistonen, P., Weaver, A., and Lander, E. (1992). Linkage disequilibrium mapping in isolated founder populations: Diastrophic dysplasia in Finland. Nature Genet. 2: 204–211. Ha¨stbacka, J., de la Chapelle, A., Mahanti, M. M., Clines, G., ReeveDaly, M. P., Daly, M., Hamilton, B. A., Kusumi, K., Trivedi, B., Weaver, A., Coloma, A., Lovett, M., Buckler, A., Kaitila, I., and Lander, E. S. (1994). The diastrophic dysplasia gene encodes a novel sulfate transporter: Positional cloning by fine-structure linkage disequilibrium mapping. Cell 78: 1073–1087. Hill, W. G. (1974). Disequilibrium among several linked genes in finite populations. I. Mean changes in disequilibrium. Theor. Pop. Biol. 5: 366–392. Hill, W. G., and Weir, B. S. (1994). Maximum-likelihood estimation of gene location by linkage disequilibrium. Am. J. Hum. Genet. 54: 705–714. Jorde, L. B. (1995). Linkage disequilibrium as a gene mapping tool. Am. J. Hum. Genet. 56: 11–14. Kaplan, N., and Weir, B. S. (1992). Expected behavior of conditional linkage disequilibrium. Am. J. Hum. Genet. 51: 333–343. Kaplan, N., and Weir, B. S. (1995). Are moment bounds on the recombination fraction between a marker and a disease locus too good to be true? Allelic association mapping revisited for simple genetic diseases in the Finnish population. Am. J. Hum. Genet. 57: 1486– 1498. Kaplan, N., Hill, W. G., and Weir, B. S. (1995). Likelihood methods for locating disease genes in nonequilibrium populations. Am. J. Hum. Genet. 56: 18–32. Kerem, B., Rommens, J. M., Buchanan, J. A., Markiewicz, D., Cox, T. K., Chakravarti, A., Buchwald, M., and Tsui, L-C. (1989). Identification of the cystic fibrosis gene: Genetic analysis. Science 245: 1073–1080. Lehesjoki, A-E., Koskiniemi, M., Norio, R., Tirrito, S., Sistonen, P., Lander, E., and de la Chapelle, A. (1993). Localization of the EPM1 gene for progressive myoclonus epilepsy on chromosome 21: Linkage disequilibrium allows high resolution mapping. Hum. Mol. Genet. 2: 1229–1234. Lewontin, R. C. (1964). The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49: 49–67. Lindsay, B. G. (1988). Composite likelihood methods. Contemporary Mathematics 80: 221–239. Luria, S. E., and Delbru¨ck, M. (1943). Mutations of bacteria from virus sensitivity to virus resistance. Genetics 28: 491–511. Lynn, A. H., Puffenberger, E. G., Kashuk, C., and Chakravarti, A. (1995). Multipoint linkage disequilibrium mapping using ancestral recombinants. Am. J. Hum. Genet. 57(3 Suppl.): A27. Olson, J. M., and Wijsman, E. M. (1994). Design and sample-size considerations in the detection of linkage disequilibrium with a disease locus. Am. J. Hum. Genet. 55: 574–580. Ozelius, L. J., Kramer, P. L., deLeon, D., Risch, N., Bressman, S. B.,

gnma

AP: Genomics

16

DEVLIN, RISCH, AND ROEDER

Schuback, D. E., Brin, M. F., Kwaitkowski, D. J., Burke, R. E., Gusella, J. F., Fahn, S., and Breakefield, X. O. (1992a). Strong allelic association between the torsion dystonia gene (DYT1) and loci in chromosome 9q34 in Ashkenazi Jews. Am. J. Hum. Genet. 50: 619–628. Ozelius, L. J., Teitz, S. S., Buckler, A., Hervitt, J., Gasser, T., deLeon, D., Kramer, P. L., Risch, N., Bressman, S. B., Housman, D., Fahn, S., Gusella, J. F., and Breakefield, X. O. (1992b). Fine mapping of the torsion dystonia gene (DYT1) on 9q34 and evaluation of a candidate gene. Am. J. Hum. Genet. 51(Suppl.): A224. Risch, N., deLeon, D., Ozelius, L. J., Kramer, P., Bressman, S., Kwiatkowski, D., Brin, M. F., et al. (1991). The genetics of torsion

AID

Genom 4241

/

6r1b$$$787

07-24-96 18:22:57

dystonia in Ashkenazi Jews. In ‘‘Molecular Genetics and Neuropsychiatric Disorders,’’ p. A6, Scientific Program and Abstracts, Israel Ministry of Science and Technology. Risch, N., deLeon, D., Ozelius, L. J., Kramer, P., Almasy, L., Singer, B., Fahn, S., Breakefield, X., and Bressman, S. (1995). Genetic analysis of idiopathic torsion dystonia in Ashkenazi Jews and the recent descent of Ashkenazim from a small founder population. Nature Genet. 9: 152–159. Terwilliger, J. D. (1995). A powerful likelihood method for the analysis of linkage disequilibrium between trait loci and one or more polymorphic loci. Am. J. Hum. Genet. 56: 777–787. Walter, S. D. (1975). The distribution of Levin’s measure of attributable risk. Biometrika 62: 371–374.

gnma

AP: Genomics