Insights from linked single nucleotide polymorphisms: what we can learn from linkage disequilibrium

Insights from linked single nucleotide polymorphisms: what we can learn from linkage disequilibrium

647 Insights from linked single nucleotide polymorphisms: what we can learn from linkage disequilibrium Jeffrey D Wall New methods for analyzing sequ...

86KB Sizes 1 Downloads 25 Views

647

Insights from linked single nucleotide polymorphisms: what we can learn from linkage disequilibrium Jeffrey D Wall New methods for analyzing sequence polymorphism data have uncovered some striking patterns of linkage disequilibrium in both humans and fruitflies. These methods have revealed examples where the observed amount of linkage disequilibrium is either much more or much less than expected, and have led to advances in our understanding of the forces that affect naturally occurring genetic variation. With the recent explosion of sequence polymorphism data, the prospects for further progress from these methods are quite promising. Addresses Department of Organismic and Evolutionary Biology, Harvard University, 16 Divinity Avenue, Cambridge, Massachusetts 02138, USA; e-mail: [email protected] Current Opinion in Genetics & Development 2001, 11:647–651 0959-437X/01/$ — see front matter © 2001 Elsevier Science Ltd. All rights reserved. Abbreviations LD linkage disequilibrium N effective population size ρ population recombination parameter r recombination rate per generation SNPs single nucleotide polymorphisms

Introduction With recent advances in laboratory techniques, it is now possible to look at variation at the finest scale: that of single nucleotide changes in the DNA sequence. The number of single nucleotide polymorphisms (SNPs) discovered has grown dramatically in the past year or two, primarily in humans [1••,2,3••] but also in other species [4–6]; however, our level of understanding of the forces that shape the observed patterns of SNP variation has increased much more slowly. The challenge of the next few years is to develop the theoretical and computational tools needed to make sense of genomic data. SNP data can help us answer many questions, including determining how natural selection and demographic history affect patterns of sequence variability, estimating relevant population genetic parameters, such as the effective population size, the mutation rate or the recombination rate, or dissecting the genetic architecture of complex traits. To approach these questions, researchers tend to focus on specific aspects of the data, such as levels of variability, measures of population differentiation (such as Wright’s Fst), the distribution of frequencies of SNPs (the ‘frequency spectrum’), or patterns of linkage disequilibrium (LD) (i.e. the nonrandom association of alleles at different SNPs). Ideally one would like to use all of the information, without first summarizing it. This turns out to be computationally feasible for only the simplest of

questions. For more complex situations, with alternative models and intragenic recombination, we must rely on efficient methods of summarizing the data. In this review, I discuss how linked SNP data can help distinguish between different models of natural selection and population demography. I focus on what new insights have been gained from using LD information. Though the patterns of LD are commonly used for association studies and fine-scale mapping [7–9], they have been rarely used in population genetic studies. Recent studies that incorporate LD into the analyses have led to important insights into the forces that shape the patterns of sequence variation. I first discuss how LD can be measured in a set of linked SNPs, then present some recent applications to human and Drosophila data.

Linkage disequilibrium Researchers have used different aspects of SNP data to try to distinguish between competing hypotheses in both human and Drosophila studies. For example, there are two main explanations for the observed correlation between levels of variation and local rates of recombination [10–12]: the ‘hitchhiking’ model and the ‘background selection’ model (see Andolfatto, this issue [pp 635–641]). These models make different predictions about levels of variation [5,13] and the frequency spectrum [14–16], and many studies have used them to try to infer which model fits the data better ([5,14,16,17•]; M Jensen, B Charlesworth, M Kreitman, unpublished data). However, sometimes multiple models make similar predictions regarding levels of variability and the frequency spectrum (e.g. see the section on ‘Human population history’, below). In such cases, other summaries of the data, such as a summary of the levels of LD, may be useful for distinguishing between alternative models [18]. Analyses involving LD serve as a natural complement to studies that focus on other aspects of the data. One impediment to the widespread use of LD in molecular evolutionary studies is the difficulty in quantifying the associations between multiple SNPs. Historically, measures of LD (such as r2 or D′) were constructed for pairs of markers (reviewed in [19,20]). However, it is difficult to interpret these pairwise measures when analyzing data consisting of many linked SNPs; traditional pairwise measures are sensitive to allelic frequencies, and, because of linkage, any two pairs of SNPs are not independent of each other. One new method for summarizing the extent of LD among multiple SNPs is to use the data to estimate the population recombination rate, ρ = 4Nr ([1••,21,22]; J Wall, P Andolfatto, M Przeworski, unpublished data). Here, N is

648

Genomes and evolution

the effective population size, and r is the recombination rate per generation. This method has the advantage of considering an explicit population genetic model. Also, the distribution of LD expected between two linked markers (conditional on marginal allele frequencies) is completely determined by ρ (e.g. see [7,9,23••]). Hence, estimates of ρ make a natural summary of LD when there are more than two markers.

Estimating ρ

There are two main approaches for estimating ρ. One method estimates r and N separately [22,24•]; r is estimated from a comparison between physical and genetic maps [25–27,28•], whereas N is estimated from levels of diversity and divergence data. The other method estimates ρ directly from the patterns of LD in sequence polymorphism data [21,23••,29–36]. The laboratory-based method has two drawbacks: the quality of genetic maps is limited by the number of meioses considered — a big problem for humans but less so for D. melanogaster or D. simulans — and recombination rate estimates are averages over large regions. The estimate at any specific location may be inaccurate if recombination rates vary on a scale smaller than the density of markers in the genetic map. There is some evidence that in humans recombination rates can vary over very small distances [37,38] but very little is known about the scale of recombination rate variation in Drosophila. The difficulty with sequence-based estimates of ρ is that they tend to be unreliable unless large stretches of sequence information are available from many individuals [23••,33]. The first methods for estimating ρ from sequence polymorphism data used moment estimators or ad hoc statistics [21,29,31,32]. Though easy to calculate, these methods are biased and inaccurate [23••,33], presumably because of their inefficient use of the available information. Recently, researchers have attempted to use all of the available information by estimating the likelihoods of the full data for different values of ρ [30,34–36]. The value of ρ under which the observed data is the most likely (i.e. the maximum-likelihood estimate of ρ) is taken as a point estimate, and it is easy to use the likelihood function to obtain credibility intervals. Because of the astronomical number of different genealogical histories that are consistent with any given data set, it is not at all straightforward to calculate the relevant likelihoods. At present, researchers use either importance sampling [30,36] or Markov chain Monte Carlo [34,35] to estimate the likelihood function for ρ. (For more on the relative merits of these methods in similar settings, see [39].) Full-likelihood methods are extremely computationally demanding and at present are only practical for small data sets [33,36]. Even with the approximately exponential rate of increase in computer processor speed, it may take substantial improvements in methodology and implementation for full likelihood methods to be generally applicable. More troubling, as full-likelihood methods are so computationally intensive, it is difficult to test their accuracy and sampling

properties adequately. Fearnhead and Donnelly [36], for example, have shown that the method of Kuhner et al. [34] cannot accurately estimate the shape of the likelihood curve even for low recombination rates (see Figure 4 in [36]). Two recent compromise approaches employ likelihood frameworks but are still computationally feasible for large data sets. Both methods try to reduce the complexity of the data before performing likelihood calculations. One method first summarizes the data using simple statistics, then runs simulations to estimate the value of ρ under which the observed values of the statistics are the most likely [33]. Ideally, one would want to find summary statistics whose distributions are highly sensitive to the recombination rate. The number of distinct haplotypes, and the minimum number of inferred recombination events (cf. [29]), seem to work reasonably well [23••,33]; it is hoped that future studies may uncover more effective summaries. The other method calculates likelihoods for pairs of segregating sites, then multiplies the likelihoods over all pairs to obtain a composite likelihood [23••]. This second method is not a true likelihood, as many pairs of sites are not independent of each other. So, unlike maximum likelihood, where approximate credibility intervals are easy to obtain, credibility intervals must be found by simulation. Τhe inclusion of all pairs, however, seems to be an effective way of simultaneously incorporating information on LD from each SNP, and the method generalizes easily to situations where there are diploid genotype data and/or missing data [23••]. Both methods have been shown to be roughly unbiased (under the standard null model, described below) and superior to the previous non-likelihood methods [23••,33]. These sequence-based estimators of ρ assume a null model with no selection, no gene conversion (but see [1••], and below), no population structure and no changes in population size. If this null model is correct, then laboratorybased and sequence-based estimates for the same region should be roughly similar. If there are systematic differences between the two, then one knows that there must be some important element missing from the null model. If instead more complex (and realistic) models of demographic history are considered, sequence-based estimators of ρ become quite biased: for example, population structure and population bottlenecks lead to underestimates, whereas population growth generally leads to overestimates (JD Wall, unpublished data). Without a priori knowledge of an appropriate demographic model, they might not provide much useful information about the actual recombination rate. However, when viewed as multilocus summaries of LD, sequence-based estimates of ρ can be compared to laboratorybased estimates to provide new insights into the relative contributions of different population genetic forces to the patterns of sequence variation that are observed. Below, I present results for two model organisms from which key insights have been gained by this comparison: humans and Drosophila.

Insights from linked single nucleotide polymorphisms Wall

Gene conversion Recent analyses of human sequence data have shown that on short scales (e.g. <10 kb), sequence-based estimates of ρ are almost always higher than laboratory-based ones [1••,24•]. In other words, at the intragenic scale, there appears to be much more recombination than was expected a priori [1••,24•,40]. In contrast, long-range studies find normal or high levels of LD [41–43]. One possible explanation for this pattern is if the laboratory-based estimates consistently underestimate the true r at small scales but are more reliable at larger scales. This would happen if most meiotic recombination events are gene conversion events (without crossing over). Gene conversion is expected to be relevant at the intragenic scale, but not at longer distances [17•,44,45]. As estimates of r from pedigree data measure only the crossing-over rate, they underestimate the true rate of recombination at the intragenic scale. In support of this hypothesis, recent studies have shown that human sequence polymorphism data are more likely under a model of gene conversion and crossing over than under a crossover-only model [1••,24•]. Not much is known about gene conversion rates in mammals but they can be quite high in yeast and Drosophila [46,47]. Preliminary evidence suggests that in humans, the gene conversion rate may be much higher than the crossing-over rate [1••,48]. Gene conversion has also played a role in the analysis and interpretation of Drosophila data. In D. melanogaster and D. simulans, where laboratory-based estimates of r are much more accurate than in humans, it is known that crossingover rates are very close to (or equal to) zero on the tip of the X chromosome and on the 4th chromosome; however, sequence data from these regions show ample evidence of intragenic recombination ([17•]; M Jensen, B Charlesworth, M Kreitman, unpublished data). This suggests that areas where crossing-over is suppressed may have normal (or even elevated) rates of gene conversion ([17•], JD Wall, unpublished data). The presence of gene conversion has implications for many models of natural selection (e.g. the balancing selection [44], hitchhiking and background selection models), as their effect depends directly on the local recombinational environment.

Human population history Previous human SNP studies have consistently found higher levels of variation in African populations relative to non-African populations, and a skew in the frequency spectrum towards more common variants in non-African populations relative to African populations [1••,12,49]. One demographic scenario that might explain these observations is if all non-African populations have experienced a ‘bottleneck’, or temporary reduction in effective population size, in the recent past. This would fit the ‘Out of Africa’ model of modern human origins, which posits that modern humans evolved in a small region in Africa 120,000–200,000 years ago, and from there expanded and replaced existing hominid populations around the world [50]. However, there are other demographic models that can cause similar patterns. For

649

example, an island model with unequal island sizes [51] can generate both the reduced diversity and the skew towards common alleles observed in non-African populations (JD Wall, unpublished data). To test further the plausibility of these models, we should examine their predictions for another summary of the data, such as sequence-based estimators of ρ. The actual patterns of LD for human data are quite striking. Recent studies have shown that the rate of decay of LD with distance varies dramatically in different ethnic groups [1••,2]. In particular, the rate of decay in African populations is much quicker than in non-African populations; equivalently, the sequence-based estimate of ρ in African populations is much higher than the estimate of ρ in non-African populations [1••]. There is no reason to believe that the recombination rate r varies between populations, and the difference in ρ estimates between African and non-African populations is much greater than the difference in levels of diversity (i.e. much greater than the difference in effective population size) [1••]. In addition, these studies sequence the same individuals in many unlinked noncoding regions. Thus, the systematic pattern suggests that there are strong differences in either the selective or demographic history of different human populations. Preliminary results suggest that both a bottleneck model and an unequal island size model can lead to an excess of LD in nonAfrican populations, though the amount of excess tends to be much smaller than what is observed (JD Wall, unpublished data). The patterns generated by each model are quite sensitive to the parameters used, and it remains to be seen whether either model can predict the actual difference in ρ estimates — along with the observed levels of diversity and frequency spectra — for parameter values that are plausible for human history.

Drosophila population history Drosophila population history is still not well understood. D. melanogaster and D. simulans are human commensals; as with humans, they are thought to have originated in Africa, and only recently spread to other continents [52,53]. The levels of diversity seem to be higher in African populations than non-African ones [54] but there does not seem to be any clear pattern with the frequency spectrum [55]. Part of the problem is that most studies have focused on either North American or European populations, whereas the ancestral African populations are understudied (especially in D. simulans). Even with data from New World populations, however, interesting patterns emerge when multiple facets of the data are considered. A recent analysis of levels of variation on the X chromosome and chromosome 3 in a North American population of D. simulans suggested that a hitchhiking model might account for the two-fold reduction in variability observed on the X [5]. However, the conclusions change when other aspects of the data are considered. Simple hitchhiking

650

Genomes and evolution

models are expected to cause a skew in the frequency spectrum towards rare variants [14,15] but the data does not show this skew [56]. Furthermore, estimates of ρ are much lower for the X-linked loci than the autosomal loci (i.e. there is much more LD on the X than the autosomes), and preliminary results suggest that this pattern of LD is not expected under a simple hitchhiking model (J Wall, P Andolfatto, M Przeworski, unpublished data). Our results suggest instead that a recent drastic bottleneck in the demographic history of North American D. simulans might account both for the differences in levels of diversity and the differences in levels of LD between the X and the autosomes. As the initial D. simulans colonization of North America probably occurred via ship within the past several hundred years, the founder-associated bottleneck may have been severe. A recent bottleneck might also help explain why a recent study found sequence-based estimates of ρ to be consistently smaller than laboratory-based estimates of ρ in both D. melanogaster and D. simulans [22]. Most of the samples considered in that study were nonAfrican, so the data may be showing the signal of a recent bottleneck. In contrast, a comparison of sequence-based and laboratory-based estimates of ρ for X-linked loci in a Zimbabwe population of D. melanogaster — which is not expected to have experienced any recent, strong demographic events — found no significant difference between the two (JD Wall, unpublished data). Thus, as with humans, LD analyses have revealed the effects of recent demographic events, and have helped us clarify what models may be appropriate for describing Drosophila SNP data.

Conclusions Studies of LD in Drosophila and humans have revealed the important role of gene conversion [1••,17•,24•], and have provided evidence for demographic differences between African and non-African populations of both taxa ([1••,2]; J Wall, P Andolfatto, M Przeworski, unpublished data). LD serves as a natural complement to other summaries of the data, such as levels of variation or the frequency spectrum. Future studies that integrate these different aspects of SNP data into the analyses will prove to be much more informative for distinguishing between alternative models of natural selection and population demography.

Acknowledgements I thank Y Gilad and M Przeworski for helpful discussions and comments on an earlier version of this manuscript. I am supported by a National Science Foundation Postdoctoral Fellowship in Bioinformatics.

References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as:

• of special interest •• of outstanding interest 1. ••

Frisse L, Hudson RR, Bartoszewicz A, Wall JD, Donfack J, Di Rienzo A: Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am J Hum Genet 2001, 69:831-843. This paper examines variation at 10 anonymous non-coding regions in three human populations. The authors introduce a novel LD analysis, and quantify

differences in levels of LD between different human populations. This is the first paper to estimate gene conversion rates from sequence polymorphism data. 2.

Reich DE, Cargill M, Bolk S, Ireland J, Sabeti PC, Richter DJ, Lavery T, Kouyoumjian R, Farhadian SF, Ward R, Lander ES: Linkage disequilibrium in the human genome. Nature 2001, 411:199-204.

3. ••

Sachidanandam R, Weissman D, Schmidt SC, Kakol JM, Stein LD, Marth G, Sherry S, Mullikin JC, Mortimore BJ, Willey DL et al.: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 2001, 409:928-933. This study provides a comprehensive description of SNPs in the human genome. It describes how levels of diversity vary over different regions and chromosomes, and documents a correlation between GC content and heterozygosity. 4.

Teeter K, Naeemuddin M, Gasperini R, Zimmerman E, White KP, Hoskins R, Gibson G: Haplotype dimorphism in a SNP collection from Drosophila melanogaster. J Exp Zool 2000, 288:63-75.

5.

Begun DJ, Whitley P: Reduced X-linked nucleotide polymorphism in Drosophila simulans. Proc Natl Acad Sci USA 2000, 97:5960-5965.

6.

Tenaillon MI, Sawkins MC, Long AD, Gaut RL, Doebley JF, Gaut BS: Patterns of DNA sequence polymorphism along chromosome 1 of maize (Zea mays ssp. mays L.). Proc Natl Acad Sci USA 2001, 98:9161-9166.

7.

Long AD, Langley CH: The power of association studies to detect the contribution of candidate genetic loci to variation in complex traits. Genome Res 1999, 9:720-731.

8.

McPeek MS, Strahs A: Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping. Am J Hum Genet 1999, 65:858-875.

9.

Pritchard JK, Przeworski M: Linkage disequilibrium in humans: models and data. Am J Hum Genet 2001, 69:1-14.

10. Begun DJ, Aquadro CF: Levels of naturally occurring DNA polymorphism correlate with recombination rates in Drosophila melanogaster. Nature 1992, 356:519-520. 11. Nachman MW, Bauer VL, Crowell SL, Aquadro CF: DNA variability and recombination rates at X-linked loci in humans. Genetics 1998, 150:1133-1141. 12. Przeworski M, Hudson RR, Di Rienzo A: Adjusting the focus on human variation. Trends Genet 2000, 16:296-302. 13. Aquadro CF, Begun DJ, Kindahl EC: Selection, recombination, and DNA polymorphism in Drosophila. In Non-neutral Evolution: Theories and Molecular Data. Edited by Golding B. New York: Chapman and Hall; 1994:46-56. 14. Braverman JM, Hudson RR, Kaplan NL, Langley CH, Stephan W: The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics 1995, 140:783-796. 15. Simonsen KL, Churchill GA, Aquadro CF: Properties of statistical tests of neutrality for DNA polymorphism data. Genetics 1995, 141:413-429. 16. Andolfatto P, Przeworski M: Regions of lower crossing over harbor more rare variants in African populations of Drosophila melanogaster. Genetics 2001, 158:657-665. 17. •

Langley CH, Lazzaro BP, Phillips W, Heikkinen E, Braverman JM: Linkage disequilibria and the site frequency spectra in the su(s) and su(wa) regions of the Drosophila melanogaster X chromosome. Genetics 2000, 156:1837-1852. The authors gather polymorphism data at the tip of the X chromosome in different populations of D. melanogaster. They find evidence for a surprisingly large amount of recombination, even though crossing-over rates are known to be extremely low. They infer that this must be the result of a high local rate of gene conversion. 18. Wall JD: Detecting ancient admixture in humans using sequence polymorphism data. Genetics 2000, 154:1271-1279. 19. Devlin B, Risch N: A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 1995 29:311-322.

20. Hudson RR: Linkage disequilibrium and recombination. In Handbook of Statistical Genetics. Edited by Balding D, Bishop M, Canning C. New York: Wiley and Sons; 2001:309-324. 21. Hudson RR: Estimating the recombination parameter of a finite population model without selection. Genet Res 1987, 50:245-250.

Insights from linked single nucleotide polymorphisms Wall

651

22. Andolfatto P, Przeworski M: A genome-wide departure from the standard neutral model in natural populations of Drosophila. Genetics 2000 156:257-268.

38. Badge RM, Yardley J, Jeffreys AJ, Armour JAL: Crossover breakpoint mapping identifies a subtelomeric hotspot for male meiotic recombination. Hum Mol Genet 2000, 9:1239-1244.

23. Hudson RR: Two-locus sampling distributions and their •• application. Genetics 2001, in press. This paper examines the distribution of LD expected at a pair of SNPs, introduces a new method for estimating the population recombination parameter, and compares the sampling properties of this method with those of other estimators. The new method is especially useful because it easily generalizes to diploid or missing data.

39. Stephens M, Donnelly P: Inference in molecular population genetics. J R Stat Soc B 2000, 62:605-635.

24. Przeworski M, Wall JD: Why is there so little intragenic linkage • disequilibrium in humans? Genet Res Camb 2001, 77:143-151. The authors show that sequence-based estimates of the population recombination parameter for human polymorphism data sets are consistently larger than a priori expectations (i.e. there is less LD than expected at short scales). They are the first researchers to suggest that gene conversion might be an important factor in explaining the patterns of LD in humans.

41. Taillon-Miller P, Bauer-Sardina I, Saccone NL, Putzel J, Laitinen T, Cao A, Kere J, Pilia G, Rice JP, Kwok PY: Juxtaposed regions of extensive and minimal linkage disequilibrium in human Xq25 and Xq28. Nat Genet 2000, 25:324-328.

25. Ashburner M: Drosophila: a Laboratory Handbook. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press; 1989. 26. True JR, Mercer JM, Laurie CC: Differences in crossover frequency and distribution among three sibling species of Drosophila. Genetics 1996, 142:507-523. 27.

Payseur BA, Nachman MW: Microsatellite variation and recombination rate in the human genome. Genetics 2000, 156:1285-1298.

28. Yu A, Zhao C, Fan Y, Jang W, Mungall AJ, Deloukas P, Olsen A, • Doggett NA, Ghebranious N, Broman KW, Weber JL: Comparison of human genetic and sequence-based physical maps. Nature 2001, 409:951-953. The authors make a detailed comparison of human genetic and physical maps, using the draft human genomic sequence. They find a large amount of variation in the estimated local recombination rate, which suggests that recombination rates might vary unpredictably over short scales in humans.

40. Ardlie K, Liu-Cordero SN, Eberle MA, Daly M, Barrett J, Winchester E, Lander ES, Kruglyak L: Lower-than-expected linkage disequilibrium between tightly linked markers in humans suggests a role for gene conversion. Am J Hum Genet 2001, 69:582-589.

42. Abecasis GR, Noguchi E, Heinzmann A, Traherne JA, Bhattacharyya S, Leaves NI, Anderson GG, Zhang Y, Lench NJ, Carey A et al.: Extent and distribution of linkage disequilibrium in three genomic regions. Am J Hum Genet 2001, 68:191-197. 43. Dunning AM, Durocher F, Healey CS, Teare MD, McBride SE, Carlomagno F, Xu CF, Dawson E, Rhodes S, Ueda S et al.: The extent of linkage disequilibrium in four populations with distinct demographic histories. Am J Hum Genet 2000, 67:1544-1554. 44. Andolfatto P, Nordborg M: The effect of gene conversion on intralocus associations. Genetics 1998, 148:1397-1399. 45. Wiuf C, Hein J: The coalescent with gene conversion. Genetics 2000, 155:451-462. 46. Hilliker AJ, Harauz G, Reaume AG, Gray M, Clark SH, Chovnick A: Meiotic gene conversion tract length distribution within the rosy locus of Drosophila melanogaster. Genetics 1994, 137:1019-1026. 47.

Paques F, Haber JE: Multiple pathways of recombination induced by double-strand breaks in Saccharomyces cerevisiae. Microbiol Mol Biol Rev 1999, 63:349-404.

29. Hudson RR, Kaplan NL: Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 1985, 111:147-164.

48. Zangenberg G, Huang MM, Arnheim N, Erlich H: New HLA-DPB1 alleles generated by interallelic gene conversion detected by analysis of sperm. Nat Genet 1995, 10:407-414.

30. Griffiths RC, Marjoram P: Ancestral inference from samples of DNA sequences with recombination. J Comp Biol 1996, 3:479-502.

49. Wall JD, Przeworski M: When did the human population size start increasing? Genetics 2000, 155:1865-1874.

31. Hey J, Wakeley J: A coalescent estimator of the population recombination rate. Genetics 1997, 145:833-846.

50. Stringer CB, Andrews P: Genetic and fossil evidence for the origin of modern humans. Science 1988, 239:1263-1268.

32. Wakeley J: Using the variance of pairwise differences to estimate the recombination rate. Genet Res 1997, 69:45-48.

51. Relethford JH, Harpending HC: Ancient differences in population size can mimic a recent African origin of modern humans. Curr Anthro 1995, 36:667-674.

33. Wall JD: A comparison of estimators of the population recombination rate. Mol Biol Evol 2000, 17:156-163. 34. Kuhner MK, Yamato J, Felsenstein J: Maximum likelihood estimation of recombination rates from population data. Genetics 2000, 156:1393-1401. 35. Nielsen R: Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 2000, 154:931-942. 36. Fearnhead P, Donnelly P: Estimating recombination rates from population genetic data. Genetics 2001, in press. 37.

Fullerton SM, Harding RM, Boyce AJ, Clegg JB: Molecular and population genetic analysis of allelic sequence diversity at the human beta-globin locus. Proc Natl Acad Sci USA 1994, 91:1805-1809.

52. David JR, Capy P: Genetic variation of Drosophila melanogaster natural populations. Trends Genet 1988, 4:106-111. 53. Lachaise D, Cariou LM, David JR, Lemeunier F, Tsacas L, Ashburner M: Historical biogeography of the Drosophila melanogaster species subgroup. Evol Biol 1988, 22:159-225. 54. Andolfatto P: Contrasting patterns of X-linked and autosomal nucleotide variation in Drosophila melanogaster and Drosophila simulans. Mol Biol Evol 2001, 18:279-290. 55. Przeworski M, Wall JD, Andolfatto P: Recombination and the frequency spectrum in Drosophila melanogaster and Drosophila simulans. Mol Biol Evol 2001, 18:291-298. 56. Begun DJ: The frequency distribution of nucleotide variation in Drosophila simulans. Mol Biol Evol 2001, 18:1343-1352.