Population Structure and Gene Flow JP Wares, University of Georgia, Athens, GA, USA r 2016 Elsevier Inc. All rights reserved.
Glossary Anadromous It describes fish species that lay eggs that hatch in fresh water, but after hatching the fish migrates to the sea to feed and grow to reproductive maturity.
A frequently used term in biology is ‘structure’ – the organization and relation of parts. It may refer to communities, it may refer to morphology; in evolutionary biology, it is often used to refer to population structure. This type of structure ;is invoked when individuals of a study species comprise more than one population, defined by the diversity within each. A population evolves primarily through the reproductive success of its members, and (for most eukaryotic organisms) who they mate with. When, for whatever reason, there are groups of individuals that are less likely to mate with other groups of individuals, we set them apart because these distinct populations, harboring distinct diversity, may be responding to different environments (Slatkin, 1987). The primary information about diversity in a population involves the number of genetic variants from a region of the genome, or alleles, found in a sample of individuals. First, we may just count the number of types sampled and assess the frequency of each, and a simple frame of reference comes from the Hardy–Weinberg (HW) model. This model is generally used as a null hypothesis in evolutionary studies, so that rejection of this model leads us to understand what mechanism(s) of evolution (change in genetic diversity) may be acting on that population (Hartl, 2000). The basis for testing this hypothesis involves the frequencies of alleles, and the frequencies of diploid genotypes comprised of these alleles, assuming independent assortment and no evolutionary mechanisms acting. Of course, we expect the diversity of almost any population to be influenced by immigration, finite population size, mutation, selection, or nonrandom mating. Only when we reject this simple model can we begin to identify which are acting, and whether they lead to identifiable structure. So, the next consideration is space: where were the samples taken from? As a set of genetic samples includes more
Philopatry The process of an adult returning to the region or habitat of its birth for mating and reproduction.
locations, we can ask whether the allele and genotype frequencies for the sample at any location are consistent (statistically probable) with the allele and genotype frequencies at another location being the same. If not, we may start to see that these frequencies have diverged across locations, and a first question to ask is whether this is the result of nonrandom mating. Here this simply means that individuals at one location are more likely to have mated with another individual from the same location than an individual from the other location. This is one way to define population structure – that the sampled individuals from distinct locations are not often mating with one another. There may be reasons why individuals at the same location mate nonrandomly as well (Figure 1). This nonrandom mating is a form of inbreeding, as it involves reproduction between individuals that tend to be somewhat more closely related to one another. This in turn leads to an excess of genotype homozygosity relative to all other spatial samples, in the terms of the HW model, and is a means of statistically testing for population structure. Some of these statistical tests are detailed below. It might also seem as though the DNA sequence divergence of particular alleles would tell us about population structure; two individuals that carry mitochondrial haplotypes that are different at 5 out of every 100 nucleotides seem more likely to be from different populations, in an evolutionary sense, than if their haplotypes are nearly identical. However, even alleles that are quite divergent may be found in the same population if individuals still mate with one another. Famously, alleles at the major histocompatibility complex in mammals may be much older than the species they are found in (Hughes and Nei, 1992)! Any single locus can have an unusual history or dynamics that lead it to carry ancient diversity. To understand this diversity we require the context of other loci, to see if
Figure 1 Visualizing population structure. There are two types of population genetic structure in pink salmon (Oncorhynchus gorbuscha). Branching of highly simplified genealogy identifies relatedness of individuals from population samples. Regional samples (represented as distinct shapes) exhibit high within-sample relatedness because of philopatry, and could help identify adaptation to distinct environments. However, the same spatial samples are split temporally because of the 2-year maturation cycle of this anadromous fish (Beacham et al., 2012): the ‘odd’ years (pink) have evolved distinct allele and genotype frequencies from the ‘even’ years (blue), and have different patterns of abundance and demography. Thus, structure results from both spatial limits on gene flow, as well as the temporal effect of alternating maturation cycles.
Encyclopedia of Evolutionary Biology, Volume 3
doi:10.1016/B978-0-12-800049-6.00035-4
327
328
Population Structure and Gene Flow
individuals are clearly distinct at multiple loci, suggesting again that they are not mating with one another. The cause of reproductive isolation may be extrinsic – a barrier between populations – or intrinsic, when the populations do not recognize one another as good mates or when there are fitness consequences for crosses between the populations (Bewick and Dyer, 2014). Leaving the biological, intrinsic mechanisms alone for now, we can see that an important way in which populations form is from simple limits on individuals encountering one another. So, organisms that move very little in their lifetime would be expected to only encounter a mate locally; organisms that have at least some stage in their life of high motility will expand the spatial range of possible mates (Slatkin, 1987). If an individual moves from one area to another, and successfully mates, we now have gene flow – literally, the movement of alleles from one location to another. If this happens often, it is likely that the locations are part of the same population (we cannot statistically differentiate the individuals from different locations), and if it is rare then we begin to posit ‘structure.’ It is important to note that this is a quantitative and continuous designation. It may seem odd, but rejecting a hypothesis of a single population does not necessarily mean there are two, or more, populations. A phenomenon known as ‘isolation by distance’ refers to situations in which the level of gene flow is sufficient that neighboring locations may appear to be consistent with one population, but as the domain is expanded to include more geographically or ecologically distinct samples, the probability of sufficient dispersal and gene flow diminishes, and allele frequencies, genotype frequencies, or pattern of substitutions become consistent with populations that are demographically isolated from one another (Wright, 1943). So, the number of populations inferred may depend on the scale and sampling for the analysis.
Inference of Structure Analytically, our understanding of structure follows the logic identified above. There are many ways to approach this problem; here we focus first on the diversity within and among sample locations – in other words, examples in which the location of the sampled individual influences the analysis – and then on approaches that assess the fit of the data to a series of models of population structure to determine the most probable description of structure given the data; these latter approaches do not require spatial information, and can help identify patterns arising through other mechanisms. As noted above, when individuals are sampled from multiple locations, we can measure the genetic diversity both within locations as well as inclusive of multiple locations. Traditionally, the complete data set is considered as the ‘total’ diversity, and diversity from individual ‘sites’ are evaluated for how much of the total diversity they contain. By symbolizing the site-level diversity as S, and the total diversity as T, we can evaluate a family of statistics that quantify structure as XST, where S and T are as defined and X is a symbol that varies depending on assumptions about the data. Though this now sounds complicated, our most general model for quantifying
population structure can be expressed in simple terms:
XST ¼
ðtotal diversityÞ ðmean within-site diversityÞ ðtotal diversityÞ
where diversity may be measured as heterozygosity or related measures of genetic diversity (where X is replaced by F or G), variance in microsatellite allele size (where we assume the stepwise mutation model, and use RST), nucleotide diversity (which may be evaluated using ΦST), and other ‘flavors’ of this approach (Excoffier et al., 1992; Hartl and Clark, 1997). Typically, researchers will refer to FST, which was the first common ‘fixation index’ for such studies and can be estimated using a variety of computational approaches (Whitlock, 2011). Often these indices are calculated in a hierarchical ‘analysis of molecular variance,’ mirroring a standard ANOVA statistical framework. If there is no structure – no correlation of allele frequencies by site, no increase in genotypic similarity by site – then there is as much diversity at the ‘site’ level of diversity as the ‘total’ level and the statistic XST is close to zero. At the extreme where every single site sampled harbors low diversity, but allelic diversity is distinct from all other sites (high divergence among regional groups of alleles), the statistic approaches 1 (Figure 2). In recent years, effort has gone toward recognizing the limits of this class of statistics that are caused by how much variation is found at the site level, limiting the comparability of these indices among taxa. Improved (but slightly more complicated) statistics may be appropriate for some considerations, and may greatly improve our capacity to explore structure at multiple hierarchies of biodiversity (Jost, 2008; Whitlock, 2011; Smouse et al., 2015). In almost all cases, this family of statistics is assessed for statistical significance through permutation testing, where genotypes are randomly assigned to a spatial site to generate a null distribution of divergence. The other way in which population structure is often assessed is explicitly through finding the best fit of genotype data from individuals to inferred populations, where it is assumed that a true population will have minimal deviations from Hardy–Weinberg expectations on allele and genotype frequencies. As discussed above, to the extent two locations are connected by immigrants, the allele and genotype frequencies will not be divergent from one another; if gene flow is limiting, then by the stochastic process of variation in reproductive success (drift) these frequencies diverge from one another at different locations. When there are populations with heterogeneous allele and genotype frequencies, whether the location of the samples are considered or not, the overall fit to Hardy– Weinberg expectations will be poor. A way to identify populations and which individuals belong to each is implemented in programs like structure (Pritchard et al., 2000). A type of ‘clustering analysis,’ this approach identifies how well data fit Hardy–Weinberg expectations when 1, 2, 3, or more evolutionary populations are assumed – in each case, the likelihood of this analysis comes from the fit (lack of deviation) of genotype data in each cluster to HW expectations and thus the number of populations can be inferred in some circumstances (Figure 3). Other clustering approaches (Gao et al., 2007) may be used when this
Population Structure and Gene Flow
Ari Chi
100 1
20°S
329
Northern clade
South-East Pacific Ocean
110 Pca 100 1
Cad
Pal
Center clade
30°S
99 0.6
Pan
100 100
Que
0.1 substitutions/site 40°S
(a)
81
0.96 0.95
Southern clade
(b)
Excirolana hirsuticauda (outgroup)
(c)
Figure 2 Illustrating hierarchical population structure. Mitochondrial haplotypes were sequenced from the isopod Excirolana braziliensis (Varela and Haye, 2012), from sites indicated in part (a). In (b), a phylogram illustrates the relationships among these haplotypes; the genetic relationship between particular haplotypes as well as the locations at which they were collected are then summarized in (c). From this network it can clearly be seen that the total amount of sequence variation in Excirolana is large (the numbers along the haplotype network indicate the number of nucleotide substitutions between clades), and at particular sites (or within sub-regions) the variation is less (e.g., only 7 distinct haplotypes in the Northern Clade). For sequence data, it is typical to calculate ΦST which uses the sum of squared differences among DNA sequences as a metric within an ANOVA framework. ΦST values among locations in the Center Clade, for example, range from 0.158 to 0.444, while values between sites in the Center Clade and other regions consistently present ΦST greater than 0.95.
Population 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Allele 1 96 96 94 95 95 96 94 95 96 96 95 93 96 96 96 94 94 93 96 91
Allele 2 93 96 94 95 96 93 96 95 94 96 95 97 96 95 96 93 93 94 93 96
Figure 3 Clustering analyses. A genealogy illustrates the temporal relationship among alleles in two sites with limited gene flow. Simulated diploid microsatellite genotypes at a single locus present data that, when a single population is assumed, exhibit homozygote excess relative to the allele frequencies and the relationships of the HW model. When two populations (blue and green) are assumed, the fit of genotypes to HW is much better. This is a model-fitting approach to inference of population structure.
underlying genetic model is likely to be violated (e.g., in selfing plants where assumptions of HW are known to be false). Of course, while it is known that migration and gene flow are the forces that will homogenize allele and genotype frequencies among locations, it is important to recognize that
migration is not always (or often) omni-directional and symmetric among sites. From fundamental papers in population genetics to more contemporary efforts, we know that in these instances the upstream, or source, population diversity will end up driving the diversity in the recipient sites (Nagylaki, 1978; Wares and Pringle, 2008). This has numerous
330
Population Structure and Gene Flow
Ladv = 0
Generations
400 300 200 100 200
400
600
800
(a)
1200
1400
1600
1800
1200
1400
1600
1800
Ladv = 4
400 Generations
1000 Distance
300 200 100 200
400
600
800
(b)
1000 Distance
Figure 4 Effects of asymmetric gene flow. When dispersal is biased by wind, currents, or other environmental features, the genetic diversity of a system may be homogenized with the diversity of the ‘upstream’ region (source region) dominating the entire system. In both panel (a) and panel (b), a simulation model is carried out where the domain is initialized with haploid individuals and 5 distinct alleles, each geographically isolated to 1/5 of the domain. The model is run for 400 generations. In (a), there is no advection (asymmetric gene flow) and diffusion is the primary mechanism of dispersal. In (b), advection is approximately 40% as strong as random diffusion of offspring, and the offspring preferentially disperse toward the right. The upstream allele (indicated in blue) quickly dominates the entire domain. Reproduced from Wares, J.P. and Pringle, J.M., 2008. Drift by drift: Effective population size is limited by advection. BMC Evolutionary Biology, 8, 235.
effects on the maintenance of diversity throughout the domain of a species (Figure 4). The most direct way to identify source– sink demographic systems, where the composition of the ‘sink’ populations may be entirely reflective of the composition of the ‘source’ is to use multi-locus genetic data to identify parentage of recruits by site (Peery et al., 2008).
be driven by local adaptation at associated genes, but is often undirected change associated with genetic structure or population structure of the focal organism instead. The context of structure is thus necessary for interpreting how evolution is changing natural patterns of diversity. Population genetics was founded on a statistical necessity; it is now necessary for exploring diversity at finer scales than can be named.
The Context of Population Structure At some level the structure of a ‘species’ into ‘populations’ may seem academic, but ‘species’ are not responding to environmental change – the populations they are comprised of are. Repeatedly, researchers have found that individuals sampled from distinct sites – and ultimately shown to be in distinct populations – have distinct environmental tolerances, reaction norms, and levels of additive genetic diversity allowing response to change (Sanford and Kelly, 2011; Crozier and Hutchings, 2014; Evans et al., 2015). Understanding population structure gives us great insight into the frequency and mechanisms of movement between sites, as well as the variation in the environment – some of it abiotic, some biotic – that maintains population structure. Much of what we know of trait variation – even the trait of reproductive isolation – is not necessarily at the level of species; instead, the values of traits may differ by geographic location and the genetic composition of those populations (Cutter, 2012; Mandeville et al., 2015). The extent to which even subtle traits – coloration and other quantitative measures – vary among populations could
See also: Genetic Variation in Populations. Hardy−Weinberg Equilibrium and Random Mating. Inbreeding and Nonrandom Mating. Reproductive Isolation, Postzygotic. Reproductive Isolation, Prezygotic. Ring Species
References Beacham, T.D., McIntosh, B., MacConnachie, C., Spilsted, B., White, B.A., 2012. Population structure of pink salmon (Oncorhynchus gorbuscha) in British Columbia and Washington, determined with microsatellites. Fishery Bulletin 110, 242–256. Bewick, E.R., Dyer, K.A., 2014. Reinforcement shapes clines in female mate discrimination in Drosophila subquinaria. Evolution 68, 3082–3094. Crozier, L.G., Hutchings, J.A., 2014. Plastic and evolutionary responses to climate change in fish. Evolutionary Applications 7, 68–87. Cutter, A.D., 2012. The polymorphic prelude to Bateson−Dobzhansky−Muller incompatibilities. Trends in Ecology and Evolution 27, 209–218. Evans, T.G., Padilla-Gamino, J.L., Kelly, M.W., et al., 2015. Ocean acidification research in the 'post-genomic' era: Roadmaps from the purple sea urchin
Population Structure and Gene Flow
Strongylocentrotus purpuratus. Comparative Biochemistry and Physiology. Part A, Molecular & Integrative Physiology 185, 33–42. Excoffier, L., Smouse, P.E., Quattro, J.M., 1992. Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data. Genetics 131, 479–491. Gao, H., Williamson, S., Bustamante, C.D., 2007. A Markov Chain Monte Carlo approach for joint inference of population structure and inbreeding rates from multilocus genotype data. Genetics 176, 1635–1651. Hartl, D.L., 2000. A Primer of Population Genetics. Sunderland, MA: Sinauer Associates. Hartl, D.L., Clark, A.G., 1997. Principles of Population Genetics. Sunderland, MA: Sinauer. Hughes, A.L., Nei, M., 1992. Maintenance of MHC polymorphism. Nature 355, 402–403. Jost, L., 2008. G(ST) and its relatives do not measure differentiation. Molecular Ecology 17, 4015–4026. Mandeville, E.G., Parchman, T.L., McDonald, D.B., Buerkle, C.A., 2015. Highly variable reproductive isolation among pairs of Catostomus species. Molecular Ecology 24, 1856–1872. Nagylaki, T., 1978. Clines with asymmetric migration. Genetics 88, 813–827.
331
Peery, M.Z., Beissinger, S.R., House, R.F., et al., 2008. Characterizing source-sink dynamics with genetic parentage assignments. Ecology 89, 2746–2759. Pritchard, J.K., Stephens, M., Donnelly, P., 2000. Inference of population structure using multilocus genotype data. Genetics 155, 945–959. Sanford, E., Kelly, M.W., 2011. Local Adaptation in Marine Invertebrates. Annual Review of Marine Science 3 (3), 509–535. Slatkin, M., 1987. Gene flow and the geographic structure of natural populations. Science 236, 787–792. Smouse, P.E., Whitehead, M.R., Peakall, R., 2015. An informational diversity framework, illustrated with sexually deceptive orchids in early stages of speciation. Molecular Ecology Resources 15, 1375–1384. Varela, A.I., Haye, P.A., 2012. The marine brooder Excirolana braziliensis (Crustacea: Isopoda) is also a complex of cryptic species on the coast of Chile. Revista Chilena De Historia Natural 85, 495–502. Wares, J.P., Pringle, J.M., 2008. Drift by drift: Effective population size is limited by advection. BMC Evolutionary Biology 8, 235. Whitlock, M.C., 2011. G'ST and D do not replace FST. Moleclar Ecology 20, 1083–1091. Wright, S., 1943. Isolation by distance. Genetics 28, 139–156.