Forensic Science International: Genetics 39 (2019) 57–60
Contents lists available at ScienceDirect
Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsigen
The natural selection that shapes our genomes
T
Antonio Salas Unidade de Xenética, Instituto de Ciencias Forenses (INCIFOR), Facultade de Medicina, Universidade de Santiago de Compostela, and GenPoB Research Group, Instituto de Investigaciones Sanitarias (IDIS), Hospital Clínico Universitario de Santiago (SERGAS), Galicia, Spain
A R T I C LE I N FO
A B S T R A C T
Keywords: Natural selection Neutral evolution Darwin Kimura Biogeographical ancestry Genomics 1000 genomes
Most of the variation in the human genome (∼95%) is constrained, directly or indirectly, by purifying selection and GC-biased gene conversion, according to a recent article by Pouyet et al. (2018). The use of ‘non-neutral’ variation to infer human demographies can lead to undesirable biases; for example, in estimation of the time of the most recent common ancestor. Further examination of ‘neutral’ variation in entire human genomes from The 1000 Genomes Project reveals that ∼99% of this variation lacks exonic function, but ∼35% of it falls in introns. In addition, estimates of biogeographical ancestry using ‘non-neutral’ SNPs differ very marginally from inferences obtained from ‘neutral’ variation. Additional investigations should be carried out before establishing the roadmap for future human population and forensic genetic studies.
1. The arrival of the genome era and the study of natural selection The term ‘genomics’ was coined by the geneticist Tom Roderick in the mid 1980′s. However, it is difficult to establish the exact moment that genomics began its journey as a new discipline of science. One of the most significant milestones of genomics, The Human Genome Project (HGP), was made public in 2000, and the technology employed to achieve it was still ‘archaic’ compared to what would be developed only a few years later. For many scholars, the genuine genome era started with the arrival of the pioneering large-scale genotyping projects aimed at exploring thousands of DNA variants across the genome in a cost-effective manner; the first of which was HapMap [1]. Many sub-disciplines of human genetics took advantage of the new technological developments; and the arrival of SNP genotyping platforms entailed a big step in the field, facilitating exploratory studies of human multi-factorial diseases (by means of e.g. Genome-Wide Association studies or GWAs). However, there was little time to benefit from these novel technological achievements, as the scientific community next witnessed the launch of next generation sequencing (NGS) technologies; and, in parallel, the birth of more ambitious, large-scale projects, such as The 1000 Genomes Project [2] (hereafter 1000 G). The availability of the big DNA data repositories generated by 1000 G and many other genome projects has greatly improved our knowledge of the architecture and function of the human genome. In the pre-genome era, very little was known about natural selection in humans, mainly because other organisms (e.g. Drosophila, bacteria) were traditionally much more suitable for evolutionary experiments than humans. However, the new developments in genomics would
E-mail address:
[email protected]. https://doi.org/10.1016/j.fsigen.2018.12.003 Received 30 November 2018; Accepted 13 December 2018 Available online 14 December 2018 1872-4973/ © 2018 Elsevier B.V. All rights reserved.
completely change the rules of the game. The ingenious (still pregenomic) proposal of Sabetti et al. [3] that rescued the old idea of analyzing patterns of extended haplotype homozygosity to investigate signatures of positive selection, symptomized this general feeling. Thus, in parallel to GWAs in the field of disease studies, there was a growing interest in exploring signatures of positive selection events, taking advantage of the genotyping data available in the large genome repositories [4,5]. In only a few years, it was the human genome that turned out to be the main focus for the study of natural selection. 2. Genomics illuminates natural selection Thanks to the present-day genomic tools, significant progress has been achieved regarding the historical controversy between selectionists and neutralists; a debate that represents one of the most exciting issues in the history evolutionary biology. Darwin proposed that the fate of variations (in traits) is mainly driven by natural selection; adjusting this assertion with a small dose of good judgement: “variations neither useful nor injurious would not be affected by natural selection, and would be left a fluctuating element…” [6]. A century later, the neutral theory proposed by the talented geneticist Motoo Kimura stated that “in sharp contrast to the Darwinian theory of evolution by natural selection, the neutral theory claims that the overwhelming majority of evolutionary changes at the molecular level are caused by random fixation (due to random sampling drift in finite populations) of selectively neutral (i.e., selectively equivalent) mutants under continued inputs of mutations” [7]. Since then, estimating the fraction of the genome that evolves under positive or purifying selection represents one of the greatest challenges
Forensic Science International: Genetics 39 (2019) 57–60
A. Salas
of evolutionary genetics [8]. The recent article by Pouyet et al. [9] provides exciting new insights into this dilemma and their results and conclusions are undeniably disruptive: 80–85% of the human genome is probably affected by background selection (BGS), and this proportion rises to ∼95% when considering GC-biased gene conversion (gBGC). These estimates were obtained by developing a measure (average de¯ i ) that depends exclusively rived allele frequency per individual or DAF on the average time to the most recent common ancestor (TMRCA) of the whole sample. Some lessons that can be learned from this study are the following:
Table 1 Descriptive statistics of SNPs located at ‘neutral’ and ‘non-neutral’ regions according to the definition of Pouyet et al. [9]. Only biallelic SNPs with rs accession numbers are considered (which represent > 99.7% of total sequence variants in the 1000 G samples analyzed).
• Due to hitchhiking, genetic variation impacted by negative selection •
• •
(as direct targets or indirectly by neighboring effect) concentrates in genome regions with the lowest recombination rates. The site frequency spectrum (SFS) observed at synonymous sites is very different to that of ‘pure neutral’ sites. Demographic inferences (at least the ones tested by the authors, e.g. dating bottleneck events) are markedly different if using ‘neutral’ or ‘non-neutral’ SFSs. Unpredictable biases are expected when using 'non-neutral' variants. Functional sites are the direct target of purifying selection (8–15% of the genome), but they have indirect influence on most of the genome. Last but not least, the method developed by these authors can be equally employed to understand patterns of natural selection in the genomes of other species.
Nº of continental populations
4 (8 sub-populations)
Total sample size Nº of total SNPs ‘Neutral' regions Nº of 'neutral' regions Mean length of the 'neutral' regions in bp (SD) Shortest 'neutral' region in bp Largest 'neutral' region in bp Genetic variants at 'neutral' regions Nº of SNPs Nº of 'neutral' (WW + SS) SNPs Nº of WW ‘neutral’ SNPs Nº of SS ‘neutral’ SNPs Nº of 'neutral' SNPs with MAF > 0.05 Nº of 'neutral' SNPs with MAF > 0.05 and r2 < 0.8 Genetic variants at 'non-neutral' regions Nº of SNPs Nº of SNPs with MAF > 0.05 Nº of SNPs with MAF > 0.05 and r2 < 0.8
769 78,024,524 49,031 7589 (8712) 10,660 249,198,690 12,166,425 1,852,209 734,307 1,117,902 182,885 154,660 65,863,319 5,741,005 1,271,788
Table 2 Characteristics of ‘neutral’ variants. A total of 1,852,209 variants could be annotated as done previously [11], using ANNOVAR [20] and RefSeq [21]. Func refGene: chromosomal regions hit by the variants. ExonicFunc refGene: exonic variant function. SNV: single nucleotide variant.
3. Impact of purifying selection on population genetic inferences At this stage, it is difficult to predict how the findings by Pouyet et al. [9] will impact investigations in molecular anthropology, ancient DNA, forensic genetics, and disease studies. Some of the consequences were advanced in their publication: by dating demographic events in Yoruban and Japanese populations these authors demonstrated that different demographies can be inferred if using ‘neutral’ or ‘non-neutral’ SFSs. Harris [10] highlights that, in order to understand the history of human migrations, studies should ideally focus on the 5% of the genome that seems to behave neutrally. For the sake of illuminating further the impact of the findings of Pouyet et al. [9], I examinee the distribution of ‘neutral’ variation (as defined by these authors) in human genomes, and investigated its impact on biogeographical ancestry (BGA) inference, a form of analysis that is increasingly used in anthropologic and forensic studies. There are more than 78 × 106 genetic variants in 769 full genomes taken from eight continental 1000 G population samples (Table 1). The ‘neutral’ regions, defined by Pouyet et al. [9] as those having high recombination rates (≥ 1.5 cM/Mb), account for 49,031 segments of sizes that range from 10,660 bp to ∼249 Mb (mean: 7589; SD: 8717). These ‘neutral regions’ represent all together 12.9% of the human genome. Only weak to weak (WW; A to T and T to A) and strong to strong (SS; G to C and C to G) changes falling at these segments can be considered to be ‘purely neutral’. While ‘neutral’ regions account for ∼16% of the total SNPs in the full genomes (> 12 × 106); the ‘pure neutral’ ones embedded in these segments sum to only ∼2% of the total (Table 1). These data indicate that ‘neutral’ regions proportionally concentrate more SNPs than ‘non-neutral’ ones, and this observation agrees well with the absence of purifying selection pressure in these segments. In addition, ‘neutral’ variants are heterogeneously distributed along the chromosomes, and particularly enriched in the telomeric regions (Fig.S1; [9]). In addition, WW and SS sites distribute unevenly in chromosomes (Figs. S1B and S2). Annotation of all ‘neutral’ variants (1,852,209 SNPs) reveals that most of them do not have exonic function (∼99%). However, there is a substantial number of non-synonymous variants (accounting for 0.77% of the total ‘neutral’ variants; Table 2), suggesting that a minor pro¯ i to portion of ‘non-neutral’ variants could elude the ability of DAF
Func refGene UTR3 UTR5 UTR5; UTR3 Downstream Exonic Exonic; splicing Intergenic Intronic ncRNA_exonic ncRNA_exonic; splicing ncRNA_intronic ncRNA_splicing Splicing Upstream Upstream;downstream
18,769 (1.01%) 5,309 (0.29%) 6 (0.00%) 15,504 (0.84%) 19,457 (1.05%) 3 (0.00%) 999,737 (53.98%) 656,568 (35.45%) 9,189 (0.50%) 1 (0.00%) 109,699 (5.92%) 62 (0.00%) 190 (0.01%) 17,026 (0.92%) 689 (0.04%)
ExonicFunc refGene No exonic function Non-synonymous SNV Stopgain Stoploss Synonymous SNV Unknown
1,832,749 (98.95%) 14,333 (0.77%) 208 (0.01%) 17 (0.00% 4,625 (0.25%) 277 (0.01%)
detect pure ‘neutral’ variation (as anticipated by Pouyet et al. [9]). Although the main percentage of the ‘neutral’ variants falls at intergenic regions (∼54%), it is remarkable that ∼35% of them locate at introns. Analyses of ancestry were carried out using the ‘neutral’ panel (154,660 SNPs after filtering the full set of ‘neutral’ SNPs for minor allele frequency [MAF] > 0.05 and LD [r2 < 0.8]) and a panel of SNPs that fall outside the defined ‘neutral’ regions. Filtering by LD impacts differently on the SFS patterns of WW + SS changes located at ‘neutral’ regions from those located at ‘non-neutral’ regions (Fig. S2; [9]). However, these differences have little effect on ancestry estimates. The Multidimensional scaling plot (MDS) built on pairwise identity-by-state values between populations in the 1000 G reference populations (data processed as in [11,12]) shows a similar pattern independently of the 58
Forensic Science International: Genetics 39 (2019) 57–60
A. Salas
Fig. 1. Boxplot of ancestry proportions for eight 1000 G population samples (n = 769) obtained from admixture analysis using ‘neutral’ versus ‘non-neutral’ SNP panels (Fig. S4). Maximum likelihood estimation of individual ancestries from multi-locus SNP data were obtained using ADMIXTURE [22], and using a ‘neutral panel’ of 154.660 SNPs (see also legend of Fig. S3) (A), and using a random set of 154.660 ‘non-neutral’ SNPs (B). See also barplot of Fig. S4.
Acknowledgments
panel of SNPs used (Fig. S3). The barplot of ancestral membership sharing (optimum K = 4; Fig. S4) agrees well with the patterns observed in the MDS. Fig. 1, shows estimates of ancestry for the 769 individual SNP profiles using ‘neutral’ versus ‘non-neutral’ SNPs. The results indicate very marginal differences between the estimates of ancestry using the two different SNP panels compared to the large variability observed at commonly used forensic AIMs panels [13–15]. Filtering by MAF does not impact these estimates (data not shown).
I would like to thank J. Pardo and X. Bello for useful discussion, and F. Pouyet and C. Phillips for critically reading the document. A.S. received support from the Instituto de Salud Carlos III (Proyecto de Investigacion en Salud, Accion Estrategica en Salud): project GePEM ISCIII/PI16/01478/ Cofinanciado FEDER) and 2016-PG071 Consolidacion e Estructuracion REDES 2016GI-1344 G3VIP (Grupo Gallego de Genetica, Vacunas, Infecciones y Pediatria, ED341 R2016/ 021).
4. ‘Junk’ DNA?
Appendix A. Supplementary data
It was in 1972 that Susumu Ohno coined the term ‘junk’ DNA to the noncoding region of the genome (∼99%). The view of the DNA as ‘garbage’ (still a debatable issue in the present-day literature) could also be interpreted in the light of the findings of Pouyet et al. [9]. These authors claim that the effect of natural selection in most of the human genome is indirect e.g. through hitchhiking. It is then tempting to conceive hitchhiking as a sort of a subtle and sophisticated mechanism that can influence the fitness of individuals. If true, the new results by Pouyet et al. [9] would indicate that virtually all the genome, either indirectly or directly, could play a biological/evolutionary role and ultimately impacts fitness. There is empirical and functional evidence that fit well with this interpretation. For instance, the Encyclopedia of DNA Elements (ENCODE) pilot project [16,17], the main public international project aiming to identify functional elements in the human genome, estimated that 93% of bases are represented in a primary transcript (although this figure could be as low as ∼74%). Adding to this complexity, there is growing evidence supporting the key role of jumping (‘junk’) DNA (mobile repetitive elements scattered randomly in the human genome, amounting to > 50% of it) on DNA function [18,19].
Supplementary material related to this article can be found, in the online version, at doi:https://doi.org/10.1016/j.fsigen.2018.12.003. References [1] T.I.H. Project, The International HapMap project, Nature 426 (2003) 789–796. [2] R.M. Durbin, G.R. Abecasis, D.L. Altshuler, A. Auton, L.D. Brooks, R.A. Gibbs, M.E. Hurles, G.A. McVean, A map of human genome variation from populationscale sequencing, Nature 467 (2010) 1061–1073. [3] P.C. Sabeti, D.E. Reich, J.M. Higgins, H.Z. Levine, D.J. Richter, S.F. Schaffner, S.B. Gabriel, J.V. Platko, N.J. Patterson, G.J. McDonald, et al., Detecting recent positive selection in the human genome from haplotype structure, Nature 419 (2002) 832–837. [4] B.F. Voight, S. Kudaravalli, X. Wen, J.K. Pritchard, A map of recent positive selection in the human genome, PLoS Biol. 4 (2006) e72. [5] J.M. Akey, G. Zhang, K. Zhang, L. Jin, M.D. Shriver, Interrogating a high-density SNP map for signatures of natural selection, Genome Res. 12 (2002) 1805–1814. [6] C.R. Darwin, The Origin of Species, John Murray, London, 1859. [7] M. Kimura, The neutral theory of molecular evolution: a review of recent evidence, Jpn. J. Genet. 66 (1991) 367–386. [8] D. Graur, An upper limit on the functional fraction of the human genome, Genome Biol Evol 9 (2017) 1880–1885. [9] F. Pouyet, S. Aeschbacher, A. Thiery, L. Excoffier, Background selection and biased gene conversion affect more than 95% of the human genome and bias demographic inferences, Elife 7 (2018). [10] K. Harris, The randomness that shapes our DNA, Elife 7 (2018). [11] A. Salas, J. Pardo-Seco, R. Barral-Arca, M. Cebey-López, A. Gómez-Carballa, I. Rivero-Calle, S. Pischedda, M.J. Curras-Tuala, J. Amigo, J. Gómez-Rial, et al., Whole exome sequencing identifies new host genomic susceptibility factors in empyema caused by Streptococcus pneumoniae in children: a pilot study, Genes (Basel) 9 (2018). [12] A. Salas, J. Pardo-Seco, M. Cebey-López, A. Gómez-Carballa, P. Obando-Pacheco, I. Rivero-Calle, M.J. Curras-Tuala, J. Amigo, J. Gómez-Rial, F. Martinón-Torres, et al., Whole exome sequencing reveals new candidate genes in host genomic susceptibility to respiratory syncytial virus disease, Sci. Rep. 7 (2017) 15888. [13] J. Pardo-Seco, F. Martinón-Torres, A. Salas, Evaluating the accuracy of AIM panels at quantifying genome ancestry, BMC Genomics 30 (2014) 543. [14] C. Phillips, A. Salas, J.J. Sánchez, M. Fondevila, A. Gómez-Tato, J. Álvarez-Dios, M. Calaza, M.C. de Cal, D. Ballard, M.V. Lareu, et al., Inferring ancestral origin using a single multiplex assay of ancestry-informative marker SNPs, Forensic Sci. Int. Genet. 1 (2007) 273–280. [15] J.M. Galanter, J.C. Fernández-López, C.R. Gignoux, J. Barnholtz-Sloan, C. Fernández-Rozadilla, M. Via, A. Hidalgo-Miranda, A.V. Contreras, L.U. Figueroa,
5. Conclusions From the present evidence, the effect of genome hitchhiking on fitness should receive more attention in future research. Efforts in this direction would perhaps help to disentangle our understanding of why DNA has such a complex replication machinery -where only ∼1% of it encodes for genes (∼8-15% is directly targeted by natural selection [8]) despite almost all the DNA is constrained by purifying selection. For the time being, it is important to evaluate how these findings affect demographic inferences of different kinds before establishing the roadmap for future genetic studies. The impact of Pouyet et al. [9] findings on inference of BGA seems to be marginal (on the basis of the preliminary analyses carried out here), but it is expected to be much more relevant to other population genetic inferences. 59
Forensic Science International: Genetics 39 (2019) 57–60
A. Salas
sequence increase, DNA Res. 25 (2018) 521–533. [19] M. Jagannathan, R. Cummings, Y.M. Yamashita, A conserved function for pericentromeric satellite DNA, Elife 7 (2018). [20] K. Wang, M. Li, H. Hakonarson, ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data, Nucleic Acids Res. 38 (2010) e164. [21] N.A. O’Leary, M.W. Wright, J.R. Brister, S. Ciufo, D. Haddad, R. McVeigh, B. Rajput, B. Robbertse, B. Smith-White, D. Ako-Adjei, et al., Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation, Nucleic Acids Res. 44 (2016) D733–745. [22] D.H. Alexander, J. Novembre, K. Lange, Fast model-based estimation of ancestry in unrelated individuals, Genome Res. 19 (2009) 1655–1664.
P. Raska, et al., Development of a panel of genome-wide ancestry informative markers to study admixture throughout the americas, PLoS Genet. 8 (2012) e1002554. [16] E.P. Consortium, E. Birney, J.A. Stamatoyannopoulos, A. Dutta, R. Guigo, T.R. Gingeras, E.H. Margulies, Z. Weng, M. Snyder, E.T. Dermitzakis, et al., Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project, Nature 447 (2007) 799–816. [17] E.P. Consortium, An integrated encyclopedia of DNA elements in the human genome, Nature 489 (2012) 57–74. [18] W. Tang, S. Mun, A. Joshi, K. Han, P. Liang, Mobile elements contribute to the uniqueness of human genome with 15,000 human-specific insertions and 14 Mbp
60