Forensic Science International: Genetics 3 (2008) 7–13
Contents lists available at ScienceDirect
Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsig
D9S1120, a simple STR with a common Native American-specific allele: Forensic optimization, locus characterization and allele frequency studies C. Phillips a,*, A. Rodriguez b, A. Mosquera-Miguel b, M. Fondevila b, L. Porras-Hurtado b, F. Rondon b, A. Salas b, A´. Carracedo a, M.V. Lareu b a b
Genomic Medicine Group, CIBERER, University of Santiago de Compostela, Galicia, Spain Institute of Legal Medicine, Genomic Medicine Group, University of Santiago de Compostela, Santiago de Compostela, Spain
A R T I C L E I N F O
A B S T R A C T
Article history: Received 15 June 2008 Received in revised form 2 July 2008 Accepted 4 July 2008
The simple tetrameric STR D9S1120 exhibits a common population-specific allele of 9 repeats (9RA) reported to have an average frequency of 0.36 in Native Americans from both North and South of the continent. Apart from the presence of 9RA in two northeast Siberian populations, D9S1120 shows variability exclusive to, and universal in all American populations studied to date. This STR therefore provides an informative forensic marker applicable in countries with significant proportions of Native American populations or ancestry. We have re-designed PCR primers that reduce the amplified product sizes reported in NCBI UniSTS by more than a third and have characterized the repeat structure of D9S1120. The 9RA allele shares the same repeat structure as the majority of other D9S1120 alleles and so originates from a slippage-diminution mutation rather than an independent deletion. We confirm the previously reported allele frequencies from a range of populations indicating a global heterozygosity range for D9S1120 of 66–75% and estimate the proportion of Native American-diagnostic genotypes to average 53%, underlining the potential usefulness of this STR in both forensic identification and in population genetics studies of the Americas. ß 2008 Elsevier Ireland Ltd. All rights reserved.
Keywords: STR D9S1120 Population genetics AIMs Ancestry markers
1. Introduction A recent study by Wang et al. [1] surveying short tandem repeat (STR) variability amongst 29 American populations and 49 worldwide populations, highlighted the widespread distribution of a common Native American-specific allele in the STR D9S1120 consisting of 9 repeats (termed 9RA). The 9RA allele is the shortest repeat observed in D9S1120 and is unique amongst all STR alleles studied to date in being population-specific at high frequency, notably 9RA has been found in every American population examined and is close to fixation in the Suruı´ of Amazonia. A second study by Schroeder et al. [2] also observed a consistently high frequency of 9RA across all American regions: averaging 0.301 in North America and 0.471 in South America. An interesting aspect of Schroeder’s study was the first observation of 9RA outside of America but confined to two northeast Siberian populations (frequencies of 0.175 in the Koryaks of Kamchatka and 0.238 in the Chukchi of the extreme east of Siberia). In contrast, Wang’s study did not observe 9RA in samples of 14 Tundra Nentsi and 25 Yakut
* Corresponding author. E-mail address:
[email protected] (C. Phillips). 1872-4973/$ – see front matter ß 2008 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.fsigen.2008.07.002
from northwestern and mid Siberia, respectively [1]. The observation of a geographically limited distribution of 9RA immediately west of the Bering Strait strongly suggests a principal migration event from the eastern edge of Siberia, with an accompanying bottleneck followed by an expansion in frequency of the 9RA allele during the initial colonization of the Americas. From the perspective of reconstructing past population demographics from modern patterns of genetic variation the autosomal 9RA parallels the haplotypic unique event marker: DYS199T of Y chromosome lineage QM3 [3]. In addition, detailed analysis of the four typical Native American mitochondrial (mtDNA) haplogroups of A2, B2, C1 and D1 [4,5] has added improved resolution to the tracking of past migrations into the Americas. Evidence from these three markers taken together points to the main contribution to modern American variability originating from a major migration from west Beringia of greater significance than serial migrations into America of more broadly distributed East Asian populations. Given the population-specific distribution and frequencies of 9RA described above, the STR D9S1120 not only provides a highly informative marker for studying population events, but could also be useful as a forensic STR that allows the detection of Native American ancestry in a substantial proportion of individuals with this geographic origin. We report the development of new primer
8
C. Phillips et al. / Forensic Science International: Genetics 3 (2008) 7–13
Table 1 D9S1120 primer designs and their binding regions in the flanking sequence of a 9RA allele
F and R denote forward and reverse strand binding sites (UniSTS designations), [NED]- shows the position of the dye label used. UniSTS binding regions shown as light grey boxes, sequencing primers first and last bases, where different to UniSTS, as dark grey boxes. Genotyping primers are shown as bold italic text and the repeat region as light grey text. Note the 29th base [T] is a G/T SNP: rs730779 reported as monomorphic T in all HapMap populations.
designs that reduce amplicon sizes by over a third to suit forensic analysis. A survey was made of six American (Maya, three native Colombian and two American-African-European admixed populations from Colombia) and three non-American populations that confirm the previously observed allele frequency distributions, while discovering two new low frequency intermediate (x.3) alleles in the Taiwanese population sample. Finally, in order to construct a comprehensive sequenced allelic ladder we characterized all 13 alleles observed to date and the immediate flanking sequence of D9S1120, finding deviations from a simple [GATA]n repeat structure. 2. Materials and methods 2.1. PCR primer design and amplification conditions Primer 3 software [6] was used to design new primers that shortened the amplified product size range from 272–312 bp to 168–208 bp, a median reduction of 36%. Table 1 lists the PCR primer sequences of the NCBI UniSTS consensus primers ([7], UniSTS: 62438) and two different primers pairs we used independently for typing and sequencing. The forward primer was labeled with NEDTM dye (Applied Biosystems: AB, Foster City, USA) for STR typing. Attempts to reduce the amplified fragments to a typical miniSTR size range (i.e. less than 125 bp) were unsuccessful. PCR amplification used steps: 95 8C 11 min, then 95 8C 45 s, 62 8C 1 min, 72 8C 70 s for 30 cycles with a final elongation of 65 8C 60 min. Reaction components comprised: 1 ng of DNA in 12.5 ml total volume, 1.5 mM MgCl2, 0.2 mM dNTPs, 0.2 mM primers and 1 U AmpliTaq GoldTM (AB) Taq polymerase. 2.2. Populations studied Population samples comprised: 84 Mozambicans (herein alternatively termed AFR), 121 northwest Spanish (EUR), 95 Taiwanese (ASN), 23 Maya together with Colombian populations: 30 Awa, 35 Pijao, 62 Coyaima, giving four Native American (AME) populations, plus two Colombian American-African-European admixed populations of 42 Mulalo´s and 50 Mestizos. The regional
distribution of the five Colombian populations studied is shown in the inset map of Fig. 1. 2.3. Sequence analysis and construction of allelic ladder Each allele was isolated by excising homozygous PCR product bands (by preference) or, for rare alleles, adequately separated heterozygous bands from silver-stained polyacrylamide gels (T: 9%, C: 5%). DNA was eluted from the gel by incubation with 50 ml of 20% ChelexTM overnight, then three freeze–thaw cycles followed by centrifugation for 4 min 13,000 rpm. The resulting supernatant was sequenced using standard Big-dyeTM protocols with redesigned UniSTS primers (forward primer 12 bp closer to the repeat region, reverse primer with an extra C base). The binding regions of all the primer designs are compared in the D9S1120 flanking sequence included with Table 1. 3. Results 3.1. Population variability and forensic informativeness of D9S1120 The frequency distributions observed for 9RA alone and for the full range of D9S1120 alleles are shown in Figs. 1 and 2, respectively. For comparison purposes Fig. 2 uses the same colour labels for D9S1120 alleles as those of Wang’s study of global variability (Fig. 10 of Ref. [1]). The 9RA allele was found at an average frequency of 0.31 in the four Native American populations combined (Maya, Colombian Awa-Pijao-Coiyama) compared with a continent-wide average of 0.36 reported by Wang et al. [1]. The Mulalo´ sample shows a lower than average 9RA frequency and has a noticeably different overall allele frequency distribution possibly as a result of admixture, although the Mestizo sample matches more closely the allele frequency distribution of other American populations. In general the high between-population variability of 9RA frequencies in America found by all studies so far, reduces the precision of D9S1120 as a gauge of Native American contributions to admixed populations. In non-American populations 88–93% of the variation is accounted for by alleles 15, 16 and 17 alone, indicating below-
C. Phillips et al. / Forensic Science International: Genetics 3 (2008) 7–13
9
Fig. 1. Geographic location of the nine populations analyzed (with inset of Colombian populations) and frequency of American-specific 9RA allele shown as dark pie-chart segment.
average informativeness for D9S1120 in forensic identification applications. This is underlined by comparatively low estimated heterozygosities for a forensic STR in both American and nonAmerican populations shown in Table 2. Although the heterozygosity range of 66–71% in non-Americans is only marginally raised to 75% by the presence of a fourth common allele in Native Americans, these values compare favorably to heterozygosities (in
Europeans) of the least informative CODIS loci: TPOX 64%, CSF1PO 72% and TH01 75%. 3.2. D9S1120 as an American-specific ancestry informative marker The 9RA allele is frequent enough to make an informative ancestry marker; providing an unequivocal indication of American
Fig. 2. D9S1120 allele frequency distributions of the nine study populations. The column marked America (four population) shows estimates from combined Native American samples of Maya plus Colombian Awa, Pijao and Coyaima.
C. Phillips et al. / Forensic Science International: Genetics 3 (2008) 7–13
10
Table 2 D9S1120 allele (italics) and genotype frequency estimates in four population groups Repeats observed
9RA
10
11
12
13
14
15
16
17
18
19
Allele frequencies
0.309
0.003
0
0
0.003
0.023
0.168
0.329
0.128
0.026
0.010
0 0 0 0 0 0 0 0
0.000 0.000 0.001 0.002 0.001 0.000 0.000
0.000 0.001 0.006 0.015 0.008 0.001
0.003 0.009 0.043 0.110 0.028
0.006 0.017 0.084 0.108
0.003 0.007 0.016
0.001 0.001
0.000
(a) Native American: 125 Maya plus Colombian Awa, Pijao and Coyaimaa 19 0.010 0.006 0.000 0 18 0.026 0.016 0.000 0 17 0.128 0.079 0.001 0 16 0.329 0.203 0.002 0 15 0.168 0.104 0.001 0 14 0.023 0.014 0.000 0 13 0.003 0.002 0.000 0 12 0 0 0 0 11 0 0 0 0 10 0.003 0.002 0.000 9RA 0.309 0.096 Repeats observed
11
12
13
14
15
16
17
17.3
18
18.3
19
Allele frequencies
0
0.005
0
0
0.212
0.471
0.196
0.005
0.106
0.005
0
0 0 0 0 0 0 0 0 0 0 0
0 0.000 0.001 0.000 0.002 0.005 0.002 0 0 0.000
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0.002 0.045 0.002 0.083 0.199 0.045
0 0.005 0.100 0.005 0.184 0.222
0 0.002 0.041 0.002 0.038
0 0.000 0.001 0.000
0 0.001 0.011
0 0.000
0
(b) East Asian: 95 Taiwaneseb 19 0 18.3 0.005 18 0.106 17.3 0.005 17 0.196 16 0.471 15 0.212 14 0 13 0 12 0.005 11 0
(c) European: 19 18 17 16 15 14 13 12 11
Repeats observed
11
12
13
14
15
16
17
18
19
Allele frequencies
0.004
0.004
0.017
0.054
0.236
0.426
0.215
0.041
0.004
121 Northwest Spanishc 0.004 0.041 0.215 0.426 0.236 0.054 0.017 0.004 0.004
0.000 0.000 0.002 0.004 0.002 0.000 0.000 0.000 0.000
0.000 0.000 0.002 0.004 0.002 0.000 0.000 0.000
0.000 0.001 0.007 0.014 0.008 0.002 0.000
0.000 0.004 0.023 0.046 0.025 0.003
0.002 0.019 0.101 0.200 0.055
0.004 0.035 0.183 0.181
0.002 0.018 0.046
0.000 0.002
0.000
Repeats observed
11
12
13
14
15
16
17
18
19
Allele frequencies
0
0.018
0.006
0.018
0.292
0.470
0.167
0.030
0
0 0 0 0 0 0 0 0 0
0 0.001 0.006 0.017 0.010 0.001 0.000 0.000
0 0.000 0.002 0.006 0.003 0.000 0.000
0 0.001 0.006 0.017 0.010 0.001
0 0.017 0.097 0.274 0.085
0 0.028 0.157 0.221
0 0.010 0.028
0 0.001
0
(d) African: 84 Mozambicansd 19 0 18 0.030 17 0.167 16 0.470 15 0.292 14 0.018 13 0.006 12 0.018 11 0
American-specific genotypes are highlighted in bold. Genotype values below 0.1% are given as 0.000 and zero values as 0. a Heterozygosity 75.02%, American-specific genotypes 52.73%. b Heterozygosity 68.39%. c Heterozygosity 71.23%. d Heterozygosity 66.42%.
ancestry in 52.7% of the genotypes expected from the pooled Native American allele frequency estimates of Table 2a. One advantage of D9S1120 is the absence of 9RA in East Asian populations, a characteristic not shown by the patterns of variability in the bulk of other autosomal loci. In order to compare the ancestry informativeness of D9S1120 to a 28plex assay of the best American informative AIM-SNPs currently being developing
in our laboratory, we estimated the population divergence parameter In [8] of the STR by treating it as a bi-allelic AIM (i.e. simplified to 9RA:non-9RA genotypes). Divergence values were calculated for a four-population group comparison of AME:AFR:EUR:ASN, and for the pairwise comparisons of American vs. AFR, EUR and ASN, each set of values is shown in Fig. 3, ranked in descending four-population divergence. Although D9S1120 shows
C. Phillips et al. / Forensic Science International: Genetics 3 (2008) 7–13
11
Fig. 3. Ranked four-population (AME:AFR:EUR:ASN) In divergence values for 28 American indicative AIM-SNPs plus D9S1120 re-configured as a bi-allelic marker consisting of 9RA:non-9RA genotypes. Paired population divergence plots indicate four AIM-SNPs are less informative than D9S1120 for AME:EUR comparisons and nine are less informative for AME:ASN comparisons. All D9S1120 paired population In values are identical at 0.19 and are shown as a white circle.
the lowest four-population In value of 0.18 amongst the 29 markers (AIM-SNP In range: 0.59–0.24) this partly reflects the complete absence of 9RA in three of the four populations. More importantly D9S1120 is the 25th most informative marker for AME:EUR comparisons (In = 0.19 in the range: 0.65–0.13) and notably is ranked 20th most informative marker for AME:ASN comparisons (In = 0.19 in the range: 0.68–0.004). Since the 28 AIM-SNPs compared here represent the most informative markers we have found so far for analyzing American populations this suggests that for more challenging ancestry analyses, particularly AME vs. ASN, D9S1120 merits inclusion in the AIM sets chosen. 3.3. Observation of rare alleles The 10-repeat allele of D9S1120 has been previously reported by Wang et al. [1] as American-specific but at very low frequency (average 0.003) and observed in only 3 of 29 populations studied: Maya (south Mexico), Ojibwa and Cree (central-eastern USCanadian border region). We found a single 10 allele in a sample of 23 Maya from a total of 245 Americans examined; further suggesting that this allele is very rare. Similarly, the 11-repeat allele was only observed once in all, specifically in a European sample. Two different 3 bp repeat intermediate alleles: a 17.3 and an 18.3 were observed in 2 of 95 Taiwanese individuals. These represent newly observed variation in D9S1120 since they were not reported in the previous, more extensive global population studies of Wang et al. [1] and Schroeder et al. [2]. Both 17.3 and 18.3 alleles were characterized by sequence analysis but to maintain clear peak spacing were not included in the allelic ladder.
detected in the sequences flanking the repeat region, however Table 3 shows that a simple tandemly repeated GATA structure for D9S1120 is compounded by two independent sequence motifs: i. The 4th repeat comprises CATA in nearly all alleles but has a rare C > G SNP found in two EUR samples and one ASN sample, creating uniform GATA tandem repeats in 11- and 12-repeat alleles. An equal number of C and G SNP alleles were found in the four 12-repeat alleles sequenced but the observation of a single 11-repeat genotype prevented detailed study of any association between the G SNP allele and the very rare 11-repeat allele. Although each of the eight 9RA alleles we analyzed showed the common repeat pattern of D9S1120 indicating 9RA originated from a diminution-slippage mutation of a 10, 11 or 12 repeat, the C allele of the observed SNP is unlikely to occur in America at a useful enough frequency to give an additional populationinformative marker within this STR. ii. An A-nucleotide deletion in the 7th GATA repeat creating a [GAT] repeat observed as the 17.3 or 18.3 intermediate allele described in Section 3.2 and confined to two Taiwanese in our study. Therefore the summary structure for the D9S1120 repeat region is ½GATA3 ½C=G ATA1 ½GATA2 ½GAT A=-1 ½GATA p with p ranging from 2 in 9RA to 12 in the 19-repeat allele. Finally it should be noted that the repeat region shown in Table 1 has a symmetrical ATA sequence each side of the tandem repeats therefore it is equally valid to describe the repeat unit as [ATAG].
3.4. Sequence analysis of the D9S1120 repeat structure 4. Discussion Alleles with 13 different mobilities were analyzed from 56 individuals by sequencing. Table 3 outlines the detected repeat structure of D9S1120. No nucleotide variants or deletions were
This study indicates that D9S1120 will be a useful supplement to Y-chromosome and mtDNA typing for the analysis of Native
C. Phillips et al. / Forensic Science International: Genetics 3 (2008) 7–13
12
Table 3 Repeat structure of 14 repeat alleles of D9S1120 detected by extracting a total of 56 PCR products showing 13 different mobilities Allele
Repeat structure
Samples sequenced
Size in bp
Source population when rare or specific (bold)
9 10 11 a 12 12 a 13 14 15 16 17 17.3 b 18 18.3 b 19
[GATA]3[CATA]1[GATA]m [GATA]3[CATA]1[GATA]m [GATA]n [GATA]3[CATA]1[GATA]m [GATA]n [GATA]3[CATA]1[GATA]m [GATA]3[CATA]1[GATA]m [GATA]3[CATA]1[GATA]m [GATA]3[CATA]1[GATA]m [GATA]3[CATA]1[GATA]m [GATA]3[CATA]1[GATA]2[GAT]1[GATA]p [GATA]3[CATA]1[GATA]m [GATA]3[CATA]1[GATA]2[GAT]1[GATA]p [GATA]3[CATA]1[GATA]m
8 1 1 2 2 4 3 10 10 6 1 4 1 3
168 172 176 180 180 184 188 192 196 200 203 204 207 208
AME AME EUR EUR and ASN
ASN ASN AME and EUR
Alleles that are rare or specific to a population group are indicated. n: number of repeats; m: n 4 (segment: [GATA]3[CATA]1); p: n 6.3 (segment: [GATA]3[CATA]1[GATA]2[GAT]1). a G allele at [C/G] SNP in repeat position 4.1. b x.3 allele = [A/-] deletion in repeat position 7.4.
American populations and their origins. In addition the 9RA allele shows informative levels of American specificity for inferring Native American ancestry in 53% of individuals from this population group. Comparison with our set of most informative American indicative AIM-SNPs shows D9S1120 will be a useful additional locus particularly for differentiating American and East Asian ancestries. One less useful characteristic of D9S1120 is that variation in 9RA frequencies amongst American populations is high, making estimation of Native American admixture proportions based on this marker alone imprecise. For example the geographically close native Colombian populations of Awa and Coyaima show an almost threefold difference in 9RA frequency of 0.133 and 0.375, respectively. However the 9RA allele is generally frequent enough across the American continent to allow the detection of a minor Native American component when examining admixed populations using a large enough sample set. We are currently analyzing all three loci plus autosomal AIM-SNPs in parallel to assess the extent to which D9S1120 helps to resolve complicated patterns of admixture previously characterized by mtDNA alone [9]. The unusual allele frequency characteristics of D9S1120 suggest it would be a useful addition to the STRs available to forensic laboratories in North and South America despite showing levels of polymorphism in all population groups lower than most of the established forensic STRs. Since D9S1120 has only 11 tetrameric repeats in total it is well suited to adapt as a miniSTR system. However the flanking sequence shown in Table 1 illustrates why our attempts to design an adequate short amplicon PCR primer pair have been unsuccessful so far. The upstream sequence has several tandem dinucleotide repeat motifs and a lower than average %GC, therefore we struggled to bring the forward primer closer to the repeat region while keeping the right characteristics for good quality PCR. As well as analyzing native and admixed Colombian populations complimentary to previous samples of American D9S1120 variability and characterizing the repeat structure of the STR with re-designed primers, we aimed to find sequence variability within the repeat region or flanking sequence that may provide additional markers for the analysis of American populations. Although we discovered new allele variation, it showed insufficient variability to significantly enhance D9S1120 for population genetics studies. Two further possibilities exist to search for nucleotide variation that could have become associated with 9RA during its expansion
in frequency in Beringia: screening for SNPs in a broadened range of flanking sequence and detection of low frequency substitutions within repeats using larger numbers of samples than sequencing analysis permits. In searching for peripheral loci it is interesting to note that one SNP: rs4877301 positioned 663 bp upstream of the D9S1120 forward primer shows considerable global divergence with an Fst of 0.33 amongst the four HapMap populations. This would be an interesting SNP to study further for possible associations with 9RA as well as some eight other SNPs within 1000 bp each side of the STR but not yet characterized. The other strategy of looking at larger population samples to simultaneously detect length and nucleotide variability has recently benefitted from the development of a viable high-throughput approach to achieve this: ICEMS, a system that combines ion-pair reversedphase high-performance liquid chromatography and electrospray ionization quadrupole time-of-flight mass spectrometry [10]. We intend to pursue both the above approaches to attempt the capture of further population indicative variability that may exist in or around the D9S1120 STR. Acknowledgements Funding from Xunta de Galicia: (PGIDTIT06PXIB228195PR) and a grant from the Ministerio de Educacio´n y Ciencia: (project BIO2006-06178) given to MVL supported this project. References [1] S. Wang, C.M. Lewis Jr., M. Jakobsson, S. Ramachandran, N. Ray, et al., Genetic variation and population structure in Native Americans, PLoS Genet. 3 (11) (2007) 2049–2067. [2] K.B. Schroeder, T.G. Schurr, J.C. Long, N.A. Rosenberg, M.H. Crawford, L.A. Tarskaia, L.P. Osipova, S.I. Zhadanov, D.G. Smith, A private allele ubiquitous in the Americas, Biol. Lett. 3 (2007) 218–223. [3] P.A. Underhill, L. Jin, R. Zemans, P.J. Oefner, L.L. Cavalli-Sforza, A pre-Columbian Y chromosome-specific transition and its implications for human evolutionary history, Proc. Natl. Acad. Sci. U.S.A. 93 (1996) 196–200. [4] G.F. Shields, A.M. Schmiechen, B.L. Frazier, A. Redd, M.I. Voevoda, J.K. Reed, R.H. Ward, mtDNA sequences suggest a recent evolutionary divergence for Beringian and northern North American populations, Am. J. Hum. Genet. 53 (1993) 549– 562. [5] A. Achilli, U.A. Perego, C.M. Bravi, M.D. Coble, Q.P. Kong, S.R. Woodward, A. Salas, A. Torroni, H.J. Bandelt, The Phylogeny of the four pan-American mtDNA haplogroups: implications for evolutionary and disease studies, PLoS One 3 (3) (2008) e1764. [6] http://primer3.sourceforge.net/. [7] http://www.ncbi.nlm.nih.gov/genome/sts/sts.cgi?uid=62438.
C. Phillips et al. / Forensic Science International: Genetics 3 (2008) 7–13 [8] N.A. Rosenberg, L.M. Li, R. Ward, J.K. Pritchard, Informativeness of genetic markers for inference of ancestry, Am. J. Hum. Genet. 73 (2003) 1402–1422. [9] A. Salas, A. Acosta, V. A´lvarez-Iglesias, M. Cerezo, C. Phillips, M.V. Lareu, A´. Carracedo, The mtDNA ancestry of admixed Colombian populations, Am. J. Hum. Biol. (2008) [epub ahead of print].
13
[10] H. Oberacher, F. Pitterl, G. Huber, H. Niedersta¨tter, M. Steinlechner, W. Parson, Increased forensic efficiency of DNA fingerprints through simultaneous resolution of length and nucleotide variability by high-performance mass spectrometry, Hum. Mutat. 29 (3) (2008) 427–432.