Forensic Science International 151 (2005) 59–69 www.elsevier.com/locate/forsciint
On human STR sub-population structure Diane J. Rowold, Rene J. Herrera* Department of Biological Sciences, Florida International University, University Park Campus, Miami, FL 33199, USA Received 4 November 2003; received in revised form 12 July 2004; accepted 15 July 2004 Available online 21 September 2004
Abstract This study is an investigation of genetic variation among the CODIS (combined DNA index system) STR (short tandem repeat) reference collections. The analysis was conducted in two parts. The first is an overall comparison (G-test) of allelic frequency distributions for 12 STR loci among 19 forensic databases representing various major groups and ethnic populations. In the second phase of the study, the impact of database replacement on DNA profile frequencies and inclusion probabilities is examined. The G-test results reveal clear allelic frequency differences among the major divisions, and, in some cases, among ethnically distinct subdivisions of the same major group affiliation. Other results indicate that database substitution may sometimes lead to substantial alterations in individual inclusion probabilities. Furthermore, there are numerous instances in which an allele present in some databases is missing in others. Fortunately, in most cases, these effects may not be forensically significant due to the increased discriminatory power of the STR markers employed in the CODIS system. However, differences in inclusion probabilities may become critical in situations in which DNA quantity is severely limited and/or compromised by degradation since, in these cases, the power of forensic STR analysis may be mitigated by reducing the number of informative loci. # 2004 Elsevier Ireland Ltd. All rights reserved. Keywords: Forensic marker systems; Forensic database; Sub-population structure; STRs; CODIS
1. Introduction DNA forensic technology has evolved considerably since its inception in the late seventies. Yet, some controversies still persist [1]. One, for example, concerns the appropriate procedure to assess the statistical significance of a match between a suspect’s DNA and the crime scene sample [1]. Another matter of contention involves the impact the reference database may have on the profile’s probability of occurrence [1]. In a previous study [2], the issue of reference database usage was examined with respect to the polymar-
* Corresponding author. Tel.: +1 305 348 1258; fax: +1 305 348 1259. E-mail address:
[email protected] (R.J. Herrera).
ker, HLA-DQA1 and DS180 loci. More specifically, the authors explored the effect of inter-population genetic variation on probability profiles. The results of their study indicate that, in general, there is a marked reduction in probability of occurrence, some times up to five orders of magnitude difference, when a database other than that compiled from the authentic reference population is used to establish genotypic frequencies. Furthermore, this decrease appeared to be more pronounced as the genetic distance between biogeographically distinct populations increased. In addition, these previous results revealed that ethnic structure (i.e., intra-major group genetic variation) is often significant, and, as with inter-major group differences, sometimes leads to situations in which one or several of the alleles present in the test group genotypes are absent from the reference database.
0379-0738/$ – see front matter # 2004 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.forsciint.2004.07.007
60
D.J. Rowold, R.J. Herrera / Forensic Science International 151 (2005) 59–69
The current investigation represents a natural progression of the Gallo et al. study [2]. This time, however, we focus on the consequences of inter-population variation and database selection using data generated from the PCR analyses of 12 microsatellite or STR (short tandem repeats) loci, a subset of the CODIS (combined DNA index system) STR panel developed by the FBI. This STR marker set contains substantially higher numbers of both alleles and loci than the polymarker/HLA-DQA1/D1S80 systems employed by Gallo et al. (12 loci and a mean of 14 alleles/locus versus 7 loci and a mean of 7 alleles/locus, respectively). The central question in this analysis is whether or not the degree of intra-population variation inherent in the CODIS STR databases is sufficient to provide the resolution necessary to counteract frequency differences among major group and/or ethnically distinct populations. To ascertain whether the degree of genetic relatedness between populations substantially affects changes in probability of occurrence (pOcc) scores upon CODIS database substitution, we compare, systematically, the pOcc values of individual STR profiles across a diverse array of reference databases. We also infer genetic relationships among the corresponding reference populations via standard G-tests for homogeneity [3].
(http://www.cfs.ca). In addition to the 16 reference databases used in probability calculations, three others were included in the G-test. Relevant information on the 19 reference collections used in this study is presented in Table 1. At this point, we, the authors, feel that a commentary on the use (or, in our case, the lack) of the term ‘‘race’’ is necessary. ‘‘Race’’ as a biological concept refers to a group of individuals sharing a specific genetic background (ancestry) that is sufficiently different from other such groups. ‘‘Race’’ is also used to denote cultural, linguistic and social cohesiveness. Native Americans and Caucasians represent both distinct biological and social–cultural units since their members share a common genetic ancestry with little admixture as well as a similar cultural history and linguistic patterns. The Hispanic, however, may not fit the biological definition of race since these populations represent admixtures comprised of several genetic groups. The AfricanAmerican populations, to some extent, also have a mixed genetic heritage although the sub-Saharan African populations have been known, historically, to provide the major genetic contribution to these populations. To avoid any confusion over the usage of the word ‘‘race’’ in this study, we employ the phrase ‘‘major group’’ to refer to the four major population subdivisions, Americans of African descent, Caucasians, Hispanic and Native Americans.
2. Materials and methods
2.2. Data and statistical analysis
2.1. Population samples
Probability of occurrence (pOcc) values for the 30 individuals in each of the four test groups (i.e., MAATG, MCTG, MHTG and MNATG) were calculated using the corresponding Minnesota database as well as a number (9–13) of the other population-specific databases. Allelic frequencies at each of the 12 STR loci were either provided by the source publication (CODIS [4], CIt [6], CMa [7] and CSw [8]) or, as in the cases of the CSF, RCMP and SMBCAL data, calculated directly by the authors of the current study using the gene count method [10]. Genotypic frequencies incorporating all 12 loci were generated according to the standard procedure outlined in [11]. All forensic STR databases used in the current study are assumed to be in both HardyWeinberg and linkage equilibrium. The reciprocals of the probability of occurrence (pOcc 1) for each of the 30 STR profiles within a single test group were generated across all of the databases comprising the comparison panel. The pOcc 1 values derived using the corresponding Minnesota databases were then, systematically compared, in a matched pair fashion, with that obtained from each of the other databases. For every paired comparison, the exponential changes in pOcc 1 values were calculated for each individual profile and grouped with respect to both magnitude and sign of the exponential difference. The proportion of individuals in each category was ascertained. Several times during our analysis, a population database failed to contain one or several alleles present in a given profile. No pOcc 1 scores are obtained for
Inclusion probabilities were calculated for 120 STR genotypes representing four distinct, Minnesota-based collections (30 individuals each from African-American, Caucasian, Hispanic and Native American populations) using 16 population-specific databases. Twelve CODIS loci were examined: CFS1PO, FGA, TH01, TPOX, VWA, D3S1358, D5S818, D7S820, D8S1179, D13S317, D18S51 and D21S11. These samples were collected and analyzed by the State of Minnesota Bureau of Criminal Apprehension Laboratory (SMBCAL), Minnesota, USA, and the STR frequency data were kindly provided by Terry Laber, Assistant Director of SMBCAL. For the remainder of this article, these sample collections are designated as MAA (Minnesota African-American), MC (Minnesota Caucasian), MH (Minnesota Hispanic) and MNA (Minnesota Native American) test groups (TG), to indicate the respective populations and to differentiate them from the reference databases. Four of the reference databases were derive from the same Minnesota-based populations as the four sets of test individuals. These four databases from Minnesota do not include the test individuals examined in the study. The genotypic information for these four databases was generously provided by SMBCAL. The other reference databases were either taken directly from the literature or provided by law enforcement agencies through an official web site
D.J. Rowold, R.J. Herrera / Forensic Science International 151 (2005) 59–69
61
Table 1 Reference database information Database
Abbreviation
Na
Source
Minnesota African-American Minnesota Caucasian Minnesota Hispanic Minnesota Native American African-American CODIS Bahamian CODIS Caucasian CODISb Hispanic CODISc Jamaican CODIS Trinidadian CODIS Northern Ontario Aboriginal RCMPd Salishan Aboriginal RCMPe Saskatchewan Aboriginal RCMP Italian Caucasian Maine Caucasian Swiss Caucasian Japanese RCMPf Asian CFSg East Indian CFS
MAA MC MH MNA AAC BC CC HC JC TC NOnRC SalRC SasRC CIt CMa CSw JapRC ACF EICF
120 120 120 170 179–210 157–162 195–203 203–209 194–244 78–85 256–258 208 208–210 223 151 206 172 195 167
SMBCAL SMBCAL SMBCAL SMBCAL [4] [4] [4] [4] [4] [4] [5] [5] [5] [6] [7] [8] [9] http://www.cfs.ca http://www.cfs.ca
a b c d e f g
A number range indicates that number of chromosomes (N) sampled varies among loci. Caucasians from the United States. Hispanics from Southwestern United States. RCMP: Royal Canadian Mounted Police. Native Americans from Coastal British Columbia. The Japanese, Asian and East Indian collections were used only for the G-test, phylogenetic analysis and PCA. CFS: Centre of Forensic Sciences.
these profiles (since 1 divided by 0 is undefined) and, thus, these cases are excluded in the calculations of exponential differences in pOcc 1 values. For each of the four sets of pair-wise comparisons, the percentage of individuals with missing alleles and the corresponding Nei’s genetic distance (GD) values were generated. Both Pearson’s correlation and Spearman’s rank correlation analyses were performed on the resulting data [12]. To determine any significant differences in allelic frequencies among the various STR databases, standard G-tests were executed on pair-wise combinations of databases using George Carmody’s 1990 program [3].
3. Results 3.1. G-test Table 2 presents the results of the G-test for all pair-wise comparisons of databases. In all four cases, the differences between each Minnesota test group and its corresponding database are not significant. Furthermore, three of the four Minnesota databases are statistically congruent to several or all of the databases within their major group. In contrast, the Native American database from Minnesota is statistically different (at a = 0.05) to all other populations examined (Native American or otherwise). Out of all possible intra-
major group comparisons, 4/10 African-American/Caribbean, 11/14 Caucasian, 1/1 Hispanic and 0/6 Native American, indicate homogeneity (at a = 0.05). To reduce the probability of a type I error in our G-test analysis, we applied a correction procedure for multiple comparisons which is widely used in forensics and population genetics. As expected, the number of significant comparisons at the adjusted level (the corrected a = 0.05/324 comparisons or 0.00015) decreased slightly from 304 to 295. Out of the nine comparisons affected, the majority involved different ethnicities from the same major group (Table 2). One drawback to the correction procedure, however, is that it increases the potential for type II error especially, as in our research, when numerous tests are performed (324). Thus, some comparisons which are notably significant when few tests are conducted (e.g., MAA/TC at a P-value of 0.006) become non-significant when many comparisons are included in the analysis. In our discussion, we discuss significance at the a = 0.05 level. 3.2. Genetic distance, missing alleles and probability calculations The percentage of individuals with missing alleles upon database replacement, is presented for each of the four test groups (MAATG, MCTG, MHTG and MNATG in Tables 3– 6, respectively). To facilitate evaluation, databases are listed
62
Database MAA MAA MC MH MNA AAC BC CC HC JC TC CMa CIt CSw NOnRC SalRC SasRC AsCF EICF JapRC
MC
MH
MNA
AAC
501 915.1 167.4 0.145** 488.4 0 0.996** 186 406.2 529.8 0 0 0.055** 253.4 564 0 0 0 0.830** 1021.8 0.206 0 0 0 0.447 0 0 0 0.127 0 0.994 0 0 0 0 0 0.144 o 0 0 0 0 0 0 0.006* 0 0 0 0 0 0.786 0 0 0 0 0.149 0 0 0 0 0.429 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.037* 0 0 0 0 0 0 0 0.413 0.037* 0 0 0 0 0 0 0
BC
CC
HC
149.7 542.5 545.5 974.8 174.3
599.3 91.1 282.5 632.8 722.9 703.7
802.1 204.4 184.4 579.8 616.2 308.4 769.5 323.8 106.3 145.1 148 800.7 327.3 256.9 294 273.8 1378.9 550.8 537 671.4 963.4 213.1 192.6 675.2 770.6 900.2 167.3 178.5 672.4 741.3 455.3 1005.2 395.6 112.5 161.3 1284.1 486.2 383.3 467.2 0 259.2 929.3 1040.6 0 0 391.4 373.5 0 0 0 165.2 0 0 0 0.016* 0 0 0 0.115 0.028* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.235 0.889 0 0 0 0 0
0 0 0.113 0.007* 0 0 0 0 0 0 0 0 0
0 0 0 0.772 0.032* 0.044* 0 0 0 0 0.051 0
JC
TC
CMa
Clt
CSw
NOnRC SalRC SasRC AsCF
EICF JapRC
635.1 127 298.3 653.1 807.7 775 159.1 486.1 111.2 396.6 141.3 165.7
1613.8 1191.9 837.9 567.2 1799.4 1726.2 1586 897.9 2186.8 1110.5 1339.5 1654.7 1619.8
343.1 114.4 149.2 272.5 374.4 358.3 142.1 197.4 484.9 197.2 122 105.1 126.1 706.4 433 345.9 177.1
0 0 0 0 0.274 0
0 0 0 0 0
1035 952.9 720.8 690.4 533 457.4 445.2 350.4 389.1 417.5 152.4 676.1 1133.1 1018.7 853.1 1049.5 979.4 806.2 943.7 782.2 586.4 469.2 338.4 534.8 1380.7 1288.8 1082.6 709.3 595.7 444.6 865.5 673.2 501.6 977.7 796.3 530.2 936.2 795.2 526.8 0503.9 301.8 1125.1 280.9 667.7 0 661.4 0 0 0 0 0.001* 0 0 0
984.6 570.8 511 794.8 1210.9 1110.5 769.1 688.6 1059.2 565.1 647.5 669.1 722.6 1375.6 841.5 747.8 185.6 190.7
0
The G-test scores and P-values occupy the upper and lower diagonals, respectively. * P-values which become non-significant with the correction for multiple comparisons (a = 0.00015). ** P-value for the comparison between Minnesota test group and corresponding Minnesota database (MAA TG MAA, MCTG/MC, MHTGMH and MNATG/MNA). The G-test scores are 136, 75, 140.5 and 97.4, respectively.
D.J. Rowold, R.J. Herrera / Forensic Science International 151 (2005) 59–69
Table 2 G-test results
D.J. Rowold, R.J. Herrera / Forensic Science International 151 (2005) 59–69
63
Table 3 Missing alleles: Minnesota African-American test group (MAATG)
Table 5 Missing alleles: Minnesota Hispanic test group (MHTG)
Databasea
GDb
Ind w/Mac
Databasea
GDb
Ind w/Mac
MAA AAC BC JC TC MC CC MH MNA NOnRC
0 0.015 0.015 0.017 0.035 0.088 0.091 0.093 0.153 0.301
6.7 3.3 6.7 3.3 20 70 63.3 20 56.7 90
MH HC MC MNA CC TC BC AAC MAA JC NOnRC
0 0.014 0.031 0.037 0.039 0.066 0.080 0.087 0.093 0.109 0.151
13.3 6.7 13.3 16.7 33.3 23.3 36.7 6.7 10 23.3 46.7
a This is a subset of the 16 reference databases. We included all the intra-majorgroup databases and at least one from each of the other major groups. b Nei’s genetic distance between the corresponding Minnesota reference databases and the one listed as estimated. c Percentage of TG individuals with missing alleles (i.e., alleles not present in reference data base).
in ascending order of Nei’s genetic distance (GD) [13] obtained from pair-wise comparisons of the authentic Minnesota reference collection to each of the other CODIS databases in the panel. In all four sets of tests, we note a strong major group component in the order of the databases. For the MAATG (Table 3), the African-American/Caribbean populations are first followed by the Caucasian, Hispanic and Native American as expected, whereas for either the MCTG and MHTG (Tables 4 and 5, respectively), the MNA reference database is the only exception in an otherwise clean partitioning of the major groups. The ranked GD Table 4 Missing alleles: Minnesota Caucasian test group (MCTG) Databasea
GDb
Ind w/Mac
MC CC CSw CIt CMa MH HC TC MNA BC AAC MAA JC NOnRC
0 0.006 0.011 0.012 0.013 0.031 0.040 0.065 0.076 0.082 0.085 0.088 0.102 0.247
6.7 10 3.3 10 6.7 3.3 0 13.3 6.7 20 0 6.7 16.7 66.7
a
This is a subset of the 16 reference databases. We included all the intra-majorgroup databases and at least one from each of the other major biogeographical groups. b Nei’s genetic distance between the corresponding Minnesota reference databases and the one listed. c Percentage of TG individuals with missing alleles (i.e., alleles not present in reference data base).
a
This is a subset of the 16 reference databases. We included all the intra-majorgroup databases and at least one from each of the other major groups. b Nei’s genetic distance between the corresponding Minnesota reference databases and the one listed. c Percentage of TG individuals with missing alleles (i.e., alleles not present in reference data base).
scores for the MNATG missing allele distribution (Table 6) indicates that the MNA and SasRC are the most genetically similar, and the five African-American/Caribbean databases, the most distant. The results of both the Pearson’s and Spearman’s rank correlation tests reveal a very strong positive linear relationship between GD and the percentage of individuals with missing alleles (scatter plots not shown) for only one of the test groups, MAATG (P-value < 0.0001 for both tests at a = 0.05). The MCTG scatter plot depicts a linear association between the percentage of individuals with missing alleles
Table 6 Missing alleles: Minnesota Native American test group (MNATG) Databasea
GDb
Ind w/Mac
MNA SasRC HC MH SalRC MC NOnRC CC TC BC AAC MAA JC
0 0.024 0.034 0.037 0.073 0.076 0.080 0.088 0.102 0.125 0.131 0.153 0.167
3.3 3.3 3.3 3.3 13.3 6.7 20 43.3 6.7 43.3 3.3 3.3 43.3
a
This is a subset of the 16 reference databases. We included all the intra-majorgroup databases and at least one from each of the other major biogeographical groups. b Nei’s genetic distance between the corresponding Minnesota reference databases and the one listed. c Percentage of TG individuals with missing alleles (i.e., alleles not present in reference data base).
64
D.J. Rowold, R.J. Herrera / Forensic Science International 151 (2005) 59–69
Table 7 Exponential change: Minnesota African-American test group (MAATG) Database
Exponential decrease Total decrease
AAC BC JC TC MC CC MH MNA NOnRC
a
26.6 26.6 16.7 16.7 10 6.6 6.7 3.3 0
Exponential increase 6
2
0 0 0 0 3.3 0 0 0 0
3.3 3.3 6.7 0 0 3.3 0 0 0
1 23.3 23.3 10 16.7 6.7 3.3 6.7 3.3 0
0
1
2
3
4
5
7
8
Total increaseb
46.7 56.7 56.7 36.7 6.7 6.7 13.3 6.7 0
20 10 16.7 23.3 10 13.3 10 6.7 0
0 0 3.3 3.3 0 6.7 20 6.7 0
0 0 0 0 0 3.3 20 16.7 0
0 0 0 0 0 0 6.7 0 0
0 0 0 0 0 0 0 0 3.3
0 0 0 0 0 0 0 0 3.3
0 0 0 0 0 0 0 0 3.3
20 10 20 26.6 10 23.3 56.7 30.1 9.9
a Percentage of individual profiles that underwent negative exponential change in pOcc 1 upon database substitution (profile became more frequent). b Percentage of individual profiles that underwent positive exponential change in pOcc 1 upon database substitution (profile became less frequent).
the MAATG individuals displays no difference in pOcc 1 scores when Bahamian (BC) and Jamaican (JC) databases are consulted. A sizable proportion of both AAC (47%) and TC (37%) are in the zero exponential change category as well. A scatter plot of GD score versus percentage of profiles undergoing zero exponential change (not shown) along with the results of both Pearson’s (P-value < 0.0001 at a = 0.05) and Spearman’s (P-value < 0.0001 at a = 0.05) correlation tests suggest a highly significant inverse linear relationship between the two variables. A change 3 orders of magnitude is observed with all inter-racial database replacements. A MAA:MH substitution results in a shift of this magnitude for 27% of the individuals. Furthermore, 10% of the profiles incur differences 5 orders of magnitude with an MAA: NOnRC replacement. The net change is negative (i.e., the individual’s genotype is more common in the replacement
and the genetic distance of the database employed, a result corroborated by the outcome of the Pearson correlation test (P-value < 0.0001 at a = 0.05). However, the Spearman’s rank analysis (P-value = 0.055 at a = 0.05) suggests that the correlation is almost, but not quite significant. Although the MHTG scatter plot hints at a weak positive trend between these variables, there is no distinct linear correlation (0.057 and 0.413 at a = 0.05 for Pearson’s and Spearman’s Pvalues, respectively) and for MNATG, no linear relationship is detected by either the shape of the scatter plot or the statistical analyses (P-values are 0.190 and 0.270 at a = 0.05 for Pearson’s and Spearman’s correlation tests, respectively). Tables 7–10 summarize the exponential changes in individual pOcc 1 values associated with each database substitution for the four test groups. In Table 7, the majority of Table 8 Exponential change: Minnesota Caucasian test group (MCTG) Exponential decrease Database
Total decreasea
CC CSw CIt CMa MH HC TC MNA BC AAC MAA JC NOnRC
10 13.3 3.3 10 16.7 6.6 3.3 6.7 3.3 3.3 3.3 3.3 0
Exponential increase 2 0 0 0 0 0 3.3 0 0 0 0 0 0 0
1 10 13.3 3.3 10 16.7 3.3 3.3 6.7 3.3 3.3 3.3 3.3 0
0
1
2
3
4
6
Total increaseb
70 40 60 70 26.7 30 20 13.3 6.7 20 20 3.3 3.3
6.7 13.3 23.3 10 40 36.7 36.7 43.3 30 23.3 26.7 23.3 3.3
0 23.3 0 0 6.7 16.7 16.7 20 20 23.3 36.7 23.3 3.3
0 0 0 0 3.3 3.3 6.7 6.7 16.7 23.3 3.3 13.3 10
0 0 0 0 0 0 0 0 0 0 0 10 6.7
0 0 0 0 0 0 0 0 0 0 0 0 3.3
6.7 36.6 23.3 10 50 56.7 60.1 70 66.7 69.9 66.7 69.9 26.6
a Percentage of individual profiles that underwent negative exponential change in pOcc 1 upon database substitution (profile became more frequent). b Percentage of individual profiles that underwent positive exponential change in pOcc 1 upon database substitution (profile became less frequent).
D.J. Rowold, R.J. Herrera / Forensic Science International 151 (2005) 59–69
65
Table 9 Exponential change: Minnesota Hispanic test group (MHTG) Database
Exponential decrease a
Total decrease HC MC MNA CC TC BC AAC MAA JC NOnRC
13.3 6.7 10 13.3 13.3 13.3 13.3 13.3 10 3.3
Exponential increase 2
1
0 0 6.7 0 0 3.3 10 3.3 3.3 0
13.3 6.7 3.3 13.3 13.3 10.0 3.3 10 6.7 3.3
0
1
2
3
4
Total increaseb
36.7 43.3 30 30 27 10 13.3 6.7 6.7 3.3
30 26.7 30 20 23.3 13.3 23.3 40 23.3 13.3
6.7 6.7 10 0 6.7 16.7 16.7 13.3 3.3 16.7
0 0 0 0 6.7 10 16.7 10 13.3 13.3
0 0 0 0 0 0 0 0 6.7 3.3
36.7 33.4 40 20.0 36.7 40 56.7 63.3 46.6 46.6
a
Percentage of individual profiles that underwent negative exponential change in pOcc 1 upon database substitution (profile became more frequent). b Percentage of individual profiles that underwent positive exponential change in pOcc 1 upon database substitution (profile became less frequent).
versus the MAA reference database) only with respect to the two closest databases, AAC and BC. The pOcc 1 exponential change pattern for the set of MCTG comparisons (Table 8) is very similar to that for the MAATG. When a Caucasian database is substituted for another Caucasian database, a high percentage of MCTG pOcc 1 scores are exponentially unchanged (mean of 60%). In general, the proportion of cases sustaining zero exponential change decreases with an increase in the GD index. A scatter plot (not shown), as well as the outcomes of both Pearson’s and Spearman’s (P-value = 0.004 and P-value < 0.0001, respectively, both at a = 0.05) suggest a strong inverse linear relationship between the two variables. Only one of the four comparisons involving the Caucasian population-specific databases (CSw) yields an exponential change of two. In contrast, all nine non-Caucasian databases
generated differences in pOcc 1 values of three or more orders of magnitude, several of which, involved a substantial proportion of individual profiles (16.7, 23.3 and 13.3% for BC, AAC and JC, respectively). In the case of the most genetically distant group, NOnRC, one person out of 30 (3%) experiences a shift of six orders of magnitude. All database substitutions, except those involving CC and CMa, result in a substantial, net exponential increase (probability of occurrence decreased). The exponential changes in pOcc 1 for the MHTG profiles are presented in Table 9. As in the MAATG and MCTG exponential comparison sets, the most closely related databases display the highest percentages of individuals in the zero exponential change category. Furthermore, none of these four substitutions exceed an exponential change of two (either + or ). A scatter plot of GD versus
Table 10 Exponential change: Minnesota Native American test group (MNATG) Database
Exponential decrease a
Total decrease SasRC HC MH SalRC MC NOnRC CC TC BC AAC MAA JC
13.3 6.7 3.3 13.3 30 6.7 3.3 6.7 3.3 6.7 3.3 6.7
Exponential increase 3
2
0 0 0 0 6.7 0 0 0 0 0 0 0
0 0 0 0 10 0 3 0 0 0 0 0
1 13.3 6.7 3.3 13.3 13.3 6.7 0 6.7 3.3 6.7 3.3 6.7
0
1
2
3
4
5
6
7
Total increaseb
63.3 30 16.7 23.3 10 26.7 23.3 10 6.7 0 3.3 0.0
20 36.7 50 36.7 20 23.3 10 16.7 13.3 6.7 10 3.3
0 16.7 16.7 10 10 13.3 13.3 23.3 6.7 10 20 23.3
0 3.3 6.7 3.3 16.7 10.0 3.3 20.0 16.7 16.7 20.0 3.3
0 3.3 3.3 0 6.7 0 3.3 10 10 26.7 20 10
0 0 0 0 0 0 0 6.7 0 20 10 6.7
0 0 0 0 0 0 0 0 0 0 10 3.3
0 0 0 0 0 0 0 0 0 10 0 0
20 60 76.7 50 53.4 46.6 29.9 76.7 46.7 90.1 90 49.9
a Percentage of individual profiles that underwent negative exponential change in pOcc 1 upon database substitution (profile became more frequent). b Percentage of individual profiles that underwent positive exponential change in pOcc 1 upon database substitution (profile became less frequent).
66
D.J. Rowold, R.J. Herrera / Forensic Science International 151 (2005) 59–69
percent of individuals with zero exponential change in pOcc 1 scores (not shown) indicates a distinct inverse linear relationship between these two variables, a result reinforced by the outcomes of the two correlation tests [Pearson’s (P-value = 0.003) and Spearman’s (P-value < 0.0001) both at a = 0.05]. All the six most distant groups exhibit changes 3 orders of magnitude. The maximum shift, a 104 exponential decrease in probability, occurs in the two most distant groups, JC and NOnRC. In all reference database probability evaluations, incidents of exponential increase are considerably more common than those involving exponential reduction. Table 10 provides the exponential changes in pOcc 1 for the MNATG genotypes. The two genetically closest databases, SasRC and HC, displayed the largest percentages of profiles with zero exponential change. NOnRC and SalRC are next in line in the zero exponential change category. A scatter plot reveals a strong inverse linear correlation between the GD and proportion of profiles at zero exponential change in pOcc 1 and both Pearson’s and Spearman’s correlation tests yield highly significant P-values (0.002 and <0.0001, respectively, at a = 0.05). For all 12 database replacements, the exponential increases in pOcc 1 are much more frequent than exponential decreases and, in most cases, the difference is substantial. The use of two Native American databases, NOnRC and SalRC, resulted in pOcc 1 exponential differences of +3 (3 and 10%, respectively). In contrast, for the other three test groups, no database substitution within the same major group induced exponential changes exceeding two orders of magnitude. Use of the African-American/Caribbean databases, the most genetically dissimilar reference collections to the Minnesota Native American, caused positive, exponential changes of four or more.
4. Discussion Once it has been established that the evidence profile and that of the suspect are identical, the next step is to determine the frequency of this genotype in the appropriate population [1,14]. This requires a judicious use of reference database(s) in performing the necessary calculations. For cases in which the relevant database(s) are not defined, or not available, another is often substituted. This database replacement may result in substantial probability differences with respect to the target genotype since allelic frequencies can vary considerably among databases. This was found to be true in a comparison study involving polymarker/DQA1/DS180 databases [2]. In the current analysis, we are interested in the magnitude of CODIS STR database variation and its impact on probability calculations. An understanding of the patterns inherent in CODIS STR diversity may be useful in planning strategies for database sampling and in establishing guidelines for their forensic application.
The polymarker/DQA1/DS180 G-test analysis of Gallo et al. indicated that genetic homogeneity is restricted to pairwise comparisons between the test group and its corresponding Minnesota database (at a = 0.05). In contrast, approximately half of all intra-major group CODIS comparisons (16/30) yield non-significant results (at a = 0.05). The differences between these G-test findings most likely stem from the greater array of alleles per locus and loci in the CODIS STR system, a feature which amplifies the observed intra-population variation and, thus, the power of discrimination in an exponential manner. Also, high STR mutation rates, multiple routes to the same-sized alleles and length constraints [15] may further obscure statistical distinctions among CODIS databases by intensifying the potential for allelic length convergence among ethnic populations. Nevertheless, since almost half (14/30) of the intra-major group database combinations generated large G scores and significant P-values (a = 0.05), substantial ethnic STR variation persists despite this potential homogenization effect. As with the polymarker/DQA1/DS180 system of Gallo et al., a parallel exists between genetic distance and the G-test results. On numerous occasions during the database comparisons, we encountered situations in which the database failed to contain an allele displayed by a test group profile. In several cases, the pOcc or pOcc 1 calculations are affected by multiple missing alleles at several loci. Although, none of the four Minnesota databases contain the full allelic array of their corresponding test groups (a mean of 2.3 missing alleles over all loci per TG), the incidence of affected alleles is much higher with the use of the replacement collections (a mean of 7.3 missing alleles over all loci per TG). The number of un-represented alleles increases when the reference database is outside the major group of the TG set (a mean of 10.6 missing alleles over all TGs). A linear correlation between genetic distance and percent of individuals with missing alleles exists for both MAATG and MCTG. However, neither the Native Americans nor the Hispanic test groups show a linear relationship between these variables. A net (total number of missing alleles per database minus those absent in the authentic database) of 10 and 17% MNATG individuals possess alleles which are absent in the SalRC and NOnRC, respectively, whereas several databases of African descent (AAC and MAA) display a much lower incidence of missing MNATG alleles (3% of the individuals in each case). The substantial inter-population differences among Native Americans arise from the numerous founder effects, genetic bottlenecks, and the long-term reproductive isolation associated with the small scale nomadic migrations out of Africa across Asia, the Bering Strait, and through out the New World [16]. Overall, the MAATG is the most affected by missing alleles (80% of inter-major group database substitutions resulted in at least five affected loci with a mean of 23 missing alleles per replacement). These findings mirror the greater genetic diversity within the populations of African
D.J. Rowold, R.J. Herrera / Forensic Science International 151 (2005) 59–69
descent [17,18] in comparison to other major groups. Historically, African-American/Caribbean groups represent combinations of various West African peoples and are admixed with a number of other populations including Native Americans, Europeans and Asians [19]. An allele may be missing from a sample collection because it is either absent, or else, below polymorphic levels in the reference population. In forensic cases, these two situations must be interpreted differently. In the first instance (allelic absence), an inappropriate reference database was used to calculate the frequency of occurrence and, hence, the standard procedure of substituting a value of 1/2N (where N is the number of individuals in the database) for the missing frequency may be especially damaging to the suspect if the affected allele is common in the authentic database. In the second scenario, however, both the suspect’s profile and that
67
obtained from the evidence DNA contains a rare allele, which may be highly supportive of a match if the correct database was consulted. In this situation, an estimate of 1/2N actually benefits the suspect since we can surmise that the true frequency is lower (assuming a random sampling and Hardy-Weinberg equilibrium). With respect to the CODIS system, the most drastic changes in exponential pOcc 1 values (Tables 7–10), as well as an increased severity of missing information (i.e., a rise in the number of affected individuals (Tables 3–6), loci and/or alleles) occurs when the replacement database belongs to a different major group affiliation. In addition, ethnic substructure, which is considerable in the Native Americans (as revealed by analysis of missing alleles and patterns of exponential change in Tables 6 and 10, respectively), may also impact the results. To a lesser extent,
Table 11 pOcc 1 upon locus exclusion and use of authentic database pOcc
1
category
No. loci remaining
Resulting pOcc
1
MAATG/MAA
MCTG/MC
MHTG/MH
MNATG/MNA
High
12 11 10 9 8 7 6 5 4 3 2 1
5.79E 7.93E 2.76E 2.08E 1.43E 2.13E 3.48E 4.06E 6.73E 7.33E 6.61E 1.55E
+ + + + + + + + + + + +
14 12 11 10 09 08 07 05 04 03 02 01
6.42E 1.28E 1.95E 1.01E 4.92E 1.61E 2.89E 1.89E 2.74E 2.77E 7.45E 9.79E
+ + + + + + + + + + + +
12 11 10 09 07 07 06 05 04 03 02 00
3.18E 9.91E 2.33E 2.84E 1.15E 8.66E 2.03E 1.45E 1.89E 7.65E 2.29E 2.68E
+ + + + + + + + + + + +
13 11 11 10 09 07 06 05 04 02 02 01
8.24E 6.23E 8.98E 1.15E 3.96E 1.19E 2.89E 1.84E 3.81E 4.10E 7.53E 5.46E
+ + + + + + + + + + + +
11 10 09 09 07 07 05 04 03 02 01 00
Medium
12 11 10 9 8 7 6 5 4 3 2 1
7.13E 4.24E 3.81E 4.68E 2.46E 9.34E 6.02E 4.01E 2.07E 1.43E 1.58E 2.62E
+ + + + + + + + + + + +
16 15 14 13 11 09 08 07 06 05 03 01
5.99E 5.10E 4.47E 4.42E 2.10E 1.44E 1.15E 8.18E 6.19E 2.81E 1.18E 9.79E
+ + + + + + + + + + + +
15 13 12 11 10 08 07 05 04 03 02 00
1.11E + 15 3.41E + 12 4.69E + 11 3.44E + 10 5.83E + 08 1.67E + 07 5.66E + 05 4.71E + 04 4.17E + 3 5.24E + 02 4.02E + 02 3.13E + 01
3.98E 3.06E 1.49E 9.44E 3.26E 5.I7E 1.09E 4.36E 2.57E 5.75E 4.33E 3.62E
+ + + + + + + + + + + +
14 13 12 10 09 08 07 05 04 02 01 00
Low
12 11 10 9 8 7 6 5 4 3 2 1
3.93E 1.18E 3.35E 1.71E 2.68E 1.29E 2.10E 6.72E 7.16E 4.93E 4.45E 1.70E
+ + + + + + + + + + + +
20 18 16 15 13 11 10 05 04 03 02 01
2.26E 7.97E 1.22E 2.74E 1.33E 3.87E 3.07E 1.51E 7.64E 3.34E 7.64E 7.53E
+ + + + + + + + + + + +
16 14 14 12 11 10 09 08 06 05 03 00
1.60E + 18 6.22E + 16 1.58E + 14 7.10E + 12 8.29E + 09 2.09E + 09 2.19E + 08 5.96E + 06 2.11E + 04 2.65E + 3 7.96E + 02 7.24E + 01
2.20E 1.52E 4.24E 1.71E 5.83E 1.75E 4.26E 3.96E 1.04E 7.75E 5.93E 1.75E
+ + + + + + + + + + + +
20 19 17 16 14 14 12 09 08 05 04 03
68
D.J. Rowold, R.J. Herrera / Forensic Science International 151 (2005) 59–69
significant ethnic differences are also observed within the African-American/Caribbean collections since the database replacement of MAA by TC induces a higher incidence of missing alleles (a net of 3). A more thorough comparison of the total databases (i.e., G-test in Table 2) exposed additional substructure among the African-American/Caribbean groups (6/10 significant G-test comparisons at a = 0.05), the Native Americans collections (6/6 significant G-test comparisons at a = 0.05) and between some Caucasian populations as well (3/14 significant G-test comparisons at a = 0.05). Although a good portion of the ethnic variation revealed by the latter analysis does not appreciably affect the genotypic frequencies of the 120 Minnesota-based TG profiles examined in the current study, it may potentially influence the outcome of future forensic cases involving other profiles (both Minnesota-based and otherwise). The most obvious difference between our results and those of Gallo et al. is the astronomically low pOcc scores obtained with the STR system versus those generated with the polymarker/DQA1/DS180 combination. The magnitude of the pOcc 1 score generated from the most common profile (1 in 1012) exceeds the global population size (approximately 7 109) and, thus, may be interpreted by some to mean that the discriminatory power of the STR system is so great that the choice of a reference database is irrelevant. Under optimal circumstances, when all or most loci are informative, this would be the case. However, forensic conditions are rarely optimal. Quantity and quality of evidence DNA as well as both economic and time constraints can severely decrease the number of loci successfully analyzed and, thus, drastically reduce the power of discrimination of the CODIS STR system. On this note, we conducted an analysis in which the CODIS loci are excluded one by one in order of mean allelic (Amplicon) size (based on reports citing an inverse relation between amount of amplified product and size of CODIS STR fragments [20]). Twelve sets of pOcc 1 calculations involving the four test groups and their respective Minnesota databases were performed. Three profiles from each Minnesota TG were selected, representing the low (<1 in 1018), medium (1 in 1016 to 1 in 1018) and high (1 in 1012 to 1 in 1015) probability categories using all 12 loci. Results (Table 11) reveal that a minimum of six to nine loci are required to reach a discriminatory value of the same magnitude as the Earth’s population (109). Furthermore, the mean number of loci necessary to obtain this level of discrimination increased with the probability category (a mean of 6.3, 8, and 8.5 with the low, medium and high divisions, respectively). If five loci are used, the discriminatory power of the CODIS STRs decreases such that the in most cases (67% across all probability categories) the pOcc 1 is 1 in 105 or higher. From that point it degenerates rapidly with the loss of each locus such that in 50% of the cases with four loci, the pOcc 1 exceeds 1 in 104. A locus exclusion analysis involving an intra-major group substitution (MNA is replaced by NonCF) is shown
Table 12 pOcc 1 upon locus exclusion and inter-population database substitution No. loci remaining
12 11 10 9 8 7 6 5 4 3 2 1
Resulting pOcc
1
MNATG/MNA
MNATG/NOnCF
3.98E 3.06E 1.49E 9.44E 3.26E 5.17E 1.09E 4.36E 2.57E 5.75E 4.33E 3.62E
5.26E 4.48E 1.87E 1.71E 4.41E 6.99E 1.23E 2.63E 4.34E 5.63E 9.02E 2.76E
+ + + + + + + + + + + +
14 13 12 10 09 08 07 05 04 02 01 00
+ + + + + + + + + + + +
17 16 15 13 11 10 09 07 05 02 01 00
in Table 12. In this situation, the exponential frequency difference between respective pOcc 1 values is at least two orders of magnitude with the use of 12 to five loci. Although these differences may not be consequential with eight or more loci since the pOcc 1 calculated with either database approaches or exceeds the earth’s population, they may become important with a lower number (six loci: approximately 1 in 107 with the Minnesota Native American database versus less than 1 in 109 with the NOnCF). When four or less loci are employed, the level of discrimination is poor with the use of the authentic as well as the substituted database (1 in 105, 1 in 575 and 1 in 90 with 4, 3 and 2 loci, respectively). The synergistic combination of missing alleles and locus dropout due to, for example, PCR amplification failure may be extremely problematic even when the correct database is employed and may be further intensified by database replacement (as suggested by the results listed in Table 12). In this scenario, the observed allelic frequency differences between authentic and replacement databases at the few remaining loci will exert a disproportionate influence on the probability scores. Inter-majorgroup database substitutions can exacerbate the problem since they may result in the lack of allelic representation at multiple loci. To minimize these potential complications, it is prudent to consult databases within the same major group, in cases in which the authentic population-specific database is unavailable. However, missing alleles can present a problem even with respect to these intra-major group substitutions. The situation is especially serious in the Native Americans since both the SalRC and NOnRC databases failed to contain several MNA TG alleles at each of four different loci (SalRC: D3S1358, CSF1PO, D5S818 and D18S15, NOnCF: D3S1358, D7S820, D8S1179 and D18S15, results not shown). Thus, in cases in which there is substantial intra-major group variation, like that among the Native Americans databases and, to a lesser degree, the African-American/Caribbean collections used in
D.J. Rowold, R.J. Herrera / Forensic Science International 151 (2005) 59–69
this study, it is critical to employ the true population-specific database. [7]
Acknowledgements We would like to thank Terry Laber, Assistant Director of State of Minnesota Bureau of Criminal Apprehension Laboratory (SMBCAL), Minnesota, USA, who provided us with the CODIS STR Minnesota database and Justin Bolinger, a student at FIU for his considerable contribution to this manuscript.
[8]
[9]
[10] [11]
References [1] National Research Council, The Evaluation of Forensic DNA Evidence, National Academy Press, Washington, DC, 1996. [2] J.C. Gallo, E. Thomas, G.E. Novick, R.J. Herrera, Effects of subpopulation structure on probability calculations of DNA profiles from forensic analysis, Genetica 101 (1997) 1–12. [3] G. Carmody, G-test, Carleton University, Ottawa, Canada, 1991. [4] B. Budowle, T.R. Moretti, A.L. Baumstark, D.A. Defenbaugh, K.M. Keys, Population data on thirteen CODIS core short tandem repeat loci in African-Americans, U.S. Caucasians, Hispanics, Bahamians, Jamaicans and Trinidadians, J. Forensic Sci. 44 (1999) 1277–1286. [5] S. Borys, H. Vanstone, G. Carmody, R. Fourney, Allele frequencies for the COFILERTM STR loci in the Canadian Caucasian and the Canadian First Nations Populations, J. Forensic Sci. 45 (2000) 945–946. [6] L. Garofano, et al. Italian population data on thirteen short tandem repeat loci: HUMTH01, D21S11, D18S51, HUMVWFA31, HUMFIBRA, D8S1179, HUMTPOX,
[12] [13] [14] [15]
[16]
[17]
[18] [19]
[20]
69
HUMCSF1PO, D16S539, D7S820, D13S317, D5S818, D3S1358, Forensic Sci. Int. 97 (1998) 53–60. T. Kuperschmid, T. Calicchio, B. Budowle, Maine Caucasian population DNA database using twelve short tandem repeat loci, J. Forensic Sci. 44 (1999) 392–395. C. Gehrig, M. Hochmeister, U.V. Borer, R. Dirnhofer, B. Budowle, Swiss Caucasian population data for 13 STR loci using AmpFISTR Profiler Plus and Cofiler PCR amplification kits, J. Forensic Sci. 44 (1999) 1035–1038. S. Borys, R. Iwamoto, J. Miyakoshi, G. Carmody, R. Fourney, Allele frequency distributions for nine STR loci in the Japanese population, J. Forensic Sci. 44 (1999) 1319. C.C. Li, First Course in Population Genetics, Boxwood Press, 1976. J.L. Hernandez, B.S. Weir, A disequilibrium approach to HardWeinberg testing, Biometrics 45 (1989) 53–70. SPSS version 10, 1989. M. Nei, Genetic distance between populations, Am. Naturalist 106 (1972) 283–292. E.S. Lander, B. Budowle, DNA fingerprinting dispute laid to rest, Nature 371 (1994) 735–738. M.J. Nauta, F.J. Weissing, Constraints on allele size at microsatellite loci: implications for genetic differentiation, Genetics 143 (1996) 1021–1032. F.M. Salzano, Molecular variability in Amerindians: widespread but uneven information, Ann. Brazilian Acad. Sci. 74 (2002) 223–263. L.L. Cavalli-Sforza, P. Menozzi, A. Piazza, The History and Geography of Human Genes, Princeton University Press, Princeton, NJ, 1994. R. Foley, The context of human genetic evolution, Genome Res. 8 (1998) 339–347. J. Thornton, Africa and Africans in the Making of the Atlantic World, 1400–1800, Cambridge University Press, London, UK, 1998. J.M. Butler, Forensic DNA Typing, Academic Press, San Diego, CA, 2001.