Forensic Science International: Reports 2 (2020) 100059
Contents lists available at ScienceDirect
Forensic Science International: Reports journal homepage: www.elsevier.com/locate/fsir
Forensic Genetics
LUS+: Extension of the LUS designator concept to differentiate most sequence alleles for 27 STR loci Rebecca S. Just * , Jennifer Le, Jodi A. Irwin DNA Support Unit, Federal Bureau of Investigation Laboratory, 2501 Investigation Parkway, Quantico, VA, 22135, USA
A R T I C L E I N F O
A B S T R A C T
Keywords: Next generation sequencing (NGS) Massively parallel sequencing (MPS) Short tandem repeat (STR) Probabilistic genotyping
We previously proposed representation of STR sequences using the LUS to expedite use of sequence-level information in probabilistic genotyping. To consider the greater diversity of allele sequences now known, here we surveyed datasets representing >3000 individuals from 12 population groups. From 1059 ForenSeq Universal Analysis Software (UAS) range alleles detected for 27 STR loci, we identified additional motifs that could be referenced in designations to differentiate >99 % of alleles. To further assist labs with probabilistic interpretation of sequence-based STRs, we include lookup tables for conversion of UAS range sequences to LUS/LUS+ alleles, and provide LUS and LUS+ allele frequency tables.
1. Introduction The longest uninterrupted stretch (LUS) concept for STR sequence allele designation was conceived to facilitate interpretation of next generation sequence (NGS) based STR typing results using current or near-term probabilistic genotyping programs [1]. Probabilistic genotyping programs were originally developed and programmed to interpret the length-based alleles produced via capillary electrophoresis (e.g. 9.3, 13, 20, etc.), rather than the sequence strings produced via NGS. For “fully continuous” (quantitative) programs in particular, interpretation of data in stutter positions is necessarily based on defined relationships between parent alleles and their potential stutter products. As these relationships are more complex when considering sequence-level information than length-based alleles (where relationships can be defined by size), STR sequence strings cannot be used as input in most current probabilistic genotyping programs. In the format “Repeat unit allele_LUS reference region length” (e.g. 9.3_6), the LUS-based allele designators we proposed maintain the length-based information produced via CE platforms while at the same time capturing some of the sequence-level information generated via NGS. Critically, sequence alleles represented using the LUS designators maintain clear relationships between parent alleles and stutter products that can be programmed in a relatively straightforward manner in quantitative programs. Since publication of Ref [1], the EuroForMix program [2] has been modified to accommodate LUS-based allele designations (EuroForMix versions 1.11.3 and newer), and the new CaseSolver program also allows the designations [3]. In addition to the proof of concept work described in Ref [1] that used the LRmix Studio program [4,5], LUS alleles have also been utilized in a
published study to represent sequence variants for mixture interpretation in EuroForMix [6]. As detailed in Ref [1], the LUS allele designators we proposed captured greater than 80% of the sequence variation observed among four population datasets totaling 777 individuals; and using those same datasets, the defined LUS reference region was the actual longest uninterrupted stretch of identical repeats in >99 % of allele sequences. Given the greater diversity of allele sequences now available from a wider variety of population groups, in this study we sought to 1) determine the rate at which the previously defined LUS reference regions produced the actual LUS when additional allele sequences were considered, and 2) extend the LUS allele designation concept to differentiate a higher percentage of alleles that are distinct by sequence. To achieve these aims, we examined STR sequence allele datasets for 12 population groups, developed from more than 3000 individuals, as well as further unique alleles from the NCBI STRseq BioProject [7], for 27 autosomal STR loci. Here, we detail additional reference regions within the ForenSeq Universal Analysis Software (UAS; Verogen, Inc., San Diego, CA) range sequence strings for 17 of these loci that can be used to designate “LUS+” alleles. We also share simple tools that can be used by forensic labs to assist interpretation of NGS-based STR typing results in probabilistic genotyping programs. 2. Materials and methods 2.1. Compilation of sequence alleles Sequence alleles for 12 population groups, captured from nine published datasets [8–16] were used for the study. The number of typed
* Corresponding author. E-mail addresses:
[email protected] (R.S. Just),
[email protected] (J. Le),
[email protected] (J.A. Irwin). http://doi.org/10.1016/j.fsir.2020.100059 Received 1 November 2019; Received in revised form 27 December 2019; Accepted 9 January 2020 Available online 11 January 2020 2665-9107/© 2020 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
R.S. Just et al.
FSIR 2 (2020) 100059
formulas referenced the correct cells, and the values summed to 1 for frequency and same number of total alleles for the locus as was originally reported in Ref [15] Table S3. The frequencies calculated for the LUS and LUS+ alleles were transformed into tables formatted for probabilistic genotyping programs as follows. First, the entire set of LUS (or LUS+) allele designations was copied to the first column of a new Excel sheet, duplicates were removed, and the designations were sorted A–Z. The locus names were copied to columns 2–28, and then the entire sheet was copied three additional times such that there would be one table created for each of the four population groups (African American, Asian, Caucasian, and Hispanic). Within the same Excel file, a new tab was created for each locus and named accordingly. Individually by locus, the LUS (or LUS+) allele designations and their frequencies for all four populations were copied from the calculation sheet and pasted as values into the correspondingly named tab for the locus. Within the sheet named for each population, the Excel vlookup function was then used to obtain the population-specific frequency for each LUS (or LUS+) allele for each locus. The resulting data within each population frequency table were copied and pasted as values, then NA values and frequencies of 0 were removed. Every value within each of the eight frequency tables formatted for probabilistic genotyping (four each for LUS and LUS+) was checked against the original calculated LUS and LUS+ frequencies to ensure no errors were made in preparing the tables.
individuals from these nine datasets totaled 3104, and included 200 White British, 200 British Chinese, 250 Koreans, 106 Han, 59 South Brazilians, 143 Catalan, 88 Spanish Roma, 610 African Americans, 266 U.S. Asians, 641 U.S. Caucasians, 479 U.S. Hispanics and 62 Yavapai. For each dataset, only alleles for the 27 autosomal STR loci included in the MiSeq FGx DNA Signature Prep Kit (Verogen, Inc.; see Ref [17]) were used. In addition, all alleles available in the NCBI STRseq BioProject [7] for the 27 loci as of July 11, 2019 were downloaded to capture additional unique sequences. The sequence alleles from the published datasets and the STRseq BioProject varied in terms of the sequence range reported, due to the differing assays and analysis software packages used to generate the data. To develop a set of alleles with a consistent range per locus, sequences were trimmed to the specific ranges used for analysis by the ForenSeq UAS (Verogen, Inc.). Alleles for which the full UAS range for a given locus was not reported (i.e., when only the core repeat region sequence was included in the published dataset, but the UAS range for the locus includes some portion of the 50 and/or 30 flanking region) were eliminated from the set for this study. Lastly, the alleles compiled from the various sources were filtered to remove duplicate sequences. The final set of 1059 unique alleles for the 27 loci were used for all subsequent analyses. 2.2. Evaluation of the LUS reference region and selection of additional reference regions
3. Results and discussion
Each of the 1059 allele sequences was examined to determine a) the length of the LUS reference region and b) the actual LUS for the sequence. Sequence alleles for each locus were translated to their LUS allele designations. Loci for which all alleles could not be uniquely represented as LUS alleles were further evaluated to identify additional sequence motifs whose repeat lengths could be used to distinguish more alleles. Either one or two additional reference regions were determined for each of these loci, with selection based on maximizing the number of alleles that could distinguished.
The LUS reference region either produced the actual LUS, or was equal in length to another repeat motif in the sequence, for 1045 (98.7%) of the 1059 unique alleles identified from the various data sources. Three of the instances in which the LUS reference region was not the longest uninterrupted stretch occurred with the vWA locus; these were the distinctive size 13, 14 and 15 alleles with the structure [TAGA]n TGGA [TAGA]n [CAGA]4 [TAGA]n, and represented three of the 52 unique alleles for the locus. The remaining 11 instances occurred with the three loci with the highest counts of unique sequence alleles: D21S11 (eight of 152 alleles), D12S391 (two of 127 alleles) and D2S1338 (one of 92 alleles). As the same region of each STR produced the LUS in nearly all instances (typically 100% for a locus, but never less than 94% for a locus when exceptions occurred), the repeat regions originally defined in Table 1 of Ref [1] are appropriate to maintain as the reference regions used to determine LUS length for allele designations. Designation of sequences as LUS alleles enabled unique representation of all sequence variants for 10 of the 27 STR loci (Fig. 1). For seven of the remaining loci (CSF1PO, D16S539, D18S51, D2S441, D3S1358, D4S2408 and D6S1043), defining a secondary reference region that could be added to the LUS allele designation to produce an ‘LUS+’ allele (Tables 1 and S1) was sufficient to distinguish all sequence variants. The secondary reference regions identified for these loci tended to occur within the LUS reference region, and typically referred to the presence or absence of a specific motif whose length would thus be represented as either 1 or 0. A detailed example for the CSF1PO locus is displayed in Fig. 2. The notable departure for these loci was the secondary reference region selected for D3S1358: the number of TCTG repeats ranged from 1 to 4 among the 41 unique sequences for the locus. An additional exception occurred with a STRseq [7] D6S1043 sequence (MH166947.1) that exhibited two ATGT repeats. Both secondary and tertiary reference regions were defined for the last ten loci (Tables 1 and S1). The regions occurred both within and outside of the LUS reference region, and in all but one instance referred to a repeat motif. The exception occurred with D7S820, which required reference to the presence or absence of a specific T insertion for unique designation of all alleles (see Table S2). For five of the loci (D12S391, D1S1656, D2S1338, D7S820 and D8S1179), the use of secondary and tertiary reference region lengths to designate LUS+ alleles differentiated all of
2.3. Development of LUS and LUS+ allele frequencies and tables Though nine different datasets were evaluated to identify unique sequence alleles and determine additional reference regions, allele frequencies were calculated using only data from the single largest study: the NIST 1036 population data [15]. Frequencies of both LUS and LUS+ designated alleles were calculated based on sequence data previously reported in Ref [15] for the African American, Asian, Caucasian, and Hispanic U.S. populations as follows: Using Excel Table S3 from Ref [15], a ForenSeq UAS range sequence was obtained for each sequence allele reported. For most loci this range could be obtained simply by copying the repeat region string from Column P of the original table, but for the 8 loci for which the UAS range includes a portion of the 50 or 30 flank (D13S317, D18S51, D19S433, D1S1656, D5S818, D7S820, Penta D and vWA), the UAS portions of each flank reported in Ref [15] Table S3 were included in the UAS range sequence string for the locus. An in-house developed lookup table was subsequently used to translate the UAS sequences to LUS alleles and LUS+ alleles, and a conditional formatting function in Excel was used to identify identical LUS alleles and LUS+ alleles within each locus. Columns E–N from the original Ref [15] Table S3, which included the formulas and cell references used to calculate the original Table S3 allele frequencies from the allele counts, were copied directly to use for the LUS and LUS+ allele frequency calculations. All counts for each unique LUS allele and (separately) LUS+ allele were edited according the identical alleles previously identified using the Excel conditioning function. The frequencies calculated from these allele counts were thus based on the original table Ref [15] Table S3 formulas. Each locus and both data sets (LUS and LUS+) were checked to ensure that 1) the Excel frequency formulas referenced the correct cell ranges, and 2) the summation
2
R.S. Just et al.
FSIR 2 (2020) 100059
Fig. 1. Sequence alleles differentiated by LUS and LUS+ designations. Table 1 Secondary and tertiary reference regions for 27 STR loci.
a
Alleles are displayed using the UAS data range, but in accordance with ISFG recommendations as to strand. Repeat regions used to determine repeat unit alleles are in upper case text. LUS length reference regions are underlined, and are identical to those described in Table 1 of Ref [1]. Secondary reference regions are indicated by blue text. Tertiary reference regions are indicated by green text.
LUS+ designators occurred with three, two, two, one and one allele pairs respectively (Fig. 1 and Table S3). Overall, the original LUS designators differentiated 857 of the 1059 alleles examined, or 80.93%. This value is similar to but lower than the 86.9– 88.3 % rate reported in Ref [1] for the same sequence ranges, a result that is
the unique sequences. For the final five loci various options for secondary and tertiary reference regions were considered given the sequence variation observed, and the options that resulted in the greatest degree of allele differentiation were selected. For these loci - D21S11, D13S317, vWA, D19S433 and FGA – unique sequence alleles unresolved using the 3
R.S. Just et al.
FSIR 2 (2020) 100059
Fig. 2. Secondary reference region and LUS+ allele example. The figure shows all unique UAS-range CSF1PO allele sequences examined in this study (n = 17 alleles). Alleles identical by length-based repeat unit are highlighted in pink. When the length of the LUS (in red text) was used to designate LUS alleles, 15 of the alleles were uniquely represented, while one allele pair had the same LUS allele designations (highlighted in yellow). By adding a reference to the length of the ACCT repeat (secondary reference region, in blue text) to designate LUS+ alleles, all sequence alleles were differentiated.
and LUS+ alleles as input. For automated conversion of UAS sequences to the LUS and LUS+ formats based on the lookup tables provided here, the EuroForMix website now includes an R-package: seq2lus (developed by Ø. Bleka; http://euroformix.com/seq2lus). The lookup tables will be periodically updated as more sequences become available in the STRseq BioProject to further facilitate the use of sequence level information in probabilistic genotyping programs.
not surprising given the deliberate attempt in this study to consider a much larger set of unique sequence alleles reported for more diverse population groups. Yet, use of the secondary and tertiary reference regions selected for 17 loci in this study enabled unique LUS+ designations for 1050 of the 1059 alleles, or 99.15% (Fig. 1). Among the very small number of sequence allele pairs that could not be differentiated using LUS+ designators, at least one allele of each pair was a rare sequence variant (Table S3). For four of these nine pairs, only one of the two affected alleles was observed in the NIST 1036 dataset used to develop allele frequencies [15]; and for the remaining five pairs, the rare sequence allele was observed only once. Given the rarity of these alleles, the likelihood of encountering distinct UAS-range sequence alleles that cannot be distinguished via LUS+ allele designations in any given forensic case is extremely low. To assist labs interested in using the LUS or LUS+ designators to perform probabilistic genotyping of NGS-based STR typing results, several Excel tables were generated. These included 1) frequency tables for both LUS and LUS+ alleles (Tables S4-S5) developed from the NIST 1036 population datasets [15], and 2) lookup tables based on the STRseq BioProject alleles [7] that can be used to convert UAS sequence strings to LUS and LUS+ alleles (Table S6).
Rebecca S. Just: Conceptualization, Methodology, Investigation, Data curation, Formal analysis, Writing - original draft. Jennifer Le: Investigation, Data curation, Validation. Jodi A. Irwin: Conceptualization, Writing - review & editing, Supervision.
4. Conclusions
Acknowledgements
While use of the LUS length in allele designations enables unique representation of the majority of UAS-range sequences for 27 STR loci, further resolution of the known sequence variants is desirable to maximize the advantage of using sequence information for STR evidence profile interpretation. In this study, secondary and tertiary reference regions were selected for 17 loci on the basis of data developed from more than 3000 individuals originating from 12 diverse population groups [8–16], as well as additional alleles from the STRseq BioProject [7]. Use of these additional reference regions to designate LUS+ alleles increased the number of sequences that could be differentiated to greater than 99%. The Excel-based tools provided here should assist labs interested in exploring or performing probabilistic genotyping using the sequence-level information produced via NGS. For “fully continuous”/quantitative interpretation, EuroForMix [2] versions 1.11.3 and newer (available at http://www.euroformix.com/) and CaseSolver [3] accommodate the LUS
The authors thank Øyvind Bleka and Peter Gill of the Department of Forensic Sciences at Oslo University Hospital for discussion and collaboration; Daniel Hatch of the FBI Laboratory for development of a data compilation macro; Lilliana Moreno, Thomas Callaghan and Anthony Onorato of the FBI Laboratory for review of the manuscript; and Rebecca Mitchell for assistance with manuscript revisions. This is FBI Publication #19-23. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. This research was supported in part through the FBI’s Visiting Scientist Program, an educational opportunity administered by the Oak Ridge Institute for Science and Education (ORISE). Names of commercial manufacturers are provided for identification purposes only, and inclusion does not imply endorsement of the manufacturer, or its products or services by the FBI. The views expressed are those of the authors and do not necessarily reflect the official policy or position of the FBI or the U.S. Government.
Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. CRediT authorship contribution statement
4
R.S. Just et al.
FSIR 2 (2020) 100059
Appendix A. Supplementary data [9]
Supplementary material related to this article can be found, in the online version, at doi:https://doi.org/10.1016/j.fsir.2020.100059.
[10]
References
[11]
[1] R.S. Just, J.A. Irwin, Use of the LUS in sequence allele designations to facilitate probabilistic genotyping of NGS-based STR typing results, (Forensic Sci. Int. Genet. 34 (2018) 197–205. [2] Ø. Bleka, G. Storvik, P. Gill, EuroForMix: an open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts, (Forensic Sci. Int. Genet. 21 (2016) 35–44. [3] Ø. Bleka, L. Prieto, P. Gill, CaseSolver: an investigative open source expert system based on EuroForMix, (Forensic Sci. Int. Genet. 41 (2019) 83–92. [4] H. Haned, K. Slooten, P. Gill, Exploratory data analysis for the interpretation of low template DNA mixtures, (Forensic Sci. Int. Genet. 6 (2012) 762–774. [5] H. Haned, J. de Jong, LRmix Studio 2.1 User Manual, (2016) http://lrmixstudio.org/ download/manual.pdf. [6] H.L. Hwa, M.Y. Wu, W.C. Chung, T.M. Ko, C.P. Lin, H.I. Yin, et al., Massively parallel sequencing analysis of nondegraded and degraded DNA mixtures using the ForenSeq system in combination with EuroForMix software, (Int. J. Legal Med. 133 (2019) 25–37. [7] K.B. Gettings, L.A. Borsuk, D. Ballard, M. Bodner, B. Budowle, L. Devesse, et al., STRSeq: a catalog of sequence diversity at human identification Short Tandem Repeat loci, (Forensic Sci. Int. Genet. 31 (2017) 111–117. [8] F.R. Wendt, J.D. Churchill, N.M. Novroski, J.L. King, J. Ng, R.F. Oldt, et al., Genetic analysis of the Yavapai Native Americans from West-Central Arizona using the
[12]
[13]
[14]
[15] [16]
[17]
5
Illumina MiSeq FGx forensic genomics system, (Forensic Sci. Int. Genet. 24 (2016) 18–23. K.B. Gettings, K.M. Kiesler, S.A. Faith, E. Montano, C.H. Baker, B.A. Young, et al., Sequence variation of 22 autosomal STR loci detected by next generation sequencing, (Forensic Sci. Int. Genet. 21 (2016) 15–21. N.M. Novroski, J.L. King, J.D. Churchill, L.H. Seah, B. Budowle, Characterization of genetic sequence variation of 58 STR loci in four major population groups, (Forensic Sci. Int. Genet. 25 (2016) 214–226. F. Casals, R. Anglada, N. Bonet, R. Rasal, K.J. van der Gaag, J. Hoogenboom, et al., Length and repeat-sequence variation in 58 STRs and 94 SNPs in two Spanish populations, (Forensic Sci. Int. Genet. 30 (2017) 66–70. Z. Wang, D. Zhou, H. Wang, Z. Jia, J. Liu, X. Qian, et al., Massively parallel sequencing of 32 forensic markers using the precision ID GlobalFiler NGS STR panel and the Ion PGM System, (Forensic Sci. Int. Genet. 31 (2017) 126–134. E.H. Kim, H.Y. Lee, S.Y. Kwon, E.Y. Lee, W.I. Yang, K.J. Shin, Sequence-based diversity of 23 autosomal STR loci in Koreans investigated using an in-house massively parallel sequencing panel, (Forensic Sci. Int. Genet. 30 (2017) 134–140. L. Devesse, D. Ballard, L. Davenport, I. Riethorst, G. Mason-Buck, D. Syndercombe Court, Concordance of the ForenSeq system and characterisation of sequence-specific autosomal STR alleles across two major population groups, (Forensic Sci. Int. Genet. 34 (2018) 57–61. K.B. Gettings, L.A. Borsuk, C.R. Steffen, K.M. Kiesler, P.M. Vallone, Sequence-based U.S. Population data for 27 autosomal STR loci, (Forensic Sci. Int. Genet. 37 (2018) 106–115. D.S.B.S. Silva, F.R. Sawitzki, M.K.R. Scheible, S.F. Bailey, C.S. Alho, S.A. Faith, Genetic analysis of Southern Brazil subjects using the PowerSeq AUTO/Y system for short tandem repeat sequencing, (Forensic Sci. Int. Genet. 33 (2018) 129–135. A.C. Jager, M.L. Alvarez, C.P. Davis, E. Guzman, Y. Han, L. Way, et al., Developmental validation of the MiSeq FGx forensic genomics system for targeted next generation sequencing in forensic DNA casework and database laboratories, (Forensic Sci. Int. Genet. 28 (2017) 52–70.