Accepted Manuscript Title: Massively parallel sequencing of Identifiler and PowerPlex® Y amplified forensic samples Author: Ryan England Nicholas Curnow Alex Liu Janet Stacey SallyAnn Harbison PII: DOI: Reference:
S1875-1768(15)30233-X http://dx.doi.org/doi:10.1016/j.fsigss.2015.09.084 FSIGSS 1081
To appear in: Received date: Accepted date:
4-9-2015 15-9-2015
Please cite this article as: Ryan England, Nicholas Curnow, Alex Liu, Janet Stacey, SallyAnn Harbison, Massively parallel sequencing of Identifiler and PowerPlexregd Y amplified forensic samples, Forensic Science International: Genetics Supplement Series http://dx.doi.org/10.1016/j.fsigss.2015.09.084 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Massively parallel sequencing of Identifiler and PowerPlex® Y amplified forensic samples Ryan England a, Nicholas Curnow a, Alex Liu a b, Janet Stacey a, and SallyAnn Harbison a* a
Institute of Environmental Science and Research Ltd, Private Bag 92021, Auckland 1142, New Zealand Department of Forensic Science, School of Chemical Sciences, The University of Auckland, Auckland, New Zealand *Corresponding author
[email protected], T: +64 9 815 3670 b
Abstract In the last few years the cost of massively parallel sequencing has reduced dramatically to the point that it can now be practically considered as a tool in forensic casework. An important consideration for the implementation of any new approach is the ability to remain compatible with previous technology. With this is mind we conducted two sets of experiments to evaluate sequencing the previously amplified products of two commercial forensic STR multiplexes. Samples amplified with AmpFlSTR® Identifiler® and PowerPlex® Y were sequenced on the Illumina© MiSeq and Ion PGM™ Sequencer (Life Technologies). We found it is possible to sequence such amplified DNA and to accurately determine the STR genotype of forensic samples using both platforms. Sequencing these STR loci provided extra information in the form of sequence variation, something that is not possible measuring amplicon length alone. And from these results we begin to characterise the sequence data from a forensic perspective, by looking at sequence variation within the repeats, stutter and heterozygote imbalance. 1.
Introduction Current methods for determining the DNA profile of a biological sample found at, or associated with, a crime scene involve the use of a multiplex PCR reaction targeting short tandem repeats (STRs) [1]. The amplified products are detected and measured using capillary electrophoresis (CE). The length of the repeat region is determined, but any sequence variation in the STRs go undetected. Massively parallel sequencing (MPS) [2-5] provides a new tool to determine the STR profile of a sample, together with the collection of any sequence variation found in the repeat structures, providing an extra opportunity for discrimination [2]. The purpose of this study was to find out if sequencing DNA previously amplified with two commercial STR multiplex kits with fluorophores attached was possible on two MPS platforms; the Ion PGM™ Sequencer (Ion PGM™) and the Illumina© MiSeq (MiSeq). Two sequencing projects to assess this question were carried out and are described below. The first involved sequencing the amplified products of five (three male and two female) known buccal samples in duplicate. These were sequenced on the Ion torrent PGM and the Illumina MiSeq. The second experiment involved sequencing amplified product from a Quality Assurance (QA) sample set. This was performed on two replicate Ion PGM™ sequencing runs. 2. Five samples on the Ion PGM™ and MiSeq Materials and Methods: DNA was extracted and, the autosomal STRs were amplified with the Identifiler® PCR Amplification Kit (Thermo Fisher Scientific), and the Y chromosome STRs with the PowerPlex® Y (PPY®) system (Promega). The DNA profiles of all samples were determined by capillary electrophoresis (CE) using a 3130xl Genetic Analyzer (Applied Biosystems®) and Genemapper software v3.2. For male samples the Identifiler® and PPY® amplified products were combined and all amplicons were purified using AMPure XP PCR Purification (Agencourt® Beckman Coulter®). Barcoded sequencing libraries for the Ion PGM™ were prepared for each sample using a KAPA DNA Library Preparation Kit for Ion Torrent (KAPA Biosystems©). The libraries were quantified using the Ion Library Quantification Kit (Thermo Fisher Scientific). Libraries were diluted and pooled for template preparation on the OneTouch™ 2 (OT2) instrument using the Ion PGM™ Template OT2 400 Kit. The template was loaded onto an Ion 316™ v2 Chip and sequenced using the Ion PGM™ 400 sequencing kit on the Ion PGM™ sequencer. Libraries for MiSeq sequencing were prepared with the Prep2Seq™ DNA Library Prep Kit for Illumina™ (Affymetrix) from the pooled KAPA Ion Torrent libraries as the starting material. The Prep2Seq™ DNA libraries were quantified using the KAPA Library Quantification Kit for Illumina, and were sequenced on the MiSeq, using a MiSeq Reagent Kit v3, with 600 cycles for 2x 300 bp pair ended reads. Adapter trimming and barcode sorting were completed using the following software: the Ion Torrent Suite, Base Space® (Illumina), or using cutadapt [6], and Fastx barcode splitter [7]. FastQ quality trimmer [7] was used for quality trimming. The sequencing reads were aligned to the STR loci with STRait Razor (version 1.2) [4]. The output file from STRait Razor was used to call the STR alleles following a number of guidelines: a minimum of 50 reads aligned to a locus was required for alleles to be called. For an allele to be called either a minimum of 10 reads (50-999 total reads aligned to that locus), or 1% of total reads (1000+ total reads) had to match the allele. A stutter threshold of 15% was used to assess whether observed alleles were likely to be stutter. The level of acceptable heterozygote locus balance (Hb), was set between a ratio of 0.5-2.0 (coverage of shorter allele divided by longer allele). Results: In general the quality of the MiSeq reads were higher along the full length of the sequencing reads, which meant after quality trimming 66% of the MiSeq reads were long enough to span the entire STR repeat region and flanks required by STRait Razor. In comparison only 25% of the Ion PGM™ reads were aligned by STRait Razor. Both platforms produced variation in the coverage of
the different STR loci. This variation could not be completely explained by either the size of the amplicons or the complexity of the STRs repeat region. Overall from the Ion PGM™ data 72% of the alleles could be correctly called, another 21% were only partially called (due to either allele drop out or heterozygote imbalance), and 5% were not called due to less than 50 reads aligning to the loci (locus dropout). There were seven cases of the FGA allele being incorrectly called. This is due to a base calling error of the six T homopolymer found in FGA [8]. This prevented us being able to distinguish between FGA alleles that are 1bp difference in length. From the MiSeq data 87% of the alleles were correctly called 5% were partially called, and 5% were not called. A further five loci were incorrectly called due to sequencing errors found in the repeats that were not observed in the duplicate or in the original CE results. The larger proportion of MiSeq alleles correctly called is likely due to the increased coverage obtained from a MiSeq sequencing run. Running the samples on the larger Ion 318™ chip would likely lead to a higher number of correctly called loci from the Ion PGM™ data. From these five individuals we found 35 alleles with sequence variation that could be used to distinguish it from an allele of the same length, of these seven were novel allele sequence variations: D2S1338[22]TGCC[6]TTCC[13]GTCC[1]TTCC[2], D2S1338[24]TGCC[7]TTCC[14]GTCC[1]TTCC[2], D2S1338[24]TGCC[8]TTCC[13]GTCC[1]TTCC[2], D2S1338[26]TGCC[7]TTCC[16]GTCC[1]TTCC[2], D3S1358[18]TCTA[1]TCTG[2]TCTA[15], D8S1179[15]TCTA[2]TCTG[1]TCTA12, D19S433[14.2]AAGG[1]AA[1]AAGG[1]TAGG[1]AAGG[13]. 3. Sequencing of QA sample set Materials and Methods: The QA sample set contained two reference samples, a blood stain on fabric, and a blood/semen mixture on fabric. The blood/semen mixture was differentially extracted separating the epithelial and sperm fractions, creating a total of five sample. DNA was amplified using both the Identifiler ® and PPY® profiling kits. For male samples the Identifiler® and PPY® amplified products were combined and all amplicons were purified using AMPure XP PCR Purification. Barcoded KAPA Ion Torrent libraries were prepared as above. Two replicate template preparations (Ion PGM™ Template OT2 400 kit) and sequencing runs (Ion PGM™ 400 sequencing kit and Ion 318™ v2 Chips) were completed for the five samples. All adapter trimming, barcode sorting and quality filtering bioinformatic steps were completed as above. The sequencing reads were aligned to STR loci using an in house modified version of STRait Razor 2.0 [4] with a modified locus file. The guidelines from above were used to call the alleles of the samples. Results: The two sequencing runs produced different total coverage, with the first run having an average of 19000 reads aligning to the STR loci compared to the second run with an average of 47000 total reads. As the templates of each sequencing run were prepared using the same pooled libraries, this indicates there is substantial run to run variation. This is likely due to the manual processes involved in the OT2 template preparation such as chip loading. Despite this, the allele calling results of the two sequencing runs were very similar (Table 1). There were more loci in the first run that were not able to be called due to low coverage, and Run 1 had higher levels of heterozygote imbalance overall. In total 104 of the Identifiler® loci were full correct calls (both alleles correct and balanced), 33 loci had heterozygote imbalance, 12 loci had an allele dropout, and 11 had complete locus drop out (Table 1). These results are very similar to Ion PGM™ data seen above in Section 2. Certain loci performed better than others. D21S11 and D1851 were problematic due to low coverage, the majority of the samples have either locus drop out, allele drop out or heterozygote imbalance. These loci also had low average coverage in the results from Section 2. This suggests the low coverage may be due to a lower abundance of these amplicons in the original Identifiler® sample, and is unlikely to be caused by the sequencing process.
-1: reads that match the correct allele but contain a single base deletion. +1: reads that match the correct allele but contain a single base insertion. est: allele has elevated stutter above 15% of the allele coverage. Sub: reads that match the correct allele but contain a single base substitution. Many loci were observed with elevated stutter above the levels expected from CE results (Table 1). There were also numerous reads containing either a one base pair insertion, or deletion within the repeat unit. In some cases the number of reads with the insertion/deletion were equal to the number of reads with the true allele sequence. These are likely caused by errors incorporated into the sequence during either the multiple polymerase amplifications or during the base calling on the Ion PGM™ instrument [5]. The sequencing worked equally well using both the reference samples and the ‘casework’ type samples. In particular we were able to identify the majority of the alleles from the mixture sample (Item 4 epithelial cells Table 1). 4. Conclusion We found it is possible to sequence the amplified products of two STR multiplex kits, Identifiler® and PPY®, on two MPS platforms and to accurately, albeit incompletely, determine the STR genotype of forensic samples using both platforms. Sequencing these STR loci provided extra information in the form of sequence variation, something that is not possible when measuring amplicon length
alone. By characterising features of the DNA sequence profiles, such as stutter and imbalance, we identified areas for future development that are needed before casework implementation can occur. Acknowledgements: This research was funded by Core Funding from Institute of Environmental Science & Research Ltd. Role of Funding: None Conflict of interest: ESR has been provided with an Ion Torrent PGM™ sequencer on loan from Thermo Fisher Scientific. All reagents and consumables are paid for by ESR. SAH received funding from Thermo Fisher Scientific to attend the HIDS conference 2015 in Madrid. References 1. Butler JM (2012) Short Tandem Repeat (STR) Loci and Kits. Advanced Topics in Forensic DNA Typing: Methodology. San Diego: Elsevier Academic Press. pp. 99-140. 2. Scheible M, Loreille O, Just R, et al. (2014) Short tandem repeat typing on the 454 platform: Strategies and considerations for targeted sequencing of common forensic markers. Forensic Science International: Genetics 12: 107-119. 3. Van Neste C, Vandewoestyne M, Van Criekinge W, et al. (2013) My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing. Forensic Science International: Genetics 9. 4. Warshauer DH, Lin D, Hari K, et al. (2013) STRait Razor: A length-based forensic STR allele-calling tool for use with second generation sequencing data. Forensic Science International: Genetics 7: 409-417. 5. Fordyce SL, Mogensen HS, Børsting C, et al. (2014) Second-generation sequencing of forensic STRs using the Ion Torrent™ HID STR 10-plex and the Ion PGM™. Forensic Science International: Genetics 14. 6. Martin M (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal 17. 7. (2010) Fastx Toolkit. FASTQ/A short-reads pre-processing tools. 0.0.13 ed. http://hannonlab.cshl.edu/fastx_toolkit/index.html. 8. Parson W, Strobl C, Huber G, et al. (2013) Evaluation of next generation mtGenome sequencing using the Ion Torrent Personal Genome Machine (PGM). Forensic Science International: Genetics 7: 543-549.
Table 1: Identifiler® STR loci profile results from QA samples sequenced in replicate on the Ion PGM™ sequencer. Green boxes: allele profile is balanced and concordant with CE. Yellow boxes: alleles show heterozygote imbalance greater than 50%, with the lowest coverage allele following the > symbol. Red boxes: allele drop out. Orange boxes: locus drop out. Sample
D8
D21
D7
CSF
D3
TH01
D13
D16
D2
D19
vWA
TPOX
D18
Amel
D5
FGA
Ref sample 1 CE
11 14
30 35
8 10
10 10
14 17
9 9.3
8 12
11 11
17 19
12 16.2
15 17
89
12 17
XX
12 13
20 23
PGM Run 1
11-1 14
N
8 10
10 10est
14 17
9 9.3
8 >12est
N
17est -1 +1 19
12 16.2
15 >17+1
89
N
XX
12est +1 13+1
20 23-1
PGM Run 2
11est-1 14
30est -1
8 10
10 10
14 17
9 9.3
8 >12
11 11est
17est -1 +1 19
12 16.2est
15 >17+1
89
12 >17
XX
12+1 13+1
20-1+1 23+1
Ref sample 2 CE
11 16
28 31.2
9 11
11 12
15 16
9 10
8 12
12 12
23 25
15.2 16
14 17
8 11
15 17
XY
11 12
20 21
PGM Run 1
11 >16
28-1
9 11
11 12
N
9 10
8 >12
N
N
15.2 16
14 >17est
8>11
N
Y>X
11+1 12+1
20-1 21-1
PGM Run 2
11 >16
28-1 >31.2+1
9 11
11 12
15est 16
9 10
8 >12
12 12est
1sub
15.2est 16
14sub >17est
8 11
15est >17
Y>X
11+1 12+1
Item 3 Blood CE
11 14
30 35
8 10
10 10
14 17
9 9.3
8 12
11 11
17 19
12 16.2
15 17
89
12 17
XX
12 13
20 23
PGM Run 1
11 14
29.3
8 10est
10 10est
14 17
9 9.3
8-1 >12
11 11est
17est-1+1 19
12 16.2
15 >17+1
89
12
XX
12est+1 13+1
20-1 23est-
PGM Run 2
11-1 14
sub
8 10
10 10-1
14 17
9 9.3
8-1 >12
11 11
17est-1+1 19
12 16.2est
15+1 >17
89
12>17
XX
12+1 13+1
20-1 23-1
Item 4 Blood/semen epi cells CE
11 14 16
28 30 31.2 35
8 9 10 11
10 11 12
14 15 16 17
9 9.3 10
8 12
11 12
17 19 23 25
12 15.2 16 16.2
14 15 17
8 9 11
12 15 17
XY
11 12 13
20 21 23
PGM Run 1
11-1 14 16
28>29.3
8 9 10 11
10 11 12
14 17est 15 16
9 9.3 10
8-1 >12
11 12
17est-1+1 19
12 15.2 16 16.2
15 17+1 14
8 9 11
12 15
XY
12+1 11+1 13+1
20-1 23 21
PGM Run 2
11 14 >16
28
8 9 10 11
10 11
N
9 9.3 10
8 12
11 12
17est-1+1 19 23 25
12 16.2 15.2
15 17 14
8 9 11
N
XY
12 11 13
20 21 23
Item 4 Blood/semen sperm cells CE
11 16
28 31.2
9 11
11 12
15 16
9 10
8 12
12 12
23 25
15.2 16
14 17
8 11
15 17
XY
11 12
20 21
PGM Run 1
11est-1 >16
28-1
9 11
11 12
15 16
9 10
8-1 >12+1
N
23 25
15.2 16
14 >17est
8 >11
N
Y >X
11+1 12+1
20 21
29.3 30est
23+1 25-
20-1+1 211+1
1
PGM Run 2
11est-116est
28-1 >31.2+1
9 11
11 12
15 16
9 10
-1: reads that match the correct allele but contain a single base deletion.
+1: reads that match the correct allele but contain a single base insertion.
est: allele has elevated stutter above 15% of the allele coverage.
Sub: reads that match the correct allele but contain a single base substitution.
8-1 >121+1
12 12est
23est-1+1 >25-1
15.2est 16est
14 17est-1
8 11
15 17
Y >X
11+1 12+1
20-1 21