Understanding the characteristics of sequence-based single-source DNA profiles

Understanding the characteristics of sequence-based single-source DNA profiles

Journal Pre-proof Understanding the characteristics of sequence-based single-source DNA profiles Sarah Riman, Hari Iyer, Lisa A. Borsuk, Peter M. Vallo...

4MB Sizes 0 Downloads 9 Views

Journal Pre-proof Understanding the characteristics of sequence-based single-source DNA profiles Sarah Riman, Hari Iyer, Lisa A. Borsuk, Peter M. Vallone

PII:

S1872-4973(19)30334-5

DOI:

https://doi.org/10.1016/j.fsigen.2019.102192

Reference:

FSIGEN 102192

To appear in:

Forensic Science International: Genetics

Received Date:

17 July 2019

Revised Date:

17 October 2019

Accepted Date:

20 October 2019

Please cite this article as: { doi: https://doi.org/ This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier.

Understanding the characteristics of sequence-based single-source DNA profiles

Authors: Sarah Riman1*, Hari Iyer2, Lisa A. Borsuk1, Peter M. Vallone1

of

Author Information: 1

ro

U.S. National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Drive, Gaithersburg, MD 20899-8314, USA. Electronic address: [email protected]. 2

Highlights 

re

author: [email protected]

lP

*Corresponding

-p

U.S. National Institute of Standards and Technology, Statistical Engineering Division, 100 Bureau Drive, Gaithersburg, MD 20899-8314, USA.

We amplify and sequence STR loci over a range of template amounts of single-source DNA samples using different library construction workflows and pooling approaches. We use statistical tools to study the characteristics of single-source DNA profiles generated by

ur na



targeted sequencing. 

We use a simple cost-based approach for selecting a value for analytical threshold.



The Apparent Heterozygote Balance (AHb) and Receiver Operating Characteristic (ROC) curves are used as metrics for setting zygosity thresholds. The characteristics of the sequence-based STR profiling are compared to those obtained with

Jo



the CE-based typing methods.

Abstract The sequencing of STR markers provides additional information present in the underlying sequence variation that is typically masked by traditional fragment-based genotyping. However, 1

the interpretation of STR profiles generated by targeted sequencing methods are susceptible to the same factors encountered in profiles processed through capillary gel electrophoresis. These factors include stochastic variation, noise, stutter artifacts, heterozygote imbalance, and allelic dropout/in. Our goal is to characterize and understand how these behave in targeted sequence datasets. Here, we developed a framework using statistical tools to systematically interpret the characteristics of single-source DNA profiles generated by targeted sequencing. Sensitivity studies were performed using known single-source samples amplified with the PowerSeq 46GY System

of

Prototype with varying DNA target masses ranging from 15 pg to 500 pg. The STR loci were subjected to DNA library preparation using two commercially available library kits and sequenced on the Illumina MiSeq platform. Raw FASTQ data files were analyzed in STRait Razor v2.0

ro

without applying any thresholds (at a coverage ≥ 1). We investigated the effect of library normalization on average locus coverage and studied methods for setting analytical and zygosity

-p

thresholds. All the data were analyzed per DNA quantity as well as investigated per method. Analyses presented can be applied to sequence data generated by similar targeted sequencing

lP

re

panels and/or NGS platforms.

Keywords: Next Generation Sequencing (NGS); Short Tandem Repeat (STR); Analytical

ur na

Threshold; Receiver Operating Characteristic (ROC) plots; Noise; Zygosity

1 Introduction

Forensic DNA laboratories rely on the fluorescence-based capillary electrophoresis (CE) technology to genotype short tandem repeat (STR) markers. The DNA extracts are amplified by polymerase chain reactions (PCR) using commercial kits with dye-tagged primers to multiplex

Jo

STR fragments with overlapping size ranges. Fluorescently-labeled PCR products are then separated via CE according to their molecular weight. The outcome is electrophoretic data composed of peaks with their quantitative information, heights and sizes, expressed in relative fluorescent units and base pairs, respectively (Fig. 1) [1, 2]. Some of the limitations of the CE methods relate to the generation of only the length variants of STRs, poor allelic resolution at certain loci [3-5], and restriction in the number of STR markers that can be multiplexed in one kit [6-8]. 2

The arrival of the Next Generation Sequencing (NGS) technologies and the recent improvements in cost, chemistry, and throughput have elicited an interest in applying this new approach to forensic genetics and sequencing the STR loci [9-12]. Currently, even with the NGS workflows, PCR amplification is still a standard step to enrich for the STR markers. However, unlike CE, the target-specific primers do not need to be fluorescently labeled. Rather, the amplified products are taken into library construction, a multi-step process that flank the STR amplicons on both ends by adapters and produce sequenceable DNA libraries [13-15]. The outcome of the

of

clonally amplified and sequenced DNA libraries is raw digital data of sequencing reads containing nucleotide base calls and quality scores [16]. Different available data mining algorithms or software can identify the sequence and length-based polymorphisms and quantify the read counts

ro

(coverage) of STRs from raw data [17-21].

The major advantages offered by NGS over CE methods when studying STR markers is

-p

the ability to provide the sequence of both the repeat and flanking regions of alleles. This additional information can resolve alleles identical by length but different in sequences, increase the

re

discriminating power of loci, and assist in mixture interpretation and complex paternity cases [5, 22-24]. Another notable advantage of NGS is the ability to multiplex samples, amplify, and

lP

sequence large sets of forensic markers simultaneously in a single reaction [13-15, 25]. To allow compatibility with the CE-STR data, much of the recent NGS research work has focused on sequencing STR loci from different population samples, generating sequence-based

ur na

allele frequencies, and developing bioinformatic tools for sequence analysis [26-34]. To date, a limited number of studies have outlined methods for validating NGS-STR multiplexes, setting thresholds, and modeling the results of the sequencing data [13, 35-40]. Characterizing NGS targeted STR sequences obtained from single-source and mixture DNA profiles could assist in extending NGS-STR genotyping to forensic casework and developing probabilistic approaches for

Jo

sequences and coverage. To that end, in this pilot study we amplified and sequenced low and high template levels of single-source DNA samples using the prototype version of the PowerSeq 46 GY NGS STR assay and MiSeq platform. In the first part of this paper, we study the effect of two library construction workflows on the library product yield along with the consequences of the normalization and non-normalization approaches of the DNA libraries on depth of coverage. The remainder of the work focuses on the characteristics and behavior of single-source NGS profiles using different library kits and normalization protocols. We explore (1) the impact of DNA 3

template on the distribution of known alleles, stutter, and noise sequences, (2) a simple approach for setting an analytical threshold (AT) based on the cost tradeoff of the number of allelic dropouts versus the number of drop-ins, (3) the Receiver Operating Characteristic (ROC) curves as an approach for setting zygosity thresholds for single-source samples, and (4) coverage variability of known allelic sequences at heterozygote loci as measured by heterozygote balance. The characteristics of the studied PCR sequence-based STR profiling are compared to those obtained

Materials and methods

ro

2

of

with the CE-based typing methods.

2.1 DNA samples quantification and preparation

Sensitivity experiments were conducted using three biological single-source male DNA extracts

-p

of known genotypes (referred to herein as A, B, and C). These DNA samples were selected from the NIST population dataset [42] and quantified on ABI 7500 Real Time PCR System (Thermo

re

Fisher Scientific, Foster City, CA) using the PowerQuant kit (Promega Corporation, Madison, WI) following the manufacturer’s recommended cycling instructions and NIST SRM 2372 as an

lP

external calibrant [43]. Based on the quantification results, a series of two-fold dilutions of each of the three DNA extracts were prepared to give the following total DNA input when added at 2 µL into the PCR reaction: 15 pg (9 in total); 30 pg (9 in total); 60 pg (9 in total); 125 pg (9 in total);

ur na

250 pg (6 in total); and 500 pg (4 in total). For each sample, three replicates were generated for 15 pg – 125 pg DNA input and two replicates for the 250 pg. For the 500 pg input mass, there were two replicates for sample A and only one replicate for each of sample B and C. All work presented has been reviewed and approved by the NIST Human Subjects Protections Office.

Jo

2.2 STR typing using CE-based methods Samples were amplified using the PowerPlex Fusion multiplex kit (Promega) with a reaction volume of 25 µL at 30 cycles on GeneAmp PCR System 9700 Thermal Cycler (Thermo Fisher Scientific) following the manufacturer’s protocol [44]. No template controls (NTCs), positive controls, and PCR reactions containing DNA samples at six different DNA amounts mentioned above were amplified. Amplified fragments were then separated and detected on the Applied Biosystems 3500xL Genetic Analyzer (Thermo Fisher Scientific). The injection parameters were 4

set to 1.2 kV for 15 s. The raw data files collected from the CE run of DNA samples from the dilution series (n = 46), positive controls (n = 3), and NTCs (n = 3) were analyzed in GeneMapper ID-X version1.5 (Thermo Fisher Scientific). Allele calling and peak height information were generated using an analytical threshold set at one relative fluorescence unit (RFU) in all dye channels. An example of PCR plate including all the samples amplified and typed is illustrated in Supplementary Fig. 1.

of

2.3 STR typing using NGS-based methods In parallel with the CE-based method for typing STRs, each of the three DNA samples were amplified at six DNA target quantities mentioned above with the PowerSeq 46GY System

ro

Prototype (Promega Corporation), a PCR-based 46-loci multiplex kit [15]. The kit includes the same loci included in PowerPlex Fusion [44] (22 autosomal STR loci, AMEL, and 1 Y-STR locus

-p

(DYS391)), as well as the 23 Y-STR loci contained in the PowerPlex Y23 System [45]. Samples were prepared in a final volume of 25 µL and amplified at 30 cycles on GeneAmp PCR System

re

9700 Thermal Cycler (Thermo Fisher Scientific) following the manufacturer’s protocol [15]. Amplified PCR products were cleaned-up with AMPure XP (Beckman Coulter Inc., Brea, CA)

lP

using a 3X beads-to-sample volumetric ratio. To avoid bead clumping, 5 μL of 504 μg/mL Proteinase K was added to the 25 μL amplification reaction [15]. Beads were washed following the manufacturer’s instructions [46], and purified DNA was then eluted off the beads in 30 μL of

ur na

10 mmol/L Tris-HCl pH 8.5 (Teknova, Hollister, CA). DNA concentrations of the eluted amplicons were quantified using the Quant-iT PicoGreen dsDNA assay kit (Thermo Fisher Scientific) following manufacturer’s instructions [47]. An example of 96-well PCR plate containing 92 DNA samples and two negative controls with no DNA template amplified with PowerSeq 46GY System Prototype is illustrated in Supplementary Fig. 2 and a detailed description

Jo

of the protocols of the amplification setup, thermal cycling parameters, and preparation of amplification products for sequencing are depicted in Supplementary Fig. 3.

2.4 DNA library preparation for Illumina sequencing Two different library construction kits were used to manually prepare DNA libraries, TruSeq DNA PCR-Free High Throughput (HT) Library Prep Kit (Illumina, San Diego, CA) and Kapa Hyper Prep Kit (Kapa Biosystems, Roche, San Francisco, CA). Of the 94 amplified samples, 47 were 5

subjected to library preparation using TruSeq DNA PCR-Free HT kit and the other 47 were prepared via Kapa Hyper Prep kit. One ligation reaction control lacking insert DNA was included during library preparation per each kit (Supplementary Fig.2). Both kits utilized the same enzymatic processes (end repair, A-tailing, and ligation) but have differences in the enzymes used, overall protocol, and number of steps. The detailed preparation steps of the two library kits are shown in Supplementary Fig.3. Briefly, the total elution volume (~ 25 μL) of each purified amplification reaction was used as DNA input to end repair, A-tailing, and adaptor ligation

of

according to the suppliers recommended protocols [41, 48]. All 96 libraries were prepared for cluster generation using the 96 uniquely indexed adapter combinations contained in the TruSeq DNA PCR-Free HT Library Prep Kit (Illumina). Ligation products were purified using a 0.7X

ro

volume of AMPure XP Beads and DNA libraries were eluted off beads following manufacturer’s

-p

procedures [15, 48].

2.5 Sequencing

re

To study the effect of library normalization on downstream analysis of STRs, libraries were either normalized and then sequenced or pooled for sequencing without normalization. In both scenarios,

lP

libraries were multiplexed in batches of 96 samples. A schematic diagram showing the difference between the normalization (N) and non-normalization (NN) workflows is shown in Supplementary Fig. 4. For library normalization, each individual library generated by TruSeq and Kapa was

ur na

quantified on the 7900HT Fast Real-Time QPCR System (Thermo Fisher Scientific) using the Kapa Library Quantification Kits for Illumina platforms (Kapa Biosystems, Roche) according to manufacturer’s technical guide [49]. Based on the DNA concentration and average library fragment size, libraries were then individually normalized (diluted in 10 mmol/L Tris-HCl pH 8.5) to an equal concentration of 123 pmol/L, and then pooled by equal volume (volumetric pooling).

Jo

Libraries were denatured using a modified denaturation protocol and hybridization buffers for libraries with subnanomolar concentrations following the protocol published in [50]. A detailed protocol for preparing subnanomolar libraries is presented in Supplementary Protocol 1. For library non-normalization approach, another sequencing run was performed using the same 96 libraries generated by TruSeq and Kapa. However, instead of combining libraries at equimolar amounts, only volumetric library pooling was performed. Briefly, equal volumes of each individual library were combined into one pool. This library pool was then quantified using the 6

Kapa Library Quantification Kit (Kapa Biosystems, Roche), normalized to 4 nmol/L, and denatured for DNA sequencing following the MiSeq denaturation and dilution guide [51] . The normalized and non-normalized libraries were sequenced separately. Both sequencing runs were prepared for loading onto the reagent cartridges by diluting the denatured library pools to a final concentration of 13.5 pmol/L and spiking-in the control PhiX DNA at 1.5 % [15]. Paired-end sequencing was performed on the MiSeq FGx System (Illumina) in Research Use Only (RUO)

of

mode using 600 cycle (2×300bp reads) MiSeq v3 Reagent Kit (Illumina) [52].

2.6 Bioinformatic analysis

Raw FASTQ files were generated on the MiSeq FGx System. The sequencing data was then

ro

analyzed using the open source STRait Razor v2.0 software to identify the sequences, allele length,

to capture as much data as possible. 2.7 Method dependent analytical threshold

-p

and coverage of the STR markers and regions [53]. A minimum depth of coverage of 1X was used

re

In this study, a cost-based approach was used to set analytical thresholds (ATs) of the five different methods (Kapa-N, TruSeq-N, Kapa-NN, TruSeq-NN, and CE). This approach is based on the

lP

balance between the number of allelic drop-out and drop-in events. Here, we made this balance assessment by using a cost parameter “c” that specifies the cost of a drop-out in terms of the number of drop-ins by applying a series of different thresholds to the entire empirical dataset. For

ur na

example, if a laboratory is willing to accept 5 drop-ins to avoid a single drop-out, then this can be expressed by using a value of c = 5. If the intent is to avoid any drop-outs, then a higher value of c should be set (for e.g. c = 100). A c = 100 would mean that 100 drop-in events are tolerated to avoid a single drop-out. On the other hand, if the intent is to allow more drop-outs (e.g. 3) to avoid a single drop-in, then a c = 1/3 is used. Thus, c would be the relative cost of a drop-out versus a

Jo

drop-in event. The total cost associated with any considered threshold is evaluated using empirical data generated by the laboratory and the threshold value that results in the lowest total cost is chosen as the value for AT. The cost-based approach to select an AT value for each method was implemented as explained above. Data was combined across all loci, replicates, and template level for each method. AT was varied from 0 to 1000. Two outcome measures, (drop-out/-in events), were created for every possible threshold. The drop-out events were defined as the total number of known allelic 7

sequences or peaks with counts or heights below a given threshold. The drop-in events were defined as the total number of noise sequences or peaks with counts or heights equal to or above a given threshold. Noise was defined as any peak or sequence that was not attributed to known alleles or N-1 stutter. The rest of stutter artifacts (N-2 and N+1) were counted as noise. Further details on the categorization of peaks and sequences are provided in Section 3.3. To generate the total cost for each AT, the total number of drop-out events was multiplied by the cost parameter “c”, summed up, and added to the total number of drop-in events. For a given method, the cost per profile was determined by dividing the total cost by the total number of profiles in the empirical data set. The

of

“cost vs AT” curves were generated to evaluate the optimal AT (i.e. an AT with a minimum cost) as well as the performance of different methods in terms of detecting and distinguishing known

ro

allelic products from noise.

Below is a detailed example for the cost calculation using a c = 5 at AT of 218 for the Kapa-N

-p

method. The c = 5 and AT = 218 were used only for illustrative purposes and not as recommended

At AT=218: 

The number of alleles with coverage < 218 (over all 46 samples) was 70. Therefore, total drop-

lP

outs = 70 and cost = 70 x 5 = 350. 

re

parameters. The cost for each AT ranging from 0 to 1000 was calculated in a similar way.

The number of noise/artifact sequences with coverage ≥ 218 (over all 46 samples) was 45. Therefore, total drop-ins = 45 and cost = 45 x 1 = 45. The total cost of all drop-outs and drop-ins = 350 + 45 = 395.

ur na



The 395 value is the total cost over all 46 profiles. Therefore, a cost per profile = 395/46 = 8.586. It is easily verified that any other choice for AT will result in a higher cost per profile for the data set we used.

Jo

2.8 Concepts and definitions

We begin by defining the terms that were used to conduct the analysis of the results obtained from NGS methods. The statistics and data analysis were done for only the 22 autosomal STR loci, AMEL, and 1 Y-STR locus (DYS391)). The additional YSTRs have not been considered. 

Raw Coverage: The raw coverage for a sequence at a locus is the read count obtained for that sequence at that locus

8



Average Locus Coverage (ALC): The average locus coverage for a known single-source profile is defined as: ALCProfile = ∑raw coverage of known allelic sequences across all loci total number of loci



Apparent Heterozygote Balance (AHb): The two sequences with the highest raw coverage are first identified at each locus. The lower of the two coverages divided by the higher of the two coverages is defined as the apparent heterozygote balance. This quantity is of interest when

of

we have a profile known to have come from a single-source but the zygosities of the different loci are not known.

Heterozygote Balance (Hb): The heterozygote balance is defined as the ratio of the lower

ro



coverage value to the higher coverage value for the two allelic sequences at a heterozygous locus, irrespective of their molecular weights. The distribution of Hb ratios using this definition 

-p

ranges between 0 and 1 [54-56].

Empirical Receiver Operating Characteristic (ROC) plots: An empirical ROC plot

re

associated with a metric that is used to discriminate between two groups (e.g. heterozygotes/homozygotes) is a two-dimensional graph generated by plotting the number of

lP

true positives (TP) in the empirical dataset vs the number of false positives (FP) in the empirical dataset at each given threshold (from 0 to infinity) for decision-making (Fig. 2) [57].

ur na

2.9 The use of AHb and ROC curves as metrics for setting zygosity thresholds In this study, empirical ROC plots were generated for distinguishing homozygote from heterozygote loci using AHb. Data was combined across all loci and replicates at each template level and method. The true positive counts (TP) and false positive counts (FP) were generated from the empirical data of the five different methods (Kapa-N, TruSeq-N, Kapa-NN, TruSeq-NN,

Jo

and CE) by applying a series of different thresholds to the entire dataset. Two outcome measures or coordinates, (FP; TP), were created for every possible threshold. Plotting all these coordinates generated the ROCs that captured all the thresholds (t) simultaneously. The points in the ROC plot were connected by line segments for visualization purposes. TP and FP were plotted on the Y-axis and X-axis, respectively (Fig. 2). A point that is closest to the top left corner on the ROC plot has a high number of true positives, and low number of false positives [57].

9

For evaluating thresholds for zygosity calling, the homozygosity was used here as the reference standard. A locus was called homozygous if the value of AHb was less than or equal to a given threshold value (t). ROC plots were constructed as follows: TP represented the counts of the number of correctly classified homozygotes at a given (t) value for AHb. FP represented the counts of the number of heterozygote loci incorrectly classified as homozygotes at the given threshold (Fig. 2). True negative counts (TN) and false negative counts (FN) were also deduced from the plotted ROCs. TN were the heterozygote loci correctly called heterozygotes. FN were the counts

of

of homozygotes falsely labeled as heterozygotes.

All the concepts and definitions mentioned above were used as well to analyze the CE data.

ro

Peak height information was used instead of coverage when calculating Average Locus Peak Height (APH), AHb, and Hb. We will show only the results of 15 pg and 125 pg as this will

-p

demonstrate one example each of low and high template amounts of DNA, respectively. The rest of the data for the various DNA amounts for Sections 3.4 and 3.6 are shown in the Additional files

re

1, 3, and 4. All data analysis and visualization discussed were conducted using the open source

3 Results and discussion

lP

software R [58].

3.1 Impact of library construction workflows on library product yield

ur na

A key difference in the NGS workflows as compared to CE methods is in the construction of high-quality sequencing libraries (Fig. 1). This procedural step is necessary to flank the STR amplicons at both ends with adapters. The universal sequences contained in these adapters are essential for multiplexing, hybridization, cluster amplification, and sequencing of the STR templates on solid surfaces [59]. Ligation-based library synthesis is one of the different strategies

Jo

that are used to prepare libraries. It generally entails a cascade of multi-enzymatic reactions. After amplifying STR regions from a variety of DNA sources and amounts, the STR amplicons which constitute the library DNA input are subjected to end-repair by enzymatic fill-in and trimming, phosphorylation of the 5’ ends, and addition of a single-dA to the 3’ end fragments. Adapters that have a single ‘T’ nucleotide overhang are then ligated to the end-repaired DNA, thus generating sequenceable libraries [59, 60]. A final evaluation of the library workflow employs the quantification of the adapters’ flanked STR templates by a quantitative PCR assay. This quality 10

control checkpoint determines the concentration of each individual library (i.e. the post-ligation library yield) and guides the multiplex pooling of libraries at optimal concentrations for bridge amplification on the flow cell. This ensures optimal cluster densities and high sequencing quality [61-63]. To examine if different NGS library workflows impact the STR library output and sequencing data, two different ligation-based library synthesis kits were used in this study: Illumina TruSeq DNA PCR-Free and Kapa HyperPrep Library kits [41, 48]. The DNA samples, PCR amplification

of

set up, bead clean-up ratios, and loading concentration were kept identical between the two kits. Samples prepared via the Kapa HyperPrep workflow required less preparation steps and time (Supplementary Fig. 3) and exhibited higher yields of adapter-ligated libraries at all DNA input

ro

amounts as compared to the samples that were constructed with the TruSeq kit (Fig. 3). The Kapa kit protocol has a streamlined, single-tube chemistry [48, 64]. It combines the three enzymatic

-p

reactions, end-repair, A-tailing, and adapter ligation into a single tube and eliminates the cleanup step in between the end-repair and A-tailing (Supplementary Fig. 3). According to the

re

manufacturer, the Kapa Hyper Prep protocol does not require a rigorous post-ligation cleanup due to the novel reagent formulation that inhibits adapter-dimer formation [48, 64]. We speculate that

lP

the enzymes/reagents formulation, and reduction in the liquid handling as well as clean-up steps could have decreased the loss of DNA input molecules, improved DNA recovery, and thus resulted in a high overall DNA yield (Fig. 3).

ur na

It is important to note and as will be discussed later in Sections 3.4, 3.5, 3.6, and 3.7 that from the limited number of samples in this pilot study, the higher yield observed with the Kapa kit did not exhibit any substantial difference on correct allele calling, signal and noise resolution, zygosity, heterozygote sequence balance, and drop-out events when compared to the TruSeq kit.

Jo

3.2 Effect of normalization and non-normalization strategies on STR depth of sequencing coverage

One of the advantages of using next-generation sequencers is the ability to pool and

sequence high number of samples simultaneously in a single run. Manufacturers of library kits recommend equimolar normalization of individual libraries prior to sequencing . The normalization process ensures that libraries are added in equimolar amounts to the sequencing

11

pool. The aim is to achieve reliable clustering, equal representation from all the pooled samples, and uniformity of sequence coverage for each library [61]. The multiplex pooling requires several steps that include determining the library size in base pairs (bp), quantifying the individual libraries by qPCR, diluting libraries of different concentrations to the same concentration, and finally combining equal volumes of the normalized libraries into one single pool (volumetric pooling). After the end of this step, the pool is further diluted and loaded for sequencing [61].

of

Library normalization is another procedural step that is not performed with the CE workflows. With the CE methods, a one µL aliquot of each amplified sample is individually analyzed on the Genetic Analyzer without prior normalization [44]. Vilsen et.al [37, 38] noted that library

ro

normalization approaches resulted in average coverage that were not correlated with DNA amounts which is in stark contrast to what is observed with CE empirical data. To understand the

-p

consequences of library normalization on downstream analysis, two different multiplex library pools were generated. One pool was produced by combining equal volumes of each of the

re

normalized Kapa and TruSeq libraries. Another one was constructed by pooling libraries immediately after adapter ligation without normalization and by only adding an equal volume of

lP

each library to a single tube (Supplementary Fig. 4). The normalized library pool and the nonnormalized one were sequenced separately on two different runs. As mentioned in the materials and methods section, the same set of samples used in this sequencing study was analyzed using

ur na

CE methods. To examine the relationship between the peak height or coverage and PCR template amounts, the average locus peak height (APH) and the average locus coverage (ALC) of singlesource profiles generated from the electropherogram (EPG) and normalized and non-normalized sequencing datasets were calculated and plotted against the amounts of input DNA into the amplification process (Fig. 4).

Jo

As has been shown previously with the CE workflows [65-68] and demonstrated in this study

as well (Fig. 4A), the APH associated with a profile is approximately proportional to the amount of input DNA present at the start of the amplification. However, this correlation is lost with normalized sequencing datasets. As shown in Fig. 4B, the ALC of known alleles from the normalized sequencing datasets (Fig. 4B) did not show a strong correlation (R < 0.4) to the DNA template for either of the library kits. This observation is attributed to library normalization approaches and as mentioned above has been shown by other studies using different STR kits and 12

NGS platforms [37, 38]. Similar to the APH within an EPG, the ALC of known alleles from the non-normalized sequencing datasets, showed to be a good indicator of the amounts of the DNA input (Fig. 4C). Thus, coverage obtained using the non-normalized methods could be modelled similarly as modeling peak heights in CE.

3.3 Categorization of sequence-based STR data We used the sensitivity study performed on single-source samples to characterize NGS-STR

of

profiles. We categorized the allelic, stutter and non-allelic noise sequences, and studied signal to noise ratio, stochastic effects, and variations in sequence coverage. The ground truth of both the length and sequence variation of the genotypes of the samples were well characterized and known

ro

in advance. Those samples have been extensively typed using different CE- and NGS-STR kits and sequenced on multiple platforms [31, 42]. All sequences generated from the STRait razor

-p

analysis with a coverage of ≥ 1 were grouped into four categories: (1) A = Known allele sequences

within an allelic sequence

re

(2) S1 = Back stutter of the longest uninterrupted stretch (LUS) of the basic repeat motifs

lP

(3) S2 = Back stutter sequences not attributed to S1 (4) N = Noise sequences (not A, S1, or S2)

Known allele sequences (A): A single or pair of sequences at a locus exhibiting the true

ur na

sequence of the known ground truth alleles were identified as homozygote or heterozygote allelic sequences. In NGS profiles known allelic sequences with sufficient amounts of input DNA are associated with the highest coverage at each locus with a rank position of either 1 (in case of homozygote alleles) or 1 and 2 (in case of heterozygote alleles). However, this is not always the case especially with low template DNA profiles where allele coverage can be low and stutter or

Jo

noise sequences can have a coverage greater than or equal to that of the allele. For example, in one of the samples amplified at 15 pg and prepared with the Kapa-N workflow, a CSF1PO locus exhibited two parental alleles, A1 and A2, ranked in positions 1 and 3 with a coverage of 1853 and 195, respectively. A sequence with a coverage of 223 had the same length (i.e. number of bases) as A1 but with a base substitution was in rank position 2, an i.e. at a position higher than the parental allele, A2.

13

Stutter products are PCR artifacts generated due to strand slippage during the amplification of STR loci resulting in a deletion (back stutter or N-1) or insertion (forward stutter or N+1) of an STR motif on the newly synthesized DNA strand [70-75]. Stutter products with more than one motif unit or 2 bp lower than the parent allele have been observed as well. The mechanism and rate of stutter have been extensively characterized with the CE data and studies have reported that the LUS is the best indicator of stutter ratio [66, 71, 76-79]. Since PCR amplification is a standard step for targeting STR loci with high-throughput sequencing, stutter products are generated and

of

observed with the sequencing data. Compared to the CE method, NGS generates a higher level of base-pair resolution data allowing a more detailed characterization of the different types of stutter variants generated from the various motifs of the same parent allele. The authors of recent

ro

publications showed that LUS is not the best predictor of stutter in sequence-based STR profiles [36, 39, 40]. In this work, we only considered stutter sequences in the N-1 position (back stutter)

-p

and further categorized them into S1 or S2. Back stutter sequences of the LUS of the basic repeat motifs within an allelic sequence were categorized as S1 and the rest of the back stutter sequences

re

including the 2 bp back stutter were denoted as S2. Sequences in the stutter position of N+1 (forward stutter) and N-2 (double back stutter) occurred less frequently than N-1 and were put into

lP

the same category as noise. Additional considerations are needed in order to perform comprehensive stutter analysis and this will be reported in future work. Noise sequences (N): We observed high number of sequences at each locus affected by errors

ur na

that were introduced either during the PCR amplification or sequencing processes [80, 81]. The errors were in the form of (1) substitutions (sequences with the same length of the parent allele and/or stutter but with ≥ 1 base substitution); (2) insertions; and (3) deletions. Sequencing data that were not allelic or back stutter sequences were categorized as noise.

Jo

3.4 Impact of DNA Template Amounts on the Distribution of Known Allele, Stutter, and Noise Sequences

Irrespective of the library kits and pooling approaches, the coverage of sequences was more

variable at low amounts of template. Low DNA input into PCR reactions generated allelic sequences with coverage in the range of the coverage of noise and stutter sequences. Thus, the uncertainty of allele calling could be high due to the stochastic effects (Fig. 5A and Additional file 1). As expected, improved discrimination between alleles (A) and the remainder of the sequences 14

(N, S1, and S2) was observed as the amounts of DNA template increased (Fig. 5B and Additional file 1). When the input amount of DNA is sufficiently high the distribution of coverage (read counts) for alleles is separated from the distribution of non-alleles. Similar observations were noticed from the CE data as well; however, unlike the sequencing data, with CE the N-1 type stutter variants (S1 and S2) were grouped together in the same bin and could not be differentiated (Fig. 5C and Additional file 1), hence referred to here as S only. 3.5 Setting analytical thresholds for sequence-based STR profiles

of

The main performance metric for any STR typing method in forensic genetics is its ability to discriminate between allelic and non-allelic signal (noise). The standard approach for

ro

discriminating between alleles and non-alleles is to establish a threshold, referred to as the Analytical Threshold (AT). Baseline or noise generated by CE instruments has been well studied [2]. Typically, profiles are collected from positive controls with varying amounts of DNA template

-p

and are analyzed in genotyping software using a threshold of one RFU. Allelic, stutter, and other artifactual peaks are discarded from the profiles. The mean (µ) and standard deviation (σ) of the

re

remaining peaks (noise observations) are estimated either per dye-color channel or on a per-locus basis. Then, AT is determined by substituting the values of µ and σ in the following equation: AT

lP

= µ + k* σ, where k is a numerical factor determined based on a specified confidence level [83, 84]. AT is commonly expressed in peak height (RFU). Peaks with RFU < AT are classified as noise and are discarded from the profile. Peak heights with RFU ≥ AT are manually inspected for

ur na

their morphology. Electronic spikes, dye artifacts, and spectral pull-ups are removed. The rest of the peaks are considered allelic signal (either known alleles or PCR byproducts) and are passed to the interpretation stages [85, 86].

Laboratories employing NGS technologies have started investigating analytical thresholds. For example, the authors of a recent work used sequencing data generated by the Illumina

Jo

ForenSeq System and set a per-locus AT by multiplying the locus-specific range of noise by a scaling factor: AT= c × (Noisemax − Noisemin) [35]. Another group of researchers have removed noise sequences by using dynamic thresholds created by fitting the distribution of general noise sequences [38]. In another work, the same authors assigned marker dependent AT by using the percentage of the maximum coverage of the marker [37]. In the developmental validation study of the MiSeq Forensics Genomic System, a 1.5 % of the total locus coverage was set as an AT [13].

15

The equations used to estimate the signal thresholds are different and can lead to variability across analysts and/or laboratories. Selecting an AT is an important task that can affect the dropout/-in rate and crucially impact downstream analysis such as assigning number of contributors and calculating likelihood ratios (LRs) [86-90]. There is no ‘perfect’ value for AT. This is the reason there are different methods and equations that provide guidance to establish AT but none of the formulas are normative. Higher values for AT will prevent erroneously labeling non-alleles as alleles decreasing drop-in events (false positives). However, a high AT value can lead to drop-

of

out events (false negatives) since an actual allele with low coverage will be discarded and not pass to the interpretation stage. This situation is likely to occur in low template samples or in mixtures with low amount of DNA from one or more minor contributors. Drop-in events complicate mixture

ro

interpretation, whereas drop-outs cause the loss of valuable genotype information of the contributing individuals. Therefore, an established AT value should provide a satisfactory balance

-p

point between drop-ins and drop-outs.

Here, we illustrated how this balance assessment can be achieved by using a cost parameter

re

“c” which specifies the “cost” of a drop-out in terms of the number of drop-ins. The simple costbased approach was used to illustrate selection of an AT value. As discussed in the Section 2.7

lP

(materials and methods), AT was varied from 0 to 1000 and the cost of drop-out/-in events were computed and averaged across all profiles of varying DNA amounts per method. The “Cost vs AT” curves were generated for each method to select an optimal AT value with a minimum cost.

ur na

Cost curves were compared to assess whether any of the methods showed superior performance over the other (Fig. 6). Results shown in Fig. 6 are for c = 5. Results for different values of cost parameters are shown in Additional file 2. Higher “c” values place more penalty for drop-outs relative to drop-in events when compared to lower “c” values. The normalized methods performed better than the non-normalized ones (Fig. 6 and Additional file 2). This is most likely due to the

Jo

fact that the coverages obtained using non-normalized methods, similar to the peak heights within an EPG, increase with increasing amount of DNA (Fig. 4). Therefore, a single value of AT is unlikely to perform well for every DNA amount. On the contrary, the normalized methods have coverages that are uncorrelated with DNA amounts, therefore a single AT value has a better chance of satisfactory performance on average. The average cost curves for CE (Fig. 6 and Additional file 2) were shown only for informal comparison with the NGS methods since in practice dye-specific AT values are usually used instead. Tables in Fig. 6 and Additional file 2 show the suggested 16

optimal AT values (rounded up to the nearest 10) for our empirical data with their associated minimum cost for the different methods when a cost parameter of c = 5 was set. The cost-based method can also be used to develop locus-specific analytical thresholds. This involves running the analysis separately for each locus. Whether or not there is an advantage in doing this can only be determined by conducting both global and locus-specific thresholds analyses and comparing the results. The cost-based method can be used as well to determine analytical thresholds for mixtures. This requires ground-truth known mixture samples representative of casework. Here too, a global

of

or a locus-specific AT can be considered. Detailed illustrations of this topic will be reported in a future study.

ro

3.6 Inferring zygosity using apparent heterozygote sequence balance (AHb):

Forensic laboratories have implemented a stochastic threshold (ST) also referred to as

-p

homozygote or drop-out threshold to reliably identify the zygosity or genotype of a single allelic peak at a locus and limit mismatches between the evidential profile and a reference profile [91,

re

92]. Different methods have been implemented to assess the stochastic threshold [87, 91-95]. One of those methods is the use of the logistic regression curves to estimate the probability of drop-out

lP

as a function of either peak height [54, 91] or average peak height of a profile [65, 96]. Due to library normalization steps with the NGS workflows, the coverage and average locuscoverage of the known sequenced alleles do not show a strong correlation to the amount of

ur na

input DNA present at the start of the amplification (Fig. 4B). Therefore, neither parameters can be used to infer zygosity and predict allelic drop-out for the normalized sequenced data. This has also been mentioned by another study using a different NGS-STR kit and sequencing platform [38]. In a recent work, the same authors used the Poisson-gamma coverage model to predict the probability of drop-out [37].

Jo

Here we extended the information generated from the AHb to infer zygosity and evaluate

the risks of misidentifying heterozygous and homozygous genotypes from single-source profiles. ROC plots were plotted using the AHb information as described in the materials and methods. The plots (Additional file 3) showed that within each method, the performance (shape) of the ROC plots created from low amounts of DNA (e.g. 15 pg and 30 pg) differed substantially from those generated by higher amounts of DNA (e.g. 60 pg – 500 pg). A perfect discriminating empirical ROCs were reached with DNA amounts ≥ 125 pg with all the methods. The ROC plots showing 17

the performance of AHb at 250 pg and 500 pg were lower than the rest of the DNA amounts since a lower number of samples were analyzed at these amounts, fully described in section 2.1, thus leading to a lower number of homozygote and heterozygote loci (Additional file 3). The data associated with the ROC plots were converted into relative frequency histograms (Fig. 7 and Additional file 4). The latter were developed to understand the ROC plots and display the distribution of the calculated AHb. The cyan and green histograms represent the AHb from the homozygote and heterozygote loci, respectively. Both histograms were plotted using the same

of

horizontal axis and were separated vertically. The values of the calculated AHb were binned into a range of intervals as shown on the horizontal axis and the vertical axis represented the relative frequency (number of occurrences) of those values within each bin.

ro

For each workflow depicted here, a total of 216 observations from the ground truth data were used to build the ROCs and the histograms. Out of the 216 observations, 21 loci were

-p

homozygotes and 195 were heterozygotes. The risks of misidentifying heterozygous and homozygous genotypes were associated with low amounts of DNA (e.g. 15 pg) where stochastic

re

variations due to template sampling and amplification efficiency are usually observed (Fig. 7A and Additional file 4). As expected, improved separation and discrimination between homozygote

(Fig. 7B and Additional file 4).

lP

and heterozygote genotypes were observed as the amount of DNA template increases (e.g. 125 pg)

Here, for only illustrative purposes and not as a recommended threshold, we considered a

ur na

cut-off value of Hb = 0.2 as an example of a zygosity threshold. At Hb = 0.2, the true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) shown in tables next to each histogram (Fig. 7) were deduced from the data. As mentioned in Section 2.9, the homozygosity was used here as the positive group and a locus was called homozygous if the value of its AHb was less than or equal to a given threshold value (t). Therefore, TP represented the

Jo

counts of homozygote loci that were correctly called homozygotes. FP represented the counts of heterozygote loci that were falsely called as homozygotes. FN were the counts of homozygotes falsely labeled as heterozygotes. TN were the heterozygote loci correctly called heterozygotes. Applying an Hb = 0.2 on the TruSeq-N (15 pg) dataset falsely called 2 out of 21 homozygotes as heterozygotes (FN = 2) while 46 out of 195 heterozygotes were falsely called as homozygotes (FP = 46) (Fig. 7A). TP and TN were 19 and 149, respectively. Heterozygote loci were falsely interpreted as homozygotes due to either the non-detection of one of the alleles or extreme 18

heterozygote imbalance in which one of the alleles was preferentially amplified. Homozygote loci were falsely interpreted as heterozygotes due to the high coverage of stutter or noise sequences. Similar interpretation applies for the rest of the tables. Therefore, if an analyst set a cut-off of 0 (here shown in red set at 0.2) and move this threshold from left to right through the plots, the true positives increase and false positives increase as well. However, at low amounts of DNA, the two histograms overlap and the analyst will falsely classify the zygosity no matter where the cut-off value is set. Different numeric values for various

of

cut-off parameters with their associated TP, FN, FP, and TN for each method and DNA amount are shown in the table of (Additional file 5). Each laboratory should perform sensitivity experiments and establish an optimal cut-off threshold that maximizes the TP and TN and

probability curves to model zygosity in mixture profiles.

ro

minimizes the FP and FN. The distributions of the ratio shown here can be used in the future as

-p

The information generated from the AHb can also be used to develop locus-specific zygosity thresholds. This requires a large pool of ground truth known samples consisting of both

re

homozygotes and heterozygotes for each locus. Whether or not there is a meaningful difference between using global thresholds versus locus-specific thresholds can only be determined by

lP

conducting both analyses and comparing the results. Locus-specific zygosity thresholds will be investigated in a separate study using a larger dataset.

ur na

3.7 Heterozygote sequence balance (Hb)

Sister alleles at each heterozygote locus are in perfect balance in vivo [54, 97]. STR typing studies with CE methods have shown that the heterozygote peak height ratios become imbalanced due to stochastic effects. The primary stochastic source is from the random sampling of alleles during the pipetting process especially when DNA molecules are present in low copy-number.

Jo

Another stochastic event can occur but to a less extent during the PCR amplification process when one allele gets randomly amplified more than the other. Primer site or somatic mutation and preferential amplification of shorter allelic sequences can as well result in heterozygote imbalances [54-56, 98-101]. We studied the variation in coverage between the two allelic sequences at each observed and known heterozygote locus from the sequenced single-source samples. The variation in the Hb was measured across a range of DNA template amounts, different library kits, and pooling 19

approaches (Fig. 8A). Hb was calculated from the data analyzed at a coverage of ≥ 1 and plotted against the corresponding amounts of DNA input in the PCR reactions. As expected, the variation of Hb from the sequenced data decreased as the amount of DNA increased. The Hb was equally variable for both library kits and pooling approaches as a function of DNA amount. We observed similar Hb imbalances from the reduced template amplification with the CE data (Fig. 8B). Since targeted sequencing of forensic STR markers relies solely so far on the PCRamplification process [13-15], stochastic sampling effects similar to those studies with the CE data

of

will be observed with the amplified and sequenced low-template samples [38, 56].

ro

4 Conclusion

Fully-continuous DNA interpretation systems employing probabilistic approaches have been widely adopted by the forensic laboratories. These systems are based on models that use the

-p

peak height information from EPGs to deconvolute mixtures and provide LRs [102-105]. The development of these software has been accomplished after understanding and describing the

re

behavior of DNA profiles [86].

A new generation of probabilistic genotyping (PG) software that can model the coverage

lP

and sequences can be expanded into the sequence-based STR data. To establish sequence-based interpretation systems it is necessary to first understand the characteristics of single-source NGSDNA profiles and study the (1) error sequences, (2) different stutter variants generated from the

ur na

same parental allele, (3) allelic drop-in/drop-out, (4) zygosity, (5) heterozygote sequence balance, (6) locus balance, (7) effect of library preparation protocols, and (8) impact of sample multiplexing.

Here in this pilot study, we analyzed targeted sequence datasets from single-source samples generated using different amounts of DNA, library kits (Kapa and TruSeq), and pooling

Jo

approaches (normalization and non-normalization). The yield of adapter-ligated libraries was observed to be higher with the Kapa kit as compared to the TruSeq kit. However, the results showed that the higher yield observed with the Kapa kit did not exhibit any substantial difference on correct allele calling, zygosity, and heterozygote sequence balance when compared to the TruSeq kit.. However, each method should have its own analytical and zygosity thresholds. This is not surprising as every laboratory establishes thresholds based on kits and instruments they use.

20

Our study highlighted the effects of library normalization before sequencing on the behavior of coverage. In an EPG, the average peak heights of profiles are approximately proportional to the amount of undegraded input DNA [65-68]. This relationship is not observed in NGS profiles that have been generated from equimolar normalized library products. The loss of the correlation between coverage and DNA amount might affect the modeling parameters especially if the number of the reads (coverage) will substitute the peak heights as with the current CE models.

of

Stochastic sampling effects impacted the interpretation of NGS STR profiles that were amplified and sequenced from low DNA templates. The stochastic effects were observed by the increase in variation of sequence coverage, drop-out rates, and heterozygote sequence imbalance.

ro

At low amounts of DNA, we observed an overlap between the distribution of coverage of the allelic signal and that of noise as well as between homozygote and heterozygote genotypes. The

-p

trade-off between the true positives and false positives happens mainly at low DNA template amounts.

re

In CE analysis, EPGs measure the quantity of signal from DNA fragments of different sizes. The location of an individual peak or allele is ‘binned’ according to its defined number of

lP

base-pairs. However, any allelic bin can generally consist of several different distinct fragments all of which are of the same length (or base-pair size) as the allelic sequence and cannot be distinguished from each other. The sequence-based STR data analysis further refines all the

ur na

sequences, allelic and non-allelic, at a marker with their associated coverage (counts). For convenience we have called these sequences as noise but in fact those are mostly errors that got incorporated during PCR cycles or sequencing processes. Error reads incorporated in the NGS profiles can confound the calling of unique sequences in low template amount samples or an unknown complex mixture. In this case a real allelic sequence present at a low coverage cannot be

Jo

identified. Therefore, distinguishing low-coverage alleles from artifacts or errors can be challenging and requires statistical and probabilistic modeling. We noticed that these non-allelic sequences can be obtained by inserting, deleting, or substituting a certain number of nucleotides in a parental allelic-sequence. The number of insertions/deletions/substitutions required to transform an allelic sequence into a noise sequence is quantified using what is called Levenshtein distance (LVD). LVD between two sequences is defined as the minimum number of substitutions, insertions, and deletions needed to transform one sequence into the other [106]. Current work is 21

focused on studying those error sequences and characterizing their patterns using the LVD and a larger population database (unpublished data). It is anticipated that applying LVD (or a suitably modified version of LVD) to STR sequencing will be useful in the development of probabilistic models for NGS data which will help in inferring the most likely sets of unique true allelic sequences in mixture or low DNA template. Although the forensic field is moving away from threshold-based systems, there are so far few programmatic approaches that can model noise at the interpretation stage without the use of

of

an AT. For applications involving forensic evidence interpretation one needs to understand what consequences the choice of AT has on the findings, for instance the discriminating power of the computed LRs. This can be studied by constructing ROC curves for the LRs generated from the

ro

PG software at different AT values, similar to what has been reported in You et al. [90]. Currently, PG models for analyzing sequence-based STR data are not available. Here we illustrate, using data

-p

generated in this study, how one might use the simple cost-based approach for selecting an optimal AT value for each method discussed. This AT method is not based on an equation, instead it is

re

based on the consequences of assigning a certain AT value. The cost-based approach can be used to compare the performance of different analytical methods.

lP

The aim of the paper is not to suggest any analytical or zygosity threshold. Our intent is to show the methods used for data analysis and our observations. Laboratories evaluating analytical and zygosity thresholds to interpret single-source NGS profiles should evaluate the cost or

ur na

tradeoffs associated with each threshold by assigning a cost-parameter or constructing the ROCs from their empirical sensitivity laboratory data. Thresholds derived from single-source NGS sensitivity studies should be validated prior to application on evidentiary profiles. Future studies will focus on generating and interpreting large set of empirical data of single-source (especially at low template amounts for further statistical testing) and mixture samples of varying qualities and

Jo

quantities for performance check.

Funding and disclosures XXXXXXXXXXXXXX

Conflict of interest 22

The authors declare no conflict of interest.

Acknowledgements XXXXXXXXXXXXXX

Author Contributions

of

XXXXXXXXXXXXXX

Funding and disclosures This work was supported by NIST Special Programs Office: Forensic

ro

DNA, Federal Bureau of Investigation (FBI) Biometrics Center of Excellence: Forensic DNA Typing as a Biometric tool. Points of view in this document are those of the authors and do not

-p

necessarily represent the official position or policies of the U.S. Department of Commerce. Certain commercial equipment, instruments, and materials are identified in order to specify experimental procedures as completely as possible. In no case does such identification imply a recommendation

re

or endorsement by NIST, nor does it imply that any of the materials, instruments, or equipment

Conflict of interest

lP

identified are necessarily the best available for the purpose.

ur na

The authors declare no conflict of interest.

Acknowledgements

The authors would like to thank Dr. Justin M. Zook at NIST for his helpful discussion regarding this manuscript. Also, the authors express gratitude to Doug Storts and Spencer Hermanson at

Jo

Promega for providing PowerSeq 46GY System Prototype reagents and technical guidance. We thank the anonymous reviewers for their thorough review of our manuscript and their helpful comments and suggestions.

Author Contributions S.R., H.I., and P.M.V. designed the project, planned the experiments, and discussed the data and manuscript. S.R. performed the experiments and wrote the manuscript. L.A.B. analyzed the 23

FASTQ files. H.I. designed and implemented R codes and statistical tools for data curation and

Jo

ur na

lP

re

-p

ro

of

analysis. S.R. and H.I. analyzed the data.

24

References

Jo

ur na

lP

re

-p

ro

of

[1] K. Lazaruk, P.S. Walsh, F. Oaks, D. Gilbert, B.B. Rosenblum, S. Menchen, D. Scheibler, H.M. Wenz, C. Holt, J. Wallin, Genotyping of forensic short tandem repeat (STR) systems based on sizing precision in a capillary electrophoresis instrument, Electrophoresis 19(1) (1998) 86-93. [2] J.M. Butler, Fundamentals of Forensic DNA Typing, Academic Press2009. [3] A. Aliferi, J. Thomson, A. McDonald, V.M. Paynter, S. Ferguson, D. Vanhinsbergh, D. Syndercombe Court, D. Ballard, UK and Irish Y-STR population data-A catalogue of variant alleles, Forensic science international. Genetics 34 (2018) e1-e6. [4] S. Dalsgaard, E. Rockenbauer, A. Buchard, H.S. Mogensen, R. Frank-Hansen, C. Borsting, N. Morling, Non-uniform phenotyping of D12S391 resolved by second generation sequencing, Forensic science international. Genetics 8(1) (2014) 195-9. [5] E. Rockenbauer, S. Hansen, M. Mikkelsen, C. Borsting, N. Morling, Characterization of mutations and sequence variants in the D21S11 locus by next generation sequencing, Forensic science international. Genetics 8(1) (2014) 68-72. [6] J.M. Butler, E. Buel, F. Crivellente, B.R. McCord, Forensic DNA typing by capillary electrophoresis using the ABI Prism 310 and 3100 genetic analyzers for STR analysis, Electrophoresis 25(10-11) (2004) 1397-412. [7] M.G. Ensenberger, K.A. Lenz, L.K. Matthies, G.M. Hadinoto, J.E. Schienman, A.J. Przech, M.W. Morganti, D.T. Renstrom, V.M. Baker, K.M. Gawrys, M. Hoogendoorn, C.R. Steffen, P. Martin, A. Alonso, H.R. Olson, C.J. Sprecher, D.R. Storts, Developmental validation of the PowerPlex((R)) Fusion 6C System, Forensic science international. Genetics 21 (2016) 134-44. [8] S. Gopinath, C. Zhong, V. Nguyen, J. Ge, R.E. Lagace, M.L. Short, J.J. Mulero, Developmental validation of the Yfiler((R)) Plus PCR Amplification Kit: An enhanced Y-STR multiplex for casework and database applications, Forensic science international. Genetics 24 (2016) 164-175. [9] S.L. Fordyce, M.C. Avila-Arcos, E. Rockenbauer, C. Borsting, R. Frank-Hansen, F.T. Petersen, E. Willerslev, A.J. Hansen, N. Morling, M.T. Gilbert, High-throughput sequencing of core STR loci for forensic genetic investigations using the Roche Genome Sequencer FLX platform, BioTechniques 51(2) (2011) 12733. [10] D.M. Bornman, M.E. Hester, J.M. Schuetter, M.D. Kasoji, A. Minard-Smith, C.A. Barden, S.C. Nelson, G.D. Godbold, C.H. Baker, B. Yang, J.E. Walther, I.E. Tornes, P.S. Yan, B. Rodriguez, R. Bundschuh, M.L. Dickens, B.A. Young, S.A. Faith, Short-read, high-throughput sequencing technology for STR genotyping, BioTechniques. Rapid dispatches 2012 (2012) 1-6. [11] C. Borsting, N. Morling, Next generation sequencing and its applications in forensic genetics, Forensic science international. Genetics 18 (2015) 78-89. [12] C. Van Neste, F. Van Nieuwerburgh, D. Van Hoofstat, D. Deforce, Forensic STR analysis using massive parallel sequencing, Forensic science international. Genetics 6(6) (2012) 810-8. [13] A.C. Jager, M.L. Alvarez, C.P. Davis, E. Guzman, Y. Han, L. Way, P. Walichiewicz, D. Silva, N. Pham, G. Caves, J. Bruand, F. Schlesinger, S.J.K. Pond, J. Varlaro, K.M. Stephens, C.L. Holt, Developmental validation of the MiSeq FGx Forensic Genomics System for Targeted Next Generation Sequencing in Forensic DNA Casework and Database Laboratories, Forensic science international. Genetics 28 (2017) 52-70. [14] P. Muller, A. Alonso, P.A. Barrio, B. Berger, M. Bodner, P. Martin, W. Parson, Systematic evaluation of the early access applied biosystems precision ID Globalfiler mixture ID and Globalfiler NGS STR panels for the ion S5 system, Forensic science international. Genetics 36 (2018) 95-103. [15] Instructions for use of the PowerSeq™ 46GY System Prototype. [16] E.R. Mardis, DNA sequencing technologies: 2006-2016, Nature protocols 12(2) (2017) 213-218. [17] Converge software .

25

Jo

ur na

lP

re

-p

ro

of

[18] A.E. Woerner, J.L. King, B. Budowle, Fast STR allele identification with STRait Razor 3.0, Forensic science international. Genetics 30 (2017) 18-23. [19] ForenSeq™ Universal Analysis Software Guide. . [20] C.S.L. K. Hendricks, L. Luo, J. McGuigan, T. Snyder-Leiby, N. Wan, J. Wu, Concordance Study: HighThroughput STR/YSTR results using GeneMarker®HTS software are concordant with capillary electrophoresis results and provide additional sequence variation information, ISHI, Phoenix, 2018. [21] S. Ganschow, J. Silvery, J. Kalinowski, C. Tiemann, toaSTR: A web application for forensic STR genotyping by massively parallel sequencing, Forensic science international. Genetics 37 (2018) 21-28. [22] Y. Ma, J.Z. Kuang, T.G. Nie, W. Zhu, Z. Yang, Next generation sequencing: Improved resolution for paternal/maternal duos analysis, Forensic science international. Genetics 24 (2016) 83-85. [23] A. Alonso, P.A. Barrio, P. Muller, S. Kocher, B. Berger, P. Martin, M. Bodner, S. Willuweit, W. Parson, L. Roewer, B. Budowle, Current state-of-art of STR sequencing in forensic genetics, Electrophoresis 39(21) (2018) 2655-2668. [24] S.B.S.S. D, F.R. Sawitzki, M.K.R. Scheible, S.F. Bailey, S.A. C, S.A. Faith, Paternity testing using massively parallel sequencing and the PowerSeq AUTO/Y system for short tandem repeat sequencing, Electrophoresis 39(21) (2018) 2669-2673. [25] W. Parson, Development of Basic Tools to Predict Appearance, Ancestry and Age Using Massively Parallel Sequencing (VISAGE), ISHI, Phoenix, 2018. [26] K.J. van der Gaag, R.H. de Leeuw, J. Hoogenboom, J. Patel, D.R. Storts, J.F.J. Laros, P. de Knijff, Massively parallel sequencing of short tandem repeats-Population data and mixture analysis results for the PowerSeq system, Forensic science international. Genetics 24 (2016) 86-96. [27] D.H. Warshauer, D. Lin, K. Hari, R. Jain, C. Davis, B. Larue, J.L. King, B. Budowle, STRait Razor: a lengthbased forensic STR allele-calling tool for use with second generation sequencing data, Forensic science international. Genetics 7(4) (2013) 409-17. [28] J. Hoogenboom, K.J. van der Gaag, R.H. de Leeuw, T. Sijen, P. de Knijff, J.F. Laros, FDSTools: A software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise, Forensic science international. Genetics 27 (2017) 27-40. [29] S.L. Friis, A. Buchard, E. Rockenbauer, C. Borsting, N. Morling, Introduction of the Python script STRinNGS for analysis of STR regions in FASTQ or BAM files and expansion of the Danish STR sequence database to 11 STRs, Forensic science international. Genetics 21 (2016) 68-75. [30] C. Van Neste, M. Vandewoestyne, W. Van Criekinge, D. Deforce, F. Van Nieuwerburgh, My-ForensicLoci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing, Forensic science international. Genetics 9 (2014) 1-8. [31] K.B. Gettings, L.A. Borsuk, C.R. Steffen, K.M. Kiesler, P.M. Vallone, Sequence-based U.S. population data for 27 autosomal STR loci, Forensic science international. Genetics 37 (2018) 106-115. [32] L. Devesse, D. Ballard, L. Davenport, I. Riethorst, G. Mason-Buck, D. Syndercombe Court, Concordance of the ForenSeq system and characterisation of sequence-specific autosomal STR alleles across two major population groups, Forensic science international. Genetics 34 (2018) 57-61. [33] J.D. Churchill, N.M.M. Novroski, J.L. King, L.H. Seah, B. Budowle, Population and performance analyses of four major populations with Illumina's FGx Forensic Genomics System, Forensic science international. Genetics 30 (2017) 81-92. [34] N.M.M. Novroski, J.L. King, J.D. Churchill, L.H. Seah, B. Budowle, Characterization of genetic sequence variation of 58 STR loci in four major population groups, Forensic science international. Genetics 25 (2016) 214-226.

26

Jo

ur na

lP

re

-p

ro

of

[35] B. Young, J.L. King, B. Budowle, L. Armogida, A technique for setting analytical thresholds in massively parallel sequencing-based forensic DNA analysis, PloS one 12(5) (2017) e0178005. [36] S.B. Vilsen, T. Tvedebrink, P.S. Eriksen, C. Bosting, C. Hussing, H.S. Mogensen, N. Morling, Stutter analysis of complex STR MPS data, Forensic science international. Genetics 35 (2018) 107-112. [37] S.B. Vilsen, T. Tvedebrink, P.S. Eriksen, C. Hussing, C. Borsting, N. Morling, Modelling allelic drop-outs in STR sequencing data generated by MPS, Forensic science international. Genetics 37 (2018) 6-12. [38] S.B. Vilsen, T. Tvedebrink, H.S. Mogensen, N. Morling, Statistical modelling of Ion PGM HID STR 10plex MPS data, Forensic science international. Genetics 28 (2017) 82-89. [39] A.E. Woerner, J.L. King, B. Budowle, Compound stutter in D2S1338 and D12S391, Forensic science international. Genetics 39 (2019) 50-56. [40] A.E. Woerner, J.L. King, B. Budowle, Flanking Variation Influences Rates of Stutter in Simple Repeats, Genes 8(11) (2017). [41] TruSeq DNA PCR-Free Library Prep Reference Guide. . [42] C.R. Hill, D.L. Duewer, M.C. Kline, M.D. Coble, J.M. Butler, U.S. population data for 29 autosomal STR loci, Forensic science international. Genetics 7(3) (2013) e82-3. [43] Instructions for Use of PowerQuant System. . [44] Instructions for Use of PowerPlex Fusion System for Use on the Applied Biosystems Genetic Analyzers. . [45] Instructions for Use of PowerPlex Y23 System for Use on the Applied Biosystems Genetic Analyzers. . [46] Instructions For Use Agencourt AMPure XP PCR Purification. . [47] Instructions for Use of Quant-iT PicoGreen dsDNA Reagent and Kits. . [48] KAPA Hyper Prep Kit Technical Data Sheet. . [49] Instructions for Use of KAPA Library Quantification Kit for Illumina Platforms. . [50] M.A. Quail, I. Kozarewa, F. Smith, A. Scally, P.J. Stephens, R. Durbin, H. Swerdlow, D.J. Turner, A large genome center's improvements to the Illumina sequencing system, Nature methods 5(12) (2008) 100510. [51] MiSeq System Denature and Dilute Libraries Guide. . [52] MiSeq System Guide. . [53] J.L. King, F.R. Wendt, J. Sun, B. Budowle, STRait Razor v2s: Advancing sequence-based STR allele reporting and beyond to other marker systems, Forensic science international. Genetics 29 (2017) 21-28.

27

Jo

ur na

lP

re

-p

ro

of

[54] P. Gill, H. Haned, O. Bleka, O. Hansson, G. Dorum, T. Egeland, Genotyping and interpretation of STRDNA: Low-template, mixtures and database matches-Twenty years of research and development, Forensic science international. Genetics 18 (2015) 100-17. [55] H. Kelly, J.A. Bright, J.M. Curran, J. Buckleton, Modelling heterozygote balance in forensic DNA profiles, Forensic science international. Genetics 6(6) (2012) 729-34. [56] M.D. Timken, S.B. Klein, M.R. Buoncristiani, Stochastic sampling effects in STR typing: Implications for analysis and interpretation, Forensic science international. Genetics 11 (2014) 195-204. [57] P. Sonego, A. Kocsor, S. Pongor, ROC analysis: applications to the classification of biological sequences and 3D structures, Briefings in bioinformatics 9(3) (2008) 198-209. [58] The R Project for Statistical Computing. . [59] E.R. Mardis, Next-generation sequencing platforms, Annual review of analytical chemistry (Palo Alto, Calif.) 6 (2013) 287-303. [60] I. Kozarewa, Z. Ning, M.A. Quail, M.J. Sanders, M. Berriman, D.J. Turner, Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes, Nature methods 6(4) (2009) 291-5. [61] Best practices for manually normalizing library concentrations. . [62] L. Aigrain, Y. Gu, M.A. Quail, Quantitation of next generation sequencing library preparation protocol efficiencies using droplet digital PCR assays - a systematic comparison of DNA library preparation kits for Illumina sequencing, BMC genomics 17 (2016) 458. [63] S. Fisher, A. Barry, J. Abreu, B. Minie, J. Nolan, T.M. Delorey, G. Young, T.J. Fennell, A. Allen, L. Ambrogio, A.M. Berlin, B. Blumenstiel, K. Cibulskis, D. Friedrich, R. Johnson, F. Juhn, B. Reilly, R. Shammas, J. Stalker, S.M. Sykes, J. Thompson, J. Walsh, A. Zimmer, Z. Zwirko, S. Gabriel, R. Nicol, C. Nusbaum, A scalable, fully automated process for construction of sequence-ready human exome targeted capture libraries, Genome biology 12(1) (2011) R1. [64] KAPA BIOSYSTEMS Next-Generation Solutions. . [65] T. Tvedebrink, P.S. Eriksen, H.S. Mogensen, N. Morling, Evaluating the weight of evidence by using quantitative short tandem repeat data in DNA mixtures, Journal of the Royal Statistical Society: Series C (Applied Statistics) 59(5) (2010) 855-874. [66] J.A. Bright, D. Taylor, J.M. Curran, J.S. Buckleton, Developing allelic and stutter peak height models for a continuous method of DNA interpretation, Forensic science international. Genetics 7(2) (2013) 296304. [67] J.-A. Bright, S. Neville, J.M. Curran, J.S. Buckleton, Variability of mixed DNA profiles separated on a 3130 and 3500 capillary electrophoresis instrument, Australian Journal of Forensic Sciences 46(3) (2014) 304-312. [68] L.E. Alfonse, A.D. Garrett, D.S. Lun, K.R. Duffy, C.M. Grgicak, A large-scale dataset of single and mixedsource short tandem repeat profiles to inform human identification strategies: PROVEDIt, Forensic science international. Genetics 32 (2018) 62-70. [69] W.S. Cleveland, LOWESS: A program for smoothing scatterplots by robust locally weighted regression, American Statistician 35(1) (1981) 54. [70] G. Levinson, G.A. Gutman, Slipped-strand mispairing: a major mechanism for DNA sequence evolution, Molecular biology and evolution 4(3) (1987) 203-21. [71] P.S. Walsh, N.J. Fildes, R. Reynolds, Sequence analysis and characterization of stutter products at the tetranucleotide repeat locus vWA, Nucleic acids research 24(14) (1996) 2807-12. [72] M. Klintschar, P. Wiegand, Polymerase slippage in relation to the uniformity of tetrameric repeat stretches, Forensic science international 135(2) (2003) 163-6. 28

Jo

ur na

lP

re

-p

ro

of

[73] M. Meldgaard, N. Morling, Detection and quantitative characterization of artificial extra peaks following polymerase chain reaction amplification of 14 short tandem repeat systems used in forensic investigations, Electrophoresis 18(11) (1997) 1928-35. [74] K. Lazaruk, J. Wallin, C. Holt, T. Nguyen, P.S. Walsh, Sequence variation in humans and other primates at six short tandem repeat loci used in forensic identity testing, Forensic science international 119(1) (2001) 1-10. [75] A. Edwards, A. Civitello, H.A. Hammond, C.T. Caskey, DNA typing and genetic mapping with trimeric and tetrameric tandem repeats, American journal of human genetics 49(4) (1991) 746-56. [76] J.A. Bright, J.M. Curran, J.S. Buckleton, Investigation into the performance of different models for predicting stutter, Forensic science international. Genetics 7(4) (2013) 422-7. [77] J.A. Bright, K.E. Stevenson, M.D. Coble, C.R. Hill, J.M. Curran, J.S. Buckleton, Characterising the STR locus D6S1043 and examination of its effect on stutter rates, Forensic science international. Genetics 8(1) (2014) 20-3. [78] C. Brookes, J.A. Bright, S. Harbison, J. Buckleton, Characterising stutter in forensic STR multiplexes, Forensic science international. Genetics 6(1) (2012) 58-63. [79] D. Taylor, J.A. Bright, H. Kelly, M.H. Lin, J. Buckleton, A fully continuous system of DNA profile evidence evaluation that can utilise STR profile data produced under different conditions within a single analysis, Forensic science international. Genetics 31 (2017) 149-154. [80] G. Park, J.K. Park, S.H. Shin, H.J. Jeon, N.K.D. Kim, Y.J. Kim, H.T. Shin, E. Lee, K.H. Lee, D.S. Son, W.Y. Park, D. Park, Characterization of background noise in capture-based targeted sequencing data, Genome biology 18(1) (2017) 136. [81] M. Schirmer, R. D'Amore, U.Z. Ijaz, N. Hall, C. Quince, Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data, BMC bioinformatics 17 (2016) 125. [82] R. McGill, J.W. Tukey, W.A. Larsen, Variations of box plots, The American Statistician 32(1) (1978) 1216. [83] J. Bregu, D. Conklin, E. Coronado, M. Terrill, R.W. Cotton, C.M. Grgicak, Analytical thresholds and sensitivity: establishing RFU thresholds for forensic DNA analysis, Journal of forensic sciences 58(1) (2013) 120-9. [84] J.R. Gilder, T.E. Doom, K. Inman, D.E. Krane, Run-specific limits of detection and quantitation for STRbased DNA testing, Journal of forensic sciences 52(1) (2007) 97-101. [85] U.J. Monich, K. Duffy, M. Medard, V. Cadambe, L.E. Alfonse, C. Grgicak, Probabilistic characterisation of baseline noise in STR profiles, Forensic science international. Genetics 19 (2015) 107-122. [86] D. Taylor, J.A. Bright, C. McGoven, C. Hefford, T. Kalafut, J. Buckleton, Validating multiplexes for use in conjunction with modern interpretation strategies, Forensic science international. Genetics 20 (2016) 6-19. [87] K.E. Lohmueller, N. Rudin, K. Inman, Analysis of allelic drop-out using the Identifiler((R)) and PowerPlex((R)) 16 forensic STR typing systems, Forensic science international. Genetics 12 (2014) 1-11. [88] C.A. Rakay, J. Bregu, C.M. Grgicak, Maximizing allele detection: Effects of analytical threshold and DNA levels on rates of allele and locus drop-out, Forensic science international. Genetics 6(6) (2012) 7238. [89] L.E. Alfonse, G. Tejada, H. Swaminathan, D.S. Lun, C.M. Grgicak, Inferring the Number of Contributors to Complex DNA Mixtures Using Three Methods: Exploring the Limits of Low-Template DNA Interpretation, Journal of forensic sciences 62(2) (2017) 308-316. [90] Y. You, D. Balding, A comparison of software for the evaluation of complex DNA profiles, Forensic science international. Genetics 40 (2019) 114-119. [91] P. Gill, R. Puch-Solis, J. Curran, The low-template-DNA (stochastic) threshold--its determination relative to risk analysis for national DNA databases, Forensic science international. Genetics 3(2) (2009) 104-11. 29

Jo

ur na

lP

re

-p

ro

of

[92] A. Kirkham, J. Haley, Y. Haile, A. Grout, C. Kimpton, A. Al-Marzouqi, P. Gill, High-throughput analysis using AmpFlSTR(R) Identifiler(R) with the Applied Biosystems 3500xl Genetic Analyser, Forensic science international. Genetics 7(1) (2013) 92-7. [93] A.A. Westen, L.J. Grol, J. Harteveld, A.S. Matai, P. de Knijff, T. Sijen, Assessment of the stochastic threshold, back- and forward stutter filters and low template techniques for NGM, Forensic science international. Genetics 6(6) (2012) 708-15. [94] J.A. Bright, E. Huizing, L. Melia, J. Buckleton, Determination of the variables affecting mixed MiniFiler DNA profiles, Forensic science international. Genetics 5(5) (2011) 381-5. [95] C. Luce, S. Montpetit, D. Gangitano, P. O'Donnell, Validation of the AMPFlSTR MiniFiler PCR amplification kit for use in forensic casework*, Journal of forensic sciences 54(5) (2009) 1046-54. [96] T. Tvedebrink, P.S. Eriksen, H.S. Mogensen, N. Morling, Estimating the probability of allelic drop-out of STR alleles in forensic genetics, Forensic science international. Genetics 3(4) (2009) 222-6. [97] P. Gill, R. Sparkes, C. Kimpton, Development of guidelines to designate alleles using an STR multiplex system, Forensic science international 89(3) (1997) 185-97. [98] P. Gill, J. Curran, K. Elliot, A graphical simulation model of the entire DNA process associated with the analysis of short tandem repeat loci, Nucleic acids research 33(2) (2005) 632-43. [99] J.A. Bright, K. McManus, S. Harbison, P. Gill, J. Buckleton, A comparison of stochastic variation in mixed and unmixed casework and synthetic samples, Forensic science international. Genetics 6(2) (2012) 180-4. [100] T. Tvedebrink, H.S. Mogensen, M.C. Stene, N. Morling, Performance of two 17 locus forensic identification STR kits-Applied Biosystems's AmpFlSTR(R) NGMSElect and Promega's PowerPlex(R) ESI17 kits, Forensic science international. Genetics 6(5) (2012) 523-31. [101] P.S. Walsh, H.A. Erlich, R. Higuchi, Preferential PCR amplification of alleles: mechanisms and solutions, PCR methods and applications 1(4) (1992) 241-50. [102] J.A. Bright, D. Taylor, C. McGovern, S. Cooper, L. Russell, D. Abarno, J. Buckleton, Developmental validation of STRmix, expert software for the interpretation of forensic DNA profiles, Forensic science international. Genetics 23 (2016) 226-239. [103] O. Bleka, G. Storvik, P. Gill, EuroForMix: An open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts, Forensic science international. Genetics 21 (2016) 35-44. [104] M.W. Perlin, M.M. Legler, C.E. Spencer, J.L. Smith, W.P. Allan, J.L. Belrose, B.W. Duceman, Validating TrueAllele(R) DNA mixture interpretation, Journal of forensic sciences 56(6) (2011) 1430-47. [105] D.J. Balding, Evaluation of mixed-source, low-template DNA profiles in forensic science, Proceedings of the National Academy of Sciences of the United States of America 110(30) (2013) 12241-6. [106] V.I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals. 1966 (8), Forschungsbericht.–707–710 S (1966).

30

of ro -p re

Fig. 1: Comparison of conventional CE versus NGS-STR genotyping workflows

Jo

ur na

lP

With CE, genomic DNA is extracted, quantified, and amplified using dye labeled primers contained in the commercial multiplex STR kits. During the electrophoretic separation fluorescently labeled amplicons with different molecular weights pass by laser and the emitted light from the fluorophores are captured by charge coupled device (CCD) camera. The outcome of this process is electropherogram (EPG) composed of the length variants, heights, and sizes, of the STR alleles [2]. With NGS, PCR amplification is still a standard step to enrich for STR loci. STR amplicons are then subjected to library construction where adapters are ligated and converted into DNA libraries (discussed in detail in Supplementary Fig. 3). DNA libraries are then quantified, diluted, and pooled for clonal amplification. The final steps are the sequencing by synthesis via the chemistries of the NGS instruments (here shown the Illumina chemistry) and data analysis [41].

31

t = 15% (13;21)

t = 10% (4;18)

of

t = 5% (1;1) Number of heterozygotes incorrectly classified as homozygotes (FP)

-p

Fig. 2: Receiver Operating Characteristic plot or (ROC plot)

ro

Number of correctly classified homozygotes (TP)

Receiver Operating Characteristic Curve

Jo

ur na

lP

re

An illustration of an ROC plot used to evaluate thresholds for zygosity calling for Kapa-N: 30 pg is shown. At a given AHb threshold value (t), the counts of the number of correctly classified homozygotes (TP) and the number of heterozygote loci incorrectly classified as homozygotes (FP) are plotted on the y-axis and x-axis, respectively. The ROC captures all the AHb thresholds (t) simultaneously. Specific t values of 5%, 10%, and 15% (i.e., 0.05, 0.10, and 0.15) 2%, 10%, and 40% are highlighted in this illustration. For every possible t, a (FP; TP) value is recorded. As t increases, the number of both TP and FP decreases increases.

32

80.5 80

70

60 51.6

50

40

30 21.6 20

16.4

15.4 8.0

10 0.5

7.9 1.6

7.5 2.4

4.4

0

15 pg

30 pg

60 pg

125 pg

250 pg

500 pg

DNA Amount (pg) Kapa Library Preparation Kit

ro

Illumina Truseq Library Preparation Kit

of

Average Library Concentration by qPCR (nM)

Yield of Adaptor-Ligated Library Molecules

90

-p

Fig. 3: Average library concentration yield versus starting amount of DNA template

Jo

ur na

lP

re

Data are shown for libraries that were prepared from DNA samples with varying amounts of input DNA using either Kapa or TruSeq library preparation kits. The library DNA from Kapa and TruSeq workflows was eluted in 25 µL and 22.5 µL elution buffer, respectively. A quantitative assessment of library yields by qPCR showed that the library input DNA converted to adapter-ligated libraries was higher with Kapa HyperPrep workflow as compared to TruSeq kit across the entire range of DNA inputs. The Y-axis represents the average library concentration values (yield of DNA) obtained by qPCR after the post-ligation cleanup. The Xaxis reflects the varying amount of DNA masses (15 pg - 500 pg) from which the libraries were synthesized.

33

A.

APH

CE APH vs DNA Amount (pg)

of

Amount of Input DNA (pg)

ro

B. KAPA-N ALC vs DNA Amount (pg)

lP

Amount of Input DNA (pg)

re

ALC

ALC

-p

TruSeq-N ALC vs DNA Amount (pg)

ur na

C.

Jo

Amount of Input DNA (pg)

TruSeq-NN ALC vs DNA Amount (pg)

ALC

ALC

KAPA-NN ALC vs DNA Amount (pg)

Amount of Input DNA (pg)

Amount of Input DNA (pg)

Fig. 4: APH and ALC of known alleles relative to DNA template amounts 34

Jo

ur na

lP

re

-p

ro

of

(A) APH of single-source profiles amplified with PowerPlex Fusion and analyzed on CE 3500 xL. (B and C) ALC of the Kapa and TruSeq normalized (N) and non-normalized (NN) sequenced libraries. The horizonal axis shows the varying amounts of DNA input (pg) and the vertical axis shows the APH or ALC of profiles generated from different workflows. ALC obtained from the normalized sequenced libraries does not correlate to the amounts of input DNA present at the start of the amplification, unlike the CE and non-normalized sequencing data. LOWESS lines (shown in red) were superimposed on the plots to help visualize the trend [69].

35

Fig. 5: Impact of DNA template amount on the distribution of known allele, stutter, and noise sequences

ur na

lP

re

-p

ro

of

The coverage and distribution of alleles, noise, and stutter displayed here as datapoints and boxplots [82], respectively, have been delineated across Kapa and TruSeq normalized (N) and non-normalized (NN) sequenced libraries (A and B) and CE technology (C). Only the 15 pg and 125 pg DNA amounts are shown here. The horizontal axis shows the types of the sequences or peaks (A; S (S1; S2); and N) and the vertical axis (log10 scale) shows the coverage or height of each of the sequence or peak at each locus. Box plots are added to show the 25th, 50th (horizontal bar), and 75th percentiles of the coverage and height for each type of the sequences and peaks, respectively [82]. As the amount of DNA increases, separation between the alleles and noise or stutter increases. The total number of known alleles was 411.

Jo

B

36

of

ro

-p

re

lP

ur na

Jo C

37

Fig. 6: Setting analytical thresholds for sequence-based STR profiles

AT 218 327 266 56 68

Minimum cost 8.587 8.630 13.435 18.913 22.065

Jo

c = 5

ur na

Method Kapa-N TruSeq-N Kapa-NN TruSeq-NN CE

lP

re

-p

ro

of

Using the “cost vs AT” plots to evaluate the optimal method dependent AT (an i.e. an AT with a minimum cost) as well as the performance of different methods in terms of detecting and distinguishing known allelic products from noise. The data that was generated from each method (Kapa-N, Kapa-NN, TruSeq-N, TruSeq-NN, and CE) was combined across all the loci, replicates, and DNA amounts. The drop-out and drop-in events of the five different methods were generated by varying the AT threshold for the entire datasets of each method. A cost parameter of c = 5 was considered. Each point on the curve corresponds to a AT threshold with its corresponding cost computed as discussed in Section 2.7. Cost and AT were plotted on the Y-axis and X-axis, respectively. Connecting all the points obtained at all the possible thresholds formed the observed empirical curves. Different curves shown correspond to the different methods and are represented in different colors. The table below the curves show the suggested profile level AT values (rounded up to the nearest 10) with their associated minimum cost for the different methods. Kapa-N and TruSeq-N methods performed better than the non-normalized ones.

38

Fig. 7: Inferring zygosity using apparent heterozygote balance (AHb)

Predicted Heterozygotes

n = 216

Predicted Homozygotes

TP = 21

FN = 0

Actual Homozygotes

TP = 19

FN = 2

FP = 62

TN = 133

Actual Heterozygotes

FP = 46

Predicted Heterozygotes

TN = 149

Predicted Heterozygotes

n = 216

Predicted Homozygotes

Actual Homozygotes

TP = 21

FN = 0

Actual Homozygotes

TP = 19

FN = 0

Actual Heterozygotes

FP = 66

TN = 129

Actual Heterozygotes

FP = 47

TN = 147

n = 216

Predicted Homozygotes

Predicted Heterozygotes

re

Predicted Homozygotes

lP

n = 216 Actual Homozygotes Actual Heterozygotes

-p

ro

of

(A) Relative frequency histograms of the calculated AHb for the different sequencing methods at 15 pg; (B) Relative frequency histograms of the calculated AHb for the different sequencing methods at 125 pg. N and NN represent normalized and nonnormalized datasets, respectively. (C) Relative frequency histograms of the calculated AHb for the CE data at 15 pg and 125 pg. AHb values from 216 loci were binned into range of intervals (horizontal axes). The vertical axes show the number of times the AHb values occur within each bin. Cyan and green histograms represent the AHb values from the homozygote and heterozygote loci, respectively. Underneath each histogram is a table that shows the rates of the true positives (TP), false negatives (FN), false positives (FP), and true negatives (TN). Those values were computed after applying a cut-off value of Hb = 0.2. The histograms across the rest of the DNA amounts and various cut-off thresholds can be found in the Additional files 4 and 5.

ur na

B

Predicted Homozygotes

Predicted Heterozygotes

n = 216

Predicted Homozygotes

Actual Homozygotes

TP = 21

FN = 0

Actual Homozygotes

TP = 21

Predicted Heterozygotes FN = 0

Actual Heterozygotes

FP = 0

TN = 195

Actual Heterozygotes

FP = 0

TN = 195

Jo

n = 216

n = 216

Predicted Homozygotes

Predicted Heterozygotes

n = 216

Predicted Homozygotes

Actual Homozygotes

TP = 21

FN = 0

Actual Homozygotes

TP = 21

Predicted Heterozygotes FN = 0

Actual Heterozygotes

FP = 0

TN = 195

Actual Heterozygotes

FP = 0

TN = 195

39

Actual Homozygotes

TP = 18

FN = 3

Actual Heterozygotes

FP = 38

TN = 157

Predicted Heterozygotes

Actual Homozygotes

TP = 21

FN = 0

Actual Heterozygotes

FP = 0

TN = 195

-p

Predicted Heterozygotes

Predicted Homozygotes

re

Predicted Homozygotes

n = 216

Jo

ur na

lP

n = 216

ro

of

C.

40

Fig. 8: The amounts of input DNA in pg versus Hb (A) Empirical heterozygote sequence coverage was studied at each known heterozygote from the Kapa and TruSeq normalized (N) and non-normalized (NN) sequenced libraries. (B) Hb of heterozygotes from the single-source profiles amplified with PowerPlex Fusion and analyzed on CE 3500 xL. The horizonal axis shows the varying amounts of DNA input (pg) and the vertical axis shows the Hb calculated from different workflows. Distribution of Hb values are displayed in boxplots [82]. Increased imbalance was noticeable at low-template profiles irrespective of the workflow, pooling approaches, and library kits.

of

A. Kapa-N Hb vs DNA Amount (pg)

ro

re

Amount of Input DNA (pg)

TruSeq-NN Hb vs DNA Amount (pg)

ur na

Hb

Hb

lP

Kapa-NN Hb vs DNA Amount (pg)

-p

Hb

Hb

Amount of Input DNA (pg)

Amount of Input DNA (pg)

B.

TruSeq-N Hb vs DNA Amount (pg)

Amount of Input DNA (pg)

Hb

Jo

CE Hb vs DNA Amount (pg)

Amount of Input DNA (pg)

41

42

of

ro

-p

re

lP

ur na

Jo