DNAStatistX: Development and validation of a software suite for the data management and probabilistic interpretation of DNA profiles

DNAStatistX: Development and validation of a software suite for the data management and probabilistic interpretation of DNA profiles

Forensic Science International: Genetics 42 (2019) 81–89 Contents lists available at ScienceDirect Forensic Science International: Genetics journal ...

1MB Sizes 0 Downloads 57 Views

Forensic Science International: Genetics 42 (2019) 81–89

Contents lists available at ScienceDirect

Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsigen

Research paper

DNAxs/DNAStatistX: Development and validation of a software suite for the data management and probabilistic interpretation of DNA profiles

T

Corina C.G. Benschopa, , Jerry Hoogenbooma, Pauline Hoversa, Martin Slagtera, Dennis Kruiseb, Raymond Paraga, Kristy Steensmaa, Klaas Slootena, Jord H.A. Nagela, Patrick Dieltjesa, Vincent van Mariona, Heidi van Paassena, Jeroen de Jongb, Christophe Creetenb, Titia Sijena, Alexander L.J. Kneppersa ⁎

a b

Netherlands Forensic Institute, Division of Biological Traces, Laan van Ypenburg 6, 2497GB, The Hague, the Netherlands Netherlands Forensic Institute, Division of Digital and Biometric Traces, Laan van Ypenburg 6, 2497GB, The Hague, the Netherlands

ARTICLE INFO

ABSTRACT

Keywords: DNA profile interpretation DNA expert system DNAxs Likelihood ratio Probabilistic genotyping DNAStatistX EuroForMix

The data management, interpretation and comparison of sets of DNA profiles can be complex, time-consuming and error-prone when performed manually. This, combined with the growing numbers of genetic markers in forensic identification systems calls for expert systems that can automatically compare genotyping results within (large) sets of DNA profiles and assist in profile interpretation. To that aim, we developed a user-friendly software program or DNA eXpert System that is denoted DNAxs. This software includes features to view, infer and match autosomal short tandem repeat profiles with connectivity to up and downstream software programs. Furthermore, DNAxs has imbedded the ‘DNAStatistX’ module, a statistical library that contains a probabilistic algorithm to calculate likelihood ratios (LRs). This algorithm is largely based on the source code of the quantitative probabilistic genotyping system EuroForMix [1]. The statistical library, DNAStatistX, supports parallel computing which can be delegated to a computer cluster and enables automated queuing of requested LR calculations. DNAStatistX is written in Java and is accessible separately or via DNAxs. Using true and non-contributors to DNA profiles with up to four contributors, the DNAStatistX accuracy and precision were assessed by comparing the DNAStatistX results to those of EuroForMix. Results were the same up to rare differences that could be attributed to the different optimizers used in both software programs. Implementation of dye specific detection thresholds resulted in larger likelihood values and thus a better explanation of the data used in this study. Furthermore, processing time, robustness of DNAStatistX results and the circumstances under which model validations failed were examined. Finally, guidelines for application of the software are shared as an example. The DNAxs software is future-proof as it applies a modular approach by which novel functionalities can be incorporated.

1. Introduction 1.1. The DNA eXpert System ‘DNAxs’ DNA-profiling is one of the most valuable forensic approaches: it can generate investigative leads by retrieving suspect names from DNA database searches and high weights of evidence when suspects and evidential traces are compared. Increasingly complex DNA profiles are over the years considered to be interpretable, which is partly the result

of the increased sensitivity and the larger number of loci incorporated into global STR typing systems. Moreover, there has been a change in the type of evidence being submitted for analysis; from high quality and quantity (often single source) stains to low quality and quantity (often mixed) samples [2]. As one case may include multiple such complex profiles, comparisons to (sometimes many) reference profiles is complex, time-consuming and error-prone when performed manually. This has increased the demand for software that enables fast and automated comparison of sets of profiles assisting the DNA experts in routine case

Corresponding author. E-mail addresses: [email protected] (C.C.G. Benschop), [email protected] (J. Hoogenboom), [email protected] (P. Hovers), [email protected] (M. Slagter), [email protected] (D. Kruise), [email protected] (R. Parag), [email protected] (K. Steensma), [email protected] (K. Slooten), [email protected] (J.H.A. Nagel), [email protected] (P. Dieltjes), [email protected] (V. van Marion), [email protected] (H. van Paassen), [email protected] (J. de Jong), [email protected] (C. Creeten), [email protected] (T. Sijen), [email protected] (A.L.J. Kneppers). ⁎

https://doi.org/10.1016/j.fsigen.2019.06.015 Received 23 May 2019; Received in revised form 19 June 2019; Accepted 20 June 2019 Available online 21 June 2019 1872-4973/ © 2019 Elsevier B.V. All rights reserved.

Forensic Science International: Genetics 42 (2019) 81–89

C.C.G. Benschop, et al.

work. eDNA is one such application whose development was financially supported by the European Union in the period 2014–2016 [3]. Although its functionalities can be very useful, the eDNA application may not be easily implemented in every forensic laboratory, which prompted the development of other expert systems, such as CaseSolver [4] and, in our laboratory, the DNA eXpert System ‘DNAxs’. The DNAxs software is server based, was written using structured programming (so that multiple software engineers can integrate their work) and uses a modular approach so that novel features can be readily incorporated.

DNAxs supports parallel computing which can be delegated to a computer cluster and enables queuing of requested LR calculations. Since EuroForMix is written in R and C++, not all of the functions EuroForMix uses are available for DNAStatistX. These involve the optimizer and the model validation function, for which alternatives were explored and selected. The start points of the optimizer in DNAStatistX are based on a normal distribution fitted on the summed peak heights which is the same as in EuroForMix. Slightly different is the degree of variation that can be defined by the user in EuroForMix whilst it is fixed in DNAStatistX v1.0.0. In addition, a slightly different implementation of the rare allele frequency was used (see for instance LRmix (Studio) [13]). In a later version of EuroForMix (v2.0), the implementation of the rare allele frequency was modified and is the same as in DNAStatistX. Furthermore, DNAStatistX allows the use of a dye-specific or even locusspecific detection threshold (instead of an overall detection threshold) as profile analysis software (such as [14] [15],) have progressed to these specific thresholds allowing a better distinction of alleles and noise. An overview of the differences in implementation between EuroForMix and DNAStatistX is presented in Table 2. While EuroForMix includes both a degradation and stutter model, in this first version of DNAStatistX only the degradation model was implemented. The stutter model was not realized for the following rationale. In our laboratory, DNA profiles are analysed through GeneMarker® HID that applies locus-specific stutter filters at the -2, -1, -0.5, +0.5 and +1 repeat unit positions, while the stutter model in EuroForMix (at least up to version 2.0.4) considers -1 repeat unit stutters only. Some elevated stutters (predominantly occurring when amplifying low amounts of DNA) may exceed the stutter filter ratios in the analysis software and are maintained in the profile. Although a stutter model in probabilistic software is best applied on un-filtered data, one could consider it advantageous to apply a stutter model on the stutter-filtered genotyping data in order to take account of these elevated stutters. However, the stutter model will also expect to find stutter peaks at positions where they were in fact removed by the analysis software, which can affect the likelihood ratio. In EuroForMix the likelihoods are maximised under both hypotheses using the parameters that best explain the data for each hypothesis (single points in the parameter space). As uncertainty regarding this parameter estimation is not taken into account in the MLE calculation, EuroForMix contains an option to perform sensitivity analysis that considers the LR as a function of the parameters involved [16]. As a socalled conservative value (in favour of the defence hypothesis) one can use the lower 5% percentile of this sensitivity analysis. This results in

1.2. Functionalities of DNAxs The initial versions of DNAxs contain features to import, view, infer [5], match and export autosomal STR profiles (Table 1 and [6]). The software includes an audit trail that logs which action is performed when and by whom. Since its first internal release in December 2017 (DNAxs v1.0.0), new features were implemented (such as, viewing the electropherograms, supporting massively parallel sequencing data), bugs were fixed and user-friendliness was improved based on feedback from a large group of users. The DNAxs software has a prominent part in our laboratory and has become essential in the daily work of DNA experts reporting on forensic DNA casework. DNAxs has connectivity to different software programs via so-called ‘documented web-API’s that contain methods to import and export data. DNAxs is future-proof, as it is built in a modular way. New features, such as working with massively parallel sequencing data for autosomal or mitochondrial DNA can be incorporated. Existing features can be updated for improved performance. In our laboratory the current programs that are accessible from DNAxs are amongst others our LIMS system, SmartRank [7], [8], Bonaparte [9], FDStools [10], and a statistical library. This study focuses on the statistical library, denoted DNAStatistX. 1.3. DNAStatistX Since 2019, DNAxs implements ‘DNAStatistX’ (DNAxs v1.3.2 and DNAStatistX v1.0.0), a statistical library that contains the algorithm to calculate the maximum likelihood estimate (MLE). DNAStatistX is based on the model as published in Cowell et al. [11], Graversen et al. [12] and Bleka et al. [1] describing the underlying scientific principles. Alike in the software program EuroForMix from Bleka et al. [1], DNAStatistX includes theta correction, a degradation model and supports the analyses of replicates. The DNAStatistX library is written in Java and is accessible separately or via DNAxs. The statistical library in Table 1 Main functionalities of DNAxs version 1.0.0 and 1.1.0 in italics. Functionality Login Import profiles View profiles

Details

with a user name and password • Login login screen contains release notes and version changes as well as instruction videos • The file formats are accepted • Various capillary electrophoresis (CE) and massively parallel sequencing (MPS) data • Supports graphs visualizing the alleles and peaks heights/ read counts • Bar – In case of replicates: Different colours representing whether or not an allele is reproduced in at least half of the replicates. – Different colours representing whether an allele is part of the major component or not (according to LoCIM [5])

View summary statistics Match profiles Derive profiles Match matrix Export profiles Audit trail Notes

• Eectropherograms allele count (TAC) • Total allele count (MAC) • Maximum of Type I, II and III loci (according to LoCIM) • Numbers of alleles in the bar graphs • Matching of a major contributor (according to LoCIM) • Inference matrix enabling comparisons of (many) stains to (many) reference profiles • Match coding for the number of non-matching alleles and a slider to set this number • Colour match matrix can be exported to PDF • The SmartRank/ LIMS/ CODIS/ Bonaparte. • ToOptions to choose which alleles are exported (manual/ replicate/ consensus/ composite/ LoCIM inference) • Keeps track which actions are performed when and by whom • Possibility toofwrite notes within the software regarding e.g. the profile interpretation • 82

Forensic Science International: Genetics 42 (2019) 81–89

C.C.G. Benschop, et al.

Table 2 Overview of differences in implementation for EuroForMix v1.11.4 and DNAStatistX v1.0.0. Calculations

EuroForMix v1.11.4

DNAStatistX v1.0.0

Code Optimizer Number of optimizer start points/ iterations

R and C++ nlm User defined, default = 4

Model validation Rare allele frequency

AdaptIntegrate 5/2*size of population or user defined minimum frequencya Overall Included Yes Nob No

Java CMA-ES Programmed dynamically to have 3 times the same largest likelihood out of 3 up to 10 iterations TrapezoidIntegrator 1/(2*size of population)

Detection threshold Stutter model Sensitivity analysis Support kinship model Automated queuing of LR calculation requests a b

Dye or locus-specific Not included No No Yes (within DNAxs)

In EuroForMix v2.0 the implementation of the rare allele frequency is the same as in DNAStatistX (and e.g. LRmix Studio). Kinship model is not supported in EuroForMix v1.11.4, but is supported from v2.0 and up.

somewhat lower LRs when compared to the MLE. Since the lowering effect was very small with PowerPlex® Fusion 6C (PPF6C) profiles with a minimal effect on the number of false-positives [17] and because sensitivity analyses are very time-consuming processing-wise, this option was not implemented in DNAStatistX v1.0.0.

expectation, peak height variance and degradation slope (Supplementary Table 1) as these are factors that can have an influence on the LR results. The 500 remaining reference profiles were used as non-contributors in LR calculations with this simulated dataset. Hypotheses for a total of 1900 LR calculations were defined: 500 for Hp true tests and 1400 for Hd true tests (Supplementary Tables 2 and 3). In Hp true tests, each time one true contributor was regarded the person of interest (POI) under Hp while conditioning on none or the other true contributors in all possible combinations (Supplementary Table 2). In Hd true tests, for each sample ten different non-contributors were regarded as POI under Hp, while conditioning on none or one of the true contributors (Supplementary Table 3). Parameters and population data were used as described in [14], except that the analyses were performed without theta (Fst) correction. The parameter values for mixture proportion, peak height expectation, peak height variation and degradation slope were used 1) as decided in the simulation by the gamma model and thereby representing the ‘ground truth’ parameters and 2) as obtained by the DNAStatistX optimizer using four iterations (which is the default value in EuroForMix). Comparisons were made between the results (likelihoods and LRs) obtained using the ground truth parameters and the MLE method. In addition, LR calculations with theta correction of 0.01 and 0.03 were performed for 1) Hp true tests that yielded LRs which were more than one unit on log10 scale larger with MLE than with ground truth parameters and for 2) Hd true analyses that yielded an LR larger than ten.

1.4. Software validation To guide the validation process several documents have been published [18–25], that describe that the developmental validation could, or should, address:

• sensitivity (demonstrate the range of LRs that can be expected for true contributors); • specificity (demonstrate the range of LRs that can be expected for non-contributors); • precision (variation in LRs from repeated software analyses of the same input data) [19]; • accuracy of statistical calculations and other results (comparison to an alternate statistical model or software program) [18]; • determination of the limits of the software [22] (either computational or conceptual, regarding for instance the number of unknown contributors or types of DNA profiles).

Furthermore, the internal validation was described as an ‘accumulation of representative test data within the laboratory to demonstrate that the established parameters, software settings, formulae, algorithms and functions perform as expected’ [18], [2]. In a previous study examining the performance of EuroForMix applied to PPF6C profiles, the sensitivity and specificity were assessed using a set of various mixtures and hypotheses [16]. As the statistical library DNAStatistX is based on EuroForMix, sensitivity and specificity are conceived to be analogous and not further pursued in this study. We did assess the accuracy of DNAStatistX by comparing outcomes to those obtained with EuroForMix. Furthermore precision, robustness and effects of model options and settings were examined and defined.

2.2. LR calculations using laboratory-generated DNA profiles 2.2.1. Samples and hypotheses for comparison to EuroForMix In this study, DNAStatistX v1.0.0 was compared with EuroForMix v1.11.4. Parameters and population data were used as described in [17], unless stated otherwise. Experimental DNA profiles (n = 85) and reference profiles (n = 45) were selected from existing datasets and sets of hypotheses were created (n = 160) to cover the various aspects encountered in forensic casework in our laboratory. The DNA profiles had known content and comprised degraded single donor PPF6C profiles from an in-house study, mixed NGM profiles as described in [5] and mixed PPF6C profiles as described in [17]. These DNA profiles varied for the number of contributors (one to four), amount of DNA, mixture proportion, level of drop-out, occurrence of drop-in (or elevated stutter), level of allele sharing, level of degradation, STR typing kit (NGM™ or PPF6C) or number of replicates (one to three). The hypotheses had a true or a non-contributor as POI under Hp, zero to two conditioned contributors under both hypotheses, the true or an over-/ under-estimated number of contributors, a theta correction of 0.01 or 0.03, the degradation model turned on or turned off and a detection

2. Materials and methods 2.1. Likelihood ratio (LR) calculations using simulated DNA profiles PPF6C reference profiles (n = 610) were simulated using a Dutch allele frequency database [26] for genotypes and in accordance with the gamma model for peak heights. From 110 of these reference profiles a simulated dataset of 20 single-source, ten two-person, ten threeperson and ten four-person mixtures was generated. These samples were generated to have variation for mixture proportion, peak height 83

Forensic Science International: Genetics 42 (2019) 81–89

C.C.G. Benschop, et al.

threshold at all loci of 40, 45, 50 or 80 relative fluorescent units (RFU). These different detection thresholds were chosen as the PPF6C profiles were analysed using dye-specific detection thresholds (FL = 45, JOE = 50, TMR = 45, CXR = 80, TOM = 40 RFU) [17]; the NGM profiles were analysed using a 50 RFU threshold for all dyes. In addition, trace or reference profiles with a rare allele were included in an additional 20 sets of hypotheses that were tested with both software programs. Comparisons between the two software programs were performed using the LRs (overall and per marker, compared up to two significant digits) and estimated parameters (compared up to four significant decimals, i.e. the first four digits of the mantissa in scientific notation). For the expected peak height a minimum difference of 1.0 was used; differences that were smaller than the level up to which the comparisons were performed were regarded as equal results.

DNAStatistX. These ten analyses cover Hp and Hd analyses, one- to four-person profiles and samples with one or three replicates. In addition, a peak of improbable height was introduced at a long fragment in a sample to examine the effect of the model validation. Furthermore, model validation was performed for the diverse series of 180 sets of hypotheses described in Section 2.2.1 (i.e. the hypotheses that were used for comparison between DNAStatistX and EuroForMix (n = 160 without a rare allele and n = 20 with a rare allele) to assess when data points plot outside the 0.01-line. Additionally, three two-person PPF6C profiles generated using 32 PCR cycles [27] were submitted to DNAStatistX using each true contributor as POI under Hp. Model validations (n = 6 Hp and n = 6 Hd) were examined to infer whether the peak heights of these DNA profiles follow the expected peak heights.

2.2.2. Values and hypotheses to examine the effect of detection threshold In DNAStatistX, the detection threshold can be specified per dye (or even per locus). The PPF6C profiles were analysed in GeneMarker® HID using 1) low dye-specific detection thresholds (FL = 45, JOE = 50, TMR = 45, CXR = 80, TOM = 40 RFU) [17]; 2) high dye-specific detection thresholds (FL = 95, JOE = 140, TMR = 85, CXR = 135, TOM = 95 RFU; standard in our laboratory). DNAStatistX calculations were performed with 20 hypotheses that involved two- (n = 6), three(n = 8), or four-person mixtures (n = 6), had (n = 3) or did not have a conditioned contributor under both hypotheses (n = 17) and had the degradation model turned on (n = 14) or off (n = 6). A theta correction of 0.01 was applied, the true number of contributors was regarded and a true donor was used as POI under Hp. The DNA profiles were analysed twice in DNAStatistX, once using the dye specific thresholds and once using the lowest value of the dye specific thresholds for all loci (i.e. 40 or 85 RFUs for the low or high detection thresholds, respectively). Likelihoods and LRs were compared for these various detection thresholds.

An important part of the MLE calculation is to search for the optimal parameter values that best explain the data with respect to mixture proportion, peak height expectation, peak height variation and degradation slope. As the optimizer may not always reach the global optimum and may end up in a local optimum, the optimizer is run multiple times. By default the number of optimizer start points, or iterations, is four in EuroForMix and this was also used in the previously described analyses using DNAStatistX. As the DNAStatistX results showed robust results with four iterations, we examined whether fewer iterations suffice. Lowering the number of optimizer iterations drastically reduces processing time. The optimal number of iterations (maximised chance on global maximum in the context of processing time) was determined from 12,092 Hp and 12,092 Hd likelihoods obtained by running the optimizer from four up to 60 times in 235 analyses. Next, using the same number of iterations, the processing time of DNAStatistX v1.0.0 was examined for the LR calculations and model validations of the 180 analyses described in Section 2.2.1. DNAStatistX calculations were performed on Dell PowerEdge M1000e blade servers, with 2x Intel Xeon E5-2620 (2.0 GHz) CPUs each. Calculations with less than three unknowns under Hd were run on two processor cores, while three unknowns under Hd were run on four processor cores, and four unknowns under Hd were run on five processor cores. For optimal configuration and use of server capacity, DNAxs can delegate DNAStatistX calculations to a computer cluster and automatically queues DNAStatistX requests. An efficient workflow is created: new requests can be queued when earlier analyses are not yet finished; queued analyses for two- and three-person mixtures may start before analyses with four-person mixtures are finished.

2.4. Examination of the processing time

2.2.3. LR calculations to examine precision and robustness of DNAStatistX Duplicate analyses of ten of the LR calculations described in Section 2.2.1 (three two-person mixtures, three three-person mixtures and four four-person mixtures) were performed to examine whether repeated analyses give results of good precision. To test robustness of the software, 17 flawed scenarios were analysed: 1) scenarios for which LR calculations should not be possible because of an excessive number of peaks that cannot be explained by the number of contributors or drop-in (n=2), 2) scenarios in which the chosen STR typing kit in the calculation is different from the kit of the trace sample (n = 3), 3) scenarios in which the trace contains loci that are not in the reference profile(s) (n = 8), 4) a scenario in which the reference contains loci that are not in the trace profile (n = 1), and 5) scenarios in which the trace or reference(s) lack loci or contain empty loci (n = 3).

3. Results and discussion 3.1. Performance of the MLE algorithm using simulated samples In DNAStatistX, an algorithm calculates the MLE in which peak heights are considered to follow a gamma distribution. Thus, when simulated profiles are prepared with peak heights derived by a gamma model, the performance of the MLE algorithm can be assessed. Results from 500 Hp true tests and 1400 Hd true tests showed that 1) the MLE function had strong repeatability, 2) the MLE algorithm performed correct as the MLE likelihoods were equal to or slightly larger than the likelihoods obtained with the ground truth parameters, and 3) few Hd true tests yielded a false-positive result/ LR > 1. False-positives occurred with some three- or four-person mixtures and the largest LR was 361 without theta correction. Further detailed information regarding these analyses can be found in Supplementary Material 1.

2.3. Model validation Both EuroForMix and DNAStatistX include a model validation option during which the cumulative probability of the expected peak height for each allele is plotted against the observed peak height, yielding a PP (probability-probability) plot. EuroForMix uses the AdaptIntegrate function for model validation. As this function is not available in Java (the programming language for DNAStatistX) TrapezoidIntegrator was selected (available from the org.apache.commons.math3.analysis.integration package). During model validation, a Bonferroni-corrected ‘goodness of fit test’ is performed where the significance level is set to 0.01 by default. When the observed peak heights deviate from the expected peak heights, alleles will plot outside of this ‘0.01-line’. Ten analyses with data points in EuroForMix outside (n = 5) and within (n = 5) this 0.01-line were also analysed in

3.2. Accuracy of DNAStatistX The MLE algorithm in DNAStatistX v1.0.0 is slightly different from the MLE calculation in EuroForMix v1.11.4 as a CMA-ES instead of an 84

Forensic Science International: Genetics 42 (2019) 81–89

C.C.G. Benschop, et al.

EuroForMix likelihood was larger. Twelve analyses yielded a log10(LR) difference of which two had a difference whose log10 was more than 0.5, namely 0.63 and -0.59. These concerned one Hd true test using a four-person mixture that yielded log10(LR) 0.630 using DNAStatistX and log10(LR) -0.004 using EuroForMix, and one Hd true test using a three-person mixture that resulted log10(LR) -18.79 with DNAStatistX and log10(LR) -18.21 using EuroForMix. These DNAStatistX and EuroForMix results would yield the same statement in forensic casework (both neutral or exclusion). In addition, when extracting the parameter values obtained by the optimizer in EuroForMix and plugging these into DNAStatistX no differences were obtained between the two software programs, showing that the implementation of the code to calculate the LRs is correct and that differences were the result of using a different optimizer.

Table 3 Overview of the largest differences observed between DNAStatistX and EuroForMix as a result of the use of a different optimizer. Further details on the differences is presented in Supplementary Fig. 2. Difference ina

Number of analyses out of 160

Range of differences (DNAStatistX minus EuroForMix)b

None of the parameters, LR or likelihoods Log10(LR) Log likelihood Hp Log likelihood Hd LR per locus Expected peak height (RFU/ %)

38/160

Not applicable

12 / 160 5 / 160 2 / 160 97 / 3520 5 / 160

Peak height variation Mixture proportion Degradation slope

27 / 160 220 / 470 26 / 160

−0.59 to +0.63 −1.3 to +7.6 +1.6 and +7.2 −72 to +5.2 −135 to +32 RFU / 0.6% to +2.1% −0.1 to +0.4 −0.4 to +0.3 −0.01 to +0.02

3.3. Effect of dye specific detection thresholds

a Comparisons between the two software programs were performed using the LRs (overall and per marker, compared up to two significant digits) and estimated parameters (compared up to four significant decimals). For the expected peak height a minimum difference of 1.0 was used; differences that were smaller than the level up to which the comparisons were performed were regarded as equal results. b A negative number means that EuroForMix yielded a larger value and a positive number that DNAStatistX yielded a larger value.

DNAStatistX requires, like EuroForMix, user defined detection thresholds, so that the model is fed with RFU values below which dropout is certain. EuroForMix (at least up to version 2.0.4) makes use of the same detection threshold at all loci (an overall detection threshold). As we use dye-specific detection thresholds for PPF6C profiles, the possibility of dye (or even locus)-specific detection thresholds was implemented in DNAStatistX. Since the genotyping data were generated using dye-specific detection thresholds, one would expect larger likelihoods with dye-specific settings in DNAStatistX. This was confirmed for both lower and higher dye-specific thresholds (the overall detection threshold used in the comparison was the lowest RFU value of the dyespecific thresholds) using a set of 20 hypotheses (Fig. 1). We noted a stronger increase for the higher dye-specific detection thresholds (Fig. 1). As the increase in likelihoods was about equal under Hp and Hd, the differences on log10(LR) were minimal, ranging from -1.0 to +1.1 (results not shown).

nlm optimizer is used (Table 2). An optimizer searches for the parameters that best explain the mixture proportion, peak height expectation, peak height variance, degradation slope (and stutter). A different optimizer may invoke slight differences between the results for the two software programs. Comparisons were performed for 160 sets of hypotheses that did not include a rare allele and used an overall detection threshold, plus 20 sets of hypotheses that included a rare allele. Results reveal small differences between the two software programs which are depicted in Table 3 and Supplementary Fig. 2. The different implementation of the rare allele frequency hardly affected the outcomes; 2/20 analyses with a rare allele revealed no differences and the differences observed for the remainder 18 analyses (results not shown) were in the same range as those presented in Table 3 for analyses without a rare allele but a different optimizer. Twenty-four percent (38/160) yielded equal results regarding the likelihoods, LRs and optimizer parameters (Table 3). We observed seven instances with a log likelihood difference between DNAStatistX and EuroForMix. These concerned one analysis of which a difference was observed under both Hp and Hd, four times under Hp only and once under Hd only. In 4/7 analyses the DNAStatistX likelihood was larger and in 3/7 analyses the

3.4. Precision and robustness When a calculation is repeated in DNAStatistX, small differences may occur as the optimizer will not always retrieve the exact same parameter values. To test the precision of DNAStatistX, ten analyses were run in duplicate. The duplicate analyses yielded LRs (overall and per locus) that were the same up to at least two decimal places, indicating good precision of the software. To test robustness of the software, 17 flawed scenarios were analysed as presented in Supplementary Table 4. All gave results as expected. For instance, DNAStatistX includes only the loci in the

Fig. 1. Differences in log likelihoods between a dye-specific or overall detection threshold (DT) in DNAStatistX. Results obtained under Hp and Hd are presented as black circles and white triangles, respectively. Comparisons for A) high detection thresholds and B) low detection thresholds. 85

Forensic Science International: Genetics 42 (2019) 81–89

C.C.G. Benschop, et al.

calculations that are present in the trace and all reference profiles, provided these loci are not empty in the reference profiles; other loci do not receive an LR calculation. Also the loci involved in the Hd as in the Hp calculation are the same.

combined data not all individual data points follow the expected peak heights. Next, we studied model validation performance with PPF6C profiles obtained using 32 instead of 29 PCR cycles [21] which results in more stochastic variation. The stochastic threshold for these types of profiles is around 3700 RFUs (on a 3500xL apparatus), whilst the stochastic threshold for 29 cycles PPF6C profiles is 800 RFUs [27]. Model validations for all six analyses with 32 cycles PPF6C profiles resulted in many data points outside the 0.01-line under both Hp and Hd. This demonstrates that the gamma distribution model appears not suitable for these types of profiles with large stochastic variation and it was decided to not apply DNAStatistX (at least v1.0.0) to 32 cycles PPF6C profiles. Examples of the failed model validations are shown in Supplementary Fig. 1.

3.5. Model validation The model validation option is an important quality check to infer whether the peak height data follow the peak heights expected by the model. The TrapezoidIntegrator was chosen as the Java integrator for implementation of the model validation feature in DNAStatistX v1. When starting an LR calculation in DNAStatistX the model validation is automatically performed, whilst in EuroForMix the model validation is optional. For comparison between the EuroForMix and DNAStatistX model validation a total of ten analyses were selected. All ten analyses showed the same data points (with a maximum difference of 0.01) for the two software programs. As a result, the analyses that failed model validation did this in both EuroForMix and DNAStatistX. When a peak of improbable height was introduced at a long fragment model validation failed as expected. It cannot be deduced if the specific peak was causative for a failed model validation since peak height expectation considers all peaks which causes others to reside outside the 0.01-line in the model validation plot. The 180 Hp and 180 Hd analyses described in Section 2.2.1 were inspected for model validation with at least two data points outside the 0.01-line. This occurred for 27 Hp analyses and for 11 Hd analyses. These relate to three types of analyses: 1) Hp analyses with a noncontributor as POI. Failing model validations can be expected for analyses with a non-contributor as peak heights are more difficult to explain since Hp is false. Generally, these analyses will yield low (exclusionary) LRs [17]. 2) Analyses in which the degradation model was not applied while the data did show degradation (descending peak heights with increasing fragment lengths). The peak heights expected by the model will then not fit the data. 3) Analyses with three replicates having extraordinary variation for the peak heights between the replicates. As the optimizer searches for parameters that best explain the

3.6. Processing time For a two-person mixture with four alleles on a locus and two unknowns under Hd there are 15 different genotypes possible (including possibilities for drop-out and/ or drop-in) per contributor which gives a total of 225 different genotype combinations for the two contributors (15*15). These calculations require about 5000 optimizer steps, and for a PPF6C profile with 23 autosomal markers this yields a total of about 25,875,000 likelihood estimate calculations under Hd (225*5000*23). With a four-person mixture this number increases to hundreds of billions of calculations. For efficient use of the processing time the number of optimizer start points, or iterations, was examined in more detail. The optimizer runs multiple times (denoted as the number of iterations) to enable reaching the global optimum. By default the number of optimizer iterations is four in EuroForMix. DNAStatistX uses a different optimizer (Table 2) and showed robust results when using four iterations. To minimize processing time, it was examined whether fewer iterations suffice. In 96.1% (23,233/24,184) of the repeated optimizer runs, the same largest log likelihood was obtained (Fig. 2, log likelihood differences of zero on the X-axis). For the 3.9% of the optimizer runs for which a difference was found, it is apparent that lower log likelihoods present larger variation between optimizer iterations (Fig. 2). Table 4 Fig. 2. Variation in log likelihoods between iterations of the optimizer in DNAStatistX. The sizes of the circles are proportional to the number of analyses that yielded this optimum. The large circles at X=0 are the iterations yielding the global optimum. The smaller circles on the right show the iterations that did not yield the global optimum. The further to the right, the larger the difference compared to the global optimum.

86

Forensic Science International: Genetics 42 (2019) 81–89

C.C.G. Benschop, et al.

3.7. Discussion on guidelines for use in forensic casework

Table 4 Overview of the numbers/percentages of Hp and Hd analyses that yielded the optimal likelihood (‘succeeded’) and the probabilities of missing the optimum when using one, two, three or four iterations for the optimizer in DNAStatistX.

Number of analyses Number of succeeded analyses p(missed) 1 iteration: Missed: 1 in 2 iterations: Missed: 1 in 3 iterations: Missed: 1 in 4 iterations: Missed: 1 in

Total

Hp

Hd

24,184 23,233 3.9% 25 647 16,445 418,206

12,092 11,594 4.1% 24 590 14,315 347,596

12,092 11,639 3.7% 27 713 19,020 507,691

Within the forensic community it is encouraged to share results from internal validation of probabilistic interpretation software [24]. To that effect a number of such studies have been published. Internal guidelines that address issues like when to apply probabilistic genotyping software or when to report the obtained LR are less shared. Such guidelines are based on the validation study outcomes and the usefulness to a forensic case as informed by the expertise collected in casework. In more detail, we wish to perform LR calculations when regarded useful (i.e. decide to subject traces to calculations if evidential value can be obtained, but otherwise not) and obtain uniformity among reporting officers within the laboratory in the application of software programs and the interpretation process. In our laboratory process, first the DNA profiles are analysed technically and independently using GeneMarkerHID. Then, (within DNAxs) the quality of the DNA profiles is assessed after which the profiles are compared to reference profiles and only when certain criteria are met, one proceeds to an LR calculation (using DNAStatistX) [32–34]. Calculated LRs are validated through examination of a.o. the model validation results. LRs are reported when they exceed an LR threshold (‘T’) of 10^4 as calculations resulting in lower LRs can be more sensitive to the model and parameters used. In addition, in theory it is not possible that non-contributors have a probability larger than 1/ T to obtain an LR of at least T, for models that use parameters obtained independently of the POI. We expect that also for this model, even if the MLE does depend on the POI, these bounds are respected since the MLE estimate presents LRs close to those with the generating parameters. Indeed, the largest LR for an unrelated non-contributor was 361 in 1400 trials (Supplementary Material 1). Replicate analyses are performed in cases where an LR based on one replicate is between 10^4 and 10^6. Replicate analyses tend to increase LRs for true contributors and decrease LRs for non-contributors such as relatives of a true contributor [17]. In the comparison preceding the LR calculation, mismatches to the reference profile may occur because of drop-out in the trace profile or because the reference profile is from a non-contributor. We only proceed to an LR calculation when there is a maximum of 15 mismatches with two-person, ten mismatches with three-person and five mismatches with four-person PPF6C profiles. Larger numbers of mismatches often lead to LRs in the uninformative zone, i.e. below 10^4 [17]. We are aware that sometimes a large LR for a true contributor (e.g. LR = 9000) may be missed because of these restricted guidelines which is accepted, although deviation from the guidelines is possible with proper argument. When one replicate yields an LR in the range 10^6 to 10^10, replicate analyses is performed in a case-dependent manner, i.e. when more information on the POI or number of contributors is needed. Clearly, the case circumstances, availability of other stains and profiles within the case are factors considered in the process. The summarized guidelines as presented in Fig. 3 are not intended as strict rules but are used as helpful tool in the decision making process.

shows the probabilities of missing the optimum likelihood when using one, two, three or four optimizer iterations. With three iterations the probability of missing the optimum is 1 in 16,445 (Table 4), which was regarded an acceptable number. Although this probability is rather low, for 3% of the Hp/ Hd calculations the optimum was missed in more than 30% of the repeated analyses (data not shown). Therefore, the number of optimizer iterations in DNAStatistX was programmed dynamically to have a minimum of three iterations: if with three iterations the likelihoods are not identical (with a maximum difference of two decimal places on log10 scale), the number of iterations is increased by one until three times the same largest likelihood is obtained. The maximum number of iterations was set to ten. This flexible iteration number enables efficient use of the calculation capacity as more than three iterations are only applied when needed. Furthermore, using this approach the probability of missing the optimum is almost negligible; the probability of missing the optimum after ten iterations was 6E-12. Considering the 180 analyses presented in Section 2.2.1, the processing time was specifically determined by the number of contributors. With one, two, three or four unknowns under Hd, the average processing time of the MLE calculation was less than a second, 18 s, 7.5 min and 11.75 h, respectively (Table 5). The total processing times are longer due to the model validation, which on average required 47% of the total processing time (Table 5). Processing times were comparable to those of EuroForMix [28]. Up to three-person mixtures this was faster or comparable to processing times reported for LR calculations using other probabilistic genotyping software programs [29–31]. With regard to four-person mixtures, DNAStatistX was found to have longer total processing times. With the most complex analyses (four unknowns under Hd) the total processing time varied from 10 min (with an NGM profile) or three hours (with a PPF6C profile) up to 113 h (with a PPF6C profile and three replicates). The fact that we used 23 loci-profiles including the highly complex SE33 locus, three replicates and model validation have an effect. More efficient programming is expected to decrease processing time, especially for model validation, in the near future.

Table 5 Average processing times of the DNAStatistX analyses (MLE and model validation). Number of contributors

Knowns conditioned under Hd

Unknowns under Hd

Mean time MLE calculation

Mean time model validation

Mean time for MLE & model validation

Number of analyses

1 2 2 3 3 4 4

0 1 0 1 0 1 0

1 1 2 2 3 3 4

00:00:00 00:00:00 00:00:10 00:00:26 00:07:15 00:13:16 11:45:11

00:00:04 00:00:06 00:00:20 00:00:27 00:09:13 00:01:28 13:43:03

00:00:04 00:00:06 00:00:30 00:00:53 00:16:28 00:14:44 25:28:14

2 7 49 7 42 2 48

87

Forensic Science International: Genetics 42 (2019) 81–89

C.C.G. Benschop, et al.

Fig. 3. Summarized example of guidelines for the application of DNAStatistX (or EuroForMix) to PPF6C profiles in casework.

4. Concluding remarks

[2] M.D. Coble, J.A. Bright, Probabilistic genotyping software: an overview, Forensic Sci. Int. Genet. 38 (2019) 219–224. [3] B. Haldemann, S. Dornseifer, T. Heylen, C. Aelbrecht, O. Bleka, H.J. Larsen, S. Willuweit, A. Alonso, J.M. Teodoridis, J. Morzfeld, L. Zatkalikova, M. Krupsky, B. Berger, W. Parson, N. Morling, P. Gill, U. Neuhaus-Steinmetz, eDNA—an expert software system for comparison and evaluation of DNA profiles in forensic casework, Forensic Sci. Int. Genet. Suppl. Ser. 5 (2015) e400–e402. [4] Ø. Bleka, L. Prieto, P. Gill, CaseSolver: an investigative open source expert system based on EuroForMix, Forensic Sci. Int. Genet. 41 (2019) 83–92. [5] C.C.G. Benschop, T. Sijen, LoCIM-tool: an expert’s assistant for inferring the major contributor’s alleles in mixed consensus DNA profiles, Forensic Sci. Int. Genet. 11 (2014) 154–165. [6] https://www.forensischinstituut.nl/wetenschap–innovatie/europese-projecten/ dnaxs. Website under construction. [7] C.C.G. Benschop, J. de Jong, L. van de Merwe, H. Haned, Adapting a likelihood ratio model to enable searching DNA databases with complex STR DNA profiles, Proceedings of the 27th ISHI, Promega.com. [8] C.C.G. Benschop, L. van de Merwe, J. de Jong, V. Vanvooren, M. Kempenaers, C. van der Beek, F. Barnie, E. López Reyes, L. Moulin, L. Pene, P. Gill, H. Haned, T. Sijen, SmartRank: a likelihood ratio software for searching national DNA databases with complex DNA profiles, Forensic Sci. Int. Genet. 29 (2017) 145–153. [9] C.J. van Dongen, K. Slooten, M. Slagter, W. Burgers, W. Wiegerinck, Bonaparte: application of new software for missing persons program, Forensix Science International: Genetics Supplement Series 3 (2011) e119–e120. [10] J. Hoogenboom, K.J. van der Gaag, R.H. de Leeuw, T. Sijen, P. de Cnijff, J.F.J. Laros, FDSTools: a software package for analysis of massively parallel sequencing data with the ability to recognise and correct STR stutter and other PCR or sequencing noise, Forensic Sci. Int. Genet. 27 (2017) 27–40. [11] R.G. Cowell, T. Graversen, S. Lauritzen, J. Mortera, Analysis of forensic DNA mixtures with artefacts, J. R. Stat. Soc. Ser. C 64 (1) (2015) 1–48. [12] T. Graversen, S. Lauritzen, Computational aspects of DNA mixture analysis, Stat. Comput. 25 (3) (2015) 527–541. [13] www.lrmixstudio.org. [14] https://softgenetics.com/GeneMarkerHID.php, Accessed April 26, 2019. [15] https://www.ncbi.nlm.nih.gov/projects/SNP/osiris/, Accessed April 26, 2019. [16] Ø. Bleka, C.C.G. Benschop, G. Storvik, P. Gill, A comparative study of qualitative and quantitative models used to interpret complex STR DNA profiles, Forensic Sci. Int. Genet. 25 (2016) 85–96. [17] C.C.G. Benschop, A. Nijveld, F.E. Duijs, T. Sijen, An assessment of the performance of the probabilistic genotyping software EuroForMix: trends in likelihood ratios and analysis of Type I & II errors, Forensic Sci. Int. Genet. 42 (2019) 31–38. [18] European Network of Forensic Science Institutes (ENFSI), Best Practice Manual for the internal validation of probabilistic software to undertake DNA mixture interpretation ENFSI-BPM-DNA-01 issue 001, 17052017. [19] Scientific working group on DNA analysis methods (SWGDAM), Guidelines for the Validation of Probabilistic Genotyping Systems, Accessed April 2019 (2015) http:// media.wix.com/ugd/4344b0_22776006b67c4a32a5ffc04fe3b56515.pdf. [20] Forensic Science Regulator, Software validation for DNA mixture interpretation, FSR-G-223 (1) (2018). [21] M.D. Coble, J. Buckleton, J.M. Butler, T. Egeland, R. Fimmers, P. Gill, et al., DNA Commission of the International Society for Forensic Genetics: recommendations on the validation of software programs performing biostatistical calculations for forensic genetics applications, Forensic Sci. Int. Genet. 25 (2016) 191–197. [22] J.A. Bright, I.W. Evett, D. Taylor, J.M. Curran, J. Buckleton, A series of recommended tests when validating probabilistic DNA profile interpretation software, Forensic Sci. Int. Genet. 14 (2015) 125–131. [23] H. Haned, P. Gill, C. Lohmueller, C. Inman, N. Rudin, Validation of probabilistic genotyping software for use in forensic DNA casework: definitions and illustrations, Sci. Justice 56 (2016) 104–108.

This study demonstrates that the implementation of the EuroForMix code in DNAStatistX was correct, as same results were obtained when same parameters were used. Minor differences (within one unit on log10 scale) that were observed can be explained by the use of a different optimizer. Robust and precise answers were obtained under the tested scenarios. The probability of not obtaining the best explaining parameters is almost nil with the dynamic number of optimizer iterations as programmed in DNAStatistX. The implementation of dye-specific detection thresholds showed improved results with our data compared to an overall detection threshold. DNAStatistX, as a stand-alone software program or implemented in DNAxs, is suitable for its intended use for the interpretation of single donor and mixed autosomal DNA profiles. DNAxs with DNAStatistX adds to the set of software solutions for assessing the weight of evidence of (complex) DNA profiles that are encountered in forensic casework. DNAxs is built modular, such that new features, like working with massively parallel sequencing data for autosomal or mitochondrial DNA, can be incorporated. Existing features can be updated for improved performance. Furthermore, DNAxs is able to communicate via a web API to various systems. It is expected that the DNAxs software program will be disseminated early 2020 (details will be provided on https://www.forensischinstituut.nl/wetenschap–innovatie/europeseprojecten/dnaxs). With the DNAxs software program we laid the basis for a future-proof and increasingly automated workflow of data management and probabilistic interpretation of DNA profiles. Acknowledgements We are very grateful to Dr. Øyvind Bleka and Prof. Peter Gill for their support, helpful discussions and for critically reading the manuscript. This study was partly funded by the European Union’s Internal Security Fund — Police (Proposal Number: 820838, Proposal Acronym: DNAxs2.0). Appendix A. Supplementary data Supplementary material related to this article can be found, in the online version, at doi:https://doi.org/10.1016/j.fsigen.2019.06.015. References [1] Ø. Bleka, G. Storvik, P. Gill, EuroForMix: an open source software based on a continuous model to evaluate STR DNA profiles from a mixture of contributors with artefacts, Forensic Sci. Int. Genet. 21 (2016) 35–44.

88

Forensic Science International: Genetics 42 (2019) 81–89

C.C.G. Benschop, et al. [24] PCAST, Forensic Science in Criminal Courts: Ensuring Scientific Validity of Featurecomparison Methods, US President’s Council of Advisors on Science and Technology, 2016 obamawhitehouse.archives.gov/sites/default/files/microsites/ ostp/ PCAST/ pcast_forensic_science_report_final.pdf. [25] J. Ropero-Miller, Landscape Study of DNA Mixture Interpretation Software, National Institute of Justice, Forensic Technology Centre of Excellence, 2015, rti. connectsolutions.com/mist/. [26] A.A. Westen, T. Kraaijenbrink, E.A. Robles de Medina, J. Harteveld, P. Willemse, S.B. Zuniga, K.J. van der Gaag, N.E.C. Weiler, J. Warnaar, M. Kayser, T. Sijen, P. de Knijff, Comparing six commercial autosomal STR Kits in a large Dutch population sample, Forensic Sci. Int. Genet. 10 (2014) 55–63. [27] F. Duijs, L. van de Merwe, T. Sijen, C.C.G. Benschop, Low-template methods yield limited extra information for PowerPlex® Fusion 6C profiling, Leg. Med. 33 (2018) 62–65. [28] http://euroformix.com/sites/default/files/EuroForMixTheory_ISFG17.pdf. [29] J.A. Bright, D. Taylor, C. McGovern, S. Cooper, L. Russell, D. Abarno, J. Buckleton,

[30] [31] [32] [33] [34]

89

Developmental validation of STRmixTM, expert software for the interpretation of forensic DNA profiles, Forensic Sci. Int. Genet. 23 (2016) 226–239. S. Manabe, C. Morimoto, Y. Hamano, S. Fujimoto, C. Tamaki, Development and validation of open-source software for DNA mixture interpretation based on a quantitative continuous model, PLoS One 12 (11) (2017) e0188183. https://www.bioke.com/webshop/sg/mastr-from-softgenetics.html, accessed April 15 2019. T.M. Clayton, J.P. Whitaker, R. Sparkes, P. Gill, Analysis and interpretation of mixed forensic stains using DNA STR profiling, Forensic Sci. Int. 91 (1998) 55–70. A.J. Meulenbroek, T. Sijen, C.C.G. Benschop, A.D. Kloosterman, A practical model to explain results of comparative DNA testing in court, Forensic Sci. Int.: Genet. Suppl. Ser. 3 (2011) e325–e326. C.C.G. Benschop, H. Haned, T.J.P. de Blaeij, A.J. Meulenbroek, T. Sijen, Assessment of mock cases involving complex low template DNA mixtures: a descriptive study, Forensic Sci. Int. Genet. 6 (2012) 697–707.