Next generation database search algorithm for forensic mitogenome analyses

Next generation database search algorithm for forensic mitogenome analyses

Accepted Manuscript Title: Next generation database search algorithm for forensic mitogenome analyses Authors: Nicole Huber, Walther Parson, Arne Dur ...

826KB Sizes 0 Downloads 34 Views

Accepted Manuscript Title: Next generation database search algorithm for forensic mitogenome analyses Authors: Nicole Huber, Walther Parson, Arne Dur ¨ PII: DOI: Reference:

S1872-4973(18)30297-7 https://doi.org/10.1016/j.fsigen.2018.09.001 FSIGEN 1962

To appear in:

Forensic Science International: Genetics

Received date: Revised date: Accepted date:

29-5-2018 29-8-2018 3-9-2018

Please cite this article as: Huber N, Parson W, Dur ¨ A, Next generation database search algorithm for forensic mitogenome analyses, Forensic Science International: Genetics (2018), https://doi.org/10.1016/j.fsigen.2018.09.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Next generation database search algorithm for forensic mitogenome analyses

a

Institute of Legal Medicine, Innsbruck Medical University, Innsbruck, Austria

b

Forensic Science Program, The Pennsylvania State University, University Park, Pennsylvania,

USA Institute of Mathematics, University of Innsbruck, Austria



N

U

c

SC RI PT

Nicole Hubera, Walther Parsona,b, , Arne Dürc

corresponding author: Institute of Legal Medicine, Medical University of Innsbruck,

A

Müllerstraße 44, 6020 Innsbruck, Austria; [email protected], Tel +43 512 9003 70651,

TE

D

M

Fax +43 512 9003 73640

Keywords

A

CC

EP

Mitochondrial DNA; Sequencing; EMPOP; String search; Phylogenetic; Alignment;

1

Highlights  A new mtDNA sequence query algorithm results is presented for the EMPOP database  Updated nomenclature conventions for the instable regions are presented  Neighbours can be queried by phylogenetic costs  A total of 28 block indels containing between 2 and 264 basepairs were defined The effect of length variant regions for matches and neighbours was evaluated

A

CC

EP

TE

D

M

A

N

U

SC RI PT



2

1. Introduction The analysis of mitochondrial DNA (mtDNA) has found a vital niche in forensic genetics and is typically applied in cases where nuclear DNA is absent or insufficiently present in biological samples [1] or when individuals need to be identified along the maternal lineage across many

SC RI PT

generations [2]. This molecule was key in understanding human evolution [3] and is still being used as informative marker to investigate human migration and the peopling of particular regions of the world [4]. Mitochondrial DNA is further an important marker in medical genetic applications [5]. Due to its maternal mode of inheritance [6] the rarity of mitochondrial haplotypes (mitotypes) has been estimated using mtDNA databases [7]. The forensic community is particularly sensitive to the quality control of mitotypes stored in mtDNA databases, as probability estimates are usually biased to the disadvantage of a suspect when erroneous data are [8-19].

Therefore,

the

European

DNA

Profiling

U

present

Group

(EDNAP;

N

https://www.isfg.org/EDNAP) launched an initiative in 1999 to perform rigorous quality control

A

of mtDNA datasets and endorsed the development of a centrally curated mtDNA database that went online in 2006 under the acronym EMPOP (https://empop.online; [20]). This database has

M

soon developed from a mere repository of global mtDNA population data to include quality control functions [17] that allow the user to a posteriori check mitotypes with plausibility and

D

phylogenetic tools [20-22]. The editors of the forensic genetic journals Forensic Science

TE

International Genetics and International Journal of Legal Medicine require authors to have their data quality controlled prior to submission of the manuscript to the journal [23, 24], which has

EP

led to significant improvement of the quality of mitotypes in these journals. Another source of biased probability estimates can be differing alignments between query and

CC

database sequences when a database search is performed by comparing difference-coded haplotypes (and not the nucleotide sequences; see examples below). MtDNA sequences have

A

been conventionally reported relative to the corrected version of the first sequenced human mitogenome [25]. It is known that the alignment of insertions and deletions (indels) is often ambiguous, as usually more than one single alignment exists. The DNA Commission of the International

Society

for

Forensic

Genetics

(ISFG,

https://www.isfg.org/)

issued

recommendations in an attempt to harmonize alignment by placing indels 3’ to the event on the 3

forward sequencing strand [26], however, numerous exceptions continued to exist that required additional conventions [27]. Currently, the forensic community uses a phylogenetic approach [28] to align mitotypes [29, 30], which is being employed by worldwide laboratories and proficiency testing programs (e.g. GEDNAP, https://gednap.qualitype.de/#/home). The basic concept of phylogenetic alignment follows the rationale that ‘… good alignments of related sequences are

SC RI PT

ones that better reflect the evolutionary relationship between them’ [31]. We use the term "phylogenetic alignment" as introduced in [28]. Formally, phylogenetic alignment can be described as global pairwise alignment of the query sequence to the sequence motif of the haplogroup of the sample in question using weighted parsimony. The concept of haplogrouping by sequence motifs (alignment) was also discussed in [32]. Alignment also has forensic relevance, as mitotypes are known to slightly vary between [33] and even within tissues of the

U

same individual [34]. The observed differences include mixtures of nucleotides at given positions

N

(termed point heteroplasmy) and changes in the number of nucleotides in length variant regions (known as length heteroplasmy). The effect of the latter on alignment is shown in the following

A

example taken from the EMPOP database [20]: under phylogenetic alignment the mitotype

M

16183C 16188T 16189C (present in different haplogroups e.g. B4b or C1 [35], denoted as mitotype 1 in Table S1) shows one difference to the neighbour mitotype 16183C 16188T 16189C

D

16193del (motif 1 in Table S1), as it harbours one C-residue less in the respective region. Under

TE

(unweighted) maximum parsimony the neighbour motif would be aligned and called 16183del (motif 2 in Table S1), as it requires only one difference to the rCRS (compared to four differences under phylogenetic alignment). Therefore, the more parsimonious alignment puts the

EP

neighbour sequence (motif 2) at a distance of three differences relative to the mitotype 1, although the two sequences are separated by one mutation only. In a forensic case setting with

CC

two samples, e.g. unknown evidentiary sample and reference sample, these alignment differences can lead to different interpretation (from inclusive/inconclusive to exclusion) and should

A

therefore be avoided. Another commonly observed ambiguity caused by different alignment can be visualized in the sequence region G247 A248 A249 (rCRS) (see Table S2) that shows haplogroup-specific deletions of an A residue at 249 (e.g. CZ, F; [35]). Since it is unknown, which of the two A’s was deleted (248 or 249) the forensic 3’ convention suggests 249del as shown in mitotype 3. In some haplogroups (e.g. C1f, L1c1c, or L5b1) this deletion is 4

accompanied by a 247A transition, which would be called 247A 249del phylogenetically (mitotype 4, motif 3), while the more parsimonious alignment would favor 247del (mitotype 4, motif 4). In forensic mtDNA database searches that are based on comparing rCRS-coded mitotypes such alignment differences would result in biased database search results, as the same

SC RI PT

sequence would not be found due to different alignment. These and similar problems were solved by introducing alignment-immune database searches such as SAM [36], which was implemented in EMPOP 3 in 2010. With this approach query mitotypes as well as database profiles are turned into sequence strings and compared as unaligned sequence strings. This guarantees that a mitotype is found in the database regardless of the alignment convention applied by the user. While SAM solved the immediate need to harmonize mitotype searches in EMPOP, ambiguity remained in how (forensic) laboratories were aligning

U

and reporting mtDNA sequences in population studies. The extent of this problem can be

N

visualized by searching for 16189C and 16189del in published and in internet mtDNA datasets.

A

Here one finds all possible variants of describing the unification of the two small C-runs between 16184 - 16188 and 16190 - 16193 in the presence of 16189C that leads to a long uninterrupted C-

M

run and is typically showing extensive length heteroplasmy. Depending on the remaining Cresidues a wide variety of alignment variants are being reported that are seemingly different from

D

each other but in fact are only based on different alignment principles (e.g. 16189del, 16189C

TE

16193del). The same applies to 310C or other regions in the mitogenome, where length variants are introduced by substitutions. The application of SAM standardizes the database search in these

EP

cases. However, SAM lacks features to harmonize alignment in the reported mitotypes. Here, we scrutinized more than 50.000 (partial) mitogenome sequences of forensic quality in order to:

CC

1. characterize both phylogenetically robust and highly fluctuating regions 2. define block-indels that are composed of multiple nucleotides but mutate together

A

3. extended the search for additional hotspot length variants that are prone to length variation on an evolutionary level.

5

SAM 2 will be introduced into the updated database version EMPOP 4 to provide (forensic) users with unbiased database search results and enable harmonization of mtDNA alignment not only in the forensic community but also in other genetic fields.

SC RI PT

2. Materials and Methods SAM 2 was developed on the basis of SAM [36], an alignment-immune mtDNA database search software that was implemented into the EMPOP internet database version 3.0 (Dec-27-2010; https://empop.online) and updated Nov-20-2015 with slight modifications (see history

Terms and nomenclature

N

2.1

U

https://empop.online/updates).

Throughout the manuscript the terms ‘mitotype’ and ‘profile’ are used synonymously to describe

A

the (rCRS-coded) nucleotide sequence of an investigated mtDNA fragment. For easier reading we

M

sometimes use the term ‘variant(s)’ instead of the correct but more bulky expression ‘differences to the rCRS’ to describe variation relative to the reference sequence. Mitotypes are reported

D

relative to the rCRS [25] following the extended IUPAC code described in [30]. In brief, the

TE

traditional notation in forensics indicates the variant nucleotide as suffix to the position, whereas a prefix denotes the rCRS nucleotide, e.g. 152C, T152 (the rCRS harbours a T at position 152). Insertions are indicated with a “.” (dot) followed by the number of insertions 3’ to the position

EP

after which the insertion is observed, e.g. a T insertion relative to the rCRS between A490 and C491 is called 490.1T (490insT or 490+T are alternative notations). Deletions relative to the

CC

rCRS are notated with a “-” (gap). We note that in some mtDNA databases as well as in some publications the gap is inconsistently denoted by using “D” or “d”, which is in conflict with the

A

existing IUPAC code for “D” (mixture of A/G/T). For example, the sequence HM238207.1, a representative mitogenome in Phylotree 17 [35] that is deposited in GenBank [37] contains the symbol d at positions 526 and 527, which we consider to be the frequently observed 523del 524del. In the updated recommendations on human mtDNA analysis of the International Society for Forensic Genetics [30] the IUPAC code was extended to using small case letters to account 6

for (heteroplasmic) mixtures of a nucleotide and a deletion at a given position, which we also use here. Thus, 249g would refer to a mixture of 249G and 249del in a given sequence.

2.2

Symbols and matching in SAM 2

SC RI PT

Query and database profiles are compared as strings over the symbol alphabet with unique symbols A,C,G,T and - (the gap for a deletion), and ambiguous symbols R,Y, S, W, K, M, B, D, H, V, N, a, c, g, t, r, y, s, w, k, m, b, d, h, v, n applying the extended IUPAC code [30].

A query profile symbol matches a database profile symbol in pattern mode if both symbols share a joint symbol and in literal mode if both symbols are equal. E.g. the query 152Y matches in pattern mode all database profiles that contain a C or T or both at position 152, whereas in literal

U

mode only those database profiles match that contain a mixture of C and T at position 152

N

(=152Y). As EMPOP currently includes only profiles with 152C or T152 or 152Y, the query

A

152Y matches all EMPOP profiles in pattern mode but only 108 EMPOP profiles with 152Y in

M

literal mode. 152Y would also match the GenBank profile GQ301863.1: 16221T 16249C 16288C 16301T 16304C 16390A 16519C 73G 152H 263G 315.1C 329A 513A

D

593C 750G 1040C 1117G 1438G 2706G 4491A 4769G 6770G 7028T 7325G 7598A 8860G

TE

10316G 10609C 10877T 11719A 13145A 13359A 14766T 15326G 15885T with 152H in pattern mode but not in literal mode (H denotes A or C or T). While pattern mode is the standard choice for estimating the probability of a given mitotype in EMPOP (= typical

EP

forensic search), the literal mode can be used for specifically investigating the occurrence of point heteroplasmy in the EMPOP mitotypes. The definition of pattern match in SAM 2 slightly

CC

extends that of SAM as described in [36] in order to widen the search (this extension was integrated in EMPOP on Nov-20-2015). For example, the query 152M matches database profiles

A

with 152Y in SAM 2 (joint symbol C) but not in SAM (neither M is contained in Y nor Y is contained in M).

7

Mutations and transcripts

2.3

SAM 2 discerns three different types of mutational events: 1. single-symbol substitutions or indels at arbitrary positions (aka point mutations, e.g. T52C, 16191insC or 299delC),

SC RI PT

2. block-indels at special positions (e.g. 105-110delCGGAGC or 292insAT) and

3. run-length changes in length variant regions (including length hotspots e.g. 307309delCCC, 524insAC or 8281-8289delCCCCCTCTA).

In contrast to block-indels length hotspots show different repetition counts. For example, the rCRS shows a repetition count of 2 for the pattern “CCCCCTCTA” starting at position 8272.

U

Inspection of the Phylotree 17 mitogenomes revealed that the repetition count for this pattern may vary from 1 to 4, e.g. the GenBank sequence EF657347.1 shows a repetition count of 1

N

whereas EU092880.1 shows a repetition count of 4. Following mitogenome data and literature

A

review [38-40] we added 28 block-indels and evaluated three additional potential length hotspots

M

(currently 9 are offered in EMPOP) compared to the original version of SAM (Table 1 and Table 2).

D

Standard forensic search settings ignore run-length changes in length variant regions, as they are

TE

known to vary within tissues of the same individual [8], but can individually be considered for a query in particular scenarios. Thus, standard queries provide conservative probability estimates that are preferred in the forensic context. The location of a run in an arbitrary database profile is

EP

determined by searching for the 3'-most occurrence of the repeating motif between the junctions of the run in the rCRS. If the motif is found the start and the end of the run is identified by

CC

proceeding 5' or 3' between the junctions of the run in the rCRS until the repetition stops. If the motif is not found no length changes are possible for that run (in the current EMPOP release R11

A

no database profile exhibits a complete loss of the run). As the C-runs around 309 and 315 are connected by the loss of T310, the junctions 302 and 316 are used for both runs. E.g. in the mitotype 308del 309del 315.1C 455C 523G 573.1C 574C the C-run around 309 extends from 303 to 307, the T-run around 455 from 452 to 454, the AC-run around 524 from 515 to 522, the C-run around 573 from 568 to 573.1, and all the other runs are identical to those in the rCRS. For 8

an EMPOP search using SAM 2 all run-length changes at length hotspots can be ignored except

A

CC

EP

TE

D

M

A

N

U

SC RI PT

for the stable insertion at position 315.

9

Table 1 List of block-indels defined in SAM 2. Block-indels diagnostic of Phylotree 17 haplogroups are marked by *; In column “Pattern” only the first 15 nucleotides are shown.

EP

M

D

Length [in bp] 15 154 7 6 6 7 3 5 4 16 2 2 51 6 4 2 3 4 14 8 2 204 16 264 5 15 4 3

N

U

SC RI PT

Pattern TCTCTGTTCTTTCAT AACCCAATCCACA… GTACATA CGGAGC GGAGCA GTGTGTT TAA TAACA ATTT ACATCATAACAAA… AT AT CCCTCCCCCCGCT… TCCCCC GCTT AT ATC AGAA ACCAGATTTCAAAT TACTACTA GC AACAAAGAACCC... CACAGTTTATGTA… ACTCCTCATTGTA… CGAGC TCGCAGGATTTTT… TTAA CTA

A

Deletion positions 16032.1-16032.15 16165-16318 16310-16316 105-110 106-111 209.1-209.7 241.1-241.3 286.1-286.5 291-294 291.1-291.16 292.1-292.2 292.2-292.4 307-356 310-315 316-319.0 342.1-342.2 343.1-343.3 368.1-368.4 398.1-398.14 471-478 524.1-524.2 563.1-563.204 568.1-588.16 3327-3590 6020-6024 9487-9501 14787-14790 16006.1-16006.3

TE

Insertion position 16032 16164 16309 104 105 209 241 286 290 291 292 292 306 309 315 342 343 368 398 470 524 563 588 3326 6019 9486 14786 16006

A

CC

No. 1 2 3 4* 5* 6 7 8 9* 10 11* 12 13 14 15 16 17 18* 19 20 21 22 23 24 25 26 27 28

10

Table 2 List of implemented length hotspots in SAM and considered for SAM 2. Hotspots already implemented in SAM are indicated by *. Note that the length variant at 315 is not displayed in

A

3’ junction 16194 292 316 316 456 464 525 574 961 5900 8277 8286 8290

U

N

5’ junction 16189 285 302 302 451 460 514 567 955 5894 8271 8280 8271

M

Length variant 16193 291 309 315 455 463 524 573 960 5899 8276 8285 8289

Repeating motif C A C C T C AC C C C C C CCCCCTCTA

D

No. 1* 2 3* 4 5* 6* 7 8* 9* 10* 11* 12* 13

SC RI PT

the EMPOP query site and cannot be changed by the user.

TE

In order to describe the results of database searches, so-called transcripts are used. A transcript specifies how the database profile can be transformed into the query profile by ‘mutational events’ at disjoint positions. The event count of a transcript specifies the number of events

EP

without matching point mutations or ignored run-length changes, which are both denoted by brackets. The cost of a transcript is computed as the sum of the costs of all events. The sum is

CC

zero for matching point mutations or ignored run-length changes, or calculated as log likelihood ratios of fluctuation rates as explained below. For example, the mitotype JQ703076.1: 16093C

A

16223T 16292T 16519C 58C 60.1T 60.2T 64del 65del 66T 73G 189G 194T 204C 207A 263G 315.1C 709A 750G 1243C 1438G 2706G 3505G 4769G 5046A 5460A 6528T 7028T 8251A 8618C 8860G 8994A 9809G 11674T 11696A 11719A 11947G 12414C 12705T 14766T 15326G 15775G 15884C from GenBank can be obtained from the Phylotree 17 haplogroup motif W5b1a by the transcript T16093C [Y16519C] C66T with event count 2 and costs of 0.459 + 0.000 + 11

1.586 = 2.045. The rCRS shows G66 and writing the string JQ703076.1 in difference notation relative to the rCRS creates the artificial transversion 66T.

2.4

Fluctuation rates and log likelihood ratios

SC RI PT

Probabilistic rates are used to measure the fluctuation of mutational events within haplogroups. We here assume that these rates are symmetric, i.e. reversing the mutation does not change the rate. SAM 2 updates the weighting of mutational events introduced in [36] for the control region by using a training set of 30,141 mitotypes (instead of 7,074 used for the earlier version of SAM), including the coding region and by applying the haplogroup classification of Phylotree 17.

U

In detail, the costs of mutational events are derived as follows:

For point mutations in the control region and for indels in the coding region the rates were

N

estimated from 30,141 forensic mitotypes from EMPOP by the method described in [41]. Due to

A

the lack of sufficient forensic data for transitions and transversions in the coding region the rates 𝑛

M

were estimated roughly by 2(𝑁+1) where N=30,141 and n is the number of observations provided in Table S3 of Soares et al. [42] based on 2,196 complete mitogenomes (only 24 different values

D

of observed numbers). For diagnostic Phylotree 17 haplogroup mutations unobserved in

TE

EMPOP’s forensic mitotypes and the Soares et al. data, the rate for one occurrence was used.

𝑢 𝑟

log10 (⁡ ) 3

where r is the rate of point mutation and u is the rate of no change. This weighting

CC

LLR =

EP

The log likelihood ratio (LLR) of a point mutation without ambiguous symbols is defined by

of mutations was introduced in [41] and provides values of approximately 1.0 for standard mutations. E.g. the transition T16519C shows a rate of 11% and an LLR of 0.3. To avoid

A

excessive LLRs minimal rates of 10-6 for transitions and 10-8 for transversions or indels are used so that the maximal LLRs are about 2.0 for unobserved transitions and 2.667 for unobserved transversions or indels (our choice of the minimal rates was motivated by the weights in Table 4 of [43] and an estimated transition/transversion rate ratio of 100 [43]. For a point mutation with 12

ambiguous symbols, the LLR is defined as follows: In pattern mode the LLR is reduced to the minimum of the corresponding LLRs. In literal mode, the LLR is the arithmetic mean of the two values obtained by averaging with respect to one symbol, the minimized LLRs with respect to the other symbol. E.g., the LLR of C replaced by Y is zero in pattern mode and one fourth of the match. The LLR of a block indel is defined by LLR =

1−𝑟 ) 𝑟

log10 (⁡ 3

SC RI PT

LLR of C replaced by T in literal mode. In each mode the LLR is zero if and only if the symbols

where r is the rate of the block indel.

E.g., the block deletion “105-110delCGGAGC” has a rate of 0.050% and an LLR of 1.10 whereas the block insertion 292insAT has a rate of 0.013% and an LLR of 1.29. For insufficient statistical data the default rate is 10-8 with an LLR of about 2.667.

defined by LLR(1) =

3

where r is the rate of the length change and u is the rate of no change

N

𝑢 𝑟

log10 (⁡ )

U

For a not-ignored length hotspot the log likelihood ratio of a length change by one repetition is

A

(in case of insufficient statistical data a default value of 0.5 for LLR(1) is used). For k >1 the

M

LLR of a length change by k repetitions is extrapolated linearly by the formula LLR(k) = LLR(1) * k.

D

E.g., 308-309delCC has an LLR of 0.4*2= 0.8, whereas the run-length change 524insAC has

Search algorithm

EP

2.5

TE

LLR 0.4*1 = 0.4.

The execution of the search algorithm follows a hierarchical approach. First, for a given query

CC

profile (or query string) the feasible database profiles are determined, which include all EMPOP mitotypes that have an identical or larger sequence range and thus become the statistical basis for

A

probability estimates. In the second step, the query profile and the feasible database profiles are converted to strings and compared either by the minimal event count or the minimal cost of suitable transcripts (can be selected by the user) in order to determine the neighbouring strings that differ from the query 13

string by at most d events or costs not higher than 2.667*d (currently d=2). This comparison is executed separately for each region of the joint range and the results (event counts/costs and optimal transcripts) are combined. Distributing the feasible database sequences to the available processors enables parallelization. To speed up the comparison bounded-distance dynamic programming [44] is applied that requires a tentative upper bound for the number of symbol

SC RI PT

differences (excluding matches but including ignored length changes) and restricts the search for transcripts to a small strip in the dynamic programming table. Using only the strip, the tabular algorithm checks whether the bound is valid and, in this case, yields the correct number of symbol differences. More precisely, let m and n denote the length of the database or query sequence, respectively, and let b denote the bound, then the full table has m+1 rows and n+1 columns. If |n-m| > b the strip is empty. Otherwise the strip consists of at most b+1 diagonals and

U

is given by lower <= j-i <= upper where lower is (n-m-b+1)/2 rounded downwards, upper is (n𝑏+1

. E.g., for control region sequences of lengths

max⁡(𝑚,𝑛)+1

A

of entries in the strip or table is at most

N

m+b-1)/2 rounded upwards, i is the row and j is the column index. Hence the ratio of the numbers

about 1,122 bp and a bound of 120 the strip is only 11% of the full table. The initial value for the

M

bound is 30, which has proven adequate for control region sequences. If the estimated number of symbol differences exceeds the bound minus 10, the bound is doubled and the execution

D

repeated. This procedure is continued until the bound is large enough. Avoiding that the most

TE

parsimonious transcripts with respect to symbol differences get too close to the border of the strip makes it implausible that transcripts with minimal event count or cost leave the strip. Increasing

EP

the tentative bound for the number of symbol differences until it becomes valid is limited to a maximum value of 1010 to avoid excessive computation times for unrealistic query sequences. If

CC

this value fails, the database string is excluded from the search. For comparison, block-indels known to SAM 2 have lengths up to 204 bp (EMPOP, R11) but are considered as single mutational events. For a valid bound the minimal event counts/costs are computed by the tabular

A

algorithm of dynamic programming generalized to block-indels and run-length changes. While the implementations of block-indels and run reductions are straightforward, run extensions are implemented by prepending symbol matches or substitutions to avoid that other insertions interfere at the end of the run. 14

In the third and last step, the best transcript is computed for all neighbours of the query string. For searches by cost all transcripts with minimal costs are generated by the trace back method of dynamic programming generalized to block-indels and run-length changes. As the costs of mutational events are quite different (at least in the control region), the transcript with minimal cost is usually unique. This is in contrast to the search by count where the transcripts with

SC RI PT

minimal event count are often abundant. For search by count we therefore generate only regular transcripts as described in [41].

In both search modes the best transcript is then selected by the following hierarchical criteria:

1. minimal event count for transcripts with minimal costs, or minimal cost for transcripts with minimal event counts, respectively,

3. minimal number of symbol changes,

A

M

5. insertions or deletions arranged 3'.

N

4. minimal number of insertion or deletion clusters,

U

2. minimal number of ignored events,

The working hypothesis of SAM 2 is that the optimal transcripts with respect to event count or

D

cost can be found within the strip, which has been determined by calculating the number of

TE

symbol differences. This is a heuristic that works well for real data but may theoretically yield a suboptimal transcript because event count or costs can be zero for ignored length changes or

EP

small for highly fluctuating events. This suboptimal transcript, however, would not have higher counts or costs than any most parsimonious transcript with respect to symbol differences. A sufficient and more concrete condition for this hypothesis is that in optimal transcripts the

A

CC

number of symbol indels including split block-indels or length changes respects the bound.

2.6

Evaluation

In order to test the updated algorithm the search results of SAM and SAM 2 were compared using 34,617 mitotypes from the EMPOP database (Release 11) as well as 14,943 Phylotree mitogenomes that were condensed to the control region (CR; 16024-576). 15

3. Results We here present software for alignment-immune mtDNA sequence matching (SAM 2) that is also capable of producing rCRS-coded alignments following the phylogenetic principle elaborated in [28] and generally accepted in the forensic community [29, 30]. Thus, it constitutes a significant improvement for forensically oriented mtDNA sequence queries and is made available to

SC RI PT

(forensic) users via the EMPOP database (https://empop.online) [20]. The modifications and new developments particularly address conservativeness of mtDNA query results to minimize bias in forensic casework applications. To comply with the biology of the mtDNA molecule and the relevance of rapidly mutating sequence regions (hotspots) to forensic mtDNA database searches, new block-indels were introduced and the list of possible length variant hotspots was extended. In the course of developing SAM 2 we discerned a total of 28 block-indels (Table 1) and evaluated

U

the effect of four additional length variants (Table 2) on database searching based on careful

N

analysis of published and self-produced mitogenome sequences. The impact of the additional length variants on database matches was significant (see below) and thus requires further

A

experiments to investigate whether or not these additional length variants differ between tissues

M

within an individual. Therefore, the investigated length variants will not be implemented in EMPOP v. 4.

D

The ability to ignore length variations in forensic mtDNA searches leads to more conservative

TE

probability estimates that are desired in forensic practice. SAM 2 counts block-indels and length changes at not-ignored length hotspots as a single difference to the rCRS. Hence, the search for

EP

neighbours by count, i.e. database profiles with at most two differences, is enhanced. The option to search for neighbours by cost provides an alternative view at closely related sequences that

CC

show more than two differences to the query mitotype, but may be more relevant to the case than

A

neighbours with fewer differences that belong to a different phylogenetic background.

3.1

Realignment in the EMPOP database

While harmonizing alignment via SAM 2 we were striving to comply with the notation rules suggested in [28]. For some mtDNA regions we had to extend this concept, as new variation

16

came to our attention that was not addressed earlier. These regions include 50-70, 310-316, 455460, 961-966, 8276-8279, 16180-16193, and 16258-16262.

Region 50-70:

SC RI PT

This region generally includes little phylogenetic signal with the exception of some mutational patterns that are indicative of haplogroups, e.g. 55.1T 59- 60- 65.1T 66T for haplogroup M39 or 58C 60.1T 64T within R0a2’3 [45]. After intensive phylogenetic analysis and collation of a wide variety of alignment practices we suggest harmonizing the alignment in region 50-70 for those lineages that do not display significant mutation patterns to the consistent use of 60.1T where applicable, resulting in the following realignments: 

57A 60.1T

56.1A 57G



57A 58G 60.1T

56.1G



57G 60.1T

56.1G 58C



57G 59C 60.1T

56.1C



57C 60.1T

56.1C 57G



57C 58G 60.1T

56.1C 58C



57G 59C 60.1T



58C 60.1T

N

A

M

D

TE

57.1C

U

56.1A

In the absence of other phylogenetic evidence we suggest this convention to enable consistent reporting of mutations in this region. Should future data reveal different signatures, the

CC

EP

convention will be updated on EMPOP and the changes documented in a curated file.

Region 310-316:

A

The most frequently observed nomenclature changes relative to the current life version of EMPOP (Vs. 3, May 2018, R11) is the new suggestion to call a T after T310 as 311T 315.1C instead of the current practice 310.1T, which is affecting 79 of a total of 33,691 mitotypes. This is in better agreement with the phylogenetic alignment rules [28] that state that ‘the long C tracts of HVS-1 and HVS-II should always be scored with 16189C and 310C, respectively, so that 17

phylogenetically subsequent interruptions by novel C to T changes are encoded by the corresponding transition’ (C tract conventions). Therefore, the notation of mitotypes with the earlier variant 310.1T changes, e.g., the HVS-II haplotype 73G 263G 310.1T 315.1C changed to 73G 263G 311T 315.1C 315.2C. Consequently, the haplotype 16182C 16183C 16189C 16217C 16519C 73G 143A 146C 189R 195C 263G 309.1C 309.2C 309.3T 309.4C 309.5C 315.1C 499A 523- 524- would be notated as 16182C 16183C 16189C 16217C 16519C 73G 143A 146C 189R 195C 263G 309.1C 309.2C 313T 315.1C 315.2C 315.3C 315.4C 499A 523- 524-. For the

SC RI PT

haplotype 16111A 16519C 199C 236C 263G 309.1T 309.2C 315.1C we suggest the following alternative 16111A 16519C 199C 236C 263G 312T 315.1C 315.2C 315.3C in accordance to the newly suggested convention. In contrast the most parsimonious choice between 310.1T and 311T would depend on the length of the C-stretch around 315 which is known to fluctuate within the

U

same haplogroup.

N

Region 456-460, 961-966 and 8276-8279:

Browsing mitotypes from the literature revealed differing practices aligning those regions.

A

Insertions and deletions were allocated between the rCRS junctions (see Table 2), e.g., we

M

rewrote:



455.1T 456T

961C 965.1C



960.1C 961C



8276.1C 8277C

TE

8277C 8278.1C

D

456T 456.1T

EP

This convention is also used in Phylotree 17, e.g. in the motif 573.XC 574C for haplogroup I5a4.

CC

Region 16180-16193 and 16258-16262 Also known as difficult to align is the region around position T16189, which was already

A

discussed in detail in [28]. We adhered to the rules proposed in [28] and used the same approach to harmonize other variants that appear in this region. E.g. for 16183del 16189C we suggest the following alignment 16183C 16189C 16193del and for 16193T we suggest the longer but phylogenetically more stable variant 16191.1C 16192T 16193del for haplogroups where 16192T 18

is a signature mutation (e.g. L2a1b, I5a4, R9b1a1, H61a; Phylotree 17). The same logic was applied to 16258del and 16262del, changing the alignment to include 16262del as stable variant.

Applying these conventions to the EMPOP mitotypes (R11; N=34,617) resulted in a total of 179 necessary alignment changes (0.52%, R11.2, Fig. 1). The detailed changes are listed in Table S3. We performed comparisons with other databases (total of 28,622 haplotypes) that resulted in

SC RI PT

similar values (between 0.8 and 1.5%) for alignment changes. Observed differences are based on

EP

TE

D

M

A

N

U

a slightly different haplogroup composition of mitotypes in these datasets (data not shown).

Fig. 1. Distribution of alignment changes in the EMPOP database after realignment with SAM 2

CC

(R11.2; N=34,617).

A

Further in this manuscript we refer to the current database version that is accessible online (https://empop.online; May 2018) as EMPOP R11 and to the new database version as EMPOP R11.2, consisting of the same mitotypes (N=34,617) but presented in the updated alignment/nomenclature detailed above. The term “standard search parameter” describes the

19

default query settings that are used in EMPOP when the user does not actively change them (i.e. find neighbours by count, pattern match type and disregard all 9 length variant hotspots).

3.2

Consistency checks for EMPOP queries

Implementing new alignment rules and conventions may affect database query output. In order to

SC RI PT

enable users to understand and follow these changes we performed extensive checks and provide details of search results that differ between the database/alignment versions. Evaluation of the new search algorithm and the realigned database was conducted by comparing matches and neighbours of SAM (EMPOP R11) with SAM 2 (EMPOP R11.2) by using standard search settings (find neighbours by count, pattern match type, disregarding indels in length variants at positions 16193, 309, 455, 463, 573, 960, 5899, 8276 and 8285) and the full database size

N

U

(N=34,617).

3.2.1 Database matches

A

This comparison led to full concordance of matches between SAM and SAM 2 with the

M

exception of four mitotypes (0.01%; NIC1200125, NIC1200145, NIC1200146, and ESP1200252)

D

that resulted in a slightly decreased number of database matches with SAM 2 (Table 3).

TE

Table 3

List of mitotypes that produced a different number of database matches in the consistency check

EP

between SAM and SAM 2.

Sequencing range

CC

Sample ID

A

NIC1200125

NIC1200145

60-340 16024-16365 60-340 16024-16365

Mitotype 73G 263G 309.1C 315.1C 16182C 16183C 16189C 16217C 263G 309.1C 315.1C

No. of matches in SAM 47

No. of matches in SAM 2 46

ID(s) of missing profiles USA1202301

976

971

GRC0500123 GRC0500319 SVN0600004 SVN0600041 20

SVN0600051 60-340 16024-16365

263G 309.1C 315.1C

ESP1200252

57-429 16020-16377

73G 150T 263G 309.1C 315.1C 16189C 16193.1C 16270T 16311C 16336A

976

971

2

1

GRC0500123 GRC0500319 SVN0600004 SVN0600041 SVN0600051 USA1203897

SC RI PT

NIC1200146

Different matches between SAM and SAM 2 were not based on the string search function itself but on the fact that the starting point of the search range excluded the inserted base in R11 (56.1 and 57.1, respectively) but included it in R11.2. All concerned mitotypes are shown below, with the first line representing the R11 mitotype and the second line representing the same mitotype as

U

aligned in R11.2. It is important to note that such a scenario cannot occur between queries within

N

the same or different releases of the database, except of a nomenclature change as observed

CC

M

315.1C 315.1C

60.1T

263G 263G

315.1C 315.1C

60.1T

263G 263G

309.1C 309.2C 315.1C 309.1C 309.2C 315.1C

60.1T

263G 263G

309.1C 309.2C 315.1C 309.1C 309.2C 315.1C

60.1T

263G 263G

309.1C 309.2C 315.1C 309.1C 309.2C 315.1C

57.1C

58C

#! 30-400 16024-16400 SVN0600004 56.1A SVN0600004

SVN0600041 SVN0600041

56.1A

SVN0600051 SVN0600051

56.1A

A

60.1T

263G 263G

D

58C

EP

GRC0500319 GRC0500319

57.1C

TE

Range 16024-576 GRC0500123 GRC0500123

A

between R11 and R11.2.:

57A

57A

57A

21

#! 16008-579 USA1203897 USA1203897

16182C 16182C 73G 73G

16183C 16183C 263G 263G

16189C 16189C 309.1C 309.1C

16217C 16217C 309.2C 309.2C

16519C 16519C 315.1C 315.1C

16189C 16189C 73G 73G

16270T 16270T 150T 150T

16311C 16311C 263G 263G

16336A 56.1A 16336A 57A 309.1C 315.1C 309.1C 315.1C

56.1C 57C 499A 499A

58C 59C

60.1T

SC RI PT

#! 16008-599 USA1202301 USA1202301

60.1T

Note that NIC1200125, NIC1200145, NIC1200146 and ESP1200252 (Table 3) show an unusual sequencing range start point (position 60). Typically, HVS-I/II-generated mitotypes start at

U

position 73 or around position 50. We strongly recommend performing HVS-II database searches

N

in EMPOP by using the conventional 73-340 range, as this would include the majority of

3.2.2

M

A

database entries available in EMPOP (33,691 in R11).

Database neighbours

D

The consistency check resulted in a total of 30,745 mitotypes (88.81%) with identical number of

TE

neighbours for SAM vs. SAM 2 (data not shown). However, 3,852 mitotypes (11.13%) resulted in an increased number of neighbours in SAM 2 while for 20 mitotypes (0.06%) the number of

EP

neighbours decreased.

For example, the query profile ARE0500038 73G 146C 195C 198T 263G 315.1C 524.1A 524.2C

CC

524.3A 524.4C 16224C 16311C 16519C belonging to haplogroup K1b2 showed an increase of +3 neighbours (HUN0600080, SWE0900518 and USA1103600) with the identical mitotype 73G 146C 195C 263G 315.1C 16224C 16311C 16519C that also belongs to haplogroup K1b2 [35].

A

This mitotype differs to ARE0500038 by the transition at position 198 (1 count) and by 2 AC insertions at 524 (1 count in SAM 2), whereas they differed by 5 counts in SAM (4 counts for 524+ACAC). Hence, the increase of neighbours is more conservative in the forensic context, especially when the neighbours belong to the same or a similar haplogroup. 22

We also observed a slightly decreased number of neighbours for 20 mitotypes (0.06%, Table S4) for three reasons: First, those mitotypes that produced increased matches (due to the realignment of the database) resulted in fewer neighbours. For example, mitotypes UZB0600870, UZB0600915 and UZB0601054 that share the CR mitotype 16129A 16182C 16183C 16189C 16217C 16261T 73G 263G 309.1C 309.2C 315.1C 523del 524del (haplogroup B4a) match the haplogroup B4a

SC RI PT

mitotype KOR0801184 16129A 16182C 16183C 16189C 16217C 16261T 73G 263G 307del 308del 309del 315.1C 523del 524del in SAM 2 R11.2, but do not match in SAM R11 with the earlier alignment 307T 310C 314del 315del.

Second, as mentioned above, the interpretation range starting point at position 60 in combination with our suggested 60.1T alignment in the phylogenetically instable region 50-70 caused a different number of neighbours in mitotypes CHN1000076, NIC1200087, NIC1200124, NIC1200128,

NIC1200131,

NIC1200132,

NIC1200136,

U

NIC1200127,

NIC1200137,

N

NIC1200138, NIC1200139, NIC1200140, NIC1200141, NIC1200145, NIC1200146 and

A

ESP1200424 using SAM 2 R11.2 (Table S4).

Third, in SAM 2 indels are only applied 3’ of the run of a length hotspot. For the query profile

M

HUN0600020 263G 309.1C 309.2C 309.3C 315.1C 16080G 16183C 16189C 16193.1C 16356C 16519C the database profile PRT1000001 263G 310C 313del 314del 315del 16183C 16189C

D

16193.1C 16356C 16519C exceeded the maximum value of 2 counted differences and was

TE

therefore not listed as neighbour.

Note that for NIC1200145 and NIC1200146 two mitotypes were not considered as neighbours in

CC

EP

SAM 2 but the total balance was +4 as 6 new neighbours were found in SAM 2 (Table S4).

Conclusively, the comparison of matches and neighbours between SAM and SAM 2 revealed 4 additional matches (0.01%) and 3,852 additional neighbours (11.13%) due to the realignment of

A

the database and the counting of the block indels or length changes as single variants. In 18 mitotypes (0.05%) fewer neighbours were observed as a consequence of the increased number of matches, uncommon interpretation starting points and 3’ alignment conventions. This demonstrates that the realigned EMPOP database (R11.2) is providing very similar probability 23

estimates and tends to a higher number of matches and neighbours using standard search parameters. The same strategy was applied to the condensed Phylotree 17 mitogenomes and we observed the same trend in this data set: 88.85% of the queried samples resulted in identical neighbours while 11.14% showed an increased number of neighbours and 0.01% a decreased number of neighbours

3.3

SC RI PT

(Table S6).

Effects of disregarding indels at additional length variant regions

Here, we compared the matches and neighbours of SAM R11 with SAM 2 R11.2 using the following search parameters (find neighbours by count, pattern match type, disregarding indels in length variants at positions 16193, 291, 309, 455, 463, 524, 573, 960, 5899, 8276, 8285, and

N

U

8289) and the full database size (N=34,617).

A

3.3.1 Database matches

M

A total of 29,520 mitotypes (85.28%) resulted in identical matches and 5,097 mitotypes (14.72%) showed an increased number of matches. We note that this experiment was expected to result in a

D

significantly higher number of matches and neighbours as new block-indels were introduced and the number of ignored length variant regions increased from 9 to 12. A total of 27 (0.08%)

TE

mitotypes gave more than 525 additional matches each, 2,462 (7.11%) mitotypes resulted in 3-

A

CC

(Table 4).

EP

140 additional matches each and 2,608 (8.11%) mitotypes yielded 2 or 1 additional matches each

24

Table 4 Distribution of additional matches when disregarding length variants at positions 291, 524 and 8289. No. of samples 27 0 2462 1124 1484 5097

SC RI PT

No. of additional matches >525 141-525 3-140 2 1 SUM

The 27 mitotypes with an increase of more than 525 additional matches concern 9 distinct mitotypes that cluster around the most common Westeurasian mtDNA sequence pattern 16519C

U

263G 315.1C (Table 5):

N

Table 5

variants at positions 291, 524 and 8289

A

List of mitotypes that produced more than 525 additional matches when disregarding length Haplogroup Frequency R0 11 R0 1 R0 3 R0 1 R0 3 R0 2 R0 4 R0 1 H10+(16093) 1 27

CC

EP

TE

D

M

Mitotype 16519C 263G 315.1C 523del 524del 16309R 16519C 263G 315.1C 524.1A 524.2C 16519C 263G 309.1C 315.1C 523del 524del 16519C 263G 309.1C 315.1C 524.1A 524.2C 573.1C 573.2C 16519C 263G 309.1C 309.2C 315.1C 523del 524del 16519C 263G 315.1C 524.1A 524.2C 16519C 263G 309.1C 315.1C 524.1A 524.2C 16519C 263G 309.1C 309.2C 315.1C 524.1A 524.2C 16093Y 16519C 263G 315.1C 523del 524del SUM

A

The consideration of the additional length variant regions led to a higher number of matches in about 15% of all queried mitotypes. For a few common mitotypes (0.08%) we observed extreme increases of matches that were mostly attributed to the newly introduced option for ignoring length changes in the AC-run around position 524. E.g. the SAM query of the CR mitotype 16519C 263G 315.1C 523del 524del resulted in 23 matches in 26,127 EMPOP CR entries, while 25

the same mitotype yielded 549 matches in SAM 2. It is yet unclear - and further studies are currently pending - whether the additional potential length variants (291, 524, 8289) are stable between tissues of an individual [46-48]. If they are not, then we suggest adding those to the list of ignored length hotspots in order to achieve conservative database query results. For the time being, SAM 2 uses only those 9 length hotspots that are currently considered in SAM.

SC RI PT

3.3.2 Database neighbours

We also investigated the impact of the new block-indels and additional length variants with respect to the number of observed neighbours when comparing SAM vs. SAM 2 and found that for 17,605 (50.86%) mitotypes the number of neighbours was equal. In 11,915 (34.42%) mitotypes the number of neighbours was increased and for the 5,097 (14.72%) haplotypes that were now considered as a match (see 3.3.1) the number of neighbours decreased. The maximum

U

total increase of 1,606 additional neighbours was observed for the CR mitotype DEU120661

N

263G 291.1A 315.1C 460C 523DEL 524DEL 16519C with length changes at hotspots 291 and

3.4

Searching neighbours by costs

M

A

524.

D

In SAM 2 neighbour searches can be based on count or cost where database profiles with up to

TE

two differences (count) or costs lower than 5.34 are considered as neighbours. This value was chosen after initial tests and may be adapted in future.

EP

The CR query profile 73G 146C 152C 195C 263G 309.1C 315.1C 16126C 16163G 16186T 16189C 16294T 16360T resulted in 0 matches / 5 neighbours using ‘count’ mode and 0 matches /

CC

115 neighbours under ‘cost’ mode. Table 6 lists the five concordant profiles that were found in both query modes and Table S5 lists the additional neighbours that were found only under cost

A

mode.

26

Table 6 Common neighbours found by both neighbour search modes. ‘Count’ shows the actual number of differences, ‘mutations’ shows the specific differences with their costs in brackets, ‘total costs’ shows the sum of the individual costs, ‘ignored mutations’ indicates mutations that were not considered for the search and hence have costs of 0.00 and ‘number’ shows the total number of occurrences. Total costs 1.14 1.14 1.33

Ignored mutations 522insAC (0.00) -

Number 3 1 1 5

SC RI PT

Count Mutations 2 C16360T (0.84) C16519T (0.30) 2 C16360T (0.84) C16519T (0.30) 2 C16360T (0.84) T146C (0.49) SUM

Offering neighbour searches based on costs in addition to the conventional count method was

U

motivated by observations that multiple differences between sequences can display lower costs

N

compared to fewer differences. This is reflected by the low costs of mutations that are highly fluctuating in the human phylogeny, which is consistent with earlier findings (Table 4 in [43]).

A

Therefore, the option of searching neighbours by costs can lead to useful results in cases where 0

M

matches are observed and the breadth of the known phylogenetic neighbours is to be evaluated.

3.5

TE

forensic search scenarios.

D

We note here, however, that it is current practice to use only the number of matches for standard

Inspection of matches and mismatches

EP

Here, we are drawing particular attention to mitotypes that produced 0 matches in EMPOP R11 with SAM settings and resulted in more than 0 matches in EMPOP R11 with extended SAM 2

CC

settings as this could have a pronounced effect on probability estimates for the respective profiles.

A

We used 14,943 mitogenomes from Phylotree (condensed to the CR) as query profiles and extracted all non-matching profiles (10,913). Subsequently, we used the 10,913 non-matching profiles as input for SAM 2.

27

First, we used the standard search parameters and observed 100 % concordance for the nonmatches. Then, we included length variants at positions 291, 524 and 8289 and observed that 362 profiles now resulted in a database match. The distribution of these additional matches is shown in Figure 2:

20

40

60

80

100

120

140

160

NR. OF AFFECTED PROFILES (N=362)

EP

0

TE

D

M

A

N

U

SC RI PT

+145 matches +88 matches +82 matches +69 matches +64 matches +48 matches +47 matches +39 matches +38 matches +37 matches +36 matches +28 matches +25 matches +24 matches +23 matches +22 matches +21 matches +20 matches +19 matches +18 matches +17 matches +16 matches +14 matches +13 matches +12 matches +11 matches +10 matches +9 matches +8 matches +7 matches +6 matches +5 matches +4 matches +3 matches +2 matches +1 matches

Fig. 2. Overview of match increase observed for the Phylotree 17 profiles in EMPOP R11. The y-

CC

axis shows the actual match increase and the x-axis shows the number of affected profiles. The highest increase of +145 matches was observed for profile JQ705107.1 with the CR

A

haplotype 16519C 152C 263G 315.1C 524.1A 524.2C 524.3A 524.4C, belonging to haplogroup H1bb. These results are consistent with the results obtained in 3.3, where the reason for the match increase could be attributed to the newly introduced length variants.

28

3.6

Pitfalls with unstable ends – from the users’ perspective

Some scenarios may require the search of sequence regions that are shorter than the CR and therefore the EMPOP database can also be queried with fragments encompassing only a small number of nucleotides. However, in this case it is important to choose a sequencing range with stable ends as otherwise, potential database profiles might be mistakenly excluded as illustrated in Table 7, where the query pairs 1a and 1b, 2a and 2b, 3a and 3b are identical sequences in their

SC RI PT

joint ranges. Table 7

The effect of unstable ends in the sequencing range in SAM 2; * augmented counting (+1). Comparison Table

U

Range 50-60 50-61 16180-16190 16183-16193 455 455-460

N

Query 56.1A 57A 60.1T 16183del 16183C 16188T 16189C 16193del 455C 455del 459.1C

M

A

ID Query 1a Query 1b Query 2a Query 2b Query 3a Query 3b

SAM 2/R11.2 Matches Frequency 0/29,899 3.3445e-5* 5/29,899 1.6723e-4 1/34,617 2.8888e-5 79/34,617 2.2821e-3 2/27,753 7.2064e-5 34/27,753 1.2251e-3

SAM 2 found 0 matches for query 1a because the EMPOP database mitotype is aligned as shown

D

in query 1b. The T insertion after position 60 is not included in the queried range 50-60, which is

TE

why the matches are lost (Fig. S1). Extension of the query range by one position, to 50-61 allows SAM 2 to find the 5 expected matches. The same holds true for other positions and examples

EP

(Table 7).

We note that query 1 – 3 were artificial profiles rarely encountered under practical circumstances and shown here to demonstrate the relevance of the selection of the query range in case of region-

CC

specific mtDNA analysis. We also note that in forensic practice the full mtDNA control region

A

range (16024-576) is recommended [30].

In summary, the latter experiments that did not consider variation at length variants 291, 524 and 8289 as exclusive criterion led to substantial increase of database matches and neighbours. Although our findings show that this increase is phylogenetically reasonable, further research is 29

needed in order to understand the stability of these regions in forensically relevant human tissues. For the time being we stick to the established length variants with SAM 2 that were also in use in SAM. However, should pending studies suggest that the additional length variants are instable the search settings can be adjusted accordingly.

SC RI PT

4. Discussion MtDNA evidence is used in law enforcement and for human identification purposes on a regular basis. Both applications require databases of mitotypes representing national, geographic, political or other identities and most often derive from mtDNA population studies of randomly drawn individuals. The mtDNA control region or portions thereof is known as one of the most powerful single forensic genetic markers [1], because the number of singletons in a random population is typically high. As a consequence, mtDNA database searches often result in no

U

matches except when (relatively) common variants are considered. Errors in mtDNA mitotypes

N

thus bias search results or mislead identification cases, which is why the forensic community has

A

been striving for stringent quality control in establishing mtDNA population datasets [8, 16, 26,

M

30, 49].

The alignment problem - current status

D

4.1

TE

In forensics, it has been recognized that different alignments of the same mtDNA sequence can lead to biased query results in rCRS-coded databases searches, most often leading to

EP

overestimation of the evidence [36]. Therefore, we implemented an alignment-immune database search software (SAM) in EMPOP in 2010 that turns query and database sequences into

CC

nucleotide sequence strings in FASTA-like format [36], which guarantees that a match is found in the database regardless of the user alignment. This solved the problem of biased probability estimates due to alignment differences at the user side, as users are since then free to query any

A

(correct) alignment version that would result in the same number of matches and neighbours as any other alternative (correct) alignment. The introduction of SAM was important for harmonizing database query output, as a user could not have known the alignment of the database mitotypes and thus was confronted with the risk of using an alternative alignment that would 30

have resulted in a decreased number of matches otherwise. With SAM 2 we offer software that builds upon SAM’s search functions and provides a phylogenetic alignment estimate that could be used to harmonize mtDNA alignment within the forensic community and also across other disciplines that investigate mtDNA. This is needed as mtDNA variation is still being reported differently between laboratories and in the literature. The reasons are manifold: First, some laboratories use ambiguous nomenclature, e.g. 249D with the intention to describe a deletion at 249, while D is actually reserved in the IUPAC code for the (heteroplasmic) mixture of A, G and

SC RI PT

T. Second, some groups prefer to stick with their traditional alignment/nomenclature schemes in the absence of a harmonized and stable system. As a result, taken at face value, these datasets are not directly comparable, as the individual mutation patterns differ systematically between datasets. Third, we regularly observe inconsistent alignments between established datasets and even within individual publications of the same groups, some of which are caused by individual manual alignment corrections trying to accommodate one or the other alignment

U

recommendation. This makes mtDNA datasets even incompatible within themselves.

N

Phylogenetically correct manual alignment has proven difficult if not impossible, as this would

A

require cognizance of the entire mitochondrial phylogeny as well as knowledge on the different mutation rates of the positions involved. Fourth, molecular genetic disciplines have their

M

established traditions that can lead to significantly different mitotypes, e.g. the forensic genetic field as well as the majority of population genetic and anthropological/archaeological studies use

D

a 3’ alignment for insertions and deletions, while the medical genetic field reports indels in 5’

TE

alignment. Thus, the forensically known insertions 309.1C and 315.1C are termed 302.1C and 310.1C in medical genetics, which may seem comprehensible for an mtDNA expert but is more

EP

difficult to identify as being the same sequence for non-professionals and general matching software. For these reasons we have developed the updated SAM 2 algorithm that is freely accessible and that produces rCRS-coded phylogenetic alignments from FASTA-like sequence

A

CC

strings.

4.2

Block-indels and length-variant regions with regards to database searches

With the growing body of data and with the development of full mitogenome sequences with the emergence of Massively Parallel Sequencing technologies [50, 51] we here extend the list of (highly) variable mtDNA sequence regions in the control region and for the coding region. Based 31

on careful mitogenome analyses we were defining a total of 28 block-indels consisting of 2-264 nucleotides that are inherited together (Table 1) and thus count as single difference when comparing mitotypes. We further discern a total of 13 length variant regions (Table 2) nine of which were already used in SAM (EMPOP 3). The additional 4 length variants include indels at 291, 315 , 524 and 8289. Length variant regions (in addition to some positional hotspots) were also disregarded in Phylotree: ‘… the mutations 309.1C(C), 315.1C, AC indels at 515-522, 16182C, 16183C, 16193.1C(C) and 16519 were not considered for phylogenetic reconstruction

SC RI PT

and are therefore excluded from the tree’ [45]. We determined and use fluctuation rates for all sites and regions in SAM 2 including length variants and positional hotspots, however, length variants that are somatically instable can and should be ignored when comparing mtDNA sequences in the forensic context. Otherwise identical mtDNA sequences that only differ by the number of C-residues e.g. in the C-run between positions 302 and 310 are considered as coming from the same/closely maternally related individual(s) according the current interpretation

U

guidelines [30]. With SAM 2 this rationale can be applied to all of the 13 defined length variant

N

regions in order to provide conservative probability estimates for database queries. The number

A

of EMPOP matches (and neighbours) may differ significantly for some mitotypes when additional length variants are excluded from a database search (Table 4, Table 5 and Fig. 1),

M

particularly with respect to differences in the AC-repeat around 524. The interpretation of AC indels in region 515-524 is still not clear, as not many data have been established to understand

D

the somatic stability of this repeat in different (forensically relevant) tissues. Earlier work

TE

suggests that the interpretation of length heteroplasmy in the AC repeat varies with amplification and detection strategies [46-48]. However, further studies are required to clarify this issue, which

EP

is why the additional length variants at positions 291, 524 and 8289 will not be available in the

A

CC

online version of EMPOP until these issues are clarified.

32

5. Conclusions We developed and present SAM 2, an updated and optimized software based on tests with carefully curated full mitogenome sequences to perform unbiased and conservative EMPOP database queries to assist statistical evaluation of the evidence in forensic practice. The changes to the earlier version of the software include: i.

updated alignment/nomenclature conventions for the phylogenetically instable regions 50-

SC RI PT

70, 310-316, 455-460, 961-966, 8276-8279, 16180-16193, and 16258-16262 ii.

‘count’ and ‘cost’ search modes for neighbours

iii.

implementation of 28 block indels containing between 2 and 264 base-pairs

iv.

evaluation of additional length variant regions at positions 291, 315, 524 and 8289 (that

U

are however not yet implemented at EMPOP until pending studies suggest so)

i) The updated alignment conventions form the basis for a standardized mtDNA nomenclature

A

research. This work is currently underway.

N

system that is so far missing and would make mtDNA data comparable across different fields of

M

ii) The advantage of a cost-based system is that neighbours are sorted and displayed by the evolutionary stability (fluctuation rate) of the differing variant(s) and not only by the mere count

D

distance. This option takes the mutation rate into consideration and copes with positional

TE

hotspots. So far, this has not been considered in forensic practice but could serve as alternative to the currently applied, strictly formal interpretation of neighbours by event count distance and thus

EP

may lead to a more science-based interpretation of neighbours in the future. iii) Block indels consist of multiple basepairs that are inserted/deleted together and thus count as

CC

single mutational event in forensic practice. This function is now enabled in EMPOP to comply with the biology of the molecule.

A

iv) After careful mitogenome sequence analyses we identified a total of 13 length variant regions, 9 of which have already been considered in EMPOP 3. Length variant regions are known to be highly mutable and may differ between tissues of an individual. This is why they should be disregarded when comparing mtDNA sequences in the forensic context. The effect of disregarding the four new length variant regions 291, 315, 524 and 8289 (that are however not 33

yet implemented in EMPOP unless pending studies suggest so) is demonstrated with respect to

A

CC

EP

TE

D

M

A

N

U

SC RI PT

database matches and neighbours.

34

Acknowledgements The authors would like to thank Gregor Kofler, Stefan Troger and Martin Pircher for IT work related to the implementation of SAM 2 and the development of EMPOP 4. We greatly acknowledge the work and comments of international colleagues that tested the beta and final versions of EMPOP 4 SAM 2 R11.2, particularly Lara Adams, Laura Catelli, Constance Fisher, Mario Gysi, Douglas Hares, Jodi Irwin, Rebecca Just, Hwan-Young Lee, Sabine Lutz-Bonengel,

SC RI PT

Charla Marshall, Dixie Peters, Dirk Sauer, John Tonkyn. Special thanks go to Kimberly Andreaggi for testing the EMPOP 4 beta version and providing valuable comments to an earlier version of the manuscript.

We thank current and former colleagues at the Institute of Legal Medicine, Medical University of Innsbruck for mtDNA sequence data interpretation, particularly Christiane Bauer, Cordula Berger, Martin Bodner, Mayra Eduardoff, Liane Fendt, Theresa Harm, Antonia Heidegger,

U

Gabriela Huber, Anita Kloss-Brandstätter, Anna König, Simone Nagl, Harald Niederstätter,

N

Daniela Niederwieser, Jannika Oeke, Alexander Röck, Lisa Schnaller, Filipa Simao, Christina Strobl, Catarina Xavier, and Bettina Zimmermann. We would also like to thank many

A

international colleagues who contributed to the EMPOP project with population data and/or

M

useful comments over the past years (listed under ‘contribute section’ in EMPOP).

EP

Conflict of Interest

TE

- ISFP-2016-AG-IBA-ENFSI.

D

This work received support from the European Union, grant agreement number 779485 - STEFA

A

CC

The authors declare no conflict of interest.

35

Holland, M.M. and T.J. Parsons, Mitochondrial DNA Sequence Analysis-Validation and Use for Forensic Casework. Forensic science review, 1999. 11(1): p. 21-50.

2.

King, T.E., et al., Identification of the remains of King Richard III. Nat Commun, 2014. 5: p. 5631.

3.

Cann, R.L., M. Stoneking, and A.C. Wilson, Mitochondrial DNA and human evolution. Nature, 1987. 325(6099): p. 31-6.

4.

Bodner, M., et al., Rapid coastal spread of First Americans: novel insights from South America's Southern Cone mitochondrial genomes. Genome Res, 2012. 22(5): p. 811-20.

5.

Krishnan, K.J., et al., What causes mitochondrial DNA deletions in human cells? Nat Genet, 2008. 40(3): p. 275-9.

6.

Giles, R.E., et al., Maternal inheritance of human mitochondrial DNA. Proc Natl Acad Sci U S A, 1980. 77(11): p. 6715-9.

7.

Smith, D.R., The past, present and future of mitochondrial genomics: have we sequenced enough mtDNAs? Briefings in functional genomics, 2015. 15(1): p. 47-54.

8.

Parson, W., et al., The EDNAP mitochondrial DNA population database (EMPOP) collaborative exercises: organisation, results and perspectives. Forensic Sci Int, 2004. 139(2-3): p. 215-26.

9.

Bandelt, H.J., et al., Detecting errors in mtDNA data by phylogenetic analysis. Int J Legal Med, 2001. 115(2): p. 64-9.

10.

Bandelt, H.J., et al., 'Distorted' mitochondrial DNA sequences in schizophrenic patients. Eur J Hum Genet, 2007. 15(4): p. 400-2; author reply 402-4.

11.

Bandelt, H.J., et al., The fingerprint of phantom mutations in mitochondrial DNA data. Am J Hum Genet, 2002. 71(5): p. 1150-60.

12.

Bandelt, H.J. and A. Salas, Contamination and sample mix-up can best explain some patterns of mtDNA instabilities in buccal cells and oral squamous cell carcinoma. BMC Cancer, 2009. 9: p. 113.

13.

Bandelt, H.J., et al., High penetrance of sequencing errors and interpretative shortcomings in mtDNA sequence analysis of LHON patients. Biochem Biophys Res Commun, 2007. 352(2): p. 283-91.

14.

Brandstatter, A., et al., Phantom mutation hotspots in human mitochondrial DNA. Electrophoresis, 2005. 26(18): p. 3414-29.

15.

Kong, Q.P., et al., Distilling artificial recombinants from large sets of complete mtDNA genomes. PLoS One, 2008. 3(8): p. e3016.

16.

Parson, W. and H.J. Bandelt, Extended guidelines for mtDNA typing of population data in forensic science. Forensic Sci Int Genet, 2007. 1(1): p. 13-9.

17.

Salas, A., et al., A practical guide to mitochondrial DNA error prevention in clinical, forensic, and population genetics. Biochemical and Biophysical Research Communications, 2005. 335(3): p. 891-899.

CC

EP

TE

D

M

A

N

U

SC RI PT

1.

A

References

36

Yao, Y.G., C.M. Bravi, and H.J. Bandelt, A call for mtDNA data quality control in forensic science. Forensic Sci Int, 2004. 141(1): p. 1-6.

19.

Yao, Y.G., et al., mtDNA data mining in GenBank needs surveying. Am J Hum Genet, 2009. 85(6): p. 929-33; author reply 933.

20.

Parson, W. and A. Dür, EMPOP—a forensic mtDNA database. Forensic Science International: Genetics, 2007. 1(2): p. 88-92.

21.

Zimmermann, B., et al., Application of a west Eurasian-specific filter for quasi-median network analysis: Sharpening the blade for mtDNA error detection. Forensic Sci Int Genet, 2011. 5(2): p. 133-7.

22.

Zimmermann, B., et al., Improved visibility of character conflicts in quasi-median networks with the EMPOP NETWORK software. Croat Med J, 2014. 55(2): p. 115-20.

23.

Gusmao, L., et al., Revised guidelines for the publication of genetic population data. Forensic Sci Int Genet, 2017. 30: p. 160-163.

24.

Parson, W. and L. Roewer, Publication of population data of linearly inherited DNA markers in the International Journal of Legal Medicine. Int J Legal Med, 2010. 124(5): p. 505-9.

25.

Andrews, R.M., et al., Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genet, 1999. 23(2): p. 147.

26.

Bär, W., et al., DNA Commission of the International Society for Forensic Genetics: guidelines for mitochondrial DNA typing. International journal of legal medicine, 2000. 113(4): p. 193-196.

27.

Wilson, M.R., et al., Recommendations for consistent treatment of length variants in the human mitochondrial DNA control region. Forensic Sci Int, 2002. 129(1): p. 35-42.

28.

Bandelt, H.J. and W. Parson, Consistent treatment of length variants in the human mtDNA control region: a reappraisal. Int J Legal Med, 2008. 122(1): p. 11-21.

29.

SWGDAM, Interpretation Guidelines for Mitochondrial DNA Analysis by Forensic DNA Testing Laboratories. 2013.

30.

Parson, W., et al., DNA Commission of the International Society for Forensic Genetics: revised and extended guidelines for mitochondrial DNA typing. Forensic Sci Int Genet, 2014. 13: p. 134-42.

31.

Ewens, W.W.J. and G.R. Grant, Statistical Methods in Bioinformatics: An Introduction. Statistics for Biology and Health. 2001: Springer.

CC

EP

TE

D

M

A

N

U

SC RI PT

18.

Bandelt, H.J., M. van Oven, and A. Salas, Haplogrouping mitochondrial DNA sequences in Legal Medicine/Forensic Genetics. Int J Legal Med, 2012. 126(6): p. 901-16.

33.

Naue, J., et al., Evidence for frequent and tissue-specific sequence heteroplasmy in human mitochondrial DNA. Mitochondrion, 2015. 20: p. 82-94.

34.

Tully, G., et al., Results of a collaborative study of the EDNAP group regarding mitochondrial DNA heteroplasmy and segregation in hair shafts. Forensic science international, 2004. 140(1): p. 1-11.

A

32.

37

van Oven, M., PhyloTree Build 17: Growing the human mitochondrial DNA tree. Forensic Science International: Genetics Supplement Series, 2015. 5: p. e392-e394.

36.

Röck, A., et al., SAM: String-based sequence search algorithm for mitochondrial DNA database queries. Forensic Sci Int Genet, 2011. 5(2): p. 126-32.

37.

Benson, D.A., et al., GenBank: update. Nucleic Acids Res, 2004. 32(Database issue): p. D23-6.

38.

Just, R.S., et al., Complete mitochondrial genome sequences for 265 African American and U.S. H " ispanic"individuals. Forensic Sci Int Genet, 2008. 2(3): p. e45-8.

39.

Just, R.S., et al., Full mtGenome reference data: development and characterization of 588 forensic-quality haplotypes representing three US populations. Forensic Science International: Genetics, 2015. 14: p. 141-155.

40.

King, J.L., et al., High-quality and high-throughput massively parallel sequencing of the human mitochondrial genome using the Illumina MiSeq. Forensic Science International: Genetics, 2014. 12: p. 128-135.

41.

Röck, A.W., et al., Concept for estimating mitochondrial DNA haplogroups using a maximum likelihood approach (EMMA). Forensic Science International: Genetics, 2013. 7(6): p. 601-609.

42.

Soares, P., et al., Correcting for purifying selection: an improved human mitochondrial molecular clock. Am J Hum Genet, 2009. 84(6): p. 740-59.

43.

Bandelt, H.-J., et al., Estimation of Mutation Rates and Coalescence Times: Some Caveats, in Human Mitochondrial DNA and the Evolution of Homo sapiens. 2006, Springer-Verlag Berlin Heidelberg. p. 47-90.

44.

Ukkonen, E., Algorithms for approximate string matching. Information and control, 1985. 64(1-3): p. 100-118.

45.

van Oven, M. and M. Kayser, Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum Mutat, 2009. 30(2): p. E386-94.

46.

Berger, C., et al., Evaluating sequence-derived mtDNA length heteroplasmy by amplicon size analysis. Forensic Sci Int Genet, 2011. 5(2): p. 142-5.

47.

Chung, U., et al., Mitochondrial DNA CA dinucleotide repeats in Koreans: the presence of length heteroplasmy. Int J Legal Med, 2005. 119(1): p. 50-3.

48.

Bhatti, S., et al., Problems in Mitochondrial DNA forensics: while interpreting length heteroplasmy conundrum of various Sindhi and Baluchi ethnic groups of Pakistan. Mitochondrial DNA A DNA Mapp Seq Anal, 2017: p. 1-10.

CC

EP

TE

D

M

A

N

U

SC RI PT

35.

A

49.

50.

Irwin, J.A., et al., Development and expansion of high-quality control region databases to improve forensic mtDNA evidence interpretation. Forensic Sci Int Genet, 2007. 1(2): p. 154-7. Parson, W., et al., Evaluation of next generation mtGenome sequencing using the Ion Torrent Personal Genome Machine (PGM). Forensic Science International: Genetics, 2013. 7(5): p. 543-549.

38

Strobl, C., et al., Evaluation of the precision ID whole MtDNA genome panel for forensic analyses. Forensic Sci Int Genet, 2018. 35: p. 21-25.

A

CC

EP

TE

D

M

A

N

U

SC RI PT

51.

39