Sensitivity of mitochondrial DNA heteroplasmy detection using Next Generation Sequencing

Sensitivity of mitochondrial DNA heteroplasmy detection using Next Generation Sequencing

Journal Pre-proofs Sensitivity of mitochondrial DNA heteroplasmy detection using Next Generation Sequencing María del Mar González, Amanda Ramos, Mari...

616KB Sizes 0 Downloads 86 Views

Journal Pre-proofs Sensitivity of mitochondrial DNA heteroplasmy detection using Next Generation Sequencing María del Mar González, Amanda Ramos, Maria Pilar Aluja, Cristina Santos PII: DOI: Reference:

S1567-7249(19)30189-8 https://doi.org/10.1016/j.mito.2019.10.006 MITOCH 1418

To appear in:

Mitochondrion

Received Date: Accepted Date:

16 July 2019 10 October 2019

Please cite this article as: del Mar González, M., Ramos, A., Aluja, M.P., Santos, C., Sensitivity of mitochondrial DNA heteroplasmy detection using Next Generation Sequencing, Mitochondrion (2019), doi: https://doi.org/ 10.1016/j.mito.2019.10.006

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Elsevier B.V. and Mitochondria Research Society. All rights reserved.

Sensitivity of mitochondrial DNA heteroplasmy detection using Next Generation Sequencing María del Mar González1, Amanda Ramos1,2,3,4, Maria Pilar Aluja1, Cristina Santos1* 1. Unitat d’Antropologia Biològica, Departament de Biologia Animal, Biologia Vegetal i Ecologia, Universitat Autònoma de Barcelona, 08193 Cerdanyola del Vallès, Barcelona. Spain. GREAB – Research Group in Biological Anthropology. 2. Faculdade de Ciências e Tecnologia, Universidade dos Açores, Ponta Delgada, Portugal. 3. Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Portugal. 4. IBMC– Instituto de Biologia Molecular e Celular, Universidade do Porto, Porto, Portugal.

*Corresponding Author: Cristina Santos Unitat Antropologia Biològica Dep. Biologia Animal, Biologia Vegetal i Ecologia Facultat Biociències Edifici C Universitat Autònoma de Barcelona 08193 Cerdanyola del Vallès, BARCELONA (SPAIN) [email protected]

Abstract Although the use of Next Generation Sequencing (NGS) in mitochondrial DNA (mtDNA) studies related to forensic and human genetics has contributed to the report of heteroplasmy at very low levels (lower than 1% and even 0.5%), their detection is not a straightforward process. Our purpose is to establish mitochondrial heteroplasmy detection limits, generating mixed bases at low frequencies by the PCR co-amplification of mtDNA and a nuclear insertion of mitochondrial origin (NUMT). A primer set that presents a perfect annealing with both mitochondrial and nuclear DNA was used to amplify the mitochondrial DNA region located between positions 6739 and 8910 and the corresponding region located inside a NUMT located in chromosome 1 (GRCh38.p12 Primary Assembly 631910-634079) that presents an identity of 98% with the corresponding region of mtDNA in two samples. Independent Nextera XT® (Illumina) NGS libraries were generated for each sample and sequenced in different MiSeq (Illumina) runs. Non-identical and identical positions between individual mtDNA and NUMT sequences were detected, and heteroplasmy detections limits were established: a) with a minor allele frequency <1.5%, false positive and negative can occur, even with a depth higher than 3000X; b) with a minor allele frequency >3%, no presence of false positive and negative were detected with a depth of ~1000X; and c) with a minor allele frequency between 1.5% and 3%, a minimal depth of 3000X was necessary to avoid false positives or negatives detection. Our results demonstrate an alternative strategy to establish a reliable limit of mitochondrial heteroplasmy detection.

Keywords: mitochondrial DNA (mtDNA); heteroplasmy; Nuclear Insertion of Mitochondrial Origin (NUMT); Next Generation Sequencing (NGS).

1. Introduction The presence of several mitochondrial DNA (mtDNA) molecules in each cell enables the occurrence of new mutations that introduce variability between mtDNA copies. Therefore, it could co-exist different mitochondrial genotypes within a cell, tissue or individual, a phenomenon known as heteroplasmy [1, 2]. For more than a decade, many studies related to forensic and human genetics have described mitochondrial heteroplasmy using Sanger sequencing [3-8]. However, this method is currently hampered by speed, efficiency, resolution and costs limitations, as well as restrictions to carry out multiple samples analysis in parallel [9-11]. Due to the appearance of Next Generation Sequencing (NGS), several studies have been focused on the mitochondrial heteroplasmy study in different fields as medical genetics (e.g. cancer) [12], molecular anthropology [13, 14], and forensics [15]. Moreover, the sensitivity of mitochondrial heteroplasmy detection has been increased, detecting variants with an imperceptible frequency by traditional sequencing, lower than 1% and even 0.5% [16, 17]. Nevertheless, the detection of low-level variants is not a straightforward process even using NGS. In NGS platforms, reads have more random and systematic errors due to artefacts and biases after alignment and after or during assembling [18]. These errors could affect the discrimination capacity between a reliable heteroplasmy at very low level and noise [19]. In fact, some authors still consider that accuracy and read length in Sanger sequencing is greater than NGS [11]. One important issue that must be accounted in the heteroplasmy detection and interpretation is the presence of nuclear insertions of mitochondrial origin (NUMTs). NUMTs represent nuclear DNA (nDNA) sequences with a high degree of identity with

mtDNA that, depending on the technical strategy used to analyse samples, could lead to false heteroplasmy detection. Studies that analysed mtDNA by Sanger sequencing gave different results: Goios et al. [20] reported that NUMTs contamination depends on the features of the sample and the type of tissue, and do not consider the existence of a significative risk contamination by NUMTs co-amplification using standard laboratory techniques; by contrast, others consider the possibility of contamination [21]. In NGS, however, nDNA co-amplification level that would remain imperceptible with traditional sequencing can be easily detected. In this sense, some studies consider that mtDNA heteroplasmic variants observed using a <2% detection level could be false positives due to NUMTs presence [22-25]. Currently, several studies based on Illumina NGS technology have established their own heteroplasmy detection limit [2, 9, 16, 17, 26-29]. All of them agree that lowvariants detection is consistent depending on the depth obtained. In this sense, different heteroplasmy detection limits were reported depending on depth: range from 10% (76X) [27] to ~2% (between 500X and 1170X) [26, 28], ~1.5% (16700X) [2] and 1% (35000X) [16]. However, other studies are able to detect very low-variants with a lower depth, ranging from 1% (1785X) [29] to 0.5-1% (between 5000X and 8000X) [9], and 0.5% (3458X) [17]. Therefore, the establishment of a heteroplasmy detection limit may be very important to ensure reliable heteroplasmy detection, particularly those detected at very low levels. Moreover, the establishment of heteroplasmy detection limit can take advantage of the NUMT co-amplification, a strategy that is not yet reported up to date. For this reason, our aims are: i) to provide a possible alternative strategy to establish a heteroplasmy detection limit, generating mixed bases at low frequencies by mtDNA and NUMTs co-

amplification; and ii) to propose limits of detection for mtDNA heteroplasmy that guaranty a maximum sensibility avoiding the presence of false negatives and positives.

2. Material and Methods Subjects’ data Blood samples from two Spanish maternally unrelated individuals were used (S1 and S2). Appropriate informed consent was obtained under strict confidentiality. The present study and the written informed consent were approved by the ethics committee of the Specialized Attention Board at the Healthcare Complex of Zamora and authorized by its Medical Director [30]. DNA was extracted using JETQUICK Blood DNA Spin Kit (Genomed) according to the manufacturer’s specifications. MtDNA amplification and Next-Generation Sequencing A specific region presents in both mtDNA and nDNA was selected to generate mixed bases at low frequencies by mtDNA and NUMTs co-amplification. This region is located in the mtDNA between positions 6739 and 8910 and the corresponding nuclear region is in the chromosome 1 (GRCh38.p12 Primary Assembly 631910-634079) [31]. The NUMT region has a 98% of identity with the corresponding mtDNA region. Primers

(6511F:

CTGCTGGCATCACTATACTACTA;

9220R:

GATTGGTGGGTCATTATGTGTTG) [32] that present a perfect annealing with both mtDNA and nDNA were used for the PCR amplification of the selected region. PCR conditions were previously described by Ramos et al. [32]. PCR products were purified using NucleoSpin Gel and PCR Clean-up kit (Macherey-Nagel) and quantified using Qubit® 2.0 fluorometer and Qubit dsDNA HS Assay kit.

Different NGS libraries and sequencing runs were carried out. For samples S1 and S2, respectively 2 and 4 independent Nextera XT® (Illumina) NGS libraries were generated according to manufacturer’s recommendations. Samples were sequenced with other samples (combining in each run the entire mtDNA sequencing of 24 samples) in different 2x250 MiSeq (Illumina) runs. The MiSeq (Illumina) sequencer from the Servei de

Genòmica

i

Bioinformàtica

at

Universitat

Autònoma

de

Barcelona

(http://sct.uab.cat/genomica-bioinformatica/es) was used. NGS Data Analysis Data from each run were analysed using the mtDNA NGS analysis workflow from mtDNA-Server [33]. This server focuses on reliable heteroplasmy detection and quantification and identifies the possible presence of contamination due to NUMT coamplification. The server is based on the following steps: 1) raw data are analysed in FASTQ or BAM format; 2) reads are aligned with the sequence aligner BWA-MEM v.0.7.5; and finally, 3) BAM files obtained from aligned reads are analysed in order to detect mtDNA variants. To detect both homoplasmic and heteroplasmic positions (represented as mixed bases) and avoid misleading results, the mtDNA-Server applies several metrics and filters to eliminate poor reads. Furthermore, the mtDNA-Server applied further criteria for heteroplasmy detection in order to report only reliable mixed bases positions (for the detailed analysis workflow see material and methods section from Weissensteiner et al. [33]). Within these criteria, a strand bias score was calculated, and those heteroplasmic sites with a value >1 were filtered out [35, 36]. Mixed bases analysis The mitochondrial genome of S1 and S2 samples was previously amplified and sequenced by Sanger [3], using primers specifically designed to avoid NUMT co-

amplification and applying a sensitivity level of mixed bases of 10%. Thus, a mtDNA sequence of individuals S1 and S2 without ambiguities generated by NUMT coamplification is available [3]. To identify identical and non-identical positions between mtDNA and the corresponding region in the chromosome 1 NUMT, the alignment of the previously reported sequences of S1 and S2 [3] with the NUMT sequence (GRCh38_Chr.1 631681-634389) was performed using BioEdit Sequence Alignment Editor [34]. Based on the alignment (Figure 1) expected non-identical positions were identified for each

individual. Figure 1. Partial alignment between individual mtDNA and NUMT sequences. Non-identical positions between sequences are highlighted in boxes.

For mixed bases detected in identical positions between mtDNA and NUMT, NUMT variability was analysed using the 1000 Genomes [35] data with the purpose to discriminate if these mixed bases are reliable heteroplasmies or, actually, are nonidentical positions due to the possible presence of polymorphic positions in both S1 and S2 NUMT region. From all the results obtained about mixed bases in both non-identical and identical positions, heteroplasmy detection limits were established taking into account the sensitivity of mixed bases detection and the depth obtained.

3. Results Mixed bases in non-identical positions between mtDNA and NUMT Based on the alignment between mtDNA sequence of S1 and S2 individuals, obtained by Sanger [3], and NUMT sequence, all expected mixed bases in non-identical positions were detected by NGS, at least, in one of the runs (Table 1). From 28 mixed bases positions expected for S1 sample, 26 (92.86%) were detected in all runs, whereas for S2, from 29 mixed bases positions expected, only 17 (58.62%) were detected in all runs. Therefore, NUMT co-amplification was detected in the analysed samples, presenting an average percentage of mixed bases of 3.79% in S1 and 1.83% in S2 (Table 1).

Table 1. Mixed bases detected in mtDNA analysed region between 6739-8910 bp in S1 and S2. All positions are expected non-identical positions between mtDNA and NUMT sequences, except those signalled with *. The most frequent base is in capital letter, while minority base is in small letter.

Sample Position

S1

individual mtDNA/ NUMT

Run1 (Mean depth 10519.62X)

Run2 (Mean depth 7071.62X)

% mixed bases

Depth

% mixed bases

Depth

Sample Position

individual mtDNA/ NUMT

Run1 (Mean depth 8995.70X)

Run2 (Mean depth 8200.77X)

Run3 (Mean depth 1491.59X)

Run4 (Mean depth 4502.56X)

% mixed bases

Depth

% mixed bases

Depth

% mixed bases

Depth

% mixed bases

Depth

6935

C/t

2.56

16550

2.64

7188

6935

C/t

1.3

7134

1.6

6134

1.43

1185

1.66

3665

6938

C/t

2.49

16569

2.39

7211

6938

C/t

1.1

7151

1.44

6174

1.75

1198

1.56

3662

7146

A/g

3.16

15946

3

6723

7028

T/c

2.15

6750

2.04

5492

1.77

1188

2

3307

7232

C/t

2.53

17525

2.07

6858

7146

A/g

2.08

6650

2.18

5089

1.29

1320

1.67

3407

7256

C/t

2.88

16686

2.55

6157

7232

C/t

1.28

6788

1.55

5341

ND

1407

ND

3462

7316

G/a

3.24

14567

2.69

5432

7256

C/t

1.47

6056

1.74

5060

ND

1309

1.02

3220

7521

G/a

2.81

12855

2.41

4943

7270*

T/c

3.69

6047

2.85

5013

2.77

1264

3.54

3164

7650

C/t

3.13

14225

2.9

6414

7316

G/a

1.41

5386

1.1

4536

ND

1203

1.23

2765

7663*

C/t

1.29

14103

ND

6326

7521

G/a

1.03

4672

1.26

4359

ND

1002

0.96

2591

7705

T/c

4

14223

3.79

6909

7650

C/t

1.89

6285

1.57

4972

1.33

1206

1.22

2865

7810

C/t

4.25

13998

4.59

6624

7705

T/c

1.97

7191

1.59

5273

1.24

1211

1.93

2951

7868

C/t

4.58

13947

4.8

7089

7810

C/t

3

7239

1.78

5392

1.44

1111

1.56

2878

7891

C/t

4.16

12630

3.77

6638

7868

C/t

3.25

7835

2.03

5873

1.66

1084

1.56

2954

7912

G/a

4.77

12244

4.9

7059

7891

C/t

2.7

7361

1.65

5503

2.08

1008

1.3

2762

8021

A/g

5.3

11702

4.89

6892

7922*

T/c

1.4

8045

1.58

5942

1.57

1082

1.39

2802

8065

G/a

5.07

11309

4.6

6693

8021

A/g

3.61

8195

2.52

5645

1.94

1030

2.07

2848

8140

C/t

5.05

9351

4.6

5689

8065

G/a

2.86

8189

1.91

5433

ND

1040

1.76

2847

8152

G/a

5.44

8386

4.48

5005

8140

C/t

2.44

6917

1.95

4831

1.24

1051

1.66

2774

8167

T/c

6.12

7483

5.09

4305

8152

G/a

2.18

6552

2.15

4649

ND

955

1.81

2600

8203

C/t

6.79

6908

5.85

3829

8155*

G/a

ND

6851

0.79

4782

ND

967

ND

2642

8254*

C/t

3.18

5724

3.07

6063

8157*

T/c

4.33

6622

3.56

4546

4.52

996

3.44

2644

S2

8392

G/a

5.51

6927

3.62

3338

8203

C/t

2.22

5856

1.64

4443

1.56

1088

1.29

2554

8455

C/t

ND

9747

3

4566

8380

C/t

2.45

4812

2.25

6322

1.8

1043

1.21

3052

8461

C/t

ND

10264

3.08

4939

8392

G/a

2.7

4474

2.34

3799

ND

767

1.51

1920

8503

T/c

3.17

11403

3.2

5346

8455

C/t

2.05

4984

2.29

4200

ND

838

1.81

2215

8545

G/a

3.1

8490

3.85

4158

8461

C/t

1.99

5380

2.3

4395

1.71

875

2.16

2318

8655

C/t

3.09

13862

3.56

6400

8503

T/c

2.21

5830

1.79

4534

1.95

870

1.84

2442

8677

A/c

3.23

13285

3.15

6200

8545

G/a

2.36

4242

ND

3194

ND

596

2.19

1825

8701

A/g

3

13976

2.92

6757

8655

C/t

1.86

6511

1.97

5534

ND

1059

1.89

2905

8718

A/g

3.26

14135

3.44

6926

8677

A/c

1.87

6408

1.88

5544

ND

1039

2.2

2824

8878*

C/t

1.16

13238

1.05

6383

8701

A/g

1.88

7115

1.76

5975

ND

1041

2.21

2983

8718

A/g

2.56

7334

2.14

4085

1.92

888

2.62

2152

3.95

12660.85

3.64

5938.86

2.14

6403.46

1.87

5132.85

1.63

1084.75

1.70

2788.37

Average**

Average**

ND: mixed bases not detected. **Average calculated without positions not detected and identical positions between mtDNA and NUMT sequences.

Taking into account the depth obtained in each run for both samples (Table 1) and the number of non-identical positions not detected, the results showed a significant negative correlation between depth and the appearing of false negative in non-identical positions: as depth decreases, the number of false negatives increases (Pearson’s Correlation 0.468; p<0.01). In this sense, the run 3 in S2 presents the lowest depth (an average depth of 1491.59X) and, consequently, the lowest number of non-identical positions detected (13/30) (Table 1). Some false negatives in non-identical positions were detected even with a high depth, suggesting that other factors can lead to an underestimation of mixed bases. On one hand, positions 8455 and 8461 of S1 were detected in run 2, but not in run 1 with a higher depth (Table 1). Although raw data revealed the presence of these positions as mixed bases, a subsequent filter revised discarded them. In these cases, the strand bias score (>1) reveals that the frequency of each base is not consistent between forward and reverse strands, so they are filtered out in this step. On the other hand, the position 8545 of S2 in run 2 is not detected since the minor base differs between forward and reverse strand (a thymine is detected in forward strand, while an adenine is detected in reverse strand), so this position is also filtered out from raw data. Mixed bases in identical positions between mtDNA and NUMT A total of 3 (positions 7663, 8254, and 8878) and 4 (positions 7270, 7922, 8155, and 8157) mixed bases were detected in identical positions between mtDNA and NUMT sequences, respectively in S1 and S2 samples (Table 1). Considering the variation of NUMT in the 1000 genomes individuals, only the position 7663 detected in S1 corresponds to a polymorphic position in nDNA (rs373437560) that is locates inside the chromosome 1 NUMT in position 632834 (GRCh38_Chr.1) (Figure 2). In accordance,

S1 individual could present variability in the chromosome 1 NUMT and the reported mixed base could represent a non-identical position. Concerning the remaining positions, they appear to be identical positions between mtDNA and the corresponding nuclear DNA region. Thus, these positions could be reliable mtDNA heteroplasmies or could also represent false positives. Nevertheless, it is important to highlight that five of seven mixed bases (positions 8254 and 8878 in S1, and positions 7270, 7922, and 8157 in S2) were detected in all the runs, so it is unlikely that they could be false positives. The remaining two mixed bases (positions 7663 and 8155 in S1 and S2, respectively), which were detected at <1.5% level, were not reproducible in all the runs, and can represent false positives.

Figure 2. Allelic frequencies of NUMT variant rs373437560 (mtDNA position 7663) in different populations. Data obtained from 1000 Genomes project.

Heteroplasmy detection limit From different depths obtained, it is possible to establish heteroplasmy detection limits based on the minor allele frequencies obtained. Firstly, all mixed bases at level >3% are detected in all cases even in run 3 of S2, where the depth ranges around 1000X. Thus, when the heteroplasmy frequency is >3%, the presence of false positives and false

negatives seems to be virtually absent with a depth of ~1000X. Secondly, it is possible to detect mixed bases at level of 1.5%< X >3% if the depth is ~3000X. It could be established observing runs in S2 (Table 2): while some mixed bases are detected in run 1 and 2 (average depth in these positions of 6079X and 4892X), they are not detected in run 3 (average depth in these positions of 1052X). However, when the depth increases in run 4 (average depth in these positions of 2733X), these mixed bases are able to be detected. Thereby, no false positives and negatives are apparently detected with a depth of ~3000X. Finally, there are many discrepancies referring to very low mixed bases (<1.5%). It is possible to detect them in several cases, but false negatives are also detected, especially in run 3 in S2, where the lowest depth is obtained. For this reason, when the heteroplasmy frequency is <1.5%, the discrimination between false and true positives is very complex even with a high depth, and it is not possible to discard the presence of false negatives.

4. Discussion Taking the large number of methodologies followed and the criteria and results reported in the literature [2, 9, 16, 17, 26-29], it is very difficult to evaluate which strategy is more accurate to obtain reliable results and establish limits to detect mtDNA heteroplasmy. Bearing in mind that heteroplasmy detection at very low levels remains a challenge to solve, the novel approach presented in this study, based on mixed bases generation by co-amplification of mtDNA and NUMTs, provides a new method to establish heteroplasmy detection limits. The present study confirms the possibility of co-amplification of mtDNA and nDNA and its detection by NGS, being appreciable in some positions where the mix-base

composition indicates a co-amplification that reaches appreciable levels, with the nDNA variant being present in 6% of the reads. This result is in accordance with others studies, which determine that the presence of identical or almost identical sequences between NUMTs and mtDNA can introduce bias and generate false positives low-level heteroplasmies from NGS [22]. Furthermore, during mtDNA amplification, the amount of NUMT amplification appears to be different between samples, probably due to stochastic reasons. These results strengthen the importance of an exhaustive PCR primer design to amplify selectively mtDNA, as previously suggested by others [31, 32]. NUMT co-amplification strategy can be used to establish heteroplasmy detection limits. Mixed bases observed at low frequencies (<1.5%) cannot always be detected in all runs, independently of the depth obtained. This result reinforces the theory that heteroplasmy discrimination at low frequencies is a very complex procedure [36, 37]. However, it is important to consider that, although the sensitivity of detection is usually higher in the present study –some mixed bases are detected at frequency <1.5% in some runs–, it is possible to ignore some reliable mixed bases –false negatives– or accept sequencing errors –false positives– at this level. This is a very important feature regarding some studies like case-controls or heteroplasmic mutation load, where it is possible to obtain biased results, giving rise to misinterpretations in research findings. Thus, although some heteroplasmies can be detected, we contemplate that all the mutations present in a frequency <1.5% must not be considered for further analysis even if the depth is higher than 3000X. Mixed bases that present slightly higher values –between 1.5% and 3%– can be reproducible in all the runs if depth is at least ~3000X. In this range, depth plays a very important role in mtDNA heteroplasmy detection [38, 39]. Finally, with a depth

from ~1000X up to ~3000X, a minimum frequency of 3% is appropriated to avoid false positives and negatives detection. Several works have been focused in establish heteroplasmy detection limit (Table 2).

Table 2. Works based on the establishment of heteroplasmy detection limit depending mainly on the

References

Heteroplasmy detection limit

Depth

Sensitivity/Reproducibility

He et al. (2010) [2]

1.6%

16700X

High sensitivity.

Tang and Huang (2010) [29]

≥5%

1785X

Detection of false positives at 1% level. Specificity low at this level.

Li et al. (2010) [27]

10%

76X

False positives and negatives detected.

1170X

No false positives and negatives with error rate threshold 0.001 and minor allele frequency 0.01.

2.3%

500X

3 methods applied show good specificity (no false positives and negatives). False discovery rate of <1%.

1%

5000X

0.5%

8000X

Variants are readily and reproducibly observed.

Kloss-Brandstatter et al. (2015) [16]

1%

35000X

False negatives observed.*

Li et al. (2015) [17]

0.5%

3458X

Original findings are reproducible (false positive rate <0.01).

3%

1000X

1.5%-3%

3000X

Goto et al. (2011) [26]

Li et al. (2012) [28]

McElhoe et al. (2014) [9]

Present study (2018)

2%

False positives and negatives are apparently not detected.

depth obtained, in order to achieve the utmost together sensitivity and reproducibility.

*Detected by our group in the Supplementary Material.

Comparing the results obtained in the present study with those described by others (table 2), several topics may be considered. Firstly, our results allow establishing heteroplasmy detection limits lower than others [27, 29], which reported their own limits with a low reproducibility, and detecting the presence of false positives and negatives. Secondly, the results of He et al. [2] agree with the results presented in the current study. Although they were able to detect heteroplasmies with a sensitivity >1%, they established a heteroplasmy detection limit at 1.6% with a very high depth, strengthening the theory that a higher depth is necessary to detect heteroplasmies between 1.5% and 3%, and that heteroplasmy behind 1.6% is not reliable detected. Thirdly, while some studies were able to establish a very low heteroplasmy detection limit (≤1%) [9, 16], the depth necessary to establish it is very higher comparing to the reported in the present study. However, it is very remarkable that false negatives could still be observed with a coverage of 35000X [16]. Finally, others studies showed more discrepancies. Compared to our results, Goto et al. and Li et al. [26, 28] were able to decrease the low-level variants detection at ~2% with a depth between 500X and ~1000X, whereas Li et al. [17] established a heteroplasmy detection limit at 0.5% with a depth of ~3500X, achieving accurate results and good reproducibility in both studies. It is important to highlight that differences in data analysis could give different result. Accordingly, the workflow applied in the current study from mtDNA-Server is mainly designed for the purpose of low mitochondrial heteroplasmies detection and validation. Besides, although using the same NGS platform (Illumina), different high-throughput sequencing models were used (Genome Analyzer, Genome Analyzer II, HiSeq 2000, HiSeq 2500, MiSeq), and in this sense, very few studies apply MiSeq high-throughput sequencing [9].

Taking into account all the results mentioned above, the new strategy presented in the present work based on the Nextera XT and MiSeq platform technology implementation, leads to ensure the detection of mtDNA heteroplasmies at 3% frequency with a high reproducibility and a depth of ~1000X. Finally, our results confirm the important role of the depth obtained during the NGS process to obtain accurate results related to heteroplasmy detection at very low levels. 5. Conclusions These results demonstrate an alternative strategy to establish a reliable mitochondrial heteroplasmy detection limit. Following this strategy, several detection limits were settled, and a minimum minor allele frequency of 3% is required to ensure that neither false positives nor false negatives are detected at a depth of 1000X. With a lower minor allele frequency, depth plays a key role to prevent inaccurate interpretations. In addition, an exhaustive primer design to avoid misleading results due to NUMTs coamplification have to be performed, when the mtDNA analysis include a previous mtDNA amplification.

Acknowledgments This work was supported by MICINN (CGL2009-08205 and CGL2014-53781) and by Generalitat

de

Catalunya

(ref.

2017SGR1630).

A

postdoctoral

fellowship

(SFRH/BPD/105660/2015) (A.R.) was supported by Fundação para a Ciência e a Tecnologia (FCT).

References

[1] H.A. Coller, K. Khrapko, N.D. Bodyak, E. Nekhaeva, P. Herrero-Jimenez, W.G. Thilly, High frequency of homoplasmic mitochondrial DNA mutations in human tumors can be explained without selection, Nature genetics 28(2) (2001) 147-50. [2] Y. He, J. Wu, D.C. Dressman, C. Iacobuzio-Donahue, S.D. Markowitz, V.E. Velculescu, L.A. Diaz, Jr., K.W. Kinzler, B. Vogelstein, N. Papadopoulos, Heteroplasmic mitochondrial DNA mutations in normal and tumour cells, Nature 464(7288) (2010) 610-4. [3] A. Ramos, C. Santos, L. Mateiu, M. Gonzalez Mdel, L. Alvarez, L. Azevedo, A. Amorim, M.P. Aluja, Frequency and pattern of heteroplasmy in the complete human mitochondrial genome, PloS one 8(10) (2013) e74636. [4] C. Santos, R. Montiel, A. Arruda, L. Alvarez, M.P. Aluja, M. Lima, Mutation patterns of mtDNA: empirical inferences for the coding region, BMC evolutionary biology 8 (2008) 167. [5] C. Santos, B. Sierra, L. Alvarez, A. Ramos, E. Fernandez, R. Nogues, M.P. Aluja, Frequency and pattern of heteroplasmy in the control region of human mitochondrial DNA, Journal of molecular evolution 67(2) (2008) 191-200. [6] C. Berger, B. Berger, W. Parson, Sequence analysis of the canine mitochondrial DNA control region from shed hair samples in criminal investigations, Methods in molecular biology (Clifton, N.J 830 (2012) 331-48. [7] J.U. Palo, M. Hedman, N. Soderholm, A. Sajantila, Repatriation and identification of the Finnish World War II soldiers, Croatian medical journal 48(4) (2007) 528-35. [8] C. Turchi, F. Stanciu, G. Paselli, L. Buscemi, W. Parson, A. Tagliabracci, The mitochondrial DNA makeup of Romanians: A forensic mtDNA control region database and phylogenetic characterization, Forensic science international 24 (2016) 136-142.

[9] J.A. McElhoe, M.M. Holland, K.D. Makova, M.S. Su, I.M. Paul, C.H. Baker, S.A. Faith, B. Young, Development and assessment of an optimized next-generation DNA sequencing approach for the mtgenome using the Illumina MiSeq, Forensic science international 13 (2014) 20-9. [10] R. Arsenic, D. Treue, A. Lehmann, M. Hummel, M. Dietel, C. Denkert, J. Budczies, Comparison of targeted next-generation sequencing and Sanger sequencing for the detection of PIK3CA mutations in breast cancer, BMC clinical pathology 15 (2015) 20. [11] M. Verma, S. Kulshrestha, A. Puri, Genome Sequencing, Methods in molecular biology (Clifton, N.J 1525 (2017) 3-33. [12] T.C. Larman, S.R. DePalma, A.G. Hadjipanayis, A. Protopopov, J. Zhang, S.B. Gabriel, L. Chin, C.E. Seidman, R. Kucherlapati, J.G. Seidman, Spectrum of somatic mitochondrial mutations in five cancers, Proceedings of the National Academy of Sciences of the United States of America 109(35) (2012) 14087-91. [13] B.A. Payne, I.J. Wilson, P. Yu-Wai-Man, J. Coxhead, D. Deehan, R. Horvath, R.W. Taylor, D.C. Samuels, M. Santibanez-Koref, P.F. Chinnery, Universal heteroplasmy of human mitochondrial DNA, Human molecular genetics 22(2) (2013) 384-90. [14] J.L. King, B.L. LaRue, N.M. Novroski, M. Stoljarova, S.B. Seo, X. Zeng, D.H. Warshauer, C.P. Davis, W. Parson, A. Sajantila, B. Budowle, High-quality and highthroughput massively parallel sequencing of the human mitochondrial genome using the Illumina MiSeq, Forensic science international 12 (2014) 128-35. [15] J. Irwin, R. Just, M. Scheible, O. Loreille, Assessing the potential of next generation sequencing technologies for missing persons identification efforts, Forensic Science International: Genetics Supplement Series 3 (2011) e447–e448.

[16] A. Kloss-Brandstatter, H. Weissensteiner, G. Erhart, G. Schafer, L. Forer, S. Schonherr, D. Pacher, C. Seifarth, A. Stockl, L. Fendt, I. Sottsas, H. Klocker, C.W. Huck, M. Rasse, F. Kronenberg, F.R. Kloss, Validation of Next-Generation Sequencing of Entire Mitochondrial Genomes and the Diversity of Mitochondrial DNA Mutations in Oral Squamous Cell Carcinoma, PloS one 10(8) (2015) e0135643. [17] M. Li, R. Schroder, S. Ni, B. Madea, M. Stoneking, Extensive tissue-related and allele-related mtDNA heteroplasmy suggests positive selection for somatic mutations, Proceedings of the National Academy of Sciences of the United States of America 112(8) (2015) 2491-6. [18] I. Abnizova, R. te Boekhorst, Y. Orlov, Computational Errors and Biases in Short Read Next Generation Sequencing, J Proteomics Bioinform 10 (2017) 1-17. [19] R.S. Just, J.A. Irwin, W. Parson, Mitochondrial DNA heteroplasmy in the emerging field of massively parallel sequencing, Forensic science international 18 (2015) 131-9. [20] A. Goios, L. Prieto, A. Amorim, L. Pereira, Specificity of mtDNA-directed PCRinfluence of NUclear MTDNA insertion (NUMT) contamination in routine samples and techniques, International journal of legal medicine 122(4) (2008) 341-5. [21] S. Calvignac, L. Konecny, F. Malard, C.J. Douady, Preventing the pollution of mitochondrial datasets with nuclear mitochondrial paralogs (numts), Mitochondrion 11(2) (2011) 246-54. [22] L. Albayrak, K. Khanipov, M. Pimenova, G. Golovko, M. Rojas, I. Pavlidis, S. Chumakov, G. Aguilar, A. Chavez, W.R. Widger, Y. Fofanov, The ability of human nuclear DNA to cause false positive low-abundance heteroplasmy calls varies across the mitochondrial genome, BMC genomics 17(1) (2016) 1017.

[23] F. Ye, D.C. Samuels, T. Clark, Y. Guo, High-throughput sequencing in mitochondrial DNA research, Mitochondrion 17 (2014) 157-63. [24] Y. Guo, J. Li, C.I. Li, Y. Shyr, D.C. Samuels, MitoSeek: extracting mitochondria information and performing high-throughput mitochondria sequencing analysis, Bioinformatics (Oxford, England) 29(9) (2013) 1210-1. [25] D.C. Samuels, L. Han, J. Li, S. Quanghu, T.A. Clark, Y. Shyr, Y. Guo, Finding the lost treasures in exome sequencing data, Trends Genet 29(10) (2013) 593-9. [26] H. Goto, B. Dickins, E. Afgan, I.M. Paul, J. Taylor, K.D. Makova, A. Nekrutenko, Dynamics of mitochondrial heteroplasmy in three families investigated via a repeatable re-sequencing study, Genome biology 12(6) (2011) R59. [27] M. Li, A. Schonberg, M. Schaefer, R. Schroeder, I. Nasidze, M. Stoneking, Detecting heteroplasmy from high-throughput sequencing of complete human mitochondrial DNA genomes, American journal of human genetics 87(2) (2010) 23749. [28] M. Li, M. Stoneking, A new approach for detecting low-level mutations in nextgeneration sequence data, Genome biology 13(5) (2012) R34. [29] S. Tang, T. Huang, Characterization of mitochondrial DNA heteroplasmy using a parallel sequencing system, BioTechniques 48(4) (2010) 287-96. [30] L. Alvarez, C. Santos, A. Ramos, R. Pratdesaba, P. Francalacci, M.P. Aluja, Mitochondrial DNA patterns in the Iberian Northern plateau: population dynamics and substructure of the Zamora province, American journal of physical anthropology 142(4) (2010) 531-9. [31] A. Ramos, C. Santos, E. Barbena, L. Mateiu, L. Alvarez, R. Nogues, M.P. Aluja, Validated primer set that prevents nuclear DNA sequences of mitochondrial origin co-

amplification: a revision based on the New Human Genome Reference Sequence (GRCh37), Electrophoresis 32(6-7) (2011) 782-3. [32] A. Ramos, C. Santos, L. Alvarez, R. Nogues, M.P. Aluja, Human mitochondrial DNA complete amplification and sequencing: a new validated primer set that prevents nuclear DNA sequences of mitochondrial origin co-amplification, Electrophoresis 30(9) (2009) 1587-93. [33] H. Weissensteiner, L. Forer, C. Fuchsberger, B. Schopf, A. Kloss-Brandstatter, G. Specht, F. Kronenberg, S. Schonherr, mtDNA-Server: next-generation sequencing data analysis of human mitochondrial DNA in the cloud, Nucleic acids research 44(W1) (2016) W64-9. [34] T.A. Hall, BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT Nucleic Acids Symposium Series 41 (1999) 95-98. [35] A. Auton, L.D. Brooks, R.M. Durbin, E.P. Garrison, H.M. Kang, J.O. Korbel, J.L. Marchini, S. McCarthy, G.A. McVean, G.R. Abecasis, A global reference for human genetic variation, Nature 526(7571) (2015) 68-74. [36] M. Duan, J. Tu, Z. Lu, Recent Advances in Detecting Mitochondrial DNA Heteroplasmic Variations, Molecules (Basel, Switzerland) 23(2) (2018). [37] E.J. White, T. Ross, E. Lopez, A. Nikiforov, C. Gault, R. Batorsky, C. Darcy, D. Campagna, M. Fleming, J. Thompson, Chasing a moving target: Detection of mitochondrial heteroplasmy for clinical diagnostics, https://doi.org/10.1101/222109 (2017). [38] Y. Guo, C.I. Li, Q. Sheng, J.F. Winther, Q. Cai, J.D. Boice, Y. Shyr, Very lowlevel heteroplasmy mtDNA variations are inherited in humans, Journal of genetics and genomics = Yi chuan xue bao 40(12) (2013) 607-15.

[39] A.D. Jayaprakash, E.K. Benson, S. Gone, R. Liang, J. Shim, L. Lambertini, M.M. Toloue, M. Wigler, S.A. Aaronson, R. Sachidanandam, Stable heteroplasmy at the single-cell level is facilitated by intercellular exchange of mtDNA, Nucleic acids research 43(4) (2015) 2177-87.

Abstract Although the use of Next Generation Sequencing (NGS) in mitochondrial DNA (mtDNA) studies related to forensic and human genetics has contributed to the report of heteroplasmy at very low levels (lower than 1% and even 0.5%), their detection is not a straightforward process. Our purpose is to establish mitochondrial heteroplasmy detection limits, generating mixed bases at low frequencies by the PCR co-amplification of mtDNA and a nuclear insertion of mitochondrial origin (NUMT). A primer set that presents a perfect annealing with both mitochondrial and nuclear DNA was used to amplify the mitochondrial DNA region located between positions 6739 and 8910 and the corresponding region located inside a NUMT located in chromosome 1 (GRCh38.p12 Primary Assembly 631910-634079) that presents an identity of 98% with the corresponding region of mtDNA in two samples. Independent Nextera XT® (Illumina) NGS libraries were generated for each sample and sequenced in different MiSeq (Illumina) runs. Non-identical and identical positions between individual mtDNA and NUMT sequences were detected, and heteroplasmy detections limits were established: a) with a minor allele frequency <1.5%, false positive and negative can occur, even with a depth higher than 3000X; b) with a minor allele frequency >3%, no presence of false positive and negative were detected with a depth of ~1000X; and c) with a minor allele frequency between 1.5% and 3%, a minimal depth of 3000X was necessary to avoid false positives or negatives detection. Our results demonstrate an alternative strategy to establish a reliable limit of mitochondrial heteroplasmy detection.

Keywords: mitochondrial DNA (mtDNA); heteroplasmy; Nuclear Insertion of Mitochondrial Origin (NUMT); Next Generation Sequencing (NGS).