Strategies for complete mitochondrial genome sequencing on Ion Torrent PGM™ platform in forensic sciences

Strategies for complete mitochondrial genome sequencing on Ion Torrent PGM™ platform in forensic sciences

Forensic Science International: Genetics 22 (2016) 11–21 Contents lists available at ScienceDirect Forensic Science International: Genetics journal ...

2MB Sizes 1 Downloads 72 Views

Forensic Science International: Genetics 22 (2016) 11–21

Contents lists available at ScienceDirect

Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsig

Research paper

Strategies for complete mitochondrial genome sequencing on Ion Torrent PGMTM platform in forensic sciences Yishu Zhoua , Fei Guob,c , Jiao Yub , Feng Liub , Jinling Zhaob , Hongying Shenb , Bin Zhaob , Fei Jiab , Zhu Sunb , He Songa , Xianhua Jianga,b,* a b c

China Medical University School of Forensic Medicine, No. 77, Puhe Road, Shenyang North New Area, Shenyang, Liaoning 110122, PR China Criminal Science and Technology Institute of Liaoning Province, No. 2, Qishan Middle Road, Huanggu District, Shenyang, Liaoning 110032, PR China Department of Forensic Medicine, National Police University of China, No. 83, Tawan Street, Huanggu District, Shenyang, Liaoning 110854, PR China

A R T I C L E I N F O

A B S T R A C T

Article history: Received 21 April 2015 Received in revised form 30 December 2015 Accepted 8 January 2016 Available online 15 January 2016

Next generation sequencing (NGS) is a time saving and cost-efficient method to detect the complete mitochondrial genome (mtGenome) compared to Sanger sequencing. In this study we focused on developing strategies for mtGenome sequencing on the Ion Torrent PGMTM platform and NGS data analysis. With our experience, 4, 15 and 30 samples could be loaded onto Ion 314TM, Ion 316TM and Ion 318TM chips respectively at a pooling concentration of 26 pM, achieving to sufficient average coverage of 1500  and well strand balance of 1.05. Data processing software is essential to NGS mega data analysis. The in-house Perl scripts were developed for primary data analysis to screen out uncertain positions and samples from variant call format (VCF) reports and for pedigree study to perform pairwise comparisons. The Integrative Genomic Viewer (IGV) and the NextGENe software were introduced to secondary data analysis. The mthap and EMMA were employed for haplogroup assignment. The dataset was reviewed and approved by the EMPOP as the final version, which showed 2.66% error rate generated from the Torrent Variant Caller (TVC). Across the mtGenome, 4022 variants were found at 725 nucleotide positions, where ratio of transitions to transversions was estimated at 20.89:1 and 22.18% of variants was concentrated at hypervariable segments I and II (HVS-I and HVS-II). Totally, 107 complete mtGenome haplotypes were observed from 107 Northern Chinese Han and assigned to 88 haplogroups. The random match probability (RMP) of complete mtGenome was calculated as 0.009345794, decreasing 26.19% by comparison to that of HVS-I only, and the haplotype diversity (HD) was evaluated as 1, increasing 0.33% by comparison to that of HVS-I only. Principal component analysis (PCA) showed that our population was clustered to East and Southeast Asians. The strategies in this study are suitable for complete mtGenome sequencing on Ion Torrent PGMTM platform and Northern Chinese Han (EMP00670) is the first complete mtGenome dataset contributed to the EMPOP from East Asian. ã 2016 Elsevier Ireland Ltd. All rights reserved.

Keyword: Next generation sequencing (NGS) Ion Torrent PGMTM Mitochondrial genome (mtGenome) Northern Chinese Han Forensic sciences

1. Introduction The human mitochondrial genome (mtGenome) is an approximately 16,569 base pairs (bp) long extra-chromosomal genome presenting in the subcellular organelle. The 15,447 bp long coding region (CodR) is economical construction, which encodes 37 genes correlate to oxidative phosphorylation. The 1,122 bp long noncoding control region (CR) has proven to be informatics, which

* Corresponding author at: China Medical University School of Forensic Medicine, No. 77, Puhe Road, Shenyang North New Area, Shenyang, Liaoning 110122, PR China. Tel.: +86 24 31939433; fax: +86 24 31939433. E-mail address: [email protected] (X. Jiang). http://dx.doi.org/10.1016/j.fsigen.2016.01.004 1872-4973/ ã 2016 Elsevier Ireland Ltd. All rights reserved.

comprises an origin of replication along with three hypervariable segments (HVS) [1]. Compared with nuclear DNA, mitochondrial DNA (mtDNA) possesses some unique characteristics including simplistic inheritance and great copy number per cell [2]. It is considered to be inherited strictly from our mothers and to be transmitted without recombination across many generations. While specimens like highly decomposed remains, bones or hair shafts are hard to obtain full short tandem repeat (STR) profiles, mtDNA typing can provide an alternative clue to investigation by comparison with known maternal relatives [3]. Thus, it is commonly used in human evolutionary studies [4] and forensic practice [5]. Previous studies often restrict to sequence the HVS-I, II and III of CR [6,7] as well as some specific single nucleotide polymorphisms (SNPs) of CodR [8]. Such partial information may limit the

12

Y. Zhou et al. / Forensic Science International: Genetics 22 (2016) 11–21

polymorphism information content of this genetic marker and hinder its application in practical forensic casework. Undoubtedly, the complete mtGenome sequencing will provide new insights beyond current capabilities. However, most available mtGenomes are generated with conventional Sanger sequencing technology, which is a time-consuming and expensive task particularly when large mtGenome databases are established or high quality data are required with redundant sequence coverage. Fortunately, next generation sequencing (NGS) technology has the potential to dramatically increase sample throughput, workflow efficiency and detection resolution, therefore facilitating to obtain reliable and accurate complete mtGenome information. In this study, we have developed strategies for complete mtGenome sequencing on the Ion Torrent Personal GenomeTM Machine (PGMTM) platform and investigated mtGenome features of the Northern Chinese Han population to evaluate the application in forensic sciences. 2. Materials and methods 2.1. Samples and DNA extraction A total of 125 peripheral blood samples of Northern Han were collected from Shenyang City, Liaoning Province, Northeast China after informed consent, including 107 samples from unrelated individuals and 18 samples from five maternal genealogies (four generations in three families and three generations in two families). DNA was extracted on the EZ11 Biorobot workstation using the EZ11 DNA Investigator Extraction Kit (QIAGEN, Hilden, Germany). Purified DNA was stored at 20  C. 2.2. PCR amplification The mtGenome was amplified in two independent reactions using primers described in [9], namely L644, H8982, L8789, and H877 (Table S1). Two amplicons, fragment A (8.3 kilo base pairs, kbp) and fragment B (8.6 kbp), were partially overlapped at positions 622–898 and 8764–9003. Long range PCRs were carried out using the SequalPrepTM Long PCR Kit (Thermo Fisher Scientific, MA, USA) according to the manufacturer’s recommendations with two modifications: (1) the total DNA input was 30–50 ng per reaction. Fragment A was dealt with 2 ml enhancer A and 2 ml enhancer B; fragment B was done with 1 ml enhancer A. (2) amplification was performed on the GeneAmp1 9700 System thermal cycler (Thermo Fisher) with the thermal cycling condition for both reactions: denaturation for 2 min at 94  C, amplification for 32 cycles of 10 s at 94  C, 45 s at 60  C, 9 min at 68  C, final extension for 5 min at 72  C, and hold at 4  C. PCR products (5 ml) were qualified and quantified by 1.0% agarose gel electrophoresis with GoodView Nucleic Acid Stain (SBS Genetech, Beijing, China) to confirm successful amplification. 2.3. Library construction Library construction included three steps as follows: fragmentation, adapter ligation and size selection. (1) Fragmentation: fragments A and B were quantified on the Qubit1 2.0 fluorometer using the Qubit1 dsDNA BR Assay kit (Thermo Fisher) and were pooled with each of 50 ng. Then enzymatic shearing was performed using the Ion ShearTM Plus Reagent (Thermo Fisher) following the manufacturer’s instruction. In order to yield fragments with a median size of 200 bp, shearing condition was adjusted to 8 min at 37  C; (2) Adapter ligation: both ends of each fragment were ligated with the Ion P1 Adapters and the Ion Xpress Barcode Adapters according to the manufacturer’s recommendations; (3) Size selection: it was conducted on the E-Gel1 iBaseTM Power System with E-Gel1 SizeSelectTM 2% agarose gels (Thermo

Fisher) according to the manufacturer’s recommendations. The selected libraries were amplified for 8 cycles with the reagents supplied in Ion Plus Fragment Kit (Thermo Fisher). Purification was an essential step after each process (fragmentation, adapter ligation and libraries amplification), using the Agencourt AMPure1 XP PCR Purification system following the manufacturer’s recommendations. 2.4. Template preparation and PGMTM sequencing Quantification and qualification for the amplified libraries were examined on the Agilent 2100 Bioanalyzer System with the Agilent High Sensitivity DNA Kit (Agilent Technologies, CA, USA) according to the manufacturer’s recommendations. All barcoded libraries were pooled in equimolar amounts of 26 pM to ensure equal representation of each barcoded library in the sequencing run. Pooled libraries were then subjected to emulsion PCR on the Ion OneTouchTM 2 instrument (Thermo Fisher) with the Ion PGMTM Template OT2 200 Kit (Thermo Fisher), and followed by templatepositive Ion Sphere Particles (ISPs) enrichment on the Ion OneTouchTM ES instrument (Thermo Fisher) according to the manufacturer’s recommendations. NGS was performed on PGMTM platform (Thermo Fisher) with three size chips (Ion 314TM Chip, Ion 316TM Chip and Ion 318TM Chip) using the Ion PGMTM Sequencing 200 v2 Kit. All preparations were conducted to the protocol released by the manufacturer: chlorite cleaning and 18.2 MV water cleaning, initialization the PGMTM system and chip check. Enriched template-positive ISPs were mixed with control ISPs, sequencing primers and sequencing polymerase in a consecutive way, and then loaded onto an Ion chip carefully. 2.5. Data analysis All raw reads were mapped with the revised Cambridge Reference Sequence (rCRS, NC_012920.1) [10] by the Ion Torrent Suite Software (TSS) version 4.0 based on the TMAP SmithWaterman alignment algorithm. Variants were called using the Torrent Variant Caller (TVC) plugin v.4.0 with somatic-low stringency and filed into variant call format (VCF) reports. The in-house Perl scripts (Boxes 1–3) were compiled for primary data analysis. Base by base investigation was performed using the Integrative Genomic Viewer (IGV) package v.2.3.34 [11] and the NextGENe version 2.3.4.5 (SoftGenetics, PA, USA), where binary alignment map (BAM) and binary alignment index (BAI) files were visualized based on the Torrent mapping alignment program (TMAP) using the IGV, and FASTQ-converted-FASTA files were re-aligned based on the Burrow–Wheeler transform (BWT) alignment algorithm using the NextGENe. Parameters of NextGENe were described in [12]. The preliminary haplogroup assignment was chosen from the highest ranked haplogroup from mthap (http://dna.jameslick.com/mthap) and EMMA [13] in the European DNA Profiling Group mtDNA Population (EMPOP) database (www.empop.org) [14]. The final haplogroup assignment was determined according to rCRS-oriented version of the mtDNA tree Build 16 (released 19 Feb., 2014) [15]. Homopolymers in rCRS and each sample were detected using mreps 2.6 software [16]. Profiles were compared among maternal genealogies using Perl script (Box 4) for pedigree study. 2.6. Sanger sequencing Sanger sequencing was employed to confirm variants in discordance between IGV and NextGENe. Primers for Sanger sequencing were listed in Table S1, where five primers were described in [17] and one primer was self-designed in this study.

Y. Zhou et al. / Forensic Science International: Genetics 22 (2016) 11–21

Sequencing was performed on the Applied Biosystems1 3130xl Genetic Analyzer (Thermo Fisher) using the BigDye1 Terminator v3.1 Cycle Sequencing Kit (Thermo Fisher) following the manufacturer’s recommendations. Raw data was analyzed using the Sequencing Analysis Software v5.3.1 (Thermo Fisher).

13

The random match probability (RMP) was defined as RMP = pi2, where pi is the frequency of the ith haplotype. The haplotype diversity (HD) was calculated as HD = (1–RMP) (n/(n  1)), where n is the sample size. Based on haplogroup frequencies from populations, principal component analysis (PCA) was measured and visualized using R version 3.0.2. Proportion test was performed using the OpenStat [18], and analysis of variance (ANOVA), Fisher’s exact test and Chi-square test were done using SPSS version 20.0.0 software (IBM, NY, USA).

600 Mbp and 1100 Mbp (0.5 million (M), 3 M and 5.5 M reads  200 bp/read), which could load with 4, 15 and 30 samples at a pooling concentration of 26 pM within 30%–35% of low quality data and polyclonal ISPs, respectively. Most of population samples were sequenced on the Ion 316TM chips in this study. Take one of running summaries for example: 85% ISP density was obtained with 362 Mbp raw data, 34% low quality data, 35% polyclonal ISPs and 43% usable data. Interestingly, the percentage of low quality seemed quite high, but it decreased to 8% when we reanalyzed this run from Signal Processing using TSS version 4.2. We referred the Torrent Suite Software User Documentation, which noted all reads with off-scale (pinned) signal were counted as low quality ISPs for purposes of the ISP summary in version 4.0 but now as empty wells in version 4.2 or above. According to the equation of % Low Quality = Low quality ISPs/Clonal ISPs, the result would become low with the numerator decreased.

3. Results and discussion

3.2. Data analysis

3.1. PGMTM sequencing summary

Data analysis on the complete mtGenome can be seen as a challenging part in practice. The strategy was developed in this study (Fig. 1) mainly including primary data analysis (align variants from VCF reports; sort alignment by positions and by samples; collect positions by samples for investigation) and secondary data analysis (base by base review; haplogroup assignment; EMPOP evaluation).

2.7. Statistics P

The processes of complete mtGenome sequencing by NGS are basically shared among current platforms, including DNA extraction, long range PCR amplification, library construction, clonal amplification, sequencing and raw data analysis. Although manufacturers are trying to shorten the library preperation time relatively by seeking to streamline workflows, it takes longer (about 4 days) from sample to profile with our experience when factors are considered as follows: pipetting time, working hours and multiplexing number of samples. Ion 314TM, Ion 316TM and Ion 318TM chips for the PGMTM platform had the theoretical maximum throughputs of 100 Mbp,

3.2.1. Primary data analysis 3.2.1.1. Step 1: align variants from VCF reports. Information on VCF position, VCF reference, VCF variant, frequency (freq) and coverage (cov) was extracted from VCF reports by Box 1. Also, it calculated

Fig. 1. Workflow of data analysis. Data analysis mainly includes two parts presenting the strategies herein: primary data analysis (align variants from VCF reports; sort alignment by positions and by samples; collect positions by samples for investigation); secondary data analysis (base by base review; haplogroup assignment; EMPOP evaluation).

14

Y. Zhou et al. / Forensic Science International: Genetics 22 (2016) 11–21

the times of variants observed at each position from all samples and the amount of variants and the average cov for each sample, and finally exported the alignment report in an .xls file (Table S2). In this file, positions followed by the reference and variant from all samples were vertically arranged from the lowest to the highest, and variant details of each sample, including variant, freq and cov, were horizontally arranged every three columns according to file names (e.g., N001–N108) from the smallest to the biggest. Herein, sample N081 was present the same variants as sample N073, possibly double-sampling from the same individual. It was removed after confirmation by STR profiles. 3.2.1.2. Step 2: sort alignment by positions and by samples. According to minimum variant frequency and sufficient coverage for detection, the alignment report from Step 1 was divided into four categories by positions (Table S3) with Box 2 and by samples (Table S4) with Box 3: I. freq  15% and cov  100 ; II. freq  15% and cov < 100 ; III. freq < 15% and cov  100 ; IV. freq < 15% and cov < 100  By positions, the rule for confirming/rejecting variants was mainly followed as Category I, confirmed (88.75%); Category II, documented and further investigated (7.98%); Category III and IV, rejected (3.27%). A total of 31 complicated positions were also documented and needed further investigation including different variants reported at the same positions (e.g., 195T ! C and 195T ! A) and multiple nucleotides reported at the reference (e.g., 12236GC ! G) and variant (e.g., 2150T ! TA). By samples, the qualification level was measured with % Category I (the amount of variants in Category I divided by total variants in each sample), where 71 good samples were convinced (90% Category I) and 36 poor samples needed to be documented and further investigated (<90% Category I). Step 3: collect positions by samples for 3.2.1.3. investigation. Collection of three parts documented and further investigated from Step 2 was integrated into an .xls file (Table S5) according to complicated positions for corresponding samples and Category II by the qualification level in each sample. At a glance, 91.96% variants in Category II were distributed in poor samples. The integrated table should be archived and referred to during base by base review in secondary data analysis.

3.2.2. Secondary data analysis 3.2.2.1. Step 1: base by base review. Each sample was reviewed base by base by two staffs with IGV and two staffs with NextGENe, who mainly focused on collection in Table S5 from primary data analysis. We strictly followed the latest International Society for Forensic Genetics (ISFG) recommendations on nomenclature [19]. Private mutations (not associated with haplogroup) and missing mutations (expected but not observed in haplogroup) were verified on the basis of current knowledge of the phylogeny. Any discrepancy between IGV and NextGENe was listed by an expert and resolved to the satisfaction of four reviewers or confirmation by Sanger sequencing. In Table S5, a total of 321 variants in Category II could be confirmed, although depth of coverage was not sufficient (<100  ). The investigation of 31 complicated positions found indels (nomenclature differences between medical and forensic genetics conventions to be adjusted, Table 1) at VCF positions 247 (249DEL), 310 (309.XC, 315.XC), 513 (524.1A, 524.2C, 521–524DEL, 523–524DEL), 567 (573.1C), 2150 (2156.1A), 2226 (2232.1A), 8270 (8281–8289DEL), 12236 (12241DEL), 15939 (15944DEL) and 16241 (16241.1T), tri-alleles at 6 positions (195, 4200, 9824, 16232, 16265 and 16291), multiple nucleotide polymorphisms at positions 10397 (10397G, 10398G) and 16179, 16181, 16183 (16182C, 16183C), substitutions at 4 positions (4246, 9523, 13758 and 16361) and wrong reports at 7 positions such as 309 (310C), 750 (750G, 752T), 8472 (8473C), 10288 (10289G), 10872 (10873C), 12704 (12705T) and 13955 (13956G). It was important to note that insertions were mainly inspected using NextGENe, because they were present as a bar in IGV no matter how many bases were inserted and alignments were obscure to variants within C stretches of HVS-I and HVS-II based on TMAP alignment algorithm (an example in Fig. S1). Also, a total of 96 variants were uncalled in VCF reports (Cn indels at positions 309, 315 and 16193 were ignored, Table 2), where 31 variants (32.29%) distributed in 71 good samples and 65 variants (67.71%) did in 36 poor samples. Although some of variants were never observed in VCF reports, e.g., 960.1C, 965.1C, 3172.1C, 3571T, 5889.1C, 8278.1C, 12389T, 13681G, 13753C, 14305C, 14755G, 16190T and 16254T, they were included in the EMPOP and

Table 1 Information of indels. Interpretation

VCF report

Visual inspection

249DEL 309DEL 309.XC 315.XC 524.1A, 524.2C 521–524DEL 523–524DEL 573.1C 960.1C 965.1C 2156.1A 2232.1A 3172.1C 5899.1C 8278.1C 8281–8289DEL

247GA ! 247G Not reported Reported but not correctly Reported but not correctly 513G ! 513GCA 513GCACA ! 513G 513GCA ! 513G 567A ! 567ACCC, 567ACCCC and 567ACCCCC Not reported Not reported 2150T ! 2150TA 2226T ! 2226TA Not reported Not reported Not reported 8270CACCCCCTCT ! 8270C or 8270CACCCCCTCTAC ! 8270CAC 12236GC ! 12236G 15939CT ! 15939C Not reported Not reported Not reported 16241A ! 16241AT

248DEL 303DEL 303.XC 311.XC 514.1C, 514.2A 514–517DEL 514–515DEL 568.XC 956.1C 956.1C 2151.1A 2227.1A 3168.1C 5895.1C 8272.XC 8271–8279DEL

12241DEL 15944DEL 16192–16193DEL 16193DEL 16193.XC 16241.1T

12237DEL 15940DEL two Cs deletion in C homopolymers C deletion in C homopolymers Cn insertion in C homopolymers 16241.1T

Y. Zhou et al. / Forensic Science International: Genetics 22 (2016) 11–21

15

Table 2 Details of false negative variants in VCF reports. Varianta

# of total observed

# of false negative

Length of homopolymers (bp)

Coverage ()

318C 456T 489C 573.1C 960.1Cb 961C 965.1Cb 3172.1Cb 3571Tb 4248C 5899.1Cb 8277C 8278.1Cb 8473C 8563G 10873C 12398Tb 12406C 13681Gb 13753Cb 13759C 14287C 14305Cb 14755Gb 14766T 15535T 16172C 16182C 16183C 16189C 16190Tb 16217C 16245T 16249C 16254Tb

2 3 66 3 1 3 3 3 2 8 2 2 2 3 3 66 1 10 1 1 7 2 1 1 108 6 10 13 27 34 1 9 2 5 1

1 1 3 2 1 1 3 3 2 1 2 1 2 2 1 3 1 1 1 1 2 1 1 1 11 2 1 8 18 12 1 1 1 2 1

– 5 – 7 6 9 9 6 5 6 6 8 8 6 – 5 – – – – – 4 – – – – 3 9 9 9 – – – 4 –

39 517 33, 38, 44 380, 5021 97 606 451, 601, 1006 2091, 2369, 4243 79, 2002 159 294, 1905 299 350, 1345 23, 302 44 17, 19, 36 12 28 113 360 26, 58 12 10 106 12, 13, 16, 20, 20, 23, 23, 41, 45, 65, 112 17, 115 130 15, 34, 40, 58, 77, 81, 84, 114 5, 11, 15, 34, 36, 38, 42, 50, 66, 67, 81, 82, 87, 94, 101, 108, 115, 235 5, 12, 22, 23, 32, 33, 36, 39, 41, 47, 90, 800 22 58 18 30, 52 31

a b

Cn indels at positions 309, 315 and 16193 were ignored. Variants had never observed in VCF reports.

on the Phylotree or reported in Soares et al. [20] and Just et al. [21] studies. With regard to point heteroplasmy (PHP), the minor allele frequency (MAF) 15% was applied as detection threshold (private communication with Walther Parson). In cases of length heteroplasmy (LHP), the major molecule was reported. Contamination by nuclear insertions of mitochondrial DNA (NUMTs) sequences could be contributed to problems in the heteroplasmy detection, though amplification of NUMTs was unlikely when sufficient mtDNA was present in a sample and a long-range PCR strategy was used [23]. Briefly, the refseq representative genomes of the complete human genome were queried using Primer-BLAST [22] for the two long range PCR primer pairs. Primer specificity stringency was allowed with default parameters and max target size was changed to 10000. The result demonstrated that no potential region from the nuclear genome was identified and no NUMT amplification was observed during base by base review. A total of 34 potential PHP positions were listed in Table S3, excluding positions (309, 310, 513, 567, 8270 and 16179–16185) located in LHP regions. Finally, removal of mixtures at 10 positions was authenticated as follows: (1) the hotspots at position 469 (76/107) in the CR and positions 8772 (2/107) and 8936 (15/107) in the CodR might be rare in a single population, meanwhile discrepancies between IGV and NextGENe were observed at these positions as well as position 1164, and finally mixtures were confirmed by Sanger sequencing; (2) mixtures at positions 4248, 8251 and 14587 were caused by misalignment owing to indels resided close to these positions; (3) mixtures at positions 5196 and 9546 violated to double-strand

confirmation, where T5196 and 9546A were only observed on the forward strands, respectively; and (4) heteroplasmic mixture of deleted (10873DEL) and substituted (10873C) variant at position 10872 suggested that it might be artificial product of Ion Torrent PGMTM sequencing chemistry and/or the alignments, no matter what alignment algorithms were applied [12]. A more detailed discussion about (1), some regions affected by false priming were isolated during this procedure. A number of suspicious reads reflected sequences different from the rCRS were present in three regions (positions 445–470, 1164–1183 and 8915– 8940, Fig. S2), which had similarity to the primers used for long range PCR amplification and resulted in false positive substitutions at positions 469, 1164 and 8936 (most of which were reported as mixtures) as well as false negative substitutions at position 456. This explained the unusual high frequencies at the positions 469 and 8936 from VCF reports. Besides, two homoplasmic variants (8772C) were also reported as mixtures (Fig. S2), probably resulting from the position resided at primer L8789 binding regions and within Fragment A overlapped regions. Although several filters had been developed to eliminate potential mixtures and thereby could authenticate PHP [23], the measures in some instances were insufficient, especially in positions affected by false priming and/or positions resided close to the binding sites of primers. Hence any PHP should be carefully interpreted by referring to raw data, or confirmed by Sanger sequencing if necessary. After base by base review, all variants in each sample were documented and re-reviewed. Then, the sample haplotypes were

16

Y. Zhou et al. / Forensic Science International: Genetics 22 (2016) 11–21

prepared for haplogroup assignment according to format accepted in mthap and EMMA. 3.2.2.2. Step 2: haplogroup assignment. Preliminary and final haplogroups for 107 haplotypes were listed in Table S6. Thirty-

one assignments and indications were inconsistent because of missing signature mutations and nomenclatures. For 6 assignment differences (N001, N006, N036, N055, N056 and N102), a missing signature mutation would lead to a more ancestral estimation. For example, sample N006 was assigned to F2 by mthap instead of

Fig. 2. Number of forward reads (#F) and reverse reads (#R) at each position across the complete mtGenome were derived from averaged data of all 107 samples by using the NextGENe software. Curve of coverage was plotted with #F and #R, and the distribution pattern was not change over depth of average coverage among individuals; Curve of forward strand and reverse strand balance was plotted with the ratio of #F/#R, and the distribution plot fluctuated across the mtGenome, where average strand bias presented higher toward reverse strand than toward forward strand. Gray color bar indicated regions of low coverage (487  ) and regions of high strand bias (ratio of #F/#R 0.14) across the complete mtGenome. Homopolymers in rCRS were detected using mreps software. Each bar indicated a homopolymer, color and height of which corresponded to different nucleotides and length. “Heat-map” for variants observed from 107 Northern Chinese Han were constructed by plotting mtGenome position versus the number of variants observed at each variant position.

Y. Zhou et al. / Forensic Science International: Genetics 22 (2016) 11–21

F2a-@16291 because it lacked the 16291T required for F2-16291, even though the diagnostic 16203 G for F2a was observed. For 25 indication differences, (1) a missing signature mutation toward the rCRS state was not indicated with a @ prefix following a haplogroup name but it was listed a summary of “Mismatches”, even though the sample could be assigned correctly. For example, sample N002 was indicated with M7c1a by mthap instead of M7c1a-@199 even though 199C toward the M7c1 state was expected but not observed. (2) brackets represented mutations were treated as optional by mthap and not contributed to scoring, while brackets did recurrent/unstable mutations within the respective clade in rCRS-oriented version. For example, sample N007 was indicated as D5c (T16311C) by mthap, while T16311C actually was not a recurrent/unstable mutation within the clade D5c and it should be indicated as D5c-16311 instead. Although mthap was sufficient for preliminary haplogroup estimation, manual adjustment was needed for final assignment. EMMA was also utilized for preliminary assignment after the EMPOP v3 was launched. There were 3 assignment differences (N001, N015 and N055). For example, sample N015 was assigned to haplogroup M8a3 according to Phylotree and 5100T was diagnostic mutation from M8a20 3 to M8a3. EMMA’s lowest cost estimate was produced with AP013124.1 for haplogroup M8a20 3 with costs of 2.53 for a missing mutation 5460A and a private mutation 5100T. Actually, the best estimate (a virtual haplotype for haplogroup M8a3) was ranked second with costs of 3.94 resulting from private mutations 16293G, 152C, 737T, and 13488C. For samples N001 (haplogroup D4c2a-@15391@16311) and N055 (haplogroup F1b1b-@152@8772@16172), missing signature mutations caused a more conservative estimate (D4c2 and F1b1, respectively). On the aspect of indication, a missing signature mutation was found only with a virtual haplotype for haplogroup in the “Missing Mutations” column, and D5c-16311 was indicated as D5c + 16311 in EMMA. Although assignment differences between mthap and EMMA showed no statistical significance for Fisher’s exact test (p = 0.4983), EMMA would be more convenient for forensic practice because the emp-file could be uploaded directly without any further transformation and more information could be obtained such as haplotype frequency by origin and by metapopulation and similar haplotypes with a low number of differences.

17

3.2.2.3. Step 3: EMPOP evaluation. Haplotype, haplogroup assignment and raw data (BAM and BAI files) for each sample as well as details of the dataset and geographic background were documented and uploaded to the EMPOP. The whole dataset was independently evaluated by the EMPOP, and any discrepancy from authoritative feedback was inspected base by base again. The final version under .emp format was re-uploaded to the EMPOP and approved with accession number EMP00670 for Northern Chinese Han. 3.3. PGMTM performance 3.3.1. Coverage and strand balance Coverage was extracted by the NextGENe software and strand balance was calculated as the ratio of forward reads to reverse reads (#F/#R). Theoretically, coverage and strand balance at each position should be equal across the complete mtGenome. However, Fig. 2 illustrated that average values from all 107 samples were not dispersed evenly. As regards coverage, mean  standard deviation (SD) was calculated as (1269  966)  for all 107 samples, (1561  618)  for 71 good samples and (632  370)  for 37 poor samples. There was significant difference between good samples and poor samples (p < 0.05, one-way ANOVA). The distribution pattern was also observed in previous studies [24–26] and it would not change over depth of average coverage among individuals in the same population. It was likely to characteristics inherited in the mtGenome and not associated with NGS platforms applied. Herein, positions 500–900 were consistently presented as highest spike, probably due to non-specific amplification by false priming mentioned in Section 3.2.2. Using a 100-bp sliding window, some regions were presented as particularly low coverage, mostly located in HVS and NADH dehydrogenase (ND) coding regions, including positions 300–500 (HVS-II), 3500–3600 (ND1), 5300–5500 (ND2), 10900–11000 (ND4), 12300–12500 (ND5), 13600–13800 (ND5), 14100–14900 (ND6, tRNA and Cytochrome b), 15500–15600 (Cytochrome b) and 16100–16300 (HVS-I). For strand balance, mean  SD was calculated as 1.12  0.46 for all 107 samples, 1.05  0.11 for 71 good samples and 1.22  0.19 for 37 poor samples, which differed significantly between good samples and poor samples (p < 0.05, one-way

Fig. 3. Normalized coverage (coverage at each position divided by average coverage in each sample) and strand balance in areas of derived and inherent homopolymers were separated into groups according to different lengths. For normalized coverage, value < 1 means the coverage was below the average; for strand balance, ratio of #F/#R < 1 means strand bias presented higher toward reverse strand. Asterisk markers indicated 9-bp homopolymers.

18

Y. Zhou et al. / Forensic Science International: Genetics 22 (2016) 11–21

ANOVA). Also, the distribution plot fluctuated across the mtGenome, where average strand bias presented higher toward reverse strand (1/0.59) than toward forward strand (1.36) although there were more forward biases (68.68%) than reverse biases (31.32%). The highest strand bias was calculated as 0.008, which meant #R was 125-fold #F at position 10939 (ND4). At first glance in Fig. 2, most of high reverse strand biases located in the regions with low coverage relative to the rest of the mtGenome. Most importantly, it seemed that most of above regions would coincide with the areas of homopolymers. 3.3.2. Homopolymers In order to investigate the association, normalized coverage (coverage at each position divided by average coverage in each sample) and strand balance in areas of inherent and derived homopolymers were separated into groups according to different lengths. Overall, the tendency of normalized coverage and strand balance declined with the length of homopolymers increased in Fig. 3. However, poly-T homopolymers were departed from the tendency, which were probably accounted for the protocol used. At 8-bp homopolymer, the normalized coverage and strand balance sharply dropped to 0.31 and 0.14, respectively. Herein, we defined the threshold as 487  for low coverage and 0.14 for high strand bias across the mtGenome (gray color bar in Fig. 2). Such single large component (8-bp) of homopolymers could directly impact on coverage at positions 287–474, 12364–12431 and 16187–16292 and strand bias at positions 299–326. Also, continuous homopolymers might have influence on coverage and strand balance. With regard to coverage, it could explain 89.08% (1044/1172) positions with lower than 487  which distributed in positions 3499–3606 with seven of 3–6 mers, positions 5417–5467 with six of 3–4 mers, positions 10885–11007 with eight of 3–6 mers, positions 13636–13781 with thirteen of 3–5 mers, positions 14139–14667 with thirty-eight of 3–7 mers, positions 14758–14820 with five of 4–5 mers, and positions 15521–15544 with four of 3–4 mers. However, the remaining regions (positions 1–38, 5315–5384 and 16550–16569) were not evident with multiple continuous homopolymers. As regards strand balance, a total of 18 regions were identified by less than 0.14 including positions 437–466, 489–503, 519–557, 3552–3581, 5421–5449, 7390–7418, 8607–8621, 10922–10960, 12375–12410, 13649–13692, 13739–13774, 14259–14293, 14337–14355, 14499–14507, 14521–14543, 14618–14636, 15526–15542 and 16262–16278, where all could be explained by multiple continuous homopolymers. The result demonstrated that low coverage and high reverse strand biases were mainly attributed to homopolymers, especially presenting a single large component (8-bp) and/or multiple continuous components in a small region, when PGMTM platform was applied.

3.3.3. Reliability of TVC We evaluated the reliability of the TVC by comparing dataset generated from VCF reports with one approved by EMPOP. The error rate was calculated as number of error variants (including false negative variants and wrongly reported variants) to number of total variants (Cn indels at nps 309, 315 and 16193 were ignored in all of the calculations), which was evaluated as 2.66% (107/4022) of the whole dataset, and 1.37% (37/2692) of 71 good samples and 5.26% (70/1330) of 36 poor samples. There was significantly lower error rate in good samples than that in poor samples (p < 0.05, Chi-square test). Also, the error rate was evaluated as 7.58% (74/ 976) within homopolymer regions and 1.08% (33/3046) outside homopolymer regions, which differed significantly between two regions (p < 0.05). Moreover, it distributed to 1.19% (6/504), 1.90% (4/211), 4.96% (7/141), 32.14% (9/28), 25.00% (2/8), 75.00% (3/4) and 53.75% (43/80) from 3-bp to 9-bp homopolymers. The error rate in  5 bp homopolymers showed significantly lower than that in  6 bp homopolymers (p < 0.05). The results demonstrated TVC was more reliable with 1500  average coverage and 5 bp homopolymer. When it existed with homopolymers 6 bp (especially 8 bp) and average coverage 500  variants should be authenticated by visual inspection in some certain regions and even across the complete mtGenome. 3.4. Population genetics 3.4.1. Variants Totally, 4022 variants were observed at 725 nucleotide positions from 107 Northern Chinese Han, where 3919 (97.44%) were substitutions and 103 (2.56%) were indels. Among these substitutions, 3739 transitions and 179 transversions were observed with 20.89:1 ratio of transitions to transvertions; a total of 24 unique PHPs were found: 6, 1, and 17 of which were in CR, non-coding region and CodR respectively (Table S7); 6 tri-alleles were observed and all were included in the EMPOP and in Soares et al. [20] (Table S8). Though indels usually reflected a low frequency in mtDNA compared to substitutions, 20 insertions were detected and 83 deletions were found. Interestingly, insertion 16241.1T was not included neither in the EMPOP nor in Mitomap database (http://www.mitomap.org), which might be a new variant. Cn insertions at nps 309, 315 and 16193 were ignored in all of the calculations. A “heat-map” (Fig. 2) was generated as described in [24]. Variants 73G, 263G, 750G, 1438G, 2706G, 4769G, 7028T, 8860G, 11719A, 14766T, and 15326G were observed from almost all samples (at least from 103 samples), and 489C, 8701G, 9540C, 10398G, 10400T, 10873C, 12705T, 14783C, 15043A, 15301A, 16223T and 16519C were observed from over half of the samples (from over 58 samples). High frequencies of variants 263G, 750G, 1438G,

Table 3 Summary statistics of mtGenome from 107 Northern Chinese Han. Parameters

# Variantsa # Haplotypesa # Unique haplotypesa Mean # of pairwise differencesa Range of differences HD RMP # Haplogroupsb # Unique haplogroupsb

HVS-I

522 94 84 7.36 [0,14] 0.9967 0.0126 65 43

HVS-I/HVS-II

892 102 98 9.66 [0,19] 0.9989 0.0104 75 56

CR

1102 105 103 11.41 [0,22] 0.9996 0.0097 79 60

CodR

2920 107 107 27.70 [1,50] 1.0000 0.0093 79 62

mtGenome

4022 107 107 39.15 [2,66] 1.0000 0.0093 88 74

Percentage increased HVS-I to HVS-I/HVS-II

HVS-I/HVS-II to CR

CR to mtGenome

8.51% 16.67% 31.25%

2.94% 5.10% 18.12%

1.90% 3.88% 243.12%

0.22% 17.46% 15.38% 30.23%

0.07% 6.73% 5.33% 7.14%

0.04% 4.12% 11.39% 23.33%

a Cn indels at positions 309, 315 and 16193 were not counted in all calculations; 523–524DEL, 521–524DEL, 524.1A and 524.2C, and 8281–8289DEL were treated as one variant respectively; PHPs were treated as variants. b Haplogroups for HVS-I, HVS-I/HVS-II, CR and CodR were assigned by mthap, while haplogroups for mtGenome were approved by the EMPOP.

Y. Zhou et al. / Forensic Science International: Genetics 22 (2016) 11–21

19

4769G, 8860G and 15326G were the reflection of the rCRS, and some variants were specific to haplogroup M, e.g., 489C, 10398G, and 10400T. There were 522 (12.98%) variants within HVS-I (Table 3) which ranked the highest polymorphism density (Fig. 2). Considering the HVS-I and HVS-II together, 892 (22.18%) variants were observed, percentage of which were close to dataset reported in [24]. As expected, high polymorphism density of HVS was a common pattern regardless of population, which was extensively applied in forensic practice.

in Table S10. The highest frequencies of LHP were observed in the common poly-C tracts of the CR, with the most (72.87%) located in positions 311–315. Compared LHP distribution with those in Ramos et al. and Just et al., there were significant differences in HVS-I and II regions in both studies and in the CodR in Just et al. (all p < 0.05, Proportion test), which mainly resulted from population diversity. For example, 29.91% (32/107) individuals were observed with LHPs in HVS-I because T16189C is a haplogroup-specific signature mutation in East Asia [19].

3.4.2. Heteroplasmy In this dataset, 103 individuals (96.26%) presented heteroplasmy, and the remaining 3.74% were fully homoplasmic. A total of 24 PHPs were observed among 22 individuals, or 20.56% (22/ 107) of the individuals studied. One individual (N068) presented 3 PHPs (207R, 7498R and 14502Y). Each position appeared to be heteroplasmic in only a single individual. Compared PHPs with those in some previous studies [21,24,27–31] (Table S9), the distribution differed significantly from CR in Skonieczna et al. [30] (p < 0.05, z = 3.650, Proportion test) and in Just et al. [21] (p = 0.0476, z = 1.668), which were primarily accounted for variety in tissue types, population studied or MAF threshold. In the CR, 16093Y (RMR = 79) could be observed in all compared studies, which might be hotspots in CR with high ranked RMRs. Besides, 207R (RMR = 36) and 228R (RMR = 16) could be observed only in Just et al. In the CodR, 16 of 18 PHPs in our study were unique, and 7498R and13477R could be observed in Just et al. This pattern was similar as earlier report and further confirmed the viewpoint that CodR PHP hotspots might be rare or nonexistent [23]. As regards LHP, 95.33% (102/107) of the individuals were heteroplasmic. A total of 21 individuals had only one LHP, and remaining carried more than one LHP. The LHP regions of mtGenome were presented

3.4.3. Forensic parameters Totally, there were 107 complete mtGenome haplotypes from 107 Northern Chinese Han (Table S11). The RMP was evaluated as 0.009345794, and HD was calculated as 1. Table 3 showed summary statistics of HVS-I only, HVS-I/HVS-II, CR, CodR and the complete mtGenome. By comparison to HVS-I only, the RMP of complete mtGenome decreased 26.19%, while number of haplotypes, number of unique haplotypes, mean number of pairwise differences and haplotype diversity (HD) increased 13.83%, 27.38%, 432.93% and 0.33% respectively. The resolution with sequencing the complete mtGenome was dramatically improved by comparing value from the subsets of the molecule historically targeted for human identification. 3.4.4. Haplogroup resolution As shown in Table 3 and Table S6, haplotypes based on the complete mtGenome had potential on assigning to the most accurate haplogroups compared with HVS-I only, HVS-I/HVS-II, CR and CodR. For example, sample N088 was assigned to D4j7-@16082 or D4j-16311 based on information on the complete mtGenome, however it was misassigned to P6 according to HVS-I/HVS-II. The similar results had been discussed in [24]. Thus,

Fig. 4. Principal component analysis (PCA) map based on mtGenome haplogroup frequencies from 10 populations (East and Southeast Asians: Northern Chinese Han and Filipino; Westeurasians: Armenian, Azeri, Georgian, Iranian, Turk, and U.S. Caucasian; Africans: African American and Zambian).

20

Y. Zhou et al. / Forensic Science International: Genetics 22 (2016) 11–21

haplogroup information from non-complete mtGenome should be carefully employed as additional investigation leads. Fig. 4 showed the PCA map for the first two principal components, which together accounted for 82% of the total variation based on haplogroup frequencies from 10 complete mtGenome dataset (Table S12). In the PCA map, Westeurasian (U.S. Caucasian [21]. Armenian, Azeri, Georgian, Iranian and Turk [32]) were clustered to the 1st and 4th quadrant, distinguishing from Africans, East and Southeast Asians by PC1, whereas East and Southeast Asians (Northern Chinese Han and Filipino [9]) were clustered to the 3rd quadrant, distinguishing from Africans (African American [21] and Zambian [33]) in the 2nd quadrant by PC2. This distribution mirrored the historical and geographical background of populations compared.

Acknowledgements This study was supported by grants (No. 201201ZDYJ001) from the Key Research Project of Ministry of Public Security, China. The authors wish to thank Walther Parson and Simone Nagl from the EMPOP for advices and efforts on the data evaluation. We also thank Qingqing Zhang from Department of Field Bioinformatics Support (FBS) of Thermo Fisher Scientific for advices on Perl compilation. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j. fsigen.2016.01.004.

3.5. Pedigree study References Table S13 demonstrated the number of shared and disparate variants between samples from 5 maternal genealogies (F001– F005) was compared by Perl script Box 4. The number of shared variants from members was 28.58 [21, 44] in the same families and 13.67 [11, 20] among different families, and the number of disparate variants was 3.88 [1, 11] in the same families and 36.02 [23, 51] among different families. After primary and secondary data analysis, only one discrepancy was found in minor allele frequency of 8756Y from F003: C allele frequency from grandma, mother and daughter fluctuated at 24%, 12% and 34% respectively. Otherwise, no mutation was observed from 18 generation in 13 transmissions. The patterns of these variants complied with the maternal genetic rules. 4. Conclusions The article outlines strategies for complete mtGenome sequencing on Ion Torrent PGMTM platform including (1) methodological optimization: amplification of long range PCR, incubation time of enzymatic shearing and the ratio of samples-tochips; (2) data interpretation: workflows on primary and secondary data analyses, validation on haplogroup assignments, PCA based on different populations, comparison study on maternal genealogies and some basic statistics for forensic genetics. Therefore, we believe the NGS technology has powerful potential on complete mtGenome detection compared with traditional method. However, some improvements are still needed if NGS will be applied in forensic sciences. First, the process of pre-sequencing should be refined. According to our experience, 30 samples per week by an individual are produced on PGMTM platform. Most of time is exhausted in the hands-on operations and separated preparations before sequencing. If the process would be automated and integrated, it can satisfy with current forensic demands to a large extent. Second, the algorithms of TSS and parameters of TVC should be fully optimized. Based on current software version 4.0 or above, variants are imprecisely called in some regions, especially in homopolymer structures, and discrepancies in nomenclature are not distinguished between medical and forensic genetics. These deficits must be corrected in a manual way, which is a laborious and tedious task from sample to profile for operators and has influence on the broad range of forensic practice. Some plug-ins should be added, such as alerts on potential false negative variants, options on different nomenclatures, comparisons between samples from separate runs, statistics for forensic genetics, and so forth. Above improvements will make the NGS technology applied more smoothly and conveniently for forensic scientists.

[1] J.M. Butler, Advanced Topics in Forensic DNA Typing: Methodology, Academic Press, Boston, 2011. [2] T.J. Parsons, D.S. Muniec, K. Sullivan, N. Woodyatt, R. Alliston-Greiner, M.R. Wilson, D.L. Berry, K.A. Holland, V.W. Weedn, P. Gill, M.M. Holland, A high observed substitution rate in the human mitochondrial DNA control region, Nat. Genet. 15 (1997) 363–368. [3] G. Chemale, G.G. Paneto, M.A. Menezes, J.M. de Freitas, G.S. Jacques, R.M. Cicarelli, P.R. Fagundes, Development and validation of a D-loop mtDNA SNP assay for the screening of specimens in forensic casework, Forensic Sci. Int. Genet. 7 (2013) 353–358. [4] Y.G. Yao, Q.P. Kong, H.J. Bandelt, T. Kivisild, Y.P. Zhang, Phylogeographic differentiation of mitochondrial DNA in Han Chinese, Am. J. Hum. Genet. 70 (2002) 635–651. [5] P. Gill, P.L. Ivanov, C. Kimpton, R. Piercy, N. Benson, G. Tully, I. Evett, E. Hagelberg, K. Sullivan, Identification of the remains of the Romanov family by DNA analysis, Nat. Genet. 6 (1994) 130–135. [6] J.A. Irwin, J.L. Saunier, P. Beh, K.M. Strouss, C.D. Paintner, T.J. Parsons, Mitochondrial DNA control region variation in a population sample from Hong Kong, China, Forensic Sci. Int. Genet. 3 (2009) e119–e125. [7] J.L. Saunier, J.A. Irwin, K.M. Strouss, H. Ragab, K.A. Sturk, T.J. Parsons, Mitochondrial control region sequences from an Egyptian population sample, Forensic Sci. Int. Genet. 3 (2009) e97–e103. [8] S. Köhnemann, H. Pfeiffer, Application of mtDNA SNP analysis in forensic casework, Forensic Sci. Int. Genet. 5 (2011) 216–221. [9] E.D. Gunnarsdóttir, M. Li, M. Bauchet, K. Finstermeier, M. Stoneking, Highthroughput sequencing of complete human mtDNA genomes from the Philippines, Genome Res. 21 (2011) 1–11. [10] R.M. Andrews, I. Kubacka, P.F. Chinnery, R.N. Lightowlers, D.M. Turnbull, N. Howell, Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA, Nat. Genet. 23 (1999) 147. [11] J.T. Robinson, H. Thorvaldsdóttir, W. Winckler, M. Guttman, E.S. Lander, G. Getz, J.P. Mesirov, Integrative genomics viewer, Nat. Biotechnol. 29 (2011) 24–26. [12] W. Parson, C. Strobl, G. Huber, B. Zimmermann, S.M. Gomes, L. Souto, L. Fendt, R. Delport, R. Langit, S. Wootton, R. Lagacé, J. Irwin, Reprint of: evaluation of next generation mtGenome sequencing using the Ion torrent personal genome machine (PGM), Forensic Sci. Int. Genet. 7 (2013) 632–639. [13] A.W. Röck, A. Dür, M. van Oven, W. Parson, Concept for estimating mitochondrial DNA haplogroups using a maximum likelihood approach (EMMA), Forensic Sci. Int. Genet. 7 (2013) 601–609. [14] W. Parson, A. Dür, EMPOP—a forensic mtDNA database, Forensic Sci. Int. Genet. 1 (2007) 88–92. [15] M. van Oven, M. Kayser, Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation, Hum. Mutat. 30 (2009) e386–e394. [16] R. Kolpakov, G. Bana, G. Kucherov, mreps. efficient and flexible detection of tandem repeats in DNA, Nucleic Acids Res. 31 (2003) 3672–3678. [17] L. Fendt, B. Zimmermann, M. Daniaux, W. Parson, Sequencing strategy for the whole mitochondrial genome resulting in high quality sequences, BMC Genomics 10 (2009) 139. [18] W.G. Miller, OpenStat, Iowa State University, 2008. [19] W. Parson, L. Gusmão, D.R. Hares, J.A. Irwin, W.R. Mayr, N. Morling, E. Pokorak, M. Prinz, A. Salas, P.M. Schneider, T.J. Parsons, DNA Commission of the International Society for Forensic Genetics: revised and extended guidelines for mitochondrial DNA typing, Forensic Sci, Int. Genet. 13 (2014) 134–142. [20] P. Soares, L. Ermini, N. Thomson, M. Mormina, T. Rito, A. Röhl, A. Salas, S. Oppenheimer, V. Macaulay, M.B. Richards, Correcting for purifying selection: an improved human mitochondrial molecular clock, Am. J. Hum. Genet. 84 (2009) 740–759. [21] R.S. Just, M.K. Scheible, S.A. Fast, K. Sturk-Andreaggi, A.W. Röck, J.M. Bush, J.L. Higginbotham, M.A. Peck, J.D. Ring, G.E. Huber, C. Xavier, C. Strobl, E.A. Lyons, T. M. Diegoli, M. Bodner, L. Fendt, P. Kralj, S. Nagl, D. Niederwieser, B. Zimmermann, W. Parson, J.A. Irwin, Full mtGenome reference data:

Y. Zhou et al. / Forensic Science International: Genetics 22 (2016) 11–21

[22]

[23]

[24]

[25]

[26]

[27]

development and characterization of 588 forensic-quality haplotypes representing three U.S. populations, Forensic Sci. Int. Genet. 14 (2015) 141–155. J. Ye, G. Coulouris, I. Zaretskaya, I. Cutcutache, S. Rozen, T.L. Madden, PrimerBLAST: a tool to design target-specific primers for polymerase chain reaction, BMC Bioinf. 13 (2012) 134. R.S. Just, J.A. Irwin, W. Parson, Mitochondrial DNA heteroplasmy in the emerging field of massively parallel sequencing, Forensic Sci. Int. Genet. 18 (2015) 131–139. J.L. King, B.L. LaRue, N.M. Novroski, M. Stoljarova, S.B. Seo, X. Zeng, D.H. Warshauer, C.P. Davis, W. Parson, A. Sajantila, B. Budowle, High-quality and high-throughput massively parallel sequencing of the human mitochondrial genome using the Illumina MiSeq, Forensic Sci. Int. Genet. 12 (2014) 128–135. W. Parson, G. Huber, L. Moreno, M.B. Madel, M.D. Brandhagen, S. Nagl, C. Xavier, M. Eduardoff, T.C. Callaghan, J.A. Irwin, Massively parallel sequencing of complete mitochondrial genomes from hair shaft samples, Forensic Sci. Int. Genet. 15 (2015) 8–15. J.A. McElhoe, M.M. Holland, K.D. Makova, M.S. Su, I.M. Paul, C.H. Baker, S.A. Faith, B. Young, Development and assessment of an optimized next-generation DNA sequencing approach for the mtgenome using the Illumina MiSeq, Forensic Sci. Int. Genet. 13 (2014) 20–29. A. Ramos, C. Santos, L. Mateiu, Mdel. M. Gonzalez, L. Alvarez, L. Azevedo, A. Amorim, M.P. Aluja, Frequency and pattern of heteroplasmy in the complete human mitochondrial genome, PLoS One 8 (2013) e7436.

21

[28] M. Li, A. Schönberg, M. Schaefer, R. Schroeder, I. Nasidze, M. Stoneking, Detecting heteroplasmy from high-throughput sequencing of complete human mitochondrial DNA genomes, Am. J. Hum. Genet. 87 (2010) 237–249. [29] M. Li, R. Schröder, S. Ni, B. Madea, M. Stoneking, Extensive tissue-related and allele-related mtDNA heteroplasmy suggests positive selection for somatic mutations, Proc. Natl. Acad. Sci. U.S.A. 112 (2015) 2491–2496. [30] K. Skonieczna, B. Malyarchuk, A. Jawien, A. Marszałek, Z. Banaszkiewicz, P. Jarmocik, M. Borcz, P. Bała, T. Grzybowski, Heteroplasmic substitutions in the entire mitochondrial genomes of human colon cells detected by ultra-deep 454 sequencing, Forensic Sci. Int. Genet. 15 (2015) 16–20. [31] M.X. Sosa, I.K. Sivakumar, S. Maragh, V. Veeramachaneni, R. Hariharan, M. Parulekar, K.M. Fredrikson, T.T. Harkins, J. Lin, A.B. Feldman, P. Tata, G.B. Ehret, A. Chakravarti, Next-generation sequencing of human mitochondrial reference genomes uncovers high heteroplasmy frequency, PLoS One 8 (2012) e1002737. [32] A. Schönberg, C. Theunert, M. Li, M. Stoneking, I. Nasidze, High-throughput sequencing of complete human mtDNA genomes from the Caucasus and West Asia: high diversity and demographic inferences, Eur. J. Hum. Genet. 19 (2011) 988–994. [33] C. Barbieri, A. Butthof, K. Bostoen, B. Pakendorf, Genetic perspectives on the origin of clicks in Bantu languages from southwestern Zambia, Eur. J. Hum. Genet. 21 (2013) 430–436.