Platform-independent models for age prediction using DNA methylation data

Platform-independent models for age prediction using DNA methylation data

Forensic Science International: Genetics 38 (2019) 39–47 Contents lists available at ScienceDirect Forensic Science International: Genetics journal ...

1MB Sizes 11 Downloads 92 Views

Forensic Science International: Genetics 38 (2019) 39–47

Contents lists available at ScienceDirect

Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsigen

Platform-independent models for age prediction using DNA methylation data Sae Rom Honga,b, Kyoung-Jin Shina,b, Sang-Eun Junga, Eun Hee Leea, Hwan Young Leea,c,

T



a

Department of Forensic Medicine, Yonsei University College of Medicine, 50-1 Yonsei-ro, Seodaemun-gu, Seoul, 03722, South Korea Brain Korea 21 PLUS Project for Medical Science, Yonsei University, 50-1 Yonsei-ro, Seodaemun-gu, Seoul, 03722, South Korea c Department of Forensic Medicine, Seoul National University College of Medicine, 103 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea b

A R T I C LE I N FO

A B S T R A C T

Keywords: Age prediction DNA methylation MPS Methylation SNaPshot Neural network

Age prediction has been in the spotlight recently because it can provide an important information about the contributors of biological evidence left at crime scenes. Specifically, many researchers have actively suggested age-prediction models using DNA methylation at several CpG sites and tested the candidates using platforms such as the HumanMethylation 450 array and pyrosequencing. With DNA methylation data obtained from each platform, age prediction models were constructed using diverse statistical methods typically with multivariate linear regression. However, because each developed model is based on single-platform data, the prediction accuracy is reduced when applying DNA methylation data obtained from other platforms. In this study, bisulfite sequencing data for 95 saliva samples were generated using massively parallel sequencing (MPS) and compared with methylation SNaPshot data from the same 95 individuals. The predicted age obtained by applying MPS data to an age-prediction model built for methylation SNaPshot data differed greatly from the chronological age due to platform differences. Therefore, novel variables were introduced to indicate the platform type, and construct platform-independent age predictive models using a neural network and multivariate linear regression. The final neural network model had a mean absolute deviation (MAD) of 3.19 years between the predicted and chronological age, and the mean absolute percentage error (MAPE) was 8.89% in the test set. Similarly, the linear regression model showed 3.69 years of MAD and 10.44% of MAPE in the same test set. The platform-independent age-prediction model was made extensible to an increasing number of platforms by introducing platform variables, and the idea of platform variables can be applied to age prediction models for other body fluids.

1. Introduction Age prediction is an interesting topic in forensic field because it enables to utilize important information of the contributor as an investigative lead from biological evidence left at a crime scene [1,2]. Several age-predicting biological markers have been introduced such as telomere length [3], mitochondrial DNA (mtDNA) deletion [4], advanced glycation end-products [5], signal-joint T-cell receptor excision circles [6,7], microRNA [8] and DNA methylation (DNAm). Among them, DNAm is considered to be one of the most promising biomarkers because of its high accuracy in age prediction [1,2,6]. Therefore, many researchers have suggested age-prediction models using the DNAm of forensically relevant body fluids such as blood [9–17], semen [18,19] and saliva [20–22]. Blood, as typical crime scene evidence, is the most studied subject in age prediction, and

DNAm age prediction models using blood have been proposed on various platforms, including Illumina HumanMethylation27/450 BeadChip array [9], pyrosequencing [10,12–14] and massively parallel sequencing (MPS) [16,17]. In the case of semen, Lee et al. [18] constructed an age-predictive model using DNAm of three CpG sites based on methylation SNaPshot method, and this model was validated with forensic caseworks at [19]. Models for saliva have used methods such as 27 K/450 K array [20], methylation-sensitive high resolution melting (MS-HRM) [21] and methylation SNaPshot [22]. Just as the studies are diverse, there are also many ways to construct age predictive models, such as an elastic net [9,23], multivariate linear regression (MLR) [10,12–14,18,22], and support vector regression (SVR) [15,21]. A milestone work of Horvath [23] exploited an elastic net (penalized regression) method to make a model for 27 K/450 K array data from various tissues and showed high accuracy (3.6 years of

⁎ Corresponding author at: Department of Forensic Medicine, Seoul National University College of Medicine, 103 Daehak-ro, Jongno-gu, Seoul, 03080, South Korea. E-mail address: [email protected] (H.Y. Lee).

https://doi.org/10.1016/j.fsigen.2018.10.005 Received 20 April 2018; Received in revised form 7 September 2018; Accepted 8 October 2018 Available online 09 October 2018 1872-4973/ © 2018 Elsevier B.V. All rights reserved.

Forensic Science International: Genetics 38 (2019) 39–47

S.R. Hong et al.

Severance Hospital, Yonsei University in Seoul, Korea. Participants collected their own samples using an Oragene™ DNA self-collection kit (DNA Genotek Inc., Ottawa, Canada). All samples were stored at room temperature. DNA was extracted from 200 μL aliquots of samples using the QIAamp® DNA Mini Kit (Qiagen, Hilden, Germany) and quantified with the Quantifiler® Duo Kit (Thermo Fisher Scientific Inc., Waltham, MA, USA). For MPS analysis, bisulfite-converted DNA was freshly obtained by modifying 200 ng of genomic DNA with the Imprint® DNA Modification Kit (Sigma-Aldrich Inc., St. Louis, MO, USA) following the manufacturers’ protocol and eluted with 20 μL of nuclease-free water. Converted DNA was stored at −80 °C and used in 1 to 24 h. Additionally, Infinium HumanMethylation450 BeadChip array (Illumina, San Diego, CA, USA) data (450 K) of 54 male saliva samples (Accession number: GSE92767) were downloaded from NCBI GEO database for further modelling.

median absolute deviation). Hannum et al. [9] applied the same method for age prediction to 450 K data from the blood of 656 individuals. Their model required DNAm values of 71 age-associated CpG sites and showed 4.9 years of mean absolute deviation (MAD) from actual ages. Multivariate linear regression (MLR) model is one of the most commonly used statistical methods to predict age with DNAm data [10,12–14,18,22]. Using the MLR method and pyrosequencing, Weidner et al. [10] proposed an age prediction model in blood that included only three CpG sites. Similarly, Zbiec-Piekarska et al. [12], Park et al. [13], and Cho et al. [14] all constructed models for blood data by applying MLR with pyrosequencing and achieved 3.9, 3.4, and 3.3 years of MAD, respectively, from chronological ages. A preliminary study [18] of age prediction in semen, as well as blood, was also performed using MLR based on methylation SNaPshot data, and that model was validated with forensic casework samples by Lee et al. [19] (MAD = 5.2 years). Hong et al. [22] also exploited MLR to construct an age prediction model for saliva with methylation SNaPshot data, and that model showed high accuracy (MAD = 3.15 years, RMSE = 4.34 years). The support vector regression (SVR) method was introduced as another approach to forensic age prediction by Xu et al. [15]. Using EpiTYPER data of blood, they compared four different analysis methods: MLR, multivariate nonlinear regression, back propagation neural network and SVR. The SVR model showed the lowest MAD, 2.8 years, among those four models. Hamano et al. [21] also used the SVR method to construct an age prediction model using MS-HRM data from saliva. In recent years, breakthroughs in machine learning have enabled several studies to use novel analysis methods for age prediction such as a random forest, in which the ensemble learning method constructs multiple decision trees for classification or regression [24], was used by Naue et al. [16] and showed high accuracy (3.24 years of MAD) with DNAm data obtained from massively parallel sequencing (MPS). Moreover, Vidaki et al. [17] proposed a generalized regression neural network (GRNN) model with HumanMethylation27/450 data from blood (mean absolute error = 4.6 years). Furthermore, they applied MPS data from blood into the GRNN model and achieved a mean absolute error of 7.45 years from chronological age, suggesting that the model could be applicable to other platforms. However, the accuracy was quite different between the two analysis platforms. Recently, Feng et al. [25] constructed age prediction models for blood using DNAm level obtained from EpiTYPER with MLR, SVR, and artificial neural network (ANN). Moreover, they applied pyrosequencing data to the EpiTYPER-based model and got 4.03 years of MAD. In addition, Feng et al. suggested the z-score transformation to reduce those prediction errors, and the model showed 2.76 years of MAD. In this study, we obtained bisulfite sequencing MPS data from 95 saliva samples, compared the DNAm values against the methylation SNaPshot data from [22], and applied both data sets to an age-prediction model created for methylation SNaPshot data for validation. Then, to construct platform-independent age prediction models, we introduced platform variables to build neural network and MLR models.

2.2. Bisulfite sequencing using massively parallel sequencing (MPS) 2.2.1. Library construction Based on the simplified library preparation method of Lee et al. [26], we adopted two-step PCR amplification to construct a library for the Illumina system for six age-associated CpG sites (cg00481951, cg19671120, cg14361627, cg08928145, cg12757011, and cg07547549) and one cell type-specific CpG site (cg18384097) from Hong et al. [22]. Two rounds of PCR amplifications for library generation used primers modified from the Nextera® Sample Preparation Kit (Illumina). The first PCR reaction was intended to generate the amplicons of the seven CpG markers from our previous study [22], so its primers were a combination of read sequences and CpG marker targeting sequences, which were the same as the PCR primers of the methylation SNaPshot (Supplementary Table 1). Using those primers, multiplex PCR reactions were performed in 20 μL reaction volumes containing 1–2 μL of bisulfite-converted DNA (freshly converted, 3 U of AmpliTaq Gold® DNA Polymerase (Thermo Fisher Scientific), 2 μL of Gold ST*R 10× Buffer (Promega, Madison, WI, USA), and 0.4–2.0 μM of each primer (Supplementary Table 1). PCR cycling was conducted using a Veriti™ 96-Well Thermal Cycler under the following conditions: 95 °C for 11 min; 27 cycles of 94 °C for 20 s, 56 °C for 60 s, and 72 °C for 30 s; and a final extension at 72 °C for 7 min. The second PCR was performed to ligate the indices and platform specific sequences using the primer sequences listed in Supplementary Table 1. In this step, 1 μL of 10-fold diluted PCR product, 2 μL of Gold ST*R 10× Buffer (Promega), 2 μL each of i5 and i7 adapters from the Nextera® Index Kit (Illumina) and 1.5 U of AmpliTaq Gold® DNA Polymerase (Thermo Fisher Scientific) were included in a 20 μL reaction volume. PCR cycling was conducted using a Veriti™ 96-Well Thermal Cycler under the following conditions: 95 °C for 15 min; 15 cycles of 94 °C for 20 s, 61 °C for 30 s, and 72 °C for 45 s; and a final extension at 72 °C for 5 min. Following PCR cleanup with 1.2× Agencourt® AMPure® XP beads (Beckman Coulter Inc., Indianapolis, IN, USA), the libraries were quantified using KAPA library quantification kits (KAPA Biosystems Inc., Wilmington, MA, USA). Then, the size of the amplicons was checked using an Agilent 2100 Bioanalyzer and a DNA 1000 Kit (both Agilent Technologies, Inc., Santa Clara, CA, USA).

2. Materials and methods

2.2.2. MPS sequencing and DNAm extraction Sequencing was conducted on a MiSeq® system using MiSeq Reagent Kit v3 (600 cycles) (Illumina). The obtained fastq data were trimmed with cutadapt v1.9.1 (available online at: https://cutadapt. readthedocs.org/, RRID:SCR_011841), and the base quality was checked by FastQC v0.11.4 (available online at: http://www. bioinformatics.babraham.ac.uk/projects/fastqc/, RRID:SCR_014583). A sample with a low read depth (less than 1000) was excluded from the following analyses. The obtained reads from 95 individuals were aligned to in silico bisulfite-converted genomic reference sequences of

2.1. Sample selection and preparation In our previous work [22], we performed targeted bisulfite sequencing for seven amplicons using the methylation SNaPshot method and 226 saliva samples that we partitioned into two sets, a training set of 113 samples and a test set of the remaining 113 samples. To validate the model using a different platform, we selected 96 individuals from the test set to use in this study. Consisting of 48 males and 48 females, the 96 saliva donors were aged 18 to 65 years. Those samples were obtained under the supervision of the Institutional Review Board of 40

Forensic Science International: Genetics 38 (2019) 39–47

S.R. Hong et al.

Fig. 1. A schematic diagram of the workflow. We used 10 ng of bisulfite-converted DNA as a template, and conducted multiplex PCR for seven markers using both the methylation SNaPshot [22] (upper part) and MPS (lower part) methods. There were differences in primers; we used general primers in the methylation SNaPshot method; we used the same primers but attached read sequences in the MPS method. Accordingly, in the methylation SNaPshot data, the single base extension (SBE) method was used to obtain electropherograms of the seven CpG sites. In the MPS method, indexing PCR with an index kit was followed by data analysis using Bismark and R.

of the 95 individuals were taken from [22].

seven markers using Bismark v0.13.0 (RRID:SCR_005604) [27] and bowtie2 [28] with the following parameters: base quality Q > 32, paired-end, non-directional alignment, and other conditions set to default values. The DNAm values of the CpG and the non-CpG sites were extracted using the bismark-methylation-extractor and SAMtools v1.3.1 (available online at: http://www.htslib.org/, RRID:SCR_002105) [29]. As bisulfite conversion is a key step for DNAm analysis, bisulfite conversion efficiency of each sample was calculated by subtracting the level from 100% using the non-CpG DNAm levels.

2.3.3. Construction of platform-independent age prediction model To generate an integrative age-prediction model, we set the platform type as variable and included DNAm values obtained from both platforms, MPS and methylation SNaPshot. Data were separated into two sets using the caret R package: a training set containing 154 samples (77 MPS data and 77 methylation SNaPshot data) and a test set containing 36 samples (18 MPS data and 18 methylation SNaPshot data). We then performed MLR and NN analyses on the training set. However, due to the features of the NN, we changed the chronological age from 0 to 1 through min-max scaling. This scale range criterion could be wider or narrower depending on the data, but here we set the max to 65 years from the oldest and the min to 18 years from the youngest. Therefore, the predicted age in the NN model was obtained by multiplying 47 to convert the NN result into age. The models, using resilient backpropagation NN regression (neuralnet R package) and MLR, respectively, were trained by repeating 5-fold cross-validation with the training set 10 times using the caret R package. The tuned models were validated with the test set. For further modelling, we adopted the same method for DNAm data from three different platforms such as MPS, methylation SNaPshot, and 450 K A training set contained 197 samples (77 MPS, 79 SNaPshot, and 41 450 K data) and a test set did 47 samples (18 MPS, 16 SNaPshot, and 13 450 K data). The chronological age converted using min-max scaling with the ranged 18 years of minimum and 73 years of maximum. Both NN and MLR used to train 10 times repeated 5-fold cross validation.

2.3. Data analysis 2.3.1. Analysis of correlations with age Because each amplicon contained several CpG sites, as well as formerly targeted markers of the multiplex methylation SNaPshot method, we assessed correlations between chronological age and the DNAm of each CpG site in all seven amplicons. Using the statistical package IBM SPSS Statistics for Windows, Version 23.0 (IBM Corp., Armonk, NY, USA), we calculated the Pearson’s correlation coefficient (Pearson’s r, r) for all 62 CpG sites. We followed a rule of thumbs for interpreting r [30]; for |r|, 0 to 0.3 as negligible correlation, 0.3 to 0.5 as low correlation, 0.5 to 0.7 as moderate correlation, 0.7 to 0.9 as high correlation, and 0.9–1 as very high correlation. 2.3.2. Comparison with a methylation SNaPshot For the seven CpG markers used in the age prediction model in [22], we compared the DNAm values from 95 saliva samples obtained using MPS and the methylation SNaPshot method. To assess the applicability of the age prediction model constructed with the multiplex methylation SNaPshot method to the MPS data, we applied the DNAm values of the same seven CpGs from the MPS result to the SNaPshot model in [22]. The predicted ages from MPS data and methylation SNaPshot data were then compared to determine the differences. The DNAm values obtained from the methylation SNaPshot method and the predicted ages

3. Results 3.1. Bisulfite sequencing using MPS Bisulfite sequencing for the six age-associated CpG sites (cg00481951, cg19671120, cg14361627, cg08928145, cg12757011, 41

Forensic Science International: Genetics 38 (2019) 39–47

S.R. Hong et al.

0.636). In Fig. 2, the DNAm patterns of the seven overlapping CpGs are plotted to show the correlations between the DNAm measures obtained from the MPS and SNaPshot methods. All patterns indicate that most of the samples were above the y = x line, where the x-axis and y-axis designate the DNAm values from the MPS and methylation SNaPshot methods, respectively. Obviously, the methylation SNaPshot data showed higher DNAm values than the MPS data. The difference between the MPS and methylation SNaPshot data might come from a difference in the platforms because both data sets were obtained from the same samples.

and cg07547549) and one cell type-specific CpG site (cg18384097) from Hong et al. [22], which had been conducted using methylation SNaPshot assay (as shown in the upper part of Fig. 1), was carried out using MPS. The procedure is shown in the schematic diagram as the bottom part of Fig. 1. 3.1.1. MPS coverage In this study, a total of 672 bisulfite-converted DNA fragments from 96 saliva samples were sequenced. Approximately 8.3 Gb of sequences were obtained, and more than 92% of bases had a base quality > Q32. However, one sample from a 40-year-old male showed a low depth of coverage, less than 1000x, which is less than Masser et al [31] recommended for quantitative DNAm analysis. Due to technical problems with library preparation, the generated data from that sample could not be aligned to the reference sequences using Bismark program. So it was excluded from further data processing. Therefore, 95 samples were analyzed for DNAm and correlation with chronological age. On average, 66 million bases per sample were mapped using bismark and bowtie2. The average coverage per amplicon using paired-end option was about 30,000x, in a range from 3000x to 90,000x (Supplementary Figs S1 & S2). The mean read depth of the A2 amplicon (including cg00481951) was the lowest among the seven amplicons, and that of the A1 amplicons (including cg18384097) was the highest. As MPS data can show sequences of amplicons, it is possible to check bisulfite conversion efficiency of each samples. The average of conversion rate was 99.5%, in the range of 98.4% to 99.9%.

3.3. SNaPshot model The age prediction model for saliva suggested in [22] was constructed using the methylation SNaPshot data from 113 samples and validated with data from an independent test set also composed of 113 samples. Because the SNaPshot samples we selected were from the test set in [22], both the SNaPshot data and MPS data for the seven CpG markers could be applied to the model in Supplementary Table 3. The predicted age using the MPS data showed a high error; the MAD between the chronological age and predicted age was 23.42 years, and the root mean square error (RMSE) was 25.18 years. Those values are much higher than those using methylation SNaPshot data from the same 95 samples: 3.11 years of MAD and 4.37 years of RMSE. Despite those differences, the slopes of the predicted ages using both the MPS and SNaPshot were quite similar, and the DNAm obtained from the MPS and methylation SNaPshot data explained 83.0% and 90.2% in the variance of age, respectively (Fig. 2(h)).

3.1.2. Age correlations of CpG sites The DNAm profiles of 62 CpG sites within the seven amplicons are listed in Supplementary Table 2; they were analyzed to determine the correlations between DNAm and chronological age (Table 1). Among the 62 CpG sites, A2_CpG_3 (cg00481951), A2_CpG_2, A7_CpG_9 (cg07547549), and A4_CpG_6 (cg14361627) were the four most agecorrelated CpG loci; their Pearson’s r values were 0.814, 0.799, 0.769, and 0.756, respectively. As shown in Table 1, most CpG sites located within a same amplicon exhibited a similar age-correlation, but some CpGs showed age associations different from those of adjacent CpGs within the same amplicon. For example, in the A3 amplicon, the front region (CpG_1 to 6) was not associated with age, showing a p-value higher than 0.05, but the rear region (CpG_7 to 15), including cg19671120, did show low or moderate age-association (r in 0.325 to 0.567, and p-value ≤ 0.001). Both ends of the A6 amplicon (A6_CpG_1 and 5), showed a p-value higher than 0.05; however, but the middle region (CpG_2 to 4) presented p-value smaller than 0.05 and r in 0.230 to 0.489. The A7 amplicon DNAm pattern differed from the A3 and A6 patterns; only one CpG site (A7_CpG_2) was not correlated with age. Specifically, A7_CpG_2 was located 6 bp upstream of A7_CpG_1 and 9 bp downstream of A7_CpG_3, but the age-correlation status between CpG_2 and the two neighboring CpGs differed significantly. On the other hand, on A1, all CpGs including cell type-specific marker cg18384097, showed no age-correlation (p-value > 0.05), but their methylation values were highly, linearly correlated with each other; their R2 values with cg18384097 were higher than 0.97.

3.4. Construction of platform-independent age prediction models By setting the platform type as a variable, including the DNAm value of each marker, we constructed integrated age-prediction models using two methods, an NN and MLR. By cross-validation, the NN was tuned to have five and two neurons on layer 1, and layer 2, respectively, and the MLR model was tuned as well (Table 2, Supplementary Table 4). As shown in Fig. 3 and Table 2, both the NN and MLR models predicted ages with high accuracy using both MPS and methylation SNaPshot data. Interestingly, the mean absolute percentage error (MAPE), MAD, and RMSE were slightly higher in the MLR model than the NN. MAD merely represents the error between chronological age and predicted age, whereas MAPE is a statistical value indicating the ratio of the error to chronological age; thus, it could provide more accurate information in comparing models. Although the number of samples was small, models seemed to be able to predict age in high accuracy. Likewise, Both the NN and MLR models for MPS, methylation SNaPshot, and 450 K data presented high accuracy regardless of platform in Supplementary Figure S3 and Supplementary Table 5. Although, the number of samples from 450 K data is smaller than that of MPS and SNaPshot, those age-prediction models appear to predict ages quite precisely regardless of platforms.

3.2. Comparison with the methylation SNaPshot 4. Discussion Table 1 compares seven loci (cg18384097, cg00481951, cg19671120, cg14361627, cg08928145, cg12757011 and cg07547549) with the methylation SNaPshot from Hong et al. [22]. Interestingly, five CpG loci among the six age-correlated markers in the methylation SNaPshot data showed the highest age-correlation from each amplicon in the MPS data: A2_CpG_3 (cg00481951), A3_CpG_15 (cg19671120), A4_CpG_6 (cg14361627), A6_CpG_4 (cg12757011), and A7_CpG_9 (cg07547549). On the other hand, one neighboring CpG site in A5 amplicon had higher Pearson’s r value than those of the overlapping targets: A5_CpG_3 (rho = 0.662) vs. A5_CpG_11 (cg08928145, rho =

Using the same primers, MPS can provide more information about all the CpG sites in amplicons compared with the methylation SNaPshot, which can target only a few CpG sites. However, unlike the SNaPshot data, which can be easily incorporated into the workflow of general forensic genetic laboratories because of their capillary electrophoresis (CE) system-based characteristics [22], MPS is burdensome and expensive to process. In addition, depending on the experiment, the user must use a commercial tool or have a certain level of computer skills, especially when using in-house-developed methods, as in this 42

Forensic Science International: Genetics 38 (2019) 39–47

S.R. Hong et al.

Table 1 List of 62 CpG sites by amplicon. Amplicon

Gene

CpG list

A1

PTPN7

A2

SST

CpG_1 CpG_2 CpG_3 CpG_4 CpG_5 CpG_6 CpG_1 CpG_2 CpG_3 CpG_4 CpG_5 CpG_6 CpG_7 CpG_8 CpG_1 CpG_2 CpG_3 CpG_4 CpG_5 CpG_6 CpG_7 CpG_8 CpG_9 CpG_10 CpG_11 CpG_12 CpG_13 CpG_14 CpG_15 CpG_1 CpG_2 CpG_3 CpG_4 CpG_5 CpG_6 CpG_1 CpG_2 CpG_3 CpG_4 CpG_5 CpG_6 CpG_7 CpG_8 CpG_9 CpG_10 CpG_11 CpG_1 CpG_2 CpG_3 CpG_4 CpG_5 CpG_1 CpG_2 CpG_3 CpG_4 CpG_5 CpG_6 CpG_7 CpG_8 CpG_9 CpG_10 CpG_11

A3

CNGA3

A4

KLF14

A5

TSSK6

A6

TBR1

A7

a b c

SLC12A5

CpG IDa

cg18384097

cg00481951 cg05121480 cg12146673

cg18404308

cg19671120 cg22285878 cg07955995 cg09499629 cg08097417 cg14361627

cg08928145

cg12757011

cg07547549

Gene locationb chr1:202129633 chr1:202129623 chr1:202129617 chr1:202129583 chr1:202129575 chr1:202129566 chr3:187387664 chr3:187387659 chr3:187387650 chr3:187387621 chr3:187387601 chr3:187387572 chr3:187387555 chr3:187387552 chr2:98962874 chr2:98962877 chr2:98962894 chr2:98962900 chr2:98962902 chr2:98962914 chr2:98962918 chr2:98962926 chr2:98962938 chr2:98962950 chr2:98962955 chr2:98962967 chr2:98962969 chr2:98962972 chr2:98962974 chr7:130419173 chr7:130419159 chr7:130419136 chr7:130419133 chr7:130419118 chr7:130419116 chr19:19625300 chr19:19625316 chr19:19625318 chr19:19625327 chr19:19625339 chr19:19625344 chr19:19625346 chr19:19625355 chr19:19625357 chr19:19625361 chr19:19625364 chr2:162281132 chr2:162281122 chr2:162281117 chr2:162281111 chr2:162281027 chr20:44658279 chr20:44658273 chr20:44658264 chr20:44658253 chr20:44658248 chr20:44658234 chr20:44658230 chr20:44658228 chr20:44658225 chr20:44658203 chr20:44658199

SNaPshot targetc

v

v

v

v

v

v

v

Pearson’s r

p-value

−0.179 −0.163 −0.162 −0.150 −0.163 −0.180 0.682 0.799 0.814 0.501 0.421 0.311 0.381 0.478 0.187 0.067 0.104 −0.034 0.135 0.194 0.433 0.508 0.336 0.325 0.501 0.483 0.521 0.560 0.567 0.261 0.492 0.557 0.631 0.650 0.756 0.596 0.584 0.662 0.649 0.637 0.637 0.616 0.637 0.641 0.629 0.636 −0.035 0.230 0.319 0.489 −0.002 0.321 0.130 0.441 0.571 0.585 0.683 0.741 0.679 0.769 0.399 0.285

0.082 0.114 0.116 0.146 0.114 0.081 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 0.002 < 0.001 < 0.001 0.070 0.521 0.316 0.747 0.191 0.060 < 0.001 < 0.001 0.001 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 0.011 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 0.737 0.025 0.002 < 0.001 0.984 0.002 0.209 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 0.005

Illumina HumanMethylation BeadChip ID. GRCh 37 a.k.a. hg19. Loci overlapped with the seven markers in Hong et al. [22] are marked in “v”.

with chronological age. In fact, all seven amplicons in this study were designed to be shorter than 200 bp except the read sequence and index. Therefore, we expected most amplicons to exhibit similar association states between DNAm and age, as reported in previous works [11–14]. However, the CpG loci on the A3, A6, and A7 amplicons in this study differed in their association between DNAm and age even within the

study. However, the advantage of bisulfite amplicon sequencing is that the work-load can be reduced by using the amplicon sequence as a reference instead of the whole genome. As mentioned in 3.1.2., the DNAm status within amplicons showed several patterns. Most regions in amplicons presented similar age-correlations, but some CpG sites, such as A7_CpG_2, were not associated 43

Forensic Science International: Genetics 38 (2019) 39–47

S.R. Hong et al.

Table 2 Platform-independent age prediction models constructed with neural network and multivariate linear regressiona. Neural Network Model

Training set (N = 154) Test set (N = 36) a b c

Multivariate Linear Regression

Hidden Layer Layer 1 Layer 2

Neuron 5 2

Targetb (intercept) A1_CpG_6 A2_CpG_3 A3_CpG_3 A4_CpG_6 A5_CpG_11 A6_CpG_4 A7_CpG_9 Platformc

Illumina ID

Coefficient −30.087 −29.199 31.567 46.056 73.819 25.141 42.906 91.890 24.739

cg18384097 cg00481951 cg19671120 cg14361627 cg08928145 cg12757011 cg07547549

R2

MAD

RMSE

MAPE

R2

MAD

RMSE

MAPE

0.9058

3.09

3.94

8.58

0.8680

3.66

4.80

10.36

0.9208

3.19

4.03

8.89

0.8953

3.69

5.29

10.44

MAD, RMSE and MAPE are abbreviations of mean absolute deviation, root mean square error, and mean absolute percentage error, respectively. SNaPshot variables were excluded from the final multivariate linear regression model. Platform variable were coded 1 for MPS and 0 for SNaPshot.

Fig. 2. Comparison of DNA methylation between the MPS and SNaPshot data. a–g In all plots, the x-axis is methylation value using the MPS method and y-axis is that using the methylation SNaPshot method. The same 95 samples were used, though SNaPshot showed a slightly shifted tendency. Most samples seemed to have a higher DNAm value in the methylation SNaPshot data. Except A3_CpG_15 (cg19671120), both MPS and methylation SNaPshot showed a high Spearman correlation coefficient. h The predicted age with the MPS data using the model suggested in [22] showed a high error, but that using the SNaPshot data had high accuracy. Despite these large differences, the tendency in both data sets was quite similar. The model explains 83.0% of the age variance in the MPS data and 90.2% of the age variance in the methylation SNaPshot data.

could indicate the ratio of these two cell types. Because the samples were from the test set in [22], we were able to compare the age prediction results between the bisulfite MPS and methylation SNaPshot data. When the MPS data were applied to the model suggested in [22], the results showed significant differences between chronological age and predicted age (Fig. 2(h) and Supplementary Table 3). Fig. 2(a–g) shows the relationship between the DNAm data from the MPS and SNaPshot data sets at the same CpG sites. Most of the samples are above the y = x line, implying that the DNAm values calculated using the SNaPshot method were much higher than those calculated using the MPS method, possibly because of the difference in

same amplicon. Thus, it might be worth checking the age-correlation of other CpG sites around the known markers to find the candidates with the best age-correlation. In particular, A5_CpG_3 showed higher Pearson’s r value than A5_CpG_11 (cg08928145), and it could be another age-predictive marker for saliva. However, we did not construct an age prediction model only for MPS data in this study. We focused on comparing DNAm level obtained from MPS and methylation SNaPshot. Therefore, we enquired into all of the seven CpGs from [22] including cell type-specific marker A1_CpG_6 (cg18384097). In [22], inclusion of the cell type-specific marker was suggested for better age prediction because saliva has a heterogeneous cell composition containing both buccal epithelial cell and leucocyte, and the cell type-specific marker 44

Forensic Science International: Genetics 38 (2019) 39–47

S.R. Hong et al.

Fig. 3. Platform-independent age prediction models with MPS and methylation SNaPshot data. a A training set for a neural network (NN) model. b A test set for an NN model. c A training set for a multivariate linear regression (MLR) model. d A test set for an MLR model. In all plots, the x-axes are chronological age, and the yaxes are the predicted age from each model. Two models using an NN and MLR could explain almost 90% of the variation in age in each set. The test sets for both models showed high accuracy as well as the training sets.

fluorescence intensities used for the SNaPshot: the intensity of the blue dye (methylated signal) was divided by that of the blue plus green dye (unmethylated signal). The dye intensity might not perfectly match the number of actual methylated or unmethylated strands, because nucleotide G intensity is higher than nucleotide A intensity in the SNaPshot Kit, unlike the nucleotide C/T combination [2,32–34]. Actually, average of DNAm level differences between MPS and methylation SNaPshot was about 0.1. The BLUEPRINT consortium compared several assays for quantitative DNAm levels, and they verified that DNAm level measurement using bisulfite sequencing using MPS and the gold standard pyrosequencing is highly robust [35]. As they pointed out, relative measurement methods are not robust as absolute methods such as pyrosequencing [35]. Due to unequal intensity, methylation SNaPshot can exaggerate DNAm level than the actual. As shown in Supplementary Figure S4, the difference between simulated and actual DNAm level was higher when the difference in G/A dye intensity was larger. In particular, the difference of DNA methylation was 0.1 or more when the actual DNA methylation level is on the range from 0.35 to 0.55 in 1.5-fold, 1.7 fold, and 2-fold difference in fluorescence signal. In that, the higher DNAm level measured by SNaPshot might be explained. This phenomenon could explain the difference between chronological age and predicted age that occurred when we substituted the MPS data into the SNaPshot model. As age increased, the trend of increasing DNAm was concordant, but there were differences in the actual values between the MPS and methylation SNaPshot data, as shown

in Fig. 2(a–g). Therefore, an age-prediction model trained for a specific platform cannot be applied directly to another platform, even if they target the same locus. By the way, Feng et al. [25] suggested a great idea of the z-score transformation as normalization to exclude platform effect. The model for EpiTYPER method reduced the differences in predicted ages of both EpiTYPER and pyrosequencing by normalsization with the z-score transformation. However, our models can be applied directly without going through any process of normalization. We successfully addressed that limitation by introducing “platform variables” into new age-prediction models. Both our NN and MLR models predicted ages with high accuracy using both MPS and methylation SNaPshot data (Table 2 and Fig. 3). Age-prediction models without platform variables had lower accuracy than the models with the platform variables (Supplementary Figure S5), the MAD values from the test set were 8.19 years and 8.03 years for the NN and MLR models, respectively, and the MAPE values from the test set without the platform variables were higher than 20% (NN = 23.04% and MLR = 22.30%). Moreover, using platform variables provides extensibility; if a platform becomes more diverse, it new models could be made by simply increasing the number of platform variables. As shown in Supplementary Figure S3, with MAPE values from the test set of 9.36% and 8.72% for the NN and MLR method, respectively. Compared to models for two platforms, it is encouraging that the MAPE value is about 10%, which means the accuracy of the models remains similar 45

Forensic Science International: Genetics 38 (2019) 39–47

S.R. Hong et al.

References

even as the number of platform variables increases. Additionally, when constructing an age-prediction model with a small number of samples, it could be useful to find the optimal model through k-fold cross-validation rather than simply dividing the samples into just two sets; a training set and a test set. However, it is important to find a suitable model based on a statistical theory and the features of the data rather than simply applying it. In particular, because agecorrelated DNAm markers seems to have linear correlation with ages, it might be appropriate to analyze DNAm markers using a regression-related method. Our model, composed of seven CpG sites from [22] and platform variables, enabled age prediction in a platform-independent manner with high accuracy (3.19 years and 3.69 years of MAD with chronological age in the NN and MLR models, respectively) using both MPS and methylation SNaPshot data. The prediction accuracy is much higher (Fig. 3 and Table 2) than the results from applying MPS data directly into the methylation SNaPshot data-based model, which produced a MAD with chronological age of more than 20 years (Fig. 2(h) and Supplementary Table 3). Therefore, we suggest that the platform variable could extend the applicability of data produced using various platforms and help build a platform-independent model. Moreover, the idea of introducing a platform variable when developing an age-prediction model might be applicable to other forensically relevant body fluid data, such as blood and semen. If more data on a variety of platforms, such as MPS, methylation SNaPshot, and 450 K/EPIC chip data were used, more sophisticated age-prediction models independent of platforms could be constructed for use in forensic field.

[1] S.E. Jung, K.J. Shin, H.Y. Lee, DNA methylation-based age prediction from various tissues and body fluids, BMB Rep. 50 (2017) 546–553. [2] A. Freire-Aradas, C. Phillips, M.V. Lareu, Forensic individual age estimation with DNA: from initial approaches to methylation tests, Forensic Sci. Rev. 29 (2017) 121–144. [3] S. Hewakapuge, R.A. van Oorschot, P. Lewandowski, S. Baindur-Hudson, Investigation of telomere lengths measurement by quantitative real-time PCR to predict age, Leg. Med. (Tokyo) 10 (2008) 236–242. [4] C. Meissner, S. Ritz-Timme, Molecular pathology and age estimation, Forensic Sci. Int. 203 (2010) 34–43. [5] A. Pilin, F. Pudil, V. Bencko, Changes in colour of different human tissues as a marker of age, Int. J. Legal Med. 121 (2007) 158–162. [6] D. Zubakov, F. Liu, I. Kokmeijer, Y. Choi, J.B.J. van Meurs, I.W.F.J. van, A.G. Uitterlinden, A. Hofman, L. Broer, C.M. van Duijn, J. Lewin, M. Kayser, Human age estimation from blood using mRNA, DNA methylation, DNA rearrangement, and telomere length, Forensic Sci. Int. Genet. 24 (2016) 33–43. [7] D. Zubakov, F. Liu, M.C. van Zelm, J. Vermeulen, B.A. Oostra, C.M. van Duijn, G.J. Driessen, J.J. van Dongen, M. Kayser, A.W. Langerak, Estimating human age from T-cell DNA rearrangements, Curr. Biol. 20 (2010) R970–1. [8] T. Huan, G. Chen, C. Liu, A. Bhattacharya, J. Rong, B.H. Chen, S. Seshadri, K. Tanriverdi, J.E. Freedman, M.G. Larson, J.M. Murabito, D. Levy, Age-associated microRNA expression in human peripheral blood is associated with all-cause mortality and age-related traits, Aging Cell 17 (2018), https://doi.org/10.1111/acel. 12687. [9] G. Hannum, J. Guinney, L. Zhao, L. Zhang, G. Hughes, S. Sadda, B. Klotzle, M. Bibikova, J.B. Fan, Y. Gao, R. Deconde, M. Chen, I. Rajapakse, S. Friend, T. Ideker, K. Zhang, Genome-wide methylation profiles reveal quantitative views of human aging rates, Mol. Cell. 49 (2013) 359–367. [10] C.I. Weidner, Q. Lin, C.M. Koch, L. Eisele, F. Beier, P. Ziegler, D.O. Bauerschlag, K.H. Jockel, R. Erbel, T.W. Muhleisen, M. Zenke, T.H. Brummendorf, W. Wagner, Aging of blood can be tracked by DNA methylation changes at just three CpG sites, Genome Biol. 15 (2014) R24. [11] R. Zbiec-Piekarska, M. Spolnicka, T. Kupiec, Z. Makowska, A. Spas, A. ParysProszek, K. Kucharczyk, R. Ploski, W. Branicki, Examination of DNA methylation status of the ELOVL2 marker may be useful for human age prediction in forensic science, Forensic Sci. Int. Genet. 14 (2015) 161–167. [12] R. Zbiec-Piekarska, M. Spolnicka, T. Kupiec, A. Parys-Proszek, Z. Makowska, A. Paleczka, K. Kucharczyk, R. Ploski, W. Branicki, Development of a forensically useful age prediction method based on DNA methylation analysis, Forensic Sci. Int. Genet. 17 (2015) 173–179. [13] J.L. Park, J.H. Kim, E. Seo, D.H. Bae, S.Y. Kim, H.C. Lee, K.M. Woo, Y.S. Kim, Identification and evaluation of age-correlated DNA methylation markers for forensic use, Forensic Sci. Int. Genet. 23 (2016) 64–70. [14] S. Cho, S.E. Jung, S.R. Hong, E.H. Lee, J.H. Lee, S.D. Lee, H.Y. Lee, Independent validation of DNA-based approaches for age prediction in blood, Forensic Sci. Int. Genet. 29 (2017) 250–256. [15] C. Xu, H. Qu, G. Wang, B. Xie, Y. Shi, Y. Yang, Z. Zhao, L. Hu, X. Fang, J. Yan, L. Feng, A novel strategy for forensic age prediction by DNA methylation and support vector regression model, Sci. Rep. 5 (2015) 17788. [16] J. Naue, H.C.J. Hoefsloot, O.R.F. Mook, L. Rijlaarsdam-Hoekstra, M.C.H. van der Zwalm, P. Henneman, A.D. Kloosterman, P.J. Verschure, Chronological age prediction based on DNA methylation: massive parallel sequencing and random forest regression, Forensic Sci. Int. Genet. 31 (2017) 19–28. [17] A. Vidaki, D. Ballard, A. Aliferi, T.H. Miller, L.P. Barron, D. Syndercombe Court, DNA methylation-based forensic age prediction using artificial neural networks and next generation sequencing, Forensic Sci. Int. Genet. 28 (2017) 225–236. [18] H.Y. Lee, S.E. Jung, Y.N. Oh, A. Choi, W.I. Yang, K.J. Shin, Epigenetic age signatures in the forensically relevant body fluid of semen: a preliminary study, Forensic Sci. Int. Genet. 19 (2015) 28–34. [19] J.W. Lee, C.M. Choung, J.Y. Jung, H.Y. Lee, S.K. Lim, A validation study of DNA methylation-based age prediction using semen in forensic casework samples, Leg. Med. (Tokyo) 31 (2018) 74–77. [20] S. Bocklandt, W. Lin, M.E. Sehl, F.J. Sanchez, J.S. Sinsheimer, S. Horvath, E. Vilain, Epigenetic predictor of age, PLoS One 6 (2011) e14821. [21] Y. Hamano, S. Manabe, C. Morimoto, S. Fujimoto, K. Tamaki, Forensic age prediction for saliva samples using methylation-sensitive high resolution melting: exploratory application for cigarette butts, Sci. Rep. 7 (2017) 10444. [22] S.R. Hong, S.E. Jung, E.H. Lee, K.J. Shin, W.I. Yang, H.Y. Lee, DNA methylationbased age prediction from saliva: High age predictability by combination of 7 CpG markers, Forensic Sci. Int. Genet. 29 (2017) 118–125. [23] S. Horvath, DNA methylation age of human tissues and cell types, Genome Biol. 14 (2013) R115. [24] A. Liaw, M. Wiener, Classification and regression by randomForest, R News. 2 (2002) 18–22. [25] L. Feng, F. Peng, S. Li, L. Jiang, H. Sun, A. Ji, C. Zeng, C. Li, F. Liu, Systematic feature selection improves accuracy of methylation-based forensic age estimation in Han Chinese males, Forensic Sci. Int. Genet. 35 (2018) 38–45. [26] E.Y. Lee, H.Y. Lee, S.Y. Oh, S.E. Jung, I.S. Yang, Y.H. Lee, W.I. Yang, K.J. Shin, Massively parallel sequencing of the entire control region and targeted coding region SNPs of degraded mtDNA using a simplified library preparation method, Forensic Sci. Int. Genet. 22 (2016) 37–43. [27] F. Krueger, S.R. Andrews, Bismark: a flexible aligner and methylation caller for bisulfite-seq applications, Bioinformatics 27 (2011) 1571–1572, https://doi.org/10.

5. Conclusion Bisulfite sequencing MPS was targeted to the same seven age-predictive CpG markers for saliva in our previous work. Both MPS data and methylation SNaPshot data were highly associated with chronological age, but DNA methylation values from those two methods did not match exactly, so that MAD between chronological and predicted age of MPS data obtained from the SNaPshot based age prediction model was more than 20 years. To construct an age-prediction model that is independent to platforms, a platform variable was introduced. With a platform variable, both of the two age-prediction models built with neural network and multivariate linear regression, showed a high accuracy regardless whether the data were from MPS or methylation SNaPshot. This platform-independent age-prediction model can be extended with increasing number of platforms, and the idea of constructing an age-predictive model with platform variables can be applied to age prediction models for other body fluids.

Conflict of interest statement The authors declare that they have no conflicts of interest.

Acknowledgement This research was supported by the Bio & Medical Technology Development Program of the National Research Foundation of Korea (NRF) funded by the Korean government (NRF-2014M3A9E1069992). We want to thank Prof. Jae Joon Ahn (Department of Information & Statistics, Yonsei University, Wonju, South Korea) for statistical consultant.

Appendix A. Supplementary data Supplementary material related to this article can be found, in the online version, at doi:https://doi.org/10.1016/j.fsigen.2018.10.005. 46

Forensic Science International: Genetics 38 (2019) 39–47

S.R. Hong et al.

P.M. Schneider, P.M. Vallone, N. Morling, Forensic typing of autosomal SNPs with a 29 SNP-multiplex—Results of a collaborative EDNAP exercise, Forensic Sci. Int. Genet. 2 (2008) 176–183. [33] C. Lou, B. Cong, S. Li, L. Fu, X. Zhang, T. Feng, S. Su, C. Ma, F. Yu, J. Ye, L. Pei, A SNaPshot assay for genotyping 44 individual identification single nucleotide polymorphisms, Electrophoresis 32 (2011) 368–378. [34] M. Fondevila, C. Børsting, C. Phillips, M. de la Puente, E.N. Consortium, A. Carracedo, N. Morling, M.V. Lareu, Forensic SNP genotyping with SNaPshot: technical considerations for the development and optimization of multiplexed SNP assays, Forensic Sci. Rev. 29 (2017) 57–76. [35] BLUEPRINT consortium, quantitative comparison of DNA methylation assays for biomarker development and clinical applications, Nat. Biotechnol. 34 (2016) 726–737.

1093/bioinformatics/btr167. [28] B. Langmead, S.L. Salzberg, Fast gapped-read alignment with bowtie 2, Nat. Methods 9 (2012) 357–359, https://doi.org/10.1038/nmeth.1923. [29] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, The sequence alignment/map format and SAMtools, Bioinformatics 25 (2009) 2078–2079. [30] M.M. Mukaka, Statistics Corner: a guide to appropriate use of correlation coefficient in medical research, Malawi Med. J. 24 (2012) 69–71. [31] D.R. Masser, A.S. Berg, W.M. Freeman, Focused, high accuracy 5-methylcytosine quantitation with base resolution by benchtop next-generation sequencing, Epigenetics Chromatin 6 (2013) 33, https://doi.org/10.1186/1756-8935-6-33. [32] J.J. Sanchez, C. Børsting, K. Balogh, B. Berger, M. Bogus, J.M. Butler, A. Carracedo, D.S. Court, L.A. Dixon, B. Filipović, M. Fondevila, P. Gill, C.D. Harrison, C. Hohoff, R. Huel, B. Ludes, W. Parson, T.J. Parsons, E. Petkovski, C. Phillips, H. Schmitter,

47