Journal of Chromatography B xxx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
Journal of Chromatography B journal homepage: www.elsevier.com/locate/jchromb
The use of LC predicted retention times to extend metabolites identification with SWATH data acquisition Tobias Bruderera, Emmanuel Varesiob, Gérard Hopfgartnera, a b
⁎
Life Sciences Mass Spectrometry, Department of Inorganic and Analytical Chemistry, University of Geneva, Quai Ernest-Ansermet 30, CH-1211 Geneva 4, Switzerland School of Pharmaceutical Sciences, University of Lausanne, University of Geneva, Rue Michel Servet 1, CH-1211 Geneva 4, Switzerland
A R T I C L E I N F O
A B S T R A C T
Keywords: Liquid chromatography High resolution mass spectrometry SWATH LC retention time prediction QSRR Metabolomics
The application of predicted LC retention time to support metabolite identification was evaluated for a metabolomics MS/MS database containing 532 compounds representative for the major human metabolite classes. LC retention times could be measured for two C18 type columns using a mobile phase of pH = 3.0 for positive ESI mode (n = 337, 228) and pH = 8.0 for negative ESI mode (n = 410, 233). A QSRR modelling was applied with a small set of model compound selected based on the Kennard-Stone algorithm. The models were implemented in the R environment and can be applied to any library. The prediction model was built with two molecular descriptors, LogD2 and the molecular volume. A limited set of model compounds (LC CalMix, n = 16) could be validated on two different C18 reversed phase LC columns and with comparable prediction accuracy. The CalMix can be used to compensate for different LC systems. In addition, LC retention prediction was found, in combination with SWATH-MS, to be attractive to eliminate false positive identification as well as for ranking purpose different metabolite isomeric forms.
1. Introduction The identification of low molecular weight compounds from liquid chromatography mass spectrometry analyses, to support metabolomics investigations, is mainly based on accurate mass measurements and liquid chromatographic retention comparison with authentic standards. Due to the large chemical space of the metabolome, the assignment of peak features (m/z and retention time) to molecules remains a challenging task [1–3]. Four major parameters can be considered including: elemental formula, product ion spectrum, ionization polarity and chromatographic retention time. With modern high resolution mass spectrometry elemental formula calculation based on accurate mass and isotopic match is straight forward. Product spectra assigned with libraries is still limited and does not differentiate isomers. As in many cases standards are not available, chromatographic prediction plays a major role as shown for analyte screening or identification [4–10]. Quantitative structure retention relationships (QSSR) is well suited for retention times prediction, based on molecular physicochemical properties [11]. The major advantage of QSRR compared to other retention prediction approaches is that it can be applied for any given compound based on its molecular descriptors calculated from its structure. LogD and LogP have been extensively applied as suitable molecular descriptor in particular for reverse phase chromatography. QSSR was
⁎
applied for global LC–MS based metabolomics analysis by Creek et al. [4]. They developed a model based on a multiple linear regression between the logD(pH), five additional molecular descriptors and the measured retention times for 120 metabolite standards on a HILIC column. They reported a reduction in false positive annotations by 40% for putatively annotated metabolites from cell extracts. Recently Cao et al. [12] proposed a QSRR model for non-targeted lipid analysis with multiple linear regression (MLR) modelling and the random forests (RF) approach with significantly improved retention prediction. These types of QSRR models can be used for retention prediction but they are limited by the need for large sets of model compounds (i.e. typically more than 100 reference standards) [4,12,13]. Furthermore, their precisions are somewhat limited which is regularly attributed to the lack of accuracies from the used physicochemical descriptors like the logP, pKA and logD(pH) [6] which are calculated from open source software, often based on two dimensional molecular descriptors and more rarely from three dimensional descriptors [13]. A strategy has been presented by Andries et al. [14] for the reduction of the number of model compounds in QSRR models. A simple logP linear regression model, Abraham’s solvation equation [15–17] and the Quantum Indices model [11] were compared for a small set of model compounds across 76 different LC conditions obtained from the literature. The use of small number of model compounds i. e. 7–15 did not
Corresponding author. E-mail address:
[email protected] (G. Hopfgartner).
http://dx.doi.org/10.1016/j.jchromb.2017.07.016 Received 6 November 2016; Received in revised form 30 June 2017; Accepted 7 July 2017 1570-0232/ © 2017 Elsevier B.V. All rights reserved.
Please cite this article as: Bruderer, T., Journal of Chromatography B (2017), http://dx.doi.org/10.1016/j.jchromb.2017.07.016
Journal of Chromatography B xxx (xxxx) xxx–xxx
T. Bruderer et al.
Fig. 1. Strategies to apply in-silico LC retention time prediction to support metabolites indentification.
Table 1 Recommended Model Compound Mixture for both LC columns Waters T3 and C18 (n = 3-20) selected with the Kennard-Stone algorithm with Euclidian distance for the two molecular descriptors logD2 (consensus logP and classic pKa, calculated with ACD Labs 2012, Build 2076, July 25, 2012) and MV (Molecular Volume calculated with ACD Labs 2010). Nr.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Model Compounds Selection (pH = 3.0)
Model Compounds Selection (pH = 8.0)
Compound
logD (pH = 3.0)
MV
Compound
logD (pH = 8.0)
MV
Chenodeoxycholic acid 2-Pyrrolidinone Chlorogenic acid 3-Chlorotyrosine Cortisol N-Methyl-a-aminoisobutyric acid Epicatechin Diphenhydramine Sphinganine Loratadine 1,11-Undecanedicarboxylic acid Ranitidine Tryptophanol Ribothymidine 2-Piperidinone Melatonin Luteolin 2-Hydroxyphenethylamine Liothyronine 5-Methoxytryptophol
3.75 −0.80 −0.49 −1.42 1.66 −2.57 0.57 0.61 2.57 4.12 3.07 −2.82 1.70 −1.49 −0.50 1.74 2.35 −2.57 1.58 1.67
347.8 81.2 214.5 147.8 281.3 114.5 182.1 249.2 325.0 303.5 232.3 265.4 132.1 163.7 98.9 197.6 172.9 124.1 272.6 156.1
Purine Hyodeoxycholic acid Chlorogenic acid Cortisol 3-Chlorotyrosine Indole-3-carboxylic acid Octadecanedioic acid Tetradecanedioic acid 3,5-Diiodo-L-tyrosine Melatonin Estrone Pyridoxamine Ranitidine Ribothymidine Salicylic acid Biocytin L-Aspartyl-L-phenylalanine 1-Methylhistidine N-Acetylserotonin 5-Methoxytryptophol
−0.48 0.72 −4.06 1.66 −1.61 −0.91 0.77 −1.29 −1.20 1.74 3.38 −0.96 −0.15 −1.51 −0.78 −2.72 −3.64 −3.07 1.13 1.67
81.5 347.8 214.5 281.3 147.8 114.4 314.9 248.8 180.0 197.6 232.1 131.1 265.4 163.7 100.3 301.7 204.4 122.6 172.0 156.1
for proteomics [20] these features are valuable for metabolomics studies [21,22] and provide qualitative and quantitative analysis in the same run (QUAL/QUAN) with enhanced selectivity of the precursor ions compared to MSE as the size of the windows can be selected. Compared to DDA, SWATH spectra enable no only to perform library search on any precursor but also perform structural analog search based on product fragment by using the chromatographic profile. In the present work, to allow a better assignment of metabolites in complex samples and in particular of isomeric metabolites, a QSRR modelling was applied with a small set of model compound (n < 20) selected based on the Kennard-Stone algorithm for a larger set of 532 compounds present in metabolomics datasets and referenced in the Human Metabolome Database [23]. SWATH MS/MS spectra were available in positive and negative mode for 532 compounds [24]. The
result in any significant loss of prediction accuracy. They applied the Kennard-Stone [18] algorithm to obtain an equidistant distribution of the molecular descriptors over the investigated variable space. Falchi et al. [10] have shown that Kernel-Based, partial least squares quantitative structure-retention relationship model can be a useful tool for metabolite identification. They build a model based on 1383 compounds and considering different chemical classes and demonstrated that their model succeeded in the RT prediction of drug metabolites. In addition, for unknowns of interest, product ion spectra are acquired and identification is further enabled with the use of libraries or denovo spectra interpretation [19]. Data Independent Acquisition (DIA) such as MSE or SWATH have gained interest as they enable the recording of full scan and product ion spectra in a single LC–MS run. While SWATH/MS has mostly been used 2
Journal of Chromatography B xxx (xxxx) xxx–xxx
T. Bruderer et al.
Fig. 2. Residuals versus experimental retention times, logD2 and molecular volume for the T3 column a pH = 3.0 for all compounds in the data set (A–C) versus compounds inside the chemical space (D–F) defined as k’ > 0.5, retention time < 21 min (5–95% gradient phase), logD > −5 and logD < 6, molecular volume < 350 and lipids exclusion. k: number of validation compounds.
2.2. Metabolite reference compounds and accurate mass metabolite library (AMML)
molecular descriptors were calculated with the ACD Labs software suite which contains a large database of measured logP and pKa values and advanced logD(pH) modelling. Retention times were predicted by multiple linear regressions based on the logD and molecular volume. The models were evaluated for three different RP-LC columns and two pH conditions, pH = 3.0 for positive ESI mode and pH = 8.0 for negative ESI mode.
532 reference compounds were obtained from the Human Metabolome Database (HMDB, [23]) as powder or liquid (Tables S1 and S2, Supplemental information) or from Sigma-Aldrich. To build an accurate mass metabolite library (AMML) discrete collision energy (5 eV–100 eV) accurate mass spectra were collected for 532 metabolites from the human metabolome database (HMDB). Reference compounds were analysed by flow injection in positive and negative mode on a QqTOF instrument (TTOF5600, Sciex) and compiled into composite spectra over a selected large collision energy range (e.g. 70 eV) [24]. For LC–MS analysis stock solutions in the range of 0.5–1 mg/mL were prepared by dissolving the analytes in mixtures of methanol, acetonitrile, isopropanol, acetone, ethanol or water. Analytical solutions for retention time determination were prepared as mixtures of 10 non-isobaric compounds at 10 μg/mL in LC gradient initial conditions.
2. Materials and methods 2.1. Chemicals and solvents Water (Millipore), methanol, acetonitrile and isopropanol (all HPLC grade) were provided by VWR (Darmstadt, Germany), formic acid, ammonium hydroxide, ammonium formate and ammonium acetate were provided by Sigma-Aldrich (Buchs, Switzerland). 3
Journal of Chromatography B xxx (xxxx) xxx–xxx
T. Bruderer et al.
Table 2 Ranking for 11 isomer pairs and 2 triplets for the Waters T3 column at pH = 3.0, ESI positive mode. Nr
Isomers
Compound Name
Formula
Exact Mass
RT (min)
Predicted RT (min)
Residual (min)
Ranking
1 2 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 32 33 29 30 31 35 36 37
Pair 1 Pair 1 Pair 2 Pair 2 Pair 3 Pair 3 Pair 4 Pair 4 Pair 5 Pair 5 Pair 6 Pair 6 Pair 7 Pair 7 Pair 8 Pair 8 Pair 9 Pair 9 Pair 10 Pair 10 Pair 11 Pair 11 Triplet 1 Triplet 1 Triplet 1 Triplet 2 Triplet 2 Triplet 2
Delta-Hexanolactone Gamma-Caprolactone Tyramine 2-Hydroxyphenethylamine Methylglutaric acid Monomethyl glutaric acid 3-Methyladenine 6-Methyladenine 2-Phenylglycine Acetaminophen 4-Hydroxy-3-methylbenzoic acid p-Anisic acid L-Phenylalanine Benzocaine Phenylephrine 3-Methoxytyramine 12-Hydroxydodecanoic acid 3-Hydroxydodecanoic acid Estriol 16b-Hydroxyestradiol m-Aminobenzoic acid p-Aminobenzoic acid Glutaric acid Methylsuccinic acid Monoethyl malonic acid Homogentisic acid Vanillic acid 5-Methoxysalicylic acid
C6H10O2 C6H10O2 C8H11NO C8H11NO C6H10O4 C6H10O4 C6H7N5 C6H7N5 C8H9NO2 C8H9NO2 C8H8O3 C8H8O3 C9H11NO2 C9H11NO2 C9H13NO2 C9H13NO2 C12H24O3 C12H24O3 C18H24O3 C18H24O3 C7H7NO2 C7H7NO2 C5H8O4 C5H8O4 C5H8O4 C8H8O4 C8H8O4 C8H8O4
114.06808 114.06808 137.08406 137.08406 146.05791 146.05791 149.07015 149.07015 151.06333 151.06333 152.04734 152.04734 165.07898 165.07898 167.09463 167.09463 216.17254 216.17254 288.17254 288.17254 137.04768 137.04768 132.04226 132.04226 132.04226 168.04226 168.04226 168.04226
6.3 7.5 2.9 3.8 5.9 6.9 1.8 3.1 2.1 4.9 9.5 11.6 4.4 11.6 2.7 3.7 16.9 18.8 12.8 14.4 3.9 5 3.9 4.7 5.5 4.1 7.8 10.5
6.4 6.4 1.0 1.0 6.0 7.2 −0.6 4.0 2.9 7.0 9.9 10.2 3.6 10.5 0.3 1.5 15.2 17.2 15.3 15.3 5.7 7.1 4.7 5.1 5.5 6.8 9.0 9.0
0.1 1.1 1.9 2.8 0.1 0.3 2.4 0.9 0.8 2.1 0.4 1.4 0.8 1.1 2.4 2.2 1.7 1.6 2.5 0.9 1.8 2.1 0.8 0.4 0.0 2.7 1.2 1.5
no distinction no distinction no distinction no distinction correct correct correct correct correct correct correct correct correct correct correct correct correct correct no distinction no distinction correct correct correct correct correct no distinction no distinction no distinction
2.3. Human urine samples
2.5. LC-SWATH/MS data acquisition
Urine samples were collected from ten healthy volunteers and immediately stored at −20 °C. They were centrifuged and pooled on the same day and stored at −80 °C until analysis. Urine samples were either analysed without dilution or with 10 times dilution in LC gradient initial conditions. Isotopically labelled internal standards were added to each sample with a final concentrations of 1 μg/mL of tryptophane15N2, 100 ng/mL benzamide-15N (both obtained from Cambridge Isotope Laboratories, Andover MA, USA), 1 μg/mL phenylalanine-13C, 1 μg/mL estrone-13C3 and 1 μg/mL myristic acid-13C (negative ESI only) respectively 10 ng/mL of testosterone-13C3 (positive ESI only) (all obtained from Sigma-Aldrich, Buchs, Switzerland).
For SWATH acquisition, a single TOF MS scan was followed by 12 MS/MS experiments using variable Q1 windows. The cycle time was adapted to 831 ms based on an average LC peak width of 12 s to obtain at least 12 points/peak. A collision energy spread of 50 ± 30 eV was applied. The sprayer capillary voltage was of 5300 V, −4300 V with a DP of ± 70 V and a source temperature of 450 °C for positive and negative mode. The curtain gas was set at 25 and the gas 1 and gas 2 at 40. Retention times for the reference compounds were determined by targeted product ion scans for mixtures of ten non-isobaric compounds with a mass difference of at least 5 u. One TOF MS scan was followed by 10 TOF MS/MS scans over a mass range of m/z 50–1000. The accumulation time for each MS experiment was 71 ms with a total cycle time of 810 ms.
2.4. Liquid chromatography 2.6. Variable Q1 SWATH windows generation An UltiMate 3000 RSLC chromatography system (Dionex, Germering, Germany) was used for analysis of the reference standards and the human urine samples. Three different columns were investigated: 2.1 × 150 mm, 2.5 μm T3 Xselect column HSS XP (Waters), 2.1 × 100 mm, 2.7 μm Express C18 column (Ascentis) and 2.1 × 100 mm, 2.7 μm Express F5 column (Ascentis). The flow rate on each column was 300 μl/min heated at 40 °C. For positive ionization mode, mobile phase A was 5 mM ammonium formate in water with an adjusted pH of 3.0 by the addition of formic acid, and mobile phase B was methanol. For negative ionization mode, mobile phase C was 5 mM ammonium acetate in water with an adjusted pH of 8.0 by the addition of ammonium hydroxide. The gradient used in positive mode was 0–1 min, 5% B; 1–20 min, 5–95% B; 20–25 min, 95% B; 25–28 min, 5% B. In negative mode, mobile phase C was used instead of mobile phase A. The injection volume was 1 μl for the urine samples and 1–10 μl for the reference compounds. Samples were cooled at 6 °C. The dead volume was 1.0 min for the 150 mm Waters column and 0.6 min for the two 100 mm Ascentis columns.
A SWATH method was used with total ion current (TIC) optimized variable Q1 windows [25] based on the TOF MS scan of the first analysed urine sample. The tool “enhance LC/MS peak-finding filter” in the PeakView software (v2.0, SCIEX) was used to generate a peak list. LC–MS peaks were extracted within the gradient phase from 1 to 26 min with an estimated LC peak width of 12 s and background subtraction. The generated peak list was saved as a text file and imported into swathTUNER software tool [25] The following software settings were used with swathTUNER: 12 windows, m/z range 50–893 for ESI positive and 50–932 for ESI negative mode. The resulting tables containing the start and end m/z values of each window were copied (.txt file format) and saved as the files (.wpoa), which were imported into Analyst TF 1.5.1 software for creating the SWATH acquisition method. The used variable Q1 windows are shown in Table S3, Supplemental information. 2.7. Metabolite identification with the AMML library Skyline (version 3.1.0.7382, [26]) was used for comparison of the 4
Journal of Chromatography B xxx (xxxx) xxx–xxx
T. Bruderer et al.
Fig. 3. Residual (min) versus number of model compounds n = 10, 16, 20. A) T3 positive ESI mode B) T3 negative mode C) C18 positive mode D) C18 ESI negative mode, k: number of validation compounds. The horizontal black lines respresent the 95% percentile.
positive mode at mobile phase pH = 3.0 in Table S1 and for negative mode at pH = 8.0 in Table S2, Supplemental information. The models were implemented in the R environment with the R Studio interface [27]. The following packages were used: “dplyr” to manage data frames, “lme4” to perform MLR, “prospectr” for the Kennard-Stone algorithm and “xlsx” to create Excel outputs. The used R-Script is provided in Supplemental information (RScript_RTPrediction.txt) as well as the input file for the R-Script (Input R-Script.csv).
observed extracted fragment chromatograms (fragments XICs) for the fragments contained in the AMML Library (m/z values and relative intensities). Each fragment XIC was manually inspected. Urine samples were screened for selected compounds of the AMML Library with MasterView 1.1 and PeakView 2.2 (both Sciex). Retention time shifts were small as shown by the five respective four isotopic labelled internal standards which were well-distributed over the complete LC gradient.
2.8. Retention time prediction and molecular descriptors 3. Results and discussion Twelve molecular descriptors were selected from ACD/Labs® (Release 12.00, Advanced Chemistry Development, ON, Canada). The logD values were calculated based on a database containing 18,500 experimental log P values and 15,900 pKa values (ACD Labs 2012, Build 2076, July 25, 2012) according to six different combinations of log P and pKa models: logD1 (consensus logP and GALAS pKa), logD2 (consensus logP and classic pKa), logD3 (classic logP and GALAS pKa), logD4 (classic logP and classic pKa), logD5 (GALAS logP and GALAS pKa), logD6 (GALAS logP and classic pKa). The molecular weight (MW), molecular volume (MV), and molecular refractivity (MR), number of hydrogen bond acceptors (HBA), number of hydrogen bond donors (HBD) and the number of topological or molecular polar surface area (TPSA) were calculated with ACD Labs 2010. The used molecular descriptors for the 532 AMML library compounds are shown for
The focus of the present work is to investigate the use of a simple prediction model for LC retention times ranking using a small set of reference compounds in addition to MS/MS spectra libraries with the emphasis to improve isobaric metabolites identification. The challenges of using LC retention times for identification is that it requires either the use of reference compounds or that unknowns samples analysis have to be performed in the exactly same LC conditions as for the generation of the retention time library. Batch-to-batch variation in LC columns or different LC dead volumes can jeopardize the use of LC retention time libraries. In-silico libraries offer therefore a very attractive alternative to experimentally determined retention time. The various possible approaches for the use of LC retention time in addition to mass spectrometric data can be considered and are 5
Journal of Chromatography B xxx (xxxx) xxx–xxx
T. Bruderer et al.
Fig. 4. Extracted Ion Current of 135 metabolites identified in urine with column Water T3, ESI positive and negative: 37 compounds, ESI positive 45 compounds, ESI negative 53 compounds A) pH = 3.0, Positive ESI, B) pH = 8.0, Negative ESI C) of precursor m/z 377.146 ([M+H]+) corresponding to riboflavin.
summarized in Fig. 1. The first approach takes in account that for each analyte its retention time is available for the given chromatographic separation. A limited set of model compounds (typically n = 16–20) is used for checking the chromatographic performance or to realigning the library retention time depending mostly on hardware change (pumps, injector). Than analyte identification is performed based on spectral match with the library within a defined LC retention time window. In the second case, where LC column of different manufacturers are considered, the model compounds are used to recalculate the RT of the complete library. Than analyte identification is performed based on spectral match with the library within a defined LC retention time window. One specificity of data independent acquisition, such as SWATH, is that, for any precursor ion, product ion fragments are generated. With the use of collision energy spread a SWATH spectrum contains the precursor ion as well as the fragments. Product ion spectra can be processed on different ways: i) perform a classical MS/MS spectra library search using various scoring algorithm based on the SWATH spectra or ii) extracted ion current profile can be generated using the m/ z of the precursor and/or selected fragments. The second way can be attractive to search for isomeric metabolites that may not be present in the MS/MS spectra library. In the last scenario a library search based on the precursor ion extracted ion current is performed without the constrain of the LC retention time. When several additional peak are
observed, hypothesis driven generation of compound structure (structural modifications of the library compounds, e.g. isomers, analogues) is considered and the retention time are predicted based on molecular descriptors 3.1. Retention time prediction The approach builds on previous work where the LC Simulator approach from ACD Labs was applied to enable the rapid screening of co-medications interferences in quantitative LC-SRM/MS analysis, which is scalable to any analyte within the applicability domain of the model and transferable to any other LC–MS system [28]. The LC Simulator uses QSRR with multiple linear regressions between the measured retention factors for a set of model compounds and their logP or logD(pH) with either the molecular volume, -weight or -refractivity. The molecular descriptors were selected depending on the model with the best fit (i.e. the regression coefficient). The retention model was extended and applied to a diverse set of 532 metabolites contained in our in-house made accurate mass metabolite library (AMML library). LC retention could be measured for 337, 228, 238 metabolites for pH = 3.0, ESI positive mode detection and 410, 233, 215 for pH = 8.0 ESI negative mode with the T3, C18 and F5 columns (Supplemental information, Tables S1 and S2). Prediction models were built for two reverse phase C18 columns (T3, C18) from different suppliers with the 6
Journal of Chromatography B xxx (xxxx) xxx–xxx
T. Bruderer et al.
correlated molecular descriptors and different combinations with regard to the split between the number of model and validation compounds (data not shown) and obtained similar results. More elaborate models than multiple linear regressions (e.g. Random Forest) might result in better predictions, but these approaches often require a large number of model compounds. The possible reduction of the required reference compounds is important because pure reference compound (> 99% purity) for high-resolution mass spectrometry and in particular isotopic labelled internal standards are expensive.
different mobile phase conditions (Supplemental information, Tables S1 and S2). The QSSR model allows to predict LC retention times for basically any metabolite within applicability domain of the model based on their chemical properties (i. e. molecular descriptors). 3.1.1. Selection of molecular descriptors The basis for a QSRR model is the selection of appropriate molecular descriptors. In this study we selected twelve molecular descriptors (MDs) which were calculated with the ACD Labs package and a model was developed in the R environment. Six logD(pH) values were calculated according to the different models proposed in ACD Labs 2012 and used together with the molecular weight (MW), molecular volume (MV) and molecular refractivity (MR) and the number of hydrogen bond acceptors (HBA), the number of hydrogen bond donors (HBD) and the number of topological or molecular polar surface area (TPSA). Unfortunately, ACD Labs 2012 is not able to calculate the molecular volumes or refractivities for charged tertiary amines. These compounds (i.e. eight metabolites out of 532 of the AMML Library) were not considered for further investigations. ChemAxon was also considered as an alternative to ACDlabs, but a direct comparison was not within the scope of the present work. The molecular descriptors for the 528 AMML library compounds at pH = 3.0 and at pH = 8.0 are shown in Tables S1 and S2, Supplemental information. This set of twelve molecular descriptors was reduced by pairwise comparison of the correlation between each descriptor. The correlation plots for the twelve MDs are shown in Figs. S1 and S2, Supplemental information. As the aim of the work is to predict retention windows based on only a small set of model compounds the goal was to use the simplest possible model only molecular descriptors with the lowest correlation were considered. The retained molecular descriptors in positive and negative ESI mode were: logD2, MW, MV, HBD, HBA. The following filter were included i) retention times from c18 and t3 column with k* ≥ 0.5 and ii) logDs between −5 and +6
3.1.4. Proposed set of model compounds The minimum number of model analytes to build a linear regression is 3. Table 2 presents the recommended model compound considered 3–20 analytes, while Table S4 (Supplemental information) presents the model coefficients. The residual (min) were calculated for n = 3 to n = 20 and are shown for n = 10, 16 and 20 in Fig. 3. From n = 16 the residual are below 4 min. We were able to develop a robust QSRR model for retention prediction. The predicted retention window respective RMSE was below 4 min for the models with only 16 model compounds for both C18 type columns and both pH conditions (pH = 3.0 and pH = 8.0). The goal of this approach was not to obtain predictions as accurate as possible but to show that it is possible to predict retention windows with a certain width (i.e. below 4 min) based on only a small set of model compounds, and to use the Kennard-Stone algorithm to select the model compounds from a diverse set of model compounds. The resulting general prediction Eq. (1) is shown below where PRT: Predicted Retention Time, logD and MV: Molecular Volume; the coefficients a, b and c are calculated for each set of compounds and LC conditions (see Table S5). PRT (compound) = a * logD (pH) + b * MV + c
3.1.2. Selection of model compounds In routine work to model the retention on a given LC setup it is highly preferable to use a limited set of analytes as previously described by Andries et al. [14]. A correlation between the retention factors for all three columns and pH conditions is shown in Figs. S3 and S4, Supplemental information. Best correlation could be obtained between the two C18 column (Waters T3 and C18). A lower correlation was found for the T5 column certainly due to the differences in selectivity. We only included the compounds which were detected on all three columns with a retention factor k* > 0.5 to create training and validation sets. This resulted in 148 compounds for positive mode and 103 compounds for negative mode. We then used the Kennard-Stone algorithm for the selection of up to 20 model compounds based on the univariate Euclidian distance for the uncorrelated molecular descriptors. This resulted in a set of 20 model compounds with equidistant distribution over the investigated variable space (i. e. the molecular descriptors) which are listed in Table 1 for positive and negative ESI mode.
(1)
3.2. Application of retention time prediction to the characterization of metabolite isomers in urine Human urine samples were analysed with the Waters T3 column at pH = 3.0 and pH = 8.0 as well as the 16 model compounds presented in Table 1. In urine, using the Waters T3 column 135 metabolites have been identified (ESI positive and negative: 37 compounds, ESI positive only: 45 compounds, ESI negative only: 53 compounds) based on elemental formula, isotopic fit, spectra library search retention and RT experimental retention time library match (Fig. 4A and B). The metabolites identified in positive mode are listed in Table S6 and Table S7 for negative ESI mode (Supplemental Info) and the measured retention times obtained on the Waters T3 column are compared with the predicted LC retention times values. With both mobile phase conditions (for pH 3 and pH 8) e46 or 43 metabolites, respectively, provided acceptable residuals considering the selection criteria. Plasma and urine contains many polar metabolites which are eluting in the solvent front and modelisation remains critical. Nevertheless in our approach the MS/MS spectra is the base for identification while RT prediction serves as a confirmation tools to avoid false positive and a tool to rank isomeric metabolites. The use of different separation conditions in particular pH should improve the accuracy of the assignment. Another important challenge are false positive assignment. When extracting current ion profile even at high resolution, often several peaks are present in the LC trace. With data in dependent acquisition such as SWATH the selectivity can be increased with the use of fragment ions. Riboflavin (C17H20N4O6) was identified in the urine sample with a retention of 8.8 min. An extracted ion current trace of riboflavin ([M+H]+ m/z 377.1461) in human urine (Waters T3, pH = 3.0, positive ESI mode) is shown in Fig. 4C. Two peaks are observed at the retention times 8.8 min (error = −2 ppm, −0.8 mmu) and 16.5 min (error = 15.1 ppm, 5.7
3.1.3. Model training and evaluation Finally, only two variables were taken further for consideration. These were the logD2 and the molecular volume for both the C18 type columns in positive and negative ESI mode. At this point, the modelling for the F5 column was not further considered. The model used in this study was considered as not suited for this type of column or at least the specific column used during this work. Fig. 2 illustrates the residuals versus experimental retention times, logD2 and molecular volume for the T3 column a pH = 3.0 positive ESI mode for all compounds in the data set (A–C) versus compounds inside the chemical space (D–F). The data for T3 ESI negative mode and C18 positive and negative mode are presented in Fig. S5 to S7 (Supplemental Info) We did run some variations of the models with alternative 7
Journal of Chromatography B xxx (xxxx) xxx–xxx
T. Bruderer et al.
[3] J.G. Xia, I.V. Sinelnikov, B. Han, D.S. Wishart, MetaboAnalyst 3.0-making metabolomics more meaningful, Nucleic Acids Res. 43 (2015) W251–W257. [4] D.J. Creek, A. Jankevics, R. Breitling, D.G. Watson, M.P. Barrett, K.E.V. Burgess, Toward global metabolomics analysis with hydrophilic interaction liquid chromatography–mass spectrometry: improved metabolite identification by retention time prediction, Anal. Chem. 83 (2011) 8703–8710. [5] F. Aicheler, J. Li, M. Hoene, R. Lehmann, G. Xu, O. Kohlbacher, Retention time prediction improves identification in nontargeted lipidomics approaches, Anal. Chem. 87 (2015) 7698–7704. [6] J. Stanstrup, S. Neumann, U. Vrhovsek, PredRet. prediction of retention time by direct mapping between multiple chromatographic systems, Anal. Chem. 87 (2015) 9421–9428. [7] C.D. Broeckling, A. Ganna, M. Layer, K. Brown, B. Sutton, E. Ingelsson, G. Peers, J.E. Prenni, Enabling efficient and confident annotation of LC-MS metabolomics data through MS1 spectrum and time prediction, Anal. Chem. 88 (2016) 9226–9234. [8] G.M. Randazzo, D. Tonoli, S. Hambye, D. Guillarme, F. Jeanneret, A. Nurisso, L. Goracci, J. Boccard, S. Rudaz, Prediction of retention time in reversed-phase liquid chromatography as a tool for steroid identification, Anal. Chim. Acta 916 (2016) 8–16. [9] R. Bade, L. Bijlsma, T.H. Miller, L.P. Barron, J.V. Sancho, F. Hernández, Suspect screening of large numbers of emerging contaminants in environmental waters using artificial neural networks for chromatographic retention time prediction and high resolution mass spectrometry data analysis, Sci. Total Environ. 538 (2015) 934–941. [10] F. Falchi, S.M. Bertozzi, G. Ottonello, G.F. Ruda, G. Colombano, C. Fiorelli, C. Martucci, R. Bertorelli, R. Scarpelli, A. Cavalli, T. Bandiera, A. Armirotti, KernelBased, partial least squares quantitative structure-retention relationship model for UPLC retention time prediction: a useful tool for metabolite identification, Anal. Chem. 88 (2016) 9510–9517. [11] R. Put, Y. Vander Heyden, Review on modelling aspects in reversed-phase liquid chromatographic quantitative structure?retention relationships, Anal. Chim. Acta 602 (2007) 164–172. [12] M. Cao, K. Fraser, J. Huege, T. Featonby, S. Rasmussen, C. Jones, Predicting retention time in hydrophilic interaction liquid chromatography mass spectrometry and its use for peak annotation in metabolomics, Metabolomics 11 (2015) 696–706. [13] K. Gorynski, B. Bojko, A. Nowaczyk, A. Bucinski, J. Pawliszyn, R. Kaliszan, Quantitative structure-retention relationships models for prediction of high performance liquid chromatography retention time of small molecules: endogenous metabolites and banned compounds, Anal. Chim. Acta 797 (2013) 13–19. [14] J.P.M. Andries, H.A. Claessens, Y.V. Heyden, L.M.C. Buydens, Strategy for reduced calibration sets to develop quantitative structure–retention relationships in highperformance liquid chromatography, Anal. Chim. Acta 652 (2009) 180–188. [15] M.H. Abraham, C.F. Poole, S.K. Poole, Classification of stationary phases and other materials by gas chromatography, J. Chromatogr. A 842 (1999) 79–114. [16] J.S. Arey, W.H. Green, P.M. Gschwend, The electrostatic origin of abraham's solute polarity parameter, J. Phys. Chem. B 109 (2005) 7564–7573. [17] J.C. Bradley, M.H. Abraham, W.E. Acree Jr., A.S. Lang, Predicting Abraham model solvent coefficients, Chem. Cent. J. 9 (2015) 12. [18] R.W. Kennard, L.A. Stone, Computer aided design of experiments, Technometrics 11 (1969) 137–148. [19] M. Vinaixa, E.L. Schymanski, S. Neumann, M. Navarro, R.M. Salek, O. Yanes, Mass spectral databases for LC/MS- and GC/MS-based metabolomics: state of the field and future prospects, TrAC Trends in Anal. Chem. 78 (2016) 23–35. [20] L.C. Gillet, P. Navarro, S. Tate, H. Rost, N. Selevsek, L. Reiter, R. Bonner, R. Aebersold, Targeted data extraction of the MS/MS spectra generated by dataindependent acquisition: a new concept for consistent and accurate proteome analysis, MCP 11 (2012) (O111 016717). [21] R. Bonner, G. Hopfgartner, SWATH acquisition mode for drug metabolism and metabolomics investigations, Bioanalysis 8 (2016) 1735–1750. [22] D. Siegel, A.C. Meinema, H. Permentier, G. Hopfgartner, R. Bischoff, Integrated quantification and identification of aldehydes and ketones in biological samples, Anal. Chem. 86 (2014) 5089–5100. [23] D.S. Wishart, T. Jewison, A.C. Guo, M. Wilson, C. Knox, Y. Liu, Y. Djoumbou, R. Mandal, F. Aziat, E. Dong, S. Bouatra, I. Sinelnikov, D. Arndt, J. Xia, P. Liu, F. Yallou, T. Bjorndahl, R. Perez-Pineiro, R. Eisner, F. Allen, V. Neveu, R. Greiner, A. Scalbert, HMDB 3.0—The human metabolome database in 2013, Nucleic Acids Res. 41 (2013) D801–D807. [24] G. Hopfgartner, E. Varesio, L. Burton, E. Duchoslav, R. Bonner, SWATH libraries and common fragment libraries for metabolites identification in urine, 62th ASMS Conference on Mass Spectrometry and Allied Topics, Baltimore, USA, 2014. [25] Y. Zhang, A. Bilbao, T. Bruderer, J. Luban, C. Strambio-De-Castillia, F. Lisacek, G. Hopfgartner, E. Varesio, The use of variable Q1 isolation windows improves selectivity in LC-SWATH-MS acquisition, J. Proteom. Res. (2015), http://dx.doi. org/10.1021/acs.jproteome.5b00543. [26] B. MacLean, D.M. Tomazela, N. Shulman, M. Chambers, G.L. Finney, B. Frewen, R. Kern, D.L. Tabb, D.C. Liebler, M.J. MacCoss, Skyline: an open source document editor for creating and analyzing targeted proteomics experiments, Bioinformatics 26 (2010) 966–968. [27] RStudio Team, RStudio: Integrated Development for R, RStudio, Boston, MA, 2015. [28] T. Bruderer, E. Varesio, S. Winter, G. Hopfgartner, In silico prediction for the investigation of comedication interferences in quantitative LC–MS detection in the SRM mode, Bioanalysis 4 (2012) 1907–1917.
mmu). The predicted LC retention time for riboflavin was calculated according Eq. (2): (PRT: Predicted Retention Time, logD1 and MV calculated by ACD Labs 2012). PRT (riboflavin) = 2.00 * logD1 (pH = 3.0) + 0.03 * MV + 2.13 = 6.8 min (2) This results in a predicted LC retention window of 6.8 ± 2.9 min with the RMSE for the Waters T3 column and pH = 3.0 (see Table S4) and confirms the peak at 8.8 as riboflavin which is in good accordance with the larger error (15.1 ppm) obtained for the peak at 16.5 min. The use of multiple parameters to eliminate false positive is therefor of great importance. As shown Liquid Chromatography Retention Time (LC RT) prediction has some limitation regarding accuracy. In a previous work we found that LC RT prediction is most powerful as a ranking tool [28]. In the metabolome many compounds are isomeric therefore the prediction was evaluated in regards of ranking of the elution order. The ranking for 13 isomer pairs or triplets metabolites present in the AMML library for column Waters T3 pH = 3.0 are presented in Table 2. Three pairs and one triplet did not show any distinction in their predicted retention time, one pair was wrong but seven pairs and two triplets showed a correct elution order between measured and calculated LC RT. In addition, to the SWATH MS/MS spectra which is available for any precursor ions, the LC retention time of any postulated isomeric metabolite can be predicted and investigated which is of great benefit because many metabolites are not available as reference compounds. 4. Conclusions The application of predicted LC retention time to support metabolites identification was evaluated for a metabolomics database containing 532 compounds from all the major metabolite classes of the human metabolome database. A prediction model was built based on the QSSR approach by MLR with two molecular descriptors, logD2 and the molecular volume for both types of C18 columns either positive or in negative ESI mode. Sixteen model compounds (LC CalMix) representative for the complete library could be used to build a model with sufficient quality to enable identification based on ranking. The models were implemented in the R environment and can be applied to any library. The set of model compounds was validated on two different C18 reversed phase LC columns and mobile phase conditions (pH = 3.0 and pH = 8.0) with comparable prediction accuracy. In addition to accurate mass, isotopic matching, SWATH spectral library search, LC retention time prediction using a LC CalMix was found to be attractive to enhance metabolites identification, in particular for isomeric metabolites and to exclude false positive identification. Acknowledgements We would like to thank Marco Randazzo for discussions with regard to retention time prediction modelling and to acknowledge David Bell who provided the two Ascentis LC columns used during this study. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.jchromb.2017.07.016. References [1] C.H. Johnson, J. Ivanisevic, H.P. Benton, G. Siuzdak, Bioinformatics: the next frontier of metabolomics, Anal. Chem. 87 (2015) 147–156. [2] E. Rathahao-Paris, S. Alves, C. Junot, J.-C. Tabet, High resolution mass spectrometry for structural identification of metabolites in metabolomics, Metabolomics 12 (2015) 10.
8