The chemical space for non-target analysis

The chemical space for non-target analysis

Accepted Manuscript The chemical space for non-target analysis Boris L. Milman, Inna K. Zhurkovich PII: S0165-9936(17)30270-4 DOI: 10.1016/j.trac.2...

2MB Sizes 55 Downloads 135 Views

Accepted Manuscript The chemical space for non-target analysis Boris L. Milman, Inna K. Zhurkovich PII:

S0165-9936(17)30270-4

DOI:

10.1016/j.trac.2017.09.013

Reference:

TRAC 15005

To appear in:

Trends in Analytical Chemistry

Received Date: 25 July 2017 Revised Date:

5 September 2017

Accepted Date: 11 September 2017

Please cite this article as: B.L. Milman, I.K. Zhurkovich, The chemical space for non-target analysis, Trends in Analytical Chemistry (2017), doi: 10.1016/j.trac.2017.09.013. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

1

ACCEPTED MANUSCRIPT

The chemical space for non-target analysis Boris L. Milman a,* and Inna K. Zhurkovich b b

Institute of Experimental Medicine, ul. Akad. Pavlova 12, 197376 Saint Petersburg, Russia Institute of Toxicology, ul. Bekhtereva 1, 192019 Saint Petersburg, Russia.

RI PT

a

ABSTRACT

M AN U

SC

The review describes chemical space, i.e. the set of known and possible compounds, and presents the use of corresponding chemical data in non-target analysis. Its implementation is briefly considered. General and dedicated chemical databases are outlined. Citation and cocitation of chemical compounds in databases are considered. The data transfer from high resolution mass spectrometry to chemical databases is noted to be the key stage of modern nontarget analysis. Searched structures are further filtered with the use of reference and computational data. Related issues are also addressed. Keywords: Non-target analysis, Mass spectrometry, Chemical database, Chemical identification, Citation

AC C

Contents

EP

TE D

Abbreviations: CAS, Chemical Abstracts Service; CID, Collision-induced dissociation; EPA, Environmental Protection Agency; EI, Electron ionization; ESI, Electrospray ionization; GC, Gas chromatography; HMDB, Human Metabolome Database; HPLC, High-performance liquid chromatography; HILIC, Hydrophilic interaction liquid chromatography; HRMS, High-resolution mass spectrometry; LC, Liquid chromatography; MS, Mass spectrometry; MS1, One-stage mass spectrometry; MS2, Tandem mass spectrometry; MSn, Multi-stage mass spectrometry; NMR, Nuclear magnetic resonance; PAH, Polycyclic aromatic hydrocarbon(s); QuEChERS, Quick Easy Cheap Effective Rugged and Safe; RI, Retention index(ices); RP, Reversed phase; RT, Retention time(s); UPLC, Ultraperformance liquid chromatography.

1.

Introduction .......................................................................................................................... 2

2.

Non-target analysis ............................................................................................................... 3

3.

A chemical space .................................................................................................................. 5

4.

Chemical data bases.............................................................................................................. 6 4.1. General .............................................................................................................................. 6

*

Phone: +7(921) 766 52 96. Fax: +7 (812) 692 26 54. E-mail: [email protected]; [email protected].

2

ACCEPTED MANUSCRIPT

4.2. Citation and co-citation ..................................................................................................... 7 5.

Prior, intermediate, and posterior data search ...................................................................... 8

6.

Searching chemical databases as an analytical workflow .................................................... 9 6.1. Mass spectrometry............................................................................................................. 9

RI PT

6.2. Chromatography .............................................................................................................. 10 6.3. Interlaboratory comparisons in identification ................................................................. 10 6.4. Structure elucidation ....................................................................................................... 11 Conclusions ........................................................................................................................ 11

SC

7.

1. Introduction

M AN U

References .................................................................................................................................. 12

Progress in analytical instrumentation, chemical informatics, and related fields resulted in developing powerful techniques of chromatography mass spectrometry. In the first place, it

TE D

concerns HRMS and its combinations with HPLC/UPLC providing very high selectivity and sufficient sensitivity. These particular features are necessary for the special kind of analytical determinations, namely non-target analysis (in other terms, non-targeted analysis, untargeted analysis, unknown analysis). In this analysis, the sample composition is unknown to the analyst

EP

before performing analytical procedures. Here, high selectivity and sensitivity are not the only challenges for a chemical scientist. One should also note the demand for performance of (a)

AC C

sample preparation procedures, (b) processing methods of raw mass spectrometric and chromatographic data, and (c) sophisticated analysis (interpretation) of information extracted from the raw data. The challenges are increasingly met in modern chromatography mass spectrometry as seen from the articles on various fields of non-target analysis [1-29] (Table 1). In identification processes of general non-target analysis, all known chemical compounds

(considered here as ‘unknown knowns’) are taken into account and the presence of unknown ones (‘unknown unknowns’) is not excluded. The set of known or known and possible compounds constitutes the chemical space. The space is characterized by records in chemical databases that contain structures of compounds, together with other identifiers and also corresponding properties, features, values, references, links, and so on. The composition of

3

ACCEPTED MANUSCRIPT

different samples/matrices (Table 1) is representative of analytes from not the same chemical subspaces. Analysts access chemical databases in different stages of the analytical workflow, from gathering the prior data to estimating plausibility of the results. The direct use of chemical data is increasingly important in identification based on mass spectrometry and chromatography. A

RI PT

holistic view of these operations has not been addressed sufficiently in previous reviews on nontarget analysis (Table 1), is essential to its progress, and hence is chosen as the subject of this article. We will consider principles of non-target analysis, the size of the chemical space or subspaces, principal chemical databases and their use, and incorporating the chemical data in

SC

analytical workflows. The review is subject to low-molecular organic compounds (whose

2. Non-target analysis

M AN U

molecular mass is hundreds of Da, a maximum of 1000-2000 of Da).

Non-target analysis can be divided into two categories, semi-target analysis (formally the same term, see [30]) and properly non-target one. In the first case, classes of compounds (e.g., metabolites) or some individual substances are expected or even known before the analysis is

TE D

performed. Their properties, e.g. toxicity or odor, may be also known. Semi-target analysis is often named suspect analysis or screening [2, 4]. Comprehensive non-target determinations arise when potential analytes are basically not limited in their number and origin. That is especially complicated if the nature of the matrix is not clear. In the article, both types of non-target

EP

analysis will be considered together. Its strategy is depicted in Fig. 1. Various strategies of nontarget analysis have been widely discussed in the literature [4, 5, 11, 12, 26, 27].

AC C

It begins from a common sample treatment intended for separation of analytes from matrix and foreign substances, separation of analytes from each other, and for concentrating solutions of analytes. Different extraction procedures are commonly used. In non-target analysis, all the unknown analytes should be present in subsamples prepared for determinations. A key problem is that chemical compounds differ in many respects from each other, beginning with general properties: analytes may be volatile or nonvolatile, ionic or nonionic, acidic or basic, hydrophobic or hydrophilic. Therefore, there is no the only method of sample treatment appropriate for any analyte. In the practice, a combination of several extraction operations with different sorbents may be successful [6, 31]. In general, methods providing unselective extraction of diverse analytes, if they exist, should be preferred, although subsequent matrix effects, such as ion suppression in ESI-MS, may

4

ACCEPTED MANUSCRIPT

lead to the loss of some compounds [6]. The QuEChERS method seems to be suitable for both non-polar and polar analytes [12, 32, 33]. In order to find advanced guides for non-target sample preparation, multi-analyte methods (see, e.g. works [32, 34-36]) validated to a greater or lesser extent, are recommended. Various hyphenated techniques of chromatography mass spectrometry have been used in

RI PT

the analysis under consideration. For volatile compounds, GC-MS is commonly used, with some innovations such as GCxGC-MS, GC-MS2, and GC-HRMS. A comparison of experimental EI mass spectra with reference ones from mass spectral libraries [37, 38] results in candidates for identification (in other words, identification hypotheses). Experimental RI are also determined,

SC

which are additional points of identification. In order to perform corresponding identification, the big reference database of gaseous RI [39] was developed.

M AN U

Non-volatile compounds are more difficult targets for identification. The reasons for this are that their mass spectra (MS2, MSn) usually acquired by electrospray and subsequent collision-induced dissociation depend on many factors and are not very reproducible [37]. Corresponding MS libraries are not completed as yet, and therefore the prediction of reference mass spectra takes on greater and greater importance (see below), although such in silico spectra may not provide unambiguous identification [40]. Techniques of HRMS and HRMS2 have

TE D

increasingly been involved in solving not-target problems [2, 4-6, 9, 11, 12, 18, 22, 26, 27, 30, 37, 41]. The techniques may ensure a generation of molecular and fragment ion formulas. The molecular formula search in chemical databases or even searching by the accurate ion mass leads

EP

to known compounds as advanced candidates for identification. ESI-MS may be used without or with LC. In the first case, direct injection of solutions into ESI mass spectrometer formally provides conditions for ionization of all the dissolved analytes but the effects of ion suppression

AC C

prevent the uniform coverage of analytes [6, 29]. In the combination with mass spectrometry, RP-HPLC is commonly used. HILIC columns may improve the separations of polar analytes [4, 6, 12, 42]. Further, UPLC was introduced as the advanced version of HPLC [43]. Estimation of retention parameters (as RT) is not as popular as in GC but come into use [4, 6, 42]. The reliability of non-target analysis can be considered in two ways. First, trueness of identification of individual compounds unambiguously detected in analytical procedures can be estimated. The true result of identification is provided by at least two independent (orthogonal) methods, supplemented by co-analysis using analytical standards [37, 57]. The second estimate is the completeness of analysis, defined by the number (percentage) of analytes reliably identified in the sample. This rate is impossible to estimate because the original sample composition is unknown. The completeness of analysis can be indirectly described, e.g. (a) by

5

ACCEPTED MANUSCRIPT

specifying all the used techniques and methods together with their performances (instrumental resolution, limit of detection, and so on) and/or by references to the successful participation of the laboratory in pertinent interlaboratory comparisons (see below) or proficiency tests. The established practice is that non-target analysis does not require immediate quantitation of identified analytes. However, when quantitative estimations are critical to solve

RI PT

one or another problem, a suitable method can be searched in the literature or its ad hoc version can be self-developed using analytical standards, which are selected among numerous chemicals (see below) supplied by vendors.

In different stages of non-target analysis, non-spectral and non-chromatographic

SC

information on chemical compounds is directly or indirectly used, which is shown in Fig. 1. In comprehensive determinations, a plenty of candidate compounds should be taken into account,

M AN U

i.e. one could say that an analyst gets into the full chemical space. In semi-target/suspect screening, the number of possible analytes is appreciably reduced.

3. A chemical space

By definition, results of non-target analysis are within the boundaries of the chemical

TE D

space or its subspaces. In general, an unlimited number of chemical compounds, predominantly unknowns, can be theoretically synthesized by chemists; an enormous number can be done in nature. When the limiting molecular mass or the number of atoms is set, the finite size of virtual chemical space can be derived (Table 2). Problems with synthesis, decomposition, and enhanced

EP

chemical reactivity of many compounds sharply decrease the number under consideration by many orders of magnitude. This may be evident, e.g. in the fact that elemental relationships in

AC C

known molecules are subject to the certain rules [59]. In the determination of unknown knowns, the chemical space is much smaller in its

dimension. Two estimates can be advanced in this regard. First is the number of synthesized and potentially synthesizable compounds and also of ones isolated from natural sources: 400 000 000 (the ZINC database, Table 2). Here, 14 000 000 chemicals can be quickly supplied by vendors [47]. The second estimate is the number of compounds and substances registered in the largest chemical data system: > 129 000 000 (CAS, Table 2). The composition of individual real-world samples is certainly far poorer than the space of all known compounds. This regards all three basic matrices: bio-, environmental, and food (Table 1). Tens of millions of compounds of biological value are known, and the number of common/abundant compounds in different bio matrices is less than one million (the Metlin and

6

ACCEPTED MANUSCRIPT

HMDB databases, Table 2). The estimate level is approximately the same for environmental matrices as demonstrated by the EPA database (Table 2). The subspace dimension of abundant compounds entering the environment, i.e. many hundreds of thousands, can also be derived in a different way. This subset contains dangerous (regulated) chemicals supplemented with their decomposition/transformation products and metabolites. Accordingly, the number of original

RI PT

chemicals (appr. 350 000, Table 2) should be multiplied by several times. Another subspace, the food one, looks less occupied (about 27 000, Table 2), and additional accounting of adulterants, agrochemical residues, veterinary drugs and their metabolites, other contaminants [17] may increase its size up to the level of bio matrices.

SC

Thus, the total number of common analytes in bio- and environmental matrices is estimated as 105-106 of individual compounds that represents ≤ 1% of more than 108 known

M AN U

compounds (or substances). The rest of the full space, i.e. ≈ 99 % of compounds are rare ones that agrees with the previous estimate [37]. As a whole, rare substances are less frequently cited than ≈ 1 % of abundant compounds [37], and identification of just rare molecules having a few reliable reference data of the experimental origin is the really hard problem.

4. Chemical databases

TE D

4.1. General

Chemical space and subspaces can be said to be included in chemical databases (Table 2) named also structure ones. There are three big general chemical databases, CAS [48], PubChem [49], and ChemSpider [50], combining both bibliographic and factual information and therefore

EP

relevant to non-target analysis in many aspects. CAS is the commercial service accessible through the SciFinder or STN interface [48]. PubChem and ChemSpider are better suited for the

AC C

purpose under consideration because they are non-commercial and give more direct factual information, first of all, accurate monoisotopic mass of molecules, that is crucial for identification with the use of HRMS. Also, the two databases contain values of some important properties, such as pKa and logP, that could be used for the enumeration of candidates for identification [60].

Modern software for processing mass spectral data may directly access chemical databases to provide effective online searches. Such are the Bruker CompoundCrawler [61] and Thermo Sieve [62] software accessing ChemSpider and other data resources. The SmartMass program interprets MS2 spectra of candidates for identification, which were searched by accurate ion mass in ChemSpider or PubChem [25]. Likewise, the MetFrag software retrieves candidate structures for their in silico fragmentation [63, 64]. Different issues of the access to online

7

ACCEPTED MANUSCRIPT

chemical resources are discussed in the literature [65-66]. For distributed chemical information, e.g. MS2 spectra of nonvolatile compounds [38] or references to some compounds and their properties, big data approaches [67] could be implemented. Apart from the three databases mentioned above, there are other electronic chemical information resources, see [11, 23, 24, 37, 57, 68-70]. In-house collections of the data extracted

is addressed in the articles [71, 72].

SC

4.2. Citation and co-citation

RI PT

from common databases can be generated and used. Progress in chemical and related databases

Bibliographic references are important for the identification of unknown knowns in many respects. Our previous works [37, 73-76] demonstrated that chemical databases, first of all, CAS,

M AN U

can be used to calculate statistical rates of citation (occurrence) and co-citation (co-occurrence) of chemical compounds in the literature. The citation rate was proved to be the relative measure of the abundance (in other words, popularity or significance) of compounds and the related probability of their presence in real-world samples to be analyzed. Highly cited compounds are considered as identification hypotheses that should be tested experimentally by means of MS,

TE D

chromatography, or other technique [37, 75, 77, 78]. On the other hand, rare candidate compounds having low citation rates can be excluded from consideration for analytical purposes. Most known compounds are rare ones, see above. Facts and rates of the co-citation of chemical compounds to each other and with matrices

EP

in the literature or databases provide the possibility of a prior prediction of a group of compounds available in the sample under analysis [37, 76]. The estimation of co-citations seems to be effective in semi-target analysis or during non-target determinations, i.e. when some

AC C

analytes are or become known. The estimation of the rate can be exemplified for PAH in Fig. 2. These compounds were

identified in the waste gas and subsequently searched for literature references in CAS [74]. In 2017, search in Chemspider and Pubchem results in the similar conclusions. For example, the retrieval of C20H12 leads to the highest citation rank of benzo[a]pyrene reliably identified previously by GC-MS [74]. Here, the compound has the maximum score in terms of four different citation rates (Fig. 2a). One from them is the number of references in PubChem (more exactly, in PubMed, the related bibliographic database). Citation rates as the numbers of corresponding references for the two information sources, are sufficiently correlated (Fig. 3).

8

ACCEPTED MANUSCRIPT

Fig. 2b shows examples of some relevant references. Another reference from the list is individually shown (Fig. 2c). Here, the name of the environmental matrix, sediment, can be found in the title. Search by four different PAH compounds led to the same reference. This is the example of co-citation of benzo[a]pyrene and those four PAH (Fig. 2c) calculated both between each other and between every PAH and this matrix. Compounds reliably identified and candidate

RI PT

compounds rejected by identification criteria, firstly, isomers having similar mass spectra and retention parameters, are appreciably differed in citation and co-citation rates as proved in the previous research based on CAS [74]. Renewed search using ChemSpider and PubChem in the current versions confirms that conclusion: both the average citation score (and its rank) and the

M AN U

compared to candidate compounds (Table 3).

SC

corresponding rate of co-citation with matrix names is appreciably higher for analytes as

5. Prior, intermediate, and posterior data search A prior search of databases may be useful for advancing identification hypotheses and discovery of candidate compounds [37, 75, 77]. This applies to both sorts of unknown determinations. In semi-target analysis, a matrix type and a class of suspected chemical

TE D

compounds are known. This simplifies the analytical problem because previous topical reports can be used for the selection of suitable technique/method or for estimating the prior possibility of the presence of candidates in the matrix, e.g. the presence of pesticides in fruit and vegetable samples [79, 80].

EP

In the intermediate stage of analysis, after determination of an abundant or easily identifiable substance, the problem solution can be further facilitated. First, advanced

AC C

analytical methods can be found in chemical databases, e.g. in the ‘Identification’ section of PubChem records, and used when the earlier applied method is unbalanced in relation to various analytes. Second, the search of co-citation with the known analyte and known matrix may lead to new candidates for identification. The discovery of new natural compounds begins from screening of samples for known constituents, which is named ‘dereplication’ [23, 24]. Here, the use of the co-citation technique to detect other unknown knowns looks challenging. Some other approaches to involving prior data are discussed in the book [37]. Non-target analysis of an unknown matrix is far more complicated than a known one. Testing its solution in an appropriate solvent or various extracts may result in detection and identification of some sample constituents. Then searching chemical databases by those compounds can be performed, e.g. for retrieving co-citation data (see above).

9

ACCEPTED MANUSCRIPT

Searches in chemical databases could be a posteriori made. The presence of the particular compound in the matrix confirmed by previous analytical reports means the plausibility of the corresponding result of non-target analysis [37].

RI PT

6. Searching chemical databases as an analytical workflow As a rule, combinations of mass spectrometry with gas or liquid chromatography have been used in analyses under consideration, supplemented with the search/generation of reference data and performing various estimations. In the ideal case, individual chromatographic peak and

SC

MS1 and MS2 spectra should be acquired for every analyte. That is not easy to do when analytic signals are low, and special approaches are required [81]. Therefore, non-target analysis

6.1. Mass spectrometry

M AN U

considered below is often limited to major constituents of samples.

For many years, it was rather standard practice for analytical mass spectrometrists to search chemical handbooks and databases for candidate chemical compounds by the molecular formula obtained from HRMS, e.g. see [82, 83]. Starting from the publications using CAS [84]

TE D

and Chemspider [85] where the terms ‘known unknowns’ and ‘unknown unknowns’ were introduced into the chemical discourse, the number of relevant articles has increased considerably. It became a common practice to search ChemSpider [11-13, 35, 62, 85-88] and Pubchem [4, 12, 16, 17, 25, 63, 67] for structures by molecular formula or accurate ion mass

EP

derived from HRMS data. The number of references/sources from one or another database ([2, 4, 6, 10, 22, 64, 89] and [2, 22, 37], respectively), has been also taken into account. In such

AC C

analyses, CAS [12, 13] and other databases [10, 12, 13, 26, 27, 72, 90] have been also involved though the former is not suited for the search by monoisotopic mass values. In searching the databases, candidate compounds are ranked by the number of

corresponding references or data sources. The 1st rank is often preferable but all highly ranked candidates should be generally taken into account, especially as different databases may lead to different citation rankings. In environmental analysis, searching the specialized database, EPA’s CompTox Chemistry Dashboard (Table 2), was more effective than search in the common database (ChemSpider) [10]. The fact that search results may depend on the particular database is discussed in the new publications [14, 22, 60, 71]. Candidate structures found in chemical databases should be confirmed or rejected by means of another method or technique. By now, search in the reference MS2 libraries (Fig. 1) has

10

ACCEPTED MANUSCRIPT

been routinely used [37, 38, 57, 58]. Also, the searched structures can be in silico fragmented to compare predicted and experimental MS2 spectra as discussed, e.g. in the articles [17, 71]. Such softwares as MetFrag [2, 64, 86, 87, 91], MetFusion [2, 4, 6, 87, 91], MOLGEN [2, 87, 91], some other [6, 12, 86, 87, 91, 92], and also the CSI:FingerID program implementing the related approach of computing fragmentation trees [14, 64], were explored.

RI PT

Three relevant research contributions may be further noteworthy. (1) Characteristic substructures of some veterinary drugs can be software generated from experimental MS2 spectra and further used in filtering of candidate compounds retrieved by PubChem [25]. (2) The library of predicted MS2 spectra of human metabolites and products of their transformation was built

SC

basing on the HMDB database (Table 2) [15]. (3) By combining predicted fragmentation with searches in both chemical databases and common MS2 libraries, the especially high rate of true

M AN U

results can be reached [86].

The structure found in Chemspider is certainly confirmed when MS2 spectra of the corresponding analytical standard are acquired [2, 88]. In the search for analytical standards, one may access the ZINC database (Table 2) or some other resources informing of suppliers of various chemicals.

TE D

6.2. Chromatography

Chromatography is another orthogonal technique highly recommended to use in the combination with MS for reliable non-target identification. In gas chromatography, the big

EP

database of RI [37-39] has been widely used. This quantity is not very reproducible in liquid chromatography, and the different retention variable (RT) of analyte structures retrieved from

AC C

chemical databases can be predicted using chemometrics [4, 6]. This is another way to filter candidates initially derived from HRMS data.

6.3. Interlaboratory comparisons in identification Starting from 2012, the contests of “Critical Assessment of Small Molecule Identification

(CASMI)” have been performed [64, 87, 91]. Techniques of MS and also chromatography were used, supplemented with chemical information under consideration. In 2016, the particular HRMS2 data were distributed between involved teams, which were requested to identify the 208 low-molecular compounds by means of predicted fragmentation, mass spectral libraries, and other methods of computational mass spectrometry and chromatography [64]. More than 30 % (specifically, 34 %) of analytes were truly identified using MS, and this rate was increased up to

11

ACCEPTED MANUSCRIPT

70 % when the 1st rank results of searching ChemSpider were added [64, 72]. In the post-contest research [86], the rate was still higher than previous one. Searching chemical databases was also used in the comparison on water analysis [5].

RI PT

6.4. Structure elucidation When searching the databases initiated by MS information do not lead to certain results, i.e. only very low or even zero citation is recorded, one can make two conclusions. First, the database is not completed yet. Second, a very rare or even a new (i.e., unknown unknown)

7. Conclusions

M AN U

required to elucidate its structure [93].

SC

compound may present in the sample. Other techniques, first of all, NMR are additionally

Rapidly developing organic, bio-, and environmental chemistry fill the space of known chemical compounds with new molecules. The chemical space is ‘perceived’ through chemical databases. Subspaces are ‘enclosed’ within special databases, e.g. those of bio-, regulated, or commercial chemicals. Principal chemical databases, such as CAS, ChemSpider, and PubChem,

cited.

TE D

contain many bibliographic references to abundant compounds, i.e. the compounds are highly

Chemical compounds found in various samples as analytes may be unknown to chemists from corresponding laboratories before executing analytical workflows. This is the case of non-

EP

target analysis, which does not have general complete solutions and therefore is complicated in its implementation. Fortunately, both analytical instrumentation and chemical informatics are far

AC C

advanced and propose new analytical approaches. Identification of volatile compounds is commonly based on the use of reference EI mass

spectral libraries and also RI databases. Advances in determination of nonvolatile compounds are due to progress in HRMS and MSn. High resolution mass spectrometers are principal instruments for non-target determinations because they make it possible to obtain accurate molecular mass and subsequently to generate molecular formulas. Further searching chemical databases proposes candidate structures for identification. Their citation and also co-citation rates can reasonably be taken into account. Theoretical MS2 spectra and also retention times for candidate compounds are often predicted and compared to experimental ones. On the other hand, the experimental MS2 spectra (acquired together with MS1 scans) can be compared to corresponding reference data

12

ACCEPTED MANUSCRIPT

from established libraries. In the series of these procedures, candidate compounds are filtered and ranked and top structures can be subsequently tested. Further development of chromatography mass spectrometry and chemical informatics should give rise to more reliable results of solving non-target problems. First of all, this relates to minor sample constituents. It would be advisable to verify and validate corresponding analytical

RI PT

methods in the dedicated comparisons between the laboratories.

References

AC C

EP

TE D

M AN U

SC

[1] R.Khaiwal, Determination of atmospheric volatile and semi-volatile compounds, In: M. Quante, R. Ebinghaus, G. Flöser (eds.), Persistent Pollution–Past, Present and Future, Springer, Berlin, 2011, pp. 177-205. [2] A.C. Chiaia-Hernandez, E.L. Schymanski, P. Kumar, H.P. Singer, J. Hollender, Suspect and nontarget screening approaches to identify organic contaminant records in lake sediments, Anal. Bioanal. Chem. 406 (2014) 7323-7335. [3] A.A. Bletsou, J. Jeon, J. Hollender, E. Archontaki, N.S. Thomaidis, Targeted and nontargeted liquid chromatography-mass spectrometric workflows for identification of transformation products of emerging pollutants in the aquatic environment. TrAC, Trends Anal. Chem. 66 (2015) 32-44. [4] P. Gago-Ferrero, E.L. Schymanski, A.A. Bletsou, R. Aalizadeh, J. Hollender, N.S. Thomaidis, Extended suspect and non-target strategies to characterize emerging polar organic contaminants in raw wastewater with LC-HRMS/MS, Environ. Sci. Technol. 49 (2015) 12333-12341. [5] E. L. Schymanski, H. P. Singer, J. Slobodnik, I. M. Ipolyi, P. Oswald, M. Krauss, T. Schulze, P. Haglund, T. Letzel, S. Grosse, N.S. Thomaidis, A. Bletsou, C. Zwiener, M. Ibáñez, T. Portolés, R. de Boer, M.J. Reid, M. Onghena, U. Kunkel, W. Schulz, A. Guillon, N. Noyon, G. Leroy, P. Bados, S. Bogialli, D. Stipaničev, P. Rostkowski, J. Hollender, Nontarget screening with high-resolution mass spectrometry: critical review using a collaborative trial on water analysis, Anal. Bioanal. Chem. 407 (2015) 6237-6255. [6] P. Gago-Ferrero, E.L. Schymanski, J. Hollender, N.S. Thomaidis, Nontarget analysis of environmental samples based on liquid chromatography coupled to high resolution mass spectrometry (LC-HRMS), Compr. Anal. Chem. 71 (2016) 381-403. [7] M.M. Plassmann, E. Tengstrand, K.M. Åberg, J.P. Benskin, Non-target time trend screening: a data reduction strategy for detecting emerging contaminants in biological samples, Anal. Bioanal. Chem. 408 (2016) 4203-4208. [8] B. Wang, Y. Wan, G. Zheng, J. Hu, Evaluating a tap water contamination incident attributed to oil contamination by nontargeted screening strategies, Environ. Sci. Technol. 50 (2016) 2956-2963. [9] Y. Zushi, S. Hashimoto, K. Tanabe, Nontarget approach for environmental monitoring by GC× GC-HRTOFMS in the Tokyo Bay basin, Chemosphere 156 (2016) 398-406. [10] A.D. McEachran, J.R. Sobus, A.J. Williams, Identifying known unknowns using the US EPA’s CompTox Chemistry Dashboard, Anal. Bioanal. Chem. 409 (2017) 1729-1735. [11] W.M.A. Niessen, R.A. Correa C., Identification strategies, in: W.M.A. Niessen, R.A. Correa C., Interpretation of MS–MS Mass Spectra of Drugs and Pesticides, Wiley, Hoboken, 2017, pp. 351-379. [12] A.M. Knolhoff, T.R. Croley, Non-targeted screening approaches for contaminants and adulterants in food using liquid chromatography hyphenated to high resolution mass spectrometry, J. Chromatogr. A 1428 (2016) 86-96.

13

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

[13] A.M. Knolhoff, J.A. Zweigenbaum, T.R. Croley, Nontargeted screening of food matrices: development of a chemometric software strategy to identify unknowns in liquid chromatography–mass spectrometry data, Anal. Chem. 88 (2016) 3617-3623. [14] K. Dührkop, H. Shen, M. Meusel, J. Rousu, S. Böcker, Searching molecular structure databases with tandem mass spectra using CSI: FingerID, Proc. Natl. Acad. Sci. 112 (2015) 12580-12585. [15] T. Huan, C. Tang, R. Li, Y. Shi, G. Lin, L. Li, MyCompoundID MS/MS Search: Metabolite identification using a library of predicted fragment-ion-spectra of 383,830 possible human metabolites, Anal. Chem. 87 (2015) 10619-10626. [16] S. Böcker, K. Dührkop, Fragmentation trees reloaded. J. Cheminform. 8 (2016) 5. [17] F. Hufsky, S. Böcker, Mining molecular structure databases: Identification of small molecules based on fragmentation mass spectrometry data. Mass Spectrom. Rev. 2016, 1– 10; doi: 10.1002/mas.21489. [18] B. Rochat, From targeted quantification to untargeted metabolomics: Why LC-highresolution-MS will become a key instrument in clinical labs, TrAC, Trends Anal. Chem. 84 (2016) 151-164. [19] K. Uppal, D.I. Walker, K. Liu, S. Li, Y.M. Go, D.P. Jones, Computational metabolomics: a framework for the million metabolome. Chem. Res. Toxicol. 29 (2016) 1956–1975. [20] M. Vinaixa, E.L. Schymanski, S. Neumann, M. Navarro, R.M. Salek, O. Yanes, Mass spectral databases for LC/MS-and GC/MS-based metabolomics: State of the field and future prospects, TrAC, Trends Anal. Chem. 78 (2016) 23-35. [21] G. Lubes, M. Goodarzi, Analysis of volatile compounds by advanced analytical techniques and multivariate chemometrics, Chem. Rev. 117 (2017) 6399-6422. [22] B. Rochat, Proposed confidence scale and ID score in the identification of knownunknown compounds using high resolution MS data. J. Am. Soc. Mass Spectrom. 28 (2017) 709-723. [23] H. Laatsch, Dereplication of natural products using databases, in: B.J. Baker (Ed.) Marine Biomedicine: From Beach to Bedside, CRC Press, Boca Raton, 2015, pp. 65-88. [24] I. Pérez-Victoria, J. Martín, F. Reyes, Combined LC/UV/MS and NMR strategies for the dereplication of marine natural products, Planta Med. 82 (2016) 857-871. [25] B. Xia, X. Liu, Y.C. Gu, Z.H. Zhang, H.Y. Wang, L.S. Ding, Y. Zhou, Non-target screening of veterinary drugs using tandem mass spectrometry on SmartMass. J. Am. Soc. Mass Spectrom. 24 (2013) 789-793. [26] C.B. Mollerup, P.W. Dalsgaard, M. Mardal, K. Linnet, Targeted and non‐targeted drug screening in whole blood by UHPLC‐TOF‐MS with data‐independent acquisition. Drug Test. Anal. 9 (2017) 1052-1061. [27] C. Chen, A. Wohlfarth, H. Xu, D. Su, X. Wang, H. Jiang, Y. Feng, M. Zhu, Untargeted screening of unknown xenobiotics and potential toxins in plasma of poisoned patients using high-resolution mass spectrometry: Generation of xenobiotic fingerprint using background subtraction. Anal. Chim. Acta 944 (2016) 37-43. [28] H. Oberacher, K. Arnhard, Current status of non-targeted liquid chromatography-tandem mass spectrometry in forensic toxicology, TrAC, Trends Anal. Chem. 84 (2016) 94-105. [29] K.M. Rentsch, Knowing the unknown–State of the art of LCMS in toxicology. TrAC, Trends Anal. Chem. 84, (2016) 88-93. [30] A. Kaufmann, P. Butcher, K. Maden, S. Walker, M. Widmer, Semi-targeted residue screening in complex matrices with liquid chromatography coupled to high resolution mass spectrometry: current possibilities and limitations, Analyst 136 (2011) 1898-1909. [31] K. Levsen, Sample preparation for water analysis, Compr. Anal. Chem. 37 (2002) 721778.

14

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

[32] C. Baduel, J.F. Mueller, H. Tsai, M.J.G. Ramos, Development of sample extraction and clean-up strategies for target and non-target analysis of environmental contaminants in biological matrices. J. Chromatogr. A 1426 (2015) 33-47. [33] M.M. Plassmann, M. Schmidt, W. Brack, M. Krauss, Detecting a wide range of environmental contaminants in human blood samples—combining QuEChERS with LC-MS and GC-MS methods. Anal. Bioanal. Chem. 407 (2015) 7047-7054. [34] A. Masiá, C. Blasco, Y. Picó, Last trends in pesticide residue determination by liquid chromatography–mass spectrometry, Trends Environ. Anal. Chem. 2 (2014) 11-24. [35] S.J. Lehotay, Y. Sapozhnikova, H.G. Mol, Current issues involving screening and identification of chemical contaminants in foods by mass spectrometry, TrAC, Trends Anal. Chem. 69 (2015) 62-75. [36] J. Cotton, F. Leroux, S. Broudin, M. Poirel, B. Corman, C. Junot, C. Ducruix, Development and validation of a multiresidue method for the analysis of more than 500 pesticides and drugs in water based on on-line and liquid chromatography coupled to high resolution mass spectrometry, Water Res. 104 (2016) 20-27. [37] B.L. Milman, Chemical identification and its quality assurance, Springer, Berlin, 2011. [38] B.L. Milman, I.K. Zhurkovich, Mass spectral libraries: A statistical review of the visible use, TrAC, Trends Anal. Chem. 80 (2016) 636-640. [39] V.I. Babushok, Chromatographic retention indices in identification of chemical compounds, TrAC, Trends Anal. Chem. 69 (2015) 98-104. [40] B.L. Milman, Y.V. Russkikh, L.V. Nekrasova, Z.A. Zhakovskaya An approach to the mass spectrometry identification of cyanobacterial peptides. The case of demethylmicrocystin-LR. J. Anal. Chem. 66 (2011) 1423-1431. [41] S.S. Andra, C. Austin, D. Patel, G. Dolios, M. Awawda, M. Arora, Trends in the application of high-resolution mass spectrometry for human biomonitoring: An analytical primer to studying the environmental chemical space of the human exposome, Environ. Int. 100 (2017) 32–61. [42] D.J. Creek, A. Jankevics, R.Breitling, D.G. Watson, M.P. Barrett, K.E. Burgess, Toward global metabolomics analysis with hydrophilic interaction liquid chromatography–mass spectrometry: improved metabolite identification by retention time prediction, Anal. Chem. 83 (2011) 8703-8710. [43] M. Ibáñez, J.V. Sancho, F. Hernández, D. McMillan, R. Rao, Rapid non-target screening of organic pollutants in water by ultraperformance liquid chromatography coupled to timeof-light mass spectrometry, TrAC, Trends Anal. Chem. 27 (2008) 481-489. [44] R. van Deursen, J.L. Reymond, Chemical space travel, ChemMedChem 2 (2007) 636640. [45] R. Visini, M. Awale, J.L. Reymond, Fragment database FDB-17, J. Chem. Inf. Model. 57 (2017) 700-709. [46] ZINC 15, a free database of commercially-available compounds. http://zinc.docking.org (accessed 20.06.17). [47] John Irwin, personal communication, 30.04.17 [48] CAS, Chemical Abstracts Service. http://www.cas.org (accessed 20.06.17). [49] PubChem dabase. https://pubchem.ncbi.nlm.nih.gov (accessed 20.06.17). [50] ChemSpider dabase. http://www.chemspider.com (accessed 20.06.17). [51] CHEMLIST database http://www.cas.org/content/regulated-chemicals (accessed 20.06.17). [52] EPA’s CompTox Chemistry Dashboard database. https://comptox.epa.gov/dashboard (accessed 20.06.17). [53] METLIN database. https://metlin.scripps.edu/landing_page.php?pgcontent=mainPage (accessed 20.06.17).

15

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

[54] Human Metabolome Database. http://www.hmdb.ca (accessed 20.06.17). [55] FoodB database. http://foodb.ca (accessed 20.06.17). [56] Wiley Registry of Mass Spectral Data, 11th Edition. http://eu.wiley.com/WileyCDA/WileyTitle/productCd-1119171016.html (accessed 20.06.17). [57] B.L Milman, General principles of identification by mass spectrometry, TrAC, Trends Anal. Chem. 69 (2015) 24-33. [58] T. Kind, H. Tsugawa, T. Cajka, Y. Ma, Z. Lai, S.S. Mehta, G. Wohlgemuth, D.K. Barupal, M.R. Showalter, M. Arita, O. Fiehn, Identification of small molecules using accurate mass MS/MS search, Mass Spectrom Rev. 2017, doi: 10.1002/mas.21535. [59] T. Kind, O. Fiehn, Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics 8 (2007) 105. [60] M. Krauss, High-resolution mass spectrometry in the effect-directed analysis of water resources, Compr. Anal. Chem. 71 (2016) 433-457. [61] A.C. Ionas, A.B. Gómez, P.E. Leonards, A. Covaci, Identification strategies for flame retardants employing time‐of‐flight mass spectrometric detectors along with spectral and spectra‐less databases, J. Mass Spectrom. 50 (2015) 1031-1038. [62] J.J. Lohne, S.B. Turnipseed, W.C. Andersen, J. Storey, M.R. Madson, Application of single-stage Orbitrap mass spectrometry and differential analysis software to nontargeted analysis of contaminants in dog food: detection, identification, and quantification of glycoalkaloids. J. Agric. Food Chem. 63 (2015) 4790-4798. [63] C. Ruttkies, E.L. Schymanski, S. Wolf, J. Hollender, S. Neumann, MetFrag relaunched: incorporating strategies beyond in silico fragmentation, J. Cheminform. 8 (2016) 3. [64] E.L. Schymanski, C. Ruttkies, M. Krauss, C. Brouard, T. Kind, K. Dührkop, F. Allen, A. Vaniya, D. Verdegem, S. Böcker, J. Rousu, H. Shen, H. Tsugawa, T. Sajed, O. Fiehn, B. Ghesquiére, S. Neumann Critical assessment of small molecule identification 2016: automated methods. J. Cheminform. 9 (2017) 22. [65] A.E. Day, S.J. Coles, C.L. Bird, J.G. Frey, R.J. Whitby, V.E. Tkachenko, A.J. Williams, ChemTrove: Enabling a generic ELN to support Chemistry through the use of transferable plug-ins and online data sources, J. Chem. Inf. Model. 55 (2015) 501-509. [66] S. Kim, P.A. Thiessen, E.E. Bolton, S.H. Bryant, PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem. Nucleic Acids Res. 43 (2015) W605–W611. [67] I.V. Tetko, O. Engkvist, U. Koch, J.L. Reymond, H. Chen, BIGCHEM: challenges and opportunities for Big Data analysis in chemistry, Mol. Inf. 35 (2016) 615-621. [68] L. Borland, M. Brickhouse, T. Thomas, A.W. Fountain, Review of chemical signature databases, Anal. Bioanal. Chem. 397 (2010) 1019-1028. [69] J.L. Reymond, L. Ruddigkeit, L. Blum, R. van Deursen, The enumeration of chemical space, WIREs Comput. Mol. Sci. 2 (2012) 717-733. [70] J.L. Reymond, The chemical space project, Acc. Chem. Res. 48 (2015) 722-730. [71] S. Böcker, Searching molecular structure databases using tandem MS data: are we there yet? Curr. Opin. Chem. Biol. 36 (2017) 1-6. [72] E.L. Schymanski, A.J. Williams, Open Science for Identifying “Known Unknown” Chemicals. Environ. Sci. Technol. 51 (2017) 5357–5359. [73] B.L. Milman, M.A. Kovrizhnych, Identification of chemical substances by testing and screening of hypotheses II. Determination of impurities in n-hexane and naphthalene, Fresenius J. Anal. Chem. 367 (2000) 629-634. [74] B.L. Milman, A Procedure for decreasing uncertainty in the identification of chemical compounds based on their literature citation and cocitation. Two case studies, Anal. Chem. 74 (2002) 1484-1492.

16

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

[75] B.L. Milman, Identification of chemical compounds, Trends Anal. Chem. 24 (2005) 493– 508. [76] B.L. Milman, Literature-based generation of hypotheses on chemical composition using database co-occurrence of chemical compounds, J. Chem. Inf. Model. 45 (2005) 1153-1158. [77] B.L. Milman, L.A. Konopelko, Identification of chemical substances by testing and screening of hypotheses I. General. Fresenius J. Anal. Chem. 367 (2000) 621-628. [78] M. Valcárcel, S. Cárdenas, D. Barceló, L. Buydens, K. Heydorn, B. Karlberg, K. Klemm, B. Lendl, B. Milman, B. Neidhart, Á. Ríos, R. Stephany, A. Townshend, A. Zschunke, Metrology of qualitative chemical analysis: Project Report EUR 20605, EC, 2002. [79] M. Woldegebriel, Novel method for calculating a nonsubjective informative prior for a Bayesian model in toxicology screening: A Theoretical framework, Anal. Chem. 87 (2015) 11398-11406. [80] M. Woldegebriel, G. Vivó-Truyols, A New Bayesian approach for etimating the presence of a suspected compound in routine screening analysis. Anal. Chem. 88 (2016) 9843-9849. [81] T. Bader, W. Schulz, K. Kümmerer, R. Winzenbacher, General strategies to increase the repeatability in non-target screening by liquid chromatography-high resolution mass spectrometry. Anal. Chim. Acta 935 (2016) 173-186. [82] J.C. Wolff, L.A. Thomson, C. Eckers, Identification of the ‘wrong’active pharmaceutical ingredient in a counterfeit Halfan™ drug product using accurate mass electrospray ionisation mass spectrometry, accurate mass tandem mass spectrometry and liquid chromatography/mass spectrometry, Rapid Commun. Mass Spectrom. 17 (2003) 215-221. [83] E.M. Thurman, I. Ferrer, A.R. Fernández-Alba, Matching unknown empirical formulas to chemical structure using LC/MS TOF accurate mass and database searching: example of unknown pesticides on tomato skins, J. Chromatogr. A 1067 (2005) 127-134. [84] J.L. Little, C.D. Cleven, S.D. Brown, Identification of “known unknowns” utilizing accurate mass data and chemical abstracts service databases. J. Am. Soc. Mass Spectrom. 22 (2011) 348-359. [85] J.L. Little, A.J. Williams, A. Pshenichnov, V. Tkachenko, Identification of “known unknowns” utilizing accurate mass data and ChemSpider. J. Am. Soc. Mass Spectrom. 23 (2012) 179-185. [86] I. Blaženović, T. Kind, H. Torbašinović, S. Obrenović, S.S. Mehta, H. Tsugawa, T. Wermuth, N. Schauer, M. Jahn, R. Biedendieck, D. Jahn, O. Fiehn Comprehensive comparison of in silico MS/MS fragmentation tools of the CASMI contest: database boosting is needed to achieve 93% accuracy, J. Cheminform. 9 (2017) 32. [87] T. Nishioka, T. Kasama, T. Kinumi, H. Makabe, F. Matsuda, D. Miura, M. Miyashita, T. Nakamura, K. Tanaka, A.Yamamoto, Winners of CASMI2013: automated tools and challenge data. Mass Spectrom. 3 (2014) S0039-S0039. [88] Q. Chang, Y.E. Peng, L. Yun, Q. Zhu, S. Hu, Q. Shuai, Rapid identification of unknown organic iodine in small-volume complex biological samples based on nanospray mass spectrometry coupled with in-tube solid phase microextraction, Anal. Chem. 89 (2017) 4147-4152. [89] J.E. Rager, M.J. Strynar, S. Liang, R.L. McMahen, A.M. Richard, C.M. Grulke, J.F. Wambaugh, K.K. Isaacs, R. Judson, A.J. Williams, J.R. Sobus, Linking high resolution mass spectrometry data with exposure and toxicity forecasts to advance high-throughput environmental monitoring, Environ. Int. 88 (2016) 269-280. [90] E.M. Thurman, I. Ferrer, P. Zavitsanos, J.A. Zweigenbaum, Identification of imidacloprid metabolites in onion (Allium cepa L.) using high‐resolution mass spectrometry and accurate mass tools, Rapid Commun. Mass Spectrom. 27 (2013) 1891-1903.

17

ACCEPTED MANUSCRIPT

RI PT

[91] E.L. Schymanski, M. Gerlich, C. Ruttkies, S. Neumann, Solving CASMI 2013 with MetFrag, MetFusion and MOLGEN-MS/MS, Mass Spectrom. 3 (2014) S0036. [92] D.W. Hill, T.M. Kertesz, D. Fontaine, R. Friedman, D.F. Grant, Mass spectral metabonomics beyond elemental formula: chemical database querying by matching experimental with computational fragmentation spectra, Anal. Chem. 80 (2008) 5574-5582. [93] J. Přichystal, K.A. Schug, K. Lemr, J. Novak, V. Havlíček, Structural analysis of natural products, Anal. Chem. 88 (2016) 10338–10346.

Figure captions

M AN U

SC

Fig. 1. Flowchart for non-target analysis. Chemical databases play a twofold role. First, they are used for the search of candidates for identification by molecular formula or accurate molecular mass. Second, the databases can be also required for both the selection of appropriate methods or techniques and the estimation of the plausibility of identification results. Intermediate results are conventionally shown as different lists of candidate compounds. The candidates can be ranked. In practice, one technique may provide filtering data acquired by another one. For example, reference experimental or predicted MS2 spectra and HPLC-RT are filters for the information supplied by HRMS.

TE D

Fig. 2. (a) The C20H12 formula search in Chemspider resulted in benzo[a]pyrene ranked first by the citation rate; May 12, 2017. (b) The same-day benzo[a]pyrene search in PubChem for the relevant literature (only upper references are shown). The citation rate (8376) is not equal exactly to the corresponding value retrieved by searching ChemSpider (10227, a), and both searches provide the top rank of this and many other PAH formulas. (c) One from references which cocites both five PAH to each other and the PAH compound with the sediment matrix.

AC C

EP

Fig. 3. Correlation of updated citation rates of 25 PAH identified in the work [74]. There was searching ChemSpider and PubChem.

ACCEPTED MANUSCRIPT Table 1. Fields and samples of non-target analysis Food, plant/animal raw + ++ + ++ +

a

+

References Air, water, soil, sediment ++ +

++ ++ ++

Specification: ++ is the basic matrix, + is the minor matrix

Table 2. Chemical space and subspaces

+ + +

Comments, information source Virtual space All possible compounds Druglike molecules, molecular mass ≤500 Da [44] Druglike molecules [45] All molecules of up to Druglike molecules [45] 30 atoms All molecules of up to Druglike molecules [45] 17 non-hydrogen atoms Virtual and known compounds Synthesizable Catalogs/advertising collected in the ZINC database [46, 47] compounds Known compounds All compound Registered in Chemical Abstracts [48] Chemical compounds of PubChem database [49] biological significance Common chemical ChemSpider database [50] compounds Delivered chemicals Catalogs/advertising collected in the ZINC db [46, 47] (ZINC) Regulated chemicals CHEMLIST database [51] Environmental EPA’s CompTox Chemistry Dashboard database [52] compounds Metabolites Metabolites, bio compounds in the METLIN database [53] Human Metabolome Database [54] Natural compounds Described compounds [23, 24]

AC C

EP

TE D

M AN U

Space/subspace

SC

Environmental Food Metabolomics Natural/marine products Pharmaceuticals, drugs, doping Toxicology

Matrix a Blood, urine, tissue +

RI PT

Field

Food Volatile compounds having EI mass spectra Non-volatile compounds having mass spectra

FoodB database [55] Wiley mass spectral library [56]

[1-11] [11-13] [11, 14-22] [11, 23-24] [11,12,25, 26] [10,11,27-29]

Compounds 1020-10200 1060 1020−1024 1.7·1011 ~ 4·108

> 129 000 000 > 63 000 000 >59 000 000 14 000 000 > 348 000 747 000 961 829 74 461 >215 000 260 000 26 630 599 700

Crude estimation for reference spectra of the experimental 50 000 origin based on reviews [57, 58] 100 000

ACCEPTED MANUSCRIPT

Table 3. Updated citation and co-citation rates of identified PAH and rejected candidate compounds [74]

EP

780 1161

96 45

1.4

3.4

RI PT

Candidates 14 (13)*

0.7 1.6 0.5 1.9 3.8 1.2 0.9 0.7 3.2

SC

TE D

* One compound is not available in the databases

AC C

Analytes 25

M AN U

Rate n average citation Chemspider Pubchem average rank in the formula search, Chemspider compound-to-matrix co-citation, % of references in PubChem blood food plasma sediment soil tissue urine waste water all the matrices

14

0.2 0 0.3 0.6 0.3 0 0.2 0 1.6 3.2

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

Highlights

The implementation of non-target analysis is briefly considered



The chemical space is the origin of candidates for identification



Chemical databases are highly essential for performing the analysis



The transfer of accurate molecular mass to chemical databases is the key stage



Citation of compounds in databases provides advanced identification hypotheses

AC C

EP

TE D

M AN U

SC

RI PT