Prediction on the mutagenicity of nitroaromatic compounds using quantum chemistry descriptors based QSAR and machine learning derived classification methods

Prediction on the mutagenicity of nitroaromatic compounds using quantum chemistry descriptors based QSAR and machine learning derived classification methods

Ecotoxicology and Environmental Safety 186 (2019) 109822 Contents lists available at ScienceDirect Ecotoxicology and Environmental Safety journal ho...

2MB Sizes 0 Downloads 38 Views

Ecotoxicology and Environmental Safety 186 (2019) 109822

Contents lists available at ScienceDirect

Ecotoxicology and Environmental Safety journal homepage: www.elsevier.com/locate/ecoenv

Prediction on the mutagenicity of nitroaromatic compounds using quantum chemistry descriptors based QSAR and machine learning derived classification methods

T

Yuxing Haoa, Guohui Suna,∗, Tengjiao Fana, Xiaodong Suna, Yongdong Liua, Na Zhanga, Lijiao Zhaoa, Rugang Zhonga, Yongzhen Pengb a

Beijing Key Laboratory of Environmental and Viral Oncology, College of Life Science and Bioengineering, Beijing University of Technology, Beijing, 100124, PR China National Engineering Laboratory for Advanced Municipal Wastewater Treatment and Reuse Technology, Engineering Research Center of Beijing, Beijing University of Technology, Beijing, 100124, China

b

A R T I C LE I N FO

A B S T R A C T

Keywords: Nitroaromatic compounds Mutagenicity QSAR Classification Toxicity mechanism Hazard assessment

Nitroaromatic compounds (NACs) are an important type of environmental organic pollutants. However, it is lack of sufficient information relating to their potential adverse effects on human health and the environment due to the limited resources. Thus, using in silico technologies to assess their potential hazardous effects is urgent and promising. In this study, quantitative structure activity relationship (QSAR) and classification models were constructed using a set of NACs based on their mutagenicity against Salmonella typhimurium TA100 strain. For QSAR studies, DRAGON descriptors together with quantum chemistry descriptors were calculated for characterizing the detailed molecular information. Based on genetic algorithm (GA) and multiple linear regression (MLR) analyses, we screened descriptors and developed QSAR models. For classification studies, seven machine learning methods along with six molecular fingerprints were applied to develop qualitative classification models. The goodness of fitting, reliability, robustness and predictive performance of all developed models were measured by rigorous statistical validation criteria, then the best QSAR and classification models were chosen. Moreover, the QSAR models with quantum chemistry descriptors were compared to that without quantum chemistry descriptors and previously reported models. Notably, we also obtained some specific molecular properties or privileged substructures responsible for the high mutagenicity of NACs. Overall, the developed QSAR and classification models can be utilized as potential tools for rapidly predicting the mutagenicity of new or untested NACs for environmental hazard assessment and regulatory purposes, and may provide insights into the in vivo toxicity mechanisms of NACs and related compounds.

Abbreviations: NACs, nitroaromatic compounds; QSAR, quantitative structure activity relationship; GA, genetic algorithm; MLR, multiple linear regression; NPAHs, nitropolycyclic aromatic hydrocarbons; US EPA, United States Environmental Protection Agency; 1-NP, 1-nitropyrene; NOx, nitrogen oxides; 2-NFR, 2-nitrofluoranthene; 2-NP, 2-nitropyrene; PMs, particulate matters; CEC, contaminants of emerging concern; LD50, 50% lethal dose; AD, applicability domain; EHOMO, highest occupied molecular orbital energy; ELOMO, lowest unoccupied molecular orbital energy; μ, dipole moment; E, total energy; OLS, ordinary least square; Q2Loo, leave-one-out cross-validation correlation coefficient; RMSE, root mean squared of errors; MAE, mean absolute error; h, hat value; MCDM, multi-criteria decision making; SMILES, simplified molecular input line entry specification; Est, estate fingerprint; SubFP, substructure fingerprint; Ext, CDK Extended fingerprint; Graph, CDK graph only fingerprint; kNN, k-nearest neighbor; LR, logistic regression; NB, Naïve Bayes; RF, random forest; NN, neural network; SVM, support vector machine; TP, true positives; FP, false positives; TN, true negatives; FN, false negatives; SE, sensitivity; SP, specificity; CA, the overall predictive accuracy; ROC, receiver operating characteristic; AUC, area under the ROC curve; IG, information gain; SFA, substructure frequency analysis; MW, molecular weight; ALogP, Ghose-Crippen LogKow; BaP, benzo[a]pyrene; 1,6-DNP, 1,6-dinitropyrene ∗ Corresponding author. E-mail addresses: [email protected] (Y. Hao), [email protected] (G. Sun), [email protected] (T. Fan), [email protected] (X. Sun), [email protected] (Y. Liu), [email protected] (N. Zhang), [email protected] (L. Zhao), [email protected] (R. Zhong), [email protected] (Y. Peng). https://doi.org/10.1016/j.ecoenv.2019.109822 Received 22 July 2019; Received in revised form 11 October 2019; Accepted 14 October 2019 0147-6513/ © 2019 Elsevier Inc. All rights reserved.

Ecotoxicology and Environmental Safety 186 (2019) 109822

Y. Hao, et al.

1. Introduction

Therefore, environmental monitoring and risk assessment are of great importance to ensure their potential hazard or safe use at certain exposure levels. Given that the presence of various NACs in environment, the biological properties of these chemicals cannot be sufficiently measured by the currently available in vitro (non-animal) or in vivo (animal) testing methods. Consequently, it is lack of sufficient information on the hazard that they may pose to human health and the environment. In silico modelling methods, i.e., quantitative structure activity relationship (QSAR) and qualitative classification, if they meet the principles of Organization for Economic Co-operation and Development (OECD), are recognized and recommended by European Union (EU) REACH (Registration, Evaluation, Authorization and Restriction of Chemicals) legislation to satisfy the regulatory testing needs of hazardous chemicals [OECD, 2007; Gozalbes and Julian-Ortiz, 2018]. In this respect, QSAR and classification approaches are used as “green” alternative tools to provide reliable activity or property data, thereby avoiding or reducing some in vitro testing or in vivo animal experiments. Meanwhile, they are conducive to explain the toxic mechanism of these hazardous chemicals. QSAR and classification methods are usually used to explore the relationship between molecular structural information and the biological or physicochemical properties of chemicals, including environmental toxicants, drugs, and so on [Sun et al., 2016; Cronin, 2017; Tratnyek et al., 2017; Sun et al., 2018; Fan et al., 2018]. In the field of environmental toxicology, toxicity values or levels can be accurately predicted for untested chemicals by validated QSAR [Tugcu et al., 2017; Onlu and Sacan, 2018; Tugcu and Sacan, 2018; Cao et al., 2018; Khan et al., 2019] and classification models [Gramatica et al., 2007a; Xu et al., 2012; Li et al., 2014a], which can also be applied for chemicals’ screening and prioritization regarding their potential hazard. In several recent studies, QSAR models were built for exploring the aquatic toxicity of substituted phenols or anilines towards Chlorella vulgaris [Tugcu et al., 2017; Tugcu and Sacan, 2018], the toxicity of ionic liquids towards rat leukemia cells [Cao et al., 2018], toxicity of contaminants of emerging concern (CEC) towards Dugesia japonica [Onlu and Sacan, 2018]. Interestingly, interspecies correlation models were also established in the above studies [Tugcu et al., 2017; Onlu and Sacan, 2018]. Machine learning methods derived classification models were also established for predicting the oral acute toxicity of chemicals to rats [Li et al., 2014a], mutagenicity of chemicals [Gramatica et al., 2007a; Xu et al., 2012] and pesticide toxicity against Daphnia magna [He et al., 2019]. In a recent published perspective, the current status and future needs of QSAR and classification methods to predict environmental toxicities were assessed and discussed [Cronin, 2017]. The sources of reliable mutagenic endpoint data of NACs are very limited because of experimental determinations are expensive and timeconsuming processes. QSAR and classification methods are thus particularly practical tools for investigating the mutagenicity or in vivo toxicity of NACs with different structures. Gooch et al. [Gooch et al., 2017] performed a comprehensive QSAR study to analyse the in vivo toxicity of 90 NACs in relating to their 50% lethal dose for rats (LD50). Additionally, several previously published papers reported the QSAR modelling of NACs mutagenicity, however, the prediction accuracy of these models was limited to some extent, even with an excess of descriptors [Gramatica et al., 2007b; Singh et al., 2008; Toropov et al., 2009]. In this study, we performed the in silico prediction of NACs mutagenicity using QSAR methods combined with quantum chemistry descriptors according to the OECD guidance principles [OECD, 2007]. Notably, qualitative classification models for mutagenicity were further established by seven machine learning methods, along with six molecular fingerprints. The established models were evaluated through rigorous internal and external validation criteria. The detailed analysis of these models may prompt us to identify the main factors responsible for the mutagenicity of NACs, as well as provide rapid preliminary assessment of the mutagenicity for other experimentally untested

Nitroaromatic compounds (NACs), such as nitropolycyclic aromatic hydrocarbons (NPAHs), are widely existed in the atmosphere, aqueous environment, and soil as an important type of environmental organic pollutants [Zhang et al., 2018; Li et al., 2019]. More importantly, NACs were also found in certain cooking foods (e.g., Yakitori, vegetables, smoked and grilled foods, oils, tea, coffee, spices, fresh and cured meat products) [Ohnishi et al., 1985; Schlemitz and Pfannhauser, 1996; Deng et al., 2015], thus may pose potential hazard to humans. Many NACs have shown different degree of toxicity (e.g., mutagenicity and carcinogenicity) towards aquatic organisms, experimental animals and possibly humans, along with other adverse effects, including allergic reactions, endocrine disruption, skin irritation and so on [Kovacic and Somanathan, 2014]. In fact, apart from their own toxicity of NACs, their metabolic intermediates and products, such as arylhydroxylamines, arylamines, azo and azoxy compounds may exhibit similar or even higher toxicity than their parent NACs [Marwood et al., 1995; Kurian et al., 2006]. Among millions of chemicals, the United States Environmental Protection Agency (US EPA) has listed 129 chemicals as priority pollutants in 1976, in which several NACs are also involved (e.g., 2-nitrophenol, 4-nitrophenol and 2,4-dinitrophenol) [Keith and Telliard, 1979]. The environmental NACs, especially NPAHs (e.g., 1-nitropyrene, 1NP), primarily result from the imperfect combustion of biomass and fossil fuels, and automobile exhaust [Hayakawa, 2016]. In addition, NPAHs are also produced via the photochemical reactions of their parent PAHs with nitrogen oxides (NOx), such as 2-nitrofluoranthene (2-NFR) and 2-nitropyrene (2-NP) [Arey et al., 1986; Hayakawa, 2016]. With the rapid development of industry and economy, a large amount of energy is consumed, typically the consumption of fossil fuels. In China, coal is the main energy sources, especially in the Northeast China, coal-burning equipment in winter (e.g., coal-heating boilers) exhaust many pollutants into the atmosphere, including gases and airborne particulate matters (PMs) [Hayakawa, 2016]. Furthermore, the number of vehicles has a remarkable increase over the past decades in China, the automobile emission also contributes to atmospheric PMs to a great extent. It should be noted that PM2.5 refers to a kind of fine inhalable particles with 2.5 μm or smaller diameter in general. In fact, it is a mixture of liquid droplets and solid particles found in the atmosphere, which has various sizes and shapes. PM2.5 is composed of hundreds of different chemicals, among PAHs and NPAHs are the major toxic components [Albinet et al., 2007; Wang et al., 2011]. Exposure to high levels of PM2.5 poses the great risk to human health, which may result in respiratory diseases (e.g., asthma and lung cancer) and cardiovascular diseases, of note, PAHs and NPAHs are considered as the primary causes of indirect-acting and direct-acting mutagenicity of PMs, respectively [Albinet et al., 2007; Hayakawa, 2016]. The International Agency for Research on Cancer (IARC) has classified PM2.5 as carcinogen for humans (Group 1) in 2013 [Loomis et al., 2013]. This is indirectly supported by the fact that several PAHs and NPAHs contained in PM2.5, such as benzo[a]pyrene (BaP) and 1-NP, are classified by IARC as Group 1 or 2A carcinogens [Benbrahim-Tallaa et al., 2012]. The ubiquity of NACs in polluted atmosphere and their close associations with PM2.5 dramatically improve the concern of NACs from governments, environmental protection agencies and scientists. Other applications (e.g., pesticides, dyes, solvents, plastics, shoe polish, explosives, etc.) will also inevitably lead to the atmospheric, water, soil and food pollution of NACs [Hartter, 1985; Kovacic and Somanathan, 2014; Zhang et al., 2018]. Especially, NACs are difficult to control because they show high resistance to hydrolysis, biological or chemical oxidation due to the presence of the electron-withdrawing nitro groups [Zhang et al., 2018]. Production and usage of these NACs ultimately pose environmental risk towards the aquatic living organisms, plants, animals and humans through atmosphere system, aquatic ecosystem and diets [Kovacic and Somanathan, 2014; Tugcu et al., 2017]. 2

Ecotoxicology and Environmental Safety 186 (2019) 109822

Y. Hao, et al.

search the solution space by maximization of the leave-one-out crossvalidation correlation coefficient (Q2Loo) as the fitness function. To acquire the best variables, the following parameters were set, namely, population size = 200, mutation rate = 20 and number of generations = 500. Q2Loo was selected because it gives a measure of model stability and robustness. Finally, a population of good models were generated after repeating this procedure. According to the empirical rule [Tropsha, 2010], the ratio of the number of compounds in the training set to the number of selected variables (descriptors) should be more than 5, which means that at most 7 descriptors are allowed to develop models in the present study. The internal fitness and robustness of QSAR models were rigorously assessed by the statistical parameters like Q2Loo, correlation coefficient R2, and modified form R2adj. The QUIK (Q Under Influence of K) rule [Todeschini et al., 1999; Gramatica et al., 2013] was used for checking the intercorrelation between descriptors and was set as 0.05 to avoid models with multicollinearity and chance correlation, which has been mentioned in the 191th term of OECD guideline document [OECD, 2007]. A Y-scrambling step (the response values were deranged and randomly reordered) with 2000 iterations was also carried out to assess the possibility of chance correlation of the QSAR models. If the new QSAR models produced by randomly reordering the logTA100 values gave significantly lower Q2Loo than the original model, the proposed QSAR model should not be obtained casually. The external predictive performance of the developed QSAR models was validated by Q2ext and R2ext [Golbraikh and Tropsha, 2002; Tropsha et al., 2003; Gramatica, 2007; Gramatica, 2014]. Q2ext = (SD-PRESS)/ SD, in which PRESS means the sum of squared differences between the actual and predicted responses in the test set, and SD represents the sum of squared differences between the actual value for each compound in the test set and the mean actual value of the training set [Tropsha et al., 2003]. Regarding external predictivity, several additional validation parameters, including Q2F1, Q2F2, Q2F3, concordance correlation coefficient (CCC), CCCext, r2m and Δr2m [Gramatica and Sangion, 2016] were also involved. Root mean squared of errors (RMSE) for the training set (RMSEtr) and external test set (RMSEext) that characterize the overall error of the models were included as an extra measurement to evaluate the accuracy for the developed QSAR models [Golbraikh and Tropsha, 2002; Gramatica and Sangion, 2016]. We also considered mean absolute error (MAE), which was proposed by Roy and coworkers [Roy et al., 2016]. All the statistical parameters used in this study can be seen in Table S2 in the Supplementary material.

compounds in this class before they produce potential hazardous effects on both the environment and humans. 2. Materials and methods 2.1. QSAR study 2.1.1. Collection of experimental data In this work, a mutagenicity data set of 48 nitroaromatic compounds (NACs) for Salmonella typhimurium TA100 strain was chosen from the Benigni's Report (QSARs for mutagenicity and carcinogenicity) for OECD, all the mutagenic activities were determined by bacterial Ames test without the S9 activation system [OECD, 2004]. To obtain a normally distributed dataset, all original mutagenicity data were converted into logTA100 values as metrics of mutagenic potency, and were used as the dependent (response) variables in QSAR modelling (Fig. S1 in the Supplementary material). The logTA100 values ranged from −2.1 to 4.63 (spanned more than 6 log units), suggesting an adequate data set for QSAR study. All of these NACs were ordered by their logTA100 values, then one was chosen as the test set from every four molecules and the remaining molecules were assigned as the training set. The training set was used for model development, while the compounds in the test set were used for model validation. The maximum and minimum values of the logTA100 were designated as training to make sure that the external test set compounds are within the model applicability domain (AD) [Gramatica et al., 2013]. The names, chemical structures, CAS no., mutagenicity values of NACs studied can be seen in Table S1 (Supplementary material). 2.1.2. Calculation of molecular descriptors The chemical structures of 48 NACs were drawn manually using GaussView 5.0 software and geometrically optimized by the Gaussian 09 program at the B3LYP/6-31 + G(d,p) theoretical level [Frisch et al., 2009]. Frequency analysis on each optimized geometry was used to ensure that the structure was a local saddle point rather than a transition state. After geometry optimization, five quantum chemistry descriptors, including the highest occupied molecular orbital energy (EHOMO), the lowest unoccupied molecular orbital energy (ELUMO), ELUMO − EHOMO gap, dipole moment (μ) and total energy (E), were calculated. DRAGON 7.0 software [available at https://chm.kodesolutions.net/] was used to calculate the theoretical molecular descriptors (DRAGON descriptors), which include 22 2D molecular descriptor groups (e.g., constitutional indices, charge descriptors, ring descriptors, topological indices, drug-like indices, etc.). 3D descriptors were removed because of the extreme sensitivity of 3D descriptor groups to the quantum chemistry calculations [Onlu and Sacan, 2017]. Finally, 3822 2D descriptors combined with the quantum chemistry descriptors were utilized for QSAR modelling. It should be noted that the wide range of descriptors may contribute to the discovery of hidden important information.

2.1.4. Applicability domain According to OECD guidance principles [OECD, 2007], a validated QSAR predictive models should have a defined applicability domain (AD). Leverage approach was commonly used for defining the AD. In the original variable space, a compound's leverage is described as hat value (h) [Gramatica, 2007]. The warning hat value (h*) was calculated by h* = 3 (p+1)/n, where p is the number of model descriptors, and n is the number of compounds used for model building [Gramatica, 2007]. If a compound with h > h*, it will be identified as a structural outlier. For response space, a critical standardized residual value of 3 was usually utilized to identify response outliers [Gramatica, 2007; Gadaleta et al., 2016]. A compound will be identified as a response outlier if its standardized residual was bigger than ± 3. Based on the standardized residuals against leverages (h), a Williams plot was constructed to visually characterize the AD. Only a compound falls into the AD of the proposed model, its prediction is considered to be reliable.

2.1.3. QSAR modelling and validation QSAR modelling was carried out by multiple linear regression (MLR) ordinary least square (OLS) procedure as implemented in QSARINS 2.2.2 software (Varese, Italy) [Gramatica et al., 2013; Gramatica et al., 2014]. Prior to QSAR modelling, we performed a preliminary screening to exclude the constant or nearly constant (> 80% compounds sharing the same value for a descriptor) and descriptors with high intercorrelation (pair-wise correlations among all pairs of descriptors > 0.95) due to statistical insignificance. Descriptor selection for training set was performed by all subsets and GA tools of QSARINS. Firstly, all low-dimensional models (up to 2–3 descriptors) were generated using the all subset facility to find the best descriptors encoding the mutagenicity and to avert GA with a completely random start. The best descriptors subset determined at this step was used as the core of chromosomes of the initial GA population. Then, GA was used to

2.1.5. Selection of the best model A procedure called Multi-Criteria Decision Making (MCDM), which is also implemented in the QSARINS, was utilized to evaluate the model performance relevant to internal/external validations simultaneously using scores (0–1), where 0 means the worst criterion and 1 means the 3

Ecotoxicology and Environmental Safety 186 (2019) 109822

Y. Hao, et al.

weighted parameters, and a k value of 5 was set. Logistic Regression (LR). LR method was created in 1958 by statistician David Cox [Cox, 1958; Walker and Duncan, 1967], which was commonly used for classifying a binary response. Two possible response values are labelled with symbols of “0” and “1”, which represent outcomes like success/failure, positive/negative or yes/no, respectively. Naïve Bayes (NB). It is based on the Bayes rule for conditional probability, which allows users to classify samples based on the equal and independent contribution of their attributes [Sun, 2005; Watson, 2008]. Here, NB classification was performed by the default settings in Orange Canvas 3.11 software. Random Forest (RF). It was created by Breiman as an ensemble learning approach for classification and regression [Breiman, 2001]. The forest is assembled by trees, in which each tree is formed from a bootstrapped sample of the training set. Each tree is grown up to maximum size without pruning. The classification results depend on the majority of individual tree's output. In this study, the number of trees was set as 20 in the forest. Neural Network (NN). NN can effectively recognize complicated nonlinear relationship between input and output variables [Basheer and Hajmeer, 2000]. Thus, the NN algorithm produces a platform where different models can be shaped and developed. For NN modelling, one input layer, one hidden layer, and one output layer are included in the network. In this study, the number of neurons in hidden layer was set as 200. Support Vector Machine (SVM). It is a kernel-based algorithm to perform binary classification, first developed by Vapnik et al. [Cortes and Vapnik, 1995]. By substructure pattern recognition approach, each sample is described as a binary string, and used as an eigenvector for SVM, which projects input vectors to a higher dimensional space in which an optimal separating hyperplane is generated to discriminate samples from different categories. In addition, two parallel hyperplanes are formed on each side of the optimal hyperplane. After training, SVM produces a decision function. In our study, gaussian radial basis function (RBF) kernel was selected, the parameters C and slack variable γ were tuned on the training set by 10-fold cross-validation. The SVM module from the LIBSVM package [Chang and Lin, 2011] is embedded in Orange Canvas 3.11 software. Tree. It is a standard benchmark in machine learning. A decision tree includes decision nodes, branches, and leaves, for a categorical dependent variable described by on one or more predictor variables, it inputs an object or situation and outputs a decision. An object is categorized by starting at the root node of a tree, testing the attribute defined by this node, then shifting down to the tree branch based on the attribute value [Plewczynski et al., 2006]. In the pre-pruning process, the minimal instance in leaves was set as 3, and stop splitting nodes with fewer instances than 5.

best criterion [Gramatica et al., 2013]. Various validation parameters, including internal fitting (R2), cross-validation (Q2), external validation Q2ext and R2ext, are altogether considered by the MCDM procedure [Gramatica et al., 2013]. Thus, the selected models are those with the highest values for all statistical parameters discussed above and obviously the least RMSE and MAE values on the training and test sets [Gramatica and Sangion, 2016]. After numerous trials, the final model was chosen with the best MCDM score and the least possible number of descriptors, fulfilling the OECD validation requirements [OECD, 2007] and statistical thresholds for various validation parameters [Golbraikh and Tropsha, 2002; Tropsha et al., 2003; Gramatica, 2007; Gramatica, 2014; Gramatica and Sangion, 2016; Roy et al., 2016; Gadaleta et al., 2016]. 2.2. Classification study 2.2.1. Data preparation A same dataset of NACs used in the QSAR study was also used for classification study. According to the mutagenicity data (logTA100) distribution of all NACs used in this study (Fig. S1 in the Supplementary material), the logTA100 of the compound closest to the vertex (mean value) was considered as threshold value. Therefore, a mutagenic cutoff value (logTA100) of 0.91 was chosen as the classification criterion. Compounds with logTA100 values higher than 0.91 were identified as high mutagenic NACs (denoted as “H”), while less than 0.91 were identified as low mutagenic NACs (denoted as “L”). Finally, a dataset consisting of 24 “H” NACs with high mutagenic potency and 24 “L” NACs with low mutagenic potency was obtained. All these compounds were randomly split into a training set and an external test set at a ratio of 3:2. A summary for the compounds’ classification was listed in Table S1 (Supplementary material). 2.2.2. Molecular fingerprint Molecular structure information was represented by molecular fingerprint, a substructure recognition method, which has been widely used in classification study for similarity searching. Each molecule is represented as a binary string of structural keys through this method. If a given molecule contains a substructure, the corresponding bit is “1” and otherwise “0” [Shen et al., 2010]. Each corresponding bit was used as an “independent variable” in model establishment. SMARTS uses rules that are extensions of simplified molecular input line entry specification (SMILES), is a language for specifying substructures. Six fingerprints were employed in this study, including Estate fingerprint (Est, 79 bits), MACCS keys (166 bits), PubChem fingerprints (881 bits), Substructure fingerprint (SubFP, 307 bits), CDK Extended fingerprint (Ext, 1024 bits) and CDK graph only fingerprint (Graph, 1024 bits). Detailed description of these fingerprints was shown in the original literatures [Klekota and Roth, 2008; Yap, 2011]. PaDEL-Descriptor software was applied to generate these fingerprints [Yap, 2011].

2.2.4. Model performance evaluation A 10-fold cross-validation and an external test set were used for evaluating the model performance. All classification models were assessed by the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). In addition, sensitivity (SE), specificity (SP) and the overall predictive accuracy (CA), which represent the predictive accuracy of the “H” class, the predictive accuracy of the “L” class, and the total predictive accuracy of “H” and “L” NACs, respectively, were calculated by the following formulas [Sun et al., 2018; Fan et al., 2018].

2.2.3. Model building Seven machine learning methods were used to build model, including k-nearest neighbor (kNN), logistic regression (LR), Naïve Bayes (NB), random forest (RF), neural network (NN), support vector machine (SVM) and classification tree. All the methods were performed in Orange Canvas 3.11 software (freely available at https://orange.biolab. si/) and used with the parameters according to our previous studies [Sun et al., 2018; Fan et al., 2018]. k-Nearest Neighbor (kNN). It is a non-parametric method to classify objects according to the nearest training examples in the feature space [Cover and Hart, 1967]. For a test sample, its nearest neighbor is judged by the algorithm computed distance or similarity between each training sample. Then a sample is classified by a majority vote of its k nearest neighbors. A distance-weighted approach is used to reduce the impact of k value (the number of nearest neighbors). In this work, the nearness was determined by the Euclidean distance and distance-

SE = TP /( TP + FN)

(1)

SP = TN /( TN + FP)

(2)

CA = (TP + TN )/( TP + TN + FP + FN)

(3)

Additionally, the receiver operating characteristic (ROC) curve was constructed based on the TP rate (sensitivity) and the FP rate (14

Ecotoxicology and Environmental Safety 186 (2019) 109822

Y. Hao, et al.

specificity). The model quality was evaluated by the area under the ROC curve (AUC). The AUC is the probability of “H” compounds being ranked earlier than “L” compounds in this study, and the value of AUC ranges from 0.5 (no discriminative ability) to 1 (perfect classifier) [Perez-Garrido et al., 2011].

logTA100 = −2.4885 − 38.1695EHOMO + 1.2564Infective − 50 + 0.9804 Hypnotic − 80 + 0.1974CATS2D_04_LL − 0.0295TIC2 (5) 2 2 Ntr = 36, QLoo = 0.950, R2 = 0.967, Radj = 0.961, F = 174.821,RMS

2.2.5. Identification of privileged substructures as structural alerts In this study, molecular functional groups that have high potential to cause mutagenesis were defined as privileged substructures or structural alerts. The analysis of privileged substructures and structural alerts were performed by the information gain (IG) along with substructure frequency analysis (SFA) [Shen et al., 2010.; Xu et al., 2012; Li et al., 2014a]. Their presence in a compound warns the researchers or professional exposure population to the potential mutagenicity of the untested compounds. Structural alerts are important predictive tools of mutagenic toxicity because they are usually directly derived from the mechanistic understanding [Benigni and Bossa, 2008]. The privileged substructures known to provide ligands for various receptors were able to clarify the corresponding biological mechanism [Evans et al., 1988]. If a substructure was more frequently presented in the “H” class, it was named a privileged substructure potentially involved in high chemical mutation. The frequency of a substructure in “H” class was calculated as follows [Sun et al., 2018; Fan et al., 2018]:

frequency of a substructure =

Etr = 0.316, CCCtr = 0.983; 2 2 2 Ntest = 12, Qext = 0.836, Rext = 0.843, RMSEtest = 0.724, QF1 = 0.808, 2 2 = 0.808, QF3 = 0.826, CCCtest = 0.914; QF2 2 2 (Rext − R 02)/R 2ext = 0.0008, k = 1.0011, (Rext − R′20 )/R 2ext = 0.013,

k ′ = 0.873, |R 02 − R′20 |= 0.011. It is obvious that the model satisfies all the statistical criteria described above. Ntr and Ntest are the number of compounds in the training set and external test set, respectively. The relatively high quality of cross-validation parameter (Q2Loo) and fitting parameters (R2, R2adj and RMSEtr) indicated that the model had good internal fitting ability, reliability and robustness. An external test set, which never participated in the model building steps, was used for evaluating the external predictive ability of the model. Good external predictive performance of the model was reflected by the crucial statistical parameters like Q2ext (0.836), R2ext (0.843) and RMSEtest (0.724). Compared to the original model, significantly lower values of R2Yscr and Q2Yscr observed in Y-scrambling procedure indicated the reliability of our model, suggesting no any chance correlation (Table 1). As well, the close values of Q2F1 (0.808), Q2F2 (0.808) and Q2F3 (0.826) indicated that the selected test set had similar response distribution with the training set. The high value of CCCext (0.914) also suggested that the model had excellent external predictability. Furthermore, we gave the MAEtest (mean absolute error for 95% of test set data) proposed by Roy and coworkers [Gramatica and Sangion, 2016; Roy et al., 2016]. Regarding this metrics, MAEtest ≤ 0.1 training set range (TSR) and MAEtest + 3σ ≤ 0.2TSR should be satisfied for good predictions. In view of TSR = 6.73, calculated MAEtest = 0.606, and MAEtest + 3σ = 0.906, our model further displayed good predictive ability. Fig. 1 showed a good linear correlation between predicted and experimental logTA100 values, where the red circles and blue squares represent the training and test sets, respectively. All compounds were homogeneously distributed around the optimal line, indicating our model had good predictive performance. On the other hand, we also performed the modelling without quantum chemistry descriptors, where only DRAGON descriptors were considered. As shown in Table S3 in the Supplementary material, the best models containing 1–7 descriptors (no quantum chemistry descriptors) were picked out based on the ranking of MCDM, in which the Q2Loo and R2 ranged from 0.719 to 0.936 and 0.749 to 0.957, respectively. Moreover, we also compared the quality of models built with quantum chemistry descriptors, without quantum chemistry descriptors, and previously reported models using NCSS software [Singh et al., 2008]. As shown in Fig. 2A, the models developed in the present work, regardless of quantum chemistry descriptors, exhibited substantially higher R2 values than previously reported models built by Singh et al. [Singh et al., 2008], in which models with quantum chemistry descriptors also had higher R2 values than that without quantum chemistry descriptors. Similar to R2, Q2Loo values of models with quantum chemistry descriptors were also higher than that without quantum chemistry descriptors (Fig. 2B). In another published literature, Gramatica et al. reported the QSAR modelling using the same data, they obtained the best models with two topological molecular descriptors using different splitting method [Gramatica et al., 2007b]. However, the Q2Loo (0.705–0.861) and R2 (0.768–0.885) of models developed by Gramatica et al. [Gramatica et al., 2007b] were also lower than that of our QSAR model with two descriptors (EHOMO, CATS2D_04_LL) (Q2Loo = 0.874 and R2 = 0.896). Therefore, we consider

I Nfragment × Ntotal

Nfragment _ total × NI

(4)

where NIfragment is the count of compounds having the fragment in the “H” class, Ntotal is the total count of NACs, Nfragment_total is the total count of NACs having the fragment, and NI represents the count of “H” class in the dataset. 3. Results and discussion 3.1. QSAR models 3.1.1. Model validation After pre-filtration of the constant values and highly intercorrelated descriptors, a total of the remaining 339 Dragon descriptors along with the quantum chemistry descriptors were used to establish the QSAR models. Then, GA combined with MLR procedure were used to further screen the best modelling variables, followed by 110 models were generated. After utilizing QUIK rule [Todeschini et al., 1999; Gramatica et al., 2013], 70 models were retained without multicollinearity. For an acceptable QSAR predictive model, the following statistical criteria should be satisfied [Golbraikh and Tropsha, 2002; Tropsha et al., 2003; Gramatica, 2007; Gramatica, 2014; Gramatica and Sangion, 2016; Roy et al., 2016; Gadaleta et al., 2016]: (i) Q2Loo > 0.5; (ii) R2ext > 0.6; (iii) (R2ext − R02)/R2ext < 0.1 and 0.85 ≤ k ≤ 1.15 or (R2ext − R′02)/ R2ext < 0.1 and 0.85 ≤ k’ ≤ 1.15; (iv) |R02 − R′02| < 0.3. R02 and R′02 are the correlation coefficients of predicted versus experimental activities and experimental versus predicted activities for regression lines through the origin, respectively. The corresponding slopes k and k’ represent the regression lines through the origin (predicted versus experimental activities, and experimental versus predicted activities, respectively). Finally, the performance of all models was ranked by MCDM and the best candidate model was chosen. Due to the allowed maximum number of descriptor was 7 in this study, the best models containing 1–7 descriptors were picked out and listed in Table 1, in which the Q2Loo ranged from 0.792 to 0.972, R2 ranged from 0.817 to 0.981. Although the models with 6 and 7 descriptors in this study showed higher Q2Loo and R2 than models with less than 5 descriptors, they were mentioned with “listed data may be unreliable” because of multi-collinearity. Therefore, the best QSAR model for the prediction of NACs mutagenicity was shown in Equation (5) with 5 descriptors. 5

Ecotoxicology and Environmental Safety 186 (2019) 109822

Y. Hao, et al.

Table 1 Internal fitting, and external validation parameters of the best QSAR models containing 1–7 descriptors based on quantum chemistry descriptors. Fitting and internal validation parameters Model

Descriptors

R2

R2adj

RMSEtr

CCCtr

F

Q2Loo

RMSEcv

CCCcv

Q2Lmo

R2Yscr

Q2Yscr

1

EHOMO, SIC2, ATSC8e, MATS8m, CATS2D_04_LL, Hypnotic-80, Infective-50 EHOMO, TIC2, VE1sign_A, CATS2D_04_LL, Hypnotic-80, Infective50 EHOMO, TIC2, CATS2D_04_LL, Hypnotic-80, Infective-50 EHOMO, P_VSA_MR_5, CATS2D_04_LL, Hypnotic-80 EHOMO, CATS2D_04_LL, Hypnotic-80 EHOMO, CATS2D_04_LL SpDiam_A

0.981

0.977

0.238

0.991

208.953

0.972

0.289

0.986

0.967

0.196

−0.362

0.974

0.969

0.278

0.987

183.679

0.960

0.348

0.980

0.955

0.170

−0.292

0.967 0.950 0.908 0.896 0.817

0.961 0.943 0.900 0.889 0.811

0.316 0.389 0.528 0.560 0.743

0.983 0.974 0.952 0.945 0.899

174.821 146.545 114.630 141.676 151.340

0.950 0.936 0.890 0.874 0.792

0.387 0.439 0.577 0.616 0.790

0.975 0.967 0.942 0.934 0.887

0.947 0.932 0.885 0.869 0.786

0.141 0.115 0.080 0.058 0.028

−0.249 −0.199 −0.147 −0.125 −0.091

2 3 4 5 6 7

External validation parameters Model

R2ext

Q2ext

RMSEext

Q2F1

Q2F2

Q2F3

CCCext

MAEext

r2m

Δr2m

k

k’

(R2ext - R02)/R2ext

(R2ext – R′02)/R2ext

1 2 3 4 5 6 7

0.831 0.844 0.843 0.845 0.779 0.770 0.644

0.833 0.839 0.836 0.843 0.923 0.942 0.638

0.789 0.716 0.724 0.707 0.777 0.834 1.074

0.773 0.812 0.808 0.817 0.779 0.745 0.578

0.772 0.812 0.808 0.817 0.778 0.744 0.577

0.793 0.830 0.826 0.834 0.799 0.769 0.617

0.901 0.915 0.914 0.916 0.872 0.874 0.730

0.668 0.570 0.606 0.611 0.644 0.717 0.944

0.729 0.797 0.788 0.802 0.720 0.706 0.379

0.072 0.067 0.067 0.067 0.099 0.090 0.237

0.996 0.999 1.001 0.992 0.817 0.871 0.785

0.857 0.877 0.873 0.885 1.021 0.933 0.885

0.0078 0.0002 0.0008 0.0001 0.0249 0.0008 0.5521

0.0334 0.0107 0.013 0.0095 0.0003 0.0265 0.0804

structures among all NACs in the data set as they are the only two anthracenes with one nitro group. Overall, the mutagenic potency of NACs can be well predicted and the proposed model are reliable. 3.1.3. Mechanism interpretation of the model As shown in Equation (5), the best QSAR model contains one quantum chemistry descriptor (EHOMO) and four DRAGON descriptors (TIC2, CATS2D_04_LL, Hypnotic-80 and Infective-50). The chemical meanings of the five descriptors were listed in Table S4 in the Supplementary material by descending order based on the standardized coefficients in model equation. The detailed explanations of DRAGON descriptors are available in the Handbook of Molecular Descriptors [Todeschini et al., 2000]. Regarding the coefficients of variables (descriptors) in Equation (5), three descriptors (Infective-50, Hypnotic-80 and CATS2D_04_LL) exhibited positive contribution to the mutagenicity, whereas two descriptors (EHOMO and TIC2) were negatively correlated with the mutagenicity. EHOMO, a quantum chemistry descriptor, which is a measure of the nucleophilicity of a molecule, is the most significant descriptor in the proposed model. Chemicals with higher EHOMO values are more likely attacked by the strong electrophile hydroxyl radical (·OH) [Li et al., 2014b]. Compared to compounds with low EHOMO values, compounds with high EHOMO values can donate their electrons more easily and hence they are more reactive with electrophiles (e.g., ·OH), thereby detoxification may occur. Indeed, the negative coefficient of EHOMO indicates that NACs with higher nucleophilicity (EHOMO) tend to have low mutagenicity. In a recent study performed in our laboratory, EHOMO was also selected as a key descriptor in modelling the rat oral acute toxicity of N-nitro compounds [Fan et al., 2018]. EHOMO was also reported by Li et al. as the most important descriptor in the modelling of hydroxyl radical reaction rate constants of organic chemicals [Li et al., 2014b]. TIC2 is the total information content index (neighbourhood symmetry of 2-order) from the information indices group, it represents a measurement of the molecular graph complexity. The minimal standardized coefficient of TIC2 (−0.0295) implies that it might have weaker effect than other descriptors in the equation. This descriptor was used for the prediction of solubility for pesticide compounds in water and the design of anti-bladder cancer agents [Deeb and Goodarzi, 2010; Speck-Planche et al., 2013]. Infective-50 and Hypnotic-80 belong to drug-like indices descriptors, representing Ghose-ViswanadhanWendoloski anti-infective-like index at 50% and hypnotic-like index at

Fig. 1. Scatter plot of experimental versus predicted mutagenicity values (LogTA100) for compounds in the training and test sets of the QSAR model in Eq. (5).

the quantum chemistry descriptors based QSAR models in the present study are fit for predicting the mutagenicity of NACs. 3.1.2. Analysis on applicability domain and outliers The leverage along with standardization approaches allow us to define our model's AD. A Williams plot representing the AD of the best QSAR model was shown in Fig. 3. It was found that the leverage values of all NACs were lower than the warning hat value (h* = 0.500), thus no structural outliers were identified for both the training and test sets. A 2D plot of hat value vs. model deviation was shown in Fig. S2 in the Supplementary material. As to response values, only two compounds (8 and 47) fell outside the AD because their standardized residuals were bigger than 3. It is noteworthy that 2-nitroanthracene (8) and 9-nitroanthracene (47) share similar structures that are independent of other compounds in the data set, as shown in Table S1 (Supplementary material). They were both assigned to the external test set and neither of their structural information was included in the model building. Therefore, the response outliers can be explained based on their unique 6

Ecotoxicology and Environmental Safety 186 (2019) 109822

Y. Hao, et al.

Fig. 2. Comparison of the models built with quantum chemistry descriptors, without quantum chemistry descriptors and previously reported by Singh et al. (A) R2 in three types of models; (B) Q2LOO of models built with or without quantum chemistry descriptors.

and Sacan, 2017; Onlu and Sacan, 2018]. Detailed values of calculated model descriptors can be seen in Table S5 in the Supplementary material. 3.2. Classification models 3.2.1. Data set analysis As described in Materials and Methods, all NACs were classified into two categories (“H” and “L”) based on their logTA100 values. The training set consisted of 16 “H” class and 13 “L” class, while the external test set consisted of 8 “H” class and 11 “L” class (Table S6 in the Supplementary material). Thus, each set had roughly the balanced distribution of high mutagenic NACs (training set = 55%, test set = 42%), which was suitable for evaluating the real predictive performance of the models. This is also the reason why the data set in the QSAR study was not used in the classification study. The diversity of the data set is critical for the predictive accuracy of the models. In this study, molecular weight (MW) and Ghose-Crippen LogKow (ALogP) were applied to define the chemical space distribution of each set [Xu et al., 2012; Sun et al., 2018; Fan et al., 2018] and the corresponding scatter plot was shown in Fig. S3A in the Supplementary material. Obviously, the chemical space distribution of external test set was within the range of training set. Furthermore, the Euclidian distance metrics of the data set was calculated by PubChem keys fingerprint to further explore the chemical diversity of the whole data set. Figs. S3B and 3C (Supplementary material) presented the heat map of molecular similarity for the training vs. training set and training vs. external test set constructed by Euclidian distance metrics, respectively. Apparently, low similarity between the training and external test sets was observed.

Fig. 3. Standardized residuals versus hat values of the compounds in the training and test sets (Williams plot). The transverse dash lines represent ± 3 standard residuals, vertical black line represents warning leverage h* = 0.500.

80%, respectively. Bearing positive coefficients, these two descriptors positively contribute to mutagenicity of NACs. Likewise, Erzincan et al. reported that Infective-50 positively contributes to antioxidant activity of coumarin derivatives [Erzincan et al., 2015]. Hypnotic-80 is a descriptor that can take values of 1 or 0, characterizing the lipophilicity (hydrophobicity), its positive coefficient (0.9804) indicates that cell membrane permeability has an important role for eliciting mutagenic effects. It was also used as one of the two most significant descriptors to positively describe the antigiardial activity of thiazole derivatives [Mocelo-Castell et al., 2015]. The last descriptor CATS2D_04_LL is a 2D structure-based atom-pair descriptor representing topological information (CATS means a chemically advanced template search), LL represents a lipophilic-lipophilic pair of pharmacophore (L-L PP) point at different topological distance in the molecule. CATS2D_04_LL represents CATS2D L-L PP at 4-lags topological distance, positively contribute to mutagenicity. To some extent, lipophilicity (hydrophobicity), as a common physicochemical property, represents the capacity of a chemical to arrive at the active site of toxic action, the first step for exerting its adverse effects. A good correlation was usually observed between the lipophilicity (hydrophobicity) of chemicals and their bioactivities or toxic effects [Wu et al., 2016; Tugcu et al., 2017; Onlu

3.2.2. Performance of 10-fold cross-validation The combinatorial predictive models for binary classification were built using six different fingerprints combined with seven machine learning approaches. Consequently, a total of 42 classification models were constructed based on the training set. A 10-fold cross-validation for the training set was performed to evaluate the performance of all developed models. The best models were selected in consideration of the CA and AUC values. As seen in Fig. 4, most models were observed with CA and AUC values higher than 0.7 except for the Graph-NB model. The Ext-NN, Ext-LR, PubChem-NN, PubChem-LR, PubChem-RF, Ext-RF, MACCS-NN, PubChem-Tree, Ext-Tree and Ext-SVM models were in the top ten ranking. For the top ten models, their CA and AUC values ranged from 0.828–1.000 and 0.962–1.000, respectively. Their SE values (0.820–1.000) had no significant difference when compared to SP values (0.830–1.000), indicating that these models had good 7

Ecotoxicology and Environmental Safety 186 (2019) 109822

Y. Hao, et al.

Table 2 Performance of the top ten models for training and testing sets in classification study. Data set

Model

SE

SP

TP

FP

FN

TN

AUC

CA

Training set

Ext-NN Ext-LR PubChem-NN PubChem-LR PubChem-RF Ext-RF MACCS-NN PubChem-Tree Ext-Tree Ext-SVM Ext-NN Ext-LR PubChem-NN PubChem-LR PubChem-RF Ext-RF MACCS-NN PubChem-Tree Ext-Tree Ext-SVM

1.00 1.00 1.00 1.00 1.00 0.93 0.88 0.88 0.93 0.82 0.80 0.89 0.78 0.88 1.00 0.78 0.88 0.64 0.75 0.80

1.00 1.00 1.00 1.00 0.81 0.80 0.92 0.92 0.86 0.83 1.00 1.00 0.90 0.91 0.92 0.90 0.91 0.88 0.82 1.00

16 16 16 16 13 13 15 15 14 14 8 8 7 7 7 7 7 7 6 8

0 0 0 0 3 3 1 1 2 2 0 0 1 1 1 1 1 1 2 0

0 0 0 0 0 1 2 2 1 3 2 1 2 1 0 2 1 4 2 2

13 13 13 13 13 12 11 11 12 10 9 10 9 10 11 9 10 7 9 9

1.000 1.000 1.000 1.000 0.986 0.981 0.971 0.971 0.966 0.962 0.977 0.977 0.955 0.955 0.989 0.966 0.920 0.892 0.818 0.966

1.000 1.000 1.000 1.000 0.897 0.862 0.897 0.897 0.897 0.828 0.895 0.947 0.842 0.895 0.947 0.842 0.895 0.737 0.789 0.895

Test Set

patterns. It appears that the 79 bits are too short to characterize molecules. SubFP fingerprint is arisen from a suit of SMARTS patterns representing functional groups. Each bit of SubFP refers to a specific chemical feature instead of to the general patterns. Thus, features defined in advance are usually limited to represent a great deal of chemicals [Xu et al., 2012], which might lead to result that SubFP was unreasonable as attribute for models. Overall, when the same fingerprints were used, NN and LR algorithms performed better than other machine learning algorithms. For example, the results of 10-fold crossvalidation given by Ext fingerprint revealed that the AUC values of ExtNN and Ext-LR models were both of 1.000, while the AUC values of ExtkNN, Ext-NB, Ext-SVM, Ext-RF and Ext-Tree models were 0.862, 0.828, 0.828, 0.862 and 0.897, respectively (Fig. 4). Among these models, ExtNN, Ext-LR, PubChem-NN and PubChem-LR models gave the best predictive results.

3.2.3. Predictive performance of external test set The predictive performance of the best ten models was evaluated by the external test set. The test set was independent from the training set and was not used to develop model, therefore, the evaluation on the test set could objectively reflect the external predictive power of the models. As seen in Table 2, most models showed good predictivity for the test set, their CA and AUC values ranged from 0.842 to 0.947 and 0.920 to 0.989, respectively, except for PubChem-Tree and Ext-Tree models. Interestingly, the SE values of these models were slightly lower than the SP values in general, which reflected the better predictive performance for “L” NACs in these models. This may be due to the slight imbalance of “H” and “L” class of NACs in the test set with a ratio of 0.7. The PubChem-Tree and Ext-Tree models were not excellent enough due to their inferior SE values. Nevertheless, most models still had SE values more than 0.78 in the present study, especially for PubChem-RF model, a value of 100% predictive accuracy for “H” NACs was obtained. The influence of imbalanced data on the models was analyzed and discussed in many previous studies [Xu et al., 2012; Du et al., 2017; Fan et al., 2018]. Among these models, not only Ext-LR model (CA = 0.947, AUC = 0.977) and PubChem-RF model (CA = 0.947, AUC = 0.989) gave the best results with high predictive accuracy, but also they exhibited good performance for both “H” and “L” NACs. The Ext-LR model was obtained with accuracy of 89% for high mutagenic NACs, and 100% for low mutagenic NACs. The accuracy of PubChem-RF model for high and low mutagenic NACs were 100% and 92%, respectively. On the basis of the results from training and test sets, Ext and

Fig. 4. Performance of the 42 classification models evaluated by 10-fold crossvalidation for the training set. CA, AUC, SE and SP are the classification accuracy; the area under the ROC curve, sensitivity and specificity, respectively.

predictive ability for both “H” and “L” classes. The detailed performance of the top ten models was listed in Table 2. Based on the 10-fold cross-validation results, Ext and PubChem fingerprints yielded the best results. When the same algorithms were used, Est and SubFP fingerprints performed worst in general. Ext fingerprint (1024 bits) is an extension of the CDK fingerprint that take into account ring features, which includes abundant structural information [Sawada et al., 2014]. Of note, NACs is a kind of polycyclic compounds that Ext fingerprint may well characterize their molecular structure information. For instance, Ext fingerprint was shown to perform well for the classification of both androgen/estrogen receptors binders/non-binders and acetylcholinesterase inhibitors/non-inhibitors [Chen et al., 2014; Simeon et al., 2016]. PubChem fingerprint contains 881 bits in length, it contains a variety of different substructures and features defined in the PubChem database, which is a public, freely accessible platform for digging the bio-information resources of small molecules [Bolton et al., 2008]. Est fingerprint performed worst might be its own nature that only 79 bits are involved signified substructure 8

Ecotoxicology and Environmental Safety 186 (2019) 109822

Y. Hao, et al.

Table 3 Representative privileged substructures responsible for the high mutagenicity of NACs that obtained from PubChem fingerprint. No.

Privileged Substructure

PubchemFP12 PubchemFP199 PubchemFP200

General Substructures

IG

FP

FN

≥16 C ≥4 any ring size 6 ≥4 saturated or aromatic carbon-only ring size 6

0.311 0.191 0.191

24 (12) 24 (8) 24 (8)

24 (0) 24 (0) 24 (0)

PubchemFP261

≥4 aromatic rings

0.311

24 (12)

24 (0)

PubchemFP144

≥1 saturated or aromatic carbon-only ring size 5

0.219

24 (9)

24 (0)

PubchemFP839

CC1CC(C)CC1

0.219

24 (9)

24 (0)

PubchemFP860

CC1C(C)CCC1

0.219

24 (9)

24 (0)

PubchemFP192 PubchemFP193

≥3 any ring size 6 ≥3 saturated or aromatic carbon-only ring size 6

0.281 0.281

24 (14) 24 (14)

24 (1) 24 (1)

PubChem fingerprints were proposed for constructing the in silico predictive models of mutagenicity. As a machine learning algorithm, LR may be the best choice for discovering the relationship of structuremutagenicity for NACs. At same time, the predicted results demonstrated the robustness and good prediction accuracy of the proposed models.

Representative Compounds

epoxide), followed by the formation of DNA adducts [Hirano et al., 2013]. In addition, the one-electron reduction of nitro group on the aromatic rings also lead to the generation of electrophilic reagents, which might further strengthen the formation of DNA adducts by NACs, especially for NPAHs, and then acquire higher mutagenicity or carcinogenicity [Peres and Agathos, 2000]. Overall, it is not difficult to understand why these substructures or fragments in NACs will result in high levels of mutation.

3.2.4. Analysis of privileged substructures or structural alerts In order to identify the structural differences between the high mutagenic and low mutagenic NACs, the IG combined with the SFA method [Shen et al., 2010; Xu et al., 2012; Li et al., 2014a] were carried out to identify privileged substructures in the data set based on PubChem fingerprint. The higher IG value means the substructure is more important. For each fragment presented in the “H” and “L” classes, their detailed IG values and frequencies can be seen in Table S7 in the Supplementary material, we found 35 substructures responsible for mutagenicity of NACs. Table 3 listed some representative privileged substructures (fragments) and NACs containing these substructures. As shown in Table 3, nine substructures (≥16C, ≥ 4 aromatic rings, ≥ 1 saturated or aromatic carbon-only ring size 5, 1,3-dimethylcyclopentane, 1,2-dimethylcyclopentane, ≥ 4 any ring size 6, ≥4 saturated or aromatic carbon-only ring size 6, ≥3 any ring size 6, and ≥3 saturated or aromatic carbon-only ring size 6) appeared more frequently in high mutagenic NACs than in low mutagenic NACs. This implies that these substructures are associated with highly mutagenicity of NACs. The former seven substructures were only present in high mutagenic NACs, and the latter two substructures were present in 14 high mutagenic NACs and just one low mutagenic NACs. As a result, these substructures should be responsible for the high mutagenic potency of NACs according to their appearance in each class, and can be considered as structural alerts to predict potential mutagenicity of new or untested NACs. In fact, many mutagenic or carcinogenic PAHs/NPAHs, such as BaP, 1-NP, 1,6-dinitropyrene (1,6-DNP) and 1,8-DNP, contain a certain type of these identified privileged substructures [Hirano et al., 2013; Hayakawa, 2016], in which 1-NP (12) 1,6-DNP (4) and 1,8-DNP (5) were included in the present study as high mutagenic NACs. Other NACs, such as 2,7-dinitrofluorene (9) containing cyclopentane (PubChemFP144), 1-nitrofluoranthene (7) and 3-nitrofluoranthene (10) containing 1,3/2-dimethylcyclopentane (PubChemFP839, 860) fragments, were also three NACs with high mutagenic potency. Once these chemicals enter the cells in vivo, they can be further activated via cytochrome P450 enzymes to generate diol epoxide (e.g., BaP diol

4. Conclusions In the present study, we built the QSAR and classification models for predicting the mutagenicity of NACs. Based on the DRAGON descriptors and quantum chemistry descriptors, the QSAR models were constructed. Finally, the best QSAR model with five molecular descriptors was observed with high quality of statistical parameters (e.g., Q2LOO = 0.950, R2 = 0.967, Q2ext = 0.836 and R2ext = 0.843), exhibiting excellent internal fitting ability, robustness and external predictive performance, significantly exceeded the models without quantum chemistry descriptors and previously reported models. These descriptors (EHOMO, Infective-50, Hypnotic-80, CATS2D_04_LL and TIC2) well explain the contribution of specific molecular properties to mutagenicity. For classification study, the accuracies of 10-fold cross-validation for the training set were 0.828–1.000 in top ten models, meanwhile, for external test set, the range of accuracy was 0.737–0.947. In view of the overall prediction accuracy, the best classification model was obtained with Ext fingerprint along with LR algorithm, which showed an accuracy of 1.000 and 0.947 for the training set and the external test set, respectively. Furthermore, several privileged substructures for screening NACs with high mutagenicity were identified using the IG and SFA method. To some extent, these privileged substructures may be the key structural alerts of mutagenicity or carcinogenicity in vivo. In summary, our predictive models are markedly simple and robust as well as satisfying the rigorous validation criteria recognized by OECD, enabling the production of reliable predictive information using existed mutagenicity data, while reducing the requirement of in vivo and in vitro experiments. In addition, the models can give straightforward understanding of the structural features responsible for the mutagenicity of NACs, thereby providing useful insights into the in vivo toxicity mechanisms for this class of compounds.

9

Ecotoxicology and Environmental Safety 186 (2019) 109822

Y. Hao, et al.

Declaration of competing interest

on prediction of acute oral toxicity of N-nitroso compounds. Int. J. Mol. Sci. 19, 3015. Frisch, M.J., Trucks, G.W., Schlegel, H.B., et al., 2009. Gaussian 09. Gaussian Inc., Wallingford, CT, USA. Gozalbes, R., Julian-Ortiz, J.V.de, 2018. Applications of chemoinformatics in predictive toxicology for regulatory purposes, especially in the context of the EU REACH legislation. Int. J. Quant. Struct.-Prop. Relatsh. 3, 1–24. Gooch, A., Sizochenko, N., Rasulev, B., Gorb, L., Leszczynski, J., 2017. In vivo toxicity of nitroaromatics: a comprehensive quantitative structure-activity relationship study. Environ. Toxicol. Chem. 36, 2227–2233. Golbraikh, A., Tropsha, A., 2002. Beware of q2!. J. Mol. Graph. 20, 269–276. Gramatica, P., Papa, E., Marrocchi, A., Minuti, L., Taticchi, A., 2007a. Quantitative structure-activity relationship modeling of polycyclic aromatic hydrocarbon mutagenicity by classification methods based on holistic theoretical molecular descriptors. Ecotoxicol. Environ. Saf. 66, 353–361. Gramatica, P., Pilutti, P., Papa, E., 2007b. Approaches for externally validated QSAR modelling of nitrated polycyclic aromatic hydrocarbon mutagenicity. SAR QSAR Environ. Res. 18, 169–178. Gramatica, P., 2007. Principles of QSAR models validation: internal and external. QSAR Comb. Sci. 26, 694–701. Gramatica, P., Chirico, N., Papa, E., Cassani, S., Kovarich, S., 2013. QSARINS: a new software for the development, analysis, and validation of QSAR MLR models. J. Comput. Chem. 34, 2121–2132. Gramatica, P., Cassani, S., Chirico, N., 2014. QSARINS-chem: insubria data sets and new QSAR/QSPR models for environmental pollutants in QSARINS. J. Comput. Chem. 35, 1036–1044. Gramatica, P., 2014. External evaluation of QSAR Models, in addition to crossvalidation: verification of predictive capability on totally new chemicals. Mol. Inf. 33, 311–314. Gramatica, P., Sangion, A., 2016. A historical excursus on the statistical validation parameters for QSAR models: a clarification concerning metrics and terminology. J. Chem. Inf. Model. 56, 1127–1131. Gadaleta, D., Mangiatordi, G.F., Catto, M., Carotti, A., Nicolotti, O., 2016. Applicability domain for QSAR models: where theory meets reality. Int. J. Quant. Struct. Prop. Relatsh. (IJQSPR) 1, 45–63. Hayakawa, K., 2016. Environmental behaviors and toxicities of polycyclic aromatic hydrocarbons and nitropolycyclic aromatic hydrocarbons. Chem. Pharm. Bull. 64, 83–94. Hartter, D.R., 1985. The use and importance of nitroaromatic chemicals in the chemical industry. In: Rickert, D.E. (Ed.), Toxicity of Nitroaromatic Compounds. Hemisphere Publishing Corporation, Washington, pp. 1–13. He, L.J., Xiao, K.Y., Zhou, C., Li, G.L., Yang, H.B., Li, Z., Cheng, J.G., 2019. Insights into pesticide toxicity against aquatic organism: QSTR models on Daphnia Magna. Ecotoxicol. Environ. Saf. 173, 285–292. Hirano, M., Tanaka, S., Asami, O., 2013. Classification of polycyclic aromatic hydrocarbons based on mutagenicity in lung tissue through DNA microarray. Environ. Toxicol. 28, 652–659. Kovacic, P., Somanathan, R., 2014. Nitroaromatic compounds: environmental toxicity, carcinogenicity, mutagenicity, therapy and mechanism. J. Appl. Toxicol. 34, 810–824. Kurian, J.R., Chin, N.A., Longlais, B.J., Hayes, K.L., Trepanier, L.A., 2006. Reductive detoxification of arylhydroxylamine carcinogens by human NADH cytochrome b(5) reductase and cytochrome b(5). Chem. Res. Toxicol. 19, 1366–1373. Keith, L.H., Telliard, W.A., 1979. Priority pollutants I-A perspective view. Environ. Sci. Technol. 13, 416–423. Khan, K., Roy, K., Benfenati, E., 2019. Ecotoxicological QSAR modeling of endocrine disruptor chemicals. J. Hazard Mater. 369, 707–718. Klekota, J., Roth, F.P., 2008. Chemical substructures that enrich for biological activity. Bioinformatics 24, 2518–2525. Li, J.S., Yang, L.X., Gao, Y., Jiang, P., Li, Y.Y., Zhao, T., Zhang, J.M., Wang, W.X., 2019. Seasonal variations of NPAHs and OPAHs in PM2.5 at heavily polluted urban and suburban sites in North China: concentrations, molecular compositions, cancer risk assessments and sources. Ecotoxicol. Environ. Saf. 178, 58–65. Li, X., Chen, L., Cheng, F.X., Wu, Z.R., Bian, H.P., Xu, C.Y., Li, W.H., Liu, G.X., Shen, X., Tang, Y., 2014a. In silico prediction of chemical acute oral toxicity using multiclassification methods. J. Chem. Inf. Model. 54, 1061–1069. Li, C., Yang, X.H., Li, X.H., Chen, J.W., Qiao, X.L., 2014b. Development of a model for predicting hydroxyl radical reaction rate constants of organic chemicals at different temperatures. Chemosphere 95, 613–618. Loomis, D., Grosse, Y., Lauby-Secretan, B., El Ghissassi, F., Bouvard, V., BenbrahimTallaa, L., Guha, N., Baan, R., Mattock, H., Straif, K., IARC, 2013. The carcinogenicity of outdoor air pollution. Lancet Oncol. 14, 1262–1263. Marwood, T.M., Meyer, D., Josephy, P.D., 1995. Escherichia-coli lacz strains engineered for detection of frameshift mutations induced by aromatic-amines and nitroaromatic compounds. Carcinogenesis 16, 2037–2043. Mocelo-Castell, R., Villanueva-Novelo, C., Caceres-Castillo, D., Carballo, R.M., QuijanoQuinones, R.F., Quesadas-Rojas, M., Cantillo-Ciau, Z., Cedillo-Rivera, R., Moo-Puc, R.E., Moujir, L.M., Mena-Rejon, G.J., 2015. 2-Amino-4-arylthiazole derivatives as anti-giardial agents: synthesis, biological evaluation and QSAR studies. Open Chem. 13, 1127–1136. OECD (Organization for Economic Co-Operation and Development), 2007. Guidance Document on the Validation of (Quantitative) Structure-Activity Relationships [(Q) SAR] Models. OECD Environment Health and Safety Publications Series on Testing and Assessment No. 69, Paris. OECD (Organization for Economic Co-Operation and Development), 2004. The Report from the Expert Group on (Quantitative) Structure-Activity Relationships [(Q)SARs] on the Principles for the Validation of (Q)SARs. 2nd Meeting of the Ad Hoc Expert Group on QSARs. OECD Headquarters 20-21 September.

The authors declare no potential conflicts of interest. Acknowledgements This work was supported by the Beijing Natural Science Foundation (No. 7184192 and 7162015), the National Natural Science Foundation of China (No. 21778011), the Great Wall Scholars Program of Beijing Municipal Education Commission (No. CIT&TCD20180308), China Postdoctoral Science Foundation funded project (No. 2017M620567), Beijing Postdoctoral Research Foundation (No. 2018-ZZ-022), Chaoyang District Postdoctoral Research Foundation (No. 2018ZZ-0125) and Graduate Science & Technology Foundation of BJUT (No. ykj2018-00860). The authors would like to thank Prof. P. Gramatica in University of Insubria (Varese, Italy) for providing the QSARINS 2.2.2 software. Appendix A. Supplementary data Supplementary data to this article can be found online at https:// doi.org/10.1016/j.ecoenv.2019.109822. References Arey, J., Zielinska, B., Atkinson, R., Winer, A.M., Ramdahl, T., Pitts, J.N., 1986. The formation of nitro-PAH from the gas-phase reactions of fluoranthene and pyrene with the OH radical in the presence of NOX. Atmos. Environ. 20, 2339–2345. Albinet, A., Leoz-Garziandia, E., Budzinski, H., ViIlenave, E., 2007. Polycyclic aromatic hydrocarbons (PAHs), nitrated PAHs and oxygenated PAHs in ambient air of the Marseilles area (South of France): concentrations and sources. Sci. Total Environ. 384, 280–292. Benbrahim-Tallaa, L., Baan, R.A., Grosse, Y., Lauby-Secretan, B., El Ghissassi, F., Bouvard, V., Guha, N., Loomis, D., Straif, K., IARC, 2012. Carcinogenicity of diesel-engine and gasoline-engine exhausts and some nitroarenes. Lancet Oncol. 13, 663–664. Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. Basheer, I.A., Hajmeer, M., 2000. Artificial neural networks: fundamentals, computing, design, and application. J. Microbiol. Methods 43, 3–31. Benigni, R., Bossa, C., 2008. Structure alerts for carcinogenicity, and the Salmonella assay system: a novel insight through the chemical relational databases technology. Mutat. Res. Rev. Mutat. Res. 659, 248–261. Bolton, E.E., Wang, Y., Thiessen, P.A., Bryant, S.H., 2008. Chapter 12 - PubChem: integrated platform of small molecules and biological activities. Annu. Rep. Comput. Chem. 4, 217–241. Cronin, M.T.D., 2017. (Q)SARs to predict environmental toxicities: current status and future needs. Environ. Sci. Proc. Improv. 19, 213–220. Cao, L.D., Zhu, P., Zhao, Y.S., Zhao, J.H., 2018. Using machine learning and quantum chemistry descriptors to predict the toxicity of ionic liquids. J. Hazard Mater. 352, 17–26. Cover, T.M., Hart, P., 1967. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13, 21–27. Cox, D., 1958. The regression analysis of binary sequences. J. R. Stat. Soc. 2, 215–242. Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn. 20, 273–297. Chang, C., Lin, C., 2011. LIBSVM: a library for support vector machines. ACM. Trans. Intell. Syst. Technol. 2, 1–27. Chen, Y.J., Cheng, F.X., Sun, L., Li, W.H., Liu, G.X., Tang, Y., 2014. Computational models to predict endocrine-disrupting chemical binding with androgen or oestrogen receptors. Ecotoxicol. Environ. Saf. 110, 280–287. Dragon Software for Molecular Descriptor Calculation V 7.0.6, Kode Srl. Available online: https://chm.kode-solutions.net/(accessed on 3 September 2017).. Deeb, O., Goodarzi, M., 2010. Predicting the solubility of pesticide compounds in water using QSPR methods. Mol. Phys. 108, 181–192. Deng, K.L., Wong, T.Y., Wang, Y.N., Leung, E.M.K., Chan, W., 2015. Combination of precolumn nitro-reduction and ultraperformance liquid chromatography with fluorescence detection for the sensitive quantification of 1-nitronaphthalene, 2-nitrofluorene, and 1-nitropyrene in meat products. J. Agric. Food Chem. 63, 3161–3167. Du, H.W., Cai, Y.C., Yang, H.B., Zhang, H.X., Xue, Y.H., Liu, G.X., Tang, Y., Li, W.H., 2017. In silico prediction of chemicals binding to aromatase with machine learning methods. Chem. Res. Toxicol. 30, 1209–1218. Evans, B.E., Rittle, K.E., Bock, M.G., Dipardo, R.M., Freidinger, R.M., Whitter, W.L., Lundell, G.F., Veber, D.F., Anderson, P.S., Chang, R.S.L., Lotti, V.J., Cerino, D.J., Chen, T.B., Kling, P.J., Kunkel, K.A., Springer, J.P., Hirshfield, J., 1988. Methods for drug discovery - development of potent, selective, orally effective cholecystokinin antagonists. J. Med. Chem. 31, 2235–2246. Erzincan, P., Sacan, M.T., Yuce-Dursun, B., Danis, O., Demir, S., Erdem, S.S., Ogan, A., 2015. QSAR models for antioxidant activity of new coumarin derivatives. SAR QSAR Environ. Res. 26, 721–737. Fan, T.J., Sun, G.H., Zhao, L.J., Cui, X., Zhong, R.G., 2018. QSAR and classification study

10

Ecotoxicology and Environmental Safety 186 (2019) 109822

Y. Hao, et al.

acetylcholinesterase inhibition via QSAR modeling and molecular docking. Peerj 4, e2322. Tugcu, G., Erturk, M.D., Sacan, M.T., 2017. On the aquatic toxicity of substituted phenols to Chlorella vulgaris: QSTR with an extended novel data set and interspecies models. J. Hazard Mater. 339, 122–130. Tratnyek, P.G., Bylaska, E.J., Weber, E.J., 2017. In silico environmental chemical science: properties and processes from statistical and computational modelling. Environ. Sci. Proc. Improv. 19, 188–202. Tugcu, G., Sacan, M.T., 2018. A multipronged QSAR approach to predict algal low-toxiceffect concentrations of substituted phenols and anilines. J. Hazard Mater. 344, 893–901. Toropov, A.A., Toropova, A.P., Benfenati, E., 2009. Simplified Molecular Input Line Entry System-based optimal descriptors: quantitative structure-activity relationship modeling mutagenicity of nitrated polycyclic aromatic hydrocarbons. Chem. Biol. Drug Des. 73, 515–525. Tropsha, A., 2010. Best practices for QSAR model development, validation, and exploitation. Mol. Inf. 29, 476–488. Todeschini, R., Consonni, V., Maiocchi, A., 1999. The K correlation index: theory development and its application in chemometrics. Chemometr. Intell. Lab. Syst. 46, 13–29. Tropsha, A., Gramatica, P., Gombar, V.K., 2003. The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb. Sci. 22, 69–77. Todeschini, R., Consonni, V., Mauri, A., Pavan, M., 2000. Handbook of Molecular Descriptors. Wiley-VCH, Germany. Wang, W.T., Jariyasopit, N., Schrlau, J., Jia, Y.L., Tao, S., Yu, T.W., Dashwood, R.H., Zhang, W., Wang, X.J., Simonich, S.L.M., 2011. Concentration and photochemistry of PAHs, NPAHs, and OPAHs and toxicity of PM2.5 during the beijing olympic games. Environ. Sci. Technol. 45, 6887–6895. Walker, S.H., Duncan, D.B., 1967. Estimation of the probability of an event as a function of several independent variables. Biometrika 54, 167–179. Watson, P., 2008. Naive Bayes classification using 2D pharmacophore feature triplet vectors. J. Chem. Inf. Model. 48, 166–178. Wu, X., Zhang, Q., Hu, J., 2016. QSAR study of the acute toxicity to fathead minnow based on a large dataset. SAR QSAR Environ. Res. 27, 147–164. Xu, C.Y., Cheng, F.X., Chen, L., Du, Z., Li, W.H., Liu, G.X., Lee, P.W., Tang, Y., 2012. In silico prediction of chemical Ames mutagenicity. J. Chem. Inf. Model. 52, 2840–2847. Yap, C.W., 2011. PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 32, 1466–1474. Zhang, C.L., Yu, Y.Y., Fang, Z., Naraginti, S., Zhang, Y.H., Yong, Y.C., 2018. Recent advances in nitroaromatic pollutants bioreduction by electroactive bacteria. Process Biochem. 70, 129–135.

Ohnishi, Y., Kinouchi, T., Manabe, Y., Tsutsui, H., Otsuka, H., Tokiwa, H., Otofuji, T., 1985. Nitro Compounds in Environmental Mixtures and Foods. Short-Term Bioassays in the Analysis of Complex Environmental Mixtures IV. Springer, US, pp. 195–204. Onlu, S., Sacan, M.T., 2018. Toxicity of contaminants of emerging concern to Dugesia japonica: QSTR modeling and toxicity relationship with Daphnia magna. J. Hazard Mater. 351, 20–28. Onlu, S., Sacan, M.T., 2017. Impact of geometry optimization methods on QSAR modelling: a case study for predicting human serum albumin binding affinity. SAR QSAR Environ. Res. 28, 491–509. Plewczynski, D., Spieser, S., Koch, U., 2006. Assessing different classification methods for virtual screening. J. Chem. Inf. Model. 46, 1098–1106. Perez-Garrido, A., Helguera, A.M., Borges, F., Cordeiro, M., Rivero, V., Escudero, A.G., 2011. Two new parameters based on distances in a receiver operating characteristic chart for the selection of classification models. J. Chem. Inf. Model. 51, 2746–2759. Peres, C.M., Agathos, S.N., 2000. Biodegradation of nitroaromatic pollutants: from pathways to remediation. Biotechnol. Annu. Rev. 6, 197–220. Roy, K., Das, R.N., Ambure, P., Aher, R.B., 2016. Be aware of error measures. Further studies on validation of predictive QSAR models. Chemometr. Intell. Lab. Syst. 152, 18–33. Schlemitz, S., Pfannhauser, W., 1996. Monitoring of nitropolycyclic aromatic hydrocarbons in food using gas chromatography. Zeitschrift Fur LebensmittelUntersuchung Und-Forschung 203, 61–64. Sun, G.H., Fan, T.J., Sun, X.D., Hao, Y.X., Cui, X., Zhao, L.J., Ren, T., Zhou, Y., Zhong, R.G., Peng, Y.Z., 2018. In Silico prediction of O6-methylguanine-DNA methyltransferase inhibitory potency of base analogs with QSAR and machine learning methods. Molecules 23, 2892. Sun, G.H., Fan, T.J., Zhang, N., Ren, T., Zhao, L.J., Zhong, R.G., 2016. Identification of the structural features of guanine derivatives as MGMT inhibitors using 3D-QSAR modeling combined with molecular docking. Molecules 21, 823. Sun, H.M., 2005. A naive Bayes classifier for prediction of multidrug resistance reversal activity on the basis of atom typing. J. Med. Chem. 48, 4031–4039. Speck-Planche, A., Kleandrova, V.V., Luan, F., Cordeiro, M., 2013. Unified multi-target approach for the rational in silico design of anti-bladder cancer agents. Anti Cancer Agents Med. Chem. 13, 791–800. Singh, J., Singh, S., Shaik, B., Deeb, O., Sohani, N., Agrawal, V.K., Khadikar, P.V., 2008. Mutagenicity of nitrated polycyclic aromatic hydrocarbons: a QSAR investigation. Chem. Biol. Drug Des. 71, 230–243. Shen, J., Cheng, F.X., Xu, Y., Li, W.H., Tang, Y., 2010. Estimation of ADME properties with substructure pattern recognition. J. Chem. Inf. Model. 50, 1034–1041. Sawada, R., Kotera, M., Yamanishi, Y., 2014. Benchmarking a wide range of chemical descriptors for drug-target interaction prediction using a chemogenomic approach. Mol. Inf. 33, 719–731. Simeon, S., Anuwongcharoen, N., Shoombuatong, W., Malik, A.A., Prachayasittikul, V., Wikberg, J.E.S., Nantasenamat, C., 2016. Probing the origins of human

11