Global versus local QSAR models for predicting ionic liquids toxicity against IPC-81 leukemia rat cell line: The predictive ability

Global versus local QSAR models for predicting ionic liquids toxicity against IPC-81 leukemia rat cell line: The predictive ability

Accepted Manuscript Global versus local QSAR models for predicting ionic liquids toxicity against IPC-81 leukemia rat cell line: The predictive abilit...

1MB Sizes 1 Downloads 33 Views

Accepted Manuscript Global versus local QSAR models for predicting ionic liquids toxicity against IPC-81 leukemia rat cell line: The predictive ability

A. Sosnowska, M. Grzonkowska, T. Puzyn PII: DOI: Reference:

S0167-7322(16)32855-0 doi: 10.1016/j.molliq.2017.02.025 MOLLIQ 6938

To appear in:

Journal of Molecular Liquids

Received date: Accepted date:

20 October 2016 7 February 2017

Please cite this article as: A. Sosnowska, M. Grzonkowska, T. Puzyn , Global versus local QSAR models for predicting ionic liquids toxicity against IPC-81 leukemia rat cell line: The predictive ability. The address for the corresponding author was captured as affiliation for all authors. Please check if appropriate. Molliq(2017), doi: 10.1016/ j.molliq.2017.02.025

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Global versus local QSAR models for predicting ionic liquids toxicity against IPC-81 leukemia rat cell line: the predictive

SC RI PT

ability

NU

A. Sosnowska, M. Grzonkowska and T.Puzyn*

Laboratory of Environmental Chemometrics, Institute for Environmental and Human Health

AC

CE

PT

ED

MA

Protection, Faculty of Chemistry, University of Gdansk, Gdansk, Poland

*Address correspondence: T. Puzyn, Laboratory of Environmental Chemometrics, Department of Chemistry, University of Gdansk, Wita Stwosza 63, 80-308 Gdansk, Poland, phone: (+48 58) 523 52 48; fax: (+48 58) 523 50 12; e-mail: [email protected]

ACCEPTED MANUSCRIPT

Abstract The unlimited possibility of modifying the structure of the cation and anion of a given ionic liquid (IL) delivers a countless number of potential derivatives having different toxicological activities. The experimental measurements for such huge set of compounds are impossible

SC RI PT

due to time and costs limitations. In the absence of experimental data, computational methods such as quantitative structure–activity relationship (QSAR) method can be applied to characterize those species for which experimental data is not available. However, two different ways of QSAR modeling could be applied: local modeling - dedicating to one specific group and more complex global models. The aim of this study was to investigate how local/global QSAR modeling works for the ionic liquids. Which of these two approaches

NU

is more suitable in ILs’ modeling? We have developed six local models and a global one for predicting the toxicity against the leukemia rat cell line IPC-81. Based on the obtained results,

MA

we conclude that local models are useful to solve specific mechanistic problems among one group of ILs, but in case of predicting the toxicity for a new designed compound the global

Keywords:

AC

CE

PT

ED

model, whenever fulfills all quality recommendations, is recommended.

Global models; local model; QSAR; ionic liquids; toxicity against the leukemia rat cell line IPC-

81

ACCEPTED MANUSCRIPT 1. Introduction Ionic liquids (ILs) are nowadays one of the very interesting groups of compounds, which have very desirable properties in many branches of industry like: the electronic devices[1], agriculture[2, 3] as well as chemical industry [4]. Their unique properties like low vapor pressure, melting point below 100C and high thermal

SC RI PT

stability, allow their application as a promising alternative to the classical solvents [5]. Our recently studies were focused on the investigating, which structural features are responsible for IL’s properties (physicochemical properties [6, 7] or toxicity [8]). This knowledge is essential in the proces of desiging new IL with predefined

NU

properties. Since ionic liquids can exhibit wide-ranging toxicity, and in some cases they are more toxic than the organic solvents [9], the step of designing the ILs with

MA

wished properties always has to consider also the toxicity potential of the proposed compounds. Because the number of possible cation-anion combinations, which built

ED

every IL is extensive, the empirical toxicity measurements are impossible, due to time

PT

limitation and high cost of the analytical procedures. In this case the alternative, computational methods are recommended for predicting the toxicity of ILs [10].

CE

Recently, one of the widely used groups of in silico method is the quantitative structure-activity/property relationships (QSAR/QSPR) approach. This method is

AC

based on defining mathematical dependencies between the variance in molecular structures, encoded by so-called ‘molecular descriptors’ and the variance in biological activity/property in a set of compounds. The molecular descriptors are the final results of a logic and mathematical transformations of chemical information encoded in the molecule structure into a useful number or the result of some standardized experiment [11]. This means, that in case of QSAR if one has calculated or experimentally measured molecular descriptors for a group of similar chemicals and the toxicological

ACCEPTED MANUSCRIPT data are available only for a part of this group, one is able to predict the lacking data from the molecular descriptors with a proper mathematical model. There are two possible QSAR/QSPR modeling strategies: the local and global models. The local models approach can be used only for one specific class of compounds (e.g., only for imidazolium ILs). Therefore the local models have limited

SC RI PT

chemical domain and are an alternative to global one[12], which is calibrated on large number of compounds (e.g., ammonium, imidazolium, pyridinium, morpholinium, phosphonium etc. ionic liquids). In the literature, the global and local QSPR/QSAR models for different compounds are commonly described[12-14]. Oberg et al.[12]

NU

developed the global and local PLS regression models to predict vapor pressure for organic compound. The Helgee et al.[13] presented the evaluation of global and local

MA

QSAR modeling strategies for simulated data. The comparison was conducted for risk assessment and computational cost. They demonstrated that the local models have

ED

better predictive ability, although the global one is more desirable from the economic point – such approach enables to additionally save resources by predicting data for

PT

larger group of compounds in the same time. Puzyn et al.[14] also described the global

CE

and local QSPR approaches. They recommended application of the global model, because the predictive ability of both approaches was similar. Moreover, they point out

AC

that global modeling may be the only useful approach, when the number of chemicals from one specific group is insufficient for the development of local QSPR model. In the light of these studies the huge advantage of the global QSAR/QSPR model is the ability to predict properties for hundreds compounds, but at the same time such strategy may lead to mechanistic simplification and/or higher errors in the predicted data [15].

ACCEPTED MANUSCRIPT The main goal of presented study was to investigate how local/global modelling works for the ionic liquids. Is it possible to develop the global QSAR model for predicting ionic liquids toxicity that has comparable predictive ability to the local ones? To answer this question we have developed six local QSAR models for predicting the IL’s toxicity against the leukemia rat cell line IPC-81 separately for each

SC RI PT

individual group of ILs (ammonium, imidazolium, morpholinium, phosphonium, piperidinium, pyridinium, pyrrolidinium). Simultaneously, the global QSAR model for all collected ILs’s data (304) was developed. Than local and global models were compared in case of the average residuals from the predictions. We verified which of

NU

these two strategies provide more reliable results. Conclusions from our work would

MA

be further used for evaluation of the toxicity of newly designed ionic liquids.

2. Material and methods

ED

2.1. Development of the global and local QSAR models The high-quality QSAR models was developed according to the following five steps:

PT

(i) collecting experimental data; (ii) calculating molecular descriptors; (iii) splitting the

CE

dataset into training and validation subset; (iv) determining the relationship between structure and activity, and validating the developed models and (v) comparing the local

AC

and global QSAR models.

2.2. Dataset The development of the high-quality QSAR models characterized by true predictive ability requires reliable experimental data. This step is very significant during model development, because quality of the data affected to the results of the modeling. It is

ACCEPTED MANUSCRIPT crucial, that every experimental data point should be measured in the same conditions, according to the same procedure. For developing the QSAR models, which quantitatively describe the relationship between the structure and ILs toxicity to leukemia rat cell line, we collected the 304 experimental data from the literature [16-25] (please refer to the Table ES1 in

SC RI PT

Electronic Supplementary Materials). The data were divided into 10 groups according to ILs’ cation type (ammonium: 35; imidazolium: 104; morpholinium: 33; phosphonium: 4; protic ionic liquids (PILs), based on substituted amines (monethanolamine, diethanolamine, triethanolamine): 9; piperidinium: 17; pyridinium:

AC

CE

PT

ED

MA

NU

67; pyrrolidinium: 28; quinolinium: 5; sulfonium: 2), which are presented in Figure 1.

Fig.1. Structures of tested ionic liquids cations

ACCEPTED MANUSCRIPT

In the collected dataset, we found 43 points for which the measurements were above or below the detection limit of the method. Because, in case of building the local model we needed as much as possible data to properly calibrate the model, therefore we decided to not exclude these compounds from our dataset. In order to solve the

SC RI PT

problem of the measurements that are above or below the limit detection we applied the transforming method recommended by Brereton [26]. The experimental data, which the values of EC50 were greater than the detection limit, were multiplied twice (e.g. data from the literature: EC50 > 20000 [M] was transformed into  EC50 =

NU

40000 [M]), whereas the experimental data which the values of EC 50 were below the detection limit were divided by two (e.g. data from the literature: EC 50 < 1.0 [M] was

MA

transformed into: EC50 =0.5 [M]). After this step, all collected data were transferred into logarithmic units (log EC50 [M]).

ED

The experimental data collected in this study are also available in our online ILsTOX

PT

database (www.db.qsar.eu.org).

CE

2.3. Descriptor calculations

AC

The structures of IL’s cation and anion were drawn separately in ChemSketch software [27]. The geometry of each ions was optimized at the semi-empirical PM7 level using MOPAC 2012 software [28]. It has been previously proved, that the semi-empirical PM7 allows to get similar results of predictions compare to ab initio and DFT method of molecular geometry optimization [29]. The molecular descriptors were calculated by the Dragon (ver. 6.0) software [30]. Taking to account our previous study on the toxicity of ionic liquids [31] and our unpublished data, we pre-selected five groups of descriptors (which exhibit the high correlation with the toxicity) to be employed for

ACCEPTED MANUSCRIPT developing the QSAR model, namely: (i) Weighted Holistic Invariant Molecular Descriptors (WHIM), (ii) ring descriptors, (iii) functional group counts, (iv) topological and (v) constitutional indices.

2.4. Splitting data

SC RI PT

To develop the local QSAR models we used only these groups of ILs, which contain more than 10 ionic liquids. In this way, we have 6 local groups (ammonium: 35 ILs; imidazolium: 104 ILs; morpholinium: 33 ILs; piperidinium: 17 ILs; pyridinium: 67 ILs; pyrrolidinium: 28 ILs). For the global one we used all collected 304 ILs.

NU

In the next step, the ionic liquids were sorted according to the increasing values of the log EC50, and each of datasets was split into two subsets: training and test (validation)

MA

set. The splitting was performed using a “Z:1 algorithm”, according to which, after sorting, every Z-th compound was put into a test set [32]. In our analysis Z=3. The

ED

application of this ‘‘three-to-one’’ splitting algorithm ensured that the both training and validation sets contained the compounds evenly distributed within the range of ILs

PT

toxicity [33]. The ILs with extreme toxicity values, were included in the training set. In

CE

this way, we obtained the training set containing 70-75%, and validation set with about 25-30% of ionic liquids. The details about splitting the data into the training and

AC

validation sets for particular groups of ILs are presented in the Table ES2A- ES2F, ES3 in the Electronic Supplementary Material.

ACCEPTED MANUSCRIPT

2.5. QSAR modeling - calibration and validation We developed QSAR models following the recommendations of the Organization for Economic Cooperation and Development (OECD) [34]. We employed the Multiple Linear Regression (MLR) as the chemometric method of modeling. The MLR allows presenting

SC RI PT

simple linear relationship based on few independent variables (molecular descriptors). To select the most optimal combination of descriptors for QSAR modeling we used the Genetic Algorithm (GA) implemented in the QSARINs (ver. 2.1) software [35]. The detailed control parameters for particular genetic algorithm are available in Table ES4 in Electronic Supplementary Material section. The descriptors selected according to the GA were utilized

NU

in the QSAR models development.

The goodness-of-fit of the developed models was measured by the determination coefficient

MA

(R2)and root-mean-square error of calibration (RMSEC). The internal validation was conducted using the Leave-One-Out (LOO) cross-validation technique. The squared cross-

ED

validation coefficient (Q2CV) and the root mean square error of cross-validation (RMSECV)

models’ overfitting.

PT

were calculated in order to assess the robustness of the models and reduce probability of the

CE

The predictivity of the model was determined by external validation based on the predictions

AC

performed for the independent test set (the compounds that were not previously used during the model’s development or calibration). There are many proposed metric for validating the QSAR/QSPR models: Q2 metrics (Q2F1, Q2F2, Q2F3) [36-38], r2m metrics, namely: r2m, r2m and average r2m aver for internal, external and overall validation tests [39] or concordance correlation coefficient (CCCext) [40]. However, recent studies of Todeschini et al. [41] suggest that the most stable and reliable parameter in evaluating the prediction ability of regression models is Q2F3 calculating according to the following equation:

ACCEPTED MANUSCRIPT 2 𝑄𝐹3

=1−

𝑛𝑂𝑈𝑇 ∑𝑖=1 (𝑦𝑖 − 𝑦̂ 𝑖/𝑖 )2 /𝑛𝑂𝑈𝑇 𝑛𝑇𝑅 ∑𝑖=1 (𝑦𝑖 − 𝑦̅ 𝑇𝑅 )2 /𝑛 𝑇𝑅

where 𝑦𝑖 corresponds to the experimental response of the i-th object, 𝑦̂ 𝑖/𝑖 is the predicted response when i-th object is not in the training set, 𝑛 𝑇𝑅 and 𝑛𝑂𝑈𝑇 corresponds to the number of training and test set sets, 𝑦̅ 𝑇𝑅 is the average value of the training set experimental response.

SC RI PT

Therefore we utilized the squared external validation coefficient (Q 2F3) and the root mean square error of prediction (RMSEP) to determine the predictivity of the developed model. The root mean squared error values for training, cross-validation and external validation should be low and simultaneously as similar to each other as possible [42].

NU

The applicability domain (AD) is a theoretical spatial region defined by the values of molecular descriptors and the modelled response. In the presented study, the applicability

MA

domain was verified by using the leverages (Williams plot) [42] and standardization approach [43]. The Williams plot presents the values of standardized residuals versus the

ED

leverage values (h). The boundary of the AD is defined by the critical value of leverage, h*

PT

(h*=3p'/n, where p' is the number of model variables plus one, and n is the number of the training compounds used to develop the model) and the values of the standardized cross-

CE

validated residuals differing by more than ± 3 standard deviation units (). The value of the

AC

hi greater than the h* means that the compound’s structure significantly differs from these from the training set. Compound from the training set with standardized residuals <  3 and high leverage would reinforce the QSAR model (good leverage), but such a compound in the test set could have unreliable predicted data (bad leverage) [42] The standardization approach based on the assumption, that the compounds descriptors should follow a normal distribution pattern; where 99.7% of the population will remain within the range mean  3 [43].

ACCEPTED MANUSCRIPT

2.6. Methodology of comparing local and global QSAR models We performed comparison between the local QSAR models and the global one. We took into account the quality of each, characterized by the previously mentioned statistics. We also employed the paired statistic Student’s t test to verify, whether the average residuals

SC RI PT

from the predictions with the local and global QSAR models differ significantly and if the local models provide more reliable results according to global model.

3. Results 3.1. Local QSAR models

NU

We have developed a statistically significant local QSAR models based on the experimental

MA

log EC50 values and calculated descriptors using the MLR method. The linear equations and statistic parameters describing quality of the developed models are presented in Table 1 and

ED

Table 2 respectively. Table 1. Local QSAR models equations

PW4A -path/walk 4 - Randic shape index; RPerimC - ring perimeter E1iC- 1st component accessibility directional WHIM index, weighted by ionization potential

CE

ImidazoliumQSAR model

logEC50 = 3.29(0.09) – 0.61 (0.09)PWA – 1.10(0.10)RPerimC – 0.43(0.09)E1iC

logEC50 = 2.87(0.05) – 0.30 (0.05)nFA + 0.37(0.05)nOC – 0.88(0.05)S3KC

AC

AmmoniumQSAR model

PT

Local QSAR models

nFA - number of Fluorine atoms; nOC - number of Oxygen atoms; S3KC - 3-path Kier atpha-modified shape index

logEC50 = 3.80(0.05) – 0.33(0.06)L1uA – 0.34(0.06)E3mA – 0.15(0.06)G2pC

MorpholiniumQSAR model

PiperidiniumQSAR model

L1uA - unweighted 1st component size WHIM index; E3mA - 3rd component accessibility index weighted by mass; G2pC - 2nd component symmetry index weighted by polarizability

logEC50 = 3.91(0.04) – 0.50(0.04)SiA + 0.17(0.04)Psi_i_AC SiA - sum of first ionization potentials scaled on Carbon atom; Psi_i_AC - intrinsic state pseudoconnectivity index - type S average

ACCEPTED MANUSCRIPT logEC50 = 3.03(0.06) – 0.09(0.06)MSDA – 0.22(0.06)nNC – 0.76(0.06)ON0VC+ 0.23(0.06)L2mC

PyridiniumQSAR model

MSDA - mean square distance index; nNC - number of Nitrogen atoms; ON0VC - overall modified Zagreb index of order 0 by valence vertex degrees; L2mC - 2nd component size directional WHIM index weighted by mass

logEC50 = 3.41(0.09) – 0.53(0.09)MAXDNA – 0.70(0.10)DECCC– 0.17(0.10)E3pC

PyrrolidiniumQSAR model

MAXDNA - maximal electropological negative variation; DECCC - eccentric; E3pC - 3rd component accessibility directional WHIM index weighted by polarizability

SC RI PT

The indexes A and C means that the descriptors were calculated for anion or cation.

Table 2. Statistic parameters obtained in developed local and global QSAR models R2

Q2CV

Q2F3

RMSEC

26

9

0.90

0.87

0.88

0.41

78

26

0.84

0.81

0.81

0.43

25

8

0.81

0.73

0.88

12

5

0.95

0.92

0.94

50

17

0.79

0.75

21

7

0.83

228

76

0.77

RMSEP

F

p

0.49

0.46

69.20

2.31*10-11

0.46

0.46

126.45

4.75*10-29

0.25

0.30

0.20

30.75

7.2*10-8

0.11

0.15

0.14

95.95

9.10*10-7

0.75

0.39

0.43

0.43

42.26

1.10*10-14

0.75

0.80

0.38

0.45

0.36

27.16

1.033*10-6

0.76

0.84

0.51

0.53

0.43

150.38

3.206*10-69

NU

RMSECV

MA

nv

PT

Global model

nt

ED

QSAR model Ammoniummodel Imidazoliummodel Morpholinium - model Piperidiniummodel Pyridiniummodel Pyrrolidiniummodel

CE

The goodness-of-fit and predictive ability of the presented models have been confirmed by the values of the statistic measures. The determination coefficient, R 2 and the values of both

AC

internal and external validation coefficients (Q2CV and Q2F3) (although the number of external validation compounds in piperidinium and pyrrolidinium sets was small) are close to one and similar to each other. This means that models are stable and correctly predict the endpoint values for the compounds used for training. The parameters for external validation (Q2F3 and RMSEP) are satisfactory, therefore they can be considered as predictive. These conclusions are additionally supported by the high visual correlation between the observed and predicted values of the log EC50 (Figure 2). We have proven also the models’ robustness

ACCEPTED MANUSCRIPT using the Y-scrambling analysis [44] (Fig.ES1 in Supplementary Info). Following the Yscrambling algorithm we utilized the same descriptors like in QSAR development , to build 100 random “models” – every time the descriptors were correlated with randomly shuffled values of the log EC50 for the training compounds. The two times higher values of RMSE C and RMSECV calculated for the randomly generated models than these of the real QSAR

AC

CE

PT

ED

MA

NU

SC RI PT

model confirmed that model was not a result of by-chance correlation.

Figure 2. Observed vs. predicted toxicity plot for the local regression models: A) ammonium, B) imidazolium, C) morpholinium, D) piperidinium, E) pyridinium, F) pyrrolidinium

In order to define the applicability domain of the developed models we applied the Williams plot (Fig. 3) and the standardization approach (Table ES2A-ES2F). According to both

ACCEPTED MANUSCRIPT approaches almost all compounds belong to model’ applicability domain. Only, in case of ammonium (Table ES2A), imidazolium- (Table ES2B) and morpholinium- (Table ES2C) QSAR models the standardization approach indicates that several points are defined as outliers or points lying outside the AD. According to the Williams plots (Fig. 3B and Fig. 3C), the value of hi for these compounds from training set of imidazolium- and

SC RI PT

morpolinium-models are greater than the h*, which means that structures of these compounds are significantly differ from those from the training set, but the predictions for such compounds are reliable, so they reinforce the QSAR models (good leverage). However, two compounds from validation set of imidazolium-model (B60, B65) are defined as bad leverages, because they could have unreliable predicted data. The standardization and

AC

CE

PT

ED

MA

NU

leverage approaches indicate identical ILs located above AD.

ACCEPTED MANUSCRIPT Figure 3. Williams plot showing the domain of applicability: A) ammonium (h*=0.46), B) imidazolium (h*=0.15), C) morpholinium(h*=0.48), D) piperidinium (h*=0.75), E) pyridinium (h*=0.30), F) pyrrolidinium (h*=0.57) - based local models

According to the OECD recommendations [34] each QSAR model, if possible, should has mechanistic interpretation that explains the relationship between the structure and biological activity of the studied compounds.

SC RI PT

Descriptors selected to develop the local QSAR models (Table 1, eq. 1-6) are mostly related to the shape or size of the cation structure (WHIM descriptors, calculated for cation and weighted by different schemes (polarizability or mass) and topological descriptors related to the size and shape of the ions structure). It has been previously mentioned, that major impact

NU

on the ionic liquids toxicity has the cation structure, mostly branching and size of attached alkyl chains [16] [31] [45]. According to studies by Stolte et al. [23], the investigated anions

MA

show no or only insignificant cytotoxicity effect. However, there are two possible explanations of the ILs cytotoxicity. One is related to the lipophilic effect, and the second is

ED

based on the intrinsic reactivity of their hydrolysis products. The lipophilicity increases with

PT

the cation’ chain length [16]. Results predicted by our QSAR models are in agreement with the previously mentioned studies. The shape and size of the cation have the largest influence

CE

on the ionic liquids cytotoxicity. The presence of the hydrogen groups in the cation structure decreases the ILs toxicity. For ions that contain the same number of atoms, the less branched

AC

(more linear) structures are more toxic than the more branched ones. The anions’ structure usually has insignificant impact on the ILs toxicity, but in case of similar cations, the anion decides about ILs toxicity. The application of the local model strategies allowed choosing the molecular descriptors, which more precisely describe a specific ILs group. The values of the calculated descriptors used in the models are presented in the Electronic Supplementary Material. 3.2. Global QSAR model

ACCEPTED MANUSCRIPT Based on the whole set of experimental toxicity values (log EC 50) collected for 304 ILs and the calculated molecular descriptors we have developed a statistically significant global QSAR model. The linear equation and statistic parameters describing quality of the developed models are presented in following equation and in the Table 2: logEC50 = 3.15(0.03) – 0.30(0.04)SNarA – 0.78(0.03)MAXDPC + 0.19(0.06)nRORC – 0.21(0.04)D/Dtr10C

SC RI PT

nt=228, nv=76, R2=0.772, RMSEC=0.513, Q2CV=0.758, RMSECV=0.528 Q2EXT=0.839, RMSEP=0.434, F=150.38, p=3.21*10-6

The predictive ability and the goodness-of-fit of the developed model have been proved by

NU

the values of the statistic measures. The values of both validation coefficients are relatively close to one. There are no significant differences between particular mean square errors

MA

(RMSEC, RMSECV and RMSEP) therefore the model is not overfitted and predicts endpoint correctly, not only in the space of training set, but also for other (new) ILs. We

ED

have proven also the models’ robustness using the Y-scrambling analysis [44] (Fig.ES1 in Supplementary Info). Almost two times higher values of RMSEC and RMSECV calculated

PT

for the randomly generated models than these of the real QSAR model confirmed that model

CE

was not a result of by-chance correlation. Furthermore, the high visual correlation between the observed (experimental) and predicted values of the log EC50 (Figure 4A) additionally

AC

supports the conclusions from the validation step.

SC RI PT

ACCEPTED MANUSCRIPT

Figure 4. A) Observed vs. predicted toxicity (in M units) plot for the global-QSAR model, and B) Williams

MA

NU

plot showing the applicability domain (h*=0.08) of the developed global QSAR model.

The applicability domain of the global-QSAR model was defined by application of the

ED

standardization (Table ES3) and leverages (Fig. 4B) approaches. According to the standardization approach 12 compounds from training and 3 from validation set do not

PT

belong to model’ applicability domain. On the Williams plots (Fig. 4B) 11 ILs from training, and 2 ILs from validation set characterizes by the greater hi values than the h*. Generally,

CE

this is consistent with results obtained from standardization approach, however the

AC

standardization approach also showed two another points lying above the AD: B54 - 1hexadecyl-3-methylimidazolium chloride and A25 benzyltetradecyldimethylammonium chloride (on Williams plot these compounds lying inside the AD). Five of the 13 compounds (with hi > h*) belong to the quinolinium ILs group (I1-I5), or are characterized by different (in more cases-bigger) cation or anion structures e.g. trihexyltetradecylphosphanium bis(trifluoromethylsulfonyl)amide

(D1)

or

1-methyl-3-(3,3,4,4,5,5,6,6,7,7,8,8,8-

tridecafluorooctyl)imidazolium hexafluorophosphate (B93).

ACCEPTED MANUSCRIPT The global QSAR model is a combination of five descriptors. One of them is calculated for anion structure (Narumi simple topological index (SNarA)) while the rest are calculated based on the cation structure. They are: mean square distance index (MSD C), maximal electrotopological positive variation (MAXDP C), and the number of ethers groups (nRORC) and distance/detour ring index of order 10 (D/Dtr10C).

SC RI PT

The SNarA descriptor belongs to the topological indices and is a molecular descriptor related to molecular branching, calculated of the vertex degrees [11]. For small anions the values of this descriptor are low (e.g. tetrafluoroborate, SNarA= 1.386, hexafluorophosphate, SNarA =1.792), and increase for larger structures (e.g. bis(trifluoromethylsulfonyl)amide SNarA = 6.238). One can observe that the increase of the value of the SNar descriptor results

NU

increasing toxicity value. Second descriptor, the mean square distance index (MSDC) also

MA

belongs to the group of topological indices and is calculated from the second-order distance distribution moment [11] related to the size and the structure branching. For small cations the values of this descriptor are low, and increase for large structures. In fact the increase of

ED

MSDC value (usually) caused the increase of ionic liquid toxicity. For example, 1-butyl-3-

ionic

liquids

having

PT

methylimidazolium chloride (MSDC=3.360; log EC50obs= 3.55) is less toxic than the other the

same

anion,

but

bigger

cation,

namely:

1-hexyl-3-

CE

methylimidazolium (MSDC=4.255; log EC50obs= 2.85) and 1-octyl-4-methylpyridinium

AC

(MSDC=5.416; log EC50obs= 1.63). However, few ionic liquids characterized by low MSD C values exhibit higher toxicity as well. For example: tetraethylammonium (MSDC=2.920; log EC50obs= 3.78) and 1-butyl-3-methylimidazolium (MSDC=3.360; log EC50obs= 3.55) are more toxic

than

1-(3-methoxypropyl)-1-methylpiperidiniumchloride

(MSDC=3.764;

log

EC50obs= 4.40). The MAXDP descriptor describes the maximum positive intrinsic state difference and can be related to the electrophilicity of the molecule [46]. According to the eq. 7 it can be observe that the increase of the MAXDP C descriptor value results in the

ACCEPTED MANUSCRIPT decrease of ILs toxicity. For example 1-propylpyridinium bis(trifluoromethylsulfonyl)amide (MAXDPC= 0.181, log EC50obs= 3.20) is more toxic than 1-(2-hydroxyethyl)pyridinium bis(trifluoromethylsulfonyl)amide

(MAXDPC= 2.499, log EC50obs=

3.79).

The

next

descriptor is the number of ether groups (nROR C). The increase of the nRORC value results in the decrease of the ILs toxicity. For example, ethyl(2ethoxyethyl)dimethylammonium

ionic

liquids

having

the

same

hydroxyethyl)dimethylammonium

anion,

SC RI PT

bis(trifluoromethylsulfonyl)amide (nRORC=1, log EC50=3.27) is less toxic, than the other but

(nRORC=0,

various

log

cation,

EC50=3.70).

namely:

The

ethyl(2-

D/Dtr10C

ring

descriptor describes the distance ring index of order 10 and assumes the value equal 1 only

NU

in case of quinolinium ILs.

MA

3.3. Comparison of global and local QSAR models

Due to compare the two approaches of QSAR models, first we evaluate their statistics.

ED

Without doubts, the measures of goodness-of-fit (Table 1, and global model) favour the local QSAR models. However, in some cases the measures of robustness and predictivity

PT

favour the global QSAR model. For example, the lower values (differences in the second decimal place) of Q2CV were notice for morpholinium-, pyridinium- and pyrrolidinium-

CE

QSAR models. Also the Q2F3 values calculated for some local models (ammonium,

AC

imidazolium, pyridinium and pyrrolidinium) are slightly lower than Q2F3 for the global one. Higher correlations coefficient (R2) and lower values of the root mean square errors proved that the local models were better fitted to the experimental points than the global model. We also performed the comparison of the experimental log EC 50 values in pairs with predictions from the particular local and global models. Figure 5 presents the results based on piperidinium and morpholinium ionic liquids. The comparison for other ILs group is presented in the Figure ES2 in the Electronic Supplementary Materials. In case of

ACCEPTED MANUSCRIPT piperidinium, the prediction errors were lower for the local model than the prediction errors for the global model (Fig 5A). By employing the Student’s t signed-rank test, we have confirmed that the average residuals (for 33 piperidinium ionic liquids) for both models differed significantly (p=0.0023, p*=0.0500). The piperidinium local model is the only case, when the employed statistics tests confirmed the significant differences between residuals

SC RI PT

obtained from global and local models (Table 3). In case of morpholinium (for 37 ILs) local model the prediction errors are slightly lower than the prediction errors from the global one (Fig. 5B). We employed paired Student’s t test, which confirmed that the average residuals for both models do not differed significantly (p=0.6137, p*=0.0500). Similar results were obtained for the rest of local models (ammonium, imidazolium, pyridinium, pyrrolidinium)

AC

CE

PT

ED

MA

NU

Fig. ES2 and Table 3.

Fig. 5. Residual values (in log units) calculated for A) piperidinium and B) morpholinium ionic liquids based on the predictions with global and local QSAR models

ACCEPTED MANUSCRIPT

Table 3. Comparison between the residuals derived from the prediction of log EC50 (for rat cell line IPC-81) with local and global MLR models by the paired Student t test; p*=0.0500

p value

ammonium

0.4521

imidazolium

0.9791

morpholinium piperidinium pyridinium

SC RI PT

Model

0.6137 0.0023 0.4346 0.2413

NU

pyrrolidinium

MA

The applicability domains of local and global models were verified by using the leverage (Fig.3, Fig 4B) and standardization approaches. The global model has been calibrated and

ED

validated also on ILs groups not used to develop any of the local models (e.g. phosphonium, quinolinium, sulfonium, PILs). In the global QSAR model the applicability domain is wider,

PT

but the similarity among the compounds is lower. According to Williams plot presented for the global model (Fig. 4B) there are 13 points outside of the applicability domain. Among

CE

these compounds there are ionic liquids belonging to the applicability domain of particular

AC

local models. This is observed, for example, in case of imidazolium ionic liquids (marked on plots as B). Imidazolium ionic liquids outside the local model AD belong to the global model AD, and vice versa. In order to verify the impact of the selected descriptors (from the local or from the global dataset) on the results of the QSAR modelling, we developed the local models based on descriptors previously used in the global one (excluding the descriptors with constant values for specific classes of ILs). In our study, we assumed that the ratio between the number of descriptors and training compounds should be at least 5:1 [47]. The global model was

ACCEPTED MANUSCRIPT calibrated on five molecular descriptors, so according to the above criterion, we cannot develop the piperidinium model based on these descriptors (n training=12). What is more we have applied two different algorithms for splitting the data into the local training sets (1:x algorithm, and Kennard-Stone algorithm[48]). The results of comparison confirm that better local models could be obtained by selecting descriptors characteristic for the specific group

SC RI PT

of ionic liquids (Table 4). This means that local models development could be more useful to solve specific mechanistic problems, for example to characterise which particular structure feature among the one group of ILs affect the toxicity. Nevertheless, in the development of local models one should always remember of the risk corresponds to the

NU

training overfitting due to the high number of descriptors dedicated to the number of variables. The second crucial potential disadvantage of local model is the problem in the

MA

model’s validation procedure because of the few number of the molecules in the validation set. The development of several local models, based on smaller dataset, in one time are more

ED

time consuming than development of one global model. What is more, the huge advantage of global model is the ability to capture such structural features, (which can be shared in

PT

different groups of compounds) that allow for the good endpoint prediction for molecules

CE

that do not belong to a very specific group.

AC

Table 4. Basic statistics original local QSAR models (descriptors selection from the local training set), and local models developed based on molecular descriptors from global QSAR equation (descriptors selection from the global training set) Descriptors selection from local training set (1:x) global training set (1:x) global training set (KS) local training set (1:x) global training set (1:x) global training set (KS) local training set (1:x) global training set (1:x)

R2 Q2CV Q2F3 RMSEC Ammonium-QSAR model 0.90 0.87 0.88 0.41 0.81 0.70 0.83 0.59 0.81 0.73 0.81 0.58 Imidazolium-QSAR model 0.84 0.81 0.81 0.43 0.80 0.76 0.82 0.48 0.81 0.77 0.65 0.50 Morpholinium-QSAR model 0.81 0.73 0.88 0.25 0.48 0.18 0.71 0.42

RMSECV

RMSEP

0.49 0.73 0.73

0.46 0.55 0.54

0.46 0.52 0.55

0.46 0.44 0.34

0.30 0.53

0.20 0.30

ACCEPTED MANUSCRIPT 0.50 0.26 0.62 Pyridinium-QSAR model 0.79 0.75 0.75 local training set (1:x) 0.68 0.59 0.75 global training set (1:x) 0.70 0.64 0.65 global training set (KS) Pyrrolidinium-QSAR model 0.83 0.75 0.80 local training set (1:x) 0.70 0.48 0.70 global training set (1:x) 0.73 0.53 0.75 global training set (KS) KS – splitting according to the Kennard-Stone algorithm.

0.42

0.52

0.27

0.39 0.49 0.48

0.43 0.55 0.53

0.43 0.43 0.46

0.38 0.50 0.47

0.45 0.65 0.62

0.36 0.45 0.52

SC RI PT

global training set (KS)

4. Conclusion

In this study we have verified how the QSAR local/global modelling works for ionic liquids. We have developed six local models and one global for predicting the toxicity against the

NU

leukemia rat cell line IPC-81. The first approach (local QSAR models) allows developing the QSAR models based on reduced number of structurally similar compounds. This

MA

approach improves the goodness-of-fit and allows selecting the molecular descriptors describing mechanism of the predicted endpoint for the specific groups of ionic liquids. The

ED

second approach (global-QSAR) used a wider, more structurally diversified training set,

PT

which may leads to the decrease of the model’s accuracy. Based on the obtained results, we recommend that if the developed QSAR models

CE

fulfil the OECD criteria, the global model should be applied in practice. There is no need to develop several smaller local models, while the obtained residuals values do not differ

AC

significantly, than these employed from global models. What is more, the global model allows predicting the toxicity for hundreds of ILs simultaneously. Such approach is more efficient especially from the economic point of view, since the number of new ILs synthesis as well as identification in the environment increase every day. Moreover, development of the local QSAR models for the particular groups of ILs one-by-one is more time consuming and requires sufficient number of experimental data for specific group to calibration and

ACCEPTED MANUSCRIPT validation the models. Finally, as we demonstrated, the predictive ability of the presented global model could be compare to the local models.

Acknowledgments The authors would like to thank Professor Paola Gramatica and her team for providing

Science

Center

(Poland)

(grant

SC RI PT

QSARINS 2.1 (www.qsar.it). This material is based on research founded by the National no. UMO-2012/05/E/NZ7/01148,

project

CRAB).

Calculations were carried out at the Academic Computer Center in Gdańsk.

NU

References

AC

CE

PT

ED

MA

[1] D.R. MacFarlane, N. Tachikawa, M. Forsyth, J.M. Pringle, P.C. Howlett, G.D. Elliott, J.H. Davis, M. Watanabe, P. Simon, C.A. Angell, Energy applications of ionic liquids, Energ Environ Sci, 7 (2014) 232250. [2] J. Pernak, M. Niemczak, K. Materna, K. Marcinkowska, T. Praczyk, Ionic liquids as herbicides and plant growth regulators, Tetrahedron, 69 (2013) 4665-4669. [3] P. Dominguez de Maria, Recent trends in (ligno)cellulose dissolution using neoteric solvents: switchable, distillable and bio-based ionic liquids, Journal of Chemical Technology & Biotechnology, 89 (2014) 11-18. [4] P. Sun, D.W. Armstrong, Ionic liquids in analytical chemistry, Analytica chimica acta, 661 (2010) 116. [5] R. Jindal, Sablok, A, Preparation and Applications of Room Temperature Ionic Liquids in Organic Synthesis: A Review on Recent Efforts,, Curr. Green Chem. , 2 (2015) 135-155. [6] A. Sosnowska, M. Barycki, A. Gajewicz, M. Bobrowski, S. Freza, P. Skurski, S. Uhl, E. Laux, T. Journot, L. Jeandupeux, H. Keppner, T. Puzyn, Towards the Application of Structure-Property Relationship Modeling in Materials Science: Predicting the Seebeck Coefficient for Ionic Liquid/Redox Couple Systems, Chemphyschem : a European journal of chemical physics and physical chemistry, 17 (2016) 1591-1600. [7] M. Barycki, A. Sosnowska, T. Puzyn, Which structural features stand behind micelization of ionic liquids? Quantitative Structure-Property Relationship studies, Journal of colloid and interface science, 487 (2017) 475-483. [8] M. Grzonkowska, A. Sosnowska, M. Barycki, A. Rybinska, T. Puzyn, How the structure of ionic liquid affects its toxicity to Vibrio fischeri?, Chemosphere, 159 (2016) 199-207. [9] K.M. Docherty, C.F. Kulpa, Toxicity and antimicrobial activity of imidazolium and pyridinium ionic liquids, Green Chem, 7 (2005) 185-189. [10] E. Regulation (EC) No 1907/2006 of the European Parliament and of the Council of 18 December 2006 concerming the Registration, Authorisation and Restriction of Chemicals (REACH), establishing a European Chemicals Agency, and amemding Directive 1999/45/EC and repealing Council Regulation (EEC) No 793/93 and Commission Regulation (EC) No 1488/94 as well as Council Directive 76/769/EEC and Commission Directives 91/155/EEC, 93/67/EEC, 93/105/EC and 2000/21/EC. Official Journal of the European Union, L 396/1, 2006.

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA

NU

SC RI PT

[11] R. Todeschini, Consonni, V., Molecular descriptors for chemoinformatics, WILEY-VCH:, Weinheim: Germany, 2009. [12] T. Oberg, T. Liu, Global and local PLS regression models to predict vapor pressure, Qsar Comb Sci, 27 (2008) 273-279. [13] E.A. Helgee, L. Carlsson, S. Boyer, U. Norinder, Evaluation of Quantitative Structure-Activity Relationship Modeling Strategies: Local and Global Models, Journal of chemical information and modeling, 50 (2010) 677-689. [14] T. Puzyn, A. Gajewicz, A. Rybacka, M. Haranczyk, Global versus local QSPR models for persistent organic pollutants: balancing between predictivity and economy, Struct Chem, 22 (2011) 873-884. [15] B.L. Lei, Y.M. Ma, J.Z. Li, H.X. Liu, X.J. Yao, P. Gramatica, Prediction of the adsorption capability onto activated carbon of a large data set of chemicals by local lazy regression method, Atmos Environ, 44 (2010) 2954-2960. [16] J. Ranke, K. Molter, F. Stock, U. Bottin-Weber, J. Poczobutt, J. Hoffmann, B. Ondruschka, J. Filser, B. Jastorff, Biological effects of imidazolium ionic liquids with varying chain lengths in acute Vibrio fischeri and WST-1 cell viability assays, Ecotox Environ Safe, 58 (2004) 396-404. [17] J. Ranke, S. Stolte, R. Stormann, J. Arning, B. Jastorff, Design of sustainable chemical products - The example of ionic liquids, Chem Rev, 107 (2007) 2183-2206. [18] J.S. Torrecilla, J. Garcia, E. Rojo, F. Rodriguez, Estimation of toxicity of ionic liquids in Leukemia Rat Cell Line and Acetylcholinesterase enzyme by principal component analysis, neural networks and multiple lineal regressions, Journal of hazardous materials, 164 (2009) 182-194. [19] S. Stolte, M. Matzke, J. Arning, A. Boschen, W.R. Pitner, U. Welz-Biermann, B. Jastorff, J. Ranke, Effects of different head groups and functionalised side chains on the aquatic toxicity of ionic liquids, Green Chem, 9 (2007) 1170-1179. [20] M. Stasiewicz, E. Mulkiewicz, R. Tomczak-Wandzel, J. Kumirska, E.M. Siedlecka, M. Golebiowski, J. Gajdus, M. Czerwicka, P. Stepnowski, Assessing toxicity and biodegradation of novel, environmentally benign ionic liquids (1-alkoxymethyl-3-hydroxypyridinium chloride, saccharinate and acesulfamates) on cellular and molecular level, Ecotox Environ Safe, 71 (2008) 157-165. [21] M.H. Fatemi, P. Izadiyan, Cytotoxicity estimation of ionic liquids based on their effective structural features, Chemosphere, 84 (2011) 553-563. [22] J. Pernak, N. Borucka, F. Walkiewicz, B. Markiewicz, P. Fochtman, S. Stolte, S. Steudte, P. Stepnowski, Synthesis, toxicity, biodegradability and physicochemical properties of 4-benzyl-4methylmorpholinium-based ionic liquids, Green Chem, 13 (2011) 2901-2910. [23] S. Stolte, J. Arning, U. Bottin-Weber, M. Matzke, F. Stock, K. Thiele, M. Uerdingen, U. WelzBiermann, B. Jastorff, J. Ranke, Anion effects on the cytotoxicity of ionic liquids, Green Chem, 8 (2006) 621-629. [24] J. Ranke, A. Muller, U. Bottin-Weber, F. Stock, S. Stolte, J. Arning, R. Stormann, B. Jastorff, Lipophilicity parameters for ionic liquid cations and their correlation to in vitro cytotoxicity, Ecotox Environ Safe, 67 (2007) 430-438. [25] B. Peric, J. Sierra, E. Marti, R. Cruanas, M.A. Garau, J. Arning, U. Bottin-Weber, S. Stolte, (Eco)toxicity and biodegradability of selected protic and aprotic ionic liquids, Journal of hazardous materials, 261 (2013) 99-105. [26] R.G. Brereton, Chemometrics for pattern recognition, Wiley2009. [27] A.C.D. Inc, Advanced Chemistry Development Inc: Toronto, , Toronto, Canada, 2014. [28] J.J.P. Steward, Stewart Computational Chemistry: Colorado Springs, CO,, USA, 20012. [29] A. Rybinska, A. Sosnowska, M. Barycki, T. Puzyn, Geometry optimization method versus predictive ability in QSPR modeling for ionic liquids, Journal of computer-aided molecular design, 30 (2016) 165176. [30] Talete., Talete: Milano, Italy, 2014. [31] A. Sosnowska, M. Barycki, M. Zaborowska, A. Rybinska, T. Puzyn, Towards designing environmentally safe ionic liquids: the influence of the cation structure, Green Chem, 16 (2014) 47494757. [32] T. Puzyn, A. Mostrag-Szlichtyng, A. Gajewicz, M. Skrzynski, A.P. Worth, Investigating the influence of data splitting on the predictive ability of QSAR/QSPR models, Struct Chem, 22 (2011) 795-804.

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA

NU

SC RI PT

[33] M. Hewitt, M.T.D. Cronin, J.C. Madden, P.H. Rowe, C. Johnson, A. Obi, S.J. Enoch, Consensus QSAR models: Do the benefits outweigh the complexity?, Journal of chemical information and modeling, 47 (2007) 1460-1468. [34] OECD, OECD Principles for the validation, for regulatory purposes, of (Quantittative) Structure Activity Relationship models, 37thJoint Meeting of the Chemicals Committee and Working Party on Chemicals, Pesticides and Biotechnology, Organisation for Economic Co-Operation and Development, Paris, France, OECD. [35] P. Gramatica, N. Chirico, E. Papa, S. Cassani, S. Kovarich, QSARINS: A new software for the development, analysis, and validation of QSAR MLR models, J Comput Chem, 34 (2013) 2121-2132. [36] G. Schuurmann, R.U. Ebert, J.W. Chen, B. Wang, R. Kuhne, External Validation and Prediction Employing the Predictive Squared Correlation Coefficient - Test Set Activity Mean vs Training Set Activity Mean, Journal of chemical information and modeling, 48 (2008) 2140-2145. [37] V. Consonni, D. Ballabio, R. Todeschini, Comments on the Definition of the Q(2) Parameter for QSAR Validation, Journal of chemical information and modeling, 49 (2009) 1669-1678. [38] V. Consonni, D. Ballabio, R. Todeschini, Evaluation of model predictive ability by external validation techniques, J Chemometr, 24 (2010) 194-201. [39] K. Roy, I. Mitra, S. Kar, P.K. Ojha, R.N. Das, H. Kabir, Comparative Studies on Some Metrics for External Validation of QSPR Models, Journal of chemical information and modeling, 52 (2012) 396408. [40] N. Chirico, P. Gramatica, Real External Predictivity of QSAR Models: How To Evaluate It? Comparison of Different Validation Criteria and Proposal of Using the Concordance Correlation Coefficient, Journal of chemical information and modeling, 51 (2011) 2320-2335. [41] R. Todeschini, D. Ballabio, F. Grisoni, Beware of Unreliable Q(2)! A Comparative Study of Regression Metrics for Predictivity Assessment of QSAR Models, Journal of chemical information and modeling, 56 (2016) 1905-1913. [42] P. Gramatica, Principles of QSAR models validation: internal and external., QSAR and Combinatorial Science, 26 (2007) 694-701. [43] K. Roy, S. Kar, P. Ambure, On a simple approach for determining applicability domain of QSAR models, Chemometr Intell Lab, 145 (2015) 22-29. [44] S.E. Wold, Statistical validation of QSAR Results, Wiley-VCH, Weinheim, 1995. [45] Y.S. Zhao, J.H. Zhao, Y. Huang, Q. Zhou, X.P. Zhang, S.J. Zhang, Toxicity of ionic liquids: Database and prediction via quantitative structure-activity relationship method, Journal of hazardous materials, 278 (2014) 320-329. [46] P. Gramatica, M. Corradi, V. Consonni, Modelling and prediction of soil sorption coefficients of non-ionic organic pesticides by molecular descriptors, Chemosphere, 41 (2000) 763-777. [47] J.G. Topliss, R.J. Costello, Change correlations in structure-activity studies using multiple regression analysis, Journal of medicinal chemistry, 15 (1972) 1066-1068. [48] R.W. Kennard, L.A. Stone, Computer Aided Design of Experiments, Technometrics, 11 (1969) 137148.

ACCEPTED MANUSCRIPT

Highlights

The predictive ability of global versus local QSAR models for ionic liquids are verified. QSAR models for predicting ionic liquids toxicity against IPC-81 leukemia rat cell line are proposed. The predictive abilities of developed local and global models are similar.

AC

CE

PT

ED

MA

NU

SC RI PT

Whenever global model fulfills all quality recommendations by OECD, it should be applied in practice.