water partition coefficient

water partition coefficient

Fluid Phase Equilibria 397 (2015) 44–49 Contents lists available at ScienceDirect Fluid Phase Equilibria j o u r n a l h o m e p a g e : w w w . e l...

1MB Sizes 0 Downloads 49 Views

Fluid Phase Equilibria 397 (2015) 44–49

Contents lists available at ScienceDirect

Fluid Phase Equilibria j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / fl u i d

Short communication

CORAL: Model for octanol/water partition coefficient Andrey A. Toropov * , Alla P. Toropova, Claudia Ileana Cappelli, Emilio Benfenati IRCCS-Istituto di Ricerche Farmacologiche Mario Negri, Laboratory of Environmental Chemistry and Toxicology, Via La Masa 19, 20156 Milano, Italy

A R T I C L E I N F O

A B S T R A C T

Article history: Received 27 February 2015 Received in revised form 27 March 2015 Accepted 30 March 2015 Available online 2 April 2015

A new model for a large dataset on octanol/water partition coefficient (log P, n = 10005) has been built with the CORAL software (http://www.insilico.eu/coral). The simplified molecular input-line entry system (SMILES) is used to represent the molecular structure. Building up the model includes three phases: (i) distribution of available data into the sub-training set (n = 2634), calibration set (n = 2528), test set (n = 2410), and validation set (n = 2433). No data on compounds distributed into the validation set is involved in building up model: these compounds are used solely to check up predictive potential of the built model. The statistical characteristics of the model for external validation set are the following: r2 = 0.89, s = 0.59. The mechanistic interpretation of the model is discussed. ã 2015 Elsevier B.V. All rights reserved.

Keywords: QSPR Monte Carlo method Octanol/water partition coefficient CORAL software

1. Introduction

2.2. Optimal descriptor

The octanol/water partition coefficient (log P) is an important ecological and biochemical indicator [1]. Quantitative structure – property/activity relationships (QSPRs/QSARs) are a tool to predict of different physicochemical, [2–6] toxicological, [7–14] and therapeutic [15] endpoints. The QSPR/QSAR analyses can be based on the representation of the molecular structure by molecular graphs [6,10] and/or by a simplified molecular input-line entry system (SMILES) [16–18] The CORAL software [19] is used to build up QSPR for various endpoints where SMILES is the representation of the molecular structure. The aim of the present study is to assess the CORAL software [19] as a tool to model log P for a large set of various organic compounds.

The SMILES-based optimal descriptor described in the literature [23] was used to build the log P model. The optimal descriptors for each compound are calculated with so-called correlation weights of different molecular features which are extracted from the SMILES. There are local attributes of SMILES (in fact these are fragments of molecules) and global attributes of SMILES (these are characteristics of a molecule in general). The optimal descriptors are calculated with the numerical data on correlation weights using the following formula:

2. Method 2.1. Data set The numerical data on octanol/water partition coefficient (log P) were taken from available sources [20–22]. The collection of compounds (n = 10005) was randomly distributed into the subtraining set (2634, 26.3%); calibration set (2528, 25,3%); test set (2410, 24,1%), and validation set (2433, 24.3%). No information on the compounds of the validation set is used to build up model for log P.

* Corresponding author. Tel.: +39 2 39014595; fax: +39 2 39014735. E-mail address: [email protected] (A.A. Toropov). http://dx.doi.org/10.1016/j.fluid.2015.03.051 0378-3812/ ã 2015 Elsevier B.V. All rights reserved.

DCWðSMILES; T; NÞ ¼ SCWðSk Þ þ SCWðSSk Þ þ SCWðSSSk Þ þ CWðBONDÞ þ CWðHALOÞ þ CWðNOSPÞ þ CWðPAIRÞ

(1)

where Sk, SSk, SSSk, BOND, HALO, NOSP, and PAIR are features of molecule represented by lines of twelve symbols (Fig. 1). The BOND reflects presence or absence in molecule of double (=), triple (#) and stereochemical (@,@@) bonds; HALO reflects presence or absence of fluorine (F), chlorine (Cl), bromine (Br), and iodine (I) atoms; NOSP reflects presence or absence of nitrogen (N), oxygen (O), sulfur (S), and phosphorus (P) atoms; and PAIR reflects the presence of pairs of the above listed features of molecule (i.e., presence of pair from the list of F, Cl, Br, I, N, O, S, P, =, #, and @). In order to avoid misinterpretations of XY and YX (as well as XYZ and ZYX) as different molecular fragments, the attributes “XY” (as well as “XYZ”) are ranged according to ASCII codes of corresponding symbols. The numerical data on the correlation weights of the abovementioned molecular features are calculated by optimization

A.A. Toropov et al. / Fluid Phase Equilibria 397 (2015) 44–49

[(Fig._1)TD$IG]

45

Fig. 1. An example of calculation of the DCW(SMILES, T*, N*) which is used in Eq. (4)gr1

procedure based on the Monte Carlo method. The correlation coefficient between the DCW(SMILES, T, N) and log P for the subtraining set and calibration set are basis for the target function (TF): 0

0

TF ¼ A  ðr þ r Þ  B  absðr  r Þ

(2)

building up a model. For instance, if an attribute X takes place in the sub-training set two times and T = 3, then X should be classified as rare; but if T = 2, then X should be classified as not rare. The N is the number of epoch of the optimization procedure. Fig. 2 shows the influence of the T and N upon the correlation coefficients for the sub-training set, the calibration set, and the test

0

where r and r are correlation coefficients for the sub-training set and the calibration set, respectively; A = 0.1 and B = 0.01 are empirical constants. The TF is a mathematical function of the threshold and the number of epochs of the Monte Carlo optimization. The T (1,2, . . . , m) is threshold, i.e., coefficient to define two categories of SMILES attributes: (i) rare attributes which should be neglected; and (ii) not rare attributes which should be involved in

[(Fig._2)TD$IG]

Fig. 2. The graphical representation of influence of parameters T and N upon subtraining set, calibration set, and test set (the r is correlation coefficient); the logic of selection of preferable T* and N*.

Table 1 Correlation weights for molecular features [22] which are promoters of increase or decrease for octanol/water partition coefficient (log P). SMILES attribute

Comment

Promoters of log P C........... ¼........... O........... (........... HALO00100000 HALO01000000 HALO10000000

increase Aromatic carbon Double bond Oxygen Branching Bromine Clorine Fluorine

Promoters of log P ++++N–O=== 1........... 2........... 3........... 4........... 5........... 6........... 7........... N...#....... I...(....... ++++O–P===

decrease Oxygen and nitrogen in molecule Cycles in molecule – – – – – – Nitrogen with triple bond Iodine (in organic molecule) Oxygen and phosphorus in molecule

Correlation weights 4.74700 0.50200 1.75011 0.50100 4.24900 1.00000 2.75500

1.24900 2.74700 1.24600 2.25200 4.25000 1.00000 3.50000 0.24500 1.74900 2.25300 0.49700

46

A.A. Toropov et al. / Fluid Phase Equilibria 397 (2015) 44–49

Table 2 Examples of molecular structures which contain stable promoters of increase or decrease of octanol/water partition coefficient (log P). CAS and Structure

log P CH3

O H3C

O

8.10

Comments Aromatic carbon, double bonds, and oxygen

O

O

117-84-0 10.89 Aromatic carbon, oxygen, branching of molecular skeleton, double bond, and cycles

H3C

O

CH3 CH3

O

H3C

132213-91-3 8.07 Aromatic carbon, oxygen, and chlorine

83992-73-8 Cl

Cl

Cl

Cl Cl

Cl

O

Cl Cl

Cl 4.64 Oxygen together with nitrogen

305-62-4 H2N NH2 O OH

3.05 Oxygen, nitrogen, iodine, aromatic system, and double bond

A.A. Toropov et al. / Fluid Phase Equilibria 397 (2015) 44–49

47

Table 2 (Continued) CAS and Structure

log P

Comments

66108-95-0 OH

OH O

NH

I

O

I

OH NH

N

H3C

I

HO

OH

O

HO 4.00 Oxygen, nitrogen, and phosphorus

1071-83-6 O

HO

P

NH HO

O

OH 1.37

540-61-4

Triple bond for nitrogen

NH2 N 2.00 Oxygen, nitrogen, phosphorus, and cycles

60-92-4

O HO

P O

N

N

OH O

NH2 N O

N

set: the T* and N* (Fig. 2) should be considered as preferable from point of view of the predictive potential of the model. The T* and N* can be defined from preliminary computational experiments [23]. The final CORAL-model is the following: logP ¼ C 0 þ C 1  DCWðSMILES; T  ; N Þ

(3)

Thus, four sub-sets have special roles: sub-training set is “builder” of model; calibration set is “critic” of model; test set is “preliminary” estimator of predictive potential of model; finally, the validation set is “final” estimator of predictive potential of model.

3. Results and discussion The square of correlation coefficient (r2), cross-validated correlation coefficient (q2), standard error of estimation (s) and

Fischer F-ration (F) were used as statistical criteria of quality for the CORAL-models. The statistical quality of the model calculated with CORAL software for all 10005 compounds is the following (T* = 1 and Nepoch* = 7): logP ¼ 0:2872ð0:0005Þ þ 0:04890ð0:00001Þ  DCWð1; 7Þ

(4)

n = 2634, r2 = 0.8806, q2 = 0.8804, s = 0.631, F = 19415 (sub-training set) n = 2528, r2 = 0.8873, s = 0.624 (calibration set) n = 2410, r2 = 0.8691, s = 0.654 (test set) n = 2433, r2 = 0.8667, s = 0.592 (validation set) In the case of sub-set of compounds which contain carbon, nitrogen, and oxygen the approach gives the following model (T* = 1 and Nepoch* = 14):

48

A.A. Toropov et al. / Fluid Phase Equilibria 397 (2015) 44–49

Table 3 Comparison of the statistical quality of different QSPR models for octanol/water partition coefficient (log P) which are suggested in the literature (the n,r2, and s are the number of substances in set; square of correlation coefficient; and root mean square error, respectively). Training set*

Validation set

n

r2

s

22 69 69 2524 2569 8122

0.9306 0.9872 0.9953 0.87 0.49 0.23

0.601 0.156 0.096 0.69 1.07 1.67

7572 1015 437 382 121

0.8695 0.9499 0.9689 0.9770 0.9820

0.64 0.44 0.38 0.38 0.25

Reference or comments r2

s

12 70 70 – – –

0.9776 0.9841 0.9930 – – –

0.727 0.179 0.119 – – –

2433 343 146 138 41

0.8667 0.9380 0.9519 0.9609 0.9538

0.59 0.51 0.50 0.51 0.37

n

[14] [11] [8] [20] [7] [7], log D (pH-depended log P) CORAL-models built up in this study All 10005 compounds Compounds contain only carbon, oxygen, and nitrogen Compounds contain only carbon and nitrogen Compounds contain only carbon and oxygen Compounds contain only carbon

* The training set for the CORAL-models means the combine of the sub-training set, the calibration set, and the test set. These sets are “visible” during building up a model, whereas validation set is “invisible” during building up a model.

logP ¼ 1:5506ð0:0026Þ þ 0:08721ð0:00005Þ  DCWðSMILES; 1; 14Þ

(5)

n = 367, r2 = 0.9599, q2 = 0.9594, s = 0.405, F = 8730 (sub-training set) n = 320, r2 = 0.9598, s = 0.410 (calibration set) n = 328, r2 = 0.9260, s = 0.511 (test set) n = 343, r2 = 0.9380, s = 0.513 (validation set) In the case of sub-set of compounds which contain carbon and nitrogen the approach gives the following model (T* = 1 and Nepoch* = 5): logP ¼ 0:6750ð0:0048Þ þ 0:09109ð0:00009Þ  DCWðSMILES; 1;5Þ

(6)

n = 169, r2 = 0.9694, q2 = 0.9688, s = 0.396, F = 5295 (sub-training set) n = 145, r2 = 0.9771, s = 0.334 (calibration set) n = 123, r2 = 0.9594, s = 0.423 (test set) n = 146, r2 = 0.9519 s = 0.497 (validation set) In the case of sub-set of compounds which contain carbon and oxygen the approach gives the following model (T* = 1 and Nepoch* = 6): logP ¼ 0:03608ð0:00443Þ þ 0:07937ð0:00008Þ  DCWðSMILES; 1; 6Þ

4. Conclusions (7)

n = 134, r2 = 0.9830, q2 = 0.9824, s = 0.323, F = 7619 (sub-training set) n = 119, r2 = 0.9827, s = 0.353 (calibration set) n = 129, r2 = 0.9641, s = 0.456 (test set) n = 138, r2 = 0.9609 s = 0.507 (validation set) In the case of sub-set of compounds which contain solely carbon the approach gives the following model (T* = 1 and Nepoch* = 2): logP ¼ 0:6001ð0:0105Þ þ 0:07021ð0:00015Þ  DCWðSMILES; 1; 2Þ

carbon, oxygen, nitrogen, sulfur, phosphorus, chlorine, bromine, etc.,) the statistical quality of the model is more or less satisfactory. However, the reducing of the domain of applicability is accompanied by the considerable improving of the accuracy of the CORALmodels. This gives possibility to define a preferable CORAL-model of log P according to chemical genesis of the substances selected for specific physicochemical or biochemical roles. Having the correlation weights [23] of various molecular features obtained in a group of runs of the Monte Carlo optimization one can select promoters of increase (i.e., stable positive values of correlation weights) and decrease (i.e., stable negative values of correlation weights) for an endpoint. Table 1 lists the correlation weights for molecular features extracted from SMILES which are promoters of increase or decrease for log P. Table 2 gives examples of compounds with the above molecular features and numerical log P data. Table 3 contains the comparison of the statistical quality of log P models from the literature and statistical quality of models suggested in this work. The comparison indicates that the CORAL-models are comparable with models from the literature. The Supplementary materials section contains technical details related to the CORALmodels.

(8)

n = 49, r2 = 0.9907, q2 = 0.9899, s = 0.192, F = 4983 (sub-training set) n = 41, r2 = 0.9771, s = 0.257 (calibration set) n = 31, r2 = 0.9770, s = 0.306 (test set) n = 41, r2 = 0.9538 s = 0.373 (validation set) Thus, one can see, in the case of maximally wide domain of applicability (all available compounds which are a combine of

The CORAL software gives satisfactory model of octanol/water partition coefficient (log P) for a large data set of various organic compounds (n = 10005). This model has a mechanistic interpretation, defined in terms of SMILES attributes which are stable promoters of increase or decrease for log P. The application of the approach for some selected group of compounds according to their chemical genesis lead to improving of the predictive potential of the CORAL-models. Thus, the CORAL software (http://www. insilico.eu/coral) can be a tool for theoretical prediction of the octanol/water partition coefficient. Acknowledgement The authors are grateful for the contribution of the EU project PROSIL funded under the LIFE program (project LIFE12 ENV/IT/ 000154). Also we thank CALEIDOS (the project number LIFE11-INV/ IT 00295) and the project No. FKZ 3110 64 405 “Prioritization of chemicals: a methodology embracing PBT parameters into a unified strategy (PROMETHEUS)” funded by the Environmental Research of the Federal Ministry for the Environment, Nature Conservation and Nuclear Safety, for the financial support.

A.A. Toropov et al. / Fluid Phase Equilibria 397 (2015) 44–49

Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.fluid.2015.03.051. References [1] K.M. Yerramsetty, B.J. Neely, K.A.M. Gasem, A non-linear structure–property model for octanol–water partition coefficient, Fluid Phase Equilib. 332 (2012) 85–93. [2] Y. Liu, X. Li, L. Wang, H. Sun, Prediction of partition coefficients and infinite dilution activity coefficients of 1-ethylpropylamine and 3-methyl-1-pentanol using force field methods, Fluid Phase Equilib. 285 (2009) 19–23. [3] V. Karthick, A.P. Toropova, A.A. Toropov, K. Ramanathan, Discovery of potential, non-toxic influenza virus inhibitor by computational techniques, Mol. Inf. 33 (2014) 559–565. [4] R. Tabaraki, A. Toulabi, Solubility modeling in three supercritical carbon dioxide ethane and trifluoromethane fluids by one set of molecular descriptors, Fluid Phase Equilib. 383 (2014) 108–114. [5] A.P. Toropova, A.A. Toropov, S.E. Martyanov, E. Benfenati, G. Gini, D. Leszczynska, J. Leszczynski, CORAL: Monte Carlo method as a tool for the prediction of the bioconcentration factor of industrial pollutants, Mol. Inf. 32 (2013) 145–154. [6] S. Jullian, X. Longaygue, From conventional to more sustainable fuels. Trends and needs in research in the thermodynamics area, Fluid Phase Equilib. 362 (2014) 192–195. [7] H. Zhu, W. Guo, Z. Shen, Q. Tang, W. Ji, L. Jia, QSAR models for degradation of organic pollutants in ozonation process under acidic condition, Chemosphere 119 (2015) 65–71. [8] I. Raska Jr., A.A. Toropov, Comparison of QSPR models of octanol/water partition coefficient for vitamins and non vitamins, Eur. J. Med. Chem. 41 (2006) 1271–1278. [9] K. Roy, I. Sanyal, G. Ghosh, QSPR of n-octanol/water partition coefficient of nonionic organic compounds using extended topochemical atom (ETA) indices, QSAR Comb. Sci. 26 (2007) 629–646.

49

[10] A.A. Toropov, A.P. Toropova, I. Raska Jr., QSPR modeling of octanol/water partition coefficient for vitamins by optimal descriptors calculated with SMILES, Eur. J. Med. Chem. 43 (2008) 714–740. [11] A.A. Toropov, A.P. Toropova, E. Benfenati, QSPR modelling of the octanol/water partition coefficient of organometallic substances by optimal SMILES-based descriptors, Cent. Eur. J. Chem. 7 (2009) 846–856. [12] A.A. Toropov, A.P. Toropova, E. Benfenati, QSPR modeling of octanol water partition coefficient of platinum complexes by InChI-based optimal descriptors, J. Math. Chem. 46 (2009) 1060–1073. [13] A.A. Toropov, A.P. Toropova, E. Benfenati, QSPR modelling of normal boiling points and octanol/water partition coefficient for acyclic and cyclic hydrocarbons using SMILES-based optimal descriptors, Cent. Eur. J. Chem. 8 (2010) 1047–1052. [14] A.A. Toropov, A.P. Toropova, I. Raska, E. Benfenati, QSPR modeling of octanol/ water partition coefficient of antineoplastic agents by balance of correlations, Eur. J. Med. Chem. 45 (2010) 1639–1647. [15] D. Saaber, S. Wollenhaupt, K. Baumann, S. Reichl, Recent progress in tight junction modulation for improving bioavailability, Expert Opin. Drug Discovery 9 (2014) 367–381. [16] D. Weininger, Smiles. 3. Depict. Graphical depiction of chemical structures, J. Chem. Inf. Comput. Sci. 30 (1990) 237–243. [17] D. Weininger, A. Weininger, J.L. Weininger, SMILES. 2. Algorithm for generation of unique SMILES notation, J. Chem. Inf. Comput. Sci. 29 (1989) 97–101. [18] D. Weininger, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules, J. Chem. Inf. Comput. Sci. 28 (1988) 31–36. [19] CORAL (2014) http://www.insilico.eu/CORAL (accessed 07.01.15). [20] VEGA (2014) http://www.vega-qsar.eu/index.php (accessed 07.01.15). [21] EPI suite (2014) http://www.epa.gov/opptintr/exposure/pubs/episuitedl.htm (accessed 07.01.15). [22] U.S. National Library of Medicine, http://toxnet.nlm.nih.gov/ (accessed 07.01.15). [23] A.P. Toropova, A.A. Toropov, E. Benfenati, G. Gini, D. Leszczynska, J. Leszczynski, CORAL: quantitative structure–activity relationship models for estimating toxicity of organic compounds in rats, J. Comput. Chem. 32 (2011) 2727–2733.