Accepted Manuscript Title: In Silico Prediction of the -Cyclodextrin Complexation Based on Monte Carlo Method Author: Aleksandar M. Veselinovi´c Jovana B. Veselinovi´c Andrey A. Toropov Alla P. Toropova Goran M. Nikoli´c PII: DOI: Reference:
S0378-5173(15)30175-7 http://dx.doi.org/doi:10.1016/j.ijpharm.2015.08.078 IJP 15161
To appear in:
International Journal of Pharmaceutics
Received date: Accepted date:
20-4-2015 24-8-2015
Please cite this article as: Veselinovi´c, Aleksandar M., Veselinovi´c, Jovana B., Toropov, Andrey A., Toropova, Alla P., Nikoli´c, Goran M., In Silico Prediction of the rmbetaCyclodextrin Complexation Based on Monte Carlo Method.International Journal of Pharmaceutics http://dx.doi.org/10.1016/j.ijpharm.2015.08.078 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Research paper Title In Silico Prediction of the β-Cyclodextrin Complexation Based on Monte Carlo Method Aleksandar M. Veselinović1*, Jovana B. Veselinović1, Andrey A. Toropov2, Alla P. 5
Toropova2, Goran M. Nikolić1 1
Faculty of Medicine, Department of Chemistry, University of Niš, Niš, Serbia
2
IRCCS- Istituto di Ricerche Farmacologiche Mario Negri, Milano, Italy
*
Corresponding author:
10
Aleksandar M. Veselinović, Faculty of Medicine, Department of Chemistry, University of Niš, Bulevar Dr Zorana Đinđića 81, 18000 Niš, Serbia Fax: +381 18 4238770; Phone: +381 18 4570029 E-mail:
[email protected]
15
1
20
Graphical abstract
Abstract In this study QSPR models were developed to predict the complexation of structurally diverse compounds with β-cyclodextrin based on SMILES notation optimal descriptors using 25
Monte Carlo method. The predictive potential of the applied approach was tested with three random splits into the sub-training, calibration, test and validation sets and with different statistical methods. Obtained results demonstrate that Monte Carlo method based modeling is a very promising computational method in the QSPR studies for predicting the complexation of structurally diverse compounds with β-cyclodextrin. The SMILES attributes (structural
30
features both local and global), defined as molecular fragments, which are promoters of the increase/decrease of molecular binding constants were identified. These structural features were correlated to the complexation process and their identification helped to improve the understanding for the complexation mechanisms of the host molecules.
35
Keywords: QSAR/QSPR; SMILES; β-cyclodextrin; CORAL software; Monte Carlo method
2
1. Introduction 40
Cyclodextrins (CDs) are a family of macrocyclic oligosaccharides which includes three wellknown members, α-CD (six glucose units), β-CD (seven units) and γ-CD (eight units), as well as several other less well-known oligosaccharides (1). CDs have attracted much interest pharmaceutical industries, since they are capable to form host–guest complexes with hydrophobic molecules. These interactions greatly modify physical and chemical properties
45
of guest molecules (2,3). CD complexes have important application for drug delivery because they can overcome various undesirable properties of drug molecules via inclusion complexation leading to enhanced absorption performance, improved solubility, modified release kinetics of drugs and reduced side-effects (4-6). One of important feature of CD complexes is their application in site-specific drug delivery (7).
50
Molecule’s cyclodextrin complexation capability can be determined with the dissociation equilibrium of the complexation process and the binding constant (K) of the complex (8-10). The binding constant can be used for estimation of the binding strength between the guest and the host molecules and the disassociation/association stability of the complex. Further, binding constant is essential for evaluating the influence of cyclodextrins on in vivo
55
absorption and bioavailability of drugs (11,12). Cyclodextrin binding constants of guest molecules have been determined with application of many different methods like spectroscopic methods, fluorometry, nuclear magnetic resonance, phase-solubility technique, high-pressure liquid chromatography, surface tension methods and electrochemistry (13). However, these experimental methods are often difficult, require complex apparatuses and
60
time-consuming because of the low aqueous solubility of the guest molecules. For this reasons it is necessary to develop theoretical models with sufficient accuracy for predicting the binding constant values of CD inclusion complexes.
3
The quantitative structure–property (activity) relationship (QSPR/QSAR) quantitatively correlates molecular structure with well defined properties or activities. The main advantage 65
of QSPR/QSAR modeling is that only the knowledge of molecular structure is required for predicting desired activity or property. The QSPR/QSAR modeling can be considered as a useful alternative approach for estimating the binding constant and thermodynamic stability of cyclodextrin inclusion complexes. There are some published studies. Previous studies showed that QSPR/QSAR modeling can be successfully applied for the prediction of the
70
binding constant values of CD inclusion complexes (14-24). QSPR modeling often uses descriptors calculated on the basis of the molecular graph (22-28). The simplified molecular input-line entry system (SMILES) can be considered as an attractive alternative for the representation of the molecular structure by graph (29-33). On the basis of molecule’s SMILES notation optimal descriptors can be calculated where SMILES notation optimal
75
descriptor is a molecular descriptor which depends both on the molecular structure and the property under analysis, but does not explicitly depends on details from the molecular 3D geometry. The development of QSPR approaches, where SMILES is the representation of the molecular structure and it is used for developing optimal descriptors is an attractive direction of research work in the field of QSPR theory and applications because every QSPR model
80
that includes geometry-dependent molecular descriptors usually includes a relatively difficult calculation of the optimum molecular geometry, have high computational costs and long developing time (34,35). QSPR/QSAR modeling based on SMILES notation optimal descriptors has been successfully applied for various molecules properties/activities (36-40). The main aim of the presented study was to develop some new and efficient QSPR models
85
for predicting the binding properties of β-CD based on SMILES notation optimal descriptors. Further aims of this study were to identify molecular structures and molecular fragments that have influence on β-CD complexation. This information may be useful for understanding
4
their mechanism of β-CD binding and for anticipating the complexation potencies of molecules with unknown binding constants. 90
2. Materials and methods 2.1. Data set The data set of β-CD complex stability constants (logK) was taken from literature (17). Data set was comprised with various organic molecules that include acids, alcohols, aldehydes, anilines, aromatic hydrocarbons, phenols, ethers, ketones, esters, nitriles, halogenated
95
compounds, heterocycles, nitro, sulfur and steroids and barbital compounds. Molecular structure of studied compounds was represented with SMILES notation, which was generated with the ACD/ChemSketch program (ACD/ChemSketch v. 11.0). Table S1 presents a complete list of the chemical substances with the experimentally determined logK values. All three splits are generated according to the following principles: a) the range of the endpoint is
100
approximately the same for each sub-set b) the splits are not identical. For test and validation sets 20% of molecules from original data set were used (47 molecules). The rest of data was splited even between sub-training and calibration sets (70 and 69 molecules, respectively). 2.2. Method Molecule structure can be defined with different approaches. One of most convenient is the
105
representation with simplified molecular input-line entry system (SMILES) notation. SMILES notation can be used for defininig of molecular optimal descriptor (DCW) which is further used for QSAR modeling. DCW can be calculated as a function of SMILES notation with Eq. 1: DCW = F(SMILES)
(1)
5
110
There are two types of optimal descriptors defined with molecular SMILES notation. First is a local SMILES attribute, a fragment of SMILES notation which contains one ('C', 'N', '=', etc.) or two symbols ('Cl', 'Br', '@@', etc.) which cannot be examined separately. This SMILES attributes can be defined as SMILES atoms. Local SMILES attributes are also linear combination of two or three SMILES atoms. Not only local SMILES attributes can be used
115
for QSAR modeling but also global SMILES attribute: a charecteristic of molecule in whole, e.g. the presence of nitrogen together with oxygen, presence of double bonds and cycles, the presence of nitrogen and sulphur and absence of oxygen, and others. When both of SMILES based optimal descriptors are used for QSPR modeling DCW is calculated as following (Eq. 2):
120
DCW(T,Nepoch) = zCW(ATOMPAIR) + xCW(NOSP) + yCW(BOND) + tCW(HALO) + αΣCW(Sk) + βΣCW(SSk) + γΣCW(SSSk)
(2)
Coefficients z, x, y, t, α, β and γ can be 1 (yes) or 0 (no). If the value of coefficient is 1 then appropriate SMILES based descriptor is included in model building. However, if the value is 0 then appropriate SMILES notation based descriptor is discarded form model building. 125
Various combinations of their values give the possibility of defining a particular version of the optimal descriptor. Sk is a descriptor related to one symbol (or two that can not be separated) - SMILES atoms. SSk and SSSk are descriptors related to combination of two and three SMILES atoms. ATOMPAIR, BOND, NOSP, HALO are SMILES notation based indices that are used as descriptors (36,40). For building QSPR models presented in this study
130
all SMILES notation based descriptors, both local and global attributes, were took into consideration. Correlation weight (CW) can be defined as a numerical value for each SMILES notation base optimal descriptor calculated with the application of Monte Carlo method. Monte Carlo
6
method solves a problem by generating suitable random numbers and observing how that 135
fraction of numbers obeys some property or properties where CW value is randomly assigned to SMILES based descriptors in each independent Monte Carlo run and for defined endpoint. For obtaining the distribution of an unknown probabilistic entity Monte Carlo method simulations based on iterative algorithms are ran. The calculation of the numerical data on the correlation weights which give maximal value of the correlation coefficient between an
140
endpoint and the optimal descriptor is achived with application of Monte Carlo optimization process. When Monte Carlo method is used for QSPR model building two parameters should be taken into consideration - Threshold (T) and the Number of epochs (Nepoch). Threshold is a parameter used for the classification of various molecular features (both SMILES based molecular fragments and SMILES based indices) calculated from SMILES notation into two
145
categories: a) active (in this case correlation weight is involved in the modeling process); and b) rare (in this case correlation weight is not involved in the modeling process). This process is achieved in the following manner: if a molecular feature (X) defined from SMILES notation of molecules in the training set takes place less than T times, then the X should be removed from the building of the model. Therefore the numerical value that represents this
150
feature (correlation weight of the X, CW(X)) is set as zero and that molecular feature is defined as rare. All other molecular features are active and can be used for model building. Nepoch is the number of epochs of the Monte Carlo optimization which gives the best statistical quality for the training set. When the unlimited number of epochs is applied the above-mentioned Monte Carlo optimization will provide the maximum correlation coefficient
155
for the training set. However, the maximum correlation coefficient between the optimal descriptor and the endpoint for the external test set exists for a certain number of epochs. This amount of epochs is preferable in calculations, because when the number of epochs reaches that value, the obtained model is characterized by a good predictive potential. On the other
7
hand, the increase of T is accompanied by the decrease of the correlation coefficient for the 160
training set, but still, the threshold which provides the maximum correlation coefficient for the test set exists. This threshold is preferable from the practical point of view. In order to prepare a good QSPR model with application of SMILES based optimal descriptor, preferable values of the threshold and the number of epochs of the Monte Carlo optimization (T and Nepoch) must be defined (36-40). The search for most predictive combination of T and
165
Nepoch for QSPR models presented in this study for all splits was performed from values 0–10 for T and 0–60 for Nepoch. After obtaining numerical data on correlation weights from the model with the preferable statistics for the test set, the QSPR model is calculated (by using the training set) with application of linear regression approach (Eq. 3). CORAL software was used for all calculations related to building QSPR models (http://www.insilico.eu/coral).
170
Endpoint = C0 + C1×DCW(Threshold, Nepoch)
(3)
The main goal of any QSPR modeling is to developing of a robust model capable of predicting the properties of new molecules in objective, reliable and precise manner (41). For the assessment of robustness and the reliability of developed QSAR models presented in this study three methods were applied according to published approach: a) internal validation or 175
cross-validation using the training set compounds, b) external validation using the test set compounds and c) data randomization or Y-scrambling (42). Developed QSPR model must have defined applicability domain (AD). AD of a QSPR model is defined as the biological, structural or physico-chemical space, knowledge or information on which the model of the training set is developed, and for which it is applicable to make predictions for new
180
compounds. If predicted compounds are within AD then the predictions of a QSPR model are more reliable. When a compound is very much dissimilar to all compounds of the modeling set, reliable prediction of its property is uncertain (43,44). For defining the domain of applicability methodology suggested by Toropov et al. was used (45). 8
3. Results and discussion 185
Table 1 shows the statistical quality of the built QSPR models. Presented data revealed that there is the reproduction of the statistical quality for calculated models in three runs of the Monte Carlo optimization. Further, results from Tab. 1 show that the predictability is very good. In addition, all presented models for pK are satisfactory from the point of view of new criteria suggested by Roy et al. (46,47). Table S2 (supplementary material) shows Y-
190
randomization which also confirms the robustness of suggested models (48). The search for preferable T and Nepoch revealed that for splits 1 and 2 preferable T is 1 and preferable Nepoch 11 and for split 3 preferable T is 0 and preferable Nepoch 10. Results from applicability domain indicate that molecules have typical (“average”) behavior and all were included in developing QSPR models. Figure 1 presents graphically best Monte Carlo optimization run (the highest
195
value for r2) for built QSPR models. [Tab. 1]
9
[Fig. 1] The application of above mentioned T and the Nepoch gives the following models for the pK calculated according to Eq. 3: 200
Split 1 pK = -0.377(± 0.017) + 0.058(± 0.0003) × DCW(1,11)
(4)
Split 2 pK = -0.239(± 0.023) + 0.053(± 0.0004) × DCW(1,11)
(5)
Split 3 205
pK = -0.314(± 0.014) + 0.051(± 0.0002) × DCW(0,10)
(6)
The prediction of binding constants or binding free energies of CD with organic guest molecules has been vastly modeled with different QSPR approaches. Of particular interest are QSPR models built with guest molecules with structural diversity (18-24). Suzuki has developed a model with a group-contribution method for calculating the binding constants or 210
the free energies of complexation (22). The nonlinear model for binary (1:1 stoichiometry) complexes of beta-CDs was derived with r2 of 0.917. The parameters used for model development were first-order molecular connectivity index as a measure of molecular bulk and atom/group counts in the guest molecules. The predictive performance of the model was tested with 27 compounds as a test set and results indicated that the predicted values by the
215
developed models are in good agreement with the experimentally determined data. Katritzky et al. applied CODESSA-PRO approach to model binding energies for 1:1 complexation systems between 218 organic guest molecules and beta-cyclodextrin, using a seven-parameter equation and obtained model with r2 = 0.796 and rcv2 = 0.779. Further, authors employed
10
fragment-based TRAIL calculations and obtained model with r2 = 0.943 and rcv2 = 0.848 (23). 220
Regardless the goodness of fit between experimental and calculated values for binding constants both model have several disadvantages. The molecule’s fragment composition is necessary to analyze carefully because the specific contribution to the overall binding free energies of each fragment or functional group is needed to be fixed before summing individual contributions. Therefore, molecules with very little or none of fragmental
225
information in the training set will not be properly characterized. The more common and popular approach for predicting the binding constants or free energies of complexation is to establish a mathematical relationship between them and the structural descriptors. To the best of our knowledge, at least six QSPR models by different research groups have been derived from the same or nearly the same data set of diverse guest
230
molecules (18-24). Four QSPR models were based on topological descriptors implemented in the commercial package of program and most introduce dozens of variables in the process of modeling (17–19,24). The main disadvantage of these models is the difficult interpretation of complex topological indices in terms of interaction mechanism. QSPR study on the structurally diverse data set with the structural descriptors derived not merely from the
235
topological features of a molecule was reported by Katritzky et al. (23). Models were developed on the basis of the seven descriptors divided into four categories: topological, charge-distribution related, geometrical and molecular orbital ones. Diversity and complexity of descriptors introduced, it was difficult, to elucidate the direct links between those descriptors and the physical phenomena involved in host–guest complexation, as the authors
240
commented23. Sang et al. developed QSPR models based molecules electronic densities, electrostatic potentials, molecular surface area (As) and volume (Vmc) and the energy levels of the highest occupied orbital (EHOMO) and the lowest unoccupied orbital (ELUMO). Authors employed different chemometeric approaches like multiple linear regression (MLR), partial
11
least-squares regression (PLS), support vector machine (SVM), least-squares support vector 245
machine (LSSVM), random forest (RF) and Gaussian process (GP) (21). The correlation coefficients r2 and q2, root mean error of estimation and root mean error of cross-validation of the seven-parameter QSPR model developed with MLR were 0.685, 0.653, 0.503, and 0.527, respectively. Predictive statistics of test set were rpred2 = 0.681 and the root mean error of prediction 0.514. Model developed with PLS explained 72.5% of the variance for dependent
250
variable y (r2 = 0.725), and predict 62.9% of the variance for y by cross-validation (q2 = 0.629). Prediction of the logK values of test compounds, revealed that 58.7% of the variance for test y could be explained sufficiently (rpred2 = 0.587). Developed QSPR model with SVM had values of r2, q2 and rpred2 0.866, 0.712, 0.494, respectively. Obtained values for r2 and q2 with LSSVM modeling were 0.891 and 0.721, respectively. The statistical results for the
255
optimal GP model had r2 value of 0.931 and GP model showed good predictive ability for the external test set with rpred2 value of 0.832. Our model has several advantages in comparison to stated ones. When statistical parameters of QSPR models are compared it can be seen that presented model have similar or better values for correlation coefficient, cross-validated correlation coefficient, mean error of
260
estimation and root mean square error. Also, our model has satisfactory values of novel statistical metric that represent true predictive potential of model. However, the most important advantage of presented QSPR modeling is the ease of descriptor calculation and interpretation. Every QSPR model that includes geometry-dependent molecular descriptors usually includes a relatively difficult calculation of the optimum molecular geometry, have
265
high computational costs and long developing time. For these reasons, the conformationindependent 0D, 1D and 2D-QSPR methods are alternative approach for developing models. Further, the exclusion of 3D-structural descriptors also avoids problems associated with ambiguities, resulting from an incorrect geometry optimization due to the existence of
12
compounds in various conformational states. Unlike topological descriptors SMILES notation 270
based ones do not have difficult mechanistic interpretation. The example
of calculations
of DCW and pK
for
SMILES
molecule
with
“Clc1ccc(OC(C)=O)cc1” SMILES notation is presented in Table 2. Figure 2 presents mechanistic interpretation of CW values for Sk - one SMILES atom. Similar analysis can be made for SSk and SSSk. 275
[Tab 2] [Fig 2] The correlation weights (CW) of molecular features (SAk) for SMILES notation based optimal descriptors can be used for their classification according to values from three Monte Carlo optimization runs. There are three types of SAks: with the stable positive values of CW
280
- the promoters of the increase of an endpoint value; with the stable negative values of CW the promoters of the decrease of an endpoint value; and unstable SAk which have both positive and negative values of CW for three Monte Carlo optimization runs. Further, if the CW(SAk) is > 0 in all three runs of the Monte Carlo optimization process, then the SAk is the promoter of pIC50 increase; if CW(SAk) is < 0 in all three runs of the optimization, then the
285
SAk is the promoter of pIC50 decrease. In the end, if there are both positive and negative values of CW(Sk) in three runs of the Monte Carlo optimization process, then SAk has an undefined role (36-38). The list of all SAk, with the correlation weights for three runs of the Monte Carlo optimization process for defined split of data set is given in Table S3. According to presented results for split 1 some of SMILES notation based descriptors that have positive
290
influence of pK and therefore increase pK value are: global SMILES attributes ++++F--B2==, ++++Br--B2==, ++++I---B2== (molecule contains fluorine, bromine or iodine atom and double bond), ++++CL--S=== (molecule contains chlorine and sulfur atoms),
13
BOND01000000, HALO01000000 (molecule contains chlorine atom), HALO00100000 (molecule contains bromine atom), NOSP11000000 (molecule contains nitrogen and oxygen 295
atoms); local SMILES attributes C........... (molecule contains aliphatic carbon atom), C...(...C... (molecule contains molecular branching on aliphatic carbon atoms), c...O....... (molecule contains oxygen atom bonded to aromatic carbon atom), c...O...C... (molecule contains ether group bonded to aromatic carbon atom), etc. According to presented results for split 1 some of SMILES notation based descriptors that have negative influence of pK and
300
therefore decrease pK value are: global SMILES attributes ++++B2--B3== (molecule contains double and triple bond), ++++S---B2== (molecule contains sulfur atom and double bond), ++++F---Cl==, ++++F---N===, ++++F---O=== (molecule contains fluorine atom and chlorine, nitrogen or oxygen atom), HALO11000000 (same as ++++F---Cl==), =........... (molecule contains double bond), N...=....... (molecule contains nitrogen atom with double
305
bond), N...#....... (molecule contains nitrogen atom with triple bond), N...#...C... (molecule contains nitrogen atom connected to carbon atom with triple bond), etc. 4. Conclusion In this study QSPR models were developed to predict the complexation of structurally diverse compounds with β-CD for a data set of 233 compounds based on SMILES notation
310
optimal descriptors using Monte Carlo method. The predictive potential of the applied approach was tested with three random splits into the sub-training, calibration, test and validation sets and with different statistical methods. Best obtained model had following statistical parameters: sub-calibration set 0.9046 for r2, 0.8981 for q2, 0.205 for mae; calibration set 0.9242 for r2, 0.9185 for q2, 0.218 for mae; test set 0.9337 for r2, 0.9278 for q2,
315
0.192 for mae, 0.8998 for rm(a)2, 0.0613 for Δrm2; validation set 0.8891 for r2, 0.8794 for q2, 0.237 for mae, 0.304 for rmse, 0.8402 for rm(a)2 and 0.0674 for Δrm2. This demonstrates that Monte Carlo method based modeling is a very promising computational method in the QSPR 14
studies for predicting the complexation of structurally diverse compounds with β-CD. The SMILES attributes (both local and global) which are promoters of increase/decrease 320
molecular binding constant were identified. These structural features are related to the complexation process and their identification helped to improve the understanding for the complexation mechanisms of the host molecules. Acknowledgment This work has been supported by Ministry of Education and Science, Republic of Serbia,
325
under Project Number 172044. APT and AAT acknowledge support from the EU project PROSIL funded under the LIFE program (project LIFE12 ENV/IT/000154). References 1. Szejtli J. Introduction and general overview of cyclodextrin chemistry. Chem Rev. 1998;98:1743–1754.
330
2. Davis ME, Brewster ME. Cyclodextrin-based pharmaceutics: Past, present and future. Nat Rev Drug Discov. 2004;3(12):1023-1035. 3. Loftsson T, Duchêne D. Cyclodextrins and their pharmaceutical applications. Int J Pharm. 2007;329(1-2):1-11. 4. Stella V J, Rajewski RA. Cyclodextrins: Their future in drug formulation and delivery.
335
Pharm Res. 1997;14(5):556-567. 5. Brewster ME, Loftsson T. Cyclodextrins as pharmaceutical solubilizers. Adv Drug Deliver Rev. 2007;59(7):645-666. 6. Challa R, Ahuja A, Ali J, Khar RK. Cyclodextrins in drug delivery: an updated review. AAPS PharmSciTech 2005;6(2):E329-357.
15
340
7. Uekama K, Hirayama F, Arima H. Recent aspect of cyclodextrin-based drug delivery system. J Incl Phenom Macrocycl Chem. 2006;56:3-8. 8. Liu L, Guo Q-X. The driving forces in the inclusion complexation of cyclodextrins. J Inclusion Phenom 2002;42(1-2):1-14. 9. Loftsson T, Brewster ME. Pharmaceutical applications of cyclodextrins: Basic science and
345
product development. J Pharm Pharmacol. 2010;62(11):1607-1621. 10. Loftsson T, Brewster ME. Cyclodextrins as functional excipients: Methods to enhance complexation efficiency. J Pharm Sci. 2012;101(9):3019-3032. 11. Gamsiz ED, Miller L, Thombre AG, Ahmed I, Carrier RL. Modeling the influence of cyclodextrins on oral absorption of low-solubility drugs: I. Model development. Biotechnol
350
Bioeng. 2010;105(2):409–420. 12. Gamsiz ED, Miller L, Thombre AG, Ahmed I, Carrier RL. Modeling the influence of cyclodextrins on oral absorption of low solubility drugs: II. Experimental validation. Biotechnol Bioeng. 2010;105(2):421-430. 13. Dodziuk H. Cyclodextrins and Their Complexes: Chemistry, Analytical Methods,
355
Applications. Weinheim:Wiley-VCH; 2006. 14. Liu L, Guo Q-X. Wavelet neural network and its application to the inclusion of βcyclodextrin with benzene derivatives. J Chem Inf Comput Sci. 1999;39(1):133-138. 15. Liu L, Guo Q-X. Novel prediction for the driving force and guest orientation in the complexation of α- and β-cyclodextrin with benzene derivatives. J Phys Chem B
360
1999;103(17):3461-3467.
16
16. Estrada E, Perdomo-López I, Torres-Labandeira JJ. Combination of 2D-, 3D-connectivity and quantum chemical descriptors in QSPR. Complexation of α- and β-cyclodextrin with benzene derivatives, J Chem Inf Comput Sci. 2001;41(6):1561-1568. 17. Pérez-Garrido A, Helguera AM, Guillén AA, Cordeiro MNDS, Escudero AG. Convenient 365
QSAR model for predicting the complexation of structurally diverse compounds with βcyclodextrins. Bioorg Med Chem. 2009;17(2):896-904. 18. Pérez-Garrido A, Helguera AM, Cordeiro MNDS, Escudero AG. QSPR modelling with the topological substructural molecular design approach: β-cyclodextrin complexation. J Pharm Sci 2009;98(12):4557-4576.
370
19. Chari R, Qureshi F, Moschera J, Tarantino R, Kalonia D. Development of improved empirical models for estimating the binding constant of a β-cyclodextrin inclusion complex. Pharm Res. 2009;26(1):161-171. 20. Li HY, Sun J, Wang YJ, Sui XF, Sun L, Zhang JW, He ZG. Structure-based in silico model profiles the binding constant of poorly soluble drugs with β-cyclodextrin. Eur J Pharm.
375
Sci 2011;42(1-2):55-64. 21. Sang P, Zou J-W, Dai D-M, Hu G-X, Jiang Y-J. Prediction of the complexation of structurally diverse compounds with β-cyclodextrin using structural descriptors derived from electrostatic potentials on molecular surface and different chemometric methods. Chemometr Intell Lab 2013;127:166–176.
380
22. Suzuki T. A nonlinear group contribution method for predicting the free energies of inclusion complexation of organic molecules with α- and β-cyclodextrins. J Chem Inf Comput Sci. 2001;41(5):1266-1273.
17
23. Katritzky AR, Fara DC, Yang H, Karelson M, Suzuki T, Solov'ev VP, Varnek A. Quantitative structure–property relationship modeling of β-cyclodextrin complexation free 385
energies. J Chem Inf Comput Sci. 2004;44(2):529-541. 24. Merzlikine A, Abramov YA, Kowsz SJ, Thomas VH, Mano T. Development of machine learning models of β-cyclodextrin and sulfobutylether-β-cyclodextrin complexation free energies. Int J Pharm. 2011;418(2):207-216. 25. Randic M, Basak SC. New descriptor for structure-property and structure-activity
390
correlations, J Chem Inf Comput Sci. 2010;41(3):650-656. 26. da Silva Junkes B, Arruda ACS, Yunes RA, Porto LC, Heinzen VEF. Semi-empirical topological index: A tool for QSPR/QSAR studies. J Mol Model. 2005;11(2):128-134. 27. Ivanciuc O. QSAR comparative study of wiener descriptors for weighted molecular graphs. J Chem Inf Comput Sci. 2000;40(6):1412-1422.
395
28. Ivanciuc O. Chemical graphs, molecular matrices and topological indices in chemoinformatics and quantitative structure–activity relationships. Curr Comput Aided Drug Des. 2013;9(2):153-163. 29. Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988;28(1):31-36.
400
30. Weininger D, Weininger A, Weininger JL. SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci. 1989;29(2):97-101. 31. Weininger D. SMILES. 3. Depict. Graphical depiction of chemical structures. J Chem Inf Comput Sci. 1990;30(3):237-243.
18
32. Toropov AA, Benfenati E. SMILES in QSPR/QSAR modeling: results and perspectives. 405
Curr Drug Discov Technol. 2007;4(2):77-116. 33. Toropov AA, Benfenati E. SMILES as an alternative to the graph in QSAR modelling of bee toxicity. Comput Biol Chem. 2007;31(1):57-60. 34. Duchowicz PR, Comelli NC, Ortiz EV, Castro EA. QSAR study for carcinogenicity in a large set of organic compounds. Curr Drug Saf. 2012;7(4):282-288.
410
35. Talevi A, Bellera CL, Ianni MD, Duchowicz PR, Bruno-Blanch LE, Castro EA. An integrated drug development approach applying topological descriptors. Curr Comput Aided Drug Des. 2012;8(3):172-181. 36. Veselinović AM, Milosavljević JB, Toropov AA, Nikolić GM. SMILES-based QSAR model for arylpiperazines as high-affinity 5-HT1A receptor ligands using CORAL. Eur J
415
Pharm Sci. 2013;48(3):532-541. 37. Veselinović JB, Toropov AA, Toropova AP, Nikolić GM, Veselinović AM. Monte Carlo Method-Based QSAR Modeling of Penicillins Binding to Human Serum Proteins. Arch Pharm. 2015;348:1-6. 38. Veselinović AM, Milosavljević JB, Toropov AA, Nikolić GM. SMILES-Based QSAR
420
models for the calcium channel-antagonistic effect of 1,4-dihydropyridines. Arch Pharm. 2013;346(2):134-139. 39. Toropov AA, Toropova AP, Diaza RG, Benfenati E, Gini G. SMILES-based optimal descriptors: QSAR modeling of estrogen receptor binding affinity by correlation balance. Struct Chem. 2012;23(2):529-544.
19
425
40. Toropov AA, Toropova AP, Benfenati E, Gini G, Leszczynska D, Leszczynski J. SMILES-based QSAR approaches for carcinogenicity and anticancer activity: comparison of correlation weights for identical SMILES attributes. Anticancer Agents Med Chem 2011;11(10):974-982. 41. Roy K. On some aspects of validation of predictive quantitative structure activity
430
relationship models. Expert Opin Drug Dis. 2007;2(12):1567-1577. 42. Roy PP, Leonard JT, Roy K. Exploring the impact of the size of training sets for the development of predictive QSAR models. Chemom Intell Lab Syst. 2008;90(1):31-42. 43. Weaver S, Gleeson MP. The importance of the domain of applicability in QSAR modeling. J Mol Graph Model. 2008;26(8):1315-1326.
435
44. Gramatica P. Principles of QSAR models validation: Internal and external. QSAR Comb Sci. 2007;26(5):694-701. 45. Toropov AA, Toropova AP, Lombardo A, Roncaglioni A, Benfenati E, Gini G. CORAL: building up the model for bioconcentration factor and defining it's applicability domain. Eur J Med Chem. 2011;46(4):1400-1403.
440
46. Roy PP, Roy K. QSAR studies of CYP2D6 inhibitor aryloxypropanolamines using 2D and 3D descriptors. Chem Biol Drug Des. 2009;73(4):442-455. 47. Ojha PK, Mitra I, Das RN, Roy K. Further exploring rm2 metrics for validation of QSPR models. Chemometr Intell Lab Syst. 2011;107(1):194-205. 48. Ojha PK, Roy K. Comparative QSARs for antimalarial endochins: importance of
445
descriptor-thinning and noise reduction prior to feature selection. Chemometr Intell Lab Syst. 2011;109(2):146-161.
20
Figure captions Figure 1. Graphical representation of built QSPR models. Figure 2. Mechanistic interpretation of CW values (Table 2) for Sk - one SMILES atom for 450
molecule 86 (Table S1).
21
Table titles Table 1. The statistical parameters of QSPR models Table 2. Example of DCW(1,11) calculation for molecule 86 (Table S1) with application of 455
Eq. 4
Table 1. Statistical parameters of QSPR models
1 2
Split 1
3 A v 1 2
Split 2
3 A v 1 2
Split 3
3 A v
Sub-trainig set r2 q2 ma e 0.9 0.8 0.2 046 981 05 0.8 0.8 0.2 984 914 13 0.9 0.8 0.2 012 948 09 0.9 0.8 0.2 014 948 09 0.9 0.8 0.1 093 995 67 0.9 0.8 0.1 048 945 73 0.9 0.9 0.1 102 005 72 0.9 0.8 0.1 081 982 71 0.9 0.9 0.1 249 195 87 0.9 0.9 0.1 225 169 90 0.9 0.9 0.1 269 215 90 0.9 0.9 0.1 247 193 89
Calibration set r2 q2 ma e 0.9 0.9 0.2 242 185 18 0.9 0.9 0.2 207 146 27 0.9 0.9 0.2 193 133 24 0.9 0.9 0.2 214 155 23 0.9 0.9 0.2 245 203 22 0.9 0.9 0.2 239 198 25 0.9 0.9 0.2 242 202 23 0.9 0.9 0.2 242 201 23 0.8 0.8 0.2 829 741 75 0.8 0.8 0.2 799 705 72 0.8 0.8 0.2 783 690 80 0.8 0.8 0.2 804 712 76
Test set r2 q2 0.9 337 0.9 272 0.9 199 0.9 269 0.9 456 0.9 463 0.9 418 0.9 446 0.8 621 0.8 740 0.8 717 0.8 693
0.9 278 0.9 206 0.9 119 0.9 201 0.9 402 0.9 409 0.9 361 0.9 391 0.8 411 0.8 547 0.8 525 0.8 494
ma e 0.1 92 0.1 96 0.2 11 0.2 00 0.1 97 0.1 99 0.2 08 0.2 01 0.2 97 0.2 85 0.2 89 0.2 90
rm(a)
Δrm
2
2
0.8 998 0.8 942 0.8 838 0.8 926 0.8 740 0.8 556 0.8 653 0.8 650 0.8 020 0.8 190 0.8 156 0.8 122
0.0 613 0.0 280 0.0 357 0.0 417 0.0 542 0.0 568 0.0 582 0.0 564 0.0 942 0.0 656 0.0 800 0.0 799
Validation set r2 q2 ma e 0.8 0.8 0.2 891 794 37 0.8 0.8 0.2 709 596 60 0.8 0.8 0.2 729 618 47 0.8 0.8 0.2 776 669 48 0.8 0.8 0.2 441 265 28 0.8 0.8 0.2 424 242 22 0.8 0.8 0.2 478 300 22 0.8 0.8 0.2 448 269 24 0.7 0.7 0.3 443 231 25 0.7 0.7 0.3 738 562 15 0.7 0.7 0.3 399 188 31 0.7 0.7 0.3 527 327 24
rm se 0.3 04 0.3 30 0.3 22 0.3 18 0.3 24 0.3 26 0.3 21 0.3 23 0.4 63 0.4 37 0.4 67 0.4 56
rm(a)
Δrm
2
2
0.8 402 0.8 151 0.8 176 0.8 243 0.7 788 0.7 765 0.7 838 0.7 797 0.6 432 0.6 815 0.6 381 0.6 543
0.0 674 0.0 604 0.0 736 0.0 671 0.0 117 0.0 087 0.0 165 0.0 123 0.1 438 0.1 430 0.1 309 0.1 392
Best model for each split is indicated in italic font Av - average statistical parameter values from three independent Monte Carlo optimization runs r2 - Correlation coefficient q2 - Cross-validated correlation coefficient mae - Mean error of estimation rmse - Root mean square error rm(a)2, Δrm2 - Metrics sugested by Roy et al. (46,47)
460
22
Table 2. Example of DCW(1,11) calculation for molecule 86 (Table S1) with application of Eq. 4 Local SMILES attributes
465
Global SMILES attributes
SA
CW
SA
CW
SA
CW
SA
CW
Cl.......... c........... 1........... c........... c........... c........... (........... O........... C........... (........... C........... (........... =........... O........... (........... c........... c........... 1...........
1.4652 0.8082 0.4984 0.8082 0.8082 0.8082 -0.661 -1.3743 0.5983 -0.661 0.5983 -0.661 -1.3076 -1.3743 -0.661 0.8082 0.8082 0.4984
c...Cl...... c...1....... c...1....... c...c....... c...c....... c...(....... O...(....... O...C....... C...(....... C...(....... C...(....... =...(....... O...=....... O...(....... c...(....... c...c....... c...1.......
-0.5035 2.5895 2.5895 0.8426 0.8426 -0.1218 -0.7192 -0.9397 1.5666 1.5666 1.5666 -0.9731 -1.3738 -0.7192 -0.1218 0.8426 2.5895
Cl..c...1... c...1...c... c...c...1... c...c...c... c...c...(... c...(...O... C...O...(... O...C...(... C...(...C... (...C...(... C...(...=... O...=...(... =...O...(... c...(...O... c...c...(... c...c...1...
1.7509 2.5596 4.4083 1.7526 1.6267 1.9684 3.5028 1.6208 2.0931 2.0965 -1.1263 0.2486 -1.1586 1.9684 1.6267 4.4083
NOSP01000000 HALO01000000 BOND10000000 ++++CL--O=== ++++Cl--B2== ++++O---B2==
2.3457 2.6581 1.0024 -1.5278 -0.3773 1.0013
Fig. 1
23
470
Fig. 2
24