Accepted Manuscript QSAR model for blood-brain barrier permeation
Andrey A. Toropov, Alla P. Toropova, Marten Beeg, Marco Gobbi, Mario Salmona PII: DOI: Reference:
S1056-8719(17)30014-X doi: 10.1016/j.vascn.2017.04.014 JPM 6446
To appear in:
Journal of Pharmacological and Toxicological Methods
Received date: Revised date: Accepted date:
27 January 2017 20 April 2017 30 April 2017
Please cite this article as: Andrey A. Toropov, Alla P. Toropova, Marten Beeg, Marco Gobbi, Mario Salmona , QSAR model for blood-brain barrier permeation, Journal of Pharmacological and Toxicological Methods (2017), doi: 10.1016/j.vascn.2017.04.014
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
QSAR model for Blood-Brain Barrier Permeation Andrey A. Toropov1, Alla P. Toropova*1, Marten Beeg2, Marco Gobbi2, Mario Salmona3 1
Department of Environmental Health Science, Laboratory of Environmental Chemistry and
Toxicology, IRCCS-Istituto di Ricerche Farmacologiche Mario Negri, Via La Masa 19, 20156 2
PT
Milano, Italy
Department of Molecular Biochemistry and Pharmacology, Laboratory of Pharmacodynamics
3
SC
20156 Milano, Italy
RI
and Pharmacokinetics, IRCCS-Istituto di Ricerche Farmacologiche Mario Negri, Via La Masa 19,
Department of Molecular Biochemistry and Pharmacology, Laboratory of Biochemistry and
Protein Chemistry, IRCCS-Istituto di Ricerche Farmacologiche Mario Negri, Via La Masa 19,
NU
20156 Milano, Italy
MA
Abstract
Background and Objective: Predicting blood-brain barrier permeability for novel compounds is an important goal for neurotherapeutics-focused drug discovery. It is impossible to determine
D
experimentally the Blood-Brain Barrier partitioning of all possible candidates. Consequently,
PT E
alternative evaluation methods based on computational models are desirable or even necessary. The CORAL software (http://www.insilico.eu/coral) has been checked up as a tool to build up quantitative structure – activity relationships for Blood-Brain Barrier Permeation.
CE
Methods: The Monte Carlo technique gives possibility to build up predictive model of an endpoint by means of selection of so-called correlation weights of various molecular features. Descriptors
AC
calculated with these weights are basis for correlations "structure – endpoint". Results: The approach gives good models for three random splits into the training and validation sets. The best model characterized by the following statistics for the external validation set: the number of compounds is 41, determination coefficient is equal to 0.896, root mean squared error is equal to 0.175. Conclusions: The suggested approach can be applied as a tool for prediction of Blood-Brain Barrier permeation.
Keywords: QSAR; Blood–Brain Barrier; Monte Carlo method; Computer-aided drug design; CORAL software
ACCEPTED MANUSCRIPT
*)
AC
CE
PT E
D
MA
NU
SC
RI
PT
Corresponding author Alla P. Toropova Laboratory of Environmental Chemistry and Toxicology, IRCCS - Istituto di Ricerche Farmacologiche Mario Negri, Via La Masa 19, 20156 Milano, Italy Tel: +39 02 3901 4595 Fax: +39 02 3901 4735 Email:
[email protected]
ACCEPTED MANUSCRIPT 1. Introduction The blood-brain barrier (BBB) is a major factor hindering the development of neurotherapeutics. Experimental methods of BBB permeation determination as well as experimental definition of many other biomedical endpoints are cumbersome and expensive; thus, computational methods for BBB permeation prediction are an attractive alternative of the experiment (Hou & Xu, 2002,2003; Hou et al., 2006). The BBB is a physiological barrier in the
PT
circulatory system which is responsible for maintaining the homeostasis of the central nervous system by separating the brain from the systemic blood circulation, and thus stopping impact of
RI
many substances upon the central nervous system. The blood-brain distribution of a molecule is a key characteristic for assessing the suitability of a molecule be a potential drug for the central
SC
nervous system (Konovalov et al., 2007).
One of the biopharmaceutical properties that is of critical influence upon drug design is the
NU
ability of a molecule to penetrate the BBB. The potential effective agents which are intended to interact with their molecular targets in the central nervous system must cross the BBB in order to
MA
be used as therapeutic agents. At the same time, the peripherally acting agents should not cross the BBB so as to avoid side effects. In the both cases the BBB permeability of the molecules must be known. The experimental determination of BBB permeability is usually very difficult, time
D
consuming, and expensive and requires a sufficient quantity of the pure compounds and hence not
PT E
suitable for providing results in a high-throughput manner. Therefore, there is an increasing interest for good, reliable, and easily applicable computational approaches which can rapidly predict the BBB penetration capability of molecules. Such predictive models can be of widely use
CE
in the drug discovery process, especially in the area of central nervous system. In fact, the computational modeling of BBB permeability started in 1988 when the importance
AC
of lipophilicity for brain penetration was established statistically (Young et al., 1988). Further, various methods which can be widely used by the researchers to build up models based on the traditional statistical approaches such as multiple linear regression, partial least square, linear discriminant analysis, and other approaches in order to predict BBB of unknown substances (Crivori et al., 2000). BBB models have also been built using artificial intelligence techniques such as genetic algorithm (Iyer et al., 2002) and artificial neural networks (Garg & Verma, 2006). The advantage of these methods is that they can efficiently handle the nonlinear data. In addition, there are have been multiple attempts to build quantitative structure-activity relationships (QSARs) for the BBB permeation (Ooms et al., 2002; Luco & Marchevsky, 2006; de Sá et al., 2010; Carpenter et al., 2014; Bujak et al., 2015; Chena et al., 2009), which are based on
ACCEPTED MANUSCRIPT different descriptors, such as physicochemical properties (octanol/water partition coefficient, solubility, lipophilicity), as well as topological and constitutional descriptors. The aim of this work is to build up the QSAR model for the BBB permeation by means of the CORAL software (http://www.insilico.eu/coral).
2. Method 2.1 Data
PT
The database for BBB permeation (logBB) values for 291 substances is available from the literature (Hou & Xu, 2002). These substances were three times randomly split into training
SC
are random and non-identical (Toropova & Toropov, 2014).
RI
(≈35%), invisible training (≈35%), calibration (≈15%), and validation (≈15%) sets. These splits
2.2. Optimal descriptor
NU
Figure 1 shows the scheme of building up QSAR models by the CORAL software. Available data on the molecular structure are represented by simplified molecular input-line entry systems (SMILES) (Weininger, 1988). The CORAL extracts molecular features according to selected
MA
method (Toropova & Toropov, 2014). The QSAR models for logBB are building up with the following two versions of the optimal descriptors:
PT E
D
DCW1 (T *, N *) CW (Sk ) CW (SSk ) CW ( HARD) DCW2 (T *, N *) CW ( S k ) CW ( SS k )
(2)
CE
CW ( BOND) CW ( NOSP) CW ( HALO)
(1)
The SMILES attributes Sk, SSk, BOND, NOSP, and HALO are described in the literature (Toropova & Toropov, 2014). The Sk is SMILES-atom, i.e. one symbol or symbols which cannot
AC
be examined separately (e.g. ‘Cl’, ‘Br’, etc.). The SSk is a combination of two SMILES-atoms. Table 1 contains an example of definition of the listed attributes, which are represented by sequences of twelve symbols. One can see the HARD is association of BOND, NOSP, and HALO (Table 1). The CW(x) are correlation weights of the above SMILES attributes. The numerical data on the correlation weights are calculated with the Monte Carlo method optimization. During several epochs (modification of all correlation weights involved in building up of a model), the numerical data on correlation weights which give the maximal correlation coefficient for the endpoint and the optimal descriptor are calculated. Figure 2 shows the scheme to define the number of epochs of the optimization.
ACCEPTED MANUSCRIPT The T is threshold, i.e integer value to separate SMILES attributes into two classes: rare and not rare ones. If T=2, then SMILES attributes which appear in the training set, twice or less, are classified as rare ones. The correlation weights for rare attributes are fixed be equal to zero. Consequently, these attributes are not involved in building up a model. The N is the number of epochs of the Monte Carlo optimization. The T* and N* are values of the above parameters which give the best statistical characteristics for the calibration set. Figure 3 shows the general strategy to select T* and N*.
PT
There are two possibility (methods) to build up a QSAR model using the CORAL software: the traditional scheme and the balance of correlations.
RI
The traditional scheme is based on three special sets: (i) training set; (ii) calibration set; and
SC
(iii) validation set.
The balance correlation is based on four sets: (i) training set; (ii) invisible training set; (iii)
NU
calibration set; and (iv) validation set. Figure 2 illustrates tasks for the four sets used in the balance of correlations.
In the case of the balance of correlations, four sets have special roles. The training set is
MA
builder of the model. The invisible training set is the inspector: checking up absence of the overtraining (i.e. situation where perfect statistical quality for training set is accompanied by poor
D
statistical quality for some external set). The calibration set is the estimator: whether current model has the predictive potential. The external validation set is the final estimator of the predictive
PT E
potential for unknown substances. In the case of traditional scheme, the training and the invisible training are assembled into the common training set. In other words, in the case of the traditional scheme, the inspector, which checks up "whether the overtraining happens" is absent. Figure 3
CE
shows general scheme of building up a model with the CORAL software (definition of the T* and N*).
AC
Having the numerical data on the correlation weights, one can calculate with the training set the following model
Endpoint = C0 + C1 * DCWx(T*,N*),
x=1,2
(3)
The predictive potential of model calculated with Eq. 3 should be checked up with external validation set.
3. Results The balance of correlations based on DCW1(T*,N*) gives the following models: Split 1
ACCEPTED MANUSCRIPT logBB = -0.0138320 (± 0.0031331) + 0.0331486 (± 0.0002262) * DCW1(1,15)
(4)
Split 2 logBB = 0.0087213 (± 0.0032976) + 0.0311008 (± 0.0001652) * DCW1(1,10)
(5)
Split 3 logBB = -0.0002735 (± 0.0031709) + 0.0488865 (± 0.0002740) * DCW1(1,20)
(6)
Table 2 contains the distributions into split 1, split 2, and split 3 together with the experimental
PT
and calculated with Eqs. 4-6 values of logBB. These splits were used for the DCW2(T*,N*) as well as for the traditional scheme, i.e. training set – calibration set – validation set (without
SC
RI
invisible training set). The traditional scheme based on DCW1(T*,N*) gives the following models: Split 1
NU
logBB = -0.1758388 (± 0.0016122) + 0.0382357 (± 0.0001028) * DCW2(1,20) Split 2
logBB = -0.1636285 (± 0.0015508) + 0.0303687 (± 0.0000765) * DCW2(1,10)
MA
Split 3
(8)
(9)
D
logBB = -0.0991360 (± 0.0015057) + 0.0482449 (± 0.0001139) * DCW2(1,21)
(7)
The values of logBB calculated with Eqs. 7-9 are not represented in Table 2, because the
PT E
predictive potential of these models are lower than predictive potential of models calculated with
4. Discussion
CE
Eqs. 4-6 (Table 3).
Table 3 contains the statistical characteristics of the models. One can see from Table 3: (i) the
AC
balance of correlations gives models which are better than models calculated with the traditional scheme; and (ii) the first descriptor calculated as suggested in this work attribute HARD gives models better than models based on the second descriptor calculated with separated BOND, NOSP, and HALO (Toropova & Toropov, 2014; Toropova et al., 2011). Thus, the using of the HARD instead of separated or united BOND, NOSP, and HALO improves the predictive potential of the models based on optimal descriptors. The review of works dedicated to development of computational approaches to predict biochemical endpoints (Hou & Xu, 2002,2003; Hou et al., 2006; Konovalov et al., 2007; Young et al., 1988; Crivori et al., 2000; Iyer et al., 2002; Garg & Verma, 2006; Ooms et al., 2002; Luco & Marchevsky, 2006; de Sá et al., 2010; Carpenter et al., 2014; Bujak et al., 2015; Chena et al.,
ACCEPTED MANUSCRIPT 2009) confirms that the search for effective methods to calculate biochemical phenomena is important task of modern natural sciences. The search of the improvement of the CORAL software also is a fragment of the global task of the improvement of theory and practice of QSAR (Toropova & Toropov, 2014; Gobbi et al., 2016). The novelty of this work is using of the new global SMILES attribute named HARD (Table 1). The suggested attribute HARD reflects the presence (absence) of different molecular features (Table 1). The involving of correlation weights of those molecular features improves predictive potential of the models for logBB (Table 3). There
PT
are rare (noise) versions of the HARD descriptors, fortunately the rare HARD descriptors are removed from building up a model by scheme described in the literature (Toropova & Toropov,
RI
2014; Gobbi et al., 2016). In other words, if a version of the HARD is rare, this version will be
SC
removed from modeling process (e.g. $00000001010, $00000001110, and $01001000000 which are absent in the training set have correlation weights equal to zero, therefore these are not
NU
involved in building up a model). In spite of an interconnection between the described BOND, NOSP, HALO with HARD there is apparent difference between model based on descriptor DCW1 The difference is visible from Table 3.
MA
(T*,N*) (calculated with HARD) and model based on DCW2 (T*,N*) (calculated without HARD). It should be noted, that the T* and N* are different for examined random split.
D
Table 4 contains the comparison of different models for logBB suggested in the literature. The comparison shows that models suggested in this work are comparable to models suggested in the
PT E
literature (Hou & Xu, 2002; Konovalov et al., 2007; Crivori et al., 2000; Iyer et al., 2002; Garg & Verma, 2006; Ooms et al., 2002; Chena et al., 2009). It is to be noted that majority of the above mentioned models involve physicochemical parameters (Konovalov et al., 2007; Bujak et al.,
CE
2015) and / or 3D stereo chemical calculations (Ooms et al., 2002), whereas the approach
notations.
AC
suggested here involves solely 2D data on the molecular structure represented by the SMILES
Having the numerical data on correlation weights of molecular features expressed via SMILES attributes obtained in several runs of the Monte Carlo optimization, one can extract four categories of the attributes. The first category is attributes with positive correlation weights in all the runs. The second category is attributes with negative correlation weights in all the runs. The third category is attributes, which have both positive and negative correlation weights. The fourth category is blocked attributes. Table 5 contains collection of attributes of the first and second categories. The attributes of first category can be examined as promoters of increase for logBB. The attributes of second category can be examined as promoters of decrease for logBB. Thus, the presence of nitrogen atoms and double covalent bonds are promoter for increase of logBB,
ACCEPTED MANUSCRIPT whereas the presence of oxygen and two rings are promoters for decrease for logBB. This is qualitative information, however, this can be basis for clear hypotheses, which can be confirmed (or rejected) after carrying out the corresponding experiments. Table 6 contains examples of molecules confirms the influence of two promoters of increase (large logBB values) and decrease (small logBB values). The statistically significant promoters of logBB increase as well as statistically significant promoters of logBB decrease are molecular features, which are important for blood-brain barrier
PT
permeation. The statistical significance is meaning, firstly, considerable prevalence of molecular feature in training and calibration set; and secondly, stable positive values (or stable negative
RI
values) of correlation weights for the given feature which are observed in several runs of the
SC
Monte Carlo optimization.
The SMILES-attribute “N…C…….” is a promoter of logBB increase (Table 5). Table 7
NU
contains an example of modification of molecular structure that leads to increase of logBB. The SMILES-attribute “O………….” is a promoter of logBB decrease (Table 5). Table 8 contains an example of modification of molecular structure that leads to decrease of logBB. Thus, suggested
MA
approach gives possibility to define hypothesizes how one should modify molecular structure in order to change the logBB value.
D
The numerical data on correlation weights for model calculated with Eq. 4 (Split 1) are
PT E
available in the Supplementary materials.
Conclusions
CE
1. The CORAL software gives good prediction for the Blood-Brain Barrier Permeation (logBB); 2. The suggested SMILES attribute HARD improves the statistical quality of prediction;
AC
3. The balance of correlations gives better models in comparison with the traditional scheme for all three random non-identic splits. 4. The described approach has the mechanistic interpretation in terms of promoters of increase or decrease for the logBB. Thus, the suggested models are built up in accordance with OECD principles (OECD, 2007).
Acknowledgments AAT and APT thank the project EU-ToxRisk (Project reference:681002) and LIFE-COMBASE contract (LIFE15 ENV/ES/000416) for financial support.
ACCEPTED MANUSCRIPT
References Bujak, R., Struck-Lewicka, W., Kaliszan, M., Kaliszan, R., & Markuszewski, M.J. (2015). Blood–brain barrier permeability mechanisms in view of quantitative structure–activity relationships (QSAR). Journal of Pharmaceutical and Biomedical Analysis, 108, 29–37. Carpenter, T.S., Kirshner, D.A., Lau, E.Y., Wong, S.E., Nilmeier, J.P., & Lightstone, F.C. (2014). A method to predict Blood-Brain Barrier permeability of drug-like compounds using
PT
molecular dynamics simulations. Biophysical Journal, 107, 630–641.
Chena, Y., Zhu, Q.-J., Pan, J., Yang, Y., & Wu, X.-P. (2009). A prediction model for blood–brain
RI
barrier permeation and analysis on its parameter biologically. Computer Methods and
SC
Programs in Biomedicine, 95, 280–287.
Crivori, P., Cruciani, G., Carrupt, P.A., & Testa, B. (2000). Predicting blood-brain barrier
NU
permeation from three-dimensional molecular structure. Journal of Medicinal Chemistry, 43, 2204-2216.
de Sá, M.M., Pasqualoto, K.F.M., & Rangel-Yagui, C.O. (2010). A 2D-QSPR approach to predict
MA
blood-brain barrier penetration of drugs acting on the central nervous system. Brazilian Journal of Pharmaceutical Science, 46, 741-751.
D
Iyer, M., Mishru, R., Han, Y., & Hopfinger, A.J. (2002). Predicting blood-brain barrier partitioning
19, 1611–1621.
PT E
of organic molecules using membrane-interaction QSAR analysis. Pharmaceutical Research,
Garg, P., & Verma, J. (2006). In silico prediction of blood brain barrier permeability: an artificial neural network model. Journal of Chemical Information and Modeling, 46, 289–297.
CE
Gobbi, M., Beeg, M., Toropova, M.A., Toropov, A.A., & Salmona, M. (2016). Monte Carlo method for predicting of cardiac toxicity: hERG blocker compounds. Toxicology Letters,
AC
250–251, 42–46.
Hou, T.J., & Xu, X.J. (2002). ADME evaluation in drug discovery 1. Applications of genetic algorithms to the prediction of blood–brain partitioning of a large set of drugs. Journal of Molecular Modeling, 8, 337–349. Hou, T.J., & Xu, X. J. (2003).
ADME Evaluation in Drug Discovery. 3. Modeling Blood-Brain
Barrier Partitioning Using Simple Molecular Descriptors. Journal of Chemical Information and Computer Sciences, 43, 2137–2152. Hou, T., Wang, J., Zhang, W., Wang, W., & Xu, X. (2006). Recent Advances in Computational Prediction of Drug Absorption and Permeability in Drug Discovery. Current Medicinal Chemistry, 13, 2653-2667.
ACCEPTED MANUSCRIPT Konovalov, D.A., Coomans, D., Deconinck, E., Heyden, Y.V. (2007). Benchmarking of QSAR models for Blood-Brain Barrier Permeation. Journal of Chemical Information and Modeling, 47, 1648-1656. Luco, J.M., & Marchevsky, E. (2006).
QSAR Studies on Blood-Brain Barrier permeation.
Current Computer-Aided Drug Design, 2, 31-55. OECD, (2007). Environment Health and Safety Publications Series on Testing and Assessment No. 69. Guidance Document on the Validation of (Quantitative) Structure–Activity Relationship
PT
[(Q)SAR] Models. http://search.oecd.org/officialdocuments/ (accessed 12.11.16). Ooms, F., Weber, P., Carrupt, P.-A., & Testa, B. (2002). A simple model to predict Blood–Brain
RI
Barrier permeation from 3D molecular fields. Biochimica et Biophysica Acta, 1587, 118–
SC
125.
Toropova, A.P., Toropov, A.A., Benfenati, E., Gini, G., Leszczynska, D., & Leszczynski, J.
NU
(2011). CORAL: Quantitative Structure–Activity Relationship models for estimating toxicity of organic compounds in rats. Journal of Computational Chemistry, 32, 2727-2733. Toropova, A.P., & Toropov, A.A. (2014). CORAL software: Prediction of carcinogenicity of drugs
MA
by means of the Monte Carlo method. European Journal of Pharmaceutical Sciences, 52, 21–25.
D
Young, R.C., Mitchell, R.C., Brown, T.H., Ganellin, C.R., Griffiths, R., Jones, M., Rana, K.K., Saunders, D., Smith, I.R., Sore, N.E., & Wilks, T.J. (1988).
Development of a new
PT E
physicochemical model for brain penetration and its application to the design of centrally acting H2 receptor histamine antagonists. Journal of Medicinal Chemistry, 31, 656–671. Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to
AC
28, 31.
CE
methodology and encoding rules. Journal of Chemical Information and Computer Sciences
ACCEPTED MANUSCRIPT
Table 1 Examples of representation of SMILES attributes by means of twelve symbols [SMILES = “NC(SCCF)=N” ] 3 . . . . . . . . . . . . . . . . . . .
4 . . . . . . . . . . . . . . . . . . .
5 . . . . . . . . . . C ( ( C C C ( ( =
6 . . . . . . . . . . . . . . . . . . .
7 . . . . . . . . . . . . . . . . . . .
8 . . . . . . . . . . . . . . . . . . .
9 . . . . . . . . . . . . . . . . . . .
10 . . . . . . . . . . . . . . . . . . .
11 . . . . . . . . . . . . . . . . . . .
12 . . . . . . . . . . . . . . . . . . .
PT
2 . . . . . . . . . . . . . . . . . . .
RI
1 N C (* S C C F ( = N N C S S C F F = N
SC
ID Comment 1 Representation of Sk
Representation of SSk
3
Definition of BOND attribute
B
O
N
D
= 1
# 0
@ 0
0
0
0
0
0
4
Definition of NOSP attribute
N
O
S
P
N 1
O 0
S 1
P 0
0
0
0
0
Cl 0
Br 0
I 0
0
0
0
0
O 0
S 1
P 0
F 1
Cl 0
Br 0
I 0
5
Definition of HALO attribute
6
AC
CE
PT E
D
MA
NU
2
*)
Definition of HARD attribute
H A
L
O
F 1
= 1
# 0
@ 0
N 1
$
Brackets are the representation of molecular branching and used only “(“without “)”
ACCEPTED MANUSCRIPT Table 2 Experimental (Hou & Xu, 2002) and calculated values of logBB
1.0200
0.7779
0.7203
0.8271
Y
Y Y
-0.3420 0.1510 -0.2800 -0.1400 0.2890 -0.1000 0.3300 0.1150 0.2700 0.5600 0.2670 0.1420 0.3900 0.3050 0.0400 -0.1300 0.3700 -0.0600 0.7400 0.8600
-0.2067 0.0484 0.0050 -0.1652 0.1977 0.0463 0.1953 0.1073 -0.0362 0.1583 0.1551 0.2672 -0.0763 0.0918 0.0272 -0.1402 -0.1710 0.1333 0.3407 0.3822
-0.3416 0.1126 -0.1030 -0.3087 0.3568 -0.0351 0.2442 0.2485 0.3969 0.4687 0.2250 0.1744 0.2946 0.1775 0.0760 0.1811 0.0690 0.2104 0.3746 0.4074
-0.2387 0.0274 -0.0736 -0.1960 0.1560 0.0607 0.2810 0.1580 0.2898 0.4671 0.3967 0.1523 0.3940 0.1459 0.0338 0.0246 0.1989 0.1886 0.4020 0.4447
N Y Y N Y N N N Y Y Y Y Y Y N N N Y Y Y
Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N N Y Y Y
AC
PT
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
V C I V T V T V T I V T T C T I T T I I
Eq. 5 0.0661 0.0448 -0.1118 0.6220 0.6548 0.6876 0.7205 0.7533 0.7861 0.6043 0.6372 0.6700 0.7028 0.6043 0.6372 0.5596 0.6747 0.7732 0.7732 0.8060 0.7379 0.9200
RI
I V T V T C T C T T C T I T T T T T V T
Eq. 4 -0.0804 -0.0079 -0.0266 0.5105 0.5520 0.5935 0.6349 0.6764 0.7179 0.5977 0.6391 0.6806 0.7221 0.5977 0.6391 0.5810 0.3919 0.5164 0.5164 0.5579 0.6907 0.6740
SC
V I V T C I T T V I C I V I T T V T T C
Experiment 0.0300 0.0300 0.0300 0.6320 0.6800 0.4420 0.6890 0.5200 0.6650 0.9700 0.8600 0.9800 1.0500 1.0130 0.8990 1.0370 0.1100 0.9330 1.1100 0.9600 1.0700 0.6100
NU
28
SMILES [N][N] [N-]=[N+]=O [H]C([H])([H])[H] CCCCC CCCCCC CCCCCCC CCCCCCCC CCCCCCCCC CCCCCCCCCC CC(C)CCC CC(C)CCCC CC(C)CCCCC CC(C)CCCCCC CCC(C)CC CCCC(C)CC CCC(C)(C)C C1CC1 CC1CCCC1 C1CCCCC1 CC1CCCCC1 C1CC(C)C(C)CC1 CC(C1CCCCC1)( C)C C1C(C)C(C)CC(C )C1 ClCCl ClC(Cl)Cl ClC(Cl)C ClCCCl CC(Cl)(Cl)Cl ClC(Cl)CCl ClC(Cl)(Cl)CCl ClCC(F)(F)F BrCCC CC(Br)C FC(Br)C(F)(F)F FC(F)(F)C(Cl)Br FS(F)(F)(F)(F)F C=C Cl/C=C/Cl Cl/C=C\Cl Cl/C(Cl)=C(Cl)\Cl CC=C CCCCCCC=C CCCCCCCC=C
MA
V I
ID 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Domain of applicability Eq.6 1 2 3 ** 0.0670 Y Y Y -0.2267 N N Y -0.1642 Y Y Y 0.5981 Y Y Y 0.6408 Y Y Y 0.6835 Y Y Y 0.7261 Y Y Y 0.7688 Y Y Y 0.8115 Y Y Y 0.6712 Y Y Y 0.7139 Y Y Y 0.7566 Y Y Y 0.7993 Y Y Y 0.6712 Y Y Y 0.7139 Y Y Y 0.6651 Y Y Y 0.4797 Y Y Y 0.6077 Y Y Y 0.6077 Y Y Y 0.6504 Y Y Y 0.7539 Y Y Y 0.8486 Y Y Y
D
I
3 I C T V I I T T V C I V C T C C T T T C V T
PT E
2 T T I V C C I C T V I V V V C V I I T C V I
CE
Split 1 I* I T C V I V T C C C C I V I T T V V T I I
Y Y Y Y Y Y Y Y N N N N N Y N N N Y Y Y
C 71
T T V C C I T T T T C T T I
I I I I V V C C I C I I I T
C I C T I V T V C C T T I T
C I C
V V 86 I T 87 V I 88
I
T T 89
V
V C 90
T
T I
72 73 74 75 76 77 78 79 80 81 82 83 84 85
91
0.4873 0.2280 0.1032 0.1731 0.4298 -0.0527 -0.0100 0.0327 0.0754 0.1180 -0.1016 0.1058 0.1485 -0.0650 -0.0224 0.1394 0.0843 0.0416 0.0843 0.2771 0.0998
Y Y N Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
Y Y Y Y N Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
0.1400
-0.0655
-0.0574
-0.0103
Y
Y Y
0.3000
0.1847
0.3466
0.1389
Y
Y Y
0.0100 0.1400 0.1300 -0.1700 -0.1700 -0.0100 -0.1340 0.0030 0.1200 0.2770 0.4000 0.3990 0.4500 0.5540
-0.0060 0.0949 0.1580 0.0155 -0.0798 0.0985 0.0256 0.0671 -0.1152 0.1501 -0.0322 -0.0695 0.0590 0.0135
0.1034 0.0911 0.2337 0.0762 0.0099 0.1419 0.0918 0.1246 -0.0175 0.1903 0.0482 -0.0679 0.0407 -0.0023
-0.0339 0.1025 0.1055 0.0602 0.0021 0.1455 -0.0258 0.0169 -0.1052 0.1023 -0.0198 -0.0747 0.0319 0.0106
Y Y Y Y Y Y Y Y Y Y Y Y Y Y
Y Y Y Y Y Y Y Y Y Y Y Y Y Y
0.1980 0.2710 0.4100
0.1121 0.1536 0.1764
0.3010 0.3338 0.4076
0.1485 0.1912 0.2065
Y Y Y
Y Y Y Y Y Y
0.2200
0.1104
0.2228
0.1726
Y
Y Y
0.3800
0.2408
0.3162
0.2643
Y
Y Y
0.2600
0.1951
0.3667
0.2339
Y
Y Y
PT
I
0.4402 0.2181 0.3243 -0.1186 0.6111 0.0105 0.0433 0.0762 0.1090 0.1418 -0.1492 0.0585 0.0914 -0.1611 -0.1283 0.1848 -0.0197 -0.0525 -0.0197 0.3527 0.0299
RI
I
0.4237 0.1975 0.0752 0.1533 0.1932 -0.1113 -0.0698 -0.0283 0.0132 0.0547 -0.2064 0.0589 0.1004 -0.1816 -0.1401 0.1002 -0.0116 -0.0531 -0.0116 0.3041 0.0653
SC
V T 70
0.9600 -0.1660 0.1050 -0.0200 0.6000 0.0200 -0.1240 -0.0820 -0.0230 0.2030 -0.1130 -0.1400 0.0380 0.1100 0.0700 -0.0100 0.2200 0.3600 0.1700 0.1190 0.1900
NU
I
CCCCCCCCC=C C=CC=C ClC=C(Cl)Cl ClC=C(F)F S=C=S CO CCO CCCO CCCCO CCCCCO CC(C)O CC(C)CO CCC(C)CO CC(C)(C)O CCC(C)(C)O CCOCC CC(OCC)(C)C CC(OC)(C)C CCC(C)(C)OC COC(F)(F)C(Cl)Cl FC(F)OC(Cl)C(F)( F)F FC(F)OC(F)(F)C( F)Cl FCOC(C(F)(F)F)C (F)(F)F C1CO1 FC(F)(F)COC=C C=COC=C CC(C)=O CC(=O)CC CCCC(C)=O COC(C)=O CCOC(C)=O CC(OCCC)=O CCCCOC(C)=O CC(OCCCCC)=O CC(C)OC(C)=O CC(C)COC(=O)C CC(OCCC(C)C)= O C1=CC=CC=C1 CC1=CC=CC=C1 CC1=CC=CC=C1 C CC1=CC(C)=CC= C1 CC1=CC=C(C)C= C1 CCC1=CC=CC=C
D
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
AC
I T I V I C T I I V C I V T T I I T T T V
PT E
I I T V T V T C T I C V V T C T C T C T T
CE
I I I I I T I V I V C T C V V V I C C I T
MA
ACCEPTED MANUSCRIPT Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y
Y Y Y Y Y Y Y Y Y Y Y Y Y Y
V
T I
I I T
T V 96 T V 97 T T 98
I
V I
99
I
I
I
100
I
I
C 101
C
T T 102
T
T T 103
T
C C 104
C
I
T
T T 106
I
T I
T
T T 108
C
C I
V
I
95
AC
T 105
107
109
T 110
Y
Y Y
0.4300
0.2697
0.4479
0.3894
Y
Y Y
0.1600
0.1976
0.2051
0.2458
Y
Y Y
0.1700
0.3353
0.2666
0.4784
Y
Y Y
-0.2300 -0.4000 -0.2400
-0.0826 -0.3807 0.0619
-0.2229 -0.4932 0.1487
-0.0466 -0.7246 0.0749
Y Y Y
N Y N Y Y Y
-0.1700
-0.0578
-0.0410
0.0530
Y
Y Y
-0.2022
-0.0505
-0.1914
Y
Y Y
1.0100
0.7384
0.8743
0.7407
N
N Y
-0.0700
-0.0613
0.0552
0.0308
Y
N Y
1.3800
0.7011
0.7581
0.6857
N
N Y
-0.0500
0.0046
0.0290
0.0534
Y
Y Y
-0.0300
-0.3700
-0.0772
-0.1309
Y
N Y
0.9800
1.1219
1.2164
1.1778
N
N Y
-0.1000
-0.0242
-0.0994
-0.0714
Y
N Y
0.6400
1.0988
1.0967
1.1156
N
N Y
0.1880
-0.0039
0.0218
0.0410
Y
Y Y
0.0170
0.1206
0.1203
0.1691
Y
Y Y
-0.3010
PT
C T 94
0.2306
RI
C
0.3415
SC
C C 93
0.2178
NU
V
0.4500
D
V 92
PT E
I
CE
C
1 C=CC1=CC=CC= C1 CC(C1=CC=CC= C1)(C)C CC1=CC(C)=C(C) C=C1 FC(F)(C1=CC=C( Cl)C=C1)F OCC#C C=CC#N FCCCN1C=CN=C 1N(=O)=O FCCCCCCCCN1 C(N(=O)=O)=NC =C1 OC1=CC4=C(C3C CC(C(C(CC2C3C C4)F)O)2C)C=C1 IC1=CC=C(N2CC N(CC2)CCCCCC) C=C1 OCC3=NC=C4CN =C(C2=C(N34)C= CC(Cl)=C2)C1=C( F)C=CC=C1 IC1=CC=C(N2CC N(CC2)C(CC)C)C =C1 FC1=C(C)N(C)N( C2=CC=CC=C2)C 1=O CC3=NC=C4C(N =C(C2=C(N34)C= CC(Cl)=C2)C1=C( F)C=CC=C1)O IC3=C(C=C(C=C3 )CN2CCCCC2)C N1CCCCC1 CC2=C(C(N(N2C) C1=CC=CC=C1)= O)I IC1=C(C=C(C=C1 )NC(CC)C)NC(C C)C O=C1C(CC)(C(N C(N1)=O)=O)CC CC O=C1C(CC)(C(N C(N1)=O)=O)CC CCCCC
MA
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
I
T T 115
C
T V 116
T
C I
117
T
T I
118
T
T T 119
I
T V 120
T
I
I
T
I
T 122
I
V I
C
T C 125
C
V T 126
V 114
AC
121
124
Y Y
-0.2220
-0.1284
-0.0767
-0.0870
Y
Y Y
0.2410
0.1621
0.1531
0.2117
Y
Y Y
0.0860
0.0376
0.0546
0.0837
Y
Y Y
0.0860
-0.0454
-0.0017
Y
Y Y
-0.6700
-0.9847
-0.7638
-0.9372
Y
N N
0.5143
0.4362
0.1687
Y
Y Y
0.2045
0.2604
0.1804
N
Y N
-0.2600
0.2045
0.2604
0.1804
N
Y N
-0.5900
-0.3734
-0.3418
-0.4655
Y
Y Y
-0.3800
-0.4197
-0.3271
-0.5121
N
Y Y
0.0000
0.3516
0.3234
0.3706
N
N N
0.0000
0.1520
0.1318
0.1326
Y
Y Y
0.0400
-0.2232
-0.1783
-0.1365
Y
Y Y
-0.1600
-0.0139
0.1108
0.0262
Y
Y Y
0.5480
PT
I
Y
0.6750
-0.0111
RI
T
0.1264
SC
C C 113
0.0874
NU
V
0.0791
MA
V C 112
0.3640
D
V
O=C1C(CC)(C(N C(N1)=O)=O)CC CCCC O=C1C(CC)(C(N C(N1)=O)=O)C O=C1C(CC)(C(N C(N1)=O)=O)CC CCCCCC O=C1C(CC)(C(N C(N1)=O)=O)CC CCC O=C1C(CC)(C(N C(N1)=O)=O)CC C O=C2N1CCCC(O) C1=NC(C)=C2CC N3CCC(C4=NOC 5=C4C=CC(F)=C5 )CC3 FCCC2=CC(N)=C (C=C2)SC1=CC= CC=C1N(C)C CC1N(C3=C(C)C( C)=NC(NC4=CC= C(C=C4)F)=N3)C CC2=CC=CC=C1 2 CC1N(C3=C(C)C( C)=NC(NC4=CC= C(F)C=C4)=N3)C CC2=CC=CC=C1 2 CC2=CN(C([NH] C2=O)=O)C1CC( C(O1)CO)F [O]/[N+](C(C)(C)C) =C/C1=CC=[N+]( [O-])C=C1 CC3=NN=C4CN= C(C2=CC(Cl)=CC =C2N34)C1=CC= CC=C1 CN(C2=C(N(N(C2 =O)C1=CC=CC= C1)C)C)C CCC(C(NC(NC1= O)=O)=O)1CCC( C)C CC2=CC(N(N2C) C1=CC=CC=C1)=
PT E
T T 111
CE
T
C I
129
T
T I
130
T
I
I
T I
V
T C 133
I
T T 134
T
T T 135
C
T I
136
T
I
I
137
T
I
I
138
V
V T 139
T
V I
140
V
I
142
T 131
-0.1691
-0.2320
Y
Y Y
-0.1370
-0.3934
-0.2264
-0.2950
Y
Y Y
0.8200
0.7054
0.6980
0.5327
N
Y Y
-0.4300
-0.3677
-0.1012
-0.1958
Y
Y N
-0.3000
0.1101
0.2579
0.1910
Y
Y Y
0.8525
0.8350
1.0461
Y
Y Y
-0.1200
-0.5467
-0.4352
-0.4391
Y
Y Y
-0.3014
-0.3540
-0.2444
N
N N
-0.0100
-0.3263
-0.0684
-0.1532
Y
Y N
-0.0900
-0.0799
-0.2462
-0.1491
Y
Y N
0.0700
-0.2954
-0.3186
-0.2443
N
N N
0.4200
0.1723
0.2629
0.1809
Y
Y Y
0.0300
-0.2584
-0.1939
-0.3708
Y
Y Y
0.0000
0.1789
0.1809
0.1815
Y
Y Y
0.8450
-0.5500
I
AC
CE
PT E
D
132
-0.2257
PT
T
-0.6990
RI
128
SC
V I
NU
I
O CC(NCC(COC1= CC=C(C=C1)CC( N)=O)O)C CCC(C(NC(NC1= O)=O)=O)1CC CN4CC3C1=CC= CC(C)=C1OC2=C C=CC=C2C(CC4) 3O CCN3CN(C4=CC =C(Br)C=C4)C2( C3=O)CCN(CC2) CCCC(C1=CC=C( C=C1)F)=O CNCCC1=NC=CC =C1 OC(C2=CC=CC= C2)(C3C4CC(C=C 4)C3)CCN1CCCC C1 CC(C1=CC=C(O) C=C1)(C2=CC=C( C=C2)O)C BrC1=CC=C(N4C 3(C(N(C)C4)=O)C CN(CC3)CCCC(C 2=CC=C(F)C=C2) =O)C=C1 CCCN3CN(C4=C C=C(Br)C=C4)C2 (C3=O)CCN(CC2) CCCC(C1=CC=C( C=C1)F)=O CC(C)(OC(C1=C( C2CCCN2C(C4= C3C=CC(Br)=C4) =O)N3C=N1)=O) C BrC1=CC=C(N4C 3(C(NC4)=O)CCN (CC3)CCCC(C2= CC=C(C=C2)F)= O)C=C1 CCCCOC(C1=CC =C(C=C1)N)=O CN1C=NC(N(C(N 2C)=O)C)=C1C2= O NC(N2C1=C(C=C C3=C2C=CC=C3)
MA
ACCEPTED MANUSCRIPT
I
V 145
I
C I
148
T
I
I
149
T
I
I
150
T
I
T 151
T
I
I
152
I
I
I
154
I
T I
155
V
C T 156
I
I
V
T C 158
V
I
T 159
C
I
V 160
T
V I
AC
T 157
162
N
Y Y
-0.4000
-0.3831
-0.3233
-0.3627
Y
Y Y
1.0600
1.0490
0.5860
0.5727
Y
Y Y
0.3500
0.0094
0.2012
-0.0511
Y
Y Y
0.1100
0.1123
PT
T
144
-0.0844
0.1615
0.2431
N
N N
0.6000
0.0858
0.0287
-0.0570
Y
Y Y
-0.2200
-0.2283
-0.4195
-0.2080
N
N N
-0.3480
0.0317
0.0960
0.0108
Y
Y Y
-0.1500
-0.5297
-0.4720
-0.5715
Y
Y Y
1.0000
0.5127
0.4526
0.4994
Y
Y Y
1.0600
0.9059
0.7403
0.9331
Y
Y Y
0.3600
-0.1195
0.0055
-0.0669
Y
Y Y
0.5000
0.1581
0.2369
0.2144
Y
Y Y
0.5900
0.8996
0.7175
0.6299
Y
N N
0.2750
0.1959
0.3707
0.1504
Y
Y Y
-1.3000
-1.0316
-0.7147
-1.0611
Y
Y Y
RI
I
0.0244
SC
I
-0.1010
NU
T
-0.5200
D
T 143
PT E
I
CE
I
C=CC=C1)=O ClCCNC(N(N=O) CCCl)=O CC(C)(NCC(COC 1=CC=CC(N2)=C 1CCC2=O)O)C CN(CCCN2C1=C( SC3=C2C=C(Cl)C =C3)C=CC=C1)C CN2C(CC(N(C3= C2C=CC(Cl)=C3) C1=CC=CC=C1)= O)=O ClC1=CC=CC(Cl) =C1/N=C2NCCN\ 2 COC(C1C(OC(C3 =CC=CC=C3)=O) CC2CCC1N2C)= O OC1C4C2(CCN5 C)C(C5CC3=C2C( O4)=C(OC)C=C3) C=C1 CN1C(C2=CN=C C=C2)CCC1=O OC1=CC=C(C3=C OC2=C(C3=O)C= CC(O)=C2)C=C1 CNCCCN2C1=C( CCC3=C2C=CC= C3)C=CC=C1 NCCCN1C(C=CC =C3)=C3CCC2=C 1C=CC=C2 ClC3=CC(N(C(C2 )=O)C1=CC=CC= C1)=C(C=C3)NC2 =O ClC(C=CC=C1N2 )=C1C(C3=CC=C C=C3)=NCC2=O CNCCCN1C(C=C C=C3)=C3SC2=C 1C=CC=C2 CN2C(CN=C(C3= C2C=CC(Cl)=C3) C1=CC=CC=C1)= O OCC1CCC(N2C= NC3=C2N=C[NH]
MA
ACCEPTED MANUSCRIPT
T 165
I
V V 166
I
C T 167
T
C C 168
T
T T 170
T
I
T 171
I
I
I
I
T T 173
T
T T 174
T
C V 175
I
T T 176
T
T V 177
I
I
AC
172
T 178
Y
Y Y
-0.9250
-0.5013
-0.5532
-0.7145
N
N N
0.6400
0.2159
0.2150
0.2905
Y
Y Y
-0.0460
0.0623
-0.0081
Y
Y Y
0.2700
0.0893
0.1972
0.0956
Y
Y Y
0.5780
0.5171
0.4728
0.6237
Y
Y Y
-0.0950
-0.4324
-0.4866
-0.3906
Y
Y Y
-0.0825
0.0397
-0.1494
Y
Y Y
0.0600
-0.2648
-0.2535
-0.2810
Y
Y Y
-0.1400
-0.0869
0.0181
-0.0364
Y
Y Y
-0.0100
-0.4522
-0.2344
-0.4820
Y
Y Y
0.3600
0.4735
0.3878
0.4066
Y
Y Y
-1.3980
-0.9507
-1.2482
-1.4100
Y
N N
0.0400
-0.0308
-0.0676
-0.0357
Y
Y Y
0.0730
-0.1506
-0.1633
-0.1114
Y
Y Y
-0.2900
PT
I
0.4006
0.1778
RI
T
0.3088
SC
T 164
0.4149
NU
I
1.2600
D
I
PT E
T T 163
CE
I
C3=O)O1 CN(CCOC(C2=C C=CC=C2)C1=CC =CC=C1)C ClC5=CC=C4N(C ([NH]C4=C5)=O) C3CCN(CC3)CCC N1C([NH]C2=CC =CC=C12)=O CN(CCOC(C1=C C=CC=C1)(C2=C C=CC=N2)C)C CCOC1=CC=CC= C1C(N)=O CCOC(C1=CC=C( N)C=C1)=O CCC(N(C3=CC=C C=C3)C1CCN(CC C2=CC=CC=C2)C C1)=O FC1=CC3=C(N2C =NC(C(OC(C)F)= O)=C2CN(C3=O) C)C=C1 CCOC(C1=C(CN( C(C3=C2C=CC(F) =C3)=O)C)N2C= N1)=O CN2C(CN=C(C3= C2C=CC(N(=O)= O)=C3)C1=C(F)C =CC=C1)=O CCOC(C1=C(CN( C(C3=C2C=CC(F) =C3)=O)CCF)N2 C=N1)=O O=N(C1=NC=CN 1CC(COCF)O)=O CNCCC(C2=CC= CC=C2)OC1=CC= C(C(F)(F)F)C=C1 CC(C(CCC(C1CC C(C2C(C=C5C3C C(CCC(CCC45C) 3C)(C(O)=O)C)= O)4C)2C)O)1C OC(CC)(CC(N)=O )C1=CC=CC=C1 CN2C(NC(C(C1= CCCCC1)(C2=O) C)=O)=O
MA
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
I
T 181
I
I
I
I
I
V 183
V
C C 184
I
T I
I
I
T 186
I
I
I
C
C V 189
C
I
I
T I
C
T C 193
I
T I
194
I
T I
195
182
0.8709
0.8343
0.6232
Y
Y Y
-0.2540
-0.2432
-0.0908
-0.2321
N
N N
1.1340
0.7606
0.9888
0.9489
Y
N Y
-0.1800
-0.1671
-0.0635
Y
Y Y
0.8300
0.4693
0.5279
0.5451
Y
Y Y
-0.7968
-0.8679
-0.7499
Y
N N
-0.2457
-0.3646
-0.3094
Y
Y Y
0.4800
0.2081
0.3525
0.0956
Y
Y Y
0.3400
0.2679
0.1479
0.2390
Y
Y Y
0.4400
0.1108
0.4482
0.1778
Y
Y Y
-1.0600
-0.5593
-0.8450
-0.8594
Y
N Y
-1.6000
-1.5478
-1.4244
-1.5294
Y
Y Y
0.6300
0.6220
0.8123
0.5807
Y
Y Y
0.9000
0.3526
0.4617
0.5026
Y
Y Y
-0.7450
-0.1025
-1.2600
AC
C 190
CE
188
PT E
D
185
0.3900
PT
T
OCCOCCN1CCN( C(C3=CC=C(C=C 3)Cl)C2=CC=CC= C2)CC1 IC1=CC(C(CCCN 4CCC2(CC4)C(N( CN2C3=CC=CC= C3)C)=O)=O)=CS 1 I/C=C/CN1CCC(C OC2=CC=C(C#N) C=C2)CC1 CC(CC1=CC=C(C (C(O)=O)C)C=C1) C CN(CCCN2C1=C( CCC3=C2C=CC= C3)C=CC=C1)C CC(C)(NC(C1CN( CC5=CN=CC=C5) CCN1CC(CC(C(N C3C(CC4=CC=C C=C34)O)=O)CC2 =CC=CC=C2)O)= O)C COC1=CC3=C(N( C(C)=C3CC(O)=O )C(C2=CC=C(C= C2)Cl)=O)C=C1 NC1=NC(N)=C(C 2=C(Cl)C(Cl)=CC =C2)N=N1 CCN(CC(NC1=C( C=CC=C1C)C)=O )CC OC3N=C(C2=CC( Cl)=CC=C2NC3= O)C1=CC=CC=C1 Cl CN(CC3=CC=C(O 3)CSCC[NH]C1= NC(C(CC2=CN=C (C)C=C2)=C[NH] 1)=O)C OCC(C(C(C(CO) O)O)O)O OC(C2=CC(C(F)( F)F)=NC3=C2C= CC=C3C(F)(F)F)C 1CCCCN1 CNC(CC1=CC=C
RI
180
SC
I
NU
I
MA
T
192
ACCEPTED MANUSCRIPT
198
T
I
201
I
T C 202
I
I
V 203
I
I
I
I
V C 205
I
T I
206
I
I
I
207
C
I
I
208
T
I
I
209
I
T I
210
I
T I
211
I
I
204
AC
I
V 212
N Y
0.9900
0.6266
0.6240
0.6954
Y
Y Y
0.4260
0.0155
0.5179
0.2142
Y
N Y
-0.9210
-0.9115
-0.6346
Y
Y Y
0.5300
0.4340
0.5133
0.5028
Y
Y Y
-0.0545
-0.0127
-0.1887
Y
Y Y
-0.4879
-0.4659
-0.5501
Y
N N
0.3178
0.3619
0.4764
Y
Y Y
-0.6580
-0.3352
-0.5224
-0.3933
Y
Y Y
0.0000
-0.2744
-0.2273
-0.3231
Y
Y Y
0.2360
0.3383
0.4126
0.3627
Y
Y Y
0.7500
0.7793
0.8311
0.8835
Y
Y Y
0.7800
0.2931
0.6159
0.2670
Y
N Y
0.1600
0.3375
0.5380
0.4259
Y
Y Y
0.3900
0.0714
0.3136
0.0043
Y
Y Y
-0.0100
PT
T I
Y
-0.6480
0.4600
-0.6542
RI
I
-0.2867
SC
T 197
-0.1751
NU
I
-0.3248
MA
C
-0.0600
D
T 196
PT E
I
CE
T
C=C1)C CCC#CC(C(C(NC (N(C1=O)C)=O)= O)1CC=C)C CN1CCN3C(C2= CC=CC=C2CC4= CC=CC=C34)C1 CC3=NC=C4CN= C(C2=C(N34)C=C C(Cl)=C2)C1=C(F )C=CC=C1 CON1C=C(C(C3= C1C=C2OCOC2= C3)=O)C(O)=O CN1CCN3C(C2= CC=CC=C2CC4= CC=CN=C34)C1 OC(CN1C=CN=C 1N(=O)=O)COCF OC4=C(O5)C2=C( C=C4)CC3C1C=C C(O)C5C12CCN3 C CN3CN(C4=CC= CC=C4)C2(C3=O) CCN(CC2)CCCC( C1=CC=C(C=C1) F)=O CCN1C=C(C(C2= C1N=C(C)C=C2)= O)C(O)=O CC1=CC=NC(N(C 4=C3C=CC=N4)C 2CC2)=C1NC3=O CN1CCCC1C2=C C=CN=C2 CSC4=CC2=C(C= C4)SC1=CC=CC= C1N2CCC3CCCN C3 CN1CCN(C3=NC 2=CC=CC=C2NC 4=C3C=C(C)S4)C C1 O=C1N(CCCCN2 CCN(C3=CC(C(F) (F)F)=CC=N3)CC 2)CCC1 ClC4=CC1=C(C= C4)OC3=C(C=CC =C3)C2C1CNC2
I
I
215
T
I
I
216
C
V C 218
I
I
T 219
T
I
I
C
I
T 222
T
I
I
T
V V 226
I
I
T
C I
T
I
I
T T 231
T
I
C 232
I
I
T 234
T
T I
220
224
T 227
AC
228
T 229
235
0.5844
Y
Y Y
0.0000
0.1498
-0.0043
0.2539
Y
Y Y
1.0300
0.5250
0.6570
0.5264
Y
Y Y
0.5800
0.2185
0.1931
Y
Y Y
-0.3400
-0.3066
-0.1677
-0.2034
Y
Y Y
-0.2692
-0.1395
-0.3220
Y
Y Y
0.5669
0.5904
0.5788
Y
N N
0.1400
-0.2232
-0.1783
-0.1365
Y
Y Y
-0.0300
-0.2893
-0.2754
-0.2416
N
N N
0.4400
0.2767
0.4464
0.6098
Y
Y Y
-0.5200
0.0652
0.0961
0.0861
Y
Y Y
0.0500
0.1156
0.2081
0.1177
N
Y Y
-0.1000
0.4520
0.3730
0.4560
Y
Y Y
0.0800
0.2681
0.2873
0.2764
Y
Y Y
-0.1350
-0.2265
-0.3341
-0.1580
Y
Y Y
0.4800
0.1816
0.2508
0.2473
Y
Y Y
-1.2600
0.0482
0.0937
0.0067
Y
Y Y
0.0600 0.5440
PT
I
0.4147
0.2579
RI
214
0.5769
SC
T I
0.5200
NU
T
OC12C4=C(C=CC =C4)OC3=C(C=C C=C3C)C1CNCC 2 C=CCC(N)C(C=C C=C3)=C3C1=NO C2=C1C=CC=C2 CN1CC3C(C2=C( OC4=C3C=CC=C 4)C=CC(Cl)=C2)C 1 OC3N=C(C2=CC( Cl)=CC=C2NC3= O)C1=CC=CC=C1 CC(NC1=CC=C(O )C=C1)=O O=C1N(C(NC2=C 1N(C)C=N2)=O)C OC1=CC2=C(CC3 C(C)C(C)2CCN3C /C=C(C)\C)C=C1 CCCC(C(C(NC(N C1=O)=O)=O)1C C)C C12=NN=NN1CC CCC2 N1(C2(C3=CC=C C=C3)CCCCC2)C CCCC1 CCCCC1C(N(C3= CC=CC=C3)N(C2 =CC=CC=C2)C1= O)=O [O]/[N+](C(C)(C)C) =C/C1=CC=CC=C 1 O=C1NC(C(C2=C C=CC=C2)(C3=C C=CC=C3)N1)=O CNC(OC3=CC=C 2N(C1N(CCC(C2 =C3)1C)C)C)=O CC([NH]CC(COC 1=CC=CC2=C1C =C[NH]2)O)C CCC(N(C(C2=CC 1=CC=CC=C1C(C 3=C(Cl)C=CC=C3 )=N2)=O)C)C OC(C1=CC=C(C2
D
213
PT E
C I
CE
I
MA
ACCEPTED MANUSCRIPT
T
I
I
T T 241
T
T T 242
I
I
C
T C 244
T
T T 245
T
I
I
246
I
T I
247
I
I
I
248
V
I
C 249
T T
T I 250 I T 251
I
T T 252
I
I
I
243
0.1103
Y
Y Y
0.9500
0.9413
1.0121
0.9483
Y
N N
0.6970
0.0981
0.0007
0.0347
Y
Y Y
0.6400
-0.0680
-0.1030
-0.1276
Y
Y Y
0.5500
0.1308
0.2300
0.1383
Y
Y Y
0.2300
0.4117
0.5274
0.5147
Y
Y Y
0.3009
0.2364
0.2228
Y
Y Y
0.1000
0.3246
0.2697
0.1925
Y
Y Y
-1.3572
-0.6809
-0.5615
N
Y Y
-0.0200
-0.3113
-0.2664
-0.1549
Y
N N
0.6100
0.0855
-0.0171
0.0068
Y
Y Y
0.2500
0.2948
0.2365
0.3436
Y
Y Y
-0.2750 -1.3000
0.1206 -0.7148
0.1864 -0.6273
-0.0437 -0.5675
Y Y
Y Y Y Y
-1.1610
-0.2383
-0.1383
-0.4234
Y
Y Y
-1.0300
-0.9013
-1.0265
-0.9605
Y
N N
0.5000
-1.4230
AC
CE
PT E
D
I
239
0.1828
PT
T T 237
0.1246
RI
T
0.0500
SC
C T 236
NU
T
=CC=CC=C2)C= C1)=O CCN(CCOC(C1= CC=C(C=C1)N)= O)CC CN(CCCN2C1=C C=CC=C1SC3=C C=CC=C23)C CC(C1=CC=CC(C (C)C)=C1O)C CC(NCC(COC2= C1C=CC=CC1=C C=C2)O)C CCCOC(C1=CC= C(N)C=C1)=O C13=CC=CC4=C1 C2=C(C=C4)C=C C=C2C=C3 COC2=CC=C(C= C2)CN(C1=CC=C C=N1)CCN(C)C OC(C3C(CC4C=C )CCC4C3)C2=C1 C=C(C=CC1=NC =C2)OC CCC1=CC=CC2= C1[NH]C3=C2CC OC(CC([OH])=O) 3CC CC(N=C1CCCCN 12)=C(CCN3CCC (C4=NOC5=C4C= CC(F)=C5)CC3)C 2=O COC1=C(OC2CC CC2)C=C(C3CNC (C3)=O)C=C1 CCCN(CCC(C=C C=C1N2)=C1CC2 =O)CCC NC(SCCF)=N CC(C)(NCC(C1= CC=C(C(CO)=C1) O)O)C OC(C1=CC=CC= C1O)=O CC(C)(NC(C2CC1 CCCCC1CN2CC( C(NC(C(NC(C4= NC5=C(C=CC=C5 )C=C4)=O)CC(N)
MA
ACCEPTED MANUSCRIPT
T 254
ACCEPTED MANUSCRIPT
I
259
V
I
I
260
C
V T 261
V
T T 262
C
V V 263
I
T V 264
C
I
T
C T 266
V
T I
267
V
I
268
265
-0.6600
-0.7818
-0.6670
-0.8220
Y
Y Y
-0.1200
-0.4769
-0.4697
-0.5570
Y
Y Y
-0.1800
-0.1637
-0.3891
-0.2312
N
N N
-0.0299
-0.0948
-0.0627
N
N Y
-1.0412
-0.7545
-1.2422
Y
N N
-1.3303
-1.2857
-1.5546
Y
N N
-1.5400
-1.4534
-1.0667
-1.6299
Y
N Y
-1.1200
-1.1613
-1.4778
-1.1212
Y
N Y
-0.7300
-0.6178
-0.9112
-0.9447
Y
N Y
-0.2700
-0.2717
-0.3698
-0.3362
N
Y Y
-0.2800
-0.2879
-0.2401
-0.2328
Y
Y Y
-0.4600
-0.2256
-0.0416
-0.1765
Y
Y Y
-0.0400 -1.1500
-1.5700
AC
I
Y N
PT
I
N
RI
T
-0.5587
SC
T T 258
-0.7251
NU
I
-0.7031
MA
C C 257
-0.6700
D
C
PT E
V T 256
CE
T
=O)=O)CC3=CC= CC=C3)O)=O)C BrC1=CC=CN=C1 CSCC[NH]C2=C( N(=O)=O)C=C[N H]2 O=N(C1=C([NH] CCSCC2=NC=CC =C2)[NH]C=C1)= O O=N(C1=C([NH] CCSCC3=NC=CC =C3)[NH]C=C1C C2=CC=CC=C2)= O [NH2]/C([NH2])= N\C1=NC(C2=CC =CC=C2)=CS1 CC1=CSC(/N=C([ NH2])/[NH2])=N1 N/C(N)=N\C1=NC (C2=CC=CC(N)= C2)=CS1 CC([NH]C1=CC= CC(C2=CSC(/N= C(N)\N)=N2)=C1) =O C[NH]/C([NH]C1 =CC(C2=CSC(/N= C(N)/N)=N2)=CC =C1)=N/C#N CN(CC2=CC=C(O 2)CSCC[NH]C1= C(N(=O)=O)C=C[ NH]1)C CN(CC3=CC=C(O 3)CSCC[NH]C1= C(N(=O)=O)C(CC 2=CC=CC=C2)=C [NH]1)C CN(CC1=CC=C(C 2=CC([NH]C3=C( N(=O)=O)C=C[N H]3)=CC=C2)O1) C CN(CC1=CC=NC( C2=CC([NH]C3= C(N(=O)=O)C=C[ NH]3)=CC=C2)= C1)C CC([NH]CCCOC1
I
V
V V 271
I
V T 272
T
I
T
T I
I C
I T 275 T I 276
V
T V 277
I
T I
T
I
I
T I
I
C T 281
V
T C 282
C
T C 283
I
270
V 273
0.1222
Y
Y Y
-0.0200
0.0109
0.0464
-0.0478
Y
Y Y
0.6900
0.5001
0.4456
0.6541
Y
Y Y
0.4400
0.2183
0.0357
0.4313
Y
Y Y
0.2200
0.1421
0.4539
0.1318
Y
Y Y
-0.4300
-0.2274
-0.2416
-0.1331
Y
Y Y
-0.3101 0.1936
-0.1476 0.1814
-0.3461 0.2939
Y Y
Y Y Y Y
-0.4820
-0.2385
-0.3141
-0.4049
Y
Y Y
0.1800
0.1227
0.2087
0.2272
Y
Y Y
0.0950
0.1310
0.2688
0.0928
Y
N Y
1.0000
0.3713
0.3892
0.5009
N
Y Y
-0.3000
-0.1389
-0.2753
-0.1117
Y
Y Y
-0.3450
-0.3435
-0.1395
-0.3770
Y
Y Y
-0.1470
-0.2873
-0.2662
-0.3198
Y
Y Y
-0.6020 0.2600
CE
AC
278
PT E
D
274
0.1425
PT
I
0.1454
RI
V
-0.2400
SC
T T 269
NU
T
=CC(CN2CCCCC 2)=CC=C1)=O O=C(C3=CC=CC =C3)[NH]CCCOC 1=CC=CC(CN2C CCCC2)=C1 OCCCOC1=CC(C N2CCCCC2)=CC =C1 C1(NCCCOC2=C C(CN3CCCCC3)= CC=C2)=CC=CC =N1 C1(NCCCOC2=C C=CC(CN3CCCC C3)=C2)=NC=CS 1 C(NCCCOC2=CC (CN4CCCCC4)=C C=C2)3=NC1=C( O3)C=CC=C1 CCCN(CCC(C=C C(O)=C1N2)=C1C C2=O)CCC NC(SC)=N FC1=CC=C(C(CC CN2CCC3(N(C4= CC=CC=C4)CNC 3=O)CC2)=O)C= C1 CC2=CN(C([NH] C2=O)=O)C1OC( C=C1)CO CN1CCCCC1CC N3C2=C(SC4=CC =C(S(C)(=O)=O)C =C34)C=CC=C2 CN(CC2=C1C=C C=CC1=CC=C2)C /C=C/C#CC(C)(C) C CC(C)(OC(CCCC 1=CC=C(N(CCCl) CCCl)C=C1)=O)C CN1C=NC(N(C([ NH]2)=O)C)=C1C 2=O CN2C(N(C1=C(C 2=O)NC=N1)C)= O CCCC(C(C(NC(N
MA
ACCEPTED MANUSCRIPT
T 279
280
ACCEPTED MANUSCRIPT
I
I
C 290
V
C I
291
I
T I
293
T
I
I
294
I
V I
295
I
I
I
296
T
I
I
297
C T
V V 299 I I 300
T T
T T 301 T T 302
C
I
V
C T 304
AC
289
C 303
N Y
0.2830
0.4631
0.3994
0.4431
Y
N N
0.4000
0.3926
0.1452
0.3246
N
N Y
-0.8200
-0.6138
-0.7376
-0.7840
Y
N Y
-0.0826
-0.0229
-0.0117
Y
Y Y
0.2069
0.5778
0.2826
N
N N
1.0707
1.1393
0.7612
Y
Y Y
0.8900
0.6497
0.9401
0.6338
Y
N N
-1.1550
-0.3057
-0.2434
-0.2122
Y
Y Y
-0.4690
-0.0409
-0.1093
0.0260
N
N Y
-0.0400 -0.0600
-0.2397 0.0977
-0.0981 0.3854
-0.1593 0.2641
Y Y
Y Y Y Y
-0.4200 -0.7200
-0.2504 -0.8285
-0.0744 -0.8745
-0.1937 -0.9819
Y N
Y N N Y
0.1400
0.0155
0.2670
0.2144
Y
Y Y
0.5000
0.2142
0.3303
0.2511
Y
Y Y
0.2800
0.5700
PT
I
Y
RI
I
0.0446
SC
T
0.1218
NU
T T 286
0.0780
MA
T
-0.1600
1.4400
D
T 285
PT E
I
CE
I
C1=O)=S)=O)1CC )C S=C(N2CCC(C3= CNC=N3)CC2)NC 1CCCCC1 CSC4=CC=C3SC1 =C(N(C3=C4)CC C2CCCCN2C)C= CC=C1 CC3CC1=C(C4CC C2(C(C34)CCC(C #C)2O)C)CCC(C1 )=O CN/C(NCCSCC1= CSC(/N=C(N)/N)= N1)=N/C#N CC(NCC(COC1= CC(C)=CC=C1)O) C CC3=NN=C4CN= C(C2CC(CCC2N3 4)Cl)C1=CC=CC= C1Cl CN1CCN(CCCN3 C2=C(SC4=C3C( C(F)(F)F)=CC=C4 )C=CC=C2)CC1 CN(CC/C=C2\C1= C(CCC3=C2C=C C=C3)C=CC=C1) C CCCC(C(O)=O)C CC COC2=CC=C(C= C2OC)CCN(CCC C(C(C)C)(C1=CC( OC)=C(OC)C=C1) C#N)C O CN(CCC1=CC=C C=N1)C NCCC1=NC=CS1 CC2=CN(C(NC2= O)=O)C1CC(C(O1 )CO)N=N#N N1(CC2=CC=CC( OCCCNC4=NC3= C(S4)C=CC=C3)= C2)CCCCC1 ClC3=CC=C2NC( CN=C(C2=C3)C1
T V 308
T
I
T 309
T
I
I
I
T T 312
V
C I
313
V
I
314
T
T T 315
V
I
T 316
T
I
T 317
I
T V 320
I
T I
AC
I
311
321
Y
Y Y
-0.6700
-0.6341
-0.4257
-0.5771
N
Y N
-0.6600
-0.3398
-0.4337
-0.3910
Y
Y Y
-0.1200
-0.1384
-0.3104
-0.1140
Y
Y Y
-0.1800
-0.7678
-0.6203
-0.6555
Y
N N
-0.9790
-1.0932
-0.8303
Y
N Y
-1.5400
-1.2423
-1.4554
-1.5356
Y
N N
-0.5370
-0.6417
-0.7266
Y
Y Y
-0.7300
-0.3356
-0.5184
-0.4495
Y
Y Y
-0.2700
-0.4449
-0.6149
-0.4275
Y
Y Y
-0.2800
-0.4611
-0.4851
-0.3241
Y
Y Y
-0.4600
0.2161
0.1102
0.0985
Y
Y Y
0.6900
0.2578
0.2711
0.3884
Y
Y Y
0.4400
0.1438
0.1235
0.1377
Y
Y Y
-1.5700
PT
I
-0.7901
RI
T T 307
306
-0.6008
SC
I
I
-0.9213
NU
I
-1.1700
-1.1200
D
T
PT E
T T 305
CE
T
=CC=CC=C1)=O CN(C1=NC=CC(C 2=NNC(N)=N2)= C1)C BrC1=CC=CN=C1 CSCCNC2=C(C= CN2)N(=O)=O O=N(C(C=CN2)= C2NCCSCC1=NC =CC=C1)=O O=N(C(C(CC3=C C=CC=C3)=CN2) =C2NCCSCC1=N C=CC=C1)=O N/C(N)=N/C1=NC (C2=CC=CC=C2) =CS1 N/C(N)=N/C1=NC (C2=CC(NC(C)=O )=CC=C2)=CS1 N/C(N)=N\C1=NC (C2=CC(N/C(NC) =N/C#N)=CC=C2) =CS1 CN(CC1=CC=C(S CCNC2=C(C=CN 2)N(=O)=O)O1)C CN(CC1=CC=C(S CCNC2=C(C(CC3 =CC=CC=C3)=C N2)N(=O)=O)O1) C CN(CC1=CC=C(C 2=CC=CC(NC3=C (C=CN3)N(=O)=O )=C2)O1)C CN(CC1=CC(C2= CC(NC3=C(C=CN 3)N(=O)=O)=CC= C2)=NC=C1)C O=C(C)NCCCOC 1=CC(CN2CCCC C2)=CC=C1 N1(CC2=CC=CC( OCCCNC3=NC= CC=C3)=C2)CCC CC1 N1(CC2=CC=CC( OCCCNC3=NC= CS3)=C2)CCCCC 1
MA
ACCEPTED MANUSCRIPT
ACCEPTED MANUSCRIPT
I
T 325
T
I
C 326
I
I
C 327
I
C I
328
0.2414
0.5764
0.1717
Y
Y Y
0.3000
0.3229
0.0803
0.2988
Y
Y Y
-0.3400
-0.0011
0.1580
-0.0103
Y
Y Y
-0.3000
-0.4750
-0.5072
Y
Y Y
PT
I
0.2200
-0.5039
RI
C V 324
N1(CC2=CC(OCC CNC4=NC3=C(O 4)C=CC=C3)=CC =C2)CCCCC1 CCC(NC(C2=C(C (C4=CC=CC=C4) =NC3=C2C=CC= C3)C)=O)C1=CC= CC=C1 NC(N3C1=CC=C C=C1C2CC2C4= CC=CC=C34)=O O=C3C1=C(N2C= NC(C4=NOC(C(C )C)=N4)=C2CN3C )C=CC=C1Cl O=C3C1=C(N2C= NC(C4=NOC(C(C )(O)C)=N4)=C2C N3C)C=CC=C1Cl O=C3C1=C(N2C= NC(C4=NOC(C(C )(O)CO)=N4)=C2 CN3C)C=CC=C1 Cl
-1.3400
SC
I
-1.8200
-0.9291
-0.8830
-0.9288
Y
Y Y
-1.1428
-1.0423
-1.1795
Y
Y Y
T=Training set; I=Invisible training set; C=Calibration set; V=Validation set Y=domain of applicability, N=no domain of applicability
D
*)
322
NU
T I
MA
T
AC
CE
PT E
**)
ACCEPTED MANUSCRIPT
Table 3 The statistical characteristics of QSAR models for logBB (n is the number of compounds in a set, r2 is the determination coefficient, RMSE is root-mean squared error).
2
Traditional scheme
3
MA
Balance of correlations
Traditional scheme
r2 0.7587 0.8569 0.7324 0.6989 0.7563 0.9143 0.8003* 0.7338 0.7903 0.8619 0.7080 0.6783 0.8599 0.8948 0.7511 0.7128 0.7796 0.7054 0.7207 0.9296 0.8964
r2 0.7265 0.8350 0.6746 0.6767 0.7085 0.8753 0.7297 0.7066 0.8403 0.7542 0.6803 0.6572 0.8620 0.8612 0.7194 0.6591 0.7222 0.6703 0.6884 0.8775 0.7988
RI
PT
RMSE 0.299 0.262 0.291 0.310 0.353 0.224 0.272 0.310 0.239 0.268 0.336 0.335 0.216 0.268 0.306 0.337 0.278 0.324 0.348 0.190 0.175
RMSE 0.318 0.261 0.325 0.322 0.365 0.229 0.298 0.326 0.222 0.339 0.351 0.344 0.220 0.299 0.324 0.398 0.311 0.343 0.371 0.238 0.240
CE
Models with highest predictive potential (maximal correlation coefficient) are indicated with bold.
AC
*)
PT E
Balance of correlations
n 205 43 43 101 104 43 43 210 41 40 103 107 41 40 209 41 41 104 105 41 41
DCW2(T*,N*) Eq. 2
SC
Balance of correlations
Set Training Calibration Validation Training Invisible training Calibration Validation Training Calibration Validation Training Invisible training Calibration Validation Training Calibration Validation Training Invisible training Calibration Validation
NU
Method Traditional scheme
D
Split 1
DCW1(T*,N*) Eq. 1
ACCEPTED MANUSCRIPT
Table 4 Comparison of logBB models suggested in the literature (n is the number of compounds in a set, r2 is the determination coefficient). nvalidation 63 21 41
r2validation 0.58 0.518 0.896
References (Garg & Verma, 2006) (Konovalov et al., 2007) (Crivori et al., 2000) (Iyer et al., 2002) (Ooms et al., 2002) (Chena et al., 2009) (Hou & Xu, 2002) In this work
PT
r2training 0.88 0.845 0.78 0.69 0.86 0.672 q2=0.766* 0.740
RI
ntraining 21 56 79 150 260 120 291 250
SC
No. 1 2 3 4 5 6 7 8 *)
Comment** ANN MLR, kNN 3D-QSAR MI-QSAR 3D-QSAR VPCP VPCP SMILES
the q2 is cross-validated correlation coefficient ANN = artificial neuronal networks; MLR = multiple-linear regression; kNN = k nearest neighbors; 3D-QSAR = QSAR based on three dimensional representation of molecules; MI-QSAR = QSAR based on parameters of membrane interactions; VPCP = vector of physicochemical properties.
AC
CE
PT E
D
MA
NU
**)
ACCEPTED MANUSCRIPT
Table 5 Correlation weights (CWs), which are promoters of increase (all correlation weights are positive) or decrease (all correlation weights are negative) for logBB Split
Attribute
CWs run 1
CWs run 2
CWs run 3
3.24511 3.81416 2.49742 2.87618
2.12423 2.68351 1.93346 5.31253
1.37910 2.12465 3.06416 3.05954
-6.31039 -0.87153 -4.06125
-7.24883 -2.19137 -4.62422
-6.31605 -1.43359 -4.62395
0.81302 2.50269 4.74762 4.75052
2.12609 0.43945 5.12636 3.80957
-8.00464 -1.62872 -2.93503
Frequency in training set
Frequency in invisible training set
Frequency in calibration set
59 37 23 23
20 12 7 6
66 60 36
69 59 35
27 16 10
1.74603 1.00255 4.93289 3.62906
61 40 26 24
68 45 27 30
19 11 5 13
-4.81494 -0.87586 -3.12997
-5.37674 -1.43289 -1.99677
64 56 35
75 70 43
28 15 8
1.93776 2.12858 6.24741 3.43855
2.31176 2.87755 4.93620 1.18458
0.81079 4.18775 3.43821 2.50436
61 38 28 17
63 45 29 30
21 12 11 10
-10.6200 -2.56148 -2.93355
-6.87903 -1.25434 -3.49987
-8.37164 -1.06561 -2.00268
66 55 37
69 66 36
28 17 11
3
AC
D
CE
Increase N...C....... =...2....... $10011000000 N...1....... Decrease O........... C...2....... 2...(.......
PT E
Increase N...C....... =...2....... N...1....... $10011000000 Decrease O........... C...2....... 2...(.......
SC
MA
2
RI
61 41 32 23
NU
Increase N...C....... =...2....... $10011000000 N...1....... Decrease O........... C...2....... 2...(.......
PT
1
ACCEPTED MANUSCRIPT
Table 6 A collection of molecules with large and small logBB values Structure and SMILES
Calculated logBB
Large logBB 0.82
0.62
Comments
Presence of HARD=’$10011000000’
PT
H3C
Experimental logBB
N
RI
O CH3
SC
HO CN4CC3C1=CC=CC(C)=C1OC2=CC=CC=C2C(CC4)3O
0.97
Presence of HARD=’$10011000000’
0.54
0.64
Presence of HARD=’$10011000000’
Small logBB -1.42
-1.44
Presence of SSk=’C...2.......’
-1.54
-1.35
Presence of SSk=’C...2.......’
NU
0.85
MA
HO
D
N
OC(C2=CC=CC=C2)(C3C4CC(C=C4)C3)CCN1CCCCC1
PT E
H3C N
H3C H3C
CE
OH
CH3
AC
OC1=CC2=C(CC3C(C)C(C)2CCN3C/C=C(C)\C)C=C1
O
N H
H3C
OH O
CH3
CCC1=CC=CC2=C1[NH]C3=C2CCOC(CC([OH])=O)3CC
HN
NH2 H2N
N N
NH
CH3
N
S N C[NH]/C([NH]C1=CC(C2=CSC(/N=C(N)/N)=N2)=CC=C1)=N/C#N
ACCEPTED MANUSCRIPT Table 7 Example of modification of the molecular structure for increase of logBB: fragment “N…C……” is added. SMILES
Structure
logBB (calc) 0.2578
N1(CC2=CC=CC(OCCCNC3=NC=CC=C3)=C2)CCCCC1 N
PT
NH
O
N
NU
SC
RI
Original structure
MA
N1(CC2=CC=CC(OCCCNC3=NC=CC=C3)=C2)CCNCC1
AC
CE
PT E
D
Modified structure
0.2700 N
NH
H N O N
ACCEPTED MANUSCRIPT
Table 8 Example of modification of the molecular structure for decrease of logBB: fragment “O….……” is added. SMILES
Structure
logBB (calc) 0.2578
N1(CC2=CC=CC(OCCCNC3=NC=CC=C3)=C2)CCCCC1
PT
N
O N
NU
SC
RI
Original structure
NH
N1(CC2=CC=CC(OCCCNC3=NC=CC=C3)=C2)CCOCC1
0.2264
MA
N
AC
CE
D
PT E
Modified structure
NH
O O N
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
AC
CE
PT E
D
MA
Figure 1 The scheme of building up a model for logBB by the CORAL software
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
AC
CE
PT E
Figure 2 The tasks of training, invisible training, calibration, and validation sets in building up a model
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
AC
CE
PT E
Figure 3 The scheme definition of T* and N* for the Monte Carlo optimization