QSAR model for blood-brain barrier permeation

QSAR model for blood-brain barrier permeation

Accepted Manuscript QSAR model for blood-brain barrier permeation Andrey A. Toropov, Alla P. Toropova, Marten Beeg, Marco Gobbi, Mario Salmona PII: D...

993KB Sizes 0 Downloads 71 Views

Accepted Manuscript QSAR model for blood-brain barrier permeation

Andrey A. Toropov, Alla P. Toropova, Marten Beeg, Marco Gobbi, Mario Salmona PII: DOI: Reference:

S1056-8719(17)30014-X doi: 10.1016/j.vascn.2017.04.014 JPM 6446

To appear in:

Journal of Pharmacological and Toxicological Methods

Received date: Revised date: Accepted date:

27 January 2017 20 April 2017 30 April 2017

Please cite this article as: Andrey A. Toropov, Alla P. Toropova, Marten Beeg, Marco Gobbi, Mario Salmona , QSAR model for blood-brain barrier permeation, Journal of Pharmacological and Toxicological Methods (2017), doi: 10.1016/j.vascn.2017.04.014

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

QSAR model for Blood-Brain Barrier Permeation Andrey A. Toropov1, Alla P. Toropova*1, Marten Beeg2, Marco Gobbi2, Mario Salmona3 1

Department of Environmental Health Science, Laboratory of Environmental Chemistry and

Toxicology, IRCCS-Istituto di Ricerche Farmacologiche Mario Negri, Via La Masa 19, 20156 2

PT

Milano, Italy

Department of Molecular Biochemistry and Pharmacology, Laboratory of Pharmacodynamics

3

SC

20156 Milano, Italy

RI

and Pharmacokinetics, IRCCS-Istituto di Ricerche Farmacologiche Mario Negri, Via La Masa 19,

Department of Molecular Biochemistry and Pharmacology, Laboratory of Biochemistry and

Protein Chemistry, IRCCS-Istituto di Ricerche Farmacologiche Mario Negri, Via La Masa 19,

NU

20156 Milano, Italy

MA

Abstract

Background and Objective: Predicting blood-brain barrier permeability for novel compounds is an important goal for neurotherapeutics-focused drug discovery. It is impossible to determine

D

experimentally the Blood-Brain Barrier partitioning of all possible candidates. Consequently,

PT E

alternative evaluation methods based on computational models are desirable or even necessary. The CORAL software (http://www.insilico.eu/coral) has been checked up as a tool to build up quantitative structure – activity relationships for Blood-Brain Barrier Permeation.

CE

Methods: The Monte Carlo technique gives possibility to build up predictive model of an endpoint by means of selection of so-called correlation weights of various molecular features. Descriptors

AC

calculated with these weights are basis for correlations "structure – endpoint". Results: The approach gives good models for three random splits into the training and validation sets. The best model characterized by the following statistics for the external validation set: the number of compounds is 41, determination coefficient is equal to 0.896, root mean squared error is equal to 0.175. Conclusions: The suggested approach can be applied as a tool for prediction of Blood-Brain Barrier permeation.

Keywords: QSAR; Blood–Brain Barrier; Monte Carlo method; Computer-aided drug design; CORAL software

ACCEPTED MANUSCRIPT

*)

AC

CE

PT E

D

MA

NU

SC

RI

PT

Corresponding author Alla P. Toropova Laboratory of Environmental Chemistry and Toxicology, IRCCS - Istituto di Ricerche Farmacologiche Mario Negri, Via La Masa 19, 20156 Milano, Italy Tel: +39 02 3901 4595 Fax: +39 02 3901 4735 Email: [email protected]

ACCEPTED MANUSCRIPT 1. Introduction The blood-brain barrier (BBB) is a major factor hindering the development of neurotherapeutics. Experimental methods of BBB permeation determination as well as experimental definition of many other biomedical endpoints are cumbersome and expensive; thus, computational methods for BBB permeation prediction are an attractive alternative of the experiment (Hou & Xu, 2002,2003; Hou et al., 2006). The BBB is a physiological barrier in the

PT

circulatory system which is responsible for maintaining the homeostasis of the central nervous system by separating the brain from the systemic blood circulation, and thus stopping impact of

RI

many substances upon the central nervous system. The blood-brain distribution of a molecule is a key characteristic for assessing the suitability of a molecule be a potential drug for the central

SC

nervous system (Konovalov et al., 2007).

One of the biopharmaceutical properties that is of critical influence upon drug design is the

NU

ability of a molecule to penetrate the BBB. The potential effective agents which are intended to interact with their molecular targets in the central nervous system must cross the BBB in order to

MA

be used as therapeutic agents. At the same time, the peripherally acting agents should not cross the BBB so as to avoid side effects. In the both cases the BBB permeability of the molecules must be known. The experimental determination of BBB permeability is usually very difficult, time

D

consuming, and expensive and requires a sufficient quantity of the pure compounds and hence not

PT E

suitable for providing results in a high-throughput manner. Therefore, there is an increasing interest for good, reliable, and easily applicable computational approaches which can rapidly predict the BBB penetration capability of molecules. Such predictive models can be of widely use

CE

in the drug discovery process, especially in the area of central nervous system. In fact, the computational modeling of BBB permeability started in 1988 when the importance

AC

of lipophilicity for brain penetration was established statistically (Young et al., 1988). Further, various methods which can be widely used by the researchers to build up models based on the traditional statistical approaches such as multiple linear regression, partial least square, linear discriminant analysis, and other approaches in order to predict BBB of unknown substances (Crivori et al., 2000). BBB models have also been built using artificial intelligence techniques such as genetic algorithm (Iyer et al., 2002) and artificial neural networks (Garg & Verma, 2006). The advantage of these methods is that they can efficiently handle the nonlinear data. In addition, there are have been multiple attempts to build quantitative structure-activity relationships (QSARs) for the BBB permeation (Ooms et al., 2002; Luco & Marchevsky, 2006; de Sá et al., 2010; Carpenter et al., 2014; Bujak et al., 2015; Chena et al., 2009), which are based on

ACCEPTED MANUSCRIPT different descriptors, such as physicochemical properties (octanol/water partition coefficient, solubility, lipophilicity), as well as topological and constitutional descriptors. The aim of this work is to build up the QSAR model for the BBB permeation by means of the CORAL software (http://www.insilico.eu/coral).

2. Method 2.1 Data

PT

The database for BBB permeation (logBB) values for 291 substances is available from the literature (Hou & Xu, 2002). These substances were three times randomly split into training

SC

are random and non-identical (Toropova & Toropov, 2014).

RI

(≈35%), invisible training (≈35%), calibration (≈15%), and validation (≈15%) sets. These splits

2.2. Optimal descriptor

NU

Figure 1 shows the scheme of building up QSAR models by the CORAL software. Available data on the molecular structure are represented by simplified molecular input-line entry systems (SMILES) (Weininger, 1988). The CORAL extracts molecular features according to selected

MA

method (Toropova & Toropov, 2014). The QSAR models for logBB are building up with the following two versions of the optimal descriptors:

PT E

D

DCW1 (T *, N *)   CW (Sk )   CW (SSk )  CW ( HARD) DCW2 (T *, N *)   CW ( S k )   CW ( SS k )

(2)

CE

 CW ( BOND)  CW ( NOSP)  CW ( HALO)

(1)

The SMILES attributes Sk, SSk, BOND, NOSP, and HALO are described in the literature (Toropova & Toropov, 2014). The Sk is SMILES-atom, i.e. one symbol or symbols which cannot

AC

be examined separately (e.g. ‘Cl’, ‘Br’, etc.). The SSk is a combination of two SMILES-atoms. Table 1 contains an example of definition of the listed attributes, which are represented by sequences of twelve symbols. One can see the HARD is association of BOND, NOSP, and HALO (Table 1). The CW(x) are correlation weights of the above SMILES attributes. The numerical data on the correlation weights are calculated with the Monte Carlo method optimization. During several epochs (modification of all correlation weights involved in building up of a model), the numerical data on correlation weights which give the maximal correlation coefficient for the endpoint and the optimal descriptor are calculated. Figure 2 shows the scheme to define the number of epochs of the optimization.

ACCEPTED MANUSCRIPT The T is threshold, i.e integer value to separate SMILES attributes into two classes: rare and not rare ones. If T=2, then SMILES attributes which appear in the training set, twice or less, are classified as rare ones. The correlation weights for rare attributes are fixed be equal to zero. Consequently, these attributes are not involved in building up a model. The N is the number of epochs of the Monte Carlo optimization. The T* and N* are values of the above parameters which give the best statistical characteristics for the calibration set. Figure 3 shows the general strategy to select T* and N*.

PT

There are two possibility (methods) to build up a QSAR model using the CORAL software: the traditional scheme and the balance of correlations.

RI

The traditional scheme is based on three special sets: (i) training set; (ii) calibration set; and

SC

(iii) validation set.

The balance correlation is based on four sets: (i) training set; (ii) invisible training set; (iii)

NU

calibration set; and (iv) validation set. Figure 2 illustrates tasks for the four sets used in the balance of correlations.

In the case of the balance of correlations, four sets have special roles. The training set is

MA

builder of the model. The invisible training set is the inspector: checking up absence of the overtraining (i.e. situation where perfect statistical quality for training set is accompanied by poor

D

statistical quality for some external set). The calibration set is the estimator: whether current model has the predictive potential. The external validation set is the final estimator of the predictive

PT E

potential for unknown substances. In the case of traditional scheme, the training and the invisible training are assembled into the common training set. In other words, in the case of the traditional scheme, the inspector, which checks up "whether the overtraining happens" is absent. Figure 3

CE

shows general scheme of building up a model with the CORAL software (definition of the T* and N*).

AC

Having the numerical data on the correlation weights, one can calculate with the training set the following model

Endpoint = C0 + C1 * DCWx(T*,N*),

x=1,2

(3)

The predictive potential of model calculated with Eq. 3 should be checked up with external validation set.

3. Results The balance of correlations based on DCW1(T*,N*) gives the following models: Split 1

ACCEPTED MANUSCRIPT logBB = -0.0138320 (± 0.0031331) + 0.0331486 (± 0.0002262) * DCW1(1,15)

(4)

Split 2 logBB = 0.0087213 (± 0.0032976) + 0.0311008 (± 0.0001652) * DCW1(1,10)

(5)

Split 3 logBB = -0.0002735 (± 0.0031709) + 0.0488865 (± 0.0002740) * DCW1(1,20)

(6)

Table 2 contains the distributions into split 1, split 2, and split 3 together with the experimental

PT

and calculated with Eqs. 4-6 values of logBB. These splits were used for the DCW2(T*,N*) as well as for the traditional scheme, i.e. training set – calibration set – validation set (without

SC

RI

invisible training set). The traditional scheme based on DCW1(T*,N*) gives the following models: Split 1

NU

logBB = -0.1758388 (± 0.0016122) + 0.0382357 (± 0.0001028) * DCW2(1,20) Split 2

logBB = -0.1636285 (± 0.0015508) + 0.0303687 (± 0.0000765) * DCW2(1,10)

MA

Split 3

(8)

(9)

D

logBB = -0.0991360 (± 0.0015057) + 0.0482449 (± 0.0001139) * DCW2(1,21)

(7)

The values of logBB calculated with Eqs. 7-9 are not represented in Table 2, because the

PT E

predictive potential of these models are lower than predictive potential of models calculated with

4. Discussion

CE

Eqs. 4-6 (Table 3).

Table 3 contains the statistical characteristics of the models. One can see from Table 3: (i) the

AC

balance of correlations gives models which are better than models calculated with the traditional scheme; and (ii) the first descriptor calculated as suggested in this work attribute HARD gives models better than models based on the second descriptor calculated with separated BOND, NOSP, and HALO (Toropova & Toropov, 2014; Toropova et al., 2011). Thus, the using of the HARD instead of separated or united BOND, NOSP, and HALO improves the predictive potential of the models based on optimal descriptors. The review of works dedicated to development of computational approaches to predict biochemical endpoints (Hou & Xu, 2002,2003; Hou et al., 2006; Konovalov et al., 2007; Young et al., 1988; Crivori et al., 2000; Iyer et al., 2002; Garg & Verma, 2006; Ooms et al., 2002; Luco & Marchevsky, 2006; de Sá et al., 2010; Carpenter et al., 2014; Bujak et al., 2015; Chena et al.,

ACCEPTED MANUSCRIPT 2009) confirms that the search for effective methods to calculate biochemical phenomena is important task of modern natural sciences. The search of the improvement of the CORAL software also is a fragment of the global task of the improvement of theory and practice of QSAR (Toropova & Toropov, 2014; Gobbi et al., 2016). The novelty of this work is using of the new global SMILES attribute named HARD (Table 1). The suggested attribute HARD reflects the presence (absence) of different molecular features (Table 1). The involving of correlation weights of those molecular features improves predictive potential of the models for logBB (Table 3). There

PT

are rare (noise) versions of the HARD descriptors, fortunately the rare HARD descriptors are removed from building up a model by scheme described in the literature (Toropova & Toropov,

RI

2014; Gobbi et al., 2016). In other words, if a version of the HARD is rare, this version will be

SC

removed from modeling process (e.g. $00000001010, $00000001110, and $01001000000 which are absent in the training set have correlation weights equal to zero, therefore these are not

NU

involved in building up a model). In spite of an interconnection between the described BOND, NOSP, HALO with HARD there is apparent difference between model based on descriptor DCW1 The difference is visible from Table 3.

MA

(T*,N*) (calculated with HARD) and model based on DCW2 (T*,N*) (calculated without HARD). It should be noted, that the T* and N* are different for examined random split.

D

Table 4 contains the comparison of different models for logBB suggested in the literature. The comparison shows that models suggested in this work are comparable to models suggested in the

PT E

literature (Hou & Xu, 2002; Konovalov et al., 2007; Crivori et al., 2000; Iyer et al., 2002; Garg & Verma, 2006; Ooms et al., 2002; Chena et al., 2009). It is to be noted that majority of the above mentioned models involve physicochemical parameters (Konovalov et al., 2007; Bujak et al.,

CE

2015) and / or 3D stereo chemical calculations (Ooms et al., 2002), whereas the approach

notations.

AC

suggested here involves solely 2D data on the molecular structure represented by the SMILES

Having the numerical data on correlation weights of molecular features expressed via SMILES attributes obtained in several runs of the Monte Carlo optimization, one can extract four categories of the attributes. The first category is attributes with positive correlation weights in all the runs. The second category is attributes with negative correlation weights in all the runs. The third category is attributes, which have both positive and negative correlation weights. The fourth category is blocked attributes. Table 5 contains collection of attributes of the first and second categories. The attributes of first category can be examined as promoters of increase for logBB. The attributes of second category can be examined as promoters of decrease for logBB. Thus, the presence of nitrogen atoms and double covalent bonds are promoter for increase of logBB,

ACCEPTED MANUSCRIPT whereas the presence of oxygen and two rings are promoters for decrease for logBB. This is qualitative information, however, this can be basis for clear hypotheses, which can be confirmed (or rejected) after carrying out the corresponding experiments. Table 6 contains examples of molecules confirms the influence of two promoters of increase (large logBB values) and decrease (small logBB values). The statistically significant promoters of logBB increase as well as statistically significant promoters of logBB decrease are molecular features, which are important for blood-brain barrier

PT

permeation. The statistical significance is meaning, firstly, considerable prevalence of molecular feature in training and calibration set; and secondly, stable positive values (or stable negative

RI

values) of correlation weights for the given feature which are observed in several runs of the

SC

Monte Carlo optimization.

The SMILES-attribute “N…C…….” is a promoter of logBB increase (Table 5). Table 7

NU

contains an example of modification of molecular structure that leads to increase of logBB. The SMILES-attribute “O………….” is a promoter of logBB decrease (Table 5). Table 8 contains an example of modification of molecular structure that leads to decrease of logBB. Thus, suggested

MA

approach gives possibility to define hypothesizes how one should modify molecular structure in order to change the logBB value.

D

The numerical data on correlation weights for model calculated with Eq. 4 (Split 1) are

PT E

available in the Supplementary materials.

Conclusions

CE

1. The CORAL software gives good prediction for the Blood-Brain Barrier Permeation (logBB); 2. The suggested SMILES attribute HARD improves the statistical quality of prediction;

AC

3. The balance of correlations gives better models in comparison with the traditional scheme for all three random non-identic splits. 4. The described approach has the mechanistic interpretation in terms of promoters of increase or decrease for the logBB. Thus, the suggested models are built up in accordance with OECD principles (OECD, 2007).

Acknowledgments AAT and APT thank the project EU-ToxRisk (Project reference:681002) and LIFE-COMBASE contract (LIFE15 ENV/ES/000416) for financial support.

ACCEPTED MANUSCRIPT

References Bujak, R., Struck-Lewicka, W., Kaliszan, M., Kaliszan, R., & Markuszewski, M.J. (2015). Blood–brain barrier permeability mechanisms in view of quantitative structure–activity relationships (QSAR). Journal of Pharmaceutical and Biomedical Analysis, 108, 29–37. Carpenter, T.S., Kirshner, D.A., Lau, E.Y., Wong, S.E., Nilmeier, J.P., & Lightstone, F.C. (2014). A method to predict Blood-Brain Barrier permeability of drug-like compounds using

PT

molecular dynamics simulations. Biophysical Journal, 107, 630–641.

Chena, Y., Zhu, Q.-J., Pan, J., Yang, Y., & Wu, X.-P. (2009). A prediction model for blood–brain

RI

barrier permeation and analysis on its parameter biologically. Computer Methods and

SC

Programs in Biomedicine, 95, 280–287.

Crivori, P., Cruciani, G., Carrupt, P.A., & Testa, B. (2000). Predicting blood-brain barrier

NU

permeation from three-dimensional molecular structure. Journal of Medicinal Chemistry, 43, 2204-2216.

de Sá, M.M., Pasqualoto, K.F.M., & Rangel-Yagui, C.O. (2010). A 2D-QSPR approach to predict

MA

blood-brain barrier penetration of drugs acting on the central nervous system. Brazilian Journal of Pharmaceutical Science, 46, 741-751.

D

Iyer, M., Mishru, R., Han, Y., & Hopfinger, A.J. (2002). Predicting blood-brain barrier partitioning

19, 1611–1621.

PT E

of organic molecules using membrane-interaction QSAR analysis. Pharmaceutical Research,

Garg, P., & Verma, J. (2006). In silico prediction of blood brain barrier permeability: an artificial neural network model. Journal of Chemical Information and Modeling, 46, 289–297.

CE

Gobbi, M., Beeg, M., Toropova, M.A., Toropov, A.A., & Salmona, M. (2016). Monte Carlo method for predicting of cardiac toxicity: hERG blocker compounds. Toxicology Letters,

AC

250–251, 42–46.

Hou, T.J., & Xu, X.J. (2002). ADME evaluation in drug discovery 1. Applications of genetic algorithms to the prediction of blood–brain partitioning of a large set of drugs. Journal of Molecular Modeling, 8, 337–349. Hou, T.J., & Xu, X. J. (2003).

ADME Evaluation in Drug Discovery. 3. Modeling Blood-Brain

Barrier Partitioning Using Simple Molecular Descriptors. Journal of Chemical Information and Computer Sciences, 43, 2137–2152. Hou, T., Wang, J., Zhang, W., Wang, W., & Xu, X. (2006). Recent Advances in Computational Prediction of Drug Absorption and Permeability in Drug Discovery. Current Medicinal Chemistry, 13, 2653-2667.

ACCEPTED MANUSCRIPT Konovalov, D.A., Coomans, D., Deconinck, E., Heyden, Y.V. (2007). Benchmarking of QSAR models for Blood-Brain Barrier Permeation. Journal of Chemical Information and Modeling, 47, 1648-1656. Luco, J.M., & Marchevsky, E. (2006).

QSAR Studies on Blood-Brain Barrier permeation.

Current Computer-Aided Drug Design, 2, 31-55. OECD, (2007). Environment Health and Safety Publications Series on Testing and Assessment No. 69. Guidance Document on the Validation of (Quantitative) Structure–Activity Relationship

PT

[(Q)SAR] Models. http://search.oecd.org/officialdocuments/ (accessed 12.11.16). Ooms, F., Weber, P., Carrupt, P.-A., & Testa, B. (2002). A simple model to predict Blood–Brain

RI

Barrier permeation from 3D molecular fields. Biochimica et Biophysica Acta, 1587, 118–

SC

125.

Toropova, A.P., Toropov, A.A., Benfenati, E., Gini, G., Leszczynska, D., & Leszczynski, J.

NU

(2011). CORAL: Quantitative Structure–Activity Relationship models for estimating toxicity of organic compounds in rats. Journal of Computational Chemistry, 32, 2727-2733. Toropova, A.P., & Toropov, A.A. (2014). CORAL software: Prediction of carcinogenicity of drugs

MA

by means of the Monte Carlo method. European Journal of Pharmaceutical Sciences, 52, 21–25.

D

Young, R.C., Mitchell, R.C., Brown, T.H., Ganellin, C.R., Griffiths, R., Jones, M., Rana, K.K., Saunders, D., Smith, I.R., Sore, N.E., & Wilks, T.J. (1988).

Development of a new

PT E

physicochemical model for brain penetration and its application to the design of centrally acting H2 receptor histamine antagonists. Journal of Medicinal Chemistry, 31, 656–671. Weininger, D. (1988). SMILES, a chemical language and information system. 1. Introduction to

AC

28, 31.

CE

methodology and encoding rules. Journal of Chemical Information and Computer Sciences

ACCEPTED MANUSCRIPT

Table 1 Examples of representation of SMILES attributes by means of twelve symbols [SMILES = “NC(SCCF)=N” ] 3 . . . . . . . . . . . . . . . . . . .

4 . . . . . . . . . . . . . . . . . . .

5 . . . . . . . . . . C ( ( C C C ( ( =

6 . . . . . . . . . . . . . . . . . . .

7 . . . . . . . . . . . . . . . . . . .

8 . . . . . . . . . . . . . . . . . . .

9 . . . . . . . . . . . . . . . . . . .

10 . . . . . . . . . . . . . . . . . . .

11 . . . . . . . . . . . . . . . . . . .

12 . . . . . . . . . . . . . . . . . . .

PT

2 . . . . . . . . . . . . . . . . . . .

RI

1 N C (* S C C F ( = N N C S S C F F = N

SC

ID Comment 1 Representation of Sk

Representation of SSk

3

Definition of BOND attribute

B

O

N

D

= 1

# 0

@ 0

0

0

0

0

0

4

Definition of NOSP attribute

N

O

S

P

N 1

O 0

S 1

P 0

0

0

0

0

Cl 0

Br 0

I 0

0

0

0

0

O 0

S 1

P 0

F 1

Cl 0

Br 0

I 0

5

Definition of HALO attribute

6

AC

CE

PT E

D

MA

NU

2

*)

Definition of HARD attribute

H A

L

O

F 1

= 1

# 0

@ 0

N 1

$

Brackets are the representation of molecular branching and used only “(“without “)”

ACCEPTED MANUSCRIPT Table 2 Experimental (Hou & Xu, 2002) and calculated values of logBB

1.0200

0.7779

0.7203

0.8271

Y

Y Y

-0.3420 0.1510 -0.2800 -0.1400 0.2890 -0.1000 0.3300 0.1150 0.2700 0.5600 0.2670 0.1420 0.3900 0.3050 0.0400 -0.1300 0.3700 -0.0600 0.7400 0.8600

-0.2067 0.0484 0.0050 -0.1652 0.1977 0.0463 0.1953 0.1073 -0.0362 0.1583 0.1551 0.2672 -0.0763 0.0918 0.0272 -0.1402 -0.1710 0.1333 0.3407 0.3822

-0.3416 0.1126 -0.1030 -0.3087 0.3568 -0.0351 0.2442 0.2485 0.3969 0.4687 0.2250 0.1744 0.2946 0.1775 0.0760 0.1811 0.0690 0.2104 0.3746 0.4074

-0.2387 0.0274 -0.0736 -0.1960 0.1560 0.0607 0.2810 0.1580 0.2898 0.4671 0.3967 0.1523 0.3940 0.1459 0.0338 0.0246 0.1989 0.1886 0.4020 0.4447

N Y Y N Y N N N Y Y Y Y Y Y N N N Y Y Y

Y Y Y Y Y Y Y Y Y Y Y Y Y Y N N N Y Y Y

AC

PT

29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

V C I V T V T V T I V T T C T I T T I I

Eq. 5 0.0661 0.0448 -0.1118 0.6220 0.6548 0.6876 0.7205 0.7533 0.7861 0.6043 0.6372 0.6700 0.7028 0.6043 0.6372 0.5596 0.6747 0.7732 0.7732 0.8060 0.7379 0.9200

RI

I V T V T C T C T T C T I T T T T T V T

Eq. 4 -0.0804 -0.0079 -0.0266 0.5105 0.5520 0.5935 0.6349 0.6764 0.7179 0.5977 0.6391 0.6806 0.7221 0.5977 0.6391 0.5810 0.3919 0.5164 0.5164 0.5579 0.6907 0.6740

SC

V I V T C I T T V I C I V I T T V T T C

Experiment 0.0300 0.0300 0.0300 0.6320 0.6800 0.4420 0.6890 0.5200 0.6650 0.9700 0.8600 0.9800 1.0500 1.0130 0.8990 1.0370 0.1100 0.9330 1.1100 0.9600 1.0700 0.6100

NU

28

SMILES [N][N] [N-]=[N+]=O [H]C([H])([H])[H] CCCCC CCCCCC CCCCCCC CCCCCCCC CCCCCCCCC CCCCCCCCCC CC(C)CCC CC(C)CCCC CC(C)CCCCC CC(C)CCCCCC CCC(C)CC CCCC(C)CC CCC(C)(C)C C1CC1 CC1CCCC1 C1CCCCC1 CC1CCCCC1 C1CC(C)C(C)CC1 CC(C1CCCCC1)( C)C C1C(C)C(C)CC(C )C1 ClCCl ClC(Cl)Cl ClC(Cl)C ClCCCl CC(Cl)(Cl)Cl ClC(Cl)CCl ClC(Cl)(Cl)CCl ClCC(F)(F)F BrCCC CC(Br)C FC(Br)C(F)(F)F FC(F)(F)C(Cl)Br FS(F)(F)(F)(F)F C=C Cl/C=C/Cl Cl/C=C\Cl Cl/C(Cl)=C(Cl)\Cl CC=C CCCCCCC=C CCCCCCCC=C

MA

V I

ID 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Domain of applicability Eq.6 1 2 3 ** 0.0670 Y Y Y -0.2267 N N Y -0.1642 Y Y Y 0.5981 Y Y Y 0.6408 Y Y Y 0.6835 Y Y Y 0.7261 Y Y Y 0.7688 Y Y Y 0.8115 Y Y Y 0.6712 Y Y Y 0.7139 Y Y Y 0.7566 Y Y Y 0.7993 Y Y Y 0.6712 Y Y Y 0.7139 Y Y Y 0.6651 Y Y Y 0.4797 Y Y Y 0.6077 Y Y Y 0.6077 Y Y Y 0.6504 Y Y Y 0.7539 Y Y Y 0.8486 Y Y Y

D

I

3 I C T V I I T T V C I V C T C C T T T C V T

PT E

2 T T I V C C I C T V I V V V C V I I T C V I

CE

Split 1 I* I T C V I V T C C C C I V I T T V V T I I

Y Y Y Y Y Y Y Y N N N N N Y N N N Y Y Y

C 71

T T V C C I T T T T C T T I

I I I I V V C C I C I I I T

C I C T I V T V C C T T I T

C I C

V V 86 I T 87 V I 88

I

T T 89

V

V C 90

T

T I

72 73 74 75 76 77 78 79 80 81 82 83 84 85

91

0.4873 0.2280 0.1032 0.1731 0.4298 -0.0527 -0.0100 0.0327 0.0754 0.1180 -0.1016 0.1058 0.1485 -0.0650 -0.0224 0.1394 0.0843 0.0416 0.0843 0.2771 0.0998

Y Y N Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y

Y Y Y Y N Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y

0.1400

-0.0655

-0.0574

-0.0103

Y

Y Y

0.3000

0.1847

0.3466

0.1389

Y

Y Y

0.0100 0.1400 0.1300 -0.1700 -0.1700 -0.0100 -0.1340 0.0030 0.1200 0.2770 0.4000 0.3990 0.4500 0.5540

-0.0060 0.0949 0.1580 0.0155 -0.0798 0.0985 0.0256 0.0671 -0.1152 0.1501 -0.0322 -0.0695 0.0590 0.0135

0.1034 0.0911 0.2337 0.0762 0.0099 0.1419 0.0918 0.1246 -0.0175 0.1903 0.0482 -0.0679 0.0407 -0.0023

-0.0339 0.1025 0.1055 0.0602 0.0021 0.1455 -0.0258 0.0169 -0.1052 0.1023 -0.0198 -0.0747 0.0319 0.0106

Y Y Y Y Y Y Y Y Y Y Y Y Y Y

Y Y Y Y Y Y Y Y Y Y Y Y Y Y

0.1980 0.2710 0.4100

0.1121 0.1536 0.1764

0.3010 0.3338 0.4076

0.1485 0.1912 0.2065

Y Y Y

Y Y Y Y Y Y

0.2200

0.1104

0.2228

0.1726

Y

Y Y

0.3800

0.2408

0.3162

0.2643

Y

Y Y

0.2600

0.1951

0.3667

0.2339

Y

Y Y

PT

I

0.4402 0.2181 0.3243 -0.1186 0.6111 0.0105 0.0433 0.0762 0.1090 0.1418 -0.1492 0.0585 0.0914 -0.1611 -0.1283 0.1848 -0.0197 -0.0525 -0.0197 0.3527 0.0299

RI

I

0.4237 0.1975 0.0752 0.1533 0.1932 -0.1113 -0.0698 -0.0283 0.0132 0.0547 -0.2064 0.0589 0.1004 -0.1816 -0.1401 0.1002 -0.0116 -0.0531 -0.0116 0.3041 0.0653

SC

V T 70

0.9600 -0.1660 0.1050 -0.0200 0.6000 0.0200 -0.1240 -0.0820 -0.0230 0.2030 -0.1130 -0.1400 0.0380 0.1100 0.0700 -0.0100 0.2200 0.3600 0.1700 0.1190 0.1900

NU

I

CCCCCCCCC=C C=CC=C ClC=C(Cl)Cl ClC=C(F)F S=C=S CO CCO CCCO CCCCO CCCCCO CC(C)O CC(C)CO CCC(C)CO CC(C)(C)O CCC(C)(C)O CCOCC CC(OCC)(C)C CC(OC)(C)C CCC(C)(C)OC COC(F)(F)C(Cl)Cl FC(F)OC(Cl)C(F)( F)F FC(F)OC(F)(F)C( F)Cl FCOC(C(F)(F)F)C (F)(F)F C1CO1 FC(F)(F)COC=C C=COC=C CC(C)=O CC(=O)CC CCCC(C)=O COC(C)=O CCOC(C)=O CC(OCCC)=O CCCCOC(C)=O CC(OCCCCC)=O CC(C)OC(C)=O CC(C)COC(=O)C CC(OCCC(C)C)= O C1=CC=CC=C1 CC1=CC=CC=C1 CC1=CC=CC=C1 C CC1=CC(C)=CC= C1 CC1=CC=C(C)C= C1 CCC1=CC=CC=C

D

49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69

AC

I T I V I C T I I V C I V T T I I T T T V

PT E

I I T V T V T C T I C V V T C T C T C T T

CE

I I I I I T I V I V C T C V V V I C C I T

MA

ACCEPTED MANUSCRIPT Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y

Y Y Y Y Y Y Y Y Y Y Y Y Y Y

V

T I

I I T

T V 96 T V 97 T T 98

I

V I

99

I

I

I

100

I

I

C 101

C

T T 102

T

T T 103

T

C C 104

C

I

T

T T 106

I

T I

T

T T 108

C

C I

V

I

95

AC

T 105

107

109

T 110

Y

Y Y

0.4300

0.2697

0.4479

0.3894

Y

Y Y

0.1600

0.1976

0.2051

0.2458

Y

Y Y

0.1700

0.3353

0.2666

0.4784

Y

Y Y

-0.2300 -0.4000 -0.2400

-0.0826 -0.3807 0.0619

-0.2229 -0.4932 0.1487

-0.0466 -0.7246 0.0749

Y Y Y

N Y N Y Y Y

-0.1700

-0.0578

-0.0410

0.0530

Y

Y Y

-0.2022

-0.0505

-0.1914

Y

Y Y

1.0100

0.7384

0.8743

0.7407

N

N Y

-0.0700

-0.0613

0.0552

0.0308

Y

N Y

1.3800

0.7011

0.7581

0.6857

N

N Y

-0.0500

0.0046

0.0290

0.0534

Y

Y Y

-0.0300

-0.3700

-0.0772

-0.1309

Y

N Y

0.9800

1.1219

1.2164

1.1778

N

N Y

-0.1000

-0.0242

-0.0994

-0.0714

Y

N Y

0.6400

1.0988

1.0967

1.1156

N

N Y

0.1880

-0.0039

0.0218

0.0410

Y

Y Y

0.0170

0.1206

0.1203

0.1691

Y

Y Y

-0.3010

PT

C T 94

0.2306

RI

C

0.3415

SC

C C 93

0.2178

NU

V

0.4500

D

V 92

PT E

I

CE

C

1 C=CC1=CC=CC= C1 CC(C1=CC=CC= C1)(C)C CC1=CC(C)=C(C) C=C1 FC(F)(C1=CC=C( Cl)C=C1)F OCC#C C=CC#N FCCCN1C=CN=C 1N(=O)=O FCCCCCCCCN1 C(N(=O)=O)=NC =C1 OC1=CC4=C(C3C CC(C(C(CC2C3C C4)F)O)2C)C=C1 IC1=CC=C(N2CC N(CC2)CCCCCC) C=C1 OCC3=NC=C4CN =C(C2=C(N34)C= CC(Cl)=C2)C1=C( F)C=CC=C1 IC1=CC=C(N2CC N(CC2)C(CC)C)C =C1 FC1=C(C)N(C)N( C2=CC=CC=C2)C 1=O CC3=NC=C4C(N =C(C2=C(N34)C= CC(Cl)=C2)C1=C( F)C=CC=C1)O IC3=C(C=C(C=C3 )CN2CCCCC2)C N1CCCCC1 CC2=C(C(N(N2C) C1=CC=CC=C1)= O)I IC1=C(C=C(C=C1 )NC(CC)C)NC(C C)C O=C1C(CC)(C(N C(N1)=O)=O)CC CC O=C1C(CC)(C(N C(N1)=O)=O)CC CCCCC

MA

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

I

T T 115

C

T V 116

T

C I

117

T

T I

118

T

T T 119

I

T V 120

T

I

I

T

I

T 122

I

V I

C

T C 125

C

V T 126

V 114

AC

121

124

Y Y

-0.2220

-0.1284

-0.0767

-0.0870

Y

Y Y

0.2410

0.1621

0.1531

0.2117

Y

Y Y

0.0860

0.0376

0.0546

0.0837

Y

Y Y

0.0860

-0.0454

-0.0017

Y

Y Y

-0.6700

-0.9847

-0.7638

-0.9372

Y

N N

0.5143

0.4362

0.1687

Y

Y Y

0.2045

0.2604

0.1804

N

Y N

-0.2600

0.2045

0.2604

0.1804

N

Y N

-0.5900

-0.3734

-0.3418

-0.4655

Y

Y Y

-0.3800

-0.4197

-0.3271

-0.5121

N

Y Y

0.0000

0.3516

0.3234

0.3706

N

N N

0.0000

0.1520

0.1318

0.1326

Y

Y Y

0.0400

-0.2232

-0.1783

-0.1365

Y

Y Y

-0.1600

-0.0139

0.1108

0.0262

Y

Y Y

0.5480

PT

I

Y

0.6750

-0.0111

RI

T

0.1264

SC

C C 113

0.0874

NU

V

0.0791

MA

V C 112

0.3640

D

V

O=C1C(CC)(C(N C(N1)=O)=O)CC CCCC O=C1C(CC)(C(N C(N1)=O)=O)C O=C1C(CC)(C(N C(N1)=O)=O)CC CCCCCC O=C1C(CC)(C(N C(N1)=O)=O)CC CCC O=C1C(CC)(C(N C(N1)=O)=O)CC C O=C2N1CCCC(O) C1=NC(C)=C2CC N3CCC(C4=NOC 5=C4C=CC(F)=C5 )CC3 FCCC2=CC(N)=C (C=C2)SC1=CC= CC=C1N(C)C CC1N(C3=C(C)C( C)=NC(NC4=CC= C(C=C4)F)=N3)C CC2=CC=CC=C1 2 CC1N(C3=C(C)C( C)=NC(NC4=CC= C(F)C=C4)=N3)C CC2=CC=CC=C1 2 CC2=CN(C([NH] C2=O)=O)C1CC( C(O1)CO)F [O]/[N+](C(C)(C)C) =C/C1=CC=[N+]( [O-])C=C1 CC3=NN=C4CN= C(C2=CC(Cl)=CC =C2N34)C1=CC= CC=C1 CN(C2=C(N(N(C2 =O)C1=CC=CC= C1)C)C)C CCC(C(NC(NC1= O)=O)=O)1CCC( C)C CC2=CC(N(N2C) C1=CC=CC=C1)=

PT E

T T 111

CE

T

C I

129

T

T I

130

T

I

I

T I

V

T C 133

I

T T 134

T

T T 135

C

T I

136

T

I

I

137

T

I

I

138

V

V T 139

T

V I

140

V

I

142

T 131

-0.1691

-0.2320

Y

Y Y

-0.1370

-0.3934

-0.2264

-0.2950

Y

Y Y

0.8200

0.7054

0.6980

0.5327

N

Y Y

-0.4300

-0.3677

-0.1012

-0.1958

Y

Y N

-0.3000

0.1101

0.2579

0.1910

Y

Y Y

0.8525

0.8350

1.0461

Y

Y Y

-0.1200

-0.5467

-0.4352

-0.4391

Y

Y Y

-0.3014

-0.3540

-0.2444

N

N N

-0.0100

-0.3263

-0.0684

-0.1532

Y

Y N

-0.0900

-0.0799

-0.2462

-0.1491

Y

Y N

0.0700

-0.2954

-0.3186

-0.2443

N

N N

0.4200

0.1723

0.2629

0.1809

Y

Y Y

0.0300

-0.2584

-0.1939

-0.3708

Y

Y Y

0.0000

0.1789

0.1809

0.1815

Y

Y Y

0.8450

-0.5500

I

AC

CE

PT E

D

132

-0.2257

PT

T

-0.6990

RI

128

SC

V I

NU

I

O CC(NCC(COC1= CC=C(C=C1)CC( N)=O)O)C CCC(C(NC(NC1= O)=O)=O)1CC CN4CC3C1=CC= CC(C)=C1OC2=C C=CC=C2C(CC4) 3O CCN3CN(C4=CC =C(Br)C=C4)C2( C3=O)CCN(CC2) CCCC(C1=CC=C( C=C1)F)=O CNCCC1=NC=CC =C1 OC(C2=CC=CC= C2)(C3C4CC(C=C 4)C3)CCN1CCCC C1 CC(C1=CC=C(O) C=C1)(C2=CC=C( C=C2)O)C BrC1=CC=C(N4C 3(C(N(C)C4)=O)C CN(CC3)CCCC(C 2=CC=C(F)C=C2) =O)C=C1 CCCN3CN(C4=C C=C(Br)C=C4)C2 (C3=O)CCN(CC2) CCCC(C1=CC=C( C=C1)F)=O CC(C)(OC(C1=C( C2CCCN2C(C4= C3C=CC(Br)=C4) =O)N3C=N1)=O) C BrC1=CC=C(N4C 3(C(NC4)=O)CCN (CC3)CCCC(C2= CC=C(C=C2)F)= O)C=C1 CCCCOC(C1=CC =C(C=C1)N)=O CN1C=NC(N(C(N 2C)=O)C)=C1C2= O NC(N2C1=C(C=C C3=C2C=CC=C3)

MA

ACCEPTED MANUSCRIPT

I

V 145

I

C I

148

T

I

I

149

T

I

I

150

T

I

T 151

T

I

I

152

I

I

I

154

I

T I

155

V

C T 156

I

I

V

T C 158

V

I

T 159

C

I

V 160

T

V I

AC

T 157

162

N

Y Y

-0.4000

-0.3831

-0.3233

-0.3627

Y

Y Y

1.0600

1.0490

0.5860

0.5727

Y

Y Y

0.3500

0.0094

0.2012

-0.0511

Y

Y Y

0.1100

0.1123

PT

T

144

-0.0844

0.1615

0.2431

N

N N

0.6000

0.0858

0.0287

-0.0570

Y

Y Y

-0.2200

-0.2283

-0.4195

-0.2080

N

N N

-0.3480

0.0317

0.0960

0.0108

Y

Y Y

-0.1500

-0.5297

-0.4720

-0.5715

Y

Y Y

1.0000

0.5127

0.4526

0.4994

Y

Y Y

1.0600

0.9059

0.7403

0.9331

Y

Y Y

0.3600

-0.1195

0.0055

-0.0669

Y

Y Y

0.5000

0.1581

0.2369

0.2144

Y

Y Y

0.5900

0.8996

0.7175

0.6299

Y

N N

0.2750

0.1959

0.3707

0.1504

Y

Y Y

-1.3000

-1.0316

-0.7147

-1.0611

Y

Y Y

RI

I

0.0244

SC

I

-0.1010

NU

T

-0.5200

D

T 143

PT E

I

CE

I

C=CC=C1)=O ClCCNC(N(N=O) CCCl)=O CC(C)(NCC(COC 1=CC=CC(N2)=C 1CCC2=O)O)C CN(CCCN2C1=C( SC3=C2C=C(Cl)C =C3)C=CC=C1)C CN2C(CC(N(C3= C2C=CC(Cl)=C3) C1=CC=CC=C1)= O)=O ClC1=CC=CC(Cl) =C1/N=C2NCCN\ 2 COC(C1C(OC(C3 =CC=CC=C3)=O) CC2CCC1N2C)= O OC1C4C2(CCN5 C)C(C5CC3=C2C( O4)=C(OC)C=C3) C=C1 CN1C(C2=CN=C C=C2)CCC1=O OC1=CC=C(C3=C OC2=C(C3=O)C= CC(O)=C2)C=C1 CNCCCN2C1=C( CCC3=C2C=CC= C3)C=CC=C1 NCCCN1C(C=CC =C3)=C3CCC2=C 1C=CC=C2 ClC3=CC(N(C(C2 )=O)C1=CC=CC= C1)=C(C=C3)NC2 =O ClC(C=CC=C1N2 )=C1C(C3=CC=C C=C3)=NCC2=O CNCCCN1C(C=C C=C3)=C3SC2=C 1C=CC=C2 CN2C(CN=C(C3= C2C=CC(Cl)=C3) C1=CC=CC=C1)= O OCC1CCC(N2C= NC3=C2N=C[NH]

MA

ACCEPTED MANUSCRIPT

T 165

I

V V 166

I

C T 167

T

C C 168

T

T T 170

T

I

T 171

I

I

I

I

T T 173

T

T T 174

T

C V 175

I

T T 176

T

T V 177

I

I

AC

172

T 178

Y

Y Y

-0.9250

-0.5013

-0.5532

-0.7145

N

N N

0.6400

0.2159

0.2150

0.2905

Y

Y Y

-0.0460

0.0623

-0.0081

Y

Y Y

0.2700

0.0893

0.1972

0.0956

Y

Y Y

0.5780

0.5171

0.4728

0.6237

Y

Y Y

-0.0950

-0.4324

-0.4866

-0.3906

Y

Y Y

-0.0825

0.0397

-0.1494

Y

Y Y

0.0600

-0.2648

-0.2535

-0.2810

Y

Y Y

-0.1400

-0.0869

0.0181

-0.0364

Y

Y Y

-0.0100

-0.4522

-0.2344

-0.4820

Y

Y Y

0.3600

0.4735

0.3878

0.4066

Y

Y Y

-1.3980

-0.9507

-1.2482

-1.4100

Y

N N

0.0400

-0.0308

-0.0676

-0.0357

Y

Y Y

0.0730

-0.1506

-0.1633

-0.1114

Y

Y Y

-0.2900

PT

I

0.4006

0.1778

RI

T

0.3088

SC

T 164

0.4149

NU

I

1.2600

D

I

PT E

T T 163

CE

I

C3=O)O1 CN(CCOC(C2=C C=CC=C2)C1=CC =CC=C1)C ClC5=CC=C4N(C ([NH]C4=C5)=O) C3CCN(CC3)CCC N1C([NH]C2=CC =CC=C12)=O CN(CCOC(C1=C C=CC=C1)(C2=C C=CC=N2)C)C CCOC1=CC=CC= C1C(N)=O CCOC(C1=CC=C( N)C=C1)=O CCC(N(C3=CC=C C=C3)C1CCN(CC C2=CC=CC=C2)C C1)=O FC1=CC3=C(N2C =NC(C(OC(C)F)= O)=C2CN(C3=O) C)C=C1 CCOC(C1=C(CN( C(C3=C2C=CC(F) =C3)=O)C)N2C= N1)=O CN2C(CN=C(C3= C2C=CC(N(=O)= O)=C3)C1=C(F)C =CC=C1)=O CCOC(C1=C(CN( C(C3=C2C=CC(F) =C3)=O)CCF)N2 C=N1)=O O=N(C1=NC=CN 1CC(COCF)O)=O CNCCC(C2=CC= CC=C2)OC1=CC= C(C(F)(F)F)C=C1 CC(C(CCC(C1CC C(C2C(C=C5C3C C(CCC(CCC45C) 3C)(C(O)=O)C)= O)4C)2C)O)1C OC(CC)(CC(N)=O )C1=CC=CC=C1 CN2C(NC(C(C1= CCCCC1)(C2=O) C)=O)=O

MA

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

I

T 181

I

I

I

I

I

V 183

V

C C 184

I

T I

I

I

T 186

I

I

I

C

C V 189

C

I

I

T I

C

T C 193

I

T I

194

I

T I

195

182

0.8709

0.8343

0.6232

Y

Y Y

-0.2540

-0.2432

-0.0908

-0.2321

N

N N

1.1340

0.7606

0.9888

0.9489

Y

N Y

-0.1800

-0.1671

-0.0635

Y

Y Y

0.8300

0.4693

0.5279

0.5451

Y

Y Y

-0.7968

-0.8679

-0.7499

Y

N N

-0.2457

-0.3646

-0.3094

Y

Y Y

0.4800

0.2081

0.3525

0.0956

Y

Y Y

0.3400

0.2679

0.1479

0.2390

Y

Y Y

0.4400

0.1108

0.4482

0.1778

Y

Y Y

-1.0600

-0.5593

-0.8450

-0.8594

Y

N Y

-1.6000

-1.5478

-1.4244

-1.5294

Y

Y Y

0.6300

0.6220

0.8123

0.5807

Y

Y Y

0.9000

0.3526

0.4617

0.5026

Y

Y Y

-0.7450

-0.1025

-1.2600

AC

C 190

CE

188

PT E

D

185

0.3900

PT

T

OCCOCCN1CCN( C(C3=CC=C(C=C 3)Cl)C2=CC=CC= C2)CC1 IC1=CC(C(CCCN 4CCC2(CC4)C(N( CN2C3=CC=CC= C3)C)=O)=O)=CS 1 I/C=C/CN1CCC(C OC2=CC=C(C#N) C=C2)CC1 CC(CC1=CC=C(C (C(O)=O)C)C=C1) C CN(CCCN2C1=C( CCC3=C2C=CC= C3)C=CC=C1)C CC(C)(NC(C1CN( CC5=CN=CC=C5) CCN1CC(CC(C(N C3C(CC4=CC=C C=C34)O)=O)CC2 =CC=CC=C2)O)= O)C COC1=CC3=C(N( C(C)=C3CC(O)=O )C(C2=CC=C(C= C2)Cl)=O)C=C1 NC1=NC(N)=C(C 2=C(Cl)C(Cl)=CC =C2)N=N1 CCN(CC(NC1=C( C=CC=C1C)C)=O )CC OC3N=C(C2=CC( Cl)=CC=C2NC3= O)C1=CC=CC=C1 Cl CN(CC3=CC=C(O 3)CSCC[NH]C1= NC(C(CC2=CN=C (C)C=C2)=C[NH] 1)=O)C OCC(C(C(C(CO) O)O)O)O OC(C2=CC(C(F)( F)F)=NC3=C2C= CC=C3C(F)(F)F)C 1CCCCN1 CNC(CC1=CC=C

RI

180

SC

I

NU

I

MA

T

192

ACCEPTED MANUSCRIPT

198

T

I

201

I

T C 202

I

I

V 203

I

I

I

I

V C 205

I

T I

206

I

I

I

207

C

I

I

208

T

I

I

209

I

T I

210

I

T I

211

I

I

204

AC

I

V 212

N Y

0.9900

0.6266

0.6240

0.6954

Y

Y Y

0.4260

0.0155

0.5179

0.2142

Y

N Y

-0.9210

-0.9115

-0.6346

Y

Y Y

0.5300

0.4340

0.5133

0.5028

Y

Y Y

-0.0545

-0.0127

-0.1887

Y

Y Y

-0.4879

-0.4659

-0.5501

Y

N N

0.3178

0.3619

0.4764

Y

Y Y

-0.6580

-0.3352

-0.5224

-0.3933

Y

Y Y

0.0000

-0.2744

-0.2273

-0.3231

Y

Y Y

0.2360

0.3383

0.4126

0.3627

Y

Y Y

0.7500

0.7793

0.8311

0.8835

Y

Y Y

0.7800

0.2931

0.6159

0.2670

Y

N Y

0.1600

0.3375

0.5380

0.4259

Y

Y Y

0.3900

0.0714

0.3136

0.0043

Y

Y Y

-0.0100

PT

T I

Y

-0.6480

0.4600

-0.6542

RI

I

-0.2867

SC

T 197

-0.1751

NU

I

-0.3248

MA

C

-0.0600

D

T 196

PT E

I

CE

T

C=C1)C CCC#CC(C(C(NC (N(C1=O)C)=O)= O)1CC=C)C CN1CCN3C(C2= CC=CC=C2CC4= CC=CC=C34)C1 CC3=NC=C4CN= C(C2=C(N34)C=C C(Cl)=C2)C1=C(F )C=CC=C1 CON1C=C(C(C3= C1C=C2OCOC2= C3)=O)C(O)=O CN1CCN3C(C2= CC=CC=C2CC4= CC=CN=C34)C1 OC(CN1C=CN=C 1N(=O)=O)COCF OC4=C(O5)C2=C( C=C4)CC3C1C=C C(O)C5C12CCN3 C CN3CN(C4=CC= CC=C4)C2(C3=O) CCN(CC2)CCCC( C1=CC=C(C=C1) F)=O CCN1C=C(C(C2= C1N=C(C)C=C2)= O)C(O)=O CC1=CC=NC(N(C 4=C3C=CC=N4)C 2CC2)=C1NC3=O CN1CCCC1C2=C C=CN=C2 CSC4=CC2=C(C= C4)SC1=CC=CC= C1N2CCC3CCCN C3 CN1CCN(C3=NC 2=CC=CC=C2NC 4=C3C=C(C)S4)C C1 O=C1N(CCCCN2 CCN(C3=CC(C(F) (F)F)=CC=N3)CC 2)CCC1 ClC4=CC1=C(C= C4)OC3=C(C=CC =C3)C2C1CNC2

I

I

215

T

I

I

216

C

V C 218

I

I

T 219

T

I

I

C

I

T 222

T

I

I

T

V V 226

I

I

T

C I

T

I

I

T T 231

T

I

C 232

I

I

T 234

T

T I

220

224

T 227

AC

228

T 229

235

0.5844

Y

Y Y

0.0000

0.1498

-0.0043

0.2539

Y

Y Y

1.0300

0.5250

0.6570

0.5264

Y

Y Y

0.5800

0.2185

0.1931

Y

Y Y

-0.3400

-0.3066

-0.1677

-0.2034

Y

Y Y

-0.2692

-0.1395

-0.3220

Y

Y Y

0.5669

0.5904

0.5788

Y

N N

0.1400

-0.2232

-0.1783

-0.1365

Y

Y Y

-0.0300

-0.2893

-0.2754

-0.2416

N

N N

0.4400

0.2767

0.4464

0.6098

Y

Y Y

-0.5200

0.0652

0.0961

0.0861

Y

Y Y

0.0500

0.1156

0.2081

0.1177

N

Y Y

-0.1000

0.4520

0.3730

0.4560

Y

Y Y

0.0800

0.2681

0.2873

0.2764

Y

Y Y

-0.1350

-0.2265

-0.3341

-0.1580

Y

Y Y

0.4800

0.1816

0.2508

0.2473

Y

Y Y

-1.2600

0.0482

0.0937

0.0067

Y

Y Y

0.0600 0.5440

PT

I

0.4147

0.2579

RI

214

0.5769

SC

T I

0.5200

NU

T

OC12C4=C(C=CC =C4)OC3=C(C=C C=C3C)C1CNCC 2 C=CCC(N)C(C=C C=C3)=C3C1=NO C2=C1C=CC=C2 CN1CC3C(C2=C( OC4=C3C=CC=C 4)C=CC(Cl)=C2)C 1 OC3N=C(C2=CC( Cl)=CC=C2NC3= O)C1=CC=CC=C1 CC(NC1=CC=C(O )C=C1)=O O=C1N(C(NC2=C 1N(C)C=N2)=O)C OC1=CC2=C(CC3 C(C)C(C)2CCN3C /C=C(C)\C)C=C1 CCCC(C(C(NC(N C1=O)=O)=O)1C C)C C12=NN=NN1CC CCC2 N1(C2(C3=CC=C C=C3)CCCCC2)C CCCC1 CCCCC1C(N(C3= CC=CC=C3)N(C2 =CC=CC=C2)C1= O)=O [O]/[N+](C(C)(C)C) =C/C1=CC=CC=C 1 O=C1NC(C(C2=C C=CC=C2)(C3=C C=CC=C3)N1)=O CNC(OC3=CC=C 2N(C1N(CCC(C2 =C3)1C)C)C)=O CC([NH]CC(COC 1=CC=CC2=C1C =C[NH]2)O)C CCC(N(C(C2=CC 1=CC=CC=C1C(C 3=C(Cl)C=CC=C3 )=N2)=O)C)C OC(C1=CC=C(C2

D

213

PT E

C I

CE

I

MA

ACCEPTED MANUSCRIPT

T

I

I

T T 241

T

T T 242

I

I

C

T C 244

T

T T 245

T

I

I

246

I

T I

247

I

I

I

248

V

I

C 249

T T

T I 250 I T 251

I

T T 252

I

I

I

243

0.1103

Y

Y Y

0.9500

0.9413

1.0121

0.9483

Y

N N

0.6970

0.0981

0.0007

0.0347

Y

Y Y

0.6400

-0.0680

-0.1030

-0.1276

Y

Y Y

0.5500

0.1308

0.2300

0.1383

Y

Y Y

0.2300

0.4117

0.5274

0.5147

Y

Y Y

0.3009

0.2364

0.2228

Y

Y Y

0.1000

0.3246

0.2697

0.1925

Y

Y Y

-1.3572

-0.6809

-0.5615

N

Y Y

-0.0200

-0.3113

-0.2664

-0.1549

Y

N N

0.6100

0.0855

-0.0171

0.0068

Y

Y Y

0.2500

0.2948

0.2365

0.3436

Y

Y Y

-0.2750 -1.3000

0.1206 -0.7148

0.1864 -0.6273

-0.0437 -0.5675

Y Y

Y Y Y Y

-1.1610

-0.2383

-0.1383

-0.4234

Y

Y Y

-1.0300

-0.9013

-1.0265

-0.9605

Y

N N

0.5000

-1.4230

AC

CE

PT E

D

I

239

0.1828

PT

T T 237

0.1246

RI

T

0.0500

SC

C T 236

NU

T

=CC=CC=C2)C= C1)=O CCN(CCOC(C1= CC=C(C=C1)N)= O)CC CN(CCCN2C1=C C=CC=C1SC3=C C=CC=C23)C CC(C1=CC=CC(C (C)C)=C1O)C CC(NCC(COC2= C1C=CC=CC1=C C=C2)O)C CCCOC(C1=CC= C(N)C=C1)=O C13=CC=CC4=C1 C2=C(C=C4)C=C C=C2C=C3 COC2=CC=C(C= C2)CN(C1=CC=C C=N1)CCN(C)C OC(C3C(CC4C=C )CCC4C3)C2=C1 C=C(C=CC1=NC =C2)OC CCC1=CC=CC2= C1[NH]C3=C2CC OC(CC([OH])=O) 3CC CC(N=C1CCCCN 12)=C(CCN3CCC (C4=NOC5=C4C= CC(F)=C5)CC3)C 2=O COC1=C(OC2CC CC2)C=C(C3CNC (C3)=O)C=C1 CCCN(CCC(C=C C=C1N2)=C1CC2 =O)CCC NC(SCCF)=N CC(C)(NCC(C1= CC=C(C(CO)=C1) O)O)C OC(C1=CC=CC= C1O)=O CC(C)(NC(C2CC1 CCCCC1CN2CC( C(NC(C(NC(C4= NC5=C(C=CC=C5 )C=C4)=O)CC(N)

MA

ACCEPTED MANUSCRIPT

T 254

ACCEPTED MANUSCRIPT

I

259

V

I

I

260

C

V T 261

V

T T 262

C

V V 263

I

T V 264

C

I

T

C T 266

V

T I

267

V

I

268

265

-0.6600

-0.7818

-0.6670

-0.8220

Y

Y Y

-0.1200

-0.4769

-0.4697

-0.5570

Y

Y Y

-0.1800

-0.1637

-0.3891

-0.2312

N

N N

-0.0299

-0.0948

-0.0627

N

N Y

-1.0412

-0.7545

-1.2422

Y

N N

-1.3303

-1.2857

-1.5546

Y

N N

-1.5400

-1.4534

-1.0667

-1.6299

Y

N Y

-1.1200

-1.1613

-1.4778

-1.1212

Y

N Y

-0.7300

-0.6178

-0.9112

-0.9447

Y

N Y

-0.2700

-0.2717

-0.3698

-0.3362

N

Y Y

-0.2800

-0.2879

-0.2401

-0.2328

Y

Y Y

-0.4600

-0.2256

-0.0416

-0.1765

Y

Y Y

-0.0400 -1.1500

-1.5700

AC

I

Y N

PT

I

N

RI

T

-0.5587

SC

T T 258

-0.7251

NU

I

-0.7031

MA

C C 257

-0.6700

D

C

PT E

V T 256

CE

T

=O)=O)CC3=CC= CC=C3)O)=O)C BrC1=CC=CN=C1 CSCC[NH]C2=C( N(=O)=O)C=C[N H]2 O=N(C1=C([NH] CCSCC2=NC=CC =C2)[NH]C=C1)= O O=N(C1=C([NH] CCSCC3=NC=CC =C3)[NH]C=C1C C2=CC=CC=C2)= O [NH2]/C([NH2])= N\C1=NC(C2=CC =CC=C2)=CS1 CC1=CSC(/N=C([ NH2])/[NH2])=N1 N/C(N)=N\C1=NC (C2=CC=CC(N)= C2)=CS1 CC([NH]C1=CC= CC(C2=CSC(/N= C(N)\N)=N2)=C1) =O C[NH]/C([NH]C1 =CC(C2=CSC(/N= C(N)/N)=N2)=CC =C1)=N/C#N CN(CC2=CC=C(O 2)CSCC[NH]C1= C(N(=O)=O)C=C[ NH]1)C CN(CC3=CC=C(O 3)CSCC[NH]C1= C(N(=O)=O)C(CC 2=CC=CC=C2)=C [NH]1)C CN(CC1=CC=C(C 2=CC([NH]C3=C( N(=O)=O)C=C[N H]3)=CC=C2)O1) C CN(CC1=CC=NC( C2=CC([NH]C3= C(N(=O)=O)C=C[ NH]3)=CC=C2)= C1)C CC([NH]CCCOC1

I

V

V V 271

I

V T 272

T

I

T

T I

I C

I T 275 T I 276

V

T V 277

I

T I

T

I

I

T I

I

C T 281

V

T C 282

C

T C 283

I

270

V 273

0.1222

Y

Y Y

-0.0200

0.0109

0.0464

-0.0478

Y

Y Y

0.6900

0.5001

0.4456

0.6541

Y

Y Y

0.4400

0.2183

0.0357

0.4313

Y

Y Y

0.2200

0.1421

0.4539

0.1318

Y

Y Y

-0.4300

-0.2274

-0.2416

-0.1331

Y

Y Y

-0.3101 0.1936

-0.1476 0.1814

-0.3461 0.2939

Y Y

Y Y Y Y

-0.4820

-0.2385

-0.3141

-0.4049

Y

Y Y

0.1800

0.1227

0.2087

0.2272

Y

Y Y

0.0950

0.1310

0.2688

0.0928

Y

N Y

1.0000

0.3713

0.3892

0.5009

N

Y Y

-0.3000

-0.1389

-0.2753

-0.1117

Y

Y Y

-0.3450

-0.3435

-0.1395

-0.3770

Y

Y Y

-0.1470

-0.2873

-0.2662

-0.3198

Y

Y Y

-0.6020 0.2600

CE

AC

278

PT E

D

274

0.1425

PT

I

0.1454

RI

V

-0.2400

SC

T T 269

NU

T

=CC(CN2CCCCC 2)=CC=C1)=O O=C(C3=CC=CC =C3)[NH]CCCOC 1=CC=CC(CN2C CCCC2)=C1 OCCCOC1=CC(C N2CCCCC2)=CC =C1 C1(NCCCOC2=C C(CN3CCCCC3)= CC=C2)=CC=CC =N1 C1(NCCCOC2=C C=CC(CN3CCCC C3)=C2)=NC=CS 1 C(NCCCOC2=CC (CN4CCCCC4)=C C=C2)3=NC1=C( O3)C=CC=C1 CCCN(CCC(C=C C(O)=C1N2)=C1C C2=O)CCC NC(SC)=N FC1=CC=C(C(CC CN2CCC3(N(C4= CC=CC=C4)CNC 3=O)CC2)=O)C= C1 CC2=CN(C([NH] C2=O)=O)C1OC( C=C1)CO CN1CCCCC1CC N3C2=C(SC4=CC =C(S(C)(=O)=O)C =C34)C=CC=C2 CN(CC2=C1C=C C=CC1=CC=C2)C /C=C/C#CC(C)(C) C CC(C)(OC(CCCC 1=CC=C(N(CCCl) CCCl)C=C1)=O)C CN1C=NC(N(C([ NH]2)=O)C)=C1C 2=O CN2C(N(C1=C(C 2=O)NC=N1)C)= O CCCC(C(C(NC(N

MA

ACCEPTED MANUSCRIPT

T 279

280

ACCEPTED MANUSCRIPT

I

I

C 290

V

C I

291

I

T I

293

T

I

I

294

I

V I

295

I

I

I

296

T

I

I

297

C T

V V 299 I I 300

T T

T T 301 T T 302

C

I

V

C T 304

AC

289

C 303

N Y

0.2830

0.4631

0.3994

0.4431

Y

N N

0.4000

0.3926

0.1452

0.3246

N

N Y

-0.8200

-0.6138

-0.7376

-0.7840

Y

N Y

-0.0826

-0.0229

-0.0117

Y

Y Y

0.2069

0.5778

0.2826

N

N N

1.0707

1.1393

0.7612

Y

Y Y

0.8900

0.6497

0.9401

0.6338

Y

N N

-1.1550

-0.3057

-0.2434

-0.2122

Y

Y Y

-0.4690

-0.0409

-0.1093

0.0260

N

N Y

-0.0400 -0.0600

-0.2397 0.0977

-0.0981 0.3854

-0.1593 0.2641

Y Y

Y Y Y Y

-0.4200 -0.7200

-0.2504 -0.8285

-0.0744 -0.8745

-0.1937 -0.9819

Y N

Y N N Y

0.1400

0.0155

0.2670

0.2144

Y

Y Y

0.5000

0.2142

0.3303

0.2511

Y

Y Y

0.2800

0.5700

PT

I

Y

RI

I

0.0446

SC

T

0.1218

NU

T T 286

0.0780

MA

T

-0.1600

1.4400

D

T 285

PT E

I

CE

I

C1=O)=S)=O)1CC )C S=C(N2CCC(C3= CNC=N3)CC2)NC 1CCCCC1 CSC4=CC=C3SC1 =C(N(C3=C4)CC C2CCCCN2C)C= CC=C1 CC3CC1=C(C4CC C2(C(C34)CCC(C #C)2O)C)CCC(C1 )=O CN/C(NCCSCC1= CSC(/N=C(N)/N)= N1)=N/C#N CC(NCC(COC1= CC(C)=CC=C1)O) C CC3=NN=C4CN= C(C2CC(CCC2N3 4)Cl)C1=CC=CC= C1Cl CN1CCN(CCCN3 C2=C(SC4=C3C( C(F)(F)F)=CC=C4 )C=CC=C2)CC1 CN(CC/C=C2\C1= C(CCC3=C2C=C C=C3)C=CC=C1) C CCCC(C(O)=O)C CC COC2=CC=C(C= C2OC)CCN(CCC C(C(C)C)(C1=CC( OC)=C(OC)C=C1) C#N)C O CN(CCC1=CC=C C=N1)C NCCC1=NC=CS1 CC2=CN(C(NC2= O)=O)C1CC(C(O1 )CO)N=N#N N1(CC2=CC=CC( OCCCNC4=NC3= C(S4)C=CC=C3)= C2)CCCCC1 ClC3=CC=C2NC( CN=C(C2=C3)C1

T V 308

T

I

T 309

T

I

I

I

T T 312

V

C I

313

V

I

314

T

T T 315

V

I

T 316

T

I

T 317

I

T V 320

I

T I

AC

I

311

321

Y

Y Y

-0.6700

-0.6341

-0.4257

-0.5771

N

Y N

-0.6600

-0.3398

-0.4337

-0.3910

Y

Y Y

-0.1200

-0.1384

-0.3104

-0.1140

Y

Y Y

-0.1800

-0.7678

-0.6203

-0.6555

Y

N N

-0.9790

-1.0932

-0.8303

Y

N Y

-1.5400

-1.2423

-1.4554

-1.5356

Y

N N

-0.5370

-0.6417

-0.7266

Y

Y Y

-0.7300

-0.3356

-0.5184

-0.4495

Y

Y Y

-0.2700

-0.4449

-0.6149

-0.4275

Y

Y Y

-0.2800

-0.4611

-0.4851

-0.3241

Y

Y Y

-0.4600

0.2161

0.1102

0.0985

Y

Y Y

0.6900

0.2578

0.2711

0.3884

Y

Y Y

0.4400

0.1438

0.1235

0.1377

Y

Y Y

-1.5700

PT

I

-0.7901

RI

T T 307

306

-0.6008

SC

I

I

-0.9213

NU

I

-1.1700

-1.1200

D

T

PT E

T T 305

CE

T

=CC=CC=C1)=O CN(C1=NC=CC(C 2=NNC(N)=N2)= C1)C BrC1=CC=CN=C1 CSCCNC2=C(C= CN2)N(=O)=O O=N(C(C=CN2)= C2NCCSCC1=NC =CC=C1)=O O=N(C(C(CC3=C C=CC=C3)=CN2) =C2NCCSCC1=N C=CC=C1)=O N/C(N)=N/C1=NC (C2=CC=CC=C2) =CS1 N/C(N)=N/C1=NC (C2=CC(NC(C)=O )=CC=C2)=CS1 N/C(N)=N\C1=NC (C2=CC(N/C(NC) =N/C#N)=CC=C2) =CS1 CN(CC1=CC=C(S CCNC2=C(C=CN 2)N(=O)=O)O1)C CN(CC1=CC=C(S CCNC2=C(C(CC3 =CC=CC=C3)=C N2)N(=O)=O)O1) C CN(CC1=CC=C(C 2=CC=CC(NC3=C (C=CN3)N(=O)=O )=C2)O1)C CN(CC1=CC(C2= CC(NC3=C(C=CN 3)N(=O)=O)=CC= C2)=NC=C1)C O=C(C)NCCCOC 1=CC(CN2CCCC C2)=CC=C1 N1(CC2=CC=CC( OCCCNC3=NC= CC=C3)=C2)CCC CC1 N1(CC2=CC=CC( OCCCNC3=NC= CS3)=C2)CCCCC 1

MA

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

I

T 325

T

I

C 326

I

I

C 327

I

C I

328

0.2414

0.5764

0.1717

Y

Y Y

0.3000

0.3229

0.0803

0.2988

Y

Y Y

-0.3400

-0.0011

0.1580

-0.0103

Y

Y Y

-0.3000

-0.4750

-0.5072

Y

Y Y

PT

I

0.2200

-0.5039

RI

C V 324

N1(CC2=CC(OCC CNC4=NC3=C(O 4)C=CC=C3)=CC =C2)CCCCC1 CCC(NC(C2=C(C (C4=CC=CC=C4) =NC3=C2C=CC= C3)C)=O)C1=CC= CC=C1 NC(N3C1=CC=C C=C1C2CC2C4= CC=CC=C34)=O O=C3C1=C(N2C= NC(C4=NOC(C(C )C)=N4)=C2CN3C )C=CC=C1Cl O=C3C1=C(N2C= NC(C4=NOC(C(C )(O)C)=N4)=C2C N3C)C=CC=C1Cl O=C3C1=C(N2C= NC(C4=NOC(C(C )(O)CO)=N4)=C2 CN3C)C=CC=C1 Cl

-1.3400

SC

I

-1.8200

-0.9291

-0.8830

-0.9288

Y

Y Y

-1.1428

-1.0423

-1.1795

Y

Y Y

T=Training set; I=Invisible training set; C=Calibration set; V=Validation set Y=domain of applicability, N=no domain of applicability

D

*)

322

NU

T I

MA

T

AC

CE

PT E

**)

ACCEPTED MANUSCRIPT

Table 3 The statistical characteristics of QSAR models for logBB (n is the number of compounds in a set, r2 is the determination coefficient, RMSE is root-mean squared error).

2

Traditional scheme

3

MA

Balance of correlations

Traditional scheme

r2 0.7587 0.8569 0.7324 0.6989 0.7563 0.9143 0.8003* 0.7338 0.7903 0.8619 0.7080 0.6783 0.8599 0.8948 0.7511 0.7128 0.7796 0.7054 0.7207 0.9296 0.8964

r2 0.7265 0.8350 0.6746 0.6767 0.7085 0.8753 0.7297 0.7066 0.8403 0.7542 0.6803 0.6572 0.8620 0.8612 0.7194 0.6591 0.7222 0.6703 0.6884 0.8775 0.7988

RI

PT

RMSE 0.299 0.262 0.291 0.310 0.353 0.224 0.272 0.310 0.239 0.268 0.336 0.335 0.216 0.268 0.306 0.337 0.278 0.324 0.348 0.190 0.175

RMSE 0.318 0.261 0.325 0.322 0.365 0.229 0.298 0.326 0.222 0.339 0.351 0.344 0.220 0.299 0.324 0.398 0.311 0.343 0.371 0.238 0.240

CE

Models with highest predictive potential (maximal correlation coefficient) are indicated with bold.

AC

*)

PT E

Balance of correlations

n 205 43 43 101 104 43 43 210 41 40 103 107 41 40 209 41 41 104 105 41 41

DCW2(T*,N*) Eq. 2

SC

Balance of correlations

Set Training Calibration Validation Training Invisible training Calibration Validation Training Calibration Validation Training Invisible training Calibration Validation Training Calibration Validation Training Invisible training Calibration Validation

NU

Method Traditional scheme

D

Split 1

DCW1(T*,N*) Eq. 1

ACCEPTED MANUSCRIPT

Table 4 Comparison of logBB models suggested in the literature (n is the number of compounds in a set, r2 is the determination coefficient). nvalidation 63 21 41

r2validation 0.58 0.518 0.896

References (Garg & Verma, 2006) (Konovalov et al., 2007) (Crivori et al., 2000) (Iyer et al., 2002) (Ooms et al., 2002) (Chena et al., 2009) (Hou & Xu, 2002) In this work

PT

r2training 0.88 0.845 0.78 0.69 0.86 0.672 q2=0.766* 0.740

RI

ntraining 21 56 79 150 260 120 291 250

SC

No. 1 2 3 4 5 6 7 8 *)

Comment** ANN MLR, kNN 3D-QSAR MI-QSAR 3D-QSAR VPCP VPCP SMILES

the q2 is cross-validated correlation coefficient ANN = artificial neuronal networks; MLR = multiple-linear regression; kNN = k nearest neighbors; 3D-QSAR = QSAR based on three dimensional representation of molecules; MI-QSAR = QSAR based on parameters of membrane interactions; VPCP = vector of physicochemical properties.

AC

CE

PT E

D

MA

NU

**)

ACCEPTED MANUSCRIPT

Table 5 Correlation weights (CWs), which are promoters of increase (all correlation weights are positive) or decrease (all correlation weights are negative) for logBB Split

Attribute

CWs run 1

CWs run 2

CWs run 3

3.24511 3.81416 2.49742 2.87618

2.12423 2.68351 1.93346 5.31253

1.37910 2.12465 3.06416 3.05954

-6.31039 -0.87153 -4.06125

-7.24883 -2.19137 -4.62422

-6.31605 -1.43359 -4.62395

0.81302 2.50269 4.74762 4.75052

2.12609 0.43945 5.12636 3.80957

-8.00464 -1.62872 -2.93503

Frequency in training set

Frequency in invisible training set

Frequency in calibration set

59 37 23 23

20 12 7 6

66 60 36

69 59 35

27 16 10

1.74603 1.00255 4.93289 3.62906

61 40 26 24

68 45 27 30

19 11 5 13

-4.81494 -0.87586 -3.12997

-5.37674 -1.43289 -1.99677

64 56 35

75 70 43

28 15 8

1.93776 2.12858 6.24741 3.43855

2.31176 2.87755 4.93620 1.18458

0.81079 4.18775 3.43821 2.50436

61 38 28 17

63 45 29 30

21 12 11 10

-10.6200 -2.56148 -2.93355

-6.87903 -1.25434 -3.49987

-8.37164 -1.06561 -2.00268

66 55 37

69 66 36

28 17 11

3

AC

D

CE

Increase N...C....... =...2....... $10011000000 N...1....... Decrease O........... C...2....... 2...(.......

PT E

Increase N...C....... =...2....... N...1....... $10011000000 Decrease O........... C...2....... 2...(.......

SC

MA

2

RI

61 41 32 23

NU

Increase N...C....... =...2....... $10011000000 N...1....... Decrease O........... C...2....... 2...(.......

PT

1

ACCEPTED MANUSCRIPT

Table 6 A collection of molecules with large and small logBB values Structure and SMILES

Calculated logBB

Large logBB 0.82

0.62

Comments

Presence of HARD=’$10011000000’

PT

H3C

Experimental logBB

N

RI

O CH3

SC

HO CN4CC3C1=CC=CC(C)=C1OC2=CC=CC=C2C(CC4)3O

0.97

Presence of HARD=’$10011000000’

0.54

0.64

Presence of HARD=’$10011000000’

Small logBB -1.42

-1.44

Presence of SSk=’C...2.......’

-1.54

-1.35

Presence of SSk=’C...2.......’

NU

0.85

MA

HO

D

N

OC(C2=CC=CC=C2)(C3C4CC(C=C4)C3)CCN1CCCCC1

PT E

H3C N

H3C H3C

CE

OH

CH3

AC

OC1=CC2=C(CC3C(C)C(C)2CCN3C/C=C(C)\C)C=C1

O

N H

H3C

OH O

CH3

CCC1=CC=CC2=C1[NH]C3=C2CCOC(CC([OH])=O)3CC

HN

NH2 H2N

N N

NH

CH3

N

S N C[NH]/C([NH]C1=CC(C2=CSC(/N=C(N)/N)=N2)=CC=C1)=N/C#N

ACCEPTED MANUSCRIPT Table 7 Example of modification of the molecular structure for increase of logBB: fragment “N…C……” is added. SMILES

Structure

logBB (calc) 0.2578

N1(CC2=CC=CC(OCCCNC3=NC=CC=C3)=C2)CCCCC1 N

PT

NH

O

N

NU

SC

RI

Original structure

MA

N1(CC2=CC=CC(OCCCNC3=NC=CC=C3)=C2)CCNCC1

AC

CE

PT E

D

Modified structure

0.2700 N

NH

H N O N

ACCEPTED MANUSCRIPT

Table 8 Example of modification of the molecular structure for decrease of logBB: fragment “O….……” is added. SMILES

Structure

logBB (calc) 0.2578

N1(CC2=CC=CC(OCCCNC3=NC=CC=C3)=C2)CCCCC1

PT

N

O N

NU

SC

RI

Original structure

NH

N1(CC2=CC=CC(OCCCNC3=NC=CC=C3)=C2)CCOCC1

0.2264

MA

N

AC

CE

D

PT E

Modified structure

NH

O O N

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

Figure 1 The scheme of building up a model for logBB by the CORAL software

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

AC

CE

PT E

Figure 2 The tasks of training, invisible training, calibration, and validation sets in building up a model

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

AC

CE

PT E

Figure 3 The scheme definition of T* and N* for the Monte Carlo optimization