Using quantitative structure activity relationship models to predict an appropriate solvent system from a common solvent system family for countercurrent chromatography separation

Using quantitative structure activity relationship models to predict an appropriate solvent system from a common solvent system family for countercurrent chromatography separation

Accepted Manuscript Title: Using Quantitative Structure Activity Relationship Models to Predict an Appropriate Solvent System from a Common Solvent Sy...

903KB Sizes 0 Downloads 6 Views

Accepted Manuscript Title: Using Quantitative Structure Activity Relationship Models to Predict an Appropriate Solvent System from a Common Solvent System Family for Countercurrent Chromatography Separation Author: Siˆan Marsden-Jones Nicola Colclough Ian Garrard Neil Sumner Svetlana Ignatova PII: DOI: Reference:

S0021-9673(15)00563-4 http://dx.doi.org/doi:10.1016/j.chroma.2015.04.020 CHROMA 356440

To appear in:

Journal of Chromatography A

Received date: Revised date: Accepted date:

10-12-2014 26-3-2015 8-4-2015

Please cite this article as: S. Marsden-Jones, N. Colclough, I. Garrard, N. Sumner, S. Ignatova, Using Quantitative Structure Activity Relationship Models to Predict an Appropriate Solvent System from a Common Solvent System Family for Countercurrent Chromatography Separation, Journal of Chromatography A (2015), http://dx.doi.org/10.1016/j.chroma.2015.04.020 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

1 2

Using Quantitative Structure Activity Relationship Models to Predict an Appropriate Solvent System from a Common Solvent System Family for Countercurrent Chromatography Separation

3

Siân Marsden-Jonesa, Nicola Colcloughb, Ian Garrarda, Neil Sumnerb, Svetlana Ignatovaa a

Advanced Bioprocessing Centre, Institute of Environment, Health and Societies, Brunel University, Uxbridge, Middlesex, UB8 3PH, UK b

6

AstraZeneca UK Limited, Alderley Park, Cheshire, SK10 4TG, UK

ip t

4 5

Abstract

8

Countercurrent Chromatography (CCC) is a form of liquid-liquid chromatography. It works by running

9

one immiscible solvent (mobile phase) over another solvent (stationary phase) being held in a CCC

cr

7

column using centrifugal force. The concentration of compound in each phase is characterised by the

11

partition coefficient (Kd), which is the concentration in the stationary phase divided by the concentration

12

in the mobile phase. When Kd is between approximately 0.2 and 2, it is most likely that optimal

13

separation will be achieved. Having the Kd in this range allows the compound enough time in the column

14

to be separated without resulting in a broad peak and long run time. In this paper we report the

15

development of Quantitative Structure Activity Relationship (QSAR) models to predict logKd. The

16

QSAR models use only the molecule’s 2D structure to predict the molecular property logKd.

17

Key words: countercurrent chromatography, centrifugal partition chromatography, solvent system

18

selection, method development, Quantitative Structure Activity Relationships (QSAR)

19

Highlights

21 22 23 24 25 26

an

M

d

te

Ac ce p

20

us

10



Quantitative Structure Activity Relationship (QSAR) models have been developed.



The QSAR models have been built for the six HEMWat solvent systems.



A data set of the logKd values of 54 compounds in six HEMWat systems was generated.

1. Introduction

Countercurrent Chromatography (CCC) was invented in 1966 by Ito [1]. In CCC, the compounds partition between two immiscible liquids (phases). One phase (stationary) is retained inside the column, which is spun in planetary motion. Whereas the other (mobile phase) is pumped through the column.

27

Separation is achieved as compounds that spend more time in the stationary phase take longer to pass

28

through the column than compounds that spend more time in the mobile phase. CCC has many

29

advantages over traditional liquid-solid chromatography including total recovery of compound; also

30

crude samples containing particulates can be separated and higher loading capacities are tolerated [2,3].

31

CCC is also reproducible and scalable. A disadvantage of CCC is that the choice of the solvent system is

1

Page 1 of 20

currently based on an analyst’s past experience, trial and error or literature analysis. This may mean that

33

systems that would give very well defined chromatography are missed or that large quantities of time and

34

solvents/samples are used to select an appropriate solvent system. Being able to predict the Kd values of

35

target compounds would speed up the method development phase of the CCC process without time

36

consuming, solvent intensive experiments.

37

There have been previous attempts to computationally predict the partition coefficients of compounds.

38

Hopmann et al. [4] used the software COSMO-RS to predict Kd values using activity coefficients of the

39

upper and lower phases. The conformation of the molecule plays a very important role in the calculation

40

and is computationally expensive to calculate, taking up to 9 hours for large molecules.

41

Another method used the UNIFAC (Universal quasichemical functional-group activity coefficients)

42

model which was developed by Li et al. [5]. This model uses thermodynamics to calculate Kd. A

43

potential disadvantage of this programme is its dependency on group interaction parameters which are

44

often limited.

45

Ren et al. [6] used NRTL-SAC (Non-Random Two Liquid – Segment Activity Coefficient) in

46

combination with UNIFAC and GA (Generic Algorithm) to predict partition coefficients. This method is

47

not purely computational as some experimental Kd values are needed for the prediction. This is a

48

disadvantage if the compound to be separated is expensive or supply is limited.

50 51 52 53 54

cr

us

an

M

d

te

Ac ce p

49

ip t

32

The modelling approach that has been investigated in this work is Quantitative Structure Activity Relationships (QSARs). QSARs are relationships that are used to predict a molecular property, in this case logKd, from a molecule’s structure. The logKd is predicted instead of the Kd value as this normalises

the experimental values. QSARs offer fast computational predictions. They rely on molecular descriptors which can be calculated manually (for example the number of hydrogen bond acceptors) or from a number of widely available software packages (e.g. ACD Labs logP/D, Daylight/Biobyte ClogP). As

55

long as a complete set of descriptors is available, QSAR models of the type explored in this work can be

56

run in Microsoft Excel or an equivalent. This type of software is owned by the majority of people so

57

using the model would be convenient. This could allow more automation of CCC, increasing the

58

techniques’ appeal especially to industry.

2

Page 2 of 20

The QSAR models that have been built in this study work with the HEMWat solvent systems. It contains

60

heptane (or hexane), ethyl acetate, methanol and water in varying proportions. It is a very versatile

61

system as changing the proportions of each solvent will change the polarity of the overall system as well

62

as the polarity difference between two phases. This control over the polarity allows the system to be

63

adjusted to optimise the partitioning of many different compounds. In the Brunel University CCC

64

literature database containing 2322 papers, 1121 of these (48%) use HEMWat based solvent systems.

65

The next most commonly used solvent system is based on butanol which was used in 542 papers (23%).

66

This implies that HEMWat is the most commonly used solvent system making it ideal for this proof of

67

concept study [7]. Garrard [8] adopted a numeric labelling scale from 6 - 28 to denote polarity, within

68

which HEMWat6 was the most polar and HEMWat28 was the least polar. The HEMWat systems

69

denoted 1-6, contain butanol and not always all four of the other solvents. The QSAR approach can be

70

applied to any solvent system. However, in this work we have chosen to focus on the HEMWat solvent

71

system. Traditionally, QSARs have been developed for much simpler two solvent systems such as

72

octanol/water and cylcohexane/water [9]. Therefore, successfully applying the QSAR methodology to the

73

much more complex HEMWat systems would by analogy show that QSARs are likely to be applicable to

74

all solvent system families. Liquid handling robots are commonly used in industry for logP

75

measurements so could easily be used for fast, accurate partition coefficient measurement. Therefore,

76

measuring Kd values for other solvent system families for QSAR generation would not be too time

77

consuming. This work attempts to develop QSAR models to increase the speed and efficiency of solvent

79 80 81 82

cr

us

an

M

d

te

Ac ce p

78

ip t

59

selection in CCC. Through the use of a diverse data set to train each HEMWat QSAR, the aim is that the models will be able to accurately predict logKd values for a large range of molecules.

2. Experimental

2.1 Materials and Chemicals

The solvents used were HPLC grade heptane, ethyl acetate, methanol, acetonitrile and ethanol purchased

83

from Fisher Chemicals (Loughborough, UK). The water used was deionised in house using a Purite

84

Select Fusion purification system (Thame, UK). All compounds were purchased from Sigma Aldrich

85

(Gillingham, UK) (including quality control compound, 2-ethylanthraquinone) with a minimum purity of

86

95%. Ammonia solution (35%) and TFA (99%) were purchased from Fisher Scientific (Loughborough,

87

UK).

3

Page 3 of 20

2.2 Apparatus

89

HPLC analysis was conducted on a HP1100 Agilent system (Stockport, UK) with detection at 254, 260,

90

275, 295, and 310 nm with a Symmetry C18 column (75 × 4.6 mm I.D., 3.5 μm), (Waters, USA). An

91

Eppendorf Concentrator 5301 (Hamburg, Germany) was used as a centrifuge at 1400rpm (240g) at room

92

temperature. The balance was a Sartorius Mechatronics analytical balance 1601A MP8-1 (Epsom, UK)

93

unit with a range from 0.1mg to 110g.

94

2.3 Preparation of two phase solvent system and determination of logKd

95

The predictive ability of the QSAR is dependent on the accuracy of the experimentally determined

96

partition coefficient values used to train the model. Therefore physical factors were controlled to

97

minimise the experimental error. These included temperature and pH which were held constant while the

98

compound was in the two phase system. Once each phase had been sampled, it was diluted in ethanol to

99

remove any matrix effect from the solvent system.

M

an

us

cr

ip t

88

To avoid volume variations when preparing the HEMWat solvent systems due to possible temperature

101

fluctuations six HEMWat solvent systems were made up by mass according to Table 1 [8] using

102

thermostated solvents (at 20°C for 20 minutes in a water bath) and the mixtures left overnight to

103

equilibrate at room temperature. Before sampling, the solvent systems were placed in a 20°C water bath

104

for 20 minutes. As these solvent systems have been made up by mass as opposed to volume, the final

106 107 108 109

te

Ac ce p

105

d

100

percentage composition is slightly different from the conventional HEMWat systems described by Garrard [8]. Therefore, they have been distinguished by the addition of the letter “m” to the HEMWat numbers.



110 111

The six HEMWat systems were chosen as they gave a large polarity range across the whole series. To

112

remove the effect of pH on ionising compounds such molecules were converted to their neutral form,

113

acidic compounds were run in acidified HEMWat (0.1% TFA in water, replacing pure water) and basic

114

compounds were run in basified HEMWat (1% ammonia solution in water, replacing pure water). 4

Page 4 of 20

Compounds with a negative ClogP (octanol/water partition coefficient from Biobyte, Inc. of Claremont,

116

CA, USA and Daylight, Laguna Niguel, CA, USA) were dissolved in 1.5ml of the lower phase of

117

HEMWat until the phase was saturated. Compounds with a positive ClogP were dissolved in the upper

118

phase of the HEMWat system until the phase was saturated. This ensured that the maximum amount of

119

compounds was dissolved in the HEMWat system. The solutions were centrifuged (1400 rpm for

120

30 seconds) to remove all particulates from the supernatant. An aliquot of 400 μl of supernatant was

121

mixed and centrifuged with 1400 μl of the alternative phase (1400 rpm for 30 seconds).

122

An aliquot of 80 μl of the 1400 μl volume phase and 320 μl of the 400 µl phase were separately diluted

123

using 1 ml of ethanol. To avoid cross contamination, before the lower phase was sampled the remaining

124

upper phase was removed by pipette until no upper phase could be seen on visual inspection. The

125

samples were run on a 10 minute gradient method on the HPLC using Symmetry C18 column

126

(4.6x75mm, 3.5um), at 1ml/min and 40C. Mobile phase consisted of 0.1% aqueous trifluoroacetic acid

127

(solvent A) and acetonitrile (solvent B). The gradient elution program was as follows: 0-6 min, 10% B;

128

2-8 min, 80% B.

129

The logKd values of a quality control (QC) compound, 2-ethylanthraquinone, was measured in each of

130

the six HEMWat systems alongside the other compounds. The mean and standard deviation for the

131

2-ethylanthraquinone can be found in Table 2 for 15 runs.

133 134 135 136

cr

us

an

M

d

te

Ac ce p

132

ip t

115



The compound concentration dependence of the Kd measurement was evaluated using three structurally

diverse compounds including a carboxylic acid, a compound class known to dimerise at high

137

concentrations in non-polar media. These 3 compounds were: 3-bromobenzoic acid (16-249 mM),

138

warfarin (5-81 mM) and phenol (30-531 mM). Throughout the range of tested concentrations, the

139

measured Kd value was the same for all three compounds. It was therefore concluded that the

140

concentrations used in this method did not affect the Kd value (see Supplementary data S1, S2 and S3 for

141

Kd results).

5

Page 5 of 20

2.4 Data sets

143

A data set of the logKd values of 54 compounds in six HEMWat systems was measured. Each logKd

144

value was measured in triplicate and averaged. The set of 54 compounds chosen contained 38 neutral

145

compounds, 10 acidic compounds and 6 basic compounds. From this data set, 50 compounds were

146

selected as a training set to build the QSAR and 4 compounds were selected as a test set. Figure 1 shows

147

a principal component analysis (PCA) carried out on all 54 compounds. The first principal component is

148

chosen to account for the maximum amount of variance and therefore describe most of the variability of

149

the model. The second component is fixed as orthogonal to the first component and then made to cover

150

the maximum variance possible under this constraint. The PCA analysis was used to ensure a spread of

151

property space of the training set compounds and also to select the test set for the models. This PCA

152

(Figure 1) was used to select 4 compounds that were well spread across parameter space to ensure the

153

model was tested on a diverse range of compounds. The four compounds chosen were Biphenyl,

154

Tolbutamide, Quinine and Benzoquinone as they represent distinct areas of parameter space to test.

M

an

us

cr

ip t

142

155

159 160 161 162 163

te

158

2.5 Generating QSAR models

Ac ce p

157



d

156

The QSAR models were developed using two dimensional molecular descriptors from CLab an in-house AstraZeneca software package which generates 196 descriptors for each compound (see supplementary data table S4). The descriptors are generated using SMILES (Simplified Molecular-Input Line-Entry System) and fall into seven main categories: lipophilicity, hydrogen bonding, size and shape, charge and polarity, atom counts, topology and drugability (i.e. Lipinski rule of five [10]). In addition to these 196

164

parameters, five Abraham parameters (Hydrogen bonding acidity, A, Hydrogen bonding basicity, B,

165

Polarisability, S, Excessive molar refraction, E, and McGowan volume, V) [11] were investigated as they

166

have been used extensively for modelling partitioning [12]. It was decided to use Partial Least Squares

167

(PLS) to explore the ability to generate predictive models for logKd.

6

Page 6 of 20

SIMCA version 13 (Umetrics, Umea, Sweden) was used to perform the PLS regression. The initial

169

QSAR models were generated using the automated fit tool within the software with the 196 2D

170

descriptors. Any descriptors with a Variance Inflation Plot (VIP) value of less than 1 were removed from

171

the model and the PLS models rebuilt. This was applied for a second time to the resulting model leading

172

to three PLS models for each HEMWat system. The models were then compared using the Root Mean

173

Square Error (RMSE) and R2 values for the training and test sets. The R2 terms quantifies how much of

174

the variation in the response, logKd, is explained by the model. An R2 value of 1 is indicative of a perfect

175

model and a value of 0 suggest the model is very poor. We considered an R2 value above 0.80 as

176

acceptable. RMSE is calculated using Equation 1. The smaller the RMSE the better the model but we

177

considered a RMSE value of less than 0.5 as acceptable.

an

us

cr

ip t

168

M

178

Equation 1 - Root Mean Square Error (RMSE) where

180

value.

181

The R2 values of the training sets were calculated. The predictive performance of the models was

182

assessed according to the cross validated coefficient of correlation Q2. In addition an external validation

183

of the models was undertaken by predicting the logKd of the external test set of 4 compounds calculating

185 186 187 188

is the observed

te

Ac ce p

184

is the predicted value and

d

179

the RMSE and R2. One model was chosen for each HEMWat system on the basis of these statistics. 3. Results and Discussion

The best PLS model for each of the HEMWat systems was selected on the basis of the R2 and Q2 values.

The QSAR equations are explicitly described in the supplementary data S5-S10. The summation of these coefficients multiplied by their corresponding compound specific value with the residual coefficient will

189

predict the logKd value for the compound. A summary of the statistics for these models can be found in

190

Table 3.The best models for HEMWat8m, 14m, 17m and 20m were obtained after all the descriptors with

191

a VIP value of less than one were removed once. The best performing models for HEMWat22m and

192

HEMWat26m were achieved after the descriptors with a VIP value of less than 1 were removed twice.

193

The models for five of the six HEMWat systems produced an R2 value for the training set within our

7

Page 7 of 20

acceptance criteria of greater than 0.8. HEMWat8m was the exception with an R2 value for the training

195

set of 0.69. This suggests that the model for HEMWat8m may not produce as accurate predictions as the

196

other five models. When the RMSE for the test set is analysed, the models for HEMWat17m, 20m, 22m

197

and 26m, are all within our target acceptance criteria of 0.5.

198

cr

199

ip t

194

us

200

Figure 2 shows the difference in the measured and predicted values of the test set (see Supplementary

202

data S11 for experimental and predicted values). Of the 24 predictions, 70% are within +/- 0.5 long units

203

and 79% are within +/- 0.52 log units. Interestingly, it is the systems with higher HEMWat numbers that

204

produce the most accurate predictions.

M

205



d

206

210 211 212 213 214 215

Figure 3 shows the coefficient plots for the six QSARS for each of the six HEMWat systems. From each

Ac ce p

209

te

207 208

an

201

HEMWat model’s coefficient plot, the descriptors contributing to the model can be observed. Interestingly, the octanol/water based lipophilicity descriptors dominate the lower HEMWat numbers whilst hydrogen bonding descriptors are more prevalent in the models of the higher HEMWat numbers. This possibly reflects the change in the organic phase from mostly ethyl acetate to mostly heptane as the HEMWat number increases. As heptane is unable to hydrogen bond to solutes, hydrogen bonding descriptors are important showing negative coefficients since they favour solutes partitioning into the aqueous phase.

216 217



218

8

Page 8 of 20

3.1 Predicting a solvent system for optimal separation conditions

220

The QSAR models generated using the PLS regression method were used to predict the logKd values of

221

the 4 test set compounds in the six HEMWat systems. By applying a linear fit between the six HEMWat

222

system numbers and the predicted logKd for each compound , the six individual predicted logKd values

223

can be used to suggest the HEMWat system in which the compound will have a logKd of zero (Kd equal

224

to one). This system should provide optimal separation conditions.

225

Table 4 shows a good comparison between the HEMWat systems predicted to have optimal separation

226

based on extrapolating the experimentally determined logKd results and extrapolating the QSAR

227

predicted logKd values for three of the four test compounds. The linear relationships used to predict this

228

optimal system provides high R2 values and low RMSE values indicting a strong linear fit for both the

229

predicted logKd values and the experimentally determined logKd values.

an

us

cr

ip t

219

M

230 231



d

232 4. Conclusion

234

In this work for the first time QSAR models were generated to predict the logKd values of compounds in

236 237 238 239

Ac ce p

235

te

233

HEMWat systems from their molecular structure alone. Of the 4 test compounds, 71% were predicted within +/- 0.5 log units by the PLS QSARs. These QSARs will be developed further as they have the potential to speed up the solvent system selection for CCC. 5. Acknowledgements

The first author especially would like to thank AstraZeneca and the Engineering and Physical Science

240

Research Council (EPSRC) for funding this project as part of her PhD. Furthermore, the authors are

241

grateful to Dr Jonathan Huddleston for support and advice at the beginning of this work. Thanks are also

242

extended to Professor Michael Abrahams and Dr Joelle Gola for their time and assistance.

243

6. Bibliography

9

Page 9 of 20

[1] Y. Ito, M. Weinstein, I. Aoki, R. Harada, E. Kimure, K. Nunogaki, Nature, The Coil Planet Centrifuge, 1966, Vol. 212, pp. 985-987.

246 247

[2] A. Marston, K. Hostettmann, Journal of Chromatography A, Developments in the Application of Counter-Current Chromatography to Plant Analysis, 2006, Vol. 1112, pp.181-194.

248 249 250 251

[3] L. Chen, Q. Zhang, G. Yang, L. Fan, J. Tang, I. Garrard, S. Ignatova, D. Fisher, I. A. Sutherland, Journal of Chromatography A, Rapid purification of and Scale-up of Honokiol and Magnolol using High-Capacity High-Speed Counter-Current Chromatography, 2007, Vol. 1142, pp. 115-122.

252 253 254

[4] E. Hopmann, W. Arlt, M. Mincerva, Journal of Chromatography A, Solvent system selection, in counter-current chromatography using the conductor-like screening model for real solvents, 2011, Vol. 1218, pp. 242-250.

255 256 257

[5] Z. Li, Y. Zhou, F. Chen, L. Zhang, Y. Yang, Journal of Liquid Chromatography and Related Technologies, Property Calculation and Prediction for Selecting Solvent Systems in CCC, 2003, Vol. 26, pp. 1397-1415.

258 259 260 261

[6] D-B Ren, Z-H Yang, Y-Z Liang, Q. Ding, C. Chen, M-L Ouyang, Journal of Chromatography A, Correlation and prediction of partition coefficients using non-random two-liquid segment activity coefficient model for solvent system selection in counter-current chromatography separation, 2013, Vol. 1301, pp. 10-18.

262

[7] Krystyna Skalicka, Internal Report, Medical University of Lublin.

263 264 265

[8] I. J. Garrard, Journal of Liquid and Chromatography & Related Technologies, Simple Approach to the Development of a CCC Solvent Selection Protocol Suitable for Automation, 2005, Vol. 28, pp. 19231935.

266 267 268

[9] M.H. Abraham, H.S.Chadha Applications of a salvation equation to drug transport properties, in Lipophilicity in Drug Action and Toxicology, Edited by V. Pliska, B. Testa, H. Van der Waterbeemd, 1996 VCH Weinheim.

269 270 271

[10] C. A. Lipinski, F. Lombardo, B. W. Dominy, P. J. Feeney, Experimental and computation approaches to estimate solubility and permeability in drug discovery and development settings, Advanced Drug Delivery Reviews, 1997, 23, pp. 3-25

274 275 276 277

cr

us

an

M

d

te

Ac ce p

272 273

ip t

244 245

[11] M. H. Abraham, Chemical Society Reviews, Scales of solute Hydrogen-bonding: Their Construction and Application to Physiochemical and Biochemical Processes, 1993, Vol. 22, pp. 73-83

[12] M. H. Abraham, J. M. R. Gola, A. Ibrahim, W. E. Acree Jr, X. Liu, Pest Management Science, The prediction of blood-tissue partitions, water-skin partitions and skin permeation for agrochemicals, 2014, Vol. 70, pp. 1130-1137 Figure Legends

278

Figure 1 - Principle Component Analysis (PCA) used to select Tolbutamide, Quinine, Biphenyl and

279

Benzoquinone.

280

Figure 2 - The difference between the test set experimentally determined logKd values and the predicted

281

logKd values.

10

Page 10 of 20

282

Figure 3 - The coefficient plot for each of the PLS models built for (a) HEMWat8m (b) HEMWat14m

283

(c) HEMWat17m (d) HEMWat20m (e) HEMWat22m and (f) HEMWat26m (see supplementary

284

information S4 for descriptor definitions).

Ac ce p

te

d

M

an

us

cr

ip t

285

11

Page 11 of 20

285

Table 1 - Ratios of solvents to make up the selected HEMWat solvent systems (heptane, ethyl acetate,

286

methanol and water) Heptane

Ethyl Acetate

Methanol

Water

number [7]

(g)

(g)

(g)

(g)

8m

1

9

1

14m

3

6

3

17m

4

4

4

20m

6

3

22m

6

2

26m

9

1

cr us

6

9

6 4 3

6

2

9

1

an

287

ip t

HEMWat system

Ac ce p

te

d

M

288

12

Page 12 of 20

288

Table 2 - The average and standard deviation for the experimentally measured logKd value for the QC

289

compound, 2-ethylanthraquinone across 15 HPLC runs in triplicate. Average

Standard Deviation

8m

3.19

0.36

14m

2.07

17m

1.18

20m

0.70

22m

0.52

26m

0.19

0.13 0.07

cr

0.10

us

0.04 0.02

an

290

ip t

HEMWat System

Ac ce p

te

d

M

291

13

Page 13 of 20

Table 3 - Statistics for the best performing QSARs generated using Partial Least Squares. R2 training set

Q2 training set

R2 test set

RMSE test set

8m

0.69

0.66

0.86

0.66

14m

0.83

0.69

0.67

0.60

17m

0.81

0.65

0.81

20m

0.85

0.68

0.83

22m

0.89

0.80

26m

0.92

0.87

cr 0.93 0.82

0.43 0.44 0.27 0.47

an

292

ip t

HEMWat number

us

291

Ac ce p

te

d

M

293

14

Page 14 of 20

293

Table 4 – The HEMWat system most likely to provide optimised chromatography QSAR predicted

R2

RMSE

Experimentally determined

R2

RMSE

0.92

0.15

Compound HEMWat system 0.98

0.19

2

Biphenyl

25

0.99

0.29

27

Quinine

16

0.95

0.34

16

Tolbutamide

14

0.88

0.39

16

0.98

0.20

0.98

0.25

0.95

0.31

cr

13

us

Benzoquinone

ip t

HEMWat system

294

Ac ce p

te

d

M

an

295

15

Page 15 of 20

295

15

-5

0

5

-5 -10 -15

10

ip t

0 -10

15

cr

-15

5

us

Principal Component 2

10

Principal Component 1 Training set

Test Set

an

296

Ac ce p

te

d

M

297

16

Page 16 of 20

0.50 0.00 8

14

17

20

22

26

ip t

-0.50

-1.50

cr

-1.00 HEMWat number

Benzoquinone

Biphenyl

297

Tolbutamide

Ac ce p

te

d

M

an

298

Quinine

us

Difference between experimentally determined LogKd and predicted logKd

1.00

17

Page 17 of 20

298

us

cr

ip t

a

an

299

300

Ac ce p

te

d

M

b

18

Page 18 of 20

us

cr

ip t

c

301

302

Ac ce p

te

d

M

an

d

e

303 19

Page 19 of 20

us

cr

ip t

f

304

Ac ce p

te

d

M

an

305

20

Page 20 of 20