Fluid Phase Equilibria 309 (2011) 8–14
Contents lists available at ScienceDirect
Fluid Phase Equilibria journal homepage: www.elsevier.com/locate/fluid
Estimation of Hansen solubility parameters using multivariate nonlinear QSPR modeling with COSMO screening charge density moments Gábor Járvás a,∗ , Christian Quellet b , András Dallos a a b
Department of Chemistry, University of Pannonia, H-8200 Veszprém, Egyetem str. 10, Hungary Givaudan Schweiz AG, CH-8600 Dübendorf, Ueberlandstrasse 138, Switzerland
a r t i c l e
i n f o
Article history: Received 3 May 2011 Received in revised form 17 June 2011 Accepted 22 June 2011 Available online 29 June 2011 Keywords: Hansen Solubility parameter Estimation QSPR Artificial neural network COSMO Ionic liquids
a b s t r a c t New QSPR multivariate nonlinear models based on artificial neural network (ANN) were developed for the prediction of the components of the three-dimensional Hansen solubility parameters (HSPs) using COSMO-RS sigma-moments as molecular descriptors. The sigma-moments are obtained from high quality quantum chemical calculations using the continuum solvation model COSMO and a subsequent statistical decomposition of the resulting polarization charge densities. The models for HSPs were built on a training/validation data set of 128 compounds having a broad diversity of chemical characters (alkanes, alkenes, aromatics, haloalkanes, nitroalkanes, amines, amides, alcohols, ketones, ethers, esters, acids, ion-pairs: amine/acid associates, ionic liquids). The prediction power of the correlation equation models for HSPs was validated on a test set of 17 compounds with various functional groups and polarity, among them drug-like molecules, organic salts, solvents and ion-pairs. It was established that COSMO sigmamoments can be used as excellent independent variables in nonlinear structure–property relationships using ANN approaches. The resulting optimal multivariate nonlinear QSPR models involve the five basic sigma-moments having defined physical meaning and possess superior predictive ability for dispersion, polar and hydrogen bonding HSPs components, with test set correlation coefficients R2 d = 0.85, R2 p = 0.91, R2 h = 0.92 and mean absolute errors of ıd = 1.37 MPa1/2 , ıp = 1.85 MPa1/2 and ıh = 2.58 MPa1/2 . © 2011 Elsevier B.V. All rights reserved.
1. Introduction The Hildebrand solubility parameter [1], defined as the square root of the cohesive energy density, is characteristic for the miscibility features of solvents. Hansen [2] proposed an extension of the solubility parameter to a three-dimensional system. Based on the assumption that the cohesive energy is a sum of the contributions from nonpolar, polar and hydrogen bonding molecular interactions, he divided the overall solubility parameter into three parts: a dispersion force component (ıd ), a hydrogen bonding component
Abbreviations: ANN, artificial neural network; ANN14, artificial neural network with 14 sigma moments; ANN5, artificial neural network with 5 sigma moments; CAMD, computer assisted molecular design; CNN, computational neural networks; COSMO, COnductor-like Screening MOdel; COSMO-RS, COnductor-like Screening MOdel for Realistic Solvents; HSP, Hansen solubility parameter; MAE, mean absolute error; PVT, pressure volume temperature; QSAR, quantitative structure–activity relationship; QSPR, quantitative structure–property relationship; R, cross-correlation coefficients. ∗ Corresponding author at: Department of Chemistry, University of Pannonia, P.O. Box 158, H-8200 Veszprém, Hungary. Tel.: +36 88 624109; fax: +36 88 624548. E-mail address:
[email protected] (G. Járvás). 0378-3812/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.fluid.2011.06.030
(ıh ), and a polar component (ıp ). These so-called Hansen solubility parameters (HSPs) are additive: ı2t = ı2d + ı2p + ı2h
(1)
where ıt stands for the total, Hildebrand’s solubility parameter [2]. The three-dimensional Hansen solubility scale gives information about the relative strengths of solvents and allows determining solvents, which can be used to dissolve a specific solute. This approach has significantly upgraded the power and usefulness of the solubility parameter in the screening and selection of the appropriate solvents in industry and in laboratory applications. HSPs belong to the key parameters for selecting solvents in chemical, paint and coatings industries, and for selecting suitable solvents for polymeric resins. They are widely used for characterizing surfaces, for predicting solubilities, degree of rubber swelling, polymer compatibility, chemical resistance, suitable chemical protective clothing, environmental stress cracking, permeation rates, for explaining different properties of the components forming a formulation in pharmacy, and in solvent replacement and substitution programs [3]. The solubility parameter and its components can be applied for complete description and selection of the best excipient materials to form stable pharmaceutical liquid mixtures or stable coating formulations [4]. Furthermore, both Hildebrand and
G. Járvás et al. / Fluid Phase Equilibria 309 (2011) 8–14
Hansen solubility parameters are related to the thermodynamic chemical potential of the ingredients in binary or multi-component systems [5], which reinforce the physical soundness of this model. Although the definition of the solubility parameters is simple, their experimental determination is not always easy, especially for nonvolatile compounds. Several different methods for the determination of solubility parameter of materials exist: swelling measurements [6], inverse gas chromatography [7], mechanical measurements [8], solubility/miscibility measurements in liquids with known cohesive energy [9] and viscosity measurement [10]. The partial solubility parameters can also be calculated from experimental PVT data of the systems using equation-of-state models [11,12]. In all cases, the experimental determination of the HSPs requires pure materials and is generally expensive. In the absence of reliable experimental data, the HSPs components can be estimated based on the molecular structure by cohesive energy density method, using molecular dynamics computer simulation [13], or by group contribution method [14–16]. However, the group contribution methods require the knowledge of all chemical group contributions, which is difficult for ionic liquids or acid/base mixtures (organic salts) involving molecular association. Alternatively, multivariate, linear or nonlinear regression methods, such as quantitative structure–property relationships (QSPR), based on purely theoretical molecular descriptors have been proposed [17,18]. The development of such predictive QSPR models for the HSPs components, based on theoretical chemical structure alone, is of great interest, because they would allow to obtain valuable information in the early phase of the development of new molecules, i.e. even before the synthesis of these molecules is started. Additionally, QSPR seems to be the only way to obtain the HSPs components of ionic liquids, which are of growing interest in the industry, owing to their unique properties as sustainable solvents. However, the molecular descriptors generally used in QSPR are often abstract quantities related to topological, structural, electrostatic, and quantum chemical features of the molecules and the models obtained do not always have a straightforward physical meaning. In particular, the link between molecular descriptors and the thermodynamic properties of materials is generally not obvious. 1.1. COSMO -moment approach The COSMO-RS theory [19] provides an alternative way to connect both molecular and thermodynamic levels. COSMO-RS is a variant of the dielectric continuum solvation models, which calculate the molecular interactions from the screening (polarization) charge densities () on the molecular surface (Fig. 1). The moments of the screening charge density distribution (Fig. 2), the
Fig. 1. Visualization of COSMO screening charges on ion pair molecular surface of 1-butyl-3-methylimidazolium tetrafluoroborate ([bmim]BF4 ) ionic liquid.
9
Fig. 2. Screening charge distributions and -profiles of the ionic components and ion pair of 1-butyl-3-methylimidazolium tetrafluoroborate ([bmim]BF4 ) ionic liquid.
-moments, are stated [20] as excellent linear descriptors derived from theory for regression function relating important material characteristics (P) to molecular properties: P = C0 + C1 · M0X + C2 · M1X + C3 · M2X + C4 · M3X + C5 · M4X X X + C6 · M5X + C7 · M6X + C8 · MHbacc1 + C9 · MHbacc2 X X X + C10 · MHbacc3 + C11 · MHbacc4 + C12 · MHbdon1 X X X + C13 · MHbdon2 + C14 · MHbdon3 + C15 · MHbdon4
(2)
where MX i is the ith -moment. The coefficients (C0 –C15 ) can be derived by multilinear regression of the -moments with a sufficient number of reliable experimental data. Some of the 15 moments have a well-defined physical meaning, e.g. surface area of the molecule: MX 0 = MX area , total charge: MX 1 = MX charge , electrostatic interaction energy: MX 2 = MX el , the kind of skewness of the -profile: MX 3 = MX skew , and acceptor and donor functions: MX Hbacc3 , MX Hbdon3 . However, the other moments (MX 4 , MX 5 , MX 6 , MX Hbacc1,2,4 , MX Hbdon1,2,4 ) do not have simple physical interpretations [20] and have mathematically important functions rather than physical meaning. Therefore, they are solely included for QSPR correlations. In some extents, the -moment approach has some similarities with the Abraham empirical solvation model [20,21]. The -moment approach has been successfully applied to such diverse problems as olive oil–gas partitioning, blood–brain partitioning, intestinal absorption and soil-sorption [22]. Considering the physics behind the sigma function, it is tempting to use the -moments as QSPR descriptors to model solubility as such, and to predict the Hansen solubility parameters in a more deterministic/phenomenological way. As pointed out by Katritzky et al. [23], however, the real world is rarely “linear” and most QSAR/QSPR relationships are nonlinear in nature. These hidden nonlinearities between the property and the descriptors can be detected and described by artificial or computational neural networks (ANN, CNN) included in nonlinear approaches [24–29]. In this paper multivariate nonlinear QSPR models were constructed and tested for prediction of Hansen’s partial solubility parameters for molecules of various polarity classes, even formed by ion-pairs, like ionic liquids or organic salts. COSMO screening charge density moments were used as independent variables, which were computed by quantum-chemical techniques. Partial least square statistics using artificial neural network correlations
10
G. Járvás et al. / Fluid Phase Equilibria 309 (2011) 8–14
were employed for generating the correlation and predictive models. 2. Methods 2.1. Training/validation and test sets of molecules with partial solubility parameter data The experimental HSPs component values were taken from the official HSPs chemical database [14,15] and from selected references [9,30–33]. A training/validation set of 128 molecules with chemically diverse characters, including a wide range of molecular size, complexity, polarity and hydrogen bond building ability (alkanes, alkenes, aromatics, haloalkanes, nitroalkanes, amines, amides, alcohols, ketones, ethers, esters, acids, organic salts, ionic liquids) was selected to cover a wide numerical range of HSPs component values, i.e. ıd values ranging from 14.3 to 24.7 MPa1/2 , ıp values ranging from 0 to 29.2 MPa1/2 , and ıh values ranging from 0 to 35.1 MPa1/2 . This represents a challenging set because of the structural diversity, the several multifunctional groups present in large molecules, organic salts and the ionic liquids. A test set consists of 17 compounds with various functional groups and polarity. 2.2. Computational procedure The COSMO files and -moments of the molecules considered in this study have been mostly taken from the database of COSMOtherm software [34]. The 3D structures of the amine/acid associates and ionic liquids, which were not included in the database, were built using GaussView 3.09. The raw 3D structures were exported in Sybyl Mol2 file format to OpenBabel version 2.2.3 and were converted to Cartesian XYZ format. The organic salts and ionic liquids were considered as neutral ion-pairs, since charged species cannot be observed without the presence of counter ions, and the measured HSPs parameters were defined and reported in Hansen’s HSPs database [14,15,31] for bulk phases and not for individual ions. Molecular geometries were optimized by TURBOMOLE 6.0 quantum chemical program package [35] using high quality quantum chemical ab initio electronic structure optimization on both the BP-TZVP-COSMO and the BP-TZVP-GAS level. Sigma moments of molecules and ion-pairs were calculated with COSMOtherm Version C21 0110 software using BP TZVP C21 1010 parameterization [34]. Table 1 shows the calculated five basic -moments for selected chemicals to illustrate how the values of the five basic -moments depend on the diversity of the chemical characters of some specific compounds involved in the regression and prediction (e.g. drug-like molecule, organic salt, solvent, ionic liquid). The extended version of Table 1 with the -moments of all of the compounds can be found in the online version as supplementary material associated with this article. In the earlier phase of the QSPR model construction, we tested the assumption that Hansen’s partial solubility parameters may be directly related to linear combinations of -moments, as given by Eq. (2). However, the imperfection of linear QSPR models suggests the existence of non-linear relations between HSPs and -moments. 2.3. Nonlinear QSPR models The nonlinear QSPR models were developed with artificial neural networks. Three-layered feed-forward networks with backpropagation training function were chosen as nonlinear regression model using the Neural Network Toolbox 7 of MATLAB 7.11.0.584 (R2010b) version [36] and an in-house developed MATLAB routine for process automation. Two sets of -moments were included in
the QSPR models: the first set consisted only in the five basic moments, which have a well-defined meaning (MX 0 , MX 2 , MX 3 , MX Hbacc3 , MX Hbdon3 ) the second set consisted in an extended array of input variables with 14 -moments. Namely, the first -moment (MX 1 = MX charge ) vanishes in both models, because both positive and negative charges in organic salts and ionic liquids compensate each other. Consequently, the quantum chemical calculations were performed on electronically neutral chemical entities with a total net charge of zero. The number of neurons in the input and output layers was automatically determined by the number of input and output variables (5 and 14 -moments and one HSPs component, respectively). To define the ANN’s topologies and to determine the numbers of neurons in the hidden layer, several ANN’s with different architectures were developed by simultaneous building of the ANN models and their validation, for which the cross-correlation coefficients (R) between input and output variables was compared. A symmetric sigmoidal transfer function was employed in the hidden layer and a linear transfer function in the output layer (Fig. 3). Each network calculation was started many times with random initial values to avoid convergence to local minima. The architectures which showed the highest R values for the training and validation sets were chosen for the final models. Models were constructed using the training set of compounds and a validation subset was used to provide an indication of the model performance using Levenberg–Marquardt back propagation training algorithms and mean squared error performance function. Since the models are nonlinear, the determination of the regression coefficients required iterative processes. To avoid “overtraining” phenomena, the ANN models obtained were firstly internally validated by the leave-many-out cross-validation technique and finally externally validated. 113 data points were chosen for training, 15 compounds were selected for post-training analysis (internal validation) and the test set were used for testing (external validation). The test set consists of 17 characteristic compounds, among them drug-like molecules, ionic liquids, organic salts and solvents. The components of the test set are listed exactly in Table 3 (superscripted with “b”). These compounds were not included in the parameter estimation, they are left for prediction. The test set was used only to assess the performance of the specified ANN. Only the training set was used for learning, to fit the parameters (i.e. the weights and biases). Actually not only the 17 compounds of the test set were excluded from the parameter estimation, but the 15 compounds of the validation set too, altogether 32 experimental data. The number of compounds left for test and validation sets is determined by the total number of experimental data and the training and validation/test set ratio. Training sets contain typically 70–80% of the original data. The remainder (20–30%) is usually set aside for testing and/or validation. Therefore, we selected 78% of the total number of experimental data (145 data points) with wide variety for the training set, and the rest 20% (32 data points), which was not included in the parameter fitting, has been used for validation and test, and was split in two parts: for cross-validation (10%) and testing (12%). The MLR and ANN models were statistically evaluated by the squared correlation coefficient of the experimental versus both fitted or predicted values (R2 ) and mean absolute error (MAE = |(ıi,j,predicted − ıi,j,experimental )|/N), where i = 1, . . . , N, and j = d, p or h, respectively).
3. Results and discussion The multivariate nonlinear QSPR models were based on the optimized ANN topology and parameters. The final ANN architectures
G. Járvás et al. / Fluid Phase Equilibria 309 (2011) 8–14
11
Table 1 Basic -moments for selected chemical entities calculated by COSMOtherm. Chemical entities
MX 0 (nm2 )
MX 2
4-Amino-benzoic acid Benzene Benzoic acid [bmim]PF6 ␥-Butyrolactone Diethylethanolamine + acetic acid Hexane Ibuprofen Lactose Na-benzoate Na-diclofenac Salicylic acid Tetrahydro-furfurylalcohol Urea
1.667 1.214 1.529 2.407 1.169 2.25 1.569 2.575 3.091 1.696 3.063 1.585 1.395 0.911
113.3 27.81 75.42 209.7 64.98 145.2 7.92 85.3 297.9 230.3 269.2 79.77 64.9 122.7
MX 3
MX Hbacc3
−27.36 −0.436 −15.91 192.2 40.34 111.2 0.434 −9.78 29.51 −170.9 −156.7 −27.46 48.03 16.32
MX Hbdon3
2.65 0 1.34 25.04 2.647 13.95 0 1.34 13.61 12.08 14.2 0.863 4.995 8.096
5.719 0 3.938 1.561 0 2.361 0 3.942 12.82 0 0.527 4.64 0.618 5.416
Table 2 Statistical data of multiple nonlinear regressions for QSPRs models based on ANN with 5 and 14 -moments as independent variables. R2 is the squared correlation coefficient and MAE is mean absolute error. The exact list of chemical species involved in training and test sets are provided in the extended form of Table 3 which can be found in the online version as supplementary material associated with this article. Statistics
2
Set
ANN5
ANN14
␦d
␦p
␦h
␦d
␦p
␦h
R
Training Test
0.86 0.85
0.9 0.91
0.93 0.92
0.91 0.87
0.92 0.91
0.97 0.94
MAE (MPa1/2 )
Training Test
0.48 1.37
1.66 1.85
2.21 2.58
0.37 1.09
1.45 1.7
0.98 1.96
contained 12 and 13 neurons in the hidden layer, according to the two sets of -moments (Fig. 3). After optimization of the ANN’s architecture, the networks were trained using the training set for the adjustment of weights and bias values. The external validation set was used to monitor the quality of generalisation ability of the neural networks at each learning cycle. After the training of the ANNs was completed, the optimized weights and biases were set in the networks and the best-trained neural networks were saved. The total MAE and R2 values obtained by the trained ANNs on training set are summarized in Table 2. As apparent from the statistical results of both ANN models depicted in Table 2, the multivariate ANN based nonlinear QSPR models for the correlation of HSPs components and the -moments are reliable, even if only the five basic -moments with welldefined physical meaning (ANN5) are used. The MAE values for HSPs data of the compounds in the training set are comparable
to the experimental errors of different methods [33]. However, as expected, the nonlinear QSPR model with 14 -moments (ANN14) produced slightly better results for all the three HSPs components. In order to evaluate the prediction power of the obtained nonlinear QSPR models, the trained and validated ANNs were used to calculate the HSPs of test set molecules, which were not involved in the regression process. The computed correlation coefficient (0.85 ≤ R2 ≤ 0.94) and mean absolute error (1.09 MPa1/2 ≤ MAE ≤ 2.58 MPa1/2 ) values obtained for ıd , ıp and ıh (Table 2) of the test-set compounds confirm that both ANN5 and ANN14 models satisfactorily predict all three HSPs components, when applied to an external dataset. However, both linear and nonlinear QSPR models with 14 sigma-moments produced slightly better results for all the three HSPs components. Consequently, they are better than the basic parameter set contains only 5 sigma-moments. This is expected, because the expansion of
Fig. 3. MATLAB standard visualization of architectures and specifications of the optimized ANN’s with one hidden layer: (a) ANN5, using 5 -moment; (b) ANN14, using 14 -moments as input variables.
12
G. Járvás et al. / Fluid Phase Equilibria 309 (2011) 8–14
the number of adjustable parameters in most cases improves the quality of the correlations. Nevertheless, using too many parameters in correlation functions can result in over-parameterization or in over-fitting. In addition, the application of parameters without well-defined physical meaning is disadvantageous in QSPRs, because the interpretation of the correlation results is difficult. Therefore, the application of the basic parameter set contains five sigma-moments with well-defined meaning is favorable. Furthermore, despite the lesser number of independent variables, the ANN5 model possesses about the same prediction power as the ANN14 method. Statistical comparison of R2 values of the ANN5 model does not show significantly better performance using the training set than those from using the test set, revealing that no over-fitting did occur. The differences in MAE values of the test set are also close to those of the training set for ıp and ıh and only for ıd is slightly higher because of some valuable outlying points, which have more influence on the correlation than others (see Fig. 4). This is a reinsurance that, using multivariate QSPRs with only the basic COSMO -moment descriptors (MX 0 = MX area , MX 2 = MX el , MX 3 = MX skew , MX Hbacc3 , MX Hbdon3 ) overparameterization was avoided when training the ANNs. It can be concluded from the above that in the prediction of Hansen solubility parameters the five basic theoretical Klamt descriptors encode almost the same chemical information on molecular interactions, as the total -moment set. This confirms the statements of Abraham et al. [21] and Klamt [20] that the solvent space is approximately five-dimensional, therefore a small number of descriptors, probably no more than five, is enough to describe the most important intermolecular interactions. A principal drawback of proposed neural networks is that there are too complex to allow a straightforward interpretation of the interrelationships between HSPs components and the -moments. The application of the suggested ANN5/QSPR prediction method needs advanced knowledge of computer techniques and the ability to deal with user-friendly commercial software (Turbomole, COSMOtherm, MATLAB). Unfortunately, due to size and space limits the article cannot be provided with all the necessary parameters for full reproduction of the results. However, they can be found in the online version as supplementary material associated with this article. Furthermore, the ready-to-use MATLAB objects of the constructed ANNs (including all necessary parameters such as weights and biases) are available by request from the authors. The good agreement between the observed dispersion, polar and hydrogen bonding HSPs components of the compounds in training set and those fitted by ANN5 is demonstrated in Figs. 4–6 and for a series of characteristic molecules of the training set in Table 3. The
Fig. 4. Fitted and predicted (ANN5) Hansen dispersion solubility parameters as function of experimental data for the training and test sets.
Fig. 5. Fitted and predicted (ANN5) Hansen polar solubility parameters as function of experimental data for the training and test sets.
regression lines, which represent the linear correlation between the predicted and measured values, are also depicted in the figures. These lines confirm the good prediction quality of the nonlinear ANN5 models and the absence of significant bias. The regression lines (- - - -) of the predicted versus observed data almost coincide with the diagonal of the plot (1:1 relationship, ——). The estimated HSPs values obtained by ANN5 models and the experimental ones are compared in Table 3. The quantitative predictions for the HSPs components are quite accurate in a wide range of values for the dispersion, polar and hydrogen bonding HSPs components. Even chemical entities with high HSPs components are predicted well and the model is able to quantitatively differentiate between compounds with high and low HSPs values. This demonstrates the usefulness of the nonlinear multivariate QSPR models with five -moments for the estimation of HSPs of very strongly polar chemical species, which is particularly interesting from a practical standpoint. Our HSP estimation method has been tested on several drug-like molecules (e.g. Piroxicam, Ibuprofen, Na-diclofenac, salicylic acid, mannitol, niacinamide, 4amino-benzoic acid), but has not been tested on polymers. This will be the subject of further investigations, because the sigmaprofile and therefore the sigma-moments of large molecules can be performed only approximately by COSMOfrag method using the sigma-profiles of similar molecular fragments. Hence the method has also been verified on polymer-precursors (e.g. ␥butyrolactone, propylene carbonate, methacrylic acid, styrene).
Fig. 6. Fitted and predicted (ANN5) Hansen hydrogen bonding solubility parameters as function of experimental data for the training and test sets.
G. Járvás et al. / Fluid Phase Equilibria 309 (2011) 8–14
13
Table 3 Comparison of experimental HSPs components to those obtained by fitting and estimation using multivariate nonlinear QSPR models with 5 -moments (ANN5). Name
Aceticacid-2-ethylhexylestera Acrylic acida 4-Aminobenzoic acida Benzoic acida Benzyl alcohola [bmim]PF6a Bis(2-chloroethyl)ethera Citric acida ␥-Butyrolactonea Dibutylphthalatea Dipropyleneglycola Ethylenecyanohydrina Formic acid a Hexafluoro-1-propanola Hexamethylphosphoramidea Na-benzoatea Na-diclofenaca N,N-dimethylacetamida Propylenecarbonatea Salicylic acida Sorbitola Sucrosea Tetrahydrofurfurylalcohola Trichloromethanea Tricresyl phosphatea Triethyl phosphatea Trimethyl phosphatea Acetonitrileb [bmim]BF4 b 2-Butoxyethanolb Butyl acetateb Butyl benzyl phthalateb Diethylethanolamine/acetic acidb Diethyl etherb Dimethyl-ethanolamineb Dipropyleneglycolb Ethanolamineb Ethylbenzeneb Hexafluoro-i-propanolb Ibuprofenb Lactoseb Mannitolb Piroxicamb Ureab a b
Dispersion (MPa1/2 )
Polar (MPa1/2 )
Hydrogen bonding (MPa1/2 )
Calc.
Exp.
Calc.
Exp.
Calc.
Exp.
15.4 17.4 17.2 17.1 18 21.1 18.9 20.9 17 18.5 16.5 18.6 14.2 17.3 18.1 16.3 16.4 17.5 18.5 17 19.4 24.7 17.3 17.8 18.7 15.8 17.3 14.5 26 15.1 16 20.4 16.1 15.6 16.3 15.9 18.2 18.8 17.6 19.2 28.1 20 19.6 19
15.8 17.7 17.3 17.6 18.4 21 18.8 20.9 19 17.8 16.5 17.2 14.3 17.2 18.5 16.3 16.3 16.8 20 16.6 19 24.7 17.8 17.8 19 16.7 16.7 15.3 23 16 15.8 19 16 14.5 16.1 16.5 17 17.8 17.2 16.4 24.2 19 16.8 20.9
3.8 7.3 13.2 8.6 6.2 18.6 7.3 9.4 14.6 8.2 10.7 20.1 12.1 5.4 6.3 27.2 17.9 12.4 17.9 11.5 8.9 10.9 8.1 3.1 10 10.1 12.6 15 18.6 6.7 6.5 13.3 20.8 3.4 6.9 10.6 18.5 3 7.1 11.3 11.9 11.1 19.1 20.2
2.9 6.4 14.3 10.1 6.3 17.2 9 8.2 16.6 8.6 10.6 18.8 11.9 4.5 8.6 29.2 18 11.5 18 12.4 10.3 11.3 8.2 3.1 12.3 11.4 15.9 18 19 5.1 3.7 11.2 20.3 2.9 9.2 10.6 15.5 0.6 4.5 6.4 11.2 10.3 21.4 18.7
2.8 12.3 15.6 10.8 12.4 8.6 4 20.6 6.1 2.3 16.7 15.1 14.9 12.5 9.4 9.8 10.4 8.8 6.1 10.5 27 33.1 10 6 2.7 7.2 8.1 5.5 8.8 8.1 3.8 3.1 18.4 3.6 14.5 16.5 12 0.5 12.4 10.6 32.4 27.2 5.9 18.1
5.1 14.9 14.4 10.7 13.7 10.9 5.7 21.9 7.4 4.1 17.7 17.6 16.6 14.7 11.3 13 13.5 10.2 4.1 14.6 33.5 35.1 10.2 5.7 4.5 9.2 10.2 6.1 10 12.3 6.3 3.1 18.4 5.1 15.3 17.7 21.2 1.4 14.7 8.9 34.9 33.5 6.6 26.4
Some characteristic, randomly selected compounds and data are taken from the training set. Compounds and data of the test set.
The majority of estimated values were close to or within the experimental error associated with the determination of solubility parameters [14]. To our knowledge, there are no other QSPR studies of HSPs in the literature dealing with data sets comprising base/acid molecular associates and ionic liquids, and therefore our nonlinear -moment HSPs models cannot be compared directly to the models of other authors. Even so, we compared our ANN5/QSPR COSMO -moment method with some representative predictive methods with high efficiency, such as an equation-of-state model [12], a molecular dynamics computer simulation method [13], and a group contribution method [16], all applied to less challenging datasets. The ANN5/QSPR method presented in this work uses no adjustable parameters for the direct estimation of HSP components and do not need experimental input information, similarly, as the CED Molecular Dynamics method presented by of Belmares et al. [13] and the group contribution method of Stefanis and Panayiotou [16]. However, the equation-of-state model of Stefanis et al. [12] is a parametric method, which needs experimental input data (vapor pressures, density, heat of vaporization or PVT data) for the determination of the adjustable characteristic constants of fluids [12].
The equation-of-state approach [12] is stated to be applicable to ordinary solvents, to supercritical fluids, as well as to high polymers. The group contribution method [16] can be used for broad series of organic compounds, including those having complex multi-ring, heterocyclic, and aromatic molecular structures. The CED molecular dynamics method [13] has been tested on common solvents, reactants, and monomers. However, our ANN5/QSPR estimation method can additionally predict the HSPs of complex mixtures, including organic salts, base/acid associates, and ionic liquids besides organic compounds with broad diversity of chemical characters (solvents, reactants and monomers). A numerical comparison of the predictive ability (measured by the mean absolute estimation error) of the HSP estimation methods is shown in Table 4. The equation-of-state model [12] and the group contribution method [16] with specific fitted constants and molecular fragments perform the best estimation results. The accuracies of the prediction methods using quantum chemical or molecular dynamic methods, like the CED MD method [13] and the ANN5/QSPR method (this study), are lower, probably due to the generalities of these methods to deal also with complex mixtures.
14
G. Járvás et al. / Fluid Phase Equilibria 309 (2011) 8–14
Table 4 Comparison of the estimation errors of representative HSP prediction methods. Estimation method
ANN5/QSPR COSMO -moment method (this study)a CED MD method [13]a Equation-of-state model [12]a Group contribution method [16]b a b
Mean absolute errors (MPa1/2 ) ıd
ıp
ıh
1.37 0.98 0.77 0.41
1.85 3.84 0.72 0.86
2.58 5.96 0.16 0.8
Calculated from the differences between estimated values and experimental data obtained from Ref. [15]. Stated values by the authors of Ref. [16].
4. Conclusion Nonlinear models were built up using artificial neural networks were able to derive flexible QSPR correlation models between the COSMO -moments and Hansen solubility parameters over a wide range of HSPs component values. The reliability of these models was confirmed by statistical analysis of the training and test data sets, which clearly indicates the superiority of the ANN models over the MLR approaches. A QSPR model set developed via ANN and using only the five basic COSMO -moments having defined meaning as molecular descriptors, is proposed as optimal method for the estimation of dispersion, polar and hydrogen bonding. This nonlinear QSPR set exhibits very good ability to estimate the HSPs components within the test set as confirmed by the relatively low MAE values (in the range of 1.37–2.58 MPa1/2 ) and high regression coefficients (0.85 ≤ R2 ≤ 0.92). The COSMO -moments included in these models as molecular descriptors, can be calculated purely by quantum chemical methods based on the molecular structure, and provide useful information related to various molecular structural features that can participate in solution processes. Furthermore, the results provide new insights in the sigma function of COSMO-RS and support the view that the solvent space can be fully characterized by a limited set of parameters. The use of the multivariable nonlinear QSPR correlation equation models presented in this work may be an important tool by providing Hansen solubility parameters for solvents in process design, for molecules in early drug discovery or in the CAMD of new chemical entities with high polarity, even if they should involve unusual chemical functionality or ion-pairs. Acknowledgement We acknowledge the financial support of this work by the Hungarian State and the European Union under the TAMOP4.2.1/B-09/1/KONV-2010-0003 project and the grant of Foundation for Engineer Education of Veszprém.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34]
Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.fluid.2011.06.030.
[35] [36]
J.H. Hildebrand, The Solubility of Non-Electrolytes, Reinhold, New York, 1936. C.M. Hansen, J. Paint Technol. 39 (1967) 104–117. C.M. Hansen, Prog. Org. Coat. 51 (2004) 77–84. K. Adamska, A. Voelkel, K. Héberger, J. Chromatogr. A 1171 (2007) 90–97. A.F.M. Barton, Handbook of Solubility Parameters and Other Cohesion Parameters, CRC Press, London, 1991. S.M. Aharoni, J. Appl. Polym. Sci. 45 (1992) 813–817. K. Adamska, R. Bellinghausen, A. Voelkel, J. Chromatogr. 1195 (2008) 146–149. R.J. Roberts, R.C. Rowe, Int. J. Pharm. 99 (1993) 157–164. P. Bustamante, M.A. Pena, J. Barra, Int. J. Pharm. 194 (2000) 117–124. F. Gharagheizi, A.M. Torabi, J. Macromol. Sci. B: Phys. 45 (2006) 285–290. A. Eslamimanesh, F. Esmaeilzadeh, Fluid Phase Equilib. 291 (2010) 141–150. E. Stefanis, I. Tsivintzelis, C. Panayiotou, Fluid Phase Equilib. 240 (2006) 144–154. M. Belmares, M. Blanco, W.A. Goddard, R.B. Ross, G. Caldwell, S.H. Chou, J. Pham, P.M. Olofson, C. Thomas, J. Comput. Chem. 25 (2004) 1814–1826. S. Abbott, C.M. Hansen, HSPiP: Hansen Solubility Parameters in Practice Complete with Software, Data, and Examples, 3rd ed. (eBook), 2010. C.M. Hansen, Hansen Solubility Parameters: A User’s Handbook, CRC Press, Boca Raton, 2007. E. Stefanis, C. Panayiotou, Int. J. Thermophys. 29 (2008) 568–585. E.A. Hoffmann, Z.A. Fekete, R. Rajkó, I. Pálinkó, T. Körtvélyesi, J. Chromatogr. A 1216 (2009) 2540–2547. V. Tantishaiyakul, N. Worakul, W. Wongpoowarak, Int. J. Pharm. 325 (2006) 8–14. A. Klamt, F. Eckert, M. Hornig, J. Comput. Aided Mol. Des. 15 (2001) 355–365. A. Klamt, COSMO-RS from Quantum Chemistry to Fluid Phase Thermodynamics and Drug Design, Elsevier, Amsterdam, 2005. M.H. Abraham, A. Ibrahim, A.M. Zissimos, J. Chromatogr. A 1037 (2004) 29–47. A. Klamt, F. Eckert, W. Arlt, Annu. Rev. Chem. Biomol. Eng. 1 (2010) 101–122. A.R. Katritzky, M. Kuanar, S. Slavov, C.D. Hall, M. Karelson, I. Kahn, D.A. Dobchev, Chem. Rev. 110 (2010) 5714–5789. I.I. Baskin, V.A. Palyulin, N.S. Zefirov, in: D.J. Livingstone (Ed.), Artificial Neural Networks, Humana Press Inc./Springer Science, Totowa, 2009, pp. 133–154. R. Guha, P.C. Jurs, J. Chem. Inf. Model. 45 (2005) 800–806. R. Guha, D.T. Stanton, P.C. Jurs, J. Chem. Inf. Model. 45 (2005) 1109–1121. K. Héberger, J. Chromatogr. A 1158 (2007) 273–305. ˝ Á. Papp, Z. Gulyás, F. Darvas, Bioorg. Med. Chem. Lett. L. Molnár, G.M. Keseru, 14 (2004) 851–853. A. Tompos, J.L. Margitfalvi, E. Tfirst, K. Héberger, Appl. Catal. A 324 (2007) 90–93. J. Barra, M.A. Pena, P. Bustamante, Eur. J. Pharm. Sci. 10 (2000) 153–161. C.M. Hansen, Hansen Solubility Parameters, http://www.hansensolubility.com/index.php?id=21, last accessed at 30.03.2011. P. Bustamante, M.A. Pena, J. Barra, Int. J. Pharm. 174 (1998) 141–150. A. Voelkel, A. Adamska, B. Strzemiecka, K. Batko, Acta Chromatogr. 20 (2008) 1–14. F. Eckert, A. Klamt, COSMOtherm, Version C2.1, Release 01.10, COSMOlogic GmbH & Co. KG, Leverkusen, Germany, 2009. R. Ahlrichs, M. Bar, M. Haser, H. Horn, C. Kölmel, Chem. Phys. Lett. 162 (1989) 165–169. M.H. Beale, M.T. Hagan, H.B. Demuth, Neural Network Toolbox 7 of MATLAB, The MathWorks Inc., 2010.