Journal of Molecular Liquids 179 (2013) 78–87
Contents lists available at SciVerse ScienceDirect
Journal of Molecular Liquids journal homepage: www.elsevier.com/locate/molliq
Prediction of surface tension of ionic liquids by molecular approach Seyyed Alireza Mirkhani a,⁎, Farhad Gharagheizi a,⁎, Nasrin Farahani b, Kaniki Tumba c a b c
Department of Chemical Engineering, Buinzahra Branch, Islamic Azad University, Buinzahra, Iran Department of Chemistry, Buinzahra Branch, Islamic Azad University, Buinzahra, Iran Department of Chemical Engineering, Mangosuthu University of Technology, Durban, South Africa
a r t i c l e
i n f o
Article history: Received 23 October 2012 Received in revised form 21 November 2012 Accepted 23 November 2012 Available online 12 December 2012 Keywords: Surface tension QSPR model Ionic liquids LSSVM Validation techniques
a b s t r a c t Originally, Quantitative Structure Property Relationship (QSPR) models for the surface tension of ionic liquids are developed based on molecular descriptors. A large data set of 930 experimental surface tension data points for 48 ionic liquids is applied to derive the model. Seven descriptors are selected by genetic function approximation to relate the surface tension of ionic liquids to their corresponding anions and cation structures. To capture the nonlinear nature of surface tension, a model based on Least-Squared Supported Vector Machine (LSSVM) is also developed. The derived models are authenticated with several statistical validation techniques. © 2012 Elsevier B.V. All rights reserved.
1. Introduction Recently, ionic liquids (ILs) as the new generation of conductive materials find their way in the innovative industrial and chemical applications. The term ionic liquid or more specifically room temperature ionic liquid (RTIL) refers to the salts entirely composed of ions and have a melting point below the normal boiling point of water. Same as other salts, ionic liquids possess very negligible vapour pressures even at conditions well above room temperature [1]. In addition, low-toxicity as well as non-volatility is the other novel characteristics of ILs. By possessing aforementioned qualities, ionic liquids are promising candidates to supersede convenient organic solvents in industrial applications. Ionic liquids consisted of low-symmetry, large and unreactive cation such as phosphorus or sulphur containing ring and an anion that largely controls its physical and chemical properties. Altering in the cation and anion combinations permits the physical, chemical and biological properties of ionic liquids to be tailored for specific applications, as largely manifested by the task-specific ionic liquids (TSILs). One of the recent applications of ILs is their employment in the processes such as extraction and multiphasic homogeneous catalytic reactions which are mainly governed by interfacial phenomena. The mentioned type of reactions occurs in the two-phase systems; one phase contains the reactant and products and the other, the immiscible one, act as the catalyst solvent. Such processes occur at the interface between the IL and the overlying aqueous or organic phase, and are dependent on the accessibility of the material to the surface and ⁎ Corresponding authors. Fax: +98 21 77 92 65 80. E-mail addresses:
[email protected] (S.A. Mirkhani),
[email protected] (F. Gharagheizi). 0167-7322/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.molliq.2012.11.018
the transfer of material across the interface. More exhaustive studies of the surface-related properties are required to enhance our insight into the mechanisms behind these processes. The available surface tension data of ionic liquids are very limited in comparison of possible ionic liquids (≈ 10 15). To “tailor” an ionic liquid with the desired properties, it is important to have a rational estimation of its properties e.g. surface tension prior to the synthesis in the absence of the experimental data. For this purpose, prediction methods which provide accurate estimations of desired properties are essential. Models based on parachors, group contribution methods or corresponding state theories (CST) are widely applied for the prediction of the surface tension of the ionic liquids [2]. The foundation of parachor-based models are an empirical formula originally proposed by MacLeod [3], which relates density to the surface tension via temperature-independent Eq. (1): 1
σ4 ¼ K ρ
ð1Þ
Sugden [4] modified the original formula by multiplying each side of the expression by the molecular weight, Mw, to give a constant K.Mw labelled as parachor, Pch: 1
P ch ¼ KM W ¼
MW ⋅σ 4 : ρ
ð2Þ
Sugden [4] also proposed that the parachor of a compound is an additive property which can be described by the sum of its parachor contribution.
S.A. Mirkhani et al. / Journal of Molecular Liquids 179 (2013) 78–87
Knotts et al. [5] compiled a vast amount of available physical DIPPR data to derive a group contribution correlation for the estimation of parachors. For their model, deviations of 8% for multifunctional compounds were obtained with maximum deviation of 34%. Deetlefs et al. [6] were the first who applied Knotts et al. [5] parachors for ionic liquids. They found that the differences between estimated parachor values based on non-ionic solvents and experimental ionic liquids' parachors are rather small. Although their calculations are based on the very limited data, they postulated that the QSPR correlation based on neutral species could be used for ionic liquids. Gardas and Coutinho [7] are the first who successfully tackled the estimation of the surface tension of ionic liquids by applying Knotts et al. [5] parachor approach. By compiling 361 data points of 38 imidazolium based ionic liquids with thirteen anions, they derived a model with the overall deviation of 5.75% and maximum deviation less than 16%. Another fascinating outcome of their model is the successful application of Knotts' correlation, which is basically derived for neutral solvents without regard Coulombic interactions, for ionic liquids. Finally, Mousazadeh and Faramarzi [8] applied CST methodology to derive a model for surface tension of ionic liquids. Since the CST correlations are primarily based on the critical properties, which are not available for ionic liquids, they chose to use the melting (Tm) and boiling points of ionic liquids (Tb) along with the surface tension at the melting point (σm) to define their corresponding states correlation. σ ¼ 0:819
T b −T T σ m þ 0:5 σ m: T b −T m Tb
ð3Þ
The deviations reported for the surface tension of the 30 ionic liquids used in the development of the correlation represent only 3.0% while predictions for 4 ionic liquids not used in the correlation development are of 6.5%. Since many ionic liquids don't have a reliable melting point and determination of their boiling points is as elusive as their critical temperatures, the application of the proposed CST model for practical estimation of surface tension of ionic liquids is extremely limited. In this communication, the predictive model derived from molecular descriptor is presented to estimate the surface tension of the studied ionic liquids. Gharegheizi et al. [9] proposed a group contribution model to predict surface tensions of 51 ionic liquids based on 920 experimental data. 19 chemical substructures (12 for anions and 7 for cation) were considered to develop a group contribution model. The model generates acceptable results in terms of R2 =0.919 and Average absolute relative deviation (ARD)=3.61. In comparison with our new QSPR model, GC model has simpler terms which are easily accessible from the chemical structure. In term of number of variables, the new QSPR model with 7 descriptors is superior than previous GC model with 19 parameters.
79
is called training set. The other one (test set) is used to assess the learning ability of the model from training set to produce reliable results for absent compounds. In this study, K-means clustering is applied to select training and test sets. K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. As the rule of thumb, 20% of collecting data was retained to test the model and remain was applied for model derivation [30]. For LSSVM model derivation, 80%–10%–10% of data points split into TrainingValidation-Test sets respectively. This selection like the previous one is performed by k-mean clustering. 2.3. Calculation of descriptors The aim of this study is to relate the surface tension of ionic liquids to their ionic constituents' structures via anion- and cation-based descriptors. For this purpose, anion and cation descriptors are separately calculated for each ionic liquid. This approach is successful to correlate the studied physical property with the structure of both anion and cation. However, it discarded anion/cation interactions. In order to optimize the 3D chemical structure of each cation and anion, the Dreiding Force field as explained by Chemaxon's JChem was employed [31]. About 3000 descriptors from 22 diverse classes of descriptors are calculated by Dragon software [32]. These 22 classes of descriptors are: Constitutional descriptors, Topological indices, Walk and path counts, Connectivity indices, Information indices, 2D autocorrelations, Burden Eigen values, Edge-adjacency indices, Functional group counts, Atom-centred fragments, Molecular properties, topological charge indices, Eigenvalue-based indices, Randic molecular profiles, geometrical descriptors, RDF descriptors, 3D-MORSE descriptors, WHIM descriptors, GETAWAY descriptors, charge descriptors, 2D binary fingerprint, and 2D frequency finger print. After the completion of descriptor calculation, those couldn't be calculated for several anions or cations are excluded completely from the list. Next, the pair correlations for each binary group of descriptors (all anions and cations descriptors) are calculated. For binary groups with the pair correlation greater than 0.9, one of the descriptors is omitted randomly. 2.4. Diversity test
2. Methodology
To ensure the diversity of ionic liquids present in both training and test sets the diversity test is conducted in this study. There are many approaches to measure diversity, owing to its diverse definitions. In this study, the Euclidean distance approach is applied to measure the diversity of present ionic liquids. In this approach, each IL (Xi) is described by a vector of corresponding molecular descriptors incorporating both anion and cation descriptors (xim) as its elements where m is the number of all calculated descriptors.
2.1. Data preparation
X i ¼ ðxi1 ; xi2 ; xi3 ; …; xim Þ:
An extensive literature review is conducted to extract 930 experimental surface tension data points belonging to 48 ionic liquids [10–29]. The collected data covers a wide range of surface tension (0.02315–0.06399 N.m−1) as well as temperature (268.3–743.7 K). Table 1 describes the present ionic liquids with the range of their surface tension as well as temperature. The structures of 12 anions and 19 cations present in our study are tabulated as Tables 2 and 3, respectively.
A distance of two different ionic liquids is defined as follows: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u m 2 uX dij ¼ ‖X i −X j ‖ ¼ t xik −xjk :
ð5Þ
k¼1
Next, the mean distance of one sample to the remaining ones is calculated as follows:
2.2. Training and test set selection Typically, in QSPR modelling, the selected experimental data set is split into two subsets: the subset which involved in model derivation
ð4Þ
n X
di ¼
di;j
j¼1
n−1
:
ð6Þ
80
S.A. Mirkhani et al. / Journal of Molecular Liquids 179 (2013) 78–87
Table 1 Temperature and corresponding surface tensions ranges for studied ionic liquids. No.
Compound
ARD%
T range/K
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
1-decyl-3-methylimidazolium bis[(trifluoromethyl)sulfonyl]amide 1-methyl-3-pentylimidazolium bis[(trifluoromethyl)sulfonyl]amide N-butyl-N-propyl-N,N-dimethylammonium bis[(trifluoromethyl)sulfonyl]amide Hexyltrimethylammonium bis[(trifluoromethyl)sulfonyl]amide Triethylhexylammonium bis[(trifluoromethyl)sulfonyl]amide Trihexyl(tetradecyl)phosphonium bis[(trifluoromethyl)sulfonyl]amide 1,3-dimethylimidazolium methylsulfate 1-(2-hydroxyethyl)-3-methylimidazolium tetrafluoroborate 1-butyl-1-methylpyrrolidinium cyanocyanamide 1-butyl-1-methylpyrrolidinium thiocyanate 1-butyl-2,3-dimethyl-imidazolium hexafluorophosphate 1-butyl-3-methylimidazolium octylsulfate 1-butyl-3-methylimidazolium bis[(trifluoromethyl)sulfonyl]amide 1-butyl-3-methylimidazolium chloride 1-butyl-3-methylimidazolium cyanocyanamide 1-butyl-3-methylimidazolium hexafluorophosphate 1-butyl-3-methylimidazolium iodide 1-butyl-3-methylimidazolium methylsulfate 1-butyl-3-methylimidazolium tetrafluoroborate 1-butyl-3-methylimidazolium thiocyanate 1-butyl-3-methylimidazolium trifluoromethanesulfonate 1-butyl-4-methylpyridinium tetrafluoroborate 1-butyl-4-methylpyridinium thiocyanate 1-ethyl-3-methylimidazolium L-lactate 1-ethyl-3-methylimidazolium bis[(trifluoromethyl)sulfonyl]amide 1-ethyl-3-methylimidazolium cyanocyanamide 1-ethyl-3-methylimidazolium tetrafluoroborate 1-ethyl-3-methylimidazolium trifluoromethanesulfonate 1-heptyl-3-methylimidazolium bis[(trifluoromethyl)sulfonyl]amide 1-hexyl-3-methylimidazolium bis[(trifluoromethyl)sulfonyl]amide 1-hexyl-3-methylimidazolium chloride 1-hexyl-3-methylimidazolium hexafluorophosphate 1-hexyl-3-methylimidazolium iodide 1-hexyl-3-methylimidazolium tetrachloroindate 1-hexyl-3-methylimidazolium tetrafluoroborate 1-hexyl-3-methylimidazolium trifluoromethanesulfonate 1-methyl-3-propylimidazolium bis[(trifluoromethyl)sulfonyl]amide 1-methyl-3-propylimidazolium hexafluorophosphate 1-octyl-3-methylimidazolium bis[(trifluoromethyl)sulfonyl]amide 1-octyl-3-methylimidazolium chloride 1-octyl-3-methylimidazolium hexafluorophosphate 1-octyl-3-methylimidazolium tetrafluoroborate 3-methyl-1-octyl-imidazolium hexafluorophosphate 3-methyl-1-octyl-imidazolium tetrafluoroborate N-butyl-3-methylpyridinium cyanocyanamide N-butyl-4-methylpyridinium tetrafluoroborate Trihexyl(tetradecyl)phosphonium chloride Trihexyl(tetradecyl)phosphonium tris(pentafluoroethyl)trifluorophosphate
17.8 3.1 0.6 8.9 3.6 19.3 0.4 1.2 1.4 9.4 1.8 18.9 14.1 2.9 4.5 3.1 11.2 2.5 1.0 7.5 0.7 0.7 6.0 3.8 19.8 8.1 6.2 4.3 8.2 1.4 4.2 0.7 6.4 4.0 2.3 11.2 5.5 7.9 15.1 12.2 1.0 7.4 1.3 1.1 6.0 0.7 6.9 12.5
293.2 293.2 299.1 300.4 299.5 303.0 288.2 572.0 293.1 303.4 303.2 278.8 278.8 298.0 282.5 287.6 298.0 283.2 284.2 292.4 292.2 291.2 302.9 283.2 293.2 278.2 288.1 268.3 293.2 283.2 298.0 293.2 298.0 283.2 268.6 294.0 293.2 318.3 293.2 298.0 283.9 298.5 293.2 288.2 293.2 291.2 298.0 298.5
Where n refers to the number of all compounds. Then, the calculated mean distances are normalized according to following definition:
di Norm ¼
di −d min d max −d min
:
ð7Þ
Norm
di reveals the structural diversity of ionic liquid (i) in comparison to others. Fig. 1 presents the values of the similarity test for both test and train sets of the studied ionic liquids. 3. Sub-set variable selection In this study Genetic Function Approximation (GFA) is employed for a subset variable selection. GFA – as a genetically based variable selection approach – involves the combination of multivariate adaptive regression splines (MARS) algorithm [33] with genetic algorithm [34] to evolve a series of equations instead of one that best fit the training set data. The approach was originally proposed by the pioneering work of Rogers and Hopfinger [35].
STexp/N.m−1
– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –
343.2 343.2 324.9 333.4 347.1 332.9 313.2 743.7 353.5 344.2 353.2 278.8 351.6 393.0 356.5 353.0 393.0 354.4 393.0 350.9 356.0 350.8 342.2 333.2 303.2 356.5 356.0 356.3 353.2 351.1 393.0 353.2 393.0 338.2 393.0 355.4 343.2 352.1 343.2 393.0 353.6 393.0 343.2 343.2 350.5 350.8 347.7 343.3
0.02868 0.03031 0.03693 0.03377 0.03145 0.03028 0.0589 0.0536 0.0508 0.0421 0.04137 0.02808 0.03041 0.0418 0.04266 0.04047 0.0456 0.0377 0.0382 0.0366 0.03282 0.0423 0.0384 0.04863 0.03643 0.05562 0.05026 0.03831 0.02829 0.02891 0.0362 0.03515 0.0365 0.0389 0.0343 0.03006 0.03218 0.04431 0.02886 0.0267 0.03083 0.026 0.03198 0.0302 0.0363 0.0423 0.02955 0.02657
– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –
STpred/N.m−1 0.03212 0.0329 0.03838 0.03588 0.03517 0.03257 0.0609 0.0647 0.0564 0.0498 0.0448 0.02808 0.0338 0.0482 0.0488 0.04447 0.0533 0.0459 0.04529 0.0473 0.03552 0.0452 0.0477 0.05337 0.03694 0.064 0.05461 0.04446 0.032 0.03234 0.0418 0.03902 0.0435 0.042 0.03981 0.03264 0.03487 0.04675 0.03193 0.0319 0.03475 0.0323 0.03516 0.03415 0.0437 0.0452 0.03362 0.02899
0.02371 0.03128 0.03721 0.03704 0.03325 0.02458 0.05920 0.05519 0.05187 0.04994 0.04103 0.03339 0.03487 0.04134 0.04590 0.04216 0.04134 0.03934 0.03922 0.04378 0.03266 0.04194 0.04478 0.05165 0.04371 0.05226 0.04748 0.03901 0.02609 0.02846 0.03491 0.03572 0.03491 0.04055 0.03279 0.02626 0.03406 0.04097 0.02449 0.03053 0.03131 0.02841 0.03185 0.03097 0.04075 0.04194 0.03243 0.02315
– – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –
N 0.02627 0.03384 0.03853 0.03873 0.03568 0.02612 0.06048 0.06399 0.05496 0.05203 0.04359 0.03339 0.03860 0.04621 0.04969 0.04551 0.04621 0.04299 0.04480 0.04678 0.03593 0.04499 0.04679 0.05422 0.04422 0.05627 0.05096 0.04352 0.02917 0.03194 0.03978 0.03879 0.03978 0.04337 0.03917 0.02941 0.03663 0.04270 0.02705 0.03540 0.03489 0.03326 0.03441 0.03379 0.04369 0.04499 0.03498 0.02545
6 6 14 11 10 7 6 12 30 11 6 1 32 20 54 33 20 53 49 9 56 16 10 10 2 32 29 48 7 18 20 7 20 12 55 26 6 16 6 21 28 27 6 7 18 16 12 9
In the most cases, the QSPR models are presented as the sum of linear terms: F ðX Þ ¼ a0 þ
M X
ak X k :
ð8Þ
k¼1
Where a0 is the intercept, ak is the model coefficient and Xks are molecular descriptors. The initial QSPR models are constructed by random selection of the number of molecular descriptors. In the next step, the qualities of the derived models are evaluated by Friedman's LOF scoring function, which is a penalized least-squares error measures: LOF ðmodelÞ ¼
1 LSEðmodelÞ : N 1− ðcþ1þðdpÞÞ 2
ð9Þ
N
In this LOF function, c is the number of non-constant basis functions, N is the number of samples in the data set, d is a smoothing factor to be set by the user, and p is the total number of parameters in the model and the LSE is the least square error of the model. Employment of LOF leads to the models with better prediction without over fitting.
S.A. Mirkhani et al. / Journal of Molecular Liquids 179 (2013) 78–87
At this point, we repeatedly perform the genetic recombination or crossover operation: • Two good models in terms of their fitness are selected as ‘parents’. • Each is randomly ‘cut’ into two sections. A new model is created using the basis functions taken from a section of each parent. • The model with the worst fitness is replaced by this new model. • The overall process is ended when the average fitness of the models in the population stops improving.
No.
Anion
Table 2 (continued) No.
Anion
Cl4In-
N
211
tetrafluoroborate
11 1
N
12
9
10
Table 2 Anion structures of studied ionic liquids.
81
1
30 thiocyanate
octylsulfate
12 2
130
125 trifluoromethanesulfonate bis[(trifluoromethyl)sulfonyl]amide
3
73 chloride
4
134 cyanocyanamide
5
In this study, population and the number of maximum generations are set to 100 and 5000, respectively. The value of Mutation probability is considered to be 1.5 in this study. To study the non-linear nature of surface tension, the least square support vector machine (LSSVM) is also introduced. 4. Results and discussions 4.1. Linear model
96
The final linear QSPR model obtained by GFA contains 7 descriptors. In order to investigate the contribution of anion and cation on the surface tension of ionic liquids, their corresponding descriptors in the model are presented as σAnion and σCation, respectively:
hexafluorophosphate σ ¼ 0:08258−5:26608 10−5 T þ σ Anion þ σ Cation σ Anion ¼ 0:00197 X5AAnion þ 0:01616 Mor07vAnion −0:00261 HATS3mAnion
6
40
ð10Þ
σ Cation ¼ 0:02596 nOCation −0:00248 GATS4eCation −0:00676 IC5Cation þ0:69760 R5mCation
iodide
R2 ¼ 0:8948; R2adj ¼ 0:8946; nTraining ¼ 742; nTest ¼ 188; F ¼ 6293:56; Q 2 ¼ 0:9376; Q 2boot ¼ 0:8941; Q 2ext ¼ 0:9016 2 a R ¼ −0:022; ΔK ¼ 0:946; ΔQ ¼ 0; Rp ¼ 0; RN ¼ 0:996: 7
10
The definitions of the present descriptors are enlisted in Table 4. L-lactate
8
59
methylsulfate
• X5A is the kier and Hall index reflects information concerning geometrical feature of the molecule namely, the molecular Van der Waals volume. Since the polarizability of an atom or molecule is proportional to its volume, X5A represents a measure of the molecular polarizability. Since polarizability of the molecule is closely related to the strength of the London forces, X5A accounts for dispersion interaction between molecules [36]. • The number of atoms of oxygen (nO) in the molecules represents information about polar interactions among ions. The introduction of oxygen atoms into organic cations produces a charge separation in
82
S.A. Mirkhani et al. / Journal of Molecular Liquids 179 (2013) 78–87
Table 3 Cations structures of studied ionic liquids.
Table 3 (continued) No.
No.
N
Cation
Cation
11 1
6
N
327 1-butyl-3-methylimidazolium
1-decyl-3-methylimidazolium
2
6
12
1-methyl-3-pentylimidazolium
3
26 1-butyl-4-methylpyridinium
11
13
121
hexyltrimethylammonium 1-ethyl-3-methylimidazolium
4
14 14
N-butyl-N-propyl-N,N-dimethylammonium
7 1-heptyl-3-methylimidazolium
5
10 15
triethylhexylammonium trihexyl(tetradecyl)phosphonium
158 1-hexyl-3-methylimidazolium
16 6
22 1-methyl-3-propylimidazolium
7
17 7
82 1-octyl-3-methylimidazolium
12
1-(2-hydroxyethyl)-3-methylimidazolium 18
13 3-methyl-1-octyl-imidazolium
8
6 1,3-dimethylimidazolium 19
18 N-butyl-3-methylpyridinium
9
41 1-butyl-1-methylpyrrolidinium
20
16
N-butyl-4-methylpyridinium 10
6 21
21
1-butyl-2,3-dimethyl-imidazolium
trihexyl(tetradecyl)phosphonium
S.A. Mirkhani et al. / Journal of Molecular Liquids 179 (2013) 78–87
1 0.9
The mean distance
feature space. The detailed mathematical explanation of the optimization problem treated by LSSVM approach is not provided here and can be found in detail in mentioned references [37–39]. To implement the original SVM algorithm to handle nonlinear problem, Radial Basis Function (RBF) is defined as the kernel function. The objective of the definition of the kernel function is to map the data into the higher dimensional feature space in order to increase computational power. The simulated annealing optimization method is actuated to find the proper combination of the LSSVM parameters, namely (γ,σ2), considering the minimum mean squared error (MSE) of leave-one-out (LOO) crossvalidation of the training set as the optimal condition. The obtained parameters of the final model is described as follows:
Training set Test set
0.8
83
0.7 0.6 0.5 0.4 0.3 0.2
2
γ ¼ 2891:0126; σ ¼ 0:1726:
0.1 0 0.02
0.03
0.04
0.05
0.06
0.07
ST exp / N.m−1 Fig. 1. Diversity of training and test sets.
the bond which produces a dipole moment, and provided the dipole moment orientations of neighbouring molecules. There will be an attractive interaction among them which will be seen reflected in an enhanced value of surface tension. The positive value of the model correlation coefficient for this descriptor supports this interpretation [36]. • Mor07v belongs to MORSE descriptors with the weighting scheme of Van der Waals volumes. This descriptor simultaneously provides information about the effect of 3D structure of the anion as well as the contribution of the Van der Waals volumes of the constituent atoms. • HATS3m encodes information about cyclicity and branching. It is well established that branching lowered the surface tension owing to disruption of the packing of the molecule in the surface layer. The positive value of the model correlation coefficient for this descriptor supports this interpretation. 4.2. Non-linear model In order to study the nonlinear nature of the surface tension of ionic liquids, the selected descriptors by GFA are introduced as inputs to Least Squared Support Vector Machine (LSSVM) method for the sake of nonlinear model derivation. LSSVM belongs to the larger class of machine-learning based algorithm namely support vector machine (SVM) which profoundly is based on the seeking of an optimal separating hyperplane to minimize expected generalization error in the
As it is described by Sedev [40], surface tension of ionic liquids decreases linearly with temperature and the value of the gradient, − dσ , dT are lying in the range of 0.04–0.2 mJ/m 2.K. The predicted gradient by the model is 0.05 which is consistent with the experimental findings. Among the reviewed models in the Introduction section, merely the work of Gardas and Coutinho merits consideration owing to the large implemented database as well as valid theoretical background. Despite of their originality of the applied method, in their model the knowledge of the molar volume of the ionic liquid is mandatory for the prediction of surface tension. Since, the experimental molar volumes are not available for new ionic liquids; this property should be estimated formerly, which indeed decrease the accuracy of the surface tension prediction. In terms of implemented database, the investigated data base in this work is approximately threefold of the dataset applied by Gardas and Coutinho. This would result in a more comprehensive correlation. Unlike the work of Gardas and Coutinho, all collected experimental data points in this work are associated with their measurement errors. In the case of multiple reported values for a very ionic liquid, the one with the least error is retained in our database. In addition, in terms of the accuracy of prediction the new model is superior to the Gardas and Coutinho model. The value of the Absolute average relative deviation from nonlinear model is equal to 1.05 with 930 data points; however the reported value for this parameter in Gardas and Coutinho model is equal to 5.65 for 261 data points. In the light of the mentioned parameter, it is clear that the prediction ability of the new model is superior to the previous model by Gardas and Coutinho. 4.3. Applicability domain Organization for Economic Cooperation and Development (OECD) proposed five principles for QSAR validation: 1. 2. 3. 4.
A defined endpoint; An unambiguous algorithm; A defined applicability domain; Appropriate measures of goodness-of-fit, robustness and predictivity; and 5. Mechanistic interpretation, if possible.
Table 4 Model's descriptors definition. Descriptor Definition X5A Mor07v HATS3m nO GATS4e IC5 R5m
Average connectivity index chi-5
Group
Connectivity indices Signal 7/weighted by atomic Van der Waals volumes 3D-MoRSE Leverage-weighted autocorrelation of lag 3/weighted GETAWAY by atomic masses Number of oxygen atoms in the molecule Constitutional Geary autocorrelation—lag 4/weighted by atomic Geary 2D Sanderson electronegativities autocorrelations Information content index (neighbourhood Information symmetry of 5-order) indices R autocorrelation of lag 5/weighted by atomic masses GETAWAY
Applicability domain [41] is a theoretical spatial domain defined by the molecular descriptors and the modelled responses and the nature of training sets. When a model is built on the specific domain based on the compounds of the training set, the prediction of the properties of absent compound would be valid if it is “similar” to the training set molecules. In other word, prediction conducted within the domain based on interpolation is reliable. Otherwise, the validity of the prediction outside of domain conducted by extrapolation is disputed. The lack of general and accepted definition of similarity, makes researchers to define AD to test the reliability of the QSPR models. In this study, Williams graph is used to depict AD. In this approach, leverage or hat
84
S.A. Mirkhani et al. / Journal of Molecular Liquids 179 (2013) 78–87
5
indices are calculated based on Hat matrix (H) with the following definition:
ARD% ¼
3
ð11Þ
Where X is a two-dimensional matrix comprising n ionic liquids (rows) and k descriptors belonging to both anion and cation (columns). The leverages or hat values (hi) of the chemicals in the descriptor space are the diagonal elements of H. Williams graph shows the correlation of hat values and standardized cross-validated residuals (R). A warning leverage (h⁎ =0.02903) – blue vertical line – is generally fixed at 3n/p, where n is number of training ionic liquids and p the number of model variables plus one. The leverage of 3 is considered as a cutoff value to accept the points that lay ±3 (two horizontal red lines) standard deviations from the mean (to cover 99% normally distributed data). The AD is located in the region of 0 ≤ h≤ 0.02903 and − 3 ≤ R≤ +3. Existence of the majority of training and test data points in this domain reveals that both model derivation and prediction are done in applicability domain which results in a valid model. Two data points depicted by blue and red circles are wrongly predicted by the model (3b R or R b −3), however, their hat values lie in the domain of AD. This erroneous prediction could probably be attributed to wrong experimental data rather than to molecular structure [42]. Fig. 2 shows Williams graph of the studied ionic liquids. The absolute relative deviation is defined as follows: !N p 1 X σ exp −σ calc : Np i¼1 σ exp
4
ð12Þ
Where Np is the number of total points estimated for each system. The GFA driven model shows that for 48 studied ionic liquids the mean ARD% is 4.9% with maximum deviation of 20%. The highest error in this study belongs to 1-ethyl-3-methylimidazolium bis[(trifluoromethyl)sulfonyl] amide. From these 45.8% of the estimated surface tensions were within absolute deviation of 0.00–3.00%, 18.3% were within 3.001–6.00%, 19.35% were within 6.001–10.00%, 7.42% were within 10.001–13.00, and only 9.13% were within 13.001–18.9%. The results obtained by the nonlinear model present that 89.46% of the estimated surface tensions were within absolute deviation of 0.00–3.00%, 8.81% were within 3.001–6.00%, 1.4% were within 6.001–10.00% and merely 0.63% of the prediction have the error between 10.001 and 14.4%. The statistical values for the nonlinear model are tabulated in Table 5. One potential application of the derived model is to determine which combination of studied anion and cation may probably result in highest and lowest surface tensions. Since the model is temperature dependent we calculate the highest and lowest surface tensions at the specific temperature (T = 298.15 K). Among 12 present anions in the structure of studied ionic liquids, octylsulfate ([OctSo4] −) has the lowest anion contribution with σAnion = −0.01653. In addition, 1-decyl-3-methylimidazolium ([C10Mim] +) with the σCation = −0.03158 has the lowest cation contribution among 21 studied cations. It is estimated that the ionic liquid [C10Mim]+[OctSo4] − has the lowest surface tension at 298 K from 252 possible ionic liquids. Furthermore, the ionic liquid with the combination of tetrachloroindate ([InCI4] − with the σAnion = 0.00340, and 1-(2-hydroxyethyl)-3methylimidazolium ([HEMIm] +) with σCation = 0.01418 have the highest surface tension among possible ionic liquids. 5. Validation Validation process is the critical stage of the assessment of the model stability and its predictive capability. If the developed model stands up to the validation scrutiny, it is dubbed as “verified” model and could be safely employed to estimate the particular properties.
Standardized Residuals
−1 T T X : H¼X X X
2 1 0 −1 −2
valid data (Training set) suspended data (Training set) suspended limit (Training set) leverage limit (Training set) valid data (Test set) suspended data (Test set)
−3 −4 −5
0
0.01
0.02
0.03
0.04
Hat Fig. 2. Applicability domain (AD) of the studied ionic liquids.
The various validation techniques applied in this study are described as follows: 5.1. F-test [43] F is the F-ratio defined as follows:
F¼
MSS=df M : RSS=df E
ð13Þ
Where dfM and dfE denote the degree of freedom of the obtained model and the overall error respectively and MSS and RSS are model summation of squares and residual summation of squares, respectively. It is a comparison between the model explained variance and the residual variance. It should be noted that high values of the F-ratio test indicate the reliability of models.
Table 5 Statistical parameters for nonlinear model. Statistical parameter Training set R2 Absolute average relative deviation Standard deviation error Root mean square error N
0.990 1.02 0.000835 0.000834 744
Validation set R2 Absolute average relative deviation Standard deviation error Root mean square error N
0.987 1.14 0.000825 0.000822 93
Test set R2 Absolute average relative deviation Standard deviation error Root mean square error N
0.991 1.05 0.000806 0.000803 93
Total R2 Absolute average relative deviation Standard deviation error Root mean square error N
0.989 1.03 0.00083 0.00083 930
S.A. Mirkhani et al. / Journal of Molecular Liquids 179 (2013) 78–87
5.2. LOO (leave one out) validation technique Leave-one-out is the most common and extensively used internal validation technique. This method is based on partitioning of the sample data into two different subsets one is used as training set and the other as a validation set. The modified training set was generated by deleting one object from the original data set. For each reduced data set, the model output is calculated, and responses for the deleted object were determined from the model. The evaluated leave-one-out cross validation parameter of the obtained linear model is 0.8941. 2 5.3. Adjusted R-squared (Radj )
In a multiple linear regression model, adjusted R square measures the proportion of the variation in the dependent variable accounted for by the explanatory variables. Unlike R square, adjusted R square allows for the degrees of freedom associated with the sums of the squares. Therefore, even though the residual sum of squares decreases or remains the same as new explanatory variables are added, the residual variance does not. For this reason, adjusted R square is generally considered to be a more accurate goodness-of-fit measure than R square.
R
2 adj
n−1 2 ¼ 1− 1−R : n−p′
ð14Þ
Where n and p′ are the numbers of experimental values and the model parameters, respectively. The less difference between this value and the R 2 parameter, the more validity of the model would be expected. The evaluated adjusted-R 2 parameter of the obtained linear model is 0.8946. 5.4. RQK validation technique In lieu of avoiding chance correlations in the model and improved its prediction, Todeschini et al. [44] proposed 4 RQK constraints which must be completely satisfied [38,45–48]: 1. 2. 3. 4.
ΔK = KXY − KX > 0 (Quick rule) ΔQ ¼ Q 2 LOO −Q 2 ASYM > 0 (Asymptotic Q 2rule) R P > 0 (Redundancy RP rule) R N > 0 (Over-fitting RN rule).
The calculated values of RQK test are presented as follows: ΔK = 0.946, ΔQ= 0, RP = 0 and RN = 0.996. The procedure of calculation of parameters of RQK validation techniques, clearly described in the mentioned references. These values which are greater or equal to zero indicate not only the validity of the model, but also approval for non-chance correlation.
85
prediction power of the model in terms of R 2 or Q 2 doesn't change significantly, then the validity of the model is disputable. The y-scrambling parameter is the intercept of the following equation: 2
~k Þ: Q k ¼ a þ brk ðy; y
ð15Þ
Where Qk2 is the explained variance of the model obtained using the same predictors but the kth y-scrambled vector; rk is the correlation between the true response vector and the kth y-scrambled vector. The numerical value of the intercept a is a criteria for assessing of the model if it is a chance correlation or not. The numerical values close to zero verify that the model is not a chance correlation. In other hand, the large values cast doubt on the validity of model and interpret the model as unstable, chance correlation. The y-scrambling should be repeated hundreds of times (in this work 300 times). The value of intercept a has been calculated as −0.027 for the developed linear model. 5.7. External validation technique [51] External validation technique is conducted by testing additional compound for validation set in order to assess the prediction capability of the model. The Q2ext demonstrated as follows:
Q
ntest 2 X yˆ i= −yi i
2 ext
¼ 1− n i¼1 2 test X yi −y training
ð16Þ
i¼1
where y training is the average value of the surface tension of the compounds present in training set, yˆ i= is response of ith object predicted i by the obtained model ignoring the value of the related object (ith experimental surface tension). The less difference between this value and the R 2 parameter, the more validity of the model would be expected. The evaluated Q 2ext parameter of the obtained linear model is 0.8449. Ultimately, all the validation techniques demonstrate the final model as valid, stable, non-chance correlation with high predictive power. Fig. 3 depicts the predicted surface tensions versus the experimental ones. As it is obvious in this figure the majority of points is located in 0.07 0.065
Training set Test set
0.06
5.5. Bootstrap validation technique [49]
ST pred / N.m−1
The bootstrap approach was applied to verify robustness and internal prediction power of the model. In this method, K n-dimensional groups are generated by a repeated random selection of n-chemicals from the original data set. The model obtained on the first selected chemicals is used to predict the values for the excluded compounds and then Q2 is calculated for each model. The bootstrapping was repeated 5000 times, in this study. Consequently, the value Q2boot parameter of the obtained model has been evaluated to be 0.8938.
0.055 0.05 0.045 0.04 0.035 0.03 0.025
5.6. Y-scrambling validation technique [50] The objective of this approach is to assure the developed model is not to be a chance correlation. For this purpose, all responses variable are shuffled randomly without any changes in the predictor set. If the
0.02 0.02
0.03
0.04
0.05
0.06
0.07
ST exp / N.m−1 Fig. 3. Predicted surface tension by the linear model versus experimental ones.
86
S.A. Mirkhani et al. / Journal of Molecular Liquids 179 (2013) 78–87
25
The ionic liquid names and predicted surface tension values were provided as supplementary material.
20
Relative deviation %
15
6. Conclusion
10 5 0 −5 −10 −15 Training set Test set
−20 −25 0.02
0.03
0.04
0.05
0.06
0.07
ST exp / N.m−1 Fig. 4. Relative deviations of predicted surface tensions by the linear model from experimental data.
the vicinity of bisection. This indicates the acceptable accuracy of the prediction. Relative errors of the predicted surface tension values in comparison with experimental ones are portrayed in Fig. 4. As it is shown in this figure, the relative errors of the majority of points lie in 0–3% interval which indicated acceptable prediction error. Fig. 5 illustrates the predicted surface tensions obtained by the LSSVM model versus the experimental ones. Fig. 6 depicts relative errors of the predicted surface tension values by the LSSVM in comparison with the experimental ones. As it is shown in this figure, the relative errors of the majority of points lie in 0–3% interval which indicated acceptable prediction error.
Accurate surface tension values are essential for adequate design of chemical processes which involved ILs specifically those based on interfacial phenomena. Despite of flourishing studies concerning ionic liquids and their properties in the last decade, the amount of available experimental data in this field is still a tiny fraction of all possible ionic liquids (≈ 10 15). In the lack of experimental data, the accurate predictive model is essential to rationally determine unknown properties of ILs prior to their synthesis. This study is the first to tackle the accurate estimation of surface tension of ILs via theoretical molecular descriptors obtained solely from their structures. First, GFA algorithm is applied to select relevant descriptors as well as model derivation. To capture the nonlinear intrinsic nature of surface tension, the LSSVM is also applied to derive a model. The second model in terms of correlation coefficient (R2) as well as accuracy of the prediction (ARD%) is the optimal model in this field so far and generates much better results than previous models. Analysis based on applicability domain of the derived model and LSSVM (0≤h≤0.02903 and R>3 or b −3) implies that reported experimental data for 1-butyl-1-methylpyrrolidinium thiocyanate at 341.35 K and 1-butyl-3-methylimidazolium dicyanamide at 298.15 K [22] are ambiguous and need modification. By the aid of derived model parameters as well as its applicability domain, the reliability of the experimental data could be analyzed to find flawed data points. Furthermore, all seven model descriptors have physical meaning, being primarily related to the size, structure, branching, dispersion and polar interactions. Moreover, the reliability and predictive capability of the model are adequately scrutinized by various statistical validation techniques. The results of validation techniques pronounced that the model is stable and accurate and is immune of chance correlation.
0.07 training set validation set test set
0.065 0.06
ST pred / N.m−1
0.055 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
0.06
0.065
ST exp / N.m−1 Fig. 5. Predicted surface tension values of LSSVM model versus experimental ones.
0.07
S.A. Mirkhani et al. / Journal of Molecular Liquids 179 (2013) 78–87
87
10
Relative deviation %
5
0
−5
−10 training set validation set test set
−15 0.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
0.06
0.065
0.07
ST exp / N.m−1 Fig. 6. Relative deviations of predicted surface tensions of LSSVM model from experimental data.
Finally, since the majority of the cations of studied ionic liquids belong to imidazolium family, the model prediction capability will be in favour of this particular group. Appendix A. Supplementary data Supplementary data to this article can be found online at http://dx. doi.org/10.1016/j.molliq.2012.11.018. References [1] M.J. Earle, J.M.S.S. Esperança, M.A. Gilea, J.N.C. Lopes, L.P.N. Rebelo, J.W. Magee, K.R. Seddon, J.A. Widegren, Nature 439 (2006) 831–834. [2] M. Tariq, M.G. Freire, B. Saramago, J.A.P. Coutinho, J.N.C. Lopes, L.P.N. Rebelo, Chemical Society Reviews 41 (2012) 829–868. [3] D.B. Macleod, Transactions of the Faraday Society 19 (1923) 38–41. [4] S. Sugden, Journal of the Chemical Society, Transactions 125 (1924) 1167–1177. [5] T.A. Knotts, W.V. Wilding, J.L. Oscarson, R.L. Rowley, Journal of Chemical & Engineering Data 46 (2001) 1007–1012. [6] M. Deetlefs, K.R. Seddon, M. Shara, PCCP 8 (2006) 642–649. [7] R.L. Gardas, J.A.P. Coutinho, Fluid Phase Equilibria 265 (2008) 57–65. [8] M.H. Mousazadeh, E. Faramarzi, Ionics 17 (2011) 217–222. [9] F. Gharagheizi, P. Ilani-Kashkouli, A.H. Mohammadi, Chemical Engineering Science 78 (2012) 204–208. [10] M.H. Ghatee, A.R. Zolghadr, Fluid Phase Equilibria 263 (2008) 168–175. [11] E. Rilo, J. Pico, S. García-Garabal, L.M. Varela, O. Cabeza, Fluid Phase Equilibria 285 (2009) 83–89. [12] M. Součková, J. Klomfar, J. Pátek, Fluid Phase Equilibria 303 (2011) 184–190. [13] P.J. Carvalho, M.G. Freire, I.M. Marrucho, A.n.J. Queimada, J.o.A.P. Coutinho, Journal of Chemical & Engineering Data 53 (2008) 1346–1350. [14] A. Fernandez, J.n. Garcia, J.S. Torrecilla, M. Oliet, F. Rodriguez, Journal of Chemical & Engineering Data 53 (2008) 1518–1522. [15] S.I. Fletcher, F.B. Sillars, N.E. Hudson, P.J. Hall, Journal of Chemical & Engineering Data 55 (2009) 778–782. [16] P. Kilaru, G.A. Baker, P. Scovazzo, Journal of Chemical & Engineering Data 52 (2007) 2306–2314. [17] J. Klomfar, M. Souckova, J. Patek, Journal of Chemical & Engineering Data 54 (2009) 1389–1394. [18] J. Klomfar, M. Souckova, J. Patek, Journal of Chemical & Engineering Data 56 (2011) 3454–3462. [19] A.B. Pereiro, F. Santamarta, E. Tojo, A. Rodriguez, J. Tojo, Journal of Chemical & Engineering Data 51 (2006) 952–954. [20] A.B. Pereiro, P. Verdia, E. Tojo, A. Rodríguez, Journal of Chemical & Engineering Data 52 (2007) 377–380. [21] J. Restolho, A.P. Serro, J.L. Mata, B. Saramago, Journal of Chemical & Engineering Data 54 (2009) 950–955. [22] L.G.n. Sanchez, J.R. Espel, F. Onink, G.W. Meindersma, A.B.d. Haan, Journal of Chemical & Engineering Data 54 (2009) 2803–2812.
[23] J. Tong, Q.-S. Liu, P. Zhang, J.-Z. Yang, Journal of Chemical & Engineering Data 52 (2007) 1497–1500. [24] A. Wandschneider, J.K. Lehmann, A. Heintz, Journal of Chemical & Engineering Data 53 (2008) 596–599. [25] J. Klomfar, M. Součková, J. Pátek, The Journal of Chemical Thermodynamics 42 (2010) 323–329. [26] A. Muhammad, M.I. Abdul Mutalib, C.D. Wilfred, T. Murugesan, A. Shafeeq, The Journal of Chemical Thermodynamics 40 (2008) 1433–1438. [27] J.-y. Wang, H.-c. Jiang, Y.-m. Liu, Y.-q. Hu, The Journal of Chemical Thermodynamics 43 (2011) 800–804. [28] M.G. Freire, P.J. Carvalho, A.M. Fernandes, I.M. Marrucho, A.J. Queimada, J.A.P. Coutinho, Journal of Colloid and Interface Science 314 (2007) 621–630. [29] G. Law, P.R. Watson, Langmuir 17 (2001) 6138–6141. [30] F. Gharagheizi, Computation Materials Science 40 (2007) 159–167. [31] S.L. Mayo, B.D. Olafson, W.A. Goddard, Journal of Physical Chemistry 94 (1990) 8897–8909. [32] Talete srl, DRAGON for Windows (software for molecular descriptor calculations) .Version 5.5 – 2007 http://www.talete.mi.it. [33] J.H. Friedman, The Annals of Statistics 19 (1991) 1–67. [34] J.H. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysis With Applications to Biology, Control, and Artificial Intelligence, 1st MIT Press ed. MIT Press, 1992. [35] D. Rogers, A.J. Hopfinger, Journal of Chemical Information and Computer Sciences 34 (1994) 854–866. [36] E.J. Delgado, G.A. Diaz, SAR and QSAR in Environmental Research 17 (2006) 483–496. [37] A. Eslamimanesh, F. Gharagheizi, M. Illbeigi, A.H. Mohammadi, A. Fazlali, D. Richon, Fluid Phase Equilibria 316 (2012) 34–45. [38] F. Gharagheizi, A. Eslamimanesh, F. Farjood, A.H. Mohammadi, D. Richon, Industrial and Engineering Chemistry Research 50 (2011) 11382–11395. [39] S.M. Mousavisafavi, F. Gharagheizi, S.A. Mirkhani, J. Akbari, Journal of Thermal Analysis and Calorimetry in press, http://dx.doi.org/10.1007/s10973-012-2208-7. [40] R. Sedev, Current Opinion in Colloid & Interface Science 16 (2011) 310–316. [41] P. Gramatica, The Royal Society of Chemistry (2012) 458–478. [42] P. Gramatica, QSAR and Combinatorial Science 26 (2007) 694–701. [43] W.J. Krzanowski, Principles of Multivariate Analysis: A User's Perspective, Rev ed. Oxford University Press, Oxford, 2000. [44] R. Todeschini, V. Consonni, A. Mauri, M. Pavan, Analytica Chimica Acta 515 (2004) 199–208. [45] F. Gharagheizi, A. Eslamimanesh, A.H. Mohammadi, D. Richon, Chemical Engineering Science 66 (2011) 2959–2967. [46] F. Gharagheizi, A. Eslamimanesh, A.H. Mohammadi, D. Richon, Journal of Chemical & Engineering Data 56 (2011) 720–726. [47] F. Gharagheizi, A. Eslamimanesh, A.H. Mohammadi, D. Richon, Journal of Chemical & Engineering Data 56 (2011) 2587–2601. [48] F. Gharagheizi, M.R.S. Gohar, M.G. Vayeghan, Journal of Thermal Analysis and Calorimetry 109 (2012) 501–506. [49] B. Efron, Journal of the American Statistical Association 82 (1987) 171–185. [50] F. Lindgren, B. Hansen, W. Karcher, M. Sjöström, L. Eriksson, Journal of Chemometrics 10 (1996) 521–532. [51] J. Chiou, Computers and Chemical Engineering 23 (1999) 1277–1291.