A chemical structure based model for the determination of speed of sound in ionic liquids

A chemical structure based model for the determination of speed of sound in ionic liquids

Journal of Molecular Liquids 196 (2014) 7–13 Contents lists available at ScienceDirect Journal of Molecular Liquids journal homepage: www.elsevier.c...

3MB Sizes 5 Downloads 119 Views

Journal of Molecular Liquids 196 (2014) 7–13

Contents lists available at ScienceDirect

Journal of Molecular Liquids journal homepage: www.elsevier.com/locate/molliq

A chemical structure based model for the determination of speed of sound in ionic liquids Mehdi Sattari a,c, Farhad Gharagheizi a,c, Poorandokht Ilani-Kashkouli a,c, Amir H. Mohammadi a,b,⁎, Deresh Ramjugernath a,⁎⁎ a b c

Thermodynamics Research Unit, School of Engineering, University of KwaZulu-Natal, Howard College Campus, King George V Avenue, Durban 4041, South Africa Institut de Recherche en Génie Chimique et Pétrolier (IRGCP), Paris Cedex, France Department of Chemical Engineering, Buenzahra Branch, Islamic Azad University, Buenzahra, Iran

a r t i c l e

i n f o

Article history: Received 6 February 2014 Accepted 27 February 2014 Available online 15 March 2014 Keywords: Speed of sound Ionic liquid Quantitative structure–property relationship QSPR Genetic function approximation GFA Model Prediction

a b s t r a c t In this communication, a reliable quantitative structure–property relationship (QSPR) model is developed to estimate the speed of sound (SS) in ionic liquids at atmospheric pressure. A data set comprising 446 experimental data values for 41 ionic liquids extracted from the NIST Standard Reference Database was used to develop and evaluate the model (80% as the train set and 20% as the test set). In this study, the effects of both anions and cations are considered in the development of the model. Genetic function approximation (GFA) is applied to select the model parameters (molecular descriptors) and develop a 9-variable multivariate linear QSPR model. Statistical analysis of the performance of the model with respect to the data set indicates an average absolute relative deviation (AARD%) of 0.92, a coefficient of determination (R2) of 0.9862, and a root mean square of error (RMSE) of 18.72 m s −1. © 2014 Elsevier B.V. All rights reserved.

1. Introduction Ionic liquids (ILs) are described as molten salts which are usually liquids at room temperature and even higher temperatures (typically below 100 °C) due to their ions being poorly coordinated [1]. Because of their unusual thermophysical properties such as a negligible vapor pressure and high thermal stability, ILs have received great interest, mostly in industrial applications. Due mainly to their very low vapor pressure and consequently volatility, most of them are considered as a “green” alternative to volatile organic solvents which are commonly used in the chemical industry today. In addition, they can be used in various applications such as separation processes, catalysis and lubricants [2–9]. Another reason for the excitement and interest which researchers have in ILs is due to the number of permutations in terms of constitution of IL molecules from anion and cation pairs. This high number of

⁎ Correspondence to: A.H. Mohammadi, Institut de Recherche en Génie Chimique et Pétrolier (IRGCP), Paris Cedex, France. ⁎⁎ Corresponding author. E-mail addresses: [email protected] (A.H. Mohammadi), [email protected] (D. Ramjugernath).

http://dx.doi.org/10.1016/j.molliq.2014.02.041 0167-7322/© 2014 Elsevier B.V. All rights reserved.

permutations means that molecules could in theory be designed by a combination of ion pairs so as to have a certain desired thermophysical property. However to explore this “tunability” and “designability” of ILs further, models have to be developed in order to relate the thermophysical properties to the chemical structure or other physicochemical properties [10,11]. In this communication, the relationship between the speed of sound and chemical structure is determined and an appropriate QSPR model developed. The speed of sound (SS) is an important property in chemistry and physics which is usually used in equation of state development and consequently be used to derive several thermophysical properties such as the reduced bulk modulus, isobaric thermal expansion coefficient, thermal pressure coefficient, isentropic and isothermal compressibilities, isobaric and isochoric heat capacities, and the Joule–Thomson coefficient [12–15]. Unfortunately, due to limited experimental data, the SS has not been fully and extensively utilized for thermodynamic property derivation for ILs. There have been relatively few studies reported in the literature with regard to SS model development for ILs. In 2008, the first study was undertaken by Gardas and Coutinho [13] for 14 imidazoliumbased ILs comprising 133 data point values and related the SS to surface tension and density of the IL. Based on the relation of Auerbach [16], they applied some modification and developed a log-scaled model.

8

M. Sattari et al. / Journal of Molecular Liquids 196 (2014) 7–13

Table 1 Abbreviation and structures of cations in this study. No.

Abbreviation

1

[TEMA]+

Cation

Table 1 (continued) No.

Abbreviation

15

[P14,6,6,6]+

Triethylmethylammonium 2

Trihexyltetradecylphosphonium 16

[NH3(CH2)2OH]+

Cation

[C8MPy]+

2-Hydroxyethylammonium 3

N-octyl-3-methylpyridinium

[OMIm]+ 17

[C2MPyr]+

1-Methyl-3-octylimidazolium 1-Ethyl-1-methylpyrrolidinium 4

[C1MIm]+

18

+

[C2Py]

1,3-Dimethylimidazolium 1-Ethylpyridinium 5

[C2MIm]+ 19

+

[C4EPyr]

1-Ethyl-3-methylimidazolium 6

1-Butyl-1-ethylpyrrolidinium

[C4MIm]+ 20

[AlaC3]+

1-Butyl-3-methylimidazolium 7

[C8MIm]+

1-Octyl-3-methylimidazolium 8

21

1-methylethyl ester

L-Alanine,

2-methylpropyl ester

[AlaC4]+

[M0-py]+

N-butylpyridinium 22 9

L-Alanine,

[GlyC3]+

[C3MIm]+ Glycine, 1-methylethyl ester 1-Methyl-3-propylimidazolium 23

10

[GlyC4]+

[C4MPyr]+

1-Butyl-1-methylpyrrolidinium 11

+

[C3Py]

Glycine, 2-methylpropyl ester 24

[GluC3]+

1-Propylpyridinium 12

L-Glutamic

acid, 1,5-bis(1-methylethyl) ester

L-Glutamic

acid, 1,5-bis(2-methylpropyl) ester

[C6MIm]+ 25

[GluC4]+

1-Hexyl-3-methylimidazolium 13

14

[C5Mim]+

1-Pentyl-3-methylimidazolium

26

[ValC3]+

1-Butyl-3-methylpyridinium

27

[ValC4]+

[bmpy]+

L-Valine,

1-methylethyl ester

M. Sattari et al. / Journal of Molecular Liquids 196 (2014) 7–13 Table 1 (continued) No.

Abbreviation

28

2-methylpropyl ester

[ProC3]+

L-Proline,

29

and amino acid classes of ILs. The SS data covered a wide range of temperatures (278.15–343.15 K) and values (1128.4–1885.4 m s−1). The ILs used in this study were constituted from 29 cations and 11 anions. The names and structures of ILs are shown in Tables 1 and 2 respectively. Table 3 lists the ILs studied and the range of temperatures and SS. The data used was screened as follows: where there were several reported values of SS for a single temperature, the value with the lowest uncertainty was incorporated into the data set utilized. If the reported values had the same uncertainties, the latest published values were utilized. The complete data set of SS values utilized, including the original reference source of the experimental data is presented as Supplementary materials.

Cation

L-Valine,

9

1-methylethyl ester

[ProC4]+

2.2. Calculation of the descriptors

L-Proline,

In order to relate the desired property to the constituent cation and anion combinations, the descriptors of both ions were calculated for each IL. In order to optimize the 3D chemical structure of each cation and anion, the Dreiding force field as explained by ChemAxon's JChem

2-methylpropyl ester

  σ log SS ¼ 0:6199 log þ 5:9447 ρ

Table 2 Abbreviation and structures of present anion in this study.

ð1Þ

where σ and ρ are surface tension in N m−1 and density in kg m−3, respectively. The proposed model was reported to have an overall relative deviation of 1.96%. Because of limited experimental data being available at the time for both the density and surface tension of ILs, they used their own correlations to predict these two properties. Consequently, a large number of computations were required to use this model to predict SS. In a subsequent study, Singh and Singh [17] developed a model for 3 additional imidazolium-based ILs using the same approach as Gardas and Coutinho. The aim of this study is to develop a quantitative structure–property relationship (QSPR) model to predict the SS of various classes of ILs with high accuracy, without any dependency to other properties or correlations. QSPR is a methodology that relates the macroscopic properties to the microscopic structure of molecules. This would be done by introducing “descriptors”. The molecular descriptor is the final result of a logical and mathematical procedure which transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment [18]. By this definition, the molecular descriptors can be classified into two categories: experimental measurements, i.e. log P, molar refractivity, dipole moment, polarizability, and, in general, physico-chemical properties, and theoretical molecular descriptors, which are derived from a symbolic representation of the molecule, such as 1D-descriptors, 2D-descriptors, and so on. After the calculation of the molecular descriptors, the next step is to find a model, using statistical and regression analysis, with a minimum number of descriptors which has the highest ability to fit and predict data. The benefit of the predictive capability of the model is that the property of an unknown compound can be estimated without the need for chemical synthesis and experimental measurements.

No.

Abbreviation

Anion



1

[MeSO4]

2

[ac]−

3

[Cl]−

4

[TfO]−

5

[PF6]−

6

[BF4]−

7

[NTf2]−

8

[EtSO4]−

9

[OcSO4]−

10

[dca]−

11

[DS]−

Methylsulfate

Acetate Chloride

Trifluoromethanesulfonate

Hexafluorophosphate

Tetrafluoroborate

Bis[(trifluoromethyl)sulfonyl]imide

Ethyl sulfate

2. Methodology Octylsulfate

2.1. Data preparation The NIST Standard Reference Database #103b [19] was used to extract the physical property data for the ionic liquids. The database contained 446 reported experimental data points of SS for 41 ILs at atmospheric pressure. The types of molecules used in the study belong to the imidazolium, phosphonium, pyridinium, pyrrolidinium,

Dicyanamide

Dodecyl sulfate

10

M. Sattari et al. / Journal of Molecular Liquids 196 (2014) 7–13

Table 3 List of ionic liquids and their frequency in this study. No.

Compound

T range/K

SS range/m s−1

AARD%

Subset

1 2

[TEMA][MeSO4] [NHHH, (CH2)2OH][ac] [OMIm][Cl] [C1MIm] [MeSO4] [C2MIm][TfO] [C4MIm][PF6] [C4MIm][BF4] [C4MIm][TfO] [C2MIm][NTf2] [C4MIm][NTf2] [C8MIm][NTf2] [M0-py][BF4] [C3MIm][NTf2] [C4MPyr][NTf2] [C3Py][BF4] [C6MIm][BF4] [C8MIm][BF4] [C5Mim] [NTf2] [C6MIm][PF6] [C8MIm][PF6] [C2MIm][EtSO4] [C6MIm][NTf2] [C4MIm] [MeSO4] [C4MIm][OcSO4] [bmpy][BF4] [P14,6,6,6][dca] [C8MPyr][BF4] [C4MPyr] [MeSO4] [C2MPyr][EtSO4] [C2Py][EtSO4] [C4EPyr][EtSO4] [AlaC3][DS] [GlyC3][DS] [GlyC4][DS] [AlaC4][DS] [GluC3][DS] [GluC4][DS] [ValC3][DS] [ValC4][DS] [ProC3][DS] [ProC4][DS]

308.15–343.15 571.3–571.3

1761.5–1853.5 1790.73–1790.73

1.77 1.84

Test Training

8 1

278.15–343.15 283.15–343.15

1510.2–1885.4 1708–1851

3.35 0.36

Test Training

14 13

278.15–338.15 278.15–343.15 283.15–343.15 293.15–318.15 293.15–293.15 293.15–293.15 293.15–293.15 293.15–323.15 293.15–343.15 278.15–343.15 278.15–338.15 293.15–318.15 293.15–343.15 298.15–298.15 278.15–343.15 278.15–343.15 288.15–343.15 283.15–343.15 278.15–343.15

1348.51–1482.23 1329.4–1492.5 1462.1–1604.5 1348.1–1403.4 1240–1240 1227–1227 1232–1232 1543.45–1611.47 1137–1243 1173–1316 1549.54–1691.9 1470.5–1532.7 1361.1–1495.6 1227–1227 1318.4–1490.7 1294.6–1481.6 1566.4–1703.9 1128.4–1262 1552.1–1711.3

0.63 0.88 1.39 0.29 2.91 1.11 3.43 1.31 0.71 0.77 0.36 0.33 1.32 0.95 0.34 0.40 1.04 2.86 1.31

Training Training Training Training Training Training Test Test Training Training Training Test Training Training Test Training Training Test Training

10 14 13 6 1 1 1 4 11 14 25 6 11 1 14 14 12 8 14

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

N

of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. As a result, 20% of the ILs was kept for the test set and remained in use as the training set to develop the model. 2.4. Genetic function approximation (GFA) for model development Genetic function approximation (GFA) is a fusion of two apparently distinctive algorithms: the multivariate adaptive regression splines (MARS) of Friedman [21] and the genetic algorithm (GA) introduced by Holland [22]. It was originally proposed in the pioneering work of Rogers and Hopfinger [23]. Generally, the target of most QSPR studies is to introduce the linear combination of basic functions φk(X) of the features X = {x1, …, xm} in the training data set of size M: F ðX Þ ¼ b0 þ

XM

b φ ðX Þ: k¼1 k k

ð2Þ

The GFA approach works by generating the initial population of equations by a random selection of descriptors. The fitness function used in GFA during the evolution is Friedman's lack of fit (LOF) score which is described by the following formula: LOF ðmodelÞ ¼

1 LSE ðmodelÞ   : N 1− ðcþ1ðdpÞÞ 2

ð3Þ

N

278.15–343.15 293.15–318.15 278.15–343.15 278.15–328.15 298.15–343.15

1349.6–1557.2 1538.9–1599.8 1390–1599 1433.6–1571.9 1625.5–1741.6

0.57 1.47 0.52 0.44 0.70

Training Training Training Training Training

50 6 14 21 10

308.15–343.15 298.15–343.15 328.15–343.15 288.15–343.15 303.15–343.15 303.15–343.15 288.15–343.15 288.15–343.15 323.15–343.15 288.15–343.15 288.15–343.15 288.15–343.15 288.15–343.15

1665.7–1750.2 1608–1711 1564.9–1602 1266.3–1432.3 1245.8–1373.4 1248.7–1372.1 1246.6–1447.6 1233.6–1439.7 1270.5–1330.6 1254.9–1438.4 1249.5–1434.4 1284–1455.5 1286.2–1460.2

0.63 0.78 0.81 0.88 0.93 0.32 1.00 1.28 1.93 1.57 0.73 0.93 0.54

Training Training Training Training Test Training Training Training Training Test Training Training Training

8 10 4 12 9 9 12 12 5 12 12 12 12

was employed [20]. Thereafter, over 2000 molecular descriptors were calculated by importing the SMILES (simplified molecular input line entry specification) structures of all cations and anions, separately. These descriptors belong to 15 classes of descriptors: constitutional descriptors; topological indices; walk and path counts; connectivity indices; information indices; 2D autocorrelations; burden Eigen values; edge–adjacency indices; functional group counts; atom-centered fragments; molecular properties; topological charge indices; Eigen value-based indices; 2D binary finger print; 2D frequency finger print; and 3D conformational descriptors. After the calculation of descriptors, those which could not be calculated by the used software are completely removed from the list. Thereafter, pair correlation was done for all descriptors. Accordingly, the pair of descriptors with a correlation coefficient greater than 0.9 was removed and remained in use to develop the model. 2.3. Subset selection In QSPR modeling, normally the experimental data is divided into two subsets. One is used to develop and train the model which is named the training set, and the other is used to determine the predictive capability of the model for new compounds which have not been used in model development. For this purpose, K-means clustering is used to select the training and test sets. K-means clustering is a method

In this LOF function, c is the number of non-constant basis functions, N is the number of samples in the data set, d is a smoothing factor to be set by the user, p is the total number of parameters in the model, and LSE is the least square error of the model. Employment of LOF leads to models with better prediction without the problem of overfitting. The initial QSPR models were developed by the selection of random sets of descriptors from the pool. The next step was a genetic recombination or crossover process conducted on the linear string of descriptors: Two best models in terms of their fitness were selected as parents. Then, each parent is split randomly into two parts from a crossing point, and the first substring of the first parent was combined with the second substring of the second parent to create two new children. Next, the worst model was replaced by the best new child model. This process continued until no significant fitness improvement of the model was observed in the population. For a population of 300 models, 3000 to 10,000 genetic operations are normally sufficient to achieve convergence. 3. Results and discussion To get the simplest and most accurate predictive model, several models with different numbers of descriptors were studied and the best model in terms of R2 was selected. The final model consisted of intercept, temperature, cation and anion descriptors. To simplify the model, the effect of the cation and anion descriptors is shown as SSCation and SSAnion. SS ¼ intercept þ SSCation þ SSAnion −2:7271 T intercept ¼ 3000:68056 SSCation ¼ 98:66368  Mor03u−88:27463  ATS5m−638:70715  X2A −49:4333  R⋯CR⋯R SSAnion ¼ −30:37573  nC−47:8716  nF−104:91867  MATS3p −510:95776  JGI2 ð4Þ 2

R ¼ 0:9919 nTraining ¼ 370

2 Radj

¼ 0:9917 nTest ¼ 76

M. Sattari et al. / Journal of Molecular Liquids 196 (2014) 7–13

11

Table 4 Statistical parameters of the model. Statistical parameter Training set R2a Average absolute relative deviationb Standard deviation errorc Root mean square errord N

0.9919 0.76 13.53 13.54 370

Test set R2 Average absolute relative deviation Standard deviation error Root mean square error N

0.9701 1.66 34.10 34.12 76

Total R2 Average absolute relative deviation Standard deviation error Root mean square error N h i2 N ∑i¼1 yðiÞcalc −yðiÞexp a R2 ¼ 1− h i2 N ∑i¼1 yðiÞcalc −yexp :   100 N yðiÞcalc −yðiÞexp  b ∑ AARD% ¼ :  N i¼1  yðiÞexp rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi   2 N c SDEC ¼ N1 ∑i¼1 ½eðiÞ−e ; e ¼ ycalc −yexp : rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi h i d

RMSE ¼

N 1 N ∑i¼1

yðiÞcalc −yðiÞexp

2

0.9862 0.92 18.72 18.72 446

:

In the above equation, T is the absolute temperature and nC and nF are the number of carbon and fluorine atoms, respectively. Also R⋯CR⋯R is an atom centered fragment descriptor which is the number of carbon atoms on an aromatic ring that has three carbon neighbors on the same aromatic ring. Mor03u is signal 3 / unweighted 3D MoRSE descriptors (3D molecule representation of structures based on electron diffraction) which are derived from infrared spectra simulation using a generalized scattering function [24]. X2A is average connectivity index chi-2 which belongs to Kier–Hall connectivity indices. A molecular connectivity index is calculated by drawing out a chemical in a hydrogen-suppressed molecular structure and calculating the number of adjacent non-hydrogen atoms for each

Fig. 1. Comparison between the experimental and calculated/predicted speed of sound in ionic liquids; (*) for the training set, and (o) for the test set.

Fig. 2. Relative error of the model from the experimental values of speed of sound; (*) for the training set, and (o) for the test set.

atom. This descriptor reflects the relative accessibility of each bond to encounter other bonds of the same molecule [25]. ATS5m is the Broto– Moreau autocorrelation of a topological structure — lag 5 weighted by atomic masses [26]. MATS3p is the Moran autocorrelation — lag 3 / weighted by atomic polarizabilities [27]. JGI2 is the mean topological charge index of order 2. The topological charge indices evaluate the charge transfer between pairs of atoms and hence the global charge transfer in the molecule [28]. According to the explanation of descriptors and model parameters, speed of sound is affected mostly by X2A descriptors in cations and JGI2 in anions. This means that the effects of bonds on each other in cations, and charge transfer between atoms in anions are the most important factor on speed of sound of ILs. The statistical parameters of the final model are shown below Eq. (4). nTraining and nTest are the number of data points used in the training and test sets respectively, and R2 is the squared correlation coefficient of the model. The summary of the statistical parameters of the model, and training and test sets can be found in Table 4.

Fig. 3. Relative deviation of calculated/predicted speed of sound from experimental data; (*) for the training set, and (o) for the test set.

12

M. Sattari et al. / Journal of Molecular Liquids 196 (2014) 7–13

The predicted values of SS in comparison with the experimental values are presented in Fig. 1. In addition, the deviation of the model in comparison with the experimental data is shown as Fig. 2. According to Tables 3 and 4, the overall AARD of the training set is less than 0.8% which indicates that the model can fit data very well. The calculated values of SS in the training set showed that about 73% of data points have absolute deviations between 0 and 1%, about 27% between 1.01 and 3.00%, and just one point with deviation over than 3%; ie 3.01%. The summary of the above mentioned descriptions is presented in Fig. 3 for both the training set and test set. For the test set, the AARD is less than 1.7% which shows a good predictability of the model. About 38% of predicted values present deviations from 0 to 1.00%, 46% from 1.01 to 3.00%, and 16% over than 3.0%. The maximum prediction error belongs to 1-octyl-3methylimidazolium bis[(trifluoromethyl)sulfonyl]imide which has an AARD% of 3.43 for one data point. Also, 1-methyl-3-octylimidazolium chloride shows a relatively large deviation because of non-linear temperature dependence of SS (non-predictable by our current linear model). Other ILs have satisfactory predicted results due to their linear temperature dependence of SS. All information for the whole dataset as well as the value of descriptors of ILs are available in the Supplementary materials. 4. Validation

4.1. F-test

  n−1  2 2 Radj ¼ 1− 1−R n−p′

ð7Þ

where n and p′ are the numbers of experimental values and the model parameters, respectively. The smaller the difference between this value and the R2 parameter, the greater is the expected validity of the model. The value of R2adj of the proposed model was calculated as 0.9917.

Todeschini et al. [34] proposed 4 RQK constraints which must be completely satisfied to ensure the model prediction capability and verify that the model is not a chance correlation: 1: ΔK ¼ K XY −K X N0 ðquick  ruleÞ  2 2 2 2: ΔQ ¼ Q LOO −Q ASYM N0 asymptotic Q rule 3: R N0 ðredundancy RP ruleÞ N 4: R N0 ðover−fitting RN ruleÞ:

ð5Þ

where d fM and d fE denote the degree of freedom of the obtained model and the overall error, respectively. It is a comparison between the model explained variance and the residual variance. It should be noted that high values of the F-ratio test indicate the reliability of models [31]. In this study, F-test was applied in the process of model selection. 4.2. LOO (leave one out) validation technique Leave-one-out belongs to the most common and extensively used validation techniques known as internal validation. Internal or crossover validation techniques are based on the partitioning of the sample data into two different subsets; one serves as a training set and the other as a validation set. The modified training set was generated by deleting one object from the original data set. For each reduced data set, the model is calculated and responses for the deleted object were calculated from the model. The squared difference between the true response and the predicted one for the omitted object is used to calculate R2CV which is also called Q 2LOO [32]. Xn RCV ≡ Q LOO ¼ 1− Xi¼1 n

2

^ic Þ ðyi −y

i¼1

2

ðyi −yÞ

ð6Þ

^ic are the SS of ith IL, mean value of SS for all ILs, where yi, y, and y and response of the ith object, respectively. Less difference between

ð8Þ

The calculated values of the RQK test for the proposed model are: P

MSS=d f M RSS=d f E

2

In a multiple linear regression model, adjusted R2 measures the proportion of the variation in the dependent variable accounted for by the explanatory variables. Unlike R2, adjusted R2 allows for the degrees of freedom associated with the sums of the squares. Therefore, even though the residual sum of squares decreases or remains the same as new explanatory variables are added, the residual variance does not. For this reason, adjusted R2 is generally considered to be a more accurate goodness-of-fit measure than R2 [33].

p

F is the Fisher function which is defined as the ratio between the model summation of squares (MSS) and the residual summation of squares (RSS) [30].

2

4.3. Adjusted R-squared (R2adj)

4.4. RQK validation technique

The validation process is the most critical for the evaluation of the model stability and its predictive capability. If the model developed stands up to the validation check, it is dubbed as a “verified” model and can be used with some degree of confidence to estimate the particular properties [29]. The various validation techniques applied in this study are described below:



Q 2LOO and R2 shows the predictive ability of the proposed model. In this study Q 2LOO was 0.9913 which was too close to 0.9919 as R2 of model.

N

ΔK ¼ 5:49; ΔQ ¼ 0:01; R ¼ 0:099; R ¼ 0: These values are greater or equal to zero which indicates that the model is valid and it is not a chance correlation. 4.5. Bootstrap validation technique Bootstrapping is a statistical method to estimate the sampling distribution of an estimator by sampling with replacement from the original sample [35]. In this technique, the original size of the data set (n) is preserved for the training set, by selecting the objects with repetition. The training set is constituted with some duplicated objects and the evaluation set contains the left out objects. The model is computed using the training set and responses are predicted on the evaluation set. This procedure is repeated thousands of times and finally, the average predictive power (Q 2boot) is calculated. In this study, bootstrapping was repeated 5000 times and the (Q 2boot) of the model was evaluated as 0.9911. 4.6. y-Scrambling validation technique To verify whether the model is a chance correlation or not, this validation technique is performed by randomly shuffling all response variables, but without any changes in the predictor set [36]. If the prediction power of the model in terms of R2 or Q2 changes significantly after several hundreds of times, then the model is not a chance correlation. 2

Q k ¼ a þ b  r k ðy; e yk Þ

ð9Þ

M. Sattari et al. / Journal of Molecular Liquids 196 (2014) 7–13

where Q 2k is the explained variance of the model obtained using the same predictors but the kth y-scrambled vector; rk is the correlation between the true response vector and the kth y-scrambled vector. The value of intercept a shows that the obtained model is not a chance correlation. If it is close to zero, it means that the model is not found by chance and it is a stable one [34]. In this study the y-scrambling was performed 300 times and the intercept was calculated as −0.064 which is close to zero. 4.7. External validation technique External validation is a technique to evaluate the prediction power of a model using an external data set called a test set. The Q 2ext is demonstrated as: 2 Q ext

2 Xn  test ^i=i −yi y i¼1 ¼ 1−X  2 ntest yi −ytraining i¼1

ð10Þ

where ytraining is the average value of the SS of the compounds present in ^i=i is the response of ith object predicted by the obthe training set and y tained model ignoring the value of the related object (ith experimental SS). The smaller the difference between Q2ext and Q2, the greater the expected validity of the model. The evaluated value of this parameter for the model was 0.9486. All the validation techniques demonstrate that the final model is valid, stable and a non-chance correlation with high predictive capability. 5. Conclusion A QSPR model was developed in this study to predict the speed of sound in several ionic liquids. The proposed model is a 9-variable multivariate linear model which has been developed based on the experimental data for 41 ionic liquids. The molecular descriptors were selected using the GFA technique and are calculated based on the SMILE structure of ionic liquids and optimized using the Dreiding force field. The results obtained show that using eight descriptors and temperature as the input parameters of model, produces a simple and accurate model with an AARD% of 0.76% for the training set and 1.66% for the test set. It should be noted that because of the linear variation of the speed of sound of the ionic liquids with respect to temperature for the data sets available, the proposed model can just predict linear behavior, but with good accuracy. However the overall AARD% of the model was 0.92% which indicates the accuracy of the model in correlating and predicting the speed of sound. Acknowledgment The authors would like to thank Dr. Michael Frenkel of the National Institute of Standards and Technology for providing the NIST Standard Reference Database. This work is based upon research supported by the South African Research Chairs Initiative of the Department of Science and Technology and National Research Foundation. Appendix A. Supplementary data Supplementary data to this article can be found online at http:// dx.doi.org/10.1016/j.molliq.2014.02.041.

13

References [1] G. Wypych, Handbook of Solvents, ChemTec Publishing, Toronto, 2001. [2] L.A. Aslanov, J. Mol. Liq. 162 (3) (2011) 101. http://dx.doi.org/10.1016/j.molliq.2011. 06.006. [3] E. Clavero, J. Rodriguez, J. Mol. Liq. 163 (2) (2011) 64. http://dx.doi.org/10.1016/ j.molliq.2011.07.014. [4] H.S. Kim, R. Pani, S.H. Ha, Y.-M. Koo, Y.G. Yingling, J. Mol. Liq. 166 (0) (2012) 25. http://dx.doi.org/10.1016/j.molliq.2011.11.008. [5] H. Jun, Y. Ouchi, D. Kim, J. Mol. Liq. 179 (0) (2013) 54. http://dx.doi.org/10.1016/ j.molliq.2012.12.013. [6] C.V. Manohar, T. Banerjee, K. Mohanty, J. Mol. Liq. 180 (0) (2013) 145. http://dx.doi.org/ 10.1016/j.molliq.2013.01.019. [7] X. Zhu, Y. Gao, L. Zhang, H. Li, J. Mol. Liq. 190 (0) (2014) 174. http://dx.doi.org/ 10.1016/j.molliq.2013.11.007. [8] L. Guo, X. Pan, C. Zhang, M. Wang, M. Cai, X. Fang, S. Dai, J. Mol. Liq. 158 (2) (2011) 75. http://dx.doi.org/10.1016/j.molliq.2010.10.011. [9] M.M. Papari, S. Amighi, M. Kiani, D. Mohammad-Aghaie, B. Haghighi, J. Mol. Liq. 175 (0) (2012) 61. http://dx.doi.org/10.1016/j.molliq.2012.08.013. [10] S. Zahn, M. Brehm, M. Brüssel, O. Hollóczki, M. Kohagen, S. Lehmann, F. Malberg, A.S. Pensado, M. Schöppke, H. Weber, B. Kirchner, J. Mol. Liq. (2014). http://dx.doi.org/ 10.1016/j.molliq.2013.08.015 (in press). [11] Y. Chen, Y. Cao, X. Sun, T. Mu, J. Mol. Liq. 190 (0) (2014) 151. http://dx.doi.org/ 10.1016/j.molliq.2013.11.010. [12] C.W. Lin, J.P.M. Trusler, J. Chem. Phys. 136 (9) (2012) 094511. [13] R.L. Gardas, J.A.P. Coutinho, Fluid Phase Equilib. 267 (2) (2008) 188. http://dx.doi.org/ 10.1016/j.fluid.2008.03.008. [14] A.F. Estrada-Alexanders, D. Justo, J. Chem. Thermodyn. 36 (5) (2004) 419. http:// dx.doi.org/10.1016/j.jct.2004.02.002. [15] I.A. Johnston, (Defence Science And Technology Organisation Edinburgh (Australia), Weapons Systems Division, 2005). [16] R. Auerbach, Cell. Mol. Life Sci. 4 (12) (1948) 473. http://dx.doi.org/10.1007/ bf02164502. [17] M.P. Singh, R.K. Singh, Fluid Phase Equilib. 304 (1–2) (2011) 1. http://dx.doi.org/ 10.1016/j.fluid.2011.01.029. [18] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, Wiley-VCH, Weinheim, Chichester, 2000. [19] M. Frenkel, R.D. Chirico, V. Diky, C.D. Muzny, A.F. Kazakov, J.W. Magee, I.M. Abdulagatov, K. Kroenlein, C.A. Diaz-Tovar, J.W. Kang, R. Gani, (National Institute of Standards and Technology, Gaithersburg, MD 20899, USA, http://www.nist.gov/ srd/nist103b.cfm , 2011). [20] S.L. Mayo, B.D. Olafson, W.A. Goddard, J. Phys. Chem. 94 (26) (1990) 8897. http://dx.doi.org/10.1021/j100389a010. [21] J.H. Friedman, Ann. Stat. 19 (1) (1991) 1. http://dx.doi.org/10.1214/aos/1176347963. [22] J.H. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysis With Applications to Biology, Control, and Artificial Intelligence, 1st MIT Press ed. University of Michigan Press, 1975. (p xiv, 211 pp.). [23] D. Rogers, A.J. Hopfinger, J. Chem. Inf. Comput. Sci. 34 (4) (1994) 854. http://dx.doi.org/ 10.1021/ci00020a020. [24] J.H. Schuur, P. Selzer, J. Gasteiger, J. Chem. Inf. Comput. Sci. 36 (2) (1996) 334. http:// dx.doi.org/10.1021/ci950164c. [25] L.B. Kier, L.H. Hall, Molecular Connectivity in Structure–Activity Analysis, Research Studies Press, 1986. [26] P. Broto, G. Moreau, C. Vandycke, Eur. J. Med. Chem. 19 (1) (1984) 79. [27] P.A.P. Moran, Biometrika 37 (1/2) (1950) 17. [28] J. Galvez, R. Garcia, M.T. Salabert, R. Soler, J. Chem. Inf. Comput. Sci. 34 (3) (1994) 520. http://dx.doi.org/10.1021/ci00019a008. [29] M. Sattari, F. Gharagheizi, Chemosphere 72 (9) (2008) 1298. http://dx.doi.org/ 10.1016/j.chemosphere.2008.04.049. [30] W.J. Krzanowski, Principles of Multivariate Analysis: A User's Perspective, Rev ed. Oxford University Press, Oxford, 2000. (p xxi, 586 pp.). [31] F. Gharagheizi, M. Sattari, P. Ilani-Kashkouli, A.H. Mohammadi, D. Ramjugernath, D. Richon, Chem. Eng. Sci. 84 (2012) 557. http://dx.doi.org/10.1016/j.ces.2012.08.036. [32] A. Eslamimanesh, F. Gharagheizi, A.H. Mohammadi, D. Richon, J. Chem. Eng. Data 56 (10) (2011) 3775. http://dx.doi.org/10.1021/je200444f. [33] S.A. Mirkhani, F. Gharagheizi, M. Sattari, Chemosphere 86 (9) (2012) 959. http:// dx.doi.org/10.1016/j.chemosphere.2011.11.021. [34] R. Todeschini, V. Consonni, A. Mauri, M. Pavan, Anal. Chim. Acta. 515 (1) (2004) 199. http://dx.doi.org/10.1016/j.aca.2003.12.010. [35] B. Efron, Ann. Stat. 7 (1) (1979) 1 (citeulike-article-id:2825416). [36] F. Lindgren, B. Hansen, W. Karcher, M. Sjöström, L. Eriksson, J. Chemom. 10 (5–6) (1996) 521. http://dx.doi.org/10.1002/(sici)1099-128x(199609)10:5/6b521::aidcem448N3.0.co;2-j