Fluid Phase Equilibria 383 (2014) 108–114
Contents lists available at ScienceDirect
Fluid Phase Equilibria j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / fl u i d
Solubility modeling in three supercritical carbon dioxide, ethane and trifluoromethane fluids by one set of molecular descriptors Reza Tabaraki *, Aref Toulabi Department of Chemistry, Faculty of Science, Ilam University, Ilam, Iran
A R T I C L E I N F O
A B S T R A C T
Article history: Received 12 February 2014 Received in revised form 22 August 2014 Accepted 5 October 2014 Available online 8 October 2014
Quantitative structure property relationships (QSPR) were developed for the first time predicting of solubility in supercritical carbon dioxide, ethane and trifluoromethane over a wide range of pressures (5.1–36.2 MPa) and temperatures (308–343 K). A large number of descriptors were calculated and a subset of calculated descriptors was selected by genetic algorithm–multiple linear regression (GA–MLR). Four molecular descriptors and three experimental descriptors such as pressure, temperature and melting point were selected as the most feasible descriptors for prediction of solubility in three supercritical fluids. The data set consisted of 14 molecules in various temperatures and pressures, which form 586 solubility data. Modeling of the relationship between selected descriptors and solubility data was achieved by linear (multiple linear regression; MLR) and nonlinear (artificial neural network; ANN) methods. The artificial neural network architectures and their parameters were optimized simultaneously. The root mean squares error (RMSE) for supercritical carbon dioxide, ethane and trifluoromethane were 0.56, 0.68 and 0.72, respectively. The performance of the ANN models was also compared with multiple linear regression models and the results showed the superiority of the ANN over MLR model. ã 2014 Elsevier B.V. All rights reserved.
Keywords: Solubility Supercritical fluid Carbon dioxide Ethane Trifluoromethane ANN MLR
1. Introduction Supercritical fluid technology (SFT) finds applications in chemical, biochemical, pharmaceutical and food processing industries. Supercritical fluids (SCFs) have diffusivities between that of gases and liquids; compressibility’s comparable to gases, densities comparable to liquids and negligible surface tension. These properties make them attractive solvents for many industrial applications [1]. Solubility data in SCFs are important for the successful implementation of SFT. The experimental determination of solubility of organic solids in SCFs at various temperatures and pressures is expensive. Regarding the difficulties of solubility measurement in SCF, development of mathematical model to predict the solubility of new or even non-synthesized compounds is essential for saving both time and money. Therefore, modeling and prediction of solubility is essential. In the mathematical modeling of solubility data in supercritical fluids, the solubility systems can be categorized in three groups, a single solute in a supercritical fluid, mixed solutes in a supercritical fluids and a single solute in mixed supercritical fluids or supercritical fluid plus an organic solvent. Different models have been presented for solubility in supercritical fluid and can be
* Corresponding author. Tel.: +98 841 2227022; fax: +98 841 2227022. E-mail addresses:
[email protected],
[email protected] (R. Tabaraki). 0378-3812/$ – see front matter ã 2014 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.fluid.2014.10.008
categorized into two groups, theoretical (such as equations of state and semi-empirical equations) and empirical equations (such as density based equations). Numerous theoretical models have studied solubility in binary solid-SC fluid systems such as cubic equations of state, perturbed hard-sphere equations of state, lattice models, Kirkwood–Buff solution theory, Monte Carlo simulation and mean field theory [2]. Equations of state often require properties such as critical temperature, critical pressure and acentric factor that are not available for the solid solutes. Also, the models require one or more temperature-dependent parameters which must be obtained from binary solid solubility data [3,4]. The empirical models are based on simple error minimization using least squares method and, for most of them; there is no need to use physicochemical properties [5]. One of the most successful approaches to the prediction of chemical properties with molecular structural information is quantitative structure property/activity relationship. In QSPR, the molecular structure is translated into the so-called molecular descriptors using chemical graph theory, information theory, quantum mechanics, etc., and mathematical equations are related chemical structure to a wide variety of physical, chemical and biological properties [6,7]. QSPR models can be used to predict properties of new compound. Major steps in constructing the QSPR models are (i) the proper calculation of molecular descriptors, which satisfactorily describe the properties of a set of chemical substances (ii) selection of the best descriptors (iii) constructing a
R. Tabaraki, A. Toulabi / Fluid Phase Equilibria 383 (2014) 108–114 Table 1 Molecular structure of organic solids. Molecules 1
109
Table 1 (Continued) Molecules Molecular structure
1,4-Naphthoquinone
13
Molecular structure
Benzoic acid
O
O
OH
14 O
2
2-Aminofluorene
5-Aminoindole
5-Hydroxyindole
HO
N H
5
Indole 3-carboxylic acid
HO O
N H
6
Naphtalene
7
Phenanthrene
8
Oxindole O
NH
9
Skatole CH3
HN
10
2-Naphthol
11
5-Methoxyindole
OH
H3CO
N H
12
N H
H2N
N H
4
H O
NH2
3
Indole-3-aldehyde
Acridine
N
mathematical model having the best prediction of property data and (iv) validating the quality and predictivity of the model [8]. Genetic algorithms (GA) are optimization tools and randomized search techniques guided by the principles of evolution and natural genetics [9]. They have been proved to be a very efficient method in the feature selection problem. Variables are represented as genes on a chromosome and they are generally coded as binary strings. A population of strings is randomly created. In variable selection, each string is a row vector containing as many elements as there are variables. Each element was coded as 1 if the corresponding variable was selected and 0 if it was not selected. The fitness of the string is equal to the evaluation response that is based on the predictive ability with a given subset of selected variables. The method used in the GA variable selection is designed to select variables with lowest prediction error. Through natural selection and the genetic operators, mutation and recombination, chromosomes with better fitness are found. Natural selection guarantees that chromosomes with the best fitness will propagate in future populations. Mutation allows new areas of the response surface to be explored. GA offers a generational improvement in the fitness of the chromosomes and after many generations will create chromosomes containing the optimized variable settings. GA has several advantages such as the ability to move from local optima present on the response surface and require no knowledge or gradient information about the response surface and can be employed for a wide variety of optimization problems. Artificial neural network consists of a large number of processing elements (neurons) and connections between them. Function f(x) maps a set of given input values to some output values y = f(x). A neural network tries to find the best possible approximation of the function f(x). This approximation is coded in the neurons of the network using weights that are associated with each neuron. The weights of a neural network are learned using an iterative procedure during which examples of correct input– output associations are shown to the network and the weights are modified so that the network starts to mimic this desirable input– output behavior. Learning in a neural network then means finding an appropriate set of weights. This ability to learn from examples and based on this learning the ability to generalize to new situations is the most attractive feature of the neural network. One of the most popular learning algorithms is the back propagation algorithm. The architecture of a network used in connection with the back propagation algorithm is the feed forward layered network. In a feed forward layered network, the processing elements are divided into disjoint subsets, called layers. A feed forward network consists of layers (input, hidden and output layers). The input data flow through the network from the hidden layer towards the output layer. The number of hidden layers in a
110
R. Tabaraki, A. Toulabi / Fluid Phase Equilibria 383 (2014) 108–114
Table 2 Parameters of the genetic algorithm. Parameter Population size Number of variables per chromosome in the original population Regression method Elitism Crossover Probability of crossover Mutation Probability of mutation Number of runs
feed forward network must be optimized [10]. Artificial neural networks have been used for representing non-linear functional relationship between variables. The ability of an ANN to learn and generalize the behavior of any complex and non-linear process makes it a powerful modeling tool. A wide variety of descriptors have been reported for use in QSPR analysis [11]. In the most cases, it is more convenient to consider a linear relationship between solubility and descriptors by multiple linear regression (MLR) [12–15]. Artificial neural network (ANN) [13–15], wavelet neural network (WNN) [12–14] and Bayesian methods [16] are the most commonly used nonlinear models for predicting of solubility in supercritical carbon dioxide. Bayesian neural networks can do the descriptor selection and network architecture optimization automatically [17]. Based on our literature search, there is no report on the application of QSPR models for the prediction of solubility in different supercritical fluids by one set of descriptors. In this work, MLR (linear model) and ANN (nonlinear model) were used for predicting the solubility in supercritical carbon dioxide, ethane and trifluoromethane fluids over a wide range of pressures and temperatures by one set of descriptors.
30 chromosome 2–9 MLR True Multipoint 50% Multipoint 1% 2000
structures of those molecules are given in Table 1 (the last 5 molecules). 2.2. Descriptor calculation All molecules were drawn in the HyperChem 6 software (version 7.0, Hypercube, Inc.). The optimization of the molecular structures was also carried out by semi-empirical AM1 method using the Fletcher–Reeves algorithm until the root mean square gradient was 0.01. The resulted geometry was loaded into Dragon software [23] to calculate 1497 descriptor in 18 different classes. 2.3. Descriptor selection The selection of relevant descriptors, which relate the solubility to the molecular structure, is an important step to construct a predictive model. In this work, the following method was used to select the best calculated descriptors among 1497 theoretical descriptors using the training sets: (1) All descriptors with zero or same values for all the molecules in
the training set were eliminated. 2. Data and methodology 2.1. Data set The data set consists of 14 organic solids that their structures are given in Table 1. The experimental values of solubility at different temperature and pressure were collected from the following references: (1,4-naphthoquinone, 2-aminofluorene, naphthalene, phenanthrene, 2-naphthol, acridine and benzoic acid) [18,19]; (5-aminoindole, 5-hydroxyindole, indole 3-carboxylic acid, oxindole and indole-3-aldehyde) [20]; (skatole and 5-methoxyindole) [21]. Based on our literature search, these compounds had solubility data in all three supercritical fluids (supercritical carbon dioxide (SC-CO2), supercritical ethane (SC-C2H4) and supercritical trifluoromethane (SCCHF3). The data set consisted of 586 experimental solubility values in different temperatures and pressures. The data set of 9 organic solids was randomly divided into training set (SC-CO2: 80; SC-C2H4: 81; SC-CHF3: 44; total = 205 data) and test set (SCCO2: 155; SC-C2H4: 148; SC-CHF3: 78; total = 381 data). The test data consisted of two data sets. The adequacy of the models was evaluated by both interpolation and extrapolation (molecules within and beyond the samples used for constructing the model) [22]. Internal test set consisted of the same molecules as the training set but data was not used in constructing the model (SCCO2: 50; SC-C2H4: 52; SC-CHF3: 34 data; total = 136). The external test set compromised of five organic solids molecules, with molecular structures which were new for the model (SC-CO2: 105; SC-C2H4: 96; SC-CHF3: 44 data; total = 245). The molecular
(2) Co-linearity of the descriptors were calculated and one of the
two descriptors which had the pairwise correlation coefficient above 0.9 (R > 0.9) and a large correlation coefficient with the other descriptors was eliminated. Finally, 187 descriptors were remained for next section. (3) Genetic algorithm–multiple linear regression (GA–MLR) was used to select the most relevant descriptors. To select the most relevant descriptors with GA, the evolution of the population was simulated. Each individual of the population, defined by a chromosome of binary values as the coding technique, represented a subset of descriptors. The number of the genes at each chromosome was equal to the number of the descriptors. The population of the first generation was selected Table 3 Calculated values of selected descriptors.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Molecules
Tm
SIC3
Mor03m
1,4-Naphthoquinone 2-Aminofluorene 5-Aminoindole 5-Hydroxyindole Indole 3-carboxylic acid Naphthalene Phenanthrene Oxindole Skatole 2-Naphthol 5-Methoxyindole Acridine Benzoic_acid Indole-3-aldehyde
399 404 405 381 475 353 373 399 368 395 327 381 395 469
0.79 0.91 0.91 0.97 0.98 0.67 0.65 0.94 0.92 0.92 0.92 0.89 0.85 0.97
1.37 1.65 0.96 1.00 1.08 1.16 1.76 1.23 0.88 1.24 1.15 1.47 1.11 0.85
Mor10m
C-026
0.39 0.44 0.49 0.52 0.36 0.24 0.70 0.37 0.36 0.34 0.59 0.50 0.24 0.31
0 1 1 1 0 0 0 0 0 1 1 0 0 0
R. Tabaraki, A. Toulabi / Fluid Phase Equilibria 383 (2014) 108–114
111
Table 4 Correlation coefficients between various descriptors. Descriptor Tm SIC3 Mor03m Mor10m C-026
Tm 1 0.44 0.23 0.35 0.24
SIC3
Mor03m
1 0.53 0.13 0.36
1 0.48 0.02
Mor10m
C-026
1 0.34
1
Table 5 RMSE and mean relative error of MLR and ANN models.
RMSE Training %RE Training RMSE Int.test %RE Int.test RMSE Ext.test %RE Ext.test
SC MLR
CO2 ANN
SC MLR
CHF3 ANN
SC MLR
C2H6 ANN
0.6 10.2 0.6 10.1 0.8 9.2
0.4 6.6 0.4 7.6 0.7 8.7
0.7 8.0 1.0 9.3 1.1 11.9
0.4 4.7 0.9 5.6 0.8 8.4
1.0 13.0 0.8 12.1 1.0 10.8
0.5 5.8 0.6 6.4 0.8 8.6
Int. test: internal test set; Ext. test: external test set.
Fig. 2. Scatter plot of MLR and ANN models in supercritical trifluoromethane.
randomly. A gene was given the value of one, if its corresponding descriptor was included in the subset; otherwise, it was given the value of zero. The GA performs its optimization by variation and selection via the evaluation of the fitness function (h). Fitness function was used to evaluate alternative descriptor subsets that were finally ordered according to the predictive performance of related model. The fitness function that we used was the one that was proposed by Depczynski et al. [24]. Parameters of GA–MLR are shown in Table 2.
2.4. Modeling
Fig. 1. Scatter plot of MLR and ANN models in supercritical ethane.
Modeling of negative logarithm of the solubility ( ln y) by selected molecular descriptors for three supercritical fluids were performed by linear (multiple linear regression; MLR) and nonlinear (artificial neural network; ANN) methods. Neural networks had sigmoid functions as a hidden transfer function and linear functions as output transfer function. A back propagation learning algorithm was employed to adjust the weights. ANN models were developed using seven neurons in input layer corresponding to the selected seven descriptors. The output layer had one node that predicts solubility. In this work, the number of
112
R. Tabaraki, A. Toulabi / Fluid Phase Equilibria 383 (2014) 108–114 Table 6 Optimized parameters of ANN models.
Input neurons Hidden neurons Output neurons Learning rate Momentum Iterations
SC-CO2
SC-CHF3
SC-C2H6
7 3 1 0.088 0.25 5000
7 3 1 0.077 0.75 7000
7 3 1 0.006 0.95 14000
3. Results and discussion 3.1. Descriptor selection and MLR model The GA–MLR technique was also performed to select the best general molecular descriptors for three supercritical fluids. Therefore, three training set were used as one new training set. The best equation was as follow: y = 8.224(2.367) 0.019(0.007) T 0.096(0.007) ln P + 0.038(0.004) Tm + 5.727(1.179) SIC3 1.496(0.422) Mor03m + 4.427(0.697) Mor10m + 1.033(0.220) C026 (1) n = 205; R2 = 0.892; std. error = 0.918; F = 231.777 The values of selected descriptors are shown in Table 3. From Table 4, it can be seen that the correlation coefficient value for each pair descriptors was less than 0.55, which mean that the selected descriptors were independent. Based on Topliss work [25], relation between the number of observations and the number of variables for chance correlation level Pc < 0.01 and r2 0.9 can be evaluated by extrapolation. For 187 variables (this work), 208 observations are needed at a chance correlation level of 1%, which in this work, 205 observations were used. Although GA–MLR was used as feature selection method, the selected general descriptors were also used to develop the MLR models in each supercritical fluid. The best equations were: Fig. 3. Scatter plot of MLR and ANN models in supercritical carbon dioxide.
nodes in hidden layer and other parameters except the number of iterations were optimized simultaneously. In other words, the best value for each variable was not obtained by “one at a time” optimization method. The number of nodes in hidden layer can be changed from 2 to 11, while the learning rate from 0.001 to 0.1 with a step of 0.001 and momentum from 0.1 to 0.99 with a step of 0.01. The ANN models were constructed with all of the possible combinations of those three variables. The root mean square error (RMSE) for training and test sets for each ANN model was calculated. The training was stopped when the RMSE for training set was low and the RMSE for the test set reached a minimum. Finally, the number of iterations was optimized with the optimized variable values. 2.5. Software All molecules were drawn in the HyperChem 6 software (version 7.0, Hypercube, Inc.). Dragon software was used to calculate molecular descriptor in 18 different classes. Multiple linear regression, genetic algorithm and artificial neural network programs were written by authers in MATLAB environment (version 6.1, Mathworks, Inc.).
ln y (SC-CO2) = 6.824(2.862) 0.025(0.008) T 0.128 (0.011) P + 0.042(0.005) Tm + 4.742(1.439) SIC3 1.944 (0.510) Mor03m + 2.166(0.884) Mor10m + 1.378(0.268) C026 (2) n = 80; R2 = 0.949; std. error = 0.665; F = 189.908 ln y (SC-C2H6) = 12.709(3.887) 0.011(0.012) T - 0.073 (0.012) P + 0.035(0.007) Tm + 8.739(2.187) SIC3 0.926 (0.749) Mor03m + 6.674(1.154) Mor10m + 0.508(0.398) C026 (3) n = 81; R2 = 0.887; std. error = 1.009; F = 82.243 ln y (SC-CHF3) = 6.991(5.893) 0.005(0.019) T 0.096 (0.013) P + 0.034(0.006) Tm + 1.635(1.803) SIC3 0.833 (0.896) Mor03m + 3.810(1.303) Mor10m + 1.345(0.358) C026 (4) n = 44; R2 = 0.890; std. error = 0.731; F = 41.505 The predictive ability of the MLR models were evaluated by calculation of the RMSE and mean relative error for the training and test sets (Table 5). The training set was used for model generation. The test sets were used for the evaluation of the predictive ability of the models. The performance of the MLR models in three supercritical fluids was evaluated by plotting the calculated versus experimental values of the solubility for the training set, internal test set and external test set (Figs. 1–3).
R. Tabaraki, A. Toulabi / Fluid Phase Equilibria 383 (2014) 108–114 Table 7 Some of experimental and calculated values of Compounds
113
ln(y) using MLR and ANN models for external test set.
T(K)
P(MPa)
308 328 343
6.17 8.23 8.22
9.01 8.25 8.63
9.56 9.20 9.04
0.55 0.95 0.41
8.83 8.11 8.61
0.18 0.14 0.02
Benzoic acid
308 318 343
11.10 20.15 36.35
6.94 6.17 4.64
7.28 6.51 5.06
0.34 0.34 0.42
6.93 6.20 4.44
0.01 0.02 0.20
2-Naphthol
308 328 343
7.35 6.60 9.30
9.43 10.56 8.54
9.46 9.30 8.95
0.03 1.26 0.41
9.66 10.83 9.09
0.23 0.27 0.55
5-Methoxyindole
308 308 308
5.85 16.30 19.2
8.29 7.25 7.16
8.74 7.98 7.76
0.45 0.73 0.60
7.80 7.25 7.24
0.49 0.00 0.08
Indole-3-aldehyde
308 308 308
6.59 8.35 13.9
11.41 11.30 11.11
11.51 11.38 10.97
0.10 0.08 0.14
11.35 10.43 10.05
0.06 0.87 1.06
318 318 328
13.45 21.7 21.7
7.48 7.23 6.99
7.67 6.88 6.83
0.19 0.35 0.16
7.48 7.23 7.14
0.00 0.00 0.14
Benzoic acid
318 318 328
8.32 13.1 13.1
8.11 7.22 6.98
7.28 6.82 6.77
0.83 0.40 0.21
8.51 6.93 7.08
0.40 0.29 0.10
2-Naphthol
328 328 343
13.55 24.1 24.1
8.03 7.47 6.87
8.68 7.67 7.59
0.65 0.20 0.72
7.58 7.73 7.08
0.45 0.26 0.21
5-Methoxyindole
308 308 308
6.69 7.8 9.65
7.02 6.56 6.16
8.00 7.89 7.72
0.98 1.33 1.56
6.68 6.69 6.72
0.34 0.13 0.56
Indole-3-aldehyde
308 308 308
10.66 12.7 15.01
10.55 10.41 10.36
9.86 9.67 9.45
0.69 0.74 0.91
10.10 9.95 9.58
0.45 0.46 0.78
318 328 343
20 12.25 19.97
6.79 8.55 6.73
6.96 7.70 6.34
0.17 0.85 0.39
6.76 8.41 6.70
0.03 0.14 0.03
Benzoic acid
318 343 308
12 15.12 15.1
6.77 6.48 6.35
7.13 6.10 6.98
0.36 0.38 0.63
6.73 6.52 6.46
0.04 0.04 0.10
2-Naphthol
328 343 343
12 11.15 20.1
8.57 9.25 6.95
9.05 8.79 7.65
0.48 0.46 0.70
8.60 9.42 6.99
0.03 0.17 0.04
5-Methoxyindole
308 308 308
10.9 13.7 15.9
6.73 6.50 6.47
7.18 6.82 6.54
0.45 0.32 0.07
7.15 6.53 6.15
0.42 0.03 0.32
Indole-3-aldehyde
308 308
8.72 10.76
11.70 11.55
11.14 10.88
0.56 0.67
10.50 10.33
1.20 1.22
SC-C2H6 Acridine
SC-CHF3 Acridine
SC-CO2 Acridine
ln(y)
Calc. values (MLR)
3.2. Interpretation of the selected descriptors By interpreting the descriptors contained in the model, it is possible to gain useful chemical insights into the chemical property. The result shows that four calculated and three experimental descriptors are the most feasible ones. The calculated descriptors were one topological descriptor (SIC3: structural information content, neighborhood symmetry of 3-order), one atom centered fragment descriptor (C026: R-CX-R) and two 3DMoRSE descriptor (Mor03m: 3D-MoRSE-signal 3 weighted by
Absolute error (MLR)
Calc. values (ANN)
Absolute error (ANN)
atomic masses; Mor10m: 3D-MoRSE- signal 10 weighted by atomic masses). Selection of pressure, temperature and melting temperature of solids as the experimental descriptors is important. The density of the supercritical fluid, which is the key parameter to the solubility of different compounds, is related to both temperature and pressure of the supercritical gas. Topological descriptors (SIC3: structural information content, neighborhood symmetry of 3-order) are based on a graph representation of the molecule. They are numerical quantifiers of molecular topology obtained by the application of algebraic
114
R. Tabaraki, A. Toulabi / Fluid Phase Equilibria 383 (2014) 108–114
operators to matrices representing molecular graphs and whose values are independent of vertex numbering or labeling. They can be sensitive to one or more structural features of the molecule such as size, shape, symmetry, branching and cyclicity and can also encode chemical information concerning atom type and bond multiplicity [11]. The next descriptor is C026. It is one of the atom-centered fragment descriptors. C-026 corresponds to R-CX-R, representing the number of substituent groups bonded to the benzene ring, but excluding those which are bonded by a carbon atom to carbon atoms, i.e., C benzene ring–C substituent group [26]. The third and fourth descriptors are Mor03m: 3D-MoRSE-signal 3 weighted by atomic masses and Mor10m: 3D-MoRSE-signal 10 weighted by atomic masses. These 3D-MoRSE descriptors (3D molecule representation of structures based on electron diffraction) are derived from infrared spectra simulation using a generalized scattering function [11]. A comparison between selected descriptors and models in this work and previous works [12–16] was performed. In most of these works, temperature and pressure were used because density of supercritical fluid is a function of these parameters and has significant effect on solubility. Only in this work and Hemmateenejad work [15], large number of descriptors was calculated. Different feature selection methods such as stepwise MLR [12–14], combined data splitting feature selection strategy (CDFS) [15], MLR with expectation maximization (MLREM) [16] and GA–MLR (this study) were used to select the best descriptors. Both linear and nonlinear models were used in all of these studies. In this work and previous works [12–16], topological, geometrical, charge, atom-centered fragment, functional group, WHIM and 3D-MoRSE descriptors were the most selected descriptors. 3.3. ANN model In order to improve the predictive ability, nonlinear model such as ANN was constructed. The best MLR model had seven descriptors. ANN models were developed using seven neurons in input layer corresponding to the mentioned seven descriptors. Mean relative error and RMSE of ANN models and optimized parameters of ANN models are shown in Tables 5 and 6, respectively. The capability of the models were also evaluated for prediction of the solubility of five organic solids, their data were not used in any of the previous data sets. These organic solids were 2-naphthol, acridine, benzoic acid, indole-3-aldehyde and 5methoxyindole. The structures of these organic solids are given in Table 1. The calculated descriptors for these organic solids are presented in Table 3. Some of the results of the prediction ability of the models for external validation set are given in Table 7. The performance of the ANN models in three supercritical fluids was evaluated by plotting the calculated versus experimental values of the solubility for the training and test sets (Figs. 1–3).
4. Conclusions The MLR and ANN models were developed for predicting the solubility for 14 organic solids in supercritical carbon dioxide, ethane and trifluoromethane over a wide range of pressure and temperature. Some crucial implications of this study are listed below: a This study is the first report on the application of QSPR models
for the prediction of solubility in different supercritical fluids by one set of descriptors. b The capabilities of the models were also evaluated for prediction of the solubility of five organic solids with molecular structures which were new for the models. c The performance of the ANN model was compared with MLR model. The results from Table 5 and Figs. 1–3 indicate the superiority of the ANN model over that of the MLR models. It can be concluded that the characteristics of the molecular descriptors on the solubility values in supercritical fluids was nonlinear as mentioned in previous works [12–16]. Similar results were obtained for supercritical C2H6 and CHF3. d The root mean squares error (RMSE) for supercritical carbon dioxide, ethane and trifluoromethane were 0.56, 0.68 and 0.72, respectively.
References [1] A. Akgerman, G. Madras, Fundamentals of solids extraction by supercritical fluids, in: E. Kiran, J.M.H.L. Sengers (Eds.), Supercritical Fluids: Fundamentals and Applications, Kluwer Academic Publishers, Dordrecht, 1994. [2] K.P. Johnston, D.G. Peck, S. Kim, Ind. Eng. Chem. Res. 29 (1989) 1115–1125. [3] G.S. Foster, J.S.L. Gurdial, K.D. Liong, S.S.T. Tilly, H. Ting, Singh, J.H. Lee, Ind. Eng. Chem. Res. 30 (1991) 1955–1964. [4] J. Mendez-Santiago, A.S. Teja, Fluid Phase Equilibria 158-160 (1999) 501–510. [5] H.K. A. Jouyban, N.R. Foster Chan, J. Supercrit. Fluids 24 (2002) 19–35. [6] M. Karelson, V.S. Lobanov, Chem. Rev. 96 (1996) 1027–1044. [7] T. Le, V.C. Epa, F.R. Burden, D.A. Winkler, Chem. Rev. 112 (2012) 2889–2919. [8] H.L. Engelhardt, P.C. Jurs, J. Chem. Inf. Comput. Sci. 37 (1997) 478–484. [9] D.E. Golderg, Genetic Algorithms: Search, Optimization and Machine Learning, Addison-Wesley, New York, 1989. [10] J. Zupan, J. Gasteiger, Neural Networks for Chemists: An Introduction, VCH, Weinheim, 1993. [11] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, John Wiley & Sons, 2008. [12] T. Khayamian, M. Esteki, J. Supercrit. Fluids 32 (2004) 73. [13] R. Tabaraki, T. Khayamian, A.A. Ensafi, Dyes Pigments 73 (2007) 230. [14] R. Tabaraki, T. Khayamian, A.A. Ensafi, J. Mol. Graph. Model. 25 (2006) 46. [15] B. Hemmateenejad, M. Shamsipur, R. Miri, M. Elyasi, F. Foroghinia, H. Sharghi, Anal. Chim. Acta 610 (2008) 25. [16] A. Tarasova, F. Burden, J. Gasteiger, D.A. Winkler, J. Mol. Graph. Model. 28 (2010) 593–597. [17] F.R. Burden, D.A. Winkler, QSAR Comb. Sci. 28 (2009) 645–653. [18] W.J. Schmitt, R.C. Reid, J. Chem. Eng. Data 31 (1986) 204–212. [19] M. McHugh, M.E. Paulaitis, J. Chem. Eng. Data 25 (1980) 326–329. [20] K. Nakatani, J. Supercrit. Fluids 2 (1989) 9–14. [21] S. Sako, K. Shibata, K. Ohgaki, T. Katayama, J. Supercrit. Fluids 2 (1989) 3–8. [22] H. Martens, T. Naes, Multivariate Calibration, Wiley, Chichester, 1989. [23] R. Todeschini, Milano Chemometrics and QSPR Group. http://www.disatunimib.it/vhml/. [24] U. Depczynski, V.J. Frost, K. Molt, Anal. Chim. Acta 420 (2000) 217–227. [25] J.G. Topliss, R.P. Edwards, J. Med. Chem. 22 (1979) 1238. [26] B. Rasulev, H. Kusic, D. Leszczynska, J. Leszczynski, N. Koprivana, J. Environ. Monit. 12 (2010) 1037–1044.