Chemical Engineering Science 97 (2013) 186–197
Contents lists available at SciVerse ScienceDirect
Chemical Engineering Science journal homepage: www.elsevier.com/locate/ces
Evaluation and refinement of the property prediction stage of the targeted QSPR method for 15 constant properties and 80 groups of compounds Mordechai Shacham a,n, Michael Elly a, Inga Paster a, Neima Brauner b a b
Department of Chemical Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel School of Engineering, Tel-Aviv University, Tel-Aviv 69978, Israel
H I G H L I G H T S
We predicted 15 constant properties for 470 compounds with a Targeted QSPR method. Most compound/property combinations were predicted on data uncertainty level. Common causes of excessive prediction errors were identified. To eliminate such errors improvements of the method are suggested and demonstrated. The results of this study are applicable to other property prediction techniques.
art ic l e i nf o
a b s t r a c t
Article history: Received 26 December 2012 Received in revised form 11 March 2013 Accepted 5 April 2013 Available online 18 April 2013
The dominant descriptor version of the targeted QSPR method (TQSPR1, [Shacham and Brauner, Chem. Eng. Sci., 66, 2606, 2011]) is used to predict 15 constant properties for 80 groups of compounds in order to assess the range of applicability of the prediction technique and to identify frequent causes of excessive prediction errors. It is demonstrated that the TQSPR1 method can model the majority of the properties within the data uncertainty level for most groups of compounds that were included in this study. It is also shown that the accuracy of the TQSPR1 predictions is comparable or higher than the accuracy of predictions obtained by commonly used group contribution methods. Frequent causes of excessive prediction errors are identified and ways for improvement of the prediction accuracy in such cases are proposed and demonstrated. The results reported and the suggestions for improvements are applicable to other property prediction techniques, as well, and are not limited to the TQSPR1 method. The results of the study enable improving the consistency of the physical property and descriptor databases for increasing the robustness of QSPRs that can be derived for a variety of properties and compounds. & 2013 Elsevier Ltd. All rights reserved.
Keywords: Computational chemistry Parameter identification Systems engineering Mathematical modeling Property prediction QSPR
1. Introduction Pure-compound property data are widely used in process design, simulation and optimization, environmental impact assessment, hazard and operability analysis and additional diverse areas as chemistry and chemical engineering. The amount of measured property data is rather limited consequently there is a growing need to predict properties. Current methods used to predict physical and thermodynamic properties can be classified into “group contribution”
n
Corresponding author. Tel.: +972 8 6461481; fax: +972 8 6472916. E-mail addresses:
[email protected],
[email protected] (M. Shacham). 0009-2509/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.ces.2013.04.008
(GC) methods (see, for example, Constantinou and Gani, 1994; Poling et al., 2001), “asymptotic behavior” correlations (Marano and Holder, 1997) and Quantitative Structure Property Relationships (QSPRs, Dearden, 2003). There are commercial software packages such as the Dorthmund Data Bank (DDBST, 2011 release, http://www.ddbst. de) and CRANIUM (Molecular Knowledge Systems, Inc., http://www. molecularknowledge.com/ ) that are able to predict several properties of chemical compounds based on the molecular structure. Most of the prediction techniques used in these programs are based on “group contribution”. However the applicability of the group contribution techniques is limited to the properties and the particular atom groups for which the values of the “contributions” are available. The TQSPR method (Targeted QSPR, Brauner et al., 2006) attempts to remove the above limitation of the group contribution
M. Shacham et al. / Chemical Engineering Science 97 (2013) 186–197
methods as the prediction of a property for a “target compound” can be carried out using only molecular structure data and property data for molecules “similar” to the target compound. This method has undergone so far two phases of evaluation and refinement. Kahrs et al. (2008) carried out an evaluation of the TQSPR method using a database of 1630 molecular descriptors for 259 hydrocarbons by testing the prediction of the critical temperature (TC). The results of that study lead to the development of the single (dominant) descriptor version of the TQSPR method (TQSPR1, Shacham et al., 2007). Various considerations involving the training set selection for developing the TQSPR1 model were analyzed (Shacham and Brauner, 2011) using a database of 3224 molecular descriptors. The analysis involved prediction of 33 properties for a particular compound, n-hexyl mercaptan. Based on the results of that study we have developed a reduced descriptor database for improving the training set selection (RDBS) and a program that can predict various properties for any target compound. Obviously, the prediction accuracy depends on the availability of similar molecular structures and their physical properties in the database. However, the TQSPR1 method offers reliable measures to estimate the prediction accuracy for the unknown property values of the target compound. In this paper we report the results of the evaluation of this program by predicting 15 properties for 471 compounds. Common reasons for excessive prediction errors are analyzed and further improvements of the method for reducing prediction errors are proposed. The reasons for excessive prediction errors and the solutions offered are not limited to the TQSPR1 method, but are relevant and can be helpful to other property prediction methods also.
2. The dominant descriptor targeted QSPR method (TQSPR1) The Dominant Descriptor Targeted QSPR (TQSPR1) method was described in detail by Shacham and Brauner (2011) and it is only briefly reviewed here. The first stage of the method involves identification of a training set structurally related to the target compound. The similarity between the target compound (the compound for which the property needs to be predicted) and a potential predictive compound is measured by the partial correlation coefficient between their molecular-descriptor vectors. The training set is established by selecting the p predictive compounds (usually p¼ 10) with the highest correlation with the target compound for which experimental property values yi are available. The selected training set is used for the development of a TQSPR1 model for a particular property of the target compound. A linear structure–property relation is assumed of the form y ¼ β 0 þ β 1 ζD þ ε
ð1Þ
where y is a p-dimensional vector of the respective property values, ζD is a p-dimensional vector of the (dominant) molecular descriptor (to be selected via a stepwise regression algorithm), β0 ; β1 are the corresponding model parameters to be estimated, and ε is a p-dimensional vector of random errors. To identify the dominant descriptor, we examine the partial correlation coefficient between the vector of (target) property value of the compounds included in the training set (i.e., y) and the vector of descriptor values for these compounds, for all descriptors available in the database. These correlation coefficients will be further called descriptor–property (D–P) correlation coefficients for the particular property. The dominant descriptor, ζD, is the descriptor that has the highest D–P correlation coefficient. The TQSPR1 so-obtained can be subsequently employed for estimating the property value for the target compound and for other compounds in the similarity group that do not have measured
187
data, by applying the TQSPR1 model, whereby y~ t ¼ β0 þ β1 ζ Dt
ð2Þ
where ζ Dt is the dominant-descriptor value of the targeted compound and y~ t is estimated property value. In recent years there has been considerable progress in development of statistical metrics for external validation of QSPR models (see, for example Roy et al., 2012, 2013). These metrics require the use of a set of compounds (training set) for derivation of the regression model (QSPR) and a different set of compounds (validation set) to validate the model. These metrics are useful for judging QSPR models where the same regression model is used eventually for predicting a property value for many different compounds. In contrast, the TQSPR method requires derivation of a unique regression model for a property of each target compound, using a unique training set for the particular target compound. Therefore, metrics intended for the QSPR method are not suitable for representing the accuracy of the TQSPR model. Yet, there are several measures associated with the training set, which can be used for assessing the consistency of the predicted target property values for the members of the training set and also for the target compound (Shacham and Brauner, 2011). The main measure used for assessing the quality of the fit of the TQSPR1 model for the members of the training set is the Training Set Average Error (TSAE): 100 p TSAEð%Þ ¼ ð3Þ ∑ y −ðβ0 þ β1 ζ t Þ =yi p i¼1 i The advantage of TSAE over statistical measures (such as variance or standard deviation) is that it can be compared with the uncertainty values of the reported data (Ui) or the maximal uncertainty value (Umax) of the training set members. A consistent set of property values for the training set should yield TSAE⪡Umax. If no reported property value (yt) is available for the target compound, the TSAE can be used as an estimate for its prediction error (δy). Otherwise, the residual value (rt) can be used for reporting the prediction error for the target compound property value: r t ¼ ðyt −y~ t Þ
ð4Þ
where yt is the reported property value for the target compound. The corresponding relative prediction error is δyð%Þ ¼ 100jðyt −y~ t Þ=yt j
ð5Þ
If δyoUt, the predicted property value for the target compound is considered to be consistent with the property values of the training set members. Otherwise, the reasons for the excessive prediction error need to be detected.
3. Methodology The database used in this study contains physical property data for 1798 compounds. Included in this database are numerical values and data uncertainties (Ui) for 34 properties (e.g., critical properties, normal melting and boiling temperatures, heat of formation, flammability limits, etc.). All the property data is from the DIPPR database (Rowley et al., 2010). The database contains also 3224 molecular descriptors generated by the Dragon, version 5.5 software (DRAGON is copyrighted by TALETE srl, http://www. talete.mi.it) from minimized 3D molecular structures that were obtained from Rowley (2010). A Visual Basic program that uses the TQSPR1 method was developed. The program can operate in batch mode and attempts to predict all the properties for all the compounds in the database. Most of the computations reported were carried out using this program. For the examples presented here, some additional details
188
M. Shacham et al. / Chemical Engineering Science 97 (2013) 186–197
were computed with a special version of the SROV (MATLAB) program of Shacham and Brauner (2003), which was revised in order to fit the needs of the TQSPR1 algorithm. For identification of the training sets, a subset of 294 selected (mostly 1D) descriptors from the “atom centered fragment” and “functional group count” categories were used. This subset was selected based on its ability to identify compounds belonging to the target compound’s homologous series (if available) and discriminate between compounds according to the number and location of branches and double bonds (Paster, 2012). Some of the results were compared with predictions obtained by various GC methods. The implementations of the GC methods in the DDBST package were used for the predictions.
4. Analysis of the TQSPR1 prediction errors for a set of 471 compounds and 15 properties. The selected 15 constant properties included in this study from the DIPPR database are listed in Table 1. Included in Table 1 are the symbols of the properties (as defined by DIPPR) and short descriptions of the properties and their units. Out of the 34 properties, 19 were excluded for several reasons. Some were excluded because they are categorized as “defined”, namely, they are calculated from other properties and/or from the molecular structure (i.e., the molecular weight, or critical compressibility factor, which is calculated based on the critical properties). Excluded are also properties for which most of the values in the database are predicted and/or are associated with very high uncertainties. The 471 compounds included in this study could be associated with certain groups (mostly homologous series). Fifty, out of the 84 groups included, are listed in Table 2. The application of the TQSPR1 method on these compounds enabled the detection of the reasons for excessive prediction error by analyzing the consistency of the experimental/predicted values for homologous series, as well as compact, tabular presentation of the results. The predictions obtained for 15 properties by applying the TQSPR1 method on the 50 groups are summarized in Table 2. The results shown for every group–property combination include the average relative prediction error (δyt ), number of compounds for which the prediction error is o 5% (nl) and the number of components in the group for which the property value is available (nt), using the format: δyt ðnl =nt Þ. For example, for the critical volume (VC) of the Aldehydes group (row 50 in Table 2) the reported prediction reads 2.6 (10/13), namely, the average prediction error is 2.6%, the number of compounds for which VC data are available is 13 and for 10 compounds the prediction error is o5%.
Compound/property combinations for which the prediction error is4 6% are marked in bold numerals. The nt values vary in the range of 1≤nt≤32. There is only one compound from the trans-4alkene series (row 3 in Table 2) and there are 32 compounds from the n-alkane series (not shown). This high variability enables comparing the prediction accuracy of the TQSPR1 method for groups which are widely represented in the property database with groups that are sparsely represented. The compound groups are sorted according to increasing value of the median of the average prediction error. Thus, the groups with higher prediction errors are those shown toward the bottom of Table 2. The properties from left to right are sorted according to increasing value of the median of the average prediction error. Thus, the most accurate results are obtained for the refractive index, followed by the liquid and vapor properties NBP, LVOL, ENT and the critical properties. The least accurate results are obtained for the solid properties (melting point, and heat of fusion) and entropy and enthalpy of formation values at the standard state. In the following some of the causes of the excessive prediction errors are discussed.
5. Common reasons of excessive prediction errors 5.1. Prediction errors for low carbon number compounds Some of the reasons of excessive prediction errors will be demonstrated by developing a TQSPR1 model for the normal boiling temperature (Tb) for the first 11 compounds of the 1-alkene homologous series using 1-heptene (number of carbon atoms, nC ¼7) or ethylene (nC ¼2) as target compounds. Data and results of the TQSPR1 prediction of Tb for the 1-alkene series, with 1-heptene as target compound, are shown in Table 3. The training set consists of ten members of the 1-alkene series, which are the closest to 1-heptene in terms of nC. The Tb values and their uncertainties (from the DIPPR database) for the target compound and the members of the training set, are shown in Table 3. The uncertainties for most compounds are o1%, for propylene, 1-hexene and 1-heptene it is o0.2%. The DIPPR Tb data are depicted versus nC in Fig. 1, showing that Tb changes nonlinearly with nC. The rate of the change is the highest for low nC compounds and diminishes to zero for nC-∞ (as Tb converges to a finite value at the limit of nC-∞). To fit a TQSPR1 model to the Tb data, the descriptor with the highest correlation with the training set data is sought. From among the 3224 descriptors, the one that has the highest correlation with Tb is the VEA1 descriptor. This descriptor belongs to the 2D “eigenvaluebased” indices category and it is calculated from eigenvector
Table 1 Constant properties included in this study from the DIPPR database. No.
Symbol
Property description
Units
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ENT FP HCOM HFOR HFUS HSTD HSUB LVOL MP NBP PC RI SSTD TC VC
Absolute Entropy of Ideal Gas at 298.15 K and 100,000 Pa Flash point Net enthalpy of combustion standard state (298.15 K) Enthalpy of formation of ideal gas at 298.15 K and 100,000 Pa Enthalpy of fusion at melting point Enthalpy of formation in standard state at 298.15 K and 100,000 Pa Heat of sublimation at the triple point Liquid molar volume at 298.15 K Melting point (1 atm) Normal boiling point (1 atm) Critical pressure Refractive index at 298.15 K Absolute entropy in standard state at 298.15 K and 100,000 Pa Critical temperature Critical volume
J/kmol K K J/kmol J/kmol J/kmol J/kmol J/kmol m\widehat3/kmol K K Pa – J/kmol K K m3/kmol
Table 2 Partial prediction summary table for 15 properties and 471 compounds. No. 1 2
3 4
5
14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29 30 31
32 33 34 35 36 37
LVOL
ENT
TC
VC
PC
FP
HCOM
SSTD
0 (3/3) 0.1 (2/2)
0.1 (3/3) 0.4 (2/2)
0.2 (3/3) 0.4 (2/2)
0.5 (3/3) 0.2 (2/2)
0.1 (3/3) 0.4 (2/2)
0.7 (3/3) 0.4 (2/2)
0.2 (3/3) 1.4 (2/2)
0.1 (3/3) 0.8 (2/2)
0.4 (3/3) 0.2 (2/2)
0.4 (3/3) 0.4 (2/2)
0.6 (3/3) 0.3 (2/2)
0.5 (3/3) 0.9 (2/2)
0.1 (1/1) 0.2 (2/2)
0.2 (1/1) 0.3 (2/2)
0.4 (1/1) 0.5 (2/2)
0.9 (1/1) 0.9 (2/2)
0.2 (1/1) 0.4 (2/2)
0.1 (1/1) 0.7 (2/2)
0.2 (1/1) 0.6 (2/2)
0.5 (1/1) 0.9 (2/2)
0.4 (1/1) 0.4 (2/2)
0 (1/1) 0.4 (2/2)
0.9 (1/1) 0.9 (2/2)
0.1 (1/1)
0.7 (1/1)
0.7 (1/1)
1.7 (1/1)
0.6 (1/1)
0.6 (1/1)
0.8 (1/1)
0.1 (1/1)
0.2 (1/1)
0.2 (1/1)
1.9 (1/1)
0 0.1 0.1 0 0.1 0.2 0 0.6
(6/6) (2/2) (3/3) (3/3) (5/5) (4/4) (2/2) (1/1)
0.4 0.6 0.4 0.1 0.2 0.8 0.4 0.6
(6/6) (2/2) (3/3) (3/3) (5/5) (4/4) (2/2) (1/1)
0.1 0.2 0.3 0.2 0.3 0.5 0.2 1.9
(6/6) (2/2) (3/3) (3/3) (5/5) (4/4) (2/2) (1/1)
0.8 0.5 1.3 0.5 0.1 0.3 0.4 0.2
(6/6) (2/2) (3/3) (3/3) (5/5) (4/4) (2/2) (1/1)
0.3 0.4 0.2 0.3 0.3 0.6 0.3 1.1
(6/6) (2/2) (3/3) (3/3) (5/5) (4/4) (2/2) (1/1)
2 1.1 0.4 0.9 2.5 1.1 1.3 0.9
(6/6) (2/2) (3/3) (3/3) (5/5) (4/4) (2/2) (1/1)
2.2 2.5 3.7 0.7 1 1.4 2.2 1.6
(4/6) (2/2) (2/3) (3/3) (5/5) (4/4) (2/2) (1/1)
0.4 0.7 0.9 0.3 0.2 0.4 1 0.2
(6/6) (2/2) (3/3) (3/3) (5/5) (4/4) (2/2) (1/1)
1.1 0.3 1.2 0 1.1 6 0.4 0.4
(6/6) (2/2) (3/3) (3/3) (5/5) (2/4) (2/2) (1/1)
0.9 2.5 1 0.7 0.8 1 1.5 0.7
(6/6) (2/2) (3/3) (3/3) (5/5) (4/4) (2/2) (1/1)
0.8 0.8 0.9 2.6 1.6 1.6 1.4 1.8
(6/6) (2/2) (3/3) (3/3) (5/5) (4/4) (2/2) (1/1)
0.2 0.9 0.1 0.1 0.1 0.3
(3/3) (4/4) (3/3) (6/6) (10/10) (2/2)
1.6 1.5 0.7 0.3 0.6 0.4
(3/3) (4/4) (3/3) (6/6) (10/10) (2/2)
0.5 1 0.4 0.3 0.8 1.5
(3/3) (4/4) (3/3) (6/6) (9/10) (2/2)
1.6 4.6 0.9 1.2 0.7 0.2
(3/3) (2/4) (3/3) (6/6) (10/10) (2/2)
0.8 1.7 1.2 0.3 0.5 1.4
(3/3) (4/4) (3/3) (6/6) (10/10) (2/2)
0.8 1.4 0.3 0.4 1.1 1.7
(3/3) (4/4) (3/3) (6/6) (10/10) (2/2)
1.1 5.9 1.9 1.2 0.6 1.2
(3/3) (3/4) (3/3) (6/6) (10/10) (2/2)
1.7 0.7 2.7 0.4 3.7 0.7
(3/3) (4/4) (3/3) (6/6) (9/10) (2/2)
0.6 0.4 0.2 0.4 0.4 0.4
(3/3) (4/4) (3/3) (6/6) (10/10) (2/2)
1.3 1.4 0.4 0.9 5.6 0.3
(3/3) (4/4) (3/3) (6/6) (7/10) (2/2)
1.4 1.6 0.9 0.5 0.4 0.2
(3/3) (4/4) (3/3) (6/6) (10/10) (2/2)
1 1.5 2.3 2.5 1.3 3.2
(3/3) 14.3 (0/3) (4/4) 1.2 (4/4) (3/3) 9.2 (0/2) (4/6) 6.6 (3/6) (10/10) 5.5 (5/10) (2/2) 6.5 (1/2)
0.1 0.2 0.2 0 0.2 0.2 0.1 0.2 0.2 0.2 0.1 0.6
(4/4) (5/5) (2/2) (6/6) (3/3) (11/11) (3/3) (5/5) (10/10) (1/1) (4/4) (2/2)
0.4 0.5 0 0.3 0.5 0.4 0.4 0.4 1.7 0.2 0.6 1.6
(4/4) (5/5) (2/2) (6/6) (3/3) (11/11) (3/3) (5/5) (9/10) (1/1) (4/4) (2/2)
0.5 0.3 1 0.3 0.7 0.6 0.6 1 0.9 0.6 1 2.6
(4/4) (5/5) (2/2) (6/6) (3/3) (11/11) (3/3) (5/5) (10/10) (1/1) (4/4) (2/2)
0.4 1.9 1.5 0.8 0.7 0.2 2.3 2.1 0.5 0.9 1 0.8
(4/4) (5/5) (2/2) (6/6) (3/3) (11/11) (3/3) (5/5) (10/10) (1/1) (4/4) (2/2)
0.5 0.2 0.7 0.2 1 0.3 0.1 1 1.4 0.7 0.5 1.7
(4/4) (5/5) (2/2) (6/6) (3/3) (11/11) (3/3) (5/5) (10/10) (1/1) (4/4) (2/2)
1.1 1.5 4.7 3 0.8 0.4 1.3 1.4 4.5 3.9 3.1 0.6
(4/4) (4/5) (2/2) (4/6) (3/3) (11/11) (3/3) (5/5) (8/10) (1/1) (2/4) (2/2)
1.3 1.1 1.3 2.3 2.1 0.3 1 0.8 1.2 0.2 1.9 1.5
(4/4) (5/5) (2/2) (4/6) (3/3) (11/11) (3/3) (5/5) (10/10) (1/1) (4/4) (2/2)
2.6 1 1.3 0.4 2.2 2.3 0.7 2.5 1.3 1.7 1.6 1.6
(3/4) 0.4 (4/4) (5/5) 0.5 (5/5) (2/2) 2.7 (2/2) (6/6) 0.2 (6/6) (2/3) 1 (3/3) (9/11) 0.7 (11/11) (3/3) 0.5 (3/3) (5/5) 4.2 (3/5) (10/10) 0.6 (10/10) (1/1) 1.6 (1/1) (4/4) 2.2 (3/4) (2/2) 0.7 (2/2)
3.4 (3/4) 1.6 (4/5) 0.7 (2/2) 1.6 (6/6) 5 (0/3) 1.9 (8/11) 2.4 (3/3) 0.8 (5/5) 3 (8/10) 0 (1/1) 0.8 (4/4) 0.4 (2/2)
1.4 1.8 2.3 0.5 4.6 1.3 2.3 3.8 1.5 0 5.3 0.6
(4/4) (4/5) (2/2) (6/6) (0/3) (9/11) (3/3) (4/5) (10/10) (1/1) (3/4) (2/2)
1.2 1.8 0.6 0.5 1.4 7.2 0.5 1.7 1.5 2.8 1.7 7.6
(4/4) 6.4 (5/5) 4 (2/2) 8.1 (6/6) 2.8 (3/3) 5.5 (5/11) 3.4 (3/3) 6.4 (5/5) 4.7 (10/10) 3.4 (1/1) 2.8 (4/4) 5.1 (0/2) 19.3
0.2 2 0.3 3.4
(3/3) (8/9) (7/7) (3/3)
1.4 0.3 0.7 1.6
(3/3) (9/9) (7/7) (3/3)
0.6 1.4 1 6.6
(3/3) (8/9) (7/7) (0/3)
0.8 0.8 0.7 0.7
(3/3) (9/9) (7/7) (3/3)
1.1 0.6 0.6 2.1
(3/3) (9/9) (7/7) (3/3)
2.5 0.4 0.4 3.3
(3/3) (9/9) (7/7) (2/3)
1.1 2.6 0.3 2.5
(3/3) (8/9) (7/7) (3/3)
1.3 2 1.4 3.2
(3/3) (8/9) (7/7) (2/3)
1.3 2.6 7.3 0.6
3.5 2.8 2 7.3
(2/3) (8/9) (5/7) (0/3)
(3/3) (1/9) (2/7) (3/3)
0.7 (6/6) 1.2 (2/2) 1.2 (3/3) 0.1 (3/3) 0.5 (5/5) 0.8 (4/4) 0.9 (2/2) 0.9 (1/1)
HSUB
0.5 0.6 1.8 2.5
(3/3) (9/9) (5/7) (3/3)
(3/3) (7/9) (3/7) (3/3)
0.1 (3/3) 0.2 (2/2)
0.2 (3/3) 0.3 (2/2)
0.5 (3/3) 0.7 (2/2)
0.9 (3/3) 3.4 (2/2)
0.5 (3/3) 0.7 (2/2)
3.2 (2/3) 1.3 (2/2)
0.2 (3/3) 2.2 (2/2)
3.9 (2/3) 1 (2/2)
0.5 (3/3) 0.5 (2/2)
0.1 (3/3) 5.6 (0/2)
0.3 0.4 0.2 0.4
0.7 0.9 1.4 1.1
0.7 0.8 0.2 3
3.8 0.4 0.6 3.3
2.5 1.7 0.4 1
1.6 0.7 0.4 4
2 1.1 0.9 0.4
1.7 3.7 5.3 3
1 1.3 0.9 5.3
3.3 3.3 3.4 1.7
(2/2) (4/4) (5/5) (7/7)
(2/2) (4/4) (5/5) (7/7)
(2/2) (4/4) (5/5) (5/7)
(1/2) (4/4) (5/5) (4/7)
(2/2) (4/4) (5/5) (7/7)
(2/2) (4/4) (5/5) (5/7)
(2/2) (4/4) (5/5) (7/7)
(2/2) (3/4) (3/5) (5/7)
(2/2) (4/4) (5/5) (4/6)
(2/2) (3/4) (5/5) (7/7)
1 (3/3) 2.1 (2/2) 6.9 11.2 1.8 14.6
(1/2) (1/4) (5/5) (2/7)
HFOR
HSTD
HFUS
4.1 (2/3) 3 (2/2)
0.5 (3/3) 0.8 (2/2)
18.3 (0/3) 15.8 (1/2)
2.7 (1/1) 1 (2/2)
10.4 (0/1) 6.7 (1/2)
2.1 (1/1) 0.9 (2/2)
33.2 (0/2)
1 (1/1)
2.6 (1/1)
0.7 (1/1)
4.9 (1/1)
2.4 20.1 23 2.7
5.9 (0/3) 2.9 (2/2) 3.3 0.6 2.6 2.8
(2/2) (4/4) (5/5) (7/7)
MP
8.7 11.1 1 8.5 5.9 6.5 8.7 24.7
4.6 3.1 3.3 11.1
(4/6) (0/2) (3/3) (0/3) (3/5) (2/4) (0/2) (0/1)
1 0.5 0.6 1.6 2.1 1.1 0.2 1.1
(6/6) (2/2) (3/3) (3/3) (5/5) (4/4) (2/2) (1/1)
14.4 6.7 3.5 18.2 7.6 5.3 13.9 20.6
(1/6) (0/2) (2/3) (0/2) (1/5) (1/4) (0/2) (0/1)
1.2 1.2 0.9 2 1.6 3.7
(3/3) (4/4) (3/3) (4/6) (10/10) (2/2)
10.7 15.6 5.2 25.2 7.3 29.3
(2/3) (0/3) (1/2) (0/6) (6/10) (0/2)
(1/4) (4/5) (1/2) (4/6) (2/3) (8/11) (0/3) (2/4) (8/10) (1/1) (1/2) (0/2)
2.2 0.7 1.3 0.7 1.1 10.6 0.4 2.2 1.1 2.6 2.5 4.8
(4/4) (5/5) (2/2) (6/6) (3/3) (6/11) (3/3) (5/5) (10/10) (1/1) (4/4) (1/2)
10.3 5.1 24.3 6.3 20.4 8.3 12.7 50.5 10.4 7.6 46.1 75
(0/4) (3/5) (0/2) (1/6) (0/3) (6/11) (0/3) (0/5) (2/10) (0/1) (0/2) (0/2)
(0/3) (8/9) (7/7) (0/3)
2.3 37.5 23 3.4
(3/3) (1/9) (2/7) (3/3)
5 10.8 8 50.7
(1/2) (5/8) (3/7) (0/3)
3.9 (2/3) 8.7 (1/2) 10.5 11.4 4 3.9
(1/2) (1/4) (3/5) (4/7)
4.1 (2/3) 3.7 (2/2) 1.6 6 6 2
(2/2) (3/4) (2/5) (7/7)
7.5 (0/2) 11 (1/2) 6.2 8.1 10 17.5
(0/1) (2/4) (2/5) (2/5)
189
38 39 40 41
NBP
M. Shacham et al. / Chemical Engineering Science 97 (2013) 186–197
6 7 8 9 10 11 12 13
4-Methylalkanes Trans-1,2dimethylcycloalkanes Trans-4-alkenes Trans-1,3dimethylcycloalkanes Trans-1,4dimethylcycloalkanes 3-Methylalkanes 2,6-Dimethylalkanes 4-Ketones Trans-3-alkenes 2-Methylalkanes n-Alkyl propionate 2,5-Dimethylalkanes Cis-1,4dimethylcycloalkanes 2-Butyl-1-alkanols 2-Methyl-alkanal 2,4-Dimethylalkanes 2-Alkanols Acetates Cis-1,3dimethylcycloalkanes Alkyl n-butyrate 3-Ketones 3,3-Dimethylalkanes 2-Ketones Isoalkyl acetate Trans-2-alkenes 3-Alkanols 2,2-Dimethylalkanes Formates Cis-4-alkenes 2,3-Dimethylalkanes Cis-1,2dimethylcycloalkanes Alkyl isobutyrate 1-n-Naphthalenes Cis-2-alkenes 2,2,3,3-Tetramethylalkanes Cis-3-alkenes 3-Methyl-1-aliphatic alcohols 2-Ethyl-1-alcohols Methyl n-alkyl ether Di-n-alkyl ether Di-n-alkyl phthalate
RI
(0/4) (0/3) (0/2) (5/20) (0/3) (7/13) 13.9 40.9 71.8 10.1 6.6 8.4 (3/4) 1.3 (4/4) 7.4 (2/4) 4.5 (2/4) 8 (2/4) (3/3) 12.6 (0/3) 3.9 (2/3) 11.8 (0/3) 2.9 (2/3) (3/3) 11 (0/3) 12.4 (0/3) 1.7 (3/3) 21.5 (0/3) (17/20) 11 (11/20) 0.9 (19/20) 3.3 (14/20) 0.9 (20/20) (2/3) 3.9 (2/3) 12.9 (0/3) 6 (2/3) 16.1 (0/3) (10/13) 5.5 (8/13) 2.8 (11/13) 5.8 (7/13) 4.8 (9/13) 4.1 1.5 1.5 4.1 4.8 3.4 (2/4) (2/3) (0/3) (17/20) (0/3) (13/13) 5 5.2 5.8 3.3 5.7 0.7 (4/4) (3/3) (3/3) (18/20) (3/3) (10/13) 1 2.4 2.2 6.6 1.3 4.3 (4/4) (0/3) (3/3) (17/20) (3/3) (9/13) (4/4) 1.7 (3/3) 5.6 (3/3) 2.5 (20/20) 3.4 (3/3) 2.1 (10/13) 11.2 0.3 1.4 1.9 0.4 3.3 2.6 (4/4) (3/3) (3/3) (20/20) (3/3) (13/13) 2.1 0.7 0.5 0.3 2.6 0.6 (4/4) 0.8 (4/4) (3/3) 2.2 (3/3) (2/3) 1.8 (3/3) (19/20) 10.8 (17/20) (3/3) 2.5 (3/3) (11/13) 0.3 (13/13) 1.2 1.6 2.3 1.5 2.1 0.9 (4/4) (3/3) (3/3) (17/20) (3/3) (11/13) 1 1.3 0.4 5.4 1.9 7.7 (4/4) (3/3) (3/3) (20/20) (3/3) (12/12) 0.1 0.4 0.1 0.2 0.9 0.1
23.6 (2/12) 24.8 (0/3) 7.9 (9/12) 1 (4/4) 7.6 (3/12) 8.5 (0/3) 4 (9/12) 3 (3/4) 12.4 (9/12) 4.7 (2/4) 0.6 (12/12) 2.9 (11/12) 0.7 (11/12) 2.7 (9/12) 5.5 (2/4) 1.3 (4/4) 11.8 (2/4) 2.6 (3/4) 0.2 (12/12) 0.6 (12/12) 0.4 (12/12) 0.7 (4/4) 2.3 (4/4) 1.7 (4/4) 0.9 (11/12) 0.4 (12/12) 1.1 (4/4) 0.3 (4/4) 0 (11/11) 0.2 (4/4)
2.3 (5/7) 9.5 (1/6) 4.4 (4/7) 6.2 (3/7) 3.1 (4/7) 0.9 (7/7) 1.5 (5/7) 3.5 (5/7) 1.3 (7/7) 2.1 (7/7) 4.4 (4/7) 0.6 (7/7) 0.3 (7/7)
45 46 47 48 49 50
43 44
2-Methyl-1-aliphatic alcohols Mercaptans 2-Methyl-alkanoic acid Alkyl-cyclopentane Neo-alkanoic acid 3-Methyl-1-alkenes n-Aliphatic acids n-Alkyl chloride Aldehydes 42
0.7 (7/7)
HSTD MP HFOR SSTD HCOM HSUB FP PC VC TC ENT LVOL NBP RI No.
Table 2 (continued )
9.3 (4/6)
M. Shacham et al. / Chemical Engineering Science 97 (2013) 186–197
HFUS
190
coefficient sum from adjacency matrix (Todeschini and Consonni, 2000). The last column in Table 3 shows the normalized values (VEA1) of this descriptor for the first 11 members of the 1-alkene series (VEA1¼VEA1/5.876). Using the training set data the TQSPR1 model obtained is Tb ¼1022.8 VEA1-72.8. Plotting Tb of the training set and the target compound versus VEA1 (Fig. 2) demonstrates the linear relationship between the descriptor and Tb for the compounds involved. The differences between the DIPPR Tb data and the predicted values are indicated by the ri (Eq. (4)) and the δyi values (Table 3). Observe that for the ethylene r1 ¼−3.91 K and δy1 ¼ 2.31% and for 1dodecene r10 ¼−3.27 K and δy10 ¼ 0.67%. Thus, while the magnitude of the absolute error is very similar for the two compounds, the relative errors are quite different. For ethylene δy1 is significantly larger than the data uncertainty (U1 o1%), whereas for 1-dodecene δy10 oU10 o1%. This demonstrates one of the reasons of the higher prediction errors for low nC compounds. Many property values are considerably smaller for low nC compounds, thus the relative error tends to inflate the prediction error for such compounds. Minimization of the sum of squares of the relative errors should be considered for avoiding excessive prediction errors (see Section 5.4) in cases where the property values change considerably (i.e., order of magnitude) within the training set members. The results presented in Table 3 were obtained with ethylene included in the training set. To study the effect of extrapolation on the prediction error, 1-heptene is replaced by ethylene as target compound. Repeating the TQSPR1 prediction procedure, the descriptor VEA1 is again identified as the one with the highest correlation with the Tb values for the members of the training set. The TQSPR1 model parameters changed slightly to Tb ¼1011 VEA1– 67, the relative prediction error increased to 3.95% and the residual increased in absolute value to r t ¼ 6.7 K. The comparison of the residual plots for Tb prediction with ethylene and 1-heptene serving as target compounds (Fig. 3) demonstrates the considerable increase of the prediction error caused by the extrapolation involved in the prediction of the properties of ethylene. The DDBST package was used to predict the Tb of ethylene by the prediction techniques available for this compound. Using the Joback and Reid (1987) method yielded Tb ¼234.56 K (δy¼38.46%) was obtained, the GVS (Group Vector Space) method of Wen and Qiang (2002) yielded Tb ¼180.29 K (δy¼6.4%) and the method of Constantinou and Gani (1994) returned an error message “Group assignment failed”. Thus, in this case the TQSPR1 method yields lower prediction errors than the pertinent GC methods. In Table 4 the prediction error for 15 properties of 1-heptene and ethylene are presented for the cases where 1-heptene is the target compound (ethylene is a member of the training set) and the case when ethylene is the target compound. The training set average error (TSAE) is also presented (only for the case when 1-heptene is the target compound). Let us consider first the case when 1-heptene is the target compound. In this case the TSAE values are less than 1% for 8 properties and less than 2% for two additional properties. In general the prediction errors for ethylene and 1-heptene are consistent with the TSAE values. Too high δy values can often be explained by the respective Ui or Umax values. For example, the TSAE value for FP is 0.74% and the prediction error for 1-heptene is: δy ¼2.5%. This discrepancy can be explained by the experimental uncertainty of FP for this compound: Ui ¼3% Similarly, for VC, TSAE ¼1.47%, the prediction error for ethylene is 7.16%, a discrepancy that can be explained by Umax ¼ 10% for VC. The reasons for excessive TSAE values and/or prediction errors for HSTD, SSTD and LVOL are discussed in detail in Section 5.2; for HFUS see Section 5.3 and for HFOR and HSTD see Section 5.4. In the case when ethylene is the target compound the prediction error increases significantly for all the properties considered.
M. Shacham et al. / Chemical Engineering Science 97 (2013) 186–197
191
Table 3 Training set and target property data and predicted values for modeling of Tb with a TQSPR1 for 1-heptene. Compound no.
1 2 3 4 5 6 7 8 9 10 Target a b
Name
Ethylene Propylene 1-Butene 1-Pentene 1-Hexene 1-Octene 1-Nonene 1-Decene 1-Undecene 1-Dodecene 1-Heptene
nC
2 3 4 5 6 8 9 10 11 12 7
DIPPR dataa
Value of VEA1b
TQSPR1 prediction
Tb (K)
U (%)
Tb (K)
δy (%)
Residual
Descriptor
169.41 225.45 266.91 303.22 336.63 394.41 420.018 443.75 465.82 486.15 366.79
o1 o 0.2 o1 o1 o 0.2 o1 o1 o1 o1 o1 o 0.2
173.32 224.32 265.92 302.30 334.85 392.47 418.75 443.47 466.97 489.42 364.79
2.31 0.50 0.37 0.30 0.53 0.49 0.30 0.06 0.25 0.67 0.55
-3.91 1.13 0.99 0.92 1.78 1.94 1.27 0.28 -1.15 -3.27 2.00
0.24064 0.2905 0.33118 0.36675 0.39857 0.4549 0.4806 0.50477 0.52774 0.54969 0.42784
Source: DIPPR database (Rowley et al., 2010). Normalized value, VEA1¼ 5.876 VEA1
Fig. 1. Plot of Tb of the first 11 members of the 1-alkene series versus nC.
Fig. 2. Plot of Tb of the first 11 members of the 1-alkene series versus the normalized value of the descriptor VEA1.
Thus, extrapolation to a lower carbon number can be clearly identified as a principal reason for increase of the prediction error. The reasons for the excessive prediction errors that we have identified are listed in the last column of Table 4. For all 15 properties, extrapolation to a lower carbon number (of the target compound) has been identified as a reason for an excessive prediction error. Thus, when the target compound is the first member in the series, the inherent need to extrapolate to lower nC increases the prediction error compared to the expected error based on the training set members.
Fig. 3. Comparison of residual plots for Tb prediction with ethylene and 1-heptene serving as target compounds.
The outlying behavior of the properties of the low nC compounds can be best explained with reference to homologous series. Homologous series have the general formula H(CH2)nR, where –CH2– is the methylene group, n is the number of methylene groups and R is an end (or functional) group, which defines the homologous series. For low nC compounds the functional group has a dominant influence on the property value, while for higher nC compounds the contribution of the methylene groups becomes more significant and the influence of the functional group diminishes. As there may be a considerable discrepancy between the property contributions of the various functional groups, there are also differences between the limiting nC values (nCmin) below which the functional group contribution to the property dominates. Obviously, the nCmin should exceed the number carbon numbers in the functional group R, nCR, and typically nCmin ¼nCR+2 (Shacham et al., 2013). The melting point temperatures (Tm), of the first 11 members of the 1-alkene series are plotted versus nC in Fig. 4. Observe that for this series the Tm value increases monotonically with increasing nC starting from nC ¼4, however for nC ¼2 and 3 the trend is different: Tm decreases with increasing nC. Because of this change in the trend, there is no descriptor in the database that can represent correctly the full set of data. The selected descriptor (Mor04e, from the 3D-MoRSE descriptor group—signal 04/weighted by atomic Sanderson electronegativities) linearizes the Tm data (see Fig. 5), but the deviation of the straight line representation is considerable larger for the low nC compounds (ethylene and 1-butene) than for the rest of the compounds. Thus, a possible cause of excessive prediction error for low nC compounds can be the difference in the
192
M. Shacham et al. / Chemical Engineering Science 97 (2013) 186–197
Table 4 Dominant descriptors and prediction errors for ethylene when ethylene or 1-heptene is the target compound. No.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Property
ENT FP HCOM HFOR HFUS HSTD HSUB LVOL MP NBP PC RI SSTD TC VC
Descriptor
X0v SPI WA H-046 DP07 Mor19m AMR H-046 Mor04e VEA1 HVcpx HATS0p Mor20e DP01 piID
1-Heptene target compound
Ethylene target compound
TSAE
δy (%, C7)
δy (%, C2)
Descriptor
δy (%)
0.17 0.74 0.15 9.80 10.45 127.90 0.60 1.89 3.02 0.58 0.74 0.22 3.02 0.25 1.47
0.12 2.50 0.15 1.54 7.34 2.41 0.45 0.61 3.90 0.54 0.86 0.07 6.42 0.07 0.10
0.18 0.26 0.41 1.97 8.68 2.91 1.85 9.56 8.59 2.31 0.89 0.65 7.03 1.08 7.16
Sv TIC1 G2 Sp Mor11u Mor19v Seig BID DP09 VEA1 ESpm02x VRZ2 Mor20u piPC01 piID
4.5 9.72 1.6 22.78 137 13.85 14.34 13.85 14.74 3.95 8.44 9.8 11.13 3.81 10.01
Causes of excessive pred. errora
1 1 1 1 1, 1, 1 1, 1, 1 1 1, 1, 1 1
2, 4 3 3 2
3 3
a Causes for excessive prediction errors: 1. Extrapolation to target compound, 2. Change of trend for low nC compounds, 3. Phase change within the training set at standard state and 4. Crystalline structure change within the training set.
trend of the property change with nC in the low nC region in comparison to the other regions. The prediction errors for Tm of the first four members of 1-alkene series using four prediction techniques are summarized in Table 5. The methods used are the TQSPR1, Joback and Reid (1987), GVS (Wen and Qiang, 2002) and Constantinou and Gani (1994). Observe that for all the methods involved the prediction error was either excessively large for all compounds (Joback's and GVS), or with acceptable prediction errors for some compounds and excessively large for others (TQSPR1, Constantinou and Gani). Thus, the GC methods are adversely affected also by the change in the trend of the property values for low nC compounds. The heat (enthalpy) of fusion is an additional solid property that has been included in this study, where the change of trend at low nC has been identified as a reason for excessive prediction error. However for this property, a change in the crystalline structure within the training set has been identified as an additional source of inaccuracy. The prediction of the heat of fusion of ethylene is discussed in detail in Section 5.3.
Fig. 4. Plot of Tm of the first 11 members of the 1-alkene series versus nC.
5.2. Phase change at standard state within the training set There are several properties in the DIPPR database for which the reported values are for the “standard state”, which is defined as the stable phase at 298.15 K and 1 bar. These properties include standard state enthalpy of formation (HSTD), entropy of formation (SSTD) and enthalpy (heat) of combustion (HCOM). Refractive indexes (RI) are reported usually for the liquid phase, which may not be the valid phase at the “standard state” (i.e., the compound is in gas phase at the standard state). For compounds with Tm higher than 218.15 K, RI is reported for the undercooled liquid at the standard state. Liquid molar volume (LVOL) is usually reported at standard state temperature, unless TC o 298.15, in which case LVOL is reported at Tb, or at the triple point temperature if Ttp 4298.15. Obviously, phase and/or temperature variations maybe reflected also in the property variation within the group of similar compounds (i.e., homologous series). The data for the TQSPR1 model prediction of SSTD with ntridecane as target compound are shown in Table 6. The training set comprises of n-alkanes in the 12≤nC≤22 region. The SSTD values of the target compound and its training set are plotted versus nC in Fig. 6. Observe that there is a sharp “step” change in the SSTD value between n-heptadecane (nC ¼17) and n-octadecane (nC ¼18). The last column in Table 6 shows that Tm ¼ 295.13 K for
Fig. 5. Plot of Tm of the first 11 members of the 1-alkene series versus the descriptor Mor04e.
n-heptadecane, thus it is in liquid phase at the standard state temperature, while n-octadecane (Tm ¼ 301.31 K) is solid at the standard state. Attempting to develop a TQSPR1 model using all ten members of the training set results in the selection of EEig13d as the dominant descriptor. This descriptor belongs to the 2D “edge adjacency indices” category and it represents the “eigenvalue 13 from edge adj. matrix weighted by dipole moments”. The
M. Shacham et al. / Chemical Engineering Science 97 (2013) 186–197
193
Table 5 Comparison of accuracy of Tm predictions by various methods for the 1st four members of the 1-alkene series. Comp. no.
Name
1 2 3 4
Ethylene Propylene 1-Butene 1-Pentene a
Jobacka
GVSa
Constantinou, Gania
DIPPR
TQSPR1
Tm (K)
Tm (K)
δy (%)
Tm (K)
δy (%)
Tm (K)
δy (%)
Tm (K)
δy (%)
104 87.9 87.8 108.02
95.06 88.94 97.80 110.55
8.59 1.19 11.39 2.35
113.86 121.81 133.08 144.35
−9.48 −38.58 −51.57 −33.64
101.46 108.37 107.52 124.32
2.44 −23.28 −22.46 −15.09
– 87.90 91.73 124.54
– 0.00 −4.48 −15.30
References: Joback and Reid (1987); GVS, Wen and Qiang (2002); Constantinou and Gani (1994).
Table 6 Training set and target property data and predicted values for modeling of SSTD with a TQSPR1 for n-tridecane. Comp. no.
Name
1 2 3 4 5 6 7 8 9 10 Target
nC
n-Tetradecane n-Dodecane n-Pentadecane n-Hexadecane n-Heptadecane n-Octadecane n-Nonadecane n-Eicosane n-Heneicosane n-Docosane n-Tridecane
14 12 15 16 17 18 19 20 21 22 13
DIPPR data
TQSPR1 prediction
SSTD (J/kmol K)
U (%)
SSTD (J/kmol K)
δy (%)
Descriptor
Tm (K)
555430 490660 587520 619650 652240 480200 503550 526900 550250 573610 522870
o1 o1 o1 o1 o1 o1 o3 o3 o3 o3 o1
589847.7 502471.3 584336.3 576987.7 568698.2 560094.9 551491.7 543157.4 535226.3 527698.5 502471.3
6.20 2.41 0.54 6.88 12.81 16.64 9.52 3.09 2.73 8.00 3.90
−1.95 0 −1.827 −1.663 −1.478 −1.286 −1.094 −0.908 −0.731 −0.563 0
279.01 263.57 283.07 291.31 295.13 301.31 305.04 309.58 313.35 317.15 267.76
Standard State Entropy (J/Kmol*K)
7.0E+05
6.5E+05 Training Set Target
6.0E+05
5.5E+05
5.0E+05
4.5E+05
4.0E+05 11
13
15
17
EEig13D
19
21
23
No. of C atoms
Fig. 6. Plot of SSTD of n-tridecane and its training set versus nC.
resultant TQSPR1 model is SSTD¼−44,808 EEig13d +502,471. The predicted values using this TQSPR1 model (Table 5) are very inaccurate reaching δy 415% even for a member of the training set. Thus, in terms of the SSTD property values, the training set compounds belong to two separate populations (Fig. 6) and as such cannot be satisfactorily represented with the same TQSPR1 model. An additional notable problem in this example is related to the definition of the EEig13d descriptor. The EEig13d descriptor values decrease linearly from −0.563 (for nC ¼22) to −1.95 (for nC ¼14). But for nC o13 (i.e., for n-tridecane and n-tetradecane, Table 6) the descriptor value is 0. The zero value in this case is assigned by the DRAGON software to the particular compound in cases where the descriptor calculation is applied outside the range of its definition. This issue is discussed in detail in Section 5.6. To find a satisfactory TQSPR1 for representing the SSTD of n-tridecane, a training set which includes only compounds that are liquid at the standard state should be selected. Indeed using the
five compounds with Tm o 298.15 K as a training set, can yield very accurate TQSPR1, based on a variety of dominant descriptors that are identified as highly correlated with the training set SSTD data. One option is to use nC as the dominant descriptor. In this case the TQSPR1 obtained is SSTD ¼103,280+32285.1351 nC, with TSAE¼ 0.018 and δy¼ 0.022%, indicating excellent representations of both the training set and the target compound. The results of property prediction for ethylene which are summarized in Table 4 can be used for assessing the effect of phase change on the additional properties defined at the standard state, as the first three compounds in the 1-alkene series ( nC ¼2, 3 and 4) are in a gaseous phase at the standard state, while the rest (up to nC ¼18 ) are in the liquid phase. Comparison of the prediction errors for the case when 1-heptene is used as the target compound can provide an assessment for the effect of the phase change as this excludes the error associated with the extrapolation to the target. The prediction error for HCOM is very low (δy¼−0.41%), so there is very small influence of the phase condition on HCOM. This can be explained by the orders of magnitude differences between the heat of fusion and vaporization and the heat of combustion. For SSTD δy¼7.03% and for HSTD δy¼2.61%, thus for both properties the change of phase conditions within the training set effects the prediction error, while the deterioration of the prediction accuracy for SSTD is more severe than for HSTD. The LVOL of ethylene is reported in the DIPPR database at its Tb ( ¼169 K) instead of the standard state. Consequently the prediction error is very high (δy¼9.56%). RI for ethylene is measured also well below the standard state temperature (at 173 K), however in this case the dominant descriptor identified matches the irregular behavior of ethylene in the training set yielding low δy value (0.65%). The DDBST software does not include prediction techniques for RI of ethylene and its LVOL cannot be predicted either. The values of the Gibbs energy (which are needed for prediction of SSTD) are predicted only at the vapor phase in the standard state. Therefore, comparison with GC method is not included in this section.
194
M. Shacham et al. / Chemical Engineering Science 97 (2013) 186–197
Table 7 Training set and target property data and predicted values for modeling of HFUS with a TQSPR1 for ethylene. Comp. No.
1 2 3 4 5 6 7 8 9 10 Target
Name
Propylene 1-Butene 1-Pentene 1-Hexene 1-Heptene 1-Octene 1-Nonene 1-Decene 1-Undecene 1-Dodecene Ethylene
nC
3 4 5 6 7 8 9 10 11 12 2
DIPPR Data
TQSPR1 Prediction
Value of Mor11u Descriptor
HFUS (J/kmol)
U (%)
HFUS (J/kmol)
δy (%)
2.936E+06 3.848E+06 5.937E+06 9.347E+06 1.264E+07 1.531E+07 1.936E+07 1.381E+07 1.699E+07 1.991E+07 3.351E+06
o1 o1 o1 o1 o5 o1 o1 o0.2 o0.2 o0.2 o1
1.980E+06 3.940E+06 6.931E+06 1.010E+07 1.297E+07 1.463E+07 1.627E+07 1.710E+07 1.808E+07 1.809E+07 −1.239E+06
32.56 2.38 16.74 8.04 2.59 4.44 15.93 23.80 6.39 9.12 136.98
Fig. 7. Plot of HFUS of the first 11 members of the 1-alkene series versus nC.
5.3. Irregularities in the solid properties related to the crystalline structure The solid properties (e.g., Tm and HFUS) may behave irregularly within the group of similar compounds for various reasons (i.e., melting from different crystalline structures, Broadhust, 1962, or crystallization in a state of orientation disorder, McCullough et al., 1957). The irregularity caused by orientation disorder will be demonstrated here by modeling HFUS of the first 11 members of the 1-alkene series, using ethylene as the target compound. The HFUS data of the target compound and the training set members are listed in Table 7 and depicted versus nC in Fig. 7. As in the case of Tm, a different trend of the HFUS variation with nC can be observed in the low nC region. But additionally, there is a “step” change in the TFUS between 1-nonene (nC ¼ 9) and 1-decene (nC ¼10). It seems very unlikely that the training set shown in Fig. 7 can be satisfactorily represented by a TQSPR1 model. Attempting to construct such a model yields the Mor11u descriptor as the “dominant” descriptor. This descriptor belongs to the “3D-MoRSE descriptors” category (just like the Mor04e descriptor that was selected for Tm) and defined as “3D-MoRSE—signal 11, un-weighted” descriptor. The last column in Table 7 shows the normalized values of this descriptor (Mor11u¼Mor11u/9.79). Using the training set data, the TQSPR1 model obtained is TFUS¼ −1.7134E+08 Mor11u—6.6105E+06. This model indeed gives very poor predictions; even for a member of training set the prediction error reaches 32.56% (Table 7). For the target compound, negative heat of fusion is obtained with a prediction error of 137% (Tables 4 and 7). For this property only the Joback and Reid (1987) GC prediction method is available in the DDBSP package. This
−0.050138 −0.061575 −0.079036 −0.097519 −0.11427 −0.12397 −0.13356 −0.13836 −0.14408 −0.14418 −0.031349
Fig. 8. Plot of HFUS of the first 8 members of the 1-alkene series versus the descriptor RDF040m.
prediction method provides a value HFUS¼−1.826e6 J/kmol, which represents a prediction error of 154.49%. The accuracy of the prediction can be improved considerably in this case by removing the last three compounds from the training set. The dominant descriptor selected with the 7-member training set is the RDF040m descriptor (belongs to the 3D RDF descriptors category, “Radial Distribution Function—4.0/ weighted by atomic masses”). Observe that this descriptor represents HFUS very well for the training set and the target compound (Fig. 8). The associated prediction error for ethylene is δy¼1.87%. 5.4. Change of property values by orders of magnitude within the training set The effects of changes of property values by orders of magnitude within the training set are demonstrated for the ideal gas enthalpy of formation (HFOR) of the first 11 members of the 1-alkene series. The respective data are shown in Table 8. Observe that HFOR changes sign between propylene and 1-butene. Consequently the absolute value of HFOR of 1-butene is smaller by 3 or 2 orders of magnitude than the values for the rest of the compounds. Using 1-heptene as the target compound, the descriptor H-046 (from the 1D “atom-centered fragments” category, H atom attached to C0(sp3) no X attached to next C atom) is identified as the dominant descriptor. Using normalized values of this descriptor (H-046¼H-046/87), the TQSPR1 model obtained is HFOR¼5.1474E+07−8.9684E+08 H-046. This model yields δy values below the data uncertainty for most compounds. However for 1-butene, δy is extremely large (86.3%). Comparing the residual values shows that the residual for 1-butene is r¼−4.3E+05 J/kmole,
M. Shacham et al. / Chemical Engineering Science 97 (2013) 186–197
195
Table 8 Training set and target property data and predicted values for modeling of HFOR with a TQSPR1 for 1-heptene using absolute and relative error least squares. Comp. no.
1 2 3 4 5 6 7 8 9 10 Target a b
Name
nC
Ethylene Propylene 1-Butene 1-Pentene 1-Hexene 1-Octene 1-Nonene 1-Decene 1-Undecene 1-Dodecene 1-Heptene
2 3 4 5 6 8 9 10 11 12 7
TQSPR1 predictiona
DIPPR data
TQSPR1 predictionb
Value of H-046 descriptor
HFOR (J/kmol)
U (%)
HFOR (J/kmol)
δy (%)
HFOR (J/kmol)
δy (%)
5.251E+07 2.023E+07 −5.000E+05 −2.162E+07 −4.167E+07 −8.194E+07 −1.035E+08 −1.221E+08 −1.449E+08 −1.654E+08 −6.289E+07
o3 o1 o3 o3 o3 o3 o3 o3 o3 o1 o3
5.147E+07 2.055E+07 −6.870E+04 −2.069E+07 −4.130E+07 −8.254E+07 −1.032E+08 −1.238E+08 −1.444E+08 −1.650E+08 −6.192E+07
1.97 1.57 86.31 4.32 0.88 0.73 0.34 1.37 0.35 0.24 1.54
5.123E+07 2.019E+07 −5.000E+05 −2.119E+07 −4.189E+07 −8.328E+07 −1.040E+08 −1.247E+08 −1.454E+08 −1.661E+08 −6.258E+07
2.43 0.18 0.00 1.97 0.53 1.63 0.45 2.10 0.31 0.39 0.49
0 0.034483 0.057471 0.08046 0.10345 0.14943 0.17241 0.1954 0.21839 0.24138 0.12644
Absolute error least squares: HFOR ¼ 5.1474E+07−8.9684E+08 H-046 Relative error least squares: HFOR ¼5.1234E+07−9.0018E+08 H-046
Table 9 Training set and target property data and predicted values for modeling VC with a TQSPR1 for 1-decanol.
1 2 3 4 5 6 7 8 9 10 Target
Name
1-Undecanol 1-Tridecanol 1-Tetradecanol 1-Pentadecanol 1-Hexadecanol 1-Heptadecanol 1-Octadecanol 1-Nonadecanol 1-Eicosanol 1-Nonanol 1-Decanol
nC
11 13 14 15 16 17 18 19 20 9 10
DIPPR data
TQSPR1 prediction
Value of Ds descriptor
Vc (m3/kmol)
U (%)
Vc (m3/kmol)
δy (%)
0.715 0.86 0.802 1.01 1.08 1.16 1.23 1.31 1.12 0.576 0.645
o3 o5 o25 o5 o6 o7 o8 o10 o25 o3 o3
0.68481 0.82838 0.90016 0.97194 1.0437 1.1155 1.1873 1.2591 1.2591 0.61302 1.4026
4.22 3.68 12.24 3.77 3.36 3.84 3.47 3.89 12.42 6.43 117.46
while for ethylene r¼1.0E+06 J/kmole. Thus, the large prediction error for 1-butene is caused by the orders of magnitude difference between HFOR of 1-butene and the rest of the compounds. To alleviate this problem, we used the same descriptor for derivation of the TQSPR1 model; however the sum of squares of the relative errors was minimized to obtain the model parameters. The parameter values of the TQSPR1 model are only slightly different than that in the previous case (HFOR ¼5.1234E+07 −9.0018E+08 H-046 ), but the minimization of the relative errors brings all the δy values below the HFOR data uncertainties. The DIPPR database lists over 15 sources for HFOR of 1-butene, some of the reported values are experimental some are predicted. The value recommended by DIPPR (−5.0E5 J/kmol) is a value that was predicted by the GC method of Domalski and Hearing (1993). Consequently, there is no point in comparing the TQSPR1 prediction to other methods. The analysis that was carried out in this section suggests, however, that in terms of the relative error, the recommended value for 1-butene is consistent with the HFOR values of the rest of the compounds in the 1-alkene series. 5.5. Inconsistency in the 3D molecular structure files and or 3D descriptors In Table 9 data and results for prediction of the critical volume (VC) for 1-decanol are shown. The training set in this case includes members of the 1-alkanol series in the 9≤nC≤20 range. The Ds descriptor was identified as the dominant descriptor. This descriptor belongs to the 3D WHIM descriptors category and it represents “D total accessibility index/ weighted by atomic electrotopological states”. The TQSPR1 model obtained is VC ¼16.9797−71.7838 Ds.
0.227 0.225 0.224 0.223 0.222 0.221 0.22 0.219 0.219 0.228 0.217
0.23 0.228 Training Set Target
0.226 Descriptor Ds
Compound no.
0.224 0.222 0.22 0.218 0.216 5
7
9
11
13
15
17
19
21
No. of C atoms
Fig. 9. Plot of the descriptor Ds of 1-decanol and its training set versus nC.
The δy values obtained with this model for the members of the training set are all smaller than the respective data uncertainty values (Table 9). However, for the target compound, 1-decanol δy¼117.46%, thus the prediction error is rather excessive. A plot of the descriptor Ds vs. nC (Fig. 9) indicates that the Ds descriptor value is inconsistent with the descriptor values of the other members of the 1-alkanol series. Graphical representation of the 3D molecular structure (MOL) files of 1-decanol and its two immediate neighbors in the homologous series (1-nonanol and 1-undecanol, Fig. 10) reveals the
196
M. Shacham et al. / Chemical Engineering Science 97 (2013) 186–197
cause of the inconsistency of the Ds values: The –CH2OH group in the 1-decanol structure file is rotated relatively to its position in the structure files of the other members of the 1-alkanol series. Indeed, upon modifying the 1-alkanol structure as to match the configuration of the other members of the 1-alkanol series, the resulting descriptor value is Ds ¼0.229. Introducing this value into the TQSPR1 shown above yields VC ¼0.541 m3/kmol, where the prediction error is δy ¼16%. While such an error is still considered as relatively large, its magnitude can be attributed to the Umax value ( ¼25%) of the training set data We have carried out some studies (Paster et al., 2009, 2010) on the effects of the 3D molecular structure variation on the calculated values of the 3D descriptors and have found that some groups of descriptors are more sensitive to variations of the 3D structure than the others. The present study shows, however, that the dangerous situation arises when the 3D structure of the target compound is inconsistent with those of the training set members. Thus, it is very important to obtain the 3D optimized structure of the target compound using the same tools and algorithms that were used for the rest of the compounds of the database. In fact, applying the TQSPR1 method for property prediction of members
Fig. 10. The 3D structure (MOL) files of 1-decanol (target compound) and its two immediate neighbors in the homologous series.
of homologous series has proven to be an efficient method for detecting irregularities or inconsistencies in the representation of the minimized 3D structures which were used to obtain the molecular descriptor database.
5.6. Use of descriptors whose range of definition does not cover the entire compound space There are many 2D descriptors whose range of definition is limited in terms of the number of non-Hydrogen molecule atoms. The descriptor EEig13d, for example, which was mentioned in Section 5.2, is defined only for molecules that contain more the 12 non-hydrogen atoms. Usually a zero value is assigned by Dragon to the descriptors when their calculation is attempted outside their range of definition. Table 10 shows the data and results for the prediction of TC for ethylene using 10 members of the 1-alkene series as training set and the ESpm02d as the dominant descriptor. This descriptor belongs to the 2D “edge adjacency indices” group and defined as “spectral moment 02 from edge adjacency matrix weighted by dipole moments”. The edge adjacency matrix maps the molecule with respect to the connection between edges (bonds) via nonHydrogen atoms assigning a value of “1” if there is a connection between two edges and a value of “0” otherwise. Since for ethylene there is only one edge which is connected to non-Hydrogen atoms, the edge adjacency matrix consists of one term with the value of zero. Consequently the value of the ESpm02d descriptor for ethylene is zero, which is inconsistent with the descriptor values of the other members of the 1-alkene series that are included in the training set (Table 10). The TQSPR1 model for the training set, TC ¼163.56 ESpm02d+145.82 yields TSAE value of 1.45%, while the prediction error for ethylene is over 30 times larger (δy¼48.35%). In practice, it is possible to divide descriptors such as EEig13d and ESpm02d into two regions: one region where the descriptor changes with the change of the molecular structure (region A) and where the descriptor value is constant (undefined and/or) zero (region B). The danger in the use of such descriptors in QSPR models, is that all (or most) of training set members will probably be associated with region A, while the target compound may be located in region B (in particular in cases of extrapolation to lower nC compounds). This may lead to a very good fit of the TQSPR1 model to the properties of the training set members, while the prediction error for the target compound can be excessively large. To prevent prediction errors resulting from such a situation, the use of a 2D descriptor whose value is zero for some members of the training set and/or for the target compound (and the zero
Table 10 Training set and target property data and predicted values for modeling of TC with ESpm02d descriptor. Compound no.
1 2 3 4 5 6 7 8 9 10 Target
Name
Propylene 1-Butene 1-Pentene 1-Hexene 1-Heptene 1-Octene 1-Nonene 1-Decene 1-Undecene 1-Dodecene Ethylene
nC
3 4 5 6 7 8 9 10 11 12 2
DIPPR data
TQSPR1 prediction
Value of ESpm02d descriptor
TC (K)
U (%)
TC (K)
δy (%)
364.85 419.5 464.8 504 537.4 566.9 593.1 616.6 637.8 657.1 282.34
o 0.2 o 0.2 o 0.2 o 0.2 o 0.2 o1 o1 o1 o1 o1 o 0.2
348.96 423.54 474.58 513.34 544.74 571.08 593.65 613.60 631.27 647.29 145.82
4.35 0.96 2.10 1.85 1.37 0.74 0.09 0.49 1.02 1.49 48.35
1.242 1.698 2.01 2.247 2.439 2.6 2.738 2.86 2.968 3.066 0
M. Shacham et al. / Chemical Engineering Science 97 (2013) 186–197
actually represents an undefined value), should be avoided. This can be accomplished at the dominant descriptor selection stage. Conclusions This study involved the development and testing of a program which is able to predict a variety of properties for different groups of compounds using the TQSPR1 method. The program was tested by predicting 15 constant properties for a set of 471 compounds belonging to 80 groups (mostly homologous series). It was demonstrated that the TQSPR1 method can model the majority of the properties within the data uncertainty level for most groups of compounds that were included in this study. It was also shown that the accuracy of the TQSPR1 predictions is comparable to, or higher than the accuracy of the commonly used GC predictions. This study enabled the identification of the frequent causes of excessive prediction errors for some properties, groups of compounds and property–group combinations. The following causes of excessive prediction errors were demonstrated in this paper: (1) extrapolation to low nC compounds; (2) phase change at standard state within the training set; (3) irregularities in the solid properties affected by changes in the crystalline structure; (4) change of property values by orders of magnitude within the training set; (5) inconsistency in representation of the molecular structure by some of the 2D descriptors; and (6) inconsistency in the 3D molecular structure files Upon identifying the reasons of excessive prediction errors, improvements and modifications of the property prediction algorithm can be followed in order to eliminate or minimize their adverse effects. It was demonstrated that in the cases where the property values change by orders of magnitude within the training set, the minimization of the sum of squares of the relative errors can significantly reduce the prediction error. For properties that are strongly influenced by the phase condition at the standard state, the phase condition of the target compound at the standard state should be first defined (or predicted), and only compounds in the same phase should be retained in the training set. In fact, the property prediction system that we have developed can serve as a very efficient and effective tool for evaluation of the proposed improvements of the property prediction algorithm. The results reported and the suggestions for improvements are applicable to other property prediction techniques, as well, and are not limited to the TQSPR1 method. This study demonstrates the need to develop an interactive computational tool which provides the user the information and tools to make wise decisions in the course of the training set selection and in the derivation of the QSPR model in order to increase the prediction reliability and accuracy. There is also a need to prepare a “recommended” list of reliable non-3D descriptors, which were proven successful in predicting particular properties and their range of definition includes the whole compound space. Limiting the availability of descriptors to the selected ones when predicting the same property for a new target compound, can resolve many of the problems associated
197
with the unreliability of some of the descriptors and the need to extrapolate to low nC. In subsequent stages of this research our property prediction system will be used for identifying preferred set of descriptors for the various properties. References Brauner, N., Stateva, R.P., Cholakov, G, St., Shacham, M., 2006. A structurally “targeted” QSPR method for property prediction. Ind. Eng. Chem. Res. 45, 8430–8437. Broadhurst, M.G., 1962. Extrapolation of the orthorhombic n–paraffin melting properties to very long chain lengths. J. Chem. phys. 36, 2578–2582. Constantinou, L., Gani, R., 1994. New group-contribution method for estimating properties of pure compounds. AIChE J. 10, 1697–1710. Dearden, J.C., 2003. Quantitative structure–property relationships for prediction of boiling point, vapor pressure, and melting point. Environ. Toxicol. Chem. 22, 1696–1709. Domalski, E.S., Hearing, E.D., 1993. Estimation of the thermodynamic properties of C–H–N–O–S–Halogen compounds at 298.15-K. J. Phys. Chem. Ref. Data 22, 805–1159. Joback, K.G., Reid, R.C., 1987. Estimation of pure-component properties from groupcontributions. Chem. Eng. Commun. 57, 233–243. Kahrs, O., Brauner, N., Cholakov, G.St., Stateva, R.P., Marquardt, W., Shacham, M., 2008. Analysis and refinement of the targeted QSPR method. Comput. Chem. Eng. 32, 1397–1410. Marano, J.J., Holder, G.D., 1997. General equations for correlating the thermophysical properties of n-paraffins, n-olefins and other homologous series. 2. Asymptotic behavior correlations for PVT properties. Ind. Eng. Chem. Res. 36, 1895. McCullough, J.P., Finke, H.L., Gross, M. E., Messerly, J.F., Waddington, G., 1957. Low temperature calorimetric studies of seven 1-olefins: effect of orientational disorder in the solid state. J. Phys. Chem. 61, 289–301. Paster, I., Shacham, M., Brauner, N., 2009. Investigation of the relationships between molecular structure, molecular descriptors and physical properties. Ind. Eng. Chem. Res. 48, 9723–9734. Paster, I., Tovarovski, G., Shacham, M., Brauner, N., 2010. Combining statistical and physical considerations in deriving targeted QSPRs using very large molecular descriptor databases, in: S. Pierucci, G. Buzzi Ferraris (Eds.), Proceedings of the 20th European Symposium on Computer Aided Process Engineering–ESCAPE 20, June 6–9, Ischia, Naples, Italy, pp. 61–66. Paster, I. Prediction of Properties Using Molecular Descriptors, Ph.D Thesis, BenGurion University of the Negev, 2012. Poling, B.E., Prausnitz, J.M., O’Connel, J.P., 2001. Properties of Gases and Liquids, fifth ed.. McGraw-Hill, New York. Rowley, R.L., Wilding, W.V., Oscarson, J.L., Yang, Y., Zundel, N.A., 2010. DIPPR Data Compilation of Pure Chemical Properties Design Institute for Physical Properties. Brigham Young University Provo Utah. 〈http//dippr.byu.edu〉. Rowley, R.L., 2010. Personal Communication. Roy, K., Chakraborty, P., Mitra, I., Ojha, P.K., Kar, S., Das, R.N., 2013. Some case studies on application of “rm2” metrics for judging quality of quantitative structure–activity relationship predictions: emphasis on scaling of response data. J. Comput. Chem., http://dx.doi.org/10.1002/jcc.23231. (Article first published online: 8 January 2013). Roy, K., Mitra, I., Kar, S., Ojha, P.K., Das, R.N., Kabir, H., 2012. Comparative studies on some metrics for external validation of QSPR models. J. Chem. Inf. Model. 52, 396–408. Shacham, M., Paster, I., Brauner, N., 2013. Property prediction and consistency analysis by a reference series method. AIChE J. 59, 420–428. Shacham, M., Brauner, N., 2011. Analysis and refinement of the training set in predicting a variety of constant pure compound properties by the targeted QSPR method. Chem. Eng. Sci. 66, 2606–2615. Shacham, M., Kahrs, O., Cholakov, G.St., Stateva, R., Marquardt, W., Brauner, N., 2007. The role of the dominant descriptor in targeted quantitative structure property relationships. Chem. Eng. Sci. 62, 6222–6233. Shacham, M., Brauner, N., 2003. The SROV program for data analysis and regression model identification. Comput. Chem. Eng. 27, 701–714. Todeschini, R., Consonni, V., 2000. Handbook of Molecular Descriptors. Wiley-VCH. Wen, X., Qiang, Y., 2002. Group vector space (GVS) method for estimating boiling and melting points of hydrocarbons. J. Chem. Eng. Data. 47, 286–288.