Analysis and refinement of the training set in predicting a variety of constant pure compound properties by the targeted QSPR method

Chemical Engineering Science 66 (2011) 2606–2615 Contents lists available at ScienceDirect Chemical Engineering Science journal homepage: www.elsevi...

Download PDF

364KB Sizes 2 Downloads 21 Views

Report

PDF Reader
Full Text

Chemical Engineering Science 66 (2011) 2606–2615

Contents lists available at ScienceDirect

Chemical Engineering Science journal homepage: www.elsevier.com/locate/ces

Analysis and reﬁnement of the training set in predicting a variety of constant pure compound properties by the targeted QSPR method Mordechai Shacham a,n, Neima Brauner b a b

Department of Chemical Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel School of Engineering, Tel-Aviv University, Tel-Aviv 69978, Israel

a r t i c l e i n f o

a b s t r a c t

Article history: Received 2 January 2011 Received in revised form 6 March 2011 Accepted 9 March 2011 Available online 17 March 2011

The possibility of obtaining reliable predictions of a wide variety of constant properties is examined. To this aim, a modiﬁed version of the Targeted QSPR (Brauner et al., 2006) method is applied. In the present study new statistical indicators are introduced, which enable a reliable estimation of the prediction uncertainty for the (unknown) property of the target compound based on the training set data. It is shown that while increasing the number of descriptors in the QSPR enables better representation of the training set data, it may signiﬁcantly deteriorate the prediction of the target compound property value. If necessary, improved prediction is achievable by using the statistical information to reﬁne the training set, rather than by increasing the number of the descriptors used. It is demonstrated that by proper adjustment of the training set, the great majority of the constant properties can be predicted within the experimental error level. & 2011 Elsevier Ltd. All rights reserved.

Keywords: Computation chemistry Parameter identiﬁcation Molecular descriptor Systems engineering QSPR Property prediction

1. Introduction Pure-compound property data are at present available only for a small fraction of compounds, pertaining to such diverse areas as chemistry and chemical engineering, environmental engineering and environmental impact assessment, hazard and operability analysis. Therefore, methods for reliable prediction of property data are needed. Current methods used to predict physical and thermodynamic properties can be classiﬁed into ‘‘group contribution’’ methods (see, for example, Marrero and Gani, 2002, Poling et al., 2001), ‘‘asymptotic behavior’’ correlations (Marano and Holder, 1997) and Quantitative Structure Property Relationships (QSPRs; Dearden, 2003). Developers of the property prediction models report usually values of the average and maximal prediction errors for the training set (compounds whose structure and properties were used for developing the model) and for the evaluation set (compounds that were used only for testing the model). While this information is useful for comparison of different prediction techniques, it usually provides only a crude estimate of the property prediction error for a particular ‘‘target’’ compound. Kahrs et al. (2008) carried out an evaluation of the Targeted QSPR (TQSPR) method (Brauner, et al., 2006) using a database, which contained 1630 molecular descriptors for 259 hydrocarbons. Only the prediction of the critical temperature (TC) was n

Corresponding author. Tel.: þ972 8 6461481; fax: þ972 8 6472916. E-mail address: [email protected] (M. Shacham).

0009-2509/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.ces.2011.03.019

tested in that study. The objective of the present study is to develop a system, which can be used as a general prediction tool irrespective of the property to be predicted. The technique used should provide a good estimate of the prediction error and suggest ways as to improve the prediction accuracy, if the available data enable it. The recently developed Dominant Descriptor version of the targeted QSPR method (TQSPR1, Shacham et al., 2007) is employed in such a system. To this aim, a new database of a much larger variety of compounds, which contains physical property data for 1798 compounds has been established. Included in this database are numerical values and data uncertainty for 34 properties (critical properties, normal melting and boiling temperatures, heat of formation, ﬂammability limits, etc.). All the property data is from the DIPPR database (Rowley et al., 2010). The database contains 3224 molecular descriptors generated by the Dragon, version 5.5. software (DRAGON is copyrighted by TALETE srl, http://www.talete.mi.it) from minimized 3D molecular structures. The 3D molecular structures (that were obtained from Rowley (2010)) for about 1000 compounds were optimized by Gaussian 03 (Frisch et al., 2004) using B3LYP/6-311þG (3df, 2p), a density functional method with a large basis set. Most of the other compounds were optimized using HF/6-31Gn, a Hartree–Fock ab initio method with a medium-sized basis set. All the computations, in the present study, were carried out with a special version of the SROV (MATLAB) program of Shacham and Brauner (2003), which was revised in order to ﬁt the needs of the TQSPR1 algorithm.

M. Shacham, N. Brauner / Chemical Engineering Science 66 (2011) 2606–2615

In this study the TQSPR1 method is evaluated by predicting a large number of pure compound constant properties for a target compound. The novel aspects introduced include the statistical measures that are used for estimating the expected prediction error and for identifying its source. A new approach for reduction of the prediction error by reﬁnement of the training set rather than by increasing the number of descriptors/terms in the QSPR is also presented. In Section 2 the basic concepts of the of the TQSPR1 method are reviewed. Some indicators, which can signal inaccurate predictions and help to analyze the possible causes of the inaccuracy are introduced in Section 3. Then, an example is presented, where a training set selected based on the full database of 3224 molecular descriptors is used to predict 32 constant properties of a target compound (Section 4). The analysis of the sources of excessive prediction errors for some of the properties is carried out in Section 5, and, is followed by description of techniques for training set reﬁnement, which can reduce the prediction errors considerably (Section 6). Finally some conclusions are drawn.

2. The dominant descriptor targeted QSPR method (TQSPR1) The ﬁrst stage of the method involves identiﬁcation of a similarity group structurally related to the target compound. For identiﬁcation of the similarity group, a database of molecular descriptors, xij, is required, where i is the number of the compound and j is the number of the descriptor. The similarity between potential predictive compounds and the target compound is measured by the partial correlation coefﬁcient, rti, between the vector of the molecular descriptors of the target compound, xt, and that of a potential predictive compound xi. The partial correlation coefﬁcient is deﬁned as rti ¼ xt xTi , where xt and xi are row vectors, centered (by subtracting the mean) and normalized to unit length (by dividing by the Euclidean norm of the vector). Absolute rti values close to one (9rti9 E1) indicate high correlation between vectors xt and xi, and thus, high level of similarity between the molecular structures of the target compound and the predictive compound i. The training set is established by selecting the p compounds with the highest 9rti9 value and for which experimental property values yi are available. Based on our past experience, in most cases we use similarity groups of 50 compounds and training sets of 10 compounds. The remaining similarity group members are used for replacing some members of the training set whenever the need arises. Examples of cases where training set members need to be replaced will be shown later. The selected training set is used for the development of a TQSPR1 model for a particular property of the target compound. A linear structure–property relation is assumed of the form: y ¼ b0 þ b1 fD þ e

ð1Þ

In this equation y is a p-dimensional vector of the respective property values of the p components in the training set, fD is a p-dimensional vector of the (dominant) molecular descriptor (to be selected via a stepwise regression algorithm), b0,b1 are the corresponding model parameters to be estimated, and e is a p-dimensional vector of random errors. To identify the dominant descriptor fD, we examine the partial correlation coefﬁcient between the vector of (target) property value of the p compounds included in the training set (i.e., y) and the vector of (target) descriptor values for these compounds, for all the descriptors available in the database. These correlation coefﬁcients will be further called descriptor–property (D–P) correT

lation coefﬁcients for the particular property: rPj ¼ yfj . The

2607

dominant descriptor, fD, is the descriptor that has the highest D–P correlation coefﬁcient, rDP ¼ maxj ðrpj Þ. The TQSPR1 soobtained can be subsequently employed for calculating estimated property values for the target compound and for other compounds in the similarity group that do not have measured data, by the use of: y~ t ¼ b0 þ b1 zDt

ð2Þ

In this equation y~ t is the estimated unknown property value of the respective compound and zDt is the corresponding dominant molecular descriptor value. The identiﬁcation of the most adequate similarity group and training set is critical for precise prediction of some of the properties. The similarity group selected is very much dependent on the composition of the molecular descriptor database and the reliability of the individual descriptors. Paster et al. (2009), for example, have shown that many of the 3D descriptors exhibit inconsistent (sometimes even random) behavior and their numerical value depends very much on the minimization algorithm used to obtain the 3D structure of the molecule. Many of the 2D descriptors are undeﬁned for an atom number lower than a threshold value. Unbalanced descriptor database composition and inconsistency of some of the descriptors may prevent the identiﬁcation of the most adequate similarity group and training set for a particular target compound. In cases when inadequate training set adversely affects the prediction accuracy it is important to detect, based on the training set results, the inadequacy of the TQSPR1 model, and to identify the sources of the inaccuracy.

3. Measures of uncertainty of the predicted property value There are several measures associated with the training set, which can be used for assessing the prediction uncertainty of the target property for the target compound. In cases where reported property values for the target compound are available (from the DIPPR database or other sources), the property predicted by the TQSPR1 method can be compared with the reported value. The absolute difference (in %) between the reported and the predicted values is the prediction uncertainty. The uncertainty of the property data for the members of the training set provides the attainable lower limit on the prediction uncertainty, as the predicted value cannot be more accurate than the data that is used for its estimation. The DIPPR database provides for most properties the upper limit of uncertainty (experimental error) Ui (in %). As there may be different Ui values for the p members of the training set, the average uncertainty (Uavg, %) and maximal uncertainty (Umax, %) of the particular property are used to represent the precision of the available property data. As a measure for the level of the structural similarity between the training set compounds and the target compound, the Geometric Average Correlation Coefﬁcient (GACC) is used qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ GACC ¼ rt1 rtp ð3Þ When comparing the experimental uncertainty with prediction uncertainty, the range of variation of the property value within the training set, TSRV, should be examined. It is deﬁned by: p X ðymax ymin Þ ; yavg ¼ 1 TSRV ð%Þ ¼ 100 y ð4Þ yavg pi¼1 i where yavg, ymax and ymin are the average, maximal and minimal property values of the training set members, respectively. To represent the quality of the ﬁt of the TQSPR for the members of the training set, a Training Set Average Error (TSAE)

M. Shacham, N. Brauner / Chemical Engineering Science 66 (2011) 2606–2615

is used p 100 X y ðb0 þ b1 =zi Þ=yi TSAE ð%Þ ¼ p i¼1 i

ð5Þ

Cases where the TSAE is higher than Uavg (or Umax) imply that an excessive prediction uncertainty can be expected also for the target compound. Such cases are further analyzed by examining the values of numerical indicators for outlying property (yi) or descriptor values (zDi). Outlying (or high leverage) descriptor values can be detected based on the diagonal elements, hii of the hat matrix, H which is deﬁned as 1 ZT ð6Þ H ¼ Z ZT Z In this equation Z ¼[1 fD] is a p 2-dimensional matrix, 1 is a p-dimensional unity vector and fD is a p-dimensional vector of the (dominant) descriptor values. For the 2-paramters TQSPR1 model, the sum of the diagonal elements is 2, and their values are always in the range of 0 rhii r1. If hii 46/p ( ¼0.6 for a training set of 10 compounds), the ith compound is considered an outlier in terms its descriptor (zDi) value (a high leverage point, e.g., Neter et al., 1996, pp. 373–378). The outlying property (yi) value can be identiﬁed by comparing the studentized deleted residual ti with t distribution value of t(1 a/2p; n), where a is the signiﬁcance level and n is the degrees of freedom. The studentized deleted residuals can be calculated from: " #1=2 ti ¼ ei

n

SSEð1hii Þe2i

ð7Þ

where ei is the residual of component i and SSE is the error sum of squares of the TQSPR1 model (for the members of the training set). For example, the studentized deleted residual ti with signiﬁcance level of a ¼ 0.1 and degrees of freedom n ¼8 (10 data points of the training set members, 2 parameters of the TQSPR1 model), is considered excessive if ti 43.3554.

4. Predicting constant properties of n-hexyl mercaptan using the full database The study reported in this paper was carried out using n-hexyl mercaptan (number of carbon atoms, nC ¼6) as target compound. The target compound belongs to the n-mercaptan homologous series. The DIPPR database contains data for the ﬁrst 12 members of this series (methyl mercaptan, nC ¼1, to n-dodecyl mercaptan, nC ¼12). 4.1. The training set and the similarity group of n-hexyl mercaptan The similarity group and the training set were selected based on the full database of 3224 molecular descriptors. Data regarding the molecular structure of the training set members and the target compound are shown in Table 1. The compounds included contain between 4 and 9 carbon atoms, hydrogen atoms and one sulfur atom. One compound (1-hexanol) contains an oxygen atom instead of the sulfur atom. There are two types of sulfur containing molecules: mercaptans in which there is an –SH group at the end of the parafﬁnic chain, and sulﬁdes in which the –S-atom is inside the chain. The compounds most similar to the target are its two immediate neighbors in the mercaptan homologous series: n-heptyl mercaptan whose correlation coefﬁcient with the target is 9rti9¼0.977, and n-pentyl mercaptan with 9rti9¼0.975. For the last (10th) member of the training set: di-n-propyl sulﬁde, the

Table 1 Molecular structure and correlation coefﬁcient data for the target compound (n-hexyl mercaptan) and the training set members. Comp. no.

Name

1 2 3

n-Heptyl mercaptan CH3(CH2)6SH n-Pentyl mercaptan CH3(CH2)4SH CH3(CH2)3SCH3 Methyl n-butyl

4 5 6 7 8 9 10 Target

Structural formula

sulﬁde Methyl pentyl sulﬁde Ethyl propyl sulﬁde n-Nonyl mercaptan n-Butyl mercaptan n-Octyl mercaptan 1-Hexanol Di-n-propyl sulﬁde

CH3S(CH2)4CH3

CH3CH2S(CH2)2CH3 CH3(CH2)8SH CH3(CH2)3SH CH3(CH2)7SH CH3(CH2)5OH CH3(CH2)2S(CH2)2CH3 n-Hexyl mercaptan CH3(CH2)5SH

No. of C atoms

9rti9

7 5 5

0.977 0.975 0.962

6

0.957

5 9 4 8 6 6 6

0.952 0.947 0.946 0.944 0.942 0.942

18 16 No. of Compounds

2608

14 Similarity Group

12

Training Set

10 8 6 4 2 0

1

2

3

4 5 6 7 No. of C - atoms

8

9

10

Fig. 1. The nC distribution for the compounds included in the similarity group and the training set for n-hexyl mercaptan.

correlation coefﬁcient is reduced to 9rti9 ¼ 0.942. The geometric average correlation coefﬁcient (GACC) of this training set is 0.959. The additional 40 members of the similarity group include a wide variety of compounds, all of them in the range of nC ¼4 to 10, most of them contain one or more oxygen atoms (alcohols, ethers, acids). But there are also some additional sulfur, nitrogen and even bromine containing compounds. The nC distribution for the compounds included in the similarity group and in the training set is shown in Fig. 1. Obviously nC has a dominant inﬂuence on the selection of the members of the similarity group: 80% of the selected compounds are in the 5rnC r7, whereas members of the n-mercaptan homologous series for which nC o4 or nC 410 are not included in the similarity group. 4.2. Predicting liquid molar volume at 298.15 K (LVOL) of n-hexyl mercaptan The TQSPR1 methodology is demonstrated with respect to the prediction of the liquid molar volume (LVOL) of n-hexyl mercaptan at 298.15 K. For prediction of LVOL for the target compound, property data for the members of the training set are needed. The LVOL values and their uncertainties (from the DIPPR database), for the target compound and the members of the training set, are shown in Table 2. The uncertainties for all but two compounds are o1%, for compounds no. 4 and 5 they are o3%.

M. Shacham, N. Brauner / Chemical Engineering Science 66 (2011) 2606–2615

2609

Table 2 Training set and target compound data for prediction of LVOL for n-hexyl mercaptan. Comp. no.

1 2 3 4 5 6 7 8 9 10 Target

Name

DIPPR data

n-Heptyl mercaptan n-Pentyl mercaptan Methyl n-butyl sulﬁde Methyl pentyl sulﬁde Ethyl propyl sulﬁde n-Nonyl mercaptan n-Butyl mercaptan n-Octyl mercaptan 1-Hexanol Di-n-propyl sulﬁde n-Hexyl mercaptan

TQSPR1 prediction

LVOL (m3/kmol)

Uncertainty (%)

LVOL (m3/kmol)

Uncertainty (%)

0.15769 0.12453 0.12439 0.1409 0.1268 0.19042 0.10781 0.17439 0.1252 0.14192 0.14101

o1 o1 o1 o3 o3 o1 o1 o1 o1 o1 o1

0.15794 0.12535 0.12535 0.14164 0.12535 0.19053 0.10905 0.17423 0.12298 0.14164 0.14164

0.15 0.66 0.77 0.53 1.15 0.06 1.15 0.09 1.78 0.19 0.45

132.3 104.24 104.24 118.27 104.24 160.36 90.21 146.33 102.2 118.27 118.27

training set of 20 compounds (instead of 10) is used for deriving the TQSPR1 for the LVOL of the same target compound, the Sp descriptor (sum of atomic polarizabilities, scaled on Carbon atom, Todeschini and Consonni, 2000) is identiﬁed as the dominant descriptor. This TQSPR1 model yields prediction uncertainty of 0.2%, while using MW instead of Sp with the same training set increases the prediction uncertainty to 1.2%.

0.2 Liquid Molar Volume (m^3/kmol)

MW (descriptor)

0.18 y = 0.0011615x + 0.0042756 R2 = 0.9981632

0.16

4.3. Predicting all constant properties available in DIPPR for n-hexyl mercaptan

0.14 Training Set

0.12

Target Linear (Training Set)

0.1 80

100

140 120 Descriptor MW

160

180

Fig. 2. Plot of LVOL of n-hexyl mercaptan and its training set versus the molecular mass (MW).

From among the 3224 descriptors, the one that has the highest correlation with LVOL is the molecular mass (MW). Fig. 2 shows the plot of the LVOL data of the training set members versus MW. The LVOL values are aligned along a straight line: LVOL¼ 0.0042756þ0.0011615MW, with a correlation coefﬁcient of R2 ¼0.998. Using this TQSPR and the descriptor value MW¼ 118.27 for the target compound yields LVOLtarget ¼0.14164 (m3/ kmol), which is different from the DIPPR recommended value (0.14101 m3/kmol) by 0.45%. Since the uncertainty of most of the DIPPR data is in the 1% range, it is impossible to reduce further the prediction uncertainty. Table 2 shows the predicted LVOL values and the respective prediction uncertainties for the target compound and the members of the training set. Observe that the prediction uncertainty is slightly higher than the data uncertainty only for two members of the training set: n-butyl mercaptan and 1-hexanol. For 1-hexanol the discrepancy can be explained by the replacement of the sulfur atom by the oxygen atom, whereas for n-butyl mercaptan (nC ¼4) being one of the ﬁrst members of the homologous series can explain the deviation. The statistical indicators described in section were calculated in order to assess the prediction accuracy. The low TSAE value (0.653%) as well as the closeness of the dominant descriptor correlation coefﬁcient to the value of 1 (rDP ¼0.999), indicate that the expected prediction uncertainty is low (as indeed it is). No statistically signiﬁcant outlying property (y) values were detected. Based on this study it is possible to get the impression that MW is always the best descriptor to represent LVOL. However, if a

The 34 constant properties available in the DIPPR database are listed in Table 3. Included in the Table are the symbols used for abbreviating the property names (as deﬁned by DIPPR), short descriptions of the properties and their units. Eight of the properties are categorized as ‘‘deﬁned’’, meaning that they can be calculated from other properties and/or from the molecular structure. This provides more ﬂexibility in their prediction, as their value can be calculated from their deﬁnition, or they can be predicted directly using the TQSPR1 technique. The TQSPR1 method was used to predict 32 out of the 34 properties shown in Table 3 for the target compound, n-hexyl mercaptan. The results of the prediction are shown in Table 4. Molecular weight was excluded as it is included both in the property and the descriptor databases. The triple point temperature was excluded as for all the compounds involved it has the same numerical value as the normal melting point. The following information is presented for each property predicted: GACC (Eq. (3)), Uavg, Umax, TSRV (Eq. (4)), rDP, the parameters (b0, b1) and the descriptor used in the TQSPR1 model, TSAE (Eq. (5)), excessive ti (Eq. (7)) and hii (Eq. (6)) values, the value of the property as reported by DIPPR, the TQSPR1 prediction for the property value, and the percent absolute difference between the DIPPR and the predicted values (prediction uncertainty). The same training set (shown in Table 1) was used for all the properties, except for AIT, DC and PAR. Accordingly, GACC ¼0.959 for all the properties except these three. Replacement of one compound causes the reduction of the GACC value to 0.956 (for AIT), while replacement of seven compounds (in the case of PAR) yields GACC¼ 0.936. Note that at this stage the replacement of compounds in the training set was carried out because of the lack of property data (and not due to similarity issues). In the case of PAR, for example, data are available only for compounds Nos. 4, 5 and 9 in Table 1. Consequently, a training set of 10 compounds was obtained by adding the 1-alcohols and n-acids from the similarity group to training set. For several of the properties (ACEN, DC, DM, FLTL, FLTU, FP, and ZC) DIPPR does not provide uncertainty estimates for all the

2610

M. Shacham, N. Brauner / Chemical Engineering Science 66 (2011) 2606–2615

Table 3 Constant properties available in DIPPR database. No.

Symbol

Property description

Type

Units

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

ACEN AIT DC DM ENT FLTL FLTU FLVL FLVU FP GFOR GSTD HCOM HFOR HFUS HSTD HSUB LVOL MP MW NBP PAR PC RG RI SOLP SSTD TC TPP TPT VC VDWA VDWV ZC

Acentric factor Auto ignition temperature Dielectric constant Dipole moment Absolute entropy of ideal gas at 298.15 K and 100,000 Pa Lower ﬂammability limit temperature Upper ﬂammability limit temperature Lower ﬂammability limit Upper ﬂammability limit Flash point Gibbs energy of formation of ideal gas at 298.15 K and 100,000 Pa Gibbs energy of formation in standard state at 298.15 K and 100,000 Pa Net enthalpy of combustion standard state (298.15 K) Enthalpy of formation of ideal gas at 298.15 K and 100,000 Pa Enthalpy of fusion at melting point Enthalpy of formation in standard state at 298.15 K and 100,000 Pa Heat of sublimation at the triple point Liquid molar volume at 298.15 K Melting point (1 atm) Molecular weight Normal boiling point (1 atm) Parachor Critical pressure Radius of gyration Refractive index at 298.15 K Solubility parameter at 298.15 K Absolute entropy in standard state at 298.15 K and 100,000 Pa Critical temperature Triple point pressure Triple point temperature Critical volume van der Waals area van der Waals reduced volume Critical compressibility factor

Deﬁned

– K – cm J/kmol K K K vol. % in air vol. % in air K J/kmol J/kmol J/kmol J/kmol J/kmol J/kmol J/kmol m3/kmol K kg/kmol K – Pa m – (J/m3)(1/2) J/kmol K K Pa K m3/kmol m2/kmol m3/kmol –

compounds involved. In such cases only Umax is presented, and this value is considered to be an estimate for the lower limit for the prediction uncertainty. It is important to point out that the DIPPR database includes both experimental and predicted property values. For some properties the data is experimental for both the target and the training set compounds (e.g., NBP). For other properties most of the data are experimental (MP, LVOL), while for other properties most of the data (e.g., TC, PC, VC) or all of the data (e.g., FLTL, FLTU) are predicted. In this work our objective was to show that the TQSPR1 model can represent well the behavior of the various properties for similar compounds. Therefore, we did not differentiate between experimental and predicted DIPPR data and assumed that we can rely on the uncertainty estimates for assessing the reliability of the data. The ultimate test of the success of the TQSPR1 method to predict correctly a property is the comparison of the prediction uncertainty with Uavg (or Umax, when Uavg is not available). In Fig. 3 the prediction uncertainty is plotted versus TSAE for 28 properties. The properties which are not shown are PAR, for which no value is available in DIPPR, and GFOR, GSTD, TPP for which both the TSAE and the prediction uncertainty are too large for the scale used in the ﬁgure. For 24 of the properties TSAE o3% and all of these properties satisfy the requirement of the prediction uncertainty being smaller than the data uncertainty. Even for FP, for which the prediction uncertainty ¼5.21%, the maximal data uncertainty is Umax ¼10%, thus the prediction is of acceptable accuracy. However there are several properties for which the prediction uncertainty is considerably higher than the data uncertainty: MP, GSTD, HFOR and HSTD. The potential causes of excessive prediction errors and their detection are discussed in the next section.

Deﬁned Deﬁned

Deﬁned Deﬁned

Deﬁned Deﬁned Deﬁned

5. Analysis of the sources of excessive prediction errors and their detection The variables that can be used as diagnostic tools are rDP, TSAE, excessive values of ti ( 43.3554) and excessive values of hii (40.6). Cases where TSAE 4Uavg imply that the prediction uncertainty for the target compound will be also too high. We have found that when ti is only slightly larger than the maximal value (like in the case of FLVL and PC, Table 4), the prediction uncertainty is not considerably deteriorated. In the cases of HFOR and TPP, the high ti values are associated with high hii values, and their effects cannot be separated. Analysis of cases involving inadequate TSQPR1 models follow. 5.1. Model inadequacy is characterized by hii ¼1 There are several properties in Table 4 for which the maximal value of the hat matrix diagonal is hii ¼1, indicating extremely high leverage of one of the compounds included in the training set. Such a case will be analyzed with respect to the prediction of GSTD for n-hexyl mercaptan. The values of this property for the training set and for the target compound, plotted versus the dominant descriptor Mv (mean atomic van der Waals volume—scaled on Carbon atom), are shown in Fig. 4. Observe that this descriptor has the same value for all the sulfur containing compounds (0.54) and a different value for 1-hexanol (0.51). The 1-hexanol is an outlier in terms of its property value (its GSTD value is 1.5020Eþ08 J/kmol, while the range of the GSTD values for the sulfur containing compounds is 9.4655Eþ06 to 3.1730Eþ07 J/kmol). The TQSPR1 in this case is actually a line connecting the property value of the 1-hexanol with the point representing average property value of the other (sulfur

Table 4 Prediction of the constant properties of n-hexyl mercaptan with the TQSPR1 method. Training set

TQSPR1 model

Property value

Prediction

No.

Symbol

GACC

Uavg (%)

Umax (%)

TSRV (%)

qDP

b0

b1

DD

TSAE

Excess ti

Excess hii

DIPPR

Predicted

Uncertainty %

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 21 22 23 24 25 26 27 28 29 31 32 33 34

ACEN AIT DC DM ENT FLTL FLTU FLVL FLVU FP GFOR GSTD HCOM HFOR HFUS HSTD HSUB LVOL MP NBP PAR PC RG RI SOLP SSTD TC TPP VC VDWA VDWV ZC

0.959 0.956 0.953 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.936 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959 0.959

– 25 – – 2.32 – – 22 26 – 4.6 4.4 1.24 2.2 3.6 1.72 11.5 1.4 0.96 0.76 4.11 8.1 3 0.64 5 2.32 3.12 42.5 19.8 4 1.5 –

25 25 100 10 3 25 25 25 50 10 25 25 3 3 25 3 25 3 3 1 10 10 3 3 10 5 5 100 25 5 3 25

72.8 12.9 174 10.1 43.2 26.5 25.6 65.7 60.3 24.7 1197.6 6964.2 72.9 115.2 129 129.9 48.8 58.4 49.6 28.6 76.2 52.6 46.3 2.6 28 50.85 17.98 426.5 63.2 60.6 64.1 12.8

0.978 0.85 0.987 0.833 0.999 0.991 0.995 0.996 0.999 0.99 0.976 0.9889 0.9997 0.983 0.987 0.987 0.9814 0.9991 0.974 0.997 0.997 0.997 0.9947 0.92 0.988 0.997 0.991 0.8 0.997 0.9999 0.9999 0.936

0.20808 427.3872 4.42 4.80E 30 173,032 180.9915 136.4434 2.419 14.5143 185.6166 2.96E þ09 3,036,655,405 8.69E þ07 1,136,961,654 32,958 1,300,085,682 15,328,247 0.0042717 158.53 111.41 20.2528 51,145,233.5 2.43E 10 0.886 25,248.5 54,974 494.395 0.025964 0.033132 2.671E þ07 0.0036556 0.28325

0.093585 644.7616 8.61 4.82E 31 24,824 10.0382 13.275 0.79407 4.2991 9.6553 5.54E þ09 5.66E þ09 6.12E þ09 1.18E þ09 8.03Eþ 07 1.31E þ09 4.23E þ06 1.16E-03 5.75E þ02 1.67E þ02 2.29E þ01 1.36E þ07 2.36E 12 1.0394 12,474 7625 510.6456 0.26718 0.02948 1.900E þ08 0.0058375 0.01234

RDF020u R1eþ nO Mor12e Sv Ss Rte ATS3p ATS4e Ss Mv Mv PHI X0Av Mor23v X0av Ss MW Mor29m ESpm01r Sp BEHP1 H3D Mv SEigZ AMR Mor29v RDF100p Sp X0sol Sp Mor08m

4.7 1.702 7.9 1.51 0.365 0.861 0.55 1.2 0.577 0.966 37.17 60.55 0.48246 7.53 7.34 5.396 2.96 0.653 3.39 0.509 1.6357 0.968 1.225 0.233 1.054 1.029 0.672 2147 1.1365 0.25 0.377 1.19

– – – – – – – t10 ¼ 3.53 – – – – – t9 ¼3.69 – – – – – – – t7 ¼4.17 – – – – – t2 ¼11 – – – –

– – h77 ¼ 1 – – – – – – – h99 ¼ 1 h99 ¼ 1 – h99 ¼ 0.68 h66 ¼ 0.64 h99 ¼ 0.68 – – – – h*ii ¼ 0.84 – – h99 ¼ 1 h99 ¼ 1 – – h11 ¼ 0.73 – – – –

0.3681 520 4.436 5.10E 30 454,600 307 351 1 8.4 293.15 2.759E þ07 1.431E þ07 4.176Eþ 09 1.292Eþ 08 1.801Eþ 07 1.757Eþ 08 6.68E þ07 0.141 1.926E þ02 425.81 – 3,080,000 4.25E-10 1.4473 17,450 343,210 623 0.013096 0.412 1.12E þ09 0.07963 0.245

0.3756 513.79 4.42 5.26E 30 452,803 308.7 351.15 1.04 8.47 308.4 3.220E þ07 1.959E þ07 4.167E þ09 1.486E þ08 2.116E þ07 1.961E þ08 6.92E þ07 0.1416 2.115E þ02 428.14 317.92 3,058,107 4.15E-10 1.4473 17,452 340,640 624.61 0.02596 0.416 1.12E þ09 0.079427 0.248

2.04 1.195 0.36 3.07 0.395 0.546 0.043 4.45 0.88 5.21 16.7 36.9 0.21 15 17.5 11.6 3.52 0.45 9.78 0.55 – 0.71 2.4288 1.50E 04 0.0127 0.749 0.26 98 0.92 0.09 0.25 1.238

M. Shacham, N. Brauner / Chemical Engineering Science 66 (2011) 2606–2615

Property

2611

2612

M. Shacham, N. Brauner / Chemical Engineering Science 66 (2011) 2606–2615

1.00E+00

18 16

Triple Point Pressure (Pa)

Prediction Uncertainty (%)

20

14 12 10 8 6 4 2 0

0

1

2

3

4 5 TSAE (%)

6

7

8

9

Gibbs Energy of Formation (J/kmol)

5.00E+07 y = 5.6597E+09x - 3.0367E+09 R2 = 9.7798E-01

0.51

0.52

0.53

0.54

0.55

-5.00E+07 -1.00E+08 Training Set Target

-1.50E+08

5

6

7

8

9

10

1.00E-01

1.00E-02 Training Set Target

1.00E-03

1.00E-04 No. of C atoms

Fig. 3. Plot of prediction uncertainty versus TSAE for 28 properties (obtained using the ‘‘basic’’ training set).

0.00E+00 0.5

4

Linear (Training Set)

-2.00E+08 Descriptor Mv

Fig. 4. Plot of GSTD of n-hexyl mercaptan and its training set versus the descriptor Mv.

Fig. 6. Plot of TPP of n-hexyl mercaptan and its training set versus nC.

GSTD, in all these cases the TQSPR1 derived using the selected descriptor is ‘‘orthogonal’’ to the sulfur containing compounds. For GFOR, the use of an inappropriate descriptor in the TQSPR1 model leads to high TSAE value and high prediction uncertainty. For DC, RI and SOLP, the prediction uncertainties are less than 1%. However, the reason for the apparent successful prediction is that for these properties, the average property value of the sulfur containing compounds is very close to the target compound property value. The low TSAE values for RI and SOLP can further be attributed to the unusually small range of variation of the property within the training set when 1-hexanol is excluded (TSRV¼0.92% for RI and TSRV¼3.95% for SOLP). The above results demonstrate that TQSPR1 models which are associated with hii ¼1 are actually ‘‘orthogonal’’ to the relevant data and renders the TSAE indicator to be very unreliable for estimating the prediction uncertainty. 5.2. Model inadequacy is characterized by hii 40.6

Heat of Formation (J/kmol)

-5.00E+07 0.7 -1.00E+08 -1.50E+08

0.75

0.8

0.85

0.9

Training Set Target Linear (Training Set)

-2.00E+08 -2.50E+08 -3.00E+08

y = 1.1766E+09x - 1.1370E+09 R2 = 9.6565E-01

-3.50E+08 Descriptor X0Av Fig. 5. Plot of HFOR of n-hexyl mercaptan and its training set versus the descriptor X0Av.

containing) compounds. Hence practically, the resulting TQSPR1 can be considered as ‘‘orthogonal’’ to the properties of the sulfur containing compounds. The existence of an outlier is reﬂected in the very large (absurd) TSRV value (TSRV¼6964%). Removing the 1-hexanol from the training set reduces the TSRV to 141%. The outlying GSTD value of 1-hexanol prevents the detection of a descriptor which represents adequately the variation of GSTD values of the sulfur containing compounds. This in turn causes the excessive TSAE value for the training set, and the high prediction error for the target compound. There are four additional properties: DC, GFOR, RI and SOLP for which hii ¼1 is associated with 1-hexanol. Just like in the case of

Such a situation is demonstrated for HFOR of n-hexyl mercaptan which is plotted in Fig. 5 versus the descriptor X0Av (average valence connectivity index chi-0). Similarly to the previous case, 1-hexanol is actually an outlier in terms of the property value. Its HFOR value is 3.162Eþ08 (J/kmol), while for the sulfur containing compounds the range is 1.908Eþ08 (J/kmol)rHFOR r 8.780Eþ07 (J/kmol). The existence of an outlier prevents the identiﬁcation of the descriptor which is collinear with HFOR of the sulfur containing compounds. This in turn leads to high TSAE value and high prediction uncertainty. There are four additional properties: HSTD, HFUS, PAR and TPP for which hii 40.6 associated with 1-hexanol or a different compound. The case of HSTD is the same as HFOR. In the case of HFUS, there is an additional source of inaccuracy, besides 1-hexanol being an outlier. This case will be discussed in the next section. For parachor (PAR), the selected descriptor (Sp—sum of atomic polarizabilities, scaled on carbon atom) has a high leverage value for 1-hexanol, however this compound is not an outlier in terms of the PAR value. Consequently the TSAE value is considerably smaller than the uncertainty of the PAR data, and it can be assumed that the predicted value is acceptable (no PAR data for the target compound is available). The triple point pressure data for the training set and the target compound are plotted versus nC on a semi-logarithmic scale in Fig. 6. Observe that the TPP values spread over four orders of magnitude without any distinguishable trend. There are several indicators (see Table 4) which show the high level of uncertainty and inconsistency of this data set. The Uavg ¼42.5% and Umax ¼100%, meaning, for example, that TPP for one of the training

M. Shacham, N. Brauner / Chemical Engineering Science 66 (2011) 2606–2615

set members (methyl pentyl sulﬁde) may obtain any value between 0 and 0.0017 Pa. The correlation coefﬁcient of the dominant descriptor (rDP ¼0.8), is much lower than for any other property, TSAE obtains the absurd value of 2150% and there are indications for an outlying TPP value and a high leverage point. Based on these extreme values we conclude that the noise level in this data prevents its representation by a TQSPR1 model. 5.3. Physical considerations need to be used to identify the model inadequacy For the melting point there are no excessive ti or hii values, yet the TSAE and prediction uncertainty are considerably higher than Uavg and even Umax. Thus, statistical indicators are not able to detect the reason for model inadequacy. Fig. 7 shows a plot of MP versus the descriptor Mor29m (3D–MoRSE signal 29/weighted by

260

Melting Point (K)

240 y = -575.38x + 158.53 R2 = 0.9478

220

2613

atomic masses) for the training set and the target compound. The values are practically aligned along two straight lines. Similar phenomenon was observed, for example by Marano and Holder, 1997, for the n-parafﬁn and n-oleﬁn homologous series. They explained that compounds with odd and even carbon numbers melt from different crystalline phases for nC r20, resulting in the unusual behavior of the MP when plotted versus nC. The difference between odd and even nC populations exists also for HFUS and probably for TPT and TPP. However, the TPT values provided by DIPPR are the same as the MP values, while for the TPP the noise in the data prevents the detection of the two curves for odd an even nC compounds. Adding more descriptors into a QSPR model is a commonly used technique attempted at reducing the prediction uncertainty. Table 5 shows three TQSPR models, containing 1, 2 and 3 descriptors, for the prediction of MP of n-hexyl mercaptan. The addition of descriptors enables better representation of the training set (TSAE is reduced from 3.39% for a single descriptor TQSPR, to 0.72% for a 3-descriptors TQSPR). However, adding more descriptors causes the prediction uncertainty to increase from 9.78% (single descriptor) to 12.55% (3 descriptors).

6. Reﬁning the training set for improving the prediction

200 180 160 140 -0.2

Training Set Target Linear (Training Set)

-0.15

-0.05 -0.1 Descriptor Mor29m

0

0.05

Fig. 7. Plot of MP of n-hexyl mercaptan and its training set versus the descriptor Mor29m.

Table 5 Prediction of MP for n-hexyl mercaptan with various TQSPR models. No. of Descriptors

TQSPR model

TSAE

1

MP¼ 158.5311-575.3786 Mor29m MP¼ 182.497-494.3558 Mor29m-473.8941 P2s

3.39

9.78

1.50

11.47

0.72

12.55

2

3

MP¼ 162.7925-381.9012 Mor29m-587.4024 P2s þ227.8096 R5v

Prediction Uncertainty

It was noted in the previous section that the use of the full database of descriptors, which includes ‘‘noisy’’ 3D descriptors, resulted in masking some important details of the structure, and in a dominancy of nC in the selection of the members of the similarity group. For part of the properties, the TQSPR1 algorithm can still identify a (dominant) descriptor, which is highly correlated with the property value of the selected training set, and yields an acceptable prediction for the property value of target compound. However, for some properties it resulted with poor prediction. To avoid the effect of ‘‘noisy’’ descriptors on the similarity group selection, a robust subset of the descriptor data-base that still reﬂects the diversity in the chemical structures should be used. As noted in Sections 2, 2D and 3D descriptors may exhibit inconsistent behavior, consequently only 0D and 1D descriptors were included in the robust subset. The selected subset comprises of 202 descriptors from the ‘‘constitutional’’ and ‘‘functional group count’’ categories. The ‘‘constitutional’’ descriptor group includes for example MW, number of various atoms, number of double and triple bonds, number of rings etc. Using this subset of descriptors for identifying the similarity group of n-hexyl mercaptan, the training set shown in Table 6 is obtained. The compounds included contain between 4 to 12 carbon atoms, hydrogen atoms and one sulfur atom. Eight of the ten compounds belong to the n-mercaptan homologous series,

Table 6 Molecular structure and correlation coefﬁcient data for the target compound (n-hexyl mercaptan) and the training set members (only ‘‘constitutional’’ and ‘‘functional group count’’ descriptors used). Comp. No.

Name

Structural formula

No. of C atoms

9rti9

1 2 3 4 5 6 7 8 9 10 Target

n-Heptyl mercaptan n-Pentyl mercaptan n-Octyl mercaptan n-Nonyl mercaptan n-Butyl mercaptan 2-Pentanethiol n-Decyl mercaptan Tert-nonyl mercaptan Undecyl mercaptan n-Dodecyl mercaptan

CH3(CH2)6SH CH3(CH2)4SH CH3(CH2)7SH CH3(CH2)8SH CH3(CH2)3SH CH3CH(SH)CH2CH2CH3 CH3(CH2)9SH CH3(CH2)5C(CH3)2SH CH3(CH2)10SH CH3(CH2)11SH CH3(CH2)5SH

7 5 8 9 4 5 10 9 11 12 6

0.9985 0.9978 0.9947 0.9892 0.9892 0.9863 0.9823 0.9796 0.9744 0.9653

n-Hexyl mercaptan

2614

M. Shacham, N. Brauner / Chemical Engineering Science 66 (2011) 2606–2615

Table 7 Prediction of some constant properties of n-hexyl mercaptan using reﬁned training sets. Property

Training set

TQSPR1 model

Prediction

No. in Table 5

Symbol

Size

b0

b1

DD

TSAE

Excess ti

Excess hii

Value

Uncertainty %

3 11 12 14 15 16 19 25 26

DC GFOR GSTD HFOR HFUS HSTD MP RI SOLP

6 10 9 10 6 10 6 10 10

7.0854 5.6601E þ06 2.8259E þ07 4.7991E þ07 7.8898E þ06 4.7418E þ07 20.4046 1.362 12,934.60

6.3355 8.3312E þ06 1.8043E þ06 2.0480E þ 07 8.7557E þ06 1.4339E þ 07 73.0258 0.054984 9719.36

Ms nCs XMOD F02[C-C] TI2 RHyDp IDE Hnar MSD

0.62 1.466 1.506 0.799 5.06 0.5 0.94 0.012 1.054

t10 ¼ 4.14 t6 ¼4.67 – t6 ¼7.3 – – t6 ¼6.24 – t6 ¼3.42

– – – – – – – –

4.445 2.766E þ07 1.508Eþ 00 1.299Eþ 08 1.737E þ07 1.738Eþ 08 195.52 1.4476 17,512

0.209 0.27 5.4 0.55 3.55 1.08 1.5 0.020 0.357

7 Prediction Uncertainty (%)

and there are two compounds with slightly different structures, 2-pentathiol and tert-nonyl mercaptan. The two immediate neighbors of the target (n-heptyl mercaptan with 9rti9¼0.9985; n-pentyl mercaptan with 9rti9¼0.9978) are identiﬁed again as the ‘‘most similar’’ to the target. The last member of the training set is n-dodecyl mercaptan (nC ¼12 with 9rti9 ¼0.9653) and the geometric average correlation coefﬁcient (GACC) of this training set is 0.982. Comparison of this training set with the one of Table 1 shows that this training set is more balanced in terms of nC, the chemical composition and the topology of the molecules. This new training set was used to obtain new TQSPR1 models for all of the properties, including those that were adequately predicted using the training set based on the full descriptor database. In particular, the new training set was used as the basis for improving the prediction of properties for which high TSAE and/or excessive ti or hii values indicated inadequacy of the TQSPR1 models. The results of the predictions of nine such properties based on reﬁned training sets are summarized in Table 7. The following information is presented for each of the predicted properties: the number of the property (see Tables 3 and 4), number of compounds included in the training set, the descriptor used in the TQSPR1 model and the corresponding (b0, b1) parameters, the TSAE (Eq. (5)), excessive values of ti (Eq. (7)) and hii (Eq. (6)), the TQSPR1 prediction for the property value and the percent absolute difference between the DIPPR and the predicted values (prediction uncertainty). For 5 properties: GFOR, HFOR, HSTD, RI and SOLP (see Table 3), the full training set of Table 6 was used. For all these properties the use of the new training set enabled reducing the hii below 0.6 and the TSAE values were also reduced (or remained) below the data uncertainty level. The prediction uncertainties for these ﬁve properties are below Uavg. For GFOR, HFOR and SOLP there are ti values which exceed the maximal recommended value (3.3554). However, since the outlying property value (yi) does not increase the TSAE over the acceptable limit, there is no need to remove such an outlier. For GSTD, an attempt to use all the 10 compounds of the training set yielded TSAE value of 5.7 (higher than Uavg) and 2-pentanthiol was ﬂagged as an outlier with ti ¼5.49. Removing 2-pentanthiol from the training set yielded the results shown in Table 7. The lack of DC property data required the inclusion of the compounds sec-butyl mercaptan, thioglycolic acid, isopropyl mercaptan and 1,2 ethanedithiol as the last four compounds in the training set. Their presence prevented the identiﬁcation of an appropriate descriptor (hii associated with thioglycolic acid, for example, obtained the value of 0.92). Removal of these four compounds enabled obtaining the satisfactory results shown in Table 7. According to the analysis carried out in the previous section, the prediction of MP and HFUS requires inclusion of the ﬁrst 10 compounds from the similarity group with even nC, in the

6 5 4 3 2 1 0 0

1

2

3 TSAE (%)

4

5

6

Fig. 8. Plot of prediction uncertainty versus TSAE for 30 properties (obtained using reﬁned training sets).

training set. The ﬁrst four compounds in the selected training set are members of the n-mercaptan homologous series, the two compounds that follow are branched mercaptans. The last four compounds contain sulfur atoms, however they are different in many respects from the target compound: three of them contain oxygen atoms, two of them contain two sulfur atoms (instead of one), three of them contain only two carbon atoms, etc. Consequently their 9rti9 values are relatively low, and it is preferable to remove them from the training set. Using the ﬁrst six compounds of the training set results in TSQPR1 models with TSAE very close to the respective Uavg values (see Table 7) and the same is true for the prediction uncertainties. Note that for the case of MP, for which high precision data are available, these results can be even improved. In this case, a training set of the ﬁrst four compounds can be used, yielding the TQSPR1 model: MP ¼411.1093 339.3565 SIC2, with TSAE ¼ 0.118% and a prediction uncertainty of 0.2% (for description of SIC2 see Todeschini and Consonni, 2000). Such an improvement cannot be achieved for HFUS, as the data uncertainty for HFUS of n-dodecyl mercaptan is 25%. In Fig. 8 the prediction uncertainty is plotted versus TSAE for 30 properties, when reﬁned training sets are used (TPP and PAR are not included for reasons mentioned earlier). Observe that all, except ﬁve properties, are within the TSAE o2% and prediction uncertaintyo2% window. For the remaining 5 properties the prediction uncertainties are higher, however still below the data uncertainty level. Another important point to note is that when the full descriptor database is used for establishing the similarity group, the dominant descriptor selected for TQSPR1 model is often a 3D descriptor. In Table 4 for example, there are ten 3D descriptors

M. Shacham, N. Brauner / Chemical Engineering Science 66 (2011) 2606–2615

(RDF020u, R1e þ, Mor12e, Rte and similar, for explanations regarding these descriptiors see Todeschini and Consonni, 2000). On the other hand, with the reﬁned training sets, there was no need to include 3D descriptors in the TQSPR1 models. This is an additional advantage of the use of the reﬁned training sets, as 3D descriptors are often noisy and may obtain different values when different algorithms are used for minimizing the 3D structure (Paster et al., 2009).

7. Discussion and conclusions This study included prediction of 32 constant properties available in the DIPPR database for the compound n-hexyl mercaptan, using the targeted QSPR method with a single descriptor linear model. It has been shown that for this compound, using the full data set of molecular descriptors, the ‘‘basic’’ training set identiﬁed by the TQSPR algorithm yields satisfactory prediction for 22 of the properties considered (i.e., the prediction uncertainty is smaller than the uncertainty of the property data). The training set average error (TSAE) and the additional statistical indicators used, have proven to be reliable indicators for the accuracy of the prediction. For properties where the prediction error was too high and/or statistical indicators signal outliers in the training set data, the causes of the model inadequacy were identiﬁed. In all of the cases (except the triple point pressure, TPP) the cause of the problem in the model could be clearly identiﬁed as an inadequacy of the training set to represent the variation of the particular property. In the case of TPP the noise in the data prevents clear identiﬁcation of the causes for the model inadequacy. It was further demonstrated that the TSAE of the training set can be reduced by increasing the number of the descriptors in the QSPR model. Yet, the prediction error for the target compound may still increase. To reduce the prediction error, the similarity group for the target compound should be reﬁned. To this aim, the similarity group selection should be based on a robust subset of molecular descriptors. Such a subset should avoid ‘‘noisy’’ descriptors that mask the structural similarity between the target and other potentially similar compounds which are included in the database. In this study, a subset of 202 (0D and 1D descriptors) from the ‘‘constitutional’’ and ‘‘functional group count’’ categories was used at the stage of similarity group identiﬁcation. The resulted reﬁned training set enabled prediction of those properties that were already adequately predicted, and 6 additional properties within experimental uncertainty limits. In the cases of the Gibbs energy of formation (GSTD) and the dielectric constant (DC), compounds with outlying property values had to be excluded from the training set, and in the case of the melting point and heat of fusion, a training set of six compounds similar to the target

2615

with even number of carbon atoms had to be used in order to meet experimental uncertainty limits in the predicted values. Several principles were demonstrated in this study, which include the use of a robust representative subset of descriptors for identiﬁcation of the similarity group and the training set, improving the model and the prediction accuracy by reﬁnement of the training set instead of adding more descriptors to the TQSPR, and using statistical indicators associated with the training set to obtain a reliable estimate of the prediction uncertainty. These render the TQSPR method more accurate, robust and reliable. In this paper the advantages of the proposed method were demonstrated, in detail, only for one compound, n-hexyl mercaptan. We have tested the new technique by applying it to a variety of target compounds. For several groups of compounds the results were similar, while for some other groups additional descriptors had to be included in the similarity group selection subset in order to obtain accurate predictions. Detailed description of the results of these studies is outside the scope of the present paper. References Brauner, N., Stateva, R.P., Cholakov, G.St., Shacham, M., 2006. A structurally ‘‘targeted’’ QSPR method for property prediction. Industrial & Engineering Chemistry Research 45, 8430–8437. Dearden, J.C., 2003. Quantitative structure–property relationships for prediction of boiling point, vapor pressure, and melting point. Environmental Toxicology and Chemistry 22, 1696–1709. Frisch, M.J., Trucks, G.W., Schlegel, H.B., et al., 2004. Gaussian03, Revision A.6. Gaussian, Inc., Pittsburgh, PA. Kahrs, O., Brauner, N., Cholakov, G. St., Stateva, R.P., Marquardt, W., Shacham, M., 2008. Analysis and reﬁnement of the targeted QSPR method. Computers & Chemical Engineering 32, 1397–1410. Marano, J.J., Holder, G.D., 1997. General equations for correlating the thermophysical properties of n-parafﬁns, n-oleﬁns and other homologous series. 2. Asymptotic behavior correlations for PVT properties. Industrial & Engineering Chemistry Research 36, 1895. Marrero, J., Gani, R., 2002. Group-contribution-based estimation of octanol/water partition coefﬁcient and aqueous solubility. Industrial & Engineering Chemistry Research 41, 6623–6633. Neter, J., Wasserman, W., Kutner, M.H., 1996. Applied Linear Statistical Models third ed. Irwin, Burr Ridge. Paster, I., Shacham, M., Brauner, N., 2009. Investigation of the relationships between molecular structure, molecular descriptors and physical properties. Industrial & Engineering Chemistry Research 48, 9723–9734. Poling, B.E., Prausnitz, J.M., O’Connel, J.P., 2001. Properties of Gases and Liquids, ﬁfth ed. McGraw-Hill, New York. Rowley, R.L., Wilding, W.V., Oscarson, J.L., Yang, Y., Zundel, N.A., 2010. DIPPR Data Compilation of Pure Chemical Properties Design Institute for Physical Properties. Brigham Young University Provo Utah. /http//dippr.byu.eduS. Rowley, R.L., 2010. Personal Communication. Shacham, M., Kahrs, O., Cholakov, G. St., Stateva, R., Marquardt, W., Brauner, N., 2007. The role of the dominant descriptor in targeted quantitative structure property relationships. Chemical Engineering Science 62, 6222–6233. Shacham, M., Brauner, N., 2003. The SROV program for data analysis and regression model identiﬁcation. Computers and Chemical Engineering 27, 701–714. Todeschini, R., Consonni, V., 2000. Handbook of Molecular Descriptors. WILEY-VCH.

Analysis and refinement of the training set in predicting a variety of constant pure compound properties by the targeted QSPR method

Analysis and refinement of the training set in predicting a variety of constant pure compound properties by the targeted QSPR method

Recommend Documents