Analysis and refinement of the targeted QSPR method

Analysis and refinement of the targeted QSPR method

Available online at www.sciencedirect.com Computers and Chemical Engineering 32 (2008) 1397–1410 Analysis and refinement of the targeted QSPR method...

949KB Sizes 0 Downloads 21 Views

Available online at www.sciencedirect.com

Computers and Chemical Engineering 32 (2008) 1397–1410

Analysis and refinement of the targeted QSPR method Olaf Kahrs a , Neima Brauner b , Georgi St. Cholakov c , Roumiana P. Stateva d , Wolfgang Marquardt a , Mordechai Shacham e,∗ a

Process Systems Engineering, RWTH Aachen University, Aachen 52064, Germany b School of Engineering, Tel-Aviv University, Tel-Aviv 69978, Israel c Department of Organic Synthesis and Fuels, University of Chemical Technology and Metallurgy, Sofia 1756, Bulgaria d Institute of Chemical Engineering, Bulgarian Academy of Sciences, Sofia 1113, Bulgaria e Department of Chemical Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel Received 7 December 2006; received in revised form 18 May 2007; accepted 13 June 2007 Available online 19 June 2007

Abstract The targeted quantitative structure–property relationship (TQSPR) method of Brauner et al. [Brauner, N., Stateva, R. P., Cholakov, G. St., & Shacham, M. (2006). A structurally “targeted” QSPR method for property prediction. Industrial & Engineering Chemistry Research, 45, 8430–8437] is analyzed in this study with respect to its various algorithmic steps. It is shown that accurate QSPRs for predicting the critical temperature can be developed using a training set of 10 compounds that exhibit the highest level of similarity with the target compound (the compound for which a property has to be predicted). Alternative methods to compute the similarity of compounds and to assemble the training set are compared. The potential of a principal component analysis of the molecular descriptor data to improve the TQSPR performance is assessed and a new stopping criterion for QSPR refinement based on the discrepancy principle is introduced. It is shown that collinearity between molecular descriptors and the increase of the number of compounds and descriptors in the database do not have adverse effects on the performance of the TQSPR method. © 2007 Elsevier Ltd. All rights reserved. Keywords: Computational chemistry; Property prediction; QSPR; PCA; Cluster analysis; Stepwise regression

1. Introduction Pure component properties are essential for chemical engineering computations in steady-state and dynamic simulation, process and product design, environmental impact assessment, and hazard and operability analysis. The number of compounds presently used by industry, or being of its immediate interest, is estimated at several hundred thousands, while the chemical structures, which are theoretically possible and may eventually interest industry in the future, are at least several tens of millions (Horvath, 1992). The number of compounds for which measured data of pure component properties are available is at most several thousands and for many properties it is much less. The discrepancy between the number of compounds of interest and of pure compound data available makes it obvious that prediction of properties is a must. Furthermore, it is increasingly needed to



Corresponding author. Tel.: +972 8 6461481; fax: +972 8 6472916. E-mail address: [email protected] (M. Shacham).

0098-1354/$ – see front matter © 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.compchemeng.2007.06.006

synthesize molecules with prespecified values of properties and this requires prediction of properties for compounds that do not yet exist. A wide range of property prediction methods for pure components is described by Poling, Prausnitz, and O’Connel (2001). Similarly to any prediction method used in chemical engineering, also molecular structure–property prediction methods can be classified as entirely theory-based (mechanical), semiempirical, or empirical (see, for example, Gani & Constantinou, 1996). Within this general classification, the structure–property prediction methods are a special and relatively new challenge (Prausnitz & Tavares, 2004), because they depend on the adequate description of molecular structures, and the variation of their structural elements cannot typically be done uniformly. The only methods that can be strictly classified as non-empirical are the ab initio quantum chemical methods, which describe chemical structure “from the beginning”. Coupled with molecular dynamics and modern mechanistic methods, accessing properties of molecular clusters, they have the highest potential, but are still limited to small molecules, require huge compu-

1398

O. Kahrs et al. / Computers and Chemical Engineering 32 (2008) 1397–1410

tational time and sophisticated software (e.g., Klamt & Eckert, 2000). The much more widely used methods are typically empirical, though the molecular structures may be characterized by non-empirical descriptors (e.g., topological indices). They can be classified broadly in two major groups: group/bond contribution methods (following Franklin, 1949) and “most significant common features” methods (Wakeham, Cholakov, & Stateva, 2002). In the group/bond contribution approach, the physical property of a compound is calculated from the number of occurrences of predefined fragments (groups) in the molecule. Since only the number of occurrences of a group and not the placement of the groups is considered, different isomers cannot be distinguished. Higher order groups account for some isomeric molecular structures and, thereby, can increase the prediction accuracy (Marrero & Gani, 2001). Methods based on topological or geometric information provide a higher level of molecular representation than group contribution methods and, thus, can achieve higher accuracy. The connectivity index (Bicerano, 1993; Kier & Hall, 1986) specifies the topology of the molecule, i.e., the spatial arrangement of the atoms in the molecule. Geometric information is used to some extent by methods based on conjugation (Constantinou & Gani, 1994). In the context of the “most significant common features” methods, the different molecular descriptors in the database may be computed by group/bond contributions, simulated molecular mechanics, quantum chemical methods, the topology of the molecules, and other techniques (Cholakov, Wakeham, & Stateva, 1999). The term quantitative structure–property relationship (QSPR) is widely used for such empirical models. Compared to group/bond contribution approaches, QSPRs may employ more informative molecular descriptors and, thus, may achieve higher accuracy. However, users usually have less confidence in QSPR predictions, especially if for each compound and physical property an individual QSPR is identified. Furthermore, extrapolation of QSPRs beyond the training data set is risky. Therefore, methods to assess the QSPR prediction accuracy are of major importance. QSPR development is facilitated by programs such as the commercial programs CODESSA PRO (Katritzky, Karelson, & Petrukhin, 2005) and Dragon (Todeschini, Consonni, Mauri, & Pavan, 2006) or the freely available program SROV (Shacham & Brauner, 2003). A recent review of the field can be found in (Dearden, 2003). The relationship between molecular structure and physical property can be more accurately described if the QSPR model is tailored to a group of similar compounds (Basak, Gute, Mills, & Hawkins, 2003). Recently, Brauner, Stateva, Cholakov, and Shacham (2006) proposed the “targeted” QSPR (TQSPR) approach, which carries on the idea of tailoring QSPRs, by identifying an individual QSPR for each target compound (the compound for which a physical property has to be predicted). The TQSPR algorithm consists of four major subsequent steps (see Fig. 1): (i) compilation of a database of molecular descriptors and physical property data, (ii) selection of a set of compounds that exhibit a high level of similarity with the target compound (i.e., the training set), (iii) identification of the QSPR

Fig. 1. Major algorithmic steps and investigated algorithmic modifications of the TQSPR identification algorithm.

model, including the selection of molecular descriptors as independent variables and the estimation of the associated regression parameters, and (iv) assessment of the identified TQSPR model. Several other QSPR identification algorithms, which employ similar algorithmic steps as those used in the TQSPR method have been proposed in the literature (e.g., Basak et al., 2003; Katritzky et al., 2005; Lucic & Trinajstic, 1999). Which algorithmic steps are most critical for an accurate QSPR prediction has not been analyzed systematically yet, but would help to improve the QSPR methodology and to focus research on those parts of a QSPR algorithm that exhibit the largest potential for improvement. Due to the large number of possible QSPR algorithm designs, we focus in this paper on the analysis of the TQSPR algorithm of Brauner et al. (2006) and modifications thereof. The results are to some extent transferable to other QSPR algorithms. A new molecular descriptor database is used in this study. It includes 1630 descriptors computed by version 5.3 of the Dragon program (Todeschini et al., 2006) for the 260 hydrocarbons listed in Wakeham et al. (2002). One compound, n-hexacontane, was removed from the original database because descriptor calculation with the Dragon version used did not allow so many atoms in the molecule. The DIPPR (Rowley, Wilding, Oscarson, Yang, & Zundel, 2006) database was used as a source of experimental data. In parts of the study, where measured critical temperature data are required, only 108 compounds (for which such data are available) out of the 259 compounds in the database were used. It should be noted that QSPRs have been employed in the literature for various more challenging prediction tasks, such as melting point prediction (Dearden, 2003) or chemical toxicity (e.g., Basak et al., 2003). In this paper, we had to choose a relatively simple prediction task in order to focus on the algorithmic analysis and refinement. Our implementation of the targeted QSPR method requires specification of error (noise) levels for the physical property and molecular descriptor data. Such data are also essential for assessing the property prediction errors. The DIPPR database provides reliabilities of the experimental critical temperature data, which are considered in this paper as estimates of the standard deviation of the experimental error. Errors in the molecular descriptors

O. Kahrs et al. / Computers and Chemical Engineering 32 (2008) 1397–1410

may stem from different sources, including (i) inaccuracies and simplifications of the algorithms used to minimize the molecular energy function and to calculate the molecular descriptors, and (ii) rounding errors of the descriptor data. Since no methods to estimate the first error source are available, only the rounding errors are considered in this work, as described by Shacham and Brauner (2003). The paper is organized as follows. Section 2 briefly reviews the TQSPR method of Brauner et al. (2006). The algorithmic steps of this method are described in detail in Section 3, including several new algorithmic alternatives and modifications. Comparisons of the various alternatives are carried out in Section 4 while Section 5 discusses the results and defines directions for future research. 2. The targeted QSPR method In the first step of the TQSPR method, a database of molecular descriptors xij and physical property yi has to be assembled, where i is the number of the compound and j is the number of the descriptor. Besides the possibility to obtain molecular descriptor and physical property data from different sources, the user may also use various scaling methods and non-linear transformations of the database values. Furthermore, physical properties different from the one to be predicted may be added to the database as pseudo-molecular descriptors. After the database has been assembled, a similarity group that contains compounds structurally related to the target compound (the compound for which a physical property has to be predicted) is identified. Structural similarity is calculated from the molecular descriptors of the target compound (xtj ) and those of the predictive compounds. The training set is established by selecting the first p compounds of the similarity group for which experimental values yi of the desired property are available. Since the real physical relationship between molecular structure and physical property is approximated by the TQSPR method only locally (i.e., on the training set), a linear structure–property relationship is assumed, y = β0 + β1 ζ 1 + β2 ζ 2 + · · · + βm ζ m ,

(1)

where y is a p-dimensional vector of the respective property values (p is the number of compounds included in the training set), ζ 1 , ζ 2 , . . ., ζ m the p-dimensional vectors of predictive molecular descriptors, and β0 , β1 , β2 , . . ., βm are the corresponding model parameters to be estimated. The QSPR thus identified can be subsequently used for predicting the desired physical property of the target compound, i.e., using the equation yt = β0 + β1 ζt1 + β2 ζt2 + · · · + βm ζtm ,

(2)

where yt is the predicted unknown property value of the target compound and ζ t1 , ζ t2 , . . ., ζ tm are its molecular descriptor values. The selection of a suitable set of predictive molecular descriptors is a challenging problem, since the number of candidates is in the order of O(103 ), which prohibits the determination of the best of all possible sets of predictive molecular descriptors

1399

by a full search procedure. As in a typical most “significant common features” method (Wakeham et al., 2002), a stepwise regression program is used to determine which molecular descriptors should be included in the QSPR to best represent the measured property data of the training set and to calculate the QSPR parameter values. The stepwise regression program SROV (Shacham & Brauner, 2003) is used, which selects in each step one molecular descriptor that reduces the prediction error most strongly. Two criteria for measuring the signal-to-noise ratio in the jth candidate descriptor (TNRj ) and in the partial correlation of the jth candidate descriptor with the prediction residual (CNRj ) ensure that the selected descriptors contain valuable information. A detailed description of the TNR and CNR criteria and further algorithmic details can be found in Shacham and Brauner (2003). Modification of the first three algorithmic steps of the TQSPR algorithm are presented in Section 3 and subsequently analyzed in Section 4. 3. Algorithmic steps and parameters of the TQSPR method 3.1. Similarity group identification Four subsequent algorithmic steps for the similarity group identification are considered: 1. Downsizing of the database by removing descriptors with low information content, such as constant, near-constant, or pairwise linearly correlated descriptors. Furthermore, a principal component analysis can be used to downsize the database. 2. Scaling of the descriptor values, as required for computing the similarity between different compounds. 3. Calculation of the similarity between the predictive compounds and the target compound. The predictive compounds, which enter the similarity group first and for which property data are available, form the training set. 4. A second reduction of the database size, since some descriptors may have become non-informative (e.g., constant) in the training set. These algorithmic steps are described in more detail in the following. 3.1.1. Reduction of database size An existing practice of traditional QSPR identification is to use all available descriptors during the early model development stages and to subsequently exclude descriptors that are pairwise correlated above a certain threshold value from the final model (Turner, Costello, & Jurs, 1998). The overall computational performance of the TQSPR method may, however, be improved if low-informative descriptors are removed from the database before the TQSPR is identified. Furthermore, the numerical values of the similarity measures presented in Section 3.1.4 change when low-informative descriptors are removed and, thus, may lead to different training sets and TQSPR models. An excessive number of low-informative descriptors might even hide the sim-

1400

O. Kahrs et al. / Computers and Chemical Engineering 32 (2008) 1397–1410

ilarity between different compounds and deteriorate the TQSPR model. In this study, we investigate the options provided by the Dragon program (Todeschini et al., 2006) to remove a priori all constant, near-constant, and pairwise linearly correlated descriptors from the database: • Constant descriptors within standard deviation lower than 0.0001 or all values missing are removed. • Near-constant descriptors with only one value different from the remaining ones are removed. If only one compound in the database contains a specific group, one or more of the descriptors, relevant to this group, will have near constant values (i.e., zeros for all, except the compound which has the group). • Pairwise linearly correlated descriptors with a correlation coefficient equal or greater than a selected threshold value are removed. For each pair of correlated descriptors, the one showing the highest pairwise correlation with the other descriptors is excluded. 3.1.2. Principal component analysis for identification of the similarity group and the QSPR regression model In many data mining applications, a principal component analysis (PCA) (Hastie, Tibshirani, & Friedman, 2001) is used to reduce the dimensionality of the dataset. In relation with QSPR development, a PCA can be used to lower the number of independent variables required for identification of an accurate QSPR model and to improve the computational performance of the identification algorithm. Each of the 259 compounds considered in this paper can be represented as one data point in the 1630-dimensional molecular descriptor space. The PCA transforms this dataset to a new coordinate system whose axis are called principal components (PC). The coordinates of the ith compound in the PC space are denoted by χi . The first PC points in the direction of strongest variation in the dataset, the second one in an orthogonal direction of second largest variation, and so on. Each PC is a linear combination of all non-constant molecular descriptors (plus a constant term) and can, thus, be interpreted as a pseudo-molecular descriptor. For illustration, Fig. 2 depicts the dataset in the space of

Fig. 2. Two hundred and fifty-nine compounds in the space of the first two principal components PC1 and PC2 . The markers classify the compounds with respect to their molecular structure.

Fig. 3. Eigenvalues of principal components for original and perturbed datasets of 259 compounds.

the first two PCs. Compounds belonging to the same structural groups (or homologous series) are located in a small region or along a curve through the PC space. In the two-dimensional PC space these regions still show considerable overlap, whereas in a higher dimensional PC space the structural groups are more easily distinguishable (see Section 4.4 and Fig. 6). Usually, the first few PCs with large eigenvalues (large variation in the data) are considered to be sufficient to represent the characteristics of the dataset well. From theoretical considerations we know that the 259 distinct data points in the 1630-dimensional molecular descriptor space can at most span a 258-dimensional subspace (consider a 258-dimensional simplex with 259 edge points). Since many molecular descriptors are constant or linearly correlated with others, the dataset can even be represented in a PC space of much lower dimension. The number of PCs that represent the dataset sufficiently well can be estimated from the given molecular descriptor data and the error estimates thereof. The round-off errors of the molecular descriptor data result in random fluctuations in the set of principal components {PC}i>i* with small eigenvalues. The number of significant PCs i* is determined by comparison of the eigenvalues of the original dataset with the eigenvalues of a perturbed dataset. The perturbed dataset is constructed by adding uniformly distributed random noise to the original dataset that mimics the respective round-off errors. All integer values are perturbed by values in the order of the machine accuracy (i.e., ∼10−16 ). The number of significant PCs is defined as the number of PCs with eigenvalues larger than the difference of the eigenvalues of the original and the perturbed datasets. The influence of the actual realization of the random noise is negligible, as shown by Fig. 3, which depicts the eigenvalues for 10 different noise realizations. The full dataset can be represented by 32 significant PCs. The number of significant PCs strongly reduces if subsets of similar compounds are considered, as shown in Table 1. For example, the homologous series of 34 n-alkanes (saturated unbranched acyclic) can be described by just 6 PCs, demonstrating their high level of similarity within the series.

O. Kahrs et al. / Computers and Chemical Engineering 32 (2008) 1397–1410 Table 1 Number of compounds in the full database and in subsets of similar compounds, and the associated number of significant principal components (PCs) Compounds

Number of compounds

Number of significant PCs

All hydrocarbons Saturated unbranched acyclic Saturated branched acyclic Unsaturated unbranched acyclic Unsaturated branched acyclic Saturated unbranched monocyclic Substituted, saturated monocyclic Unsaturated monocyclic (unbranched and substituted) Monocyclic aromatic (unbranched and substituted) Hybrid hydrocarbons (having features of more than one of the above groups)

259 34 51 30

32 6 9 6

25 6

8 1

35

6

5

2

63

15

10

5

In the targeted QSPR approach the similarity group as well as the regression model can be identified either in the original molecular descriptor space or in the space of principal components. In this paper, we investigate the following two PCA strategies: PCA strategy 1: The similarity group is identified in the PC space because the similarity measures may suffer from the high dimensionality of the molecular descriptor database. The reduced dimensionality may uncover molecular similarity. The regression models are still identified on the full set of molecular descriptors since we look for a small subset of molecular descriptors as independent variables of the TQSPR model. PCA strategy 2: The similarity group and the regression model are identified in the PC space. For the similarity group identification, the PCA is applied to the full database, whereas for the regression model identification, the PCA is applied only to the training set including the target compound to further reduce the dimensionality. The CNR/TNR stopping criterion of the SROV regression program requires an error estimate of the molecular descriptors in the PC coordinate system. The error in the PCs χi,err is calculated from the molecular descriptor error xi,err by χi,err = χ(xi + xi,err ) − χ(xi ),

(3)

where the function χ(·) returns the coordinates of the molecular descriptor data in the PC space. 3.1.3. Descriptor scaling After the database has been downsized, the similarity group is identified. Since the descriptor values obtained from the Dragon program differ strongly in magnitude (from about 10−2 to 106 ), they are scaled before the similarity between different com-

1401

pounds is calculated. Furthermore, the calculated similarities between compounds change in value if the descriptors are scaled and may thus lead to different similarity groups. Therefore, we compare three different scaling approaches to investigate how strongly descriptor scaling influences the TQSPR performance: (i) linear scaling of the descriptor values to a maximum absolute value of 1 (‘maxabs’ scaling), (ii) affine scaling of the descriptor values to zero mean and a standard deviation of one (‘meanstd’ scaling), and (iii) no scaling at all. 3.1.4. Similarity measures Brauner et al. (2006) measure the similarity between the predictive compounds and the target compound by the partial correlation coefficient rti . The partial correlation coefficient is defined as rti = x¯ t x¯ Ti , where x¯ t and x¯ i are the molecular descriptor vectors of the target and the ith predictive compound. These row vectors are centered and normalized to unit length. The similarity can be measured alternatively by the Euclidean distance dti = (xt − xi )(xt − xi )T between the molecular descriptor vectors as suggested by Basak et al. (2003). |rti | ≈ 1 or dti ≈ 0 indicate a high similarity between the molecular structures of the target compound and the ith predictive compound. 3.1.5. Assignment of predictive compounds to the similarity group The similarity group is identified iteratively starting with the target compound as its first member. Three different methods for adding the predictive compounds to the similarity group, all related to cluster algorithms (Hastie et al., 2001), are compared in this paper. In contrast to typical cluster analysis, only one cluster (the similarity group) is constructed. Furthermore, the order, in which the compounds enter the cluster, is of key importance for selecting the training set. The training set comprises the p first compounds with available measurement data that enter the similarity group. The three assignment methods are described for the Euclidean distance similarity measure, because it is more intuitive than the partial correlation similarity measure. If the latter one is used, it replaces the Euclidean distance dti by 1 − |rti | in the following equations: Assignment method A1 : The predictive compound iadd = mini∈Γ / (dti ) that is closest to the target compound t is added to the similarity group Γ . Assignment method A2 : The predictive compound i / (dci ) that is closest add = mini∈Γ to the center of gravity xc = j ∈ Γ xj of the already selected compounds j is added to the similarity group. Assignment method A3 : The predictive compound iadd = mini∈Γ,j / ∈ Γ (dji ) that is closest to one of the already selected compounds j is added to the similarity group. The similarity groups identified may differ significantly. The assignment method A3 can result in long stretched similar-

1402

O. Kahrs et al. / Computers and Chemical Engineering 32 (2008) 1397–1410

ity groups (it might, for example, select all compounds of the homologous series of saturated unbranched acyclics depicted in Fig. 2), whereas A1 and A2 favour compact similarity groups.

Table 2 Algorithmic settings used for TQSPR identification Algorithmic step

Alternatives

Removed descriptors

(None); constant; constant and near-constant; constant, near-constant and pairwise linearly correlated (None); PCA strategy 1; PCA strategy 2 (1); [0.1, . . ., 10] (maxi (|xij |)) = 1; mean(xj ) = 0, std(xj ) = 1; no scaling (Partial correlation); Euclidean distance (A1 ), A2 , A3 (10); [5, . . ., 30] 1% (CNR/TNR/DP); CNR/TNR

3.2. Stopping criteria for addition of descriptors to the QSPR Principal component analysis

When a descriptor is added to the QSPR model by the stepwise regression procedure, the partial correlation of the jth candidate descriptor with the prediction residual (CNRj ) and the signal-to-noise ratio of the jth candidate descriptor (TNRj ) typically decrease. Therefore, Brauner et al. (2006) stop descriptor selection when no candidate descriptors are left with CNRj > 1 and TNRj > 1. However, our preliminary results have shown that the experimental error data reported in published databases is not sufficiently accurate for selecting the total number of descriptors by the CNR/TNR criterion. If the reported errors are too high, the CNRj and TNRj undervalue the information content of the descriptors so that less than the necessary number of descriptors will be included in the model yielding an excessive prediction error. In contrast, if the experimental errors are too low, an excessive number of descriptors is included in the model, which may also increase the prediction error. Therefore, as an additional stopping criterion, we propose to use the discrepancy principle (DP) (Hansen, 1997), which stops model refinement if the estimated variance of the prediction error on the training set s2 =

1 p−m−1 ×

p 

(˜yi − (β0 + β1 ζ1,i + β2 ζ2,i + · · · + βm ζm,i ))2

(4)

i=1 2 . For an optimally falls below a prespecified threshold value sgoal refined model, the prediction error should be approximately as large as the measurement error present in the property data. The measured property values of the compounds in the training set 2 are denoted by y˜ i . The threshold value sgoal is calculated from the relative reliability εi of the property values reported in the DIPPR database by

1 (εi y˜ i )2 . p p

2 = sgoal

(5)

i=1

For the critical temperature values of the compounds considered, the DIPPR database states relative reliabilities of 0.2–3%. In simulation studies we found that the low reliability values of 0.2% lead to an excessive number of descriptors in the TQSPR models without significantly lowering the prediction error. Therefore, we use a uniform relative error estimate of εi = 1% for all compounds in the database. In the combined CNR/TNR/DP stopping criterion, descriptors are added to the QSPR as long as the estimated variance 2 ) and the is larger than the prespecified variance goal (s2 > sgoal CNRj as well as the TNRj are larger than one for at least one of the descriptors, which is not yet included in the model.

Error multiplier Scaling of descriptors Similarity measure Assignment method Size of training set Error goal for discrepancy principle Stopping criterion

Default algorithmic settings are enclosed in parentheses.

4. Analysis of the algorithmic steps of the TQSPR method The modifications presented in Section 3 are added to the implementation of the SROV algorithm to test whether they improve the TQSPR method. Only one modification is applied at a time to ease analysis. The default algorithmic steps are shown inside parenthesis in Table 2. For each algorithmic setting and for each of the 108 compounds in the database with experimental critical temperature data, an individual TQSPR is identified. The property prediction for each target compound is compared with its experimental value, which was not used for TQSPR identification. The TQSPR predictions of the default algorithm are shown for individual target compounds in Section 4.1. The subsequent sections present averaged TQSPR results in order to condense the large amount of information and, thereby, to uncover the general influence of the algorithmic modifications on the TQSPR method. 4.1. TQSPR results obtained by the default algorithm This section presents the TQSPR identification results for the default algorithmic setting listed in Table 2. Table 3 shows the TQSPR predictions of the critical temperature Tc for 20 compounds with lowest and 20 compounds with highest prediction errors. The critical temperature of 63 of the 108 compounds is predicted with less than 1% error, and for 20 compounds with an error between 1 and 2%. For the remaining 25 target compounds the prediction errors are between 2 and 10%, while the mean prediction errors on the respective training sets are still below 1%, as preset by the discrepancy principle. The target compound may exhibit a higher prediction error than the mean for the compounds in the training set, because it is not included in the latter. The set of compounds with low prediction errors includes, as expected, many compounds well represented in the database. These predictions are of similar accuracy or even better than those reported by Owczarek and Blazej (2003), and Yan, Dong,

O. Kahrs et al. / Computers and Chemical Engineering 32 (2008) 1397–1410

1403

Table 3 Selected TQSPR predictions of the critical temperature Tc using the default algorithmic settings Compound

Tc (published) (K)

Tc (predicted) (K)

Prediction error (%)

Number of descriptors

With lowest errors n-Decane 1-Nonene n-Undecane 2,2,3,4-Tetramethylpentane Propylbenzene n-Dodecane n-Pentadecane Cycloheptane 2,4-Dimethylpentane 1-Decene n-Heptadecane 2,2,3,3-Tetramethylpentane n-Nonane p-Cymene n-Octane n-Tetradecane 1-Pentene n-Tridecane n-Pentane Ethylbenzene

617.7 593.1 639.0 592.6 638.4 658.0 708.0 604.2 519.8 616.6 736.0 607.5 594.6 652.0 568.7 693.0 464.8 675.0 469.7 617.2

617.9 593.3 638.8 592.9 638.1 658.3 707.6 603.8 520.1 616.2 735.5 607.0 595.1 651.4 569.2 692.2 465.4 675.9 469.0 618.0

0.02 0.03 0.03 0.04 0.05 0.05 0.06 0.06 0.06 0.06 0.07 0.07 0.08 0.09 0.09 0.12 0.14 0.14 0.14 0.14

1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 2 1 1 2

With highest errors 3,3,5-Trimethylheptane 2,3-Dimethylbutane Cyclopentane m-Terphenyl 3-Methyl-1-butene trans-1,4-Dimethylcyclohexane Indane p-Terphenyl Cyclopropane Naphthalene Cyclooctane Phenanthrene 1,2,3,4-Tetrahydronaphthalene 2-Methyl-2-butene 2,2-Dimethylpropane 2,2,3,3-Tetramethylhexane a-Pinene Ethane Cyclopentene n-Butane

609.5 500.0 511.7 883.0 452.7 587.7 684.9 908.0 398.0 748.4 647.2 869.0 720.0 470.0 433.8 623.0 644.0 305.3 475.0 425.1

591.3 516.4 494.0 913.8 468.7 609.0 709.7 875.0 383.4 720.6 623.0 905.1 689.9 448.4 412.9 657.7 683.5 325.3 510.0 382.9

2.98 3.27 3.46 3.49 3.54 3.62 3.63 3.64 3.66 3.72 3.74 4.15 4.18 4.60 4.82 5.56 6.14 6.56 7.36 9.93

1 3 3 2 2 2 2 2 3 2 3 2 2 2 1 2 1 2 2 2

Twenty compounds with lowest and 20 compounds with highest prediction errors are shown.

and Hong (2003) who compared critical temperature predictions of different group contribution methods. In many cases, the TQSPRs contain just one descriptor. The set of compounds with high prediction errors includes compounds with structures that are very different from the other compounds (e.g., a-pinene) and the first members of the homologous series (e.g., ethane) that require extrapolation of the TQSPR model. For the compounds ethane and n-butane, the TQSPR predictions are clearly inferior to those reported in Owczarek and Blazej (2003). However, the critical temperature of some other compounds, such as 2,2-dimethylpropane and 2,2,3,3-tetramethylhexane, are predicted with similar accuracy as by Fedors method (Fedors, 1982), which exhibits prediction errors of 3.78 and 4.42%, respectively. In general, the number of descriptors that are selected for the TQSPR model tends to be larger for compounds with high prediction error, indicating that the training set contains larger variation of structures. The

different reasons for high prediction errors have to be analyzed for each compound individually. The results of such an analysis are reported elsewhere. In the following sections, we investigate general trends revealing the influence of the algorithmic modifications on the average prediction error and the average number of descriptors of the TQSPR models. For comparison, the average prediction error for the default algorithmic setting is 1.45% and the average number of predictive descriptors in the model is 1.55. The correlation between the target property (critical temperature in this case) and a molecular descriptor is measured by the correlation coefficient YXj . The limiting values are |YXj | = 1 if the property is collinear with the descriptor and YXj = 0 if they are orthogonal. For the default algorithm, we report in Table 4 the 20 molecular descriptors that show on average the highest correlation (i.e., the |YXj | value) with the experimental values of the critical temperature on the training set of 10 compounds.

1404

O. Kahrs et al. / Computers and Chemical Engineering 32 (2008) 1397–1410

Table 4 List of the 20 most frequently used molecular descriptors with highest mean absolute YXj values for TQSPR models (left) and for one QSPR model with all 108 compounds in the database (right) Order

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

TQSPR (10 compounds in training set)

QSPR (108 compounds in training set)

Mean(abs(YXj ))

Name

Abs(YXj )

Name

0.96684 0.96149 0.96018 0.96018 0.96018 0.96018 0.96018 0.95648 0.95612 0.95525 0.95369 0.95335 0.95290 0.94685 0.94619 0.94312 0.94287 0.94287 0.94287 0.94287

ESpm01r VRD2 VRZ2 VRm2 VRv2 VRe2 VRp2 Har G2 CID RDSQ ATS1m RHyDp VEA1 EPS1 Har2 VRZ1 VRm1 VRv1 VRe1

0.98206 0.97896 0.97896 0.97896 0.97896 0.97896 0.97878 0.97611 0.96789 0.96401 0.96401 0.96401 0.96401 0.96392 0.96145 0.96040 0.95882 0.95722 0.95696 0.95653

IVDM VRZ2 VRm2 VRv2 VRe2 VRp2 VRD2 ESpm01x ESpm01r nBO MWC01 SRW02 MPC01 EPS1 ZM1 Har VEA1 SNar G2 RHyDp

The bold typed descriptors appear in both lists.

Frequently used descriptors may give valuable insight into the relationship between molecular structure and physical property. The database contains several molecular descriptors that exhibit very high |YXj | values. However, none of the descriptors is clearly superior to the others. A description of the molecular descriptors used is given in Todeschini et al. (2006) and Todeschini and Consonni (2000). For comparison, Table 4 also lists the |YXj | values of the highest ranking molecular descriptors for a QSPR model with all 108 compounds in the training set. The descriptors highlighted are among the top 20 in both rankings, which indicates a strong linear relationship between the critical temperature and the respective molecular descriptor. If a strong but non-linear relationship exists, the |YXj | values for the TQSPR models are much higher than those for the QSPR model with 108 compounds. In this case, the TQSPR models can be interpreted as a local linear model of the non-linear descriptor–property relationship. For example, the mean |YXj | value of the piID descriptor is 0.91756 for the TQSPR model and only 0.61931 for the QSPR model. In this example, the |YXj | value increases to 0.92849 for the QSPR model, if the piID descriptor is replaced by log(piID) in the database. Thus, non-linear transformations of the original molecular descriptors may be added to the descriptor database to improve the accuracy of linear QSPRs. 4.2. Stopping criterion for variable selection The discrepancy principle is used as an additional stopping criterion for variable selection to reduce the adverse effect of inaccurate error estimates on the CNR/TNR criterion. Inaccurate error estimates are mimicked by multiplying the error estimates available to us by a constant factor ranging from 0.1 to 10. TQSPRs with a varying error multiplication factor are

identified for the traditional CNR/TNR criterion and the new CNR/TNR/DP criterion. Using the traditional CNR/TNR criterion, the mean number of predictive descriptors selected increases if the error estimates decrease (see Fig. 4, right), because the signal-to-noise ratios CNRj and TNRj increase. The mean prediction error for the target compound (see Fig. 4, left) varies only in a narrow range of about 1.5–1.72%, which demonstrates the robustness of the stepwise regression program SROV against overfitting. For the combined CNR/TNR/DP stopping criterion the mean relative error for the target compound and the mean number of descriptors is almost unaffected by the inaccurate error estimates, because the variable selection is stopped as soon as the 2 ). The strong decrease of prediction is satisfactory (s2 > sgoal the prediction error for an error multiplier of about 8 is due to a change of the set of predictive compounds for the target compound n-butane, which leads to a strong decrease of the prediction error from 9.23% to about 0.3%. Still, the CNR/TNR criterion is found to be an essential part of the combined CNR/TNR/DP stopping criterion, because it prevents the selection of low-informative descriptors. If the CNR/TNR criterion is removed, the SROV program is free to choose any descriptor that minimizes the residual most strongly, which increases the mean prediction error for the target by about 0.68% to 2.13%. Thus, the combined CNR/TNR/DP stopping criterion improves the robustness of the SROV algorithm against inaccurate error estimates. 4.3. Reduction of the descriptor database size Following the three database reduction options provided by the Dragon program, we investigated their influence on the

O. Kahrs et al. / Computers and Chemical Engineering 32 (2008) 1397–1410

1405

Fig. 4. Influence of inaccurate error estimates on the mean prediction error for the target compounds (left) and the mean number of predictive descriptors (right) for the CNR/TNR stopping criterion and the combined CNR/TNR/DP stopping criterion.

TQSPR model in four different case studies (P0 , P1 , P2 and P3 ) with an increasing number of descriptors being removed. The downsizing was carried out twice, before and after similarity group identification, because some informative descriptors might become non-informative for the reduced set of compounds that constitute the training set. The downsizing options were implemented in MATLAB1 in order to facilitate downsizing of the full database as well as of the training set for each target compound. 4.3.1. No descriptors removed (P0 ) If all 1630 descriptors are retained in the database, the mean prediction error is 1.45% and the mean number of descriptors is 1.55, as shown in Table 5. The mean computational time for database preprocessing and similarity group identification is only 0.175 s, while the mean time required for QSPR identification is 0.247 s.2 4.3.2. Removal of constant descriptors (P1 ) Removal of constant descriptors from the database does not change the mean prediction error and the mean number of variables, which can be expected since the constant descriptors are typically for groups which are not present in hydrocarbons (e.g., oxygen groups). For a small number of lower molecular mass compounds certain descriptors cannot be calculated by Dragon, because the descriptors are not defined for such compounds. Although the values of the similarity measure slightly change when constant descriptors are removed, the training sets remain unchanged for all target compounds. Three hundred and twentyseven descriptors are constant throughout the full database, while the training sets with 10 compounds each contain on average 500 constant descriptors. The total computational time is lower than for P0 . Thus, this preprocessing step improves the TQSPR method.

1 MATLAB is a trademark of The Math Works, Inc. (http://www. mathworks.com). 2 All calculations were carried out on a PC with a 1.92 GHz CPU using MATLAB.

4.3.3. Removal of constant and near-constant descriptors (P2 ) The full database of 108 compounds contains only 5 nearconstant descriptors (see difference of values for P2 and P1 ), while the training sets contain 28 near-constant descriptors on average. The mean prediction error and the mean number of variables are almost identical to the corresponding values of case studies P0 and P1 . Further analysis reveals that none of the 108 TQSPR models for P0 and P1 contains a near-constant descriptor. The overall time required to identify the near-constant descriptors is much larger than the time saving in the QSPR identification step. Thus, the removal of near-constant descriptors does not improve the overall algorithmic performance if applied twice, before and after similarity group identification. 4.3.4. Removal of constant, near-constant and pairwise linearly correlated descriptors with a threshold value t (P3 ) For threshold values (t) lower than about 0.999 an increasing number of descriptors is required to reach the variance goal of the DP criterion. This indicates that an increasing number of the most informative descriptors are removed. Due to the robustness of the SROV algorithm and the CNR/TNR/DP stopping criterion, the mean prediction error fluctuates, but stays at about the same level down to threshold values of 0.8. For lower threshold values an increase of the mean prediction error becomes visible, indicating a loss of essential information in the descriptor database. The computational time for the processing steps and TQSPR identification both decrease with decreasing threshold values due to the strong reduction of the number of descriptors to be considered. The QSPR identification time scales linearly with the number of descriptors in the database. Despite the strong reduction of the number of descriptors in the database, the total computational demand remains in all cases much larger than for P0 , P1 and P2 . The TQSPR models are almost unaffected if constant and perfectly pairwise linearly correlated (t = 1) descriptors are removed from the database. The reason for this behavior is the QR decomposition approach used in the SROV program. At the beginning of the SROV program, all descriptors are centered, so that constant descriptors become zero vectors with CNRj = 0 and TNRj = 0 and, thus, cannot enter the model. Furthermore, the

O. Kahrs et al. / Computers and Chemical Engineering 32 (2008) 1397–1410

orthogonalization procedure ensures that only one out of a pair of perfectly linearly correlated descriptors can enter the model.

986 926 894 826 671 448 361 271 214 173 143 63 32 18 11 1237 1189 1176 1111 945 712 597 474 400 349 305 175 112 71 51 0.172 0.161 0.159 0.147 0.123 0.091 0.078 0.063 0.054 0.044 0.041 0.029 0.023 0.020 0.019 1.664 1.654 1.661 1.662 1.663 1.641 1.619 1.603 1.589 1.584 1.581 1.544 1.546 1.525 1.518 1 0.999999 0.99999 0.9999 0.999 0.99 0.98 0.96 0.94 0.92 0.9 0.8 0.7 0.6 0.5 P3

1.47 1.44 1.42 1.46 1.37 1.46 1.35 1.55 1.53 1.65 1.36 1.76 1.86 2.12 2.77

1.56 1.57 1.58 1.56 1.56 1.72 1.87 2.03 2.06 2.14 2.17 2.51 2.76 3.13 3.60

0.247 0.188 0.184 0.175 0.212 0.828 – – – P0 P1 P2

1.45 1.45 1.46

1.55 1.55 1.56

Time for QSPR identification (s) Threshold

Mean prediction error (%)

Mean no. of descriptors in the model

Time for downsizing and similarity group identification (s)

1630 1303 1298

1630 1130 1102

4.4. Reduction of database size by principal component analysis

Method

Table 5 Influence of database downsizing on the TQSPR method

Mean no. of descriptors in database before similarity group identification

Mean no. of descriptors in database before QSPR identification

1406

This section presents the results for both PCA-based TQSPR identification strategies described in Section 3.1.2. 4.4.1. PCA strategy 1 The number of PCs used to identify the similarity group is varied from 1 to the 32 in Fig. 5. The similarity of the compounds is measured either by the Euclidean distance or the partial correlation. Both similarity measures lead to similar mean prediction errors, whereas the number of predictive descriptors increases for the partial correlation similarity measure. At least five PCs should be used to obtain a reasonable prediction error of about 1.5% for the target compounds (cf. Fig. 5, left). If less PCs are used, the prediction error is much larger, implying that a significant amount of the structural variation in the database is not represented by those PCs, and they do not allow for distinguishing similar from less similar compounds. The lowest mean prediction errors are achieved between about 10 and 20 PCs. In this range, PCA strategy 1 slightly outperforms the results of the default algorithm presented in Section 3.1. However, due to the relatively small database, these observations might not be significant and have to be further investigated for a larger database. An additional indicator for how well the similarity group identification algorithm identifies the training set for the given molecular descriptor data is the number of compounds in the training set that belongs to the same homologous series or group of structures (see Table 1) as the target. Fig. 6 shows this information as an average over all 108 compounds and for the three largest homologous series and structural groups for the Euclidean distance similarity measure with maxabs scaling. If just one PC is used, the training sets contain compounds with diverse structures, which leads to high prediction errors as can be seen in Fig. 5 (left). 4.4.2. PCA strategy 2 Fig. 7 shows in comparison to Fig. 5 that the QSPR models identified in the PC space according to PCA strategy 2 yield a higher mean prediction error (about 0.5% higher) and require about two additional predictive descriptors. This demonstrates that, although a PCA is suitable to reduce the dimensionality of the molecular descriptor data and to identify the similarity group, it does not perform well for the QSPR regression task. The results for the default algorithm in Section 3.1 showed that in most cases accurate QSPRs can be developed with just one or two molecular descriptors, demonstrating that highly predictive descriptors exist in the database. The PCA does not account for the correlation of the calculated PCs with the physical property that has to be predicted. Furthermore, the highly descriptive descriptors cannot be accessed directly in the PC space, since each PC is a linear combination of all non-constant molecular descriptors.

O. Kahrs et al. / Computers and Chemical Engineering 32 (2008) 1397–1410

1407

Fig. 5. Influence of the number of principal components (PC) on the mean prediction error for the target compounds (left) and on the mean number of predictive descriptors (right) in the TQSPR models for PCA strategy 1. The diagrams compare the results for two similarity measures with the results of the default case when no PCA is used.

4.5. Comparison of similarity measures

Fig. 6. Mean number of compounds with similar structures in the training set for Euclidean distance with maxabs scaling for PCA strategy 1.

How many compounds should be assigned to the training set and which similarity measure and scaling method should be used is still an open question. If the training set becomes too small, the statistical significance of the identified QSPR model decreases. If the training set becomes too large, the real physical relationship probably cannot be described by a linear model with few predictive descriptors. Therefore, we compare the average TQSPR results for a varying training set size of 5–30 compounds. Furthermore, all combinations of the two similarity measures (Euclidean distance (E) and partial correlation coefficient (C)) and the three alternative scaling approaches (no scaling (−), scaling to a maximal absolute value of 1 (1), and normalization to zero mean and standard deviation 1 (N)) are investigated. As an example for the influence of these alternatives, Fig. 8 compares the calculated similarities between the predictive compounds and the target compound a-pinene for three methods E1, C1 and CN (first character: sim-

Fig. 7. Influence of the number of principal components (PC) on the mean prediction error for the target compounds (left) and on the mean number of predictive descriptors in the TQSPR models (right) for PCA strategy 2. The diagrams compare the results for two similarity measures with the results of the default case when no PCA is used.

1408

O. Kahrs et al. / Computers and Chemical Engineering 32 (2008) 1397–1410

Fig. 8. Comparison of three alternative similarity calculations between 107 predictive compounds in the database and the target compound a-pinene.

Fig. 9. Influence of the training set size on the mean prediction error for the target compounds (left) and on the mean number of predictive descriptors in the TQSPR model (right). The diagrams show the results for three different scaling methods for the partial correlation similarity measure.

ilarity measure used; second character: scaling method used). In all three cases, the training sets comprise different sets of compounds. Figs. 9 and 10 show that the descriptor values should always be scaled because scaling decreases the mean prediction error in most cases. The training set should comprise at least about 8–10 compounds in order to obtain a QSPR with low prediction error for the target compound and just few predictive descrip-

tors. If a larger training set is used, the TQSPR models require more predictive descriptors to reach the variance goal of 1% of the CNR/TNR/DP criterion. No combination of the two scaling methods (‘1’and ‘N’) and two similarity measures clearly outperforms the others. Thus, the similarity measure and scaling method used may change the training sets, but are of minor importance for attaining a low average prediction error, at least for the database employed.

Fig. 10. Influence of the training set size on the mean prediction error for the target compounds (left) and on the mean number of predictive descriptors in the TQSPR model (right). The diagrams show the results for three different scaling methods for the Euclidean distance similarity measure.

O. Kahrs et al. / Computers and Chemical Engineering 32 (2008) 1397–1410

Fig. 11. Training sets identified by assignment method A1 , A2 and A3 for the target compound n-heptane. For ease of visualization, the compounds are displayed in the PC space. The similarity groups are, however, identified using the default algorithm (see Table 2) without PCA.

Table 6 Influence of the assignment methods A1 , A2 and A3 on the mean prediction error, on the mean number of predictive descriptors in the model, and on the mean number of predictive compounds in the training set that belong to the same structural group Assignment method

Mean prediction error (%)

Mean no. of descriptors

Mean no. of compounds from same group

A1 A2 A3

1.45 1.90 2.24

1.5 1.4 1.6

6.9 7.0 6.4

4.6. Assignment of compounds to the similarity group and the training set Fig. 11 demonstrates for the target compound n-heptane that the three assignment methods A1 , A2 and A3 can yield very different training sets and, thus, will result in different TQSPR models. Table 6 compares the average results using the three different assignment methods. The assignment method A1 exhibits the lowest mean prediction error, whereas the mean number of descriptors in the model and the mean number of compounds in the training set that belong to the same homologous series or structural groups are similar for all three methods. The much higher mean prediction errors for A2 and A3 result from very high prediction errors for few compounds. However, even if the median prediction error, which is robust against outliers, is used to compare the assignment methods, A1 still outperforms the others (0.83% compared to 0.97% for A2 and 1.01% for A3 ). 5. Conclusions Several algorithmic modifications of the TQSPR method are proposed and analyzed using a database of 108 compounds, for which the critical temperature is predicted. Accurate QSPRs can

1409

be developed with a training set of 10 compounds if the level of similarity with the target compound is high. Including more compounds in the training set does not lower the average prediction error for the target significantly, but leads to a larger number of variables in the TQSPR model. The optimal size may depend on the database used and on the physical property predicted. Adapting the size to the used database and the predicted physical property, as described in Section 4.5, may improve the TQSPR accuracy. The descriptors should be scaled before similarity group identification. However, none of the different combinations of similarity measures and descriptor scaling methods investigated clearly outperforms the others. The similarity group should be assembled from compounds closest to the target compound. Both other assignment methods investigated resulted in clearly higher prediction errors. Constant and perfectly pairwise linearly correlated descriptors can be removed from the database, since they are useless for TQSPR identification and only increase computational demand. Near-constant descriptors can be removed as well, since they only explain the structural difference of one compound from all other compounds in the training set. In this study, none of the TQSPR models contained a near-constant descriptor. Removing correlated descriptors with threshold values lower than one is unreasonable, because the identification time is less than 1 s in all cases and valuable information may be discarded. Additional downsizing of the database for each of the training sets strongly increases the computational demand and should, thus, be avoided. However, the incentive to extensively downsize the descriptor database may be more prominent, when other identification algorithms are used that rely on non-linear parameter estimation techniques, or employ computationally more demanding algorithms to identify the best set of independent variables of the QSPR model. A principal component analysis of the molecular descriptor data can slightly lower the mean prediction error of the TQSPR method, if at least five principal components are used for similarity group identification (for the database considered in this study). The QSPR regression model should, however, not be identified in the PC space, because the PCs are less descriptive than some of the original molecular descriptors. An increase of the number of descriptors in the database will not negatively affect the TQSPR method, because the computational demand of the SROV program increases only linearly with the number of descriptors in the database. Furthermore, the orthogonalization approach used by the stepwise regression program ensures the identification of statistically sound QSPR models, because constant and collinear descriptors are automatically rejected from entering the model. The detrimental effect of inaccurate error estimates of the descriptor and physical property data is reduced by the proposed CNR/TNR/DP stopping criterion. Since these observations are mainly due to the SROV algorithm used, they will be qualitatively similar for other databases and predicted properties. In 23% of the cases, we observe relative prediction errors for the target compound in the range of 2–10%. We have investigated the sources of these enlarged errors and developed methods to assess the prediction error for cases when no measurement data

1410

O. Kahrs et al. / Computers and Chemical Engineering 32 (2008) 1397–1410

are available for comparison. The results of these studies will be reported separately, in the future. Our current investigations focus on more difficult prediction tasks, such as the melting point prediction of hydrocarbons, alcohols and acids. The results will be summarized in a future paper and demonstrate that the TQSPR method does not behave differently than for critical temperature prediction of hydrocarbons presented in this paper. References Basak, S. C., Gute, B. D., Mills, D., & Hawkins, D. M. (2003). Quantitative molecular similarity methods in the property/toxicity estimation of chemicals: a comparison of arbitrary versus tailored similarity spaces. Journal of Molecular Structure: Theochem, 622, 127–145. Bicerano, J. (1993). Prediction of polymer properties. Marcel Dekker Inc.. Brauner, N., Stateva, R. P., Cholakov, G. St., & Shacham, M. (2006). A structurally “targeted” QSPR method for property prediction. Industrial & Engineering Chemistry Research, 45, 8430–8437. Cholakov, G. St., Wakeham, W. A., & Stateva, R. P. (1999). Estimation of normal boiling temperature of industrially important hydrocarbons from descriptors of molecular structure. Fluid Phase Equilibrium, 163, 21–42. Constantinou, L., & Gani, R. (1994). New group-contribution method for estimating properties of pure compounds. AIChE Journal, 40, 1697– 1710. Dearden, J. C. (2003). Quantitative structure–property relationships for prediction of boiling point, vapor pressure, and melting point. Environmental Toxicology and Chemistry, 22, 1696–1709. Franklin, J. L. (1949). Prediction of heat and free energies of organic compounds. Industrial & Engineering Chemistry, 41, 1070–1076. Fedors, R. F. (1982). A relationship between chemical-structure and the critical-temperature. Chemical Engineering Communications, 16, 149– 151. Gani, R., & Constantinou, L. (1996). Molecular structure based estimation of properties for process design. Fluid Phase Equilibrium, 116, 75–86. Hansen, P. C. (1997). Rank-deficient and discrete ill-posed problems: Numerical aspects of linear inversion. SIAM. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning; data mining, inference, and prediction. Springer-Verlag. Horvath, A. L. (1992). Molecular design: Chemical structure generation from the properties of pure organic compounds. Amsterdam: Elsevier.

Katritzky, A. R., Karelson, M., & Petrukhin, R. (2005). CODESSA PRO user’s manual. University of Florida. Kier, L., & Hall, L. H. (1986). Molecular connectivity in structural-activity analysis. New York, USA: Wiley. Klamt, A., & Eckert, F. (2000). COSMO-RS: A novel and efficient method for the a priori prediction of thermophysical data of liquids. Fluid Phase Equilibrium, 172, 43–72. Lucic, B., & Trinajstic, N. (1999). Multivariate regression outperforms several robust architectures of neural networks in QSAR modeling. Journal of Chemical Information and Computer Sciences, 39, 121– 132. Marrero, J., & Gani, R. (2001). Group-contribution based estimation of pure component properties. Fluid Phase Equilibrium, 183, 183–208. Owczarek, I., & Blazej, K. (2003). Recommended critical temperatures. Part I. Aliphatic hydrocarbons. Journal of Physical and Chemical Reference Data, 32, 1411–1427. Poling, B. E., Prausnitz, J. M., & O’Connel, J. P. (2001). Properties of gases and liquids. New York: McGraw-Hill. Prausnitz, J. M., & Tavares, F. W. (2004). Thermodynamics of fluid-phase equilibria for standard chemical engineering operations. AIChE Journal, 50, 739–761. Rowley, R. L., Wilding, W. V., Oscarson, J. L., Yang, Y., & Zundel, N. A. (2006). DIPPR data compilation of pure chemical properties, Design Institute for Physical Properties. Provo, Utah: Brigham Young University. http://dippr.byu.edu Shacham, M., & Brauner, N. (2003). The SROV program for data analysis and regression model identification. Computers and Chemical Engineering, 27, 701–714. Todeschini, R., Consonni, V., Mauri, A., & Pavan, M. (2006). DRAGON user manual. Milano, Italy: Talete srl. Todeschini, R., & Consonni, V. (2000). Handbook of molecular descriptors. Wiley-VCH. Turner, B. E., Costello, C. L., & Jurs, P. C. (1998). Prediction of critical temperatures and pressures of industrially important organic compounds from molecular structure. Journal of Chemical Information and Computer Sciences, 38, 639–645. Wakeham, W. A., Cholakov, G. St., & Stateva, R. P. (2002). Liquid density and critical properties of hydrocarbons estimated from molecular structure. Journal of Chemical & Engineering Data, 47, 559– 570. Yan, X., Dong, Q., & Hong, X. (2003). Reliability analysis of group-contribution methods in predicting critical temperatures of organic compounds. Journal of Chemical & Engineering Data, 48, 374–380.