Available online at www.sciencedirect.com
Fluid Phase Equilibria 263 (2008) 33–42
QSPR prediction of N-boiling point and critical properties of organic compounds and comparison with a group-contribution method Davide Sola, Ada Ferri ∗ , Mauro Banchero, Luigi Manna, Silvio Sicardi Dipartimento di Scienza dei Materiali e Ingegneria Chimica, Politecnico di Torino, Corso Duca degli Abruzzi 24, 10129 Torino, Italy Received 16 July 2007; received in revised form 20 September 2007; accepted 24 September 2007 Available online 29 September 2007
Abstract Three QSPR models for the prediction of the critical temperature, the critical pressure and the normal boiling point of organic compounds were developed. The chemical structures were selected from the DECHEMA database among those with experimentally available properties and included different chemical groups: single ring, fused rings, halogens, OH, COOH, COOR, CON, CN, NH2 and NO2 , etc. The AMPAC software was used to model the molecules using the AM1 Hamiltonian. The CODESSA software was used to calculate the molecular descriptors and the Heuristic method was chosen to find the best multi-linear relation between the property of interest and the most significant set of descriptors. The final QSPR models showed a significantly higher accuracy with respect to the best available group-contribution method. Comparable results were obtained with respect to other QSPR models despite the different composition of the database, confirming the versatility and robustness of the QSPR method. © 2007 Elsevier B.V. All rights reserved. Keywords: QSPR model; Boiling point; Critical properties; Heuristic method
1. Introduction Critical properties and normal boiling point of organic compounds are widely used in the chemical industry. For example, critical properties are required to calculate reduced state variables in the corresponding states theory while the normal boiling point is used in the estimation of many other properties such as vapour pressure, heat of vaporization, liquid molar volume, etc. [1–3]. Experimental data of these properties are often missing, and, for this reason, methods for their estimation are necessary. The classical approach to property estimation consists in using “group contribution methods” (GC methods), where each molecule is considered as made of fundamental groups, each one giving a contribution to the property of interest, which is calculated by adding together the contributions of each group. Typical GC methods are those of Lydersen [4], Joback and Reid [5] and
∗
Corresponding author. Tel.: +39 015 401407; fax: +39 011 5464648. E-mail addresses:
[email protected] (A. Ferri),
[email protected] (M. Banchero),
[email protected] (L. Manna),
[email protected] (S. Sicardi). 0378-3812/$ – see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.fluid.2007.09.022
Marrero–Gani [6]. The principal advantage of these methods relies on their simplicity but, on the other hand, they have some disadvantages: - not all groups are listed, and if only a single group is missing, no calculation can be performed on the molecule; - the group contributions are additive and do not take into account interactions between different groups or the role of the group position inside the structure; - even in more sophisticated methods (such as that of Marrero and Gani) where isomers are differently treated, using “second” and “third order” groups, the problem remains unsolved for stereo-isomers; - they are not adequately accurate. In addition to the GC methods, Quantitative Structure– Property Relationship (QSPR) models are widely used nowadays to perform property estimation [7–9]. A QSPR model consists of a correlation between the property of interest and a variety of molecular features (named descriptors) that range from structural and topological indices to electronic and quanto-chemical properties.
34
D. Sola et al. / Fluid Phase Equilibria 263 (2008) 33–42
The method requires a database of substances (training set) for which the property of interest is known and a specific software that accomplishes the molecular modelling of the structure so that a great number of molecular descriptors can be evaluated. The following step, which is performed via classical statistical analysis, is the selection of the subset of descriptors more closely related to the bulk property of interest. Finally, a multi-linear relationship between the property and the selected descriptors is developed. The validity of the model is proved through the prediction of the property for a validation set of structures. The diffusion of QSPR models coincides with the development of semi-empirical models (AM1, PM3, MNDO, etc.) that made it possible to approximately solve the wavefunction of complex molecular structures with little computational effort. The oldest semi-empirical models, in fact, encoded parameters for few sets of atoms and, as a consequence, the QSPR models were restricted to data sets of limited structural diversity, frequently only hydrocarbons [10]. Nowadays, the new semiempirical models encode parameters for larger sets of atoms and the majority of organic compounds can be treated. Several studies on QSPR were developed for the boiling point [11–15]. In their work Katritzky et al. [15] focused on predicting the boiling point for a huge set of 612 organic compounds containing C, H, N, O, S, F, Cl, Br and I. The general performance of the final eight-parameters model is good and the results show that sulfur-containing compounds are quite difficult to be predicted, and lead to a lower model performance. Wessel and Jurs [16], who used ADAPT software with a database of more than 300 organic compounds, found that the best correlations could be found dividing the original database into two subsets: a nitrogen-containing subset and a nitrogen-excluded one. The correlations showed a good agreement with the experimental data and have been used as a reference to compare the boiling point prediction in the present work. As far as critical properties are concerned, the present paper results have been compared with the work by Turner et al. [17] where more than 140 compounds were extracted from the DIPPR database [18] and eight-parameter multi-linear correlations were constructed. For both the critical temperature and pressure good results were obtained, even though the critical pressure, as stated by the authors, is much more difficult to model. On the other hand, the database considered by the authors includes a lot of hydrocarbons, a little presence of halogen-containing compounds (∼ =3%), of aromatics (∼ =14%) and a total absence of NO2 -containing compounds and amides. A database with strictly experimentally known properties was selected in the present work: 155 among hydrocarbons, aromatics, oxygenated, halogenated and nitrogen-containing compounds, were selected from the DECHEMA database [19]. In comparison with the work by Turner, the percentage of aromatics is higher (24%) and a noticeable number of NO2 containing compounds have been introduced (10%) with the aim of assembling a varied database and obtaining general models. Apart from the comparison with literature QSPR models, the correlations for the boiling point, the critical temperature and the
critical pressure were also compared with the most sophisticated GC method (the Marrero–Gani’s one). 2. Methods 2.1. Database selection A subset of substances was selected from the DECHEMA database, giving rise to a training set and a validation set of respectively 132 and 23 compounds. Even though more than 850 compounds are included in the DECHEMA database, only those organic compounds for which the corresponding properties were experimentally measured were included in this work. The complete list is reported in Table 1: the number of the compounds in the database depends on the property being considered. Since critical pressure data are more difficult to measure, fewer compounds could be included for that property (121 in the training set and 18 in the validation set). Anyway, for each studied property the compounds in the validation set are about 15% of those in the training set. An effort was made to include in the database as many chemical families as possible: hydrocarbons (saturated and unsaturated), aromatics, oxygen containing compounds (33% of the total amount of molecules, distributed in organic acids, alcohols, aldehydes, ketones, esthers, halogen-containing compounds (5%), nitrogen-containing compounds (36% with amines, amides, nitriles), and some “hybrid” molecules (such as pyridine). 2.2. Molecular modelling and descriptors generation The 155 structures were sketched using the AMPAC software package [20] and subsequently modelled with the AM1 semiempirical model included in the software [21]. AM1 belongs to the Neglet of Diatomic Differential Overlap (NDDO) semiempirical model, which neglects the differential overlap only for atomic orbitals on different atoms. AM1 calculations are fast and reasonably robust over a large range of chemical functionalities and, thus, AM1 continues to be used for a wide variety of applications [22]. Geometry optimization was carried out by searching for local minima on the potential energy surface (PES), which is the hypersurface representing the potential energy of the collection of atoms in the structure over all possible atomic arrangements. The TRUSTE algorithm implemented in the AMPAC software was used to identify the lower energy configuration: the root mean square gradient tolerance was set equal to 0.05. The output files from AMPAC were given as an input to the CODESSA software [23] in order to calculate the molecular descriptors. For each molecule about 500 descriptors can be calculated, grouped into the following classes: constitutional descriptors (related to chemical composition), topological descriptors (related to the connectivity of the atoms), geometrical descriptors (related to the 3-D geometrical structure), electrostatic descriptors (related to the charge distribution), quanto-chemical descriptors (related to molecular orbital calculations), thermodynamic descriptors (related to the vibrational energy). The actual number of descriptors depends on the specific structure being considered.
D. Sola et al. / Fluid Phase Equilibria 263 (2008) 33–42
35
Table 1 List of compounds taken from the DECHEMA database used to develop the QSPR models
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
Compound name
Exp. Tc (K)
Calc. Tc (K)
A%D*
Exp. Tb (K)
Calc. Tb (K)
A%D
Exp. Pc (MPa)
Calc. Pc (MPa)
A%D
1,1-Diphenyl hydrazine 1,2,3-Trimethylbenzene 1,2,3-Trinitrobenzene 1,2,4-Trimethylbenzene 1,2-Dichloro-benzene 1,2-Diethylbenzene 1,2-Dimethyl-3-ethylbenzene 1,2-Dimethyl-4-ethylbenzene 1,2-Dimethylbenzene 1,3,5-Trimethylbenzene 1,3,5-Trinitrobenzene 1,3-Benzenediamine 1,3-Diethylbenzene 1,3-Dimethyl-2-ethylbenzene 1,3-Dimethyl-4-ethylbenzenea,b,c 1,3-Dimethylbenzeneb,c 1,4-Diethylbenzene 1,4-Dimethyl-2-ethylbenzene 1,4-Dimethylbenzene 1,3-Dinitrobenzenea,b 1-Aminonaphtalene 1-Butanamine 1-Butanol 1-Butanolo-2-methyl 1-Chloro-3-isocyanato benzene 1-Chloro-4-isocyanato benzenea,b,c 1-Ethyl-2-nitrobenzene 1-Ethyl-4-nitrobenzene 1-Ethyl-naphtalene 1-Methyl-2-ethylbenzenea 1-Methyl-2-nitrobenzene 1-Methyl-3-ethylbenzene 1-Methyl-3-nitrobenzene 1-Methyl-3-propylbenzene 1-Methyl-4-ethylbenzene 1-Methyl-4-nitrobenzene 1-Methyl-3,5-dinitrobenzenea,b,c 1-Methylethylbenzene 1-Methylnaphtalene 1-Nitrobutanea,b 1-Nitropropano 1-Propanamine 2-Aminonaphtalenea,b,c 2-Methyl-2-butanol 2,3-Dimethylphenol 2,4,5-Trimethyl aminobenzene 2,4,6-Trinitro aminobenzene 2,4-Dimethyl aminobenzene 2,4-Dimethylphenol 2,4-Dinitro-1-methylbenzene 2,5-Dimethylphenol 2,6-Dimethyl aminobenzene 2,6-Dimethylphenola,b,c 2-Butanamine 2-Butanol 2-Chloro aminobenzenea,b,c 2-Chlorophenol 2-Ethylnaphtalene 2-Ethylphenola,b 2-Etossi aminobenzene 2-Methoxy benzenamine 2-Methyl aminobenzene 2-Methyl benzonitrile 2-Methyl-1,3-dinitrobenzene
830 664.5 920 649.2 677.6 669.6 665.1 662.6 631.1 640.1 835 815 663.6 665.1 662.6 619 657.9 662.6 617.4 800 850 524 561.4 580.1 695 695 720 740 774.9 651.5 720 645.1 725 666.6 638.7 735 800 631.7 772 624 606 496.1 850 544.9 722.8 725 920 714 707.6 815 706.9 722 701 511 538 730 675 774.9 703 731 728 710 709 780
862.1 651.3 890.7 650.5 667.4 660.7 662.0 661.0 629.6 648.8 851.2 824.3 660.1 661.9 660.9 628.6 658.6 661.4 627.0 794.6 835.6 528.1 561.5 577.2 704.9 699.4 737.2 746.5 767.0 642.7 726.7 642.0 733.8 660.3 640.3 735.4 805.1 638.9 762.7 605.0 577.5 503.4 844.6 543.0 701.4 733.2 910.8 723.1 707.5 829.5 703.9 712.2 702.0 507.9 536.3 721.8 708.6 768.7 699.6 726.2 732.0 696.0 723.1 788.2
3.9 2.0 3.2 0.2 1.5 1.3 0.5 0.2 0.2 1.4 1.9 1.1 0.5 0.5 0.3 1.6 0.1 0.2 1.6 0.7 1.7 0.8 0.0 0.5 1.4 0.6 2.4 0.9 1.0 1.4 0.9 0.5 1.2 0.9 0.3 0.1 0.6 1.2 1.2 3.0 4.7 1.5 0.6 0.4 3.0 1.1 1.0 1.3 0.0 1.8 0.4 1.4 0.1 0.6 0.3 1.1 5.0 0.8 0.5 0.7 0.6 2.0 2.0 1.1
595.25 449.27 676.89 442.26 449.1 457.02 467 462.9 417.46 437.89 629.63 558.2 454.25 463.1 461.55 412.22 456.96 460.06 411.44 574.73 573.93 350.12 390.61 403.95 476.89 477.21 503.03 524.5 527.35 438.29 495.3 434.44 505.75 454.95 435.15 506.4 590.09 426.31 517.82 426.05 404.74 322.15 579.24 375.26 491.15 507.65 692.64 490.27 484.65 592.68 484.33 491.05 485.15 336.15 372.8 481.97 444.15 524.15 480.15 503.83 493.66 473.31 478.24 560.5
600.2 440.0 661.3 438.6 436.7 461.7 461.0 459.3 417.6 436.8 634.4 569.0 459.8 461.1 459.9 416.3 458.9 460.2 415.0 578.1 567.7 343.7 388.0 405.9 466.8 460.3 510.2 518.8 530.9 440.2 491.4 438.8 499.0 460.2 437.7 498.8 595.1 438.9 512.9 417.0 392.2 320.3 574.1 375.4 483.8 505.9 673.4 491.8 487.2 605.9 485.2 481.8 478.2 327.9 365.5 480.1 473.3 531.1 484.5 510.0 503.5 467.6 487.9 576.2
0.8 2.1 2.3 0.8 2.8 1.0 1.3 0.8 0.0 0.3 0.8 1.9 1.2 0.4 0.4 1.0 0.4 0.0 0.9 0.6 1.1 1.9 0.7 0.5 2.1 3.6 1.4 1.1 0.7 0.4 0.8 1.0 1.3 1.2 0.6 1.5 0.9 3.0 1.0 2.1 3.1 0.6 0.9 0.0 1.5 0.4 2.8 0.3 0.5 2.2 0.2 1.9 1.4 2.4 2.0 0.4 6.6 1.3 0.9 1.2 2.0 1.2 2.0 2.8
4 3.27 1.65 3.23 – 2.99 2.93 2.9 3.74 3.24 1.95 2.93 2.93 2.9 3.56 2.81 2.9 3.53 – 5 3.98 4.42 4.56 3.55 4 3.35 3 3.14 3.14 3.4 3.14 – 2.9 3.14 – 2.15 3.2 3.49 – 4.04 4.72 4.9 – 4.86 – 1.95 4 4.36 2.15 4.86 4.2 4.26 3.97 4.18 4.6 5 3.14 – 5 5.85 4.65 3.95 2.4
3.95 3.32 1.59 3.28 – 2.85 2.96 3.01 3.67 3.18 1.88 2.99 2.81 2.94 3.63 2.95 2.85 3.68 – 4.90 4.12 4.66 4.24 3.46 3.68 3.13 3.25 2.98 3.22 3.83 3.25 – 2.80 3.34 – 2.26 3.17 3.38 – 4.00 4.78 4.90 – 4.45 – 1.90 4.33 4.51 2.40 4.48 4.47 4.41 4.03 4.44 4.89 5.10 2.98 – 4.60 5.36 4.70 3.57 2.71
1.3 1.5 3.9 1.4 – 4.7 1.1 3.6 1.8 1.8 3.6 2.2 4.0 1.5 2.1 4.9 1.8 4.3 – 2.0 3.6 5.4 7.0 2.5 8.1 6.7 8.4 5.1 2.4 12.6 3.6 – 3.3 6.3 – 4.9 1.1 3.3 – 0.9 1.2 0.1 – 8.4 – 2.5 8.3 3.4 11.7 7.9 6.5 3.4 1.5 6.3 6.3 2.1 4.9 – 8.0 8.3 1.1 9.6 12.8
36
D. Sola et al. / Fluid Phase Equilibria 263 (2008) 33–42
Table 1 (Continued )
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
Compound name
Exp. Tc (K)
Calc. Tc (K)
A%D*
Exp. Tb (K)
Calc. Tb (K)
A%D
Exp. Pc (MPa)
Calc. Pc (MPa)
A%D
2-Methyl-1-propanamine 2-Methyl-1-propanol 2-Methyl-2-propanol 2-Methylnaphtalene 2-Methylphenolc 2-Methyl-2-propanaminea,b,c 2-Nitrobenzenamine 2-Nitrobutano 2-Propanamine 2-Propanol 1-ammino 2-Propanolo 2-Propanone 2-Propylphenol 3-Methyl aminobenzenea,b,c 3,4-Dimethylphenol 3,5-Dimethyl 1-ethylbenzene 3,5-Dimethylphenol 3-Chloro benzenamine 3-Ethylphenol 3-Methyl benzonitrile 3-Methylphenol 3-Nitro aminobenzene 4-Chloro aminobenzene 4-Ethyl aminobenzene 4-Ethylphenol 4-Methyl aminobenzene 4-Methyl benzonitrile 4-Methyl-1,3-diaminobenzene 4-Methylphenol 4-Nitro aminobenzene 4-Propylphenol Acetic acid 1-methyl propylesther Acetic acid 2-methylpropylesthera,b,c Acetic acid buthylesther Acetic acid methylesther Acetic acid propylesther Anilinea,b,c Anthracene Benzaldehyde Benzene Benzenemetanamine Benzonitrile Bromobenzene Butanea,b,c Butanenitrile Butanoic acid 3-methyl methylesther Butanoic acid ethylesther Chlorobenzene Cyclohexanone Diphenyl Ethane Ethanola,b,c Ethyl acetate Ethylbenzene Formic acid 1-methyl ethylesther Formic acid 1-methyl propylesther Formic acid 2-methyl propylesther Formic acid buthylesther Formic acid ethylesthera,b,c Formic acid methylesther Formic acid propylesther Isocyanato benzene Methanol Methoxy benzene Methyl hydrazine
519 547.7 508 764.3 695.3 483.9 800 615 496.1 623 516.6 508.4 705 708.7 729.8 649.1 715.6 753 716.4 710 705.4 840 756 715 716.4 704 720 818 704.5 870 715 561 563.9 578.4 506.9 549.4 698.9 873.1 695 562 686 699.4 670.1 425.6 582.2 567 577 632.6 629.1 789 305.4 515.8 524.1 617.9 521 545 551.3 559.8 507.9 486.7 536.8 650 – 643.6 567.1
527.7 557.9 536.7 764.1 692.1 504.7 805.2 600.7 481.1 611.2 516.9 522.4 708.5 711.5 703.2 660.0 705.4 753.7 705.9 726.1 691.8 843.1 768.4 729.4 706.4 712.2 727.4 800.4 692.6 857.0 718.7 556.9 572.2 564.9 503.4 543.0 705.3 870.6 699.4 581.3 673.0 674.5 662.5 429.9 568.4 575.4 577.3 627.2 623.8 778.2 308.6 504.9 530.5 620.2 527.0 555.5 554.9 559.9 505.3 479.7 534.9 681.5 – 618.7 592.8
1.7 1.9 5.7 0.0 0.5 4.3 0.7 2.3 3.0 1.9 0.1 2.8 0.5 0.4 3.7 1.7 1.4 0.1 1.5 2.3 1.9 0.4 1.7 2.0 1.4 1.2 1.0 2.2 1.7 1.5 0.5 0.7 1.5 2.3 0.7 1.2 0.9 0.3 0.6 3.4 1.9 3.6 1.1 1.0 2.4 1.5 0.1 0.9 0.8 1.4 1.1 2.1 1.2 0.4 1.2 1.9 0.6 0.0 0.5 1.4 0.4 4.9 – 3.9 4.5
340.64 381.12 355.88 514.22 464.08 316.91 557.45 412.85 322.15 432.8 355.45 329.35 493.15 476.48 498.4 456.8 492.65 501.61 487.15 468.15 475.4 581.44 505.75 490.37 492.15 473.55 490.53 565.15 474.98 604.85 505.75 385.35 389.02 397.95 330.25 374.58 457.2 613.1 452.15 353.31 457.79 464.1 429.1 272.98 390.37 389.41 394.57 405.06 428.15 528.15 184.63 351.45 349.84 409.15 341.55 366.75 371.24 379 327.54 305.07 354.05 437.22 337.77 426.84 362
345.4 385.4 364.8 513.1 471.1 328.8 564.6 413.0 308.8 425.4 352.4 336.6 498.9 479.3 485.1 458.3 486.9 505.7 489.3 488.5 471.7 593.9 514.6 498.4 489.4 478.8 490.6 556.3 471.0 605.4 505.7 381.0 395.1 383.4 326.1 360.5 467.3 588.4 457.1 369.6 444.2 441.3 434.5 270.8 376.4 395.5 395.4 405.4 423.8 522.4 179.5 342.8 350.1 416.8 356.4 380.6 379.6 380.9 331.8 306.6 357.0 446.0 325.3 410.7 379.1
1.4 1.1 2.5 0.2 1.5 3.8 1.3 0.0 4.1 1.7 0.9 2.2 1.2 0.6 2.7 0.3 1.2 0.8 0.4 4.3 0.8 2.1 1.8 1.6 0.6 1.1 0.0 1.6 0.8 0.1 0.0 1.1 1.6 3.7 1.3 3.8 2.2 4.0 1.1 4.6 3.0 4.9 1.3 0.8 3.6 1.6 0.2 0.1 1.0 1.1 2.8 2.5 0.1 1.9 4.4 3.8 2.3 0.5 1.3 0.5 0.8 2.0 3.7 3.8 4.7
4.07 4.3 4.23 3.48 5.01 3.85 5.5 3.6 4.72 5.2 5.37 4.78 3.85 4.15 4.96 2.8 3.65 5 4.15 3.75 4.56 5 4.45 3.85 4.05 – 3.55 4.6 5.15 5.5 3.6 3.13 3.18 3.05 4.69 3.33 5.3 – – 4.89 4.8 4.22 4.52 3.76 3.79 3.22 3.06 4.52 – 3.22 4.88 6.36 3.85 3.73 4 3.83 3.88 3.59 4.74 6 4.06 3.8 – 4.18 –
4.15 4.62 3.96 3.34 4.82 3.52 4.95 3.51 4.36 5.06 4.90 4.61 4.03 4.72 4.31 2.88 4.12 4.92 4.24 3.53 4.61 5.06 4.92 4.49 4.39 – 3.62 4.93 4.70 5.17 4.07 3.26 2.73 3.09 4.56 3.45 5.77 – – 5.22 4.78 4.06 4.20 3.69 3.84 2.90 3.06 4.47 – 3.26 4.71 6.04 3.73 3.67 4.09 3.69 3.75 3.81 4.96 6.03 4.29 4.31 – 4.47 –
1.9 7.5 6.4 4.0 3.7 8.5 10.0 2.6 7.7 2.7 8.7 3.5 4.7 13.8 13.1 3.0 12.9 1.6 2.3 5.7 1.1 1.2 10.5 16.5 8.5 – 1.9 7.1 8.7 6.0 13.0 4.1 14.2 1.2 2.8 3.7 8.9 – – 6.7 0.4 3.9 7.1 1.9 1.2 9.9 0.1 1.1 – 1.3 3.5 5.0 3.1 1.5 2.2 3.8 3.4 6.2 4.7 0.6 5.6 13.3 – 6.9 –
D. Sola et al. / Fluid Phase Equilibria 263 (2008) 33–42
37
Table 1 (Continued )
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 a b c *
Compound name
Exp. Tc (K)
Calc. Tc (K)
A%D*
Exp. Tb (K)
Calc. Tb (K)
A%D
Exp. Pc (MPa)
Calc. Pc (MPa)
A%D
n,n,2-Trimethyl aminobenzene n,n,4-Trimethyl diaminobenzene n,n-Diethyl aminobenzene n,n-Diethyl formamide n,n-Dimethyl formamide n,n-Dimethyl methanamine Naphthalene n-Fenil aminobenzene n-Ethyletanaminea,b,c Nitrobenzene n-Methyl aminobenzene Phenol Phenylacetamidea,b,c Phenyl hydrazine Propane Propanenitrile Propanol Propionic acid 2-methyl ethylesther Propionic acid 2-methyl methylesther Propionic acid ethylesther Propionic acid methylesther Propionic acid propylesther Propylbenzenea,b,c Pyridine Styrene Toluene
668 695 695 660 650 – 747.8 825 496.2 712 701.6 692.9 825 750 369.7 562.5 536.9 563.6 540.7 546.1 530.6 578 638.6 618.7 617.1 593.1
693.4 698.1 703.4 628.3 639.5 – 755.4 814.3 515.8 719.9 716.3 679.4 837.0 757.7 378.4 549.6 532.6 575.0 556.4 557.0 520.6 579.2 640.7 589.6 632.7 604.0
3.8 0.5 1.2 4.8 1.6 – 1.0 1.3 4.0 1.1 2.1 2.0 1.5 1.0 2.4 2.3 0.8 2.0 2.9 2.0 1.9 0.2 0.3 4.7 2.5 1.8
457.95 482.76 488.65 442.2 425.65 – 491.13 575.15 329.16 481.45 467.76 454.89 576 514.13 231.14 370.3 370.4 383.2 365.27 371.94 352.7 395.45 430.17 388.38 418.15 383.73
473.4 475.2 493.8 427.2 410.0 – 493.5 558.7 338.8 480.3 477.5 454.2 576.1 510.0 229.5 359.7 363.2 394.7 375.0 374.3 338.9 395.5 439.5 383.9 424.9 392.3
3.4 1.6 1.1 3.4 3.7 – 0.5 2.9 2.9 0.2 2.1 0.2 0.0 0.8 0.7 2.9 1.9 3.0 2.7 0.6 3.9 0.0 2.2 1.2 1.6 2.2
3.12 3.75 3.2 5 5.5 4.1 4.04 3.8 3.92 – 5.2 6.13 4.7 5.2 4.25 4.18 5.12 3.02 3.43 3.36 4 3 3.21 5.88 3.69 4.21
3.08 3.27 3.09 4.35 5.55 4.27 4.18 3.96 4.19 – 5.27 5.63 4.56 5.61 4.20 4.72 5.26 3.07 3.34 3.41 4.13 3.11 3.18 5.74 3.82 4.12
1.2 12.8 3.6 13.0 0.8 4.1 3.5 4.2 6.9 – 1.3 8.1 3.0 7.9 1.1 12.9 2.8 1.7 2.5 1.5 3.2 3.8 0.9 2.4 3.5 2.2
compound in the validation set for the boiling point correlation. compound in the validation set for the critical temperature correlation. compound in the validation set for the critical pressure correlation. |Y −Y | A%D = expYexpcalc · 100.
2.3. Development of the correlations The aim of this stage is to select those descriptors more strictly correlated to the macroscopic property of interest. The contribution of each descriptor is assumed to be linear and a multi-parameter correlation with the following form is developed: Y = a0 + a1 X1 + a2 X2 + · · · + an Xn
(1)
where Xi is the ith descriptor and Y is the property being calculated. Among the huge number of molecular descriptors calculated by the CODESSA software, a pre-selection is performed, removing descriptors not available for each structures and those having a constant value for all structures. The heuristic algorithm was chosen for the development of the correlation: starting from one-parameter correlations, multi-linear ones are developed in a step-by-step procedure. According to this method, descriptors matching any of the following criteria are eliminated: (i) the Fisher F-test’s value for the one-parameter correlations is lower than 1.0; (ii) the R2 value of the one-parameter correlation is lower than a user-defined value (R2min = 0.1); (iii) the Student’s t-value in the one-parameter correlation and in the multi-parameter correlation are respectively lower than the userdefined values t1 = 1.5 and t2 = 1.95; (iv) the pair-correlation coefficient is greater than a user-defined value (rfull = 0.99). A high value of the pair-correlation coefficient for two descriptors
means that they can be considered collinear: if it exceeds the fixed value (rfull ), the descriptor with the higher R2 is retained for further investigation while the other is discarded. The use of the branching criterion (NS = 3) and the significant pair-correlation coefficient (rsig = 0.8) are further explained in this paragraph. The last parameters requested by the heuristic algorithm is the maximum number of descriptors (ND) that was set to 8 for comparison with other literature works [17]. All the remaining descriptors are, then, listed in decreasing order according to the one-parameter R2 . The selection of the best correlations proceeds as follows. 1. Starting from the top descriptor, the two-parameter correlations are calculated by combining each descriptor with each of the remaining ones. In this procedure the correlations involving pairs of descriptors with pair-correlation coefficient above rsig are not considered. Descriptors that do not fulfill the multi-parameter t-test are discarded as well. This procedure is continued until for some ith descriptor in the list no correlations with an F-value above one-third of the maximum are found. 2. The NS best pairs (those with the highest F-value) are selected and these correlations compose the working set. 3. The remaining descriptors are added one-by-one to each of the working set correlations. If the resulting three-parameter correlation has a F-value greater than the F-value of the two-parameter correlation multiplied by p/(p + 1) where p is
38
D. Sola et al. / Fluid Phase Equilibria 263 (2008) 33–42
the number of parameters of the two-parameter correlation (i.e. the number of descriptors plus 1), the three-parameter correlation is considered to be more significant then the two-parameter one and the extended set of descriptors is considered for further treatment. 4. After all descriptors have been applied one-by-one, the NS best three-parameter correlations are selected as the new working set. If the maximum number of descriptors ND is not yet achieved, the working set is recursively submitted to the procedure from step 3. Otherwise the procedure is completed and the best ND-descriptor correlations are found. As a final result the lists of the 10 correlations with the highest R2 and the 10 correlations with the highest F-values are generated. The optimal correlation is defined as that with the highest R2 among the 10 best according to the F-value. 2.4. Validation of the correlations This section consists in testing the accuracy of the correlations by using them to predict the property of interest for a proper validation set. This set is composed of substances similar to those of the training set and for which the property of inter-
est is experimentally known as well. The predictive capability of the correlation is evaluated by comparing the predicted and the experimental value of the property. The root mean square error (R.M.S.E.) was assumed as an estimate of the predictive capability of the correlations. 3. Results and discussions Table 1 reports the experimental and the predicted data for each substance included both in the training and in the validation set. The parity plots, which help in detecting any outliers and provide a prompt indication of the accuracy of the correlation, are reported in Figs. 1–3. Only one molecule (2-methyl-2-propanol) has a relative error (A%D) >5% for Tc and another one (2chlorophenol) for Tb . Pc is much more difficult to predict as it was also stated by Turner et al. [17]: 12 molecules have an error greater than 10%, whereas 32 structures show an error greater than 5%. The descriptors for each correlation are listed in Tables 2–4. The boiling point correlation involves the cubic root of the gravitation index over all bonded atoms, which accounts for the distribution of the atomic masses within the molecular space and quantifies the bulk cohesiveness of a compound due to the
Fig. 1. Plot of experimental vs. calculated boiling point for the training (a) and validation set (b) using the eight-descriptors multi-linear model.
Fig. 2. Plot of experimental vs. calculated critical temperature for the training (a) and validation set (b) using the eight-descriptors multi-linear model.
D. Sola et al. / Fluid Phase Equilibria 263 (2008) 33–42
39
Fig. 3. Plot of experimental vs. calculated critical pressure for the training (a) and validation set (b) using the eight-descriptors multi-linear model. Table 2 Regression model for the boiling point developed with the Heuristic algorithm using Codessa Property: Tb (R2 = 0.9864, n = 134) Coefficient
t-Test
Descriptor
−1.33E+03 5.18E+01 1.94E+04 −5.45E+01 7.33E+01 8.94E+00 6.18E+00 1.12E+02 −1.10E+01
−5.37 24.5 27.1 −7.01 8.11 9.69 7.86 5.76 −4.88
Intercept Cubic root of the gravitation index (all bonds) HDCA2 /TMSA Topographic electronic index (all pairs) FHBSA Number of C atoms Tot point-charge component of the molecular dipole Average bond order of a C atom Min atomic state energy for a C atom
dispersion interactions [24]. The boiling temperature increases with the molecular weight and the gravitational index was frequently used in literature either in association with the molecular weight or not [24]. Two descriptors in the boiling point correlation are related to the ability of a compound to form hydrogen bonds: HDCA2 /TMSA, which is the solvent-accessible surface area of the hydrogen donor atoms divided by the total molecular surface area, and FHBSA, which is the surface area of both hydrogens which can be donated and hydrogen acceptor atoms divided by Table 3 Regression model for the critical temperature developed with the Heuristic algorithm using Codessa Property: Tc (R2 = 0.9856, n = 133)
the total molecular surface area. These results agree with literature, where a two-descriptor correlation involving the gravitation index and the hydrogen donor charged surface area proved to account for most of the boiling point variability (R2 = 0.9544, s = 16.2 K) for a set of structurally diverse organic compounds [10]. The topographic electronic index and the total point-charge component of the molecular dipole account for the polar intermolecular interactions that contribute to the liquid phase cohesiveness in association with the dispersion forces and the hydrogen bonds. Finally, the number of C atoms, the average bond order of a C atom and the minimum atomic state energy for a C atom quantify the bond strength between the C atoms. A molecule locked in a rigid conformation due to strong intramolecular interactions is in fact less free to move and is expected to have a higher boiling point. Five descriptors are involved both in the boiling point and in the critical temperature correlations: this was expected since both properties depend on the interactions between molecules. The complementary information content (CIC0 ) and the number of aromatic bonds are strictly present in the Tc correlation. The former is a topological descriptor that encodes the molecule complexity and quantifies the heterogeneity and redundancy of topological neighbourhoods of atoms in molecules [25] while the number of aromatic bonds can roughly be related to the flexibility of the structure. Table 4 Regression model for critical pressure developed with the Heuristic algorithm using Codessa
Coefficient
t-Test value
Descriptor
Property: Pc (R2 = 0.9209, n = 121)
3.68E+03 7.65E+01 2.75E+04 −5.74E+01 1.69E+02 4.78E+00 7.12E+00
3.16 27.9 29.5 −9.07 7.65 4.94 6.64
Intercept Cubic root of gravitation index (all bonds) HDCA2 /TMSA Topographic electronic index (all pairs) Average bond order of a C atom Number of aromatic bonds Tot point-charge component of the molecular dipole Average complementary information content (order 0) Max atomic state energy for a C atom
Coefficient
t-Test value
Descriptor
4.78E+01 −8.59E−01 8.86E−02 −8.81E+00 −3.42E+01 7.23E−02 4.03E+00 3.65E−01 3.28E−02
9.66 −15.6 7.57 −5.82 −6.81 10.1 9.09 8.46 2.74
Intercept Kier shape index (order 1) HA dependent HDCA-1 Min (>0.1) bond order of a H atom Max SIGMA–SIGMA bond order WPSA-2 weight. PPSA (PPSA2 × TMSA/1000) RPCG Rel. Pos. charge (QMPOS/QTPLUS) Randic index (order 3) WNSA-2 weight. PNSA (PNSA2 × TMSA/1000)
2.03E+01
4.68
3.82E+01
3.37
40
D. Sola et al. / Fluid Phase Equilibria 263 (2008) 33–42
The critical pressure correlation does not share any descriptor with the critical and the boiling point ones. Two topological descriptors (the Kier shape index and the Randic index) are involved in the critical pressure correlation. The first order Kier shape index is related to the cyclicity of a molecule and it increases in the following direction: aromatic → cyclic → linear structure [25]. The Randic index (order 3) belongs to the -indices, each encoding some aspects of the molecular connectivity [26], the order being related to the number of atoms involved in the molecular connectivity subgraph. The presence of these descriptors suggests that the critical pressure is affected by the shape of the molecule. This is due to the steric effects that become more consistent as the distance between the molecules shortens with the increasing pressure. Three descriptors belong to the class of the charged partial surface area descriptors (CPSA descriptors) which combine shape and electronic information to characterize the polar interactions between molecules [27]. These are: the partial charge of the most positive charged atom divided by the total positive charge (RPCG), the total charge weighted positive surface area multiplied by the total solvent accessible surface area and divided by 1000 (WPSA2 ), the total charge weighted negative surface area multiplied by the total solvent accessible surface area and divided by 1000 (WNSA2 ). Hydrogen bonding is also considered in the Pc correlation via the HDCA1 , defined as the sum of charged surface areas of hydrogens which can be donated. Finally, the bond order of the H atom and the SIGMA–SIGMA bond order, having reference to intramolecular interactions, again take into account for the molecular flexibility of the structure. The performances of the correlations obtained in this work were compared with the QSPR models presented by the Jurs’s research group: the one by Wessel and Jurs [16] as far as the boiling point is concerned and the ones by Turner et al. [17] for the critical properties. The literature boiling point correlation contains 10 descriptors but 3 descriptors pertain exclusively to the S-containing and the F-containing structures (the number of sulfide groups, the number of sulfurs relative to the total number of atoms and the number of fluorines). Their presence was forced by the researchers (though these descriptors had been originally removed by the software because of more than 90% identical values) to correct the large residuals exhibited by the molecules containing S and F as heteroatoms. The contribution of these descriptors vanishes if the structure does not contain any S or F atoms. Though the number of fluorines had been forced in the correlation, 17 structures out of 277 showed a relative error higher than 5% and 10 of them were F-containing structures. Wessel and Jurs selected 298 structures from the DIPPR 801 database, each containing at least one heteroatom (N, O, S, F, Cl, Br, I); hydrocarbons were excluded from the database because the same authors had formerly developed a specific correlation [28]. The authors split the database into two subsets (a N-excluded and a N-containing data subset) and developed a correlation for each subset: the root mean square error (R.M.S.E.) for the N-excluded database is 11.6 K while for the other is 10.7 K.
Table 5 Comparison between Marrero–Gani GC method [5], Wessel and Jurs QSPR model [6] and this work for the boiling point correlation
Number of substances in training set Number of substances in validation set Number of descriptors R2a R.M.S.E. for training set (K) R.M.S.E. for validation set (K)
Marrero–Gani GC method
Wessel and Jurs’s model
This work
108
248
135
15
27
20
– – 15.49
10 0.991 11.6
8 0.985 9.10
12.70
10.5
7.33
In this work no distinction was done between heteroatoms; many aromatics were included (68% with respect to 15% of the N-excluded database by Wessel and Jurs), 33% of them do not contain any heteroatom and 46% of them contain nitrogen. The R.M.S.E. for the boiling point correlation is 9.1 K but it decreases to 7.8 K if the N-containing structures are excluded. If the correlation is exclusively applied to the N-containing molecules the R.M.S.E. is 10.1 K. F-containing structures and S-containing structures were not considered in this work because they need a specific approach in light of the results obtained by Wessel and Jurs [16]. Table 5 reports the comparison between this work and literature as far as the boiling point is concerned. The performance of the correlation was also compared with the Marrero–Gani GC method but the GC method could be applied on a smaller set of compounds (108) because the lack of functional groups precluded the use of the method for the rest of the set. The R.M.S.E. both for the training and the validation set are at least 5 K lower when compared to those obtained with the GC method, even though the correlations were derived on a smaller number of substances and chemical families. The adjusted R-squared correlation coefficient (R2a ) is also reported since it allows to compare the results of models with different number of parameters and different number of observations. It is defined as follows: n−1 R2a = 1 − (1 − R2 ) (2) n−p where n is the number of the database compounds and p is the number of parameters of the multi-linear equation. The results obtained for the critical properties are reported in Table 6. The literature data set contained 165 structures whereof 46% are hydrocarbons, 37% are oxygen-containing compounds, 14% are nitrogen-containing compounds and 3% are halogencontaining compounds. It was richer in hydrocarbons (68%) than in aromatics (21%) and a noticeable number of cyclic structures were included (11%) with respect to this work data set where the cyclic molecules are 1% of the total. The authors developed a eight-descriptors correlation both for the critical temperature and the critical pressure with a correlation coefficient of 0.993 for the temperature and 0.964 for the pressure. The cubic root of the gravitation index is present both in literature and in this work for the critical temperature correlation whereas no descriptor
D. Sola et al. / Fluid Phase Equilibria 263 (2008) 33–42
41
Table 6 Comparison between Marrero–Gani GC method [5], Wessel and Jurs [6] and this work for the critical temperature and the critical pressure correlations
Number of substances in training set Number of substances in validation set Number of descriptors R2a for training set R.M.S.E. for training set R.M.S.E. for validation set
Marrero–Gani GC method
Turner et al. QSPR model
This work
Tc
Pc
Tc
Pc
Tc
Pc
108 15 – – 20.48 K 13.41 K
94 13 – – 0.35 MPa 0.31 MPa
147 18 8 0.993 9.16 K 9.13 K
147 18 8 0.925 0.21 MPa 0.28 MPa
133 20 8 0.984 12.6 K 9.66 K
121 18 8 0.936 0.25 MPa 0.28 MPa
accounts for hydrogen bonding in the literature work, presumably because this class of descriptors had not been implemented yet in the ADAPT software at the time when the work was presented. Two shape indices (the Wiener number and the number of paths length 3) are included also in the literature correlation confirming that the steric effects are decisive for the critical pressure determination. The charge distribution is a key factor as well: descriptors related to the distribution of the positive and the negative charges are present both in this work and in the literature. In particular WNSA2 is present in both correlations. Table 6 also reports the details about the performances of the QSPR correlations with respect to the Marrero–Gani GC method. It can be observed that the QSPR models have better ability to correlate and predict both the critical temperature and the critical pressure but the enhancement in the QSPR model performances is more significant for the critical temperature than the pressure.
p Pc rfull
4. Conclusions
[1] W.J. Lyman, W.F. Reehl, D.H. Rosenblatt, Handbook of Chemical Property Estimation Methods, American Chemical Society, Washington, DC, 1990. [2] C.H. Fisher, Chem. Eng. 96 (1989) 157–158. [3] B. Sladkov, J. Appl. Chem. U.S.S.R. 64 (1991) 11. [4] A.L. Lydersen, Estimation of Critical Properties of Organic Compounds, University of Wisconsin College Engineering, Eng. Exp. Stn. Rep. 3, Madison Wisconsin, 1955. [5] K.G. Joback, R.C. Reid, Chem. Eng. Commun. 57 (1987) 233–243. [6] J. Marrero, R. Gani, Fluid Phase Equilib. 183–184 (2001) 183–208. [7] J. Chen, X. Quan, Z. Yazhi, Y. Yan, F. Yang, Chemosphere 44 (2001) 1369–1374. [8] R. Hiob, M. Karelson, Comput. Chem. 26 (2002) 237–243. [9] M. Karelson, A. Perkons, Comput. Chem. 23 (1999) 49–59. [10] A.R. Katritzky, U. Maran, V.S. Lobanov, M. Karelson, J. Chem. Inf. Comput. Sci. 40 (2000) 1–18. [11] P.R. Duchowicz, E.A. Castro, F.M. Fernandez, M.P. Gonzales, Chem. Phys. Lett. 412 (2005) 376–380. [12] L.M. Egolf, M.D. Wessel, P.C. Jurs, J. Chem. Inf. Comp. Sci. 34 (1994) 947–956. [13] O. Ivanciuc, T. Ivanciuc, A.T. Balaban, Tetrahedron 54 (1998) 9129–9142. [14] A.R. Katritzky, L. Mu, V.S. Lobanov, M. Karelson, J. Phys. Chem. 100 (24) (1996) 10400–10407. [15] A.R. Katritzky, V.S. Lobanov, M. Karelson, J. Chem. Inf. Comp. Sci. 38 (1998) 28–41. [16] M.D. Wessel, P.C. Jurs, J. Chem. Inf. Comput. Sci. 35 (1995) 841–850. [17] B.E. Turner, C.L. Costello, P.C. Jurs, J. Chem. Inf. Comput. Sci. 38 (1998) 639–645. [18] Design Institute for Physical Property Data, DIPPR® 801 Database— electronic version with Diadem® , 2002. [19] K.H. Simmrock, R. Janowsky, A. Ohnsorge, Critical Data of Pure Substances, Chemistry Data Series, vol. II, Part 2, DECHEMA, Frankfurt am Main, 1986. [20] AMPAC 8.15 Semichem, Inc. 1649 Shawnee Mission KS 66222 USA, 2004.
Three multi-linear equations for the prediction of the normal boiling point, the critical temperature and the critical pressure were developed using the CODESSA software. Each correlation involved eight descriptors belonging to the classes of constitutional, geometrical, topological, electrostatic and quanto-chemical descriptors. The performances of the correlations were compared with two literature works [16,17] and analogous predictive capability was obtained despite of the different composition of the database. In this work the aromatics structures are the majority (68%) while in the two literature works the presence of aromatics does not exceed 24%. As far as the boiling point is concerned, the literature database was exclusively composed by heteroatom-containing structures while this work also included hydrocarbons without any heteroatoms (24%). Moreover, the N-containing compounds were treated apart from the others while no distinction was carried out among heteroatoms in this work. Finally, the best available group contribution method (that of Marrero–Gani) was used as a reference for assessing the quality of the QSPR correlations. It was found that the QSPR models were much more accurate than the GC method in all cases. List of symbols a0 , a1 , an coefficients of the multilinear equation in Eq. (1) F Fisher’s test value n number of compounds
rsig R Ra t Tb Tc Xi Y
number of parameters in the correlation critical pressure (MPa) minimum value of the pair correlation coefficient for two descriptors to be considered collinear maximum value of the pair correlation coefficient for two descriptors to be considered non-collinear correlation coefficient adjusted correlation coefficient Student’s test value boiling temperature (K) critical temperature (K) descriptors of a generic QSPR model in Eq. (1) generic property to be estimated via a QSPR model in Eq. (1)
References
42
D. Sola et al. / Fluid Phase Equilibria 263 (2008) 33–42
[21] M.J.S. Dewar, E.G. Zoebisch, E.F. Healy, J.J.P. Stewart, J. Am. Chem. Soc. 107 (1985) 3902–3909. [22] C.J. Cramer, Essential of Computational Chemistry: Theories and Models, John Wiley & Sons Ltd., West Sussex, England, 2002, p. 136. [23] CODESSA 2.642 Semichem, Inc. 1649 Shawnee Mission KS 66222 USA, 1995. [24] A.R. Katritzky, R. Petrukhin, R. Jain, M. Karelson, J. Chem. Inf. Comput. Sci. 41 (2001) 1521–1530.
[25] J. Devillers, A.T. Balaban, Topological Indices and Related Descriptors in QSAR and QSPR, Gordon and Breach Science Publishers, 1999, p. 563. [26] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, first ed., Wiley-VCH, 2002, p. 84. [27] D.T. Stanton, P.C. Jurs, Anal. Chem. 62 (1990) 2323–2329. [28] M.D. Wessel, P.C. Jurs, J. Chem. Inf. Comput. Sci. 35 (1995) 68–76.