QSPR prediction of densities of organic liquids

QSPR prediction of densities of organic liquids

PERGAMON Computers & Chemistry 23 (1999) 49±59 QSPR prediction of densities of organic liquids M. Karelson *, A. Perkson Department of Chemistry, Un...

196KB Sizes 0 Downloads 72 Views

PERGAMON

Computers & Chemistry 23 (1999) 49±59

QSPR prediction of densities of organic liquids M. Karelson *, A. Perkson Department of Chemistry, University of Tartu, 2 Jakobi Str, Tartu, EE 2400, Estonia Received 22 April 1998; accepted 14 September 1998

Abstract A general quantitative structure properly relationship (QSPR) treatment of a data set incorporating 303 individual structures (containing C, H, N, O, S, F, Cl, Br and I) from a wide cross section of classes of organic liquids has given an excellent two-parameter correlation for densities (R 2=0.9749, s 2=0.0021 for r 20). The statistical treatment to ®nd the best multi-parameter correlations from subsets of given size within larger sets of molecular descriptors was performed using the scale forward selection method within the CODESSA (comprehensive descriptors for structural and statistical analysis) program. The descriptors involved in the correlation equation that describe the density (r) are the intrinsic density calculated as the ratio of the molecular mass over the theoretically calculated Van der Waals molecular volume and the total molecular electrostatic interaction per atom in the molecule. It has been demonstrated that with just a two-parameter equation, the densities for compounds that are unknown, unavailable, not easily handled (toxic, odorous, etc.) or not yet synthesized can be predicted with a considerable degree of con®dence. # 1999 Elsevier Science Ltd. All rights reserved. Keywords: QSAR/QSPR; Density; Prediction; Modeling

1. Introduction The normal density (i.e. the density at 1 atm and 208C) is one of the main physicochemical properties used to characterize and identify a compound. In addition, densities can be used to predict or estimate other physical properties, such as critical pressure, viscosity, thermal conductivity, di€usion coecients, and surface tension (Liu and Chen, 1995). The density is often the ®rst property measured for a new compound and one of the few characteristics known for almost every compound. The density is usually easy to determine, however, when a chemical is unavailable, as yet unknown, or hazardous to handle, a reliable procedure

* Corresponding author. Fax: 00372 7 465 264; e-mail: [email protected]. 1 The following van der Waals radii were used in the calculation of volumes (in angstroms): H = 1.20; C: sp = 1.77, sp2=1.70, sp3=1.70, allenyl = 1.70, aromatic = 1.77; N: sp = 1.60, sp2=1.55, sp3=1.55, aromatic = 1.60; O: sp2=1.52, sp3=1.50; S: sp2=1.80, sp3=1.80; Cl = 1.75; Br = 1.85; I = 1.97.

to predict density would be bene®cial. Furthermore, the rapid growth of combinatorial chemistry, where literally millions of new compounds are synthesized and tested without isolation, could render such a procedure very useful for many applications. Notably, only few methods to estimate densities from the structure of compounds have been devised, (Llano et al., 1993, Kontogeorgis et al., 1995) and, to our knowledge, no QSPR correlations and prediction of densities at ambient conditions have been reported. In the present work, we have developed a two descriptor QSPR model that successfully correlate (R 2=0.9749) the densities of 303 organic liquid compounds containing O, N, S, F, Cl, Br and I. The two descriptors involved in this model were (1) `intrinsic density' rR, de®ned as the ratio of the molecular mass over the theoretically calculated molecular volume rR ˆ

Mw Mvol

…1†

where Mw is the molecular mass and Mvol denotes the intrinsic molecular volume calculated as the sum of overlapping van der Waals spheres1 of atoms in the

0097-8485/99/$ - see front matter # 1999 Elsevier Science Ltd. All rights reserved. PII: S 0 0 9 7 - 8 4 8 5 ( 9 8 ) 0 0 0 2 9 - 1

50

M. Karelson, A. Perkson / Computers & Chemistry 23 (1999) 49±59

molecule (Stouch and Jurs; 1986) and (2) Eel.stat, the total molecular electrostatic interaction energy per atom. The total molecular electrostatic interaction is de®ned as the sum of quantum-chemically calculated electron±electron repulsion, Eee(AB), electron±nuclear nuclear±nuclear repulsion, attraction, Ene(AB), Enn(AB), energies. The electrostatic energy per atom is thus calculated as follows: Eel:stat …AB† ˆ

‰Eee …AB† ‡ Ene …AB† ‡ Enn …AB†Š N

…2†

where N denotes the number of atoms in the molecule. The ®rst descriptor accounts successfully for most of the variation in the densities of hydrocarbon liquids and halogenated organic liquids, whereas the second descriptor e€ectively quanti®es the contraction due to the interatomic electrostatic attraction and repulsion in the molecule. In the following, we examine the applicability of these two descriptors for the prediction of normal densities for compounds from various classes of organic compounds and investigate the additional factors needed to devise a more general QSPR model for the densities of organic compounds. 2. Methodology 2.1. Data set The data set of densities was selected from the CRC Handbook of Chemistry and Physics (1994), the Aldrich Catalog Handbook of Fine Chemicals (1996) and Merck Index (1989). The choice was based on maximum diversity of the structure of compounds and the numerical values of densities. The ®nal set of 303 liquid (at 208C) compounds was representative for all major classes of organic compounds containing C, H, O, N, S, F, Cl, Br, and I, and included saturated and unsaturated hydrocarbons, halogenated hydrocarbons, esters, aldehydes, organic acids, alcohols, thiols, and aromatic compounds (Table 1). 2.2. Descriptor calculation and regression analysis All structures were drawn with the PCMODEL program (PCMODEL, 1992). The conformational search and the pre-optimization of the geometry of compounds was performed using the molecular mechanics MMX method. The pre-optimized structures were submitted to the AMPAC 5.0 program (AMPAC, 1994) for further geometry re®nement and for the calculation of molecular geometry and wavefunction related descriptors using the semi-empirical AM1, (Dewar et al., 1985) parametrization. The output ®les from AMPAC were transferred to the CODESSA software (CODESSA 2.0

Manual, 1994) to calculate molecular descriptors. The procedures for the calculation of a large selection of molecular descriptors including a variety of constitutional, topologic, geometric, and electrostatic descriptors have been implemented in CODESSA. The quantum-chemical descriptors extracted from the output of the molecular orbital (AMPAC 5.0, 1994; Stewart, 1989) calculations included Mulliken net atomic charges, the total dipole moment of the molecule and its components, the frontier molecular orbital (FMO) energies and the respective FMO reactivity indices, molecular polarizability terms, bond orders, and di€erent energy partitioning terms, (Katritzky et al., 1995; Karelson et al., 1996). The quantum-chemically calculated atomic charges were also used to calculate the charged partial surface area (CPSA) descriptors introduced by Stanton and Jurs, (Stanton and Jurs, 1990; Stanton et al., 1992). The topological descriptors describe the atomic connectivity in the molecule (Kier and Hall, 1986; Stankevich et al., 1988; El-Basil and Randic, 1992). Altogether, more than 700 molecular descriptors were calculated. For each compound, the actual number of descriptors calculated depended on the speci®c structure of the compound. Descriptors, which were not de®ned for every compound, were either omitted or padded with zeros when physically justi®ed (e.g. partial charge on a Cl atom or the number of Cl atoms in the compound without Cl atoms). The scale forward selection approach, (described elsewhere by Draper and Smith, 1966; Dillon and Goldstein, 1984) available in the CODESSA program was used to develop multi-linear correlation models from the pool of calculated descriptors. This procedure (described in detail elsewhere, Katritzky et al., 1994; Katritzky et al., 1996) proceeds from the statistical signi®cance and collinearity control of the descriptors selected into the correlation equations. 3. Results and discussion 3.1. Correlation analysis of subsets First, the original set of 303 compounds was divided into subsets according to the elements represented in a compound: (1) set CH included 72 compounds containing only C and H atoms; (2) set O included 39 oxygen containing compounds; (3) set N included 36 nitrogen containing compounds; (4) set of halogens included 61 F, Cl, Br and I containing compounds; (5) set S constituted of 10 sulfur containing compounds. For each subset, the general two-parameter correlation model (involving rR and Eel.stat) was re®tted. Also, one-parameter correlations were calculated for each subset using only the intrinsic density, rR (column 3 in Table 2). These one-parameter ®ts were poor for

M. Karelson, A. Perkson / Computers & Chemistry 23 (1999) 49±59

51

Table 1 Experimental and predicted densities for 303 organic compounds used in the study Structure

rexp

rcalc

rcalcÿrexp

(E)-1,2,-Dichloroethene (E)-2-Butenal (Z)-2-Pentene 1,1,1,2-Tetrachloroethane 1,1,1-Trichloroethane 1,1,2,2-Tetrachloroethane 1,1,2-Trichloroethane 1,1-Dichloroethane 1,1-Dichloroethene 1,2,3-Trichlorobenzene 1,2,3-Trimethylbenzene 1,2,4-Trichlorobenzene 1,2,4-Trimethylbenzene 1,2-Dibromoethane 1,2-Dichlorobenzene 1,2-Dichloroethane 1,2-Dichloropropane 1,3,5-Cycloheptatriene 1,3,5-Trimethylbenzene 1,3-Dichlorobenzene 1,3-Dichloropropane 1,4-Dichlorobenzene 1,4-Dichlorobutane 1,4-Dioxane 1,4-Methylnaphthalene 1,4-Pentadiene 1,5-Hexadiene 1-Bromo-2-methylpropane 1-Bromobutane 1-Bromoheptane 1-Bromohexane 1-Bromooctane 1-Bromopentane 1-Bromopropane 1-Butanol 1-Chloro-2-propene 1-Chlorobutane 1-Chloroheptane 1-Chlorohexane 1-Chloropentane 1-Chloropropane 1-Cyanobutane 1-Cyanopropane 1-Ethylnaphthalene 1-Heptanol 1-Heptene 1-Heptyne 1-Hexanol 1-Hexene 1-Hexyne 1-Iodobutane 1-Iodoheptane 1-Iodohexane 1-Iodopentane 1-Iodopropane 1-Methylnaphthalene 1-Naphthol

1.2565 0.8516 0.6556 1.5406 1.3390 1.5953 1.4397 1.1757 1.2130 1.4533 0.8944 1.4590 0.8758 2.1791 1.3059 1.2351 1.1560 0.8875 0.8652 1.2884 1.1876 1.2475 1.1408 1.0337 1.0157 0.6608 0.6878 1.2720 1.2578 1.1330 1.1725 1.1080 1.2180 1.3530 0.8098 0.9350 0.8919 0.8758 0.8759 0.8820 0.8820 0.8014 0.7940 1.0082 0.8219 0.6970 0.7369 0.8186 0.6732 0.7149 1.6170 1.3791 1.4414 1.5161 1.7471 1.0203 1.0954

1.3686 0.9166 0.7034 1.5240 1.3883 1.5334 1.3893 1.2136 1.3679 1.4560 0.8637 1.4561 0.8639 2.0685 1.3139 1.2143 1.1310 0.9245 0.8605 1.3142 1.1281 1.3139 1.0738 1.0157 0.9666 0.7563 0.7569 1.2678 1.2654 1.0972 1.1419 1.0632 1.1952 1.3689 0.7810 0.9988 0.8863 0.8432 0.8531 0.8673 0.9112 0.8135 0.8224 0.9603 0.7738 0.7195 0.7467 0.7759 0.7101 0.7454 1.6196 1.3410 1.4114 1.5073 1.7873 0.9843 1.1073

0.1121 0.0650 0.0478 ÿ0.0166 0.0493 ÿ0.0619 ÿ0.0504 0.0379 0.1549 0.0027 ÿ0.0307 ÿ0.0029 ÿ0.0119 ÿ0.1106 0.0080 ÿ0.0208 ÿ0.0250 0.0370 ÿ0.0047 0.0258 ÿ0.0595 0.0664 ÿ0.0670 ÿ0.0180 ÿ0.0491 0.0955 0.0691 ÿ0.0042 0.0076 ÿ0.0358 ÿ0.0306 ÿ0.0448 ÿ0.0228 0.0159 ÿ0.0288 0.0638 ÿ0.0056 ÿ0.0326 ÿ0.0228 ÿ0.0147 0.0292 0.0121 0.0284 ÿ0.0478 ÿ0.0481 0.0225 0.0098 ÿ0.0427 0.0369 0.0305 0.0026 ÿ0.0381 ÿ0.0300 ÿ0.0088 0.0402 ÿ0.0360 0.0119

Subset Halogen Overall CH Halogen Halogen Halogen Halogen Halogen Halogen Halogen CH Halogen CH Halogen Halogen Halogen Halogen CH CH Halogen Halogen Halogen Halogen CH CH CH CH Halogen Halogen Halogen Halogen Halogen Halogen Halogen O Halogen Halogen Halogen Halogen Halogen Halogen Overall Overall CH O CH CH O CH CH Halogen Halogen Halogen Halogen Halogen CH O (continued on next page)

52 1-Naphthylamine 1-Nitropropane 1-Nonanol 1-Nonene 1-Octanol 1-Octene 1-Octyne 1-Pentanol 1-Pentene 1-Pentyne 1-Propanol 2,2,4-Trimethylpentane 2,2,5-Trimethylhexane 2,2-Dimethylbutane 2,2-Dimethylpentane 2,3-Dimethylbutane 2,3-Dimethylnaphthalene 2,3-Dimethylpentane 2,3-dimethyl-1,3-butadiene 2,4-Dimethylpentane 2,4-Dimethylphenol 2,4-Dimethylpyridine 2,4-dimethyl-3-pentanone 2,6-Dimethylaniline 2,6-Dimethylpyridine 2-Bromo-2-methylpropane 2-Bromopropane 2-Butanol 2-Chloro-2-methylpropane 2-Chloroaniline 2-Chlorobutane 2-Chlorophenol 2-Chloropropane 2-Chloropyridine 2-Chlorotoluene 2-Decanone 2-Ethyltoluene 2-Heptanone 2-Hexanone 2-Methoxyaniline 2-Methoxyethanol 2-Methyl-1,3-butadiene 2-Methyl-1-butanol 2-Methyl-1-propanol 2-Methyl-2-butanol 2-Methyl-2-butene 2-Methyl-2-pentanol 2-Methyl-2-propanol 2-Methyl-3-pentanol 2-Methylbutane 2-Methylpentane 2-Methylpyridine 2-Methylthiophene 2-Naphthol 2-Naphthylamine 2-Nitropropane 2-Nonanone 2-Octanone 2-Pentanol 2-Phenylethanol 2-Propenol 2-Undecanone

M. Karelson, A. Perkson / Computers & Chemistry 23 (1999) 49±59 1.1229 1.0030 0.8274 0.7292 0.8246 0.7144 0.7460 0.8144 0.6405 0.7127 0.8044 0.6919 0.7160 0.6492 0.6738 0.6616 1.0080 0.6951 0.7264 0.6727 1.0360 0.9273 0.8062 0.9760 0.9420 1.2220 1.3100 0.8060 0.8470 1.2130 0.8707 1.2350 0.8590 1.2050 1.0817 0.8230 0.8807 0.8220 0.8300 1.0923 0.9370 0.6809 0.8193 0.8050 0.8090 0.6623 0.8090 0.7887 0.8243 0.6197 0.6599 0.9500 1.0194 1.2170 1.0614 0.9920 0.8188 0.8202 0.8103 0.9580 0.7851 0.8260

1.0704 1.0185 0.7710 0.7273 0.7714 0.7212 0.7447 0.7758 0.7032 0.7378 0.7855 0.6982 0.7017 0.6783 0.6887 0.6771 0.9634 0.6882 0.7657 0.6872 0.9666 0.9295 0.8080 0.9327 0.9277 1.2642 1.3688 0.7797 0.8839 1.1846 0.8850 1.2446 0.9101 1.2347 1.0770 0.7947 0.8584 0.8084 0.8152 1.0375 0.9270 0.7589 0.7810 0.7813 0.7783 0.7038 0.7759 0.7785 0.7783 0.6594 0.6745 0.9584 0.9467 1.1063 1.0681 1.0172 0.7980 0.8026 0.7759 0.9566 0.8658 0.7921

ÿ0.0525 0.0155 ÿ0.0564 ÿ0.0019 ÿ0.0532 0.0067 ÿ0.0013 ÿ0.0386 0.0627 0.0251 ÿ0.0189 0.0063 ÿ0.0143 0.0291 0.0148 0.0155 ÿ0.0446 ÿ0.0069 0.0393 0.0145 ÿ0.0694 0.0022 0.0018 ÿ0.0433 ÿ0.0143 0.0422 0.0588 ÿ0.0263 0.0369 ÿ0.0284 0.0143 0.0096 0.0511 0.0297 ÿ0.0047 ÿ0.0283 ÿ0.0223 ÿ0.0136 ÿ0.0148 ÿ0.0548 ÿ0.0100 0.0780 ÿ0.0383 ÿ0.0237 ÿ0.0307 0.0415 ÿ0.0331 ÿ0.0102 ÿ0.0460 0.0397 0.0146 0.0084 ÿ0.0727 ÿ0.1107 0.0067 0.0252 ÿ0.0208 ÿ0.0176 ÿ0.0344 ÿ0.0014 0.0807 ÿ0.0339

N N O CH O CH CH O CH CH O CH CH CH CH CH CH CH CH CH O N Overall N N Halogen Halogen O Halogen Halogen Halogen Halogen Halogen Halogen Halogen Overall CH Overall Overall N Overall CH O O O CH O O O CH CH N S O N N Overall Overall O O O Overall

M. Karelson, A. Perkson / Computers & Chemistry 23 (1999) 49±59 2-Pentanone 3-Chloroaniline 3-Chlorophenol 3-Ethylphenol 3-Ethylpyridine 3-Hexanol 3-Methoxyaniline 3-Methyl-1-butanol 3-Methyl-1-butene 3-Methylbutanoic acid 3-Methylheptane 3-Methylhexane 3-Methylpentane 3-Methylpyridine 3-Nitrotoluene 3-Pentanol 3-Phenyl-1-propanol 3-Methyl-2-butanone 3-Pentanone 4-Bromotoluene 4-Ethylpyridine 4-Ethyltoluene 4-Heptanone 4-Isopropyltoluene 4-Methoxyaniline 4-Methyl-2-pentanol 4-Methyl-2-pentanone 4-Methylacetophenone 4-Methylbenzaldehyde 4-Methylpyridine 4-n-Propylphenol 4-tert-Butylphenol 5-Nonanone Acetaldehyde Acetic acid Acetonitrile Acetophenone Benzaldehyde Benzene Benzonitrile Benzyl alcohol Biphenyl Bromobenzene Bromomethane Butanoic acid Butanone Butyl acetate Butylamine Butyraldehyde Chlorobenzene Cycloheptanol Cyclohexane Cyclohexanone Cyclohexene Cyclohexylamine Cyclopentane Cyclopentanol Cyclopentanone Cyclopentene Decyl alcohol Di-n-Propylamine Di-n-butyl ether

0.8089 1.2160 1.2680 1.0250 0.9500 0.8188 1.0960 0.8129 0.6480 0.9332 0.7058 0.6868 0.6643 0.9613 1.1581 0.8150 0.9950 0.8046 0.8159 1.3898 0.9417 0.8612 0.8174 0.8586 1.0710 0.8130 0.8017 0.9891 1.0194 0.9571 1.0890 0.9081 0.8180 0.7834 1.0492 0.7857 1.0281 1.0415 0.8787 1.0102 1.0419 0.9920 1.4951 1.7300 0.9577 0.8054 0.8810 0.7401 0.8170 1.1058 0.9554 0.7785 0.9478 0.8102 0.8191 0.7457 0.9478 0.9487 0.7720 0.8292 0.7384 0.7690

0.8225 1.1804 1.2429 0.9670 0.9316 0.7731 1.0357 0.7787 0.7032 0.9387 0.6962 0.6879 0.6745 0.9606 1.1379 0.7798 0.9329 0.8237 0.8209 1.3728 0.9298 0.8580 0.8078 0.8459 1.0374 0.7771 0.8167 0.9804 1.0134 0.9594 0.9374 0.9230 0.7966 0.8942 1.1133 0.8752 1.0133 1.0602 0.9201 1.0690 0.9901 1.0023 1.4749 1.7969 0.9736 0.8346 0.9068 0.7401 0.8366 1.1417 0.8525 0.7601 0.9174 0.8094 0.8299 0.7529 0.8745 0.9384 0.8109 0.7718 0.7363 0.7702

0.0136 ÿ0.0356 ÿ0.0251 ÿ0.0580 ÿ0.0184 ÿ0.0457 ÿ0.0603 ÿ0.0342 0.0552 0.0055 ÿ0.0096 0.0011 0.0102 ÿ0.0007 ÿ0.0202 ÿ0.0352 ÿ0.0621 0.0191 0.0050 ÿ0.0170 ÿ0.0119 ÿ0.0032 ÿ0.0096 ÿ0.0127 ÿ0.0336 ÿ0.0359 0.0150 ÿ0.0087 ÿ0.0060 0.0023 ÿ0.1516 0.0149 ÿ0.0214 0.1108 0.0641 0.0895 ÿ0.0148 0.0187 0.0415 0.0588 ÿ0.0518 0.0103 ÿ0.0202 ÿ0.0669 0.0159 0.0292 ÿ0.0258 0.0000 0.0196 0.0359 ÿ0.1029 ÿ0.0184 ÿ0.0304 ÿ0.0008 0.0108 0.0072 ÿ0.0733 ÿ0.0103 0.0389 ÿ0.0574 ÿ0.0021 0.0012

53 Overall Halogen Halogen O N O N O Overall Overall Overall Overall Overall N N O Overall Overall Overall Halogen N CH Overall CH N O Overall Overall Overall N O O Overall Overall Overall N Overall Overall CH N Overall CH Halogen Halogen Overall Overall Overall N Overall Halogen O CH Overall CH N CH O Overall CH O N Overall (continued on next page)

54 Di-n-butylamine Di-n-propyl ether Di-n-propyl sul®de Dibromomethane Dichloromethane Diethyl disul®de Diethyl ether Diethyl sul®de Diethylamine Diisopropyl ether Diisopropyl sul®de Diisopropylamine Ethanol Ethyl acetate Ethyl benzoate Ethyl phenyl ether Ethylamine Ethylbenzene Fluorobenzene Heptanal Hexanal Hexanoic acid Hexyl acetate Hexylamine Indane Iodobenzene Iodoethane Iodomethane Isoamyl acetate Isoamyl formate Isobutyl acetate Isobutyl formate Isobutylbenzene Isobutyraldehyde Isopropyl acetate Isopropyl formate Isopropylbenzene Methanol Methyl acetate Methyl benzoate Methyl cyclohexanecarboxylate Methyl formate Methyl pentanoate Methyl_phenyl_ether Methyl propanoate Methyl tert-butyl ether Methylcyclohexane Methylcyclopentane Morpholine N,N-dimethylformamide N-Methylmorpholine N-methylaniline N-methylpipyridine Naphthalene Nitrobenzene Nitroethane Nitromethane Nonanal Octanal Octylamine Pentachloroethane Pentanal

M. Karelson, A. Perkson / Computers & Chemistry 23 (1999) 49±59 0.7670 0.7360 0.8358 2.4969 1.3360 0.9927 0.7135 0.8370 0.7056 0.7258 0.8135 0.7220 0.7893 0.9003 1.0468 0.9660 0.6829 0.8670 1.0225 0.8132 0.8335 0.9274 0.8902 0.7630 0.9639 1.8308 1.9358 2.2790 0.8740 0.8773 0.8700 0.8776 0.8532 0.7891 0.8690 0.8728 0.8618 0.7914 0.9244 1.0880 0.9954 0.9742 0.8950 0.9954 0.9148 0.7405 0.7694 0.7486 1.0005 0.9940 0.9051 0.9891 0.8159 1.1680 1.2037 1.0448 1.1371 0.8264 0.8211 0.7769 1.6796 0.8185

0.7448 0.7687 0.7968 2.4012 1.3426 0.9135 0.7677 0.8099 0.7305 0.7660 0.8004 0.7360 0.7937 0.9639 1.0487 0.9491 0.7236 0.8689 1.0923 0.8073 0.8119 0.9121 0.8759 0.7428 0.9404 1.7969 2.0316 2.4386 0.8894 0.9060 0.9075 0.9322 0.8445 0.8350 0.9304 0.9622 0.8552 0.8053 1.0132 1.0896 0.9553 1.0892 0.9071 0.9857 0.9636 0.7731 0.7606 0.7545 0.9630 0.9316 0.9269 0.9425 0.8185 1.0266 1.2037 1.0852 1.1988 0.7987 0.8026 0.7497 1.6313 0.8243

ÿ0.0222 0.0327 ÿ0.0390 ÿ0.0957 0.0066 ÿ0.0792 0.0542 ÿ0.0271 0.0249 0.0402 ÿ0.0131 0.0140 0.0044 0.0636 0.0019 ÿ0.0169 0.0407 0.0019 0.0698 ÿ0.0059 ÿ0.0216 ÿ0.0153 ÿ0.0143 ÿ0.0202 ÿ0.0235 ÿ0.0339 0.0958 0.1596 0.0154 0.0287 0.0375 0.0546 ÿ0.0087 0.0459 0.0614 0.0894 ÿ0.0066 0.0139 0.0888 0.0016 ÿ0.0401 0.1150 0.0121 ÿ0.0097 0.0488 0.0326 ÿ0.0088 0.0059 ÿ0.0375 ÿ0.0624 0.0218 ÿ0.0466 0.0026 ÿ0.1414 0.0000 0.0404 0.0617 ÿ0.0277 ÿ0.0185 ÿ0.0272 ÿ0.0483 0.0058

N Overall S Halogen Halogen S Overall S N Overall S N O Overall Overall Overall N Overall Overall Halogen Overall Overall Overall N CH Halogen Halogen Halogen Overall Overall Overall Overall CH Overall Overall Overall Overall O Overall Overall Overall Overall Overall Overall Overall Overall Overall Overall Overall N Overall N N Overall N N N Overall Overall N Overall Overall

M. Karelson, A. Perkson / Computers & Chemistry 23 (1999) 49±59 Pentanoic acid Phenol Phenyl acetate Phenyl methyl sul®de Propanoic acid Propanone Propanonitrile Propionaldehyde Propyl acetate Propylamine Propylbenzene Pyridine Quinoline Styrene Tetrachloroethylene Tetracloromethane Tetrahydrofuran Tetrahydropyran Thiophene Thiophenol Toluene Trans-2-heptene Trans-2-hexenal Trans-2-octenal Tribromomethane Trichloroethylene Tricloromethane Triethylamine Water alfa-Methylstyrene cis-1,2-Dichloroethylene cis-1,2-Dimethylcyclohexane m-Xylene n-Butylbenzene n-Decane n-Heptane n-Heptylamine n-Hexane n-Hexylbenzene n-Nonane n-Octane n-Pentane n-Pentylamine n-Pentylbenzene n-Pentylcyclopentane n-Propyl butanoate n-Propyl propanoate n-Propylcyclopentane n-butanethiol n-propanethiol o-Cresol o-Toluidine o-Xylene p-Cresol p-Toluidine p-Xylene sec-Butyl benzene tert-Butylbenzene trans-1,4-Dimethylcyclohexane

0.9420 1.0545 1.0770 1.0579 0.9930 0.7908 0.7818 0.8070 0.9147 0.7190 0.8620 0.9772 1.0950 0.9074 1.6311 1.5950 0.8890 0.8810 1.0644 1.0780 0.8669 0.7034 0.8480 0.8460 2.8900 1.4556 1.4985 0.7229 0.9982 0.9077 1.2910 0.7960 0.8642 0.8600 0.7301 0.6838 0.7770 0.6603 0.8610 0.7176 0.7036 0.6260 0.7614 0.8600 0.7912 0.8790 0.8730 0.7763 0.8580 0.8357 1.0465 1.0040 0.8802 1.0347 1.0460 0.8610 0.8210 0.8670 0.7730

0.9375 1.0496 1.0926 0.9913 1.0212 0.8553 0.8402 0.8566 0.9317 0.7323 0.8552 1.0121 1.0897 0.9249 1.7013 1.6599 0.8983 0.8791 0.9809 1.0588 0.8882 0.7189 0.8618 0.8391 2.7257 1.5616 1.5334 0.7343 0.8951 0.9046 1.3754 0.7599 0.8707 0.8433 0.7034 0.6857 0.7505 0.6763 0.8284 0.7013 0.6931 0.6631 0.7405 0.8359 0.7536 0.8871 0.9072 0.7547 0.8137 0.8204 1.0006 0.9548 0.8745 0.9977 0.9537 0.8740 0.8418 0.8482 0.7645

ÿ0.0045 ÿ0.0049 0.0156 ÿ0.0666 0.0282 0.0645 0.0584 0.0496 0.0170 0.0133 ÿ0.0068 0.0349 ÿ0.0053 0.0175 0.0702 0.0649 0.0093 ÿ0.0019 ÿ0.0835 ÿ0.0192 0.0213 0.0155 0.0138 ÿ0.0069 ÿ0.1643 0.1060 0.0349 0.0114 ÿ0.1031 ÿ0.0031 0.0844 ÿ0.0361 0.0065 ÿ0.0167 ÿ0.0267 0.0019 ÿ0.0265 0.0160 ÿ0.0326 ÿ0.0163 ÿ0.0105 0.0371 ÿ0.0209 ÿ0.0241 ÿ0.0376 0.0081 0.0342 ÿ0.0216 ÿ0.0443 ÿ0.0153 ÿ0.0459 ÿ0.0492 ÿ0.0057 ÿ0.0370 ÿ0.0923 0.0130 0.0208 ÿ0.0188 ÿ0.0085

55 Overall O Overall S Overall Overall N Overall Overall N Overall N Overall CH Halogen Halogen CH Overall S S CH CH Overall Overall Halogen Halogen Halogen N Overall CH Halogen CH CH CH CH CH N CH CH CH CH CH N CH CH Overall Overall Overall S S O N CH O N CH CH CH CH

56

M. Karelson, A. Perkson / Computers & Chemistry 23 (1999) 49±59

Table 2 Correlation analysis for subsets of compounds: n denotes numbers of structures, R 2 and s denote correlation coecient and standard deviation of a correlation equation, respectively Subset

n

R2

CH O N S Halogens

71 39 36 10 61

0.9423 0.8349 0.9497 0.8431 0.9464

a b c

a

sa

R2

b

0.028 0.044 0.039 0.047 0.085

0.9511 0.8786 0.9546 0.9567 0.9754

sb

R2

c

0.026 0.037 0.037 0.026 0.058

0.9771 0.9198 0.9708 0.9891 0.9848

sc 0.017 0.032 0.03 0.014 0.046

One-parameter correlation equation with the relative density. Correlation equation with the intrinsic density and Eel.stat. The best two-parameter heuristic correlation equation for the subset

O and S subsets but signi®cantly better for CH, N and halogen subsets. The addition of the descriptor Eel.stat, improved the correlations signi®cantly for most subsets as expected (compare columns 3, 4 and 5, 6 in Table 2). These results demonstrate, however, that the overall two-parameter equation does not model well the O containing compounds. In the case of the O containing compounds, the largest deviations of the predicted densities were observed for substituted phenols. Among the nitrogen containing compounds, the largest errors of prediction were found for substituted anilines. For the CH and Halogen subsets of compounds, the intrinsic density, rR alone, resulted in a good correlation. The densities of aliphatic, nonaliphatic, cyclic, acyclic, and aromatic hydrocarbons can be predicted with a standard error of less than 0.014 g/cc using the following equation: r ˆ 1:8384…20:0055†rR ÿ 0:7543…20:0460†; n ˆ 72;

R2 ˆ 0:9423;

s ˆ 0:014;

F ˆ 1126;

R2cv ˆ 0:9359:

In the last equation, n is the number of points, R 2 is the square of the correlation coecient, F denotes the Fisher's criterion at the 95% probability level, s denotes the standard error and R 2cv is the square of the cross validated correlation coecient (Draper and Smith, 1966). For the Halogen subset, the intrinsic density rR also provided a good one-parameter model: r ˆ 0:8706…20:0269†rR ‡ 0:0769…20:0403†; n ˆ 61;

R2 ˆ 0:9464;

s ˆ 0:085;

F ˆ 1042;

R2cv ˆ 0:9389:

For all subsets, the two descriptor correlation analysis was also carried out with the inclusion of all available descriptors to determine which two parameters are

successful in capturing structural features essential for determining their densities. A summary of the best correlations for di€erent subsets of organic compounds is given in Table 2 (columns 7, 8). For the O subset, the best correlation model includes the intrinsic density, and the minimal coulombic interaction for a C-H bond, Ecol.C-H. r ˆ 2:1545…20:1199†rR ÿ 0:8016…20:1298†Ecol:C-H ‡ 1:8768…20:4021†; n ˆ 39;

R2 ˆ 0:9198;

F ˆ 207;

R2cv ˆ 0:9001:

s ˆ 0:032;

The two-parameter model for the S subset, surprisingly, did not directly contain the descriptor for an intrinsic density. A much better correlation involved the average atomic mass, AAw, in the molecule and the weighted partial negatively charged surface area of the molecule calculated from the Ze®rov's atomic charges, Zz, (Ze®rov et al., 1987). r ˆ 0:0643…20:0039†AAw ÿ 0:02141…20:0029†Zz ‡ 3:9634…20:2402†; n ˆ 32;

R2 ˆ 0:9891;

s ˆ 0:014;

F ˆ 317;

R2cv ˆ 0:9801:

However, whereas the ®rst descriptor is still related to the intrinsic density of the compound, the second descriptor clearly indicates the importance of intramolecular electrostatic interactions in determining the density of the compounds from this structural subset. 3.2. Correlation analysis of the combined set Nine descriptors (see Table 3) involved in the best correlation models for the subsets were combined and regressed against the whole set of 303 compounds. The resulting correlation equation had correlation coecient R 2=0.9852 F = 2213.56 s = 0.035. However,

M. Karelson, A. Perkson / Computers & Chemistry 23 (1999) 49±59

57

Table 3 Correlation matrix of the descriptors involved in the best correlation model for the densities of 303 diverse organic compounds 1 rR 2 Eel.sta(AB) 3 AM1 b-polarizability 4 Number of occupied electronic levels per number of atoms 5 Randic index (order 3) (ElBasil and Randic, 1992) 6 Information content (order 2) 7 Relative number of rings 8 Relative number of aromatic bonds 9 Di€erence in charged partial surface areas (DPSA-3 = PPSA3-PNSA3), (Stanton and Jurs, 1990; Stanton et al., 1992)

1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

9

1.0000 ÿ0.2622 ÿ0.1462 0.6811 ÿ0.2373 ÿ0.5034 ÿ0.0302 0.0296 ÿ0.3952

ÿ0.2622 1.0000 0.1290 0.1498 0.6450 0.1130 0.0296 0.7631 0.6225

ÿ0.1462 0.1290 1.0000 ÿ0.0906 0.2503 0.2307 ÿ0.3952 0.6225 0.2575

0.6811 0.1498 ÿ0.0906 1.0000 ÿ0.0433 ÿ0.5264 0.6811 0.1498 0.0906

ÿ0.2373 0.6450 0.2503 ÿ0.0433 1.0000 0.5475 ÿ0.2373 0.6450 0.2503

ÿ0.5034 0.1130 0.2307 ÿ0.5264 0.5475 1.0000 ÿ0.5034 0.1130 0.2307

ÿ0.0302 0.0296 ÿ0.3952 0.6811 ÿ0.2373 ÿ0.5034 1.0000 0.8798 0.2863

0.0296 0.7631 0.6225 0.1498 0.6450 0.1130 0.8798 1.0000 0.3531

ÿ0.3952 0.6225 0.2575 0.0906 0.2503 0.2307 0.2836 0.3531 1.0000

®ve descriptors: the number of electrons/number of atoms, the b-polarizability, the information content index (order1), (Kier, 1980; Bonchev, 1983), the relative number of rings and relative number of aromatic bonds were not signi®cant when applied to the whole set. The elimination of these descriptors resulted in a four parameter correlation model with R 2=0.9837 F = 4598 and s = 0.036 g/cc. Finally, all 9 signi®cant descriptors were used in all possible combinations to ®nd the best two parameter model. The most signi®cant correlation equation was obtained with the intrinsic density and total molecular electrostatic interaction/

number of atoms. Further attempts to improve the correlation model were not successful. As reported above, the ®nal best two-parameter correlation model obtained for the set of 303 diverse organic compounds had correlation coecient R 2=0.9749 F = 5937.25 s = 0.046 g/cc. The corresponding graph of predicted vs experimental densities is given in Fig. 1. The correlation matrix (Table 3) demonstrates that the descriptors involved in the model for predicting densities are almost orthogonal (R 2=0.0688). The best correlation model was used to predict densities for a set of 40 compounds from Aldrich and

Fig. 1. Plot of predicted vs experimental densities for the set of 303 organic compounds.

58

M. Karelson, A. Perkson / Computers & Chemistry 23 (1999) 49±59

Table 4 Test prediction of densities of 40 components using the best correlation equation developed for the set of 303 organic compounds Structure

rcalc

rexp

rcalcÿrexp

1,2,3,4,5-Pentamethyl cyclopentadiene 1,2-Pentanediol 1,7-Dibromoheptane 1-Buthanethiol 1-Heptene 2,3-Dibromoheptane 2,6-Dibromotoluene 2-Bromohexane 2-Chlorohexane 3-Bromohexane 3-Chlorohexane 4-Decanol 4-Decanone Benzeneacetic acid, ethyl ester Benzenemethanethiol Benzoic acid,2-nitro,2-methyl Benzoic acid,bropylester Benzoyl chloride,4-ethyl Benzoyl iodide Butane,1-ethoxy Butane,1-ethylthio Butylbenzoate Cyclobutane,methyl Ethane,1,2-diiodo(Z) Ethane,methylthio Ethanol,2-ethylthio Heptane Heptane,2,6-dimethyl Heptanoic acid Heptanoic acid,ethyl ester Hexane Hexane,1-ethoxy Methane,dibromodinitro Pentamethyl 1,2,2,6,6-piperidine Pentane,1-ethoxy Pentane,1-methoxy Pentane,3,3-diethyl Pentane,3-ethyl Piperidine Propane,1,1-diethoxy

0.8154 0.8814 1.4143 0.8115 0.7197 1.4191 1.7397 1.1377 0.8555 1.1365 0.8491 0.7703 0.7650 1.0161 0.9702 1.2838 1.0182 1.1708 1.7908 0.7727 0.7972 0.9921 0.7388 2.9649 0.8112 0.9110 0.6906 0.7015 0.8949 0.866 0.6730 0.7675 2.4373 0.8021 0.7676 0.7716 0.7083 0.6859 0.8317 0.8273

0.8440 0.9710 1.5306 0.8416 0.6970 1.5139 1.8120 1.1658 0.8694 1.1799 0.8684 0.8261 0.8240 1.0333 1.0680 1.2855 1.0230 1.1686 1.7460 0.7495 0.8376 1.0000 0.6884 3.0625 0.8422 1.0166 0.6837 0.7089 0.9181 0.8817 0.6548 0.7722 2.4439 0.8590 0.7622 0.7590 0.7536 0.6982 0.8606 0.8232

ÿ0.0286 ÿ0.0896 ÿ0.1163 ÿ0.0301 0.0227 ÿ0.0948 ÿ0.0723 ÿ0.0281 ÿ0.0139 ÿ0.0434 ÿ0.0193 ÿ0.0558 ÿ0.059 ÿ0.0172 ÿ0.0978 ÿ0.0017 ÿ0.0048 0.0022 0.0448 0.0232 ÿ0.0404 ÿ0.0079 0.0504 ÿ0.0976 ÿ0.0310 ÿ0.1056 0.0069 ÿ0.0074 ÿ0.0232 ÿ0.0157 0.0182 ÿ0.0047 ÿ0.0066 ÿ0.0569 0.0054 0.0126 ÿ0.0453 ÿ0.0123 ÿ0.0289 0.0041

CRC handbooks which covered a large diversity of densities and compound classes. A comparison of predicted and experimental densities is given in Table 4. Most of the predicted values are well within the standard error (s 2=0.0388). 4. Conclusions Successful correlation equations were developed for the densities of di€erent subsets of organic compounds containing various heteroatoms. The resulting individual QSPR correlation equations involve one to four

parameters and have standard errors ranging from 0.027 for hydrocarbons to 0.085 g/cm3 for halogenated compounds. Proceeding from the correlation equations for the subsets of compounds, a general two-parameter correlation model was developed for the prediction of density of any organic compound containing C, H, O, N, S, F, Cl, Br, and I atoms. This correlation model covers a large diversity of organic structures (Table 1) and o€ers a standard prediction error of 0.046 g/cm3. In conclusion, the QSPR equations developed in the present study enable the con®dent prediction of the densities of organic compounds on the basis of their

M. Karelson, A. Perkson / Computers & Chemistry 23 (1999) 49±59

chemical structure alone. The density can thus be predicted for any organic compound including those not yet synthesized or otherwise unavailable which makes these presented equations attractive for various technological, environmental and chemical applications.

Acknowledgements We thank Ms Helle Kuura for help in collecting data and Dr Uko Maran for technical assistance. This work was carried out within the Estonian Science Foundation grants no. 3051 and no. 3306.

References Aldrich Catalog Handbook of Fine Chemicals, 1996. Aldrich. Milwaukee, WI. AMPAC 5.0. 1994. Semichem, 7128 Summit. Shawnee, KS. Bonchev, D., 1983. Information Theoretical Indices for Characterization of Chemical Structure. Wiley-Interscience, New York. CODESSA 2.0, 1994. University of Florida. Gainesville, FL. Dewar, M.J.S., Zoebisch, E.G., Healy, E.F., Stewart, J.J.P., 1985. AM1: a new general purpose quantum mechanical molecular model. J. Am. Chem. Soc. 107, 3902. Dillon, W.R., Goldstein, M., 1984. Multivariate Analysis: Methods and Applications. Wiley, New York. Draper, N.R., Smith, H., 1966. Applied Regression Analysis. Wiley, New York. El-Basil, S., Randic, M., 1992 Adv. Quant. Chem 24, 239. Handbook of Chemistry Physics. 1994. CRC Press, Cleveland, Ohio. Karelson, M., Lobanov, V.S., Katritzky, A.R., 1996. Quantum-chemical descriptors in QSAR/QSPR studies. Chem. Rev. 96, 1027. Katritzky, A.R., Mu, L., Lobanov, V.S., Karelson, M. 1996. Correlation of boiling points with molecular structure 1. A training set of 298 diverse organics and a test set of 9 simple inorganics. J. Phys. Chem. 100. 10,400.

59

Katritzky, A.R., Lobanov, V.S., Karelson, M., 1995. QSPR: the correlation and quantitative prediction of chemical and physical properties from structure. Chem. Soc. Rev. 24, 279. Katritzky, A.R., Ignatchenko, E.S., Barcock, R.A., Lobanov, V.S., Karelson, M., 1994. Prediction of gas chromatographic retention times and response factors using a general quantitative structure-property relationship treatment. Anal. Chem. 66, 1799. Kier, L.B., Hall, L., 1986. Molecular Connectivity in Structure±Activity Analysis. Research Studies Press, Letchworth. Kier, L.B., 1980 J. Pharm. Sci. 69, 807. Kontogeorgis, G.M., Fredenslund, A., Tassios, D.P., 1995. Chain-length dependence of the critical density of organic homologous series. Fluid Phase Equilibria 108, 1±2. Liu, Z.Y., Chen, Z.C., 1995. Estimation of critical pressures of pure substances from data of density and vaporization heat of liquids. Chemical Engineering Journal and the Biochemical Engineering Journal 59, 2. Llano, J.J., Garcia, R., Rosal, R., Sastre, H., Diez, F., 1993. Program to estimate the density, viscosity and thermal conductivity in pure ¯uids and mixtures. A®nidad 50, 446. Merck, Index. 1989. Merck & Co. Inc., Rachway, NJ. PCMODEL 5th ed., 1992. Serena Software. Bloomington, IN. Stankevich, M.I., Stankevich, I.V., Ze®rov, N.S., 1988 Russ. Chem. Rev. 57, 191. Stanton, D.T., Egolf, L.M., Jurs, P. C., Hicks, M.G., 1992. Computer-assisted prediction of normal boiling points of pyrans and pyrroles. J. Chem. Inf. Comput. Sci. 32, 306. Stanton, D.T., Jurs, P.C., 1990. Development and use of charged partial surface area structural descriptors in computer-assisted quantitative structure-property relationship studies. Anal. Chem. 62, 2323. Stewart, J.J.P. 1989. MOPAC program package, QCPE. no 455. Stouch, T.R., Jurs, P.C., 1986. A simple method for the representation, quanti®cation, and comparison of the volumes and shapes of chemical compounds. J. Chem. Inf. Comput. Sci. 26, 4. Ze®rov, N.S., Kirpichenok, M.A., Izmailov, F.F., Tro®mov, M.I., 1987. Scheme for the calculation of the electronegativities of atoms in a molecule in the framework of sanderson's principle. Dok. Akad. Nauk USSR 296, 883.