QSPR analysis of fluorophilicity for organic compounds

QSPR analysis of fluorophilicity for organic compounds

Journal of Fluorine Chemistry 128 (2007) 484–492 www.elsevier.com/locate/fluor QSPR analysis of fluorophilicity for organic compounds Andrew G. Merca...

877KB Sizes 32 Downloads 123 Views

Journal of Fluorine Chemistry 128 (2007) 484–492 www.elsevier.com/locate/fluor

QSPR analysis of fluorophilicity for organic compounds Andrew G. Mercader *, Pablo R. Duchowicz, Miguel A. Sanservino, Francisco M. Ferna´ndez, Eduardo A. Castro INIFTA, Divisio´n Quı´mica Teo´rica, Departamento de Quı´mica, Facultad de Ciencias Exactas, Universidad Nacional de La Plata, Diag. 113 y 64, Suc. 4, C.C. 16, 1900 La Plata, Argentina Received 20 November 2006; received in revised form 20 December 2006; accepted 21 December 2006 Available online 3 January 2007

Abstract We constructed a QSPR model from 116 organic compounds for the prediction of fluorophilicity. The 1268 theoretical descriptors explored by means of linear regressions, encoding different aspects of the topological, geometrical, and electronic molecular structure, lead to an optimal sevenparameter equation with a correlation coefficient R = 0.9807 and cross-validation parameter Rl-15%-o = 0.9677. As a more realistic and practical application of present optimal QSPR model, it is applied to the estimation of the fluorophilicity of 69 non-yet synthesized molecular structures. # 2007 Elsevier B.V. All rights reserved. Keywords: QSPR theory; Molecular descriptors; Replacement method; Fluorophilicity

1. Introduction Nowadays, the fluorous chemistry of compounds exhibit many interesting applications in synthesis and catalysis. The tendency of an organic substance to dissolve in fluorous media has continuously gained importance after the disclosure of the fluorous biphase catalysis in 1994 [1], as biphasic reactions take advantage of the fact that organic and fluorous phases are typically immiscible at room temperature, but may homogenize at elevated temperatures. One can therefore expose reactants in an organic phase to a catalyst in a fluorous phase simply by heating, and separate products (in the organic phase) from the catalyst (in the fluorous phase) on cooling. Other useful advantages of the techniques based on the fluor content rely on the unique physical and chemical properties associated with perfluorinated type of solvents, such as inertness, non-toxicity, and easy separation [2]. The fluorophilicity of a compound can be quantified through the associated partition coefficient (P) between f1uorous (CF3C6F11) and organic (CH3C6H5) layers [3].   cðCF3 C6 F11 Þ ln P ¼ ln ; cðCH3 C6 H5 Þ

T ¼ 298 K

(1)

* Corresponding author. Fax: +54 221 425 4642. E-mail address: [email protected] (A.G. Mercader). 0022-1139/$ – see front matter # 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.jfluchem.2006.12.011

It is well-known that the experimental design of fluorophilic molecules demands a minimum fluorine content of 60%, the presence of one or more perfluorinated-alkyl chain called ‘ponytail’, and the absence of hydrogen bonding or polar groups which may interact with the organic phase. Furthermore, the fluorination of molecules often takes the form of adding long ponytails [4]. Clearly, the design of fluorous biphase catalysis experiments would be substantially improved by a reliable prediction of the fluorophilicity for a given molecule. A generally accepted remedy for overcoming the lack of experimental data in complex chemical phenomena is the analysis based on quantitative structure-property relationships (QSPR) [5], which in the present case may provide adequate predictions of fluorophilicity. The ultimate role of the different formulations of the QSPR theory is to suggest mathematical models for estimating relevant properties of interest, especially when they cannot be experimentally determined for some reason. These studies simply rely on the assumption that the structure of a compound determines its physicochemical properties. The molecular structure is therefore translated into the so-called molecular descriptors through mathematical formulae obtained from several theories, such as Chemical Graph Theory, Information Theory, Quantum Mechanics, etc. [6,7]. There exist more than a thousand theoretical descriptors available in the literature, and one usually faces the problem of selecting

A.G. Mercader et al. / Journal of Fluorine Chemistry 128 (2007) 484–492

those which are the most representative of the property under consideration. Several QSPR studies on fluorophilicity were published in the last 5 years. In 2001, Kiss et al. [8] estimated this property for 59 fluorinated organic molecules using a neural network (NN) combination of eight molecular descriptors chosen from a pool of almost a hundred variables. In 2002, Huque et al. [9] employed a modified version of the linear free-energy relationships (LFER) on 91 chemicals to derive a fivedescriptor model with structural interpretation, characterized with statistical parameters R = 0.9742 and S = 0.566. In 2004, Duchowicz et al. [10] employed the same data set to propose a different QSPR model based on linear regression and the atoms and chemical bonds as molecular descriptors. In 2004, de Wolf et al. [11] applied a universal lipophilicity model based on the mobile order/disorder (MOD) solution theory to predict partition coefficients for 88 molecules in either PFMCH/toluene or FC-72/benzene. However, those predictions required the knowledge of molecular volumes and modified non-specific cohesion parameters for the solute; data which are commonly unavailable. The same year, Daniels et al. [12] proposed a modified LFER for 93 organic compounds by means of five molecular surface area descriptors, achieving R = 0.9716 and S = 0.638 and showing an almost equal accuracy to the previous reported models. The present study reports the predictions of fluorophilicity for 116 organic compounds whose experimental data were collected from two previous publications [9,13]. For this purpose, two widely applied modeling strategies based on linear regressions are employed, namely the forward stepwise regression and the replacement method [14–17]. In Section 2 we show and discuss our results. In Section 3 we summarize the main conclusions of this study. Finally, in Section 4 we outline the employed methods. 2. Results and discussion We first apply the RM to the search for the best fluorophilicity–structure relationships. The equation that leads

485

to the best cross-validation parameters Rl-15%-o and Sl-15%-o contains seven molecular descriptors of different type: ln P ¼ 5:688ð0:5Þ  0:348ð0:01ÞSEigp þ0:164ð0:02ÞRDF055p þ 0:0531ð0:05ÞMAXDP 0:197ð0:01ÞHar þ 1:178ð0:1ÞCIC1 6:488ð0:9ÞMATS1v þ 1:606ð0:1ÞHOMA N ¼ 116; 4

p < 10 ;

R ¼ 0:9806; AIC ¼ 0:280;

S ¼ 0:494;

F ¼ 387:7;

FIT ¼ 16:450

Rloo ¼ 0:9778; Sloo ¼ 0:511Rl-15%-o ¼ 0:9677; Sl-15%-o ¼ 0:620

(2)

where the absolute errors of the regression coefficients are given in parentheses; R the correlation coefficient of the model, F the Fisher ratio, p the significance of the model, AIC the Akaike’s information criterion and FIT is the Kubinyi function. A brief description for each variable appearing in all the proposed models is presented in Table 1. The application of the FSR procedure does not improve the quality of this relationship, as it leads to the following equation: ln P ¼ 2:238ð0:6Þ þ 0:151ð0:005ÞRTm 1:902ð0:2ÞHATSp  0:00611ð0:0007ÞPCD þ0:526ð0:07ÞH2e  0:362ð0:06ÞC  006 þ0:760ð0:2ÞN  068 þ 0:807ð0:3ÞGATS1p N ¼ 116; 4

p < 10 ;

R ¼ 0:9740; AIC ¼ 0:375;

S ¼ 0:572;

F ¼ 285:5;

FIT ¼ 12:110

Rloo ¼ 0:9700; Sloo ¼ 0:5932Rl15%o ¼ 0:9281; Sl15%o ¼ 0:9153

(3)

Once again, we conclude that the RM is preferable to the FSR for exploring large sets of descriptors. Table 2 shows the predicted values of fluorophilicity and the corresponding residuals between parentheses. We may consider the closely related aromatic esters 64 (Rf7C(O)OCH2Ph) and 65

Table 1 Classification of molecular descriptors involved in the QSPR models Symbol

Description

Type

MATS1v RDF055p MAXDP Har CIC1 HOMA SEigp RTm HATSp PCD H2e C-006 N-068 GATS1p

Moran autocorrelation-lag 1/weighted by atomic van der Waals volumes RDF-5.5 weighted by atomic polarizabilities Maximal electrotopological positive variation Harary H index Complementary information content (neighborhood symmetry of first-order) Armonic oscillator model of aromaticity index Eigenvalue sum from polarizability weighted distance matrix R total index/weighted by atomic masses Leverage-weighted total index/weighted by atomic polarizabilities Difference of multiple path counts to path counts H autocorrelation of lag 2/weighted by atomic Sanderson electronegativities Number of CH2RX groups Number of Al3–N groupsc Geary autocorrelation-lag 1/weighted by atomic polarizabilities

2D-autocorrelations RDFa Topological Topological Topological Aromaticity indices Topological GETAWAYb GETAWAY Topological GETAWAY Atom-centred fragments Atom-centred fragments 2D-autocorrelations

a b c

RDF, radial distribution function. GETAWAY, GEometry, Topology and Atoms-Weighted AssemblY. Al, aliphatic groups.

486

A.G. Mercader et al. / Journal of Fluorine Chemistry 128 (2007) 484–492

Table 2 Experimental values of fluorophilicity and predictions achieved by Eq. (2) and Huque et al. [9] N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

Compound name

Exp.

Eq. (2)

Huke et al. [9]

Decane Undecane Dodecane Tridecane Tetradecane Hexadecane Dec-1-ene Undec-1-ene Dodec-1-ene Tridec-1-ene Tetradec-1-ene Hexadec-1-ene Rf8CH CH2 Cyclohexanone Cyclohexenone Cyclohexanol Trifluoroethanol (CF3)2CHOH Rf6(CH2)2OH Rf6(CH2)3OH Rf8(CH2)2OH Rf8(CH2)3OH Rf10(CH2)3OH Pentafluorobenzene Hexafluorobenzene Ethylbenzene Dodecylbenzene Rf8(CH2)3C6H5 o-Rf6(CH2)3C6H4(CH2)3Rf6 o-Rf8(CH2)3C6H4(CH2)3Rf8 o-Rf10(CH2)3C6H4(CH2)3Rf10 m-Rf8(CH2)3C6H4(CH2)3Rf8 p-Rf8(CH2)3C6H4(CH2)3Rf8 Rf8(CH2)3Cl Rf8(CH2)3NH2 Rf8(CH2)3NH(CH2)3Rf8 (Rf6(CH2)2)3P (Rf8(CH2)3)3P (Rf8(CH2)4)3P (Rf8(CH2)5)3P (Rf6(CH2)2)2PC10H19 (menthyl) (Rf8(CH2)2)2PC10H19 (menthyl) ( p-Rf6C6H4)3P ( p-Rf8C6H4)3P Ph(CH2)2SiH3 Ph(CH2)2SiOC8H15 Ph(CH2)2SiOC6H11 (cyclohexyl) Rf6I Rf8I Rf10I Rf8CH CH2 Rf8(CH2)3SH Rf8N(CH2CH2)2 Rf6S(CH2)2CO2Et Rf8S(CH2)2CO2Et CF3SPh m-CF3SC6H4CF3 Rf8SPh Rf7CH2NHMe Rf7CH2NMe2 Rf7CH2N(CH2CH2)2O Rf7CH2NHCH(Me)Ph Rf7C(O)Ph Rf7C(O)OCH2Ph

2.86 3.13 3.35 3.71 3.94 4.50 2.99 3.26 3.66 3.94 4.12 4.70 2.67 3.79 4.06 4.12 1.77 1.02 0.10 0.24 1.02 0.59 1.42 1.24 0.94 4.41 4.70 0.02 1.03 2.34 3.62 2.28 2.33 0.03 0.85 3.32 4.41 4.41 4.50 4.50 1.29 2.70 1.32 0.76 3.29 5.11 4.82 1.31 2.04 2.84 2.67 0.24 0.86 0.67 0.04 2.45 1.58 0.59 1.07 1.53 0.14 0.87 0.48 2.14

3.09 3.28 3.47 3.67 3.87 4.30 3.50 3.65 3.81 3.98 4.17 4.56 1.65 3.87 4.57 4.31 1.92 1.07 0.05 0.14 1.47 1.16 2.10 1.40 0.34 3.31 4.53 0.48 1.20 2.69 3.40 2.96 2.97 0.74 0.29 2.75 4.46 – – – 0.92 1.92 1.31 – 3.22 5.15 4.89 1.53 2.69 2.43 1.92 1.04 0.91 0.17 0.76 2.87 2.17 1.03 0.79 1.10 0.43 0.73 0.18 0.54

3.07 3.13 3.19 3.24 3.30 3.41 3.29 3.34 3.40 3.46 3.51 3.62 2.82 3.96 4.25 4.74 1.37 0.70 0.47 0.50 0.72 0.80 1.25 0.58 0.12 4.23 4.79 0.38 1.37 2.32 3.23 2.32 2.32 0.37 1.29 3.34 3.75 4.79 4.53 4.27 1.11 2.10 0.57 0.78 4.53 5.56 5.56 0.34 0.93 1.48 2.82 1.23 1.48 0.05 0.49 2.01 0.85 0.15 1.49 1.63 0.60 0.65 – –

(0.23) (0.15) (0.12) (0.04) (0.07) (0.20) (0.51) (0.39) (0.15) (0.04) (0.05) (0.14) (1.02) (0.08) (0.51) (0.19) (0.15) (0.05) (0.05) (0.10) (0.45) (0.57) (0.68) (0.16) (0.60) (1.10) (0.17) (0.50) (0.17) (0.35) (0.22) (0.68) (0.64) (0.71) (0.56) (0.57) (0.05)

(0.37) (0.78) (0.01) (0.07) (0.04) (0.07) (0.22) (0.65) (0.41) (0.75) (0.80) (0.05) (0.50) (0.72) (0.42) (0.59) (0.44) (0.28) (0.43) (0.29) (0.14) (0.30) (1.60)

(0.21) (0.00) (0.16) (0.47) (0.64) (1.09) (0.30) (0.08) (0.26) (0.48) (0.61) (1.08) (0.15) (0.17) (0.19) (0.62) (0.40) (0.32) (0.37) (0.74) (0.30) (0.21) (0.17) (0.66) (0.82) (0.18) (0.09) (0.40) (0.34) (0.02) (0.39) (0.04) (0.01) (0.34) (0.44) (0.02) (0.66) (0.38) (0.03) (0.23) (0.18) (0.60) (0.75) (0.02) (1.24) (0.45) (0.74) (0.97) (1.11) (1.36) (0.15) (0.99) (0.62) (0.62) (0.45) (0.44) (0.73) (0.74) (0.42) (0.10) (0.46) (0.22)

A.G. Mercader et al. / Journal of Fluorine Chemistry 128 (2007) 484–492

487

Table 2 (Continued ) N

Compound name

Exp.

Eq. (2)

Huke et al. [9]

65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120

p-Rf7C(O)OCH2C6H4OCF3 Rf7C(O)SMe Rf7C(O)NHMe Rf7C(O)NMe2 Rf7C(O)N(CH2CH2)2O Rf7C(S)Me Rf7C(S)NMe2 Rf7C(S)N(CH2CH2)2O Rf7C(S)NHCH(Me)Ph C6H6 CF3Ph Rf6Ph Rf8Ph Rf10Ph o-Rf8C6H4CF3 m-Rf8C6H4CF3 p-Rf8C6H4CF3 p-Rf8C6H4Rf8 [p-CF3C6H4(CF2)4]2 o-Rf6(CH2)2C6H4Cl p-Rf6(CH2)2C6H4Cl p-Rf8(CH2)2C6H4Cl o-Rf6(CH2)2C6H4Br m-Rf6(CH2)2C6H4Br p-Rf6(CH2)2C6H4Br o-Rf8C6H4CO2Me m-Rf8C6H4CO2Me p-Rf8C6H4CO2Me 1,3,5-Rf8C6H3(CF3)2 1,3,5-(Rf8)2C6H3CO2Me 1,3,5(Rf8)2C6H3CH2OH 1,3,5-(Rf8)2C6H3CHO 2-Rf8C5H4N (pyridine) 3-Rf8C5H4N (pyridine) 4-Rf8C5H4N (pyridine) (CF3)3CO(CH2)2NH2 (CF3)3CO(CH2)2NH(CH3) [(CF3)3CO(CH2)2]2NH (CF3)3CO(CH2)2NH(CH2)3Rf8 (CF3)3CO(CH2)2N(CH3)2 [(CF3)3CO(CH2)2]2NCH3 [(CF3)3CO(CH2)2]3N Rf8(CH2)3NH2 Rf8(CH2)4NH2 Rf8(CH2)5NH2 Rf7CH2NH(CH3) Rf8(CH2)3NH(CH3) [Rf4(CH2)3]2NH [Rf6(CH2)3]2NH [Rf10(CH2)3]2NH [Rf8(CH2)3]2NH [Rf8(CH2)4]2NH [Rf8(CH2)5]2NH Rf7CH2N(CH3)2 Rf8(CH2)3N(CH3)2 [Rf8(CH2)3]2NCH3

3.15 1.16 0.15 0.34 0.62 1.08 0.66 1.56 1.84 2.77 1.96 0.54 1.24 1.77 1.50 2.37 2.13 4.98 0.56 0.64 1.02 0.37 1.05 1.44 1.49 0.39 0.12 0.01 4.05 4.41 3.62 4.25 0.54 0.88 0.80 0.14 0.08 1.82 2.69 0.34 2.03 3.62 0.85 0.54 0.28 1.07 0.88 0.71 1.98 4.08 3.32 2.97 2.59 1.53 1.37 3.63

1.55 0.92 0.82 0.72 0.32 1.46 0.22 1.06 1.03 2.58 2.39 0.33 1.38 2.29 1.52 1.99 2.01 4.63 0.10 0.99 1.02 0.05 1.12 1.09 1.13 0.34 0.11 0.00 2.70 3.60 3.19 4.10 1.12 1.02 1.28 0.46 0.19 2.02 2.32 0.25 2.20 3.63 0.29 0.40 0.24 0.79 0.76 0.91 2.12 4.28 3.53 3.43 2.88 1.10 0.92 3.28

– 0.57 0.23 0.66 0.38 0.19 0.20 1.18 3.18 4.12 1.82 0.24 0.78 1.28 1.37 1.37 1.37 – 0.18 0.63 0.63 0.04 1.22 1.22 1.22 0.18 0.18 0.18 – – – – 0.64 0.64 0.64 – – – – – – – – – – – – – – – – – – – – –

(1.60) (0.24) (0.67) (0.38) (0.30) (0.38) (0.88) (0.50) (0.81) (0.19) (0.43) (0.21) (0.14) (0.52) (0.02) (0.38) (0.12) (0.35) (0.46) (0.35) (0.00) (0.32) (0.07) (0.35) (0.36) (0.73) (0.23) (0.01) (1.35) (0.81) (0.43) (0.15) (0.58) (0.14) (0.48) (0.32) (0.10) (0.20) (0.37) (0.08) (0.17) (0.01) (0.56) (0.14) (0.04) (0.28) (0.12) (0.20) (0.14) (0.20) (0.21) (0.46) (0.29) (0.43) (0.45) (0.35)

(0.59) (0.38) (0.32) (0.24) (0.89) (0.46) (0.38) (1.34) (1.35) (0.14) (0.30) (0.46) (0.49) (0.13) (1.00) (0.76) (0.38) (0.01) (0.39) (0.33) (0.17) (0.22) (0.27) (0.21) (0.30) (0.17)

(0.10) (0.24) (0.16)

Residuals are presented in parentheses. Rfn refers to (CF2)n1CF3.

( p-Rf7C(O)OCH2C6H4OCF3) as outliers, with a residual value exceeding 3S. It is not possible to determine if such deviation is either a statistical consequence of present selection of descriptors in Eq. (2) or a physical (meaningful) result. It is quite possible that these compounds are somewhat structurally different to the others in the training set. In many

cases present residuals are smaller than those of Huque et al. [9] also shown in Table 2. We mention that we were unable to include four molecules (38, 39, 40 and 44) from that previous study because the freely downloadable version of Dragon 3.0 is limited up to 100 atoms. On the other hand, those molecules identified by Huque et al. as outliers and omitted from their

488

A.G. Mercader et al. / Journal of Fluorine Chemistry 128 (2007) 484–492

Fig. 3. Histogram of experimental data. Fig. 1. Predicted vs. experimental fluorophilicity.

final model (63, 64, 65, 82, 93, 94, 95, 96) were included in present calculation. The plot of predicted versus experimental values of fluorophilicity shown in Fig. 1 suggests that the 116 compounds roughly follow a straight line. Fig. 2 shows the residuals in terms of the experimental fluorophilicities, and demonstrates that the best molecular descriptors given in Eq. (2) lead to a model that follows a normal distribution and that does not obey any kind of undesired pattern that would probably suggest the presence of non-modelled factors contributing to the fluorophilicity. The correlation matrix for Eq. (2) (indicated in Table 3) reveals that there exists some degree of intercorrelation between SEigp and Har (Rij = 0.9738), although these descriptors carry some non-overlapping structural information that makes the model to exhibits adequate predictive l-10%-o cross-validation parameters measured on 100,000 randomly generated cases of compounds exclusion. Figs. 3–10 include the histograms of the experimental property and of each molecular descriptor selected, revealing the distribution of the chemical compounds in the different numerical intervals of the descriptor variation. Present QSPR equation includes different theoretical descriptors derived from the molecular graph (G) and the three-dimensional geometry, as Table 2 shows. The numerical characterization of the molecular structure is given by: (a) four

Fig. 2. Dispersion plot of the residuals for Eq. (2).

Fig. 4. Histogram of SEigp descriptor.

topologicals: SEigp, the eigenvalue sum from polarizability weighted distance matrix; MAXDP, the maximal electrotopological positive variation; Har, the Harary index; CIC1, the complementary information content (neighborhood symmetry of first-order); (b) a 2D-autocorrelation: MATS1v, the Moran autocorrelation-lag 1/weighted by atomic van der Waals volumes; (c) a radial distribution function: RDF055p, 5.5 weighted by atomic polarizabilities; (d) an aromaticity index: HOMA, the armonic oscillator model of aromaticity index. It is possible to argue some structural interpretation for the numerical variables appearing in Eq. (2). In Chemical Graph Theory, the distance matrix introduced by Harary in the 1960s [18] accounts for the ‘‘through bond’’ interactions of atoms in

Fig. 5. Histogram of RDF055p descriptor.

A.G. Mercader et al. / Journal of Fluorine Chemistry 128 (2007) 484–492

489

Table 3 Correlation matrix for the descriptors of Eq. (2)

SEigp RDF055p MAXDP Har CIC1 MATS1v HOMA

SEigp

RDF055p

MAXDP

Har

CIC1

MATS1v

HOMA

1

0.8059 1

0.7854 0.6357 1

0.9738 0.8496 0.7283 1

0.4121 0.4684 0.0242 0.5284 1

0.09542 0.2885 0.1353 0.1988 0.2766 1

0.09487 0.1510 0.2478 0.1868 0.09254 0.5124 1

Fig. 6. Histogram of MAXDP descriptor.

Fig. 8. Histogram of ClC1 descriptor.

molecules. The molecular descriptor SEigp characterizes the distribution of the topological distances in G and differentiates the nature of atoms through the atomic polarizability values. Another descriptor from this equation is the index Har, which contemplates in its calculation the reciprocal entries of the distance matrix. This results from the fact that the interactions among atoms decrease as the distance between them increases. The topological descriptor CIC1 is obtained from the Information Theory [19], and this sort of theoretical descriptor measures the complexity of the molecule in terms of the diversity of elements that includes in its chemical structure, such as the type of atoms, bonds, cycles, etc. It expresses the molecular symmetry by measuring the neighborhood of the atoms (through the value of the vertex degrees) located at a

first-order distance (one single bond) of a considered atom, for each vertex in G. The variable MAXDP is derived from the hydrogen-depleted molecular graph and obtained from the Kier–Hall intrinsic states of atoms as [20]:

Fig. 7. Histogram of Har descriptor.

Fig. 9. Histogram of MATS1v descriptor.

MAXDP ¼ maxi jDI i j

(4)

where DIi is the field effect on the ith atom due to the perturbation of all other atoms as defined by Kier and Hall: DI i ¼

X ðI i  I j Þ i ðd þ 1Þ ij

(5)

The sum runs over all the other atoms in the molecular graph, I is the atomic intrinsic state and dij is the topological

490

A.G. Mercader et al. / Journal of Fluorine Chemistry 128 (2007) 484–492 Table 4 Prediction of unknown fluorophilicity

Fig. 10. Histogram of HOMA descriptor.

distance between the two considered atoms. The intrinsic state of an atom is calculated as the ratio between the Kier–Hall atomic electronegativity and the vertex degree, i.e. the number of bonds of an atom, encoding information related to both atomic partial charges and their topological position relative to the whole molecule. Therefore, MAXDP represents the maximum positive intrinsic state difference and can be related to the electrophilicity of the molecule. This fact reveals the importance of describing the fluorous electronegative contributions to fluorophilicity, in complete agreement with the empirical criteria for the design of fluorophilic compounds. The structural variables introduced by Moran [21] correspond to bi-dimensional autocorrelations between pairs of atoms in the molecule, and are defined in order to reflect the contribution of a considered atomic property to the property being investigated. These can be readily calculated, i.e.: by summing products of atomic weights of the terminal atoms of all the paths of a prescribed length. For the case of MATS1v, the path connecting a pair of atoms has a bond length and involves the van der Waals volumes as weighting scheme to distinguish their nature. Another optimal molecular descriptor selected by the RM is RDF055p. The radial distribution function [22] of an ensemble of atoms can be interpreted as the probability distribution of finding an atom in a spherical volume of certain radius. For RDF055p, ˚ and atomic polarizabilities are the sphere radius is 5.5 A employed as weighting scheme. Finally, the aromaticity index HOMA [23,24] considers the presence/absence of aromatic rings in the set of compounds, measuring thus the tendency of a bond length to be comprised between a single and a double bond length. HOMA takes the value 0 for an hypotetic non-aromatic system, while it equals 1 in case of aromaticity. The standardization of the regression coefficients [14] in Eq. (2) enables to assign more importance to the variables of the model exhibiting larger absolute standardized coefficients (shown in parentheses), therefore achieving the following ranking of contributions to ln P: Seigp > Har > CIC1 > RDF055p > HOMA > MATS1v ð3:398Þ

ð3:053Þ

> MAXDP ð0:041Þ

ð0:348Þ

ð0:327Þ

ð0:320Þ

ð0:178Þ

(6)

From Eq. (2) it can be concluded that increased numerical values of descriptors such as CIC1, RDF055p, HOMA and

N

Name

Eq. (2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

(Rf8)3P 1,3,4-(Rf8)3C6H3 p-Rf8C6F4Rf8 Rf11CF CF2 Rf9CF3 Rf9(CH2)2NH(CH2)2Rf9 p-Rf8C6H3FRf8 1,3,5-(Rf4)2C6H3Rf8 (Rf7CH2)3P m-Rf16C6H3CHO m-Rf16C6H3CH2OH m-Rf16C6H3CO2Me 1,3-(Rf8)2-5-CO2CF3C6H3 CH3CH2NH(CH2)3Rf20 1-Rf16-2,3-F2-5-CH2OHC6H2 m-Rf17C6H3CO2Me Rf9(CH2)2NH(CH2)2Rf2 (CF3)3CO(CF2)2NF2 o-Rf10(CH2)3C6H4(CH2)3Rf3 CH3(CH2)2NCH3(CH2)3Rf16 Rf8(CH2)3NH(CH2)3CF3 F5C6(CF2)2SiOC8F15 CF3SC6(CF3)5 o-Rf3(CH2)3C6F4(CH2)3Rf2 o-Rf4(CH2)3C6H4(CH2)3Rf3 o-Rf3(CH2)3C6H4(CH2)3Rf3 1,3-(Rf5)2-5-(CH2)2SiOC8H15C6H3 o-Rf3(CH2)3C6H4(CH2)3Rf2 1,3-(Rf5)2-2,4-F2-5-(CH2)2SiOC8H15C6H Rf8C(S)NHCH(Me)Ph Pentafluoroethanol 7,10-Rf4-hexadec-1-ene 1-Rf7-4-(CH2)2SiOC8H15C6H4 Ph(CH2)2SiF3 4-F-1-CF3C6H4 1,3,5-Trifluorobenzene 1,13-Difluorotridecane 1-(CF2)4CF2H-4-(CH2)2SiOC8H15C6H4 6-Fluoroundecane 1,14-Difluorotetradecane 1-Fluorododecane Cis-1,2-difluorododec-1-ene 6-Fluorodec-1-ene 2-Fluoroundec-1-ene 1,1-Difluorotridec-1-ene 10-Rf4-hexadec-1-ene Propylbenzene F5C6(CH2)2SiOC8H15 F5C6(CF2)2SiOC8H15 F5C6(CFH)2SiOC8H15 Tetradec-1,13-ene PhSiOPh Hexadec-1,5,9,13-ene CH2 CH(CH2)10C6H5 1,2,3,4-F4-6-(CH2)2SiOC8H15C6H F5C6CFHCH2SiOC8H15 C8H15SiOC8H15 Ph(CH2)2SiOPh 1,2,3-F3-5-(CH2)2SiOC8H15C6H2 2,5-Cyclohexadienone 1,2,3-F3-4-(CH2)2SiOC8H15C6H2 3-Cyclohexenol 1,2-F2-4-(CH2)2SiOC8H15C6H3 1,2-F2-3-(CH2)2SiOC8H15C6H3

5.84 5.21 4.85 4.61 4.47 4.45 4.19 4.16 4.10 3.97 3.69 3.64 3.05 3.01 2.94 2.55 2.31 2.18 1.68 1.55 1.07 0.95 0.68 0.55 0.32 0.04 0.12 0.47 0.73 0.76 0.81 1.63 2.06 2.08 2.35 2.44 2.90 2.92 2.97 3.09 3.10 3.13 3.24 3.28 3.30 3.34 3.41 3.72 3.82 3.88 4.33 4.34 4.46 4.60 4.67 4.67 4.70 4.82 4.83 4.86 4.88 5.01 5.03 5.04

A.G. Mercader et al. / Journal of Fluorine Chemistry 128 (2007) 484–492 Table 4 (Continued ) N

Name

Eq. (2)

65 66 67 68 69

Hexadec-1,3,5,7,9,11,13,15-ene m-FC6H4(CH2)2SiOC8H15 Icosane o-FC6H4(CH2)2SiOC8H15 1,3,5-Trihydroxybenzene

5.11 5.18 5.21 5.24 5.43

Rfn refers to (CF2)n1CF3.

MAXDP (with positive regression coefficients) and decreasing values for the descriptors SEigp, Har and MATS1v (with negative coefficients) would tend to predict higher fluorophilicities. The ranking of contributions given by Eq. (6) suggest that the distribution of topological distances in the molecules under investigation, expressed by descriptors such as SEigp and Har, plays an essential role that influences the values of ln P. As a practical application of the model obtained above we predict the fluorophilicity of some non-yet synthesized structures. Table 4 shows 69 compounds sorted according to their fluorofilicity values. Present theoretical analysis reveals greatly fluorophilic organic chemicals: (Rf8)3P(ln P = 5.84), 1,3,4-(Rf8)3C6H3 (ln P = 5.21), p-Rf8C6F4Rf8 (ln P = 5.85), that in principle could be synthesized and employed as new candidates for synthesis and catalysis. 3. Conclusions We have derived an alternative structure–fluorophilicity relationship presenting good predictive performance in a training set composed of 116 organic compounds. Some of the molecules included in the analysis were synthesized in a recent study [9,13]. The statistical parameters of present QSPR model compare fairly well with other ones reported previously in the literature based on LFER and MOD [9,11]. The best theoretical descriptors appearing in the final equation are able to reflect the molecular size, symmetry, aromaticity, as well as the importance of the fluorous-content in the organic compounds under investigation. 4. Experimental

491

functional groups and atom-centred fragments [26]. To this end we resort to the free-software Dragon version 3.0 available in the Web [27]. We excluded from our calculations the empirical and property-based descriptors provided by the software. Furthermore ten constitutional descriptors and four quantum-chemical descriptors (molecular dipole moments, total energies, and homo–lumo energies), not provided by the program, were added to the pool. As it is custommary in this sort of studies, we validate the predictive power of the model. The theoretical validation practiced on our models is based on the leave-more-out crossvalidation procedure (l-n%-o) [28], with n% representing the number of molecules removed from the training set. The number of cases for random removal in l-n%-o are 100,000. We proceed now to describe briefly the two different techniques employed for searching the best descriptors via linear regression algorithms. 4.2. The forward stepwise regression It is our purpose to search between a large number of D descriptors for an optimal subset of d descriptors that minimize the standard deviation S. Therefore, d will be the number of descriptors which are used for constructing a model. In other words, we want to obtain the global minimum of S(d) where d is a point in a space of D![d!(D  d)!] ones. A full search (FS) of optimal variables requires D![d!(D  d)!] linear regressions. The forward stepwise regression (FSR) [14] consists of a step by step addition of descriptors to the model, initially without any independent variable present in the regression, until there is no variable left outside the equation that minimizes its S. Here we define the standard deviation as follows: S¼

N X 1 res2 N  d  1 i¼1 i

(7)

where N is the number of molecules in the training set, and resi is the residual for molecule i (difference between the experimental and predicted property). The FSR sacrifices accuracy for a much smaller number of linear regressions than a FS.

4.1. General procedures 4.3. The replacement method The structures of the compounds are firstly pre-optimized with the molecular mechanics force field (MM+) procedure included in Hyperchem version 6.03 [25], and the resulting geometries are further refined by means of the semiempirical method Parametric Method-3 (PM3). We chose a gradient norm ˚ for the geometry optimization. limit of 0.01 kcal/A We derive a set of 1268 molecular descriptors including several types of variables: constitutional, topological, geometrical, charge, GEometry, Topology and Atoms-Weighted AssemblY (GETAWAY), Weighted Holistic Invariant Molecular descriptors (WHIM), 3D-Molecular Representation of Structure based on Electron diffraction (3D-MoRSE), molecular walk counts, BCUT descriptors, 2D-autocorrelations, aromaticity indices, Randic molecular profiles, radial distribution functions,

Some time ago we proposed the replacement method (RM) [15–17] that produces QSPR models that are quite close the FS ones with much less computational work. The RM gives better statistical parameters than the FSR and similar ones to the more elaborated Genetics Algorithms [29]. The RM approaches the minimum of S by judiciously taking into account the relative errors of the coefficients of the least-squares model given by a set of d descriptors d = {X1, X2, . . ., Xd}. The quality of the final optimized equations obtained via the two approaches FSR and RM was compared by means of two different criteria: the Akaike criterion and the Kubinyi function. Akaike’s information criterion (AIC) [30,31] considers the statistical goodness of fit and the number of parameters that

492

A.G. Mercader et al. / Journal of Fluorine Chemistry 128 (2007) 484–492

have to be estimated to achieve that degree of fit: X  N Nþdþ1 2 AIC ¼ resi ðN  d  1Þ2 i¼1

(8)

Therefore, the model that produces the minimum AIC value should be considered potentially the most useful. The Kubinyi function (FIT) [32,33] closely relates to the Fisher ratio (F), although the main disadvantage of F is its sensitivity to changes in d, if d is small, and its lower sensitivity if d is large. The FIT criterion has a low sensitivity toward changes in d values, as long as they are small numbers, and a substantially increasing sensitivity for large d values. The following equation is employed: FIT ¼

R2 ðN  d  1Þ ðN þ d2 Þð1  R2 Þ

(9)

where R is the correlation coefficient. The best model will present the highest value of the FIT function. Acknowledgement We would like to thank the Consejo Nacional de Investigaciones Cientı´ficas y Te´cnicas (CONICET). References [1] I.T. Horvath, J. Rabai, Science 72 (1994) 266. [2] D.S. Jozsef Rabai, E.K. Borbas, I. Kovesi, I. Kovesdi, A.G. Antal Csampai, V.E. Pashinnik, Y.G. Shermolovich, J. Fluorine Chem. 114 (2002) 199–207. [3] C.D.R. Rocaboy, B.L. Bennett, J.A. Gladysz, J. Phys. Org. Chem. 13 (2000) 596–603. [4] P. Bhattacharyya, B. Croxtall, J. Fawcett, D. Gudmunsen, E.G. Hope, R.D.W. Kemmitt, D.R. Paige, D.R. Russell, A.M. Stuart, D.R.W. Wood, J. Fluorine Chem. 101 (2000) 247–255. [5] C. Hansch, A. Leo, Exploring QSAR Fundamentals. Applications in Chemistry and Biology, American Chemical Society, Washington, DC, 1995.

[6] A.R. Katritzky, V.S. Lobanov, M. Karelson, Chem. Rev. Soc. 24 (1995) 279–287. [7] N. Trinajstic, Chemical Graph Theory, CRC Press, Boca Raton, FL, 1992. [8] L.E. Kiss, I. Ko¨vesdi, J. Ra´bai, J. Fluorine Chem. 108 (2001) 95. [9] F.T.T. Huque, K. Jones, R.A. Saunders, J.A. Platts, J. Fluorine Chem. 115 (2002) 119–128. [10] P.R. Duchowicz, F.M. Ferna´ndez, E.A. Castro, J. Fluorine Chem. 125 (2004) 43–48. [11] E. de Wolf, P. Ruelle, J. van den Brocke, B. Deelman, G. van Koten, J. Phys. Chem. 108 (2004) 1458. [12] M.S. Daniels, R.A. Saunders, J.A. Platts, J. Fluorine Chem. 125 (2004) 1291–1298. [13] D. Szabo´, J. Mohl, A.M. Ba´lint, A. Bodor, J. Ra´bai, J. Fluorine Chem. 127 (2006) 1496–1504. [14] N.R. Draper, H. Smith, Applied Regression Analysis, John Wiley & Sons, New York, 1981. [15] P.R. Duchowicz, E.A. Castro, F.M. Ferna´ndez, M.P. Gonza´lez, Chem. Phys. Lett. 412 (2005) 376–380. [16] P.R. Duchowicz, E.A. Castro, F.M. Ferna´ndez, Commun. Math. Comput. Chem. (MATCH) 55 (2006) 179–192. [17] A.M. Helguera, P.R. Duchowicz, M.A.C. Pe´rez, E.A. Castro, M.N.D.S. Cordeiro, M.P. Gonza´lez, Chemometr. Intell. Lab. 81 (2006) 180–187. [18] F. Harary, Graph Theory, Addison-Wesley, Miami, 1969. [19] D. Bonchev, Information Theoretic Indices for Characterization of Chemical Structures, RSP-Wiley, Chichester, UK, 1983. [20] L.B. Kier, L.H. Hall, J.W. Frazer, J. Math. Chem. 7 (1991) 229. [21] P.A.P. Moran, Biometrika 37 (1950) 17–23. [22] M.C. Hemmer, V. Steinhauer, J. Gasteiger, Vib. Spectrosc. 19 (1999) 151. [23] T.M. Krygowski, J. Chem. Inf. Model. 33 (1993). [24] T.M. Krygowski, M. Cyran˜ski, Chem. Rev. 101 (2001). [25] HYPERCHEM (Hypercube) available from . [26] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, Wiley VCH, 2002. [27] DRAGON, Web 3.0 available from . [28] D.M. Hawkins, S.C. Basak, D. Mills, J. Chem. Inf. Model. 43 (2003) 579. [29] S.S. So, M. Karplus, J. Med. Chem. 39 (1996) 1521. [30] H. Akaike, in: B.N. Petrov, F. Csa´ki (Eds.), Second International Symposium on Information Theory, Akademiai Kiado, Budapest, (1973), pp. 267–281. [31] H. Akaike, IEEE Trans. Automat. Control AC-19 (1974) 716. [32] H. Kubinyi, Quant. Struct. Act. Relat. 13 (1994) 393–401. [33] H. Kubinyi, Quant. Struct. Act. Relat. 13 (1994) 285–294.