A norm indexes-based QSPR model for predicting the standard vaporization enthalpy and formation enthalpy of organic compounds

A norm indexes-based QSPR model for predicting the standard vaporization enthalpy and formation enthalpy of organic compounds

Journal Pre-proof A norm indexes-based QSPR model for predicting the standard vaporization enthalpy and formation enthalpy of organic compounds Xue Ya...

931KB Sizes 1 Downloads 45 Views

Journal Pre-proof A norm indexes-based QSPR model for predicting the standard vaporization enthalpy and formation enthalpy of organic compounds Xue Yan, Tian Lan, Qingzhu Jia, Fangyou Yan, Qiang Wang PII:

S0378-3812(19)30499-6

DOI:

https://doi.org/10.1016/j.fluid.2019.112437

Reference:

FLUID 112437

To appear in:

Fluid Phase Equilibria

Received Date: 19 October 2019 Revised Date:

28 November 2019

Accepted Date: 15 December 2019

Please cite this article as: X. Yan, T. Lan, Q. Jia, F. Yan, Q. Wang, A norm indexes-based QSPR model for predicting the standard vaporization enthalpy and formation enthalpy of organic compounds, Fluid Phase Equilibria (2020), doi: https://doi.org/10.1016/j.fluid.2019.112437. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

1

A norm indexes-based QSPR model for predicting the

2

standard vaporization enthalpy and formation enthalpy of

3

organic compounds

4

Xue Yan1, Tian Lan2, Qingzhu Jia1,*, Fangyou Yan2, Qiang Wang2

5

1

School of Marine and Environmental Science, Tianjin Marine Environmental Protection and

6

Restoration Technology Engineering Center, Tianjin University of Science and Technology,

7

13St. 29, TEDA, 300457 Tianjin, PR China

8

2

School of Chemical Engineering and Material Science, Tianjin University of Science and Technology, 13St. 29, TEDA, 300457 Tianjin, PR China

9 10 11

*

Corresponding authors: Tel: 86-22-60600046; Fax: 86-22-60600046; E-mail: [email protected] (Q. Jia)

1

12

Abstract:

13

As important thermodynamic properties, vaporization enthalpy and formation

14

enthalpy were extensively utilized in the chemical industry process and chemical

15

engineering design, environment and agriculture. Based on the concept of norm index

16

proposed by our group previously, a unified QSPR model was built for predicting four

17

properties endpoints for 14 families of organic compounds. Four thermodynamic

18

properties endpoints, including standard vaporization enthalpy ( ∆H v ), standard

19

formation enthalpy in gas state ( ∆Hf (g) ), standard formation enthalpy in solid state

20

0 0 ( ∆H f (s) ) and standard formation enthalpy in liquid state ( ∆H f (l) ), were involved in

21

the same modelling work. This model has satisfactory fitting effect for four properties

22

0 0 endpoints with R2 of 0.967 for ∆H v , R2 of 0.990 for ∆H f (g) , R2 of 0.989 for

23

∆Hf0 (s) and R2 of 0.987 for ∆Hf0 (l) , respectively. Moreover, the results of internal

24

validation, external validation and applicability domain analysis indicated the good

25

stability and robustness of this model. This work not only calculated vaporization

26

enthalpy and formation enthalpy with the same formula, but also covered gas, solid

27

and liquid phases for formation enthalpy. Satisfying results obtained in the present

28

work suggest that this model and norm indexes have good reliability and

29

generalization.

30

Keywords: Norm indexes; Standard vaporization enthalpy; Standard formation

31

enthalpy; QSPR; Atomic distribution matrix

0

0

2

32

1 Introduction

33

The physical and thermodynamic properties of organic compounds play an

34

important role in various chemical industry process and chemical engineering design

35

[1-4], environment and agriculture [5-9]. The vaporization enthalpy ( Hv) and the

36

formation enthalpy ( Hf) were basic physical properties of organics.

37

essential tool for correlating and predicting many physical phenomena, such as vapor

38

pressures [10], surface tension [11], the Hildebrand's and the Hansen's solubility

39

parameters of hydrocarbons [12].

40

constants of reaction, and is of great significance in the study of resonance energies,

41

bond energies, nature of chemical bond and other related features [13]. It is necessary

42

to understand the fundamental physicochemical properties of these compounds such

43

as

44

synthesized and discovered organic compounds, but the experimental data of most

45

compounds are not measurable, and the data obtained through experimental studies

46

are not only expensive but also time-consuming. Therefore, it is urgent to develop a

47

stable and reliable calculation method to solve these problems.

Hv and

Hv is an

Hf could be used to calculate the equilibrium

Hf. Because of the exponential increase in the number of newly

48

A great deal of effort has been made in the development of methods for

49

estimating enthalpy over the past decades. The empirical correlation method, the

50

group contribution method (GCM) and the quantitative structure-property

51

relationship (QSPR) method were usually used to calculate

52

of

53

properties including critical properties (Tc, Pc, Vc) and normal boiling point (Tb), as

Hv and

Hf. Estimation

Hv by empirical correlation method, depends on some physicochemical

3

54

shown in Riedel [14], Chen [15], Alibakhshi [16] and Belghit et al.’s investigations

55

[17]. In terms of GCM, Benson et al. [18-20] proposed a general method to estimate

56

the thermochemical properties of chemical species on the basis of group additive

57

contributions. Joback and Reid [21] developed the first-order contribution method,

58

which used 41 first-order groups to calculate the

59

Constantinou and Gani [22] proposed the second-order contribution method which

60

calculated physicochemical properties including critical properties and

61

Interestingly, they all tried to distinguish the isomers of some organic compounds. Jia

62

and Wang [23-26] proposed the positional distributive group contribution method,

63

which could effectively distinguish the organic isomers and was successfully used to

64

predict the critical properties and enthalpy of organic compounds. On the whole, the

65

GCM is simple and general, yet it relies on the contribution values of the groups and

66

could not be applied to new structural classification.

Hv and

Hf of organics.

Hv and

Hf.

67

Another way to predict physicochemical properties is to establish a QSPR model

68

from the structure of molecules. QSPR [27-33] is an effective method to link the

69

thermodynamic/ physical/ chemical properties of organics with their compositions

70

and structures. Recently, several QSPR models for estimating

71

compounds have been proposed. In Abooali and Sobati 's investigation [34], a model

72

was built with MLR (Multiple Linear Regression) method for predicting the

73

Tb of 180 pure refrigerants with R2 of 0.96 and AARD of 6.83 %. Krasnykh et al. [35]

74

determined the ∆H v0 (standard enthalpy of vaporization) of trimethylolpropane and

75

carboxylic acids esters by the transpiration method, and good results were obtained 4

Hv of organic

Hv at

76

with the relative error less than 2 %. In addition, by using MLR method, Sosnowska

77

et al. [28] proposed a QSPR model for estimating the ∆H v0 for persistent organic

78

pollutants with good fitting and the external predictive ability (R2=0.888,

79

2 QCV =0.878). In terms of

80

0 presented a QSPR model to predict the ∆ H f of 180 organic molecules with good

81

effect. Mercader et al. [37] established a QSPR model to predict

82

hydrocarbons by using correlation weighting of local invariants in atomic orbital

83

molecular graphs (AOMGs), their model could provide satisfactory results with low

84

0 average deviation. Furthermore, Vatani et al. [38] predicted the ∆Hf of 1115

85

compounds based on a multivariate linear genetic algorithm, and during their

86

modelling work, five structural descriptors were calculated and selected from 1664

87

descriptor libraries.

Hf, based on neural-network method, Hu et al. [36]

Hf of 51

88

In this study, the four properties endpoints - structure modelling study was

89

performed with uniform norm descriptors for predicting the enthalpy of organic

90

chemicals. The standard vaporization enthalpy ( ∆H v ), standard formation enthalpy in

91

0 0 gas state ( ∆Hf (g) ), standard formation enthalpy in solid state ( ∆Hf (s) ) and standard

92

0 formation enthalpy in liquid state ( ∆H f (l) ) were used as properties endpoints for the

93

same modelling work. Also, the prediction ability of this model was tested by using

94

several validation methods.

95

2 Method

96

2.1 Dataset

0

5

97

0 0 0 0 573 ∆ H v , 964 ∆H f (g) , 367 ∆ H f (s) and 873 ∆Hf (l) experimental data

98

were from CRC Handbook of Chemistry and Physics [39]. The dataset covered 14

99

families of organic compounds, including chain and cyclic hydrocarbons, alcohols,

100

ketones, carboxylic acids, amines, halogenated hydrocarbons, sulfur compounds and

101

0 0 0 0 so on. The organics involved in ∆Hv , ∆H f (g) , ∆Hf (s) and ∆Hf (l) together with

102

corresponding experimental values were shown in Table S1~S4 in the Supplementary

103

Material.

104

2.2 Atomic distribution matrix

105

In this work, the most stable state and accurate spatial structure of atoms in

106

molecules were obtained using quantum chemistry ab initio method [40] based on the

107

Restricted Hartree-Fock (RHF) at STO-3G level in the HyperChem 7.0 software

108

(http://www.hyper.com/).

109 110

To reveal the composition of atoms in the molecule structure, a series of property matrices (PE) composed of atomic properties were proposed and defined as follows.

PE1 =  awi

(1) ( 2) ( 3) ( 4)

∑ aw  i

PE2 = [ ri × eni ] PE3 = [ aci zi ] 111

1/2 PE4 = ( oei × eni )    PE5 = exp ( iei ) 

PE6 =  qij 

( 5) ( 6)

qij = aci − ac j

112

where, ri, zi, iei, eni, aci, oei and awi represent the Van der Waals radius (Å), number of

113

protons, ionization energy (10 eV), electronegativity, atom charge (C, coulomb),

114

outermost number of electrons and atom weight (g/mol) of atom i in a molecule,

115

respectively. The property matrices (PE) were unitless. 6

116

The atomic parameters in the property matrices (PE) are constants for six

117

properties except for the atomic charge. The atomic charge was calculated by

118

Mulliken method. Also, the properties of atoms involved were shown in Table S5 in

119

the Supplementary Material.

120

To further describe the spatial position of each atom in the molecule, the position

121

matrices including the adjacent matrix P1, the interval matrix P2, the Euclidean

122

distance matrix P3 and the adjacent Euclidean distance matrix P4 were used. The four

123

matrices were defined as follows: P1 =  pij 

124

P2 =  pij 

1 pij =  0  2 pij =  0

sij =1 sij ≠ 1 sij =2 sij ≠ 2

P3 =  d ij  P4 =  pij 

(7)

(8) (9)

 d ij pij =  0

sij =1 sij ≠ 1

(10)

125

where, dij is the Euclidean spatial distance between atoms i and j, sij is the path

126

between atom i and j. The unit of dij is Angstrom (Å).

127 128

Then, combining the above matrices, 18 atomic distribution matrices (M) were further proposed and listed in Table 1.

7

129

Table 1 The 18 atomic distribution matrices. Mi

Mi

M1 = PE2T ⋅×P1

M10 = PE4 × PE4T × ( P3 × PE6 )

M2 = PE4 ×PE4T ×P1

M11 = PE2 × PE2T × ( P3 × PE6 )

M3 = PE4 × PE4T × P2

M12 = PE4 × PE4T ⋅× ( P3 × PE6 )

M 4 = PE5 × PE5T ⋅×P4

M13 = PE2 × PE2T ⋅×( P3 × PE6 )

M 5 = PE4T ⋅×P1

M14 = PE2 × PE2T × P1

M6 = PE2 × PE2T ⋅×P1

M15 = PE3 × PE3T ⋅×P3

M 7 = PE5 × PE5T × P4

M16 = PE2T ⋅×P4

M 8 = PE5 × PE5T × P2

M17 = PE5T ⋅×P2

M 9 = PE1T ⋅×P1

M18 = PE4 × PE4T ⋅×P4

130

Matrix QT is the transpose of matrix Q.

131

where, the operational character [.×] means:(Ⅰ) Q =  q j 

132

(Ⅱ) Q =  qij   

133

2.3 Norm expressions used

W =  wij 

W =  wij 

Q. × W =  q j × wij  ;

Q . × W =  q ij × wij  .

134

In this work, norm expressions were used as Eq. (11) – Eq. (16). The norm (I) of

135

each matrix was shown in Table 2. Based on the above atomic distribution matrices,

136

22

137

formulas.

norm

descriptors

were

calculated

8

by

using

the

following

norm

║ M║1 =

∑ λ (M )

(11)

i

i

║ M║2 =

∑∑ m j

138

(12 )

ij

i

║ M║3 =

1 ( ∑ ∑ m ij ) n j i

(13 )

║ M║4 =

1 ( ∑ ∑ m ij ) n2 j i

(14 )

(

║ M║5 =

max λi ( M × M

║ M║6 =

∑∑m j

H

))

(15 ) (16 )

2 ij

i

139

where, λi was the eigenvalue of matrix, the M H was the Hermite matrix.

140

2.4 Model Validation

141

The quality and predictability of the QSPR model should be given strictly and

142

comprehensively verified [41]. Leave-one-out, 5-fold and 10-fold cross-validations

143

[42] were utilized to evaluate the robustness of this QSPR model. The prediction

144

ability of this model was verified by using the external validation method [43].

145

Y-randomization test was utilized to avoid the possibility of chance correlation in the

146

modelling work [44]. The reliability of this model was proved by the applicability

147

domain (AD) [45].

148

3 Results and discussion

149

3.1 Model proposed

150

A unified QSPR model was built for predicting the four properties endpoints as

151

Eq. (17) by using the same norm descriptors. The parameters bk of this model for the

152

four properties were showed in Table 2. 22

153

(17 )

∆H = b0 + ∑ bk I k k =1

154

0 0 0 0 where, ∆H includes ∆H v , ∆Hf (g) , ∆Hf (s) and ∆H f (l) , Ik represents norm

9

155

indexes.

10

156

Table 2 The parameters of this model for predicting the four properties endpoints. bk k

Ik

∆H v0

∆Hf0 (g)

∆Hf0 (s)

∆H f0 (l)

0



31.165

96.502

267.380

94.648

1

M1 1

8.252

57.788

31.200

61.273

2

M2 1

-3.190

52.149

53.784

52.336

3

M3 1

-0.066

0.902

0.608

0.813

4

M4 1

-1.956

-15.327

-10.160

-16.876

5

M5

2

30.295

-90.640

-61.929

-91.048

6

M1 2

-25.545

-6.830

-32.970

-8.069

7

M6

2

1.454

-18.145

-14.733

-18.112

8

M7

3

0.852

5.506

2.808

6.007

9

M8

3

0.050

-1.456

-1.595

-1.517

10

M9

4

-165.655

-1125.487

-1468.420

-1217.389

11

M10

4

2.852×10-3

0.252

0.064

0.309

12

M 11

4

-2.449×10-3

-0.108

-0.041

-0.127

13

M12

4

-0.229

-25.654

-16.304

-25.268

14

M13

4

0.126

11.275

7.777

10.846

15

M7

5

-0.131

6.986

8.581

7.210

16

M2

5

-0.965

-29.308

-34.834

-29.671

17

M 14

5

0.878

13.850

16.899

14.150

18

M 15

5

33.946

311.366

101.511

246.837

19

M16

6

-0.883

67.100

77.722

61.254

11

bk k

Ik

20

M17

21

M4

22

M18

6

6

6

∆H v0

∆Hf0 (g)

∆Hf0 (s)

∆H f0 (l)

0.186

-34.884

-32.648

-35.958

-0.798

24.326

15.457

26.600

0.734

-31.801

-35.528

-31.593

157

The predicted values of the four properties and the corresponding absolute errors

158

were summarized in Table S1~S4. Statistical results for this model prediction were

159

0 0 0 0 shown in Table 3. The R2 of ∆H v , ∆Hf (g) , ∆Hf (s) and ∆H f (l) were 0.967,

160

0.990, 0.989 and 0.987, respectively, indicating that this model had good fitting. Fig.

161

0 1 was a scatter plot of the experimental values versus calculated values of ∆H v ,

162

∆H f0 (g) , ∆H f0 (s) and ∆H f0 (l) . It could be clearly seen that the calculated values of

163

four endpoints coincide well with the experimental values, which further affirmed the

164

accuracy of this model.

165

Table 4 listed the mean absolute error for the 14 families of organic compounds.

166

It can be seen that for the ∆H0v, the calculated values of chain and cyclic

167

hydrocarbons, ketones and monoaromatic hydrocarbons were close to the

168

experimental values, while the error calculation may fluctuate for alcohols, amines

169

and others nitrogen compounds. For the ∆H0f (g), the MAE of each family of

170

compounds is relatively close to the whole, which have no significant difference. For

171

the ∆H0f (s), high deviation of chain hydrocarbons, aldehydes and esters might

172

attribute to the small number of samples. For the ∆H0f (l), the MAE value of each class

173

was satisfactory except for the acids. Although the QSPR model has some error in

174

calculating the enthalpy of some kinds of compounds, it is stable and reliable as a 12

175

whole, which contributes to the further evaluation and application of the model. Also,

176

0 to expound the calculation process in detail for norm indexes, the ∆H v of

177

3-Bromopropene (the No. 30 compound in Table S1) has been exampled and showed

178

in Supplementary Material.

179 180

0 0 Fig. 1 The calculated values vs. experimental values for ∆H v , ∆Hf (g) ,

181

∆Hf0 (s) and ∆H f0 (l) .

13

Table 3 Summary of statistical results of this model.

182

Properties

Samples

R2

F

MAE

s

QLOO2

Q5-fold2

Q10-fold2

Rtraining2

Rtesting2

∆H v0

573

0.967

737.462

2.499

3.449

0.963

0.963

0.963

0.965

0.972

∆Hf0 (g)

964

0.990

4207.505

22.460

29.570

0.989

0.988

0.989

0.989

0.991

∆Hf0 (s)

367

0.989

1398.014

32.402

41.725

0.986

0.986

0.986

0.989

0.986

∆Hf0 (l)

873

0.987

3030.539

22.787

31.142

0.986

0.986

0.986

0.988

0.986

183

where, s is the standard deviation.

184 185

Table 4 Mean Absolute Error (MAE) of 14 families of organic compounds. ∆H0v Compound type

∆H0f (g) MAE

Numbers

MAE Numbers

(kJ/mol)

186 187

∆H0f (s)

∆H0f (l) MAE

Numbers (kJ/mol)

MAE Numbers

(kJ/mol)

(kJ/mol)

chain hydrocarbons

121

1.26

121

18.07

2

47.08

136

22.61

cyclic hydrocarbons

30

1.58

65

20.71

11

35.80

66

17.86

Alcohols

57

3.52

51

18.24

26

39.82

63

12.24

Ethers

42

2.85

35

22.07

1

25.97

36

27.67

Aldehydes

3

2.82

18

23.92

2

58.48

14

30.87

Ketones

25

1.63

29

16.42

7

22.98

23

17.15

Acids

3

2.38

32

29.02

37

32.18

24

35.03

Esters

41

2.73

55

25.27

6

43.10

58

23.08

Monoaromatic Hydrocarbons

36

3.32

65

18.48

39

22.35

53

16.33

Amines

31

5.02

41

16.60

13

24.09

43

20.24

Others nitrogen compounds

48

4.08

164

26.33

183

34.25

118

30.23

halogenated hydrocarbons

79

1.80

140

26.86

13

35.80

106

24.57

Sulfur compounds

47

2.45

72

17.24

4

15.33

65

17.28

Others compounds

10

1.40

76

26.18

23

27.64

68

27.00

Overall

573

2.50

964

22.46

367

32.40

873

22.79

3.2 Internal validation The leave-one-out (LOO), 5-fold and 10-fold cross validation methods were 14

188

performed in this study and the statistical results were shown in Table 3. The high

189

values of QLOO2 (0.963, 0.989, 0.986, 0.986), Q5-fold2 (0.963, 0.988, 0.986, 0.986) and

190

0 0 0 0 Q10-fold2 (0.963, 0.989, 0.986, 0.986) for ∆H v , ∆Hf (g) , ∆H f (s) and ∆H f (l) ,

191

demonstrated the robustness and reliability of this model. Meanwhile, the error

192

distributions of this model, LOO-CV, 5-fold CV and 10-fold CV models were

193

presented as Fig. 2. It can be seen that the error distributions obtained from internal

194

validation were all very similar to the error distributions of this work, which further

195

suggested the stability and robustness of this norm index-based model.

196 197

Fig. 2 Distributions of errors for this model and internal validation for ∆H v0 , ∆Hf0 (g) ,

198

∆Hf0 (s) and ∆Hf0 (l) .

15

199

3.3 External validation

200

External validation plays an important role in proving the predictive ability of the

201

QSPR model [43]. In this work, for each properties endpoints, the original dataset was

202

randomly divided into the training set and the testing set according to the ratio of

203

approximately 4:1. The validation results were shown in Table 3. The relationship of

204

the experimental values and the calculated values of the four properties for the

205

training set and the testing set was illustrated in Fig. 3. It showed that the calculated

206

values of the training set and the testing set were consistent with the experimental

207

values, suggesting the satisfactory external predictability of this model. Statistical

208

metrics described that the Rtraining2 and Rtesting2 values were 0.965 and 0.972 for ∆H v ,

209

0 0 0.989 and 0.991 for ∆Hf (g) , 0.989 and 0.986 for ∆Hf (s) , 0.988 and 0.986 for

210

∆H f0 (l) , which were all similar to the overall R2 and further demonstrated the good

211

stability of this model. In general, these satisfactory predictions proved that the norm

212

index proposed by our group could effectively describe the interaction between

213

organic compounds and enthalpy.

0

214

16

215 216

0 Fig. 3 The calculated values of external validation and experimental values for ∆H v ,

217

∆Hf0 (g) , ∆Hf0 (s) and ∆Hf0 (l) .

218

3.4 Y-randomization test

219

In the Y-randomization validation process, all experimental values were

220

randomly disturbed to build a new QSPR model. To make the results reliable for this

221

model, 10,000 times of Y-randomization validation were repeated in this work.

222

Y-randomization validation was repeated for 10000 times with the average Ri2 of

223

0 0 0 0.038, 0.023, 0.060, 0.025 for ∆H v , ∆Hf (g) , ∆Hf (s) and ∆Hf0 (l) , which were far

224

less than the original R2 values. Accordingly, there was no chance correlation in the

225

modelling process, further verifying the good stability of this original model.

226

3.5 Applicability Domain (AD)

227

The application domain characterized by the molecular properties of the training 17

228

set is of great significance in the QSPR research [46]. In this study, the leverage

229

values, standardized residuals versus experimental values for this model were

230

presented as Williams graph in Fig. 4.

231

The results of Fig. 4 indicated that most of the compounds (about 95.1 %~96.0%)

232

were located in the acceptable domain, which was formed by the critical leverage

233

value (h*) of Hat matrix and the standardized residuals range -3 to 3. It showed that

234

this model had a wide application domain with good predicted results. However, there

235

were 2.4 %~3.2% of chemicals with h exceed h*, but within three standard deviation

236

units. In fact, high h* does not always represent the outliers of the model, and these

237

chemicals with high h* are valuable for the stability of QSPR model, so they could be

238

regarded as a good influence point [47]. Also, it could be seen from Fig. 4 that

239

1.6 %~1.9% data points were outside of standardized residuals range of [−3, 3]. For

240

these compounds, the predictions might be not reliable owing to the high error caused

241

by unreasonable experimental data [48]. On the whole, the verification results

242

demonstrated that this model could cover a large response and structural applicability

243

domain for ∆H v0 , ∆H f0 (g) , ∆H f0 (s) and ∆H f0 (l) prediction.

244 245

18

246 247

Fig. 4 Applicability domain of this model for ∆H v0 , ∆H f0 (g) , ∆H f0 (s) and ∆H f0 (l)

248

prediction.

249

3.6 Model comparison

250

To confirm the quality and performance of this model, comparison with other

251

reference models were carried out and the comparison results were summarized in

252

Table 5.

253

For the model of ∆H v0 , good prediction results could be provided by Puri et al.

254

[49] and Padmanabhan et al. [27]. However, these models were developed for only a

255

single set compounds of Polychlorinated biphenyls, and it was not possible to confirm

256

the external prediction ability of the models. In Sosnowska et al.'s work [28], a MLR

257

2 model was established with the good fitting and robustness (R2=0.888, QCV =0.878,

258

RMSECV=5.34) for calculation ∆H v0 of 78 Persistent Organic Pollutants (POPs), yet 19

259

there are few samples involved in their modelling work.

260

0 For the models of ∆H f , some researchers have tried to develop molecular

261

0 structure-based models for predicting ∆H f with satisfactory performance. In

262

Gharagheizi's [50] and Vatani et al.'s [38] work, their statistical parameters could

263

0 accurately predict ∆H f with R2 of 0.950 and 0.983, respectively, yet the sample data

264

used in the modeling contains calculated values. Begam et al.'s [51] models and

265

Adams et al.'s [52] linear models all have high R2 and F values, but the types of

266

compounds involved in their samples were few and these models might not be

267

extrapolated. Albahri and Aljasmi [53] proposed two GCM models based on ANN

268

(Artificial Neural Networks) and MNLR (Multivariable Nonlinear Regression)

269

method respectively. Although these models contain many kinds of organic

270

compounds, the GCM depends on group division and group contribution value.

271

Compared with references above, our work contains a variety of organic

272

compounds with 14 families of structures, including chain and cyclic hydrocarbons,

273

alcohols, ketones, carboxylic acids, amines, halogenated hydrocarbons, sulfur

274

compounds and so on, which might greatly expand the application scope of this model.

275

Also, a unified QSPR model was developed for predicting the four different

276

properties, and the QSPR model have satisfactory fitting accuracy with R2 of 0.967

277

for ∆H v0 , R2 of 0.990 for ∆H f0 (g) , R2 of 0.989 for ∆H f0 (s) and R2 of 0.987 for

278

∆H f0 (l) , respectively. Moreover, the statistical results of this model was even slightly

279

better than GCM models. On the whole, this QSPR model has reasonable

280

predictability, good statistical accuracy and generalized. 20

281 Property

Table 5 Comparison with references. Reference

Compound type

Method

N

R2

Q2

F

s

Puri et al. [49]

Polychlorinated biphenyls

PLS

17

0.996

0.852

427

0.845

Polychlorinated biphenyls

MLR

17

0.976

0.948

200

2.051

Polychlorinated biphenyls

MLR

27

0.858

0.758

48

3.567

Sosnowska et al. [28]

Persistent Organic Pollutants

MLR

78

0.888

0.878

302

5.113

this work

14 kinds of organic compounds

MLR

573

0.967

0.963

737.462

3.449

Gharagheizi [50]

14 kinds of organic compounds

GA-MLR

1692

0.950

0.948

2830.03

104.03

Vatani et al. [38]

14 kinds of organic compounds

GA-MLR

1115

0.983

0.9826

10239.02

58.541

Adams et al. [52]

oxygen-containing heterocycles

GA-MLR

46

0.945

0.93

137

14.17

1



1139.9

2.7

0.984



624.213

3.768

0.984



697.650

3.732

0.987





22.401

0.81





88.399

964

0.990

0.989

4207.505

29.570

367

0.989

0.986

1398.014

41.725

873

0.987

0.986

3030.539

31.142

Padmanabhan et al. [27]

∆ H v0

PR

∆ H f0 Begam et al. [51]

hydrocarbon

PCR

60

PLSR SGC-ANN Albahri and Aljasmi [53]

14 kinds of organic compounds

584 SGC-MNLR

∆H f0 (g)

∆H f0 (s)

this work

14 kinds of organic compounds

MLR

∆H f0 (l)

282

3.7 Explanation of norm descriptors

283

The norm indexes have been successfully applied to describe the relationship

284

between molecular structure and biological activity/toxicity, physicochemical

285

properties in various research fields, including the enthalpy of vaporization of organic

286

compounds at the boiling point [54], the heat capacity of organic compounds at

287

different temperatures [55], the flash point of multiple components mixture [56],

288

multiple toxicity endpoint–structure relationships for substituted phenols and anilines

289

[57], the toxicity and heat capacity of ionic liquids [58, 59], and so on.

290

Here, based on the concept of norm index proposed by our group, 18 atomic 21

291

distribution matrices were presented for predicting the ∆H v0 , ∆H f0 (g) , ∆H f0 (s) and

292

∆H f0 (l) of organic compounds. Here, the atomic distribution matrices contain atomic

293

spatial distribution and atomic properties, which could combine the position

294

information with the contribution of atoms in each molecule. Good prediction results

295

obtained in this present work suggested that these atomic distribution matrices were

296

valuable and could be successfully applied to calculating different properties

297

endpoints of various substances.

298

4 Conclusions

299

In this work, a set of atomic distribution matrices and norm descriptors were

300

proposed. A unified QSPR model was built to predict the four properties endpoints for

301

14 families of organic compounds such as chain and cyclic hydrocarbons, alcohols,

302

ketones, carboxylic acids, amines, halogenated hydrocarbons, sulfur compounds, etc.

303

0 0 0 The four thermodynamic properties endpoints of ∆H v , ∆H f (g) , ∆H f (s) and

304

∆H f0 (l) were involved in the same modelling work. High R2 and F values showed

305

that this model could provide satisfactory calculation accuracy and fitting effect.

306

Moreover, the results of external validation and internal validation indicated that the

307

model has good predictive ability and robustness. The verification of applicability

308

domain showed that this QSPR model could be applied in a large response and

309

structural domain. Hence, it could be concluded that the norm index with generality

310

was successful for describing the

311

model could provide satisfactory results for the four properties endpoints prediction of

312

organic compounds.

Hv and

22

Hf of organics, and the unified QSPR

313

Supplementary Material

314

0 0 0 0 The experimental and calculated values of ∆H v , ∆H f (g) , ∆H f (s) and ∆H f (l)

315

were listed in Table S1~S4. The properties of atoms involved in this study were

316

0 shown in Table S5. Also, an example for calculating the ∆H v with the established

317

model was given.

318

Conflict interest statement

319

The authors confirm that this article has no conflicts of interest.

320

Acknowledgements

321

This work was supported by the National Natural Science Foundation of China [NO:

322

21808167, 21676203 and 21306137].

323

23

324

References:

325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366

[1] W. Fang, Q. Lei, R. Lin, Enthalpies of vaporization of petroleum fractions from vapor pressure measurements and their correlation along with pure hydrocarbons, Fluid Phase Equilibria, 205 (2003) 149-161. [2] T.N.G. Borhani, M. Bagheri, Z.A. Manan, Molecular modeling of the ideal gas enthalpy of formation of hydrocarbons, Fluid Phase Equilibria, 360 (2013) 423-434. [3] L. Ogorodova, M. Vigasina, L. Mel'Chakova, V. Rusakov, D. Kosova, D. Ksenofontov, I. Bryzgalov, Enthalpy of formation of natural hydrous iron phosphate: Vivianite, Journal of Chemical Thermodynamics, 110 (2017) 193-200. [4] G. Xin, P.S. Maram, A. Navrotsky, A correlation between formation enthalpy and ionic conductivity in perovskite-structured Li3xLa0.67-xTiO3 solid lithium ion conductors, Journal of Materials Chemistry A, 5 (2017). [5] A.H. Mohammadi, D. Richon, New Predictive Methods for Estimating the Vaporization Enthalpies of Hydrocarbons and Petroleum Fractions, Industrial & Engineering Chemistry Research, 46 (2007) 2665-2671. [6] S. Genheden, Predicting Partition Coefficients with a Simple All-Atom/Coarse-Grained Hybrid Model, Journal of Chemical Theory & Computation, 12 (2016) 297-304. [7] V.N. Emel'Yanenko, S.P. Verevkin, H. Andreas, The gaseous enthalpy of formation of the ionic liquid 1-butyl-3-methylimidazolium dicyanamide from combustion calorimetry, vapor pressure measurements, and ab initio calculations, Journal of the American Chemical Society, 129 (2007) 3930-3937. [8] L. Yang, C. Adam, S.L. Cockroft, Quantifying solvophobic effects in nonpolar cohesive interactions, Journal of the American Chemical Society, 137 (2015) 10084-10087. [9] L.M.N.B.F. Santos, J.N.C. Lopes, J.O.A.P. Coutinho, J.M.S.S. EsperancA, L.R. Gomes, I.M. Marrucho, L.P.N. Rebelo, Ionic liquids: first direct determination of their cohesive energy, Journal of the American Chemical Society, 129 (2007) 284-285. [10] Z.k. Kolská, M. Zábranský, A. Randová, Group contribution methods for estimation of selected physico-chemical properties of organic compounds, Thermodynamics–Fundamentals and Its Application in Science, (2012) 1-28. [11] A.A. Strechan, G.J. Kabo, Y.U. Paulechka, The correlations of the enthalpy of vaporization and the surface tension of molecular liquids, Fluid Phase Equilibria, 250 (2006) 125-130. [12] A.M. Benkouider, R. Kessas, S. Guella, A. Yahiaoui, F. Bagui, Estimation of the enthalpy of vaporization of organic components as a function of temperature using a new group contribution method, Journal of Molecular Liquids, 194 (2014) 48-56. [13] M. Firpo, L. Gavernet, E. Castro, A. Toropov, Maximum topological distances based indices as molecular descriptors for QSPR. Part 1. Application to alkyl benzenes boiling points, Journal of Molecular Structure: THEOCHEM, 501 (2000) 419-425. [14] L. Riedel, Eine neue universelle Dampfdruckformel Untersuchungen über eine Erweiterung des Theorems der übereinstimmenden Zustände. Teil I, Chemie Ingenieur Technik, 26 (1954) 83-89. [15] N.H. Chen, Generalized Correlation for Latent Heat of Vaporization, Journal of Chemical & Engineering Data, 10 (1965) 207-210. [16] A. Alibakhshi, Enthalpy of vaporization, its temperature dependence and correlation with surface tension: A theoretical approach, Fluid Phase Equilibria, 432 (2017) 62-69. 24

367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410

[17] C. Belghit, Y. Lahiouel, T.A. Albahri, New empirical correlation for estimation of vaporization enthalpy of algerian saharan blend petroleum fractions, Petroleum Science and Technology, 36 (2018) 1181-1186. [18] S.W. Benson, J.H. Buss, Additivity rules for the estimation of molecular properties. Thermodynamic properties, The Journal of Chemical Physics, 29 (1958) 546-572. [19] S.W. Benson, F. Cruickshank, D. Golden, G.R. Haugen, H.E. O'Neal, A. Rodgers, R. Shaw, R. Walsh, Additivity rules for the estimation of thermochemical properties, Chemical Reviews, 69 (1969) 279-324. [20] S.W. Benson, Thermochemical kinetics: methods for the estimation of thermochemical data and rate parameters, Wiley, 1968. [21] K.G. Joback, R.C. Reid, Estimation of pure-Component properties from groupcontributions, Chemical Engineering Communications, 57 (1987) 233-243. [22] L. Constantinou, R. Gani, New group contribution method for estimating properties of pure compounds, Aiche Journal, 40 (1994) 1697-1710. [23] Q. Wang, Q. Jia, P. Ma, Position Group Contribution Method for the Prediction of the Critical Compressibility Factor of Organic Compounds, Journal of Chemical & Engineering Data, 54 (2009) 1916-1922. [24] Q. Jia, Q. Wang, P. Ma, Prediction of the Enthalpy of Vaporization of Organic Compounds at Their Normal Boiling Point with the Positional Distributive Contribution Method, Journal of Chemical & Engineering Data, 55 (2010) 5614-5620. [25] Q. Wang, Q. Jia, P. Ma, Prediction of the Acentric Factor of Organic Compounds with the Positional Distributive Contribution Method, Journal of Chemical & Engineering Data, 57 (2012) 169–189. [26] W. Qiang, P. MA, C. WANG, S. XIA, Position Group Contribution Method for Predicting the Normal Boiling Point of Organic Compounds, Chinese Journal of Chemical Engineering, 17 (2009) 468-472. [27] J. Padmanabhan, R. Parthasarathi, V. Subramanian, P.K. Chattaraj, Using QSPR Models to Predict the Enthalpy of Vaporization of 209 Polychlorinated Biphenyl Congeners, QSAR & Combinatorial Science, 26 (2007) 227-237. [28] A. Sosnowska, M. Barycki, K. Jagiello, M. Haranczyk, A. Gajewicz, T. Kawai, N. Suzuki, T. Puzyn, Predicting enthalpy of vaporization for Persistent Organic Pollutants with Quantitative Structure–Property Relationship (QSPR) incorporating the influence of temperature on volatility, Atmospheric Environment, 87 (2014) 10-18. [29] K. Roy, Advances in QSAR Modeling: Applications in Pharmaceutical, Chemical, Food, Agricultural and Environmental Sciences, 2017. [30] F. Gharagheizi, M.R.S. Gohar, M.G. Vayeghan, A quantitative structure–property relationship for determination of enthalpy of fusion of pure compounds, Journal of Thermal Analysis & Calorimetry, 109 (2012) 501-506. [31] M. Goodarzi, T. Chen, M.P. Freitas, QSPR predictions of heat of fusion of organic compounds using Bayesian regularized artificial neural networks, Chemometrics and Intelligent Laboratory Systems, 104 (2010) 260-264. [32] M.G. Freire, C.M.S.S. Neves, S.P.M. Ventura, M.J. Pratas, I.M. Marrucho, J. Oliveira, J.A.P. Coutinho, A.M. Fernandes, Solubility of non-aromatic ionic liquids in water and correlation using a QSPR approach, Fluid Phase Equilibria, 294 (2010) 234-240. 25

411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454

[33] Y. Ye, Y. Sun, D. Wang, R. Liu, S. Gu, G. Liang, X. Jie, Quantitative structure-property relationship study of liquid vapor pressures for polychlorinated diphenyl ethers, Fluid Phase Equilibria, 391 (2015) 31-38. [34] D. Abooali, M.A. Sobati, Novel method for prediction of normal boiling point and enthalpy of vaporization at normal boiling point of pure refrigerants: A QSPR approach, International Journal of Refrigeration, 40 (2014) 282-293. [35] E.L. Krasnykh, Y.A. Druzhinina, S.V. Portnova, Y.A. Smirnova, Vapor pressure and enthalpy of vaporization of trimethylolpropane and carboxylic acids esters, Fluid Phase Equilibria, 462 (2018) 111-117. [36] L. Hu, X. Wang, L. Wong, G. Chen, Combined first-principles calculation and neural-network correction approach for heat of formation, The Journal of Chemical Physics, 119 (2003) 11501-11507. [37] A. Mercader, E.A. Castro, A.A. Toropov, QSPR modeling of the enthalpy of formation from elements by means of correlation weighting of local invariants of atomic orbital molecular graphs, Chemical Physics Letters, 330 (2000) 612-623. [38] A. Vatani, M. Mehrpooya, F. Gharagheizi, Prediction of Standard Enthalpy of Formation by a QSPR Model, International Journal of Molecular Sciences, 8 (2007) 407-432. [39] C. Press, CRC Handbook of Chemistry and Physics, Apple Academic Press Inc.; 96th New edition, (2015). [40] R.A. Friesner, Ab initio quantum chemistry: Methodology and applications, Proc. Natl. Acad. Sci. U. S. A., 102 (2005) 6648-6653. [41] A. Tropsha, P. Gramatica, V.K. Gombar, The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models, QSAR & Combinatorial Science, 22 (2003) 69-77. [42] K. Roy, S. Kar, R.N. Das, Understanding the basics of QSAR for applications in pharmaceutical sciences and risk assessment, Academic press, 2015. [43] K. Roy, I. Mitra, S. Kar, P.K. Ojha, R.N. Das, H. Kabir, Comparative studies on some metrics for external validation of QSPR models, Journal of Chemical Information & Modeling, 52 (2012) 396-408. [44] C. Rücker, G. Rücker, M. Meringer, y-Randomization and its variants in QSPR/QSAR, Journal of Chemical Information & Modeling, 47 (2007) 2345-2357. [45] K. Roy, S. Kar, P. Ambure, On a simple approach for determining applicability domain of QSAR models, Chemometrics & Intelligent Laboratory Systems, 145 (2015) 22-29. [46] K. Roy, P. Ambure, R.B. Aher, How important is to detect systematic error in predictions and understand statistical applicability domain of QSAR models?, Chemometrics & Intelligent Laboratory Systems, 162 (2017) 44-54. [47] J. Jaworska, N. Nikolovajeliazkova, T. Aldenberg, QSAR applicability domain estimation by projection of the training set in descriptor space: A review, Alternatives to Laboratory Animals Atla, 33 (2005) 445-459. [48] P. Gramatica, Principles of QSAR models validation: internal and external, QSAR & Combinatorial Science, 26 (2007) 694-701. [49] S. Puri, J.S. Chickos, W.J. Welsh, Three-dimensional quantitative structure-property relationship (3D-QSPR) models for prediction of thermodynamic properties of polychlorinated biphenyls (PCBs): enthalpies of fusion and their application to estimates of enthalpies of sublimation and aqueous, J Chem Inf Comput Sci, 34 (2003) 55-62. [50] F. Gharagheizi, Prediction of the Standard Enthalpy of Formation of Pure Compounds Using 26

455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474

Molecular Structure, Australian Journal of Chemistry, 62 (2009) 376-381. [51] B.F. Begam, J.S. Kumar, G.S. Chae, Optimized peer to peer QSPR prediction of enthalpy of formation using outlier detection and subset selection, Peer-to-Peer Networking and Applications, (2018) 1-10. [52] N. Adams, J. Clauss, M. Meunier, U.S. Schubert, Predicting thermochemical parameters of oxygen-containing heterocycles using simple QSPR models, Molecular Simulation, 32 (2006) 125-134. [53] T.A. Albahri, A.F. Aljasmi, SGC method for predicting the standard enthalpy of formation of pure compounds from their molecular structures, Thermochimica Acta, 568 (2013) 46-60. [54] Q. Jia, X. Yan, T. Lan, F. Yan, Q. Wang, Norm indexes for predicting enthalpy of vaporization of organic compounds at the boiling point, Journal of Molecular Liquids, 282 (2019) 484-488. [55] J. Yin, Q. Jia, F. Yan, Q. Wang, Predicting heat capacity of gas for diverse organic compounds at different temperatures, Fluid Phase Equilibria, 446 (2017) 1-8. [56] Y. Wang, F. Yan, Q. Jia, Q. Wang, Distributive structure-properties relationship for flash point of multiple components mixture, Fluid Phase Equilibria, 474 (2018) 1-5. [57] F. Yan, T. Liu, Q. Jia, Q. Wang, Multiple toxicity endpoint–structure relationships for substituted phenols and anilines, Science of The Total Environment, 663 (2019) 560-567. [58] F. Yan, T. Lan, X. Yan, Q. Jia, Q. Wang, Norm index-based QSTR model to predict the eco-toxicity of ionic liquids towards Leukemia rat cell line, Chemosphere, 234 (2019) 116-122. [59] W. He, F. Yan, Q. Jia, S. Xia, Q. Wang, Prediction of ionic liquids heat capacity at variable temperatures based on the norm indexes, Fluid Phase Equilibria, 500 (2019) 112260.

475

27

Highlights: A unified QSPR model was proposed for predicting four thermodynamic properties endpoints. The model for four properties endpoints was established with unified norm indexes used. 0 0 Norm indexes could be used to describe the ∆H v0 , ∆Hf (g) , ∆Hf (s) and

∆Hf0 (l) of organic compounds.

Author Contribution Section Each authors contribution(s) to the manuscript, using the numbered list below: 1. Research concept and design 2. Collection of data 3. Data analysis and interpretation 4. Writing the article 5. Critical revision of the article 6. Final approval of article Name of Author

Contribution(s) (using the numbered list above)

Xue Yan

1, 2, 3, 4, 5

Tian Lan

1, 3, 5

Qingzhu Jia

1, 5, 6

Fangyou Yan

5, 6

Qiang Wang

5, 6

Declaration of interests ■ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☒The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: