Journal Pre-proof A norm indexes-based QSPR model for predicting the standard vaporization enthalpy and formation enthalpy of organic compounds Xue Yan, Tian Lan, Qingzhu Jia, Fangyou Yan, Qiang Wang PII:
S0378-3812(19)30499-6
DOI:
https://doi.org/10.1016/j.fluid.2019.112437
Reference:
FLUID 112437
To appear in:
Fluid Phase Equilibria
Received Date: 19 October 2019 Revised Date:
28 November 2019
Accepted Date: 15 December 2019
Please cite this article as: X. Yan, T. Lan, Q. Jia, F. Yan, Q. Wang, A norm indexes-based QSPR model for predicting the standard vaporization enthalpy and formation enthalpy of organic compounds, Fluid Phase Equilibria (2020), doi: https://doi.org/10.1016/j.fluid.2019.112437. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
1
A norm indexes-based QSPR model for predicting the
2
standard vaporization enthalpy and formation enthalpy of
3
organic compounds
4
Xue Yan1, Tian Lan2, Qingzhu Jia1,*, Fangyou Yan2, Qiang Wang2
5
1
School of Marine and Environmental Science, Tianjin Marine Environmental Protection and
6
Restoration Technology Engineering Center, Tianjin University of Science and Technology,
7
13St. 29, TEDA, 300457 Tianjin, PR China
8
2
School of Chemical Engineering and Material Science, Tianjin University of Science and Technology, 13St. 29, TEDA, 300457 Tianjin, PR China
9 10 11
*
Corresponding authors: Tel: 86-22-60600046; Fax: 86-22-60600046; E-mail:
[email protected] (Q. Jia)
1
12
Abstract:
13
As important thermodynamic properties, vaporization enthalpy and formation
14
enthalpy were extensively utilized in the chemical industry process and chemical
15
engineering design, environment and agriculture. Based on the concept of norm index
16
proposed by our group previously, a unified QSPR model was built for predicting four
17
properties endpoints for 14 families of organic compounds. Four thermodynamic
18
properties endpoints, including standard vaporization enthalpy ( ∆H v ), standard
19
formation enthalpy in gas state ( ∆Hf (g) ), standard formation enthalpy in solid state
20
0 0 ( ∆H f (s) ) and standard formation enthalpy in liquid state ( ∆H f (l) ), were involved in
21
the same modelling work. This model has satisfactory fitting effect for four properties
22
0 0 endpoints with R2 of 0.967 for ∆H v , R2 of 0.990 for ∆H f (g) , R2 of 0.989 for
23
∆Hf0 (s) and R2 of 0.987 for ∆Hf0 (l) , respectively. Moreover, the results of internal
24
validation, external validation and applicability domain analysis indicated the good
25
stability and robustness of this model. This work not only calculated vaporization
26
enthalpy and formation enthalpy with the same formula, but also covered gas, solid
27
and liquid phases for formation enthalpy. Satisfying results obtained in the present
28
work suggest that this model and norm indexes have good reliability and
29
generalization.
30
Keywords: Norm indexes; Standard vaporization enthalpy; Standard formation
31
enthalpy; QSPR; Atomic distribution matrix
0
0
2
32
1 Introduction
33
The physical and thermodynamic properties of organic compounds play an
34
important role in various chemical industry process and chemical engineering design
35
[1-4], environment and agriculture [5-9]. The vaporization enthalpy ( Hv) and the
36
formation enthalpy ( Hf) were basic physical properties of organics.
37
essential tool for correlating and predicting many physical phenomena, such as vapor
38
pressures [10], surface tension [11], the Hildebrand's and the Hansen's solubility
39
parameters of hydrocarbons [12].
40
constants of reaction, and is of great significance in the study of resonance energies,
41
bond energies, nature of chemical bond and other related features [13]. It is necessary
42
to understand the fundamental physicochemical properties of these compounds such
43
as
44
synthesized and discovered organic compounds, but the experimental data of most
45
compounds are not measurable, and the data obtained through experimental studies
46
are not only expensive but also time-consuming. Therefore, it is urgent to develop a
47
stable and reliable calculation method to solve these problems.
Hv and
Hv is an
Hf could be used to calculate the equilibrium
Hf. Because of the exponential increase in the number of newly
48
A great deal of effort has been made in the development of methods for
49
estimating enthalpy over the past decades. The empirical correlation method, the
50
group contribution method (GCM) and the quantitative structure-property
51
relationship (QSPR) method were usually used to calculate
52
of
53
properties including critical properties (Tc, Pc, Vc) and normal boiling point (Tb), as
Hv and
Hf. Estimation
Hv by empirical correlation method, depends on some physicochemical
3
54
shown in Riedel [14], Chen [15], Alibakhshi [16] and Belghit et al.’s investigations
55
[17]. In terms of GCM, Benson et al. [18-20] proposed a general method to estimate
56
the thermochemical properties of chemical species on the basis of group additive
57
contributions. Joback and Reid [21] developed the first-order contribution method,
58
which used 41 first-order groups to calculate the
59
Constantinou and Gani [22] proposed the second-order contribution method which
60
calculated physicochemical properties including critical properties and
61
Interestingly, they all tried to distinguish the isomers of some organic compounds. Jia
62
and Wang [23-26] proposed the positional distributive group contribution method,
63
which could effectively distinguish the organic isomers and was successfully used to
64
predict the critical properties and enthalpy of organic compounds. On the whole, the
65
GCM is simple and general, yet it relies on the contribution values of the groups and
66
could not be applied to new structural classification.
Hv and
Hf of organics.
Hv and
Hf.
67
Another way to predict physicochemical properties is to establish a QSPR model
68
from the structure of molecules. QSPR [27-33] is an effective method to link the
69
thermodynamic/ physical/ chemical properties of organics with their compositions
70
and structures. Recently, several QSPR models for estimating
71
compounds have been proposed. In Abooali and Sobati 's investigation [34], a model
72
was built with MLR (Multiple Linear Regression) method for predicting the
73
Tb of 180 pure refrigerants with R2 of 0.96 and AARD of 6.83 %. Krasnykh et al. [35]
74
determined the ∆H v0 (standard enthalpy of vaporization) of trimethylolpropane and
75
carboxylic acids esters by the transpiration method, and good results were obtained 4
Hv of organic
Hv at
76
with the relative error less than 2 %. In addition, by using MLR method, Sosnowska
77
et al. [28] proposed a QSPR model for estimating the ∆H v0 for persistent organic
78
pollutants with good fitting and the external predictive ability (R2=0.888,
79
2 QCV =0.878). In terms of
80
0 presented a QSPR model to predict the ∆ H f of 180 organic molecules with good
81
effect. Mercader et al. [37] established a QSPR model to predict
82
hydrocarbons by using correlation weighting of local invariants in atomic orbital
83
molecular graphs (AOMGs), their model could provide satisfactory results with low
84
0 average deviation. Furthermore, Vatani et al. [38] predicted the ∆Hf of 1115
85
compounds based on a multivariate linear genetic algorithm, and during their
86
modelling work, five structural descriptors were calculated and selected from 1664
87
descriptor libraries.
Hf, based on neural-network method, Hu et al. [36]
Hf of 51
88
In this study, the four properties endpoints - structure modelling study was
89
performed with uniform norm descriptors for predicting the enthalpy of organic
90
chemicals. The standard vaporization enthalpy ( ∆H v ), standard formation enthalpy in
91
0 0 gas state ( ∆Hf (g) ), standard formation enthalpy in solid state ( ∆Hf (s) ) and standard
92
0 formation enthalpy in liquid state ( ∆H f (l) ) were used as properties endpoints for the
93
same modelling work. Also, the prediction ability of this model was tested by using
94
several validation methods.
95
2 Method
96
2.1 Dataset
0
5
97
0 0 0 0 573 ∆ H v , 964 ∆H f (g) , 367 ∆ H f (s) and 873 ∆Hf (l) experimental data
98
were from CRC Handbook of Chemistry and Physics [39]. The dataset covered 14
99
families of organic compounds, including chain and cyclic hydrocarbons, alcohols,
100
ketones, carboxylic acids, amines, halogenated hydrocarbons, sulfur compounds and
101
0 0 0 0 so on. The organics involved in ∆Hv , ∆H f (g) , ∆Hf (s) and ∆Hf (l) together with
102
corresponding experimental values were shown in Table S1~S4 in the Supplementary
103
Material.
104
2.2 Atomic distribution matrix
105
In this work, the most stable state and accurate spatial structure of atoms in
106
molecules were obtained using quantum chemistry ab initio method [40] based on the
107
Restricted Hartree-Fock (RHF) at STO-3G level in the HyperChem 7.0 software
108
(http://www.hyper.com/).
109 110
To reveal the composition of atoms in the molecule structure, a series of property matrices (PE) composed of atomic properties were proposed and defined as follows.
PE1 = awi
(1) ( 2) ( 3) ( 4)
∑ aw i
PE2 = [ ri × eni ] PE3 = [ aci zi ] 111
1/2 PE4 = ( oei × eni ) PE5 = exp ( iei )
PE6 = qij
( 5) ( 6)
qij = aci − ac j
112
where, ri, zi, iei, eni, aci, oei and awi represent the Van der Waals radius (Å), number of
113
protons, ionization energy (10 eV), electronegativity, atom charge (C, coulomb),
114
outermost number of electrons and atom weight (g/mol) of atom i in a molecule,
115
respectively. The property matrices (PE) were unitless. 6
116
The atomic parameters in the property matrices (PE) are constants for six
117
properties except for the atomic charge. The atomic charge was calculated by
118
Mulliken method. Also, the properties of atoms involved were shown in Table S5 in
119
the Supplementary Material.
120
To further describe the spatial position of each atom in the molecule, the position
121
matrices including the adjacent matrix P1, the interval matrix P2, the Euclidean
122
distance matrix P3 and the adjacent Euclidean distance matrix P4 were used. The four
123
matrices were defined as follows: P1 = pij
124
P2 = pij
1 pij = 0 2 pij = 0
sij =1 sij ≠ 1 sij =2 sij ≠ 2
P3 = d ij P4 = pij
(7)
(8) (9)
d ij pij = 0
sij =1 sij ≠ 1
(10)
125
where, dij is the Euclidean spatial distance between atoms i and j, sij is the path
126
between atom i and j. The unit of dij is Angstrom (Å).
127 128
Then, combining the above matrices, 18 atomic distribution matrices (M) were further proposed and listed in Table 1.
7
129
Table 1 The 18 atomic distribution matrices. Mi
Mi
M1 = PE2T ⋅×P1
M10 = PE4 × PE4T × ( P3 × PE6 )
M2 = PE4 ×PE4T ×P1
M11 = PE2 × PE2T × ( P3 × PE6 )
M3 = PE4 × PE4T × P2
M12 = PE4 × PE4T ⋅× ( P3 × PE6 )
M 4 = PE5 × PE5T ⋅×P4
M13 = PE2 × PE2T ⋅×( P3 × PE6 )
M 5 = PE4T ⋅×P1
M14 = PE2 × PE2T × P1
M6 = PE2 × PE2T ⋅×P1
M15 = PE3 × PE3T ⋅×P3
M 7 = PE5 × PE5T × P4
M16 = PE2T ⋅×P4
M 8 = PE5 × PE5T × P2
M17 = PE5T ⋅×P2
M 9 = PE1T ⋅×P1
M18 = PE4 × PE4T ⋅×P4
130
Matrix QT is the transpose of matrix Q.
131
where, the operational character [.×] means:(Ⅰ) Q = q j
132
(Ⅱ) Q = qij
133
2.3 Norm expressions used
W = wij
W = wij
Q. × W = q j × wij ;
Q . × W = q ij × wij .
134
In this work, norm expressions were used as Eq. (11) – Eq. (16). The norm (I) of
135
each matrix was shown in Table 2. Based on the above atomic distribution matrices,
136
22
137
formulas.
norm
descriptors
were
calculated
8
by
using
the
following
norm
║ M║1 =
∑ λ (M )
(11)
i
i
║ M║2 =
∑∑ m j
138
(12 )
ij
i
║ M║3 =
1 ( ∑ ∑ m ij ) n j i
(13 )
║ M║4 =
1 ( ∑ ∑ m ij ) n2 j i
(14 )
(
║ M║5 =
max λi ( M × M
║ M║6 =
∑∑m j
H
))
(15 ) (16 )
2 ij
i
139
where, λi was the eigenvalue of matrix, the M H was the Hermite matrix.
140
2.4 Model Validation
141
The quality and predictability of the QSPR model should be given strictly and
142
comprehensively verified [41]. Leave-one-out, 5-fold and 10-fold cross-validations
143
[42] were utilized to evaluate the robustness of this QSPR model. The prediction
144
ability of this model was verified by using the external validation method [43].
145
Y-randomization test was utilized to avoid the possibility of chance correlation in the
146
modelling work [44]. The reliability of this model was proved by the applicability
147
domain (AD) [45].
148
3 Results and discussion
149
3.1 Model proposed
150
A unified QSPR model was built for predicting the four properties endpoints as
151
Eq. (17) by using the same norm descriptors. The parameters bk of this model for the
152
four properties were showed in Table 2. 22
153
(17 )
∆H = b0 + ∑ bk I k k =1
154
0 0 0 0 where, ∆H includes ∆H v , ∆Hf (g) , ∆Hf (s) and ∆H f (l) , Ik represents norm
9
155
indexes.
10
156
Table 2 The parameters of this model for predicting the four properties endpoints. bk k
Ik
∆H v0
∆Hf0 (g)
∆Hf0 (s)
∆H f0 (l)
0
—
31.165
96.502
267.380
94.648
1
M1 1
8.252
57.788
31.200
61.273
2
M2 1
-3.190
52.149
53.784
52.336
3
M3 1
-0.066
0.902
0.608
0.813
4
M4 1
-1.956
-15.327
-10.160
-16.876
5
M5
2
30.295
-90.640
-61.929
-91.048
6
M1 2
-25.545
-6.830
-32.970
-8.069
7
M6
2
1.454
-18.145
-14.733
-18.112
8
M7
3
0.852
5.506
2.808
6.007
9
M8
3
0.050
-1.456
-1.595
-1.517
10
M9
4
-165.655
-1125.487
-1468.420
-1217.389
11
M10
4
2.852×10-3
0.252
0.064
0.309
12
M 11
4
-2.449×10-3
-0.108
-0.041
-0.127
13
M12
4
-0.229
-25.654
-16.304
-25.268
14
M13
4
0.126
11.275
7.777
10.846
15
M7
5
-0.131
6.986
8.581
7.210
16
M2
5
-0.965
-29.308
-34.834
-29.671
17
M 14
5
0.878
13.850
16.899
14.150
18
M 15
5
33.946
311.366
101.511
246.837
19
M16
6
-0.883
67.100
77.722
61.254
11
bk k
Ik
20
M17
21
M4
22
M18
6
6
6
∆H v0
∆Hf0 (g)
∆Hf0 (s)
∆H f0 (l)
0.186
-34.884
-32.648
-35.958
-0.798
24.326
15.457
26.600
0.734
-31.801
-35.528
-31.593
157
The predicted values of the four properties and the corresponding absolute errors
158
were summarized in Table S1~S4. Statistical results for this model prediction were
159
0 0 0 0 shown in Table 3. The R2 of ∆H v , ∆Hf (g) , ∆Hf (s) and ∆H f (l) were 0.967,
160
0.990, 0.989 and 0.987, respectively, indicating that this model had good fitting. Fig.
161
0 1 was a scatter plot of the experimental values versus calculated values of ∆H v ,
162
∆H f0 (g) , ∆H f0 (s) and ∆H f0 (l) . It could be clearly seen that the calculated values of
163
four endpoints coincide well with the experimental values, which further affirmed the
164
accuracy of this model.
165
Table 4 listed the mean absolute error for the 14 families of organic compounds.
166
It can be seen that for the ∆H0v, the calculated values of chain and cyclic
167
hydrocarbons, ketones and monoaromatic hydrocarbons were close to the
168
experimental values, while the error calculation may fluctuate for alcohols, amines
169
and others nitrogen compounds. For the ∆H0f (g), the MAE of each family of
170
compounds is relatively close to the whole, which have no significant difference. For
171
the ∆H0f (s), high deviation of chain hydrocarbons, aldehydes and esters might
172
attribute to the small number of samples. For the ∆H0f (l), the MAE value of each class
173
was satisfactory except for the acids. Although the QSPR model has some error in
174
calculating the enthalpy of some kinds of compounds, it is stable and reliable as a 12
175
whole, which contributes to the further evaluation and application of the model. Also,
176
0 to expound the calculation process in detail for norm indexes, the ∆H v of
177
3-Bromopropene (the No. 30 compound in Table S1) has been exampled and showed
178
in Supplementary Material.
179 180
0 0 Fig. 1 The calculated values vs. experimental values for ∆H v , ∆Hf (g) ,
181
∆Hf0 (s) and ∆H f0 (l) .
13
Table 3 Summary of statistical results of this model.
182
Properties
Samples
R2
F
MAE
s
QLOO2
Q5-fold2
Q10-fold2
Rtraining2
Rtesting2
∆H v0
573
0.967
737.462
2.499
3.449
0.963
0.963
0.963
0.965
0.972
∆Hf0 (g)
964
0.990
4207.505
22.460
29.570
0.989
0.988
0.989
0.989
0.991
∆Hf0 (s)
367
0.989
1398.014
32.402
41.725
0.986
0.986
0.986
0.989
0.986
∆Hf0 (l)
873
0.987
3030.539
22.787
31.142
0.986
0.986
0.986
0.988
0.986
183
where, s is the standard deviation.
184 185
Table 4 Mean Absolute Error (MAE) of 14 families of organic compounds. ∆H0v Compound type
∆H0f (g) MAE
Numbers
MAE Numbers
(kJ/mol)
186 187
∆H0f (s)
∆H0f (l) MAE
Numbers (kJ/mol)
MAE Numbers
(kJ/mol)
(kJ/mol)
chain hydrocarbons
121
1.26
121
18.07
2
47.08
136
22.61
cyclic hydrocarbons
30
1.58
65
20.71
11
35.80
66
17.86
Alcohols
57
3.52
51
18.24
26
39.82
63
12.24
Ethers
42
2.85
35
22.07
1
25.97
36
27.67
Aldehydes
3
2.82
18
23.92
2
58.48
14
30.87
Ketones
25
1.63
29
16.42
7
22.98
23
17.15
Acids
3
2.38
32
29.02
37
32.18
24
35.03
Esters
41
2.73
55
25.27
6
43.10
58
23.08
Monoaromatic Hydrocarbons
36
3.32
65
18.48
39
22.35
53
16.33
Amines
31
5.02
41
16.60
13
24.09
43
20.24
Others nitrogen compounds
48
4.08
164
26.33
183
34.25
118
30.23
halogenated hydrocarbons
79
1.80
140
26.86
13
35.80
106
24.57
Sulfur compounds
47
2.45
72
17.24
4
15.33
65
17.28
Others compounds
10
1.40
76
26.18
23
27.64
68
27.00
Overall
573
2.50
964
22.46
367
32.40
873
22.79
3.2 Internal validation The leave-one-out (LOO), 5-fold and 10-fold cross validation methods were 14
188
performed in this study and the statistical results were shown in Table 3. The high
189
values of QLOO2 (0.963, 0.989, 0.986, 0.986), Q5-fold2 (0.963, 0.988, 0.986, 0.986) and
190
0 0 0 0 Q10-fold2 (0.963, 0.989, 0.986, 0.986) for ∆H v , ∆Hf (g) , ∆H f (s) and ∆H f (l) ,
191
demonstrated the robustness and reliability of this model. Meanwhile, the error
192
distributions of this model, LOO-CV, 5-fold CV and 10-fold CV models were
193
presented as Fig. 2. It can be seen that the error distributions obtained from internal
194
validation were all very similar to the error distributions of this work, which further
195
suggested the stability and robustness of this norm index-based model.
196 197
Fig. 2 Distributions of errors for this model and internal validation for ∆H v0 , ∆Hf0 (g) ,
198
∆Hf0 (s) and ∆Hf0 (l) .
15
199
3.3 External validation
200
External validation plays an important role in proving the predictive ability of the
201
QSPR model [43]. In this work, for each properties endpoints, the original dataset was
202
randomly divided into the training set and the testing set according to the ratio of
203
approximately 4:1. The validation results were shown in Table 3. The relationship of
204
the experimental values and the calculated values of the four properties for the
205
training set and the testing set was illustrated in Fig. 3. It showed that the calculated
206
values of the training set and the testing set were consistent with the experimental
207
values, suggesting the satisfactory external predictability of this model. Statistical
208
metrics described that the Rtraining2 and Rtesting2 values were 0.965 and 0.972 for ∆H v ,
209
0 0 0.989 and 0.991 for ∆Hf (g) , 0.989 and 0.986 for ∆Hf (s) , 0.988 and 0.986 for
210
∆H f0 (l) , which were all similar to the overall R2 and further demonstrated the good
211
stability of this model. In general, these satisfactory predictions proved that the norm
212
index proposed by our group could effectively describe the interaction between
213
organic compounds and enthalpy.
0
214
16
215 216
0 Fig. 3 The calculated values of external validation and experimental values for ∆H v ,
217
∆Hf0 (g) , ∆Hf0 (s) and ∆Hf0 (l) .
218
3.4 Y-randomization test
219
In the Y-randomization validation process, all experimental values were
220
randomly disturbed to build a new QSPR model. To make the results reliable for this
221
model, 10,000 times of Y-randomization validation were repeated in this work.
222
Y-randomization validation was repeated for 10000 times with the average Ri2 of
223
0 0 0 0.038, 0.023, 0.060, 0.025 for ∆H v , ∆Hf (g) , ∆Hf (s) and ∆Hf0 (l) , which were far
224
less than the original R2 values. Accordingly, there was no chance correlation in the
225
modelling process, further verifying the good stability of this original model.
226
3.5 Applicability Domain (AD)
227
The application domain characterized by the molecular properties of the training 17
228
set is of great significance in the QSPR research [46]. In this study, the leverage
229
values, standardized residuals versus experimental values for this model were
230
presented as Williams graph in Fig. 4.
231
The results of Fig. 4 indicated that most of the compounds (about 95.1 %~96.0%)
232
were located in the acceptable domain, which was formed by the critical leverage
233
value (h*) of Hat matrix and the standardized residuals range -3 to 3. It showed that
234
this model had a wide application domain with good predicted results. However, there
235
were 2.4 %~3.2% of chemicals with h exceed h*, but within three standard deviation
236
units. In fact, high h* does not always represent the outliers of the model, and these
237
chemicals with high h* are valuable for the stability of QSPR model, so they could be
238
regarded as a good influence point [47]. Also, it could be seen from Fig. 4 that
239
1.6 %~1.9% data points were outside of standardized residuals range of [−3, 3]. For
240
these compounds, the predictions might be not reliable owing to the high error caused
241
by unreasonable experimental data [48]. On the whole, the verification results
242
demonstrated that this model could cover a large response and structural applicability
243
domain for ∆H v0 , ∆H f0 (g) , ∆H f0 (s) and ∆H f0 (l) prediction.
244 245
18
246 247
Fig. 4 Applicability domain of this model for ∆H v0 , ∆H f0 (g) , ∆H f0 (s) and ∆H f0 (l)
248
prediction.
249
3.6 Model comparison
250
To confirm the quality and performance of this model, comparison with other
251
reference models were carried out and the comparison results were summarized in
252
Table 5.
253
For the model of ∆H v0 , good prediction results could be provided by Puri et al.
254
[49] and Padmanabhan et al. [27]. However, these models were developed for only a
255
single set compounds of Polychlorinated biphenyls, and it was not possible to confirm
256
the external prediction ability of the models. In Sosnowska et al.'s work [28], a MLR
257
2 model was established with the good fitting and robustness (R2=0.888, QCV =0.878,
258
RMSECV=5.34) for calculation ∆H v0 of 78 Persistent Organic Pollutants (POPs), yet 19
259
there are few samples involved in their modelling work.
260
0 For the models of ∆H f , some researchers have tried to develop molecular
261
0 structure-based models for predicting ∆H f with satisfactory performance. In
262
Gharagheizi's [50] and Vatani et al.'s [38] work, their statistical parameters could
263
0 accurately predict ∆H f with R2 of 0.950 and 0.983, respectively, yet the sample data
264
used in the modeling contains calculated values. Begam et al.'s [51] models and
265
Adams et al.'s [52] linear models all have high R2 and F values, but the types of
266
compounds involved in their samples were few and these models might not be
267
extrapolated. Albahri and Aljasmi [53] proposed two GCM models based on ANN
268
(Artificial Neural Networks) and MNLR (Multivariable Nonlinear Regression)
269
method respectively. Although these models contain many kinds of organic
270
compounds, the GCM depends on group division and group contribution value.
271
Compared with references above, our work contains a variety of organic
272
compounds with 14 families of structures, including chain and cyclic hydrocarbons,
273
alcohols, ketones, carboxylic acids, amines, halogenated hydrocarbons, sulfur
274
compounds and so on, which might greatly expand the application scope of this model.
275
Also, a unified QSPR model was developed for predicting the four different
276
properties, and the QSPR model have satisfactory fitting accuracy with R2 of 0.967
277
for ∆H v0 , R2 of 0.990 for ∆H f0 (g) , R2 of 0.989 for ∆H f0 (s) and R2 of 0.987 for
278
∆H f0 (l) , respectively. Moreover, the statistical results of this model was even slightly
279
better than GCM models. On the whole, this QSPR model has reasonable
280
predictability, good statistical accuracy and generalized. 20
281 Property
Table 5 Comparison with references. Reference
Compound type
Method
N
R2
Q2
F
s
Puri et al. [49]
Polychlorinated biphenyls
PLS
17
0.996
0.852
427
0.845
Polychlorinated biphenyls
MLR
17
0.976
0.948
200
2.051
Polychlorinated biphenyls
MLR
27
0.858
0.758
48
3.567
Sosnowska et al. [28]
Persistent Organic Pollutants
MLR
78
0.888
0.878
302
5.113
this work
14 kinds of organic compounds
MLR
573
0.967
0.963
737.462
3.449
Gharagheizi [50]
14 kinds of organic compounds
GA-MLR
1692
0.950
0.948
2830.03
104.03
Vatani et al. [38]
14 kinds of organic compounds
GA-MLR
1115
0.983
0.9826
10239.02
58.541
Adams et al. [52]
oxygen-containing heterocycles
GA-MLR
46
0.945
0.93
137
14.17
1
—
1139.9
2.7
0.984
—
624.213
3.768
0.984
—
697.650
3.732
0.987
—
—
22.401
0.81
—
—
88.399
964
0.990
0.989
4207.505
29.570
367
0.989
0.986
1398.014
41.725
873
0.987
0.986
3030.539
31.142
Padmanabhan et al. [27]
∆ H v0
PR
∆ H f0 Begam et al. [51]
hydrocarbon
PCR
60
PLSR SGC-ANN Albahri and Aljasmi [53]
14 kinds of organic compounds
584 SGC-MNLR
∆H f0 (g)
∆H f0 (s)
this work
14 kinds of organic compounds
MLR
∆H f0 (l)
282
3.7 Explanation of norm descriptors
283
The norm indexes have been successfully applied to describe the relationship
284
between molecular structure and biological activity/toxicity, physicochemical
285
properties in various research fields, including the enthalpy of vaporization of organic
286
compounds at the boiling point [54], the heat capacity of organic compounds at
287
different temperatures [55], the flash point of multiple components mixture [56],
288
multiple toxicity endpoint–structure relationships for substituted phenols and anilines
289
[57], the toxicity and heat capacity of ionic liquids [58, 59], and so on.
290
Here, based on the concept of norm index proposed by our group, 18 atomic 21
291
distribution matrices were presented for predicting the ∆H v0 , ∆H f0 (g) , ∆H f0 (s) and
292
∆H f0 (l) of organic compounds. Here, the atomic distribution matrices contain atomic
293
spatial distribution and atomic properties, which could combine the position
294
information with the contribution of atoms in each molecule. Good prediction results
295
obtained in this present work suggested that these atomic distribution matrices were
296
valuable and could be successfully applied to calculating different properties
297
endpoints of various substances.
298
4 Conclusions
299
In this work, a set of atomic distribution matrices and norm descriptors were
300
proposed. A unified QSPR model was built to predict the four properties endpoints for
301
14 families of organic compounds such as chain and cyclic hydrocarbons, alcohols,
302
ketones, carboxylic acids, amines, halogenated hydrocarbons, sulfur compounds, etc.
303
0 0 0 The four thermodynamic properties endpoints of ∆H v , ∆H f (g) , ∆H f (s) and
304
∆H f0 (l) were involved in the same modelling work. High R2 and F values showed
305
that this model could provide satisfactory calculation accuracy and fitting effect.
306
Moreover, the results of external validation and internal validation indicated that the
307
model has good predictive ability and robustness. The verification of applicability
308
domain showed that this QSPR model could be applied in a large response and
309
structural domain. Hence, it could be concluded that the norm index with generality
310
was successful for describing the
311
model could provide satisfactory results for the four properties endpoints prediction of
312
organic compounds.
Hv and
22
Hf of organics, and the unified QSPR
313
Supplementary Material
314
0 0 0 0 The experimental and calculated values of ∆H v , ∆H f (g) , ∆H f (s) and ∆H f (l)
315
were listed in Table S1~S4. The properties of atoms involved in this study were
316
0 shown in Table S5. Also, an example for calculating the ∆H v with the established
317
model was given.
318
Conflict interest statement
319
The authors confirm that this article has no conflicts of interest.
320
Acknowledgements
321
This work was supported by the National Natural Science Foundation of China [NO:
322
21808167, 21676203 and 21306137].
323
23
324
References:
325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366
[1] W. Fang, Q. Lei, R. Lin, Enthalpies of vaporization of petroleum fractions from vapor pressure measurements and their correlation along with pure hydrocarbons, Fluid Phase Equilibria, 205 (2003) 149-161. [2] T.N.G. Borhani, M. Bagheri, Z.A. Manan, Molecular modeling of the ideal gas enthalpy of formation of hydrocarbons, Fluid Phase Equilibria, 360 (2013) 423-434. [3] L. Ogorodova, M. Vigasina, L. Mel'Chakova, V. Rusakov, D. Kosova, D. Ksenofontov, I. Bryzgalov, Enthalpy of formation of natural hydrous iron phosphate: Vivianite, Journal of Chemical Thermodynamics, 110 (2017) 193-200. [4] G. Xin, P.S. Maram, A. Navrotsky, A correlation between formation enthalpy and ionic conductivity in perovskite-structured Li3xLa0.67-xTiO3 solid lithium ion conductors, Journal of Materials Chemistry A, 5 (2017). [5] A.H. Mohammadi, D. Richon, New Predictive Methods for Estimating the Vaporization Enthalpies of Hydrocarbons and Petroleum Fractions, Industrial & Engineering Chemistry Research, 46 (2007) 2665-2671. [6] S. Genheden, Predicting Partition Coefficients with a Simple All-Atom/Coarse-Grained Hybrid Model, Journal of Chemical Theory & Computation, 12 (2016) 297-304. [7] V.N. Emel'Yanenko, S.P. Verevkin, H. Andreas, The gaseous enthalpy of formation of the ionic liquid 1-butyl-3-methylimidazolium dicyanamide from combustion calorimetry, vapor pressure measurements, and ab initio calculations, Journal of the American Chemical Society, 129 (2007) 3930-3937. [8] L. Yang, C. Adam, S.L. Cockroft, Quantifying solvophobic effects in nonpolar cohesive interactions, Journal of the American Chemical Society, 137 (2015) 10084-10087. [9] L.M.N.B.F. Santos, J.N.C. Lopes, J.O.A.P. Coutinho, J.M.S.S. EsperancA, L.R. Gomes, I.M. Marrucho, L.P.N. Rebelo, Ionic liquids: first direct determination of their cohesive energy, Journal of the American Chemical Society, 129 (2007) 284-285. [10] Z.k. Kolská, M. Zábranský, A. Randová, Group contribution methods for estimation of selected physico-chemical properties of organic compounds, Thermodynamics–Fundamentals and Its Application in Science, (2012) 1-28. [11] A.A. Strechan, G.J. Kabo, Y.U. Paulechka, The correlations of the enthalpy of vaporization and the surface tension of molecular liquids, Fluid Phase Equilibria, 250 (2006) 125-130. [12] A.M. Benkouider, R. Kessas, S. Guella, A. Yahiaoui, F. Bagui, Estimation of the enthalpy of vaporization of organic components as a function of temperature using a new group contribution method, Journal of Molecular Liquids, 194 (2014) 48-56. [13] M. Firpo, L. Gavernet, E. Castro, A. Toropov, Maximum topological distances based indices as molecular descriptors for QSPR. Part 1. Application to alkyl benzenes boiling points, Journal of Molecular Structure: THEOCHEM, 501 (2000) 419-425. [14] L. Riedel, Eine neue universelle Dampfdruckformel Untersuchungen über eine Erweiterung des Theorems der übereinstimmenden Zustände. Teil I, Chemie Ingenieur Technik, 26 (1954) 83-89. [15] N.H. Chen, Generalized Correlation for Latent Heat of Vaporization, Journal of Chemical & Engineering Data, 10 (1965) 207-210. [16] A. Alibakhshi, Enthalpy of vaporization, its temperature dependence and correlation with surface tension: A theoretical approach, Fluid Phase Equilibria, 432 (2017) 62-69. 24
367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410
[17] C. Belghit, Y. Lahiouel, T.A. Albahri, New empirical correlation for estimation of vaporization enthalpy of algerian saharan blend petroleum fractions, Petroleum Science and Technology, 36 (2018) 1181-1186. [18] S.W. Benson, J.H. Buss, Additivity rules for the estimation of molecular properties. Thermodynamic properties, The Journal of Chemical Physics, 29 (1958) 546-572. [19] S.W. Benson, F. Cruickshank, D. Golden, G.R. Haugen, H.E. O'Neal, A. Rodgers, R. Shaw, R. Walsh, Additivity rules for the estimation of thermochemical properties, Chemical Reviews, 69 (1969) 279-324. [20] S.W. Benson, Thermochemical kinetics: methods for the estimation of thermochemical data and rate parameters, Wiley, 1968. [21] K.G. Joback, R.C. Reid, Estimation of pure-Component properties from groupcontributions, Chemical Engineering Communications, 57 (1987) 233-243. [22] L. Constantinou, R. Gani, New group contribution method for estimating properties of pure compounds, Aiche Journal, 40 (1994) 1697-1710. [23] Q. Wang, Q. Jia, P. Ma, Position Group Contribution Method for the Prediction of the Critical Compressibility Factor of Organic Compounds, Journal of Chemical & Engineering Data, 54 (2009) 1916-1922. [24] Q. Jia, Q. Wang, P. Ma, Prediction of the Enthalpy of Vaporization of Organic Compounds at Their Normal Boiling Point with the Positional Distributive Contribution Method, Journal of Chemical & Engineering Data, 55 (2010) 5614-5620. [25] Q. Wang, Q. Jia, P. Ma, Prediction of the Acentric Factor of Organic Compounds with the Positional Distributive Contribution Method, Journal of Chemical & Engineering Data, 57 (2012) 169–189. [26] W. Qiang, P. MA, C. WANG, S. XIA, Position Group Contribution Method for Predicting the Normal Boiling Point of Organic Compounds, Chinese Journal of Chemical Engineering, 17 (2009) 468-472. [27] J. Padmanabhan, R. Parthasarathi, V. Subramanian, P.K. Chattaraj, Using QSPR Models to Predict the Enthalpy of Vaporization of 209 Polychlorinated Biphenyl Congeners, QSAR & Combinatorial Science, 26 (2007) 227-237. [28] A. Sosnowska, M. Barycki, K. Jagiello, M. Haranczyk, A. Gajewicz, T. Kawai, N. Suzuki, T. Puzyn, Predicting enthalpy of vaporization for Persistent Organic Pollutants with Quantitative Structure–Property Relationship (QSPR) incorporating the influence of temperature on volatility, Atmospheric Environment, 87 (2014) 10-18. [29] K. Roy, Advances in QSAR Modeling: Applications in Pharmaceutical, Chemical, Food, Agricultural and Environmental Sciences, 2017. [30] F. Gharagheizi, M.R.S. Gohar, M.G. Vayeghan, A quantitative structure–property relationship for determination of enthalpy of fusion of pure compounds, Journal of Thermal Analysis & Calorimetry, 109 (2012) 501-506. [31] M. Goodarzi, T. Chen, M.P. Freitas, QSPR predictions of heat of fusion of organic compounds using Bayesian regularized artificial neural networks, Chemometrics and Intelligent Laboratory Systems, 104 (2010) 260-264. [32] M.G. Freire, C.M.S.S. Neves, S.P.M. Ventura, M.J. Pratas, I.M. Marrucho, J. Oliveira, J.A.P. Coutinho, A.M. Fernandes, Solubility of non-aromatic ionic liquids in water and correlation using a QSPR approach, Fluid Phase Equilibria, 294 (2010) 234-240. 25
411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454
[33] Y. Ye, Y. Sun, D. Wang, R. Liu, S. Gu, G. Liang, X. Jie, Quantitative structure-property relationship study of liquid vapor pressures for polychlorinated diphenyl ethers, Fluid Phase Equilibria, 391 (2015) 31-38. [34] D. Abooali, M.A. Sobati, Novel method for prediction of normal boiling point and enthalpy of vaporization at normal boiling point of pure refrigerants: A QSPR approach, International Journal of Refrigeration, 40 (2014) 282-293. [35] E.L. Krasnykh, Y.A. Druzhinina, S.V. Portnova, Y.A. Smirnova, Vapor pressure and enthalpy of vaporization of trimethylolpropane and carboxylic acids esters, Fluid Phase Equilibria, 462 (2018) 111-117. [36] L. Hu, X. Wang, L. Wong, G. Chen, Combined first-principles calculation and neural-network correction approach for heat of formation, The Journal of Chemical Physics, 119 (2003) 11501-11507. [37] A. Mercader, E.A. Castro, A.A. Toropov, QSPR modeling of the enthalpy of formation from elements by means of correlation weighting of local invariants of atomic orbital molecular graphs, Chemical Physics Letters, 330 (2000) 612-623. [38] A. Vatani, M. Mehrpooya, F. Gharagheizi, Prediction of Standard Enthalpy of Formation by a QSPR Model, International Journal of Molecular Sciences, 8 (2007) 407-432. [39] C. Press, CRC Handbook of Chemistry and Physics, Apple Academic Press Inc.; 96th New edition, (2015). [40] R.A. Friesner, Ab initio quantum chemistry: Methodology and applications, Proc. Natl. Acad. Sci. U. S. A., 102 (2005) 6648-6653. [41] A. Tropsha, P. Gramatica, V.K. Gombar, The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models, QSAR & Combinatorial Science, 22 (2003) 69-77. [42] K. Roy, S. Kar, R.N. Das, Understanding the basics of QSAR for applications in pharmaceutical sciences and risk assessment, Academic press, 2015. [43] K. Roy, I. Mitra, S. Kar, P.K. Ojha, R.N. Das, H. Kabir, Comparative studies on some metrics for external validation of QSPR models, Journal of Chemical Information & Modeling, 52 (2012) 396-408. [44] C. Rücker, G. Rücker, M. Meringer, y-Randomization and its variants in QSPR/QSAR, Journal of Chemical Information & Modeling, 47 (2007) 2345-2357. [45] K. Roy, S. Kar, P. Ambure, On a simple approach for determining applicability domain of QSAR models, Chemometrics & Intelligent Laboratory Systems, 145 (2015) 22-29. [46] K. Roy, P. Ambure, R.B. Aher, How important is to detect systematic error in predictions and understand statistical applicability domain of QSAR models?, Chemometrics & Intelligent Laboratory Systems, 162 (2017) 44-54. [47] J. Jaworska, N. Nikolovajeliazkova, T. Aldenberg, QSAR applicability domain estimation by projection of the training set in descriptor space: A review, Alternatives to Laboratory Animals Atla, 33 (2005) 445-459. [48] P. Gramatica, Principles of QSAR models validation: internal and external, QSAR & Combinatorial Science, 26 (2007) 694-701. [49] S. Puri, J.S. Chickos, W.J. Welsh, Three-dimensional quantitative structure-property relationship (3D-QSPR) models for prediction of thermodynamic properties of polychlorinated biphenyls (PCBs): enthalpies of fusion and their application to estimates of enthalpies of sublimation and aqueous, J Chem Inf Comput Sci, 34 (2003) 55-62. [50] F. Gharagheizi, Prediction of the Standard Enthalpy of Formation of Pure Compounds Using 26
455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474
Molecular Structure, Australian Journal of Chemistry, 62 (2009) 376-381. [51] B.F. Begam, J.S. Kumar, G.S. Chae, Optimized peer to peer QSPR prediction of enthalpy of formation using outlier detection and subset selection, Peer-to-Peer Networking and Applications, (2018) 1-10. [52] N. Adams, J. Clauss, M. Meunier, U.S. Schubert, Predicting thermochemical parameters of oxygen-containing heterocycles using simple QSPR models, Molecular Simulation, 32 (2006) 125-134. [53] T.A. Albahri, A.F. Aljasmi, SGC method for predicting the standard enthalpy of formation of pure compounds from their molecular structures, Thermochimica Acta, 568 (2013) 46-60. [54] Q. Jia, X. Yan, T. Lan, F. Yan, Q. Wang, Norm indexes for predicting enthalpy of vaporization of organic compounds at the boiling point, Journal of Molecular Liquids, 282 (2019) 484-488. [55] J. Yin, Q. Jia, F. Yan, Q. Wang, Predicting heat capacity of gas for diverse organic compounds at different temperatures, Fluid Phase Equilibria, 446 (2017) 1-8. [56] Y. Wang, F. Yan, Q. Jia, Q. Wang, Distributive structure-properties relationship for flash point of multiple components mixture, Fluid Phase Equilibria, 474 (2018) 1-5. [57] F. Yan, T. Liu, Q. Jia, Q. Wang, Multiple toxicity endpoint–structure relationships for substituted phenols and anilines, Science of The Total Environment, 663 (2019) 560-567. [58] F. Yan, T. Lan, X. Yan, Q. Jia, Q. Wang, Norm index-based QSTR model to predict the eco-toxicity of ionic liquids towards Leukemia rat cell line, Chemosphere, 234 (2019) 116-122. [59] W. He, F. Yan, Q. Jia, S. Xia, Q. Wang, Prediction of ionic liquids heat capacity at variable temperatures based on the norm indexes, Fluid Phase Equilibria, 500 (2019) 112260.
475
27
Highlights: A unified QSPR model was proposed for predicting four thermodynamic properties endpoints. The model for four properties endpoints was established with unified norm indexes used. 0 0 Norm indexes could be used to describe the ∆H v0 , ∆Hf (g) , ∆Hf (s) and
∆Hf0 (l) of organic compounds.
Author Contribution Section Each authors contribution(s) to the manuscript, using the numbered list below: 1. Research concept and design 2. Collection of data 3. Data analysis and interpretation 4. Writing the article 5. Critical revision of the article 6. Final approval of article Name of Author
Contribution(s) (using the numbered list above)
Xue Yan
1, 2, 3, 4, 5
Tian Lan
1, 3, 5
Qingzhu Jia
1, 5, 6
Fangyou Yan
5, 6
Qiang Wang
5, 6
Declaration of interests ■ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☒The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: