Author's Accepted Manuscript
Modeling adsorption of organic compounds on activated carbon using ETA indices Supratim Ray, Kunal Roy
www.elsevier.com/locate/ces
PII: DOI: Reference:
S0009-2509(13)00631-3 http://dx.doi.org/10.1016/j.ces.2013.09.018 CES11299
To appear in:
Chemical Engineering Science
Received date: 2 April 2013 Revised date: 24 August 2013 Accepted date: 5 September 2013 Cite this article as: Supratim Ray, Kunal Roy, Modeling adsorption of organic compounds on activated carbon using ETA indices, Chemical Engineering Science, http://dx.doi.org/10.1016/j.ces.2013.09.018 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
1
Modeling adsorption of organic compounds on activated carbon
2
using ETA indices
3 4 5 6
Supratim Raya, < and Kunal Royb, *
7
a
8
Division of Pharmaceutical Chemistry,
9
Dr. B C Roy College of Pharmacy & Allied Health. Sciences, Bidhannagar,
10
Durgapur 713 206, India b
11
Drug Theoretics and Cheminformatics Laboratory,
12
Division of Medicinal and Pharmaceutical Chemistry,
13
Department of Pharmaceutical Technology,
14
Jadavpur University, Kolkata 700 032, India
15 16 17 18 19
---------------------------------------
20
*Corresponding author
21
Kunal Roy
22
Email:
[email protected];
23
Phone: +91 98315 94140; Fax: +91-33-2837-1078;
24
URL: http://sites.google.com/site/kunalroyindia/
25
<
Presently at Assam University, Silchar 788 011, India 1
26
Abstract
27
The aim of the present work is to develop quantitative structure-property relationship
28
(QSPR) models for adsorption capability of a large dataset of chemicals (n = 3483) on
29
to activated carbon. Two different splitting techniques like k-means clustering and
30
principal component analysis (PCA) combined with duplex method were used to
31
divide the data set into training and test sets. Attempt was made to find out the
32
common descriptors present in various models indicating their importance for
33
adsorption capacity on to activated carbon. In spite of presence of large number of
34
compounds in the training and test sets (3:1 in size ratio), we did not omit any
35
compounds showing outlier behavior to artificially show enhanced values of
36
validation metrics thus ensuring the predictive quality of the models for diverse types
37
of compounds. The models were developed to study the predictive ability of extended
38
topochemical atom (ETA) parameters which are calculated from two-dimensional
39
representation of molecules and introduced by the present group of authors. The ETA
40
models were compared to non-ETA models involving topological, spatial and
41
structural descriptors. In all the cases, the data set was first subjected to stepwise
42
regression to find out the contributing variables, and the selected variables were
43
further subjected to partial least squares (PLS) regression. The PLS models indicate
44
that ETA descriptors provide better external validation characteristics in terms of
45
predictive R2 than that of the non-ETA ones. The best ETA model shows encouraging
46
2 2 2 2 =0.8059, Qext statistical quality ( Qint ( F 1) =0.7914, Qext ( F 2) =0.7909, Qext ( F 3) =0.8492).
47 48
Keywords: Adsorption; Computation; Computational chemistry; Mathematical
49
modeling; QSPR; ETA
50 2
51
1
52
Industrial processes that produce aqueous effluents rich in many heavy metal ions and
53
waters leaching from agricultural and forest land after the application of chemical
54
fertilizers and pesticides remain an important source of potential human toxicity
55
(Easton, 1995; Karanfil et al., 1996). Harmful volatile organic chemicals are also
56
commonly present in many industrial manufacturing environments. Though several
57
control technologies have been applied to many industrial and municipal sources, the
58
total quantity of these agents released to the environment remains high (Nriagu and
59
Pacyna, 1988; Mohan and Pittman Jr., 2006). Activated carbon is one of the most
60
popular adsorbents used widely in different types of industries for removal of toxic
61
pollutants, ions, non-biodegradable wastewaters (Balci et al., 2011; Zhang et al.,
62
2005), as well as for various gas separation and purification processes (Le Leuch and
63
Bandosz, 2007). Activated carbon is a crude form of graphite
64
having
65
exhibiting
66
cracks to crevices and slits of molecular dimensions. The
67
starting
68
activated carbon production determine surface functional
69
groups. Activation also refines the pore structure, and
70
surface area up to 2000 m2 / g can be obtained (Radovic et
71
al., 2000).
72
coconut shells, wood char, lignin, petroleum coke, bone
73
char, peat,
74
peach pits, fish, fertilizer waste, waste rubber tire,
75
etc (Pollard
Introduction
amorphous a
structure.
broad
material
range
and
of
the
They
are
pore
sizes,
activation
Activated carbons have
sawdust,
carbon
black,
highly
porous,
from
method
visible
used
been prepared
rice
for
from
hulls, sugar,
et al., 1992; Mohan and Singh, 2005). The adsorption
3
76
capability of activated carbon for chemical adsorption depends on several factors,
77
mainly on the carbon’s characteristics such as texture (surface area, pore size
78
distributions), surface chemistry (surface functional groups), and ash content (Gregg
79
and Singh, 1982; Radovic et al., 1997). It also depends on adsorptive characteristics
80
like molecular weight, polarity, pKa, molecular size and functional groups (Villacañas
81
et al., 2006; Karanfil and Kilduff, 1999). Finally, environmental factors such as pH,
82
adsorptive concentration and presence of other possible adsorptives also affect the
83
adsorption capability of carbon (Nouri and Haghseresht, 2002; Haghseresht et al.,
84
2002).
85
significant role in adsorption capability of activated carbon (Moreno-Castilla et al.,
86
1995). The surface of activated carbon consists of basal planes, heterogeneous
87
superficial groups (mainly oxygen containing surface groups) and inorganic ash. For
88
aromatic compounds, most of the adsorption sites are found on the basal planes,
89
which correspond to about 90% of the carbon surface. Heterogeneous groups also
90
contribute towards activity and define the chemical characteristics of the carbon
91
surface (Franz et al., 2000). The heteroatoms are located at the edges, or in the defects
92
of the basal planes of carbon atoms. The amount of these oxygenated surface groups
93
varies with the nature of the raw material and the activation process (Giraudet et al.,
94
2006). With the help of physical, chemical and electrochemical treatments, the nature
95
of the surface groups can also be modified (Bansal et al., 1988; Figueiredo et al.,
96
1999).
The activated carbon surface chemistry and pH of the medium play a
97 98
Since there are different types of chemicals existing as the pollutants with varying
99
structural composition, it is difficult to predict which particular compound will get
100
adsorbed on to activated carbon to what extent. At the same time, it is also expensive
4
101
to experimentally determine the adsorption capability of different classes of chemicals
102
on to activated carbon. Thus, prediction of adsorption capability of activated carbon
103
using a diverse set of chemicals is a relevant topic for quantitative structure-property
104
relationship (QSPR) studies. This kind of studies are relevant in the context of
105
Materials Genome Initiative, a new, multi-stakeholder effort to develop an
106
infrastructure using computational approach to accelerate advanced materials
107
discovery and deployment in the United States (Service, 2012). Based on this
108
background, in the present study, we have performed a QSPR analysis on large
109
number of compounds to quantitatively relate the adsorption capability of the
110
activated carbon with the structure of the chemicals using linear regression tools. The
111
descriptors of interest in the present study are extended topochemical atom (ETA)
112
indices which are calculated simply from the 2D representation of the molecules. The
113
ETA indices, which have been developed by the present group of authors (Roy and
114
Ghosh, 2003), require less computational time than calculating complex 3D
115
descriptors, and possess good diagnostic power to develop robust predictive models.
116
The models were developed to study the predictive ability of ETA parameters in
117
comparison to non-ETA descriptors (topological, structural, physicochemical,
118
electronic and spatial descriptors).
119 120
2
Materials and Methods
121
2.1
The data set
122
In this present work, the adsorption capacities of 3483 organic compounds to
123
activated carbon as used by Lei et al (Lei et al., 2010) and originally taken from
124
Yaws’ Handbook of thermodynamic and physical properties of chemical compounds
125
(Yaws, 2003-2004) were used as the model data set. All the compounds and their
5
126
observed and calculated adsorption capabilities along with the ETA and non-ETA
127
parameters present in the models are listed in Supplementary Materials.
128 129
2.2
Types of descriptors
130
2.2.1 ETA descriptors
131
The extended topochemical atom (ETA) indices were formulated by the present group
132
of (Roy and Ghosh, 2003) based on the modification of TAU descriptors (Pal et al.,
133
1988; Pal et al., 1989). The TAU descriptors were developed considering valence
134
electron mobile (VEM) environment.
135
parameters related to size or bulk, electronegativity and electronic contribution. Some
136
of the basic ETA indices of all the compounds were calculated from software package
137
DRAGON version 6.0 software (DRAGON version 6.0 software is offered by
138
TALETE srl, Italy). Other ETA indices were calculated according to previously
139
published work (Roy and Das, 2011). The work started with forty ETA descriptors.
140
The significance of different ETA descriptors is listed in Table 1. The ETA indices
141
are also now available for computation in version 2.11 of PaDEL-Descriptor (Yap,
142
2011),
143
http://padel.nus.edu.sg/software/padeldescriptor .
an
open
144
The ETA scheme includes various basic
source
software
available
at
Table 1 near here
145 146
2.2.2 Non-ETA descriptors
147
Several types of non-ETA descriptors like two dimensional (2D) descriptors
148
(topological, structural, physicochemical and electronic indices) and three
149
dimensional (3D) descriptors (spatial indices) have been calculated for the given data
150
set of compounds to assess the performance of ETA descriptors in comparison to non6
151
ETA descriptors. For the calculation of 3D descriptors, multiple conformations of
152
each molecule were generated using the optimal search as a conformational search
153
method. Each conformer was subjected to an energy minimization procedure using
154
smart minimizer under open force field (OFF) to generate the lowest energy
155
conformation for each structure. The charges were calculated according to the
156
Gasteiger method (Gasteiger, 1978). These descriptors were calculated using Cerius2
157
version 4.10 software (Cerius2 version 4.10 is a product of Accelrys, Inc., San Diego,
158
USA). This study involved 103 non-ETA descriptors. The categorical list of different
159
non-ETA descriptors are shown in Table 2.
160
Table 2 near here
161 162
2.3
Model development
163 164
2.3.1 Splitting of the data set
165
Two different splitting techniques were used to find out the common descriptors
166
present in various models indicating their importance for adsorption capacity on to
167
activated carbon of large data sets of chemicals. At first, the same data set splitting
168
with a training set of 2612 compounds and a test set of 871 compounds as reported by
169
Lei et al. (Lei et al., 2010) after application of principal component analysis (PCA)
170
combined with duplex method was used by us. Then k-means clustering was also used
171
as a splitting technique for selection of training set for model development. The whole
172
data set (n=3483) was divided into training (n=2613, 75% of the total number of
173
compounds) and test (n=870, 25% of the total number of compounds) sets by k-means
174
clustering technique (MacQueen, 1967) applied on standardized descriptor matrix of
175
the combined sets (ETA and non-ETA). Figure 1 (in case of PCA combined with
7
176
duplex method as splitting technique) and Figure 2 (in case of k-means clustering as
177
splitting technique) show the plots of first three principal components of the
178
descriptor matrix suggesting that each of the test set compounds lies in close vicinity
179
of some training set molecules. Then all developed models were validated (externally)
180
using the test set compounds.
181
Figure 1 near here
182
Figure 2 near here
183 184 185
2.3.2 Chemometric tools
186
Initially, stepwise regression was applied on the data set to find out useful subsets of
187
descriptors, which were further subjected to partial least squares (PLS) analysis for
188
model development. PLS is a generalization of regression, which can handle data with
189
strongly correlated and/or noisy or numerous X variables (Wold, 1995; Fan et al.,
190
2001). It gives a reduced solution, which is statistically more robust than MLR. The
191
linear PLS model finds “new variables” (latent variables or X scores) which are linear
192
combinations of the original variables. To avoid over fitting, a strict test for the
193
significance of each consecutive PLS component is necessary and then stopping when
194
the components are non significant. Application of PLS thus allows the construction
195
of larger QSAR equations while still avoiding over fitting and eliminating most
196
variables. PLS is normally used in combination with cross validation to obtain the
197
optimum number of components. This ensures that the QSAR equations are selected
198
based on their ability to predict the data rather than to fit the data. In case of PLS
199
analysis on the present data set, based on the standardized regression coefficients, the
8
200
variables with smaller coefficients were removed from the PLS regression until there
201
was no further improvement in Q2 value irrespective of the components.
202 203
2.3.3
Software used for model development
204
MINITAB version 14 software (MINITAB version 14 is statistical software of
205
Minitab Inc, USA) was used for stepwise regression and partial least squares (PLS)
206
methods. STATISTICA version 7 software (STATISTICA version 7 is statistical
207
software of Stat Soft Inc) was used for the determination of the LOO (leave-one-out)
208
values of the training set compounds. SIMCA-P 10.0 was used to calculate the
209
DModX (distance to the model in X-space) value of the compounds.
210 211
2.3.4 Model validation
212
The statistical qualities of various equations were judged by calculating several
213
metrics namely determination coefficient (R2) as a measure of the total variance of the
214
response explained by the regression models (fitting) , explained variance (Ra2) and
215
variance ratio (F) at specified degrees of freedom (df) (Snedecor and Cochran, 1967).
216
Both internal and external validations are performed to assess to reliability and the
217
predictive potential of the developed models. To determine the predictive quality of
218
the models, models are required to be further validated using different validation
219
techniques: (a) internal validation or cross-validation using the training set
220
compounds, (b) external validation using the test set compounds
221 222
All the generated models were validated internally by the leave-one out procedure
223
( Qint2 ) (Wold and Ericsson, 1995). Besides leave-one out validation, the internal
224
predictive ability and robustness of the developed models were also further evaluated
225
by leave-25%-out cross-validation. The developed models were judged by different 9
226
2 2 external validation parameters like Qext ( F 1) , Qext ( F 2) (Hawkins, 2004; Schuurmann et
227
2 al., 2008), Qext ( F 3) (Consonni et al., 2009). Besides the above parameters, two more
228
external validation parameters were also employed to check the predictive ability of
229
the developed models as external validation is the most desired tool for establishing
230
the predictive quality of QSPR models. The rm2 matrices ( rm2 and 'rm2 ) are employed
231
to indicate better both the internal and external predictive capacities of a model and to
232
ascertain the proximity in the values of the predicted and observed response data
233
(Ojha et al., 2011; Roy and Roy, 2008). The rm2 and 'rm2 matrices are applied for
234
internal validation of training set compounds ( rm2( LOO ) as well as 'rm2( LOO ) ), external
235
validation of test set compounds ( rm2( test ) as well as 'rm2( test ) ) and overall validation for
236
all compounds ( rm2( overall ) 'rm2( overall ) ).
237
Again the developed equations were validated applying the parameters proposed by
238
Golbraikh and Tropsha (i.e., (i) Qint2 > 0.5, (ii) r2 > 0.6, (iii) r02 or r/02 is close to r2,
239
such that
240
d1.15) (Golbraikh and Tropsha, 2002). The detailed explanation of the notations is
241
given in Supplementary Materials section. For a high predictive ability of a
242
developed model the correlation coefficient between actual and calculated activity
243
must be close to one. So the regression of actual activity against calculated activity or
244
calculated activity against actual activity through the origin can be characterized by
245
the slope (k / k/). The slope should be close to one.
[(r2- r02)/ r2]
or [(r2- r/02)/ r2] < 0.1 and
0.85 d k d1.15 or 0.85 d k/
246 247
2.3.5
Applicability domain
248
The applicability domain (AD) of a developed QSPR model ensures that the
249
predictions made based on the developed QSPR model are more reliable if the
250
compounds being predicted are within applicability domain of the model. The purpose
251
of AD is to state whether the model’s assumptions are met. Applicability domain
252
provides information about chemical domain of the training set molecules used for the
253
development of the QSPR model and allows efficient prediction of new molecules
10
254
lying within this chemical domain. For compounds which are markedly dissimilar
255
from the training ones, the predictions made are quite uncertain. Thus, the idea of AD
256
is used to avoid such an unfounded extrapolation of property predictions and thus
257
improves the reliability for application of the developed QSPR models.
258 259
The residuals of Y and X are of diagnostic value for the quality of the model (Wold et
260
al., 2001). Since there are many X-residuals one needs a summary for each
261
observation (compound). This is accomplished by the residual standard deviation
262
(SD) of the X-residuals of the corresponding row of the residual matrix E. Because
263
this SD is proportional to the distance between the data point and the model plane in
264
X-space, it is also often called DModX (distance to the model in X-space). Here, X is
265
the matrix of predictor variables, of size (N×K), Y is the matrix of response variables,
266
of size (N×M) and E is the (N×K)matrix of X-residuals, N is number of objects
267
(cases, observations), k is the index of X-variables (k=1, 2, . . ., K) and m is the index
268
of Y-variables (m=1, 2, . . ., M). A DModX larger than around 2.5 times the overall
269
SD of the X-residuals (corresponding to an F-value of 6.25) indicates that the
270
observation is outside the applicability domain of the model (Wold et al., 2001).
271 272
3
273
Two different splitting strategies (k-means clustering and principal component
274
analysis (PCA) combined with duplex method) were employed for the division of the
275
data set into training and test sets followed by model development and validation. In
276
all the cases, the data set was first subjected to stepwise regression (stepping criteria:
277
F = 30 for inclusion; F = 29.99 for exclusion) to find out the useful variables, and the
278
selected variables were further subjected to PLS regression. The models were
Results and discussion
11
279
developed using partial least squares (PLS) analysis for each set of descriptors (ETA,
280
non-ETA and combined sets).
281 282
3.1
Development of models applying k-means clustering as the splitting technique
283 284
3.1.1 Models with ETA descriptors
285
The model from PLS analysis shows the importance of twelve different ETA
286
descriptors in predicting adsorption capacity on to activated carbon for the large data
287
set of chemicals (model no 1 in Table 3). Descriptors like , 'H D , /Nv, , 'F and
288
, have positive contributions towards the adsorption capacity, whereas descriptors
289
like 'H A , , F, []P/, \ 1 and H 2 have negative contributions towards the
290
adsorption capacity. The terms and /Nv indicate the importance of molecular
291
size for adsorption capacity. The models confirm the importance of hydrogen bond
292
donor atoms (shown by the parameter 'H D ). The positive coefficient of indicates
293
the importance of electron richness of a molecule towards the adsorption capacity.
294
The contribution of the overall topological nature and functionality contribution
295
(corresponding to and F) of a molecule and those relative to molecular size
296
(corresponding to and 'F) are also evident from the models. The parameter 'H A
297
signifying the contribution of unsaturation and presence of electronegative atoms in a
298
molecule has a negative contribution towards the adsorption capacity. The shape
299
parameter, []P/ which also contributed negatively, signifies the importance of
300
branching pattern present in a molecule. The term \ 1 indicating a measure of
301
hydrogen-bonding propensity of the molecules and / or polar surface area has a
302
negative coefficient for the adsorption capacity. The parameter 2 indicating the
303
presence of electronegative atoms in a molecule excluding hydrogen is detrimental for
12
304
the adsorption capacity. The statistical quality parameters of the model showed good
305
2 =0.8059, rm2( LOO ) =0.7219 and 'rm2( LOO ) =0.1633), external internal validation ( Qint
306
2 2 2 validation ( Qext rm2(test ) =0.7108 and ( F 1) =0.7914, Qext ( F 2) =0.7909, Qext ( F 3) =0.8492,
307
2 2 =0.7194 and 'rm(overall) =0.1425) 'rm2( test ) =0.0613), and overall validation ( rm(overall)
308
characteristics. The Q2 value after applying leave-25%-out cross-validation is 0.808.
309
The model also satisfies the set of criteria proposed by Golbraikh and Tropsha for
310
evaluation of predictive ability of the developed model (Tables 3 and 4).
311 312
Table 3 near here
313
Table 4 near here
314 315
3.1.2 Models with non-ETA descriptors
316 317
The model involving non-ETA descriptors indicate the importance of twelve non-
318
ETA descriptors in predicting the adsorption capacity (model no 2 in Table 3).
319
Parameters like Jurs SASA, Jurs PPSA-3, 3Fp, Jx, Density and 3N have positive
320
contributions towards the adsorption capacity, whereas terms like S_sF, Jurs FPSA-3,
321
Jurs PNSA-3, Wiener, 3 F CH and 1N have negative contributions for the adsorption
322
capacity. The positive coefficient of Jurs-SASA indicates that a higher value of total
323
molecular solvent accessible surface area will increase the adsorption capacity. The
324
term Jurs PPSA-3 signifies the importance of atomic charge weighted positive surface
325
area towards the adsorption capacity. The term 3Fp signifies the importance of third
326
order molecular connectivity of path type towards the adsorption capacity. This term
327
emphasizes particular atom connectivity within a molecule considering path of three 13
328
edges. The positive regression coefficient of Jx implies a positive correlation between
329
Jx and the adsorption capacity. The parameter Jx indicates the average distance sum
330
of the connectivity among different groups with in a molecule. Density has a positive
331
regression coefficient for adsorption capacity reflecting the types of atoms and their
332
packing pattern in a molecule necessary for adsorption into activated carbon. The
333
term 3N reflecting the shape of a molecule considering path length of three has a
334
positive contribution to the adsorption capacity. The electrotopological state
335
parameter, S_sF have a negative contribution signifying the importance of fragment –
336
F. The parameters like Jurs PNSA-3 and Jurs FPSA-3 have negative coefficients,
337
signifying contribution of solvent accessible surface area of a molecule in relation to
338
both negatively charged atoms and fractional charged partial positive surface area
339
towards adsorption capacity. The Wiener index showing the importance of sum of the
340
number of chemical bonds existing between all pairs of heavy atoms in the molecule
341
contributes negatively. The term 3 F CH has a negative coefficient indicating that a
342
compound having higher values of third order connectivity index (ring type) has
343
lower adsorption capacity. The term 1N signifying the shape of the molecule
344
considering the count of atoms and the presence of cycles relative to the minimal and
345
maximal graphs has a negative contribution for the adsorption capacity. The resulting
346
2 statistical parameters of the model showed good internal validation ( Qint =0.7341,
347
rm2( LOO ) =0.6263
348
2 2 2 2 Qext ( F 2) =0.7033, Qext ( F 3) =0.786, rm ( test ) =0.6064 and 'rm ( test ) =0.0437), and overall
349
2 2 validation ( rm(overall) =0.6208 and 'rm(overall) =0.1756) characteristics. The Q2 value after
350
applying leave-25%-out cross-validation is 0.731. From Table 4 it is observed that for
351
non-ETA descriptors, the squared correlation coefficient values between the observed
and
'rm2( LOO ) =0.2123),
external
validation
2 ( Qext ( F 1) =0.7039,
14
352
and predicted values of the training set compounds (leave-one out predicted values)
353
with intercept (r2) and without intercept after changing the axes (r/02) are not close
354
enough to each other.
355 356
3.1.3 Models with ETA and non-ETA descriptors
357 358
Fourteen descriptors emerged in the best equations using PLS regression analyses
359
representing their obvious importance in predicting the adsorption capacity (model no
360
3 in Table 3). Parameters like Jurs SASA, 3N , 'F, , /Nv, []Y/ and 'H D have
361
positive coefficients whereas terms like S_sF, Jurs WPSA-1, Wiener, []P/,
362
AlogP98,
363
models show positive contribution of the shape parameter []Y/, signifying the
364
importance of branching pattern present in a molecule. The descriptor Jurs WPSA-1
365
has a negative contribution to the adsorption capacity indicating the importance of
366
surface weighted positively charged partial surface area. The term AlogP98
367
(computed values corresponding to log of the partition coefficient) has a negative
368
coefficient indicating that increase in lipophilicity of a molecule will decrease the
369
adsorption capacity. The parameter 2 FV signifies the importance of second order
370
valence molecular connectivity. The term 'D B has a negative coefficient towards the
371
adsorption capacity. It is an indicator of the presence of hydrogen-bond acceptor
372
atoms. This parameter may also be considered to be an indicator of polar surface area.
373
2 The statistical parameters of the model showed good internal validation ( Qint =0.813,
374
rm2( LOO ) =0.732
375
2 2 2 2 Qext ( F 2) =0.8019, Qext ( F 3) =0.857, rm ( test ) =0.7332 and 'rm ( test ) =0.0039), and overall
2
FV , 'D B , have negative coefficients for the adsorption capacity. The
and
'rm2( LOO ) =0.1592),
external
validation
2 ( Qext ( F 1) =0.8023,
15
376
2 2 =0.1247) characteristics. The Q2 value after =0.7311 and 'rm(overall) validation ( rm(overall)
377
applying leave-25%-out cross-validation is 0.815. The developed model also satisfies
378
all the external validation criteria proposed by Golbraikh and Tropsha (Tables 3 and
379
4).
380 381
3.2
Development of models applying principal component analysis (PCA)
382
combined with duplex method as the splitting technique
383 384
3.2.1 Models with ETA descriptors
385
The model from PLS show importance of twelve descriptors in predicting adsorption
386
capacity on to activated carbon (model no 4 in Table 3). Descriptors like , /Nv,
387
'H D , , and 'F have positive contributions towards adsorption capacity, where as
388
descriptors like , F, []P/ , []X/, \ 1 and H 2 have negative contributions
389
towards adsorption capacity. The models show negative contribution of shape
390
parameter []X/, signifying the importance of the specific type of branching
391
pattern present in a molecule. The resulting statistical parameters of the PLS model
392
2 showed good internal validation ( Qint =0.7993, rm2( LOO ) =0.7123 and 'rm2( LOO ) =0.1672),
393
2 2 2 2 external validation ( Qext ( F 1) =0.827, Qext ( F 2) =0.8239, Qext ( F 3) =0.622, rm ( test ) =0.7099
394
2 2 =0.7174 and 'rm(overall) =0.1692) and 'rm2( test ) =0.1587), and overall validation ( rm(overall)
395
characteristics. The Q2 value after applying leave-25%-out cross-validation is 0.800.
396
The developed model also satisfies all the external validation criteria proposed by
397
Golbraikh and Tropsha (Tables 3 and 4).
398 399
3.2.2 Models with non-ETA descriptors 16
400
The model involving non-ETA descriptors contain twelve descriptors which indicate
401
their importance in predicting the adsorption capacity (model no 5 in Table 3).
402
Parameters like Jurs SASA, Jurs-WNSA-1, 3Fp, Jx and Density have positive
403
contributions towards the adsorption capacity whereas terms like S_sF, PMI mag, Jurs
404
FNSA-1, AlogP98, Wiener, SC-3-P and 1NDm have negative contributions. The
405
descriptor Jurs WNSA-1 signifies the importance of charge weighted partial negative
406
surface area towards the adsorption capacity.
407
influence of principal moments of inertia about the principal axes of a molecule
408
towards adsorption. Jurs FNSA-1 shows a negative contribution of fractional charged
409
partial surface area towards the adsorption capacity. Compounds having higher values
410
of third order subgraph count of path type (SC-3-P) will have a low adsorption
411
capacity. The Kappa shape index (1NDm) of path length 1 has a negative contribution
412
for the adsorption capacity, signifying the importance of molecular shape (including
413
all the heavy atoms present in a molecule). The developed model has acceptable
414
statistical
415
'rm2( LOO ) =0.2259),
416
2 Qext ( F 3) =0.4105,
417
2 2 =0.5886 and 'rm(overall) =0.2352) characteristics. The Q2 value after applying ( rm(overall)
418
leave-25%-out cross-validation is 0.718. From Table 4 it is observed that for non-
419
ETA descriptors the squared correlation coefficient values between the observed and
420
predicted values of the test, training (leave-one out predicted values) and overall set
421
compounds with intercept (r2) and without intercept after changing the axes (r/02) are
422
not close enough to each other.
limit
like
internal
external
rm2(test ) =0.559
validation validation and
The term PMI-mag indicates the
2 ( Qint =0.7196,
rm2( LOO ) =0.6039
2 ( Qext ( F 1) =0.7299,
'rm2( test ) =0.2417),
and
and
2 Qext ( F 2) =0.725,
overall
validation
423 17
424 425
3.2.3 Models with ETA and non-ETA descriptors
426
Fourteen descriptors emerged in the best equations using PLS analysis representing
427
their obvious importance in predicting the adsorption capacity (model no 6 in Table
428
3). Parameters like Jurs SASA, Density, 1N, , 'F, 'H D and ¦ E ' have positive
429
coefficients whereas terms like S_sF, Jurs WPSA-1, , ¦ H / N , 'E , []P/ and ns
430
have negative coefficients for the adsorption capacity. The models show the
431
importance of descriptor ¦ E ' which signifies a measure of sigma and non sigma
432
contribution of atoms to valence electron mobile count. The electronic
433
parameter, ¦ H / N , indicates the importance of electronegativity. The models also
434
showed the negative contributions of relative unsaturation content ( 'E and ns) of
435
molecules towards the adsorption capacity. The statistical parameters of the PLS
436
model
437
'rm2( LOO ) =0.156),
438
2 Qext ( F 3) =0.5952,
439
2 2 =0.7203 and 'rm(overall) =0.165) characteristics. The Q2 value after applying ( rm(overall)
440
leave-25%-out cross-validation is 0.816. The model also satisfies the set of criteria
441
proposed by Golbraikh and Tropsha for evaluation of predictive ability of the
442
developed model (Tables 3 and 4).
showed
good
internal
external
validation
validation
rm2(test ) =0.6921 and
2 ( Qint =0.817,
rm2( LOO ) =0.7366
2 ( Qext ( F 1) =0.8145,
and
2 Qext ( F 2) =0.8112,
'rm2( test ) =0.1697), and overall validation
443 444
3.3
Comparison of models obtained from different splitting techniques
445
3.3.1 Models with ETA descriptors
446
For the development of training sets, two different splitting techniques were used. In
447
both cases, the PLS models contain twelve variables (Table 3). Out of twelve 18
448
variables, eleven variables like , /Nv, 'H D , , , F, 'F, , []P/, \ 1 and
449
H 2 are present in both the equations. This observation is very interesting as this
450
finding indicates that combination of the selected descriptors remains same, even on
451
using different training sets applying different splitting techniques. The descriptors
452
indicate their obvious importance in predicting the adsorption capacity. The two
453
models (model nos. 1 and 4) are also comparable in terms of adjusted R2 (Ra2) (having
454
2 values of 0.8139 and 0.8094 respectively), Qint (having values of 0.8059 and 0.7993
455
respectively),
456
rm2( LOO ) (having values of 0.7219 and 0.7123 respectively). But one of the external
457
2 validation parameters like Qext ( F 3) (which considers both test and training sets
458
compounds), for the PLS model (model no 4) after applying the PCA combined with
459
duplex method as splitting technique shows lower value (0.622) in comparison to the
460
PLS model (0.8492) developed by k-means clustering technique (model no 1).
rm2(test )
(having values of 0.7108 and 0.7099 respectively),
461 462
3.3.2 Models with non-ETA descriptors
463 464
Both the developed PLS models (model nos 2 and 5) contain twelve variables (Table
465
3). But five variables like S_sF, Jurs SASA, 3Fp, Jx and Wiener are present in both the
466
equations indicating their contributions towards the adsorption capability on activated
467
carbon. The statistical qualities of the models are not so close like models with ETA
468
2 descriptors. Here the Qext ( F 3) value (0.4105) using PCA combined with duplex method
469
as the splitting technique (model no 5) does not cross the threshold value (0.5).
470 471
.3.3.3 Models with ETA and non-ETA descriptors 19
472 473
The developed PLS models after applying different splitting techniques contain
474
fourteen variables (Table 3). Seven variables like S_sF, Jurs SASA, Jurs WPSA-1, ,
475
'F, []P/ and 'H D are present in both the models. Thus, these parameters are
476
important for model development. The two models (model no: 3 and 6) are also
477
2 comparable in terms of Qint (having values of 0.813 and 0.817 respectively),
478
2 rm2( LOO ) (having values of 0.732 and 0.736 respectively). The Qext ( F 3) value (0.8570) for
479
the PLS model (model no. 3) after applying k-means clustering technique shows
480
comparatively better value than that obtained from using the PCA combined with
481
duplex method (0.5952) as the splitting technique (model no 6).
482 483
3.4
Comparison of ETA and non-ETA models
484 485
The PLS models involving ETA and non-ETA parameters after applying two different
486
splitting techniques indicate that ETA models provide better external validation
487
2 characteristics in terms of predictive R2, i.e., Qext ( F 1) (0.7914, 0.827; models 1, 4),
488
2 2 2 Qext ( F 2) (0.7909, 0.8239; models 1, 4) Qext ( F 3) (0.8492, 0.4622; models 1, 4) and rm ( test )
489
(0.7108, 0.7099; models 1, 4) which are greater than corresponding values of non-
490
ETA models (Table 3). The values for the non-ETA models are (0.7039, 0.7299;
491
models 2, 5), (0.7033, 0.725; models 2, 5), (0.786, 0.4105; models 2, 5) and (0.6064,
492
0.559; model 2, 5) respectively. The models involving combined set of descriptors
493
showed improved external validation parameters in comparison to models using non-
494
ETA descriptors. The values are (0.8023, 0.8145; models 3, 6), (0.8019, 0.8112;
495
models 3, 6), (0.8570, 0.5952; models 3, 6) and (0.7332, 0.6921; models 3, 6) 20
496
respectively. Thus, these observations signify the importance of ETA descriptors for
497
improvements of statistical quality of non-ETA models when they are used in
498
combination provide a providing more robust prediction of the adsorption capacity of
499
activated carbon for the large set of chemicals.
500 501
3.5
Discussion on the best ETA models
502 503
Considering all the statistical parameters used for internal and external validation, the
504
best ETA model is obtained after applying k-means clustering as the splitting
505
technique (model 1). The concerned equation is shown below: log10 (adsorption)
-3.50574+0.16110 ¦D +2.01862'H D -0.52995 'H A 6.86622 ¦D /NV
0.23966K+3.49750K / 4.04585KF/ 0.21032KF 0.13426 ¦H 0.90770[ ¦D ]P / ¦D
506
-1.74413\1 1.10706H 2 R2 0.8147, Ra2 0.8139, F 1144.5(df 10,2602), Qint2 0.8059, ntraining =2613,
(1)
2 2 2 rm2( LOO) 0.7219, 'rm2( LOO) 0.1633, Qext 870, ( F1) =0.7914, Qext ( F 2) =0.7909, Qext ( F 3) =0.8492, n test 2 2 rm2(test ) =0.7108, 'rm2(test ) =0.0613, rm(overall) 0.7194, 'rm(overall) 0.1427
507 508
Eq. (1) could explain and predict 81.39 % and 80.59% of the adsorption capability on
509
to activated carbon in gas phase. When this equation was applied for prediction of test
510
2 2 2 set compounds, the external validation metric values [ Qext ( F 1) , Qext ( F 2) , Qext ( F 3) ] for the
511
test set were found to be 0.7919, 0.7909 and 0.8492 respectively. Descriptors like
512
, 'H D , /Nv, , 'F and , have positive contributions towards adsorption
513
capacity, whereas descriptors like 'H A , , F, []P/, \ 1 and H 2 have negative
514
contribution towards the adsorption capacity. We further elaborate the contributions
515
of the descriptors and structure-property relations taking some examples.
516
Formaldehyde (compound no 1667), having the lowest value of shows the lowest 21
517
adsorption capacity on activated carbon. Compounds like methyl alcohol (compound
518
no 2411), methyl amine (compound no 2412) also show lower adsorption capacity
519
due to their lower molecular size. It is also observed that octamethylcyclotetrasiloxane
520
(compound no 2678) possessing the highest value of shows comparatively higher
521
adsorption capacity. The parameter 'H D indicates the importance of hydrogen bond
522
donor atoms. Compounds like formamide (compound no 1668) having two hydrogen
523
bond donor atoms show higher adsorption capacity. It is observed that all the
524
compounds having hydrogen bond donor groups show the adsoption capacity in
525
higher range except in case of compounds like dimethyl amine (compound no 1049),
526
methyl amine (compound no 2412), methyl alcohol (compound no 2411), formic acid
527
(compound no 1669), ethyl amine (compound no 1451) and methyl mercaptan
528
(compound no 2518). All these compounds have low molecular size and hence lower
529
adsorption capacity. Compounds like diiodomethane (compound no 670), 1, 1, 2, 2-
530
tetrabromoethane (compound no 2889) and 1, 1, 1, 2-tetrabromoethane (compound no
531
2890) possessing high values of /Nv show high adsorption capacity. Again, formic
532
acid (compound no 1669) and formaldehyde (compound no 1667) having lower
533
values of /Nv show lower adsorption capacity due to their lower molecular size.
534
Compounds like 1, 1, 2, 2-tetrabromoethane (compound no 2889) and 1, 1, 1, 2-
535
tetrabromoethane (compound no 2890) possessing higher values of show high
536
adsorption capacity on to activated carbon. But formaldehyde (compound no 1667)
537
having the lowest value of shows the lowest adsorption capacity. The parameter
538
indicates the importance of electron richness of a molecule. It is observed that absence
539
of electronegative atoms in molecules like ethylene (compound no 1483), acetylene
540
(compound no 11) is responsible for their low adsorption capacity. On the other hand,
541
compounds like diiodomethane (compound no 670) 1, 1, 2, 2-tetrabromoethane 22
542
(compound no 2889) and 1, 1, 1, 2-tetrabromoethane (compound no 2890) contain
543
halogen atoms which contribute to electron richness of the molecules resulting in the
544
adsorption capacity in higher range. It is observed that compounds like 3109-3114,
545
680-682, 665, 668, and 670 possess values of 'H A in lower range, but they show the
546
adorption capacity in higher range. All these compounds contain more than one
547
halogen atoms which may be responsible for their high adsorption capacity.
548
Compounds like chlorotrifluoromethane (compound no 356) possessing relatively
549
higher values of show lower adsorption capacity. Compounds like 1, 1, 2, 2-
550
tetrabromoethane (compound no 2889) and 1, 1, 1, 2-tetrabromoethane (compound no
551
2890) having the values of F in lower ranger show the highest adsorption capacity.
552
The high adsorption capacity of these compounds may be due to their overall
553
functionality contribution. It is observed that compounds with the values of F from
554
zero to negative range show the adsorption capacity in higher range. The shape index
555
[]P/ indicates the branching pattern. It is observed that molecules like ethane
556
(compound no 1333), ethylene (compound no 1333), formaldehyde (compound no
557
1333) possesses the highest value of []P/ show the lowest adsorption capacity.
558
Hexafluoropropylene (compound no 1722) possesses the lowest value of \ 1 thus
559
showing adsorption capacity in higher range whereas ethane (compound no 1333),
560
ethylene (compound no 1483) and acetylene (compound no 11) having values of \ 1
561
in comparatively higher range show the adsorption capacity in lower range. The
562
parameter H 2 is a measure of electronegative atom count. Chlorotrifluoromethane
563
(compound no 356) having the highest value of H 2 shows the adsorption capacity in
564
the lower range. It is also observed that ethane (compound no 1333), ethylene
565
(compound no 1483) and acetylene (compound no 11) having values of
H 2 in
23
566
comparatively higher range show the adsorption capacity in the lowest range.
567
Figures 3 and 4 show scatter plots of observed vs calculated / predicted values (in log
568
scale) of the training and test set compounds respectively for the best ETA model
569
(model no 1). The same scatter plots for the absolute values of adsorption capacity are
570
shown in Supplementary Materials section.
571
Figure 3 near here
572
Figure 4 near here
573 574
3.6
Validation of models according to OECD guidelines
575
In
576
(http://www.oecd.org/env/ehs/risk-assessment/37849783.pdf) to develop reliable
577
models as follows: (i) a defined end point (experimentally determined adsorption
578
capacities of 3483 organic compounds to activated carbon in gas phase: the logarithm
579
values were used as dependent variable) (ii) an unambiguous algorithm (in the present
580
work, descriptors were calculated using the Dragon software version 6 and Cerius2
581
version 4.10software and models were developed using the PLS algorithms with the
582
MINITAB software); (iii)a defined domain of applicability (applicability domain of
583
the molecules was assessed based on the DModX (distance to the model in X-space)
584
values); (iv) appropriate measures of goodness-of-fit, robustness and predictivity (the
585
developed models were extensively validated using different traditional and novel
586
statistical parameters) and (v) a mechanistic interpretation (the models developed in
587
the present work have been explained provide a mechanistic basis of interpretation as
588
much as possible through proper explanation of the various descriptors appearing in
589
the different regression models). According to OECD principle 5: “It is recognised
590
that it is not always possible, from a scientific viewpoint, to provide a mechanistic
this
study,
we
have
focused
on
all
five
OECD
principles
24
591
interpretation of a given (Q)SAR (Principle 5), or that there even be multiple
592
mechanistic interpretations of a given model. The absence of a mechanistic
593
interpretation for a model does not mean that a model is not potentially useful in the
594
regulatory context. The intent of Principle 5 is not to reject models that have no
595
apparent mechanistic basis, but to ensure that some consideration is given to the
596
possibility of a mechanistic association between the descriptors used in a model and
597
the endpoint being predicted, and to ensure that this association is documented”.
598
Thus, the models developed in the present work well comply with all five OECD
599
principles.
600 601
3.7
602
Lei et. al. developed QSPR models on the dataset of the present work using 421
603
different types of molecular descriptors. The global-MLR model developed by them
604
showed r2 value of 0.789 with a corresponding leave-one-out cross-validation R2 (Q2)
605
of 0.785. In our present work, we have performed the QSPR study with 40 ETA
606
descriptors and 103 non-ETA descriptors. PLS analysis was used as chemometric
607
tools. Both internal and external validation parameters along with overall validation
608
parameter were reported. The models obtained from two different splitting methods
609
using ETA descriptors and applying PLS analysis showed improved statistical quality.
610
A comparison of the results is listed in Table 5.
611
Comparison with previously reported models on this data set
Table 5 near here
612 613
4
614
In this work, two different splitting techniques like k-means clustering and principal
615
component analysis (PCA) combined with duplex method were used to split the data
Overview and Conclusion
25
616
set into training and test sets for subsequent model development. Attempt was made
617
to find out the common descriptors present in various models indicating their
618
importance to the adsorption capacity on to activated carbon for a large data set of
619
chemicals. The models were developed to study the predictive ability of ETA
620
parameters in comparison to non-ETA descriptors (topological, structural,
621
physicochemical, electronic and spatial descriptors). In all the cases, the data set was
622
first subjected to stepwise regression to find out the useful variables, and the selected
623
variables were further subjected to PLS regression. The ETA models from both the
624
splitting techniques contain eleven same descriptors out of total twelve descriptors.
625
Thus, this finding shows the importance of ETA parameters required for adsorption
626
capacity. But the models obtained from two different splitting techniques with non-
627
ETA parameters contain only five similar descriptors out of twelve descriptors. In
628
case of models using combined set of descriptors (ETA and non-ETA), it was found
629
that seven similar descriptors were present out of fourteen descriptors. The models
630
obtained using ETA descriptors indicate the importance of molecular size, presence of
631
hydrogen bond donor atoms, electron richness, and overall topological nature of
632
molecules towards adsorption capacity onto activated carbon. From the models with
633
non-ETA descriptors, it was found that molecular solvent accessible surface area,
634
presence of electronegative atoms like fluorine, connectivity pattern among different
635
groups of a molecule are necessary for adsorption. The models using the combined set
636
of descriptors show the importance of surface weighted charged partial surface area,
637
topological nature and functionality of a molecule for the adsorption capacity. The
638
present modeling analysis suggests that ETA descriptors are sufficiently rich in
639
chemical information for modeling adsorption capacity on to activated carbon. At
640
99% confidence level, compounds having DModX values higher than critical values
26
641
at 99% level have been removed from the test sets and external prediction quality of
642
the models was again checked. From Table 6, it is observed that comparatively less
643
number of compounds from test sets had to be removed for all models (ETA, non-
644
ETA and combined) using k-means clustering than principal component analysis
645
(PCA) combined with duplex method as the splitting technique. For the best ETA
646
model (model no 1), the outlier compounds for both test and training sets are shown in
647
Figures 5 and 6 respectively. Two compounds (compound numbers 139 and 441)
648
present in the test set possess very high DModX values. These two compounds are
649
halogen substituted methane derivative. Besides these, few other outlier compounds
650
contain silicon (compound number 2581) or sulphur (compound number 1490, 3089).
651
From the training set, it is observed that compounds having very high DModX values
652
(compound numbers 143, 156, 446, 670, 2419, 2508) are halogen substituted methane
653
derivatives. The structures of the outlier compounds are shown in Figure 7.
654
Figure 5 near here
655
Figure 6 near here
656
Figure 7 near here
657
Table 6 near here
658 659
Declaration of interest
660
The authors declare no conflict of interest.
661 662
Acknowledgement
663
The authors thank Council for Scientific and Industrial Research (CSIR), New Delhi
664
for providing financial assistance to KR in the form of a major research project.
665
27
666 667 668
References
669
Bansal, R.C., Donnet, J., Stoeckli, H.F., 1988. Active Carbon, Marcel Dekker, New
670
York.
671 672
Balci, B., Keskinkan, O., Avci, M., 2011. Use of BDST and an ANN model for
673
prediction of dye adsorption efficiency of Eucalyptus camaldulensis barks in
674
fixed-bed system. Expert System with Applications 38, 949-946.
675 676
Consonni, V., Ballabio, D., Todeschini, R., 2009. Comments on the definition of the
677
Q2 parameter for QSAR validation. Journal of Chemical Information and
678
Modeling 49, 1669-1678.
679 680 681
Cerius2 version 4.8 is a product of Accelrys, Inc., San Diego, USA, http://www.accelrys.com/cerius2.
682 683 684
DRAGON version 6.0 software is offered by TALETE SRL, Italy; the software available at http://www.talete.mi.it/products/dragon_description.htm
685 686 687
Easton, J.R., 1995. The dye maker’s view. in: Cooper, P. (Ed.), Colour in Dyehouse Effluent. Society of Dyers and Coloursit, Oxford, England, pp. 9–21.
688 689
Fan, Y., Shi, L.M., Kohn, K.W., Pommier, Y., Weinstein, J.N., 2001. Quantitative
690
structure-antitumor activity relationships of camptothecin analogs: Cluster
28
691
analysis and genetic algorithm-based studies. Journal of Medicinal Chemistry 44,
692
3254-3263.
693 694
Franz, M., Arafat, H.A., Pinto, N.G., 2000. Effect of chemical surface heterogeneity
695
on the adsorption mechanism of dissolved aromatics on activated carbon. Carbon
696
38, 1807-1819.
697 698 699
Figueiredo, J.L., Pereira, M.F.R., Freitas, M.M.A., Órfão, J.J.M., 1999. Modification of the surface chemistry of activated carbons. Carbon 37, 1379-1389.
700 701 702
Golbraikh A., Tropsha A., 2002. Beware of q2. Journal of Molecular Graphics and Modeling 20, 269-276.
703 704 705
Gasteiger, J, Marsili, M, 1978. A new model for calculating atomic charges in molecules. Tetrahedron Letters 19, 3181-3184.
706 707
Giraudet, S., Pre, P., Tezel, H., Le Cloirec, P., 2006. Estimation of adsorption
708
energies using physical characteristics of activated carbons and VOCs’ molecular
709
properties. Carbon 44, 1873–1883.
710 711 712
Gregg, S.J., Singh, K.S.W., 1982. Adsorption, Surface Area and Porosity. Academic Press, London.
713 714 715
Hawkins, D.M., 2004. The problem of overfitting. Journal of Chemical Information and Computer Science 44, 1-12.
29
716 717
Haghseresht, F., Nouri, S., Finnerty, J.J., Lu, G.Q., 2002. Effects of surface chemistry
718
on aromatic compound adsorption from dilute aqueous solutions by activated
719
carbon. The Journal of Physical Chemistry B 106, 10935-10943.
720 721
Karanfil, T., Kilduff, J.E., 1999. Role of granular activated carbon surface chemistry
722
on the adsorption of organic compounds. 1. Priority pollutants. Environmental
723
Science & Technology 33, 3217-3224.
724 725
Lei, B., Ma, Y., Li, J., Liu, H., Yao, X., Gramatica, P., 2010. Prediction of the
726
adsorption capability onto activated carbon of a large data set of chemicals by
727
local lazy regression method. Atmospheric Environment 44, 2954-2960.
728 729
Le Leuch, L.M., Bandosz, T.J., 2007. The role of water and surface acidity on the
730
reactive adsorption of ammonia on modified activated carbons. Carbon 45, 568-
731
578.
732 733
MacQueen, J.B., 1967. Some Methods for classification and Analysis of Multivariate
734
Observations. Proceedings of 5-th Berkeley Symposium on Mathematical
735
Statistics and Probability, Berkeley, University of California Press 1, 281-297.
736 737 738
MINITAB
version
14
is
statistical
software
of
Minitab
Inc,
USA,
http://www.minitab.com.
739
30
740
Mohan, D., Pittman Jr., C.U., 2006. Activated carbons and low cost adsorbents for
741
remediation of tri- and hexavalent chromium from water. Journal of Hazardous
742
Materials B137, 762–811.
743 744
Moreno-Castilla, C., Rivera-Utrilla, J., López-Ramón, M.V., Carrasco- Marín, F.,
745
1995. Adsorption of some substituted phenols on activated carbons from a
746
bituminous coal. Carbon 33, 845-851.
747 748
Mohan, D., Singh, K.P., 2005. Granular activated carbon, in: Lehr, J., Keeley, J.,
749
Lehr, J. (Eds.), Water Encyclopedia: Domestic, Municipal, and Industrial Water
750
Supply and Waste Disposal, John Wiley & Sons, New York, pp. 106-115.
751 752
Nouri, S., Haghseresht, F., 2002. Adsorption of dissociating aromatic compounds by
753
activated carbon: effects of ionization of the adsorption capacity. Adsorption
754
Science & Technology 20, 417-432
755 756 757
Nriagu, J.O., Pacyna, J.M., 1988. Quantitative assessment of worldwide contamination of air, water and soils by trace metals. Nature 333, 134–139.
758 759
Ojha, P.K., Mitra, I., Das, R.N., Roy, K., 2011. Further exploring rm2 metrics for
760
validation of QSPR models dataset. Chemometrics and Intelligent Laboratory
761
System 107, 194–205.
762
31
763
Pollard, S.J.T., Fowler, G.D., Sollars, C.J., Perry, R., 1992. Lowcost adsorbents for
764
waste and wastewater treatment: a review. Science of the Total Environment 116,
765
31–52.
766 767
Pal, D.K., Sengupta, C., De, A.U., 1988. A new topochemical descriptors (TAU) in
768
molecular connectivity concept: Part I-aliphatic compounds. Indian Journal of
769
Chemistry, 27B, 734-739.
770 771
Pal, D.K., Sengupta, C., De, A.U., 1989. Introduction of a novel topochemical index
772
and exploitation of group connectivity concept to achieve predictibility in QSAR
773
and RDD. Indian Journal of Chemistry 28B, 261-267.
774
Radovic, L.R., Moreno-Castilla, C., Rivera-Utrilla, J.,
775
2000.
776
solutions,
777
Physics of Carbon, vol. 27, Marcel Dekker, Inc., New
778
York.
Carbon in:
materials Radovic
as L.R.
adsorbents (Ed.),
in
aqueous
Chemistry
and
779 780
Roy, K., Das, R.N., 2011. On some novel extended topochemical atom (ETA)
781
parameters for effective encoding of chemical information and modeling of
782
fundamental physicochemical properties. SAR and QSAR in Environmental
783
Research 22, 451-472.
784 785
Roy, K., Ghosh, G., 2003. Introduction of Extended Topochemical Atom (ETA)
786
Indices in the Valence Electron Mobile (VEM) Environment as Tools for
32
787
QSAR/QSPR Studies. Internet Electronic Journal of Molecular Design 2, 599–
788
620.
789 790 791
Roy, P.P., Roy, K., 2008. On some aspects of variable selection for partial least squares regression models. QSAR and Combinatorial Science 27, 302-313.
792 793
Radovic, L.R., Silva, I.F., Ume, J.I., Menéndez, J.A., Leony Leon, C.A., Scaroni,
794
A.W., 1997. An experimental and theoretical study of the adsorption of aromatics
795
possessing electron-withdrawing and electron-donating functional groups by
796
chemically modified activated carbons. Carbon 35, 1339-1348.
797 798
STATISTICA version 7 is statistical software of Stat Soft Inc, www.statsoft.com
799
Snedecor, G.W., Cochran, W.G., 1967. Statistical Methods, Oxford & IBH Publishing
800
Co. Pvt. Ltd, New Delhi.
801 802
Schuurmann, G., Ebert, R.U., Chen, J., Wang, B., Kuhne, R., 2008. External
803
validation and prediction employing the predictive squared correlation coefficient-
804
test set activity mean vs training set activity mean. Journal of Chemical
805
Information and Modeling 48, 2140-2145.
806 807 808
Service, R.F., 2012. Material Scientists Look to a Data-Intensive Future. Science 335, 1434-1435.
809 810 811
UMETRICS SIMCA-P 10.0,
[email protected]: www.umetrics.com, Umea, Sweden, 2002
33
812 813
Villacañas, F., Pereira, M.F.R., Órfão, J.J.M., Figueiredo, J.L., 2006. Adsorption of
814
simple aromatic compounds on activated carbons. Journal of Colloid and Interface
815
Science 293, 128–136.
816 817
Wold, S., 1995. PLS for multivariate linear modeling. in: Van de Waterbeemd, H.
818
(Ed.), Chemometric Methods in Molecular Design (Methods and Principles in
819
Medicinal Chemistry), Weinheim-VCH, New York, pp. 195-218
820 821
Wold, S., Eriksson, L., 1995. Validation tools. in: van de Waterbeemd, H. (Ed.),
822
Chemometric Methods in Molecular Design (Methods and Principles in Medicinal
823
Chemistry), Weinheim-VCH, New York, pp. 312–317.
824 825
Wold, S., Sjostrom, M., Eriksson, L., 2001. PLS-regression: a basic tool of
826
chemometrics. Chemometrics and Intelligence Laboratory System 58, 109–130.
827 828 829
Yap, C.W., 2011. PaDEL-Descriptor: An open source software to calculate molecular descriptors and fingerprints. Journal of Computational Chemistry 32, 1466-1474.
830 831
Yaws, C.L., 2003-2004. Yaws’ handbook of thermodynamic and physical properties
832
of chemical compounds: Physical, thermodynamic and transport properties of
833
5000 organic chemicals compounds, Lamar University, Beaumount, Texas,
834
Norwich, New York.
835
34
836
Zhang, K., Cheung, W.H., Valix, M., 2005. Roles of physical and chemical properties
837
of activated carbon in the adsorption of lead ions. Chemosphere 60, 1129–1140.
838
Figure captions
839
Figure 1: PCA score plot of first three components for the standardized descriptor
840
matrix of the combined set (ETA and non-ETA) in case of PCA combined with
841
duplex method as the splitting technique.
842 843
Figure 2: PCA score plot of first three components for the standardized descriptor
844
matrix of the combined set (ETA and non-ETA) in case of k-means clustering as the
845
splitting technique.
846 847
Figure 3: Scatter plot of observed vs calculated/predicted values of the training set
848
compounds.
849 850
Figure 4: Scatter plot of observed vs calculated/predicted values of the test set
851
compounds.
852 853
Figure 5: DModX values of the 870 test set compounds at 99% level for model 1. The
854
thick horizontal line signifies the critical DModX value (2.219) at the 99% confidence
855
level.
856 857
Figure 6: DModX values of the 2613 training set compounds at 99% level for model
858
1. The thick horizontal line signifies the critical DModX value (2.219) at the 99%
859
confidence level.
860
35
861
Figure 7: Structures of some outlier compounds of the test and training sets for model
862
1
863
Table captions
864
Table 1: List of ETA descriptors used in the development of QSPR models
865
Table 2: Categorical list of Non-ETA descriptors used in the development of QSPR
866
models
867
Table 3: Comparison of models obtained after applying different splitting techniques
868
and chemometric tools
869
Table 4: External validation of the developed model using parameters proposed by
870
Golbraikh and Tropsha
871
Table 5: Comparison of statistical parameters of the ETA models developed in the
872
present work with a previously reported model
873
Table 6: Comparison of predictive quality of models after removing compounds
874
having DModX values higher than the critical values from test sets
875 876 877 878
36
Highlights
x
QSPR models were developed for adsorption of organic chemicals on to activated carbon
x
The ETA indices were used for development of models which were compared to non-ETA ones
x
The ETA indices can be easily calculated from 2D representation of chemical structure
x
The use of ETA descriptors along with non-ETA ones improves statistical quality of models
x
The models can be used for prediction of adsorption of organic compounds on activated carbon
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Table 1
S
V
V
V
F
B
17
18
Contribution of local functional groups relative to molecular size. Gives a measure of local branching and topology. Branching index. Explains branchedness of a molecule.
Flocal
16
Flocal / NV
Contribution of local functional groups. Gives a measure of local branching and topology.
F' F / NV
15
' local
Overall functionality contribution relative to molecular size
F
14
Local composite index. A measure of covalently bonded interaction of the local atoms. Imparts information on local molecular topology. Local composite index relative to molecular size. Imparts information on local molecular topology. Overall functionality contribution
Contribution of the overall topological nature Contribution of the overall topological nature relative to molecular size
A measure of electronic features of the molecule relative to molecular size
Sum of β (valence electron mobile) values of all non-hydrogen vertices in a molecule
A measure of electron richness of the molecule relative to molecular size
'
local / NV
ns
A measure of electron richness of the molecule (non-sigma contribution)
A measure of electronegative atom count of the molecule relative to molecular size
A measure of electronegative atom count (sigma contribution) of the molecule
A measure of molecular bulk relative to molecular size [NV is the total number of atoms excluding hydrogen] A measure of electron richness in a molecule
Significance Sum of core count of non-hydrogen vertex (α). It is a measure of molecular bulk
13
local
local
12
'
' ns
ns
' S
S
/ N / N / N
V
/ N
Parameter
' / NV
Sl No.
10 11
9
8
7
6
5
4
3
2
1
Table 1: List of ETA descriptors used in the development of QSPR models
SS 4
A measure of contribution of unsaturation and electronegative atom count A measure of contribution of unsaturation
B 1 4
31
[XH indicates hydrogen attached to a heteroatom]
A 1 3
NV N XH
EH XH
[SS indicates saturated carbon skeleton]
[R is the reference alkane]
A measure of electronegative atom count [EH indicates excluding hydrogen]
A measure of electronegative atom count [N is the total number of atoms including hydrogen]
A measure of count of hydrogen bond acceptor atoms and/ or polar surface area
hydrogen vertex in the molecular graph A measure of count of non-hydrogen heteroatoms [NV is the total number of atoms excluding hydrogen, R is the reference alkane]
30
5
N SS
NR
28
NV
2
EH
N
1
R 3
29
R
R B NV
27
26
25
24
NV
/ X
22
A
/ Y
21
23
stands for summation of α values of the vertices that are joined to one other nonP hydrogen vertex in the molecular graph
stands for summation of α values of the vertices that are joined to three other nonY
/ P
20
hydrogen vertex in the molecular graph
stands for summation of α values of the vertices that are joined to four other nonX
Branching index relative to molecular size. A measure of overall branchedness of a molecule.
B' B / NV
19
40
39
' ns ( )
NV
ns ( )
'
ns ( )
NV
A measure of lone electrons entering into resonance relative to molecular size
A measure of lone electrons entering into resonance
A measure of relative unsaturation content
A relative measure of relative unsaturation content
ns s
37
38
A measure of hydrogen-bonding propensity of the molecules
B 1 0.714
A measure of hydrogen-bonding propensity of the molecules and/or polar surface area
36
2
V
/ N A measure of hydrogen-bonding propensity of the molecules
EH
A 0.714 1
35
1
A measure of contribution of hydrogen-bond donor atoms
D 2 5
33
34
A measure of contribution of electronegativity
C 3 4
32
Table 2
Category of descriptors Topological Average distance-based connectivity index
Definition
These indices capture different aspects of molecular shape.
This is the number of subgraphs of a given type and order. This is a descriptor based on structural properties that restrict a molecule from being 'infinitely flexible' Total number of bonds between all pairs of atoms in the hydrogen-suppressed graph.
kappa shape index ( 1 , 2 , 3 , 1 am , 2 am , 3 am )
Subgraph count index (SC-0, SC-1, SC-2, SC-3_P, SC-3_C, SC-3_CH)
Flexibility index ( )
Wiener
Molecular connectivity index These indices are based on graph-theoretical invariant ( 0 , 1 , 2 , 3 P , 3 C , 3 CH , 0 v , 1 v , 2 v , 3 v P , 3 v C , introduced by Randic. 3 v CH )
Balaban’s J index (Jx)
Name of the descriptors
Table 2: Categorical list of Non-ETA descriptors used in the development of QSPR models Comment, if any
Physicochemical
Structural
AlogP
Molref
H-bond donor
MW Rotlbonds H-bond acceptor
Number of chiral centers (R or S) in a molecule. Molecular weight Number of rotatable bonds. Number of hydrogen-bond acceptors. Number of hydrogen-bond donors. Molar refractivity Ghose, A.K., Crippen, G.M., 1986. Atomic Physicochemical Parameters for ThreeDimensional StructureDirected Quantitative Structure-Activity Relationships I. Partition coefficients as a Measure of hydrophobicity. Journal of Computational Chemistry 7, 565-577 Log of the partition Ghose, A.K., Crippen, 1986. Atomic coefficient G.M., Physicochemical
Electrotopological state parameters of atoms having different electronic and topological environment.
E-State parameters (S_sCH3, S_dCH2, S_ssCH2, S_tCH, S_dsCH, S_aaCH, S_sssCH, S_ddC, S_tsC S_dssC, S_aasC, S_aaaC, S_ssssC, S_sNH2, S_ssNH, S_tN, S_dsN, S_aaN, S_sssN, S_ sOH, S_dO, S_ssO, S_aaO, S_sSH, S_ssS, S_aaS, S_sF, S_sCl, S_sBr, S_sI)
Chiral centers
Sum of the squares of vertex valencies
Zagreb
Electronic
Dipole moment
It is a vector quantity which encodes displacement with respect to the centre of gravity of positive and negative charges in a molecule
coefficient
partition
Octanol
logP
water
Log of partition coefficient
AlogP98
Parameters for ThreeDimensional StructureDirected Quantitative Structure-Activity Relationships I. Partition coefficients as a Measure of hydrophobicity. Journal of Computational Chemistry 7, 565-577 Ghose, A., Viswanadhan, V.N., Wendoloski, J.J., 1998. Prediction of hydrophobic (lipophilic) properties of small organic molecules, using fragmental methods. An analysis of ALOGP and CLOGP methods. Journal of Physical Chemistry 102, 3762–3772.
Spatial
These are calculated by mapping atomic partial charges on solvent accessible surface areas of individual atoms
Jurs descriptors (Jurs SASA, Jurs PPSA 1, Jurs PNSA 1, Jurs DPSA 1, Jurs PPSA 2, Jurs PNSA 2, Jurs DPSA 2, Jurs PPSA 3, Jurs PNSA 3, Jurs DPSA 3, Jurs FPSA 1, Jurs FNSA 1, Jurs FPSA 2, Jurs FNSA 2, Jurs FPSA 3, Jurs FNSA 3, Jurs WPSA 1, Jurs WNSA 1, Jurs WPSA 2, Jurs WNSA 2, Jurs WPSA 3, Jurs WNSA 3, Jurs RPCG, Jurs RNCG, Jurs RPCS, Jurs RNCS, Jurs TPSA, Jurs TASA, Jurs RPSA, Jurs RASA)
This set of descriptors combines shape and electronic information to characterize the molecules.
Van der Waals area of a molecule
Area
Vm
PMI-mag
Density
N is the number of atoms and x, y, and z are the atomic coordinates relative to the center of mass. The ratio of molecular It reflects the types of weight to molecular atoms and how tightly volume. they are packed in a molecule. It calculates the principal moments of inertia about the principal axes of a molecule. Molecular volume inside the contact surface. ( X i2 Yi2 Zi2 ) N
RadOfGyration
4
5
6
NonETA
Combined
3
Combined
ETA
2
NonETA
PCAcombined with duplex method
1
ETA
k-means clustering
Model no.
Type of descriptors
Splitting Technique
ηF, [Σα]P/Σα,
1 and
D , Σα/Nv, η′, η'F, Σε,
2
Wiener,
F
[Σα]P/Σα,
, η' , η′, Σα/N , D , S_sF, Jurs 3
P
X
F,
F
and Σβns
'
D , , S_sF, Jurs WPSA1, η, / N , , [Σα]P/Σα
Jurs SASA, Jurs-WNSA-1, 3p, Jx, Density, S_sF, PMI mag, Jurs FNSA-1, AlogP98, Wiener, SC-3-P and 1m Jurs SASA, Density, 1, η′, η'F,
v
AlogP98,
V and B Σα, Σα/N , D , Σε, η′, η' η, η , [Σα] /Σα , [Σα] /Σα, 1 and 2
WPSA-1,
[Σα]Y/Σα,
Jurs SASA,
1 v
, S_sF, Jurs FPSA-3, 3 CH and
3
Jurs PNSA-3, Wiener,
Density,
Jurs SASA, Jurs PPSA-3, 3p, Jx,
2
A , η,
,
Descriptors
10
10
10
10
10
10
LVs
0.8263
0.7281
0.8101
0.8191
0.7443
0.8147
R2
0.8256
0.7270
0.8094
0.8184
0.7432
0.8139
R a2
1237.32 (10, 2601)
696.54 (10, 2601)
1109.8 (10, 2601)
1178.92 (10, 2602)
757.49 (10, 2602)
1144.5 (10, 2602)
F(df)
0.8170
0.7196
0.7993
0.813
0.7341
0.8059
2 Qint
0.7366
0.6039
0.7123
0.732
0.6263
0.7219
rm2( LOO )
0.156
0.2259
0.1672
0.1592
0.2123
0.1633
rm2( LOO )
Table 3: Comparison of models obtained after applying different splitting techniques and chemometric tools
Table 3
0.8145
0.7299
0.827
0.8023
0.7039
0.7914
0.8112
0.725
0.8239
0.8019
0.7033
0.7909
0.5952
0.4105
0.622
0.8570
0.786
0.8492
2 2 2 Qext ( F 1) Qext ( F 2) Qext ( F 3)
0.6921
0.559
0.7099
0.7332
0.6064
0.7108
rm2(test )
0.1697
0.2417
0.1587
0.0039
0.0437
0.0613
rm2(test )
0.7203
0.5886
0.7174
0.7311
0.6208
0.7194
rm2( overall )
0.165
0.2352
0.1692
0.1247
0.1756
0.1425
rm2(overall )
Table 4
ETA NonETA Combined ETA NonETA Combined
k-means clustering
PCAcombined with duplex method
Type of descriptors
Splitting Technique 3.13x10-4 1.13x10-3
3.73x10-3 2.31x10-5 4.91x10-6 2.50x10-5 1.94x10-5 1.10x10-5
5.78x10-3 2.04x10-2
1.20x10-2 2.36x10-3 6.28x10-3
2.45x10-3
3 4 5
6
3.28x10-4
6.17x10-4 2.97x10-4 8.35x10-4
Overall
(r2- r02)/ r2 Test Training
Mod el no. 1 2
7.91x10-2
1.09x10-2 6.78x10-2 0.217
2.64x10-2 4.63x10-2
4.60x10-2
4.80x10-2 5.70x10-2 0.140
5.24x10-2 0.116
(r2- r/02)/ r2 Test Training
5.82x10-2
3.81x10-2 6.12x10-2 0.1168
4.65x10-2 9.92x10-2
Overall
0.9961
0.9975 1.006 1.008
0.9998 0.9982
k Test
0.9996
0.9997 0.9996 0.9996
0.9995 0.9996
Training
0.9988
0.9991 1.001 1.001
0.9996 0.9992
Overall
Table 4: External validation of the developed model using parameters proposed by Golbraikh and Tropsha
0.9848
0.9934 0.9759 0.9646
0.9905 0.9880
k/ Test
0.9920
0.9883 0.9912 0.9875
0.9880 0.9835
Training
0.9902
0.9895 0.9874 0.9819
0.9886 0.9846
Overall
Table 5
Table 5: Comparison of statistical parameters of the ETA models developed in the present work with a previously reported model Reference Lei et al. 2010
Present work
Modeling technique Global MLR
PLS PLS
a
Splitting Technique PCAcombined with duplex method a k-means clustering b PCAcombined with duplex method a
ntraining = 2612, a ntest =871;
2 Qint
2 Qext ( F 1)
2 Qext ( F 2)
2 Qext ( F 3)
r2
0.785
0.773
---
---
0.789
0.8059 0.7914 0.7909
0.8492
0.8147
0.7993 0.827
0.4814
0.8101
b
0.8239
ntraining =2613, b ntest = 870
Table 6
ETA Non-ETA Combined ETA
k-means clustering
PCAcombined with duplex NonETA method Combined
Type descriptors
Splitting Technique
75 116
1.875
No. of compounds removed from test set 35 34 28 92
2.219
2.219 2.219 1.875 2.219
of Critical DModX value 0.7995 0.7425 0.8162 0.8353
2 Qext ( F 2)
0.8057 0.8046
0.7406 0.7381
0.8017 0.7456 0.8173 0.8356
2 Qext ( F 1)
0.7135
0.5487
0.8886 0.8544 0.8923 0.7963
2 Qext ( F 3)
0.7298
0.6191
0.7404 0.6694 0.7586 0.7736
rm2(test )
0.0958
0.219
0.065 0.035 0.0558 0.0436
rm2(test )
0.7337
0.6115
0.7234 0.6318 0.7349 0.7278
rm2( overall )
0.1363
0.2274
0.1251 0.1695 0.1215 0.1326
rm2(overall )
Table 6: Comparison of predictive quality of models after removing compounds having DModX values higher than the critical values from test sets