Modeling adsorption of organic compounds on activated carbon using ETA indices

Modeling adsorption of organic compounds on activated carbon using ETA indices

Author's Accepted Manuscript Modeling adsorption of organic compounds on activated carbon using ETA indices Supratim Ray, Kunal Roy www.elsevier.com...

747KB Sizes 12 Downloads 78 Views

Author's Accepted Manuscript

Modeling adsorption of organic compounds on activated carbon using ETA indices Supratim Ray, Kunal Roy

www.elsevier.com/locate/ces

PII: DOI: Reference:

S0009-2509(13)00631-3 http://dx.doi.org/10.1016/j.ces.2013.09.018 CES11299

To appear in:

Chemical Engineering Science

Received date: 2 April 2013 Revised date: 24 August 2013 Accepted date: 5 September 2013 Cite this article as: Supratim Ray, Kunal Roy, Modeling adsorption of organic compounds on activated carbon using ETA indices, Chemical Engineering Science, http://dx.doi.org/10.1016/j.ces.2013.09.018 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

1

Modeling adsorption of organic compounds on activated carbon

2

using ETA indices

3 4 5 6

Supratim Raya, < and Kunal Royb, *

7

a

8

Division of Pharmaceutical Chemistry,

9

Dr. B C Roy College of Pharmacy & Allied Health. Sciences, Bidhannagar,

10

Durgapur 713 206, India b

11

Drug Theoretics and Cheminformatics Laboratory,

12

Division of Medicinal and Pharmaceutical Chemistry,

13

Department of Pharmaceutical Technology,

14

Jadavpur University, Kolkata 700 032, India

15 16 17 18 19

---------------------------------------

20

*Corresponding author

21

Kunal Roy

22

Email: [email protected];

23

Phone: +91 98315 94140; Fax: +91-33-2837-1078;

24

URL: http://sites.google.com/site/kunalroyindia/

25

<

Presently at Assam University, Silchar 788 011, India 1

26

Abstract

27

The aim of the present work is to develop quantitative structure-property relationship

28

(QSPR) models for adsorption capability of a large dataset of chemicals (n = 3483) on

29

to activated carbon. Two different splitting techniques like k-means clustering and

30

principal component analysis (PCA) combined with duplex method were used to

31

divide the data set into training and test sets. Attempt was made to find out the

32

common descriptors present in various models indicating their importance for

33

adsorption capacity on to activated carbon. In spite of presence of large number of

34

compounds in the training and test sets (3:1 in size ratio), we did not omit any

35

compounds showing outlier behavior to artificially show enhanced values of

36

validation metrics thus ensuring the predictive quality of the models for diverse types

37

of compounds. The models were developed to study the predictive ability of extended

38

topochemical atom (ETA) parameters which are calculated from two-dimensional

39

representation of molecules and introduced by the present group of authors. The ETA

40

models were compared to non-ETA models involving topological, spatial and

41

structural descriptors. In all the cases, the data set was first subjected to stepwise

42

regression to find out the contributing variables, and the selected variables were

43

further subjected to partial least squares (PLS) regression. The PLS models indicate

44

that ETA descriptors provide better external validation characteristics in terms of

45

predictive R2 than that of the non-ETA ones. The best ETA model shows encouraging

46

2 2 2 2 =0.8059, Qext statistical quality ( Qint ( F 1) =0.7914, Qext ( F 2) =0.7909, Qext ( F 3) =0.8492).

47 48

Keywords: Adsorption; Computation; Computational chemistry; Mathematical

49

modeling; QSPR; ETA

50 2

51

1

52

Industrial processes that produce aqueous effluents rich in many heavy metal ions and

53

waters leaching from agricultural and forest land after the application of chemical

54

fertilizers and pesticides remain an important source of potential human toxicity

55

(Easton, 1995; Karanfil et al., 1996). Harmful volatile organic chemicals are also

56

commonly present in many industrial manufacturing environments. Though several

57

control technologies have been applied to many industrial and municipal sources, the

58

total quantity of these agents released to the environment remains high (Nriagu and

59

Pacyna, 1988; Mohan and Pittman Jr., 2006). Activated carbon is one of the most

60

popular adsorbents used widely in different types of industries for removal of toxic

61

pollutants, ions, non-biodegradable wastewaters (Balci et al., 2011; Zhang et al.,

62

2005), as well as for various gas separation and purification processes (Le Leuch and

63

Bandosz, 2007). Activated carbon is a crude form of graphite

64

having

65

exhibiting

66

cracks to crevices and slits of molecular dimensions. The

67

starting

68

activated carbon production determine surface functional

69

groups. Activation also refines the pore structure, and

70

surface area up to 2000 m2 / g can be obtained (Radovic et

71

al., 2000).

72

coconut shells, wood char, lignin, petroleum coke, bone

73

char, peat,

74

peach pits, fish, fertilizer waste, waste rubber tire,

75

etc (Pollard

Introduction

amorphous a

structure.

broad

material

range

and

of

the

They

are

pore

sizes,

activation

Activated carbons have

sawdust,

carbon

black,

highly

porous,

from

method

visible

used

been prepared

rice

for

from

hulls, sugar,

et al., 1992; Mohan and Singh, 2005). The adsorption

3

76

capability of activated carbon for chemical adsorption depends on several factors,

77

mainly on the carbon’s characteristics such as texture (surface area, pore size

78

distributions), surface chemistry (surface functional groups), and ash content (Gregg

79

and Singh, 1982; Radovic et al., 1997). It also depends on adsorptive characteristics

80

like molecular weight, polarity, pKa, molecular size and functional groups (Villacañas

81

et al., 2006; Karanfil and Kilduff, 1999). Finally, environmental factors such as pH,

82

adsorptive concentration and presence of other possible adsorptives also affect the

83

adsorption capability of carbon (Nouri and Haghseresht, 2002; Haghseresht et al.,

84

2002).

85

significant role in adsorption capability of activated carbon (Moreno-Castilla et al.,

86

1995). The surface of activated carbon consists of basal planes, heterogeneous

87

superficial groups (mainly oxygen containing surface groups) and inorganic ash. For

88

aromatic compounds, most of the adsorption sites are found on the basal planes,

89

which correspond to about 90% of the carbon surface. Heterogeneous groups also

90

contribute towards activity and define the chemical characteristics of the carbon

91

surface (Franz et al., 2000). The heteroatoms are located at the edges, or in the defects

92

of the basal planes of carbon atoms. The amount of these oxygenated surface groups

93

varies with the nature of the raw material and the activation process (Giraudet et al.,

94

2006). With the help of physical, chemical and electrochemical treatments, the nature

95

of the surface groups can also be modified (Bansal et al., 1988; Figueiredo et al.,

96

1999).

The activated carbon surface chemistry and pH of the medium play a

97 98

Since there are different types of chemicals existing as the pollutants with varying

99

structural composition, it is difficult to predict which particular compound will get

100

adsorbed on to activated carbon to what extent. At the same time, it is also expensive

4

101

to experimentally determine the adsorption capability of different classes of chemicals

102

on to activated carbon. Thus, prediction of adsorption capability of activated carbon

103

using a diverse set of chemicals is a relevant topic for quantitative structure-property

104

relationship (QSPR) studies. This kind of studies are relevant in the context of

105

Materials Genome Initiative, a new, multi-stakeholder effort to develop an

106

infrastructure using computational approach to accelerate advanced materials

107

discovery and deployment in the United States (Service, 2012). Based on this

108

background, in the present study, we have performed a QSPR analysis on large

109

number of compounds to quantitatively relate the adsorption capability of the

110

activated carbon with the structure of the chemicals using linear regression tools. The

111

descriptors of interest in the present study are extended topochemical atom (ETA)

112

indices which are calculated simply from the 2D representation of the molecules. The

113

ETA indices, which have been developed by the present group of authors (Roy and

114

Ghosh, 2003), require less computational time than calculating complex 3D

115

descriptors, and possess good diagnostic power to develop robust predictive models.

116

The models were developed to study the predictive ability of ETA parameters in

117

comparison to non-ETA descriptors (topological, structural, physicochemical,

118

electronic and spatial descriptors).

119 120

2

Materials and Methods

121

2.1

The data set

122

In this present work, the adsorption capacities of 3483 organic compounds to

123

activated carbon as used by Lei et al (Lei et al., 2010) and originally taken from

124

Yaws’ Handbook of thermodynamic and physical properties of chemical compounds

125

(Yaws, 2003-2004) were used as the model data set. All the compounds and their

5

126

observed and calculated adsorption capabilities along with the ETA and non-ETA

127

parameters present in the models are listed in Supplementary Materials.

128 129

2.2

Types of descriptors

130

2.2.1 ETA descriptors

131

The extended topochemical atom (ETA) indices were formulated by the present group

132

of (Roy and Ghosh, 2003) based on the modification of TAU descriptors (Pal et al.,

133

1988; Pal et al., 1989). The TAU descriptors were developed considering valence

134

electron mobile (VEM) environment.

135

parameters related to size or bulk, electronegativity and electronic contribution. Some

136

of the basic ETA indices of all the compounds were calculated from software package

137

DRAGON version 6.0 software (DRAGON version 6.0 software is offered by

138

TALETE srl, Italy). Other ETA indices were calculated according to previously

139

published work (Roy and Das, 2011). The work started with forty ETA descriptors.

140

The significance of different ETA descriptors is listed in Table 1. The ETA indices

141

are also now available for computation in version 2.11 of PaDEL-Descriptor (Yap,

142

2011),

143

http://padel.nus.edu.sg/software/padeldescriptor .

an

open

144

The ETA scheme includes various basic

source

software

available

at

Table 1 near here

145 146

2.2.2 Non-ETA descriptors

147

Several types of non-ETA descriptors like two dimensional (2D) descriptors

148

(topological, structural, physicochemical and electronic indices) and three

149

dimensional (3D) descriptors (spatial indices) have been calculated for the given data

150

set of compounds to assess the performance of ETA descriptors in comparison to non6

151

ETA descriptors. For the calculation of 3D descriptors, multiple conformations of

152

each molecule were generated using the optimal search as a conformational search

153

method. Each conformer was subjected to an energy minimization procedure using

154

smart minimizer under open force field (OFF) to generate the lowest energy

155

conformation for each structure. The charges were calculated according to the

156

Gasteiger method (Gasteiger, 1978). These descriptors were calculated using Cerius2

157

version 4.10 software (Cerius2 version 4.10 is a product of Accelrys, Inc., San Diego,

158

USA). This study involved 103 non-ETA descriptors. The categorical list of different

159

non-ETA descriptors are shown in Table 2.

160

Table 2 near here

161 162

2.3

Model development

163 164

2.3.1 Splitting of the data set

165

Two different splitting techniques were used to find out the common descriptors

166

present in various models indicating their importance for adsorption capacity on to

167

activated carbon of large data sets of chemicals. At first, the same data set splitting

168

with a training set of 2612 compounds and a test set of 871 compounds as reported by

169

Lei et al. (Lei et al., 2010) after application of principal component analysis (PCA)

170

combined with duplex method was used by us. Then k-means clustering was also used

171

as a splitting technique for selection of training set for model development. The whole

172

data set (n=3483) was divided into training (n=2613, 75% of the total number of

173

compounds) and test (n=870, 25% of the total number of compounds) sets by k-means

174

clustering technique (MacQueen, 1967) applied on standardized descriptor matrix of

175

the combined sets (ETA and non-ETA). Figure 1 (in case of PCA combined with

7

176

duplex method as splitting technique) and Figure 2 (in case of k-means clustering as

177

splitting technique) show the plots of first three principal components of the

178

descriptor matrix suggesting that each of the test set compounds lies in close vicinity

179

of some training set molecules. Then all developed models were validated (externally)

180

using the test set compounds.

181

Figure 1 near here

182

Figure 2 near here

183 184 185

2.3.2 Chemometric tools

186

Initially, stepwise regression was applied on the data set to find out useful subsets of

187

descriptors, which were further subjected to partial least squares (PLS) analysis for

188

model development. PLS is a generalization of regression, which can handle data with

189

strongly correlated and/or noisy or numerous X variables (Wold, 1995; Fan et al.,

190

2001). It gives a reduced solution, which is statistically more robust than MLR. The

191

linear PLS model finds “new variables” (latent variables or X scores) which are linear

192

combinations of the original variables. To avoid over fitting, a strict test for the

193

significance of each consecutive PLS component is necessary and then stopping when

194

the components are non significant. Application of PLS thus allows the construction

195

of larger QSAR equations while still avoiding over fitting and eliminating most

196

variables. PLS is normally used in combination with cross validation to obtain the

197

optimum number of components. This ensures that the QSAR equations are selected

198

based on their ability to predict the data rather than to fit the data. In case of PLS

199

analysis on the present data set, based on the standardized regression coefficients, the

8

200

variables with smaller coefficients were removed from the PLS regression until there

201

was no further improvement in Q2 value irrespective of the components.

202 203

2.3.3

Software used for model development

204

MINITAB version 14 software (MINITAB version 14 is statistical software of

205

Minitab Inc, USA) was used for stepwise regression and partial least squares (PLS)

206

methods. STATISTICA version 7 software (STATISTICA version 7 is statistical

207

software of Stat Soft Inc) was used for the determination of the LOO (leave-one-out)

208

values of the training set compounds. SIMCA-P 10.0 was used to calculate the

209

DModX (distance to the model in X-space) value of the compounds.

210 211

2.3.4 Model validation

212

The statistical qualities of various equations were judged by calculating several

213

metrics namely determination coefficient (R2) as a measure of the total variance of the

214

response explained by the regression models (fitting) , explained variance (Ra2) and

215

variance ratio (F) at specified degrees of freedom (df) (Snedecor and Cochran, 1967).

216

Both internal and external validations are performed to assess to reliability and the

217

predictive potential of the developed models. To determine the predictive quality of

218

the models, models are required to be further validated using different validation

219

techniques: (a) internal validation or cross-validation using the training set

220

compounds, (b) external validation using the test set compounds

221 222

All the generated models were validated internally by the leave-one out procedure

223

( Qint2 ) (Wold and Ericsson, 1995). Besides leave-one out validation, the internal

224

predictive ability and robustness of the developed models were also further evaluated

225

by leave-25%-out cross-validation. The developed models were judged by different 9

226

2 2 external validation parameters like Qext ( F 1) , Qext ( F 2) (Hawkins, 2004; Schuurmann et

227

2 al., 2008), Qext ( F 3) (Consonni et al., 2009). Besides the above parameters, two more

228

external validation parameters were also employed to check the predictive ability of

229

the developed models as external validation is the most desired tool for establishing

230

the predictive quality of QSPR models. The rm2 matrices ( rm2 and 'rm2 ) are employed

231

to indicate better both the internal and external predictive capacities of a model and to

232

ascertain the proximity in the values of the predicted and observed response data

233

(Ojha et al., 2011; Roy and Roy, 2008). The rm2 and 'rm2 matrices are applied for

234

internal validation of training set compounds ( rm2( LOO ) as well as 'rm2( LOO ) ), external

235

validation of test set compounds ( rm2( test ) as well as 'rm2( test ) ) and overall validation for

236

all compounds ( rm2( overall ) 'rm2( overall ) ).

237

Again the developed equations were validated applying the parameters proposed by

238

Golbraikh and Tropsha (i.e., (i) Qint2 > 0.5, (ii) r2 > 0.6, (iii) r02 or r/02 is close to r2,

239

such that

240

d1.15) (Golbraikh and Tropsha, 2002). The detailed explanation of the notations is

241

given in Supplementary Materials section. For a high predictive ability of a

242

developed model the correlation coefficient between actual and calculated activity

243

must be close to one. So the regression of actual activity against calculated activity or

244

calculated activity against actual activity through the origin can be characterized by

245

the slope (k / k/). The slope should be close to one.

[(r2- r02)/ r2]

or [(r2- r/02)/ r2] < 0.1 and

0.85 d k d1.15 or 0.85 d k/

246 247

2.3.5

Applicability domain

248

The applicability domain (AD) of a developed QSPR model ensures that the

249

predictions made based on the developed QSPR model are more reliable if the

250

compounds being predicted are within applicability domain of the model. The purpose

251

of AD is to state whether the model’s assumptions are met. Applicability domain

252

provides information about chemical domain of the training set molecules used for the

253

development of the QSPR model and allows efficient prediction of new molecules

10

254

lying within this chemical domain. For compounds which are markedly dissimilar

255

from the training ones, the predictions made are quite uncertain. Thus, the idea of AD

256

is used to avoid such an unfounded extrapolation of property predictions and thus

257

improves the reliability for application of the developed QSPR models.

258 259

The residuals of Y and X are of diagnostic value for the quality of the model (Wold et

260

al., 2001). Since there are many X-residuals one needs a summary for each

261

observation (compound). This is accomplished by the residual standard deviation

262

(SD) of the X-residuals of the corresponding row of the residual matrix E. Because

263

this SD is proportional to the distance between the data point and the model plane in

264

X-space, it is also often called DModX (distance to the model in X-space). Here, X is

265

the matrix of predictor variables, of size (N×K), Y is the matrix of response variables,

266

of size (N×M) and E is the (N×K)matrix of X-residuals, N is number of objects

267

(cases, observations), k is the index of X-variables (k=1, 2, . . ., K) and m is the index

268

of Y-variables (m=1, 2, . . ., M). A DModX larger than around 2.5 times the overall

269

SD of the X-residuals (corresponding to an F-value of 6.25) indicates that the

270

observation is outside the applicability domain of the model (Wold et al., 2001).

271 272

3

273

Two different splitting strategies (k-means clustering and principal component

274

analysis (PCA) combined with duplex method) were employed for the division of the

275

data set into training and test sets followed by model development and validation. In

276

all the cases, the data set was first subjected to stepwise regression (stepping criteria:

277

F = 30 for inclusion; F = 29.99 for exclusion) to find out the useful variables, and the

278

selected variables were further subjected to PLS regression. The models were

Results and discussion

11

279

developed using partial least squares (PLS) analysis for each set of descriptors (ETA,

280

non-ETA and combined sets).

281 282

3.1

Development of models applying k-means clustering as the splitting technique

283 284

3.1.1 Models with ETA descriptors

285

The model from PLS analysis shows the importance of twelve different ETA

286

descriptors in predicting adsorption capacity on to activated carbon for the large data

287

set of chemicals (model no 1 in Table 3). Descriptors like , 'H D , /Nv, , 'F and

288

, have positive contributions towards the adsorption capacity, whereas descriptors

289

like 'H A , , F, []P/, \ 1 and H 2 have negative contributions towards the

290

adsorption capacity. The terms  and /Nv indicate the importance of molecular

291

size for adsorption capacity. The models confirm the importance of hydrogen bond

292

donor atoms (shown by the parameter 'H D ). The positive coefficient of  indicates

293

the importance of electron richness of a molecule towards the adsorption capacity.

294

The contribution of the overall topological nature and functionality contribution

295

(corresponding to  and F) of a molecule and those relative to molecular size

296

(corresponding to  and 'F) are also evident from the models. The parameter 'H A

297

signifying the contribution of unsaturation and presence of electronegative atoms in a

298

molecule has a negative contribution towards the adsorption capacity. The shape

299

parameter, []P/ which also contributed negatively, signifies the importance of

300

branching pattern present in a molecule. The term \ 1 indicating a measure of

301

hydrogen-bonding propensity of the molecules and / or polar surface area has a

302

negative coefficient for the adsorption capacity. The parameter 2 indicating the

303

presence of electronegative atoms in a molecule excluding hydrogen is detrimental for

12

304

the adsorption capacity. The statistical quality parameters of the model showed good

305

2 =0.8059, rm2( LOO ) =0.7219 and 'rm2( LOO ) =0.1633), external internal validation ( Qint

306

2 2 2 validation ( Qext rm2(test ) =0.7108 and ( F 1) =0.7914, Qext ( F 2) =0.7909, Qext ( F 3) =0.8492,

307

2 2 =0.7194 and 'rm(overall) =0.1425) 'rm2( test ) =0.0613), and overall validation ( rm(overall)

308

characteristics. The Q2 value after applying leave-25%-out cross-validation is 0.808.

309

The model also satisfies the set of criteria proposed by Golbraikh and Tropsha for

310

evaluation of predictive ability of the developed model (Tables 3 and 4).

311 312

Table 3 near here

313

Table 4 near here

314 315

3.1.2 Models with non-ETA descriptors

316 317

The model involving non-ETA descriptors indicate the importance of twelve non-

318

ETA descriptors in predicting the adsorption capacity (model no 2 in Table 3).

319

Parameters like Jurs SASA, Jurs PPSA-3, 3Fp, Jx, Density and 3N have positive

320

contributions towards the adsorption capacity, whereas terms like S_sF, Jurs FPSA-3,

321

Jurs PNSA-3, Wiener, 3 F CH and 1N have negative contributions for the adsorption

322

capacity. The positive coefficient of Jurs-SASA indicates that a higher value of total

323

molecular solvent accessible surface area will increase the adsorption capacity. The

324

term Jurs PPSA-3 signifies the importance of atomic charge weighted positive surface

325

area towards the adsorption capacity. The term 3Fp signifies the importance of third

326

order molecular connectivity of path type towards the adsorption capacity. This term

327

emphasizes particular atom connectivity within a molecule considering path of three 13

328

edges. The positive regression coefficient of Jx implies a positive correlation between

329

Jx and the adsorption capacity. The parameter Jx indicates the average distance sum

330

of the connectivity among different groups with in a molecule. Density has a positive

331

regression coefficient for adsorption capacity reflecting the types of atoms and their

332

packing pattern in a molecule necessary for adsorption into activated carbon. The

333

term 3N reflecting the shape of a molecule considering path length of three has a

334

positive contribution to the adsorption capacity. The electrotopological state

335

parameter, S_sF have a negative contribution signifying the importance of fragment –

336

F. The parameters like Jurs PNSA-3 and Jurs FPSA-3 have negative coefficients,

337

signifying contribution of solvent accessible surface area of a molecule in relation to

338

both negatively charged atoms and fractional charged partial positive surface area

339

towards adsorption capacity. The Wiener index showing the importance of sum of the

340

number of chemical bonds existing between all pairs of heavy atoms in the molecule

341

contributes negatively. The term 3 F CH has a negative coefficient indicating that a

342

compound having higher values of third order connectivity index (ring type) has

343

lower adsorption capacity. The term 1N signifying the shape of the molecule

344

considering the count of atoms and the presence of cycles relative to the minimal and

345

maximal graphs has a negative contribution for the adsorption capacity. The resulting

346

2 statistical parameters of the model showed good internal validation ( Qint =0.7341,

347

rm2( LOO ) =0.6263

348

2 2 2 2 Qext ( F 2) =0.7033, Qext ( F 3) =0.786, rm ( test ) =0.6064 and 'rm ( test ) =0.0437), and overall

349

2 2 validation ( rm(overall) =0.6208 and 'rm(overall) =0.1756) characteristics. The Q2 value after

350

applying leave-25%-out cross-validation is 0.731. From Table 4 it is observed that for

351

non-ETA descriptors, the squared correlation coefficient values between the observed

and

'rm2( LOO ) =0.2123),

external

validation

2 ( Qext ( F 1) =0.7039,

14

352

and predicted values of the training set compounds (leave-one out predicted values)

353

with intercept (r2) and without intercept after changing the axes (r/02) are not close

354

enough to each other.

355 356

3.1.3 Models with ETA and non-ETA descriptors

357 358

Fourteen descriptors emerged in the best equations using PLS regression analyses

359

representing their obvious importance in predicting the adsorption capacity (model no

360

3 in Table 3). Parameters like Jurs SASA, 3N , 'F, , /Nv, []Y/ and 'H D have

361

positive coefficients whereas terms like S_sF, Jurs WPSA-1, Wiener, []P/,

362

AlogP98,

363

models show positive contribution of the shape parameter []Y/, signifying the

364

importance of branching pattern present in a molecule. The descriptor Jurs WPSA-1

365

has a negative contribution to the adsorption capacity indicating the importance of

366

surface weighted positively charged partial surface area. The term AlogP98

367

(computed values corresponding to log of the partition coefficient) has a negative

368

coefficient indicating that increase in lipophilicity of a molecule will decrease the

369

adsorption capacity. The parameter 2 FV signifies the importance of second order

370

valence molecular connectivity. The term 'D B has a negative coefficient towards the

371

adsorption capacity. It is an indicator of the presence of hydrogen-bond acceptor

372

atoms. This parameter may also be considered to be an indicator of polar surface area.

373

2 The statistical parameters of the model showed good internal validation ( Qint =0.813,

374

rm2( LOO ) =0.732

375

2 2 2 2 Qext ( F 2) =0.8019, Qext ( F 3) =0.857, rm ( test ) =0.7332 and 'rm ( test ) =0.0039), and overall

2

FV , 'D B , have negative coefficients for the adsorption capacity. The

and

'rm2( LOO ) =0.1592),

external

validation

2 ( Qext ( F 1) =0.8023,

15

376

2 2 =0.1247) characteristics. The Q2 value after =0.7311 and 'rm(overall) validation ( rm(overall)

377

applying leave-25%-out cross-validation is 0.815. The developed model also satisfies

378

all the external validation criteria proposed by Golbraikh and Tropsha (Tables 3 and

379

4).

380 381

3.2

Development of models applying principal component analysis (PCA)

382

combined with duplex method as the splitting technique

383 384

3.2.1 Models with ETA descriptors

385

The model from PLS show importance of twelve descriptors in predicting adsorption

386

capacity on to activated carbon (model no 4 in Table 3). Descriptors like , /Nv,

387

'H D , ,  and 'F have positive contributions towards adsorption capacity, where as

388

descriptors like , F, []P/ , []X/, \ 1 and H 2 have negative contributions

389

towards adsorption capacity. The models show negative contribution of shape

390

parameter []X/, signifying the importance of the specific type of branching

391

pattern present in a molecule. The resulting statistical parameters of the PLS model

392

2 showed good internal validation ( Qint =0.7993, rm2( LOO ) =0.7123 and 'rm2( LOO ) =0.1672),

393

2 2 2 2 external validation ( Qext ( F 1) =0.827, Qext ( F 2) =0.8239, Qext ( F 3) =0.622, rm ( test ) =0.7099

394

2 2 =0.7174 and 'rm(overall) =0.1692) and 'rm2( test ) =0.1587), and overall validation ( rm(overall)

395

characteristics. The Q2 value after applying leave-25%-out cross-validation is 0.800.

396

The developed model also satisfies all the external validation criteria proposed by

397

Golbraikh and Tropsha (Tables 3 and 4).

398 399

3.2.2 Models with non-ETA descriptors 16

400

The model involving non-ETA descriptors contain twelve descriptors which indicate

401

their importance in predicting the adsorption capacity (model no 5 in Table 3).

402

Parameters like Jurs SASA, Jurs-WNSA-1, 3Fp, Jx and Density have positive

403

contributions towards the adsorption capacity whereas terms like S_sF, PMI mag, Jurs

404

FNSA-1, AlogP98, Wiener, SC-3-P and 1NDm have negative contributions. The

405

descriptor Jurs WNSA-1 signifies the importance of charge weighted partial negative

406

surface area towards the adsorption capacity.

407

influence of principal moments of inertia about the principal axes of a molecule

408

towards adsorption. Jurs FNSA-1 shows a negative contribution of fractional charged

409

partial surface area towards the adsorption capacity. Compounds having higher values

410

of third order subgraph count of path type (SC-3-P) will have a low adsorption

411

capacity. The Kappa shape index (1NDm) of path length 1 has a negative contribution

412

for the adsorption capacity, signifying the importance of molecular shape (including

413

all the heavy atoms present in a molecule). The developed model has acceptable

414

statistical

415

'rm2( LOO ) =0.2259),

416

2 Qext ( F 3) =0.4105,

417

2 2 =0.5886 and 'rm(overall) =0.2352) characteristics. The Q2 value after applying ( rm(overall)

418

leave-25%-out cross-validation is 0.718. From Table 4 it is observed that for non-

419

ETA descriptors the squared correlation coefficient values between the observed and

420

predicted values of the test, training (leave-one out predicted values) and overall set

421

compounds with intercept (r2) and without intercept after changing the axes (r/02) are

422

not close enough to each other.

limit

like

internal

external

rm2(test ) =0.559

validation validation and

The term PMI-mag indicates the

2 ( Qint =0.7196,

rm2( LOO ) =0.6039

2 ( Qext ( F 1) =0.7299,

'rm2( test ) =0.2417),

and

and

2 Qext ( F 2) =0.725,

overall

validation

423 17

424 425

3.2.3 Models with ETA and non-ETA descriptors

426

Fourteen descriptors emerged in the best equations using PLS analysis representing

427

their obvious importance in predicting the adsorption capacity (model no 6 in Table

428

3). Parameters like Jurs SASA, Density, 1N, , 'F, 'H D and ¦ E ' have positive

429

coefficients whereas terms like S_sF, Jurs WPSA-1, , ¦ H / N , 'E , []P/ and ns

430

have negative coefficients for the adsorption capacity. The models show the

431

importance of descriptor ¦ E ' which signifies a measure of sigma and non sigma

432

contribution of atoms to valence electron mobile count. The electronic

433

parameter, ¦ H / N , indicates the importance of electronegativity. The models also

434

showed the negative contributions of relative unsaturation content ( 'E and ns) of

435

molecules towards the adsorption capacity. The statistical parameters of the PLS

436

model

437

'rm2( LOO ) =0.156),

438

2 Qext ( F 3) =0.5952,

439

2 2 =0.7203 and 'rm(overall) =0.165) characteristics. The Q2 value after applying ( rm(overall)

440

leave-25%-out cross-validation is 0.816. The model also satisfies the set of criteria

441

proposed by Golbraikh and Tropsha for evaluation of predictive ability of the

442

developed model (Tables 3 and 4).

showed

good

internal

external

validation

validation

rm2(test ) =0.6921 and

2 ( Qint =0.817,

rm2( LOO ) =0.7366

2 ( Qext ( F 1) =0.8145,

and

2 Qext ( F 2) =0.8112,

'rm2( test ) =0.1697), and overall validation

443 444

3.3

Comparison of models obtained from different splitting techniques

445

3.3.1 Models with ETA descriptors

446

For the development of training sets, two different splitting techniques were used. In

447

both cases, the PLS models contain twelve variables (Table 3). Out of twelve 18

448

variables, eleven variables like , /Nv, 'H D , , , F, 'F, , []P/, \ 1 and

449

H 2 are present in both the equations. This observation is very interesting as this

450

finding indicates that combination of the selected descriptors remains same, even on

451

using different training sets applying different splitting techniques. The descriptors

452

indicate their obvious importance in predicting the adsorption capacity. The two

453

models (model nos. 1 and 4) are also comparable in terms of adjusted R2 (Ra2) (having

454

2 values of 0.8139 and 0.8094 respectively), Qint (having values of 0.8059 and 0.7993

455

respectively),

456

rm2( LOO ) (having values of 0.7219 and 0.7123 respectively). But one of the external

457

2 validation parameters like Qext ( F 3) (which considers both test and training sets

458

compounds), for the PLS model (model no 4) after applying the PCA combined with

459

duplex method as splitting technique shows lower value (0.622) in comparison to the

460

PLS model (0.8492) developed by k-means clustering technique (model no 1).

rm2(test )

(having values of 0.7108 and 0.7099 respectively),

461 462

3.3.2 Models with non-ETA descriptors

463 464

Both the developed PLS models (model nos 2 and 5) contain twelve variables (Table

465

3). But five variables like S_sF, Jurs SASA, 3Fp, Jx and Wiener are present in both the

466

equations indicating their contributions towards the adsorption capability on activated

467

carbon. The statistical qualities of the models are not so close like models with ETA

468

2 descriptors. Here the Qext ( F 3) value (0.4105) using PCA combined with duplex method

469

as the splitting technique (model no 5) does not cross the threshold value (0.5).

470 471

.3.3.3 Models with ETA and non-ETA descriptors 19

472 473

The developed PLS models after applying different splitting techniques contain

474

fourteen variables (Table 3). Seven variables like S_sF, Jurs SASA, Jurs WPSA-1, ,

475

'F, []P/ and 'H D are present in both the models. Thus, these parameters are

476

important for model development. The two models (model no: 3 and 6) are also

477

2 comparable in terms of Qint (having values of 0.813 and 0.817 respectively),

478

2 rm2( LOO ) (having values of 0.732 and 0.736 respectively). The Qext ( F 3) value (0.8570) for

479

the PLS model (model no. 3) after applying k-means clustering technique shows

480

comparatively better value than that obtained from using the PCA combined with

481

duplex method (0.5952) as the splitting technique (model no 6).

482 483

3.4

Comparison of ETA and non-ETA models

484 485

The PLS models involving ETA and non-ETA parameters after applying two different

486

splitting techniques indicate that ETA models provide better external validation

487

2 characteristics in terms of predictive R2, i.e., Qext ( F 1) (0.7914, 0.827; models 1, 4),

488

2 2 2 Qext ( F 2) (0.7909, 0.8239; models 1, 4) Qext ( F 3) (0.8492, 0.4622; models 1, 4) and rm ( test )

489

(0.7108, 0.7099; models 1, 4) which are greater than corresponding values of non-

490

ETA models (Table 3). The values for the non-ETA models are (0.7039, 0.7299;

491

models 2, 5), (0.7033, 0.725; models 2, 5), (0.786, 0.4105; models 2, 5) and (0.6064,

492

0.559; model 2, 5) respectively. The models involving combined set of descriptors

493

showed improved external validation parameters in comparison to models using non-

494

ETA descriptors. The values are (0.8023, 0.8145; models 3, 6), (0.8019, 0.8112;

495

models 3, 6), (0.8570, 0.5952; models 3, 6) and (0.7332, 0.6921; models 3, 6) 20

496

respectively. Thus, these observations signify the importance of ETA descriptors for

497

improvements of statistical quality of non-ETA models when they are used in

498

combination provide a providing more robust prediction of the adsorption capacity of

499

activated carbon for the large set of chemicals.

500 501

3.5

Discussion on the best ETA models

502 503

Considering all the statistical parameters used for internal and external validation, the

504

best ETA model is obtained after applying k-means clustering as the splitting

505

technique (model 1). The concerned equation is shown below: log10 (adsorption)

-3.50574+0.16110 ¦D +2.01862'H D -0.52995 'H A  6.86622 ¦D /NV

0.23966K+3.49750K /  4.04585KF/  0.21032KF  0.13426 ¦H  0.90770[ ¦D ]P / ¦D

506

-1.74413\1 1.10706H 2 R2 0.8147, Ra2 0.8139, F 1144.5(df 10,2602), Qint2 0.8059, ntraining =2613,

(1)

2 2 2 rm2( LOO) 0.7219, 'rm2( LOO) 0.1633, Qext 870, ( F1) =0.7914, Qext ( F 2) =0.7909, Qext ( F 3) =0.8492, n test 2 2 rm2(test ) =0.7108, 'rm2(test ) =0.0613, rm(overall) 0.7194, 'rm(overall) 0.1427

507 508

Eq. (1) could explain and predict 81.39 % and 80.59% of the adsorption capability on

509

to activated carbon in gas phase. When this equation was applied for prediction of test

510

2 2 2 set compounds, the external validation metric values [ Qext ( F 1) , Qext ( F 2) , Qext ( F 3) ] for the

511

test set were found to be 0.7919, 0.7909 and 0.8492 respectively. Descriptors like

512

, 'H D , /Nv, , 'F and , have positive contributions towards adsorption

513

capacity, whereas descriptors like 'H A , , F, []P/, \ 1 and H 2 have negative

514

contribution towards the adsorption capacity. We further elaborate the contributions

515

of the descriptors and structure-property relations taking some examples.

516

Formaldehyde (compound no 1667), having the lowest value of  shows the lowest 21

517

adsorption capacity on activated carbon. Compounds like methyl alcohol (compound

518

no 2411), methyl amine (compound no 2412) also show lower adsorption capacity

519

due to their lower molecular size. It is also observed that octamethylcyclotetrasiloxane

520

(compound no 2678) possessing the highest value of  shows comparatively higher

521

adsorption capacity. The parameter 'H D indicates the importance of hydrogen bond

522

donor atoms. Compounds like formamide (compound no 1668) having two hydrogen

523

bond donor atoms show higher adsorption capacity. It is observed that all the

524

compounds having hydrogen bond donor groups show the adsoption capacity in

525

higher range except in case of compounds like dimethyl amine (compound no 1049),

526

methyl amine (compound no 2412), methyl alcohol (compound no 2411), formic acid

527

(compound no 1669), ethyl amine (compound no 1451) and methyl mercaptan

528

(compound no 2518). All these compounds have low molecular size and hence lower

529

adsorption capacity. Compounds like diiodomethane (compound no 670), 1, 1, 2, 2-

530

tetrabromoethane (compound no 2889) and 1, 1, 1, 2-tetrabromoethane (compound no

531

2890) possessing high values of /Nv show high adsorption capacity. Again, formic

532

acid (compound no 1669) and formaldehyde (compound no 1667) having lower

533

values of /Nv show lower adsorption capacity due to their lower molecular size.

534

Compounds like 1, 1, 2, 2-tetrabromoethane (compound no 2889) and 1, 1, 1, 2-

535

tetrabromoethane (compound no 2890) possessing higher values of  show high

536

adsorption capacity on to activated carbon. But formaldehyde (compound no 1667)

537

having the lowest value of  shows the lowest adsorption capacity. The parameter 

538

indicates the importance of electron richness of a molecule. It is observed that absence

539

of electronegative atoms in molecules like ethylene (compound no 1483), acetylene

540

(compound no 11) is responsible for their low adsorption capacity. On the other hand,

541

compounds like diiodomethane (compound no 670) 1, 1, 2, 2-tetrabromoethane 22

542

(compound no 2889) and 1, 1, 1, 2-tetrabromoethane (compound no 2890) contain

543

halogen atoms which contribute to electron richness of the molecules resulting in the

544

adsorption capacity in higher range. It is observed that compounds like 3109-3114,

545

680-682, 665, 668, and 670 possess values of 'H A in lower range, but they show the

546

adorption capacity in higher range. All these compounds contain more than one

547

halogen atoms which may be responsible for their high adsorption capacity.

548

Compounds like chlorotrifluoromethane (compound no 356) possessing relatively

549

higher values of  show lower adsorption capacity. Compounds like 1, 1, 2, 2-

550

tetrabromoethane (compound no 2889) and 1, 1, 1, 2-tetrabromoethane (compound no

551

2890) having the values of F in lower ranger show the highest adsorption capacity.

552

The high adsorption capacity of these compounds may be due to their overall

553

functionality contribution. It is observed that compounds with the values of F from

554

zero to negative range show the adsorption capacity in higher range. The shape index

555

[]P/ indicates the branching pattern. It is observed that molecules like ethane

556

(compound no 1333), ethylene (compound no 1333), formaldehyde (compound no

557

1333) possesses the highest value of []P/ show the lowest adsorption capacity.

558

Hexafluoropropylene (compound no 1722) possesses the lowest value of \ 1 thus

559

showing adsorption capacity in higher range whereas ethane (compound no 1333),

560

ethylene (compound no 1483) and acetylene (compound no 11) having values of \ 1

561

in comparatively higher range show the adsorption capacity in lower range. The

562

parameter H 2 is a measure of electronegative atom count. Chlorotrifluoromethane

563

(compound no 356) having the highest value of H 2 shows the adsorption capacity in

564

the lower range. It is also observed that ethane (compound no 1333), ethylene

565

(compound no 1483) and acetylene (compound no 11) having values of

H 2 in

23

566

comparatively higher range show the adsorption capacity in the lowest range.

567

Figures 3 and 4 show scatter plots of observed vs calculated / predicted values (in log

568

scale) of the training and test set compounds respectively for the best ETA model

569

(model no 1). The same scatter plots for the absolute values of adsorption capacity are

570

shown in Supplementary Materials section.

571

Figure 3 near here

572

Figure 4 near here

573 574

3.6

Validation of models according to OECD guidelines

575

In

576

(http://www.oecd.org/env/ehs/risk-assessment/37849783.pdf) to develop reliable

577

models as follows: (i) a defined end point (experimentally determined adsorption

578

capacities of 3483 organic compounds to activated carbon in gas phase: the logarithm

579

values were used as dependent variable) (ii) an unambiguous algorithm (in the present

580

work, descriptors were calculated using the Dragon software version 6 and Cerius2

581

version 4.10software and models were developed using the PLS algorithms with the

582

MINITAB software); (iii)a defined domain of applicability (applicability domain of

583

the molecules was assessed based on the DModX (distance to the model in X-space)

584

values); (iv) appropriate measures of goodness-of-fit, robustness and predictivity (the

585

developed models were extensively validated using different traditional and novel

586

statistical parameters) and (v) a mechanistic interpretation (the models developed in

587

the present work have been explained provide a mechanistic basis of interpretation as

588

much as possible through proper explanation of the various descriptors appearing in

589

the different regression models). According to OECD principle 5: “It is recognised

590

that it is not always possible, from a scientific viewpoint, to provide a mechanistic

this

study,

we

have

focused

on

all

five

OECD

principles

24

591

interpretation of a given (Q)SAR (Principle 5), or that there even be multiple

592

mechanistic interpretations of a given model. The absence of a mechanistic

593

interpretation for a model does not mean that a model is not potentially useful in the

594

regulatory context. The intent of Principle 5 is not to reject models that have no

595

apparent mechanistic basis, but to ensure that some consideration is given to the

596

possibility of a mechanistic association between the descriptors used in a model and

597

the endpoint being predicted, and to ensure that this association is documented”.

598

Thus, the models developed in the present work well comply with all five OECD

599

principles.

600 601

3.7

602

Lei et. al. developed QSPR models on the dataset of the present work using 421

603

different types of molecular descriptors. The global-MLR model developed by them

604

showed r2 value of 0.789 with a corresponding leave-one-out cross-validation R2 (Q2)

605

of 0.785. In our present work, we have performed the QSPR study with 40 ETA

606

descriptors and 103 non-ETA descriptors. PLS analysis was used as chemometric

607

tools. Both internal and external validation parameters along with overall validation

608

parameter were reported. The models obtained from two different splitting methods

609

using ETA descriptors and applying PLS analysis showed improved statistical quality.

610

A comparison of the results is listed in Table 5.

611

Comparison with previously reported models on this data set

Table 5 near here

612 613

4

614

In this work, two different splitting techniques like k-means clustering and principal

615

component analysis (PCA) combined with duplex method were used to split the data

Overview and Conclusion

25

616

set into training and test sets for subsequent model development. Attempt was made

617

to find out the common descriptors present in various models indicating their

618

importance to the adsorption capacity on to activated carbon for a large data set of

619

chemicals. The models were developed to study the predictive ability of ETA

620

parameters in comparison to non-ETA descriptors (topological, structural,

621

physicochemical, electronic and spatial descriptors). In all the cases, the data set was

622

first subjected to stepwise regression to find out the useful variables, and the selected

623

variables were further subjected to PLS regression. The ETA models from both the

624

splitting techniques contain eleven same descriptors out of total twelve descriptors.

625

Thus, this finding shows the importance of ETA parameters required for adsorption

626

capacity. But the models obtained from two different splitting techniques with non-

627

ETA parameters contain only five similar descriptors out of twelve descriptors. In

628

case of models using combined set of descriptors (ETA and non-ETA), it was found

629

that seven similar descriptors were present out of fourteen descriptors. The models

630

obtained using ETA descriptors indicate the importance of molecular size, presence of

631

hydrogen bond donor atoms, electron richness, and overall topological nature of

632

molecules towards adsorption capacity onto activated carbon. From the models with

633

non-ETA descriptors, it was found that molecular solvent accessible surface area,

634

presence of electronegative atoms like fluorine, connectivity pattern among different

635

groups of a molecule are necessary for adsorption. The models using the combined set

636

of descriptors show the importance of surface weighted charged partial surface area,

637

topological nature and functionality of a molecule for the adsorption capacity. The

638

present modeling analysis suggests that ETA descriptors are sufficiently rich in

639

chemical information for modeling adsorption capacity on to activated carbon. At

640

99% confidence level, compounds having DModX values higher than critical values

26

641

at 99% level have been removed from the test sets and external prediction quality of

642

the models was again checked. From Table 6, it is observed that comparatively less

643

number of compounds from test sets had to be removed for all models (ETA, non-

644

ETA and combined) using k-means clustering than principal component analysis

645

(PCA) combined with duplex method as the splitting technique. For the best ETA

646

model (model no 1), the outlier compounds for both test and training sets are shown in

647

Figures 5 and 6 respectively. Two compounds (compound numbers 139 and 441)

648

present in the test set possess very high DModX values. These two compounds are

649

halogen substituted methane derivative. Besides these, few other outlier compounds

650

contain silicon (compound number 2581) or sulphur (compound number 1490, 3089).

651

From the training set, it is observed that compounds having very high DModX values

652

(compound numbers 143, 156, 446, 670, 2419, 2508) are halogen substituted methane

653

derivatives. The structures of the outlier compounds are shown in Figure 7.

654

Figure 5 near here

655

Figure 6 near here

656

Figure 7 near here

657

Table 6 near here

658 659

Declaration of interest

660

The authors declare no conflict of interest.

661 662

Acknowledgement

663

The authors thank Council for Scientific and Industrial Research (CSIR), New Delhi

664

for providing financial assistance to KR in the form of a major research project.

665

27

666 667 668

References

669

Bansal, R.C., Donnet, J., Stoeckli, H.F., 1988. Active Carbon, Marcel Dekker, New

670

York.

671 672

Balci, B., Keskinkan, O., Avci, M., 2011. Use of BDST and an ANN model for

673

prediction of dye adsorption efficiency of Eucalyptus camaldulensis barks in

674

fixed-bed system. Expert System with Applications 38, 949-946.

675 676

Consonni, V., Ballabio, D., Todeschini, R., 2009. Comments on the definition of the

677

Q2 parameter for QSAR validation. Journal of Chemical Information and

678

Modeling 49, 1669-1678.

679 680 681

Cerius2 version 4.8 is a product of Accelrys, Inc., San Diego, USA, http://www.accelrys.com/cerius2.

682 683 684

DRAGON version 6.0 software is offered by TALETE SRL, Italy; the software available at http://www.talete.mi.it/products/dragon_description.htm

685 686 687

Easton, J.R., 1995. The dye maker’s view. in: Cooper, P. (Ed.), Colour in Dyehouse Effluent. Society of Dyers and Coloursit, Oxford, England, pp. 9–21.

688 689

Fan, Y., Shi, L.M., Kohn, K.W., Pommier, Y., Weinstein, J.N., 2001. Quantitative

690

structure-antitumor activity relationships of camptothecin analogs: Cluster

28

691

analysis and genetic algorithm-based studies. Journal of Medicinal Chemistry 44,

692

3254-3263.

693 694

Franz, M., Arafat, H.A., Pinto, N.G., 2000. Effect of chemical surface heterogeneity

695

on the adsorption mechanism of dissolved aromatics on activated carbon. Carbon

696

38, 1807-1819.

697 698 699

Figueiredo, J.L., Pereira, M.F.R., Freitas, M.M.A., Órfão, J.J.M., 1999. Modification of the surface chemistry of activated carbons. Carbon 37, 1379-1389.

700 701 702

Golbraikh A., Tropsha A., 2002. Beware of q2. Journal of Molecular Graphics and Modeling 20, 269-276.

703 704 705

Gasteiger, J, Marsili, M, 1978. A new model for calculating atomic charges in molecules. Tetrahedron Letters 19, 3181-3184.

706 707

Giraudet, S., Pre, P., Tezel, H., Le Cloirec, P., 2006. Estimation of adsorption

708

energies using physical characteristics of activated carbons and VOCs’ molecular

709

properties. Carbon 44, 1873–1883.

710 711 712

Gregg, S.J., Singh, K.S.W., 1982. Adsorption, Surface Area and Porosity. Academic Press, London.

713 714 715

Hawkins, D.M., 2004. The problem of overfitting. Journal of Chemical Information and Computer Science 44, 1-12.

29

716 717

Haghseresht, F., Nouri, S., Finnerty, J.J., Lu, G.Q., 2002. Effects of surface chemistry

718

on aromatic compound adsorption from dilute aqueous solutions by activated

719

carbon. The Journal of Physical Chemistry B 106, 10935-10943.

720 721

Karanfil, T., Kilduff, J.E., 1999. Role of granular activated carbon surface chemistry

722

on the adsorption of organic compounds. 1. Priority pollutants. Environmental

723

Science & Technology 33, 3217-3224.

724 725

Lei, B., Ma, Y., Li, J., Liu, H., Yao, X., Gramatica, P., 2010. Prediction of the

726

adsorption capability onto activated carbon of a large data set of chemicals by

727

local lazy regression method. Atmospheric Environment 44, 2954-2960.

728 729

Le Leuch, L.M., Bandosz, T.J., 2007. The role of water and surface acidity on the

730

reactive adsorption of ammonia on modified activated carbons. Carbon 45, 568-

731

578.

732 733

MacQueen, J.B., 1967. Some Methods for classification and Analysis of Multivariate

734

Observations. Proceedings of 5-th Berkeley Symposium on Mathematical

735

Statistics and Probability, Berkeley, University of California Press 1, 281-297.

736 737 738

MINITAB

version

14

is

statistical

software

of

Minitab

Inc,

USA,

http://www.minitab.com.

739

30

740

Mohan, D., Pittman Jr., C.U., 2006. Activated carbons and low cost adsorbents for

741

remediation of tri- and hexavalent chromium from water. Journal of Hazardous

742

Materials B137, 762–811.

743 744

Moreno-Castilla, C., Rivera-Utrilla, J., López-Ramón, M.V., Carrasco- Marín, F.,

745

1995. Adsorption of some substituted phenols on activated carbons from a

746

bituminous coal. Carbon 33, 845-851.

747 748

Mohan, D., Singh, K.P., 2005. Granular activated carbon, in: Lehr, J., Keeley, J.,

749

Lehr, J. (Eds.), Water Encyclopedia: Domestic, Municipal, and Industrial Water

750

Supply and Waste Disposal, John Wiley & Sons, New York, pp. 106-115.

751 752

Nouri, S., Haghseresht, F., 2002. Adsorption of dissociating aromatic compounds by

753

activated carbon: effects of ionization of the adsorption capacity. Adsorption

754

Science & Technology 20, 417-432

755 756 757

Nriagu, J.O., Pacyna, J.M., 1988. Quantitative assessment of worldwide contamination of air, water and soils by trace metals. Nature 333, 134–139.

758 759

Ojha, P.K., Mitra, I., Das, R.N., Roy, K., 2011. Further exploring rm2 metrics for

760

validation of QSPR models dataset. Chemometrics and Intelligent Laboratory

761

System 107, 194–205.

762

31

763

Pollard, S.J.T., Fowler, G.D., Sollars, C.J., Perry, R., 1992. Lowcost adsorbents for

764

waste and wastewater treatment: a review. Science of the Total Environment 116,

765

31–52.

766 767

Pal, D.K., Sengupta, C., De, A.U., 1988. A new topochemical descriptors (TAU) in

768

molecular connectivity concept: Part I-aliphatic compounds. Indian Journal of

769

Chemistry, 27B, 734-739.

770 771

Pal, D.K., Sengupta, C., De, A.U., 1989. Introduction of a novel topochemical index

772

and exploitation of group connectivity concept to achieve predictibility in QSAR

773

and RDD. Indian Journal of Chemistry 28B, 261-267.

774

Radovic, L.R., Moreno-Castilla, C., Rivera-Utrilla, J.,

775

2000.

776

solutions,

777

Physics of Carbon, vol. 27, Marcel Dekker, Inc., New

778

York.

Carbon in:

materials Radovic

as L.R.

adsorbents (Ed.),

in

aqueous

Chemistry

and

779 780

Roy, K., Das, R.N., 2011. On some novel extended topochemical atom (ETA)

781

parameters for effective encoding of chemical information and modeling of

782

fundamental physicochemical properties. SAR and QSAR in Environmental

783

Research 22, 451-472.

784 785

Roy, K., Ghosh, G., 2003. Introduction of Extended Topochemical Atom (ETA)

786

Indices in the Valence Electron Mobile (VEM) Environment as Tools for

32

787

QSAR/QSPR Studies. Internet Electronic Journal of Molecular Design 2, 599–

788

620.

789 790 791

Roy, P.P., Roy, K., 2008. On some aspects of variable selection for partial least squares regression models. QSAR and Combinatorial Science 27, 302-313.

792 793

Radovic, L.R., Silva, I.F., Ume, J.I., Menéndez, J.A., Leony Leon, C.A., Scaroni,

794

A.W., 1997. An experimental and theoretical study of the adsorption of aromatics

795

possessing electron-withdrawing and electron-donating functional groups by

796

chemically modified activated carbons. Carbon 35, 1339-1348.

797 798

STATISTICA version 7 is statistical software of Stat Soft Inc, www.statsoft.com

799

Snedecor, G.W., Cochran, W.G., 1967. Statistical Methods, Oxford & IBH Publishing

800

Co. Pvt. Ltd, New Delhi.

801 802

Schuurmann, G., Ebert, R.U., Chen, J., Wang, B., Kuhne, R., 2008. External

803

validation and prediction employing the predictive squared correlation coefficient-

804

test set activity mean vs training set activity mean. Journal of Chemical

805

Information and Modeling 48, 2140-2145.

806 807 808

Service, R.F., 2012. Material Scientists Look to a Data-Intensive Future. Science 335, 1434-1435.

809 810 811

UMETRICS SIMCA-P 10.0, [email protected]: www.umetrics.com, Umea, Sweden, 2002

33

812 813

Villacañas, F., Pereira, M.F.R., Órfão, J.J.M., Figueiredo, J.L., 2006. Adsorption of

814

simple aromatic compounds on activated carbons. Journal of Colloid and Interface

815

Science 293, 128–136.

816 817

Wold, S., 1995. PLS for multivariate linear modeling. in: Van de Waterbeemd, H.

818

(Ed.), Chemometric Methods in Molecular Design (Methods and Principles in

819

Medicinal Chemistry), Weinheim-VCH, New York, pp. 195-218

820 821

Wold, S., Eriksson, L., 1995. Validation tools. in: van de Waterbeemd, H. (Ed.),

822

Chemometric Methods in Molecular Design (Methods and Principles in Medicinal

823

Chemistry), Weinheim-VCH, New York, pp. 312–317.

824 825

Wold, S., Sjostrom, M., Eriksson, L., 2001. PLS-regression: a basic tool of

826

chemometrics. Chemometrics and Intelligence Laboratory System 58, 109–130.

827 828 829

Yap, C.W., 2011. PaDEL-Descriptor: An open source software to calculate molecular descriptors and fingerprints. Journal of Computational Chemistry 32, 1466-1474.

830 831

Yaws, C.L., 2003-2004. Yaws’ handbook of thermodynamic and physical properties

832

of chemical compounds: Physical, thermodynamic and transport properties of

833

5000 organic chemicals compounds, Lamar University, Beaumount, Texas,

834

Norwich, New York.

835

34

836

Zhang, K., Cheung, W.H., Valix, M., 2005. Roles of physical and chemical properties

837

of activated carbon in the adsorption of lead ions. Chemosphere 60, 1129–1140.

838

Figure captions

839

Figure 1: PCA score plot of first three components for the standardized descriptor

840

matrix of the combined set (ETA and non-ETA) in case of PCA combined with

841

duplex method as the splitting technique.

842 843

Figure 2: PCA score plot of first three components for the standardized descriptor

844

matrix of the combined set (ETA and non-ETA) in case of k-means clustering as the

845

splitting technique.

846 847

Figure 3: Scatter plot of observed vs calculated/predicted values of the training set

848

compounds.

849 850

Figure 4: Scatter plot of observed vs calculated/predicted values of the test set

851

compounds.

852 853

Figure 5: DModX values of the 870 test set compounds at 99% level for model 1. The

854

thick horizontal line signifies the critical DModX value (2.219) at the 99% confidence

855

level.

856 857

Figure 6: DModX values of the 2613 training set compounds at 99% level for model

858

1. The thick horizontal line signifies the critical DModX value (2.219) at the 99%

859

confidence level.

860

35

861

Figure 7: Structures of some outlier compounds of the test and training sets for model

862

1

863

Table captions

864

Table 1: List of ETA descriptors used in the development of QSPR models

865

Table 2: Categorical list of Non-ETA descriptors used in the development of QSPR

866

models

867

Table 3: Comparison of models obtained after applying different splitting techniques

868

and chemometric tools

869

Table 4: External validation of the developed model using parameters proposed by

870

Golbraikh and Tropsha

871

Table 5: Comparison of statistical parameters of the ETA models developed in the

872

present work with a previously reported model

873

Table 6: Comparison of predictive quality of models after removing compounds

874

having DModX values higher than the critical values from test sets

875 876 877 878

36

Highlights

x

QSPR models were developed for adsorption of organic chemicals on to activated carbon

x

The ETA indices were used for development of models which were compared to non-ETA ones

x

The ETA indices can be easily calculated from 2D representation of chemical structure

x

The use of ETA descriptors along with non-ETA ones improves statistical quality of models

x

The models can be used for prediction of adsorption of organic compounds on activated carbon

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Table 1

S

V

V

V

F

B

17

18

Contribution of local functional groups relative to molecular size. Gives a measure of local branching and topology. Branching index. Explains branchedness of a molecule.

Flocal

16

  Flocal / NV 

Contribution of local functional groups. Gives a measure of local branching and topology.

F'   F / NV 

15

' local

Overall functionality contribution relative to molecular size

F

14

Local composite index. A measure of covalently bonded interaction of the local atoms. Imparts information on local molecular topology. Local composite index relative to molecular size. Imparts information on local molecular topology. Overall functionality contribution

Contribution of the overall topological nature Contribution of the overall topological nature relative to molecular size

A measure of electronic features of the molecule relative to molecular size

Sum of β (valence electron mobile) values of all non-hydrogen vertices in a molecule

A measure of electron richness of the molecule relative to molecular size

'

  local / NV 

ns

A measure of electron richness of the molecule (non-sigma contribution)

A measure of electronegative atom count of the molecule relative to molecular size

A measure of electronegative atom count (sigma contribution) of the molecule

A measure of molecular bulk relative to molecular size [NV is the total number of atoms excluding hydrogen] A measure of electron richness in a molecule

Significance Sum of core count of non-hydrogen vertex (α). It is a measure of molecular bulk

13

local

local

12

'

' ns

ns

' S

S

       / N        / N        / N 

V

  / N

Parameter

'   / NV 

Sl No.

10 11

9

8

7

6

5

4

3

2

1

Table 1: List of ETA descriptors used in the development of QSPR models

  SS 4 

A measure of contribution of unsaturation and electronegative atom count A measure of contribution of unsaturation

 B  1   4

31

[XH indicates hydrogen attached to a heteroatom]

 A  1   3

NV  N XH

 EH    XH

[SS indicates saturated carbon skeleton]

[R is the reference alkane]

A measure of electronegative atom count [EH indicates excluding hydrogen]

A measure of electronegative atom count [N is the total number of atoms including hydrogen]

A measure of count of hydrogen bond acceptor atoms and/ or polar surface area

hydrogen vertex in the molecular graph A measure of count of non-hydrogen heteroatoms [NV is the total number of atoms excluding hydrogen, R is the reference alkane]

30

5  

N SS

NR

28

NV

2  

 EH



N

1  

  R 3 

29

R

     R  B  NV

27

26

25

24

NV

  

 /  X

22

 A 

 /  Y

21

23

  stands for summation of α values of the vertices that are joined to one other nonP hydrogen vertex in the molecular graph

  stands for summation of α values of the vertices that are joined to three other nonY

 /  P

20

hydrogen vertex in the molecular graph

  stands for summation of α values of the vertices that are joined to four other nonX

Branching index relative to molecular size. A measure of overall branchedness of a molecule.

B'   B / NV 

19

40

39





' ns ( )



 NV

ns ( )

 ' 

ns ( )

NV



A measure of lone electrons entering into resonance relative to molecular size

A measure of lone electrons entering into resonance

A measure of relative unsaturation content

A relative measure of relative unsaturation content

   ns    s

37

38

A measure of hydrogen-bonding propensity of the molecules

 B   1  0.714

A measure of hydrogen-bonding propensity of the molecules and/or polar surface area

36

2

V

 / N A measure of hydrogen-bonding propensity of the molecules

EH



 A  0.714  1



 

35

1 

A measure of contribution of hydrogen-bond donor atoms

 D   2   5

33

34

A measure of contribution of electronegativity

 C   3   4

32

Table 2

Category of descriptors Topological Average distance-based connectivity index

Definition

These indices capture different aspects of molecular shape.

This is the number of subgraphs of a given type and order. This is a descriptor based on structural properties that restrict a molecule from being 'infinitely flexible' Total number of bonds between all pairs of atoms in the hydrogen-suppressed graph.

kappa shape index ( 1 , 2 , 3 , 1 am , 2 am , 3 am )

Subgraph count index (SC-0, SC-1, SC-2, SC-3_P, SC-3_C, SC-3_CH)

Flexibility index (  )

Wiener

Molecular connectivity index These indices are based on graph-theoretical invariant ( 0  , 1 , 2  , 3  P , 3  C , 3 CH , 0  v , 1 v , 2  v , 3  v P , 3  v C , introduced by Randic. 3 v  CH )

Balaban’s J index (Jx)

Name of the descriptors

Table 2: Categorical list of Non-ETA descriptors used in the development of QSPR models Comment, if any

Physicochemical

Structural

AlogP

Molref

H-bond donor

MW Rotlbonds H-bond acceptor

Number of chiral centers (R or S) in a molecule. Molecular weight Number of rotatable bonds. Number of hydrogen-bond acceptors. Number of hydrogen-bond donors. Molar refractivity Ghose, A.K., Crippen, G.M., 1986. Atomic Physicochemical Parameters for ThreeDimensional StructureDirected Quantitative Structure-Activity Relationships I. Partition coefficients as a Measure of hydrophobicity. Journal of Computational Chemistry 7, 565-577 Log of the partition Ghose, A.K., Crippen, 1986. Atomic coefficient G.M., Physicochemical

Electrotopological state parameters of atoms having different electronic and topological environment.

E-State parameters (S_sCH3, S_dCH2, S_ssCH2, S_tCH, S_dsCH, S_aaCH, S_sssCH, S_ddC, S_tsC S_dssC, S_aasC, S_aaaC, S_ssssC, S_sNH2, S_ssNH, S_tN, S_dsN, S_aaN, S_sssN, S_ sOH, S_dO, S_ssO, S_aaO, S_sSH, S_ssS, S_aaS, S_sF, S_sCl, S_sBr, S_sI)

Chiral centers

Sum of the squares of vertex valencies

Zagreb

Electronic

Dipole moment

It is a vector quantity which encodes displacement with respect to the centre of gravity of positive and negative charges in a molecule

coefficient

partition

Octanol

logP

water

Log of partition coefficient

AlogP98

Parameters for ThreeDimensional StructureDirected Quantitative Structure-Activity Relationships I. Partition coefficients as a Measure of hydrophobicity. Journal of Computational Chemistry 7, 565-577 Ghose, A., Viswanadhan, V.N., Wendoloski, J.J., 1998. Prediction of hydrophobic (lipophilic) properties of small organic molecules, using fragmental methods. An analysis of ALOGP and CLOGP methods. Journal of Physical Chemistry 102, 3762–3772.

Spatial

These are calculated by mapping atomic partial charges on solvent accessible surface areas of individual atoms

Jurs descriptors (Jurs SASA, Jurs PPSA 1, Jurs PNSA 1, Jurs DPSA 1, Jurs PPSA 2, Jurs PNSA 2, Jurs DPSA 2, Jurs PPSA 3, Jurs PNSA 3, Jurs DPSA 3, Jurs FPSA 1, Jurs FNSA 1, Jurs FPSA 2, Jurs FNSA 2, Jurs FPSA 3, Jurs FNSA 3, Jurs WPSA 1, Jurs WNSA 1, Jurs WPSA 2, Jurs WNSA 2, Jurs WPSA 3, Jurs WNSA 3, Jurs RPCG, Jurs RNCG, Jurs RPCS, Jurs RNCS, Jurs TPSA, Jurs TASA, Jurs RPSA, Jurs RASA)

This set of descriptors combines shape and electronic information to characterize the molecules.

Van der Waals area of a molecule

Area

Vm

PMI-mag

Density

N is the number of atoms and x, y, and z are the atomic coordinates relative to the center of mass. The ratio of molecular It reflects the types of weight to molecular atoms and how tightly volume. they are packed in a molecule. It calculates the principal moments of inertia about the principal axes of a molecule. Molecular volume inside the contact surface. ( X i2 Yi2  Zi2 ) N

RadOfGyration

4

5

6

NonETA

Combined

3

Combined

ETA

2

NonETA

PCAcombined with duplex method

1

ETA

k-means clustering

Model no.

Type of descriptors

Splitting Technique

ηF, [Σα]P/Σα,

1 and

D , Σα/Nv, η′, η'F, Σε,



2

Wiener,

F

[Σα]P/Σα,

 , η' , η′, Σα/N ,  D , S_sF, Jurs 3

P

X

F,

F

and Σβns

'

 D ,   , S_sF, Jurs WPSA1, η,   / N ,  , [Σα]P/Σα

Jurs SASA, Jurs-WNSA-1, 3p, Jx, Density, S_sF, PMI mag, Jurs FNSA-1, AlogP98, Wiener, SC-3-P and 1m Jurs SASA, Density, 1, η′, η'F,

v

AlogP98,

V and  B Σα, Σα/N ,  D , Σε, η′, η' η, η , [Σα] /Σα , [Σα] /Σα,  1 and  2

WPSA-1,

[Σα]Y/Σα,

Jurs SASA,

1 v

 , S_sF, Jurs FPSA-3, 3 CH and

3

Jurs PNSA-3, Wiener,

Density,

Jurs SASA, Jurs PPSA-3, 3p, Jx,

2

 A , η,

  , 

Descriptors

10

10

10

10

10

10

LVs

0.8263

0.7281

0.8101

0.8191

0.7443

0.8147

R2

0.8256

0.7270

0.8094

0.8184

0.7432

0.8139

R a2

1237.32 (10, 2601)

696.54 (10, 2601)

1109.8 (10, 2601)

1178.92 (10, 2602)

757.49 (10, 2602)

1144.5 (10, 2602)

F(df)

0.8170

0.7196

0.7993

0.813

0.7341

0.8059

2 Qint

0.7366

0.6039

0.7123

0.732

0.6263

0.7219

rm2( LOO )

0.156

0.2259

0.1672

0.1592

0.2123

0.1633

rm2( LOO )

Table 3: Comparison of models obtained after applying different splitting techniques and chemometric tools

Table 3

0.8145

0.7299

0.827

0.8023

0.7039

0.7914

0.8112

0.725

0.8239

0.8019

0.7033

0.7909

0.5952

0.4105

0.622

0.8570

0.786

0.8492

2 2 2 Qext ( F 1) Qext ( F 2) Qext ( F 3)

0.6921

0.559

0.7099

0.7332

0.6064

0.7108

rm2(test )

0.1697

0.2417

0.1587

0.0039

0.0437

0.0613

rm2(test )

0.7203

0.5886

0.7174

0.7311

0.6208

0.7194

rm2( overall )

0.165

0.2352

0.1692

0.1247

0.1756

0.1425

rm2(overall )

Table 4

ETA NonETA Combined ETA NonETA Combined

k-means clustering

PCAcombined with duplex method

Type of descriptors

Splitting Technique 3.13x10-4 1.13x10-3

3.73x10-3 2.31x10-5 4.91x10-6 2.50x10-5 1.94x10-5 1.10x10-5

5.78x10-3 2.04x10-2

1.20x10-2 2.36x10-3 6.28x10-3

2.45x10-3

3 4 5

6

3.28x10-4

6.17x10-4 2.97x10-4 8.35x10-4

Overall

(r2- r02)/ r2 Test Training

Mod el no. 1 2

7.91x10-2

1.09x10-2 6.78x10-2 0.217

2.64x10-2 4.63x10-2

4.60x10-2

4.80x10-2 5.70x10-2 0.140

5.24x10-2 0.116

(r2- r/02)/ r2 Test Training

5.82x10-2

3.81x10-2 6.12x10-2 0.1168

4.65x10-2 9.92x10-2

Overall

0.9961

0.9975 1.006 1.008

0.9998 0.9982

k Test

0.9996

0.9997 0.9996 0.9996

0.9995 0.9996

Training

0.9988

0.9991 1.001 1.001

0.9996 0.9992

Overall

Table 4: External validation of the developed model using parameters proposed by Golbraikh and Tropsha

0.9848

0.9934 0.9759 0.9646

0.9905 0.9880

k/ Test

0.9920

0.9883 0.9912 0.9875

0.9880 0.9835

Training

0.9902

0.9895 0.9874 0.9819

0.9886 0.9846

Overall

Table 5

Table 5: Comparison of statistical parameters of the ETA models developed in the present work with a previously reported model Reference Lei et al. 2010

Present work

Modeling technique Global MLR

PLS PLS

a

Splitting Technique PCAcombined with duplex method a k-means clustering b PCAcombined with duplex method a

ntraining = 2612, a ntest =871;

2 Qint

2 Qext ( F 1)

2 Qext ( F 2)

2 Qext ( F 3)

r2

0.785

0.773

---

---

0.789

0.8059 0.7914 0.7909

0.8492

0.8147

0.7993 0.827

0.4814

0.8101

b

0.8239

ntraining =2613, b ntest = 870

Table 6

ETA Non-ETA Combined ETA

k-means clustering

PCAcombined with duplex NonETA method Combined

Type descriptors

Splitting Technique

75 116

1.875

No. of compounds removed from test set 35 34 28 92

2.219

2.219 2.219 1.875 2.219

of Critical DModX value 0.7995 0.7425 0.8162 0.8353

2 Qext ( F 2)

0.8057 0.8046

0.7406 0.7381

0.8017 0.7456 0.8173 0.8356

2 Qext ( F 1)

0.7135

0.5487

0.8886 0.8544 0.8923 0.7963

2 Qext ( F 3)

0.7298

0.6191

0.7404 0.6694 0.7586 0.7736

rm2(test )

0.0958

0.219

0.065 0.035 0.0558 0.0436

rm2(test )

0.7337

0.6115

0.7234 0.6318 0.7349 0.7278

rm2( overall )

0.1363

0.2274

0.1251 0.1695 0.1215 0.1326

rm2(overall )

Table 6: Comparison of predictive quality of models after removing compounds having DModX values higher than the critical values from test sets