Prediction of hydrophile–lipophile balance values of anionic surfactants using a quantitative structure–property relationship

Prediction of hydrophile–lipophile balance values of anionic surfactants using a quantitative structure–property relationship

Journal of Colloid and Interface Science 336 (2009) 773–779 Contents lists available at ScienceDirect Journal of Colloid and Interface Science www.e...

329KB Sizes 0 Downloads 31 Views

Journal of Colloid and Interface Science 336 (2009) 773–779

Contents lists available at ScienceDirect

Journal of Colloid and Interface Science www.elsevier.com/locate/jcis

Prediction of hydrophile–lipophile balance values of anionic surfactants using a quantitative structure–property relationship Feng Luan a,*, Huitao Liu a, Yuan Gao a, Qingzhong Li a, Xiaoyun Zhang b, Yun Guo c a

Department of Applied Chemistry, Yantai University, Yantai 264005, People’s Republic of China Department of Chemistry, Lanzhou University, Lanzhou 730000, People’s Republic of China c National Laboratory of Vacuum & Cryogenics Technology and Physics, Lanzhou Institute, Lanzhou 730000, People’s Republic of China b

a r t i c l e

i n f o

Article history: Received 18 December 2008 Accepted 1 April 2009 Available online 10 April 2009 Keywords: Hydrophile–lipophile balance (HLB) Stepwise multiple linear regression (MLR) Radial basis function neural network (RBFNN) Quantitative structure–property relationship (QSPR)

a b s t r a c t A quantitative structure–property relationship study was performed on the hydrophile–lipophile balance (HLB) values of anionic surfactants. Stepwise multiple linear regression (MLR) and nonlinear radial basis function neural network (RBFNN) were used to build the models. A four-descriptor equation with the squared correlation coefficient ðR2 Þ of 0.983 and root mean square error (RMS) of 1.7309 were obtained for the training set, and R2 ¼ 0:989, RMS ¼ 1:3509 for the external test set. The RBFNN model gave better results: R2 ¼ 0:997, RMS ¼ 0:6750 for the training set and R2 ¼ 0:991, RMS ¼ 1:1895 for test set. The QSPR model established may provide a new powerful method for predicting HLB values of anionic surfactants. Ó 2009 Elsevier Inc. All rights reserved.

1. Introduction Amphiphilicity is a fundamental characteristic of any surfactant. To quantitatively assess it, the so-called hydrophile–lipophile balance (HLB) is widely used, which indicates the relative strength of the hydrophilic and hydrophobic portions of the molecule and can be used to characterize the relative affinity of surfactants for aqueous and organic phases [1]. The most commonly used expressions of the HLB of surfactants are Griffin’s HLB values [2,3] and Davies’ HLB values [4]. A high HLB value generally indicates good surfactant solubility in water, while a low HLB value indicates a lower aqueous solubility and higher relative affinity for the organic phase. There are other independent methods for defining the HLB, including phase inversion temperature, polarity indexes, surfactant affinity difference, etc. [5]. Besides the semiempirical methods noted above, there are some experimental methods to determine them [6–9]. Quantitative structure–activity and structure–property relationship (QSAR/QSPR) studies are unquestionably of great importance in modern chemistry and biochemistry [10]. Currently, these methods are increasingly employed in the prediction of chemical and physical properties, or bioactivities of different types of molecules. In the field of surfactant technology, these methods have potential applications for risk assessment purposes, the investigation of toxicity mechanisms, and product development * Corresponding author. Fax: +86 535 6902063. E-mail address: fl[email protected] (F. Luan). 0021-9797/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.jcis.2009.04.002

[11,12]. This kind of study, concerning the prediction of physical properties such as critical micelle concentration, surface tension, and cloud point of surfactant, has been largely developed with the help of QSPR [13–19]. The successful applications have inspired us to perform a more exhaustive study in order to validate the applicability of traditional QSPR in this field. To our knowledge, there is no information available for the prediction of solution properties of surface-active compounds. This is the first to study this property by the QSPR method. So we expect that more general structure–HLB relationships can be developed through the systematic QSPR approach. The aim of this study is to develop QSPR models for estimating the HLB values of anionic surfactants. Descriptors calculated from the molecular structures alone were used to represent the characteristics of the compounds. In addition, the multiple stepwise linear regression (MLR) technique and nonlinear radial basis function neural networks (RNFNN) were employed to develop QSPR models. Finally, we also wish to demonstrate which structural features or groups are likely to be most import to the HLB values of anionic surfactants. 2. Materials and methods 2.1. Data set The studied anionic surfactants (Table 1) were collected from the literature [9]. The compounds studied covered a wide range

774

F. Luan et al. / Journal of Colloid and Interface Science 336 (2009) 773–779

Table 1 Compounds, experimental and calculated HLB. NO.

Anionic surfactants

Observed HLB

Predicted HLB MLR

Residual

RBFNN

Residual

42 40.5 40.1 39.6 39.4 38.7 38.2 37.2

42.09 40.82 40.40 39.99 39.59 39.21 38.81 38.81

0.090 0.316 0.302 0.394 0.191 0.505 0.610 1.610

40.38 40.35 40.04 39.61 39.08 38.46 37.75 37.75

1.620 0.154 0.061 0.013 0.319 0.243 0.445 0.555

C6 H13 CHðCH3 ÞSO4  C8 H17 CHðCH3 ÞSO4  C12 H25 CHðCH3 ÞSO4  C16 H33 CHðCH3 ÞSO4  C8 H17 CHðC2 H5 ÞSO4  C11 H23 CHðC2 H5 ÞSO4  C12 H25 CHðC2 H5 ÞSO4  C9 H19 CHðC4 H9 ÞSO4  C10 H21 CHðC4 H9 ÞSO4  C7 H15 CHðC7 H1 5ÞSO4  C8 H17 CHðC7 H1 5ÞSO4  C9 H19 CHðC9 H1 9ÞSO4  C12 H25 CHðC3 H7 ÞSO4  C10 H21 CHðC3 H7 ÞSO4  C13 H27 CHðCH3 ÞSO4  C12 H25 CHðC2 H5 ÞCH2 SO4  C11 H23 CHðC3 H7 ÞCH2 SO4  C10 H21 CHðC4 H9 ÞCH2 SO4  C9 H19 CHðC5 H11 ÞCH2 SO4  C8 H17 CHðC6 H13 ÞCH2 SO4  C7 H15 CHðC7 H1 5ÞCH2 SO4 

42.2 41.3 39.4 37.6 41 39.6 39.1 39.9 39.4 39.9 39.6 38.5 39 39.2 38.4 38.5 38.7 38.9 39.1 39.2 39.4

43.25 42.38 40.74 38.65 40.72 39.50 39.10 37.49 37.09 36.99 36.60 35.43 36.78 36.65 40.14 40.47 39.25 39.28 39.28 39.13 39.29

1.050 1.083 1.343 1.049 0.282 0.102 0.005 2.411 2.307 2.907 2.999 3.074 2.218 2.548 1.743 1.968 0.549 0.376 0.178 0.069 0.110

41.66 41.75 40.02 37.11 40.25 40.55 40.20 39.98 39.82 39.78 39.41 37.11 39.56 39.46 38.66 39.50 38.89 38.91 38.86 38.84 38.91

0.542 0.453 0.621 0.494 0.750 0.949 1.097 0.083 0.422 0.123 0.192 1.389 0.555 0.258 0.260 0.995 0.189 0.011 0.238 0.356 0.489

Linear alkylsulfonate 30 31* 32

C6 H13 SO3  C8 H17 SO3  C10 H21 SO3 

15.2 14.2 13.3

13.47 12.61 11.78

1.734 1.594 1.522

15.78 15.03 14.07

0.580 0.830 0.768

Alkyl benzenesulfonate 33 34 35 36* 37 38 39 40 41* 42 43 44 45 46*

C12 H25 SO3  C13 H27 SO3  C14 H29 SO3  C15 H31 SO3  C16 H33 SO3  C6 H13 C6 H4 SO3  C7 H15 C6 H4 SO3  C8 H17 C6 H4 SO3  C10 H21 C6 H4 SO3  C12 H25 C6 H4 SO3  C14 H29 C6 H4 SO3  C16 H33 C6 H4 SO3  C17 H35 C6 H4 SO3  C18 H37 C6 H4 SO3 

12.3 11.8 11.4 10.9 10.4 13.5 13 12.5 11.6 10.6 9.7 8.7 8.2 7.8

10.99 10.58 10.21 9.82 9.44 15.34 14.80 14.29 13.30 12.38 11.50 10.60 10.19 9.74

1.310 1.218 1.190 1.081 0.964 1.841 1.799 1.788 1.702 1.784 1.798 1.905 1.987 1.943

12.98 12.36 11.84 11.22 10.69 12.67 12.43 12.13 11.37 10.48 9.53 8.54 8.03 7.52

0.680 0.563 0.435 0.321 0.293 0.826 0.566 0.374 0.233 0.116 0.169 0.161 0.166 0.276

Dodecyl polyoxy ethylene sulfate 47 48 49 50

C12 H25 ðOC2 H4 Þ1 SO4  C12 H25 ðOC2 H4 Þ2 SO4  C12 H25 ðOC2 H4 Þ3 SO4  C12 H25 ðOC2 H4 Þ4 SO4 

39.5 39.3 39.1 38.8

45.41 38.23 36.78 36.19

5.906 1.072 2.317 2.614

39.78 38.77 39.40 38.78

0.281 0.528 0.297 0.021

Alkyl acetate 2-sulfonates 51* 52 53

C6 H13 OOCCH2 SO3  C8 H17 OOCCH2 SO3  C10 H21 OOCCH2 SO3 

17.1 16.1 15.2

16.26 15.34 14.05

0.842 0.758 1.146

16.04 16.15 14.64

1.057 0.046 0.558

Alkyl propinoate 3-sulfonate 54 55 56* 57 58

C8 H17 OOCðCH2 Þ2 SO3  C10 H21 OOCðCH2 Þ2 SO3  C12 H25 OOCðCH2 Þ2 SO3  C14 H29 OOCðCH2 Þ2 SO3  C14 H29 COOðOC2 ÞH4 ÞSO3 

15.7 14.7 13.8 12.8 13.13

15.87 15.00 14.15 13.32 15.71

0.175 0.298 0.347 0.518 2.581

15.77 15.32 14.03 12.15 14.22

0.073 0.621 0.226 0.650 1.092

Linear alkyl acetate 59 60 61* 62 63

C10H21COO C12H25COO C14H29COO C16H33COO C18H37COO

21.4 20.4 19.5 18.5 17.6

20.52 19.75 19.00 18.26 17.52

0.881 0.647 0.501 0.244 0.081

19.24 21.48 19.86 18.89 18.63

2.159 1.081 0.361 0.389 1.029

Linear alkylsulfates 1* 2 3 4 5 6* 7 8

C8 H17 SO4  C11 H23 SO4  C12 H25 SO4  C13 H27 SO4  C14 H29 SO4  C15 H31 SO4  C16 H33 SO4  C18 H37 SO4 

Branched alkylsulfate 9 10 11* 12 13 14 15 16* 17 18 19 20 21* 22 23 24 25 26* 27 28 29

775

F. Luan et al. / Journal of Colloid and Interface Science 336 (2009) 773–779 Table 1 (continued) NO.

Anionic surfactants

Observed HLB

Predicted HLB MLR

Residual

RBFNN

Residual

Fluorinated linear alkylsulfonate 64 65

C7 F15 SO3  C8 F17 SO3 

11.9 11

12.61 12.19

0.709 1.193

11.86 10.71

0.039 0.294

Fluorinated linear alkyl acetate 66* 67 68 69 70 71* 72 73

C6F13COO C7F15COO C8F17COO C10F21COO C12F25COO CF3C8H17COO CF3C10H21COO CF3C12H25COO

20.9 20 19.1 17.4 15.7 21.4 20.5 20

21.26 20.74 20.34 16.39 15.44 19.71 18.73 17.83

0.358 0.736 1.236 1.005 0.263 1.686 1.765 2.170

24.80 19.97 19.60 17.00 17.01 20.75 19.22 18.67

3.899 0.034 0.504 0.404 1.306 0.647 1.281 1.330

and included linear and branched alkylsulfate, linear alkylsulfonate, alkyl benzenesulfonate, dodecyl polyoxy ethylene sulfate, alkyl acetate 2-sulfonates, alkyl propinoate 3-sulfonate, ainear alkyl acetate, fluorinated linear alkylsulfonate, and fluorinated linear alkyl acetate. The HLB values, as experimental values of these compounds, were calculated by Griffin’s method except for the branched alkylsulfate [2]. The values of the branched alkylsulfate were obtained by critical micelle concentration [20]. The data set was then split into training and external test set in both the linear and the nonlinear models. The training set used to build the models consists of 58 compounds. The test set used to evaluate the prediction ability of the models consists of 15 compounds. Members of each set were assigned randomly. 2.2. Molecular optimization and descriptor generation To obtain a QSPR model, the compounds must be represented by molecular descriptors that retain as much structure information as possible. Here five classes of descriptors—constitutional, topological, geometrical, electrostatic, and quantum-chemical descriptors—were calculated. The descriptors were generated as follows: The anionic part of the compounds was drawn using ISIS Draw 2.4 [21] and preoptimized using the molecular mechanics force field method (MM+) available in HyperChem 7.0 [22]. The molecular structures were optimized using the Polak–Ribiere algorithm until the root mean square gradient was equal to or less than 0.01. A more precise optimization was done with the semiempirical PM3 [23] method in MOPAC [24]. Thereafter, CODESSA PRO [25,26] was used to calculate five types of molecular descriptors. Altogether, 466 descriptors were calculated for each of the 73 anionic surfactants studied.

iation in magnitude for all structures; descriptors that give a F test’s value below 1.0 in the one-parameter correlation; and descriptors whose t values are less than the user-specified value, etc. This procedure orders the descriptors by decreasing correlation coefficient when used in one-parameter correlations. Following the preselection of descriptors, MLR models are developed in a stepwise procedure. 2.4. Methodology of modeling The descriptors and the HLB values of a set of 73 anionic surfactants were correlated by multiple linear regression (MLR) analysis and artificial neural networks. The forward stepwise multiple regression analysis, a commonly used method in QSRR study, was employed to establish the quantitative regression models [27,28]. The theory of RBFNN has also been extensively presented in some papers [29,30]. Here is a brief description of the RBFNN principle. Fig. 1 shows the basic network architecture. It consists of an input, a hidden, and an output layer. The input layer does not process the information; it only distributes the input vectors to the hidden layer. The hidden layer of RBFNN consists of a number of RBF units ðnh Þ and bias ðbk Þ. Each hidden layer unit represents a single radial basis function, with associated center position and width. Each neuron on the hidden layer employs a radial basis function as a nonlinear transfer function to operate on the input data. The most often used RBF is a Gaussian function that is char-

hj cj

2.3. Selection of molecular descriptors A successful QSPR model depends on the selection of suitable descriptors. If molecular structures are represented by improper descriptors, they will not lead to reasonable predictions. The process of features selection entails pruning the descriptors pool through the heuristic method (HM) available in the framework of the CODESSA program [25,26]. HM can either quickly give a good estimation about what quality of correlation to expect from the data or derive several best regression models. Besides, it will demonstrate which descriptors have bad or missing values, or are insignificant (from the standpoint of a single-parameter correlation), or are highly intercorrelated. A detailed discussion about the HM can be found in Ref. [26]. Here, only the main steps of this method are given in the following. The heuristic method of the descriptor selection proceeds with a preselection of descriptors by eliminating those descriptors that are not available for each structure; descriptors having a small var-

wkj

x1 x2

yk

x3 xn rj bias Input layer

Hidden layer

Fig. 1. The architecture of RBFNN.

Output layer

776

F. Luan et al. / Journal of Colloid and Interface Science 336 (2009) 773–779

acterized by a center ðcj Þ and a width ðrj Þ. The RBF function performs the nonlinear transformation by measuring the Euclidean distance between the input vector (x) and the radial basis function center ðcj Þ. The RBF in the hidden layer is

hj ðXÞ ¼ exp kX  cj k

2

=r2j

 :

ð1Þ

In this equation, hj is the notation for the output of the jth RBF unit. For the jth RBF, cj and rj are the center and the spread, respectively. The operation of the output layer is linear, which is

yk ðXÞ ¼

nk X

wkj hj ðXÞ þ bk ;

ð2Þ

j¼1

where yk is the kth output unit for the input vector x, wkj is the weight connection between the kth output unit and the jth hidden layer unit, and bk is the bias. It can be seen from Eqs. (1) and (2) that designing a RBFNN involves the selection of centers, number of hidden layer units, width, and weights. There are various ways for selecting the centers, such as random subset selection, K-means clustering, orthogonal least-squares learning algorithm, RBF-PLS, etc. The widths of the radial basis function networks can either be chosen the same for all the units or be chosen differently for each unit. In this paper, considerations were limited to the Gaussian functions with a constant width, which was the same for all units. The adjustment of the connection weight between hidden layer and output layer is performed using a least-squares solution after the selection of centers and width of radial basis functions. The overall performance of RBFNN is evaluated in terms of a root mean squared error (RMS) according to the equation

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pnk ^ 2 i¼1 ðyk  yk Þ RMS ¼ ; nk

ð3Þ

^k is the actual output of the netwhere yk is the desired output and y work; nk is the number of compounds in analyzed set. The performance of RBFNN is determined by the values of the following parameters: the number nh of radial basis functions, the center cj and the width r j of each radial basis function, the connection weight wkj between the jth hidden layer unit and the kth output unit. The centers of RBFNN are determined with the forward subset selection method proposed by Orr [31,32]. The optimal width was determined by experiments with a number of trials by taking into account the leave-one-out (LOO) cross-validation error. The one that gives a minimum LOO cross-validation error is chosen as the optimal value. 3. Results and discussion 3.1. Results of MLR The HM combined with the stepwise MLR method was used to develop the linear model for the prediction of HLB using all the Table 2 The four-parameter model for hydrophile–lipophile balance (HLB) value of anionic surfactants. No.

Coefficient

t test

0 1 2

58.86 835.59 1602.40

43.16 52.10 31.65

3 4

326.04 0.13

15.09 7.71

Descriptors max Max nucleophilic reactivity index pffiffiffiffiffiffiffiffiffiffiffiffi ffi for a C atom ðN C Þ HA dependent HDCA-2= TMSA[Quantum-Chemical pffiffiffiffiffiffiffiffiffiffiffiffi ffi PC]ðHDCA-2= TMSAÞ Min nucleophilic reactivity index for a O atom ðNmin O Þ Information content (order 0) (IC0)

þ 1602:40 HDCA-2  326:04 N min  0:13 IC0: HLB ¼ 58:86  835:59 N max C O N ¼ 58, R2 ¼ 0:9829, R2CV ¼ 0:9784, F ¼ 763:69, s2 ¼ 2:9960, where R2 is the 2 squared correlation coefficient, RCV is the squared cross-validated correlation coefficient, F is the Fisher criterion, s2 is the squared standard error, and N is the number of data points in the training set.

45

Training set Test set

40

The predicted HLB by MLR



50

35 30 25 20 15 10 5 5

10

15

20

25

30

35

40

45

50

The observed HLB Fig. 2. Plot of observed vs calculated HLB by MLR.

descriptors. First, the HM was used to reduce the pool of descriptors. In the present study, the number of descriptors was reduced from 466 to 157. Secondly, various subset sizes were investigated to determine the optimum number of descriptors. When adding another descriptor did not improve significantly the statistics of a model, the optimum subset size had been achieved. In the present study, four descriptors were eventually selected. A detailed description of the linear model based on the training set is summarized in Table 2. With the test set, the prediction results were obtained; the statistical parameters were R2 ¼ 0:989, F ¼ 1142:12, and RMS ¼ 1:3509. The predicted HLB based on MLR is shown in Table 1. Fig. 2 shows the predicted vs observed HLB values for all of the 73 compounds studied (the training set and the test set). The RMS in prediction for overall data set is 1.6601 and R2 ¼ 0:981, F ¼ 3729:452. The correlation matrix of the four selected descriptors is included in Table 3. From Table 3, it can be seen that the linear correlation coefficient of each of the two descriptors is less than 0.80, which means the descriptors are independent in the analysis [33]. 3.2. Results of RBFNN After the establishment of a linear model, RBFNN was then used to develop a nonlinear model based on the same subset of descriptors for comparison purposes with the above regression model. To obtain better results, the parameters that influence the performance of RBFNN were optimized. The selection of the optimal width value for RBFNN was performed by systemically changing its value in the training step. It is known that neural networks can become overtrained. An overtrained network has usually learned the pattern it has seen (training set) perfectly but cannot give accurate predictions for unseen compounds, and it is no longer able to generalize. There are several methods for avoiding this problem. One of the superior methods is to use a test set to validate the prediction power of the network during its training [34]. In the

Table 3 Correlation matrixes of the four descriptors.

HDCA-2 Nmin O N max C IC0

HDCA-2

N min O

N max C

IC0

1.000

0.048 1.000

0.402 0.143 1.000

0.413 0.393 0.091 1.000

777

F. Luan et al. / Journal of Colloid and Interface Science 336 (2009) 773–779

Training set Test set

15

RMS

The Residuals of each Sample

20

10

5

0

Residuals of MLR Residuals of RBFNN 4

2

0

-2

10

20

30

40

The observed HLB 0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Fig. 5. Plot of residuals vs the observed HLB.

Spread of RBF function Fig. 3. The spread vs RMS error of training and test set based on LOO crossvalidation.

present study, leave-one-out cross-validation was used for the training set. The value that gave the best LOO cross-validation result was used in the model. For this data set, the optimal spread was determined as 2.5 (see Fig. 3). Then the test set was used to evaluate the prediction ability of the models. As can been seen from Fig. 3, the trend lines of both the training and the test set are same. The corresponding number of centers (hidden layer nodes) of RBFNN is 22. The predicted results of the nonlinear models are shown in Table 1 and Fig. 4. The obtained model had a square of correlation coefficient R2 ¼ 0:9972, F ¼ 19647:91 with an RMS error of 0.6750 for the training set. The statistical parameters of test set were R2 ¼ 0:9912, F ¼ 1453:52, and RMS ¼ 1:1895. The RMS in prediction for overall data set was 0.8080 and R2 was 0.996, F ¼ 17322:59. Comparing the results of MLR and RBFNN, it can be concluded that the results of the two methods were good, but the results of the RBFNN method are preferred to the results of MLR because of the higher accuracy of the RBFNN method than the MLR method. But application of a relation which was obtained by MLR is much easier to use in comparison to that of RBFNN. Comparative residuals versus the experimental HLB values of the data set for both MLR and RBFNN models are also shown in Fig. 5. As can be seen from Fig. 5, the performance of the RBFNN model is better than that ob-

45

The predicted HLB by RBFNN

40

Training set Test set

35 30 25 20 15 10 5 5

10

15

20

25

30

35

The observed HLB Fig. 4. Plot of observed vs predicted HLB by RBFNN.

40

45

tained by the MLR model; the propagation of the residuals on both sides of zero indicates that no systematic error exists in the development of both MLR and RBFNN models. 3.3. Discussions of the input parameters By interpretation of the descriptors in the model, it is possible to gain some insight into factors that are likely to govern the HLB of these anionic surfactants. The model includes four descriptors. and N min are two quantum-chemical descriptors. They belong N max C O to the class of charge distribution-related descriptors [35]. These descriptors represent or depend directly on the quantum-chemically calculated charge distribution in the molecules, and therefore describe the polar interactions between molecules or their chemical reactivity. Here, the reactivity indices estimate the relative reactivity of the atoms (C and O) in the molecule for a given series of compounds and are related to the activation energy of the corresponding chemical reaction. As shown in Table 2, they both have negative signs, indicating that the predicted HLB values are correlated negatively with the two values. Hence, it can be concluded that molecules with larger reactivity indices tend to be less soluble in water and more lipophilic in the organic phase. HDCA-2, an electrostatic descriptor, accounts for the hydrogenbonding donor ability of the molecule and is calculated as the sum of the solvent-accessible surface areas of the hydrogen atoms as the possible hydrogen donors [36]. The hydrogen atoms directly connected with the electronegative atom in the molecule are considered as possible hydrogen-bonding donors. HDCA-2 is a hydrogen donor charged solvent-accessible surface area, and this descriptor represents the sum of solvent-accessible surface areas of the H-bonding donor atoms. Formation of H-bonding increases the hydrophilic ability of compounds, thus increasing the values of HLB. It can be validated by the positive coefficients in the linear model. IC0, a topological descriptor, is defined on the basis of the Shannon information theory [37]. It reflects the branching of the molecule and reflects how information rich the molecule is. In other words, it represents the difference between the maximum possible complexity of a graph and the realized topological information of the chemical species as defined by the information content. Therefore, it can describe the difference of the hydrophobicity and steric property of the anionic surfactants comprehensively. As we know, the hydrophobic and steric interactions can influence the balance of the hydrophile–lipophile. The negative coefficient in the model implies that increasing the value of this descriptor can lead to the lipophile of anionic surfactants.

778

F. Luan et al. / Journal of Colloid and Interface Science 336 (2009) 773–779

Analysis of the results obtained indicated that the models we proposed correctly represent the structure property relationships of these surfactants. It can generally be concluded that anionic surfactants with larger hydrogen bonding, lower reactivity indices, and smaller steric size tend to be less soluble in lipophile but more in water.

other the compound is. For the data sets, the mean distances of samples vs HLB are shown in Fig. 6, which illuminates the diversity of the molecules in the training and test sets. As can be seen from the figure, the structures of the compounds are diverse in both sets. The training set with a broad representation of the chemistry space was adequate to ensure the models’ stability and the diversity of test set can prove the predictive capability of the model.

3.4. Molecular diversity validation Two fundamental research themes in chemical database analysis are similarity and diversity sampling [38]. The diversity problem involves defining a diverse subset of ‘‘representative” compounds so that researchers can scan only a subset of the huge database each time. In this study, diversity analysis was performed for the data set to make sure the structures of the training or test cases can represent those of the whole ones. We consider a database of n compounds generated from m m highly correlated chemical descriptors fxj gj¼1 . Each compound X i is represented as a vector

X i ¼ ðxi1 ; xi2 ; xi3 . . . xim ÞT

for i ¼ 1; 2; . . . ; n;

where xij denotes the value of descriptor j of compound X i . The collective database X ¼ fX i gNi¼1 is represented by the n  m matrix X:

2

3

x11

x12

   x1m

6x 6 21 X ¼ ðX 1 ; X 2 ; . . . ; X N ÞT ¼ 6 6 .. 4 .

x22 .. .

   x2m 7 7 .. .. 7 7: . . 5

xn1

xn2

   xnm

4. Conclusion Linear and nonlinear models for the prediction of HLB values of a set of 73 anionic surfactants were performed using descriptors calculated from the molecular structure alone. According to the obtained results, the following can be concluded: (1) The proposed models could help us to know what calculated descriptors are related to HLB. (2) Nonlinear models using RBFNN produced better predictive ability than the linear regression model. (3) The general QSPR methods provided an alternative method to predict the HLB values of various anionic surfactants. Furthermore, the proposed approach validates the applicability of traditional QSPR in the field of colloid and interface science. References [1] [2] [3] [4] [5]

Here the superscript T denotes the vector/matrix transpose. A distance score for two different compounds X i and X j can be measured by the Euclidean distance norm based on the compound descriptors:

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u m uX dij ¼ kX i  X j k ¼ t ðxik  xjk Þ2 : The mean distances of one sample to the remaining ones were computed as follow:

Pn

j¼1 dij

n1

[7] [8] [9]

k¼1

di ¼

[6]

[10] [11] [12] [13]

i ¼ 1; 2; . . . ; n:

[14]

And then the mean distances were normalized within the interval [0, 1]. The closer to one the distance is, the more diverse to each

[15] [16] [17]

The mean distance of each sample to others

[18] [19]

Training set Test set

0.6

[20] [21] [22] [23] [24]

0.4

[25] [26] [27] [28]

0.2

[29] 10

20

30

40

[30]

The experimental HLB [31] Fig. 6. Scatter plot of samples for training and test set.

Availble from: . W.C. Griffin, J. Soc. Cosmet. Chem 1 (1949) 311. W.C. Griffin, J. Soc. Cosmet. Chem. 5 (1954) 259. J.T. Davies, in: Proceedings of the International Congress of Surface Activity, 1957, pp. 426–438. P.M. Kruglyakov, Hydrophile Lipophile Balance of Surfactants and Solid Particles: Physicochemical Aspects and Applications, Elsevier, Amsterdam, 2000. G. Trapani, C. Altomare, M. Franco, A. Latrofa, G. Liso, Int. J. Pharm. 116 (1995) 95. L.G. Wallace, J. Pharm. Sci. 50 (1960) 238. M.J. Rosen, Surfactants and Interfacial Phenomena, second ed., Wiley, New York, 1989. G.X. Zhao, Physical Chemistry of Surfactant, revised ed., Peking Univ. Press, Beijing, 1991, pp. 470–480. D.W. Roberts, S.J. Marshall, SAR QSAR Environ. Res. 4 (1995) 67. Y.J. Vishal, M.K. Mahesh, R.S. Manohar, J. Surfact. Deterg. 10 (2007) 25. L. Uppagrad, A. Lindgren, M. Sjostrom, S. Wold, J. Surfact. Deterg. 3 (2003) 33. P.D.T. Huibers, V.S. Lobanov, A.R. Katritzky, D.O. Shah, M. Karelsonr, Langmuir 12 (1996) 1462. S.L. Yuan, Z.T. Cai, G.Y. Xu, Y.S. Jiang, J. Dispersion Sci. Technol. 23 (2002) 465. Y.M. Li, G.Y. Xu, Y.X. Luan, S.L. Yuan, X. Xin, J. Dispersion Sci. Technol. 26 (2005) 799. P.D.T. Huibers, D.O. Shah, A.R. Katritzky, J. Colloid Interface Sci. 19 (1997) 132. P.D.T. Huibers, V.S. Lobanov, A.R. Katritzky, D.O. Shah, M. Karelson, J. Colloid Interface Sci. 187 (1997) 113. Z.W. Wang, J.L. Feng, H.J. Wang, Z.G. Cui, G.Z. Li, J. Dispersion Sci. Technol. 26 (2005) 441. Z.W. Wang, J.L. Feng, Z.N. Wang, G.Z. Li, A.J. Lou, J. Dispersion Sci. Technol. 27 (2006) 11. J.H. Zhou, Y.D. Cui, Speciality Petrochem. 2 (2001) 11. In Chinese. ISIS Draw 2.3, MDL Information Systems, Inc., 1990–2000. HyperChem 6.01, Hypercube, Inc., 2000. M.J.S. Dewar, E.G. Zoebisch, E.F. Healy, J.J.P. Stewart, J. Am. Chem. Soc. 107 (1985) 3898. MOPAC, v.6.0 Quantum Chemistry Program Exchange, Program 455, Indiana University, Bloomington, IN. A.R. Katritzky, V.S. Lobanov, M. Karelson, CODESSA: Training Manual; Univ. of Florida, Gainesville, FL, 1995. A.R. Katritzky, V.S. Lobanov, M. Karelson, CODESSA: Reference Manual; Univ. of Florida, Gainesville, FL, 1994. E. Deconinck, D. Coomans, Y. Vander Heyden, J. Pharm. Biomed. Anal. 43 (2007) 119. D.L. Massart, B.G.M. Vandeginste, L.M.C. Buydens, S. De Jong, P.J. Lewi, J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics, Part A, Elsevier Science, Amsterdam, 1997. X.J. Yao, A. Panaye, P. Doucet, R.S. Zhang, H.F. Chen, M.C. Liu, Z.D. Hu, B.T. Fan, J. Chem. Inf. Comput. Sci. 44 (2004) 1257. Y.H. Xiang, M.C. Liu, X.Y. Zhang, R.S. Zhang, Z.D. Hu, B.T. Fan, J.P. Doucet, A. Panaye, J. Chem. Inf. Comput. Sci. 42 (2002) 592–597. M.J.L. Orr, Introduction to Radial Basis Function Networks, Centre for Cognitive Science, Edinburgh Univ., 1996.

F. Luan et al. / Journal of Colloid and Interface Science 336 (2009) 773–779 [32] M.J.L Orr, MATLAB Routines for Subset Selection and Ridge Regression in Linear Neural Networks, Centre for Cognitive Science, Edinburgh Univ., 1996. [33] J.G. Topliss, R.P. Edwards, J. Med. Chem. 22 (1979) 1238. [34] M. Jalali-Heravi, M.H. Fatemi, J. Chromatogr. A 897 (2000) 227. [35] K. Fukui, Theory of Orientation and Stereoselection, Springer-Verlag, Berlin, 1975.

779

[36] M. Karelson, Molecular Descriptors in QSAR/QSPR, Wiley-Interscience, New York, 2000. [37] D. Bonchev, Information Theoretic Indices for Characterization of Chemical Structure, Wiley-Interscience, New York, 1983. [38] A.G. Maldonado, J.P. Doucet, M. Petitjean, B.T. Fan, Mol. Divers. 10 (2006) 39.