Chemometrics and Intelligent Laboratory Systems 144 (2015) 122–127
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab
A genetic programming-based QSPR model for predicting solubility parameters of polymers Dilek İmren Koç a,⁎, Mehmet Levent Koç b,1 a b
Chemical Engineering Department, Faculty of Engineering, Cumhuriyet University, 58140, Sivas, Turkey Civil Engineering Department, Faculty of Engineering, Cumhuriyet University, 58140, Sivas, Turkey
a r t i c l e
i n f o
Article history: Received 3 November 2014 Received in revised form 7 April 2015 Accepted 8 April 2015 Available online 16 April 2015 Keywords: Solubility parameter Genetic programming Polymers Linear regression QSPR
a b s t r a c t In this study, linear and nonlinear quantitative structure-property relationship (QSPR) models, respectively called the multiple linear regression based QSPR (MLR-QSPR) model and the genetic programming based QSPR (GP-QSPR) model, were built to predict the solubility parameters of polymers with structure –(C1H2–C2R3R4)–, as function of some constitutional, topological and quantum chemical descriptors. The results from the internal validation analysis indicated that the GP-QSPR model has better goodness of fit statistics. The external and overall validation measures also confirmed that the GP-QSPR model significantly outperforms the MLR-QSPR model in terms of some performance metrics over the same testing data set, and that genetic programming has good potential to obtain more accurate models in QSPR studies. © 2015 Elsevier B.V. All rights reserved.
1. Introduction Prediction of polymers' solubility parameters is of great importance in many technological or industrial applications of polymers [1–5]. The solubility parameter is an intrinsic physicochemical parameter which is defined simply from Hildebrand–Scatchard solution theory [6,7]. On the other hand, its experimental determination is not easy and traditional methodologies (e.g. group contribution methods) of predicting polymeric solubility are insufficient to meet accuracy requirements [8–11]. A few studies in the literature show that, in such cases, the development of predictive quantitative structure–property relationships (QSPR) models using linear or nonlinear data-driven methods (e.g., multiple linear regression, artificial neural networks and fuzzy set theory) seems to be a good alternative to overcome the shortcomings or the limitations of the conventional approaches such as the group contribution methods [12–14]. As an example, Yu et al. [12] introduced a multiple linear regression based QSPR model for the prediction of solubility parameters of amorphous polymers. They concluded that the presented
⁎ Corresponding author. Tel.: +90 346 2191010/2242; fax: +90 346 2191170. E-mail addresses:
[email protected] (D.İ. Koç),
[email protected] (M.L. Koç). 1 Tel.: +90 346 2191010/1318; fax: +90 346 2191170.
http://dx.doi.org/10.1016/j.chemolab.2015.04.005 0169-7439/© 2015 Elsevier B.V. All rights reserved.
QSPR model has a valuable ability to correlate the solubility parameters with the six molecular descriptors and its predictions were better than the previous models [13,14]. However, the value of the correlation coefficient between experimental and predicted solubility parameters was limited to 0.840 during the testing stage of the proposed model. This study examines the applicability of another data-driven method, which is genetic programming (GP), to predict the solubility of polymers. GP [15] is a purely nonlinear modeling approach that can be described as an extension of well-known genetic algorithm (GA). The main difference between them is the representation of the solution. While GA uses a string of numbers that represent the solution, GP solutions are computer programs. GP creates computer programs to solve a problem using the principle of Darwinian natural selection. It mainly differs from other data driven models (e.g., artificial neural networks) in that it defines an explicit functional relationship between input and output variables by optimizing the model structure and its coefficients simultaneously. To the authors' knowledge, applications of GP in QSPR studies are very few and include prediction of the wavelength of the lowest UV transition for a system of 18 anthocyanidins [16] and sublimation enthalpy of wide range organic contaminants only from their 3D molecular structures [17]. The present work focuses on further development of QSPR models for accurately predicting the solubility parameters of polymers. The GP based QSPR model was developed by using the experimental
D.İ. Koç, M.L. Koç / Chemometrics and Intelligent Laboratory Systems 144 (2015) 122–127
123
linear regression based QSPR model. This study is the first to investigate the implementation of GP in this field.
Table 1 Range of variables in the training data set. Parameter
Minimum (40 data samples)
Maximum (40 data samples)
hb alk nN Qii (Debye Å) Eint (4.19 × 109 J/mol) QH (a.u.) δ (J/cc)0.5
−0.0483 0 0 −90.1358 36.9640 0.1259 16.00
0 1 1 −18.9995 229.0590 0.4075 31.00
2. Theory and calculation The process of GP starts with a random initial population of computer programs. An individual program present in the population refers to a parse tree, which is generated by the combination of its functions (nodes) and terminals (leaves) that are defined in a function set and terminal set, appropriate to the problem, respectively [15]. A function set may consist of basic arithmetic operators, mathematical functions, conditional operators, Boolean operators, iterative functions and any user-defined functions or operators, while a terminal set contains the arguments for the functions. Once the initial population has been created, the next step is repeatedly replacing the current population with a new population (or new generation) by means of applying genetic operators (reproduction, crossover and mutation) probabilistically until the best fitness of the population has reached the desired level, or the maximum number of generations has been reached. The genetic operators applied in GP are the basic GA operators. Reproduction is the process of copying the selected individual program to the new population. Crossover operation creates a new offspring program for the new population by recombining randomly chosen parts from two selected programs. Mutation operator exchanges a randomly chosen part of one selected program in order to create one new offspring program for the new population. The best offspring program appearing in any generation, or the best-so-far solution, is able to solve the given problem in the best way [18].
Table 2 Range of variables in the testing data set. Parameter
Minimum (57 data samples)
Maximum (57 data samples)
hb alk nN Qii (Debye Å) Eint (4.19 × 109 J/mol) QH (a.u.) δ (J/cc)0.5
−0.0173 0 0 −83.9006 39.3470 0.1332 16.80
0 1 1 −14.8853 210.6090 0.2365 25.90
solubility parameters and molecular descriptors of 97 polymers with structure –(C1H2–C2R3R4)–, which was previously given in Yu et al. [12]. Its predictive performance was compared with that of multiple
Table 3 Experimental solubility parameters and calculated molecular descriptors [12] comprising the training data set. No.
Polymers
hb
alk
nN
Qii (Debye Å)
Eint (4.19 × 109 J/mol)
QH (a.u.)
δ (J/cc)0.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Poly(vinyl alcohol) Poly(vinyl ethyl ether) Poly(vinyl n-butyl ether) Poly(methyl acrylate) Poly(vinylide bromide) Polyacrylonitrile Poly(methyl methacrylate) Polymethacrylonitrile Poly(cyclohexyl methacrylate) Poly(benzyl methacrylate) Poly(n-octyl methacrylate) Poly(α-vinyl naphthalene) Polyisobutylene Poly(1-butene) Poly(4-methyl-1-pentene) Poly(1,2-butadiene) Poly(t-butyl methacrylate) Poly(vinyl sec-butyl ether) Poly(acrylic acid) Polyacrylamide Poly(vinyl butyrate) Poly(vinyl methyl sulfide) Poly(vinyl methyl ether) Poly(1-ethyl vinyl ethyl ether) Poly(vinyl ethyl ketone) Poly(4-hydroxystyrene) Poly(divinyl ether) Poly(vinyl-1-phenyl methyl ether) Poly(nitro styrene) Poly(benzyl acrylate) Poly(o-methyl styrene) Poly(allyl isocyanide) Poly(isopropyl methacrylate) Poly(allyl isopropyl ether) Poly(allyl acetate) Poly(propyl methacrylate) Poly(3-chloropylpropyl methacrylate) Poly(4-acetoxy styrene) Poly(vinyl methyl ketone) Poly(vinyl-1-amyl methyl ether)
−0.0483 0.0000 0.0000 0.0000 0.0000 −0.0173 0.0000 −0.0173 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 −0.0121 −0.0128 0.0000 0.0000 0.0000 0.0000 0.0000 −0.0019 0.0000 0.0000 0.0000 0.0000 0.0000 −0.0077 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
−18.9995 −32.1344 −45.3354 −35.3226 −48.5852 −25.8884 −42.0122 −32.9640 −67.5836 −75.1338 −90.1358 −67.7182 −27.6679 −28.2546 −41.6106 −26.3875 −61.6507 −45.4648 −29.4530 −29.5498 −48.3401 −33.1509 −25.7032 −45.4476 −38.0354 −51.4885 −30.7963 −57.2652 −66.0630 −68.6248 −53.6503 −33.0703 −55.2181 −45.4556 −42.6179 −55.2974 −71.2170 −67.1294 −32.4567 −60.5029
53.080 90.549 128.148 79.391 38.757 49.879 98.081 68.458 159.418 151.847 229.059 134.601 81.980 86.964 124.335 71.702 152.874 127.823 60.651 68.445 116.916 69.903 71.814 127.805 93.956 106.037 74.624 125.420 106.629 132.557 121.882 68.590 135.204 127.799 98.039 135.592 130.324 131.974 74.479 165.748
0.389411 0.157379 0.157283 0.169044 0.225322 0.195491 0.169770 0.169271 0.170439 0.171913 0.168645 0.153432 0.196383 0.141067 0.142244 0.143982 0.168645 0.156900 0.407577 0.338456 0.174955 0.182660 0.158178 0.163138 0.160771 0.405212 0.164267 0.125940 0.181215 0.173749 0.161500 0.180890 0.173121 0.165117 0.181694 0.164884 0.195761 0.185200 0.179187 0.155492
31.00 17.40 17.40 21.40 22.80 27.50 20.20 24.50 19.80 20.70 18.60 20.90 16.00 17.10 16.80 17.20 18.30 17.00 25.70 28.10 18.33 19.52 19.66 19.21 22.14 24.55 18.93 20.17 22.71 19.38 19.33 25.45 18.39 19.21 18.27 18.37 19.60 21.70 22.92 20.83
124
D.İ. Koç, M.L. Koç / Chemometrics and Intelligent Laboratory Systems 144 (2015) 122–127
Table 4 Experimental solubility parameters and calculated molecular descriptors [12] comprising the testing data set. No.
Polymers
hb
alk
nN
Qii (Debye Å)
Eint (4.19 × 109 J/mol)
QH (a.u.)
δ (J/cc)0.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
Polyethylene Poly(vinyl chloride) Poly(vinyl bromide) Poly(vinyl acetate) Poly(N-vinyl pyrrolidone) Poly(vinyl propionate) Poly(p-t-butyl styrene) Polypropylene Poly(vinyl cyclohexane) Poly(p-vinyl pyridine) Poly(vinylidene chloride) Poly(ethyl acrylate) Poly(n-butyl acrylate) Poly(n-butyl methacrylate) Poly(methyl α-cyanoacrylate) Poly(sec-butyl methacrylate) Poly(isobutyl methacrylate) Polystyrene Poly(p-methyl styrene) Poly(p-chloro styrene) Poly(o-chloro styrene) Poly(p-bromo styrene) Poly(ethyl methacrylate) Poly(α-methyl styrene) Poly(n-propyl acrylate) Poly(n-hexyl methacrylate) Poly(N-vinyl carbazole) Poly(ethyl α-chloroacrylate) Poly(p-fluoro styrene) Poly(isobutyl acrylate) Poly(2-ethoxyethyl methacrylate) Poly(vinyl phenyl ether) Poly(m-methyl styrene) Poly(methoxy styrene) Poly(1-methyl vinyl ethyl ether) Poly(1-phenyl vinyl ethyl ether) Poly(1,1-diphenyl ethylene) Poly(allyl 4,tolyl ether) Poly(allyl cyanide) Poly(vinyl phenyl sulfide) Poly(vinyl propyl ether) Poly(vinyl isopropyl ether) Poly(vinyl isoamyl ether) Poly(vinyl-1-methyl phenyl ether) Poly(vinyl-1- phenyl phenyl ether) Poly(2-nitro styrene) Poly(vinyl isobutyl ether) Poly(2-ethyl hexyl acrylate) Poly(allyl phenyl ether) Poly(allyl methyl ether) Poly(allyl ethyl ether) Poly(allyl propyl ether) Poly(diallyl ether) Poly(allyl 2, tolyl ether) Poly(allyl 3, tolyl ether) Poly(allyl acetonitrile) Poly(cyano styrene)
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 −0.0173 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 −0.0077 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 −0.0043 −0.0019
1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1
−14.8853 −26.1104 −30.9478 −37.5420 −49.4005 −42.0187 −73.1496 −21.5668 −53.2023 −48.2670 −38.0945 −42.0185 −55.7523 −62.0600 −49.3941 −61.9267 −62.3509 −46.9407 −53.1736 −59.8069 −64.0734 −64.0734 −48.6274 −53.7554 −48.7428 −76.0284 −83.3121 −53.7529 −52.0063 −57.5115 −65.8952 −51.1778 −53.2890 −57.4224 −38.8704 −63.8969 −79.4181 −63.9397 −34.0233 −58.5896 −38.7548 −38.8705 −52.1351 −57.9679 −83.9006 −63.9986 −45.2702 −83.3478 −57.9832 −32.1471 −38.6678 −45.3318 −43.5797 −64.3572 −63.8511 −41.9580 −62.9157
49.390 44.649 44.301 79.081 110.814 98.073 177.764 68.170 148.496 96.003 39.347 98.080 135.730 153.878 79.472 154.070 154.108 103.346 121.827 97.483 98.119 97.991 116.764 121.931 116.904 191.485 154.192 93.038 98.161 135.272 157.397 106.834 121.184 125.550 108.994 144.154 157.147 144.110 68.638 105.160 109.372 108.993 146.684 125.272 160.406 106.664 127.889 210.609 125.686 90.652 109.369 128.193 112.800 143.642 143.549 87.485 103.633
0.144396 0.193699 0.189865 0.187471 0.185949 0.171231 0.150444 0.141027 0.133253 0.153451 0.236595 0.171261 0.171100 0.169082 0.202023 0.172628 0.163776 0.151529 0.159854 0.155354 0.167444 0.155541 0.164616 0.149414 0.171168 0.169107 0.163543 0.208716 0.152659 0.175174 0.170405 0.163371 0.160541 0.167221 0.157001 0.159666 0.151710 0.158223 0.190630 0.178789 0.157346 0.157011 0.157383 0.160936 0.162306 0.182582 0.157283 0.171944 0.153756 0.156307 0.157454 0.149001 0.149843 0.177444 0.161917 0.190713 0.161041
17.50 21.20 21.10 21.00 22.30 20.40 18.30 16.80 18.80 22.40 22.80 20.50 19.70 19.10 25.90 18.80 18.80 20.10 19.40 21.20 21.20 21.30 19.60 19.30 20.00 18.80 21.60 21.50 20.00 19.40 19.20 20.19 19.33 20.19 19.29 20.03 19.93 20.03 25.45 20.28 19.29 19.29 19.13 20.19 20.50 22.20 19.21 18.45 20.19 19.44 19.28 19.21 18.84 20.03 20.03 24.18 22.36
In this study, the implementations were performed using computer programs that were written in MATLAB run on an Intel Core i7 based PC.
3. Results and discussion The GP based QSPR model was developed for exploring explicit relationships between the solubility parameter and influencing variables (i.e., molecular descriptors calculated directly from repeating unit structures of polymers). Yu et al. [12] showed that the formulation of the solubility parameter (δ) can be considered to be as follows: δ ¼ f ðhb; alk; nN ; Q ii ; Eint ; Q H Þ
ð1Þ
where hb ¼ mQ , m is the number of –OH, –NH or –CN group in the side n2
groups, Q± is hydrogen bond descriptor, n is the number of atoms from the terminal atom in the constituent group R3 or R4 to the atom C2, alk is a descriptor that is used to judge whether a polymer belongs to polyalkenes or not, nN is the number of nitrogen atoms in repeating units, Qii is the quadrupole moment (Debye Å), Eint is the thermal energy Table 5 Goodness of fit metrics for the models of GP and MLR. Model
R
Q2
Q2LOO
RMSE (J/cc)0.5
MAPE (%)
F
AIC
GP-QSPR MLR-QSPR
0.940 0.935
0.884 0.874
0.835 0.832
1.137 1.186
4.43 4.91
1916 38
22.28 27.65
D.İ. Koç, M.L. Koç / Chemometrics and Intelligent Laboratory Systems 144 (2015) 122–127
125
Table 6 Q2 and Q2LOO values after Y-randomization tests for the models of GP and MLR. Iteration
Model GP-QSPR
1 2 3 4 5 6 7 8 9 10
MLR-QSPR
Q2LOO
Q2
Q2LOO
Q2
−0.491 −0.582 −0.682 −1.980 −1.250 −0.530 −1.228 −0.299 −0.243 −0.793
0.109 0.192 0.097 −0.040 −0.068 0.152 −0.030 0.044 0.1067 −0.065
−0.269 −0.778 −0.178 −0.315 −1.441 − 0.384 −0.650 −0.613 −0.255 −0.349
0.205 0.116 0.073 0.155 0.161 0.155 0.187 0.199 0.099 0.159
(4.19 × 109 J/mol) and QH is the most positive charge of a hydrogen atom (a.u.). The ranges of the model variables of randomly selected training and testing subsets are given in Tables 1 and 2, respectively. The training set (Table 3) was used in the training (or genetic evolution) phase, while the testing set (Table 4) was used to measure the performance of the model obtained by the GP algorithm. The various parameters involved in the GP algorithm were selected as follows: The size of the randomly generated initial population was n o pffi pffi set to 100. The function set was F ¼ þ; −; ; =; exp; log; ; 3 . The selection operator was implemented by using the tournament selection method [15,19] to reproduce computational programs in proportional to the fitness values. The function to calculate the fitness (f) was expressed by the following equation: f ¼
100 SSE
ð2Þ
where SSE is the sum of the squared errors for the training data set. The crossover and mutation operators were performed probabilistically and their probabilities were adjusted at 0.5 and 0.1, respectively. As the termination criteria, the maximum number of the generation was set to 250. The GP algorithm was applied to the training data set for building the GP based QSPR (GP-QSPR) model. The GP-QSPR model is as follows:
nN −27:603ðalk expðQ H ÞÞ Q ii þ 17:243ðQ H þ expðalkÞÞ−0:058 Q ii −149:279 hb−0:029 Eint
δ ¼ −102:705
ð3Þ
Fig. 1. Comparison between experimental and predicted solubility parameters by the GPQSPR model.
models was confirmed by Y-randomization test [24] over 10 iterations (Table 6). The external validation analysis was conducted on the testing data set in order to assess the prediction capability of the developed models (Table 7). In this study, besides the common validation metrics listed above, the r2m and r2m(overall) metrics [25,26] were also used to compare the performances of the GP-QSPR and MLR-QSPR models (Table 7), since there is not a unique evaluation criterion [27–29]. The former was applied to the testing set, while the latter was calculated on the basis of the overall data set consisted of the training and testing data sets. Figs. 1–2 utilize the scatter diagrams with correlation coefficients (R) that compare the measured and predicted solubility parameters by the models GP-QSPR and MLR-QSPR, respectively. In general, good quality of the models can be indicated by small RMSE, MAPE and AIC values, R, Q2 and Q2LOO values close to one, and large F. On the other hand, Roy et al., [30] showed that in many cases the developed models could satisfy the requirements of conventional parameters such as Q2 and Q2LOO but fail to achieve the required values (more than 0.5) of the novel metrics r2m(overall) and r2m. Hence, an examination of these validation statistics has been suggested to be a more stringent requirement than the traditional validation parameters to decide acceptability of a predictive model [31]. An examination of the scatter diagrams (Figs. 1–2) indicates that the GP-QSPR model's predictions are very consistent with the experimental data and they are in better agreement with the experimental data than those from the MLR-
In addition to the GP-QSPR model, the multiple linear regression based QSPR (MLR-QSPR) model was developed by employing the least squares method based on singular value decomposition [20]. The MLR-QSPR model was given below: δ ¼ 17:728−161:280 hb−2:762 alk þ 2:902 nN −0:052 Q ii −0:029 Eint þ 15:846 Q H
ð4Þ
The goodness of fit of both models was assessed (Table 5) by examining correlation coefficient (R), root mean square error (RMSE), mean absolute percentage error (MAPE), squared correlation coefficient (Q 2) [21], squared correlation coefficient in cross-validation by the leaveone-out procedure (Q2LOO) [21], Fischer (F) ratio [22], and Akaike's information criterion (AIC) [23]. In addition, the robustness of the QSPR Table 7 External and overall validation metrics for the models of GP and MLR. Model
R
Q2
RMSE (J/cc)0.5
MAPE (%)
2 rm
2 rm(overall)
GP-QSPR MLR-QSPR
0.977 0.959
0.944 0.889
0.394 0.553
1.699 2.155
0.922 0.138
0.889 0.173
Fig. 2. Comparison between experimental and predicted solubility parameters by the MLR-QSPR model.
126
D.İ. Koç, M.L. Koç / Chemometrics and Intelligent Laboratory Systems 144 (2015) 122–127
Fig. 3. Plot of GP-QSPR model predicted values of the response and residuals for training and testing sets.
QSPR model. As also shown in Table 5, the GP-QSPR model has better goodness of fit with respect to the internal validation measures, as evidenced by the lowest values of RMSE, MAPE, and AIC, the highest values of R, Q2 , Q2LOO and F. Also, low Q2 and Q2LOO values were obtained by Y-randomization test (Table 6). The results from the external validation analysis also confirmed that the GP-QSPR model outperformed the MLR-QSPR model in terms of common performance measures over the same testing data set (Table 7). This corresponds to improvements of about 6%, 21% and 29% over Q2, MAPE and RMSE, respectively, when comparing the GP-QSPR model to the MLR-QSPR model. Moreover, the r2m and r2m(overall) metrics were respectively obtained as follows: 0.922 and 0.889 for the GP-QSPR model (after adding an intercept term) and 0.138 and 0.173 for the MLR-QSPR model. These results showed (Table 7) that GP based QSPR model was acceptable (r2m N 0.5 and r2m(overall) N 0.5). On the other hand, the acceptable values were not observed (r2m b 0.5 and r2m(overall) b 0.5) by the multiple linear regression based QSPR model, although this model was acceptable considering the values of the conventional validation metrics. It can be explained that the MLR-QSPR model forced through origin led to a significantly worse fit for the data when compared to the GP-QSPR model (i.e., inclusion of an intercept did not significantly alter the quality of the GP-QSPR model's fit). These results also indicate that there are also some nonlinear components that existed in the relationship and the GP algorithm which combines the linear and nonlinear terms in the general solvation equation (Eq. (3)) can capture mixed linear and nonlinear data structures without prior assumption about the
underlying relationship as opposed to the traditional regression method. It can be noted that the molecular descriptors appearing in the GP-QSPR model (as well as MLR-QSPR model) have clear physical meaning (being primarily related to molecular structure, branching and functional groups), since they can be calculated from the repeating unit structure. Thus, these descriptors can account for structural features that affect the solubility parameter of the interested polymers. A plot of the GP-QSPR predicted values of the response variable for both the training and testing data sets against the residuals (Fig. 3) and the William plot [32] of the GP-QSPR model (Fig. 4) show that i) the propagation of the residuals on both sides of zero line is random (Fig. 3); ii) the entire testing set compounds lie within the applicability domain (i.e., the squared area within ± 3 standard deviation and a leverage threshold h* of 0.450) of the GP-QSPR model; iii) compound 1, poly(vinyl alcohol), in the training set lies outside the applicability domain and it can be considered as an influential compound; iv) only two outliers were detected corresponding to compounds 39, poly(vinyl methyl ketone), and 25, poly(vinyl ethyl ketone), in the training data set. 4. Conclusions This study investigates the development of GP based QSPR model to predict the solubility parameters of polymers. The results show that the predictive performance of the GP based model is better than that of the traditional regression based model, since GP had the ability of effectively capturing complex real-world relationships compared to the conventional regression methods. Apart from the improvements in the prediction performance gained by using GP, this study demonstrates that GP can reconstruct the transparent functional relationship which is convenient to use later (e.g., unlike artificial neural networks). However, we noticed that one possible disadvantage of GP is to produce extremely complex functions, which may not be useful for knowledge induction (e.g., like black-box modeling techniques). This study also shows that GP can be regarded as a beneficial technique for QSPR studies in the future. Conflict of interest The authors declare that there is no conflict of interest. Acknowledgments The authors thank Xinliang Yu, Xueye Wang, Hanlu Wang, Xiaobing Li and Jinwei Gao for all data used in this work. References
Fig. 4. William plot for the GP-QSPR model.
[1] J. Bicerano, Prediction of Polymer Properties, Marcel Dekker, New York, 2002. [2] G. Inzelt, Conducting Polymers a New Era in Electrochemistry, Springer, Berlin, 2008. [3] N. Nanto, N. Dougami, T. Mukai, M. Habara, E. Kusano, A. Kinbara, T. Ogawa, T. Oyabu, A smart gas sensor using polymer-film coated quartz resonator microbalance, Sens. Actuators B 66 (2000) 16–18. [4] A.V. Shevade, M.A. Ryan, M.L. Homer, A.M. Manfreda, H. Zhou, K.S. Manatt, Molecular modeling of polymer composite-analyte interactions in electronic nose sensors, Sens. Actuators B 93 (2003) 84–91. [5] X. Chen, C. Yuan, C.K.Y. Wong, G. Zhang, Molecular modeling of temperature dependence of solubility parameters for amorphous polymers, J. Mol. Model. 18 (2012) 2333–2341. [6] A. Adjei, J. Newburger, A. Martin, Extended Hildebrand approach: solubility of caffeine in dioxane water mixtures, J. Pharm. Sci. 69 (1980) 659–661. [7] J.H. Hildebrand, J.M. Prausnitz, R.L. Scott, Regular and Related Solutions: The Solubility of Gases, Liquids, and Solids, Reinhold, New York, 1970. [8] K.C. Satyanarayana, R. Gani, J. Abildskov, Polymer property modeling using grid technology for design of structured products, Fluid Phase Equilib. 261 (2007) 58–63. [9] E.J. Delgado, Predicting aqueous solubility of chlorinated hydrocarbons from molecular structure, Fluid Phase Equilib. 199 (2002) 101–107. [10] J. Bicerano, Prediction of Polymer Properties, Marcel Dekker, New York, 1996. [11] B.A. Miller-Chou, J.L. Koenig, A review of polymer dissolution, Prog. Polym. Sci. 28 (2003) 1223–1270.
D.İ. Koç, M.L. Koç / Chemometrics and Intelligent Laboratory Systems 144 (2015) 122–127 [12] X. Yu, X. Wang, H. Wang, X. Li, J. Gao, Prediction of solubility parameters for polymers by a QSPR, QSAR Comb. Sci. 25 (2) (2006) 156–161. [13] H. Sun, Y. Tang, G. Wu, F. Zhang, X. Chen, Structure-property relationship modeling for linear chain polymers by artificial neural Networks, Comput. Appl. Chem. 1 (2003) 210–216. [14] H. Sun, Y. Tang, G. Wu, F. Zhang, X. Chen, Structure-property relationship modeling for linear chain polymers by Fuzzy set theory, Comput. Appl. Chem. 4 (2003) 381–385. [15] J.R. Koza, Genetic Programming on the Programming of Computers by Means of Natural Selection, MIT Press, Cambridge, Massachusetts, 1992. [16] B.K. Alsberg, N.M. Geneste, R.D. King, A new 3D molecular structure representation using quantum topology with application to structure–property relationships, Chemom. Intell. Lab. Syst. 54 (2000) 75–91. [17] M. Bagheri, M. Bagheri, H.A. Gandomi, A. Golbraikh, Simple yet accurate prediction method for sublimation enthalpies of organic contaminants using their molecular structure, Thermochim. Acta 543 (2012) 96–106. [18] J.R. Koza, M.J. Streeter, M.A. Keane, Routine high-return human-competitive automated problem-solving by means of genetic programming, Inf. Sci. 178 (2008) 4434–4452. [19] D.E. Goldberg, Genetic Algorithms in Search, Optimizations and Machine Learning, Addison-Wesley Longman, Bostan, MA, 1989. [20] S.I. Gass, T. Rapcsák, Singular value decomposition in AHP, Eur. J. Oper. Res. 154 (3) (2004) 573–584. [21] K. Roy, H. Kabir, QSPR with extended topochemical atom (ETA) indices: modeling of critical micelle concentration of non-ionic surfactants, Chem. Eng. Sci. 73 (2012) 86–98. [22] W.J. Krzanowski, Principles of Multivariate Analysis: A User's Perspective, Oxford University Press, Oxford, 2000.
127
[23] H. Akaike, A new look at the statistical identification model, IEEE Trans. Autom. Control AC-19 (1974) 716–723. [24] S. Wold, L. Eriksson, Statistical validation of QSAR results, in: H. van de Waterbeemd (Ed.), Chemometrics Methods in Molecular Design, Weinheim, VCH, 1995. [25] K. Roy, S. Kar, The rm2 metrics and regression through origin approach: reliable and useful validation tools for predictive QSAR models (Commentary on ‘Is regression through origin useful in external validation of QSAR models?’), Eur. J. Pharm. Sci. 62 (2014) 111–114. [26] P.K. Ojha, I. Mitra, R.N. Das, K. Roy, Further exploring rm2 metrics for validation of QSPR models, Chemom. Intell. Lab. Syst. 107 (2011) 194–205. [27] M.L. Koç, Ü. Özdemir, D. İmren, Prediction of the pH and the temperature-dependent swelling behavior of Ca2+-alginate hydrogels by artificial neural networks, Chem. Eng. Sci. 63 (2008) 2913–2919. [28] C.E. Balas, L. Koç, L. Balas, Predictions of missing wave data by recurrent neuronets, J. Waterw. Port Coast. Ocean Eng. ASCE 130 (5) (2004) 256–265. [29] M.L. Koç, C.E. Balas, Genetic algorithms based logic-driven fuzzy neural networks for stability assessment of rubble-mound breakwaters, Appl. Ocean Res. 37 (2012) 211–219. [30] P.P. Roy, S. Paul, I. Mitra, K. Roy, On two novel parameters for validation of predictive QSAR models, Molecules 14 (2009) 1660–1701. [31] P.P. Roy, K. Roy, On some aspects of variable selection for partial least squares regression models, QSAR Comb. Sci. 27 (3) (2008) 302–313. [32] L. Eriksson, E. Johansson, M. Müller, S. Wold, On the selection of the training set in environmental QSAR analysis when compounds are clustered, J. Chem. 14 (2000) 599–616.