Journal of Chromatography A, 1113 (2006) 140–147
Quantitative structure–property relationships for pesticides in biopartitioning micellar chromatography Weiping Ma a , Feng Luan a , Haixia Zhang a , Xiaoyun Zhang a,∗ , Mancang Liu a , Zhide Hu a , Botao Fan b b
a Department of Chemistry, Lanzhou University, Lanzhou 730000, Gansu, PR China Universit´e Paris 7-Denis Diderot, ITODYS 1, Rue Guy de la Brosse, 75005 Paris, France
Received 18 September 2005; received in revised form 19 December 2005; accepted 31 January 2006 Available online 21 February 2006
Abstract The retention factor (log k) in the biopartitioning micellar chromatography (BMC) of 79 heterogeneous pesticides was studied by quantitative structure–property relationships (QSPR) method. Heuristic method (HM) and support vector machine (SVM) method were used to build linear and nonlinear models, respectively. Compared the results of these two methods, those obtained by the SVM model are much better. For the test set, a predictive correlation coefficient (R) of 0.9755 and root-mean-square (RMS) error of 0.1403 were obtained. The proposed QSPR models, both by HM and SVM, contain the same descriptors that agree with the classical Abraham parameters of well-known linear solvation energy relationships (LSER). © 2006 Elsevier B.V. All rights reserved. Keywords: Heuristic methods; Quantitative structure–property relationship; Support vector machine; Biopartitioning micellar chromatography
1. Introduction To control plague, pesticides embrace an enormous diversity of products that are used in a number of different activities. These include agriculture, amateur gardening, woodworm treatment, and public health applications. Due to the physicochemical properties of these chemical agents, such as water solubility, vapor pressure and partition coefficients between organic matter (in soil or sediment) and water, they can disperse in various environmental media. The range of damages across environmental media and different receptors is equally great, providing a particularly complex example of multidimensional environmental impacts. Loss of aquatic and terrestrial biodiversity, contamination of groundwater and agricultural produce, and poisoning of agricultural workers are among the potential consequences of pesticide use in agriculture alone. Therefore, there is a need of evaluating the toxicity of pesticides for risk assessment. Chromatography has been established for years as the technique of choice for the analysis of pesticides [1,2]. As a mode of
∗
Corresponding author. Tel.: +86 931 891 2578; fax: +86 931 891 2582. E-mail address:
[email protected] (X. Zhang).
0021-9673/$ – see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.chroma.2006.01.136
micellar liquid chromatography, biopartitioning micellar chromatography (BMC) is a simple and reproducible approach in emulating the partitioning of chemicals in biomembranes. Usually, it comprises a C18 reversed stationary phase and polioxyethylene (23) lauryl ether (Brij35) micellar mobile phases. The retention data obtained in this chromatographic system under adequate experimental conditions can be related to the biological behavior of different kinds of drugs [3–8] and the toxicity of chemicals [9–11]. This could be attributed to the fact that the characteristics of the BMC systems are similar to biological barriers and extracellular fluids. First, the stationary phase modified by hydrophobic adsorption of Brij35 surfactant monomers structurally resembles the ordered array of the membranous hydrocarbon chains. Second, the hydrophilic/hydrophobic character of the adsorbed surfactant monomers resembles the polar membrane regions. In addition, Brij35 micellar mobile phases prepared at the specific physiological conditions could also mimic the environment of partitioning. Linear solvation energy relationships (LSER) methodology has been extensively applied in conventional reversed-phase liquid chromatography [12–18], gas chromatography [19,20], micellar liquid chromatography (MLC) [21–24] and micellar electrokinetic chromatography [21,25–28]. The general
W. Ma et al. / J. Chromatogr. A 1113 (2006) 140–147
solvation equation of LSER proposed by Abraham [29] is defined as: log SP = c + eE + sS + aA + bB + vV
(1)
In the equation, E is an excess molar refraction that is obtained from the refractive index. S is the dipolarity/polarizability that can be obtained from gas–liquid chromatographic measurements on polar stationary phases or more generally from water/solvent partitions. The parameters A and B are the overall or effective hydrogen bond acidity and basicity, respectively. V is the McGowan characteristic volume that can be calculated promptly from bond and atom contribution. These parameters represent the solute influence on various solute/solvent phase interactions. In the present paper, the descriptors were calculated from the pesticides structures by CODESSA. However, the interpretation of them is complicated by the fact that the system is commonly described using a three-phase model (mobile, stationary, and micellar phases) with three accompanying partition coefficients (mobile to stationary phase, mobile to micelle phase, and stationary to micelle phase transfers) [24]. LSER shows the fundamental chemical interactions governing the retention of MLC. In this study, quantitative structure–property relationships (QSPR) method was performed to predict the chromatographic retention behavior of pesticides using a large number of calculated descriptors instead of Abraham parameters of LSER. To better understand the retention mechanism of BMC, the chemical meaning of the calculated descriptors was compared with that of Abraham parameters of LSER. QSPR studies have been demonstrated to be an effective computational tool in understanding the correlation between the structure of molecules and their properties [30–32]. In a QSPR study one seeks to find a mathematical relationship between the property and one or more descriptors. Thus, this study can indicate which of the structural factors may play an important role in the determination of the property. And its advantage over other methods lies in the fact that the descriptors used to build the models can be calculated from the structure alone and are not dependent on any experimental properties. However, the main problems encountered in this kind of research are still the description of the molecular structure using appropriate molecular descriptors and selection of suitable modeling methods. At present, many types of molecular descriptors such as constitutional, topological, geometrical, electrostatic, and quantum-chemical descriptors have been proposed to describe the structural features of molecules [33–35]. The same as the diversity of molecular descriptors, different chemometrics and chemoinformatics methods, such as multiple linear regression (MLR), principal component regression (PCR), partial least squares (PLS), different types of artificial neural networks (ANN) and genetic algorithms (GA), can be employed to derive correlation models between the molecular structures and properties. Recently, there is a growing interest in the use of support vector machine (SVM) to chemical problems due to its remarkable generalization performance in modeling non-linear problem.
141
SVM is a new algorithm developed from the machine learning community and has attracted attention and gained extensive application, such as pattern recognition problems [36–38], drug design [39], prediction of protein structure [40], identifying genes [41], quantitative structure–activity relationship (QSAR) [42], and QSPR analysis [43–45]. Nevertheless, to the best of our knowledge, there is no prediction of retention factor (log k) of BMC by the QSPR approach based on SVM. In the present work, HM and SVM were used for the prediction of log k of 79 pesticides in BMC using descriptors calculated and selected by the software CODESSA. The aim was to establish a QSPR model that could be used for the prediction of log k, to show the flexible modeling ability of SVM, and, at the same time, to seek the important structure features related to the retention behavior of pesticides. 2. Experimental section 2.1. Data set In our study, a set of 79 pesticides collected from ref [11] is investigated. A complete list of the pesticides’ names and their corresponding experimental retention data (log k) is given in Table 1. The retention factor (k) of pesticides was estimated by Eq. (2): tR k= (1 + kREF ) − 1 (2) tR(REF) where tR is the experimental retention time of the pesticides assayed, tR(REF) the experimental retention time of a reference compound (acetanilide) injected during the working session and kREF is the retention factor of acetanilide. Micellar mobile phases were prepared by dissolving Brij35 in aqueous solutions of 0.05 M phosphate buffer to get a final surfactant concentration of 0.04 M. The buffer solution was prepared with sodium dihydrogen phosphate. The pH was potentiometrically adjusted at 7.0 by addition of sodium hydroxide aqueous solution. The entire set of pesticides was divided into two subsets randomly: a training set of 63 compounds and a test set of 16 compounds. The training set was used to build the actual models and the test set was used to evaluate the models once they were built. 2.2. Descriptors calculation All structures of the pesticides were drawn with the HyperChem 4.0 program [46]. The final structural optimizations of pesticides were performed using the PM3 parameterization within the semi-empirical quantum-chemical program MOPAC 6.0 [47]. The geometry optimization was performed without symmetry restrictions. In all cases frequency calculations had been performed in order to ensure that all the calculated energy of the geometries correspond to true minima. The output files exported from MOPAC were transferred into the Microsoft Windows version of the CODESSA program developed by Katritzky et al. [48], to calculate molecular descriptors. Five types of molecular descriptors were calculated: constitutional,
142
W. Ma et al. / J. Chromatogr. A 1113 (2006) 140–147
Table 1 The compounds and the predicted log k No.
CAS number
Compounds
1 2 3 4 5 6 7 8 9 10 11 12 13 14e 15e 16 17 18e 19 20 21 22e 23 24e 25 26 27 28 29 30 31 32e 33 34 35 36 37 38 39e 40 41 42 43 44e 45 46 47e 48 49 50 51e 52 53 54 55 56e 57 58 59e 60 61 62 63e
52-68-6 60-51-5 950-37-8 121-75-5 2595-54-2 732-11-6 298-00-0 29232-93-7 2642-71-9 5598-13-0 333-41-5 55-38-9 2310-17-0 56-72-4 2921-88-2 57018-04-9 23135-22-0 16752-77-5 23103-98-2 114-26-1 17804-35-2 1563-66-2 63-25-2 2212-67-1 1114-71-2 82560-54-1 122-34-9 21725-46-2 1014-69-3 841-06-5 5915-41-3 834-12-8 1610-18-0 7287-19-6 886-50-0 4147-51-7 108-90-7 99-30-9 95-50-1 106-46-7 541-73-1 120-82-1 108-70-3 87-61-6 510-15-6 5836-10-2 608-93-5 118-74-1 3547-04-4 72-54-8 115-32-2 940-31-8 122-88-3 1918-00-9 101-10-0 3307-39-9 94-74-6 94-75-7 93-65-2 120-36-5 93-76-5 94-81-5 93-72-1
Trichlorfon Dimethoate Methidathion Malathion Mecarbam Phosmet Parathion-methyl Pirimiphos-methyl Azinphos-ethyl Chlorpyrifos-methyl Diazinon Fenthion Phosalone Coumaphos Chlorpyrifos Tolclofos-methyl Oxamyl Methomyl Pirimicarb Propoxur Benomyl Carbofuran Carbaryl Molinate Pebulate Benfuracarb Simazine Cyanazine Desmetryn Methoprotryn Terbuthylazine Ametryne Prometon Prometryn Terbutryne Dipropetryn Chlorobenzene Dicloran 1,2-Dichlorobenzene 1,4-Dichlorobenzene 1,3-Dichlorobenzene 1,2,4-Trichlorobenzene 1,3,5-Trichlorobenzene 1,2,3-Trichlorobenzene Chlorbenzylate Chlorpropylate Pentachlorobenzene Hexachlorobenzene DDE DDD Dicofol 2-PPA 4-CPA DC 3-CPPA 4-CPP MCPA 2,4-D MCPP 2,4-DCPPA 2,4,5-T MCPB 2,4,5-TCPPA
Experimentala 0.722 0.941 1.748 1.893 1.885 1.714 1.817 2.116 1.887 2.085 2.112 1.965 2.179 2.014 2.3 2.037 0.169 0.409 1.354 1.28 1.149 1.328 1.493 1.814 2.13 2.257 1.394 1.415 1.566 1.674 1.736 1.698 1.599 1.815 1.845 1.942 1.82 1.668 1.867 1.982 1.952 2.046 2.128 1.944 2.13 2.239 2.236 2.354 2.497 2.42 2.413 -0.026 0.863 0.145 0.527 0.544 0.844 0.855 0.904 0.936 1.067 1.251 1.135
CalculatedHM b
Abs errorc
0.516 1.291 1.368 1.938 1.962 1.664 1.706 1.969 2.069 1.837 2.304 1.931 2.420 1.836 2.184 2.130 0.785 0.845 1.615 1.183 1.363 1.501 1.741 1.734 1.824 2.415 1.453 1.501 1.410 1.743 1.706 1.699 1.333 1.812 1.688 2.025 1.806 1.507 1.972 2.010 1.975 2.142 2.130 2.118 1.979 2.080 2.398 2.467 2.305 2.173 2.268 0.424 0.539 0.655 0.657 0.607 0.611 0.630 0.834 0.884 0.837 0.994 0.999
0.206 0.350 0.380 0.045 0.077 0.050 0.111 0.147 0.182 0.248 0.192 0.034 0.241 0.178 0.116 0.093 0.616 0.436 0.261 0.097 0.214 0.173 0.248 0.080 0.306 0.158 0.059 0.086 0.156 0.069 0.031 0.001 0.266 0.004 0.157 0.083 0.014 0.162 0.105 0.028 0.023 0.096 0.002 0.174 0.151 0.159 0.162 0.113 0.192 0.247 0.145 0.450 0.324 0.510 0.130 0.063 0.234 0.225 0.070 0.052 0.230 0.257 0.136
CalculatedSVM d 0.733 0.952 1.737 1.882 1.874 1.703 1.806 2.105 1.898 2.074 2.123 1.976 2.168 2.025 2.195 2.033 0.180 0.384 1.365 1.291 1.138 1.666 1.504 1.818 2.119 2.246 1.405 1.426 1.555 1.663 1.725 1.801 1.610 1.844 1.834 1.931 1.809 1.657 1.893 1.977 1.948 2.093 2.117 2.086 2.141 2.228 2.315 2.343 2.497 2.409 2.203 -0.015 0.852 0.156 0.661 0.493 0.527 0.729 0.742 0.925 1.078 1.240 0.961
Abs error 0.011 0.011 0.011 0.011 0.011 0.011 0.011 0.011 0.011 0.011 0.011 0.011 0.011 0.011 0.105 0.004 0.011 0.025 0.011 0.011 0.011 0.338 0.011 0.004 0.011 0.011 0.011 0.011 0.011 0.011 0.011 0.103 0.011 0.029 0.011 0.011 0.011 0.011 0.026 0.005 0.004 0.047 0.011 0.142 0.011 0.011 0.079 0.011 0.000 0.011 0.210 0.011 0.011 0.011 0.134 0.051 0.317 0.126 0.162 0.011 0.011 0.011 0.174
W. Ma et al. / J. Chromatogr. A 1113 (2006) 140–147
143
Table 1 (Continued ) No.
CAS number
Compounds
Experimentala
CalculatedHM b
Abs errorc
CalculatedSVM d
Abs error
64 65 66 67 68e 69 70e 71 72 73 74e 75 76 77 78 79
113158-40-0 101-42-8 150-68-5 19937-59-8 1746-81-2 2164-17-2 15545-48-9 18691-97-9 330-54-1 34123-59-6 330-55-2 13360-45-7 1982-47-4 555-37-3 3766-60-7 25366-23-8
Fenoxaprop-P Fenuron Monuron Metoxuron Monolinuron Fluometuron Chlorotoluron Methabenzthiazuron Diuron Isoproturon Linuron Chlorbromuron Chloroxuron Neburon Buturon Thiazafluron
1.084 0.886 1.372 1.232 1.535 1.554 1.486 1.49 1.543 1.531 1.653 1.672 1.689 1.957 1.776 1.532
1.391 1.195 1.405 1.327 1.372 1.266 1.539 1.573 1.561 1.561 1.515 1.548 1.787 1.984 1.600 1.124
0.307 0.309 0.033 0.095 0.163 0.288 0.053 0.083 0.018 0.030 0.138 0.124 0.098 0.027 0.176 0.408
1.095 0.897 1.360 1.243 1.440 1.543 1.503 1.479 1.554 1.542 1.617 1.660 1.700 1.946 1.765 1.521
0.011 0.011 0.012 0.011 0.095 0.011 0.017 0.011 0.011 0.011 0.036 0.012 0.011 0.011 0.012 0.011
a b c d e
Experimental log k. Predicted log k by HM. Absolute value of (calculated–experimental). Predicted log k by SVM. Compounds in the test set.
topological, geometric, electrostatic, and quantum-chemical descriptor. 3. Methodology 3.1. Heuristic method Successful QSPR is largely determined by good descriptors’ selection. In recent years, methodology for a general QSPR or QSAR approach has been developed and coded as the CODESSA software package, which combines different ways of quantifying the structural information about the chemicals with advanced statistical analyses for the establishment of molecular structure–property relationships. HM in CODESSA was used to select descriptors and build the linear model. The advantages of this method are: it usually produces correlations two to five times faster than other methods and has no restrictions on the size of the data set. HM can either quickly give a good estimation about what quality of correlation to expect from the data, or derive several best regression models. Besides, it will demonstrate which descriptors have bad or missing values, which descriptors are insignificant (from the standpoint of a single-parameter correlation), and which descriptors are highly inter-correlated. HM of the descriptor selection proceeds with a pre-selection of descriptors to ensure: (1) those descriptors that are available for each structure, (2) those values having variation for all structures, (3) descriptors that provide a F-test’s value below 1.0 on the one-parameter correlation, and (4) the descriptors whose t-values are less than the user-specified value, etc. Following the pre-selection of descriptors, HM was developed in a stepwise procedure. Descriptors and correlations are ranked according to the values of the F-test and the correlation coefficient. Starting with the top descriptor from the list, twoparameter correlations are calculated. In the following steps new descriptors are added one-by-one until the pre-selected number
of descriptors in the model is achieved. The final result is a list of the 10 best models according to the values of the F-test and correlation coefficient. The goodness of the correlation is tested by the coefficient regression (R2 ), the F-test (F), and the standard deviation (s2 ). 3.2. Support vector machine SVM is gaining popularity due to many attractive features and promising empirical performance. It originated from early concepts developed by Vapnik and Chervonenkis [49–51]. This method has proven to be very effective for addressing general purpose classification and regression problems [52–56]. The main advantage of SVM is that it adopts the structure risk minimization (SRM) principle, which has been shown to be superior to the traditional empirical risk minimization (ERM) principle [57], employed by conventional neural networks. SRM minimizes an upper bound of the generalization error on VapnikChernoverkis (VC) dimension, as opposed to ERM that minimizes the training error. So SVM is usually less vulnerable to the overfitting problem. For there are a number of introductions into SVM [33,58–61], here, we only briefly summarized the main ideas of SVM for regression. SVM can be applied to regression problems by the introduction of an alternative loss function. The basic idea in SVR is to map the input data x into a higher dimensional feature space F via a nonlinear mapping φ and then a linear regression problem is obtained and solved in this feature space. Therefore, the regression approximation addresses the problem of estimating a function based on a given data set G = (xi , yi )ni=1 (xi is input vector, yi is the desired value). In SVM method, the regression function is approximated by the following function: y=
n i=1
wi φi (x) + b
(3)
144
W. Ma et al. / J. Chromatogr. A 1113 (2006) 140–147
where {φi (x)}ni=1 are the features of inputs, {wi }ni=1 and b are coefficients. 3.3. SVM implementation and computation environment All calculation programs implementing SVM were written in R-file based on R script for SVM [62]. All scripts were compiled using R1.7.1 compiler running operating system on a Pentium IV with 256M RAM. 4. Results and discussion
Table 2 Descriptors, coefficients and t-values for the linear modela Descriptor
Chemical meaning
Coefficient
t-test
(Constant) HDCA-1/TMSA
Intercept HA dependent HDCA-1/TMSA [Zefirov’s PC] Partial positive surface area [Quantum-Chemical PC] LUMO energy Relative number of O atoms Relative number of rings
0.8668 −110.1300
3.9657 −9.4063
0.0030
7.1083
−0.2481 −4.0212
−6.1854 −6.5199
4.7644
3.0132
PPSA-1
LUMO RNO RNR a
4.1. Result of the HM The CODESSA program provides a large variety of nonempirical molecular descriptors. In the present treatment, the preliminary regression analysis was carried out using all the original CODESSA descriptors [63]. To find the optimum number of descriptors describing log k for current set of structures, the heuristic correlations, performed for the training set, provide the optimal equations for different numbers of descriptors in the range of 1–8. Fig. 1 shows that R2 and R2cv rise as the number of parameters increasing from 1 to 8 steadily. To avoid the “overparameterization” of the model, an increase of the R2 values of less than 0.02 was chosen as the breakpoint criterion [64]. Five descriptors were eventually selected. A detailed description of the linear model based on compounds in the training set is summarized in Table 2. This model contains one electrostatic (HDCA-1/TMSA), two quantum-chemical (PPSA-1, LUMO) and two constitutional (RNO, RNR) descriptors. These descriptors encoded different aspects of the molecular structure. By comparing these five descriptors with Abraham solute parameters of LSER, it is possible to gain some insight into factors that are likely to govern the retention mechanism of pesticides in BMC. HDCA-1/TMSA is the quotient between the sum of solventaccessible surface area of the hydrogen bond donor atoms and the total molecular surface area. Its mathematical expression is
Fig. 1.
R2
= 0.8526;
s2
= 0.0504; F = 65.92;
R2cv
= 0.8386; n = 63.
given bellow: HDCA1/TMSA =
D SD TMSA
D ∈ HH-donor
(4)
HDCA1/TMSA belongs to charged partial surface area (CPSA) descriptors, which have been developed by Jurs et al. [65]. It represents the hydrogen bond donation ability of the molecule. As the HDCA1/TMSA increases, the proportion of the sum of solvent-accessible surface area of hydrogen donors among the total molecular surface area also increases and the formation of the hydrogen bond becomes easier. Lowest unoccupied molecular orbital (LUMO) is the lowest energy level in the molecule that contains no electrons. When a molecule acts as an electron-pair acceptor in bond formation, incoming electron pairs are received in its LUMO. A lower value for LUMO energy indicates a greater tendency to form a hydrogen bond. It also represents the hydrogen bond donation ability of the molecule. An increase in these two descriptors leads to a decrease in log k of BMC. This is presumably due to the structure of nonionic Brij-35 surfactant containing a rather polar surface layer of 23 oxyethylene moieties. It can act as hydrogen bond acceptor for the polar surface layer. And there are many hetero-atoms (O, N) in the structures of the pesticides. The H atoms attaching with the hetero-atoms (O, N) can act as hydrogen bond donors. Thus, HDCA1/TMSA combining with LUMO describes the effect of hydrogen bond donation ability of the pesticides on log k. Relative number of O atoms (RNO) is a constitutional descriptor calculated as the number of O atoms divided by the number of atoms. RNO affects the density of the electron cloud of the molecule. By accounting for carbonyl and carboxyl groups in pesticides, RNO reflects the hydrogen bond acceptor ability for these groups possessing sufficient electron density to participate in hydrogen bonds. Pesticides can form hydrogen bond with both mobile phase and stationary phase as hydrogen bond acceptors. The negative coefficient of RNO in the model implies that increasing the value of this descriptor favors elution and the hydrogen bond interactions between pesticides and mobile phase play the dominant role. Partial positive surface area (PPSA-1) indicates the polar surface areas formed by positive charge distribution in the molecule and it encodes information about polar interaction. The positive
W. Ma et al. / J. Chromatogr. A 1113 (2006) 140–147
145
Fig. 2.
Fig. 3.
coefficient of this descriptor indicates that the pesticides with larger PPSA1 tend to increase in retention. The reason is probably that the polarity of the stationary was changed by the surfactant Brij35. RNR is defined as the quotient between the total number of rings and the total number of bonds. It accounts for the chain stiffness and steric hindrance effect of compounds. RNR reflects the size of molecule. The larger the size is, the pesticides are more favorable to retain. From above discussion, HDCA-1/TMSA reflects the hydrogen bond donor ability and can be related to the hydrogen bond donor acidity parameter A of LSER. The concept of the parameter A can also been applied with LUMO for being available to accept electrons [66]. RNO can be related to the hydrogen bond acceptor basicity parameter B for being available to supply electrons. PPSA1 belongs to the dipolarity/polarizability parameter S. For the parameter V can be calculated quite simply for any structure from the molecular formula and the number of rings in the molecule [67], RNR is related to the parameter V. From the above discussion, these descriptors can account for structural features responsible for the retention factor of pesticides in the certain condition. The calculated log k by HM were given in Table 1, and the scatter plot was shown in Fig. 2. The model gave RMS of 0.2245 for the training set, 0.1651 for the test set, and 0.2055 for the whole set, and the corresponding correlation coefficients (R) were 0.9234, 0.9588, and 0.9303, respectively. From Table 1 and Fig. 2, it can be seen that the model of HM was not sufficiently accurate. It means that not all of the descriptors were linear correlation with log k. So the nonlinear correlation model (SVM) was used further to discuss the correlation between the molecular structure and log k.
function. These parameters should be optimized to obtain better results. To select the proper values for these parameters, different values of them were tried; the set of values with the best leave-one-out cross-validation performance will be selected as the optimal ones. The overall performances of SVM were evaluated in terms of root-mean-square (RMS), which was defined as below: n 2 i=1 (di − oi ) RMS = (5) n where di are the desired outputs in the validation set, oi the actual outputs, and n is the number of samples in validation set. The influences of the parameters on the performance of SVM are shown in Figs. 3–5. Through the above process, the γ, ε and C were fixed to 0.30, 0.02 and 30 respectively, when the support vector number was 59, the predicted results of the optimal SVM were shown in Table 1 and Fig. 6. The model gave RMS of 0.0483 for the training set, 0.1403 for the test set, and 0.0737 for the whole set, and the corresponding correlation coefficients (R) were 0.9963, 0.9755 and 0.9917, respectively. The whole performance of SVM is much better than that of HM.
4.2. Result of SVM To obtain more accurate model, SVM is used to develop a non-linear model based on the same subset of descriptors. The performances of SVM for regression depend on the combination of several parameters: capacity parameter C, ε of ε-insensitive loss function, and γ controlling the amplitude of the Gaussian
Fig. 4.
146
W. Ma et al. / J. Chromatogr. A 1113 (2006) 140–147
this study (Programme PRA SI 02-03). The authors also thank the R Development Core Team for affording the free R1.7.1 software. References
Fig. 5.
Fig. 6.
5. Conclusion A QSPR study for the retention factor of various pesticides in BMC was performed using HM and SVM, based on electrostatic, constitutional and quantum-chemical descriptors. Satisfactory results were obtained with the proposed method. By analyzing the obtained results, the linear model indicated that hydrogen bond interactions, polarizability and volume play important role in the retention behavior in BMC. Additionally, non-linear SVM model produced much better predictive ability based on the same sets of descriptors. Besides, SVM exhibits the better whole performance due to embodying the structural risk minimization principle and some advantages over the other techniques of converging to the global optimum and not to a local optimum. Therefore, the SVM is a very promising machine learning technique from many aspects and will gain more extensive application. Furthermore, the proposed approach can also be extended in other QSPR/QSAR investigation. Acknowledgements The authors thank the Association Franco-Chinoise pour la Recherche Scientifique & Technique (AFCRST) for supporting
[1] G.R. Van der Hoft, P. Van Zoonen, J. Chromatogr. A 843 (1999) 301. [2] E. Hogendoorn, P. Van Zoonen, J. Chromatogr. A 892 (2000) 435. [3] M. Molero-Monfort, L. Escuder-Gilabert, R.M. Villanueva-Cama˜nas, S. Sagrado, M.J. Medina-Hern´andez, J. Chromatogr. B 753 (2001) 225. [4] L. Escuder-Gilabert, J.J. Mart´ınez-Pla, S. Sagrado, R.M. VillanuevaCama˜nas, M.J. Medina-Hern´andez, J. Chromatogr. B 797 (2003) 21. [5] L. Escuder-Gilabert, M. Molero-Monfort, R.M. Villanueva-Cama˜nas, S. Sagrado, M.J. Medina-Hern´andez, J. Chromatogr. B 807 (2004) 193. [6] C. Qui˜nones-Torrelo, S. Sagrado, M. Villanueva-Cama˜nas, M.J. MedinaHern´andez, J. Chromatogr. B 761 (2001) 13. [7] J.J. Mart´ınez-Pla, S. Sagrado, R.M. Villanueva-Cama˜nas, M.J. MedinaHern´andez, J. Chromatogr. B 757 (2001) 89. [8] M. Molero-Monfort, Y. Mart´ın-Biosca, S. Sagrado, R.M. VillanuevaCama˜nas, M.J. Medina-Hern´andez, J. Chromatogr. A 870 (2000) 1. [9] L. Escuder-Gilabert, Y. Mart´ın-Biosca, S. Sagrado, R.M. VillanuevaCama˜nas, M.J. Medina-Hern´andez, Anal. Chim. Acta 448 (2001) 173. [10] J.M. Berm´udez-Salda˜na, L. Escuder-Gilabert, M.J. Medina-Hern´andez, R.M. Villanueva-Cama˜nas, S. Sagrado, J. Chromatogr. B 814 (2005) 115. [11] J.M. Berm´udez-Salda˜na, L. Escuder-Gilabert, M.J. Medina-Hern´andez, R.M. Villanueva-Cama˜nas, S. Sagrado, J. Chromatogr. A 1063 (2005) 153. [12] P.W. Carr, Microchem. J. 48 (1993) 4. [13] M.H. Abraham, H.S. Chadha, A.J. Leo, J. Chromatogr. A 685 (1994) 203. [14] J.H. Park, P.W. Carr, M.H. Abraham, R.W. Taft, R.M. Doherty, M.J. Kamlet, Chromatographia 25 (1988) 373. [15] P.W. Carr, R.M. Doherty, M.J. Kamlet, R.W. Taft, W. Melander, Cs. Horvath, Anal. Chem. 58 (1986) 2674. [16] J.H. Park, J.J. Chae, T.H. Nah, M.D. Jang, J. Chromatogr. A 664 (1994) 149. [17] L.C. Tan, P.W. Carr, J. Chromatogr. A 799 (1998) 1. [18] J.A. Blackwell, P.W. Carr, J. High Resolut. Chromatogr. 21 (1998) 427. [19] C.F. Poole, T.O. Kollie, S.K. Poole, Chromatographia 34 (1992) 281. [20] M.H. Abraham, G.S. Whiting, R.M. Doherty, W.J. Shuely, J. Chem. Soc. 2 (1990) 1451. [21] S. Yang, M.G. Khaledi, J. Chromatogr. A 692 (1995) 301. [22] M.A. Garc´ıa, M.F. Vitha, M.L. Marina, J. Liq. Chromatogr. Rel. Technol. 23 (2000) 873. [23] M.H. Guermouche, D. Habel, S. Guermouche, Fluid Phase Equilibr. 147 (1998) 301. [24] M.A. Garc´ıa, M.F. Vitha, J. Sandquist, K. Mulville, M.L. Marina, J. Chromatogr. A 918 (2001) 1. [25] S. Yang, M.G. Khaledi, Anal. Chem. 67 (1995) 499. [26] S.K. Poole, C.F. Poole, Analyst 122 (1997) 267. [27] M.G. Khaledi, J.G. Bumgarner, M. Hadjmohammadi, J. Chromatogr. A 802 (1998) 35. [28] M.D. Trone, M.G. Khaledi, Anal. Chem. 71 (1999) 1270. [29] M.H. Abraham, Chem. Rev. 22 (1993) 73. [30] C. Hansch, A. Leo, Exploring QSAR: Fundamentals and Applications in Chemistry and Biology, American Chemical Society, Washington, DC, 1995. [31] H. Kubinyi, Drug Discov. Today 2 (1997) 457. [32] H. Kubinyi, Drug Discov. Today 2 (1997) 538. [33] M. Karelson, Molecular Descriptors in QSAR/QSPR, Wiley, New York, 2000. [34] R. Todeschini, V. Consonni, Handbook of Molecular Descriptors, WileyVCH, Weinheim, Germany, 2000. [35] J. Devillers, A.T. Balaban, Topological Indices and Related Descriptors in QSAR and QSPR, Gordon and Breach, Amsterdam, 1999. [36] S.N. Pang, D. Kim, S.Y. Bang, Pattern Recognit. Lett. 24 (2003) 215.
W. Ma et al. / J. Chromatogr. A 1113 (2006) 140–147 [37] H.X. Liu, R.S. Zhang, F. Luan, X.J. Yao, M.C. Liu, Z.D. Hu, B.T. Fan, J. Chem. Inf. Comput. Sci. 43 (2003) 900. [38] E. Byvatov, U. Fechner, J. Sadowski, G. Schneider, J. Chem. Inf. Comput. Sci. 43 (2003) 1882. [39] R. Burbidge, M. Trotter, B. Buxton, S. Holden, Comput. Chem. 26 (2001) 5. [40] Y.D. Cai, X.J. Liu, X.B. Xu, K.C. Chou, Comput. Chem. 26 (2002) 293. [41] L. Bao, Z.R. Sun, FEBS Lett. 521 (2002) 109. [42] H.X. Liu, R.S. Zhang, X.J. Yao, M.C. Liu, Z.D. Hu, B.T. Fan, J. Chem. Inf. Comput. Sci. 43 (2003) 1288. [43] H.X. Liu, R.S. Zhang, X.J. Yao, M.C. Liu, Z.D. Hu, B.T. Fan, J. Chem. Inf. Comput. Sci. 43 (2004) 161. [44] C.X. Xue, R.S. Zhang, H.X. Liu, X.J. Yao, M.C. Liu, Z.D. Hu, B.T. Fan, J. Chem. Inf. Comput. Sci. 44 (2004) 669. [45] C.X. Xue, R.S. Zhang, M.C. Liu, Z.D. Hu, B.T. Fan, J. Chem. Inf. Comput. Sci. 44 (2004) 950. [46] HyperChem 4.0, Hypercube, 1994. [47] J.P.P. Stewart, MOPAC 6.0, Quantum Chemistry Program Exchange, QCPE, No. 455, Indiana University, Bloomington, IN, 1989. [48] A.R. Katritzky, V.S. Lobanov, M. Karelson, Comprehensive Descriptors for structural and Statistical Analysis, Reference Manual, Version 2.0, 1994. [49] C. Cortes, V. Vapnik, Machine Learn. 20 (1995) 273. [50] V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. [51] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.
147
[52] C.Y. Zhao, R.S. Zhang, H.X. Liu, C.X. Xue, S.G. Zhao, X.F. Zhou, M.C. Liu, B.T. Fan, J. Chem. Inf. Comput. Sci. 44 (2004) 2040. [53] F. Luan, W.P. Ma, X.Y. Zhang, H.X. Zhang, M.C. Liu, Z.D. Hu, B.T. Fan, Chemosphere (2005). [54] W.P. Ma, X.Y. Zhang, F. Luan, H.X. Zhang, R.S. Zhang, M.C. Liu, Z.D. Hu, B.T. Fan, J. Phys. Chem. A. 109 (2005) 3485. [55] V.Z. Vladimir, V.B. Konstantin, A.I. Andrey, P.S. Nikolay, V.P. Igor, J. Chem. Inf. Comput. Sci. 43 (2003) 2048. [56] F. Luan, C.X. Xue, R.S. Zhang, C.Y. Zhao, M.C. Liu, Z.D. Hu, B.T. Fan, Anal. Chim. Acta 537 (2005) 101. [57] C.J.C. Burges, Data Min. Knowl. Disc. 2 (1998) 121. [58] W.J. Wang, Z.B. Xu, Neurocomputing 61 (2004) 259. [59] B.S. ch¨olkopf, C. urges, A. Smola, Advances in Kernel Methods—Support Vector Learning, MIT Press, Cambridge, MA, 1999. [60] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, Cambridge, UK, 2000. [61] http://www.kernel-machines.org. [62] W.N.D. Venables, M. Smith, the R Development Core Team, R manuals, 2003. [63] K. Tamm, D.C. Fara, A.R. Katritzky, P. Burk, M. Karelson, J. Phys. Chem. A. 108 (2004) 4812. [64] J.D. Dyekjar, S.O. Jonsdottir, Ind. Eng. Chem. Res. 42 (2003) 4241. [65] D.T. Stanton, P.C. Jurs, Anal. Chem. 62 (1990) 2323. [66] L.Y. Wilson, G.R. Famini, J. Med. Chem. 34 (1991) 1668. [67] M.H. Abraham, G.S. Whiting, R.M. Doherty, W.J. Shuely, J. Chromatogr. 587 (1991) 213.