Chemosphere 63 (2006) 1142–1153 www.elsevier.com/locate/chemosphere
Quantitative structure-activity relationship models for prediction of sensory irritants (log RD50) of volatile organic chemicals Feng Luan a, Weiping Ma a, Xiaoyun Zhang a, Haixia Zhang Mancan Liu a, Zhide Hu a, B.T. Fan b b
a,*
,
a Department of Chemistry, Lanzhou University, Lanzhou 730000, China Universite´ Paris 7-Denis Diderot, ITODYS 1, Rue Guy de la Brosse, 75005 Paris, France
Received 24 June 2005; received in revised form 14 September 2005; accepted 14 September 2005 Available online 22 November 2005
Abstract Quantitative classification and regression models for prediction of sensory irritants (log RD50) of volatile organic chemicals (VOCs) have been developed. Each compound was represented by the calculated structural descriptors to encode constitutional, topological, geometrical, electrostatic, and quantum–chemical features. The heuristic method (HM) was then used to search the descriptor space and select the descriptors responsible for activity. The best classification results were found using support vector machine (SVM): the accuracy for training, test and overall data set is 96.5%, 85.7% and 94.4%, respectively. The nonlinear regression models were built by radial basis function neural networks (RNFNN) and SVM, respectively. The root mean squared errors (RMS) in prediction for the training, test and overall data set are 0.4755, 0.6322 and 0.5009 for reactive group, 0.2430, 0.4798 and 0.3064 for nonreactive group by RBFNN. The comparative results obtained by SVM are 0.4415, 0.7430 and 0.5140 for reactive group, 0.3920, 0.4520 and 0.4050 for nonreactive group, respectively. This paper proposes an effective method for poisonous chemicals screening and considering. Ó 2005 Elsevier Ltd. All rights reserved. Keywords: Sensory irritants; The heuristic method (HM); Linear discriminant analysis (LDA); Support vector machine (SVM); Radial basis function neural networks (RNFNN); QSAR/QSPR
1. Introduction Volatile organic compounds, sometimes referred to as VOCs, are organic compounds that easily become vapors or gases. They are released either from burning
* Corresponding author. Tel.: +86 931 891 2578; fax: +86 931 891 2582. E-mail address:
[email protected] (H. Zhang).
fuel, such as gasoline, wood, coal, and natural gas or from solvents, paints, glues, and other products that are used and stored at home or at work (http://toxtown.nlm.nih.gov/text_version/chemical/volatile.html). Indoor and outdoor exposures to VOCs can both cause sensory irritation which is usually expressed by log RD50 (Alarie, 1973; Nielsen, 1991; Nielsen and Alarie, 1992). The RD50 is the concentration of a sensory irritant that would produce intolerable burning of the eyes, nose, and throat. Other common reactions include lacrimation and
0045-6535/$ - see front matter Ó 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.chemosphere.2005.09.053
F. Luan et al. / Chemosphere 63 (2006) 1142–1153
coughing (http://www.science.duq.edu/esm/Course_ Material/ESM552/Notes/indoor_air/indoor_air.html). Generally, effects of short-term exposure include symptoms of intoxication (dizziness, headache, confusion, nausea), anemia and fatigue. Effects of long-term exposure can include cancer, liver damage, spasms, and impaired speech, hearing and vision (Ferguson, 1939). Therefore, these compounds have been an important environmental issue over decades and have attracted significant attention from different research groups. In 1939, Ferguson proposed two main mechanisms for some acute toxic effects of vapors, depending upon the type of their interaction with biological receptors. One is a physical mechanism (p), which was proposed for the interaction of nonreactive VOCs and it include the forces for all types of weak noncovalent bonds: electrostatic interactions, hydrogen bonds, van der Waals attractions, and hydrophobic forces with a biological receptor. The other is a chemical mechanism (c), which was proposed for the interaction of volatile organic reactive chemicals. This mechanism would include all covalent or ionic binding mechanisms, reversible or nonreversible, with their receptor. Thus, following Fergusons principle, the irritation effect of gases and vapors is believed to be caused by their direct interaction with one or more proteins (receptors) on trigeminal nerve endings in the cornea and nasal mucosa (Alarie, 1973; Nielsen, 1991). Some researches focused on the studies of mechanism and influences of different VOCs on animals or human being (Flemming et al., 1996; Enrique et al., 1998; Monique and Dalton, 2002; Boucher et al., 2003). While several other investigations have dealt with the effect of physicochemical properties, such as hydrogen bonding, vapour pressure, lipophilicity, solute dipolarity etc., on the chemical activation of VOCs (Nielsen and Bakbo, 1985; Abraham et al., 1990; Abraham et al., 1994; Alarie et al., 1995, 1996). Modeling chemicals and biological effects is important objectives of chemistry and toxicology today. As we know, chemical and biological effects are closely related to molecular properties, which can be calculated or predicted by kinds of methods from structure. No matter a compound is synthesized or not, any descriptor can be calculated from its known structure. Once a reliable model is established, we can predict the activity of compounds and know which structural factors play an important role to the activity. The main steps involved in model building are: data collection, molecular descriptor obtaining and selection, correlation model development, and finally model evaluation. Among them, the description of the molecular structure using appropriate molecular descriptors and selection of suitable modeling methods are the most important. At present, many types of molecular descriptors such as constitutional, topological, geometrical, electrostatic,
1143
and quantum chemical descriptors have been proposed to describe the structural features of molecules (Devillers and Balaban, 1999; Karelson, 2000; Todeschini and Consonni, 2000). Linear or nonlinear statistic methods such as multiple linear regression (MLR), principal component regression (PCR), partial least squares (PLS), different types of artificial neural networks (ANN), genetic algorithms (GA) and support vector machines (SVM), etc., can be selected in the development of a mathematical relationship between the structural descriptors and biological effects. The advances in modeling studies have widened the scope of rationalizing screening and the search for the toxicological mechanism of toxic pollutants. The goal of this study is to develop robust binary classification model to categorize the 142 VOCs firstly and then to build regression QSAR models that can predict sensory irritants (log RD50) of reactive and nonreactive group, respectively. A large number of descriptors were calculated by CODESSA software. The heuristic method (HM) was then used to search the descriptor space and select the descriptors responsible for classification and regression problems. Statistic methods such as linear discriminant analysis (LDA), SVM and radial basis function neural networks (RBFNN) were developed for the purposes.
2. Experimental 2.1. Database construction The 142 VOCs as sensory irritants summarized in Table 1 were introduced in this study (Alarie et al., 1998). The log RD50values listed for each compound in Table 1 were from the experimental results with male Swiss-Webster, OF1, or CF1 mice (Schaper, 1993). The chemicals in the database were also classified according to the possible mechanisms of action: p for nonreactive chemicals and c for reactive chemicals, following the rule proposed by Ferguson. Of these compounds, there are 59 p (or nonreactive) and 83 c (or reactive) chemicals. The entire set of compounds was divided into two subsets randomly for both classification and regression models: a training set of 114 compounds (including 67 reactive and 47 nonreactive compounds), whose information was used to build the actual models, and the test set includes the left 28 compounds (including 16 reactive and 12 nonreactive compounds), which was used to validate the models once they were built. 2.2. Calculation of molecular descriptors Chemicals were sketched, and initial three-dimensional modeling was performed using HyperChem (1994). This produced connection tables, which
1144
F. Luan et al. / Chemosphere 63 (2006) 1142–1153
Table 1 Compounds, classification, experimental and calculated log RD50 No.
Name
CAS No.
Class
log RD50
Reference
LDA
SVM
Experiment
HM
SVM
RBFNN
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
1,6-Hexamethylene 2,4-Toluene diisocyanate p-Toluene isocyanate Phenyl isocyanate Methyl isocyanate o-Toluene isocyanate Allyl alcohol Allyl acetate Crotonaldehyde Diallylamine Benzyl iodide Hexyl isocyanate Benzyl bromide Methyl vinyl ketone o-Chlorobenzylchloride Chloropicrin Crotyl alcohol Allyl amine p-Chlorobenzylchloride Vinyl toluene a,a-Dichlorotoluene Benzyl chloride Heptylamine Isophorone Cyclohexylamine Menthol 2,3,4-Trichloro-1-butene Trimethylamine Mesityl oxide Allyl iodide Chloro-2-ethylbenzene Dimethylisopropylamine Isobutylamine Dipropylamine Pentylamine n-Butylamine Dibutylamine Methylamine Ethylamine Isopropylamine Diisopropylamine Dimethylethylamine 3-Dimethylamino-1-propylamine tert-Butylamine Diethylamine Methyl crotonate a-Methyl styrene 2-Furaldehyde Dimethylamine Ethyl acrylate Propionic acid Bromobenzene 4-Methylpentan-2-ol 2-Methoxyethyl Ethyl acetate
822-06-0 584-84-9 622-58-2 103-71-9 624-83-9 614-68-6 107-18-6 591-87-7 4170-30-3 124-02-7 620-05-3 2525-62-4 100-39-0 78-94-4 611-19-8 1976-6-2 6117-91-5 107-11-9 104-83-6 25013-15-4 98-87-3 100-44-7 111-68-2 78-59-1 108-91-8 89-78-1 2431-50-7 75-50-3 141-79-7 556-56-9 622-24-2 996-35-0 78-81-9 142-84-7 110-58-7 109-73-9 111-92-2 74-89-5 1975-4-7 75-31-0 108-18-9 598-56-1 109-55-7 75-64-9 109-89-7 623-43-8 98-83-9 1998-1-1 124-40-3 140-88-5 1979-9-4 108-86-1 108-11-2 110-49-6 141-78-6
c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c
c c c c c c c c c c c c c c c c c c c c c c c pa c pa c c pa c c c c c c c pa c c c c c c c c c c c c c c c pa c c
c c c c c c c c c c c c c c c c c c c c c c c c c pa c c c c c c c c c c c c c c c c c c c c c c c c c c c c c
0.770 0.699 0.201 0.137 0.114 0.161 0.439 0.462 0.548 0.602 0.633 0.681 0.716 0.723 0.756 0.902 0.949 0.954 1.146 1.215 1.301 1.342 1.425 1.444 1.591 1.653 1.764 1.785 1.786 1.838 1.924 1.954 1.959 1.964 1.987 2.066 2.104 2.149 2.179 2.196 2.207 2.207 2.246 2.250 2.286 2.308 2.436 2.458 2.463 2.498 2.584 2.613 2.628 2.756 2.776
1.2685 1.2326 0.1142 0.7365 0.7874 0.0284 0.8067 2.0644 1.8780 0.2839 1.4248 0.8089 1.5953 1.8255 1.3425 0.8084 1.1731 0.6969 1.3377 1.0564 1.8186 1.6680 1.6245 1.4235 1.7668 1.6857 1.2912 2.6568 2.1055 1.6841 1.3446 2.2985 1.9093 1.8603 1.7759 1.9201 1.6774 2.2756 2.2094 2.1497 2.2803 2.4317 1.9665 2.1680 2.2758 2.2459 1.3150 1.6671 2.4213 2.2088 2.1214 2.2122 2.5719 2.8240 3.3760
0.6636 0.5922 0.2341 0.0305 0.2203 0.0546 0.5278 2.2271 1.9888 0.7352 1.0232 0.5739 1.3208 1.7720 0.9743 1.0091 0.8428 0.7545 1.0043 1.1084 1.6382 1.4516 1.6872 1.5512 1.7594 1.5458 1.6573 2.5412 2.2736 1.9447 1.1536 2.2448 1.9563 1.9687 1.7949 1.9591 1.8382 2.2536 2.2247 2.1516 2.3160 2.3106 1.9974 2.2250 2.2742 2.3247 1.3322 2.3516 2.3567 2.3918 2.4771 2.5062 2.7352 2.6488 3.2630
0.9118 0.9355 0.0874 0.0085 0.2752 0.0250 0.0076 1.9886 1.7897 0.4652 1.3169 0.7901 1.5645 1.5042 1.3847 0.8302 0.7991 0.4452 1.3581 0.9161 1.9502 1.6824 1.6983 1.5173 1.8174 1.9049 1.4482 2.5169 2.1457 1.9400 1.3001 2.4022 1.9932 1.9528 1.8550 1.9933 1.7601 2.0618 2.1602 2.1917 2.4082 2.5006 1.9800 2.2324 2.3458 2.1490 1.1520 2.0864 2.3143 2.1470 2.5867 2.2423 2.7040 2.4118 3.2361
F. Luan et al. / Chemosphere 63 (2006) 1142–1153
1145
Table 1 (continued) No.
Name
CAS No.
Class Reference
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
n-Butyl acetate Propyl acetate Isobutyl acetate 2-Ethyl-butyraldehyde Isovaleraldehyde Butyraldehyde Chlorobenzene Allyl chloride Acetaldehyde Isobutyraldehyde Isopropyl acetate Propionaldehyde m-Chlorobenzylchloride Ethyl-2-hexanol n-Octanol n-Heptanol Heptan-1-ol Acetophenone Phenol 1,2-Dichlorobenzene Hexachloro-1,3-butadiene n-Amylbenzene Dibutylacetone Diisobutyl acetone Benzaldehyde p-tert-Butyltoluene o-Chlorotoluene n-Butylbenzene 2-Ethoxyethyl acetate n-Hexyl acetate 5-Methylheptan-3-one tert-Butylbenzene Heptan-2-one Isopentyl acetate 5-Methylhexane-2-one p-Xylene o-Xylene n-Pentyl acetate Isobutyl alcohol Isopropylbenzene n-Pentanol Ethylidene norbornene Isoamyl alcohol Ethylbenzene 2-Butoxyethanol 4-Methylpentan-2-one n-Butyl alcohol Toluene 3,3-Dimethylbutan-2-one Pentan-2-one n-Propyl alcohol Isopropyl alcohol Heptane tert-Butyl acetate Ethyl alcohol 2,2,2-Tri-uoroethanol Methyl alcohol
123-86-4 109-60-4 110-19-0 97-96-1 590-86-3 123-72-8 108-90-7 107-05-1 75-07-0 78-84-2 108-21-4 123-38-6 620-20-2 104-76-7 111-87-5 111-70-6 98-86-2 108-95-2 95-50-1 87-68-3 538-68-1 502-56-7 108-83-8 100-52-7 98-51-1 95-49-8 104-51-8 111-15-9 142-92-7 541-85-5 1998-6-6 110-43-0 123-92-2 110-12-3 106-42-3 95-47-6 628-63-7 78-83-1 98-82-8 71-41-0 16219-75-3 123-51-3 100-41-4 111-76-2 108-10-1 71-36-3 108-88-3 75-97-8 107-87-9 71-23-8 67-63-0 142-82-5 540-88-5 64-17-5 75-89-8 67-56-1
c c c c c c c c c c c c p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p
log RD50 LDA a
p c pa c pa c c c c c pa c ca p p p p p p ca p p p ca p ca p p p p p p p p p p p p p p ca p ca p ca ca ca p ca ca p p p ca p ca
SVM
Experiment
HM
c c c pa c c c c c c c c ca p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p
2.865 2.899 2.913 2.926 3.003 3.006 3.023 3.241 3.591 3.620 3.629 3.755 1.431 1.643 1.674 1.993 2.009 2.220 2.259 2.324 2.362 2.436 2.505 2.522 2.556 2.756 2.851 2.857 2.869 2.880 2.881 2.951 3.024 3.091 3.122 3.166 3.179 3.260 3.345 3.366 3.398 3.413 3.439 3.451 3.504 3.641 3.656 3.747 3.773 4.016 4.055 4.193 4.203 4.311 4.320 4.523
3.0690 3.1333 3.0141 2.5561 2.6164 2.6617 2.3960 1.8785 2.8369 2.7505 3.3105 2.8124 2.0808 1.9308 2.1457 2.5041 2.3645 1.8837 2.3734 2.5049 2.3282 2.2091 1.9030 2.1891 2.6433 2.6088 2.6152 3.5793 2.8210 2.7476 3.0526 3.1634 3.1242 3.1899 3.3430 3.5731 3.1026 3.6873 2.9474 3.1797 3.4949 3.2248 3.1513 2.9246 3.6436 3.5067 3.3152 4.0470 3.9577 3.8258 4.0886 4.8397 3.7901 4.1846 4.1385 4.6712
SVM 2.9490 3.0051 2.9055 2.7629 2.8959 2.9993 2.8759 1.9329 3.4841 3.1768 3.2563 3.3262 2.3165 2.2895 2.4297 2.7362 2.5255 2.3287 2.5203 2.5522 2.3876 2.3537 2.1375 2.4493 2.5799 2.7218 2.6264 3.4801 2.9035 2.8395 2.8814 3.1809 3.1749 3.2041 3.1511 3.3366 3.1555 3.7263 2.8620 3.3313 3.3638 3.3695 3.0572 3.0597 3.5444 3.6188 3.1862 3.7815 3.8258 3.9024 4.0384 4.2268 3.6282 4.1900 4.1375 4.5574 (continued
RBFNN 2.8198 2.9388 2.7688 2.7070 2.7995 2.8432 2.5436 2.0806 2.7905 2.9166 3.1497 2.9096 1.7181 1.7346 1.8792 2.2218 2.1105 2.2721 2.4064 2.0138 2.3631 2.5023 2.3353 2.3766 2.6958 2.6861 2.6693 3.5450 3.0034 2.8645 2.9982 3.1282 2.8277 3.1209 3.3514 3.5037 2.8627 3.7116 3.0114 3.2124 3.4003 3.2671 3.1696 2.8473 3.6927 3.6844 3.3854 3.5016 3.8455 3.9630 3.9124 4.5998 4.3267 4.3170 4.1499 4.6286 on next page)
1146
F. Luan et al. / Chemosphere 63 (2006) 1142–1153
Table 1 (continued) No.
Name
CAS No.
Class Reference
112 113 114
Methyl ethyl ketone Nonane Propyl ether
Test set 1 2,6-Toluene diisocyanate 2 Acrolein 3 Formaldehyde 4 Allyl glycidyl ether 5 4-Vinylpyridine 6 2-Vinylpyridine 7 Hexylamine 8 Divinyl benzene 9 3-Cyclohexene 10 n-Propylamine 11 Triethylamine 12 Allyl bromide 13 Acetic acid 14 Styrene 15 Methyl acetate 16 Valeraldehyde 17 Undecan-2-one 18 n-Hexylbenzene 19 Hexan-1-ol 20 Octan-2-one 21 Cyclohexanone 22 Heptan-4-one 23 Propylbenzene 24 Hexan-2-one 25 Caproaldehyde 26 3-Picoline 27 Octane 28 Acetone
log RD50 LDA
SVM
Experiment
HM
SVM
RBFNN
a
a
4.701 4.794 4.949
4.3915 4.0972 4.5303
4.1553 3.6139 4.0953
4.7212 4.6199 4.4417
0.585 0.318 0.628 0.756 1.072 1.407 1.703 1.892 1.978 2.175 2.233 2.332 2.568 2.759 2.919 3.050 1.558 2.097 2.378 2.680 2.879 3.041 3.185 3.407 3.624 3.906 4.259 4.703
1.4785 1.7402 0.6209 0.7000 0.7114 0.7538 1.7236 0.9109 1.9735 1.9665 2.4248 1.7672 2.4175 1.7174 3.4682 2.5768 1.5154 2.0520 2.8454 2.7619 4.3109 3.3125 2.8794 3.5616 3.2717 3.3895 4.4738 5.1182
0.7890 2.1360 0.3404 1.0952 0.4248 0.6153 1.7744 0.4333 2.1265 1.9974 2.2944 1.8945 3.0713 1.5734 3.3309 2.8359 1.8393 2.1515 3.0345 2.8497 4.0039 3.2522 2.8464 3.5042 3.3117 3.3932 3.9236 4.6214
1.1240 1.0251 0.3191 0.7584 0.6342 0.6857 1.8082 0.5655 2.1623 1.9800 2.5222 1.9951 2.5838 1.5454 3.3396 2.7636 1.5826 1.9314 2.6729 2.8432 4.1245 3.2661 2.8394 3.4879 3.5764 3.1897 4.7863 4.3846
78-93-3 111-84-2 111-43-3
p p p
c p ca
c p p
1991-8-7 107-02-8 50-00-0 106-92-3 100-43-6 100-69-6 111-26-2 1321-74-0 100-50-5 107-10-8 121-44-8 106-95-6 64-19-7 100-42-5 79-20-9 110-62-3 112-12-9 1077-16-3 111-27-3 111-13-7 108-94-1 123-19-3 103-65-1 591-78-6 66-25-1 108-99-6 111-65-9 67-64-1
c c c c c c c c c c c c c c c c p p p p p p p p p p p p
c c c c c c c c c c pa c pa c c c p p p p p p ca ca ca ca p ca
c c c c c c c c c c c c c c pa c p p p p ca p p p p ca p ca
Note: c, Reactive chemicals; p, nonreactive chemicals. a Misclassified chemicals.
contained atom types and bonding information. The three-dimensional structure of each compound was refined with the semi-empirical molecular orbital program MOPAC with the PM3 Hamiltonian (Stewart, 1989). All the geometries were fully optimized without symmetry restrictions. In all cases frequency calculations had been performed in order to ensure that all the calculated geometries correspond to true minima. The resulting geometry was transferred into software CODESSA, developed by the Katritzky group (1994,1995), which can calculate constitutional, topological, geometrical, electrostatic, and quantum chemical descriptors. In the present work, five classes of structural descriptors were obtained and about 612 descriptors were provided. Additionally, some macroscopic descriptors including log p, refractivity, etc. were calculated by software HyperChem (1994).
2.3. Data pre-process Successful QSAR model depends on suitable descriptors selection. If molecular structures are represented by improper descriptors, they will not lead to reasonable predictions. The process of features selection entails pruning the descriptors pool through HM available in the framework of the CODESSA program (Katritzky et al., 1994; Oblak and Randic, 2000). HM can either quickly give a good estimation about what quality of correlation to expect from the data, or derive several best regression models. Besides, it will demonstrate which descriptors have bad or missing values, or are insignificant (from the standpoint of a single-parameter correlation), and or are highly inter-correlated. In this study, log RD50 of the investigated compounds was used as dependent value for the purpose.
F. Luan et al. / Chemosphere 63 (2006) 1142–1153
1147
3.1. Theory of LDA (Fisher, 1936; Kachigan, 1986)
Finally, the descriptors pool used in the following study reduced from 612 to 166.
The basic theory of LDA is to classify the dependent by dividing an n-dimensional descriptor space into two regions that are separated by a hyperplane that defined by a linear discriminant function as follows: Y ¼ b0 þ b1 x1 þ b2 x2 þ þ bn xn ð1Þ
3. Methodology After the descriptors are pre-selected, the next step is to build the classification and regression models by using some statistic methods. In this paper, for the classification model, LDA and SVM were used, for the regression model, HM, SVM and RBFNN were performed. The general flowchart of the database-mining procedure is shown in Fig. 1. As the theories of above statistic methods have been well described in many monographs and articles, here only gives a brief description of these methods.
where, Y is discriminant score, that is, the dependent variable, x1 xn represent the specific descriptors, and b are corresponding to weights associated with the respective descriptors. 3.2. Theory of SVM SVM, developed by Vapnik (1995) as a novel type of learning machine, is gaining popularity due to many
Chemicals
LDA Descriptor selection
SVM Classification model
Reactive chemicals
Nonreactive chemicals
HM
Descriptor selection
Linear model
HM
Descriptor selection
Nonlinear model
Nonlinear model
SVM or RBFNNs Fig. 1. Flow chart of this work.
Linear model
HM
1148
F. Luan et al. / Chemosphere 63 (2006) 1142–1153
attractive features and promising empirical performance. The main advantage of SVM is that it adopts the structure risk minimization (SRM) principle, which has been shown to be superior to the traditional empirical risk minimization (ERM) principle (Burges, 1998), employed by conventional neural networks. SRM minimizes an upper bound of the generalization error on Vapnik–Chernoverkis (VC) dimension, as opposed to ERM that minimizes the training error. This method has proven to be very effective for addressing general purpose classification and regression problems (Burbidge et al., 2001; Evgeny et al., 2003; Liu et al., 2003; Vladimir et al., 2003; Cai et al., 2004; Liu et al., 2004; Xue et al., 2004). For there are a number of introductions into SVM (Vapnik, 1998; Scho¨lkopf et al., 1999; Cristianini and Shawe-Taylor, 2000; URL: http:// www.kernel-machines.org/), here, we only briefly summarized the main ideas of SVM for classification and regression. SVM classifiers are generated by a two-step procedure: First, the sample data vectors are mapped to a very high-dimensional space. The dimension of this space is significantly larger than that of the original data space. Then, the SVM algorithm finds a hyperplane in this space with the largest margin separating classes of data. The decision function is ! l X f ðxÞ ¼ sign y i ai kðx; xi Þ þ b ð2Þ i¼1
where sign is simply a sign function which returns +1 for positive argument and 1 for a negative argument; yi are input class labels that take a value of 1 or 1, xi is a set of descriptors, and K(x, xi) is a kernel function, whose value is equal to the inner product of two vectors x and xi in the feature space U(x) and U(xi). That is, K(x, xi) = /(x) Æ /(xi). Any function that satisfies Mercers condition can be used as the kernel function. And ai is Lagrangian multipliers. SVM can also be applied to regression by the introduction of an alternative loss function. The decision function of regression is as follow: ! l X f ðxÞ ¼ y i ai kðx; xi Þ þ b ð3Þ
3.4. Theory of RBFNN The theory of RBFNN has been extensively presented in some papers (Derks et al., 1995; Xiang et al., 2002). Here only a brief description of the RBFNN principle was given. It consists of an input layer, a hidden layer, and an output layer. The input layer does not process the information; it only distributes the input vectors to the hidden layer. The hidden layer of RBFNN consists of a number of RBF units (nh) and bias (bk). Each hidden layer unit represents a single radial basis function, with associated center position and width. Each neuron on the hidden layer employs a radial basis function as a nonlinear transfer function to operate on the input data. The most often used RBF is a Gaussian function that is characterized by a center (cj) and a width (rj). In this study, the Gaussian was selected as a radial basis function. The operation of the output layer is linear, which is given as below y k ðxÞ ¼
nk X
wkj hj ðxÞ þ bk
ð4Þ
j¼1
where yk is the kth output unit for the input vector x, wkj is the weight connection between the kth output unit and the jth hidden layer unit, and bk is the bias. The training procedure when using RBF involves selecting centers, width and weights. In this paper, the forward subset selection routine was used to select the centers from training set samples. The adjustment of the connection weight between the hidden layer and the output layer was performed using a least-squares solution after the selection of centers and width of radial basis functions. 3.5. Algorithm implementation and computation environment All calculation programs implementing SVM were written in an R-file based on the R script for SVM. All calculation programs implementing RBFNN were written in M-file based on a basis MATLAB script for RBFNN. The scripts were run on a Pentium IV PC with 256M RAM.
i¼1
The constraints are the same as those of Eq. (2).
4. Results and discussion
3.3. Theory of HM
4.1. Results of classification model
HM in CODESSA was used to select descriptors and built the linear model. Following the pre-selection of descriptors, multiple linear regression models are developed in a stepwise procedure (Katritzky et al., 1994; Oblak and Randic, 2000). The goodness of the correlation is tested by the coefficient regression (R2), the F-test (F), and the standard deviation (S2).
4.1.1. Results of LDA model LDA was performed using the left 166 descriptors in the SPSS statistical software. For the purposes of modeling, the value of ‘‘1’’ was assigned to compounds with reactivity, and the value of ‘‘1’’ was assigned to those with nonreactive properties. The linear classification was performed in a stepwise manner: at each step the
F. Luan et al. / Chemosphere 63 (2006) 1142–1153
variable that adds the most to the separation of the groups is entered into (or the variable that adds the least is removed from) the discriminant function. The selection of the descriptors was based on the parameter F. The criteria for the selection of the best LDA equation included a comparison of the tabulated F and Wilks k statistical values. In this study, the minimum partial F value to enter is set to the default value of 3.84 and the maximum partial F to remove is 2.71. And the prior probabilities were computed from group size (0.585 and 0.415 for the reactive and nonreactive compounds, respectively). The linear discriminant function is as follows: Discriminant score ¼ 17:532RNBR þ 0:0322 CIC þ 0:057XY 0:373PPSA3 0:020H f þ 72:481HDCA1 2:422 ð5Þ It contains six molecular descriptors, whose chemical meanings were listed in Table 2. The results of LDA model were listed in Table 1. And it gave a total accuracy of 77.5% for whole dataset. Accuracy for the training set and test set was 80.7% and 75.0%, respectively. 4.1.2. Results of SVM model The discrimination process was carried out using the same subset of descriptors by SVM. From the above discussion about the theory of SVM, the performances of SVM depend on the combination of several parameters. These are capacity parameter C, e of the e-insensitive loss function, the kernel type K and its corresponding
Table 2 Descriptors name and chemical meaning of this work Descriptors name
Chemical meaning
1
Bonding information content (order 1) Min (>0.1) bond order of a C atom Count of H-donors sites [quantum–chemical PC] Complementary information content (order 2) HA dependent HDCA-1/TMSA [quantum–chemical PC] Final heat of formation Average information content (order 2) PPSA-3 atomic charge weighted PPSA [Zefirovs PC] Refractivity Relative number of benzene rings Relative number of single bonds RPCG relative positive charge (QMPOS/QTPLUS) [Zefirovs PC] Max valency of a C atom XY Shadow ZX Shadow
BIC BOC min CHDon 2
CIC
HDCA-1/TMSA Hf 2 ICave PPSA-3 Re RNBR RNSB RPCG VC max XY ZX
1149
parameters. For the classification problem, only the kernel type K and its corresponding parameters and capacity parameter C should be optimized. The RBF function was used in the present work. In order to find the optimized combination of two parameters, leave-one-out (LOO) cross-validation of the whole training set was performed. For classification problem, the accuracy of LOO results was chosen as the optimal conditions. We considered the parameter c from 0.0001 to 1 with 0.0002 as the increment. The parameter C was chosen from values between 10 and 100 with 10 as the increment and 100 as the increment within 100–1000. The optimal c was found as 0.2 and the final optimal value of C is 70. Then, the predictive ability of the model was tested by the test set compounds. And the results were listed in Table 1. Table 1 also listed the LOO results of training set using above optimistic parameters. The misclassified samples (marked by double asterisk) by SVM were also listed. The total accuracy for SVM was 94.4%, which was much higher than that of LDA (77.5%). Accuracy for the training set and test set was 96.5% and 85.7%, respectively. From comparison of the two methods, it can be seen that performance of SVM was better than that of LDA, which implies that, using the same descriptors, the SVM method is capable of recognizing highly nonlinear structure-activity relationships; in contrast, LDA approaches can only capture linear relationships between molecular characteristics. 4.1.3. Discussion of the discriminant features By interpreting the descriptors in the classification model, it is possible to gain some insight into factors that are likely to distinguish the sensory irritants property of the VOCs. Of the six descriptors, RNBR is a constitutional, 2CIC is a topological, XY is geometrical, PPSA-3 is an electrical and Hf and HDCA-1 are two quantum chemical descriptors. These descriptors encode different aspects of the molecular structure. RNBR can reflect the size of molecule to some degree and then can account for the steric hindrance effect of compounds partially. As a lipophilic function group, it can also account for the lipophilicity of compounds partially. 2 CIC is a topological one defined on the basis of the Shannon information theory. It reflects the branching of the molecular and the diversity of the atoms of the branching. In other words, they represent the difference between the maximum possible complexity of a graph and the realized topological information of the chemical species as defined by the information content. The complexity of a topological graph and difference of the chemical species may lead to difference of the steric property and the hydrophobic of the compounds. XY shadow is a geometrical descriptor calculated from the projection of the shadow on the 2D plane according to the orientation in the space. It therefore reflects the size
1150
F. Luan et al. / Chemosphere 63 (2006) 1142–1153
and geometrical shape of the molecule. HDCA-1 is a hydrogen bonding acceptor dependent hydrogen bonding donor surface area, and this descriptor describes the hydrogen bonding acceptor properties of the compounds. All of above descriptors gave positive coefficients in the LDA model, which mean the larger values of these descriptors trends to be reactive group. Hf is a quantum mechanical energy-related descriptor that gives the energy of the molecule in the thermodynamic standard scale (elements in ideal gas state at 298.15 K and 101 325 Pa). It characterizes the stability of the molecules. PPSA-3 is one of the charged partial surface area (CPSA) descriptors, which are based on the surface area of the whole molecule and on the charge distribution in the molecule, so they combine shape and electronic information to characterize the molecule, and therefore they encode features responsible for polar interactions between molecules. The negative coefficients of the two descriptors in the model indicated that increasing the value of the descriptors can lead to nonreactive group. 4.2. Results of regression models 4.2.1. Results of HM models After classification of the dataset into two classes, HM was then used to develop the linear model for the prediction of log RD50 of both classes using calculated structural descriptors. The first step of this method is to select proper number of descriptors. To determine the optimum number of descriptors, a variety of subset size was investigated the models. When adding another descriptor did not improve significantly the statistics of a model, it was determined that the optimum subset size had been achieved. The six parameters correlation models for each group are listed as follows: log RDrea 50 ¼ 0:214 PPSA3 0:017 H f þ 22:510 V Cmax 0:229 BIC 44:508 HDCA 1=TMSA 8:438 0:049 BOmin c N ¼ 67;
R2 ¼ 0:7372;
R2CV ¼ 0:6707;
R2CV
F ¼ 28:05;
S 2 ¼ 0:3921
R2 ¼ 0:8442; ¼ 0:7606;
4.2.2. Results of SVM models In order to get more accurate prediction models, SVM is used to develop nonlinear models based on the same subset of descriptors. To obtain better results, the parameters that influence the performance of SVM were also need to be optimized. Besides the C, and c, in classification procedure mentioned above, e of the einsensitive loss function also need to be optimized for regression problem. The selection of the value for SVM was performed by systemically changing its value in the training step. The value which gives the best LOO cross-validation result was used in the model. RMS was used as an error function for regression task. From the process, c, e and C were fixed to 0.02, 0.1 and 100 for reactive group, and 0.0004, 0.04 and 500 for nonreactive group, respectively. The predicted results of the optimal SVM are shown in Fig. 2(c) and (d) and in Table 1. For the reactive group, RMS errors of the training set, the test set and the whole data are 0.4415, 0.7430 and 0.5140, and the correlation coefficient are 0.910, 0.874 and 0.759, respectively. For the nonreactive group, RMS of the three date sets are 0.372, 0.452 and 0.405, the corresponding correlation coefficients (R) were 0.900, 0.859 and 0.888, respectively. For the biological activity data of these violet organic compounds with high noise, it can be concluded that the predicted values are in agreement with the experimental values from the above results. As can be seen from above results, the nonlinear models were not so satisfied. So RBFNN were used to see if the prediction results could be improved.
ð6Þ
¼ 5:550 0:043 Re 6:329 RPCG log RDnonrea 50 0:377 ICave 0:049 CHdonor þ 3:826 RNSB 0:047 ZX ð7Þ N ¼ 47;
dicted vs. log RD50 for both models by HM. The reactive models gave RMS of 0.626 for the training set, 0.630 for the prediction set and overall data sets are 0.627, and the corresponding correlation coefficients (R) were 0.859, 0.840 and 0.845, respectively. For the nonreactive group, RMS of the three date sets are 0.362, 0.507 and 0.395, and the corresponding correlation coefficients (R) were 0.919, 0.866 and 0.900, respectively.
F ¼ 36:14;
S 2 ¼ 0:1313
The chemical meanings of molecular descriptors were also listed in Table 2. Fig. 2(a) and (b) showed the pre-
4.2.3. Results of RBFNN models To obtain better results, the parameters that influence the performance of RBFNN were optimized. The selection of the optimal width value for RBFNN was performed by systemically changing its value in the training step. The values which give the best LOO cross-validation result were used in the models. For this data set, the optimal values were determined as 2.7 for reactive group, and 1.1 for nonreactive group. The corresponding number of centers (hidden layer nodes) of RBFNN is 13 for reactive group, and 18 for nonreactive group. The predicted results of the nonlinear models are shown in Table 1 and Fig. 2(e) and (f). For the reactive group, RMS errors of the training set, the test set and the
3 2 1 0 -1
(a) 4
3.5
Training set Test set
2.5 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5
(c) -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Experimental log RD50 of reactive compounds
4
Training set Test set
3 2 1 0 -1
(e) -1
0
1
2
3
4
Experimental log RD50 of reactive compounds
Calculated log RD50 of nonreactive compounds by SVM
4.0 3.0
1151
5.5 5.0
Training Set Test Set
4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 1.0
(b) 1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Experimental log RD50 of nonreactive Compounds
4.5
Calculated log RD50 of reactive compounds by RBFNN
Calculated log RD50 of reactive compounds by SVM
-1 0 1 2 3 Experimental log RD50 of reactive compounds
Calculated log RD50 of nonreactive compounds by HM
Training set Test set
4
Calculated log RD50 of nonreactive compounds by RBFNN
Calculated log RD50 of reactive compounds by HM
F. Luan et al. / Chemosphere 63 (2006) 1142–1153
5.0 4.5
Training set Test set
4.0 3.5 3.0 2.5 2.0 1.5
(d)
1.0 1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Experimental log RD50 of nonreactive Compounds
5.0 4.5
Training set Test set
4.0 3.5 3.0 2.5 2.0 1.5 1.0 1.0
(f) 1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Experimental log RD50 of nonreactive Compounds
Fig. 2. Predicted vs. experimental log RD50 of VOCs. (a) Predicted vs. experimental log RD50 of reactive compounds by HM. (b) Predicted vs. experimental log RD50 of nonreactive compounds by HM. (c) Predicted vs. experimental log RD50 of reactive compounds by SVM. (d) Predicted vs. experimental log RD50 of nonreactive compounds by SVM. (e) Predicted vs. experimental log RD50 of reactive compounds by RBFNN. (f) Predicted vs. experimental log RD50 of nonreactive compounds by RBFNN.
whole data are 0.4755, 0.6342 and 0.5009, and the correlation coefficient are 0.8892, 0.8805 and 0.8770, respectively. For the nonreactive group, RMS of the three
date sets are 0.2430, 0.4798 and 0.3064, the corresponding correlation coefficients (R) were 0.9580, 0.8578 and 0.9340, respectively. Comparison the results obtained,
1152
F. Luan et al. / Chemosphere 63 (2006) 1142–1153
it can be seen that the results of RBFNN are the best ones. However, they did not improve significantly. 4.2.4. Discussion of the input features A prerequisite of a good predictive model for any biological activity is that it should be transparent and mechanistically interpretable. To achieve transparency and mechanistic interpretability, the physicochemical meaning of parameters utilized for the modeling needs to be elucidated. Mechanistic interpretation of QSAR descriptors is seldom easy, especially for heterogeneous data sets. In this study six descriptors were found to be important for both reactive and nonreactive groups. Firstly, we will give a brief interpretation of reactive model. Of the six descriptors, three were the same as the classification descriptors, that is HDCA-1, Hf and PPSA-3. The other three are: max valency of a C atom (V Cmax ), bonding information content (order 1) (1BIC) and Min (>0.1) bond order of a C atom (BOCmin ). V Cmax and BOCmin are quantum mechanical valency-related descriptors. These descriptors relate to the strength of intramolecular bonding interactions and characterize the stability of the molecules, their conformational flexibility and other valency-related properties. 1BIC is a topological one defined on the basis of the Shannon information theory. This index encodes the branching ratio, unsaturation, and constitutional diversity of a molecule. The combination of these descriptors, comprising shape, electronic and bond information about molecules, adequately represents hydrophobic, steric and stability effects on the activity of a molecule. For nonreactive group, the descriptors are: RNSB, ZX, (CHDon), 2ICave, RPCG and Re. The RNSB affects the density of the electron cloud of the molecule. The larger the RNSB is, the lower the density of the electron cloud of the molecule is. And the density of the electron cloud of the molecule is the main factor that influences the polarity of the molecular. Thus different polarity of the chemical may lead to various polar interaction between the chemical and the receptor. The definition and meaning of ZX Shadow are similar to XY Shadow above. CHDon distinguishes the molecules according to the number of hydrogen donor sites that are capable of donating a hydrogen to the surrounding media. Thus it indicates noncovalent hydrogen bonds action. The topological descriptor is 2ICave, which describes the size, branching, and composition of a molecule and relates to the dispersion interaction among molecules. RPCG is defined as most negative atomic charge divided by the sum of all of the negative atomic charges of the molecule and it represents the effect of the polar intermolecular interactions. Refractivity, a physicochemical descriptor, is a combined measure of the size and polarizability of a molecule and can explain the polarization reaction. As can been seen from Eqs. (6) and (7), the positive coefficients of the descriptors imply that increasing the
value of this descriptor leads to the increasing of the log RD50, vice versa. And from above discussions, we can see that descriptors representing noncovalent properties, such as dispersion, polar, hydrogen bond interaction, account for the structural features for sensory irritants activities of nonreactive chemicals. However, besides above descriptors, the covalent related descriptors, such as valency-related descriptors, are closely influence the interaction of reactive compounds. As we know, the chemical mechanism is a more complex than the physical mechanism as well as the influence factors. Thus it leads to better prediction ability for nonreactive group than reactive group.
5. Conclusion Quantitative modeling of VOCs as sensory irritants were performed by different statistic methods based on the descriptors calculated from the molecular structure alone. For the classification models, the nonlinear model built by SVM was far better than linear LDA model, which indicate that the SVM is a very promising tool for classification. However, for regression models, the nonlinear model (SVM and RBFNN) gave a little better results comparing to the linear model (HM). And the results of nonreactive group show better prediction ability than reactive group. To the biological activity data of these VOCs with high noise, it can be concluded that the predicted values are in agreement with the experimental values. At the same time, the models proposed could identify and provide some insight into what structural features are related to the sensory irritants. Furthermore, the proposed approach can also be extended in other modeling building investigation.
Acknowledgement The authors thank the National Natural Science Foundation of China (NSFC) Fund (No. 20305008) for financial support. The authors also thank the R Development Core Team for affording the free R1.7.1 software.
References Abraham, M.H., Whiting, G.S., Alarie, Y., Morris, J.J., Taylor, P.J., Doherty, R.M., Taft, R.W., Nielsen, G.D., 1990. Hydrogen bonding. A new QSAR for upper respiratory tract irritation by airborne chemicals in mice. Quant. Struct.-Act. Relat. 9, 6–10. Abraham, M.H., Nielsen, G.D., Alarie, Y., 1994. The Ferguson principle and an analysis of biological activity of gases and vapors. J. Pharm. Sci. 83, 680–688.
F. Luan et al. / Chemosphere 63 (2006) 1142–1153 Alarie, Y., 1973. Sensory irritation by airborne chemicals. CRC Crit. Rev. Toxicol. 2, 299–363. Alarie, Y., Nielsen, G.D., Andonian-Haftvan, J., Abraham, M.H., 1995. Physicochemical properties of nonreactive volatile organic chemicals to estimate RD50, alternatives to animal studies. Toxicol. Appl. Pharmacol. 134, 92–99. Alarie, Y., Schaper, M., Nielsen, G.D., Abraham, M., 1996. Estimating the sensory irritating potency of airborne nonreactive volatile organic chemicals and their mixtures. SAR QSAR Environ. Res. 5, 151–165. Alarie, Y., Schaper, M., Nielsen, G.D., Abraham, M.H., 1998. Structure-activity relationships of volatile organic chemicals as sensory irritants. Arch. Toxicol. 72, 125–140. Boucher, Y., Simons, C.T., Cuellar, J.M.S., Jung, W., Carstens, M.I., Carstens, E., 2003. Activation of brain stem neurons by irritant chemical stimulation of the throat assessed by c-fos immunohistochemistry. Exp. Brain Res. 148, 211–218. Burbidge, R., Trotter, M., Buxton, B., Holden, S., 2001. Drug design by machine leaning: support vector machines for pharmaceutical data analysis. Comput. Chem. 26, 5–14. Burges, C.J.C., 1998. A tutorial on support vector machine for pattern recognition. Data Min. Knowl. Disc. 2, 121–167. Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, Y.Z., 2004. Enzyme family classification by support vector machines. Proteins 55, 66–76. Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK. Derks, E.P.P.A., Sanchez Pastor, M.S., Buydens, L.M.C., 1995. Robustness analysis of radial basis function and multilayered feed-forward neural network models. Chemom. Int. Lab. Syst. 28, 49–60. Devillers, J., Balaban, A.T., 1999. Topological Indices and Related Descriptors in QSAR and QSPR. Gordon and Breach Science Publishers, Amsterdam. Enrique, J., William, S., Cain, M., Abraham, H., 1998. Nasal pungency and odor of homologous aldehydes and carboxylic acids. Exp. Brain Res. 118, 180–188. Evgeny, B., Uli, F., Sadowski, J., Gisbert, S., 2003. Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. J. Chem. Inf. Comput. Sci. 43, 1882–1889. Ferguson, J., 1939. The use of chemical potentials as indices of toxicity. Proc. R. Soc. Lond. B 127, 387–493. Fisher, R.A., 1936. The use of multiple measurements in axonomic problems. Ann. Eugenic. 7, 179–188. Flemming, R., Cassee, J.H.E., Arts, J.P., Groten, V.J., 1996. Feron sensory irritation to mixtures of formaldehyde, acrolein, and acetaldehyde in rats. Arch. Toxicol. 70, 329–337. HyperChem 4.0, Hypercube, Inc., 1994. Kachigan, S.K., 1986. Statistical Analysis. Radius Press, New York. Karelson, M., 2000. Molecular Descriptors in QSAR/QSPR. John Wiley & Sons, New York. Katritzky, A.R., Lobanov, V.S., Karelson, M., 1995. CODESSA: Training Manual. University of Florida, Gainesville, FL. Katritzky, A.R., Lobanov, V.S., Karelson, M., 1994. CODESSA: Reference Manual. University of Florida, Gainesville, FL.
1153
Liu, H.X., Zhang, R.S., Yao, X.J., Liu, M.C., Hu, Z.D., Fan, B.T., 2003. QSAR study of ethyl 2-[(3-methyl-2,5-dioxo(3pyrrolinyl))amino]-4-(trifluoromethyl) pyrimidine-5-carboxylate: an inhibitor of AP-1 and NF-KB mediated gene expression based on support vector machines. J. Chem. Inf. Comput. Sci. 43, 1288–1296. Liu, H.X., Zhang, R.S., Yao, X.J., Liu, M.C., Hu, Z.D., Fan, B.T., 2004. Prediction of isoelectric point of amino acid based on GA-PLS and SVMs. J. Chem. Inf. Comput. Sci. 44, 161–167. Monique, S., Dalton, P., 2002. Perceived odor and irritation of isopropanol: a comparison between nayv¨e controls and occupationally exposed workers. Int. Arch. Occup. Environ. Health 75, 541–548. Nielsen, G.D., 1991. Mechanisms of activation of the sensory irritant receptor by airborne chemicals. CRC Crit. Rev. Toxicol. 21, 183–208. Nielsen, G.D., Alarie, Y., 1992. Animal assays for upper airway irritation screening of materials and structure-activity relations. Ann. NY Acad. Sci. 641, 164–175. Nielsen, G.D., Bakbo, J.C., 1985. Sensory irritating effects of allyl halides and a role for hydrogen bonding as a likely feature at the receptor site. Acta Pharmacol. Toxicol. 57, 106–116. Oblak, M., Randic, S., 2000. Quantitative structure-activity relationship of flavonoid analogues. 3. Inhibition of p56lck protein tyrosine kinase. J. Chem. Inf. Comput. Sci. 40, 994– 1001. Scho¨lkopf, B., Burges, C., Smola, A., 1999. Advances in Kernel Methods—Support Vector Learning. MIT Press, Cambridge, MA. Schaper, M., 1993. Development of a database for sensory irritants and its use in establishing occupational exposure limits. Am. Ind. Hyg. Assoc. J. 54, 488–544. Stewart, J.P.P., 1989. MOPAC 6.0, Quantum Chemistry Program Exchange QCPE, No. 455, Indiana University, Bloomington, IN. Todeschini, R., Consonni, V., 2000. Handbook of Molecular Descriptors. Wiley-VCH, Weinheim. Vapnik, V.N., 1995. The Nature of Statistical Learning Theory. Springer. Vapnik, V., 1998. Statistical Learning Theory. Wiley, New York. Vladimir, V.Z., Konstantin, V.B., Andrey, A.I., Nikolay, P.S., Igor, V.P., 2003. Drug discovery using support vector machines. The case studies of drug-likeness, agrochemicallikeness, and enzyme inhibition predictions. J. Chem. Inf. Comput. Sci. 43, 2048–2056. Xiang, Y.H., Liu, M.C., Zhang, X.Y., Zhang, R.S., Hu, Z.D., 2002. Quantitative prediction of liquid chromatography retention of N-Benzylideneanilines based on quantum chemical parameters and radial basis function neural network. J. Chem. Inf. Comput. Sci. 42, 592– 597. Xue, C.X., Zhang, R.S., Liu, M.C., Hu, Z.D., Fan, B.T., 2004. Study of the quantitative structure-mobility relationship of carboxylic acids in capillary electrophoresis based on support vector machines. J. Chem. Inf. Comput. Sci. 44, 950–957.