Application of support vector machine (SVM) for prediction toxic activity of different data sets

Application of support vector machine (SVM) for prediction toxic activity of different data sets

Toxicology 217 (2006) 105–119 Application of support vector machine (SVM) for prediction toxic activity of different data sets C.Y. Zhao a , H.X. Zha...

412KB Sizes 4 Downloads 98 Views

Toxicology 217 (2006) 105–119

Application of support vector machine (SVM) for prediction toxic activity of different data sets C.Y. Zhao a , H.X. Zhang a,∗ , X.Y. Zhang a , M.C. Liu a , Z.D. Hu a , B.T. Fan b b

a Department of Chemistry, Lanzhou University, Lanzhou 730000, China Universit´e Paris 7-Denis Diderot, ITODYS 1, Rue Guy de la Brosse, 75005 Paris, France

Received 5 August 2005; received in revised form 31 August 2005; accepted 31 August 2005 Available online 5 October 2005

Abstract As a new method, support vector machine (SVM) were applied for prediction of toxicity of different data sets compared with other two common methods, multiple linear regression (MLR) and RBFNN. Quantitative structure–activity relationships (QSAR) models based on calculated molecular descriptors have been clearly established. Among them, SVM model gave the highest q2 and correlation coefficient R. It indicates that the SVM performed better generalization ability than the MLR and RBFNN methods, especially in the test set and the whole data set. This eventually leads to better generalization than neural networks, which implement the empirical risk minimization principle and may not converge to global solutions. We would expect SVM method as a powerful tool for the prediction of molecular properties. © 2005 Elsevier Ireland Ltd. All rights reserved. Keywords: Quantitative structure–activity relationship (QSAR); Toxicity; Multiple linear regression (MLR); Radical basis function neural network (RBF); Support vector machine (SVM)

1. Introduction Chemical byproducts from industrial systems that are allowed to escape into the environment can have toxic effects. Each of these chemicals has the potential to be harmful, and it is crucial that each compound be assessed for its toxicity level. There are several experimental methods available for screening the estrogenic activity of chemicals (e.g., in vivo and in vitro assay tests), and these have been carried out using receptors and other biological materials of human, rat, mouse, and calf origin at least (Hill, 1972). However, this can be costly, time-consuming, and could potentially pro∗

Corresponding author. Tel.: +86 931 891 2578; fax: +86 931 891 2582. E-mail address: [email protected] (H.X. Zhang).

duce toxic side products from the experimental methods used today. This has meant that the development of computational methods as an alternative tool for predicting properties of chemicals has been a subject of intensive study. As we know, the properties of a chemical are all derived from, and related to, the unique molecular structure of that chemical. Because these properties are all derived from the molecular structure of the chemical, it follows that relationships also exist between the different properties of the chemical. These principles form the underlying basis for the prediction of toxicity from chemical structure. Among the computational methods, the quantitative structure–activity relationships (QSAR), has found diverse applications for predicting the compounds’ properties, including biological activity prediction (Eldred et al., 1999; Kauffman and Jurs, 2000; Patankar and

0300-483X/$ – see front matter © 2005 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.tox.2005.08.019

106

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119

Jurs, 2000; Wessel et al., 1998), physical property prediction (Johnson and Jurs, 1999; Engelhardt McClelland and Jurs, 2000; Cumpson, 2001), and toxicity prediction (DeWeese and Schultz, 2001; Filov et al., 1979; LeBlond et al., 2000). The endocrine disruptor screening and testing advisory committee (EDSTAC) organized by Environmental Protection Agency (EPA) considers quantitative structure–activity relationship method an important part of its priority setting process (Fang et al., 2003). It is one of the most promising and successful methods not only to provide rapid and useful meanings of predicting the biological activity and/or toxicity but also to aid understanding of the mechanism of chemicals’ properties. Many different technologies can be applied for the QSAR development, e.g., multiple linear regression (MLR, Kapur et al., 2001), partial least squares (PLS, Zhao et al., 2004), heuristic method (HM, Xue et al., 2004; Katritzky et al., 1994, 1995), and neural networks (Wan and Harrington, 1999). For these methods, as we know, linear method is much limited for a complex biological system. The flexibility of neural networks enables them to discover more complex nonlinear relationships in experimental data. However, these neural systems have some problems inherent to its architecture, such as overtraining, overfitting, and network optimization. And, other problems with the use of neural networks concern the reproducibility of results, due largely to random initialization of the networks and variation of stopping criteria. Owing to these reasons above there is a continuing need for the application of more accurate and informative techniques in QSAR analysis. The support vector machine (SVM) is a new algorithm developed from the machine learning community (Vapnik, 1982; Burges, 1998). SVM approach automatically controls the flexibility of the resulting classifier on the training data. Additionally, by the design of the algorithm, the deteriorating influence of the input dimensionality on the generalization ability is largely suppressed. Due to its remarkable generalization performance, the SVM have attracted attention and gained extensive application, e.g., isolated handwritten digit recognition (Cun et al., 1995), object recognition (Blanz et al., 1996), drug design (Burbidge et al., 2001), prediction of protein structure (Cai et al., 2002), identifying genes (Bao and Sun, 2002), and diagnose diseases (Zhao et al., 2004), etc. In most of these cases, the performance of SVM modeling either matches or is significantly better than that of traditional machine learning approaches. It has a number of interesting properties, including an effective avoidance of overfitting, which improves its ability to build models using large numbers of molecular property

descriptors with relatively few experimental results in the training set. In this report, the aim was to test the performance of the SVM method on two data sets and to evaluate its applicability as a powerful screening method for predicting compounds’ physico–chemical properties, given only the molecular structures. 2. Methodology 2.1. Data set 2.1.1. Data set 1 The toxicity values log(1/EC50 ) is expressed as negative logarithm of the effective concentration, at which microtoxs reduce the emission of bioluminescence by 50% (Wang et al., 1997). The data set included 76 compounds (Table 1) and their corresponding experimental toxicity values were shown in Table 1. The whole data set was divided into a training set (61 compounds) for model development/calibration and a test set (15 compounds) for external prediction. 2.1.2. Data set 2 This data set included 146 natural, synthetic, and environmental chemicals, which can bind to human androgen receptor (AR) and may possess androgenic or antiandrogenic activities, and act as steroid hormone ligand agonists and antagonists (Beger et al., 2001; Kavlock et al., 1996; Colborn et al., 1993; Waller et al., 1996; Tong et al., 1997). The chemicals have been designated as endocrine disrupting compounds (EDs), which are defined as “an exogenous agent that interferes with the production, release, transport, metabolism, binding, action, or elimination of natural hormones in the body responsible for the maintenance of homeostasis and the regulation of developmental processes” (Beger et al., 2001; Kavlock et al., 1996). The AR binding affinity of each compound was expressed as log unit of relative binding affinity (log (RBA)) are list in Table 2. The whole data set was divided into a training set (118 compounds) and a test set (28 compounds). 2.2. Descriptor generation The structures of the compounds in all two data sets were drawn with the ISIS DRAW 2.3 program (ISIS Draw, 1990). The final geometries were obtained with the semi-empirical PM3 method in the HYPERCHEM 6.03 program (HyperChem, 2000). All calculations were carried out at restricted Hartree-Fock level with no configuration interaction. The molecular structures were

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119

107

Table 1 Experimental and calculated log(1/EC50 ) by MLR, RBFNN, and SVM methods No.

Compounds

Training set (61 compounds) 1 Furan 2 2-Methyl-2-propanol 3 Diethyl ether 4 1-Butanol 5 Acetone 6 Dichloromethane 7 1,2-Dichloroethane 8 Trichloromethane 9 Cyclohexene 10 Toluene 11 Cyclehexane 12 Hexane 13 Aniline 14 Benzene 15 p-Chloroaniline 16 Phenol 17 m-Xylene 18 2,2 -Biphenyldiol 19 p-Xylene 20 p-Nitroniline 21 1,1,2,2-Tetrachloroethane 22 m-Nitrotoluene 23 o-Cresol 24 m-Nitroniline 25 Dibromobenzene 26 Nitrobenzene 27 Chlorobenzene 28 p-Chlorotoluene 29 p-Nitrotoluene 30 o-Nitrotoluene 31 p-Bromoaniline 32 p-Chloronitrobenzene 33 Tetrachloroethylene 34 (o)-Chloro-(p)Nitro-aniline 35 m-Chlorobenzaldehyde 36 m-Chloronitrobenzene 37 p-Nitrophenol 38 2,4-Dinitroaniline 39 3,4-Dichloronitrobenzene 40 o-Chlorophenol 41 2,4-Dinitroaniline 42 Chlorophenylacetylene 43 3,4-Dinitroaniline 44 3,4-Dichlorophenylacetylene 45 1,3-Dichlorobenzene 46 1,2-Dichlorobenzene 47 2,5-Dichlorotoluene 48 1,4-Dichlorobenzene 49 2,6-Dinitrotoluene 50 2,4-Dichlorophenol 51 p-Chlorophenol 52 2,4-Dinitrotoluene 53 1,2,4-Trichlorobenzene 54 p-Bromochlorobenzene 55 p-Chlorobenzaldehyde

Experimental log (1/EC50 )a

0.9 1.45 1.68 1.78 1.9 1.96 2.43 2.55 2.95 3.08 3.16 3.27 3.28 3.34 3.57 3.64 3.65 3.65 3.68 3.7 3.7 3.74 3.75 3.77 3.78 3.82 3.86 3.88 3.9 3.91 3.92 3.94 3.94 3.99 4 4.05 4.05 4.09 4.12 4.14 4.16 4.2 4.2 4.22 4.24 4.38 4.38 4.39 4.44 4.45 4.48 4.49 4.5 4.5 4.51

Calculated log(1/EC50 )b MLR

RBFNN

SVM

1.88 2.51 2.51 2.51 1.93 2.42 2.42 3.19 3.14 3.37 3.37 3.01 3.19 4.00 3.37 4.00 3.46 4.00 3.68 3.86 3.50 3.50 4.04 4.00 4.09 4.04 3.41 4.40 4.04 3.68 4.40 4.54 4.36 4.04 4.40 4.54 4.04 4.54 4.04 4.49 3.68 4.49 4.58 4.22 4.94 4.58 4.45 4.58 5.08 4.99 6.38 4.54 4.45 5.53 4.36

0.79 1.42 1.52 1.47 1.59 2.93 2.56 2.27 2.92 3.61 3.13 3.65 3.67 3.82 3.42 3.74 3.41 3.93 4.08 3.80 3.61 4.06 4.31 3.82 3.86 4.48 3.55 3.52 4.32 4.12 3.80 3.78 4.51 4.67 3.44 4.50 4.48 4.91 4.59 4.59 4.16 4.16 5.00 4.71 4.07 4.75 4.67 5.23 5.04 4.42 4.90 4.92 4.64 4.80 4.87

1.47 1.86 1.68 1.68 1.76 2.52 2.87 2.74 2.60 3.42 3.20 3.05 3.26 4.13 3.68 4.16 3.62 3.92 3.82 4.02 3.71 3.69 4.12 3.89 3.97 4.18 4.04 3.74 4.07 3.91 4.19 4.51 4.52 4.22 4.07 4.62 4.19 4.75 4.48 4.95 3.89 4.47 4.87 4.55 4.76 4.64 4.81 4.82 5.00 5.03 4.96 4.80 4.51 5.19 5.51

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119

108 Table 1 (Continued) No.

Compounds

56 2,4,6-Trinitroaniline 57 1,2,3-Trichlorobenzene 58 1,4-Dibromobenzene 59 3,4-Dichlorobenzaldehyde 60 p-Bromonitrobenzene 61 2,4,5-Tribromotoluene Test set (15 compounds) 62 Naphthalene 63 Benzidine 64 1-Octanol 65 m-Dinitrobromobenzene 66 1,3-Dibromobenzene 67 ␣-Chlorobenzene 68 2,4-Dinitrobromobenzene 69 4-Chlorobenzyl chloride 70 p-Dichlorotoluene 71 1,2,4,5-Tetrachlorobenzene 72 Hexachloroethane 73 Pentachlorophenol 74 p-Dinitrobromobenzene 75 o-Dinitrobromobenzene 76 Hexachlorobenzene a b

Experimental log (1/EC50 )a

Calculated log(1/EC50 )b MLR

RBFNN

SVM

4.51 4.53 4.54 4.68 4.68 4.86

4.54 5.12 4.54 5.80 4.00 6.16

4.75 5.17 4.50 5.84 5.75 6.28

4.48 5.25 5.46 5.49 5.10 6.14

4.86 4.88 4.9 4.93 4.99 4.99 5 5.01 5.5 5.51 5.52 5.69 5.84 6.03 6.32

2.19 3.72 3.53 3.45 3.93 4.19 4.23 3.69 4.23 4.34 4.20 4.54 4.50 4.74 4.39

1.13 3.88 3.28 2.71 4.10 4.01 4.88 4.06 4.62 4.85 4.85 4.29 5.08 4.91 4.84

1.87 3.63 3.67 3.62 3.82 3.97 4.38 3.78 4.31 4.67 4.29 4.20 4.48 4.70 6.28

Experimental log(1/EC50 ) obtained from literature. Calculated log(RBA) obtained from MLR, RBFNN, and SVM model.

optimized using the Polak-Ribiere algorithm until the root-mean-square gradient was 0.001. Then, the resulted geometry was transferred into software CODESSA, developed by the Katritzky group (Katritzky et al., 1994, 1995), which can calculate constitutional, topological, geometrical, electrostatic, and quantum chemical descriptors. Constitutional descriptors are related to the number of atoms and bonds in each molecule. Topological descriptors include valence and nonvalence molecular connectivity indices calculated from the hydrogensuppressed formula of the molecule, encoding information about the size, composition, and the degree of branching of a molecule. The topological descriptors describe the atomic connectivity in the molecule. The geometrical descriptors describe the size of the molecule and require 3D-coordinates of the atoms in the given molecule. The electrostatic descriptors reflect characteristics of the charge distribution of the molecule. The quantum chemical descriptors offer information about binding and formation energies, partial atom charge, dipole moment, and molecular orbital energy levels. 2.3. Descriptor selection Whether the QSAR model is successful or not is largely determined by the selection of variables, and

their ability to represent the essential determinants of the molecular interaction. Preferred are information-rich descriptors capable of yielding mechanistic insight. In this work, a heuristic method (Xue et al., 2004; Katritzky et al., 1994, 1995) was employed to select the optimal subset from original calculated descriptors. These procedures provide collinearity control (i.e., any two descriptors intercorrelated above 0.8 are never involved in the same model) and implement heuristic algorithms for the rapid selection of the best correlation, without testing all possible combinations of the available descriptors. In addition, to avoid the dominance of a few large descriptors on the diversity score, scaling becomes necessary if the chemical descriptors use different units and vary significantly in their magnitudes. It is customary to standardize data entries by “minimum–maximum transformation”. xstd =

x − xmin xmax − xmin

where x is the original data, xstd the standardized data, xmax and xmin are the maximum and minimum value for descriptors, respectively. Then, the heuristic method of the descriptor selection proceeds with a pre-selection of descriptors by eliminating: (i) those descriptors that are not available for each structure, (ii) descriptors hav-

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119

109

Table 2 Experimental and calculated log(RBA) by MLR, RBFNN, and SVM methods No.

Compounds

Training set (118 compounds) 1 5␣-Androstan 2 5,6-Didehydroisoandrosterone 3 5␣-Androstane-3,11,17-trione 4 Epitestosterone 5 3␣-Androstanediol 6 T propionate 7 Androstenediol 8 4-Androstenedione 9 4-Androstenediol 10 Etiocholan-17␤-ol-3-one 11 DHT benzoate 12 3␤-Androstanediol 13 11-Keto-testosterone 14 T 15 5␣-Androstan-17␤-ol 16 Methyltrienolone (R 1881) 17 DHT 18 17␣-Estradiol 19 3-Methylestriol 20 16␤-OH-16␣-Me-3-Me-estradiol 21 2-OH-estradiol 22 Ethynylestradiol (EE) 23 4-OH-estradiol 24 Estradiol (E2 ) 25 Cortisol 26 Dexamethasone 27 Corticosterone 28 Norethynodrel 29 Progesterone 30 Promegestone 31 6␣-Me-17␤-OH-progesterone 32 Spironolactone 33 Cyproterone acetate 34 Norethindrone 35 Norgestrel 36 trans-4-Hydroxystilbene 37 3,3 -Dihydroxyhexestrol 38 3,4-Diphenyltetrahydrofuran 39 Dimethylstilbestrol 40 Hexestrol monomethyl ether 41 Clomiphene 42 Nafoxidine 43 Tamoxifen 44 4 -Hydroxy-tamoxifen 45 6-Hydroxyflavone 46 4 -Hydroxyflavanone 47 Genistein 48 Flavone 49 Equol 50 4 -Hydroxychalcone 51 Flavanone 52 4-Hydroxychalcone 53 Zearalanone 54 ␤-Zearalenol 55 6-Hydroxyflavanone

Experimental log(RBA)a

−3.32 −1.98 −1.64 −1.00 −0.81 −0.79 −0.66 −0.62 −0.31 −0.1 0.07 0.36 0.54 1.28 1.45 2.00 2.14 −2.4 −2.25 −2.08 −1.44 −1.42 −0.91 −0.12 −2.77 −2.42 −1.87 −0.7 −0.7 −0.64 −0.41 −0.35 −0.32 0.41 1.22 −2.13 −2.08 −1.98 −1.66 −1.63 −1.64 −1.63 −1.59 −1.49 −2.77 −2.48 −2.44 −2.4 −2.39 −2.27 −2.25 −2.19 −2.14 −2.09 −1.78

Calculated log(RBA)b MLR

RBFNN

SVM

−1.92 −0.072 −0.44 −0.056 −0.59 0.0069 −0.75 −0.44 −0.75 −0.0085 −0.38 −0.59 −0.11 −0.056 0.025 −0.11 −0.0056 −1.72 −1.68 −1.68 −1.93 −0.95 −1.96 −1.72 −1.96 −0.84 −0.66 −0.071 −0.16 −0.18 0.23 0.14 0.22 −0.076 −0.027 −1.93 −2.71 −2.30 −2.42 −2.02 −1.53 −0.93 −1.20 −1.60 −2.05 −2.07 −2.98 −2.36 −2.56 −1.94 −2.22 −1.95 −1.96 −1.65 −2.42

−2.56 0.47 −1.33 0.50 −0.46 −0.45 −0.40 −0.97 −0.41 0.49 −0.56 -0.46 0.25 0.50 0.72 0.41 0.50 −1.77 −1.79 −1.79 −1.91 −1.36 −1.91 −1.77 −1.91 −2.38 −1.41 0.43 −0.55 −0.62 −0.39 −0.41 −0.80 0.42 0.39 −2.13 −1.94 −2.30 −2.14 −1.91 −1.80 −1.20 −1.56 −1.92 −2.65 −2.70 −2.14 −2.24 −2.10 −2.29 −2.35 −2.31 −1.83 −1.98 −2.20

−2.86 −0.62 −0.80 0.46 −0.74 −0.11 −0.98 −0.70 −0.98 0.70 −1.24 -0.84 0.097 0.31 0.64 0.20 0.71 −1.61 −1.57 −1.57 −1.84 −1.29 −1.85 −1.34 −1.77 −1.49 −1.05 0.42 −0.39 −0.50 0.58 −0.91 −0.91 0.21 0.39 −2.52 −2.59 −2.23 −2.14 −1.86 −1.64 −1.32 −1.44 −1.80 −2.48 −2.49 −2.53 −2.12 −2.23 −2.41 −2.09 −2.23 −2.29 −1.93 −2.12

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119

110 Table 2 (Continued) No.

56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112

Compounds

Zearalenol Propyl parabene 4-Benzyloxylphenol Isoeugenol 4-tert-Butylphenol 2-sec-Butylphenol 4-sec-Butylphenol 4-tert-Amylphenol 4-n-Octylphenol Igepal CO-210 4-Heptyloxyphenol Nonylphenol 4 -Chloroacetoacetanilide Procymidone Vinclozolin Flutamide Linuron Propanil (DCPA) 2,2 ,4,4 -Tetrachlorobiphenyl 2,3,4,5-Tetrachloro-4 -Biphenylol 2,4 -Dichlorobipheny 2,4,5-T Lindane (␥-HCH) Aldrin Endosulfan (technical grade) Heptachlor Chlordane 4-Hydroxybenzophenone 4,4 -Dihydoxybenzophenone Benzophenone Bisphenol A p-Cumyl phenol Bisphenol A o,p -DDE p,p -DDT p,p -DDD o,p -DDT o,p -DDD p,p -Methoxychlor olefin p,p -Methoxychlor Monohydroxymethoxychlor olefin HPTE Dihydroxymethoxychlor olefin Diisononylphthalate di-i-Butyl Phthalate(DIBP) Butylbenzylphthalate phthalate di-n-Butyl phthalate (DBuP) Triphenylethylene Diisobutyl adipate Dibutyl adipate 4-Amino butylbenzoate 4-Heptyloxybenzoic acid 1-Methoxy-4-[1-propenyl]benzene Carbaryl 4-(3,5-Diphenylcyclohexyl)phenol 2-Benzyl-isoindole-1,3-dione 2-(4-OH-Benzyl)isoindole-1,3-dione

Experimental log(RBA)a

−1.64 −3.00 −2.89 −2.81 −2.67 −2.52 −2.44 −2.39 −1.8 −1.78 −1.69 −1.57 −3.46 −2.61 −2.5 −2.42 −2.25 −2.22 −1.74 −1.73 −1.72 −3.18 −2.12 −2.02 −1.64 −1.64 −1.51 −2.78 −2.67 −2.63 −2.39 −2.11 −2.09 −1.81 −1.76 −1.7 −1.69 −1.52 −2.20 −1.94 −1.84 −1.47 −1.31 −3.56 −3.22 −2.07 −1.95 −1.98 −2.84 −2.73 −2.85 −2.74 −3.19 −3.12 −2.27 −3.12 −2.76

Calculated log(RBA)b MLR

RBFNN

SVM

−1.45 −2.28 −2.62 −2.67 −1.89 −2.09 −2.06 −1.86 −1.66 −1.88 −2.40 −1.59 −2.46 −1.92 −1.87 −2.16 −2.42 −2.50 −2.18 −1.90 −2.36 −2.74 −1.85 −1.86 −1.69 −1.47 −1.43 −2.16 −2.88 −2.29 −2.48 −1.87 −2.44 −2.16 −1.99 −2.11 −2.01 −2.14 −2.49 −2.34 −2.44 −2.50 −2.70 −2.86 −2.89 −2.65 −3.04 −1.97 −2.68 −2.85 −2.41 −2.38 −2.82 −2.26 −1.11 −2.28 −2.12

−2.09 −2.96 −2.62 −2.81 −2.59 −2.68 −2.67 −2.44 −1.65 −1.46 −2.06 −1.38 −2.65 −2.09 −2.18 −2.08 −2.82 −2.94 −1.81 −1.77 −1.85 −2.63 −1.83 −1.79 −2.18 −1.86 −1.80 −2.33 −2.11 −2.53 −2.09 −2.18 −2.05 −1.92 −1.88 −1.88 −1.89 −1.89 −2.09 −2.01 −2.15 −2.08 −2.04 −3.55 −1.99 −2.10 −1.94 −1.73 −2.96 −2.86 −2.95 −1.97 −3.12 −2.90 −2.32 −2.24 −2.92

−1.73 −2.60 −2.36 −2.65 −2.56 −2.69 −2.69 −2.37 −2.03 −1.65 −1.38 −1.92 −2.80 −2.79 −2.01 −1.97 −2.35 −2.53 −2.04 −2.43 −2.33 −3.02 −2.26 −1.66 −1.92 −1.77 −1.83 −2.63 −2.40 −2.51 −2.17 −2.02 −2.14 −1.93 −1.75 −1.90 −1.77 −1.54 −2.12 −2.01 −2.12 −1.58 −1.58 −3.37 −2.50 −2.36 −2.59 −1.78 −2.62 −2.69 −2.47 −2.38 −2.79 −2.45 −1.55 −3.04 −2.52

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119

111

Table 2 (Continued) No.

Compounds

113 2-(4-Nitro-Benzyl)isoindole-1,3-dione 114 Methylparathion 115 Ethylparathion 116 1,3-Diphenyltetramethyldisiloxane 117 Triphenylsilanol 118 Aurin Test set (28 compounds) 1 Androsterone 2 5␣-Androstan-3␤-ol 3 Methyltestosterone 4 Trenbolone 5 Mibolerone 6 Estriol (E3 ) 7 17-Deoxyestradiol 8 3-Deoxyestradiol 9 6␣-Me-17␣-OH-Progesterone 10 4,4 -Dihydroxystilbene 11 DES 12 Chalcone 13 ␤-Zearalanol 14 3-Chlorophenol 15 4-Chloro-2-methyl phenol 16 4-Dodecylphenol 17 Metolachlor 18 Fenpicionil 19 3,3 ,5,5 -Tetrachloro-4,4 -Biphenyldiol 20 4-Hydroxybiphenyl 21 Kepone 22 2,4-Dihydroxybenzophenone 23 p,p -DDE 24 Diethylphthalate 25 Bis(n-octyl)Phthalate 26 Nordihydroguaiaretic acid 27 Triphenyl phosphate 28 4,4 -Sulfonyldiphenol a b

Experimental log(RBA)a

Calculated log(RBA)b MLR

RBFNN

SVM

−2.46 −2.26 −2.05 −3.13 −2.05 −1.7

−2.24 −2.62 −2.64 −2.60 −2.19 −2.25

−2.03 −2.56 −2.35 −2.56 −2.27 −2.09

−1.93 −2.59 −2.57 −2.83 −2.30 −2.03

−2.12 −0.74 1.28 2.05 2.27 −3.15 −2.13 0.54 0.94 −2.44 −1.66 −2.32 −1.72 −3.17 −2.59 −1.81 −2.61 −1.61 −2.1 −1.43 −1.58 −2.53 −1.7 −3.44 −3.28 −2.28 −1.69 −3.09

−0.029 0.081 0.061 −0.28 0.11 −1.94 −1.94 −0.98 0.24 −2.71 −2.26 −2.08 −1.72 −2.47 −2.30 −1.23 −1.68 −2.49 −2.59 −2.17 −1.04 −2.87 −2.13 −3.03 −3.04 −2.61 −1.92 −2.69

0.46 0.69 0.30 0.52 0.25 −1.95 −1.95 −1.37 −0.53 −2.25 −2.01 −2.32 −1.79 −2.66 −2.86 −0.73 −1.74 −2.31 −2.07 −2.33 −1.18 −2.21 −1.90 −2.38 −2.99 −2.05 −1.99 −2.08

−0.81 0.068 0.62 0.86 1.45 −2.85 −2.85 0.30 0.26 −2.33 −2.02 −2.29 −1.92 −3.22 −3.00 −1.35 −1.71 −2.30 −2.20 −1.74 −1.56 −2.44 −1.91 −2.59 −2.84 −2.41 −1.79 −2.26

Experimental log(RBA) obtained from literature. Calculated log(RBA) obtained from MLR, RBFNN, and SVM model.

ing a small variation in magnitude for all structures, (iii) descriptors that give a F-test’s value below 1.0 in the one-parameter correlation, and (iv) descriptors whose t-values are less than the user-specified value, etc. This procedure orders the descriptors by decreasing correlation coefficient when used in one-parameter correlations. The next step involves correlation of the given property with (i) the top descriptor in the above list with each of the remaining descriptors and (ii) the next one with each of the remaining descriptors, etc. The best pairs, as evidenced by the highest Fvalues in the two-parameter correlations, are chosen and used for further inclusion of descriptors in a similar manner.

The heuristic method usually produces correlations two to five times faster than other methods, with comparable quality. The rapidity of calculations from the heuristic method renders it the first method of choice in practical research. Thus, we used this method to build the linear model. 2.4. Diversity analysis Two fundamental research themes in chemical database analysis are similarity and diversity sampling (Maldonado et al., in press). The diversity problem involves defining a diverse subset of “representative” compounds so that researchers can scan only a subset

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119

112

of the huge database each time. In this study, diversity analysis was performed for the two data sets to make sure the structures of the training cases can represent those of the whole ones. We consider a database of n compounds generated from m highly correlated chemical descriptors {xj }m j=1 . Each compound is represented as a vector: Xi = (xi1 , xi2 , . . . , xim )T

for i = 1, 2, . . . , n,

where xij denotes the value of descriptor j of compound Xi . The collective database X = {Xi }ni=1 is represented by the n × m matrix X:   x11 x12 · · · x1m  x  21 x22 · · · x2m   X = (X1 , X2 , . . . , Xn )T =  .. ..   .. .. .  . . .  xn1

xn2

· · · xnm

Here, the superscript T denotes the vector/matrix transpose. A distance score for two different compounds Xi and Xj can be measured by the Euclidean distance norm based on the compound descriptors:   m 

||Xi − Xj || = (xik − xjk )2 k=1

The distances are all comprised within the interval [0,1]. The closer to one the distance is, the more diverse to each other the compounds are. Thus, for the two data sets, Fig. 1a and b are obtained to illuminate the diversity of the molecules in the training sets. As shown, it can be concluded that the structures of the compounds are diverse. The two training sets with a broad representation of the chemistry space were adequate to ensure models’ predictive capability for a large number of diverse chemicals in the future studies. 2.5. Model validation Validation of the models was required to test the predictivity and generalization of the methods as well as to enable comparison between them. In this study, each data set was divided into training set for model development/calibration and test set for external prediction. The construction of the test set was accomplished by insisting that members of the test set be representative of all members of the training set in terms of the ranges of experimental values. Initially, all the predictive models underwent a leave-one-out (LOO) procedure. Every compound in the training set was omitted once, and the

Fig. 1. Diversity analysis for the cases in the training sets of data set 1 and data set 2. (a) Diversity analysis for the cases in the training sets of data set 1. (b) Diversity analysis for the cases in the training sets of data set 2.

model developed from all remaining chemicals in the training set predicted its binding affinity. The terms q2 and RMS error for each model, chosen as the criteria to obtain the best model, are defined as follows: ns

(yk −  y )2 q = 1 − k=1 k  2 ; ns ns  k=1 yk − k=1 yk /ns  ns  i=1 (yk − yk ) RMS = ns 2

where yk is the desired output and  yk the actual output of the model, and ns is the number of compounds in the analyzed set.

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119

2.6. Theory of multiple linear regression (MLR) method Multiple linear regression (Kapur et al., 2001) is used to develop the linear model of the property of interest, which takes the form below: Y = b0 + b1 X1 + b2 X2 + · · · + bn Xn where Y is the property, that is, the dependent variable, X1 − Xn represents the specific descriptor, while b1 − bn represents the coefficients of those descriptors, and b0 is the intercept of the equation. 2.7. Theory of RBFNN method RBFNN (Wan and Harrington, 1999) is a type of neural network used to solve several types of problems in modeling and classification; it is a nonlinear method and always leads to superior results over standard statistical models. The most often used in RBF is a Gaussian function that is characterized by a center and width. In this paper, selecting the centers, the width, and the weights of RBF are involved in the training procedure. The forward subset selection routine was used to select the centers from training set samples. The adjustment of the connection weight between the hidden layer and the output layer was performed using a least squares solution after the selection of centers and width of radial basis functions. However, these neural systems have some problems inherent to their architecture in relation to overtraining, overfitting, and network optimization. 2.8. Theory of SVM method SVM (Vapnik, 1982; Burges, 1998) algorithms are mainly developed by Vapnik’s and Burges’s work. The main advantage of SVM is that it adopts the structure risk minimization (SRM) principle, which has been shown to be superior to the traditional empirical risk minimization (ERM) principle, employed by conventional neural networks. SRM minimizes an upper bound of the generalization error on Vapnik-Chernoverkis (VC) dimension, as opposed to ERM that minimizes the training error. For the case of regression approximation, suppose there are a given set of data points G = {(xi , di )}ni (xi is the input vector, di the desired value, and n is the total number of data patterns) drawn independently and identically from an unknown function, SVMs approximate the function with three distinct characteristics: (i) SVMs estimate the regression in a set of linear functions, (ii) SVMs define the regression estimation as the problem of risk minimization with respect to the ε-insensitive loss

113

function, and (iii) SVMs minimize the risk based on the SRM principle whereby elements of the structure are defined by the inequality ||w||2 ≤ constant. The linear function is formulated in the high dimensional feature space, with the form of function (1). y = f (x) = wφ(x) + b

(1)

where φ(x) is the high dimensional feature space, which is nonlinearly mapped from the input space x. Characteristics (ii and iii) are reflected in the minimization of the regularized risk function (2) of SVMs, by which the coefficients w and b are estimated. The goal of this risk function is to find a function that has at most ε deviation from the actual values in all the training data points, and at the same time is as flat as possible. 1

1 Lε (di , yi ) + ||w||2 n 2 n

RSVMs (C) = C 

(2)

i=1

 |d − y| − ε, |d − y| ≥ ε, (3) Lε (d, y) = 0 otherwise The first term C(1/n) ni=1 Lε(di , yi ) is the so-called empirical error (risk), which is measured by the εinsensitive loss function (3). This loss function provides the advantage of using sparse data points to represent the designed function (1). The second term 21 ||w||2 , on the other hand, is called the regularized term. ε is called the tube size of SVMs, and C is the regularization constant determining the trade-off between the empirical error and the regularized term. Introduction of the positive slack variables ξ, ξ * leads Eq. (4) to the following constrained function:

n 1 minimize RSVMs (w, ξ (∗) ) = w2 + C (ξi + ξi∗ ) i=1 2 (4) where i represents the data sequence, with i = 1 being the most recent observation and i = 1 being the earliest observation. Finally, by introducing Lagrange multipliers and exploiting the optimality constraints, decision function (5) takes the following form: f (x, ai∗ ) =

n

(ai − ai∗ )K(x, xi ) + b

(5)

i=1

where ai , ai∗ are the introduced Lagrange multipliers. So far, by exploiting the Karush–Kuhn–Tucker (KKT) conditions, only a number of coefficients among ai and ai∗ will be nonzero, and the data points associated with them could be referred to support vectors. In this equation, K refers to kernel function, including linear, polynomial, splines, and radial basis function. In support

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119

114

Table 3 List of the descriptors used in data set 1 and data set 2 and their physical–chemical meanings Descriptors Data set 1 DIP ELUMO WIE SAA V log P POL Data set 1 HDS TEI RI2 PP BOCmin a b

Chemical meanings

S.E.a

t-Testb

Dipole moment Energy of the lowest unoccupied molecular orbital Wiener index Surface area approximately Volume of the molecules Octanol/water partition coefficient Polarizability

0.505 0.321 0.558 0.628 0.631 0.896 0.732

−2.366 −3.771 0.270 −0.528 1.767 1.361 1.284

Count of H-donors sites Topographic electronic index (all bonds) Randic index (order 2) Polarity parameter/square distance Min (>0.1) bond order of a C atom

0.01 0.22 0.044 0.23 0.61

5.49 −4.16 3.86 2.53 2.25

Standard error of the each descriptor. t-Test tests the difference between a sample mean and a known or hypothesized value.

vector regression, the Gaussian radial basis function (6) is commonly used, which has the following form: radial basis function (RBF) : k(¯xi , x¯ j ) = exp(−γ||¯xi − x¯ j ||2 )

(6)

3. Results and discussion As shown above, for the two data sets, the descriptors of the compounds were calculated and the HM method was applied for each data set for select the proper descriptors to construct the predictive models. The physical–chemical meanings of the selected descriptors were listed in Table 3.

Table 1 (data set 1) and Table 2 (data set 2). The statistical values of q2 , RMS, and correlation coefficient R for the two data sets were shown in Table 4. Figs. 2a and 3a show these predicted values versus experimental values by MLR method for data sets 1and 2. Data set 1:   1 log = 3.145 − 0.759 × DIP − 2.104 EC50 × ELUMO + 0.169 × WIE − 0.333 × SAA + 1.584 × V + 0.997 × 0.278 log P + 1.632 × POL, N = 61, F = 76.660 Data set 2:

3.1. Result of MLR model

log(RBA) = −3.023 + 1.899 × HDS − 1.736 × TEI

Based on the selected descriptors, two linear functions were constructed as follows for the two data sets. The predicted values by the MLR method are given in

+ 1.706 × 2 χ + 0.495 × PP + 0.383 × BOCmin ,

N = 118, F = 42.00

Table 4 Statistical results of different QSAR models for two data sets Method

q2

R

RMS Training set

Test set

Whole set

Training set

Test set

Whole set

Data set 1 MLR RBFNN SVM

0.77 0.81 0.92

0.50 0.45 0.29

0.75 0.51 0.35

0.71 0.46 0.30

0.87 0.91 0.96

0.81 0.80 0.93

0.85 0.88 0.95

Data set 2 MLR RBFNN SVM

0.56 0.72 0.76

0.70 0.58 0.54

0.96 1.03 0.59

0.76 0.69 0.55

0.77 0.86 0.88

0.80 0.75 0.93

0.78 0.82 0.90

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119

115

Fig. 2. Calculated vs. experimental values by MLR, RBFNN, and SVM methods for data set 1. (a) Calculated vs. experimental values by MLR methods for data set 1. (b) Calculated vs. experimental values by RBFNN methods for data set 1. (c) Calculated vs. experimental values by SVM methods for data set 1.

3.2. Result RBFNN model For the designed RBF network used in this report, the parameter which should be adjusted is the width (r) of the network. To find optimal one, each RMS error on LOO cross-validation versus the width was recorded and the minimum was chosen as the optimal condition. The best ones were: r (width) = 0.93 and r = 0.98, respectively. Based on these designed RBF network, the final results were obtained and listed in Tables 1 and 2. The plot of predicted values versus experimental values for each data set was recorded in Figs. 2a and 3a. 3.3. Result of SVM model In this work, the training of the SVM model included the selection of capacity parameter C, ε of ε-insensitive loss function and the corresponding parameters of the kernel function. Firstly, the kernel function should be decided, which determines the sample distribution in the mapping space. The radial basis function (RBF) is commonly used in many studies because of its good general

performance and few parameters to be adjusted (Bishop, 1997). In this work, the radial basis function was used, the form of which in R is as follows: exp(−γ ∗ |u − v|2 ) where γ is a parameter of the kernel, u and v are the two independent variables. Secondly, corresponding parameters, i.e., γ of the kernel function greatly affect the number of support vectors, which has a close relation with the performance of the SVM and training time. Over many support vectors could produce overfitting and increase the training time. In addition, γ controls the amplitude of the RBF function, and therefore, controls the generalization ability of SVM. The plot of γ versus RMS on the LOO crossvalidation for each data set is shown in Figs. 4a and 5a. As can been seen from the figures, the optimal γ for two data sets were 0.008 and 0.009. Parameter ε-insensitive prevents the entire training set meeting boundary conditions and so allows for the possibility of sparsity in the dual formulation’s solu-

116

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119

Fig. 3. Calculated vs. experimental values by three methods for data set 2. (a) Calculated vs. experimental values by MLR methods for data set 2. (b) Calculated vs. experimental values by RBFNN methods for data set 2. (c) Calculated vs. experimental values by SVM methods for data set 2.

Fig. 4. Selection of gamma, eps, and cost for data set 1. (a) Gamma vs. RMS error on LOO cross-validation for data set 1. (b) Eps vs. RMS error on LOO cross-validation for data set 1. (c) Cost vs. RMS error on LOO cross-validation for data set 1.

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119

117

Fig. 5. Selection of gamma, eps, and cost for data set 2. (a) Gamma vs. RMS error on LOO cross-validation for data set 2. (b) Eps vs. RMS error on LOO cross-validation for data set 2. (c) Cost vs. RMS error on LOO cross-validation for data set 2.

tion. The optimal value for ε depends on the type of noise present in the data, which is usually unknown. Similar with selection procedure for the optimal γ, the RMS error of LOO cross-validation for each data set on different epsilon is recorded in Figs. 4b and 5b. The optimal values were found to be 0.202 and 0.180, respectively. Lastly, the effect of capacity parameter C was tested. It controls the trade-off between maximizing the margin and minimizing the training error. If C is too small then insufficient stress will be placed on fitting the training data. If C is too large then the algorithm will overfit the training data. However, the reference (Wang et al., 2003) indicated that prediction error was scarcely influenced by C. To make the learning process stable, a large value should be set up for C initially (e.g., C = 100). The plots of RMS error versus C value for the two data sets are shown in Figs. 4c and 5c. The optimal values of C were 100 and 100. Therefore, for each data set, the best choices for γ, ε, and C were obtained. The predictive results were listed in Tables 1 and 2. The plots of predicted versus

experimental values for each data set were recorded in Figs. 2c and 3c. 4. Conclusion Three methods, MLR, RBFNN, and SVM, were used to construct the linear and nonlinear quantitative relationships for two different data sets. Table 4 gave the comparison among the results obtained by three methods of MLR, RBFNN, and SVM based on RMS error. As shown in the table, SVM model gave the highest q2 and correlation coefficient R. It indicates that the SVM performed better than the MLR and RBFNN methods, especially in the test set and the whole data set. It also showed the better generalization ability. The reason may be that SVM method embodies the structural risk minimization principle, which minimizes an upper bound of the generalization error rather than minimizes the training error. This eventually leads to better generalization than neural networks, which implement the empirical risk minimization principle and may not converge to global solutions.

118

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119

Acknowledgement The authors thank the Association Franco-Chinoise pour la Recherche Scientifique & Technique (AFCRST) for supporting this study (Programme PRA SI 02-02). References Bao, L., Sun, Z.R., 2002. Identifying genes related to drug anticancer mechanisms using support vector machine. FEBS Lett. 521, 109–114. Beger, R.D., Freeman, J.P., Lay, J.O., Wilkes, J.G., Miller, D.W., 2001. Use of 13 C NMR spectrometric data to produce a predictive model of estrogen receptor binding activity. J. Chem. Inf. Comput. Sci. 41, 219–224. Bishop, C., 1997. Neural Networks for Pattern Recognition. Clarendon Press, Oxford. Blanz, V., Sch¨olkopf, B., B¨ulthoff, H., Burges, C., Vapnik, V., Vetter, T., 1996. Comparison of view-based object recognition algorithms using realistic 3D models. In: Malsburg, C.V.D., Seelen, W.V., Vorbr¨uggen, J.C., Sendhoff, B. (Eds.), Artificial Neural Networks—ICANN’96. Springer, Lect. Notes. Comput. Sci. 1112, 251–256. Burbidge, R., Trotter, M., Buxton, B., Holden, S., 2001. Drug design by machine learning: support vector machines for pharmaceutical data analysis. Comput. Chem. 26, 5–14. Burges, C.J.C., 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2, 121–167. Cai, Y.D., Liu, X.J., Xu, X.B., Chou, K.C., 2002. Prediction of protein structural classes by support vector machines. Comput. Chem. 26, 293–296. Colborn, T., vom Saal, F.S., Soto, A.M., 1993. Developmental effects of endocrine-disrupting chemicals in wildlife and humans (see comments). Environ. Health Perspect. 101, 378–384. Cumpson, P.J., 2001. Estimation of inelastic mean free paths for polymers and other organic materials: use of quantitative structure–property relationships. Surf. Interface Anal. 31, 23– 34. Cun, Y.L., Jackel, L., Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., M¨uller, U., S¨ackinger, E., Simard, P.Y., Vapnik, V., 1995. Learning algorithms for classification: a comparison on handwritten digit recognition, neural networks. Neural networks: the statistical mechanics perspective. World Scientific, 261– 276. DeWeese, A.D., Schultz, T.W., 2001. Structure–activity relationships for aquatic toxicity to tetrahymena: halogensubstituted aliphatic esters. Environ. Toxicol. 16, 54–60. EDSTAC, http://www.epa.gov/opptintr/opptendo/finalrpt.htm. Eldred, D.V., Weikel, C.L., Jurs, P.C., Kaiser, K.L.E., 1999. Prediction of fathead minnow acute toxicity of organic compounds from molecular structure. Chem. Res. Toxicol. 12, 670– 678. Engelhardt McClelland, H., Jurs, P.C., 2000. Quantitative structure–property relationships for the prediction of vapor pressures of organic compounds from molecular structure. J. Chem. Inf. Comput. Sci. 40, 967–975. Fang, H., Tong, W., Branham, W.S., Moland, C.L., Dial, S.L., Hong, H.X., Xie, Q., Perkins, R., Owens, W., Sheehan, D.M., 2003. Study of 202 natural, synthetic, and environmental chemicals for binding to the androgen receptor. Chem. Res. Toxicol. 16, 1338–1358.

Filov, V.A., Golubev, A.A., Liublina, E.I., Tokontsev, N.A., 1979. Quantitative Toxicology, first ed. John Wiley & Sons, New York, p. 462. Hill, D.L., 1972. The Biochemistry and Physiology of Tetrahymena, first ed. Academic Press, New York, p. 230. HyperChem 6.01, Hypercube, Inc., 2000. ISIS Draw 2.3, MDL Information Systems, Inc., 1990. Johnson, S.R., Jurs, P.C., 1999. Prediction of the clearing temperatures of a series of liquid crystals from molecular structure. Chem. Mater. 11, 1007–1023. Kapur, G.S., Ecker, A., Meusinger, R., 2001. Establishing quantitative structure–property relationships (QSPR) of diesel samples by proton-NMR & multiple linear regression (MLR) analysis. Energy & Fuels 15, 943–948. Katritzky, A.R., Lobanov, V.S., Karelson, M., 1994. CODESSA: Reference Manual. University of Florida, Gainesville. Katritzky, A.R., Lobanov, V.S., Karelson, M., 1995. CODESSA: Training Manual. University of Florida, Gainesville. Kauffman, G.W., Jurs, P.C., 2000. Prediction of the sodium ion-proton antiporter by benzoylguanidine derivatives from molecular structure. J. Chem. Inf. Comput. Sci. 40, 753– 761. Kavlock, R.J., Daston, G.P., DeRosa, C., Fenner-Crisp, P., Gray, L.E., Kaattari, S., Lucier, G., Luster, M., Mac, M.J., Maczka, C., Miller, R., Moore, J., Rolland, R., Scott, G., Sheehan, D.M., Sinks, T., Tilson, H.A., 1996. Research needs for the risk assessment of health and environmental effects of endocrine disruptors: a report of the U.S. EPA-sponsored workshop. Environ. Health Perspect. 104, 715–740. LeBlond, J.D., Applegate, B.M., Menn, F.M., Schultz, T.W., Sayler, G.S., 2000. Structure-toxicity assessment of metabolites of the aerobic bacterial transformation of substituted naphthalenes. Environ. Toxicol. Chem. 19, 1235–1244. Maldonado, A.G., Doucet, J.P., Petitjean, M., Fan, B. T., 2005. Molecular similarity and diversity in chemoinformatics: from theory to application. Mol Divers, in press. Patankar, S.J., Jurs, P.C., 2000. Prediction of IC50 values for ACAT inhibitors from molecular structure. J. Chem. Inf. Comput. Sci. 40, 706–723. Tong, W., Perkins, R., Xing, L., Welsh, W.J., Sheehan, D.M., 1997. QSAR models for binding of estrogenic compounds to estrogen receptor R and subtypes. Endocrinology 138, 4022– 4025. Vapnik, V., 1982. Estimation of Dependencies Based on Empirical Data. Springer, Berlin. Waller, C.L., Oprea, T.I., Chae, K., Park, H.K., Korach, K.S., Laws, S.C., Wiese, T.E., Kelce, W.R., Gray Jr., L.E., 1996. Ligand-based identification of environmental estrogens. Chem. Res. Toxicol. 9, 1240–1248. Wan, C., Harrington, P.B., 1999. Self-configuring radial basis function neural networks for chemical pattern recognition. J. Chem. Inf. Comput. Sci. 39, 1049–1056. Wang, L.S., Han, S.K., Kong, L.R., 1997. Molecular Structure-Quality and Activity. Chemical Industry Press, Beijing. Wang, W.J., Xu, Z.B., Lu, W.Z., Zhang, X.Y., 2003. Determination of the spread parameter in the Gaussian kernel for classification and regression. Neurocomputing 55, 643– 663. Wessel, M.D., Jurs, P.C., Tolan, J.W., Muskal, S.M., 1998. Prediction of human intestinal absorption of drug compounds from molecular structure. J. Chem. Inf. Comput. Sci. 38, 726– 735.

C.Y. Zhao et al. / Toxicology 217 (2006) 105–119 Xue, C.X., Zhang, R.S., Liu, H.X., Yao, X.J., Liu, M.C., Hu, Z.D., Fan, B.T., 2004. QSAR models for the prediction of binding affinities to human serum albumin using the heuristic method and a support vector machine. J. Chem. Inf. Comput. Sci. 44, 1693– 1700.

119

Zhao, C.Y., Zhang, R.S., Liu, H.X., Xue, C.X., Zhao, S.G., Zhou, X.F., Liu, M.C., Fan, B.T., 2004. Diagnosing anorexia based on partial least squares, back propagation neural network, and support vector machines. J. Chem. Inf. Comput. Sci. 44, 2040– 2046.