Discriminative models using molecular descriptors for predicting increased serum ALT levels in repeated-dose toxicity studies of rats

Discriminative models using molecular descriptors for predicting increased serum ALT levels in repeated-dose toxicity studies of rats

Computational Toxicology xxx (2017) xxx–xxx Contents lists available at ScienceDirect Computational Toxicology journal homepage: www.elsevier.com/lo...

682KB Sizes 0 Downloads 5 Views

Computational Toxicology xxx (2017) xxx–xxx

Contents lists available at ScienceDirect

Computational Toxicology journal homepage: www.elsevier.com/locate/comtox

Discriminative models using molecular descriptors for predicting increased serum ALT levels in repeated-dose toxicity studies of rats Jun-ichi Takeshita a,⇑, Haruka Nakayama b, Yoko Kitsunai b, Misako Tanabe b, Hitomi Oki b, Takamitsu Sasaki b, Kouichi Yoshinari b a b

Research Institute of Science for Safety and Sustainability, National Institute of Advanced Industrial Science and Technology, 16-1 Onogawa, Tsukuba, Ibaraki 305-8569, Japan Department of Molecular Toxicology, School of Pharmaceutical Science, University of Shizuoka, 52-1 Yada, Suruga-ku, Shizuoka 422-8526, Japan

a r t i c l e

i n f o

Article history: Received 14 January 2017 Received in revised form 3 May 2017 Accepted 4 May 2017 Available online xxxx Keywords: Hepatotoxicity Molecular descriptors Discriminative models Feature selection Imbalanced data set

a b s t r a c t The demand for alternatives to animal experiment-based assessment is increasing. Alternatives for assessing repeated-dose toxicity, however, have yet to be developed. Our aim was to develop discriminative models for predicting an increase in serum ALT levels in rats, using molecular descriptors. In vivo data for rats in the training data sets were obtained using the Hazard Evaluation Support System Integrated Platform (HESS), and molecular descriptors were calculated using DRAGON 6. We developed the discriminative models based on logistic regression models; however, there were two statistical difficulties to be overcome: (i) the number of molecular descriptors was much greater than the number of compounds; (ii) the training data sets were imbalanced. In order to overcome these difficulties, the k-medoids method was employed in the case of the first difficulty, and the Synthetic Minority Over-sampling Technique (SMOTE) algorithm in the case of the second. One of the resulting models showed predictive capability, with sensitivity of 0.783, specificity of 0.745, and concordance of 0.750. Our results show that a statistical learning approach can create a discriminative model with high predictive capability using only information on the molecular descriptors of chemicals. Ó 2017 Elsevier B.V. All rights reserved.

Introduction Today, chemical hazard evaluations are typically based on the results of animal experiments. However, considerations of time, cost efficiency, and animal welfare have led to increasing worldwide demand for alternatives to such experiment-based evaluations. For example, the European Union has banned the sale of cosmetic products whose development has involved animal experimentation, under the Registration, Evaluation, Authorization and Restriction of Chemicals (REACH) regulation [1,2]. Some alternatives have been established for assessment of skin and ocular toxicity, phototoxicity, and mutagenicity [3]. However, alternatives for the assessment of systemic toxicity, including repeated-dose toxicity (RDT), one of the most important toxicities in the hazard evaluation not only of chemical substances but also of pharmaceuticals, food additives, and pesticides, have yet to be developed. This is because RDT tests have various endpoints, and the mechanisms underlying RDTs are either very complex or simply unknown. There are only a few extant studies on alternative forms of evaluation for RDT. Liu et al. [4] attempted to predict three types of ⇑ Corresponding author. E-mail address: [email protected] (J.-i. Takeshita).

hepatotoxicity on the basis of rat liver histopathology observed using six different machine learning algorithms, and compared the accuracy of the six algorithms. Low et al. [5] evaluated the power of several statistical models for predicting drug-induced hepatotoxicity in rats, using four different classification methods. Recently, a review of existing in silico models for RDT was also published [6]. In order to develop prediction methods for RDT, Japanese government and academic institutes developed the Hazard Evaluation Support System Integrated Platform (HESS) [7–9]. The HESS is a platform equipped with 28-day RDT rat test results, compatible with Good Laboratory Practice (GLP), mainly for existing chemicals falling under the Chemical Substance Control Law in Japan. However, though the HESS includes a high-quality in vivo database in rats, it offers no quantitative prediction methods for RDT. Nonetheless, we believe that the HESS data (HESS-DB) are very useful for developing methods to predict RDT in rats. The present study utilizes HESS-DB to develop statistical models for predicting 28-day RDT in rats from molecular descriptors of chemicals. We first created a common training data set of RDT data in rats for studying methods of predicting hepatotoxicity using HESS-DB in the present study. Among various endpoints of RDT, we focus on hepatotoxicity in rats, especially an increase in serum ALT levels. The liver is one of the most important target

http://dx.doi.org/10.1016/j.comtox.2017.05.002 2468-1113/Ó 2017 Elsevier B.V. All rights reserved.

Please cite this article in press as: J.-i. Takeshita et al., Discriminative models using molecular descriptors for predicting increased serum ALT levels in repeated-dose toxicity studies of rats, Comput. Toxicol. (2017), http://dx.doi.org/10.1016/j.comtox.2017.05.002

2

J.-i. Takeshita et al. / Computational Toxicology xxx (2017) xxx–xxx

organs for toxic chemical compounds, and an increase in serum ALT levels is a well-known marker for hepatotoxicity. However, when discriminative models using molecular descriptors were being developed, two statistical difficulties appeared: (1) there were a large number of molecular descriptors that were candidates for explanatory variables in the models, and (2) the training data sets were imbalanced, that is, the number of compounds in minor categories was much less than in major categories. Therefore, the present study mainly aimed to overcome these difficulties, with the k-medoids method being applied to address the first, and the Synthetic Minority Over-sampling Technique (SMOTE) algorithm being used to address the second. We then develop two discriminative models for an increase in serum ALT levels, using logistic regression models; one discriminating “positive” and “negative” compounds, and the other discriminating “strong” and “weak” compounds (these four compound categories will be defined in S ection “Materials and methods”). Materials and methods RDT data in rats RDT data in rats were obtained from HESS-DB as of September 2014. First, compounds having 28-day or 32-day studies were extracted because these studies were assumed as 28-day RDT studies in the present study. Meanwhile, endpoints related to the effects on the liver were extracted from HESS-DB. As a result, 208 compounds and 55 endpoints were selected. Second, endpoints meeting one of the three following conditions were eliminated: (a) endpoints that were not investigated for more than 11 compounds, which corresponded to more than 5% of the 208 compounds; (b) endpoints including the term “other findings”, which is a miscellaneous category; and (c) endpoints related to relative organ weights. A total of 16, 2, and 2 endpoints respectively met conditions (a), (b), and (c), leaving 35 endpoints. Third, compounds were omitted according to the following criteria: (1) compounds having a missing value at any of the 35 endpoints; (2) metals and metal-containing compounds; (3) compounds whose chemical structure could not be decided uniquely. A total of 22, 6, and 4 compounds were respectively omitted by criteria (1), (2), and (3), leaving 176 compounds. Supplementary Tables 1 and 2 show the 35 endpoints and 176 compounds, respectively. The 176 compounds were divided into two categories, in two ways, based on changes in serum ALT levels. The first division included “positive” and “negative” categories. If any LowestObserved-Effect Level (LOEL) value was observed in HESS-DB for an increase in serum ALT level, the respective compound was registered in the positive category; otherwise it was registered in the negative category. Respectively, 40 and 136 compounds were registered in the positive and negative categories. The second division included “strong” and “weak” categories. If ALT levels were increased by a treatment at a dose below 1000 mg/kg body weight/day, the compound was registered in the strong category; otherwise it was registered in the weak category. Respectively, 23 and 153 compounds were registered in the strong and weak categories. It was clear that the two data sets were imbalanced. Note that, for each endpoint, we did not consider sex differences, but combined the effects in both male and female rats in the present study. Molecular descriptors The chemical structures of the 176 compounds were depicted by ChemDraw Professional 15 (PerkinElmer, Waltham, MA), and the molecular descriptors were calculated by DRAGON 6 (Talete,

Milano, Italy). DRAGON 6 can calculate 4885 descriptors, but only 3636 descriptors of the 4885 could be calculated for all the 176 compounds in the present study. Therefore, these 3636 descriptors were used in the subsequent analysis. Statistical analysis software All statistical analyses, except hierarchical clustering, were performed using R version 3.3.1; and hierarchical clustering was performed using JMP 12 (SAS Institute, Cary, NC). The same methods were used to develop the models to discriminate positive- and negative-category compounds, and strong- and weak-category compounds. Selection of molecular descriptors I: clustering methods We used a logistic regression model to discriminate positive/ negative-category compounds and strong/weak-category compounds, in terms of an increase in serum ALT levels. Although some of the molecular descriptors would be explanatory variables in the logistic regression models, their number was much greater than the number of compounds in the present study. Therefore, the list of molecular descriptors was narrowed down by the following procedures. First, molecular descriptors having the same values for all the 176 compounds were omitted, leaving 2159 descriptors. Second, the k-medoids method was applied to select statistically representative molecular descriptors; see details on the method in the Appendix (Section “The k-medoids method”). The k-medoids method was performed using the PAM function in the PAM package in R. The method, however, required us to decide the number of clusters in advance, to which end an aggregative hierarchical clustering method was used. The following dissimilarity measure was used for the hierarchical clustering: for any molecular descriptors x and y, the dissimilarity measure, or distance, between x and y, say dðx; yÞ, was defined by

dðx; yÞ ¼

1  corrðx; yÞ ; 2

where corrðx; yÞ is the correlation coefficient between x and y. The dissimilarity measure took a value between 0 and 1, which implied that 1  dðx; yÞ was a normalized similarity. Ward’s method could therefore be used, even though the similarity was a non-Euclidean similarity measure [10]. Resolution of the imbalanced data In order to resolve the imbalance of the data sets, the SMOTE algorithm was used; see details on the algorithm in the Appendix (Section “The Synthetic Minority Over-sampling Technique (SMOTE) algorithm”). The SMOTE function in the DMwR package in R was used to conduct the SMOTE algorithm. More precisely, 200% synthetic samples belonging to the positive or strong categories were generated for the positive- or strong-category samples. In order to generate these synthetic samples, five neighbors were considered. On the other hand, negative- or weak-category samples were selected randomly, such that the numbers of positiveand negative-category samples, and of strong- and weakcategory samples, became identical. Since the algorithm had inherent randomness (each balanced data set created by the algorithm was different), 100 balanced data sets were created in this manner. Then, a balanced data set was selected for the subsequent analysis, using 5-fold cross-validation. For each balanced data set created by the SMOTE algorithm, the following approaches were taken. First, for each training data set, the procedures described in Sections “Selection of molecular descriptors II: statistical importance of variables and Model selection” were conducted. However,

Please cite this article in press as: J.-i. Takeshita et al., Discriminative models using molecular descriptors for predicting increased serum ALT levels in repeated-dose toxicity studies of rats, Comput. Toxicol. (2017), http://dx.doi.org/10.1016/j.comtox.2017.05.002

J.-i. Takeshita et al. / Computational Toxicology xxx (2017) xxx–xxx

14 molecular descriptors were used instead of 17 molecular descriptors in Section “Selection of molecular descriptors II: statistical importance of variables”. Since the number of compounds in the training data sets was 141 or 142, 14 was the largest number not over one-tenth the number of compounds in these sets. Second, for each test data set, Youden’s index was calculated, and defined as follows:

3

way, for each initial model, the model with the minimal AIC value was generated, and 100 models which were candidates for the final logistic regression model for prediction were constructed. Finally, we chose the logistic regression model with the minimum AIC value of the 100 models as the final model. Evaluation of the discriminative models



Youden s index ¼ sensitivity þ specificity  1; where

Sensitivity ¼

The number of true positives ; The number of true positives þ The number of false negatives

In order to evaluate the discriminative ability of the logistic regression models, sensitivity, specificity, and concordance were employed. The definitions of sensitivity and specificity are expressed above (Section “Resolution of the imbalanced data”), and the definition of concordance was as follows:

Concordance

and

Specificity ¼

The number of true negatives : The number of true negatives þ The number of false positives

Third, for each balanced data set, the mean of the five Youden’s indices was calculated. It should be noted that five test data sets existed for each balanced data set, since 5-fold cross-validation was applied. Finally, the balanced data set with the largest mean of the five Youden’s indices was selected as the data set, using the following steps. Selection of molecular descriptors II: statistical importance of variables Even though the number of molecular descriptors, which were the candidates for explanatory variables in the logistic regression models, had been reduced in Section “Selection of molecular descriptors I: clustering methods”, the reduction was not enough. Therefore, for each extracted molecular descriptor, the statistical importance of the variable was defined by applying logistic regression analysis, which was performed using the glm function in R. More precisely, we conducted the following procedures. (1) For each molecular descriptor, a logistic regression model with only one explanatory variable, which was the molecular descriptor, was constructed. (2) For each compound, if its prediction probability belonging to the positive or strong category according to the model was greater or equal to 0.5, the compound was assumed to be a positive- or strong-category compound; otherwise, the compound was assumed to be a negative- or weak-category compound. (3) The sensitivity and specificity of the model were calculated by comparing the prediction results for all the compounds with their truth categories. (4) The importance of the molecular descriptor was defined by the Youden’s index of the model. Then, the top 17 molecular descriptors in importance were selected for the subsequent analysis. The number 17 was chosen as the largest number not over one-tenth of 176, which was the number of compounds in the original training data set.

¼

The number of true positives þ The number of true negatives : The total number of compounds

Results Statistical difficulties and their overcoming In order to create discriminative models, we faced two difficulties to be overcome. (1) The number of molecular descriptors, which were candidates for explanatory variables in the model, was much greater than the number of compounds. (2) There was a large bias between the numbers of major category compounds and minor category compounds; in other words, the training data sets were imbalanced. To overcome the first difficulty, we applied the k-medoids method [Section 2 in [11]], which is a nonhierarchical clustering algorithm, to select molecular descriptors for constructing the models. To overcome the second, we applied the SMOTE algorithm [12]. Selection of molecular descriptors with clustering methods Molecular descriptors having the same values for the 176 compounds were removed, leaving 2159 molecular descriptors. Fig. 1 shows a dendrogram obtained by agglomerative hierarchical clustering (Ward’s method) applied to the data set with the 176 com-

Model selection The Akaike’s information criterion (AIC) value was used to select the final logistic regression models, using some or all of the 17 molecular descriptors selected in Section “Selection of molecular descriptors II: statistical importance of variables”. However, since the number of candidate models was 217, it was too difficult to calculate the AICs for all the candidate models. Therefore, 100 models using some or all of the 17 molecular descriptors were randomly selected. Then, these 100 models were set as the initial models, and stepwise forward-backward selections were conducted, using the stepAIC function in the MASS package in R. In this

Fig. 1. A dendrogram produced by applying aggregative hierarchical clustering (Ward’s method) to the data set with the 2159 molecular descriptors and 176 compounds. The y-axis marks the dissimilarity measures at which the clusters merge, and x-axis shows the distribution of molecular descriptors.

Please cite this article in press as: J.-i. Takeshita et al., Discriminative models using molecular descriptors for predicting increased serum ALT levels in repeated-dose toxicity studies of rats, Comput. Toxicol. (2017), http://dx.doi.org/10.1016/j.comtox.2017.05.002

4

J.-i. Takeshita et al. / Computational Toxicology xxx (2017) xxx–xxx

pounds and 2159 molecular descriptors. In Fig. 1, the y-axis shows the dissimilarity measures where two clusters were merged, while the x-axis shows the distribution of molecular descriptors. Fig. 2 shows the differences between the dissimilarity measures when the number of clusters was n and (n1) in the case of x ¼ n ð200  n  30Þ. We decided that the number of clusters used for the k-medoids method should be 74, as there was a large gap when the number of clusters was reduced from 74 to 73.

The logistic regression model for discrimination between positive- and negative-category compounds Our method selected the following logistic regression model, with seven explanatory variables:



pi log 1  pi

Table 1 The confusion matrix between positive- and negative-category compounds. Reference/Prediction

Number of positive

Number of negative

Number of positive Number of negative

26 57

14 79

The logistic regression model for discrimination between strong- and weak-category compounds Our method selected the following logistic regression model, with nine explanatory variables:

  qi log ¼ 1:610  1:033 ON0V þ 0:765 SM04 EAðdmÞ 1  qi þ 142:319 NtN  2:639 Psi i A þ 1:648 nN

 ¼ 10:823  1:129 Eig07 EAðboÞ  0:647 EE BðsÞ þ 2:317 GATS3p þ 2:966 MATS3v  1:280 Rbrid þ 3:767 EE H2  0:708 ATS5v;

where pi is the probability that Compound i belongs to the positive category, and Eig07_EA(bo), EE_B(s), GATS3p, MATS3v, Rbrid, EE_H2, and ATS5v respectively mean eigenvalue n. 7 from edge adjacency matrix weighted by bond order, Estrada-like index (log function) from Burden matrix weighted by I-State, Geary autocorrelation of lag 3 weighted by polarizability, Moran autocorrelation of lag 3 weighted by van der Waals volume, Ring bridge count, Estrada-like index (log function) from reciprocal squared distance matrix, and Broto-Moreau autocorrelation of lag 5 (log function) weighted by I-state; see the DRAGON 6 user’s manual [13] for more details. All coefficients were rounded off to three decimal places. For each of the 176 compounds, the probability of belonging to the positive category was predicted by the logistic regression model. If the probability was more than or equal to 0.5, the compound was assumed to be in the positive category; otherwise, the compound was assumed to be in the negative category. The sensitivity, specificity, and concordance of the prediction results were 0.650, 0.581, and 0.600, respectively. The respective confusion matrix is shown in Table 1. There were 14 false negatives and 57 false positives.

 1:099 Eig07 EAðboÞ  0:258 TðN::NÞ þ 3:094 GATS3p þ 0:221 Wi BðeÞ; where qi is the probability that Compound i belongs to the strong category, and ON0V, SM04_EA(dm), NtN, Psi_i_A, nN, Eig07_EA (bo), T(N..N), GATS3p, and Wi_B(e) respectively mean overall modified Zagreb index of order 0 by valence vertex degrees, spectral moment of order 4 from edge adjacency matrix weighted by edge degree, number of atoms of type tN, intrinsic state pseudoconnectivity index - type S average, number of Nitrogen atoms, eigenvalue n. 7 from edge adjacency matrix weighted by bond order, sum of topological distances between N. . .N, Geary autocorrelation of lag 3 weighted by polarizability, and Wiener-like index from Burden matrix weighted by Sanderson electronegativity; see the DRAGON 6 user’s manual [13] for more details. All coefficients were rounded off to three decimal places. For each of the 176 compounds, the probability of belonging to the strong category was predicted by the logistic regression model. If the probability was more than or equal to 0.5, the compound was assumed to be in the strong category; otherwise the compound was assumed to be in the weak category. The sensitivity, specificity, and concordance of the prediction results were 0.783, 0.745, and 0.750, respectively. The respective confusion matrix is shown in Table 2. There were 5 false negatives and 39 false positives.

Fig. 2. A graph showing the differences between the dissimilarity measures at which the clusters merge and the ones immediately before. The x-axis indicates the number of clusters, while the y-axis marks the differences between the dissimilarity measures at which the clusters merge and the ones immediately before. A peak appears when the cluster number equals 74.

Please cite this article in press as: J.-i. Takeshita et al., Discriminative models using molecular descriptors for predicting increased serum ALT levels in repeated-dose toxicity studies of rats, Comput. Toxicol. (2017), http://dx.doi.org/10.1016/j.comtox.2017.05.002

5

J.-i. Takeshita et al. / Computational Toxicology xxx (2017) xxx–xxx Table 2 The confusion matrix between strong- and weak-category compounds. Reference/Prediction

Number of strong

Number of weak

Number of strong Number of weak

18 39

5 114

Discussion and concluding remarks After creating a basic training data set of RDT in rats using HESS-DB, to study methods of predicting hepatotoxicity, we developed two discriminative models for predicting an increase in serum ALT levels. In order to develop the models, we faced two statistical difficulties. One was that the number of in silico parameters was quite large. The other was that the training data sets were imbalanced. To overcome the former difficulty, the k-medoids method was applied, and the statistical importance of variables was introduced. This approach worked well for developing our simple models, since the two discriminative models had less than 10 explanatory variables. To overcome the latter difficulty, the SMOTE algorithm was applied. We often face this type of difficulty when mining toxicology data; Liu et al. [4], for example, also noted this problem, and showed that solving imbalances contributed to an increase in the accuracy of discriminative models. They only conducted undersampling of major-category compounds, which can be used only for cases in which there are a sufficient number of minor-category compounds. In the case of the present study, the numbers of positive- and strong-category compounds belonging to the minor classes were only 40 and 23, respectively. In such a case, oversampling is also needed, in addition to undersampling, and the SMOTE algorithm offers a powerful tool. In the internal validation, both models achieved sensitivity, specificity, and concordance of roughly 60–70%. In addition, the discriminative model for “strong” and “weak” compounds had higher accuracy than that for “positive” and “negative” compounds. The fact that the accuracy of discrimination was improved by applying a toxicological perspective implies that compounds having a high-dose LOEL value do not have serious toxicity, and may almost be considered negative-category compounds. Since external validation would also be useful in assessing the predictive capability of the models, we conducted external validation for the model discriminating between strong- and weak-category compounds. External data were obtained from the Toxicity Reference Database (ToxRefDB) as of October 2014, which was created and curated by the U.S. Environmental Protection Agency (US EPA) [14]. In the database, 59 compounds other than the compounds in HESS-DB were available for the external validation. Those compounds had 28-day (from 21 to 35 days) RDT studies in rats, and their chemical structures and molecular descriptors could be obtained. A total of 23 and 36 out of the 59 compounds belonged to strong- and weak-category compounds, respectively. In the external validation, the model discriminating strong and

weak compounds achieved sensitivity of roughly 60%, and specificity and concordance of roughly 40–50%. The predictive performance for the external data was lower than for the internal data. However, we cannot deny that the chemical domains of internal data and external data are different. In fact, according to twosample t-tests between the external data and internal data for 197 essential molecular descriptors, there were significant differences between the two groups, at a 5% level, for 52 molecular descriptors out of the 197. (The data related to the external validation are not shown.) The results indicate that it is important to clarify the applicable chemical domain of our models. Indeed, one of the Organisation for Economic Co-operation and Development (OECD) principles regarding (Quantitative) StructureActivity Relationship ((Q)SAR) models calls for a defined domain of applicability [15], and this is a consideration for the future. The examples in Hayashi et al. [16] are useful for evaluating the classification capability of in silico toxicity prediction models. The study evaluated three commercially available in silico (Q)SAR systems for predicting chemical genotoxicity (DEREK, MultiCASE, and ADMEWorks), using 206 chemicals falling under the Chemical Substance Control Law in Japan. Since genotoxicity is a yes/no-type toxicity, and these in silico systems are relatively acceptable in a regulatory context, it is suitable to compare the accuracy of our models with the results reported in that paper. Table 3 shows a comparative summary of the results in the paper and those of our model for discrimination between strong- and weakcompounds. The average sensitivity, specificity, and concordance of the three in silico systems are 0.704, 0.830, and 0.815, respectively. In terms of sensitivity, the capability of our model is comparable to that of the three in silico systems; however, our model shows less capability in the case of specificity and concordance. Since, in the context of toxicity prediction, false negatives are less acceptable than false positives, it is notable that the sensitivity of our model is superior to that of the three in silico systems. Five compounds were false negatives in the case of discrimination between strong-category and weak-category compounds (Table 4). Our approach did not consider in vivo metabolism, and this fact may explain the false negatives. As HESS includes a rat liver metabolism simulator, metabolisms were predicted for the five compounds. Then, the molecular descriptors of the major metabolites obtained were calculated by DRAGON 6, and the

Table 4 List of the false-negative compounds. CAS No.

Name

80-43-3 112-26-5 103-44-6 90-13-1 98-10-2

Dicumyl peroxide 1,2-Bis(2-chloroethoxy)ethane 2-Ethylhexyl vinyl ether 1-Chloronaphthalene Benzenesulphonamide

Table 3 Performance of the three in silico systems and our model. System or model

Reference category

Number of compounds predicted as positive/strong

Number of compounds predicted as negative/weak

Sensitivity

Specificity

Concordance

DEREK

Positive Negative Positive Negative Positive Negative Strong Weak

19 21 13 13 19 54 18 39

7 159 7 133 7 124 5 114

0.731

0.883

0.864

0.650

0.911

0.880

0.731

0.697

0.701

0.783

0.745

0.750

MultiCASE ADMEWorks Our model for discrimination between strong- and weak-category compounds

Please cite this article in press as: J.-i. Takeshita et al., Discriminative models using molecular descriptors for predicting increased serum ALT levels in repeated-dose toxicity studies of rats, Comput. Toxicol. (2017), http://dx.doi.org/10.1016/j.comtox.2017.05.002

6

J.-i. Takeshita et al. / Computational Toxicology xxx (2017) xxx–xxx

Fig. 3. Upper panel: the chemical structure of 1-chloronaphthalene (CAS No. 90-13-1) (left side) and one of its metabolisms predicted by the simulator in HESS (right side). Lower panel: the chemical structure of benzenesulphonamide (CAS No. 98-10-2) (left side) and one of its metabolisms predicted by the simulator in HESS (right side).

probability of their belonging to the strong-category was predicted by the logistic regression model described in Section “The logistic regression model for discrimination between strong- and weakcategory compounds”. As a result, the metabolites of two compounds (CAS No. 90-13-1 and CAS No. 98-10-2) were predicted as strong compounds. Fig. 3 shows the two compounds and their major metabolites predicted by the simulator. This implies that if we could consider in vivo metabolism of original compounds, the number of false negatives could be reduced.

Table 5 List of the false-positive compounds. CAS No.

Name

95-64-7 88-53-9 538-75-0 5460-09-3 86-87-3 121-47-1 583-39-1 623-26-7 1477-55-0 87-02-5 95-33-0 100-69-6 103–83-3 123-30-8 16219-75-3 80-09-1 51-28-5 75-59-2 88-89-1 824-78-2 5039-78-1 5124-25-4 26630-7-5 95-32-9 111-17-1 96-45-7 100-47-0 121-60-8 461-72-3 2580-78-1 95-68-1 51-52-5 14816-18-3 3618-60-8

3,4-Dimethylaniline 2-Amino-5-chloro-4-methyl-benzenesulfonic acid n,n’-Dicyclohexylcarbodiimide Monosodium 4-Amino-5-hydroxy-2,7-naphthalenedisulfonate 1-Naphthylacetic acid 3-Aminobenzenesulfonic acid 2-Mercaptobenzimidazole Terephthalonitrile; 1,4-Dicyanobenzene 1,3-Bis(aminomethyl)benzene 7-Amino-4-hydroxy-2-naphthalenesulfonic acid n-Cyclohexyl-2-benzothiaZolesulfenamide 2-Vinylpyridine n,n-Dimethylbenzylamine 4-Aminophenol 5-Ethylidene-2-norbornene 4,40 -Sulfonyldiphenol 2,4-Dinitrophenol n,n,n-Trimethylmethanaminium hydroxide 2,4,6-Trinitrophenol p-Nitrophenol sodium salt (methacryloyloxyethyl)trimethylammonium chloride Disperse yellow 42 Disperse red 206 2-(4-morpholinyldithio)benzothiazole 3,30 -Thiobispropionic acid; thiodipropionic acid 2-Imidazolidinethione Benzonitrile p-(acetylamino) benzenesulfonyl chloride Hydantoin Reactive blue 19 2,4-Dimethylaniline 6-n-Propyl-2-thiouracil Phoxim Sodium 5-chloro-3-[(1,5-dihydroxy-2-naphthyl)azo]-2hydroxybenzenesulphonate 3-Hydroxy-2-naphthanilide 2-Bromo-2-nitropropane-1,3-diol Biphenyl-2-ylamine 4-Vinylpyridine Tricyclo[3.3.1.13,7]decane

92-77-3 52-51-7 90-41-5 100-43-6 281-23-2

Meanwhile, 39 compounds were false positives (Table 5). The prediction of false positives may strongly depend on the database. Fig. 4, a histogram of the max doses of animal tests for the 39 compounds, shows that the max doses for 21 of the 39 were less than 1000 mg/kg body weight/day; meaning that no RDT tests were conducted for roughly half of the false-positive compounds at high doses. In such cases, LOEL values were usually observed at endpoints in other organs or tissues than livers at low doses. Hence, if RDT tests had been conducted for the 21 compounds at high doses, LOEL values at increased serum ALT levels would have been observed. We therefore cannot rule out the possibility that compounds having no LOEL values at increased serum ALT levels belong to the positive category. In other words, it is difficult to determine whether such compounds truly belong to the negative category. This is a limitation of studies based on databases consisting of data from existing in vivo experiments. The present study used only molecular descriptors to predict an increase in serum ALT levels. However, molecular descriptors cannot reflect biological characteristics of compounds. Since in vitro assays can reflect some biological characteristics of compounds, the addition of such data would increase the accuracy of statistical models by using the results of some in vitro assays as explanatory variables in discriminative models. In order to create such discriminative models, in vitro assays showing high reactivity would be preferable, since such studies have the potential to offer a wide range of measurement results. Liu et al. [4], for example, noted that using both in silico and in vitro parameters increased the accuracy of discriminative models for hepatotoxicity, compared with using only in silico parameters. Similarly, Low et al. [5] reported that using toxicogenomics data, which were treated as biological descriptors (in vitro parameters), or using both chemical descriptors (in silico parameters) and toxicogenomics data, increased the correct classification rate, compared with using only chemical descriptors. Finally, it should be noted that our methods were developed based only on statistical correlations between the molecular descriptors of chemicals and the target response, that is, an

Fig. 4. A histogram of the max doses of animal tests for the false-positive compounds. The max doses for 21 compounds were less than 1000 mg/kg body weight/day.

Please cite this article in press as: J.-i. Takeshita et al., Discriminative models using molecular descriptors for predicting increased serum ALT levels in repeated-dose toxicity studies of rats, Comput. Toxicol. (2017), http://dx.doi.org/10.1016/j.comtox.2017.05.002

J.-i. Takeshita et al. / Computational Toxicology xxx (2017) xxx–xxx

increase in serum ALT levels in rats. Furthermore, the model selection was based on AIC values. Thus, our models were designed for predictive capability, and are unsuitable for use in discussion of toxicity mechanisms, that is, the toxicological relationship between the selected molecular descriptors and the response.

Acknowledgments The authors are grateful to Dr. Takashi Yamada, who provided us with the HESS data. This study was supported in part by JSPS KAKENHI Grant Number JP16K21674 (JT), and the Japan Chemical Industry Association Long-range Research Initiative (JCIA-LRI) (KY).

Appendix A. The k-medoids method The k-medoids method is a non-hierarchical clustering method. Although the k-means method is well-known among nonhierarchical clustering methods, the k-medoids method differs from the k-means method in two respects. (1) While the k-means method minimizes the mean of the dissimilarity measures between the centroid and all the members in the same cluster, the k-medoids method minimizes the mean of the dissimilarity measures among all the members in the same cluster. (2) In the k-means method, the representative point in each cluster is the centroid; whereas, in the k-medoids method, the representative point in each cluster is the medoid, which is the member having the minimum sum of dissimilarity measures between all the members in the same cluster. Therefore, the medoids can be selected as statistically representative molecular descriptors. More details on the k-medoids method can be seen in Chapter 2 in Kaufman et al. [11].

The Synthetic Minority Over-sampling Technique (SMOTE) algorithm For a focused category, the algorithm produces synthetic samples in the category as follows. (1) For each sample in a focused category, n neighborhood samples in the same category are picked up, where n is a given positive number. (2) From the n neighborhoods, m samples are chosen randomly, where m < n is a given number. (3) The m line segments between the initial sample in (1) and the chosen m samples are considered. (3) One synthetic sample is produced at an arbitrary position on each line segment. In this manner, we can generate (100  m)% synthetic samples for each original sample. More details on the algorithm can be seen in Chapter 2 in Chawla et al. [12].

7

Appendix B. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.comtox.2017.05. 002. References [1] The European Parliament and the Council of the European Union, Regulation (EC) No 1223/2009 of the European Parliament and of the Council of 30 November 2009 on cosmetic products (recast), Off. J. Eur. Union. L342 (2009) (2009) 59–209. [2] ECHA, Factsheet Interface between REACH and Cosmetics regulations, 2015. http://echa.europa.eu/documents/10162/13628/reach_cosmetics_factsheet_ en.pdf. [3] JaCVAM, JaCVAM Statements, http://www.jacvam.jp/en_effort/effort02.html (accessed January 10, 2017). [4] J. Liu, K. Mansouri, R.S. Judson, M.T. Martin, H. Hong, M. Chen, X. Xu, R.S. Thomas, I. Shah, Predicting hepatotoxicity using ToxCast in vitro bioactivity and chemical structure, Chem. Res. Toxicol. 28 (2015) 738–751, http://dx.doi. org/10.1021/tx500501h. [5] Y. Low, T. Uehara, Y. Minowa, H. Yamada, Y. Ohno, T. Urushidani, A. Sedykh, E. Muratov, V. Kuzmin, D. Fourches, H. Zhu, I. Rusyn, A. Tropsha, Predicting druginduced hepatotoxicity using QSAR and toxicogenomics approaches, Chem. Res. Toxicol. 24 (2011) 1251–1262, http://dx.doi.org/10.1021/tx200148a. [6] F. Pizzo, E. Benenati, In silico models for repeated-dose toxicity (RTD): prediction of the no obsered adverse effect level (NOAEL) and the lowest obsered adverse effect level (NOAEL) for drugs, in: E. Benfenati (Ed.), Silico Methods for Predicting Drug Toxicity, Humana Press, NY, USA, 2016, pp. 163– 176. [7] Y. Sakuratani, H.Q. Zhang, S. Nishikawa, K. Yamazaki, T. Yamada, J. Yamada, K. Gerova, G. Chankov, O. Mekenyan, M. Hayashi, Hazard Evaluation Support System (HESS) for predicting repeated dose toxicity using toxicological categories, SAR QSAR Environ. Res. 24 (2013) 351–363, http://dx.doi.org/ 10.1080/1062936X.2013.773375. [8] J.C. Madden, Tools for grouping chemicals and forming categories, in: M. Cronin, J. Madden, S. Enoch, D. Roberts (Eds.), Chemical Toxicity Prediction: Category Formation and Read-Across, Royal Society of Chemical Publishing, London, 2013, pp. 72–97. [9] NITE, Hazard Evaluation Support System Integrated Platform (HESS), http:// www.nite.go.jp/en/chem/qsar/hess-e.html (accessed January 10, 2017). [10] S. Miyamoto, R. Abe, Y. Endo, J. Takeshita, Ward method of hierarchical clustering for non-Euclidean similarity measures, in: Proceedings of the 2015 Seventh International Conference of Soft Computing and Pattern Recognition (SoCPaR 2015). (2015) 60–63. [11] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley & Sons Inc, 2005. [12] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357, http://dx.doi.org/10.1613/jair.953. [13] Talete, Dragon 6 user’s manual, (2013). [14] US EPA, Toxicity ForeCaster (ToxCast) Data, https://www.epa.gov/chemicalresearch/toxicity-forecaster-toxcasttm-data (accessed March 1, 2017). [15] OECD, Guidance document on the validation of (quantitative) structureactivity relationships [(Q)SAR] models, Number 69, ENV/JM/MONO(2007) 2, 2007. [16] M. Hayashi, E. Kamata, A. Hirose, M. Takahashi, T. Morita, M. Ema, In silico assessment of chemical mutagenesis in comparison with results of Salmonella microsome assay on 909 chemicals, Mutat. Res. - Genet. Toxicol. Environ. Mutagen. 588 (2005) 129–135, http://dx.doi.org/10.1016/j. mrgentox.2005.09.009.

Please cite this article in press as: J.-i. Takeshita et al., Discriminative models using molecular descriptors for predicting increased serum ALT levels in repeated-dose toxicity studies of rats, Comput. Toxicol. (2017), http://dx.doi.org/10.1016/j.comtox.2017.05.002