Environment International 45 (2012) 51–58
Contents lists available at SciVerse ScienceDirect
Environment International journal homepage: www.elsevier.com/locate/envint
Quantitative consensus of bioaccumulation models for integrated testing strategies Alberto Fernández a,⁎, Anna Lombardo b, Robert Rallo c, Alessandra Roncaglioni b, Francesc Giralt a, Emilio Benfenati b a b c
Departament d'Enginyeria Quimica, Universitat Rovira i Virgili, Tarragona, Catalunya, Spain Istituto di Ricerche Farmacologiche “Mario Negri”, Milano, Italy Departament d'Enginyeria Informatica i Matematiques, Universitat Rovira i Virgili, Tarragona, Catalunya, Spain
a r t i c l e
i n f o
Article history: Received 13 January 2012 Accepted 7 March 2012 Available online 8 May 2012 Keywords: Quantitative consensus ITS Bioaccumulation Bayesian theory QSAR integration
a b s t r a c t A quantitative consensus model based on bioconcentration factor (BCF) predictions obtained from five quantitative structure–activity relationship models was developed for bioaccumulation assessment as an integrated testing approach for waiving. Three categories were considered: non-bioaccumulative, bioaccumulative and very bioaccumulative. Five in silico BCF models were selected and included into a quantitative consensus model by means of the continuous formulation of Bayes' theorem. The discrete likelihoods commonly used in the qualitative Bayesian model were substituted by probability density functions to reduce the loss of information that occurred when continuous BCF values were distributed across the three bioaccumulation categories. Results showed that the continuous Bayesian model yielded the best classification predictions compared not only to the discrete Bayesian model, but also to the individual BCF models. The proposed quantitative consensus model proved to be a suitable approach for integrated testing strategies for continuous endpoints of environmental interest. © 2012 Elsevier Ltd. All rights reserved.
1. Introduction The European legislation on Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH, 2006) requires an evaluation of the bioaccumulation potential for prioritization (i.e., the identification of the most dangerous chemicals). This is compulsory for substances produced/imported in quantities equal to or exceeding 10 t/year. Chemicals produced/imported above 100 t/year require a full bioaccumulation assessment since bioaccumulation is a key property for subsequent PBT (persistent, bioaccumulative and toxic) or vPvB (very persistent and very bioaccumulative) profiling. Similar regulations have emerged outside the EU and, for instance, the importance of identifying persistent and bioaccumulative chemicals used in commerce within the US toxic substances database has been recognized by Howard and Muir (2010). Bioconcentration, defined as the process by which a chemical substance is absorbed by an organism from the environment only through its respiratory and dermal surface (Arnot and Gobas, 2006), Abbreviations: B, bioaccumulative; BCF, bioconcentration factor; CASRN, chemical abstracts service registry number; ITS, integrated testing strategy; nB, non-bioaccumulative; PBT, persistent, bioaccumulative and toxic; QSAR, quantitative structure–activity relationship; REACH, registration, evaluation, authorisation and restriction of chemicals; vB, very bioaccumulative; vPvB, very persistent and very bioaccumulative. ⁎ Corresponding author at: Departament d'Enginyeria Química, Universitat Rovira i Virgili, Av. Països Catalans 26, 43007 Tarragona, Catalunya, Spain. Tel.: + 34 977 558 549; fax: + 34 977 559 621. E-mail address:
[email protected] (A. Fernández). 0160-4120/$ – see front matter © 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.envint.2012.03.004
is used to assess the bioaccumulation potential of a chemical. Bioconcentration is expressed by the BCF, defined as the ratio between the chemical concentration in the organism and in the environment. In addition to its use for prioritization, BCF can also be used for classification and labelling (requested in the EU for substances produced/ imported above 1 t/year) as well as for chemical safety assessment. For classification and labelling purposes, BCF is used to evaluate if a chemical should be also classified in the chronic aquatic toxicity category, according to the classification, labelling and packaging regulation. Therefore, it becomes of particular interest within the REACH legislative context to explore the use of BCF for classification purposes and to reflect the margins of reliability present in the bioconcentration assessment. According to REACH Annex XIII (2011), substances that do not reach the bioaccumulative criterion (i.e., their BCF is lower than or equal to 2000 l/kg) are classified as non-bioaccumulative (nB) chemicals, substances that fulfil the bioaccumulative criterion but do not reach the very bioaccumulative criterion (i.e., their BCF is greater than 2000 l/kg but lower than or equal to 5000 l/kg) are classified as bioaccumulative (B) chemicals, and the remaining substances (i.e., their BCF is greater than 5000 l/kg) are classified as very bioaccumulative (vB) chemicals. The above thresholds for classifying substances into three bioaccumulation categories are 3.3 and 3.7 in terms of log units, respectively. The difference between these two thresholds is of the same order of magnitude than the experimental variability reported for available BCF datasets, which ranges from 0.42 to 0.75 log units (Dimitrov et al., 2005; Lombardo et al., 2010).
52
A. Fernández et al. / Environment International 45 (2012) 51–58
Therefore, the correct classification of chemicals having log BCF values near the thresholds is still a challenging task, particularly for bioaccumulative chemicals. Methods for data integration have been reported to be a suitable approach for increasing the reliability in decision making within an integrated testing strategy (ITS) scheme when multiple sources of information are available (Jaworska and Hoffmann, 2010). These methods have been extensively used for the combination of raw data obtained from several sensors (sensor fusion) in environmental applications (Ashraf et al., 2012). A different approach consists in the integration of different models through a boosting procedure to obtain a new, more robust and accurate model (Rallo et al., 2005). On the other hand, weight-of-evidence approaches are quantitative decision-making methods that integrate a set of evidences by weighting them according to experts' criteria (Benedetti et al., 2012; Morales-Caselles et al., 2008). Alternatively, decision-making approaches can be implemented to integrate data by using optimization methods such as multicriteria decision analysis (Passuello et al., 2012; Zabeo et al., 2011). Qualitative consensus models, such as Dempster–Shafer theory of evidence, have also been reported for consensus reasoning with discrete datasets (Fernández et al., 2009). A similar approach for quantitative consensus can be applied to integrate the output of different models for environmental endpoints, such as quantitative structure–activity relationship (QSAR) models, in a statistically sound manner by using Bayesian rules. The aim of this study was to develop a consensus approach suitable for integrating diverse non-testing methods within the workflow of an ITS for bioconcentration assessment. The purpose of this data integration is to reduce the number of animal tests (more than 100 fish for each substance are usually required) and to optimize available information. The quantitative consensus approach developed here, based on the combination of BCF values estimated from five in silico models, is capable of reliably classifying chemicals into three bioaccumulation categories. Instead of developing a new model for BCF, the goal is to take advantage of the good properties of available models to improve classification scores. To this end, all sources of evidence must be weighted according to their efficiency in predicting the bioconcentration property of chemicals and to reach a consensus on their most probable class assignment in case of conflict between the classifications obtained from the different models. Both objectives have been attained with a Bayesian consensus model where the different sources of evidence have been weighted by their corresponding likelihood density functions. 2. Materials and methods 2.1. BCF data and in silico models The dataset used to validate the quantitative consensus approach contains experimental log BCF values for 701 chemicals that were compiled from four different databases: Arnot and Gobas (2006), Dimitrov et al. (2005), EURAS (2011) and Fu et al. (2009). Data were carefully checked to eliminate non-reliable BCF values and chemicals that could not be predicted by the in silico models. Mixtures (isomer mixtures and experiments performed with mixtures), inorganic compounds, and compounds without sufficient information to correctly generate their structure were not considered. Salts were considered in their acidic form. Chemicals reported in more than one database were considered only once. For the Arnot and Gobas database, only the compounds with a reliability score equal to 1 (i.e. reliable data) were considered in our study. Moreover, we only took the data related to the eight fish species suggested in the OECD 305 Guideline. For the EURAS dataset, the compounds with a reliability score equal to 3 or 4 (not reliable or not assignable) were also excluded from the analysis. Details of data pre-processing are reported by Lombardo et al. (2010) for the first and third datasets, Zhao et al. (2008) for the second dataset, and Toropova et al. (2010) for the
last dataset. Since the number of experimental BCF measures for each reported chemical varies from 1 to 62, the arithmetic mean of the available experimental log BCF values for each chemical was computed and compared with the predictions obtained from five BCF in silico models, which are: CAESAR, TEST, BCFBAF/M, BACFBAF/A and CHEMPROP. CAESAR (CAESAR, 2011; Lombardo et al., 2010; Zhao et al., 2008) is a model built from chemicals extracted from the Dimitrov et al. (2005) database. It uses two different models that are afterwards combined into a unique model. It is based on eight molecular descriptors being the most influential MlogP, with log Kow calculated using the Moriguchi method (Moriguchi et al., 1992, 1994). The applicability domain of the model is indicated by alerts (e.g., chemicals outside the descriptors range and/or presence of fragments linked to low reliable estimations). TEST (2011) is a model based on compounds obtained by combining the Dimitrov et al. (2005), EURAS (2011), and Arnot and Gobas (2006) databases. This model uses five different methods (i.e., hierarchical, food and drug administration, single model, group contribution, and nearest neighbour) plus a consensus method (i.e., average of the values predicted using the other five methods). Only the results of the consensus model have been considered here. Each model has a defined applicability domain based on the descriptors used and the fragments found. BCFBAF/M is a model by Meylan et al. (1999) based on log Kow and implemented in the software BCFBAF (2011). Its applicability domain is defined by the range of descriptors (molecular weight and log Kow). BCFBAF/A is a model by Arnot and Gobas (Arnot and Gobas, 2003; Arnot et al., 2008) based on log Kow and metabolism (represented by biotransformation rates), and it is also implemented in the software BCFBAF. This model can distinguish predictions at three separate trophic levels of fish (i.e., lower, mid and upper) and can predict bioconcentration by also considering null the metabolism. To ensure independence of the predictive models, two out of the three trophic levels were discarded because their corresponding models were highly correlated to each other, with correlation coefficients higher than 0.98. The mid trophic level of fish was selected because it was the one yielding the highest correlation with the experimental log BCF values. Its applicability domain is defined by the log Kow range and by identifying predictions with low reliability for a few chemical classes. CHEMPROP (2011) integrates several models to calculate physicochemical, toxicological and fate properties. A non-linear BCF model by Dimitrov et al. (2002) based on log Kow has been considered in this work. The applicability domain of this model includes few chemical classes and descriptors range (log Kow and molecular weight). The target dataset that initially contained 701 substances was filtered to remove chemicals outside the applicability domain of any of the five selected models. This filtering process reduced the total number of working chemicals to a list of 522 chemicals that are provided as Table S1 in supplementary materials, together with log BCF predictions obtained from individual QSAR models. The accuracy of each in silico model has been assessed by means of contingency tables that include the distribution of correct/incorrect classifications, according to the experimentally observed BCF values. For instance, Table 1 shows the contingency table corresponding to the CAESAR model. Contingency tables for the other models are included in Table S2 of supplementary materials.
2.2. Bayesian consensus with discrete probability distributions Bayes' theory provides a useful tool for representing and updating subjective probabilities based on probabilistic inference (Billoir et al., 2008; Duda et al., 1979). Let H be a categorical variable denoting a set of hypotheses which can take several values, such as the three
A. Fernández et al. / Environment International 45 (2012) 51–58 Table 1 Distribution of bioaccumulation category predictions for the 449 chemicals experimentally known to be nB, for the 34 compounds known to be B, and for the 39 chemicals known to be vB. The fractions in parentheses at the bottom of each cell are the relative frequencies of predictions in each experimental class (i.e., normalized by rows). CAESAR
Experimental
nB B vB
nB
B
vB
439 (0.98) 22 (0.65) 3 (0.08)
7 (0.02) 7 (0.21) 8 (0.21)
3 (0.01) 5 (0.15) 28 (0.72)
different categories nB, B and vB for bioaccumulation of chemicals (REACH Annex XIII, 2011). For instance, the composition of the screened dataset in terms of the current set of hypotheses H = {nB, B, vB} is the following: 449 (86.0%) substances are experimentally classified as nB chemicals, 34 (6.5%) as B, and the remaining 39 (7.5%) as vB chemicals. The procedure is depicted in detail in Fig. 1 for only the CAESAR model, with the evidence E supported by the bioaccumulation category e (nB, B or vB) predicted by this model. Two different probability distributions have to be taken into account to calculate the probability P(H = h | E = e) that hypothesis h (nB, B or vB) is true given evidence e. First, the prior probability P(H = h) that hypothesis h is true in the absence of any specific evidence. Here the usual uninformative priors P(H = nB) = P(H = B) = P(H = vB) = 1/3 were adopted. Second, the likelihood or probability P(E = e | H = h) that evidence e will be observed given that hypothesis h is true was estimated. Discrete likelihood distributions were obtained from the contingency tables. For instance, the three likelihood distributions for the CAESAR
53
model showed in Fig. 1 were obtained from the corresponding relative frequencies given in Table 1 for each experimental category. CAESAR predicts that tris(4-chlorophenyl)methanol, with chemical abstracts service registry number (CASRN) 3010-80-8, is a B compound (log BCF = 3.49). Therefore, the probability values 0.02, 0.21 and 0.21 corresponding to the B category in each one of the three likelihood distributions were multiplied by the corresponding prior probabilities (see Fig. 1) and the three resulting values were normalized (their sum equal to one) to assess if e provides enough evidence to accept hypothesis h. The posterior probabilities P(H = nB) = 0.04, P(H = B) = 0.48 and P(H = vB) = 0.48 were thus obtained by applying the Bayes' theorem, P H ¼ hE ¼ e ¼
P E ¼ eH ¼ h ⋅P H ¼ h 3 P P E ¼ eH ¼ hj ⋅P H ¼ hj
ð1Þ
j¼1
where h1 = nB, h2 = B and h3 = vB. According to the posterior probabilities obtained by Eq. (1) and reported in Fig. 1 for the CAESAR model, the most probable hypothesis is that tris(4-chlorophenyl)methanol is either a B or vB substance. The consensus method merges this evidence with that corresponding to the four other BCF predictions obtained from the TEST, BCFBAF/M, BCFBAF/A and CHEMPROP models in the following iterative procedure. The posterior probabilities obtained with the CAESAR model in the aforementioned first step of the consensus process become the new prior probabilities for the second step. In this second step the likelihood distributions are those of the TEST model, which classifies tris(4-chlorophenyl)methanol as an nB substance, and the posterior probabilities become P(H = nB) = 0.07, P(H = B) = 0.76 and P(H = vB) = 0.17. After the second iterative step is completed, the
Fig. 1. Illustration of the discrete Bayesian model for tris(4-chlorophenyl)methanol and its BCF categorical prediction by the CAESAR model in the first iteration of this consensus model. The three likelihood distributions are obtained from the corresponding relative frequencies given in Table 1 for each experimental category. Probability values corresponding to the B category in the likelihood distributions are used because the CAESAR model predicts tris(4-chlorophenyl)methanol to be a B chemical. The categorical prediction after the first iteration is B/vB, which changes to B after the second and third iterations, and to vB in the fourth and final fifth iterations. Iterations are identified by encircled numbers.
54
A. Fernández et al. / Environment International 45 (2012) 51–58
most probable bioaccumulation hypothesis for tris(4-chlorophenyl) methanol is the B category. The above second step posterior probabilities become the new prior probabilities in the third iteration of the consensus process and the above calculation procedure is repeated with the evidences obtained from the BCFBAF/M, BCFBAF/A and CHEMPROP models. At the end of the fifth iteration of the discrete consensus process, the following posterior probabilities are obtained: P(H = nB) = 0.00, P(H = B) = 0.01 and P(H = vB) = 0.99. This result concludes that tris(4-chlorophenyl)methanol is a vB substance, what is in agreement with its experimental log BCF value equal to 3.95. However, the Bayesian consensus framework built around discrete probabilities is not the optimal combination approach when it is applied to continuous variables. In such cases, like in the current BCF study, the continuous variable is first discretized into several disjoint categories (e.g., nB, B and vB) and Bayes' theorem applied afterwards. The drawback of this approach is that the exact numerical value of BCF is lost during the discretization process. 2.3. Bayesian consensus with continuous probability distributions To avoid the loss of information caused by the categorization of continuous variables, Bayes' theorem can be formulated using continuous probability distributions instead of discrete likelihoods. Let evidence E be now a numerical variable containing the log BCF value predicted by a QSAR model. These log BCF values can follow different probability distributions for chemicals in different experimental categories. Therefore, different Gaussian probability density functions f(E = e | H = h) have to be adjusted for each hypothesis h (nB, B, vB). That is, the mean and standard deviation of log BCF values predicted by every in silico model have been separately calculated for the 449 chemicals experimentally known to be nB, for the 34 compounds known to be B, and for the 39 chemicals known to be vB. The parameters of the Gaussian probability density functions estimated for all the QSAR models are included in Table S2 of supplementary materials. The three discrete likelihood distributions (histograms) inside a dashed box in Fig. 1 are replaced in the continuous consensus methodology by the three continuous probability density functions corresponding to the CAESAR model (see Fig. 2). In the case of the example chemical, tris(4-chlorophenyl)methanol, the likelihood values taken from the probability density functions were those corresponding to a log BCF value equal to 3.49, which is the numerical prediction given by the CAESAR model for this chemical. At the end of the first step of the continuous Bayesian process, the posterior probabilities were obtained with the continuous-variable version of Bayes' theorem, P H ¼ hE ¼ e ¼
f E ¼ eH ¼ h ⋅P H ¼ h 3 P f E ¼ eH ¼ hj ⋅P H ¼ hj
ð2Þ
j¼1
where, again, the hypothesis space is defined by h1 = nB, h2 = B and h3 = vB. The resulting posterior probabilities were P(H = nB) = 0.03, P(H = B) = 0.67 and P(H = vB) = 0.30. It should be noticed that Eqs. (1) and (2) differ only in the substitution of discrete likelihoods by 1
continuous probability distributions. The iterative continuous consensus process, that started with the estimation of tris(4-chlorophenyl) methanol as a B chemical after the first iteration (CAESAR), ends with a vB assessment for this chemical when the evidence of the fifth model (CHEMPROP) is accounted for. The above examples are not intended for the assessment of the consensus approach but to illustrate how it works for both the discrete and the continuous versions. The quality of the consensus approach based on discrete and continuous Bayesian formulations is assessed and discussed in Section 3, together with and example on how consensus of existing QSAR models can be applied to new chemicals. 2.4. Sensitivity and cost of misclassifications Quality in classification problems with two target categories is usually assessed in terms of sensitivity and specificity, defined as the proportion of correctly classified elements out of those known to belong to the “positive” and “negative” classes, respectively. In this study a generalization of the sensitivity/specificity concepts as well as a cost-based quality measure was adopted to analyze classification performances with respect to the three target categories (nB, B and vB) considered. The sensitivity for a given category is defined as the fraction of compounds correctly classified out of those belonging to the category (Schüürmann et al., 2003). The average of class sensitivities is known as the balanced accuracy of a classifier (Sokolova et al., 2006; Velez et al., 2007), and it is an especially well suited measure for unbalanced datasets, i.e., when categories are not uniformly populated. The proportion of misclassifications does not suffice to evaluate performance in the current three category problem for bioaccumulation because one has to consider two distinct types of misclassification for each experimental category. For example, from the viewpoint of risk assessment, classifying a chemical known to be B as nB is much worse than classifying it as vB. Thus, misclassifications should be weighted according to their relevance in a regulatory context. Here we used a cost matrix (Domingos, 1999; Turney, 1995) that assisted in evaluating the consensus results obtained by the integration of the different in silico models. Each element cij in the cost matrix C specifies the cost of misclassifying a compound into class j when it actually belongs to class i. Table 2 weights misclassifications according to two distinct criteria. Firstly, since bioaccumulation categories are properly sorted in contingency tables such as Table 1, misclassifications are more severe the farther they are from the correct category cell. Secondly, any mistake above the diagonal of the contingency table (cost values 1 and 2 in Table 2) is less severe than mistakes below the diagonal (cost values 3 and 4), because below-diagonal counts correspond to false negatives, i.e. to chemicals that have been classified into a class less bioaccumulative than the corresponding experimental class. Diagonal elements in Table 2 are zero (i.e., correct classifications have no cost). The costs of misclassification in Table 2, together with the relative frequencies of bioaccumulation categories obtained for each in silico model, constitute the basis for calculating an expected cost of misclassifications for the three bioaccumulation categories considered. For instance, given the relative frequencies for the nB experimental category in Table 1, the corresponding expected cost of misclassifications is calculated as: 0 × 0.98 + 1 × 0.02 + 2 × 0.01 = 0.04. The same weighted sum of errors is calculated for the B and vB experimental
0.81 0.37 0.04
0 3.49
3.49
3.49
Fig. 2. Gaussian likelihood distributions estimated from the corresponding log BCF values predicted by the CAESAR model for each experimental category: nB (left), B (middle) and vB (right). Probability values corresponding to a log BCF value equal to 3.49 in the likelihood distributions are used because this is the numerical prediction obtained from the CAESAR model for tris(4-chlorophenyl)methanol.
A. Fernández et al. / Environment International 45 (2012) 51–58 Table 2 Costs assigned to misclassifications according to both the distance to the experimental category and the type of error in terms of false positives (above-diagonal elements) and false negatives (below-diagonal elements). In silico
Experimental
nB B vB
nB
B
vB
0 3 4
1 0 3
2 1 0
categories. Finally, the average of the three expectation values is taken as the overall cost of the corresponding classifier. 3. Results and discussion 3.1. Assessment of models The two consensus approaches presented in Section 2, the discrete Bayesian model, using discrete likelihoods, and the continuous Bayesian model, using continuous probability distributions, were assessed and compared to the CAESAR, TEST, BCFBAF/M, BCFBAF/A and CHEMPROP individual QSAR models. Contingency tables with the results obtained from all the classifiers are provided in Table S2 of supplementary materials. Fig. 3 compares the sensitivities calculated for the five individual QSAR models with those obtained using the two consensus approaches. In terms of sensitivity, individual models show a high performance for nB compounds whereas the quality of the results decreases when it comes to the B and vB classes. This can be attributed to the influence of the class unbalance in model development. The integration of models with the Bayesian consensus approach yields more regular proportions of correct classifications, especially for the continuous Bayesian model, which results in a higher average sensitivity (i.e., balanced accuracy). In particular, the highest balanced accuracy values obtained are 0.79, 0.73 and 0.71, for continuous Bayes, discrete Bayes and BCFBAF/A models, respectively. Average sensitivities for the other four models are lower, in the range [0.49, 0.63]. To understand the need of an additional cost analysis in terms of misclassifications, a closer look at the sensitivity values in Fig. 3 for
55
B chemicals obtained by CAESAR, CHEMPROP and BCFBAF/M reveals that the first two models classify correctly the same number of compounds (7 out of 34) and, therefore, they have the same sensitivity for the B category (0.21). The sensitivity of BCFBAF/M is slightly lower (0.18) because the number of correct classifications drops to 6 out of 34. However, important differences are observed when the type of misclassifications is analyzed and the costs in Table 2 are applied. First, although CAESAR and CHEMPROP yield the same number of errors (27 out of 34), the errors of CAESAR are of less concern in terms of regulatory decision making because it only misclassifies 22 compounds as nB compared to the 26 misclassified by CHEMPROP. Similarly, BCFBAF/M is also preferred over CHEMPROP for decision making since it only produces one more error and its misclassifications have less impact, with 22 B chemicals misclassified as nB. Cost analysis is better than sensitivity analysis for regulatory decision making because it provides more insight on the type of errors committed by a classifier. While sensitivity analysis is just a quantitative analysis, cost analysis takes also into consideration qualitative issues such as the real impact of misclassifications. Cost analysis of misclassifications in Fig. 4 indicates that the classification of bioaccumulation categories is, in average, enhanced by consensus methods, being the continuous Bayesian approach the best one with the lowest average cost of misclassification (0.42). A detailed analysis of the results unveils that while individual models are accurate mainly to predict nB compounds (i.e., the majority class), the integration of individual models via Bayesian consensus also performs reasonably well for B and vB categories. As a matter of fact, the bioaccumulation category corresponding to B compounds is the most difficult to predict because it is associated with the narrowest range of log BCF values. This difficulty is clearly reflected in the costs of misclassification obtained by individual QSAR models, which range from 1.71 to 2.47, and the cost of 1.35 obtained with the discrete Bayesian consensus approach. In contrast, the cost of misclassification of the continuous Bayesian approach for B compounds is only 0.24, what is a remarkably low value. 3.2. Protective consensus for regulatory decision making Another fundamental advantage of the Bayesian consensus model is that it is possible to associate a probability value to the overall
1.0
0.8
0.6
nB B vB average
0.4
0.2
0.0 CAESAR
TEST
BCFBAF/M
BCFBAF/A
CHEMPROP
Discrete Bayes
Cont. Bayes
Fig. 3. Sensitivity for the five individual models and the two consensus models (discrete and continuous Bayes), according to three bioaccumulation categories (nB, B and vB). The experimental category of each chemical is compared to that predicted by in silico models.
56
A. Fernández et al. / Environment International 45 (2012) 51–58
2.5
2.0
1.5
nB B vB average
1.0
0.5
0.0 CAESAR
TEST
BCFBAF/M
BCFBAF/A
CHEMPROP
Discrete Bayes
Cont. Bayes
Fig. 4. Cost of misclassifications for the five individual models and the two consensus models (discrete and continuous Bayes), according to three bioaccumulation categories (nB, B and vB). The experimental category of each chemical is compared to that predicted by in silico models, and misclassifications are penalized by the costs in Table 2.
prediction of the category with maximum posterior probability since the different predictions are weighted in a probabilistic way. This overall probability can be used to accept or reject a predicted category depending on predefined probability thresholds, which can be adjusted for each category according to regulatory decision-making criteria. A few examples for probability thresholds equal to 0.99, 0.95 and 0.90 for accepting nB, B and vB classifications, respectively, are reported in Table 3. Using this approach it is possible to detect situations where the uncertainty of the classification is too high and therefore a category cannot be assigned to the affected chemical with enough confidence. A total of 66 out of the 522 (12.6%) chemicals considered in this study did not pass any of the above probability thresholds (see Table S3 in supplementary materials). The analysis in terms of sensitivity and cost performed on the remaining 456 chemicals indicates that both quality measures improved as a result of the reduction in the uncertainty of the classification. In particular, an average sensitivity equal to 0.87 and an average cost equal to 0.24 are attained. Both performance indicators are significantly better than the corresponding best results obtained so far with the continuous Bayesian consensus model (0.79 and 0.42, respectively). Therefore, using this protective approach it is possible to increase the reliability and to deal with safety margins in the consensus procedure for bioaccumulation assessment.
3.3. Consensus classification of new chemicals The triphenyltin chloride (CASRN 639-58-7), a chemical not present in the screened dataset of 522 compounds used to validate the consensus approach with the five individual models, has been selected to describe how the discrete and continuous Bayesian consensus can be applied to new chemicals. Since it is possible that a new chemical does not belong to the applicability domain of some models, it is necessary to determine first the QSAR models that can be applied in any of the two consensus approaches. When this is done, it is determined that the selected chemical belongs to the applicability domain of the TEST, BCFBAF/M and BCFBAF/A models, while it is outside the applicability domain of both the CAESAR and CHEMPROP models. Thus, the consensus approaches described in subsections 2.2 and 2.3 for five models are applied here to only three models to estimate the bioaccumulation class of triphenyltin chloride. When the discrete Bayesian consensus procedure illustrated in detail in Fig. 1 is applied, the TEST model predicts that triphenyltin chloride is an nB compound (estimated log BCF = 2.4) and the posterior probabilities P(H = nB) = 0.50, P(H = B) = 0.41 and P(H = vB) = 0.09 are obtained by applying Bayes' theorem, i.e., Eq. (1). According to these posterior probabilities obtained after the first iteration, the most probable hypothesis is that triphenyltin chloride is an nB
Table 3 Examples from the protective continuous Bayesian model. Posterior probabilitiesa Name
CASRN
Experimental
P(nB)
P(B)
P(vB)
Consensusb
Naphthalene t-Decalin Pentachlorobenzene 2,6-Dicyclohexylphenol 2,2′-Dichlorobiphenyl 2,5-Dichlorobiphenyl
91-20-3 493-02-7 608-93-5 4821-19-6 13029-08-8 34883-39-1
nB B vB nB B vB
1.00 0.02 0.00 0.18 0.00 0.00
0.00 0.98 0.54 0.82 0.28 0.08
0.00 0.00 0.46 0.00 0.72 0.92
nB B Not reached Not reached Not reached vB
a
Posterior probabilities obtained with the combination of CAESAR, TEST, BCFBAF/M, BCFBAF/A and CHEMPROP individual predictions. Protective consensus classification for each chemical, according to three different probability thresholds equal to 0.99, 0.95 and 0.90 corresponding to nB, B and vB categories, respectively. b
A. Fernández et al. / Environment International 45 (2012) 51–58
substance. When the process is repeated with the TEST posterior probabilities becoming the new prior probabilities and the BCFBAF/M model is considered (estimated log BCF= 3.83), posterior probabilities P(H = nB) = 0.10, P(H = B) = 0.46 and P(H = vB)= 0.44 are obtained, and the triphenyltin chloride is classified as a B substance. The evidence from the BCFBAF/A model (estimated log BCF = 2.33) in the third iterative step reinforces the hypothesis that triphenyltin chloride is a B compound, with posterior probabilities P(H = nB) = 0.26, P(H = B) = 0.65 and P(H = vB)= 0.09. However, the experimental log BCF value is equal to 3.14 and, thus, triphenyltin chloride belongs to the nB category, which is contiguous to the one estimated by the discrete Bayesian consensus approach. This exemplifies the results depicted in Figs. 3 and 4 that the Bayesian consensus framework built around discrete probabilities is not the optimal method for data integration when it is applied to continuous variables. Fig. 5 depicts one complete step of the continuous Bayesian consensus process for the same new chemical and the TEST model. The likelihood values for the continuous consensus approach are obtained from the probability density functions, taking into account that the log BCF prediction given by the TEST model for triphenyltin chloride is equal to 2.4. The posterior probabilities obtained from Eq. (2) are P(H = nB) = 0.36, P(H = B) = 0.58 and P(H = vB) = 0.06 (see Fig. 5). According to these probabilities, the most probable hypothesis is that triphenyltin chloride is a B chemical. This is a different result to the one reached at the end of the first step of the discrete consensus process. In the second iteration of the continuous consensus process we take into account that BCFBAF/M predicts a log BCF value equal to 3.83 for triphenyltin chloride and, using the likelihood values 0.25, 0.40 and 0.04 from the corresponding Gaussian probability density functions adjusted for the BCFBAF/M model, the posterior probabilities
57
P(H = nB) = 0.06, P(H = B) = 0.78, and P(H = vB) = 0.16 are obtained and triphenyltin chloride is labelled as a B chemical. Finally, the posterior probabilities P(H = nB) = 0.55, P(H = B)= 0.45 and P(H = vB)= 0.00 are obtained by the integration of the evidence obtained from the BCFBAF/A model (estimated log BCF = 2.33) in the third step of the iterative process. Hence, the addition of a third source of evidence has changed the odds and the most probable hypothesis now is that triphenyltin chloride is an nB substance. This is in agreement with the experimental log BCF value equal to 3.14 for this chemical and its classification into the nB category. The above bioaccumulation assessment carried out for the new chemical triphenyltin chloride confirms the superior performance of the continuous Bayesian consensus model, which is already illustrated in Fig. 3 in terms of sensitivity for the three categories (nB, B and vB) and the average one. 4. Conclusions The integration of different QSAR models for BCF provides a suitable framework for informed decision making within the workflow of a bioaccumulation ITS. Using a consensus approach, one can improve the reliability of the predictions obtained using individual models, especially when there is enough diversity among them. The current study demonstrates that the use of a discrete Bayesian combination rule enhances the reliability of individual BCF predictions when three bioaccumulation categories (nB, B and vB) are considered. Similarly, a consensus approach based on a continuous Bayesian formulation outperforms all individual BCF models as well as the consensus scheme based on a discrete Bayes' formulation even when using simple uninformative priors. The enhanced performance of the continuous consensus approach can be attributed to the fine-
Fig. 5. Description of one step of the continuous Bayesian model for the chemical triphenyltin chloride and its BCF numerical prediction by the TEST model. The three likelihood distributions are Gaussian probability density functions estimated from the corresponding log BCF values predicted by the TEST model for each experimental category. Probability values corresponding to a log BCF value equal to 2.4 in the likelihood distributions are used because this is the numerical prediction obtained from the TEST model for triphenyltin chloride. The categorical prediction after the first and second iterations is B, which changes to nB after the final third iteration. Iterations are identified by encircled numbers.
58
A. Fernández et al. / Environment International 45 (2012) 51–58
grained adjustment of continuous probability distributions for the integration of evidences from different individual QSAR models. For this reason and in accordance with the previously discussed results it can be concluded that the integration of different in silico models using a consensus approach based on a continuous Bayesian formulation constitutes an excellent methodology to improve the classification of continuous environmental endpoints. The current study could be further extended by using more sophisticated probability models, like priors generated for each chemical according to a given local sample (e.g., the class distribution for a subset of the most similar compounds) that convey a more realistic data description. The use of locally tuned priors could be especially relevant for datasets with unbalanced class distributions such as for the bioaccumulation endpoint. Another possibility of future research could be the addition of recent studies on terrestrial and human bioaccumulation (McLachlan et al., 2011), which are now being discussed under REACH, US and Canadian legislation as a way to improve bioaccumulative assessment of chemicals. These studies show that terrestrial and human bioaccumulation is in some cases different from aquatic bioaccumulation. Supplementary data related to this article can be found online at doi:10.1016/j.envint.2012.03.004. Conflict of interest The authors do not have competing financial interests to declare. Acknowledgments This research was financially supported by the European Union (OSIRIS Project, European Commission, FP6 Contract No. 037017), the Spanish Ministry of Science and Innovation (MICINN, CTM201124303) and the Generalitat de Catalunya (2009SGR-1529). References Arnot JA, Gobas FAPC. A generic QSAR for assessing the bioaccumulation potential of organic chemicals in aquatic food webs. QSAR Comb Sci 2003;22(3):337–45. Arnot JA, Gobas FAPC. A review of bioconcentration factor (BCF) and bioaccumulation factor (BAF) assessments for organic chemicals in aquatic organisms. Environ Rev 2006;14(4):257–97. Arnot JA, Mackay D, Parkerton TF, Bonnell M. A database of fish biotransformation rates for organic chemicals. Environ Toxicol Chem 2008;27(11):2263–70. Ashraf S, Brabyn L, Hicks BJ. Image data fusion for the remote sensing of freshwater environments. Appl Geogr 2012;32(2):619–28. BCFBAF (EPISUITE). http://www.epa.gov/opptintr/exposure/pubs/episuite.htm. 2011. Benedetti M, Ciaprini F, Piva F, Onorati F, Fattorini D, Notti A, et al. A multidisciplinary weight of evidence approach for classifying polluted sediments: integrating sediment chemistry, bioavailability, biomarkers responses and bioassays. Environ Int 2012;38(1):17–28. Billoir E, Delignette-Muller ML, Péry ARR, Charles S. A Bayesian approach to analyzing ecotoxicological data. Environ Sci Technol 2008;42(23):8978–84. CAESAR. http://www.caesar-project.eu. 2011. CHEMPROP. http://www.ufz.de/index.php?en=10684. 2011. Dimitrov S, Breton R, MacDonald D, Walker JD, Mekenyan O. Quantitative prediction of biodegradability, metabolite distribution and toxicity of stable metabolites. SAR QSAR Environ Res 2002;13(3–4):445–55. Dimitrov S, Dimitrova N, Parkerton T, Comber M, Bonnell M, Mekenyan O. Base-line model for identifying the bioaccumulation potential of chemicals. SAR QSAR Environ Res 2005;16(6):531–54.
Domingos P. MetaCost: A General Method for Making Classifiers Cost-sensitive. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM Press; 1999. p. 155–64. Duda RO, Hart PE, Konolige K, Reboh R. A Computer-based Consultant for Mineral Exploration. Final Report SRI Project, 6415. Menlo Park, CA: SRI International; 1979. EURAS (CEFIC LRI). http://www.cefic-lri.org. 2011. Fernández A, Rallo R, Giralt F. Uncertainty reduction in environmental data with conflicting information. Environ Sci Technol 2009;43(13):5001–6. Fu W, Franco A, Trapp S. Methods for estimating the bioconcentration factor of ionizable organic chemicals. Environ Toxicol Chem 2009;28(7):1372–9. Howard PH, Muir DCG. Identifying new persistent and bioaccumulative organics among chemicals in commerce. Environ Sci Technol 2010;44(7):2277–85. Jaworska J, Hoffmann S. Integrated testing strategy (ITS) – opportunities to better use existing data and guide future testing in toxicology. ALTEX 2010;27(4):231–42. Lombardo A, Roncaglioni A, Boriani E, Milan C, Benfenati E. Assessment and validation of the CAESAR predictive model for bioconcentration factor (BCF) in fish. Chem Cent J 2010;4(Suppl. 1):S1. McLachlan MS, Czub G, MacLeod M, Arnot JA. Bioaccumulation of organic contaminants in humans: a multimedia perspective and the importance of biotransformation. Environ Sci Technol 2011;45(1):197–202. Meylan WM, Howard PH, Boethling RS, Aronson D, Printup H, Gouchie S. Improved method for estimating bioconcentration/bioaccumulation factor from octanol/ water partition coefficient. Environ Toxicol Chem 1999;18(4):664–72. Morales-Caselles C, Riba I, Sarasquete C, Ángel DelValls T. Using a classical weight-ofevidence approach for 4-years' monitoring of the impact of an accidental oil spill on sediment quality. Environ Int 2008;34(4):514–23. Moriguchi I, Hirono S, Liu Q, Nakagome I, Matsushita Y. Simple method of calculating octanol/water partition coefficient. Chem Pharm Bull 1992;40(1):127–30. Moriguchi I, Hirono S, Nakagome I, Hirano H. Binding of carprofen to human and bovine serum albumins. Chem Pharm Bull 1994;42(4):937–40. Passuello A, Cadiach O, Perez Y, Schuhmacher M. A spatial multicriteria decision making tool to define the best agricultural areas for sewage sludge amendment. Environ Int 2012;38(1):1–9. Rallo R, Espinosa G, Giralt F. Using an ensemble of neural based QSARs for the prediction of toxicological properties of chemical contaminants. Process Saf Environ Prot 2005;83(4):387–92. REACH. REGULATION (EC) No 1907/2006 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 18 December 2006 concerning the Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH), establishing a European Chemicals Agency, amending Directive 1999/45/EC and repealing Council Regulation (EEC) No 793/93 and Commission Regulation (EC) No 1488/94 as well as Council Directive 76/769/EEC and Commission Directives 91/155/EEC, 93/67/EEC, 93/105/EC and 2000/21/EC. REACH Annex XIII. COMMISSION REGULATION (EU) No 253/2011 of 15 March 2011 amending Regulation (EC) No 1907/2006 of the European Parliament and of the Council on the Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH) as regards Annex XIII. Schüürmann G, Aptula AO, Kühne R, Ebert RU. Stepwise discrimination between four modes of toxic action of phenols in the Tetrahymena pyriformis assay. Chem Res Toxicol 2003;16(8):974–87. Sokolova M, Japkowicz N, Szpakowicz S. Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In: Sattar A, Kang BH, editors. Advances in Artificial Intelligence (AI 2006). Berlin / Heidelberg: Springer; 2006. p. 1015–21. TEST. http://www.epa.gov/nrmrl/std/cppb/qsar. 2011. Toropova AP, Toropov AA, Lombardo A, Roncaglioni A, Benfenati E, Gini G. A new bioconcentration factor model based on SMILES and indices of presence of atoms. Eur J Med Chem 2010;45(9):4399–402. Turney PD. Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm. J Artif Intell Res 1995;2:369–409. Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, et al. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol 2007;31(4):306–15. Zabeo A, Pizzol L, Agostini P, Critto A, Giove S, Marcomini A. Regional risk assessment for contaminated sites part 1: vulnerability assessment by multicriteria decision analysis. Environ Int 2011;37(8):1295–306. Zhao C, Boriani E, Chana A, Roncaglioni A, Benfenati E. A new hybrid system of QSAR models for predicting bioconcentration factors (BCF). Chemosphere 2008;73: 1701–7.