Ecotoxicology and Environmental Safety 145 (2017) 193–199
Contents lists available at ScienceDirect
Ecotoxicology and Environmental Safety journal homepage: www.elsevier.com/locate/ecoenv
Species sensitivity distribution for pentachlorophenol to aquatic organisms based on interval ecotoxicological data
MARK
⁎
Jinsong Zhaoa, , Run Zhanga,b a b
College of Resources and Environment, Huazhong Agricultural University, Wuhan 430070, China Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China
A R T I C L E I N F O
A B S T R A C T
Keywords: Species sensitivity distribution Interval ecotoxicological data Bayesian statistics Pentachlorophenol Aquatic organisms
Species sensitivity distribution (SSD) model is often used to extrapolate the chemicals’ effects from the ecotoxicological data on individual species to ecosystems, and is widely applied to derive water quality criteria or to assess ecological risk. Because of the influence of various factors, the ecotoxicological data of a specific chemicals to an individual usually exist in a range. The feasibility of interval ecotoxicological data directly applied to build SSD model has not been clearly stated. In the present study, by means of Bayesian statistics, the half maximal effective concentration (EC50) of pentachlorophenol (PCP) to 161 aquatic organisms, which were organized into 7 groups, i.e., single determined value, geometric mean estimation, median estimation, interval data, and combination of single determined data with other groups, were used to develop SSD models and to estimate the minimum sample sizes. The results showed that the interval data could be directly applied to build SSD model, and when combined with single point data could give the narrowest credible interval that indicates a stable and robust SSD model. Meanwhile, the results also implied that at least 6–14 ecotoxicological data were required to build a stable SSD model. It suggests that the utilization of interval data in building SSD model can effectively enhance the availability of ecotoxicological data, reduce the uncertainty brought by sample size or point estimation, and provide a reliable way to widen the application of SSD model.
1. Introduction Species sensitivity distribution (SSD) model is often used to extrapolate the ecotoxicological effects of a specific chemicals from individuals to ecological community or higher level. It is widely applied to derive ecological or environmental criteria, or to assess the ecological risk of chemicals or other agents. It assumes that the sensitivity or response of different species in an ecosystem with complex structure to certain stress, such as toxic chemicals, obeys a certain probability distribution (Forbes and Calow, 2002; Posthuma et al., 2002; Vighi et al., 2006), e.g., normal distribution (Aldenberg and Jaworska, 2000). Thus, a SSD model describes the sensitivity of different species to a stress by an empirical probability distribution. The basis for developing a SSD model is sufficient, representative, accurate and reliable ecotoxicological data from a specific ecosystem (Dowse et al., 2013). However, due to realistic conditions, the half maximal effective concentration (EC50) or no observed effect concentration (NOEC), the most common ecotoxicological endpoints used to build a SSD model (Dowse et al., 2013; Duboudin et al., 2004; Hickey et al., 2012; Wheeler et al., 2002), are mainly obtained from certain
⁎
Corresponding author. E-mail address:
[email protected] (J. Zhao).
http://dx.doi.org/10.1016/j.ecoenv.2017.07.029 Received 4 July 2016; Received in revised form 6 July 2017; Accepted 11 July 2017 0147-6513/ © 2017 Elsevier Inc. All rights reserved.
database, such as ECOTOX from U.S. Environmental Protection Agency, which collects lots of experimental ecotoxicological data of various chemicals from published literatures or reports (U.S. U.S. Environmental Protection Agency, 2017). Although most of them were acquired according to standard protocols, due to many factors affecting biological response, the ecotoxicological data for a certain chemicals to a specific species are quite different in such database. In ECOTOX, for example, the EC50 for pentachlorophenol to Daphnia magna is in the range of 0.143 μmol L−1 to 14.680 μmol L−1, with a difference of more than 100 folds. The general statistical method used to derive SSD model, such as generalized linear model, cannot deal with the above mentioned interval data (Lind, 2010). Thus, for such data, two options can be chosen. Firstly, the species having interval data are discarded, which will definitely increase the cost to obtain the ecotoxicological data, and reduce the biodiversity involved in SSD model, and thereby affect the application scope of SSD model. Secondly, a point estimation derived from the interval are used, such as (geometric) mean and median (European Commission, 1996; Raimondo et al., 2008; RIVM, 2001). However, the representativeness and effectiveness of a point estimation
Ecotoxicology and Environmental Safety 145 (2017) 193–199
J. Zhao, R. Zhang
parameters of normal distribution; erf is the error function. Because of the interval data, the parameters of normal distribution would be estimated by means of Bayesian statistics.
is less than the interval data (Lind, 2010). Therefore, to establish a SSD model based on the interval data has a clear practical significance. Since the information about empirical judgement on parameters and other interesting aspects can be incorporated into the data analysis process, Bayesian statistics, has been widely applied to SSD model development (Aldenberg and Rorije, 2013; Dowse et al., 2013; Grist et al., 2006; Hayashi and Kashiwagi, 2010a, b; Hickey et al., 2012). However, it is rarely used to build a SSD model based on interval ecotoxicological endpoints. Hayashi and Kashiwagi (2010a) had used Bayesian statistics to establish SSD model based on different taxonomy in which interval data were included. However, they did not compare the effectiveness of interval data with that of point estimation. At present, there is no any detailed comparative study on SSD models based on interval data retrieved from a database. In addition, there is no report on the minimum sample size for building a SSD model with interval data, though a lot of researches had been performed in the aspect of the minimum sample size requirement. Although it is already banned as pesticides and disinfectants, pentachlorophenol (2,3,4,5,6-Pentachlorophenol, PCP) is still widely used in wood preservation, Oncomelania snails killer, and many others (Crosby, 1981). Because of significant toxicity and low biodegradability, the fate and the ecological effects of PCP have been caused widespread concerns (Yadid et al., 2013; Zheng et al., 2012). Currently, only a handful of countries have developed ecological criteria or water quality standards for PCP. In this study, SSD models based on interval data published in ECOTOX were developed by means of Bayesian statistics, and compared with those based on single determined data or point estimations. The purposes were to confirm the feasibility of developing a SSD model and to determine the minimum sample size required for a stable SSD model based on interval data.
2.2.1. Prior distribution of parameters Since no effective prior information about each parameter had been reported, non-information prior distribution was used as follows (Gelman et al., 2004):
σ ∼ dgamma(0. 001, 0. 001)
(3)
2.2.2. Likelihood function For dataset D1, D2, D3, D5 and D6, the likelihood function was defined as:
Xi ~dnorm(μ, τ )
(4)
For dataset D4, the likelihood function was defined as:
Zi ~dinterval(Ti , Ii )
(5)
Ti ~dnorm(μ, τ )
(6)
For dataset D7, its likelihood function was the combination of formulae (4)–(6). That meant, the formulae (4) was used to deal with the mean part of D7, and formulae (5) and (6) were used to process the interval part of D7. In the above likelihood functions, Zi was an indicator variable that took the value 1 if Xi was an interval data and 0 otherwise; τ = 1/σ2 was the precision of normal distribution; Ii was the interval data of the ith species. dinterval implemented the discretization and set up the observed values (Ti) as stochastic nodes as required for the Gibbs sampler to work. Because its theoretical expression was unknown, the posterior distribution for each parameter is simulated by Markov Chain Monte Carlo (MCMC) with Gibbs sampler (Kruschke, 2011). In the simulation, the number of Markov chain was set to 3; the iterations for each Markov chain was set to 60,000 times, with initial 10,000 discarded to obtain stable estimates. The convergence for each parameter was identified when Gelman-Rubin convergence index fell in the range of 1.0–1.1 (Gelman et al., 2004).
2.1. Data sources The ecotoxicological data, i.e., EC50 of PCP to aquatic organisms in this study, were retrieved from ECOTOX (U.S. Environmental Protection Agency, 2017). All EC50 were obtained in laboratory with exposure type of flow-through, renewal, or static, and concentration type of active ingredient. Generally, each sample for a specific species in ECOTOX database has 3, i.e., minimum, mean and maximum EC50. If the mean EC50 had “ < ” or “ > ” operator, it would be treated as missing value. When there were multiple samples for one species, the interval (i.e., the upper and lower limit) and point estimation (i.e., geometric mean and median) of EC50 were calculated based on the gather of all the three EC50 if available. Finally, the EC50 of PCP to 161 aquatic organisms were obtained (cf. Table S1). In order to verify the objects of this study, the whole dataset was divided into 7 subsets: D1, with only one determined EC50 for one species in ECOTOX; D2, with the geometric mean for multiple EC50 for one species; D3, same as D2 but with median; D4, same as D2 but with interval data; D5, the combination of D1 and D2; D6, the combination of D1 and D3; D7, the combination of D1 and D4. The sample size of D1 was 65, D2 to D4 was 96 and D5 to D7 was 161.
2.2.3. Credible interval The credible interval, which represented the degree of centralization of each parameter or sample, was represented by the 95% interval of highest posterior density (HPD). The width of the HPD credible interval was an alternative way to measure the uncertainty of the parameters of SSD model (Kruschke, 2011). The credible interval of μ and σ was the estimation of HPD interval based on 150,000 pairs of μ and σ in the MCMC sample. The credible interval of HC5 (hazardous concentration at which 5% species in an ecosystem may be affected) was the estimation of HPD interval based on 5% quantile values calculated from 150,000 pairs of μ and σ in the MCMC sample. 2.3. Determination of minimum sample size
2.2. SSD model development based on Bayesian statistics
To determine the minimum sample size that was required to build a stable SSD model, the parameters and HC5 of SSD models based on a series of simulated dataset with different size were estimated. The simulated dataset with a specific sample size were generated from each of the 7 ecotoxicological datasets using the procedure of basic bootstrap (Davison and Hinkley, 1997). For each sample size, totally 5000 simulated samples were generated. SSD models were built based on those simulated samples, and then mean and credible interval of parameters and HC5 were estimated. A change point analysis was used to determine
In this study, the EC50 of PCP to aquatic organisms were assumed to follow normal distribution (Aldenberg and Jaworska, 2000). The cumulative probability function of normal distribution is:
X−μ X − μ ⎞⎤ ⎞ = 1 ⎡1 + erf ⎛ σ ⎠ 2⎢ ⎝ σ 2 ⎠⎥ ⎣ ⎦
(2)
where, dnorm and dgamma are probability density functions for normal and gamma distribution.
2. Materials and methods
F (X ; μ, σ ) = Φ ⎛ ⎝
μ ∼ dnorm(0, 0. 001)
(1)
where, X is logarithmic EC50 of PCP to aquatic organisms; μ and σ are 2 194
Ecotoxicology and Environmental Safety 145 (2017) 193–199
J. Zhao, R. Zhang
D4
D3
D2
D1
a tipping point, which corresponded to the minimum sample size that was the lowest requirement of the ecotoxicological data points for building a stable SSD model (Qian et al., 2003). The tipping point was detected based on the variance and mean changing of the parameters and HC5 of SSD models under a significance level at 0.05. All simulations and statistical analyses were performed using R 3.4.0 (R Core Team, 2017), the language and environment for statistical computing, and add-on packages such as rjags (Plummer, 2016) for invoking JAGS 4.2.0 (Plummer, 2003) to estimate the parameters of SSD models based on Bayesian statistics, and changepoint (Killick and Eckley, 2014) for detecting minimum sample size.
D5
3. Results
D6
3.1. Ecotoxicological data of PCP to aquatic organisms
D7
The ecotoxicological data, i.e., EC50 of PCP to 161 aquatic species were collected through retrieving and collating from ECOTOX (cf. Table S1). The involved aquatic species included 9 algae, mosses and fungi, 5 amphibians, 35 crustaceans, 52 fish, 13 aquatic plants, 9 insects, 5 invertebrates, 15 mollusks and 18 worms. Among which, there were 37 standard test organisms, 16 U.S. threatened and endangered species and 11 exotic / nuisance species. The EC50 of PCP to aquatic organisms was in the range of 0.00075μmol L−1 for Elodea canadensis to 6123.752 μmol L−1 for Chironomus ripaius. Among the total 161 species, there were 96 species that have more than one EC50. The smallest range is 0.300–0.338 μmol L−1 for Cyprinodon bovinus, while the largest range was 0.413–6123.752 μmol L−1 for Chironomus ripaius with a difference of more than 14,000 folds.
−0.2
0.2
μ
0.6
0.6
0.8
σ
1.0
0.025 0.1 0.25 HC5 (μmol L−1)
Fig. 1. The box-whisker plot and credible interval for the parameters and HC5 of SSD model estimated by means of Bayesian statistics. The red dot and green line indicate the mean and credible interval of the parameters or HC5.
D1 with each of D2, D3 and D4. Their parameters, standard deviation (Table 1) had the same pattern as that in D2–D4 datasets (Fig. 1), but with smaller values and narrower credible interval. When taking the taxonomic groups into account, the parameters and their standard deviation for D5 and D7 had the same pattern with the above results, but with a slight higher values (cf. Table S2). The box-and-whisker plot (Fig. 1) intuitively showed the parameter distribution of 7 datasets evaluated by means of Bayesian statistics. The box section showed the upper and lower quartiles of the estimated parameters, while outside of the whisker was generally considered outliers. The parameter μ and σ from D1 had the widest box, and also relatively high ratio of outliers. The range between the left and right whisker for D2–D4 were consistent, and narrower than that for D1. Among the 3 datasets, D2 had the narrowest box. The range of parameter for D5–D7 were narrower with less outliers at both sides of whisker. Because of the existence of interval data, D4 and D7 had a little wider box than the other datasets with the same size. Therefore, with increasing sample size, the width of the range for the distribution and the credible interval of each parameter decreased, and the precision of the estimated parameter increased at the same time (cf. Table 1 and Fig. 1).
3.2. SSD model based on Bayesian statistics Parameters of SSD model were estimated by means of Bayesian statistics for the 7 datasets (Table 1). SSD models based on D1, D2, D3 and D5 passed through Kolmogorov-Smirnov and Anderson-Darling test, while that based on D6 did not pass Kolmogorov-Smirnov test, but Anderson-Darling test (Table 1). That indicates they had reliable goodness-of-fit (Stephens, 1986). The SSD models of D4 and D7 could not be assessed by classical statistical hypothesis testing. The D1 dataset had the smallest sample size among 7 datasets, and only included the single determined EC50 of 65 species. The parameters of SSD model for D1 had the largest standard deviation (Table 1). Meanwhile, the credible interval of each parameter was also widest (Fig. 1). It implies that the quality of SSD model is poorer than the others. The sample size for D2, D3 and D4 were 96, and the datasets were consisted of the mean, median estimation and interval of the EC50 of PCP. The parameters for the 3 datasets were close with maximal estimation in D3. Standard deviation (Table 1) and credible interval of each parameter for D2 and D3 were nearly same. The parameters for D4 had larger standard deviation and wider credible interval (Fig. 1). D5, D6 and D7 were size of 161, and consisted of the combination of
Table 1 Parameters, HC5 and their standard deviation (shown in parentheses), and statistics and its p-value (shown in parentheses) of hypothesis test for SSD models based on 7 datasets by means of Bayesian statistics. Data sets
μ
D1 D2 D3 D4 D5 D6 D7
0.150 0.203 0.248 0.211 0.182 0.208 0.183
σ (0.098) (0.066) (0.066) (0.070) (0.055) (0.055) (0.057)
0.783 0.644 0.647 0.627 0.697 0.701 0.696
(0.070) (0.047) (0.047) (0.052) (0.039) (0.039) (0.041)
HC5 (μmol L−1)
Kolmogorov-Smirnov test
Anderson-Darling test
0.077 0.143 0.157 0.156 0.110 0.116 0.111
0.100 0.129 0.130 – 0.101 0.110 –
0.568 2.099 2.396 – 2.128 2.322 –
(0.026) (0.033) (0.036) (0.039) (0.021) (0.022) (0.022)
Note: - means not applicable.
195
(0.540) (0.073) (0.079) (0.074) (0.039)
(0.678) (0.081) (0.056) (0.078) (0.062)
Ecotoxicology and Environmental Safety 145 (2017) 193–199
D1
Flowers, Trees, Shrubs, Ferns Algae, Moss, Fungi Amphibians Crustaceans Fish Insects/Spiders Invertebrates Standard Test Molluscs + Exotic/Nuisance Worms + Threatened/Endangered Exotic/Nuisance Threatened/Endangered
D2
D5
D3
D6
D4
D7
30
50
70
90
10
30
50
70
Fig. 2. Cumulative probability curves and their credible interval of SSD for PCP to aquatic organisms based on Bayesian statistics. The horizontal line in D4 and D7 indicated the interval of the ecotoxicological data. Only standard test, endangered, exotic and nuisance species were shown. The solid line was the SSD curve. The shadow area was the 95% credible interval based on the highest posterior density.
10
30
50
70
90
10
Potentially affected fraction (%)
90
10
30
50
70
90
J. Zhao, R. Zhang
0.01
0.1
1
10
100
0.01
EC50 (μmol L
−1
0.1
1
10
100
) standard test, endangered, exotic and nuisance species were shown), most standard test organisms appeared at the upper location with PAF > 10%, indicating they were well tolerant to PCP. In D1 dataset, only Oncorhynchus clarkii was in the zone with PAF below 5%. There were no species that appeared in the zone with PAF below 5% in D2–D4 datasets. It should be noted that the exotic and nuisance species were tolerant to PCP. The most sensitivity exotic and nuisance species was Eurytemora attinis which appeared in the zone of PAF > 25%. D4 and D7 were both consisted of interval data. In Fig. 2, the horizontal lines indicated the interval of ecotoxicological data. The SSD curves with credible interval evaluated by means of Bayesian statistics effectively overlap the interval data, which implies the SSD model based on interval data could well describe the species sensitivity. The estimated ecotoxicological data based on the interval, which were used to build SSD model by Bayesian statistics, had a significant linear correlation with the mean (r = 0.963, p < 0.001, n = 56, Fig. 3a) and the median (r = 0.942, p < 0.001, n = 56, Fig. 3b) of observed EC50.
3.3. SSD curves Based on the parameters of each SSD model, the curves that presented each dataset and corresponding SSD model were shown in Fig. 2. The credible interval of each SSD curve (shaded area in Fig. 2) had an obvious difference. For D1 dataset, which had the widest credible interval, at percentage of affected fraction (PAF) of 50%, the difference between the upper and lower bound of credible interval was 1.258 μmol L−1. D5–D7 datasets had the narrowest credible interval. The pattern of credible interval was similar to that of model parameters and closely related to the corresponding sample size. Similar to the pattern of parameters, the credible interval of SSD curve for D4 and D7 was also slightly wider; and at 50% PAF, the difference of upper and lower bound of credible interval was 1.041 μmol L−1 and 0.797 μmol L−1, respectively; while the narrowest difference was 0.753 μmol L−1 for D5 dataset. When taking into account that species sensitivity to PCP could differ greatly among taxonomic groups during SSD model developing process, the difference of upper and lower bound of credible interval at 50% PAF was 1.427 μmol L−1 and 1.691 μmol L−1 respectively for D5 and D7 (cf. Fig. S1.). According to SSD curve of PCP to aquatic organisms (Fig. 2, only the
3.4. HC5 Based on the parameters of SSD model, the hazardous concentration 196
Ecotoxicology and Environmental Safety 145 (2017) 193–199
(b)
EC50 interval r = 0.942 p < 0.001 n = 96
Fig. 3. The correlation between the predicated EC50 based interval data by Bayesian statistics and the point estimation (a) mean and (b) median of corresponding observed EC50.
1
10
100
(a)
EC50 interval r = 0.963 p < 0.001 n = 96
0.1
Predicted EC 50 (μmol L−1)
1000
J. Zhao, R. Zhang
0.01
0.1
1
10
100
Observed EC 50 (μmol L
)
−1
1000
0.01
0.1
1
10
100
Observed EC50 (μmol L
1000
)
−1
analysis, the minimum sample size for D5 and D7 were 25 and 20 respectively (cf. Fig. S2). The minimum sample sizes were different when using different parameters or HC5 as the subject of change-point analysis. The minimum sample size based on μ was slightly larger than that based on σ and HC5 (Fig. 4).
at which 5% species in an ecosystem may be affected or 95% species are protected, i.e., HC5, and its standard deviation and credible interval could be easily calculated for PCP to aquatic ecosystems (Table 1). D1 had the lowest HC5, 0.071 μmol L−1, indicating the tightest threshold for an ecosystem. It implied that species in the ecosystem might be over-protected if an ecological criterion was established based on the HC5. In contrast, D2–D4 had a larger HC5. Like the pattern of μ and σ, D3 had a highest HC5 among the 3 datasets, while D2 had a stricter HC5. The HC5 of D5–D7 are consistent with that of D2–D4 and had a similar pattern, i.e., D6, which included median estimation, had a looser HC5, while D7 including interval data had a tighter HC5. When taking into account taxonomic groups during SSD model developing process, the HC5 for D5 and D7 were 0.144 μmol L−1 and 0.152 μmol L−1 respectively that were higher than the corresponding HC5 in Table 1 (cf. Table S2). The box-whisker plot also showed the variation of the distribution of HC5 and its credible interval among different dataset based on Bayesian statistics (Fig. 1). The credible interval of HC5 based on each dataset (Fig. 1) had the similar pattern with that of SSD curve (Fig. 2). The width between the left and right whisker decreased with increasing sample size. Under the same condition, the variation of HC5 also significantly decreased; and the width of upper and lower quartiles (box) was consistent with the credible interval of HC5.
4. Discussion 4.1. Uncertainty of ecotoxicological data The ecotoxicological data used in SSD contain numerous uncertainties both within and among studies were due to confounding factors, such as varying test conditions, statistical methods, and husbandry and culturing practices (Aldenberg and Rorije, 2013). Therefore, even though using the same protocol for a specific species, the associated ecotoxicological data were still different. For example, the EC50 of PCP to Daphnia magna collected from ECOTOX was in the range of 0.143–14.680 μmol L−1 (cf. Table S1). When building SSD, only using the point estimation of this kind of interval data, e.g., 2.689 μmol L−1 for Daphnia magna, may ignore the contained uncertainties. This did not mean the point estimation did not contain uncertainties, however, the conventional methods for model construction could not incorporate the uncertainties, which were hidden behind the point estimation. On the other hand, the interval data could be thought of a representation of uncertainties, so the model built on the interval data includes those uncertainties, and was to some extend superior to the model based on point estimation. It should be noted that the interval data used in this study did not mean that the EC50 of PCP to aquatic organisms should be limited to the given interval. In fact, the interval data used in this study was the range of EC50 measured, including the minimum and maximum EC50 of the same species collected from ECOTOX database, and it did not refer to real scope of biological responses to PCP. However, this did not affect the validity of the model based on Bayesian statistics.
3.5. Minimum sample size By means of the basic bootstrap and Bayesian statistical techniques, the distribution, mean and credible interval of parameters and HC5 of SSD models built on the simulated datasets with different sample size generated from D1 to D7 were shown in Fig. 4. It clearly showed that with increasing sample size the mean of parameters and HC5 (shown in red solid line) all decreased gradually and trend to be stable and smooth. The credible interval also gradually decreased, and close to the mean. The change-point analysis on the mean of each parameter or HC5 showed that the parameter or HC5 of SSD model was gradually stable when sample size reached 6–14, which implies to obtain a stable and reliable SSD model the minimum sample size should be greater than 6–14. The minimum sample size for building a stable SSD model increased along with the size of parent sample. When the size of parent sample was 65, i.e., D1, a minimum sample size should be greater than 10, while the minimum sample size based on the parent datasets of D2–D4 should be greater than 11. When using D5–D7 as a parent sample, the minimum size should be greater than 14. When taking into account taxonomic groups during SSD model developing process, and using HC5 as the subject of change point
4.2. SSD model based on interval ecotoxicological data The SSD model based on Bayesian statistics with different prior or/ and posterior functions, i.e. hierarchical Bayesian methods, could largely incorporate the uncertainties from various sources. For instance, Clark et al. (2005) demonstrated the ability of the hierarchical Bayesian model to successfully estimate underlying parameters, even with incomplete census data. In this study, the interval ecotoxicological data were used to construct SSD model using Bayesian methods with the implementation of dinterval distribution, which was not really a distribution, but imputed a random value that fell on the side of interval as 197
Ecotoxicology and Environmental Safety 145 (2017) 193–199
J. Zhao, R. Zhang
D2
D3
D4
D5
D6
D7
10
8
11
10
13
14
14
7
7
10
7
11
10
12
8
11
9
6
12
10
12
HC5 (μmol L−1) 0.025 0.1 0.5
0.4
σ
0.8
1.2
−0.5
0.0
μ
0.5
1.0
D1
0
50 0
40
80 0
40
80 0
40
80 0
50
100
150 0
50
100
150 0
50
100
150
Sample size Fig. 4. The parameters and HC5 of SSD model changed with sample size. D1–D7 indicated the parent sample for bootstrap resampling; the mean estimations of parameters and HC5 were shown in red solid line, and their credible interval were shown in blue dashed line. The minimum sample size for each datasets was shown with number and vertical bar.
showed that the use of interval data (e.g., D4 and D7) enhanced the data availability required by the SSD development, and furthermore, greatly improved the application of SSD model. Utilization of interval data to estimate the minimum sample size was feasible. Although there had been various studies using different methods to estimate the minimum sample size, but no study that determines the minimum sample size based on interval data had been reported. In this study, the basic bootstrap simulations based on different datasets consisting of point estimations (D1–D3, D5 and D6), interval data (D4), and the combination of point estimation and interval data (D7), gave similar minimum sample size. It indicated that a stable SSD model could be build when sample size was larger than 6–14 (Fig. 4). This was to some extent consistent with previous findings (Newman et al., 2000; Wheeler et al., 2002). Furthermore, the hybrid way of data utilization (e.g., D7) could provide a stable model with a smaller sample size. It should be noted that this results are obtained based on a limited population. If the population was infinite or very complex in structure, a large number of species were necessary to form a representative sample Therefore, Newman et al. (2000) suggested a range of 15–55 species is needed to build a reliable model. The minimum sample size was closely related to the size of parent sample. The larger of parent sample, the larger of the minimum size that were required by a SSD model. That meant more samples are needed to represent the features of large parent sample or population. When extrapolated the SSD to ecosystem level, it might be necessary to have the ecotoxicological data from a relatively large number of species. However, for the relatively simple aquatic ecosystem, a small size of sample may be enough. When taking into account that species sensitivity to PCP could differ greatly among taxonomic groups during SSD model developing process, the credible interval of SSD curve and HC5 were wider, and minimum sample size were larger (cf. Fig. S1.). It implied that incorporation the taxonomic group into the building process of SSD model was a good practice. Hayashi and Kashiwagi (2010b) had classified the taxonomic
stochastic nodes as required for the sampler to work. Hickey et al. (2012) used a similar Bayesian method to quantify the interest variability in ecotoxicological data with (interval-) censored data. With non-informative prior, the SSD models based on interval data gave similar parameters and HC5 to that based on point estimation. For instance, the parameters and HC5 and the corresponding credible interval of the SSD based on interval data in D4 were similar to that built on mean in D2 and median in D3 (Figs. 1 and 2). The SSD based on the combination of point estimation and interval data in D7 also had the similar properties to that based on mean in D5 or median in D6. Those implied that using only interval data could give consistent SSD with the ones based on the determined point data, but as stated above include more uncertainties embedded by the interval data. In addition, the SSD based on interval data also reduced the uncertainties brought by choosing the method of point estimation, such as the geometric mean, median. The estimation of interval data by Bayesian statistics had a significant correlation with the geometric mean estimation with higher correlation coefficient than that with median estimation (Fig. 3). It indicated that to some extent the estimation of interval data was consistent with the mean estimation. Meanwhile, it also indirectly implied that the geometric mean was a better point estimation when there were multiple points exists for one species.
4.3. Sample size with interval ecotoxicological data The sample size strongly affected the accuracy and/or precision and stability of model parameters (Duboudin et al., 2004; Newman et al., 2000; Wheeler et al., 2002). In this study, with the size of sample increasing from 65 in D1, 96 in D2–D4, to 161 in D5–D7, the credible interval of the parameters and HC5 were gradually narrowed (Fig. 1), indicating that increased sample size could improve the accuracy and precision of the SSD models. Although numerous studies had drawn similar conclusions, it was still rare to make such statement with including the interval data into SSD model development. This also 198
Ecotoxicology and Environmental Safety 145 (2017) 193–199
J. Zhao, R. Zhang
Regulation (EC) No 1488/94 on risk assessment for existing substance. Part II. Office for Official Publications of the European Communities, Luxembourg. Forbes, V.E., Calow, P., 2002. Species sensitivity distributions revisited: a critical appraisal. Hum. Ecol. Risk Assess. 8, 473–492. http://dx.doi.org/10.1080/ 10807030290879781. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B., 2004. Bayesian Data Analysis, 2nd ed. Chapman & Hall/CRC, Boca Raton, FL. Grist, E.P.M., O’Hagan, A., Crane, M., Sorokin, N., Sims, I., Whitehouse, P., 2006. Bayesian and Time-Independent Species Sensitivity Distributions for Risk Assessment of Chemicals. Environ. Sci. Technol. 40, 395–401. http://dx.doi.org/10.1021/ es050871e. Hayashi, T.I., Kashiwagi, N., 2010a. A Bayesian method for deriving species-sensitivity distributions: selecting the best-fit tolerance distributions of taxonomic groups. Hum. Ecol. Risk Assess. 16, 251–263. http://dx.doi.org/10.1080/10807031003670279. Hayashi, T.I., Kashiwagi, N., 2010b. A Bayesian approach to probabilistic ecological risk assessment: risk comparison of nine toxic substances in Tokyo surface waters. Environ. Sci. Pollut. Res. 18, 365–375. http://dx.doi.org/10.1007/s11356-0100380-5. Hickey, G.L., Craig, P.S., Luttik, R., de Zwart, D., 2012. On the quantification of intertest variability in ecotoxicity data with application to species sensitivity distributions. Environ. Toxicol. Chem. 31, 1903–1910. http://dx.doi.org/10.1002/etc.1891. Killick, R., Eckley, I.A., 2014. changepoint: an R package for changepoint analysis. J. Stat. Softw. 58, 1–19. http://dx.doi.org/10.18637/jss.v058.i03. Kruschke, J., 2011. Doing Bayesian Data Analysis: a Tutorial Introduction with R and BUGS. Academic Press, Burlington, MA. Lind, P., 2010. QSAR analysis involving assay results which are only known to be greater than, or less than some cut-off limit. Mol. Inf. 29, 845–852. http://dx.doi.org/10. 1002/minf.201000074. Newman, M.C., Ownby, D.R., Mézin, L.C.A., Powell, D.C., Christensen, T.R.L., Lerberg, S.B., Anderson, B.-A., 2000. Applying species-sensitivity distributions in ecological risk assessment: assumptions of distribution type and sufficient numbers of species. Environ. Toxicol. Chem. 19, 508–515. http://dx.doi.org/10.1002/etc.5620190233. Plummer, M., 2016. rjags: Bayesian graphical models using MCMC. R package version 46. [WWW Document]. URL 〈http://CRAN.R-project.org/package=rjags〉. Plummer, M., 2003. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling, In: Hornik, K., Leisch, F., Zeileis, A. (Eds.), Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), March20–22. Vienna, Austria. Posthuma, L., Traas, T.P., Suter II, G.W., 2002. General introduction to species sensitivity distributions. In: Posthuma, L., Suter II, G.W., Traas, T.P. (Eds.), Species Sensitivity Distributions in Ecotoxicology. CRC Press, Boca Raton, FL, pp. 3–10. Qian, S.S., King, R.S., Richardson, C.J., 2003. Two statistical methods for the detection of environmental thresholds. Ecol. Model. 166, 87–97. http://dx.doi.org/10.1016/ S0304-3800(03)00097-8. R Core Team, 2017. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Raimondo, S., Vivian, D.N., Delos, C., Barron, M.G., 2008. Protectiveness of species sensitivity distribution hazard concentrations for acute toxicity used in endangered species risk assessment. Environ. Toxicol. Chem. 27, 2599–2607. http://dx.doi.org/ 10.1897/08-157.1. RIVM, 2001. Guidance document on deriving environmental risk limits in the Netherlands (No. 601501012). National Institute of Public Health and the Environment, Bilthoven, The Netherlands. Stephens, M.A., 1986. Tests based on EDF statistics. In: D’Antonio, R.B., Stephens, M.A. (Eds.), Goodness-of-Fit Techniques. Marcel Dekker, New York. U.S. Environmental Protection Agency, 2017. ECOTOX user guide: ECOTOXicology Knowledgebase System. Version 4.0. [WWW Document]. URL 〈http:/www.epa.gov/ ecotox/〉 (accessed 12 June 2017). Vighi, M., Finizio, A., Villa, S., 2006. The evolution of the environmental quality concept: from the US EPA Red Book to the European Water Framework Directive. Environ. Sci. Pollut. Res. 13. pp. 9–14. http://dx.doi.org/10.1065/espr2006.01.003. Wheeler, J.R., Grist, E.P.M., Leung, K.M.Y., Morritt, D., Crane, M., 2002. Species sensitivity distributions: data and model choice. Mar. Pollut. Bull. 45, 192–202. http://dx. doi.org/10.1016/S0025-326X(01)00327-7. Yadid, I., Rudolph, J., Hlouchova, K., Copley, S.D., 2013. Sequestration of a highly reactive intermediate in an evolving pathway for degradation of pentachlorophenol. Proc. Natl. Acad. Sci. USA 110, E2182–E2190. http://dx.doi.org/10.1073/pnas. 1214052110. Zheng, W., Yu, H., Wang, X., Qu, W., 2012. Systematic review of pentachlorophenol occurrence in the environment and in humans in China: not a negligible health risk due to the re-emergence of schistosomiasis. Environ. Int. 42, 105–116. http://dx.doi. org/10.1016/j.envint.2011.04.014.
group when study the risk of nine toxic substances in Tokyo surface waters. However, the resampling process did not take the taxonomic group into consideration. On one hand, the taxonomic group could not be evenly sampled when the sample size was small and the number of taxonomic group was large; on the other hand, due to the large quantity of randomly selection with replacement to generate new sample, the taxonomic group would be gradually and evenly involved. 5. Conclusions By means of Bayesian statistics, a series of SSD models were developed based on 7 datasets of EC50 of PCP to aquatic organisms, with or without interval data. It was feasible to build a valid SSD model and determine the minimum sample size in case only the interval ecotoxicological data were available. The combination of the single measured ecotoxicological data or point estimation with interval data would enhance the validity of SSD model. The minimum sample size for the SSD model only based on interval data was consistent with that based on point estimation. The Bayesian statistics could integrate prior information into posterior distribution of parameters of SSD model that made it possible to incorporate the numerous uncertainties included in the interval data or behind the single data point into SSD model building processes. Therefore, the establishment of SSD model based on interval EC50 by means of Bayesian statistics had important practical significance for effective using the existing ecotoxicological data to build reliable SSD model and to enhance its applications. Acknowledgements This study was funded by the National Science Foundation of China (Grant No. 40601085) and the Fundamental Research Funds for Central Universities (Program No. 2012ZYTS031). Appendix A. Supplementary material Supplementary data associated with this article can be found in the online version at http://dx.doi.org/10.1016/j.ecoenv.2017.07.029. References Aldenberg, T., Jaworska, J.S., 2000. Uncertainty of the hazardous concentration and fraction affected for normal species sensitivity distributions. Ecotoxicol. Environ. Saf. 46, 1–18. http://dx.doi.org/10.1006/eesa.1999.1869. Aldenberg, T., Rorije, E., 2013. Species sensitivity distribution estimation from uncertain (QSAR-based) effects data. Altern. Lab. Anim. 41, 19–31. Clark, J.S., Ferraz, G., Oguge, N., Hays, H., DiCostanzo, J., 2005. Hierarchical bayes for structured, variable populations: from recapture data to life-history prediction. Ecology 86, 2232–2244. http://dx.doi.org/10.1890/04-1348. Crosby, D.G., 1981. Environmental chemistry of pentachlorophenol. Pure Appl. Chem. 53, 1051–1080. http://doi.org/10.1351/pac198153051051. Davison, A.C., Hinkley, D.V., 1997. Bootstrap Methods and their Application. Cambridge University Press, New York. Dowse, R., Tang, D., Palmer, C.G., Kefford, B.J., 2013. Risk assessment using the species sensitivity distribution method: data quality versus data quantity. Environ. Toxicol. Chem. 32, 1360–1369. http://dx.doi.org/10.1002/etc.2190. Duboudin, C., Ciffroy, P., Magaud, H., 2004. Effects of data manipulation and statistical methods on species sensitivity distributions. Environ. Toxicol. Chem. 23, 489–499. http://dx.doi.org/10.1897/03-159. European Commission, 1996. Technical guidance document in support of commission directive 93/67/EEC on risk assessment for new notified substance and Commission
199