Control performance of fish short term reproduction assays with fathead minnow (Pimephales promelas)

Control performance of fish short term reproduction assays with fathead minnow (Pimephales promelas)

Regulatory Toxicology and Pharmacology 108 (2019) 104424 Contents lists available at ScienceDirect Regulatory Toxicology and Pharmacology journal ho...

3MB Sizes 0 Downloads 45 Views

Regulatory Toxicology and Pharmacology 108 (2019) 104424

Contents lists available at ScienceDirect

Regulatory Toxicology and Pharmacology journal homepage: www.elsevier.com/locate/yrtph

Control performance of fish short term reproduction assays with fathead minnow (Pimephales promelas)

T

James R. Wheelera, Pablo Valverde-Garciab, Mark Cranec,∗ a

Corteva Agriscience™, Agriculture Division of DowDuPont™, 3B Park Square, Milton Park, Abingdon, Oxfordshire, OX14 4RN, UK Corteva Agriscience™, Agriculture Division of DowDuPont™, 9330 Zionsville Road, Indianapolis, IN, 46268, USA c AG-HERA, 23 London Street, Faringdon, SN7 7AG, UK b

ARTICLE INFO

ABSTRACT

Keywords: Fish short-term reproduction assay Fathead minnow Historical control data Variability Statistical power

The fish short-term reproduction assay (FSTRA) is an in vivo screen to assess potential interactions with the fish endocrine system. After a 21-day exposure period vitellogenin (VTG) and secondary sexual characteristics are measured in males and females. Egg production and fertility are also monitored daily throughout the test. This paper presents data from 49 studies performed to satisfy test orders from the United States Environmental Protection Agency's Endocrine Disruptor Screening Program. Data Evaluation Records were used to collate the typical control variability and performance of test parameters in FSTRAs conducted in different laboratories with fathead minnow (Pimephales promelas). We also examine the statistical power of FSTRA endpoints and assess whether available historical control data (HCD) assist evidence-based interpretation of the endpoints. Statistically significant inter-laboratory differences were found for all endpoints except survival. HCD could therefore be usefully developed on a laboratory-by-laboratory basis to aid interpretation of new study data. Reliable HCD ranges could be developed for survival, body weight/length, gonadal somatic index, fertilisation success, and male tubercle score, and used in association with stated test acceptability criteria to interpret FSTRA data. In contrast, high intra- and inter-laboratory control variability for VTG and fecundity means that HCD for these endpoints are of limited use during study interpretation.

1. Introduction The potential effects of endocrine disrupting chemicals (EDCs) on wildlife have been a concern for many years (Campbell and Hutchinson, 1998; Kidd et al., 2007; Tyler et al., 2008). In response to this, regulatory authorities across different jurisdictions have developed a battery of tests to detect EDCs and determine whether the endocrine activity of a substance is sufficient to cause adverse effects at the wildlife population level. The OECD Conceptual Framework (OECD, 2018) is an example of this approach and comprises five levels of testing. The fish short-term reproduction or fish screening assay (FSTRA; OECD 2012a&b, OPPTS (now known as OCSPP) 2009) sits at level 3 of the OECD's Conceptual Framework. Here we focus on the fish shortterm reproduction test designs (i.e., OECD TG 229 (OECD, 2012a) and OCSPP 890.1350 (OPPTS, 2009)). These are in vivo screening assays in which sexually mature male and female spawning fish are held together and exposed to a chemical (e.g. Table 1). At the end of a 21-day exposure period vitellogenin (VTG) and secondary sexual characteristics are measured in males and females. In addition to this, egg production

*

and fertility are monitored daily throughout the test. Gonads are also preserved and may be examined histopathologically to help assess the reproductive fitness of test animals and to add to the weight-of-evidence for potential endocrine activity. In some variants of the test design other optional endpoints are also assessed, including the gonadal somatic index (GSI) and plasma hormone measurements. The differences amongst the three test guidelines are fully described in Wheeler et al. (2013). Ankley and Villeneuve (2006) identify two “unique perspectives” of this assay from a regulatory point of view: i) biomarkers such as VTG are included; and ii) reproduction is implicitly considered in a relatively short-term assay. Dang et al. (2011) reviewed 137 studies in which 35 chemicals were tested and concluded that parameters such as VTG and secondary sexual characteristics in the FSTRA are reliable indicators of substances with oestrogenic, anti-oestrogenic, and androgenic properties. The utility of the FSTRA has been demonstrated within a battery of endocrine screening assays as a ‘gate-keeper’ that might be employed first to optimise resources and efficiency (Ankley and Gray, 2013). However, questions remain about the performance, power, and

Corresponding author. E-mail addresses: [email protected] (J.R. Wheeler), [email protected] (P. Valverde-Garcia), [email protected] (M. Crane).

https://doi.org/10.1016/j.yrtph.2019.104424 Received 5 April 2019; Received in revised form 19 June 2019; Accepted 15 July 2019 Available online 19 July 2019 0273-2300/ © 2019 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/BY-NC-ND/4.0/).

Regulatory Toxicology and Pharmacology 108 (2019) 104424

J.R. Wheeler, et al.

results are used for purposes beyond the primary application of in vivo endocrine activity screening, such as for risk assessment, for which it is not optimised (Wheeler et al., 2014a) as acknowledged by the OECD Test Guideline (OECD, 2012a). Historical control data (HCD) are often used when analysing mammalian toxicity studies, and there have been recent calls for this to be extended to ecotoxicity studies (Brooks et al., 2019). ValverdeGarcia et al. (2018) suggest that HCD from the same or multiple laboratories can assist in ecotoxicity data interpretation by clarifying seemingly random statistically significant findings or statistically insignificant trends. They suggest that HCD would help investigators to:

Table 1 Experimental conditions for the fish endocrine screening assay with P. promelas (OECD, 2012). type: Flow-through • Test temperature: 25 ± 2 °C • Water Illumination quality: Fluorescent bulbs (wide spectrum) •Light • levels)intensity 10–20 μE/m /s, 540–1000 lux, or 50-100 ft-c (ambient laboratory (dawn/dusk transitions are optional, however not considered • Photoperiod necessary): 16 h light, 8 h dark rate < 5 g per L • Loading chamber size 10 L (minimum) • Test solution volume 8 L (minimum) • Test exchanges of test solutions: Minimum of 6 daily • Volume of test organisms: The exposure phase is started with sexually dimorphic • Age adult fish from a laboratory supply of reproductively mature animals (e.g. with 2

• • • • • • • • • •

• understand the normal range of given endpoints; • characterise rare findings; • monitor biological variability over time that might be influenced by

clear secondary sexual characteristics visible), and actively spawning. For general guidance only (and not to be considered in isolation from observing the actual reproductive status of a given batch of fish), fathead minnows should be approximately 20 ( ± 2) weeks of age, assuming they have been cultured at 25 ± 2 °C throughout their lifespan. Approximate wet weight of adult fish (g): Females: 1.5 ± 20%; Males: 2.5 ± 20% No. of fish per test vessel: 6 (2 males and 4 females) No. of treatments = 3 (plus appropriate controls) No. vessels per treatment: 4 minimum No. of fish per test concentration: 16 adult females and 8 males (4 females and 2 males in each replicate vessel) Feeding regime: Live or frozen adult or nauplii brine shrimp two or three times daily (ad libitum), commercially available food or a combination of the above Aeration: None unless DO concentration falls below 60% air saturation Dilution water: Clean surface, well or reconstituted water or dechlorinated tap water Pre-exposure period: 7–14 days recommended Chemical exposure duration: 21-d Biological endpoints – survival, behaviour, fecundity, secondary sex characteristics, VTG plus, optionally, gonadal histopathology





external factors (e.g. genetic drift in specific strains of test animals); and monitor internal factors that may have changed in the performing laboratory (e.g., diet, staff proficiency, etc.).

An understanding of these issues is important once a Test Guideline moves from the validation/validated phase into common use. It is likely that the FSTRA will now become a routine test in Europe, as it forms one of the key in vivo screening requirements to meet ‘sufficiency’ for the assessment of pesticidal and biocidal active substances for potential endocrine disrupting properties (ECHA/EFSA, 2018). Therefore, understanding the HCD may become increasingly important as the test is performed more frequently in a greater number of laboratories of varying degrees of competence. Indeed, the OECD has shown interest in post-validation monitoring of its Test Guidelines to help with methodological and interpretational improvement. A recent case in point is the 2013 revision of the fish early life-stage test guideline (OECD, 2013) following a retrospective data analysis of studies performed according to the older version of the test (Oris et al., 2012). These analyses, driven by discussions at the OECD's Fish Toxicity Testing framework meeting (OECD, 2012c) implemented modification of the validity criteria and specified an ECx design. In this paper, we review results from studies performed to fulfil USEPA Endocrine Disruptor Screening Program (EDSP) Tier 1 requirements for the first list of substances screened. These studies provide a robust database of guideline studies performed to Good Laboratory Practice. The studies were conducted by multiple laboratories outside the process for validation of the Test Guideline(s) (OECD, 2006a&b, 2007) and therefore constitute the first significant performance of the test for regulatory purposes. The main aims of this analysis are:

specificity of the test. Schapaugh et al. (2015) analysed 22 sets of fathead minnow (FHM; Pimephales promelas) FSTRA control data submitted to the USEPA (a subset of the studies analysed in the current paper). They found that 1 study (4.55%) failed the performance criterion for control mortality, 1 study (4.55%) failed the performance criterion for egg production, and 7 studies (31.82%) failed the performance criterion for fertilisation success. These findings raise concerns about the ability of laboratories to perform tests that routinely pass the test guideline validity criteria. An analysis by Coady et al. (2017) identified high variability in some test endpoints, such as average plasma VTG protein concentrations in female fathead minnows, which varied more than 10-fold across 11 different FSTRA studies conducted in the same laboratory. In their view, this may have occurred for several reasons, including the inherent variability of VTG levels among individual female fish, variability in the ELISA kit performance, and variability associated with different laboratory technicians conducting the assay. A post hoc power analysis showed that the power to detect 20% or 80% decreases in female VTG varied from 5% to 99% across the 11 studies. Valverde-Garcia et al. (2018) point out, in the context of avian reproduction studies, that the limited number of treatment levels, low replication, and high number and nature of the variables tested mean that statistical differences between the concurrent control and treated groups may arise by chance alone. When FHM are the species used in the FSTRA, the test design typically includes three treatment levels plus one or more controls, with four vessels per treatment each containing two males and four females (i.e. eight males and 16 females per treatment), and measurement of several, often highly variable, endpoints (Table 1). Coady et al. (2014, 2017) conclude that the presence of so many endpoints in the FSTRA increases the likelihood of false positive results. Parallels between avian reproduction studies and the FSTRA are therefore clear and are particularly important if FSTRA

1. To understand the typical control variability and performance of the test parameters in FSTRAs with fathead minnow; 2. To examine the statistical power of FSTRA endpoints with fathead minnow; and 3. To examine whether available HCD assist with evidence-based interpretation of the endpoints. Throughout this paper simple, but robust, statistical methods have been used to aid understanding and to allow these analyses to be updated easily as more data become available. 2. Methods FSTRA data were obtained from Data Evaluation Records (DERs) made publicly available by the USEPA: https://www.epa.gov/ endocrine-disruption/endocrine-disruptor-screening-program-tier-1screening-determinations-and. These DERs have been submitted by different companies as part of the EDSP Tier 1 Screening determinations for list 1 chemicals. All were from GLP-compliant Test Guideline studies 2

Regulatory Toxicology and Pharmacology 108 (2019) 104424

J.R. Wheeler, et al.

performed in commercial contract research laboratories. The data in the DERs are limited to summaries of the original reports as reproduced and interpreted by the USEPA. It was not possible to verify values independently, as original reports were not available to us. In most cases values met with expectations. However, if a value appeared incongruous it is highlighted and discussed in the results section. Unless otherwise stated, the following FSTRA endpoints were extracted from the DERs for male and female FHM from dilution water control and solvent control (when these were included in the test) replicates:

• Percent survival; • Mean weight (g) at test beginning and end; • Mean length (mm) at test end; • Mean fecundity (number of eggs per surviving productive day); • Mean percent fertilisation success; • Median tubercle score; • Mean gonadosomatic index (GSI); and • Mean vitellogenin (VTG) concentration (ng/mL).

compared with the predicted power for each endpoint. Statistical analyses used Analyse-it for Microsoft Excel 4.97.4 except for the variance components analysis which was performed using JMP Pro 12.2. 3. Results and discussion 3.1. Distribution of control data across control type and laboratory Sixty-five sets of control data were available, comprising 49 dilution water controls, and 2 acetone, 13 dimethylformamide (DMF) and 1 triethylene glycol (TEG) solvent controls (Supplementary Material Table S1). In one case the FSTRA data in the DER that was publicly available had been entered erroneously and are in fact data for another substance. These FSTRA data therefore could not be included in this analysis. The studies were performed in 8 different laboratories, anonymised alphabetically in this paper. The numbers of studies performed in each laboratory varied widely: Laboratory A performed studies with 5 substances, Laboratory B with 3, Laboratory C with 2, Laboratory D with 1, Laboratory E with 17, Laboratory F with 7, Laboratory G with 13, and Laboratory H with 1. Each study included either a dilution water control alone, or a dilution water control plus a solvent control, and usually 3 test substance concentrations. The only exceptions to this were that in one study there were 4 test substance concentrations and in another there were 5 test substance concentrations.

female per re-

These data were used to calculate descriptive statistics for each parameter across all studies and to determine any statistical differences due to solvent type or laboratory in which the test was performed. In some cases log transformations were performed so that parametric statistics could be employed. One of the objectives of this paper was to evaluate the different sources of variation for FSTRA endpoints: variance within study (between individuals for survival and between replicate tanks for the other endpoints), variance between dilution water and solvent controls, variance between studies performed in the same laboratory, and variance between different laboratories. The variance components 2 2 2 laboratory , solvent type and study ,were estimated with the study means and the model: yijk = μ + laboratoryi + solvent typej + studyijk, where laboratory, solvent type, and study are random effects (study = residual error) (Stroup, 2012). The raw data (values per replicate) were not 2 available, so for all endpoints (except fish survival) within study was calculated as the pooled variance using the variances from the different studies. For survival data the variance within study (variability between individuals within study) was calculated as the variance for a sample proportion p × q (Zar, 2010) and then the pooled variance across studies n 1 was calculated. The variance estimates and their contribution to the total variance, expressed as a percentage, were reported, making possible the relative and absolute comparison of sources of variation between different FSTRA parameters. Means and standard deviations for control data from each study and parameter, where available, were also analysed to determine the sample size required to detect treatment differences from control values. The coefficient of variation (CV) for each study and endpoint was calculated, and the mean and standard deviation at, or closest to, the minimum, 25th, 50th, and 75th percentiles, and the maximum value of the CV data distribution were selected to represent the range of response variability across the available DER studies. These mean and standard deviation values were used to determine the minimum sample size per treatment required to detect a 10%, 20%, or 50% difference from the controls based on procedures described in Rosner (2011) as implemented in an online calculator (https://clincalc.com/stats/ SampleSize.aspx). Finally, the results presented in DERs after fish were exposed to test chemicals were analysed for both exposed and unexposed fish to determine the smallest level of effect that was actually reported using the recommended statistical analyses (the Jonckheere-Terpstra step-down test for all continuous quantitative endpoints with a monotonic doseresponse; and Dunnett's test, Tamhane-Dunnett, or Mann-WhitneyWilcoxon U test for non-monotonic responses, depending on whether the data were normally distributed with homogeneous variance). The statistically significant empirical results from these studies were then

3.2. Test validity and performance criteria The two FSTRA test guidelines take different approaches to establishing study ‘validity’ or evidence of the scientific soundness of study results. OCSPP establishes ‘hard’ validity criteria that provide an indication of the general performance of the FSTRA under test conditions: e.g., control survival that does not meet the specified value (> 90%). It also establishes performance criteria for specific environmental and water chemistry parameters that, if breached, would not necessarily invalidate a study but might provide further context to the outcome. The OECD test guideline specifies performance criteria that combine these two concepts, with ranges for acceptable biological responses, environmental parameters, and test substance exposure. Test validity and performance criteria for the two test guidelines are summarised alongside the outcomes of the studies comprising the HCD in Table 2. Additional performance criteria from the OCSPP guideline covering dilution water quality and pH are not considered. An analysis of validity and performance criteria is informative as it highlights the contrast in approach between the OECD and OCSPP test guidelines. The former is pragmatic and open to interpretation, while the latter is more prescriptive. This criteria analysis demonstrates the high overall failure rate if the criteria are taken literally – i.e. a 39% failure rate if based on maintenance of test concentrations alone. It also highlights the difficulties that some laboratories had in meeting the fecundity requirements (20% failure rate). There were also failures of more basic elements of the test system, such as temperature (14% failure rate) and dissolved oxygen (8% failure rate). It is important to note that this analysis is for the studies submitted to USEPA and it is therefore likely that other ‘failed runs’ of some studies were also performed but not included in this analysis. Hence failure rates indicated here are potentially an under-estimate of the work performed to meet the data requirement. Many of the deviations from the criteria were transient and, in most cases, corrective action rapidly restored acceptable conditions. Furthermore, considerable latitude was taken by USEPA in the application of test concentration maintenance validity. Indeed, all the studies analysed here were considered to satisfy the EDSP Tier 1 Test Order by USEPA. Descriptive statistics for FSTRA endpoints reported in the DERs are shown in Table 3 and discussed in the section below. 3

Regulatory Toxicology and Pharmacology 108 (2019) 104424

J.R. Wheeler, et al.

Table 2 Test validity and performance criteria for OECD and OCSPP FSTRA guidelines. OECD requirement

OCSPP requirement

Failures reported in DERs

Comment

Mortality in the water (or solvent) controls should not exceed 10% at the end of the exposure period.

Survival (in controls) > 90%

No solvent control failed the criterion.

Dissolved oxygen concentration should be at least 60% of the air saturation value (ASV) throughout the exposure period.

Dissolved oxygen > 4.9 mg/L (> 60% ASV).

3/49 studies. 3/65 control groups: 88% male and female survival in one study; 88% male and female survival driven by female mortality (81% survival) in a second study; 83.3% male and female survival driven by male and female mortality in a third study. 7/49 studies. 5/7 of these studies employed a solvent control.

Temperature should not differ by more than ± 1.5 °C between test vessels at any one time during the exposure period and be maintained within a range of 2 °C of the temperature ranges specified for the test species (i.e. 25 °C for fathead minnow).

Temperature (mean) 25 ± 1 °C.

4/49 studies (OCSPP). 1/49 studies (OECD).

Concentrations of the test substance in solution should be maintained within ± 20% of the mean measured values.

Measured test concentration (each test item treatment, all replicates) coefficient of variation < 20% over 21 days. Reproduction (in controls, each replicate). Spawning at least every four days, or: Fecundity greater than approximately 15 eggs/female/ reproductive day/replicate and: Fertilization success > 95%.

19/49 studies (OCSPP).

Evidence that fish are actively spawning in all replicates prior to initiating chemical exposure and in control replicates during the test.

In most cases the decrease was marginally below 60% ASV and corrective action quickly restored acceptable values. There were indications in several studies that the root cause was most likely microbial growth. Two studies had temperatures above and below the OCSPP requirement. Only 1 study exceeded the broader range allowed by the OECD criterion. However, this was suspected to be an issue with faulty wiring to the temperature probe. It was not possible to verify the OECD inter-test vessel variance, but it can reasonably be assumed to have been met considering the tight minimum, maximum and mean temperatures reported. It was not possible to verify the OECD requirements from data provided in the DERs. Failures due to specific requirements of the OCSPP guideline. Note that fertilisation success failures were observed in all studies in laboratory A only. Latitude in the OECD guideline could be interpreted to mean that all studies met this criterion.

10/49 studies (OCSPP)

6/49 studies (OCSPP) 7/65 control groups (OCSPP)

3.3. Intra- and inter-laboratory variability

both cases only 1 male fish died across each of the 4 replicates, suggesting that these mortalities were incidental. Overall solvent control mortality did not invalidate any of the studies (see section 3.2) and at the levels typically employed (0.02–0.10 mL/L) would not be expected to affect test parameters (Green and Wheeler, 2013).

3.3.1. Control survival There was no significant difference amongst laboratories in either female or male survival. Type of control (Kruskal-Wallis test (KW)) had no significant effect on female survival (H3 = 2.02, p = 0.57), but had a borderline significant effect on male survival (H3 = 7.97, p = 0.047). This was because male survival in the two acetone controls was significantly lower than in the dilution water (i.e. no solvent) controls (Steel-Dwass-Critchlow-Fligner (SDCF) test p = 0.035). This is likely to be an artefact of the low number of males per replicate (two fish) and the low number of acetone control groups available (also only two). In

3.3.2. Control body weight and length at test end Log(weight) and untransformed length data passed the ShapiroWilks test for normality at the 5% significance level. Laboratory A did not report length values for either males or females so these endpoints could not be included for this laboratory. Mean female and male weights at test end did not differ significantly

Table 3 Descriptive statistics for FSTRA control parameters. Mean and quantiles calculated using the empirical distribution of the study means. Endpoint

Lowest

2.5%ile

25%ile

Mean (STD; CV)

Median

75%ile

97.5%ile

Highest

Male survival (%) Female survival (%) Male mean weight (g) Female mean weight (g) Male mean length (mm) Female mean length (mm) Male median tubercle score Male mean GSI Female mean GSI Male mean VTG (ng/mL) Female VTG (ng/mL x 106) Fecundity (mean eggs/female/d) Fertilisation success (%)

87.5 81 1.95 0.79 47 35 15 0.93 9.39 27 0.81 6.5 90

87.5 83.5 2.04 0.96 47.0 36.28 17.6 0.97 10.28 52 1.0 9 90.55

100 94 2.7 1.22 52.28 42.78 26.0 1.18 12.1 283 1.81 16.1 96.0

97.98 (4.53; 4.62%) 97.24 (4.83; 4.97%) 3.26 (0.76; 23.31%) 1.46 (0.3; 20.55%) 56.86 (5.52; 9.71%) 45.73 (4.5; 9.84%) 30.1 (7.7; 25.58%) 1.37 (0.25; 18.25%) 13.55 (1.92; 14.17%) 19188 (116672; 608%) 21.94 (43.65; 199%) 23.3 (10.2; 43.78%) 96.87 (2.45; 2.53%)

100 100 3.12 1.42 56.7 46.0 31.0 1.31 13.2 700 3.26 22 97.6

100 100 3.73 1.73 61.08 49.08 34.0 1.58 14.5 1625 19.1 29.5 98.6

100 100 4.79 2.0 66.84 53.48 46.2 1.82 16.92 86913 165.91 49.2 99.62

100 100 5.06 2.05 69.1 53.87 57 1.95 22 920000 207.7 54 100

4

Regulatory Toxicology and Pharmacology 108 (2019) 104424

J.R. Wheeler, et al.

Fig. 1. Box-and-whisker plot of male and female weights in controls at end of FSTRA performed in different laboratories.

among control types. However, both female and male weights at test end differed significantly amongst laboratories (Fig. 1; females: ANOVA F7,57 = 26.31, p < 0.0001; males: ANOVA F7,57 = 14.66, p < 0.0001). A Tukey-Kramer (TK) comparison test showed that this was because females from Lab D controls weighed significantly less than those from all other Labs; females from Lab G controls weighed less than those from all other Lab controls except for C and D; females from Lab E weighed less than those from Lab A and F controls; and females from Lab C controls weighed less than those from Lab A. A TK comparison test showed that males from Lab D controls weighed significantly less than those from Lab A, B, E, F, and H controls; males from Lab G controls weighed less than those from Lab A, B, and F controls, males from Lab E controls weighed less than those from Lab A and F controls; and males from Lab C controls weighed less than those from Lab A controls. Mean female and male length at test end did not differ significantly between control types but did differ between laboratories (Fig. 2; females: ANOVA F6,51 = 5.08, p = 0.0004; males: ANOVA F6,51 = 2.88, p = 0.0171). A TK test showed that this was because females from Lab D controls were shorter than those from Lab C, E, and G controls; and males from Lab D controls were shorter than those from Lab E controls. Fish were statistically smaller in Lab D. However, this laboratory performed only 1 study. Interestingly, females in this study did not meet the performance criteria for fecundity or fertilisation success and were out of the age range (6–7 months) recommended by the test guidelines (4.5–6 months). The fish used in Lab G were also smaller than those used by the other laboratories, but this laboratory had conducted a larger number of the studies (13 studies) and consistently met the

performance criteria for reproduction, suggesting that organism source and culture conditions may influence test organism size without compromising the performance of the study. 3.3.3. Control body weight changes during test Pre- and post-study mean body weights were reported from all laboratories except for Lab A. The way in which this parameter was measured varied by laboratory, with some taking weights of a subsample of the batch of fish to be used and some laboratories weighing each fish into the experiment. The percentage change in mean body weight over the duration of the FSTRA did not differ significantly with solvent type. However, there was a significant difference amongst laboratories in both female and male percentage body weight change (Fig. 3; females: KW H6 = 21.18, p = 0.0017; males: KW F6 = 33.7, p < 0.0001). This was because, on average, females in controls from Labs C, E, and H lost weight over the test duration, while those in the controls from the other Labs gained weight. Males from Labs C, D, E, and H also lost weight over the test duration, while those in the controls from the other Labs gained weight. This variation in control data highlights the lack of utility of wet weight as a growth metric for adverse effects determination caused by treatment. In fact, growth impacts would not necessarily be expected in the FSTRA because the fish used in this test are essentially adult (4.5–6 months old) and are not within the exponential growth phase that is typical for test designs optimised to assess this parameter (i.e. OECD, 2000, 2013). However, as an interpretative measure the concept of weight gain or loss as employed in mammalian toxicology may be useful. Though not a measure of growth per se, weight loss could help aid in developing a weight-of-

Fig. 2. Box-and-whisker plot of male and female lengths in controls at end of FSTRA performed in different laboratories. 5

Regulatory Toxicology and Pharmacology 108 (2019) 104424

J.R. Wheeler, et al.

Fig. 3. Box-and-whisker plot of male and female percentage weight change in controls from FSTRA performed in different laboratories.

evidence argument for non-specific (i.e. not endocrine mediated) changes in mechanistic endpoints such as female vitellogenin (Wheeler and Coady, 2016; Marty et al., 2018).

supported by a lack of reproductive effect in the acetone concentrationresponse FSTRA study (it was also one of the test materials) where no effect on fecundity was noted at 0.6 and 71.4 mg/L (i.e. at levels comparable to solvent control levels (up to 0.091 mL/L assuming a density of 784.6 kg/m3) although an effect was noted at 6.2 mg/L). Consequently, there is no compelling evidence that use of acetone as a solvent affects fish reproduction. Fecundity also differed significantly amongst laboratories (Fig. 5; ANOVA F7,57 = 16.17, p < 0.0001) because fecundity in Lab A controls was significantly higher than in all other Labs apart from Labs C and F; fecundity in Lab C, F, and G controls was significantly higher than in Labs D, E, and H; and fecundity in Lab B controls was significantly higher than in Lab D (TK p < 0.05). Fecundity was positively correlated with female body weight (t63 = 3.03, p = 0.0035 for log transformed fecundity and female weight).

3.3.4. Control fertilisation success and fecundity Laboratory D did not report fertilisation success. Type of control had no significant effect on fertilisation success in the other Labs, but there was a significant difference in overall fertilisation success amongst Labs (Fig. 4; KW H6 = 31.31, p < 0.0001); an SDCF test showed that Lab A (mean = 91.9%, range 90–94%) reported significantly lower success rates than Labs E (mean = 96.7%, range 91–100%), F (mean = 98.6%, range 96.9–99.2%), and G (mean = 98%, range 95.4–99.3%). In fact, none of Lab A's studies met the validity criterion of > 95% fertilisation success (see section 3.2). Log(fecundity) passed the Shapiro-Wilks test for normality at the 5% significance level. Fecundity (expressed as log mean number of eggs per surviving female per reproductive day per replicate) differed significantly amongst solvent types (ANOVA F3,61 = 3.76, p = 0.0153) because it was significantly lower in the acetone controls compared to the dilution water and TEG controls (and fell just short of the 5% significance level when compared to DMF controls (p = 0.0864)). However, only 2 studies employed acetone (at different concentrations of 0.1 and 0.013 mL/L) and both studies failed the control reproduction validity criterion. Further, the potential adverse effect of acetone is not

3.3.5. Control tubercle scores Female median tubercle score data were either not collected for controls or were scored at zero, as it is not usual for females to express this male characteristic. Log male median tubercle score data passed the Shapiro-Wilks test for normality at the 5% significance level. There was no significant difference in median male tubercle scores amongst different control

Fig. 4. Box-and-whisker plot of fertilisation success in controls from FSTRA performed in different laboratories.

6

Regulatory Toxicology and Pharmacology 108 (2019) 104424

J.R. Wheeler, et al.

Fig. 5. Box-and-whisker plot of fecundity in controls from FSTRA performed in different laboratories.

types, but there was a significant difference amongst laboratories (Fig. 6; ANOVA F7,57 = 10.68, p < 0.0001) because the tubercle score in Lab C was significantly higher than in other Labs; the score in Lab A was higher than in Lab G; and the score in Lab E was higher than in Labs F and G (T-K p < 0.05). Male tubercle score was positively correlated with male body weight (t63 = 3.71, p = 0.0004 for log transformed male tubercle score and weight). Similar observations have been made with male fat pad weight in fathead minnow (Watanabe et al., 2007).

normality at the 5% significance level, even after transformation, so a Kruskal-Wallis test was used for analysis. Solvent type had no significant effect on female or male VTG. A literature review by Hutchinson et al. (2006) suggests that some solvents may specifically impact biomarkers of endocrine disruption (Lech et al., 1996; Ren et al., 1996) although such effects require careful interpretation (Lech, 1997). The Hutchinson et al. (2006) review also discusses a fathead minnow fish full lifecycle study employing DMF in which the plasma vitellogenin in solvent control (0.1 mL/L DMF) females was significantly higher than in the dilution water control and there was a significant reduction in fecundity. This raises questions about the potential for undesirable endocrine interactions with carrier solvents. Therefore, the lack of a solvent effect observed here is important confirmation that, although not preferred, the FSTRA can be performed reliably using solvent delivery without compromising measures such as vitellogenin expression. Laboratory had a significant effect on female VTG alone (Fig. 8; females: KW H7 = 34.15, p < 0.0001), because VTG in female controls from Lab F was significantly higher than in controls from Labs A and E. Female and male VTG were both positively correlated with body weight (females: t70 = 2.43, p = 0.0178; males t70 = 2.4, p = 0.0191).

3.3.6. Gonadosomatic index (GSI) Female GSI data did not pass the Shapiro-Wilks test for normality at the 5% significance level, even after transformation, so a Kruskal-Wallis test was used for analysis of both male and female GSI data. Solvent type had no significant effect on female or male GSI, but laboratory did have a significant effect (Fig. 7; females: KW H7 = 16.57, p = 0.02; males: KW F7 = 37.75, p < 0.0001), because GSI in female controls from Lab C was lower than in other Labs, and male control GSI in Lab G was significantly higher than in Labs A, E, and F. Male GSI was negatively correlated with male body weight (t70 = −2.35, p = 0.0215), but there was no correlation between female body weight and GSI.

3.4. Variance components analysis

3.3.7. Control plasma VTG The limit of detection for VTG in Lab C was 100,000 ng/ml. This was adequate for detecting control concentrations of female VTG, but too high for detecting control male VTG. Male VTG data were therefore analysed without Lab C data. Male and female VTG data did not pass the Shapiro-Wilks test for

Table 4 presents results from the variance components analysis of control data. Variation within a study (i.e. biological and experimental variation between replicates) accounted for most of the variation (54.4–88.1%) in the control datasets for male survival, male and female GSI, and female VTG. Variation between studies (i.e. between studies

Fig. 6. Box-and-whisker plot of male tubercle scores in controls from FSTRA performed in different laboratories. 7

Regulatory Toxicology and Pharmacology 108 (2019) 104424

J.R. Wheeler, et al.

Fig. 7. Box-and-whisker plot of male (top figure) and female (bottom figure) GSI scores in controls from FSTRA performed in different laboratories.

performed within the same laboratory) accounted for most of the variation (52.2–69.5%) in the control datasets for female survival, male and female length, and male VTG. Variation between laboratories (i.e. between studies performed across different laboratories) accounted for most of the variation (51.6–61.9%) in male and female weight, and fecundity. Variation within studies (42.3%), between studies (20.9%), and between laboratories (35.7%) all accounted for variation in fertilisation success. Solvent type had a negligible effect on variance. These results indicate that the major sources of variation differ

across different endpoints even for the reproduction and VTG endpoints considered to be most important in the FSTRA (Ankley and Villeneuve, 2006). Differences between laboratories accounted for > 60% of the variation in fecundity, while differences in fertilisation success occurred as a result of variation within and between studies and laboratories. Most variation in control VTG occurred between different studies performed in the same laboratory for males, but between replicates within the same study for females, reflecting the high female variation to be expected in these fractionally spawning fish (Watanabe et al., 2007).

Fig. 8. Box-and-whisker plot of VTG in controls from FSTRA performed in different laboratories. 8

Regulatory Toxicology and Pharmacology 108 (2019) 104424

J.R. Wheeler, et al.

Table 4 Variance components analysis of between laboratory, between study, between solvent type, and within study variance for FSTRA control parameters. End point

Variance component analysis Laboratory

Male proportion surviving Female proportion surviving Male mean length (mm) Female mean length (mm) Male mean weight (g) Female mean weight (g) Male mean GSI Female mean GSI Male VTG ng/mL plasma Female VTG ng/mL plasma Mean % fertilisation success Mean Fecundity (eggs/fem/d)

Study

Solvent type

Within study

Variance

% of Total

Variance

% of Total

Variance

% of Total

Variance

% of Total

9.85E-05 1.49E-04 4.35E+00 6.80E+00 4.15E-01 9.32E-02 2.39E-02 2.94E-01 4.97E+08 5.05E+14 3.92E+00 1.17E+02

2.17 3.70 12.35 29.29 51.56 61.94 18.53 2.98 2.17 3.04 35.72 60.04

1.78E-03 2.10E-03 2.45E+01 1.37E+01 2.23E-01 2.62E-02 2.66E-02 3.22E+00 1.29E+10 1.48E+15 2.29E+00 3.46E+01

39.17 52.21 69.51 59.12 27.71 17.44 20.69 32.58 56.21 8.89 20.87 17.80

1.95E-04 9.51E-05 1.13E+00 6.38E-01 1.14E-02 1.54E-03 1.64E-03 1.55E-01 5.37E+08 2.80E+11 1.25E-01 3.07E+00

4.28 2.36 3.20 2.75 1.42 1.02 1.28 1.56 2.34 0.00 1.14 1.58

2.47E-03 1.68E-03 5.26E+00 2.05E+00 1.55E-01 2.95E-02 7.66E-02 6.22E+00 9.02E+09 1.46E+16 4.64E+00 3.99E+01

54.38 41.72 14.94 8.85 19.31 19.59 59.51 62.88 39.28 88.07 42.27 20.57

summarises descriptive endpoint statistics for all laboratories and for these two laboratories alone to help determine the feasibility of using HCD when interpreting FSTRA results. The data in Table 5 show that some studies, even from the two most experienced laboratories, failed test acceptability criteria despite satisfying the EDSP Tier 1 Test Order (see Section 3.2 above). The data also indicate that some endpoints are more amenable than others to HCD interpretation. Intra-laboratory differences between the 5th and 95th percentile values for most endpoints are < 50%, with inter-laboratory differences a little higher. However, three endpoints (fecundity, and male and female VTG) show substantially higher intraand inter-laboratory differences between low and high values. In the case of these latter three endpoints the use of HCD is unlikely to assist study interpretation due to this high inherent variation. However, this finding is informative, especially considering the current aspirations to employ predictions from such studies in quantitative Adverse Outcome Pathway applications (Watanabe et al., 2016; Conolly et al., 2017).

3.5. Historical control data for study interpretation One of the objectives of this analysis was to consider the potential for using HCD collation as a tool to help with better interpretation of the outcome of FSTRA studies. EC (2013) stipulates that for mammalian toxicological studies, where available, HCD should be collated for endpoints that could represent critical adverse effects, be species and strain-specific, obtained from the same laboratory, and cover a five-year period, centred as closely as possible on the date of the study of interest. HCD should be presented on a study-by-study basis giving absolute values plus percentage and relative or transformed values where these are helpful in the evaluation. If combined or summary data are submitted, these should contain information on the range of values, the mean, median and, if applicable, standard deviation. Additional HCD from other laboratories may also be reported separately as supplementary information. However, despite this common practise in mammalian toxicology, HCD are not requested and are rarely used to aid the interpretation of ecotoxicological studies. This is despite the relevance to ecotoxicological studies of many of the issues found in mammalian studies, especially in other vertebrate studies such as the FSTRA that employ limited numbers of treatment levels and assess variable endpoints (Brooks et al. submitted). Indeed, the power of using HCD approaches has recently been demonstrated for avian reproduction studies (Valverde-Garcia et al., 2018). The analyses presented here for the FSTRA are not as extensive as for avian reproduction, for which studies have been performed for over 40 years and are routine requirements for plant protection active substances globally. However, it is likely that FSTRA studies will be performed in much greater numbers to meet the requirements for an assessment of potential endocrine disrupting properties of plant protection and biocidal active substances in Europe (ECHA/EFSA, 2018) as well as in screening exercises such as the USEPA EDSP and Japan's SPEED programmes. Consequently, we hope the analyses presented here will encourage laboratories to collate their HCD for FSTRA studies to aid study design optimisation and interpretation. Furthermore, as the study is performed more frequently, such HCD records may provide the basis for subsequent amendments to the test guidelines at the OECD and OCSPP levels, where appropriate. The information presented here is a unique resource of high-quality data, performed to standardised guidelines but, unlike typical test guideline validation studies, performed on substances with limited a priori expectations about optimal conditions and outcomes. Consequently, these data present a realistic measure of likely FSTRA responses in multiple laboratories. However, the dataset is still somewhat limited as 6 of the laboratories performed 7 or fewer of the FSTRAs reported in DERs, with just 2 (Labs E (n = 17 substances) and G (n = 13 substances)) performing 61% of the DER FSTRAs. Table 5

3.6. Statistical power to detect differences between treatments Table 6 shows the number of samples required to detect a 10% or 50% difference from the controls, based on DER studies that had the lowest, 25th percentile, 50th percentile, 75th percentile, and highest control variability for each endpoint. A 10% difference in male or female weight could be detected with the usual test guideline replication of n = 4 tanks only in those studies with very low variability. In more than 75% of studies this percentage difference in body weight could not be detected. In contrast, a 50% difference in weight could be detected in all studies except a very small number with the most highly variable female weight. A 10% difference in male or female length could be detected in all studies except a very small number with the most highly variable lengths. This finding is not surprising considering the small numbers of animals included in each replicate (2 males and 4 females) and the fact that the adult life stage is not optimised to assess growth. A 10% difference in fertilisation success could also be detected in almost all studies. This likely reflects the tight response, further truncated by presentation of the response parameter as a percentage value. In contrast, a 10% difference in fecundity could not be detected in most studies, and a 50% difference could not be detected in > 25% of studies. A 10% difference in male and female GSI could be detected only in those studies with the very lowest variability, while a 50% difference could be detected in all studies except those with the very highest variability. This shows the highly variable nature of this measure and is why the OECD decided not to include it as an endpoint in the OECD versions of the test guidelines (OECD, 2012a, b). 9

Regulatory Toxicology and Pharmacology 108 (2019) 104424

J.R. Wheeler, et al.

Table 5 Range of results (5th to 95th percentile) for FSTRA control parameters. Endpoint

Male survival (%) Female survival (%) Male mean weight (g) Male weight change (%) Female mean weight (g) Female weight change (%) Male mean length (mm) Female mean length (mm) Male median tubercle score Male mean GSI Female mean GSI Male mean VTG (ng/mL) Female VTG (ng/mL x 106) Fecundity (eggs/female/d) Fertilisation success (%)

Laboratory E

Laboratory G

Combined data (Labs E & G)

All Laboratories

5%ile

95%ile

%increase

5%ile

95%ile

%increase

5%ile

95%ile

%increase

5%ile

95%ile

%increase

97.6 84.2 2.62 −11.25 1.3 −0.2 48.5 39.78 26.8 1.08 11.8 38.2 0.88 10.8 91.8

100 100 3.76 12.77 1.75 0.22 66.38 53.29 41 1.63 16.62 288000 5.52 23.2 100

2% 19% 44%

87.8 93.9 2.12 7.74 1.06 −0.04 53.2 45 17.6 1.33 11.3 181.8 1.74 17.6 96.5

100 100 3.29 46.6 1.36 0.32 61.8 49 31.4 1.87 15.86 1318 139.4 29.9 99

14% 6% 55%

88 86.4 2.22 −9.6 1.08 −0.19 49.84 40.12 19.4 1.1 11.41 50 1 11.5 93.5

100 100 3.66 39.22 1.65 0.3 65.34 52.32 40.1 1.8 16 83650 81.2 29.4 99.7

14% 16% 65%

88 88 2.21 −8.91 1.07 −20.25 47.26 38.26

100 100 4.59 41.56 1.96 29.1 65.77 52.79

14% 14% 108%

39% 38%

1.01 11.02 85.4 1.14 9.78 91.7

1.78 15.9 26000 109.31 43.5 99.2

77% 44% 30345% 9522% 345% 8%

35% 37% 34% 53% 51% 41% 753827% 527% 115% 9%

28% 16% 9% 78% 41% 40% 625% 7911% 70% 3%

Neither a 10% nor a 50% difference in male VTG could be detected in any study. However, the primary purpose of this measure is for oestrogenic activity that typically induces orders of magnitude increases in male VTG (Wheeler et al., 2005). In contrast, although a 10% difference in female VTG could be detected only in studies with the very lowest variability, a 50% difference could be detected in between 25% and 50% of studies. The predicted power to detect differences for each FSTRA parameter based upon control variability, as described above, compared reasonably well with results from the complete substance datasets (i.e. including test item treatment responses) presented in the DERs. Table 7 shows the smallest percentage mean and median difference from controls that was detected empirically in those studies in which a statistically significant difference was found between treatment concentrations and the control, as reported in the DERs (see Supplemental Data Tables S2–S12 for substance-specific data summaries). These results reflect measurable differences under test item treatment conditions; we use the median and interquartile ranges of these data below to summarise the magnitude of detectable effects for each endpoint. Most studies in which a significant effect on male (n = 12 studies) or female (n = 14 studies) weight was reported were able to detect effects of < 20%, which was the predicted power for male weight for up to almost 50% of the DER studies, and for female weight for up to almost 75% of the DER studies. A significant difference in body weight is therefore an effect measured in the FSTRA at a moderate frequency and with reasonably high statistical power. However, as discussed in Section 3.3.3, it has limited value as a true growth measure, as evidenced by the high variability in control growth across these studies, although it may be helpful to establish weight loss over the course of the exposure. The small number of studies in which a significant effect on male (n = 4 studies) or female (n = 6 studies) length was reported were able to detect effects of < 10%, which was the predicted power for male and female length for > 75% of the DER studies. A significant difference in body length is therefore an effect measured in the FSTRA at a low frequency but with very high statistical power. However, fish should already be adult at the start of a test, so whether length loss actually occurred during the test is questionable. Most studies in which a significant effect on male (n = 18 studies) or female (n = 15 studies) GSI was reported were able to detect effects of > 30%, which was the predicted power for male and female GSI for up to approximately 50% of the DER studies. A significant difference in GSI is therefore an effect measured in the FSTRA at both a moderate frequency and with moderate statistical power. Most studies in which a significant effect on male (n = 6 studies) or female (n = 11 studies) VTG was reported could only detect effects

53% 31% 30% 107% 64% 40% 167200% 8020% 156% 7%

84%

of > 150% (males) or > 40% (females), which was the predicted power for male and female VTG for > 75% of the studies. A significant difference in VTG is therefore an effect measured in the FSTRA at a low (male) or moderate (female) frequency and with very low statistical power. VTG is the core mechanistic endpoint of the FSTRA, so this finding is somewhat worrying. However, the power for males is reasonable when considering that the dynamic range is very large because basal male VTG expression is very low (Wheeler et al., 2005). The magnitude of male response to even weakly oestrogenic substances is therefore substantial, and lack of statistical power to detect marginal effects is not a substantial limitation when considering the primary screening purpose of the assay. Conversely, statistical analysis can indicate significant effects for relatively minor effects. Considering the high background variability demonstrated here, the relevance of any minor changes should be interpreted with caution. An effect on fecundity was the most common significant effect across all studies (n = 32 studies). Most studies in which a significant effect on fecundity was reported could detect effects > 50%, which was the predicted power for > 75% of the studies. A significant difference in fecundity is therefore an effect measured in the FSTRA at a high frequency and with moderate to low statistical power. An effect on fertilisation success was the second most common significant effect across all studies (n = 20 studies). Most studies in which a significant effect on fertilisation success was reported could detect effects of < 10%, which was the predicted power for > 75% of the studies. A significant difference in fertilisation success is therefore an effect measured in the FSTRA at a moderately high frequency and with very high statistical power. 4. Conclusions The analysis presented here builds on that of Schapaugh et al. (2015) and represents a unique opportunity to assess the performance of this important assay after its OECD and USEPA validation. It includes 49 high quality studies, performed to GLP and, unlike the validation exercises, these studies were performed in several different commercial laboratories using different sources of animals, culture conditions, equipment, and personnel. Most importantly, the test items themselves were unlike validation standard or reference items in that they included different physicochemical properties, toxicities, and generally no expectation for any known endocrine activity. These studies therefore represent the first large-scale application of the FSTRA for its intended regulatory purpose, and this post-validation assessment provides valuable information on test performance, typical variation in control responses, and consequent data interpretation. The validity criteria analysis demonstrates the high failure rate of 10

Regulatory Toxicology and Pharmacology 108 (2019) 104424

J.R. Wheeler, et al.

Table 6 Sample sizes required to detect a 10% or 50% difference in response from controls based on the lowest, 25th percentile, 50th percentile, 75th percentile and highest control variability (coefficient of variation) for each endpoint. Sample size calculations used a power of 80% and a 5% significance level. Values are means and (standard deviations) and minimum sample size to detect 10%, 20%, and 50% difference between a treatment and the control. Endpoint

Male weight (g) Mean (STD) 10% difference 20% difference 50% difference Female weight (g) Mean (STD) 10% difference 20% difference 50% difference Male length (mm) Mean (STD) 10% difference 20% difference 50% difference Female length (mm) Mean (STD) 10% difference 20% difference 50% difference Male GSI Mean (STD) 10% difference 20% difference 50% difference Female GSI Mean (STD) 10% difference 20% difference 50% difference Male VTG (ng/mL) Mean (STD) 10% difference 20% difference 50% difference Female VTG (ng/mL x 106) Mean (STD) 10% difference 20% difference 50% difference Fecundity Mean (STD) 10% difference 20% difference 50% difference Fertilisation success Mean (STD) 10% difference 20% difference 50% difference

Control variability for different substances Lowest variability

25th %ile variability

50th %ile variability

75th %ile variability

Highest variability

4.19 n< n< n<

(0.07) 4 4 4

2.8 (0.19) n=7 n <4 n <4

2.6 (0.28) n = 18 n=5 n<4

2.25 (0.33) n = 34 n=8 n <4

3.57 (0.88) n = 95 n = 24 n=4

0.79 n< n< n<

(0.02) 4 4 4

1.4 (0.08) n=5 n <4 n <4

1.44 (0.12) n = 11 n<4 n<4

1.1 (0.12) n = 20 n=5 n <4

1.42 (0.59) n = 271 n = 68 n = 11

52 (0.5) n<4 n<4 n<4

63 (1.6) n <4 n <4 n <4

57.4 n< n< n<

48.8 n < n < n <

61.1 (4.8) n = 10 n<4 n<4

44.5 n< n< n<

(0.12) 4 4 4

47.8 n < n < n <

(0.9) 4 4 4

51 (1.3) n<4 n<4 n<4

51.25 (1.73) n <4 n <4 n <4

53.87 (3.47) n=7 n<4 n<4

0.98 n< n< n<

(0.04) 4 4 4

1.4 (0.17) n = 23 n=6 n <4

1.1 (0.2) n = 52 n = 13 n<4

1.3 (0.297) n = 82 n = 20 n <4

1.4 (0.55) n = 242 n = 61 n = 10

14 (0.28) n<4 n<4 n<4

14 (1.19) n = 22 n <4 n <4

14 (1.6) n = 40 n<4 n<4

12.4 (1.99) n = 40 n = 10 n <4

22 (12) n = 565 n = 141 n = 19

849 (234) n = 119 n = 30 n=5

4780 (3670) n = 925 n = 231 n = 37

900 (947) n = 1738 n = 435 n = 70

300 (400) n = 2791 n = 698 n = 112

27000 (63000) n = 8547 n = 2137 n = 342

25.9 n< n< n<

(0.26) 4 4 4

4.73 (1.06) n = 78 n = 20 n <4

66.3 (22.9) n = 187 n = 47 n=7

3.1 (1.76) n = 506 n = 126 n = 20

207.7 (841) n = 25737 n = 6434 n = 1029

23.5 n< n< n<

(0.33) 4 4 4

22 (3.2) n = 40 n = 10 n <4

24.5 (5.58) n = 81 n = 20 n<4

37.3 (12.9) n = 188 n = 47 n=8

9.1 (5.8) n = 638 n = 159 n = 26

99.2 n< n< n<

(0.05) 4 4 4

98.3 n < n < n <

97.9 n< n< n<

97 (2.2) n <4 n <4 n <4

92 (6.7) n=8 n<4 n<4

(0.55) 4 4 4

the FSTRA if all the validity/performance criteria are strictly adhered to. However, it also demonstrates that, within USEPA at least, latitude in interpretation can be applied to the criteria, since all the studies were considered sufficient to satisfy the regulatory requirement. Perhaps a lesson here is for there to be wider acknowledgement and a priori discussion of the extent of this latitude. This regulatory flexibility contrasts with the prescriptive language used in test guidelines, which inevitably leads both laboratories and study sponsors to repeat these animal-intensive, expensive, and time-consuming studies unnecessarily. This is a potentially contentious issue, bearing in mind that significant deviations are known to impact upon the core endpoints. For example, Feifarek et al. (2018) and Shappell et al. (2018) investigated the influence of dissolved oxygen, sodium chloride, temperature, and food availability on male FHM responses to estrone exposure in the FSTRA.

(2.1) 4 4 4

(1.08) 4 4 4

(2.2) 4 4 4

They found that higher temperature (26 °C compared to 18 °C) was significantly associated with lower testis weight, GSI, and VTG. Food restriction was associated with lower GSI in one experiment, and low dissolved oxygen was associated with a lower secondary sex characteristic score in another experiment. However, robust conclusions for the purpose of the assay can usually be drawn if there are only minor deviations from the acceptability criteria, which is a pragmatic approach in the context of reducing vertebrate animal use (Hutchinson et al., 2016; Burden et al., 2017). The analysis in this paper also demonstrates inter-relationships between some endpoints which could be important when evaluating results to conclude on whether an endocrine interaction is likely to be operating. This is highlighted by the correlations between body weight and the fecundity, tubercle score, VTG and male GSI endpoints. In those 11

Regulatory Toxicology and Pharmacology 108 (2019) 104424

J.R. Wheeler, et al.

Table 7 Statistically significant differences (p < 0.05) found for FSTRA endpoints in DER studies. Endpoint

Male weight Female weight Male length Female length Male tubercle score Male GSI Female GSI Male VTG Female VTG Fecundity Fertilisation success

Number of substances for which statistically significant difference found for each endpoint

Statistically significant percent difference from control Mean lowest percentage change (STD)

Median lowest percentage change (interquartile range)

12 14 4 6 18 18 15 6 11 32 20

14.49 (5.78) 15.56 (7.55) 5.75 (1.72) 4.33 (1.16) 33.92 (26.51) 37.9 (17.08) 33.38 (14.44) 113833 (277903) 58.15 (20.05) 61.41 (25.27) 17.89 (29.10)

13.73 (9.62-18.51) 13.15 (10.23-20.63) 5.3 (4.79-6.27) 4.14 (3.71-5.14) 21.53 (14.89-40.1) 31.5 (28.25-42.0) 34.2 (22.13-41.34) 336 (152–870) 58.3 (40.21-74.1) 54 (47.5-82.5) 5.6 (2.81-12.40)

cases where the more mechanistic markers (tubercle score and VTG) appear to respond to exposure, one interpretation may be that this is evidence for an endocrine interaction. However, the correlation of these parameters with body weight shows that other modes of action or general systemic toxicity that may affect body weight or condition could also secondarily lead to changes in these markers for endocrine activity. There is a strong evidence base across multiple taxa which shows that ‘endocrine parameters’ may also respond to non-endocrine stressors (Marty et al., 2018). From a regulatory perspective it will become increasingly important to evaluate all responses in the FSTRA holistically to avoid unnecessary higher tier testing or the misidentification of substances as endocrine disrupters. Establishing such modes of action in fish is a complex process (Mihaich et al., 2017) and will inevitably rely on consideration of each of the endpoints in a weight-of-evidence assessment. To avoid study-specific confounders, where possible, we recommend that test concentrations are selected to avoid systemic toxicity (Wheeler et al., 2013) and observed effects are interpreted in the context of the full fish toxicity database and a withinstudy evidence base for organism stress (sublethal observations and body weight change). Reading across thresholds for systemic effects amongst different studies with the test item may be critical to this evaluation when considering the poor power of the test design to assess less than overt measures of systemic toxicity, as demonstrated here by body weight and length analyses. Indeed, using the FSTRA to determine reliable thresholds for effects that may be used in risk assessment is a challenge. Wheeler et al. (2014a) conclude that apical endpoints from a FSTRA should not be used directly in risk assessment. Instead, the FSTRA provides additional information on reproductive effects for comparison with established fish chronic No Observed Effects Concentrations (NOECs) used in risk assessment (for pesticides these are typically from fish early life stage studies). This is because the FSTRA covers only a limited proportion of the fish lifecycle, uses low numbers of animals and replicates, and includes only a limited number of treatment levels spaced by a large factor. It has also long been demonstrated (McKim, 1977; Woltering, 1984) and recently reconfirmed for pesticide active substances (Wheeler et al., 2014b), that fish early life stage NOECs typically cover the sensitivity of fish reproduction from fish full lifecycle tests. Therefore, the analyses presented in this paper support and extend the conclusion that there should be only cautious application of data from FSTRAs for purposes other than screening for potential endocrine activity. The database for solvent control treatments was relatively small and there was only limited evidence for potential solvent effects at the levels employed in the studies. Importantly, there was no evidence for a VTG induction in studies using DMF, as had previously been suggested in the literature (Hutchinson et al., 2006). Nonetheless, the guideline

recommendations against use of solvent delivery if at all practically feasible, or to minimise the solvent concentration, seem prudent. The reality of performing these complex studies, often with substances that can be considered ‘difficult to test’, is that degradation in the test system might be expected, or the test substance physicochemical properties are not amenable to delivery with a flow-through system. There is a trade-off between the advantages of solvent-free delivery (reducing the number of treatments and potential interactions) and the ability to maintain stable concentrations or prepare a suitably concentrated stock solution (Green and Wheeler, 2013). Indeed, the EDSP greatly encouraged the further development and use of non-solvent delivery flow-through systems such as generator columns, sonication, large volume (typically dilute) saturated aqueous stock solutions, and passive dosing (see OECD, 2019 for descriptions of these methods). However, the fact that solvents were employed in almost a quarter of the studies for the program demonstrates the importance of solvent delivery for such study types and chemistries. In any case, the evaluations and determinations by USEPA appear to accept these limitations and have not hampered the use of these data in regulatory decision making. There were statistically significant differences in the results reported by different laboratories for all endpoints, except survival. HCD should therefore be developed on a laboratory-by-laboratory basis to aid better interpretation of new study data. Potentially useful HCD ranges could be developed for survival, body weight and length, GSI, fertilisation success, and male tubercle score. HCD for these endpoints can be reliably used, in association with stated OECD and OCSPP test acceptability criteria, to interpret FSTRA data. For example, a small but statistically significant effect on fertilisation success might not be considered of biological significance if fertilisation success remained ≥95%. Ankley and Villeneuve (2006) identify two “unique perspectives” of this assay from a regulatory point of view: i) biomarkers such as VTG are included; and ii) reproduction is implicitly considered in a relatively short-term assay. Unfortunately, both VTG and fecundity are highly variable endpoints in the FSTRA, so only large effects can be detected with the current assay design. High intra- and inter-laboratory control variability for VTG and fecundity also means that HCD for these endpoints are of limited use during study interpretation because most likely almost any values will fall within the very wide HCD ranges. This underscores the need for careful evaluation of such data. Here it may be particularly important to use within-study weight-of-evidence with due consideration of outlier analyses (particularly for VTG where time course is not typically available) to ensure that robust conclusions are drawn. This analysis of published FSTRA summary data shows the potential use of HCD for several parameters (survival, body weight and length, 12

Regulatory Toxicology and Pharmacology 108 (2019) 104424

J.R. Wheeler, et al.

GSI, fertilisation success, and male tubercle score), but not for others (fecundity and VTG), primarily because both intra- and inter-laboratory variability is very high. Variability in completed tests should be monitored regularly so that Test Guidelines and any guidance on test interpretation can be updated to reflect knowledge gained from practical application of the guidelines.

Regul. Toxicol. Pharmacol. 99, 142–158. McKim, J., 1977. Evaluation of tests with early life stages of fish for predicting long-term toxicity. J. Fish. Res. Board Can. 34, 1148–1154. Mihaich, E.M.,S.C., Dreier, D.A., Hecker, M., Ortego, L., Kawashima, Y., Dang, Z.-C., Solomon, K., 2017. Challenges in assigning endocrine-specific modes of action: recommendations for researchers and regulators. Integr. Environ. Assess. Manag. 13, 280–292. [OECD] Organization for Economic Cooperation and Development, 2000. OECD Guideline for the Testing of Chemicals 215: Fish Juvenile Growth Test. Adopted 21 January 2000, Paris, France. [OECD] Organization for Economic Cooperation and Development, 2006a. Report of the Initial Work towards the Validation of the 21-day Fish Screening Assay for the Detection of Endocrine Active Substances (Phase 1a). Series on Testing and Assessment Number 60. ENV/JM/MONO(2006)27, Paris, France, 61 pp. [OECD] Organization for Economic Cooperation and Development, 2006b. Report of the Validation of the 21-day Fish Screening Assay for the Detection of Endocrine Substances (Phase 1b). Series on Testing and Assessment Number 61. ENV/JM/ MONO(2006)29, Paris, France, 140 pp. [OECD] Organization for Economic Cooperation and Development, 2007. Final Report of the Validation of the 21-day Fish Screening Assay for the Detection of Endocrine Active Substances. Phase 2: Testing Negative Substances. Series on Testing and Assessment Number 78. ENV/JM/MONO(2007)25, Paris, France, 73 pp. [OECD] Organization for Economic Cooperation and Development, 2012a. OECD Guideline for the Testing of Chemicals 229: Fish Short Term Reproduction Assay. Adopted 2 October 2012, Paris, France. [OECD] Organization for Economic Cooperation and Development, 2012b. OECD Guideline for the Testing of Chemicals 230: 21-day Fish Assay A Short-Term Screening for Oestrogenic and Androgenic Activity, and Aromatase Inhibition. Adopted 2 October 2012, Paris, France. [OECD] Organization for Economic Cooperation and Development, 2012c. Fish Toxicity Testing Framework. Paris, France). [OECD] Organization for Economic Cooperation and Development, 2013. OECD Guideline for the Testing of Chemicals 210: Fish Early-Life Stage Toxicity Test. Adopted 26 July 2013, Paris, France. [OECD] Organization for Economic Cooperation and Development, 2018. The OECD Conceptual Framework for Testing and Assessment of Endocrine Disrupters. Paris, France. [OECD] Organization for Economic Cooperation and Development, 2019. Guidance Document on Aqueous-phase Aquatic Toxicity Testing of Difficult Test Chemicals. Series on Testing and Assessment No. 23, second ed. ENV/JM/MONO(2000)6/REV. Paris, France. 81 pp. [OPPTS] Office of Chemical Safety and Pollution Prevention, 2009. Fish short-term reproduction assay. In: Endocrine Disruptor Screening Program Test Guidelines OPPTS 890.1350. United States Environmental Protection Agency. Oris, J.T., Belanger, S.E., Bailer, J.A., 2012. Baseline characteristics and statistical implications for the OECD 210 fish early-life stage chronic toxicity test. Environ. Toxicol. Chem. 31 (2), 370–376. Ren, L., Meldahl, A., Lech, J., 1996. Dimethyl formamide (DMFA) and ethylene glycol (EG) are estrogenic in rainbow trout. Chem. Biol. Interact. 102, 63–67. Rosner, B., 2011. Fundamentals of Biostatistics, seventh ed. Brooks/Cole, Boston, MA. Schapaugh, A.W., McFadden, L.G., Zorilla, L.M., Geter, D.R., Stuchal, L.D., Sunger, N., Borgert, C.J., 2015. Analysis of EPA's endocrine screening battery and recommendations for further review. Regul. Toxicol. Pharmacol. 72, 552–561. Shappell, N.W., Feifarek, D.J., Rearick, D.C., Bartell, S.E., Schoenfuss, H.L., 2018. Do environmental factors affect fathead minnow (Pimephales promelas) response to estrone? Part 2. Temperature and food availability. Sci. Total Environ. 610/611, 32–43. Stroup, W., 2012. Generalized Linear Mixed Models: Modern Concepts, Methods and Applications. CRC Press London. Tyler, C.R., Jobling, S., Sumpter, J.P., 2008. Endocrine disruption in wildlife: a critical review of the evidence. Crit. Rev. Toxicol. 28, 319–361. Valverde-Garcia, P., Springer, T., Kramer, V., Foudoulakis, M., Wheeler, J.R., 2018. An avian reproduction study historical control database: a tool for data interpretation. Regul. Toxicol. Pharmacol. 92, 295–302. Watanabe, K.H., Mayo, M., Jensen, K.M., Villeneuve, D.L., Ankley, G.T., Perkins, E.J., 2016. Predicting fecundity of fathead minnows (Pimephales promelas) exposed to endocrine-disrupting chemicals using a MATLAB(R)-Based model of oocyte growth dynamics. PLoS One 11, e0146594. Watanabe, K.H., Jensen, K.M., Orlando, E.F., Ankley, G.T., 2007. What is normal? A characterization of the values and variability in reproductive endpoints of the fathead minnow, Pimephales promelas. Comp. Biochem. Physiol., C 146, 348–356. Wheeler, J.R., Coady, K., 2016. Are all chemicals endocrine disruptors? Integr. Environ. Assess. Manag. 12 (2), 402–403. Wheeler, J.R., Gimeno, S., Crane, M., Lopez-Juez, E., Morritt, D., 2005. Vitellogenin: a review of analytical methods to detect (anti) estrogenic activity in fish. Toxicol. Mech. Methods 15, 293–306. Wheeler, J.R., Panter, G.H., Weltje, L., Thorpe, K.L., 2013. Test concentration setting for fish in vivo endocrine screening assays. Chemosphere 92, 1067–1076. Wheeler, J.R., Weltje, L., Green, R.M., 2014a. Mind the gap: concerns using endpoints from endocrine screening assays in risk assessment. Regul. Toxicol. Pharmacol. 69, 289–295. Wheeler, J.R., Maynard, S.K., Crane, M., 2014b. An evaluation of fish early life stage tests for predicting reproductive and longer-term toxicity from plant protection product active substances. Environ. Toxicol. Chem. 33 (8), 1874–1878. Woltering, D., 1984. The growth response in fish chronic and early lifestage toxicity tests: a critical review. Aquat. Toxicol. 5, 1–21. Zar, J.H., 2010. Biostatistical Analysis, fifth ed. Prentice Hall, pp. 944.

Funding James R Wheeler and Pablo Valverde-Garcia were employed by Corteva Agriscience during the completion of this work. Mark Crane is an independent consultant and was funded for his work on this project by Corteva Agriscience. Appendix A. Supplementary data Supplementary data to this article can be found online at https:// doi.org/10.1016/j.yrtph.2019.104424. References Ankley, G.T., Gray, L.E., 2013. Cross-species conservation of endocrine pathways: a critical analysis of Tier 1 fish and rat screening assays with 12 model chemicals. Environ. Toxicol. Chem. 32, 1084–1087. Ankley, G.T., Villeneuve, D.L., 2006. The fathead minnow in aquatic toxicology: past, present and future. Aquat. Toxicol. 78, 91–102. Brooks, A., Foudoulakis, M., Schuster, HS., Wheeler, JR., 2019. Historical control data for the interpretation of ecotoxicity data: are we missing a trick? Submitted Ecotoxicology. Burden, N., Gellatly, N., Benstead, R., Benyon, K., Blickley, T.M., Clook, M., Doyle, I., Edwards, P., Handley, J., Katsiadaki, I., Lillicrap, A., Mead, C., Ryder, K., Salinas, E., Wheeler, J., 2017. Reducing repetition of regulatory vertebrate ecotoxicology studies. Integr. Environ. Assess. Manag. 13, 955–957. Campbell, P.M., Hutchinson, T.H., 1998. Wildlife and endocrine disrupters: requirements for hazard identification. Environ. Toxicol. Chem. 17, 127–135. Coady, K.K., Biever, R.C., Denslow, N.D., Gross, M., Guiney, P.D., Holbech, H., KarounaRenier, N.K., Katsiadaki, I., Krueger, H., Levine, S.L., Maack, G., Williams, M., Wolf, J.C., Ankley, G.T., 2017. Current limitations and recommendations to improve testing for the environmental assessment of endocrine active substances. Integr. Environ. Assess. Manag. 13, 302–316. Coady, K.K., Lehman, C.M., Currie, R.J., Marino, T.A., 2014. Challenges and approaches to conducting and interpreting the amphibian metamorphosis assay and the fish short-term reproduction assay. Birth Defects Res. Part B 101, 80–89. Conolly, R.B., Ankley, G.T., Cheng, W., Mayo, M.L., Miller, D.H., Perkins, E.J., Villeneuve, D.L., Watanabe, K.H., 2017. Quantitative adverse outcome pathways and their application to predictive toxicology. Environ. Sci. Technol. 51, 4661–4672. Dang, Z., Traas, T., Vermeire, T., 2011. Evaluation of the fish short term reproduction assay for detecting endocrine disruptors. Chemosphere 85, 1592–1603. [EC] European Commission, 2013. Commission Regulation (EU) No 283/2013 of 1 March 2013 setting out the data requirements for active substances, in accordance with Regulation (EC) No 1107/2009 of the European Parliament and of the Council concerning the placing of plant protection products on the market. Off. J. Eur. Pat. Off. Union 93, 1–84. [ECHA/EFSA], 2018. European chemicals agency (ECHA) and european food safety authority (EFSA) with support from the joint research centre (JRC). In: Guidance for the Identification of Endocrine Disruptors in the Context of Regulations (EU) No 528/ 2012 and (EC) No 1107/2009, (Pre-publication version; June 2018). Feifarek, D.J., Shappell, N.W., Schoenfuss, H.L., 2018. Do environmental factors affect fathead minnow (Pimephales promelas) response to estrone? Part 1. Dissolved oxygen and sodium chloride. Sci. Total Environ. 610/611 1282-1270. Green, J., Wheeler, J.R., 2013. The use of carrier solvents in regulatory aquatic toxicology testing: practical, statistical and regulatory considerations. Aquat. Toxicol. 144/145, 242–249. Hutchinson, T., Ankley, G., Segner, H., Tyler, C., 2006. Screening and testing for endocrine disruption in fish – biomarkers as signposts not traffic lights in risk assessment. Environ. Health Perspect. 114, 106–114. Hutchinson, T.H., Wheeler, J.R., Gourmelon, A., Burden, N., 2016. Promoting the 3Rs to enhance the OECD fish toxicity testing framework. Regul. Toxicol. Pharmacol. 76, 231–233. Kidd, K.A., Blanchfield, P.J., Mills, K.H., Palace, V.P., Evans, R.E., Lazorchak, J.M., Flick, R.W., 2007. Collapse of a fish population after exposure to a synthetic estrogen. Proc Nat Acad Sci USA 104, 8897–8901. Lech, J., Lewis, S., Ren, L., 1996. In vivo estrogenic activity of nonylphenol in rainbow trout. Fundam. Appl. Toxicol. 30, 229–232. Lech, J., 1997. Letter to the editor. Chem. Biol. Interact. 108, 135. Marty, S., Mihaich, E., Levene, S., Ortego, L., Yi, S., Wheeler, J., Green, R., Borgert, C., Hannas, B., Coady, C., Zorilla, L., 2018. Distinguishing between primary endocrinemediated effects of endocrine active substances versus secondary endocrine changes caused by general systemic toxicity or indirectly by non-endocrine specific toxicities.

13