Statistical analysis of the hen's egg test for micronucleus induction (HET-MN assay)

Statistical analysis of the hen's egg test for micronucleus induction (HET-MN assay)

Mutation Research 757 (2013) 68–78 Contents lists available at ScienceDirect Mutation Research/Genetic Toxicology and Environmental Mutagenesis jour...

939KB Sizes 0 Downloads 41 Views

Mutation Research 757 (2013) 68–78

Contents lists available at ScienceDirect

Mutation Research/Genetic Toxicology and Environmental Mutagenesis journal homepage: www.elsevier.com/locate/gentox Community address: www.elsevier.com/locate/mutres

Statistical analysis of the hen’s egg test for micronucleus induction (HET-MN assay) Ludwig A. Hothorn a , Kerstin Reisinger b , Thorsten Wolf c , Albrecht Poth d , Dagmar Fieblinger e , Manfred Liebsch e , Ralph Pirow e,∗ a

Leibniz Universität Hannover, Herrenhäuser Straße 2, 30419 Hannover, Germany Henkel AG & Co KGaA, Henkelstraße 67, 40589 Düsseldorf, Germany c Universität Osnabrück, Albrechtstraße 28, 49076 Osnabrück, Germany d Harlan CCR Research GmbH, In den Leppsteinswiesen 19, 64380 Roßdorf, Germany e Bundesinstitut für Risikobewertung, Max-Dohrn-Straße 8-10, 10589 Berlin, Germany b

a r t i c l e

i n f o

Article history: Received 16 January 2013 Received in revised form 24 April 2013 Accepted 28 April 2013 Available online 26 July 2013 Keywords: Williams-trend test Historical control HET-MN assay

a b s t r a c t The HET-MN assay (hen’s egg test for micronucleus induction) is different from other in vitro genotoxicity assays in that it includes toxicologically important features such as absorption, distribution, metabolic activation, and excretion of the test compound. As a promising follow-up to complement existing in vitro test batteries for genotoxicity, the HET-MN is currently undergoing a formal validation. To optimize the validation, the present study describes a critical analysis of previously obtained HET-MN data to check the experimental design and to identify the most appropriate statistical procedure to evaluate treatment effects. Six statistical challenges (I–VI) of general relevance were identified, and remedies were provided which can be transferred to similarly designed test methods: a Williams-type trend test is proposed for overdispersed counts (II) by means of a square-root transformation which is robust for small sample sizes (I), variance heterogeneity (III), and possible downturn effects at high doses (IV). Due to near-to-zero or even zero-count data occurring in the negative control (V), a conditional comparison of the treatment groups against the mean of the historical controls (VI) instead of the concurrent control was proposed, which is in accordance with US-FDA recommendations. For the modified Williams-type tests, the power can be estimated depending on the magnitude and shape of the trend, the number of dose groups, and the magnitude of the MN counts in the negative control. The experimental design used previously (i.e. six eggs per dose group, scoring of 1000 cells per egg) was confirmed. The proposed approaches are easily available in the statistical computing environment R, and the corresponding R-codes are provided. © 2013 Elsevier B.V. All rights reserved.

1. Introduction In vitro methods as an alternative to animal testing play an important role in safety assessment of chemicals. Examples illustrating the need for alternatives are the testing requirements of the European Chemicals Regulation (EC) No 1907/2006 (REACH) as well as the recent Cosmetics Regulation (EC) No 1223/2009, which foresees a complete ban of animal experiments for the safety assessment of cosmetic ingredients. The genotoxic potential of substances is currently analyzed by an in vitro test battery comprising different bacterial tests and mammalian cell-based assays [1,2]. The relatively high number of false-positive results (65–95%) [2,3] for different test-battery combinations have led to multiple efforts to optimize the strategy for

∗ Corresponding author. E-mail address: [email protected] (R. Pirow). 1383-5718/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.mrgentox.2013.04.023

testing genotoxicity [2]. The measures to reduce the rate of falsepositives include, among others, the development of new in vitro methods. One of the promising methods to complement existing in vitro test batteries for genotoxicity is the hen’s egg test for micronucleus induction (HET-MN), which mirrors absorption, distribution, metabolic activation, and excretion, and thereby the systemic availability of a test compound [4–8]. The substance is directly applied onto the inner shell membrane above the highly vascularized chorioallantoic membrane of the egg [8]. After incubation the nucleated peripheral erythrocytes are later screened for the presence of micronuclei. The HET-MN is able to detect both clastogenic and aneugenic effects and does not fall under the scope of the Directive 2010/63/EU on the protection of animals used for scientific purposes or under any other national legislation. The performance of the HET-MN in terms of transferability, reproducibility and predictivity is currently analyzed in a validation study in three participating laboratories.

L.A. Hothorn et al. / Mutation Research 757 (2013) 68–78

1.1. The statistical challenges of the HET-MN assay The experimental design of the HET-MN is a one-way layout with a minimum of three doses that are compared against a solvent control, which is considered as the negative control (Fig. 1). A positive control (PC) is also included to demonstrate assay sensitivity. The randomized unit is the egg, whereas six eggs per group are commonly used in a balanced design. Such a number of units per treatment is not uncommon in toxicology, even lower sample sizes are used in other genotoxicity assays, such as the triplicates in the Ames test. Nevertheless, the small sample-size (from a statistical point of view) feature of the test is a first issue (No. I) (see further statistical details [10]). The endpoint is the number of micronucleated erythrocytes (MN), which can be considered as a ‘count’ since the number of scored cells (1000 per egg) is constant and the incidence rates are low. Although count data can follow a normal distribution under certain circumstances, binomial or Poisson distributions are the statistical distributions of choice for the analysis of such data and they have already been used earlier for the MN assay [11]. A test for counts based on binomial or Poisson distributions is straightforward but the possible presence of between-egg heterogeneity (in combination with small sample sizes) is a second point that should be taken into account (No. II) [12]. In models with Poisson distribution, for example, this behavior is associated with variances greater

40

MN rate (‰)

For the upcoming blinded phase of the HET-MN inter-laboratory study, the term ‘validation’ is used because this final phase is designed according to internationally agreed principles for the validation of toxicological test methods [16]: standard operating procedures (SOPs) developed and optimized in previous phases are now fixed; coded (blinded) test chemicals will be tested in three laboratories each (and each test in each laboratory will be independently repeated at least two times); and a pre-defined statistical procedure is available, which involves the application of two competing prediction models (PMs) to classify test results as genotoxic or non-genotoxic. The application of two competing PMs may seem unusual but has been used previously in formal validation studies, finally resulting in tests accepted by the OECD. One example is the validation study on skin corrosion tests, where EPISKIN entered the study with two alternative PMs [17]. During the development of the HET-MN, a first prediction model based on a combined threshold and testing approach has been established [8]. To optimize the validation study, and to support the scientific and regulatory acceptance of the method, the experimental design and statistical procedure to evaluate the treatment effects have been subjected to thorough analysis. Data from the previous method-transfer phase have been used to (i) explore the applicability and performance of alternative statistical approaches and to (ii) confirm the experimental design before the start of the blind-testing phase. The present paper describes the results of this evaluation and method-selection process. For the HET-MN, we have identified six statistical issues which are detailed below and for which an adequate overall solution is proposed. The operation of the proposed testing procedure is illustrated by three statistically challenging examples that are, however, not necessarily representative for the majority of the HET-MN data. Finally, an approach is presented to easily calculate the power of the proposed testing procedure, given the experimental design and the response profile. The selection of the best suited prediction model will be made after the experimental phase of the validation study on the basis of all acquired data sets, which is in line with the principles of the European Union Reference Laboratory for alternatives to animal testing (EURL ECVAM) [9]. Since the approach described below is advanced and of general importance, it may be of first priority after the validation.

69

30

20

10

0

6

6

6

6

6

NC

D1

D2

D3

PC

Treatment Fig. 1. Example data set of the HET-MN assay showing the number of micronucleated cells (MN) in relation to the treatment (NC: negative control, D1–3 : dose groups, PC: positive control). Data are shown as raw data (filled circles), boxplots, and means ± standard deviation. The numbers above the x-axis indicate the sample sizes per group.

than the expected value and, therefore, with dispersion parameters greater than unity, which is a violation of model assumptions on the variance-to-mean relationship. The presence of overdispersion suggests the application of generalized linear models with quasi-binomial or quasi-Poisson distributions [12]. An alternative are the generalized linear mixed-effects models with binomial or Poisson distributions, in which overdispersion is accounted for by a random effect (see [13], particularly the example under 6.2). In all such models, the simultaneous inference for overdispersed counts by multi-group test procedures such as Dunnett or Williams (see below) is asymptotically (i.e. for large sample sizes) possible. However, both testing procedures are numerically unstable (or do not converge) for small sample sizes in combination with overdispersion. Moreover, all these modeling approaches are possibly intransparent for non-statisticians. When a toxic response occurs, an increase in both mean and variance is rather typical (Fig. 1). Many multi-group test procedures such as Dunnett, which compares the treatments against the control, normally assume a homogeneous variance across all treatment levels. So issue number three (No. III) is to find a test that is robust against the variance heterogeneity across the treatment levels. The US-NTP [14] recommends the use of the Williams procedure for the statistical evaluation of related toxicological assays with the claim of an increasing significant trend (against the negative control) in the case of a positive response. In general, a positive call for genotoxicity can only be maintained if the cytotoxicity of the test compound is analyzed in parallel. The cytotoxicity test determines not only the maximum dose for the evaluation of genotoxic effects, but also demonstrates the bioavailability of the substance [8], thereby avoiding false-negative decisions. Despite controlling for cytotoxicity, downturn effects near cytotoxic doses are nevertheless possible for a multitude of other (e.g., toxicokinetic) reasons, which raises the fourth issue (No. IV) of making the Williams-type trend test robust against downturn effects. The occurrence of MN under conditions of the negative control is a rare event. The counts in the negative control, therefore, tend to show small values, even zero, which is not surprising for the outcome of a pathological process, but raises the fifth issue (No. V) of analyzing results with near-to-zero counts. Here, the common comparison of dose groups vs control requires special attention as inference for ratio-to-control is either not feasible or unstable, and the inference for difference-to-control is highly sensitive to

70

L.A. Hothorn et al. / Mutation Research 757 (2013) 68–78

20

A

30

10

20

MN rate (‰)

Mean MN rate (‰)

25

5

15 10

2 1 0.5 0.2 0

5 0 0.0

0.2

0.4

0.6

0.8

1.0

6

6

6

6

NC

D1

D2

D3

6

6

6

6

NC

D1

D2

D3

Historical NC and PC (scaled index)

Table 1 Statistical challenges of the HET-MN assay. No.

Issue

I II III IV V VI

Small sample size Possible between-egg heterogeneity Variance heterogeneity across treatment levels Possible downturn effects at high doses Near-to-zero counts in the negative control Consideration of historical controls

B 10 FT−transformed MN rate

Fig. 2. Historical control data from three laboratories involved in the HET-MN validation study. The mean of the negative controls (circles) and positive controls (triangles) of successive experiments is shown. White, gray and black symbols indicate the data of the three laboratories. Note that the y-axis was scaled by means of the Freeman-Tukey square-root transformation.

8 6 4 2 0

Treatment the number of zero values, which is not a desired property (see, e.g., [15]). To overcome the difficulty associated with near-to-zero counts in the negative control, a series of historical controls are available for the HET-MN (Fig. 2), so that this information could be used instead of, or in addition to, the concurrent control. The sixth issue (No. VI) is then to test the concurrent control against the historical controls in order to decide whether the concurrent control or the historical controls are to be used to evaluate the treatment effect. To handle the six statistical challenges of the HET-MN assay (Table 1), the present paper proposes (i) a square-root transformation of the counts into pseudo-normally distributed data and (ii) a Williams-type trend test which is robust for small samples sizes and between-egg heterogeneity, variance heterogeneity across the treatment levels, and downturn effects at high doses. To tackle the issue of near-to-zero counts in the negative control, a further modification of the Williams-type trend test is proposed, which conditionally compares the treatment groups against the mean of the historical controls, not against the concurrent control. The performance of the derived Williams-type procedures is compared with that of the well-known Dunnett procedure and the standard (simple) Williams procedure. A software implementation of the proposed procedures is available in the statistical computing environment R. 2. Materials and methods

Fig. 3. Example data set with a downturn effect at the highest dose. Data are shown as untransformed counts (A) and as FT-transformed values (B). Note the variancehomogenizing effect of the FT-transformation in (B).

2.2. Data transformation Transforming counts into pseudo-normally distributed endpoints is common in toxicology [e.g., 18, 19] and has, for example, recently been used in the BALB/c 3T3 cell-transformation assay [20] and the Ames fluctuation assay [21]. Different transformations are available, whereas for overdispersed near-to-zero counts (xij ) the Freeman-Tukey (FT) root transformation [22] can be recommended [23]: xijFT =



xij +



xij + 1

where i represents the group index and j the egg index. This transformation has a variance-homogenizing effect (cf. Fig. 3) and is, therefore, particularly appropriate for heteroscedastic data, so that further adjustments for heterogeneous variances such as those obtained via the so-called sandwich estimator [24] are not needed. 2.3. Standard Dunnett and Williams procedures The Dunnett and Williams procedures are introduced here as standard procedures from which the Dunnett–Williams and Umbrella–Williams procedures are derived (see below). The procedures can be formulated for the FT-transformed individual counts xijFT (with the mean X¯ iFT ) as multiple contrast tests (MCT):

tq = S

k cqi x¯ iFT i=0  k 2

c /ni

i=0 qi

2.1. Data sets Example data sets and a set of historical control data (see Appendix A1) were taken from the method-transfer phase, which was performed to prepare the ongoing HET-MN validation study. The three data sets are used as examples for overdispersion (Fig. 1), for a downturn effect at high doses (Fig. 3), and for only-zero counts in the negative control (Fig. 5). The data presented here are selected to give the best support to the statistical approach outlined below, and not to mirror the majority of the data sets obtained.

where tMCT = max(t1 , . . ., tQ ) is jointly (t1 , . . ., tQ ) Q-variate t-distributed with common degrees of freedom df and a correlation matrix R, where R is a function of the contrast coefficients cqi and the group sizes ni only. Here, k is the number of treatment levels, q is the contrast index, Q the number of contrasts, and S2 is the pooled variance estimator. The contrast coefficients cqi are the elements of the contrast matrix, which defines how the treatments (Di ) are compared to the negative control. The contrast matrix can be simplified for a balanced design with three doses (D1 , D2 and D3 )

L.A. Hothorn et al. / Mutation Research 757 (2013) 68–78

1.0

1.0

A

1.0

B

0.6

0.4

0.6

0.4 linear convex

μNC = 0.2

μNC = 0.5

0.2 4

5

6

7

8

0.6

0.4

μNC = 0

0.2 3

C

0.8 Power

0.8 Power

Power

0.8

71

9

0.2 3

4

5

6

7

8

9

3

4

5

6

7

8

9

Sample size per group Fig. 4. Dependence of the power of the Williams-type procedure on the sample size, the shape of the dose-response profile (linear and convex), and the magnitude of the expected value in the control NC (values on the untransformed scale). The subfigures show the power dependency for expected number of MN (per 1000 cells) with the negative control of 0 (A), 0.2 (B) and 0.5 (C).

Dunnett–Williams

FT−transformed MN rate

4

3

2

Umbrella–Williams

cqi

NC

D1

D2

D3

cqi

NC

D1

D2

D3

Ca Cb Cc Cd Ce

−1 −1 −1 −1 −1

0 0 1/3 0 1

0 1/2 1/3 1 0

1 1/2 1/3 0 0

Ca Cb Cc Cd Ce Cf

−1 −1 −1 −1 −1 −1

0 0 1/3 0 1/2 1

0 1/2 1/3 1 1/2 0

1 1/2 1/3 0 0 0

2.5. Williams-type procedure using historical controls

1

0

6

6

NC

D1

6

6

6

D2

D3

D4

The four procedures introduced above are useful, except for cases where the (concurrent) negative control tends to be zero or near-to-zero. Under these circumstances, it is appropriate to take the historical negative controls into account because the sensitivity of the test against the concurrent negative control depends heavily on the number of zero values. For the Williams-type approach, the procedure against the arithmetic mean (ϑ) of the individual arithmetic means of the FT-transformed historical negative controls can be formulated as follows [29]:

Treatment Fig. 5. Example data set with only zero counts in the negative control. Note that the data are shown on the FT-transformed scale, where the values of the negative controls are all 1.

k S

cqi x¯ FT − ϑ

i k historical i=1

tqmodified =

c 2 /ni i=1 qi

where Shistorical is the joint variance estimator for all historical studies. as multiple contrast (Ca , Cb and Cc ) for Dunnett procedure [25] and for the Williams procedure [26]: Dunnett

Williams

cqi

NC

D1

D2

D3

cqi

NC

D1

D2

D3

Ca Cb Cc

−1 −1 −1

0 0 1

0 1 0

1 0 0

Ca Cb Cc

−1 −1 −1

0 0 1/3

0 1/2 1/3

1 1/2 1/3

Both contrast matrices contain sum-to-zero contrasts which means that the coefficients in each row must sum up to zero. The statistical evaluation of a given contrast is based on the weighted sum of all group responses, where the contrast coefficients are used as multiplicative weights (e.g., −1, 0, 1/2, 1/2). In the Dunnett procedure, each treatment is separately compared to the control such as the highest dose D3 against the negative control for the first contrast Ca . The Williams procedure also includes the comparison of the highest dose against the control in the contrast Ca . The remaining contrasts (Cb and Cc ), however, use a weighted average of the treatment effects and thus compare pooled treatment estimates with the control.

2.4. Dunnett–Williams and Umbrella–Williams procedures The Dunnett–Williams procedure integrates a test for any change in the treatment levels (Dunnett) with a test for a trend against the control (standard Williams). This is achieved by combining the contrast matrices of both standard procedures by row as shown below by the contrast matrix Dunnett–Williams for a balanced design with three doses [27]. A further modification is the downturn-protected Williams-type procedure called Umbrella–Williams [28]:

2.6. Statistical software and codes snippets All analyses were carried out in the statistical computing environment R [30] by use of the R package multcomp [31], which contains predefined contrast matrices for the Dunnett, Williams, and Umbrella–Williams procedures and additionally accepts user-defined contrast matrices (e.g., for the Dunnett–Williams procedure). The power of the multiple-contrast testing procedures was calculated in a closed form [32] by use of the R package MCPAN [33]. R code snippets are given in the Appendix for a generalized linear model with quasi-Poisson link (A2), for the application of the Dunnett and Williams procedures on FT-transformed data (A3), and for the performance comparison of all four procedures when applied to a data example with a downturn effect at the highest dose (A4). Code snippet A5 illustrates the implementation of the Williams-type procedure with historical controls by means of the parameter rhs in the function glht of the R package multcomp. Code snippet A6 shows a pairwise-test modification with conditional consideration of historical controls.

3. Results and discussion Assay data from the method-transfer phase of the HET-MN validation study were subjected to a thorough statistical evaluation to identify robust testing procedures and to confirm the experimental design. As a result of this exploratory process, a square-root transformation of counts into pseudo-normally distributed data and the application of a Williams-type trend test (of which different modifications are available) was recommended. The performance of

72

L.A. Hothorn et al. / Mutation Research 757 (2013) 68–78

Table 2 P-values of Williams- and Dunnett-type contrasts against the concurrent control. Multiple comparisons were applied to the FT-transformed data of the example in Fig. 1. The contrast coefficients (NC, D1 , D2 , D3 ) are additionally given for the balanced design of the example data. Dunnett procedure

Williams procedure

Contrast

Contrast coefficients

p-Value

Contrast coefficients

p-Value

Ca Cb Cc

(−1, 0, 0, 1) (−1, 0, 1, 0) (−1, 1, 0, 0)

0.009 0.773 0.934

(−1, 0, 0, 1) (−1, 0, 1/2, 1/2) (−1, 1/3, 1/3, 1/3)

0.006 0.085 0.272

the different Williams-type procedures is first demonstrated for the standard Williams procedure in comparison with the wellknown Dunnett procedure. The performance comparison is then extended to derived Williams-type procedures, which offer protection against downturn effects at high doses, or use historical data to deal with the issue of near-to-zero counts in the negative control. 3.1. Comparison of the standard Williams and Dunnett procedures The standard Williams and Dunnett procedures were applied to the square-root-transformed counts of the data example in Fig. 1 comprising a balanced design with a negative control and three doses. In the following, the principle of both procedures is explained by means of contrast matrices – introduced above – in which information is encoded how the treatments are compared with the negative control. The application of the standard Williams and Dunnett procedures yields multiplicity-adjusted p-values for the individual contrasts as shown in Table 2. The corresponding R code is given in Appendix A3. The interpretation of the p-values in Table 2 is as follows. First, because some p-values are less than 0.05, a significant global trend exists. A monotone trend, i.e. an increasing frequency of MN with increasing dose against the zero-dose control can therefore be concluded. Second, the smallest p-value for both procedures is for contrast Ca , the comparison of the highest dose D3 against the control, indicating that the trend is dominated by D3 . Whether the highest dose alone contributes to the treatment effect can be answered by use of the Dunnett procedure. Inspection of the p-values of the contrasts Cb and Cc shows that only the highest dose is significantly increased against the control. The Dunnett procedure allows statements on individual treatment-to-control comparisons, whereas the Williams procedure allows a trend statement. Remarkably, for monotonically ordered data (i.e. data with a trend), the p-values of the Williams procedure are always smaller than the corresponding p-values of the Dunnett procedure. This is demonstrated in Table 2 for each contrast. 3.2. A Williams-type procedure to protect against downturn effects Neither the Dunnett procedure nor the standard Williams procedure is sufficient to cope with all possible response scenarios. The Dunnett procedure may be too insensitive for moderate trends, whereas the Williams test may fail in case of a downturn effect at high doses. A compromise solution is to combine the two into one multiple-contrast test, the so-called Dunnett–Williams procedure [27]. An alternative is the modified downturn-protected Williams procedure called Umbrella–Williams [28]. Again, the principle of these two Williams-type procedures is explained by the contrast matrices introduced above for a balanced design with one negative control and three doses. In these example contrast matrices, the increase in the number of multiple comparisons from three (Dunnett and standard Williams) to five (Dunnett–Williams) and

Table 3 Power of the Williams test, Dunnett test, Dunnett–Williams test, and Umbrella–Williams test to detect a dose-dependent increase using simulated data based on different response profiles. Profile

Williams Dunnett Dunnett–Williams Umbrella–Williams

P1 P2 P3 P4 P5

0.83 0.78 0.89 0.78 0.60

0.76 0.71 0.84 0.76 0.72

0.78 0.71 0.86 0.78 0.72

0.77 0.71 0.85 0.77 0.72

six (Umbrella–Williams) may reduce power. However, the benefit of additional information on the individual comparisons far outweighs the loss of statistical power, which is only marginal because of the high correlation of the individual contrasts. The power of the four procedures (Dunnett, Williams, Dunnett–Williams, Umbrella–Williams) to detect a true, dosedependent increase was determined by simulation. Simulated data were generated for five different response profiles (linear trend, hockey-stick and plateau shapes, umbrella-like profiles; see Table 3) based on four groups (negative control, D1 , D2 and D3 ), a sample size of ni = 6 per group, and characteristic parameters for the group-specific expected values (untransformed: NC = 0.5, max = 6; FT-transformed: FT = 1.93, FT max = 5.10) and the com0 FT mon standard deviation (SD = 2). The power was calculated with the R package MCPAN [33]. Using a family-wise (experiment-wise) type-I error-rate of 0.05, a randomly generated response profile gave a positive call if at least one of the individual contrasts was significant. The results of this (any-pairs) power comparison for the five different response profiles are shown in Table 3. As long as the dose-response profile is monotone (profiles P1–P3), the Williams test is the most powerful of all four procedures. For profile P1, for example, the Williams test detects a significant increase in 83% of all simulated cases (power = 0.83; see Table 3). In case of a mild downturn effect (profile P4), the Williams test is robust due to its dose-pooling property, whereas for a strong downturn effect (profile P5) its power diminishes considerably to 0.60. The alternative tests, namely Dunnett, Dunnett–Williams, and Umbrella–Williams are more able to detect a response with a strong downturn effect with essentially the same power (=0.72). However, the diagnostic properties of the Dunnett–Williams and Umbrella–Williams procedures are beneficial, i.e. they provide more individual comparisons. To demonstrate these diagnostic properties of the two modified Williams procedures, a second data example with a downturn effect is used, which nicely illustrates the variance-homogenizing effect of the FT-transformation (Fig. 3). The corresponding R-code is given in Appendix A4. The power of the different procedures to detect a positive response with a downturn effect is given in Table 4. The Williams test allows the conclusion of a global trend with a plateau effect for all doses (contrast Cc , p = 0.006). The Dunnett test detects the most significant increase for the middle dose D2 (contrast Cb , p = 0.007), as does the Dunnett–Williams test (contrast Cd , p = 0.007). The Umbrella–Williams test signalizes a global trend Table 4 p-Values of the different test procedures applied to the data set with a downturn effect at the highest dose (Fig. 3). Contrast

Williams

Dunnett

Dunnett–Williams

Umbrella–Williams

Ca Cb Cc Cd Ce Cf

0.045 0.007 0.006 – – –

0.065 0.007 0.037 – – –

0.068 0.011 0.009 0.007 0.040 –

0.070 0.011 0.009 0.008 0.008 0.041

L.A. Hothorn et al. / Mutation Research 757 (2013) 68–78

with a minimum p-value of 0.008 for contrast Ce , which is a partial trend up to D2 , detected by pooling D1 and D2 . In summary, the Williams-type contrast test based on FTtransformed data is appropriate for evaluation of the results of the HET-MN assay. Its umbrella-protected version seems to be most appropriate because of its useful diagnostic properties to detect a monotone trend or a trend up to a peak point. In addition, it permits comparisons against the negative control, all without substantial power loss compared with the Dunnett test. 3.3. Power of the standard Williams procedure A side aspect of the present study was to check the experimental design of the HET-MN before the validation proceeds to the blindtesting phase. We estimated the power of the (standard) Williams procedure to detect a practically relevant minimum increase, given the experimental design. The approximate power of a multiplecontrast testing procedure can be calculated in a closed form [32] based on treatment-specific expected values and a common variance using the R package MCPAN [33]. Representative expected values are available from the method-transfer phase of the HETMN validation study, and a common variance can be assumed for the HET-MN if FT-transformed variables xijFT are used. The power of the Williams-type procedure was evaluated for linear and convex dose-response profiles with three different expected values for the negative control (NC = 0, 0.2, 0.5) and sample sizes (ni ) ranging from 3–9 eggs per group (Fig. 4). For the linear profile expected values of D1 = 1, D2 = 1.5 and D3 = 2 (untransformed scale) were assumed for the three treatment groups. The convex profile was specified as D1 = 0.2, D2 = 0.2 and D3 = 2, and a standard deviation of SDFT = 1 was chosen as representative value for the common variance on the FT-transformed scale. An increasing (i.e. one-sided) response was chosen as alternative hypothesis, and the (overall) significance level ˛ was set to 0.05, indicating the acceptance of a false-positive rate of 5%. As expected, the power increases with increasing sample size, but with diminishing slope (Fig. 4). Also, as expected, the power increases with lower values in the negative control and reaches a maximum for NC = 0 (Fig. 4A). Note that the effect of the nearto-zero behavior in NC is very strong. Furthermore, the power is higher for the linear profile than for the convex profile although this effect is not very strong. For a representative set of parameter values (NC = 0.2, D3 = 2, SDFT = 1) reflecting a minimum increase of toxicological relevance, it can be concluded that a sample size of six eggs per group is sufficient for most scenarios to achieve a power of at least 80% (Fig. 4B). The dependence of the power on alternative designs such as four instead of three doses, unbalanced samples sizes with nNC > nD , and different variance-to-effect sizes can be easily analyzed, with the additional option of alternative trend tests such as downturn-protected Williams procedure (not shown here). 3.4. A simple Williams-type procedure by use of historical controls Toxicological endpoints such as MN induction represent the outcome of a specific pathological process. These endpoints are counts or proportions, they are inherently increasing, and tend to be zero or near-to-zero in the negative control. If the concurrent negative control tends to near-to-zero, it is appropriate to take the historical controls into account because the sensitivity of the test against the concurrent negative control strongly depends on the number of zero values. For example, whether a single micronucleus or none is observed overall among the six control eggs has a serious effect on the p-value of the test for a treatment effect (The reader may apply

73

Table 5 p-Values for the Williams test against concurrent and historical control(s) as applied to the data set with only-zero counts in the negative control (Fig. 5). The contrast coefficients (negative control, D1 , D2 , D3 , D4 ) are additionally given for the balanced design of the example data. Contrast

Contrast coefficients

Concurrent control

Historical controls

Ca Cb Ce Cf

(−1, 0, 0, 0, 1) (−1, 0, 0, 1/2, 1/2) (−1, 0, 1/3, 1/3, 1/3) (−1, 1/4, 1/4, 1/4, 1/4)

0.020 0.051 0.070 0.033

0.184 0.437 0.596 0.293

the model in Appendix A2 to data set 3, and may then replace a single zero value of the negative control by 1). Rather complex approaches for both proportions and counts, including Bayesian approaches, are available [34–37], but they are rarely used in practice. They are rather complex for nonstatisticians and do not follow the US-FDA recommendation [38]: “The concurrent control group is always the most appropriate and important in testing drug-related increases in tumor rates in a carcinogenicity experiment. However, if used appropriately, historical control data can be very valuable in the final interpretation of the study results.” The standard operating procedure (SOP) of the HET-MN includes a validity criterion to restrict the mean of the concurrent negative control to the upper limit of the normal (2) range of the historical controls. Zero or near-to-zero counts in the concurrent negative control are therefore conform to the SOP. To deal with the statistically challenging zero or near-to-zero counts in the negative control, a conditional two-step approach is proposed. The key is to use the arithmetic mean of the FT-transformed historical negative controls as a standard value for the Williams-type multiple contrast tests in cases when the mean of the FT-transformed concurrent negative control is below the range of the historical data (or 1). Naive 2 intervals [39] for FT-transformed historical data can be recommended based on a trade-off between simplicity and validity [40]. This approach does not depend on the number of historical negative controls (nhistNC ), neither directly, as in the case of including the historical negative controls into the test [7], nor indirectly, provided that nhistNC is sufficiently large (e.g., >10) [36]. In the proposed conditional two-step approach, the concurrent negative control is first checked to be within the 2 range of historical controls. When within the range, the common Williams-type approach is used against the concurrent negative control. When below the range, a modified Williams-type procedure against the arithmetic mean of the individual arithmetic means of the FTtransformed historical negative controls is applied [29]. The proposed two-step approach is illustrated by a data example with only-zero counts in the concurrent negative control (Fig. 5). The arithmetic mean of the FT-transformed concurrent control FT is 1.0 which is below the normal (2) range of the historical x¯ NC controls [1.23, 2.35] (see R code in Appendix A5); the modified Williams-type procedure comparing against the historical controls is therefore recommended to avoid decisions with too small pvalues (Table 5). Because of the statistical limitations of the naive 2 intervals, it may happen that the lower limit of the historical control range is <1 on the FT-transformed scale. Even in this case the comparison against the historical controls can be recommended for data sets with only-zero counts in the concurrent control. This pragmatic approach is unilateral as there can be arbitrarily large values but not arbitrarily small values (limited at zero). The proposed parametric approach of conditionally comparing the doses against the concurrent and historical controls is more adequate than the non-parametric approach that was previously used for the HET-MN [7], in which the minimum of the p-values for the (exact) Wilcoxon test against the concurrent control, and

74

L.A. Hothorn et al. / Mutation Research 757 (2013) 68–78

Table 6 Unadjusted p-values of pairwise comparisons against the concurrent and historical control(s). The tests were applied to FT-transformed data of the example with an unusually elevated negative control (in Fig. 1). Comparison

Concurrent control

Historical controls

D3 – NC D2 – NC D1 – NC

0.003 0.526 0.767

<0.001 0.101 0.321

for the (asymptotic) Wilcoxon test against the historical controls was used to evaluate the treatment effect. The issue here is that (i) the familywise type-I error rate is not controlled and that (ii) the p-value of the asymptotic Wilcoxon test is additionally influenced by the number of available historical controls, which is not a desirable property. The versatility and advantage of the Williams-type procedure, for which a non-parametric version is available [41], also extends to trend analysis. Compared with the non-parametric Jonckheere-Terpstra trend test, which is sensitive for linear doseresponse relationships only [42], the Williams-type procedures are additionally able to the detect other types of monotone response profiles. 3.5. Simple pairwise tests with historical controls Another kind of balance between type-I and type-II error rates is the use of pairwise tests, each at significance level ˛. Here, the individual p-values are not adjusted for multiple testing, and the control of the type-I error rate is therefore comparison-wise rather than family-wise (experiment-wise) [43]. A pairwise-test modification with conditional consideration of the historical controls can be easily implemented in R (see Appendix A6). Separate two-sample tests could in principle be used, but are not recommended because of the rather small sample sizes and the resulting low degrees of freedom (df). The better alternative is a procedure of pairwise tests with common df and a common variance estimator. Our proposal is demonstrated again with the data example shown in Fig. 1. As expected, the p-values of the pairwise comparisons against the concurrent control (Table 6) are smaller than the respective p-values of treatment-to-control comparison of the Dunnett procedure (Table 2), indicating the influence of controlling the type-I error rate either in a comparison-wise manner (unadjusted p-values) or in a experiment-wise manner (multiplicity-adjusted p-values). The pairwise comparisons against the mean of the historical controls yield even smaller (and not reportable) p-values (Table 6) because the unusually elevated concurrent control is replaced. Non-parametric pairwise comparisons should be discussed as well to compare with the approach

previously used for the HET-MN [7]. Unfortunately, for such a shifted one-sample test only an asymptotic version is available which causes serious problems in case of small sample sizes and tied data. 4. Conclusions To evaluate the number of micronucleated erythrocytes (MN) in the HET-MN assay, a modified Williams-type procedure with Freeman-Tukey (FT) square-root transformed data is proposed. It compares either against the concurrent control or against the arithmetic mean of the historical controls, depending on whether the concurrent control is within or below the normal range of historical controls. The FT-transformation ensures a simple test for overdispersed near-to-zero counts, which is particularly suitable for small sample sizes (such as six eggs per group) and possible variance heterogeneity. The modified Williams-type procedure allows the conclusion of a trend, either for a global monotony in the dose-response relationship, or up to the peak-point when a downturn effect occurs at high doses. When the mean of the FTtransformed concurrent control is smaller than the normal range of the historical controls, the new Williams-type test against a historical standard avoids false-positive outcomes. The approach is easily available in the statistical computing environment R. Subsequently, a second more advanced statistical method was established to evaluate the test-compound effect, providing an optimized preparation of the validation study. In addition, the experimental design of the HET-MN assay for the validation was confirmed. Conflict of interest One of the authors (LAH) declared a conflict of interest due to a contract with the BfR of March 2012 for the development of an R program for statistical analyses of the HET-MN assay. All other authors declare that there is no conflict of interest. Acknowledgements The funding of the present study by the German Federal Institute for Risk Assessment (BfR) is greatfully acknowledged. The experimental data used in this study originate from a validation study, which is funded by the German Federal Ministry for Research and Technology (BMBF Funding Priority “Replacement methods of animal experiments”, funding code: 0315803). The authors are grateful to two anonymous referees for the helpful comments which considerably improved the article.

L.A. Hothorn et al. / Mutation Research 757 (2013) 68–78

Appendix A. R Code A.1. Code-Snippet 1: The example data sets # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # Experimental data set 1 (elevated NC) # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Dose <- gl(4, 6, labels=c("NC","D1","D2","D3")) MN <- c(2,2,4,0,3,2,0,1,7,0,0,1,1,8,2,2,0,1,3,14,28,19,2,6) d1 <- data.frame(Dose, MN)

# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # Experimental data set 2 (downturn effect in D3) # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Dose <- gl(4, 6, labels=c("NC","D1","D2","D3")) MN <- c(2,0,2,3,0,0,7,3,3,5,3,29,24,3,6,4,21,10,2,22,6,7,1,4) d2 <- data.frame(Dose, MN)

# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # Experimental data set 3 (zero counts in NC) # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Dose <- gl(5, 6, labels=c("NC","D1","D2","D3","D4")) MN <- c(0,0,0,0,0,0,3,1,0,1,1,2,0,1,2,0,0,0,1,0,0,2,0,0,4,0,0,3,3,0) d3 <- data.frame(Dose, MN)

# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # Historical data set (NCs only) # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ lab <- factor(c(rep("A",120), rep("B",160), rep("C",122))) run <- c( 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9,10,10,10,10,10,10, 11,11,11,11,11,11,12,12,12,12,12,12,13,13,13,13,13,13,14,14, 14,14,14,14,15,15,15,15,15,15,16,16,16,16,16,16,17,17,17,17, 17,17,17,17,18,18,18,18,18,18,18,18,19,19,19,19,19,19,19,19, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9,10,10,10,10, 10,10,10,10,11,11,11,11,11,11,12,12,12,12,12,12,13,13,13,13, 13,13,14,14,14,14,14,14,14,14,15,15,15,15,15,15,15,15,16,16, 16,16,16,16,17,17,17,17,17,17,18,18,18,18,18,18,18,18,19,19, 19,19,19,19,19,19,20,20,20,20,20,20,21,21,21,21,21,21,22,22, 22,22,22,22,23,23,23,23,23,23,23,23,24,24,24,24,24,24,24,24, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9,10,10,10,10,10,10,11,11,11,11,11,11, 12,12,12,12,12,12,13,13,13,13,13,13,14,14,14,14,14,14,15,15, 15,15,15,15,16,16,16,16,16,16,17,17,17,17,17,17,17,17,18,18, 18,18,18,18,18,18,19,19,19,19,19,19,19,19,20,20,20,20,20,20, 20,20) MN <- c(0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,1,0,0,0,1,0,0,0,0,1,0,1,0, 0,0,0,0,0,1,2,1,0,1,0,1,2,0,0,0,1,1,0,0,1,2,1,2,0,1,1,0,2,0, 0,1,1,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0, 0,0,0,1,0,0,0,0,0,2,0,0,0,0,2,0,1,1,0,1,0,2,0,0,0,1,0,0,1,1, 1,2,0,0,0,2,0,0,1,0,3,2,0,1,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1, 3,0,1,0,0,0,2,0,2,3,0,0,0,0,3,1,0,0,1,0,1,3,0,1,1,0,0,2,0,0, 1,0,0,0,1,0,0,1,3,0,1,3,0,0,0,2,0,0,4,3,0,0,3,1,1,3,0,0,1,0, 2,0,1,0,1,0,2,1,0,5,0,0,1,1,0,1,0,1,1,0,0,0,0,1,3,1,0,0,0,0, 2,1,0,0,0,0,1,0,0,1,0,2,1,0,0,0,1,0,0,0,1,2,2,0,0,2,0,0,0,1, 2,0,1,0,0,1,0,0,0,0,1,2,0,0,0,1,1,0,0,1,1,2,0,1,1,1,0,1,0,0, 0,0,0,1,0,0,0,2,1,1,2,1,2,0,0,0,1,0,1,0,1,0,0,1,0,0,0,0,0,0, 0,0,1,0,2,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,2,1,1,0,0, 1,0,1,0,0,1,1,0,0,1,1,0,1,1,1,1,1,1,0,0,2,0,0,0,0,1,0,1,0,1, 0,0,3,1,0,0,0,0,0,0,0,1) labrun <- droplevels(factor(lab):factor(run)) dh <- data.frame(lab, run, labrun, MN)

75

76

L.A. Hothorn et al. / Mutation Research 757 (2013) 68–78

A.2. Code-Snippet 2: GLM with quasi-Poisson link library(multcomp) f0 <- glm(MN ~ Dose, data=d1, family=quasipoisson(link="log")) g0D <- glht(f0, linfct=mcp(Dose="Dunnett")) g0W <- glht(f0, linfct=mcp(Dose="Williams")) summary(f0)$dispersion # Dispersion parameter summary(g0D) # Dunnett test summary(g0W) # Williams test

A.3. Code-Snippet 3: FT-transformed MCT library(multcomp) n <- table(d1$Dose) contrMat(n, type="Dunnett") procedure contrMat(n, type="Williams") procedure

# Sample sizes for each group # contrast matrix for Dunnett # contrast matrix for Williams

d1$ft <- sqrt(d1$MN)+ sqrt(d1$MN+1) # FT transformation f1 <- lm(ft ~ Dose, data=d1) # fit linear model g1D <- glht(f1, linfct=mcp(Dose="Dunnett" ), alternative="greater") g1W <- glht(f1, linfct=mcp(Dose="Williams"), alternative="greater") Dun <- summary(g1D)$test$pvalues # FT-Dunnett test Wil <- summary(g1W)$test$pvalues # FT-Williams test round(cbind(Dun=rev(Dun), Wil), digits=3)

A.4. Code-Snippet 4: Example with downturn effect at the highest dose library(multcomp) # Sample sizes for each n <- table(d2$Dose) group # contrast matrix cmDW <- rbind(contrMat(n, type="Williams"), Dunnett-Williams contrMat(n, type="Dunnett")[2:1,]) d2$ft <- sqrt(d2$MN)+ sqrt(d2$MN+1) f2 <- lm(ft ~ Dose, data=d2) g2D <- glht(f2, linfct=mcp(Dose="Dunnett"

# FT transformation # fit linear model ), alternative="greater")

g2W

<- glht(f2, linfct=mcp(Dose="Williams"

), alternative="greater")

g2DW

<- glht(f2, linfct=mcp(Dose= cmDW

), alternative="greater")

g2UW

<- glht(f2, linfct=mcp(Dose="UmbrellaWilliams"), alternative="greater")

Dun <- summary(g2D )$test$pvalues # FT-Dunnett test Wil <- summary(g2W )$test$pvalues # FT-Williams test DuWil <- summary(g2DW)$test$pvalues # FT-Dunnett-Williams test uWil <- summary(g2UW)$test$pvalues # FT-UmbrellaWilliams test round(cbind(Wil=c(Wil, rep(NA,3)), Dun=c(rev(Dun), rep(NA,3)), DuWil=c(DuWil, NA), UmbWil=uWil), digits=3)

L.A. Hothorn et al. / Mutation Research 757 (2013) 68–78

A.5. Code-Snippet 5: Example against historical control # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # Analysis of historical control data # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ library(qcc) dh$ft <- sqrt(dh$MN)+ sqrt(dh$MN+1) # FT transformation dH <- subset(dh, lab=="B") # select data of lab B grp <- qcc.groups(dH$ft, dH$run) qq <- qcc(grp, type="xbar") # mean of the means MwHist
A.6. Code-Snippet 6: FT-transformed two-sample tests library(multcomp) d1$ft <- sqrt(d1$MN)+ sqrt(d1$MN+1) # Pairwise tests against historical control mean (see Code-Snippet 5) n <- table(d1$Dose) # Sample sizes for each group # contrast matrix w/o cMatrix <- contrMat(n, type="Dunnett")[,-1] concurrent NC d1a <- droplevels(subset(d1, Dose != "NC")) # omit concurrent control f1a <- lm(ft ~ Dose - 1, data=d1a) # fit reduced linear model g1a <- glht(f1a, linfct=cMatrix, rhs=MwHist, alternative="greater") pairwiseHist <- summary(g1a, test=univariate())$test$pvalues # Pairwise tests against concurrent control f1b <- lm(ft ~ Dose, data=d1) # fit complete linear model g1b <- glht(f1b, linfct=mcp(Dose="Dunnett"), alternative="greater") pairwiseConc <- summary(g1b, test=univariate())$test$pvalues round(cbind(pairwiseConc, pairwiseHist), digits=3)

77

78

L.A. Hothorn et al. / Mutation Research 757 (2013) 68–78

References [1] SCCS, SCCS/1416/11, The SCCS’s Notes of Guidance for the Testing of Cosmetic Ingredients and their Safety Evaluation, 7th revision, 2010 (Adopted by the SCCS during the 9th plenary meeting of 14 December 2010). [2] S. Pfuhler, A. Kirst, M. Aardema, N. Banduhn, C. Goebel, D. Araki, M. CostabelFarkas, E. Dufour, R. Fautz, J. Harvey, N.J. Hewitt, J. Hibatallah, P. Carmichael, M. Macfarlane, K. Reisinger, J. Rowland, F. Schellauf, A. Schepkyn, J. Scheel, A tiered approach to the use of alternatives to animal testing for the safety assessment of cosmetics: genotoxicity. A COLIPA analysis, Regul. Toxicol. Pharm. 57 (2/3) (2010) 315–324. [3] D. Kirkland, M. Aardema, L. Henderson, L. Muller, Evaluation of the ability of a battery of three in vitro genotoxicity tests to discriminate rodent carcinogens and non-carcinogens -I. Sensitivity, specificity and relative predictivity, Mutat. Res. -Genet. Toxicol. Environ. 584 (1/2) (2005) 1–256. [4] T. Wolf, N.P. Luepke, Formation of micronuclei in incubated hen’s eggs as a measure of genotoxicity, Mutat. Res. -Genet. Toxicol. Environ. 394 (1–3) (1997) 163–175. [5] T. Wolf, C. Niehaus-Rolf, N.P. Luepke, Some new methodological aspects of the hen’s egg test for micronucleus induction (HET-MN), Mutat. Res. -Genet. Toxicol. Environ. 514 (1/2) (2002) 59–76. [6] T. Wolf, C. Niehaus-Rolf, N.P. Luepke, Investigating genotoxic and hematotoxic effects of N-nitrosodimethylamine, N-nitrosodiethylamine and Nnitrosodiethanolamine in the hen’s egg-micronucleus test (HET-MN), Food Chem. Toxicol. 41 (4) (2003) 561–573. [7] T. Wolf, C. Niehaus-Rolf, N. Banduhn, D. Eschrich, J. Scheel, N.P. Luepke, The hen’s egg test for micronucleus induction (HET-MN): novel analyses with a series of well-characterized substances support the further evaluation of the test system, Mutat. Res. -Genet. Toxicol. Environ. 650 (2) (2008) 150–164. [8] D. Greywe, J. Kreutz, N. Banduhn, M. Krauledat, J. Scheel, K.R. Schroeder, T. Wolf, K. Reisinger, Applicability and robustness of the hen’s egg test for analysis of micronucleus induction (HET-MN): results from an inter-laboratory trial, Mutat. Res. -Genet. Toxicol. Environ. 747 (1) (2012) 118–134. [9] T. Hartung, S. Bremer, S. Casati, S. Coecke, R. Corvi, S. Fortaner, L. Gribaldo, M. Halder, S. Hoffmann, A. Roi, P. Prieto, E. Sabbioni, L. Scott, A. Worth, V. Zuang, A modular approach to the ECVAM principles on test validity, ATLA-Altern. Lab. Anim. 32 (5) (2004) 467–472. [10] A.R.S. Bhat, K.A. Rao, K.V.A. Latha, Small sample comparison of likelihood ratio, Wald and score tests in generalised linear models involving count data, Int. J. Agric. Stat. Sci. 8 (2) (2012) 667–677. [11] J. Lindsey, C. Laurent, Estimating the proportion of lymphoblastoid cells affected by exposure to ethylene oxide through micronuclei counts, Statistician 45 (1996) 223–229. [12] D.A. Noe, A.J. Bailer, R.B. Noble, Comparing methods for analyzing overdispersed count data in aquatic toxicology, Environ. Toxicol. Chem. 29 (2010) 212–219. [13] L.C. Fabio, G.A. Paula, M. de Castro, A Poisson mixed model with nonnormal random effect distribution, Comput. Stat. Data Anal. 56 (2012) 1499–1510. [14] NTP, (US-National Toxicology Program), Description of NPT Study Types, Statistical procedures, Expanded Overview, 2013, Available at: http://ntp. niehs.nih.gov/?objectid=72015e2c-bdb7-ceba-f17f9aca7ae5346d (accessed 15.03.13) (Web page last updated on 18.09.04). [15] G. Dilba, E. Bretz, V. Guiard, L. Hothorn, Simultaneous confidence intervals for ratios with applications to the comparison of several treatments with a control, Methods Inf. Med. 43 (5) (2004) 465–469. [16] OECD, OECD Guidance Document 34–Guidance Document on the Validation and International Acceptance of New or Updated Test Methods for Hazard Assessment, ENV/JM/MONO(2005)14, 2005, Available at: http:// search.oecd.org/officialdocuments/displaydocumentpdf/?cote=env/jm/mono %282005%2914&doclanguage=en (accessed 04.04.13). [17] J. Fentem, G. Archer, M. Balls, P. Botham, R. Curren, L. Earl, D. Esdaile, H. Holzhutter, M. Liebsch, The ECVAM international validation study on in vitro tests for skin corrosivity. 2. Results and evaluation by the management team, Toxicol. in vitro 12 (1998) 483–524. [18] H. Nishiyama, T. Omori, I. Yoshimura, A composite statistical procedure for evaluating genotoxicity using cell transformation assay data, Environmetrics 14 (2) (2003) 183–192.

[19] L.A. Hothorn, A robust statistical procedure for evaluating genotoxicity data, Environmetrics 15 (6) (2004) 635–641. [20] S. Hoffmann, L.A. Hothorn, L. Edler, A. Kleensang, M. Suzuki, P. Phrakonkham, D. Gerhard, Two new approaches to improve the analysis of BALB/c 3T3 cell transformation assay data, Mutat. Res. -Genet. Toxicol. Environ. 744 (1) (2012) 36–41. [21] G. Reifferscheid, H.M. Maes, B. Allner, J. Badurova, S. Belkin, K. Bluhm, F. Brauer, J. Bressling, S. Domeneghetti, T. Elad, S. Fluckiger-Isler, H.J. Grummt, R. Guertler, A. Hecht, M.B. Heringa, H. Hollert, S. Huber, M. Kramer, A. Magdeburg, H.T. Ratte, R. Sauerborn-Klobucar, A. Sokolowski, P. Soldan, T. Smital, D. Stalter, P. Venier, C. Ziemann, J. Zipperle, S. Buchinger, International roundrobin study on the Ames fluctuation test, Environ. Mol. Mutagen. 53 (3) (2012) 185–197. [22] M.F. Freeman, J.W. Tukey, Transformations related to the angular and the square root, Ann. Math. Stat. 21 (4) (1950) 607–611. [23] Y. Guan, Variance stabilizing transformations of Poisson, binomial and negative binomial distributions, Stat. Probab. Lett. 79 (14) (2009) 1621–1629. [24] E. Herberich, J. Sikorski, T. Hothorn, A robust procedure for comparing multiple means under heteroscedasticity in unbalanced designs, PLoS ONE 5 (3) (2010) e9788. [25] C.W. Dunnett, A multiple comparison procedure for comparing several treatments with a control, J. Amer. Statist. Assoc. 50 (272) (1955) 1096–1121. [26] F. Bretz, An extension of the Williams trend test to general unbalanced linear models, Comput. Stat. Data Anal. 50 (7) (2006) 1735–1748. [27] L.A. Hothorn, T. Jaki, Statistical evaluation of toxicological assays: Dunnett or Williams test? Take both, Arch. Toxicol. (2013), http://dx.doi.org/10.1007/s00204-013-1065-x (Epub ahead of print). [28] F. Bretz, L.A. Hothorn, Statistical analysis of monotone or non-monotone doseresponse data from in vitro toxicological assays, ATLA-Altern. Lab. Anim. 31 (2003) 81–96. [29] T. Jaki, L.A. Hothorn, Statistical evaluation of toxicological assays with zero or near-to-zero proportions or counts in the concurrent negative control group: a tutorial, Report Leibniz University, 2013, www.biostat.uni-hannover.de [30] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2012, http://www.R-project.org/ [31] T. Hothorn, F. Bretz, P. Westfall, Simultaneous inference in general parametric models, Biometrical J. 50 (3) (2008) 346–363. [32] F. Bretz, A. Genz, L.A. Hothorn, On the numerical availability of multiple comparison procedures, Biometrical J. 43 (5) (2001) 645–656. [33] F. Schaarschmidt, D. Gerhard, M. Sill, MCPAN. Multiple Comparisons Using Normal Approximation, R package version 1.1–14, 2012 http://CRAN.R-project.org/package=MCPAN [34] R.E. Tarone, The use of historical control information in testing for a trend in Poisson means, Biometrics 38 (2) (1982) 457–462. [35] G.E. Dinse, S.D. Peddada, Comparing tumor rates in current and historical control groups in rodent cancer bioassays, Stat. Biopharm. Res. 3 (1) (2011) 97–105. [36] A. Kitsche, L.A. Hothorn, F. Schaarschmidt, The use of historical controls in estimating simultaneous confidence intervals for comparisons against a concurrent control, Comput. Stat. Data Anal. 56 (12) (2012) 3865–3875. [37] D.G. Chen, Incorporating historical control information into quantal bioassay with Bayesian approach, Comput. Stat. Data Anal. 54 (6) (2010) 1646–1656. [38] FDA, Guidance for Industry: Statistical Aspects of the Design, Analysis, and Interpretation of Chronic Rodent Carcinogenicity Studies of Pharmaceuticals, 2001. [39] L.S. Nelson, When should the limits on a Shewhart control chart be other than a center line ±3-sigma? J. Qual. Technol. 35 (4) (2003) 424–425. [40] S. Aebtarm, N. Bouguila, An empirical evaluation of attribute control charts for monitoring defects, Expert Syst. Appl. 38 (6) (2011) 7869–7880. [41] F. Konietschke, L.A. Hothorn, Evaluation of toxicological studies using a nonparametric Shirley-type trend test for comparing several dose levels with a control group, Stat. Biopharm. Res. 4 (1) (2012) 14–27. [42] M. Neuhauser, P.Y. Liu, L.A. Hothorn, Nonparametric tests for trend: Jonckheere’s test, a modification and a maximum test, Biometrical J. 40 (8) (1998) 899–909. [43] L.A. Hothorn, M. Hasler, Proof of hazard and proof of safety in toxicological studies using simultaneous confidence intervals for differences and ratios to control, J. Biopharm. Stat. 18 (5) (2008) 915–933.