Mutation Research, 272 (1992) 73-77
73
© 1992 Elsevier Science Publishers B.V. All rights reserved 0165-1161/92/$05.00
MUTENV 08832
Significance testing in mutagen screening: the dependence of statistical power on the control sample size H. T r a u t a n d W. S c h e i d Institute of Radiation Biology, University of Miinster, D-4400 Miinster, Germany (Received 2 July 1991) (Revision received 3 March 1992) (Accepted 4 March 1992)
Keywords: Biological dosimetry; Conditional binomial test; Dicentric chromosomes; Drosophila; Historical control test; Power
Summary The continuous accumulation of control data in multicellular mutagen screening systems prompted us to study the dependence of the statistical power on the size of the control sample (for fixed control values). Two widely used screening systems were chosen: dicentric chromosomes in human lymphocytes and recessive sex-linked lethals in Drosophila melanogaster. The power increases rapidly at first as the control sample size increases, then levels off at a few tens of thousands of control units tested and thereafter remains almost constant up to the historical control. The practical implications from our study are discussed.
Problem Mutagen screening experiments with multicellular organisms have one property in common: the continuous accumulation of control data. Therefore in the course of time the control frequen,cy of the genetic effect tested in a mutagen screening laboratory becomes based on a very large sample, e.g., on several tens of thousands of units. In the following study we show to what extent significance testing in mutagen screening profits by this almost automatically increasing size of the control sample. If the substance screened is in-
Correspondence: H. Traut, Institute of Radiation Biology, University of Miinster, D-4400 Miinster, Germany.
deed mutagenic, we may ask: how large is the probability that we are able to demonstrate the significance of the difference between the experimental and the control frequency (at significance level t~)? More specifically we may ask: how does this probability depend on the size of the control sample? The probability we would like to know is the 'power' of the test, i.e., the probability of rejecting the null hypothesis when in fact it is false. It is immediately evident that the power increases with the difference between the experimental and the control mutation frequency, with the size of the experimental and the control sample, and with a, the significance level chosen (e.g., 0.05). The role of sample size in significance testing in mutagenicity experiments has been studied
74
from various points of view (e.g., Wiirgler et al., 1975; Katz, 1978; Bateman, 1979; Traut, 1980; Margolin et al., 1983; Mann et al., 1985). For instance, Wiirgler et al. (1975) and Katz (1978) have shown how one has to subdivide numerically the total sample size N = n~ (experimental samp l e ) + n 2 (control sample) to make the significance test most powerful (solution: n~ --- n2). Method
In order to study the dependence of the statistical power on the control sample size, rt2, we used the power function, 7r, of the conditional binomial test. A computer program to evaluate this power function has been developed by Papworth (personal communication). The power function is computed for the one-tailed version of the test; for the significance level, a = 0.05 was chosen. Note that the conditional binomial test provides a good approximation to the Fisher exact test (Berchtold, 1975), and that the tables for determining the statistical significance of mutation frequencies developed by Kastenbaum and Bowman (1970) are based on the binomial test. In the present communication r 1 = number of mutated units obtained in a mutagen screening experiment, n 1 = n u m b e r of units tested in that experiment, and r 2 and n 2 = the corresponding control values. Then the mutation frequency obtained in the mutagenicity experiment = r l / n l, and that of the control experiment = r2/n 2. (In screening systems registering > 1 mutational events p e r ' u n i t , like dicentric chromosomes per cell, the term 'mutation yield' seems more appropriate than 'mutation frequency'.) If al and A 2 are the Poisson means belonging to the experimental and the control samples, respectively, n~ A~ = the mean belonging to r~ and n 2 A 2 = the mean belonging to r 2. Example: If A 2 = 0.0005 is the Poisson mean for dicentric chromosomes arising spontaneously in human lymphocytes, we expect in a control sample of n 2 = 1000 cells n 2 A 2 = 1000 × 0.0005 = 0.5 dicentrics. It is with this figure that the number of dicentrics (r l) actually observed in an experimental sample (n l) has to be compared, e.g., rl = 3 dicentrics among n I = 1000
cells.
The power function, which is similar to that derived by Przyborowski and Wilenski (1939), is
*
a~
77"=
E r=r o
where
L
--e-a r!
(~)0i(1-0)
}
" '
(1)
i=k(r,a)
r = r I + r2,
A = t/ih I +//2A2,
and
0
=
n l A J ( n l h I +n2A2). r 0 is the smallest value of r capable of yielding a significant difference at significance level a, and k (r, a ) denotes the critical value of r~ conditional upon the value of r. Hence ~r = f (hi, A 2 , F/l, /'/2, Ol). The part E ~i = k(r,cO (~) 0 i (1 - 0) r-i of formula (1) represents the conditional binomial test. To test the significance of the difference between an experimental frequency and a control value based on very large material (n 2 --+ m), i.e., for comparisons with historical controls, a procedure proposed by Traut (1980) can be applied. If h 0 denotes the historical control mean, the mean number of mutated units in the experimental sample under the null hypothesis (H o) that /~1 = a o is n l a o. The 'historical control test' (Traut, 1980) consists of rejecting H o at significance level a if 1
(nlao)
i
~_, - - e - " ' a " < ~ a .
(2)
i=r I
If h~ is different from h0, the power of this test to detect the difference between h I and A0 is
"n'= E
(nlAl)i e -nLa' il
i=k
(3)
"
or, to facilitate the computation of 7r, k-I
i=0
(nlA1)
i
(4)
i!
where k (which is a function of h 0, n I and a ) is the smallest observation r 1 satisfying (2). This power function has also been p r o g r a m m e d by Papworth (personal communication). In order to show how the power depends on the control sample size (for fixed control values)
75
we have chosen two mutational changes often used in mutagen screening: dicentric chromosomes in human lymphocytes and recessive sexlinked lethals in Drosophila melanogaster. The former system has been and is more and more used in 'biological dosimetry' (e.g., Lloyd and Purrot, 1981), the latter system presents one of the standard mutagenicity tests in Drosophila (e.g., Wiirgler et al., 1977). The concrete figures (r 1, n 1, etc.) have been chosen realistically, we hope. Researchers using other multicellular systems for mutagenicity screening might, on the basis of this communication, compute the 7r-n 2 dependence characteristic of their own system.
Results and discussion Because of the discreteness of the Poisson distribution the power functions of n 2 presented in the Figures are not monotonic but exhibit slight irregularities.
Dicentric chromosomes Figs. 1 and 2 show the dependence of the statistical power, 7r, on the size of the control sample, n z, under the following realistic assumptions: the fixed control yield is assumed to be either 0.0005 (Fig. 1) or 0.001 (Fig. 2). The 'true' historical control yield certainly is not outside this range (Lloyd et al., 1980). For the size of the experimental sample, nl, in both Fig. 1 and Fig. 2, 1000 metaphases scored have been assumed, for the number of dicentrics found in the screening tests, r a = 2, 3 and 4 (Fig. 1) and r I = 2, 3, 4 and 5 (Fig. 2) have been chosen. Starting with n 2 = 1000, rr increases rapidly in both Fig. 1 and Fig. 2 when n 2 increases and remains almost constant for n 2 values > 30,000. 7r values computed for very large n 2 values are indistinguishable from those belonging to the historical control test (n 2 = ~). Recessive sex-linked lethals Fig. 3 shows the 7r-n 2 dependence for recessive sex-linked lethals (Drosophila melanogaster), again under realistic assumptions: fixed control frequency = 0.0025, n 1 = 1000 germ cells tested, r 1 = 5, 6, 7 and 8. Principally the same results as for dicentric chromosomes (Figs. 1 and 2) are
Power 0,80
--
0,60
/
o.- .....
4
o- .....
-o
o- .....
..o
0,40
2 0,20
O,OC 0
I
I
1
1
/
50 100 150 200 / Control s a m p l e size (units o! i 0 0 0 )
_1
/oo
Fig. 1. Dependence of the statistical power, ~r, on the control sample size, n2, for dicentric chromosomes in human lymphocytes. For the control a fixed yield of 0.0005 (dicentrics per cell) has been assumed, this yield being also considered as historical control. The experimental yields 2/1000 (a), 3/1000 (b), and 4/1000 (c) have been chosen (number of dicentrics per 1000 metaphases scored). The power belonging to n 2 = 200,000 is marked by o, that belonging to n 2 = ~ (historical control) by e.
obtained: the 7r-n 2 curve levels off at n 2 ~ 40,000 germ cells, and again ~" values computed for large n 2 values are very similar to those belonging to the historical control.
Conclusions There are three kinds of control frequencies in mutagen screening with multicellular organisms like Drosophila, mice, and man: concurrent, laboratory, and historical controls. The concurrent control frequency is obtained from a control experiment simultaneously carried out with a screening experiment. The laboratory Control frequency is calculated from the concurrent control data collected in the same laboratory. (We are, however, not sure whether this term is generally
76
used in this sense.) The laboratory control sample also increases because control experiments have to be repeated from time to time to ensure that the mutagen screening system still works reliably. The historical control frequency is based on the laboratory controls of several laboratories. Of course, before pooling individual control data to obtain laboratory or historical controls, homogeneity tests have to show that this pooling is legitimate. The ~--n 2 dependence of the examples studied by us is characterized by relatively low power values up to a few thousand control units tested and by a saturation of the curve above n2 values of a few tens of thousands ( n 2 ~ 3 0 , 0 0 0 for dicentrics in human lymphocytes, n e ~ 40,000 for lethals in Drosophila). This suggests that principally significance tests between the result of a single experiment and a laboratory or a historical control are superior to significance tests with a
~owe r
0,80
--
o- . . . .
f
,ower
---o- . . . . . . .
0,80
......
......
-0
-0
0.60 --o- . . . . . . . . .
0,40
---o- . . . . . . . . .
.....
....
•
--Q
0,20
o,oo 0
I
I
I
100
200
300
Control
sample
size (units
/ #oo / of 1 0 0 0 )
Fig. 3. Principally as in Fig. 1, but the v - - n 2 d e p e n d e n c e refers to recessive sex-linked lethals in Drosophila m e & n o gaster. A fixed control f r e q u e n c y of 0.0025 has been a s s u m e d , this f r e q u e n c y also b e i n g c o n s i d e r e d as historical control. The e x p e r i m e n t a l f r e q u e n c i e s 5 / 1 0 0 0 (a), 6 / 1 0 0 0 (b), 7 / 1 0 0 0 (c), and 8 / 1 0 0 0 (d) have b e e n c h o s e n ( n u m b e r of lethals p e r 1000 g e r m cells scored). The p o w e r b e l o n g i n g to n~ = 300,000 is m a r k e d by ©, that b e l o n g i n g to n~ = :c (historical control) by e.
0,60 o.- . . . .
f 0,40
o,- . . . . .
-t
O- . . . . . .
4
O.20
a
o,o0 0
1
,
L
J
50
100
150
200
Control
sample
sire (units
/
of 1000)
Fig. 2. A s in Fig. 1, but a fixed control yield of 0.001 has b e e n a s s u m e d , this yield again b e i n g c o n s i d e r e d as historical control. In addition, an e x p e r i m e n t a l yield of 5 / 1 0 0 0 (d) has b e e n i n t r o d u c e d . T h e p o w e r b e l o n g i n g to n 2 = 200,000 is m a r k e d by ©, that b e l o n g i n g to n 2 = ~ (historical control) by e.
concurrent control. If, however, there are special reasons for basing the significance test on a concurrent control, the most powerful designed experiments are those in which approximately equal numbers of units are tested in the experimental and the control sample (Katz, 1978; Margolin et al., 1983). Another conclusion from our study: collecting control material larger than that belonging to the beginning of the saturation region of the 7r-n~ curve does not markedly improve significance testing in experiments using multicellular mutagen screening systems.
Acknowledgement This communication has profited considerably from the competent advice by Dr. D.G. Papworth, Medical Research Council, Radiobiology Unit, Chilton, Didcot, UK. Dr. Papworth also has
77 g e n e r o u s l y m a d e a v a i l a b l e to us t h e c o m p u t e r p r o g r a m s d e v e l o p e d by h i m f o r c o m p u t i n g t h e statistical p o w e r . W e g r a t e f u l l y a c k n o w l e d g e his help.
References Bateman, A.J. (1979) Significance of specific-locus germ-cell mutations in mice, Mutation Res., 64, 345-351. Berchtold, W. (1975) Comparison of the Kastenbaum-Bowman test and Fisher's exact test, Arch. Genet., 48, 151-157. Kastenbaum, M.A., and K.O. Bowman (1970) Tables for determining the statistical significance of mutation frequencies, Mutation Res., 9, 527-549. Katz, A.J. (1978) Design and analysis of experiments on mutagenicity. I. Minimal sample sizes, Mutation Res., 50, 301-307. Lloyd, D.C., and R.J. Purrott (1981) Chromosome aberration analysis in radiological protection dosimetry, Radiat. Protec. Dosimetry, 1, 19-28. Lloyd, D.C., R.J. Purrott and E.J. Reeder (1980) The incidence of unstable chromosome aberrations in peripheral
blood lymphocytes from unirradiated and occupationally exposed people, Mutation Res., 72, 523-532. Mann, R.C., D.M. Popp and R.A. Popp (1985) Critical sample sizes for determining the statistical significance of mutation frequencies, Mutation Res., 143, 93-100. Margolin, B.H., B.J. Collings and J.M. Mason (1983) Statistical analysis and sample-size determinations for mutagenicity experiments with binomial responses, Environ. Mutagen., 5, 705-716. Przyborowski, J., and H. Wilenski (1939) Homogeneity of results in testing samples from Poisson series, Biometrika, 31,313-323. Traut, H. (1980) A method for determining the statistical significance of mutation frequencies, Biometr. J., 22, 7378. Wiirgler, F.E., U. Graf and W. Berchtold (1975) Statistical problems connected with the sex-linked recessive lethal test in Drosophila melanogaster. I. The use of the Kastenbaum-Bowman test, Arch. Genet., 48, 158-178. Wiirgler, F.E., F.H. Sobels and E. Vogel (1977) Drosophila as assay system for detecting genetic changes, in: B.J. Kilbey, M.S. Legator, W. Nichols and C. Ramel (Eds.), Handbook of Mutagenicity Test Procedures, Elsevier, Amsterdam, pp. 335-373.