~AI)IOTHERAPY 'QONCOLOGY Radiotherapy and Oncology 33 (1994) 177-178
i
Letter to the Editor
The interpretation of multiple P-values
To the Editors In their paper on the use of p-values Beck-Bornholdt and Dubben [1] discuss the problem that in the statistical literature is known as multiple comparisons or simultaneous inference. Although in essence technically correct, their approach seems to be a little bit over-simplistic. They argue almost exclusively from the point of view of the null hypothesis of no difference or relation. Their aim is to ensure that the probability of at least one chance finding will not exceed a chosen limit (often 5%) if for allp-values calculated in a paper, the null hypothesis is in fact true and so no difference or relation should have been found. However, the use of multiple comparisons techniques (MCTs) as advocated by Beck-Bornholdt and Dubben inevitably decreases the level below which a single p-value is considered statistically significant and thereby reduces the power considerably when the null hypothesis is not true. Therefore, indiscriminate use of MCTs to control the overall type I error (the cumulative value Pc in Beck-Bornholdt and Dubben) may be as harmful as the use of individual p-values per test. Instead, in each individual case one needs to consider the question carefully whether or not MCTs are appropriate. If, for instance, Chart B in Fig. 1 in the paper by Beck-Bornholdt and Dubben the 16 parameters per study had been tested in 16 different studies, we would again have the situation of their Chart A Fig. 1 and Beck-Bornholdt and Dubben would have been perfectly happy, while nothing has changed fundamentally! Using the approach of Beck-Bornholdt and Dubben the investigators of Chart B of Fig. 1 would have been punished rather than rewarded for their diligent use of resources and their honesty to report the results in one publication. To quote Miller [2] 'At the other extreme is the ultraconservative statistician who has just a single family consisting of every statistical statement he might make during his lifetime. If all statisticians operated at the 5% level, then 95% of the world's statisticians would never falsely reject a null hypothesis, and 5% would be guilty of some sin against nullity.' Miller finishes with the following conclusion: 'There are no hard-and-fast rules for where the family [of tests] lines should be drawn, and the statistician must rely on his own judgment for the problem at hand'. Unfortunately, the situation has not changed since then. Nevertheless, one can think of some guidelines on whether or
not to use an MCT in a particular instance. At least the questions addressed simultaneously should be closely related. Miller considers this to be the case - - at least in principle - when they are addressed in the same experiment. However, this again could punish experimenters for using their resources diligently. For instance, in the recent history of operable breast cancer two questions have been addressed regarding treatment: (a) which is better, radical mastectomy or breast conserving treatment and (b) can adjuvant treatment improve results. In principle these two questions could have been addressed in one clinical trial, using a factorial design, resulting in four arms or in two different two-armed trials, each addressing a separate question. The first design would be more economical in terms of total number of patients required and time to reach a conclusion on both questions. However, according to Miller's criterion the ,~ used should be approximately half the one used in two separate clinical trials. To me there does not seem to be a logical basis for such an attitude. My own main rule of thumb is to consider whether or not a global test - - such as a one-way ANOVA, Kruskal-Wallis, a xZ-test for contingency tables or Hotelling's test in the case of multiple response variables - - could make sense. If so, an MCT might also be appropriate. This would be the case, for instance, when multiple dose levels are used to answer the question whether or not a dose-response relation exists. On the other hand, in the factorial design of the breast cancer example a global test of equality of all four arms does not seem to be very relevant in answering the two basic therapeutic questions. Subgroup analyses form another type of multiple comparisons problem. In general one first should test for interaction between group and (treatment) effect and perform subgroup analyses only when such an interaction is found. Then an MCT should be used in those subgroup analyses. To conclude, rather than automatically use an MCT when more than one p-value is calculated, one should:
0167-8140/94/$07.00 © 1994 Elsevier Science Ireland Ltd. All rights reserved SSDI 0167-8140(94)01477-K
report all individual unadjusted p-values as numbers; this enables readers to use MCTs if they so wish, address the multiple comparison problem carefully in drawing conclusions from data and mention it in the discussion,
178
A.A.M. Hart et al. / Radiother. Oncol. 33 (1994) 177-178 use an MCT if this facilitates the drawing of conclusions, accept that by doing more, one automatically has a larger chance of making at least one error, but of course also of being at least once right[ Sincerely, A.A.M. Hart (received 5 September 1994; accepted 25 October 1994) The Netherlands Cancer Institute, Antoni van Leeuwenhoekhuis, Piesmanlaan 121, 1066 CX Amsterdam, The Netherlands
References
[1] Beck-Bornholdt, H.-P. and Dubben, H.-H. Potential pitfalls in the use of p-values and in interpretation of significance levels. Radiother. Oncol. 33: 171-176, 1994. [2] Miller R.G. Jr. Simultaneous Statistical Inference, pp. 31-32. Springer-Verlag, New York, 1981.