CHAPTER 24
Multiple Comparisons INTRODUCTION A population has a mean of 50 units and a standard deviation of 10 units. Draw 10 random samples with N ¼ 25 from this population. The Central Limit Theorem indicates that the means of the 10 samples are distributed normally about a mean of 50 with a standard error 10 of the mean estimated from pffiffiffiffiffi ¼ 2. Therefore the 95% confidence limits for these 25 means are 50 t0.05 2 ¼ 50 2.262 2 ¼ 45.476 to 54.524. If the highest and lowest means from these 10 samples are compared by an unpaired t-test, t is 3.199, P ¼ 0.005. What is wrong with this scenario? Only about 5% of means of such samples drawn from the same population will be >2 standard errors from the population mean, so that only about 1 time in 20 would we incorrectly reject the null hypothesis. However, by selecting the highest and lowest means from such a series, the difference between them has caused us to reject the null hypothesis. In fact, rejecting the null hypothesis for a mean in the lower tail of the distribution implies rejecting it for comparison between that mean and any mean that is above the population mean. More realistic scenarios are frequent in the scientific literature. Consider Fig. 24.1 that illustrates figures and comparisons commonly shown in the literature.
Fig. 24.1 Multiple comparisons. *P<0.05; **P<0.01. Basic Biostatistics for Medical and Biomedical Practitioners https://doi.org/10.1016/B978-0-12-817084-7.00024-3
© 2019 Elsevier Inc. All rights reserved.
375
376
Basic Biostatistics for Medical and Biomedical Practitioners
The left panel shows 7 sequential time periods with several differences. The right panel shows similar differences for all pairs of comparisons. Short vertical lines are standard errors. The questions raised by these multiple comparisons are whether they are legitimate because multiple t-tests have been done. The arguments to follow apply equally to other multiple tests, such as chi-square tests. The subject is controversial with some asserting that there is no need to correct for the multiple comparisons and others that correction is always necessary. Most statisticians occupy the middle ground and use correction selectively. Tukey (1977) provided the following argument. Consider drawing at random 2 samples of the same size from the same population. The logic of the t-test tells us that there is a 0.05 probability of rejecting the null hypothesis falsely (α, Type I error), and therefore a 1 – α ¼ 1–0.05 ¼ 0.95 chance of correctly accepting the null hypothesis. If we draw another 2 samples from that population, then by the same argument there is a 0.95 chance of correctly accepting the null hypothesis. What then is the chance of correctly accepting the null hypothesis both times? By the product rule for probabilities, it is (1 – α)2 ¼ 0.95 0.95 ¼ 0.952 ¼ 0.9025. Therefore the probability of rejecting the null hypothesis incorrectly is 1 – (1 – α)2 ¼ 1 – 0.9025 ¼ 0.0975, even though we use the conventional 0.05 level of α. Draw a third set of samples from that population. Then the probability of correctly accepting the null hypothesis all three times is (1 – α)3 ¼ 0.953 ¼ 0.8574. Similarly, drawing 10 pairs of samples, the chances of correctly accepting the null hypothesis in all 10 comparisons is (1 – α)10 ¼ 0.9510 ¼ 0.5987. Therefore the probability of incorrectly rejecting the null hypothesis at least once is 1 – (1 – α)10 ¼ 1 – 0.5987 ¼ 0.4013, even though the 0.05 value for α is what we should get. The value of α has thus been inflated. A similar example illustrated by Bland and Altman (1995) involved 20 comparisons, and then the chances of finding at least one comparison rejecting the null hypothesis at the 0.05 level is 0.64. For 100 comparisons, the probability of incorrectly rejecting the null hypothesis at the 0.05 level is 1 – (1 – α)100 ¼ 1 – (1 – 0.05)100 ¼ 0.9941. A similar inflation of Type I errors occurs when in an experiment the results are divided into very many small groups until a “significant” result appears. This was described by Martin (1984) who termed it “Munchausen’s statistical grid,” the “great virtue of which was that it can resurrect a ‘significant’ result from any foundering therapeutic trial.”
BONFERRONI CORRECTION AND EQUIVALENT TESTS One way to correct for this inflation factor is attributed to the Italian mathematician Carlo Bonferroni (1892–1960), although the concept had been known for centuries (Bland and Altman, 1995). To keep the value of α at 0.05 for the whole set of comparisons (call this αf indicating the value of α for the whole family of experiments), solve the equation for the
Multiple Comparisons
Table 24.1 Results of Bonferroni correction k α 5 0.05 α 5 0.01
10 20 50 100
αc
α 5 α/n
αc
α 5 α/n
0.005116 0.002561 0.001025 0.000513
0.005 0.0025 0.001 0.0005
0.00100453 0.00050239 0.00020099 0.0001005
0.001 0.0005 0.0002 0.0001
value of αc (c indicating the individual comparison) for each independent comparison: 1 (1 ac)k ¼ 0.05 for different values of k, the number of tests conducted. The expression αF ¼ 1 (1 ac)k is sometimes written αc ¼ 1 (1 ac)1/k. In either form it is known as the Dunn-Sidak equation. Thus if k is 10, then αc ¼ 0.005116. For k ¼ 20, 50, and 100, and for αe ¼ 0.05 and 0.01, the values of αc are given in Table 24.1. These values of αc are close to what we get by dividing α by k. This is known as the Bonferroni equation, and its results are only very slightly different from the Dunn-Sidak test (Abdi, 2007). Therefore the Bonferroni correction for k t-tests requires that we reject the null hypothesis for each individual comparison only if P < α/k. The approximation works because if with k independent t-tests, each with α ¼ 0.05 then the probability of getting no differences at this level of α is (1 α)k. Because α is small, expanding the expression approximates 1 kα, because all the higher powers of α are tiny. Then if the null hypothesis is true, for 1 of the k comparisons to have a P< 0.05, kα must be <0.05, so that α must be <0.05/k (Bland and Altman, 1995). Reminder: P< 0.05 may be an unsafe level to use. Table 24.1 shows that with 100 comparisons the corrected value of α is so small that it may be difficult to achieve. By trying to avoid Type I errors, the Type II error becomes larger, with the risk of failing to reject the null hypothesis falsely too many times. There is thus loss of statistical power. More efficient procedures were described by Holm and by Hochberg and Benjamini. For the Holm test, rank the P values from the k comparisons from smallest to biggest. Then test the smallest against 0.05/k, the number of comparisons. If that does not allow rejection of the null hypothesis, accept the null hypothesis for all the comparisons. If we reject the null hypothesis, test the second smallest P value against 0.05/(k 1), and so on. Assume that the P values are 0.005, 0.020, 0.026, and 0.09. Then test 0.005 against 0.05/4 ¼ 0.0125. Because it is smaller, reject the null hypothesis for that comparison. Then test the second smallest P value, 0.020, against 0.05/3 ¼ 0.017. Because it is bigger, accept the null hypothesis for this and all remaining comparisons. An online calculator using Excel is at http://www.researchgate.net/ publication/236969037_Holm-Bonferroni_Sequential_Correction_An_EXCEL_ Calculator; alternatively use http://www.quantitativeskills.com/sisa/calculations/ bonfer.htm
377
378
Basic Biostatistics for Medical and Biomedical Practitioners
For the Hochberg test, rank the P values from largest (pk) to smallest (p1). If pk < α, reject H0(i) for i ¼ 1…k for all comparisons. If p1 > α, determine if pk 1 < α/2. If it is, reject H0(i) for i1….k1, etc. Assume that these values are 0.09, 0.026, 0.020, and 0.005. Because the first P value exceeds 0.05, do not reject the null hypothesis. Then compare the next P value with 0.05/2¼ 0.025. Because P ¼ 0.026 exceeds this, do not reject the null hypothesis for this comparison. For the third comparison, the critical P value is 0.05/3 ¼ 0.017. Because P ¼ 0.020 exceeds this critical value, do not reject the null hypothesis for the third comparison. The fourth P value is compared with the critical value of 0.05/4 ¼ 0.0125. Because P ¼ 0.005 is less than the critical value, reject the null hypothesis for the fourth comparison. If the second comparison had had a P value of 0.021, then this would have been less than the critical value of 0.025, and we would have rejected the null hypothesis for this and all subsequent comparisons. Both of these tests keep the error rate for the whole set of tests at 0.05, but greatly reduce the Type II error. The Hochberg test is more powerful than the Holm test, and both improve considerably on the original Bonferroni correction. An online but complex program for this test is at http://www.ncbi.nlm.nih.gov/pmc/ articles/PMC3263024/. The procedure is discussed clearly with examples at http:// onbiostatistics.blogspot.com/2009/08/hochberg-procedure-for-adjustment-for.html. Variations of these methods are used when the various end points are correlated with each other; for example, different symptoms in a disease are often correlated (Wright, 1992; Yao and Wei, 1996). Several statistical programs, including R, implement these tests. The Bonferroni correction is often misused. Glantz (1980) and Wallenstein et al. (1980) wrote editorials to address some common statistical errors in the journals Circulation and Circulation Research, one of the errors being the use of multiple t-tests. They recommended the Bonferroni correction. Similar conclusions were reached by Pocock et al. (1987) who examined the reports of 45 comparative trials published in the British Medical Journal, the Lancet, and New England Journal of Medicine. Unfortunately, the Bonferroni correction is needed for only some multiple t-tests, and there is a tendency to apply it to all such tests, whether or not they require the correction (Kusuoka and Hoffman, 2002).
Error Rates Some statisticians reject the Bonferroni adjustment (Rothman, 1990) and others restrict its use to certain types of experiments (Perneger, 1998). An important distinction is between experiment-wise and comparison-wise error rates. A simple example to explain the difference was published by Carter (2010). If we want to make bicycles with only a 5% chance of being defective, the defective (comparison-wise) error rate for each component (frame, chain, handlebars, etc.) must be well below 5% if the 5% defective (experiment-wise) error rate for the whole bicycle is to be met. As an example, Creasy et al. (1972) embolized the placenta in pregnant sheep with microspheres and compared the runted and control lambs. They measured body and organ weights and organ blood flows, as well as hematocrit, blood glucose concentrations, and
Multiple Comparisons
Fig. 24.2 Comparison between control and runted lambs. *P < 0.05; **P < 0.01. P(b) Bonferroni adjustment. Vertical lines are standard errors.
arterial pH and oxygen and carbon dioxide tensions. In all, 14t-test comparisons were made. Fig. 24.2 shows some of their data. Was it correct to do so many t-tests, and did they have to use the Bonferroni correction? There are two main types of error rates (O’Neill and Wetherill, 1971; O’Brien and Shampo, 1988a; Abdi, 2007). One is the comparison-wise error rate, defined as: Number of comparisons leading to erroneous rejection of the null hypothesis : Total number of comparisons The second is the experiment-wise (or family-wise) error rate, defined as: Number of families withoneor more erroneous rejection of the null hypothesis : Total number of families Most statisticians agree that the Bonferroni adjustment should be used if the universal null hypothesis of no differences in any variable is used. For example, doing a battery of biochemical tests on a presumed normal subject. In the Creasy example above, each comparison-wise error rate is 5% and would not change if another six variables had been measured. The family-wise error rate decides whether we accept the hypothesis that runting does or does not change any of the variables and needs protection by the Bonferroni or other methods.
379
380
Basic Biostatistics for Medical and Biomedical Practitioners
There are many scenarios on the theme of simultaneous multiple comparisons. For example, making consecutive measurements over time such as Fig. 24.1 (left panel) (O’Brien and Shampo, 1988b) using multiple statistical tests to examine possible heterogeneity of response (O’Brien and Shampo, 1988c) or combining the results of several different end points to provide a global measure of superiority of one treatment over another (O’Brien and Shampo, 1988d). The controversies about doing multiple t-tests extend to more complex analyses and will be discussed in detail in Chapter 25 on Analysis of Variance.
Extreme Multiplicity and False Discovery Rates Multiplicity problems become extreme when sets of hundreds or thousands of data points are examined, most notably in examining microarrays used to evaluate gene or protein expression or voxels used in imaging. In these studies, the test object (e.g., blood for genes and proteins, organs such as brain or heart for imaging) is divided into thousands of samples, each compared with control normal values. These normal values have their own variability, and some threshold is needed to determine if any one locus differs between test and control. The comparison-wise error rate (CWER) with α ¼ 0.05 examines each comparison separately. For any single comparison, there is a 5% chance of falsely rejecting the null hypothesis with P ¼ 0.05, so that standard t-tests on 10,000 spots in a microarray chip provide 500 false positive “significant” differences even if the test and control material were identical. Family-wise error rates (FWER) reduce this potential error and keep the total error rate for the whole array below 5%. An example of such a test is the Bonferroni Inequality in which the null hypothesis for each spot is rejected only with a value of α/k, where k is the number of spots. With k ¼ 10,000 the power of the test is very low, and the Type I error is controlled at a given value for α at the cost of an inflated Type II error. Consequently, many differences that might be important would not be detected. In 1995 Benjamini and Hochberg proposed the concept of the false discovery rate (FDR) to deal with these problems. Results of testing a large number of samples are given in Table 24.2. The comparison-wise error rate is stipulated to be E(V/m), where E indicates the expectation of the ratio in the long run. Testing each hypothesis at level α guarantees that E(V/m) α. Setting a value for α sets a limit for the value of V/m. Testing each hypothesis at level α/m gives the family-wise error rate that keeps the Type I error for the whole data set α. The false discovery error rate is represented by the random variable Q ¼ V/(V + S) or V/R and indicates the proportion of rejected null hypotheses that are erroneously rejected. If all the null hypotheses are true, then Q is zero, and the false discovery and family-wise error rates are the same. If not, then the false discovery rate is much lower.
Multiple Comparisons
Table 24.2 Type I and Type II errors applied to microarrays Do not reject H0 (Non-DE) Reject H0 (DE)
Total
H0 true (Non-DE)
U ¼ true negative
V ¼ false positive (Type I error)
H0 false (HA true; DE)
T ¼ false negative (Type II error)
S ¼ true positive m – m0
mR
R
m0
m
DE, differential expression (difference between test and control value); m, total number of hypotheses tested; m0, number of true null hypotheses; m–m0, number of true alternative hypotheses. m is known in advance and R is an observed random variable, but none of the other values are known. S, T, U, and V are rates.
Given that the FDR decreases both Type I and Type II errors, how is it possible to calculate V/R? One method has been to plot the histogram of the relative frequency with which each p value (for the individual t-tests) occurs (Fig. 24.3; Karp and Lilley, 2007; Storey and Tibshirani, 2003).
Fig. 24.3 Diagram to illustrate one approach to calculating FDR. If there were no differences between test and control materials, then the relative frequency of each p value (the number of times it occurred relative to what is expected from a normal distribution) would be constant, as indicated by the horizontal solid line. On the reasonable assumption that high p values indicate that the null hypothesis is likely and that there is no difference between the control and the test material, the relatively uniform part of the histogram with p > 0.5 provides an estimate of the true background level, as shown by the dashed line. Portions of the columns above this dashed line indicate the true positives (shaded area for the lowest p values) as compared to the baseline false positives shown in dark area. (Based on published figures by Karp, N.A., Lilley, K.S., 2007. Design and analysis issues in quantitative proteomics studies. Proteomics, 7(Suppl 1), 42–50; Storey, J.D., Tibshirani, R., 2003. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U. S. A. 100, 9440–5.)
381
382
Basic Biostatistics for Medical and Biomedical Practitioners
A simple technique is to set a value Q for the FDR. (Q is the expected proportion of false positives as determined from Fig. 24.3) Assume we choose Q ¼ 0.2. List the P values from smallest (rank 1) to largest (rank N ¼ m) and calculate (i/m)Q. Thus for the 4th ranked P value out of 17, (i/m)Q ¼ 4/17 0.2 ¼ 0.0471. If the P value is <0.0471, reject the null hypothesis; if it is greater, it is unsafe to reject the null hypothesis. As an example, consider the data in Table 24.3. Table 24.3 Calculation of FDR Rank P value
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0.010 0.022 0.029 0.043 0.11 0.16 0.27 0.31 0.38 0.49 0.55 0.61 0.63 0.69 0.75 0.83 0.98
(i/m)Q
Comment
0.0118 0.0235 0.0353 0.0471 0.0588 0.0706 0.0824 0.0941 0.1059 0.1176 0.1294 0.1412 0.1529 0.1647 0.1765 0.1882 0.2000
Reject H0 Reject H0 Reject H0 Reject H0 Do not reject H0 Do not reject H0 Do not reject H0 Do not reject H0 Do not reject H0 Do not reject H0 Do not reject H0 Do not reject H0 Do not reject H0 Do not reject H0 Do not reject H0 Do not reject H0 Do not reject H0
Excel spread sheets for performing this test are available at www.biostathandbook. com/benjaminihochberg.xls. The subject is discussed simply at http://www.biostathandbook.com/multiple comparisons.html. There are numerous approaches to computing FDR, and numerous complications such as the fact that often several genes or proteins rise or fall together, so that consultation with an expert in the field is essential. Different methods of setting the threshold for declaring a difference that rejects the null hypothesis give different results. There is a trade-off between sensitivity and specificity and the results should not be accepted blindly without realizing their limitations.
GROUP SEQUENTIAL BOUNDARIES It is common in clinical trials to plan a study involving many subjects for a long time, but to check the results at intervals to determine if the study should be ended prematurely
Multiple Comparisons
because of unexpectedly bad or good results in one group. It would be unethical to continue the study when one group receives a less good treatment. In this context, the multiple comparisons problem is invoked (McPherson, 1974). The approach depends in part on how the trial is likely to proceed. In some trials, many subjects are readily available and unambiguous results are obtained early for each patient. The trial supervisors then estimate how many are needed in each group, what the primary outcome will be, and how long the trial is expected to continue. At the other extreme is a trial when patient accrual is slow and irregular, the time to completion of the trial is uncertain, and the outcome may not be known for several years. These two extremes need different analyses. Furthermore, because most clinical trials test treatments that will produce only modest improvements, the patient numbers required can be very large (Mehta et al., 2009). One of the most often used methods for interim checks was devised by O’Brien and Fleming. The number of interim tests is defined in advance. Because earlier interim tests involve smaller numbers of patients they have large standard errors and confidence limits and demand a higher level of significance before the trial should be stopped. With each succeeding interim examination, the criterion for accepting the null hypothesis becomes less strict, until for the final test at the end of the trial the conventional predetermined value of α is achieved. O’Brien and Fleming calculated the critical values of α for a predetermined per experiment error rate of 0.05 for 2, 3, 4, or 5 interim tests (Table 24.4, column 2). Table 24.4 column 2 shows critical P values for a final Type I error rate of 0.05, based on Tables from Pocock (2006) and O’Brien and Shampo. The final test on the completed trial has a critical value close to the designated 0.05. If the final error rate is to be 0.01, then the O’Brien-Fleming method requires critical P values of 0.0000001, 0.00001, 0.001, and 0.004 for the first to fourth interim analyses, respectively. Column 3 shows alternative critical values developed by Haybittle and Peto et al. (1976). These authors use a constant boundary for the interim analyses but retain the α value of 0.05 if early termination does not occur (Table 24.2). Finally, one other type of test has been proposed to deal with the problem that preliminary data may suggest the need to change the times at which interim analyses are made, or due to slow recruitment the trial has to be extended so that more interim analyses may be needed. The adaptive procedures used were developed by Lan and DeMets and are variants of the O’Brien and Fleming method. They proposed an “alpha spending function” that controlled how much of the false positive error can be used at each interim analysis as a function of the proportion of total information available for the whole test. This proportion is usually based on the fraction of total patients enrolled or the proportion of expected events that had occurred. Some specialized computer programs such as PASS or Cytel will perform the calculations. A simple rough test was advocated by Pocock (2006) when examining interim results. With the reasonable assumptions that the two groups being compared had approximately
383
384
Basic Biostatistics for Medical and Biomedical Practitioners
Table 24.4 Probabilities for interim tests Test number Critical P value O’Brien-Fleming
Haybittle-Peto
One interim test
1 2
0.005 0.049
0.001 0.050
Two interim tests
1 2 3
0.0006 0.0151 0.0471
0.001 0.001 0.0495
Three interim tests
1 2 3 4
0.00005 0.004 0.018 0.042
0.001 0.001 0.001 0.0492
Four interim tests
1 2 3 4 5
0.000005 0.0013 0.009 0.023 0.042
0.001 0.001 0.001 0.001 0.0489
equal numbers and that the incidence of events was small and therefore fitted a Poisson ab distribution, the ratio z ¼ pffiffiffiffiffiffiffiffiffi, where a and b are the numbers of events in the two a+b groups, has an approximately normal distribution. If z is 1.96, then P ¼ 0.05. The P values of 0.01, 0.001, and 0.0001 are equivalent to z values of 2.68, 3.29, and 3.89, respectively. This method does not eliminate the need for more accurate interim analyses but serves as a check on the arithmetic and provides a more easily understandable figure to evaluate. The reasons for early termination, however, are not affected by this calculation. The issue of prematurely stopping a clinical trial is complex, and data monitoring committees must take more into account than a P value that refers to the primary end point. Complications need to be taken into account, so does the possibility of late occurring results, and the importance of that particular trial. Pocock discussed these issues in detail (Pocock and White, 1999; Pocock, 2006). He pointed out that most reported trials that were terminated early for substantial benefit were based on limited data and showed unrealistically and unexpectedly large treatment effects. In fact, in some trials that
Multiple Comparisons
were continued after apparent interim significance had been reached, subsequent results indicated a less marked difference between the treatments, and Pocock termed this effect “regression to the truth” (Pocock and White, 1999). The decision to correct for multiplicity is often complicated. The problem is intensified in clinical trials where multiple end points may be involved. An error due to multiple comparisons occurs also when in an experiment a low P value is not achieved, and the investigators then add more subjects, retest, and continue adding subjects until a satisfactorily low P value is obtained. This illicit process, termed P-hacking (Simmons et al., 2011), cannot be condoned. An adequate sample size should be determined before beginning the experiment. A similar error occurs when comparing two different treatments that are “not significantly different” and the large samples are then subdivided into many smaller samples, each of which is tested (Lee et al., 1980).
SEQUENTIAL ANALYSIS The ultimate form of interim analysis is sequential analysis, performed after each pair of data points is accrued, whether the data are measured values or preferences (Armitage, 1975). As an example, for comparing preferences (e.g., A is better than B or worse than B) a grid is prepared (Fig. 24.4). To construct the figure, the α and β errors are designated, usually α ¼ 0.05 and β ¼ 0.1–0.2, and the magnitude of the expected difference θ is selected (often 0.85). Based on these three numbers, the appropriate tables are consulted to determine how to draw the boundary lines.
Fig. 24.4 Closed pattern with treatment A being no better than treatment B, and open pattern with A being better than B.
385
386
Basic Biostatistics for Medical and Biomedical Practitioners
Left panel: For the first pair of preferences (two subjects, two forearms, two cough remedies, etc.) if A is better than B then a diagonal line is drawn upwards in the first square. This is true of the second preference set, so another diagonal line is drawn upwards. If the next preference is for B to be better than A, the next diagonal line passes downwards, as in the third box. Eventually one of three patterns will occur. The jagged preference line crosses the upper boundary, indicating that A is better than B with the stated α and β levels; or it crosses the lower boundary which means that B is better than A at the stated α and β levels; or it crosses the middle boundary which means that the trial has failed to show a difference of the expected magnitude. The figure shows a closed pattern because the middle boundaries are set. It is also possible to have an open design (right panel) for which the tables provide pairs of parallel lines that serve the same function as the boundaries above. This design has some advantages except that with unbounded parallel lines it is possible for the observed data preference line to meander forever without crossing a boundary. A good example of sequential analysis with an open design was published by Lewis et al. (1983) on the use of aspirin in angina pectoris. Cautionary Tale It is important to graph correctly. One study (Hellier, 1963) published results comparing the effects of trimeprazine versus amylobarbitone in controlling pruritus. A preference for trimeprazine over amylobarbitone produced a line going up at 45 degrees, and a preference for amylobarbitone over trimeprazine produced a line going down at 45 degrees. This graph also showed horizontal lines that indicated no preference. These should not be part of the figure construction, and drawing horizontal portions has the effect of decreasing the mean angle and making the preference line cross the null boundary, giving a false sense of no difference between the treatments. It is, of course, possible for the vast majority of comparisons to result in ties. If that happened, the correct conclusion would be that for most subjects there was no difference between the two treatments, but that for the few who expressed a difference A was preferred more often than B.
Sequential analysis can also be done for measured values. For example, if paired differences are examined, then for each pair calculate a ratio P ð d Þ2 z ¼ P 2 , where d is the difference (A–B) between the pairs and may be positive or d negative. Σd is the result obtained by accumulating successive values for d. If the null hypothesis is true, then Σd will remain small, and become a smaller and smaller fraction of the expression. If there is a meaningful difference, z will increase and eventually reach and cross the critical boundary. The data are plotted on a figure (Fig. 24.5) in which the boundaries are set by tables available in Armitage’s book (Armitage, 1975). Several variations of these designs are possible (Armitage, 1975).
Multiple Comparisons
Fig. 24.5 Sequential trial of paired measurements that do not allow us to reject the null hypothesis.
Sequential analyses may minimize the number of patients involved in a trial (Armitage, 1975). All the precautions needed in thinking about early termination of a grouped sequential clinical trial discussed before apply here too.
ADAPTIVE METHODS Randomized sampling is the hallmark of good clinical trials (Chapter 38) but apart from practical problems of designing an impeccable trial, such trials have a major ethical issue, namely, is it ethical to give one group of patients what will turn out to be an inferior treatment (Royall et al., 1991)? This cannot be known beforehand because that is why the trial is being done, but is there a way to shorten the time of use of the less effective treatment? One method described before is to examine the results at different times after starting the trial, but this cannot be done too often. A second method, sequential sampling, on the average decreases the needed sample size, but is sometimes difficult to implement. An interesting variant is an adaptive strategy termed randomized playthe-winner that can be used if there are binary outcomes detectable soon after treatment has started. As described by Rosenberger (1999) for two treatments A and B an urn (actual or theoretical) is set up to contain αA balls of type A and αB balls of type B. For the null hypothesis that A and B are equally good, αA ¼ αB. After the first few patients are selected at random, the adaptive strategy begins. Any success with A or failure with B causes another type A ball to be added to the urn; conversely, any success with B or failure with A causes another type B ball to be added to the urn. In this way, if treatment A tends to produce better results, the number of A type balls grows faster than the number of B type balls, and the next patient has a greater chance of receiving the apparently better
387
388
Basic Biostatistics for Medical and Biomedical Practitioners
treatment, and only a minimum number of patients will have received inferior treatment. The theory and practice of this technique have been described (Rosenberger, 1999; Yao and Wei, 1996). Cautionary Tale In the first reported clinical trial with this method (Bartlett et al., 1985) the value of ECMO (extracorporeal membrane oxygenation) was assessed in moribund infants with respiratory failure and no response to optimal therapy; their mortality risk was estimated to be 80%–100%. The first patient was randomized to receive ECMO and survived. The next patient was assigned to no ECMO and died. The next patient was assigned to ECMO and survived, and all subsequent patients were assigned to ECMO and survived. Eventually there were 11 survivors of ECMO and 1 death of a patient who did not receive ECMO. Unfortunately, there were problems in that trial, not the least of which were that only one subject did not receive ECMO, and that none of the infants not in the trial who did not receive ECMO died (Paneth and Wallenstein, 1985; Ware and Epstein, 1985). (Subsequently, more extensive trials confirmed the value of ECMO (UK Collaborative ECMO Trail Group, 1996.) Care is needed when selecting adaptive designs, and power is difficult to determine. Nevertheless, sometimes such a design might give a clear answer using the minimum number of subjects. In a conventional trial of AZT in preventing the vertical transmission of HIV from mother to infant, 239 women received AZT and 238 received placebo; 60 infants in the placebo group had HIV as against 20 in the treated group (Connor et al., 1994). An analysis of this study using an adaptive design suggested that a similar result could have been attained with only 7 failures in the placebo group, and thus would have involved a much smaller trial (Yao and Wei, 1996). Adaptive methods have recently been recommended as a more effective way to conduct clinical trials (Berry et al., 2015; Meurer et al., 2012). A readable review of various adaptive designs is provided by Chow and Chang (2008).
REFERENCES Abdi, H., 2007. The Bonferonni and Sˇida´k Corrections for Multiple Comparisons. Sage, Thousand Oaks, CA. Available: http://wwwpub.utdallas.edu/herve/Abdi-Bonferroni2007-pretty.pdf. Armitage, P., 1975. Sequential Medical Trials. Wiley & Sons, New York. Bartlett, R.H., Roloff, D.W., Cornell, R.G., Andrews, A.F., Dillon, P.W., Zwischenberger, J.B., 1985. Extracorporeal circulation in neonatal respiratory failure: a prospective randomized study. Pediatrics 76, 479–487. Berry, S.M., Connor, J.T., Lewis, R.J., 2015. The platform trial: an efficient strategy for evaluating multiple treatments. JAMA 313, 1619–1620. Bland, J.M., Altman, D.G., 1995. Multiple significance tests: the Bonferroni method. BMJ 310, 170. Carter, R.E., 2010. A simple illustration for the need of multiple comparison procedures. Teach. Signif. 32, 90–91. Chow, S.C., Chang, M., 2008. Adaptive design methods in clinical trials—a review. Orphanet J. Rare Dis. 3, 11.
Multiple Comparisons
Connor, E.M., Sperling, R.S., Gelber, R., Kiselev, P., Scott, G., O’sullivan, M.J., Vandyke, R., Bey, M., Shearer, W., Jacobson, R.L., et al., 1994. Reduction of maternal-infant transmission of human immunodeficiency virus type 1 with zidovudine treatment. Pediatric AIDS Clinical Trials Group Protocol 076 Study Group. New Engl. J. Med. 331, 1173–1180. Creasy, R.K., Barrett, C.T., De Swiet, M., Kahanpaa, K.V., Rudolph, A.M., 1972. Experimental intrauterine growth retardation in the sheep. Am. J. Obstet. Gynecol. 112, 566–573. Glantz, S.A., 1980. Biostatistics: how to detect, correct, and prevent errors in the medical literature. Circulation 61, 1–7. Hellier, F.F., 1963. A comparative trial of trimeprazine and amylobarbitone in pruritus. Lancet 1, 471–472. Karp, N.A., Lilley, K.S., 2007. Design and analysis issues in quantitative proteomics studies. Proteomics 7 (Suppl. 1), 42–50. Kusuoka, H., Hoffman, J.I., 2002. Advice on statistical analysis for circulation research. Circ. Res. 91, 662–671. Lee, K.L., Mcneer, J.F., Starmer, C.F., Harris, P.J., Rosati, R.A., 1980. Clinical judgment and statistics. Lessons from a simulated randomized trial in coronary artery disease. Circulation 61, 508–515. Lewis, H.D.J., Davis, J.W., Archibald, D.G., Steinke, W.E., Smitherman, T.C., Doherty III, J.E., Schnaper, H.W., Lewinter, M.M., Linares, E., Pouget, J.M., Sabharwal, S.C., Chesler, E., Demots, H., 1983. Protective effects of aspirin against acute myocardial infarction and death in men with unstable angina. Results of a Veterans Administration Cooperative Study. New Engl. J. Med. 309, 396–403. Martin, G., 1984. Munchausen’s statistical grid, which makes all trials significant. Lancet 2, 1457. McPherson, K., 1974. Statistics: the problem of examining accumulating data more than once. New Engl. J. Med. 290, 501–502. Mehta, C., Gao, P., Bhatt, D.L., Harrington, R.A., Skerjanec, S., Ware, J.H., 2009. Optimizing trial design: sequential, adaptive, and enrichment strategies. Circulation 119, 597–605. Meurer, W.J., Lewis, R.J., Tagle, D., Fetters, M.D., Legocki, L., Berry, S., Connor, J., Durkalski, V., Elm, J., Zhao, W., Frederiksen, S., Silbergleit, R., Palesch, Y., Berry, D.A., Barsan, W.G., 2012. An overview of the adaptive designs accelerating promising trials into treatments (ADAPT-IT) project. Ann. Emerg. Med. 60, 451–457. O’Brien, P.C., Shampo, M.A., 1988a. Statistical considerations for performing multiple tests in a single experiment. 1. Introduction. Mayo Clin. Proc. 63, 813–815. O’Brien, P.C., Shampo, M.A., 1988b. Statistical considerations for performing multiple tests in a single experiment. 3. Repeated measures over time. Mayo Clin. Proc. 63, 918–920. O’Brien, P.C., Shampo, M.A., 1988c. Statistical considerations for performing multiple tests in a single experiment. 4. Performing multiple statistical tests on the same data. Mayo Clin. Proc. 63, 1043–1045. O’Brien, P.C., Shampo, M.A., 1988d. Statistical considerations for performing multiple tests in a single experiment. 5. Comparing two therapies with respect to several endpoints. Mayo Clin. Proc. 63, 1140–1143. O’Neill, R., Wetherill, G.B., 1971. The present state of multiple comparison methods. J. Rol. Stat. Soc. Ser. B 33, 218–241. Paneth, N., Wallenstein, S., 1985. Extracorporeal membrane oxygenation and the play the winner rule. Pediatrics 76, 622–623. Perneger, T.V., 1998. What’s wrong with Bonferroni adjustments. BMJ 316, 1236–1238. Peto, R., Pike, M.C., Armitage, P., Breslow, N.E., Cox, D.R., Howard, S.V., Mantel, N., McPherson, K., Peto, J., Smith, P.G., 1976. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. I. Introduction and design. Br. J. Cancer 34, 585–612. Pocock, S.J., 2006. Current controversies in data monitoring for clinical trials. Clin. Trials 3, 513–521. Pocock, S., White, I., 1999. Trials stopped early: too good to be true? Lancet 353, 943–944. Pocock, S.J., Hughes, M.D., Lee, R.J., 1987. Statistical problems in the reporting of clinical trials. New Engl. J. Med. 317, 426–432. Rosenberger, W.F., 1999. Randomized play-the-winner clinical trials: review and recommendations. Control. Clin. Trials 20, 328–342. Rothman, K.J., 1990. No adjustments are needed for multiple comparisons. Epidemiology 1, 43–46.
389
390
Basic Biostatistics for Medical and Biomedical Practitioners
Royall, R.M., Bartlett, R.H., Cornell, R.G., Byar, D.P., DuPont, W.D., Levine, R.J., Lindley, F., Simes, R.J., Zelen, M., 1991. Ethics and statistics in randomized clinical trials. Stat. Sci. 6, 52–88. Simmons, J.P., Nelson, L.D., Simonsohn, U., 2011. False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychol. Sci. 22, 1359–1366. Storey, J.D., Tibshirani, R., 2003. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U. S. A. 100, 9440–9445. Tukey, J.W., 1977. Some thoughts on clinical trials, especially problems of multiplicity. Science 198, 679–684. UK Collaborative ECMO Trail Group, 1996. UK collaborative randomised trial of neonatal extracorporeal membrane oxygenation. Lancet 348, 75–82. Wallenstein, S., Zucker, C.L., Fleiss, J.L., 1980. Some statistical methods useful in circulation research. Circ. Res. 47, 1–9. Ware, J.H., Epstein, M.F., 1985. Extracorporeal circulation in neonatal respiratory failure: a prospective randomized study. Pediatrics 76, 849–851. Wright, S.P., 1992. Adjusted P-values for simultaneous inference. Biometrics 48, 1005–1013. Yao, Q., Wei, L.J., 1996. Play the winner for phase II/III clinical trials. Stat. Med. 15, 2413–2423 (discussion 2455-8).