J Clin Epidemiol Vol. 51, No. 7, pp. 537–545, 1998 Copyright 1998 Elsevier Science Inc. All rights reserved.
0895-4356/98/$19.00 PII S0895-4356(98)00029-8
COMMENTARY
The Quest for ‘‘Power’’: Contradictory Hypotheses and Inflated Sample Sizes Alvan R. Feinstein1,2,* and John Concato1,2 1
Departments of Medicine and 2 Epidemiology, Yale University School of Medicine, New Haven, Connecticut, and 3West Haven Veterans Affairs Medical Center, West Haven, Connecticut ABSTRACT. To have the ‘‘power’’ of avoiding undersized clinical trials, the customary statistical strategy used in the past few decades is aimed at rejecting both a null stochastic hypothesis and a contradictory alternative hypothesis. This approach gives a trial the ‘‘power’’ to confirm the ‘‘insignificance’’ of differences much smaller than the large value of δ desired in trials done to show efficacy. In many instances, however, a prime problem is that the current ‘‘double-significance’’ approach produces sample sizes 2–3 times larger than needed for stochastic confirmation of large differences ($δ ). The inflated sample sizes and consequent problems can be avoided if a realistic value for δ is chosen and maintained thereafter, and if an adequate ‘‘capacity’’ is calculated for ‘‘single significance.’’ j clin epidemiol 51;7:537–545, 1998. 1998 Elsevier Science Inc. KEY WORDS. Statistical significance, Neyman-Pearson, clinical trials, sample sizes, power
[Note from the authors: Readers are herewith warned that the discussion presented here is a departure from what has been accepted as ‘‘conventional wisdom.’’ To understand what is being proposed, try to set aside whatever entrenched ideas you may have developed or been given about samplesize calculations. Instead, begin with an open mind, asking what do you want, why do you want it, and how do you go about getting it.] Until about 25 years ago, investigators doing a clinical trial of efficacy would select δ as the lower boundary of the anticipated large difference in outcomes that would show Treatment A was better than Treatment B. After the trial was completed, with observed success rates of pA and pB and an increment d0 5 pA 2 pB , the hope was that d0 would be big enough to equal or exceed the boundary value of δ. If the observed result showed the expected big difference (i.e., d0 $ δ ), ‘‘statistical significance’’ could be declared if a suitable stochastic test (such as Z or chi-square) led to rejection of the conventional null hypothesis that success rates of the two treatments have a parametric difference of 0. For this rejection, the lower boundary of the confidence interval around the observed d0 must exclude 0, that is d0 2 Zα(SED) $ 0.
(1)
In formula (1), SED is the standard error of the difference, pA 2 pB , and Zα is a Gaussian factor corresponding to a *
Address correspondence to: Alvan R. Feinstein, MD, Yale University School of Medicine, 333 Cedar Street—Rm. I 456 SHM, New Haven, CT 06510. Accepted for publication on 24 March 1998.
selected level of α, which represents the probability of a ‘‘false positive’’ rejection if the null hypothesis is true. In most instances, with α set at a two-tailed value of .05, the corresponding Zα is 1.96. For the results of two rates (or proportions), the standard error of the difference is usually calculated as: SED 5 √NPQ/nAnB ,
(2)
where nA and nB are the respective group sizes; N 5 nA 1 nB ; and under the null hypothesis that the two treatments are equivalent, P is the ‘‘common’’ success-rate value (nApA 1 nBpB)/N, with Q 5 1 2 P. When sample size is calculated in advance for the desired goal, the usual approach assigns the two groups to have equal size, so that each group’s membership is nA 5 nB 5 n 5 N/2. With equal sample sizes, the value of Pˆ for the anticipated pˆA and pˆB becomes (pˆA 1 ˆ 5 1 2 Pˆ. The value of SED in formula (2) pˆB)/2 and Q ˆ /n and inserted into formula can then be estimated as √2PˆQ (1), with δ substituted for d0. Squaring and rearranging terms to solve for n produces the conventional formula that expresses the sample size needed for one group as ˆ (Zα)2 / δ 2. n $ 2PˆQ
(3)
To show how the calculations work, suppose δ is set at .15 for a clinical trial. If pˆB is estimated as .268, the estimate for pˆA will be .418; and Pˆ (with equal group sizes) will be ˆ will be 1 2 .343 (.268 1 .418)/2 5 .343. The value of Q 5 .657. With Zα 5 1.96 for a two-tailed α 5 .05, formula (3) will produce n $ 2(.343)(.657)(1.96)2 /(.15)2 5 77.0. Thus, about 154 persons would be needed in the trial.
A. R. Feinstein and J. Concato
538
This relatively straightforward approach to calculating sample size was essentially abandoned, however, after a 1978 publication [1] showing that clinical trials were often undersized and lacked ‘‘power.’’ According to the authors, investigators using too-small samples may have been ‘‘missing an important therapeutic improvement’’ when reaching a ‘‘negative’’ conclusion and conceding the null hypothesis after finding P . .05 for the compared results. With a large enough sample size, such trials might often have attained ‘‘statistical significance.’’ The suggested solution to the problem was based on concepts of alternative hypotheses and ‘‘power’’ first proposed by Neyman and Pearson [2,3] in 1928 and 1933. NEYMAN-PEARSON CONCEPT OF ‘‘POWER’’ In the Neyman-Pearson concepts, the rejection of the conventional null hypothesis would imply that an alternative stochastic hypothesis had been conceded. A level of β would be chosen for the probability that the alternative hypothesis would be falsely rejected if correct. Thus, when decisions were made about the null hypothesis, rejection would be accompanied by an α probability for ‘‘false positive’’ or Type I error, and concession would involve the β probability of a ‘‘false negative’’ or Type II error. Because the alternative hypothesis could be placed at various locations, the different settings would lead to various ‘‘power functions.’’ Just as 1 2 α had been employed to construct confidence intervals, the value of 1 2 β was used to represent ‘‘power,’’ which will increase as β is made smaller. In planning clinical trials and other research, the alternative stochastic hypothesis is located at δ, with the statement that the parametric difference in the compared groups is $δ. To reject the alternative hypothesis, the upper boundary of the confidence interval for the observed result must exclude δ, so that d0 1 Zβ (SED) # δ.
(4)
Like Zα , the value of Zβ in formula (4) is also a Gaussian factor, suitably chosen according to the selected level of β. Because β is usually interpreted in a one-tailed direction, Zβ 5 1.645 if a one-tailed β is set at .05. With other choices, Zβ 5 1.28 if the one-tailed β 5 .10, or .84 for β 5 .20. Not fully understanding this statistical formulation, most investigators think of ‘‘power’’ as denoting the general numerical adequacy of a sample, rather than its specific ability to reject the alternative stochastic hypothesis. Consequently, when planning a trial intended to show efficacy, most investigators do not realize that ‘‘power’’ is calculated for a contradictory goal. For efficacy, the hope is to find a ‘‘big’’ result (i.e., d0 $ δ ) and to show stochastically that it is not zero. When the observed result, however, is disap-
pointingly small (i.e., d0 , δ ), the role of power is to offer stochastic confirmation for the smallness. For example, in the foregoing trial where the investigators hoped to find δ 5 .15 with 77 persons in each group, suppose the results showed pB 5 21/77 5 .272 and pA 5 26/77 5 .338 so that d0 5 .338 2 .272 5 .066. With a one-tailed β set at .05, and a power of 95%, the upper end of the confidence interval stated in equation (4) is determined as .066 1 (1.645)(.0740) 5 .188. Since this value does not exclude δ 5 .15, the investigators cannot reject the alternative hypothesis. The observed result, although much smaller than desired, is still compatible with a true difference as large as δ. If the investigators had really expected to find a difference as small as the observed d0 , and if they had really wanted stochastic confirmation of its smallness, a larger sample would have been needed. Its size could be determined from the strategy stated in formula (4). The goal is to achieve a value of SED that will produce SED # (δ 2 d0)/Zβ .
(5)
Under the alternative, rather than the null, hypothesis, a different formula is used to calculate SED as SED 5 √(pAqA /nA) 1 (pB qB /nB).
(6)
For most practical purposes, however, formulas (2) and (6) give closely similar results. (For the data now under discussion, SED is .0742 with formula (2) and .0740 with formula (6).) With equal sample sizes, the alternative SED is estimated as √(pˆAqˆA 1 pˆBqˆB)/n.
(7)
When this estimate is inserted into formula (4), which is then solved for n, the result is n $ (Zβ )2 (pˆAqˆA 1 pˆBqˆB )/(δ 2 d0 )2.
(8)
Thus, to reject the alternative hypothesis here, with a onetailed Zβ 5 1.65 and a ‘‘power’’ of 95%, the required sample size would be at least n $ (1.645)2 [(.338)(.662) 1 (.272) (.728)]/(.15 2 .066)2 5 162. The trial would therefore require a total of at least 324 patients to provide stochastic confirmation that the observed d0 5 .066 is really ‘‘small.’’ SCIENTIFIC AND STOCHASTIC HYPOTHESES At this point, however, a thoughtful investigator will wonder why sample size is being calculated for a stochastic statistical goal that is contrary to the scientific goal of the trial. Wanting to show efficacy, the investigator would like to find and stochastically confirm a large value of do (i.e., $δ ). In fact, if do turns out to be smaller than δ, the investigator will usually be pleased if the confidence interval includes δ, thus making the observed small result stochastically compatible with the desired big δ. Rejecting the possible com-
‘‘Power,’’ Contradictory Hypotheses, and Inflated Sample Sizes
patibility with δ is not what the investigator sought in doing the trial. The investigator might now be told, however, that the larger sample size calculated with formula (8) will be advantageous if do turns out to be slightly smaller than the preset δ of .15. For example, consider the trial that was done with 77 persons in each group [as calculated with formula (3)]. If the results showed pB 5 21/77 5 .272 and pA 5 31/77 5 .403, so that do 5 .13, the difference is smaller than the desired δ 5 .15, and will not be stochastically confirmed as ‘‘significant’’ because the SED will be too large to allow rejection of the null hypothesis. Nevertheless, the observed do 5 .13 can hardly be regarded as a ‘‘small’’ difference. On the other hand, if the trial had been done with 162 persons in each group [as calculated with formula (8)], the null hypothesis could be rejected for the same do 5 .13 with results of pB 5 44/162 5 .272 and pA 5 65/162 5 .401. For these results, the do of .13 would be called ‘‘statistically significant,’’ with its 95% confidence interval extending from .027 to .233. A thoughtful viewer of the process might now ask why the original δ was set at .15, if a smaller value such as do 5 .13 was also acceptable as ‘‘big.’’ If even smaller values for do —such as .12, .10, or even .09—are all acceptable as ‘‘big,’’ why were they not chosen for the boundary ofδ? Thus, if do 5 .09 is acceptable as ‘‘big,’’ but .08 is not, the value of δ should presumably be set at the value of .09. The sample size for rejection of the null hypothesis with formula (3) and with pˆB estimated as .268 would then become n $ 2 (.313) (.687) (1.96)2 /(.09)2 5 203.9, so that 204 persons would be needed in each group. This size is larger than the previous 77 calculated with δ 5 .15, and is also larger than the 162 required for the ‘‘power’’ calculation in formula (8). The next question, therefore, is why do investigators demarcate the boundary of ‘‘big’’ with a large δ (such as .15), although they will accept a substantially smaller value (such as .09) as still being ‘‘big’’? The answer to this question is provided—as noted shortly—by the current ‘‘double’’ approach to power, which produces a sample size that can simultaneously reject both the null and the alternative hypothesis. (The ‘‘double-significance’’ concept here refers to two stochastic hypotheses, not to the use of a ‘‘one-tailed’’ or ‘‘two-tailed’’ approach for evaluating individual hypotheses.) Before the ‘‘double’’ approach is introduced, however, some further discussion is needed for the two different types of ‘‘non-significance’’ under the conventional null hypothesis. TWO TYPES OF ‘‘NON-SIGNIFICANCE’’ In the now-classic paper [1] that demonstrated a too-small size for the groups studied in clinical trials, the authors did not separate two distinctly different situations in which a result can be ‘‘non-significant.’’
539
Impressive (‘‘Big’’) Value of d0 In the first situation, the trial produces the expected big value of d0, but the investigators are chagrined to find that it is not ‘‘statistically significant.’’ For example, with the big value of δ set at .15, a trial might show pA 5 18/42 5 .429 and pB 5 11/41 5 .268, so that do 5 .429 2 .268 5 .161, which exceeds the level of δ 5 .15. When this result is tested for ‘‘significance’’ using formula (1), the common proportion is calculated as P 5 [42(.429) 1 41(.268)]/83 5 .349 and Q 5 1 2 P 5 .651 so that SED 5 √83(.349)(.651)/(42)(41) 5 .105. If Zα is set at 1.96 for the customary two-tailed 95% confidence interval, the lower boundary of the interval will extend to .161 2 (1.96)(.105) 5 2.045, which does not exclude zero. Thus, although quantitatively impressive, the observed result would be deemed statistically ‘‘non-significant.’’ This situation is illustrated in Figure 1A. The problem just cited occurred because the sample size was definitely too small. Although the observed do exceeded the desired δ 5 .15, the sample lacked the ‘‘capacity’’ to reject the null hypothesis. For the observed result of pA 5 .429 and pB 5 .268, with do 5 .161, formula (3) shows that a sample size of adequate capacity would have required n $ 2(.3485)(.6515)(1.96)2 /(.161)2. 5 67.3. With 68 persons in each group, a total of 136 would have been needed. Thus, the trial containing 83 persons had a capacity of only 83/136 5 61% for confirming ‘‘statistical significance’’ of the observed big difference. Unimpressive (‘‘Small’’) Value of d0 The other ‘‘non-significant’’ situation arises when the value of do is distinctly smaller than δ. For example, suppose the investigators hoping to demonstrate a big difference for the 83 patients in the foregoing trial had instead found a very small one, so that pA 5 12/42 5 .285 and pB 5 11/41 5 .268, with do 5 .285 2 .268 5 .017. This small difference would probably not be tested for ‘‘significance,’’ but the test, if done, would determine P 5 [42(.285) 1 41(.268)]/83 5 .277, Q 5 .723 and SED 5 √83(.277)(.723)/(42)(41) 5 .098. With Z α set at 1.96, the, lower boundary of the confidence interval is calculated as .017 2 .192 5 2.175. The extension below 0 indicates that the result is not stochastically significant. On the other hand, when do is small, the upper end of the confidence interval can be explored to see how large the difference might have been under the possibilities of random chance. Thus, if d0 1 Z α (SED) $ δ , (9) there is a reasonable possibility, within the limits set by 1 2 α, that the ‘‘small’’ result for do emerged as a stochastic variation from a true difference that is at least as large as δ .
A. R. Feinstein and J. Concato
540
FIGURE. 1. Confidence intervals for three clinical situations. In each instance the confidence interval for the observed result,
d0 , is shown in the upper (dashed) line. The lower (solid) line shows the location of d0 as well as the desired value of d and the null value of 0. (A), the lower boundary of the confidence interval includes 0, although d0 . d. (B), the upper boundary of the confidence interval exceeds d although d0 , d. (C), the location of d0 is shown for the confidence interval that satisfies the ‘‘double-significance’’ goal of being simultaneously .0 and ,d.
In the example here, the upper boundary of the confidence interval is .017 1 .192 5 .209, a large value that offers consolation to the investigators. Despite the small do 5 .017, the result is stochastically compatible with the desired large value for do. This situation is shown in Figure 1B. If sample size is calculated using formula (3) for ‘‘single significance,’’ the anticipated value of Zα (SED) is δ. When added to do, this result will fulfill the requirement stated in formula (9), unless d0 is either negative (in the reverse direction of what was expected) or small enough to reduce the observed PQ substantially below the anticipated value. Another (uncommonly used) strategy for satisfying formula (9) is to have a sufficiently small value of n. Expansion and rearrangement of the algebra in formula (9) produces n # Z 2α (2 PQ)/(δ 2 do )2. In the situation under discussion, this calculation would produce (1.96)2 (2) (.277) (7.23)/(.15 2 .017)2 5 87 members as a maximum for each group. With a total of only 83 persons in the trial, formula (9) will easily be satisfied. The foregoing discussion demonstrates that the first problem (i.e., stochastic failure to confirm an observed big difference) can always be avoided if a sample size with large enough capacity is calculated in advance with the relatively simple formula (3). With this approach in a trial done with the hope of showing a big difference for do , the investigators will seldom be badly disappointed by the results as long as do exceeds 0. If suitably large, do will be stochastically confirmed as ‘‘significant’’; and if not too small, do will almost always have an upper confidence limit compatible with the large δ .
proach, aimed at achieving ‘‘double significance.’’ The goal is to have both the ‘‘capacity’’ to reject the original null stochastic hypothesis if do is big (i.e., $ δ ), and also the ‘‘power’’ to reject the alternative stochastic hypothesis if d0 is small (i.e., ,δ ). With this approach, the trial should provide a ‘‘statistically significant’’ result no matter which way things turn out, and regardless of what the investigators had hoped to find. The results will be stochastically confirmed whether do turns out to be impressively big or insignificantly small. One of the current authors (ARF) helped advocate the ‘‘double-significance’’ approach in a paper, about 20 years ago, titled ‘‘The other side of ‘statistical significance’: Alpha, beta, delta, and the calculation of sample size’’ [4]. Because the dual-hypothesis approach was described and illustrated in a manner ‘‘accessible’’ to clinical readers, the paper was frequently cited thereafter and often used to help persuade investigators that ‘‘power’’ and ‘‘doublesignificance’’ calculations were needed to avoid ‘‘false negative’’ conclusions. The paper was written because its author was pleased both to have discerned the mathematical argument of the Neyman-Pearson doctrine and to have succeeded in preparing a reasonably clear account of it. Like most investigators, however, the author had accepted the doctrine itself without giving further thought to its inherent contradictions and without thoroughly understanding its problems. In the mathematical calculations used to satisfy both stochastic goals, the value of do in formula (1) is entered into formula (4) to produce
δ $ (Z α 1 Z β) (SED).
(10)
THE ‘‘DOUBLE-SIGNIFICANCE’’ STRATEGY
With suitable substitution of the estimated value for SED, formula (10) can be solved to yield ˆ ] (Z α 1 Z β) 2 / δ 2. n $ [2PˆQ (11)
To avoid sample sizes that will be too small, the strategy used in the past two decades has pursued a different ap-
Formula (11) is the simplest mathematical statement of the ‘‘double-significance’’ strategy. As noted in the Appen-
‘‘Power,’’ Contradictory Hypotheses, and Inflated Sample Sizes
dix, the formula may be slightly modified for a different calculation of SED or for application of a ‘‘continuity correction,’’ but the basic version just described is relatively straightforward and will be used for the subsequent discussions here. To show the numerical effects of this process, consider the clinical trial discussed earlier, where the investigators anticipated pˆA 5 .358 and pˆB 5 .268 with δ 5 .09. Using the ‘‘singly-significant’’ formula (3), the required sample size was 204 persons in each group. Using the ‘‘doublysignificant’’ formula (11), with Zβ set at 1.28 for a onetailed β 5 .10, having ‘‘power’’ of 90%, the sample size would be n $ 2 (.313) (.687) (1.96 1 1.28)2 /(.09)2 5 557.4. This size is more than two and one-half times the 204 persons required for ‘‘single significance’’ with δ 5 .09. To avoid the dismayingly large sample sizes that can arise with the ‘‘double-significance’’ formula, its components can be altered in three ways. First, the null hypothesis can be tested in a uni-directional manner, so that Z α for a onetailed α 5 .05 can be converted to a smaller value of 1.645. Second, Z β can be relaxed to .84 for 80% power with a onetailed β 5 .20. With these two changes alone, the foregoing sample size would be reduced to n $ 2 (.313) (.687) (1.645 1 0.84) 2 /(.09) 2 5 328, but it is still about 50% larger than the previous value of 204. Furthermore, the alterations of α and β may be stochastically unappealing to data analysts who want a two-tailed α and a power of about 90%. Accordingly, a third option is often chosen, which increases the value of δ. For δ 5 .15 and pˆA 5 .268, the value of pˆB will be .418 and Pˆ will be .343. If Zα and Zβ are returned respectively to values of 1.96 and 1.28, the sample size will be N $ 2 (.343) (.657) (1.96 1 1.28)2 /(.15)2 5 210.3, which is smaller than any of the previous ‘‘doublesignificance’’ calculations with δ 5 .09. The latter sample size of about 211, although fulfilling the mathematical goals of a ‘‘double-significance’’ calculation, will still be larger, however, than the 204 needed for ‘‘single significance’’ if the observed d0 turns out to be .09 (or larger). Thus, the advantage of the ‘‘double-significance’’ formula is somewhat illusory. It seems to offer 80% power for rejecting the null hypothesis at a two-tailed α 5 .05 if the observed do is bigger than δ 5 .15, but the main advantage is that the sample size will be large enough to reject the null hypothesis if do has the much smaller value of .09. Furthermore, the ‘‘double-significance’’ sample size of 211 at δ 5 .15 is larger than the corresponding ‘‘singlesignificance’’ size of 204 calculated directly for δ 5 .09. MATHEMATICAL AND SCIENTIFIC PROBLEMS Despite its current well accepted status as the ‘‘authorized’’ method for calculating sample size, the ‘‘doublesignificance’’ approach creates some major problems, in both mathematics and science, that are discussed in the next seven sections.
541
Inflated Sample Size A comparison of formulas (3) and (11) shows that (for the same value of δ ) formula (11) will always enlarge the size by a factor of [(Z α 1 Z β)/Z α ] 2 or [1 1 (Z β /Z α )] 2. If Z α is set at 1.96 and Z β is set at 1.645, the factor will be [1 1 (1.645/ 1.96)]2 5 3.38. If Z β is lowered to the now-often-used onetailed 0.84, the value of Z β /Z α 5 0.84/1.96 5 .43, and the ‘‘doubly-significant’’ sample will be 2.04 times the ‘‘singlysignificant’’ size. Because the sample size becomes much larger than what is needed to achieve ample capacity for ‘‘single significance,’’ the ‘‘double-significance’’ strategy has led to an unnecessary increase in the effort, time, and costs of clinical trials in the past two decades. The use of an excessively large sample size can promptly be determined when readers examine the P-values published for clinical trial results. If α is set in advance at .05, a trial that finds d0 equal to or slightly larger than the anticipated δ should have a P-value slightly below .05. The P-value might be substantially smaller, at levels such as .01 or even .001, if the trial later shows an unexpectedly large value for d0. Even if d0 is substantially larger than expected, however, the P-values will seldom descend to such levels as .0001 or .00001 unless the sample size was excessively inflated [5]. Deflation of Original Boundary and ‘‘Truthful’’ Reporting Although the trial is planned with the idea that an impressive difference requires d0 $ δ, the inflated sample size will regularly allow statistical ‘‘significance’’ to be attained for d0 , δ. Thus, a difference that was originally deemed as too small to be impressive may nevertheless emerge as ‘‘significant.’’ For example, in the earlier trial where 204 persons per group were needed for ‘‘double significance’’ with δ 5 .15, the investigators might be disappointed to discover that pA 5 73/204 5 .3578 and pB 5 54/204 5 .2647, so that d0 5 .093, which is substantially below the desired δ $ .15. Nevertheless, for this result, the value of SED turns out to be .0458, and the lower end of the 95% confidence interval will extend down to .093 2 (1.96)(.0458) 5 .003, which exceeds 0. Thus, although the observed d0 of .093 is much smaller than the desired δ 5 .15, the investigators will nevertheless have attained a statistically ‘‘significant’’ result. When ‘‘significance’’ is declared despite the finding that d0 , δ, the advance demarcation of boundaries has been violated as a prime principle of statistical hypothesis testing. According to this principle, all decisional boundary values such as α and β should be fixed before the results are observed, and should not be changed afterward. Nevertheless, the pre-established level of δ, which might be expected to have a similar sanctity, is constantly altered to fit the observed results. It becomes adjusted downward to comply
A. R. Feinstein and J. Concato
542
with whatever value of d0 turned out to be stochastically significant. Furthermore, an investigator who has done the trial believing that Treatment A is better than B may not acknowledge that the observed distinction is much smaller than desired. The claim of ‘‘significance,’’ which is correct stochastically but not strictly correct quantitatively, then becomes an example of incomplete ‘‘truth in reporting.’’ The issue is a matter of scientific taste rather than morality, however, because the investigators can usually cite the results (with expressions such as proportionate increments) to convince themselves (as well as reviewers and readers) that the relatively unimpressive d0 is really impressive. The occurrence of the ‘‘incomplete-truth’’ phenomenon can be suspected whenever the published report of a randomized trial does not mention the level of δ that was originally used for calculating the sample size of a result that emerges as ‘‘significant’’ [5]. Forcing d0 to Be ,d A striking peculiarity of ‘‘doubly-significant’’ sample sizes is that the anticipated value of d0 is forced to be , δ for the calculation. As shown in Figure 1C, the value of d0 must lie between 0 and δ in order to reject both the null hypothesis, which requires that d0 2 Zα (SED) $ 0, and the alternative hypothesis, which requires that d0 1 Z β (SED) # δ. For d0 simultaneously to be $ Zα (SED) and # δ 2 Z β (SED), the unequal parts of the two equations can be ignored when they are added and solved to show that 2d0 ≅ δ 1 (Z α 2 Z β)SED. If Zα 5 Zβ, then d0 ≅ δ /2, and is forced to be half the value of δ. If Zα 5 1.96 and Zβ 5 0.84, d0 is somewhat larger, at δ /2 1 [(1.12)(SED)/2]. This constraint on d0 was not overtly specified earlier when formula (1) was solved for d0, which was then substituted into formula (4) to produce the doubly-significant sample-size formula (11). A careful examination of the required location of d0, however, shows that the ‘‘doublysignificant’’ sample size becomes a deliberate mechanism for getting statistical ‘‘significance’’ under the null hypothesis for values of d0 that are ,δ. Conflicting Goals of the ‘‘Double Hypothesis’’ Beyond these mathematical difficulties, a distinct scientific problem is the conflicting goals of the ‘‘double hypothesis.’’ The two compared treatments may have not been convincingly shown to be different when a trial begins in the state of ‘‘equipoise,’’ but the investigator seldom wants to prove that the treatments are similar. Except for uncommon instances where the goal is really to show equivalence (a topic not under discussion here), most trials are done to demonstrate efficacy, where the hope is that d0 $ δ. With this
hope, an investigator unhappy to find that d0 , δ when the trial is completed will be pleased that δ is included in the upper end of the confidence interval. In the reasoning used for the alternative hypothesis, however, the investigators are trying to show that d0 is really small or that the main treatment under investigation is essentially no better than the comparative agent. The stochastic proof of this goal is determined with the confidence interval shown in formula (4), which leads to the sample size cited in formula (8). This size, used to reject the alternative hypothesis, will almost always be larger than the size determined with formula (3) to reject the null hypothesis. The trial is not being done, however, to demonstrate that d0 is small and to get stochastic confirmation for the smallness. The investigators’ goal is to show a big difference (i.e., that d0 $ δ ). Consequently, wanting to show ‘‘efficacy’’ as a big difference in the compared treatments, the investigators would have no reason to calculate sample size using formula (8). They could apply the simple, direct approach of formula (3). Not interested in getting stochastic proof that d0 , δ, an investigator who truly understands the consequences of the current statistical custom may be distressed by its implications. Established without regard to an investigative target or to a scientific hypothesis, the ‘‘double-significance’’ approach does not care whether d0 turns out to be big ($δ ) or small (,δ ) as long as the result achieves ‘‘significance’’ by rejecting a statistical hypothesis, no matter which way it goes. Confusion about ‘‘Power’’ Perhaps the most scientifically confusing aspect of the ‘‘power’’ strategy is the use of a hypothesis in which the investigators are not interested. Wanting to confirm or at least accept the idea that d0 $ δ, not to prove the opposite, investigators are told that the sample size will have inadequate ‘‘power’’ unless it can reject something that they want to accept. An investigator who hopes to show that d0 $ δ can use formula (3) to determine a sample size of adequate capacity. If statistical ‘‘significance’’ is not obtained when the trial shows d0 $ δ, the group sizes were too small; and their inadequacy can readily be demonstrated by comparing their actual sizes with what would have been needed. In the currently popular approach, however, sample sizes that are too small are said to lack ‘‘power’’ for a different task: the ability to reject the alternative hypothesis that the difference is large. When a sample size is calculated with only Z α in formula (3), the procedure has set a value of 0 for Z β in formula (11). When Z β 5 0, a two-tailed P-value, determined under the alternative hypothesis, will be 1. For the one-tailed calculation of power, half of the two-tailed P-value will be 0.5, and ‘‘power’’ will be 1 2 0.5 5 0.5. Traumatized by the threat that formula (3) will produce
‘‘Power,’’ Contradictory Hypotheses, and Inflated Sample Sizes
543
only 50% power despite adequate capacity, the investigators may accept the blandishment of the ‘‘double-significance’’ strategy, and then calculate an inflated sample size, much larger than needed either for stochastically confirming that d0 $ δ or to show that the result might be as large as δ if d0 turns out to be ,δ. The main scientific reason for including ‘‘power’’ in the calculation with formula (11) would be a desire to contradict the goal of the trial and to reject the alternative hypothesis by stochastically confirming ‘‘insignificance’’ if d0 , δ. Because most investigators do not understand that the popular calculation and the idea of ‘‘power’’ are aimed at this contradictory goal, the statistical procedure is usually proposed and accepted without a suitably ‘‘informed consent.’’ Furthermore, since ‘‘power’’ is a concept established by the β level chosen for sample-size calculations before the research is done, the observed results afterward should not receive a post hoc calculation of ‘‘power’’ [6,7]. When the research is completed, an investigator who wants to know how large d0 might have been can readily determine a suitable confidence interval without resorting to confusing ideas and calculations of ‘‘power.’’
low, normal, and too high, or negative, equivocal, and positive, or absent, uncertain, and present. If δ is set at .15 for the minimum magnitude of an impressive difference in two success rates, a two-zone decision space will make such differences as .13, .08, and .03 all be regarded as unimpressive. With two-zone demarcations, if diastolic hypertension is $90 mm Hg, anyone below that level is hypotensive; and if a fasting blood sugar level of 120 is hyperglycemic, any lower values are hypoglycemic. According to the two-zone scheme of reasoning, an investigator who deliberately wants to show that two treatments produce essentially similar results need not demarcate a separate boundary, such as ζ, for the maximum magnitude of a small or trivial difference [10]. Instead, the claim of similarity or equivalence can be made merely if the difference is anything less than the ‘‘big’’ value of δ. This two-zone approach might be satisfactory for theoretical mathematics, but is contrary to the realities of science and life [9]. Furthermore, the absence of a boundary to demarcate ζ and a third zone of small differences leads to major difficulties when clinical trials are done to demonstrate equivalence rather than efficacy. Those difficulties, however, are beyond the scope of the current discussion.
Problems in Premature Cessation of Trials
ALTERNATIVE APPROACH
With the ‘‘single-significance’’ formula (3) for sample size, and with the demand that the result show d0 $ δ, a trial would require its full size to achieve ‘‘significance’’ if d0 5 δ. The trial could be ended prematurely, with less than full size, only if d0 substantially exceeded δ. With the inflated sample size, however, ‘‘significance’’ can be achieved when d0 is smaller than δ ; and as the accruing groups in the trial become larger, SED will be smaller, and ‘‘significance’’ can occur for progressively smaller values of d0 . Consequently, to end a trial as soon as a ‘‘significant’’ result is achieved, the investigators must intermittently ‘‘peek’’ at the data to check for ‘‘significance.’’ The previous ‘‘peeks’’ then produce the challenge of suitably adjusting the α level for ‘‘multiple comparisons.’’ Although diverse methods have been proposed for this adjustment [8], the problem itself would essentially vanish if the sample sizes were not inflated, if the investigators insisted that d0 must equal or exceed δ for a trial to be stopped early, and if the intermittent ‘‘peeks’’ were confined to examining only the value of d0, avoiding any tests of ‘‘significance’’ unless d0 .. δ.
A simple alternative approach is readily available to solve the cited problems. After realizing that research is done to investigate a scientific hypothesis, investigators can give priority to the scientific goals and can then abandon a doctrine that gives prime attention only to statistical hypotheses. If the scientific goal is to show efficacy as a big difference (i.e., d0 $ δ ), the sample size will have adequate capacity if calculated with the relatively simple formula (3). If the trial then shows that d0 , δ, with a P-value that exceeds α, a simple confidence interval will demonstrate how large the ‘‘non-significant’’ difference might have been. Conversely, if the scientific goal is to demonstrate equivalence (i.e., a really small difference) between the treatments, the dual-hypothesis strategy is fundamentally flawed because the relatively large value of δ does not demarcate a small difference, which would require an additional boundary value, such as ζ. For clinical trials aimed at showing efficacy, therefore, a capacious sample can be determined without the need for invoking ‘‘power,’’ ‘‘alternative hypotheses,’’ and ‘‘double-significance.’’ The key issue is then the choice of δ. As noted earlier, it is often made much larger than necessary because δ 2 appears in the denominators of both formula (3) and (11). Because small values of δ will lead to larger sample sizes, investigators and statisticians who want to get a ‘‘feasible’’ size will usually give δ an unrealistically large boundary, such as the δ 5 .15 level used throughout the discussion here. For both ‘‘truth in reporting’’ and scientific reality in do-
Inadequacy of a Two-Zone Decision Space Perhaps the most striking intellectual problem of the ‘‘double-significance’’ formulation is the idea that a single boundary of δ can be used to separate small from big. This two-zone decision space is not compatible with the customary three zones [9] used for such clinical decisions as too
A. R. Feinstein and J. Concato
544
ing a trial, however, the investigators can choose a reasonable value for δ beforehand, and then maintain it afterward. It should be chosen and preserved as the smallest level of d0 that will still be regarded afterward as an impressive difference. Thus, if d0 5 .09 or .07 would still be impressive but d0 5 .06 is not, δ should be lowered to .07. The lowered δ may then lead to an increased sample size when calculations are done only for ‘‘single-significance’’ capacity. For example, if δ is lowered from .15 to .10 in the ‘‘single-significance’’ calculation with formula (3), the sample size will be increased by a factor of about (.15/.10)2 5 2.25. The larger value will often still be smaller, however, than what might have been required at δ 5 .15 with the corresponding ‘‘double-significance’’ formula. Although we do not want to get into the controversy about one- versus two-tailed tests, or into new arguments about ethics in clinical trials, it should be noted that most trials are done to support, not merely test, a scientific hypothesis. The hypothesis almost always takes a distinct direction in favor of one of the treatments, but the plan can ethically be regarded as having ‘‘equipoise’’ because the hypothesis has not been suitably confirmed. If this directional goal is appropriately acknowledged before and after the trial, however, sample sizes might be further reduced with use of a one-tailed Z α . Although sample sizes will almost always be smaller with a ‘‘single’’ rather than ‘‘double-significance’’ formulation, the main scientific advantages of the new approach (which is really a return to the older, simpler method) will be a statistical strategy that corresponds to the investigative goals, the maintenance of realistic ‘‘honest’’ boundaries for δ, and a sample size that can stochastically confirm the goal if it is reached. CONCLUSIONS Despite the persistent loyalty of clinical trialists, the Neyman-Pearson principles of ‘‘power’’ have been rejected by some prominent leaders in the statistical pantheon. Sir Ronald A. Fisher [11] said the principles come from ‘‘an unrealistic formalism’’ and ‘‘are liable to mislead those who follow them into much wasted effort.’’ Maurice Kendall and Alan Stuart [12] declared that ‘‘we cannot obtain an ‘optimum’ combination of α, β and n for any given problem.’’ W. E. Deming [13] contended that ‘‘there is no such thing as the power of a test in an analytic problem, despite all the pages covered by the mathematics of testing hypotheses.’’ Egon Pearson [14] himself, about 30 years after his original work with Neyman, did not overtly recant the basic strategy, but said it should be regarded as part of the ‘‘historical process of development of thought.’’ He confessed that ‘‘the emphasis which we gave to certain types of situations may now seem out of balance.’’ Despite these dissents and caveats, however, the Neyman-Pearson ideas have become thoroughly entrenched. The approach is now routinely used
and ubiquitously advocated not just for clinical trials, but for almost any advance calculations of sample size. Can modern investigators and statisticians escape from the scientific misconceptions and pragmatic problems of the ‘‘double-significance’’ strategy? Perhaps the most authoritative encouragement to do so was offered, almost 50 years after the original proposal, by one of the original proponents. According to Egon Pearson [15], ‘‘Hitherto the user has been accustomed to accept the function of probability laid down by the mathematicians, but it would be good if he could take a larger share in formulating himself what are the practical requirements that the theory should satisfy in application.’’ Pearson might also have pointed out that he and Neyman neither intended nor urged that their mathematical theories be converted into rigid dogma. The original authors are not responsible for what has been done with their theoretical discussion of testing two statistical hypotheses, contemplating ‘‘power,’’ and drawing true and false stochastic conclusions. Without any disrespect for either Jerzy Neyman or Egon Pearson, the dogma can now be altered after thoughtful attention to the ‘‘practical requirements.’’ APPENDIX A The ‘‘double-significance’’ formula (11) is often slightly modified for some mathematical niceties in the statistical reasoning.
Alternative SED Under the alternative (rather than null) statistical hypothesis, the standard error of the difference in two compared groups is calculated with formula (6) or formula (7), not with formula (2). When the alternative SED approach is used, formula (11) becomes converted to ˆ 1 Z β √pˆAqˆA 1 pˆBqˆB ]2 / δ 2 . n $ [Z α √2PˆQ
(12)
Formulas (11) and (12) are regularly cited in discussions devoted to calculating sample size according to the commonly used ‘‘double-significance’’ principles. Since SED will usually have similar values with either method of calculation, the two formulas will yield quite close results.
‘‘Continuity Correction’’ The values for proportions such as pA and pB come from discrete data, but the values of Z α and Z β come from a continuous Gaussian curve. To account for the slight disparity between theory and reality, a ‘‘continuity correction’’ can be introduced, using a formula that is well described by Fleiss [16]. The net effect of the correction is a slight increase in the value calculated for n.
References 1. Freiman JA, Chalmers TC, Smith H Jr, et al. The importance of beta, the Type II error and sample size in the design and
‘‘Power,’’ Contradictory Hypotheses, and Inflated Sample Sizes
2. 3. 4.
5. 6.
7. 8.
interpretation of the randomized control trial: Survey of 71 ‘‘negative’’ trials. N Engl J Med 1978; 229: 690–694. Neyman J, Pearson ES. On the use and interpretation of certain test criteria for the purposes of statistical inference. Biometrika 1928; 20: 175–240. Neyman J, Pearson ES. On the problem of the most efficient test of statistical hypotheses. Phil Trans A 1933; 231: 289– 337. Feinstein AR. Clinical biostatistics: XXXIV. The other side of ‘‘statistical significance’’: Alpha, beta, delta, and the calculation of sample size. Clin Pharmacol Ther 1975; 18: 492– 505. Fischer MA, Feinstein AR. Excess sample size in randomized controlled trials. J Invest Med 1997; 45: 198A. Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med 1994; 121: 200– 206. Smith AH, Bates MN. Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies. Epidemiology 1992; 3: 449–452. O’Brien PC, Shampo MA. Statistical considerations for per-
545
9. 10. 11. 12. 13. 14. 15. 16.
forming multiple tests in a single experiment. 6. Testing accumulating data repeatedly over time. Mayo Clin Proc 1988; 63: 1245–1249. Feinstein AR. The inadequacy of binary models for the clinical reality of three-zone diagnostic decisions. J Clin Epidemiol 1990; 43: 109–113. Feinstein AR. Zeta and delta: Critical descriptive boundaries in statistical analysis. J Clin Epidemiol. (In press) Fisher RA. Statistical Methods and Scientific Inference, 2nd Edition. London, UK: Oliver and Boyd; 1959 Kendall MG, Stuart A. The Advanced Theory of Statistics. Vol. 2. Inference and Relationship, 3rd Edition. London, UK: C. Griffin & Co., Ltd.; 1973: 190–191. Deming WE. Review article. The New York Statistician. 1972; 25: 1–2. Pearson ES. Some thoughts on statistical inference. Ann Math Statist 1962; 33: 394–403. Pearson ES. In: Edwards AWF. Likelihood. Cambridge: Cambridge University Press; 1976: v. Fleiss JL. Statistical Methods for Rates and Proportions, 2nd Edition. New York, NY: John Wiley & Sons; 1981.