The power of analysis: Statistical perspectives. Part 1

295 Psychiatry Research, 23,295-299 Elsevier The Power of Analysis: Statistical Perspectives. Part 1 Ann E. Pulver, John J. Bartko, and John A. M...

Download PDF

326KB Sizes 2 Downloads 80 Views

Report

PDF Reader
Full Text

295

Psychiatry Research, 23,295-299 Elsevier

The Power of Analysis: Statistical

Perspectives.

Part 1

Ann E. Pulver, John J. Bartko, and John A. McGrath Received

March 26, 1987; revised version received July 23, 1987; accepted July 30, 1987.

Abstract. Failure to consider statistical power when achieving apparently “negative” results prevents accurate interpretation of the results. A nonsignificant result can be obtained when one includes an insufficient number of subjects to permit observation of a true effect (low power to detect an effect), or when one has an adequate number of subjects, but a meaningful effect does not exist (high power, no effect); one can also have a situation of lower power and no real effect. Without considering power, one is unable to distinguish a “negative” experiment from an inadequate one. This article examinks I54 published nonsignificant t-test results. When power is calculated with an effect size equal to a standardized difference of unity, over 50% of the tests have inadequate power. Key Words. Statistical

methods,

power, Type II error.

Failure to find a statistically significant difference does not imply that one can accept a null hypothesis. The null hypothesis states that the statistics being compared (e.g., two means) come from the same population and that, given random sampling, any difference between the statistics is due to chance. For the instance of mean comparisons, such a hypothesis can be expressed as M, - ,Y, = 0, or P, = ,v,. Alternatively, the statistics could come from different populations, and their difference would result from this fact: ,v, - f12 f 0, or ,v, f ~.r,. Hypothesis testing, in an attempt to rule out the null hypothesis, is subject to two types of errors: Type I error occurs when a true null hypothesis is declared false (one concludes a difference when no real difference exists); Type I1 error results from failure to declare a null hypothesis false when it is in fact false (one misses a real difference). Unlike the probability of a Type I error, which is set by choosing the a level for the statistical test, the probability of a Type II error (/I) is not a single value. While the probability of a Type I error is based on the assumption of the null hypothesis, the probability of a Type II error is calculated under the alternative assumption that a difference does exist in the population parameters (e.g., I-(, -& # 0); since this difference can take on

Ann E. Pulver, Sc.D., is Associate Professor of Psychiatry (Epidemiology), University of Maryland School of Medicine, Maryland Psychiatric Research Center, Baltimore. John J. Bartko, Ph.D., is a member of the Theoretical Statistics and Mathematics Branch, Division of Biometry and Applied Sciences, National Institute of Mental Health, Bethesda, MD. John A. McGrath, M.A., is Research Associate, Department of Psychiatry, University of Maryland School of Medicine, Maryland Psychiatric Research Center. (Reprint requests to Dr. A.E. Pulver, Maryland Psychiatric Research Center, P.O. Box 21247, Baltimore, MD 21228, USA.) 0165-1781

j88jSO3.50

@ 1988 Elsevier Scientific

Publishers Ireland

Ltd.

296

an infinite number of non-zero values, an infinite range of probabilities of Type 11 error can exist. However, one can calculate an exact probability of Type II error by assuming a specified alternative hypothesis (e.g., ,Y, - ,uu2= IO). Since the probability of missing a true difference or effect (Type 11 error) is equal to p, the probability of noting a true difference is equal to 1 - p. This probability is the power of the statistical test. When the probability of Type II error is low, power is high, and vice versa. Psychiatric journals seem to pay most attention to Type I errors: significance or a levels are usually specified while power is often ignored, even when sample sizes are small and variances are large-conditions that maximize the probability of Type 11 errors. Large variance in many of the measures used in psychiatry may be expected in part because diagnostic schemes do not provide homogeneous groups of patients. All other things being equal, one is in a better position to uncover an existing difference or effect (i.e., one has high power) when the effect is large rather than small, or when the number of subjects examined is large enough to overcome the noise introduced by chance variation (e.g., unknown heterogeneity within the groups being contrasted). It has been pointed out that there is frequent neglect of power consideration in clinical trials (Freiman et al., 1978) and behavioral science research (Cohen, 1977; Buchsbaum and Rieder, 1979; Tachibana, 1980; Rothpearl et al., 1981; Cohen, 1982). In this article (Part I of a 2-part series), the results of a study are reported in which we examined the level of power for a series of analyses in which the investigators failed to reject the null hypothesis. For simplicity, we chose to examine the results of analyses from one statistical test, and selected the unpaired Student’s t test because of the frequency with which it is used. The data were not evaluated to determine whether the assumptions underlying the t test were met. These statistical assumptions, the classic NIH conditions, are (1) normality (underlying populations are assumed to have a Gaussian bell-shaped distribution); (2) independence (responses are not influenced by other responses); and (3) homogeneity of variances (the underlying normal populations are assumed to have the same variance). In addition, we did not attempt to determine if the studies reported were of an exploratory or hypothesis-testing nature. We also did not address the Bonferroni adjustment (Fleiss, 1986) for inflated Type 1 error. Methods Criteria. The articles in all issues of Volumes 14, 15, and 16 of Psy~hiutry R~.~earch (108 articles) were searched to identify analyses that met the following criteria: (I) Analysis was reported in the Results section. (2) Unpaired Student’s I test was the test statistic. (3) The result of the statistic was reported to be nonsignificant 0, > 0.05). Sample. Analysis.

One hundred

and sixty-three

test procedures

in 23 articles met the above criteria.

For each test procedure, the following data were abstracted: (I) The number of observations in each group (n,, n2). (2) The mean value reported for each group (X,, x2). (3) The standard deviations for each group (s,, s?). We calculated the following for each test procedure: (I) The observed mean difference (X, -X2). (2) The standardized difference (or effect size). The actual powers associated with the

297 nonsignificant t tests would be low, owing to small numbers of subjects, small mean differences, or both. We recalculated power for each t test by inflating the observed mean difference to a standard level, but did not adjust the actual sample size for each test. This allowed us to observe whether, given a “true effect” as represented by our artificially inflated mean difference, researchers would have had sufficient subjects to realize an acceptable level of power (e.g., 80%) for detecting such an effect. We imposed a common effect size across all studies by setting the mean difference equal to a standardized difference of I. The standardized difference is equal to the difference between two means divided by the pooled standard deviations of these means, i.e., effect size = (x, - &)/pooled where

pooled

SD

SD =

(n, -

I) S,2 + (n2 - I) S2z n,+n,-2

where

S* = 1 (x - Z)2/(n -

I)

(3) Student’s t. (4) The probability (p) value associated with the t test. (5) Power: Setting the standardized difference equal to unity and (x = 0.05, two-tailed. (In I58 of the 163 statistical tests examined, the standardized difference of unity is greater than the actual standardized difference.) The selection of a standardized difference of unity provides a crude benchmark for a clinically meaningful difference between means (i.e., one standard deviation), although it is not necessarily meaningful in all cases. These power estimates can then be examined across the 163 studies, since they vary as a function of sample size alone rather than sample size and effect size.

Results T-Test Calculations. Evaluation of the data revealed that 9 of the 163 tests reported by the authors to be statistically nonsignificant (p > 0.05) were in fact statistically significant (p < 0.05). In one of these tests, the authors reported a t of 1.32 when in fact the t was 7.32, perhaps a seven being misread as a one. Possible reasons for the errors in the other eight tests were not apparent. Power. Power associated with the statistical test assuming an effect size of one standardized difference, and a = 0.05, is reported in Table 1. Approximately 15% of the analyses had power < 0.50, and 56% of the analyses had power < 0.80. Discussion Power analysis performed after an experiment enables one to draw conclusions from the results of the statistical analysis. If a test statistic is not significant and the experiment has low power (e.g., power < 0.50), it would indicate that the experiment was relatively insensitive and that the evidence in favor of the null hypothesis is not strong. However, if the test statistic is not significant and the experiment has a high level of power (e.g., power > 0.80) the experiment was relatively sensitive and the evidence in favor of the null hypothesis is strong. An experiment with high power provides strong support for the decision to reject the null hypothesis, while an experiment with low power provides little support for either the null or the alternative hypothesis.

298 Table 1. Distribution of power associated with the t tests assuming a = 0.05 and an effect size equal to a standardized difference of unity Power

Number of tests

% of tests

Cumulative % of tests

0.00-0.09

0

0.00

0.00

0.10-0.19

0

0.00

0.00

0.20-0.29

3

1.95

1.95

0.30-0.39

8

5.19

7.14

0.40-0.49

12

7.79

14.94

0.50-0.59

18

11.69

26.62 41.56

0.60-0.69

23

14.94

0.70-0.79

23

14.94

56.49

0.80-0.89

26

16.88

73.38

0.90-1.00

41

26.63

100.00

The power calculation using the standard difference of unity allows us to ask whether the investigators had sufficient sample size to allow the detection of a difference between the two means equal to one standard deviation. Although this difference may not be clinically meaningful for all the analyses, it does provide a crude substitute for a meaningful difference, and can be used to look across the many studies. Fifteen percent of the analyses had power < 0.50, and 56% of the analyses had power < 0.80. The number of analyses that have inadequate power to detect the standard effect size and the lack of any discussion about power suggest that power is not routinely taken into consideration in the planning stages of studies or in the data analysis. Our conclusions do not differ from those of Freiman et al. (1978), who calculated the probability of a Type II error (l-power) for 71 negative randomized clinical trials which were identified from 20 different journals over the period 1960-1977. They found that of the 71 trials only 20% had power 3 0.50 for detecting a 25% therapeutic improvement in the response; for a 50% therapeutic improvement only 56% had power 3 0.50. If power is important in influencing the conclusions that can be drawn from the results of a statistical analysis that failed to reject the null hypothesis, why isn’t power routinely reported? Why do so many “negative” studies have inadequate power? One answer for both of these questions might be the lack of familiarity with power: its definition, how it is calculated, why it is calculated, and when it is calculated. We address these issues in Part 2. Acknowledgment. This investigation

was supported

in part by grant #M H-357 12-05.

299 References Buchsbaum, M.S., and Rieder, R.O. Biologic heterogeneity and psychiatric research: Platelet MAO activity as a case study. Archives of General Psychiatry, 36, I 163 (1979). Cohen, J. Statistical Power Analyses for Behavioral Sciences. Rev. ed. Academic Press, New York (1977). Cohen, P. To be or not to be: Control and balancing of Type 1 and Type II errors. Evaluation and Program Planning, 5,247 (I 982). Fleiss, J.L. The Design and Analysis of Clinical Experiments. John Wiley & Sons, Inc., New York (1986). Freiman, J.A., Chamlers, T.C., Smith, H., Jr., and Kuebler, R.R. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial: Survey of 71 “negative” trials. New England Journal of Medicine. 299 (No. l3), 690 (1978). Rothpearl, A.B., Mohs, R.C., and Davis, K.L. Statistical power in biological psychiatry. Psychiatry Research, 5, I57 (198 I). Tachibana, T. Persistent erroneous interpretation of negative data and assessment of statistical power. Perceptual and Motor Skills, 51,37 (1980).