# Statistical Power Analysis

## Statistical Power Analysis

Statistical Power Analysis L V Hedges and C Rhoads, Northwestern University, Evanston, IL, USA ã 2010 Elsevier Ltd. All rights reserved. Glossary Int...

Glossary Intraclass correlation – In two stage samples that first sample intact groups (statistical clusters) and then sample individuals with the groups, the intraclass correlation is the ratio of between-group variance to total variance. It is used to describe the degree of clustering within samples. Multilevel (multistage) sample – A sample that first obtains a sample of intact groups (statistical clusters) and then samples individuals within the groups. Statistics based on multilevel samples (also called multistage samples) have different properties than those based on simple random (one level) samples.

A critical aspect of planning any research design is ensuring that it is well enough designed to provide definitive evidence for the phenomenon under investigation. Power analysis is used to ensure that a study will have an appropriate chance of yielding a statistically significant result, given the treatment effect that is expected. Power analyses are required as part of proposals for funding research, and it would be almost unthinkable to embark on a large-scale study without conducting a power analysis. This article is an introduction to power analysis for research designs that are most often used in education research.

chance that, if the researcher’s substantive hypothesis is correct, the research design will lead to the correct conclusion that the research hypothesis is true. Statistical power always depends on the research design, the statistical significance level (a), the effect size, and the sample size. The power of the design depends on how that design is organized (e.g., if covariates are used, how the sampling and/or randomization is carried out). All other things being equal, a smaller significance level reduces power. All other things being equal, a design with a larger sample size will have higher statistical power. Finally, statistical power depends on the effect size, which is a quantification of how false the null hypothesis is. All other things being equal, the larger the effect size, the higher the power. In more complex designs, there are also additional factors that affect power. Only the type of research design, the significance level, and the sample size can, in principle, be changed by the investigator. However, there are often practical constraints on the research designs that can be used. Moreover, a strong scientific convention makes it virtually impossible to utilize significance levels larger than 0.05. The effect size is determined by the phenomenon under investigation, so it cannot be changed. This implies that sample size is usually the only factor that can be easily changed to increase statistical power.

Statistical Power Analysis Statistical Hypothesis Testing The dominant form of statistical inference in educational research involves hypothesis testing. A null hypothesis that is inconsistent with the research (or alternative) hypothesis of the investigation is formulated. If the observed data would have only a very small chance of occurring if the null hypothesis were true, then the null hypothesis is rejected in favor of the research hypothesis. Formal hypothesis testing usually requires the explicit specification of how unlikely the test statistic must be to lead to rejection of the null hypothesis. This is typically a small number such as 5% (0.05) or 1% (0.01). This small number is called the significance level (often denoted by the Greek letter a). Statistical power is the probability of making the correct decision to reject the null hypothesis when it is false. The statistical power of a research design represents the

436

Statistical power analysis involves deciding on a specific type of design and a statistical significance level (frequently a ¼ 0.05) and then investigating the three interrelated factors of sample size, effect size, and power. Power analysis might start with a particular sample size and an effect size (often the smallest effect deemed by the investigator to be of scientific interest) and then compute the statistical power. If the statistical power is unacceptably low, the investigator may choose to increase the sample size or change the research design in fundamental ways (such as introducing covariates) to obtain higher power. Alternatively, power analysis might start with a desired power (e.g., 0.80) and an effect size, and compute the smallest sample size that yields the specified power. Or, power analysis could start with the desired power and the sample size and then compute the smallest effect

Statistical Power Analysis

size that would yield the specified power, called the minimum detectable effect size. Retrospective power analyses are sometimes conducted after a study has been completed to better interpret a finding of no statistically significant effect. The question is whether the study had sufficient power to detect the smallest effect size that is scientifically important. If not, failure to reject the null hypothesis does not provide strong confirmation that the effects are too small to be scientifically important. Procedures for Power Analysis Regardless of the type of research design involved, the procedure for carrying out statistical power analyses is the same, utilizing tables that relate power to significance level, sample size, and effect size. Different designs have different natural effect size measures that will be used in power analysis. The exact procedure depends on the specific goal of the power analysis, but generally the specific design and a significance level (usually a ¼ 0.05) is chosen first. We illustrate the procedure using Table 1, a table of power values obtained using the noncentral t-distribution that is appropriate for several designs. Computing power for fixed sample size and effect size

Finding the power associated with a given sample size and effect size involves entering the table at the row for the appropriate sample size and then moving to the column of that row for the appropriate effect size and reading the entry, which is the power value. This procedure is the same regardless of whether it is conducted prospectively or retrospectively. Computing sample size for fixed power and effect size

Determining the sample size that will result in a fixed power (often 0.80) for a fixed effect size can also be accomplished with a table such as Table 1. First, enter the table in the column corresponding to the appropriate effect size. Then, read down the column until the power value for a row is as large as desired or larger. The sample size corresponding to that row is the desired sample size. Some sources (such as Cohen, 1988) make the process easier by providing different tables giving the minimum sample sizes necessary to obtain a specified power (usually 80%) for various effect sizes. Computing the minimum detectable effect size

Using a table such as Table 1, enter the table on the row corresponding to the appropriate sample size. Then, move across the row until the power is at least the desired value (e.g., 0.80). This value will be approximately the minimum detectable effect size. The value is approximate because

437

the desired power value may not occur exactly in that row for any tabulated effect size, so that the desired power occurs for an effect size that is in between those defining columns of the table. A more accurate value of the minimum detectable effect size can be obtained by interpolation. Designs Comparing Independent Groups Using Simple Random Samples Perhaps the most common research design compares the outcomes in independent groups that differ in a specific way (e.g., receive different treatments). Experiments with completely randomized designs and quasi-experiments using the nonequivalent control group design are examples of these designs. In this section, we consider only experiments that use simple random samples (e.g., subjects all in one school or sampled independently from several schools) and not experiments with more complicated sampling designs (such as those that assign whole schools to the same treatment). Some of these more complicated designs will be considered later. Effect size The statistical analyses of these designs may involve the analysis of variance to test a null hypothesis that several groups have identical means. However, such designs are almost always planned by examining the power of the contrast between a particular pair of groups, such as a control group with one of the treatment groups. As a consequence, the most relevant power analysis for planning the design would be the power of that contrast. The effect size for comparing two groups is Cohen’s d, defined by d ¼ ðm1 m2 Þ=s

where m1 and m2 are the population means and s is the within-treatment group standard deviation in the experiment. How should an effect size be chosen for power analysis?

Choosing the smallest effect size that is scientifically important is a matter of judgment and experience. The effect size d is intuitive to many educational researchers and is widely used in meta-analysis. Consequently, there are many sources of data on d values actually obtained in educational research. Compendia of effect sizes in metaanalyses by Lipsey and Wilson (1993) and other compendia by Bloom et al. (2007) may be particularly useful. A somewhat outdated and potentially misleading guideline was offered by Cohen (1988): an effect size of d ¼ 0.20 is small, d ¼ 0.50 is medium, and d ¼ 0.80 is large. These guidelines should be used only if there is no

438 Statistics

Power of the two-sided test for treatment effects as a function of operational sample size N0 and operational effect size d0 for a ¼ 0.05 level of significance

Table 1

Effect size d0 N0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07

0.05 0.05 0.06 0.06 0.06 0.06 0.07 0.07 0.07 0.07 0.08 0.08 0.08 0.08 0.09 0.09 0.09 0.09 0.09 0.10 0.10 0.10 0.10 0.11 0.11 0.11 0.11 0.12 0.12 0.12 0.12 0.13 0.13 0.13 0.13

0.05 0.06 0.07 0.07 0.08 0.08 0.09 0.09 0.10 0.10 0.11 0.11 0.12 0.12 0.13 0.14 0.14 0.15 0.15 0.16 0.16 0.17 0.17 0.18 0.19 0.19 0.20 0.20 0.21 0.21 0.22 0.22 0.23 0.24 0.24

0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39

0.06 0.08 0.09 0.11 0.12 0.14 0.15 0.17 0.19 0.20 0.22 0.23 0.25 0.26 0.28 0.29 0.31 0.32 0.34 0.35 0.37 0.38 0.40 0.41 0.42 0.44 0.45 0.46 0.48 0.49 0.50 0.52 0.53 0.54 0.55

0.07 0.09 0.11 0.13 0.16 0.18 0.20 0.22 0.25 0.27 0.29 0.31 0.33 0.35 0.38 0.40 0.42 0.44 0.46 0.48 0.49 0.51 0.53 0.55 0.56 0.58 0.60 0.61 0.63 0.64 0.66 0.67 0.68 0.70 0.71

0.07 0.10 0.13 0.16 0.20 0.23 0.26 0.29 0.32 0.35 0.37 0.40 0.43 0.46 0.48 0.51 0.53 0.56 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0.71 0.73 0.75 0.76 0.77 0.79 0.80 0.81 0.82 0.83

0.08 0.12 0.16 0.20 0.24 0.28 0.32 0.36 0.40 0.43 0.47 0.50 0.53 0.56 0.59 0.62 0.65 0.67 0.69 0.72 0.74 0.76 0.77 0.79 0.81 0.82 0.84 0.85 0.86 0.87 0.88 0.89 0.90 0.91 0.92

0.09 0.14 0.19 0.24 0.29 0.34 0.39 0.43 0.48 0.52 0.56 0.60 0.63 0.66 0.69 0.72 0.75 0.77 0.79 0.81 0.83 0.85 0.86 0.88 0.89 0.90 0.91 0.92 0.93 0.94 0.94 0.95 0.96 0.96 0.96

0.10 0.16 0.22 0.29 0.35 0.41 0.46 0.51 0.56 0.61 0.65 0.69 0.72 0.75 0.78 0.81 0.83 0.85 0.87 0.89 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.96 0.97 0.97 0.98 0.98 0.98 0.98 0.99

0.10 0.18 0.26 0.33 0.41 0.47 0.54 0.59 0.64 0.69 0.73 0.77 0.80 0.83 0.85 0.87 0.89 0.91 0.92 0.94 0.95 0.95 0.96 0.97 0.97 0.98 0.98 0.98 0.99 0.99 0.99 0.99 0.99 0.99 1.00

0.11 0.21 0.30 0.39 0.47 0.54 0.61 0.67 0.72 0.76 0.80 0.84 0.86 0.89 0.91 0.92 0.94 0.95 0.96 0.97 0.97 0.98 0.98 0.99 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.13 0.23 0.34 0.44 0.53 0.61 0.68 0.74 0.78 0.83 0.86 0.89 0.91 0.93 0.94 0.96 0.97 0.97 0.98 0.98 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.14 0.26 0.38 0.49 0.59 0.67 0.74 0.80 0.84 0.88 0.91 0.93 0.95 0.96 0.97 0.98 0.98 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.15 0.29 0.43 0.55 0.65 0.73 0.80 0.85 0.89 0.92 0.94 0.96 0.97 0.98 0.98 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.16 0.33 0.48 0.60 0.71 0.78 0.84 0.89 0.92 0.95 0.96 0.97 0.98 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.17 0.36 0.52 0.65 0.76 0.83 0.88 0.92 0.95 0.97 0.98 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.19 0.39 0.57 0.70 0.80 0.87 0.92 0.95 0.97 0.98 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.20 0.43 0.61 0.75 0.84 0.90 0.94 0.97 0.98 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.22 0.46 0.66 0.79 0.88 0.93 0.96 0.98 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.07 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.08 0.09 0.09 0.09 0.09 0.10 0.10 0.10 0.10 0.11 0.11 0.12 0.13 0.14 0.16 0.17 0.20 0.23 0.26 0.29 0.32 0.35

0.14 0.14 0.14 0.14 0.15 0.15 0.15 0.15 0.16 0.16 0.16 0.16 0.17 0.17 0.17 0.18 0.18 0.19 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.34 0.39 0.43 0.47 0.51 0.61 0.69 0.75 0.81 0.85 0.88

0.25 0.25 0.26 0.26 0.27 0.27 0.28 0.29 0.29 0.30 0.30 0.31 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.39 0.41 0.43 0.45 0.47 0.49 0.51 0.53 0.54 0.56 0.64 0.71 0.76 0.81 0.85 0.92 0.96 0.98 0.99 0.99 1.00

0.40 0.41 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.48 0.49 0.50 0.51 0.52 0.54 0.55 0.57 0.58 0.61 0.64 0.66 0.69 0.71 0.73 0.75 0.77 0.79 0.80 0.87 0.92 0.95 0.97 0.98 0.99 1.00 1.00 1.00 1.00 1.00

0.56 0.58 0.59 0.60 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.73 0.75 0.76 0.78 0.80 0.82 0.85 0.86 0.88 0.90 0.91 0.92 0.93 0.94 0.97 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.72 0.73 0.74 0.75 0.77 0.78 0.79 0.79 0.80 0.81 0.82 0.83 0.84 0.84 0.86 0.87 0.88 0.89 0.90 0.92 0.93 0.95 0.96 0.96 0.97 0.98 0.98 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.84 0.85 0.86 0.87 0.88 0.89 0.89 0.90 0.91 0.91 0.92 0.92 0.93 0.93 0.94 0.95 0.96 0.96 0.97 0.98 0.98 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.92 0.93 0.94 0.94 0.95 0.95 0.96 0.96 0.96 0.97 0.97 0.97 0.97 0.98 0.98 0.98 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.97 0.97 0.98 0.98 0.98 0.98 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

0.99 0.99 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

Statistical Power Analysis

37 38 39 40 41 42 43 44 45 46 47 48 49 50 52 54 56 58 60 64 68 72 76 80 84 88 92 96 100 120 140 160 180 200 250 300 350 400 450 500

439

440

Statistics

other information that could be used to judge the smallest effect size that is likely to be scientifically important. Example of computing power using Table 1

Suppose that we are interested in determining the statistical power of an experiment that contrasted a treatment group with a control group with a ¼ 0.05 level of significance, a sample size of N ¼ 30 per group, and an effect size of d ¼ 0.30. Entering Table 1 on the row for N0 ¼ 30 and moving across to the column for an effect size of d0 ¼ 0.30, we read the power value of 0.21. How large would the sample size have to be in order to obtain a power of 0.80? To find it out, we enter the table in the column for the effect size of 0.30 and move down the rows until we reach a power value of 0.80 or larger. Doing so yields a sample size of N ¼ 180 per group. Alternatively, we might ask what the minimum detectable effect size is for a power of 0.80 and a sample size of N ¼ 30. Entering the table on the row for N0 ¼ 30 per group and moving across the columns of the table, we see that an effect size of d0 ¼ 0.70 yields a power of 0.76 and an effect size of d0 ¼ 0.80 yields a power of 0.86; thus, the minimum detectable effect size is between d ¼ 0.70 and d ¼ 0.80. Interpolating, we see that a power of 0.80 is 4/10 of the way between 0.76 and 0.86, so the minimum detectable effect size is approximately 4/10 of the way between d ¼ 0.70 and d ¼ 0.80, or d ¼ 0.74. Designs Utilizing Covariates to Increase Power in Comparing Independent Groups Using Simple Random Samples The use of covariates can dramatically increase precision in designs comparing independent groups. The covariate is often a pretest of some kind, but it could be any variable that is correlated with the outcome and is measured before treatment begins. In designs that involve covariates, power depends not only on significance level, sample size, and effect size, but also on the effectiveness of the covariate in explaining variance in the outcome variable. Nonetheless, the computations using covariates can be carried out using almost the same procedures as without covariates. The only difference is that instead of using the actual effect size and the sample size to directly compute power, a slightly modified version of each, called the operational effect size and operational sample size, is used. Operational effect size and operational sample size

Suppose that the actual effect size of interest is d, the actual sample size in each group is N, and the correlation (or multiple correlation if there is more than one covariate) between covariate(s) and the outcome is R. Then, the

operational sample size per group is N0 ¼ N – q/2, where q is the number of covariates used, and the operational effect size d0 is  pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃhpﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃi d0 ¼ d = 1  R2 2N =ð2N  qÞ

½1

If the number of covariates is small (e.g., 1), it may be sufficiently accurate to use N0 ¼ N – 1 and pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ d0 ¼ d = 1  R2

½2

as the operational sample size and effect size, respectively. The actual effect size d has the same scientific interpretation in this design as in the design without covariates and should be chosen in the same way. The operational effect size is simply a convenient device that allows us to compute statistical power from standard tables. Example of computing power using Table 1 in a design using covariates

Suppose that, as in the earlier example, we are interested in determining the statistical power of an experiment that contrasts a treatment group with a control group with the a ¼ 0.05 level of significance, a sample size of N ¼ 30 per group, and an effect size of d ¼ 0.30. Now, we introduce a covariate that is correlated (R ¼ 0.66) with the outcome variable. The operational sample size is N0 ¼ 30 – 1=2 ¼ 29.5 and the operational effect size is  pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃhpﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃi d0 ¼ 0:30= 1  0:662 ð2  30Þ=½ð2  30Þ  1 ¼ 0:403

Note that if we had ignored the last term in brackets, we would have computed pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ d0 ¼ 0:30= 1  0:662 ¼ 0:399

which is the same to two decimal places (namely d0 ¼ 0.40). Because N0 ¼ 29.5 is between 29 and N0 ¼ 30, we will have to interpolate between rows. Entering Table 1 on the row for N0 ¼ 29 and moving across to the column for an effect size of d0 ¼ 0.40, we read the power value of 0.32, and on the row for N0 ¼ 30 for an effect size of d0 ¼ 0.40, we read the power value of 0.33; thus, the power value is between 0.32 and 0.33 or about 0.325. Note that introducing the covariate has increased the power by approximately 50% (from 0.21 to 0.325) compared to the design with no covariate. How large would the sample size have to be in order to obtain a power of 0.80? To find it out, enter the table in the column for the (operational) effect size of 0.40 and move down the rows until we reach a power value of 0.80 or larger. Doing so yields an (operational) sample size of N0 ¼ 100 per group, or a true sample size of N ¼ 101, which is much less than without covariates. Alternatively, we might ask what the minimum effect size that yields a power of 0.80 is for a sample size of N ¼ 30 per group when the covariate is included

Statistical Power Analysis

(the minimum detectable effect size for a power of 0.80 and an operational sample size of N0 ¼ 29). Entering the table on the row for N0 ¼ 29 per group and moving across the columns of the table, we see that an effect size of d0 ¼ 0.70 yields a power of 0.75 and an effect size of d0 ¼ 0.80 yields a power of 0.85; thus, the minimum detectable operational effect size is between d0 ¼ 0.70 and d0 ¼ 0.80. Interpolating, we see that a power of 0.80 is 1/2 of the way between 0.75 and 0.85, so the minimum detectable effect size is approximately 1/2 of the way between 0.70 and 0.80 or about 0.75. Note, however, that this is the minimum detectable operational (not actual) effect size. We can translate this minimum detectable operational effect size into the minimum detectable actual effect size by inverting eqn [2] to obtain pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ d ¼ d0 1  R2 ¼ 0:75 1  0:662 ¼ 0:56

which is much smaller than without covariates.

Designs Comparing Independent Groups Using Multilevel (Clustered) Samples Many research designs in education compare the outcomes in independent groups but obtain their sample via a multilevel (multi-stage) sampling process. For example, many educational studies obtain the groups of participants by first obtaining schools and then assigning all of the students or teachers in the same school to the same treatment. Cluster-randomized (hierarchical) experiments and many quasi-experiments patterned after them are examples of designs in this category. The term ‘cluster-randomized’ refers to the concept of first sampling intact groups (statistical clusters), then sampling individuals within groups to obtain a two-stage (twolevel) cluster sample and then randomly assigning whole clusters to treatments. We discuss here only designs with two levels of sampling (e.g., schools assigned to treatments with students nested within schools). The use of multilevel samples and cluster randomization can dramatically decrease statistical power in designs comparing independent groups as compared with designs wherein simple random sampling and individual randomization to treatment are used. However, these designs are commonly used either because they are much less costly than the alternatives or because it is politically infeasible to assign individuals within the same school to different treatments. It is also the only alternative to evaluating treatments that must be administered to the entire school because of theoretical considerations (e.g., whole school reforms). In designs that involve two-level sampling, the power depends not only on significance level, the sample size, and the effect size but also on the amount of clustering in the sample. The amount of clustering is measured by a

441

quantity called the intraclass correlation, which measures the proportion of the total variance in the outcome that is between (as opposed to within) clusters (e.g., schools). Intraclass correlations range from 0 to 1, but values between 0.05 and 0.30 are most common in educational research. An additional complication is that power in these designs depends not only on the total sample size, but also on the number of clusters in each treatment group, m, and the number of individuals, n, within each cluster (assumed to be the same for each cluster). Choosing a reasonable intraclass correlation value is important for the accuracy of the power computation, so it is desirable to choose on the basis of empirical evidence from populations similar to those in the experiment. One extensive set of intraclass correlation values from representative samples of the United States population was complied by Hedges and Hedberg (2007). Power computations for multilevel samples such as cluster randomized experiments can be carried out using almost the same procedures as for experiments with singlelevel samples. The only difference is that, as in the case of singlelevel experiments with covariates, operational effect size and operational sample size are used.

Operational effect size and operational sample size

Suppose that the actual effect size is d, the actual number of clusters assigned to each treatment is m, the sample size in each cluster is n, and the intraclass correlation is r. Then, the operational sample size is N0 ¼ m. The operational effect size, d0, is d0 ¼ d

rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ n 1 þ ðn  1Þr

½3

Note that unless r ¼ 1, the operational effect size d0 is always larger than the actual effect size d. However, the operational sample size, N0 ¼ m, is smaller than the actual sample size, mn, so the power in the design with clustering will always be smaller than in the design without clustering.

Example of computing power in a multilevel design using Table 1

Suppose that we are interested in determining the statistical power of a multilevel experiment that contrasted two treatment groups using an a ¼ 0.05 level of significance, a sample size in each group of m ¼ 20 schools, n ¼ 25 students in each school, and an actual effect size of d ¼ 0.30. Assume that the intraclass correlation is r ¼ 0.15. The operational sample size is N0 ¼ 20 and the operational effect size is sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 25 d0 ¼ 0:30 ¼ 0:700 1 þ ð25  1Þ  0:15

442

Statistics

Entering Table 1 on the row for the operational sample size N0 ¼ 20 and moving to the column for an operational effect size of d0 ¼ 0.70, we see that the power is 0.58. Note that the total sample size is 2  20  25 ¼ 1000, which is rather large, yet the power is not particularly large. A study using a simple random sample of this size would have had power in excess of 0.99. How many clusters would be required to obtain a power of 0.80? To find it out, enter the table in the column for the (operational) effect size of 0.70 and move down the rows until we reach a power value of 0.80 or larger. Doing so yields a sample size on m ¼ 33 clusters (each with n ¼ 25 individuals) per group. We might have wondered how increasing the total sample size by increasing the number of individuals within each cluster (increasing n) and keeping the number of clusters fixed at m ¼ 20 would affect the power. It can be shown that the largest operational effect size can possibly be, even if n is infinitely large, d0 ¼ 0.78, which (with N0 ¼ m ¼ 20) corresponds to a power of about 0.67. Thus, increasing sample size by increasing clusters is a much more effective way to increase power in multilevel designs than is increasing the number of individuals within the clusters. This example illustrates that, even an infinite sample size, if it was obtained by increasing the number of individuals within the clusters and not increasing the number of clusters, may not increase the power to 0.80. Designs Comparing Independent Groups with Multilevel Samples Using Covariates The use of covariates can dramatically increase precision in designs comparing independent groups using multilevel samples. In designs with two-level sampling and covariates, the power depends not only on the significance level, sample size (the number m of clusters and sample size n within each cluster), effect size, and the amount of clustering in the sample (measured by the intraclass correlation r) but also on the effectiveness of the covariate(s) in explaining variance in the outcome variable at the cluster level and the individual (within-cluster) level. Typically, covariate effectiveness is measured by the multiple correlations R1 and R2 between the covariate(s) and the outcome at the individual and cluster levels, respectively. Power computations using covariates can be carried out using a procedure similar to what was used without covariates. The only difference is in how the operational effect size and operational sample size are defined. Operational effect size and operational sample size

Suppose that the actual effect size is d, the number of clusters assigned to each treatment is m, the sample size

within each cluster is n, the intraclass correlation is r, there are q1 covariates used at the individual level and qC covariates used at the cluster level, and the multiple correlations between the covariate(s) and outcome at the individual and cluster level are R1 and R2, respectively. Then, the operational sample size is N0 ¼ m  qC/2. The operational effect size, d0, is rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ"sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ # n 2m ½4 d0 ¼ d 1 þ ðn  1Þr  R21  ðnR22  R21 Þr 2m  q C

The operational effect size d0 is always larger than the operational effect size that would have been computed if there were no covariates. If the number of covariates is small (e.g., 1), it may be sufficiently accurate to use N0 ¼ m – 1 and d0 ¼ d

rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ n 1 þ ðn  1Þ  R21  ðnR22  R21 Þr

½5

as the operational sample size and effect size, respectively. The covariate–outcome correlations at both the individual and (especially) the cluster level have important effects on the accuracy of the power computation, so they should be chosen on the basis of empirical evidence gathered from populations similar to those in the experiment. One extensive compilation of covariate–outcome correlations at the individual and school level from representative samples of the US population is Hedges and Hedberg (2007).

Example of computing power in a multilevel design with covariates using Table 1

Suppose that we are interested in determining the statistical power of a multilevel experiment that will contrast treatment groups using a a ¼ 0.05 level of significance, a sample size in each group of m ¼ 20 schools and n ¼ 25 students in each school, and an effect size of d ¼ 0.30. Assume that the intraclass correlation is r ¼ 0.15 and that the individual- and cluster-level-squared multiple correlations between the qI ¼ qC ¼ 1 covariate and outcome are R21 ¼ 0:60 and R22 ¼ 0:50 at the individual and cluster levels, respectively. The operational sample size is approximately N0 ¼ 20 – 1 ¼ 19 and the operational effect size is approximately sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 25 ¼ 1:008 d0 ¼ 0:30 1 þ ð25  1Þ  0:15  0:60  ð25  0:50  0:60Þ  0:15

Entering Table 1 on the row for the operational sample size N0 ¼ 19 and moving to the column for an operational effect size of d0 ¼ 1.00, we see that the power is 0.85, so the power of this design is approximately 0.85. Note that if we had used the more complicated expression for the operational effect size given in eqn [4], we would have obtained d0 ¼ 1.02 and essentially the same power.

Statistical Power Analysis

Conclusion The procedures described in this article can be used to carry out statistical power analysis with the goal of finding statistical power for a fixed sample size and effect size, finding the sample size that will provide acceptable power for a fixed effect size, and finding the minimum detectable effect size for a fixed sample size and power. One complication that arises is that within treatment group sample sizes or (in designs with clustering) within cluster sample sizes may not be equal. In this case, using the harmonic mean of the sample sizes yields reasonably accurate power computations. Thus, in the case of designs involving simple random samples, we would use N ¼ 2N1N2/(N1 + N2). In designs wherein the number of clusters in the treatment and control groups are unequal, we would use m ¼ 2m1m2/(m1 + m2). When the cluster sizes are unequal, we would use n ¼ m(n1n2 . . . nm)/(n1 + n2 +   + nm) as the cluster size in formulas involving n. See also: Analysis of Variance; Design of Experiments; Hypothesis Testing and Confidence Intervals; Statistical Significance Versus Effect Size.

Bibliography Bloom, H. S., Richburg-Hayes, L., and Black, A. R. (2007). Using covariates to improve precision: Empirical guidelines for studies that randomize schools to measure the impacts of educational interventions. Educational Evaluation and Policy Analysis 29, 30–59. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd edn. New York: Academic Press. Hedges, L. V. and Hedberg, E. C. (2007). Intraclass correlations for planning group-randomized experiments in education. Educational Evaluation and Policy Analysis 29, 60–87. Lipsey, M. W. and Wilson, D. B. (1993). The efficacy of psychological, educational, and behavioral treatment: Confirmation from metaanalysis. American Psychologist 48(2), 1181–1209.

443

effect-size benchmarks for educational interventions. Journal for Research on Educational Effectiveness 1, 289–328. Borenstein, M., Rothstein, H., and Cohen, J. (2001). Power and Precision. Teaneck, NJ: Biostat. Elashoff, J. (2007). nQuery Advisor. Saugus, MA: Statistical Solutions. Hedges, L. V. and Pigott, T. D. (2001). The power of statistical tests in meta-analysis. Psychological Methods 6, 203–217. Hedges, L. V. and Pigott, T. D. (2004). The power of statistical tests for moderators in meta-analysis. Psychological Methods 9, 426–445. Kaplan, D. (1995). Statistical power in structural equation modeling. In Hoyle, R. H. (ed.) Structural Equation Modeling: Concepts, Issues, and Applications, pp 100–117. Newbury Park, CA: Sage. Konstantopoulos, S. (2008). The power of tests for treatment effects in three-level cluster randomized designs. Journal of Research on Educational Effectiveness 1, 66–88. Konstantopoulos, S. (2008). The power of the test for treatment effects in three-level randomized block designs. Journal of Research on Educational Effectiveness 1, 265–288. Kraemer, H. C. and Thiemann, S. (1987). How Many Subjects? Statistical Power Analysis in Research. Newbury Park, CA: Sage. Lipsey, M. W. (1990). Design Sensitivity: Statistical Power Analysis for Experimental Research. Newbury Park, CA: Sage. Murphy, K. R. and Myors, B. (2004). Statistical Power Analysis. Mahwah, NJ: Erlbaum. Obrien, R. G. and Muller, K. E. (1993). Unified power analysis for t-tests through multivariate hypotheses. In Edwards, L. K. (ed.) Applied Analysis of Variance in Behavioral Science, pp 297–344. New York: Marcel Dekker. Raudenbush, S. W. (1997). Statistical analysis and optimal design for cluster randomized experiments. Psychological Methods 2, 173–185. Raudenbush, S. W. and Liu, X. (2000). Statistical power and optimal design for multisite randomized trials. Psychological Methods 5(3), 199–213. Raudenbush, S. W. and Liu, X. (2001). Effects of study duration, frequency of observation, and sample size on power in studies of group differences in polynomial change. Psychological Methods 6(4), 387–401. Satorra, A. (1989). Alternative test criteria in covariance structure analysis: A unified approach. Psychometrika 54, 131–151. Satorra, A. and Saris, W. E. (1985). Power of the likelihood ratio test in covariance structure analysis. Psychometrika 50, 83–90. Schochet, P. Z. (2008). Statistical power for random assignment evaluations of educational programs. Journal of Educational and Behavioral Statistics 33, 62–87. Spybrook, J., Raudenbush, S. W., Liu, X., Congdon, R., and Martinez, A. (2008). Optimal Design. Ann Arbor, MI: University of Michigan.