NVITED EDITORIAl, Testing w i t h confidence: The use (and misuse) of confidence intervals in biomedical research
In this issue of the Journal, Wolfe and C u m m i n g 1 discuss the advantages of using confidence intervals, r a t h e r t h a n p-values, w h e n reporting s t u d y findings. They point out t h a t the p-value combines two pieces of information the precision of the study, a n d the m a g n i t u d e of the t r e a t m e n t effect - t h a t should be kept separate. Thus, p-values c a n be extremely misleading in some situations, especially w h e n reported as a simple inequality (eg, p>0.05), r a t h e r t h a n as a c o n t i n u o u s numerical value (eg, p=0.06). Confidence intervals can be readily computed, in m o s t situations, using the estimated s t a n d a r d errors produced by all major statistical software packages. In view of the advantages of confidence intervals over p-values, detailed by Wolfe and C u m m i n g 1, a n d others 2-4, r e s e a r c h e r s are strongly encouraged to report confidence intervals instead of p-values. Use of p-values a n d hypothesis testing is widespread a n d deeply entrenched within the culture of m o d e r n biomedical research. Wolfe and C u m m i n g state t h a t a p-value reduces the information presented in a confidence interval to j u s t a single item - namely, w h e t h e r the result h a s attained statistical significance. The sad truth, however, is this u n f o r t u n a t e reductionism is performed by the scientist, not the statistic. While hypothesis testing clearly h a s its place in some r a n d o m i s e d studies, the biomedical r e s e a r c h c o m m u n i t y as a whole h a s b e c o m e overly reliant on the concept of hypothesis testing. 2 To d e m o n s t r a t e the limitations of this hypothesis testing mindset, we need only note t h a t a p=0.06 finding h a s far more in c o m m o n with a p=0.04 finding t h a n it does with a p=0.99 finding. However, a p u r e hypothesis-testing m i n d s e t would mislead u s into concluding t h a t the p=0.06 result and the p=0.99 result convey the s a m e information, and t h a t the p=0.04 finding is s o m e h o w vastly different from the p=0.06. In fact, a p=0.06 result could easily become p=0.04 due to a small i m p r o v e m e n t in statistical power (perhaps s t e m m i n g from the inclusion of a small n u m b e r of additional subjects), without a n y tangible increase in the m a g n i t u d e of the t r e a t m e n t effect. This problem is particularly g e r m a n e to observational epidemiologic studies involving n o n - r a n d o m s a m p l e s a n d / o r n o n - r a n d o m i s e d exposures, since the Type 1 error rate (alpha itself) h a s no clear interpretation for this type of study. 5 A f u n d a m e n t a l p a r a d i g m shift is required. The binary y e s / n o hypothesis testing a p p r o a c h (typically b a s e d on p<0.05 or similar cut-point) should itself be rejected in favor of a e s t i m a t i o n - b a s e d a p p r o a c h t h a t quantifies the strength and precision of s t u d y effect4. Use of confidence intervals, as advocated by Wolfe and C u m m i n g 1, is a n excellent first step in moving towards estimationb a s e d science. Sadly, experience suggests t h a t converting r e s e a r c h e r s from the u s e of p-values to confidence intervals doesn't g u a r a n t e e conversion from the hypothesis testing p a r a d i g m to the estimation paradigm. This is b e c a u s e r e s e a r c h e r s c a n continue to employ the confidence interval as a hypothesis test, using the convenient fact (highlighted by Wolfe a n d Cumming) that exclusion of the null value from the 95% confidence interval is generically
135
equivalent to a hypothesis test u s i n g the p=0.05 criteria. To e n c o u r a g e m o v e m e n t away from h y p o t h e s i s testing a n d t o w a r d s e s t i m a t i o n - b a s e d science, the confidence interval ratio (CLR) h a s b e e n p r o p o s e d 4. The CLR is simply c o m p u t e d as the u p p e r limit of the confidence interval divided by the lower limit of the confidence interval. It provides a h a n d y m a r k e r of the degree of precision associated with a given estimate, and is particularly useful if ratio m e a s u r e s (eg, odds ratios) are being used. To illustrate the u s e of the CLR, consider the two odds ratios (ORs) in Table 1. Let u s a s s u m e they come in two hypothetical prospective cohort studies a s s e s s i n g risk factors for injury in football players. F r o m a hypothesis testing standpoint, the first s t u d y clearly provides stronger evidence for an association between not stretching a n d injury risk, since here the confidence interval excludes the null value (1.00 for the OR) w h e r e a s the confidence interval for the second s t u d y includes the null. However, this conclusion is incorrect. A r e s e a r c h e r utilising the estimation p a r a d i g m would note t h a t b o t h studies observed a n elevation in injury risk due to non-stretching, b u t in the second ("non-significant") s t u d y the effect of non-stretching h a s b e e n m o r e precisely estimated (CLR of 4.4 v e r s u s 8.8). Obviously, the point estimate for the OR is lower in the second s t u d y (2 v e r s u s 3), so ultimately the c o m p a r i s o n is possibly equivocal. However, this example d e m o n s t r a t e s t h a t (mis)using the confidence interval as a hypothesis test (ie, checking w h e t h e r the null is outside the interval) can as misleading as a n y other form of hypothesis testing. Correct u s e of the CI involves a s s e s s m e n t of precision u s i n g the entire interval (potentially t h r o u g h a s u m m a r y m e a s u r e of the interval width, s u c h as the CLR), not j u s t the portion of the interval t h a t includes (or excludes) the null. Finally, it m u s t noted t h a t the discussion above, and i n Wolfe and C u m m i n g l, relates only to the issue of imprecision (so-called " r a n d o m error"). Systematic error, or bias, is frequently a greater t h r e a t to the overall a c c u r a c y of observational studies 6. Recent work extends the concept of a confidence interval to include sources of n o n - r a n d o m error, s u c h as selection bias or misclassification error 7. In s u m m a r y , the binary y e s / n o m i n d s e t of hypothesis testing should be discouraged in favor of estimation. In particular, the m a g n i t u d e and precision of s t u d y effects can be quantified t h r o u g h point estimates a n d confidence intervals, r a t h e r t h a n t h r o u g h p-values. Use of confidence intervals in place of p-values is a n i m p o r t a n t step in the right direction, away from hypothesis testing and towards estimation. However, m a n y u s e r s of confidence intervals b e c o m e fixated on w h e t h e r the interval excludes the null value (0 for a
OR (95%CI) Study 1 Study 2
3.0 (1.01, 8.90) 2.0 (0.95, 4.21)
CLR* 8.8 4.4
*CLR=Confidence Limit Ratio, defined as Upper Confidence Limit Divided by Lower Confidence Limit, Table 1: Hypothetical Odds Ratios for Effect of Not Stretching versus Any Stretching, and Risk of Injury. 136
difference, 1 for a ratio). This procedure is equivalent to hypothesis testing and should be replaced with a focus on precision, which can be quantified with the CLR.
References 1. 2. 3. 4.
Wolfe, Cumming. J Sci Med Sport, 2004:7(2): 138-143. Poole C. Beyond the confidence interval. Am J Public Health. 1987:77(2): 195-9. Rothman KJ. A show of confidence. N Engl J Med. 1978:299(24): 1362-3. Poole C. Low p-values or narrow confidence intervals: which are more durable?
Epidemiology 2001:12(3}:291-294. 5. Greenland S. Randomization, statistics, and causal inference. Epidemiology. 1990:1(6) :421-219. 6. Greenland S. Basic Methods for Sensitivity Analysis and External Adjustment (Chapter 19). In: Modern Epidemiology (2nd Edn): Eds: Rothman K, Greenland S. Lippincott-Raven: New York NY: 1998:pp343-358 7. Lash TL, Fink AK. Semi-automated sensitivity analysis to a s s e s s systematic errors iia observational data. Epidemiology. 2003:14(4):451-458.
S t e p h e n W Marshall Biostatistician, Injury Prevention Research Center. Deparment of Epidemiology, School of Public Health Department of Orthopedics, School of Medicine University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
137