Journal of Clinical Epidemiology 64 (2011) 1047e1048
LETTERS TO THE EDITOR P-values are misunderstood, but do not confound
That confounded P-value revisited
To the Editor:
In reply:
I agree with Professor Stang [1] that P-values cannot rule out the null hypothesis and do not substitute for measures of effect, and that these points bear repeating. But in cleaning up the language of published quantitative research, we should be careful not to throw P-values out with the bath water. The idea that the P-value ‘‘confounds’’ effect size and sample size is itself a fallacy ( pace Rothman) [2]. P is calculated using the effect size and sample size, but the two things are combined in such a way as to weigh the evidence against the null hypothesis, adjusting for sample size (large effects are less convincing in smaller studies) [3]. P-values and confidence intervals cannot substitute for each other because they answer different questions: for the confidence interval, this is ‘‘how big do we think the effect might be?’’; for the P-value, it is ‘‘do we think there is an effect at all?’’ In some situations, we may have moved beyond the latter, but in others this is the first question we ask, through an application of Occam’s razor. The real problem with the way people interpret P-values is calibration. Many find P Z 0.05 highly convincing, whereas realistically it does little more than tip the balance of probabilities in favor of the alternative hypothesis [3,4]. Smaller P-values are progressively more persuasive, which is why we need to see a P-value in addition to a 95% confidence interval. We should not abandon P-values, but instead try to understand them better.
Hooper [1] defends using P-values to answer the question, ‘‘do we think there is an effect at all.’’ But what advantage is there in viewing measurable phenomena as a dichotomy? The P-value’s role in significance testing only fosters this unfortunate dichotomous thinking. Quantitative thinking is preferable [2]. Although one could argue that a zero effect is qualitatively different from other values, one cannot distinguish zero from values close to it. It makes far more sense to consider zero on an equal footing with all other possible effect values. The question for the investigator ought to be ‘‘what is the best estimate of effect given the data in hand?’’ [3]. Lang et al. [4] described the P-value as confounded because it blends effect size with precision, thus failing to give a clear assessment of either. Hooper called this description a fallacy but offered no reason. He also described smaller P-values as being progressively more persuasive; this claim is an actual fallacy. Many researchers dream of conducting large studies. If you are lucky enough to conduct a sufficiently large study, every association measured will be accompanied by very small P-values and thus be statistically significant. Unfortunately, these significant P-values will tell you nothing about effect size. In a large study, random error is reduced to small levels, but whatever systematic error there is from confounding, selection bias, unequal follow-up, information error, and other sources remains unaffected by the large study size. No study is completely free of systematic error; thus, as confidence intervals shrink with increasing study size, they shrink around a value that is incorrect because of the systematic error. Consequently, as study size increases, confidence intervals become less likely to contain the truth, and P-values become small but irrelevant because they contain little information about where the truth lies. Even in small studies, P-values confuse rather than enlighten because they mix effect size with study size. Instead of checking P-values, it is far better to estimate an effect with a point estimate, use the size of a confidence interval to gauge statistical imprecision, and consider sources of bias that systematically distort the estimate obtained. As Lash [5] nicely summarized: ‘‘Epidemiologic research is an exercise in measurement. Its objective is to obtain a valid and precise estimate of either the occurrence of disease in a population or the effect of an exposure on the occurrence
Richard Hooper Blizard Institute, Queen Mary, University of London, Abernethy Building, 2 Newark Street, London E1 2AT, UK Tel.: þ44-207-7882-7324; fax: þ44-207-7882-2552 E-mail address:
[email protected]
References [1] Stang A. Low P-values exclude nothing, and P-values are no substitute for measures of effect. J Clin Epidemiol 2011;64:452e3. [2] Lang JM, Rothman KJ, Cann CI. That confounded P-value. Epidemiology 1998;9:7e8. [3] Hooper R. The Bayesian interpretation of a P-value depends only weakly on statistical power in realistic situations. J Clin Epidemiol 2009;62:1242e7. [4] Ioannidis JPA. Why most published research findings are false. PLoS Med 2005;2:696e701. doi: 10.1016/j.jclinepi.2011.03.003 0895-4356/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved.
1048
Letters to the Editor / Journal of Clinical Epidemiology 64 (2011) 1047e1048
of disease.’’ To address validity, the investigator should focus on methods that control confounding and compensate for bias stemming from misclassification or selection [6]. We do agree with Hooper’s statement, ‘‘P-values and confidence intervals cannot substitute for each other.’’ But, as explained by Poole [7], confidence intervals are far more useful than P-values, expressing both effect size, seen in the point estimate that anchors the confidence interval, and precision, seen in the width of the interval. The P-value blends these two messages into one number that fails to elucidate either. Hooper says, ‘‘We should not abandon P-values, but instead try to understand them better.’’ When we come to understand them better, we will want to abandon them. Andreas Stang Institut f€ur Klinische Epidemiologie Medizinische Fakult€ at Martin-Luther-Universit€at Halle-Wittenberg Halle, Germany E-mail address:
[email protected]
Kenneth J. Rothman RTI Health Solutions Research Triangle Park NC, USA
References [1] Hooper R. P-values are misunderstood, but do not confound. J Clin Epidemiol 2011;64:1047. [in this issue]. [2] Stang A, Poole C, Kuss O. The ongoing tyranny of statistical significance testing in biomedical research. Eur J Epidemiol 2010;25: 225e30. [3] Oakes MW. Statistical inference. Chichester, UK: Wiley; 1986. [4] Lang JM, Rothman KJ, Cann CI. That confounded P-value. Epidemiology 1998;9:7e8. [5] Lash TL. Heuristic thinking and inference from observational epidemiology. Epidemiology 2007;18:67e72. [6] Lash TL, Fox MP, Fink AK. Applying quantitative bias analysis to epidemiologic data. Dordrecht, The Netherlands: Springer; 2009. [7] Poole C. Low P-values or narrow confidence intervals: which are more durable? Epidemiology 2001;12:291e4. doi: 10.1016/j.jclinepi.2011.03.004