The Seduction of P Values

The Seduction of P Values

The Seduction of P Values Christian J. Nelson, PhD, and Elizabeth Schofield, MPH Earlier this year, the American Statistical Association published an ...

118KB Sizes 3 Downloads 71 Views

The Seduction of P Values Christian J. Nelson, PhD, and Elizabeth Schofield, MPH

Earlier this year, the American Statistical Association published an excellent statement related to P values and how P values are frequently misinterpreted.1 This statement led us to think about the common mistakes related to P values seen in the Journal of Sexual Medicine and at presentations at the annual meeting of the North American Society of Sexual Medicine and the Sexual Medicine World Meeting sponsored by the International Society of Sexual Medicine. In general, researchers tend to be “seduced” by P values, often times relying too much on them at the expense of missing important interpretations of their data. Make no mistake; P values are needed as part of the hypothesis testing we use to make inferences about our data. And, despite a growing concern about the misinterpretation of P values, their use will continue to be part of the fabric of our research. As such, it is important to use them in an appropriate manner that helps push our field forward. We have seen researchers tend to misinterpret P values in three ways. First, many researchers confuse statistical significance with clinical significance. A P value is below the common threshold of 0.05 indicates that the differences or relations being tested most likely did not occur by chance. Statistical significance does not imply anything about clinical significance or how important the differences or associations are to the specific study. Once statistical significance is determined, researchers need to answer the question “so what?” and explain why the results are meaningful within the context of their study. For example, men are taller than women. We can easily demonstrate significance in a small study, but who cares? At times it is better to be taller (putting items in an overhead bin in an airplane), and at times it is better to be shorter (taking a seat in an airplane after you stowed your luggage). Most of the time, it simply does not matter. Second, researchers frequently overstate the importance of very low P values. We have often heard presenters say something like, “You can see that the P value is less than .001, so these findings are ‘very’ significant.” In most cases, once it has been determined that the differences or associations are unlikely to occur by chance (P < .05), it really does not matter if they are “very” unlikely to occur by chance (P < .001). The two values indicate statistical significance, but do not indicate

Received October 3, 2016. Accepted October 4, 2016. Department of Psychiatry and Behavioral Sciences, Memorial SloanKettering Cancer Center, New York, NY, USA Copyright ª 2016, International Society for Sexual Medicine. Published by Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jsxm.2016.10.001

J Sex Med 2016;-:1e2

anything about the clinical importance of the findings related to the study question. The reason for this is that P values are partly determined by sample size. As the sample size increases (and group differences or associations remain the same), the P value will decrease. As a result, any group differences or associations can produce “very” low P values depending on the sample size. For example, the average height of women in the United States is approximately 5 feet 4 inches, and the average height of men is approximately 5 feet 9 inches. If we randomly select 30 women and 30 men, this difference would mostly likely be significant at a P value less than .05. Now, because P values are determined in part by sample size, if we randomly select 100 women and 100 men with no changes in average height, then the P value could decrease to less than .001. Because the difference between the groups has not changed, the findings are no more meaningful or “significant” despite the lower P value. And, as stated earlier, the differences in height are probably unimportant in the first place. Third, we have seen researchers dismiss or overlook potentially important findings when the P value is slightly above .05, for example, .06 to .15. This is generally a concern for studies with small samples. Again, P values are based in part on sample size. The concern with small studies is that there might be a meaningful effect (meaningful differences or associations), but the sample is not large enough to produce a P value less than .05. This would be the definition of an “underpowered” study or potentially a type II error. This highlights the importance of understanding the study in context and determining whether the results are clinically meaningful or important. For example, in a small random sample, say the P value for comparing men’s heights with women’s heights is equal to .09, although the observed mean difference is 5 inches. Despite not having statistical evidence of a difference, we still cannot conclude that the heights of men and women are equivalent based on this statistical test alone. Instead, when this occurs, researchers can state something to the effect,2 “Although the P value (P ¼ .09) did not reach conventional statistical significance, the magnitude of the difference between the means has been found to be clinically meaningful in previous studies (cite the studies). We believe these results have important clinical significance and this indicates continued research in this area is needed with larger samples.” It is often helpful to label these types of studies as pilot or exploratory studies and important to recognize the possible value (within their limitations) of these studies. A continuum of research is needed to push science forward from exploratory, to pilot, to large randomized studies. All types of studies have their place if labeled and interpreted correctly. 1

2

P values will continue to be a central part of our research and it is important that we use them and interpret them correctly. Corresponding Author: Christian J. Nelson, PhD, Associate Attending Psychologist, Associate Member, Department of Psychiatry and Behavioral Sciences, Memorial Sloan-Kettering Cancer Center, 641 Lexington Avenue, 7th Floor, New York, NY 10022, USA; E-mail: [email protected]

Nelson and Schofield (b) Acquisition of Data Not applicable (c) Analysis and Interpretation of Data Not applicable Category 2 (a) Drafting the Article Christian J. Nelson; Elizabeth Schofield (b) Revising It for Intellectual Content Christian J. Nelson; Elizabeth Schofield Category 3

Conflicts of Interest: The authors report no conflicts of interest. Funding: This article was funded by NIH grants R01 CA190636 and P30 CA008748.

(a) Final Approval of the Completed Article Christian J. Nelson; Elizabeth Schofield

REFERENCES

STATEMENT OF AUTHORSHIP

1. Wasserstein RL. The ASA’s statement of p-values: context, process, and purpose. Am Stat 2016;70:129-133.

Category 1

2. Vickers AJ, Sjoberg DD. European Urology. Guidelines for reporting of statistics in European Urology. Eur Urol 2015; 67:181-187.

(a) Conception and Design Christian J. Nelson; Elizabeth Schofield

J Sex Med 2016;-:1e2