Multiple Sclerosis and Related Disorders 38 (2020) 101490
Contents lists available at ScienceDirect
Multiple Sclerosis and Related Disorders journal homepage: www.elsevier.com/locate/msard
Commentary
EFFECT size or statistical significance, where to put your money Gary Cutter
⁎
T
University of Alabama at Birmingham, School of Public Health, Department of Biostatistics, 1665 University Boulevard, Ryals Public Health Building, Room 327, Birmingham, AL United States
A R T I C LE I N FO
A B S T R A C T
Keywords: P-values Effect size Testing hypothesis
There is continuing controversy over the excessive reliance on p-values. This paper elaborates on the use of pvalues, points out their utility and reminds that there is no single measure that is a universal measure of the value of a study. All measures have their warts and moles, but each has utility. This paper discusses not only p-values but important measures of effect, effect sizes. The over or under interpretation of various measures is cautioned. The p-value is just a function of the summary statistic chosen for the outcome, the sample size and indicates a binary decision about the hypothesis. Using p-values are still OK, but should be coupled with other measures, such as effect sizes. A p-value alone won't get you through an ethics review board, no matter what its value, and it shouldn't get you through a journal review either.
1. Introduction There is continuing controversy over the excessive reliance on pvalues. With the increased focus over the use of p-values, many journals and national organizations are arguing and/or demanding to describe studies using some standardized measure of the magnitude of differences found. Some journals encourage their use and some without any p-values. The size of the differences is commonly called effect sizes, this commentary will try to provide a balanced view of the situation. However, let me start with what some may see as a disclaimer and others a fact. There is no single measure that is a universal measure of the value of a study. All measures have their warts and moles, but many are quite useful. 2. The Maligned p-value Why has this discontent with the p-value arisen? It has often taken on the role as the sole determination of the outcome of a study. Presumably readers here believe they know what a p-value is, but it is a good place to start. The concept of performing a hypothesis test is a mathematical approach to decision making, based on evidence. When we propose testing a hypothesis, it is either true or false. There are two possible decisions: reject the hypothesis, our so called null hypothesis, or fail to reject it. In statistical decision theory, we are setting up a decision rule that is a function of a summarization of the data, mathematically called a sufficient statistic. In more common vernacular, we are setting up a neutral or no difference scenario and asking how often
⁎
would the data we observe result by chance from a true population of results that adhered to the null hypothesis? If the outcomes are summarized and fall within our a priori rejection region, commonly chosen to contain 5% of the observations or fewer, we use our decision rule to declare the null hypothesis is likely not true and the alternative hypothesis seems more plausible and appealing. This decision rule is required to be decided in advance and the calculation yields a binary result – reject the null hypothesis or fail to reject the null hypothesis. Maybe more important is the fact that the p-value doesn't tell us the value of the decision – all it tells us if we reject the null hypothesis, is how often we are wrong! The decision theoretic approach is done so that we have some clear idea of the potential errors in our decision making. Our false positive and false negative decisions (Type I and Type II errors). Kuffner and Walker (American Psychological Association, 2009; Cohen, 1988; Cohen et al., 2010; Fricker et al., 2019; Gold et al., 2012; Hedges and Olkin, 1985; Kuffner and Walker, 2019; McCool et al., 2019; Rodríguez Cruz et al., 2016; Trafimow and Marks, 2015; Wasserstein and Lazar, 2016) recently argued there are three types of “testers”: 1. Those who set a Type I error probability α, see their data, compute the p-value and reject the null hypothesis if p-value<α; (Fricker et al., 2019) Those who compute the p-value and assert they would have chosen an α that is bigger than the p-value observed, if it had been chosen beforehand and thus reject the null hypothesis; and (Kuffner and Walker, 2019) Those who compute the p-value and believe it is small and argue the Type I error is the p-value. The only tester who is using the p-value appropriately is the first Type. The literature,
Corresponding author. E-mail addresses:
[email protected],
[email protected].
https://doi.org/10.1016/j.msard.2019.101490 Received 26 September 2019; Received in revised form 29 October 2019; Accepted 31 October 2019 2211-0348/ © 2019 Elsevier B.V. All rights reserved.
Multiple Sclerosis and Related Disorders 38 (2020) 101490
G. Cutter
however is crowded with Testers 2 and 3. The arguments over the p-value are not new, but as the number of journals increases and more and more research is being published, the p-value also became a decision tool for accepting articles. Statistically significant results are seemingly the most important and thus play a major role in the selection criteria. Even in the early days of metaanalyses, almost 50 years ago, publication bias, a preference for p<0.05 studies, was a concern (the file drawer problem), but the problems and concerns have certainly escalated in the past decade. In February 2014, the editors of the journal Basic and Applied Social Psychology (BASP) banned the use of the “null hypothesis significance testing procedure” (NHSTP) and confidence intervals. They went on further to stipulate “…BASP will require strong descriptive statistics, including effect sizes. We also encourage the presentation of frequency or distributional data when this is feasible.” (Fricker et al., 2019) Did banning p-values work? Fricker et.al. in a recent (Kuffner and Walker, 2019) article, argued that such a ban did not work. A year after the ban, they reviewed 31 articles published in BASP and concluded: “We found multiple instances of authors overstating conclusions beyond what the data would support if statistical significance had been considered. Readers would be largely unable to recognize this because the necessary information to do so was not readily available.” Partly in response to the BASP ban on p-values, the American Statistical Association issued: “The ASA Statement on p-Values: Context, Process, and Purpose” (Wasserstein and Lazar, 2016) which contained 5 principles:
Drug B (n = 270), ARR (sd) = 0.26 (0.50); ARR Difference = 0.10,% reduction = 38%, t-test = 2.326, p-value = 0.01 Trial 2 (unsuccessful): Drug A (n = 45), ARR (sd) = 0.16 (0.50) and Drug B (n = 45) ARR (sd) = 0.26 (0.50); ARR Difference = 0.10,% reduction = 38% t-test = 0.9542, p-value = 0.17 The above summarization of the two studies just using the terms: successful or unsuccessful result is using a p-value to describe the studies or the value of the studies rather than simply using it as the decision tool it was originally designed to be. Enter effect sizes. Effect sizes have been described for well over 100 years. The effect size (ES) is a standardized difference between two or more groups and does not depend on the sample size. For continuous variables, the most common ES is Cohen's d (Cohen, 1988) – it is the difference between the means of the two treatment groups being compared divided by the standard deviation of one or the other population (since often they are assumed to be equal). When they are unequal, we often use the larger of the standard deviations to be conservative or the weighted (by sample size) average of the two standard deviations. In the example above, the ES would be calculated as 0.10, the difference between the two groups, often divided by the larger of the standard deviations, but here equal (s = 0.50):
1 “P-values can indicate how incompatible the data are with a specified statistical model. 2 P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. 3 Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. 4 Proper inference requires full reporting and transparency 5 A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.”
ES = 0.10/0.50 = 0.20 The ES is called a standardized difference. It is a unitless quantity and thereby comparable across studies. It can be interpreted as how many standard deviation units the treatment effect has been able to move the mean. Thus, it is a summarization of the results on a standardized unitless scale. However, this formulation is for continuous variables. There is no one single formula for an effect size. The ES depends on the mathematical summarization of the data used to assess the treatment effects. However, almost all effect size indices can be described as effects that are small, medium or large. Cohen argued that an ES of 0.20 is small, 0.50 is medium and 0.80 is large. These guidelines are arbitrary because it is difficult to make such statements out of context. A small effect at an individual level may have major implications at a population level. Think about treating elevations in cholesterol. For any one individual the reduction might be small, but the societal benefit might be larger. A 60 year old non-African American male with total cholesterol of 220, HDL of 45, blood pressure 140/90, nonsmoking and nondiabetic has a 10 year risk of heart disease or stroke of 11.8%. Lowering his cholesterol to 160 with no changes in the other risk factors, lowers the risk over 10 years to 8.8%. (http://www. cvriskcalculator.com/) Is this a small, medium or large effect size? It is an absolute reduction of 3% in the risk of a cardiovascular event over the next 10 years. For the individual it is a 25% reduction in the risk over the situation if there is no intervention? Since the occurrence of an event is relatively low, the risk is still small, but cumulated over millions of individuals; it can save a lot of lives, cost, etc. Countries have recognized that at a population level, cholesterols at this level should be treated to provide a reduction in the impact of the disease because of the high prevalence in the general population. Within the multiple sclerosis world, effect sizes may be more difficult to assess. When betaseron was first introduced, the annual exacerbation rate for patients receiving placebo was 1.27; and for the 1.6 MIU IFNB group, 1.17; and for 8 MIU IFNB group, 0.84 after 2 years. The absolute difference between the 8 MIU dose and Placebo was 0.43 relapses per year. To give a general idea of how effect sizes for
Thus, the misuse and abuse of p-values has become clear and this has led to calls for additional information in published articles. The American Psychological Association (APA) (American Psychological Association 2009) suggests additional reporting including effect sizes, confidence intervals, and extensive description. Effect sizes are designed to eradicate one key problem of p-values, the impact of the sample size on the p-value. When p-values are used properly, in a decision theoretic context, this is not directly a problem, since the decision is binary. It does not imply importance only a decision tool with controlled error rates. However, in reading and interpreting the literature, it is difficult. Suppose two clinical trials are conducted to compare the results of drugs A versus B. Trial 1 compares the annualized relapse rate in a study of Drug A versus Drug B: The p-value is shown to be p < 0.01 with A outperforming B (i.e. lower Annualized Relapse Rate [ARR] than Drug B). Trial 2 makes the same comparison and finds: The p-value is shown to be p = 0.17 with A outperforming B. Trial 1 achieves its goal and is called successful, while Trial 2 is a considered a failed trial. If we look at the results a bit more closely, as the results are shown below, we see that the impact of Drug A and Drug b are identical, they have the same Annualized relapse rate and standard deviation of that rate in both studies, but the results differ only because of the sample sizes. In Trial 1 there were 270 participants per group whereas Trial 2 only had 46. This caused the p-values based on identical results to differ substantially, p = 0.01 versus p = 0.17. Trial 1 (successful): Drug A (n = 270), ARR (sd) = 0.16 (0.50) and 2
Multiple Sclerosis and Related Disorders 38 (2020) 101490
G. Cutter
Fig. 1. Estimated Effect Size for Annualized Relapse Rate Reductions.
for the practitioner. One of the benefits of ES is that it does not depend on the sample size as it uses the mean difference divided by the standard deviation. However, one wonders about the precision of this measure. Certainly a difference calculated from a very large sample would intuitively be more precise than one calculated from a smaller sample. This is likely be correct and suggests that some margin of error might be useful or appropriate to understanding and publishing the ES. This leads to the idea of a confidence interval for the ES. Hedges and Olkin (Hedges and Olkin, 1985) have provided a confidence interval for the ES, which is a relatively straightforward calculation from the sample sizes (NE and NC of the 2 treatments) and the estimated ES:
annualized relapse rates compare, Fig. 1 shows the differences in mean ARR for the treatment and control groups divided by the estimated standard deviation of the comparison group (Adapted from McCool). For simplicity and illustration we made this simple assumption that relapses are distributed as a Poisson distribution since we can calculate the standard deviation by simply taking the square root of the ARR – this distribution has a smaller standard deviation than would be generated using a negative binomial distribution (often used to analyze ARRs), but allow us to illustrate the ES concept using the same approach. This Poisson assumption makes the ES a bit larger (since the standard deviation is smaller with the Poisson distribution than the negative binomial). The general appearance of these effect sizes in Fig. 1 follows what might be expected from knowledge of the treatments and the studies. The AFFIRM Trial produces one of the largest ES since it was compared to a placebo. The red bars indicate the comparator group is an active comparator. What can be seen in Fig. 1 is the wide variability in effect sizes, despite the statistical significance found in these studies. Using the guidelines of small, medium and large ES, only the AFFIRM study approaches the medium ES guideline and none of these studies demonstrates a large ES. Keep in mind that relapse rates have fallen over time especially before 2010, so time or entrance criteria or adherence to tighter definitions of relapse are imbedded in these results, but the Fig. 1 indicates that there is more to consider than simply p-values. One characteristic of ES that is rarely discussed is that for Cohen's d, this is equivalent to calculating a z-score from a normal distribution. This in turn provides another way of interpreting ES. Suppose we compare Drug C to a placebo and the ES of the ARR is 0.50. Assuming we subtracted the ARR of Drug C from the Placebo, the ES z-score is a positive 0.50. This means that 69% of the Drug A group would have a lower ARR than the placebo group. Why is this? Because in a normal distribution with mean = 0 and standard deviation = 1, 30.9% of the distribution would be higher than this z-score and 69.1% would be at 0.50 or lower. (This can be found from any table or calculator of the standard normal distribution). This z-score gives us a way to interpret what the ES actually means in terms of the population of participants under study. Similarly, a very large effect size of 1.645 would tell us 95% of the Drug C population had ARR lower than those in the placebo group. Thus, Cohen's rule of thumb for small, medium and large correspond to overlaps or separation in the population of about 60% for small, 70% for medium and 80% for large effect sizes. Interpreted this way, the effect size can have more easily interpreted clinical meaning
The95%CIfortheESis: =0.244 + / −sqrt{(429 + 431) /(429*431) +0.2442/2(429 + 431)} = 0.244 + / −sqrt (0.004686) = 0.244 + / −0.068 = (0.176, 0.312) NE and NC are the respective sample sizes and ES is the estimated effect size For example, in the TRANSFORMS study (Cohen et al., 2010) of fingolimod, the mean change in the 0.5 dose group in the MSFC was = 0.8 (sd=0.42) and in the interferon-beta-1a group was = −0.03 (sd=0.48) with sample sizes of 429 and 431 respectively. The ES can be calculated as
ES = [0.08 − −( −0.03)]/ pooledsd. =0.11/(0.45) = 0.244 By Cohen's standards, this was a relatively small ES. We can calculate the precision or a confidence interval for the ES.
The95%CIfortheESis: =0.244 + / −sqrt{(429 + 431) /(429*431) +0.2442/2(429 + 431)} = 0.244 + / −sqrt (0.004686) = 0.244 + / −0.068 = (0.176, 0.312) Thus, one could say, the 95% Confidence interval for the effect size is 0.176 to 0.312 based on this approach, still in the small range. If there had only been 100 individuals in each treatment group, the 95% CI for the same effect size would be:
= 0.244 + / −sqrt {(100 + 100)/(100*100) + 0. 2442/2(100 + 100)} = 0.244 + / −sqrt {0.020} = 0.244 + / −0.142 = (0.102, 0.386) Here the 95% CI is wider (0.102 to 0.386 compared to 0.176 to 0.312) when the sample size is smaller, as we would expect. It is 3
Multiple Sclerosis and Related Disorders 38 (2020) 101490
G. Cutter
number of issues with the paper, the authors have nicely presented their findings and what they did. The finding indicated a strong seasonal effect based on p-values. However, the odds ratio overall was 1.0677 or a 6.77% increase in the number of MS births. They authors examined a 62 and 42 year period of time over which slightly over 21,100 cases or about 42 cases per month (21,138/42 × 12) we compared to the general population. Thus, the increase in risk represents about 3 extra MS cases at the peak in April and barely mentioned 9.01% lower risk in the November period or about 4 MS cases saved. This “strong” p-value is met with a very small effect size, but maybe arguing it's better to conceive on Valentine's Day than on vacation in August. In the end, articles are written to inform, but also to establish that what one has been doing is worthwhile. Pressure is placed on researchers to be productive. They need to justify that the money they have been given has gone to good use contributes, in part, to the pressure to publish important results leading to the abuse of the meager p-value. P-values didn't create the term: “almost significant” or “there was a trend toward significance”. The p-value was just a function of the summary statistic chosen to represent the results of the investment and a priori indicate a binary decision by its very calculation. Using p-values is still OK, but it should be coupled with other measures, such as effect sizes of the results of the experiment to consider the value of the work. A p-value alone won't get you through an ethics review board, no matter what its value, and it shouldn't get you through a journal review either.
reasonable to include the confidence interval in a publication to allow the reader to understand the precision of the effect size estimate which does depend on the sample size. The tools of meta-analyses have used both of these concepts for decades and are often included in their applied publications, but not so much with the reporting from trials. The recommendation to use and present effect sizes along with descriptive statistics and confidence intervals is a well-reasoned one. However, just as p-values have limitations and are subject to misuse, belief biases and obfuscation by statistics, ES is not immune from similar problems. ES for continuous variables are, as we said, not dependent on the units of measurement, are independent of the sample size and thus can be compared from one study to another under certain circumstances. Those circumstances are not really different from indirect treatment comparisons. Indirect treatment comparisons assume a homogeneity of effects in studies over time and while randomization ensures comparable treatment groups within a study, there is no guarantee of such balance within treatments or amongst different studies. Thus, while it is tempting to compare effect sizes amongst studies, the same cautions as seen for network meta-analyses and indirect treatment comparisons certainly apply. Thus far, we have really focused on ES as defined by Cohen's d. What about measures such as hazard ratios, relative risk, rate ratios, odds ratios, percentages and percent differences? These are all measures of effect size, but for binary outcomes. A relative risk is similar to a hazard ratio, but the relative risk assesses the probability of the outcome event of interest cumulative to a particular point in time in one group compared to the other and/or maybe over the entire study. The hazard ratio is the risk of the outcome event in one group compared to the other at any point in time. The odds ratio is an approximation to the relative risk under certain assumptions and is generally a good approximation when the underlying frequency of the outcome is about 10% or less. When the event rate is over 10%, the odds ratio is biased (overestimating the relative risk), which, in effect, is overestimating the effect size and/or benefit. Consider the publication of the DEFINE study by Gold et.al. (Gold et al., 20). To their credit, they published both the hazard ratio and the odds ratio (Table 2 page 1103) for patients having a relapse. The percent of patients with a relapse in the placebo arm over two years was 46% compared to Twice Daily BG-12 and Thrice Daily BG-12 of 27% and 26%, respectively. The hazard ratio was 0.51 and 0.50 for the two BG12 Groups relative to the placebo, while the Odds Ratios (OR) were 0.42 and 0.41. Using the hazard ratio, one might say there was nearly a 50% reduction in the risk of a relapse, whereas using the OR, you might be tempted to say it was nearly a 60% reduction. The percentage of events occurring is well over 10% and thus, the Odds Ratio is over estimating the benefit. Other pitfalls abound in using purported effect sizes to illustrate results. To draw data from the same paper by Gold et.al, they reported the proportion of participants with confirmed disability progression for 12 weeks at 27% for the placebo arm and 16% and 18% for the two BG12 groups respectively. They provide and discuss the hazard ratio of 0.62 and 0.66 indicating a 38% and 34% reduction in the risk of progression. However, they could have discussed the reduction from 27% to 16% percent of occurrences or a reduction of 11%. Further, they could have talked about the effect size, to make this more impactful, and stated the percent reduction in progression was 41% {(27%−16%)/27% = 0.407). Clearly, all of these numbers are close and informative, but require careful reading to understand what effect sizes are being presented, described and their Implications. The search for the etiologies of MS often use p-values and effect sizes to substantiate their findings. For example, in examining season of birth as a risk factor for MS, Rodríguez Cruz et.al (Gold et al., 20) found that there was an excess of births in the second quarter of the year with a p-value for the seasonality effect of p<0.001. While there are a
Ethical publication statement I confirm that we have read the journal's position on issues involved in ethical publication and affirm that this report is consistent with those guidelines. References Trafimow, D., Marks, M., 2015. Editorial. Basic. Appl. Soc. Psych. 37, 1–2 374,375,376. Fricker Jr., R.D., Burke, K., Han, X., William, H., 2019. Woodall assessing the statistical analyses used in basic and applied social psychology after their p-Value ban. Am. Stat. 73 (S1), 374–384. Statistical Inference in the 21st Century. https://doi.org/10. 1080/00031305.2019.1537892. Kuffner, T.A., Walker, S.G., 2019. Why are p-values controversial? Am. Stat. 73 (1), 1–3. General. https://doi.org/10.1080/00031305.2016.1277161. Wasserstein, R.L., Lazar, N.A., 2016. The asa statement on p-Values: context, process, and purpose. Am. Stat. 70 (2), 129–133. https://doi.org/10.1080/00031305.2016. 1154108. American Psychological Association, 2009. Publication Manual of the American Psychological Association, 6th ed. American Psychological Association, Washington, DC [381]. Cohen, J., 1988. Statistical Power Analyses For the Behavioral Sciences, 2 ed. Erlbaum, Hillsdale, NJ, pp. 18–25. Hedges, l., Olkin, I, 1985. Statistical Methods for Meta-Analysis. Academic Press, New York. Cohen, J.A., Barkhof, F., Comi, G., Hartung, H.P., Khatri, B.O., Montalban, X., Pelletier, J., Capra, R., Gallo, P., Izquierdo, G., Tiel-Wilck, K., de Vera, A., Jin, J., Stites, T., Wu S,Aradhye, S., Kappos, L., TRANSFORMS Study Group, 2010. Oral fingolimod or intramuscular interferon for relapsing multiple sclerosis. N. Engl. J. Med. 362 (5), 402–415. https://doi.org/10.1056/NEJMoa0907839. Feb 4Epub 2010 Jan 20. PubMed PMID:20089954. McCool, R., Wilson, K., Arber, M., Fleetwood, K., Toupin, S., Thom, H., Bennett, I., Edwards, S, 2019. Systematic review and network meta-analysis comparing ocrelizumab with other treatments for relapsing multiple sclerosis. Mult. Scler. Relat. Disord. 29, 55–61. https://doi.org/10.1016/j.msard.2018.12.040. AprEpub 2019 Jan 2. PubMed PMID:30677733. Gold, R., Kappos, L., Arnold, D.L., Bar-Or, A., Giovannoni, G., Selmaj, K., Tornatore, C., Sweetser, M.T., Yang, M., Sheikh, S.I., Dawson, K.T., Study Investigators, DEFINE, 2012. controlled phase 3 study of oral BG-12 for relapsing multiple sclerosis. N. Engl. J. Med. 367 (12), 1098–1107 Sep 20Erratum in: N Engl J Med. 2012 Dec 13;367(24):2362. PubMed PMID: 22992073. Rodríguez Cruz, P.M., Matthews, L., Boggild, M., Cavey, A., Constantinescu, C.S., Evangelou, N., Giovannoni, G., Gray, O., Hawkins, S., Nicholas, R., Oppenheimer, M., Robertson, N., Zajicek, J., Rothwell, P.M., Palace, J, 2016. Time- and Region-Specific season of birth effects in multiple sclerosis in the united kingdom. JAMA Neurol. 73 (8), 954–960. https://doi.org/10.1001/jamaneurol.2016.1463. PubMed PMID: 27366989.
4