Hypothesis Testing: The Null Hypothesis, Significance and Type I Error

Hypothesis Testing: The Null Hypothesis, Significance and Type I Error

CHAPTER 10 Hypothesis Testing: The Null Hypothesis, Significance and Type I Error HYPOTHESES Statistical inference is often based on a test of signif...

395KB Sizes 1 Downloads 183 Views

CHAPTER 10

Hypothesis Testing: The Null Hypothesis, Significance and Type I Error HYPOTHESES Statistical inference is often based on a test of significance, “a procedure by which one determines the degree to which collected data are consistent with a specific hypothesis…” (Matthews and Farewell, 1996). Hypotheses may be specific, for example, that the slope relating two variables is 1, the line of identity, or that the mean height is 150 cm. Often the hypothesis is that two (or more) sample statistics could have been drawn from the same population. This is the null hypothesis or H0. The null hypothesis holds special importance in statistical inference. In many experiments two (or more) samples drawn from a population are given different treatments: two groups of patients with leukemia are given different agents to determine if one is better at prolonging life; a protein kinase inhibitor is given to one group of cells to find out if it decreases the production of an angiogenic factor; two groups of rats are fed different quantities of two types of fat to find out if the relationship of fat intake to serum cholesterol differs for the types of fat. Because of variation it is almost certain that the means or the slopes will differ even if the treatment had zero effect, so the question to be solved is how to calculate the effects of chance on the outcome. If the differences observed are likely to be due to chance, we would conclude that the treatment was unlikely to have been effective (providing the experiment had sufficient power—see Chapter 11). If the differences are unlikely to be due to chance, then we might want to reject the null hypothesis. How do we test H0? The principles to be discussed are illustrated by referring to sample and population means but apply equally to all other types of statistics. Conventionally today we apply a test of significance, determine a P value, and if the P value is very small we reject the null hypothesis. This is known as null hypothesis significance testing (NHST), but the approach has been controversial since its inception.1

1

If we reject the null hypothesis but it is really correct, we have committed a Type I error, and we would like to keep this low. Conversely if we accept the null hypothesis when it is incorrect, we have committed a Type II error—discussed in the next chapter.

Basic Biostatistics for Medical and Biomedical Practitioners https://doi.org/10.1016/B978-0-12-817084-7.00010-3

© 2019 Elsevier Inc. All rights reserved.

159

160

Basic Biostatistics for Medical and Biomedical Practitioners

The first person to attempt to determine if the difference between samples was or was not due to random error was Gosset (Ziliak, 2008). This was the impetus to developing the “Student” t-test. He did not, however, think it was appropriate to set a number for the probability of rejecting the null hypothesis. In 1904 he wrote: “Results are only valuable when the amount by which they probably differ from the truth is so small as to be insignificant for the purpose of the experiment. What the odds should be depends: 1. On the degree of accuracy which the nature of the experiment allows, and 2. On the importance of the issues at stake” He was followed by Fisher who recommended calculating the probability, termed p or P, that a particular difference between two means allowed us to reject the null hypothesis of no difference. A decision of this sort has many implications, so we try to minimize the probability of rejecting the null hypothesis (of no difference) when it is actually correct. He calculated P by a “significance test”; the smaller the value of P the higher the significance. Fisher originally suggested a critical P value <0.02, but then settled on P < 0.05 as more practical. As used in this way, P represents the probability of obtaining an experimental mean value that differs from the observed control mean value if the null hypothesis is true. Fisher used the P value to assess the weight of evidence against the null hypothesis, not as an absolute criterion. As proposed by Fisher, the statistical test is only one of the factors allowing us to decide whether or not to reject the null hypothesis. The magnitude of the difference and the plausibility of the hypothesis are other important factors. Fisher thus introduced a degree of subjectivity into the assessment, something quite the opposite of what happens when we slavishly accept a given level of significance. Fisher wrote: “If p is between 0.1 and 0.9 there is no reason to suspect the hypothesis tested. If it is below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be led astray if we draw a conventional line at 0.05 and consider that higher values of χ 2 indicate a real discrepancy.” Fisher (1956) was here referring to the null hypothesis.2 That he was not rigid in his approach is shown by his statement that “ the state of opinion derived from a test of significance is provisional, and capable, not only of confirmation, but of revision” and “A test of significance… is intended to aid the process of learning by observational experience” (Fisher, 1956).2 It is worthy of note

2

The statement “Nothing is certain except death and taxes” applies here. Consider the numbers from 1 to 100. Draw two sets of 3 numbers, replace them, draw another two sets, and so on, and compare the two sets. Now you are just as likely to get 1, 2, 3 and 98, 99, 100 as to get 44, 46, 48 and 45, 47, 49. Each combination will be repeated many times. By t-tests some P values will be high, others low. Five percent of them will have high t-values, 1% will have higher t-values, 0.1% will have still higher t-values. The higher the t-value, and the smaller the P value the more confidence you will have in rejecting the null hypothesis, but no matter how small P is you can never be certain. That is why Fisher insisted on biological plausibility and being able to repeat the experiment.

Hypothesis Testing

that Gosset, one of Fisher’s mentors, disdained the use of a fixed value for significance and advised Fisher against using it. An alternative approach was recommended in 1928 by Neyman and Pearson who selected in advance the value of α, the long-term risk of falsely rejecting the null hypothesis, usually 0.05 or 0.01, and postulated an alternative hypothesis—HA. (They introduced the term Type I error.) The differences between the two approaches are given in Table 10.1. More detailed but very readable accounts of the differences are given by Goodman (1999b), Lew (2012), and Hubbard and Bayarri (2003). Goodman used the analogy of a legal trial. In Fisher’s approach the accused person (who is actually innocent) would be judged guilty or not guilty with a 5% probability of being incorrectly found guilty. In the Neyman and Pearson approach, only 5% of a large group of such accused people would be incorrectly found guilty. Table 10.1 Comparing hypotheses Fisher

P value Added support HA Number of tests

Indicates weight of evidence against H0 Required Unspecified One test only

Neyman-Pearson

If <0.05, reject H0 Not required Specified Probability applies to several repetitions of the test

There is no certainty, merely an expression of confidence that one might accept the results and move on to the next experiment. I add three more objections. 1. For a given difference between the means, P is a function of sample variability and size. Consequently, two experiments with similar results but different P values are not necessarily in conflict. (See Fig. 4 in the excellent article by Motulsky, 2015.) 2. A given P value, even if low, is no guarantee that a similar P value will be obtained when repeating that experiment. Cumming (2008) has shown that, for example, a P value of 0.05 has an 80% chance of being from 0.00008 to 0.44 on subsequent replications of the experiment. 3. The P value for significance is not a fixed number. If the experiment is trivial and the outcome not important, P < 0.05 might suffice. If, however, the outcome is unexpected and—if true—of great importance, then a much smaller value for P is required. As Carl Sagan stated “Extraordinary claims require extraordinary evidence.” There is much confusion in textbooks and practice about the relationships among P and α. For Fisher, P (defined before) is the result of a single experiment, and it is not the same as α as defined by Neyman and Pearson because α represents the long-term error

161

162

Basic Biostatistics for Medical and Biomedical Practitioners

determined in advance. α might be set to be 0.05, but in a single experiment P might be 0.035; they are obviously not the same. In addition, α is a conditional probability, conditional on the null hypothesis being correct. John Tukey (1991) wrote: “The worst, i.e., most dangerous feature of “accepting the null hypothesis” is the giving up of explicit uncertainty: the attempt to paint with only the black of perfect equality and the white of demonstrated direction of inequality. Mathematics can sometimes be put in such black-and-white terms, but our knowledge or belief about the external world never can. The black of “accept the null hypothesis” is far too black. It treats “between 101 and +1,” “between 101 and +101,” and “between 1 and +1” all alike, when their practical meanings are often very, very different. The white of demonstrated direction of inequality is too white. On its face, it treats “between +1 and +101,” “ between+1 and +3,” and “between +99 and +101” as if they were the same when their practical meaning is quite different.” He continued: “Long ago, Fisher (1926, foot of p. 504) recognized that truly solid knowledge did not come from analyzing a single experiment-even when that gave a confident direction with a very, very small error rate, like one in a million-but rather that solid knowledge came from a demonstrated ability to repeat experiments, each of which showed confident direction at a reasonable error rate, like 5%. This is unhappy for the investigator who would like to settle things once and for all, but consistent with the best accounts we have of scientific method, which emphasize repetition, preferably under varied circumstances.” An example of this caveat was reported by Crease (2010) in discussing the search for dark matter by astrophysicists. Physicists collect over time the outputs from a detector displayed in different energy levels (bins). If one particular bin shows an increased number of hits compared to what was expected at that energy level, it may provide evidence for a new particle. In one such experiment, the DAMA/LIBRA experiment at the Gran Sasso National Laboratory in Italy, the excess of energy in a particular bin had a confidence level of 8.2σ. This indicates a very low probability of falsely rejecting the null hypothesis, with a probability of 1 in ten million billion. Nevertheless, subsequent experiments failed to confirm this finding. As a corollary to this, we cannot ignore specific circumstances when assessing the results of a study. Millard (Borenstein et al., 2009) provided an example by citing a report from The Guardian newspaper at http://www.guardian.co.uk/football/2010/jul/12/ paul-psychic-octopus-wins-world-cup about the ability of an octopus named Paul who in 8 successive years correctly predicted the winner of the World Cup (soccer). Although the probability of this occurring by chance alone was 1/256 ¼0.0039, to draw the conclusion that the octopus was able to predict the results of a soccer match defies credibility, no matter what the P value was. Ioannidis (2005) published a provocative article entitled “Why most published research findings are false.” He pointed out that when the prior probability of a

Hypothesis Testing

relationship is low (and this is true for many relationships) a value of P < 0.05 for the Type I error α would almost certainly lead to an exaggerated tendency to reject the null hypothesis. From his analysis he set out some important corollaries, including: 1. The smaller the size of the studies conducted in a scientific field, the less likely the research findings are to be true. 2. The smaller the effect sizes in a scientific field, the less likely the research findings are to be true. 3. The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true. Although this is an ethical and not a statistical matter, the bias is so pervasive (Chapter 2) that we should be cautious in rejecting the null hypothesis based on a borderline value of 0.05 for the Type I error. Schuemie et al. (2014) introduced an empirical calibration to correct the error, and estimated that up to 54% of tests deemed significant at the 0.05 level were actually not in favor of rejecting the null hypothesis. The exact proportion of erroneous statistical conclusions is disputed, (see the Symposium in the journal “Biostatistics”, volume 15, issue 1, 2014) but it is almost certainly much >5%. This bias has recently been emphasized (Colquhoun, 2014) who showed that the Type I error of 5% now in vogue is 36%, and that to keep it below 5% one should use 3σ deviation from the mean (P  <0.001), not 2σ (P  <0.05) as is now used (See Appendix). A recent recommendation from a large group of statisticians has advised using P < 0.005 as the critical value for rejecting the null hypothesis and has provided additional arguments in favor of this choice (Benjamin et al., 2017). Sterne and Davey Smith (2001) agreed that the smaller the value of P the less the chance of a Type I error, but they (along with many others), regarded 0.05 as unsafe per se, and preferred to see a value of P < 0.001 for safety Fig. 10.1 summarizes their view.

Fig. 10.1 Degree of evidence against null hypothesis. (Based on Sterne, J.A., Davey Smith, G., 2001. Sifting the evidence-what’s wrong with significance tests? BMJ 322, 226–31.)

It is commonplace to see in the Methods section of a scientific publication the phrase: “Statistical significance was set at P <0.05.” All that this means is that the reader is given the user’s definition of the term “statistical significance” and little of substance is

163

164

Basic Biostatistics for Medical and Biomedical Practitioners

provided. It would be more useful to abandon the term “statistical significance” and provide the reader with important information such as the exact P value and the confidence limits of the mean or other relevant parameter. The value of α ¼ 0.05 is arbitrary, and values of 0.01 or 0.10 may be used. The apparent importance of 0.05 and 0.01 was in part due to the lack of computing facilities in the 1930s. Statistical tables were difficult and expensive to produce, so that initially only the 0.05 and 0.01 values were provided for certain tests (Barnard, 1992). Much of the time, however, we wish to evaluate how much uncertainty there is in rejecting the null hypothesis, and it is preferable to give the exact P value, whether it be 0.036 or 0.12, and the confidence limits associated with it, and do not even need to mention the term “significant” (Colquhoun, 1971). In fact, quotations from statisticians indicate the unpopularity of the current usage of the term “statistical significance”: ..it is better to regard the level of significance as a measure of the strength of the evidence against the null hypothesis rather than showing whether the data are significant or not at a certain level. (Sterne and Davey Smith, 2001)

…the function of significance tests is to prevent you from making a fool of yourself, and not to make unpublishable results publishable. (Colquhoun, 1971)

Statistical “significance” by itself is not a rational basis for action. (Deming, 1943)

In biologic experimental work, for instance, a…. common abuse is to use a statistical test to try to “prove” a hypothesis…… The scandal is that the “significant” results are published as though they had meaning. (Feller, 1969)

Because of these and other caveats, it is better to treat p values as nothing more than useful exploratory tools or measures of surprise. In any search for new physics, a small p value should only be seen as a first step in the interpretation of the data, to be followed by a serious investigation of an alternative hypothesis. Only by showing that the latter provides a better explanation of the observations than the null hypothesis can one make a convincing case for discovery. (Demortier, 2008)

P values of 0.036, 0.047, or 0.063 are all important pieces of information, and an arbitrary designation of significance may conceal information. As the sports commentator Vin Scully remarked: “Statistics are used much like a drunk uses a lamppost: for support, not illumination.” A final point to consider was raised by Binder (1963) who wrote “It is surely apparent that anyone who wants to obtain a significant difference enough can obtain one—if his only consideration is obtaining that significant difference.” (Significance can always be attained by a large enough sample size—see Chapter 11.) In practice, the 0.05 value of P is used in at least two ways. It may be used as a critical value for making a decision by (deductive inference) to accept or reject the null hypothesis, in keeping with the Neyman-Pearson approach. If an investigator is testing 500 chemical

Hypothesis Testing

agents for their ability to reduce bacterial growth, there is no doubt about the population mean value that is the control value of growth. What the investigator seeks are those agents worth pursuing in more detail. Therefore in a large batch of experiments, the 0.05 cutoff allows the investigator to concentrate on the more promising agents, knowing that less than 5% of the time the null hypothesis will be rejected falsely (Type I error). Furthermore, although the null hypothesis might not be true, the difference is unlikely to be important. In this usage there are multiple samples, and no inverse logic is used to determine a population mean. However, in most experiments investigators use P as support for a hypothesis, and the exact value of P does not have the same critical importance. In summary, those who understand statistical methods and philosophy attach only minor importance to the calculated P values, whereas those who merely make the calculations and do not think about them overestimate their importance. Recently the American Psychological Association wrote: “Estimation based on effect sizes, confidence intervals, and metaanalysis usually provides a more informative analysis of empirical results than does statistical significance testing, which has long been the conventional choice in psychology (Cumming et al., 2011).” This cautious approach to deciding about statistical significance has even been endorsed by the US Supreme Court (Ziliak, 2011). A pharmaceutical company was sued for failing to warn investors about adverse effects that were not statistically significant. The Supreme Court found on behalf of the plaintiffs. At one point in the judgment Justice Sotomayor wrote: “… argument rests on the premise that statistical significance is the only reliable evidence of causation. This premise is flawed“, and again “medical professionals and researchers do not limit the data they consider to the results of randomized clinical trials or to statistically significant evidence.… The FDA similarly does not limit the evidence it considers for purposes of assessing causation and taking regulatory action to statistically significant data. In assessing the safety risk posed by a product, the FDA considers factors such as “strength of the association,” “temporal relationship of product use and the event,” “consistency of findings across available data sources,“ “evidence of a doseresponse for the effect,” “biologic plausibility,” “seriousness of the event relative to the disease being treated,” “potential to mitigate the risk in the population,” “feasibility of further study using observational or controlled clinical study designs,” and “degree of benefit the product provides, including availability of other therapies.” …[The FDA] “does not apply any single metric for determining when additional inquiry or action is necessary”.

CRITICISM OF NHST Fisher considered only the null hypothesis, but if the null hypothesis is rejected an alternative hypothesis must be considered, as suggested by Neyman and Pearson. Goodman (1993, 1999a) and Goodman and Royall (1988) discussed the philosophical differences

165

166

Basic Biostatistics for Medical and Biomedical Practitioners

between these two approaches and emphasized that Fisher’s use of significance testing required confirmation by repeated experiments and biological plausibility. A low P value suggested that it was worthwhile to repeat the study, a high P value suggested that it was not worth repeating. Fisher did not regard the P value as indicating the frequency of error (Type I error) if the experiment were to be repeated, nor did he consider an alternate hypothesis. Now it is true that the single sample on which we base our decision to accept or reject the null hypothesis is either from the control population or else from another population. What the rest of the area under the curve from that point out to infinity is has no real meaning. Whether the mean of another sample from the population being examined will be close to or far from the mean of the first sample is unknown. The best that we can do is to make a statement of belief that the sample does or does not come from the control population.Cohen (1994) brought up another problem in the current usage of the P value. He pointed out that what we wanted of the test was “Given these data, what is the probability that H0 is true?” But as most of us know, what it tells us is “Given that H0 is true, what is the probability of these (or more extreme) data?” Our test gives us P(D j H0) (D ¼ data, H0 ¼ null hypothesis), but what we really want is P(H0 j D). These are not the same. By invoking Bayes’ theorem (Chapter 5), P ðH0 =DÞ ¼

P ðDjH0 ÞP ðH0 Þ : P ðDÞ

Criticisms of the NHST approach abound. The P value is often used inappropriately. It quantifies the discrepancy between a given data set and the null hypothesis, but it does not give the probability that the null hypothesis is true or that it quantifies a particular error. Failing to reject the null hypothesis does not mean that the null hypothesis is correct. It is better to deal with the null hypothesis the way the Scottish law deals with trial sentences. There are three possible verdicts. 1. Guilty—the accused did it. 2. Not guilty—the accused did not do it. 3. Not proven—not enough evidence to make a decision. Similarly, the null hypothesis should be assessed as rejected (very small P value), accepted (large P value, small effect size), or not rejected (P above critical value and moderate effect size); in the last of these the null hypothesis cannot be rejected but is not accepted. Furthermore, the value of P depends on sample size; the same effect size gives different P values for different sample sizes. This important subject has been written about simply in several publications (Weinberg, 2001; Sterne and Davey Smith, 2001; Poole, 2001; Moran and Solomon, 2004; Goodman, 2008). Recently the American Statistical Association set out its views on the subject (Wasserstein and Lazar, 2016). Among their conclusions were:

Hypothesis Testing

1. P values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. 2. Scientific conclusions and business or policy decisions should not be based only on whether a P value passes a specific threshold. 3. A P value, or statistical significance, does not measure the size of an effect, or the importance of the result. 4. A P value does not provide a good measure of evidence for a model or hypothesis. On the other hand, P values (but not the term “significance”) provide information if used carefully. Frost (2014) gave five guidelines for using P: 1. The exact P value matters. The lower the P value, the lower the error rate. 2. Replication matters. 3. The effect size matters. 4. The alternative hypothesis matters. 5. Subject area knowledge matters.

HYPOTHESIS TESTING AND MAXIMUM LIKELIHOOD; THE BAYES FACTOR One way out of this dilemma is to use confidence limits (Cumming et al., 2012) and another is to use mathematical likelihood, a concept also introduced by Fisher (Goodman, 1993). We can be more comfortable if we can compare our belief that the sample comes from the control population with our belief that it comes from another specified population. Consider Fig. 10.2 The Bayes factor, also known as the likelihood ratio, is the ratio: Probability of data ðnull hypothesisÞ Probability of data ðaltemative hypothesisÞ A

B

b a

A

B b

a

Fig. 10.2 Likelihood ratios.

167

168

Basic Biostatistics for Medical and Biomedical Practitioners

In the upper panel, with the two means close together, the probability of “a” is about 50% the probability of “b.” The exact probability is used, not the area beyond the critical value. In the lower panel, “a” is about 16% of “b.” Curve B has a better likelihood of being correct in the lower panel, even though in terms of a Type I error the null hypothesis would not be rejected. The calculations of likelihood are presented in Goodman’s article and in the Appendix. Many experts recommend using the confidence interval to indicate what possible values occur if the null hypothesis is true. This approach meets the objections made by Tukey and provides a range of values for the investigator to consider. It is better than giving a single value. As Tukey (1969) observed, “The physical sciences have learned much by storing up amounts, not just directions. If, for example, elasticity had been confined to.” When you pull on it, it gets longer “Hooke’s law, the elastic limit, plasticity, and many other important topics could not have appeared.”

APPENDIX Bayes Factor We can relate P values to a minimal Bayes’ likelihood, defined as e z /2 (Table 10.2; Fig. 10.3, modified from Goodman, 1999b). 2

Fig. 10.3 Relationship between P and Bayes’ likelihood. Table 10.2 Bayes factors P value (z score) Minimum Bayes’ likelihood Strength of evidence

0.10 (1.64) 0.05 (1.96) 0.01 (2.58) 0.001 (3.28)

0.26 0.15 0.036 0.005

Weak Moderate Moderately strong Strong

Hypothesis Testing

This can be interpreted as showing that when P ¼ 0.05, the null hypothesis gets 15% as much support as does the alternative hypothesis. Therefore the evidence against the null hypothesis is not as strong as it might seem, given the alternative hypothesis. Even if P ¼ 0.01 the null hypothesis is not clearly rejected. For details, see Goodman (1999b).

Terminology If the term “statistical significance” has little meaning, how do we convey the idea that the difference between two (or more) groups is not just chance variation? Even if we redefine it, the term “significance” is too tainted and controversial to be retained. One alternative is to use a term to indicate that we are referring to a difference that is likely to be more than chance variation and suggests rejecting the null hypothesis without specifying a specific level of α; the terms “substantial” and “not due to chance” are appropriate. It is up to the investigator to decide what P value should lead to rejection of the null hypothesis. This level is often set at 0.05, although arguments presented before suggest that 0.001 is safer.

Type I Error and α Why can the Type I error be so much greater than α? An example based on Fig. 2 from Colquhoun (2014) provides an explanation.

169

170

Basic Biostatistics for Medical and Biomedical Practitioners

The assumptions here are that 1000 tests are done to compare control and experimental groups. Of these, 10% or 100 have differences that are large and real (i.e., confirmed by subsequent work) whereas 90% or 900 have differences that are due to chance variation. Consider the 100 real differences. Because almost all tests can produce false results, assume that these tests are 90% sensitive (Chapter 21), that is, they will identify the real difference in 90 (true positives) and mistakenly fail to reject the null hypothesis in 10 (false negatives). Then consider the 900 chance differences. If α ¼ 0.05, then 5% or 45 of these tests will reject the null hypothesis; these are false positives. The remaining 95% or 855 will be correctly identified as unable to reject the null hypothesis; these are true negatives. Note that α ¼ 0.05 is equivalent to a specificity of 95%—see Chapter 21. Therefore in our 1000 tests there are 45 false positives out of a total of 45 + 90 ¼ 135 positives, for a percentage of 33% error. The actual results will depend on the prevalence of real differences and the sensitivity and specificity, but the Type I error will always exceed α. The current usage that expects 5% Type I errors if α ¼ 0.05 applies only to the lower arm of the diagram. In practice, however, we do not know if we have real or null differences, and so must include the upper arm as well. If α ¼ 0.01, then the Type I error will be 9/99 or 9.1%. If α ¼ 0.001, then the Type I error will be 1/91  1%. If α ¼ 0.0048, the Type I error is 5%. Colquhoun gives alternate derivations for this error, and his results are similar to those obtained by others by completely different methods.

REFERENCES Barnard, G.A., 1992. Review of statistical inference and analysis: Selected correspondence of R. A. Fisher by J T Bennett. Stat. Sci. 7, 5–12. Benjamin, D.J., et al., 2017. Redefine Statistical Significance. PsyArXiv, July 22. Binder, A., 1963. Further considerations on testing the null hypothesis and the startegy and tactics of investigating theoretical models. Psychol. Rev. 70, 107–115. Borenstein, M., Hedges, L.V., Higins, J.P.T., Rothstein, H.R., 2009. Introduction to Meta-Analysis. John Wiley & Sons, Chichester. Cohen, J., 1994. The Earth Is Round (p < 0.05). Am. Psychol. 49, 997–1003. Colquhoun, D., 1971. Lectures on Biostatistics. Clarendon Press, Oxford. Colquhoun, D., 2014. An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci 1, 140216 (15 pages). Crease, R.P., 2010. Discovery with Statistics. Available: http://physicsworld.com/cws/article/indepth/ 43309. Cumming, G., 2008. Replication and p intervals: P values predict the future only vaguely, but confidence intervals do much better. Perspect. Psychol. Sci. 3, 286–300. Cumming, G., Fidler, F., Kalinowski, P., Lai, J., 2011. The statistical recommendations of the American Psychological Association Publication Manual: effect sizes, confidence intervals, and meta-analysis. Austral. J. Psychol. 177, 7–11 Cumming, G., Fidker, F., L’ai, J., 2012. Association Publication Manual: effect sizes, confidence intervals, and meta-analysis. Austral. J. Psychol. 64, 138–146. Deming, W.E., 1943. Statistical Adjustment of Data. John Wiley & Sons, Inc., New York.

Hypothesis Testing

Demortier, L., 2008. P values and nuisance parameters. In: Prosper, H.B., Lyons, L., De Roeck, A. (Eds.), PHYSTAT LHC Workshop on Statistical Issues for LHC Physics, 2007. CERN, Geneva. Feller, W., 1969. Are life scientists overawed by statistics? Sci. Res. 4, 24–29. Fisher, R.A., 1956. Statistical Methods and Scientific Inference. Oliver and Boyd, Edinburgh. Frost, J., 2014. Five guidelines for using P values. http://blog.minitab.com/blog/adventures-in-statistics-2/ five-guidelines-for-using-p-values. Goodman, S.N., 1993. P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am. J. Epidemiol. 137, 485–496 (discussion 497-501). Goodman, S.N., 1999a. Toward evidence-based medical statistics. 1: The P value fallacy. Ann. Intern. Med. 130, 995–1004. Goodman, S.N., 1999b. Toward evidence-based medical statistics. 2: The Bayes factor. Ann. Intern. Med. 130, 1005–1013. Goodman, S., 2008. A dirty dozen: twelve p-value misconceptions. Semin. Hematol. 45, 135–140. Goodman, S.N., Royall, R., 1988. Evidence and scientific research. Am. J. Public Health 78, 1568–1574. Hubbard, R. & Bayarri, M. J. 2003. P-values are not error probabilities [online]. Available: http://ftp.stat.duke. edu/WorkingPapers/03-26.pdf. Ioannidis, J.P., 2005. Why most published research findings are false. PLoS Med. 2, e124. Lew, M.J., 2012. Bad statistical practice in pharmacology (and other basic biomedical disciplines): you probably don’t know P. Br. J. Pharmacol. 166, 1559–1567. Matthews, D.E., Farewell, V.T., 1996. Using and Understanding Medical Statistics. Karger, Basel. Moran, J.L., Solomon, P.J., 2004. A farewell to P-values. Crit. Care Resusc. 6, 130–137. Motulsky, H.J., 2015. Common misconceptions about data analysis and statistics. Br. J. Pharmacol. 172, 2126–2132. Poole, C., 2001. Low P-values or narrow confidence intervals: which are more durable? Epidemiology 12, 291–294. Schuemie, M.J., Ryan, P.B., Dumouchel, W., Suchard, M.A., Madigan, D., 2014. Interpreting observational studies: why empirical calibration is needed to correct p-values. Stat. Med. 33, 209–218. Sterne, J.A., Davey Smith, G., 2001. Sifting the evidence-what’s wrong with significance tests? BMJ 322, 226–231. Tukey, J.W., 1969. Analyzing data: Sanctification or detective work? Am. Psychol. 24, 83–91. Tukey, J.W., 1991. The philosophy of multiple comparisons. Stat. Sci. 6, 1. Wasserstein, R. L. & Lazar, N. A. 2016. The ASA’s statement on p-values: context, process, and purpose. Amer Stat, 70, 129–133. Weinberg, C.R., 2001. It’s time to rehabilitate the P-value. Epidemiology 12, 288–290. Ziliak, S.T., 2008. Guinnessometrics: The economic foundation of “Student’s” t. J Econ Perpsect. 22, 199–216. Ziliak, S.T., 2011. Matrixx v. Siracusano and Student v. Fisher. Statistical significance on trial. Significance 8, 131–134.

171