It has not been proven why or that most research findings are false

It has not been proven why or that most research findings are false

Medical Hypotheses 113 (2018) 27–29 Contents lists available at ScienceDirect Medical Hypotheses journal homepage: www.elsevier.com/locate/mehy It ...

131KB Sizes 0 Downloads 11 Views

Medical Hypotheses 113 (2018) 27–29

Contents lists available at ScienceDirect

Medical Hypotheses journal homepage: www.elsevier.com/locate/mehy

It has not been proven why or that most research findings are false

T

John C. Ashton Department of Pharmacology & Toxicology, Otago School of Biomedical Sciences, University of Otago, Dunedin, New Zealand

A R T I C L E I N F O

A B S T R A C T

Keywords: Bias Experiment Null hypothesis testing Research finding Pre-study odds

The claim has been made that it can be proven that most published findings in medical, biological, and allied sciences are false and that the reason for this can be proven and explained with a mathematical model. It has not, however, been mathematically proven that most research findings are false, and this can be proven. The model used in the proof is incoherent and has been falsified. Furthermore, advice to researchers derived from the model is misleading and distracts from more important issues in experimental standards.

Introduction In a very highly cited paper, Ioannidis [1] argued that intrinsic logical properties of the experimental process lead to very high false positive finding rates. To do this he first assumed a model of scientific investigation based on null hypothesis testing (NHT) and then combined the mathematics of NHT with the concept of pre-study odds (that a hypothesis is true) to produce a model with which to estimate false positive finding rates. This approach was derived from the work of Walcholder et al. [2] who used a very similar model to calculate the rate of false positive hits in microarray analysis. Walcholder et al. [2] applied their model to a finite and tightly enumerable domain defined by the possible outcomes constrained by the dimensions of microarrays. Ioannadis [1], by contrast, applied the model to experimental science as such, where no such enumeration of hypotheses is possible. This leads to a paradox, and as a result his analysis and conclusions are incoherent. Furthermore, they are falsified by data and are scientifically pernicious. The model is incoherent The critical parameter in the Ioannidis model is R, the pre-study odds. Ioannidis also investigates the role of bias and statistical power in his model, but these are minor factors compared with R (see Table 4, Ioannidis [1]). The central assumption of the Ioannidis model is that R is a real number and can be estimated. However, although this is true for finite systems as in the work of Walcholder et al. [2] on microarrays, this is not the case for hypotheses across an experimental research field. To estimate R it is necessary to first estimate a denominator defined by the set of possible hypotheses in a field of enquiry (the numerator will consist of the true hypotheses within that set). But what is this exactly? Is it all conceivable hypotheses? This cannot be correct

E-mail address: [email protected]. https://doi.org/10.1016/j.mehy.2018.02.004 Received 1 December 2017; Accepted 6 February 2018 0306-9877/ © 2018 Elsevier Ltd. All rights reserved.

because such a set would be infinite. Perhaps then it is all hypotheses that are actually investigated? But how is this defined, by all hypotheses that are thought through and written down? Or is it restricted to all those hypotheses that are communicated to others? Or, to those that undergo some preliminary experimental investigation, such as the tinkering that goes on prior even to something as formal as a pilot experiment? Or, only hypotheses that are subject to a fully planned NHT experiment? One answer to this conundrum is that the set of possible hypotheses is restricted to those that have sufficient explanatory power and logical consistency with other findings in the field (plausibility) to warrant experimental testing. It is clear that restricting the set of possible hypotheses to those that are plausible in the sense of fitting the constraints set by existing knowledge in a research field reduces the denominator for R and thus R increases. If this were the end of the matter, then it would lead immediately to some interesting results: some experimental fields are much more tightly constrained in this sense than others. For example, theoretical physics is so highly developed that new hypotheses or theories are subject to very strong logical constraints prior to even being considered worthy of experimental testing. Einstein’s theory of General Relativity was remarkable even before empirical testing because it passed a very high bar set by the logical constraints of existing findings in the field, so much so that it justified high cost experimental tests. By contrast, in a field with relatively weak theoretical and empirical constraints such as social psychology, simple hypotheses consisting of little more than proposed associations between phenomena may be admitted for experimental testing. Furthermore, these may be relatively easily tested by low cost experiments. Hence, R in advanced physics would appear to be far higher than in social psychology. However, other considerations about the logic of hypotheses lead to an opposite conclusion and hence to a contradiction at worst, or at best to an unsolved paradox. A hypothesis that is consistent with other findings in a field and that explains or predicts these findings in

Medical Hypotheses 113 (2018) 27–29

J.C. Ashton

rationality of our society [9–11].

addition to making some new (hence testable) predictions will be more precise or specific than a hypothesis that does not. Vague hypotheses with relatively indeterminate predictions (such as are the norm in horoscopes for example) are virtually untestable because they rule out few possible outcomes. The more informative a hypothesis, the less probable a priori it is. A tautology is not empirically informative at all, and has a probability of being true of 1. By contrast, a testable hypothesis will rule out a number of conceivable outcomes and so be less probably true. This means that R for the more rigorously constrained hypothesis will be smaller than for a hypothesis in a field with weaker constraints. To use the Einstein example again, the precision of the predictions made by his theory and the universality of its scope means that it rules out a great number of possible outcomes, which means that R becomes vanishingly small. In fact, several logicians have argued cogently that for any significant causal theory the prior probability of its being true approaches zero [3,4]. Therefore, it can be shown that hypotheses with greater explanatory power are both more and less probable than weaker hypotheses, and so the concept of pre-study odds is incoherent.

Making science more rigorous The focus of the Ioannidis critique is null hypothesis testing. There are problems with NHT that have been widely discussed, such as widespread confusion over the difference between the likelihood of data given a (null) model with the probability that a hypothesis is true given a set of data. A deeper and less discussed problem with NHT lies in its algorithmic nature. It is possible to follow the forms of NHT quite properly, but only to test a trivial hypothesis. This is an example of what Richard Feynman called “Cargo cult science” [12]. Hence, in this way NHT can become an experimental box-ticking exercise and thus a smokescreen for poor standards in other ways, such as quality control tests of reagents or the theoretical consistency of the hypothesis [13]. Despite this, NHT has a very clear purpose in experimental research, and that is to test the alternative explanation that results are due to chance. NHT is not perfect for this, and other statistical approaches are hotly debated [14] but the problem that NHT addresses is very important and the need for NHT or something like it very real. It is therefore extremely important to understand the limited role of NHT in science (testing for chance) and not to confuse it with the whole of experimental science. A highly testable theory is one that rules out a large number of possible experimental outcomes. As discussed above, any attempt to reduce this to pre-study odds runs into a paradox, but all this means is that the application of probabilities to causal theories is misguided [15]. Attempts at experimental testing are properly aimed at such theories, not only at simple hypotheses about the outcomes of particular experiments that such theories predict. Highly testable theories that over time pass tests and explain or solve research problems are thereby corroborated and considered to be provisionally true and scientifically significant. Ioannidis [1] acknowledges that negative results for “major concepts” are more informative than for narrow questions but does not recognise how this is related to testability as such. A greater acknowledgement of this, and of the limits on the role and scope of NHT in the scientific enterprise could do much to correct the identification of NHT with the whole of experimental science, the analysis of NHT in terms of the incorrect Ioannidis model, and the crisis of public trust in science [9–11]. Nevertheless, there is a relatively straightforward way of making even the results of individual NHTs more reproducible, and that is to set stringent standards on what are considered to be meaningful effect sizes (and not just significant P values). Proposals based on reducing the value of P that is considered significant [16] in the absence of any consideration of effect sizes run into a paradox discussed by Lindley [17] and Meehl [18], wherein increasing sample sizes can increase the chance of finding significant P values but for tiny and meaningless effect sizes. This proposal has been borne out by replication studies: The Open Science Collaboration [5] pointed out that in their results replication success was better predicted by the strength of original associations than by researcher status. Alternatively, for original studies meaningful effect sizes can be set in advance of experiment on theoretical and/or practical grounds [19].

The model is falsified by data Table 4 in [1] makes explicit predictions about false positive rates in various types of investigation. These take the form of PPV (positive predictive values; the false positive probability is the complement of this). The predictions for exploratory research are alarming, with false positive rates predicted to be well over 99% even for relatively highly powered experiments with low bias. Hence, irreproducible results are explained by low pre-study odds irrespective of bias or experimental standards. Strangely, that this is not been confirmed by replication studies has received little or no comment. Successful replication rates have been reported that are far in excess of that predicted the Ioannidis model. In one highly cited study that sought to replicate 100 experimental or correlational studies [5] 36% of findings tested were reproduced, a far greater rate than predicted by the Ioannidis model, and in a field with relatively weak theoretical constraints (psychology). In a study that specifically selected highly unexpected findings to replicate [6], which according to Table 4 in Ioannidis [1] should be reproduced at a rate of 0.1–0.15%, the actual replication rate was 11%, a hundred-fold difference. Hence, Ioannidis’s model and assumptions about pre-study odds have been falsified. The model is pernicious There are many reasons why experiments are not reproduced. For example, in recent decades many researchers were unaware of the lack of quality control for many antibody probes in commercial production. As a result, papers were published based upon antibodies that were subsequently challenged (e.g. [7]). This was (and is) a clear problem with clear solutions; data sharing on quality control for antibodies among scientists, and greater awareness of potential problems. It also highlights the risks that come with scientific financial bubbles [8], where rapid production of new technologies in highly funded fields may lead to inferior quality control. Publication bias, data manipulation and both low and high level scientific fraud are also definite problems with various solutions. The problem of lack of checking and rechecking of results (experimental rigour) and design problems (such as lack of blinding when outcome measures have an element of subjectivity) are also well known. However, all of these are quite distinct from the core idea in the Ioannidis model, which is the overwhelming influence of pre-study odds on experimental reproducibility. This is pernicious because it casts scepticism on experimental science as such, rather than simply on badly carried out science. This is damaging to science, and such claims for the invalidity of the experimental enterprise have now extended beyond science with potential consequences for the

Refutational strength as a guiding principle for experimental research Powerful experiments are those that are not only well constructed on experimental design principles, but also test predictions that logically follow directly from the hypothesis being tested. When this is not the case loosely associated experiments can be performed without strongly testing the hypothesis. For example, a theory might predict that drugs targeting protein X can improve life expectancy in patients with disease Y. A well performed clinical trial with hard outcomes is a very direct test of this hypothesis. However, the predicted result for some outcome measure for a cellular or animal model is a much less 28

Medical Hypotheses 113 (2018) 27–29

J.C. Ashton

difficulties of applying probability theory to questions of truth, and brings renewed attention to improving standards. At worst its emphasis on pre-study odds is destructive to science itself.

direct and thus a weaker test. This is because a greater number of auxiliary hypotheses are assumed in the latter case; for instance, that the outcome measure adequately predicts a corresponding outcome measure in patients, that the model adequately captures the human disease, that the time course of the study in the model system is relevant to the human disease, and so forth. A negative result in the well-carried out clinical trial is a strong refutation of the hypothesis. However, a negative result in one of the model systems is only a weak refutation, because the hypothesis may be true but one or more of the auxiliary hypotheses incorrect. Although in philosophy of science this – the so-called Duhem-Quine thesis – is discussed as a problem with falsification as a scientific principle as such [20], for working scientists it can be exploited as a guide to designing rigorous experiments. Good experiments are those then that (i.) test predictions about clearly defined effect sizes that are meaningful with respect to a causal/mechanistic or practical hypothesis, and (ii.) in which the prediction about the experimental outcome follows with minimal auxiliary assumptions from the primary hypothesis under investigation. Where the second criterion cannot be met, these auxiliary assumptions require stringent testing before a tested hypothesis can be considered a research finding. If these principles were widely followed, and P values only considered one aspect of the experimental process, then research findings would become more readily reproducible.

Conflict of interest The author has no conflict of interest to declare. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.mehy.2018.02.004. References [1] Ioannidis JP. Why most published research findings are false. PLoS Med 2005;2:e124. [2] Wacholder S, Chanock S, Garcia-Closas M, El Ghormli L, Rothman N. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J Natl Cancer Inst 2004;96:434–42. [3] Carnap R. Logical foundations of probability. Chicago: University of Chicago Press; 1950. [4] Popper KR. The logic of scientific discovery. New York: Basic Books; 1959. [5] Collaboration OS. Estimating the reproducibility of psychological science. Science 2015;349. aac4716. [6] Begley CG, Ellis LM. Drug development: raise standards for preclinical cancer research. Nature 2012;483:531–3. [7] Grimsey NL, Goodfellow CE, Scotter EL, Dowie MJ, Glass M, et al. Specific detection of CB 1 receptors; cannabinoid CB 1 receptor antibodies are not all created equal!. J Neurosci Methods 2008;171:78–86. [8] Hendricks VF. Scientific research bubble – neuroscience risks being the next one. Science 2014;20. [9] Freedman DH. Lies, damned lies, and medical science. Atlantic 2010;306:76–84. [10] Lehrer J. The truth wears off. New Yorker 2010;13:229. [11] Sarewitz D. Saving science. New Atlantis 2016;49:4–40. [12] Feynman RP. Cargo cult science. Eng Sci 1974;37:10–3. [13] Deutsch D. The beginning of infinity: Explanations that transform the world: Penguin UK; 2011. [14] Bayarri MJ, Berger JO. The interplay of Bayesian and frequentist analysis. Stat Sci 2004:58–80. [15] Popper K. Realism and the aim of science: From the postscript to the logic of scientific discovery: Routledge; 2013. [16] Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E-J, et al. Redefine statistical significance. Nat Human Behav 2017;1. [17] Lindley DV. A statistical paradox. Biometrika 1957;44:187–92. [18] Meehl PE. Theory-testing in psychology and physics: a methodological paradox. Philos Sci 1967;34:103–15. [19] Ashton JC. Experimental power comes from powerful theories [mdash] the real problem in null hypothesis testing. Nat Rev Neurosci 2013;14:585. [20] Harding S. Can theories be refuted?: Essays on the Duhem-Quine thesis: Springer Science & Business Media; 2012. [21] Langmuir I. Pathological science. Res-Technol Manage 1989;32:11–7.

Conclusion The focus on pre-study probability in Ioannidis model is destructive. It has been long known that researcher integrity, experimenter blinding, adequate detection methods relative to effect sizes, and so on, are critical for good science [21]. However Ioannidis [1] instead emphasizes the importance of improving pre-study odds, the very parameter in his model that is incoherent. The focus on pre-study probability diminishes the importance of experimental standards to trivial status. If even well carried out experiments for novel findings will be expected to be correct only about 0.1% of the time it is hard to see why any rational society would fund such a disastrous rate of return on investment. Little wonder than various journalists have questioned the entire experimental science enterprise [9–11]. But this need not be the case; the Ioannidis model is incorrect and its central assumption incoherent. Reproducibility can be improved by continual re-attention to experimental standards, and by a more sophisticated view of what constitutes a research finding. But it cannot be if the experimental process itself thought to be broken. At best, the Ioannidis analysis rediscovers some of the paradoxes of inductionist science and the

29