TIPS 1335 No. of Pages 2
Letter
Reproducibility Crisis: Are We Ignoring Reaction Norms? Bernhard Voelkl1,* and Hanno Würbel1 The ‘reproducibility crisis’ in preclinical biomedical research is making headlines at a staggering rate. Spectacular examples of translational failures and poor reproducibility [1,2] have been attributed to various aspects of poor experimental design and conduct, including small sample sizes [3], risks of bias [4,5], selective reporting [6], and publication bias [7]. In a recent review in this journal, Jarvis and Williams [8] drew a more nuanced picture, challenging the increasingly common view that irreproducibility reflects poor research conduct. While we applaud their effort to reset the frame and their suggestions for improving research practice, we feel that they have missed perhaps the most important point: ignorance of phenotypic plasticity in experimental design and analysis.
a mouse kept at 208C. Food and light conditions can matter and so do rearing history, handling, housing conditions, and many other unmeasured and unmeasurable variables. For these reasons alone we should expect results to differ whenever an in vivo experiment is replicated. The magnitude of the expected difference is given by the reaction norm and by the domain of the reaction norm represented by the study populations of the replicate studies. This variation should not be mistaken for random variation, which is a different form of variation on top of phenotypic plasticity [9]. The concept of the reaction norm was introduced by Richard Woltereck more than a hundred years ago, and one would expect it to be well established and taken into account in animal research. However, out of 20 review or opinion articles on the ‘reproducibility crisis’ published in the past 6 years in top tier journals, none mentioned ‘reaction norm’, ‘phenotypic plasticity’, or
‘G E interaction’ as potential sources of poor reproducibility. Adopting a reaction norm perspective is not equivalent to just acknowledging biological variation. It requires a fundamental rethinking of parameter estimation and inference, and it can lead to seemingly paradoxical results. The gist of the reaction norm approach is to abandon the idea of a true population parameter, as it is construed by frequentist statistics. Bayesian statisticians have for long emphasised that the parameter to be estimated is itself a random variable with a corresponding uncertainty. This statistical framework reflects the reality of animal experiments very well: each replicate experiment is conducted under slightly different environmental conditions, which means that their results represent different points on the norm of reaction and therefore different parameter values should be expected. However, while the discussion among statisticians as to whether or not a true population parameter
(A)
E(YA)
(B)
Y
∑YA
E(Y)
CIW When studying living organisms we are Y1 CIN faced with inherent biological variation that E(YB) ∑YB CIR is distinct from random noise or measureY2 ment error and that is fundamental to the correct interpretation of experimental A B results. Biological variation occurs in X X X1 X 2 X Vrn 1 2 X Vind Vtot response to environmental variation. Importantly, the response of an organism to an experimental treatment (e.g., a drug or a stressor) often depends not only on the Figure 1. Scheme of a Reaction Norm and Reproducibility Paradox. (A) Scheme of a reaction norm. The expected value E(Y) for a response to a treatment measured in a population of organisms can be influenced properties of the treatment but also on the by an environmental variable X. The probability density function for E(Y) is a function of the probability with which state of the organism, which is as much the a certain environment is encountered and the reaction norm. Individual studies conducted under environmental product of past and present environmental conditions X1 and X2 will show within-study variation for measurements of the response variable Y and allow making study-specific estimates Y1 and Y2. Sampling repeatedly from a range of the environmental parameter influences as of the genetic architecture [9]. according to probability distributions A and B will result in respective distributions for observed values SYA and Such phenotypic plasticity of organisms is SYB. Variation in the expected parameter due to the reaction norm (Vrn) and individual variation (Vind) add up, a result of gene-by-environment (G E) giving the total observed variation (Vtot). (B) Reproducibility paradox. If a study is conducted under environmental interactions and its consequence is the condition X1 and researchers increase sample size and precision of measurements, then this will be reflected in a smaller confidence interval (CIN, blue) compared with a less precise study with a smaller sample size (CIW, red). A reaction norm of an organism's response replication study performed under environmental condition X2 and precisely estimating the expected value of the (Figure 1A). A mouse kept at 248C might response variable (with a narrow confidence interval CIR, grey) would successfully reproduce the imprecise respond to a drug treatment differently from study with the wide CI but fail to reproduce the precise study with the narrow CI.
Trends in Pharmacological Sciences, Month Year, Vol. xx, No. yy
1
TIPS 1335 No. of Pages 2
exists is sometimes seen as a purely academic exercise, a steep reaction norm for a specific response of an organism means that between-experiment variation in the measured parameter can be substantial. Instead of indicating that a study was biased or underpowered, a failure to reproduce its results might rather indicate that the replication study was probing a different region of the reaction norm.
This is mirrored by the ‘standardisation fallacy’, the erroneous belief that reproducibility can be improved through ever more rigorous standardisation. Because many environmental factors resist standardisation between laboratories [10], animals within laboratories will be more homogeneous than animals between laboratories. Increasingly rigorous standardisation will therefore produce results that are increasingly distinct between laboratories and hence less reproducible [11,12]. Thus, instead of trying to spirit biological variation away through standardisation, researchers should eventually start to embrace it in view of improving the external validity and hence the reproducibility of their results.
Adopting a reaction norm perspective also helps explaining what we call the ‘reproducibility paradox’. By focusing on internal validity, rigorous standardisation, and statistical power, researchers obtain effect size estimates with increasingly narrow confidence intervals. As a consequence of this, replicate studies in randomly selected laboratories, where the animals Acknowledgments This work was supported by an ERC Advanced Grant are at other points on the norm of reaction, REFINE to H.W. are increasingly unlikely to fall within the confidence interval of the original study 1Animal Welfare Division, Veterinary Public Health Institute, (Figure 1B). In other words, a more precise University of Bern, Bern, Switzerland study conducted at a random point along *Correspondence:
[email protected] the reaction norm is less likely to be repro- (B. Voelkl). http://dx.doi.org/10.1016/j.tips.2016.05.003 ducible than a less precise one.
2
Trends in Pharmacological Sciences, Month Year, Vol. xx, No. yy
References 1. Prinz, F. et al. (2011) Believe it or not: how much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov. 10, 328–329 2. Begley, C.G. and Ellis, L. (2012) Drug development: raise standards for preclinical research. Nature 483, 531–533 3. Button, K.S. et al. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376 4. Macleod, M.R. et al. (2015) Risk of bias in reports of in vivo research: a focus for improvement. PLoS Biol. 13, e1002273 5. Bailoo, J.D. et al. (2014) Refinement of experimental design and conduct in laboratory animal research. ILAR J. 55, 383–391 6. Hutton, J.L. and Williamson, P.R. (2000) Bias in metaanalysis with variable selection within studies. Appl. Stat. 49, 359–370 7. Sena, E.S. et al. (2010) Publication bias in reports of animal stroke studies leads to major overstatement of efficacy. PLoS Biol. 8, e1000344 8. Jarvis, M.F. and Williams, M. (2016) Irreproducibility in preclinical biomedical research: perceptions, uncertainties, and knowledge gaps. Trends Pharmacol. Sci. 37, 290–302 9. Schoener, T.W. (2011) The newest synthesis: understanding the interplay of evolutionary and ecological dynamics. Science 331, 426–429 10. Crabbe, J.C. et al. (1999) Genetics of mouse behavior: interactions with laboratory environment. Science 284, 1670–1672 11. Würbel, H. (2000) Behaviour and the standardization fallacy. Nat. Genet. 26, 263 12. Richter et al. (2009) Environmental standardization: cure or cause of poor reproducibility in animal experiments? Nat. Methods 6, 257–261