Accepted Manuscript Pragmatic statistical issues in biological research: Introduction to special series
Robert S. Danziger, Michael L. Berbaum PII: DOI: Reference:
S0022-2828(18)30084-1 doi:10.1016/j.yjmcc.2018.03.013 YJMCC 8704
To appear in:
Journal of Molecular and Cellular Cardiology
Received date: Accepted date:
20 February 2018 17 March 2018
Please cite this article as: Robert S. Danziger, Michael L. Berbaum , Pragmatic statistical issues in biological research: Introduction to special series. The address for the corresponding author was captured as affiliation for all authors. Please check if appropriate. Yjmcc(2018), doi:10.1016/j.yjmcc.2018.03.013
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT Pragmatic statistical issues in biological research: Introduction to special series Robert S Danziger1,*
[email protected], MD, MBA, Michael L Berbaum2, PhD 1Departments
of Medicine, Physiology and Biophysics, and Pharmacology,
IP
*Corresponding
of Health Research and Policy, University of Illinois at Chicago‘ author at: Department of Medicine, University of Illinois at Chicago, 840
CR
2Institute
T
University of Illinois at Chicago
US
S Wood St, Chicago, IL 60612
AN
Abstract: This is an introduction to a special series so an abstract may not be required (but some verbiage is needed in this space to submit!!) Keywords: statistics; reproducibility; quantitative analysis; comparisons; power analyses
M
Statistical methods address “ways of dealing with the collection, analysis, interpretation, presentation and organization of data may provide one of the robust ways
ED
of addressing reproducibility.” Within the biosciences, we perceive that among all this
PT
activity, there has been limited attention to the frankly statistical aspects of improving reproducibility. In a survey by Nature [1], nearly 90% of responding scientists reported
CE
that reproducibility would be improved by “more robust experimental design and better
AC
statistics.” The purpose of this Series, which will be published over the next 1-2 years, is to address some of the key elements believed to be contributing to a lack of reproducibility in biomedical sciences and, to specifically focus on statistical questions relevant to the types of studies published the cardiology and cardiovascular research. Topics include: 1.When a single animal provides a series of measurements, how many animals must be tested in treatment and comparison groups?
ACCEPTED MANUSCRIPT
Typical Question: We are doing a very difficult and expensive experiment. Each animal we study requires several weeks to prepare. We can make multiple measurements on each animal. How do we determine how many animals should be studied and the
T
number of measurements to be made on each one? How do we allow for retention, or
CR
IP
its mirror image, dropout, and even death?
2. When is exclusion of wild or outlying data points permissible and what
AN
US
statistical precautions avoid misleading results?
Typical Question: We are studying the effect of some drugs on cardiac function in an
M
animal model. The effect is very different in some animals. How do we best handle this
ED
statistically?
PT
3. Appropriate use of statistics in hypothesis testing (Null hypothesis testing):
CE
One reads apparently strong criticisms of the way statistical testing is conducted and
AC
suggestions for revised procedures.
Typical Question: What are the underlying issues and do these “reforms” have merit? Should I conduct testing (inference)?
ACCEPTED MANUSCRIPT 4. Straw-man hypothesis testing: superiority versus equivalence – low power so never detect a difference. Nowadays we may want to test for more than differences, e.g., superiority (inferiority) or an order (monotone, umbrella).
T
Typical Questions: How are these tests to be conducted statistically and what are
IP
the pros and cons? We have a new intervention (drug) and want to compare it with an
AN
US
CR
existing care regime or drug. How should we conduct such a test?
M
5. What is learned from extending or repeating the same experiment? When is
ED
extending a study legitimate? How can one learn from repeated experiments?
PT
Typical Question: If I am testing for superiority (1-sided) and don’t reach significance,
AC
studies?
CE
can I just infer that there is a difference? Can I combine the results with those of other
6. Multiplicity. How many tests can a single design support? What are the limits on conducting multiple analyses and tests?
Typical Question: Sometimes there are a variety of ways to make comparisons on the same data (parameter) – indeed, there are more comparisons conceptually than the
ACCEPTED MANUSCRIPT data can distinguish (identify). This is termed multiplicity and there are a number of solutions – multiple comparison techniques. How do I recognize the situation and pick a good technique?
T
7. Multiple outcomes from the same subjects (or animals). Often a number of
IP
aspects of a biological process are measured and all these are potentially interacting
CR
responses from the same subjects or animals. How should we approach such data
US
statistically (and practically)?
AN
Typical Question: We are trying to determine the effects of a drug on cardiac function and toxicity. We have looked at 26 different parameters, e.g., ejection fraction, diastolic
ED
drug. How do we interpret this?
M
function, blood pressure, etc. We find that two of these parameters are affected by the
CE
PT
8. When can/should missing data be filled in and what methods are acceptable?
Typical Question: We are correlating cardiac function with molecular changes in heart
AC
tissue. However, for some animals we have measurements only of function since the assay on the tissue did not work and/or the animal died before the time at which the heart was to be harvested for study. Alternatively, in some animals we were able to do the assay on the heart tissue and do not have the functional data.
ACCEPTED MANUSCRIPT 9. Heterogeneity among subjects. Randomization makes groups of subjects (animals) equivalent (between groups). But the subjects still differ among themselves within groups. What are worthwhile ways to reduce the impact of such individual
T
differences on results?
IP
Typical Question: We are determining the effect of aortic banding on blood pressure. In
CR
each rat baseline blood pressure is a little different. Do these initial differences matter
US
and how best can we handle these?
AN
10. When we actually do the experiments, we find that the requisite “n” to show statistical significance is greater or less than that predicted by our power analysis. How
M
do we interpret this? Was the SD of the response different from that anticipated? Was
ED
the distribution of response different than assumed?
PT
Typical Question: I have performed an experiment ten times and detect near
CE
significance (just above P < 0.05, e.g., P = 0.07). Should I keep repeating it? What can
AC
I learn from the data?
11. Parametric versus non-parameteric analyses. There are two kinds of statistical tests, parametric and non-parametric. How should we decide which ones to use? What are the trade offs? Typical Question: I am measuring the BMI of individuals and trying to determine the relationship between exercise on BMI and calories consumed. What test should I use?
ACCEPTED MANUSCRIPT
12. What are principles and best practices for managing research data?
Typical Question: We use Excel spreadsheets for our data entry and record keeping
IP
T
and to make plots. What could go wrong? What are alternative and better platforms for
CR
saving and analyzing data?
US
13. Oversight and review of data: When there are doubts about the origin of a dataset, i.e., was it “faked?”, what tools can help decide the matter rigorously (non-
AN
subjectvely)?
M
Typical Questions: My post-doc has just given me a spreadsheet with 526
ED
protein measurements. I really am questioning whether he actually measured all of these or made some up. The graduate student has generated some data points that
PT
just look “too good to be true.” How can I check if they are fabricated? I wonder if a
CE
dataset I was given is filtered and outliers removed?
AC
14. Studying dose-response relationships. A change is introduced into a biological system. We would like to determine how much change results in certain features of the system that are of interest. In the case of a drug, this is a dose-response relationship. There are a number of issues to consider:
Typical Questions: How many different doses should be considered? How should we determine the time window for observing the response? Should we observe the
ACCEPTED MANUSCRIPT response repeatedly and how should we summarize the time course of the response? Are there multiple kinds of response, perhaps occurring at different time lapses and with different durations?
T
15. Sometimes it seems useful to categorize a continuous response into discrete
IP
categories: present versus absent or low, medium, high. What statistical principles
CR
should be followed in this process and what tools are available to carry out this work.
US
Typical Question: We are measuring blood pressure to determine if there is a
AN
relationship between renal function as measured by creatinine and blood pressure. How should we analyze creatinine measurements and blood pressures, i.e., as low, normal,
ED
M
or high versus as continuous variables?
16. When results cannot be replicated either materials and techniques are at
PT
variance (lack of reproducibility), or there are unknown factors at play. Either way, the
CE
investigation will yield worthwhile discoveries.
AC
Truly irreproducible results may provide significant insights. “Replication can increase certainty when findings are reproduced and promote innovation when they are not.” If 1) there is adequate reporting of all known variables and methodological details believed to be relevant and 2) robust and appropriate statistics have been used, then greater insights may be discovered if results are not reproducible [4]. Prominent among these is that there are unappreciated factors or contributory components to the results.
ACCEPTED MANUSCRIPT For example, normally temperatures at which rats are housed has not been reported in the past since it was not believed to be a significant variable. However, recently is has been found that it is [5]. Thus, lack of reproducibility of results in mice, especially in cancer models, may be, at least in part, due to lack of control of this variable. In general,
T
the larger the sample size that is required to demonstrate a significant difference, the
IP
greater the number of unappreciated variables. Similarly, ‘passenger mutations’ were
CR
discovered to confound results [6] and non-uniform antibodies [7]. But it is only with rigorous reporting and statistics, that truly irreproducible results may lead to these
US
insights.
AN
The goal of this series is to address commonly encountered statistical issues in normal contexts in which they may arise. Both theoretical background and websites which can
M
do the analyses discussed will be included in each article. Since this is a series, we
ED
encourage readers to recommend topics of interest or current questions in their
PT
laboratories.
CE
References
4.
AC
[1] M. Baker, 1,500 scientists lift the lid on reproducibility, Nature 533(7604) (2016) 452-
[2] N.A. Vasilevsky, M.H. Brush, H. Paddock, L. Ponting, S.J. Tripathy, G.M. Larocca, M.A. Haendel, On the reproducibility of science: unique identification of research resources in the biomedical literature, PeerJ 1 (2013) e148.
ACCEPTED MANUSCRIPT [3] C. Kilkenny, N. Parsons, E. Kadyszewski, M.F. Festing, I.C. Cuthill, D. Fry, J. Hutton, D.G. Altman, Survey of the quality of experimental design, statistical analysis and reporting of research using animals, PLoS One 4(11) (2009) e7824. [4] A. Ward, T.O. Baldwin, P.B. Antin, Research data: Silver lining to irreproducibility,
T
Nature 532(7598) (2016) 177.
IP
[5] K.M. Kokolus, M.L. Capitano, C.T. Lee, J.W. Eng, J.D. Waight, B.L. Hylander, S.
CR
Sexton, C.C. Hong, C.J. Gordon, S.I. Abrams, E.A. Repasky, Baseline tumor growth and immune control in laboratory mice are significantly influenced by subthermoneutral
US
housing temperature, Proc Natl Acad Sci U S A 110(50) (2013) 20176-81.
AN
[6] T. Vanden Berghe, P. Hulpiau, L. Martens, R.E. Vandenbroucke, E. Van Wonterghem, S.W. Perry, I. Bruggeman, T. Divert, S.M. Choi, M. Vuylsteke, V.I.
M
Shestopalov, C. Libert, P. Vandenabeele, Passenger Mutations Confound Interpretation
ED
of All Genetically Modified Congenic Mice, Immunity 43(1) (2015) 200-9. [7] M. Baker, Reproducibility crisis: Blame it on the antibodies, Nature 521(7552) (2015)
AC
CE
PT
274-6.