Editorial
Editor’s Note
analytically considering the content of published reports. We hope that our readers find this interesting and informative.
In this issue of JFAS we present a recent Letter to the Editor, and my reply to that letter, with the aim of expressing some opinions related to the important concepts of statistical and clinical significance, and to point out the importance of
Clinical Importance Versus Statistical Significance, and Correcting the Scientific Literature To the Editor: Although I do want to first congratulate the authors for undertaking the type of high level of evidence clinical trial that is often lacking within our literature, I also did want to respectfully question, and ask for clarification on, their definition and use of the term clinically significant [Jay RM, Malay DS, Landsman AS, Jennato N, Huish J, Younger M. Dual-component intramedullary implant versus Kirschner wire for proximal interphalangeal joint fusion: a randomized controlled clinical trial. J Foot Ankle Surg 2016;55:697–708]. In my opinion this was a well-written manuscript, which clearly outlined the primary and secondary outcomes of interest, in addition to the statistical plan. I appreciate that the authors did explicitly define in the “Patients and Methods” section what they considered to be a clinically significant finding of 10%, albeit without reference or justification. And I also appreciate and agree with their conversation to begin the “Discussion” section on the differences between a statistically significant finding and a clinically significant one. This is an important topic that I often discuss with my students and residents, and one which I think all critical readers should be familiar with. My concern is that the authors concluded and defined some of their results as having clinical significance after first reporting no statistical significance. This specifically occurred for the 1-week time point in favor of the intramedullary implant for the Bristol Foot Score, and for the 1-week, 3-week, 3-month, and 6-month time points in favor of the intramedullary implant for the Foot Function Index. In the discussion, the authors also concluded this for the 3-month time point for the Bristol Foot Score, but their data in Table 7 actually indicates that the Kirschner wire performed clinically significantly better at this time point based on their definition. The concern I have is two-fold. First, it has been my understanding that an author or critical reader should evaluate to see if a specific result is clinically significant only after first establishing statistical significance. A good example of this would be an investigation into a specific type of osteotomy for correction of the hallux abductovalgus deformity. An osteotomy that decreases
Clinical Significance, Clarified Reply: Once again, Dr. Meyr has raised some very good questions. In essence, he is asking what it means to say that something is clinically significant, and how does clinical significance compare
D. Scot Malay, DPM, MSCE, FACFAS Editor The Journal of Foot & Ankle SurgeryÒ the first intermetatarsal angle from 19 preoperatively to 13 postoperatively could likely easily demonstrate statistical significance with a reasonable sample size, but most people would agree that this finding would lack clinical significance as the postoperative result remains in an abnormal range. In contrast, an osteotomy that decreases the first intermetatarsal angle from 19 to 4 would likely achieve both statistical and clinical significance. But if we make a determination of clinical significance completely independent of statistical significance, then it utterly undermines the statistical calculation! There would almost be no point in calculating any comparative statistical test in any paper if authors could simply conclude that their findings were clinically significant. Clinical significance is primarily a subjective determination, and one of the reasons why statistical significance in the medical literature is important is because it can decrease this subjectivity. I agree with the authors that statistical significance has inherent limitations and is likely overvalued within our current evidencebased medicine paradigm, but it is the best way we have to objectify our investigations. I do not think that statistical significance should represent the end of a discussion of a finding, but rather initiate the beginning of one. My second concern is that this was an investigation into, and partially funded by, a specific commercial product. Because the authors concluded that they observed “significantly better” outcomes with the intramedullary device, whether clinical or statistical, I would fear the potential for industry to blur the lines between these 2 in subsequent advertisements. I respect and trust the physician scientists who completed this investigation, but unfortunately I can’t say with confidence that I feel the same way about most commercial industries that try to sell me products. I frankly worry about the precedent set by this conclusion. The intention of this correspondence is not to be critical of the authors d in fact, I am standing on their shoulders in my current position d but instead to hopefully start a conversation within our profession about the inherent strengths and weaknesses of both clinical and statistical significance. In this way, we can push forward the progress of foot and ankle science. Andrew J. Meyr, DPM, FACFAS Clinical Associate Professor Temple University School of Podiatric Medicine Philadelphia, Pennsylvania Program Director Temple University Hospital Podiatric Surgical Residency Program Philadelphia, Pennsylvania with statistical significance? This is an important question for all of us to consider, and it is one that I often think about when I edit submissions to JFAS and when I consider theories of statistical inference. It is for this precise reason that we, in our report of the dual-component intramedullary fixation device for proximal interphalangeal arthrodesis, decided to even point out the distinction between clinical and statistical significance. This has been a concern for statisticians over the past century (1), and we
1067-2516/$ - see front matter Ó 2016 by the American College of Foot and Ankle Surgeons. All rights reserved. http://dx.doi.org/10.1053/j.jfas.2016.07.022
904
Editorial / The Journal of Foot & Ankle Surgery 55 (2016) 903–905
have discussed the matter in this journal in previous articles (2,3). The question arises because statistical and scientific reasoning don’t always directly overlap, and both can be used incorrectly if we aren’t careful. I think it’s important for us to keep both statistical and scientific reasoning (including effect size) in mind, and to bring the two together as closely as we can. In fact, I believe that the p value may be relied upon too heavily when it comes to trying to determine what is and what is not true, and what is or is not clinically important. In other words, if the test is statistically significant, we far too often conclude that the difference we are seeing is clinically significant; if the test is not statistically significant, we far too often conclude that there is no difference and end our investigation right there (4). In our journal club sessions, I always stress the importance of analytically comparing proportional differences in the sizes of effects along with the p values related to such comparisons when we inspect the results of a report. Large proportional differences, even if they lack statistical significance, might warrant further consideration because they might be clinically important. Surgeons, investigators, and users of the medical literature need to keep in mind that, when all that we consider is the p value, all that we focus on is the probability of a result (the summary of the data in question), or any more extreme result, given a certain hypothesis (the null, non-inferiority, or something else). But, to a clinician or any other scientist, the probability of the data given the null is not as interesting as the scientific credibility of the conclusions based on the results. Interestingly, the probability doesn’t actually speak to the credibility of an investigator’s conclusions, and it is not a part of a mathematical formula that measures credibility (plausibility or believability of the result). When all we do is focus on the p value, the base questions become whether or not the data is likely given the null hypothesis or that the result observed occurred by chance, and these are misguided queries since what is really important to us is the scientific credibility of the result and whether or not the null itself is likely and our conclusions valid. If more papers that claimed statistical significance were truly clinically important, then I think the results of more clinical investigations would be reproducible. This concept has even been put forth by the American Statistical Society in their statement pertaining to the context, process, and purpose of p values (5). As it is, more often than not, re-evaluation of a clinical investigation is not undertaken, and even when it is, previously reported results are not reproducible. I have been lobbying for reproducibility a lot, lately. The p value, in and of itself, doesn’t measure the credibility of a result, even though many investigators and sponsors, and readers, suffer under this illusion. The extension of this misconception is that the p value is the dividing line between scientifically justified and unjustified work. Statistical significance is all about the likelihood of a chance finding that won’t hold up in future efforts to replicate an experiment or clinical investigation. The p value that we typically employ for intercomparison of sample data was popularized by a number of statisticians, including Galton (6), Edgeworth (7), Gosset (8), and Fisher (9), and the history of the concept has been clearly and concisely summarized by Stigler (10). The p value is based on verifiable frequencies of repeatable events and the conventional willingness to be incorrect just once in 20 estimates (the 5%, or p .05, level). A statistically significant result, as indicated by the p value, merely means that the result warrants further attention and follow up, and it doesn’t necessarily mean that something was substantially different (better or worse than), or practically or clinically important, or even true. Computation of the probability
doesn’t obviate the need for scientific reasoning, with attention to measurement, prior evidence (like Thomas Bayes’ probability of an event based on measured prior probability (11)), limitations due to biases associated with experimental design, understanding the biological or pathological processes in question, spontaneous blunders, and the credibility of the presumptions used to inform the statistical analyses. In fact, repetition of studies and results is needed, and that requires more work, which is usually difficult to carry out and expensive in terms of time and money. Repeating an experiment or clinical investigation is a far cry from the all too prevalent practice of making a claim based on a single demonstration of statistical significance. Nonetheless, large proportional differences in effect sizes, in my opinion, should stimulate further interest and, perhaps, further investigation, even in the absence of statistical significance. So, what makes a reasonable surgeon think a result is important enough to change practice, or to at least reconsider the comparison (try to repeat the experiment)? The answer to this question, in my opinion, is clinical significance, or the difference that would make a clinician consider changing practice, or an investigator to try to reproduce the work. Statistical significance does not tell us directly the magnitude of the difference; whereas, proportional differences between outcomes provides some evidence of the size of the differences and, assuming that the investigation had reasonable control of biases and a suitable sample size, then substantial proportional differences should warrant clinical interest, and perhaps attempts at reproducibility, especially if the sample size was considered adequate. I think that most of us probably would not alter therapy if the results were not statistically significantly different, assuming that the experimental or investigational methodology was reasonably unbiased, since the result could have been a spurious observation. However, lack of a statistically significant result doesn’t mean that we shouldn’t consider re-evaluating the comparison (say, between interventions or diagnostic tests), if the sample size was substantial and the difference between the effect sizes or outcomes was also substantial (10% to 15%). The 10% to 15% minimum difference for a proportional comparison to warrant further consideration is a generally accepted guideline (12,13). The magnitude of a proportional difference may or may not be considered important, depending on the outcome (clinically satisfied versus dissatisfied, short versus long recovery period, limb salvage versus amputation, life or death). In other words, the meaning of an investigator’s results needs to be considered in the context of existing knowledge, experimental design (limitation of biases), and the potential usefulness of the conclusions derived from the results. Even a small difference may be considered important if the outcome influences morbidity and mortality or if the intervention is extremely costly, whereas a large difference may be required for most clinicians to consider a difference between less serious (less important) outcomes. This really comes down to a question of clinical judgment, and that takes into consideration many factors, including what our patients are willing to do and what makes them subjectively happy or less aggravated, side effects, quality of life years, costs, and the like. Using a value like the “number needed to treat” or describing a precise confidence interval about an estimate might help us better appreciate the clinical meaning of a result, but it still comes down to our clinical judgment and the magnitude of the difference. All we were trying to say in the hammertoe paper was that we felt that differences ranging from 10% to 50% in subjective patientoriented outcomes warranted further consideration.
Editorial / The Journal of Foot & Ankle Surgery 55 (2016) 903–905
In response to Dr. Meyr’s comments regarding our description of the Bristol Foot Score (BFS) results at 3 months postoperative, my review of the manuscript, and then the original data and our results, showed that he is indeed correct, and that although he meant to direct our attention to Table 6 (and not Table 7, as noted in his letter), the BFS for the K-wire was 30.81 (28.10 to 33.41), whereas that for the dual-component intramedullary device was 33.94 (29.46 to 38.43), for a probability given the null hypothesis of 0.4169 and a proportional difference of 10.16%. Because a lower BFS indicates better foot-related quality of life, the K-wire results were, at the 3-month follow up visit, marginally clinically significantly different (perhaps better), based on our use of 10% as the minimum cutoff for considering a difference clinically important in regard to foot-related quality of life and suggesting to surgeons that the relationship warrants further consideration. We are pleased to correct the public record, and have asked our publisher to publish an appropriate erratum to make this clear. Dr. Meyr’s observation also makes clear the fact that readers are our ultimate peer reviewers, and widespread, analytic consideration of a written report can pick up such errors, even after a manuscript has gone through peer review and editorial scrutiny. As a coauthor of the report in question, I deferred all editorial consideration and decision making to another editor who selected peers to comment on the report and made recommendations for revision and, ultimately, publication. As an author of the report, I take full responsibility for the error, and appreciate the fact that Dr. Meyr pointed it out so that we can correct the public record. Finally, in regard to Dr. Meyr’s comments related to sponsorship and the threat of bias, this is a real concern that all readers should consider. It is for this reason that we require our authors, peer reviewers, and editors to disclose financial support and real or perceived conflicts of interest. There is no doubt that sponsorship can impart biases, especially when the investigators are paid to carry out a particular study. In regard to the dualcomponent, intramedullary fusion device used in our study, 3 of the authors, namely Drs. Jay and Landsman and myself, gave this concern a great deal of consideration when we designed the study. In particular, we used a number of bias-reducing methods, including random treatment allocation to minimize selection biases, compared use of the intramedullary device to use of the standard K-wire, employed health measurement instruments known to produce valid information (BFS, Foot Function Index, and pain on a 10-cm visual analog scale), measured outcomes that most reasonable clinicians would consider to be clinically important (pain, radiographic findings, foot-related quality of life), and we employed the patients and unbiased assessors to produce and measure the outcomes. Furthermore, the statistician (me) abstained from including any patients upon whom I had operated in the investigation. Still further, by contract, the clinical investigators were given sole control over the design and execution of the investigation, management and analysis of the data, and production and publication of the report. Although these methods do not guarantee that bias did not infiltrate our investigation and influence our conclusions, they were aimed at minimizing biases and were fully disclosed in the report. My review of the peer reviewers’ and editor’s comments that led to 2
905
post-submission revisions of our report also shows strong concern for bias reduction and disclosure of potential conflicts of interest. In this day and age, when institutional and society funding for clinical research is limited and difficult to procure, and drug and device company interest and funding is available for investigators with legitimate clinical questions related to the therapies that they employ, designing a study and seeking funding from a manufacturer is commonplace. In my experience, moreover, commercial funding is very difficult to procure when the investigational protocol employs scientific methods that limit biases. In regard to the study in question, negotiations with the sponsor were funneled through the principal investigator (RMJ), and while complicated in regard to certain issues, the sponsor remained true to the terms of our contract and, from my perspective, seemed genuinely interested in letting us conduct the investigation without restrictions or interference. And even though I was aware that the sponsor believed in their implant, at no time did I feel like they were pressuring us relative to experimental design, measurement of the outcomes, or interpretation of the results. Once again, I appreciate the opportunity to reply to Dr. Meyr’s Letter to the Editor and the critical points that he raised. It is my hope that all readers analytically consider the information published in JFAS, and that authors and readers alike carefully consider the relationships between clinical and statistical significance, and use such results to determine whether or not research findings warrant further attention. D. Scot Malay, DPM, MSCE, FACFAS Editor The Journal of Foot & Ankle SurgeryÒ References 1. Boring EG. Mathematical vs. scientific significance. Psychol Bull 16:335–338, 1919. 2. Jupiter DC. Mind your p values. J Foot Ankle Surg 52:138–139, 2012. 3. Malay DS. Some thoughts about data type, distribution, and statistical significance. J Foot Ankle Surg 45:357–359, 2006. 4. Jupiter DC. Turning a negative into a positive. J Foot Ankle Surg 52:556–557, 2013. 5. Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process, and purpose. Am Stat 70:129–133, 2016. 6. Galton F. Statistics by intercomparison, with remarks on the law of frequency of error. Phil Mag, 4th series 49:33–46, 1875. 7. Edgeworth FY. On methods of ascertaining variations in the rate of births, deaths and marriages. J Roy Stat Soc 48:628–649, 1885. 8. Gosset WS. The probable error of a mean. Biometrika 6:1–24, 1908. 9. Fisher RA. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika 10:507–521, 1915. 10. Stigler SM. The Seven Pillars of Statistical Wisdom, Harvard University Press, Cambridge, MA, 2016, pp. 87–106. 11. Bayes T. An essay towards solving a problem in the doctrine of chances [offprint title: “A method of calculating the exact probability of all conclusions founded on induction”]. Phil Trans Roy Soc Lond 53:370–418, 1764. 12. Cohen JA. power primer. Psychol Bull 112:155–159, 1992. 13. Lipsey MW, Puzio K, Yun C, Hebert MA, Steinka-Fry K, Cole MW, Roberts M, Anthony KS, Busick MD. Translating the statistical representation of the effects of education interventions into more readily interpretable forms (NCSER 2013-3000), National Center for Special Education Research, Institute of Education Sciences, US Department of Education, Washington, DC. Available at: https://ies.ed.gov/ncser/ pubs/20133000/pdf/20133000.pdf; 2012. Accessed July 20, 2016.