Evaluating significance: comments on “size matters”

Evaluating significance: comments on “size matters”

The Journal of Socio-Economics 33 (2004) 547–550 Evaluating significance: comments on “size matters” Graham Elliott∗ , Clive W.J. Granger Department ...

37KB Sizes 2 Downloads 58 Views

The Journal of Socio-Economics 33 (2004) 547–550

Evaluating significance: comments on “size matters” Graham Elliott∗ , Clive W.J. Granger Department of Economics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 0508, USA

Abstract The authors argue that economics as a science is held back through the use of “asterisk” reporting. We concur that many authors do a poor job at disseminating their results, however consider the call to reject the tools a little overblown. The tools are very useful for many of the most important parts of economics as a science, such as testing theories. Also, the argument that poor reporting holds back science is made through showing that the ‘X’ variable is low, not that it has the impact on the ‘Y’ of scientific progress. The impact may be zero and insignificant! © 2004 Elsevier Inc. All rights reserved.

We agree with the main point made in this paper that many applied economic papers do not treat the question of the relevance of estimated parameters in a satisfactory way, even in papers published in major journals. The authors make an important point when they complain that economic significance is often completely overlooked in applied econometrics. Papers such as that by Solon lauded by the authors are rare, more often one can gauge the economic significance only through some detective work. Typically a “summary statistics” table is the first table included in an applied study with means and standard deviations of the data used in the analysis. From this and the regression estimates tables, along with some approximate multiplication, the economic significance can be determined. It is obviously ridiculous for an author to force the reader to do this when the main point of many applied papers that reports such tables and regressions is the economic size of the effect under study and precision of this estimated size. When this is the point of the exercise, it is fairly clear that “reasonable ranges” of ∗

Corresponding author. Tel.: +1 858 534 3383; fax: +1 858 534 7040. E-mail addresses: [email protected] (G. Elliott), [email protected] (C.W.J. Granger).

1053-5357/$ – see front matter © 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.socec.2004.09.025

548

G. Elliott, C.W.J. Granger / The Journal of Socio-Economics 33 (2004) 547–550

the effect in economically understandable terms are the most relevant numbers to report. For a classical statistician this would be a confidence interval for the effect, Bayesians could choose between highest posterior densities or some other reporting of the estimated distribution of the effect. Either way, statements of the form “the estimated effect is around US$ 10 to US$ 30 a week in extra earnings” convey a lot more to the reader than “the point estimate on parental income is 0.02 (S.E. 0.005).” We feel that the complaint is less about the use of asterisks per se, but really about the misuse of tools in general. Reacting to a poor choice of statistical tools by stating that “significance testing as used has no theoretical justification” is a case of throwing the baby out with the bathwater. Rather than invalidating the use of statistical significance, perhaps it simply lays bare the poor choice of size? The authors claim often that “loss” matters, but like nearly all of the statistical and empirical literature this is glossed over as though the only reason for conducting an econometric procedure is the quantification of marginal effects. The authors suggest that there is no room in “economics as a science” for the researcher to take seriously the question of how to distinguish between economic models. Further, what does constitute empirical verification (or lack thereof) for any hypothesis? An example of what does not constitute empirical verification is given below. The basic point by the authors is that paying attention to “asterisk” economics, regarding effects that are statistically significant at commonly accepted significance levels whilst ignoring economic significance holds back economics as a science. They present empirical evidence that the number of published empirical papers, which ignore economic significance, is high as a proportion of total papers and recently has risen more. We agree that for the many papers that are interested in the effect of one variable on another simply paying attention to asterisks is unreasonable. As noted, we should care more about the size of the estimated economic effect and the uncertainty surrounding this estimate. This can be presented in a way that is clear to the reader without hardly any additional work over and above simply reporting estimates. It really does not matter if the effect is statistically significant from zero—this will be clear from the reported range by considering if the confidence interval includes zero or not. However, for many papers this is more of a “literary critique” than a deep point about whether or not the paper is misleading science. Provided the table of summary statistics is available, one can go between the tables to work out economic sizes of effects, understand the uncertainty, and so forth. If we step back and allow that results reported in economic journals are a scientific discussion between economists, and all the tables mean that the results, economic and statistically, are readily available from the paper by those trained well enough to interpret the relevance and correctness of the study anyway, then there is really no problem. This remains true even when the author misunderstands the true importance of relative effects—their discussion of the results may be virtually empty of content (focusing on precisely estimated economically irrelevant effects) but the reader can still piece together all that the reported regression has to offer. We could criticize the author (and should criticize the editor), but if the readers who are making decisions based on the results are misled with all the evidence in front of them, then it would be just as reasonable to complain about their statistical comprehension skills. For papers that do not have the supporting evidence to enable interpretation of the results in this way, then all bets are off (but we do not know why a decision maker would consider such papers, even if they are published in the AER).

G. Elliott, C.W.J. Granger / The Journal of Socio-Economics 33 (2004) 547–550

549

If we were to complain about simply examining statistical significance, our complaint would not be that reporting a number as being significant statistically or not but more the choice of size. Such a choice is warranted for a test or for a confidence interval. This is not unrelated to the authors point, but it is a better way to think about how the globally accepted significance levels are typically irrelevant to most empirical papers. There are a number of ways to think about this, although they all boil down to similar issues. We typically consider statistically insignificant estimates to be zero, and statistically significant estimates to be nonzero. This leads to a strange dichotomy. When estimates are relatively imprecisely estimated (relative to economically significant differential effects), we regard as zero effects that a confidence interval would suggest might be zero but also might be very important (we know of results “spun” in this way, although insignificantly different from zero the effect is insignificantly different from twice the point estimate). The “precise estimate of an insignificant effect” is the opposite of this, regarding as non-zero something that really does not matter economically. The essential point is that when estimates are expected to be relatively imprecise, choosing size to be larger is warranted. For situations where we expect precise estimates (typically when the sample size is very large), small sizes are reasonable. Even if the effect is not so significant economically; that there is an effect is interesting. It is not the case, as one might gather from the comments of the authors, that all we are interested in is the marginal effect of some covariate or an outcome. Whilst this is often the focus, and provides a practically relevant and useful part of applied econometrics, presumably if there is any thought that we take economics seriously as a science then we might actually take economic models seriously as well. This means that we would be interested in testing the models. It is perfectly reasonable that there be two models that have very different implications for many variables that we measure poorly, yet very similar but different effects for a variable whose effect can be precisely estimated. Does it make no sense to examine the statistical difference between the models (if we accept classical methods, choose the reigning model as the null) even when the economic significance is uninteresting? (In physics, the original motivating experiment differentiating Einstein’s theories from Newton’s was the slight bending of light in an eclipse, a result not predicted by Newton’s theories. Surely not a significant effect physically for anything the theories were being used for at the time, but the implications for rejecting Newton for Einstein were of course enormous even though much of it had nothing to do with eclipses. It is in this sense that we suggested above that even when researchers find statistically significant effects that are small economically that they still hold interest. Theories do explain that area still may want to know this and be able to account for the effect, it may well help decide between competing theories.) One should note a difference in the suggestions of the authors and our “agreement.” First, we agree only when the problem is restricted to reporting the magnitude of an effect, and not about testing. However, if this were all that the field were about, that we would never be interested in testing between different theories, then we would have very little to say. We would be stepping away from attempting to answer any question that does not come along with an attendant data set (history) that allows estimation of the particular effect we are interesting in. What would be the effect on economically interesting quantities of a never before constructed program? No answer? Or perhaps we could use theory and plausible

550

G. Elliott, C.W.J. Granger / The Journal of Socio-Economics 33 (2004) 547–550

parameters to get some reasonable understanding. This criticism is real; if the authors really believe that testing is irrelevant then they have a far bigger complaint with empirical methods in all of science. The second difference is that the authors appear to take at face value the estimated effect, rather than we have suggested employ some range that conveys both the magnitude and uncertainty of the estimates. Their statement in the conclusion, seldom is the magnitude of the sampling error the chief scientific issue, suggests that they reject ranges and it is interesting. Surely this statement is not to be taken too literally. However, it makes no sense theoretically to a classical statistician or a Bayesian to report an estimate without considering the undeniable fact that the estimate is not the truth. This is a fundamental issue in understanding what data tells us from sample. Without some idea of the sampling uncertainty we would expect greater problems in policy making that the ones the authors find so exciting. The authors seem to think that the Friedman story backs up their claim that a result from a well fitting model can mislead by focusing on statistical significance. A clearer interpretation might be that Friedman was not led astray by statistical fit, but through not thinking clearly about the uncertainty of the estimated duration time of the alloy. If the confidence interval were computed well, so that worrying about the sampling error rather than the point estimate of the magnitude were done well, then in this situation perhaps the mistake would not have been made. The quantitative effectiveness on the actual economy of poor or mixed quality models is very unclear, particularly when dealing with policy questions. However, if the models are used for forecasting the effects of using a well specified versus a poorly specified model can be easily evaluated. The criticisms made in the paper never discuss the question of the purpose of the models. It is unclear to us if exactly the same criticisms apply to all purposes. If one wanted to explicitly test an economic theory that said that a particular explanatory variable did not enter this equation then whether or not its parameter is zero is an appropriate question. In other cases, it is difficult to say why testing one variable that is related to other explanatory variables is of particular interest; what matters is the quality of the whole model to perform its purpose. The bottom line of the article is that the emphasis on statistical rather than economic significance holds back the science. The authors go a little further, stating that jobs are lost, perhaps even lives, that science has stopped. The counterfactual – where economics as a science would be if we had better reporting and consideration of the results – is tough to get at. One approach might be to give anecdotal or statistical evidence on situations where poorly interpreted results have been misused in such a way as to lead to a high proportion of bad outcomes either for policy or modeling. The difference between these could give us an estimate we could use to examine the economics significance of the effect of the reporting. The empirical evidence presented by the authors amounts to examining the variation in what would be the “X” variable in such an analysis. They examine in close detail whether or not reporting is done well and how it has changed over time. But no effort has gone into examining the relationship between the emphasis, or not, on how quickly our understanding progresses.