D Rowland, S Debanne
BY
By the Numbers
THE
NUMBERS
Meta-analysis, Part II This is the second of two columns on meta-analysis, a technique of secondary analysis to help interpret a collection of published studies addressing a single research issue. The point of view expressed here is statistical in nature and treats the research issue as a hypothesis addressed by the collection of studies. In the previous column, the process of iden772
GASTROINTESTINAL ENDOSCOPY
tifying and culling candidate articles for the metaanalysis was examined. It is assumed that those tasks have been completed and that the results (addressing the hypothesis to be investigated) of the articles to be included in the meta-analysis have been tabulated. This column will describe 3 types of techniques, which provide a meta-analytical “summing up” of the evidence carried by those results: techniques that focus on p values in the individual studies, vote-counting techniques that tally the findings of studies, and parametric techniques. These 3 types of meta-analysis techniques differ in the detail with which results are considered and in the detail with which they provide meta-analytic findings. In the description of the techniques, it is assumed that the research issue has to do with whether an experimental method is superior to a standard treatment. The statistical convention is to term as the null hypothesis the notion that the two methods are equivalent. Also, there will be use of the term “effect size” which, for quantitative outcomes, is defined as the difference in mean values (for those treated in the experimental versus control groups) divided by the pooled standard deviation. This transformation is used to ensure the validity of the statistical methods used. Finally, it is strongly advised that it is prudent to seek the assistance of a biostatistician with experience in meta-analysis when performing a meta-analysis that will be submitted for publication. The first set of techniques to be described includes those that appeared first historically. They are sometimes termed “combined significance” techniques, because they take as input data the p values from the individual studies in the meta-analysis and yield as output an overall significance level to reject the null hypothesis. Because it is not necessary to have many details of summary statistics in the individual studies, all acceptable articles may be included in the meta-analysis. Thus, if an article failed to report a standard deviation, for example, it could still be included. First used in the 1930s, these techniques actually predate the formal development of meta-analysis. Although the problem had been considered by Karl Pearson and Ronald A. Fisher, two giants of statistical developments of that time, the first published treatment was due to L. H. C. Tippett in 1931, whose method is described easily because of its simplicity.1 Based on the uniform probability distribution, this procedure would be to reject the null hypothesis at the α level of significance if the smallest p value of the k studies in the meta-analysis has a smaller value than 1 – (1 - α)1/k. Strictly speaking, this procedure should be applied without corrections only to tests of continuous data, but it gives a superficial approximation even if this VOLUME 55, NO. 6, 2002
By the Numbers
requirement is violated. More comprehensive tests are described in Hedges and Oklin,2 including the Fisher inverse chi-square test, which uses all the p values. In the Fisher procedure, the test statistic is calculated and evaluated in 3 steps: 1. Take the product of the k p values; 2. Take the natural logarithm of that product and multiply that quantity by –2 to get the value of the test statistic; and 3. Assume that the test statistic has a chi square distribution with 2k degrees of freedom under the null hypothesis, and compare the test statistic to tabulated values. Reject the null hypothesis (at the α level of significance) if the (1 - α) critical value is exceeded. More complicated combined significance tests differ only in the extent that the various p values are used. The specific rules for application of these procedures should be noted before citing results. A great strength of the combined significance procedures is that their validity is not removed by the presence of a “publication bias” whereby those studies showing significant results may be more likely to appear in print and, thus, be eligible for inclusion. The second set of techniques are termed “votecounting” techniques because they only consider the “votes” of the studies in the sense that only the total tallies of the number of studies with significant findings favoring the experimental treatment, the number of studies with significant findings favoring the standard treatment, and the number of studies without significant differences are used. There is debate on the use of these techniques, but they may be useful in situations in which there are not many candidate studies and, of those, when relatively few have adequate summary data. The specific level of significance to be used in a vote-counting metaanalysis is dependent on the data reported; however, in most cases, the studies will report findings for significance at the 5% level or give information on an overall direction (equivalent to the 50% level). The power of vote-counting meta-analysis studies has been shown to decrease as there are more studies considered, but this is not really problematic because when more studies are available, the parametric procedures, described next, are preferable in providing a strong, detailed analysis. The votecounting procedures may then be included as a footnote to address the overall thrust of information from the complete body of acceptable studies. Specifics of the vote-counting procedures are somewhat complicated, some versions of which involve nomographs that must be visually read. Still, the output of these procedures, especially in the situation of few acceptable studies, may be surprisingly VOLUME 55, NO. 6, 2002
D Rowland, S Debanne
detailed. For a research question addressing the probability that an experimental treatment will provide better results than the control treatment, it is possible to derive 95% confidence limits for that probability. When the research issue deals with the strength of an effect, it is possible to derive 95% confidence limits for the effect size. Necessarily, the rules for conducting a vote-counting meta-analysis study must be strictly followed for these outputs to be meaningful. Both Hedges and Oklin2 and Light and Smith3 are able to provide guidance for the rules to allow such detailed output to arise from such apparently minimal details as input. The third type of meta-analytic procedures are known as parametric techniques because they result in precise estimates of the strength of the effect under consideration. The procedures generate results in terms of the effect sizes, but it is not difficult to transform back to values for the parameter of interest. There is a problem that occurs somewhat often for which a quick remedy is suggested. Because the results are taken from research papers, it is not uncommon for the mean values to be given (for experimental and control treatment groups), but for the standard deviations, as well as pooled standard deviation, to be missing. Often, the test statistic or its p value will be reported, however, and it may also be true that the sizes of the groups are somewhat similar. Further, if the data in the studies are approximately normal and suitable for analysis by the Student t test, it is possible to impute the effect size as follows. First, the sign of the effect size (i.e., positive for when the mean of the experimental group exceeds that of the control group, negative when the reverse is true) should be determined from the paper. Second, the sample sizes in the experimental and control groups, n1 and n2, respectively, should be extracted from the paper. Third, the value of the corresponding t statistic (for a two-sided test), denoted as t, should be extracted from the paper or, if only the p value is given, then the value of t should be inferred as the corresponding two-sided critical value of the t statistic (with n1 + n2 – 2 degrees of freedom). Fourth, impute the effect size to be t (1 /n + 1 1 /n 2) taking care that the sign is correct from the first step. A conservative version of this method is to impute the value of the effect size to be 0 whenever the result in the study is nonsignificant. Parametric procedures may best be described by either a fixed effect model or by a random effects model. The former model refers to the situation in which there is really only one value of an effect size and in which different estimates of its value may be GASTROINTESTINAL ENDOSCOPY
773
D Rowland, S Debanne
explained by random differences among the studies. The latter model refers to the situation in which the value of the effect size itself varies. There, for each study a different effect size is assumed to be underlying the experiment. Consequently, the variability among studies of the resulting estimates of effect sizes is due not only to random differences among the studies, but also to variations among the true effect sizes that they are estimating. If there are no clear systematic differences that could be recognized to create natural categorizations for effect sizes before examining any data, the issue of whether the fixed effect or random effects model is appropriate is addressed by conducting a test of homogeneity on effect sizes. The best-known homogeneity test is a form of chi-square test whose rejection is evidence that the random effects model is most appropriate. For either model, the output is a confidence interval to address the strength of the effect size. For the fixed effect model, the confidence interval would typically be the 95% confidence interval for the value of the true effect size. For the random effects model, the confidence interval would typically be the 95% confidence interval for the value of the true mean of the random variable whose value (for any given study) is that study’s true effect size. There are various approaches taken by practitioners of meta-analysis when the outcome data are in the form of count data, but an acceptable method is to consider the proportion having the attribute of concern as the basic outcome variable and to compute the effect size accordingly. Briefly, this would be a binomial type model and would have the difference in means being the difference of proportions for experimental and control groups, respectively, p1 – p2, to be divided by the standard deviation of the difference of proportions, the binomial pooled standard error, [( 1 /n p1(1 -p +1 ( /n p2(1 -p] 1) 1) 2) 2) where n1 and n2 denote the respective sample sizes. When a significant meta-analytic result is indicated, a final consideration should focus on the strength of those findings. One measure of strength has to do with whether there might be enough unpublished, nonsignificant studies sitting in the file drawers of researchers to undo the significance found. The question posed by Robert Rosenthal,4 “How many unpublished, nonsignificant studies sitting in the file drawers of researchers would there
774
GASTROINTESTINAL ENDOSCOPY
By the Numbers
have to be in order to just undo the significance found?” has as its answer a quantity referred to as the “Rosenthal file drawer index.” Heuristic methods may be used to solve this problem for significant meta-analyses performed by using either of the first two types of techniques, but Rosenthal suggested explicit ways to solve the problem for a significant meta-analysis by using parametric procedures.2 Rosenthal suggests that a value of the index that is at least 5 times the number of studies in the metaanalysis plus 10 is sufficient to make a significant finding “fail-safe.” Different interpretations are appropriate for values of this index depending on the research problem. Also, to measure the strength of a significant meta-analytic result, one might calculate the number of patients necessary to be treated before a single patient’s status should be expected to change. Here, too, the nature of the research problem focuses the meaning of potential values of this quantity. A final word on meta-analysis is that it is an area of methodology in flux. This column, along with the previous one, has attempted to describe the main thrusts of the discipline of meta-analysis, but it cannot do more than touch the main approaches. There are many researchers suggesting alternative approaches to the problems, including graphical methods, Bayesian methods, and dynamic methods whereby the publication of a new study can change the ongoing meta-analysis, as in control charts. The latter practice was that championed by Thomas Chalmers.5 In any of these it would be wise to involve a biostatistician at an early stage of the research because the process of meta-analysis is quite labor intensive. REFERENCES 1. Tippett LHC. The method of statistics. London: Williams & Norgate; 1931. 2. Hedges LV, Olkin I. Statistical methods for meta-analysis. San Diego: Academic Press; 1985. 3. Light RJ, Smith PV. Accumulating evidence: procedures for resolving contradictions among different research studies. Harv Educ Rev 1971;41:429-71. 4. Rosenthal R. The “file drawer problem” and tolerance for null results. Psychological Bull 1979;86:638-41. 5. Lau J, Schmid GH, Chalmers TC. Cumulative meta-analysis of clinical trial builds evidence for exemplary medical care. J Clin Epidemiol 1995;48:45-57.
Douglas Y. Rowland, PhD Sara M. Debanne, PhD Cleveland, Ohio doi:10.1067/mge.2002.122331
VOLUME 55, NO. 6, 2002