Statistical methods can be improved within Cochrane pregnancy and childbirth reviews

Statistical methods can be improved within Cochrane pregnancy and childbirth reviews

Journal of Clinical Epidemiology 64 (2011) 608e618 Statistical methods can be improved within Cochrane pregnancy and childbirth reviews Richard D. Ri...

2MB Sizes 0 Downloads 14 Views

Journal of Clinical Epidemiology 64 (2011) 608e618

Statistical methods can be improved within Cochrane pregnancy and childbirth reviews Richard D. Rileya,*, Simon Gatesb, James Neilsonc, Zarko Alfirevicc a

Department of Public Health, Epidemiology and Biostatistics, Public Health Building, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK b Warwick Clinical Trials Unit, University of Warwick, Gibbet Hill Road, Coventry CV4 7AL, UK c Division of Perinatal and Reproductive Medicine, School of Reproductive & Developmental Medicine, Cochrane Pregnancy & Childbirth Group, University of Liverpool, Liverpool, UK Accepted 2 August 2010

Abstract Objectives: To assess statistical methods within systematic reviews of the Cochrane Pregnancy and Childbirth Group (CPCG). Study Design and Setting: We extracted details about statistical methods within 75 reviews containing at least 10 studies. Results: The median number of forest plots per review was 52 (min 5 5; max 5 409). Seven of the 75 reviews assessed publication bias or explained why not. Forty-four of the 75 reviews performed random-effects meta-analyses; just 1 of these justified the approach clinically and none interpreted its pooled result correctly. Of 31 reviews not using random-effects, 26 assumed a fixed-effect given potentially moderate or large heterogeneity (I2 O 25%). In their Methods section, 25 (33%) of the 75 reviews said I2 was used to decide between fixedeffect and random-effects; however, in 12 of these (48%) reviews, this was not carried out in their Results section. Of 72 reviews with moderate or large heterogeneity, 47 (65%) did not explore the causes of heterogeneity or justify why not. Conclusion: Within CPCG reviews, publication bias is rarely addressed; heterogeneity is often not appropriately considered, and randomeffects analyses are incorrectly interpreted. How these shortcomings impact existing review conclusions needs further investigation, but regardless of this, we recomment the Cochrane Collaboration increase ‘‘hands-on’’ statistical support. Ó 2011 Elsevier Inc. All rights reserved. Keywords: Meta-analysis; Systematic review; Cochrane Collaboration; Heterogeneity; Publication bias; Random-effects

1. Introduction There is an abundance of systematic reviews in the medical literature [1], and many are undertaken within The

Authors’ contributions: Z.A. and J.N. obtained funding for the project and set the review questions in conjunction with R.D.R. R.D.R. performed the entire review (i.e., identification of relevant articles, data extraction, and summary of findings) independent to the Cochrane Pregnancy and Childbirth group and wrote the first draft on the article with specific examples. All authors commented on the first draft and helped revise it. R.D.R. and S.G. developed the stated recommendations for improvement based on the Cochrane Handbook. Funding: This work was funded by the United Nations Development Programme/United Nations Population Fund/World Health Organization (WHO)/World Bank Special Programme of Research, Development and Research Training in Human Reproduction, WHO, Geneva, Switzerland. Competing interests: J.N. and Z.A. are joint coordinating editors for the Cochrane Pregnancy and Childbirth Group. S.G. and (subsequent to the completion of this review) R.D.R. are statistical advisors for the Cochrane Pregnancy and Childbirth Group. * Corresponding author. Tel.: þ44-121-414-7508; fax: þ44-121-4147878. E-mail address: [email protected] (R.D. Riley). 0895-4356/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2010.08.002

Cochrane Collaboration. This international organization supports evidence-based health care by preparing, maintaining, and making available systematic reviews of the effects of health care interventions. Within The Cochrane Collaboration, there are over 50 review groups, each specializing in a particular disease or health condition of interest. The Cochrane Pregnancy and Childbirth Review Group (CPCG) conducts and maintains systematic reviews of interventions aimed at improving health outcomes in pregnancy and childbirth. The CPCG was the first to register with The Cochrane Collaboration in October 1992, and its members are highly active, having published over 350 reviews within The Cochrane Database of Systematic Reviews by the end of 2009. If appropriate, at the end of a systematic review, the quantitative evidence from the studies identified (or the subset of better quality studies) can be combined using statistical methods for meta-analysis [2], so to produce results that inform clinical decision making and health care policies. For example, in a typical Cochrane review, a meta-analysis will synthesize the estimates of a particular treatment effect, so to produce an average (‘‘pooled’’)

R.D. Riley et al. / Journal of Clinical Epidemiology 64 (2011) 608e618

What is new? There are deficiencies in the use of statistical methods within the Cochrane Pregnancy and Childbirth Group (CPCG) reviews. The issue of publication bias is rarely addressed; the process of measuring, investigating, and accounting for heterogeneity is often limited or inadequate; and random-effects analyses are not correctly interpreted. The large number of metaanalyses per review also raises the concern of multiple testing. These problems need to be urgently addressed. What does this study add? No formal review of statistical methods has previously been conducted within Cochrane reviews, as far as we are aware. This is the first article to show the problems identified. What is now recommended? Improved use of statistical methods is urgently needed within Cochrane reviews. Although we have only assessed CPCG reviews in the article, our findings have general implications for all Cochrane reviews. Such reviews tend to be conducted by nonstatisticians, who rely heavily on the Cochrane Handbook and RevMan. The Cochrane Collaboration must seek to engage more statisticians and methodologists within individual reviews, not just at the Cochrane Handbook or advisory level.

treatment effect estimate (and its confidence interval) based on all the evidence. This result then indicates whether or not the treatment is effective, and by how much, and is subsequently used within decision models and costeffectiveness analyses that influence health and treatment policies. Thus the correct choice, implementation, and interpretation of meta-analysis methods are crucial, as otherwise biased and misleading clinical decisions may be made. But meta-analysis methods are nontrivial and require statistical expertise. In particular, related studies can vary in important clinical factors (e.g., patient characteristics and dose of treatment) and/or methodological factors (e.g., study design and analysis methods), leading to betweenstudy heterogeneity in the intervention effect of interest. Where possible, such heterogeneity must be quantified, accounted for, and examined using, for example, I2 [3], a random-effects model [4,5] and subgroup analyses [6]. There is also the threat of publication bias [7], where studies that do not identify statistically or clinically significant results are not published, potentially leading to an asymmetric funnel plot [8] and overly optimistic (biased) meta-analysis results.

609

In regards statistical methods, Cochrane review authors are greatly assisted by The Cochrane Handbook [9], which gives detailed advice about the rationale and methodology for synthesizing results in a systematic review. Furthermore, many statistical methods can be implemented directly within RevMan (The Nordic Cochrane Centre, Copenhagen, Denmark) [10], the freely available software used for preparing and maintaining systematic reviews within The Cochrane Collaboration. Thus, meta-analysis methods are easily attainable, even for authors with limited statistical support. Given this, it is important to check that statistical methods are being correctly implemented and interpreted in Cochrane reviews. This is particularly important for CPCG reviews, which usually contain multiple meta-analyses as often several treatments and outcomes are of interest (for both the mother and the baby). In this article, we review statistical methods used within systematic reviews of the CPCG, to assess current practice, identify areas for improvement, and draw attention to ongoing challenges. Our review focuses primarily on two aforementioned statistical issues: between-study heterogeneity and publication bias. These were chosen after consultation with the editors of the CPCG, who identified them as two of the key issues faced. In the Methods section, we describe the methods of our review, and in the Results section, we summarize our findings. In the Discussion section, we conclude with some discussion and provide recommendations in accordance with The Cochrane Handbook [6]. We note that this article assumes a basic knowledge of metaanalysis and statistical terminology; but to increase accessibility of our article, we provide a brief exposition of such terminology within Web Supplement 1 (see appendix on the journal’s Web site at www.elsevier.com).

2. Methods In this section, we describe the methods we used to identify, classify, and critically appraise the statistical methods of CPCG reviews. 2.1. Identification and classification of CPCG reviews The Cochrane Library was searched on the 28th March 2008 using the search term ‘‘pregnancy and childbirth group,’’ which is given on the cover page of all CPCG reviews. All identified articles were imported into the Endnote reference manager software (Thomsen ResearchSoft, Berkeley, CA, USA) [11]. The title and abstract of each article was then read by one investigator, and each article was classed as ‘‘included,’’ ‘‘excluded,’’ or ‘‘unsure.’’ ‘‘Included’’ articles were CPCG reviews that contained 10 studies or more in their review. The restriction of 10 or more studies was based partly on time and funding constraints, and also those meta-analysis methods for investigating between-study heterogeneity (e.g. meta-regression) or publication bias (e.g., tests for funnel

610

R.D. Riley et al. / Journal of Clinical Epidemiology 64 (2011) 608e618

plot asymmetry) are not generally recommended when there are less than 10 studies anyway [12,13]. We recognized that a CPCG review with 10 or more included studies may actually contain meta-analyses with far fewer included studies, but this criterion allowed us to focus on those reviews with the broadest potential scope of meta-analysis methods for our assessment. ‘‘Excluded’’ articles were non-CPCG reviews or CPCG reviews with less than 10 studies. All ‘‘included’’ and ‘‘excluded’’ classifications were double checked by Lynn Hampson, Trials Search Coordinator, who also reclassified any ‘‘unsure’’ articles as ‘‘included’’ or ‘‘excluded’’ as appropriate. Any discrepancies in the classification of articles between the two reviewers were resolved on further discussion with Sonja Henderson, the CPCG coordinator.

2.2. Extraction of information The full text of each of the included reviews was obtained. A data extraction form was then developed to extract explicit details from each review and this is summarized in Web Supplement 2 (see appendix on the journal’s Web site at www.elsevier.com). The information sought included general details (e.g., the year of publication) and more specific details about the use and interpretation of methods for assessing between-study heterogeneity and publication bias. In terms of between-study heterogeneity, this involved identifying what methods were proposed in the Methods section; whether these were described correctly to the reader; if and how heterogeneity was assessed and measured in the Results section; if and how randomeffects meta-analysis models were justified; how the results from a random-effects meta-analysis were interpreted; and how the causes of heterogeneity were explored. In terms of publication bias, this involved identifying what methods were proposed in the Methods section; whether these were described correctly to the reader; if and how publication bias was assessed in the Results section; whether such assessments were interpreted appropriately; and whether, in the Discussion section, the main conclusions of the review were placed in context of whether publication bias was deemed a potential concern. A statistical reviewer (R.D.R.), with large experience of applying and developing meta-analysis methods, performed all the data extraction and summarized the findings.

3. Results The search within The Cochrane Library identified 365 potential reviews. After assessment by the two reviewers, a final set of 75 included reviews was identified. A flow chart showing the identification and then inclusion or exclusion of reviews is shown in Fig. 1. We now summarize our appraisal of these 75 reviews.

3.1. Number of meta-analyses Within the 75 included reviews, a large number of metaanalyses were presented. For example, the median number of forest plots provided was 52 per review (min 5 5; max 5 409), with each forest plot displaying the pooled result from at least one meta-analysis (Fig. 2). These high numbers reflect the large number of outcomes, interventions, and subgroups of interest within CPCG reviews. However, they also raise the concern of multiple testing, such that significant meta-analysis results are to be expected by chance because of the large number of analyses considered. Seventy (93%) of the reviews did not mention the potential for multiple testing to affect their conclusions; these presented a median of 50 forest plots. Another issue is that, although primary and secondary outcomes were usually distinguished in the Methods section, within the Results section and forest plots it was often difficult to distinguish between them.

3.2. Heterogeneity 3.2.1. Planned and reported assessments of heterogeneity Forty-three (57%) of the 75 reviews stated in the Methods section how they would assess heterogeneity. The most commonly planned method was the I2 statistic (30 reviews), followed by the chi-square test of heterogeneity (12 reviews). In the 30 studies that planned to use I2, 15 (50%) referred to it as a test for heterogeneity, when rather it is a measure of the percentage of variability in the observed effect estimates that is due to between-study heterogeneity rather than within-study sampling error [3]. Eight of the 30 studies (27%) did not define what values of I2 they deemed to be (clinically or statistically) important. The chi-square test uses the Q-statistic to test the null hypothesis that the underlying intervention effect is fixed across studies. In the 12 reviews that stated that the chi-square test would be used, 5 (24%) did not state what P-value from the test was deemed statistically significant and 2 others defined a P-value lesser than 0.05 as significant. It is usually recommended that a P-value lesser than 0.1 indicates significant evidence of heterogeneity (as the test is known to have low power), or alternatively a large Q-statistic relative to the degrees of freedom [12]. All 75 reviews reported heterogeneity measures (e.g., I2) and/or assessments (e.g., chi-square test) within the forest plots for each meta-analysis; these are produced automatically by the RevMan software. Forty-five (60%) of the reviews discussed the interpretation of these measures within their Results section, and 25 (33%) stated in their Methods section if and how the reported I2 values were used to make decisions about the meta-analysis method used (e.g., fixedeffect or random-effects). Most of these 25 reviews stated that with I2 greater than 50%, a random-effects metaanalysis and/or a subgroup analysis should be used. However, in the Results section, 12 of the 25 (48%) reviews did not ultimately carry out their proposed plan; for

R.D. Riley et al. / Journal of Clinical Epidemiology 64 (2011) 608e618

611

Fig. 1. Flow chart showing the identification and inclusion/exclusion of reviews during the search strategy. An included review was defined as a Cochrane Pregnancy and Childbirth Group (CPCG) review that contained 10 or more studies. * These two reviews were wrongly excluded by the first reviewer who classed them as CPCG reviews with less than 10 studies; j On inspection this review had subsequently been withdrawn from The Cochrane Library.

example, 15 stated in their Methods section that, given a large I2, they would explore heterogeneity and perform a random-effects meta-analysis, yet 6 of these still performed some fixed-effect meta-analyses even when a high I2 was present. Only 1 of the 75 CPCG reviews provided an estimate of the between-study variance (‘‘tau-squared’’) for one or more meta-analyses, and none interpreted its clinical relevance in the text. This is perhaps understandable given that RevMan, which is used for all CPCG meta-analyses, did not produce an estimate of tau-squared until the recent update in 2008. Such consequences of relying on Revman, rather than more flexible software, is considered further in the Discussion section. Forty-four (59%) of the 75 reviews performed one or more random effects meta-analyses. Of these, only 8 (19%) mentioned in the Methods section that they planned to use a random-effects approach, and just 1 (2%) justified clinically why a random-effects model was an appropriate method. This review [14] stated that, although heterogeneity existed across studies, there was much clinical overlap across studies and the effect sizes were in the same direction across studies, making a random-effects model both appropriate and clinically meaningful. None of the 44 articles defined or interpreted the pooled result from

a random-effects meta-analysis as the average of the intervention effects across studies. Of the 31 CPCG reviews that did not present a randomeffects meta-analysis, 5 justified why a random-effects meta-analysis was not used (e.g., small I2). The other 26 (84%) still presented fixed-effect meta-analyses even when potentially moderate (e.g., I2 O 25%) or large (e.g., I2 O 50%) heterogeneity was present, without justification of why the fixed-effect approach was still deemed appropriate.

3.2.2. Planned and reported explorations of heterogeneity Thirty-three (44%) reviews stated in the Methods section how they would examine the causes of heterogeneity, if it existed. All these 33 reviews planned to use either a subgroup analysis or a sensitivity analysis to identify the causes of heterogeneity. Subgroup analyses consider meta-analyzing sets of potentially homogeneous studies (e.g., of trials using the same dose of drug); sensitivity analyses exclude studies (e.g., those with a high risk of bias) from the original metaanalysis to assess if heterogeneity is reduced and how the pooled effect changes. For both such analyses, it is important that, where possible, the factors to be considered are prespecified [12], and encouragingly 25 of the 33 (81%) reviews did this.

612

R.D. Riley et al. / Journal of Clinical Epidemiology 64 (2011) 608e618

Fig. 2. Number of forest plots presented within the 75 included articles. Each forest plot provides the pooled result from at least one meta-analysis.

In the Results section and within the forest plots, potentially moderate (defined by an I2 O 25% [3]) or potentially large (defined to be I2 O 50% [3]) heterogeneity was observed in at least one meta-analysis for 72 of the 75 reviews. Forty-seven (65%) of these 72 did not explore the potential causes of heterogeneity, without justifying why not. Of the remaining 25, 22 used subgroup analyses explicitly to explore the causes of heterogeneity, and the other 3 correctly justified that subgroup analyses to assess heterogeneity were not possible because of the small number of studies available. Of the 22 reviews that specifically used subgroup/sensitivity analyses to explore heterogeneity, 14 prespecified the subgroups they would look at; only 1 of these 22 reviews noted the potential for multiple testing issues because of the assessment of multiple subgroups, and 1 raised the issue that subgroup analyses are prone to confounding as they are observational in their nature (i.e., when looking at relationships across studies there may be study-level confounding). Meta-regression was not used within any of the 75 reviews, but this is again not surprising given that RevMan does not implement it.

3.3. Publication bias Just 6 (7%) of the 75 reviews stated in their Methods section how they would assess publication bias; only 7 (9%) described a publication bias assessment in their Results or Discussion section or justified why publication bias assessments were not possible; and only 3 reviews described a publication bias assessment plan in their Methods section and subsequently reported an assessment in their Results or Discussion section. Just 1 of the 75 reviews presented a funnel plot (although an additional one study stated funnel plots were not appropriate because of the small number of studies). Only 12 (16%) of the 75 reviews raised the issue of publication bias in their Discussion

section, to place their conclusion in context of this potential problem. Of the six reviews that stated how they would assess publication bias in their Methods section, all proposed to visually assess funnel plot asymmetry, with one also stating they would additionally use Egger’s test for publication bias (i.e., a test of funnel plot asymmetry or ‘‘small-study effects’’) [8]. Of the seven reviews that discussed publication bias in their Results or Discussion section, three stated that they looked visually at funnel plots and then discussed whether asymmetry was present or not; one stated that ‘‘graphs’’ and ‘‘tests’’ showed no evidence of publication bias but did not define what these graphs or tests were; one stated that smaller studies showed larger effects than the bigger studies, thus publication bias was a concern; one said that the older studies showed larger effects than the newer studies, indicating potential publication bias in older studies; and one argued that funnel plot assessments were not sensible because of small number of studies in the meta-analyses. Of the three reviews that assessed funnel plots, two found evidence of asymmetry and both appropriately noted that such asymmetry may be because of reasons other than publication bias.

4. Discussion The appropriate use and interpretation of statistical methods is fundamental to the validity and reliability of meta-analysis results. In this article, we have reviewed the application of statistical methods within CPCG reviews and identified a number of deficiencies. In particular, the issue of publication bias is not adequately addressed in most reviews; and the process of measuring, modeling, and accounting for between-study heterogeneity is often limited or inadequate. A summary of the main findings is presented in Fig. 3, and the key deficiencies are illustrated further in Web Supplement 3 (see appendix on the journal’s Web site at www.elsevier.com) within a few (anonymized) CPCG reviews. In the following sections, we suggest how to address the shortcomings identified, starting with the implications for The Cochrane Collaboration as a whole and then moving onto specific statistical advice for each particular issue. 4.1. Implications for The Cochrane Collaboration Although we have only assessed CPCG reviews in the article, the issues identified are likely to occur in reviews by other Cochrane Review Groups (although this clearly needs to be substantiated by further audits) and implications arise for The Cochrane Collaboration as a whole. In particular, Cochrane reviews tend to be conducted by nonstatisticians and thus statistical incoherence and misunderstanding are inevitable. Statisticians are often only involved at the protocol and referee stage but not when the data are being extracted, analyzed, or interpreted. Thus, there is

R.D. Riley et al. / Journal of Clinical Epidemiology 64 (2011) 608e618

613

Fig. 3. Summary of main findings from the assessment of the 75 included articles.

a pressing need for The Cochrane Collaboration to involve more statisticians within each review group, not only just at each stage of the editorial and refereeing process but also ‘‘hands-on’’ within each individual review. This will inevitably require funding, and statisticians with expertise in meta-analysis are clearly not abundant, but the deficiency in statistical expertise needs to be addressed. Perhaps Revman makes it too easy for statistical expertise to be ignored. One could argue that it enables the production of summary statistics, meta-analyses, forest plots, and subgroup analyses with little thought, when statistical expertise is still required to inform analysis choice and interpretation. It also constrains the choice of analysis method as it does not currently allow some advanced statistical methods, such as meta-regression or the calculation of prediction intervals,

and statistical expertise is thus required to facilitate analyses in more sophisticated software packages. Another related issue facing the Collaboration is when to update existing reviews in the Cochrane Library to address ongoing advances in statistical methods. The Cochrane Handbook is regularly updated to account for methodological developments, but should Review groups be expected to reanalyze and reinterpret their existing reviews to keep up-to-date with the latest methodological guidelines? For example, traditionally, the main focus from a random-effects meta-analysis was the pooled effect and its confidence interval. However, recent guidelines now suggest also producing a 95% prediction interval for the underlying effect size in a single study, so to describe the range of underlying intervention effects across studies

614

R.D. Riley et al. / Journal of Clinical Epidemiology 64 (2011) 608e618

[15], as exemplified in Web Supplement 3 (see appendix on the journal’s Web site at www.elsevier.com). General guidance is needed for authors and editors in such situations and statistical support is again fundamental when updates are necessary. Furthermore, editors, review authors, and those refereeing during the review process may not always agree with statistical guidelines in the Cochrane Handbook. For example, the Handbook suggests looking at a small number of primary and secondary outcomes, but CPCG reviews have a median of 52 per review. Thus review authors, alongside editors and those refereeing the review’s protocol and final report, are either unaware of the Handbooks recommendation here or are aware of it but disagree with its advice. Indeed, our experience is that some researchers believe that all outcomes considered within individual studies should be assessed in a Cochrane review. Such tensions between methodologists and those conducting the reviews must be ironed out. 4.2. The choice between fixed or random-effects meta-analysis In Fig. 4, we summarize our key recommendations for improving the assessment and reporting of heterogeneity

within CPCG reviews. These draw on those within The Cochrane Handbook. A major dilemma is when to perform a random-effects rather than a fixed-effect meta-analysis. A fixed-effect approach assumes that the intervention effect is common (fixed) across studies, and thus all studies are estimating the same effect. A random-effects approach allows the intervention effect to vary across studies because of between-study heterogeneity, and so each study is estimating a different underlying effect. At the moment, review authors are selecting one of these two methods based on either a ‘‘large’’ I2 value (e.g., greater than 50%) or on the P-value for a chi-square test for heterogeneity. We recommend that to decide between the two approaches, or even whether meta-analysis is actually appropriate, review authors must consider both statistical and clinical reasoning. In terms of statistical reasoning, this is ongoing area for debate. Firstly, it is important to note that the fixed-effect and random-effects approaches coincide when there is no between-study heterogeneity, that is, where the betweenstudy variance equals zero. When the between-study variance is nonzero, then it is more appropriate to perform a random-effects meta-analysis to properly account for

Fig. 4. Recommendations for improving heterogeneity assessments.

R.D. Riley et al. / Journal of Clinical Epidemiology 64 (2011) 608e618

the between-study variance. The decision thus seems clear cut based on whether the between-study variance is zero or nonzero. Yet, the decision is more complicated because: (1) the between-study variance must be estimated (‘‘tausquare’’), and this estimate often has large uncertainty because of a small number of studies in the analysis; and (2) tau-squared may be nonzero but small, such that the fixed-effect and random-effect analyses may not differ greatly. An estimate of tau-squared is produced with the forest plots produced by RevMan for a random-effect meta-analysis. Generally, if tau-squared is estimated nonzero, we prefer to adopt a random-effects approach (assuming clinical justification, see the following paragraph), as it is more conservative: accounting for the between-study variance produces wider confidence intervals for the pooled effect and allows the calculation of prediction intervals to explicitly show how the intervention effects vary across studies. Rucker et al. [16] concur that the choice of metaanalysis approach is best guided by tau-squared rather than I2, and the Cochrane Handbook states that ‘‘the choice between a fixed-effect and a random-effects meta-analysis should never be made on the basis of a statistical test of heterogeneity’’ (e.g., the chi-square statistic). In addition, clinical justification must also be made in light of heterogeneity. Section 9.5.1 of The Cochrane Handbook [12] notes that ‘‘A common analogy is that systematic reviews bring together apples and oranges, and that combining these can yield a meaningless result. This is true if apples and oranges are of intrinsic interest on their own, but may not be if they are used to contribute to a wider question about fruit. For example, a metaanalysis may reasonably evaluate the average effect of a class of drugs by combining results from trials where each evaluates the effect of a different drug from the class.’’ The random-effects approach facilitates this broad outlook as it allows one to summarize the distribution of the intervention effect across studies, to estimate the average of the intervention effects and its confidence interval, and to calculate a prediction interval for the underlying effect in a single study [15]. Review authors thus must decide whether this broader summary is clinically meaningful in their particular setting. For example, one CPCG review [14] stated that not only was a random-effect approach statistically appropriate as heterogeneity existed but also clinically meaningful as there was much clinical overlap across studies. Where there are arguments in favor of either a fixed-effect or random-effects approach, there is no harm is presenting both analyses to show transparently how they impact on estimates and conclusions. Note that if a random-effects approach is used, then the pooled result must be properly interpreted as the average intervention effect across studies. None of the CPCG reviews that presented a random-effects meta-analysis did this. There was also no recognition that across studies, the underlying intervention effect could differ, but the newly proposed prediction interval will hopefully address this [15].

615

4.3. Examining heterogeneity In light of observed heterogeneity, it is important to assess what is causing it, if possible [6]. Although a random-effects meta-analysis accounts for between-study heterogeneity, it does not explain it and ‘‘is not a substitute for a thorough investigation of heterogeneity’’ [12]. However, most of the CPCG reviews we assessed failed to describe in their Methods section if and how the causes of heterogeneity will be explored; even fewer actually did investigate the causes of heterogeneity when it was found to exist, or justify why this was not possible. Such an investigation can be carried out using subgroup analyses, where subgroups of studies with related characteristics are analyzed separately to see if the observed heterogeneity is removed, or sensitivity analyses, where some studies are removed (e.g., according to their risk of bias) to see if the heterogeneity disappears. Another option is metaregression, where the intervention effect estimates across studies are regressed against study-level covariates to see if and how they are important. This approach also provides an estimate (with confidence interval) of the differences between subgroups, and the effect of study-level covariates on the intervention effect; although, it has low power when the number of studies is small. Meta-regression is currently not available within RevMan but is available in other software, such as STATA (StataCorp, College Station, TX, USA). All such investigations of heterogeneity should ideally be kept to a minimum and prespecified in the Methods section, and we were encouraged that 25 of the 31 CPCG reviews that used subgroup/sensitivity analyses to explore heterogeneity, did prespecify the factors to be used. Post hoc investigations of the cause of heterogeneity should be labeled as such. Usually only a few investigations will be possible because of the small number of studies. A general rule is that 10 studies are required per study-level covariate assessed. The Cochrane Handbooks says [12]: ‘‘If more than one or two characteristics are investigated it may be sensible to adjust the level of significance to account for making multiple comparisons.’’ It is also important to be aware that subgroup analyses and meta-regression results are observational by nature and thus subject to confounding. For example, those studies that use a higher dose of drug may happen to be those studies that are most poorly designed.

4.4. Publication bias Given that publication bias is known to be a common threat to the validity of meta-analyses, it was surprising that few CPCG reviews investigated publication bias, explained why publication bias assessments were not feasible, or even discussed the potential issue at all. It is clear that CPCG reviews must now consider the issue of publication bias in more detail, both when planning their review and when interpreting their results. This is particularly important for

616

R.D. Riley et al. / Journal of Clinical Epidemiology 64 (2011) 608e618

their primary analyses, as else misleading or overly strong conclusions may be made. The Cochrane Handbook Section 10 [13] gives detailed advice on assessing publication and reporting bias, suggested when specific tests, such as Harbord et al. [17] or Peters et al. [18] should be used, and we draw on this in our recommendations in Fig. 5. These recommendations recognize that publication bias assessments are not always feasible (e.g., because of a small number of studies) or reliable (e.g., because of large heterogeneity), and that funnel plot asymmetry (‘‘small study effects’’ [17]) may be because of reasons other than publication bias. 4.5. Multiple testing Another important issue identified during our review is that of multiple testing, which occurs where multiple

meta-analyses are performed across different outcomes, interventions, and so forth [19]. Figure 2 shows that CPCG reviews presented large number of forest plots and metaanalyses, because of the wide variety of outcomes and interventions of interest, together with the large number of subgroup and sensitivity analyses undertaken. This process leads to more meta-analyses involving very few studies, thus increasing the potential impact of dissemination biases (such as publication bias and outcome reporting bias) if they exist. It also increases the chance of finding a ‘‘significant’’ meta-analysis result simply by chance. For example, using the conventional significance level of 5%, it is expected that 1 in 20 independent significance tests will be ‘‘statistically significant’’ even when there is truly no difference between the interventions being compared; likewise, it is expected that 1 in 20 independent 95% confidence intervals will not include the true

Fig. 5. Recommendations for improving publication bias assessments.

R.D. Riley et al. / Journal of Clinical Epidemiology 64 (2011) 608e618

parameter value. To address this problem in the future, we encourage review authors to focus on a small number of primary outcomes and a small number of secondary outcomes and to keep the number of subgroup and sensitivity analyses to a minimum; all analyses should clearly be prespecified and of clinical interest. Section 16 of The Cochrane Handbook [9] recommends that ‘‘The protocol for the review states which analyses and outcomes are of particular interest (the fewer the better)’’ and one should ‘‘interpret cautiously any findings that were not hypothesized in advance, even when they are ‘‘statistically significant.’’ We also suggest that further methodological and empirical research is needed to deal with, and investigate the impact of, multiple testing in meta-analysis, especially as multiple meta-analyses within the same review may not be independent (e.g., because of outcomes being correlated). A related issue is that it was often difficult to tell in the Results section, analyses, forest plots, and Discussion section whether the findings being discussed related to primary or secondary outcomes. This concurs with a previous assessment that found only 47 of 100 Cochrane Reviews [20] made a distinction between primary and secondary outcomes. For improved clarity, we suggest structuring the abstract and subsequent sections according to ‘‘primary outcomes’’ and ‘‘secondary outcomes,’’ so that secondary analyses are not interpreted too strongly. Examples of good conduct within CPCG reviews include Anim-Somuah et al. [21], who report which meta-analyses relate to primary outcomes, and Boulvain et al. [22], who report which analyses were not prespecified. 4.6. Impact of findings on CPCG reviews This review was prompted by the editors of the CPCG itself, who want to ensure high statistical standards in CPCG reviews. We have identified numerous statistical shortcomings, which clearly weaken the methodological rigor of existing CPCG reviews. Further investigation is required to establish if and how these shortcomings impact on the existing conclusions and recommendations within individuals CPCG reviews. But regardless, the findings of this review have led to immediate changes in the CPCG to improve methodological rigor. The CPCG has updated both the procedure and checklists for providing statistical input into protocols and reviews as part of its editorial process; similarly, it has updated the statistical guidelines they provide to authors, which complement those in the Handbook. In particular, (i) CPCG review authors are now being encouraged to better justify random-effects analysis (both clinically and statistically); interpret random-effects analyses more appropriately (e.g., with prediction intervals); and assess the potential for publication bias where appropriate; and (ii) the CPCG facilitates ‘‘hands-on’’ statistical support for advanced statistical requirements identified through the editorial process by offering authors access to a statistical consultant if the review team has no in-house statistical

617

expertise themselves. This statistical work, and that of the CPCG as a whole, is funded by infrastructure and programmed grants from the National Institute for Health Research Health Technology Assessment Programme in the United Kingdom and a grant from the National Health and Medical Research Council in Australia.

Supplementary material Supplementary material can be found, in the online version, at 10.1016/j.jclinepi.2010.08.002

References [1] Egger M, Davey Smith G, Altman DG. Systematic reviews in health care: meta-analysis in context. London, UK: BMJ Publishing Group; 2001. [2] Glass GV. Primary, secondary, and meta-analysis of research. Educational Researcher 1976;5:308. [3] Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ 2003;327:557e60. [4] DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986;7:177e88. [5] Berkey CS, Hoaglin DC, Antczak-Bouckoms A, Mosteller F, Colditz GA. Meta-analysis of multiple outcomes by regression with random effects. Stat Med 1998;17:2537e50. [6] Thompson SG, Egger M, Davey Smith G, Altman DG. Why and how sources of heterogeneity should be investigated. London, UK: BMJ Publishing Group; 2001. pp. 157e75. [7] Rothstein HR, Sutton AJ, Borenstein M. Publication bias in metaanalysis. Chichester, UK: Wiley; 2005. [8] Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ 1997;315: 629e34. [9] Sutton AJ, Higgins JP. Recent developments in meta-analysis. Stat Med 2008;27:625e50. [10] Review Manager (RevMan) [Computer Program]. Version 5. Copenhagen, UK: The Nordic Cochrane Centre, The Cochrane Collaboration; 2008. [11] ResearchSoft TI. EndNote Version 5. Berkeley, CA: ResearchSoft TI; 2004. [12] Deeks JJ, Higgins JPT, Altman DG. Chapter 9: Analysing data and undertaking meta-analyses. In: Higgins JPT, Green S, editors. Cochrane Handbook for Systematic Reviews of Interventions Version 5.0.1. The Cochrane Collaboration; 2008. (updated September 2008; Available from www.cochrane-handbook.org; accessed May 2009). [13] Sterne JAC, Egger M, Moher D. Chapter 10: Addressing reporting biases. In: Higgins JPT, Green S, editors. Cochrane Handbook for Systematic Reviews of Intervention Version 5.0.1. The Cochrane Collaboration; 2008. (updated September 2008; Available from www.cochrane-handbook.org; accessed May 2009). [14] King J, Flenady V, Cole S, Thornton S. Cyclo-oxygenase (COX) inhibitors for treating preterm labour. Cochrane Database Syst Rev 2005;(2):CD001992. [15] Higgins JP, Thompson SG, Spiegelhalter DJ. A re-evaluation of random-effects meta-analysis. J R Stat Soc Ser A 2009;172:137e59. [16] Rucker G, Schwarzer G, Carpenter JR, Schumacher M. Undue reliance on I2 in assessing heterogeneity may mislead. BMC Med Res Methodol 2008;8:79. [17] Harbord RM, Egger M, Sterne JA. A modified test for small-study effects in meta-analyses of controlled trials with binary endpoints. Stat Med 2006;25:3443e57.

618

R.D. Riley et al. / Journal of Clinical Epidemiology 64 (2011) 608e618

[18] Peters JL, Sutton AJ, Jones DR, Abrams KR, Rushton L. Comparison of two methods to detect publication bias in meta-analysis. JAMA 2006;295:676e80. [19] Bender R, Bunce C, Clarke M, Gates S, Lange S, Pace NL, et al. Attention should be given to multiplicity issues in systematic reviews. J Clin Epidemiol 2008;61:857e65. [20] Biester K, Lange S. The multiplicity problem in systematic reviews. XIII Cochrane Colloquium, Melbourne, Program and Abstracts. 2005:153.

Available at http://www.cochraneorg/colloquia/abstracts/melbourne/P155htm. Accessed May 2009. [21] Anim-Somuah M, Smyth R, Howell C. Epidural versus non-epidural or no analgesia in labour. Cochrane Database Syst Rev 2005;(4): CD000331. [22] Boulvain M, Kelly A, Irion O. Intracervical prostaglandins for induction of labour. Cochrane Database Syst Rev 2008;(1): CD006971.