Preventive Veterinary Medicine 113 (2014) 313–322
Contents lists available at ScienceDirect
Preventive Veterinary Medicine journal homepage: www.elsevier.com/locate/prevetmed
Meta-analyses including data from observational studies夽 Annette M. O’Connor a,∗ , Jan M. Sargeant b a Department of Veterinary Diagnostic and Production Animal Medicine, College of Veterinary Medicine, Iowa State University, Ames, IA, USA b Centre for Public Health and Zoonoses and Department of Population Medicine, University of Guelph, Guelph, Ontario, Canada
a r t i c l e
i n f o
Article history: Received 25 January 2013 Received in revised form 9 October 2013 Accepted 19 October 2013 Keywords: Meta-analysis Observational studies Randomized controlled trials
a b s t r a c t Observational studies represent a wide group of studies where the disease or condition of interest is naturally occurring and the investigator does not control allocation to interventions or exposures. Observational studies are used to test hypotheses about the efficacy of interventions or about exposure-disease relationships, to estimate incidence or prevalence of conditions, and to assess the sensitivity and specificity of diagnostic assays. Experimental-study designs and randomized controlled trials (RCTs) can also contribute to the body of evidence about such questions. Meta-analyses (either with or without systematic reviews) aim to combine information from primary research studies to better describe the entire body of work. The aim of meta-analyses may be to obtain a summary effect size, or to understand factors that affect effect sizes. In this paper, we discuss the role of observational studies in meta-analysis questions and some factors to consider when deciding whether a meta-analysis should include results from such studies. Our suggestion is that one should only include studies that are not at high risk of inherent bias when calculating a summary effect size. Study design however can be a meaningful variable in assessment of outcome heterogeneity. © 2013 Elsevier B.V. All rights reserved.
1. Introduction Meta-analysis is a quantitative approach to combining information from multiple primary research studies (Deeks et al., 2011). When the use of observational studies in metaanalysis is questioned, without specifically expressing so, the concern is usually related to the incorporation of results from observational studies into meta-analyses of comparative efficacy for the assessment of interventions (Norris et al., 2010; Higgins et al., 2013; Reeves et al., 2013). The concern is that the results of meta-analysis might be misleading due to systematic bias that might be introduced by
夽 This paper is a summary of a presentation given at the Schwabe Symposium in December 2012 honoring the lifetime achievements in veterinary epidemiology and preventive medicine of Dr. Ian Dohoo. ∗ Corresponding author. Tel.: +1 515 294 5012; fax: +1 515 294 1072. E-mail address:
[email protected] (A.M. O’Connor). 0167-5877/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.prevetmed.2013.10.017
observational studies. Although such studies are prone to systematic bias, it is overly simplistic to propose a blanket ban on observational studies in meta-analysis,; not all meta-analysis questions relate to comparative efficacy and even when considering the effects of interventions not all observational studies are biased beyond utility. A more generalizable approach is to consider several eligibility questions in the context of the particular metaanalysis question:
• What primary study designs are likely to generate results that address the meta-analysis question? • To what biases are these designs susceptible? • Is it likely that inclusion of the results from such designs will introduce bias into the calculation of a summary measure obtained from the meta-analysis? • Will the inclusion of the results of these studies in the meta-analysis provide important information about
314
A.M. O’Connor, J.M. Sargeant / Preventive Veterinary Medicine 113 (2014) 313–322
factors that affect the effect measure including study design? The answers to these questions will provide a basis for the decision about the validity of incorporating the results from observational studies (or any other study designs) into a meta-analysis. The remainder of our paper is divided into two sections. First, we describe the types of meta-analysis questions and the different aims of meta-analysis. We then propose an approach to considering bias from observational study designs for each type of meta-analysis question. We make brief mention of biases from other designs and their possible effects on meta-analysis. 2. Meta-analysis A meta-analysis is an observational study in which the unit of concern is the individual study included. A meta-analysis is conducted on a single outcome of interest. A systematic review may have a single outcome or multiple outcomes and therefore single or multiple metaanalyses. Meta-analyses may also be conducted outside a systematic-review framework. A meta-analysis can be used to calculate a summary effect size and/or explore factors that affect the study effect measure (heterogeneity assessment). We contend that a meta-analysis should always evaluate heterogeneity and sometimes calculate a summary effect size. Like any observational study, a meta-analysis should be designed a priori. For purposes of internal and external validity, the a priori design must incorporate features to reduce systematic bias. When planning a meta-analysis (regardless of whether or not in the framework of a within a systematic review), the investigators should first determine which metaanalysis question type is applicable. It will then be clear what study designs provide results relevant to the metaanalysis question. From knowledge of the study designs will come knowledge of the potential sources of bias to which each design is subject. Knowledge of potential sources of bias will enable the meta-analyst to determine whether the study design should be included in assessment of heterogeneity and the calculation of the summary effect. As with any research endeavor, these decisions should be made a priori and described in the protocol. It is not justifiable post hoc to select particular studies that will or won’t be included in the calculation of a summary effect size or subject to heterogeneity assessment. Such an approach tends to introduce a bias and is inconsistent with best practices in research synthesis. 3. Types of questions Meta-analyses might address four types of questions. (EFSA, 2010). • Questions about the comparative efficacy of therapeutic or preventive interventions. These are often referred to as “PICO(S) questions”. The acronym PICO(S) refers to
population (P), intervention (I), comparator (C), outcome (O) and study design (S). • Questions about the effect of an exposure on the incidence or prevalence of disease i.e. etiology. These questions are often referred to as “PECO(S) questions.” The acronym PECO refers to population (P), exposure (E), comparator (C), outcome (O) and study design (S). The exposure factor may be either protective or a risk factor for the outcome. • Questions about the prevalence of a condition or state. Such questions are often referred to as “PO questions.” The acronym PO refers to population (P) and outcome of interest (O). • Questions about diagnostic-test accuracy. Such questions are often referred to as “PIT questions.” The acronym PIT refers to population (P), index test (I) and target condition (T). 4. Example of a meta-analysis The forest plot in Fig. 1 contains an example of a graphical approach to reporting a meta-analysis that has calculated a summary effect measure for a PICO or PECO question. The figure is a forest plot representing the results of a pairwise meta-analysis using a random-effects model. Each row represents the results of a single treatment comparison from a primary research study presented by trial arm. In the plot on the right-hand side, the box in each row represents the study effect size; the horizontal line represents the confidence interval of the effect size. At the lower part of the plot is the summary effect size, in this instance represented by a large diamond. The vertical axis of the large diamond indicates the summary risk ratio of 1.01. The horizontal limits of the diamond indicate the confidence interval (CI), in this case a 95% CI from 0.96 to 1.05. Other aspects of the meta-analysis reported include measures of heterogeneity from left to right: I-squared (I2 ), tau-squared and the p value for Cochrans Q test. I2 represents the approximate proportion of total variability in point estimates that can be attributed to heterogeneity. Negative values of I2 are converted to zero so that I2 lies between 0% and 100%. A value of 0% indicates no observed heterogeneity and larger values show increasing heterogeneity. The following guidelines have been suggested for interpreting I2 values: 0 to 40%: unimportant; 30% to 60%: moderate heterogeneity; 50 to 90%: substantial heterogeneity; and 75% to 100%: considerable heterogeneity (Deeks et al., 2011). Tau-squared describes the between study variance as this is a random-effects model meta-analysis. For Cochrans Q Chisquare test of homogeneity the null hypothesis is that there is homogeneity of effect sizes in the studies included in the analysis; therefore rejection of the null hypothesis implies heterogeneity is present. This fictitious example would be interpreted as indicating unimportant to moderate heterogeneity and that the intervention does not change the risk of disease. A meta-analysis should include an assessment of publication bias. There are numerous ways to study publication bias. (Macaskill et al., 2001; Ioannidis and Trikalinos, 2007). Some graphical approaches include production of either a funnel plot or a radial plot (Fig. 2a and 2b). The presence
A.M. O’Connor, J.M. Sargeant / Preventive Veterinary Medicine 113 (2014) 313–322
Study
315
Risk Ratio
Active Control Events Total Events Total
RR
95%−CI W(random)
Study Design = RCT 1 2 3 4 5 6 7 8 9 10 11
Random effects model
25 37 244 24 26 23 19 39 37 42 44
31 40 289 27 27 23 20 40 40 48 44
24 44 246 26 24 19 18 37 37 47 40
29 44 289 27 27 23 20 40 40 48 44
0.97 0.92 0.99 0.92 1.08 1.21 1.06 1.05 1.00 0.89 1.10
[0.77; 1.24] [0.84; 1.02] [0.93; 1.06] [0.79; 1.08] [0.93; 1.26] [0.99; 1.47] [0.88; 1.26] [0.95; 1.17] [0.88; 1.13] [0.80; 1.00] [0.99; 1.22]
3.5% 12.2% 16.6% 7.1% 7.1% 4.6% 5.7% 11.9% 9.3% 10.3% 11.8%
629
631
1.01 [0.96; 1.05]
100.0%
629
631
1.01 [0.96; 1.05]
100%
Heterogeneity: I−squared=38.6%, tau−squared=0.0024, p=0.0915
Random effects model Heterogeneity: I−squared=38.6%, tau−squared=0.0024, p=0.0915
0.8
1
RR < 1 favors the control arm
1.25
RR > 1 favors the active arm
Fig. 1. Graphical method of presenting a fictitious meta-analysis that aims to calculate a summary effect measure. In this example, a summary risk ratio calculated using a random effects model is presented.
of symmetry in both plots suggests little or no evidence of publication bias. 5. Aims of meta-analysis Within the meta-analysis question, the aim of a metaanalysis can differ. A meta-analysis might have the explicit aim to calculate a summary effect: a single point estimate that combines the effect size from each study and the associated uncertainty (a confidence interval or credibility interval). The summary effect may describe the expected effect of the intervention or the exposure expressed as an odds ratio, risk ratio or mean difference, the prevalence, or sensitivity and specificity. A meta-analysis can also be used to understand sources of heterogeneity in the effect size, i.e., to understand patterns in the outcome (Shmueli, 2002; Beyer, 2006). These two aims are very closely related. For example, prior to calculating a summary effect, heterogeneity should always be assessed.
Heterogeneity might be contextual or methodological. Contextual heterogeneity refers to demographic characteristics that modify the effect size in studies. Examples of factors often investigated as potential sources of contextual heterogeneity include region, average age, and production systems. Contextual heterogeneity may also be referred to as clinical heterogeneity in clinical settings. To capture the same concept for broader settings, such as food safety or animal welfare, the term contextual heterogeneity is preferable. For example, an aim of a meta-analysis about prevalence may be to explore whether breed or region are a source of contextual heterogeneity. Such information might be important for designing surveillance programs. Methodological heterogeneity occurs due to differences in design features within study designs or differences across study designs. In the context of this paper, i.e., the role of observational studies in meta-analysis, we will only discuss study design as a potential source of methodological heterogeneity.
Fig. 2. Graphical approaches to presenting a fictitious meta-analysis that aims to explore publication bias. The plot on the left hand side is a funnel plot. The plot on the right hand side is a radial plot.
316
A.M. O’Connor, J.M. Sargeant / Preventive Veterinary Medicine 113 (2014) 313–322
Study
Risk Ratio
Active Control Events Total Events Total
RR
95%−CI W(random)
Study Design = Other design 12 13 14 15 16 17 18 19
26 25 20 19 37 35 40 40
Random effects model
27 27 23 20 40 40 48 44
26 24 19 18 37 37 47 40
269
27 27 23 20 40 40 48 44
1.00 1.04 1.05 1.06 1.00 0.95 0.85 1.00
269
[0.90; 1.11] [0.88; 1.24] [0.82; 1.35] [0.88; 1.26] [0.88; 1.13] [0.82; 1.10] [0.75; 0.97] [0.88; 1.14]
7.7% 3.1% 1.6% 2.9% 5.6% 4.2% 5.0% 5.1%
0.98 [0.93; 1.03]
35.1%
Heterogeneity: I−squared=0%, tau−squared=0, p=0.4935
Study Design = RCT 1 2 3 4 5 6 7 8 9 10 11
25 37 244 24 26 22 19 39 37 42 42
Random effects model
31 40 289 27 27 23 20 40 40 48 44
24 44 246 26 24 19 18 37 37 47 40
29 44 289 26 27 23 20 40 40 48 44
0.97 0.92 0.99 0.89 1.08 1.16 1.06 1.05 1.00 0.89 1.05
[0.77; 1.24] [0.84; 1.02] [0.93; 1.06] [0.77; 1.04] [0.93; 1.26] [0.94; 1.42] [0.88; 1.26] [0.95; 1.17] [0.88; 1.13] [0.80; 1.00] [0.94; 1.18]
1.6% 8.5% 14.9% 4.0% 3.9% 2.2% 2.9% 8.1% 5.6% 6.5% 6.6%
629
630
0.99 [0.95; 1.04]
64.9%
898
899
0.99 [0.96; 1.02]
100%
Heterogeneity: I−squared=25.3%, tau−squared=0.0013, p=0.203
Random effects model Heterogeneity: I−squared=9.4%, tau−squared=0.0004, p=0.3402
0.8
1
RR < 1 favors the active arm
1.25 RR > 1 favors the control arm
Fig. 3. This fictitious of meta-analysis provides an example where study design does not appear to be a source of methodological heterogeneity.
Examples of graphic and quantitative approaches to exploring sources of heterogeneity related to study design for a PECO or PICO question are illustrated in Figs. 3 and 4. Both figures present a subgroup analysis based on study design and the effect-size measure is the risk ratio (RR). In Fig. 3, for the subgroup “other design” the summary risk ratio is 0.98 (95% CI = 0.93, 01.03). For the randomized controlled trial (RCT) subgroup the summary risk ratio is 0.99 (95% CI = 0.95, 1.04), and the summary RR (sRR) from both designs is 0.99 (95% CI = 0.96, 1.02). In this example, study design does not seem to affect the effect size; those for each study design are quite similar and the summary effect seems to reflect the combined data well (albeit with a narrower CI). The subgroup effect can be tested formally with a hypothesis test, using a chi-square test similar to Cochrans Q (Deeks et al., 2011). In this example, the p value for the chi-squared test comparing the subgroups is 0.6 (not included on figure). In contrast, in Fig. 4 the effect in the “other design” suggests a protective effect (sRR =0.76, 95% CI = 0.70, 0.82) whereas the data from RCTs suggest no effect (sRR = 1.00, 95% CI =0.95, 1.06). The non-overlapping 95% confidence intervals suggest these differences are statistically significant, and this can be tested using a Chi-square test comparing the subgroups (p = <0.0001). The summary effect measure combining data from both designs does not reflect either design-specific estimate (sRR = 0.92, 95% CI = 0.86, 0.99). For the example in Fig. 4, using a metaanalysis to calculate of a summary effect measure across
all study designs would be misleading. It is this situation that should be avoided. However, if one of the aims of the meta-analysis was to understand if study design was a source of methodological heterogeneity, then including data from observational studies achieves that aim. The reasons that study design was associated with different effect measures should be explored. Another approach to exploring heterogeneity is meta-regression (van Houwelingen et al., 2002). Meta-regression is regression analysis with each study result as the outcome and study characteristics, including design if multiple designs are included, as explanatory variables. 6. Summary effects-size calculation for PICO questions Study designs that might provide data for a metaanalysis to calculate a summary effect measure for a comparative-efficacy question include randomized controlled trials (RCTs) and non-randomized studies (NRS). Non-randomized study designs (NRS) include nonrandomized controlled trials (trials that used methods other than randomization to allocate the subject to the intervention), observational studies that associate disease incidence with risk factors (incidence case-control studies and cohort studies) and observational studies that associated disease prevalence with risk factors (prevalence case-control studies and cross-sectional studies) where the investigator has no control over allocation to intervention
A.M. O’Connor, J.M. Sargeant / Preventive Veterinary Medicine 113 (2014) 313–322
Study
317
Risk Ratio
Active Control Events Total Events Total
RR
95%−CI W(random)
Study Design = Other design 12 13 14 15 16 17 18 19
20 20 15 14 20 30 35 30
Random effects model
27 27 23 20 40 40 48 44
26 24 19 18 37 37 47 40
269
27 27 23 20 40 40 48 44
0.77 0.83 0.79 0.78 0.54 0.81 0.74 0.75
269
[0.61; 0.97] [0.64; 1.08] [0.55; 1.12] [0.56; 1.07] [0.39; 0.75] [0.66; 0.99] [0.62; 0.89] [0.60; 0.94]
4.2% 3.8% 2.6% 2.9% 2.9% 4.9% 5.4% 4.4%
0.76 [0.70; 0.82]
31.1%
Heterogeneity: I−squared=0%, tau−squared=0, p=0.5941
Study Design = RCT 1 2 3 4 5 6 7 8 9 10 11
25 37 244 24 26 23 19 39 37 42 44
Random effects model
31 40 289 27 27 23 20 40 40 48 44
24 44 246 26 24 19 18 37 37 47 40
29 44 289 26 27 23 20 40 40 48 44
0.97 0.92 0.99 0.89 1.08 1.21 1.06 1.05 1.00 0.89 1.10
[0.77; 1.24] [0.84; 1.02] [0.93; 1.06] [0.77; 1.04] [0.93; 1.26] [0.99; 1.47] [0.88; 1.26] [0.95; 1.17] [0.88; 1.13] [0.80; 1.00] [0.99; 1.22]
4.1% 7.2% 7.8% 6.0% 5.9% 4.9% 5.4% 7.1% 6.6% 6.8% 7.1%
629
630
1.00 [0.95; 1.05]
68.9%
898
899
0.92 [0.86; 0.99]
100%
Heterogeneity: I−squared=42.9%, tau−squared=0.0028, p=0.0636
Random effects model Heterogeneity: I−squared=72%, tau−squared=0.0142, p<0.0001
0.5
1
RR < 1 favors the active arm
2 RR > 1 favors the control arm
Fig. 4. This fictitious meta-analysis provides an example where study design does appear to be a source of methodological heterogeneity.
groups. Whether the results of all, some, or only one design should be included in the calculation of a summary effect size will be topic-specific and based on the potential for bias and relevance (Reeves et al., 2011)(Higgins et al., 2013; Reeves et al., 2013). If the decision is made to include from NRS data in a meta-analysis, a separate meta-analyses for each design should be presented alongside the summary effect size. Figs. 3 and 4 illustrate this approach. This would enable others to assess the validity of the decision. One of the main reasons for considering the inclusion of designs other than RCTs into meta-analysis is that few relevant RCTs might be available. The potential to add data from quasi-randomized trials and observational studies is appealing because this often bolsters the number of studies available for calculating the summary effect and exploring heterogeneity. Doing so might narrow the confidence interval (or credibility interval) of the summary effect measure. Another reason to use designs other than RCTs is that RCT might be unavailable because it is not possible to allocate randomly or randomization may be unethical. In this situation, the only possible source of data for a summary effect-size calculation would be non-randomized controlled trials or observational studies. However, even within those designs, decisions about potential impact of bias should be considered; only certain designs might be appropriate for inclusion. Finally, the outcome of interest might be an adverse event or harm, which randomized controlled trials are rarely designed to detect. To provide a comprehensive analysis of the effect of the intervention,
adverse events or harms should be included; excluding data about harms because they do not arise from RCTs may present a biased overall picture of the intervention. 6.1. Potential for bias in calculation of summary effect size As the purpose of randomization is to reduce the potential for selection bias (which will manifest itself as confounding); the concern about including NRS is that confounded effect sizes will bias the summary effect measure. For designs where the potential for selection bias to confound the outcome is considered high, results for that study design should be excluded from the calculation of the summary effect measure. Confounding by indication is a common source of selection bias for observational studies but may be less relevant for some topics in non-randomized controlled trials. A veterinary example where the potential for selection bias in some observational study designs might be too great for the study to be included in the calculation of the summary effect measure is the effect of pre-weaning vaccination programs on health outcomes in feedlot calves. The evidence base for this area contains a few small RCTs and numerous observational studies. The observational studies are cohort, incidence case-control, prevalence case control and cross-sectional designs. It might be tempting to include the observational data into a meta-analysis because they are numerous; however, this is also an intervention
318
A.M. O’Connor, J.M. Sargeant / Preventive Veterinary Medicine 113 (2014) 313–322
where the potential for selection bias to confound the outcome is high. Producers who vaccinate calves pre-weaning might also have other characteristics that positively influence health outcomes; therefore, the use of data from studies that have poor control for potential confounding variables could be expected to inflate the positive effects of pre-vaccination. Alternatively, it could be decided to include any studies with clearly reported approaches to controlling for important known confounding variables either with random allocation, restricted randomization or multivariable analyses. Such decisions are topic-specific, and require topic-matter expertise (to identify known and anticipated confounders), expertise in study design (to anticipate sources of selection bias), documentation of the decisions prior to the start of the meta analysis, and transparent reporting (Stroup et al., 2000; Verbeek, 2008; Liberati et al., 2009; Vandenbroucke, 2009). In alternative situations the potential for selection bias due to non-randomization might be considered minimal. For example, if a meta-analysis were assessing the impact of a carcass wash on the prevalence of Salmonella on carcasses it might be reasonable to a priori to plan to include quasi-randomized trials and perhaps even observational studies. For this topic, a non-randomized controlled trial might be used to assess the comparative effect of carcass washes to reduce microbial contamination on poultry carcasses. For this intervention, it may be only practical to wash carcasses on one line with one product and carcasses on another line with a different product. Although allocation to group is based on non-random methods, the extent of selection bias might be judged to be minimal. Of course, this assumes that the authors of the non-randomized controlled trials described the method of allocation in a manner that enables assessment; this is often not the case (O’Connor et al., 2006; Sargeant et al., 2006; Denagamage et al., 2007; Burns and O’Connor, 2008; Sargeant et al., 2009; Sargeant et al., 2010). As mentioned, post hoc section of studies for inclusion into the meta-analysis after data extraction is not consistent with the systematic review process. 6.2. Multiple outcomes of interventions including adverse events and harms Although each meta-analysis assesses only one outcome, a systematic review and/or guideline-development process may include multiple outcomes so that the entire effect of the intervention is considered. For example, a systematic review of the effect of the intervention bovine somatrophin hormone considered the effects of the intervention on production, animal health and animal welfare (Dohoo et al., 2003a, 2003b). The eligibility of data from non-randomized studies must be assessed at the protocoldesign stage for each outcome for each intervention. Some outcomes from intervention use, particularly harmful or adverse outcomes, are rare and the only source of information is likely to be non-randomized studies. Therefore a systematic review with multiple outcomes must assess the eligibility of non-randomized studies for each outcome. Excluding non-randomized studies may be valid for intentional outcomes where the potential for selection
bias is high. However for the same intervention, excluding data from non-randomized studies about adverse events and harms would remove important information from the review process. In human clinical medicine, it has even been proposed that the potential for selection bias for adverse outcomes is low (Vandenbroucke and Psaty, 2008). It is unclear if this inference is valid for interventions in animal health and food safety. 6.3. Challenge studies in PICO meta-analysis Challenge studies (where allocation to exposure and disease outcome are deliberate and under the control of the investigator) are a unique and common design used to assess interventions in veterinary medicine. Two topics that are relevant to challenge studies are selection bias and applicability. Randomization can and should be used to allocate animals to groups in challenge designs. However, often the group sample sizes are insufficient to allow the researcher to have confidence that the groups are likely exchangeable. In these situations, even randomized challenge studies may be considered to have a high potential for selection bias unless other design tools are used that specifically aim to limit confounding. For example, restricting the population to a very narrow spectrum and the use of blocking and stratification may be needed to limit selection bias. When correctly executed, these design tools provide results with high internal validity and few sources of contextual and methodological heterogeneity. However, even wellexecuted challenge studies can have low applicability to the disease of interest. Incorporation of estimates of intervention efficacy from challenge studies might introduce a bias into the meta-analysis if the model of disease does not accurately represent the effect of the intervention in either the population (P) or the outcome (O) of interest. As such, the inclusion of results from challenge studies in meta-analysis can bias the summary effect size and again erroneously increase certainty about the summary effect estimate. As with other designs, if the decision is made to include challenge trial data in a meta-analysis, a separate meta-analyses for challenge studies and for RCTS should be presented alongside the summary effect size. Although there is no empirical evidence to support the assertion, we hypothesize that the more similar the challenge disease model is to the naturally occurring disease and the more similar the study population to the target population, the more relevant results from challenge studies are likely to be. We use the example of a meta-analysis of the effect of pre-weaning vaccination on health of calves in feedlot; numerous challenge studies describe the effect of different vaccines in bovine respiratory-disease models. Such models are rarely able to capture the complexity of bovine respiratory-disease causation and rarely use feedlot-age beef-breed calves in high-stress situations. A reasonable argument could be made, therefore, for excluding results from such challenge studies from the meta-analysis. For other topics, challenge studies might provide a very reasonable representation of the expected effect of an intervention and should be included in a metaanalysis. For example, the results of challenge studies that
A.M. O’Connor, J.M. Sargeant / Preventive Veterinary Medicine 113 (2014) 313–322
Approach to observational study design
Description
Exposure variable of interest to the study and the meta -analysis Primary disease outcome of interest
Potential confounding variables and effect modifiers deliberately collected Exposure variable of interest to the study Primary disease outcome of interest
Potential confounding variables and effect modifiers deliberately collected include exposure variable of interest to the meta-analysis
Exposure variable of interest to meta-analysis
Exposure variable
Primary disease outcome(s) of interest to the study and the meta-analysis
Exposure variable
319
Exposure variable
a. Hypothesis testing study; exposure-disease (E-D) relationship of interest to the review. The magnitude of effect observed for the E-D is not a byproduct of the study. The estimate of disease prevalence is a by-product of the original study. b. Hypothesis testing study; exposure-disease (E-D) other than that of interest to the review. The magnitude of effect observed for the E-D is a byproduct of the original study. The estimate of prevalence is also a by-product of the original study.
c. A survey designed to estimate the prevalence of a speciic condition(s). Associations observed are likely hypothesis generating because purposeful adjustment for confounders did not occur. The magnitudes of effect observed for E-D relationships are likely confounded.
Fig. 5. Schematic diagrams depicting approaches to obtaining estimates of exposure-disease associations from observational studies.
describe the efficacy of tetanus vaccines in equids might accurately represent the effects seen in an RCT. Decisions about inclusion and exclusion of challenge-studies results must be topic-specific. As with observational studies, such decisions require topic expertise, expertise in study design, documentation of decisions, and transparent reporting. 7. Heterogeneity assessment for PICO(S) questions Not all meta-analyses aim to calculate a summary effect measure. Another reason to conduct a meta-analysis is study sources of outcome heterogeneity formally (either the extent to which it occurs in the population or the causes of differences in the outcome between studies). Because study design is a potential source of methodological heterogeneity, results from NRS should be included in meta-analyses if the aim is to understand whether study design is a source of methodological heterogeneity. Study design can be included as a subgroup or risk factor in meta-regression. The results of such analysis can provide insight into the reasons for heterogeneity. 8. Summary effect-size calculation for PECO(S) questions In veterinary science, study designs that can be used to answer questions about exposure-disease (E-D)
relationships are RCTs, non-randomized controlled trials, and observational studies (incidence case-control, cohort, prevalence case control, cross sectional). These are the same designs available for PICO questions and this highlights that the distinction between PECO and PICO studies is somewhat artificial and difficult to delineate (Higgins et al., 2013). It might be tempting to consider that PICO questions relate to interventions where the factors can be manipulated (such as vaccines), and PECO studies are factors that cannot be manipulated (such as climate, region, barns, or processing plants). However, this distinction is not always useful because exposures that at one time are studied as “risk factors” are often reevaluated as “interventions.” For example, consider a meta-analysis aimed at evaluating the association between exercise and cholesterol levels. In the body of work available, it is possible that one source of data is a population-based cohort study evaluating lack of exercise as a risk factor for high cholesterol. In the same body of work could be an RCT that allocated participants to exercise or no-exercise to study the potential to reduce cholesterol. The results of both studies could be relevant to the meta-analysis. Due to this overlap, much of what is relevant for PICO meta-analysis is relevant to PECO metaanalyses. Regardless of the study design: if the results from a specific study design are likely to be so biased as to be misleading, that study design should not be included in the calculation of the summary effect measure.
320
A.M. O’Connor, J.M. Sargeant / Preventive Veterinary Medicine 113 (2014) 313–322
8.1. Potential for bias in calculation of summary effect size For PECO questions the majority, if not all, studies would be observational. Meta-analysis of observational studies for causal relationships is controversial because of the potential for confounding bias. As before, decisions about inclusion of observational studies require expertise in the topic, study design, and data analysis. One useful approach to evaluating the potential for confounding bias for questions about etiology is to consider the possible ways that observational studies can come to report the association of interest (Fig. 5). Ideally, the aim of the original primary study would be to test a hypothesis about the specific ED relationship of interest to the meta-analyst (Fig. 5a). In this scenario, the study should have been designed to obtain the most accurate “adjusted” estimate (taking into account known confounders). In the absence of RCTs or non-randomized controlled trials, the meta-analyst might be confident that the potential for confounding bias has been adjusted to the greatest extent possible–and reasonably expect that data may be combined from multiple similarly designed hypothesis-testing observational studies. An example of this situation would be a group of observational studies that studied the association between living near a confined animal-feeding operation (CAFO) and the prevalence of childhood asthma–a question that cannot be answered by an RCT or a non-randomized controlled trial (O’Connor et al., 2010). For this topic, the only evidence available will come from observational studies. Some experts might argue it is possible to identify the major confounders that should be controlled: atopy, parent atopy, smoking in the household, presence of pets, and wood fires. Matching, restriction, or analytic approaches can be employed to control the impact of these confounders on the study effect size. The meta-analyst could limit the eligibility of studies included in the calculation of the summary effect size to studies that were purposefully designed to test the E-D question and adjusted for the known confounders. Despite the prospective and purposeful design of the observational studies, the meta-analyst should still acknowledge the possibility of residual confounding in the estimate of the summary effect size. Regrettably, the need to determine whether purposeful control of confounding was used leaves the meta-analysts at the mercy of the approach to reporting used by the authors, reviewers, and editors of the original paper. Currently, approaches to adjusting for confounding are often poorly reported (Groenwold et al., 2008). Another way that an observational study could report the E-D association of interest to a PECO meta-analysis is as a by-product of a different hypothesis (Fig. 5b). The aim of the original observational study might have been to test a hypothesis about an alternative E-D relationship. Information about the E-D relationship of interest to the meta-analysis is only available as a by-product of controlling for confounding for a different hypothesis. In this situation, the observed association between the E-D of interest to the meta-analyst has a strong potential to be biased because the study was not purposefully designed to
address all the confounders of this particular E-D relationship. Therefore, in this situation, the meta-analyst might reasonably decide that the results of such studies should not be used in the calculation of a summary effect measure because of the strong potential for confounding bias. Other situations that potentially could be included in a meta-analysis of an E-D relationship are described in Fig. 5c. These are hypothesis-generating studies not purposefully designed to study any particular relationship. In veterinary science, this approach is common. For example, a dataset may be “mined” for hypothesis-generating exposure-disease associations (as may occur when working with production recording-keeping systems such as swine- or dairy-health improvement datasets). Because no particular E-D relationship is identified a priori, it is likely that the association of interest to the review does not control for known confounders relevant to the meta-analysts question. It is necessary to consider the design and modelbuilding approach to determine whether the estimate should be incorporated. If the study design and modelbuilding approach were such that all possible confounders were included in the design and model, then it might be appropriate to include the association in the meta-analysis. This is most likely to occur when datasets are large, multiple variables are available, the authors did not use any variable-exclusion processes, and all known confounders are present in the model. Alternatively, if the authors report an association derived as the result of algorithm-driven approaches to variable selection (such as backwards stepwise selection), then the impact of confounding variables has not been considered for any reported E-D associations. Such E-D associations should not be included in the metaanalysis. 9. Heterogeneity for PECO(S) questions For PECO questions, the formal exploration of study design as a source of publication bias or methodological heterogeneity can be very insightful. For example, if there is concern about selection bias in case-control studies compared to cohort studies, this can be studied formally using sub-group analysis and meta-regression. The results of such investigations might importantly contribute to the understanding of causal relationships by providing empirical evidence of differences in effect estimates; the results might also support hypotheses about sources of bias. 10. Summary effect-size calculation for PO questions The study designs that can be used to answer questions about the prevalence or incidence of an outcome are surveys. However, it is also common for estimates of prevalence or incidence to be reported as a by-product of hypothesis-testing observational studies (incidence casecontrol, cohort, cross-sectional), hypothesis-generating observational studies, and even the control arms of nonrandomized controlled trials and RCTs. Here again, we distinguish between studies that provide estimates of prevalence or incidence because that was the purpose of the study and studies where the prevalence or incidence
A.M. O’Connor, J.M. Sargeant / Preventive Veterinary Medicine 113 (2014) 313–322
estimate is a by-product. Surveys designed to estimate the prevalence or incidence for the outcome of interest will likely have addressed potential sources of bias (such as volunteer bias) by using random recruitment approaches into the study to ensure a representative population. However, hypothesis-testing and hypothesis-generating observational studies will not necessarily have taken any steps to ensure that the estimate of prevalence or incidence was obtained from a representative population (because such a criterion is not necessary for a valid assessment of an E-D relationship or intervention efficacy). To use an example from human disease, a study might be designed to estimate the association between smoking and lung cancer over 20 years in men aged 50–60 who live on a Greek island. Such a study, if well designed, might give a valid estimate of the E-D relationship in this population but the estimate of lung-cancer incidence and smoking prevalence are likely not representative estimates of the incidence in a broader target population. 11. Heterogeneity assessment for PO questions Frequently, the aim of a meta-analysis of a PO question is not to calculate a summary effect size i.e., a summary estimate of prevalence but instead to describe the extent to which the prevalence or incidence estimates vary in a broadly defined population. Understanding the sources of variation of prevalence or incidence can be critical. For example, a single summary measure of the prevalence in Salmonella culture-positive pigs in the United States is likely not very useful. However, knowledge that most farms are culture negative and a few farms are strongly culture positive, or that different production stages have different infection prevalence, may be very useful. A meta-analyst might want to quantify this heterogeneity. In this situation, if the study population in the primary study falls within the target population, then estimates of prevalence from surveys, observational studies designed for other purposes, and control arms of randomized trials all provide naturally occurring estimates. Assuming that information bias is equal across all studies, these estimates are likely to be important sources of data and should not be excluded; their exclusion would lead to an underestimation of variation. Both study design and population demographic information could be evaluated and presented as potential sources of heterogeneity. 12. Summary effect-size calculation and heterogeneity assessment for PIT questions A meta-analysis of diagnostic tests might aim to calculate a summary measure of sensitivity and specificity or a summary ROC, or the aim may be to explore sources of variation in study effects. By their nature, evaluations of diagnostic-test accuracy required observational studies where the purpose is to assess the sensitivity and specificity of a test to detect the naturally occurring disease or condition. The test could be a clinical examination, a questionnaire, or a diagnostic assay. For PIT questions, assessment of the diagnostic tests in situations with naturally occurring disease in the
321
target animal species is strongly preferred for realistic representation of the assay’s performance. Diagnostic test-assessment studies should be distinguished from diagnostic-validation studies; it is unlikely that the latter designs, which are experimental, are relevant to PIT metaanalysis. Diagnostic-validation studies determine the limit of analytical sensitivity or specificity for detection of a pathogen or toxin in laboratory settings, and are used in the development stages for assays. Such studies use samples of known concentrations in very controlled settings to determine the detection limit. Alternatively, diagnostic test-accuracy studies assess the ability of the test to detect infection (or disease) in groups of animals with naturally occurring exposure. As such, these animals may have characteristics that affect the diagnostic accuracy of the test, such as concurrent disease or vaccination. The relevance of data from diagnostic-test validation studies to the calculation of a summary measure of sensitivity and specificity of diagnostic-test accuracy is analogous to the relevance of challenge studies and RCTs. In some situations, the two might provide similar results but in other circumstances, the highly controlled study (the diagnostic-validation study and the challenge study) will overestimate the effect size because it has reduced sources of heterogeneity. Consequently, the meta-analyst will need to consider the relevance of data from the more-controlled experiment to the review question. This issue is largely limited to diagnostic tests for pathogens. For example, although a diagnostic-validation study can be conducted for detection by polymerase chain reaction for E. coli O157 in cattle feces, there is no diagnostic-validation assay for the ability of ultrasound to detect a liver pathology, or a questionnaire to detect poor biosecurity. For PECO and PO questions that aim to calculate a summary effect measure, we suggested that it is necessary to be careful about the incorporation of results that were a by-product of other research endeavors. However, for PIT questions, given the need to test specimens purposefully using two tests to conduct a study of diagnostic-test accuracy, we do not anticipate that estimates of sensitivity and specificity are frequent by-products of other studies. Designs used for diagnostic-test assessment (DTA) are less well defined than designs used to understand the epidemiology of disease. Diagnostic test-assessment designs are cross-sectional in nature; i.e. study subjects have the disease or not and are tested using two or more assays. Within this general framework, DTA study designs are described as one-gate or two-gate designs. This nomenclature differentiates studies based on the method of enrolling study subjects (Rutjes, 2006). A one-gate design has a single set of inclusion criterion for all study participants whereas a two-gate design has different criteria for those with or without the target condition. As with observational studies for etiological studies, the choice of design might introduce bias. Therefore, in meta-analysis decisions about inclusion of the results from a particular design should be made a priori. Just as for etiological and intervention studies, the potential for bias associated with the design differs based on the topic area: one-gate or two-gate designs. There are numerous sources of bias that can occur within studies of
322
A.M. O’Connor, J.M. Sargeant / Preventive Veterinary Medicine 113 (2014) 313–322
diagnostic-test accuracy, including spectrum bias, observer bias, and reference test bias (Begg, 1987, 1988, 2008); it is beyond the scope of this paper to discuss the biases. Because of the number of biases that can occur, an expert in diagnostic-test accuracy design and bias should be consulted. Further, it seems that lack of general knowledge of diagnostic-test accuracy study design and sources of bias means that reports of diagnostic tests are frequently inadequate for making meaningful assessments of bias (Ell, 2003; Coppus et al., 2006; Christopher, 2007). 13. Conclusion Depending upon the meta-analysis question and aim, it might be necessary or desirable to include data from observational studies when calculating a summary effect size. Regardless of the question or the primary research design, the major concern is the incorporation of biased studylevel effect sizes into the summary effect size –leading to misleading results. Prior to conducting a meta-analysis, it is necessary to determine if the purpose is to explore sources of heterogeneity or to calculate a summary effect size. Whether the objective is to calculate a summary effect size, then it is necessary to identify the study designs that can contribute information to the meta-analysis and the biases that can occur within the design. A priori decisions about the design-based inclusion criteria should be created. Topic-specific experts and study-design experts will be needed for this process. The rationale and criteria for inclusion of any study design should be included in a comprehensively reported meta-analysis. Conflict of interest statement The authors declare that they have no personal or financial conflicts of interest that might have inappropriately influenced this paper. References Begg, C.B., 1987. Biases in the assessment of diagnostic tests. Stat. Med. 6, 411–423. Begg, C.B., 1988. Methodologic standards for diagnostic test assessment studies. J. Gen. Intern. Med. 3, 518–520. Begg, C.B., 2008. Meta-analysis methods for diagnostic accuracy. J. Clin. Epidemiol. 61, 1081–1082, discussion 1083-1084. Beyer, W.E., 2006. Heterogeneity of case definitions used in vaccine effectiveness studies–and its impact on meta-analysis. Vaccine 24, 6602–6604. Burns, M.J., O’Connor, A.M., 2008. Assessment of methodological quality and sources of variation in the magnitude of vaccine efficacy: a systematic review of studies from 1960 to 2005 reporting immunization with Moraxella bovis vaccines in young cattle. Vaccine 26, 144–152. Christopher, M.M., 2007. Improving the quality of reporting of studies of diagnostic accuracy: let’s STARD now. Vet Clin Pathol 36, 6. Coppus, S.F., van der Veen, F., Bossuyt, P.M., Mol, B.W., 2006. Quality of reporting of test accuracy studies in reproductive medicine: impact of the Standards for Reporting of Diagnostic Accuracy (STARD) initiative. Fertil. Steril. 86, 1321–1329. Deeks, J.J., Higgins, J.P.T., Altman, D.G. (Ed.) Chapter 9: Analysing data and undertaking meta-analyses. In Higgins, J., Green, S. (Ed.).Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0. Available from http://www.cochrane-handbook.org, 2011. Denagamage, T.N., O’Connor, A.M., Sargeant, J.M., Rajic, A., McKean, J.D., 2007. Efficacy of vaccination to reduce Salmonella prevalence in live and slaughtered swine: A systematic review of literature from 1979 to 2007. Foodborne Pathogens and Disease 4, 539–549.
Dohoo, I.R., DesCoteaux, L., Leslie, K., Fredeen, A., Shewfelt, W., Preston, A., Dowling, P., 2003a. A meta-analysis review of the effects of recombinant bovine somatotropin. 2. Effects on animal health, reproductive performance, and culling. Can. J. Vet. Res. 67, 252–264. Dohoo, I.R., Leslie, K., DesCoteaux, L., Fredeen, A., Dowling, P., Preston, A., Shewfelt, W., 2003b. A meta-analysis review of the effects of recombinant bovine somatotropin. 1. Methodology and effects on production. Can. J. Vet. Res. 67, 241–251. EFSA, 2010. Application of Systematic Review Methodology to Food and Feed Safety Assessments to Support Decision Making. EFSA Journal 8, 1-90. Ell, P.J., 2003. STARD and CONSORT: time for reflection. Eur. J. Nucl. Med. Mol. Imaging 30, 803–804. Groenwold, R.H., Van Deursen, A.M., Hoes, A.W., Hak, E., 2008. Poor quality of reporting confounding bias in observational intervention studies: a systematic review. Ann. Epidemiol. 18, 746–751. Higgins, J.P.T., Ramsay, C., Reeves, B.C., Deeks, J.J., Shea, B., Valentine, J.C., Tugwell, P., Wells, G., 2013. Issues relating to study design and risk of bias when including non-randomized studies in systematic reviews on the effects of interventions. Research Synthesis Methods 4, 12–25. Ioannidis, J.P., Trikalinos, T.A., 2007. The appropriateness of asymmetry tests for publication bias in meta-analyses: a large survey. CMAJ 176, 1091–1096. Liberati, A., Altman, D.G., Tetzlaff, J., Mulrow, C., Gotzsche, P.C., Ioannidis, J.P., Clarke, M., Devereaux, P.J., Kleijnen, J., Moher, D., 2009. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. J. Clin. Epidemiol. 62, e1–e34. Macaskill, P., Walter, S.D., Irwig, L., 2001. A comparison of methods to detect publication bias in meta-analysis. Stat. Med. 20, 641–654. Norris, S., Atkins, D., Bruening, W., 2010. Selecting observational studies for comparing medical interventions. In: In: Agency for Healthcare Research and Quality. Methods Guide for Comparative Effectiveness Reviews [posted June 2010]. Rockville, M. (Ed.). O’Connor, A.M., Auvermann, B., Bickett-Weddle, D., Kirkhorn, S., Sargeant, J.M., Ramirez, A., Von Essen, S.G., 2010. The association between proximity to animal feeding operations and community health: a systematic review. PLoS One 5, e9530. O’Connor, A.M., Wellman, N.G., Evans, R.B., Roth, D.R., 2006. A review of randomized clinical trials reporting antibiotic treatment of infectious bovine keratoconjunctivitis in cattle. Anim. Health Res. Rev. 7, 119–127. Reeves, B.C., Higgins, J.P.T., Ramsay, C., Shea, B., Tugwell, P., Wells, G.A., 2013. An introduction to methodological issues when including non-randomised studies in systematic reviews on the effects of interventions. Research Synthesis Methods 4, 1–11. Sargeant, J.M., Elgie, R., Valcour, J., Saint-Onge, J., Thompson, A., Marcynuk, P., Snedeker, K., 2009. Methodological quality and completeness of reporting in clinical trials conducted in livestock species. Prev. Vet. Med. 91, 107–115. Sargeant, J.M., Thompson, A., Valcour, J., Elgie, R., Saint-Onge, J., Marcynuk, P., Snedeker, K., 2010. Quality of reporting of clinical trials of dogs and cats and associations with treatment effects. J. Vet. Intern. Med. 24, 44–50. Sargeant, J.M., Torrence, M.E., Rajic, A., O’Connor, A.M., Williams, J., 2006. Methodological quality assessment of review articles evaluating interventions to improve microbial food safety. Foodborne Pathog Dis 3, 447–456. Shmueli, A., 2002. Reporting heterogeneity in the measurement of health and health-related quality of life. Pharmacoeconomics 20, 405–412. Stroup, D.F., Berlin, J.A., Morton, S.C., Olkin, I., Williamson, G.D., Rennie, D., Moher, D., Becker, B.J., Sipe, T.A., Thacker, S.B., 2000. Meta-analysis of observational studies in epidemiology: a proposal for reporting, Meta-analysis Of Observational Studies in Epidemiology (MOOSE) group. JAMA: the journal of the American Medical Association 283, 2008–2012. van Houwelingen, H.C., Arends, L.R., Stijnen, T., 2002. Advanced methods in meta-analysis: multivariate approach and meta-regression. Stat. Med. 21, 589–624. Vandenbroucke, J.P., 2009. STREGA, STROBE, STARD, SQUIRE, MOOSE, PRISMA, GNOSIS, TREND, ORION, COREQ, QUOROM, REMARK. . . and CONSORT: for whom does the guideline toll? J. Clin. Epidemiol. 62, 594–596. Vandenbroucke, J.P., Psaty, B.M., 2008. Benefits and risks of drug treatments: how to combine the best evidence on benefits with the best data about adverse effects. JAMA 300, 2417–2419. Verbeek, J., 2008. Moose Consort Strobe and Miame Stard Remark or how can we improve the quality of reporting studies. Scand. J. Work. Environ. Health 34, 165–167.