Journal of Clinical Epidemiology 59 (2006) 342–353
REVIEW ARTICLES
A systematic review identifies a lack of standardization in methods for handling missing variance data Natasha Wiebea,*, Ben Vandermeerb, Robert W. Plattc,d, Terry P. Klassenb, David Mohere,f, Nicholas J. Barrowmane,f,g a Department of Medicine, Division of Nephrology, University of Alberta, Rm. 4048, Research Transition Facility, 8308 114 Street, Edmonton, Alberta, Canada T6G 2E1 b Alberta Research Centre for Child Health Evidence, Department of Pediatrics, University of Alberta, Aberhart Centre One, 11402 University Ave, Edmonton, Alberta, Canada T6G 2J3 c Department of Pediatrics, McGill University, Montreal, Children’s Hospital, 2300 Tupper St, Montreal, Quebec, Canada H3H 1P3 d Department of Epidemiology and Biostatistics, McGill University, 1020 Pine Ave West, Montreal, Quebec, Canada H3A 1A2 e Chalmers Research Group, Children’s Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, Ontario, Canada K1H 8L1 f Department of Pediatrics, Faculty of Medicine, University of Ottawa, 451 Smyth Road, Ottawa, Ontario, Canada K1H 8M5 g School of Mathematics and Statistics, Carleton University, 1125 Colonel by Drive, Ottawa, Ontario, Canada K15 5B6 Accepted 29 August 2005
Abstract Background and Objectives: To describe and critically appraise available methods for handling missing variance data in meta-analysis (MA). Methods: Systematic review. MEDLINE, EMBASE, Web of Science, MathSciNet, Current Index to Statistics, BMJ SearchAll, The Cochrane Library and Cochrance Colloquium proceedings, MA texts and references were searched. Any form of text was included: MA, method chapter, or otherwise. Descriptions of how to implement each method, the theoretic basis and/or ad hoc motivation(s), and the input and output variable(s) were extracted and assessed. Methods may be: true imputations, methods that obviate the need for a standard deviation (SD), or methods that recalculate the SD. Results: Eight classes of methods were identified: algebraic recalculations, approximate algebraic recalculations, imputed study-level SDs, imputed study-level SDs from nonparametric summaries, imputed study-level correlations (e.g., for change-from-baseline SD), imputed MA-level effect sizes, MA-level tests, and no-impute methods. Conclusion: This work aggregates the ideas of many investigators. The abundance of methods suggests a lack of consistency within the systematic review community. Appropriate use of methods is sometimes suspect; consulting a statistician, early in the review process, is recommended. Further work is required to optimize method choice to alleviate any potential for bias and improve accuracy. Improved reporting is also encouraged. Ó 2006 Elsevier Inc. All rights reserved. Keywords: Meta-analysis; Data collection; Data analysis, statistical; Publication bias
1. Introduction Meta-analysis is an efficient way of analytically combining the results of individual studies together to provide an overall estimate of an intervention’s effectiveness along with a measure of its precision. Selective reporting, present in many forms (e.g., omittance of publications, outcomes, or subgroups), will decrease statistical power and may distort the magnitude and possibly the direction of metaanalysed results. In a recent study that compared Danish
* Corresponding author. Tel.: 780-407-1532; fax: 780-407-1921. E-mail address:
[email protected] (N. Wiebe). 0895-4356/06/$ – see front matter Ó 2006 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2005.08.017
protocols of randomized controlled trials to their published manuscripts, Chan et al. [1] found incompletely reported results for 92% of efficacy outcomes and 81% of harm outcomes. Similar results have been reported for other jurisdictions [2]. Incomplete reporting of continuous outcomes is often manifested in the omission of standard deviations (SDs). Streiner and Joffe [3] found that only 9 out of 69 randomized controlled trials in their systematic review (SR) of antidepressants reported standard deviations; Song et al. [4] found 20 of 33 studies reported SDs in their SR of selective serotonin reuptake inhibitors for depression. To the best of our knowledge, a comprehensive review on the effects of omitting studies that did not report SDs has not been reported. Regardless, omitting studies that fail
N. Wiebe et al. / Journal of Clinical Epidemiology 59 (2006) 342–353
to report SDs is wasteful. SDs are used in the calculation of the precision of the overall pooled estimate; thus, missing SDs reduce the power of the meta-analysis (MA). Additionally, SDs are used in the weighted calculation of the overall pooled estimate. Omitting studies due to systematically missing SDs may induce bias in this estimate. In primary studies (e.g., surveys, randomized controlled trials), Schafer [5] suggests that ignoring a case loss of 5% or less because of missing data is a reasonable solution to the missing data problem. In secondary studies (i.e., meta-analysis), where the number of included studies tends to be small, any omitted study may amount to a significant loss. Meta-analysts are forced to improvise and/or use other summary measures to derive an SD estimate to include studies with missing SDs. When SDs cannot be algebraically recalculated from reported data, meta-analysts have suggested and used a myriad of methods to impute SD (fill in SDs with plausible values) to attenuate any loss in power and to avoid bias. However, many of these methods have not been theoretically derived or empirically tested. It has not been known how many such methods exist, whether different methods provide different variance imputation ‘‘corrections,’’ and whether different imputation methods influence the overall conclusions of a meta-analysis. 2. Objective To describe, and critically appraise, methods for handling missing variance data in meta-analysis. 3. Methods 3.1. Search strategy We developed a comprehensive search strategy to identify all relevant documents regardless of publication status. Combinations of the following sets of search terms were used: variance, standard deviation, standard error and imput*, miss*, deriv*. We searched the following electronic databases: MEDLINE (1966 to May 17, 2002), EMBASE (1988 to May 17, 2002), Web of Science (May 3, 2002), MathSciNet (May 17, 2002), and Current Index to Statistics (September 17, 2002). We performed full-text searches of BMJ SearchAll (October 23, 2002) and The Cochrane Library (May 1, 2002) to specifically retrieve systematic reviews suggesting variance imputation methods. For BMJ SearchAll, we specified that systematic review* or one of the various spellings of meta-analysis had to be in the title. In addition, the Cochrane Colloquium proceedings (No. 2–10), the Statistics Methods Group listserve (June 2002 to August 2004) and personally-identified meta-analysis texts were hand searched. We also reviewed the reference lists of included documents. 3.2. Inclusion criteria A document (paper, chapter, email, newsletter, etc.) was included if: (1) it provided a method to impute or to
343
algebraically derive the actual nonreported variance or a variation therein (standard error, standard deviation); (2) it provided a method to combine study effects (mean difference, standardized mean difference, P-values, etc.) for a continuous outcome without using individual study variance or a variation therein; (3) it provided a method to impute a study effect size (e.g., standardized mean difference) without using individual study variance (or a variation therein) where the individual study variance could be extracted from the effect size. 3.3. Exclusion criteria A document was excluded if: (1) it only provided a method to impute a statistic other than variance or a variation therein, wherefore a variance could not be extracted (e.g., correlation as an effect size measure). 3.4. Methods Two authors independently screened the results of the search strategies to identify potentially relevant documents. The systematic reviews were screened online or by full-text hard copies. Two authors independently assessed the documents for inclusion in the review using the eligibility criteria. We resolved any disagreements through discussion or by consulting a third party. Two authors independently extracted data from the included documents. We requested unpublished data from authors where necessary. We developed a data extraction form to extract descriptions of how to implement each method, the theoretic basis and/or ad hoc motivation(s), and the input and output variable(s). We resolved any disagreements by referring to the document and through discussion or by consulting a third party. Background texts, authors, and experts were sought to fully assess descriptions of each method since fully implementable descriptions were generally lacking.
4. Results The search yielded 822 documents (Fig. 1), 6% from references of included documents. One hundred fifty-three reports were included: 73% were SRs, and 27% were from method papers or textbooks (complete listing of citations are available from N.W.). Documents were excluded if they were duplicate papers, not retrievable, or did not contain a relevant method. We classified the methods into eight groups: algebraic recalculation, approximate algebraic recalculation, studylevel imputation, study-level imputation from nonparametric summaries, study-level imputation of correlation (for change-from-baseline or crossover SD and to calculate the design effect for cluster studies), MA-level imputation of overall effect, MA-level tests, and no-imputation methods. The last three classes of methods involve no study-level SD. The MA-level tests provide no overall estimates of
344
N. Wiebe et al. / Journal of Clinical Epidemiology 59 (2006) 342–353
Potentially relevant documents, identified and screened for retrieval (n=771)
Potentially relevant references of included studies (n=51)
Documents excluded, with reasons (n=590) Duplicate (n=6) Not retrievable (n=11) No variance imputation method (n=573)
Documents retrieved for formal inclusion/exclusion (n=232) Not intended for meta-analysis** (n=27)
Documents excluded, with reasons (n=79) Not retrievable (n=1) Not transferable to meta-analysis (n=22) No applicable method (n=56)
Documents included (n=153) Systematic reviews included (n=112) Method papers included (n=41) Not intended for meta-analysis** (n=5)
* Documents include: systematic reviews, chapters, method papers, emails, newsletters, etc. ** The objective for imputing variance in these papers was not for the purposes of metaanalysis, however, papers containing methods that use the range and linear regression to impute variance were transferrable to meta-analysis
Fig. 1. Flow diagram of documents* considered for inclusion in the review.
effect, just a P-value or a vote. Also, the last class of method, no-imputation, provides methods that omit the data from studies that do not report an SD from pooling. Table 1 shows the prevalence of method choices in the 101 included SRs. Table 2 gives the methods that were not identified in the SRs but suggested in method texts. Table 3 describes in brief all the methods for variance imputation we found (excluding algebraic recalculations and MA-level tests) including the theoretic basis and/or motivation (and assumptions) for the method, the required input parameters and what is produced (the output). The following result sections A through H expand upon this table. 4.1. Algebraic recalculation Methods that algebraically recalculate an omitted SD use reported parametric summary statistics. Note that they often assume that summary statistics have exact parametric distributions. Along with reported means and sample sizes, the following statistics were used or suggested for use: t- and F-statistics from t-tests, ANOVAs, linear regressions, factorial and repeated-measure analyses; P-values from the
same analyses; confidence and prediction intervals; and power. The most comprehensive indices for algebraic recalculation were found in Lipsey [6] and Glass [7]. A few of these methods were either not found in practice (i.e., factorial analyses [6], prediction intervals [8]), were rarely used in practice (i.e., power [9]), or perhaps should not be used in practice (i.e., repeated-measures analyses [9]). Repeated-measures analyses can take on a variety of forms. Due to the brevity in method reporting, it would be unusual that sufficient details would be provided to recalculate the appropriate SD. Other issues that were identified in various articles and studies included inexactness of computed SD due to rounding [10], methods for binned data using maximum likelihood estimation [11], and incorrect calculations using percent change from baseline data [12]. 4.2. Approximate algebraic recalculation The statistical significance of hypothesis tests has been reported in various ways: significant, not significant, an upper bound significant P-value, an interval of significant Pvalues, or an exact P-value. In practice, the critical P-value
N. Wiebe et al. / Journal of Clinical Epidemiology 59 (2006) 342–353
345
Table 1 Methods retrieved from SRs and their prevalence within SRs Class
SR count (%)
Algebraic recalculation
29 (28.7)
Approximate algebraic recalculation Study-level imputation
7 (6.9) 34 (33.7)
Study-level imputation from nonparametric summaries
12 (11.9)
Study-level imputation of correlation
29 (28.7)
MA-level imputation of overall effect
4 (4.0)
MA-level tests
2 (2.0)
No-Imputation Methods
Unclear Overall
21 (20.8)
9 (8.9) 101 (100)
Method
SR count (%)
Using t- or F-statistics, P-values, and confidence intervals, power repeated measures Using upper bound P-values, critical P-values, and nonsignificant P-values Direct substitution with a baseline or other treatment SD, another included study’s SD or the maximum thereof, or an SD from another literature source Arithmetic mean (four were nonuniform weights, likely sample size weights) Coefficient of variation (one weighted by sample size) Regression with covariates Using maximum likelihood estimation on ‘‘binned’’ data Using numerator and denominator SD to estimate the SD of a ratio Range width (divided by 4, tabled value, 3, 5.88) Interquartile width (divided by 1.35 or by 2) Mann-Whitney P-values Median Direct substitution with 0.5 Direct substitution with 0 Direct substitution with a value from another included study or the minimum thereof Sensitivity analyses using at least two different values Direct substitution, not clear what value was used Multiple imputation Weight by some function of sample size Bootstrapping Combining P-values Vote counting Exclude study from systematic review Exclude study from meta-analysis Narrative review Unclear
or an upper bound P-value has been used to recalculate the SD. This SD will be larger and thus the study will be weighted less in the meta-analysisda conservative approach for efficacy questions. For a P-value interval, Follmann [13] suggests using the median value; this approach may be conservative or liberal. For studies that simply report nonsignificant outcomes, authors have suggested using a range of values in a sensitivity analysis, from a P-value of 1 to something just outside the critical value. 4.3. Study-level imputation Methods that impute study-level SD either assume that the SD is associated with other study-level variables or the treatment or control mean (i.e., linear or Buck’s regression [14], coefficient of variation [15]), or they assume simply that the missing SD is similar to other study SD (e.g., substitution, arithmetic mean). Substituted SDs come from various sources: same study baseline values, same study other treatment/control values, other included study end point values, the maximum of
29 (28.7) 7 (6.9) 13 (12.9)
10 (9.9) 5 3 1 4
(5.0) (3.0) (1.0) (4.0)
8 2 1 1 5 4 11
(7.9) (2.0) (1.0) (1.0) (5.0) (4.0) (10.9)
6 6 1 2 1 1 1 9 9 4 9 101
(5.9) (5.9) (1.0) (2.0) (1.0) (1.0) (1.0) (8.9) (8.9) (4.0) (8.9) (100)
other included study end point values, and reference population or external study values. Using the arithmetic mean SD from the other included studies was usually done using uniform weights, but at least one study used sample size weights in this calculation [16]. Three other studies used weights but did not report which weights. The motivation for using the coefficient of variation to impute SD is unclear. Early references are found in the quality control industry in outcomes quantifying strength and height [17]; note that they both left-bound at zero. Outcomes in SRs using the coefficient of variation were: length, head circumference, weight gain [18], and for any outcome [19]. The coefficient of variation is the SD divided by its mean [15]. Its use here implies a relationship between means and SDs. A relationship between means and SDs can also exist when data are skewed [20]. In 1960, Buck [14,21] suggested using regression with available cases to impute missing data values in the context of multivariable primary datasets. Buck’s purpose was not to increase power, but to estimate variances and covariances consistently regardless of which set of variables were used.
346
N. Wiebe et al. / Journal of Clinical Epidemiology 59 (2006) 342–353
Table 2 Methods retrieved in method texts and not found in SRs Class
Method
Algebraic recalculation
Using factorial ANOVA summaries, R2 and other statistics from ANOVA or regression summaries, Tukey’s Q, Dunn’s statistic, Dunnett’s statistic, prediction intervals Using interval P-values
Approximate algebraic recalculation Imputed study-level SD, from nonparametric data summaries
Imputed MA-level overall effects
MA-level tests
Unclear
Using empirical intervals Any nonparametric P-value Combining skewed data (investigation ongoing) MLE to impute an overall effect size from the proportion of positive or positive significant studies Combining P-values, T-statistics, or Z-statistics with or without weighting by sample size Bayesian (not clearly reported)
Sutton [22] suggests using regression to impute missing variables in meta-analysis, but does not indicate this method specifically for SD. Two reviews [23,24] used regression to impute an SD. One was a special case where the data had a Poisson distribution where the mean closely approximates the SD. The other review imputed the treatment SD by regressing the treatment SD on the control SD and the length of follow-up. A few authors suggest a Bayesian solution [22,25] to missing SD, but provide no further details. Maximum likelihood estimation (MLE) has been suggested for missing data in general and for meta-analysis, but not specifically proposed for missing SD. 4.4. Study-level imputation from nonparametric summaries Using the interquartile range, the range (or other empirical intervals), and P-values from nonparametric tests to impute SD requires that the distribution of the outcome is normal and not, in general, skewed. The width of the interquartile range of a standard normal distribution is 1.35 SDs; dividing by 1.35 therefore estimates the SD [26]. When using the range to impute SD, several reviewers have divided the range width by both 3 and 4 to estimate SDs. The Cochrane Handbook recommends dividing the width of the range by 4 to impute a SDs, although Mendehall et al. [27] may have made the original recommendation from their survey environment. Other fixed divisors have also been suggested [28]. Despite these fixed conversion factors, the range width increases as sample size increases. The relationship, then, between range width and SD is also not fixed, and varies with sample size. Pearson published as
far back as 1932 [29], for the purposes of the quality control industry, a table from simulated results that approximates this relationship. Dividing by 3 or 4 would be appropriate for sample sizes of approximately 9 and 27, respectively. Note, that Pearson suggests optimal use for sample sizes of 10 or smaller [17], since the extreme values in sample become less stable as sample size increases. One methodologist [30] has suggested substituting t-test P-values with nonparametric P-values and then approximating the SD using reported means (or medians). In practice, the Wilcoxon (a.k.a, Mann-Whitney) test P-value has been used as a conservative replacement for the t-test P-value when the sample distributions are normal [31]. In theory, other nonparametric tests such as the sign test could also provide P-values leading to conservative SD imputations. One review [32] directly used the median to impute SDs. When outcome data are skewed [20], the central limit theorem suggests that, except in very extreme cases (such as a nonfinite variance), for large data sets the mean will be nearly normally distributed and thus the standard error would be appropriate. What sample size this requires is not in general clear. The sample size and magnitude of skewness will determine this. No empirical work is available. Alternatively, investigations to combine nonparametric or skewed data are currently underway [33].
4.5. Study-level imputation of correlation Imputed study-level correlations are used in changefrom-baseline SD and crossover SD to account for within person correlation, and in cluster study SD to account for the unit of analysis not equalling the unit of randomization (i.e., the design effect). For many SRs, it is unclear what value of correlation they use. Other authors have used: 0.5 as suggested by Follmann [13], another included study’s correlation, the minimum correlation from included studies, a known or accepted value of correlation, or assumed no correlation. The latter approach is the most conservative, so will give the largest SD. Other authors suggest a sensitivity analysis and so use a range of plausible correlations. Follmann [13] argues for a minimum correlation of 0.5 for imputing change-from-baseline SD because the design would be less efficient if the correlation was anything smaller. The change-from-baseline SD would be greater than the final value SD. For crossover SD, any nonnegative correlation would give a more efficient design. From our limited results, few reviewers reported the correlations they ‘‘borrowed’’ from other studies. Furthermore, the few available correlations (reported or gathered from included authors) for change-from-baseline SD did not all exceed the critical value of 0.5 although most did: 0.07 to 0.56 [34], 0.46 to 0.74 [35], 0.63 to 0.90 [36], 0.75 [37], 0.82 [38], 0.9 [39]. Three sets of sensitivity analyses were used: 0.3, 0.4, and 0.5; 0.25, 0.5, and 0.75; and 0, and 0.5. All of these sets included values less than 0.5.
Table 3 Description of methods Theoretical basis and/or motivation, assumptions
Description
Direct substitution ([6], p. 186)
Substitute the missing SD with any of the ‘‘required input’’ SD suggested. When selected from several SD, one may choose the maximum in order to be the most conservative.
All of the ‘‘required input’’ SD should be similar. Missing SD are assumed to be missing completely at random (MCAR).
Arithmetic mean [13]
Substitute the missing SD with the mean SD of the other included study’s SD. The mean calculation may or may not be weighted by sample size.
All of the other included study SD should be similar. Pooling assumes the treatment and control SD are similar, but this may not always be the case; some may prefer to select same-treatment SD. MCAR.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uP u N SD2 ðn 2 1Þ u i i u SD 5 ui 5 1 N t P ðni 2 1Þ
Required input
Output
Baseline, other time point, other treatment, another included study’s, or another nonincluded (yet similar in study population) study’s SD All other available included studies’ SD
SD
SD
N. Wiebe et al. / Journal of Clinical Epidemiology 59 (2006) 342–353
Method (reference)
i51
A variant of this calculation is to average the SD directly. N P SDi SD 5 i 5 1 N Linear regression ([14,21], p. 168)
Regress same-treatment SD as the missing SD(s) onto other study covariates that the investigator specifies as related to the missing SD such as the baseline SD. The following example assumes that the baseline SD and some other study covariate, X2, are related linearly to the missing SD.
SD 5 b0 1 b1 SDbaseline 1 b2 X2 Coefficient of variation ([15], p. 17)
May use mean and SD estimates from the most similar study to calculate the coefficient of variation (CV) or may use all the studies to calculate a mean CV. This latter route could be weighted by sample size.
CV 5
All other available included All of the ‘‘required input’’ SD should studies’ same-treatment SD be similar, however, some study and other specified covariates. covariate may also modify this This generally includes relationship (e.g., length of either the baseline or control follow-up). The relationship between group’s SD. the dependent SD and the independent SD is expected to be linear. Missing SD are assumed to be missing at random (MAR).
SD
The SD-to-mean ratio should be similar among included studies (e.g., clinically similar scales). MAR.
Any other or all other available included studies’ same-treatment SD and means (other included study treatment sample sizes are optional)
SD
The distribution of the outcome data is assumed to be normal.
Interquartile range
SD
SD ,100 mean
The coefficient of variation and the mean from the study with the missing SD is used to impute the missing SD.
Inter-quartile Range [26]
SD 5
CV,meanmissing:sd 100
SD 5
IQupper 2 IQlower 1:35
347
(Continued)
348
Table 3 Continued Method (reference)
Range [29]
This relationship is a one-to-one mapping so far as the normality assumption is correct. Max 2 Min SD 5 table:value Table values can be found in Pearson (1932) and require only the sample size to find the desired entry.
df 5 ntx 1 nctr 2 2 t:stat 5 T 2 1 ðp 2 value; df Þ tx:mean 2 ctr:mean rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 1 t:stat, ntx nctr Pool mean differences weighting by a function of sample size. The following calculation is the sample weight for study i. ntx;i ,nctr;i Wi 5 ntx;i 1 nctr;i Standard error (SE) for the weighted mean difference would have to be calculated through other means. Both the weighted mean difference and its SE is estimated using the bootstrap method. Mean differences may be weighted by sample size. Datasets of randomly sampled mean differences with replacement are created. An overall SE or confidence interval are then generated. A number of datasets, M, (if 30% of the data is missing M 5 3 would sufficedif 50% of the data is missing M 5 5 is preferrable; Rubin 1991) are created by imputing SDs via random sampling of SDs from the other included studies. The weighted mean difference and SE are calculated for each of the M datasets. The final estimates of weighted mean difference and SE are calculating from the M datasets in the following way. SD 5
Sample size weights [40]
Bootstrap [41,42]
Multiple imputation [43,44]
Theoretical basis and/or motivation, assumptions
Required input
Output
The distribution of the outcome Tdata is assumed to be normal. The relationship between width of range and SD varies with sample size. For the same normal distribution, as sample size increases, range also increase. The distribution of the outcome data is assumed to be normal. This result is conservative; the SD will generally be larger than the exact sample SD.
Range and same-treatment sample size
SD
SD of the same outcome, same treatment on similar study populations should be similar. Assuming equal SD, the SD cancel out algebraically.
All included study treatment means and treatment sample sizes
Weighted mean difference (pooled) without a SE
This nonparametric procedure can be used for any distribution.
All included study treatment means (included study treatment sample sizes are optional)
Weighted mean difference (pooled) with a SE
This nonparametric procedure can be used for any distribution. This is the only method that accounts for the uncertainty implicit from the missing SD.
All possible included study treatment means, treatment SD, and treatment sample sizes
Weighted mean difference (pooled) with a SE
Mann-Whitney P-value, treatment SD means and treatment sample sizes
(Continued)
N. Wiebe et al. / Journal of Clinical Epidemiology 59 (2006) 342–353
Mann-Whitney P-value ([7], p. 130–1)
Description
Table 3 Continued Theoretical basis and/or motivation, assumptions
Description M P
Required input
Output
Proportion of positive studies or the proportion of positive statistically significant studies and sample size All other possible included study treatment means, treatment SD, and treatment sample sizes All other possible included study treatment means, treatment SD and treatment sample sizes Any relevant reported summary data
Standardized mean difference with confidence intervals
WMDi
WMD 5 i 5 1
M vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uP M P uM ðWMDi 2 WMDÞ2 u SE2i ti 5 1 1 i51 , 1 11 SE 5 M M M21 Overall standardized mean difference estimation [45,46]
A likelihood is set up based on the required proportion. There is no closed form solution. Tables are provided.
Studies are repetitions; a fixed effects scenario is assumed.
Exclude from systematic review
The study is deemed ineligible and excluded from the entire systematic review.
Missing SD are unrelated to size of treatment effect. Power is lost, but no bias is assumed. MCAR.
Exclude from meta-analysis This study is eligible and included in the systematic review but not in the particular meta-analysis where the SD is missing. Narrative review
This study is eligible and included in the systematic review but no meta-analysis is performed. Review data is not pooled, but described by study.
Missing SD may be related to size of treatment effect. Power is lost, bias may exist but is offset to a small extent by narrative in the text. Missing SD may be related to size of treatment effect. All power due to pooling data is lost, interpretation is difficult because results may be confounded with sample size or SE.
Effect size with a SE
Effect size with a SE
Reported individual study results
N. Wiebe et al. / Journal of Clinical Epidemiology 59 (2006) 342–353
Method (reference)
349
350
N. Wiebe et al. / Journal of Clinical Epidemiology 59 (2006) 342–353
4.6. MA-level imputation overall effect Four methods do not impute (singular) study-level estimates of SD or standardized mean differences. One method weights the mean differences by some function of sample size [40]. One method bootstraps an overall standard error and mean difference from a dataset of only the study-level mean differences [41,42]. No weighting mechanism was reported. The third method uses multiple imputation where multiple study-level SD are imputed, resulting in multiple datasets [43,44]. Results from these multiple datasets are combined to give an overall summary (mean difference and standard error). The fourth method estimates an overall SMD either using the proportion of positive studies or the proportion of positive statistically significant studies and sample size [45,46].
and the binomial test using the proportion of significant positive studies are most often suggested. These methods ignore any weighting of the included studies and thus easily mislead. Numerous small studies can easily and erroneously overturn a large study’s result. 4.8. No-imputation methods Three methods were identified whereby when the SD is not reported; then the study is not included in the metaanalytic estimate. The study may be excluded from: the entire systematic review, just the meta-analysis where the available study information may or may not be reported in text or in a table, or the review may exclude meta-analysis entirely where all available information from the studies is relayed in text or table.
4.7. MA-level tests
4.9. Example
Meta-analytic tests determining whether the experimental treatment is superior to the control can be performed without any SD. Together, Rosenthal [47,48] and Sutton et al. [22] provide a comprehensive listing of these tests using summary statistics such as the P-value, the Z-statistic and the t-statistic. One systematic review [49] used Fisher’s inverse chi-square test (a.k.a. Pearson and Pearson’s inverse chi-square test) when many of the SD had been omitted from reported studies. Most of these tests apply uniform weighting, thereby ignoring the sample size and precision of the studies; Mosteller and Bush [50] suggest weighting with degrees of freedom. Vote counting using statistical significance or direction of effect are also forms of testing. In addition to the plurality rule, the sign test using the proportion of positive studies
Here, an example is presented to illustrate the behavior of four different methods (Table 4), each selected from different method classes. Our meta-analysis example compares glucocorticoids to placebo for the treatment of croup using a validated score (change-from-baseline at 6 hr) [51]. This published meta-analysis found a weighted mean difference of 21.22 (95% CI 21.62 to 20.82), indicating greater benefit in the glucocorticoid groups. Seven trials were included in this meta-analysis; two trials had three arms so the control results were split for the purposes of this meta-analysis. We selected the two entries with the smallest mean differences to simulate missing SD because outcomes with statistical nonsignificance are more likely to report the least information [1,2]. The first method, exclude from MA when trial SDs are missing, gave larger (stronger) effects than the
Table 4 Example of four methods: Westley score (change-from-baseline at 6 hr), any glucocorticoid versus placebo Glucocorticoid
Inpatient Study 1 Study 2 Study 3 Study 4 Study 5 Outpatient Study 6 Study 7 Study 8 Study 9
Placebo
n
Mean (SD)
n
Mean (SD)
Mean difference (95% CI)
P-values
27 23 44 20 9
22.21 22.73 22.96 23.86 21.10
(1.39) (1.37) (2.25) (3.19) (1.00a)
15 15 39 16 8
21.08 21.08 21.74 20.88 21.20
(1.40) (1.40) (2.95) (2.09) (1.88a)
21.13 21.65 21.22 22.98 0.10
(22.01, (22.55, (22.36, (24.71, (21.36,
20.25) 20.75) 20.08) 21.25) 1.56)
.0050 .0002 .0009 !.0001 .8648
17 47 48 27
22.00 22.80 22.00 23.00
(1.08) (1.37) (1.39a) (1.93)
21 25 24 27
21.00 21.40 21.40 21.00
(0.74) (1.40) (1.40a) (2.39)
21.00 21.40 20.60 22.00
(21.60, (22.07, (21.28, (23.16,
20.40) 20.73) 0.08) 20.84)
.0026 !.0001 .0458 !.0001
Random effects weighted mean difference (95% CI) Observed results Exclude from MA results Fisher’s method results Weighted arithmetic mean resultsb Multiple imputation resultsb a b
SD omitted for the purposes of method simulation. By treatment group and by inpatient/outpatient status.
21.22 (21.62, 21.38 (21.74, P ! .0001 21.26 (21.63, 21.25 (21.64,
20.82) 21.03) 20.89) 20.85)
N. Wiebe et al. / Journal of Clinical Epidemiology 59 (2006) 342–353
original (21.38, 21.74 to 21.03); unsurprisingly, because we selected it to be so. The second method, Fisher’s inverse chi-square method, resulted in a P-value less than .0001. Again, this is not surprising because we removed the only two studies with nonsignificant P-values. Our third method, direct substitution with the arithmetic mean, found results more similar to the observed results (21.26, 21.63 to 20.89). The fourth method, multiple imputation, found the most similar results (21.25, 21.64 to 20.85). The slightly wider confidence intervals accounts for the uncertainty of the missing data. Multiple imputation is the only method that widens the confidence intervals.
5. Discussion This work aggregates the ideas of many researchers. The abundance of methods suggests a lack of communication across the systematic review community. At least two matters should be considered when selecting a method of imputation: (1) what information is already known about the missing SDs, and (2) the mechanism that caused the data to be missing. Harrell [52], among others, discuss three mechanisms for missing data. Missing completely at random (MCAR) suggests that the missing data is unrelated to any observed variable including the missing item itself. In this case, omitting the study with the missing SDs would cause no bias to occur; only power would be lost. Missing at random (MAR) occurs when the missing data is related to some observed variable. Buck’s regression and using the coefficient of variation are two examples that assume this mechanism. Nonignorable missing data are occurences where the missing data is related to its own value. Little empirical work exists to substantiate reasons for missing SDs. Chan [1,2] found in two small author surveys that omitting entire outcomes occurred for three reasons: (1) lack of statistical significance, (2) journal space restrictions, and (3) lack of clinical importance. For example, Chan [1,2] found that the odds that an outcome was fully reported was more than doubled when the outcome was
351
statistically significant. The reasons for incompletely reporting outcomes may be similar. Particular to missing SD, we would propose three further reasons based on observations gleaned during this methodologic review: (1) older pre-CONSORT reporting stylesd often upperbound or nonsignificant P-values, (2) preference for nonparametric summaries regardless of whether the data were normal or not, and (3) advanced modeling (e.g., repeated measures, regressions) for which it is difficult to extract the simple data summaries required for meta-analysis. The relationship with statistical significance, however, is the greatest cause for concern. Not only has power been lost, but bias may also occur. This provides compelling justification for attempting to impute SDs. As evidenced from Table 1 and 2, we have found many imputation methods, some readily applied in meta-analysis and some not, for a myriad of situations. Figure 2 provides brief guidelines (not rules!) on how to select a method of handling missing variance data. The majority of these methods are not difficult to apply. Perform algebraic recalculations whenever possible. Seeking a statistician or consulting a text [6,7] that details the necessary computations may be advisable. When this approach is not successful, it is reasonable to contact study authors for missing SD. When the exact SD cannot be regained, reviewers should choose a method of imputation: multiple imputation, imputation using nonparametric summaries, or single imputation. Single imputation can be any of the following: single borrowing, using the arithmetic mean or imputing the SDs based on an assumed relationship to another variable such as the treatment mean or duration of follow-up. None of these single imputation methods account for the fact that imputed information was used. Multiple imputation accounts for this added uncertainty by inflating the standard error in the overall pooled estimate of effect. Single imputation methods are comparatively poor because they proceed as if the imputed values were the true observed values. The use of nonparametric summaries for imputing SDs is also a form of single imputation. However, their estimates are based on the actual data and not borrowed
Attempt in the following order: 1. Use algebraic recalculation to recover the missing SD. 2. Contact study authors to retain missing SD. 3. If a sufficient number of included studies with complete information exists, use multiple imputation. 4. If the distribution is not skewed and non-parametric summaries are reported, use non-parametric summaries. 5. If at least one sufficiently similar study exists, use a method of single imputation. 6. Summarize non-pooled data along side meta-analysed result in text. 7. Use meta-analytic tests. Peform sensitivity analyses in steps 3 through 5. Fig. 2. Guidelines for handling missing variance data.
352
N. Wiebe et al. / Journal of Clinical Epidemiology 59 (2006) 342–353
information. Their performance will depend on how close to normally distributed their data actually are. When many of the studies indicate substantial skewness [20], summarizing with a mean and SD may not be advisable. Presently, medians, interquartile ranges and other statistics are usually reported unpooled in text or tables. Most importantly, an accompanying sensitivity analysis should be conducted (e.g., using another method of imputation, choosing multiple levels for single-value borrowing, running the multiple imputation method more than once). Approximate algebraic recalculations can also be a form of sensitivity analysis. Although they may be unduly conservative, they do provide us with verification when we decide to perform imputation. Further empirical or simulation work is required to test the performance of these imputation methods. Nonetheless, we would recommend multiple imputation when a sufficient number of included studies with complete information exist. Robertson et al. [53] have further suggested using multiple imputation in a variety of sensitivity analyes for handling missing SD in meta-analysis. Alternatively, the entire process could be formalized and expanded within a Bayesian framework. Last, with regard to imputation methods, we do not suggest applying methods that make demonstrably false assumptions such as equating mean ratios to the mean of the numerator divided by the mean of the denominator. Additionally, it is preferable to use numerical relationships based on empirical evidence rather than ‘‘rules of thumb’’ (e.g., use Pearson’s table [29] for imputing SD from ranges). When only a few studies are missing complete information and no similar studies are available, a combination of meta-analysis and summarizing unusable results may prove the better choice. Power is lost, but some estimate of effect size is gained. Omitting study results from the entire systematic review should be avoided. Furthermore, when complete summary information is sparse for all or most studies but P-values and sample sizes are available, weighted metaanalytic tests may be used. Regrettably, meta-analytic tests give no estimate of treatment effect. Although considerable energy has been focused on how best to handle missing data, it is unlikely that the problem will ever be solved entirely. Perhaps the best avenue would be for authors to redouble their efforts to improve the quality of reporting of their research.
6. Limitations Our prevalence statistics (Table 1) are limited by the simple search strategy we used in the Cochrane Library and BMJ SearchAll; a hand search of an entire issue of the Cochrane Library and a highly sensitive search strategy to find systematic reviews in BMJ SearchAll may prove more productive. Also, our findings are limited by how
comprehensively the authors (given the constraints imposed by journals) reported their strategy for handling missing SD. In addition, the psychologic literature has shown itself to be quite important, suggesting that it might have been advisable to search ERIC and PsychLit.
Acknowledgments We were supported through an operating grant (MOP57880) from the Canadian Institutes of Health Research. We thank Ellen Crumley and Margaret Sampson for assistance with literature searching. We also thank Jennifer Houseman, John Russell, and Philip Berry for their assistance with text retrieval and citation management. References [1] Chan A, Hro´bjartsson A, Haahr MT, Gøtzsche PC, Altman DG. Empirical evidence for selective reporting of outcomes in randomized trials: comparison of protocols to published articles. JAMA 2004; 291:2457–65. [2] Chan A, Krleza-Jeric´ K, Schmid I, Altman DG. Outcome reporting bias in randomized trials funded by the Canadian Institutes of Health Research. Can Med Assoc J 2004;171:735–40. [3] Streiner DL, Joffe R. The adequacy of reporting randomized, controlled trials in the evaluation of antidepressants. Can J Psychiatry 1998;43:1026–30. [4] Song F, Freemantle N, Sheldon TA, House A, Watson P, Long A. Selective serotonin reuptake inhibitors: meta-analysis of efficacy and acceptability. Br Med J 1993;306:683–7. [5] Schafer JL. Analysis of incomplete multivariate data. New York: Chapman & Hall; 1997. [6] Lipsey MW, Wilson DB. Practical meta-analysis. Thousand Oaks, CA: Sage Publications; 2001. [7] Glass GV, McGaw B, Smith ML. Measuring study findings. In: Glass GV, McGaw B, Smith ML, editors. Meta-analysis in social research. Beverly Hills, CA: Sage Publications; 1981. pp. 93–152. [8] Deeks J. Are you sure that’s a standard deviation? Part I & II. Cochrane News 1998;11:9–12. [9] Martins S, Logan S, Gilbert R. Iron therapy for improving psychomotor development and cognitive function in children under the age of three with iron deficiency anaemia (Cochrane Review). In: The Cochrane Library, Issue 3. Chichester, UK: John Wiley & Sons, Ltd.; 2002. [10] Greenland S. Quantitative methods in the review of epidemiologic literature. Epidemiol Rev 1987;9:1–30. [11] Allison D, Mentore J, Heo M, Chandler LP, Cappelleri JC, Infante MC, Weiden PJ. Antipsychotic-induced weight gain: a comprehensive research systhesis. Am J Psychiatry 1999;156:1686–96. [12] Brion L, Primhak R. Intravenous or enteral loop diuretics for preterm infants with (or developing) chronic lung disease (Cochrane Review). In: The Cochrane Library, Issue 3. Chichester, UK: John Wiley & Sons, Ltd.; 2002. [13] Follmann D, Elliott P, Suh I, Cutler J. Variance imputation for overviews of clinical trials with continuous response. J Clin Epidemiol 1992;45:769–73. [14] Buck SF. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J R Stat Soc 1960;22:302–3. [15] Bracken MB. Statistical methods for analysis of effects of treatment in overviews of randomized trials. In: Sinclair JC, Bracken MB, editors. Effective care of the newborn infant. Oxford: Oxford University Press; 1992. pp. 13–20.
N. Wiebe et al. / Journal of Clinical Epidemiology 59 (2006) 342–353 [16] Spooner CS, Saunders LD, Rowe BH. Nedocromil sodium for preventing exercise-induced bronchoconstriction (Cochrane Review). In: The Cochrane Library, Issue 3. Chichester, UK: John Wiley & Sons, Ltd.; 2002. [17] Pearson ES. The application of statistical methods to industrial standardization and quality control. London, UK: B.S. 600, British Standards Institution; 1960. [18] Vickers A, Ohlsson A, Lacy JB, Horsley A. Massage for promoting growth and development of preterm and/or low birth weight infants (Cochrane Review). In: The Cochrane Library, Issue 3. Chichester, UK: John Wiley & Sons, Ltd.; 2002. [19] Suarez-Almazor M, Spooner C, Belseck E. Penicillamine for treating rheumatoid arthritis (Cochrane Review). In: The Cochrane Library, Issue 3. Chichester, UK: John Wiley & Sons, Ltd.; 2002. [20] Altman DG, Bland JM. Detecting skewness from summary information. Br Med J 1996;313:1200. [21] Pigott TD. Methods for handling missing data in research synthesis. In: Cooper H, Hedges LV, editors. The handbook of research synthesis. New York: Sage Publications, Inc.; 1994. pp. 163–76. [22] Sutton AJ, Abrams KR, Jones DR, Sheldon TA, Song F. Missing data. In: Methods for meta-analysis in medical research. New York: John Wiley; 2000. pp. 199–204. [23] Cranney A, Tugwell P, Wells G, Guyatt G. Meta-analyses of therapies for postmenopausal osteoporosis. I. Systematic reviews of randomized trials in osteoporosis: Introduction and methodology. Endocr Rev 2002;23:496–507. [24] Marinho V, Higgins J, Logan S, Sheiham A. Fluoride gels for preventing dental caries in children and adolescents (Cochrane Review). In: The Cochrane Library, Issue 3. Chichester, UK: John Wiley & Sons, Ltd.; 2002. [25] Stangl DK, Berry DA. Meta-analysis: past and present challenges. In: Stangl DK, Berry DA, editors. Meta-analysis in medicine and health policy. New York: Marcel Dekker, Inc.; 2000. pp. 1–28. [26] Deeks JJ, Higgins JPT, Altman DG. Cochrane Statistical Methods Group. Analysing and presenting results; Section 8. In: Alderson P, Green S, Higgins JPT, editors. Cochrane reviewer’s handbook 4.2.2 [updatedMarch 2004]; The Cochrane Library, Issue 1. Chichester, UK: John Wiley & Sons, Ltd.; 2004. [27] Mendehall W, Ott L, Scheaffer RL. Elementary survey sampling. Belmont, CA: Dufbury Press; 1971. pp. 7–41. [28] Norris SL, Lau J, Smith SJ, Schmid CH, Engelgau MM. Self-management education for adults with type 2 diabetes: a metaanalysis of the effect on glycemic control. Diabetes Care 2002;25: 1159–71. [29] Pearson ES. The percentage limits for the distribution of range in samples from a normal population. Biometrika 1932;24:404–17. [30] Glass GV. Integrating findings: the meta-analysis of research. Rev Res Ed 1977;5:351–79. [31] Ausejo M, Saenz A, Pham B, Kellner JD, Johnson DW, Moher D, Klassen TP. The effectiveness of glucocorticoids in treating croup: meta-analysis. Br Med J 1999;319:595–600. [32] Sauerland S, Lefering R, Neugenbauer E. Laparoscopic versus open surgery for suspected appendicitis (Cochrane Review). In: The Cochrane Library, Issue 3. Chichester, UK: John Wiley & Sons, Ltd.; 2002. [33] O’Rourke K. Re: [smglist] Cochrane help: skewed data. E-mail to: Vail A, SMGlist. 2001 Jul 23.
353
[34] Hooper L, Bartlett C, Davey Smith G, Ebrahim S. Systematic review of long term effects of advice to reduce dietary salt in adults. Br Med J 2002;325:628. [35] Thompson R, Summerbell C, Hooper L, Higgins JPT, Little PS, Talbot D, Ebrahim S. Dietary advice given by a dietician versus other health professional or self-help resources to reduce blood cholesterol (Cochrane Review). In: The Cochrane Library, Issue 3. Chichester, UK: John Wiley & Sons, Ltd.; 2002. [36] Craig JV, Lancaster GA, Williamson PR, Smyth RL. Temperature measured at the axilla compared with rectum in children and young people: systematic review. Br Med J 2000;320:1174–8. [37] Sestini P, Renzoni E, Robinson S, Poole P, Ram F. Short-acting beta 2 agonists for stable chronic obstructive pulmonary disease (Cochrane Review). In: The Cochrane Library, Issue 3. Chichester, UK: John Wiley & Sons, Ltd.; 2002. [38] Brown L, Rosner B, Willett WW, Sacks FM. Cholesterol-lowering effects of dietary fiber: a meta-analysis. Am J Clin Nutr 1999;69: 30–42. [39] Hazell P, O’Connell D, Heathcote D, Robertson J, Henry D. Efficacy of tricyclic drugs in treating child and adolescent depression: a metaanalysis. Br Med J 1995;310:897–901. [40] Sanchez-Meca J, Marin-Martinez F. Weighting by inverse variance or by sample size in meta-analysis: a simulation study. Educ Psychol Measure 1998;58:211–20. [41] Kelley GA, Kelley KS, Tran ZV. Exercise and bone mineral density in men: a meta-analysis. J Appl Physiol 2000;88:1730–6. [42] Zhu W. Making bootstrap statistical inferences: a tutorial. Res Q Exerc Sport 1997;68:44–55. [43] Kasiske B, Lakatua J, Ma J, Louis T. A meta analysis of the effects of dietary protein restriction on the rate of decline in renal function. Am J Kidney Dis 1998;31:954–61. [44] Rubin DB, Schenker N. Multiple imputation in health-care databases: an overview and some applications. Stat Med 1991;10:585–98. [45] Hedges LV, Olkin I. Vote counting methods in research synthesis. Psychol Bull 1980;88:359–69. [46] Bushman BJ. Vote-counting procedures in meta-analysis. In: Cooper H, Hedges LV, editors. The handbook of research synthesis. New York: Sage Publications, Inc.; 1994. pp. 193–213. [47] Rosenthal R. Meta-analytic procedures for social research. Newbury Park, CA: Sage Publications; 1991. [48] Rosenthal R. Combining results of independent studies. Psychol Bull 1978;85:185–93. [49] Aker PD, Gross AR, Goldsmith CH, Peloso P. Conservative management of mechanical neck pain: systematic overview and meta-analysis. Br Med J 1996;313:1291–6. [50] Mosteller FM, Bush RR. Selected quantitative techniques. In: Lindzey G, editor. Handbook of social psychology: volume 1. Theory and method. Cambridge, MA: Addison-Wesley; 1954. pp. 289–334. [51] Russell K, Wiebe N, Saenz A, Ausejo Segura M, Johnson, Hartling L, Klassen TP. Glucocorticoids for croup (Cochrane Review). In: The Cochrane Library. Issue 1. Chichester, UK: John Wiley & Sons, Ltd.; 2004. [52] Harrell FE Jr. Regression modeling strategies. New York: Springer (Springer Series in Statistics); 2001. 41–2. [53] Robertson C, Idris NR, Boyle P. Beyond classical meta-analysis: can inadequately reported studies be included? Drug Discov Today 2004;9:924–31.