Statistical and theoretical considerations in meta-analysis

Statistical and theoretical considerations in meta-analysis

fm5-4356@4)00136-7 Pergamon J Ch Epidemiol Vol. 48, No. 1, pp. 133-146, 1995 Copyright 0 1995 Elsevia Science Ltd Printed in Great Britain. All righ...

1MB Sizes 7 Downloads 55 Views

fm5-4356@4)00136-7

Pergamon

J Ch Epidemiol Vol. 48, No. 1, pp. 133-146, 1995 Copyright 0 1995 Elsevia Science Ltd Printed in Great Britain. All rights reserved

0895-4356/95$9.50 +O.OO

STATISTICAL

AND

THEORETICAL META-ANALYSIS

CONSIDERATIONS

IN

INGRAM OLKIN Departmentof Statistics,Stanford University, CA 94305-4065, U.S.A.

Abstract-An historical perspective of statistical methods for combining the results of independent studies is provided. The information explosion subsequent to 1950has been a critical factor in the development of the field. An example of a meta-analysis is used to illustrate all alternative analyses that should be considered for conducting a meta-analytic study.

SOME HISTORY

INTRODUCTION

The New York Times (7 January 1994) in a report on the effect of aspirin therapy for the prevention of recurrence of heart attacks or strokes, provides a definition of meta-analysis: “A meta-analysis aims at gleaning more information from existing data by pooling the results of many smaller studies and applying one or more statistical techniques. The benefits or hazards that might not be detected in small studies can be found in a meta-analysis that uses data from thousands of patients.” The National Library of Medicine definition is similar: “A quantitative method of combining the results of independent studies (usually drawn from the published literature) and synthesizing summaries and conclusions which may be used to evaluate therapeutic effectiveness, plan new studies, etc., with application chiefly in the areas of research and medicine.” These definitions are reasonable in that they capture the spirit and intent of meta-analysis, though one might argue with one or two details. However, the recent introduction of these definitions into the literature may also convey that meta-analysis is a new idea. Before launching into some of the statistical issues, I would like to provide some history that may serve to put meta-analysis into a broader perspective.

One of the earliest meta-analyses, perhaps the first, was conducted by Karl Pearson [l], who combined the results of five studies on the effectiveness of inoculation on enteric fever. In his book on research methods, Sir Ronald A. Fisher [2, p.991 discussed the problem more globally:

21.1. The Combination of Probabilities from Tests of Significance

When a number of quite independent tests of significancehave been made, it sometimes happens that although few or none can be claimed individually as significant, yet the aggregate gives an impression that the probabilities are on the whole lower than would often have been obtained by chance. It is sometimes desired, taking account only of these probabilities, and not of the detailed composition of the data from which they are derived, which may be of very different kinds, to obtain a single test of the significance of the aggregate, based on the product of the probabilities individually observed.”

Fisher’s interest in combining results arose from an agricultural context, and his method is based on the products of the p-values. Motivated by industrial applications, Tippett [3] proposed an alternative method based on the minimum p-value for combining p-values.

133

134

Ingram Olkin

Problems relating to the inflation of statistical measures when aggregating census tract data were noted early by Gehkle and Biehel [17]. Later Yule and Kendall [ 181examined the correlation coefficient between potato and wheat yields in 48 counties in England, and showed the inflation of the correlation when combining contiguous counties. In the social sciences there were attempts to impute group data results to individuals. The fallacy in doing this was noted by Robinson [19], and a variety of model-based solutions were subsequently proposed (e.g., Goodman [20]). A general discussion about issuesin aggregation and disaggregation are discussed by Hannan [21]. In a certain sense,convolution and deconvolution in physics are examples in which combin“In statistical practice it often happens that we wish to combine the results of a number of indeing and decombining take place. Herein, there is pendent experiments which have all been planned a distinct difference in that the combined freto test a common hypothesis. Thus, for example, quencies are not replicates, but rather are compseveral experiments comparing two treatments may have been carried out, but owing to differlementary. Also, in many cases in physics the ences in error variance or to other changes in frequencies are deterministic. However, in time conditions between experiments, it is not possible seriesrandom components of different types are to pool all the data together. The overall test calls therefore for the combination of a number of aggregated, and we observe only the aggregated independent tests of significance.” result. This is not to suggest that meta-analyses and convolutions are the same, but rather that In a psychological context, data on an intelli- a broader view of combinations may suggest gence test given to 13 similarly selected samples methodology from one field that is useful in of boys and girls of age 13-l 7 in various cities another field. An example wherein methods was reported in McNemar and Terman [9]. from disparate fields--engineering and mediGordon et al. [lo] use this example to combine cine-led to some commonality is the study of “survival analysis.” The genesis of survival in results using Fisher’s method. The introduction of their paper is a reiteration of the need for engineering is from life distributions of metals undergoing fatigue, whereas the genesis of combining results: survival in medicine is from life distributions of individuals undergoing a treatment. Although “Duplicate studies, and studies bearing on the some techniques are unique to engineering or same essential problem but based on samples from non-identical populations, often lead to tests of medicine, others are common to both. significance whose associated probabilities are There was not much research on meta-analyquite unequal. If each sample is small, moreover, sis procedures during the period 1950-1976. A the null hypothesis may not be rejected with any great certainty on the basis of any one study. But renewed interest was kindled in education with if the results of the several studies are all or nearly the work of Glass [22], in which he combined all in the same direction, there is additional eviresults on the effectiveness of class size on dence against the null hypothesis. In such caseswe may wish to base a new test of significance on the achievement. (See also Glass et al. [23]). Glass combined data from all the studies. coined the term “meta-analysis,” which he viewed as a process that included a search and Agriculture was a primary catalyst during the preparation of the results of independent studperiod 1930-1950. For example, Cochran ies, as well as a quantifiable combination of [l 1, 121entitled his papers “Problems arising in effect sizes. I would like to reinforce that meta-analysis is the analysis of a series of similar experiments,” and “The combination of estimates from not solely a synonym for the statistical prodifferent experiments.” Other papers relating cedures that produce a combined effect size; it to combining results are those of Cochran includes the labors of a search, coding, assessing quality, modeling missing data, accounting for [13-151, and Yates and Cochran [16].

Somewhat later, in a major series of studies in social psychology in World War II, Stouffer et al. [4, p. 451 proposed yet another method in which each p-value is converted to a normal score. This method was stated in a footnote that could easily be missed; it was later explained explicitly in Mosteller and Bush [5], who discussed other methods as well. These three methods [3-51 are based on the chi-square, uniform and normal distributions, respectively, and comprise the standard procedures. General discussions for combining p-values are provided by Hedges and Olkin [6] and by Becker 171. The problem of combining information surfaced sporadically. Egon Pearson [8] writes:

Statistical and Theoretical Considerations

biases, and so on. Actually, the effort required prior to carrying out the statistical analyses is usually the most formidable part of a metaanalysis. (For an excellent discussion of issues relating to the pre-statistical analysis, see Light and Pillemer [24]). The label “meta-analysis” has not met with universal appeal. Indeed, it is interesting to note that some coined names such as black box, wavelets, fractals, jackknife, bootstrap, and game theory have enjoyed instant appeal. Others such as fuzzy logic and, alas, meta-analysis, have generated antipathy among some researchers. However, regardless of whether a term is or is not appealing, the use of a new term often helps to popularize techniques or ideas. Some researchers prefer the rubric “synthesis,” “pooling,” etc. “combining,” “overview,” Although these terms may be more euphonious, they each suffer from not being explicit, in the sense that meta-analysis implies the use of quantification, which these other terms do not. Aggregation may be an alternative name that has a long history, and does carry a note of quantification; however, its use has been confined primarily to the social sciences. In any case, meta-analysis has flourished-for better or for worse-and in 1989 the National Library of Medicine included it as an item on MEDLINE.

135

Medical information arises from different types of data. David Byar, formerly of the National Institutes of Health, prepared a hierarchy of evidence (l-8) 1. Anecdotal case reports 2. Case series without controls 3. Series with literature controls 4. Analyses using computer data bases 5. Case control observational studies 6. Series based on historical control groups 7. Simple randomized control trials 8. Confirmed randomized control clinical including META-ANALYSIS trials to which I have added a ninth, 9. Meta-analysis with original data

Meta-analysis from individual patient data is an ideal goal, but it must remain in the realm of a luxury. There are currently close to 2000 meta-analyses in medicine, and it is unrealistic to think that we have the personnel and financial resources to carry out a meta-analysis on individual patient data in each instance. Thus, it behooves us to examine what is gained and lost in each step of this hierarchy in order to assessthe credibility of each type of data, as well as to how much resources we should put into each type. I am suggesting a series of studies that will demonstrate the gain in information and the corresponding AN INFORMATION EXPLOSION increase in cost in comparing various levels To what is the popularity of meta-analysis of evidence, as for example, “case series withdue? In short, I believe meta-analysis has be- out controls” vs “case--control observational come an essential ingredient in order to be able studies.” to cope with the current information explosion. A recent meta-analysis study on ovarian canIn 1940 there were 2300 biomedical journals; cer risk was based on 12 U.S. case-control there are now close to 25,000. There are ap- studies during the period 19561986, representproximately 9000 randomized clinical trials per ing 3000 casesand 10,000controls [25]. Clearly, year. It has been noted that the 10 leading individual patient data provides more inforjournals in internal medicine contain over mation than summary data. Thus, an important 200 articles and 70 editorials per month! In question is how much is lost if we analyze the fields other than medicine there has been a data using summary data. In particular, we need similar information explosion. As of 1985 there to keep account of the time and cost using the were over 750 studies on weather modification two methods. This will permit a rationale for (cloud seeding). Review papers in psychology deciding which areas will profit from an analysis have mushroomed since 1980, with an increase of individual data. The Centers for Disease in the number of journals from 91 in 1951 to Control and Prevention are currently pursuing 1195 in 1992. The number of articles in math- this question. ematics increased from 3379 in 1940 to 58,208 in 1992; the number of journals has risen from DEALING WITH UNCERTAINTY 91 in 1953 to 920 in 1992. There appears to have Uncertainty, variability, diversity are the been a IO-fold increase during the last 40-50 ruison d’&tre of the statistical sciences. But I years.

Ingram Olkin

136

don’t believe that we have been successful in conveying uncertainty. I question whether significance levels, confidence intervals, or power serve to clarify uncertainty for the user. Studies by Tversky and Kahneman [26] suggest that people have trouble understanding even simple probabilities, and that they may think in degrees of truth rather than likelihoods. For example, in their well-known experiment they presented subjects with a description of a woman as being outspoken, concerned about social justice, and of age 31. Subjects were asked whether it was more likely that she was (A) a bank teller or (B) a bank teller and a feminist. More responders chose (B), even though (A) is less-restrictive, and therefore more likely. Experiments conducted with other groups such as physicians appear to yield similar results. The communication of uncertainty takes various forms. Weather forecasters give probabilities of rain, though it is unclear as to the interpretation of such a probability. Is it the forecaster’s prior, is it a vote count over a number of forecasters, or is it a relative frequency in which the denominator is the number of similar weather maps at the same time of year. We also quantify linguistically by using a verbal scale, as in the case of juvenile, youngster, adult, grown-up, middle-aged, mature, senior, elder. The Likert seven point scale-disagree strongly, disagree, disagree mildly, no opinion, agree mildly, agree, agree strongly-is also an attempt to introduce gradations. How is the public to interpret headlines used by the media? I quote from two current headlines (New York Times, 2, March 1994).

Drug Reported to Slow Lou Gehrig’s Disease

An experimental new drug appears to be the first to slow the fatal progression of Lou Gehrig’s disease, the illness that until now has defied all attempts at treatment. The medicine, called riluzole, is not a cure. But a study financed by the maker of riluzole found it seemed to delay crippling symptoms and death. Survival almost doubled in some patients. But some experts are skeptical and worry that the initial study’s seemingly positive findings could have been a statistical fluke. Tuberculosis Vaccine Found Surprisingly

Effective

A vaccine to prevent tuberculosis that is used only infrequently in this country, largely because Federal Health officials consider it unreliable, is surprisingly effective, a new statistical study has discovered. It was found to reduce the risk of full-fledged tuberculosis of the lung by 50 percent and death by 71 percent.

This article was accompanied by a chart (see Fig. 1) exhibiting the range of results obtained from different levels of evidence: prospective trials, case-control studies, laboratory confirmed cases in case-control studies. The best estimates vary from 50 to 80%. Although this chart is an accurate reflection of the results, there may be considerable variability in readers’ interpretations. Uncertainty is inherent in the notion of a confidence interval, in the range of results, or even in a complete distribution. Perhaps a statement such as “this treatment is currently 0.73 effective” might be more meaningful in conveying that at the present state of research we believe that there is considerable evidence of its efficacy.

STATISTICAL

DIAGNOSTICS

Some authors have incorporated quality scores in their meta-analyses in order to answer the issue of aggregating poor and good quality studies. This method has also generated the criticism that it imposes subjective criteria into what purports to be an objective procedure. I propose several alternatives, designed to detect disparities among the studies.

If quality scores are used, analyses with and without quality scores should be reported. The discovery of differences will suggest further analyses to account for the difference. When using quality scores it is important to determine inter-rater reliability. Carry out multiple meta-analyses by omitting one study at a time. This will permit the detection of particularly influential studies. Omit the most significant study or studies (to be conservative), or the least significant studies (to be more certain in detecting differences). Because we are omitting ordered values, the methodology normally used needs to be modified. In some instances the modifications require modest changes; in others, the changes required may be formidable. Use multiple statistical methods of analysis. In particular, covariate analyses (called subgroup analyses in medicine) is particularly important. For example, in the case of

Statistical and Theoretical Considerations

thrombolytic therapy, an important covariate is the time the therapy is administered. Thus, a plot of effect size (odds ratio, p-values, etc.) against time should be most informative. 5. Note that proportions can be fragile when the numerator of a fraction is small, or when the denominator is small. In Figs 2 and 3 we should consider the effect of a change of one person in the experimental or control group. Also, by deleting one study at a time we can observe the effect of each study (Figs 2 and 3). (For a general discussion of fragility see Ref. [27]).

(4

(4 STATISTICAL

PROCEDURES

1. p-Values Although p-values are inadequate measures for cumulating results, used properly they may serve as indicators of aberrations. As such they can be useful in determining the consistency of patterns.

(4 It is relatively easy to use the Fisher (x2

(b)

distribution) or Stouffer (normal distribution) methods for combining results. I suggest omitting one p-value at a time and computing a combined value. A frequency distribution of the combined p-values is a useful diagnostic, Figure 4 provides data on 70 studies on the effectiveness of thrombolytic therapy in acute myocardial infarction. I owe a thanks to Drs Tom Chalmers and Joseph Lau for the use of this data. The data provides odds ratios and 95% confidence intervals (Fig. 5). A frequency distribution of the p-values, a smoothed histogram and an empirical distribution are given in Figs 6, 7-8. The p-values are plotted against sample sizes in Fig. 9. If the p-values are ordered in ascending order, say, then the deletion of 5% from each tail has the effect of not giving heavy weight to the extremes. This may serve as a check on the robustness of the results. However, the Fisher and Stouffer methods for combining no longer apply. What is needed is the distribution of L = -2 X;logp,,,

where P(I) 4 P(2)G * * . G P(k)

137

are the ordered p-values. The distribution of L is &i-square when a = 1, b = k, but not otherwise. Instead, it is the distribution of weighted exponentials; the required distribution is given in Olkin and Saner [28]. If one uses normal order statistics, other methods are required; these are given in Saner [29]. Under the null hypothesis, the range of p-values from k studies has a well-known beta distribution with parameters k - 1 and 2, from which lower and upper-tail critical values can be determined. Critical values can be obtained directly from a table of the incomplete beta function. There has been a well-earned emphasis on publicated bias. Some papers provide a fail-safe number. Although the fail-safe number is a simple diagnostic, it is not based on a statistical model. Iyengar and Greenhouse [30] provide more sophisticated models to estimate publication bias. For a general discussion of publication bias see Begg and Berlin [31]. Another procedure is to use the largest p-value, which is the sufficient statistic when we assume that we observe the k most significant values out of k + u studies. From the distribution of the largest p-value a lower confidence bound for the number u, of unpublished studies can be obtained [32].

2. Multivariate methods Although most studies have more complex designs than just treatment and control, little has been done either in the design of an experiment, or in the analysis. Many studies have multiple treatment groups. For example, suppose there are a number of studies on the effectiveness on binding percentage characteristics of five widely used antibiotics; P = Penicillin G, T = tetracycline, S = streptomycin, E = Erythromycin, C = Chloronphenicol. Study

P *

T

1

Control *

2 3

* *

*

* *

S *

E *

C *

*

Regardless of what method of analysis is used, a dependency is introduced by virtue of the fact that a common control is compared with each treatment. This dependency can be taken into

Ingram Olkin

138

account when using standardized mean differences or differences of proportions. It becomes more complicated when using odds ratios. In effect, what is required is the determination of the correlation between two odds ratios that contain a common control. A second aspect of the design is that not every antibiotic is used in each study. This can be treated by a regression model, and also as a missing data model. A discussion of multiple treatment groups is given in Gleser and Olkin [33]. Multiple end-points is perhaps more a norm than has been indicated, and the correlations between effects, such as odds ratios, is generally ignored. One way to deal with correlations and avoid using multivariate procedures is to alter the significance levels by the use of Bonferroni inequalities. The use of multivariate procedures is discussed in Gleser and Olkin [33]. 3. Analysis of variance/regression

models

Because subgroups, rather than covariate measures, are commonly reported in medical papers, analysis of variance models may be useful. Frequently, one-way classifications will be reported, whereas two-way (or multi-way) classifications will provide more reliable information. For example, in a multi-country study, country might be a category. Other categories frequently included are age, such as under and over 60, or blood pressure categories. Thus, it is more informative to use a two-way analysis of variance analysis, rather than two-way analyses based on the marginal distributions for age with proportions p, +, p2+ and for blood pressure with proportions. p+,, p+*, P+~. We can also invoke different methodologies such as fixed and random effects models, or hierarchical models.

and could also be made using p-values. An example of a stopping procedure for metaanalyses of the efficacy of new drugs has been proposed by Whitehead [34,35]. This formal method is based on a triangular stopping rule in which trials are plotted with an ultimate crossing of a boundary. Software using this procedure is called PEST (Planning and Evaluation of Sequential Trials). 5. Logistic regression with summary data

Logistic regression is well-suited to take account of subgroup variation when individual patient data is available. However, logistic regression can also be used with summary data provided the number of studies is ample. The idea is as follows. If p denotes the proportion of successes(different for the treatment and control), and Z,, . . . , &,, denote the means of covariates (e.g. age, time of treatment, length of treatment, etc.), then the logistic model for the control and treatment groups are yf=logp~/(l-p~)=b,C+b~~,j++..+bb,C~,,

yi’ = logpi’l(l - p,F-)= b,T+ b ;Xv + . . . + b;Zmj for the jth study. The difference yj = log(OR)j = b, + b,Z,j + . . . + b,,&j provides a regression model for predicting ‘the log odds ratio. Once the beta weights are obtained using ordinary least squares, a confidence interval can be generated for the mean, p,,, of the log odds ratio. By exponentiating, this confidence interval can be converted to a confidence interval for the geometric mean of the odds ratios. CONCLUDING RECOMMENDATIONS

Blood pressure Age <60 260

< 105

105-125

2 125

PI1

P12

PI3

PI+

P21

P22

P2s

P2+

P+1

P+2

P+3

4. Sequential models

There are many ways to make use of sequential or cumulative meta-analysis. In particular, it permits a display in which a covariate can be taken into account. This is exemplified by the cumulative meta-analysis of the data given in Fig. 4. Here we cumulate using sample size (Fig. 10). The cumulation is made using odds ratio

The following are some recommendations aimed at improving the quality of meta-analyses. Several of these recommendations might serve also as a basis for standards for metaanalyses. I. Improve the quality of statistical reporting in medical journals. The general principle that should prevail is that enough data should be presented so that the resulting computations can be verified. For example, the statement “the odds ratio is 1.2” is inadequate because it can arise from an infinite number of scenarios. Age descriptions such as

Statistical and Theoretical Considerations

30-89 are not particularly helpful, and should be broken down to more usable intervals. II. Create a repository in which data from each published paper can be stored. It would be wonderful to have original data, but such a proposal is currently unrealistic unless a governmental (national or international) agency assumes responsibility. Original data also raises uncomfortable confidentiality issues that need to be resolved. Summary data does not suffer from issues of confidentiality, and should be provided in considerable detail. III. Provide guidelines for good studies. These should include discussion of design, randomization, measuring end points, etc. IV. Meta-analysis should be used to plan new studies. This requires more use of covariates (subgroups) in studies. For example, in the use of radium (Ra) or plutonium (Pu) in the treatment of bone cancer there are studies in the categories: humans-Ra; dogs-Ra, Pu; rats-Pu.

Humans Dogs Rats New Species

Ra

Pu

* * ? ?

* * ?

The design question is whether a new experiment should fill the missing gaps, or whether it would be preferable to use another species. (This example is due to William DuMouchel.)

139

V. Encourage the use of covariates (subgroups). However, it is important to distinguish between exploratory and confirmatory analysis. (Subgroups implies a separation into categories, whereas covariates are frequently actual measures.The use of measurements permits better regression analyses.) VI. Undertake research to better determine the effects of publication, selection, and other forms of bias. VII. Software for meta-analysis is being developed by a number of groups. At one level software should be written so that it is easy to use. On the other hand, when it is too user-friendly there is a danger that it will become an automatic procedure, and subject to potential abuse. I would like to see an opening statement such as: meta-analysis is not an automatic procedure; it is a timeconsuming process requiring the joint skills of a statistician and subject matter specialist. VIII. Statistical procedures for meta-analysis are still in an early state of development. Little theory has been developed in multivariate analysis, longitudinal analysis, Bayes procedures, hierarchical models, computer procedures. statistical IX. Develop and improve graphics for the display of meta-analyses. X. Cumulative meta-analyses should be carried out using different characteristics (e.g. chronologically by effect size, by sample size, by covariates, etc.). There is a need to develop a theory of stopping, together with an analysis of the consequencesof using different stopping rules.

(Figs I-10 on pp. 140-145, References on p. 46)

Ingram Olkin

140

Combining Studies to Analyze A Vaccine’s Effectiveness Different kinds of studies of the BCG vaccine for tuberculosis had seemingly disparate results. Combining the studies through metaanalysis gives a different picture. In a meta-analysis, the studies are weighted and their data are rated based on factors like the size of the study and its methodology. Then the results are stated in comparable terms, with their probable ranges of accuracy. Percent protection provided by BCG vaccine. .9 Range l Best estimate

n

0

10%

20%

30%

40%

50%

60%

70%

60%

90%

100%

TB cases in prospective trials, where the vaccine is given and subjects are followed over time to see if TB develops. n mmmmmmmm~mmmmmm~

TB cases in case control studies, where the vaccine is given to an experimental group matched with a control group that receives a dummy vaccine. n mmmm*mmmn Laboratory confirmed cases in case-control advanced methods of diagnosis.

studies, with more

n mmmmmmmmmm*mmmm

TB deaths in prospective

trials. n mmmm8mmmmemmmmm

Analysis consider these studies particularly rates of error in diagnosis. Source: Harvard School of Public Health

Fig. 1.

significant

because they had 10%

Statistical and Theoretical Considerations

141

Effect of prophylactic oxytocics in third stage of labour on postpartum haemorrhage EXPT

CTRL

Odds ratio

n

(%)

n

(So)

241963

(2.49)

251470

(5.32)

0.43 (0.23-0.78)

l/150

(0.67)

S/50

(10.00)

0.04 (0.01-0.27)

451490

(9.18)

80/510

(15.69)

0.55 (0.38-0.80)

341346

(9.83)

421278

(15.11)

0.61 (0.38-0.99)

24/717

(3.35)

21177

(1.13)

2.19 (0.82-5.83) 0.57 (0.44-0.73)

Graph of odds ratios and confidence interval

(95%CI)

0.01

0.1

0.5

1

2

Fig. 2.

deleted study

0.3

0.4

0.5

.

2

.

3

5

combined study

0.7

.

1

4

0.6

.

.

. Fig. 3. Exhibit 2. Graph of confidence interval of deleted-one odds ratios.

0.8

142

Ingram Olkin Experiment N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

Allthor

YekU

Fletchel Dewar Lippschutz European 1 European 2 Heikinheimo Italian Australian 1 Fmnk!im2 GtKmsen NHLBI SMI’I Bmhier Euro Collab Frank Valere Klein UK-Collab Austrian Australian 2 Lasierra N Ger Collab Witch&! European 3 Anderson Kennedy Khaja TherOUX

Kolibash Leiboff Rentrop Been Raizner Verstraete Kersschot ISAM Netherlands GISSI-1 Olson Ikmm Baroffio Schreiber Cribier Sainsous Leizorovicz Buchalter Julian Bossaert Croydon Guerci Durand White Top01 Timmis Bassand Vlay Van de Werf G’Rourke Kennedy Meinertz ISIS-2 AIMS Wise&erg NHF-ACTG Brunelli ASSET APSIM AIlllS~Otlg

TEAHAT Cassagnes Bonaduce

Total Patients 2 = 9.9517

1963 1965 1969 1971 1971 1971 1973 1973 1973 1974 1975 1975 1975 1975 1976 1976 1977 1977 1977 1977 1977 1979 1983 1983 1983 1983 1984 1984 1984 1985 1985 1985 1986 1986 1986 1986 1986 1986 1986 1986 1986 1986 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 1988 1988 1988 1988 1988 1988 1988 1988 1988 1988 1989 1989 1990 1990 1990

COtltWl

odds

95% confidence Limits

Obs

Total

Obs

Total

Ratio

Low

1 4 6 20 69 22 19 26 13 2 7 2 29 6 11 4 38 37 25 1 63 5 18 2 5 1 4 4 4 13 0 4 1 1 54 19 628 I 5 0 1 1 3 1 0 1 4 3 4 3 2 3 2 4 1 18 4 12 9 791 32 2 7 3 182 7 3 11 7 2

12 21 43 83 373 219 164 264 102 14 53 60 172 55 49 14 302 352 123 13 249 32 156 24 134 20 10 28 22 63 16 29 64 32 859 269

4 7 7 15 94 17 18 32 29 3 3 8 24 6 9 1 40 65 31 3 51 5 30 5 13 4 3 7 2 6 2 2 4 8 63 34 758 2 7 6 3 1 6 0 5 7 2 4 5 5 12 2 1 7 2 29 4 17 19 1029 61 5 3 8 245 6 7 18 7 2

11 21 41 84 357 207 157 253 104 14 54 60 169 53 42 9 293 376 107 11 234 26 159 26 116 20 8 29 18 61 16 35 65 30 882 264

0.16 0.47 0.79 1.46 0.64 1.25 1.01 0.75 0.38 0.61 2.59 0.22 1.23 0.96 1.06 3.20 0.91 0.56 0.63 0.22 1.22 0.78 0.56 0.38 0.31 0.21 1.11 0.52 1.78 2.38 0.18 2.64 0.24 0.09 0.87 0.5 1 0.81 0.4 1 0.66 0.06 0.30 1.10 0.47 0.90 0.07 0.12 1.68 0.61 0.72 0.59 0.16 0.48 1.71 0.57 0.42 0.62 0.96 0.63 0.41 0.75 0.49 0.21 2.40 0.36 0.72 1.26 0.42 0.58 1.07 1.00

0.01 0.11 0.24 0.69 0.45 0.64 0.5 1 0.44 0.18 0.09 0.63 0.05 0.68 0.29 0.39 0.30 0.57 0.36 0.34 0.02 0.80 0.20 0.30 0.07 0.11 0.02 0.16 0.13 0.29 0.84 0.01 0.45 0.03 0.01 0.60 0.29 0.72 0.03 0.20 0.00 0.03 0.06 0.11 0.03 0.00 0.01 0.29 0.13 0.18 0.12 0.03 0.08 0.14 0.16 0.03 0.34 0.23 0.29 0.18 0.68 0.31 0.04 0.60 0.09 0.59 0.41 0.10 0.26 0.36 0.13

1.73 1.94 2.58 3.10 0.90 2.42 2.01 1.31 0.78 4.37 10.60 1.10 2.21 3.19 2.88 34.59 1.47 0.87 1.15 2.53 1.85 3.04 1.05 2.19 0.89 2.08 7.51 2.04 11.04 6.75 3.97 15.58 2.23 0.76 1.27 0.93 0.90 4.80 2.19 1.19 3.14 18.77 1.99 23.52 1.43 1.05 9.71 2.97 2.79 2.86 0.73 3.05 21.33 2.08 5.30 1.14 3.98 1.36 0.93 0.82 0.77 1.15 9.69 1.41 0.87 3.86 1.71 1.26 3.14 7.89

0.73

0.66

0.80

= 48154

28 76 29 19 21 49 39 21 45 48 35 72 35 107 75 16 52 13 355 74 191 162 8592 502 41 73 74 2516 112 56 117 112 20

24 73 30 19 23 49 11 22 45 39 30 66 29 112 25 13 55 12 366 71 177 151 8595 502 25 71 76 2495 119 59 175 119 20

Hiah

2p = <0.00001

Fig. 4. Exhibit 3. Thrombolytic therapy in acute myocardial infarction meta-analysis: random effects model (D&L).

Statistical and Theoretical Considerations

143

Odds Ratio 95% Cl Study

Year

#Pts

Fletcher Dewar Lippschutz European 1 European 2 ~~lkilteimo

1959 1963 1965 1969 1971 1971 1971 1973 1973 1973 1974 1975 1975 1975 1975 1976 1976 1977 1977 1977 1977 1977 1979 1983 1983 1983 1983 1984 1984 1984 1985 1985 1985 1986 1986 1986 1986 1986 1986 1986 1986 1986 1986 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 1987 1988 1988 1988 1988 1988 1988 1988 1988 1988

23 42

Australian 1 Fmnkfoti 2 Golmsen NHLBI SMlT Brochier Euro Collab Frank Valere Klein UK-Collab Austrian Australian 2 Las&m N Ger Collab Witch& European 3 Anderson

Kennedy Khaja Theroux Kolibash L&off Rentrop Been Raizner Verstmete Kersschot ISAM Netherlands GISSI-1 Olson Ikram Baroffio Schreiber Cribier Sainsous Leizorovicz Buchalter Julian Bossaert Croydon Guerci Durand White Top01 Timmis Bassand way Van de Werf O’Rourke Kennedy Meinertz ISIS-2

AIMS Wisenberg NHF-ACTG Bmnelli ASSET APSIM Armstrong TEAHAT Cawgnes Bonaduce Overall

1988 1988 1989 1989 1990 1990 1990

1::: 730 426 321 517 206 28 107 120 341 108 91 23 595 728 230 24 483 58 315 50 250 40 18

0.01

0.02

0.05

0.1

0.2

0.5

1

2

5

10

20

I

I

I

I.i

ii 124 32 64 129 62 1741 533

11712 52 149 :i 44 98 50 43 i: 1:: 2: 100 29 107 7;: 145 368 313 17187

1004 66

I I

144 150 5011 231 115 352 231 40 48154

Fig. 5. Exhibit 4, Thrombolytic therapy in acute myocardial infarction (Mantel-Haenszel method, fixed effects model).

50

100

Ingram Olkin

144 25

2.0

20 15

1.5

-

10

1.0

5

0.5

0 0 0.0

0.2

0.4

0.6

0.8

1.0

-0.2

Fig. 6. p-Values. 70 thrombolytic studies.

0

0.2

0.2

0.4

0.6

0.8

Fig. 8. Empirical cdf for p-values-thrombolytic

studies.

2000 r-

.

.

1500 N” ‘ci

t

0

0.2

0.4

0.6

0.8

1.0

1.2

Fig. 7. Smoothed histogram-70 thrombolytic studies.

0.4

0.6

0.8

1.0

Fig. 9. Exhibit 8. Plot of sample size vs p-valuesthrombolytic studies, sample size
Statistical and Theoretical Considerations

145

Rate Difference 95% CI Year

: 3 4 5 6 7 8 9 10 11 12 13 14 :i 17 18 19 E z 2A 25 26 27 28 29 30 31 32 z 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 51 58 59 60 61 62 63 64 65 66 67 68 69 70

Thcmux Klein Fletcher Iaiem %Y Gcmwen Timmis Been Schreiber Bonaduce L&off Khaja Dcwar Buchalter Ctibier AlmJmon Lcizorovicz Okon Kolibash Witchitz Bamffio Ke.rsschot Raimer Durand Croydon Wise&erg Lippschutz BosS&?rt Julian Vdelt sauaous Top01 NHLBI SMrI Bassand Frank Armstrong BlUchk Rentmp Vcrsbracte Gucrci NHF-ACTG OROWk lkram BNozlli European 1 Fmnktiut2 white Australian 2 APSIM CassagneS Kennedy Meinerz European 3 Italian Euro Collab TEAHAT Kennedy Heikioheimo N Ger Collab Australian 1 Netherlands UK-Collab Van de Wed Au&~ Europaan 2 AIMS fSAM ASSET GISSI-I ISIS-2

1983 1976 1959 1977 1988 1973 1987 1985 1986 1990 1984 1983 I%3 1987 1986 1983 1987 1986 1984 1917 1986 1986 1985 1987 1987 1988 1985 1987 1987 1975 1986 1987 1974 1987 1975 1989 1975 1984 1985 198-I 1988 1988 1986 1988 1969 1973 1987 1977 1969 1990 1983 1988 1979 1971 1975 1990 1988 1971 1977 1973 1985 1976 1988 1977 1971 1988 1986 1988 1986 1988

#Fta

-0.5

-0.4

18 41 64 88 113 141 170 202 240 280 320 360 402 445 489 539 589 641 698 756 815 877 941 1005 1070 1136 1220 1307 1391 1488 1586 1686 1793 1900 2008 2123 2243 2367 24% 2534 2778 2923 3072 3222 3389 3595 3814 4044 4275 4506 4156 5069 5384 5705 6046 6398 6766 7192 7675 8192 8125 9320 10041 10769 11499 12503 14244 19255 30987 48154

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I

-

-0.3

--r I -II I I I I I I I I I I I I I I I I I I I I I I I I I I I I

I

I I I I I I I I I I I I I I I I I I I I

I

-i-I

I

-0.1

-

I

-5

-

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I

I I I I I I I I I I I I I I I I I I I I I I I I I I I

-0.2

I

I

I

-I-

ordered by size

-0.5

0

-Iv-r-

i-iII _ I I I

-

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I

.:+

-y-.-.-

-1

-..-i-;A*‘.,@-L vi-.-.a-& -&-f-.++-.-.2.. -‘.J.-I.-l.‘ .p-p.q-.. e

I I I I I

s+ I 7’ --c ---I .

-ej-

I

‘0 I@

0. I

0.2

-0.4

-0.3

-0.2

-0.1

0

0.3

-!-

-

-I I

I

I I I I I I I I I I I

I I I I

I I I I ;=I I I I z=-

-2. 6 j

I I 1 Z=- -3. 5.



-2. 8

I

I.01 I

~.ool I I I I

I

I

I

I I

I I I I

I I I I

.3. 1

p
I I I I I I

I I I I I I I I I I I I I I I

0.1

I I I I

I I I I

I Z=-

;=.. I I I I I I I I I I I I I I I I I

14

A..

i2

I I I I I I I I I I I I I I I I I I I I I I

I I I I I I I I I I I I I I I

I I I

0.2

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I

I I I I I I I I I I I I

I I I I

I

I 2 3 4 5 6

-i-

1.0; I I I I

0.5

I

-i-

l -t

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I l I I I I I I I I I I I I I I I I I I I I I I I

0.4 ;

--

-L

*Studies

A-

I

I I I I I I I I I I I I I I I I I I I I I I I I I I I

7ilH&a I I I

I I I I I I I I I I I I I I

, 7-dMioc L 0.3

i 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

0.4

Favors Control Favors Treatment Fig. 10. Exhibit 9. Thrombolytic therapy in acute myocardial infarction. Cumulative rate difference D&L Method (random effects model)*.

0.5

Ingram Olkin

146 REFERENCES

5.

6. I.

8. 9. 10.

11. 12. 13. 14. 15. 16. 17. 18.

Pearson K. Report on certain enteric fever inoculation statistics. Br Med J 1904; 3: 1243-1246. Fisher RA. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd; 1932. Tippett LHC. The Methods of Statistics. London: Williams & Norgate; 1931. Stouffer SA, Suchman EA, DeVinney LC, Star SA, Williams RM Jr. The American Soldier: Adjustment during Army Life, Vol. I. Princeton, NJ: Princeton University Press; 1949. Mosteller- I, B&h RR. Selected quantitative techniques. In: Lindzey G, Ed. Handbook of Social Psychology Cambridge, MA: Addison-Wesley; 1954: Chap. 8, 289-334. Hedges LV, Olkin I. Statistical Methods for Metaanalysis. Orlando, FL: Academic Press; 1985. Becker BJ. Combining significance levels. In: Cooper HM, Hedges LV, Eds. The Handbook of Research Synthesis. New York: Russell Sage Foundation; 1993. Pearson ES. On questions raised by the combinations of tests based on discontinuous distributions. Biometrika 1950; 37: 383-398. McNemar Q, Terman LM. Sex differences in variational tendency. Genet Psycho1 Monog 1936; 18(l): 31. Gordon MH, Loveland EH, Cureton EE. An extended table of chi-square for two degrees of freedom, for use in combining probabilities from independent samples. Psychometrlka 1952; 17: 311-316. Cochran WG. Problems arising in the analysis of a series of similar experiments. J R Stat See 1937;Suppl. 4: 102-l 18. Cochran WG. The combination of estimates from different experiments. Biometrics 1954; 10: 101-129. Cochran WG. Some difficulties in the statistical analysis of replicated experiments. Empire J Exp Agric 1938; 6: 157-175. Cochran WG. Summarizing the results of a series of experiments. Paper 112. In: Cochran WG, Ed. Contributions to Statistics. New York: Wiley; 1980. Cochran WG. In: Moses LE, Mosteller F, Eds. Planning and Analyses of Observational Studies. New York: Wiley; 1983. Yates F, Cochran WG. The analysis of groups of experiments. J Agric Sci 1936; XXVIII, Pt. IV: 556580. Gehkle C, Biehel R. Certain effects of grouping upon the size of the correlation coefficient in census tract material. J Am Stat Assoc 1934; Suppl. 29: 169-170. Yule GU. Kendall MG. An Introduction to the Theorv of Statist& London: Charles Griffin; 1950. ’

19. Robinson WS. Ecological correlations and the behavior of individuals. Am Social Rev 1950; 15: 351-357. 20. Goodman L. Ecological regression and the behavior of individuals. Am J Social 1953: 64, 61@625. 21. Hannan MT. Aggregation and Disaggregation in sociology. Lexington, MA: D. C. Heath & Co.; 1971. 22. Glass GV. Primary, secondary, and meta-analysis of research. Educ Res 1976; 5: 3-8. 23. Glass GV, McGaw B, Smith ML. Meta-analysis in Social Research. Beverly Hills, CA: Sage; 1981. 24. Light RJ, Pillemer DB. Summing Up: The Science of Reviewing Research. Cambridge, MA: Harvard University Press; 1984. 25. Whittemore AS, Harris R, Intyre J, Halpern J and the Collaborative Ovarian Cancer Group. Characteristics relating to ovarian cancer risk: collaborative analysis if 12 U.S. case-control studies. I. Methods. Am J Epidemiol 1992; 136: 1175-1183. 26. Tversky A, Kahneman D. Probability, representatives, and the conjunction fallacy. Psycho1 Rev 1983; 90: 293-315. 21. Feinstein AR. The unit fragility index: an additional appraisal of “statistical significance” for a contrast of two proportions. J Clin Epidemiol 1990; 43(2): 201-209. 28. Olkin I, Saner H. A robust procedure for combining p-values in integrative research. Technical Report. Stanford, Stanford University, CA: 1993. 29. Saner HL. Robust Methods in Meta-analysis. Ph.D. dissertation. Stanford, CA: Stanford University; 1991. 30. Iyengar S, Greenhouse JB. Selection models and the file drawer problem (with discussion). Stat Sci 1988;3: 109-13s. 31. Begg C, Berlin J. Publication bias: a problem in interpreting medical data (with discussion). J R Stat Sot A 1988; 151: 419-463. 32. Gleser LJ, Olkin I. Models for estimating the failsafe number in the file drawer problem. Technical Report No. 284. Stanford, CA: Stanford University; 1991. 33. Gleser LJ, Olkin I. Stochastically dependent effect sizes. In: Cooper H and Hedges LV Eds, The Handbook of Reseaich Synthesis. New York: Russell Sage Foundation: 1993: Chap. 22, 339-355. 34. Whitehead j. Sequential designs for pharmaceutical clinical trials. Pharmaceutical Med 1992; 6: 179-191. 35. Whitehead J. Application of sequential methods to a phase III clinical trial in stroke. Drug Inf J 1993; 27: 733-740.