The likelihood ratio versus the p value in meta-analysis: Where is the evidence?

The likelihood ratio versus the p value in meta-analysis: Where is the evidence?

The L i k e l i h o o d Ratio Versus the p Value in Meta-analysis: Where Is the Evidence? Comment on the Paper by S. N. Goodman David Zucker, PhD and ...

228KB Sizes 22 Downloads 69 Views

The L i k e l i h o o d Ratio Versus the p Value in Meta-analysis: Where Is the Evidence? Comment on the Paper by S. N. Goodman David Zucker, PhD and Salim Yusuf, MRCP, DPhil Biostatistics Research Branch and Clinical Trials Branch, Division of EpidemioIogy and Clinical Applications, National Heart, Lung, and Blood Institute, Bethesda, Maryland

Combining information from several clinical trials that address the same broad question (often referred to as meta-analysis, overviews, or pooling) has been a topic of intense interest in recent years. The article by Steven N. Goodman in this issue of Controlled Clinical Trials proposes use of the likelihood ratio (LR) for combining evidence from several studies. Goodman argues that the LR is superior to the p value as a measure of the evidence regarding treatment effect provided by a series of trials. We agree that the LR may represent an additional and useful way of displaying data from several clinical trials. However, we disagree with Goodman's implication that the LR approach should replace the more conventional approaches currently in use for meta-analyses. In particular, we consider many of Goodman's arguments suggesting superiority of the LR to the p value to be ill-founded. It is useful to note preliminarily that there are two main parts to the presentation in Goodman's article. The first part is a general discourse (drawn from other work in the statistical literature) on the defects of the p value and the virtues of the LR. The second part is a description of the straightforward way in which the LRs for several individual studies can be combined to obtain an overall LR. Our comments focus on the first of these two parts, because the thrust of Goodman's criticisms of the p value is general in nature rather than specific to meta-analysis. To begin with, Goodman overstates the case in suggesting that previous work in meta-analysis has not dealt with the concept of evidence. The p value clearly represents an attempt to measure evidence against the null hypothesis. Because the considerations underlying its use in meta-analysis are similar to those underlying its use for an individual study, meta-analysts have not felt a need to elaborate on the justification of the p value as an evidentiary measure. Of course, one could debate whether the p value is an ideal measure of evidence. The merits of the p value in fact have been extensively debated in the statistical literature (see, e.g., Goodman's references), and the argument that Goodman presented is essentially a synopsis of that debate. Address correspondenceand reprint requests to: David Zucker, PhD, NationalHeart, Lung, and Blood Institute, FederalBuilding, 7550 Wisconsin Avenue, Rm. 2All, Bethesda, MD 20892.

ControlledClinicalTrials10:205-208(1989) © David Zucker, 1989

205

206

D. Zucker and S. Yusuf Nevertheless, most practicing statisticians consider the p value to be a useful inferential measure. This is because the p value is a well-defined objective measure of the degree to which the observed data are inconsistent with the null hypothesis. Alternative approaches to inference are fraught with problems of their own. The Bayesian approach depends critically on subjective specification of a prior distribution. The approach of drawing inference directly from the LR avoids this problem, but leaves in its place the problem of interpreting the numerical value of the LR in terms of "strength of evidence." Goodman, in his Table 1, offers an interpretive guide to LR values, but without adequate justification. In view of Goodman's emphasis on "intrinsic error," one should note that although the Birnbaum [1] article that Goodman cites discussed intrinsic error, it did not advocate the use of intrinsic error in actual practice. Also, the correspondence that Goodman points out between the intrinsic error of 1% criterion and Peto's z = 3.0 rule for statistical evidence seems to be nothing more that coincidence. Given the broad agreement among practicing statisticians regarding the usefulness of the p value, Goodman's statement that the p value "cannot form a basis for induction" is most puzzling. We wonder what Goodman's deftnition of "induction" is. He does not seem to be using the term in the familiar sense, referring to the process of drawing conclusions from data. Rather, he seems to be referring to some special, but unspecified, idealized conception of this process. Some specific comments are in order regarding several of Goodman's criticisms of the p value: (1) the criticism of the p value's focus on tail areas implicit in Goodman's discussion of relative support and intrinsic error; (2) the criticism that the same p value means different things in trials of different sizes; and (3) the criticism of the p value's vulnerability to multiple comparisons problems. In contrasting the p value's focus on "the area beyond z = 1.96 in the tails of a normal distribution" with the LR's focus on the "relative probability of two models at the point observed," Goodman suggests that the focus on tail areas results in an exaggeration of the evidence against the null hypothesis. This argument was presented in greater detail in an earlier version of Goodman's paper, drawing on mathematical results of Berger and Sellke [2]. In response, we note that the use of tail areas is readily justified by the interpretation of the p value as the "probability of declaring there to be evidence against H0 when it is in fact true and the data under analysis are regarded as just decisive" [3]. Moreover, the correspondence drawn by Berger and Sellke between the p value and the "inappropriate" posterior probability P(H01test statistic ~ observed value) (as contrasted with P(H01test statistic = observed value) depends on the questionable assumption that H0 has prior probability of 1/2 and is of dubious relevance in any case because the p value is not intended to be interpreted as a posterior probability. Goodman criticizes the p value on the ground that "under the null hypothesis, a small effect in a large trial can be as rare as a big effect in a small trial, producing the same z score and p value." But in fact, in the two situations described, the LR curves also are identical. This is clear from Goodman's eqs. 1 and 2. Indeed, just below these equations Goodman cites

The Likelihood Ratio

207

this property, highlighted as a defect of the p value, as a virtue of the LR, showing how well the LR reproduces "our intuition about how evidence should behave." Thus, Goodman's criticism clearly provides no basis for claiming that the LR is superior to the p value. Regarding the point about multiple comparisons, it is true that multiple comparisons can be a serious problem in meta-analysis. Problems can arise in several ways. For instance, a multiple comparisons problem can arise, just as it can in an individual trial, by testing many subgroup hypotheses (and, contrary to a statement by Goodman, it is hardly "very unusual" for investigators to be able after the fact to come up with biological evidence in support of an enticing subgroup result). In addition, a multiple comparisons problem related to the multiple looks problem in sequential trials can arise in a situation where a series of trials are conducted by various investigators over time and an investigator's decision to do or not to do a further trial is influenced in part by previous trials (this is to be contrasted with the situation where various investigators independently initiate similar trials at about the same time). The problem of multiple subgroup hypotheses is well recognized, and is conventionally safeguarded against by restricting the number of subgroups analyzed, testing for treatment by subgroup interaction, and generally viewing subgroup results with caution. The "multiple look" problem in metaanalysis is not well recognized and require further attention, especially in view of the complexities involved in h o w the planning of trials is influenced by prior trials. In certain situations where several trials have been initiated over an extended period of time (e.g., thrombolysis in myocardial infarction), it may be useful to present results of "early" trials and "late" trials separately. Goodman contends that the LR is not affected by multiple comparisons. This is plainly wrong. Although the computation of the LR is not affected by the number of subgroup hypotheses or the reasons trials were conducted, its interpretation is. If enough hypotheses are tested or the same hypothesis tested often enough, a big LR is sure to come up eventually. One cannot escape the need to be careful about multiplicity. There is no difference between the LR and the p value in this respect. The p value often is criticized on the ground that clinicians have difficulty interpreting it or are likely to misinterpret it (e.g., Diamond and Forrester [4]). It is worth noting that the LR (without integrating in a prior distribution to obtain posterior probabilities) is subject to the same concern. This does not imply that the LR is a bad tool, but does imply that clinicians need to be instructed carefully on the LR's use and interpretation. Finally, Goodman's description of how p values are used in practice is misleading. Scientists who find the p value useful do not at all attempt to use a single p value as the sole criterion of evidence without recognizing the importance of "results from animal trials, reasoning based on physiologic or biologic principles, analogies to similar treatments, epidemiology, etc." Ultimately, the conclusion that a treatment is effective rests on concordance of multiple sources of information and consistency among various reasonable analytical approaches. No single number, neither a p value nor an LR, can tell the whole story.

208

D. Zucker and S. Yusuf

REFERENCES 1. Birnbaum A: On the foundations of statistical inference. J Am Stat Assoc 298:269306, 1962 2. Berger JO, Sellke T: Testing of point hypotheses: the irreconcilability of p-values and evidence. J Am Stat Assoc 82-112-139, 1987 3. Cox DR: Comment on the paper by J.O. Berger and M. Delampady. Stat Sci 2"335336, 1987 4. Diamond GA, Forrester JS: Clinical trials and statistical verdicts: probable grounds for appeal. Ann Intern Med 98:385-394, 1983