Is the p -value a good measure of evidence? Asymptotic consistency criteria

Is the p -value a good measure of evidence? Asymptotic consistency criteria

Statistics and Probability Letters 82 (2012) 1116–1119 Contents lists available at SciVerse ScienceDirect Statistics and Probability Letters journal...

215KB Sizes 7 Downloads 59 Views

Statistics and Probability Letters 82 (2012) 1116–1119

Contents lists available at SciVerse ScienceDirect

Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro

Is the p-value a good measure of evidence? Asymptotic consistency criteria M. Grendár ∗ Department of Mathematics, FPV UMB, 974 01 Banská Bystrica, Slovakia Institute of Mathematics and Computer Science of the Slovak Academy of Sciences (SAS), and UMB, Banská Bystrica, Slovakia Institute of Measurement Sciences SAS, Bratislava, Slovakia

article

info

Article history: Received 28 November 2011 Received in revised form 18 February 2012 Accepted 20 February 2012 Available online 3 March 2012 MSC: 62A01

abstract What are the criteria that a measure of statistical evidence should satisfy? It is argued that a measure of evidence should be consistent. Consistency is an asymptotic criterion: the probability that, if a measure of evidence in data strongly testifies against a hypothesis H, then H is indeed not true, should go to 1 as more and more data appear. The p-value is not consistent, while the ratio of likelihoods is. The same holds also with respect to the unconditional consistency criterion. © 2012 Elsevier B.V. All rights reserved.

Keywords: Statistical evidence Consistency Unconditional consistency p-value Ratio of likelihoods

1. Introduction The p-value is commonly used as a measure of evidence in data X1n against a hypothesis H1 : the smaller the p-value, the stronger the evidence against H1 in the data. Recall that the p-value is the smallest level at which a test T (X1n ) rejects H1 . According to the typical calibration (Cox and Hinkley, 1974; Wasserman, 2004), a p-value smaller than 0.01 suggests very strong evidence against H1 . Unlike the p-value, which measures evidence against a single hypothesis, the ratio of likelihoods1 measures evidence in data for a simple hypothesis H1 , relative to a simple hypothesis H2 . For a parametric model fX (x | θ ), the ratio of likelihoods r12 = f (X1n | H1 )/f (X1n | H2 ) measures evidence for H1 relative to H2 , in data X1n . A value of r12 above a certain threshold k > 1 is taken as evidence in favor of H1 , and against H2 . Values of k around 30 are suggested for a threshold, above which the evidence is considered very strong (see Royall (1997) and Barnard et al. (1962)). Statistics abounds with criteria for assessing the quality of estimators, tests, forecasting rules, and classification algorithms, but besides the likelihood principle discussions (see Berger and Wolpert (1988)), there seems to be almost nothing on what criteria a good measure of evidence should satisfy. Schervish, in a notable exception2 (Schervish, 1996),



Correspondence to: Department of Mathematics, FPV UMB, 974 01 Banská Bystrica, Slovakia. E-mail address: [email protected].

1 The likelihood ratio is used in Neyman–Pearson hypothesis testing. To distinguish the evidential use of the likelihood ratio from its use in decision making, the former is referred to as the ratio of likelihoods (RL). The RL has a rich history: see Fisher (1956), Barnard et al. (1962), Hacking (1965), Edwards (1992), Lindsey (1999a,b), and Royall (1997, 2000), among others. 2 See also Section 3.2 in Edward’s monograph (Edwards, 1992), and a recent work (Bickel, in press). 0167-7152/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.spl.2012.02.018

M. Grendár / Statistics and Probability Letters 82 (2012) 1116–1119

1117

considers a requirement of coherence, borrowed from multiple comparisons theory (Gabriel, 1969). If H : θ ∈ Θ implies H ′ : θ ∈ Θ ′ (i.e., Θ ⊂ Θ ′ ), then the coherent measure of evidence gives at least as strong evidence to H ′ as it gives to H. The p-value is not coherent; see Schervish (1996). In this note, an asymptotic criterion of consistency is introduced, and it is demonstrated that the p-value is not consistent, while the ratio of likelihoods satisfies the consistency requirement. The same holds also with respect to a related criterion of unconditional consistency, which is introduced as well. 2. Measure of evidence To set a formal framework, let X ∈ RK be a random variable with probability density function (pdf) (or probability mass function (pmf)) fX (x | θ ), parameterized by θ ∈ Θ ⊆ RL , and such that, if θ ̸= θ ′ , then fX (· | θ ) ̸= fX (· | θ ′ ). Let Θ1 , Θ2 form a partition of Θ , and associate Θj with the hypothesis Hj , j = 1, 2. Let X1n , X1 , . . . , Xn ∼ fX (x | θ ) be a random sample from fX (x | θ ). A measure of evidence ϵ(H1 , H2 , X1n ), in data X1n , for the hypothesis H1 : X1n ∼ fX (x | θ ), where θ ∈ Θ1 , relative to H2 : X1n ∼ fX (x | θ ), where θ ∈ Θ2 , is a mapping ϵ(H1 , H2 , X1n ) : Θ1 × Θ2 × (RK )n → R. It usually goes with a calibration that partitions values of ϵ(·) into intervals, or categories. In what follows, our interest will concentrate on the category S of the most extreme values of the evidence measure ϵ(·) that correspond to the strongest evidence. Finally, the measure of evidence against a hypothesis H1 , relative to H2 , in data X1n , will be denoted ϵ(¬H1 , H2 , X1n ). 3. Consistency requirements In Sellke et al. (2001), the authors stress that in applications of an evidence measure, data sets may come from either H1 or H2 . The authors illustrate this important point by an example of testing drugs D1 , D2 , D3 , . . ., for an illness, in a series of independent experiments. The measure of evidence applied to a data set from the ith experiment is used to differentiate between the hypothesis H1 that the drug Di has a negligible effect, and the alternative H2 that the drug Di has a non-negligible effect. Some drugs have negligible effects, some have non-negligible ones. In other words, some experimental data X1n come from H1 , and other data sets are from H2 . This key aspect of applications of the evidence measure can be captured by the following two-level sampling mechanism. 1. First, θ is drawn from a pdf (or pmf) p(θ ). 2. Given θ , a random sample X1n is drawn from fX (x | θ ). As the sample size n increases, it should hold, informally put, that, among the data sets which, according to the measure of evidence strongly testify against H1 , the relative number of those which in fact come from H1 should go to zero. This motivates the following requirement of consistency.3 We say that a measure of evidence ϵ(¬H1 , H2 , X1n ) against H1 , relative to H2 , is consistent if lim Pr(H1 | ϵ(¬H1 , H2 , X1n ) ∈ S ) = 0.

n→∞

The probability that θ is in Θ1 , given that the measure of evidence ϵ(¬H1 , H2 , X1n ) strongly testifies against H1 , relative to H2 , should go to zero as the sample size n goes beyond any limit. The above requirement can be viewed as a conditional consistency criterion. In addition, an unconditional ‘frequentist’ criterion of consistency can be introduced. We say that a measure of evidence is unconditionally consistent if lim Pr(X1n ∼ fX (x | θ ), θ ∈ Θ1 : ϵ(¬H1 , H2 , X1n ) ∈ S ) = 0.

n→∞

Informally put, for data X1n that come from H1 , it should be asymptotically impossible that the measure of evidence strongly testifies against H1 . The two criteria are closely related. If a measure of evidence is unconditionally inconsistent, then it is also inconsistent. 4. Is the p-value consistent? The p-value is π , inf {α : T (X1n ) ∈ Rα }, where T is a test statistic, α is the size of the test, and Rα is the rejection region for H1 . In this section, it is assumed that X is a continuous random variable and that the test statistic T is such that it rejects H1 when the observed value t of T is large. Then the p-value is π = supΘ1 Pr(T > t | θ ). The p-value π (¬H1 , ·, X1n ) as a measure of evidence against H1 does not take H2 into account. Let S = [0, αS ) be the interval of values that indicate the very strong evidence against H1 . Before addressing the question of consistency of the p-value in general, consider an illustrative example of the Gaussian random variable X with variance σ 2 = 1, and let Θ1 = {θ1 }, Θ2 = {θ1 + δ}, δ > 0. Let w = p(Θ1 ), w ∈ (0, 1). Also, let

3 In Sellke et al. (2001), the authors use a Monte Carlo simulation to estimate the probability Pr(Θ | π(¬H ; ·, X n ) ≈ 0.05), for a point set Θ , in small 1 1 1 1 samples, for the p-value, and relate it to the analogous probability for the Bayes factor, which in the studied setting is the same as the ratio of likelihoods. The authors do not propose an asymptotic criterion for a measure of evidence.

1118

M. Grendár / Statistics and Probability Letters 82 (2012) 1116–1119



T (X1n ) = n(x − θ1 ) be the test statistic, and Rα = {X1n : T (X1n ) > z1−α } be the rejection region, with z1−α denoting the 1 − α quantile of the standard normal distribution. Under H1 , the p-value is a uniform random variable, so Pr(π (¬H1 , ·, X1n ) ∈ S | Θ1 ) = αs . Under H2 , the power of the test √ is Pr(π (¬H1 , ·, X1n ) ∈ S | Θ2 ) = 1 − Φ (z1−αs − nδ), where Φ (·) is the distribution function of the standard normal random variable. Note that Pr(π (¬H1 , ·, X1n ) ∈ S | Θ2 ) converges to 1, for δ > 0. Taken together, limn→∞ Pr(H1 | ϵ(¬H1 , ·, X1n ) ∈ α w S ) = 1−w(S1−α ) . Thus, in this simple example, the p-value is not a consistent measure of evidence against H1 . S

Following the reasoning in the above example, it can be demonstrated that the p-value is inconsistent.4 Proposition 1. Let Θ1 , Θ2 form a partition of Θ . Let p(θ ) be such that w , Θ p(θ ) is w ∈ (0, 1). Also, let T and Rα be such 1 that Pr(π (¬H1 , ·, X1n ) ∈ S | Θ2 ) → 1 as n → ∞ (i.e., for θ ∈ Θ2 , the power of the test statistic T converges to 1). Then it holds that



lim Pr(H1 | π (¬H1 , ·, X1n ) ∈ S ) =

n→∞

αS w 1 − w(1 − αS )

.

(1)

Proof. Under H1 , the p-value is uniformly distributed, so Pr(π (¬H1 , ·, X1n ) ∈ S ) | θ = αS , for θ ∈ Θ1 . Thus,  Pr (π (¬H1 , ·, X1n ) ∈ S | θ )p(θ ) = αS w. Next, under the assumption that the power of the test statistic T goes to 1 Θ 1

as n → ∞, the probability



Θ2

Pr(π (¬H1 , ·, X1n ) ∈ S | θ )p(θ ) → 1 − w . Taken together, this proves the proposition.



Since the right-hand side expression in (1) is positive, the p-value is not a consistent measure of evidence. The limit of the probability becomes zero only at the extreme uninteresting case of w = 0, i.e., when no X1n come from H1 . For the typical value of αS = 0.01 and w = 0.5, the limit value of the probability is αS /(1 + αS ) = 0.0099. For w = 0.9, the probability is 0.0826. For w = 0.999, the probability is 0.9090, and it converges to 1 as w → 1. The greater the relative presence of data sets from H1 , the higher the asymptotic probability that the data come from H1 , when the p-value strongly testifies against H1 . Since limn→∞ Pr(X1n ∼ fX (x | θ ), θ ∈ Θ1 : π (¬H1 , ·, X1n ) ∈ S ) = αS , the p-value is inconsistent also in the unconditional sense. 5. Is the ratio of likelihoods consistent? For point sets Θ1 , Θ2 , the ratio of likelihoods r12 of H1 relative to H2 is r12 , f1 /f2 , where fj , fX n (xn1 | Θj ), for j = 1, 2. 1 The ratio r12 measures the evidence in favor of H1 (and against H2 ), in data X1n . The larger the r12 , the stronger the evidence in favor of H1 (and against H2 ), so S = [kS , ∞), kS > 1. First, consider the ratio of likelihoods r21 in the example given above. Clearly, Pr(r21 (¬H1 , H2 , X1n ) ∈ S | Θ1 ) = √ √ 1 − Φ (log kS /δ n + nδ/2), which, under the assumption that δ > 0, converges to 0 as n → ∞. Also, Pr(r21 (¬H1 , H2 , X1n ) ∈ √ √ S | Θ2 ) = 1 − Φ (log kS /δ n − nδ/2), which, under the assumption that δ > 0, converges to 1 as n → ∞. Thus, limn→∞ Pr(H1 | r21 (¬H1 , H2 , X1n ) ∈ S ) = 0. Hence, the ratio of likelihoods is a consistent measure of evidence in this example. This consistency is not accidental, as stated in the following proposition. Proposition 2. For point sets Θ1 , Θ2 , and p(Θ1 ) ∈ (0, 1), the ratio of likelihoods r21 (¬H1 , H2 , X1n ) is a consistent measure of evidence, i.e., lim Pr(H1 | r21 (¬H1 , H2 , X1n ) ∈ S ) = 0.

n→∞

Proof. First, Pr(r21 (·) ∈ S | Θ1 ) < Pr(r21 (·) > 0 | Θ1 ), and, by the law of large numbers (LLN), 1/n log r21 (·) | Θ1 converges in probability to the Kullback–Leibler divergence f1 log(f2 /f1 ), which is negative. Thus, limn→∞ Pr(r21 (·) > 0 | Θ1 ) = 0. In a similar vein, the LLN applied to 1/n log r21 | Θ2 implies that limn→∞ Pr(r21 (·) ∈ S | Θ2 ) = 1. The claim then follows from the Slutsky theorem.  The proof implies that the ratio of likelihoods is also consistent in the unconditional sense. Recently, Bickel (in press) proposed an extension of the ratio of likelihoods (see also Lavine and Schervish (1999) and e Zhang (2009)) to the case of general Θ1 , Θ2 : r12 , supΘ1 f (X1n | θ )/ supΘ2 f (X1n | θ ), and suggested its use as a measure of evidence. The extended ratio of likelihoods reduces to the ratio of likelihoods when Θ1 and Θ2 are point sets. Under e additional assumptions, r21 is a consistent measure of evidence. Before stating the result, recall that the maximum likelihood

˜ ) of θ , restricted to Θ ˜ ⊂ Θ , is θˆ (Θ ˜ ) , arg supθ∈Θ˜ fX n (xn1 | θ ). (ML) estimator θˆ (Θ 1

4 Proposition 1 holds also for a p-value that is valid in the sense of Mudholkar and Chaubey (2009).

M. Grendár / Statistics and Probability Letters 82 (2012) 1116–1119

1119

Proposition 3. Let fX (x | θ ) and Θ1 , Θ2 be such that the maximum likelihood estimators θˆj (Θj ), restricted to Θj , are consistent estimators of θ , j = 1, 2. And let the maximum likelihood estimators θˆj (Θi ), restricted to Θi , converge in probability to some  e finite θ¯j , i, j ∈ {1, 2}, i ̸= j. Let p(θ ) be such that Θ p(θ ) ∈ (0, 1). Then the extended ratio of likelihoods r21 (¬H1 , H2 , X1n ) is a 1 consistent measure of evidence against H1 , relative to H2 . Proof. Under the assumed consistency and convergence of the constrained MLs, the claim follows from the LLN and the e positivity of the Kullback–Leibler divergence between two different distributions, applied to the probability Pr(r21 >   kS | θ) in the upper bound e r21



,

,

H1 H2 X1n

) ∈ S ).

e >kS |θ) Θ p(θ) supΘ Pr(r21 1 1 e >kS |θ)p(θ) Pr ( r Θ 21



2

e > kS | θ ) and the lower bound infΘ1 Pr(r21





Θ1

p(θ ) of Pr(θ ∈ Θ1 |



6. Is the Bayes factor consistent? It is open to debate whether a measure of evidence can Bayesians usually measure  depend on a prior information.  evidence for H1 relative to H2 by the Bayes factor b12 , H f (X1n | θ )q(θ ) dθ / H f (X1n | θ )q(θ ) dθ , where q(·) is the 1 2 prior distribution. A Bayes factor above 150 is usually considered (see Kass and Reftery (1995)) as very strong evidence for H1 . However, Lavine and Schervish (1999) note that the Bayes factor does not satisfy the coherence requirement, while the posterior odds ratio is coherent. Both the Bayes factor b21 and the posterior odds p21 (H2 , H1 , X1n ) , b21 q(Θ2 )/q(Θ1 ) are consistent measures of evidence against H1 , relative to H2 . Also, in analogy with Proposition 3, consistency of the ratio of posterior modes can be established. 7. Conclusions There are several measures of statistical evidence in use. Among them is the Fisherian p-value and its extensions, likelihood-based measures, such as the ratio of likelihoods and the extended ratio of likelihoods, as well as the Bayes factor and the posterior odds. What are the criteria that a measure of evidence should satisfy? Coherence (see Section 1) is one such criterion. It is a logical criterion. In this note, the asymptotic criterion of consistency was introduced, together with the related criterion of unconditional consistency. Besides being incoherent, the p-value is also inconsistent and unconditionally inconsistent. The ratio of likelihoods and its extension are consistent, unconditionally consistent, and coherent measures. Among the Bayesian measures of evidence, the posterior odds ratio is both coherent and consistent. Acknowledgments Valuable feedback from George Judge, Lukáš Lafférs, Lenka Mackovičová, Ján Mačutek, Jana Majerová, Andrej Pázman, František Rublík, Vladimír Špitalský, František Štulajter, and Viktor Witkovský is gratefully acknowledged. Thanks are extended to an anonymous reviewer for her/his constructive suggestions. Supported by VEGA grants 1/0077/09 and 2/0038/12. References Barnard, G.A., Jenkins, G.M., Winsten, C.B., 1962. Likelihood inference and time series (with discussion). J. Roy. Statist. Soc. Ser. A 125, 321–372. Berger, J.O., Wolpert, R.L., 1988. The likelihood principle: a review, generalizations and statistical implications. Hayward: Inst. Math. Stat. Bickel, D.R., 2008. The strength of statistical evidence for composite hypotheses with an application to multiple comparisons. COBRA preprint series. paper 49, Statist. Sinica (in press). Cox, D.R., Hinkley, D.V., 1974. Theoretical Statistics. Chapman & Hall, London. Edwards, A.W.F., 1992. Likelihood (Expanded Edition). Johns Hopkins University Press, Baltimore. Fisher, R.A., 1956. Statistical Methods and Scientific Inference. Oliver and Boyd, Edinburgh. Gabriel, K.R., 1969. Simultaneous test procedures—some theory of multiple comparisons. Ann. Math. Statist. 40, 224–250. Hacking, I., 1965. Logic of Statistical Inference. Cambridge University Press, New York. Kass, R., Reftery, A., 1995. Bayes factors. J. Amer. Statist. Assoc. 90, 773–795. Lavine, M., Schervish, M.J., 1999. Bayes factors: what they are and what they are not. Amer. Statist. 53 (2), 119–122. Lindsey, J.K., 1999a. Some statistical heresies (with discussion). Statistician 48 (1), 1–40. Lindsey, J.K., 1999b. Relationships among sample size, model selection and likelihood regions, and scientifically important differences. Statistician 48 (3), 401–411. Mudholkar, G.S., Chaubey, Y.P., 2009. On defining p-values. Statist. Probab. Lett. 79, 1963–1971. Royall, R., 1997. Statistical Evidence (A Likelihood Paradigm). Chapmann & Hall, London. Royall, R., 2000. On the probability of observing misleading statistical evidence (with discussion). J. Amer. Statist. Assoc. 95 (451), 760–768. Schervish, M., 1996. P values: what they are and what they are not. Amer. Statist. 50, 203–206. Sellke, T., Bayarri, M.J., Berger, J.O., 2001. Calibration of p values for testing precise null hypotheses. Amer. Statist. 55 (1), 62–71. Wasserman, L., 2004. All of Statistics (A Concise Course in Statistical Inference). Springer-Verlag, New York. Zhang, Z., 2009. A law of likelihood for composite hypotheses. Available online at http://arxiv.org/abs/0901.0463.