Journal of Statistical Planning and Inference 137 (2007) 43 – 56 www.elsevier.com/locate/jspi
Bayes factors for goodness of fit testing Fulvio Spezzaferria , Isabella Verdinellia, b,∗ , Massimo Zeppieria a Università di Roma, La Sapienza, Italy b Carnegie Mellon University, USA
Received 26 May 2004; accepted 13 September 2005 Available online 9 November 2005
Abstract We propose the use of the generalized fractional Bayes factor for testing fit in multinomial models. This is a non-asymptotic method that can be used to quantify the evidence for or against a sub-model. We give expressions for the generalized fractional Bayes factor and we study its properties. In particular, we show that the generalized fractional Bayes factor has better properties than the fractional Bayes factor. © 2005 Elsevier B.V. All rights reserved. Keywords: Generalized fractional Bayes factor; Dirichlet process; Beta-Stacy process
1. Introduction In this paper we propose a Bayesian method for testing fit in multinomial models. Specifically, we will use the Bayes factor for evaluating the evidence for or against a null sub-model of the multinomial. The advantages of using a Bayesian approach for this problem are that it does not rely on large sample asymptotics and it provides a measure of evidence in favor of the null model in contrast to p-values which are usually only regarded as measures of evidence against the null. The main disadvantage of the Bayesian method is that it requires specifying a prior distribution. When a defensible, subjective prior is available, the Bayesian approach is straightforward: we simply compute the Bayes factor, the posterior odds in favor of the null model divided by the prior odds. When a subjective prior is not available, one possibility is to use some sort of objective Bayesian procedures. These include the work on intrinsic Bayes factors and on Bayes factors from intrinsic priors, introduced by Berger and Pericchi (1996a,b), and examined by several authors (for example, O’Hagan, 1997; Moreno et al., 1998), and the work on fractional Bayes factor (O’Hagan, 1991, 1995, 1997), and generalized fractional Bayes factor (De Santis and Spezzaferri, 1997, 2001). The intrinsic Bayes factor has been shown to have good properties but it is computationally intensive and in the case of discrete data the presence of empty cells might pose some difficulties. We thus focus on the fractional Bayes factor and the generalized fractional Bayes factor. The main goals of this paper are (i) to give closed form expressions for the generalized fractional Bayes factor and (ii) to show that the generalized fractional Bayes factor has better statistical behavior than the fractional Bayes factor. Goodness of fit testing in the Bayesian framework for continuous data has been examined by a number of authors (for example Verdinelli and Wasserman, 1998; Berger and Guglielmi, 2001; Robert and Rousseau, 2003). For discrete data, ∗ Corresponding author. Carnegie Mellon University, USA.
E-mail address:
[email protected] (I. Verdinelli). 0378-3758/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2005.09.002
44
F. Spezzaferri et al. / Journal of Statistical Planning and Inference 137 (2007) 43 – 56
the paper by Albert (1997) illustrates procedures for testing and estimating association structures in contingency tables. A review of Bayesian and non-Bayesian alternatives to the 2 method is presented in Conigliani et al. (2000). Our work concentrates on discrete data and, to a large degree, builds on and improves on Conigliani, Castro and O’Hagan (CCO, 2000). The present paper also highlights ways for using hierarchical prior distribution based on the Dirichlet process as well as on the Beta-Stacy process considered in Walker and Muliere (1997). In Section 2 we briefly illustrate the CCO procedure for goodness of fit testing, based on the fractional Bayes factor and on using a Dirichlet distribution for modeling a two-stage hierarchical prior on the alternative model. We also introduce the generalized fractional Bayes factor and present examples and general results showing that the generalized fractional Bayes factor has good statistical properties. In Section 3 we consider a hierarchical prior based on the Dirichlet model, and obtain an explicit expression for the generalized fractional Bayes factor. Examples are also considered for comparing the two methods presented in the paper. In Section 4, an extension to using a Beta-Stacy process (Walker and Muliere, 1997) for modeling the hierarchical prior is considered. Section 5 contains a brief discussion.
2. Goodness of fit testing: Bayesian alternatives to the 2 2.1. Introduction Let x = (x1 , x2 , . . . , xn ) be a sample of n iid observations classified in one of k + 1 groups, denoted by Gj (j = 0, . . . , k), and let r = (r0 , . . . , rk ) be a vector whose entries are the number of xi ’s falling in each group. CCO (2000) presented the following Bayesianapproach to goodness of fit testing. Let MN() denote a multinomial distribution with parameters = (0 , . . . , k ), j = 1 and let Pr(xs ∈ Gj ) = j , s = 1, . . . , n, with r ∼ MN(). They assumed that under M1 the j ’s belong to a specific parametric family F = {j (), ∈ R}, characterized by j () depending on a parameter , while under M2 , no assumptions on the form of j ’s were made. The alternative model M2 is thus equivalent to the alternative hypothesis implicitly used in the 2 method, where data can be generated by any probability distribution. The Bayesian hypothesis testing procedure selects a prior for the parameters of both models. The resulting procedure, that depends on the assumptions made, can be seen as a Bayesian alternative to the 2 goodness of fit test. The two models that CCO compared are M1 : P(xs ∈ Gj |) = j (),
M2 : P(xs ∈ Gj |) = j .
(1)
The parameters j () in M1 depend on , and CCO considered to be distributed according to a non-informative prior (). They also assumed that, under M2 , the vector had a Dirichlet prior distribution with parameters (c0 , . . . , ck ) and density: 2 (|c) = k
(c)
k
j =0 (cj ) j =0
c −1
jj
,
(2)
where c = cj , (cj > 0; j = 0, . . . , k). The assumption E[j ] = j () was also made to express the idea that the alternative model M2 was centered on the null model M1 , so that, from E[j ] = cj /c, it follows cj = cj (). As they did for model M1 , CCO considered a non-informative prior distribution () for , while they assumed c to be a fixed constant. 2.2. Bayes factors The Bayes factor (Jeffreys, 1961) is used in Bayesian hypotheses testing and model comparison. Assume that fi (x|i ), i = 1, 2 are the likelihoods for x under two competing models M1 and M2 , with parameters i , and let i (i ) be their
F. Spezzaferri et al. / Journal of Statistical Planning and Inference 137 (2007) 43 – 56
prior distributions. The Bayes factor for M2 against M1 f2 (x|2 )2 (2 ) d2 B21 = , f1 (x|1 )1 (1 ) d1
45
(3)
represents the weight of evidence from the data in favor of M2 , against M1 . The Bayes factor (3) is indeterminate when non-informative improper prior distributions are considered, as in the assumptions made by CCO for the models of Section 2.1. Arguments have been made by a number of authors (Aitkin, 1991; O’Hagan, 1995; Berger and Pericchi, 1996a,b; De Santis and Spezzaferri, 1999) suggesting the use of the partial Bayes factor when weak prior information is represented by improper distributions. The partial Bayes factor is defined by f2 (x|2 )2 (2 ) d2 f1 (x()|1 )1 (1 ) d1 P B21 (x()) = , (4) f1 (x|1 )1 (1 ) d1 f2 (x()|2 )2 (2 ) d2 where a subset x(), 0 < < n, of the data is used as a training sample for updating the priors into proper posterior distributions and the models are compared using the remainder of data. To avoid arbitrariness in the choice of a specific training sample x(), O’Hagan (1995) proposed using the whole likelihood raised to the power b = /n for training. This choice was motivated by the following approximation: fi (x()|i ) ≈ [fi (x|i )]b ,
(5)
that is valid for iid data, when n and are large. O’Hagan defined the fractional Bayes factor as F B21 (x, b) =
where
q2 (x, b) , q1 (x, b)
qi (x, b) =
fi (x|i )i (i ) di
[fi (x|i )]b i (i ) di
(6)
.
(7)
A non-asymptotic justification for using (6) was given in De Santis and Spezzaferri (1997). A different method for eliminating the influence of training samples on partial Bayes factor was suggested by Berger and Pericchi (1996a) and led to the intrinsic Bayes factor approach. For the models in (1), under the prior assumptions of Section 2.1, CCO (2000) obtained the following expressions for qi in (7): k rj j =0 j () 1 () d q1 (x, b) = , k brj () d 1 j =0 j () k (bn + c) j =0 ((rj + cj ())/(cj ()))2 () d q2 (x, b) = , (8) k (n + c) j =0 ((brj + cj ())/(cj ()))2 () d and they used the fractional Bayes factor (6) for Bayesian goodness of fit testing. In CCOs paper the expression of q2 in (8) and, consequently, the value of the fractional Bayes factor depends on the values chosen for the Dirichlet parameter c. This left some ambiguities for a clear interpretation of the goodness of fit test proposed by these authors. 2.3. The generalized fractional Bayes factor We examine here a proposal from De Santis and Spezzaferri (1997, 2001) that leads to a variation of the alternative Bayes factors used in model comparison when the prior information is weak. Like the fractional and the intrinsic Bayes factors, the method avoids arbitrary choices of specific training samples. This method is a generalization of the fractional Bayes factor and we will show that there are cases of model comparisons in the Bayesian framework where this method outperforms the fractional Bayes factor.
46
F. Spezzaferri et al. / Journal of Statistical Planning and Inference 137 (2007) 43 – 56
De Santis and Spezzaferri (1997, 2001) suggested replacing the likelihood fi (x()|i ) with the geometric training likelihood: ⎡ ⎤1/L Li (i ) = ⎣ fi (x()|i )⎦ , (9) x()∈X
where X = {x1 (), . . . , xh (), . . . , xL ()} is the set of all possible training samples of size and L n is the total number of samples considered. When observations are iid, the geometric training likelihood Li (i ) reduces to [fi (x|i )]b , (b = /n) (De Santis and Spezzaferri, 2001), that is the fractional likelihood in (5) used for defining the fractional Bayes factor. But when data are either dependent or not identically distributed or both, the geometric training likelihood (9) can be used for defining the generalized fractional Bayes factor: GF B21 =
q2G (x, ) q1G (x, )
where
qiG (x, ) =
,
fi (x|i )i (i ) di Li (i )i (i ) di
(10)
.
De Santis and Spezzaferri (2001) showed that, when comparing nested linear models, with independent but not identically distributed data, the generalized fractional Bayes factor always retains consistency, while the fractional Bayes factor may not. Lack of consistency in a model choice criterion implies that for increasing sample size the probability of choosing the correct model does not approach to 1. 2.4. Advantages of the generalized fractional Bayes factor GF behaves This section presents some examples of model comparison where the generalized fractional Bayes factor B21 F more reasonably than the fractional Bayes factor B21 .
Example 1. Consider the two equicorrelation models: M1 : x| ∼ N(0, (1 − )I + J), M2 : x|, ∼ N(1, (1 − )I + J), where the value of the correlation is assumed known and the only unknown parameter is the mean in M2 . The vector x consists of n observations, and 0, 1, I, J are, respectively, vectors of n zeros, n ones, the n × n identity matrix and an n × n matrix of ones. Let the prior distribution for be improper, () ∝ constant, and examine the expressions of fractional and generalized fractional Bayes factors for M2 against M1 , with = 1, that corresponds to the size of the minimal training sample: 1 1 (n − 1)x¯ 2 F B21 = √ exp + 2 1 + (n − 1) n √ 1 + (n − 1) 1 (n − 1)(1 − )x¯ 2 GF B21 . = exp + √ 2 1 + (n − 1) n F and B GF coincide, since observations are independent and L ( ) = [f (x| )]b . When, instead When = 0, B12 i i i i 12 = 0, the data are strongly dependent. No Bayes factor is a consistent model selector since x¯ is not even a consistent p F → estimator of . However, we see that B21 0 whether or not M1 and M2 is true. Thus for large n, whatever the evidence from data, the fractional Bayes factor will always choose the null model √ M1 . The generalized fractional Bayes factor, instead, is asymptotically equivalent to exp{(2)−1 (1 − )x¯ 2 } and it allows one to make a reasonable choice between M1 and M2 , based on the distance of x¯ from 0. It can be shown that
F. Spezzaferri et al. / Journal of Statistical Planning and Inference 137 (2007) 43 – 56
47
the behavior of fractional Bayes factor is due to the fact that, under M2 , the fraction of the likelihood that is supposed to get information about fails to depend on when n is large enough. Example 2. This example shows another case of model comparison with dependent data, where the generalized fractional Bayes factor appears to perform more reasonably than the fractional Bayes factor. Consider the following autoregressive models of order 1:
1 M1 : x| ∼ N 0, W , 1 − 2
1 M2 : x|, ∼ N 1, W , 1 − 2
where the matrix W = {wss ; s, s = 1, . . . , n} has elements wss = |s−s | , is a known constant and is unknown. Let the prior distribution for be () ∝ constant. Setting = 1, the minimal training sample size, the two fractional Bayes factors for M2 against M1 are 1 (1 − ) (n − 1)(nx¯ − S)2 F B21 = √ exp + 2 n[n − (n − 2)] n √ (1 − ) (nx¯ − S)2 1+ GF 2 exp + B21 = √ , (11) − (1 + )x¯ 2 n − (n − 2) n − (n − 2) where S = n−1 s=2 xs . Both Bayes factors in (11) are consistent and, as in the previous example, they coincide when = 0. Consider now the limiting case when, keeping everything else fixed, increases up to 1, that is equivalent to dealing with a single observation, no matter what the value of n is. In this case 1 F lim B21 =√ , →1 n
GF = 1. lim B21
→1
Thus, for increasing , when data become more and more dependent, the fractional Bayes factor will favor M1 whatever the data. The generalized fractional Bayes factor instead suggests, more appropriately, that no choice can be made among the two models in this limiting case. Next we compare the two fractional Bayes factors when hierarchical priors are used. Recall that this is the assumption made for the prior of the parameters of M2 in (1). Let data be iid and let the prior distribution on the parameter of one of the two models, 2 in M2 say, be set in two stages: 2 (2 | 2 ) and 2 ( 2 ), where 2 is a hyperparameter on which the prior information is usually weak. We argue that in this case there is not a unique way for defining the fractional and the generalized fractional Bayes factors. Let us start from expression (4) that defines the partial Bayes factor, and consider that 2 (2 ) = 2 (2 | 2 )2 ( 2 ) d 2 . Then, using (5) and (9), we compute, respectively, the fractional and the generalized fractional Bayes factors. Since the observations are iid, the two Bayes factors coincide: f2 (x|2 )2 (2 | 2 )2 ( 2 ) d 2 d2 [f1 (x|1 )]b 1 (1 ) d1 F GF . (12) B21 = B21 = f1 (x|1 )1 (1 ) d1 [f2 (x|2 )]b 2 (2 | 2 )2 ( 2 ) d 2 d2 However, an equivalent expression for (4) is the following: f2 (x| 2 )2 ( 2 ) d 2 f1 (x()|1 )1 (1 ) d1 , f1 (x|1 )1 (1 ) d1 f2 (x()| 2 )2 ( 2 ) d 2
(13)
where the marginal distribution of the data x and of a training sample x() are obtained, respectively, as f2 (x|2 )2 (2 | 2 ) d2 and f2 (x()|2 )2 (2 | 2 ) d2 . As usual, we need to eliminate the arbitrariness in the choice of training sample x() from (13). Typically the observations x and x() with densities f2 (x| 2 ) and f2 (x()| 2 ) are dependent. Thus the fractional Bayes factor cannot be computed, approximation (5) being valid only for iid data.
48
F. Spezzaferri et al. / Journal of Statistical Planning and Inference 137 (2007) 43 – 56
In this case, we propose using the generalized fractional Bayes factor (10), that takes the following form: f2 (x| 2 )2 ( 2 ) d 2 [f1 (x|1 )]b 1 (1 ) d1 GF ˜ , B21 = f1 (x|1 )1 (1 ) d1 L2 ( 2 )2 ( 2 ) d 2
(14)
where L2 ( 2 ) = [ x()∈X f2 (x()| 2 )]1/L , and f2 (x()| 2 ) = f2 (x()|2 )2 (2 | 2 ) d2 . GF instead of B F for comparing two models M and M , when We now illustrate an argument in favor of using B˜ 21 1 2 21 the prior distribution is set through a two-stage hierarchy for M2 . The following Lemma 1 shows that the generalized GF gives less support to M than the fractional Bayes factor B F does, regardless of which fractional Bayes factor B˜ 21 2 21 GF supports M less strongly when information from data on the unknown model is true. Lemma 1 also shows that the B˜ 21 2 GF is to be preferred to B F . parameter 2 in M2 is in disagreement. This supports our point that B˜ 21 21 Consider the following definition of disagreement or contradictory evidence on a parameter given by two samples x1 and x2 from a random variable X with density function f (x| ). Definition 1. We say that two samples x1 and x2 give “contradictory evidence” about if [( |x1 )( |x2 )]1/2 d is small,
(15)
where ( |xi ), i =1, 2 denote the two posterior distributions obtained either from x1 or from x2 , when a prior distribution ( ) is considered for . Condition (15) states that the probability masses of the two posterior distributions ( |x1 ) and ( |x2 ) are concentrated on different values of . Note that (15) requires that the affinity between the two posterior distributions is small, or that their Hellinger distance is large (see, for example, Le Cam, 1986, Chapter 4, p. 47). Indeed a large distance between the posterior distributions based on two samples is a way to express the idea that the two samples present contradictory evidence about the parameter . In order to characterize a sample x that contains contradictory evidence about a parameter , we propose the following natural extension of Definition 1 to the set X = {x1 (), . . . , xh (), . . . , xL ()} of all training samples of size extracted from x. Definition 2. Let ( |x()) denote the posterior distribution for obtained from a training sample x() ∈ X, of size . We say that x contains “contradictory evidence” about if
⎡ ⎣
⎤1/L ( |x())⎦
d
is small.
(16)
x()∈X
The smaller (16), the more contradictory the evidence about in the sample. Note that (15) is a special case of (16) with L = 2, and that (16) is 1, by Cauchy–Schwartz inequality. Lemma 1. In the hierarchical prior setup specified above: GF gives less support to M than the fractional Bayes factor (a) For any x the generalized fractional Bayes factor B˜ 21 2 F B21 does: GF B˜ 21 F B21
=
[f2 (x|2 )]/n 2 (2 | 2 )2 ( 2 ) d2 d 2 1. L2 ( 2 )2 ( 2 ) d 2
(17)
(b) Consider a sample x that contains contradictory evidence about 2 as in (16). The more contradictory the evidence GF gives to M with respect to B F . in x about 2 , the less support B˜ 21 2 21
F. Spezzaferri et al. / Journal of Statistical Planning and Inference 137 (2007) 43 – 56
49
Proof. To prove part (a) we use the fact that data, distributed as f2 (x|2 ), are independent and the Cauchy–Schwartz inequality ⎡ ⎤1/L [f2 (x|2 )]/n 2 (2 | 2 ) d2 = ⎣ f2 (x()|2 )⎦ 2 (2 | 2 ) d2 L2 ( 2 ), (18) x()∈X
and (17) follows. To prove part (b), note that the ratio between the Bayes factors in (17) can be written as 1/L f (x()| ) ( | ) ( ) d 2 2 2 2 2 2 2 d2 GF x()∈X 2 B˜ 21 = , 1/L F B21 f (x()| ) ( ) d 2 2 2 2 2 x()∈X 1/L , where f (x()) is the marginal distribution and, multiplying and dividing by 2 x()∈X [f2 (x()] f2 (x()|2 )2 (2 | 2 )2 ( 2 ) d2 d 2 , we obtain 1/L GF d2 B˜ 21 x()∈X (2 (2 |x())) = . F 1/L B21 d 2 x()∈X (2 ( 2 |x())) Hence the more contradictory the evidence that x contains about 2 , as in (16), the smaller the ratio in (17).
GF has a smaller chance than B F of choosing a model Lemma 1 shows that the generalized fractional Bayes factor B˜ 21 21 on whose parameter the sample evidence is contradictory. We argue that this is the appropriate behavior for a model choice criterion. As an illustration of part (b) of Lemma 1 consider the following simple example where x is a sample of just two observations from a Normal distribution N(2 , 2 ) under M2 , with 2 known. Let the prior distribution for 2 be set in two stages with 2 | 2 ∼ N(0, 2 ) at the first stage and ( 2 ) at the second stage. Using = 1, the ratio the between Bayes factors is easily shown to be a monotone decreasing function of the sample variance s 2 = 2−1 (xi − x) ¯ 2: 2 GF B˜ 21 ( + 2 )−1/2 exp{−x¯ 2 /2(2 + 2 )}( 2 ) d 2 = . F B21 (2 + 2 )−1/2 exp{−(2 x¯ 2 − s 2 2 )/22 (2 + 2 )}( 2 ) d 2
This shows that as the two observations x1 and x2 get farther apart the evidence about 2 becomes more contradictory and B˜ GF gives less support to M2 with respect to B F . 21
21
3. Hierarchical prior specified through a Dirichlet process 3.1. Introduction Let us now go back to the model selection problem of Section 2.1. We assume, as in CCO, that the prior distribution of under M2 is set in two stages. At the first stage, given and c, let be distributed as in (2), with cj = cj (), j = 0, . . . , k. At the second stage, in contrast to CCO, we assume that 2 (, c) is a non-informative prior distribution for both hyperparameters and c, where (, c) represents 2 in the notation of Section 2.4. Deriving the marginal distribution f2 (x|, c) for M2 , the two models to be compared are M1 : f1 (x|) =
k
[j ()]rj ,
j =0
k
M2 : f2 (x|, c) =
j =0
rj −1 h=0
n−1
(cj () + h)
h=0 (c
+ h)
.
(19)
50
F. Spezzaferri et al. / Journal of Statistical Planning and Inference 137 (2007) 43 – 56
In (19), we take the second product in the numerator of f2 (x|, c) to be 1 if rj = 0. Data under M2 are not iid, thus we consider the generalized fractional Bayes factor approach of Section 2.4 for model selection. The two models in (19) are nested, in fact limc→∞ f2 (x|, c) = f1 (x|). We interpret c > 0 as a parameter measuring the distance between the two models, since the smaller c is, the further apart M2 is from M1 . The generalized fractional Bayes factor for the models in (19) is f2 (x|, c)2 (, c) d dc [f1 (x|)]/n 1 () d GF ˜ B21 = . (20) f1 (x|)1 () d L2 (, c)2 (, c) d dc Lemma 2. The explicit expression for L2 (, c) in (20) is given by k
j =0
L2 (, c) =
min(rj ,)−1 h=0
−1
(cj () + h) h,j
h=0 (c
,
+ h)
(21)
where h,j
1 = L
min(rj ,)
t=max(h+1,+rj −n)
r n − r j j . t −t
The sum on the right-hand side of h,j is the total number of training samples with at least h + 1 observations in group Gj . Proof. From (19) the likelihood of one training sample x() ∈ X under M2 is k
j =0
f2 (x()|, c) =
rj (x())−1 h=0 −1 h=0 (c
(cj () + h) + h)
,
(22)
where rj (x()) is the number of observations in x() falling in group Gj and, as in (19), if rj (x()) = 0 the second product in the numerator is set to 1. From (9) and (22), L2 (, c) is given by ⎡ ⎣
x()∈X
k
j =0
=
rj (x())−1 h=0 −1 h=0 (c
(cj () + h) + h)
⎤1/L ⎦
1/L rj (x())−1 k (c () + h) j x()∈X j =0 h=0 −1
h=0 (c
+ h)
.
(23)
For all x() ∈ X, we have rj (x()) min(rj , ). Now let I [rj (x()) − 1 h] denote the indicator function of the event {rj (x()) − 1h}, so the third product in the numerator of (23) can be written as min(rj ,)−1
rj (x())−1
(cj () + h) =
h=0
(cj () + h)I [rj (x())−1 h] .
(24)
h=0
Substituting (24) in (23) and appropriately adding the indicator functions in (24), we obtain L2 (, c) =
min(rj ,)−1 k (cj () + h) j =0 h=0
from which (21) follows.
−1
h=0 (c
+ h)
x()
I [rj (x())−1 h]
1/L
F. Spezzaferri et al. / Journal of Statistical Planning and Inference 137 (2007) 43 – 56
51
From Lemma 2 it can be checked that when = 1, L2 (, c) reduces to [f1 (x, )]/n thus implying that if the size of training samples is = 1 the geometric likelihood L2 (, c) fails to update the prior on c, as it was to be expected since, in the case considered here, the size of the minimal training sample is 2. Remark. So far, the number of groups Gj has been assumed to be finite. In the case where observations fall in one of a countable set of groups, as for example in the Poisson case, then in M2 we assume that P(xs ∈ Gj |) = j = F (j ) − F (j − 1), where F (·) is a discrete time Dirichlet process, centered on M1 , with parameter cH (·|), where H (·|) is a cumulative distribution function with jumps j () at the points j = 0, 1, . . . . If k − 1 denotes the last group index where observations fall, then results and notation of this section still hold with k () = 1 − k−1 j =0 j (), and rk = 0. 3.2. Prior distributions for and c CCO treated c as a known constant and chose the same non-informative prior distribution 1 () for under both models. The form of 1 () was obtained according with the specific parametric family F={j (), ∈ R} considered. We will now specify a prior 1 () in M1 and a joint prior 2 (, c) in M2 . An independence assumption among and c under M2 appears reasonable, given that c represents the distance among M1 and M2 in (19), and specifies the location of the two models in (19). Hence 2 (, c) = 2 () 2 (c), and we choose 2 () = 1 (), as in CCO. The standard non-informative prior distribution for the scale parameter c, 2 (c) ∝ 1/c cannot be used in this context, since the integrals in (20) do not converge. This problem has been considered in the literature by Good (1967), who suggested two possible solutions. Instead of looking for a proper distribution approximating 2 (c) ∝ 1/c as in Good, we propose an approach that exploits the fact that the correlation among any two observations under M2 is a function of c. From f2 (x|, c) in (19): E[xs xs ] − E[xs ]E[xs ] 1 , r(xs , xs ) = = √ c+1 V[xs ]V[xs ] showing that the correlation among two observations does not depend on , and it varies between 0 and 1. If no prior information is available for r(xs , xs ), we assume r(xs , xs ) to be uniformly distributed between 0 and 1, and obtain from this the correspondent proper prior distribution for c 1 2 (c) = . (25) (c + 1)2 3.3. Examples and comparisons with CCOs procedure The fractional Bayes factor in (6) and the generalized fractional Bayes factor in (20) can be computed once the parametric family F = {j (), ∈ R} that specifies j () under M1 is selected. We will consider the case where the j ()’s have the form of a binomial distribution, so that
k j () = j (1 − )k−j , j = 0, 1, . . . , k (26) j and, as in CCO, we choose the improper prior distribution for to be 1 () ∝ −1 (1 − )−1 . We first examine data from Hoaglin et al. (1985), giving the number of females in 100 queues of size 10 observed in a London underground station, also examined by CCO to test the hypothesis that data were in fact from a binomial distribution. The data are below. Number of females Number of queues
0 1
1 3
2 4
3 23
4 25
5 19
6 18
7 5
8 1
9 1
10 0
The fractional Bayes factor computed by CCO depends on the parameter c. The values obtained are not stable and they get closer to 1 for large c when in fact the two models M1 and M2 coincide. Table 1 shows the values of the fractional Bayes factor reported by CCO for a few values of c, together with the value of the generalized fractional Bayes factor computed from (21), both based on training samples of minimal size ( = 1, and = 2, respectively).
52
F. Spezzaferri et al. / Journal of Statistical Planning and Inference 137 (2007) 43 – 56
Table 1 GF B˜ 21
F B21
c=2 8.0
× 10−7
c = 20
c = 100
0.355
0.676
4.68 × 10−6
Table 2 n
60 40 20
r0
r1
0 0 1
3 4 1
r0
r1
r2
14 7 8
r3
r4
r5
r6
GF B˜ 21
F B21
c=2
c = 20
c = 100
0.7437 1.6461 0.7192
1.0954 1.6419 1.0655
22 19 5
15 8 3
6 1 2
0 1 0
0.0775 0.1225 0.0493
r3
r4
r5
r6
F B21
3.94 × 10−5 0.0002 0.0002
Table 3 n
60 40 20
10 11 2
22 10 10
r0
r1
r2
22 12 3
6 4 3
0 3 1
0 0 1
0 0 0
GF B˜ 21
c=2
c = 20
c = 100
0.4755 0.1540 0.1520
1.5874 1.3050 1.1507
1.4471 1.5214 1.3288
4.35 × 10−5 6.54 × 10−5 0.0004
Table 4 n
60 40 20
15 15 7
10 6 3
r2
3 0 0
r3
4 0 0
r4
8 5 3
r5
20 14 7
GF B˜ 21
F B21
c=2
c = 20
c = 100
7.73 × 1024 2.68 × 1026 2.69 × 1011
9.92 × 1021 2.32 × 1021 9.9 × 107
1.56 × 1014 9.58 × 1013 10, 977.76
1.73 × 1019 2.49 × 1020 207, 570.6
The value obtained for the generalized Bayes factor in this example supports the conclusion from the fractional Bayes factor, but avoids the need to guess which value of c might be appropriate and why. In the example above, the sample size was n = 100. For further comparison, we selected a few examples with decreasing sample size, and data generated either from a binomial distribution, or from an U shaped distribution. Reducing the sample size makes it harder for any model selection method to choose the correct model. Table 2 shows three data sets generated with j () defined in (26) with k = 6, = 0.5 and decreasing sample size. Similarly Table 3 considers three data sets generated with j () defined in (26) with k = 6, = 0.25 and decreasing sample size. The main advantage in using the generalized fractional Bayes factor is that it avoids the instability caused by the F favors model M more arbitrary choice of the Dirichlet parameter c. For all values of c chosen in these examples, B21 2 GF ˜ than B21 does. This is even clearer in Table 3, where data generated from an asymmetric binomial with small n, are also likely to be from a non-binomial model. For checking how well the fractional Bayes factors detect data that are sampled from a distribution that is obviously not binomial, Table 4 considers three data sets with decreasing values of n, k = 5 and data sampled from = (0.31, 0.16, 0.03, 0.03, 0.16, 0.31) representing a U shaped distribution. Table 4 shows that the fractional Bayes GF that data are not binomial, in accordance with our general results of Section 2. factor is able to detect better than B˜ 21 The case where the j ()’s have the form of a Poisson distribution leads to similar results that are not reported here.
F. Spezzaferri et al. / Journal of Statistical Planning and Inference 137 (2007) 43 – 56
53
4. Extension to a Beta-Stacy process This section generalizes the approach of Section 3. We still consider the two models M1 : P(xs ∈ Gj |) = j () and M2 : P(xs ∈ Gj |) = j , where under M1 the j ’s belong to a specific parametric family F = {j (), ∈ R} with the number of groups Gj being either finite or countable. In particular, we illustrate how the method can be extended to the Beta-Stacy process, a generalization of the Dirichlet process. This section is included to offer an alternative for modeling the prior through a larger class of distributions. As it appears from the specification in (31), made for the hyperparameters of the Beta-Stacy model, this extension can be especially useful when the groups are naturally ordered. As in Section 3, the prior distribution for under M2 is set in two stages. We assume that is a realization of a discrete Beta-Stacy process, with appropriate hyperparameters on which a second stage prior distribution is specified. A Beta-Stacy process is a generalization of the Dirichlet process that has been treated by Walker and Muliere (1997), (see also Damien and Walker, 2002). The main features of the Beta-Stacy distribution and discrete process, obtained by Walker and Muliere (1997), are summarized in the Appendix. The first stage prior distribution for j in M2 considers j to be the width of the jth jump F (j ) − F (j − 1) of a discrete Beta-Stacy process F (·) as in Definition A1. We constrain the discrete Beta-Stacy process parameters { j , j } to be such that the model M2 is centered on M1 thus, as in Section 3, E[j ] = j (). Let H (·|) be the cumulative distribution function specified by the parametric family F in M1 . From Lemma A1, M2 is centered on M1 if 0 = c0 H (0|), j = cj (H (j |) − H (j − 1|)), and j = cj (1 − H (j |)), where cj > 0. By requiring M2 be centered on M1 reduces the number of hyperparameters of the first stage prior to and c = (c0 , c1 , . . . , cj , . . .). GF defined in (14). With As in Section 3 we can compare M1 and M2 using the generalized fractional Bayes factor B˜ 21 the Beta-Stacy assumptions the expression for f2 (x|, c) analogous to (19) takes the form
f2 (x|, c) =
k j =0
rj −1 h=0
mj −1 ( j + h) h=0 ( j + h)
rj +mj −1 h=0
( j + j + h)
,
(27)
where rj and mj are defined in Lemma A3 in the Appendix. An alternative expression for f2 (x|, c) is obtained using the conditional densities of the elements in x: f2 (x|, c) = f2 (x1 |, c)f2 (x2 |x1 , , c) · · · f2 (xn |x1 , . . . , xn−1 , , c),
(28)
where each factor in (28) can be obtained through (34), the predictive probability derived in Lemma A3. The marginal distribution of each training sample x() ∈ X needed to calculate the generalized fractional Bayes factor can be obtained similarly to (27) or to (28). Note that, denoting by Gk the last group containing observations we have that f2 (x|, c) depends only on the hyperparameters and c0 , c1 , . . . , ck , on which we need to specify a prior distribution. A further reduction of the number of hyperparameters is obtained requiring that the distribution for is such that the correlation (s , r ) among any two parameters be negative. This condition is an appropriate default requirement because of the constraints j = 1 and it is satisfied by the Dirichlet prior of Section 3. Lemma 3. A sufficient condition for (s , r ) to be negative for all s = r is c0 c1 · · · cj · · · c0 + 1.
(29)
Proof. Lemma A2 in the appendix can be used for deriving the expressions of the covariances of the random probability masses {0 , 1 , . . . , k } assigned to the groups {G0 , G1 , . . . , Gk } by a Beta-Stacy process centered on H (·|).
54
F. Spezzaferri et al. / Journal of Statistical Planning and Inference 137 (2007) 43 – 56
Assuming s < t, we have t = V t
t−1 j =0
t−1 Vt (1 − Vj ) = s (1 − Vj ) Vs
s t = Vs Vt
and
j =s
s−1
(1 − Vj )2
j =0
t−1
(1 − Vj ).
j =s
Thus E(t s ) can be obtained as product of first and second moments of Beta random variables and the covariance between s and t is given by ⎤ s−1 c [1 − H (j |)] + 1 c j s C(s t ) = s ()t () ⎣ − 1⎦ . c0 + 1 cj +1 [1 − H (j |)] + 1 ⎡
(30)
j =0
The proof follows noting that if (29) holds, all the factors in the product of the expression of C(s , t ) in (30) are < 1. We suggest using the following specification of c0 , . . . , ck that satisfies (29) in Lemma 3: cj = c +
j d, k
0 d 1, j = 0, . . . , k.
(31)
With (31) the prior distribution for only depends on the hyperparameters , c, d, where and c have the same meaning as in Section 3, and d can be interpreted as the distance between the Dirichlet and Beta-Stacy processes. In fact the two processes coincide when d = 0. The prior distribution for depends on the way groups have been ordered. This makes the Beta-Stacy hierarchical prior best suited for situations where there is a natural ordering for the groups. The prior distributions for the hyperparameters , c and d, can be assigned by adopting the same priors of Section 3 for and c and a uniform U (0, 1) for d. Then the generalized fractional Bayes factor can be computed as in Section 3.
5. Discussion We have presented an approach to Bayesian goodness of fit testing for multinomial data. Our method is based on the generalized fractional Bayes factor. The advantages of this method are that it does not rely on asymptotic justifications, it does not require a subjective prior and it behaves well in the examples. We have also shown that the generalized fractional Bayes factor has some advantages over the fractional Bayes factor. We used a hierarchical prior distributions built either through a Dirichlet or through a Beta-Stacy model, both expressing alternatives to a null sub-model of the multinomial. We derived a closed form expression for the generalized fractional Bayes factor under the Dirichlet model prior. The extension to the Beta-Stacy case in Section 4 allows for more flexibility in model specifications, but has greater computational complexity. The Beta-Stacy hierarchical prior is useful in the case that the groups are ordered. However, the Beta-Stacy prior can also be used in the case where there is no such order. This requires symmetrizing the prior by averaging over group orderings. We leave the details for future work.
Acknowledgements The authors would like to thank Larry Wasserman, two referees and the associate editor for their help in improving previous drafts of this paper, and Stephen Walker for a useful conversation about properties of the Beta-Stacy distribution.
F. Spezzaferri et al. / Journal of Statistical Planning and Inference 137 (2007) 43 – 56
55
Appendix. The discrete Beta-Stacy process All the results in this appendix are taken from Walker and Muliere (1997). A random variable Y is said to have a Beta-Stacy distribution BetaSt( , , ), with parameters > 0, > 0 and 0 < 1 if Y has the following density function: f (y| , , ) =
1 1 y −1 ( − y) −1 I(0,) (y), + (
B( , ) −1)
(32)
where B( , ) is the beta function. For = 1 the density in (32) reduces to the density function of a Beta( , ) random variable. For 0 < < 1 the density in (32) is a Beta density that assigns mass to the interval (0, ], rather than (0, 1]. It can be seen that E[Y ] = /( + ). Consider the sequence of points {0, 1, . . . , j, . . .} and associate a Beta-Stacy random variable with each j: Y0 ∼ BetaSt ( 0 , 0 , 1), Y1 |Y0 ∼ BetaSt ( 1 , 1 , 1 − Y0 ), .. . Yj |Y0 , Y1 , . . . , Yj −1 ∼ BetaSt ( j , j , 1 −
j −1 s=0
Ys ),
.. .
(33)
where j > 0, j > 0. Let F (t) = j t Yj , Walker and Muliere (1997) showed that, if with probability 1, F is a random cumulative distribution function.
∞
j =0 [ j /( j
+ j )] = 0, then,
Definition A1 (Walker and Muliere, 1997). The random cumulative distribution function F is a discrete Beta-Stacy random process with parameters { j , j } and jumps of random size Y0 , Y1 , . . . , Yj , . . . at the points {0, 1, . . . , j, . . .}. Walker and Muliere (1997) also proved a number of properties of the discrete Beta-Stacy random process F, among which we list the following three, as lemmas that we use in Section 4. Lemma A1. Let H (·) be a discrete cumulative distribution function with jumps at the points {0, 1, . . . , j, . . .}. If
0 = c0 H (0|), j = cj (H (j ) − H (j − 1)), and j = cj (1 − H (j )), with cj > 0, the discrete Beta-Stacy process F is centered on H, in the sense that E[Yj ] = H (j ) − H (j − 1). Remark A1. If cj = c for all js, then j + j = j −1 and F is a discrete Dirichlet process, with parameter cH (·). j −1 Lemma A2. If Vj = Yj /(1 − s=0 Ys ), then for any m > 0 the random variables V0 , V1 , . . . , Vm are independent and marginally distributed as Beta( j , j ). Furthermore, inverting the relation among the Y ’s and the V ’s, gives the representation Y0 = V0 , Y1 = V1 (1 − V0 ), . . . , Ym = Vm m−1 s=0 (1 − Vs ). Lemma A3. Let x1 , . . . , xn , with each xs ∈ {0, 1, . . . , j, . . .}, be an iid sample from an unknown distribution function F on [0, ∞), if the prior for F is chosen to be a discrete Beta-Stacy process with parameters j , j and jumps at {0, 1, . . . , j, . . .}, then (i) Given x1 , . . . , xn the posterior for F is also a discrete Beta-Stacy process with parameters ∗j , ∗j , where ∗j =
j + rj , ∗j = j + mj , mj = s>j rs , and rj is the number of observations at the jump point j th. (ii) The predictive probability P(xn+1 = j |x1 , . . . , xn ) is given by E[F (j ) − F (j − 1)|x1 , . . . , xn ] =
∗j
∗j
j −1
+ ∗j s=0
∗s
∗s . + ∗s
(34)
56
F. Spezzaferri et al. / Journal of Statistical Planning and Inference 137 (2007) 43 – 56
References Aitkin, M., 1991. Posterior Bayes factors. J. Roy. Statist. Soc. B 53, 111–142. Albert, J.H., 1997. Bayesian testing and estimation of association in a two-way contingency table. J. Am. Statist. Assoc. 92, 685–693. Berger, J.O., Guglielmi, A., 2001. Bayesian testing of a parametric model versus nonparametric alternatives. J. Am. Statist. Assoc. 96, 542–554. Berger, J.O., Pericchi, L., 1996a. The intrinsic Bayes factor for model selection and prediction. J. Am. Statist. Assoc. 91, 109–122. Berger, J.O., Pericchi, L., 1996b. The intrinsic Bayes factor for linear models. In: Bernardo, J.M., et al. (Eds.), Bayesian Statistics, vol. 5. Oxford University Press, Oxford, pp. 23–42. Conigliani, C., Castro, J.I., O’Hagan, A., 2000. Bayesian assessment of goodness of fit against nonparametric alternatives. Canad. J. Statist. 28, 327–342. Damien, P., Walker, S., 2002. A Bayesian non-parametric comparison of two treatments. Scand. J. Statist. 29, 51–56. De Santis, F., Spezzaferri, F., 1997. Alternative Bayes factors for model selection. Canad. J. Statist. 25, 503–515. De Santis, F., Spezzaferri, F., 1999. Methods for robust and default Bayesian model comparison: the fractional Bayes factor approach. Internat. Statist. Rev. 67, 267–286. De Santis, F., Spezzaferri, F., 2001. Consistent fractional Bayes factor for nested normal linear models. J. Statist. Plann. Inference 97, 305–321. Good, I.J., 1967. A Bayesian significance test for multinomial distributions. J. Roy. Statist. Soc. B 29, 399–431. Hoaglin, D.C., Mosteller, F., Tukey, J.O., 1985. Exploring Data, Tables, Trends and Shapes. Wiley, New York. Jeffreys, H., 1961. Theory of Probability. Oxford University Press, New York. Le Cam, L.M., 1986. Asymptotic Methods in Statistical Decision Theory. Springer, New York. Moreno, E., Bertolino, F., Racugno, W., 1998. An intrinsic limiting procedure for model selection and hypothesis testing. J. Am. Statist. Assoc. 93, 1451–1460. O’Hagan, A., 1991. Discussion of Aitkin’s “Posterior Bayes Factors”. J. Roy. Statist. Soc. B 53, 111–142. O’Hagan, A., 1995. Fractional Bayes factors for model comparison (with discussion). J. Roy. Statist. Soc. B 57, 99–138. O’Hagan, A., 1997. Properties of intrinsic and fractional Bayes factors. Test 6, 101–118. Robert, C.P., Rousseau, J., 2003. A mixture approach to Bayesian goodness of fit. Ceremade, Université Paris, Dauphine, Preprint. Verdinelli, I., Wasserman, L., 1998. Bayesian goodness of fit testing using infinite dimensional exponential families. Ann. Statist. 26, 1215–1241. Walker, S., Muliere, P., 1997. Beta-Stacy processes and a generalization of the Polya-urn scheme. Ann. Statist. 25, 1762–1780.