Robust Bayesian analysis with partially exchangeable priors

Robust Bayesian analysis with partially exchangeable priors

Journal of Statistical Planning and Inference 102 (2002) 99–107 www.elsevier.com/locate/jspi Robust Bayesian analysis with partially exchangeable pr...

93KB Sizes 0 Downloads 125 Views

Journal of Statistical Planning and Inference 102 (2002) 99–107

www.elsevier.com/locate/jspi

Robust Bayesian analysis with partially exchangeable priors  Malay Ghosha; ∗ , Dal Ho Kimb

a Department

of Statistics, University of Florida, 103 Grin-Floyd Hall, P.O. Box 118545, Gainesville, FL 32611-8545, USA b Department of Statistics, Kyungpook National University, Taegu, 702-701 South Korea Received 1 February 1998; received in revised form 1 September 1998

Abstract In this paper, we consider a general Bayesian model which allows multiple grouping of parameters, where the components within a subgroup are exchangeable. The general idea is then illustrated for the normal means estimation problem under priors which are scale mixture of normals. We discuss also implementation of the Bayes procedure via Markov chain Monte Carlo c 2002 integration techniques. We illustrate the proposed methods with a numerical example.  Elsevier Science B.V. All rights reserved. Keywords: Partial exchangeability; Multiple shrinkage; Robust Bayesian; Scale mixtures of normals; Model averaging

1. Introduction Often, in statistical practice, it is useful to combine results from several experiments or observational studies. One objective for this is the ability to make reliable inference for each component experiment (or study) by “borrowing strength” from other related experiments (or studies). This is a standard practice for small area estimation (cf. Ghosh and Rao, 1994). A second objective is to combine inferential summaries from several studies into a single analysis. The latter, popularly known as “metaanalysis” has been discussed in Hedges and Olkin (1985) and DuMouchel (1990) from the frequentist and Bayesian perspectives, respectively. One simple Bayesian approach for the general problem, in the absence of covariates, is to use a prior that builds exchangeability among the component problems, thereby allowing 

Research partially supported by NSF Grant Numbers SBR-9423996 and SBR-9810968. Corresponding author. E-mail address: [email protected] (M. Ghosh). ∗

c 2002 Elsevier Science B.V. All rights reserved. 0378-3758/02/$ - see front matter  PII: S 0 3 7 8 - 3 7 5 8 ( 0 1 ) 0 0 1 7 4 - 4

100

M. Ghosh, D.H. Kim / Journal of Statistical Planning and Inference 102 (2002) 99–107

the experimenter to “borrow strength” in a sensible way. An interesting discussion with examples appears in DuMouchel (1990) (see also Morris and Normand, 1992). Malec and Sedransk (1992) have pointed out the weakness of Bayesian models based solely on the exchangeability assumption. As an example, often the population means are clustered into two or more subgroups, as opposed to being clustered in a single group. Clearly, shrinking all the means towards a common weighted average is inappropriate in such cases, and a modiFed analysis which allows diGerent shrinkage points is called for. A useful substitute for exchangeability in the above situation is partial exchangeability, where the components within a subgroup are exchangeable, but the diGerent subgroups are not. Partial exchangeability is often dictated by the problem at hand. To cite an example (cf. Efron and Morris, 1973), for estimating the batting averages of baseball players, it is natural to cluster the players into two groups: (i) right-handed, and (ii) left-handed. However, in reality, a natural clustering of parameters from prior considerations is not immediate. Instead, an adaptive clustering dictated by the data seems more appropriate. Malec and Sedransk (1992) initiated such a study for the normal means estimation problem. Their method allowed partition of the parameters into diGerent plausible clusters, and prior probabilities were assigned to the diGerent partitions. For a given partition, within each cluster, a normal–normal hierarchical model built exchangeability among the component parameters, while parameters in diGerent clusters were assigned independent priors. This idea was pursued further by Consonni and Veronese (1995) who considered combining results from several binomial experiments. The objective of this paper is to outline in Section 2, a general Bayesian model which allows multiple grouping of parameters. The basic idea is then illustrated with the aid of scale mixture of normal priors for the normal means estimation problem. Priors which are scale-mixtures of normals have, by construction, tails that are Datter than those of the normal, and are very suitable for robust Bayesian analysis. Also, this class of priors is suHciently rich, since it includes Student t family, double exponential, logistic and the exponential power family of Box and Tiao (1973) among others. These priors are also very useful for detection of outliers as pointed out in Dawid (1973), and O’Hagan (1979,1988). Section 2 also discusses implementation of the Bayes procedure via Markov chain Monte Carlo (MC2 ) integration techniques (cf. Geman and Geman, 1984; Gelfand and Smith, 1990). This is in contrast to the approximation of posterior pdf’s as done, for example, in Consonni and Veronese (1995). Also, unlike Malec and Sedransk (1992), we do not have closed form formulas for the posterior moments, so that numerical integration becomes a necessity. Finally, in this section, we obtain posterior means and variances by “model averaging” as in Malec and Sedransk (1992), Draper (1995) or Consonni and Veronese (1995). Section 3 contains a numerical example with multivariate t priors to illustrate the methods of Section 2. As a consequence of multiple grouping, estimates of parameters within diGerent groups are shrunk towards diGerent points. In this way, the present procedure can be

M. Ghosh, D.H. Kim / Journal of Statistical Planning and Inference 102 (2002) 99–107

101

aptly described as a “multiple shrinkage” procedure. That is, however, distinct from the multiple shrinkage of George (1986) who considered as the prior a weighted average of several normal distributions with diGerent means. This results in shrinking each parameter towards a common weighted average of prior means, where the weights are governed by the prior weights as well as data. The point of distinction is that unlike ours, George (1986) is not shrinking his estimates towards diGerent group means, but instead towards a common weighted average of several means.

2. The Bayesian analysis Let Yi given i and  be independently distributed with pdf’s f(yi | i ; ) (i = 1; : : : ; L). We denote by G the number of partitions of {1; : : : ; L} that we want to consider in a given context. Clearly 1 6 G 6 2L − 1. A typical partition is denoted by g comprising d(g) subsets Sk (g) of sizes pk (g) (k = 1; : : : ; d(g)). For example if L = 4, and one considers only the partitions g1 = {{1; 2}; {3; 4}} and g2 = {{1; 2}; {3}; {4}}, then G = 2; d(g1 ) = 2 with S1 (g1 ) = {1; 2}, S2 (g1 ) = {3; 4}, p1 (g1 ) = 2 = p2 (g1 ). Similarly, d(g2 ) = 3 with S1 (g2 ) = {1; 2}, S2 (g2 ) = {3}, S3 (g2 ) = {4} and p1 (g2 ) = 2, p2 (g2 ) = 1, p3 (g2 ) = 1. Write  = (1 ; : : : ; L ). We consider the following partially exchangeable prior for (; ): For a given partition g, we denote by ((Sk (g)) ) the joint prior of the i (i ∈ Sk (g)). In the above, (Sk (g)) denotes the vector of the i each belonging to Sk (g) with suHxes arranged in an ascending order. It is assumed that the i belonging to diGerent groups under a given partition are independently distributed. Thus, corresponding to the partid(g) tion g,  has the joint prior g () = k=1 ((Sk (g)) ). Also, using the prior () for  independent of , one arrives at the joint prior g ()() for (; ) under the partition G g. Finally, one assigns the prior p(g) to the partition g when g=1 p(g) = 1. Now for a given partition g, one gets the posterior   d(g)   (;  | g; y) = {f(yi | i ; )((Sk (g)) )} (): (2.1) k=1 i∈Sk (g)

If now p(g | y) denotes the posterior probability of g given y, the joint posterior of  and  given y is (;  | y) =

G  g=1

p(g | y)(;  | g; y):

(2.2)

The posterior moments for the i are now obtained as E(i | y) = E[E{i | g; y} | y] =

G  g=1

p(g | y)E(i | g; y);

V (i | y) = E[V {i | g; y}|y] + V [E{i | g; y} | y]

(2.3)

102

M. Ghosh, D.H. Kim / Journal of Statistical Planning and Inference 102 (2002) 99–107

=

G  g=1

p(g | y)V (i | g; y) +

Cov (i ; j | y) =

G  g=1

+

G  g=1

p(g | y){E(i | g; y)}2 − (E(i | y))2 ; (2.4)

p(g | y)cov (i ; j | g; y)

G 

g=1

p(g | y)E(i | g; y)E(j | g; y) − E(i | y)E(j | y):

(2.5)

Note that if i and j belong to diGerent clusters under the partition g, then cov(i ; j | g; y) = 0. The above methodology is now illustrated with an example. Suppose Yi | i ; ind

r ∼N(i ; r −1 ); i = 1; : : : ; L. For a given partition g, (Sk (g)) has a mulivariate t pdf with location parameter k (g), scale parameter k (g) and degrees of freedom k (g). Symbolically, (see e.g. Zellner, 1971 or Press, 1972), −(k (g)+pk (g))=2   2 2 (i − k (g)) =k (g) ; (2.6) ((Sk (g)) ) ˙ k (g) + i∈Sk (g)

where one may recall that pk (g) is the size of Sk (g) under the partition g. For r we assign the gamma pdf (r) ˙ exp(−(a=2)r)r b=2−1 , with a ¿ 0 and b ¿ 0. This is referred to as a Gamma( 12 a; 12 b) prior. Direct evaluation of the joint posterior of  given y involves high-dimensional numerical integration, and is analytically intractable. Instead, we adopt the MC2 numerical integration methods. For a given partition g, the MC2 method requires Fnding the full conditionals for i given j (j = i); r; g and y (i = 1; : : : ; L), and r given ; g and y. The calculations are greatly facilitated by parameter augmentation as described below. We write the pdf given in (2.6) in two stages. Conditional on some parameter uk (g); i (i ∈ Sk (g)) are iid N(k (g); uk−1 (g)k2 (g)); uk (g) has the marginal Gamma( 12 k (g); 21 k (g)) pdf. The full conditionals are given as follows: ind

(i) i (i ∈ Sk (g)) | y; u1 (g); : : : ; ud(g) (g); j (j ∈ Sk (g)); r; g ∼ N ( 1 ), r+uk (g)=k2 (g)

ryi +k (g)uk (g)=k2 (g) ; r+uk (g)=k2 (g)

L (ii) r | y; ; u1 (g); : : : ; ud(g) ; g ∼ Gamma( 12 {a + i=1 (yi − i )2 }; 12 (L + b)),  (iii) uk (g) | y; ; r; g ∼ Gamma( 12 {k (g)+ i∈Sk (g) (i −k (g))2 =k2 (g)}; 12 (pk (g)+k (g)). Also, the full conditional for the partition variable g is given by p(g | y; ; r; u1 (g); : : : ; ud(g) (g))  d(g)  = p(g) k (g)−pk (g) uk (g)1=2{pk (g)+k (g)}−1 k=1

M. Ghosh, D.H. Kim / Journal of Statistical Planning and Inference 102 (2002) 99–107



uk (g) − 2

× exp 

G  g=1

p(g) 

× exp

d(g)  k=1





i∈Sk (g) (i − k2 (g)

k (g) +

k (g))2

103



k (g)−pk (g) uk (g)1=2{pk (g)+k (g)}−1

uk (g) − 2





k (g) +

i∈Sk (g) (i − k2 (g)

k (g))2

 :

(2.7)

Inferences about  will be based on the posterior moments (2.3) – (2.5). To compute the posterior pdf p( | g; y) of  given g and y, one typically Frst derives p( | g; y; uk (g); r) and then p(uk (g); r | g; y). It is immediate to verify that for i ∈ Sk (g),

rk2 (g) rk2 (g) uk (g) ind i | g; yi ; uk (g); r ∼ N k (g) + yi ; : uk (g) + rk2 (g) uk (g) + rk2 (g) uk (g) + rk2 (g) (2.8) On the other hand, the distribution of uk (g) and r given g and y is    −1 −1 (k (g)=2)−1 (b=2)−1 2 −1=2 r (r + uk (g)k (g) ) p(uk (g); r | g; y) ˙ uk (g) 



i∈Sk (g)

1 × exp − ar + k (g)uk (g) + 2



i∈Sk (g) (yi − k (g)) r −1 + uk−1 (g)



2

: (2.9)

Hence, conditional on the partition g, it can be shown that for i ∈ Sk (g), E(i | g; y) = E[B(uk (g); r) | y; g]k (g) + (1 − E[B(uk (g); r) | y; g])yi = yi − (yi − k (g))E[B(uk (g); r) | y; g];

(2.10)

where B(uk (g); r) = uk (g)=(uk (g) + rk2 (g)). Also, we obtain, for i ∈ Sk (g), V (i | g; y) = 1 − E[B(uk (g); r) | y; g] + (yi − k (g))2 V {B(uk (g); r) | y; g}:

(2.11)

Finally, for i; j ∈ Sk (g); i = j, Cov (i ; j | g; y) = (yi − k (g))(yi − k (g))V {B(uk (g); r) | y; g}:

(2.12)

For implementing the Gibbs sampler, Gelman and Rubin (1992) recommended to run m(¿ 2) parallel chains, each for 2d iterations, with starting points drawn from an overdispersed distribution. But to diminish the eGects of the starting distribution, the Frst d iterations of each chain are discarded. Hence after d iterations, we retain all of the subsequent iterates for Fnding the posterior moments given in (2.3) – (2.5) as well as for monitoring the convergence of the Gibbs sampler.

104

M. Ghosh, D.H. Kim / Journal of Statistical Planning and Inference 102 (2002) 99–107

To estimate the posterior moments using Gibbs sampling, we use the Rao–Blackwellized estimates as in Gelfand and Smith (1991). Note that using (2.10) E(i | y) is approximated by G  g=1

 yi − (yi − k (g))

m  2d  i=1j=d+1

 B(ukij (g); rij )=md

× p(g | y;  = ij ; r = rij ; u1 (g) = u1ij (g); : : : ; ud(g) (g) = ud(g)ij (g)):

(2.13)

Similarly, one uses (2.10) and (2.11) to approximate V (i | y) by G  g=1

 1 −

2d m   i=1j=d+1

B(ukij (g); rij )=md + (yi − k (g))2

 2     m  2d m  2d   B(ukij (g); rij )=md B2 (ukij (g); rij )=md − ×  i=1j=d+1 i=1j=d+1 × p(g | y;  = ij ; r = rij ; u1 (g) = u1ij (g); : : : ; ud(g) (g) = ud(g)ij (g))  2 2d m  G   yi − (yi − k (g)) B(ukij (g); rij )=md + g=1

i=1j=d+1

× p(g | y;  = ij ; r = rij ; u1 (g) = u1ij (g); : : : ; ud(g) (g) = ud(g)ij (g)) − (E(i | y))2 : (2.14) Also, using (2.10) and (2.12) cov(i ; j | g; y) is approximated by G  g=1

(yi − k (g))(yj − k (g))   2    m  m  2d 2d  × B2 (ukij (g); rij )=md − B(ukij (g); rij )=md  i=1j=d+1 i=1j=d+1 × p(g | y;  = ij ; r = rij ; u1 (g) = u1ij (g); : : : ; ud(g) (g) = ud(g)ij (g))   G m  2d   yi − (yi − k (g)) + B(ukij (g); rij )=md g=1

 ×

yj − (yj − k (g))

i=1 j=d+1

m  2d  i=1 j=d+1



B(ukij (g); rij )=md

× p(g | y;  = ij ; r = rij ; u1 (g) = u1ij (g); : : : ; ud(g) (g) = ud(g)ij (g)) −(E(i | y))(E(j | y)):

(2.15)

M. Ghosh, D.H. Kim / Journal of Statistical Planning and Inference 102 (2002) 99–107

105

Finally, note that using Gibbs sampler, one uses (2.7) to approximate p(g | y) by (md)−1

m  2d 

p(g | y;  = ij ; r = rij ; u1 (g) = u1ij (g); : : : ; ud(g) (g) = ud(g)ij (g)):

i=1j=d+1

(2.16)

3. A numerical example We illustrate the methods of the previous section with a numerical example. Suppose the given data is y1 = 1:1; y2 = 1:2 and y3 = 10:0. Here L = 3. Suppose the four partitions that we are interested in are g1 = {{1}; {2}; {3}}; g2 = {{1; 2}; {3}}; g3 = {{1}; {2; 3}} and g4 = {1; 2; 3}. To maintain neutrality, we assign equal prior probabilities 14 to these partitions. The choice of these partitions is consistent with the recommendation of Consonni and Veronese (1995) who advocate retaining always the two extreme partitions g1 and g4 , namely those corresponding to independence and exchangeability. The given data at hand suggest g2 as the best possible partition, but g3 is considered also to examine the eGect of a seemingly wrong partition. As we shall see in this section, the posterior probability of g2 is not overwhelmingly larger than the posterior probabilities of other possible partitions so that model averaging still seems reasonable. In deriving the posterior moments of i (i = 1; 2; 3) given y as well as the posterior probabilities p(g | y) we have used k (g) = 0 throughout, and k2 (g) = 2 = 0:5; 1:0; 5:0 and 10.0. Also, to avoid fully exchangeable priors, we have chosen diGerent degrees of freedom for the t-priors for diGerent subgroups. In particular, we have taken 1 (g1 ) = 2; 2 (g1 ) = 1; 3 (g1 ) = 3; 1 (g2 ) = 2; 2 (g2 ) = 1; 1 (g3 ) = 1; 2 (g3 ) = 2; 1 (g4 ) = 3. Clearly, the choice of degrees of freedom is ad hoc, but will illustrate the main points that we are going to make. To implement and monitor the convergence of the Gibbs sampler, we follow the basic approach of Gelman and Rubin (1992). We consider 10 independent sequences each with a sample of size 2000 with a burn in sample of another 2000. We have used a = b = 0:005 as the parameters of the Gamma distribution. Table 1 illustrates that although g2 is the clear winner under all circumstances as consistent with the data, the other partitions also have nonnegligible posterior probabilities. Second, we notice that larger the value of 2 , the better we are able to identify the correct partition, namely g2 . This phenomenon becomes apparent from (2.8) once we observe that for large 2 , the full conditional for the i becomes approximately N(yi ; 1); (i = 1; 2; 3). Now the wide discrepancy between (y1 ; y2 ) and y3 will be reDected in the generation of the i (i = 1; 2; 3) from (2.8), and will then have its eGect in the calculation of the partition probabilities as given in (2.7). Table 2 provides the posterior means and standard errors of the i via the model averaging formulas as given in (2.3) and (2.4). Once again, the discrepancy between

106

M. Ghosh, D.H. Kim / Journal of Statistical Planning and Inference 102 (2002) 99–107

Table 1 Posterior probabilities for a selected collection of partitions g p(g | y) Partition

g

2 = 0:5

2 = 1:0

2 = 5:0

2 = 10:0

{{1}; {2}; {3}} {{1; 2}; {3}} {{1}; {2; 3}} {1; 2; 3}

1 2 3 4

0.068 0.347 0.285 0.300

0.057 0.409 0.246 0.288

0.038 0.582 0.186 0.194

0.043 0.596 0.195 0.166

Table 2 Data, estimates of the i , and the standard errors (in parenthesis) E(i | y) i

yi

1

1.1

2

1.2

3

10.0

2 = 0:5

2 = 1:0

2 = 5:0

2 = 10:0

0.290 (0.648) 0.335 (0.695) 3.772 (4.500)

0.430 (0.763) 0.490 (0.806) 5.017 (4.579)

0.762 (0.918) 0.846 (0.939) 7.758 (3.562)

0.852 (0.941) 0.940 (0.956) 8.371 (3.020)

(y1 ; y2 ) and y3 is reDected in the posterior means of the i , that is, E(3 | y) is usually much larger than E(1 | y) and E(2 | y) as suggested by the data. Second, larger the 2 , the posterior mean E(i | y) converges more and more to yi (i = 1; 2; 3) as one may anticipate from (2.8). References Box, G.E.P., Tiao, G.C., 1973. Bayesian Inference in Statistical Analysis. Addison-Wesley, Reading, MA. Consonni, G., Veronese, P., 1995. A Bayesian method for combining results from several binomial experiments. J. Amer. Statist. Assoc. 90, 935–944. Dawid, A.P., 1973. Posterior expectations for large observations. Biometrika 60, 664–667. Draper, D., 1995. Assessment and propagation of model uncertainty (with discussion). J. Roy. Statist. Soc. Ser. B 57, 45–97. DuMouchel, W., 1990. Bayesian metaanalysis. In: Berry, D.A. (Ed.), Statistical Methodology in the Pharmaceutical Sciences. Dekker, New York, pp. 509–529. Efron, B., Morris, C., 1973. Combining possibly related estimation problems. J. Roy. Statist. Soc. Ser. B 35, 379–421. Gelfand, A.E., Smith, A.F.M., 1990. Sampling based approaches to calculating marginal densities. J. Amer. Statist. Assoc. 85, 398–409. Gelfand, A.E., Smith, A.F.M., 1991. Gibbs sampling for marginal posterior expectations. Comm. Statist. Theory Methods 20 (5,6), 1747–1766. Gelman, A.E., Rubin, D., 1992. Inference from iterative simulation (with discussion). Statist. Sci. 7, 457–511. Geman, S., Geman, D., 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741. George, E.I., 1986. Combining minimax shrinkage estimators. J. Amer. Statist. Assoc. 81, 437–445. Ghosh, M., Rao, J.N.K., 1994. Small area estimation: an appraisal (with discussion). Statist. Sci. 9, 55–93.

M. Ghosh, D.H. Kim / Journal of Statistical Planning and Inference 102 (2002) 99–107

107

Hedges, J.V., Olkin, I., 1985. Statistical Methods for Metaanalysis. Academic Press, Orlando. Malec, D., Sedransk, J., 1992. Bayesian methodology for combining the results from diGerent experiments when the speciFcations for pooling are uncertain. Biometrika 79, 593–601. Morris, C.N., Normand, S.L., 1992. Hierarchical methods for combining information and for meta-analysis. In: Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (Eds.), Bayesian Statistics, Vol. 4. Oxford Science Publications, Oxford, pp. 321–335. O’Hagan, A., 1979. On outlier rejection phenomena in Bayes inference. J. Roy. Statist. Soc. Ser. B 41, 358–367. O’Hagan, A., 1988. Modeling with heavy tails. In: Bernardo, J.M., DeGroot, M.H., Linley, D.V., Smith, A.F.M. (Eds.), Bayesian Statistics, Vol. 3. Oxford University Press, Oxford, pp. 349–359. Press, S.J., 1972. Applied Multivariate Analysis. Holt, Rinehart and Winston, New York. Zellner, A., 1971. An Introduction to Bayesian Inference in Econometrics. Wiley, New York.