Computational Statistics & Data Analysis 29 (1999) 411–427
Estimation of posterior density functions from a posterior sample Man-Suk Oh 1 Statistics Department, Ewha Womans University, Seoul 120-750, South Korea Received 1 December 1997; received in revised form 1 June 1998
Abstract The joint posterior density function of parameters and marginal posterior density functions of subsets of parameters are key quantities in Bayesian inference. Even when the posterior densities are unknown, there are many cases where Markov Chain Monte Carlo methods can generate samples from the joint posterior distribution. This paper proposes a simple and ecient method of estimating the posterior c 1999 Elsevier Science density functions at various points simultaneously by using a posterior sample. B.V. All rights reserved. Keywords: Bayesian analysis; Markov Chain Monte Carlo; Simulation; Density estimation
1. Introduction For a vector of parameters Â, the posterior density function of  at  = Â∗ is given by (Â∗ | x) = R
l(Â∗ | x)(Â∗ ) ; l(Â | x)(Â) dÂ
where l( | x) is the likelihood function of  given data x and (Â) is a prior density function of Â. The likelihood and the prior are assumed to be given in closed form expressions. Let  = (Â1 ; : : : ; ÂB ), where Âi is the ith block of  which can be an element or a vector of elements. Let Â( j) = (Â1 ; : : : ; Âj ) and Â−( j) = (Âj+1 ; : : : ; ÂB ). 1
This work was supported by KOSEF 981-0105-025-1, Korea.
c 1999 Elsevier Science B.V. All rights reserved. 0167-9473/99/$ – see front matter PII: S 0 1 6 7 - 9 4 7 3 ( 9 8 ) 0 0 0 6 8 - 1
412
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
Then the marginal posterior density function of Â( j) at Â(∗j) is R
(Â(∗j)
| x) =
l(Â(∗j) ; Â−( j) | x)(Â(∗j) ; Â−( j) ) dÂ−( j) R : l(Â | x)(Â) dÂ
The joint posterior density function of  is a special case of the above expression with j = B. If we let m(x) denote the marginal likelihood function of x, from the relation m(x) =
l(Â∗ | x)(Â∗ ) (Â∗ | x)
(1)
for an arbitrary Â∗ , the marginal likelihood function can be estimated with the joint posterior density function at one point of Â. Obviously, the joint and the marginal posterior density functions play an important role in Bayesian inference. In addition, the marginal likelihood function obtained from the joint posterior density is also the key quantity in many areas of Bayesian inference such as hypothesis testing and model selection. However, in many practical cases the joint and=or the marginal posterior densities are not analytically tractable because of the complexity of the likelihood and the prior. Even though the posterior density function is unknown, in many situations a sample from the joint posterior distribution can be generated by Markov Chain Monte Carlo methods such as the Gibbs sampler (Gelfand and Smith, 1990), the Metropolis– Hastings algorithm (Hastings, 1970), the Hit–Run algorithm (Chen and Schmeiser, 1993), and other hybrid methods of MCMC algorithms (Mueller, 1991). Thus, one may be able to use the posterior sample to estimate the posterior densities. A simple way of doing this is to use the Rao–Blackwellization proposed by Gelfand and Smith (1990). The method estimates (Â(∗j) | x) by ˆg (Â(∗j) | x) =
n 1X (Â(∗j) | Â−( j)i ; x); n i=1
(2)
where (Â(∗j) | Â−( j) ; x) is the full conditional posterior density function of Â(∗j) given Â−( j) , and {Âi = (Â( j)i ; Â−( j)i ); i = 1; : : : ; n} is a sample of  from the joint posterior distribution. The method is simple and ecient, but applicable cases are limited because it requires the full conditional density function of Â( j) which may not be available, especially when Â( j) is high dimensional. What is more, the method can not be used for estimation of the joint posterior density function. Chen (1994) proposed the importance-weighted method for estimation of the marginal posterior densities by using a posterior sample. As an estimate of (Â(∗j) | x), he suggested ˆc (Â(∗j) | x) =
n (Â(∗j) ; Â−( j)i | x) 1X ; w(Â( j)i | Â−( j)i ) n i=1 (Âi | x)
where w(Â( j) | Â−( j) ), called the weighting conditional density, is a conditional density function of Â( j) given Â−( j) . The density w needs not be equal to the true conditional posterior density of Â( j) , but for eciency consideration, a density close to the true
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
413
conditional posterior density is optimal for w. The ratio of posterior density functions (Â(∗j) ; Â−( j) | x)=( | x) is known because it does not require the normalization constant. The main advantage of Chen’s method is that neither the joint nor the conditional densities needs to be known. Once the weighting conditional density w is chosen and a posterior sample available, the method can be routinely implemented. However, a disadvantage of the method is that it requires a careful selection of w for eciency. In some cases, there may be no good guidelines for selecting w. In many situations, the full conditional density functions of each block of  are given in closed forms, and the generation of a random sample from the full conditional distributions can be done easily. Note that a block of  can be an element or a vector of elements. Under these conditions on the full conditional distributions, Chib (1995) proposed to estimate the joint posterior density function at Â∗ by ˆh (Â∗ | x) = (Â1∗ | Â2∗ ; : : : ; ÂB∗ ; x)ˆh (Â2∗ | Â3∗ ; : : : ; ÂB∗ ; x) · · · ˆh (ÂB∗ | x); where ∗ ˆh (Âj∗ | Âj+1 ; : : : ; ÂB∗ ; x) =
n 1X ( j) ∗ ∗ (Âj∗ | Â1i( j) ; : : : ; Âj−1; i ; Âj+1 ; : : : ; ÂB ; x); n i=1
for j = 2; : : : ; B, and {(Â1i( j) ; : : : ; Âj;( j) i ); i = 1; : : : ; n} is a sample from the conditional ∗ ; : : : ; ÂB∗ ). Speciÿcally, {(Â1i( j) ; : : : ; Âj;( j) posterior distribution of (Â1 ; : : : ; Âj ) given (Âj+1 i ); i = 1; : : : ; n} is generated from the Gibbs sampler using conditional densities (Â1 | Â2 ; ∗ ∗ ∗ : : : ; Âj ; Â−( j) ); (Â2 | Â1 ; Â3 ; : : : ; Âj ; Â−( j) ); : : : ; (Âj | Â1 ; : : : ; Âj−1 ; Â−( j) ). Note that a new set ∗ ∗ ∗ ; of samples, which depends on (Âj+1 ; : : : ; ÂB ), is required to estimate each h (Âj∗ | Âj+1 ∗ ∗ ∗ : : : ; ÂB ; x). With the estimate ˆh ( | x) for a given  , one can estimate the marginal likelihood from relation (1). Chib’s method is very simple and it only requires the full conditional density functions of each block of Â. However, it requires sets of posterior samples from repeated runs of the Gibbs sampler to estimate the posterior density function at one point. For the marginal likelihood function, computation of the joint posterior density at one point is sucient. But one usually wants to compute other marginal posterior density functions at various points and hence, the method can be very inecient for estimating the marginal posterior density functions. In this paper, a simple and ecient method of estimating the joint and the marginal posterior density functions from a joint posterior sample is proposed. The rest of the paper is organized as follows. Section 2 describes the method and compares it with other methods. Two examples are given in Section 3. Summary and conclusions are given in Section 4. 2. The method For simplicity, the data x will be dropped from here on in expressions of the posterior densities unless it is needed to avoid confusion. And {Âi = (Â1i ; : : : ; ÂBi ); i = 1; : : : ; n} denotes a sample from the joint posterior distribution.
414
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
2.1. When all the full conditionals are known Consider the marginal posterior density function of a block, for example Â1 . From the equation (Â1∗ ) =
Z
(Â1∗ | Â−(1) )(Â−(1) ) dÂ−(1) = E[(Â1∗ | Â−(1) )];
an estimate of (Â1∗ ) is given by (Â ˆ 1∗ ) =
n 1X (Â1∗ | Â−(1)i ): n i=1
This is equivalent to the Rao–Blackwellization estimate (2). Next, consider the marginal posterior density function of two blocks, for example Â(2) = (Â1 ; Â2 ). The following equation can be derived: ∗ )= (Â(2)
= = = =
Z Z Z Z
∗ (Â(2) ; Â−(2) ) dÂ−(2)
(Â1∗ | Â2∗ ; Â−(2) )(Â2∗ ; Â−(2) ) dÂ−(2) (Â1∗ | Â2∗ ; Â−(2) ) (Â1∗ | Â2∗ ; Â−(2) )
Z Z
Z Z
(Â1 ; Â2∗ ; Â−(2) ) dÂ1 dÂ−(2)
(Â2∗ | Â1 ; Â−(2) )(Â1 ; Â−(2) ) dÂ1 dÂ−(2)
(Â1∗ | Â2∗ ; Â−(2) )(Â2∗ | Â1 ; Â−(2) )(Â1 ; Â−(2) ) dÂ1 dÂ−(2)
= E[(Â1∗ | Â2∗ ; Â−(2) )(Â2∗ | Â1 ; Â−(2) )]: ∗ Therefore, (Â(2) ) can be estimated by ∗ (Â ˆ (2) )=
n 1X (Â1∗ | Â2∗ ; Â−(2)i )(Â2∗ | Â1i ; Â−(2)i ): n i=1
In general, the marginal posterior density function (Â(∗j) ), for 1 ≤ j ≤ B, can be expressed as "
(Â(∗j) ) = E
j Y k=1
#
(Âk∗
∗ | Â(k−1) ; Âk+1 ; : : : ; Âj∗ ; Â−( j) )
;
(3)
where the expectation is taken over (Â( j−1) ; Â−( j) ) = (Â1 ; : : : ; Âj−1 ; Âj+1 ; : : : ; ÂB ). Thus, an appropriate estimate of it is given by (Â ˆ (∗j) ) =
j
n Y 1X ∗ (Âk∗ | Â(k−1)i ; Âk+1 ; : : : ; Âj∗ ; Â−( j)i ): n i=1 k=1
(4)
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
415
Since one can permute the blocks of Â, any marginal posterior density function can be estimated from the above equation. A key feature of estimate (4) is that it replaces the unknown conditional density function of Â( j) = (Â1 ; : : : ; Âj ) by the product of known full conditionals of Âk , k = 1; : : : ; j, but instead takes expectation over a larger set of variables (Â( j−1) ; Â−( j) ), compared with the Rao–Blackwellization method (2). This is similar to the idea of conditional Monte Carlo described in Rubinstein (1981) in that by inclusion of extra variables the problem becomes feasible and, in spite of the extra variability, the gain in computational convenience and eciency can be great. Compared with Chib (1995)’s method which expresses the marginal density as the product of posterior expectations of full conditional densities and estimates each expectation by using a dierent posterior sample, the proposed method expresses the marginal density as the posterior expectation of the product of full conditional density functions and estimates the expectation by using a posterior sample. Because the proposed method requires only a single set of joint posterior samples to estimate any marginal density function at any point, it can estimate all the desired marginal density functions at various points simultaneously. This makes the proposed method very ecient particularly when generation of the posterior sample is expensive. 2.2. When some of the full conditionals are unknown In some cases, the full conditional posterior densities of blocks of  are unknown or hard to evaluate. In such cases, one may implement the importance-weighted method of Chen (1994) for the unknown conditionals. In the previous subsection, only the full conditionals of Âk ; k = 1; : : : ; j, are required to estimate the marginal density function of Â( j) . Therefore, for a given j, if all the full conditionals of Âk ; k = 1; : : : ; j, are known then the marginal density (Â(∗j) ) can be estimated as in the previous subsection. When some of the full conditionals of Âk ; k = 1; : : : ; j, are unknown, let the blocks having the unknown full conditionals constitute the last block Âj and adjust j accordingly, without loss of generality. Then, from Eq. (3) and derivations in Chen (1994), "
(Â(∗j) ) = E
=
j Y k=1
# ∗ (Âk∗ | Â(k−1) ; Âk+1 ; : : : ; Âj∗ ; Â−( j) )
Z Z Y j k=1
=
Z Z Y j−1 k=1
∗ (Âk∗ | Â(k−1) ; Âk+1 ; : : : ; Âj∗ ; Â−( j) )(Â( j−1) ; Â−( j) ) dÂ( j−1) dÂ−( j)
∗ (Âk∗ | Â(k−1) ; Âk+1 ; : : : ; Âj∗ ; Â−( j) )(Âj∗ | Â( j−1) ; Â−( j) )
× (Â( j−1) ; Â−( j) ) dÂ( j−1) dÂ−( j)
416
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
=
Z Z Y j−1 k=1
∗ (Âk∗ | Â(k−1) ; Âk+1 ; : : : ; Âj∗ ; Â−( j) )
× (Â( j−1) ; Âj∗ ; Â−( j) ) dÂ( j−1) dÂ−( j) =
Z Z Y j−1 k=1
∗ (Âk∗ | Â(k−1) ; Âk+1 ; : : : ; Âj∗ ; Â−( j) )
Z
w(Âj | Â( j−1) ; Â−( j) ) dÂj
× (Â( j−1) ; Âj∗ ; Â−( j) ) dÂ( j−1) dÂ−( j) =
Z Y j−1 k=1
× =E
∗ (Âk∗ | Â(k−1) ; Âk+1 ; : : : ; Âj∗ ; Â−( j) )w(Âj | Â( j−1) ; Â−( j) )
(Â( j−1) ; Âj∗ ; Â−( j) ) (Â) d (Â) " j−1 Y k=1
∗ (Âk∗ | Â(k−1) ; Âk+1 ; : : : ; Âj∗ ; Â−( j) )w(Âj | Â( j−1) ; Â−( j) )
#
(Â( j−1) ; Âj∗ ; Â−( j) ) ; × (Â)
(5)
where the expectation is taken over  and w(Âj | Â( j−1) ; Â−( j) ) is the weighting conditional density function of Âj . Therefore, (Â(∗j) ) can be estimated by the sample average of the function inside the expectation in Eq. (5). Following Chen (1994), it can be easily proven that a desirable choice of w(Âj | Â( j−1) ; Â−( j) ) is a density close to the true conditional posterior density (Âj | Â( j−1) ; Â−( j) ). This method utilizes all the known full conditionals of Â1 ; : : : ; Âj−1 and handles only the unknown full conditional of Âj by the importance-weighted method. On the other hand, a direct implementation of Chen’s method would be to choose the weighting conditional density function v(Â( j) | Â−( j) ) of Â( j) and estimate (Â(∗j) ) by the sample average of v(Â( j) | Â−( j) )(Â(∗j) ; Â−( j) )=(Â):
(6)
It is desirable to select v close to the true conditional posterior density function of Â( j) but this may be dicult especially in high-dimensional cases. Comparing the two methods, the dimension of Âj could be much lower than the dimension of Â( j) and hence selection of the weighting conditional density w for Âj in Eq. (5) could be much easier than selection of the weighting conditional density v for Â( j) in Eq. (6). Thus, the proposed method could reduce the dimension of the weighting conditional density function in the importance-weighted method by making maximum use of the known full conditional density functions, which could result in better performance than the direct importance-weighted method.
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
417
2.3. Convergence of the estimate To estimate the marginal posterior density function of Â( j) at Â(∗j) , the proposed method represents the marginal density as the posterior expectation of an appropriate function hÂ(∗j) (Â) (see Eqs. (3) and (5)), and estimates it by the sample average of hÂ(∗j) (Â). Thus, under some regularity conditions, the estimate (Â ˆ (∗j) ) converges to the true (Â(∗j) ) almost surely as n increases (Chen, 1994; Tierney, 1994). Estimation of the variance and the standard error of (Â ˆ (∗j) ) can also be obtained in some cases. When the samples Âi ’s are independent, the variance var ((Â ˆ (∗j) ) can be estimated by "
var( ˆ (Â ˆ (∗j) )) =
#
n 1 1X h2∗ (Âi ) − ˆ2 (Â(∗j) ) : n n i=1 Â( j)
When the samples are not independent, the correlation structure should be incorporated in the variance. There is no general rule of doing this since samples from dierent Markov Chain methods would have dierent correlation structures. But if the samples are obtained from the Gibbs sampler, from Chib (1995) and Geweke (1992), the variance can be estimated by "
var( ˆ (Â ˆ (∗j) )) =
q
#
X 1 0 + 2(1 − s=(1 + q))s ; n s=1
(7)
where s =
n−s 1 X hÂ∗ (Âi )hÂ(∗j) (Âi+s ) − ˆ2 (Â(∗j) ); n − s i=1 ( j)
and q is the lag size where the autocorrelation of hÂ(∗j) (Âi ) is negligible (Newey and West, 1987). Because one usually wants to estimate various marginal densities at many dierent points, there would be a lot of h’s and it would be dicult to use an appropriate q for each h. One way to handle this problem is to compute the autocorrelation coecient of each element of the posterior sample {Âi }, ÿnd the lag size where all the autocorrelation coecients become negligible, and use it as q for all h’s. 3. Examples 3.1. Example 1 Pyndyck and Rubinfeld (1981) provided data on rent paid, number of rooms rented, number of occupants, gender, and distance from campus in blocks for undergraduates at the University of Michigan. Denoting rent paid per person by yi , rooms per person by ri , distance from campus in blocks by di , and gender by si (one for male and zero for female), they used a linear regression model yi = ÿ1 + ÿ2 si ri + ÿ3 (1 − si )ri + ÿ4 si di + ÿ5 (1 − si )di + i ;
i = 1; : : : ; 32;
418
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
where i are i.i.d. N(0; 2 ). By letting y = (y1 ; : : : ; y32 )0 ; ÿ = (ÿ1 ; : : : ; ÿ5 )0 ; = (1 ; : : : ; 32 )0 , and
1
X= : 1
s 1 r1
(1 − s1 )r1
s1 d1
(1 − s1 )d1
:
:
:
:
s32 r32
(1 − s32 )r32
s32 d32
(1 − s32 )d32
;
(8)
the model can be expressed as y = Xÿ + : As in Geweke (1986), the region A = {ÿ; ÿ2 ≥ 0; ÿ3 ≥ 0; ÿ4 ≤ 0; ÿ5 ≤ 0} is considered as a reasonable constraint on the coecient ÿ = (ÿ1 ; : : : ; ÿ5 ) in this example. With prior (ÿ; 2 ) = 1=2 IA (ÿ), where IA (ÿ) = 1 if ÿ ∈ A and 0 otherwise, the posterior density function of  = (ÿ; 2 ) is proportional to 2 −17
f(Â) = ( )
1 exp − 2 (y − Xÿ)0 (y − Xÿ) IA (ÿ): 2
From this, the full conditional distribution of ÿj and 2 are given, respectively, as ÿj | ÿk; k6=j ; 2 ; data ∼ N(j ; 2j )IA (ÿj );
j = 1; : : : ; 5;
2 | ÿ; data ∼ IG(16; 2=(y − Xÿ)0 (y − Xÿ)); where j and 2j are the conditional mean and variance of ÿj when ÿ follows N((X 0 X )−1 X 0 y; 2 (X 0 X )−1 ) distribution, and IG(·; ·) denotes the inverse gamma distribution. Note that the conditional mean j is a function of (ÿk; k6=j ) and 2j is a function of 2 . Because all the full conditional distributions of elements of  are known, implementing the Gibbs sampler to generate a posterior sample {Âi = (ÿ1i ; : : : ; ÿ5i ; i2 )} is straightforward. In this example, quantities of interest would be marginal density functions of ÿ2 − ÿ3 , ÿ4 − ÿ5 , (ÿ2 ; ÿ3 ), (ÿ4 ; ÿ5 ), (ÿ2 ; ÿ4 ), (ÿ3 ; ÿ5 ), and (ÿ2 ; ÿ3 ; ÿ4 ; ÿ5 ). Considering the constraint, the full conditional posterior density functions of ÿj are given as (ÿ1 | ÿk; k6=1 ; 2 ; data) = ((ÿ1 − 1 )=1 ); (ÿj | ÿk; k6=j ; 2 ; data) =
((ÿj − j )=j ) ; 1 − (−j =j )
j = 2; 3;
(ÿj | ÿk; k6=j ; 2 ; data) =
((ÿj − j )=j ) ; (−j =j )
j = 4; 5;
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
419
where and are the standard normal density and distribution functions, respectively. Therefore, the proposed method can be routinely implemented to estimate all the desired marginals and their standard errors. For instance, the marginal posterior density function of ÿ4 − ÿ5 at t is estimated by (ÿ ˆ 4 − ÿ5 = t | data) =
n 1X (ÿ4 = t + ÿ5i | ÿ1i ; ÿ2i ; ÿ3i ; ÿ5i ; i2 ; data); n i=1
(9)
and the marginal density function of (ÿ2 ; ÿ4 ) at (ÿ2∗ ; ÿ4∗ ) is estimated by (ÿ ˆ 2∗ ; ÿ4∗ | data) =
n 1X (ÿ2∗ | ÿ1i ; ÿ3i ; ÿ4∗ ; ÿ5i ; i2 ; data) n i=1
× (ÿ4∗ | ÿ1i ; ÿ2i ; ÿ3i ; ÿ5i ; i2 ; data):
(10)
Note that the full conditional density function of ÿ4 given (ÿ1i ; ÿ2i ; ÿ3i ; ÿ5i ; i2 ) is used in both Eqs. (9) and (10), and some necessary quantities, such as i ; i , and (−i =i ); i = 2; 4, have already been computed in the Gibbs sampler. All the computations in this paper were done by using the Fortran programming language and a personal computer with a Pentium processor. Random number generation was done by using the IMSL subroutine RNUN for the uniform distribution and RNGAM for the gamma distribution. For estimation, a posterior sample of size 10 000 was generated from the Gibbs sampler with warm-up time 200. All the sample autocorrelation coecients of elements of the parameters seemed to become negligible around lag size 30. Thus, q = 30 was chosen and all the standard errors in this example were computed by the square root of Eq. (7) with the q. Fig. 1 presents the estimated marginal density function of ÿ2 − ÿ3 from 90 evenly spaced grid points as a solid line when the sample size is 2000 and 10 000. Connected lines of the upper (lower) bounds of an approximate 95% conÿdence intervals for the marginal density at each grid point, called the conÿdence band, are superimposed in Fig. 1 as dotted lines. Fig. 2 presents the analogous lines for ÿ4 − ÿ5 . Convergence of the estimated marginals is clear since the upper and the lower bounds of the conÿdence band are very close when n = 10 000. In addition, the estimated marginals are almost identical to those given in Geweke (1986). Figs. 3 and 4 show the marginal density functions of (ÿ2 ; ÿ4 ) and (ÿ3 ; ÿ5 ) from 100 × 100 evenly spaced grid points by using the same posterior sample of size 10 000. There is only a slight positive correlation between ÿ2 and ÿ4 , and between ÿ3 and ÿ5 . 3.2. Example 2 As a second example, the biochemical oxygen demand (BOD) data from Marske (1967) is considered. Six observations of the BOD and time are given in Table 1.
420
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
Fig. 1. Estimated marginal posterior density function of ÿ2 − ÿ3 in Example 1 (—): estimate, (· · · · · ·): approximate 95% conÿdence band.
Fig. 2. Estimated marginal posterior density function of ÿ4 − ÿ5 in Example 1 (—): estimate, (· · · · · ·): approximate 95% conÿdence band.
Let yi denote the BOD and xi denote the time. Based on experimental decay with a ÿxed-rate constant, Bates and Watts (1988) assumed a nonlinear model yi = Â1 (1 − e−Â2 xi ) + i ; where i ’s are i.i.d. N(0; 2 ). Under the model, the likelihood function of  = (Â1 ; Â2 ; 2 ) is "
l(Â | data) ∝
−6
#
6 1 X exp − 2 (yi − Â1 (1 − e−Â2 xi ))2 : 2 i=1
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
421
Fig. 3. Estimated marginal posterior density function of (ÿ2 ; ÿ4 ) is Example 1 (—): 0.0005, (· · · · · ·): 0.001, (- - - -): 0.0015.
Fig. 4. Estimated marginal posterior density function of (ÿ3 ; ÿ5 ) is Example 1 (—): 0.001, (· · · · · ·): 0.003, (- - - -): 0.005, (– –): 0.007. Table 1 Biochemical oxygen demand vs. time Time (days) BOD (mg=l) 1
8.3
2
10.3
3
19.0
4
16.0
5
15.6
7
19.8
422
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
If a constant prior is assumed for (Â1 ; Â2 ) over the whole two-dimensional space, the posterior density function becomes improper. Thus, for illustrative purposes, I restricted (Â1 ; Â2 ) on the region (0; 50) × (0; 6) and assumed a noninformative prior (Â) = 1=2 I(0; 50) × (0; 6) (Â1 ; Â2 ). Then the full conditional distributions of Â1 and 2 are, respectively, 2
Â1 | Â2 ; ; data ∼ N
P6
−Â2 xi )yi i=1 (1 − e ; P6 −Â x 2 i )2 i=1 (1 − e
2 P6 −Â2 xi )2 i=1 (1 − e
2 | Â1 ; Â2 ; data ∼ IG 3; P6 −Â2 xi ))2 i=1 (yi − Â1 (1 − e 2
!
I(0; 50) (Â1 );
!
:
But the full conditional posterior density of Â2 is not given in convenient form. Posterior samples of  can be generated in various ways. For instance, the Griddy– Gibbs sampler of Litter and Tanner (1992) can be used. At each cycle of the Gibbs sampler, it generates Â1 and 2 from the given conditional distributions. However, to generate Â2 from the full conditional distribution, it approximates the conditional cdf of Â2 and generates Â2 by using the cdf inversion method with the approximated cdf. Another way is to reparameterize Á1 = 1=Â1 ; Á2 = log(Â2 ) since this makes the likelihood close to a normal density (Racine–Poon, 1992), use the Metropolis–Hasting algorithm or the resampling method (Rubin, 1988) to generate (Á1 ; Á2 ), and convert them to (Â1 ; Â2 ). In this example, the Griddy–Gibbs sampler with warm-up size 50 was used for posterior sample generation. To use the proposed method, one needs to choose a conditional density w2c (Â2 | Â1 ; 2 ) of Â2 . Three candidate densities were considered in this example. First, the density gu of the uniform distribution over the region (0; 6) was considered. This would be the simplest one for w2c . Second, the density gm of a normal approximation to the conditional posterior distribution of Â2 given 2 was considered. For the normal approximation, the mode of (Â1 ; Â2 ) given 2 and minus inverse the Hessian at the mode were found. From these, the mode of Â2 and the second diagonal element of minus inverse the Hessian were used as the mean and the variance, respectively, of the normal distribution. Third, the density gc of a normal approximation to the full conditional posterior distribution of Â2 was considered. The conditional mode and minus inverse the conditional Hessian at the conditional mode, obtained from the conditional posterior kernel of Â2 given (Â1 ; 2 ), were used, respectively, as the mean and the variance of the normal distribution. This density would be the closest to the true conditional posterior density function of Â2 among the three densities. The conditional mode and the conditional Hessian of Â2 can be easily obtained because it requires only one-dimensional maximization. In this example, a simple bisection method was used. In estimation of the marginal density function of Â2 , the proposed method is equivalent to the importance-weighted method because the conditional posterior density of Â2 is unknown. Fig. 5a–c presents the estimated marginal density function of Â2 with n = 1000 posterior samples when each of the above three candidate densities was used as w2c . In each ÿgure, the estimated marginal density function
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
423
Fig. 5. Estimated marginal posterior density function of Â2 in Example 2 for dierent choices of the weighting conditional density w2c for Â2 and the sample size n (—): estimate, (· · · · · ·): approximate 95% conÿdence band, (- - - -): true density.
is presented as the solid line, an approximate 95% conÿdence band described in Example 1 as the dotted line, and the true marginal density function calculated from an extensive grid method as the dashed line. The ÿgure shows that the estimate is very sensitive to the choice of the weighting conditional density w2c . In particular, Fig. 5a shows that the uniform density gu gives a very inaccurate, even misleading, results since the true posterior density is far outside the conÿdence band. This indicates the importance of a good w2c . Fig. 5b shows that the
424
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
Fig. 6. Contour of the dierence between the estimated and the true marginal posterior density functions of (Â1 ; Â2 ) in Example 2 (—): −0:005; (· · · · · ·): −0.025, (- - - -): −0.065.
normal approximation gm to the conditional posterior density function of Â2 given 2 yields a better ÿt than gu . However, it still provides a somewhat unsatisfactory ÿt since the true posterior density is almost equal to the upper bound of the conÿdence band. Fig. 5c shows that the normal approximation gc to the full conditional posterior density function of Â2 gives a very good ÿt. The estimated and the true marginal posterior density functions are almost indistinguishable. Fig. 5d shows the case of w2c = gc and n = 4000 samples. The estimate gives almost a perfect ÿt. In estimation of the marginal density function of (Â1 ; Â2 ), the proposed method is dierent from the direct importance-weighted method of Chen (1994). The direct importance-weighted method requires to choose a conditional density w(Â1 ; Â2 | 2 ). As suggested in Chen (1994), one way of doing this is to let w(Â1 ; Â2 | 2 ) = w1c (Â1 |Â2 ; 2 )w2m (Â2 | 2 ), and choose w1c and w2m . The obvious choice for w1c is (Â1 | Â2 ; 2 ) because it is known. As w2m (Â2 | 2 ), a common choice would be w2m = gm , the normal approximation to the conditional posterior density of Â2 given 2 . In contrast, the proposed method requires the full conditional density w2c of Â2 and the same density w2c which was used for estimation of the marginal posterior density function of Â2 can be used for that of (Â1 ; Â2 ). The density gc was chosen for w2c because it is close to the true full conditional posterior density. To compare the proposed method with the direct importance-weighted method, the dierence of the estimated and the true marginal density functions was computed for each of the two methods, where the estimates were obtained from 1000 samples and the true marginal density was computed from an extensive grid method. Contour graphs of the dierences are presented in Fig. 6. In addition, L2 norms of the
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
425
Fig. 7. Contour of the true marginal posterior density function of (Â1 ; Â2 ) in Example 2 (—): 0.005, (· · · · · ·): 0.05, (- - - -): 0.1, (– –): 0.3.
dierences are estimated, yielding 0.041 for the proposed method and 0.084 for the direct importance-weighted method. Both the ÿgure and the L2 norms indicate that the proposed method provides a signiÿcantly better ÿt than the direct importanceweighted method. The reason for this might be that the posterior kernel has a bananashaped contour as shown in Fig. 7 and hence the normal approximation to the conditional posterior density of Â2 given 2 is bad. In contrast, the proposed method used an approximation to the full conditional posterior density of Â2 , using the information contained in the full conditional posterior distribution, and resulted in better performance. 4. Summary and conclusions In this paper, a simulation-based approach to estimating the joint and the marginal posterior density functions of parameters has been proposed. The method expresses a desired posterior density function as the expectation of the product of full conditional posterior density functions of blocks of the parameter. When some of the full conditionals are unknown it replaces them by an arbitrary conditional density times the ratio of the posterior kernels, implementing the importance-weighted method of Chen (1994). The estimate of the posterior density is given as the sample average of an appropriate function, where the sample is obtained from the joint posterior distribution. Under some regularity conditions, the posterior density estimate converges to the true posterior density as the sample size increases. Standard error of the estimate can be estimated when the samples are independent or are generated from the Gibbs sampler. A great feature of the proposed method is that it can estimate all the desired posterior density functions at various points simultaneously by using a single set of posterior samples, and it can be run mechanically when all the full conditionals are known. This contrasts with most of the recently developed simulation-based
426
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
approaches which often require cumbersome selection of functions or complicated approximations or a new set of samples at each marginal density and at each point. Obviously, the proposed method can be applied to any permutation of the parameter blocks whose posterior density is desired. For instance, one can switch the roles of Â1 and Â2 in estimating the posterior density of (Â1 ; Â2 ). The order may aect the convergence rate of the estimate. But I presume that the eect is not great since all the full conditionals are used. The two examples presented in this paper gave similar estimation results from dierent orders of the blocks, showing some support on the above presumption. Because the method utilizes all the known full conditional posterior density functions, the method is best suited for situations where all or most of the conditionals of blocks of the parameter are known. Acknowledgements The author would like to thank two anonymous referees for helpful comments. References Bates, D., Watts, D., 1988. Nonlinear Regression Analysis and its Applications. Wiley, New York, p. 270. Chen, M.-H., 1994. Importance-weighted marginal bayesian posterior density estimation. J. Amer. Statist. Assoc. 89 (427), 818– 824. Chen, M.-H., Schmeiser, B.W., 1993. Performance of the Gibbs hit-and-run, and metropolis samplers. J. Comput. Graph. Statist. 2 (3), 1– 22. Chib, S., 1995. Marginal likelihood from the Gibbs output. J. Amer. Statist. Assoc. 90 (432), 1313– 1321. Gelfand, A.E., Smith, A.F.M., 1990. Sampling-based approaches to calculating marginal densities. J. Amer. Statist. Assoc. 85 (410), 398 – 409. Geweke, M., 1986. Exact Inference in the Inequality Constrained Normal Linear Regression Model. J. Appl. Econom. 1 (1), 127–141. Geweke, M., 1992. Evaluating the accuracy of sampling-based approaches to the calculation of the posterior moments. In: Bernardo, J.M. (Ed.), Bayesian Statistics, vol. 4. Oxford Univ. Press, Oxford, pp. 169 –194. Hastings, W.K., 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57 (1), 97–109. Litter, C., Tanner, M.A., 1992. Facilitating the Gibbs sampler: the Gibbs stopper and the Griddy–Gibbs sampler. J. Amer. Statist. Assoc. 87 (419), 861– 868. Marske, D., 1967. Biochemical oxygen demand data interpretation using sum of squares surfaces. Ph.D. Thesis. Department of Statistics, University of Wisconsin, Madison. Mueller, P., 1991. A generic approach to posterior integration and Gibbs sampling, Technical Report 91-09. Department of Statistics, Purdue University, W. Lafayette, IN. Newey, W.K., West, K.D., 1987. A simple, postive semi-deÿnite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55 (3), 700–703. Pyndyck, R.S., Rubinfeld, D.L., 1981. Econometric Models and Econometric Forecasts. 2nd ed. McGraw-Hill, New York, p. 44.
M.S. Oh / Computational Statistics & Data Analysis 29 (1999) 411–427
427
Racine-Poon, A., 1992. SAGA: Sample Assisted graphical analysis. In: Bernardo, J.M. (Ed.), Bayesian Statistics, vol. 4. Oxford Univ. Press, Oxford, pp. 389 – 404. Rubin, D., 1988. Using the SIR algorithm to simulate posterior distributions. In: Bernardo, J.M. (Ed.), Bayesian Statistics, vol. 3. Oxford University Press, Oxford, pp. 395– 402. Rubinstein, R.Y., 1981. Simulation and the Monte Carlo Method. Wiley, New York, pp. 141–143.