Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225 www.elsevier.com/locate/jspi
On convergence of properly weighted samples to the target distribution Sonia Malefaki, George Iliopoulos∗ Department of Statistics and Insurance Science, University of Piraeus, 80 Karaoli & Dimitriou str., 18534 Piraeus, Greece Received 30 March 2006; received in revised form 30 November 2006; accepted 14 March 2007 Available online 25 May 2007
Abstract We consider importance sampling as well as other properly weighted samples with respect to a target distribution from a different point of view. By considering the associated weights as sojourn times until the next jump, we define appropriate jump processes. When the original sample sequence forms an ergodic Markov chain, the associated jump process is an ergodic semi-Markov process with stationary distribution . In this respect, properly weighted samples behave very similarly to standard Markov chain Monte Carlo (MCMC) schemes in that they exhibit convergence to the target distribution as well. Indeed, some standard MCMC procedures like the Metropolis–Hastings algorithm are included in this context. Moreover, when the samples are independent and the mean weight is bounded above, we describe a slight modification in order to achieve exact (weighted) samples from the target distribution. © 2007 Elsevier B.V. All rights reserved. MSC: 65C05; 60K15 Keywords: Importance sampling; Properly weighted samples; Markov chain Monte Carlo; Semi-Markov process; Limit distribution
1. Introduction Importance sampling (IS) (Marshall, 1956) is a well-known Monte Carlo method that is useful in any discipline where integral approximations are needed. For a measurable space (X, B(X)) and a probability distribution defined on it, the method attempts to estimate the integral E (h) := h(x)(dx), X
for h ∈ L1 (), by drawing independent samples x1 , . . . , xn from a trial distribution g with support at least that of . Assuming that the distributions have densities (x), g(x) with respect to a -finite measure [i.e., (dx) = (x)(dx), g(dx) = g(x)(dx)], E (h) is approximated by n i=1 w(xi )h(xi ) ¯hIS , (1) n := n ∗ Corresponding author.
E-mail address:
[email protected] (G. Iliopoulos). 0378-3758/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2007.05.030
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
1211
or by hˆ IS n :=
n i=1 w(xi )h(xi ) n , i=1 w(xi )
(2)
where w(xi ) := (xi )/g(xi ). In standard terminology, g is called “the importance distribution” and w(xi )’s “the importance weights”. The validity of h¯ n and hˆ n as approximations of E (h) is justified by the strong law of large numbers which ensures that for all h ∈ L1 () it holds n−1 w(Xi )h(Xi ) → Eg {w(X)h(X)} = E (h) and n−1 w(Xi ) → Eg {w(X)} = 1 with probability one. Note that hˆ n can be used in more general settings than h¯ n , e.g. when the importance weights w(x) are known up to a multiplicative constant. In this case the second limit in general differs from one whilst the first becomes Eg {w(X)}E (h). Also note that the assumption of independent samples can be relaxed since there are more general contexts under which the above IS estimators converge to E (h). For example, if the sequence of X’s forms a Harris ergodic Markov chain with stationary distribution g, the above limits still hold due to the Ergodic Theorem. Besides integral approximations, simulation methods often aim to generate samples from a target distribution. For instance, this is the case when the aim is to gain knowledge on features of the distribution that cannot be expressed as expectations, such as quantiles or the number of modes. In this respect, IS seems at first glance to fail obtaining samples from since all draws are made from the importance distribution g. The sampling/importance resampling method (SIR, Rubin, 1987) is an attempt to circumvent this drawback by weighted resamplingwith replacement from the generated g-sample. The weight assigned to xi is its normalized importance weight w(xi )/ ni=1 w(xi ). As n → ∞, this approach produces a sample which is approximately -distributed. Smith and Gelfand (1992) revisited the approach and proved the assertion by considering the resampling procedure as weighted bootstrap. The aim of the present paper is to show that, under certain conditions, when the g-sample is properly weighted it converges in a sense to the target distribution . Actually, this is true for a jump process associated with the weighted sample. From this point of view, importance weighting does not differ much from MCMC sampling schemes. In fact, some of them are special cases of the above mentioned jump processes (e.g. the Metropolis–Hastings algorithm, see Section 3.1). It turns out that in order to obtain approximate samples from , resampling is not needed at all. The rest of the paper is organized as follows: In Section 2 we define the jump process associated with a weighted sample and prove that under certain conditions it converges weakly to the target distribution. Section 3 contains some examples of known sampling schemes which are special cases of this context. Moreover, we give an example of how one can modify a target distribution in order to apply a Gibbs sampler. Section 4 deals with the case of stationary weighted sequences. In Section 5 we provide upper bounds for the total variation distance to stationarity for some special cases of weighted samples including the standard importance sampling procedure. As a result of independent interest, we give an upper bound for the total variation distance to stationarity of Markov jump processes with embedded Doeblin chains when the mean sojourn time is bounded above. We conclude with a discussion followed by an Appendix containing proofs of the stated propositions.
2. Jump processes associated with weighted sequences The concept of a properly weighted sample has been introduced by Liu and Chen (1998) (see also Liu, 2001) as a generalization of the standard IS method. Definition 2.1 (Liu and Chen (1998)). A set of weighted random samples (Xi , i )1 i n is called proper with respect to if for any square integrable function h, E{i h(Xi )} = E {h(Xi )}
for i = 1, . . . , n,
for some positive constant . As Liu (2001) points out, an equivalent definition is the following:
1212
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
Definition 2.2. A set of weighted random samples (Xi , i )1 i n is called proper with respect to if E{i |Xi = x} = (x)/g(x) for i = 1, . . . , n, for some positive constant , where Xi ∼ g. In the sequel, we will associate with any infinite weighted random sequence a jump process in the following sense. Definition 2.3. Consider a weighted sequence (Xn , n )n∈Z+ : =((X0 , 0 ), (X1 , 1 ), . . .), where the ’s are strictly positive weights. Define S0 = 0, Sn = n−1 i=0 i , n1, and let Nt : = sup{n : Sn t},
t 0.
Then, the stochastic process Y = (Yt )t 0 defined by Yt : =XNt , t 0, will be called the jump process associated with the weighted sequence (Xn , n )n∈Z+ . The definition ensures that the processY has right continuous sample paths which also have left-hand limits. However, if the support of n ’s is a subset of N = {1, 2, . . .}, we will consider the process Y only for t ∈ Z+ , i.e. we set Y = (Y0 , Y1 , Y2 , . . .). If this is the case, limits of quantities related to Yt should be suitably interpreted. Proposition 2.1. Assume that the sequence X = (Xn )n∈Z+ is a homogeneous Harris ergodic Markov chain with state space (X, B(X)) having an invariant probability distribution g and the distribution of n depends solely on Xn with E{n |Xn = x} = w(x) = (x)/g(x) for some > 0. Then, for the jump process (Yt )t 0 associated with the weighted sequence (Xn , n )n∈Z+ it holds that lim P{Yt ∈ A} = (A) ∀A ∈ B(X).
t↑∞
Proof. The result follows from the standard theory of semi-Markov processes (cf. Limnios and Opri¸san, 2001). Under the above assumptions, Y is an ergodic semi-Markov process with embedded Markov chain X and respective sojourn times (n )n∈Z+ . Thus, E{|x}g(x)(dx) w(x)g(x)(dx) (x)(dx) A A lim P{Yt ∈ A} = = = A = (A), t↑∞ X E{|x}g(x)(dx) X w(x)g(x)(dx) X (x)(dx) as is claimed.
Setting deterministically n ≡ w(Xn ), we have the following: Corollary 2.1. If (Xn )n∈Z+ forms a Harris ergodic Markov chain with stationary distribution g, then the jump process associated with the weighted sequence (Xn , w(Xn ))n∈Z+ has as limit distribution. Any sequence of independent g-distributed random variables trivially forms an ergodic Markov chain with stationary distribution g. Thus, Corollary 2.1 covers also the original importance weighted sequence. Example 2.1. Let the target distribution be the normal mixture ∼ 13 N(0, 32 ) + 13 N(5, 1) + 13 N(15, 22 ). We have taken as importance distribution a centered Cauchy with scale parameter 10 and have run the standard IS algorithm m = 10 000 times independently. For each run, we have recorded the values Y1 , Y3 and Y10 of the associated jump process as well as N10 , i.e. the number of Cauchy variates needed in order to obtain Y10 . The corresponding histograms in Fig. 1 clearly illustrate the distributional convergence of the jump process to . For instance, the histogram of the values of Y10 is practically indistinguishable from histograms of iid samples from . We also note that the Monte Carlo estimate of E(N10 ) was about 11.8 with 95% confidence limits (11.7, 11.9). As mentioned in the Introduction, another method that generates an approximately -distributed sample based on a properly weighted sample with respect to is the SIR algorithm. SIR screens the weighted sample by discarding the “bad” (unimportant) values and repeating many times the “good” (important) ones. So, the method in fact assigns new
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
1213
Y3
Y1 0.08 0.06 0.04 0.02
Y10
10
20 N10
30
40
Fig. 1. Histograms of the values of Y1 , Y3 , Y10 and N10 obtained from m = 10 000 independent IS runs with target distribution ∼ 13 N(0, 9) + 13 N(5, 1) + 13 N(15, 4) and importance distribution g ∼ C(0, 10).
integer-valued weights to the original sample. The conditional expectations of the new weights given the whole output are proportional to the corresponding old weights divided by their sum. This implies that, provided the original sample is large, SIR generates also a roughly properly weighted sample with respect to the target distribution. However, it is clear that if the original sample does not contain “good” values, resampling will not contribute much. On the other hand, if the weighted sample provides a reasonable approximation of the target distribution, the same will happen for the SIR output as well. Hence, in general, the quality of the samples obtained by the two methods is expected to be roughly similar. Therefore, if the purpose is to gain knowledge about the target distribution rather to store a set of good values, SIR does not offer more than IS. Example 2.2. In order to compare the quality of the weighted sample generated by IS and the (unweighted) sample provided by SIR we have considered again the distributions of Example 2.1. A single iid sample of size 10 000 from C(0, 10) has been properly weighted with respect to the target distribution and 10 000 values have been resampled with replacement with resampling probabilities proportional to these weights. Fig. 2 shows the histogram of the resampled values and the part of the histogram of the weighted sample corresponding to the same range. This range includes about 64% of the weighted sample with the total mass excluded being about 5 × 10−5 . Moreover, the corresponding QQ-plots are shown under each histogram. Following the suggestion of a referee, we have also included the histogram of an iid sample of the same size from the target distribution for comparison purposes. As it can be seen, the first two graphs barely differ from the last one. It is well-known (cf. Liu, 2001) that, under the assumptions of Proposition 2.1, the weighted average n−1 i h(Xi ) hˆ n : = i=0 n−1 i=0 i converges almost surely to E (h) for any h ∈ L1 (). In light of Definition 2.3, estimator (3) can be expressed as Sn ˆhn = 1 h(Ys )(ds), Sn 0
(3)
1214
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
Fig. 2. Histograms and QQ-plots of samples generated by a single run of IS and SIR with target and proposal distributions the same as in Example 2.1. The third histogram and QQ-plot correspond to an iid sample from the target distribution.
where is either the Lebesgue or the counting measure depending on whether the weights are continuous or discrete. Note that in the case of standard IS estimator (i.e. when n ≡ w(Xn )), this gives another justification for the use of ¯ IS hˆ IS n in (2) instead of hn in (1). Proposition 2.1 establishes a stronger result than that of the convergence of Cesàro averages t h(Ys )(ds) hˆ t : =t −1 0
to E (h). It states that distributional convergence of the generated sequence takes place analogous to that of MCMC schemes. Since the almost sure convergence of hˆ t (and hˆ n ) to E (h) is a consequence of this fact, it turns out that Proposition 2.1 provides the actual type of convergence of properly weighted samples to the target distribution. Thus, properly weighted samples and, in particular, the output of the standard IS method, are in general serious competitors of MCMC methods. In the case where the Central Limit Theorem holds for hˆ n , its asymptotic variance is 2 (h) = 2IS (h) +
1 Eg [Var{|X}{h(X) − E (h)}2 ]. 2
Here, 2IS (h) is the asymptotic variance of the IS estimator hˆ IS n in (2) (which arises using n ≡ w(Xn ) deterministically). This follows immediately from the identity: Eg {i h(Xi )j f (Xj )} 2 Eg {w(Xi )2 h(Xi )f (Xi )} + Eg {Var(i |Xi )h(Xi )f (Xi )}, i = j, = 2 Eg {w(Xi )h(Xi )w(Xj )f (Xj )}, i = j, and the formula for the asymptotic variance of a ratio estimator. As is intuitively rational, randomization of the weights increases the variance of the estimator. Hence, for estimation purposes, the IS estimator is the most accurate and must be preferred whenever possible. Otherwise, the less variable the weights are the more accuracy is achieved. The requirement that the distribution of n depends only on xn seems rather restrictive. However, xn could be a block of specific size allowing n to depend on more than one term of the original sequence. (Note that the standard definition of a semi-Markov process allows the sojourn time of Xn depending on both Xn and Xn+1 .) This is illustrated via two examples in Section 3.2. Another potential application of Proposition 2.1 is the following: Let be a target distribution of which some full conditional distributions are difficult to sample from. Instead of hybridizing Gibbs sampler using suitable Metropolis steps, one can replace by another target distribution g of which all full conditional distributions are easily handled,
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
1215
run the Gibbs sampler and finally weight the output. The weighted sample will be the realization of a converging jump process (see Section 3.3). 3. Examples In this section we connect some known simulation schemes with the context of jump processes. Moreover, we give a simple example of how one can facilitate MCMC algorithms by discreitization of the state space followed by proper weighting of the sample, so that the associated jump process converges to the target distribution. 3.1. The Metropolis–Hastings algorithm Consider an arbitrary Metropolis–Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970) with target distribution and proposal q(·|·), that is, at time t + 1 given Yt = y, draw Z ∼ q(z|y) and set Yt+1 = z with probability (z)q(y|z) , (4) a(y, z) = min 1, (y)q(z|y) or Yt+1 = y with probability 1 − a(y, z). Although it is well-known that the algorithm defines a reversible Markov chain with stationary distribution , let us consider it from a different point of view. Let X = (Xn )n∈Z+ be a Markov chain with transition density g(xi |xi−1 ) =
min{(xi−1 )q(xi |xi−1 ), (xi )q(xi−1 |xi )} a(xi−1 , xi )q(xi |xi−1 ) = . a(xi−1 , z)q(z|xi−1 )(dz) min{(xi−1 )q(z|xi−1 ), (z)q(xi−1 |z)}(dz)
(Notice that this is exactly the density of the accepted states of the above MH algorithm.) It can be easily verified that g(xi |xi−1 ) satisfies the detailed balance condition g(xi−1 )g(xi |xi−1 ) = g(xi )g(xi−1 |xi ), where g(x) ∝ min{(x)q(z|x), (z)q(x|z)}(dz). This function, when normalized, results in a probability density function since g(x)(dx) (x)q(z|x)(dz)(dx) = 1, hence it is the stationary distribution of the Markov chain X. Weight now xi by i drawn from the geometric distribution with probability mass function p(|xi ) =
a(xi , z)q(z|xi )(dz)
−1
1−
a(xi , z)q(z|xi )(dz)
,
= 1, 2, . . . .
Since E{|xi } =
−1
a(xi , z)q(z|xi )(dz)
∝ (xi )/g(xi ),
(5)
the sequence (Xn , n )n∈Z+ is properly weighted with respect to . It is immediately seen that the associated jump process is the original MH output (Yt )t∈Z+ which is known to be a pure Markov chain (rather than a general discrete time semi-Markov process). The above analysis suggests that we are allowed to use any distribution (beyond the geometric) for the weights provided (5) is satisfied. In particular, if i ∝ w(xi ) = (xi )/g(xi ) is chosen, the variance of estimators of certain expectations of interest will be minimized. However, direct calculation of this importance weight is in general computationally costly or even infeasible making such a task hard to accomplish. Moreover, the geometric distribution comes out naturally since each simulation from g(·|·) automatically generates the corresponding geometric weight.
1216
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
Example 3.1. The exact values of the underlying importance weights can be evaluated only for certain toy examples as the following one. Consider the independence MH algorithm with target distribution (x) = e−x I (x > 0) and proposal q(z|x) = q(z) = e−z I (z > 0). Then, after straightforward calculations it can be seen that g(x) ∝ e−x(+1) min { − 1 + ex , 1 − + ex }.
(6)
The above minimum depends on whether < 1 or > 1. Hence, for any accepted state xi , the use of w(xi ) = (xi )/g(xi ) instead of its original sojourn time i reduces the variance of any ergodic mean. Moreover, the (continuous time) jump process associated with the weighted sequence (xn , w(xn ))n∈Z+ has as limit distribution. Although the underlying importance weights cannot in general be calculated, they can be approximated using the original MH output. Observe that q(Z|x) q(x|Z) q(z|x) q(x|z) , (z)(dz) = E min , , a(x, z)q(z|x)(dz) = min (z) (x) (Z) (x) hence, setting Ax := {z : (x)q(z|x)(z)q(x|z)}, w(xi ) may be estimated by ⎫
⎧ n n n ⎨ q(x |x ) ⎬ 1 j i I (xj ∈ Axi ) + j j j q(xi |xj )I (xj ∈ Acxi ) . w(x ˆ i) = ⎩ ⎭ (xj ) (xi ) j =1
j =1
j =1
In fact, w(x ˆ i ) approximates a quantity proportional to w(xi ) but clearly this is all we need. In particular, if the proposal distribution is symmetric, i.e. if q(z|x) = q(x|z) for all x, z, then the integral simplifies to 1 q(z|x)(dz) + q(x|z)(z)(dz). a(x, z)q(z|x)(dz) = (x) Acx Ax There are cases where the first integral can be analytically evaluated, provided that Ax is available in closed form. For example, this happens when q(z|x) is a well-known distribution such as the normal distribution with mean x. Therefore, w(xi ) may be approximated by −1 n c j =1 j q(xi |xj )I (xj ∈ Axi ) q(z|x)(dz) + . w(x ˆ i) = (xi ) nj=1 j Ax Example 3.1 (Continued). In this toy example it holds Ax = (0, x] or [x, ∞) according to whether < 1 or > 1, and so the estimates of w(xi )’s become ⎫
⎧ n n n ⎨ ⎬ j j e(1−)xj I (xj ∈ Axi ) + e(1−)xi j I (xj ∈ Acxi ) , i = 1, . . . , n. w(x ˆ i) = ⎩ ⎭ j =1
j =1
j =1
We have run the MH with = 0.5 and have generated a stationary chain oflength n = 10 000. Fig. 3 shows algorithm ˆ i )xi2 / ni=1 w(x ˆ i ) to the the convergence of ni=1 i xi2 / ni=1 i , i.e. the original MH output estimate, and of ni=1 w(x second moment of the target distribution (which equals two). We have not included here the sequence of the IS estimates arising using the exact weights in (6) since it is indistinguishable from that using the estimated ones. Moreover, the histogram plotted using the estimated weights shows quite good fit to the target distribution. 3.2. Importance weighting an MCMC output Let y0 , y1 , y2 , . . . be the output of any MCMC updating scheme having target distribution and updating distribution (yn |yn−1 ). A crude method for diagnosing convergence of the chain to is checking the convergence of many estimators of some quantities of interest E (h) (cf. Robert and Casella, 1999, p. 382).
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
1217
2.2
2
1.8
2000
4000
6000
8000
10000
0
2
4
6
8
n
Fig. 3. The left graph shows the convergence of the weighted means i=1 i xi2 / ni=1 i (dotted line) and ni=1 w(x ˆ i )xi2 / ni=1 w(x ˆ i ) (solid line) to the second moment of the target distribution (x) = e−x in Example 3.1. The right graph shows the histogram of the weighted sample (xi , w(x ˆ i ))1 i 10 000 together with (x).
One particular estimator used in this context is the IS estimator
n n (yi ) (y ) i h˜ n : = h(yi ) , (yi |yi−1 ) (yi |yi−1 ) i=1
i=1
provided that the ratios (yi )/(yi |yi−1 ) can be calculated up to a constant. By setting i := (yi )/(yi |yi−1 ), it can be seen that the jump process associated with the weighted sequence (yi , i )i∈N also converges to . Indeed, let (1) (2) xi = (xi , xi ) := (yi−1 , yi ). Then, the sequence (xn )n∈N forms a Markov chain with transition density (1)
(2)
(1)
g(xi |xi−1 ) = x (2) (xi )(xi |xi ), i−1
where x (·) denotes the Dirac measure at x, and limit distribution g(x) = (x (1) )(x (2) |x (1) ). Setting (1)
i =
(2)
(xi )(xi )
(1) (2) (1) (xi )(xi |xi )
=
(yi ) , (yi |yi−1 )
i = 1, 2, . . . ,
we are led to the sequence (xi , i )i∈N which is properly weighted with respect to the product (x (1) )(x (2) ). Thus, the (2) jump process associated with the (marginal) sequence (xi , i )i∈N = (yi , i )i∈N has as limit distribution. Robert and Casella (1999) discuss another approach providing convergent estimators in the context of MH algorithms. As in Section 3.1, consider a general MH algorithm with target distribution and proposal q(·|·). Denote now by y0 , y1 , y2 , . . . the whole simulated output (containing also the rejected samples) and by y1 , y2 , . . . the accepted variates. Define i := sup{j : j < i} and notice that y i is the last accepted value before time i implying that yi has been generated from q(·|y i ). The IS estimator of E (h) used in this context is
n n (yi ) (y ) i h˜ n := h(yi ) . q(yi |y i ) q(yi |y i ) i=1
i=1
By setting i = (yi )/q(yi |y i ), it can be seen that the jump process associated with the weighted sequence (yi , i )i∈N (1) (2) (1) (1) converges to . To see this, define xi = (xi , xi ) := (y i , yi ). If i < i − 1, then i = i−1 and thus xi = xi−1 . On the (1)
other hand, i = i − 1 implies that xi
(1)
(2)
(1)
(2)
= xi−1 . The latter occurs with probability a(xi−1 , xi−1 ) (given in (4)), whereas (2)
the former with probability 1 − a(xi−1 , xi−1 ). Hence, the sequence x1 , x2 , . . . forms a Markov chain with transition density (1)
(2)
(1)
(1)
(2)
(1)
(2)
(1)
g(xi |xi−1 ) = {[1 − a(xi−1 , xi−1 )] x (1) (xi ) + a(xi−1 , xi−1 ) x (2) (xi )}q(xi |xi ). i−1
i−1
1218
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
The stationary distribution of this chain is (x (1) )q(x (2) |x (1) ) since y0 , y 1 , y 2 , . . . is the original MH output. Weighting each xi by (2)
i =
(1)
(xi )(xi )
(1) (2) (1) (xi )q(xi |xi )
=
(yi ) , q(yi |y i ) (2)
we obtain the desired result for the marginal sequence (xi , i )i∈N = (yi , i )i∈N . 3.3. Modification of the target distribution: Dugongs data set Besides its theoretical interest, Proposition 2.1 can be used in order to facilitate MCMC approximations. According to this proposition, one is allowed to run a more convenient MCMC algorithm with a different target distribution and then weight properly the output. The approach is illustrated below on the well known dugongs data set which was originally analyzed by Ratkowsky (1983). This particular data set is among standard WinBUGS examples and has been used by many authors in order to illustrate and compare several sampling techniques. The data consist of length (y) and age (x) measurements for n = 27 dugongs (sea cows) captured near Townsville, Queensland. Carlin and Gelfand (1991) modelled the data using a nonlinear growth curve with no inflection point and an asymptote as x tends to infinity. Specifically, they assumed that yi ∼ N( − xi , −1 ),
i = 1, . . . , n,
where , , > 0 and 0 < < 1. We consider the following relatively vague prior distributions for the parameters: ∼ N(0, −1 )I ( > 0),
∼ N(0, −1 )I ( > 0), ∼ U(0, 1), ∼ G(k, k),
with = = 10−4 and k = 10−3 . The posterior distribution of = (, , , ), (|data) ∝ f (data|)f ()f ( )f ( )f ( ), can be described using the output of an appropriate Gibbs sampler. Sampling from the full conditional (posterior) distributions of , (truncated normal) and (gamma) is a straightforward task but this is not the case for . Instead of using a Metropolis step, we can adopt the following strategy. We choose a different target distribution, namely, g(|data) ∝ f (data|, , [m] , )f ()f ( )f ( )f ( ), where [m] := ( m
+ 0.5)/m for some positive integer m. Here, the sample space of , i.e. the interval (0, 1), is discretized into m equal length bins, with being uniformly distributed within each bin. Sampling from the full conditional distribution of is now an easy task: one can first draw a bin from the discrete distribution p(j |, , , data) ∝ g(, , (j − 0.5)/m, |data),
j = 1, . . . , m,
(7)
and then simulate U ∼ U(0, 1) and set = (j + U − 1)/m. In the above scheme the importance weights are w() ∝
(|data) , g(|data)
and, according to Proposition 2.1, the jump process associated with the weighted output converges to . In order to compare the proposed approach with standard MCMC algorithms, we implemented additionally two Gibbs samplers with target distribution . In both of them, the full conditional distribution of has been sampled by incorporating corresponding Metropolis steps. In the first scheme, the updating values of have been proposed from the discretized version of its full conditional distribution in (7) (with in the place of g) followed by the corresponding random perturbations. The second scheme involves a normal random walk on the logit scale as proposal, that is, at each step the logit of the proposed value has been generated from a normal distribution with mean the logit of the present one. We have tried several values for the number of bins as well as for the variance of the random walk proposal. Here, we report for m = 50 bins and variance equal to 0.25. Fig. 4(a) shows the convergence of the weighted mean n the results n
ˆ IS w( ) / i i n = i=1 i=1 w(i ) (solid line) and of the ergodic means of generated by the two Gibbs samplers, to the
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
1219
0.88 0.87 0.86 0.85 2000
4000
6000
8000
10000
0.7
0.75
0.8
0.85
0.9
Fig. 4. (a) Convergence of ˆ IS n (solid line) and of the ergodic means obtained by using a discretized proposal (dotted line) and a random walk proposal (dashed line) to the posterior expectation of . (b) Histogram of the weighted sample ( i , w(i ))1 i 10 000 .
posterior mean E{ |data} (the dashed line corresponds to the scheme with the random walk proposal). They all have been computed from the outputs of 10 000 updates after burn-in periods of 1000 iterations. As we can see, the estimates of the posterior mean of obtained by the proposed method and the scheme with the discretized proposal behave similarly, whereas the one corresponding to the scheme involving the random walk seems to exhibit considerably slower convergence. The graphs for the rest of parameters are similar. It is also worth mentioning that after many runs of the above algorithms we concluded that the proposed method needs slightly less CPU time than the scheme with the discretized proposal. This is in agreement with what is expected to happen, since the calculation of the MH acceptance probabilities requires the computation of two “importance weights” rather than one. 4. Stationary weighted samples Hereafter we denote by p(v|x)(dv) the conditional distribution of given x, with being either the Lebesgue measure (continuous weights) or the counting measure (discrete weights). Moreover, we set P (u|x) := p(v|x)(dv) = P(u|x). [u,∞)
In this section, we will discuss a slight modification of the standard weighting method under which one can obtain stationary weighted samples from the target distribution . We will first need some facts from the theory of semi-Markov processes. Let (Yt )t 0 be a semi-Markov process with embedded Markov chain (Xn )n∈Z+ and respective sojourn times (n )n∈Z+ . Let also Sn and Nt be as in Definition 2.3. Then, (Xn , Sn )n∈Z+ is a Markov renewal process and (Nt )t 0 is its counting process. The corresponding excess life (residual age) process is defined by (Vt )t 0 , where Vt := SNt +1 − t. Then, (Yt , Vt )t 0 is a Markov process with stationary distribution (y)pe (v|y), where pe (v|y) :=
P (v|y) w(y)
(cf. McDonald, 1977). In the context of weighted samples, this suggests the following. Assume that it is possible to generate X0 ∼ . Then, if X0 is weighted by 0 drawn from pe (·|x0 ) (instead of p(·|x0 )), the excess life process (Yt , Vt )t 0 starts in equilibrium and thus Yt ∼ ∀t 0. As a consequence, the estimator t Nt −1 h(Ys )(ds) = t −1 { j h(Xj ) + (t − SNt )h(XNt )} hˆ t := t −1 j =0
0
is exactly unbiased for E (h) for any fixed time t > 0. In the case of IS, we have that pe (v|y) ∼ U(0, w(y)). Hence, if X0 ∼ , U ∼ U(0, 1) and X1 , X2 , . . . are iid draws from g (or an ergodic Markov chain with stationary distribution g), the estimator hˆ t becomes Nt −1 w(Xj )h(Xj ) + (t − SNt )h(XNt )}. hˆ t = t −1 {U w(X0 )h(X0 ) + j =1
1220
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
In general, simulation from pe (v|y) can be completed in two stages. First generate a random variate T ∼ q(t|y) ∝ tp(t|y) and then draw V from the uniform distribution in (0, T ) (continuous case) or in {1, . . . , T } (discrete case). This is quite an easy task for many standard distributions. For example, if p is a gamma distribution with shape a, then q is also a gamma distribution but with shape a + 1. Or, if p is Poisson then q is a truncated Poisson with the same parameter. 5. Rates of convergence in some special cases An important aspect in MCMC simulation is the convergence rate of the resulting chain to the target distribution. There is an extensive literature on this subject and several authors have provided bounds for the total variation distance between the probability law of the current state and the limit distribution. For instance, in the context of independence MH algorithm, i.e. when q(y|z) = q(y) (using the notation of Section 3), Mengersen and Tweedie (1996) showed that if w˜ ∗ = supx (x)/q(x) < ∞, the algorithm is uniformly ergodic with P{Yt ∈ ·} − (1 − 1/w˜ ∗ )t , t ∈ Z+ , where = supA∈B(X) |(A)| denotes as usual the total variation of the signed measure . In general, this is the case if supx,z (x)/q(x|z) < ∞. On the other hand, little is known about convergence rates of semi-Markov processes. In this section, we attempt to give some first results in certain special cases. It seems intuitively clear that the rate of convergence of a semi-Markov process to its stationary distribution depends on the rate of convergence of the embedded Markov chain and on the distribution of the sojourn times. We first provide a result for the special case of exponential sojourn times, i.e. when the semi-Markov process is a pure Markov jump process. We prove that when the embedded Markov chain is Doeblin and w(x) is bounded above, the process exhibits a uniformly ergodic behavior. Theorem 5.1. Let (Xn , n )n∈Z+ be a weighted sequence satisfying the following: 1. The sequence X=(Xn )n∈Z+ is an ergodic Markov chain with stationary distribution g and its transition distribution g(·|·) satisfies a Doeblin condition, equivalently, there exists a probability distribution g0 and a nonnegative constant such that g(z|y) g0 (z), ∀z, y ∈ X.
(8)
2. The conditional distribution of n given Xn = x is exponential with mean w(x) = (x)/g(x), where is a probability distribution, and w∗ = supx∈X w(x) < ∞. Then, for the associated Markov jump process Y = (Yt )t 0 it holds P(Yt ∈ ·) − exp{− t/w ∗ }, t 0.
(9)
The proof of the theorem can be found in the Appendix. In the case where the Xn ’s are independent g-distributed random variates, (8) holds with = 1 and g0 = g, thus the total variation distance in (9) is no greater than exp{−t/w ∗ }. This bound is comparable to that of the previous cases, although the actual importance distribution is different (since here it has not been taken account for rejection control). Example 5.1. Consider again the distributions of Example 2.1 for which it can be verified that w ∗ ≈ 6.905. Let X = (Xn )n∈Z+ be iid Cauchy C(0, 10) random variates and n be exponential with mean w(xn ). Then, by Proposition 5.1, for the associated Markov jump process Y holds P(Yt ∈ ·) − exp{−t/6.905}, t 0. For instance, for t 31.8, the total variation distance will be less than 0.01. When (Xn )n∈Z+ is an iid sequence from g and the hazard rate p(v|y)/P (v|y) is uniformly bounded away from zero, i.e. if ∗ := inf v,y
p(v|y) P (v|y)
> 0,
(10)
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
1221
an accept–reject argument can instantly give a stationary semi-Markov process. Indeed, let (x, ) be a draw from g(x)p(|x). Then, P (|x) (x)pe (|x) 1 = , g(x)p(|x) p(|x) ∗ and consequently if U is an independent U(0, 1) random variate and U ∗ P (|x)/p(|x) holds, (x, ) can be considered as (x)pe (|x)-distributed. ∗ Lemma v 5.1. If (10) holds then (i) w = supx∈X w(x) < ∞ and (ii) there exists a > 1 such that Eg {a } := a g(y)p(v|y)(dv)(dy) < ∞.
According to part (i) of Lemma 5.1, the above approach does not offer much, since the accept-reject method applies also to the xn ’s. However, part (ii) can be used to obtain a crude bound for total variation distance to stationarity for the associated semi-Markov process. Theorem 5.2. Assume that (Xn )n∈Z+ is an iid sequence from g and (10) holds. Then for the associated jump process Y = (Yt )t 0 we have the following: (a) There is an almost surely finite time 0, such that Yt ∼ for t . (b) The total variation distance between the law of Yt and converges to zero exponentially fast in t. Theorem 5.2 does not cover the case of standard IS, since in this case p(v|x) ∼ w(x) (v) whereas pe (v|x) ∼ U(0, w(x)). Nevertheless, a similar result also holds. Theorem 5.3. Assume that (Xn )n∈Z+ is an iid sequence from g weighted by i = w(xi ) = (xi )/g(xi ), i = 1, . . . , n. If w∗ = supx w(x) < ∞, then for the jump process Y = (Yt )t 0 it holds P{Yt ∈ ·} − → 0 at an exponential rate. 6. Discussion The aim of this paper is to stress out that the proper weighting of a Markov chain’s output can be used in order to obtain samples from the target distribution. This is done by considering it from a different point of view, namely, associating the weighted sample with an appropriate weakly convergent jump process. Many well known simulation schemes, including the Metropolis–Hastings algorithm, fall in this context. Moreover, contrary to what it is thought, this is the case for the standard importance sampling output. Hence, in order to obtain (approximate) samples from the target distribution the time-consuming SIR algorithm may be bypassed. The independence MH algorithm with proposal distribution g (i.e., the importance distribution) is a natural competitor to the standard IS. From the point of view of this paper, this competition is naturally transferred to the Markov chain generated by the MH and the jump process associated with the importance weighted sequence. The comparison of these stochastic processes is similar to that of SIR and IS. When a “good” state is visited by any of these stochastic processes, then both remain there for a long time. On the other hand, “bad” states are in general rejected by the MH algorithm and consequently they are avoided by the generated Markov chain whereas the jump process stays on them for a negligible time. However, in Section 3.1 we have seen that the MH chain is a properly weighted sequence with importance distribution which differs from the proposal. Therefore, the comparison of this particular Markov chain with the jump process associated with the importance weighted sequence is rather unfair. The true competitors of the MH chain should be the jump processes associated with properly weighted sequences the underlying importance weights of which are given in (5). Since semi-Markov processes are extensions of Markov chains, many aspects concerning the rate of their convergence are expected to be similar. However, their study seems to be more complicated, since the behavior of a semi-Markov process depends on its sojourn times’ distribution as well. Things are much easier when the mean sojourn time is bounded above. In the context of simulation, this is equivalent to boundedness of the importance ratio. A possible extension of the weighting procedure seems to arise when Xn+1 is drawn given Xn and its (possibly random) weight n . Alternatively, this can be done by sampling first Xn+1 given Xn and then weighting Xn by n
1222
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
generated from its conditional distribution given Xn and Xn+1 . In order for this scheme to be valid, it must hold E{n |xn , xn+1 } ∝ w(xn , xn+1 ) =
(xn )f (xn+1 |xn ) , g(xn )g(xn+1 |xn )
where g(·|·) is the transition distribution of the Markov chain (Xn )n∈Z+ , g(·) is its limit distribution and f (·|·) is any probability density function which is absolutely continuous with respect to the underlying measure. However, this is a special case of the current semi-Markov process context as well, since it can be considered as giving rise to the Markov (1) (2) chain (Zn )n∈Z+ with Zn = (Zn , Zn ) = (Xn , Xn+1 ), which has limit distribution g(z(1) )g(z(2) |z(1) ) and is properly weighted with respect to (z(1) )f (z(2) |z(1) ). Another interesting simulation scheme arises if both Xn+1 and its weight n+1 depend on Xn and n . For instance, this is the case for the dynamic weighting approach of Liu et al. (2001). However, the jump processes associated with such weighted sequences are not semi-Markov, raising doubts about their convergence. Besides its theoretical interest, the benefits of the proposed point of view have not been explored yet to a full extent. We feel that it can be proven quite useful when applied in conjunction with the discretization of the state space (as in Section 3.3). This is the subject of our current research.
Acknowledgments The authors wish to thank the two anonymous referees for their comments and suggestions which have improved the presentation.
Appendix In order to prove Theorem 5.1 we will need the following lemma. Lemma A.1. Under the assumptions of Theorem 5.1 it holds lim 1 − (1 − e
−t/w∗
t↓0
)
X
e
−t/w(z)
1/t g0 (z)(dz)
= exp{− /w ∗ }.
(11)
Proof. Define first −t/w∗
e−t/w(z) g0 (z)(dz) ∗ ∗ = 1 − (1 − e−t/w ) + (1 − e−t/w ) (1 − e−t/w(z) )g0 (z)(dz),
a(t) := 1 − (1 − e
)
X
X
and observe that (8) implies g(z) g0 (z), ∀z. Hence,
X
(1 − e
−t/w(z)
)g0 (z)(dz)
X
(1 − e−t/w(z) )g(z)(dz) = Pg (0 t) = o(1).
∗ ∗ Since 1 − e−t/w = O(t), we have that (1 − e−t/w ) X (1 − e−t/w(z) )g0 (z)(dz) = o(t) and thus, ∗
∗
1 − (1 − e−t/w ) a(t)1 − (1 − e−t/w ) + o(t). Raising to the power of 1/t and taking the limits as t ↓ 0 we obtain the desired result.
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
1223
Proof of Theorem 5.1. Since (Yt )t 0 is a Markov process, every -skeleton, i.e. any sequence (Yn )n∈Z+ for fixed > 0, forms a Markov chain with transition kernel P (y, A) = P{Y ∈ A|Y0 = y}. But then, P (y, A) = P{0 > , y ∈ A|X0 = y} +
∞
P{Sm < Sm+1 , Xm ∈ A|X0 = y}
m=1
P{0 , 1 > , X1 ∈ A|X0 = y} − /w(y) = {1 − e } e− /w(z) g(z|y)(dz) A − /w∗ {1 − e } e− /w(z) g0 (z)(dz) A
= Q (A) where := {1 − e
− /w∗
∀y,
}
X
e− /w(z) g0 (z)(dz)
and Q (A) := A
e− /w(z) g0 (z)(dz)
Xe
− /w(z) g
0 (z)(dz)
.
Thus, the -skeleton chain satisfies a Doeblin condition. It is well known that this implies P(Yn ∈ ·) − (1 − )n
∀n ∈ Z+ ,
hence, writing t = n , we conclude that P(Yt ∈ ·) − {(1 − t/n )n/t }t ∀n ∈ Z+ . However, limn↑∞ (1 − t/n )n/t = limt↓0 (1 − t )1/t and the result follows from Lemma A.1.
Proof of Lemma 5.1. (i) We have that p(v|y) > ∗ ∀v, y ⇔ p(v|y) > ∗ P (v|y) ∀v, y P¯ (v|y) ⇒ p(v|y)(dv) > ∗ P (v|y)(dv) ∀y ⇔ 1 > ∗ w(y) ∀y ⇔ w(y) < 1/∗ , ∀y, thus w∗ = supy w(y) 1/∗ < ∞. (ii) We will first prove that E{m |y} m!/m ∗ , m 1, by induction. By (i) the assertion holds for m = 1. Assuming that it holds for some m we have E{m |y} = v m p(v|y)(dv) m m ∗ v P (v|y)(dv) = ∗ v p(u|y)(du) (dv) [v,∞) u m m = ∗ v (dv) p(u|y)(du) ∗ v dv p(u|y)(du) = ∗
(0,u]
v=0
um+1 ∗ p(u|y)(du) = E{m+1 |y} ∀y, m+1 m+1
1224
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
and thus, E{m+1 |y}(m + 1)!/m+1 . Now, ∗ E{a |y} = E{e log a |y} =
∞ ∞ (log a)m m! 1 (log a)m = <∞ E{m |y} m m! ∗ 1 − log a/∗ m! m=0
m=0
for any a ∈ [1, exp{∗ }), the same holding true also for Eg {a } = Eg [E{a |Y }]. Proof of Theorem 5.2. (a) After (Xn , n ) has been drawn, generate an independent random variate Un ∼ U(0, 1). Let N be the first index n ∈ Z+ at which Un ∗ P (n |xn )/p(n |xn ) occurs. Then (YSN , VSN ) = (XN , N ) ∼ (·)pe (·|·). Since pe is stationary for (Yt , Vt )t 0 , we have that Yt ∼ for every t : =SN . Now, by the standard theory of rejection sampling, N + 1 ∼ G(∗ ). Moreover, the “rejected” (X, )’s have the residual distribution r(x, v) :=
g(x)p(v|x) − ∗ (x)pe (v|x) p(v|x) − ∗ P (v|x) = g(x) . 1 − ∗ 1 − ∗
Notice that for any a such that g (a) := Eg {a } < ∞ it also holds r (a) := Er {a } < ∞ since r(x, v) g(x)p(v|x)/ (1 − ∗ ) for all x, v. The function r (a) is continuous and increasing in a 1 with r (1) = 1, hence there exists > 1 such that r ()(1 − ∗ ) < 1. Then, P( > t) = P(SN > t) −t E{SN } = −t E{r ()N } =
∗ −t , 1 − r () + ∗ r ()
(12)
which converges to zero as t → ∞ implying P( < ∞) = 1. (b) Let A ∈ B(X). Then, |P{Yt ∈ A} − (A)| = |P{Yt ∈ A| > t}P{ > t} + P{Yt ∈ A| t}P{ t} − (A)| P{ > t}|P{Yt ∈ A| > t} − (A)| P{ > t}, since P{Yt ∈ A| t} = (A) by (a), and |P{Yt ∈ A| > t} − (A)| 1. Thus, P{Yt ∈ ·} − = sup |P{Yt ∈ A} − (A)| P{ > t} = O(−t ) A∈B(X)
by (12).
Proof of Theorem 5.3. Similarly to the proof of Theorem 5.2, after Xn is drawn, generate an independent random variate Un ∼ U(0, 1) and denote by N the first time at which Un w(xn )/w ∗ occurs. At time N, draw additionally U ∼ U(0, 1) and set =SN +w(XN )U . Then Y =XN ∼ (by the accept-reject argument) and V =w(XN )(1−U ), which conditional on Y = y is uniformly distributed in U (0, w(y)) (see Fig. 5). Hence (Y , V ) has the stationary distribution of the Markov process (Yt , Vt )t 0 and consequently, Yt ∼ , for t by the strong Markov property. Observe now that conditional on N, SN +w(XN )U is stochastically smaller than w ∗ N j =0 Uj , where U0 , U1 , . . . , UN are independent U(0, 1) random variables. This is clear for n = 0 whereas for n 1 it follows from the fact that the first n − 1 X’s have been “rejected” implying that Uj > w(Xj )/w ∗ , j = 0, 1, . . . , n − 1. Thus, N N Uj > t = P Uj > t/w∗ . P{ > t} = P{SN + w(XN )U > t} P w∗ j =0
But for fixed n 0 and 0 x n, P
x 1 m n+1 (x − m)n+1 Uj > x = 1 − (−1) j =0 m (n + 1)!
n
m=0
j =0
S. Malefaki, G. Iliopoulos / Journal of Statistical Planning and Inference 138 (2008) 1210 – 1225
1225
X1 X2
XN+1
w(XN)U′ XN Yτ
X0
S0 = 0
Vτ
XN-1
S1
S2
S3
...
SN-1
SN+1
SN
...
Fig. 5. Definition of the time for the importance weighted sequence.
(cf. Feller, 1971, p. 27), where x denotes the integer part of x. Thus, ⎧ ⎫ ∗ t/w ∞ N ⎨ ⎬ 1 1 n 1 ∗ m n+1 P 1− ∗ (t/w ∗ − m)n+1 . Uj > t/w = 1 − ∗ (−1) m ⎩ ⎭ w w (n + 1)! j =0
n=0
(13)
m=0
The above quantity is an upper bound for the total variation distance of the bivariate process (Yt , Vt )t 0 to its stationary distribution since it is less than or equal to P{ > t}. Thus, this is also the case for P{Yt ∈ ·} − . Finally, by the obvious inequality N j =0 Uj N + 1, ⎧ ⎫ ∗ N ⎨ ⎬ 1 t/w P Uj > t/w∗ P{N > t/w ∗ − 1} P{N > t/w ∗ } = 1 − ∗ ⎩ ⎭ w j =0
which implies that the quantity in (13) vanishes at least exponentially with t.
References Carlin, B.P., Gelfand, A.E., 1991. An iterative Monte Carlo method for nonconjugate Bayesian analysis. Statist. Comput. 1, 119–128. Feller, W., 1971. An Introduction to Probability Theory and its Applications, vol. II. Wiley, New York. Hastings, W.K., 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109. Limnios, N., Opri¸san, G., 2001. Semi-Markov Processes and Reliability. Statistics for Industry and Technology. Birkhäuser, Basel. Liu, J.S., 2001. Monte Carlo Strategies in Scientific Computing. Springer, New York. Liu, J.S., Chen, R., 1998. Sequential Monte Carlo methods for dynamic systems. J. Amer. Statist. Assoc. 93, 1032–1044. Liu, J.S., Liang, F., Wong, W.H., 2001. A theory for dynamic weighting in Monte Carlo computation. J. Amer. Statist. Assoc. 96, 561–573. Marshall, A.W., 1956. The use of multi-stage sampling schemes in Monte Carlo computations. In: Meyer, M.A. (Ed.), Symposium on Monte Carlo Methods. Wiley, New York, pp. 123–140. McDonald, D.R., 1977. Equilibrium measures for semi-Markov processes. Ann. Probab. 5, 818–822. Mengersen, K.L., Tweedie, R.L., 1996. Rates of convergence of the Hastings and Metropolis algorithms. Ann. Statist. 24, 101–121. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E., 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1091. Ratkowsky, D., 1983. Nonlinear Regression Modeling. Marcel Dekker, New York. Robert, C.P., Casella, G., 1999. Monte Carlo Statistical Methods. Springer, New York. Rubin, D.B., 1987. A noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when the fraction of missing information is modest: the SIR algorithm. J. Amer. Statist. Assoc. 82, 543–546. Smith, A.F.M., Gelfand, A.E., 1992. Bayesian statistics without tears. Amer. Statist. 46, 84–88.