z . .
'
t
STATISTICS& PROBABILITY LETTERS ELSEVIER
Statistics & Probability Letters 38 (1998) 323-328
Hastings-Metropolis algorithms and reference measures Pei-de Chen * Department o]" Statistics. Colorado State University, USA Received November 18, 1997
Abstract In this note, we discuss the possible existence and effect of different reference measures on the Hastings-Metropolis (H-M) algorithm. We prove that the standard H-M algorithm as a Markov chain does not depend on the choice of the reference measure, and we obtain a sufficient and necessary condition for the existence of a reference measure. (~ 1998 Elsevier Science B.V. All rights reserved A M S classification." 60J05; 65C05
Keywords." H-M algorithms; Reference measure
I. Introduction The Hastings and Metropolis ( H - M ) algorithms allow simulation of a probability distribution/7 which is only known up to a constant (normalising) factor. This is surprisingly widely relevant, occurring especially when /7 is a Bayesian posterior distribution, but in many other contexts also (Gilks et al., 1996). In the standard construction (Hastings, 1970; Metropolis et al., 1953) of the H - M algorithm on a space S, one first considers a candidate transition kernel Q(x,.), x E S, which generates potential transitions for a discrete-time Markov chain evolving on S. Traditionally, S is either a countable space or ~k equipped with the Borel a-field ~ ( S ) , and both H and Q(x, .) have densities n ( y ) and q(x, y) with respect to a reference measure taken either as counting measure or as Lebesgue measure. Much more general formulations are possible (see Tierney, 1994; Gilks et al., 1996) on a general space (S,:~(S)) if there is some existing reference measure ll available on the space. The role o f the reference measure is fundamental in the definition of the H - M algorithm, which works as follows. A "candidate transition" to y, generated according to the density q(x, y), is accepted with probability I From a dissertation submitted to the Academic Faculty of Colorado State University in partial fulfillment of the requirements for the degree of Ph.D., supervized by Prof. R. Tweedie. Work supported in part by NSF Grant DMS 9504561. * Correspondence address. Institute of Applied Mathematics, Academia Sinica, Beijing, 100080, People's Republic of China. Email:
[email protected]. 0167-7152/98/$19.00 (~) 1998 Elsevier Science B.V. All rights reserved PH S 0 1 6 7 - 7 1 5 2 ( 9 8 ) 0 0 0 4 0 - 6
Pei-de Chen / Statistics & Probability Letters 38 (1998) 323-328
324 ct(x, y), given by
• (x, y ) =
{ min{zt(y) q ( y ' x ~rt(x) ) l ) l q(x, y ) '
zt(x)q(x,rr(x)q(x'y)>O'y) = O.
(1)
Thus, actual transitions of the H - M chain take place according to a law P(x, .) with transition densities p(x,y) = q(x,y)ct(x,y), y # x and with probability of remaining at the same point given by r(x) = P(x, {x}) = f q(x, y)[1 - ~(x, y)] dy. The crucial property of the H - M algorithm is that, with this choice of a, the target H is invariant for the operator P, It is then standard (Meyn and Tweedie, 1993; Roberts and Tweedie, 1996) that under some weak conditions, including irreducibility and aperiodicity, the n-step transition probabilities Pn(x,A) converge to H in the total variation norm. Recently, Tiemey (1998) extended the classical H - M acceptance probability defined by 1 to more general rules which still lead to the property that H is the invariant measure for the resulting chain. His construction does not require reference measures at all. In this paper our goal is two-fold: firstly, to see what effect a change in the choice of reference measure has on the H - M algorithm (or the more general extension by Tierney); and secondly, to find conditions under which there exists such a reference measure for the algorithm. To achieve these goals we consider the standard H - M algorithm in the following way. We start from a target probability H and a candidate probability measure 2 on the product state space (S × S, M(S) × ~'(S)). We will call a measure # an available reference measure if there is a measurable function q such that for each measurable set A,B E .~(S) we have 2(A x B) = fA f a p(dx)q(x, y)p(dy); the candidate kemel Q is then given by Q(x,A) = fAq(x,y)p(dy) as usual. This approach allows us to keep the right to choose an available reference measure (if any) freely instead of assuming a fixed one as a prerequirement. Remark. The target distribution does not affect the existence of an available reference measure. In fact, if p
is available for Q, then given any target distribution H, v = 1( p + H ) is available for Q and H. So in the discussion of the existence of a reference measure, we may ignore the target H. An easy example shows that an available reference measure does not always exist, and in Section 3 we find a sufficient and necessary condition for the existence of an available reference measure. Clearly if available reference measures exist, they form a large class (if p is a member, any a-finite measure v such that p << v is also a member). But fortunately, the standard H - M algorithm as a Markov chain (or its transition probability kernel) does not depend on the choice of the reference measure, and we now prove this.
2. The effect of the reference measure
Before discussing the existence of available reference measures, we first prove that the algorithm essentially does not depend on the choice of available reference measures. Theorem 2.1. Suppose that an available reference measure exists for a 9iven target 17 and candidate transition law Q. Then the Hastings algorithm essentially does not depend on the choice of reference measures. More exactly, for different reference measures l~J and P2, the corresponding transition kernels P1 and P2
Pei-de Chen I Statistics & ProbabilityLetters 38 (1998) 323-328
325
coincide except on a If-negligible subset A" Pl(x,')
for all
= P2(x,')
x ~ A E ~(S),
II(A) = O.
Proof. Without loss of generality, we may assume that #1 << #2. Otherwise let #3 = (#1 +#2)/2, then #1 << #3, and #2 << #3. If #1 and #3 provide the same chain, and so do #2 and #3, then #1 and #2 also provide the same chain. Let dH
7rl = ld#--'
dH
7r2 = (1#2"7"--'
q~ =
d#! then 7/72 = 7~ltd. d#2'
Denote
qj(x,.)-- dQ(x,.\ ( ~x,)(.), q2(x,')- dQ(x,.\ ( ltx,)(.). d#1 d#2 We have
q2(x,y)= ql(x,y)O(y). Thus,
~2(x,y) --
=
7z2(y)q2(y,x)
r~2(x)q2( x, y )
A1
7zl(y)(a(y)ql (y,x)qS( x) A1 ~l( x) ~( x) ql( x, y)¢(y) rh(y)ql(y,x) A1 rcl( x)ql( x, y)
= ~l(X, y),
~Zz(x)q2(x,y) > 0 (which implies both qS(x) > 0 and qS(y) > 0). Now, suppose g2(x)q2(x,y) = 0 and also both ~b(x) > 0 and ~b(y) > 0. Then, we must have r q ( x ) ql(x,y) = 0, so we have still ~ 2 ( x , y ) = 1 = ~l(x,y). Therefore, if
Pl( x,A) = 1A( x)Pl( x, {x}) + ~ pl( x, y)pl(dy) = lA(x){ f s q l ( x , y ) [ l - ~ t , ( x , y ) ]
#1(dy)+Q(x,{x}) }
+ f el(x,y)ql(x,y)#l(dy) = 1A(X)
~s C1{0>0} q2(x,y)[1 -- ~2(x,y)] #2(dy)
+IA(x)Q(x,{x}) ÷ ~
~2(x,y)q2(x,y)p2(dy) n{¢>o}
= P2(x,A), for all x C {q~ > 0}. Here we use ql (x, y)#1 ( d y )
= ql (x, y)dp(y)#z(dy) = q2(x, Y)P2(Y),
326
Pei-de Chen / Statistics & Probability Letters 38 (1998) 323 328
and since the integrands are zero on {y : ~b(y) = 0}, we may restrict the integration only to {4 > 0}, while the replacement of ~l(x,y) by ~2(x,y) is available for all x such that ~b(x) > 0. Hence,
A : {x : P , ( x , .) ¢ P2(x, .)}
c{x : q~(x) =
0}
is/~l-negligible, and so is H-negligible as required.
[]
3. The existence of reference measures
In this section we seek conditions under which a reference measure may exist, enabling the formulation 1. The following simple example shows that available reference measures may fail to exist Example 3.1. Let S = N = (-cx~, cx~), Q (x, { x + l } ) = 1, for all x E N. Then any available reference measure /~ should have x as an atom for any x E ~, and such/~ cannot be o--finite. So no available reference measure exists for such a candidate transition kernel. We first consider the independent case, where normally it is assumed the transition kernel Q satisfies Q(x,.) = Q(.) for all x. We relax this slightly by assuming we have three factors, none of which is related to a pre-assigned reference measure: (i) an arbitrary target distribution /7, (ii) a common staying probability p = Q(x, {x}), 0~
Q( x,A) = plA( X) 4- Q°(A). If p = l, QO is a zero measure, and there is nothing interesting to be studied: any target distribution becomes a stationary distribution, but the chain is not irreducible, and so P~(x,.) does not converge to any stationary distribution. I + 1/2(1 - p ) Q O is an available reference However, assuming 0~
Q( B × A) = ~ Q(x,A)l~(dx)
=f, fq(x,y>(dY)~(dx)+fn ~p ( x ) # ( d x ) .
Pei-de Chen / Statistics & Probability Letters 38 (1998) 323-328
327
where O<~p(x) = 1 - Q ( x , S ) < ~ 1 is the extra probability of not moving. We will regard Q as the admissible measure of the Metropolis algorithm. A probability measure 2 defined on a self-product a-field of the form M(S) x M(S) is called symmetric if any product measurable set G C M(S)×M(S) has the same 2-value as its transpose G' = { ( x , y ) : ( y , x ) E G}, i.e., 2(G) = )o(G' ). We are interested in distinguishing what kind of symmetric measure can be regarded as the admissible measure of a Metropolis algorithm in the above sense. Any probability measure 2 on M(S) x M(S) can be decomposed into two parts, the diagonal part 2d and the off-diagonal part 20, as follows: )~d(B x A) = 2(B x A N A) with the diagonal A = { ( x , x ) : x E S } as its support, and
)oo(B × A) = 2( B × A \ A) with S x S \ A as its support. Now put p ( B ) = 2(B × S). This defines a probability measure on M(S) known as the marginal probability measure on the first factor space. If we project the product space S x S to the first factor space, 2d induces a measure on M(S), denoted by )~a which is absolutely continuous with respect to p,2d(B) = 2d( B x S)~< p(B). Denote d2a/dp by p, then,
2d( B x A) = 2d( ( B N A ) x ( B N A ) )
=d(axA) = ~
p(x)p(dx). NA
Theorem 3.1. A symmetric probability measure 2 on M(S) x ~ ( S ) can be regarded as the admissible measure of a Metropolis algorithm, if and only if its off-diagonal part 20 << d/~ x d/~, where p is the marginal probability measure of 2 (on either factor spaces due to the symmetry of 2). In this case, # is an available reference probability measure, p = d2d/d# defined as above is the staying probability, and q( x , y ) = d/d# [d2o(. x .)/d#](x, .)] (x,y) is the density part of the candidate. Proof. For necessity, notice that 2 is naturally decomposed into the diagonal part 2d: ~-d(B x A ) = (dx) and the off-diagonal part 20: 2o(B × A) = fa fA q(x, y ) # ( d y ) # ( d x ) , and obviously 20 << dp the reference measure of the algorithm/~ is just the marginal probability measure of 2. Remember here that in the definition of H - M algorithms, q(x,x) = 0 for all x E S, so we need the diagonal from the region of integration. For sufficiency, if 2o << dp × dp, then q(x,y) = d2o/d# × d# can be regarded as the density algorithm with # as the reference measure, as required. []
fans p(x)l~ x dp. Here not remove part of the
Non-symmetry does not cause much more trouble. We have: Theorem 3.2. A probability measure 2 on M(S) × ~ ( S ) can be regarded as the admissible measure of a Hastings algorithm if and only /f 20 << Pl x Pl, where #l is the marginal probability measure of 2 on the
first factor space. Proof. Since
2(B × A) = 2o(B x A) + 2d(B × A) =fa~q(x,Y)#(dY)#(dx)+~'nBP(X)p(dx)
328
Pei-de Chen / Statistics & Probability Letters 38 (1998) 323-328
still defines a probability measure on M(S) × M(S) with /z as its marginal measure on the first factor space (without symmetry, /~ need not to be the marginal measure on the second factor space), the condition is necessary. The same argument as in the proof of Theorem 3.2 shows the sufficiency. [] The main difference with the symmetric case is that the marginal probability measure of 2 on the second factor space ,//2 may be different from ~ , since p2(A) = 2(S × A)
=f [fsq(x,Y)m(dx)+~(Y)l~,(dY), but obviously, /~2 <~J (on condition 20 <
References Gilks, W.R., Richardson, S., Spiegelhalter, D.J., (Eds.) 1996. Markov Chain Monte Carlo in Practice. Chapman Hall, London. Hastings, W.K., 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97-109. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E., 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087-1091. Meyn, S.P., Tweedie, R.L., 1993. Markov Chains and Stochastic Stability. Springer, London. Roberts, G.O., Tweedie, R.L., 1996. Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika 83. Tierney, L., 1998. A note on Metropolis-Hastings kernels for general state spaces (submitted for publication). Tierney, L., 1994. Markov chains for exploring posterior distributions (with discussion). Ann. Statist. 22, 1701-1762.