Automatica 44 (2008) 1676–1685 www.elsevier.com/locate/automatica
Optimal smoothing of non-linear dynamic systems via Monte Carlo Markov chainsI Gianluigi Pillonetto a,∗ , Bradley M. Bell b a Department of Information Engineering, University of Padova, Italy b Applied Physics Laboratory, University of Washington, Seattle, USA
Received 3 August 2006; received in revised form 4 April 2007; accepted 22 October 2007 Available online 18 April 2008
Abstract We consider the smoothing problem of estimating a sequence of state vectors given a nonlinear state space model with additive white Gaussian noise, and measurements of the system output. The system output may also be nonlinearly related to the system state. Often, obtaining the minimum variance state estimates conditioned on output data is not analytically intractable. To tackle this difficulty, a Markov chain Monte Carlo technique is presented. The proposal density for this method efficiently draws samples from the Laplace approximation of the posterior distribution of the state sequence given the measurement sequence. This proposal density is combined with the Metropolis–Hastings algorithm to generate realizations of the state sequence that converges to the proper posterior distribution. The minimum variance estimate and confidence intervals are approximated using these realizations. Simulations of a fed-batch bioreactor model are used to demonstrate that the proposed method can obtain significantly better estimates than the iterated Kalman-Bucy smoother. c 2008 Elsevier Ltd. All rights reserved.
Keywords: Bayesian estimation; State estimation; Nonlinear time series; Stochastic processes; Iterated Kalman smoothing filter; Stochastic fed-batch bioreactor
1. Introduction Estimation of complex nonlinear and non Gaussian systems often require integration over high-dimensional probability distributions. Unfortunately, this problem may turn out to be analytically intractable. To tackle this difficulty several stochastic sampling algorithms have been introduced in recent years. In particular, Markov chain Monte Carlo (MCMC) is a powerful approach (see Hastings (1970), Gilks, Richardson, and Spiegelhalter (1996) for a general introduction and Spall (2003) for an introduction oriented for the controls audience). The idea underlying MCMC is to first exploit the Metropolis–Hastings algorithm to generate a Markov chain converging in distribution to the posterior probability density. Next, using this approximation for the posterior in sampled I This paper was not presented at any IFAC meeting. This paper was recommended for publication in revised form by Associate Editor George Yin under the direction of Editor Ian Petersen. ∗ Corresponding author. Tel.: +39 0498277607; fax: +39 0498277699. E-mail addresses:
[email protected] (G. Pillonetto),
[email protected] (B.M. Bell).
c 2008 Elsevier Ltd. All rights reserved. 0005-1098/$ - see front matter doi:10.1016/j.automatica.2007.10.028
form, a Monte Carlo integration step is performed to obtain quantities of interest such as minimum variance estimates and confidence intervals. This approach has been successfully proven in many areas such as biomedicine, computer vision and signal processing (Green, 1996; Magni, Bellazzi, & Nicolao, 1998; Pillonetto & Bell, 2007). We consider the smoothing problem of estimating a sequence of state vectors given a nonlinear state space model with additive white Gaussian noise and measurements of the system output. The system output may also be nonlinearly related to the system state. The integral necessary for the minimum variances estimate can be computed analytically in the case where both the system dynamics and measurements are linear. In general, it is not analytically tractable to compute this high-dimensional integral. One alternative is to use an iterated Kalman-Bucy smoother (IKS) which avoids the integral and instead solves a high-dimensional optimization problem. It computes the maximum a posteriori estimate instead of the minimum variance estimate (Bell, 1994). Classical numeric integration methods are typically subject to the ‘curse of dimensionality’ but stochastic integration methods can
1677
G. Pillonetto, B.M. Bell / Automatica 44 (2008) 1676–1685
circumvent this problem and obtain approximations for the minimum variance estimate. Recently, a number of stochastic smoothing algorithms have been proposed and, notably, most of them are alternative to MCMC. In particular, following the seminal work (Gordon, Salmond, & Smith, 1993) in which the bootstrap filter was described, Monte Carlo strategies relying on sequential importance sampling – called particle filters – have been designed. Such approaches use stochastic samples to represent the state posterior distribution by generating adaptive grids whose values are updated randomly according to a simulation-based rule. This enables one to track the distribution over time (see e.g. Crisan (2001), Del Moral (1998) for a convergence analysis of these methods). Smoothing methods that use these techniques to approximate the posterior marginal distribution of the state at each time point have been developed, e.g., Doucet, Godsill, and Andrieu (2000), H¨urzeler and K¨unsch (1998), Kitagawa (1996). A method for sampling trajectories from the posterior joint distribution of the state at all time points is described in Godsill, Doucet, and West (2004). This method first preforms a forward-filtering step that generates a particlebased approximation to the filtering density at each time step. Then a backward “simulation smoothing” step is performed to obtain the desired state realizations. To date, MCMC samplers for smoothing of nonlinear state-space models have had limited application. This is due to the difficulty of designing an effective proposal density for sampling the state sequence. Under mild assumptions, the Metropolis–Hastings algorithm ensures that any proposal distribution will result in a chain that converges to the posterior distribution (e.g. see Meyn and Tweedie (1993)). On the other hand, a well chosen proposal density can greatly reduce the number of samples needed to reconstruct the posterior distribution with a prescribed accuracy. The high-dimensional and complex nature of the posterior distribution of the state space sequence makes the choice of the proposal density, and the algorithm for generating its samples, crucial. We present a proposal density together with an algorithm for generating its samples. We use the IKS to find a set of local maximizers of the joint density of the state and the observations. Near each of these local maximizers, the posterior is approximated by a normal with a Hessian that is proportional to the Hessian of the joint density at the local maximizer. The state space model structure is used to obtain a method that samples from these approximate distributions and that scales linearly with the number of time points (where measurements were collected). Given a current state sequence in the MCMC chain, the closest local maximizer is determined. Then the corresponding approximate distribution is scaled and shifted so that its mean is the current state sequence in the chain. This scaled and shifted distribution is used to generate the proposal sample for the next point in the chain. This can be interpreted as the nonlinear extension of the linear smoothing algorithm presented in Carter and Kohn (1994). This paper is organized as follows: Section 2 defines our estimation problem in mathematical terms. Section 3 presents an efficient algorithm that samples from a secondorder approximations of the smoothing posterior near its local
Fig. 1. Information that is input for the problem.
maximizers. Section 4 applies the sampler from the previous section in an MCMC scheme and obtains a chain that converges to the posterior distribution of the state sequence. Section 5 tests this MCMC approach using simulated examples. Section 6 summarizes the results of this paper. 2. Problem formulation and notation Unless otherwise stated, all vectors are column vectors. We use N(s|µ, Σ ) to denote the value at s of the Gaussian density with mean µ and covariance Σ . We are given the information displayed in Fig. 1, where the index k is between one and N inclusive. In order to avoid special cases, our estimate for the initial state vector x1 is given by g1 (x0 ) which does not depend on x0 ; i.e, the derivative of g1 (x0 ) is identically zero. The variance of this initial state estimate is Q 1 . In addition, we use the following extended definitions for k = N + 1: g N +1 (x N ) = 0
(1)
Q N +1 = In
where In ∈ Rn×n is the identity matrix. Our state space model is defined by Eq. (2) and Assumption 1: xk = gk (xk−1 ) + ωk k = 1, . . . , N . (2) z k = h k (xk ) + vk N Assumption 1. The measurement noise sequence {vk }k=1 N and the transition noise sequence {ωk }k=1 are independent, N N Gaussian, with mean zero and variance {Rk }k=1 and {Q k }k=1 respectively.
We define the vector X ∈ Rn N corresponding to the state N , and the vector Z ∈ Rm N corresponding to sequence {xk }k=1 N , by the measurement sequence {z k }k=1 x1 z1 .. .. X = . , Z = . . xN
zN
The functions h : Rn N → Rm N and g : Rn N → Rn N are defined by
1678
G. Pillonetto, B.M. Bell / Automatica 44 (2008) 1676–1685
h 1 (x1 ) .. h(X ) = . ,
h N (x N )
The matrices Q defined by Q1 0 0 Q2 Q= .. . 0 ... R1 0 0 R2 R= .. . 0 ...
g1 (x0 ) .. g(X ) = . . g N (x N −1 )
∈ R(n N )×(n N ) and R ∈ R(m N )×(m N ) are ...
0 .. . , .. . 0 0 QN ... 0 .. . . .. . 0 0 RN
If v ∈ R p and W ∈ R p× p is a symmetric positive definite matrix, we define kv, W k2 = v T W v The function L : Rn N × Rm N → R is defined by L(X, Z ) =
1 1 log det(2π R) + kZ − h(X ), R −1 k2 2 2 1 1 + log det(2π Q) + kX − g(X ), Q −1 k2 . 2 2
Lemma 2. If Assumption 1 holds, L(X, Z ) is equal to the negative log of the joint density of Z and X . Proof. If Assumption 1 holds, the negative log of the density of z k given xk is 1 1 log det(2π Rk ) + kz k − h k (xk ), Rk−1 k2 . 2 2 The negative log of the density of xk given xk−1 is 1 1 2 log det(2π Q k ) + kxk − gk (xk−1 ), Q −1 k k . 2 2 It now follows from the independence assumption that the negative log of the density of z k and xk given xk−1 is L k (xk , z k |xk−1 ) where L k : Rn × Rm × Rn → R is defined by 1 1 log det(2π Rk ) + log det(2π Q k ) 2 2 1 1 2 + kz k − h k (xk ), Rk−1 k2 + kxk − gk (xk−1 ), Q −1 k k .(3) 2 2 Using induction on k, it follows from the independence assumption that the negative log of the joint density of the measurement N and the state sequence {x } N is equal to sequence {z k }k=1 k k=1 L k (xk , z k |xk−1 ) =
N X
L k (xk , z k |xk−1 ) = L(X, Z ).
k=1
This completes the proof.
We use p(X, Z ) to denote the joint density of X and Z and p(X |Z ) to denote the posterior density of X given Z , i.e.,
p(X, Z ) = exp[−L(X, Z )] Z +∞ p(X |Z ) = p(X, Z )/ p(X, Z )dX.
(4)
−∞
The optimal smoothing problem considered in this paper is: Problem 3. Given the information in Fig. 1, and a function θ (X ), compute E[θ (X )|Z ], the lower γ -confidence limit λ, and upper limit ρ, that are defined by Z +∞ θ (X )p(X |Z )dX E[θ (X )|Z ] = −∞ Z p(X |Z ) dX γ /2 = Z{X :θ(X )≤λ} = p(X |Z ) dX . {X :ρ≤θ(X )}
If θ (X ) = X , E[θ (X )|Z ] is the minimum variance estimate for the state sequence given then measurement sequence. If θ (X ) = xk and γ = .05, [λ, ρ] is the 95% confidence interval for the k-th component of the state vector given the measurement sequence. In general, a direct numerical approximation for the integrals above is computationally expensive because one has to integrate over a space of dimension n N . 3. Sampling a local posterior approximation The maximum a posteriori estimate for X given Z (the minimizer of L(X, Z ) with respect to X ) can be computed by an iterated Kalman Smoother (IKS) (e.g., see Bell (1994)). The IKS exploits the special state space structure of L(X, Z ) so that the number of floating point computations scales linearly with the number of time points. For a vector v ∈ R p , we use the notation |v| for the standard Euclidean norm. Suppose that a local minimizer of L(X, Z ) has been computed and denote it by Xˆ , i.e. there is an ε > 0 such that L( Xˆ , Z ) = min{L(X, Z ) : |X − Xˆ | < ε}. In this section, we use the special state space structure of L(X, Z ) to obtain an an algorithm that samples from an approximation of the posterior p(X |Z ) near Xˆ . Like the IKS, this algorithm scales linearly with the number of time points. 3.1. Likelihood derivative and Hessian We begin by noting the special structure for the derivative of L(X, Z ) with respect to X (also noted elsewhere; e.g., Bell (2000)). We use the notation ∂k to denote the partial derivative of a function f (X ) with respect to xk , i.e. ∂ X f (X ) = [∂1 f (X ), . . . , ∂ N f (X )]. Note that for a scalar valued function f (X ) ∈ R, the partial derivative ∂k f (X ) and the total derivative ∂ X f (X ) are row vectors. It follows from Eq. (3) that ∂k L j (x j , z j |x j−1 ) = 0,
if j 6= k and j 6= k + 1
∂k L k (xk , z k |xk−1 ) = [h k (xk ) − z k ]T Rk−1 ∂k h(xk ) + [xk − gk (xk−1 )]T Q −1 k
1679
G. Pillonetto, B.M. Bell / Automatica 44 (2008) 1676–1685
∂k L k+1 (xk+1 , z k+1 |xk )
Using this notation, the Hessian of L(X, Z ) is
= [gk+1 (xk ) − xk+1 ]
T
Q −1 k+1 ∂k gk+1 (x k ).
∂ X2 L(X, Z ) = tridiag[B1 (X ), C2 (X ), . . . , C N (X ), B N (X )]
It follows that for k 6= N , ∂k L(X, Z ) =
N X
(10)
where the matrix valued function sequences Bk : Rn N 7→ Rn×n and Ck : Rn N 7→ Rn×n are given by
∂k L j (X, Z )
j=1
Bk (X ) = ∂k ∂k L(X, Z ), k = 1, . . . , N Ck (X ) = ∂k−1 ∂k L(X, Z ), k = 2, . . . , N .
= ∂k L k (X, Z ) + ∂k L k+1 (X, Z ) = [h k (xk ) − z k ]T Rk−1 ∂k h(xk ) + [xk − gk (xk−1 )]T Q −1 k + [gk+1 (xk ) − xk+1 ]
T
Q −1 k+1 ∂k gk+1 (x k ).
3.2. The local posterior approximation (5)
We also consider Eq. (5) for the case k = N . In this case, L N +1 (X, Z ) is not included in the summation and
We use Xˆ to denote a local maximizer of p(X |Z ) with respect to X (minimizer of L(X, Z )). Note that ∂ X L( Xˆ , Z ) = 0, hence
[g N +1 (x N ) − x N +1 ]T Q −1 N +1 ∂ N g N +1 (x N )
L(X, Z ) =
is not included in the expression for ∂ N L N +1 (X, Z ). It follows from Eq. (1) that ∂ N g N +1 (x N ) = 0 and Q N +1 is invertible. Hence Eq. (5) is also correct, and can be evaluated, for the case k = N. The special structure for the derivative of L(X, Z ) with respect to X leads to a the sparse structure for its Hessian. Given a matrix A, we define rvec(A) as the column vector containing the elements of the matrix in row major order, i.e. the transpose of the first row, above the transpose of the second row and so on. We use the convention in Marlow (1993) where the derivative of a matrix valued function A(xk ) is the ∂k rvec[A(xk )]. Using this convention, Eq. 6.78 in Marlow (1993), and the fact that variance matrices are symmetric, the diagonal blocks of the Hessian of L(X, Z ) are given by + ∂k gk+1 (xk )T Q −1 k+1 ∂k gk+1 (x k ) (6)
The cross partials in the Hessian of L(X, Z ) are given by if j < k − 1 or j > k + 1
− log p(X |Z ) =
1 kX − Xˆ , ∂ X2 L( Xˆ , Z )k2 2 + ζ ( Xˆ ) + o(|X − Xˆ |2 ).
We define the approximate posterior density near Xˆ by ˜ |Z , Xˆ ) = − log p(X
1 kX − Xˆ , ∂ X2 L( Xˆ , Z )k2 2 i h + log det 2π ∂ X2 L( Xˆ , Z )−1 .
˜ |Z , Xˆ ) + o(|X − Xˆ |2 ). − log p(X |Z ) = − log p(X
− ({[z k − h k (xk )]T Rk−1 } ⊗ Im )∂k2 h k (xk )
∂ j ∂k L(X, Z ) = 0,
It follows from the equation above and Eq. (4) that there is a value ζ ( Xˆ ) (that scales p(X |Z ) so its integral with respect to X is one) such that
Using the fact that the integral of p(X |Z ) with respect to X is one, we obtain
∂k ∂k L(X, Z ) = ∂k h(xk )T Rk−1 ∂k h(xk ) + Q −1 k
2 − ({[xk+1 − gk+1 (xk )]T Q −1 k+1 } ⊗ In )∂k gk+1 (x k ).
1 kX − Xˆ , ∂ X2 L( Xˆ , Z )k2 2 + L( Xˆ , Z ) + o(|X − Xˆ |2 ).
(7)
∂k−1 ∂k L(X, Z ) = −Q −1 k ∂k−1 gk (x k−1 )
(8)
∂k+1 ∂k L(X, Z ) = −∂k gk+1 (xk )T Q −1 k+1 .
(9)
Given a sequence of symmetric matrices Bk ∈ Rn×n for k = 1, . . . , N , and a sequence of matrices Ck ∈ Rn×n for k = 2, . . . , N , we define the corresponding symmetric block tridiagonal matrix B1 C2T 0 · · · 0 .. C2 B2 C T . . . . 3 .. . tridiag(B1 , C2 , . . . , C N , B N ) = 0 C . B3 0 3 . . . . .. .. .. CT .. N 0 · · · 0 C N BN
3.3. Block tridiagonal Cholesky factorization algorithm We use chol(A) to denote a Cholesky factorization of the positive-definite matrix A, i.e., chol(A) is a lower triangular matrix such A = chol(A) chol(A)T . We use A−T to denote the matrix inverse of AT . Algorithm 4. The input to this algorithm is a sequences of symmetric matrices Bk ∈ Rn×n for k = 1, . . . , N , a sequence of matrices Ck ∈ Rn×n for k = 2, . . . , N , and a sequence of vectors ek ∈ Rn for k = 1, . . . , N . We use the notation T ∈ Rn N ×n N for the matrix T = tridiag(B1 , C2 , . . . , C N , B N ). The output is a state sequence sk ∈ Rn for k = 1, . . . , N such that S = chol(T )−T E
1680
G. Pillonetto, B.M. Bell / Automatica 44 (2008) 1676–1685
where S ∈ Rn N and E ∈ Rn N are the vectors corresponding to N and {e } N . {sk }k=1 k k=1 F1 = chol(B1 ) for k = 2, . . . , N do −T G k = Ck Fk−1 Fk = chol(Bk − G k G Tk ) −T s N = FN e N for k = N − 1, . . . , 1 do sk = Fk−T (ek − G Tk+1 sk+1 ) The following two results guarantee the correctness of Algorithm 4. Lemma 5. Suppose the input matrix T to Algorithm 4 is positive definite. Define G 1 = 0, D1 = B1 , and for k = 2, . . . , N , define Dk = Bk − G k G Tk . It follows that for k = 1, . . . , N , the following matrix is positive definite Hk = tridiag(Dk , Ck+1 , Bk+1 , Ck+2 , . . . , C N , B N ) Hence Dk is also positive definite.
that the algorithm provides a Cholesky factorization of T . To be specific, if we define F and then compute F F T as follows F1 0 ··· 0 .. G 2 F2 . . . . . F = . . . .. .. .. 0 0 · · · G N FN Multiplying out F F T , we obtain F1 F1T F1 G T2 0 T T T G 2 F1 G 2 G 2 + F2 F2 G 3 F2T .. 0 . G 3 F2T . . .. .. ... 0 ... G N FNT −1
0 .. .
..
.
FN −1 G TN G N G TN + FN FNT .
Using the definitions of Fk , G k in the algorithm and T = tridiag(B1 , C2 , . . . , C N , B N ) it follows that
Proof. The conclusion of the lemma is true by assumption for the case k = 1. Suppose that the conclusion is true for index j. We complete this proof by showing that it is true for index j + 1. We first note that
Hence F is a Cholesky factorization of T . We now show that F T S = E, i.e,
Dk+1 = Bk+1 − G k+1 G Tk+1
FNT s N = e N and for k = N − 1, . . . , 1
T = Bk+1 − Ck+1 Dk−1 Ck+1 .
Define the matrix Fk−1 Ak = −Ck+1 Dk−1 0 It follows that T Fk 0 Ak Hk = 0 . .. Ak Hk ATk =
0
0 In 0
FkT sk + G Tk+1 sk+1 = ek . The equations follow directly from the definition of algorithm. Thus we obtain S = F −T E which completes the proof of this proposition.
0 0
.
In(N −k+1)
T Fk−1 Ck+1
···
···
Dk+1 Ck+2
T Ck+2 Bk+2 .. .
···
0
0 T Ck+3 .. . CN
In 0
T = F F T.
0 .. . 0 .. . BN
0 . Hk+1
The matrix Ak is non-singular and Hk is positive definite, hence the block matrix on the right side of the last equality is positive definite. It follows that Hk+1 is positive definite which completes the proof of the lemma. Proposition 6. If the Algorithm 4 input matrix T is positive definite, all the chol arguments in the algorithm are positive definite. In addition, the output state sequence satisfies the equation S = chol(T )−T E. Proof. The conclusion about the chol arguments follows directly from the conclusion at the end of Lemma 5. We note
Lemma 7. Suppose we execute Algorithm 4 with the corresponding symmetric block tridiagonal positive definite matrix T . Further suppose that the input vector E is simulated from a multi-normal distribution with mean zero and covariance equal to the identity matrix. It follows that the algorithm output vector S is distributed N[S|0, T −1 ]. This method computes S with computational complexity O(N n 3 ). Furthermore, if T = ∂ X2 L( Xˆ , Z ) as expressed in Eq. (10), S is a sample from ˜ − Xˆ |Z , Xˆ ). the approximate distribution p(X Remark 8. It is worth noting that, when approximating the Hessian during optimization of L(X, Z ) with respect to X , the Gauss–Newton method excludes the terms 2 ({[xk+1 − gk+1 (xk )]T Q −1 k+1 } ⊗ In )∂k gk+1 (x k )
({[z k − h k (xk )]T Rk−1 } ⊗ In )∂k2 h k (xk ) which are included in Eq. (6). Including these terms in our simulations guarantees that the samples are drawn from a second order approximation to the posterior distribution (see Section 3.2). Furthermore, ∂ X2 L( Xˆ i , Z ) positive definite is the second order sufficient condition for Xˆ i to be a local minimizer of L(X, Z ). On the other hand, the Gauss–Newton approximation for the Hessian is positive definite at all values of X . In some cases it may be preferable to use this approximation
G. Pillonetto, B.M. Bell / Automatica 44 (2008) 1676–1685
in Eq. (10) when computing Bk (X ). In the numerical examples described in the next Section, including or excluding these terms did not lead to significant differences in the performance of the MCMC smoothing scheme. 4. An MCMC smoothing method The simulation scheme in Lemma 7 is used to sample from the proposal density for our MCMC smoothing method. This method combines this proposal density with the symmetric random-walk Metropolis algorithm (RWM) (Roberts & Rosenthal, 2001). This results in accurate approximations for the integrals in Problem 3. The iterated Kalman smoother (IKS) is used to obtain a set of local minimizers of L(X, Z ) with respect to X (maximizers of p(X, Z )) which we denote by Xˆ i for i = 1, . . . , `. For example, these might be obtained by running the IKS from a set of different starting points. For each of the local minimizers Xˆ i , we associate a variance matrix Σ i that is a multiple of the covariance corresponding to the distribution p(X ˜ |Z , Xˆ i ), i.e., Σ i = α i ∂ X2 L( Xˆ i , Z )−1
(11)
αi
where each ∈ R is a suitable scalar factor whose choice is discussed below. The Σ i matrices are used to define the ` . To be specific, proposal densities {q i (X |Y )}i=1 q i (X |Y ) = N(X |Y, Σ i ). Using the method in Lemma 7, with T = Σ i , we can simulate samples for X − Y that correspond to q i (X |Y ). Each simulated sample requires O(N n 3 ) floating point operations. For i = 1, . . . , `, simulation of X − Y requires memory storage of N 2N − 1 matrices of size n × n, to be specific, {Fki }k=1 and i N {G k }k=2 . Note that we have indexed the matrices in Algorithm 4 by the local minimizer index i because the algorithm input matrix T = Σ i depends on the local minimizer Xˆ i . We now consider which proposal index i ∈ {1, . . . , `} to use for each MCMC iteration. Let X (k−1) denote the state of the MCMC chain that was selected for the previous iteration. One method is to randomly choose a proposal index i according to a fixed set of weights on the set {1, . . . , `} (see e.g. Section 1.4.3 in Gilks et al. (1996)). This choice of i would be independent of the previous state X (k−1) . We instead choose the index i that is closest (in a well defined statistical sense) to the previous state ` X (k−1) . To be specific, we define the weights {ηi }i=1 by h i β i = p( Xˆ i , Z )/N Xˆ i | Xˆ i , ∂ X2 L( Xˆ i , Z )−1 ηi = β i /
` X
β j.
j=1
The proposal density index i that is statistically closest to a state vector X is defined by h i ˆ ) = arg max η j N X | Xˆ j , ∂ X2 L( Xˆ j , Z )−1 . i(X j
Algorithm 9. The input to this algorithm includes a set of local ` . It also includes a minimizers of L(X, Z ) denoted by { Xˆ i }i=1 ` , number of MCMC iterations M, the scaling factors {α i }i=1
1681
` and the variance matrices {Σ i }i=1 (defined in Eq. (11)). The information in Fig. 1 is used to evaluate L(X, Z ) and p(X, Z ) which are used by this algorithm. The output of this algorithm M . is the sequence of state sequence vectors {X (k) }k=1 ` Set X (0) so that L(X (0) , Z ) = mini=1 L( Xˆ i , Z ) for k = 1, . . . , M do ˆ (k−1) ) Determine i(k) = i(X (k) Sample S from q i(k) (S|X (k−1) ) ˆ (k) ) Determine j (k) = i(S Sample u (k) from a uniform on [0, 1] p(S (k) ,Z ) q j (k) (X (k−1) |S (k) ) Set r (k) = p(X (k−1) i(k) (k) (k−1) ) n (k) ,Z ) q (k) (S (k)|X S if u ≤r (k) Set X = X (k−1) otherwise
In the algorithm above, S (k) is the candidate sample drawn by the proposal density q i(k) while X (k−1) is the chain state at index k − 1. In addition, r (k) is the probability of accepting the candidate which, according to the Metropolis–Hastings algorithm, is the product of two ratios. The first ratio is the posterior evaluated at S (k) divided by the posterior evaluated at X (k−1) ; see Eq. (4). The second ratio is the likelihood of candidate X (k−1) given chain state S (k) divided by the likelihood of candidate S (k) given chain state X (k−1) . We also note that, if the candidate is not accepted, the state of the chain at index k is the same as at index k − 1. For i = 1, . . . , ` we define the corresponding acceptance rate for the algorithm above as the number of k for which i = ˆ (k−1) ) and u (k) ≤ r (k) divided by the number of k for which i(X ˆ (k−1) ). A pilot run of the algorithm above is repeated, i = i(X ` and the values {α i }i=1 in Eq. (11) are adjusted, until the corresponding acceptance rates for the pilot run are between 20 and 30 percent (as suggested in Roberts, Gelman, and Gilks (1997)). In some situations it is difficult to obtain a suitable acceptance rate and allow the chain to rapidly move between the local minimizers For these situations, the single component Metropolis–Hastings scheme may be more efficient than direct application of Algorithm 9. This scheme divides the state space into groups of components which are updated separately (see e.g. page 10 in Gilks et al. (1996)). Details about combining the single component Metropolis-Hasting scheme with Algorithm 9 are included in the Appendix. 5. Simulation results We use two examples from the literature to illustrate the performance of the proposed MCMC smoothing scheme. The required number of MCMC samples, M in Algorithm 9, is determined using the Raftery–Lewis criteria (RLC) (Raftery & Lewis, 1996). For each of the unknown variables to be integrated, θ (X ) in Problem 3, the RLC input parameters are the quantiles 0.025, 0.25, 0.5, 0.75, 0.975, with corresponding precisions 0.02, 0.05, 0.01, 0.05, 0.02, and with requested probability 0.95. 5.1. Example 1 In this example, the state vector oscillates rapidly with respect to the time index k, the IKS finds multiple local minima,
1682
G. Pillonetto, B.M. Bell / Automatica 44 (2008) 1676–1685
For example, with θ (X ) = xk , the minimum variance estimate N is approximated by of xk given {z k }k=1 M h i 1 X ( j) N x . E xk |{z k }k=1 ≈ M j=1 k
(13)
N , the In Fig. 2, the circles are the measurement sequence {z k }k=1 N 2 thick line is the noiseless output {xk /20}k=1 , and the thin line is the estimated noiseless output ( )N M h i 1 X ( j) 2 x /20 . M j=1 k k=1
Fig. 2. Output for Example 1, noiseless output (thick line), measurement sequence (circles), estimated noiseless output (thin line).
and the single component Metropolis–Hastings version of Algorithm 9 is applied, i.e., Algorithm 10. The model for this example is a classical benchmark problem from the literature; e.g. in Godsill et al. (2004) and Kitagawa (1996). xk−1 xk−1 xk = 2 + 25 1 + x 2 + 8 cos(1.2k) + ωk k−1 2 x z = k + v . k k 20
(12)
As in Godsill et al. (2004), our simulated number of time point is N = 100, the initial state estimate is g1 (x0 ) = 0, the variance of ωk is Q k = 10 (k = 1 corresponds to the initial estimate), and the variance of vk is Rk = 1. The problem N is to reconstruct a state sequence realization {xk }k=1 from a corresponding realization of the measurement sequence. The N is displayed as circles in Fig. 2. measurement sequence {z k }k=1 3 For this example, three distinct posterior modes { Xˆ i }i=1 were obtained using an IKS algorithm, i.e., ` = 3. Using Eq. (10) to express ∂ X2 L( Xˆ i , Z ), the corresponding N N matrices {Bk ( Xˆ i )}k=1 and {Ck ( Xˆ i )}k=2 were computed for i = 1, 2, 3. To improve efficiency, the single component Metropolis–Hastings version of Algorithm 9 was used where the state space was divided into 3 groups of components with corresponding dimensions 40, 30 and 30, i.e., h = 3, K (1) = 40, K (2) = 30, and K (3) = 30 in Algorithm 10. The number of iterations required by the RLC was less than our choice M = 30, 000. The algorithm obtained samples {X ( j) } M j=1 from the posterior distribution of X given Z . Then, both point estimates and confidence limits were computed by approximating the integrals in Problem 3 for different functions θ (X ) using the formula
E[θ(X )|Z ] ≈
M 1 X θ (X ( j) ). M j=1
Example MCMC realizations of the state component at two successive time points o M n ( j) ( j) xk , xk+1 j=1
are plotted as points in Fig. 3 (for k = 7, 44, 57, 99). Note that the posterior joint distribution of the state vector components has clusters of high density and is not well approximated by a Gaussian distribution. These results are similar to the particle filtering results obtained by Godsill et al. (2004) for this example. N , the thin In Fig. 4, the thick line is true state sequence {xk }k=1 line is the MCMC estimate for the state sequence; Eq. (13), the dot-dashed line is the MCMC approximation upper and lower N and {ρ } N defined as follows. 95% confidence limits {λk }k=1 k k=1 The approximate confidence limits are obtained by replacing the integrals in Problem 3 by their MCMC approximations, i.e., the confidence limits λk and ρk for the state at time index k is given by .025 =
M M 1 X 1 X ( j) ( j) (xk ≤ λk ) = (ρk ≤ xk ) M j=1 M j=1
where, used with an arithmetic operation, 1 if c ≤ d (c ≤ d) = 0 otherwise. Exact equality in the equation above is usually not possible, so as close to equality as possible is obtained. 5.2. Example 2 Consider the fed-batch bioreactor model in Kristensen, Madsen, and Jorgensen (2004, Example 1). For each time index k = 1, . . . , N , xk = (ak , bk , ck )T is the corresponding state vector where ak is biomass concentration, bk is substrate concentration, and ck is volume. The growth rate is given by µ(B) = µmax
K2
B2
B + B + K1
where µmax = K 1 = K 2 = 0.5. The Euler approximation discrete time version of the stochastic differential equation is ak = ak−1 + µ(bk−1 )ak−1 ∆t −
F ck−1
ak−1 ∆t + ωk,1
G. Pillonetto, B.M. Bell / Automatica 44 (2008) 1676–1685
1683
Fig. 3. Realizations of state components pairs for Example 1. Bivariate posterior densities are displayed.
Fig. 4. State estimation for Example 1, true state (thick line), estimated state (thin line), 95% confidence limits (dash-dot line).
Fig. 5. Two output components for Example 2, noiseless output (thick line), measurement sequence (circles), estimated noiseless output (thin line).
µ(bk−1 )ak−1 bk = bk−1 − ∆t Y F + (B f − bk−1 )∆t + ωk,2 ck−1 ck = ck−1 + F∆t + ωk,3
This is also used as the true value for the state at time index k = 1 in the simulations. Our estimation problem consists of N reconstructing the state sequence {xk }k=1 using the simulated N N measurement sequences {z k,1 }k=1 and {z k,2 }k=1 where N = 100; see Fig. 5. An IKS was first employed and only one posterior mode was found. The proposed MCMC scheme was then used to draw 20,000 samples from the posterior smoothing density, without resorting to the single component Metropolis–Hastings scheme. Fig. 5 shows true vs MCMC estimated noiseless output profiles (thick and thin line, respectively). Fig. 6 displays linearly interpolated true and MCMC estimated state values (thick and thin line, respectively) together with 95% confidence intervals (dash-dot line). Notice how all the estimates are close to the truth. The average root mean square errors (RMSE) turn out
z k,1 = ak + bk + vk,1 z k,2 = bk + ck + vk,2 where F = 0.5 is the feed flow rate, Y = 0.5 is the yield coefficient, B f = 10 is the feed concentration, and ∆t = 0.4 is the time between index k and index k +1. The variance matrices for this example are given by Q k = 0.2I3 and Rk = I2 . Our initial state estimate is given by g1 (x0 ) = (a1 , b1 , c1 )T = (4, 5, 3)T .
1684
G. Pillonetto, B.M. Bell / Automatica 44 (2008) 1676–1685
Fig. 6. Three state components for Example 2, true state (thick line), MCMC estimated state (thin line), MCMC 95% confidence limits (dash-dot line).
structure of a state-space model to efficiently generate a random walk that converges to the posterior density. This enables one to compute the minimum variance estimates as well as confidence intervals. Simulated data was used to corroborate the effectiveness of this approach. These simulations also point out advantages of the proposed method over smoothing via mode-finding algorithms, such as the iterated Kalman smoother. The proposed method is distinct from those based on particle filters and reported in Doucet et al. (2000), H¨urzeler and K¨unsch (1998) and Kitagawa (1996). Those sample from the marginal posterior of the state at each of the time points while the proposed technique samples form the posterior of the entire state sequence. The method in Godsill et al. (2004) also samples from the posterior for the entire state sequence, but employing a different methodology based on sequential sampling and particulate representations. In addition we explicitly suggest a proposal density for MCMC computation while the machinery described in Godsill et al. (2004) relies upon an importance distribution (to be used in the filtering mode) that has to be selected according to the particular system under study. Another advantage of MCMC over particle filters is the availability of aids that determine the number of samples required to reconstruct the posterior with a certain accuracy. Finally, we also point out that the proposed algorithm is not just an alternative but can also be used jointly with other MCMC smoothing and particle filter based schemes. In fact, it is well known that efficiency in the reconstruction of general target filtering/smoothing distributions can be improved by combining importance sampling and MCMC moves. This kind of algorithmic combination could turn out to be particularly suitable when dealing with models that exhibit strong nonlinearities.
Fig. 7. Three state components for Example 2, true state (thick line), IKS estimated state (thin line), IKS 95% confidence limits (dash-dot line).
Acknowledgment
to be 1.44, 1.34 and 1.3 for the first, second and third state sequence, respectively. For comparison, Fig. 7 plots state estimates obtained by the IKS. In particular we display maximum a posteriori estimates (thin line) together with approximated 95% confidence intervals (dash-dot line) computed as the estimated profile ±1.96 SD. Here, SD denotes the standard deviation of the error estimated by the IKS after computing the last affine approximation. All RMSE now increase, turning out 2.13, 2.14 and 2 as regards the first, the second and the third state sequence, respectively. This demonstrates how, when dealing with models that exhibit strong nonlinearities, minimum variance estimates obtained by MCMC are much more robust than point estimates attained as the modes of the joint posterior. Notice also how MCMC confidence intervals are more informative than those achieved by IKS. 6. Conclusions We have proposed a Markov Chain Monte-Carlo (MCMC) approach for sampling state sequences from their posterior smoothing density. The method takes advantage of the special
This research was partially supported by the Italian Ministry of University and Research through the PRIN Project entitled ‘Tecniche ed applicazioni innovative di identificazione e controllo adattativo’. Appendix In this section we present a single component Metropolis– Hastings modification to Algorithm 9. At each step, the Metropolis–Hastings algorithm is used to update a subvector of the current state of the chain (Gilks et al., 1996). To be specific, we divide the vector X into h subvectors that we denote by X [ j] for j = 1, . . . , h, i.e., T T (x1T , . . . , x NT ) = X T = (X [1] , . . . , X [h] )
where X [ j] ∈ Rn K ( j) and K (1) + · · · + K (h) = N . We use the notation ∂[ j] for the partial of L(X, Z ) with respect to X [ j] . It follows that 2 ∂[1] L(X, Z ) = tridiag[B1 (X ), C2 (X ), . . . , B K (1) (X )]
where Bk (X ) and Ck (X ) are defined as in Eq. (10). For i = 1, . . . , ` and j = 1, . . . , h, we define the variance matrix Σ ij
G. Pillonetto, B.M. Bell / Automatica 44 (2008) 1676–1685
and proposal density q ij by Σ ij = α ij ∂[2j] L( Xˆ i , Z )−1 q ij (S|X ) = N(S|X [ j] , Σ ij ) where Σ ij ∈ Rn K ( j)×n K ( j) , S ∈ Rn K ( j) , X ∈ Rn N , and α ij ∈ R. Given X ∈ Rn N , an index j ∈ {1, . . . , h}, and a vector S ∈ Rn K ( j) , we define V ( j, S, X ) ∈ Rn N by T T V ( j, S, X )T = (X [1] , . . . , X [ j−1] , S, X [ j+1] , . . . , X [h] ).
Algorithm 10. The input to this algorithm includes a set of ` . It also local minimizers of L(X, Z ) denoted by { Xˆ i }i=1 includes a number of MCMC iterations M, the component ` , and group sizes {K ( j)}hj=1 , the scaling factors {{α ij }hj=1 }i=1 i h ` the variance matrices {{Σ j } j=1 }i=1 . The information in Fig. 1 is used to evaluate L(X, Z ) and p(X, Z ) which are used by this algorithm. The output of this algorithm is the sequence of state M . sequence vectors {X (k) }k=1 ` Set X (0) so that L(X (0) , Z ) = mini=1 L( Xˆ i , Z ) for k = 1, . . . , M do Set V(0) = X (k−1) for j = 1, . . . , h do Determine i( j) = iˆ V( j−1) i( j) Sample S( j) from q j S|V( j−1) Set W( j) = V j, S( j) , V( j−1) Determine τ ( j) = iˆ W( j) (k) Sample u ( j) from a uniform on [0, 1]
Set
(k) r( j)
τ ( j)
=
p(W( j) ,Z ) q j
p(V( j−1) ,Z ) q j
i( j)
Set V( j) =
W( j) V( j−1)
(k−1)
X [ j]
|W( j)
( S( j) |V( j−1) )
(k) (k) if u ( j) ≤ r( j)
otherwise
Set X (k) = V(h) For i = 1, . . . , ` and j = 1, . . . , h, we define the (i, j)th acceptance rate for the algorithm above as the number of ˆ ( j−1) ) and u (k) ≤ r (k) divided by indices k for which i = i(V ( j) ( j) ˆ ( j−1) ). A pilot run the number of indices k for which i = i(V ` of the algorithm above is repeated, and the values {{α ij }hj=1 }i=1 are adjusted, until all the acceptance rates for the pilot run are between 20% and 30%.
1685
Godsill, S. J., Doucet, A., & West, M. (2004). Monte Carlo smoothing for nonlinear time series. Journal of the American Statistical Association, 99, 156–168. Gordon, N. J., Salmond, D. J., & Smith, A. F. M. (1993). Novel approach to nonlinear/non-gaussian Bayesian state estimation. IEE Proceedings, 140, 107–113. Green, P. J. (1996). MCMC in image analysis. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 381–400). London: Chapman and Hall. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chain and their applications. Biometrika, 57, 97–109. H¨urzeler, M., & K¨unsch, H. R. (1998). Monte Carlo approximations for general state space models. Journal of Computational and Graphical Statistics, 7, 175–193. Kitagawa, G. (1996). Monte Carlo filter and smoother for nonlinear nongaussian state-space models. Journal of Computational and Graphical Statistics, 5, 1–25. Kristensen, N. R., Madsen, H., & Jorgensen, S. B. (2004). Parameter estimation in stochastic grey-box models. Automatica, 40, 225–237. Magni, P., Bellazzi, R., & De Nicolao, G. (1998). Bayesian function learning using MCMC methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1219–1331. Marlow, W. H. (1993). Mathematics for operations research. New York: Dover. Meyn, S. P., & Tweedie, R. L. (1993). Markov chains and stochastic stability. Springer-Verlag. Del Moral, P. (1998). Measure-valued processes and interacting particle systems. The Annals of Applied Probability, 8, 438–495. Pillonetto, G., & Bell, B. M. (2007). Bayes and empirical Bayes semi-blind deconvolution using eigenfunctions of a prior covariance. Automatica, 43(10), 1698–1712. Raftery, A. E., & Lewis, S. M. (1996). Implementing MCMC. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 115–130). London: Chapman and Hall. Roberts, G., Gelman, A., & Gilks, W. (1997). Weak convergence and optimal scaling of random walk metropolis algorithms. The Annals of Applied Probability, 7, 110–120. Roberts, G. O., & Rosenthal, J. S. (2001). Optimal scaling for various Metropolis-Hastings algorithms. Statistical Science, 16(4), 351–367. Spall, J. C. (2003). Estimation via Markov chain Monte Carlo. IEEE Control Systems Magazine, 23(2), 34–45. Gianluigi Pillonetto was born on January 21, 1975 in Montebelluna (TV), Italy. He received the Doctoral degree in Computer Science Engineering cum laude from the University of Padova in 1998 and the Ph.D. degree in Biomedical Engineering from the Polytechnic of Milan in 2002. In 2000 and 2002 he was visiting scholar and visiting scientist, respectively, at the Applied Physics Laboratory, University of Washington, Seattle. From 2002 to 2005 he was Research Associate at the Department of Information Engineering, University of Padova. Since 2005 he is Assistant Professor of Control and Dynamic Systems at the Department of Information Engineering, University of Padova. His research interests are in the field of system identification, stochastic systems, deconvolution problems, nonparametric regularization techniques, learning theory and randomized algorithms.
References Bell, B. M. (1994). The iterated kalman smoother as a gauss-newton method. SIAM Journal on Control and Optimization, 4, 156–168. Bell, B. M. (2000). The marginal likelihood for parameters in a discrete GaussMarkov process. IEEE Transactions on Signal Processing, 48, 870–873. Carter, C. K., & Kohn, R. (1994). On gibbs sampling for state-space models. Biometrika, 81, 541–553. Crisan, D. (2001). Particle filters: A theoretical perspective. Springer-Verlag. Doucet, A., Godsill, S. J., & Andrieu, C. (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10, 197–208. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Markov chain Monte Carlo in practice. London: Chapman and Hall.
Dr. Bradley M. Bell received his BA in math and physics from Saint Lawrence University in 1973, his MA in mathematics from the University of Washington in 1976. and his Ph.D. in Mathematics from the University of Washington in 1984. Currently he is employed by the Applied Physics Laboratory of the University of Washington. Much of his work is in collaboration with the Bioengineering Department as a member of the Resource Facility for Population Kinetics. He is also the project manager for the CoinOR Algorithmic Differentiation project CppAD. His current research is focused on statistically motivated numerical methods for scientific modeling and data analysis. This includes optimization, Montecarlo simulation, function estimation (Tikhonov regularization), nonlinear mixed effects modeling, Kalman-Bucy filtering and smoothing, and algorithmic differentiation.