Pattern Recognition Letters 3 (1985) 369-373 North-Holland
December 1985
Baum's forward-backward algorithm revisited Pierre A. DEVIJVER Philips Research Laboratory, Avenue Van Becelaere 2, Box 8, B-II70 Brussels, Belgium
Received 19 August 1985 Abstract: In this note, we examine the forward-backward algorithm from the computational viewpoint of the underflow prob-
lem inherent in Baum's (1972) original formulation. We demonstrate that the conversion of Baum's computation of joint likelihoods into the computation of posterior probabilities results in essentially the same algorithm, except for the presence of a scaling factor suggested by Levinson et al. (1983) on rather heuristic grounds. The resulting algorithm is immune to the underflow problem, and Levinson's scaling method is given a theoretical justification. We also investigate the relationship between Baum's algorithm and the recent algorithms of Askar and Derin (1981) and Devijver (1984). Key words: Hidden Markov chains, maximum likelihood estimation, maximum a posteriori probability estimation, forward-
backward algorithm.
1. Introduction B a u m ' s (1972) f o r w a r d - b a c k w a r d algorithm is a very elegant technique for solving the Bayes smoothing problem under the so-called 'hidden M a r k o v chain' assumption (see, also, Baum et al. (1972)). It is being used quite extensively for learning purposes in speech recognition applications (see, e.g., Jelinek et al. (1982), Levinson et al. (1983), Rabiner et al. (1983) and the references therein). As noted by Levinson et al. (1983), it has also proved to be convenient in a number of other contexts. It is well known, however, that the implementation o f the f o r w a r d - b a c k w a r d technique obtained by a mere translation of B r a u m ' s formulas into c o m p u t e r programs would be marred by severe underflow problems on any existing computer for all but the most trivial applications. Levinson et al. (1983) have proposed a rather heuristic way to remedy the situation which consists in re-scaling the forward and backward probabilities whenever underflow is likely to occur. Their scaling operation has the effect of converting the forward likelihoods into forward a posteriori probabilities.
However the way that the other quantitites involved are affected by the scaling has not been clarified. The first goal of this note is to demonstrate that the conversion of B a u m ' s computation o f joint likelihoods into the computation o f a posteriori probabilities results in essentially the same f o r w a r d - b a c k w a r d algorithm, except for the presence o f a scaling factor which turns out to be the same as the one suggested by Levinson et al. The resulting algorithm is - by nature - immune to the underflow problem, and Levinson's scaling method is given a theoretical justification. An alternative solution to the Bayes smoothing problem has been proposed by Askar and Derin (1981) and Devijver (1984), in terms of a posteriori probability computations. Quite recently, this a p p r o a c h was used by Derin et al. (1984) for the purpose o f picture segmentation. Hereafter, we refer to this algorithm as Derin's algorithm. So far, the relationship between B a u m ' s and Derin's algorithm has not been clarified either. Therefore, our second goal will be to show that both approaches differ only in the way that the (scaled) b a c k w a r d probabilities are computed. The hidden Markov chain model and the asso-
0167-8655/85/$3.30 © 1985, Elsevier Science Publishers B.V. (North-Holland)
369
Volume 3, Number 6
PATTERN RECOGNITION LETTERS
ciated problem o f computing the joint likelihoods required for solving the smoothing problem are introduced in Section 2. Baum's forward-backward algorithm is summarized in Section 3. The solution to Baum's problem in terms o f a-posteriori probabilities and its relation to Levinson's scaling method is outlined in Section 4. Derin's algorithm is examined in Section 5.
2. Statement of the problem
December 1985
simple decomposition:
. ~ ; ( ~ j ) - p(¢o' = ¢oj, :g~) = P ( w ~ = ¢oj, Y,~, X,-+O -N = p(¢o~= e)/, X-~l ) P ( X r-+u I ] OJ r = 03j)
= ~(o~j) ~ d o O ,
for j = 1. . . . . c and 1 < r < N . It is customary to call .~(w i) and ~ ( % ) the forward and backward probabilities respectively. The forward probabilities can be computed inductively by the recurrence
.;'~(wj) = PjpAX'), Let 12-{ool . . . . . we} designate the state-space of a discrete parameter, discrete time, first order, homogeneous Markov chain. We write oor=ooj to indicate that the process is in state ooj at time r. The Markov chain is specified in terms o f an initial state distribution Pj =P(oot =ooj), j = 1. . . . . c, and a matrix o f stationary state transition probabilities Pjk = P(c°'+l = ~k] oor = ooj), r > 1, and j , k ~ {l..... c}. The random process associated with the states is represented by c probability distributions p j ( X ) = j = l , . . . , c . Given a sequence $[~-*{ X I..... X N} o f observations o f the random variable X - where X r is the observation at time r we make the assumption that X t..... X N are state-conditionally independent (or that X ' s are observed in memoryless noise). More precisely, we assume that
p(Xl~oj),
(1)
r= 1
= ~ ~_l(ooi)Popi(Xr), r = 2 ..... N, (2) i
for we have ~7(wj) = ~ P ( o / - '
= oJi, o / = o.,i, R [ - ' , X ~)
i
= ~ P(¢..or-l=wi,~[-I)Pijpj(Xr),
(3)
i
for r > l , and the last case o f (2) follows by identifying ~ - i ( w i ) in the right hand side of (3). Likewise, the backward probabilities can be computed inductively by the recurrence .6(wj) = 1,
r=N,
= ~ Pykpk(xr+l).~r+l(OJ~), r = N - l , ..., 1. k
(4)
The last case of (4) is readily verified for r = N - I. For r < N - 2 , we have
N
P( XI ..... x N I oJl..... oJu) = l I p ( X r [ mr) •
~r(O)j)=~p(XZ+l ,x~+2, -N w r + l = w ~ l w ~ = % k
One o f the problems addressed by Baum 0972) was to compute an estimate o f 0) 3 for each r = 1..... N, based on a realization o f the observation sequence R~. Thus, under a minimum errorprobability criterion, it was necessary to determine either the joint likelihoods .~z((.oj)=P(o)r=o)j, ~IN), j = 1. . . . . c, or the a posteriori probabilities .~(ooj) = P(oor=ooj I ~ ) , j = l ..... c, for r = l ..... N. Baum's algorithm is a very elegant scheme for computing the joint likelihoods.
3. Baum's algorithm Baum's algorithm is based on the following, 370
.)
= E Pj~pk(Xr+I)p(XN2[ ¢°r+l=mk). (5) k
Again, (4) follows by identifying ~r+l(wk) in the right hand side o f (5). In view o f (1), Baum's forward-backward algorithm is quite simple: ~r's are computed recursively forward and stored for use during the backward stage which amounts to computing ~ ' s using (4) and using (1) to get the desired likelihoods ~(o)j) for j = l . . . . . c and r=N, .... I. Let us also note that
EJ
.~,.(wj)= E) 2r(wj)=p(X]').
Levinson et al. (1983) pointed out that ~ ( w j ) ~ 0
Volume 3, Number 6
P A T T E R N R E C O G N I T I O N LETTERS
December 1985
N-r
and Ar(~j) = 0 at an exponential rate, thereby causing underflow on any real computer. To remedy the situation, they suggested to multiply both ~(toj) and :#r(toj) by a scaling coefficient
P(m r = ogj, ~,~) X
P(XI) =
They also observed that, in practice, the scaling operation need not be performed at every observation time and that one can use any scaling interval for which underflow does not occur. In that case, scaling coefficients corresponding to values of r within any interval are set to unity. It will be shown hereafter that if scaling is performed at every observation time, and if scaled probabilities are substituted for the unscaled ones in Baum's forward recurrence (2), then
~,tr = [ p ( X r [ Xl-l)] -t,
(6)
for 1 < r_
~r
,v~
r,_,+,lXI'-'l
-*,
(7)
for i = 1..... n. In both cases, there follows
H -~r
=
[p(£~)]-~
(8)
where r is running over the sequence of time instants at which scaling actually occurs. A simple, though indirect p r o o f of (6) and (7) will appear as a byproduct of the results in the next section.
4. Baum's algorithm for posterior probabilities
- ?'+, [ to~ = p(Xr
%)
P(XrN+ l [ £'[)
~(tofl ,~r(~j),
(9)
j = l ..... c. We note that both ~(toj) and ~(toj) are posterior probabilities, whereas .~r(toj) does not seem to be amenable to any natural interpretation. The forward probabilities can be computed inductively by the recurrence
~(~j) =. ~Pjpj(x'), =~ tr ~
r = l,
.~r_l(t~i)Pijpj(Xr),
r=2
..... N,
i
(10)
where 13 =
r
L~)
r=
J
1,
(11)
To establish (10) and (11), we first note that, for l_
.~(%) = p ( o : =toj [ X~,S~ -') = P(~,~=.,j,x~l x [ - ' ) P(Xr I £ [ - t )
(12)
For the numerator in (12), we proceed as in (3):
P(to~=%,xr [ :gl -~ ) = T. p(~r-i =toi,Ojr=%,Xr [ :g[-I) i
= T..~_~(o~/)Pupj(x r)
(13)
i
In this section, we derive an equally simple computational procedure for computing the a posteriori probabilities fir(toy)=P(o~r=e~j I X~). The derivation is based on the following decomposition, analogous to (1): For 1 _ < r < N ,
as required. The case of the denominator can be quickly made by noting that
[.~~l -~ - p ( X r l ,X~-~)
= ~.p(e)r=toj,X r ]X[-t)
(14)
)
~(~j) = p(tor=% I $¢~')
p(X,~) P(to r = ~ / , X- lr , X-rN+ t ) pCgl,
X~+l) - ,v
and remarking that (13) applies to each term in the sum in (14), thereby giving the second case of (I 1). Finally, substitution in (12) yields the second case o f (10). Readily, we note that the recurrence prescribed by (10) and (11) is equivalent to using scaled likelihoods at each step of (2), and performing the scaling as suggested by Levinson et al. 371
Volume 3, Number 6
PATTERN RECOGNITION LETTERS
(1983). This proves (6). Equation (7) can be proved in very much the same way. The case of the backward probabilities is even simpler and equally interesting. For r _ < N - 2 , we have:
=E
PtX ~+t, X~+2, ~
r+t = o k l o ~ = w j )
PjkPk( x r + I)P(:~N+ 2 [ o r + 1 = Ok )
? p(Xr+,l XOp(X,+2 I xF') =, tr+l X Pjkpk(Xr+t)'~r+l(Ok) •
December 1985
state o j at time r, given only the sequence of previous observations X[-I, and the scheme prescribed by (17) is nothing but the sequential c.,mpound algorithm first proposed by Abend (1968) and Raviv (1967) for computing ~ ( o j ) = P ( o ~ = o j l X D . (See Devijver and Kittler (1982) for a simpler formulation.) In the signal processing literature, our ~, ~ and ~7 probabilities are frequently going by the names of predicted, filtered, and smoothed probabilities respectively.
(15)
k
5. Derin's algorithm Hence the backward recurrence ~r(oj) = 1,
r=N,
= ~ + l ~,PjkPk(Xr+l)~,+l(Ok), k
T = N - I ..... I,
(16) where "J'r+J is given by (I1). At this stage, it can be observed that the recurrences (10) and (16) expressed in terms of posterior probabilities are formally identical to the recurrences (2) and (4) expressed in terms of joint likelihoods, except for the presence of the normalizing factor ~ . In this way, we have shown that it is the right scaling factor to use if one wishes to convert Baum's computation of joint likelihoods into an equivalent computation of a posteriori probabilities in order to eliminate the underflow problem. Our presentation in this section was intended to enhance the similarity between the two kinds of computation. However, it should be obvious that, as stated in (10) and (IlL the forward recurrence does not lead to an efficient implementation. A much more efficient, though equivalent one is as follows:
:;,(oj)
r = 2 ..... N,
=
i
(oA = E P ( o ~ = o j , o r +' = o k I : t ~ ) k
= Z P ( o ~ = % I ~ r+' = ~k, S f ) ~ + , ( ~ k ) k
.-. P(o~ = o j , ~ ~+l = ~ k , : X ~ ) {=" :~r+,(wD, (18) /,(w~+ t = wj, ~ ?.)
for T-
= p ( X D e ( o r = o j I S])
X Pjk.p(XN+ 1 [ O r + 1 _ (,Ok), P ( o r*' = o~, : ~ ) = p ( g D . ( o ;
r = 2 ..... N,
= ~(%)pj(Xq, t
.~(oj) =.. fr .~'(oj),
+ 1= o ~ [ £ D
xp(Xr+t [Or+l=Wk)
r=l,
(17)
--[
r = 1..... N.
With this formulation, ~r(oj) is the a priori probability P ( o r = o j l Ji] -l) of the process being in
(19)
and -N
. ~ ' ( O j ) = P y p j ( X 1),
372
While the derivation of Baum's algorithm is based on a decomposition in forward and backward probabilities, the first step in the derivation of Derin's algorithm (see Askar and Derin (1981) or Devijver (1984)) consists in establishing a backward recurrence between the smoothed a posteriori probabilities at times r + 1 and r, viz.,
(20)
respectively. By substitution in (18), both the first and last factors in the right hand tides of (19) and (20) can be factored out and cancelled - with the consequence that the resulting scheme is also immune to the underflow problem - and two out of the remaining factors are readily identified as and ~ probabilities. Thus, we are left with
Volume 3, Number 6
PATTERN RECOGNITION LETTERS
~ ( ~ j ) = 7,Ao.,),
r=N,
= ,~'~(coj)~ ~'~a;+t(c°k),
,~
:~+,(co~)
r=N-l, ""'
1.
(21)
F r o m t h e d e r i v a t i o n , it is clear t h a t we h a v e to define £~+,(coD/:~r+l(coD--O w h e n e v e r : ~ + t ( o J D = 0 . A s e v i d e n c e d b y (21), D e r i n ' s c o m p u t a t i o n a l s c h e m e is j u s t as s i m p l e as B a u m ' s : ,Y a n d ,~9 p r o b abilities are first c o m p u t e d recursively f o r w a r d using (17); they h a v e to be s t o r e d for use in the b a c k w a r d r e c u r r e n c e (21). In s o m e sense, the essential d i f f e r e n c e resides in the fact t h a t D e r i n ' s m e t h o d offers a n a l t e r n a t i v e w a y o f c o m p u t i n g (scaled) b a c k w a r d p r o b a b i l i t i e s . A m i n u t e e x a m i n a t i o n o f b o t h a p p r o a c h e s reveals t h a t B a u m ' s alg o r i t h m - with scaling, as in Section 4 - requires c = l I2[ a d d i t i o n a l o p e r a t i o n s per step. A s i d e f r o m their f o r m a l differences, b o t h a p p r o a c h e s d i f f e r also slightly in their s t o r a g e req u i r e m e n t s : B a u m ' s a l g o r i t h m requires s t o r a g e for X a n d ,:l values w h e r e a s , with D e r i n ' s J a n d .~9 values are needed. W h e t h e r o n e o r the o t h e r s h o u l d be s u p e r i o r in this respect d e p e n d s on the d i m e n s i o n a l i t y o f X a n d the size c o f the state space.
References Abend, K. (1968). Compound decision procedures for unknown distributions and for dependent states of nature. In:
December 1985
L.N. Kanal, Ed., Pattern Recognition. Thompson, Washington DC, pp. 207-249. Askar, M. and H. Derin, (1981). A recursive algorithm for the Bayes solution of the smoothing problem. IEEE Trans." Automatic Control 26, 558-560. Baum, L.E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3, I-8. Baum, L.E., T. Petrie, G. Soules, and N. Weiss, (1972). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist. 41, 164-171. Derin, H., H. Elliot, R. Christi, and D. Geman, (1984). Bayes smoothing algorithms for segmentation of binary images modeled by Markov random fields. IEEE Trans. Pattern Anal. Machine lnteL 6, 707-720. Devijver, P.A. (1984). Classification in Markov chains for minimum symbol error rate. Proc. 7th Intern. Conf. Pattern Recognition, Montreal, pp. 1334-1336. Devijver, P.A. and J. Kittler, (1982). Pattern Recognition: A Statistical Approach. Prentice-Hall, Englewood Cliffs, NJ. Jelinek, F., R.L. Mercer, and L.R. Bahl, (1982). Continuous speech recognition: Statistical methods. In: P.R. Krishnaiah and L.N. Kanal, Eds., Handbook o f Statistics 2. NorthHolland, Amsterdam, pp. 549-573. Levinson, S.E., L.R. Rabiner, and M.M. Sondhi, (1983). An introduction to the application of the theory of probabilistic functions of a Markov process in automatic speech recognition. B.S.T.J. 62, 1035-1074. Rabiner, L.R., S.E. Levinson, and M.M. Sondhi, (1983). On the application of vector quantization and hidden Markov models to speaker independent, isolated ~vord recognition. B.S.T.J. 62, 1075-1105. Raviv, J. (1967). Decision making in Markov chains applied to the problem of pattern recognition. IEEE Trans. Inform. Theory 3, 536-551.
373