MRF parameter estimation by an accelerated method

MRF parameter estimation by an accelerated method

Pattern Recognition Letters 24 (2003) 1251–1259 www.elsevier.com/locate/patrec MRF parameter estimation by an accelerated method Yihua Yu *, Qianshen...

346KB Sizes 5 Downloads 149 Views

Pattern Recognition Letters 24 (2003) 1251–1259 www.elsevier.com/locate/patrec

MRF parameter estimation by an accelerated method Yihua Yu *, Qiansheng Cheng Department of Information Science, Key Laboratory of Pure and Applied Mathematics, School of Mathematical Sciences, Peking University, Beijing 100871, PeopleÕs Republic of China Received 9 May 2002; received in revised form 9 August 2002

Abstract Markov random field (MRF) modelling is a popular method for pattern recognition and computer vision and MRF parameter estimation is of particular importance to MRF modelling. In this paper, a new approach based on Metropolis–Hastings algorithm and gradient method is presented to estimate MRF parameters. With properly chosen proposal distribution for Metropolis–Hastings algorithm, the Markov chain constructed by the method converges to stationary distribution quickly and it gives a good estimation result. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: Markov random field; Metropolis–Hastings method; Parameter estimation

1. Introduction In the computer vision literature, statistical techniques have received considerable interest for modelling and processing image data, where Markov random field (MRF) modelling is a very popular method and it plays an important role in pattern recognition and computer vision, such as the classification and segmentation of textured images. Although these techniques have demonstrated substantial success for a wide variety of images, they have some limitations. One of the main difficulties is in estimating the parameters of the model from specific realizations. During the past years, *

Corresponding author. Tel.: +86-10-62759902. E-mail addresses: [email protected] (Y. Yu), qcheng@ pku.edu.cn (Q. Cheng).

many authors presented various methods to estimate MRF parameters. Besag (1974) suggested the coding method. This procedure is cumbersome and difficult to use reliably in experiments, because of its dependence on algorithms which numerically solve nonlinear equations. In view of this, Derin and Elliott (1987) developed a new approach which allows for use of standard, linear, least-squares (LS) method. However, LS is not accurate in estimation. Maximum likelihood method (e.g., Winkler, 1995) is a frequently used method for MRF parameter estimation, but it usually requires high computational cost. Wang et al. (2000) proposed a Markov chain Monte Carlo method, while the computational requirements are still prohibitive because the Markov chain constructed by the method slowly converges to stationary distribution. Other recent methods include the iterated conditional estimation (e.g., Giordana and Pieczynski, 1997).

0167-8655/03/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 ( 0 2 ) 0 0 3 1 9 - 7

1252

Y. Yu, Q. Cheng / Pattern Recognition Letters 24 (2003) 1251–1259

In this paper, a new approach for MRF parameter estimation is developed which is based on Metropolis–Hastings algorithm and gradient method. The main difference of our approach and previous ones is that we take advantage of the merits of these two methods and overcome their respective defects, since gradient method converges fast but it usually converges to local optimal solution, and the global optimal solution can be obtained by Metropolis–Hastings algorithm but its convergence is slow. The main idea is that the proposal distribution of Metropolis–Hastings algorithm is properly chosen such that the fast convergence of gradient method is incorporated into the method and global optimal solution is also guaranteed. This paper is organized as follows. In Section 2, we present some background material on MRF and describe a particular class of MRF that will be used in this paper. Section 3 presents our algorithms for MRF parameter estimation. Some experimental results appear in Section 4, followed by concluding remarks in Section 5. 2. Background on MRF In this section we present the basic definitions of MRF models and a particular class of MRF that is used in this paper. More details can be seen in (Derin and Elliott, 1987) or (Winkler, 1995). Let S be a finite N1  N2 rectangular lattice of pixels defined as S ¼ fði; jÞ : 1 6 i 6 N1 ; 1 6 j 6 N2 g: For every site s 2 S, let QXs be a finite space of states xs . The product X ¼ s2S Xs is the space of finite configuration x ¼ ðxs Þs2S . Definition 2.1. A collection @ ¼ f@ðsÞ : s 2 Sg of subsets of S is called a neighborhood system if (1) s 62 @ðsÞ, and (2) s 2 @ðtÞ if and only if t 2 @ðsÞ.

Definition 2.3. A random field p defined on S is a Gibbs random field with respect to @ if and only if its joint distribution is of the form 1 expð U ðxÞÞ; Z P where U ðxÞ ¼ c2C Vc ðxÞ is energy function, Vc ðxÞ is P potential associated with clique c and Z ¼ x2X expð U ðxÞÞ is partition function. pðxÞ ¼

Example 2.1. The index set is a finite lattice S ¼ fði; jÞ 2 Z  Z : 1 6 i; j 6 mg; and 2

2

@ði; jÞ ¼ fðk; lÞ : 0 6 ðk iÞ þ ðl jÞ 6 2g;

ð1Þ

which consists of the eight pixels neighboring ði; jÞ, is known as the second-order neighborhood system. The neighborhood structures are shown in Fig. 1 and the clique types are shown in Fig. 2. In this paper, we only discuss the second-order neighborhood systems (1). The extension to larger neighborhood systems and the restriction to smaller ones are evident. It is assumed that the random field is homogeneous, that is, the clique potentials depend only on the clique type and the pixel values in the clique, but not on the position of the clique in lattice S. Here, we only discuss a particular class of potential, i.e., Vc ðxÞ ¼ 0, if jcj 6¼ 2. For the different clique types in Fig. 2, a parameter is assigned to each following clique type   h i h i ½ ; b1 ; ; b2 ; ; b3 ; ; b4 and the parameters assigned to other clique types are all set to zero. The associated clique potentials are defined as

The sites s 2 oðtÞ are called neighbors of t. Definition 2.2. A subset c of S is called a clique if two different elements of c are always neighbors. We denote the set of cliques as C.



Fig. 1. The second-order neighbors (d) of a site ( ).

Y. Yu, Q. Cheng / Pattern Recognition Letters 24 (2003) 1251–1259

1253

Fig. 2. The clique types of the second-order neighborhood system.

 Vc ðx; hÞ ¼

bi bi

if all xs in c are equal; otherwise:

where bi is the parameter specified for clique type c and h ¼ ðb1 ; . . . ; b4 Þ. The distribution of x is rewritten as 1 expð U ðx; hÞÞ; pðx; hÞ ¼ ZðhÞ P where U ðx; hÞ ¼ c2C Vc ðx; hÞ, ZðhÞ ¼ P 4 x2X expð U ðx; hÞÞ and h 2 R . Note that for positive bi , same grey values of pixels in c are favorable while different ones are favorable for negative bi . Small values jbi j correspond to weak and large values jbi j correspond to strong coupling. By a suitable choice of the parameters, more or less ordered patterns and attraction– repulsion effect can be incorporated into the model. It is even more striking in the texture models where different parameters character obviously different texture (Winkler, 1995). So with a realization of image, all these parameters should be estimated. Now the problem is as follows. Let x be a realization of X whose distribution is pðx; h Þ. But the ÔtrueÕ parameters h is not known and need to be determined or at least approximated. The only available information is hidden in the realization x. Hence we need a method to choose some e h as a substitute for h . Under the mean-squared errors (MSE) criterion, the optimal solution of this estimation problem is the Bayes estimator, which has the form Z EðhjxÞ ¼ hpðhjxÞ dh: ð2Þ h2R4

In the next section, we will discuss how to compute the integration (2). 3. An accelerated method for MRF parameter estimation Because the discrete space X is very large, the computation involved in the partition function

ZðhÞ and the integration (2) is so difficult that one has to resort to some numerical approaches. Monte Carlo methods are perhaps the most versatile and powerful ones (Liu, 2001). Suppose that we can generate N random samples h1 ; . . . ; hN from the posterior distribution pðhjxÞ, the integration (2) can be approximated by N X e ðhjxÞ ¼ 1 E hi : N i¼1

Metropolis–Hastings algorithm has been widely used to generate random samples. Metropolis et al. (1953) introduced the fundamental idea and Hastings (1970) provided a more general algorithm as follows. Algorithm 3.1 (Metropolis–Hastings Algorithm). 1. Initialize. Start with h0 , for t ¼ 1; 2; . . ., do step 2. 2. (1) From the current state ht , a new value h0 is proposed by sampling from a proposal distribution qðh0 jht Þ. (2) h0 is accepted with probability ( ) pðh0 jxÞqðht jh0 Þ 0 aðh ; ht Þ ¼ min 1; : pðht jxÞqðh0 jht Þ (3) If h0 is accepted, set htþ1 ¼ h0 , otherwise, set htþ1 ¼ ht . The basic idea behind the algorithm is that one can achieve the sampling from the target distribution by running a Markov chain whose stationary distribution is exactly p, but generally its convergence is slow. So new constructions are needed to enable a fast exploration of high-probability regions of pðhjxÞ. It is well known that gradient method converges fast but it usually converges to local optimal solution. In this paper, a new method is proposed which is based on Metropolis–Hastings algorithm and gradient method such that the fast convergence of gradient method is incorporated

1254

Y. Yu, Q. Cheng / Pattern Recognition Letters 24 (2003) 1251–1259

into Metropolis–Hastings algorithm and global optimal solution is also guaranteed. According to Bayesian theorem, the posterior distribution pðhjxÞ is pðhÞpðxjhÞ pðhjxÞ ¼ R / pðhÞpðxjhÞ: pðhÞpðxjhÞ dh

ð3Þ

To make the problem a computational feasible one, likelihood function pðxjhÞ is replaced by the pseudo-likelihood function ! Y pðxs jxSns ; hÞ ; ð4Þ PLðxjhÞ ¼ ln s2S where pðxs jxSns ; hÞ is conditional probability defined as pðxs jxSns ; hÞ ¼ P

expð U ðxs xSns ; hÞÞ : zs 2Xs expð U ðzs xSns ; hÞÞ

Since the prior information is totally unavailable, the prior pðhÞ can be assumed to be flat. Then, with (3) and (4), we have ( ) 0 0 expðPLðxjh ÞÞqðh jh Þ t aðh0 ; ht Þ ¼ min 1; : expðPLðxjht ÞÞqðh0 jht Þ The choice of proposal distribution qðh0 jht Þ is almost arbitrary. We choose qðh0 jht Þ as follows. Since the positive gradient direction of PLðxjhÞ is (approximately) the fastest ascent direction to the high-probability regions of pðxjhÞ, we may choose the new value h0 in the positive gradient of PLðxjhÞ at the current state ht . On the other hand, we must guarantee that the Markov chain constructed by the method is irreducible and can converge to the stationary distribution. The following rule is used to sample the new value h0 :

(

1. Given gð0 < g < 1Þ, we choose the positive gradient direction of PLðxjhÞ at ht with probability g. Denote XðhÞ ¼ fh : h is point in the positive gradient of PLðxjhÞ at hg: 2. If the positive gradient direction is chosen, a new value h0 is sampled from Xðht Þ whose distance d ¼ jjh0 ht jj from the current value ht satisfies the distribution N ðd; 0; 1ÞIðd > 0Þ; otherwise, a new value h0 is sampled from normal distribution N ðh0 ; ht ; I4 Þ. Here, N ð; l; RÞ is the normal probability density function with mean l and variance R, and IðÞ denotes the indicator function whose value is 1 when its argument is true and 0 otherwise. I4 is 4-dimension identity matrix. So qðh0 jht Þ ¼ gIðh0 2 Xðht ÞÞ2N ðjjh0 ht jj; 0; 1Þ þ ð1 gÞIðh0 62 Xðht ÞÞN ðh0 ; ht ; I4 Þ: ð5Þ If the current state is h0 , a new value h00 is proposed for the same rule, qðh00 jh0 Þ ¼ gIðh00 2 Xðh0 ÞÞ2N ðjjh00 h0 jj; 0; 1Þ þ ð1 gÞIðh00 62 Xðh0 ÞÞN ðh00 ; h0 ; I4 Þ: Because the measure of the Xðh0 Þ in the total state space of h is zero, we have pðht 2 Xðh0 ÞÞ ¼ 0 and pðht 62 Xðh0 ÞÞ ¼ 1. Then qðht jh0 Þ ¼ ð1 gÞN ðht ; h0 ; I4 Þ: 0

ð6Þ 0

Note that N ðht ; h ; I4 Þ ¼ N ðh ; ht ; I4 Þ, so the acceptance probability aðh0 ; ht Þ is reduced to

ð1 gÞN ðh0 ;ht ;I4 Þ aðh ;ht Þ ¼ min 1;expðPLðxjh Þ PLðxjht ÞÞ gIðh0 2 Xðht ÞÞ2N ðjjh0 ht jj;0;1Þ þ ð1 gÞIðh0 62 Xðht ÞÞN ðh0 ;ht ;I4 Þ ( #) " ð1 gÞN ðh0 ;ht ;I4 Þ 0 0 0 ¼ min 1;expðPLðxjh Þ PLðxjht ÞÞ Iðh 2 Xðht ÞÞ þ Iðh 62 Xðht ÞÞ 2gN ðjjh0 ht jj;0;1Þ ( #) " ð1 gÞ 0 0 0 ¼ min 1;expðPLðxjh Þ PLðxjht ÞÞ Iðh 2 Xðht ÞÞ pffiffiffiffiffiffi 3 þ Iðh 62 Xðht ÞÞ : 2gð 2pÞ 0

0

)

Y. Yu, Q. Cheng / Pattern Recognition Letters 24 (2003) 1251–1259

From (5) and (6), the choice of the proposal distribution satisfies the property: qðh0 jht Þ > 0 if and only if qðht jh0 Þ > 0, so the method necessarily leads to an irreducible Markov chain which can converge to stationary distribution. For details on the irreducibility and convergence of Markov chain, we refer the reader to Van Laarhoven and Aarts (1987). Our algorithm described above can be summarized as follows.

Algorithm 3.3 (Modified simulated annealing). Step 0. Initialize. Choose the temperature T0 , h0 , c and the maximum number of iteration N2 . For t ¼ 1; . . . ; N2 , do the following steps. Step 1. A new value h0 is sampled from the proposal distribution N ðh0 ; ht ; I4 Þ. Step 2. h0 is accepted with probability aðh0 ; ht Þ

Algorithm 3.2 (Accelerated algorithm for MRF parameter estimation). Step 0. Initialize. Choose values h0 ; g and the maximum number of iteration N1 . For t ¼ 1; . . . ; N1 , do the following steps. Step 1. Sample a uniform (0; 1) random variable m. If m 6 g, compute the gradient of PLðxjhÞ at ht and sample a new value h0 from Xðht Þ whose distance d ¼ jjh0 ht jj from ht satisfies the distribution N ðd; 0; 1Þ Iðd > 0Þ. Otherwise, sample a new value h0 by N ðh0 ; ht ; I4 Þ. Step 2. If h0 is sampled from Xðht Þ, accept it with probability ( aðh0 ; ht Þ ¼ min 1; expðPLðxjh0 Þ ð1 gÞ pffiffiffiffiffiffi PLðxjht ÞÞ 2gð 2pÞ3

) ;

otherwise, accept it with probability aðh0 ; ht Þ ¼ minf1; expðPLðxjh0 Þ PLðxjht ÞÞg: If h0 is accepted, set htþ1 ¼ h0 , else set htþ1 ¼ ht . Sept 3. Increment t. If t 6 N1 , goto Step 1; otherwise, stop. Because Metropolis–Hastings algorithm is preferable for low temperature (Winkler, 1995) and experimental evidence shows that in the front of iterations, the value expðPLðxjh0 Þ PLðxjht ÞÞ is very big (when PLðxjh0 Þ > PLðxjht Þ) or nearly zero (when PLðxjh0 Þ < PLðxjht Þ) such that the acceptance probability is 0 or 1 and no other choices, we use a modified simulated annealing before Algorithm 3.2 described as follows.

1255

(

PLðxjh0 Þ PLðxjht Þ ¼ min 1; exp Tt

!) :

If h0 is accepted, set htþ1 ¼ h0 , else set htþ1 ¼ ht . Step 3. Increment t. If t P N2 , stop; otherwise, set Tt ¼ cTt 1 , goto Step 1. In Algorithm 3.3, we choose c such that it satisfies T0 cN2 ¼ 1: So at the final iterate of Algorithm 3.3, the temperature TN2 ¼ 1, which is different from the standard simulated annealing. Because we can consider the temperature of Metropolis–Hastings Algorithm T ¼ 1, the switch from Algorithm 3.3 to Algorithm 3.2 at the end of Algorithm 3.3 is appropriate. Another reason why we apply Algorithm 3.3 before Algorithm 3.2 is the acceptance ratio. In the front of iterations, the acceptance ratio is usually high, so we apply Algorithm 3.3 and the computation of gradient is avoided. After a number of iterations, the state of Markov chain is close to the optimal point and the acceptance ratio becomes low. We apply Algorithm 3.2 to increase the acceptance ratio and the global convergence is also guaranteed. In conclusion, the algorithm for MRF parameter estimation is summarized as follows. Algorithm 3.4 (Algorithm for MRF parameter estimation––full form). Step 1. Choose a value h0 , do Algorithm 3.3. Step 2. Set the final value of Algorithm 3.3 as h0 of Algorithm 3.2 and do Algorithm 3.2.

1256

Y. Yu, Q. Cheng / Pattern Recognition Letters 24 (2003) 1251–1259

We choose the final MðM 6 N1 Þ values e h1; . . . ; e h M of step 2 in Algorithm 3.4 and the integration (2) can be approximated by

M X e e ðhjxÞ ¼ 1 hi: E M i¼1

ð7Þ

Fig. 3. Some texture realizations.

Table 1 Specified and estimated parameters of MRF Figure

Method

b1

Texture (a)

Specified MCMC Algorithm 3.4

1 0.988 1.004

Texture (b)

Specified MCMC Algorithm 3.4

Texture (c)

b2

b3

b4

1 0.990 0.998

)0.5 )0.508 )0.511

)0.5 )0.508 )0.489

1 1.009 0.979

)0.8 )0.859 )0.791

0.5 0.557 0.503

)0.5 )0.452 )0.487

Specified MCMC LS Algorithm 3.4

0.3 0.348 0.326 0.290

0.3 0.276 0.308 0.294

0.3 0.296 0.265 0.296

0.3 0.288 0.307 0.320

Texture (d)

Specified MCMC Algorithm 3.4

0.5 0.553 0.470

1 1.039 0.995

)0.5 )0.595 )0.506

0.7 0.681 0.712

Texture (e)

Specified LS Algorithm 3.4

1.0 0.752 1.043

1.0 0.805 1.016

)1.0 )0.762 )1.023

1.0 0.909 1.018

Texture (f)

Specified LS Algorithm 3.4

2.0 1.728 1.970

2.0 1.785 1.971

)1.0 )0.856 )0.995

)1.0 )0.851 )0.981

Y. Yu, Q. Cheng / Pattern Recognition Letters 24 (2003) 1251–1259

4. Experiments In this section, we present several examples of MRF parameter estimation on realizations that are generated with specified parameter values. Fig. 3 presents four 128  128 textures (a)–(d) and two 256  256 textures (e)–(f) with specified parameters shown in Table 1. The first four textures come from Wang et al. (2000) and the last two textures are from Derin and Elliott (1987) (the texture (c) are also discussed in (Derin and Elliott, 1987)). The first two textures are sampled with two greylevels and the next four textures are sampled with four grey-levels. All of the textures are generated using the Gibbs sampler, presented by Geman and Geman (1982). In our experiments, we choose randomly four different initial values as h10 ¼ ð2; 1; 1; 2Þ, h20 ¼

1257

ð 4; 4; 6; 6Þ, h30 ¼ ð 6; 6; 8; 7Þ and h40 ¼ ð 10; 10; 12; 12Þ. The iteration number is relative to initial value and decision about the iteration number is an important and practical matter. We choose the iteration number such that the solutions from the four initial values agree adequately. During the performance of our algorithms, the iteration numbers N1 ¼ 300 and N2 ¼ 100 are enough. The initial temperature T0 of Algorithm 3.3 is set to 1000, so c  0:933 since it satisfies T0 cN2 ¼ 1. The probability constant g of Algorithm 3.2 is set to 0:4. After the iterations are completed, the final 100 values are used to compute (7). Fig. 4 presents the Markov chain for estimating the parameters of texture (d). The numerical results are presented in Table 1. In order to verify the performance of our algorithm, corresponding results of MCMC

Fig. 4. Markov chain for texture (d).

1258

Y. Yu, Q. Cheng / Pattern Recognition Letters 24 (2003) 1251–1259

Table 2 The CUP time of MCMC and Algorithm 3.4 (s) h10

h20

h30

h40

Figure

Method/initial values

Texture (a)

MCMC Algorithm 3.4

23.2 8.9

23.2 9.1

23.2 9.1

23.2 9.2

Texture (b)

MCMC Algorithm 3.4

23.2 9.0

23.2 9.0

23.2 9.1

23.2 9.3

Texture (c)

MCMC Algorithm 3.4

40.6 15.6

40.5 15.5

40.7 15.7

40.6 15.7

Texture (d)

MCMC Algorithm 3.4

40.4 15.7

40.5 15.8

40.5 15.8

40.5 15.9

Texture (e)

MCMC Algorithm 3.4

161.0 62.2

160.9 61.9

161.0 62.7

161.0 62.5

Texture (f)

MCMC Algorithm 3.4

162.5 63.6

162.7 65.0

162.5 64.0

162.8 67.3

approach (Wang et al., 2000) and LS method (Derin and Elliott, 1987) are also given in Table 1. From Table 1, it can be seen that LS is not accurate, especially for textures (e)–(f). Compared to MCMC method, our numerical results are better or at least not worse. On the other hand, during iterations, it is not necessary to compute the gradient of PLðxjhÞ at every iteration. Only when a new value is accepted, it is possible to compute the gradient. In the performances of our algorithms for these six textures, the numbers of computing gradient are only 28, 23, 11, 18, 13, 45 for initial value ð 10; 10; 12; 12Þ and 15, 19, 9, 7, 15, 10 for initial value ð 6; 6; 8; 7Þ, perspectively (the results for other two initial values are less or similar), while the total number of iteration is 400, compared to 1000 of the MCMC approach (Wang et al., 2000), where the proposal distribution is normal distribution centered on the current state ht , so the total computational requirements of our algorithms are decreased considerably. All the computations are performed on Pentium 4 1.7 GHz PC with Visual C++ 6.0. The CUP time of MCMC and Algorithm 3.4 is shown in Table 2.

5. Concluding remarks MRF modelling is popular method for pattern recognition and computer vision and MRF para-

meter estimation is of particular importance to MRF modelling. In this paper, a new method is presented to estimate MRF parameters, which is based on Metropolis–Hastings algorithm and gradient method. With the properly chosen proposal distribution, the Markov chain constructed by the method converges to stationary distribution quickly, so the computational requirements are decreased considerably.

Acknowledgement This work was supported by the National Natural Science Foundation of China.

References Besag, J., 1974. Spatial interaction and the statistical analysis of lattice system. J. Roy. Statist. Soc. B 36, 192–225. Derin, H., Elliott, H., 1987. Modeling and segmentation of noisy and textured images using Gibbs random fields. IEEE Trans. Pattern Anal. Machine Intell. 9 (1), 39–55. Geman, S., Geman, D., 1982. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell. 6 (6), 721–741. Giordana, N., Pieczynski, W., 1997. Estimation of generalized multisensor hidden Markov chain and unsupervised image segmentation. IEEE Trans. Pattern Anal. Machine Intell. 19 (5), 465–475.

Y. Yu, Q. Cheng / Pattern Recognition Letters 24 (2003) 1251–1259 Hastings, W.K., 1970. Monte Carlo sampling method using Markov chains and their applications. Biometrika 57, 97– 109. Liu, J.S., 2001. Monte Carlo strategies in scientific computing. Springer-Verlag, New York. Metropolis, N., Rosenbluth, A.W., Teller, A.H., Teller, E., 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1091.

1259

Van Laarhoven, P.J., Aarts, E.H., 1987. Simulated Annealing: Theory and Application. Reidel, Dordrecht, Holland. Wang, L., Liu, J., Li, S.Z., 2000. MRF parameter estimation by MCMC method. Pattern Recognition 33, 1919– 1925. Winkler, G., 1995. Image Analysis, Random Fields and Dynamic Monte Carlo Methods. Springer-Verlag, Berlin.