Unsupervised image segmentation using Markov random field models

Unsupervised image segmentation using Markov random field models

Pattern Recognition 33 (2000) 587}602 Unsupervised image segmentation using Markov random "eld models S.A. Barker*, P.J.W. Rayner Signal Processing a...

2MB Sizes 0 Downloads 94 Views

Pattern Recognition 33 (2000) 587}602

Unsupervised image segmentation using Markov random "eld models S.A. Barker*, P.J.W. Rayner Signal Processing and Communications Group, Cambridge University Engineering Department, Cambridge CB2 1PZ, UK Received 15 March 1999

Abstract We present two unsupervised segmentation algorithms based on hierarchical Markov random "eld models for segmenting both noisy images and textured images. Each algorithm "nds the the most likely number of classes, their associated model parameters and generates a corresponding segmentation of the image into these classes. This is achieved according to the maximum a posteriori criterion. To facilitate this, an MCMC algorithm is formulated to allow the direct sampling of all the above parameters from the posterior distribution of the image. To allow the number of classes to be sampled, a reversible jump is incorporated into the Markov Chain. Experimental results are presented showing rapid convergence of the algorithm to accurate solutions. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Markov random "eld; Unsupervised segmentation; Reversible jump; Markov chain; Monte Carlo; Simulated annealing

1. Introduction The segmentation of noisy or textured images into a number of di!erent regions comprises a di$cult optimisation problem. This is compounded when the number of regions into which the image is to be segmented is also unknown. If each region within an image is described by a di!erent model distribution then the observed image may be viewed as a realisation from a map of these model distributions. This underlying map divides the image into regions which are labelled with di!erent classes. Image segmentation can therefore be treated as an incomplete data problem [1], in which the intensity data is observed, the class map is missing and the model parameters associated with each class need to be estimated. In the unsupervised case, the number of model classes is also unknown. The unsupervised segmentation problem has been approached by several authors. Of these, most propose

* Corresponding author. E-mail address: [email protected] (S.A. Barker)

algorithms comprising of two steps [2}4]. The image is assumed to be composed of an unknown number of regions, each modelled as individual Markov random "elds. The "rst of these steps is a coarse segmentation of the image into the most &likely' number of regions. This is achieved by dividing the image into windows, calculating features or estimating model parameters, then using a measure to combine closely related windows. The resulting segmentation is then used to estimate model parameters for each of the classes, before a supervised high-resolution segmentation is carried out via some form of relaxation algorithm. A similar methodology is used in Geman et al. [5] but the measure used, the Kolmogorov}Smirnov distance, is a direct measure of similarity of the distributions of grayscale values (in the form of histograms) between adjacent windows. Windows are then combined into a single region if the distance between their distributions is relatively small. A variant on this algorithm [6] is based on the same distance function but the distribution of grayscales in each window is compared with the distribution functions of the samples comprising each class over the complete image. If the distribution of one class is

0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 7 4 - 6

588

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

found to be close enough to that of the window then it is designated as being a member of that class. Otherwise, a new outlier class is created. When the "eld stabilises, usually after several iterations, new classes are created from the outliers if they constitute more than one percent of the image. If not, the algorithm is re-run. A split and merge algorithm is proposed by Panjwani and Healey [7]. The image is initially split into large square windows but these are then re-split to form four smaller windows if a uniformity test for each window is not met. The process ends when windows as small as 4]4 pixels are reached. The windows are then merged to form regions using a distance measure based on the pseudo-likelihood. Won and Derin [8] obtain segmentations and parameter estimates by alternately sampling the label "eld and calculating maximum pseudo likelihood estimates of the parameter values. The process is repeated over di!ering numbers of label classes and the resulting estimates are applied to a model "tting criteria to select the optimum number of classes and hence, the image segmentation. The criterion used compensates the likelihood of the optimised model with a penalty term that o!sets image size against the number of independent model parameters used. The penalty term and its associated parameter values were selected arbitrarily. This method of exhaustive search over a varying number of classes was developed further by Langan et al. [9]. Here, an EM algorithm is "rst used to estimate parameters while alternately segmenting the image. To select between the resulting optimsations of the di!ering models the function of increasing likelihood against increasing model order is "tted to a rising exponential model. The exponential model parameters are selected in a least squares sense and the optimum model order is then found at a pre-speci"ed knee point in the exponential curve. The approach to unsupervised segmentation presented here comprises a Markov chain Monte Carlo (MCMC) algorithm to sample from the posterior distribution so that simulated annealing may be used to estimate the MAP solution. This methodology is similar to that used in [10] to segment an image using a known MRF model. Here the method is extended so that the sampling scheme and hence the MAP estimate is not just over the segmentation of the image into classes but the number of classes and their respective model parameters. The algorithm di!ers from those reviewed in that no windowing is required to estimate region model parameters. The algorithm's MCMC methodology removes the necessity for an exhaustive search over a subsection of the parameter space. This ensures an improvement in e$ciency over algorithms that require separate optimisations to be carried out for each model before a model order selection is made.

The remainder of this paper is divided as follows: Section 2 speci"es the image models used throughout the paper. The posterior distributions for the noisy image and texture models are derived in Sections 2.1 and 2.2. Section 3 describes the algorithms employed to sample from these distributions. The segmentation process or allocation of class labels to pixel sites is given, as is the sampling scheme for noise and MRF model parameters from their conditional densities. The method by which reversible jumps are incorporated into Markov chains to enable sampling of the number of classes into which an image might be segmented is then described. This process is then detailed for both noisy and textured image modes. Experimental results for the resulting algorithms are presented in Section 4 and the paper is concluded in Section 5.

2. Image models Let ) denote an M]N lattice indexed by (i, j) so that )"M(i, j);1)i)M,1)j)NN. Let Y"MYs"ys; s3)N be the observed grayscale image where pixels take values from the interval (0,1]. Then let X"MXs"xs; s3)N correspond to the labels of the underlying Markov random "eld which have values taken from "" M0,1,2,k!1N. If g de"nes a neighbourhood at site s, then let the s vector of labels comprising that neighbourhood be x s. g Similarly, let o de"ne a second neighbourhood strucs ture at site s but let this be de"ned on the observed image Y, so that y s is the vector of pixel grayscale o values over that neighbourhood. Finally, let all model parameters be included in the parameter vector W. If a Gibbs distribution is used to model the likelihood of observing the image Y given the label "eld X, and to model all a priori knowledge of spacial correlations within the image and label "elds then the conditional density for an observed pixel grayscale value and class label at site s, given the two neighbourhoods is, p(> "y , X "x D y s, x s, W) s s s s o g 1 " , e~U(Ys/ys,Xs/xs@yos,xgs,(), Z(y s, x s, W) o g

(1)

where ;( ) ) is the energy function and Z( ) ) is the normalising function of the conditional distribution. If we divide the parameter vector so that, W"[/ ; Mc3"N, c], c where / corresponds to a vector of model parameters c de"ning the likelihood of observing the pixel value y , s given its neighbourhood and label and c corresponds to c a vector of hyper-parameters de"ning the prior on the label "eld X, then the conditional distribution may be

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

589

factorised so that,

2.1. The Isotropic Markov random xeld model

p(> "y , X "x D y s, x s, W) s s s s o g

When considering the complete Gibbs distribution for the entire image the partition function (or normalising function) becomes far too complex to evaluate making it unfeasible to compare the relative probabilities of two di!erent MRF realisations. An approximation to the Gibbs distribution that allows an absolute probability for an MRF realisation to be calculated is the PseudoLikelihood, introduced by Besag [11]. The Pseudo-Likelihood is simply the product of the full conditionals, each given by Eq. (2), over the complete image ).

The Isotropic MRF model is used to model an image consisting of regions of constant but di!erent grayscales, corrupted with an i.i.d. noise process. For each pixel, the likelihood of its grayscale value given its underlying label, is given by an Gaussian distribution whose parameters are dependent on the label class. Hence, the grayscale values of the pixels comprising a region labelled as a single class c may be considered a realisation of an i.i.d. Gaussian noise process whose parameter vector is given by / "[k ,p ]. c c c The Potts model chosen to model a priori knowledge of spacial correlations within the label "eld incorporates potential functions de"ned using both singular and pairwise cliques on a nearest neighbour type neighbourhood. If the hyper-parameter vector is c"[a ;Mc3"N, b] then c the approximation to the posterior density given in Eq. (4) may be written,

PL(Y"y, X"x D W)

p(X"x, W, k D Y"y)

1 e~U1(Ys/ys @ yos, Xs/xs, /xs)~U2(Xs/xs @ xgs,c). " Z(y s, / s, x s, c) o x g (2)

1 "< expM!; (> "y D y s, Xs"xs, /xs)N 1 s s o Z(/ s) x s|) ]

expM!+ ; (X "x D x s, cN s|) 2 s s g , < + expM!; (X "c D x s, cN s|) c|" 2 s g

G

(3)

where Z(/ ) is the normalising constant of the likelihood c distribution for the observed pixel value given its neighbourhood and label. By applying Bayes law, an approximation to the posterior distribution for the MRF image model may be formed from the Pseudo-Likelihood. To make this a function of the model order (or number of label classes) k, proper priors must be de"ned for all the model parameters. The distribution will then be given by, p(X"x, W, k D Y"y)JPL(Y"y, X"x D W) k~1 ]p (k)p (c) < p (/ ), r r r c c/0

H

1 1 exp ! (y !k s)2 J< s x 2p2 J2np2 s x s|) xs

(4)

where p ( ) ) are the prior distributions. It is possible to r incorporate various information criteria [8,9] into the posterior distribution by adding compensatory terms to the prior on model order k The Isotropic and Gaussian MRF models used as the basis for segmentation algorithms throughout the remainder of this paper both take Potts models as their prior distribution on the label "eld X. The di!erences between the two models occur in their modelling of the likelihood of observing pixel grayscale values given the label "eld. The principle di!erence comprises the lack of conditioning on neighbouring pixel grayscale values in the Isotropic model. The two models are described in more detail in the following two subsections.

expM!+ (a s#b<(x , x s))N s|) x s g < + expM!(a #b<(c, x s))N s|) c|" c g

]

]p (b)p (k)< p (k )p (p )p (a ), r r r c r c r c c|"

(5)

where <(c, x s) is the potential function at site s when it is g allocated to class c. Throughout this paper the potential function is de"ned, <(c,x )"1+ (c=x ), where = is an g 4 t|g t operator de"ned to take the value !1 if its arguments are equal, otherwise, #1. Non-informative reference priors are chosen for the noise model parameters to ensure that the observed intensity data dominates any prior information incorporated into the model. Uniform priors are selected for Mk , a ; c3"N, b and k. For Mp ,∀c3"N the reference priors c c c may be found using Je!rey's formula for non-informative priors [12]. Normally these priors are improper, but here their range is restricted to allow normalisation and ensure that the criteria for model selection, described later is valid. 2.2. The Gaussian Markov random xeld model The Gaussian MRF (GMRF) is commonly used to model images comprising textures characterised by their spacial correlation. The image model used in this paper is hierarchical with individual textures modelled by GMRFs and the interaction between the regions comprising these textures, by a Pott's model (as was used in the Isotropic case).

590

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

The cth GMRF may be parameterised by its covariance matrix R and mean vector k . If the Y is the pixel c c c grayscale vector and its dimension is N , then this leads c to a joint distribution, p(Yc"yc D k , R ) c c expM!1[y !k ]T[R~1][y !l ]N 2 c c c c c . " (6) (2n)N2c @2DR D12 c If we assign a neighbourhood for the pixel located at site s and denote it o then the above process may be s written in terms of a non-causal AR representation with correlation coe$cients h(q), c y " + h(q)y #e , (7) s c s`q s q>s`q|os where e is a zero mean Gaussian noise process with s autocorrelation given by

G

p2, c E[e e ]" !h(q)p2, s s`q c c 0,

q"0, s#q3o , s otherwise.

(8)

It is possible, with no loss of generality to halve the number of correlation parameters by making h(q)"h(~q). c c Letting H comprise the vector of correlation parameters c for the GMRF labelled as c, then the conditional distribution for a single pixel may be written, p(y D k ,p ,H ) s c c c 1 1 " exp ! (y !k ) s c 2p2 J2np2 c c

G

A

BH

2 ! + h(q)[(y !k )#(y !k )] . (9) c s`q c s~q c q>s`q|os The interaction between regions is described by a Pott's model. As with the Isotropic MRF this models a priori knowledge of spacial correlations within the label "eld. The Pott's model chosen here is identical to that used in the Isotropic case (given in the previous section) except that there are no single clique parameters. These are omitted to reduce the complexity of the model order sampling step (see Section 3.2) and to weaken the prior on X. Hence, the hyper-parameter vector simply consists of one term, b. The posterior distribution for the complete hierarchical image model may now be approximated using the expression given in Eq. (4) and the conditional distributions given by Eq. (9). p(X"x, W, k D Y"y)J< p(y D k s, p s, H s) s x x x s|) expM!+ b<(x ,x s)N s|) s g ] < + expM!b<(c, x s)N s|) c|" g ]p (b)p (k)< p (k )p (p )p (# ), r r r c r c r c c|"

where <(c, x s) is the potential function as de"ned in the g previous section. Priors for all parameters are as de"ned earlier excepting the h(q) parameters. These are assigned conjugate priors for reasons given later, in Section 3.2. The priors will therefore consist of a family of normal distributions, N(0, j2) with j being a common hyperparameter. 3. MCMC sampling from the posterior distribution The image segmentation, parameter estimates and model order are all estimated to maximise the a posteriori probability (known as the MAP criterion) of the model given the observed image. This comprises an optimisation problem over the MRF's label map, parameter space and model order. The approach adopted here was developed by Geman and Geman [10]: to construct a Markov chain to sample from the target distribution and then perform a process of stochastic relaxation (i.e. simulated annealing). Various shortcuts, for example the iterative conditional modes algorithm [13]; using the EM algorithm to estimate model parameters [14], or MAP based parameter estimation algorithms [8] have not been adopted in this paper. This is because when sampling model order (using the reversible jump algorithms described in Sections 3.1 and 3.2) jumps to lower probability areas of the model space occur frequently and hence, to adopt any deterministic elements to the algorithm would appear inconsistent. The sampling scheme in this paper is based on the Gibbs sampler [10] but Metropolis}Hastings sub-chains [15] are incorporated to enable the model parameters and number of classes to be sampled. The sampling process follows a predetermined sequential scan, updating the pixel sites and various model parameters in a speci"c order. The scan consists of the following sampling processes: (i) (ii) (iii) (iv)

re-segment the image, sample noise model parameters, sample MRF model parameters, sample the number of classes.

The "rst three steps are relatively straightforward. All labels and parameters are sampled from their conditional distributions. The "rst step consists of Gibbs sampling the label "eld. The conditional distributions may be found by applying Bayes law to Eq. (5). The resulting distribution for the Isotropic MRF described in Section 2 is given by p(x "c D y , x s, k , p , a , b) s s g c c c 1 1 1 y !k 2 s c J exp ! ¹ 2 p J2np2¹ t c c t

(10)

G CA DH

# (a #b<(c, x s)) c g

,

B

(11)

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

where ¹ is the annealing temperature at iteration t of the t algorithm. The distribution for the GMRF is similar, p(x "c D y , y s, x s, k , p , H , b) s s o g c c c 1 1 J exp ! ! ((y !k ) c ¹ 2p2 s J2np2 t c c

A

DH

.

(12)

p(k , p D Y, X)J < p(y D k , p )p (k )p (p ) c c s c c r c r c s>xs/c 1 " p (2np2¹ )Nc c c t

A

BH

1 y !k 2 s c ]exp ! + , p 2¹ t s>xs/c c

(13)

where N "d(s : x "c). For the cth GMRF texture c s model, the model parameter conditional distribution is given by p(k , p ,H D >,X) c c c J < p(y D k ,p ,H )p (k )p (p )p (H ) s c c c r c r c r c s>xs/c

G

B

A

1 1 " exp ! + (y !k ) s c p (2np2¹ )nc 2¹ p2 c c t t c s>xs/c ! + h(q)[(y !k )#(y !k )] c s`q c s~q c q>s`q|os

BH 2

.

(15)

where N xg "d(s : x "c, x s"x ). g (c, ) s g It is theoretically possible to sample the b parameter from its conditional distribution

Metropolis}Hastings sampling is used to update noise and MRF model parameters. The proposal densities used are zero mean Gaussian with variances dependent on the parameter being sampled. The conditional distributions for the noise model parameters are found by multiplying the appropriate pseudo-likelihood terms by the model parameter priors. As speci"ed in Sections 2.1 and 2.2, for both models, non-informative priors are used for the k and p parameters. However, conjugate priors are used for the h(q) parameters of the GMRF. The resulting conditional distribution for the Isotropic case is,

G

p(a , c3" D X) c

exp(!1t[a #b<(c, x )]) N(c,xg) T c g " < , + exp(!1t[a #b<(i, x )]) g i|" T i (c|",+xg)

!+ h(q)[(y !k )#(y !k )])2 c s`q c s~q c q>s`q|os !b<(x ,x s) s g

uniform priors and are given by

"p(X D a , c3")p (a , c3") c r c

G C

1

591

(14)

To sample the hyper-parameters for the prior on the label "eld, the Metropolis}Hastings algorithm is used to sample an approximation to the full conditional distribution. The approximation used is again the pseudo-likelihood. The conditional distributions for the external "eld parameters, Ma , c3"N used in the Isotropic model take c

p(b D X)"p(X D b)p (b) r

A

B

exp(!1tb<(c, x )) N(c,xg) T g " < . + exp(!1t[a #b<(i, x ))] i|" T i g (c|",+xg) (16) Unfortunately, under particular underlying label map con"gurations the posterior density for b will not be proper. lim p(X"x@, W D Y"y)"d, b?~=

(17)

where d is a positive constant. This is a direct consequence of approximating the likelihood by the pseudolikelihood and results in estimates for b following a random walk toward !R. It may be possible to limit this problem by choosing a suitable prior, for example a Normal distribution centrered on a rough estimate of b. This will be discussed in more detail in Section 4. To sample the model order, reversible jumps are incorporated into the Markov chain. Reversible jumps were developed by Green [16] to allow a Metropolis} Hastings based algorithm to sample the model order, i.e. the number of classes from the posterior distribution in a typical incomplete data problem [1]. The algorithm is designed to preserve detailed balance, thus ensuring the ergodicity of the Markov chain. For a Metropolis}Hastings sampling algorithm to maintain detailed balance, the old and new model spaces must be identical. But here a change of model order occurs making the dimension of the model's parameter vector either increase or decrease, depending on whether the number of classes has risen or fallen. Such a di!erence in dimensionality means the model spaces are di!erent. This is overcome by padding the parameter vectors with random variables, thus equalising their dimensions. The parameter spaces are still di!erent and so, to allow for this, the mapping functions between old and new parameters must be such that the Radon-Nykodym derivative [17] of one model's measure with respect to the other is non-zero.

592

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

For example, if two models are considered, labelled m1 and m2, and W(1) and W(2) are their associated parameter vectors of dimension n and n respectively, then to jump 1 2 between their parameter spaces requires the generation of random vectors u(1) and u(2) such that, d(W(1))#d(u(1))"d(W(2))#d(u(2)), where d( ) ) indicates the dimension of the inclosed vector. A mapping must also be de"ned so that, [W(2), u(2)]"f ([W(1), u(1)]) and an inverse must exist. Once the new parameters have been generated, then any label variables may be reallocated, segmenting the data into the new number of classes. Again, this allocation must be time reversible. The steps just described complete the proposal generation step of the reversible-jump algorithm. The acceptance ratio is now calculated in the traditional Metropolis}Hastings manner, to preserve detailed balance. However, the Radon-Nikodym derivative (taking the form of a Jacobian determinant) is included which compensates for the di!erence in probability measures between the proposed and original models. If we proceed with the above example, then if Y is the observed data, x(1) and x(2) the two allocations of the hidden data labels, then the acceptance ratio for the reversible jump transition from model m1 to m2 is given by

G K

n (X(2), W(2) D Y)q(X(1) D X(2), W(1))q(u(1)) min 1, m2 n (X(1), W(1) D Y) q(X(2) D X(1),W(2)) q(u(2)) m1 ]

KH

L(W(2), u(2)) , L(W(1), u(1))

If the number of classes was to be incremented then a class was randomly chosen to be split in two. New parameters were generated for each of the two new classes. The observed data samples previously allocated to that class were then reallocated to one of the two new classes by Gibbs sampling from their conditional distributions. If the number of classes was to decrease then two classes were randomly chosen to be merged. A new parameter vector for the merged class was then generated as a function of the two old parameter vectors. All data samples allocated to either of the two classes being merged were then automatically assigned to the new merged class with probability one. To summarise, a reversible jump algorithm consists of four steps, (i) decide whether to increase or decrease the number of classes, (ii) generate the new model parameter vectors, (iii) reallocate the hidden data labels, (iv) accept or reject this proposal based on an acceptance ratio generated to ensure detailed balance is preserved. The following two subsections describe how reversible jump algorithms may be used to allow the number of di!erent label states (model order) to be sampled for the Isotropic and Gaussian MRF models described in Section 2.

(18)

where n (X(2), W(2) D Y)/n (X(1), W(1) D Y) is the ratio bem2 m1 tween posterior probabilities of the two models under consideration, q(X(1) D X(2), W(1))/q(X(2) D X(1), W(2)) is the ratio of reverse and forward label allocation probabilities and q(u(1))/q(u(2)) the ratio of the probability of generating the random vectors corresponding to the reverse and forward mapping functions. The "nal term DL(W(2), u(2))/L(W(1), u(1))D is the Radon-Nikodym derivative between the two model probability measures. This corresponds to the Jacobian determinant generated from the mapping function, [W(1), u(1)]P[W(2), u(2)]. This methodology was applied by Richardson and Green [18] to the problem of "tting a Gaussian mixture model with an unknown number of components to observed data. Here, the posterior distribution was sampled using Gibbs and Metropolis}Hastings algorithms which updated the label allocation and model parameters in a given sequence. At the end of this scan through the model space, a reversible jump was incorporated which updated the number of mixture components. At each sweep the algorithm would randomly choose between proposing to increment or decrement the number of classes.

3.1. Reversible jumps for the Isotropic MRF The approach to sampling the model of an Isotropic Markov random "eld adopted here is similar to that adopted by Richardson and Green [18] when sampling the number of components of a mixture distribution. As discussed in the previous section, the problem revolves around proposing to split a region labelled by a single class, c, into a region composed of two classes, c ,c . 1 2 Each class is de"ned by a parameter vector, W "Mk , p , a N. When splitting one class into two, three c c c c new parameters need to be created. To ensure the model spaces remain of equal dimension, the parameter space of the original model is supplemented by three random variables, u , u , u . Thus when splitting state c into c 1 2 3 1 and c a transform between [k , p , a ] and [k , p , a , 2 c c c c1 c1 c1 k , p , a ] may be de"ned with three degrees of freec2 c2 c2 dom. Hence, the three parameters for each of the two new classes may be derived from the old values, k , p , a , and c c c the random variables u , u , u . The new parameters are 1 2 3 calculated to preserve the 0th, 1st and 2nd order moments across the transformation. The resulting mapping

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

functions are given in the following set of equations: a "a !¹ ln(u ), c1 c t 1 a "a !¹ ln(1!u ), c2 c t 1 k "k !u p c1 c 2 c k "k #u p c2 c 2 c

S S

1!u1 , u1 (19)

u1 , 1!u1

1 p2 "(1!u )(1!u2)p2 . c2 3 2 c 1!u 1 The choice of random variables for, u , u , u , must 1 2 3 be such that, Mu , u , u 3(0,1]N. For this reason and to 1 2 3 allow a bias towards splitting the data into roughly equal partitions, beta distributions are used to sample u ,u ,u . 1 2 3 The Jacobian determinant of these mapping functions needed in the calculation of the acceptance ratio is

K

L(W ,W ) ¹ p2 c1 c2 " t c . L(W , u , u , u ) u2 (1!u2 )Ju (1!u ) c 1 2 3 1 1 3 3

The Kolmogorov}Smirnov distance is a measure of the closeness of two distribution functions. It may be applied to two samples of data to ascertain whether they have been drawn independently from the same distribution. If FK (i) and FK (i) are two independent sample 1 2 distribution functions (i.e. histograms) de"ned, 1 FK (i)" d(i : y )i), i n

(21)

where n is the number of data samples, y so that i 1)i)n, then the Kolmogorov}Smirnov distance is the maximum di!erence between distributions over all i,

1 p2 "u (1!u2)p2 , c1 3 2 cu 1

K

593

(20)

The pixel sites labelled by the class or classes selected to be split or merged must be reallocated on the basis of the new parameters generated. If a merge is being proposed, then all sites allocated to the two old classes are relabelled by the new merged class label, with probability one. The di$culty occurs when splitting one class into two. If a reasonable probability of acceptance is to be maintained, the proposed reallocation of labels to sites needs to be completed in such a way as to ensure the posterior probability of that particular segmentation is relatively high. To achieve this it would be desirable to propose a reallocation of labels by Gibbs sampling from the conditional distributions given by Eq. (11). Unfortunately, this is not possible since the allocation of classes in the neighbourhood g on which the probabilities will be s conditioned is not available. To overcome this problem, a deterministic &guess' is made at the future allocation of each pixel labelled by the class to be split. The &guess' is made at each pixel site using a distance measure between the model distribution functions of the new classes and the histogram of observed grayscale values of the surrounding region of pixels. The measure used here has a precedent in this type of algorithm: the Kolmogorov}Smirnov distance was used by Geman et al. [5] to allocate pixel sites between classes based on grayscale values or particular transformations of grayscale, indicative of texture type.

d(y(1), y(2))"max DFK (k)!FK (i)D. 1 2 i

(22)

The Kolmogorov}Smirnov distance is useful for two reasons; its value is independent of the underlying distribution function and it is una!ected by outlying data values. Hence, it may be expected to closely model the label "eld of an Isotropic MRF, as was found by Geman et al. [5]. The deterministic estimation of the new label allocation may now be used as the basis for the Gibbs sampling of pixels from their full conditional distributions. The acceptance ratio for splitting region c into c1 and c2, thus increasing the number of classes from k to k`, may now be given by p(X"x`,W`, k` D Y"y) 1 p(X"x,W, k D Y"y) q (u )q (u )q (u ) b 1 b 2 b 3

K

K

1 L(W ,W ) c1 c2 , q(re-segmentation) L(W , u , u , u ) c 1 2 3

]

(23)

where, p(X"x, W, kDY"y) is the approximation to the posterior density de"ned by Eq. (5), q (u) is the probabilb ity of proposing the random variable u drawn from a beta distribution, q (re-segmentation) is the probability of the reallocation of the pixels labelled by c into regions labelled by c1 and c2, and the "nal term is the Jacobian determinant given by Eq. (20). The acceptance ratio for the jump combining two states into one is simply the inverse of that given in Eq. (23). In this case q (resegmentation) is found in retrospect and u , u , u are 1 2 3 calculated by back-substitution into Eq. (19). 3.2. Reversible jumps for the Gaussian MRF Proposing a move when using a reversible jump to sample the model order of an hierarchical GMRF is more complex than generating a similar proposal when sampling the Isotropic MRF for the simple reason, the model parameter vector [k, p, H] is longer. When splitting a GMRF class, two new GMRF parameter vectors

594

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

need to be generated. The number of degrees of freedom for the generation of the new vectors is therefore dependent on the size of the old parameter vector and in particular, upon the size of the GMRF neighbourhood. If the dimension of the correlation coe$cient vector associated with this neighbourhood is denoted N , then the h number of random variables needed to equalise the two model parameter vectors will be N #2 which is equivah lent to the number of degrees of freedom in generating the new model parameters. If a large degree of freedom is allowed in generating a large number of parameters then the likelihood of proposing a reversible jump that has a reasonable chance of being accepted will be small and so convergence of the Markov chain will be slow. To alleviate this problem the model order is sampled from a marginal density and the method of composition sampling [1] is then employed to obtain the remaining parameters. Composition sampling requires the availability of factorisations of the posterior density. The marginal we propose using eliminates all but one correlation parameter from the posterior density, p(H, k, p, X D Y)"p(H(~i) D H(i), k, p, X, Y)p(H(i), k, p, X D Y), (24) where, H(~i)"[H(~i); c3"] and H(~i)"[h(1),2,h(i~1), c c c c h(i`1),2,h(NAR)]. Similarly H(i)"[H(i); c3"], k" c c c [k ; c3"] and p"[p ; c3"]. c c The above factorisation of the posterior density can only be achieved by incorporating the pseudo-likelihood approximation into the marginal. Applying Bayes law, the marginal may be expressed, p(h(i), k, p, X D Y)

A

B

JPL(Y D j, k, p, h(i), X) < p(xs D x s) pr(h(i), k, p). g s |)

(25)

All the terms in the above equation are easily de"ned except the marginal of the pseudo-likelihood, P¸(YDj, k, p, h(i), X). This is found by "rst constructing the full pseudo-likelihood and multiplying it by priors on all the parameters to be eliminated on forming the marginal. These parameters are the integrated out of the resulting expression to form that marginal. To facilitate the analytical integration over the correlation parameter space, conjugate priors are used which take the form of Normal distributions, N(0, j2). The pseudo-likelihood is therefore marginalised by evaluating the following product of integrals: PL(Y D j, k, p, h(i), X)

P A

B

"< < p(y D k ,p ,H ,y s) p(H(~i)) dH(~i) s c c c g c c (~i) c|" #c s>xs/c

~Nc`Nh~1 ~Nh`1 (2np2t) 2 c "< (2nj2t) 2 1 c|" DD D2 c

G

H

1 ]exp ! (h(h(i) D k , j,Y)) , c c 2p2t c

(26)

where h(h(i) D k ,Y)"!vT[D ]~1v c c c #+ (y@ !h(i)[y@ i#y@ i])2, s c s`q s~q s

C

D

) , v" + (y@ !h(i)[y@ i#y@ i]y s c s`q s~q (o~i)s s

C

D

, D " jI#+ yT sy (o~i) (o~i)s c s

(27)

(28)

(29)

yT s"[(y@ 1#y@ 1),2,(y@ i~1#y@ i~1), (o~i) s`q s~q s`q s~q ](y@ i`1#y@ i`1),2,(y@ Nh#y@ Nh)], s`q s~q s`q s~q y@ "y !k . s s c

(30) (31)

We now have an expression for the marginal likelihood which, when multiplied by the relevant priors gives the basis for the reversible jump sampling of the model order. To complete the algorithm, methods to propose parameters for the new classes and to propose new segmentations for the marginalised GMRF are required. To allow a one-to-one mapping between old and new parameter vectors random variables are introduced. Transforms need to be de"ned for both the splitting of one state into two, and the inverse, the combining of two into one. The parameter vector for a single state, including the additional random variables is [k , p , h(i), u m, u e, x x x m m u m, u e, u m, u e]. This may be split into two giving, s s t t [k 1, k 2, p 1, p 2, h(i)1, h(i)2, a , a , a ]. x x x x x x m s t In the above vectors u m, u e, u m, u e, u m, u e, a , a , a are m m s s t t m s t zero mean Gaussian random variables with di!ering variances. The split move is de"ned by the transformations: k 1"k !u m, k 2"k #u m#u e, x x m x x m m p 1"p !u m, p22"p #u m#u e, x x s x x s s h(i)1"h(i)!u m, h(i)1"h(i)#u m#u e. x x t x x t t

(32)

The combine move is designed to preserve the 1st and 2nd order central moments in a similar way to that used in the Isotropic case and by Richardson and Green [18] when sampling mixture distributions. In addition, a small

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

perturbation in the parameters is allowed, given by the random variables a , a , a . The resulting transm s w forms are: 1 k " [k 1N 1#k 2N 2]#a , x N x x x x m x 1 p " [(p21#k21)N 1#(p22#k22)N 2] x N x x x x x x x

(33)

h(h(i) D k ,Y)!vT[D s]~1v c c W

C

(34)

D

v" + (y@ !h(i)[y@ i#y@ i]y ) , r c r`q r~q (o~i)r r|Ws

C

(35)

D

D s" jI# + yT ry W (o~i) (o~i)r r|Ws

1 h(i)" [h(i)1N 1#h(i)2N 2]#a , x x x x x t N x where N is the number of pixels assigned to class x, x N 1 to class x and N 2 to class x . x 1 x 2 The accepted reversible jump framework [16,18] for proposing a split is to begin by generating new parameters for the two new states, then Gibbs sample the data allocated to the split state into each of the two new states based on their full conditionals. In our case we may generate new parameters in the traditional manner. However, we cannot Gibbs sample the data labels because of the non-causal nature of the GMRF. The conditionals are dependent on data labels yet to be allocated. To overcome this problem we use a re"nement of the methodology described in the previous section for sampling the model order of the Isotropic MRF. A likely reallocation of the label "eld is estimated in a deterministic fashion. This is then used to condition the full conditional distributions for the label parameters to allow a Gibbs sampling sweep of all sites labelled with the class being split. To estimate a likely label "eld con"guration for the GMRF, a square window, denoted = is considered 4 around each pixel in the image. The pseudo-likelihoods of allocating all the pixels in the window to each of the two new classes are calculated and the pixel is allocated to the class with the larger pseudo-likelihood. The pseudo-likelihood for each window is calculated as in Eq. (26), except the product is over all pixels in the window rather than all pixels labelled by state. Hence, the pseudo-likelihood of the pixel at site s being labelled by class c, given it surrounding window, is, P¸ (My ; r3= N D j, k , p , h(i),My r; r3= N) s c r s c c c g

(36)

and N is the size of the window surrounding pixel s. Ws y and y@ are de"ned as in Eqs. (30) and (31). (o~i)r r Using this estimation of the new label "eld, X(e), the proposal label "eld can be Gibbs sampled using the marginalised conditional distributions. The conditional distributions are given by p(x "c D j, k , p , y , y s, x(e)s ) s c c s o g Jp(y D j, k , p , h(i), y s)p(x "c D b(1), x(e)s ), s c c c o s g

(37)

where g is the neighbourhood of interactions within the s label "eld. The marginalised likelihood term in this equation may be evaluated, p(y D j, k , p , h(i), y s) s c c c o

P

"

p(y D k , p , H , y s)p(H(~i)) dH(~i) s c c c o #~i 1

" (2np2) c

2

(Nh~1) ~1 2 DD D s

G

H

1 exp ! (h (h(i) D k , y s)) , c o 2p2 s c (38)

where h (h(i) D k , y s)"[y@ !h(i)[y@ i#y@ i]]2 s c g s s`q s~q ], D ][1!yT s[D ]~1y (o~i) s (o~i)s s "[yT sy #jI], (o~i) (o~i)s

(39)

y and y@ are as de"ned in Eqs. (30) and (31). (o~i)s s The prior on X comprises an Ising model, hence the conditional for x , given its vector of neighborhood labels s x(e)s , is a conditional Gibbs distribution, g p(x "c D b, x(e)s ) s g

NWs~(Nh~1)2) 1 2 2np2t c

A B

G

where,

# + (y@ !h(i)[y@ i#y@ i])2, r c r`q r~q r|Ws

!k2!as, x

"

595

expM!b[d< (x "c D x(e)s )#< (x "c D x(e)s )]N f s g b s g . + expM!b[d< (x "x D x(e)s )#< (x "X D x(e)s )]N f s g b s g x|"

"

H

1 1 ] exp ! (h (h(i) D k ,Y)) , 2p2 t w c c 1 c DD D 2 Ws

(40)

596

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

The acceptance ratio for the resulting segmentation and parameter estimates for splitting region c into c1 and c2, is given by min[1, R], where, p(X"x`, W`, k` D Y"y) R" p(X"x, W, k D Y"y) ]

q(a )q(a )q(a ) m s t q(u m)q(u e)q(u m)q(u e)q(u m)q(u e) m s s t t m

K

K

L(W , W , a , a , a ) 1 c1 c2 m s t ] , q(re-segmentation) L(W , u m, u e, u m, u e, u m, u e) c m m s s t t (41) where p(X"x, W, k D Y"y) is the approximation to the posterior density de"ned by Eq. (10), q(u) is the probability of proposing the random variable u drawn from a Normal distribution, q(a) is the retrospective probability

of drawing the random variable a from a Gaussian distribution, q (re-segmentation) is the probability of the reallocation of the pixels labelled by c into regions labelled by c1 and c2, and the "nal term is the Jacobian determinant given by Eq. (20). If the proposed split (or merge) is accepted, then the remaining h parameters are repeatedly sampled from their full conditional distributions given by Eq. (14).

4. Experimental results There has been much debate on the subject of how convergence might relate to annealing schedule [19,20]. Although slowly cooling logarithmic schedules are generally considered more robust, we have adopted a geometric schedule. The principle reason for this choice was simply, the complexity of the algorithms makes only a relatively small number of iterations possible within a reasonable time-span, hence a fast annealing schedule is

Fig. 1. 200 iteration experiment on an image with four classes and beginning the algorithm with an arbitrary initial guess of two classes.

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

necessary. The schedule is given by ¹ "(1#a )a2(1~Nt t), t 1

(42)

where t is the iteration number, N is the total number of t iterations, and a and a are constants (which were arbit1 2 rarily set to 0.1 and 10.0, respectively). The isotropic segmentation algorithm, described previously in Section 3.1 has been applied to various computer synthesised mosaics. For the purposes of these experiments, b, the double pixel clique coe$cient, which prescribes the interaction strength between pixels within the image, was set a priori to a value of 1.5. In proposing a reversible jump, the Kolmogorov}Smirnov distances for were calculated using 9]9 windows to generate 40 bin histograms of the pixel grayscale distribution functions. Fig. 1 shows demonstrates the convergence of a 200 iteration experiment on a four class grayscale image with additive Gaussian noise. The classes are well separated

597

and the algorithm can be seen converge quickly to the correct segmentation. Fig. 2 shows the convergence of a 200 iteration run on a 6 class image whose within class densities overlap more fully. The converged segmentation gives a good representation of the image after so few iterations and the number of classes has been correctly diagnosed. To demonstrate the stability of this algorithm with respect to starting parameter values. Fig. 3 shows graphically the convergence of the algorithm in terms of number of states, from starting values of between 2 and 10. For reference, the plot is overlayed with the annealing schedule and it can be observed that the algorithm converges within 150 iterations from all starting points. As described in Section 3 it is possible to estimate the MRF hyper-parameter b from its posterior distribution. Unfortunately the use of the pseudo-likelihood approximation for the MRF partition function makes this distribution improper under certain MRF con"gurations.

Fig. 2. 200 iteration experiment on an image with six classes and beginning the algorithm with an arbitrary initial guess of two classes.

598

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

Fig. 3. Graph to demonstrate convergence of the algorithm from various starting con"gurations.

Fig. 4. (a) Graph showing convergence or otherwise of the beta parameter; (b) converged segmentation using the constrained algorithm.

This is shown in Fig. 4a where the unconstrained graph shows the drift of the b parameter towards R as its posterior distribution becomes improper. Since this only occurs at around 300 iterations, after the MRF has converged to the correct number of states and the segmentation has become largely stable, it is only necessary to limit this drift to ensure the MAP estimate remains stable as temperature tends to zero. The drift may be constrained simply my imposing a prior on the b parameter: in this case a Normal distribution was chosen,

N[1.0, 0.3]. Although the eventual estimate for b will be inaccurate, due to the use of the pseudo-likelihood estimate, the segmentation will remain stable until converged or a minimum temperature is reached. The converged segmentation for the constrained algorithm is shown in Fig. 4b. The results of a 500 iteration run on a more complex grayscale image are shown in Fig. 5. The original grayscale distributions of the "ve classes are far closer than in the previous example (as shown by the grayscale

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

599

Fig. 5. 500 iteration experiment on an image with "ve classes and beginning the algorithm with an arbitrary initial guess of six classes.

histogram) but the algorithm has still converged to a good estimate of the underlying image. A greyscale image of a house is segmented in Fig. 6. The algorithm was repeated from various starting conditions and the results reached remained consistent. It could be argued that a better segmentation was arrived at in the intermediate steps but the algorithm then "ts to a statistically if not visually better model in the "nal

iterations. The possibility of using an information criterion to arrive at a more useful segmentation could be considered. The hierarchical GMRF segmentation algorithm described in Section 3.2 has been tested on various computer generated mosaics of GMRF's. The neighbourhood size of each GMRF required ten h coe$cients and so was of type n"4 [21].

600

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

Fig. 6. 400 iteration experiment on a greyscale image of a house. The algorithm begins with an arbitrary initial guess of six classes and consistently converges to "ve.

An example of such a test is shown in Fig. 7. Again the b parameter has been set a priori to 1.5. The algorithm was run for 600 iterations from a starting temperature of 2 to a "nishing temperature of 0.1. The temperature schedule for the experiment was simply linear, (N !t) ¹" t ( ¹ !¹ )#¹ , t .*/ .*/ N * .!9 t

(43)

where t is the iteration number, N is the total number of t iterations, and ¹ and ¹ are the minimum and .*/ .!9 maximum temperatures of the schedule. When estimating a likely reallocation of the label "eld during reversible jump proposal generation, windows of 16]16 pixels were used. For all these simulations, it is interesting to observe how at high temperatures the algorithms "t the model to

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

601

Fig. 7. 600 iteration experiment on a four state synthetic GMRF image. The algorithm begins with an arbitrary initial guess of three classes and converges to four.

the gray level histogram of the images. Only as the temperature falls to around the critical temperature [22] of the system, do the spacial correlations in the image

begin to be modelled. This is a direct consequence of using a Potts model for the con"guration of the label "eld.

602

S.A. Barker, P.J.W. Rayner / Pattern Recognition 33 (2000) 587}602

5. Conclusion In this paper we have presented unsupervised segmentation algorithms for both noisy and textured images. The model parameters, number of classes and pixel labels are all directly sampled from the posterior distribution using an MCMC algorithm. A single parameter is de"ned prior to the data (or at least an informative prior distribution needs to be de"ned) which de"nes the overall strength of neighbouring pixel interactions within the image. Excepting this, the algorithms are completely unsupervised. Experiments results have been presented in which synthesised images have been rapidly segmented to give accurate estimate of the original underlying image.

[8]

[9]

[10]

[11] [12] [13]

References [1] M.A. Tanner, Tools for Statistical Inference, Springer, Berlin, 1993. [2] B.S. Manjunath, R. Chellappa, Unsupervised texture segmentation using Markov random "elds, IEEE Trans. Pattern Anal. Machine Intell. 13 (5) (1991) 478}482. [3] F.S. Cohen, Z. Fan, Maximum likelihood unsupervised texture segmentation, CVGIP: Graphical Models and Image Processing 54 (3) (1992) 239}251. [4] H.H. Nguyen, P. Cohen, Gibbs random "elds, fuzzy clustering, and the unsupervised segmentation of images, CVGIP: Graphical Models and Image Processing 55 (1) (1993) 1}19. [5] C. Gra$gne, D. Geman, S. Geman, P. Dong, Boundary detection by constrained optimization, IEEE Trans. Pattern Anal. Machine Intell. 12 (7) (1990) 609}628. [6] C. Kervrann, F. Heitz, A Markov random "eld modelbased approach to unsupervised texture segmentation using local and global statistics, IEEE Trans. Image Processing 4 (6) (1995) 856}862. [7] D.K. Panjwani, G. Healey, Markov Random "eld models for unsupervised segmentation of textured color images,

[14]

[15] [16]

[17] [18]

[19] [20] [21]

[22]

IEEE Trans. Pattern Anal. Machine Intell. 17 (10) (1995) 939}954. C.S. Won, H. Derin, Unsupervised segmentation of noisy and textured images using Markov random "elds, CVGIP: Graphical Models and Image Processing 54 (4) (1992) 308}328. J.W. Modestino, D.A. Langan, J. Zhang, Cluster validation for unsupervised stochastic model-based image segmentation, Proc. ICIP94, 1994, pp. 197}201. S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Trans. Pattern Anal. Machine Intell. 6 (6) (1984) 721}741. J. Besag, Statistical analysis of non-lattice data, The Statistician 24 (3) (1975) 179}195. J.M. Bernardo, A.F.M. Smith, Bayesian Theory, Wiley, New York, 1994. J. Besag, On the statistical analysis of dirty pictures, J. Roy. Statist. Soc. Ser. B 48 (1986) 259}302. B. Chalmond, An iterative gibbsian technique for reconstruction of m-ary images, Pattern Recognition 22 (6) (1989) 747}761. L. Tierny, Markov chains for exploring posterior distributions, Ann. Statist. 22 (5) (1994) 1701}1762. P.J. Green, Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination, Biometrika 82 (4) (1996) 711}732. P. Billingsley, Measure Theory, Addison-Wesley, Reading, MA, 1960. S. Richardson, P.J. Green, On Bayesian analysis of mixtures with an unknown number of components, J. Roy. Statist. Soc. Ser. B 59 (4) (1997) 731}792. R. Srichander, E$cient schedules for simulated annealing, Eng. Optim. 24 (1995) 161}176. P.N. Strenski, S. Kirkpatrick, Analysis of "nite length annealing schedules, Algorithmica 6 (1991) 346}366. R. Chellappa, L.M. Kanal, A. Rosenfeld, (Eds.), Two-dimensional discrete Gaussian Markov random "eld models for image processing, Progress in Pattern Recognition, vol. 2, Elsevier, Amsterdam, 1985, pp. 79}112. R.J. Baxter, Exactly solved models in statistical mechanics, Academic Press, London, 1982.

About the Author*SIMON A. BARKER received his M.Eng. degree from the University of Surrey, UK, in 1989 and recently completed studying towards his Ph.D. degree at Cambridge University Engineering Department. He is now a member of the Core Research Group of Lernout & Hauspie Speech Products, Burlington, Massachusetts, US. His current research interests include image segmentation, Markov Random "elds, model selection techniques and acoustic modelling. About the Author*PETER J.W. RAYNER received the M.A. degree from Cambridge University, UK, in 1968 and the Ph.D. degree from Aston University in 1969. Since 1968 he has been with the Department of Engineering at Cambridge University and is Head of the Signal Processing Research Group. In 1998 he was appointed to an ad-hominem Professorship in Signal Processing. He teaches courses in random signal theory, digital signal processing, and image processing. His current research interests include image sequence restoration, audio restoration, non-linear estimation and detection and time series modelling and classi"cation.