Contextual estimators of mixing probabilities for Markov chain random fields

Contextual estimators of mixing probabilities for Markov chain random fields

PatternRecognition,Vol. 26, No. 5, pp. 763-769, 1993 0031-3203/93 $6.00+.00 Pergamon Press Ltd © 1993 Pattern Recognition Society Printed in Great B...

604KB Sizes 0 Downloads 75 Views

PatternRecognition,Vol. 26, No. 5, pp. 763-769, 1993

0031-3203/93 $6.00+.00 Pergamon Press Ltd © 1993 Pattern Recognition Society

Printed in Great Britain

CONTEXTUAL ESTIMATORS OF MIXING PROBABILITIES FOR MARKOV CHAIN RANDOM FIELDS A. VEIJANEN J( Ari Veijanen Inc., Kellonsoittajantie 47, 05400 Jokela, Finland (Received 22 February 1990; in revised forra 1 October 1992; received for publication 13 October 1992) Abstract--This paper discusses the estimation of proportions of classes from an image. Classification methods are compared with likelihood methods and the importance of contextual information is discussed. The joint distribution of classes at neighbouring sites is modelled by a Markov chain random field. The class attributes are estimated from training sets and unclassified observations. The effect of biased class means is reduced with a stochastic model of the bias. Contextual likelihood methods yield better results than non-contextual methods.

Bias Estimation Training set

Image analysis

Markov chain

l. I N T R O D U C T I O N

Consider an image of a true scene whose pixels can be grouped into classes, prescribed categories of similar pixels. Land-use categories, such as crop and forest types, are examples of classes in remote sensing. The n u m b e r of classes is fixed and the class attributes are usually known in advance. An observation associated with a pixel has a distribution that depends on the class of the pixel, the classes being characterized at least by the expectation of observations. Using such information, we classify each observation. Observations assigned to the kth class are given a label k. The proportions of classes have often been estimated by the proportions of labels. Acreages of crop types, for example, are estimated from satellite images after classification.(1-4) Good classification methods are then essential. In non-contextual methods each observation is classified independently of the other observations. However, the observations are seldom independent; on the contrary, neighbouring pixels often belong to the same class. This property of the true scene is used in contextual classification: we classify a pixel using not only the observation associated with the pixel but also the observations or labels in the pixel's neighbourhood. Contextual classification based on a model of the true scene is preferable to non-contextual classification, c5-7~ Yet, the labels are not necessarily in correct proportions. Pixels in a rare class are often misclassified, as the image is strongly smoothed and details are eliminated. Corrections can be made with an estimate of a confusion matrix, ~a~but a small training set yields inaccurate corrections. We therefore consider contextual estimation methods that do not involve classification.

t Currently with the Academy of Finland. PR 26:5-E

Prior probabilities

Random field

Instead of classifying the pixels, we apply likelihood methods. A non-contextual likelihood method is based on the marginal distributions of observations. Such likelihoods are mixtures of distributions whose mixing proportions can be estimated with maximum likelihood techniques. (9) In a contextual likelihood method the joint distribution of neighbouring observations is specified. It is not, however, necessary to compute the joint likelihood of all the observations. Using pseudolikelihoods we estimate the joint distribution of classes at two or four neighbouring sites under a random field model. The true scene is often modelled as a Markov random field.(5'6) Unfortunately, it is difficult to estimate the field's parameters from an image, and recently introduced estimators ~°-12~ may not be ideal for the estimation of class proportions. It is easier to derive methods for certain simple random felds with known marginal distributions. Such random fields are used as approximations of more general Markov random fields. (13) Several authors have used Markov mesh fields(14) and the Pickard random field.(;5 20) Estimators have been derived for these and related models.(2 ~-2~) We discuss Markov chain random fields whose distribution is determined by a 2 × 2 Markov random field. (24"2s) The class attributes, such as the expectation and variance of observations, are estimated from a training set of observations with a known class. The accuracy of estimates depends on the size and reliability of the training set. Estimates from a small sample are inaccurate. Observations in the training set and in our data may even follow different distributions. We study the effect of bias in class means. The bias of a class mean in the training set is defined as the difference between the expectation of the mean and the corresponding true expectation in the image. If the class means are biased, we also call the training set biased. Bias in class means yields errors in estimated class 763

764

A. VEIJANEN

proportions, which do not necessarily tend to the correct proportions as the size of the image and the training set increase. To reduce such errors, one should update the estimates of class attributes using unclassified observations, t2L22~ Contextual likelihood methods are expected to yield better results than non-contextual likelihood methods, t2 ~-2a,26) Moreover, the reliability of a training set can be described by a misclassification model, for example. 13) We attempt to combine these ideas. Our main contribution is to model both the true scene and the bias in class means. With a suitable choice of the variance of bias we reduced errors due to a biased training set. We derived contextual pseudolikelihood methods that yielded clearly better results than maximum likelihood classification. In Sections 2.1 and 2.2 we review Markov random fields and Markov chain random fields. For readers interested in the estimation theory of these models, Section 2.3 discusses the identifiability of parameters. Section 3 describes the reliability of a training set and introduces the bias model. The pseudolikelihood estimators are defined in Section 4. Our experiments are described in Sections 5.1 and 5.2. Section 5.3 summarizes the conclusions.

2. RANDOM FIELD MODELS

2.1. Markov random fields Suppose each pixel (i, j) belongs to a class denoted by X~j. Their joint distribution is specified by a random field model. The n u m b e r of contiguous sites with equal classes is used in simple Markov random field (MRF) models. (5,6) More generally, we describe the co-occurrence of classes. Classes a and b, for example, may occur at neighbouring sites more often than expected. We count the frequency n,(x) of each class and the frequencies nk.b(X) of pairs of classes in four directions:

We use the following constraints for the parameters ~. and ffab:

Z~. =°, a=l

Z flab ' - -- O,

" ~ab

~ab'

b=l

and

ffab=fib.

(3)

(a, b = 1, 2 . . . . . z; t = 1, 2, 3). Under these conditions, the parameters ~. and fl'.b (b _< a < z; t = 1, 2, 3) are linearly independent. 2.2. Markov chain random fields To define our model we denote R i = (X , , X i2..... X ~m) and Cj = (Xaj, X2j)'. X is a Markov chain random field (MCRF), if (C1 . . . . . C,,) and (R1 . . . . . R.) are homogeneous Markov chains. "5'16'24) The distribution of X is determined by a generating set = (XllX12X~ G

kX21X22J.

(4)

The transition probabilities of (C j) are first derived from the distribution of G. The distribution of the first two rows then determines the joint distribution of all the rows. Suppose G (4) is stationary and X12 is conditionally independent of X21 given X~ a. Then X generated by G has been called the Pickard random field, a 5,~6) Its regions are often unrealistically rectangular (Fig. 1). Consider a 2 x 2 M R F G (4) with energy function (2) under the constraints (3). The field's distribution is invariant under reflections with respect to the x- and y-axis. For a M C R F X generated by G, (C1,..., C,,) and (R1 .... , R.) are reversible Markov chains. All the 2 x p subsets of X (p = 1, 2,..., m) are identically distributed but the distribution of X is not necessarily invariant under rotations. Still, realizations of M C R F s appear more realistic than realizations of Pickard random fields (Fig. 2).

n.(x) = card {(i, j ) e L : x i j = a} n].(x) = card {((i, j), (i, j + 1))eL x L:xij

a, xi,j+l = b} n~(x) = card {((i, j), (i + 1, j ) ) e L x L:xl,j

a, x i + L j = b } n ~ ( x ) = card {((i, j), (i + 1, j + l))eL x L:xi,j

a, xi+ LJ+ 1 = b} and

n4b(x) = card {((i, j), (i + 1, j -- l))eL x L:x,,j =a, xi+l,j-1 =b}. The probability mass function of the M R F is

P{X=x;O}=Zoaexp{-U(x;O)}

(1)

where 0 is the parameter vector, Zo is the normalizing constant and U(x; O) is the energy function

U(x;O) =

-

~, n , ( x ) a . -

a=l

4

v

E

E

E

t=la=lb=l

'

n°b(X)~.b.

'

(2)

Fig. 1. A realization ofa Pickard random field with P { X I 2 = IlX,, = l} = P{X12 =21Xu =2} = P { X ~ = IlX,, = 1} = P{X21 =21Xu =2} =0.8.

Contextual estimators of mixing probabilities

765

Corollary 1 holds for r --- 2 with {U(2, 2, 2, 2 ) - U(1, 1, 1, 1) /

q,(0)

/ U ( 2 , 2, 2, 2 ) - U(2, 1, 1,2) | U ( 2 , 2 , 2 , 2 ) - U(2, 2, 2, 1) / \ U ( 2 , 2, 2, 2) - U(2, 2, 1, 1).

81,

and 0 = / f l ~ l [ "

In the same way, the author has verified identifiability for r = 3,4,..., 25 under the conditions (3). Identifiability for a generating set on a 2 x 2 lattice implies the identifiability for a M C R F and for a M R F on a larger lattice (Theorem 2). If the observations in the kth class follow normal distribution N(#k, a2k) and #k :# #j for all k :#j, then the parameters are identifiable from noisy observations as well. Fig. 2. A realization of a binary MCRF generated by a 2 x 2 MRF with cq =ct2 =0, ~1 =fl~2 =0-3, and P~2 =fl~l = -0.3 for t = 1,2,3,4.

T h e o r e m 2. If the parameters of a M R F (1) are identifiable on a 2 x 2 lattice, they are identifiable on any n x m lattice (n, m > 2). Proof. See Appendix.

2.3. T h e identifiability o f parameters The identifiability of parameters is an important theoretical issue when a statistical model is defined. Unless the identifiability conditions hold, it is not possible to obtain unique maximum likelihood estimates. We first discuss identifiability for noise-free realizations of a Markov random field. Identifiability problems arise if we attempt to estimate all the parameters % and flt,b in the model (1) without any constraints. There is no unique estimate 0 maximizing the likelihood function P { X = x; 0} (1) given an observation x. As an example, parameters % + p and ~. yield identical probability mass functions for any constant p. Therefore, estimates ~a can always be replaced by estimates a. + p. To resolve this difficulty we reduce the dimension of the parameter space by expressing certain parameters as a function of the other parameters. Under the conditions (3), for example, only the first r - 1 values of % are estimated and a, = ~I ct2 . . . . . ~ - I. The parameters are called identifiable, if different parameters always yield different distributions. In other words, P { X = x; O} = P { X = x; 9} for every x only if 0 = & To verify identifiability for a chosen parameterization, we apply Theorem 1 or Corollary 1 proved in the Appendix. Let A(0) = {U(x~;O) - U(x2; O):xl, x2e~"} where U(x; O) is the energy function (2), ~ is the range of the X u and n is the n u m b e r of sites. T h e o r e m 1. If A(0) ¢ A(8) for all 0 ~ 11, then the parameter 0 in (1) is identifiable. Corollary 1. Suppose there exists an invertible matrix A and a vector O(0) c A(0) such that O(0) = AO for all 0. Then 0 in (1) is identifiable. We first consider a 2 x 2 M R F with energy function U(Xll,X~2,X2~,X22) (2) under the conditions (3).

3. A M O D E L OF A BIASED T R A I N I N G SET

To estimate the proportions of classes we first have to know the distribution of observations in each class. It is straightforward to estimate the distributions from a training set of observations that are readily classified. In satellite image analysis the class of some observations is often determined from aerial photographs, for example. A training set is not always representative. If the sample has been subjectively chosen or some observations have been misclassified, TM estimates from the training set are perhaps biased. Our image and the training set are sometimes from different regions. The area studied may be so heterogeneous that a single training set is not reliable everywhere. Thus the observations in the ltraining set do not necessarily have the same distribution as in our data. Assume an observation belonging to the kth class follows normal distribution N(pk, a 2 ) in the image and N(#~', a 2) in the training set. For the bias bk = ptk"-- lak we define a prior distribution bk = Ittk" -- #k ~ N(O, 62).

(5)

This distribution arises when p~,"and/~k are independent random variables following N(/~j,, 62/2) for some/a k. A small bias variance 62 implies that the expectation tt~," of observations in the training set is close to the expectation/l k in our image. When 62 is large, the mean of observations in the kth class of the training set is an unreliable estimate of/~k. We choose 6 empirically or subjectively to express our confidence in the training set. We next derive two likelihood functions of observations in a training set. Denote thejth observation in class k by Ykj = I~k + b~ + eki, where ekj ~ N(0, a2); j = 1, 2 . . . . . n k. The observations are often conditionally independent given the classes, and the bias terms b k are independent of the errors ekj. The joint density function

766

A. VEUANEN

of the observations and the b k is then

qgt~(Yt,, b; 7, 6)¢pp( Y; Op, Y) or
~o,,(Y,r,b;?,6)= l-I

k= 1

ck{~))l-lck

\

rkj--tak irk

/)

(6, (M1) ~1(Y;01,7)= f i f i fl(Y~.j; 0,,?) i=1 j = l

where tp denotes the density function of N(0, 1). If we do not attempt to estimate the bias terms, we use the marginal density function of the sufficient statistics

_-fi

nk

~ r = 1 E Ykl and nkj=l

S2(tr)= 1 ~ (Ykj----'~2 Yk ) • nkj=l

rp,,(Y,,;?,f)=?=~,e~/(f2

--tr

+a2/nk)]~ \

(8)

n m-1

(M2) tP2(r;02,?)= n

1-I f2(rij, ri.j+,;02,?)

i=lj=!

It is easy to prove that they are independent, Y~,"follows N(/~k,62 + a2/ng) and nkS2(tr)/tr 2 follows chi-square distribution with n~ - 1 degrees of freedom. Thus an alternative to (6) is r

~ ~.O(Y~jIXi.j=a;?)

i=1 j = l a = l

(j odd) ( M 3 ) (p3(~'~03,~') =

H H f3(Yi,J, Yi+l,j, Yi,j+l , i=1 j = l

Yi+l.j+l;03,?)

2

"~ /

(7)

where f~ denotes the density function of the chi-square distribution with nk -- 1 degrees of freedom. The first term in (7) implies that with a large training set and tr small 3 we force the estimate of/z~ to be close to -Y~. Our Bayesian approach is easily generalized for multinormal distributions. The covariance matrices 2~ and ~ " of each class k can be considered as random variables. As an example, assume that ~,~.~[-2~~ ,k1-lij 2 follows chi-square distribution for some 2 > 0.

(9)

n-lm-I

(i, j odd).

(10)

Here #(Y~.jlX~.j = a; 7) denotes the conditional density function of an observation Y~,j given the pixel (i,j) is in the class X~,j = a. The functions f2 and f3 denote the density function of two and four neighbouring observations, respectively. These pseudolikelihood functions are not, in general, genuine density functions. Nevertheless, the estimators are consistent under weak conditions. 125) With method (M2) we first estimate the marginal probabilities Pab = P{X~,~ = a, X~,j+ 1 = b} under reversibility constraints P~b = Pbo (a, b = 1, 2. . . . . T). In (9)

f2 (Yij, Yi.j + l ; 02, Y) 4. P S E U D O L I K E L I H O O D ESTIMATORS

Since we consider the true scene as a realization of a random field, we derive here methods for estimating the prior probabilities of classes. A randomly chosen site (i, j) belongs to class k with probability ZCk= P{Xia = k}. Their estimates ~k are used as estimates of class proportions. If our training set is small or biased, we update the estimates of class attributes using unclassified data/2L22~ However, it is difficult to estimate the parameters of a mixture of normal distributions given only the number of classes. The maximum likelihood method may fail since the likelihood function often has several local maxima. 19~If there are no distinct clusters in the feature space, the estimates are unreliable. In segmentation, analogous problems with the K-means algorithm are solved by finding regions of uniform intensity, t27~ We therefore use contextual likelihood methods. A random field model provides us a constraint: neighbouring observations are usually in the same class. We define three estimation methods (M1)-(M3) with differences in emphasis on contextual information. (M1) is a non-contextual method. ~22~ (M2) is a nonparametric contextual method without specific random field assumptions. It is similar to other methods for estimating the prior and transition probabilities simultaneously. ~21'23'26) With (M3) we estimate the distribution of the generating set (4) of a M C R F under the constraints (3) and ffob= fl~b (t = 2, 3, 4). TO use both the image Y and the training set Y,,, we maximize

= ~ Pabg(YjXi.j = a;y)g(Yi,j+ t lXi.j+ 1 = b;?). a,b

To simplify the numerical maximization of our likelihood we used a parameter vector 02 with components cta and fl~bsatisfying (3): Pab =

e x p { ~ + /Lb 1 + ~b} E exp {c%+ ft,1b + Ctb}" a,b

The estimates of prior probabilities are then ~o = ~,b.P,b. In (M3), the prior probabilities were computed in the same way from the estimated distribution of the generating set. If we have observed X~. s = a, then fl(Yi.j; 01,7)= 9(Y~,jlX~,j = a;?) in (8), and the other functions f2 and f3 are modified in the same way. The likelihoods are maximized numerically. If the distribution of observations in each class was known, we could use simple estimation methods/26,2a~ To estimate the class attributes and prior probabilities simultaneously, we used Powell's maximization algorithm. (29) If the likelihood functions have local maxima, the final estimates may depend on the initial values. They were chosen here as follows. For class k,/~k and ~rk2 were first estimated from a training set by the mean ?'k" and sample variance S2(tr), respectively. With these estimates we classified each observation Yo of our image by maximizing the conditional density function g(Y~jlXo= k;$) with respect to k. After this maximum likelihood classification, the prior probabilities ~ were estimated by the proportions of labels.

Contextual estimators of mixing probabilities

767

(a)

In (M2) and (M3), we used initial values cq = log(r~o) and fltab = O. 5. EXPERIMENTS

5.1. The data We made two experiments using artificial 64 x 64 images with three classes. The true scenes X were simulated with 50 sweeps of the Gibbs sampler 15)for parameters ctI = - 0 . 0 5 , ct2 =0.05, ~3 = 0 , fltaa = 0.25 and fl'ab = - 0 . 1 2 5 (t = 1,2,3,4; a,b = 1,2,3; b #a). On average, 2 0 ~ of the pixels were in class 1, 55% in class 2. Conditionally independent observations Y/j were simulated from N(#k, 1) for X ~ = k (/~ = 0 , /~2 = 2 and /z3 = 4). In the simulated training sets, observations in the kth class followed N(l~k, t, 1). Denote the true proportion of class k in the ith image by pk(i) and the corresponding estimate by ~tk(i). TO evaluate the estimates we define the mean error of ~k in n images by (1/n)Z~= 1 (~k(i) --Pk(i)) and the mean absolute estimation error M A E (~) by

~ I~k(i)--pk(i)l.

MAE(~) =

0.1

-2

-I

MAE (/~) is defined in the same way. In the first experiment, we compared training sets with different sizes and expectations /t~" (Table 1). Twenty images were simulated as explained above, and for each image we simulated four training sets with nt, observations in each class. In the unbiased training sets T-10OOu the expectations /~,' were equal to #k, whereas the sets T-50b and T-10OOb were slightly biased. Our estimates maximized ~pt,(Yt,,b; 7, ~)tpp(l~ 0~, 7) (P = 1,2, 3). The bias terms bk ----#kt"_ #k were estimated only for 3 > 1 0 - i s in (6). For infinite bias variance 6 2 = oc, we first initialized the estimates using

Table 1. The training sets of the first experiment Set

n,,

#7

U~"

if3r

T-1000u T-20b T-50b T-1000b

1000 20 50 1000

0 0.5 0.2 0.2

2 1.5 2.2 2.2

4 4 4 4

1

2

(b) 0.2

"-.

(11)

i= k=l

0 Bias

/ f•

<

/

D

o.1

0

i -2

I -I

~

I

7 0

I

2

Bias

Fig. 3. Average MAE(a) in three replications as a function of the bias b2 (bl = b3 = 0) for maximum likelihood classification (@), (M1) (O), (M2) (A) and (M3) ([]) with (a) 6 = 10 -5 and (b) 6 =0.3.

T-1000b and then ignored the training set by" maximizing tpp(Y; 0p, 7). Table 2 presents the results. In the second experiment we studied how large bias b2 = / ~ " - ~2 can be tolerated when b~ = b 3 = 0. For each b2 = - 1.8, - 1.6,..., 1.8, we simulated three images and three training sets with 100 observations in each class. To save time, we estimated the parameters from

Table 2. The mean error of ~t, MAE(~) and MAE(/~) in 20 images with the training sets of Table 1 Set

6

T-1000u T-1000u T-20b T-50b T-50b T-1000b T-1000b

10 -15 0.3 0.3 10 -5 0.3 10 -15 oo

Mean error of ¢tI (%) (M1) (M2) (M3)

(M1)

---0.33 --0.34 3.82 2.44 0.33 4.63 --1.35

0.82 1.48 6.52 3.31 3.48 3.09 15.27

--0.30 --0.05 2.11 1.96 0.21 4.48 --0.12

---0.30 ---0.14 0.90 1.33 0.00 4.23 1.74

MAE(~) (%) (M2) (M3) 0.85 1.08 3.69 2.89 2.92 2.98 5.12

0.81 1.10 2.15 1.92 1.96 2.83 1.46

(MI) 0.03 0.07 0.16 0.12 0.10 0.13 0.33

MAE(/~) (M2) (M3) 0.03 0.05 0.10 0.10 0.08 0.13 0.12

0.02 0.05 0.07 0.08 0.06 0.12 0.06

768

A. VEIJANEN

25% of the observations by maximizing the product of tp,,(Yf,; 7,6) (7) and ~op(Y; 0p,7) ( p = 1,2,3) with i , j = 1,3,5 ..... 63 in (8), i = 1,3,5 ..... 63 and j = 1,5,9 ..... 61 in (9), and i , j = 1,5,9 . . . . . 61 in (10). Figure 3 shows the averages of MAE(~) for 6 = 10 -5 and 6 = 0.3. 5.2. Results Classification methods yielded inaccurate estimates. With any training set of Table 1, MAE (1~) was larger than 7% for maximum likelihood (ML) classification and larger than 2.8% for ICM t6~ even if we used the true values of ~ta and ff,b in the ICM algorithm. With the biased training sets T-50b and T-1000b too many pixels in the second class were classified into the first class. As a result, the proportion of class 1 was too large; the mean error of 1~ was 9.2% for ML and 5.8% for ICM. Using the likelihood methods (M 1)-(M3) we obtained better results (Table 2; Fig. 3), since we were able to update the estimates of class expectations and variances with unclassified pixels. The performance of our estimators depended on the size and quality of training sets. When the training set was large and unbiased, the noncontextual method (M1) was fast and accurate. (M1) was, however, sensitive to bias in class means. It was not always possible to update the estimates with unclassified observations. The estimates/1 k were often too close to the class means ~'~,"in the training sets. Accordingly, the estimates of class proportions were biased. The contextual methods (M2) and (M3) were more robust than (M1), (M3) yielding slightly better results than (M2) (Table 2; Fig. 3). These methods are time-demanding, but as in the second experiment, time can be saved by using a subset of observations. The choice of the bias variance 6 2 affected the results for biased training sets. When we erroneously used 62 2 0 with the biased sets T-50b and T-1000b, the estimates were inaccurate. The estimates of class expectations were determined by the training set, and ¢~1 was biased. With 3 = 0.3, in contrast, the training set had less effect. We sometimes even discarded prior information in T-20b: we identified classes 1 and 2 with /~1 ~ P2 and/~2 -~ #1, for example, and the labels had to be interchanged before evaluations. When the bias b2 was large, estimates were more accurate with ~ = 0.3 than with 6 = 10 - s (Fig. 3). For a training set with no bias, 6 > 0 implies a pessimistic attitude and possible loss of efficiency. Nevertheless, (M2) and (M3) yielded good results for T-1000u with 6 = 0.3. Too large values of ~ should, of course, be avoided. With 3 = ~ we ignored all the information in T-1000b and consequently obtained large errors with (M2). 5.3. Conclusions Our experiments imply that likelihood methods should be preferred to classification. Using a simple yet realistic model of the true scene, we obtained better estimates of class proportions than with non-contextual methods. In this respect, our results agree with

earlier studies. ~21-23,26) Furthermore, we suggest that bias in the class means of a training set is modelled as a normally distributed random variable with variance 62 . Our approach may be more generally applicable than a misclassification model, t3~The choice of the bias variance 62 expresses our confidence in the training set. Instead of rejecting a training set with slightly biased class means, we used a reasonable value of 32 to reduce the effect of the bias. Evidently it is important to model the reliability of a training set.

6. SUMMARY

We studied how to use contextual information when estimating class proportions from an image. With both training sets and unclassified observations we also estimated the expectation and variance of observations in each class. The true scene was modelled as a Markov chain random field whose distribution is determined by a 2 x 2 Markov random field. Conditions were derived for the identifiability of parameters. We estimated the joint distribution of classes at two or four neighbouring sites. In simulation experiments we used biased training sets of observations that did not have the same expectation as in the image. This bias was modelled as a normally distributed stochastic parameter whose variance describes the reliability of a training set. With a suitable choice of the bias variance we reduced errors due to a biased training set. Maximum likelihood classification and the ICM algorithm yielded less accurate estimates than likelihood methods. For large and reliable training sets, a non-contextual estimator maximizing the product of marginal likelihoods of observations was accurate enough. For training sets with bias, in contrast, contextual likelihood methods yielded the best results. Acknowledgements--I thank the referees for commenting on

the manuscript. This work was supported by the Academy of Finland and Emil Aaltonen Foundation.

REFERENCES

1. R. S. Chhikara, ed., Commun. Statist. 13(23) (special issue on crop surveys using satellite data) (1984). 2. S. Poso, T. H/ime and R. Paananen, A method of estimating the stand characteristics of a forest compartment using satellite imagery, Silva Fenn, 18, 261-292 (1984). 3. R.S. Chhikara, Error analysis ofcrop acreage estimation using satellite data, Technometrics 28, 73-80 (1986). 4. G. E. Battese, R. M. Hatter and W. A. Fuller, An errorcomponents model for prediction of county crop areas using survey and satellite data, J. Am. Statist. Ass. 83, 28-36 (1988). 5. S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images, IEEE Trans. Pattern Analysis Mach. Intell. 6, 721-741 (1984). 6. J. Besag, On the statistical analysis of dirty pictures, J. R. Statist. Soc. Set. B 48, 259-302 (1986). 7. J. Besag, Towards Bayesian image analysis, J. Appl. Statist. 16, 395-407 (1989). 8. C. R. O. Lawoko and G. J. McLachlan, Bias associated with the discriminant analysis approach to the estimation

Contextual estimators of mixing probabilities

of mixing proportions, Pattern Recognition 22, 763-766 (1989). 9. D. M. Titterington, A. F. M. Smith and U. E. Markov, Statistical Analysis of Finite Mixture Distributions. Wiley, New York (1987). 10. B. Chalmond, An iterative Gibbsian technique for reconstruction of m-ary images, Pattern Recognition 22, 747-761 (1989). 11. L. Younes, Parametric inference for imperfectly observed Gibbsian fields, Probab. Theory Related Fields 82, 625645 (1989). 12. A. Veijanen, A simulation-based estimator for hidden Markov random fields, IEEE Trans. Pattern Analysis Mach. Intell. 13, 825-830 (1991). 13. J. Goutsias, Unilateral approximation of Gibbs random field images, CVGIP: Graph. Model Image Process. 53, 240-257 (1991). 14. K. Abend, T. J. Harley and L. N. Kanal, Classification of binary random patterns, IEEE Trans. Inf Theory 11, 538-544 (1965). 15. D. K. Pickard, A curious binary lattice process, J. Appl. Probab. 14, 717-731 (1977). 16. D . K . Pickard, Unilateral Markov fields, Adv. Appl. Probab. 12, 655-671 (1980). 17. H. Derin, H. Elliott, R. Cristi and D. Geman, Bayes smoothing algorithms for segmentation of binary images modeled by Markov random fields, IEEE Trans. Pattern Analysis Mach. lntelL 6, 707-720 (1984). 18. P. A. Devijver, Probabilistic labelling in a hidden second order Markov mesh, Phillips Research Lab. Report (1985). 19. J. Haslett, Maximum likelihood discriminant analysis on the plane using a Markovian model of spatial context, Pattern Recognition 18, 287-296 (1985). 20. E. Mohn, N. L. Hjort and G. O. Storvik, A simulation study of some contextual classification methods for remotely sensed data, IEEE Trans. Geosci. Remote Sens. 25, 796-804 (1987). 2l. N.L. Hjort, Estimating parameters in neighbourhood based classifiers for remotely sensed data, using unclassified vectors, Contextual Classification of Remotely Sensed

Data, Statistical Methods and Development of a System,

22.

23.

24.

25.

26.

27.

28. 29.

H. V. Saeb¢, K. Brfiten, N. L. Hjort, B. Llewellyn and E. Mohn, eds. Norwegian Computing Center, Report 768 (1985). N. L. Hjort, Notes in the Theory of Statistical Symbol Recoynition. Norwegian Computing Center, Report 778 (1986). G. R. Dattatreya, Unsupervised context estimation in a mesh of pattern classes for image recognition, Pattern Recognition 24, 685-694 (1991). W. Qian and D. M. Titterington, Multi-dimensional Gibbs Chain Texture Models. Preprint, University of Glasgow (1989). A. Veijanen, On Estimation of Parameters of Partially Observed Random Fields and Mixing Processes. Tilastotieteellisiii tutkimuksia 9. Finnish Statist. Soc., Helsinki. Dissertation (1989). G.R. Dattatreya, Estimation of prior and transition probabilities in multiclass finite Markov mixtures, IEEE Trans. Syst. Man Cybern. 21,418-426 (1991). T. N. Pappas, An adaptive clustering algorithm for image segmentation, IEEE Trans. Signal Process. 40, 901-913 (1992). G. R. Dattatreya, Estimation of mixing probabilities in multiclass finite mixtures, IEEE Trans. Syst. Man Cybern. 20, 149 158 (1990). W. H. Press, B. P. Flannery, S. A. Teukolsky and W. T. Vetterling, Numerical Recipes in C. Cambridge University Press, London (1988).

769

APPENDIX

Proof of Theorem I. If P{X; O} = P{X; 8} with probability one, then Z~- 1exp { - U(x; 0)} = Z~- i exp { - U(x; 8)} for every xe.~". This equality yields U(x; O) = U(x; 8) + log(Zs/Zo) for every x. Therefore, A(0) = A(8), which implies 0 = 8. [] Proof of Corollary 1. Suppose 0 4: 8. Then ~O(0)= AO # A8 = ~b(8), since A is invertible. Hence A(0) # A(8). By Theorem 1, the parameter 0 is then identifiable. []

Proof of Theorem 2. Denote by Uw(A,B, C, D) the energy function of

Suppose P{X = x; 0} = P{X = x; 8} for all the realizations x. We prove that then P { W = w;0} = P { W = w;8} for every w. (1) Consider first a 3 × 2 MRF X=

D f

with an energy function U (X; 0) = Uw(A, B, C, D; O) + f l (C, D, E, F; O)

= U~(C,D,E,F;O)+f2(A,B,C,D;O) where f l and f2 are certain linear combinations of the interaction functions • and fit. Introduce the following realizations: gl =

and g2 =

a

.

C

Since there exists 6 = 6(0, 8) such that U(gl ; O)- U(gl; 8) = 6 for any gl (cf. the proof of Theorem 1), we obtain

Uw(a,b,c,d;O ) - Uw(a,b,c,d; 8) =fl(c,d,e,f;8) -fl(C,d,e,f; O) + 6. This difference is independent of (a,b). Using U(g2;O) we show that it is independent of (c, d) as well:

Uw(a,b,c,d; O)- U~(a,b,c,d; 8) =f2(e,f,a,b;8) -f2(e,f,a,b; O) + 6. Hence Uw(a,b,c,d;O)-U~(a,b,e,d;8) is constant for every a,b,c,d. This result also holds for a 2 x 3 and a r × 2 M R F (r > 3). (2) Consider an r x s M RF (r, s > 2) with X 11 = A, X ~2 = B, X21 = C, and X22 = D. Then

U(X;O) = Uw(A,B,C,D;O)+f(X';O)

(A1)

where X' denotes the set of random variables Xu, ((u, v) :# (1, 1)) and f is a certain linear combination of interaction functions. The value of A contributes only to U~(A, B, C, D; 0), not to f(X'; 0). Let 6 = U(X; O)- U(X; 8). By (A1)

Uw(A, B, C, D; O) + f(X'; O) = U~(A, B, C, D; 8) + f(X'; 8) + 6. Therefore, Uw(A, B, C, D; O) - Uw(A, B, C, D; 8) = f(X'; 8) f(X'; O) + 6 is independent of the value of A. In the same way we could derive, using the other corners of the lattice, that 6~ = U~(A, B, C, D; O)- Uw(A, B, C, D; 8) is also independent of the values of B, C and O. Now Zw(O) = exp { - 3w}Z~(8). Therefore, P { W= w; 0) = P { W = w; 8} for every w. Under the assumptions, the parameters of W were identifiable. Hence 0 = 8. This concludes the proof. []

About the Author--ARI VEIJANEN received an M.Sc. degree in 1986 and a doctoral degree in statistics in 1989 from University of Helsinki. He is currently with the Academy of Finland. His research activities include estimation and segmentation problem~ in image analysis.