Use of probability entropy for the estimation and graphical representation of the accuracy of maximum likelihood classifications

Use of probability entropy for the estimation and graphical representation of the accuracy of maximum likelihood classifications

Volume 49, number 2, 1994 13 Fabio Maselli 1, Claudio Conese 1 and Ljiljana Petkov 2 Use of probability entropy for the estimation and graphical re...

1MB Sizes 0 Downloads 28 Views

Volume 49, number 2, 1994

13

Fabio Maselli 1, Claudio Conese 1 and Ljiljana Petkov 2

Use of probability entropy for the estimation and graphical representation of the accuracy of maximum likelihood classifications A method is proposed for statistically evaluating the accuracy levels of maximum likelihood classifications and representing them graphically. Based on the concept that the heterogeneity of maximum likelihood membership probabilities can be taken as an indicator of the confidence for the classification, such a parameter is estimated for all pixels as relative probability entropy and represented in a separate channel. After a brief presentation of the statistical basis of the methodology, this is applied to a conventional and two modified maximum likelihood classifications in a case study using Landsat TM scenes. The results demonstrate the efficiency of the approach and, particularly, its usefulness for operational applications.

1. Introduction

Classifications of high resolution remotely sensed data based on the maximum likelihood (ML) approach are nowadays extremely widespread. This approach, relying on Bayes theorem of joint probabilities has several advantages, mainly connected with its theoretical simplicity and its statistical stability and robustness (Swain and Davis, 1978; Strahler, 1980; Schowengerdt, 1983, Yool et al., 1986). In addition, the maximum likelihood strategy allows some modifications to be inserted into the basic algorithm, substantially improving its performance (Mather, 1985; Bolstad and Lillesand, 1991; Conese and Maselli, 1992; Maselli et al., 1992). Despite these positive features, the conventional versions of the ML classifier suffer from the fundamental drawback of all "hard" classification procedures; most spectral information is lost in the process of condensing and transforming the remotely sensed data to generate a thematic map (Foody et al., 1992). Also in the frequent case of mixed, not spectraUy defined pixels, the final output reports a unique attribution of each pixel to a cover class. All the other information concerning 1 I.A.T.A.-C.N.R., P. le delle Cascine 18, 50144 Firenze, Italy. 2CESIA-Accademia dei Georgofili, Logge Uffizi Corti, 50122, Firenze, Italy.

class membership properties which is generated during the discrimination process is lost, and only the global level of classification accuracy is usually provided. It can be noted that, from a statistical viewpoint, this is not the most efficient way to exploit the potential of the procedure. The ML classifier can in fact produce membership probabilities, which, suitably processed and represented, could provide significant information about the confidence for the attribution of each pixel to the most likely class (Foody et al., 1992). It is in fact clear that the more defined the ML membership probability (when a pixel can only be attributed to a single class), the higher the expected confidence for the relevant assignment, whilst uncertain membership probabilities (when a pixel has approximately the same likelihood of belonging to two or more classes) can be supposed to lead to unreliable assignment. A simple, straightforward strategy for utilizing this information is described here, based on the definition of a measure of confidence for the assignment of each pixel. Following the above discussion, this definition utilizes the heterogeneity of ML membership probabilities quantified as probability entropy (Kullback, 1959). Once such a measure of reliability has been defined, it can be calibrated on the scene under investigation and converted into classification accuracy to be rep-

ISPRS Journal of Photogrammetry and Remote Sensing, 49(2): 13-20 0924-2716/94/$07.00 © 1994 Elsevier Science B.V. All rights reserved.

14

ISPRS Journal of Photogrammetry and Remote Sensing

resented graphically in an auxiliary map. In the present paper the statistical background of this strategy is first presented and examined. Consequently, a ease study using multitemporal Landsat TM scenes is employed to illustrate the performance of the method with various ML classifications. Finally, the results are analysed and discussed together with possible, future extensions.

2. Statistical background As is well known, the membership probabil-

ity (Pr) of a pixel with respect to a cover category can be computed following the maximum likelihood approach as (Swain and Davis, 1978; Strahler, 1980):

Pr = (27r) -1/2n IC1-1/2 X exp - I [ ( x

- M ) t C - 1 (X -- M)]

(1)

where: n = number of channels; C = variancecovariance matrix of the class considered; M = mean vector of the class considered; X = pixel vector. The rescaling of Pr between 0 and 1 according to Bayes' theorem yields the ML a posteriori probabilities. The formula, or its logorithm transformation, is generally used with equal priors, which is operationally simple but statistically incorrect and inefficient. Recently, two modifications of the conventional procedure have been proposed by our research group to obviate the most evident shortcomings of this approach: the tendency towards biased area estimates and the lack of flexibility in the presence of strongly irregular spectral distributions. The first modification is based on the simultaneous transformation of the membership probabilities by means of a transition process. As explained in Conese and MaseUi (1992), the transition matrix is derived from a contingency table compared to some reference points. In some case studies, this method has been shown to greatly reduce biases in area estimates. The second method (Maselli et al., 1992), defines some nonparametric prior probabilities to be inserted into the maximum likelihood process from the frequency histograms of the training sets. When applied to the classification of complex rural areas, the use of these priors tended to render the whole process far more flexible and efficient. In any case, ML probabilities are usually em-

ployed simply to produce a "hard" classification; only the class with maximum likelihood is retained and reported in the final map. This would be correct if each pixel could be completely attributed to a certain cover category. Instead, most pixels are actually a composition of several classes, and the attribution of such mixed pixels to a category is somewhat unreliable. In practice, during a "hard" classification, all the information concerning the uncertainty of the attributions is usually lost. A method which aims at retaining all the class membership information expressed during a ML classification for subsequent use is given by the "fuzzy" approach (Wang, 1990a, b; Blonda et al., 1991; Foody, 1992). By this strategy, the membership probabilities of every pixel are kept in the form of fuzzy representation for all classes. The approach yields very interesting results, but has the fundamental drawback of requiring a great deal of data for the representation of the classification output. For example, N channels are necessary for fully reporting the output of a discrimination process with N categories. When the number of categories increases, the procedure may become too expensive. Moreover, the interpretation of such a large amount of data can be problematic and inefficient. An alternative strategy is suggested in the current work. The basic idea is that the inter-class heterogeneity of the maximum likelihood membership probabilities found for each pixel can express information on the confidence for the relevant assignment. From a statistical viewpoint, such a heterogeneity can be quantified as probability entropy (Kullback, 1959). The concept of entropy has been used in remote sensing studies for different purposes (Davis and Dozier, 1990; Conese and Maselli, 1993). Mathematically, entropy expresses the amount of statistical information of a system described by N discrete levels and is computed as: H = - ~

Pr(i) In Pr(i)

(2)

where: H = entropy of the system; Pr(i) = probability of occurrence of level i. When applied to the classification of a pixel, Pr(i) values are the maximum likelihood membership probabilities with respect to class i. If a pixel is found to have a maximum probability of belonging to a class, Pr for that class will be 1 and that of all other classes will be 0. Consequently, the probability entropy (P/-/) will equal 0. On the other hand, if

Volume 49, number 2, 1994

1fi

the membership probabilities of all categories have similar values, P H will reach its maximum level. In the first case, maximum confidence for the pixel assignment can be expected, whilst no hypothesis can be formulated about assignment in the second case. For a better representation P H can be expressed in percentage by the formula: R P H = PH/PHmax" 100

(3)

where: R P H = relative probability entropy; PHmax = maximum P H (when all Pr = 1 / N ) . According to the above discussion, R P H found for each pixel can be utilized as an efficient measure of confidence for the pixel attribution. This hypothesis is investigated in the present work by means of a suitable case study. A rural area of Tuscany was chosen as a test site and sensed by multitemporal Landsat Thematic Mapper (TM) acquisitions. The method was applied to the original and the two modified versions of the ML classifier (i.e. with the transition matrix and with nonparametric priors). In this way an analysis of their behaviours was also possible. The preliminary results show the potential of the method and, above all, its utility for practical applications.

3. Case study 3.1. Data collection and preliminary processing

A rural area east of Florence (43"41'N, 11"33'E) where extensive ground references were available was selected for the research. As the relevant data set was already used for other investigations, only the essential information is currently provided, while a more complete description can be found in Maselli et al. (1992). The terrain ranges in altitude between 200 and 1400 m. Its main cover categories are coniferous and deciduous woods, olive groves, cereal crops and urban areas (classes 1-5, respectively, in the present work). Extensive ground surveys and interpretation of aerial photographs provided a ground reference map which was entirely digitized. The data processing was carded out using a software package developed in the I.A.T.A.-C.N.R. Institute which runs on a Digital VAX 11/750 computer system. Two TM scenes taken on 26 May and 14 August 1988 from the frame 192/30 were utilized in the case study. These two dates were chosen to

exploit the multitemporal information connected with the changing phenology of vegetation (Conese and Maselli, 1991). The scenes were georeferenced in the same system of the ground data by a bilinear interpolation procedure trained on ground control points. Then they were compressed by a Principal Component transformation which produced three informative, uncorrelated channels for each original scene. 3.2. Image classifications and error evaluation

The transformed images were classified by the three procedures mentioned (original and modified ML classifiers). For comparison, the same training and test pixels were used in the three trials. About 3% of the pixels digitized were selected as training points using a stratified random sampiing scheme. All the pixels which were not used for training were kept for final verification. After the training phase, the three classifications were applied to the study scenes. During the classification processes, the membership probabilities were also employed to generate three separate channels with R P H levels by means of formulas 1-3. For the classifier using the transition matrix, the probabilities before the transition process were considered, so as to avoid confusion due to high misclassification rates. The output images were divided into five levels of R P H corresponding to 0% and four quartiles (1-25%, 26-50%, 51-75%, 76-100%). Narrower R P H intervals were not selected to maintain a sufficient number of pixels in each interval for accuracy evaluation. For each level, the classification accuracy was computed as overall precision and Kappa coefficient of agreement compared to the test pixels (Congalton et al., 1983; Rosenfield and FitzpatrickLins, 1986). Overall precision is a widely used measure of accuracy, but it does not take into account the presence of pixels randomly classified in the correct categories. The Kappa coefficient of agreement, which does consider random correctly classified pixels, is therefore a more complete statistic. In Figs. 1-3 the number of pixels and the accuracy for each R P H level of the three classifications are shown. The classified images with relevant R P H levels are shown in Figs. 4-6. Overall classification accuracies were derived from the error matrices of Tables 1-3, together with producer's and user's accuracy for each cover category.

16

ISPRS Journal of Photogrammetry and Remote Sensing 40

(A)

~ 35

~

5

~o

25 20 .

.

.

.

~

_x

..... .

.

.

.

.

.

20

~15 o ~10

0 w 10 m

m

# 5 0

30

o25

\\

\x \\

N~ 0%

1-25%

o81

RELATIVE

26-50% PROBABILITY

7= 5 51-75%

76-100%

o

NN 0%

1-25%

ENTROPY

5o.6

26-50%

51-75%

76-100%

RELATIVE PROBABILITY ENTROPY

O0.8 .'B'. . . . . .

g0.4 .x

~0.4 0.2

0

0%

10% 24% 30% 40% 5{3% 6{3% 7{3% 8{3~ 9{3%100% h

RELATIVE PROBABILITY ENTROPY

Figure 1. Number of pixeis (A) and accuracy values (B) for each RPH level from the conventional ML classifier (rpr/RPH = 0.700, rK/RPH = 0.845).

3.3. Resul~ As can be seen from Figs. 1-3, an inverse relationship is generally present between the probability entropy and the accuracy of the three classifiers. Thus, even though there was no statistical reason to assume a linearity in this relationship, linear correlation analyses were applied to these variables to objectively quantify their interdependencies. The conventional ML classifier yields few pixels with low RPH and high accuracy (Fig. 1). Most of the pixels fall in the 26-50% RPH level, which corresponds to very low overall precision and Kappa accuracy. Beyond this level, the classification accuracy remains almost unvaried and practically nonsignificant. Consequently, a linear regression between precision and entropy (Pr/H) and Kappa and entropy ( K / H ) can not be established, and the correlation coefficients are not high. Also, the overall accuracy of this classification is rather low, (Pr = 0.557 and K = 0.374, Table 1). The main sources of error are related to confusion between classes 1 and 2 (coniferous and deciduous forests) and between classes 4 and 5 (olive groves

0

0%

i

i

i

i

i

i

i

i

10% 20% 30% 40/,% 50%. 60~ . . . 70 80 % 9 0 ~o1 0 0 T, RELATIVE PROBABILITY ENTROPY +Overell

Precision + K a p p o

Figure 2. Number of pixels (A) and accuracy values (B) for each RPH level from the ML classifier with the transition matrix (rpr/RPH = 0.926, rK/RPH = 0.985).

and urban areas), which actually are spectrally similar and partly composed of mixed pixels. As regards the second process, the main effect of the utilisation of the transition matrix is a notable improvement in accuracy for medium-high RPH levels (Fig. 2). In particular, the level with the greatest number of pixels (26-50%) has medium precision and Kappa accuracy (0.813 and 0.524, respectively). Good regressions P r / H and K / H can be found, with correlation coefficients of 0.93 and 0.99, respectively. Globally, classification accuracy is markedly increased with respect to the original procedure (Table 2). The confusion between classes 1 and 2 is considerably reduced, as well as that between classes 4 and 5. Finally, the ML classifier using nonparametric priors produces a large number of pixels with low RPH and high accuracy (Fig. 3). A good linear regression can be drawn, with high correlation coefficients (rpr/RPH = 0.98, rK/nen = 0.97). In this case also the classification is fairly accurate (Table

Volume 49, number 2, 1994

40

17 TABLE 2

(A)

Error matrix of the ML classifier with the transition matrix compared to the test pixels (U.A. = user's accuracy, P.A.= producer's accuracy)

~35

3O

~25 ×

t .........

2O

o ~:10 m

5 0

0%

1-25%

26-50%

51-75%

76-100%

RELATIVE PROBABILITYENTROPY

Class

1

2

3

4

5

U.A.

1 2 3 4 5

6279 2457 10 187 48

2340 39265 362 3902 464

0 1 2277 370 413

125 3472 645 10033 673

8 475 207 951 644

0.710 0.860 0.650 0.650 0.287

EA.

0.699

0.846

0.744

0.671

0.282

Overall precision = 0.773; Kappa = 0.604.

TABLE 3

Error matrix of the ML classifier with nonparametric priors compared to the test pixels (U.A. = user's accuracy, RA. = producer's accuracy)

~0.6
. (~<04 [

0.2 i

0

I'

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%100% RELATIVE PROBABILITYENTROPY ~Overoll Precision -FKoppo

Figure 3. Number of pixels (A) and accuracy values (B) for each RPH level from the ML classifier with nonparametric prior probabilities (rpr/RPH = 0.981, rK/RPH = 0.974).

TABLE 1

Error matrix of the conventional ML classifier compared to the test pixels (U.A. = user's accuracy, EA. = producer's accuracy) Class

1

2

3

4

5

U.A.

1 2 3 4 5

7350 1246 9 142 187

18365 22590 405 3180 2077

0 0 2496 125 474

718 2236 810 8854 2360

44 315 377 556 1002

0.278 0.856 0.609 0.689 0.164

EA.

0.823

0.485

0.806

0.591

0.437

Overal precision = 0.557; K a p p a = 0.374.

3), with error levels similar to those of the previous procedure. This indicates that pixel identification has been notably improved by this classifier for practically all cover types. In general, the utility of having an error map together with a classification can be appreciated from a visual analysis of Figs. 4-6. Low error rates usually correspond to spectrally homogeneous, well

Class

1

2

3

4

5

U.A.

1 2 3 4 5

6611 2065 9 168 56

3755 37874 299 3840 593

0 1 2234 525 296

161 3183 636 9957 913

9 450 346 991 457

0.627 0.869 0.634 0.643 0.197

EA.

0.742

0.817

0.731

0.671

0.203

Overall precision = 0.757; Kappa = 0.586.

defined areas (woods, cereals). High error rates, on the other hand, are present for mixed, complex surfaces such as those partly covered by olive groves and urban areas. 4. Discussion and conclusions The utilisation of high resolution remotely sensed data for the automatic creation of thematic maps is becoming extremely widespread. In particular, land cover classifications based on probabilistic methods such as the maximum likelihood are now routinely used in many image processing packages and GIS (Geographic Information Systems). A critical issue of all these applications is generally error evaluation and representation, the importance of which, up to now, has often been underestimated. In the current work it is noted that ML classification procedures can be used not only to assign each pixel to the most likely cover category, but also to provide membership probabilities which can serve for the estimation and the graphical

18

ISPRS Journal of Photogrammetry and Remote Sensing

Figure 4

/\ Figure 5

Figure 6

Volume 49, number 2, 1994

representation of classification confidence. In particular, by means of the concept of entropy the heterogeneity of ML membership probabilities can be employed as an indicator of classification uncertainty. From the experimental results, probability entropy has been found to be strongly related to accuracy levels, so that it can serve to represent these in a thematic map once the relationship has been calibrated on some reference points. The advantage of this representation for practical applications is apparent. For each area classified the end user can get an idea of the classification reliability, which may help him to undertake operational decisions. Also, the ambiguity in case of high R P H can be solved by the use of external information such as that provided by the consideration of context or by the inclusion of auxiliary data into a GIS. Our case study has also confirmed the different performances of the three classifiers considered. The modified algorithms behave considerably better than the conventional one, with error rates which are notably reduced. Specifically, the ML classifier with transition matrix tends to increase classification accuracy for medium-high R P H levels, whilst the classifier with nonparametric priors yields very few pixels with low accuracy and high RPH.

This analysis brings out the importance of advanced discrimination procedures to decrease uncertain attributions and increase classification accuracies. The problem is also connected with the properties of the satellite imagery used. Image data with higher spatial resolution than the TM (30 x 30 m) could match the patterns of the sensed landscapes better and reduce the proportion of mixed pixels. Similarly, a more complete spectral coverage of the data could improve the identification of complex cover types. Even in these cases, however,

Figure 4. (A) Classified image obtained by the first procedure (blue = coniferous forest, green = deciduous forest, yellow = olive grove, sky blue = cereal crops, red = urban area). (B) Relevant RPH levels (blue = 0%, violet = 1-25%, green = 26-50%, yellow = 51-75%, red = 76-100%). Figure 5. (A) Classified image obtained by the second procedure and (B) relevant RPH levels (colours as in Fig. 4). Figure 6. (A) Classified image obtained by the third procedure and (B) relevant RPH levels (colours as in Fig. 4).

19

mixed and uncertainly attributed pixels will be present in heterogeneous environments and methods for characterizing them such as that proposed will remain crucial in most applications. At present, the methodology for accuracy evaluation and representation has been tested only with ML classifiers, but it can be applied to other probabilistic discrimination procedures as well. Further experiments in other areas and with other high-resolution imagery are currently being planned to fully verify the potential of the methodology.

Acknowledgements This research has been partly funded by an ASI (Agenzia Spaziale Italiana) grant. The authors want to thank Dr. Resti for his contribution to the collection of the ground references. Special thanks are due to two anonymous referees whose comments improved the quality of the paper.

References Blonda, P., Pasquariello, G., Losito, S., Mori, A., Posa, E and Ragno, D., 1991. An experiment for the interpretation of multitemporal remotely sensed images based on a fuzzy logic approach. Int. J. Remote Sensing, 12: 463-476. Bolstad, EV. and Lillesand, T.M., 1991. Rapid maximum likelihood classification. Photogramm. Eng. Remote Sensing, 57: 67-74. Conese, C. and Maselli, E, 1991. Use of multitemporal information to improve classification performance of TM scenes in complex terrain. ISPR$ J. Photogramm. Remote Sensing, 46: 187-197. Conese, C. and MaseUi, E, 1992. Use of error matrices to improve area estimates with maximum likelihood classification procedures. Remote Sensing Environ., 40: 113124. Conese C. and Maselli, E, 1993. Selection of optimum bands from TM scenes through mutual information analysis. ISPRS J. Photogramm. Remote Sensing, 48 (3): 2 - l l . Congalton, R.G., Mead, R.A. and Oderwald, R.G., 1983. Assessing Landsat classification accuracy using discrete multivariate analysis statistical techniques. Photogramm. Eng. Remote Sensing, 49: 1671-1678. Davis, EW. and Dozier, J., 1990. Information analysis of a spatial database for ecological land classification. Photogramm. Eng. Remote Sensing, 56: 605-613. Foody, G.M., 1992. A fuzzy sets approach to the representation of vegetation continua from remotely sensed data: an example from Lowland Heath. Photogramm. Eng. Remote Sensing, 58: 221-225. Foody, G.M., Campbell, N.A., Trodd, N.M. and Wood, T.E, 1992. Derivation and applications of probabilistic mea-

20 sures of class membership from the maximum-likelihood classification. Photogramm. Eng. Remote Sensing, 58: 1335-1341. Kullback, S., 1959. Information Theory and Statistics. Wiley, New York, N.Y. Maselli, E, Conese, C., Petkov, L. and Resti, R., 1992. Inclusion of prior probabilities derived from a nonparametric process into the maximum likelihood classifier. Photogramm. Eng. Remote Sensing, 58: 201-207. Mather, P.M., 1985. A eomputationally efficient maximumlikelihood classifier employing prior probabilities for remotely sensed data. Int. J. Remote Sensing, 6: 369-376. Rosenfield, G.H. and Fitzpatrick-Lins, K., 1986. A coefficient of agreement as a measure of thematic classification accuracy. Photogramm. Eng. Remote Sensing, 52: 223-227. Schowengerdt, R.A., 1983. Techniques for Image Processing and Classification in Remote Sensing. Academic Press, New York, N.Y., 249 pp. Strahler, A.H., 1980. The use of prior probabilities in max-

ISPRS Journal of Photogrammetry and Remote Sensing imum likelihood classification of remotely sensed data. Remote Sensing Environ., 10: 135-163. Swain, EH. and Davis, S.M., 1978. Remote Sensing: The Quantitative Approach. McGraw Hill, New York, N.Y., pp. 166-174. Wang, E, 1990a. Fuzzy supervised classification of remote sensing images. IEEE "n'ans. Geosci. Remote Sensing, 28: 194-202. Wang, E, 1990b. Improving remote sensing image analysis through fuzzy information representation. Photogramm. Eng. Remote Sensing, 56:1163-1169. Yool, S.R., Star, J.L., Estes, J.E., Botkin, D.B., Eckhardt, D.W. and Davis, EW., 1986. Performance analysis of image processing algorithms for classification of natural vegetation in the mountains of Southern California. Int. J. Remote Sensing, 7: 683-702. (Received December 13, 1992; revised and accepted April 22, 1993)