A generalized histogram clustering scheme for multidimensional image data

A generalized histogram clustering scheme for multidimensional image data

0031 3203/83/020193 07 $03.00/0 Pergamon Press Ltd Pattern Recognition Society Patter~l Re, oq,ition Vol 16. No 2, pp 193 199, 1983 Printed in Great ...

600KB Sizes 11 Downloads 129 Views

0031 3203/83/020193 07 $03.00/0 Pergamon Press Ltd Pattern Recognition Society

Patter~l Re, oq,ition Vol 16. No 2, pp 193 199, 1983 Printed in Great Britain

A GENERALIZED HISTOGRAM CLUSTERING SCHEME FOR MULTIDIMENSIONAL IMAGE DATA STEPHEN W. WHARTON Earth Resources Branch, NASA Goddard Space Flight Center, Greenbelt, Maryland, U.S.A. (Received 16 December 1981 ; receivedfor publication 5 May 1982) Abstract A clustering procedure called HICAP (Histogram Cluster Analysis Procedure) was developed to perform an unsupervised classification of multidimensional image data. The clustering approach used in HICAP is based upon an algorithm described by Narendra and Goldberg to classify four-dimensional Landsat Multispectral Scanner data. HICAP incorporates two major modifications to the scheme by Narendra and Goldberg. The first modificationis that HICAP is generalizedto process up to 32-bitdata with an arbitrary number of dimensions.The second modificationis that HICAP uses more efficientalgorithms to implement the clustering approach described by Narendra and Goldberg.~1~This means that the HICAP classificationrequires less computation, although it is otherwise identical to the original classification.The computational savings afforded by HICAP increases with the number of dimensions in the data. Cluster Analysis

ImageProcessing

Histograms

INTRODUCTION This paper describes a non-parametric cluster analysis program called HICAP (Histogram Cluster Analysis Procedure) which was developed to perform an unsupervised classification of multidimensional image data. The clustering approach used in HICAP is based upon the methodology given in a recent cluster analysis program described by Narendra and Goldberg, '~ which used a multidimensional histogram to perform an unsupervised classification of Landsat Multispectral Scanner data. The histogram, in a geometrical sense, represents a partitioning of the data space into a number of non-overlapping regions or cells. Each cell is associated with a particular feature vector and has a counter which records the number of times that the feature vector occurs in the data. The histogram summarizes the distribution and approximates the probability density function of the data. Clusters in the data are identified by locating peaks (i.e. cells with relatively high frequency counts) in the histogram. The program by Narendra and Goldberg was designed to rapidly classify data as part of an interactive system on a small-scale computer with limited core storage. Due to its intended use as a classifier for Landsat data, and to conserve memory, the program is limited to processing data that can be packed into a 24-bit storage location (e.g. fourdimensional Landsat data). Despite its limitation on data, the program by Narendra and Goldberg has a number of advantages for cluster analysis. (1) The number of computations needed to identify clusters in the histogram is independent of the number of samples. 193

K-d Trees

(2) No underlying assumptions about the probability distribution of the clusters are required. (3) The program functions more or less automatically since the user does not have to specify parameters to control the clustering process. (4) The scheme is very efficient, By searching for clusters in the histogram rather than on each original feature vector separately, it is possible to reduce the number of computations by a large factor and thus efficiently handle the large sample sizes found in remote sensing applications. For example, in many Landsat frames, 95'yo of the 7.6 million pixels contain less than 6000 distinct intensity vectors/z) The objective in developing HICAP is to retain (and perhaps augment) the advantages of the clustering scheme given by Narendra and Goldberg~1~ in a generalized program which can process data with more than four dimensions and/or 6-bits (64 gray levels). HICAP was designed for batch use on a largescale, general-purpose computer in which the constraint on memory was not prohibitive. Thus, it was possible to avoid the data length restrictions by storing the data in a vector format. Since HICAP is not limited by the amount of information which can be packed into one storage location, it can process data with up to 32-bits in each dimension and an arbitrary number of dimensions. Extensive modifications to the original methodology given by Narendra and Goldberg were made in HICAP to accommodate the change to the vector data storage convention. In addition, several recently developed computational techniques (discussed later) were incorporated into HICAP, which significantly increased the efficiency of the clustering process.

194

STEPHENW. WHARTON

The HICAP program is discussed in a subsequent section. The following section outlines the functional characteristics of the program described by Narendra and Goldberg, ~1~ which forms the basis for HICAP. R E V I E W O F T H E P R O G R A M BY NARENDRA AND GOLDBERG

Portions of the following discussion were condensed from several papers."-5) The clustering scheme has three basic steps: (1) construct a histogram which consists of the distinct data vectors and their frequency of occurrence ; (2) cluster the distinct vectors using the histogram as a probability density estimate by a nonparametric clustering algorithm and store the cluster labels in a look-up table ; (3) classify the original data using the look-up table and produce a thematic map of the results. Histogram construction

The dimensionality of the histogram is equal to the number of dimensions in the data. The histogram is computed by partitioning the data space into discrete cells with equal volume and counting the number of feature vectors which occur within each cell. The major difficulty in this approach is to find a suitable method of storing the histogram so that any particular cell can be accessed with a minimum of searching. An efficient way to store and access the vector list is by use of a hashing function, which maps a feature vector into a particular position in the histogram table. The hashing method gives quick access to any vector and uses a relatively small amount of computer memory. The hashing function used by Narendra and Goldberg ") concatenates the four 6-bit Landsat radiance values to form a 24-bit number and divides the result by a prime number. The remainder after the division is used as the storage address. Since the hashing function maps a potentially large number of vectors into a smaller sized table, it is possible for several distinct vectors to be mapped into the same location. A collision occurs when two different vectors are mapped into the same location. Collisions are resolved by a trial and error process which iteratively applies a different hashing function until either a vector match or an empty location is found. As the load factor (percentage of occupied table locations) increases, the likelihood of a collision will also increase. This renders the hashing procedure less efficient at high load factors since the number of probes (table look-ups and comparisons) needed to enter a vector into the table will increase. The hashing procedure used by Narendra and Goldberg ~1~ avoids this inefficiency by only allowing the table to become 75% filled. Thus, a table size of at least 8000 entries is required to efficiently process 6000 distinct vectors. The clustering algorithm

The histogram vectors are clustered according to a non-parametric, valley seeking algorithm of Koontz et

al., ~6~ which uses the histogram cell counts as a probability density estimate. Clusters in the data are associated with peaks in the probability density. The boundaries between clusters are defined by the valleys in the probability density which surround each peak. The clustering algorithm is non-iterative and does not require a priori knowledge of the number of clusters. The clustering scheme is very efficient. The computational requirements for the clustering stage (excluding neighborhood computation) are approximately linearly dependent on the number of unique vectors in the histogram. ~xl Narendra and Goldberg ~11implement the clustering algorithm of Koontz et al. in a four step process. The first step (described in a subsequent section) is to compute a list of neighboring vectors for each vector in the histogram. The second step is to examine the list of neighbors and to place a directed link or pointer between each vector and its immediate neighbor having the maximum positive density gradient. The gradient between two vectors is the difference in frequency counts divided by the distance between the vectors. The distance measure used is the city block metric, K

DIST(X,Y) = ~ I X ( i ) - Y ( i ) I ,

(1)

i-1

which, in addition to being simple to compute, will always yield integer distances for integer vectors. Figure 1 shows a two-dimensional histogram to illustrate the clustering process. The numbers in the boxes are the frequency counts corresponding to hypothetical vectors in the histogram. The arrows in Fig. l link each cell to its neighbor with the maximum positive density gradient. The third step is to identify the cluster centroids by locating peaks in the histogram. A peak is defined to be any vector whose frequency count is greater than the frequency count of its neighbors. Boxes A and B in Fig. 1 represent centroids. The term centroid, as used here, is intended to convey the idea that the peak forms a nucleus around which the other points in the cluster are grouped, and not to imply that the peak is at the center of a cluster. The fourth step in clustering the histogram is to use the directed links to assign the remaining, non-peak vectors to the appropriate cluster. The directed links form a series of paths which follow the maximum uphill gradient and terminate at a local maxima which, by definition, is also a centroid. Since the links are always directed towards higher density neighbors, the vectors near cluster boundaries are linked to the centroid of their respective clusters. Thus, even adjacent clusters can be discriminated if they are separated by a low density valley, as in Fig. 1. The vectors belonging to each cluster are identified by tracing a path through the directed links until either a peak or a previously clustered vector is encountered. All of the vectors on the path are then assigned to the appropriate cluster which is unimodal in the histogram.

A generalized histogram clustering scheme

N ei,qhborhood computation Before the directed link assignment can be made, it is necessary to compute a list of the neighboring vectors for each vector in the histogram. Vector X is a neighbor of vector Y if in each dimension i, X(i) and Y(i) differ by at most 1. Narendra and Goldberg describe a method of searching an ordered vector list such that at most 2NP distance computations are required to build the neighbor lists for each of the N distinct vectors and P possible neighbors. The number of neighbors equals 80 for discrete, four-dimensional data such as Landsat. P = 3 K -- 1.

(2)

The algorithm for this search strategy is given in Narendra and Goldberg} ~ The neighborhood search stage incurs the major computational cost involved in the histogram clustering program. THE HICAP

PROGRAM

The methodology given by Narendra and Goldberg ~ is based upon the principal assumption that each feature vector can be compressed into one storage location. As such, this methodology cannot be used for the features stored in a vector format without extensive modifications. HICAP includes a number of major modifications in the histogram construction, neighbor computation and table look-up classification phases of the clustering, to accommodate the change to the vector cell storage convention. HICAP also incorporates a more efficient hashing algorithm for use in the histogram construction and table look-up stages and a more efficient method to locate cell neighbors in the histogram. In addition, HICAP automatically compresses the histogram by increasing the cell size whenever the histogram is filled. These modifications and enhancements are described below.

Histogram generation The hashing scheme used in the original procedure requires a table size at least 33~o greater than the expected number of unique vectors to efficiently build the histogram table. To minimize unused space, a more

]30I

I

80

T70

30

60

40

85 A 116O

85

35

70

6O

3o

1

6O

75

t°°J 6O~

sophisticated hashing algorithm by Brent 17~was used in HICAP. Brent's algorithm is considerably more efficient at high load factors. Brent showed, on a theoretical basis, that the maximum expected number of probes to insert an entry with his method was less than 2.5, even for a nearly full table. Narendra and Goldberg found that the average number of probes for a 75"/0 full table was 2.21. The hashing function in the original program consisted of taking the remainder of the 24-bit composite value after division by a prime number. The hashing function used with HICAP performs a series of remainder operations, one for each feature. The exact method used depends upon whether a maximum of 16 or 32-bit data is allowed. The HICAP implementation described in this paper allows a 16-bit maximum. The hashing function first computes the remainder of the first feature. This remainder is then concatenated with the second feature. The hashing function then computes the remainder of this result, and concatenates it to the third feature. This process continues until all the features have been considered. The final remainder is used as the storage address. This concatenation scheme limits the histogram table size to a maximum of 32749 (the maximum two-byte integer) cells to insure that the result of the remainder operation will fit within a two-byte integer. To compare the performance of Shlien's and Brent's hashing algorithms, 4737 frequency vectors derived from synthetic data were hashed into tables using both methods. The experiment was repeated with table load factors varying from 0.5 to 0.999. The number of probes needed for each method is noted in Table 1. Brent's method was more efficient than Shlien's method over all table loadings. The difference in efficiency is most pronounced at high table loadings. Brent's method requires additional computations to enter a vector and may rearrange vectors in the table in order to minimize the number of probes for subsequent table look-ups. Results from timing both methods in making 0.5 million entries showed that, despite the additional computations, Brent's method is faster over all table loadings, and over twice as fast at the 99.9~, loading. Table 1. Averagenumber of probes to enter 4737 vectors into a histogram at various table load factors for two hashing methods.

40

tl ¸



195

70

60

40B

9o

,5o

3O

70

Fig. 1. A two-dimensional illustration of the histogram clustering scheme. Each square denotes a histogram cell with the frequencycount in the lower right. The arrows link cellsto neighbors with the maximum positive gradient in frequency count. Cells A and B represent local maxima.

Load factor 50 75 85 90 95 97 98 99 99.9

Average number of probes Brent (1973) Shlien (1975) 1.13 1.34 1.43 1.54 2.02 1.97 1.95 2.42 2.32

1.18 1.60 1.76 2.22 3.36 3.60 3.70 3.70 6.03

196

STEPHEN W. WHARTON

Table 2. Results for testing HICAP with synthetic data showing the average number of comparisons per cell to compute the list of immediate cell neighbors for each cell in the histogram. For comparison, the average number of neighbors found per cell and the average number of comparisons needed by the original method are also given. Number dimensions

Number of cells

Averagenumber of neighbors (F)

4

219 1050 2107 3197 1902 3631 4338

13 33 38 37 53 71 66

4 4 4 7 7 7

Average number of Comparisons Original* HICAP 147 127 122 123 4319 4301 4306

33 60 79 80 158 272 260

*Computed by the formula NC = P + (P - F); where NC is the computed number of comparisons, P is the possible number of neighbors (for N = 4, P = 80; N = 7, P = 2186) and F is the actual number of neighbors found.

Histogram compression

At the finest resolution, each cell in the histogram represents a single distinct feature vector, and the number of histogram cells needed to represent the data is equal to the number of distinct feature vectors. It is necessary to compress the histogram (i.e. reduce the number of cells needed to represent the data) if the number of distinct feature vectors exceeds the maximum number of cells in the histogram. The original clustering scheme does not specify the action to be taken if the histogram table is filled. Shlient2) suggests that the histogram be compressed by deleting cells with a relatively small number of entries. HICAP avoids the necessity of cell deletion if the histogram is filled by automatically increasing the cell size, thereby allowing two or more feature vectors to be mapped into one cell. This compression scheme is implemented by dividing all feature vectors in the histogram by a cell size factor equal to the length of a side of the cell, prior to their insertion into the histogram. The cell size factor starts at one and is doubled each time the histogram is compressed. The transformed vectors are then rehashed into a new table. Processing then continues, and every subsequent feature is divided by the same factor before being hashed into the table. This scheme allows the histogram to be compressed without reprocessing data. Thus, with HICAP, relatively small table sizes can be used for situations involving much larger numbers of distinct vectors. Neighborhood computation

Narendra and Goldberg showed that, with their method, the number of comparisons needed to find all immediate neighbors for a given cell in the sorted histogram is proportional to P, the number of possible neighbors. However, as equation 2 shows, P increases exponentially with the number of dimensions in the data. It was also observed in testing with synthetic data, that A, the actual number of neighbors per cell, is usually much less than P, particularly for large P. As a

result, the performance of the original method will tend towards the worst case, since the algorithm would have to backtrack frequently in searching for neighbors which do not exist. The neighbor search procedure used in HICAP first forms the data into a K-dimensional binary tree or 'K-d tree'. 181(Additional detail on K-d trees is given in the Appendix.) The K-d tree data structure provides an efficient mechanism for examining only those cells closest to the cell under consideration, thereby greatly reducing the number of comparisons needed to find the list of neighboring cells,tgl Bentley and Friedman t~0) show that the number of comparisons needed with the K-d tree is proportional to A, the actual number of neighbors found. Synthetic data sets of various sizes and with four and seven dimensions were used to illustrate the computation savings afforded by the K-d tree search strategy. The average number of comparisons needed to compute the neighbor list for each cell with the K-d tree was noted for each case. The number of comparisons needed for the original method, according to the analysis by Narendra and Goldberg t~ is equal to P + (P - A). P is 80 for four dimensions and 2186 for seven dimensions. Table 2 lists the average number of comparisons per cell for both methods for each data set, and also shows the average number of neighbors per cell that were found. Since A is usually much less than P, and since the number of comparisons with the K-d tree is proportional to A, searching with the K-d tree achieves a considerable computational saving, especially for large P, as in the case with seven dimensions. Implementation

HICAP was implemented in Fortran IV on an IBM 370/3033 computer for batch processing. HICAP has five processing steps; histogram construction, computation of the K-d tree, searching for cell neighbors, histogram clustering and table look-up classification. A sixth step for histogram smoothing (not described,

A generalized histogram clustering scheme see ~'~arendra and Goldbergtl)) is also available. Each step is implemented as a separate program and each program requires 140K 8-bit bytes of memory to process a 5000 cell four-dimensional histogram. The intermediate results between successive programs are saved on disk storage. The final classification results may be output in the form of a line printer map or stored on computer compatible tape. The amount of CPU time used by HICAP to classify a 480 × 512 pixel image with four-dimensional 16-bit data is 91 seconds for a histogram with 2299 cells and 132 seconds for a histogram with 4382 cells. The same sized image with seven-dimensional 16-bit data takes 164 seconds for a 2345 cell histogram and 218 seconds for a 3631 cell histogram. Applications

Like the original program, HICAP is intended for use with relatively large data sets in which there is a high ratio of pixels to distinct features, to insure that the resulting histogram is a good estimate of the probability density of the feature space. Since both HICAP and the original program use the same clustering criteria, their respective classification results are also identical. Thus, to avoid redundancy, we refer the reader to Narendra and Goldberg~11for a detailed description of the application of their program to the classification of multispectral data. HICAP was used in a study to classify five spectrally heterogeneous land use classes (e.g. residential, commercial) in a 7.5m resolution remotely sensed imageJ 111 The spectral data were first preprocessed to extract the local frequency distribution of ground cover types (e.g. lawn, forest, pavement) surrounding each pixel. The underlying assumption in this approach is that different types of land use are comprised of a characteristic distribution of cover types and can be recognized on that basis in high resolution remotely sensed data. HICAP was used to classify the 16-bit preprocessed data. The test results showed that the overall classification accuracy depends upon the number of cover types identified in the image and upon the size of the neighborhoods used to compute the local frequency distributions. The highest overall classification accuracy of 75.8~o was achieved with four ground cover types and 31 × 31 pixel neighborhoods. ~1~ HICAP was also used to classify synthetic data developed to simulate the distribution of land cover classes within four hypothetical land use classes.~ 2/The highest overall classification accuracy was 98!~O. SUMMARY

A non-parametric cluster analysis program was developed to perform an unsupervised classification of multidimensional image data. The program, called HICAP, is based upon the methodology described by Narendra and Goldberg. °1 HICAP incorporates two major modifications to the original scheme. The first

197

modification is that HICAP is generalized to process up to 32-bit data and an arbitrary number of dimensions. HICAP achieves this generality by storing each histogram cell in a vector format, unlike the original program which, due to the storage restrictions on a small-scale computer, is limited to processing fourdimensional 6-bit data which can be packed into a 24bit storage location. HICAP can also process a large number of distinct vectors with a relatively small number of histogram cells, since HICAP automatically compresses the histogram (by increasing the cell size) when the histogram is filled. The second modification is that HICAP uses more efficient algorithms to implement the clustering approach described by Narendra and Goldberg. This means that the HICAP classification requires less computation, although it is otherwise identical to the original classification. HICAP incorporates a more efficient hashing algorithm for building the histogram and for the table look-up classification, which requires fewer probes to locate a cell in the histogram. HICAP also uses a more efficient method to locate the cell neighbors in the histogram, in which the number of comparisons needed is proportional to the actual number of neighbors rather than the possible number of neighbors, as in the original method. Tests with synthetic data showed that the number of comparisons was reduced by 1/3 or more in discrete fourdimensional space, and by a factor of 16 or more in seven-dimensional space. Acknowledgements--The author wishes to acknowledge the

assistance of the Canada Center for Remote Sensing for providing source code and program documentation for their non-parametric histogram cluster analysis procedure. REFERENCES

1. P. M. Narendra and M. Goldberg, A non-parametric clustering scheme for Landsat, Pattern Reco.qnition 9, 207-215 (1977). 2. S. Shlien, Practical aspects related to automated classification of Landsat imagery using look-up tables, Research Report 75-2 of the Canada Center for Remote Sensing, Ottawa, Canada (1975). 3. M. Goldberg and S. Shlien, A clustering scheme for multispectral images, IEEE Trans. Syst. Man Cybernet. SMC-8, 86-92 (1978). 4. S. Shlien and A. Smith, A rapid method to generate spectral theme classificationof Landsat imagery, Remote Sensing Envir, 4, 67-77 (1975). 5. M. Goldberg and S. Shlien,Computer implementation of a four-dimensional clustering algorithm, Research Report 76-2 of the Canada Center for Remote Sensing, Ottawa, Canada (1976). 6. W. Koontz, P. M. Narendra and K. Fukunaga, A graph theoretic approach to non-parametric cluster analysis, 1EEE Trans. Comput. C-25, 936-944 (1976). 7. R. P. Brent, Reducing the retrieval time of scatter storage techniques, Communs Ass. comput. Mach. 16, 105 109 (1973). 8. J. L. Bentley, Multidimensional binary search trees for associative searching, Communs Ass. comput. Mach. 18, 509-517 (1975). 9. J. H. Friedman, J. L. Bentley and R. A. Finkel, An algorithm for finding best matches in logarithmic expec-

198

10. 11.

12.

13.

STEPHEN W . WHARTON

ted time. Ass. comput. Mach. Trans. Math. Software 3, 209-226 (1977). J. L. Bentley and J. H. Friedman, Data structures for range searching. Computg Surv. 11, 397-409 (1979). S. W. Wharton, A context based land use classification algorithm for high resolution remotely sensed data. J. appL photogr. Engng 8, 46-50 (1982). S. W. Wharton, A contextual classification method for recognizing land use patterns in high resolution remotely sensed data, Pattern Recognition 15, 317 324 (1982). D. E. Knuth, The Art of Computer Programminy, Vol. 3: Sorting and Searching. Addison-Wesley, Reading, MA (1973).

the node just partitioned until the number of records in each subtile is less than or equal to the maximum bucket size. Friedman et al.(9) recommend that the bucket size be between 8 and 16 records. The objective of the search procedure is to locate all neighbors in the K-d tree whose coordinate values in each dimension differ from the coordinate of the cell under consideration by, at most, a distance threshold of 1. In searching for these cells we are, in effect, making a query which asks for the identity of all cells in the K-d tree which satisfy certain specified characteristics (i.e. proximity to the cell under consideration). To answer the query, we first start at the root node of the K-d tree and recursively search as follows.(~°) When visiting a node that discriminates by thej-th dimension, the j-th range of the query is compared to the partition value. If the query range is entirely above (or below) that value, then it is necessary to only search the right successor node or subtree (respectively, left) of that node ; the other subtree can be ignored since all of the records which it represents do not satisfy the query in the j-th dimension. If the query range overlaps the discriminator value at a node, then both subtrees must be searched. Figure 2a is a list of hypothetical coordinates for 15 cells in a two-dimensional histogram. Figure 2b shows the K-d tree representation of the 15 cell coordinates. The root of the tree is node A which is a y-discriminator since the y-coordinate has the largest variance. The partitioning value is 6, the median of the y dimension. The data set is split such that all cells whose y-coordinate is less than or equal to 6 are placed in the left subtile (represented by node B); all cells whose ycoordinate is greater than 6 are placed in the right subtile (represented by node C). Each subtile is then split, and the process continues recursively until each bucket contains fewer than 3 cells (the bucket criteria used for this example). The search procedure is illustrated by identifying all cells in Fig. 2b that differ from the cell with coordinates (3,8) by, at most, a distance threshold of 1 in each coordinate. The search starts at the root node A. The query range of 7, 8, 9 in the y-coordinate is entirely above the partition value of 6. Thus, only the right subtree (represented by node C) need be considered further. The perpendicular line in the A - B link of Fig. 2b identifes that the left subtree can be pruned from the search tree. The search continues and both sons of C must be examined. A total of 3 buckets (identified in Fig. 2b by an "S" underneath) are examined to find the 2 neighbors of cell (3,8), [e.g. (3,7) and (4,9)].

APPENDIX. DESCRIPTION OF K-d TREES

The following discussion of K-d trees was adapted from Friedman et al..~9~ who describe the K-d tree as a generalization of the one-dimensional binary tree u 3~used for sorting and searching. The K-d tree is a binary tree in which each node represents a subtile of the records in the tile. The root of the tree (i.e. node A in Fig. 2b) represents the entire file. Each non-terminal node (i.e. nodes A, B and C in Fig. 2b) has two sons or successor nodes which represent the two subfiles defined by the partitioning at that node. The terminal nodes (i.e. nodes D, E, F and G in Fig. 2b) represent mutually exclusive small subsets of the data, called buckets, which form a partition of the record space. In the case of one-dimensional searching, each record is represented by a single feature or key and a partition is defined by some value of that key. The records in a subtile with keys less than or equal to the partition value belong to the left son or successor node. Those records with keys greater than the partition value belong to the right son. In K dimensions, each record has K keys, any one of which can serve as the discriminator for partitioning the subtile represented by a particular node in the tree. Friedman et alJ 9) give a methodology for choosing both the discriminator and partition value for each subtile, as well as the bucket size, to minimize the number of comparisons needed for searching. The algorithm for building a K-d tree is to choose at every non-terminal node the coordinate j with the largest spread (e.g. variance) of attribute values for the subtile represented by that node. The partitioning value is chosen to be the median value of the j-th attribute. This algorithm is applied recursively to the two subtiles represented by the two sons of

(a}

x y

I I [

3 7

×1~-5 ~-8 19 A I )~

4 2

=P

9 8

B 6

8 6

6 5

2 10

,%J, 18-8~, I 1 AI I= ~F

4 9

6 7

1 12

3 8

8 5

10 1

9 4

5 4

~ 5

S

S

Fig. 2. Illustration of a K-d tree. (a) Tabular representation of hypothetical vectors for 15 cells in a twodimensional histogram. (b) The K-d tree representation, in which the circles denote nodes and the boxes denote buckets.

A generalized histogram clustering scheme About the Author--STEPHEN WHARTON was born in Cambridgeshire, England, on 14 January 1956. He received a BS and MS in Forest Science from The Pennsylvania State University and a MS in Computer Science from the University of Maryland. Currently, he is working in the Earth Resources Branch of the Earth Survey Applications Division at the NASA Goddard Space Flight Center on various non-supervised classification algorithms for remotely sensed data. He is a member of the American Society of Photogrammetry.

199