Clustering of homogeneous subsets

Clustering of homogeneous subsets

Pattern Recognition Letters 12 (1991) 401-408 North-Holland July 1991 Clustering of homogeneous subsets Donald E. Brown, Christopher L . H u n t ...

489KB Sizes 2 Downloads 55 Views

Pattern Recognition Letters 12 (1991) 401-408 North-Holland

July 1991

Clustering of homogeneous subsets Donald

E. Brown,

Christopher

L . H u n t l e y a n d P a u l J. G a r v e y

Institute for Parallel Computation and Department of Systems Engineering, University of Virginia, Charlottesville, VA 22901, USA

Received 7 March 1991 Revised 16 April 1991 Abstract

Brown, D.E., C.L. Huntley and P.J. Garvey, Clustering of homogeneous subsets, Pattern Recognition Letters 12 0991) 401-408. The homogeneousclustering problem involvesgrouping data when neither the number of clusters nor tile number of elements in a cluster are known. Instead, there is a threshold requirementon group or cluster homogeneity.The major existingapproach to this problem is the leader algorithm. We present a new approach, called Clustering of HomogeneousSubsets (CHS). In tests on an important example of tile homogeneousclustering problem, sensor fusion for military surveillance, CHS outperforms the leader algorithm. Keywords. Partitional clustering, leader algorithm, sensor fusion.

1. Introduction Clustering is one of the most widely used procedures in exploratory data analysis and machine learning. Because clustering is heavily driven by applications, new algorithms typically arise from new applications for which previous procedures are not appropriate. This paper presents a new algorithm for clustering that was motivated by a military surveillance application. The key feature of this application that requires a new approach is the lack of prior knowledge of the number and size of clusters in the data. Instead, there is a threshold information about the level of homogeneity that should exist within clusters. The approach preThis work was supported in part by the Jet Propulsion Laboratory under grant number 95722, Loral WDL under grant number ST-922473-ES, and the Center for InnovativeTechnology under grant number INF-89-006.

sented in this paper is effective for this type of clustering problem, and it is quick. Clustering algorithms are categorized into two types--hierarchical and partitional. Hierarchical algorithms develop tree structures out of the data and show partitions with sizes from one to n, where n is the number of entities in the data set. Hierarchical algorithms are the most common approach to clustering; examples of popular hierarchical algorithms are in (Jain and Dubes, 1988). If a single partition is desired then some additional procedure must be used to select one froh~ the range of possibilities. Even without this procedure, typical hierarchical algorithms are O(n3), because cluster to cluster distance calculations are O(n 2) and there are n partitions formed. Hence, hierarchical methods have limited applicability to large problems. Partitional methods produce a single partition

0167-8655/91/$03.50 © 1991 -- Elsevier Science Publishers B.V. (North-Holland)

401

Volume 12, Number 7

PATTERN RECOGNITION LETTERS

of the entities to be clustered. Examples are kmeans and its variations (Hartigan, 1975), and pmedians (Kaufman and Rousseeuw, 1987), both of which require knowledge about the number of clusters in the data. ISODATA (Gnanadesikan, 1977), MacQueen's k-means (MacQueen, 1967), and Wishart's k-means (Anderberg, 1973) require the minimum distance between clusters, the maximum variance within clusters, and the minimum number of reports in a cluster, respectively. All of these also require an initial estimate of the number of clusters in the data. One algorithm which does not require this type of information is the leader algorithm (Hartigan, 1975), which is discussed in greater detail in Section 3. The clustering algorithm presented in this paper is partitional, but requires neither the number of clusters in the data nor the number of elements in a cluster. It does require a minimum level of homogeneity for the entities within a cluster. The need to cluster given only this type of information is a problem we encountered in practice and refer to as the homogeneous clustering problem (HCP). The approach presented in this paper addresses this specific problem with an algorithm that provides a quick and effective partitioning of the data. The next section describes both the general problem of clustering and the specific formulation of the HCP. Section 3 presents our approach for homogeneous clustering and Section 4 describes an example from the domain of sensor fusion that lead us to develop this approach. Conclusions are in Section 5.

2. The homogeneous clustering problem The objective of clustering is to partition entities into groups in a way that will reveal any underlying structure. There are many approaches to accomplish this objective, but they are all based on relatively few central concepts. This section introduces the general concepts of clustering necessary for understanding our approach and then formulates the specific HCP considered in this paper. The partitioning produced by a clustering algorithm is based on three choices by the algorithm developer: the similarity (or distance) function, the

402

July 1991

chtstering criterion, and the #nprovement or (more loosely) optimization procedure. The similarity function measures the closeness between the entities to be clustered. Generally, a distance instead of a similarity function could also be used simply by changing the objective of our improvement procedure from maximize to minimize. There might, however, be technical reasons for preferring similarity to distance (or vice versa). This paper uses similarity functions. The clustering criterion evaluates the partitions found by the clustering algorithm. Most clustering algorithms seek to maximize the similarity within an element of the partition (a subset of the set of entities to be clustered) and minimize the similarity between elements of the partition. The clustering criterion seeks to balance these two competing objectives in a manner that will produce partitions with interesting structure. The improvement procedure drives the clustering algorithm to find partitions with better criterion scores. Ideally, we would like to obtain an optimal solution to this problem. For realistic-sized problems, optimality is usually not possible. Hence, clustering procedures make improvements to the clustering criterion until no further improvements can be found without expending unacceptable computational effort. The choice of similarity function, clustering criterion, and improvement procedure determine the type of structure produced by the clustering algorithm. Holding two of the choices constant, while varying the third, can produce markedly different partitions of the data. The remainder of this section formulates the HCP as an optimization problem in the context of a clustering criterion, which we call a homogeneity score, and a threshold parameter. The following section provides details on both the improvement procedure in our algorithm and our choice of similarity function. Let --- be the set of entities, {~t,~2..... ~n}, to be clustered and C be a partitioning of these entities into clusters, {cl, c 2. . . . . cm}. ~ is the set of all partitionings of S and ,9' is the set of all similarity functions, with s : 3x-----* I1~for s ~ S °. Then the homogeneity score (clustering criterion) is h : C x II~x~--. I1~. Now for our purposes, neither

Volume 12, Number 7

PATTERN RECOGNITION LETTERS

the number of clusters (k) nor the number of entities in each cluster (Ici[) are known a priori. Instead, there is a similarity threshold, z, for all clusters, so that clusters with homogeneity less than r are not representative of the underlying structure. With these definitions, the homogeneous clustering problem (HCP) is: minimize k = ]C]

(1)

subject to Ill

[.J c, = fl,

(2)

i=l

ci N cj = fl, Vci,cj e C and ci~ cj,

(3)

h(ci, z , s ) - h ( c i - { ~ j } , r , s ) > O ,

(4)

V~jeci.

Since the minimization is over all C e ~u, constraints (2) and (3) ensure the solution is a partition of X. Constraint (4) requires that the addition of any entity to a cluster improve that cluster's homogeneity score. Hence, the HCP, as formulated here, seeks to force cluster formation whenever the homogeneity score can be improved by enlarging the cluster. Clearly there are other ways to formulate the HCP. However, this formulation captures the desire for clusters with a minimum level of similarity (given by ~), while not requiring knowledge of either the number of clusters present or the number of entities in each cluster.

3. A procedure for clustering h o m o g e n e o u s subsets

Ideally, we would like to obtain the optimal solution to the HCP in (1)-(4). Unfortunately, this optimization is likely to be intractable as the number of entities to cluster increases. In practice, the optimal solution to the HCP is not obtainable within realistic time limits for even moderately sized problems. Thus, most hierarchical and partitional approaches use algorithms that obtain good answers to (1)-(4), even though they might not be optimal. However, as noted in Section 1, most existing procedures do not directly address the HCP as we have formulated it here. An important exception is the leader algorithm (Hartigan, 1975). The leader algorithm is a simple and quick clus-

July 1991

tering method. It serves as the basis for operational sensor fusion algorithms that we will discuss in Section 4, although the developers of fusion algorithms introduced them without knowledge of the leader algorithm's prior existence in the clustering literature. The basic steps of the leader algorithm are shown in Figure I. Essentially, the algorithm proceeds sequentially through the set of entities to be clustered, 3, and forms clusters based on a threshold test. The first entity, ~ , forms the leader, L(c O, for the first cluster, c s. If s(~j,/~2)>r, then the second entity joins the first in c I, otherwise ~2 becomes the leader for cluster 2. In general, ~i, joins the first cluster for which its similarity to the cluster leader is less than the threshold, otherwise ~j forms a new cluster and becomes its leader. Clearly, the leader algorithm provides a solution to the HCP, because the only prior information required is the threshold. The leader algorithm is also quick, because once similarities between entities are found, which is an O(n 2) procedure, then only a single pass through the entities forms the clusters. Hence, the algorithm is O(n2). However, the algorithm is order-dependent (simply starting with a different entity can produce markedly different clusters), and, therefore, leader-dependent. A simple modification, to reduce the latter impact, is to recompute a cluster centroid when new members are added to a cluster. While this modification reduces the impact of leader selection on the resulting partition, it does not reduce order dependence. Our approach, clustering of homogeneous subsets (CHS), is an improvement to the leader algorithm. Our development was motivated by an application in sensor fusion, where the extreme order dependence of the leader algorithm is not tolerated. CHS is also order-dependent, but much less so than

ct ~ es; L(cD ~ ¢1; k ~ 1 FOR i = 2 t o n

and j = 1 tok

IF s(L(cj), ¢1) > 't

THEN cj 4- cj u ~I ELSEIFj = k THEN k ~ k+l; cj ~/~l; L(c/)~ ~t

Figure 1. Leader algorithm. 403

Volume 12, Number 7

PATTERN RECOGNITION LETTERS

the leader algorithm. Additionally, CHS maintains the advantage of the leader algorithm in only requiring threshold information to obtain a partition. The availability of only threshold information is a key characteristic of the sensor fusion problem. Figure 2 shows the steps in CHS. The procedure begins with the random selection o f an entity from 3. This entity acts as an initial cluster of size one. Then CHS selects the entity in ~ that is most similar to the initial cluster. If the addition of this entity to the cluster causes the homogeneity score to improve then the entity is added to the cluster. The remaining elements of S are then examined for the most similar entity to the cluster, and the change in cluster homogeneity is calculated with the new entity. If the addition of an entity to the initial cluster does not improve cluster homogeneity, then the new entity starts a new cluster, and the procedure is repeated with this new cluster. No other elements are added to the initial cluster. In summary, entities are added to clusters until the addition of the most similar entity causes no improvement in the cluster's homogeneity. The process continues until each entity in 3 is a member of a cluster. CHS is 0(/'/2). Construction of the similarity matrix is O(n2). Each o f the n entities is chosen once by the MOSTSIMILAR operation and the MOSTSIMILAR operation is O(n). Hence, the overall procedure is O(n2). CHS's homogeneity score is h(c,z,s) = E

(5)

~ 2(r-s(~i,~j))

Icl(Icl-a)

cj ~- Random{--l; k <- I

REPEAT k

,~t-- MOSTSIMILAR(ct,---[,.Jc~) IF h(ck u ~, ~, s) - h(ct, ~, s) > 0

THEN ck ~ ck u/j ELSE

ck+! ~ ~ ;k ~--k+l t

UNTIL

~.-UcI:® Figure 2. CHS.

404

July 1991

and was first proposed in (Barker, 1989). This score uses a similarity function and a threshold. The similarity function is problem-dependent (we used the chi-square density function for the sensor fusion problem described in Section 4). Clearly, other homogeneity scores can be used with the general CHS procedure. We found the one in (5) to be particularly well-suited for our application.

4. A

sensor

fusion

application

This section describes the sensor fusion problems which motivated our development of CHS, gives our evaluation procedure, explains the simulation used to generate sensor fusion test cases, and then gives results comparing CHS to its competitors. Sensor fusion involves the combination o f data from multiple sensors into a single coherent description of the sensed environment. The specific sensor problem of concern to us, consists of a large region (approximately 80× 100 km) over which electronic sensors operate to produce location estimates for entities in the sensed region. Sensors report both location and a 95% elliptic error probable. The latter is an elliptical region, which is expected to contain the true entity location in 95 out o f 100 random trials. Under Gaussian assumptions, the reports completely specify a Gaussian density. The problem is to combine all the reports produced by the sensors into an accurate estimate for the number and locations of the entities in the environment. Clustering algorithms are used on this problem to group reports that apparently belong to the same underlying entity. Reports that are grouped in this manner are said to be correlated. After correlation the next step is to use the correlated reports to estimate the locations of the entities. Our concern in this paper is with the correlation or clustering process, and all algorithms will use the same estimation procedure (see (Spillane et al., 1989) for details). It is not a simple matter to evaluate algorithms on this type of sensor fusion problem. Here we explain only the fundamentals needed to understand the evaluation of CHS; additional details on the

Volume 12, Number 7

PATTERN RECOGNITION LETTERS

basic principles underlying the evaluation of sensor fusion algorithms are in (Spillane et al., 1989). A standard o f comparison for sensor fusion algorithms is the Perfect Association Representation (PAR). P A R comes from correctly correlating (clustering) all sensor reports; thus, P A R is the best estimate obtainable from the available sensor reports. We measure a clustering algorithm's performance by comparing the algorithm's estimate to P A R . An objective score for this comparison is the Perfect Association Minimum Distance Score (PAMDS), which is defined as,

July 1991

PAMDS(R)= E

d(r,p)+

P

+

E

d(r,J*(r))

r~R;(r, 6*(r))~P

E

d(J(p),p)

(6)

p ~ PAR; (6Co),p) q P

where R = the algorithm's estimate of the environment, r = an estimate (Gaussian density) in R of a single entity, p = an estimate (Gaussian density) in P A R of a single entity, d = a distance function between densities,

@ C

v

i -t-

Figure 3. 48 Hours of unclustered reports. 405

V o l u m e 12, N u m b e r 7

PATTERN RECOGNITION LETTERS

6 ( p ) = the element o f R closest to p in PAR, d~*(r) = the element o f PAR closest to r in R, P = {(r,p): d~*(r)=p & J ( p ) = r for p e P A R ,

A justification for divergence is in (Brown et al., 1990). Obviously, we can only produce P A R if we know the true locations for the entities in the environment and associate these with sensor reports. While in real sensor fusion problems these data are difficult to obtain, simulations can produce such data. For our evaluation o f CHS, we simulated the operation o f airborne sensor systems with flight paths parallel to one axis o f a rectangular region that contains 200 entities. Each entity is indistinguishable from another, except by location. The three

reR}. In our analysis here, we use divergence for d, which is defined as J(r,p) =

l

OO

dx

r ( x ) In r ( x )

p(x)

_¢.

+

l

~*

p(x)

July 1991

In p(x) dx.

r(x)

o

0 • o 0

00

o

o0

DO'

Q0' °0,0 •

o

Q.

o

'

g

0

'o

0 '

o

%

"C)" I

0

0



o

0

0

o 0

~

0

0~

0

o •

O 0

|



o

o

0

0

0

;o

0

• 0 0

o

o

0

° 0

0

"'0,

o

o 0

o

e

0

o

ocP

|

0 i

0

e

I

!

O

° 0

0 o

o

e •

|

dP

e 0

0 •

e ~ O

o

o

O

0

~5 Figure 4. CHS clusters.

406

Volume 12, Number 7

PATTERN RECOGNITION LETTERS |

2000

|

|

.

.

.

.

July 1991

.

1900 18001700O'J O :E 16001500"

1400"

1300'

1200 0

. 50

.

. . 100

.

. 150

.

- .200

.

. 250

300

350

Hours

Figure 5. PAMDS vs. time for leader algorithm L2. parameters in the EEP of a sensor report are found by sampling from beta distributions, which model the sensor characteristics. The coordinate values of the sensor reports are given by draws from a bivariate Gaussian with variance-covariance matrix

determined by the EEP and centered at the true location of the entity. Additionally, entities might not be sensed during the operation of a sensor system. Fourteen days of sensor reporting are simulated with the entities moving randomly through-

1200

1000

800

600 Q :E

0 CHS 13 A1 4OO

200

0 0

50

100

150 200 Hours

250

300

350

Figure 6. P A M D S vs. time for CHS and leader algorithm A I .

407

Volume 12, Number 7

PATTERN RECOGNITION LETTERS

out the region. Figure 3 shows all sensor reports in the region after two days of sensing. Figure 4 shows the estimate produced by CHS with these reports. Three algorithms were tested using this simulation: CHS and two variants of the leader algorithm. The first variant, A1, updates an entity location estimate (the cluster leader) after each correlation. The Second variant, L2, waits to find all current leaders that correlate with a new report and then clusters (correlates) all of them into one cluster. The values for the threshold, z, in each algorithm were found by analyzing the performance of each algorithm on 8 trials (14 days per trial) over a range of r values. The best r value for each algorithm was used in comparison testing. We ran the comparison tests of the algorithms on an iPSC/2 hypercube, which allowed us to generate 32 independent trials for each algorithm. The results over the 14 day sensing period are shown in Figures 5 and 6. The performance of L2 was so poor that it required a separate plot. A1 outperforms CHS only on the first day. Paired t-tests at 48, 192 and 336 hours reject the null hypothesis (no 'significant' difference in performance) with a significance level of more than 0.99.

5. Conclusions Homogeneous clustering requires neither the number of clusters nor the number of entities in the clusters. Military surveillance is an example of

408

July 1991

the homogeneous clustering problem that also requires algorithms capable of quickly handling large data rates. CHS is particularly effective for this class of problems. In tests with simulated sensor data CHS outperformed it chief c o m p e t i t o r - - t h e leader algorithm.

References Anderberg, M. (1973). Cluster Analysis for Applications. Academic Press, New York. Barker, A. (1989). Neural NetworkApplications to Sensor Data Fusion. M.S. Thesis, University of Virginia, Charlottesville, VA. Brown, D.E., C.L. Pittard and A.R. Spillane (1990). ASSET: A simulation test bed for evaluating data association algorithms. Research Report IPC-TR-90-002,Institute for Parallel Computation, Universityof Virginia, Charlottesville, VA. Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of Multivariate Observations. Wiley, New York. Hartigan, J. (1975). Clustering Algorithms. Wiley, New York. Jain, A.K. and R.C. Dubes (1988). Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ. Kaufman, L. and P. Rousseeuw(1987). Clustering by means of medoids. In: Dodge, Y., Ed., Statistical Data Analysis. Based on the Li-Norm and Related Methods. North-Holland, Amsterdam. MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley Symposium on Mathematical Statistics and Probability,

281-297. Spillane, A.R., C.L. Pittard and D.E. Brown (1989). A method for the evaluation of correlation algorithms. Research Report IPC-TR-89-004, Institute for Parallel Computation, University of Virginia, Charlottesville, VA.