Parttr, R~c.emltlon
Pergamon Pres~ 19"76 X,ol 8 pp 107-114 Printed m Great Britain
CLASS: A NONPARAMETRIC CLUSTERING ALGORITHM FREDERICK R FROMM and RICHARD A NORTHOUSE* Bell Telephone Laboratories, Inc Naperwlle IL 60540 U S A (Receited 12 March 1973 and m revised [orm 7 Jul~ 19741
Abstract--The paper describes a nonparametnc method for clustering of large data problems The algorithm based on the ISODATA technique calculates all reqmred thresholds from the actual data thus ehmmatmg a priori esumates Empmcal derivation of the set of rules for calculatmg these parameters is presented Results of using the technique on a number of arnficml and real data samples are discussed Clustering
Grouping
Threshold calculatmn
INTRODUCTION The principal task of clustering is to group multidimensional data with httle if any a priori knowledge about the underlying structure of the data The importance and apphcatlons of the various clustering techmques are described by Ball (1966) A serious problem associated w~th many clustering algor|thms is the reqmrement of some a pmort knowledge on the structure of the data under conslderatlon (such as a priori knowledge of split and lump thresholds (Ball and Hall, 1967)) This paper describes a nonparametnc clustering method ffhlch for most practical purposes has no sample size lmalt and reqmres a ver~ hmlted amount of a priori reformation The algorithm presented m this paper is a means for the experienced cluster analyst to ehmmate many of the ~terat~ons needed to determine the clustering parameters If the analysis is to be based on the best linear dlscrlmmant funcuon criteria (as presented by Anderberg, 1973 and Sheath and Sokal, 1973), the algorithm is shown to be effectwe The inexperienced cluster analyst should use caution m applying these techmques since they could mask out data violations of the underlying criteria, assumptions, and should perhaps first become famihar with the ISODATA algomthm itself
Nonparametnc 2 Group each sample of the feature space to its nearest (Euclidean) cluster center 3 After all samples have been grouped, compute the coordinate values of each cluster center 4 If an2~ "split threshold" is exceeded, spht that cluster into two clusters The "spilt threshold" xs one of the a prtor~ estabhshed thresholds 5 If a spilt occurred in the previous step, regroup the data and compute new coordinates for the cluster centers
nittahze,
[
N(z Of clustersl Cluster lots I A/or~rl thresh~
t
ApDly nearest ' 1 netghbor criterion to group data to clusters
t RecomDute _ C uster I coordlnates
I I
Yes ISpht that ~,sp ht._t~_st~old~L~,~ cluster
-.~,.=%~
~
.
< h a~u ~er f e w ~ - ~
load ,eg~oop
I'Ehmmote
those clusters
~ /
land regroup
",c.~- ' ~
land regroup
THE BASIC ALGORITHM
The basic structure of the clustering algorithm is patterned after ISODATA (Ball and Hall, 1967) The basic approach of ISODATA is to define, a prtort, a starting point and some thresholds The logm flow is as follows (See Fig 1 ) 1 Guess at the number of clusters and their mean values
go
* Robotics and Artlficml Intelhgence Laborator.~ (RAIL) Dept of Electrical Engineering and Computer Science Umverslty of W~sconsm-Mflwaukee, Milwaukee, Wl 53201 USA 107
Fig 1 ISODATA algomhm
o,u.ters
I
108
FREDERICK R FROMM and RICHARD A NORTHOUSE
6 If any cluster has too few members (where "too few" is estabhshed a p r i o r & elumnate that cluster 7 If a cluster was eliminated m the previous step, regroup (by nearest prototype point) and compute new coordinates for the cluster centers 8 If the distance between any two cluster centers is smaller than some predetermined distance, combine the two clusters 9 If any two clusters were combined in the previous step, compute the new coordinates for the cluster center 10 Repeat Steps 2-9 until a predetermined number of iteration is performed The ISODATA program requires a priori mformatlon on when to split, combine, or delete a cluster THE CLASS ALGORITHM
The primary goal m the development of the class algorithm was to elmamate the need for a przort eramates on (1) the initial number of clusters and their locations, (2) the split parameter, and (3) the lumping threshold. This was felt necessary since very often the data analyst has no feel for these parameters m his data and, therefore, spends much of his time m a "'try and see" approach to his problem, eventually settling on some mtumvely correct result Since this mtumve result appears to be a widely accepted Calculate starhng vector JAp~oiy nearest J I~lghbor criterion
Recornpute _ Icluster coordinates
J
Calculate split parameter
that r KJroup
_J
.C.alculate "too few*
approach, a set of rules which were developed m an mtumve sense from tyro and three-dimensional problems of known results have been generahzed to an n-dimensional algorithm and ~mposed as the clustermg criterion The techmque manifests ~tself m a modlficatlon to the ISODATA algorithm that we call CLASS, as shown m Fig 2 The addmons to the algorithm, Le the parameter calculations are next described
T H E STARTING V E C T O R
One possible set of starting points or starting vectors for the clustenng process, which was determined experimentally ( F r o m m and Northousc, 1973) effectwely scatters the data about the n-dunenslonal space That is. a cluster center is placed at the centcr of the space and then 2" other cluster centers are placed one standard devlaUon from the mcan on each sadc of the mean on each dlmensmn The approach stems from a speed criterion rather than final clustering results Investlgat]ons show ( F r o m m and Northouse, 1973) that starting v, lth a single cluster at the center of the data space eventually results m the same clustenng result as using the starting vector calculated in equations (1--4) However, using a single starting pomt necessitates a number of split tterat]ons, each requiring that the 0 element mterdlstance matrix of the N data samples be calculated Using the calculated vector of equations (1-4) usually results m startmg with more clusters than actually exists, usually eliminating and always reducing the need of split iterations and relying more heawly on the lumping of clusters, a technique requiring only the calculataon of the (~t) element mterdlstance matrix of cluster centers where usually M ,~ N The starting vector is calculated as follows i Compute ~, the grand mean value of the tth dtmenslon for the entire sample population conslstmg of N samples 2 Compute s,, the standard deviation of the tth dmaenslon, for the entire sample population 3 Define the starting vector as Yk There will be 2" such vectors, each having n elements as.
late clusters egroup
_J
loCalc ulate
too close"
Y~ = ( ~ l , ~'.,
, ~ , , ) , k = 1,2,
(1)
~,.
(2)
where
clusters ~roup
¢
,2"
l
,÷ and I N
n ~,'> l Fig 2 CLASS algorithm
(3)
CLASS A nonparametrlc clustering algorithm
109
StOrtlng dUsler . {£-s, RI+.'
Stortlflg cluster 2
eeee. el • ": ".'.
J
.----...:
.::. .::-.
•
..- : - :
:"
•
• "'.:..::.'..
g
::'.:
\
;-;5.
StortlJ3g x,
/II
• %°*e*ee•
..:.:..::.-:.
,
S,19(nn9 ~,
• * • *•
x
Slorlmo,Lcluster 3 (~,-s,, ~-s~ x,
" " ""
ctu~er4
(~,-s,, ~cs )
Fxg 3 Sample data problem
Fig 4 Starting clusters for sample problem m Fig 3
4 Include A-, the grand mean value of the entire sample populanon as the (2" + 1) point
and again a spht is not reqmred (see Fig 5(b)). but now as P ( x * ) tends towards some skew dlsmbutlon
Y:,,+, --- X
(4)
]a~ f -----~1
Figure 3 shows a sample data base whde R g 4 shows the calculated starting vector as found by CLASS
l a : f - * i.
THE SPLITTING THRESHOLD
Although the described starting vector ehrmnates much of the need for sphttmg, an automatac means of determining cluster sphts is provided by CLASS In most algorithms this parameter ~s usually static; that is it does not change d u n n g the clustering process CLASS calculates it dynam~ally by the following procedure For each successive iteration k, we can determine the new sphttmg threshold as S~=&_I
(1 - S , , ) + - 7
(11)
and the cluster should be spht. (see Fig 5(c)) Therefore, the worst case for lal I and ta:l for a cluster that should not be spht is a umform &stnbun o n and. therefore. So > 0 5
(12)
Sk IS increased with every iteration to insure convergence by 7 ]terauons That Is Sk = 1 at the yth iteration Therefore, it is appealing to let So be just shghfly greater than 0 5 and It has been found that S o = 0.6 ymlds good results
(5)
Where ), = the maximum number of lteratmn and
-I
at
[
02
I
So was determined as follows
For a normahzed data set X* the range on any dlmenslon can be forced to - l _< x,~ < 1 by letting 1 X~
=
XIJ
max I~ , ~ I ~
=
1,2,
,n
J = 1,2,
,n
k = 1,2,
,n
(6)
Consldenng rather half of the interval -
-;
I < x* < 0
(7)
0 < ,c*_< 1
(8~
We can deternune s spht parameter a~ for the neganve interval and a2 for the posmve interval (described later) such that a cluster is spht when rather a~ > Sk or a 2 > S k Intumvely, clusters whose P D F are uniform do not reqmre sphttmg tSee Ftg 5(at )For a uniform P{\*) lalj
=
ia21 = 0 5 ,
(9l
as P{ v*) tend~ to Gaussmn (with a mean of zero), lall = }a21---*02-03,
(10)
Normal
o,
I
d,,,,,b,,,,o4 ] I°, I " I o2 I
(c)
-IO
o2
oz-o3
F(X)
-05
~ Negat~
rflom@nt arm a I
~
~ O,~, Poslt w e
~
O
X,
mowteflt orm 0 2
Fig 5 Values of a~ for (a) a umform distribution, (b) a normal dlstnbuuon, and (c) a skew &stnbutlon
110
FREDERICK R FROMM and RICHARD A NORTHOUSE DETERMINING
/ Cluster K ,,
x2t
CLUSTER SPLITS
To determine If a cluster is to be split, we can use a nearest neighbor criterion. We need to f r s t calculate the cluster means, r/k, for each cluster k m every dimension t
/
5::-:?.:.
where
'I,~ = ~ _l , E ."¢~Pt =
1,2,
,n,
• %o° , (13) N ~z points k left of ~/~
• • * °°°
eON
"
kl pOlfltS rl~'lt
of "qn k
m cluster k, and N k -- number of samples m cluster k From th~s, we can calculate the average &stances, D~,--the average &stance of pomts > r/,k m the ith dimension from the mean r/,k and D,~2--the average distance of points < ~/,~ m the tth dmaens]on from the mean r/,*, where Nk/
1 v= 3-"! (x,~" - r/,k) D,~1 = ~-Ei-
(14)
and I
Nk2
D ~2 = NkZ ,,~, (x~,p -- ~lf)
(15)
Fig 6 Determination of N kt and
Algorithms such as I S O D A T A need to specafy thas parameter a priori It was found empirically, however, that this threshold could be dynarmcally calculated from several parameters, ] e the dimension of data. the present number of clusters and the average rmnlm u m distance between clusters It was found experimentally that r can then be represented by'
r=--
for N kt = number of points m cluster k with ~ P > r/,~ N k2 -- number of points m cluster k with ~ P < ~/k That is, N kl would be the number of points on one side of the mean on a dmaension and N ~2 the number of points on the other side of the mean (see Fig 6). Now define
N '~2
S" Dk,
(19)
3111 l,'" I
where n m the dimension of the sample space, rn In the present number of dusters, and Dk is the average distance between cluster k and the other m-1 clusters. that is
Dk = m m l l H J , k , I = l.2. !
..m,k ~ j,
(21)
where a~
=
max D~'I~T,max .(~1 ~ 0 ,k max x,
(16}
Hkj = rlk -- r/J, q' IS the mean vector (r/'t, r/~. and
and
Df ~
~f~
a~ = max - - ~ - - . m l n ,~ mln x,
@0
117)
for k = 1,Z k , = 1.2, n F o r any duster which has ak>S
or a ~ > S
(18)
ilH~,~l = / , 5 -
, r/~,) for cluster
(r/~ - r/l}:
(23)
Figure 7 shows an example of these calculations where each H . is the &stances between the tth and /th clusters The Dks are found by finding the minimum &stance between that cluster and the other m - 1 clusters For the example of Fig 7
that cluster is split in the dimension of the corresponding maximum D kp
D 1 = HI2 H,3 D s = H34 D4 = H34
D2 =
THE LUMPING PARAMETER
Another possible cluster reeonfiguratlon IS the combining or lumping of two or more clusters when they become too close to one another The problem comes in defining the minimum distance allowed before two clusters are combined Thls distance, ~, will be referred to as the lumping parameter
(22)
(20)
Hence
nl k~l
~s the average minimum distance between any cluster and its nearest nelghbourmg cluster
111
CLASS A nonparametnc clustering algorithm
x~'
Cluster
I
O_O0~o~aO 0 00 -0' O o o ' o •-
• -. ":
.eo ~otT, o ." "
•
• "Ow
~ •
\ H
i
-'-"
4
•
•
°
.....
Nt2
t
Cluster I
c,ust,~
Cluster 2
~
o~-• e,y~ • e • o ~ o O ' ~e.,io oo
m~ 0='6
-o.. op
..:~~ ...
Fig 9 Example of two clusters not reqmrmg lumping
".
o•,1• l[l,~o• • :.
Cluster
4 X~ I I
F~g 7 Example of cluster d~stances
I
Now for any D k < Z the kth cluster is ehrrnnated, and its members are reassigned to netghbormg clusters by the nearest prototype point pohcy (Ntlsson, 1965) The techmque is exemphfied m Fig 8 In Fig 8 we see two clusters that mtmtwely should be combined H~: < z and the clusters are combined Note that ff the two clusters were sufficiently far apart (mtmuvely) the means ~ould be farther apart and H~2 > z and the clusters would not be spht (see F~g 9)
I
Fig l0 Nine-cluster cube configuranon
RESULTS Tables 1-3 show the results of three experiments comparing CLASS and ISODATA The three data sets are an aruficmlly generated Gaussmn data set of nine clusters m a cube configuration (see Fig 10), the IRIS data of Fisher (1936L and channels 1, 6, 10 and 12 of the multi-spectral scanner data taken
(o)
: i: "I
I
•~" ;I' ° "
::.::"
(b)
• oo o
.~0_ • _% :°
g.'-'-%'o I °
olo_°
ee
oi°
0~00A O0"oIdb"
• O0
OU
•.-.o
.
QU
:~.
f::
"_o.o" b~,oo -o • • 0 - 0 O00
g
Fig 8 Sample determination of lumping two clusters
::,.:! ! el
o..:
".'.."~ .:. (a)
(el
•"~y'."
o'_OoOO-oo
e
:Oo° . . . °
•o
:~
(e)
Fig 11 Nagy (1968) has illustrated the major dlfficultms m cluster analysis la) and (b) nonsphencal clusters, (c) bndges between clusters, (d) hnearly nonseparable clusters, and (e) unequal cluster populations
112
FREDERICK R FROMM arid RICHARD A NOR'I'HOUSE Table 1 StatasUcal results of ISODATA and CLASS on the 9-cluster cube
Cluster label Points
ACTUAL Means
SD
Points
ISODATA Means
SD
Pomts
CLASS Means
SD
1
105
8.52 8 44 8 45
1 12 1 15 102
105
Same as Actual
Same as Actual
105
Same as Actual
Same as Actual
2
48
13.82 13 84 14 07
0 84 0.86 006
48
Same as Actual
Same as Actual
48
Same as Actual
Same as Actual
3
50
2 58 13 99 13 86
0"91 0-90 0"83
50
Same as Actual
Same as Actual
50
Same as Actual
Same as Actual
4
44
13 84 2 96 14 10
1 21 1 20 1 13
44
Same as Actual
Same as Actual
44
Same as Actual
Same as Actual
5
48
3 09 2 83 14 07
0"84 0 96 0"92
48
Same as Actual
Same as Actual
48
Same as Actual
Same as Actual
6
46
13 79 14 04 2 92
0 89 0 99 0-98
46
Same as Actual
Same as Actual
46
Same as Actual
Same as Actual
7
53
3.00 14 15 3 03
1"01 1434 0"92
53
Same as Actual
Same as Actual
53
Same as Actual
Same as Actual
8
58
13 93 2 86 3 26
0 98 0 99 0"96
58
Same as Actual
Same as Actual
58
Same as Actual
Same as Actual
9
48
2 88 3 10 2 92
1 10 1+04 0 88
48
Same as Actual
Same as Actual
48
Same as Actual
Same as Actual
Table 2 Stattsueal results of ISODATA and CLASS on the IRIS data Cluster label Pomts 1
ISODATA Means
CLASS Means
SD
50
501 3 43 146 0 25
035 038 017 010
0 32 0 25 0 37 0 17
37
5 68 2 68 409 l 27
039 0 30 041 018
6 54 306 549 2 14
0 26 022 027 0 24
63
6 60 2 99 5 38 1 92
054 0 30 0 59 033
40
6 17 2 85 471 1 55
0 37 0 27 0"31 0 22
12
7 47 3 12 6 30 2 05
0 27 040 0 36 0 25
ACTUAL Means
SD
Points
50
5"01 3 43 1 46 025
0"35 0 38 0 17 0 11
50
5 01 3 43 l 46 025
0 0 0 0
35 38 17 11
50
5 94 2 77 4 30 1 33
0 52 0 31 0 47 020
23
5 49 2 57 3'88 1 19
50
6 59 297 555 2 03
0 64 032 055 0 27
25
SD
Points
CLASS A nonparametnc clustering algorxthm
113
Table 3 Stanstlcal results of ISODATA and CLASS on the C-I DATA Cluster label
Points
I SODATA Means
I
370
17030 16805
183 28 17591
1 97 1 60
307 470
339
17062 16791
181 78 17454
1 95 1 95
409 4 16
2
221
175 36 161 35
15302 18448
1 65 264
436 4 19
190
17421 16373
15998 18584
266 262
11 54 272
264
17297 14661
13210 17157
257 476
593 349
192
17360 14782
13269 17269
197 250
760 333
4
434
167 79 161 17
168 20 17571
3 19 1 75
2 78 330
386
166 74 161 34
168 77 17611
1 88 1 74
3 80 300
5
420
17534 15501
13956 17734
1 69 1 75
4 11 331
279
17546 15491
14425 18047
1 60 248
8% 24"~
6
134
17828 17108
17575 14972
220 146
172 541
132
17831 171 14
17579 14951
217 133
170 516
7
110
17022 13606
11898 16635
292 499
774 3 17
167
17056 13853
12491 16741
264 4 16
1034 3 16
8
47
16562 11770
9721 15889
347 610
1046 448
55
16562 11884
10042 15956
321 616
11 77 452
260
175 29 157 10
144 25 18047
1 63 240
7 56 1 90
SD
9
Points
CLASS Means
SD
from aircraft over Tippecanoe County m Indxana by F u et al (1969~ T h e results of the C-1 data c o m p a r e with the g r o u n d t r u t h presented by F u et al (1969) a n d wtth those presented by Elgen et al (1974) using a different a p p r o a c h T h e CLASS a n d I S O D A T A p r o g r a m s were tested on a n u m b e r of data sets as described in ( F r o m m et al (1972) T h e results of these stu&es showed that the CLASS a l g o r i t h m m a t c h e d the results of ISOD A T A m ever.~ case except the IRIS data of (Fisher, 1936~ where CLASS improved the results of ISOD A T A This was d o n e ~ l t h n o a pr~om reformation a b o u t starting vectors or spht a n d lump thresholds The n o n p a r a m e t n c m e t h o d s described were found to be successful m e h m l n a t m g the need for a p n o n mformatron m cluster analysis of most m u l t w a n a t e data
analysis The a l g o r i t h m ~tself ts a n extension of the I S O D A T A algorithm of Ball a n d Hall (19661 T h e primary goal in the development of CLASS was the eln'mnation of (1) the a pru)m estimates o n the n u m b e r of clusters to start w~th a n d their locations, (21 when a cluster is to be spilt into two, a n d (3) w h e n two clusters are to be joined into one The paper describes the development of heuristics that are derived from m t m t w e a n d factual reformation a b o u t 2- a n d 3-&mens~onal problems a n d then generalized to n-&menslonal problems Finally, results of using the algorithm on a n u m b e r of test data sets are presented The algorithm ~s thus shown to be a vmble tool for the experienced cluster analyst to analyze new data
REMARKS
Acknowledgements--The authors are indebted to the National Aeronautics and Space Administration (Grant NAS9-129311 the Nauonal Science Foundanon (Grant GK-37418) and the Umverslty of Wlsconsm-Mdwaukee Graduate School [Grant 7728 and 7030) for their support of this project
Heuristics have been presented for calculating parameters necessary for clustermg multivarmte data using techniques smallar to I S O D A T A These rules are based on a n u m b e r of lntmtwe observations m a d e m a n a l y z m g k n o w n two a n d three-dmaens~onal data problems Nagy (1968) points out that mtuttion breaks d o w n m m a n ) cases as those shown in Fig 11 However, by applying rules that correctly analyze k n o w n results, it ts hoped that some of the a r b l t r a n ne~s of cluster analysis can be removed
SUMMAR~
A clustermg a l g o r i t h m is presented which reqmres no a pr~or~ reformation of the data to perform the
REFERENCES
1 M R Anderberg Cluster analyszs for apphcanon Academic Press, NY (1973) 2 G H Ball A comparison of some cluster-seeking techtuques, TR Report No RADC-TR-66-514 (1966} 3 G H Ball and J D Hall, A clustering techmque for summarizing multwanate data Behavioral Scz 12, 153-155 (1967) 4 D J Elgen and R A Northouse, lnmal considerations of unsupert zsed dzscrete clustermg, TR-CS-72-1 U mverslty of Wisconsin, Milwaukee [19721
114
FREDERICK R FROMM and RICHARD A NORTHOUSE
5 D J Elgen, F R Fromm and R A Northouse, Cluster analysis based on dimensional reformation with apphcations to feature selection and classification, IEEE Trans Syst, Man., Cybern 4, 284--294 (1974) 6 R A Fisher, The use of multiple measurements in taxonomic problems Human Genetics 6, 179-188 (1936) 7 F R Fromm, D J Elgen and R A Northouse, A compartsan of clustering algorithms, TR-CS-72-3, Umverstty of Wisconsin, Mdwaukee (1972) 8 F R Fromm and R A Northouse, Some results on non-parametric clustering of large data problems
9 10 11 12
Proc l,t lntnl Jr Con[ Pattern Recoonmon, pp 18-21 (1973) K S Fu, D A Landgrebe and T L Phlhps, Information processing of remotely sensed agricultural data, Proc IEEE, 639-653 (1969) G Nagy, State of the art in Pattern Recognmon, Proc IEEE, 56. 836-882 (1968) N J Nllsson, Learmn# machines McGraw-Hill, NY (1965) A Ralston and H S Wilf Eds, Mathematical Methods for Dtaital Computers Wiley, NY (1960)
About the Author--FREDERICK R FROM~a was born in Milwaukee, Wisconsin, on 28 March. 1949 He received the B S degree in apphed science and engmeerm~and the M S. degree in electrical engineermg in 1972 and 1973, respectively, from the University of Wisconsin-Milwaukee During 1971-1973 he was both a Graduate Teaching and Research Assistant for the Electrical Engineering and Computer Science Department at the University of Wisconsm-Mdwaukec He is currently a Member of the Techmcal Staff of Bell Telephone Laboratories, Napervllle, Illinois. where he is engaged in the design and implementation of fault recognition and recovery software for peripherals of electromc switching systems Mr Fromm is a member of the IEEE About the Author--RICHARD A. NORTHOUSE was born in Lanesboro, Minnesota, on 2 April, 1938 He rece,vcd the B S and M S degrees in electrical engmeenng from the University of Wisconsin. Madison, m 1966 and 1968, respectively, and the M S degree in computer science, and the Ph D degree in electrical engineering from Purdue University. Lafayette, Indmna. in 1970 and 1971. respectlvely From 1966 to 1968 he was an Instructor of Electrical Engineering at the University of WisconsinMilwaukee He was a National Scmnce Foundation Summer Fellow at Santa Clara University, Santa Clara. California. in 1967 and at Worcester Polytechnic Institute, Worcester. Massachusetts in 1968 From 1968 to 1971 he was a Research Instructor at the School of Electrical Engmeenng, Purdue University During the summers of 1970 and 1971 he was a Visiting Scientist at the NASA Manned Spacecraft Center. Houston. Texas He is presently an Associate Professor of Electrical Engineering and Computer Science at the University of Wlsconsm-Mdwaukee, where he is engaged in both teaching and research in Image Processing, Artificial Intelligence, Pattern Rccognmon, Computers, and Controls Dr Northouse is a member of the IEEE, the Association for Computing Machinery, and the Pattern Recognition Society He is also an Associate Editor of the IEEE, Systems, Man, and Cybernetics
Transactions