Regional Science and Urban F_conom~cs 10 (1980) 123-151 O North-Holland
APPLICATIONS OF CLUSTER ANALYSIS ALGORITHM TO GEOSTATIS'rlCAL SERIES* Shai EVER.HADANI Hebrew Umverslty, Jerusalem, Israel Untvers,ty of Cahfornta, Berkeley, CA, USA t
Received Aprd 1977 The common problems of cluster a, alysls are briefly revtewed and discussed A cluster analysis algorithm for pamUonmg geostamtJcal senes Ls extenswely descnbed and apphed to examples of various visual and statistical types The utd~zauon of the algorithm m measuring the d~vergence between geosta~tical ¢hstributions is shown. The usefulness of the algorithm is demonstrated m solving the problem of optimal locatmn of m fac~hties A method for reconstructing territories by simple geometric shapes via cluste ~ analys~s is also 8~ven
1. Introduction Cluster analysis is the name given to the class of statistical procedt res which seek to group data points that are m some sense 'close'. The need for partitioning geostatistical series via cluster analysis arises from (a) the inability of the usual geostatistical parameters to describe the set satisfactory in the case when the distribution is fragmented or gapped, ~,nd (b) when the sizes of the sets under study are so large that It is w~ctually impossible to compare them geostatistically using their original points. The purpose of this paper is: (a) to supply a comprehensive brief survey of the problems involving the ase of cluster analysis, and (b) to construct, test and employ a cluster analysts algorithm for geostatistical .~eries. The problems underlying the utilization of cluster analysis procedures ~re discussed in section 2. The proposed multiple algorithm is descrtbed lrl secttons 3 and 4. Section 5 is devoted to apphcattons of the algorithm m geostausucal series and territories. *The second poruon of this paper ~s part of the author's Ph D thes~s done under the supervision of Professor Roberto Bach~, and the author wishes to thank Professor Bach~ for h~s kind gmdance Thts research was supported in part by a scholarship given to the author by Lady Daws Fellov,shtp Trust of the Hebrew Umvers]ty. Jerusalem
124
S Ever-Hadam, Cluster analys)s and geostatlstwal series
Since a tremendous amount of study has been done over the last fifteen years ,n cluster analysis, resultmg m several hundreds of apphed works on th~s subject, we do not w~sh to overburden the b~bhographical hst given m th~s paper The reader is referred to the detailed bibliography by Duran and Odeli (197/.), and by Eventt (1974)
2. The problems of cluster analysis Let P = {p~, . ,PN} denote N entitles from some conceptual population P It is assumed that there exists a set of an observable K characteristics (variables) X~,.. ,Xh on each p, m P. Let m be an integer less than N. The problem is to determine m clusters (subsets) of P, say re1,.., n,,, such that each p, m P belongs to one and only one of the rb such that entities which are asstgned to the same cluster are s~mdar m all the K characteristics, yet entities from different clusters are not smular m all the K charactensucs Altogether there are seven problems involved m the use of such a procedure (1) Choosmg the "best' dtstance function and the "best' optimahty crttenon by means of which the cluster scheme would take place. (2) The 'real' number of clusters present m the data. (3) Standardizauon of the data 14) Detecting the outllers. (5) Convergence of the algortthm. (6) The problem of local minima, (7) The choice of a good mapping procedure into a Euchdean space of a few d~mens~ons, m e~ther the case of qualitative data, or m the more general case of reducing d~menslonahty. The following ts a discussion of the above problems.
2.1. Let x,, x~ and Xk denote any three points (vectors) in a K-dimensional Euchdean space E K. A non-negatwe real valued function d(x,,xj) is said to be a distance function (metric) if and only if
lal d('q.x~)>O for all x, and xj m El,, (b) d(x,,x~)=O
if and only If
(c) d(x,, xs)=d(xj, x,), (d) d(x,,xj)<=d(x,,xl,)+d(xk,
xj).
x,=',cj,
S Et er-Had{m#, ( luster analt ~l~ and geo~tatl~ttc al ~t'rte~
123
Since the final clusters are relative to the particular measure of d~stance function (metric) chosen, ~t is ~mportant to choose a 'rational' metric In the present study we analyze two-dimensional series which sho~ distributions over a given terntory Igeostat~stlcal series). Therefore obx~ously the natural choice of metric would be the I2 Euchdean metric defined by (1)
d(x,.x,)=C~=~
(x,~,-x,~,)2)~,
where x,h and xjk denote the kth component of x, and
(1)
xj, respectively
In the process of grouping data points there are m general two kinds of schemes: (a) The hterarchicals, and (b) The optimizers. Although some methods are, strictly speaking, neither h~erarchlcals nor optimizers, we ignore them here for the sake of slmphclty The hierarchical algorithms group the data points palrwlse so as to form a tree structure (dendogram) using some palrwlse measure. The optimizers group the data points so as to minimize (or maximize) some fixed m advance objective function. Our a~m is to split a geostat~stical set mto several 'compact' subsets Therefore a rational optimization function ts the one which minimizes the total within group distance varmnce and maximizes the total between group d~stance varmnce. 2.2. The problem of the 'real' numbers of clusters (subgroupst presented m the data is one of the baste problems of cluster analysis The question i,, whether we impose a structure on the data or (hopelully) find tt This topic was treated by several authors [see, for example, Beal (1969). Cahnsk~ and Harabasz (1971), Fortier and Solomon (1966)], suggesting methods ranging from choosing the number of clusters a priora to techmques based on some known statistical distributions The strategy we propose to use regarding this question ts to compare somc objective function computed from the given set of s~ze N to the distribution of the same function calculated from a large number of random sets of size N, assuming that such random sets are essentially not clustered An cxtenssxe descriplion of this procedure is gwen in subsection 4 3. 2.3. Since most metrics are ,~armnt with the scale of the ~anables. tt is Important to scale the variables before the clustering procedure takes place 'For an exact definition of the dJ,,tance ~,arlanee ,,ee Flachl 11962)
126
S Et,er-Hadam, Cluster analrsls and geostatlsttcal sertes
Fortunately m the case of geostaUsttcal study such a scahng ts not necessary espectally when the set points are given by their X and Y coordinates along the prmc~pal axes of the set. However m the more general case we highly recoLl,mend to standardize the data by the range rather than by centering and then dividing ~t by a scale factor, smce the former scahng ts much more sens~ttve to most of the cluster analys~s techniques than the latter one. 2.4. By definition an outher is an isolated data point, i e., in using almost any cluster analysis scheme this point v, ould form a cluster (subgroup) by ttself The exmtence of such outlier points ~s of ~mportance especially in the multivariate cases where it can badly affect the whole procedure [Gnanadeslkan and Wilk (1969)'1. However, m the visual two-dnnenstonai case those pomts can be lmmedtately detected and therefore removed even before the clustermg scheme takes place 2.5. One of the questions involving cluster analysts is whether the algorithm used converges The problem mtght occur only with regard to the opumizers, smce the hterarch~cal algorithms always converge. Although the term 'convergence' is somehow vague, a practical approach to this problem is to consider the algorithm as a function of the computer-time t. Let ~b(t) denote the objecttve funcuon we wish to maxtmtze. Then if for each t and a reasonable At > 0, cp(t + At)> qb(t) we say that the algorithm converges. Once attained then a stopping rule must be introduced to stop the computer program in a reasonable ttme. 2.6. In the optimizer scheme a direct way of solwng the problem is to evaluate the objective function for each clustering alternative and then choose the partltton y~eldmg the optimal value of the objectwe function. Hox ever th~s procedure ~s wrtually impossible unless N is very small since the number of clustering alternatives behaves hke m N as N - - , ~ . 2 S:nce total enumer&tion is practically impossible one must make do w~th semt optimal solutions. An alternative to the complete enumeration is to ut~hze the techniques called Dynamtc Programming [see for instance Rao (1971) and Jensen (1969)]. But even with those techniques it is still lmpracttcal to work w,th sets of a fair size, say over 30 points. However to evaluate a g~ven algorithm one may compare its performance, even on a small set of data, to the exact solutton attamed by a dynamic programming algortthm If the results are sattsfactory, one can conclude that the algorithm used is reasonable and can be generalized to practical data. Such an approach is used in our study and is descrtbed m subsection 4.2. 2For an extenstve dlscusslon on thts subject, see Duran and Odell (1974)
S Et'er-Hadam, Cluster anahsts and geostattstwal serws
127
2.7. The most mterestmg aspect m cluster analysis is the problem of choosmg the "correct' mappmg procedure of the data into a Euchdean space of a few dimensions Although th~s subject has no direct influence on our study, since we are deahng already w,th two-dJmens~onal series and indeed no mappmg procedure is needed, we feel that a brief overview of this topic ,s worthwhile. The problem arises either when all the varmbles are quahtatwe (categorical), or in the more general case when a reduction of d~mens~onahty ~s sought while preserwng the mare structure of the original data Having K qualitative variables a measure of similarity, or dlss~mdanty, between each two individuals p,, p~ m P can be calculated [for example, see Sokal and Sheath (1963) and Gower (1971)]. Let D = [ D , j ] denote the slmdar,ty (dJsslmdanty) matr,x between the (~) paxrs of elements m P One method of representing graphically such a matrix is by uuhzmg the ~dea of two-dimensional hierarchical "dendogram' or 'tree structure' [ H a m g a n (19671 and Johnson (1967)]. In the hierarchical tree dmgram the elements are usually labeled verucally on the left and the clustering results are indicated to the right. The levels of simdanty at which new clusters are formed are indicated horizontally across the top of the dtagram Another method for mapping a s~mdanty (d~ss~mdanty) matrix into a Euclidean space is due to Shepard (1962), Kruskal (1964a, b) and Guttman (1968). This techmque endeavours to find a set of pomts PI, ~= 1. ,N. m a reduced number of d~menslons, such that the d~stances m th~s lower dimensional space are monotomcally related to the slmdarmes (dissimilarities) D,~, i e., for each quartet of md~ces ~, j, k, l If
D,j>D~t
then
d,j
(21
where the d o and d~ denote the Euclidean d~stances between the points PI, P~ and P;,, P~, respectively. 4 In general the monotonicity property (2) cannot be completely satisfied For this purpose Guttman (1968) mtroduces a measure called 'coefficient of alienation" defined by
6=(1 _#2)½
(3)
where N
// N
N
,,,
,,
d"d'*'/iX
\
d;'*l :
aln case D~I is a d,ss~mdanty measure the ,nequahty sign between d q and dkj ~s reversed '*The method is called 'Smallest 3pace Anal)sis' or S S A ,n short
(4)
S Erer.Hadant Cluqer analysts and geostatlstwal serws
128
where the ~,,,j ~a,~ are connected to ftd,jj by N
d,3=
N
Y k=l
{5)
1=1
where J,.ikt denotes some permutation function such that whenever (2) exists for all ~, I, k and I then/~ = 1. When ~ ~s less than say 0 15 ~t ~s assumed that the representation of the t matrix D by the corresponding N points ~P,} ~s sumclent to draw some valid conclusmns about the structure of the data A further technique which may be used to obtain a metric representation of a similarity matrix D ~s described by Gower (1966, 1967), who showed that if the matrix D ~s posmve sem~-defimte then c , ~ = ( 2 ( I - D , j ) ) t can function as the Euchdean dmtance between the points P~ and P'j representing the objects P, and Pj. The primary objectwe of these strategies is the same, namely mapping the elements into a Euchdean space of a low d~mensmnahty (usually twodimensional), while preservmg the mare features of the data. However the S.S.A. method ~s more general since ~t does not require that the N pomts PI would form the nodes of a dendogram and it does not pose restrictions on the matrix D In the other case where all the K variables are quantttatwe there are basically two strateg,es of reducmg dimenstonahty One is based on the principal component theory [see for example Rao {1964)] and the other is based on going through the procedure described above, namely calculating some s~mflanty (dlssimflantyj matrix D and mapping it m some way into two-dlmensional Euchdean space. To summarize the present paragraph two pomts should be mentioned: {a) The key point to the mapping scheme, hawng gone through the matrix D, lies mewtably m a rauonal choice of this matrix. Ibj One should be cautious in using the principal components approach especmlly m the case of existing outliers.
3. The proposed algorithm In the following we consider a muluple algorithm which enables one to spht a geostatlst~cal series so as to minimize the total distance variance (sum of squares)'wRhm', and maximize the total distance variance 'between' While domg so specml attention will be focused upon most of the problems discussed m section 2. Let there be gwen a set of N points m a two-dimensional Euclidean space.
S Et e r - H a d a m . C l u s t e r analts~s and g e o s t a t ~ t w a l serw~
129
each point P~ hawng coordinates (xrya) and posture weight ~ The coordinates are g~ven m terms of the prmmpal axes of the set The problem ~s the determmaUon of m addmonal points Qo, a = 1. ,m. each having coordinates (~,,, t/,,) in the same Euchdean space, such that those points would divide the N points P~ into m mutually excluswe and exhausuve non empty subsets, such that the 'wRhm' sum of squared d~stances ~s minimized, and the 'between" sum of squared d~stances ~s maximized Since we are interested m diwdmg the whole set into a relanvely 'small' number of subsets we restrict ourselves to small m's between 2 and 9 Let d~,. be the Euchdean d~stance between point P~ and point Qo Let b:/:a, b = 1, m, %o=wj ff d~ <,./2 ..,~,.~ =0
otherwise.
Let Iv,
m
---
~ j a d ja
(6)
•
1=1 a= 1
The problem is to determine the m points Qo such that U is minimized D~fferentmtmg U by ~,~ and th (ignoring the %,/s since they are step functions), we get
~'U/?~,, =
2f~ ( ~o - .~),
(7)
~U/~h = 2J~ (~h - Yo),
(8)
where
E w,.. j=l
E w,o,,.,f.. 1=1
(9)
J=l
We assume t h a t / o > 0 for all a. However, m case j~,=0 for some t,, we oral, that particular partition. Equating (7) and (8) to zero does not solve the problem since Ca, qo, ,~o and .~ are unknowns. Therefore an ~terauve method is used as follows: Let (~'J,r/~ ~}) be some 'first approxlmauon' for the points Qa. Those points determine a value of the funcuon U denoted by U ~ A smaller value for U, say U c2~, and therefore a better solution for the problem is attained by ~a1~(2}~2}~,J'1o which are connected to the fust approximation by the formulae
d>0. la
--)la
--
~lncasetlesexlst. le d j2. = d , ~2 b e define ~ , a = ~ j h = ~
"1
-
S Ever.Hadanf, Cluster anal~s~s and geostat,sttcai serws
130
and after derivation, ~(2)_r(I)
(1)
(1o) 0
n(x)_.c,,a
--
(11)
"/a
To ensure proper convergence the value of c=½ is used. The iteration process (10) and (11) is ropeated untd either (12) is satisfied
[U,,+1 ) _ U(,)[/U.)
(12)
where 0 < e ~s less than say 0 05, or after a fixed number of iterations, say 10. Since the m points Qa determine a umque partition of the original set, finding them ~s equivalent to finding a partition of the latter. This procedure developed first by Guttman (1969) gwes some solution to the problem. However, it suffers heavily from the dependency of the final solution to the inmal approximation (~tl), r/~l)), a = 1,..., m. In order to determine a 'ratmnar first approximation to U we propose three alternatives as followsThe first one is based on a hierarchical scheme m which the most adjacent points P,1 and P~x are fused into a new point P,. whose coordinates are the coordinates of the center of gravity of the prewous two points, and whose weight is the sum of the weights of points P,t and Pj,. Then we ehmmate points P,I and PjI and add the new point P,.. Thus the number of points is reduced from N to N - 1 The next step ts to repeat the above procedure wRh the N - 1 'new' points, and so on untd we end up with m points. These m points determine a first approximation for U. The second alternative ~s based on a two-phase procedure in which at the beginning m initial 'centers' are chosen. To these centers we add points from the set - one at a time - so as to mmtm~ze a loss function U' analogous to the function U. The adding of those points causes a change m the m 'centers', thus creating new ones. To these we continue to add points until all the points from the set have been accounted for. The final m 'centers' are the ones statable for the first approximation of the algorithm based on U. Here should be noted the problem of choosing the initial m 'centers'. However the dependency of the final solution on these chosen 'centers' Is not so crmcal as in the procedure based on the minimization of the function U. Let i
wjo = ~ =0
)
if 'Pj belongs to region a , otherwise
S Et er-Hadam. Cluster anal;sis and geostattsHcal serzes
131
The term "Pj belongs to region a' shall be explained below Let ~'
m
u'=ZZ
' ' a, wjad~
(13)
1=1°=1
djo is the Euclidean distance of point Pj from the center of region a The criterion used in assigning points from the set to the regions is based on minimizing the differences: U'(l+ I ) - U'(i), I= 1,.. ,N-re. i is the phase number. Here it should be pointed out that U ' ( / = I ) = 0 , since In the first phase the m points coincide with the m centers. It can be proved [Ever-Hadam (1975)] that according to the last criterion it ms necessary and sufficient to add point Pj to regmon (sub-group} b in the I+ 1 phase, if and only if the mdlces 0,b) fulfil the condmon (14),
Mm ~" f'~(ll)~ d2(i ) ,= t . . ~ " {f'.(l) + w, a = 1,
f'b(i)wj dfbll)
f'b(I) +)v~
(14)
.rn
in the lth phase. N
r
f'~(I)= ~, ~%o(I) I=1
ts the sum of weights of 'region' a in the Ith phase A rational choice for the lnmal m 'centers' is based on the hmrarchmal process described previously The third alternative for the first approximation for U is based on a 'combmatorlal' approach by dtvldmg the area on which the N points P, are spread into L x L(m
mmk(U~). A more detailed d,scussion of this method ,s g~ven m subse:tmon 4 1 We shall denote the final solutions attamed by the first alternative as the hierarchical (method I). by the second alternative as the tterat.e (method IIL and by the third alternative as the combmatortal {method III) Here mt should be mentioned that method IlI can be. applied only to sets of two dimensions However methods I and II are applicable to sets m any fimte Euclidean space.
132
S Eter-Hadam, Cluste~ analysis and geostat~st~cal series
4. Problems in the application of the algorithm In applying the proposed multiple algorithm one should consider four problems" (1) How to choose the best value of L m method IlI? (2) Which of the three alternatives gwes the best partition in terms of minimizing U [eq (6)]? (3) How to determine the 'real' optimal number of regions (clusters) present m the geostaUsucal series '~ (4) How to take into account th: she,pe and size of the territory on which the points of the series are spread'~ 4.1. Since we hm~t ourselves to m's between 2 and 9, clearly L>=3 m order the keep the mequalmes L2 > m for each m = 2 , . . , 9. The question is what ~s the practical upper bound for L '~ To answer th~s question let us look at an hypothetical geostatistlcal series, fig. 1, for which L = 3 and the corresponding grid which forms nine rectangles (boxes). Let m be fixed, 2 < m < 9.
D
q
X __
g•
O•
•
•
Q
F~g 1
The geostatlstlcal set consists of 30 eqm-welghted points given in terms of thetr principal axes. The rectangle area on which the set points are drawn is bounded by the values, mm(x,), max(xi), min(yi), max()',), and is the mmtmal rectangle containing the set. This fact implies that the probabdtty of finding empty boxes is low Let f~l~ be defined by (9) m the initial step (the first approximation). Due to the low probabdlty of finding empty boxes we may assume that the corresponding Jai1~,s for each one of the (~.') L' possible combinations would be positive. However ff such empty boxes are found, and as a consequence we might have ~f~l~=0 for some a, a = l , , m , w e omit that particular initial combination
S Eter-Hadam, Cluster anahsls and geostatlstlcal ~erles
133
On the other hand tf f ~ t l = 0 for all a=l,. ,m we omit altogether that pamcular partRmn. A sRuation hke th~s indicates that we are trying to ~mpose the ' "ong pamtmn on the set. For example if the series is defimtely two-clustered and we try to spht ~t into three dusters. Let U~(m), U2(m)and U3{rn,L ) denote the final values of U ¢aicula:ed by methods I, il and III, respectively, using the ~terat~ve processes {10), Ill) and {12) where c - ~ . Obviously U3{m,L) is a decreasing function of L,
U3(m, L)>- U3(nl,L + 1),
(151
for each L. Let
Olm)=min(Ul(mL U2(ml}.
(16}
Let U{m) denote the exact optimal value of U Then
U(m)
(171
U{m)
I!81
U(m)<-U3(m,L) for all L
(191
However, !1 ~s clear that hm U3(m,L)=UIm).
(20)
L~t
The strategy of deciding what the practical value of L should be depends upon O(m} and U3(m,L). Due to (20) there exists Lo minimal such that
O(m)~_ Ua(m,Lo).
(21}
The value of L 0 may be found by equating U(ml to U3(m.L} for different values of L, starting from L = 3 and increasing ~t each time by unity For example ff
O(mj
{22)
O(m)>_U3(m,4).
123}
but
then Ln = 4
134
S Ever-Hadam, Cluster analvszs and geostatlst~cal ser,es
The mequahty (21) implies that neither method I nor method II can do better than method III with L o Thus by employing this method we increase substantmlly the probabd~ty of attaining the exact optimal solution To conclude three pomts should be noted" (a) In practice Lo is 3, 4 or at most 5 (b) Inequalmes such as (22) and (23) are functions of m. That is for some m (22) exists, for another m (23) may hold (c) In reahty methods I and I][ are first employed. Then method III until (21) is satisfied for each m. Although this procedure ensures closeness to the exact optimal solution, ~t ~s costly. An alternatwe ~s to e~ther employ raethods I and II separately, or use method III with a fixed L regardless of (21) 4.2. As was lnd~cated m subsecuon 2.6, an evaluation of a cluster analys~s algorithm can be done by comparmg ~ts performance to an exact algorithm on a small set of data Using th~s tactic 50 two-dimensional random sets uniformly d~strtbuted over the umt square were generated, Each such set, of size N = 16, was dw~ded into 2, 3 and 4 clusters, and the corresponding U functions U~(m), U2(m) and U3(m,L=3) were calculated In addition a dynam,c programming method was employed. The results showed that method I was correct m 53°,° of the cases, method II was correct in 68°°, whde method III with L = 3 was correct m 100°'o of the cases. With regard to computer time, method I was the most rapid. Method III was the slowest, while method 1I was mtermedmte, slower than 1, yet faster than III Here it should be stressed thai an exact optimal solution for partmoning a set of N elements into m = 2 subsets is possible, by choosing all the possible pairs of points ~P, Pj), whose numbers are (~1 for the mltal approximation for the function U. As a matter of fact the combmatorml approach, method III, is an approxlmauon to the latte~ notion. 4.3. As was menuoned m subsection 2.2, the tactics we use to solve the problem of the 'real' (optimal) numl:er of clusters (regions) presented m a geostat~st~cal series is based on comparmg our particular objective function t~ [eq 10)] to its d~strlbut~on calculated from a large number of geostatistlcal random sets To th~s end define the following notations' Let d~ denote the squared Euchdean d~stance from pomt P./to the center of the set Let
Uq! )= ~ ./=1
~,,dj2c
(24)
S Eter-Hadam, Cluster anal'~ts and geostatlsttcal series
135
Let U(m) denote the final value of U after employing the multiple algorithm wnh L defined by (21) Let tl2(m ) = 1 - U ( m ) / U
(1).
(25)
Clearly r/Z(m) is a dtmenslonless parameter which saUsfies 0 - - < r / 2 ( m ) -<- 1 ant. ~i2(m)
100 a theoretical model set is considered, in which the d~stance between each pa~r of pomts assigned to a g~ven subgroup is a constant value X, and the distance between each pmr of centers of two subgroups is a constant value Y, Y > X In such a set t/Zlm) = (1 + (X2/Y 2)[(m/(m- 1))(1 - m/N )] )- t
1261
Let r/~ 9s(m) denote the "95 percentde' of the empirical dtstrlbutlon of tl~ for random sets of N eqm-wetghted po,nts (N < 100), each divided into m subgroups (clusters). Let r/2(m) denote an actual value of tl z for a glx,en parucular set of N s elements dwided into m subsets. The stausttcally significant (optimal) number of subsets for such a set is determined by mo satisfying
max
2~m_s9
tl=(m)/tl~
gsim)=t/2(mo)/r/~, ~s(mo),
1,12(mo)/~l~ 95(mo)>1.
(27) 128)
if an opposite lnequahty to (28) exists for each m =2. ,9. that partIcular set is considered to be "statlsucally ind~x~s~ble' In such a s~tuat~on a weaker criterion may be mtroduced by using the means of the corresponding empirical dlstributmns mstead of the 0 95 pomts For sets of N > 100 points let
Q(m)=tl~oo 9~(m) "The restnctton on N _>_25 ts of no praettcal s*gmficance 8We assume for the moment that N < 100
~29)
136
S EL'er-Hadam Cluster analvszs and geostatlstwal serws
By equating (29) and (26) and msertmg N = 100 lr (26) we get an estimate of the ratio X2,'Y 2 Let /~ denote that estimate Then for a particular set of N > 100 the optimal number of clusters ts determmed by mo sausfymg max m=2.
{~/2(,n)[1
1))(l-m/N)}}
9
(30)
= r/2 (mo){ 1 + R (mo/(m o - 1 ))( 1- too~N)},
provided that the right side of (30) is greater than, or equal to, umty Here ~t should be stressed that theorettcaily a more precise way to determine the optimal number of clusters for a given two-dimensional set ts to combine the above procedure with another random-oriented scheme as follows: Let nj, j = l , . ,m, denote the number of pomts that fall in the jth cluster after applymg the proposed algorithm. The random techmque we propose is to spht the set Into m subsets randomly with no optimal considerations whatsoever, keeping the sizes of the subsets fixed such that nj points are asstgned to the jth subset. Each such division determines a value of ~l2 say r/'2(m). By repeating this process many times the random empirical distribution of tl '2 for the set under study can be estimated. Let D(m) denote the Euclidean dtstance calculated between the empirical dlstrlbuuon of r/'2 and the optmlal value r/21m). Then if m0 satisfies (27) and (28) [or (30)], and
ID(m)~,=D(mo),
max t¢l = 2 ,
131)
.9
we can conclude that mo is the 'real' optimal number of clusters typical to lhe set. To summarize the present discussion four pomts should be considered: (a) Since the above random procedure has to be repeated for each set under study the method is costly and can be applied only in studies in which the computer budget is generous. For this reason we refer to it only theoretically (b) In case the ratio defined by (27) Js greater than, or equal to, umty for almost each m = 2 ..... 9, we conclude that the set under study has a contmuous elongated sausage-hke shape and no apparent clustering ~s needed (c) The empirical dmtnbutton of t/2, as described calculated by generating 500 random sets which consist pomts dtstnbuted randomly over the umt square. In geostattsttcal set is not distributed over the umt square and
previously, was of eqm-weighted reahty a given its points are not
S Eter-Hadam. Cluster anal)sis and geostatistJcal ~er~e~
137
es~entmlly eqm-we~ghted. However smce the Euchdean dmtance is mvarmnt to any hnear transformation the first restriction causes no problem The second restriction casts doubt on the ~,ahdlty of (27) and (28) [or (30)] However m pracuce the varmbfl~ty of the weights has no apparent effect on r/a(m) as was tested by Monte Carlo methods. (d) Although theoretmcally we expect (27), (28) [or (30)] and (31) to y~eld the same 'real' mo for almost all practical cases, one has to consider the posslbdity of an amb~gmty which may arise when the set is e~ther too oblong or contains too many outliers.
4.4. Any geostat~stical set is g~ven m terms of both ~ts points and the temtory on which those pomts are spread In order to take into consideration the shape of the territory contammg the set we consider the following notations. Let a~,.p, a~ p, a~,,, tr~,t represent the standard devmtlons along the principal axes of the geostatmtlcai set and the terntor,,, respectively." Let 1321 Let tr~ po, tr~ ~, tr,, ,,,, tr~ ,a represent the above standard deviations for region (cluster, subgroup) a, a = 1..... m Let (33) Let l., be g~ven by (9). Let
S'(m)=~
loSp,a/ ~
a=l
(~,
134)
~ a=l
and R(m )= (Sm - S'(m))/Sr~.
135)
The higher the R(m) the more compact are the points of the set m the different regions, in comparison to the territorial shapes of the corresponding regions. By virtue of Rim) we may proceed to a more prec~sc definitIon of d partR1on of a geostatlst~cal series Let (36)
~,(m )= R(rn )d/(m ), " F o r a n o p e r a t v , e definltton of
o,,
o, ,. a~ e ,r, p see Bach1 1197"~ 19761
S Ever-Hadam, Clu~ter analysis and geostatistrcal ser,es
138
where 2,
,1
@(m)=v/ (m)/~¢ 95(m).
(37)
If there exists mo such that maXmd/(m)=d/(mo)>=l and max,,.R(m)-R(mo) >0, and therefore max,~(m)=~(mo), then m0 is the number of optimal subgroups (regions) for the set, 'too ~s opumal'. This state shall be denoted symbohcally by the 'partmon parameter' p, which is assigned its maximum value 4 (p = 4). If p ¢ 4 but there exists mo such that max,,@(m)=~,(mo)> 1 and R(mo)>0, then mo ~s optimal and p = 3. If p ~ 3 , 4 but there exists mo such that ~,(mo)> 1 and R(mo)>0, then mo is opumal and p = 2. If p¢:2, 3,4 but there exists mo such that max,.~,(m)=@(mo) and R(mo)>O, then mo ~s optimal and p = 1. If p ¢ 1, 2, 3, 4 then the set ~s 'stausUcally lndw~s~ble' and p = 0. In such a case a weaker criterion may be used, e.g. instead of using the 0.95 points, the means of the empmcal distributions of r/2 may be used. Here ~t should be noted that attaining h~gh values of p ~s not equivalent to the minimizing of U [eq. (6)1, because of the term R(m) [eq. (35)]. However m most practical cases we expect a positwe correlation between ~,(m) and R(m). In general one can overcome this d~fficulty, if one w~shes to ignore the lnflue~,ce of the territory, by reducing the territory on which the set points are spread to the Convex Hull of the set defined by the minimum convex area containing the set points. Such a procedure ensures the equivalence between maximizing p and maximizing ~,(m) [eq. (37)].
5. Applications Th~s section ~s devoted to apphcat~ons of the algorithm m geostatistlcal series and territories. 5.1. The first apphed question that arises ~s whether the proposed algorithm y~elds the correct partmon whenever such a partmon can be detected easdy To answer this question a 'rat~onahzation test" was performed m which m-clustered sets were purposely generated for each m - 2 . . . . 9. The outcome was exactly as expected In sets dw~ded a prior~ into m regions over the umt square m was optimal and slgmficant satisfying (25) and (28) [EverHadanl (1975)] 5.2. The tlext geostatlst~cal apphcat~on ~s devoted to three examples displaying data on
S Et er-Hadam. Cluster analysts and geostatzstwal serws
139
(1) U.SA. The &stnbutlon of cotton mills, 1935. [Source White, C L and Foscue, E.J (1943) Regional geography of Anglo-America, Prentice-Hall, p 169.] (2) Tasmania. The &stributlon of aboriginal man m Tasmama, archaeological remains (after Jones, 1968). [Source. Mulvaney, D.J. and Golson, J. (1971), Aboriginal man and enwronment m Austraha. Australian National Umversity Press, p 276.] (3) Brazil. The distribution of coffee harvest in Sa6 Paulo between 1934-35 to 1939-40. [Source: Monbeig, P. (1952). Pionmers et planteurs de Sao Paulo, Cahlers de la Fondation Nationale des Sciences Polmques Librarie Armand Colin, 28, p. 233.] In the left of fig. 2 the usual geostat~stlcal parameters are presented Those parameters are: (i) The center of gravity whtch indicates the general location (ii) The principal axes which mdtcate &rectlon and the principal elhpse which measures spread. (rio The ellipses have been enlarged so as to include at least 75°-,0 of the whole set. The right maps m fig. 2 represent the geostatlst~cal series after the rnult~ple algorithm has been apphed In each regton (cluster) a graphical rational pattern is given [see Bacbt (1968)-I which m&cates the percentage of points contained in the region. In all those examples the partmon parameter p defined in subsection 4 4 attained the high value of p = 3 However ~t should be noted that in the third example of Brazil the partmon into two regions was not 'statlsUcally significant' and the computer program has switched to the means of the empirical &stnbution of r/2. The latter result indicates that the partition was then imposed on the data as seems to be quite clear due to the fact that the description of the entire set by the usual parameters ~s good In all those examples we could not obtain the original hst of data which indicate the exact location of the set points. Therefore we haxe esumated their location by using an automatic dlgitmzer attzched to a punch card machine. In the first example of U.S.A. we estimated the number of points as N = 393. In the second example of Australia (Tasmama) the original map includes two d~stnbut[ons: (a) distnbutton of aboriginal man m Tasmama, archaeological remains after Jones, and (b) dtrect ethnographic observations after Hmtt
140
S Ever-Hadau,, Cluster anahs~s and geostattst~cal serws
,'
)
,
,,
,,
7Te:L
~ ..... - ....
,'
~,.. ,,
r---',,
'
-.
r
~\:'-,-d
""'N
.-"'.
-_.,~;__/..-'~
~
">
,'---:-. "
=~,-
!___"?'::~_""-~_..7~., W ~ .]..).". "~ j~l.~ "~'~'~
, . , q ~ ,•, ~ ~-.
I 5.
~
~,
"~'
• I
L:-
\,,
\.
~d
_1
. - ~ i -.,.¢j
, ~ . , :.T~.~,-~:~
k)\
.~ ~".'...v..~
w_~] i . . . . . --tL"
c F~g 2 Usual geostahst]cal parameters (left-hand szde) and results of the partition [right-hand side) for the U S A, Tasmania, and Brazd
In our example we refer only to (a) with N = 2 0 5 points, ignoring the points outside the ~sland boundaries for obwous reasons. In the last example of Brazil the N = 158 points, unhke the other examples, have different weights. 5.3. A further quesnon concerning the use of the algorithm on geostatisucal data is the measure to what extent the m 'centro~ds" (the m centers of the subgroups) represent statistically the N original points of the set.
S Et'er-Hadam, Cluster anal~ sts and geostattsttcal sertes
141
In znswer to th~s question two series of tests were performed In the first series 100 random sets d,stnbuted uniformly over the umt square were generated. Those sets of size N = 2 5 , 50, 75 and 100 were dw~ded mto m = 3 , 4 ..... 9 subsets. ~° Then the exact geostatlst~cal parameters based on the N original points were compared to the est,mated parameters which were based on the m weighted centrolds. in the second series of tests the same comparisons were run. usmg data on seven Jewish population distributions in Israel, 1961. The chosen geostatlstical parameters for those tests are (1) the standard dewation along the prmctpal axes - a~, a~. (2) the angle of rotation between the usual axes and the principal axes - ~. and (3) the oblongtty of the set [Bachl {1976)] defined by
B=½1o' ,/a> +or,/a ].
138 )
The center of grawty ~s equal ,n both calculations, therefore ~t ~s omitted Results showed that m the random sets, where there ~s naturally no defimte pattern (no natural clusters), the h~gher the proport,on re~N, the closer were the estimated parameters to the real ones However m the sexen actual sets, the results demonstrated that the estimated parameters. calculated accordmg to the m optamal weighted centro~ds, were very close to the exact ones, regardless of the sizes of the sets, as ~s shown m table 1 [Ever-Hadani (1975)] 5.4. Another geostattstical apphcatlon of the proposed method ~s ItS use m measurmg the mm~mum d~stance between two geostat~st,cal series P = [p,~_ J -I and Q={qj}~=t, distrtbuted over the same territory and having the same total weights. (The latter assumption ~s easdy attained by standard,zauon ) Let ~-~'.V,j,-_~Nt and ~"q,~J=~' ~N" ~ denote the weights of P and Q. respectwely We assume that N
N
139~ a:l
j=l
Let d, denote the Euchdean d~stance between po,nt p, e P to one and only one pomt q~eQ The avera,,.e minimum distance between P and Q is defined by mm(d)e.Q= mm
~v,t~ I=1
t"Clearl.~ the case of m = 2 ~ a s ormtted
wv, ~ 1=1
1401
142
S EL,er-Hadan:, Cluster analys,s and geostatzstlcal series
IIIII1~
m
"0
I l l l t l ~
~j
~J
=_=
2
q.
VII
==
oo CA
r" eo
r.4
v
~ ~ ~ ~.~ m
~,~ 0
VII
Z
Q
,
.
.
<
E
A m
A
VII u e,j
S Ecer-Hadam, Cluster analysis and geostatJstwal serzes
143
v,here the mJmmum ~s over all the N! possible connections [see Bach~ (19621]. Due to (39) either the weights wp, or the weights w~, may be used m (40). Although there are some Reratwe methods which enable one to achieve a very close solution to (40) [Yam (1969)], they cannot be considered to be practical for N large, say over 50, and a rational reduction m N is necessary using for instance clustering techmques. An example of the use of the proposed algorithm m analyzing empirical data using (40) is given by calculating the minimum distance between each pair of the seven populatmn distributmns described In table 1. The resulting matrix [minld)~j] which was calculated using the opumal ~' centro~ds for each di, tribution is given in fig. 3. This matrix was analyzed using the Smallest Space Analysis (S.S.A.) method described m subsection 2.7 [Guttm.,n (1968)]. The results are summarized m a map (fig 4), m which the points P~ represent the seven population distributions under study - P,, l = 1. ,7 The very low value (0.0076) of the coefficient of ahenatlon 6 [eq. (31] indicates a very good representation of the matrix. Actually there ~s only one
I 2 3 4 5 6
2
3
4
5
6
7
31 73 ---. . . .
3390 1369 --. .
2208 2379 2663 -. .
4635 3497 3694 38 20
3967 1972 1732 31 Ol 32 70
31 92 1500 1931 22 59 30 13 22 16
.
Fig 3 The matrix mm(d), j m kilometer,.
® ®
(D
®
Q
®
C F~g 4 S S A between seven populatmn groups in Israel. coeffictent of ahenatmn = 0 0 0 7 6
~tln all these d~strlbutmns the optimal nt attained tts h~ghesl pracllcal ~alue ol 9. ¢,tth the part~tmn parameter p = 3
144
S Ever-Hadant. Cluster analysts and geostattst;,al series
violation at the monotonlcity property (2~ as nun(d),,.7 d4,2
Altl-.ough the geographic and derlographlc interpretation of the map is typ;cal only to Israel and is beyond the scope of the present paper, It should bt: pointed out that those 'innocent' seven points actually represent, in a most compact way, the processm,2 of a fairly large amount of data involving more than 3,000 two-dimens|oaal points. Due to the official partlt,on of Israel into 36 'natural' regions it was made possible to calculate the same distance matrix usmg the 36 centers of gravity of each region Gomg through the same procedure we have arrived at the same structure [Ever-Had~m (1975)]. Although the reformation usea in the former procedure was based on m = 9 points for each set as compared to 36 points in the latter procedure, the general conclusion is the same indicating that the rationale ~s correct. 5.5. A further geostatlstlcai example of the use of the proposed algorithm is in solving the problem of locating m facdmes serwng a population distributed over a given territory. Let there be N points given in a two-dimensional space, each point Pj having positive weight (population) wj The basic assumption is that each pa~r of points P. and Pj is connected by a minimal route (road) %, which is not essentially straight. The problem is to locate m central faclhtles Q,,, a = 1,. ,m, such that each mdiwdual from any point Pj can obtain service from one and only one of the m facilities Q,,, such that the overall cost C ~s minimized. C is p r o p o m o n a i to m and to the total distances S where m
S= Z
Z wjSr,
(41)
a= I t~.Qa
S~a is the minimal route-distance from point P~ to facihty Qa. Let at be the cost of travel per unit distance, mde, kilometer, etc. Let fl be the overall cost of budding and maintaining one facility. Then m
C=~ Z Z wjSja+mfl
(42)
a= I J~Qa
For a gwen ~ and fl, the minimum of C as a funcuon of the m facdmes can easdy be found. In order to solve the problem we intend to use the S S A method so as to map the original pomts into a two-dimensional space Then the algorithm is applied resulting m several partitions which enable us to calculate the
S E,'er-Hadam, Clu~ter am~#~s~s and geostat~stwal ser~es
14'5
i 14~
m
e-
Fa
~_
~
~
I _
E eO
"O
t'It
"~ f "
,,~ I " - P ' i ~% '4", ~'~ ~'~ L
e'. w m
~
m
e-
-J
.o C
...
E _'t:
~ m
.,-j ¢'~1 o , I
,-,
¢~
r"
e"4 ¢ N
p,.,. ¢"1
¢',e I e¢~
¢.,% ¢.~
146
S Ever-a ¢ad~.,., Cluster analysis and geostatistlcai series
corresponding median centers ~z by means of an ~terative procedure, PracUcal results are achieved by going back to the original map and locating the m pomts on the corresponding routes. In the illustrauve example gwen m th~s paper the locations of m facihues m Israel are found. The N = 2 6 points Pj are 26 selected towns in Israel, 1961, and the weights % are the corresponding populations. We summarize the matrix [Co] - fig. 5, and the eight equations for the cost C as a function of 0t,//for m = 2,..., 9 - eq. (43). An illustration for the proposed locatmn is given in the case of m - 3 - fig. 6.
F~g 6 The proposed locations, marked by ® m the case of m=3.
C (m = 2 ) = ~t 27,250,773 + 2fl, C ( m = 3 ) =~t 17,232,792 + 3fl,
C(m = 4 ) =~x 11,717,543 +4fl, C(m=5) = •
9,621,230 + 5fl,
C(m = 6 ) = ~
7,504,232 + 6fl,
C(m=7)=ot
5,720,675 + 7fl,
(43)
C(m=8)=a 4,935,9 l0 = 8fl, C(m = 9 ) = a
4,049,515 + 9ft.
12The median center is the point (r,.. ~..) which minimizes the function G(x. vl=~.~'=t w,((x~ -xl2+(~.-y)zl t
S Ever-Hadam, Cluster anal)sts and geostat,stzcal sertes
147
It should be pointed out that the eight equations (43) are calculated after the locations are found. As an example let ~ =$0.3 and fl=$3,000,000 Then mm,,. = 2, .9C(m)=C(3). If ct = $0.5 and fl = $1,000,000 then nun.=z. .9C(m)=C(6). 5.6. The last applicable part of the present work deals with the geostatistical analysis of territories, with the purpose of representing them graphically by simply and compact shapes (models) [Bachl (1976)] The criterion by which this process is carried out is based on the geostat~st~cal parameters of the territory T such as: (1)
(2) (3) (4) (5) (6)
A - the area, (~, .#) - the center of gravity, ~', a~.z_ the variance of the coordinates along the usual axes, coy(x, y) - the covariance, d" = ~ +o'2 - the distance variance, 0t - the angle of rotation between the usual axes and the principal axes
Methods of qmck calculation of the above parameters are gwen m Bach~ (1973, 1975). The procedure starts with the 'measure of spread' S, defined by
S, = (Ox.Oy/A )4n,
(44)
where a~. and ay. are the standard devtaUon computed along the principal axes of T. If
1~ S,6 4x/(6~/3),
(45)
then there exists a simple geometric model M 0.e., circle, square, elhpse, rectangle, rhombus, trapezoid, etc.) which is closest to T m shape [Bach~ (1976)]. Such a model M can be stretched, squeezed, moved, etc., using other geostatistical parameters of 7", so as to represent accurately the original territory T. When this is done a measure of discrepancy Dn.r can be evaluated by putting M and T m the same posmon by making sure that they have the same (~,y), ct and other geostatistlcal parameters. Once attained DM.r ~s defined by Dn. T = (~$a~d× dy)/(A u + A r),
(46)
148
S Ever-Hadant. Cluster analys,s and geostatzstwal series
where AM and Ar denote the areas of M and T, respectively. A1 is defined by
(x,y)eAlc,,(x,y)eM, (x,y)eT,
(x,y)¢T, (x, y ) ¢ M .
or (47)
Clearly 0 < DM.r <-1. In case (45) ~s not satisfied the sphtting of T into m more compact subterrRones is ~ecessary, prowded that for each sub-terrttory T~ there exists
1 < S,<4n/(6x/3 ).
(48)
If this happens then the original territory T can be represented by the m corresponding models for which the overall &screpancy D is the lowest,
D---m=2.m'n,9 ~Dr%'r')/m'=,
(49)
The partmon of T Is possible by spreading a large number of eqmdistributed points over the entire area of T by using a hexagon grid. The proposed algorithm is then apphed [Ever-Hadam (1975)]. The rational behind such a partiuon should be based on ach~evmg lnequalmes (48) rather than on minim~zmg the functton U Ieq. (6)'1. However there IS a very strong correspondence between the two aims as the more convex and compact Is T the higher ms the probabd~ty of attaining (45) In figs. 7 and 8 several examples of the above process are given, dlustratmg the construcuon of: Australia, Israel, Italy, South America and the U.S.A., usmg the trapezoid model. Each one of the sub-territories T~, 3 - 1..... m, ~s &stmguished by a different number assigned to its corresponding inter-points. The original drawings have been done by the computer. The ones which are presented here are their hand-drawn duphcates. In the drawings the parameter D represents the goodness of fit of the procedure, i.e., 1 - D of eq. (49).
S EIt'r-Hudam. Cluster anul)sls ~md £t'ost~Hstlc.! ~erles
i I
ii
149
II Illij
~nto
Au~trdlva - Reconstruction m - 7 trapezold~, C 88
l~rael - Partltlon~nq m 5 ~ubterrltorles
Into
I s r a e l - R e c o n ~ t r u c t l o n by m 5 trapezoid5 O o'J
Italy - Partltlonlnq .i 7 $ubte'rltOr|es
into
Italy - Reconstrurtton m 7 tr3Dezoid~, D
Austral14 - Partltlonlnq m • 7 ~ubterrl~.ortes
L
F~g
7
by ~b
by
1 ~0
S Ever-Hadam, Cluster analysis and geostat,sttcal sertes
Q~ UO ~m~
"R
~ E
"I:':F •1,
9
V'
~
"
"
Q
~
O
Ila
qD
0
V,J
0
Q
.L.
,,0
~eD u
d
,,d "
~
i7
S Et er-Hadam, Cluster anal) sis and geostatlsttcal sertes
1.51
References Bachl, R, 1962, Standard dtstance measures and related methods for spaual analysts, Regmnal Science Assoclauon Paper, Xth Zurich Congress, 83-132 Bacht, R., 1968, Graplucal rauonal patterns. A new approach to graphical presentation of staUsUcs (Umverstty Press, Jerusalem) Bachl, R., 1973, Geostausucal analysis of territories, Proceedings of 39th Session, Vmnna, Bulletin of the International Stattsttcal lnsUtute 1,121-133 Bacht, R., 1975, Geostaustlcal analysis of territories and populattons, First draft of a book Bacht, R., 1976, Apphcattons of parameters of shape to statlsUcal d~strlbuttons. Regtonal Sctence and Urban Economics 6, 205-227. ikal, E M L., 1969, Cluster analysis (Sctentdic Control Systems, London) Calmskl, T. and J. Harabast, 1971, A dendrite method for cluster analysts. Unpubhshed manuscnpt. Duran, B.S. and P.L. Odell, 1974, Cluster analysts A survey, Lecture notes m economics and mathemattcal systems, Economemcs (Sprmger-Verlag, Berl,n/Tleldelberg;~New York~ Ever-Hadam, S, 1975, An optimal parutson of geostat~stlcal and multtvanate series Methodology and apphcauon, Thesis submRted for Ph D (Hebrew) (Hebre~ Um~erslty. Jerusalem) Everitt, B, 1974, Cluster analysts (Hememann EducaUonai Books, London) Foruer, JJ. and H. Solomon, 1966, Clustenng procedures, Proceedings of Symposmm on Multtvarmte Analysts, Dayton, OH (Academic Press, New York) 493-506 Gnanadeslkan, R. and M B Wdk, 1969, Data analyuc methods m multwanate stattst~cal analysts Muluvanate analysts II (Acadermc Press. New York) 593-638 Gower, J C, 1966, Some distance properUes of latent root and vector methods used m muluvanate analysts. Bmmemka 53, 325-338 Gower, J C, 1967, Muluvanate analysis and multldtmenslonal geometry, The Stausttcmn 17. 1325 Gower, J C, 1971, Statlsucal methods of comparing dtfferent multivariate analyses of the same data, Mathemattcs m the archaeolo~cal and htstoncal scmnce~ (Umverslty Press. Edinburgh) Guttman, L, 1968, A general nonmetnc techmque for finding the smallest coordinate space for a configurauon of points, Psychometnka 33, 469-506 Guttman, L., 1969, An algorithm for partmonmg N pomts into m regions of mm~mal standard dtstance, Unpubhshed manuscript Harttgan, J.A., 1967, Representation of stmdanty matrices by trees, Jomnal of the American Staust~cal Assocmtmn 62, 1140-1158 Jensen, R.E., 1969, A dynamic programming algorithm for cluster analysis, Operations Research 12, 1034--1057. Johnson, S C, 1967, Hlerarchtcal clustenng schemes, Psychometnka 32, no 3, 241-254 Kruskal, J B, 1964a` Multtdimensmnal scahng by opt~mlzmg goodness of fit to nonmetnc hypothests, Psychomemka 29, 1-27 Kruskal, J B, 1964b, Nonmetnc mulUdlmenslonal scahng A numerical method. Psychometrlka 29, 115-129. Rao, C.R., 1964, The use and mterpretatmn of pr,nc,pal components analysis m apphed research, Sankhy/i A26, 329-358. Rao, M R, 1971, Cluster analysts and mathemattcal programming, Journal of the American Statistical AssoctaUon 66, 622-626 Shepard, R N, 1962, The analys~s of proxtm,t,es Multld~mens,onai scahng w~tt. an unknown dtstance funcUon, I and II, Psychomemka 27. 125--139 and 219-246 Sokal, R R. and P H.A. Sneath, 1963, Prmctples of numerical taxonomy (Freeman. Lo, don) Yam, J, 1969, The analys,s of dtvergence m geostat~sttcal d~stnbut~ons Methodolog~ and apphcauons, The~ls subm,tted for Ph D {Hebrew) {Hebrew Umvcrslty. Jerusalem)