Pattern Reco(lmtum Vol. 19. N o I. pp. 49 53. 1986
(W)31 32U3 X(~ SxIWI- tltl Pc/gamon I'rc~ L td f 198b Pallern Rccognitlo0 '~ocicl%
Printed in Great Britain
REMARKS ON SOME STATISTICAL PROPERTIES OF THE MINIMUM SPANNING FOREST RICHARD C. DUBES and RICHARD L. HOFFMAN Computer Science Department, Michigan State University, East Lansing, MI 48824. U.S.A.
(Received 24 July 1984; in revised form 23 April 1985; received.for publication 30 July 1985) Abstraet--The paper by Di Ges6 and Sacco proposed a test for uniformity (UT) that compares the number of clusters observed in a set of points to the number expected under randomness as a function of distance. This interesting idea is marred by some misstatements about the distribution of the number of clusters which, we believe, invalidate their UT. Uniformity tests
Near neighbor distributions
Minimum spanning tree
The paper by Di Gesti and Sacco (t) suggests a uniformity test (UT) for deciding whether the points in a given set of points are uniformly, or randomly, positioned. The U T selects a regularly spaced sequence of distances, plots the number of clusters observed in the data at each distance and compares the resulting curve with a 'theoretical' curve, their equation (4). Clusters are obtained from the single link clustering method. (2~ The 'theoretical' curve is derived under a hypothesis of random positioning of points in D dimensions. Unfor-
tunately, the derivation of the theoretical curve contains errors which invalidate the test. The null hypothesis for the U T is that the observed points are uniformly distributed over some sampling window. This is equivalent to the hypothesis that the points are a realization of a Poisson process, conditioned on the number of points. Di Gesfi and Sacco (t) assume that the distributions of the lengths of all edges in an M S T formed on such a sample are independent and that all are equivalent to the near-neighbor
i00
8o
60
40
20 - -
Theoretical
results
....
Monte
results
Carlo
I .05
I
0.i
0.15
I 0.2
I 0.25
distance
Fig. 1. Simulated and theoretical results for 100 random points in the unit interval. Fr~ ~,P~ D
49
O.
50
RICHARD C. DUNES a n d RICHARD L. HOFFMAN
i00
80
60
40
20 -.....
I .05
W 0.i
i
Theoretical Monte Carlo
I
0.15
0.2
results results
1 0.25
O.
distance
I00
80
60
40
20 -....
i
i .05
0.i
0.15
Theoretical Monte Carlo
I 0',2
results results
I 0.25
0.3
distance
Fig. 2. Simulated and theoretical results for 100 random points in two dimensions. (a) Sampling window is the unit square. (b) Sampling window is the unit circle.
distribution for any two points in a Poisson process. This incorrect assumption leads to misstatements about the distribution of the number of clusters, which invalidates their UT. One of the key issues in the application of any test
for uniformity is the state of knowledge about the sampling window. Figure 2 in Di Gesfl and Sacco provides an illustration of this point. If the borders of the figure constitute the sampling window, it is clear that the points are not randomly distributed; they
Minimum spanning forest
51
i00
80
60
40
20 ..... Monte Carlo results
I
.05
~
I
0.i
'
,
I
,
,
,
I
0.15
I
0.2
0.25
0.3
distance
I00
80
60
40
20 .... Monte Carlo results
[
.05
I
0.I
~
1
0.15
~
",\, ,
[
0.2
,
I
0.25
0.3
distance
Fig. 3. Simulated and theoretical results for 100 random points in three dimensions.(a) Sampling window is the unit cube. (b) Sampling window is the unit sphere. cluster in the middle, with too much empty space around them to conclude that their positions are random. What is an appropriate sampling window? A convex hull might b¢ used TM 4~ or a square or circle might be assumed. (s~ Since the MST can be found
without reference to any sampling window, the UT ignores this question. To demonstrate the inaccuracy of their assumption, we reproduced their Fig. 1 by Monte Carlo means. Specifically, we generated 100 points in a sampling
RICHARD C. DUBESa n d RICHARDL. HOFFMAN
52
i00
80
\
60
\
\'\ \
40 j
I I
I
i
I
20 --
Theoretical Monte Carlo
.....
I .05
results results
[
I 0.I
0.15
I
0.2
0.25
! 0.3
distance
I00
\ "'',,,,
80
\ \
",
'\ \\ \
60
40
\
",
\
20 --
....
Theoretical Monte Carlo
results results
I .05
0.i
F 0.15
I 0.2
0.25
0.3
distance
Fig. 4. Simulated and theoretical results for 100 random points in four dimensions. (a) Sampling window is the unit hypercube. (b} Sampling window is the unit hypersphere.
window, computed an MST using Euclidean distance as proximity, and counted the number of clusters obtained under the single link paradigm as the distance threshold increased. This was repeated for dimensionality from I to 4 for two sampling windows,
a hypercube of side ! and a hypersphere of volume !. We ran 100 Monte Carlo trials for each situation. The results are plotted in Figs 1-4, along with plots of equation (4) in Di Ges6 and Sacco, for dimensionalities from 1 to 4. The (a) figure is for a unit volume
Minimum spanning forest hypercube and the (b) figure, for a unit volume hypersphere in each case. In all cases our simulations demonstrate a dramatic difference between the actual behavior of the statistic and that predicted by their equation (4) under randomness. Di Ges/l and Sacco neatly avoid the issue of estimating an explicit sampling window by using nearneighbor information. However, errors invalidate their approach. Experiences reported in the literature do not bode well for statistics based on near-neighbor information or for statistics based on the M S T in D dimensions. The ideal statistic for assessing randomness has yet to be discovered. SUMMARY Di Gesu and Sacco "~ propose a test for spatial uniformity of points in D dimensions based on the number of single link clusters as a function of distance between points. Improper assumptions lead to an incorrect null distribution for their test statistic and invalidate their test of hypothesis. Our M o n t e Carlo
53
simulations demonstrate dramatic differences between their distributions and the actual number of single link clusters when points are generated in D-dimensional hypercubes and hyperspheres for D from one to four. Acknowledgement--This work was supported by National Science Foundation Grant ECS-83002004.
REFERENCES
1. V. Di Gesil and B. Sacco, Some statistical properties of the minimum spanning forest, Pattern Recognition 16, 525 (1983). 2. J. C. Gower and G. J. S. Ross, Minimum spanning trees and single linkage cluster analysis, Appl. Statist. 18, 54 (1969). 3. R. Hoffman and A. K. Jain, A test of randomness based on the minimal spanning tree, Pattern Recognition Lett. 1, 175 (1983). 4. S. P. Smith and A. K. Jain, Testing for uniformity in multidimensional data, I EEE Trans. Pattern Anal. Much. lntell. PAMI-6, 73 (1984). 5. E. Panayirci and R. C. Dubes, A test for multidimensional clustering tendency, Pattern Recognition 16, 433 (1983).
About the Author--RICHARD C. DUBESwas born in Chicago, Illinois. He received a B.S. degree from the University of Illinois, Urbana, and M.S. and Ph.D. degrees from Michigan State University, East Lansing, all in Electrical Engineering. He has been a Professor in the Computer Science Department of Michigan State University since 1970, having served as Assistant and Associate Professor in the Electrical Engineering Department. His areas of technical interest include pattern recognition, exploratory data analysis, signal analysis and image processing. He is an Associate Editor of Pattern Recognition and a member of IEEE, Sigma Xi and the Classification Society. About the Author--RICHARD L. HOFFMANwas born in Mt. Pleasant, Michigan, in 1956. He received a B.S.
degree in Mathematics/Physics from Central Michigan University and M.S. degrees in Mathematics and Computer Science from Michigan State University, where he is currently pursuing a Ph.D. in Computer Science. Since 1982 he has been a Graduate Research Assistant in the Pattern Recognition and Image Processing Laboratory in the Department of Computer Science at M.S.U. During the summer of 1983 he was a researcher at Northrop Research and Technology Center, Palos Verdes, CA, working on problems of image processing. His areas of research interest include pattern recognition, exploratory data analysis and graph theory. He is a member of IEEE and Sigma Xi.