Remarks on some statistical properties of the minimum spanning forest

Remarks on some statistical properties of the minimum spanning forest

Pattern Reco(lmtum Vol. 19. N o I. pp. 49 53. 1986 (W)31 32U3 X(~ SxIWI- tltl Pc/gamon I'rc~ L td f 198b Pallern Rccognitlo0 '~ocicl% Printed in Gre...

192KB Sizes 1 Downloads 40 Views

Pattern Reco(lmtum Vol. 19. N o I. pp. 49 53. 1986

(W)31 32U3 X(~ SxIWI- tltl Pc/gamon I'rc~ L td f 198b Pallern Rccognitlo0 '~ocicl%

Printed in Great Britain

REMARKS ON SOME STATISTICAL PROPERTIES OF THE MINIMUM SPANNING FOREST RICHARD C. DUBES and RICHARD L. HOFFMAN Computer Science Department, Michigan State University, East Lansing, MI 48824. U.S.A.

(Received 24 July 1984; in revised form 23 April 1985; received.for publication 30 July 1985) Abstraet--The paper by Di Ges6 and Sacco proposed a test for uniformity (UT) that compares the number of clusters observed in a set of points to the number expected under randomness as a function of distance. This interesting idea is marred by some misstatements about the distribution of the number of clusters which, we believe, invalidate their UT. Uniformity tests

Near neighbor distributions

Minimum spanning tree

The paper by Di Gesti and Sacco (t) suggests a uniformity test (UT) for deciding whether the points in a given set of points are uniformly, or randomly, positioned. The U T selects a regularly spaced sequence of distances, plots the number of clusters observed in the data at each distance and compares the resulting curve with a 'theoretical' curve, their equation (4). Clusters are obtained from the single link clustering method. (2~ The 'theoretical' curve is derived under a hypothesis of random positioning of points in D dimensions. Unfor-

tunately, the derivation of the theoretical curve contains errors which invalidate the test. The null hypothesis for the U T is that the observed points are uniformly distributed over some sampling window. This is equivalent to the hypothesis that the points are a realization of a Poisson process, conditioned on the number of points. Di Gesfi and Sacco (t) assume that the distributions of the lengths of all edges in an M S T formed on such a sample are independent and that all are equivalent to the near-neighbor

i00

8o

60

40

20 - -

Theoretical

results

....

Monte

results

Carlo

I .05

I

0.i

0.15

I 0.2

I 0.25

distance

Fig. 1. Simulated and theoretical results for 100 random points in the unit interval. Fr~ ~,P~ D

49

O.

50

RICHARD C. DUNES a n d RICHARD L. HOFFMAN

i00

80

60

40

20 -.....

I .05

W 0.i

i

Theoretical Monte Carlo

I

0.15

0.2

results results

1 0.25

O.

distance

I00

80

60

40

20 -....

i

i .05

0.i

0.15

Theoretical Monte Carlo

I 0',2

results results

I 0.25

0.3

distance

Fig. 2. Simulated and theoretical results for 100 random points in two dimensions. (a) Sampling window is the unit square. (b) Sampling window is the unit circle.

distribution for any two points in a Poisson process. This incorrect assumption leads to misstatements about the distribution of the number of clusters, which invalidates their UT. One of the key issues in the application of any test

for uniformity is the state of knowledge about the sampling window. Figure 2 in Di Gesfl and Sacco provides an illustration of this point. If the borders of the figure constitute the sampling window, it is clear that the points are not randomly distributed; they

Minimum spanning forest

51

i00

80

60

40

20 ..... Monte Carlo results

I

.05

~

I

0.i

'

,

I

,

,

,

I

0.15

I

0.2

0.25

0.3

distance

I00

80

60

40

20 .... Monte Carlo results

[

.05

I

0.I

~

1

0.15

~

",\, ,

[

0.2

,

I

0.25

0.3

distance

Fig. 3. Simulated and theoretical results for 100 random points in three dimensions.(a) Sampling window is the unit cube. (b) Sampling window is the unit sphere. cluster in the middle, with too much empty space around them to conclude that their positions are random. What is an appropriate sampling window? A convex hull might b¢ used TM 4~ or a square or circle might be assumed. (s~ Since the MST can be found

without reference to any sampling window, the UT ignores this question. To demonstrate the inaccuracy of their assumption, we reproduced their Fig. 1 by Monte Carlo means. Specifically, we generated 100 points in a sampling

RICHARD C. DUBESa n d RICHARDL. HOFFMAN

52

i00

80

\

60

\

\'\ \

40 j

I I

I

i

I

20 --

Theoretical Monte Carlo

.....

I .05

results results

[

I 0.I

0.15

I

0.2

0.25

! 0.3

distance

I00

\ "'',,,,

80

\ \

",

'\ \\ \

60

40

\

",

\

20 --

....

Theoretical Monte Carlo

results results

I .05

0.i

F 0.15

I 0.2

0.25

0.3

distance

Fig. 4. Simulated and theoretical results for 100 random points in four dimensions. (a) Sampling window is the unit hypercube. (b} Sampling window is the unit hypersphere.

window, computed an MST using Euclidean distance as proximity, and counted the number of clusters obtained under the single link paradigm as the distance threshold increased. This was repeated for dimensionality from I to 4 for two sampling windows,

a hypercube of side ! and a hypersphere of volume !. We ran 100 Monte Carlo trials for each situation. The results are plotted in Figs 1-4, along with plots of equation (4) in Di Ges6 and Sacco, for dimensionalities from 1 to 4. The (a) figure is for a unit volume

Minimum spanning forest hypercube and the (b) figure, for a unit volume hypersphere in each case. In all cases our simulations demonstrate a dramatic difference between the actual behavior of the statistic and that predicted by their equation (4) under randomness. Di Ges/l and Sacco neatly avoid the issue of estimating an explicit sampling window by using nearneighbor information. However, errors invalidate their approach. Experiences reported in the literature do not bode well for statistics based on near-neighbor information or for statistics based on the M S T in D dimensions. The ideal statistic for assessing randomness has yet to be discovered. SUMMARY Di Gesu and Sacco "~ propose a test for spatial uniformity of points in D dimensions based on the number of single link clusters as a function of distance between points. Improper assumptions lead to an incorrect null distribution for their test statistic and invalidate their test of hypothesis. Our M o n t e Carlo

53

simulations demonstrate dramatic differences between their distributions and the actual number of single link clusters when points are generated in D-dimensional hypercubes and hyperspheres for D from one to four. Acknowledgement--This work was supported by National Science Foundation Grant ECS-83002004.

REFERENCES

1. V. Di Gesil and B. Sacco, Some statistical properties of the minimum spanning forest, Pattern Recognition 16, 525 (1983). 2. J. C. Gower and G. J. S. Ross, Minimum spanning trees and single linkage cluster analysis, Appl. Statist. 18, 54 (1969). 3. R. Hoffman and A. K. Jain, A test of randomness based on the minimal spanning tree, Pattern Recognition Lett. 1, 175 (1983). 4. S. P. Smith and A. K. Jain, Testing for uniformity in multidimensional data, I EEE Trans. Pattern Anal. Much. lntell. PAMI-6, 73 (1984). 5. E. Panayirci and R. C. Dubes, A test for multidimensional clustering tendency, Pattern Recognition 16, 433 (1983).

About the Author--RICHARD C. DUBESwas born in Chicago, Illinois. He received a B.S. degree from the University of Illinois, Urbana, and M.S. and Ph.D. degrees from Michigan State University, East Lansing, all in Electrical Engineering. He has been a Professor in the Computer Science Department of Michigan State University since 1970, having served as Assistant and Associate Professor in the Electrical Engineering Department. His areas of technical interest include pattern recognition, exploratory data analysis, signal analysis and image processing. He is an Associate Editor of Pattern Recognition and a member of IEEE, Sigma Xi and the Classification Society. About the Author--RICHARD L. HOFFMANwas born in Mt. Pleasant, Michigan, in 1956. He received a B.S.

degree in Mathematics/Physics from Central Michigan University and M.S. degrees in Mathematics and Computer Science from Michigan State University, where he is currently pursuing a Ph.D. in Computer Science. Since 1982 he has been a Graduate Research Assistant in the Pattern Recognition and Image Processing Laboratory in the Department of Computer Science at M.S.U. During the summer of 1983 he was a researcher at Northrop Research and Technology Center, Palos Verdes, CA, working on problems of image processing. His areas of research interest include pattern recognition, exploratory data analysis and graph theory. He is a member of IEEE and Sigma Xi.