Pattern Recognition Letters 12 (1991) 199-202 North-Holland
April 1991
A cluster detection algorithm based on percolation theory Craig Gotsman* Dept. of Computer Science, Tile Hebrew University, Jerusalem 91904, Israel
Received 28 August 1990 Abstract Gotsman, C., A cluster detection algorithm based on percolation theory, Pattern Recognition Letters 12 (1991) 199-202. We describe a novel algorithm for the detection of clusters of points embedded in background noise in the plane. The algorithm is based on the percolation phenomena found in random graphs obtained from a planar Poisson process. We estimate the time complexity of the algorithm and its expected performance. Keywords. Cluster, percolation, random graph, Poisson process.
I. I n t r o d u c t i o n The problem of detecting a two-dimensional diffuse cluster of points embedded in a noisy background has important applications in m a n y fields of science, in particular X-ray astronomy and SAR (synthetic-aperture radar) imaging. A more precise definition of the problem is: given a Poisson process in a finite region of the plane with density (rate) A, detect whether there are clusters of points of rate >I A' embedded in it, where A'~>c>0 (valid clusters). S o m e of the existing approaches to solving this problem are based on constructing the nearest-neighbor graph (NNG), the random graph whose vertices are the given points, an edge connecting each point to its nearest neighbor, or the minimal spanning tree (MST) (e.g. Zahn (1971), De Biase et al. (1986)). * Partially supported by an Eshkol fellowship, administered by the National Council for Research and Development - Israel Ministry of Science and Development.
We present a similar, yet novel, algorithm for solving this problem. The algorithm has two parameters: r and t. A graph is constructed by connecting each point to all other points within radius r. All connected-components of the resulting graph of size ~< t are discarded. The remaining components are valid clusters. The correctness of the algorithm is based on a phase-transition phenomenon in r a n d o m graphs constructed in this fashion from planar Poisson processes, known as percolation (Grimmett (1989, p. 10.20)): There exists a critical radius r e (depending on 2 alone) such that for r < r e, there are many small connected-components. For r > r c, there almost surely exists a unique large connected-component containing a large fraction o f the points. Hartigan (1981) has observed that this phenomenon is relevant to cluster analysis, but analyzed it from a statistical viewpoint. The idea behind our algorithm is to carefully choose r just below re, so that the background noise will create very small connected-components. The valid clusters, whose local density (together with the noise)
0167-8655/91/$03.50 © 1991 -- Elsevier Science Publishers B.V. (North-Holland)
199
Volume 12, Number 4
PATTERN RECOGNITION LETTERS
exceeds r c, will create very large connectedcomponents. An appropriate choice of t will then eliminate the background noise, leaving only the valid clusters.
3. The algorithm and its complexity
2. Some continuum percolation theory
Input.
April 1991
The following algorithm is proposed for cluster detection:
Algorithm CD
Let P(2) be a Poisson process of density (rate) 2 on the infinite plane, d(x,y) the Euclidean 12 metric and r a real number. Construct an (infinite) random graph G from the points of P by connecting all pairs x,y for which d(x,y)<~r. For a fixed x, define the random variable S(x) as the size of the connected-component to which x belongs. Obviously, because of scaling effects, the quantity which uniquely determines the distribution of S is Ar 2, so, without loss of generality, we may assume that r = l.l Gilbert (1961) first observed empirically the existence of an absolute critical density Ac, separating two fundamentally different forms of the distribution of S: For 2
Ac, E(S) is infinite and there exists a unique infinite connectedcomponent. Much effort, both analytic (exact and approximate) and empirical (computer simulation), has been expended in determining the exact value of 2 c. The best analytic results to date are those of Hall (1985), who proved that 0.696 ~< A¢ ~< 3.372. Simulation and approximation methods (see summary in Gawlinski and Stanley (1981)) yield the estimate )-c = 2.872+0.012.
(2)
I| -* ~
exists and is strictly positive, implying that the distribution is asymptotically exponentially decreasing. i Alternatively, 2 may be held fixed and r varied. 200
Method Step 1.
Count N = the number of input points (including the clusters). Estimate the density of the background noise: 2 = N/Area(A); Step 2. Denote a = 0.95, r = a(~-~c/A) (Ac as in (1)). Construct the graph G on the input points by connecting any pair (x,y) such that d(x,y)~r. Step 3. Denote t = l o g n . Calculate the sizes of the connected-components of G. Output all connected-components of size > t. Step 2 may be implemented using a recent algorithm of Dickerson and Drysdale (1990). The algorithm runs in O(N+I) time with an O ( N l o g N ) preprocessing phase, where I is the number of edges in G. The preprocessing phase basically constructs the Voronoi diagram (see Edelsbrunner (1987, Ch. 13)) of the N points, and the algorithm itself processes this. Note also that G may be constructed for any other r, using the same Voronoi diagram. Step 3 may be done in O(N) time by depth-first search.
(1)
As for the exact form of the distribution of S in the range 0
limI-11ogProb[S=n] 1
Planar coordinates of n points distributed randomly in a rectangle A and coordinates of clusters of O(logn) points distributed randomly in small subregions of A. Output. A collection of subsets of the input points--the valid clusters.
4. Probabilistie analysis of algorithm performance The performance of the algorithm is measured by two quantities: Pd = the probability of detecting a valid input cluster. FAR (false alarm rate) = the density of invalid clusters (the average number of invalid clusters produced by the algorithm per unit area). We now estimate these quantities for Algorithm
Volume
12,
4
Number
PATTERN
;- •
..-;
,
,
' .'
...
, '*
,
[ .... ..:...
..,
L
'l
i'/:
":..'
'.
:..
•
,
° %,
"'.. ,
, " ,, I. " ~ . . .
..,.
•
.
""
°
"'::.
•
O~
.:. .
,,.
eg O
°.
:.. :'. ,
.
. .:
i"
':•
; ,o
""
,
". : ; °
°'.
•. . . .
;'
'.
.,.,..,,.
'
I
•
I
i..
.,'1
,:.
,.,,'..,
,......
(b) ,
• • I ,
•
.
,,.
~,'
"
.
~
" .
•
.', .*
.
.
'"
;,
o° °
~
..-..
"
,
,,i
,',°,'•~.
,
,,,,
,
°
J
, ° °
I •
. ...':
.. . l"
'"
• •
.
. .
..
.
-
,..
.
,
• .
,,
•.
L
.
i•
,.,
..~,
,.':.
' ,..
•.
~: ~ : ' . . ' J
.
, o
...':':.
.~
. .- ."--'...-!
!%.:
,]"',,
" ' ~ " ' ~ ' , . ,',o
~' • ."
'
a ° ° o•
.
.,
•
1991
,°',',~'-
I
~.
April
* ,,,
.
•
I .. , ' " "'", , I• '
,"
.....:...
,
LETTERS
• ...
.
° I
I
RECOGNITION
.
.
.
.
;;;;!:i!;' • :
.
, .;
',,
•
,°
,
•
•,
(d)
Figure 1. Results of the algorithm on simulated data: (a) simulation of background; (b) simulation of clusters; (c) sum of (a) and (b) - input to algorithm CD; (d) clusters detected by algorithm CD.
C D . In the v i c i n i t y o f each o f the v a l i d clusters, the
local density ). + ~' satisfies (2 + 2') r 2 > ).c, therefore percolation occurs. This guarantees the existence of a connected-component of size O(log n), which will be detected with probability Pd = 1--O(1). In an area containing only background noise, the local density is 2, satisfying 2r2<2c. By (2), the expected number of invalid clusters of size >I t per unit area is
tative simulations• In general, the algorithm detected all valid clusters with few false alarms. Inevitably, the detected clusters were larger than the originals•
6. C o n c l u s i o n W e h a v e p r e s e n t e d an a l g o r i t h m for the detec-
FAR = J. ~ ae-bi/i
t i o n of planar diffuse clusters in the presence of
i=t
<~
ae-bi = O(e-Ot/t).
(3)
t i=t
For t = O(log n), we obtain FAR = o(n-t').
5. E x p e r i m e n t a l results
We implemented algorithm CD and tested it on a variety of simulated point maps. The background points were generated as Poisson processes, on which a small number of clusters were superimposed• Figures 1-2 show the results of represen-
random background noise• The percolation property on which the algorithm is based exists also in the analogous situation in Euclidean space of dimension > 2 (but not for d = 1), albeit with different critical radii (see Pike and Seager (1974) for empirical results in higher dimensions), implying that the same algorithm may be used in that case too.
The algorithm is mainly effective in detecting diffuse clusters, without any pronounced geometric shape (such as elongated clusters)• To detect these cases, other properties of the connectedcomponents besides their size (diameter etc.) should be u s e d . A l t e r n a t i v e l y , this a l g o r i t h m m a y be u s e d 201
Volume 12, Number 4
PATTERN RECOGNITION LETTERS
'. ' .
....
.
, ;';
:o
I "
.
'
. . . . ..::.
.....
. ' "
.
:'.
'
•
•
,
•
..
';'
: •
~ . •
...-,..,
o.~
I
• . %•
, ; ""
•
i
.
•
.%
'.
. .%
.°.,
• °
,
:(.).
.', :...).:
• .:
• •~.
°
.:.'
..~ ,,:: . ."
. .•
., o
•
.
.°
".
.i:.
(b)
. •.,.. ,.,.•
...• . ;':
'1
• 4
.
t
'
April 1991
.
'
•
'.
I •
l"
'~
. •
' ;.
.
" :."
.
'" • : . ~.
::
..
.
"......, ..,,...'.".:..:.~.: ;'.
i(o!..'
• .
.~
.:
•:
.'.
•
i
o
..
..''
'..."-',-
•
t
...•
..'
:
.
.
.
•
. •
.
...::
.
: ":'. • :.•
•.•~,
.. '....~,.: " :':i~ ':'
..
:..-':........'£;?,:
"i(d)' 'i; ::::' '"
Figure 2. As Figure 1. Note the i,walid cluster (marked) detected by the algorithm in (d).
in a transform domain to detect clusters indicating the presence of specific geometric structures, such as the Hough transform (see Illingworth and Kittler (1987)). This enables detection of (nearly) straight lines among the input points. It is possible to speed up the run-time of the algorithm by using probabilistic techniques, such as those reported in Berger and Shvaytser 0990), which effectively dilute the points, taking care not to lose important information. These techniques deserve separate investigation.
Acknowledgement I would like to thank Eli Shamir for introducing me to percolation theory.
References Berger, J.R. and H. Shvaytser (1990). A probabilistic algorithm for computing Hough transforms. To appear in J. A Igorithnts.
202
De Biase, G.A., V. Di Gesu and B. Sacco (1986). Detection of diffuse clusters in noise background. Pattern Recognition Letters 4, 39-44. Dickerson, M. and R. Drysdale (1990). Fixed-radius near neighbors search algorithms for points and segments. Inform. Process. Lett. 35, 269-273. Edelsbrunner, G. (1987). Algorithms in Computational Geometry. Springer, Berlin. Gawlinski, E.T. and H.E. Stanley 0981). Continuum percolation in two dimensions: Monte Carlo tests of scaling and universality for non-interacting discs. J. Physics A: Math. and Gen. 14, L291-L299. Gilbert, E.N. (1961). Random plane networks• J. Soc. lndustr. Appl. Math. 9, 533-543. Grimmett, G. (1989). Percolation. Springer, Berlin. Hall, P. 0985). On continuum percolation. Ann. Probab. 13 (4), 1250-1266. Hartigan, .I.A. (1981). Consistency of single linkage for highdensity clusters. J. Amer. Statist. Assoc. 76 (374), 388-394. illingworth, J. and J. Kittler (1987). The adaptive Hough transform. IEEE Trans. Pattern Anal. Machine lntell. 9 (5), 690-698. Pike, G.E. and C.H. Seager 0974). Percolation and conductivity: a computer study• Physics Review B 10, 1421-1434. Zahn, C.T. (1971). Graph-theoretic method for detecting and describing gestalt clusters. IEEE Trans. Computers 20, 68-86.