Pattern Recognition Letters 11 (1990)589-594 North-Holland
September 1990
A deterministic annealing approach to clustering Kenneth ROSE, Eitan GUREWITZ*
and Geoffrey FOX
Caltech Concurrent Computation Program, California Institute of Technology, Mail Stop 206-49, Pasadena, CA 91125, USA Received 29 January 1990 Revised 19 April 1990
Abstract: A deterministic annealing technique is proposed for the nonconvex optimization problem of clustering. Deterministic annealing is used in order to avoid local minima of the given cost function which trap traditional techniques. A set of temperature parametrized Gibbs probability density functions relate each data point to each cluster. An effective cost function is defined and minimized at each temperature. It is shown that as the temperature approaches zero, the algorithm becomes the basic I S O D A T A algorithm. The method is independent of the initial choice of cluster means. Simulation results are given and show how the method succeeds in avoiding local minima.
Key words: Pattern classification, clustering, fuzzy clustering, statistical mechanics.
Introduction Clustering methods are m a j o r tools for the analysis of data without knowledge of apriori distributions. The problem of clustering is that of partitioning the data into subgroups each of which should be as homogeneous as possible. This is usually made mathematically precise by defining a cost function or criterion to be minimized (or maximized). All useful cost functions to our knowledge are not convex and have several local minima. Clustering techniques such as the K-means algorithm (see D u d a and Hart, 1974), basic I S O D A T A (simplified version from Ball and Hall, 1967), and fuzzy I S O D A T A (Dunn, 1974, and Bezdek, 1980), are 'descent' algorithms in the sense that at each iteration the cost is reduced. For this reason they tend to get trapped in a local minimum. Thus, their performance is dependent on the number of local minima, on how cleverly one can choose the initial configuration or on how much one m a y assume on the apriori probability distribution. Our main concern here will be to avoid nonglobal minima. We * On a leave of absence from the Department of Physics, NRCN, P.O. Box 9001, Beer-Sheva, Israel.
shall restrict our discussion to the case where the number of clusters is known in advance. There exist heuristic methods and measures to 'optimize' the n u m b e r of clusters by defining cluster validity (Duda and Hart, 1974), (Gath and Geva, 1989), and in the discussion we shall mention some possible advantages of our method when generalized. For attempting global minimization we apply an a p p r o a c h based on principles of statistical mechanics. The interest in the application of statistical mechanics to nonconvex optimization problems has been growing recently, for example Simic (1990) on the travelling salesman problem. A known technique for nonconvex optimization is stochastic relaxation or simulated annealing (Kirkpatric et al., 1983). This stochastic iterative method is such that with some small probability a configuration may be of higher cost than the previous one. The 'temperature' parameter makes this probability go to zero as it is lowered. As the cost is not m o n o t o n e decreasing, local minima can be avoided. However, one must be very careful with the annealing schedule, the rate at which the temperature is lowered. In their work on image restoration, G e m a n and G e m a n (1984) have shown that in theory, the global minimum can be achiev-
0167-8655/90/$03.50 © 1990 - - Elsevier Science Publishers B.V. (North-Holland)
589
Volume 11, Number 9
PATTERN RECOGNITION LETTERS
ed if the schedule obeys Toc l / l o g n, where n is the number of the current iteration. Such schedules are not realistic in many applications since in order to reduce T by two one has to square the number of iterations. Unlike stochastic relaxation where random moves are made on the given cost surface, deterministic annealing can be viewed as incorporating the 'randomness' into the cost function. We obtain an effective cost function depending on the temperature. This cost function is deterministically optimized at each temperature sequentially, starting at high temperature and going down. We shall show how this method is applied to the clustering problem. The resulting algorithm is intuitively understood as a sort of a fuzzy ISODATA algorithm where the membership probability is modelled by a parametrized Gibbs distribution. The procedure is started with 'high fuzziness' (i.e., equal membership in all clusters) at high temperatures and converges to the nonfuzzy basic I S O D A T A algorithm at low temperatures. At this point we cannot prove that the method will always converge to the global minimum and it is not unlikely that it will fail in some circumstances. However, in series of simulations we performed (some of which are presented in this paper), where the hard K-means algorithm failed for reasonable imJal configurations, our method did find the global minimum independently from the initial configuration. We therefore conjecture that the conditions for obtaining the global minimum should not be very restrictive. We are currently working on the determination of sufficient conditions for global minimization.
September 1990
function to normalize the expression and is defined by f~
Zx = ~ exp(-flEx(k))
(2)
k=l
where c is the number of clusters. As fl gets larger, the associations become less fuzzy. In fact for fl = 0 each point is equally associated with all clusters, while for fl ~ oo each point belongs to exactly one cluster with probability one. Now we may define the expected energy of x as
Ex= ~ Pr(xeCk)Ex(k).
(3)
k=l
Our goal is therefore to minimize the expected total energy given by
E = E E'x-
(4)
X
For a given set of {yj}, it is assumed that the probabilities relating different x's to their clusters are independent. Hence the total partition function is
Z = I-[ Zx.
(5)
X
Now let us define the effective cost function F (free energy in the statistical mechanics analogy), in terms of the partition function of (5), Z = exp(-fl F )
(6)
or, F:
1 1 - ~ l o g Z : - ~ x~ log Z x.
(7)
It is not difficult to show that for fl--* 0% the effective cost equals the expected energy ( F = E ) . In order to proceed and minimize the effective cost, we have to specify Ex(j), the cost for associating a point with a given cluster.
Statistical mechanics formulation of clustering The squared distance cost For each point in the data there exists a cost (or energy) for associating it with a given cluster. We therefore define a Gibbs distribution for all possible associations as follows
exp(-.fl Ex(j)) (1) zx where x is a data point, Cj is the j t h cluster, Ex(j) is the energy when associating x with Cj and fl is a parameter (1/temperature). Z x is the partition P r ( x e Cj) =
590
In this work we choose the cost for associating x with Cj as
Ex(j) = [x- yjl 2
(8)
where yj is the representative of Cj. The probability that x belongs to Cj is thus P r ( x e Cj) =
e x p ( - f l ]x-yj] 2) ~ = 1 e x p ( - f l [x-Ykl2) "
(9)
Volume 11, Number 9
P A T T E R N R E C O G N I T I O N LETTERS
In the limit (/3--, oo) minimizing /~ of (4) is the minimization of the standard sum of squared distances criterion. F r o m (2), (7) and (8) the effective cost is
1 The o p t i m u m of the effective cost is for 0 ---F=
0)4
0,
gj'.
(11)
It should be noted that yj are vectors and this is shorthand notation implying differentiation with respect to each component to yield the 0 vector. Substituting (10) into (11) we get W.
(12)
The optimal set of cluster means obeys then x e x p ( - f l ]x _ y j [2) ~k exp(-/3 I x - y k ] 2) e x p ( - f l Ix-yjl 2) ~;k exp(-/3
'
vj.
(13)
]x-yk] 2)
This means that yj is indeed the center of mass oi' the average of the samples in Cj, where the expectation is taken by assigning to each point its relative weight in the cluster, and in fact
~xxPr(xeCj) yj= ExPr(xeCj) ,
Vj.
(14)
All this naturally leads us to propose a fixed point iterations algorithm, i.e., y(" + 1) = f(y(~))
(15)
where f is based on V' .¢,,+1)
x exp(-fl
trized probability distribution involved is derived directly from the cost function to be minimized and is indeed the corresponding Gibbs distribution. Amongs all probability distributions which yield a given average energy (cost), 1his is the one of m a x i m u m entropy. Since we choose the squared distance cost, our association probability distribution becomes Gaussian. If the data happens to be a normal mixture, then at each temperature we will be performing the m a x i m u m likelihood estimation. Still, our annealing technique will enable avoiding local minima which may trap the traditional method in this case as well.
The deterministic annealing algorithm
e x p ( - f l ]x- yj 12) Y'~=I e x p ( - / 3 1 x - y k ] 2) = 0,
( x - yj)
.vj = v
September 1990
Ix-y) Iz)
~x Ek exp(-/3lx-Y(k~)l 2) exp(-/3 Ix-y) n) I2) i~x Y~a"exp(-/3lx-Y~ n, 2)
Vj.
(16)
One may notice the similarity between the above results and the m a x i m u m likelihood estimate of means in normal mixtures (Duda and Hart, 1974). An important distinction to keep in mind is that here there is no prior knowledge or assumption on the probability densities of the data. The parame-
In the previous section we derived the algorithm for obtaining a local minimum of the free energy at a given/3. Annealing will be introduced by varying/3 starting at fl = 0 and ending at the appropriate value. In order to gain intuition as to how annealing helps avoiding local minima, let us consider the range of influence of each data point. In hard K-means or basic I S O D A T A clustering, each data point only effects the mean nearest to it, and this results in multiple stable minima of the energy. This exists but is less pronounced in fuzzy clustering as distant means are weakly effected by a data point. The annealing process starts at/3 - 0 where each data point influences all the means equally (our Gibbs distribution is uniform), and gradually increases the bias in favor of near means. The influence of each data point is gradually localized. At low /3, the function clearly has only one (i.e. global) minimum, which is the center of mass of the distribution, and all means are !ocated at this point. Excluding imaginable pathologies, at each step, the new global minimum will be close to the previous one, and the annealing process is in fact tracking of the global minimum while increasing ft. The graduated localization of influence can be further viewed as a multiscale process. It is not difficult to realise that in the annealing process, first the coarse aspect of the structure will determine the location of means. Only when the data influence becomes localized enough then fine aspects of the structure emerge. This leads to the natural applica591
Volume 11, Number 9
PATTERN RECOGNITION LETTERS
tion o f our method to hierarchical clustering which will be derived in a coming paper. To recapitulate, the following is the outline of the deterministic annealing algorithm: 1. Set p = 0. 2. Choose an arbitrary set of initial cluster means. 3. Iterate according to (16) until converged. 4. Increase ~. 5. If fl(flmax go to 3. We have intentionally not been precise on certain issues, such as the final value for p and the rate of its increase. The reason is that the one depends on the exact goal of the clustering and the other remains an open question under current investigation. The process can be stopped at different stages for different reasons. It is clear that for large fl it is aiming at optimizing the hard K-means clustering criterion. This may be best for problems where the clusters do not overlap. However, when the clusters do overlap, fuzzy clustering is more appropriate, as data points have fuzzy memberships in clusters. This is done by our algorithm if we stop at an intermediate 8. Indeed, fl directly controls the fuzziness, as it goes from totally fuzzy (fl = 0) to hard clustering (fl--, oo). The annealing schedule, or the rate at which 8 is increased remains open at the moment. We are currently trying to find if there is a way o f increasing 8 which ensures convergence to the global minimum. A practical comment is due concerning the use o f perturbations in the method. At many stages we have a number of means located at the same mathematical point (e.g., all means at the center of mass in the beginning). These are always local optima of the cost function but will become unstable (local maxima) at the appropriate ft. Thus, the use of arbitrarily small perturbations ensures the separation of grouped means when they no longer minimize the cost.
Further extensions of the algorithm In this section a number of ideas for the extension o f the method are given briefly. These are cur592
September 1990
rently under research and results will be described in full in a coming paper. It seems that the method will naturally generalize to clustering when the number of clusters is not known apriori. The method does in fact unsupervised hierarchical clustering. Recall that for very small fl, all the cluster means practically merge at the center of mass. When 8 becomes larger, they separate into sub-groups, which will themselves separate at higher 8. Thus, if we start with a large enough number of means, we get a natural hierarchical clustering. Unlike traditional techniques, the split happens by itself and no heuristics are needed to place extra means at the 'right' initial locations, etc. Each split occurs when ~ reaches a critical value with respect to the variance within a specific cluster. Another extension is to account for clusters of different size, orientation and population in the criterion. This is required in applications where the mean squared distance is not the right cost function and should be modified. The cluster prior probability is given by the cluster relative mass, and if added when computing P r ( x e Cj) (according to the Bayes rule), will account for clusters of differing populations.
Simulation results We present some examples to demonstrate the performance of the proposed algorithm. In all these examples we generate isotropic random clusters of equal size, centered at the sites marked by ' × ' . The deterministic annealing algorithm was initialized by fl = 0.0001, and fl was increased by ten percent at each step. In example 1 (Figures l(a), l(b)), each generated cluster is of eighty sample points. In Figure l(a) we see the output of the basic ISODATA algorithm for a set of four clusters. The initial means were equally distributed on a small circle around the center of mass of the data. This example represents a typical 'hidden cluster' trap. In Figure l(b), the data is optimally clustered by the deterministic annealing algorithm. Our algorithm achieves this clustering regardless of the initial means. This example illustrates how the global minimum may be much lower than a local
Volume 11, Number 9
PATTERN RECOGNITION LETTERS
September 1990
• : • I
••
g•
:•
•
•
*"
°"
0"
:x.'..•.. "•..•"
.-
°..
•
•~
i
• •
•
-.
•
~
•
.
L
""2." •
'
•° °
•
.'~R;~ "¢
]. •¢
••"_ •
•°
• • |
•.
o
/
" '/ (a)
°1.
•
,.~.
t.•.p
•
°
(a)
•T
,: *•
i::::
•-" . •
..:
%"
•o •
;~.,..-..
. . . . °•
•¢
"
"
,l•'• •
, ...., " ,%
. . . . .
.
, , 0 , . . ' ' ' • ., ° • • • ,.;.. " . . .
• s
•
..
• .
• . •
"
•
°~; • •'•
• •
I.
g•
(b) Figure 1. (a) K-means clustering of four clusters. The lines are the decision boundaries. The energy is 54.43. (b) Deterministic annealing clustering of four clusters. The lines are the decision boundaries. The final beta is 0.1, and the final energy is 30.05. o = computed cluster mean. x = center of cluster random generator.
minimum
trapping
a
traditional method.
(bl Figure 2. (a) K-means clustering of six clusters. The lines are the decision boundaries. The energy is 18.8. (b) Deterministic annealing clustering of six clusters. The lines are the decision boundaries, the final beta is 0.1, and the final energy is 15.9. o = computed cluster mean. × = center of cluster random generator.
We
s t o p p e d at fl = 0.1, which is practically a n infinite
(see Figure 3). The fuzzy clustering o b t a i n e d at
fl for the clustering o u t p u t . T h e second e x a m p l e d e m o n s t r a t e s the perform a n c e with six clusters o f eighty points each. Some o f these clusters overlap. I n Figure 2(a), the hard clustering m e t h o d finds a s u b o p t i m a l solution. Here a g a i n the m e a n s are initialized to a small circle a r o u n d the center o f mass. The o u t p u t of o u r m e t h o d is s h o w n in Figure 2(b). T h e difference in cost is n o t as extreme as in the previous example. N o t e that the actual centers o f d i s t r i b u t i o n have b e e n detected. I n the third example we show how fuzzy clustering is o b t a i n e d for a n i n t e r m e d i a t e / ? . T h e r e are six o v e r l a p p i n g clusters o f eighty sample p o i n t s
fl = 0.01 is shown. Here instead o f decision b o u n daries we show for each cluster c o n t o u r s of equal p r o b a b i l i t y ( p = 1/3), so as to emphasize the fuzzy n a t u r e of the result.
Summary
A new clustering t e c h n i q u e is presented, which is m o t i v a t e d b y a n a n a l o g y to principles of statistical m e c h a n i c s . I n its essence it is a deterministic ann e a l i n g a l g o r i t h m which attempts f i n d i n g the global m i n i m u m o f the cost f u n c t i o n . It is i n d e p e n d e n t o f the initial choice of c o n f i g u r a t i o n , a n d 593
Volume 11, Number 9
PATTERN RECOGNITION LETTERS
•
: •
•
'
i~ ".b×i-"
strated here for the simpler case of a known n u m b e r of isotropic clusters of equal size, it is extendable and such extensions will be discussed in a coming paper•
"..-..~
:
September 1990
:i : References
_
....
' ::"i ~
-
. .~;'.. ( .
•
•
:"/ : " "
Figure 3. Deterministic annealing clustering of six clusters. Intermediate beta = 0.01 producing fuzzy clustering. Equiprobability lines of p = 1/3 are shown, o = computed cluster mean. x = center of cluster random generator.
shows to be insensitive to local minima. The method is applicable for either hard or fuzzy clustering and may be intuitively viewed as a generalization of both. Although it is demon-
594
Ball, G., and D• Hall (1967). A clustering technique for summarizing multivariate data. Behavioral Science 12, 153-155. Bezdek, J.C. (1980). A convergence theorem for the fuzzy ISODATA clustering algorithms. IEEE Trans. Pattern Anal. Machine Intell. 2, 1-8. Duda, R.O., and P.E. Hart (1974), Pattern Classification and Scene Analysis. Wiley-Interscience, New York. Dunn, J.C. (1974). A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybern. 3 (3), 32-57. Gath, I., and A.B. Geva (1989). Unsupervised optimal fuzzy clustering. IEEE Trans. Pattern Anal. Machine Intell. 11, 773-781. Geman, S., and D. Geman (1984). Stochastic relaxation, Gibbs distribution, and the bayesian restoration of images. IEEE Trans. Pattern A n a l Machine Intell. 6, 721-741. Kirkpatrick, S., C.D. Gelatt, and M.P. Vecchi (1983). Optimization by simulated annealing. Science 220, 671-680. Simic, P.D. (1990). Statistical mechanics as the underlying theory of 'elastic' and 'neural' optimization. Network: Computation in Neural Systems. 1, 89-103.