Pairwise clustering using a Monte Carlo Markov Chain

Pairwise clustering using a Monte Carlo Markov Chain

Physica A 388 (2009) 2373–2382 Contents lists available at ScienceDirect Physica A journal homepage: www.elsevier.com/locate/physa Pairwise cluster...

800KB Sizes 0 Downloads 138 Views

Physica A 388 (2009) 2373–2382

Contents lists available at ScienceDirect

Physica A journal homepage: www.elsevier.com/locate/physa

Pairwise clustering using a Monte Carlo Markov Chain Borko D. Stošić Departamento de Estatística e Informática, Universidade Federal Rural de Pernambuco, Rua Dom Manoel de Medeiros s/n, Dois Irmãos, 52171-900 Recife-PE, Brazil

article

info

Article history: Received 28 September 2008 Received in revised form 1 February 2009 Available online 28 February 2009 Keywords: Clustering MCMC Quenching

a b s t r a c t In this work an application of MCMC is proposed for unsupervised data classification, in conjunction with a novel pairwise objective function, which is shown to work well in situations where clusters to be identified have a strong overlap, and the centroid oriented methods (such as K-means) fail by construction. In particular, an exceptionally simple but difficult situation is addressed when cluster centroids coincide, and one can differentiate between the clusters only on the basis of their variance. Performance of the proposed approach is tested on synthetic and real datasets. © 2009 Elsevier B.V. All rights reserved.

1. Introduction Over the past couple of decades, pattern recognition, clustering of data, and related issues have evolved into central questions, addressed by a significant portion of the current scientific community. At present, it has been widely recognized that reliable techniques in this area may lead to significant advances in a wide spectrum of areas of knowledge (see e.g. Ref. [1] and references therein). It therefore seems paramount to continuously question existing methodologies, compare their performance, and propose alternative methods and algorithms, which may contribute to the overall state of the art. Many of the existing methods have drawbacks in that they are highly dependent on initialization options, they may converge to solutions close to boundary of the parameter space, or they may fail to find the global maximum of the objective function (e.g. of the maximum likelihood). The Monte Carlo Markov Chain (MCMC) with a variable disorder parameter (temperature) may be used in conjunction with such methods to help alleviate the above-mentioned problems. As a method in itself, the MCMC has been used for pattern recognition to implement model selection criteria [2,3] for mixture model inference, or in a truly Bayesian sense to sample from the full a posteriori distribution with the number of groups k considered unknown [4,5]. Finally, a simulated annealing version of the MCMC has been proposed [6] for finding the global minima of the minimum sum of squares clustering (MSSC) problem, and has been applied [7,8] in diverse contexts. In this work an application of the MCMC is proposed for unsupervised data classification (clustering) in conjunction with a novel one-to-one objective function, which is shown to work well in situations where clusters to be identified have a strong overlap, and the centroid oriented methods (such as K-means) fail by construction. In particular, an exceptionally difficult situation is addressed when cluster centroids coincide, and one can differentiate between the clusters only on the basis of their variance. The paper is organized as follows. In the next Section the problem is stated explicitly, an objective function is defined without using the concept of centroids, and a MCMC procedure for convergence towards solutions that minimize this objective function is described, in both quenching and annealing versions. In Section 3 clustering results are provided for several artificial and real data sets, and finally in Section 4 the conclusions are drawn.

E-mail address: [email protected]. 0378-4371/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.physa.2009.02.025

2374

B.D. Stošić / Physica A 388 (2009) 2373–2382

2. Centroid free clustering Clustering methods based on the concept of centroids by construction cannot distinguish between clusters with coinciding centers, which differ only in the variance of the data point distributions. In what follows situations particularly simple to set up but difficult to resolve are considered, where clusters of different densities are embedded within each other, with coinciding centroids. In order to deal with such situations, first an adequate objective function is proposed, and then the suitable MCMC dynamics for convergence towards global solutions that minimize this objective function is described. 2.1. The objective function To fix the notation, consider a set of N points xi , i = 1 . . . N in d-dimensional space Rd , where scalar coordinates of point xi are denoted by xi` , ` = 1 . . . d, and the Euclidean distance between points xi and xj is given by dij =

" d X

#1/2 xi` − xj`

2

.

(1)

`=1

Each point xi , i = 1 . . . N belongs to one of the K clusters µk , k = 1 . . . K , so that clusters represent collections (unordered PK sets) of nk points each, such that k=1 nk = N. The fact that point xi belongs to k-th cluster shall be from here on denoted by i ∈ µk , that is, µk is understood as a set of indices of those points that belong to the k-th cluster. For any given spatial distribution of points xi , i = 1 . . . N there are altogether K N possibilities of assigning a cluster label si ∈ {1, 2, . . . K } to each point, so identification of some particular assignments (e.g. those that minimize a given objective function) represents an exponential time problem (or more precisely, examining all possible such assignments by brute force becomes computationally infeasible already for rather small values of K and N). In what follows, a given assignment of labels to points shall be termed ‘‘configuration’’, represented by a N-dimensional vector s ≡ (s1 , s2 , . . . , sN )T , and the set of all the K N such configurations shall be denoted by ΠK .  

T

Defining cluster centroids c k ≡ c1k , . . . , cdk , where c`k = i∈µk xi` /nk , greatly simplifies this assignment task, as was cleverly implemented in the K-means (probably most widely used), and other related methods. On the other hand, in situations where cluster centroids coincide (or lack sufficient separation in the d-dimensional parameter space) this concept cannot be used to differentiate between the clusters. In particular, the commonly used Minimum Sum of Squares Clustering (MSSC) objective function defined as M =

K XX d X k=1 i∈µk `=1

xi` − c`k

2

,

P

(2)

which should be minimized over the set ΠK in order to identify the ‘‘ideal’’ clustering configuration(s), looses meaning if the cluster centroids coincide. An alternative objective function which minimizes the sum of distances between points within the same cluster and simultaneously maximizes the sum of distances between pairs of points belonging to different clusters, may be defined without use of the cluster centroid concept as E=

K X X k=1 i,j∈µk i
dij −

K K X X X

dij .

(3)

k=1 k0 =k+1 i∈µk j∈µ 0 k

In the above definition the sum of distances between observations belonging to different clusters is subtracted from the sum of intercluster distances. Minimizing this function over the set ΠK should lead to solutions (partitions) similar to that of the centroid oriented objective MSSC function (2) in cases where the centroids are well separated. The principal difference from MSSC is that here all the relative pairwise distances enter the calculations irrespective of the cluster centroid positions, making it possible to encounter situations where cluster centers coincide. On the other hand, there seems to be no simple iterative procedure, such as employed by K-means, that would lead in several steps from an arbitrary initial configuration (cluster assignment) to an optimal solution (be it local or global), so in the next subsection the focus is shifted to a stochastic dynamical MCMC procedure, which may be used in the quest of such solutions. 2.2. The Monte Carlo Markov Chain The Monte Carlo Markov Chain has been introduced in the area of statistical physics [9] as a tool for finding free energy minima of systems of interacting particles, and has subsequently been used in many other areas of research for attacking NP-complete problems (see e.g. Refs. [10,11] and references therein).

B.D. Stošić / Physica A 388 (2009) 2373–2382

2375

In the current context, one is interested in finding a particular assignment of N d-dimensional data points to K clusters (a configuration), which minimizes the objective function (3). The philosophy of the MCMC is to devise a dynamical iterative process, starting from (an arbitrary) initial configuration, which explores the parameter space (in the current case the set ΠK ) in search of the global minimum of the objective function. Such a dynamical process should be reversible, and this requirement is embodied in the Detailed Balance Principle P (A)W (A → B) = P (B)W (B → A),

(4)

where P (A) is the probability of finding the system in configuration A ∈ ΠK , and W (A → B) is the probability of transition from A to B. It may be of interest to note here that if one identifies the transition probability W (A → B) with the conditional probability P (B|A) of finding the system in configuration B given it was found in A, the Detailed Balance Principle (4) translates into the Bayes rule. In analogy with model systems of statistical mechanics, in order to ensure that equilibrium configurations of the dynamical process W correspond to the minimum of the objective function (3), for P (A) one can use the Boltzmann factor P (A) =

e−

EA −E0 T

Z

,

(5)

where T ∈ R+ is the (continuous) disorder parameter (equivalent of temperature), EA (equivalent of energy) is the value of the objective function (3) for the configuration A, E0 is the global minimum of the objective function (the ground state), and Z =

X

e−

Es −E0 T

(6)

s∈ΠK

is the normalizing factor (partition function), which guarantees that the probability of all possible configurations sums into unity. Finally, for the transition probability between any two configurations one may adopt the (probably most widely used) Metropolis choice [9]



W (A → B) = min 1,

P (B)



P (A)

.

(7)

It should be noted that for the current choices (5)–(7), for a given pair of configurations A and B, at a given temperature T , the transition probability reduces to



W (A → B) = min 1, e

EA −EB T



,

(8)

consequently one does not need to know explicitly the probability values, and therefore neither the ground state E0 , nor the value of the partition function (which would require calculation over all the K N possible configurations). More precisely, from (8) it follows that it is sufficient to know the difference between the values of the objective function for the two configurations. By construction of the Metropolis algorithm (other choices that satisfy (4) are also possible), transitions to configurations with a lower value of the objective function are always accepted, while transitions to configurations with higher values are accepted with probability which decays exponentially with the increment of the objective function, on the scale defined by T . The sequence of accepted configurations represents a Markov Chain as the transition probability between two configurations does not depend on previous history (therefore the method bears the name Monte Carlo Markov Chain - MCMC). Implementing this dynamics for a sufficiently long (possibly infinite) time for a given value of T leads to equilibrium configurations which correspond to the minimum of the quantity (equivalent of free energy) F = −T log Z ,

(9)

rather then the ground state. More precisely, the higher the value of T , can configurations with higher values of E be realized, and only as T −→ 0+ does the free energy approach the ground state, and the global minimum is found. To implement the algorithm, one starts from an arbitrary configuration, then attempts to change the cluster assignment of one or more points (usually chosen at random). The objective function is calculated for both initial and target configurations, and the new configuration is accepted with probability given by (8). This procedure is repeated until some convergence criterion is met. It is important to note here that starting from a given configuration A for which the objective function EA given by Eq. (3) is already known, changing the assignment of a single point xi from cluster k to cluster k0 to form configuration B, induces a change of the objective function given by EB = EA −

X j∈µk

dij +

X

dij .

(10)

j∈µ0k

Therefore, the computational effort necessary for every Monte Carlo Step (MCS) is proportional to the sum of sizes of clusters k and k0 , that is, nk + nk0 = O (N ), so that the total computational effort is of the order O (MN ), where M is the total number of MCS needed to meet the chosen convergence criterion. It should be mentioned here that in the literature the term ‘‘Monte

2376

B.D. Stošić / Physica A 388 (2009) 2373–2382

Fig. 1. Typical clustering results obtained using the MCMC (left) and the K-means method (right), for clusters of different density and coinciding centroids (see text for details).

Carlo Step’’ often refers to a sequence of individual point updates of the order of magnitude of the system as a whole. Also, the updates can be done at random or sequentially, on individual observation points or on groups of (e.g. adjacent) points in order to speed up the convergence towards global minima solutions, but the overall philosophy of the approach remains unchanged. There are several possibilities as to the choice of disorder parameter T . For the choice T = 0 (called ‘‘quenching’’) the transition probability (8) acquires a particularly simple form W (A → B) =



1, 0,

EA ≥ EB EA < EB ,

(11)

yielding a greedy algorithm which accepts only changes that reduce the objective function (similar to the K-means method), where maximization of the algorithm speed is countered by the possibility of ending up in a local minimum. Another option is known as Simulated Annealing (a rather popular variant of MCMC sampling introduced in Ref. [12]), where one starts from a high T value and then successively reduces T (usually by a multiplicative factor τ < 1) at given intervals (this option is slower, but ensures finding the global minimum). Finally, starting from an arbitrary initial configuration (supposedly far from equilibrium), and allowing the MCMC algorithm ample equilibration time with either a quenching or annealing option, rather then reaching T = 0 (which corresponds to either a local or a global minimum of the objective function) one may perform measurements (record cluster belonging) at small constant values of T . For overlapping clusters at (reasonably) low temperatures the observation points from the overlap region (whose reallocation from one cluster to another does not substantially increase the value of the objective function) should be changing the cluster belonging with considerable frequency, which may be used to construct a quantitative probability estimate (similar to fuzzy methods), or one may simply opt for majority count to assign the final cluster label to a given point. 3. Results In this Section results are presented of the application of the MCMC method using the proposed objective function, on several artificial and real datasets. It turns out that the (fast and simple) quenching MCMC version yields rather satisfactory results in most cases. 3.1. Synthetic data 3.1.1. 2-d overlapping normal clusters First, to illustrate the performance of the current approach as compared with the K-means method for overlapping clusters, artificial datasets were considered consisting of two clusters with 300 points each, in two-dimensional parameter space. Each of the (mutually uncorrelated) features was distributed normally N (0, σi ), with σ1 = 1.0 and σ2 = 0.3 for clusters µ1 and µ2 , respectively. A total of 100 samples were generated, for each of these 100 random initial configurations were chosen, and for each of these (altogether 10 000) choices, both the K-means and MCMC quenching (T = 0) algorithm were implemented. In Fig. 1 the typical results of application of the two methods are displayed. It is seen that for the current example performance of the MCMC method using the objective function (3) is far superior (∼85% of points correctly classified) to that of the K-means method (∼55%). Moreover, visual inspection of the figure reveals that the complete failure of the K-means method (a totally random choice would yield a result with a similar error rate) may be associated with the slight anisotropy (or angular distribution) of the data, rather then the (relevant) radial concentration gradient, whereupon the K-means algorithm ends up slicing the data into two half-planes.

B.D. Stošić / Physica A 388 (2009) 2373–2382

2377

Fig. 2. Frequency histograms of clustering results obtained using the MCMC quenching algorithm (left) and the K-means method (right), for clusters of different density and coinciding centroids (see text for details).

Table 1 Average and standard deviation of correct classification rates (%) for the MCMC quenching algorithm (MA and MS) and for the K-means method (KA and KS), for clusters with coinciding centroids and different dimensionality d of the parameter space. d

2

3

4

5

8

16

32

64

128

256

MA MS KA KS

75.55 12.28 59.82 6.93

86.34 3.43 59.91 7.34

87.88 0.80 60.89 7.60

88.60 0.30 60.96 7.59

89.70 0.22 61.79 7.93

90.87 0.16 63.22 8.40

91.70 0.12 64.78 9.21

92.23 0.09 67.73 9.87

92.61 0.08 70.37 10.05

92.84 0.03 72.25 9.74

Summary results of these simulations are displayed in Fig. 2 in the form of frequency histograms, binned with respect to percentage of correct classification. It follows that the current MCMC approach in the most simple version (quenching, T = 0) yields rather satisfactory results (as opposed to K-means), where over 60% of the runs were accumulated at correct classification level above 80%. Therefore, while the choice of a decaying temperature schedule for annealing ensures finding the global minimum, one may still opt for performing several runs with the simpler (and faster) quenching MCMC version for different initial configurations (cluster assignment), and then retaining only the best result (with the smallest value of the objective function attained), as is the usual strategy adopted with the K-means method.

3.1.2. Higher dimensions To assess the performance of the current approach and compare it to that of the K-means method in a more general context, additional simulations were performed on data sets of embedded clusters with 3, 4, 5, 8, 16, 32, 64,128 and 256 features. Again, 300 observations for each of the two clusters were generated using Gaussian distributions N (0, σi ) for the (mutually uncorrelated) individual features, with σ1 = 1.0 and σ2 = 0.3 for clusters µ1 and µ2 , respectively, with a total of 100 samples and 100 random initial configurations each. The summary of the results of implementation of the K-means method and the MCMC quenching (T = 0) algorithm for these tests are presented in Table 1, demonstrating overwhelming superiority of the MCMC algorithm in comparison with the K-means method, for the current choice of clusters with coinciding centroids. The somewhat high value of the standard deviation for d = 2 corresponds to bimodal behavior of the success rate observed in Fig. 1, which is dramatically suppressed with the increase of the dimension of the parameter space (number of features). This finding suggests that in the current setup there is no need for the more elaborate annealing MCMC version, as already for d = 2 two out of three runs may be expected to yield valid classification rate, and the percentage of correct classification runs rapidly increases with the increase of the dimension of the parameter space, while the actual rates also show a steady (albeit slow) increase. It should be noted at this point that increasing the dimension of the parameter space does not result in a (perhaps expected) curse of dimensionality effect. On the contrary, for the current choice for generating data, a fixed number of observation points gets more sparse as the dimension is increased, and the chance of an observation that belongs to the more sparse cluster, of occupying the ‘‘central’’ volume (populated with the more dense cluster observations) diminishes. In comparison with the MCMC quenching, while the K-means method correct classification rate also demonstrates a tendency of gradual increase, the standard deviation also increases with the dimension of the number of features. To highlight the tendencies of correct classification rates of the two methods, average rates with error bars from Table 1 are also displayed in Fig. 3 on a semi-logarithmic scale.

2378

B.D. Stošić / Physica A 388 (2009) 2373–2382

Fig. 3. Average correct classification rates (%) for the MCMC quenching algorithm (full circles) and the K-means method (open circles), for clusters with coinciding centroids and different dimensions of the parameter space (the lines connecting the simulation data points are displayed to guide the eye). Table 2 Average and standard deviation of correct classification rates (%) for the MCMC quenching algorithm (MA and MS) and for the K-means method (KA and KS), for k overlapping clusters in two dimensional parameter space. k

2

3

4

5

6

7

8

9

10

MA MS KA KS

86.20 5.62 84.07 16.42

74.60 13.13 70.87 38.33

65.94 12.94 60.79 54.39

58.55 20.25 53.82 42.44

51.97 17.41 48.38 36.44

49.58 20.65 45.02 52.11

43.68 16.18 38.72 63.89

41.51 14.45 36.61 71.28

38.03 11.92 33.17 73.88

3.1.3. Multiple clusters To compare the performance of the current approach with that of the K-means method in a scenario with multiple clusters, additional simulations were performed on data sets with 2 to 10 clusters, in two dimensional parameter space. To this end, rather then considering clusters with strictly overlapping centroids (which may be considered rare for real world situations, and unfairly difficult for the K-means method), here we shall consider sets of overlapping clusters with uniformly scattered centroids, and varying dispersion. More precisely, for each sample 100 observations were generated using Gaussian distributions N (rk , σk ) centered at points rk = (xk , yk ), with the two (mutually uncorrelated) individual features xk , yk drawn from a uniform distribution U (0, 3), and standard deviation σk drawn from U (0.3, 1). For each number of clusters k ∈ {2, 3, . . . , 10}, a total of 100 samples were generated, and for 100 random initial configurations for each sample, the K-means and the MCMC quenching (T = 0) were implemented (a total of 10 000 runs for each k, for each of the two methods). The summary results of these tests are presented in Table 2, demonstrating systematic superiority of the MCMC algorithm in comparison with the K-means method, both from the point of view of higher correct classification rate, and the stability of results (considerably lower standard deviation). 3.2. Real data To test the current approach (and compare it with the K-means method) on real data, the classical Fisher’s Iris dataset was chosen, as it has been used for testing of a variety of methods for over seven decades, and its properties are well known. The data consisting of 150 instances of measurements of four features (sepal and petal width and length) for three flower species (50 each for Iris Setosa, Iris Versicolor and Iris Virginica), was obtained from the UCI Machine Learning Repository [13] (correcting the two documented errors). In Fig. 4 all the six different projections in different two dimensional subspaces are shown, where it is seen that the species Setosa is well separated from the other two species, which in turn demonstrate a considerable overlap (to a different degree in different parameter subspaces). 3.2.1. Quenching First, both the MCMC quenching algorithm and the K-means method were performed on 10 000 initial configurations of the Iris data. The resulting success rate of 90.667% and standard deviation of 0.016% was obtained for the MCMC quenching, while the K-means algorithm ended up identifying only two (rather then three) clusters for 2136 initial configurations. These results were discarded, and for the remaining 7864 runs slightly worse success rate of 88.698% and larger deviation of 0.140% were obtained. Although these summary results are quite similar, it is instructive to take a closer look at the individual runs, as shown in Table 3. For all the runs the values of both M and E, given by Eqs. (2) and (3), respectively, were calculated. In

B.D. Stošić / Physica A 388 (2009) 2373–2382

2379

Fig. 4. Different possible projections of the Iris dataset data.

Table 3 The number of runs with final values of M and E, with the correct classification rate and number of correctly classified instances, for the Iris data. MCMC #

M

E

%

N

9994 1 5

80.226343 80.522703 80.621394

−21620.087 −21619.618 −21619.469

90.67 91.33 91.33

136 137 137

#

M

E

%

N

365 7499

78.851440 78.855664

−21380.868 −21441.161

89.33 88.67

134 133

K-means

Table 3 the number of runs with final values of M and E are given, together with the corresponding correct classification rate and the number of correctly classified instances, for both methods. It is seen from Table 3 that the MCMC performs better than the K-means method, both from the point of view of classification rate, and stability of the solution. The MCMC quenching displays extremely high stability, with only 6 runs out of 10 000 ending up in configurations with values of the objective function slightly higher then the minimum attained, and (surprisingly) somewhat better classification rate. On the other hand, the K-means method has identified the theoretical optimum solution [14] only in 3.65% of the runs, in another 74.99% cases it has reached slightly higher values of the objective function (with a slightly worse classification rate), and in the rest of the cases only two of the three clusters were found with nonzero number of instances, and these runs were discarded. It should be noted that the best case scenario of the K-means method [14] attains a worse classification rate than the worst case of the MCMC, although the difference is not large. The current example on the Iris dataset again demonstrates superiority of the proposed MCMC approach with the pairwise objective function (3) in comparison with the K-means method. On the other hand, as the K-means method remains probably the most widely used method because of its intuitive simplicity, it should be noted that the MCMC retains the same conceptual simplicity and intuitive appeal, although it is more computationally intensive. The latter fact should not represent a serious impediment for use of the proposed approach, as computing resources nowadays may be considered probably the most accessible of scientific tools. In particular, a single MCMC run with several thousand Monte Carlo updates, necessary to reach the minimum of the objective function with several digit precision, takes typically under a second on a Celeron 1.5 MHz machine used for testing, with a dedicated program written in C programming language. While the K-means method is faster,

2380

B.D. Stošić / Physica A 388 (2009) 2373–2382

Fig. 5. Classification fractions for the first two instances of the Virginica species.

the difference in speed is countered by the stability in performance, better classification rate, and the ability to deal with overlapping centroid situations. 3.2.2. Simulated annealing In the previous sections, quenching (greedy algorithm, with disorder parameter T = 0) was found to work well in comparison with the K-means method, producing quite stable results. The reason for this is that the landscape of the objective function (in all the considered scenarios) apparently was not represented by an extremely complicated hypersurface with numerous local minima (as is the case in many situations in statistical physics, such as e.g. spin glasses), but rather a well-behaved surface with a pronounced global minimum. On the other hand, rather then searching for a global minimum, simulated annealing may be used in the current setup in order to enhance the information on cluster classification. More precisely, identifying the global minimum (whether by quenching or annealing), leads to a Boolean response for each instance belonging to a cluster. However, instances close to overlapping cluster boundaries by construction have a lesser influence on the value of the objective function, as changing the cluster belonging of these points produces a smaller contribution than that from the points deep in the interior of the clusters. Therefore, increasing the disorder parameter T (‘‘temperature’’) increases the probability of flipping (changing cluster belonging) of these points, and permits their identification. The idea here is then to systematically decrease T (e.g. by a multiplicative factor α < 1) starting from some ‘‘high’’ value, and for each instance record the number of times that it was classified as belonging to any of the clusters. The entries in the frequency table with rows corresponding to instances, and columns to clusters, may then be used to extract a ‘‘fuzzy’’ measure of a cluster belonging for each instance, as the relative frequency at some convenient crossover temperature. To illustrate this procedure, the MCMC simulated annealing runs were performed with the Iris dataset starting from disorder parameter value T = 50, reducing T systematically by a multiplicative factor of α = 501/50 = 1.081383 and recording cluster belonging for each instance for 1500 000 updates for each T (approximately 10 000 per instance). In Fig. 5 the cluster belonging fraction as a function of T is displayed for the first two instances of Iris Virginica (entries 101 and 102 in the database), where the first was correctly classified, and the second was erroneously classified as Iris Versicolor. It is seen that for T > 25 no conclusions can be drawn (all the three classification fractions are close to 1/3), in the interval 10 < T < 25 both instances are clearly distinguished from the Setosa cluster, but can be equally classified as Versicolor or Virginica, and only in the region T < 10 may the actual classification be performed. Moreover, it is seen that the first (correctly classified) instance displays a clear separation of the two branches, while the second (erroneously classified) instance displays a much slower separation of the branches with the decay of the disorder parameter T . To quantify cluster belonging probability one may introduce a ‘‘fuzzy’’ measure as the classification fraction at T = 5 (the middle of the lowest 101 101 101 102 102 separation interval), with the result PSetosa = 0, PVersicolor = 0.002, PVirginica = 0.998, and PSetosa = 0, PVersicolor = 0.586, 102 PVirginica = 0.414 for the samples in Fig. 5. In Table 4, the fuzzy probability measures are given for all the 150 instances of the three species, where it is seen that the misclassified instances (six of Iris Versicolor, and nine of Virginica) display different levels of ‘‘degree’’ of misclassification ranging from 0.112 to 0.496, where the value 0.5 signifies that the two sub-species are undistinguishable.

4. Conclusions In this work a new approach (which could be termed K-mcmc) is proposed for unsupervised data classification, which implements Monte Carlo Markov Chain dynamics in search of global minima of a novel pairwise objective function. The principal advantage of this approach is that it is free of the concept of centroids, and is shown to work well in situations where clusters to be identified have a strong overlap, and the centroid oriented methods (such as the K-means method) fail by construction.

B.D. Stošić / Physica A 388 (2009) 2373–2382

2381

Table 4 Fuzzy measure of cluster belonging for Iris data, determined through MCMC simulated annealing (see text for more details). The entries in bold highlight misclassified instances. Setosa

Versicolor

Virginica

Instance

PSet .

PVers.

PVirg .

Instance

PSet .

PVers.

PVirg .

Instance

PSet .

PVers.

PVirg .

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9997 1.0000 1.0000 0.9999 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9974 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0003 0.0000 0.0000 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0026 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

0.0000 0.0001 0.0000 0.0002 0.0000 0.0001 0.0000 0.0167 0.0000 0.0003 0.0038 0.0000 0.0001 0.0000 0.0005 0.0000 0.0001 0.0005 0.0000 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0000 0.0000 0.0015 0.0004 0.0005 0.0003 0.0001 0.0001 0.0000 0.0000 0.0000 0.0004 0.0000 0.0000 0.0000 0.0001 0.0088 0.0004 0.0004 0.0002 0.0000 0.0120 0.0001

0.2229 0.6044 0.1545 0.9975 0.7528 0.9902 0.3817 0.9824 0.8123 0.9966 0.9944 0.9562 0.9964 0.9323 0.9959 0.5746 0.9677 0.9958 0.9562 0.9985 0.4409 0.9857 0.8725 0.9766 0.9348 0.7347 0.5440 0.1121 0.9455 0.9963 0.9982 0.9987 0.9951 0.8203 0.9760 0.5951 0.3442 0.9809 0.9878 0.9976 0.9964 0.9302 0.9958 0.9903 0.9939 0.9905 0.9957 0.9700 0.9871 0.9960

0.7771 0.3955 0.8455 0.0023 0.2472 0.0097 0.6183 0.0009 0.1877 0.0031 0.0017 0.0438 0.0035 0.0677 0.0036 0.4254 0.0322 0.0037 0.0438 0.0014 0.5591 0.0143 0.1275 0.0234 0.0652 0.2652 0.4559 0.8879 0.0545 0.0022 0.0014 0.0008 0.0046 0.1796 0.0239 0.4049 0.6558 0.0191 0.0118 0.0024 0.0036 0.0698 0.0041 0.0009 0.0057 0.0091 0.0040 0.0300 0.0009 0.0039

101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150

0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

0.0020 0.5858 0.0026 0.1086 0.0072 0.0000 0.9839 0.0062 0.0915 0.0009 0.0250 0.1213 0.0051 0.6558 0.1029 0.0062 0.0456 0.0002 0.0006 0.9470 0.0008 0.5959 0.0011 0.4261 0.0040 0.0046 0.4888 0.3770 0.0241 0.0162 0.0034 0.0007 0.0159 0.7084 0.8298 0.0005 0.0040 0.0534 0.5041 0.0039 0.0016 0.0026 0.5852 0.0013 0.0007 0.0045 0.3729 0.0249 0.0098 0.4677

0.9980 0.4142 0.9974 0.8914 0.9928 1.0000 0.0160 0.9938 0.9085 0.9991 0.9750 0.8787 0.9949 0.3442 0.8972 0.9938 0.9544 0.9998 0.9994 0.0530 0.9992 0.4040 0.9989 0.5739 0.9960 0.9954 0.5112 0.6230 0.9759 0.9838 0.9966 0.9993 0.9841 0.2916 0.1702 0.9995 0.9960 0.9466 0.4959 0.9961 0.9984 0.9974 0.4148 0.9987 0.9993 0.9955 0.6271 0.9751 0.9902 0.5323

Extensive numerical simulations are performed on synthetic datasets with coinciding centroids and different variance, in parameter space of up to 256 dimensions, where the overwhelming superiority of the new method in comparison with the K-means method is demonstrated. Simulations performed on synthetic data with up to ten clusters with two features also demonstrate the superiority of the MCMC quenching over the K-means method. The performance is also tested on the classical Iris dataset where both the K-mcmc and the K-means methods work well, but the former still demonstrates superiority from the point of view of stability and classification rate. In addition, it is shown how simulated annealing, with a gradual decrease of the disorder parameter, may be used for ‘‘fuzzy’’ cluster classification. The current setup is not adequate for situations where the number of clusters is not known in advance, as the objective function (3) does not contain a term that penalizes cluster number increase (by construction, the objective function (3) is minimal when the number of clusters is equal to the number of observations). On the other hand, the stochastic MCMC minimizing procedure is independent of the particular choice of the functional form of the objective function, and current approach may be extended for this purpose, by making different choices for the objective function.

2382

B.D. Stošić / Physica A 388 (2009) 2373–2382

Acknowledgments The author would like to thank the anonymous referee for constructive criticism, and acknowledge financial support of Brazilian agencies CAPES and CNPq. References [1] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: A review, ACM Computing Surveys 31 (1999) 264–323. [2] H. Bensmail, G. Celeux, A. Raftery, C. Robert, Inference in model-based cluster analysis, Statistics and Computing 7 (1997) 1–10. [3] K. Roeder, L. Wasserman, Practical Bayesian density estimation using mixtures of normals, Journal of The American Statistical Association 92 (1997) 894–902. [4] C. Rasmussen, The Infinite Gaussian Mixture Model, in: S. Solla, T. Leen, K.R. Muller (Eds.), Advances in Neural Information Processing Systems, 12, MIT Press, 2000, pp. 554–560. [5] S. Richardson, P. Green, On Bayesian analysis of mixtures with unknown number of components, Journal of the Royal Statistical Society. Series B 59 (1997) 731–792. [6] R.W. Klein, R.C. Dubes, Experiments in projection and clustering by annealing, Pattern Recognition 22 (1989) 213–220. [7] D.E. Brown, C.L. Huntley, A practical application of simulated annealing to clustering, Pattern Recognition 25 (1992) 401–412. [8] H. Yuan, S. Khorram, X.L. Dai, Applications of simulated annealing minimization technique to unsupervised classification of remotely sensed data, Proceedings IGARSS ’99, vol. 1, 1999, pp. 134–136. [9] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equations of state calculations by fast computing machines, Journal of Chemical Physics 21 (6) (1953) 1087–1092. [10] R.M. Neal, Probabilistic Inference Using Markov Chain Monte Carlo Methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto, 1993. [11] D.P. Landau, K. Binder, A Guide to Monte Carlo Simulations in Statistical Physics, second ed., Cambridge University Press, Cambridge, 2005. [12] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671–680. [13] A. Asuncion, D.J. Newman, UCI Machine Learning Repository www.ics.uci.edu/mlearn/MLRepository.html University of California, School of Information and Computer Science, Irvine, CA, 2007. [14] O. du Merle, P. Hansen, B. Jaumard, N. Mladenović, An interior point algorithm for minimum sum-of-squares clustering, SIAM Journal on Scientific Computing 21 (2000) 1485–1505.