Accepted Manuscript
A Fast and Accurate Algorithm for Unsupervised Clustering Around Centroids Giuseppe M. Mazzeo, Elio Masciari, Carlo Zaniolo PII: DOI: Reference:
S0020-0255(17)30576-5 10.1016/j.ins.2017.03.002 INS 12783
To appear in:
Information Sciences
Received date: Revised date: Accepted date:
3 June 2016 2 January 2017 3 March 2017
Please cite this article as: Giuseppe M. Mazzeo, Elio Masciari, Carlo Zaniolo, A Fast and Accurate Algorithm for Unsupervised Clustering Around Centroids, Information Sciences (2017), doi: 10.1016/j.ins.2017.03.002
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
A Fast and Accurate Algorithm for Unsupervised Clustering Around Centroids
University of California, Los Angeles
Elio Masciari ICAR-CNR, Rende
Carlo Zaniolo
AN US
University of California, Los Angeles
CR IP T
Giuseppe M. Mazzeo
Abstract
CE
PT
ED
M
A centroid-based clustering algorithm is proposed that works in a totally unsupervised fashion and is significantly faster and more accurate than existing algorithms. The algorithm, named CLUBS+ (for CLustering Using Binary Splitting), achieves these results by combining features of hierarchical and partitionbased algorithms. Thus, CLUBS+ consists of two major phases, i.e., a divisive phase and an agglomerative phase, each followed by a refinement phase. Each major phase consists of successive steps in which the samples are repartitioned using a criterion based on least quadratic distance. This criterion possesses unique analytical properties that are elucidated in the paper and exploited by the algorithm to achieve a very fast computation. The paper presents the results of the extensive experiments performed: these confirm that the new algorithm is fast, impervious to noise, and produces results of better quality than other algorithms, such as BOOL, BIRCH, and k-means++, even when the analyst can determine the correct number of clusters—a very difficult task from which users are spared by CLUBS+ .
1. Introduction
AC
The clustering challenge. Cluster analysis represents a fundamental and widely used method of knowledge discovery, for which many approaches and algorithms have been proposed over the years [5]. An incomplete list of methods includes partition-based (e.g., k-means [37]), density-based (e.g., DBScan [19]), hierarchical (e.g., BIRCH [50]), and grid-based (e.g., STING [49]) methods. The Email addresses:
[email protected] (Giuseppe M. Mazzeo),
[email protected] (Elio Masciari),
[email protected] (Carlo Zaniolo)
Preprint submitted to Elsevier
March 3, 2017
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
continuing stream of clustering algorithms proposed over the years underscores the importance and difficulty of the problem, and the fact that the logical and algorithmic complexities of this many-facet problem have yet to be tamed and can still benefit from major research advances. Foremost among the remaining issues is that clustering algorithms do not live up to their claim of providing unsupervised learning: current algorithms are effective only when the user provides several parameters that she needs to determine via a difficult exploratory process. In this paper, we present CLUBS+ , that besides overcoming this major limitation by achieving unsupervised learning, it also improves the state-of-the art on clustering around centroids by achieving better speed, accuracy and robustness in the presence of noise. These improvements have been tested through an extensive set of experiments, which include those discussed later1 . In terms of speed of computation, a good standard for comparison is provided by the k-means algorithm [37], that aims at minimizing the distance of the samples from the centroid of their nearest cluster. Owing to its efficiency and logical simplicity, k-means became the most widely used clustering algorithm in practical applications. But k-means suffers from serious shortcomings: in particular, the algorithm does not assure repeatability of outcomes, since the results it produces depend on the initialization step in which the number of centroids and their initial placement must be specified. Much research work has in fact been published to reduce the variability of the results produced by k-means, and bringing it closer to the ‘unsupervised learning’ archetype by carefully selecting the k initial points [9, 13, 43]. This work has produced the widely used k-means++ algorithm [9] which we will thus employ in our comparative experiments. In addition to those k-means refinements, several algorithms have been proposed anew to cluster the points around centroids while avoiding the problems of k-means. Hierarchical clustering algorithms are of relevance for CLUBS+ , and come in two flavors: divisive and agglomerative. Divisive clustering methods, such as DIANA [35], start from single cluster and then recursively divide the cluster into smaller and smaller ones. In the agglomerative approach, instead, every point is initially a cluster and then pairs of clusters are merged until the desired number of clusters is reached. Among the most popular such algorithms we find BIRCH [50], CURE [24], ROCK [25] and Chameleon [34]. The hierarchical clustering approaches tend to produce clusters of high quality, but they lose out to other methods in terms of performance and scalability. In this respect, the hierarchical clustering algorithms BIRCH [50] and BOOL [48] represent remarkable exceptions, since, besides being accurate, they are fast and scalable to very large data sets. Several variants of the clustering problem have been studied in the last few years. For instance, the use of constraints (e.g., some points must/cannot belong to the same cluster) was shown to improve performances of hierarchical clustering algorithms [47]. The prob1 Additional experimental results are available in [1], where the CLUBS+ code is also available for downloading and testing
2
ACCEPTED MANUSCRIPT
CR IP T
lem of hierarchical ensemble clustering investigated in [36] entails deriving an optimal dendrogram (i.e., the tree representing the hierarchy of data) given different input dendrograms of the same data set. Since both the objectives and the assumptions used by these algorithms are different from those of CLUBS+ , our comparison will focus on algorithms such as BIRCH and BOOL, that have the same objectives as CLUBS+ (no constraints in data are assumed and the hierarchical approach is not aimed at yielding a dendrogram).
AC
CE
PT
ED
M
AN US
A High-Level Overview of CLUBS+ The first phase of CLUBS+ is divisive, since the original data set is recursively partitioned through binary splits into several clusters. The number of clusters produced by this phase can be larger than the final number produced by the algorithm, since their final number is determined by hierarchical/agglomerative step that is executed later in the algorithm. Thus, CLUBS+ combines the two hierarchical clustering approaches, i.e., the divisive and the agglomerative ones, and inheriting the high-quality traits of the hierarchical methods while avoiding their performance problems by exploiting the unique analytical properties of squared sum of distances. The divisive phase is followed by a first refinement step, which is followed by the agglomerative phase; then a second refinement step concludes the algorithm. In the two refinement steps, noise is separated from the rest of the data which is then partitioned into well-rounded high-quality clusters. The refinement steps represent a major improvement of CLUBS+ over its predecessor clubs [38] that used a hierarchical phase and an agglomerative phase, but did not use any refinements step and thus could not separate noise from data. Along with the refinement steps mentioned above, several major improvements were introduced during the last two years, producing an algorithm that is fast, stable, impervious to noise and highly accurate, besides being effective in a totally unsupervised mode, a quality that sets it apart from other clustering algorithms. In fact, since it can determine the number of clusters and their centroids with great speed and without any guidance by users, CLUBS+ can also be utilized to initialize algorithms such as k-means that require this information—a use of CLUBS+ that is also briefly described and experimentally tested in the paper. The paper is organized as follows. The formal definition of the problem and the mathematical properties that lead to our fast algorithm are discussed in the next section. Section 3 presents a detailed discussion of the various steps of the CLUBS+ algorithm. Then, Section 4 presents a proposal for improving the efficiency of the divisive step of the algorithm. An extensive experimental evaluation is presented in Sections 5 and 6. In these sections, we assess the performances of CLUBS+ on several data sets, evaluating its accuracy, speed, and robustness with respect to noise and dimensionality. Section 7 presents a brief overview of the huge existing literature about clustering [5]. Section 8 concludes the paper, except for the appendix where the formal properties declared in Section 2 are proven, a statistical test to prove the significance of the experimental results and the adopted clustering-quality measures are presented. 3
ACCEPTED MANUSCRIPT
2. Problem Definition
AN US
CR IP T
Given a set D = {p1 , . . . , pn } containing n objects, our objective is to derive a partition C = {C1 , · · · , Ck } for the objects in D, such that objects in the same cluster are maximally similar and objects in different clusters are minimally similar, according to a given similarity function. Each Ci will be called a cluster, whereas C will be called the cluster set. Consequently, each object p ∈ D is assigned to exactly one cluster. In this paper, we assume that objects can be represented as d-dimensional points of an Euclidean space, i.e., pi ∈ Rd (1 ≤ i ≤ n), whereby D will be treated as a multi-set of n d-dimensional points and their Euclidean distance will be used as the measure of similarity between pairs of points: the lower the distance, the larger the similarity. We will represent multidimensional points in boldfont; thus, if p is a ddimensional point, pi denotes its i-th coordinate. The Euclidean distance between point p and point q will be denoted as kp − qk, where v u d uX (1) kp − qk = t (pi − qi )2 i=1
M
In the following, we provide definitions and analytical properties that will be used through the rest of the paper. For sake of brevity, their proofs are presented in Appendix A.
ED
Definition 1. Let C be an individual cluster containing n d-dimensional points. The centroid (or center of mass) of C is the point c = (c1 , . . . , cd ), where ci is the average of the i-th coordinates of all points in C: ci =
X 1 × pi , (1 ≤ i ≤ d) n
(2)
p∈C
CE
PT
Definition 2. Let C be an individual cluster containing n d-dimensional points and let c be the centroid of C. The W CSS ( Within Cluster Sum of Squares) of C is defined as: W CSS(C) =
X
p∈C
kp − ck2 =
d XX (pi − ci )2
(3)
p∈C i=1
AC
If some aggregate information (i.e., the sum of coordinates and their squares for each dimension and the number of points) is stored for a cluster C, its W CSS can be efficiently computed exploiting the following proposition:
Proposition 2.1. Let C be an individual cluster containing n d-dimensional points. Then, d X Si2 W CSS(C) = Qi − (4) n i=1 4
ACCEPTED MANUSCRIPT
P P where, Si = p∈C pi and Qi = p∈C p2i respectively denote the sum of the i-th coordinates and the sum of the squares of i-th coordinates of the points in C.
i=1
CR IP T
Definition 3. Let C = {C1 , . . . , Ck } be a set of clusters. The W CSS of C is defined as follows: k X W CSS(C) = W CCS(Ci ) (5)
Definition 4. Let C = {C1 , . . . , Ck } be a set of clusters over a dataset D = {p1 , . . . , pn }, where the i-th (1 ≤ i ≤ k) cluster has ni points and centroid ci , and let m denote the centroid for the whole dataset D. The BCSS ( Between Clusters Sum of Squares) of C is defined as: k X
ni × kci − mk2
AN US
BCSS(C) =
i=1
(6)
The next proposition states that for any cluster set C on D (i.e., any partition of D) the sum W CSS + BCSS is equal to W CSS(D) (i.e., the WCSS of D as a single cluster) and it is therefore constant over different cluster sets. Proposition 2.2. For any arbitrary cluster set C over D we have that:
M
W CSS(C) + BCSS(C) = W CSS(D)
(7)
ED
We will next study what happens when a pair of clusters is merged (or a conversely a single cluster is split into a pair). Proposition 2.3. Let Ci and Cj be two distinct clusters in C = {C1 , . . . , Ck } and let Cij be the cluster set obtained from C by replacing Ci and Cj by their union Ci ∪ Cj . Then,
PT
∆W CSS(Cij ) = W CSS(Cij ) − W CSS(C)
CE
can be computed as follows: ∆W CSS(Cij ) =
ni · nj kci − cj k2 ni + nj
(8)
AC
where ni = |Ci | and nj = |Cj |; moreover ci and cj denote the respective centroids of Ci and Cj , whereby we can rewrite Equation 8, as follows: ∆W CSS(Cij ) =
2 d ni · nj X Si,k Sj,k − ni + nj ni nj
(9)
k=1
where Si,k (Sj,k ) is the sum of the k-th coordinate (i.e., dimension) of all the points belonging to the cluster Ci (Cj ). This equation can be used to compute the reduction of W CSS obtained when a cluster is split into a pair of clusters or, 5
ACCEPTED MANUSCRIPT
… …
CR IP T
…
…
Figure 1: The four Steps of the CLUBS+ Algorithm
AN US
conversely, the increase of W CSS when a pair of clusters is merged into a single one. We remark that Equation 9 is at the basis of the efficient strategy adopted by CLUBS+ during the divisive phase, and, to the best of our knowledge, it is new and not used by any previous clustering algorithm. Dual properties hold for BCSS defined in Equation 6. Therefore, keeping the same notation used in Proposition 2.3, above, we obtain that: Corollary 2.1. If Ci and Cj are two distinct clusters in C, each consisting respectively of ni and nj d-dimensional points, then: ∆BCSS(Cij ) = BCSS(C) − BCSS(Cij )
M
where ∆BCSS(Cij ) = ∆W CSS(Cij ) is given by Equation 8.
ED
3. The New Clustering Algorithm CLUBS+
AC
CE
PT
CLUBS+ introduces a fast hierarchical approach, that combines the benefits of the divisive and agglomerative approaches. Figure 1 illustrates the operations performed by CLUBS+ . The input dataset D is first partitioned into a set of rectangular blocks {B1 , . . . , Bt } by means of a top-down divisive procedure. Those blocks are then refined in the next step, whose result is a new partitioning of the points of D, {R0 , R1 , . . . , Ru }, where R0 contains points that are classified as outliers, and each of the remaining u sets represents a cluster. These u clusters are processed in the third step, through an a bottom-up agglomerative procedure, whose result is a new partitioning of the points in R1 ∪ . . . ∪ Ru into the sets A1 , . . . , Ak , where k ≤ u, obtained by possibly merging groups of clusters in {R1 , . . . , Ru } into one only cluster of {A1 , . . . , Ak }. Finally, the new clusters in output from the agglomerative step and the outliers R0 in output from the previous refinement step, are refined, thus obtaining the final set of well-rounded clusters, {C1 , . . . , Ck }, and the set of outliers C0 . In our running example of Figure 2, we show how the successive steps of CLUBS+ applied to the two-dimensional data distribution of Figure 2(a), produce the results shown in Figure 2(b–e). Thus, Figure 2(b), (c), (d), and (e) show the results produced, respectively, by the steps of Figure 1, which are described in detail in the following. 6
ACCEPTED MANUSCRIPT
5
9 B 1
C
B
C
3
F
H
I
K
D E
E 8
F
L
G
G
C I
D 7
E
B
10 4
D
A
A
J
CR IP T
A
F
2
G
12 11
6
(a)
H
H
M
(b)
(c)
A
A
B
B
C
C
E
AN US
D
D
E
F
F
G
G H
(d)
H
(e)
M
Figure 2: A two-dimensional data set with 8 clusters (a), its partitioning after the divisive step (b), the intermediate refinement (c), the agglomerative step (d), and the final refinement (e).
AC
CE
PT
ED
3.1. The Divisive Step The divisive step of CLUBS+ performs a top-down partitioning of the data set to isolate blocks whose points are as much as possible close to each other: this is equivalent to minimizing the W CSS of the blocks, an objective also pursued by k-means and many other algorithms. Since finding the partition which minimizes a measure such as the W CSS is NP-hard even for two-dimensional points [40], CLUBS+ uses a greedy approach where splitting hyperplanes are orthogonal to the dimension axes, and blocks are recursively split into pairs of blocks optimizing specific clustering criteria. The procedure used in this step is shown in Algorithm 1. Given n ddimensional points as input, the algorithm begins with a single cluster S containing the whole data set, that is entered into a priority queue, Q; thus, S is split into a pair of clusters, which thus replace S in Q, and this splitting and replacement process is repeated throughout the divisive phase. The clusters in Q are ordered according to their W CSS: the larger the W CSS, the higher the priority. Prioritizing the splitting of the blocks according to their W CSS reduces the differences between the W CSS of intermediate blocks providing the best conditions for the application of the CH-index discussed in the next section. Furthermore, it was proven that when a greedy strategy is used for driving the construction of a hierarchical binary space partition with the objective of min7
ACCEPTED MANUSCRIPT
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
CR IP T
2 3
Procedure divisiveStep(D) Input: D = {p1 , . . . pn }: a data set of n points Output: B = {B1 , . . . Bt }: a partitioning of D into t blocks begin S ← a new cluster containing all the points in D; B ← {}; // an empty set of blocks Q ← {S}; // a queue of blocks containing, initially, only S t ← 1; // the current number of blocks wcss ← W CSS(D); // the WCSS of the whole dataset bcss ← 0; // the BCSS of one only cluster is 0 n ← |D|; // the number of points of the dataset ch ← ⊥; // the CH-index of the current partition, undefined for one block maxCh ← 0; // the maximum value that the CH-index gets during the iterations while Q is not empty do // at least one block is eligible for split S ← Q.pop(); hd, p, ∆i ← computeBestSplit(S ); if isSplitEffective(S, ch, maxCh, wcss, bcss, t, n, ∆) then hS1 , S2 i ← makeSplit(S, d, p); Q.push(S1 ); Q.push(S2 ); wcss ← wcss − ∆; bcss ← bcss + ∆; // see Prop. 2.2 t ← t + 1; bcss × n−t ch ← wcss t−1 ; if ch > maxCh then maxCh ← ch; end else B.add(S) // B is added to the list of blocks that will be returned end end return B; end
AN US
1
M
Algorithm 1: The algorithm for the divisive step
PT
ED
imizing the overall variance with a limited number of splits, choosing at each iteration the block with the largest variance leads to the best results [21] (which is also quite intuitive). Thus, at each step, the top cluster in Q is evaluated for splitting. If the splitting of the top cluster S is considered effective, according to a quality index, namely the CH-index, that will be discussed in the following, S is split into a pair of sub clusters that replace it in Q; otherwise S is added to the list of the final cluster-set produced by the divisive step (B in Figure 1). We next discuss the two crucial points of this divisive phase, which are:
CE
1. the computation of the the best split (computeBestSplit); 2. the evaluation of the effectiveness of the split (isSplitEffective).
AC
3.1.1. Computing the Best Split The criterion which guides the choice of the best split is the minimization of W CSS. Therefore, given a cluster C, the objective is to split it into two clusters such that ∆W CSS in Equation 9 is maximized. Since we split our blocks using hyperplanes that are orthogonal to some dimension, each possible split is defined by a pair hk, hi where k is the splitting dimension and h is the splitting position. The split hk, hi partitions the points of a cluster C into the dichotomy {Ci , Cj }, where Ci contains all the points of C whose k th coordinate is less than or equal to h, and Cj contains the remaining points. 8
ACCEPTED MANUSCRIPT
p∈S∧pi =h
CR IP T
The processing of a block S considered for splitting begins with the computation of its cumulative marginal distributions on each dimension, via a linear scan of the data. The marginal distributions for each dimension i are computed by grouping data with the same value of the i-th dimension, and summing (marginal sum) or counting (marginal count) the points in the same group. In more details, given a set of d-dimensional points S = {p1 , . . . , pn }, the marginal sum of the i-th dimension of S can be seen as a function msi : R → Rd defined as X msi (h) = p and the marginal count as a function mci : R → N defined as
AN US
mci (h) = {p : p ∈ S ∧ pi = h}
Thus, the cumulative marginal distribution at coordinate h of dimension i is simply computed by adding the marginal distribution sum of points with i-th dimension’s coordinate not greater than h, i.e. X cmsi (h) = p p∈S∧pi ≤h
and
M
cmci (h) = {p : p ∈ S ∧ pi ≤ h}
CE
PT
ED
The values of each of these functions can be computed by scanning the data in S and progressively updating vectors (or hash maps) representing the above functions. Once we have precomputed the cumulative marginals, it is possible to compute in constant time the ∆W CSS yielded by a split performed at each position using Equation 9. The best split can be thus computed by finding the pair dimension/position that maximizes ∆W CSS. Indeed, In Sec. 4 we will show that this computation can be optimized by reducing the number of positions that need to be checked in many dimensions.
AC
3.1.2. Evaluating the Effectiveness of Splits A difficult challenge of any divisive algorithm is to decide when to stop because any further splitting of current clusters does not improve the overall quality of the current cluster set. Many criteria and indexes were proposed in the literature for estimating the quality and naturalness of a cluster set [8]. We have completed an extensive evaluation of the effectiveness of these criteria in predicting the quality of clusters, and concluded that the Calinski-Harabasz index, CH-index for short, is the most suitable to our needs [12]. In fact, this index can be computed efficiently and on our testbed of the 648 experiments discussed in Section 5, the CH-index proved to be a reliable quality predictor.
9
ACCEPTED MANUSCRIPT
40
30
i
k valley
20
j
down-hill
up-hill
0 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
CR IP T
10
17
18
19
Figure 3: Example of marginal distribution satysfing the valley criterion
The CH-index for a set of k clusters C = {C1 , . . . , Ck } over a dataset D, with |D| = n, is defined as follows: BCSS(C) n−k × W CSS(C) k−1
AN US
CH-index(C) =
(10)
AC
CE
PT
ED
M
Thus, the overall objective of the divisive process is to maximize the CH-index. Although the split that maximizes the ∆W CSS at each iteration, chosen as described in Section 3.1.1, also maximizes the variation of the CH-index, this variation could be negative. In fact, the CH-index is the product of two factors: BCSS(C) n−k W CSS(C) and k−1 . At every step, the former increases, whereas the latter 2 decreases . In an ideal sequence of k − 1 splits, which perfectly separates all the actual k clusters of a data set, the curve of the CH-index vs. the number of splits would reach a local maximum, corresponding with the global maximum. Due to the constrained partitioning scheme adopted by CLUBS+ , there are many situations where such an ideal sequence of split does not exist. An example is depicted in Figure 2(b), where no initial split, according to our partitioning scheme, can be performed without splitting a cluster. In our example, the first split separate the points of cluster B. This is not a big issue for CLUBS+ , since portions of the same cluster that are separate into different blocks now, can be be rejoined by later steps. However, CLUBS+ must be prepared to handle cases in which the sequence of binary splits leading to the largest CH-index contains a few intermediate steps where the index decreases rather than increasing. Algorithm 2 reports the procedure that CLUBS+ uses to evaluate the effectiveness of a split. First of all, since the CH-index is not defined for k = 1, we can use the variation of CH-index for evaluating the effectiveness of a split only if k > 1. Assuming that k > 1, the strategy adopted is the following (lines 10–18 of Algorithm 2): (i) when the CH-index increases, then the split is considered to be effective (lines 12–14), but (ii) the split is not considered effective CH-index if it drops to a level that is below the 70% of the maximum it had 2 Here, we can note the importance of choosing the block with the larger W CSS at each step, since it is also the one that can maximize the ∆W CSS, and, thus, the increase of the former factor.
10
ACCEPTED MANUSCRIPT
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
CE
41
CR IP T
5
AN US
4
M
3
ED
2
Procedure isSplitEffective(S, ch, maxCh, wcss, bcss, k, n, ∆) Input: S: the cluster candidate for split ch: the CH-index of the current cluster set maxCh: the maximum got by CH-index so far wcss: the WCSS of the current cluster set bcss: the BCSS of the current cluster set k: the cardinality of the current cluster set n: the cardinality of data set ∆: the WCSS reduction yielded by the candidate split Output: a boolean value stating whether the split is effective begin if k > 1 then // the CH-index for 1 cluster is not defined bcss+∆ × n−k−1 ; newCh ← wcss−∆ k if newCh ≥ ch then return true; end if newCh < 0.7 × maxCh then return false; end end // try to find local discontinuities using the valley-criterion for d ← 1 : S.dimensionality do m ← smoothMarginalCount(S, d); if S.parent <> null ∧ max(m) < avg(S.parent.marginalCountd ) then continue; end gap ← avg(m) × 0.1; max ← m[0]; p ← 1; while p < m.length ∧ max − m[p] < gap do if m[p] > max then max ← m[p]; end p ← p + 1; end if p == m.length then continue; end // a downhill satisfying the criterion was found min ← m[p]; p ← p + 1; while p < m.length ∧ m[p] − min < gap do if m[p] < min then min ← m[p]; end p ← p + 1; end if p ¡ m.length then return true; // an uphill satisfying the criterion was found end end return false; // no dimension has satisfied the valley-criterion end
PT
1
42 43 44
45
46
AC
47 48 49
Algorithm 2: The algorithm for evaluating the effectiveness of the split
11
ACCEPTED MANUSCRIPT
CR IP T
reached in the previous iterations3 (lines 15–17). Finally, (iii) if k = 1 or the drop of the CH-index does not take the CH-index below 70% of the maximum it had reached in previous iterations, then the decision on whether to split or not to split is taken according to the valley-criterion (lines 19–47). The rationale is that, even if a split yields an acceptable reduction of the overall CH-index, the split is still effective if it separates two sub-regions of S belonging to different clusters. Such two regions are intuitively denser than the region separating them, and this situation can be detected by finding a ‘valley’ in the marginal count distribution, i.e., a significant decrease in the count (‘downhill’) followed by a significant increase (‘uphill’), as depicted in Figure 3. Being m the array representing the marginal count for a given dimension, we assume that the valley criterion is satisfied if and only if there exist three indexes i, j, k, i < j < k, such that
AN US
min(m[i], m[k]) − m[j] > 0.1 × avg(m)
ED
M
where avg(m) is the average of the values in m4 . Furthermore, the maximum value of m must be at least equal to the average value of the parent block of b (unless b is the root of the partitioning tree). This avoids generating clusters in regions of very low density. The complexity of evaluating the valley criterion is linear w.r.t. the size of marginal distributions, and thus, it does not increase the complexity of the splitting algorithm, given that marginals are also scanned for finding the best split. However, now the marginals must be scanned with a different purpose: finding a local discontinuity. We remark that the dimension/position of a local discontinuity does not necessarily coincides with the dimension/position of the best split, thus, all the marginals must be rescanned. From the cost of the operations performed at each split, we can now derive the overall cost of the complete divisive phase.
CE
PT
Complexity of the Divisive Step The initialization of the divisive step requires a scan of the whole data set and performs operations with constant complexity for each point whereby the overall complexity of this phase is O(n), in the number of points n.5 Assuming that k clusters are produced, their generation requires 2 × k − 1
AC
3 Our experiments show that correct results are obtained even if we use more stringent criteria, e.g. 80%. But since the successive agglomerative phase compensates for occasional over-splitting, we use 70%, since this introduces an additional safety margin, without incurring in the higher computational overhead that a more radical splitting could cause. 4 Indeed, we perform a smoothing of the marginal count, in order to remove local variations due to noise 5 Here we focus on the complexity w.r.t. to the number of points, since this tends to dominate by several orders of magnitudes other measures, such as the number of dimensions and the number of clusters. The impact of these other parameters is reviewed in Section 3.5 where we address the combined complexity of the various phases of CLUBS+ .
12
ACCEPTED MANUSCRIPT
PT
ED
M
AN US
CR IP T
iterations6 . At each iteration, the pop operation is executed with complexity O(log k); moreover, the amortized complexity of push operation is O(1) (since the W CSS of clusters decreases as new splits are done, clusters with lower and lower priority are added to Q). The procedure which computes the best splits requires a linear scan of all the data in the cluster, thus its complexity is O(n). Also the procedure which evaluates the effectiveness of the split has O(n) complexity (the number of iterations needed to evaluate the effectiveness of the split is equal to the number of distinct coordinates, which is usually much smaller than n). Since in practical cases k n, the dominant operation of each iteration is the computation of the best split. Thus, we can say that the complexity of the divisive step is O(n × k). However, if we partition the data belonging to each cluster between the two clusters resulting from the split, the number of data to be accessed decreases, as clusters become smaller. In typical cases, the bulk of the data tends to be evenly partitioned between the pair of new clusters produced by the split that maximizes the ∆W CSS of Equation 8 7 . As a result, the binary trees produced are nearly balanced and have a depth that is logarithmic in the number of clusters generated: i.e., O(log k). Since the complexity for computing the best split for the nodes at each level is O(n), we can thus estimate that the complexity of the divisive step is O(n × log k) on the average. In the worst case, however, the complexity of this step remains O(n × k), matching that of the next phase of the CLUBS+ algorithm, discussed below. The computation of the cumulative marginal distribution can be performed through a linear scan of the data inside the cluster. Furthermore, after choosing the best split, it is necessary to compute the cumulative marginal distribution for just one of the two blocks obtained from the split8 , since the cumulative marginal distributions for the other block can be obtained by difference. The size of the cumulative marginal distribution never exceeds the number of points and it is often smaller than the size of the data set, since many points have the same coordinate in a given dimension9 . Furthermore, it will be shown in Sec. 4, that the number of positions to be tested can be reduced even further while retaining the ability to find the best split.
CE
3.2. The Intermediate Refinement
AC
The divisive phase has partitioned the overall space into (A) several blocks containing clusters, and (B) zero or more blocks that only contain outliers and
6 k clusters are generated by k − 1 splitting steps that replace a cluster by a pair; then, the failure of splitting conditions is tested for each of these k clusters. 7 The factor ni ·nj in the equation tends to avoid splitting into pairs of clusters that contain ni +nj widely different numbers of points. 8 Clearly, it is better to compute the marginal distribution for the block containing less points. 9 This may not be the case for continuous values. However, a quantization can be easily performed without compromising the quality of the results.
13
ACCEPTED MANUSCRIPT
noise points. We also see that blocks of type A will only contain a single cluster since blocks containing several clusters were split according to the “when in doubt split” philosophy used in the divisive phase.
4 5 6 7 8 9 10 11 12 13 14 15
CR IP T
2 3
Procedure intermediateRefinement(B) Input: B = {B1 , . . . , Bt }: a set of t blocks S Output: R = {R0 , R1 , . . . , Ru }: a clustering of the points in ti=1 Bi begin o ← identifyOutlierBlocks(B); C ← {}; O ← {}; for i = 1 : t do if o[i] then O.add(Bi ); else C.add(Bi ); end end R ← reassignPoints(C, O); return R; end
AN US
1
Algorithm 3: The algorithm of the intermediate refinement step
3 4 5 6 7 8 9 10 11 12
CE
13
ED
2
Procedure identifyOutlierBlocks(B) Input: B = {B1 , . . . , Bt }: a set of t blocks Output: o: an array of t boolean values, where oi = true ⇔ Bi is classified as outlier block begin o ← an array [o1 , . . . , ot ] of t boolean; den ← an array [d1 , . . . , dt ]containing the densities of blocks in B; v ← an array [v1 , . . . , vt−1 ]containing t-1 numbers; sort(den); for i ← 1 to t − 1 do v[i] ← den[i + 1]/den[i]; end vt ← average(v ) + standardDeviation(v ); vi ← 1; while vi < t ∧ v[vi] ≤ vt do vi ← vi + 1; end dt ← den[vi]; // the maximum density an outlier block may have for i ← 1 to t do o[i] ←density(Bi )≤ dt ∧ aroundCentroidDensity(Bi )< 2×density(Bi ); end return o; end
PT
1
M
Thus, in this intermediate refinement step, whose pseudo-code is reported in Algorithm 3, CLUBS+ seeks to realize the objectives of (I) identifying the blocks that contain only outliers, and (II) generating well-rounded clusters for the remaining blocks. Task (I) is performed by using the observation that (a) the density of blocks containing only outliers is low, and (b) the density of such blocks is rather uniform because of the random nature of noise and outliers.
14 15 16
17 18
AC
19
Algorithm 4: The algorithm for identifying outlier blocks
The pseudo-code is reported in Algorithm 4. CLUBS+ checks condition (a) first by sorting the densities of blocks and detecting jumps in density between blocks that are contiguous in that ordering. That is, if den is an array containing the density values of the t blocks yielded by the divisive step, sorted in ascending order, we compute an array v with t−1 values, such that v[i] = den[i+1]/den[i]. 14
CR IP T
ACCEPTED MANUSCRIPT
(a)
(b)
Figure 4: A block containing only outliers (a) and a block containing a cluster (b).
PT
ED
M
AN US
Then, if vi outlier-blocks are present, the first vi − 1 values of v will be approximately equal to 1, while v[vi] will be ‘quite large’. Through a number of experiments, we have determined that a good threshold for value of v to be considered ‘quite large’ is provided by average value of v plus its standard deviation. Thus, with vi the smallest value index for which the v-value exceeds this threshold, we classify all blocks with density above dt = den[vi] as cluster blocks. Then, we test the other blocks using condition (b), i.e., we test the blocks that are suspected to be outlier blocks because of their low density. This additional verification step is needed because large density differences between cluster blocks cannot be always excluded (e.g., because of large differences in the volume of those blocks). Therefore, a small hypercube is taken around the centroid of each block being checked for condition (b), as depicted in Figure 4, and if the density in the hypercube is less than twice the density of the whole block, this is classified as an outlier block (Figure 4(a)); otherwise it is classified as a cluster block (Figure 4(b)). The size of the hypercube used for this check is such that, for each dimension, the edge of the hypercube is 1/10 of that of the whole block on the same dimension. The rationale of this the criterion, which was extensively tested and proven to be effective in several data sets containing white noise with different density (Section 5), is the following:
AC
CE
• in an outlier-block, where data are randomly distributed, it is very unlikely that the density around centroid is twice the global density; • in a cluster-block, the density around centroid is very likely to be twice the global density.
For instance, these tests applied to our running example in Figure 2(b) have classified blocks “J”, “K”, “L” and “M” as outlier-blocks. Having thus identified all the cluster blocks, we can now proceed with task (II) and materialize the centroids of their clusters and assign each point to the cluster with the nearest centroid. To this end, we use a weighted function of the distance between points on each dimension. Specifically, we measure
15
ACCEPTED MANUSCRIPT
the distance of a point p = (p1 , . . . , pd ) from a cluster-blocks C with centroid (c1 , . . . , cd ) as 2 d X kpi − ci k i=1
ri
CR IP T
dist? (p, C) =
5 6 7 8 9 10 11 12 13 14 15 16 17 18
CE
19
M
3 4
ED
2
Procedure reassignPoints(C, O) Input: C = {C1 , . . . , Ck }: a set of k cluster blocks O = {O1 , . . . , Op }: a set of p outlier blocks Output: R = {R0 , R1 , . . . Rk }: a set of blocks, where R0 is the only block containing outliers begin R ← a set {R0 , . . . , Rk } of k+1 empty clusters; foreach B ∈ C ∪ O do N B ← computeNearbyBlocks(B, C);// the indices of the blocks in C from which B can be reached foreach p ∈ B do c ← 0; dmin ← 1; foreach j ∈ N B do d ← dist? (p, Cj ); if d ≤ dmin then dmin ← d;c ← j; end end addPointToCluster(Rc , p); end end return R; end
PT
1
AN US
where ri is the radius of the cluster along the i-th dimension, which is estimated by the usual 3×σ criterion [50] used to separate core points from outliers: r W CSSi (C) ri = 3 · n where n is the number of points inside C and W CSSi (C) is the variance of the i-th coordinates of the points in C. Observe that dist? (p, C) = 1 defines an ellipsoid whose center is the centroid of C and whose radius on the i-th dimension is ri . Therefore, we can now classify each point in our sample set as either belonging to a particular cluster or as an outlier by assigning each point p to the cluster C which minimizes dist(p, C), provided that this minimum distance is less than or equal to 1; otherwise p is classified as outlier.
20
Algorithm 5: The algorithm for reassigning points to clusters
AC
This step, whose pseudo-code is reported in Algorithm 5, produces wellrounded clusters that restore the natural clusters that might have been shavedoff by rigid partitioning scheme employed in the divisive step. For instance, the block of data in Figure 2(b), which contains the points of the cluster“E” in Figure 2(a), contains also some points of the original cluster “F” which is now reassembled around its centroid. The fact that CLUBS+ uses ellipsoids instead of spheres offers several advantages. Indeed, even if assume that natural clusters are of spherical shape, this is no longer true when we stretch or shrink some dimensions to arrange the 16
ACCEPTED MANUSCRIPT
CR IP T
data in a hypercube as required by many clustering algorithms; this simple step could reduce the effectiveness of any algorithm that is designed to find spheres. However, a stretching or shrinking along any dimension, will turn spheres into ellipsoids and recast the problem into the domain where CLUBS+ is most effective. Even when all natural clusters are spherical, k-means has issues when clusters of different radii are in close proximity, since points belonging to the cluster with the larger radius are often assigned to the nearby cluster of smaller radius. CLUBS+ does not suffer from this problem, since each point is assigned to the centroid of the cluster that minimizes its distance relative to the radius of the cluster.
PT
ED
M
AN US
Complexity of the Intermediate Refinement Algorithm 4 has complexity O(n), where n is the size of data, since, at most, the data of all the blocks will be scanned for computing the density around centroids. Instead, the complexity of Algorithm 5 is O(n × k), since for each of the n points of the data set, the distance from at most k clusters must be found, and the distance between a point and a cluster can be computed in constant time. In practice, this step is expedited by exploiting the existing partitioning of data into blocks. Specifically, before processing the points of each block Bi , we first compute the set of nearby blocks, denoted N B(Bi ), that can absorb points of Bi . To this end, we consider every cluster block Cj 6= Bi , and we find the point bij on the border of Bi that is nearest to the centroid of Cj ; then, we add Cj to N B(Bi ) if dist? (bij , Cj ) ≤ 1. If Bi is a cluster block we also include it in N B(Bi ). Finally, we also add Cj to N B(Bi ) if Bi is an outlier block and its range contains the centroid of Cj . This situation is not possible after the divisive step, but has to be considered in the final refinement (see Algorithm 7), where the outliers in output from the intermediate refinement are given as input to Algorithm 5 as one only block, that could possibly span the whole data domain. Then, it is easy to see that for each point in Bi only its distance to centroids of blocks N B(Bi ) needs to be computed in order to assign it to the cluster with least-distance centroid. While this is practical optimization, the worst case complexity of the reassignment is still O(n × k), which determines the overall complexity of the refinement steps, i.e.: O(n × k).
CE
3.3. The Agglomerative Step
AC
In the agglomerative step, the clusters produced by the previous step are merged whenever this improves the overall clustering quality. The process of merging is symmetric to that of splitting: at each step we find the pair of clusters whose merge produces the least increase of W CSS and, if CH-index resulting from their merge increases, the merge is actually performed by replacing the two clusters by their union. If we determine that merging these two clusters will decrease the CH-index, the merge is not performed and the whole agglomerative phase ends, since no merging of other cluster pair can improve CH-index, as proven in the following proposition.
17
ACCEPTED MANUSCRIPT
CR IP T
Proposition 3.1. Let C be a cluster set over D, with cardinality greater than 1, and let C1 and C2 be the clusters whose merge produces the least increase of W CSS(C). If the merge of C1 and C2 produces a decrease of the CH-index of the cluster set, then no pair Ci and Cj in C exists such that their merging increases the CH-index.
4 5 6 7 8 9 10 11 12 13 14 15 16 17
CE
18
M
3
ED
2
Procedure agglometariveStep(R) Input: R = {R1 , . . . , Ru }: a set of u clusters Output: A = {A1 , . . . , Ak }: a set of k clusters (k ≤ u) begin A ← R; if u ≤ 2 then return A; end Q ←createQueueWithAdmissiblePairs(A); // a queue of pairs of clusters wcss ← W CSS(R); // the WCSS of the current cluster set bcss ← BCSS(R); // the BCSS of the current cluster set k ← |R|; // the current number of clusters S n←| k i=1 Ri |; // the size of clustered data bcss × n−k ch ← wcss k−1 ; // the CH-index of the current cluster set hC1 , C2 , ∆i ← Q.pop(); bcss−∆ × n−k+1 newCh ← wcss+∆ k−2 ; while newCh ≥ ch do M ← makeMerge(C1 , C2 ); k ← k + 1; removeElement(A, C1 ); removeElement(A, C2 ); addElement(A, M ); updateQueue(Q, M, C1 , C2 ); k ← k − 1; if k==2 then break; end hC1 , C2 , ∆i ← Q.pop(); bcss−∆ newCh ← wcss+∆ × n−k+1 k−2 ; end return C; end
PT
1
AN US
We remark that we do not impose any other constraint on the choice of pairs of clusters to be merged. This is unlike the divisive step, where clusters are bound to be hyper-rectangles and each split yields two hyper-rectangles whose union is the hyper-rectangle corresponding to the split node: in the agglomerative step there is no constraint on the shape that a cluster resulting from a merge can have. In practice, however, the only clusters considered for merging are the contiguous ones, since their merge yields a smaller increase of the W CSS (i.e., a larger increase of the CHindex). For example, Figure 2(d) shows that the cluster “B” is obtained as the merge of the clusters “B” and “I” of Figure 2(c). The pseudo-code for this step is reported in Algorithm 6. The procedure takes as input a set of u clusters R, and initialize the result A with the clusters in R. Then, a (heap-based) priority queue Q is created by adding pairs of clusters.
19
20
21
22 23
AC
24
25 26 27 28
Algorithm 6: The algorithm of the agglomerative step
We add to the priority queue only those pairs of clusters that overlap, i.e., the distance from their centroids is not larger than the sum of their radii. This 18
ACCEPTED MANUSCRIPT
AN US
CR IP T
reduces significantly the number of pairs to be processed, while making still possible to effectively rejoin points from the same cluster that had to be separated during the divisive step (e.g., “B” and “I” in Figure 2(b)). For each pair considered, we compute the ∆W CSS using Equation 9, and we use this value to prioritize the pairs in our queue: the smaller the ∆W CSS, the larger the priority. Therefore, if clusters Ci and Cj are merged into the cluster M , first we remove Ci and Cj from the set of clusters A and we add the new cluster M . Then, we remove all the pairs containing either Ci or Cj from Q, and we add all the admissible pairs containing M combined with any other cluster in A. The priority queue can be updated in time O(log u) for each pair removed, and, since the number of pairs to be removed is O(u), the complexity for updating the priority queue is O(u × log u). This is the dominant operation for each iteration.
M
Complexity of the Agglomerative Step Since the number of iterations is O(u), and the dominant operation at each iteration is the removal from the priority, whose complexity is O(u · log u), as discussed above, the complexity of the agglomerative step is O(u2 ×log u), where u is the number of clusters in input. We remark that in most practical cases the number of iterations required is small. For instance, in Figure 2(d) only two clusters are merged, that is, one iteration is performed. 3.4. The Final Refinement
AC
CE
PT
ED
The agglomerative step often produces clusters of slightly irregular shapes due to the approximate and greedy criteria used in the previous steps. For instance, Figure 2(d) shows that cluster “B” has been obtained by merging clusters “B” and “I” of Figure 2(c). The evident asymmetry is due to the fact that the block “K” of Figure 2(b) has been considered an outlier-block during the intermediate refinement, and the points at its upper-left corned have been asymmetrically adsorbed (due to their different position and radius) by blocks “B” and “I”. The final refinement step improves the quality of clusters and also identifies the great majority of the outlier points. A complete detection of outliers is not possible since points that fall in the thick of a cluster must be assigned to the cluster, even if they are the product of noise. However, our experiments show that the accuracy achieved by CLUBS+ is excellent for both cluster points and outlier points. Algorithm 7 is used to perform final refinement, and it is basically a simplified version of the one used in the intermediate refinement. In fact, in the intermediate refinement the density of blocks was evaluated in order to separate them into cluster-blocks and outlier-blocks, an operation that is no longer needed in this last step.
19
ACCEPTED MANUSCRIPT
2 3 4 5 6 7 8
Procedure finalRefinement(R0 , A1 , . . . , Ak ) Input: R0 : a set of points previously classified as outliers A1 , . . . , Ak : a set of clusters Output: C = {C0 , . . . , Ck }: the k final clusters and the set of final outliers (C0 ) begin C ← {A1 , . . . Ak }; O ← {R0 }; R ← reassignPoints(C, O); return R; end
CR IP T
1
Algorithm 7: The algorithm of the final refinement
Complexity of the Final Refinement The complexity of the final refinement step is the same as that of Algorithm 5, needed to reassign points to clusters, which is O(n × k).
AN US
3.5. CLUBS+ : Overall Complexity We can now combine and compare the time complexities of the four steps of CLUBS+ and conclude the following:
Proposition 3.2. Algorithm CLUBS+ works in time O(n × k) where n is the number of points and k is the number of clusters automatically detected after the divisive step.
M
Proof. The complexity of CLUBS+ is determined by the steps with the largest complexity, i.e. the refinement steps. 2
AC
CE
PT
ED
For what concerns the spatial complexity, during the divisive step CLUBS+ materializes the marginal distributions, which, in the worst case, could have size equal to the input data. This situation would arise if no pair of points shares the same coordinate on any dimension. Thus, the space complexity of CLUBS+ is O(n), where n is the number of points in the input data set. Finally let us return to the issue of dimensionality, which impacts the divisive phase as follows: for each iteration of this phase, d marginals have to be computed, and the computation of each marginal requires summing d dimensional points. Thus, the complexity at each divisive iteration is quadratic in d, whereby the complexity of the divisive phase is O(n × d2 ). Thus, for large d values this will dominate the O(n × k) cost of the refinement phase. However, it is also known that clustering algorithms that are based on Euclidean distance might not produce reliable results for higher dimensionalities [4], whereas we seek a clustering algorithm which performs with the flawless reliability and repeatability that users demand from an unsupervised algorithm. Therefore in our experiments we focused on values of d in the range between 2 and 12, and verified CLUBS+ reliability and repeatability for values of d in that range. In those experiments, discussed in the next section and in [1], the divisive phase is typically 20% of the overall execution time and never exceeds 30% of that. With the cost of the divisive phase dominated by that of the other phases, the overall cost of CLUBS+ scaled linearly in the number points and in the number of dimensions. 20
CR IP T
ACCEPTED MANUSCRIPT
Figure 5: Example where clubs fails to detect the correct number of clusters
3.6. clubs, CLUBS+ and Beyond
AC
CE
PT
ED
M
AN US
The divisive phase of CLUBS+ is similar to other clustering algorithms, and in particular to BIRCH, as discussed in Section 7. But CLUBS+ operates in multiple stages, and this is also the main reason for it being effective without any user guidance. In fact we have that the (i) techniques used at one stage compensate for the shortcomings of the previous stages since (ii) well-tuned criteria are used to assure the quality of the results produced at each stage. In particular, the benefits of (i), are illustrated by the fact that a certain degree of over-splitting in the divisive phase can be tolerated since the successive stages that include an agglomerative step can then compensate for that. Using multiple phases, and in particular combining divisive and agglomerative phase, is an idea that CLUBS+ has taken from its predecessor clubs [38], which proved its usefulness in several applications [39]. Yet CLUBS+ provides major improvements with respect to clubs, including the addition of the two refinement steps that achieve (i) the detection of outliers, which is critical in distinguishing regions of high-density noise from low-density clusters, and (ii) the creation of well-rounded ellipsoidal clusters around the centroids. The first point can be illustrated by Figure 5 that visualizes the clustering results obtained on synthetic data set with noise level of 8%, where clubs which partitions the data in nine blocks each containing a cluster. On this very data set, CLUBS+ basically generates the same blocks as clubs, but it then recognizes the bottomright block as a noise block for which no cluster is generated. Of course, in other situations, e.g., lower noise levels, the agglomerative phase of club could merge this noise block with the neighbor cluster and produce the correct number of clusters. But then the centroid of the neighbor would move and result in a lower accuracy.
Modifying the behavior of CLUBS+ . The ability to operate with speed and accuracy in an unsupervised mode represents a critical advantage of CLUBS+ over previous algorithms. However, when faced with complex and unusual situations, expert analysts might want to exercise more control, and e.g., explore if a larger or smaller number of clusters can produce results that better match their application requirements and the data they have at hand. CLUBS+ can be easily 21
ACCEPTED MANUSCRIPT
modified for this purpose by introducing an α parameter in Equation 10, which becomes CH-indexα (C) =
n−α×k BCSS(C) × W CSS(C) α×k−1
AN US
CR IP T
and varying it as needed to generate fewer or more clusters. More specifically, by replacing the two occurrences of k in Equation 10 with α × k, where α is a positive real number, the agglomerative step will stop earlier or later. Thus, with α < 1 we obtain more clusters and with α > 1 we obtain fewer clusters. Naturally, when seeking a larger number of clusters, the stopping condition in the divisive step should also be modified to allow for larger drops of the CH-index from its max (i.e., allow it to fall below 70%). While extensions of CLUBS+ to support fuzzy clustering are outside scope of this paper, we observe the results of the first three phases of CLUBS+ provide an effective initialization for the centroids of K-means, which can also be used as centroids for its variation for fuzzy clustering, such as c-means algorithm [10, 17]. 4. Improving the Efficiency of the Divisive Step
ED
M
As mentioned in Sec. 3.1.1, after computing the cumulative marginal distributions (one for each dimension) of a block of data, we (i) compute the maximum ∆W CSS (Equation 9) in each dimension and (ii) we use the largest of these maxima to determine on which axis to split, and at which point on that axis the split should be performed. The speed of the whole process can be greatly improved by interleaving (i) and (ii) and estimating the maxima in each dimension via progressive sampling. While the sampling progresses we view the current maximum in each dimension as the position of competitors in a race described next.
AC
CE
PT
1. In the first iteration, we compute the ∆W CSS for a 5% sample of all the possible positions each dimension. At this point, all dimensions are still viewed as competing. 2. In each successive iteration, we add in an additional 5% sample of points for each dimension that is still competing, and recompute its maximum ∆W CSS. Now we classify as uncompetitive each dimension where (a) the maximum ∆W CSS value has not changed, and (b) it is less than 90% of the current overall maximum. Then we repeat this step for the positions that are still competitive.
Thus, we perform the complete computation of ∆W CSS over each position only for those dimensions where ∆W CSS keeps growing or is at least 90% of the overall maximum ∆W CSS, as per the detailed description shown in Algorithm 8. Since the algorithm requires taking progressively larger and larger samples from the arrays that keep the cumulative marginal distributions in each dimension, our first step consists in shuffling these arrays so that every sub-array can 22
ACCEPTED MANUSCRIPT
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
CR IP T
2
Procedure computeBestSplit(S ) Input: S = {p1 , . . . pn }: a block of data of n d-dimensional points Output: hd, p, ∆i: the best splitting dimension and position, and its ∆W CSS begin C ← cumulativeMarginals(S ); // array of d cumulative marginal distributions C ← shuffle(C ); I ← {1, . . . , d}; // the set of the competitive dimensions bs ← [⊥, . . . , ⊥]; // array of d pairs hposition, ∆WCSSi ∆s ← 0.05; // the sampling step for i ∈ I do bs[i] ←computeMaxReduction(C, i, 0, ∆s ); end s ← [∆s , . . . , ∆s ]; // d values representing the portion of tested positions for each dimension hd? , p? , ∆? i ←maxDelta(bs); while max{s} < 1 ∧ I is not empty do bs0 ← [⊥, . . . , ⊥]; // array of d pairs hposition, ∆WCSSi foreach i ∈ I do if bs0 [i].∆ ≤ bs[i].∆ && bs[i].∆ < 0.9 × ∆? then I.remove(i); else bs[i] ← bs0 [i]; end s[i] ← s[i] + ∆s ; end ? hd , p? , ∆? i ←maxDelta(bs); end return hd? , p? , ∆? i; end
AN US
1
Algorithm 8: The algorithm for efficiently computing the best split
AC
CE
PT
ED
M
be considered a random sample of the whole array. Then, at each iteration we process the sub-array starting at the first unprocessed position and take a new 5% portion of the whole array. This sub-array is processed for each dimension that is still competitive, using the procedure computeMaxReduction, which compute Equation 9 for all the positions of the sub-array. While the worst-case complexity of finding the best split remains that discussed in Sec. 3.1.2, i.e., O(n), this technique significantly reduces the computational cost in all practical cases, especially when the dimensionality of data is large. In particular, in the experiments performed on data sets with 12 dimensions we found that the number positions for which the split effectiveness is computed is around 1/7 of the possible splitting positions, while for 6-dimensional data is around 1/3. Figure 6 shows an example of how the maximum ∆W CSS varies for each of the dimensions of a 6-dimensional block of data as new splitting positions are taken into account. First, for all the 6 dimensions, we compute the maximum ∆W CSS for two subsequent samples of 5% of the positions. Since for dimensions 4 and 5 the second sample does not increase the ∆W CSS, and their maximum ∆W CSS if under 90% of the overall maximum ∆W CSS (which is obtained for dimension 6), we stop the computation on those dimensions. We compute the ∆W CSS for an additional 5% sample for dimensions 1, 2, 3, and 6, and see that only for dimensions 2 and 3 we find an improvement of ∆W CSS. Therefore, we stop the computation of dimension 1, since its maximum ∆W CSS is under the 90% of the overall maximum. In the next step, even dimension 3 23
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 6: ∆W CSS produced by progressive sampling on the six dimensions of a data block.
M
reaches the steady state as well, and computation is stopped on this dimension as well. However the remaining two dimensions, i.e., 2 and 6, remain competitive till the very end, i.e., till all the positions are evaluated. Thus for this example we avoided having to evaluate 57.5% of all possible splits.
ED
5. Experimental Evaluation: Synthetic Dataset
AC
CE
PT
Extensive experiments were conducted to evaluate the performance of CLUBS+ in terms of (i) running time, (ii) quality of results, (iii) impact of dimensionality and (iv) effect of noise. We compared CLUBS+ with BOOL [48], BIRCH [50], k-means [37] and k-means++ [9]. We focused our attention on these algorithms, since they are the best known among the centroid-based ones with linear complexity in the size of data. We also tested a variation of k-means which uses CLUBS+ for seeding. Specifically, we first run the first three steps of CLUBS+ , then use the centroids of the clusters obtained after the agglomerative step as seeds for the classical k-means algorithm10 . In the following, we will refer to this variation of k-means as k-meansCS . We implemented CLUBS+ , k-means, k-means ++, and k-meansCS in Java [1], whereas for BOOL we used the author’s implementation in C++11 and for BIRCH we used the implementation in C which can be found in the NU-MineBench suite v. 3.0.1 [41]12 .
10 We remark that in the subsequent iterations of k-means any metric (e.g., the Manhattan distance) could be used for the refinement. 11 https://github.com/mahito-sugiyama/BOOL-clustering 12 This is the original program modulo the minor revisions needed to make it compilable by current gcc.
24
ACCEPTED MANUSCRIPT
• Number of dimensions: 2, 4, 6, 8, 10, and 12, • Total number of points: 105 , 106 , and 107 , • Number of clusters: 1, 2, 4, 8, 16, and 32,
CR IP T
We generated several synthetic data sets by means of a generator that defines clusters which follow a multivariate normal distribution of non-overlapping clusters13 . Cluster sets were generated and tested for all the possible combinations of parameters as follows:
• Percentage of noise points: 0%, 2%, 4%, 6%, 8%, and 10%.
AC
CE
PT
ED
M
AN US
This produced a total of 648 different data sets, one for each case above14 . A key advantage of a collection of synthetic data sets is that they enable to study the accuracy compared to the ground truth, and to analyze the sensitivity to different parameters, while keeping unvaried the other ones. Furthermore, knowing the actual number of clusters, say K, enables to correctly initialize the other algorithms, which require this parameter in input. For BIRCH, we used the same parameters used by the authors in their original work [50], and we set the number of clusters to K—as allowed by the original implementation, and actually needed to optimize the performance of the algorithm. In our experiments we measure the accuracy of clustering algorithms by comparing the results of clustering with the ground truth, which is available to us since data are synthetically generated and defines (i) the number of clusters K, (ii) the noise points, if any, and (iii) the cluster to which each non-noise point belongs. This knowledge allows us to evaluate the error rate of the results produced by the algorithms using the F-measure (F-m) and the adjusted Rand error (aRe). These measure the similarity and differences between the original clustering and the one produced by the clustering algorithm, whereby perfect similarity and no differences are respectively denoted by F-m= 1 and aRe= 0. More details about these measures are reported in Appendix C. For what concerns the running times, for all the algorithms we measured complete execution on each data set, except the time needed to load data into main memory and to store the results on disk. We will now present the results obtained on the generated data sets, grouped by each of the parameters listed above. We will show the results obtained by tables showing averages and graphs describing their range and variance of the values obtained. All experiments were run on a workstation equipped with an Intel i7-4770 CPU (3.40GHz) and 32 GB RAM. Thus, Table 1 shows the average performances obtained on the datasets grouped by data size (i.e., number of points). 13 The minimum distance between the centroid of one cluster and the points of other clusters always exceeds 3σ. 14 The code of the generator is available at https://github.com/gmmazzeo/clugen
25
CR IP T
ACCEPTED MANUSCRIPT
(a)
(b)
AN US
Figure 7: Distribution of datasets w.r.t. the F-measure (a), aRe(b)
105 points F-m aRe Time CLUBS+ 0.998 0.76 0.27 BOOL 0.978 2.52 0.06 BIRCH 0.973 7.23 0.35 k-means 0.847 33.37 1.31 k-means++ 0.907 24.84 0.89 CS k-means 0.934 20.77 0.70
Dataset size 106 points F-m aRe Time 0.999 0.72 2.23 0.949 5.72 1.38 0.945 11.81 3.97 0.846 33.46 12.77 0.906 25.05 8.97 0.934 20.76 6.60
107 points F-m aRe Time 0.999 0.65 20.99 0.963 3.66 23.30 0.877 20.29 44.67 0.849 33.03 126.64 0.906 24.93 88.67 0.934 20.75 64.29
ED
M
Algorithm
AC
CE
PT
Table 1: Accuracy and Time w.r.t. the data set size (Time in secs, aRe in percentages.)
(a)
(b)
Figure 8: Execution times—the out-of-scale value in (b) is 381 s.
26
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Figure 7 depicts the cumulative distribution of the results obtained, measured using the F-measure (a) and the aRe(b). The most relevant fact that emerges from these two graphs is that CLUBS+ achieves F -m = 1 and aRe= 0 in 40% of the datasets, whereas its F -m and aRe were respectively better than 0.90 and 0.25 in all cases15 . BOOL and BIRCH can also deliver good accuracy in general, but their reliability if well below that of CLUBS+ . In fact, BOOL’s F-measure is under 0.9 and its aRe is over 15% in about 10% of the tested data sets. Instead, BIRCH’s F-measure is under 0.85 and its aRe is over 24% in about 20% of the tested data sets. We also adopted the Friedman test [14] for assessing the statistical significance of the experimental results, and the results are reported in Appendix B. Finally, observe how k-meansCS uniformly improves upon k-means++, which uniformly improves upon k-means, following similar trajectories; all three trajectories share in a final step due to the outliers that the three algorithms are not designed to detect. The running times with respect to the number of points are shown in Table 1 and in Figure 8(a). Running times grow about linearly with the number of points for all the algorithms. CLUBS+ and BOOL exhibit comparable running times, closely followed by BIRCH. However, a comparison based on the absolute running times, is not very meaningful, since the algorithms are implemented using different programming languages16 . Instead, it is more relevant noticing how the running times increase as the data set size increases. If we measure the ratio between the average running times over 106 and 105 points, we find the value 8.3 for CLUBS+ , 23 for BOOL, and 11.3 for BIRCH. Instead, the ratio between the average running times over 107 and 106 points is 9.4 for CLUBS+ , 17.7 for BOOL, and again 11.3 for BIRCH. This makes us argue that CLUBS+ scales w.r.t. the size of data a bit better than BOOL and BIRCH. Much in the same way that we observed an improvement in accuracy, here we observe an improvement in time performance as we move from k-means to k-means++ and from this to k-meansCS . Our study suggests that is due to the good seeding provided by k-means++ and even the better one provided to k-meansCS by CLUBS+ that determines both the number of centroids K and their location, whereas K in the other algorithms must be divined by the user. Table 2 and Figure 8(b) summarize the impact of dimensionality, i.e., the number of dimensions of each point, for the data sets of size 107 . The running times was found to linearly increase w.r.t. the number of dimensions for all the algorithms. Instead, the accuracy is almost not affected by dimensionality for all the algorithms. The only exception was found to be BOOL on 2-dimensional data sets, on which it was unexpectedly slower and less accurate than it was for larger dimensionalities. Table 3 and Figure 9 (a) summarize the impact of the number of clusters, 15 Indeed,
only in two cases CLUBS+ achieved F − m and aRe worse than 0.98 and 0.1. also tested a classical k-means version implemented in C++, whose running times were on the average 4 times faster than the Java implementation. 16 We
27
ACCEPTED MANUSCRIPT
2 dimensions F-m aRe Time CLUBS+ 0.997 1.34 8.91 BOOL 0.843 13.56 26.49 BIRCH 0.906 22.73 36.17 k-means 0.864 30.33 51.72 k-means++ 0.911 24.11 34.50 k-meansCS 0.941 19.97 26.13
Dimensionality 4 dimensions 8 dimensions F-m aRe Time F-m aRe Time 0.998 1.12 13.06 1.000 0.38 22.89 0.984 1.87 18.35 0.989 1.55 19.99 0.853 25.74 39.48 0.881 18.95 45.91 0.840 33.78 77.65 0.853 32.57 135.70 0.901 25.60 49.46 0.903 25.36 97.94 0.932 20.90 38.76 0.933 20.92 70.85
12 F-m 1.000 0.986 0.865 0.849 0.910 0.934
dimensions aRe Time 0.09 34.58 1.86 25.57 18.25 54.06 33.35 207.92 24.70 158.00 20.88 105.72
CR IP T
Algorithm
2 F-m CLUBS+ 0.998 BOOL 0.987 BIRCH 0.972 k-means 0.934 k-means++ 0.934 k-meansCS 0.935
clusters aRe Time 0.48 16.13 1.58 27.10 4.95 45.57 9.64 22.82 9.58 21.98 9.23 27.01
Number of clusters 8 clusters F-m aRe Time 1.000 0.06 20.69 0.976 2.43 21.95 0.808 25.81 43.77 0.813 25.59 130.72 0.895 13.73 84.46 0.937 6.83 56.73
M
Algorithm
AN US
Table 2: Accuracy and Time Performances w.r.t. the data dimensionality (aRe in %, Time in seconds)
32 clusters F-m aRe Time 0.999 0.12 31.84 0.968 2.68 29.07 0.811 26.56 44.77 0.776 29.79 332.48 0.878 16.87 237.31 0.930 10.43 153.21
AC
CE
PT
ED
Table 3: Accuracy and Time Performances w.r.t. the number of clusters (aRe in %, Time in seconds)
(a)
(b)
Figure 9: Execution times for different number of clusters (a) and noise ratio (b)—The outof-scale values in (a) and (b) are, respectively, 423 s, and 302 s.
28
ACCEPTED MANUSCRIPT
CLUBS+ BOOL BIRCH k-means k-means++ k-meansCS
F-m 0.999 0.980 0.924 0.869 0.991 0.999
Noise ratio 0% 2% 6% 10% aRe Time F-m aRe Time F-m aRe Time F-m aRe Time 0.07 22.17 1.000 0.47 19.68 0.999 0.74 20.92 0.997 1.12 21.85 0.16 11.16 0.981 1.38 22.32 0.984 1.87 25.10 0.872 14.44 29.12 22.66 47.25 0.907 16.04 44.51 0.883 17.49 44.18 0.825 24.34 43.85 17.59 107.50 0.877 32.94 116.58 0.847 35.45 132.53 0.813 39.59 145.00 1.25 71.96 0.951 22.20 75.32 0.886 29.88 91.88 0.833 37.02 110.33 0.07 62.79 0.972 18.98 62.64 0.920 24.63 64.69 0.872 31.30 66.27
CR IP T
Algorithm
Table 4: Accuracy and Time Performances w.r.t. the noise ratio
AC
CE
PT
ED
M
AN US
for data sets of size 107 . A first observation is that, for two clusters, k-means performs as well by k-means++ and k-meansCS , a fact that confirms the wellknown property that the seeding problem of k-means disappears when there are only two clusters. Then, as the number of clusters increases the running times of these algorithms grow linearly, as expected. On the other hand, CLUBS+ running time increases sub-linearly w.r.t. the number of clusters, whereby its running time, which was about 3/4 that of k-means for 105 points, becomes about 1/11 for 32 clusters. Indeed, the divisive phase of CLUBS+ has practically logarithmic complexity w.r.t. the number of clusters (see Section 3.1.2), and the optimization of the refinement steps effectively reduces the number of clusters to be checked for each point, in order to find the nearest cluster17 . Finally, while BOOL and BIRCH were always slower than CLUBS+ , they also demonstrated the remarkable property that their running times are not very affected by the the number of clusters. Conversely, we also see that the accuracy of CLUBS+ is basically unaffected by the number of clusters, whereas the accuracy of the other algorithms, except k-meansCS , decreases as the number of clusters increases. We attribute this, to the fact that while k-meansCS is given the actual centroids of the clusters, k-means, k-means++, BOOL, and BIRCH are only given the number of clusters and they still need to find good positions for the clusters’ centroids—a task that becomes harder as their number increases. Table 4 and Figure 9(b) show the performances obtained on the datasets with 107 points for increasing pecentages of outliers due to noise or any other source that distributes data randomly throughout the space. We can observe that CLUBS+ outperforms the other algorithms in terms of time and accuracy which also remains largely unaffacted by noise. Thus for 10% noise the F-m index for CLUBS+ remains 0.997, whereas it falls below 0.9 for other algorithms. This is not surprising for k-means++ and k-meansCS , that are not designed to detect and separate out noise, but higher noise levels also lower the accuracy of BIRCH although it is designed to recognize outliers. It is also worth noticing that BOOL, on noiseless data, is very accurate, and its running times are better than thos of 17 Moreover CLUBS+ agglomerative phase that in theory is super-linear w.r.t. the number of clusters, in practice always take only few milliseconds.
29
ACCEPTED MANUSCRIPT
CR IP T
CLUBS+ . However, we can also notice that, differently from CLUBS+ , BOOL is substantially affected by the noise, both in accuracy and execution time. Table 4 also shows that the running times of CLUBS+ of BIRCH and kmeansCS are not impacted considerably by noise, but k-means++ is. A possible explanation for this slow-down is that the probabilistic criterion that k-means++ uses for seeding becomes less effective in presence of noise, and this results in a slower convergence 18 .
AN US
Number of Clusters in Synthetic Datasets. Synthetic generated data sets provide the ultimate test on the ability of unsupervised clustering algorithms to detect the correct number of centroids. CLUBS+ passes this test with flying colors since it always detected the very number of centroids that were used by the scripts generating the synthetic data. On the same synthetic data, we then tested BIRCH without specifying the number of clusters, as allowed by the original implementation: most of the times, the number of clusters returned by BIRCH was much larger than the correct value. Even when we provided them with the correct number of clusters, there were several situations where highly respected algorithms such as BOOL, BIRCH and k-means failed to separate correctly the points and reconstruct the original synthetic clusters.
AC
CE
PT
ED
M
Accelerated K-means. The popularity of K-means has led to the introduction of several optimizations to accelerate its computation. A key optimization technique that eliminates many distance calculations by exploiting triangle inequality was introduced Elkan [18]; this was followed by improvements proposed by Hamerly [26] and a lesser known one proposed by Drake [16]. These algorithms have been the subject of thorough study [27] which measures the acceleration obtained by the various techniques over the standard Lloyd algorithm used as base-line— a measure that is preferable to actual running times because it is less sensitive to differences in hardware, compilers and number of points in the dataset. The study presented in [27] shows that the above-mentioned techniques can deliver significant speedups on synthetic data sets involving high dimensionality and a large number of clusters. When that is not the case, gains are much more modest, to the point that we hardly see any improvement for the two-cluster situation 19 . We will thus concentrate on the intermediate cases of synthetic datasets generated with 8 clusters and 32 clusters, for which the results of CLUBS+ are reported in Table 3: this shows an acceleration is 130.72/20.69 = 6.31 for 8 clusters and an acceleration of 332.48/31.84 = 10.44 for 32 clusters. Figure 2.3 in [27] shows that for 8 clusters and 32 clusters the Drake technique respectively achieves an acceleration of, approximately, 3 and 5, whereas smaller accelerations are achieved by the Hamerly and Elkan algorithms. While these are smaller accelerations than those achieved by CLUBS+ , 18 Additional experimental results are available in [1], where the CLUBS+ ’s code is also available. 19 See Figure 2.3 in [27].
30
ACCEPTED MANUSCRIPT
CR IP T
the authors of [27] then propose new additional improvements, based on Kheaps and sorting of centers, and with the new improvements they measure a 5-fold acceleration for 8 clusters, and an 14-fold acceleration for 32 clusters. This last acceleration being better that that delivered by CLUBS+ suggests that highly accelerated k-means variations can indeed execute faster than CLUBS+ in particular situations. Even so, CLUBS+ remains one of the fastest all-around algorithms, which makes fast turn-arounds possible because of its speed and its deterministic nature, whereas with k-means analysts often have to rely on repeated executions for enhanced confidence and accuracy20 . 6. A Real Life Data Set
AC
CE
PT
ED
M
AN US
After using synthetic data sets to validate the effectiveness of our clustering algorithm we decided to test it on a real-life big-data application for which our readers will be able to evaluate the quality of results using their common-sense and real-world knowledge. The NYC Taxi Trips data set represents a natural choice in this respect [2]. The data set contains information on approximately 173 millions taxi rides made in New York in 2013; each record shows the time and location of each pick-up and drop-off, the trip distance and duration, the number of passengers, and contains information about the taxi and the driver. We focused our attention on the pickup location and time, thus obtaining a 3-dimensional data set describing latitude, longitude and time of pickups. We only considered rides between 6:00am and midnight, and quantized time in 10-minute intervals; we represented location using latitude and longitude coordinates specified with a precision of 1/1000 degree. After discarding rides with coordinates that denote places that are very far from New York we ended-up with 148 millions rides for our experiments (over 85% of the raw data set). From this 3-dimensional data set called T AXI3D we then derived a 2-dimensional data set called T AXI2D that describes latitude and longitude. CLUBS+ performed the clustering of the 148 millions 3-dimensional T AXI3D in 6 minutes, while the clustering of T AXI2D took 5 minutes. The results of T AXI2D are shown in Figure 10, which is conducive to a very “practical” interpretation. Indeed, the clusters found by CLUBS+ have their centroids at the J.F.K. airport (the blue cluster), La Guardia airport (the orange cluster), Downtown Manhattan (the green cluster), and Midtown Manhattan (the red cluster). These are actually are the areas with the largest number of taxi pickups. The locations of the pick-ups that were classified by CLUBS+ as outliers are represented in gray Figure 10; the intensity of these gray points has been enhanced to make them visible in Figure 10, whereas their actual density is quite 20 This conclusion is supported by our experimental evaluation of other proposed accelerations of k-means that were not evaluate in [27]. We tested the algorithms Swap, EZ-Hybrid and Hybrid [33] using the code by the authors. Our results reported in http://goo.gl/Hlf4LT show that the acceleration they provide w.r.t. Lloyd is non-uniform and never exceeds 2.1.
31
AN US
CR IP T
ACCEPTED MANUSCRIPT
M
Figure 10: Clustering results for T AXI2D data set.
ED
low; indeed, (i) altogether they only amount to 0.5% of the whole dataset, and (ii) their locations are distributed over a large area. The four clusters shown in figure remain valid for most of the day, whereby T AXI3D is clustered by CLUBS+ into ellipsoids that extend those of Figure 10 along the time dimension.
PT
7. Related Work
AC
CE
The clustering problem, despite quite old, continues to attract the interest of researchers: even if we restrict our attention only to the most popular algorithm of this field, proposed almost 60 years ago [32], we can find the term ‘k-means’ in the title of over 9000 published papers21 . Since the introduction of k-means, significant progress has been made in the field of clustering, with the proposal of hundreds of new techniques. Furthermore, in the last two decades variants of the classical problem have been studied. For instance, the use of constraints (e.g., some points must/cannot belong to the same cluster) was shown to improve performances of hierarchical clustering algorithms [22, 30, 47]. The problem of hierarchical ensemble clustering investigated in [36] entails deriving an optimal dendrogram(i.e., the tree representing the hierarchy of data) given 21 Google
Scholar, September 2016.
32
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
different input dendrograms of the same data set. CLUBS+ , however, belongs to the class of general centroid-based clustering algorithms, therefore, in the following, we briefly describe some of the most popular algorithms that share the objective of CLUBS+ , i.e., creating a flat partition of data, without the need of providing an optimal dendrogram, and that do not assume constraints in data (i.e., must/cannot links). Hierarchical Clustering Approaches. In the agglomerative clustering approaches, such as AGNES[35] and UPGMA[35], every object initially represents a cluster, and then pairs of nearest clusters are merged until the desired number of clusters is reached. Divisive clustering methods such as DIANA [35] work in an opposite way. They assign all objects to a single cluster and then divide the cluster into smaller and smaller ones till the expected number of clusters is obtained. Other popular hierarchical clustering algorithms are BIRCH [50], CURE [24], ROCK [25] and Chameleon [34]. BIRCH, in particular, is a very popular algorithm, due to its efficiency. BIRCH creates a CF-tree (cluster-feature tree) while scanning the data, which can be seen as index of the data. Each node stores summary information (the count, the sum and sum of the squared coordinates of the points) about the data point associated with node. After building the CF-tree, the leaves are considered as micro-clusters, that are re-aggregated using an input parameter, that can be the desired number of clusters or the maximum radius. BIRCH enables to optionally classify outliers. While the divisive step of CLUBS+ may recall the initial step of BIRCH, that creates micro-clusters, and the algorithms share the idea of first partitioning data and then reassembling them, the used approaches and their results are very different. In fact, while BIRCH creates some micro-clusters by progressively adding the data to an index, in a way that recalls the bottom-up techniques (if the variance of a leaf node becomes ‘too large’, then the node is split), CLUBS+ creates a top-down partition, and the clusters yielded by the divisive step can not be considered micro-clusters, since the exceeding number of blocks is only due to ‘few’ splits that are needed to completely separate parts of different clusters that were partially separated by previous splits. Furthermore, while the agglomerative step of CLUBS+ is driven by the maximization of the CH-index, with the aim of finding the ideal number of clusters, the aggregation of BIRCH’s micro-clusters is driven by an input parameter. In [20] the authors propose a fast agglomerative clustering method using an approximate nearest neighbor graph for reducing the number of distance calculations. In [11] TIC (Top down Induction of Clustering trees) is proposed, which creates a partitioning of data by recursively splitting the blocks in such a way that the distance between the centroids of the two new blocks is maximized, and use as stopping criterion a test based on significance of the variance reduction yielded by the split. In [31] the authors exploited the use of hierarchical clustering for feature selection. Finally, in [15] hierarchical clustering is performed by repeatedly splitting and merging intermediate clusters. When performing a hierarchical clustering, there are four different methods to measure the distance between clusters: centroid distance, average distance, single-link 33
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
distance and complete-link distance. Centroid distance is the distance between the centroids of two clusters. Average distance is the average of the distances between every pair of objects from two clusters. Single-link distance, also known as minimum distance, is the distance between the two nearest objects from two clusters. Complete-link distance, also referred to as maximum distance, is the distance between the two objects which are the farthest from each other from two clusters. Unfortunately, these methods tend to be slower than other techniques such as the partitioning methods even if the latter may require multiple runs for optimal accuracy performances. Partitioning Clustering Approaches. The key feature of partitioning clustering approaches is that they initially partition data into k groups and then try to improve the quality of clustering by assigning objects previously belonging to a group to a more appropriate one. The most widely known method for partitioning clustering is k-means [37], which randomly selects k objects as cluster centroids and assigns the remaining objects to the nearest cluster centroid, and then improves the clustering by iteratively updating the cluster centroids and reassigning the objects to the newly computed cluster centroids. Hundreds of studies have been proposed to improve the efficiency of k-means, by optimizing the initial seeding and the cluster assignment of points at each iteration. The most popular in the first group is k-means++ [9], which uses a distance-based probabilistic approach for choosing seeds that are distant from one another, with a O(log(k)) guaranteed approximation ratio. In the second group, we can find several approaches, including some based on the use of an index, for reducing the set of clusters from which the distance of a certain point must be computed [33], or the use of triangle inequality [18], for the same purpose. k-medoids is a variation of k-means where the medoid (i.e., the object closest to the center), is used to represent a cluster. Other partitioning methods widely used are PAM and CLARA proposed by [35] and CLARANS [42], and the k-means variation proposed in [7] based on a k-dimensional tree structure. In [3] the usage of metric and non-metric distance measures is discussed for the generalization of the k-median problem. The main disadvantage of partitioning methods is that the result of clustering is dependent on a correct selection of initial cluster centers and it may produce a local optimum rather than a global one. A traditional way to address this problem is to run k-means multiple times with different initial centers and then choose the best clustering result as output. Significant progress has recently been made on this problem, and in [43] the authors propose a probabilistic seeding for centers assignment that guarantee a good accuracy for “well-clusterable” instances. Grid-Based Clustering Approaches. Grid-based clustering works by partitioning the data space into cells arranged in grid fashion and then merging them to build clusters. Many grid-based methods, such as STING [49], WaveCluster [46] and CLIQUE [6], exploit regular grids to partition data space. On the contrary, some proposal such as adaptive grid [23] and optimal grid [28] exploits adaptive or irregular grids. STING[49] is a grid-based multi-resolution clustering technique in which the spatial area is divided into rectangular cells and organized into a statistical information cell hierarchy. In WaveCluster[46] 34
ACCEPTED MANUSCRIPT
AN US
CR IP T
a signal processing perspective is adopted for dealing with multi-dimensional data. CLIQUE[6] partitions each dimension into intervals and computes the dense units in all dimensions. Once the basic dense units are obtained they are combined to generate the higher dimension ones. Time is very fast, because it is independent on the number of objects, however it depends on the number of cells. Its disadvantage is that the quality of clustering is dependent on the granularity of cells and that the number of cells increases exponentially with the dimensionality of data. To overcome the latter limitation adaptive grid and optimal grid have been introduced. MAFIA [23] exploits adaptive grids to partition each dimension depending on the distribution of data in it, thus performing a more accurate partition w.r.t. uniform partition. OptiGrid [28] uses contracting projections of data to assign the optimal separation hyper-planes for data partitioning. As a result, the grid is “arbitrary”, as compared with equidistant, axis-parallel grids. Finally, BOOL [48] adopts a approach based on a multi-step binary discretization of data. Data are sorted according to their binary representation, and this allows to dramatically reduce the time needed to aggregate them into clusters, enabling BOOL to achieve linear complexity. Furthermore, BOOL is capable of detecting clusters of general shapes. 8. Conclusions
AC
CE
PT
ED
M
The naturalness of the hierarchical approach for clustering objects is widely recognized and supported by psychological studies of children’s cognitive behaviors [44]. CLUBS+ has provided the analytical and algorithmic advances that turn this intuitive approach into a data mining method of great accuracy, robustness and speed for clustering sample around centroids. However, CLUBS+ is not without limitations, since it does not support categorical attributes and it cannot handle well high-dimensionality problems22 . Naturally, clustering remains a many-facet problem and other clustering methods, such as density-based clustering, seek objectives other than grouping sample points around centroids. However, clustering around centroids remains the objective of many algorithms—which now come second to CLUBS+ in terms of accuracy and performance. In particular, this is true for the widely used k-means, and its many improvements introduced in its 60 years of existence, including very significant ones such as those proposed in [9]. Unlike other algorithms, CLUBS+ does not require any supervision, and yet it outperforms the rest in terms of speed, accuracy of results, and robustness w.r.t. noise. The speed achieved by our approach is primarily due to CLUBS+ ’ ability to exploit the analytical properties of its quadratic distance functions to simplify the computation. We conjecture that similar benefits might be at hand for situations where the samples are in data streams or in secondary store. These situations were not studied 22 Indeed, we were able to run CLUBS+ on the 41-dimensional KDD Cup 1999 Data, which contains some categorical attributes, obtaining the correct number of clusters (5) and aRe∼ 9% and F-m∼ 0.94 [1].
35
ACCEPTED MANUSCRIPT
in this paper, but represent a promising topic for future research. Furthermore, we are already working on the parallel implementation of CLUBS+ , which will let user run the algorithm on arbitrarily large data sets.
CR IP T
Acknowledgements The authors would like thank Prof. Wei Wang and the anonymous reviewers for the many insightful comments they have provided. References
[1] CLUBS+ website. http://yellowstone.cs.ucla.edu/clubs/. Accessed: 2016-01-20.
AN US
[2] Nyc taxi trips. http://www.andresmh.com/nyctaxitrips/. Accessed: 2016-01-20. [3] M. R. Ackermann, J. Bl¨ omer, and C. Sohler. Clustering for metric and nonmetric distance measures. ACM Transactions on Algorithms, 6(4), 2010.
M
[4] Charu C. Aggarwal, Hinneburg Alexander, and Daniel A. Keim. On the surprising behavior of distance metrics in high dimensional space. In ICDT 2001, pages 420–434, 2001. [5] Charu C. Aggarwal and Chandan K. Reddy, editors. Data Clustering: Algorithms and Applications. CRC Press, 2014.
ED
[6] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data. Data Mining and Knowledge Discovery, 11(1):5–33, 2005.
PT
[7] K. Alsabti, S. Ranka, and V. Singh. An efficient parallel algorithm for high dimensional similarity join. In IPPS. IEEE Computer Society Press, 1998.
CE
[8] Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, Jes´ uS M. P´eRez, and I˜ nIgo Perona. An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1):243–256, January 2013.
AC
[9] D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In SODA, pages 1027–1035, 2007.
[10] James C Bezdek, Robert Ehrlich, and William Full. Fcm: The fuzzy cmeans clustering algorithm. Computers & Geosciences, 10(2-3):191–203, 1984. [11] Hendrik Blockeel, Luc De Raedt, and Jan Ramon. Top-down induction of clustering trees. In Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, pages 55–63, 1998.
36
ACCEPTED MANUSCRIPT
[12] T. Calinski and J. Harabasz. A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1):1–27, 1974. [13] Y.M. Cheung. k*-means: A new generalized k-means clustering algorithm. Pattern Recognition Letters, 24(15):2883–2893, 2003.
CR IP T
[14] Janez Demˇsar. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res., 7:1–30, 2006.
[15] C. Ding and X. He. Cluster merging and splitting in hierarchical clustering algorithms. In In Proc. of IEEE International Conference on Data Mining (ICDM), page 139, Washington, DC, USA, 2002. IEEE Computer Society.
AN US
[16] Jonathan Drake. Faster k-means clustering. Master’s thesis, Baylor University, 2013. [17] Joseph C Dunn. A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. 1973.
[18] Charles Elkan. Using the triangle inequality to accelerate k-means. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 147–153, 2003.
M
[19] M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, 1996.
ED
[20] P. Franti, O. Virmajoki, and V. Hautamaki. Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans. Pattern Anal. Mach. Intell., 28:1875–1881, November 2006.
PT
[21] F. Furfaro, G. M. Mazzeo, D. Sacca, and C. Sirangelo. Compressed hierarchical binary histograms for summarizing multi-dimensional data. Knowledge and Information Systems, 15(3):335–380, 2008.
CE
[22] Sean Gilpin and Ian Davidson. Incorporating sat solvers into hierarchical clustering algorithms: An efficient and flexible approach. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1136–1144, 2011.
AC
[23] S. Goil, H. Nagesh, and A. Choudhary. Mafia: Efficient and scalable subspace clustering for very large data sets. Technical report, Deparment of Electrical and Computer Engineering, 1999. [24] S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering algorithm for large databases. In SIGMOD, pages 73–84, 1998. [25] S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In ICDE, pages 512–521, 2000.
37
ACCEPTED MANUSCRIPT
[26] Greg Hamerly. Making k-means even faster. In Proceedings of the SIAM International Conference on Data Mining, SDM 2010, April 29 - May 1, 2010, Columbus, Ohio, USA, pages 130–140, 2010.
CR IP T
[27] Greg Hamerly and Jonathan Drake. Accelerating Lloyd’s Algorithm for k-Means Clustering, pages 41–78. Springer International Publishing, 2014. [28] A. Hinneburg and D. A. Keim. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In VLDB, pages 506–517, 1999. [29] L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, 1985.
AN US
[30] Davidson Ian and Ravi S. S. Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Mining and Knowledge Discovery, 18(2):257–282, 2009.
[31] D. Ienco and R. Meo. Exploration and reduction of the feature space by hierarchical clustering. In SDM, pages 577–587, 2008. [32] Jain. Anil K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, (8):651–666, 2010.
M
[33] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell., 24(7), 2002.
ED
[34] G. Karypis, E.H. Han, and V. Kumar. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68–75, 1999.
PT
[35] L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, 1990.
CE
[36] Zheng Li, Li Tao, and Ding Chris H. Q. Hierarchical ensemble clustering. In ICDM 2010, The 10th IEEE International Conference on Data Mining, Sydney, Australia, 14-17 December 2010, pages 1199–1204, 2010.
AC
[37] J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In 5-th Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297, 1967. [38] E. Masciari, G. M. Mazzeo, and C. Zaniolo. A new, fast and accurate algorithm for hierarchical clustering on euclidean distances. In PAKDD (2), pages 111–122, 2013. [39] E. Masciari, G.M. Mazzeo, and C. Zaniolo. Analysing microarray expression data through effective clustering. Information Sciences, 262(0):32 – 45, 2014. 38
ACCEPTED MANUSCRIPT
[40] S. Muthukrishnan, V. Poosala, and T. Suel. On rectangular partitionings in two dimensions: Algorithms, complexity, and applications. In ICDT, pages 236–256, 1999.
CR IP T
[41] R. Narayanan, B. Ozisikyilmaz, J. Zambreno, J. Pisharath, G. Memik, and A. Choudhary. Minebench: A benchmark suite for data mining workloads. In International Symposium on Workload Characterization, pages 182–188, 2006.
[42] R.T. Ng and J. Han. Clarans: A method for clustering objects for spatial data mining. TKDE, 14(5):1003–1016, 2002.
[43] R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of lloyd-type methods for the k-means problem. In FOCS, 2006.
AN US
[44] Jodie M. Plumert. Flexibility in children’s use of spatial and categorical organizational strategies in recall. Developmental Psychology, 30(5):738– 747, September 1994. [45] C. J. Van Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2nd edition, 1979.
M
[46] G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: a waveletbased clustering approach for spatial data in very large databases. The VLDB Journal, 8(3-4):289–304, 2000.
ED
[47] Jan Struyf and Saso Dzeroski. Clustering trees with instance level constraints. In Machine Learning: ECML 2007, 18th European Conference on Machine Learning, Warsaw, Poland, September 17-21, 2007, Proceedings, pages 359–370, 2007.
PT
[48] Mahito Sugiyama and Akihiro Yamamoto. A fast and flexible clustering algorithm using binary discretization. In 11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada, December 11-14, 2011, pages 1212–1217, 2011.
CE
[49] W. Wang, J. Yang, and R. R. Muntz. Sting: A statistical information grid approach to spatial data mining. In VLDB, pages 186–195, 1997.
AC
[50] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In SIGMOD, pages 103–114, 1996.
39
ACCEPTED MANUSCRIPT
CR IP T
Giuseppe M. Mazzeo received the PhD degree in computer science and system engineering from the University of Calabria, Italy, in 2007. Currently, he is a Postdoctoral Scholar at Computer Science department of University of California, Los Angeles. His research interests include Data and Text Mining, Natural Language Processing, Information Retrival, Data Streams, and compression techniques for multidimensional data.
M
AN US
Elio Masciari is currently senior researcher at the Institute for High Performance Computing and Networks (ICARCNR) of the National Research Council of Italy. His research interests include Database Management, Semistructured Data and Big Data. He has been advisor of several master thesis at the University of Calabria and at Univeristy Magna Graecia in Catanzaro. He was advisor of PhD thesis in computer engineering at University of Calabria. He has served as a member of the program committee of several international conferences. He served as a reviewer for several scientific journal of international relevance. He is author of more than 80 publications on journals and both national and international conferences. He also holds ”Abilitazione Scientifica Nazionale” for Associate Professor role.
AC
CE
PT
ED
Carlo Zaniolo holds an MS Degree in EE from Padua University, Italy, and a PhD Degree in Computer Science from the University of California, Los Angeles. Carlo spent the first twenty years of his career in industrial R&D, working at Unisys, Bell Labs, and MCC. In 1991, he returned to UCLA as a professor of computer science and holder of the N.E. Friedmann chair in Knowledge Science. His research interests include Database Evolution and Archiving, Data Stream Mining, Non-monotonic Reasoning, and Semantic Computing.
40
ACCEPTED MANUSCRIPT
Appendix A. Propositions and their Proofs
d X
W CSS(C) =
i=1
Qi −
Si2 n
CR IP T
Proposition 2.1 Let C be a cluster containing n d-dimensional points. Let P Si = p∈C pi be the sum of the i-th coordinates of all the points in C, and let P Qi = p∈C p2i be the sum of the squares of i-th coordinates of all the points in C. The W CSS(C) can be computed as (A.4)
Proof. From the definition of W CSS(C) (Def. 2) we have X
p∈C d X X
i=1 p∈C
(pi − ci )2 =
d X
=
i=1
Recalling that ci = we obtain
1 n
X
p∈C
P
d XX (pi − ci )2 =
p∈C i=1
d X X
i=1 p∈C
p2i − 2 · ci ·
(p2i − 2 · ci · pi + c2i ) =
X
pi + c2i
p∈C
p∈C
W CSS(C) =
X
p∈C
1
pi (Equation 2), and the definitions of Si and Qi ,
M
=
kp − ck2 =
AN US
W CSS(C) =
d X
ED
i=1
=
Si Qi − 2 · · Si + n
d X i=1
Qi −
Si2 n
Si n2
2
!
·n
=
PT
2 Proposition 2.2 Let D = {p1 , . . . , pn } be a dataset. For any clustering C defined over D, it holds W CSS(C) + BCSS(C) = W CSS(D), (A.7)
AC
CE
that is, the sum of W CSS and BCSS for any clustering of D is constant and equal to the W CSS of D. Proof. Using Equation 4, Equation 5, and Equation 6, we obtain W CSS(C) + BCSS(C) = =
k X d X i=1 j=1
Qi,j −
2 Si,j ni
+
k X i=1
ni · kci − mk2
where k is the number of clusters in C, d is the number of dimensions, ni is the number of points in the i-th cluster, Qi,j is the sum of the squares of j-th coordinates of all the points in the i-th cluster, Si,j is the sum of the j-th coordinates of all the points in the i-th cluster, and m is the centroid of the points in D. By recalling Equation 2
41
ACCEPTED MANUSCRIPT
and denoting the sum of the j-th coordinates of all the points in D as S∗,j , and the sum of the squared j-th coordinates of all the points in D as Q∗,j , we can write W CSS(C) + BCSS(C) =
d X k X
2 2 2 Si,j Si,j S∗,j S∗,j + − 2 · Si,j · + ni · 2 = ni ni n n j=1 i=1 " # d k k k 2 X X X S∗,j S∗,j X Qi,j − 2 · · Si,j + 2 · ni = = n n j=1 i=1 i=1 i=1
Qi,j −
=
d X j=1
=
AN US
=
CR IP T
2 k X d k d 2 X X X Si,j S∗,j Si,j Qi,j − + − ni · = ni ni n i=1 j=1 i=1 j=1 " 2 # d X k 2 X Si,j Si,j S∗,j = + ni · − = Qi,j − ni ni n j=1 i=1
Q∗,j − 2 ·
d X j=1
Q∗,j −
2 S∗,j S∗,j · S∗,j + 2 · n n n
2 S∗,j n
= W CSS(D)
M
2 Proposition 2.3 Let C1 and C2 be two (disjoint) clusters containing n1 and n2 ddimensional points, respectively, and let c1 and c2 be their centroids. Let C = C1 ∪ C2 a cluster obtained by the union of C1 and C2 . Then,
ED
W CSS(C) = W CSS(C1 ) + W CSS(C2 ) + ∆W CSS, where W CSS is a non-negative quantity given by the following: ∆W CSS =
n1 · n2 kc1 − c2 k2 n1 + n2
(A.8)
PT
Proof. By exploiting Equation 4, we can write ∆W CSS = W CSS(C) − W CSS(C1 ) − W CSS(C2 ) =
d X
CE =
i=1
Q∗,i −
2 S∗,i n
−
d d 2 2 X X S1,i S2,i Q1,i − − Q2,i − , n1 n2 i=1 i=1
AC
where Q∗,i , Q1,i , and Q2,i are the sum of the squared i-th coordinate of all the points in C, C1 , and C2 , and S∗,i , S1,i , and S2,i are the sum of the i-th coordinate of all the points in C, C1 , and C2 . We can rewrite ∆W CSS =
d X i=1
2 2 2 S∗,i S1,i S2,i Q∗,i − − Q1,i + − Q2,i + n n1 n2
Since C1 and C2 are disjoint, it holds that Q∗,i = Q1,i +Q2,i , and S∗,i = S1,i +S2,i , for i ∈ {1, . . . d}, and n = n1 + n2 , therefore
42
ACCEPTED MANUSCRIPT
∆W CSS =
d 2 X S1,i i=1
By multiplying both terms by
n1
n1 +n2 , n1 ·n2
+
2 S2,i (S1,i + S2,i )2 − n2 n1 + n2
we obtain
CR IP T
n1 + n2 · ∆W CSS = n1 · n2 d 2 2 S2,i (S1,i + S2,i )2 n1 + n2 X S1,i = · + − n1 · n2 i=1 n1 n2 n1 + n2 d 2 2 X S1,i S1,i = + 2 + n1 · n2 n1 i=1
=
−
n21 · n22
2 2 n1 · n2 · S1,i + n1 · n2 · S2,i + 2 · n1 · n2 · S1,i · S2,i 2 n1 · n22
=
n21 · n22
2 S2,i S1,i · S2,i −2· n22 n1 · n2
ED
n21
+
M
d 2 X S1,i i=1
d 2 2 2 X n2 · S1,i + n21 · S2,i − 2 · n1 · n2 · S1,i · S2,i i=1
Thus,
=
d 2 2 2 2 X + n1 · n2 · S2,i + n21 · S2,i + n22 · S1,i n1 · n2 · S1,i i=1
=
AN US
2 2 2 2 S1,i S2,i S2,i S2,i 2 · S1,i · S2,i − − − + n22 n1 · n2 n1 · n2 n1 · n2 n1 · n2
∆W CSS = S
=
d X S1,i i=1
n1
−
S2,i n2
+
=
2
2 d n1 · n2 X S1,i S2,i · − n1 + n2 i=1 n1 n2
S
CE
PT
and, observing that n1,i and n2,i are the i-th coordinates of the centroid of C1 1 2 and C2 , respectively, that is, c1 and c2 , we obtain ∆W CSS =
n1 · n2 kc1 − c2 k2 n1 + n2
AC
2 Proposition 3.1 Let C be a cluster set over D, with cardinality greater than 1, and let C1 and C2 be the clusters whose merge produces the least increase of W CSS(C). If the merge of C1 and C2 produces a decrease of the CH-index of the cluster set, then no pair Ci and Cj in C exists such that their merging increases the CH-index. Proof. The CH-index is given by Equation10, which is the product of two fractions. When we merge a pair the second fraction increases since, the denominator decreases by one and the numerator increases by one, and this is the same for any pair. However, the value of the first fraction decreases by an amount that depends on the pair chosen, and depending on its value the new CH-index can be larger or smaller than the previous one. Now, if ∆W CSS denotes the increase of W CSS produced by the merging of the pair of clusters hCi , Cj i, then ∆W CSS is also the drop of BCSS for
43
ACCEPTED MANUSCRIPT
CR IP T
that pair (see Prop. 2.2). Thus, to compute new CH-index from the current one we see that ∆W CSS must be added to the denominator, and ∆W CSS must be subtracted from the numerator. As we consider different pairs to merge, we see that CH-index is monotonically decreasing with ∆W CSS (since so is the first fraction whereas the second is the same for different pairs). Thus, if the new CH-index decreases for the pair hC1 , C2 i that produces the least ∆W CSS, the same is true for every other pair hCi , Cj i. 2
Appendix B. Friedman test over experimental results
AN US
We adopted the Friedman test [14] for assessing the statistical significance of the experimental results. On the overall set of 648 experiments, we have found the following average ranks for the aRe of each the 6 studied algorithms:
Algorithm CLUBS+ BOOL k-meansCS BIRCH k-means++ k-means Average rank 1.46 1.87 3.39 3.51 4.09 5.04 In order to apply the Friedman test, we first compute k
χ2F =
X 12N × (Rj − µ)2 = 1688.17, k(k + 1) j=1
M
where N is the number of experiments (648 in our case), k is the number of tested algorithms (6 in our case), Rj is the rank of the j-algorithm (1 ≤ j ≤ k), and µ is average rank. Then, we compute the FF statistics as
ED
FF =
(N − 1)χ2F = 703.84 N (k − 1)χ2F
AC
CE
PT
which is distributed according to the F -distribution with k − 1 and (k − 1) × (N − 1) degrees of freedom. In our case, the critical value of F (6, (6 − 1) × (648 − 1)) = F (5, 3235) for α = 0.05 is 2.27. Therefore, we can reject the null-hypothesis, stating that all the algorithms are equivalent, and we can perform the Nemenyi test, to check if the null-hypothesis on differences between pairs of algorithms can be rejected. To this end, the critical difference (CD) between the average rank of two algorithms, needed for rejecting the null-hypothesis, is given by r k(k + 1) CD = qα = 0.296 6N where qα = 2.85, is the critical value for the two-tailed Nemenyi test with α = 0.05 and 6 classifiers (see [14]). Therefore, we can reject the null-hypothesis for all the pairs of algorithms, except BIRCH vs k-meansCS .
44
ACCEPTED MANUSCRIPT
Appendix C. Clustering Quality Measures
CR IP T
The use of synthetic data allows us to evaluate the quality of resulting clusters using rigorous accuracy based on F-measure [45] and Adjusted Rand Error (aRe [29] described next. The F-measure [45] (F-m, evaluates the success of retrieving the original clusters terms of precision and recall of IR, whereby F-m= 1 denotes perfect retrieval. Thus, let D be a data distribution of n points which are partitioned into l subsets {D1 , . . . , Dl }. Now say that a given clustering algorithm applied to the data set D, produces the clustering C = {C1 , . . . , Ck }. If we denote as mi,j = |Ci ∩ Dj | the number of points in Dj assigned to Ci ,
mi,all = |Ci | the total number of points in Ci ,
mall,j = |Dj | the total number of points in Dj , mall,all = |D| the total number of points in D. mi,j mi,all
and
recall(i, j) =
precision(i,j)×recall(i,j)
Thus, with F (i, j) = 2× precision(i,j)+recall(i,j) F-m =
mi,j mall,j
AN US
Then we have that precision(i, j) =
.
we have:
l X mall,j × maxi {F (i, j)} m all,all j=1
M
A second important criterion is provided by the Adjusted Rand Error (ARE, which is derived from the Adjusted Rand Index described next. The Rand Index measures the number of pairs of points that were matched (or unmatched) in both C and D. Thus let:
ED
• a be the number of pairs of points that were matched in both C and D,
• b be the number of pairs of points that were neither matched in D nor in C,
• c be the number of pairs of points that were matched in D but not in C,
PT
• d be the number of pairs of points that were not matched in D but are now matched in C The Rand Index is defined as RI =
a+b a+b+c+d
AC
CE
The Adjusted Rand Index (ARI) was introduced as compensation for the problem that for a random clustering RI tends to increase as the number of clusters increases [29]: n (a + d) − [(a + b)(a + c) + (c + d)(b + d)] ARI = 2 n2 − [(a + b)(a + c) + (c + d)(b + d)] 2
It has been shown that the expected value, for a random clustering, is 0, whereas it is 1 for a clustering that is identical to the original one. Thus the Adjusted Rand Error, aRe is simply defined as 1 − ARI. Thus for a perfect clustering we will have F-m= 1 and aRe= 0.
45