Journal of Computational Science 4 (2013) 219–231
Contents lists available at SciVerse ScienceDirect
Journal of Computational Science journal homepage: www.elsevier.com/locate/jocs
Competitive clustering algorithms based on ultrametric properties S. Fouchala,∗,1 , M. Ahatb,1 , S. Ben Amorc,2 , I. Lavalléea,1 , M. Buia,1 a
University of Paris 8, France Ecole Pratique des Hautes Etudes, France c University of Versailles Saint-Quentin-en-Yvelines, France b
a r t i c l e
i n f o
Article history: Received 30 July 2011 Received in revised form 16 October 2011 Accepted 24 November 2011 Available online 12 January 2012 Keywords: Clustering Ultrametric Complexity Amortized analysis Average analysis Ordered space
a b s t r a c t We propose in this paper two new competitive unsupervised clustering algorithms: the first algorithm deals with ultrametric data, it has a computational cost of O(n). The second algorithm has two strong features: it is fast and flexible on the processed data type as well as in terms of precision. The second algorithm has a computational cost, in the worst case, of O(n2 ), and in the average case, of O(n). These complexities are due to exploitation of ultrametric distance properties. In the first method, we use the order induced by an ultrametric in a given space to demonstrate how we can explore quickly data proximity. In the second method, we create an ultrametric space from a sample data, chosen uniformly at random, in order to obtain a global view of proximities in the data set according to the similarity criterion. Then, we use this proximity profile to cluster the global set. We present an example of our algorithms and compare their results with those of a classic clustering method. Crown Copyright © 2012 Published by Elsevier B.V. All rights reserved.
1. Introduction The clustering is useful process which aims to divide a set of elements into a set of finite number of groups. These groups are organized such as the similarity between elements in a same group is maximal, while similarity between elements from different groups is minimal [15,17]. There are several approaches of clustering, hierarchical, partitioning, density-based, which are used in a large variety of fields, such as astronomy, physics, medicine, biology, archaeology, geology, geography, psychology, and marketing [24]. The clustering aims to group objects of a data set into a set of meaningful subclasses, so it can be used as a stand-alone tool to get insight into the distribution of data [1,24]. The clustering of high-dimensional data is an open problem encountered by clustering algorithms in different areas [31]. Since the computational cost increases with the size of data set, the feasibility can not be fully guaranteed. We suggest in this paper two novel unsupervised clustering algorithms: The first is devoted to the ultrametric data. It aims to
∗ Corresponding author. E-mail addresses:
[email protected] (S. Fouchal),
[email protected] (M. Ahat), soufi
[email protected] (S. Ben Amor),
[email protected] (I. Lavallée),
[email protected] (M. Bui). 1 Laboratoire d’Informatique et des Systèmes Complexes, 41 rue Gay Lussac 75005 Paris, France, http://www.laisc.net. 2 Laboratoire PRiSM, 45 avenue des Etats-Unis F-78 035 Versailles, France, http://www.prism.uvsq.fr/.
show rapidly the inner closeness in the data set by providing a general view of proximities between elements. It has a computational cost of O(n). Thus, it guarantees the clustering of high-dimensional data in ultrametric spaces. It can, also, be used as a preprocessing algorithm to get a rapid idea on behavior of data with the similarity measure used. The second method is general, it is applicable for all kinds of data, it uses a metric measure of proximity. This algorithm provides rapidly the proximity view between elements in a data set with the desired accuracy. It is based on a sampling approach (see details in [1,15]) and ultrametric spaces (see details in [20,23,25]). The computational complexity of the second method is in the worst case, which is rare, of O(n2 ) + O(m2 ), where n is the size of data and m the size of the sample. The cost in the average case, which is frequent, is equal to O(n) + O(m2 ). In both cases m is insignificant, we give proofs of these complexities in Proposition 9. Therefore, we use O(n2 ) + ε and O(n) + ε to represent respectively the two complexities. This algorithm guarantees the clustering of high-dimensional data set with the desired precision by giving more flexibility to the user. Our approaches are based in particular on properties of ultrametric spaces. The ultrametric spaces are ordered spaces such that data from a same cluster are “equidistant” to those of another one (e.g. in genealogy: two species belonging to the same family, “brothers”, are at the same distance from species from another family, “cousins”) [8,9]. We utilize ultrametric properties in the first algorithm to cluster data without calculating all mutual similarities. The structure
1877-7503/$ – see front matter. Crown Copyright © 2012 Published by Elsevier B.V. All rights reserved. doi:10.1016/j.jocs.2011.11.004
220
S. Fouchal et al. / Journal of Computational Science 4 (2013) 219–231
Fig. 1. Steps of clustering process.
induced by ultrametric distance allows us to get a general information about proximity between all elements from just one data, consequently we reduce the computational cost. We can find ultrametric spaces in many kinds of data sets such as: phylogeny where the distance of evolution is constant [16], genealogy, library, information and social sciences . . . to name a few. In the second algorithm, we use ultrametric to acquire a first insight of the proximity between elements in just a sample data w.r.t. the similarity used. Once this view obtained, we expand it to cluster the whole data set. The rest of the text is organized as follows. In Section 2, we present a brief overview of the clustering strategies. In Section 3, we introduce the notions of metric and ultra-metric spaces, distance and balls. Our first algorithm is presented in Section 4. We present an example of this first algorithm in Section 5. The second algorithm is introduced in Section 6. In Section 7, we present an example of the second algorithm and we compare our results with those of a classic clustering algorithm. Finally, in Section 8 we give our conclusion and future work. 2. Related work The term “clustering” is used in several research communities to describe methods for grouping of unlabeled data. The typical pattern of this process can be summarized by the three steps of Fig. 1 [21]. Feature selection and extraction are preprocessing techniques which can be used, either or both, to obtain an appropriate set of features to use in clustering. Pattern proximity is calculated by similarity measure defined on data set between a pairs of objects. This proximity measure is fundamental to the clustering, the calculations of mutual measures between element are essential to most clustering procedures. The grouping step can be carried in different way, the most known strategies are defined bellow [21]. Hierarchical clustering is either agglomerative (“bottom-up”) or divisive (“top-down”). The agglomerative approach starts with each element as a cluster and merges them successively until forming a unique cluster (i.e. the whole set) (e.g. WPGMA [9,10], UPGMA [14]). The divisive begins with the whole set and divides it iteratively until it reaches the elementary data. The outcome of hierarchical clustering is generally a dendrogram which is difficult to interpret when the data set size exceeds a few hundred of elements. The complexity of these clustering algorithms is at least O(n2 ) [28]. Partitional clustering creates clusters by dividing the whole set into k subsets. It can also be used as divisive algorithm in hierarchical clustering. Among the typical partitional algorithms we can name K-means [5,6,17] and its variants K-medoids, PAM, CLARA and CLARANS. The results depend on the k selected data in this kind of algorithms. Since the number of clusters is defined upstream of the clustering, the clusters can be empty. Density-based clustering is a process where the clusters are regarded as a dense regions leading to the elimination of the noise. DBSCAN, OPTICS and DENCLUE are typical algorithms based on this strategy [1,4,7,18,24]. Since, the major clustering algorithms calculate similarities between all data prior to the grouping phase (for all types of similarity measure used), the computational complexity is increased to O(n2 ) before the execution of the clustering algorithm.
Our first approach deals with ultrametric spaces, we propose the first – as our best knowledge – unsupervised clustering algorithm on this kind of data without calculating similarities between all pairs of data. So, we reduce the computational cost of the process from O(n2 ) to O(n). We give proofs that; since the data processed are described with an ultrametric distance we do not need to calculate all mutual distances to obtain information about proximity in the data set (cf . Section 4). Our second approach is a new flexible and fast unsupervised clustering algorithm which costs mostly O(n) + ε and seldom O(n2 ) + ε (in rare worst case), where n is the size of data set and is equal to O(m2 ) where m is the size of an insignificant part (sample) of the global set. Even if the size of data increases, the complexity of the second proposed algorithm, the amortized complexity [30,32,34], remains of O(n) + ε in the average case, and of O(n2 ) + ε in the worst case, thus it can process dynamic data such as those of Web and social network with the same features. The two approaches can provide overlapped clusters, where one element can belong to more than one or more than two clusters (more general than weak-hierarchy), see [2,4,7] for detailed definitions about overlapping clustering. 3. Definitions Definition 1. A metric space is a set endowed with distance between its elements. It is a particular case of a topological space. Definition 2. We call a distance on a given set E, an application d: E × E → R+ which has the following properties for all x, y, z ∈ E: 1 (Symmetry) d(x, y) = d(y, x), 2 (Positive definiteness) d(x, y) 0, and d(x, y) = 0 if and only if x = y, 3 (Triangle inequality) d(x, z) d(x, y) + d(y, z). Example 1. The most familiar metric space is the Euclidean space of dimension n, which we will denote by Rn , with the standard formula for the distance: d((x1 , . . ., xn ), (y1 , . . ., yn )) = ((x1 − y1 )2 + · · · + (xn − yn )2 )1/2 (1). Definition 3. Let (E, d) be a metric space. If the metric d satisfies the strong triangle inequality:
∀x, y, z ∈ E, d(x, y) max{d(x, z), d(z, y)} then it is called ultrametric on E [19]. The couple (E, d) is an ultrametric space [11,12,29]. Definition 4. We name open ball centered on a ∈ E and has a radius r ∈ R+ the set {x ∈ E : d(x, a) < r} ⊂ E, it is called Br (a) or B(a, r). Definition 5. We name closed ball centered on a ∈ E and has a radius r ∈ R+ the set {x ∈ E : d(x, a) r} ⊂ E, it is called Bf (a, r). Remark 1. Let d be an ultrametric on E. The closed ball on a ∈ E with a radius r > 0 is the set: B(a, r)={x ∈ E : d(x, a) r} Proposition 1. Let d be an ultrametric on E, the following properties are true [11]: 1 If a, b ∈ E, r > 0, and b ∈ B(a, r), then B(a, r) = B(b, r), 2 If a, b ∈ E, 0 < i r, then either B(a, r) ∩ B(b, i) = ∅ or B(b, i) ⊆ B(a, r). This is not true for every metric space, 3 Every ball is clopen (closed and open) in the topology defined by d (i.e. every ultrametrizable topology is zero-dimensional). Thus, the parts are disconnected in this topology. Hence, the space defined by d is homeomorphic to a subspace of countable product of discrete spaces (c . f Remark 1) (see the proof in [11]).
S. Fouchal et al. / Journal of Computational Science 4 (2013) 219–231
221
4. First approach: an O(n) unsupervised clustering method on ultrametric data. We propose here the first – as our best knowledge – approach of clustering in ultrametric spaces without calculating similarity matrix. Ultrametric spaces represent many kinds of data sets such as genealogy, library information and social sciences data, to name a few. Our proposed method has a computational cost of O(n), thus it makes it treatable to cluster very large databases. It provides an insight of proximities between all data. The idea consists in using the ultrametric space properties, in particular, the order induced by the Ultratriangular Inequality to cluster all elements compared to only one of them chosen uniformly at random. Since the structure of an ultrametric space is frozen (cf . Proposition 3), we do not need to calculate all mutual distances. The calculation of distances compared just to one element are sufficient to determine the clusters, as clopen balls (or closed balls; cf . Proposition 1.3). Consequently, we avoid the computation of O(n2 ). However, our objective is to propose a solution to the problem of computational cost, notably the feasibility of clustering algorithms, in large data base. 4.1. Principle
Fig. 2. Illustration of some ultrametric distances on a plan.
Remark 2. A topological space is ultrametrizable if and only if it is homeomorphic to a subspace of countable product of discrete spaces [11]. Definition 6. Let E be a finite set, endowed with a distance d. E is classifiable for d if: ∀˛ ∈ R+ the relation on E:
∀x, y ∈ E, xR˛ y ⇔ d(x, y) ˛ is an equivalent relation. Thus, we can provide a partition from E as [33]:
Hypothesis 1. Consider E a set of data endowed with an ultrametric distance d. Our method is composed of the 3 following steps: Step 1. Choose uniformly at random one data from E (see Fig. 3); Step 2. Calculate distances between the chosen data and all others. Step 3. Define thresholds and represent the distribution of all data “according” to these thresholds and the calculated distances of the step 2 (see Fig. 4).
• d(x, y) ˛ ⇔ x and y belong to the same cluster, or, • d(x, y) > ˛ ⇔ x and y belong to two distinct clusters. Example 2. x, y and z are three points of plan endowed with an Euclidean distance d, we have: d(x, y) = 2, d(y, z) = 3, d(x, z) = 4. The set E is not classifiable for ˛ = 3. The classification leads to inconsistency. Proposition 2. A finite set E endowed with a distance d is classifiable if and only if d is an ultrametric distance on E[11]. Proposition 3. An ultrametric distance generates an order in a set, viewed three to three they form isosceles triangles. Thus, the representation of all data is fixed whatever the angle of view is (see Fig. 2). Proof 1. Let (E, d) an ultrametric set, for all x, y and z ∈E: Consider: d(x, y) d(x, z)
(1)
d(x, y) d(y, z)
(2)
(1) and (2) ⇒ (3) and (4) d(x, z) max{d(x, y), d(y, z)} ⇒ d(x, z) d(y, z)
(3)
d(y, z) max{d(x, y), d(x, z)} ⇒ d(y, z) d(x, z)
(4)
(3) and (4) ⇒ d(x, z) = d(y, z).
Fig. 3. Step 1: choosing uniformly at random one element A: we took the same scheme of Fig. 2 to depict the structuration of elements between them, but with our algorithm we do not need to calculate all distances (i.e. step 2).
222
S. Fouchal et al. / Journal of Computational Science 4 (2013) 219–231
Proposition 4. The clusters are distinguished by the closed balls (or thresholds), around the chosen data. Proof 2. An ultrametric space is a classifiable set (cf . Proposition 2). Thus, comparing mutual distances with a given threshold shows the belonging or not to the same cluster. Then, the distribution of data around any data in the set reflects entirely their proximity. Proposition 5. The random choice of initial point, does not affect the resulting clusters. Proof 3. Since an ultrametric distance generates an order in a set, the representation of all data is fixed whatever the angle of view. Hence, for every chosen data the same isosceles triangle are formed. Proposition 6. O(n).
The computational cost of this algorithm is equal to
Proof 4. The algorithm consists in calculating distances between the chosen data and the n − 1 data. 4.2. Algorithm
Procedure UCMUMD: • Variables: 1. Ultrametric space: a. b.
data set E of size n; ultrametric distance du .
• Begin: 1. Choose uniformly at random one element x∈E; 2. Determine intervals of clusters; // flexibility depending on context 3. For i
5. First approach: example We present in this section a comparison of our first algorithm with WPGMA. The Weighted Paired Group Method using Averages (WPGMA) is a hierarchical clustering algorithm developed by McQuitty in 1966, it has a computational complexity of O(n2 ) [27]. We have tested the two methods on sets of 34 and 100 data, respectively. Let us consider the ultrametric distance matrix of 34 data (see Fig. 5), calculated from similarity matrix based on the Normalized Information Distance NID [10]. The resulting clusters with WPGMA, on 34 data, are shown in dendrogram of Fig. 6. To test our method we determine (arbitrarily) the following intervals of the desired clusters (step 2): [0.000, 0.440], [0.440, 0.443], [0.443, 0.445], [0.445, 0.453], [0.453, 0.457], [0.457, 0.460], [0.463, 0.466], [0.466, 0.469], [0.469, 0.470], [0.470, 0.900]. The results with our algorithm (Fig. 7 and 8), using two different data as starting points, show the same proximities between
Fig. 4. Representation of all elements around A, according to the calculated distances and thresholds: the distances are those between clusters and thresholds are the radiuses of balls.
data such as those of the WPGMA clustering (cf . Fig. 6). Since ultrametric distance induce an ordered space, the inter-cluster and intra-cluster inertia for the purpose of the clustering are guaranteed. We see that even if we change the starting point, the proximity in the date set the same (cf . Proposition 5). The clusters are the same as shown in figures. Fig. 9 summarizes the results of the two methods on the same dendrogram, it shows that the generated clusters with our algorithm (balls) are similar to those of WPGMA. We have compared the two methods also on 100 words, chosen randomly from dictionary. The WPGMA results are shown in Fig. 10. Our results are the same whatever the chosen data (as we have seen in the first example), we choose here the data “toit”, Fig. 11 shows the resulting clusters. We see in comparisons (Fig. 12) that, first, the results of our method are similar to those of the hierarchical clustering (WPGMA), hence the consistency of our algorithm. Second, the clusters remain unchanged whatever the selected data. NB: A few differences between results are due to the values of thresholds chosen (arbitrarily), as if we obtain slightly different classes with different cuts in dendrograms. Our first objective was to propose a method which provides a general view of data proximity, in particular in large databases. This objective is achieved, by giving a consistent result in just O(n) iterations, thus allows the feasibility of the clustering in highdimensional data. 6. Second approach: fast and flexible unsupervised clustering algorithm Our second approach is a new unsupervised clustering algorithm which is valuable for any kind of data, since it uses a distance to measure proximity between objects. The computational of this algorithm is O(n) + ε in the best and average case and O(n2 ) + ε in the worst case which is rare. The value of ε = O(m2 ) (where m is the size of a sample) is negligible in front of n. Thus, our
harborSeal graySeal dog cat whiterhinoceros indianrhinoceros horse donkey finwhale bluewhale pig sheep cow hippopotamus fruitbat aardvark rat mouse fatdormouse squirrel rabbit guineapig armadillo wallaroo opossum elephant chimpanzee pigmychimpanzee human gorilla orangutan gibbon baboon platypus
0
graySeal
whiterhinoceros horse pig
cow
rat
squirrel
armadillo
wallaroo
chimpanzee elephant
human
gorilla
gibbon
baboon
0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0
platypus
0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,41 0,41 0,41 0,41 0
orangutan
0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,47 0,21 0
pigmychimpanzee
0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,4 0
opossum
0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0
guineapig 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,45 0,45 0,45 0,45 0
rabbit
0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0
fatdormouse 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,434 0
mouse
0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0
aardvark 0,453 0,453 0,453 0,453 0,453 0,453 0,453 0,453 0,453 0,453 0,453 0,453 0,453 0,453 0
fruitbat
0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,4 0
hippopotamus
0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,4 0
sheep
0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,3 0
bluewhale
0,44 0,44 0,44 0,44 0,44 0,44 0,44 0,44 0
finwhale
0,4 0,4 0,4 0,4 0,4 0,4 0
donkey
0,44 0,44 0,44 0,44 0,38 0
indianrhinoceros
0,4 0,4 0,4 0
cat
0,4 0,4 0
dog
0,19 0
0,4 0,4 0,4 0,4 0
0,4 0,4 0,4 0,4 0,4 0,4 0,3 0
0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,4 0,4 0
0,44 0,44 0,44 0,44 0,44 0,44 0,44 0,44 0,44 0,44 0,44 0,44 0,44 0
Fig. 6. Clustering of the 34 data with WPGMA. 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0
0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,458 0,455 0,455 0,455 0
0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0
0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0
0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,4 0,4 0,4 0
0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,4 0,4 0,4 0,4 0,4 0,4 0
Fig. 7. Results using fin-whale as the chosen origin.
algorithm has capability to treat very large data bases. The user can get the desired accuracy on proximities between elements. The idea consists in exploiting ultrametric properties, in particular, the order induced by the Ultratriangular Inequality on a given space. First we deduce the behavior of all data relative to the distance used, by creating an ultrametric (ordered) space from just a sample data (subset) chosen uniformly at random, of size m (petty compared to n). Then, we do the clustering on the global data set according to these order information. The construction of the ordered space, from sample, costs = O(m2 ) operations. Once this ultrametric space (structured sample) built, we use ultrametric distances to fix the thresholds (or
intervals) of the clusters, which depend on the choice of users. This liberty of choice makes our algorithm useful for all kinds of data and implicate the user as an actor in the construction of the algorithm. After determining clusters and their intervals, we select uniformly at random one representative by cluster, then we compare them with the rest of data, in the global set (of size n − m), one by one. Next, we add the data to the clusters of the closest representative according to the intervals. 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0,46 0
0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0
0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,3 0,3 0
0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,5 0,4 0,4 0,4 0,4 0,4 0
Fig. 5. Ultrametric distance matrix of 34 data.
223 S. Fouchal et al. / Journal of Computational Science 4 (2013) 219–231
harborSeal
224
S. Fouchal et al. / Journal of Computational Science 4 (2013) 219–231
If the compared data is away compared to all representative, we create a new cluster. Remark 3. This step costs O(x . n) + ε operations in the average case, where x is the number of clusters which is generally insignificant compared to n, therefore we keep only O(n) + ε. But in the worst case, where the clustering provides only singleton which is rare since the aim of clustering is to group objects, x is equal to n, consequently the computational cost is O(n2 ) + ε. 6.1. Principle Hypothesis 2.
Consider E a set of data and a distance d.
The method is composed of the following steps: Step 1. Choose uniformly at random a sample data from E, of size m (e.g. m = n/10, 000 if n = 1 M). Remark 4. The choice of the sample depends on the processed data and on the expert in the domain. The user is an actor in the execution of the algorithm, he can intervene in the choice of the input and refine the output (see details in the step 4). Remark 5. The value of m depends on the value of n compared to the bounds of d [3]. Example 3. Consider a set of size n = 100, 000 data and distance d ∈ [0, 0.5] if we use a sample of 50 data chosen uniformly at random, we can easily get a large idea on the aptitude of the data according to d. But, if n = 500 and d ∈ [0, 300], if we choose a sample of 5 data, we are not sure to get enough information about the manner how the data behave with d.
Step 3. Represent the distances in the resulting dendrogram, thus the ultrametric space is built. Step 4. Deduce the clusters intervals (closed balls or thresholds), which depend on the clustering sought-after and d, as large or precise view of proximities between elements (this step must be specified by the user or specialist of the domain). Step 5. Choose uniformly at random one representative per cluster from the result of the step 2. Step 6. Pick the rest of data one by one and compare them, according to d, with the clusters representatives of the previous step: • If the compared data is close to one (or more) of the representative data w.r.t. the defined thresholds, then add it to the same cluster of the representative. • Else, create a new cluster. Remark 7. If the compared data is close to more than one representative, then it will be added to more than one cluster, consequently generate an overlapping clustering [4,7]. Proposition 7. thresholds).
The clusters are distinguished by the closed balls (or
Proof 5. An ultrametric space is a classifiable set (cf . Proposition 2). Thus, comparing mutual distances with a given threshold shows the belonging or not to the same cluster. Proposition 8. The random choice of initial points, does not affect the resulting clusters.
Remark 6. More the size n is large more the size of m is petty compared to that of n [3].
Proof 6. Since the generated space is ultrametric, it is then endowed with a strong intra-cluster inertia that allows this choices.
Step 2. Execute a classic hierarchical clustering algorithm (e.g. WPGMA) with the distance d on the chosen sample.
Proposition 9. The computational cost of our algorithm is, in the average case, equal to O(n), and in the rare worst case, equal to O(n2 ).
Fig. 8. Results using horse as the chosen origin.
S. Fouchal et al. / Journal of Computational Science 4 (2013) 219–231
Fig. 9. Clustering with our method and WPGMA.
Table 1 Example 1: clustering of the 34 data in O(N) + . Thresholds
Clusters
Pig: closest items by d 0.881 Pig: d ∈ [0.881, 0.889] Pig: d ∈ [0.889, 0.904] Pig: d ∈ [0.904, 0.919] Pig: d ∈ [0.919, 0.929] Bluewhale: d 0.881 Bluewhale: d ∈ [0.881, 0.889] Bluewhale: d ∈ [0.889, 0.904] Bluewhale: d ∈ [0.904, 0.919] Bluewhale: d ∈ [0.919, 0.929] Aardvark: d 0.881 Aardvark: d ∈ [0.881, 0.889] Aardvark: d ∈ [0.889, 0.904] Aardvark: d ∈ [0.904, 0.919] Aardvark: d ∈ [0.919, 0.929] Guineapig: d 0.881 Guineapig: d ∈ [0.881, 0.889] Guineapig: d ∈ [0.889, 0.904] Guineapig: d ∈ [0.904, 0.919] Guineapig: d ∈ [0.919, 0.929] Platypus: d ∈ [0.881, 0.929]
Whiterhinoceros cow Indian-rhinoceros, gray-seal harbor-seal, cat, sheep Dog Pigmy-chimpanzee, orangutan, gibbon, human, gorilla Chimpanzee, baboon Fin-whale Donkey, horse
Rat, mouse, hippopotamus, fat-dormouse, armadillo, fruit-Bat, squirrel, rabbit Elephant, opossum, wallaroo
225
226
S. Fouchal et al. / Journal of Computational Science 4 (2013) 219–231
Fig. 10. Clustering of 100 words with WPGMA.
S. Fouchal et al. / Journal of Computational Science 4 (2013) 219–231
227
Proof 7. • In the two cases we ignore the complexity of the clustering of the sample and the construction of the ultrametric space (i . e . steps 2 and 3), because the size of the sample is insignificant compared to the cardinal of the global data set. • The worst case where the clustering provides only singleton is rare, because this process aims to group similar data in clusters whose numbers is frequently petty compared to the data set size. • Since there exist homogeneity between data according to the proximity measure used, the number of clusters is unimportant compared to the cardinal of the data set, consequently the complexity of our algorithm is frequently of O(n). 6.2. Algorithm
Procedure FFUCA • Variables: 1. Metric space: a. b.
data set E of size n; distance d).
• Begin: 1. Choose a sample of data of size adjustable according to data type (we use root (IEI) as a sample); // flexibility to expert of data 2. Do clustering of the sample with classic hierarchical clustering algorithm; // complexity O(m2 ) where m = sample size (O(n)here) 3. Deduct an ultrametric space of the clustered sample; // to exploit ultrametric properties in clustering, complexity O(n) 4. Determine intervals from the hierarchy of step 3; // flexibility to user to refine size output clusters 5. Choose uniformly at random one representative per cluster from the clusters of step 4; 6. For i
Fig. 11. Clustering of 100 words with chosen data toit.
7. Example This section is devoted to comparison of our algorithm with WPGMA (cf . Section 5). We have tested the two methods on the same set of 34 mitochondrial DNA sequences of the first example, but here we used the metric (not ultrametric) distance Normalized Information Distance NID [10] to calculate proximities between elements. For WPGMA, we calculate a similarity matrix which costs O(n2 ) (see Appendix A). With our proposed method, we do not need to calculate the matrix, we calculate only distances between the chosen representative and the rest of elements of the whole set, this step costs O(nb . n) operations (where nb is the number of representatives). The resulting dendrogram with WPGMA on 34 data, is shown in Fig. 6 (cf . Section 5). We execute FFUCA two times with different inputs chosen randomly. The first step consists in choosing uniformly at random a sample of data and doing clustering on this sample with a classic clustering algorithm according to the similarity used. Then, create an ultrametric space from the resulting clusters. In this first example we choose arbitrarily the following data: bluewhale, hippopotamus, pig, aardvark, guineapig and chimpanzee. We did a clustering and establish the ordered (ultrametric) space using WPGMA, on the chosen data, the result is depicted in Fig. 13. The second step is the deduction of the intervals (clusters); thus, we use the valuation of the dendrogram of Fig. 13. We keep here five thresholds [0,0.881], [0.881, 0.889], [0.889, 0.904], [0.904, 0.919], [0.919, 0.929]. Since, the size of data set is small (34) we use these as representatives of clusters. Table 1 shows the resulting clusters with the first inputs, the intervals are represented in the left column and clusters in the right column. NB: The chosen elements belong only to the first lines (e . g . pig ∈ {whiterhinoceros, cow}). The last lines in the table represent the new clusters, which are far from all other clusters, they are considered as outliers. In the second example we choose donkey, gray-Seal, pig, fatdormouse, gorilla as starting data. After the clustering and the construction of the ultrametric space from this chosen sample, we keep the following intervals [0,0.8827], [0.8827, 0.8836],
228
S. Fouchal et al. / Journal of Computational Science 4 (2013) 219–231
Fig. 12. Clustering of 100 words with FFUCA and WPGMA.
S. Fouchal et al. / Journal of Computational Science 4 (2013) 219–231
229
Table 2 Example 2: clustering of the 34 data in O(N) + . Thresholds
Clusters
Donkey: closest items by d 0.8827 Donkey: d ∈ [0.8827, 0.8836] Donkey: d ∈ [0.8836, 0.8900] Donkey: d ∈ [0.8900, 0.9000] Gray-seal: d 0.8827 Gray-seal: d ∈ [0.8827, 0.8836] Gray-seal: d ∈ [0.8836, 0.8900] Gray-seal: d ∈ [0.8900, 0.9000] Pig: d 0.8827 Pig: d ∈ [0.8827, 0.8836] Pig: d ∈ [0.8836, 0.8900] Pig: d ∈ [0.8900, 0.9000] Fat-dormouse: d 0.8827 Fat-dormouse: d ∈ [0.8827, 0.8836] Fat-dormouse: d ∈ [0.8836, 0.8900] Fat-dormouse: d ∈ [0.8900, 0.9000] Gorilla: d 0.8827 Gorilla: d ∈ [0.8827, 0.8836] Gorilla: d ∈ [0.8836, 0.8900] Gorilla: d ∈ [0.8900, 0.9000] Armadillo: d ∈ [0.8827, 0.9000] Platypus: d ∈ [0.8827, 0.9000] Aardvark: d ∈ [0.8827, 0.9000] Mouse: d ∈ [0.8827, 0.9000] Wallaroo: d ∈ [0.8827, 0.9000] Elephant: d ∈ [0.8827, 0.9000] Guineapig: d ∈ [0.8827, 0.9000]
Indian-rhinoceros, whiterhinoceros, horse Opossum Harbor seal, cat Dog
Cow, fin-whale, bluewhale Sheep Hippopotamus Squirrel, rabbit
Chimpanzee, pigmy-chimpanzee, human, orangutan, gibbon
Baboon
Rat
of O(n2 ) + . Our method can also be used as an hierarchical clustering method, by inserting the elements in the first (sample) resulting dendrogram according to intervals. 8. Conclusion
Fig. 13. Clustering of chosen data: construction of the ordered space.
[0.8836, 0.8900], [0.8900, 0.900]. We do not list the details about different steps in this second example, we show the results in Table 2. We see in the tables that, first, our results are similar to those of the hierarchical clustering (WPGMA), hence the consistency of our algorithm. Second, the clusters remain unchanged whatever the selected data is. The main objective in this section consists in proposing an unsupervised flexible clustering method, which provides rapidly the desired view of data proximity. This objective is achieved, by giving consistent results, from the point of view of intra-cluster and inter-cluster inertia according to the clustering aim (i . e . indicating which is the closest and is further according to the similarity), in generally O(n) + and rarely O(n2 ) + iterations, consequently allows the feasibility of the clustering in high-dimensional data. This algorithm can be applied to large data, thanks to its complexity. It can be used easily, thanks to its simplicity. It can treat dynamic data while keeping a competitive computational cost. The amortized complexity [30,32,34] is of O(n) + averagely and rarely
We proposed in this paper two novel competitive unsupervised clustering algorithms which can overcome the problem of feasibility in large databases thanks to their calculation times. The first algorithm treats ultrametric data, it has a complexity O(n). We gave proofs that in an ultrametric space we can avoid the calculations of all mutual distances, without loosing information about the proximities in the data set. The second algorithm deals with all kind of data. It has two strong particularities: the first feature is the flexibility; the user is an actor in the construction and the execution of the algorithm, he can manage the input and refine the output. The second characteristic deals with the computational cost, which is, in the average case, of O(n), and in the rare worst case, of O(n2 ). We showed that our methods treat faster data, while preserving the information about proximities, and the result does not depend on a specific parameter. Our future work aims first to test our methods on real data and compare them with other clustering algorithms by using validation measure, as V-measure and F-measure. Second, we intend to generalize the second method to treat dynamic data such us those of social network, detection of relevant information on web sites.
Appendix A. Metric distance matrix See Fig. A.1.
S. Fouchal et al. / Journal of Computational Science 4 (2013) 219–231
Fig. A.1. Metric distance matrix of 34 data.
230
S. Fouchal et al. / Journal of Computational Science 4 (2013) 219–231
References [1] M. Ankerst, M.M. Breunig, H.P. Kreigel, J. Sander, OPTICS: ordering points to identify the clustering structure, in: ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, 1999. [2] J.P. Barthélemy, F. Brucker, Binary clustering, Journal of Discrete Applied Mathematics 156 (2008) 1237–1250. [3] K. Bayer, J. Goldstein, R. Ramakrishan, U. Shaft, When is nearest neighbor meaningful? in: Proceedings of the 7th International Conference on Database Theory (ICDT) LNCS, 1540, Springer-Verlag, London, UK, 1999, pp. 217–235. [4] F. Brucker, Modèles de classification en classes empiétantes, PhD Thesis, Equipe d’acceuil: Dep. IASC de l’École Supérieure des Télécommunications de Bretagne, France, 2001. [5] M. Chavent, Y. Lechevallier, Empirical comparison of a monothetic divisive clustering method with the Ward and the k-means clustering methods, in: Proceedings of the IFCS Data Science and Classification, Springer, Ljubljana, Slovenia, 2006, pp. 83–90. [6] G. Cleuziou, An extended version of the k-means method for overlapping clustering, in: Proceedings of the Nineteenth International Conference on Pattern Recognition, 2008. [7] G. Cleuziou, Une méthode de classification non-supervisée pour l’apprentissage de règles et la recherche d’information, PhD Thesis, Université d’Orleans, France, 2006. [8] S. Fouchal, M. Ahat, I. Lavallée, M. Bui, An O(N) clustering method on ultrametric data, in: Proceedings of the 3rd IEEE Conference on Data Mining and Optimization (DMO), Malaysia, 2011, pp. 6–11. [9] S. Fouchal, M. Ahat, I. Lavallée, Optimal clustering method in ultrametric spaces, in: Proceedings of the 3rd IEEE International Workshop Performance Evaluation of Communications in Distributed Systems and Web-based Service Architectures (PEDISWESA), Greece, 2011, pp. 123–128. [10] S. Fouchal, M. Ahat, I. Lavallée, M. Bui, S. Benamor, Clustering based on Kolmogorov complexity, in: Proceedings of the 14th International Conference Knowledge-based and Intelligent Information and Engineering Systems (KES), Part I, Cardiff, UK, Springer, 2010, pp. 452–460. [11] L. Gaji’, Metric locally constant function on some subset of ultrametric space Novi Sad Journal of Mathematics (NSJOM) 35 (1) (2005) 123–125. [12] L. Gaji’, On ultrametric space, Novi Sad Journal of Mathematics (NSJOM) 31 (2) (2001) 69–71. [14] I. Gronau, S. Moran, Optimal implementations of UPGMA and other common clustering algorithms, Information Processing Letters 104 (6) (2007) 205–210. [15] S. Guha, R. Rastogi, K. Shim, An efficient clustering algorithm for large databases on management of data, in: Proceedings ACM SIGMOD International Conference, No. 28, 1998, pp. 73–84. [16] S. Guindon, Méthodes et algorithmes pour l’approche statistique en phylogénie, PhD Thesis, Université Montpellier II, 2003. [17] A. Hatamlou, S. Abdullah, Z. Othman, Gravitational search algorithm with heuristic search for clustering problems, in: 3rd IEEE Conference on Data Mining and Optimization (DMO), Malaysia, 2011, pp. 190–193. [18] A. Hinneburg, D.A. Keim, An efficient approach to clustering in large multimedia databases with noise, in: Proceedings of the Fourth International Conference
[19]
[20]
[21] [23] [24]
[25] [27] [28] [29]
[30] [31]
[32] [33] [34]
231
on Knowledge Discovery and Data Mining (KDD), New York, USA, 1998, pp. 58–65. L. Hubert, P. Arabie, J. Meulman, Modeling dissimilarity: generalizing ultrametric and additive tree representations, British Journal of Mathematical and Statistical Psychology 54 (2001) 103–123. L. Hubert, P. Arabie, Iterative projection strategies for the least squares fitting of tree structures to proximity data, British Journal of Mathematical and Statistical Psychology 48 (1995) 281–317. A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Computing Surveys 31 (3) (1999) 264–323. M. Krasner, Espace ultramétrique et valuation, Séminaire Dubreil Algèbre et théorie des nombres, Tome 1, exp. n◦ 1, 1–17, 1947–1948. H.P. Kriegel, P. Kroger, A. Zymek, Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering, ACM Transactions on Knowledge Discovery from Data 3 (1) (2009), Article 1. B. Leclerc, Description combinatoire des ultramétriques, Mathématiques et Sciences Humaines 73 (1981) 5–37. F. Murtagh, Complexities of hierarchic clustering algorithms: state of the art, Computational Statistics Quarterly 1 (2) (1984) 101–113. F. Murtagh, A survey of recent advances in hierarchical clustering algorithms, The Computer Journal 26 (4) (1983) 354–359. K.P.R. Rao, G.N.V. Kishore, T.R. Rao, Some coincidence point theorems in ultra metric spaces, International Journal of Mathematical Analysis 1 (18) (2007) 897–902. D.D. Sleator, R.E. Tarjans, Amortized efficiency of list update and paging rules, Proceedings of Communications of ACM 28 (1985) 202–208. M. Sun, L. Xiong, H. Sun, D. Jiang, A GA-based feature selection for high-dimensional data clustering, in: Proceedings of the 3rd International Conference on Genetic and Evolutionary Computing (WGEC), vol. 76, 2009, pp. 9–772. R.E. Tarjans, Data Structures and Network Algorithms, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1983. M. Terrenoire, D. Tounissoux, Classification hiérarchique, Université de Lyon, 2003. A.C.-C. Yao, Probabilistic computations: toward a unified measure of complexity, in: Proceedings of the 18th IEEE Conference on Foundations of Computer Science, 1977, pp. 222–227. Said Fouchal, is a PhD student in the University Paris 8. The topic of his research work is clustering of complex objects. His researches are based on ultrametric, computational complexity, Kolmogorov complexity and clustering.