Accelerating k-medoid-based algorithms through metric access methods

Accelerating k-medoid-based algorithms through metric access methods

Available online at www.sciencedirect.com The Journal of Systems and Software 81 (2008) 343–355 www.elsevier.com/locate/jss Accelerating k-medoid-ba...

250KB Sizes 4 Downloads 21 Views

Available online at www.sciencedirect.com

The Journal of Systems and Software 81 (2008) 343–355 www.elsevier.com/locate/jss

Accelerating k-medoid-based algorithms through metric access methods q Maria Camila N. Barioni *, Humberto L. Razente, Agma J.M. Traina, Caetano Traina Jr. Computer Sciences Department, ICMC/USP, Caixa Postal 668, 13560-970 Sa˜o Carlos, SP, Brazil Available online 5 July 2007

Abstract Scalable data mining algorithms have become crucial to efficiently support KDD processes on large databases. In this paper, we address the task of scaling up k-medoid-based algorithms through the utilization of metric access methods, allowing clustering algorithms to be executed by database management systems in a fraction of the time usually required by the traditional approaches. We also present an optimization strategy that can be applied as an additional step of the proposed algorithm in order to achieve better clustering solutions. Experimental results based on several datasets, including synthetic and real ones, show that the proposed algorithm can reduce the number of distance calculations by a factor of more than three thousand times when compared to existing algorithms, while producing clusters of equivalent quality.  2007 Elsevier Inc. All rights reserved.

1. Introduction Clustering is one of the key techniques in the KDD (Knowledge Discovery in Databases) process. It is usually applied aiming at uncovering hidden structures underlying a collection of objects. Briefly, clustering is the process of dividing the data into groups of similar objects according to a similarity measure. The goal is that each group, or cluster, be composed of objects that are similar to each other and dissimilar to objects of other groups (Han and Kamber, 2001). In the last decades, several clustering algorithms have been developed for a large spectrum of applications. They can be divided into two main groups: partitioning and hierarchical clustering algorithms. Hierarchical algorithms produce a cluster hierarchy consisting of several levels of nested partitions of the dataset. On the other hand, parti-

q A preliminary version of this paper was presented in Barioni et al. (2006). * Corresponding author. Tel.: +55 16 3373 9677; fax: +55 16 3373 9751. E-mail addresses: [email protected] (M.C.N. Barioni), [email protected] (H.L. Razente), [email protected] (A.J.M. Traina), [email protected] (C. Traina Jr.).

0164-1212/$ - see front matter  2007 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2007.06.019

tioning algorithms try to find the best k partitions of a dataset, thus creating a single-level partition that divides the data into k clusters. Examples of hierarchical algorithms are Single-Link methods (Sibson, 1973), CURE (Clustering Using Representatives) (Guha et al., 1998) and BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) (Zhang et al., 1996). Partitioning algorithms include k-means (Hartigan and Wong, 1979) and k-medoid (Kaufman and Rousseeuw, 2005). A description of several clustering algorithms can be found in (Jain et al., 1999). The k-means algorithm is the most popular among the algorithms mentioned above due to its simplicity and efficiency. However, the k-medoid-based algorithms have been shown to be more robust since they are less sensitive to the existence of outliers, do not present limitations on attribute types, and also, because the clustering found does not depend on the input order of the dataset. Moreover, they are invariant to translations and orthogonal transformations of objects (Kaufman and Rousseeuw, 2005). Another limitation of the k-means algorithm is that it works on multi-dimensional datasets, where new points can be created setting values for each coordinate, even when no object lies at this point. Therefore, the clusters centers obtained by this algorithm do not correspond to objects

344

M.C.N. Barioni et al. / The Journal of Systems and Software 81 (2008) 343–355

in the dataset. On the other hand, k-medoid-based algorithms can be employed to find clusters in multi-dimensional or any other kind of dataset where a distance function is defined, so it is applicable to a much broader range of applications. The drawback of the k-medoid-based algorithms is that they are very time consuming and therefore, they cannot be efficiently applied to large datasets. This drawback has motivated the development of several approaches aiming at reducing the computational effort needed to execute these algorithms (Ester et al., 1995; Chu et al., 2002; Zhang and Couloigner, 2005). A common strategy is to extract a sample or to perform a selection of representative objects before applying the whole or part of the clustering algorithm to the resulting subset of objects. The quality of the resulting clusters is usually dependent on the selection of a relevant subset of objects. In this paper, we present a technique that improves the efficiency of k-medoid-based algorithms using a Metric Access Method (MAM) in order to select the representative objects over which these algorithms will be applied. We also take advantage of the MAM to define a strategy that employs nearest neighbor search operations to analyze the neighborhood of the medoids returned by the proposed algorithm, improving the quality of the clusters obtained. Particularly, we illustrate our technique using the MAM Slim-tree (Traina et al., 2002). It is worth to note that any other dynamic MAM, based on the use of representative objects to partition the data space, such as the M-tree (Ciaccia et al., 1997) does, can be equally used. Although multidimensional access methods had already been used in other clustering approaches (Ester et al., 1995; Hwang et al., 2004), to the best of our knowledge, no MAM has yet been applied to reduce the time complexity of k-medoid-based algorithms. The remainder of the paper is organized as follows. Section 2 describes the existing k-medoid-based algorithms and the Slim-tree. Section 3 presents our new efficient algorithm developed to scale up k-medoid-based algorithms thus allowing fast clustering of large databases and its optimization strategy. An experimental evaluation of the approaches presented herein is shown in Section 4. Finally, Section 5 gives the conclusions of this paper and suggestions for future works. 2. Basic concepts This section presents the existing k-medoid-based algorithms and a brief description of the main aspects related to the MAM Slim-tree. The symbols and definitions presented in Table 1 are used throughout this paper. 2.1. Clustering algorithms The objective of k-medoid-based algorithms is to find a non-overlapping set of clusters, so that each cluster has one representative object (the medoid), i.e., an object that is the most centrally located in the cluster, considering a

Table 1 Summary of symbols and definitions Symbols

Definitions

k n S sj R rj,rc d( )

Number of clusters Number of objects in the dataset Set of objects to be clustered An object 2 S Set of objects 2 S selected as medoids Objects 2 R Dissimilarity measure (distance function) between two objects

dissimilarity measure expressed as a distance function d(si, sj), where si and sj are elements of a dataset S. Given a set S of objects and a number k of clusters to be found, these algorithms perform two main steps: • An initialization step, where an initial set R  S of k objects are selected as medoids; • An evaluation step, where it aims at minimizing an objective function, usually based on the sum of the total distance among non-selected objects and their medoids, i.e., the evaluation step minimizes: k X n X i¼1

dðri ; sj Þ;

ð1Þ

j¼1

where sj 2 S and d(ri, sj) < d(rc, sj) "ri, rc 2 R, ri 5 rc. The smaller the sum of distances among the medoids and all the other objects of their clusters, the better the clustering. The three best known k-medoid-based algorithms in the literature are PAM (Partitioning Around Medoids), CLARA (Clustering LARge Applications) and CLARANS (Clustering Large Applications based upon RANdomized Search) (Kaufman and Rousseeuw, 2005). The main aspects related to each one of these algorithms are described in the following subsections. 2.1.1. PAM algorithm PAM (Kaufman and Rousseeuw, 1987) is one of the earliest k-medoid-based algorithms. It is based on an iterative process of optimization that evaluates the effect of a swap between a medoid and a non-medoid object relocating objects between candidate clusters. It can be expressed in the following way. (1) Build phase: In this phase, an initial set R of k representative objects is selected. The first selected object is the one for which the sum of the dissimilarities to all other objects is as small as possible. Thus, the first selected object is the dataset medoid. The other (k  1) medoids are selected subsequently, one at a time, considering the objects that most decrease the objective function. (2) Swap phase: This phase computes the total cost Tih for all pairs of objects ri and sh, where ri 2 R is currently selected and sh 2 S is not selected.

M.C.N. Barioni et al. / The Journal of Systems and Software 81 (2008) 343–355

(3) Selection phase: This phase selects the pair (ri, sh) which minimizes Tih. If the minimum Tih is negative, the swap is carried out and the algorithm re-iterates Step (2). Otherwise, for each non-selected object, the most similar medoid is found and the algorithm stops. The guiding principle of the PAM clustering process resides in Step (2). As it can be seen, it requires evaluating all objects that are currently not medoids and thus it presents a very expensive computational cost, O(k(n  k)2) in each iteration (Ng and Han, 1994). The PAM algorithm results in high quality clusters, as it tries every possible combination, working effectively for small datasets (e.g., 100 objects in 5 clusters). However, due to its computational complexity, it is not practical to cluster large datasets. 2.1.2. CLARA algorithm The computational complexity of the PAM algorithm motivated the development of CLARA (Kaufman and Rousseeuw, 1986), a clustering algorithm based on sampling. CLARA draws multiple samples of the dataset and applies PAM on each sample. Thereafter each object of the entire dataset is assigned to the resulting medoids, the objective function is computed, and the best set of medoids is returned as the output. Experiments described in (Kaufman and Rousseeuw, 2005) indicated that five samples of size 40 + 2k give satisfactory results. The computational complexity of each iteration of CLARA is of O(kb2 + k(n  k)), where b is the size of the sample. 2.1.3. CLARANS algorithm CLARANS (Ng and Han, 1994) was developed in the context of spatial data mining. It uses a randomized search strategy in order to improve on both PAM and CLARA algorithms in terms of efficiency (computation complexity or time) and effectiveness (average distortion over the distances), respectively. When searching for a better medoid in the evaluation step, CLARANS randomly chooses objects from the remaining (n  k) objects. The number of objects tried in this step is restricted by a parameter provided by the user (maxNeighbor). If no better solution is found after maxNeighbor attempts, a local optimal is assumed to be reached. The procedure continues until

s15

s5

s12

s10 s 2 s11 s3 s s9 1

s7 s4

345

numLocal local optimals have been found. It was recommended that the parameters maxNeighbor and numLocal be set to max(250,1.25% of k * (n  k)) and 2, respectively (Ng and Han, 2002). The computational complexity of CLARANS is O(n2) in terms of the number of objects. Three focusing techniques were proposed in (Ester et al., 1995) employing R*-trees (Beckmann et al., 1990) in order to make CLARANS more efficient for large spatial databases. However, these techniques cannot be extended to non-dimensional databases. 2.2. The MAM slim-tree The basic structure of any metric tree – such as the wellknown M-tree (Ciaccia et al., 1997), the Slim-tree (Traina et al., 2002) and the DBM-tree (Vieira et al., 2004) – divides the data space into regions using representatives to which the other objects in each region will be associated with. The objects of each region are stored in a node that has a covering radius. Only objects within this radius are associated with the representative. The Slim-tree stores the objects in the leaves and creates an appropriate hierarchy on top. An important factor that affects the search performance of this kind of tree is related to the degree of overlapping among the nodes. The Slim-tree is a dynamic structure that was developed to reduce the overlapping among regions in each level, and to optimize disk accesses on nearest neighbor and range queries. The Slim-tree is built following the bottom-up approach, such as the B-tree does. The objects are inserted in a node up to its capacity. When another object must be inserted, the node splits and the objects are distributed between the two new nodes. One object of each split node is elected to go to the immediate upper level, which is done recursively. The elected objects are the nodes representatives, and they are associated with a radius that covers the respective nodes. Fig. 1 presents an example of a Slim-tree of 15 objects {s1, . . ., s15} with maximum node capacity of 3 objects. The representatives in a given level and their correspondent radii define regions that can be overlapped, even at a given tree level. This leads to the problem of qualifying more than one node to host a new object. In order to build a Slim-tree, it is necessary to have a policy employed when a new object is inserted and more than one node is qualified

s1 s 7 s6 root s8

s14 s1 s2

s6 s13 s1 s3 s9

s2 s10 s11

s7 s5

s7 s12 s4 s5 s15

s 8 s6

s8 s14

index nodes

s6 s13

leaf nodes

Fig. 1. A Slim-tree representation with 15 objects (a) and its logic structure (b). The white circles in (a) represent the leaf nodes, while the gray circles represent the index nodes. The objects in black are the representatives.

346

M.C.N. Barioni et al. / The Journal of Systems and Software 81 (2008) 343–355

for the insertion of the new object (ChooseSubtree algorithm) (Traina et al., 2002). The Slim-tree has three options for the ChooseSubtree algorithm: (1) random: randomly choose one of the qualifying nodes; (2) minDist: choose the qualifying node that has the minimum distance from the new object and the center of the node; (3) minOccup: choose the node that has the minimum occupancy among the qualifying ones. It is important to note that the policy choice affects the features of the resultant tree regarding the degree of overlapping among nodes. For example, the minOccup option generates shallow trees with higher node occupation rates leading to a smaller number of disk accesses on queries. However, this option also leads to a higher overlapping degree among the nodes. On the other hand, the minDist option generates higher trees with lower node occupation rates and a lower overlapping degree among nodes. By the procedure adopted to build a Slim-tree, each level of the tree indirectly divides a metric space into a number of clusters equal to the number of objects stored at each level. Unfortunately, this number of clusters is not a parameter given by the user, so it is not possible to let the Slim-tree to perform the whole clustering. On the other hand, it is reasonable to suppose that the representative objects stored in the index nodes are, in a sense, approximate cluster centers on each level of the tree. And, although the use of this information does not allow the selection of any number k of clusters directly, it can be used as a convenient starting point to cluster the data. This paper employ this reasoning to develop a fast algorithm to cluster datasets by similarity, based on the k-medoid approach. 3. The proposed algorithm As seen in the previous Section, the computational complexity of the k-medoid-based algorithms depends on n and k. In order to reduce the complexity, a common approach is to reduce the number of objects that are submitted to the clustering algorithms, selecting a sample of the original database. For this approach, the quality of the selection step is crucial for the quality of the resulting clusters. In this paper, we propose to speed up k-medoid-based algorithms using a MAM. To improve the scalability of k-medoid-based algorithms and achieve high quality clusters, our strategy is to perform sampling operations based on inherent features of a MAM, which hierarchically divides the data space into regions, assigning a representative to each region. As the Slim-tree meets this requirement, we use it as the basis to describe our technique in this paper. The strategy is based on the assumption that, by construction, the node centers at each level of the Slim-tree are reasonably well distributed over the data domain, as well as they are the natural centers of the regions that they

represent. Therefore, instead of considering all objects of a dataset to compute the clustering algorithm, only the objects of a chosen tree-level are considered. This selection strategy considers the distance of an object to the center of a subtree as being approximately the average of the distances for every object stored in that subtree. By construction, each level of a Slim-tree represents a division in the data space with a granularity that grows from the root to the leaves, eventually storing all the objects of a dataset. One question we need to answer is: which level of the tree has enough information about the data distribution that can lead to a clustering with a lower computational cost while keeping a reasonable quality? The upper levels of a tree (closer to the root level) do not contain much information about the data distribution because the data are grouped by a small number of representatives. On the other hand, the lower levels (close to the leaf level) tend to have too much information slowing down the algorithm. Intuitively, the middle levels of the tree may contain enough information about the data distribution that can lead to suitable clustering. The experiments we have performed indicated that the middle level of the tree is really a good choice. Although the proposed approach may also be applied to other k-medoid-based algorithms, we considered only the PAM algorithm in this paper as it presents the best clusters quality, since it evaluates every object as a possible medoid. The proposed sampling algorithm using the Slim-tree applied to PAM will be referred to as PAM-SLIM, and it is depicted as follows: (1) Pre-processing phase. In this phase, the parameters for the construction of the Slim-tree are set, and the tree is built. (a) Choose the parameters for the tree construction: the page size and the ChooseSubtree algorithm. (b) Build the Slim-tree. (2) Initialization phase. In this phase, the objects that should be considered in the clustering phase are selected. (a) Find the middle level of the tree as hm = dhmax/2e, where hmax is the height of the tree. If level hm does not have at least k objects, then select the next level until finding a level with at least k representatives. (b) Let S hm be the set of objects in the hm level. (3) Clustering phase. In this phase, the objects S hm selected in Step (2) are submitted to PAM. (a) Apply the PAM algorithm over S hm to select the best k medoids in S hm . (b) Assign each object of the entire dataset to the set of medoids R returned by PAM. The PAM-SLIM algorithm is divided into three main phases. The first phase is responsible for the construction of the Slim-tree. This phase allows specifying the parameters needed for the construction of the tree, such as the

M.C.N. Barioni et al. / The Journal of Systems and Software 81 (2008) 343–355

ChooseSubtree algorithm and the node page size. As mentioned in Section 2.2, the Slim-trees built with the different options provided for the ChooseSubtree algorithm tend to present different degrees of overlap among their nodes. Thus, the possibility of determining the ChooseSubtree option, in the pre-process phase, allows the evaluation of the results obtained from the PAM-SLIM algorithms when considering different configurations of the Slim-tree. Another parameter that deserves attention in the construction of a Slim-tree is the node page size. As the PAMSLIM algorithms choose their medoids based on the node representatives, the node size affects the behavior of the algorithm. Thus, this parameter must be set according to the dataset to be clustered. Once the Slim-tree has been constructed, the subset of objects that must be submitted to PAM are selected in the second phase. This subset of objects is composed of the node representatives stored in the middle level of the tree. The last phase applies the PAM algorithm over the objects selected in the previous step and assigns each object of the entire dataset to the set of medoids returned by PAM. 3.1. The optimization phase Although the strategy adopted by the PAM-SLIM algorithm indeed reduces the number of distance calculations computed in the clustering, the best medoids that represent the clusters may not always be found, as the clustering is performed on a sample. As the experiments described in Section 4 show, this can lead to a small decrease on the clustering quality. This is due to the fact that the clustering quality is measured in terms of the average distance of the resulting clustering. Thus, a slight variation among the objects near to the centers of the clusters chosen as medoids can cause the objects that are near to the border of the clusters be misgrouped and their distances to their medoids become large. In order to improve the set of medoids found by the PAM-SLIM algorithm, we propose to apply a new strategy that analyzes the neighborhood of the medoids returned by PAM-SLIM. This strategy can be applied as an additional step of the PAM-SLIM algorithm. This is a desirable new feature that can be added to PAM-SLIM, as it enables analyzing more or less neighbors according to the time available. Using this optimization, the algorithm quickly reaches a suitable approximation of the k medoids, and thereafter it improves the clusters found. This additional

347

step can be executed as a fourth phase of the PAM-SLIM algorithm, as follows: (4) Optimization phase. In this phase, the clustering quality is improved. (a) Choose the number of neighbors kN, that is called neighborhood factor, to be analyzed. For each medoid returned by PAM-SLIM in R, find the kN neighbors. Let S kN be the set of selected neighbors. (b) Apply the PAM algorithm modifying its swap phase in the following way: compute the total cost Tih for all pairs of objects ri and sh, where ri 2 R is currently selected by PAM-SLIM and sh 2 S kN . Notice that R  S kN , so the inclusion of the optimization phase never worsen the clusters obtained, but can improve it significantly. The main aspect of this approach is related to the reduction on the number of objects considered in the swap phase of the PAM algorithm. In this approach, only the kN neighbors of each medoid returned by PAM-SLIM are considered for the swap as illustrated in Fig. 2, where the black stars represent the medoids that would be found by PAM, the white stars represent the medoids found by PAM-SLIM, the white circles represent the other dataset objects and the dashed circles embody the four kN neighbors that are considered in the swap. It is important to note that although only kN neighbors are considered in the swap, all the dataset objects are used in the computation of the total cost Tih of each configuration (as PAM does), that is, the complexity reduction is achieved by considering a smaller set of objects for the swap. Thus, the optimization phase presents computational cost of O(k(kN)(n  k)) for each iteration instead of O(k(n  k)2) for the original PAM. It is also worth to note that the PAM-SLIM algorithm can benefit from the MAM employed to execute the selection of the kN neighbors. 4. Experimental evaluation In order to show the effectiveness and the efficiency of our proposed algorithms, we present in this Section four representative sets of experiments chosen from the ones we have performed. To evaluate the clusters obtained by the techniques under analysis (effectiveness), we compute the average distance of the resulting clustering, i.e., the

PAM-SLIM medoids PAM medoids dataset objects

Fig. 2. Illustration of the optimization phase considering a neighborhood factor kN equals to 3.

348

M.C.N. Barioni et al. / The Journal of Systems and Software 81 (2008) 343–355

average distance of all objects from their medoids (smaller values for average distance indicate better clustering). This is the standard measure employed in the literature in order to evaluate the quality of the clustering obtained by the kmedoid-based algorithms. The efficiency was measured by the number of distance calculations. In the first set of experiments, we aimed at comparing the proposed algorithm PAM-SLIM with the traditional k-medoid-based algorithms PAM, CLARA and CLARANS. As PAM is not suitable for large databases due to its high computational complexity (see Section 2.1.1), it was executed only in the first set of experiments. The second series of experiments aimed at measuring the scalability of our approach when the database increases. These experiments compared our algorithm to CLARANS and CLARA. The third set of experiments showed the performance of our algorithm using a large, real-world dataset. Lastly, the fourth series of experiments presented the performance of the PAM-SLIM algorithm when using the optimization phase described in Section 3.1. The fourth series of experiments considered the best results obtained by the PAM-SLIM algorithm in the previous experiments, to show that the clustering quality can be improved even more when the optimization phase is employed. It is important to note that, in the first three sets of experiments, the efficiency graphics are in log scale for the number of distance calculation axis, due to the large difference of the results obtained by the tested algorithms. In every experiment, we have considered two configurations for the proposed approach PAM-SLIM: PAMSLIM-MD which uses the minDist option of the Slim-tree and PAM-SLIM-MO which uses the minOccup option. This was done mainly due to the Slim-trees built with these two options present significant differences in the overlap degree among their nodes (see Section 2.2). Thus, we have intended to find out which option was the more appropriate for the construction of the tree that is employed to sample the objects submitted to PAM. In order to find the ideal page size, the experiments were run varying the page size for the Slim-tree methods. The node page sizes were chosen based on the size of the objects

in each dataset. It is important to note that the algorithms PAM, CLARA and CLARANS only run in main memory and thus, they are not influenced by the page size. The five clustering algorithms evaluated – PAM, CLARA, CLARANS, PAM-SLIM-MD and PAM-SLIMMO – were implemented within the same platform, using the C++ language into the Arboretum MAM library (GBDI-ICMC-USP, 2007), in order to obtain a fair comparison. CLARA and CLARANS were configured using their best recommended setup as described in Section 2.1. It is important to note that the algorithms PAM and PAM-SLIM perform the clustering in a deterministic way, so any time that they are run the same result is achieved. On the other hand, the strategies used by the algorithms CLARANS and CLARA are based on random samples, thus their results are presented considering the average of 10 executions in all the sets of experiments. The experiments were performed in a PC with an Intel P4 2.4 GHz CPU, 1 GB RAM and 60 GB of disk space. Nine datasets were employed in the experiments. Eight were composed of synthetic data and were built to enable a thoughtful evaluation of the algorithms. The ninth dataset was composed of real non-dimensional data and was used aiming at analyzing the behavior of our algorithm in a real-world situation. The description of the datasets are presented in Table 2 along with its name, the total number of objects (# Objs.), the object size in bytes, the number of clusters (k) used to generate the dataset, the dimensionality of the dataset (D), and the distance function used (d( )). It is worth to note that different distance functions were employed in the experiments with the intent to evaluate the adequacy of the proposed method in several situations. 4.1. Evaluating PAM-SLIM vs k-medoid-based algorithms This first set of experiments was carried out in order to compare the two configurations of our methods PAMSLIM-MD and PAM-SLIM-MO with the existent k-medoid-based algorithms PAM, CLARA and CLARANS. In these experiments, the optimization was not employed, so only the basic PAM-SLIM algorithm was evaluated.

Table 2 Description of the synthetic and real-world datasets used in the experiments Name

# Obj.

Object size (Bytes)

k

D

d( )

Description

Rand10_5k_5d Rand10_10k_5d Rand10_15k_5d Rand10_20k_5d Rand30_10k_10d Rand60_10k_10d Rand90_10k_10d Rand120_10k_10d MedHisto

10,000 10,000 10,000 10,000 30,000 60,000 90,000 120,000 40,000

24 24 24 24 44 44 44 44 44–380 (avg. 280)

5 10 15 20 10 10 10 10 –

5 5 5 5 10 10 10 10 –

L2 L2 L2 L2 L1 L1 L1 L1 LM

Synthetic vector data with Gaussian distribution in a 5-d unit hypercubea

a

Synthetic vector data with Gaussian distribution in a 10-d unit hypercubea

Metric histograms of medical gray-level images. This dataset is non- dimensional and was generated on real medical images provided by the University Hospital at USP. For more details on this dataset and the used distance function, see Traina et al. (2002)

The process to generate these datasets is described in (Ciaccia et al., 1997).

M.C.N. Barioni et al. / The Journal of Systems and Software 81 (2008) 343–355

(e) Rand10 5k 5d

(a) Rand10 5k 5d 0.00059

1e+012

0.00058 1e+011

0.00057

Average distance

Number of distance calculation

349

1e+010

1e+009

1e+008

0.00056 0.00055 0.00054 0.00053 0.00052 0.00051

1e+007

0.0005 0.00049

1e+006 1024

2048

1024

4096

0.00066 0.00064

1e+011

Average distance

Number of distance calculation

4096

(f) Rand10 10k 5d

(b) Rand10 10k 5d 1e+012

1e+010

1e+009

1e+008

0.00062 0.0006 0.00058 0.00056 0.00054

1e+007

1e+006

0.00052 1024

2048

0.0005

4096

1024

2048

4096

Page size

Page size

(g) Rand10 15k 5d

(c) Rand10 15k 5d 0.00068

1e+012

0.00066 1e+011

0.00064

Average distance

Number of distance calculation

2048

Page size

Page size

1e+010

1e+009

1e+008

0.00062 0.0006 0.00058 0.00056 0.00054

1e+007

0.00052 1e+006

1024

2048

0.0005

4096

(d) Rand10 20k 5d 1e+011

0.0007

Average distance

Number of distance calculation

4096

(h) Rand10 20k 5d

1e+010

1e+009

1e+008

0.00065

0.0006

0.00055

1e+007

1024

2048

4096

CLARANS

0.0005

1024

2048

4096

Page size

Page size

PAM

2048

0.00075

1e+012

1e+006

1024

Page size

Page size

CLARA

PAM-SLIM-MD

PAM-SLIM-MO

Fig. 3. Efficiency (a–d) and effectiveness (e–h) comparison of PAM, CLARANS, CLARA, PAM-SLIM-MD and PAM-SLIM-MO for Rand10_5k_5d, Rand10_10k_5d, Rand10_15k_5d and Rand10_20k_5d datasets.

Fig. 3 shows the efficiency (a–d) and effectiveness (e–h) comparison of PAM, CLARANS, CLARA, PAM-SLIMMD and PAM-SLIM-MO for the datasets Rand10_5k_5d, Rand10_10k_5d, Rand10_15k_5d and Rand10_20k_5d,

where 5, 10, 15 and 20 medoids were selected, respectively. Notice that, as the traditional clustering algorithms – PAM, CLARANS and CLARA – are executed in main memory, their results are shown as the three first bars in

350

M.C.N. Barioni et al. / The Journal of Systems and Software 81 (2008) 343–355

the graphs without a page size label. PAM-SLIM-MD and PAM-SLIM-MO are disk-based, employing node page sizes of 1024, 2048 and 4096 bytes. Considering the node page sizes employed in the tests for both PAM-SLIM algorithms, the node page size of 4096 bytes was the one that presented the best trade-off between efficiency and effectiveness. If we observe only the graphs related to the clustering quality (Fig. 3e–h), it is possible to verify that, in general, the quality of the clusters obtained from the PAM-SLIM algorithms, considering the node page size of 1024 bytes, was better than the ones considering the node page sizes of 2048 and 4096 bytes. However, if we observe the correspondent graphs of efficiency (Fig. 3a–d), presented in log scale for the vertical axis, the PAM-SLIM algorithms execution, with node page size of 4096 bytes, were the ones that presented the smallest number of distance calculations when compared to the same algorithms executed with the other node page sizes employed in this set of experiments. Therefore, the node page size of 4096 bytes presented a better overall result, even though this configuration had resulted in a slight decrease of the clustering quality, when compared to the configuration that employed the node page size of 1024 bytes. For the PAM-SLIM-MO, we observed efficiency improvements ranging from 70 to 137 times when compared to CLARANS, whereas the loss of effectiveness varied only from 2.1% to 5.6%. Compared to PAM, the efficiency was improved by a factor of 3577–6695 times (more than three thousand times faster), although we observed a decrease of effectiveness ranging from only 7.4% to 11.8%. Compared to CLARA, our improvement of effectiveness ranged from 9.7% to 26%, and our improvement of efficiency ranged from 11% to 33%. Table 3 shows the time to execute a clustering on the datasets measured as hours:minutes:seconds. The time comparison among PAM, CLARANS, PAM-SLIM-MD and PAM-SLIM-MO considered the page size that presented the best trade-off for PAM-SLIM-MD and PAMSLIM-MO (page size of 4096 bytes). As the experiments showed, PAM-SLIM was more than 3000 times faster than PAM while producing clusters of comparable quality. Notice that the total time closely follows the number of distance calculations required. Although the CLARA algorithm had presented the best efficiency (in terms of the number of distance calculations) Table 3 Time comparison of PAM, CLARANS, PAM-SLIM-MD and PAMSLIM-MO for Rand10_5k_5d, Rand10_10k_5d, Rand10_15k_5d and Rand10_20k_5d datasets Dataset

PAM

CLARANS

PAM-SLIMMD

PAM-SLIMMO

Rand10_5k_5d Rand10_10k_5d Rand10_15k_5d Rand10_20k_5d

00:42:48 07:49:42 21:27:41 43:44:47

00:01:31 00:07:18 00:23:03 00:44:10

00:00:04 00:00:10 00:00:33 00:01:14

00:00:02 00:00:05 00:00:15 00:00:39

Considering PAM-SLIM-MD and PAM-SLIM-MO run with page size of 4096 bytes. Time measured in hours:minutes:seconds.

in most of the tests showed herein, its clustering quality was worse, not being comparable to that of PAM, considering its recommended configuration. In order to improve its clustering quality, it is necessary to increase the number of samplings and/or their sizes. However, this results in a large reduction of its efficiency. This makes the algorithm PAMSLIM-MO the one that presents the best trade-off between efficiency and effectiveness for the datasets tested here. In the next experiments, we have not compared the results with the PAM algorithm, because of its prohibitive computational cost. 4.2. Evaluating the scalability of the PAM-SLIM algorithms The datasets Rand30_10k_10d, Rand60_10k_10d, Rand90_10k_10d and Rand120_10k_10d were used in the second set of experiments aiming at measuring the scalability of our approach regarding the increasing number of data items (see Table 2). In it, 10 medoids were asked for clustering each dataset. Fig. 4 shows the efficiency and effectiveness comparison of CLARANS, CLARA, PAMSLIM-MD and PAM-SLIM-MO. The node page sizes employed in these experiments for the PAM-SLIM-MD and PAM-SLIM-MO algorithms varied from 2048 to 8192 bytes. As in the first set of experiments, the optimization phase was not employed. As it can be seen in the graphs related to the clustering quality (Fig. 4e–h), it is possible to verify that the results obtained through the execution of the several configurations of the PAM-SLIM algorithms did not present a significant change (the difference observed among the quality results was of 0.0033). Otherwise, if we observe the correspondent graphs of efficiency (Fig. 4a–d) for the PAM-SLIM-MO algorithm, it is possible to verify that the number of distance calculations needed decreased as the node page sizes employed increased. Thus, the node page size that resulted in the best tradeoff between efficiency and effectiveness was of 8192 bytes for the PAM-SLIM-MO algorithm. Taking this page size into account, we have observed that the PAM-SLIM-MO obtained an improvement of efficiency ranging from 337 to 527 times compared to CLARANS (the clustering algorithm that presented the best clustering quality) as can be seen from Fig. 4a–d, whereas the loss of effectiveness varied from 0.6% to 2.5%. In the second set of experiments, although the CLARA algorithm presented the best efficiency among the tested algorithms, its loss of effectiveness reached 13.3% when compared to CLARANS and 12.1% when compared to the best configuration employed for the PAM-SLIM-MO. 4.3. Exploring a real database This set of experiments used the real-world non-dimensional dataset, MedHisto, in order to observe the behavior of our algorithms in a real situation. In these experiments,

M.C.N. Barioni et al. / The Journal of Systems and Software 81 (2008) 343–355

(e) Rand30 10k 10d

(a) Rand30 10k 10d 0.096

1e+012

0.095 0.094

1e+011

Average distance

Number of distance calculation

351

1e+010

1e+009

0.093 0.092 0.091 0.09 0.089 0.088

1e+008

0.087 1e+007

2048

4096

0.086

8192

2048

(b) Rand60 10k 10d

(f) Rand60 10k 10d

Average distance

Number of distance calculation

0.096 1e+011

1e+010

1e+009

0.094 0.092 0.09 0.088

1e+008

0.086 1e+007

2048

4096

0.084

8192

2048

4096

8192

Page size

Page size

(c) Rand90 10k 10d

(g) Rand90 10k 10d 0.098

1e+012

0.096 1e+011

Average distance

Number of distance calculation

8192

0.098

1e+012

1e+010

1e+009

0.094 0.092 0.09 0.088

1e+008

0.086 1e+007

2048

4096

0.084

8192

2048

4096

8192

Page size

Page size

(h) Rand120 10k 10d

(d) Rand120 10k 10d 0.098

1e+012

0.096 1e+011

Average distance

Number of distance calculation

4096

Page size

Page size

1e+010

1e+009

0.094 0.092 0.09 0.088

1e+008

0.086 1e+007

2048

4096

8192

0.084

2048

CLARANS

CLARA

4096

8192

Page size

Page size

PAM-SLIM-MD

PAM-SLIM-MO

Fig. 4. Efficiency (a–d) and effectiveness (e–h) comparison of CLARANS, CLARA, PAM-SLIM-MD and PAM-SLIM-MO for Rand30_10k_10d, Rand60_10k_10d, Rand90_10k_10d and Rand120_10k_10d datasets.

5, 10 and 15 medoids were selected from 40,000 objects of the MedHisto dataset. The node page sizes employed for the PAM-SLIM algorithms varied from 8192 to 32,768 bytes. Yet, the optimization phase was not employed.

Fig. 5 shows the efficiency (a–c) and effectiveness (d–f) comparison of CLARANS, CLARA, PAM-SLIM-MD and PAM-SLIM-MO for this dataset. The best clustering quality obtained for the PAM-SLIM-MO employed the

352

M.C.N. Barioni et al. / The Journal of Systems and Software 81 (2008) 343–355

(a) MedHisto - 5 k

(b) MedHisto - 10 k

1e+009

1e+008

1e+007

8192

16384

1e+011

Number of distance calculation

1e+010

1e+006

1e+010

1e+009

1e+008

1e+007

1e+006

32768

8192

Page size

16384

1e+010

1e+009

1e+008

1e+007

1e+006

32768

Page size

(d) MedHisto - 5 k

(e) MedHisto - 10 k 0.34

0.36

0.36

0.32

0.34

0.34

0.28 0.26

0.32 0.3 0.28 0.26 0.24

0.28 0.26 0.24 0.22

0.24

0.22

0.2

0.22

0.2

0.18

0.2

8192

16384

32768

0.18

8192

Page size

CLARANS

32768

0.3

Average distance

0.3

16384

(f) MedHisto - 15 k

0.38

0.32

8192

Page size

0.38

Average distance

Average distance

(c) MedHisto - 15 k

1e+011

Number of distance calculation

Number of distance calculation

1e+011

16384

32768

Page size

CLARA

PAM-SLIM-MD

0.16

8192

16384

32768

Page size

PAM-SLIM-MO

Fig. 5. Efficiency (a–c) and effectiveness (d–f) comparison of CLARANS, CLARA, PAM-SLIM-MD and PAM-SLIM-MO for the MedHisto dataset.

node page size of 8192 bytes, whereas for the PAM-SLIMMD algorithm the best clustering quality was achieved considering the page size configurations of 16,384 and 32,768 bytes. Observing the best configuration for the PAMSLIM-MO, the efficiency improvement ranged from 7 to 13 times when compared to CLARANS, while the loss of effectiveness ranged from 2.8% to 7.5%. For the PAMSLIM-MD, taking into account the page size configuration that resulted in the smallest number of distance calculations between the ones that obtained the best clustering quality (page size of 32,768 bytes), the efficiency improvement ranged from 164 to 221 times compared to CLARANS, whereas the loss of effectiveness ranged from 2.2% to 7.6%. Thus, the impressive gain in time compensates the small decrease in effectiveness. Moreover, such gain makes the proposed algorithm suitable for use in real applications. It has also been noted that, the loss of effectiveness presented by the PAM-SLIM algorithms in this set of experiments, was much lower than the one presented by CLARA when compared to CLARANS. The loss of effectiveness presented by CLARA was superior to 43.4% when compared to CLARANS and superior to 41.7% when compared to the best configuration of the PAM-SLIM-MD. Finally, considering the three first sets of experiments, it is possible to conclude that the node page size that led to the best configuration of the PAM-SLIM algorithm was

achieved when the maximum number of objects that fits in a page was around 180. 4.4. Employing the optimization approach The last set of experiments considered the best configurations employed for the PAM-SLIM algorithms in the previous three series of experiments in order to evaluate the efficiency and the effectiveness of the PAM-SLIM algorithm when applying the optimization phase for all the datasets tested. For all the following graphs, the neighborhood factor kN employed in the optimization varies from 5 up to 30. It is important to note that the results presented for the value zero in the kN axis correspond to the execution of the PAM-SLIM algorithm without the optimization phase, i.e., these values correspond to the ones shown previously in the graphs of the other sets of experiments. Fig. 6 shows the efficiency (a) and the effectiveness (b) comparison of the PAM-SLIM-MO with node page size of 4,096 bytes using the optimization approach for Rand10_5k_5d, Rand10_10k_5d, Rand10_15k_5d and Rand10_20k_5d datasets. Considering the neighborhood factor kN of 5, we already observed improvements of effectiveness varying from 0.05% to 1.3% when compared to CLARANS for Rand10_5k_5d, Rand10_10k_5d and Rand10_20k_5d datasets. For the Rand10_15k_5d dataset,

M.C.N. Barioni et al. / The Journal of Systems and Software 81 (2008) 343–355

(a) Efficiency

(b) Effectiveness

0.00058 0.00057

2.5e+009 0.00056

Average distance

Number of distance calculation

3e+009

353

2e+009

1.5e+009

1e+009

0.00055 0.00054 0.00053 0.00052 0.00051

5e+008 0.0005 0

0.00049 0

5

10

15

20

25

30

0

5

10

kN Rand10_5k_5d

15

20

25

30

kN Rand10_10k_5d

Rand10_15k_5d

Rand10_20k_5d

Fig. 6. Efficiency (a) and effectiveness (b) comparison of the PAM-SLIM-MO using the optimization approach for Rand10_5k_5d, Rand10_10k_5d, Rand10_15k_5d and Rand10_20k_5d datasets.

we observed a decrease of the previous loss of effectiveness that was equal to 5.6% down to 1.1%, while maintaining the efficiency improvement ranging from 22 to 42 times. Compared to PAM, the loss of effectiveness that previously varied from 7.4% to 11.8% – without considering the optimization phase – now it did not exceed 7.6%, while still presenting an improvement of efficiency of 1497 times faster. Compared to CLARA, the improvement of effectiveness ranged from 12.1% to 32.1%. It is important to point out that the PAM, CLARANS and CLARA results employed in all the comparisons presented in this section are related to those obtained and shown in the previous experiments. For subsequent values of the neighborhood factor, we observed that as the kN value increased, the clustering quality also increased. The solutions found became better and better than the ones found by CLARANS and CLARA until they reached almost the same clustering quality presented by PAM for the kN equal to 30, where the loss of effectiveness varied only from 0.9% to 1.4% yet running from 164 to 227 times faster. Regarding the computational effort (Fig. 6a), although the number of distance calculations had grown as the kN increased, the efficiency graph exhibited linear behavior, as predicted by the computational complexity described in Section 3.1. In a general manner, after selecting a kN equal to 15, the solutions found by PAM-SLIM presented slightly better quality. Since the improvement of effectiveness was accompanied by an increase on the number of distance calculations, we concluded that, for the datasets employed in this series of experiments, the neighborhood factor equal to 15 presented a suitable trade-off. Considering this factor, we observed improvements of effectiveness varying from 3.0% to 6.0% and great improvements of efficiency, since our algorithm is 7–12 times faster than CLARANS. Compared to PAM, the loss of effectiveness varied only from 1.6% to 3.3% while still presenting an efficiency improvement that ranged from 325 to 454 times. Compared to CLARA, the improvement of effectiveness reached 38.6%.

Fig. 7 shows the efficiency (a) and the effectiveness (b) comparison of the PAM-SLIM-MO with node page size of 8192 bytes using the optimization approach for Rand30_10k_10d, Rand60_10k_10d, Rand90_10k_10d and Rand120_10k_10d datasets. We can draw the following observations from this figure. In order to surpass the clustering quality of CLARANS, it was necessary to employ neighborhood factors equal to 5, 25, 15 and 10 for the datasets Rand30_10k_10d, Rand60_10k_10d, Rand90_10k_ 10d and Rand120_10k_10d, respectively. Thus, considering the neighborhood factor kN equal to 25 (that was sufficient to surpass the clustering quality of CLARANS for all the datasets considered in this figure), we observed improvements of effectiveness varying from 0.4% to 1.6% when compared to CLARANS, while maintaining an efficiency improvement ranging from 20 to 68 times. Compared to CLARA, the improvement of effectiveness achieved up to 17.1%. Fig. 8 shows the efficiency (a) and the effectiveness (b) comparison of the PAM-SLIM-MD with node page size of 32,768 bytes using the optimization approach for the MedHisto dataset. Analyzing the results presented in the graph of Fig. 8b, we observed that after the neighborhood factor kN equal to 5 there was no significant difference in the improvement of clustering quality as the kN increased. Regarding the results presented for the dataset MedHisto requesting the selection of 5 clusters, the kN equal to 10 was enough for the PAM-SLIM to surpass the clustering quality of CLARANS, achieving 0.14% of effectiveness improvement while still presenting an efficiency improvement of 48 times. For the selection of 10 and 15 clusters, although the kN equal to 30 had not been enough to surpass the clustering quality of CLARANS, it allowed the previous loss of effectiveness to decrease from 7.6% to 5.5% for the selection of 10 clusters and to decrease from 7.6% to 5.3% for the selection of 15 clusters, while maintaining an efficiency improvement of 23 and 18 times respectively. In this case, a kN greater than 30 is required in order to achieve the solution found by CLARANS.

354

M.C.N. Barioni et al. / The Journal of Systems and Software 81 (2008) 343–355

(b) Effectiveness 0.088

4e+009

0.0875 0.087

3.5e+009

Average distance

Number of distance calculation

(a) Efficiency 4.5e+009

3e+009 2.5e+009 2e+009 1.5e+009

0.0865 0.086 0.0855 0.085 0.0845

1e+009

0.084

5e+008

0.0835

0

0

5

10

15

20

25

0.083

30

0

5

10

kN Rand30_10k_10d

15

20

25

30

kN Rand60_10k_10d

Rand90_10k_10d

Rand120_10k_10d

Fig. 7. Efficiency (a) and effectiveness (b) comparison of the PAM-SLIM-MO using the optimization approach for Rand30_10k_10d, Rand60_10k_10d, Rand90_10k_10d and Rand120_10k_10d datasets.

(a) Efficiency

(b) Effectiveness 0.225

4e+009

0.22

3.5e+009

0.215

Average distance

Number of distance calculation

4.5e+009

3e+009 2.5e+009 2e+009 1.5e+009

0.21 0.205 0.2 0.195

1e+009

0.19

5e+008

0.185

0

0.18 0

5

10

15

20

25

30

0

5

10

kN

15

20

25

30

kN Med - 5k

Med - 10k

Med - 15k

Fig. 8. Efficiency (a) and effectiveness (b) comparison of the PAM-SLIM-MD using the optimization approach for the MedHisto dataset.

Compared to CLARA, the improvement of effectiveness achieved for this series of experiments was superior to 75.3%. Analyzing the results of this series of experiments, we concluded that the optimization phase always improve the quality of the clusters obtained, as predicted. They also showed that a neighborhood factor of 15 allows, almost always, to obtain clusters with better quality than those obtained by CLARA and CLARANS algorithms, yet running at fractions of the time required by them.

5. Conclusions This paper presented a new algorithm, PAM-SLIM, which employs metric access methods to scale-up k-medoid-based algorithms. The efficiency of k-medoid-based algorithms relies on the initial selection of the medoids. Our proposed algorithm assumes that a metric tree tends to naturally choose suitable medoids as its node representatives. The experiments we performed confirmed that this assumption indeed holds. The strategy employed by this

algorithm can be efficiently applied to cluster multi-dimensional as well as non-dimensional datasets. We also presented an optimization strategy that can be applied as an additional step of the proposed algorithm, in order to analyze the neighborhood of the medoids returned by the basic algorithm, according to a neighborhood factor. This strategy allowed the clustering solution found by the PAM-SLIM to be improved, while maintaining better computational efficiency than the algorithms that presented comparable clustering quality, i.e., PAM and CLARANS. The experiments we performed showed that as the neighborhood factor increases, better clusters are found. Our experiments have also shown that the proposed algorithm presented suitable results for several configurations of the Slim-tree using different node page sizes and ChooseSubtree policies. When compared to PAM and CLARANS, the PAM-SLIM algorithms presented impressive improvement in efficiency (being more than three thousand times faster), whereas maintaining a comparable clustering quality, thus offering a very good trade-off between efficiency and effectiveness. The new algorithm

M.C.N. Barioni et al. / The Journal of Systems and Software 81 (2008) 343–355

also presented much better clustering quality than CLARA for all the tested datasets. Moreover, notice that our proposed algorithm runs on datasets stored on disk, whereas the other algorithms run only in main memory. Even though, our algorithm remains much faster than PAM and CLARANS. Hence, the efficiency presented by the PAM-SLIM algorithm allows the inclusion of clustering algorithms in database management systems. There are a few issues that deserve further research. Although the two configurations employed by the PAMSLIM algorithms (PAM-SLIM-MD and PAM-SLIMMO) have obtained suitable results, they presented different behaviors in each dataset tested. Future works shall include the identification of parameters to be measured in a dataset to define beforehand which configuration is the best choice. Acknowledgements This work has been supported by FAPESP (Sa˜o Paulo State Research Foundation), by CAPES (Brazilian Coordination for Improvement of Higher Level Personnel) and by CNPq (Brazilian National Council for Supporting Research). References Barioni, M.C.N., Razente, H., Traina, A.J.M., Traina Jr., C. 2006. An efficient approach to scale up k-medoid-based algorithms in large databases. In: Brazilian Symposium on Databases (SBBD), Floriano´polis, SC, Brazil. SBC, pp. 265–279. Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B. 1990. The R*-tree: An efficient and robust access method for points and rectangles. In: ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ. ACM, pp. 322–331. Chu, S.-C., Roddick, J.F., Pan, J.S., 2002. An efficient k-medoids-based algorithm using previous medoid index, triangular inequality elimination criteria, and partial distance search. International Conference on Data Warehousing and Knowledge Discovery (DaWaK). SpringerVerlag, London, UK, pp. 63–72. Ciaccia, P., Patella, M., Zezula, P., 1997. M-tree: an efficient access method for similarity search in metric spaces. International Conference on Very Large Data Bases (VLDB). Morgan Kaufman, Athens, Greece, pp. 426–435. Ester, M., Kriegel, H.-P., Xu, X., 1995. Knowledge discovery in large spatial databases: focusing techniques for efficient class identification. In: International Symposium on Advances in Spatial Databases, vol. 951. Springer, Portland, ME, pp. 67–82. GBDI-ICMC-USP 2007. GBDI Arboretum Library. .

355

Guha, S., Rastogi, R., Shim, K. 1998. Cure: an efficient clustering algorithm for large databases. In ACM SIGMOD International Conference on the Management of Data, Seattle, WA, USA, pp. 73– 84. Han, J., Kamber, M., 2001. Data Mining: Concepts and Techniques. Academic Press, San Diego, CA. Hartigan, J., Wong, M., 1979. Algorithm AS136: A k-means clustering algorithm. Applied Statistics 28, 100–108. Hwang, J.-J., Whang, K.-Y., Moon, Y.-S., Lee, B.-S., 2004. A top-down approach for density-based clustering using multidimensional indexes. Journal of Systems and Software (JSS) 73 (1), 169–180. doi:10.1016/ j.jss.2003.08.237. Jain, A., Murty, M., Flynn, P., 1999. Data clustering: a review. ACM Computing Surveys 31 (3), 264–323. doi:10.1145/331499.331504. Kaufman, L., Rousseeuw, P.J., 1986. Clustering large data sets (with discussion). In: Gelsema, E.S., Kanal, L.N. (Eds.), Pattern Recognition in Practice II. Elsevier, North-Holland, Amsterdam, pp. 425– 437. Kaufman, L., Rousseeuw, P.J., 1987. Clustering by Means of Medoids. Statistical Data Analysis based on the L1 Norm. Elsevier, Berlin, pp. 405–416. Kaufman, L., Rousseeuw, P.J., 2005. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons. Ng, R.T., Han, J., 1994. Efficient and effective clustering methods for spatial data mining. International Conference on Very Large Data Bases (VLDB). Morgan Kaufman, Santiago, Chile, pp. 144– 155. Ng, R.T., Han, J., 2002. Clarans: a method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering (TKDE) 14 (5), 1003–1016. doi:10.1109/TKDE.2002. 1033770. Sibson, R., 1973. Slink: an optimally efficient algorithm for the single link cluster method. Computer Journal 16, 30–34. Traina , A.J.M., Traina, Jr., C., Bueno, J.M., and de A. Marques, P.M. 2002. The metric histogram: A new and efficient approach for contentbased image retrieval. In: IFIP Working Conference on Visual Database Systems (VDB), Australia. pp. 297–311. Traina Jr., C., Traina, A.J.M., Faloutsos, C., Seeger, B., 2002. Fast indexing and visualization of metric datasets using Slim-trees. IEEE Transactions on Knowledge and Data Engineering (TKDE) 14 (2), 244–260, doi:ieeecomputersociety.org/10.1109/69.991715. Vieira, M.R., Chino, C.T.-J.F., Traina, A.J.M., 2004. Dbm-tree: A metric access method sensitive to local density data. In: Brazilian Symposium on Databases (SBBD), Brası´lia, DF. SBC. pp. 163– 177. Zhang, Q., Couloigner, I., 2005. A new and efficient k-medoid algorithm for spatial clustering. International Conference on Computational Science and Its Applications. In: Lecture Notes in Computer Science, vol. 3482. Springer-Verlag, Singapore, pp. 181–189. Zhang, T., Ramakrishnan, R., Livny, M., 1996. Birch: An efficient data clustering method for very large databases. In: ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada. ACM. pp. 103–114.