On distributing the clustering process

On distributing the clustering process

Pattern Recognition Letters 23 (2002) 999–1008 www.elsevier.com/locate/patrec On distributing the clustering process B. Boutsinas a,b,c,*, T. Gnardel...

191KB Sizes 1 Downloads 26 Views

Pattern Recognition Letters 23 (2002) 999–1008 www.elsevier.com/locate/patrec

On distributing the clustering process B. Boutsinas a,b,c,*, T. Gnardellis c a

c

Department of Business Administration, University of Patras, GR-26500 Patras, Greece b University of Patras Artificial Intelligence Research Center (UPAIRC), Greece IS & AI Lab, Department of Computer Engineering and Informatics, GR-26500 Patras, Greece Received 20 November 2000; received in revised form 10 October 2001

Abstract Clustering algorithms require a large amount of computations of distances among patterns and centers of clusters. Hence, their complexity is dominated by the number of patterns. On the other hand, there is an explosive growth of business or scientific databases storing huge volumes of data. One of the main challenges of today’s knowledge discovery systems is their ability to scale up to very large data sets. In this paper, we present a clustering methodology for scaling up any clustering algorithm. It is an iterative process that it is based on partitioning a sample of data into subsets. We, also, present extensive empirical tests that demonstrate the proposed methodology reduces the time complexity and at the same time may maintain the accuracy that would be achieved by a single clustering algorithm supplied with all the data. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: Data mining; Clustering; Meta-learning; Parallel processing; Distributed computation

1. Introduction Clustering is a widely used technique, whose goal is to partition a set of patterns into disjoint and homogeneous clusters. Clustering algorithms have been widely studied in various fields including Machine Learning, Neural Networks and Statistics. They, also, have been utilized in many areas including data mining, engineering, taxonomy, statistical data analysis and business applications. Clustering algorithms can be classified as either

*

Corresponding author. Tel.: +30-610-997845; fax: +30-610996327. E-mail address: [email protected] (B. Boutsinas).

partitional clustering or hierarchical clustering algorithms. K-means (Jain and Dubes, 1988; MacQueen, 1967) along with its variants (e.g., Alsabti et al., 1995; Huang, 1998; Ruspini, 1969; Vrahatis et al., 2002), hill-climbing (Anderberg, 1973) and the density-based DBSCAN (Ester et al., 1996) are of the most popular partitional clustering algorithms. Complete-link and single-link algorithms (Dubes and Jain, 1980; Johnson, 1967) are the most popular hierarchical clustering algorithms. Clustering algorithms require a large amount of computations of distances among patterns and centers of clusters. Hence, their complexity is dominated by the number of patterns. On the other hand, there is an explosive growth of business or scientific databases storing huge volumes

0167-8655/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 ( 0 2 ) 0 0 0 3 1 - 4

1000

B. Boutsinas, T. Gnardellis / Pattern Recognition Letters 23 (2002) 999–1008

of data. One of the main challenges of today’s data mining systems is their ability to scale up to very large data sets. There is an increasing concern in clustering algorithms handling data that are substantially larger than available main memory on a single processor. There are several approaches to the problem, in the literature. An obvious approach is sampling of the data set (e.g., Bradley and Fayyad, 1983; Bradley et al., 1998; Zhang et al., 1996), although it seems that, as in classification algorithms (Provost and Kolluri, 1997), increasing the size of the training set typically increases the accuracy while decreasing the size leads to over-fitting. Another approach is to parallelize the clustering algorithms. This approach requires the transformation of the clustering algorithm to an optimized parallel one for a specific architecture. There are several earlier parallel clustering algorithms that are proposed in the literature, both for partitioning clustering (e.g., Li and Fang, 1989; Ni and Jain, 1985) and for hierarchical clustering (e.g., Rasmussen and Willett, 1989; Olson, 1995). But these earlier attempts still assume that all patterns reside in main memory at the same time and they need a large number of processors. Of course, there are recent attempts that overcome the latter problems as in (Piftzner et al., 1997 and J€ ager and Kriegel, 1999). In this paper, we present a clustering methodology for scaling up any clustering algorithm. It is an iterative process that it is based on partitioning a sample of data into subsets. In a first phase, each subset is given as an input to a clustering algorithm. The partial results form a dataset that it is partitioned into clusters, the meta-clusters, during a second phase. Under certain circumstances, meta-clusters are considered as the final clusters. We, also, present extensive empirical tests that demonstrate the proposed methodology reduces the time complexity and at the same time may maintain the accuracy that would be achieved by a single clustering algorithm supplied with all the data. In the rest of the paper we first present the proposed methodology and then we present the experimental results. The paper ends with some concluding remarks.

2. The proposed methodology The concept of the proposed methodology, that we call the Proseggisis methodology, is to divide a data set into equally sized data subsets and to apply the clustering algorithm in parallel to fragments of the original data set. Thus, we can achieve significantly faster execution times, without sacrificing much of the accuracy of the results. The proposed methodology is shown on the diagram of Fig. 1 and it is described below. Proseggisis methodology. input data set Dfa1 ; . . . ; an g input s, the number of splits of the initial dataset input inc, the size of the fraction of dataset that it is added to a split in each iteration input algð Þ, the clustering algorithm to be applied in sub-datasets and meta-clustering input k, the number of clusters to be extracted input d, the threshold indicating the maximum distance between a cluster of a sub-dataset and a meta-cluster, so that the similarity criterion is satisfied input p, the threshold indicating percentage of clusters in a single cluster set that must satisfy the similarity criterion w.r.t. the meta-clusters, in order to accept these meta-clusters as appropriate to represent the specific cluster-set (covered) input r, the threshold indicating percentage of clusters sets, that must be covered. 1: for each distinct sub-dataset Di of D, 1 6 i 6 s do in parallel apply algð Þ to Di , to extract cluster set ci ¼ fci1 ; . . . ; cik g 2: define Meta-table T ¼ fc11 ; . . . ; c1k ; . . . ; cs1 ; . . . ; csk g 3: apply algð Þ to T, to extract meta-clusters mc1 ; . . . ; mck 4: for each cluster set ci , 1 6 i 6 s do if at least ðkpÞ=100 clusters of ci have a distance at most d from any mcj then characterize ci as covered 5: if there are at least ðsrÞ=100 covered cluster sets then EXIT else increase the size of sub-datasets for inc records GOTO 1

B. Boutsinas, T. Gnardellis / Pattern Recognition Letters 23 (2002) 999–1008

1001

Fig. 1. The Proseggisis methodology.

Firstly we divide the dataset into s equal-size pieces. We apply the same clustering algorithm with the same parameters to all of the sub-datasets, which results to s sets of cluster centers (step 1). Our algorithm is based on the fact that these s sets are statistically somewhat similar, since they were extracted from different pieces of the same dataset. In order to assess that similarity we merge all the centers of cluster sets (step 2) into a new table that we call Meta-table. We then apply the clustering algorithm to the new table and extract the meta-cluster centers (step 3). If centers of the cluster sets are similar enough, that similarity will emerge through the meta-cluster centers. Therefore, we compare the centers of each of the s cluster sets with the meta-cluster centers. The way we do the comparison is by calculating the distance of each cluster-center of the set with its corresponding meta-cluster center. We regard as the corresponding meta-cluster center, the one that is closest to the cluster center of the set. If these two centers, i.e. the cluster center of the set and its corresponding meta-cluster center, are found close enough then the similarity criterion for these two centers is satisfied. In order to state that the algorithm has succeeded for a specific cluster set, that we call covered, then similarity must occur for a significant percentage (p) of the cluster-centers of that set (step 4). If the algorithm succeeds for a significant percentage (r) of the cluster center sets, then we accept

the meta-cluster centers as the final results (step 5). If not, the procedure is repeated for the same number of sub-datasets only now they contain a larger amount of data (inc) from the original dataset. Initially, we had divided the original dataset so that each subset contained the same amount of data. Naturally, since the number of the subsets remains the same while their size increases, some of the records that are included in one subset will be included in others as well, thus making the subsets more similar. More similar subsets will undoubtedly produce more similar clusters and the process will have better chances of succeeding. The rate at which we augment the subsets is the following. If the original dataset contains n records and we divide it into s subsets, then each subset will contain ðn=sÞ records. If the process does not succeed, we will increase the number of records for each of the s subsets to 2ðn=sÞ records, the third time to 3ðn=sÞ records each and so on. The process will surely succeed at some point and convergence will definitely occur as the subsets grow and become more similar each time. Even if r and p are set to be very high, in the sth iteration each subset will be identical to the original one and will contain sðn=sÞ ¼ n records. We will therefore receive s identical sets of cluster centers and the same distinct cluster centers will come out as the final meta-cluster centers. Of course this case represents the worst case scenario and it would result in an increase in time complexity with respect to the corresponding analog non-distributed

1002

B. Boutsinas, T. Gnardellis / Pattern Recognition Letters 23 (2002) 999–1008

clustering process. However, our experience is that in almost all the tests we have made the subset augmentation process has to be repeated up to three times. Since the clustering of the subsets takes place simultaneously for all the subsets, we are decreasing significantly the execution time. The extra steps of the process, i.e., the creation and clustering of the meta-table, take an insignificant amount of time to complete, so the actual time for the whole process is the execution time for clustering a subset. Even if the process has to be repeated three or four times, which is rarely the case, the execution time gain is great.

3. Empirical results In order to evaluate the Proseggisis methodology, we have implemented a system, in Borland C++Builder (http://www.isai.ceid.upatras.gr/ voutsinas/forma.htm). Using this system, we have applied the Proseggisis methodology both in a real-world data set and in a test database supplied by the UCI machine learning repository (Blake and Merz, 1998). The real-world data set, with a size of 30 000 records, describes the behavior of the customer base of a big telecommunications company (telecom dataset). It contains five certain attributes of profit related behavior of the customer base. Car evaluation database, is a test dataset (car dataset) with a size of 1728 records, that was derived from a simple hierarchical decision model with six attributes (we excluded the class attribute). Because of the known underlying concept structure, this database is particularly useful for testing constructive induction and structure discovery methods. We have used two different versions of the kmodes clustering algorithm (Huang, 1998) for clustering the sub-datasets and meta-clustering. The two versions differ in the dissimilarity measure that it is used for allocating objects to nearest cluster and in the initial mode selection method. The first version, the BR-version, selects the first k distinct records from the data set as the initial modes. Moreover, it defines the dissimilarity

measure between two categorical objects as the total mismatches of the corresponding attribute categories of the two objects (Huang, 1998). Formally, if X ; Y are two categorical objects described by m categorical attributes, then the dissimilarity Pm measure dðX ; Y Þ ¼ j¼1 dðxj ; yj Þ, where  0 if ðxj ¼ yj Þ; dðxj ; yj Þ ¼ 1 if ðxj 6¼ yj Þ: The second version, the FF-version, selects the initial modes calculating the frequencies of all categories for all attributes and assigning the most frequent categories equally to the initial k modes (Huang, 1998). Moreover, it defines the dissimilarity measure as dx2 ðX ; Y Þ ¼

m X ðnxj þ nyj Þ dðxj ; yj Þ; ðnxj nyj Þ j¼1

where nxj ; nyj are the number of objects in data set that have categories xj and yj for attribute j (Huang, 1998). The above variations have a great impact on the performance of the algorithms. The FF-version is a robust algorithm, in contrast to BR-version that has a non-deterministic behavior, although it has a reduced time complexity (Huang, 1998). However, empirical results were similar for both versions, therefore we can presume that a particular clustering algorithm has not an impact on the proposed methodology. Empirical tests aim at evaluating the proposed methodology with respect to an analog non-distributed clustering process. Therefore, for each version of k-modes, we have compared the performance of Proseggisis methodology with an application of the same version of k-modes to all the data, using the same parameters. We have used two measures in this comparison. The first measure, the similarity of clusters (SOC), refers to how similar is the final allocation of objects to clusters. It measures the ratio of objects that are allocated, during the Proseggisis methodology, in similar clusters as during the analog non-distributed clustering process. We consider as similar clusters a meta-cluster and its corresponding cluster center that satisfy the similarity criterion. The second measure, the accuracy of clustering (AOC), refers

B. Boutsinas, T. Gnardellis / Pattern Recognition Letters 23 (2002) 999–1008

to the average of distances between the cluster center and the objects that are allocated to it. Thus, 1 X oi ; AOC ¼ jCj j oi 2Cj where oi is the ith object allocated to cluster Cj and jCj j defines the cardinality of Cj . We applied the Proseggisis methodology to two datasets for four clusters, for each version and for the analog non-distributed clustering process. We tested several number of splits (5, 10 and 20 for the telecom dataset and 2, 5, and 10 for the car dataset), searching for four clusters (k ¼ 4). In each iteration we were incrementing the sub-dataset by inc ¼ jDj=s records. Finally, we consider d as the minimum distance between the meta-clusters and we set p ¼ 60% and r ¼ 50%. In Fig. 2 the speedup is depicted as a function of the number of splits, for the telecom dataset and the car dataset, respectively. We can conclude that the speedup is linear to the number of splits. The gap observed in the BR-version for the telecom dataset is due to the more iterations (three) needed in the case of 10 splits. It, also, seems that the speedup is irrelevant of the size of the dataset, as it is shown in Fig. 3, for which we also used the Nursery dataset (Blake and Merz, 1998) with 12 960 records.

1003

In Fig. 4 the similarity of clusters measure is depicted as a function of the number of splits, for the customer based dataset and the car dataset, respectively. It seems that this measure is irrelevant to the number of splits and that it is closely related to the training set and the clustering algorithm. In the first chart of Fig. 5 the accuracy of the clustering measure is depicted, for different number of splits and for the non-distributed clustering process (one split). In the second chart of Fig. 5 the improvement in the accuracy of the clustering measure with respect to the analog non-distributed clustering process, is depicted as a function of the number of splits. It seems that AOC is also irrelevant to the number of splits. The stopping criteria of the program depend on the values of d, p, and r. However, as a result of the empirical tests, we can conclude that d, p and r are not critical enough for the performance of the proposed methodology. In fact, we used these parameters for testing purposes. In most of the cases, after the first iteration, more than 80% of clusters in cluster sets satisfy the similarity criterion and in most of the cases more than 90% of cluster-sets were covered. As a consequence, the process need only a few iterations to be completed (from one to three). In support of this argument, we test the performance of the Proseggisis methodology with respect to each of these parameters.

Fig. 2. Speedup as a function of the number of splits.

1004

B. Boutsinas, T. Gnardellis / Pattern Recognition Letters 23 (2002) 999–1008

SOC

SOC

Fig. 3. Speedup as a function of the dataset size.

Fig. 4. SOC as a function of the number of splits.

Fig. 5. AOC as a function of the number of splits.

B. Boutsinas, T. Gnardellis / Pattern Recognition Letters 23 (2002) 999–1008

The tests were made using the car dataset and the FF-version for k ¼ 4. We use different values of d, p, and r keeping fixed the p and r or d and r or d and p, respectively. The results are summarized in Table 1.

3.1. Related work The proposed Proseggisis methodology is a scalable framework for any clustering algorithm either partitional or hierarchical. To our knowledge, the

Table 1 Tests for different values of d, p, and r BR

FF

5th iteration SOC ¼ 100% AOC ¼ 3.229 Speedup < 1

1st iteration SOC ¼ 69.5% AOC ¼ 0.148 Speedup ¼ 5.384615

d ¼ min =2

3rd iteration SOC ¼ 52.35% AOC ¼ 3.157 Speedup ¼ 1.627907

1st iteration SOC ¼ 69.5% AOC ¼ 0.148 Speedup ¼ 5.384615

d ¼ min

1st iteration SOC ¼ 57.94% AOC ¼ 3.258 Speedup ¼ 4.666667

1st iteration SOC ¼ 69.5% AOC ¼ 0.148 Speedup ¼ 5.384615

1st iteration SOC ¼ 57.94% AOC ¼ 3.258 Speedup ¼ 4.666667

1st iteration SOC ¼ 69.5% AOC ¼ 0.148 Speedup ¼ 5.384615

p ¼ 60%

1st iteration SOC ¼ 57.94% AOC ¼ 3.258 Speedup ¼ 4.666667

1st iteration SOC ¼ 69.5% AOC ¼ 0.148 Speedup ¼ 5.384615

p ¼ 90%

3rd iteration SOC ¼ 52.35% AOC ¼ 3.157 Speedup ¼ 1.627907

1st iteration SOC ¼ 69.5% AOC ¼ 0.148 Speedup ¼ 5.384615

1st iteration SOC ¼ 57.94% AOC ¼ 3.258 Speedup ¼ 4.666667

1st iteration SOC ¼ 69.5% AOC ¼ 0.148 Speedup ¼ 5.384615

r ¼ 50%

1st iteration SOC ¼ 57.94% AOC ¼ 3.258 Speedup ¼ 4.666667

1st iteration SOC ¼ 69.5% AOC ¼ 0.148 Speedup ¼ 5.384615

r ¼ 75%

3rd iteration SOC ¼ 52.35% AOC ¼ 3.157 Speedup ¼ 1.627907

1st iteration SOC ¼ 69.5% AOC ¼ 0.148 Speedup ¼ 5.384615

p ¼ 60%, r ¼ 50% d ¼ min =3

d ¼ min, r ¼ 50% p ¼ 30%

d ¼ min, p ¼ 60% r ¼ 25%

1005

1006

B. Boutsinas, T. Gnardellis / Pattern Recognition Letters 23 (2002) 999–1008

proposed in the literature methodologies for scaling clustering algorithms to large databases accommodate specific clustering algorithm. Of course, this paper is focused on partitional clustering. However, to apply the Proseggisis methodology in the hierarchical clustering, the only need is to extract clusters from dendrograms, that are obtained from splits and from the meta-table, in a similar manner. This need is due to the fact that hierarchical clustering does not try to find ‘‘best’’ clusters. Some other scalable clustering schemes impose constraints on the input set of patterns. For instance, the CLARA algorithm presented in (Kaufman and Rousseauw, 1990) can be applied to a maximum of 3500 patterns, while CLARANS (Ng and Han, 1994) and DBSCAN (Ester et al., 1996) can be applied to spatial data. The BIRCH (Zhang et al., 1996) and ScaleKM (Bradley et al., 1998) partitional clustering algorithms, as with the proposed Proseggisis methodology, are both based on the fact that a subset of the data typically requires significantly fewer iteration to cluster. However, they are both incremental algorithms and not distributed ones. They try to cluster very large databases, those too large to reside in main memory, with the available resources. They both operate within the confines of a limited memory buffer. They reduce time complexity reducing the number of scans to one or two, in contrast to k-means that need much more. Their key idea is to identify regions of the data that are compressible, regions that must be maintained in memory and regions that are discardable. The methodologies which are based on parallelization are the most relevant to Proseggisis methodology algorithms we aware of. However, they require to transform a specific clustering algorithm to an optimized parallel one for a specific architecture. Of course, such parallel clustering algorithms achieve a great performance gain over serial ones. For instance, the parallel partitional clustering algorithm presented in (Ni and Jain, 1985) has a potential performance gain of 1300 times over a serial algorithm while the one presented in (Li and Fang, 1989) reduces time complexity of one scan of the data to a logarithmic one with respect to the number of patterns (k logðnÞ).

However, they both need a large number of processors and a specific architecture. The first is based on a VLSI systolic architecture and the latter on a hypercube SIMD computer model with a vast number of processing elements equal to the product of the number of patterns and the number of attributes. On the other hand, parallel hierarchical clustering algorithms (e.g., Li and Fang, 1989; Olson, 1995) may decrease the Oðn2 Þ time required by the serial algorithms to OðnÞ. However, all the above partitional and hierarchical clustering algorithms need a very large number of processors. Additionally, they assume that all input patterns can reside in main memory at the same time. The recent PDBSCAN algorithm presented in (J€ager and Kriegel, 1999) overcomes these problems. However, although its performance is similar to the Proseggisis methodology, it accommodates the specific DBSCAN (Ester et al., 1996) partitional clustering algorithm.

4. Conclusions and further research We presented the Proseggisis methodology, a novel methodology for distributing the clustering process. The key idea is like running the clustering algorithm s times on subsets of the data and then taking an average of their results. We presented empirical results which demonstrate that convergence, usually, occurs when subsets are small. Empirical results, also, demonstrate that the proposed methodology can significantly speed up the analog non-distributed clustering process. Notice that the proposed methodology does not require a substantial pre-processing phase and it does not suffer of any communication overhead. There are several variations of the proposed methodology regarding the implementation of the similarity and the stopping condition. For instance, following the idea of one of the referees of this paper, we have implemented a version of the Proseggisis methodology, where the meta-table is formed in a different manner (second step). In order to construct the meta-table, the centers of cluster sets, found in the first step, are weighted by the percentage of data points they cover in their

B. Boutsinas, T. Gnardellis / Pattern Recognition Letters 23 (2002) 999–1008 Table 2 Comparing the two versions Version

Speedup

SOC (%)

AOC

k-modes Normal With weights

1 1.77215 1.77215

100 57.9 63.09

0.142 0.145 0.214

cluster. We tested this version using the car dataset and the FF-version. We set p ¼ 60%, r ¼ 50%, s ¼ 5 and k ¼ 4. Instead of using 5 4 ¼ 20 centers of cluster sets to construct the meta-table we used 5 10 ¼ 50. This is done by inserting in the meta-table 10 centers of cluster for each set. The 10 centers are actually repetitions of the four centers of cluster obtained from each set during the first step. The number of repetitions of each cluster center is analogous to the percentage of data points it covers in its cluster. Results are summarized in Table 2, where speedup, SOC and AOC is depicted for both the two versions and the k-modes algorithm. In the version with weights, while the execution time remains exactly the same, there is a satisfactory increase in the measure of the SOC. However, surprisingly, there is a decrease in the measure of the AOC. In general, although more sophisticated implementations may improve the performance, the basic strategy, presented in this paper, performs well on numerous different datasets.

Acknowledgements This work is supported by Enterprise Programmes for Research and Technology II, General Secretariat for Research and Technology, Hellenic Ministry of Development.

References Alsabti, K., Ranka, S., Singh, V., 1995. An efficient k-means clustering algorithm. In: Proc. of the First Workshop on High Performance Data Mining. Anderberg, M., 1973. Cluster Analysis for Applications. Academic Press, New York.

1007

Blake, C.L., Merz, C.J., 1998. UCI Repository of machine learning databases [http://www.ics.uci.edu/mlearn/MLRepository.html]. University of California, Department of Information and Computer Science, Irvine, CA. Bradley, P.S., Fayyad, U.M., 1983. Refining initial points for kmeans clustering. In: Proc. of the IJCAI-93, San Mateo, CA, pp. 1058–1063. Bradley, P.S., Fayyad, U.M., Reina, C., 1998. Scaling clustering algorithms to large databases. In: Proc. of the 4th Internat. Conf. on Knowledge Discovery and Data Mining, pp. 9–15. Dubes, R.C., Jain, A.K., 1980. Clustering methodologies in exploratory data analysis. Adv. Comput. 19, 113–228. Ester, M., Kriegel, H.P., Sander, J., Xu, X., 1996. A densitybased algorithm for discovering clusters in large spatial databases with noise. In: Proc. of the 2nd Internat. Conf. on Knowledge Discovery and Data Mining, pp. 226– 231. Huang, Z., 1998. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowledge Discovery 2, 283–304. J€ager, J., Kriegel, H.P., 1999. A fast parallel clustering algorithm for large spatial databases. Data Mining Knowledge Discovery 3 (3), 263–290. Jain, A.K., Dubes, R.C., 1988. Algorithms for Clustering Data. Prentice-Hall, Englewoods Cliffs, NJ. Johnson, S., 1967. Hierarchical clustering schemes. Phychometrika 23, 241–254. Kaufman, L., Rousseauw, P.J., 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, New York. Li, X., Fang, Z., 1989. Parallel clustering algorithms. Parallel Comput. 11, 275–290. MacQueen, J.B., 1967. Some methods for classification and analysis of multivariate observations. In: Proc. of the 5th Berkeley Symposium on Mathematics Statistics and Probability, pp. 281–297. Ng, R., Han, J., 1994. Efficient and effective clustering methods for spatial data mining. In: Proceedings of VLDB94. Ni, L., Jain, A., 1985. A VLSI systolic architecture for pattern clustering. IEEE Trans. Pattern Anal. Machine Intell. 7 (1), 80–89. Olson, C.F., 1995. Parallel algorithms for hierarchical clustering. Parallel Comput. 21 (8), 1313–1325. Piftzner, D.W., Salmon, J.K., Sterling, T., 1997. Halo world: tools for parallel cluster finding in astrophysical N-body simulations. Data Mining Knowledge Discovery 1 (4), 419– 438. Provost, F., Kolluri, V., 1997. Scaling up inductive algorithms: an overview. In: Proc. of the 3rd Internat. Conf. on Knowledge Discovery and Data Mining, pp. 239–242. Rasmussen, E.M., Willett, P., 1989. Efficiency of hierarchical agglomerative clustering using ICL distributed array processor. J. Doc. 45 (1), 1–24. Ruspini, E.H., 1969. A new approach to clustering. Inf. Control 15, 22–32.

1008

B. Boutsinas, T. Gnardellis / Pattern Recognition Letters 23 (2002) 999–1008

Vrahatis, M.N., Boutsinas, B., Alevizos, P., Pavlides, G., 2002. The new k-windows algorithm for improving the k-means clustering algorithm. J. Complexity, to appear.

Zhang, T., Ramakrishnan, R., Livny, M., 1996. BIRCH: an efficient data clustering method for very large databases. In: SIGMOD’96, Montreal, Canada, pp. 103–114.