A distributed data clustering algorithm in P2P networks

A distributed data clustering algorithm in P2P networks

Accepted Manuscript Title: A Distributed Data Clustering Algorithm in P2P Networks Author: Rasool Azimia Hedieh Sajedib Mohadeseh Ghayekhlooa PII: DOI...

9MB Sizes 7 Downloads 205 Views

Accepted Manuscript Title: A Distributed Data Clustering Algorithm in P2P Networks Author: Rasool Azimia Hedieh Sajedib Mohadeseh Ghayekhlooa PII: DOI: Reference:

S1568-4946(16)30614-7 http://dx.doi.org/doi:10.1016/j.asoc.2016.11.045 ASOC 3935

To appear in:

Applied Soft Computing

Received date: Revised date: Accepted date:

27-1-2015 14-11-2016 28-11-2016

Please cite this article as: Rasool Azimia, Hedieh Sajedib, Mohadeseh Ghayekhlooa, A Distributed Data Clustering Algorithm in P2P Networks, Applied Soft Computing Journal http://dx.doi.org/10.1016/j.asoc.2016.11.045 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A Distributed Data Clustering Algorithm in P2P Networks Rasool Azimia, Hedieh Sajedib,*, Mohadeseh Ghayekhlooa a b

Young Researchers and Elite Club, Qazvin Branch, Islamic Azad University, Qazvin, Iran Department of Computer Science, School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran

* Corresponding author at: Department of Computer Science, College of Science, University of Tehran, Tehran, Iran. Tel.:+982161112915 E-mail addresses: [email protected] (Hedieh Sajedi), [email protected] (Rasool Azimi), [email protected] (Mohadeseh Ghayekhloo)

ABSTRACT Clustering is one of the important data mining issues, especially for large and distributed data analysis. Distributed computing environments such as Peer-to-Peer (P2P) networks involve separated/scattered data sources, distributed among the peers. According to unpredictable growth and dynamic nature of P2P networks, data of peers are constantly changing. Due to the high volume of computing and communications and privacy concerns, processing of these types of data should be applied in a distributed way and without central management. Today, most applications of P2P systems focus on unstructured P2P systems. In unstructured P2P networks, spreading gossip is a simple and efficient method of communication, which can adapt to dynamic conditions in these networks. Recently, some algorithms with different pros and cons have been proposed for data clustering in P2P networks. In this paper, by combining a novel method for extracting the representative data, a gossip-based protocol and a new centralized clustering method, a Gossip Based Distributed Clustering algorithm for P2P networks called GBDC-P2P is proposed. The GBDC-P2P algorithm is suitable for data clustering in unstructured P2P networks and it adapts to the dynamic conditions of these networks. In the GBDC-P2P algorithm, peers perform data clustering operation with a distributed approach only through communications with their neighbours. The GBDC-P2P does not need to rely on a central server and it performs asynchronously. Evaluation results demonstrate the superior performance of the GBDC-P2P algorithm. Also, a comparative analysis with other well-established methods illustrates the efficiency of the proposed method. Keywords: Distributed data mining, Data clustering, Gossiping, Overlay, Peer-to-Peer network.

1. Introduction and Related Works Peer-to-Peer (P2P) computing or networking is a distributed application architecture that is used as a common method for the applications involving data exchange between distributed resources. In such applications, large volumes of data are distributed across several data sources. Due to privacy concerns, bandwidth limits and the large amount of data, it is difficult to collect peer’s data on a central server and perform data mining over the whole data, centrally. In fact, processing of this distributed data should be done in a distributed way without concentrating the whole data in a central place [1], [8]. In P2P networks, we are facing the challenges such as dynamics of the network, complex structure of the network and distributed data. The data at each peer is constantly changing and some peers may not be present in the network all the time. The new peers can join the network while old peers leave at any time [19]. Generally, several techniques known as data mining methods have been proposed for extracting data, finding unsuspected relationships between data, and transforming it to an understandable structure for further use. Traditional data mining algorithms assume that all data are concentrated in a location. Nowadays, with the increasing development of communication systems, there is a high volume of data to be distributed and this volume of data is increasing gradually [3]. On P2P networks, the data have been placed on the peers in a distributed way. Analysis of this distributed data sources needs data mining technology, which is designed for distributed applications, called Distributed Data Mining (DDM) [2].

* Corresponding author at: Department of Computer Science, College of Science, University of Tehran, Tehran, Iran. Tel.:+982161112915 E-mail addresses: [email protected] (Hedieh Sajedi), [email protected] (Rasool Azimi), [email protected] (Mohadeseh Ghayekhloo)

Overall, DDM operations can be performed in four approaches: A common approach is to send data to a central site, and applying centralized data mining operations on it. The second approach is related to perform local data mining operations at each site and produces a local model. Then all of the local models can be sent to a central site, to produce a global model [10], [11][18]. In the third approach, a small set of representative data is selected at each site, and sent to a central site, which combines the local representatives to form a global one. Then data mining operation can be performed on the global representative data set [3], [13], [16]. The fourth approach, unlike the previous three approaches, does not include a central site for facilitation of data mining operation. This approach belongs to P2P networks, called P2P Data Mining (P2PDM) [3], [17]. In fact, P2PDM is a special type of DDM where there is no notion of centralization in the mining process; all peers are regarded as peers. P2PDM scenarios typically considered when no single peer owns the whole data set. In this case, the data are naturally distributed over a large number of peers, and the goal is to find a clustering solution that considers the entire dataset. Since P2PDM is a type of DDM, many of the challenges arise in P2PDM actually stem from being a distributed environment. Some of those challenges are: communication model, complexity and quality of the global model, and privacy of local data [3]. Clustering is an important data mining technique for data analysis [1]. Most of clustering approaches rely on centralized data structures and/or cannot cope with dynamically changing data. This means that these techniques cannot be employed for distributed data sets. However, concerns about the current use of distributed resources need distributed approaches to deal with data clustering [4]. Real-world applications of distributed clustering includes: clustering different media metadata (e.g. word documents, pictures, music files, etc) from different machines; clustering nodes’ activity history data (allocated resources, issued queries, upload and download rates, etc.); clustering books from a distributed network of libraries; clustering scientific achievements from different academic institutions and publishers [8]. Among the methods, which have been proposed for clustering in unstructured P2P networks, LSP2P K-means [5] is a partition-based distributed clustering algorithm. In summary, it is the frequent iterations of K-means algorithm [14] in every peer. At each iteration, each peer receives centroids and the number of clusters related to the same iteration from its neighbor peers. Then the related peer produces centroids for the next iteration using integration of this information and its local data. If the distance between centroids of the new clusters and centroids of the previous iteration is greater than a threshold, the next iteration is started. Otherwise, that peer will reach the termination state. However, the communication complexity of LSP2P K-means method increases linearly with the network size. In [5], a distributed clustering algorithm called USP2P K-Means is proposed considering the fact that the clustering accuracy of LSP2P K-Means algorithm cannot be guaranteed. However, the accuracy of USP2P K-Means is guaranteed only if the data and network are not changed from the beginning to the end of the algorithm. In [1], a fully decentralized clustering method based on DBSCAN algorithm [6], called GoSCAN, is proposed. It is capable of clustering dynamic and distributed datasets without needing a central control or message flooding. The GoSCAN algorithm can be broken into two major tasks: 1. Identifying core points, and 2. Forming the actual clusters. These two operations are executed in parallel employing gossip-based communication. The GoSCAN allows each peer to find an individual trade-off between quality of discovering clusters and transmission costs. However, the GoSCAN algorithm in the networks with fewer peers, results higher accuracy compared to networks with larger number of peers. In [7], Epidemic K-means algorithm has been proposed to solve the distributed random K-means issue. Each peer may be one of two states, ACTIVE and CONVERGED. When active, the corresponding peer performs two phases at each K-means iteration. In the first phase, the calculation is performed on the dataset, such as centralized K-means. So that, for any given point x, the nearest centroid is calculated and the local cluster partitions are determined. In the second phase, the overall datasets are calculated by using the gossip-based aggregation protocols. Iterations of K-means algorithm is continued until the total error is less than a threshold value previously set. Below this threshold, the state of the peer is changed to CONVERGED state. The statistical guarantee of gossip-based protocol ensures that results within a bounded approximation error at each peer of the system are consistent with static and faultless networks. However, the performance of this approach decreases when there are losing messages and failure peers in the asynchronous networks. In [8], a decentralized partition-based algorithm, named GDCluster, has been proposed for data clustering in unstructured P2P networks. This algorithm has two aspects; 1: Each peer produces its local model in the form of representatives, and shares them during interaction with other peers. In this way, after a while the representatives of each peer are distributed in the whole network. 2: Each peer should obtain detailed information about the network through its local data. To do so, in addition to exchanging its representatives, each peer, collects data of other peers in the range of its local data and updates its representatives using them. This will make peers informed about their data after several iterations and, thus, increase the precision of representatives being moved. However, due to the high computational overhead, the efficiency of GDCluster decreases in large-scale networks as internal data of peers increase. In addition, applying the classical K-means method may lead to wrong clustering results. The performance of K-means clustering is dependent on the initial centroids which are randomly selected in the first phase of the algorithm and it is often trapped in local minima due to its hill climbing approach [32].

In general, the suggested methods lack the desired features for data clustering in distributed environments of P2P networks. The desired features include synchronization or pervasive communications, compatibility with dynamic systems, and scalability. In this paper, we propose a novel distributed data clustering algorithm that is suitable for unstructured P2P networks. The contributions of the paper are fourfold as follows:  A novel method is proposed to extract the representative data with a high data accumulation rate.  An improved version of K-means algorithm is proposed to provide more accurate clustering results in each peer.  A novel distributed approach is proposed by combining our method for extracting the representative data, gossip-based protocol and the proposed centralized clustering method for P2P networks without the need for a central server and synchronizes operations between peers.  The proposed distributed method as a soft computing approach, by providing a summary model of the whole network data in each peer, leads to optimal clustering results with a low-cost solution. Access to summarized model, make the peers needless to access the whole of network data, and reduce the cost of computations and communications as well as peers memory. The rest of the paper is organized as follows, In Section 2, our distributed clustering algorithm for unstructured P2P networks, called GBDC-P2P is introduced. In Section 3, a centralized data-clustering algorithm, named Persistent K-means, is proposed which is used in the third phase of the GBDC-P2P algorithm. In Section 4, the Persistent K-means algorithm is first assessed, and then the GBDC-P2P algorithm is evaluated in both static and dynamic networks. Finally, in Section 5, conclusions and future work is presented. 2. The proposed GBDC-P2P algorithm 2.1. General operation of the GBDC-P2P The GBDC-P2P algorithm is an adaptive approach using K-medoids [15] and K-means [14] algorithms for extracting the representative data and discover the final clustering results as well as CYCLON algorithm [9] to make interactions between peers. Assuming there are two peers P and Q. In the proposed method, during each P and Q interactions, two data packets are exchanged between these peers as well as the gossip message. The first packet includes the most important data selected by a novel approach from the peer’s local data. The second package includes the data that each peer received from its neighbors before the current round of gossip operation. From now on, the local data of each peer is called as internal data, and the data, which each peer received from other peers called as external data. In each peer, the size of external data at the beginning of the algorithm is zero. By spreading representative data across the network peers, after a while, all peers reach the same pattern of the whole network data. Thus, each peer obtains an overall pattern (view) of the whole network data. At the final phase, each peer accesses the final centroids by performing a centralized clustering algorithm over its received data and accesses the final cluster centroids independently. The significant data selected during each gossip round, are spread using CYCLON approach. Forwarding external data alongside the representatives to neighboring peers will definitely increase communicative overload. However, it is an improvement of the algorithm, because by doing so, peers will be able to access the pattern of the dataset in a shorter span of the time. Now, we will discuss the details of the proposed GBDC-P2P algorithm by breaking its workflow into separated parts. 2.2. Start of proposed GBDC-P2P In the beginning, as it is shown in Fig. 1, each peer selects M important data among its internal data using the MRepresents algorithm. The important data are placed in the locations with a high data accumulation rate. Then, the corresponding peer sends the important data to the neighbour peers, based on the CYCLON algorithm. Each peer in each gossip round sends the extracted representative data in a separate packet to some of its neighbours based on the CYCLON protocol. Mentioned peer also sends external data received from other peers in previous rounds to each neighbour peer, which is interacting with it. By aggregating this representative data, each peer achieves a good summarized view of the entire network data. This summarized view covers all clusters. If the representative data is broadcasted among the network peers, each peer accesses to an appropriate summarized view of the all network data. The list of parameters used in M-Represents algorithm is shown in Table 1.

The process in the M-Represents algorithm for selecting M representatives from D candidate data in each peer is as follows (assuming that the number of final clusters (K) is determined from the beginning):

(i) Determine Kl value according to the values of K and D (to run the Kl-medoids algorithm Kr times). A. If D  K then Kl = D  1 . B. If K  D  2K then Kl = K . C. If D  2K then Kl = 2K . (ii) Run the Kl-medoids algorithm Kr times on the D candidate data. (By default, Kr = K) (iii) Assigning a score (+1) to the returned representative data by the Kl-medoids algorithm in each implementation of this algorithm. (iv) Select the M representatives that by applying Kr iterative implementations of Kl-medoids algorithm on the candidate data have higher weights. The order of weight is the frequency of elections of each datum that has been returned as the representative during the Kr iterative implementations of Kl-medoids algorithm. Now, by considering three different datasets, according to the three states A, B and C discussed in this section, we consider different values for the Kl in the three states. In this state, a context with 128 peers is considered. Each peer individually runs the M-Represents algorithm once on its internal data. Then all the representative data from all peers are aggregated in a special place; for (M=1) it consists of 128 data. This 128 data presents a summarized view of the entire dataset used in each of the three states. It should be noted that in this case, the defaults of a P2P network are not considered and the gossip-based protocol is not used. The sole purpose is to assess the quality of the summarized view for various values of Kl. A. If D  K then Kl = D  1

By considering a dataset consisting of 1024 data and 10 clusters (K=10), we assume M=1 in a network consists of 128 peers, each of which has 8 (D=8) local data. A view of the main dataset containing 600 2-dimensional (2-d) data and its summarized view for different values (1, 3, 5, and 7) of Kl is shown in Fig. 2. According to Fig. 2, in the case Kl = (D-1) = 7, the summarized view covers the entire clusters better than other cases. B. If K  D  2K then Kl = K

Consider a dataset consisting of 2560 data and 10 clusters (K=10) and assume (M=1) in a network consisting of 128 peers, each of which has 20 (D=20) local data. A view of a dataset containing 2560 data and its summarized view for different values of Kl is shown in Fig. 3. According to Fig. 3, in the case Kl =K= 10, the summarized view covers the entire clusters better than other cases. C. If D  2K then Kl = 2K .

Consider a dataset consisting of 10240 data and 10 clusters (K=10), and assume M=1 in a network consisting of 128 peers, each of which has 80 (D =80) local data. A view of a dataset containing 5000 data and its summarized view for different values of Kl is shown in Fig. 4. According to the Fig. 4, in the case Kl =2K=20, the summarized view covers the entire clusters better than other cases. A question may arise why is Kr presumed to be equal to K (Kr=K) in the M-representative algorithm? The answer is the more precise the selection of representatives by M-Represents cause to simplify the correct discovery of final cluster K (since external data is produced by collecting representatives). As Kr increases, the chance of more significant data being selected in peers will increase due to higher iterations of Kl-medoids algorithm, as it is shown in Fig. 5. The patterns shown in Fig. 5, are obtained by performing M-Represents algorithm for different values of Kr. Under circumstances where considering a network consisting of 128 peers, and the dataset includes 10752 2-d data and 10 clusters (K=10) and the share of each peer is 84 internal data, we did some experiments. Peers perform the M-Represents algorithm for M=1 and Kl = 2K=20 on their internal data and the extracted representatives of peers (consisting of 100 data) are aggregated with each other. As shown in Fig. 5, as value of Kr increases, the aggregation level of representative data will increase at centroids. In addition, this will help to separate clusters easier and increase the chance of gaining more precise results. As it can be seen, for Kr=10 and Kr=14, the obtained results represent a better pattern. However, since the increased Kr will reduce the convergence speed of algorithm, it would be better to prefer Kr=K=10 over higher values of Kr.

The gossip-based interactive process between peers is discussed in the next section. 2.3. Gossip-based interactions According to what is shown in Fig. 6, the process of interactions between peers based on CYCLON algorithm to spread representative data on the network is as follows: (i) Active peer P increments one unit the age of all its neighbours (neighbours of each peer are peer descriptors or, in other words its view components). (ii) Active peer P selects one of the neighbours with the highest age as the passive peer Q for gossip-based interaction

(iii) Active peer P also selects g  1 other random neighbours with the P’s address by zero age (age=0) to provide a gossipbased messaging for sending to the passive peer Q. The value of g is equal to half the total number of peer descriptors. (iv) The active peer P removes the passive peer address from its view. (v) Active peer P sends the pRepresentatives and pExtdata data to the passive peer, along with the gossip-based message. (vi) Passive peer Q also selects g random neighbours to provide a gossip message for sending to the active peer P. (vii) Passive peer Q sends qRepresentatives and qExtdata data to the peer P with the gossip-based message. (viii) Finally, the peers P and Q, after integrating the received data with their local data, added them to their local memory as their external data. Description of the functions used in gossip interactions between active peer P and passive peer Q is shown in Table 2.

2.4. Summarization Anytime that the peers memory becomes full, it is necessary to perform the summarize operation over the peers external data. Another property of the K-medoids algorithm other than clustering is including the data summarization. For example, suppose that the K-medoids algorithm is performed on a dataset with a volume of 4000, and K value, for instance, is 2000. Then one can access 2000 significant data out of 4000 data, and the rest of data could be removed from the peer memory. Thus, by doing so, and part of less significant external data being removed, the problem with the lack of peer memory will be overcome. In addition, considering lifetime of external data in dynamic networks will facilitate the summarization of peers’ memories and to some extent, it will cause peers not to use the K-medoids algorithm anymore. In P2P networks, we are faced with the dynamics of the network, network structure and distributed data. The next section describes how the proposed method adapts to dynamic network conditions. 2.5. Adapting to dynamic network conditions In order to adapt the algorithm to the dynamics of the network, at the time of gossip-based interactions among peers, an AGE variable is considered for each external data of peers. Thus, at the first round of gossip-based interactions, when the data is given, for the first time, as a representative of a peer to another peer, the receiving peer assigns the data an age variable. At each round of gossip-based operations, all peers increment one unit the age of their external data. Therefore, after a while, old external data are excluded from external data of peers by the end of their lifetime, and they replaced by new data. If the peer's external data contain duplicate data with the same age, the data will be removed. If duplicate data have different ages, the older one will be removed. It aims to control the dynamics of the network. 2.6. Calculating the final clustering results Whenever any peer needs the results of clustering, it simply can run one of the popular partitioning-based centralized clustering algorithms over its collected external data. By doing this work, corresponding peer access the final cluster centroids and then assign its own internal data to the nearest centroid. Consequently, peers’ internal data are clustered properly. In the next section, a novel centralized clustering algorithm, namely Persistent K-means, is proposed, which can be used in each peer to achieve final clustering results. Persistent K-means provides more deterministic and accurate clustering results than K-means algorithm and its improved versions. 3. Persistent K-means: An improved version of centralized K-means algorithm The K-means algorithm is one of the most commonly used clustering algorithms, which uses the data reassignment method to repeatedly optimize clustering (Lloyd, 1982). The K-means algorithm has features such as simplicity and high convergence speed. However performance of K-means is totally dependent on the initial centroids which are randomly selected in the first phase of the algorithm and it is often trapped in local minima due to its hill climbing approach [32]. Persistent K-means uses an iterative approach to discover the best result for consecutive iterations of K-means algorithm. Accuracy (AC) assessment criterion [20] is used to explain the working process of the Persistent K-means centralized algorithm and evaluate its performance, which will be discussed briefly in the following. 3-1. Assessment criteria of clustering accuracy “AC” The AC criterion as Eq.(1) gives a number in the range [0, 1]. If the number is closer to one, it reflects the high accuracy of the concerned clustering algorithm [20]. AC 



d D

δ (C (d), map(C p (d))) D

(1)

where C shows the results of real cluster centroids, at the end of each round of the centralized clustering algorithm implementation, CP = {C1P ,C2P ,C3P ,...,CPK } are the K calculated clusters. |D| is the total amount of network data. The map (CP(d)) function is used for mapping a calculated cluster CP to the real cluster C. The d( x , y ) function is “1” if (x = y), otherwise, it returns the value of zero. 3.2. Details of the proposed Persistent K-means Consider a ( K  N ) matrix named Best_Dist K×N . K is the number of clusters, N is the total size of data, and each matrix element Best_Dist (k ,n ) equals the Euclidean distance of k-th centroids (1  k  K ) of the discovered clusters from the n-th data element (1  n  N ) . Using this matrix, any data can be assigned to the nearest centroid. Considering another ( K  3) matrix, named M_Dist K 3 , where rows from 1 to K, represents the cluster centroid in a sequential fashion. The three columns of M_Dist K 3 matrix contain the number of the data assigned to the k-th centroid, total Euclidean distance, and the average distance of the data assigned to the corresponding cluster, respectively. Eq. (2) presents the third column values of this matrix. M_Dist ( K ,3) = (M_Dist ( K ,2) / M_Dist ( K ,1) ) (2) Consider Ave_Dist value as the mean of M_Dist (K,3). In the consecutive run of the K-means algorithm, a low value of Ave_Dist indicates the more accurate clustering result. This is the “first proposed constraint” to achieve the best result for different iterations of the algorithm. Generally, to achieve the better clustering result of K-means algorithm, we need to run this algorithm in R iterations that R is defined by the user. The result of the iteration with the minimum Ave-Dist value is returned as the final clustering result. This is illustrated by the following example. The used dataset includes 25620 2-d data and 10 clusters. It should be noted that the synthetic data used in this section is also used in [2], [8]. In Table 3, results of 10 consecutive iterations are presented and the minimum value of Ave-Dist corresponds to iteration 8. Accuracy evaluation results based on AC measure show that the clustering accuracy in this iteration is higher than the other iterations. Table 4 presents more details regarding the results of matrix M_Dist K 3 in the eighth iteration of K-means algorithm using the synthetic dataset of the experiments. Fig. 7 shows the result of the eighth iteration of the algorithm. As shown in the Fig. 7, the repeated cluster centroids are chosen correctly. In addition, the detailed results of the M_Dist K 3 matrix in the eighth iteration are shown in Table 4. In addition, Fig. 8 displays the clustering results in iteration 10 in which it has an accuracy of 90.05%. Satisfying the “first constraint” for consecutive iterations of K-means algorithm leads to results that are more accurate. However, in some cases, in which some clusters are close and some are far from each other, the chance of making mistakes increases and even by considering the “first constraint”, it is not possible to provide the most accurate result. For instance, assume a synthetic dataset consisting of 10752 2-d records and 10 clusters. Table 5 displays the results of 10 consecutive runs of the algorithm on this dataset. As can be seen in Table 5, even though the minimum value of Ave_Dist corresponds to iteration 10, the accuracy of iteration 10 is less than that of iteration 6. Considering Ave_Dist, the result of 10 runs of the algorithm must be returned as the final result. However, according to Fig. 9, in iteration 10, the algorithm specifies some cluster centroids by mistake and based on AC accuracy measure results, the best result is obtained by iteration 6. The reason for the wrong answer of iteration 10 is definitely the wrong cluster centroid in low centralized arrears. In the following, a solution is proposed for this problem. Table 6 presents the results of M_Dist K 3 in iteration 10. In an ideal situation, for the relevant datasets, the size of clusters must be almost the same. However, in the tenth iteration (Table 6), the data assigned to the centroid of the fifth cluster (K=5) is almost twice the expected values, since one cluster centroid is selected for two adjacent clusters (Fig. 9). Moreover, the number of data records assigned to the centroid of the ninth cluster (K=9) is very low, since this centroid is selected at a point far from the centralized area of the data space. As it was mentioned, the third column of M_Dist K 3 representing the average distance of a data record from the corresponding cluster centroid. These values are presented in the fourth column of Table 6. In cases where cluster centroids are falsely selected far from centralized areas, the average distance of the data assigned to these cluster centroids is much larger than the cluster centroids, which are selected from centralized areas. For a better representation of this difference, each element of the fourth column in Table 6 is divided by the mean of its values and results are presented in the fifth column of the Table 6. According to the results, the difference between the element of the fifth

cluster (K=5) and the ninth cluster (K=9) is abnormal. The reason behind this is the large difference between the mean and median of the elements in the third column M_Dist K 3 . To prevent these problems, we have proposed a condition called the second constraint. Accordingly, if Eq. (3) is satisfied by one of the iterations, the result of that iteration is preferred to the results of iterations, which only satisfy the “first constraint.” In cases where the second constraint is satisfied, the result of the iteration is selected as the result of the algorithm. In fact, for discovering the final cluster centroids, the “second constraint” precedes the “first constraint”. (2  Mean_Dist) > (Med_Dist)

(3)

In Eq. (3), Mean_Dist equals the mean of values in the third column of M_Dist K 3 and Med_Dist equals the median of the elements in the third column of M_Dist K 3 . Mean_Dist and Med_Dist values of iteration 10 of K-means in the previous example are: Med_Dist = 468881682.9461516 Mean_Dist = 144981138.2457970 These results show that Eq. (3) is not satisfied in this iteration of K-means. Therefore, it is necessary to apply some conditions to select the best result as the result of the algorithm in all cases. The process of the proposed Persistent K-means algorithm is presented in Fig. 10. According to Fig. 10, after determining dataset (D), the number of clusters (K), and the user-defined number of iterations I, matrix M_Dist K 3 are created based on the results of Best_Dist K ×N (as one of the outputs of K-means). Subsequently, Ave_Dist is computed and compared with a default value, i.e. infinite (the initial value of Min_Dist). Clearly, it is the lower value and is recorded as the new value of Min_Dist. It should be noted that in the next iterations, only if the value of Ave_Dist in the corresponding iteration is less than Min_Dist, its value is recorded as a new minimum value in the Min_Dist variable. Otherwise, the result of that iteration is ignored and a new iteration of the algorithm is executed. After updating Min_dist, the second constraint is checked for the corresponding iteration. If it is satisfied, its result is stored in Temp2 and the value of the Counter variable is incremented; otherwise, the clustering result of that iteration is stored in Temp1. Subsequently, the algorithm checks whether there is more iteration to run. If yes, it runs the next iteration. Otherwise, it checks the Counter variable and if its value is larger than 0, it means that the second constraint has been satisfied with one of the iterations. Then the result of that iteration, stored in Temp2, is selected as the final output of the algorithm. If its value is not larger than 0, the last result stored in Temp1 is selected as the final result and the algorithm is stopped. In what follows, the results of the proposed Persistent K-means clustering in the previous example are investigated. Table 7 displays the results of the Persistent K-means (assuming r = 10) on the corresponding dataset. The minimum value of Ave_Dist corresponds to iteration 10 of the algorithm. However, since the second constraint is not satisfied, the fifth iteration is returned as the final result of the algorithm. The reason is that in iteration 5, according to Table 7, as well as minimizing the Ave_Dist value, the second constraint is also satisfied. Since this constraint precedes the first condition, the result of this iteration is selected as the final output. The evaluation results based on accuracy (AC) measure [20] for 10 consecutive iterations of the algorithm shows the clustering precision of this iteration of the Persistent K-means algorithm. Fig. 11, and Fig. 12, respectively shows an overview of the results of the Persistent K-means for iterations 5 and 10. According to Table 7, the clustering accuracy for iteration 5 based on the AC accuracy measure equals 99.9%, which is the maximum accuracy for 10 consecutive iterations of the K-means algorithm. According to Fig. 12, despite the fact that the minimum value of Ave_Dist is obtained through iteration 10, the algorithm does not have an acceptable performance in detecting clusters centroids. More specifically, selecting a cluster centroid from the less Centralized areas, as well as a centroid for two distinct clusters, its accuracy is 90.3%, which is less than that of iteration 5. The reason for this wrong selection is ignoring the second constraint. According to Table 8, Mean_Dist and Med_Dist of iteration 5 are as follows: Med_Dist :101626067.103523 Mean_Dist : 79006641.8234663

These results show establishing the second constraint in iteration 5. Accuracy evaluation results of Persistent K-means clustering algorithm using a real dataset indicate that the absolute superiority of the proposed algorithm in comparison to the classic version of K-means algorithm and other improved versions of this algorithm. In the third phase of running the GBDC-P2P algorithm, each peer separately runs Persistent K-means algorithm on its external data to obtain the final cluster centroids. Corresponding peer using the centroids obtained from this algorithm assigns its internal data to the closest cluster centroid. Eventually, the peer, clusters its data independently from other peers and thus its local clusters are formed.

Consider a network consisting of 100 peers(numbered 0, …, 99). Fig. 13 shows the processes of gossip interactions between an active peer (Peer 37) and a passive peer (Peer 41), and discovering the final clustering results in each peer based on the proposed GBDC-P2P algorithm. 4. Performance evaluation of algorithms In this section, we evaluate the proposed GBDC-P2P and examine the quality of algorithm in both static and dynamic conditions. However, before addressing the issue of evaluation of the proposed GBDC-P2P algorithm, we first examine the performance of the proposed Persistent K-means centralized clustering algorithm and in the next section; we will discuss the proposed GBDC-P2P algorithm in detail. It should be noted that to assess the efficiency of the algorithm, synthetic and real data were used. The synthetic datasets used for evaluating the proposed algorithm consist of generating data including several Gaussian distributions. It should be mentioned that in some synthetic datasets, there are 5% random noise data. 4.1. Evaluation results of Persistent K-means algorithm This section focuses on evaluating the performance of the proposed Persistent K-means centralized clustering algorithm. For this purpose, the result of the proposed Persistent K-means algorithm is compared with the results of other algorithms on the Ruspini [27], IRIS and Wine and New Thyroid [28] real datasets. It should be noted that in order to evaluate the proposed Persistent K-means centralized clustering algorithm, accuracy (AC) assessment [20] criterion is used that will be discussed briefly in the following. In addition, more information about the other types of datasets examined in this section is provided in Fig. 14. 4.2. Persistent K-means algorithm evaluation results over real datasets In this section, based on AC criterion, we compare the clustering accuracy of the proposed Persistent K-means algorithm with, K-means with random init, K-medoids, Fuzzy C-means. We also compare the proposed Persistent K-means result with the several improved versions of K-means algorithm, including versions of Forgy, MacQueen, Kaufman [22], Refinement [23], MDC [24], Improved Pillar K-means [25] and Min-Max K-means algorithm [26]. The real datasets used in this section include: (i) Ruspini [27]: The Ruspini dataset includes 75 points in four groups that is popular for illustrating clustering techniques. (ii) IRIS [28]: The Iris flower dataset is one of the best-known dataset in the pattern recognition literature. The data set consists of 50 samples from each of three species of Irises (Setosa, Versicolour, and Virginica). (iii) Newthyroid [28]: This data set is one of the several databases about Thyroid Disease. The task is to detect whether a given patient is normal (1) or suffers from hyperthyroidism (2) or hypothyroidism (3). (iv) Wine [28] This data is including the results from a chemical analysis of wines grown at the same location in Italy, whereas are derived from three different cultivars. Figures 15-18 show the comparison of clustering accuracy on four real datasets, Ruspini, IRIS, Newthyroid and Wine. The evaluation results indicate the absolute superiority of the results of the proposed Persistent K-means algorithm compared to the K-means classical version and its relative advantages in comparison with improved versions of K-means algorithm. The results of evaluation of the proposed algorithm are provided for the default condition (r = 10). 4.3. The evaluation criteria in distributed data clustering To assess the performance of the algorithm, the two evaluation criteria AC, RandI are used. The average of all these results is considered as the related overall evaluation of the proposed GBDC-P2P algorithm. The synthetic datasets used in the proposed algorithm in [1] and [8] are used. 4.3.1. Accuracy (AC) criterion for distributed data clustering algorithms

Suppose that C shows the actual centroids. By implementation of the proposed GBDC-P2P algorithm, each peer P extracts K centroids of CP = {C1P ,C2P ,C3P ,...,CPK } at the end of each round of the algorithm implementation that these centroids are called calculated cluster in peer P (CP). According to Equation (3), the AC criterion result is a number in the range [0, 1]. If this number is closer to one, it means the high clustering accuracy in the concerned peer [20]. The assessment with AC criteria must be done separately on the clustering results of all peers and the average of the AC results on the whole peers is considered as the final AC accuracy criterion. 4.3.2. Rand index (RandI) criterion

According to Eq. (4), RandI criterion [21] is used to identify the similarities between two clusters. Where “a” is the number of pairs of elements that are in the same real cluster. Also in the same estimated cluster, while “b” is the number of pairs of elements that are in different real clusters and in different estimated clusters. RandI 

a+b

  n 2

(4)

4.3.3. The costs of communications

Every peer in each round of gossip-based operations measures the cost of communications based on the average amount of transmitted data. Which includes gossip packages containing the addresses of neighbor peers and the packages containing external data and in addition, the internal data in each peer [8]. 4.4. The Proposed GBDC-P2P evaluation results This section investigates the proposed GBDC-P2P algorithm and examines the quality of the algorithm in both static and dynamic situations. Therefore, first, we introduce evaluation criteria. Then, the performance of the proposed GBDC-P2P algorithm is analysed in different situations. In order to assess the performance of the algorithm, the correct centroids of all dataset (that are called the actual centroids) are compared with the result of each of the peers (that we call them calculated centroids), separately. The average of these results are considered as the overall result of evaluating the proposed GBDC-P2P algorithm. For simulation, the proposed model was implemented in the PeerSim simulator [31]. PeerSim is a Peer-to-Peer simulation framework for P2P protocols. Simulation parameters used to evaluate the proposed GBDC-P2P algorithm are presented in Table 9. To evaluate the results of the proposed GBDC-P2P algorithm, first we discuss the evaluation results in a static network. Then, the results of the evaluation will be examined at the network with dynamic conditions. 4.4.1 The results of evaluation of proposed GBDC-P2P on a static network In a static network, changing the number of peers is among the factors expected to cause changes in the efficiency of the algorithm. The number of the peers studied in this section of this paper is between 128 to 1024 peers. At the beginning, we assume the constant value of 10 for the internal data of each peer. Thus, we use 2-d datasets with sizes of 1280, 2560, 5120, 10240 for the networks made of 128, 256, 512 and 1024 peers. To assign the data to various peers, the two approaches of data allocation are used: (i) Random data allocation (RA): In this method, some data that are randomly selected from the dataset is allocated to each peer. (ii) Cluster-aware data allocation (CA): In this method, the data of other clusters are not allocated to peers unless the data in a cluster are completely finished [1]. The working process of proposed method is as follows: In the first round, each of the peers performs the M-Represents algorithm once on its internal data. Then these data are distributed all over the network through gossiping operation and thus, after several rounds of gossiping interactions, the data sent by each peer should be distributed all over the network. The evaluation results of performing the algorithm for 10 rounds of gossiping operations with allocation of random and Cluster-aware data are respectively presented in Fig. 19. According to the results of the two criteria of AC and RandI, the proposed algorithm of this paper reaches to high accuracy in detecting distributed clusters after the end of the third round of the gossip-based operation. The reason of different results in the first two rounds is the fact that the data of the peers have not been fully distributed in the entire network. Thus, some peers have not an accurate view of the whole network datasets. It is noteworthy that here, each peer only sends one data of its internal datasets to the rest of peers in the first round of gossiping (M = 1). By sending more data, the clustering accuracy rate increases, but it causes a negative effect on communication and memory overhead of the peers. As shown in Fig. 19, in both random and accumulated data allocation, the communication overhead reaches a constant trend after the third round. This shows the representative data sent by the each peer have been broadcast in the whole network at the end of the third round. The evaluation results of the proposed algorithm in a dynamic network conditions are presented in the next section. 4.4.2. Evaluation of proposed GBDC-P2P in dynamic network conditions In this section, in order to evaluate the proposed GBDC-P2P algorithm in dynamic network conditions, we assume the constant value of 10 for the internal data of each peer. For this reason, we use 2-d datasets of sizes of 1280, 2560, 5120, 10240 for the networks consisting of 128, 256, 512 and 1024 peers. To simulate the dynamics of network in each round, the

internal data of peers change with a rate of 10%. Also, in each round of application of the algorithm, each peer run the MRepresents algorithm once on its internal data and using gossip-based operation, it sends five data to its neighbor peers as the representative data. Thus, after several rounds of gossiping interactions, the data sent by each peer should be distributed in the whole network. In this case, we assume the age limit of (AGE = 5) for the external data., If the data collected by a peer are from the newer gossiping periods, the results provided by the algorithm have higher accuracy that is in line with the changes applied to the internal data of each of the peers. To evaluate the results of the algorithm in this case, we evaluate the results of the algorithms in the conditions of changed sizes of peers for both cases of allocation of random and accumulated data. The results of application of the algorithm with random data allocation in 20 rounds of gossiping operations are presented in Fig. 20.

According to the evaluation results shown in Fig. 20. (a), in conditions of allocation random data, the clustering accuracy based on AC ccuracy criteria in this case has reached the average of 95.32%. In addition, the clustering accuracy based on RandI accuracy criterion in this case has reached the average of 99.38%. By increasing the size of the network, the average communication overhead has maintained a constant trend which is due to the determined age limit for external data of peers and the same reason controls the communication overhead of peers, well. According to the evaluation results shown in Fig. 20. (b), in conditions of allocation cluster-aware data, clustering accuracy rate in this case has reached the average of 96.68 percent. In addition, RandI rate in this case has reached the average of 99.10 percent. In the case of application of clusteraware data, the average amount of communication overhead for the peers has maintained a constant trend with the increase of the network size. This is due to applying age restriction on external data of peers. Thus, it was shown that by sending more representatives from each peer into the whole network, and by taking some measures to control the overhead, acceptable clustering results can be achieved in dynamic network conditions. The next section is devoted to compares the proposed GBDC-P2P algorithm with the LSP2P K-means and the GDCluster algorithm. 4.4.3. Comparison with LSP2P K-means and GDCluster algorithms This section compares the proposed GBDC-P2P algorithm with the LSP2P K-means and the GDCluster algorithm in two cases. In the first case, the set of test data is random and in the second case, it is compressed. In this context, by taking the networks composed of 128, 256, 512, and 1024 peers into consideration, the results of the proposed GBDC-P2P algorithm are compared with the two algorithms of GDCluster and LSP2P K-means. The LSP2P K-means algorithm is among the most well-known clustering algorithms in peer-to-peer networks and the GDCluster algorithm is one of the newest methods presented in the field of distributed data networks in peer-to- peer networks. As shown in Fig. 21, when using random and cluster-aware datasets, the average clustering accuracy of the proposed algorithm in this paper is higher than the two other algorithms. This is firstly because of the appropriate summarized view of the internal data of the peers distributed through the whole network. Second, it is due to the proper functioning of the proposed Persistent K-means centralized clustering algorithm for achieving the final centroids based on the summarized view. According to the Fig. 21, in both random and aggregated data allocation, the LSP2P algorithm has lower communication overhead than both GDCluster and proposed GBDC-P2P algorithms. However, the higher overhead of the proposed GBDC-P2P algorithms is for achieving a summarized accurate view of the entire network and improving the accuracy of clustering. Therefore, it is an improvement point for the proposed GBDC-P2P algorithm and through assumption of age for external data and more summarization of external data with the method mentioned in Section 2.3, it can be controlled. The efficiency of the proposed method over the different types of data is discussed in the next section, to present the more detailed assessment of the proposed method. 4.4.4. Evaluation of proposed GBDC-P2P algorithm on other type of datasets This section evaluates the proposed algorithm on other type of synthetic and real datasets as follows: (i) MAGIC Gamma Telescope (Magic) [28]: The UCI pen-based recognition of handwritten digits dataset, originally with 16 dimensions. (ii) Statlog (Shuttle) [28]: the UCI Statlog (Shuttle) Dataset, 9 numerical features. (iii) Pen-based recognition of handwritten digits (Pendigits) [28]: The UCI pen-based recognition of handwritten digits dataset, originally with 16 dimensions. (iv) Birch1 [29], [30]: The synthetic 2-d data where the clusters are in a regular grid structure. (v) Birch2 [29], [30]: The synthetic 2-d data where the clusters are located at a sine curve. (vi) Birch3 [29], [30]: The synthetic 2-d data include random sized clusters in random locations.

This section presents the average results of the proposed distributed algorithm over the mentioned datasets during 20 rounds of implementation of this algorithm. In this evaluation, each of the datasets was divided into approximately equal subsets of data and they were distributed among peers, then the clustering was performed using a proposed GBDC-P2P algorithm. Table 10 shows the average evaluation results of 20 times of running the proposed GBDC-P2P algorithm along with the average results of 10 times of running the K-means algorithm. As shown in Table 10, the results of the GBDC-P2P clustering algorithm on the synthetic and real datasets are very close to the results of the centralized K-means algorithm. 4.4.5. Evaluation of the proposed GBDC-P2P on data with high volume Given the importance of big data analytics, the next section belongs to evaluation the proposed method over high-volume data. 4.4.6. Evaluation of the proposed GBDC-P2P on data with high volume In this section, we evaluate the proposed method on a series of high-volume data. For this purpose, we use a synthetic dataset of the size of 5120000 3-dimensional (3-d) datasets. The view of this dataset is shown in Fig. 22. The data were located on a network of 128 peers, where, the proportion of all the peers was a fixed amount of 40,000 data. The data were allocated to the peers randomly or in a compressed manner. We evaluated four rounds of running GBDC-P2P algorithm on this dataset. Given that the volume of the internal data of the peers is very high, running the M-Represents algorithm over the whole of internal datasets is not cost effective in terms of the costs of processing and spent time. As a solution, first, we use a random sampling method, among the internal datasets of peers. By doing so, 1,000 data were randomly sampled. Then, using the default conditions mentioned in Table 11, M-Represents algorithm was executed on this sample of data collection in each of the peers to extract the representative data and thereby, to extract the important data among these selected samples. According to what has already been mentioned in connection with M-Represents algorithm, based on C mode of MRepresents algorithm, and since (D > 2K), then Kl has been equal to 2K=20.

Given the high volume of the internal data of each peer, M-Represents algorithm is run on a random sample of the whole internal data to discover the important data. In this case, the value of M is considered as the total number of clusters, so that this random sampling of the whole data does not lead to incorrect results of the algorithm. Thus, using M-Represents algorithm, every peer could select one important data among a set consisted of 1000 sampled data of its own internal memory to distribute them through the entire network. Fig. 23 presents a view of the entire dataset and the summarized view of this data in four different peers for four different rounds of execution of the proposed algorithm. According to Fig. 23, due to the lack of necessity of the age limit condition for external data, the size of the summarized image has increased considerably by increasing the number of running rounds of the algorithm. In each round, each peer accesses the final centroids through running the proposed Persistent K-means centralized data clustering algorithm on its own external data collection. By doing so, the corresponding peer allocates its own internal data to the nearest centroid. Thus, it performs its own local data clustering operation independently of other peers. The results of four rounds of running of the proposed algorithm on this random and cluster-aware dataset based on AC and RandI criteria are shown in Fig. 24. According to the evaluation results shown in the Fig. 24, the proposed algorithm has achieved an acceptable accuracy for data clustering, in the early rounds. This shows that, it represents a proper summarized view of the entire data collection in the whole network, despite the use of M-Represents algorithm on a sampled collection of all the peers’ internal data. Here, using the sampling method with its savings on time and computational costs of the peers has led to a significant increase of peers’ access speed of the results of clustering. In addition, based on the evaluation results presented in this section, the accuracy of clustering has reached to higher limits in each of the four rounds of algorithm execution. 4.4.7. Evaluation of model features Challenges for clustering in P2P networks, such as dynamics of P2P networks, scalability, synchronization, guarantee the accuracy of clustering [8], in Table 12, we compare the proposed method with the algorithms reviewed in the Section 1.

Although no criterion is proposed for ensuring the accuracy of clustering, using the M-Represents algorithm, will lead all peers to access a summarized view of total network data and the proposed GBDC-P2P algorithm presents an acceptable clustering accuracy. Considering the reliability of P2P networks, the proposed algorithm greatly supports the dynamicity of the network for both omission/addition of data and omission/addition of peers. In addition, the proposed algorithm does not need synchronization and supports scalability of P2P networks. In addition, according to the evaluated results, increasing the number of network’s peers has no impact on distributed clustering.

5. Conclusions This paper proposed a novel distributed data clustering method for unstructured P2P networks that lacks the drawbacks of the previous methods as possible. In the proposed method, data clustering is fully distributed and the data exchange is done only through the gossip-based interaction between the peers and without the need to any synchronization procedure. The main goal of the proposed method is the access of each of peers to a good summarized view of all network datasets to discover the final centroids. This general summarized view is achieved through the interactions of peers and their exchange of locally important data with each other and collecting these received representative data in each of the peers known as external data. Finally, whenever each peer needs the clustering results, it can perform centralized clustering operations over these external data to access the final centroids. The corresponding peer then assigns its own internal data to the nearest final centroid. By doing so, peers’ internal data are clustered independently. Evaluation results show the superior performance of the proposed GBDC-P2P algorithm. This paper also proposed a centralized data clustering algorithm called Persistent Kmeans algorithm. This centralized clustering algorithm was used in the final phase of implementation of the proposed GBDCP2P algorithm to provide the access of each peer to the final centroids. In fact, the proposed Persistent K-means algorithm is an improved version of the K-means clustering algorithm that by applying some limitations on the convergence method of Kmeans algorithm, it gives results that are more accurate. Providing a novel method for data distribution in P2P networks with less communications and computation overhead rate using soft computing techniques can be considered as future work. References [1] Mashayekhi, H., Habibi, J., Voulgaris, S., and van Steen, M., “GoSCAN: Decentralized scalable data clustering.” Computing, Vol. 95, No. 9, pp. 759-784, 2013. [2] Park, B.H., Kargupta, H., “Distributed data mining: Algorithms, systems, and applications,” Nong Ye, editor, Data Mining Handbook, 2002. [3] Hammouda, K.M., and Kamel, M.S., “Models of distributed data clustering in peer-to-peer environments” Knowl Inf Syst, vol. 38, no. 2, pp. 303–329, 2014. [4] Dos Santos, D.S., Bazzan, A.L.C., “Distributed clustering for group formation and task allocation in multiagent systems: A swarm intelligence approach,” Applied Soft Computing, vol.12, pp 2123-2131, 2012. [5] Datta, S., Giannella, C.R., Kargupta, H., “Approximate Distributed K-Means Clustering over a Peer-to-Peer Network,” IEEE Trans. Knowledge and Data Eng., vol. 21, no. 10, pp. 1372-1388, 2009. [6] Ester, M., Kriegel, H.P., Sander, J., and Xu, X., “A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” In Proceedings of Knowledge Discovery in Database (KDD), pp. 226–231, 1996. [7] Di Fatta, G., Blasa, F., Cafiero, S., and Fortino, G., “Fault tolerant decentralised K-Means clustering for asynchronous large-scale networks,” Journal of Parallel and Distributed Computing, vol. 73, No. 3. pp. 317-329, 2013. [8] Mashayekhi, H., Habibi, J., Khalafbeigi, T., Voulgaris, S., and van Steen, M., “GDCluster: A General Decentralized Clustering Algorithm.” Knowledge and Data Engineering, IEEE Transactions on, vol. 27, no. 7, pp. 1892-1905, 2015. [9] Voulgaris, S., Gavidia, D., and van Steen, M., “CYCLON: Inexpensive membership management for unstructured P2P overlays,” Journal of Network and Systems Management, vol. 13, 2005. [10] Samatova, N.F., Ostrouchov, G., Geist, A., and Melechko, A.V., “RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets,” Distributed and Parallel Databases, vol. 11, no. 2, pp. 157-180, 2002. [11] Merugu, S., and Ghosh, J., “Privacy-Preserving Distributed Clustering Using Generative Models,” Proc. Third IEEE Int’l Conf. Data Mining (ICDM ’03), pp. 211-218, 2003. [12] Da Silva, J., Giannella, C., Bhargava, R., Kargupta, H., and Klusch, M., “Distributed Data Mining and Agents,” Eng. Applications of Artificial Intelligence, vol. 18, no. 7, pp. 791-807, 2005. [13] Hammouda, K.M., and Kamel, M.S., “Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization,” IEEE Transactions on Knowledge and Data Engineering, Vol. 21, no. 5, pp. 681-698, 2009. [14] MacQueen, J.B., “Some methods for classification and analysis of multivariate observations.” Proceedings of 5th Berkeley symposium on mathematical statistics and probability, Berkeley, Calif., University of California Press., vol. 1, pp. 281-297, 1967. [15] Kaufman, L. and Rousseeuw, P.J. “ Clustering by means of Medoids, in Statistical Data Analysis Based on the L1-Norm and Related Methods,” edited by Y. Dodge, North-Holland, pp. 405–416,1987. [16] Klusch, M., Lodi, S., and Moro, G., “Agent-Based Distributed Data Mining: The KDEC Scheme,” Proc. AgentLink, pp. 104-122, 2003. [17] Datta, S., Bhaduri, K., Giannella, C., Wolff, R., Kargupta, H., “Distributed Data Mining in Peer-to-Peer Networks,” IEEE Internet Computing special issue on Distributed Data Mining. Vol. 10, No. 4, pp. 18-26, 2006. [18] Januzaj, E., Kriegel, H.-P., and Pfeifle, M., “Scalable Density-Based Distributed Clustering,” in 8th European Conference on Principles and Practice of Knowledge Discovery in Databases,” Berlin: Springer-Verlag, pp. 231–244, 2004. [19] Yang, M., and Yang, Y., “An efficient hybrid Peer-to-Peer System for distributed data sharing,” Proc. of 22nd IEEE International Parallel and Distributed Processing Symposium, Miami, 2008. [20] Xu, W., Liu, X., Gong, Y., “Document clustering based on non-negative matrix factorization,” Proc. of Int. Conf. on Research and Development in Information Retrieval, pp. 267-273, 2003. [21] Hubert, L., and Arabie, P., “Comparing partitions,” Journal of classification, vol. 2, no. 1, pp. 193–218, 1985. [22] Pena, J.M., Lozano, J.A., and Larraiiaga, P., “An empirical comparison of the initialization methods for the K-Means algorithm,” Pattern Recognition Letters, Vol. 20, No. 10, pp. 1027–1040, October 1999.

[23] Bradley, P.S. and Fayyad, U.M., “Refining Initial Points for K-Means Clustering,” ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning, pp. 91-99, 1998. [24] Barakbah, A.R., and Kiyoki, Y., “A Pillar Algorithm for K-Means Optimization by Distance Maximization for Initial Centroid Designation,” The IEEE Symposium on Computational Intelligence and Data Mining, Nashville, 2009. [25] Bhusare, B.B., and Bansode, S.M., “Centroids Initialization for K-Means Clustering using Improved Pillar Algorithm,” International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), vol. 3, No. 4, 2014. [26] Tzortzis, G., and Likas, A., “The MinMax k-Means clustering algorithm.” Pattern Recognition, vol. 47, no. 7, pp. 2505-2516, 2014. [27] Ruspini, E.H., “A new approach to clustering, Inform and Control,” Vol. 15, no. 1, PP. 22-32, 1969. [28] UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets.html [29] Zhang, T., Ramakrishnan, R., and Livny, M., “BIRCH: A New Data Clustering Algorithm and Its Applications.” Data Mining and Knowledge Discovery, vol. 1, no. 2, pp 141-182, 1997. [30] http://cs.uef.fi/sipu/datasets [31] peersim.sourceforge.net [32] Serapião, A.B.S., Corrêa, G.S., Gonçalves, F.B., Carvalho, V.O., “Combining K-Means and K-Harmonic with Fish School Search Algorithm for data clustering task on graphics processing units,” Applied Soft Computing, vol. 41, pp 290-304, 2016.

Biographies Rasool Azimi received his B.Sc. degree in Software Engineering from Mehrastan University, Guilan, Iran, in 2011 and the M.Sc. degree from Science and Research Branch, Islamic Azad University, Qazvin, Iran in 2014. His research interests include overlay networks, peer-to-peer systems, distributed data mining, data clustering, artificial intelligence and their applications in power systems. Hedieh Sajedi received her B.Sc. degree in Computer Engineering from AmirKabir University of Technology in 2003, and M.Sc. and Ph.D degrees in Computer Engineering (Artificial Intelligence) from Sharif University of Technology, Tehran, Iran in 2006 and 2010, respectively. She is currently an Assistant Professor at the Department of Computer Science, University of Tehran, Iran. Her research interests include Machine learning, Image procesing, and Distributed sSystems. Mohadeseh Ghayekhloo received her B.Sc. degree in Computer Engineering from Mazandaran University of Science and Technology, Babol, Iran, and the M.Sc. degree from Science and Research Branch, Islamic Azad University, Qazvin, Iran in 2011 and 2014, respectively. Her research interests include optimization algorithms, artificial neural networks, computational intelligence and their applications in power systems.

Start

Load XL={X1,X2,…,XD} : Internal Data K: Number of Final Clusters Kr: Number of Kl-Medoids iterations (Kr=K) M : Number of Represents if D < K then Kl = D-1 if K < D < 2K then Kl = K if D >= 2K then Kl = 2K i=1

Perform K1-Medoids (XL, K1) algorithm (Outputs: CL={C1,C2,…,CKl}) Add an weight attribute to CL data as new attribute (by 1)

For j=1:Kl Is Cj exists in REP_List?

Only one-unit increase the Weight Attribute of Yes existed data in REP_List

No Add Cj to REP_List

i=i+1 Yes

i <= Kr ? No

Sorting REP_List in descending order based on Weight Attribute of existed data in REP_List Select M rows from Sorted REP_List (Top-down) as M number of Representatives

Remove Weight Attribute of M selected Representatives

Return R= {R1,R2,…,RM} as Representatives End

Fig. 1: Flowchart of the M-Represents algorithm to extract M data as representative data from the internal data of the peer

Kl = 1;

Kl = 3;

Kl = 5;

Kl = 7;

Fig. 2: View of the original dataset includes 1024 data (blue x-mark ”  ”) and summarized view (red stars ”  ”) for different values of Kl.

Kl = 2;

Kl = 8;

Kl = 4;

Kl = 6;

Kl =10;

Fig. 3: View of the original dataset includes 2560 data (blue x-mark ”  ”) and summarized view (red stars ”  ”) for different values of Kl.

Kl = 1;

Kl = 5;

Kl = 10;

Kl = 13;

Kl = 17;

Kl = 20;

Fig. 4: View of the original dataset includes 10240 data (blue x-mark ”  ”) and summarized view (red stars ”  ”) for different values of Kl

Kr =4;

Kr=1;

Kr =10;

Kr =7;

Kr =14;

Fig. 5: View of the original dataset includes 10752 data (blue x-mark ”  ”) and summarized view (red stars ”  ”) for different values of Kr

do forever { Wait (T time units) incr. all descriptors' age by 1 Q ← selectPeer( ) remove Q from view buf_send ← selectToSend( ) send buf_send to Q receive buf_recv from Q view ← selectToKeep( ) //messages for data clustering: pRepresentatives ← M-Represents( ) add To pExtData (pRepresentatives) send pExtdata to Q receive qExtdata from Q merge pExtdata and qExtdata as new pExtdata Remove duplicate data in pExtdata addToMemory(pExtdata); if memory overload summarizeMemory(); } (Active Peer) Fig. 6: Gossip interactions between Active and Passive Peer

Fig. 7: Results of eighth iteration of K-means algorithm.

Fig. 8: Results of K-means algorithm in the tenth iteration

Fig. 9: Results of K-means algorithm in the tenth round

do forever { receivc buf_recv from P buf_send ← selectToSend( ) send buf_send to P view ← selectToKeep( ) //messages for data clustering: receive pExtData from P qRepresentatives ← M-Represents( ) add To qExtdata (qRepresentatives) send qExtdata to P merge qExtdata and pExtdata as new qExtdata Remove duplicate data in qExtdata addToMemory(qExtdata); if memory overload summarizeMemory(); }

(Passive Peer)

Number of Clusters (k): K Number of Iteration (r): R

r=1 Counter = 0 Min_Ave = Inf

Perform K-Means algorithm r=r+1

Make M_Dist ( K ,2) Matrix based on Best_Dist K×N Matrix and Calculate the average value of M_Dist K 3 as “Ave_Dist” No

Yes

Ave_Dist < Min_Ave ?

Min_Ave = Ave_Dist Calculate “Mean_Dist” and “Med_Dist” for M_Dist (K , 2)

No

2(Mean_Dist) > Med_Dist? Yes

ADD Centroids to Temp1

ADD Centroids to Temp2 Counter = Counter + 1

Yes

r < R? No

No

Counter > 0 ?

Set Temp1 as Final Centroids

Yes Set Temp2 as Final Centroids

End

Fig. 10: The Persistent K-means flowchart

Fig. 11: Persistent K-means algorithm results (r=5)

Fig. 12: Persistent K-means algorithm results (r=10) P’s descriptors pRepresentatives data pExtdata data

Active Peer P (41)

Passive Peer Q (67)

Q’s descriptors

Q’s view ←(P’s descriptors + Q’s descriptors )

qInternal data

qInternal data

Local veiw

Local veiw

...

Finish

Assign Peer’s internal data to the nearest cluster centroid obtained

Achieve final cluster centroids

P's local memory

Perform proposed Persistence K-Means over iExtdata

Peer i (1 < i < 90)

qExtdata

Q's local memory

Final Clustering Results

pExtdata

Proposed GBDC-P2P algorithm

qExtdata ←(pRepresentatives + pExtdata)

M-Represents algorithm

pExtdata ←(qRepresentatives + qExtdata) P’s view ←(Q’s descriptors + P’s descriptors )

Gossip interactions

pExtdata data

qRepresentatives

qRepresentatives data

Fig.13: Flow chart of gossip based interactions between peers and discovering the final clustering results in each peer in the proposed GBDC-P2P algorithm

Fig. 14: The datasets used for evaluating proposed algorithms

100

Dataset: Ruspini

K-means with random init

90

K-medoids

AC (%)

80

K-means with Forgy init

70

K-means with Mac Queen init

60

K-means with Kaufman init

50

K-means with Refinement init K-means with MDC init

40

Fuzzy C-means 30

Improved Pillar K-means 20

Min-Max K-means 10

Proposed Persistent K-means 0

Clustering Algorithms

Fig. 15: Comparing the clustering results on Ruspini dataset 100

Dataset: IRIS

K-Means with random init

90

K-Medoids 80

K-Means with Forgy init

AC (%)

70

K-Means with Mac Queen init

60

K-Means with Kaufman init

50

K-Means with Refinement init

40

K-Means with MDC init Fuzzy C-Means

30

Improved Pillar K-Means 20

Min-Max K-means 10

Proposed Persistent K-means 0

Clustering Algorithms

Fig. 16: Comparing the clustering results on IRIS dataset 100

Dataset: Newthyroid

K-Means with random init

90

K-Medoids 80

K-Means with Forgy init

AC (%)

70

K-Means with Mac Queen init

60

K-Means with Kaufman init

50

K-Means with Refinement init

40

K-Means with MDC init Fuzzy C-Means

30

Improved Pillar K-Means 20

Min-Max K-means 10

Proposed Persistent K-means 0

Clustering Algorithms

Fig. 17: Comparing the clustering results on Newthyroid dataset 100

Dataset: Wine

K-means with random init

90

K-medoids 80

K-means with Forgy init

AC (%)

70

K-means with Mac Queen init

60

K-means with Kaufman init

50

K-means with Refinement init

40

K-means with MDC init Fuzzy C-means

30

Improved Pillar K-means 20

Min-Max K-means 10

Proposed Persistent K-means 0

Clustering Algorithms

Fig. 18: Comparing the clustering results on Wine dataset

(a)

(b) Fig. 19: Evaluation of GBDC P2P algorithm in a static network, in two cases: (a) Dedicated Random data, (b) Dedicated dense data (Cluster-aware), in terms of changing network size.

(a)

(b) Fig. 20: Evaluation of GBDC P2P algorithm in a dynamic network, in two cases: (a) Dedicated random data, (b) Dedicated cluster-aware data, in terms of changing network size.

(a)

(b) Fig. 21: Comparison results of algorithms in a static network, in two cases: (a) dedicated random data, (b) dedicated cluster-aware data, by changing network size.

Fig. 22: View of the synthetic dataset includes 5120000 3-d data

(First round)

(Second round)

(Third Round)

(Fourth Round)

Fig. 23: The view of the summarized view in 4 rounds related to 4 different peers

(a)

(b) Fig. 24: Evaluation GBDC P2P algorithm on a static network, in two cases: (a) Dedicated random data, (b) Dedicated cluster-aware data, by changing network size.

Table 1: The parameters used in M-Represents algorithm Description

Parameter

number of internal data in each peer number of elected data as representative data number of final clusters number of clusters for the M-Represents algorithm number of Kl-medoids iterations Representative data

D M K Kl Kr R

Table 2: Description of the functions used in the proposed GBDC-P2P during the gossip interactions between Active Peer (P) and Passive peer (Q) Hook

Action Taken

selectPeer()

Select descriptor with the oldest age

selectToSend( )

selectToKeep( )

Active Peer

Select g−1 random descriptors. Add own descriptor with own profile and age 0.

Passive Peer

Select g random descriptors.

Active Peer Passive Peer

Keep all g received descriptors, replacing (if needed) the descriptor selected by selectPeer( ) and then the g−1 ones selected by selectToSend( ). Keep all g received descriptors, replacing (if needed) the g ones selected by selectToSend( ).

addToMemory(pExtdata);

Add received external data (Extdata) to local memory.

summarizeMemory( )

Perform Summarization on the external data in memory, if peer's memory is full.

Table 3: K-means algorithm results in 10 rounds Iteration No.

Ave_Dist

AC

1 2 3 4 5 6 7 8 9 10

655401294525.546 338521198628.016 327507683280.353 349949420687.898 827217507568.969 327507683280.353 2117958554025.93 242094107823.816 321337928867.259 349517225397.660

80.12 90.00 90.06 90.04 70.24 90.06 79.48 99.96 90.01 90.05

Table 4: Results of M_ Dist K3 in the eighth iteration of K-means K

M_Dist ( K ,1)

M_Dist ( K , 2 )

M_Dist K 3

1 2 3 4 5 6 7 8 9 10

2588 2643 2664 2514 2524 2496 2470 2607 2571 2543

256373187510.78 353934790719.30 357797992679.24 153390625395.99 228026460091.37 147624903712.54 150166385057.56 301044717531.40 284512324697.62 188069690842.32

99062282.654861 133914033.56765 134308555.81052 61014568.574379 90343288.467262 59144592.833550 60796107.310754 115475534.15090 110662125.51444 73955835.958446

Table 5: K-means algorithm results in 10 iterations Iteration No.

Ave_Dist

AC

1 2 3 4 5 6 7 8 9 10

110170616864.106 343677320109.839 204069771177.445 167025068278.392 233767050297.790 86531408864.7107 117312819963.335 199513262433.238 205711134156.617 76159749961.1842

90.30 79.28 70.36 79.99 79.97 99.90 80.67 80.16 70.35 90.33

Table 6: Results of M_ Dist K3 in the tenth iteration of K-means K

M_Dist ( K ,1)

M_Dist ( K , 2 )

M_Dist K 3

Ave

1 2 3 4 5 6 7 8 9 10

1124 1057 1039 1080 2052 1086 1067 1046 105 1096

108927851257.2 51326998672.24 33126823778.59 74642337439.19 163558171755.2 55592604154.58 56332747047.03 36021771024.52 95117399331.34 86950795151.80

96910899.69504 48559128.35595 31883372.26043 69113275.40665 79706711.38171 51190243.23626 52795451.77791 34437639.60279 905879993.6318 79334667.10931

0.66 0.33 0.21 0.47 0.54 0.35 0.36 0.23 6.24 0.56

Table 7: Results of the Persistent K-means algorithm in 10 rounds Iteration No. 1 2 3 4 5 6 7 8 9 10

Ave_Dist 317873556933.486 238935794873.245 256760408954.824 255168615089.361 86531408864.7107 163652149307.042 428409853900.782 259409970278.423 278407541804.449 76159749961.1842

AC 79.50 70.50 79.51 79.82 99.90 79.98 78.98 80.03 79.37 90.33

Table 8: Results of matrix in (r=5)

K

M_Dist ( K ,1)

M_Dist ( K , 2 )

M_Dist K 3

Ave

1 2 3 4 5 6 7 8 9 10

1019 1083 1113 1110 1134 1039 1070 1047 1039 1098

21799480091.22 82748709007.44 123911945535.1 201863624709.5 144924755005.2 33126823778.59 101072219006.8 37120242601.35 25739118546.10 93007170365.61

21393012.84712 76406933.52488 111331487.4529 181859121.3599 127799607.5883 31883372.26043 94460017.76343 35453908.88381 24772972.61415 84705983.93953

0.27 0.96 1.40 2.30 1.61 0.40 1.19 0.44 0.31 1.07

Table 8: The Simulation parameters Range values

of

Description

Parameter

100 -1024

Number of Peers

N

10-40000

D

2-20

Number of internal data in each Peer The number of elected representatives from each Peer’s internal data in each round Number of real clusters (Final clusters)

5

Lifetime of external data

Age

1-10

M K

Table 9: Accuracy assessment results of proposed GBDC-P2P algorithms in comparison with K-means and the proposed Persistent K-means algorithm based on the evaluation criteria AC and RandI .

AC (%)

RandI (%)

Average Communication Overhead (KB)

Average Storage Overhead (KB)

Dataset

Proposed GBDC-P2P

Persistent K-means

K-means

Proposed GBDC-P2P

Persistent K-means

K-means

Shuttle

80.14

74.67

73.09

82.23

83.10

81.63

7.1

30.3

Magic

72.48

64.83

64.83

63.26

54.44

54.44

19.5

47.8

Pendigits

70.90

74.38

65.78

84.93

93.53

91.15

12.5

15.5

Birch1

69.10

76.94

74.94

81.74

95.15

89.24

16.3

42.4

Birch2

74.25

81.78

79.25

78.84

87.13

82.59

15.2

40.7

Birch3

67.01

75.89

70.15

87.12

97.45

91.67

17.4

45.1

Table 10: The Simulation parameters Range of values

Description

Parameter

128

Number of Peers

N

40000

Number of internal data in each Peer

D

1

The number of elected representative data

M

10

Number of real clusters (Final clusters)

K

Accuracy

Clustering

on

Synchronizati

Scalability

P2P Networks

of

No

Yes

No

Yes

No

Yes

No

Yes

























Approaches

Dynamics

Epidemi c KMeans

USP2P KMeans

LSP2P Means K-

[7]

[5]

[5]

References









GoSCA N

[1]









GDClust er

[8]









P2P

GBDC-

d

Propose

Table 12: Comparison of distributed data clustering algorithms