Enhanced K-means re-clustering over dynamic networks

Enhanced K-means re-clustering over dynamic networks

Expert Systems With Applications 132 (2019) 126–140 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www...

2MB Sizes 0 Downloads 27 Views

Expert Systems With Applications 132 (2019) 126–140

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

Enhanced K-means re-clustering over dynamic networks AmirHosein Fadaei, Seyed Hossein Khasteh∗ Department of computer engineering, K. N. Toosi University of Technology, Tehran, Iran

a r t i c l e

i n f o

Article history: Received 17 October 2018 Revised 5 March 2019 Accepted 27 April 2019 Available online 30 April 2019 Keywords: Clustering Dynamic networks K-means Re-clustering

a b s t r a c t This paper presents a preliminary algorithm which is designed to reduce the processing cost of continuing clustering in dynamic networks. This algorithm considers that various types of changes (Inserts and deletes) might affect the clustered data over time. It promises to provide both a reliable and updated answer for clustering problem at all times. By altering the well-known K-means algorithm, this enhanced version has three parts: Initializer and Sorter, the main objective of this part is to initialize the algorithm and to store some data that can be used to reduce the calculations later on, The Dynamic Modifier, this part applies the modifications on clusters and also updates the centroids and the related info to keep the clusters valid, and The Detector, which detects the potent nodes which might need to swap their clusters after applying the recent changes that Dynamic Modifier applied. This algorithm reduces the amount of calculations by using the related data from the last scope of the clustered network to detect the potent nodes, so it can only check them for further modifications. The simulation results indicate that the number of checked nodes and the total consumed time during each iteration is reduced significantly comparing to the traditional K-means algorithm. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction Clustering is a crucial part in Machine Learning (Nguyen, Thuy, & Armitage, 2008; Sebastiani, 2002), Pattern Recognition (Duda & Hart, 1973), Data mining and Data Discovery (PiatetskyShapiro, 1996) and Data Compression and Vector Quantization (Gersho & Gray, 1992). Without it, lots of active trends in computer science and Big Data wouldn’t even exist. Clustering is the tool that can turn Big Data to Tiny Data (Feldman, Schmidt, & Sohler, 2013). In past fifty years, several types of clustering algorithms were proposed including Hierarchical algorithms like SLINK (Sibson, 1973) and CLINK (Defays, 1977), Centroid-based algorithms like K-Means (Lioyd, 1982), Density-based algorithms (Kriegel, Kroger, Sander, & Zimek, 2011) like DBSCAN (Ester, Kriegel, Sander, & Xu, 1996), OPTICS (Ankerst, Breunig, Kriegel, & Sander, 1999) and GDB-SCAN (Sander, Ester, Kriegel, & Xu, 1998) for spatial databases and some other algorithms (Aggarwal, 2013; Can & Ozkarahan, 1990). On the other hand, using dynamic network systems is increasing rapidly in different types of applications such as telecommunication systems or social networks. Even with all the current studies and utilization of this topic, clustering is still an issue on largescale dynamic networks with a high rate of data change. These systems need clustering algorithms that can stay reliable and avail∗

Corresponding author. E-mail address: [email protected] (S.H. Khasteh).

https://doi.org/10.1016/j.eswa.2019.04.061 0957-4174/© 2019 Elsevier Ltd. All rights reserved.

able at all times. These algorithms should be able to respond to the changes that are going to happen in real-time. There are so many dynamic networks around us, as such we can mention Mobile and Wireless Networks (Lin & Gerla, 1997; Yu & Chong, 2005), Search Engines (like Google) and Twitter (Becker, Naaman, & Gravano, 2011). It’s notable that the information of some are usable for understanding and categorizing others (Banerjee, Ramanathan, & Gupta, 2007). The reason we cluster data in these networks is to extract relational information from them to maintain better performance and stability and for many other applications (Mobasher, Cooley, & Srivastava, 1999). Since these networks are dynamic, we must keep the data up to date. Because of that we have reconsider the way the data is clustered after updating the network. One way to do this is to run the whole clustering algorithm from start point on the whole network after applying each change. A better way to face this issue is to only consider the potent nodes which might change their clusters. This algorithm must be implementable on Dynamic networks with fast or real-time changes. To achieve that, the algorithm must decrease the needed computation cost for correcting the clusters after each change. This is applicable by saving more information in the initialization phase and using them when the algorithm is correcting the clusters later. In other words, this method is shifting the consumed time to the initializing phase instead of doing everything online, which is a well-known technique on other topics in Computer Science. By

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

doing this, we are decreasing the total consumed time in the long run. K-means clustering is a well-studied and reputable centroidbased clustering algorithm. This algorithm was previously used to determine optimal clusters for a static representation of the distributed data in a large network. The original version can’t be used on dynamic networks as it needs a static dataset to iterate over. However, the updated proposed versions of K-means can be used on dynamic and more complicated networks (Datta, Giannella, & Kargupta, 2006; Huang, 1998; Likas, Vlassis, & Verbeek, 2003; Sculley, 2010). The proposed algorithm in this paper suggests a faster method that is always available and reliable. It will not re-initialize the original clustering method to apply and consider a change in the network. In other words, it will update the clusters by detecting only a portion of nodes which might move to a new cluster, instead of iterating over all the network after each change. The remainder of this paper is organized as follows. Section 2 presents a review of some of the related work in Clustering and discusses some of the proposed enhancements on that topic. Section 3 discusses the proposed algorithm and its features. Section 4 presents the experimental results and analysis. Section 5 presents some related discussions about our proposed method and finally, Section 6 draws conclusions and presents some future works. 2. Related works Many methods have been proposed for clustering on dynamic or incremental datasets such as recently developed density based algorithms (Kriegel et al., 2011). In these type of methods, an updated list of cluster densities must be generated after applying new changes. Even though this can be done in linear time, it is not optimal because these algorithms can’t use the previously calculated data without approximations (Wong & Lane, 1981), which means that we must do all of the calculations again in order to generate fully accurate clusters. Most of the current methods don’t use their own previous results as a source of data for clustering the dataset after submitting the changes because they’re not designed for reclustering (Applying the new changes dynamically and update the clusters). Some modifications must be done on the basis of these algorithms to let them use the differential data for updating the results instead of recalculating them (Yi, Guiling, Weixiang, & Yong, 2009). One of the main types of clustering algorithms that is fairly commonly used is centroid based algorithms (Jain, 2010), out of which K-means is the most commonly-used. The basic K-means method was proposed by Lloyd in 1957 (Lloyd, 1982). Since then, many researchers worked to enhance K-means Clustering algorithm for different purposes. Some tried to accelerate the algorithm by reducing the redundant centroid calculations with mathematical approaches, for example by using triangle inequality (Elkan, 2003; Phillips, 2002) or by using KD-trees (Bentley, 1975) or blacklisting methods (Pelleg & Moore, 1999). Elkan’s proposed method (Elkan, 2003), defines an upper bound for each node based on its distance to its current cluster centroid. It reduces the calculations by skipping the clusters where the node can’t move on to, which accelerates K-means eventually. Hamerly proposed another similar algorithm (Kanungo et al., 2002) which was using the second nearest centroid for each node to determine a lower bound for them as well. Later, Drake and Hamerly proposed a better method for accelerating K-means by using adaptive distance bounds (Hamerly, 2010). Another approach for accelerating K-means was filtering and ignoring the nodes that are not going to move to a new cluster anymore (Kanungo et al., 2002; Pelleg & Moore, 1999). These methods use an extra dataset to keep track of the closer nodes to centroids. It’s not possible to determine

127

which version will work faster on a certain dataset before testing it, as the initializing factors (how first centroids were chosen and the number of clusters - K) and nodes density (the structure of dataset) have direct effects on these methods efficiency. However, the presented test results show that Hamerly enhancements are doing generally better on some typical datasets (Drake & Hamerly, 2012; Hamerly, 2010). Another type of proposed enhancements on K-means was trying to increase the algorithm accuracy (Hamerly & Elkan, 2002). For that, one needs to define the meaning of accuracy as “clustering is in the eyes of beholder” (Estivill-Castro, 2002) meaning one’s definition of a “correct” clustering on a specific dataset depends on what they expect from the clustering system. Some others were focused on implementing K-means on data with specific requirements. For example, some requirements are using it on data with more dimensions (Aggarwal, Wolf, Yu, Procopiuc, & Park, 1999) or using it on peer to peer networks and mobile networks (Datta et al., 2006; Lua, Crowcroft, Pias, Sharma, & Lim, 2005). As we can’t always have a server to do the calculations, these enhancements rely on their predictions and local synchronization to achieve an acceptable result. These algorithms are trying to converge on a set of centroids that are as close as possible to the centroids that would have been produced if the data from all the nodes were first centralized then K-means were run (Datta et al., 2006). In other words, they use a nearly accurate cluster map to cover the peer to peer relationship between the separated nodes in the graph. Recently, a lot of developments, protocols, and algorithms have been proposed that are following the same trend (Tang, Hong, Zhou, & Miao, 2016; Reddy & Chaitany, 2016). Furthermore, to be able to use clustering or re-clustering algorithms in practice (for example in image processing and computer vision (Mohammed, Abbood, & Yousif, 2016; Williams, Wantland, Ramos, & Sibley, 2016) or for Routing (Pukhrambam, Bhattacharjee, & Das, 2017)) some modifications had been done on them. These algorithms are enormously better than the theoretical older and original versions for their specific applications. Another enhancement was improving K-means to be able to keep producing results in a dynamic network (Bilgin & Yener, 2006). When we use the term “Dynamic Network” we refer to a network that is changing over time. Most of these enhanced algorithms, will only apply the change on the cluster which had the node originally (if the change is a node removal) or the one which is getting this new node (if the change is a node addition) and will ignore the impact of that change on other clusters because they want to re-cluster dynamically (Cormode, Muthukrishnan, & Zhuang, 2007). The only update these algorithms will do is to notify all the clusters about the centroid movements to keep the whole clustering system valid, but the result is not optimal. Since the movement impact on neighbor clusters was ignored, there might be nodes in that neighborhood which are not in a cluster with the nearest centroid (Aaron, Tamir, Rishe, & Kandel, 2014; Chakraborty & Nagwani, 2011; Datta et al., 2006). The amount of the nodes that are not optimally clustered will aggrandize over time, which will increase the clustering error after a long stream of changes are applied on the network (Dudoit & Fridlyand, 2003). That is why making sure the algorithm will stay valid overtime was another challenge in this topic. Several methods for working on dynamic incremental networks (Aaron et al., 2014; Pham, Dimov, & Nguyen, 2004) have been purposed. The core idea of all of these approaches is to use the last valid clustered network and the set of centroids of that network to re-initialize the original algorithm on the whole network. Re-clustering and determining the best cluster for the nodes in an online data stream is the state of the art. The suggested algorithms will work on any cluster shape but they can only work

128

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

on inserted nodes and won’t consider node removals so these algorithms can’t consider the changing in the location of nodes or node links (For structures like graph) (Papadopoulos, Kompatsiaris, Vakali, & Spyridonos, 2012). In addition, we can use the mathematical basis of the still fairly used centroid based clustering algorithms to achieve a better result. In this paper, we are trying to accelerate re-clustering by using something similar to Hemerly method (Hamerly, 2010) and using a filtering algorithm like Kanungo et al. (2002). We’re proposing a new enhancement on K-means which is accelerating the reclustering process. This method is trying to reach the optimal clustering over time while it’s keeping the clustered network valid at all times.

3. Proposed method The main goal of this algorithm is to only update the nearest centroids to the changed nodes and to detect all the potent nodes that might change their clusters for further iterations. It will change the whole clustered system with a greedy approach to make sure all the nodes are in their optimal cluster (distance wise). The algorithm has three main parts: Initializer and sorter, Dynamic modifier and the Detector. In the first part, we are running the basic clustering algorithm (Enhanced K-means) and we’re saving some extra information which will accelerate the next parts of the algorithm. This step needs to be executed just once on a single snapshot. The next part (The Dynamic modifier) needs to be executed after applying changes (Which might be a node removal or a node addition) to keep the data valid. In this part, we update the centroids and all other details related to the changes. Finlay, the detector, which is the main addition of this algorithm to the classical approaches, will run after the dynamic modifier and will detect the potent nodes which might change their cluster because of the recent changes. Enhanced Re-clustering Algorithm will run the Modification phase again whenever there are unapplied changes or detected potent nodes which must be checked again to ensure that all clusters are valid.

3.1. Initializer and sorter This part needs to be executed at least once on a static frozen state and snapshot of the dataset. The main objective of this part is to initialize the algorithm and to store some data that can be used to reduce the calculations later on. The core of this part is based on the original K-means algorithm with some modifications that are required, like some other methods (Pena, Lozano, & Larranaga, 1999). Assuming we want to have k clusters, we will choose k random centroids. The Forgy method, which uses random nodes for initializing, is the most common initializing method (Hamerly & Elkan, 2002). There might be other methods with better performance (Bradley & Fayyad, 1998) but they highly rely on the shape of the dataset and initialization. We will calculate the distances between all the nodes and these centroids to find the closest centroid from each one. All nodes will join the closest cluster. Then we recalculate the position of each centroid according to nodes which are assigned to that cluster (as the average of the nodes in that cluster) (MacKay, 2003). In this approach, the centroids are not one of the nodes. If it’s a constraint that the centroid should be one of the nodes we should choose the nearest node to the calculated centroid which would cause an initial error (Kaufman & Rousseeuw, 1987). For calculating the centroids based on K-means we use the Eq. (1), though as we mentioned, there are other enhanced versions of K-means that use different equations to calcu-

late the centroids.

ci(

t+1 )

=

1

 (t )  S  i



xj

(1)

(t )

x j  Si

(t )

Which ci is the centroid at start of the t + 1 iteration, |Si(t ) | is the total number of the nodes inside the cluster after t iterations and the sigma is the sum of the nodes’ positions inside that cluster. This algorithm must continue until no node changes its cluster. Because of the geometric and the mathematical meaning of K-means clustering, we can determine lower bounds and upper bounds to reduce the calculations in initialization part, like the way Elkan (2003) and Hamerly (2010) did. In our implementation, we used the Hamerly enhanced K-means clustering method as our initializing algorithm. At this step, we have the clustered network based on K-means over a static representation of the data. To progress, we need extra information in the form of the minimum cluster change threshold (MCCT) for each node. MCCT is the difference between the distance from a node to its closest and second closest centroids. For calculating this we can calculate the second minimum during our K-mean’s iterations when we’re calculating the first minimum centroid. This variable shows how much a node is close to the edge of the cluster and will be useful when we want to detect the potent nodes that might change their clusters during the detecting phase. To make an efficient and easily traceable dataset that keeps the sorted nodes based on their desire for changing cluster (MCCT), this algorithm uses a KD Tree. It will put the nodes closest to the edge (with lower MCCT) in the right side of the tree so we can trace them first. We will sort the nodes in each cluster after fixing their positions (or in our final iteration) based on their MCCT and using a KD-Tree Map structure (Bentley, 1975). This can be implemented in the last iteration we use for calculating the two closest centroids. For each node, the sorting is from O(Log(n)) as n is the total number of nodes in the network in that instance. The whole initialization phase is from O(ndk + n Log(n)) which d is the number of dimensions of each node and ndk is based on Lloyd’s algorithm (Hamerly, 2010; Lloyd, 1982). By overpaying in this setup, we will reduce the calculations in the dynamic phase and will make the whole algorithm beneficial overall if enough dynamic changes are applied. We will discuss this more in Section 5. Table 1 is a list of the parameters which we used in our pseudo codes and equations. The initializer pseudo codes, where we determine and update the bounds based on geometric equations to limit the calculations, are mostly taken from the Hamerly’s Optimized K-Means algorithm (Hamerly, 2010). After running the initializer,

Table 1 Parameters. All parameters for centroids: c(j) Cluster center j (where 1 ≤ j ≤ k) c (j) Vector sum of all points in cluster j q(j) Number of points assigned to cluster j p(j) Distance that c(j) last moved All parameters for each node: x(i) Data point i (where 1 ≤ i ≤ n) Index of the cluster to which x(i) is assigned a(i) s(i) Index of the cluster of the center that is the closest to x(i) but is not c(a(i)). u(i ) Upper bound on the distance between x(i) and its assigned center c(a(i)) Lower bound on the distance between x(i) and l (i ) its second closest center – that is, the closest center to x(i) that is not c(a(i)). MCCT(i) d(x(i), c(a(i))) − d(x(i), s(a(i)))

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

Algorithm 1 Initializer (dataset x, initial centers c). .Initialize(c, x, q, c , u, l, a) while not converged do for j = 1 to |c| do {update s} s(j) ← minj = j d(c(j ), c(j)) for i = 1 to |x| do m ← max( s(a2(i )) , l (i ) ) if u(i) > m then {first bound test} u(i) ← |d(x(i), c(a(i)))| if u(i) > m then {second bound test} a ← a(i) Point − All − Ctrs(x(i), c, a(i), u(i), l(i)) if a = a(i) then Update − Bounds(q(a ), q(a(i)), c (a ), c (a(i)) MCCT(i) ← d(x(i), c(a(i))) − d(x(i), s(a(i))) Move − Centers(c , q, c, p) Update − Bounds(p, a, u, l)

129

Algorithm 6 Modifier(). While Modifier − Queue is not Empty and sleep = False counter ← counter + 1 node ← Modifier − Queue.pull() if node.type is insert then Insert(i) if node.type is remove then Remove(i) if node.type is Potent then Check − Modify() if counter > m then Detector(x(i)) sleep = True counter ← 0

age of the nodes in the same cluster) and we’re trying to move toward the point that all the nodes are in their optimal cluster. (Algorithm 7)

Algorithm 2 Initialize(c, x, q, c , u, l, a). for j = 1 to |c| do   q( j ) ← 0

   c ( j ) ← 0

for i= 1 to |x| do  Point − All − Ctrs(x(i ), c, a(i ), u(i ), l (i ))   q(a(i )) ← q(a(i )) + 1

     c ( a ( i ) ) ← c ( a ( i ) ) + x (i )

Algorithm 3 Point-All-Ctrs(x(i), c, a(i), u(i), l(i)). a(i) ← argminj | d(x(i), c(j))| u(i) ← |d(x(i), c(a(i)))| l(i) ← minj = a(i) |d(x(i), c(j))|

Algorithm 7 Check-Modify(i). s(a(i)) ← mina(i) = j d(c(j ), c(a(i))) m ← max ( s(a2(i )) , l (i ) ) if u(i) > m then {first bound test} u(i) ← | d(x(i), c(a(i)))| if u(i) > m then {second bound test} a ← a(i) Point − All − Ctrs(x(i), c, a(i), u(i), l(i)) if a = a(i) then  U pdate − Bounds(q(a ), q(a(i )), c (a ), c (a(i ))   MCC T (i ) ← d (x(i ), c(a(i ))) − d (x(i ), s(a(i ))) c∗ ← c(j) 

c ( j ) ← c ( j )/q( j ) p(j) ← d(c∗ , c(j)) 

Algorithm 4 Move-Centers(c , q, c, p). for j = 1 to |c| do  ∗  c ← c( j )

  c( j ) ← c ( j )/q( j )

p(j) ← | d(c∗ , c(j))|

Algorithm 5 Update-Bounds(p, a, u, l). r ← argmaxj p(j) r ← argmaxj = r p(j) for i = 1 to |u| do  u(i ) ← u(i ) + p(a(i )) if r = a(i) then  l ( i ) ← l ( i ) − p( r  ) else   l ( i ) ← l ( i ) − p( r )





CE ( j ) ← CE ( j ) + d (c∗ , c ( j )) MCE ← max(|CE(j)|)

Dynamic modifier entity is responsible for changing nodes cluster as they become eligible for the change (based on the K-Means algorithm). To do that it reads the stored nodes in its queue one by one and checks if their nearest centroid has changed. Then it will apply that change to the node, related clusters, and the effected centroids. If a node is removed from the cluster, we simply calculate the new location of the centroid (by calculating the new average) based on the total number of the nodes inside the cluster of the removed node and the location of the removed node, using the equation below (the cluster has n nodes before applying the change):

cir = (ci ∗ n − xr )/(n − 1 )

(2)

cir

Which represents the new centroid after removing cluster with n nodes. (Algorithm 8) we sort all the nodes in each cluster based on their MCCT values and keep that in a tree-based structure that we call MCC T _List ( j ). (Algorithms 1–5)

from a

Algorithm 8 Remove(i). a(i) ← argminj |d(x(i), c(j))| u(i) ← |d(x(n), c(a(n)))| l(i) ← minj = a(i) |d(x(i), c(j))| c∗ ← c(a(i))

3.2. Dynamic modifier



The Dynamic Modifiers job is to apply modifications on the nodes and the centroids to keep the clusters valid. Dynamic modifier has a queue which holds some nodes that are previously detected as potent nodes for changing their clusters (The third part of the algorithm detects the potent nodes and fills this queue) and a stream of not applied changes (Node addition or removals) that it should apply on the network (Algorithm 6). We can assume the only types of modifications on nodes are removals and insertions, and we can consider a node movement as two actions (a removal and an insertion). Furthermore, consider that we want to keep all the clusters reliable (We want to keep the centroids as the aver-

xr



c ( j ) ← c ( j ) − x (i ) ( i ) )− x ( a ( i ) ) c (a (i )) ← c(a(i ))∗q(ai−1 





CE (a (i )) ← CE (a (i )) + d (c∗ , c (a (i )) ) MCE ← max(|CE(j)|)

Just as with removing, when inserting a new node, we use the equation below, which can easily be calculated:

ciin = (ci ∗ n + xin )/(n + 1 ) ciin

Which represents the new centroid after inserting a cluster with n nodes. (Algorithm 9)

(3) xin

inside

130

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

nodes without losing any potent nodes.

Algorithm 9 Insert (i).













(6)

We already kept a sorted list based on the MCCT for all the nodes in one cluster and we have the other parameters of the −−→ −→ equation above (|C Eci |, max(|C Ec | )and |cinew − ci |) and we know that they are similar for all the nodes in the same cluster, so with a simple iteration, we can detect all the potent nodes and add them to the modifier queue. (Algorithm 10)



c ( j ) ← c ( j ) + x (i ) a ( i ) )+ x ( i ) c (a (i )) ← c(a(i ))∗qi(+1



  −→ −−→  MCC T (i ) − C Eci  − max C Ec  < cinew − ci 

a(i) ← argminj d(x(i), c(j)) u(i) ← d(x(i), c(a(i))) l(i) ← minj = a(i) d(x(i), c(j)) c∗ ← c(a(i))



CE (a (i )) ← CE (a (i )) + d (c∗ , c (a (i )) ) MCE ← max(|CE(j)|)

By using the equations above, all the centroids will remain as the mean of their clusters, but some nodes might not be in their optimal cluster. This is because as we change the locations of the centroids, the distances between nodes and corresponding centroids will change, so some nodes may have new nearest centroids. We want to assign each node to its optimal cluster. To do this we require to use the already-calculated data from the previous iterations. We will use a detection algorithm to detect only the potent nodes that might change their clusters after applying the addition or removal. To find the potent nodes we use the data that we sorted and stored in the initializing step (MCCT). To use that sorted list we need to know how much the centroids were moved from their initial location to their present locations to determine how valid are our previous stored knowledge about the nodes. We call this variable the centroid error (the change in distance of the centroid locations), which represents how much the MCCT values of the nodes in that cluster are effected. We will calculate and update the centroid error whenever we change a centroid location during the modification entity process. Centroid error (CE(j)) is also a vector (like MCCT) and is not a scalar value, so its value might decrease after applying the new changes. The behavior of this variable will be discussed more in Section 5.

Algorithm 10 Detector(n). for j = 1 till k foreach i in order of MCC T _List ( j ) 

if MCC T (i ) − |C E (a (i ))| − MCE < p(a (n ) ) Modifier − Queue.push(x(i)) Modifier.poke() Else break

Using the above inequalities, we visit every potent node. If a node does not fulfill inequality 6, then we know that it’s closer to its initial centroid than any other centroid, and thus is not potent. 3.4. Implementation To implement the algorithm, we will run the Initializer and Sorter at least once on a frozen and static snapshot of the network. When the system demands a change (node addition or removal), it will be queued in the Dynamic Modifier Queue and the Modifier will be informed of the change. The modifier must always be online to process the changes which are listed in its queue. After applying each m changes, the Detector will start checking all the sorted MCCT lists to detect the potent nodes and will add them to the Modifier Queue so the Dynamic Modifier can check them later. (Algorithm 11)

3.3. Detector Algorithm 11 Main (dataset x).

The main goal of this part of the algorithm is to reduce the number of nodes that the modifier entity needs to check in each iteration. The detector entity does this by finding the nodes that have the potential to change their clusters because of the recently applied node insertions or removals. Assuming all the centroid errors are zero, then we know that a node is potent to change its cluster if and only if





MCC T (i ) < cinew − ci 

(4)

A node will be potent for changing its cluster if its distance to another centroid is less than its distance from the centroid of its current cluster. We assume the centroid error of the centroid of the changed −−→ cluster (The cluster which had a node removal or insertion) is C Eci and the centroid error of the secondary cluster, where the potent −−→ node wants to join after leaving its main cluster, is C Ecs . Because we don’t want to lose any potent nodes, we’re assuming all the errors are trying to decrease Y. So the previous equation (Inequality 4) will change to this:





 −−→  −−→ MCC T (i ) − C Eci |−|C Ecs  < cinew − ci 

(5)

Furthermore, as we don’t know for sure which centroid will be − → the secondary centroid that must be considered for calculating Ecs (as we are not re-calculating them during the detection phase) we 

use the maximum centroid error (MCE = max(|CE ( j )| )) instead of − → Ecs . This assumption will result in detecting more nodes as potent

c ← Generate random centers Initializer (x, c) counter = 0 for each change Modifier-Queue.push(change) Modifier() counter ++ If counter = m detector() counter = 0

3.5. Time and memory analysis In this section, we calculate the required computational time and memory space for our proposed algorithm. As we mentioned before, the time we consumed in the initialization phase is O(ndk ln(n)). For each execution of detector entity, we need at most O(dk ln(n)) computations, as we want to find a number in a sorted list in each cluster and we have k clusters. One of the most important parameters in our algorithm is the number of times we are executing the detector entity. In one scenario we can run it after each single change in the modifier queue, which means the detector entity will be executed at most n times. With that in mind, in that scenario, the algorithm will consume O(nk ln(n)) time per change. In another scenario, we can run the detector entity after finishing the modification process of all the current stream in the modifier queue. If we do this, the algorithm will consume O(ndk + k log(n)) time per change (d is the number of

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

131

dimensions), where nd is coming from processing at most n nodes in modifier queue (in the worst case) and O(k log(n)) is the detector cost. The second scenario is more similar to Hamerly’s method and applying a static clustering method on each snapshot. Hamerly uses O(ndk + dk2 ) time for initializing and O(ndk + dk2 ) time for each iteration after applying each change (Hamerly, 2010). As mentioned in Hamerly (2010), Hamerly’s method uses O(n) memory space. We have two more structures than Hamerly. One is the modifier queue which is using at most O(n) memory space (we can avoid adding redundant nodes to the queue by adding an extra parameter to each node) The other one is a sorted list of the nodes based on MCCT for each cluster in the initialization phase. Since the number of all the nodes in all clusters is n, and we’re sorting the data based on all of their d dimensions, the required space for all of these sorted lists is O(dn). As a result, the total memory overhead of our algorithm is O(dn). Fig. 1. initializing time in Hamerly normal clustering and our algorithm with different number of clusters for kddcup dataset.

4. Simulation and test results To evaluate the performance of our proposed method we compared it with a normal incremental K-means re-clustering described by Hamerly (Drake & Hamerly, 2012), where we continue the normal K-means iterations after data insertion. We didn’t compare our method with other similar well-known methods and we trusted to the presented results in (Hamerly, 2010) Appendix which is a comparison of Hamerly method with other similar methods, showing the better performance of the Hamerly method in closeting a static dataset. We executed both methods on some popular datasets like Birch (Zhang, Ramakrishnan, & Livny, 1996), covtype (Blackard, Dean, & Anderson, 1998) and kddcup (Portnoy, Eskin, & Stolfo, 2001) and started to apply normal random but similar changes in both algorithms. We also ran both of the algorithms (Hamerly and our enhanced version of it) on 10 randomly generated data sets (10 times on each random dataset) and then took an average of all of the results. Note that we tested the algorithms on random data sets with 10 0,0 0 0 nodes (labeled as “Random I”) and 125,0 0 0 nodes (labeled as “Random II”). We also ran a simulation on an embedded version of Friendster (Yang & Leskovec, 2012) and Facebook friendships (Bimal, Mislove, Cha, & Gummadi, 2009; Facebook 2017) datasets using the data time stamps, data pooling and data embedding technics (node2vec) (Goyal & Ferrara, 2018) and forced some normally random changes whenever the timestamps for changes were not provided.

To get a more accurate and complete result, we tested our algorithm for Node addition only, Node removal only, and a randomized combination mode (Addition and Removal) separately. The results for Node addition or Node removal nodes are presented in the Appendix. Both algorithms were executed on a single machine with coreI7 1.8 GHz processor and 8GB RAM. During the simulations, we tried to put all other system processes on halt and it is assured that no hardware error, software error or exception occurred. Fig. 1 shows how initialization time differs between two algorithms on the same dataset (kddcup). You can see a comparison of the initialization times in some other datasets as well in Table 2. As we can see in the results, our initialization time is greater than the other algorithm, which is expected because we are saving an extra sorted list compared with the other algorithm. The initialization time of the smoothed running of standard K-means algorithm is polynomial (Arthur & Vassilvitskii, 2006; Arthur, Manthey, & Röglin, 2009) but because of the added sorting required in our proposed method, the consumed time will increase. The initialization phase extra cost pays off during modification phase. As you can see in the Fig. 2 (for kddcup dataset) and Table 3 (for all of the tested datasets), the amount of checked nodes (the nodes that are checked at least once), in our algorithm are significantly dropped as we are saving a list of the potent nodes (which are the only nodes that may change their clusters) in modifier queue instead of checking all the nodes.

Table 2 Initialize time. Dataset

Details

Algorithm

k

time

k

time

k

time

k

time

Birch

2 Dimensions 10 0,0 0 0 Nodes 54 Dimension 150,0 0 0 Nodes 56 Dimensions 95,412 Nodes 2 Dimensions 65,608,366 Nodes 2 dimensions 63,731 Nodes 2 Dimensions 10 0,0 0 0 Nodes 2 Dimensions 125,0 0 0 Nodes

Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering

3 3 3 3 3 3 3 3 3 3 3 3 3 3

0.46 1.58 3.27 5.38 4.21 8.12 194.2 272.9 0.55 1.98 0.89 2.13 2.9 5.1

10 10 10 10 10 10 10 10 10 10 10 10 10 10

0.51 1.94 8.7 9.33 8.36 32.61 501.1 695.7 0.95 2.3 1.99 5.64 6.43 7.17

100 100 100 100 100 100 100 100 100 100 100 100 100 100

0.86 2.3 41.26 67.54 25.1 72.67 1647 2191 2.5 5.69 3.76 6.12 15.14 25.12

10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0 10 0 0

8.9 14.68 170.1 268.9 223.7 278.1 6207 8872 6.42 15.9 17.15 29.87 206.3 298.6

covtype kddcup Friendster∗ Facebook∗ Random I Random II

The calculated initializing time of running two algorithms (until they finish the first clustering attempt) on different K’s. ∗ Without the embedding time.

132

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

Fig. 2. The number of total checked nodes while applying dynamic changes (Insertion or Removals) on kddcup dataset. K = 100, m = 10.

Fig. 3 (for kddcup dataset) and Table 4 (for all of the tested datasets) show the total amount of consumed time (considering the iteration time). Because of the extra cost in initialization phase, our algorithm won’t show promising results in the first few changes, but as we increase the total amount of changes, our algorithm will start to show the better results than Hamerly method. Fig. 4 (for kddcup) and Table 5 (for all datasets) show the consumed time after the initialization phase. It’s clear that our method

Fig. 3. Total consumed time after x node changes (Insertion or Removals) including the initializing time for kddcup dataset. K = 100, m = 10.

has better performance than Hamerly during re-clustering phase. Fig. 5 (for kddcup) and Table 6 (for all datasets) show the total number of times we processed different nodes. The results clearly show how powerful is this method over larger and more dynamic clusters (with more changes) but it’s also obvious that the extra costs for running this algorithm (like initialization time) is irrational to pay in some datasets. This depends on n (number of nodes), k (number of clusters), number of changes and etc. We discussed these factors in Section 5.

Table 3 Total checked nodes. Dataset

Details

Algorithm

10 changes

50 changes

100 changes

10 0 0 changes

10,0 0 0 Changes

Birch

2 Dimensions 10 0,0 0 0 Nodes 54 Dimension 150,0 0 0 Nodes 56 Dimensions 95,412 Nodes 2 Dimensions 65,608,366 Nodes 2 dimensions 63,731 Nodes 2 Dimensions 10 0,0 0 0 Nodes 2 Dimensions 125,0 0 0 Nodes

Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-Clustering Hamerly E. Re-Clustering

73,045 10,661 120,524 17,012 65,332 8330 54,0 0 0,719 839,042 40,871 5026 64,074 4229 71,015 6735

74,960 14,224 120,960 34,169 67,812 17,452 59,212,983 1,841,109 45,222 6703 79,992 11,056 72,626 17,442

75,889 22,342 130,453 72,771 74,087 50,008 61,749,112 8,200,774 56,769 11,678 80,184 22,764 78,197 25,720

82,334 41,908 141,886 107,882 82,160 61,031 61,991,263 48,781,260 61,210 32,581 83,442 51,620 83,881 55,490

89,675 70,063 146,981 110,445 89,912 71,663 62,811,174 61,856,100 62,096 41,937 90,621 89,691 87,408 70,019

covtype kuddup Friendster Facebook Random I Random II

The number of checked nodes for applying X inserts or removes on different datasets. K = 100, m = 10.

Table 4 Total Time. Dataset

Details

Algorithm

10 changes

50 changes

100 changes

10 0 0 changes

10,0 0 0 changes

Birch

2 Dimensions 10 0,0 0 0 Nodes 54 Dimension 150,0 0 0 Nodes 56 Dimensions 95,412 Nodes 2 Dimensions 65,608,366 Nodes 2 dimensions 63,731 Nodes 2 Dimensions 10 0,0 0 0 Nodes 2 Dimensions 125,0 0 0 Nodes

Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-Clustering Hamerly E. Re-Clustering

0.88 2.55 61.34 69.88 30.16 34.37 2883.5 5661.4 4.73 11.07 4.78 7.7 16.08 26.1

1.07 2.79 69.64 80.11 32.46 35.88 4200.7 6830.9 7.61 15.98 5.79 8.14 17.85 26.74

1.92 2.84 73.45 84.99 38.55 36.91 12,908.2 13,851 13.7 19.28 7.82 8.67 24.44 27.81

6.1 2.86 89 86.71 53.07 45.32 59,844.2 57,566.5 19.53 24.12 15.79 12.65 37.5 36.12

13.43 3.68 127.13 92.2 76.72 58.69 259,912.3 227,126.9 28.84 28.92 26.93 19.98 64.57 42.43

covtype kuddup Friendster∗ Facebook



Random I Random II

Total consumed time for the different number of changes (inserts or removes) for different datasets. K = 100, m = 10. ∗ Without the embedding time.

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

Fig. 4. Total consumed time for x node changes (Insertion or Removals) after initializing for kddcup dataset. K = 100, m = 10.

133

Fig. 5. The absolute number of total node checking while applying dynamic changes (Insertion or Removals) over kddcup dataset. K = 100, m = 10.

Table 5 Re-clustering time. Dataset

Details

Algorithm

10 changes

50 changes

100 changes

10 0 0 changes

10,0 0 0 changes

Birch

2 Dimensions 10 0,0 0 0 Nodes 54 Dimension 150,0 0 0 Nodes 56 Dimensions 95,412 Nodes 2 Dimensions 65,608,366 Nodes 2 dimensions 63,731 Nodes 2 Dimensions 10 0,0 0 0 Nodes 2 Dimensions 125,0 0 0 Nodes

Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-Clustering Hamerly E. Re-Clustering

0.02 0.25 20.11 2.34 5.06 1.7 1236.5 3470.4 2.23 5.38 1.02 1.58 0.94 0.98

0.21 0.49 28.41 12.57 7.36 3.21 2553.7 4639.9 5.11 10.29 2.03 2.02 2.71 1.62

1.06 0.54 32.22 17.45 13.45 4.24 11,261.2 11,660 11.2 13.59 4.06 2.55 9.3 2.69

5.24 0.56 47.77 19.17 27.97 12.65 58,197.2 55,375.5 17.03 18.43 12.03 6.53 22.36 11

12.57 1.38 85.9 24.66 51.62 26.02 258,265.3 224,935.9 26.34 23.23 23.17 13.86 49.43 17.31

covtype kuddup Friendster Facebook Random I Random II

Re-clustering time for different number of changes (inserts or removes) for different datasets. K = 100, m = 10. Table 6 Total Number of Node Checking. Dataset

Details

Algorithm

10 changes

50 changes

100 changes

10 0 0 changes

10,0 0 0 changes

Birch

2 Dimensions 10 0,0 0 0 Nodes 54 Dimension 150,0 0 0 Nodes 56 Dimensions 95,412 Nodes 2 Dimensions 65,608,366 Nodes 2 dimensions 63,731 Nodes 2 Dimensions 10 0,0 0 0 Nodes 2 Dimensions 125,0 0 0 Nodes

Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-Clustering Hamerly E. Re-Clustering

431,487 81,929 716,361 164,492 385,209 51,626 89,231,770 680,704 80,004 31,206 377,661 54,413 419,307 86,991

965,988 176,150 1,609,988 435,435 865,916 218,114 243,007,664 2,857,280 422,906 106,774 1,036,436 134,966 933,312 217,984

2,119,529 383,187 3,647,321 1,391,767 2,069,073 936,507 393,164,027 15,225,169 1,126,860 364,981 2,239,789 391,627 2,184,153 450,747

2,866,757 960,334 5,010,629 2,609,684 2,860,493 1,438,409 779,193,212 84,500,714 2,340,996 989,964 2,906,645 1,203,134 2,922,449 1,299,884

3,469,043 1,513,104 5,703,977 2,401,508 3,478,286 1,763,293 2,894,408,221 918,079,237 3,481,212 1,843,265 3,505,937 2,213,993 3,380,630 1,722,193

covtype kuddup Friendster Facebook Random I Random II

The total number of node checking for applying X inserts or removes on different datasets. K = 100, m = 10.

5. Discussion The efficiency of our algorithm is affected by multiple variables. The first and most important one is the length of the queue of the modifier entity. The length of the modifier queue shows how many nodes we are checking. This is the most important factor in evaluating efficiency because it’s the main computing cost of the algorithm. This parameter is determined by the number of changes on the whole network and the number of nodes detected by the

detector entity. The number of changes is not in our control and will be determined by the nature of the problem. The number of detected nodes by detector entity is dependent on two other factors which are the number of nodes in the targeted cluster and the distance of the changed nodes from their centroid (based on the (2) and (3)). If we assume that the number of changes and the density of nodes is not changed, based on (2) and (3) when n (The number of nodes in a cluster) is higher, the change in the centroids position

134

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

will be decreased, reducing the resulted centroid error and will increase the efficiency of each iteration of the detector entity based on (6). Something similar can be said about the distance of the removed/inserted nodes from their centroid. Based on the presented equations, as we decrease this distance (by choosing nodes closer to the centroid in each cluster to insert or remove), less centroid error will be created, which will cause a more efficient detection iteration. With that in mind, we can conclude that this enhanced algorithm is generally more efficient with fewer clusters, denser clusters, and in clusters with more nodes, which makes sense since clusters with those properties are more robust. The most important concern with this algorithm is its high initialization cost. As we explained already, this algorithm needs to have a sorted list of nodes based on how much they are attracted to the next closest centroid, which applies a huge cost during the initialization phase. With that in mind, this algorithm is not a good choice for systems specifically with low change rate, in which using the normal K-means re-clustering takes considerably less time. During the modification phase, the number of nodes we have to check will reduce drastically (Fig. 2), as we already have a sorted list of differences of distances of the nodes from the first and second closest centroid to them. If the induced error by a change in the network is less than the saved value in the sorted list for a node, we don’t have to check that node. So the initialization cost will be paid off, as evidenced in Fig. 3. Another parameter that changes the efficiency of the algorithm is the spatial pattern of changed (added/removed) nodes. This factor affects how this algorithm remains valid and efficient in the long run. The centroid errors are vectors not scalar values, which means that the newly created centroid errors from recent changes might neutralize the older ones. (Fig. 6) Based on a completely theoretical view, the chance for having different centroid errors in all directions, which will cause them to neutralize each other, is high if the changed (added/removed) nodes are selected randomly with a normal distribution. For removing nodes, since the initial centroids are the average of the initial nodes in the clusters, based on Statistics and Probability studies we can estimate that the centroid errors will remain in a fixed range for a long run. For newly added nodes, as the new nodes are added randomly in all directions around the centroids with the same probability, the centroid errors caused by them will likely neutralize each other. This is shown in Fig. 4 in Simulations and Test results section. With this in mind, the need for re-initializing this algorithm for removing centroid errors (centroid error = 0) will decrease significantly if the spatial pattern of changed nodes is random with a normal distribution. Another factor which has a direct impact on the efficiency of our algorithm is the number of the nodes that the modifier entity will check in one iteration before running the detector entity. If we set this parameter to one (to run the detector entity after each

Fig. 6. Centroid errors can be neutralized after new modifications.

modification), we always will have the correct clustered network, and we won’t need to wait for enough nodes in the modifier queue to run the detector entity. If we set this parameter greater than one, we can run the detector entity fewer times, but the task of the modifier entity will be more complex. If we run the detector entity only once at the end of each iteration, the algorithm will be just an enhanced version of the original clustering algorithm (In our case Hamerly (Drake & Hamerly, 2012; Hamerly, 2010)). Because the network is dynamic, the nodes are changing (adding/removing) in the network, and we don’t know the configurations of the network in next step, we can’t determine the optimal value for this parameter. There is a tradeoff on the number of times we run the detector entity. As we compared the pros and cons of running this entity as much as possible or just one time, we are unable to select a unique number as the optimal value for the number of times to run the detector entity. Even though the answer is always mathematically correct, the resulting clustered networks are different if we change the number of times to run the detector entity. According to the nature of the modifier entity, regardless of the number of times we run the detector entity, it’s assured that the centroids are always the mathematical average of the positions of the nodes in their cluster based on (2) and (3). Also, the number of clusters (K) is a constant value. If we increase the dimension of the nodes, storing a sorted list of MCCT might be a challenge. The consumed time for sorting the data based on their MCCT in the tree format is O(nd ln(n)) (as we must sort the data based on every dimension). Another issue that arises when the number of dimensions rises up is the speed of the detector entity (or when we must recall the stored data in the sorted list). As this is also obvious from the test results, increasing the number of dimensions will reduce the algorithm efficiency. In a normal modifier iteration, some nodes will change their clusters so the detector entity should be executed for these changes as well. These changes will again put some new nodes in the modifier queue. The question might be asked: “Why does this algorithm never gets trapped in an infinite loop?” The reason is the same reason that K-means never gets trapped in an infinite loop (Arthur et al., 2009). If we stop the dynamic changes (adding or removing nodes) from outside, some previously applied changes will start to affect the clusters. The changes will affect the whole system in the exact manner K-means affects it, and because there is no addition or removal, the modifier entity will progress like an enhanced version of k-means. Because there are no changes and the modifier is working on a steady network, it will progress like an enhanced version of k-means. We won’t get trapped in an infinite loop after applying each change from the outside of the network, so as long as the number of changes is finite, we can’t get trapped in an infinite loop. Also as the nodes that are not getting checked can’t change their cluster at all, considering them or ignoring them won’t change the solution correctness. An important feature of this algorithm is that all of the variables of the detector and the modifier entity are local, and it’s not necessary to run the algorithm centralized, which means we can run our algorithm in a distributed system like a peer to peer network. For example, we can run our algorithm on a mobile phone network where each node of the network is a mobile phone. The modification phase of this algorithm can be enhanced with the proposed algorithm by making better selection of initial seeds (Krishna & Murty, 1999; Mobasher et al., 1999; Pena et al., 1999) and to be more efficient on a dynamic peer to peer network (e.g., for real-time media streaming (Tran, Hua, & Do, 2004)). The changed cluster will send a message to its neighbors, telling them to run their detector entity, and the modifier entities will communicate with each other to inform the other modifier enti-

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

ties about the changes in their clusters. The only exception in these variables, which are all local, is the max centroid error, which can be replaced with a local maximum, only the neighbor clusters can affect each other nodes in one iteration of our proposed method or K-means.

135

Appendix

6. Conclusion and future works We proposed an algorithm for applying K-means clustering over large scaled dynamic networks that can be used on peer to peer networks. The proposed algorithm has a costly initialization phase but a very low computation cost during re-clustering. By using a detector entity, it can reduce its calculations on a limited set of potent nodes, allowing us to use the data from the last instance of the clustered network. In other words, this algorithm won’t recheck and recalculate all of the nodes in the system after each change in data. The results are promising as we can see around a 70% decrease in the amount of checked nodes in the application, and it seems that the introduced centroid errors, which were used as an internal parameter of the algorithm, are approximately neutralizing each other over time, which keeps the results promising for a long run. The algorithm can still get more efficient if the detector entity lists only the potent nodes from the changed cluster and its neighbor clusters instead of using inequality (6) for all the clusters because clusters can only force a change on their own nodes and their neighbor clusters’ nodes. For that, we need to have an updated neighborhood list for clusters, because these relationships might change during modifications. Developing this algorithm naturally leads to some related topics that should be explored, such as the search for a method of finding the best starting centroid positions and the number of clusters (k) in order to maximize the neutralization rate of centroid errors and a method to estimate when it’s optimal to reinitialize the algorithm for resetting all the errors rather than hoping for future neutralizer changes. There has been a lot of researches about the optimum “k” for K-means clustering such as K-MACE algorithm (Nidoy) but they mostly are for static and traditional clustering. These algorithms might be adopted and merged with the enhanced dynamic re-clustering algorithm which was introduced in this paper to dynamically adjust the number of clusters. Furthermore, if we can somehow sort the changes in the modification queue or optimize its performance so it can run the changes that will reduce the centroid error before running the detector, we will reduce the number of checked nodes in total. Better data structure than KD-Tree for MCCT values or the modification queue for streaming the changes might be used to increase the performance of this algorithm. Another future research can be about the optimal number of m, which as it was mentioned, will affect the algorithm fundamentally. The question here is: “After processing how many changes should I run the detector?” This question is about the tradeoff between accuracy in clustering and speed. If we run the detector late (increase m value), the nodes might not be in the best cluster for them for a while, but the speed will increase because we will have less detected potent nodes in total. And if we set this parameter to a small value (for example one, to run the detector entity after each modification), we always will have the correct clustered network, but the computational cost will increase and the speed will decrease.

Fig. A1. The number of total checked nodes while applying dynamic changes (removals) over kddcup dataset. K = 100, m = 10.

Fig. A2. Difference of total consumed time after x node changes (removals) including the initializing time for kddcup dataset. K = 100, m = 10.

Conflict of interest statement None.

Fig. A3. Difference of total consumed time for x node changes (removals) after initializing for kddcup dataset. K = 100, m = 10.

136

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

Fig. A4. The absolute number of total node checking while applying dynamic changes (removals) over kddcup dataset. K = 100, m = 10.

Fig. A7. Difference of total consumed time for x node changes (Inserts) after initializing for kddcup dataset. K = 100, m = 10 .

Fig. A5. The number of total checked nodes while applying dynamic changes (Inserts) over kddcup dataset. K = 100, m = 10 .

Fig. A8. The absolute number of total node checking while applying dynamic changes (Inserts) over kddcup dataset. K = 100, m = 10.

Fig. A6. Difference of total consumed time after x node changes (Inserts) including the initializing time for kddcup dataset. K = 100, m = 10.

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

137

Table A1 Comparison report of Hamerly Algorithm with some other algorithms over same databases during clustering. [Hamerly G (2010), "Making k-means even faster." In Proceedings of the 2010 SIAM international conference on data mining: 130–140., 2010.]. Total user CPU Seconds (User CPU seconds per iteration) Dataset Uniform random n = 1,250,0 0 0 d=2

Uniform random n = 1,250,0 0 0 d=8

Uniform random n = 1,250,0 0 0 d = 32

birch n = 10 0,0 0 0 d=2

covtype n = 150,0 0 0 d = 54

kddcup n = 95,412 d = 56

mnist50 n = 60,0 0 0 d = 50

Iterations lloyd kd-tree Elkan hamerly Iterations lloyd kd-tree Elkan Hamerly Iterations Lloyd kd-tree Elkan hamerly Iterations lloyd kd-tree Elkan hamerly Iterations lloyd kd-tree Elkan hamerly Iterations lloyd kd-tree Elkan Hamerly Iterations Lloyd kd-tree Elkan Hamerly

k=3

k = 20

k = 100

k = 500

44 4.0 (0.058) 3.5 (0.006) 7.2 (0.133) 2.7 (0.031) 121 21.8 (0.134) 117.5 (0.886) 14.1 (0.071) 10.9 (0.045) 137 66.4 (0.323) 208.4 (1.324) 48.1 (0.189) 46.9 (0.180) 52 0.53 (0.004) 0.41 (<0.001) 0.58 (0.005) 0.44 (0.002) 19 3.52 (0.048) 6.65 (0.205) 3.07 (0.022) 2.95 (0.019) 39 4.74 (0.032) 9.68 (0.156) 4.13 (0.012) 3.95 (0.011) 37 2.92 (0.018) 4.90 (0.069) 2.42 (0.005) 2.41 (0.004)

227 61.4 (0.264) 11.8 (0.035) 75.2 (0.325) 14.6 (0.058) 353 178.9 (0.491) 622.6 (1.740) 130.6 (0.354) 40.4 (0.099) 4120 5479.5 (1.325) 29,719.6 (7.207) 1370.1 (0.327) 446.4 (0.103) 179 4.60 (0.024) 0.96 (0.003) 4.35 (0.023) 0.90 (0.003) 204 48.02 (0.222) 266.65 (1.293) 11.58 (0.044) 7.40 (0.024) 55 12.35 (0.159) 58.55 (0.996) 6.24 (0.049) 5.87 (0.042) 249 23.18 (0.084) 100.09 (0.393) 7.02 (0.019) 4.54 (0.009)

298 320.2 (1.070) 34.6 (0.102) 353.1 (1.180) 28.2 (0.090) 312 660.7 (2.100) 2390.8 (7.633) 591.8 (1.879) 169.8 (0.527) 2096 12,543.8 (5.974) 74,181.3 (35.380) 2624.9 (1.242) 1238.9 (0.581) 110 11.80 (0.104) 2.67 (0.021) 11.80 (0.104) 1.86 (0.014) 320 322.25 (0.999) 2014.03 (6.285) 70.45 (0.212) 42.83 (0.126) 169 116.63 (0.669) 839.31 (4.945) 32.27 (0.169) 28.39 (0.147) 190 75.82 (0.387) 371.57 (1.943) 21.58 (0.101) 21.95 (0.104)

710 3486.9 (4.909) 338.8 (0.471) 2771.8 (3.902) 204.2 (0.286) 1405 13,854.4 (9.857) 46,731.5 (33.254) 11,827.9 (8.414) 1395.6 (0.989) 2408 68,967.3 (28.632) 425,513.0 (176.697) 14,245.9 (5.907) 9886.9 (4.097) 99 48.87 (0.490) 17.68 (0.173) 54.28 (0.545) 7.81 (0.075) 111 564.05 (5.058) 3303.27 (29.734) 152.15 (1.347) 169.53 (1.505) 142 464.22 (3.244) 3349.47 (23.562) 132.39 (0.907) 197.26 (1.364) 81 162.09 (1.974) 794.51 (9.780) 55.61 (0.660) 77.34 (0.928)

Table A2 The number of checked nodes for applying X removes on different datasets. K = 100, m = 10. Dataset

Details

Algorithm

10 removes

50 removes

100 removes

10 0 0 removes

10,0 0 0 removes

Birch

2 Dimensions 10 0,0 0 0 Nodes 54 Dimension 150,0 0 0 Nodes 56 Dimensions 95,412 Nodes 2 Dimensions 10 0,0 0 0 Nodes 2 Dimensions 125,0 0 0 Nodes

Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-Clustering Hamerly E. Re-Clustering

76,294 6067 105,524 10,054 72,589 8051 63,294 328 67,564 1964

76,351 19,689 107,223 36,900 73,345 34,572 69,561 7450 67,602 8453

77,782 27,670 115,212 87,245 84,462 65,552 77,340 23,725 68,457 31,429

82,891 26,892 127,890 111,343 84,681 72,432 86,628 52,231 73,451 62,005

83,412 56,782 130,031 119,672 84,005 74,210 81,045 48,952 74,312 64,210

covtype kuddup Random I Random II

Table A3 Total consumed time for different number of changes (removes) for different datasets. K = 100, m = 10. Dataset

Details

Algorithm

10 removes

50 removes

100 removes

10 0 0 removes

10,0 0 0 removes

Birch

2 Dimensions 10 0,0 0 0 Nodes 54 Dimension 150,0 0 0 Nodes 56 Dimensions 95,412 Nodes 2 Dimensions 10 0,0 0 0 Nodes 2 Dimensions 125,0 0 0 Nodes

Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering

0.93 2.39 52.47 73.31 27.23 33.89 4.03 6.76 15.96 25.4

1.12 2.41 57.76 75.49 31.12 35.01 5.43 7.11 18.07 26.77

1.79 2.44 68.89 78.13 38.34 37.37 7.28 7.96 24.74 27.67

5.34 2.67 86.9 84.21 52.66 43.12 15.32 9.43 32.49 31.98

12.37 3.5 112.34 88.32 72.4 47.52 28.88 17.76 58.72 39.71

covtype kuddup Random I Random II

138

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

Table A4 Re-clustering time for different number of changes (removes) for different datasets. K = 100, m = 10. Dataset

Details

Algorithm

10 removes

50 removes

100 removes

10 0 0 removes

10,0 0 0 removes

Birch

2 Dimensions 10 0,0 0 0 Nodes 54 Dimension 150,0 0 0 Nodes 56 Dimensions 95,412 Nodes 2 Dimensions 10 0,0 0 0 Nodes 2 Dimensions 125,0 0 0 Nodes

Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-Clustering Hamerly E. Re-Clustering

0.07 0.09 11.24 5.77 2.13 1.22 0.27 0.64 0.82 0.28

0.26 0.11 16.53 7.95 6.02 2.34 1.67 0.99 2.93 1.65

0.93 0.14 27.66 10.59 13.24 4.7 3.52 1.84 9.6 2.55

4.48 0.37 45.67 16.67 27.56 10.45 11.56 3.31 17.35 6.86

11.51 1.2 71.11 20.78 47.3 14.85 25.12 11.64 43.58 14.59

covtype kuddup Random I Random II

Table A5 The total number of node checking for applying X removes on different datasets. K = 100, m = 10. Dataset

Details

Algorithm

10 removes

50 removes

100 removes

10 0 0 removes

10,0 0 0 removes

Birch

2 Dimensions 10 0,0 0 0 Nodes 54 Dimension 150,0 0 0 Nodes 56 Dimensions 95,412 Nodes 2 Dimensions 10 0,0 0 0 Nodes 2 Dimensions 125,0 0 0 Nodes

Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-Clustering Hamerly E. Re-Clustering

450,981 22,207 626,361 74,038 428,751 47,999 372,981 3700 398,601 24,968

985,462 247,195 1,417,670 470,938 943,378 440,674 890,402 88,088 862,976 101,127

2,172,533 489,747 3,220,573 1,681,247 2,359,573 1,247,387 2,160,157 410,847 1,911,433 564,927

2,886,809 584,934 4,506,773 2,696,209 2,951,249 1,723,434 3,021,341 1,218,409 2,546,969 1,462,759

3,224,786 1,220,922 5,042,927 2,604,502 3,247,913 1,826,968 3,132,473 1,195,518 2,869,886 1,576,968

covtype kuddup Random I Random II

Table A6 The number of checked nodes for applying X inserts on different datasets. K = 100, m = 10. Dataset

Details

Algorithm

10 inserts

50 inserts

100 inserts

10 0 0 inserts

10,0 0 0 inserts

Birch

2 Dimensions 10 0,0 0 0 Nodes 54 Dimension 150,0 0 0 Nodes 56 Dimensions 95,412 Nodes 2 Dimensions 10 0,0 0 0 Nodes 2 Dimensions 125,0 0 0 Nodes

Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-Clustering Hamerly E. Re-Clustering

73,293 9016 101,420 27,354 74,504 6211 65,070 3447 68,348 8912

75,120 18,337 117,223 31,288 76,775 13,991 71,299 6658 67,890 16,834

75,631 26,712 122,318 62,771 80,427 45,421 75,666 23,670 73,890 26,770

79,838 43,769 131,890 100,911 83,266 67,981 87,483 49,052 84,233 58,395

88,444 61,069 139,981 124,560 88,902 73,790 89,771 61,669 86,768 69,451

covtype kuddup Random I Random II

Table A7 Total consumed time for different number of changes (inserts) for different datasets. K = 100, m = 10. Dataset

Details

Algorithm

10 inserts

50 inserts

100 inserts

10 0 0 inserts

10,0 0 0 inserts

Birch

2 Dimensions 10 0,0 0 0 Nodes 54 Dimension 150,0 0 0 Nodes 56 Dimensions 95,412 Nodes 2 Dimensions 10 0,0 0 0 Nodes 2 Dimensions 125,0 0 0 Nodes

Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-Clustering Hamerly E. Re-Clustering

0.92 2.61 67.33 74.65 29.87 36.98 5.2 7.92 16.78 26.53

1.18 2.63 67.74 77.11 33.06 37.13 5.71 8.12 18.21 27.91

1.84 2.67 71.9 80.69 41.31 37.96 7.44 8.31 25.69 28

5.22 2.83 88.16 85.55 54.4 48.84 16.61 9.92 35.48 33.49

12.67 3.74 124.86 91.27 78.19 56.13 30.06 18.37 61.26 41.75

covtype kuddup Random I Random II

Table A8 Re-clustering time for different number of changes (Inserts) for different datasets K = 100, m = 10. Dataset

Details

Algorithm

10 inserts

50 inserts

100 inserts

10 0 0 inserts

10,0 0 0 inserts

Birch

2 Dimensions 10 0,0 0 0 Nodes 54 Dimension 150,0 0 0 Nodes 56 Dimensions 95,412 Nodes 2 Dimensions 10 0,0 0 0 Nodes 2 Dimensions 125,0 0 0 Nodes

Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-Clustering Hamerly E. Re-Clustering

0.06 0.31 26.1 7.11 4.77 4.31 1.44 1.8 1.64 1.41

0.32 0.33 26.51 9.57 7.96 4.46 1.95 2 3.07 2.79

0.98 0.37 30.67 13.15 16.21 5.29 3.68 2.19 10.55 2.88

4.36 0.53 46.93 18.01 29.3 16.17 12.85 3.8 20.34 8.37

11.81 1.44 83.63 23.73 53.09 23.46 26.3 12.25 46.12 16.63

covtype kuddup Random I Random II

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

139

Table A9 The total number of node checking for applying X inserts on different datasets. K = 100, m = 10. Dataset

Details

Algorithm

10 inserts

50 inserts

100 inserts

10 0 0 inserts

10,0 0 0 inserts

Birch

2 Dimensions 10 0,0 0 0 Nodes 54 Dimension 150,0 0 0 Nodes 56 Dimensions 95,412 Nodes 2 Dimensions 10 0,0 0 0 Nodes 2 Dimensions 125,0 0 0 Nodes

Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-clustering Hamerly E. Re-Clustering Hamerly E. Re-Clustering

432,975 60,544 601,737 298,938 428,751 24,079 383,637 44,247 403,305 115,292

968,228 229,619 1,557,670 397,982 943,378 173,121 914,734 77,792 867,008 210,080

2,112,305 470,587 3,419,541 1,191,767 2,246,593 844,767 2,113,285 409,747 2,063,557 471,747

2,776,901 1,006,859 4,650,773 2,435,409 2,900,309 1,612,159 3,052,121 1,138,934 2,935,121 1,372,509

3,421,034 1,315,236 5,430,977 2,712,038 3,438,896 1,816,468 3,472,787 1,513,443 3,355,670 1,707,993

covtype kuddup Random I Random II

References Aaron, B., Tamir, D. E., Rishe, N. D., & Kandel, A. (2014). Dynamic incremental K-means clustering. In Proceedings of computational science and computational intelligence (CSCI): 1 (pp. 308–313). Aggarwal, C. C. (2013). A survey of stream clustering algorithms. In Data clustering algorithms and applications (pp. 231–258). Taylor & Francis. Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S. (1999). Fast algorithms for projected clustering. ACM SIGMoD Record, 28(2), 61–72. Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999). OPTICS: Ordering points to identify the clustering structure. ACM Sigmod Record, 28(2), 49–60. Arthur, D., & Vassilvitskii, S. (2006). How slow is the k-means method? In Proceedings of the twenty-second annual symposium on computational geometry (pp. 144–153). Arthur, D., Manthey, B., & Röglin, H. (2009). k-means has polynomial smoothed complexity. In Proceedings of foundations of computer science, 2009. FOCS’09. 50th annual IEEE symposium on (pp. 405–414). Banerjee, S., Ramanathan, K., & Gupta, A. (2007). "Clustering short texts using wikipedia. In Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval (pp. 787–788). Becker, H., Naaman, M., & Gravano, L. (2011). "Beyond trending topics: Real-world event identification on twitter. In Proceedings of ICWSM 2011 (pp. 438–441). Bentley, J. L. (1975). "Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9), 509–517. Bilgin, C. C., & Yener, B. (2006). Dynamic network evolution: Models, clustering, anomaly detection. IEEE Network. Bimal, V., Mislove, A., Cha, M., & Gummadi, K. P. (2009). "On the evolution of user interaction in Facebook. In Proceedings of workshop on online social networks (pp. 37–42). Blackard, J. A., Dean, D. J., & Anderson, C. W. (1998). "The forest covertype dataset." 1998/08/19]. http://archive.ics.uci.edu/ml/machine- learning- databases/ covtype (1998). Bradley, P. S., & Fayyad, U. M. (1998). Refining initial points for K-means clustering. ICML, 98, 91–99. Can, F., & Ozkarahan, E. A. (1990). Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases. ACM Transactions on Database Systems (TODS), 15(4), 483–517. Chakraborty, S., & Nagwani, N. K. (2011). "Analysis and study of incremental k-means clustering algorithm. In Proceedings of high performance architecture and grid computing (pp. 338–341). Cormode, G., Muthukrishnan, S., & Zhuang, W. (2007). "Conquering the divide: Continuous clustering of distributed data streams. In Proceedings of data engineering, 2007 (pp. 1036–1045). ICDE. 2007. Datta, S., Giannella, C., & Kargupta, H. (2006). K-Means Clustering over a large, dynamic network. In Proceedings of SDM 2006 (pp. 153–164). Defays, D. (1977). An efficient algorithm for a complete link method. The Computer Journal, 20(4), 364–366. Drake, J., & Hamerly, G. (2012). Accelerated K-means with adaptive distance bounds. In Proceedings of 5th NIPS workshop on optimization for machine learning (pp. 1–4). Duda, R. O., & Hart, P. E. (1973)., "Pattern recognition and scene analysis." Dudoit, S., & Fridlyand, J. (2003). Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19(9), 1090–1099. Elkan, C. (2003). Using the triangle inequality to accelerate K-means. Proceedings of ICML2003, 3, 147–153. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). "A density-based algorithm for discovering clusters in large spatial databases with noise." 96.(34), : 226–231. Estivill-Castro, V. (2002). "Why so many clustering algorithms: A position paper. ACM SIGKDD Explorations Newsletter, 4(1), 65–75. Facebook, (2017)., Facebook friendships network dataset – KONECT, April 2017. Feldman, D., Schmidt, M., & Sohler, C. (2013). Turning big data into tiny data: Constant-size coresets for K-means, pca and projective clustering. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics. Gersho, A., & Gray, R. (1992). Vector quantization I: Structure and performance. In Vector quantization and signal compression (pp. 309–343). US: Springer.

Goyal, P., & Ferrara, E. (2018). "Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151, 78–94. Hamerly, G. (2010). "Making K-means even faster. In Proceedings of the 2010 SIAM international conference on data mining (pp. 130–140). Hamerly, G., & Elkan, C. (2002). "Alternatives to the k-means algorithm that find better clusterings. In Proceedings of the eleventh international conference on Information and knowledge management (pp. 600–607). Huang, Z. (1998). "Extensions to the K-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283– 304. Jain, A. K. (2010). "Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8), 651–666. Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 881–892. Kaufman, L., & Rousseeuw, P. J. (1987). "Clustering by means of medoids. In Y. Dodge (Ed.), Statistical data analysis based on the L1 norm (pp. 405–416). Amsterdam: North Holland/Elsevier. Kriegel, H. P., Kroger, P., Sander, J., & Zimek, A. (2011). Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3), 231–240. Krishna, K., & Murty, M. N. (1999). "Genetic K-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 29(3), 433–439. Likas, A., Vlassis, N., & Verbeek, J. J. (2003). "The global k-means clustering algorithm. Pattern recognition, 36(2), 451–461. Lin, C. R., & Gerla, M. (1997). Adaptive clustering for mobile wireless networks. IEEE Journal on Selected areas in Communications, 15(7), 1265–1275. Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory , 28(2), 129–137. Lua, E. K., Crowcroft, J., Pias, M., Sharma, R., & Lim, S. (2005). A survey and comparison of peer-to-peer overlay network schemes. IEEE Communications Surveys & Tutorials, 7(2), 72–93. MacKay, D. (2003). Chapter 20. An example inference task: Clustering" (PDF). In Information Theory, inference and learning algorithms (pp. 284–292). Cambridge University Press. Mobasher, B., Cooley, R., & Srivastava, J. (1999). "Creating adaptive web sites through usage-based clustering of URLs. In Proceedings of knowledge and data engineering exchange, 1999.(KDEX’99) (pp. 19–26). Mohammed, R. S., Abbood, F. H., & Yousif, I. A. (2016). "Image encryption technique using clustering and stochastic standard map. In Proceedings of multidisciplinary in IT and communication science and applications (AIC-MITCSA (pp. 1–6). Nidoy, E. W. k-MACE clustering for gaussian Gaussian clusters. Nguyen, J., Thuy, T. T., & Armitage, G. (2008). A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys & Tutorials, 10(4), 56–76. Papadopoulos, S., Kompatsiaris, Y., Vakali, A., & Spyridonos, P. (2012). "Community detection in social media. Data Mining and Knowledge Discovery, 24(3), 515– 554. Pelleg, D., & Moore, A. (1999). Accelerating exact k-means algorithms with geometric reasoning. In Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 277–281). Pena, J. M., Lozano, J. A., & Larranaga, P. (1999). An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters , 20(10), 1027–1040. Pham, D. T., Dimov, S. S., & Nguyen, C. D. (2004). An Incremental K-means algorithm. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, 218(7), 783–795. Phillips, S. J. (2002). "Acceleration of k-means and related clustering algorithms. In Proceedings of workshop on algorithm engineering and experimentation: (pp. 166–177). Piatetsky-Shapiro, G. (1996). In Usama M. Fayyad, Padhraic Smyth, & Ramasamy Uthurusamy (Eds.). Advances in knowledge discovery and data mining: 21. Menlo Park: AAAI press. Portnoy, L., Eskin, E., & Stolfo, S. (2001). "Intrusion detection with unlabeled data using clustering. In Proceedings of ACM CSS workshop on data mining applied to security (DMSA-2001).

140

A. Fadaei and S.H. Khasteh / Expert Systems With Applications 132 (2019) 126–140

Pukhrambam, P., Bhattacharjee, S., & Das, H. S. (2017). A multi-level weight based routing algorithm for prolonging network lifetime in cluster based sensor networks. In Proceedings of the international conference on signal, networks, computing, and systems (pp. 193–203). Reddy, R., & Chaitany, V. V. K. (2016). Reducing current availability and re-clustering time in sensor nets. IJITR, 4(6), 4656–4658. Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (1998). Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Mining and Knowledge Discovery , 2(2), 169–194. Sculley, D. (2010). "Web-scale k-means clustering. In Proceedings of the 19th ACM international conference on World wide web (pp. 1177–1178). Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1–47. Sibson, R. (1973). SLINK: An optimally efficient algorithm for the single-link cluster method. The Computer Journal , 16(1), 30–34. Tang, H., Hong, W., Zhou, L., & Miao, G. (2016). "A new unequal clustering protocol with local re-clustering mechanism for wireless sensor networks. International Journal of Computational Science and Engineering, 12(4), 276–286.

Tran, D. A., Hua, K. A., & Do, T. T. (2004). "A peer-to-peer architecture for media streaming. IEEE journal on Selected Areas in Communications, 22(1), 121–133. Williams, S., Wantland, T., Ramos, G., & Sibley, P. G. (2016)., "Point of interest (POI) data positioning in image." U.S. Patent No. 9,406,153. 2 Aug. 2016. Wong, M. A., & Lane, T. (1981). A kth nearest neighbour clustering procedure. In Computer science and statistics: Proceedings of the 13th symposium on the interface (pp. 308–311). Yang, J., & Leskovec, J. (2012). Defining and evaluating network communities based on ground-truth. In Proceedings of ICDM2012 (pp. 181–213). Yi, G., Guiling, S., Weixiang, L., & Yong, P. (2009). "Recluster-LEACH: A recluster control algorithm based on density for wireless sensor network. In Power electronics and intelligent transportation system (PEITS), 2009 2nd international conference on: 3 (pp. 198–202). Yu, J. Y., & Chong, P. H. J. (2005). A survey of clustering schemes for mobile ad hoc networks. IEEE Communications Surveys & Tutorials, 7(1), 32–48. Zhang, T., Ramakrishnan, R., & Livny, M. (1996). "BIRCH: An efficient data clustering method for very large databases. ACM Sigmod Record, 25(2), 103–114.